We don't really know what we want to build yet. Is it too early to talk?

No — that's exactly what the AI Roadmap is for. Or, if anything in the Agent Menu (harrybenham.dev/menu) caught your eye, we can ship that as the Roadmap deliverable instead. 100% credited toward a Sprint if you proceed.

How much will the AI itself cost to run each month?

Most agents on the menu run on £20–£100/mo of API + hosting. Per-feature spend caps go into every Sprint so a runaway loop can't run away with your card. You get a costed estimate per use case on the Roadmap.

Where does the agent live after we ship?

Default for most clients is Care (£50/month per agent): I host the agent on my standardised stack, monitor 24/7, fix bugs free, you never log into a hosting dashboard. New requests are quoted separately as a mini-Sprint. Cancel any month. If you have an in-house tech team and prefer Self-host, that's free — you own everything from day 30.

What does “fixed price” actually mean if the scope changes?

Scope is locked in writing on Day 0 — what's in, what's out, the eval set, success criteria. New features mid-build get quoted as a small add-on Sprint, never billed surprise-style. If I under-quoted, that's my problem, not yours.

What if the output is wrong or the model hallucinates?

The single most important question — which is why evals are non-negotiable in every package. I build a test suite covering the critical paths. For high-stakes outputs (legal, financial, customer-facing) I add a verification layer or a human-in-the-loop step.

Do you need access to our codebase?

Almost never. Most menu agents run in standard channels (web chat, email, WhatsApp, Slack, your CRM via API) and don't touch your product code. If something does need deeper integration, I work alongside your engineers — you stay in control of your codebase.

What if our team has zero technical capability?

That's the most common case and exactly what Care is built for. £50/month per agent, I host and monitor it, you never touch infrastructure. New work gets quoted separately so there are no surprises. The agent just works — same way you don't manage your own email server.

Are our documents and data safe? NDA?

Yes on both. Happy to sign an NDA before discovery. Client data never leaves your infrastructure unless you explicitly want it to. I route model calls through enterprise endpoints with zero data retention where requested. UK GDPR DPA available on request.

← Back to home

Case study · April 2026

Filemender — a B2B SaaS where Claude agents run growth.

Eight scheduled Claude agents. Five blog posts a week. 50–100 cold emails a day. Hundreds of researched leads. Built solo, runs unattended.

8 scheduled agents · running livemodel · Opusuptime · 0.0sshipped · 00 / 16

▸ Shipping to

your business

monthly spend£18.00

●haiku · cheap triage●sonnet · extraction + drafting●eval suite · green●cost cap · £100/mo●verified · 0 fabricated●8 agents · 1 function●haiku · cheap triage●sonnet · extraction + drafting●eval suite · green●cost cap · £100/mo●verified · 0 fabricated●8 agents · 1 function

Fig. 01 — Eight scheduled agents shipping into one business. Live preview, looped.

The product

Filemender is a web-based SaaS for validating, analysing, and repairing corrupted or non-compliant media files. It's built for post-production studios, ad agencies, VFX houses, and audio engineers — anyone whose job involves video, audio, image, or document files where the wrong codec, frame rate, or naming convention will get a delivery rejected by a network or platform.

The core workflow: upload a file, Filemender runs it through a handler pipeline, identifies what's wrong (codec issues, corruption, spec violations, naming convention failures), and either flags the problems in a detailed QC report or attempts a repair. Pricing is credit-based across four tiers — Starter (30 credits/month), Pro (200), Agency (1,000), Enterprise (5,000) — and agencies can stand up branded upload portals for their own clients to submit files directly.

Stack: Vue 3 on the frontend, FastAPI + Celery + Redis on the backend, PostgreSQL via Supabase, DigitalOcean Spaces for object storage, Stripe for billing, Resend for transactional email. I built and shipped it solo.

The problem this case study is about

Building the product is half the job. The other half is finding and converting customers — and for a solo-founded B2B SaaS with no marketing budget, that's where most projects quietly die. I had a few options: hire a marketing agency I couldn't afford, do it all manually and never have time to ship product, or build an AI-powered growth stack and treat the whole marketing function as an engineering problem.

I picked option three. This case study is about what I built, what's running, what I got wrong, and what I'd change if I were building the same thing for a client.

Architecture

The growth stack runs as a set of scheduled Claude agents orchestrated through Cowork (Anthropic's desktop scheduling layer for Claude). Each agent has a single, narrow job. They share state through the Filemender database and a small set of structured artifacts — lead tables, draft queues, content calendar entries. Nothing is "one big agent." That pattern is brittle, expensive, and impossible to debug. Everything here is small, monitored, and individually replaceable.

Agent

Blog writer

Long-form SEO articles, 3×/wk. Opus drafts, Haiku fact-checks.

Agent

Lead researcher

25–30 new UK prospects/wk. Haiku shortlists, Opus qualifies.

Agent

Email drafter

Personalised 3-email cold sequences, sector-tailored.

Agent

Email sender

Python cron via Resend, 25/day max. Deterministic, no LLM.

Agent

LinkedIn writer

5 founder-voice posts/wk. Voice trained on labelled samples.

Agent

LinkedIn poster

Chrome automation, jittered timing, peak engagement windows.

Agent

LinkedIn DMs

Personalised openers referencing posts, job changes, mutuals.

Agent

Twitter monitor

Searches for venting on file corruption, queues replies.

What's running today

▸5 blog posts published per week to filemender.com/blog
▸50–100 cold emails per day through Resend, paced to protect deliverability
▸3 LinkedIn posts per week, plus around 50 DMs
▸Hundreds of newly researched leads added to the database every week

Everything runs on a schedule. Nothing requires me to be online. I check the queues daily for anything that needs human approval — anything outbound that goes out in my name has a review gate — and the rest happens on its own.

Three things I'd build differently for a client

This section matters more than the architecture. Most case studies are puff. Here's where the first version was wrong.

Lesson 01

Model selection: I started with Opus for everything. I shouldn't have.

The first version of the stack ran Opus across every agent. Within a week the API bill was uncomfortable enough to make me audit. About 70% of the work — initial shortlisting, headline generation, simple categorisation, sentiment checks — was being done at premium prices when Haiku would do it indistinguishably. I kept Opus for jobs where output quality was judged by a human (final email drafts, blog post bodies, LinkedIn posts) and migrated everything else to Haiku. Costs dropped by roughly an order of magnitude and I couldn't tell the difference in output quality.

For a client, I'd build the cost monitoring in from day one — per-agent budget caps, alerts on anomalies, a weekly spend report. I didn't, and got a surprise bill that taught me to.

Lesson 02

Hallucinated leads: the first version of the lead researcher invented contact details.

When you ask an LLM to find a CTO's email address, it will sometimes confidently make one up. The first version of the lead researcher produced lists with a non-trivial percentage of fabricated emails, fabricated job titles, and at least once an entire fabricated company. Sending cold email to fabricated addresses ruins your sender reputation immediately.

Fix: every claim the LLM makes about a prospect — email, role, company URL — gets verified against an external source before it lands in the leads table. Email goes through a verification API. Company URL gets a real-time fetch. Job title gets cross-referenced against LinkedIn. Anything that fails verification gets dropped or flagged for manual review. The agent went from a creative writer back to a researcher.

This is the lesson I'd carry into any LLM integration: you don't trust the LLM with anything that can break the world downstream. Treat its output as a candidate, verify it, then accept or reject.

Lesson 03

Cold email quality: the first version was templated personalisation that fooled no one.

The first email drafter used the kind of personalisation that's everywhere in cold outreach now — {{first_name}}, I saw you work at {{company}}… The reply rate was awful. Worse, several recipients replied just to point out the email was obviously generated.

V2 changed the order of operations. Before the drafter writes anything, a research step actually reads the prospect's website, recent LinkedIn posts, and any recent press, then writes a short paragraph of what's specifically interesting about this company right now. The drafter writes the email with that paragraph as context. Reply rates improved meaningfully. More importantly, the emails read like they were written by someone who'd actually looked at the company — because in a way, they were.

The lesson for clients: most "AI personalisation" is templated mad-libs and prospects can tell. If you're going to use LLMs for outbound, the LLM has to be doing the work a human would do — reading and thinking about the recipient — not slotting variables into a template.

What this maps to for B2B SaaS

Most companies I talk to don't need an LLM-powered product feature. They need an LLM-powered internal workflow: lead research, content generation, support triage, document parsing, reporting, customer onboarding emails. The pattern that works is the one I used for Filemender's growth stack — small narrow agents doing one job each, with deterministic verification layers around anything that touches the world. Not one giant agent. Not raw LLM output piped straight to production.

If you're building this kind of stack for the first time, the things that will bite you are model cost, hallucination on anything factual, and the gap between "looks impressive in a demo" and "still works for the hundredth unattended run." I've hit all three in production and have a strong opinion on how to engineer around them. That's the work I'm available to do.

Want this for your product?

Book a 15-min intro call.

Book a call →