The Economics of AI-Assisted Development for a WordPress Plugin Team

A Question We Couldn’t Answer Honestly In Month 1

How much does it cost to ship a WordPress plugin with Claude Code as a daily collaborator?

For our first month building Statnive, we couldn’t answer that. We knew the bill. We didn’t know what was actually driving it. Some days were $3, others were $11, and the variance had no obvious correlation with how much code shipped.

Four months later we have a much clearer picture. This post is the cost accounting: where the money goes, the four levers that compound to roughly 2–3× our spend, and the one number that mattered more than any individual optimization.

Headline: daily spend dropped from ~$6/day to ~$2–3/day for the same engineering throughput. The savings are not additive. They multiply.

Where AI Spend Actually Goes

Anthropic’s pricing for the Claude Code-relevant models, current as of early 2026:

Model	Input / MTok	Output / MTok	Best for
Haiku 4.5	$1	$5	Read-only exploration, validation, routine fixes
Sonnet 4.6	$3	$15	Standard development work
Opus 4.7	$5	$25	Complex architecture, deep reasoning

Naively, you pick one model and pay that rate × your token consumption. The reality is finer-grained, and three of the four optimization levers below exploit it.

For reference, here are typical token budgets per operation:

Operation type	Typical tokens	Realistic time
Simple fix (Haiku)	2K–5K	1–2 min
Standard feature (Sonnet)	10K–20K	5–10 min
Complex refactor (Sonnet + extended thinking)	30K–50K	15–25 min
Architecture review (Opus)	50K–100K	20–40 min

Multiplied across a working day, that’s where ~$6/day on a single engineer comes from. Multiply across the year, the optimization conversation gets serious.

Lever 1 — Prompt Caching (~90% On The Stable Prefix)

Claude Code automatically caches the stable prefix of every conversation: the system prompt, built-in tool definitions, skill metadata, and the root CLAUDE.md. Cache reads cost 0.1× the base price — a 90% reduction on the cached portion.

The stable prefix is 50K+ tokens for most setups. Without caching, every turn re-charges those 50K tokens. With caching, every turn after the first reads them at 10% cost.

Two practical rules:

Order content static-first, dynamic-last. Cache hits require an exact prefix match. Anything dynamic at the start of a conversation breaks the cache and forces a full re-read. We keep root CLAUDE.md and skill metadata first, dynamic context (changed files, current branch state) last.
Don’t pollute the prefix with anything that changes every session. This is the real argument for keeping CLAUDE.md lean — every byte you remove is a byte that doesn’t need re-caching when the file changes, and shorter cached prefixes are cheaper to refresh when they expire.

Caching is the single biggest cost lever, and it’s mostly automatic. The work is in not breaking it.

Lever 2 — Model Routing (3× On The Routed Portion)

Haiku 4.5 costs 3× less than Sonnet per token. Capability difference for read-only and validation tasks is minimal. Anthropic suggests defaulting to Haiku for ~80% of routine tasks.

In practice we route as follows:

Task	Model	Why
Code exploration (“find all places X is called”)	Haiku (subagent)	Read-only, deterministic
Test failure triage	Haiku (subagent)	Pattern matching, low judgment
Standard feature implementation	Sonnet (main)	Default for productive work
Architecture decisions, schema design	Opus (main, occasional)	Worth the premium for hard reasoning
Code review pass	Haiku (fork subagent)	Read-heavy, summary returns

Routing is most powerful when paired with subagents, because subagents have their own model setting independent of the main session. We can run the main conversation on Sonnet while a fork subagent does the read-heavy work on Haiku. The model routing happens at the skill boundary, not the conversation boundary.

Lever 3 — Subagent Isolation (~37% Main-Context Reduction)

Subagents (also called fork agents) get their own 200K-token context window. Work done inside a subagent doesn’t pollute the main conversation. Only a summary returns.

Anthropic documents subagents returning ~500–1,000 tokens from 10,000+ of internal work — roughly a 37% main-context reduction on complex tasks.

Two cost effects:

Direct token savings. A code review that reads 30 files and returns a verdict consumes its tokens in isolation. Without isolation, those 30 file reads sit in main context, getting re-charged on every subsequent turn (modulo caching).
Quality-driven token savings. Long contexts degrade retrieval quality — research documents performance drops of 15–47% as context fills, the “lost in the middle” problem. Lower quality means more retries, more re-reads, more tokens. Keeping main context clean prevents this whole class of waste.

The trade-off is real: agent teams use approximately 7× more total tokens than standard sessions because each agent spawns a fresh instance with its own setup overhead. We accept it because we’re not optimizing for total tokens — we’re optimizing for main-context cleanliness and overall throughput. And subagents on Haiku are cheap regardless.

Lever 4 — MCP Tool Deferral (~85%)

We covered this in detail in the MCP Tool Search post. The short version:

Twenty-four MCP connectors with no deferral consumed roughly 135,000 tokens of baseline overhead per session. Tool Search dropped that to ~3,000 tokens for the index, with full schemas loading on demand. 85% reduction, accuracy went up, sessions stopped feeling cramped.

Cost effect: every token of baseline overhead is a token that’s part of the cached prefix on every session. Cutting 130K tokens of overhead is cutting roughly 130K × 0.1× the per-token rate × every-session-worth of cache reads. Even at 10% cost, that’s hundreds of dollars per month for a working engineer.

The Compounding Multiplier

Here’s the part the headline numbers don’t capture. The four levers are multiplicative, not additive.

Picture a baseline session burning, say, $0.30 over a 5-minute task. Apply each lever in turn:

Lever	Effect	Running cost
Baseline	—	$0.30
Prompt caching (90% on stable prefix)	Stable prefix is ~70% of session tokens	~$0.12
Model routing (3× on Haiku-eligible work)	~50% of work is Haiku-eligible	~$0.07
Subagent isolation (37% main-context reduction)	Main context shrinks → less re-charged	~$0.05
MCP tool deferral (85% on tool-schema overhead)	Schemas no longer baseline	~$0.04

Each step is modest. The composition is dramatic. The exact percentages are not what matter — the structural insight is that token-cost optimizations compound multiplicatively because they apply to overlapping portions of the bill. Fix four of them and you’ve changed the order of magnitude.

This is why our daily spend dropped from ~$6 to ~$2–3 without changing what we shipped. Same code, same tests, same release cadence. Different defaults.

What We Spend Money On Now

A typical Statnive engineering day, post-optimization:

Activity	Approximate cost
Morning code review on an open PR (forked, Haiku)	~$0.20
2 standard features in the React dashboard (Sonnet, with caching)	~$0.80
1 release-gate run (`statnive-release` skill)	~$0.30
Documentation/blog post drafts	~$0.40
Misc exploration and Q&A	~$0.50
Total	~$2.20

Across the team, our monthly Anthropic bill is well under what a single legacy SaaS subscription would cost. For context: that bill funds an AI collaborator that helps ship a WordPress plugin used by real businesses, with 248 tests and 22 release gates gating every release.

Extended Thinking: Match The Keyword To The Task

One pricing wrinkle worth knowing: Claude’s extended thinking modes have keyword-triggered budgets:

Keyword	Thinking budget
`think`	5K–10K tokens
`think hard`	20K–50K tokens
`think harder`	50K–100K tokens
`ultrathink`	100K–128K tokens (maximum)

These tokens count as input. Use the cheapest one that fits the task. We default to no thinking modifier for routine work, think hard for non-trivial design decisions, and ultrathink only for actual architectural questions where deep reasoning earns its keep. Reaching for ultrathink on a typo fix is the AI-development equivalent of running a 50-tab Chrome window to read one webpage.

Session Hygiene Saved Us As Much As The Levers

A counter-intuitive finding: the single biggest variance in our daily cost was session length, not skill or tool configuration.

Long autonomous sessions without context resets degrade quality progressively. The “lost in the middle” effect — performance drops of 15–47% as context fills — translates directly into more retries, more re-reads, and more wasted tokens. Sessions that ran past 80% context routinely cost 2–3× what shorter sessions cost for the same work.

Two hygiene rules now baked into how we work:

/clear between unrelated tasks. Switching from a frontend bug to a release-gate change is two different conversations. Clearing context costs nothing and prevents pollution.
/compact proactively at 60–70% context, not at the auto-compact 95% threshold. Auto-compact at the limit is a panic operation that loses information. Proactive compact preserves the important state and resets the noise.

Persisting state to files instead of conversation history makes both rules safe — anything important is on disk, so a clear doesn’t lose it.

What We Did Not Optimize

We don’t measure per-skill cost yet. We can see daily totals via Anthropic billing and we use the local /cost and /stats commands for spot-checks, but we don’t have per-skill attribution. Tools like ccusage (reads Claude’s local JSONL session files) and Claude-Code-Usage-Monitor exist; we haven’t integrated them.
Our subagent ratio is probably too low. We use fork mode aggressively for reviews and audits, but a meaningful fraction of our standard work could plausibly be forked to keep main context cleaner. Haven’t measured rigorously.
We don’t run formal A/B comparisons. The $6 → $2–3 number comes from comparing our spend before and after optimizations stabilized. There’s no controlled experiment behind it. Yours will be different.

What It Would Take To Halve It Again

Honest answer: probably nothing worth doing.

We could push harder on Haiku. We could batch more work into subagents. We could write more aggressive output-budgeting contracts into our skills. Each would shave maybe 10–20% more.

The cost of an additional engineer is many multiples of our entire Anthropic spend. Spending engineering hours optimizing AI costs past a certain point is engineering hours not spent shipping the actual product. We optimize when sessions feel cramped or bills feel surprising. Otherwise we ship.

Try Statnive

Privacy-first WordPress analytics built by a team that takes cost accounting as seriously as test coverage. Install Statnive free from WordPress.org — your data stays on your server, our spend stays on a single budget line, and the entire engineering practice is open source.