The Economics of AI-Assisted Development for a WordPress Plugin Team
Honest cost accounting from a small team shipping a WordPress analytics plugin. What the Anthropic bill looked like before optimization, what it looks like now, and where the compounding savings actually come from.
A Question We Couldn’t Answer Honestly In Month 1
How much does it cost to ship a WordPress plugin with Claude Code as a daily collaborator?
For our first month building Statnive, we couldn’t answer that. We knew the bill. We didn’t know what was actually driving it. Some days were $3, others were $11, and the variance had no obvious correlation with how much code shipped.
Four months later we have a much clearer picture. This post is the cost accounting: where the money goes, the four levers that compound to roughly 2–3× our spend, and the one number that mattered more than any individual optimization.
Headline: daily spend dropped from ~$6/day to ~$2–3/day for the same engineering throughput. The savings are not additive. They multiply.
Where AI Spend Actually Goes
Anthropic’s pricing for the Claude Code-relevant models, current as of early 2026:
| Model | Input / MTok | Output / MTok | Best for |
|---|---|---|---|
| Haiku 4.5 | $1 | $5 | Read-only exploration, validation, routine fixes |
| Sonnet 4.6 | $3 | $15 | Standard development work |
| Opus 4.7 | $5 | $25 | Complex architecture, deep reasoning |
Naively, you pick one model and pay that rate × your token consumption. The reality is finer-grained, and three of the four optimization levers below exploit it.
For reference, here are typical token budgets per operation:
| Operation type | Typical tokens | Realistic time |
|---|---|---|
| Simple fix (Haiku) | 2K–5K | 1–2 min |
| Standard feature (Sonnet) | 10K–20K | 5–10 min |
| Complex refactor (Sonnet + extended thinking) | 30K–50K | 15–25 min |
| Architecture review (Opus) | 50K–100K | 20–40 min |
Multiplied across a working day, that’s where ~$6/day on a single engineer comes from. Multiply across the year, the optimization conversation gets serious.
Lever 1 — Prompt Caching (~90% On The Stable Prefix)
Claude Code automatically caches the stable prefix of every conversation: the system prompt, built-in tool definitions, skill metadata, and the root CLAUDE.md. Cache reads cost 0.1× the base price — a 90% reduction on the cached portion.
The stable prefix is 50K+ tokens for most setups. Without caching, every turn re-charges those 50K tokens. With caching, every turn after the first reads them at 10% cost.
Two practical rules:
- Order content static-first, dynamic-last. Cache hits require an exact prefix match. Anything dynamic at the start of a conversation breaks the cache and forces a full re-read. We keep root
CLAUDE.mdand skill metadata first, dynamic context (changed files, current branch state) last. - Don’t pollute the prefix with anything that changes every session. This is the real argument for keeping
CLAUDE.mdlean — every byte you remove is a byte that doesn’t need re-caching when the file changes, and shorter cached prefixes are cheaper to refresh when they expire.
Caching is the single biggest cost lever, and it’s mostly automatic. The work is in not breaking it.
Lever 2 — Model Routing (3× On The Routed Portion)
Haiku 4.5 costs 3× less than Sonnet per token. Capability difference for read-only and validation tasks is minimal. Anthropic suggests defaulting to Haiku for ~80% of routine tasks.
In practice we route as follows:
| Task | Model | Why |
|---|---|---|
| Code exploration (“find all places X is called”) | Haiku (subagent) | Read-only, deterministic |
| Test failure triage | Haiku (subagent) | Pattern matching, low judgment |
| Standard feature implementation | Sonnet (main) | Default for productive work |
| Architecture decisions, schema design | Opus (main, occasional) | Worth the premium for hard reasoning |
| Code review pass | Haiku (fork subagent) | Read-heavy, summary returns |
Routing is most powerful when paired with subagents, because subagents have their own model setting independent of the main session. We can run the main conversation on Sonnet while a fork subagent does the read-heavy work on Haiku. The model routing happens at the skill boundary, not the conversation boundary.
Lever 3 — Subagent Isolation (~37% Main-Context Reduction)
Subagents (also called fork agents) get their own 200K-token context window. Work done inside a subagent doesn’t pollute the main conversation. Only a summary returns.
Anthropic documents subagents returning ~500–1,000 tokens from 10,000+ of internal work — roughly a 37% main-context reduction on complex tasks.
Two cost effects:
- Direct token savings. A code review that reads 30 files and returns a verdict consumes its tokens in isolation. Without isolation, those 30 file reads sit in main context, getting re-charged on every subsequent turn (modulo caching).
- Quality-driven token savings. Long contexts degrade retrieval quality — research documents performance drops of 15–47% as context fills, the “lost in the middle” problem. Lower quality means more retries, more re-reads, more tokens. Keeping main context clean prevents this whole class of waste.
The trade-off is real: agent teams use approximately 7× more total tokens than standard sessions because each agent spawns a fresh instance with its own setup overhead. We accept it because we’re not optimizing for total tokens — we’re optimizing for main-context cleanliness and overall throughput. And subagents on Haiku are cheap regardless.
Lever 4 — MCP Tool Deferral (~85%)
We covered this in detail in the MCP Tool Search post. The short version:
Twenty-four MCP connectors with no deferral consumed roughly 135,000 tokens of baseline overhead per session. Tool Search dropped that to ~3,000 tokens for the index, with full schemas loading on demand. 85% reduction, accuracy went up, sessions stopped feeling cramped.
Cost effect: every token of baseline overhead is a token that’s part of the cached prefix on every session. Cutting 130K tokens of overhead is cutting roughly 130K × 0.1× the per-token rate × every-session-worth of cache reads. Even at 10% cost, that’s hundreds of dollars per month for a working engineer.
The Compounding Multiplier
Here’s the part the headline numbers don’t capture. The four levers are multiplicative, not additive.
Picture a baseline session burning, say, $0.30 over a 5-minute task. Apply each lever in turn:
| Lever | Effect | Running cost |
|---|---|---|
| Baseline | — | $0.30 |
| Prompt caching (90% on stable prefix) | Stable prefix is ~70% of session tokens | ~$0.12 |
| Model routing (3× on Haiku-eligible work) | ~50% of work is Haiku-eligible | ~$0.07 |
| Subagent isolation (37% main-context reduction) | Main context shrinks → less re-charged | ~$0.05 |
| MCP tool deferral (85% on tool-schema overhead) | Schemas no longer baseline | ~$0.04 |
Each step is modest. The composition is dramatic. The exact percentages are not what matter — the structural insight is that token-cost optimizations compound multiplicatively because they apply to overlapping portions of the bill. Fix four of them and you’ve changed the order of magnitude.
This is why our daily spend dropped from ~$6 to ~$2–3 without changing what we shipped. Same code, same tests, same release cadence. Different defaults.
What We Spend Money On Now
A typical Statnive engineering day, post-optimization:
| Activity | Approximate cost |
|---|---|
| Morning code review on an open PR (forked, Haiku) | ~$0.20 |
| 2 standard features in the React dashboard (Sonnet, with caching) | ~$0.80 |
1 release-gate run (statnive-release skill) | ~$0.30 |
| Documentation/blog post drafts | ~$0.40 |
| Misc exploration and Q&A | ~$0.50 |
| Total | ~$2.20 |
Across the team, our monthly Anthropic bill is well under what a single legacy SaaS subscription would cost. For context: that bill funds an AI collaborator that helps ship a WordPress plugin used by real businesses, with 248 tests and 22 release gates gating every release.
Extended Thinking: Match The Keyword To The Task
One pricing wrinkle worth knowing: Claude’s extended thinking modes have keyword-triggered budgets:
| Keyword | Thinking budget |
|---|---|
think | 5K–10K tokens |
think hard | 20K–50K tokens |
think harder | 50K–100K tokens |
ultrathink | 100K–128K tokens (maximum) |
These tokens count as input. Use the cheapest one that fits the task. We default to no thinking modifier for routine work, think hard for non-trivial design decisions, and ultrathink only for actual architectural questions where deep reasoning earns its keep. Reaching for ultrathink on a typo fix is the AI-development equivalent of running a 50-tab Chrome window to read one webpage.
Session Hygiene Saved Us As Much As The Levers
A counter-intuitive finding: the single biggest variance in our daily cost was session length, not skill or tool configuration.
Long autonomous sessions without context resets degrade quality progressively. The “lost in the middle” effect — performance drops of 15–47% as context fills — translates directly into more retries, more re-reads, and more wasted tokens. Sessions that ran past 80% context routinely cost 2–3× what shorter sessions cost for the same work.
Two hygiene rules now baked into how we work:
/clearbetween unrelated tasks. Switching from a frontend bug to a release-gate change is two different conversations. Clearing context costs nothing and prevents pollution./compactproactively at 60–70% context, not at the auto-compact 95% threshold. Auto-compact at the limit is a panic operation that loses information. Proactive compact preserves the important state and resets the noise.
Persisting state to files instead of conversation history makes both rules safe — anything important is on disk, so a clear doesn’t lose it.
What We Did Not Optimize
- We don’t measure per-skill cost yet. We can see daily totals via Anthropic billing and we use the local
/costand/statscommands for spot-checks, but we don’t have per-skill attribution. Tools likeccusage(reads Claude’s local JSONL session files) and Claude-Code-Usage-Monitor exist; we haven’t integrated them. - Our subagent ratio is probably too low. We use fork mode aggressively for reviews and audits, but a meaningful fraction of our standard work could plausibly be forked to keep main context cleaner. Haven’t measured rigorously.
- We don’t run formal A/B comparisons. The $6 → $2–3 number comes from comparing our spend before and after optimizations stabilized. There’s no controlled experiment behind it. Yours will be different.
What It Would Take To Halve It Again
Honest answer: probably nothing worth doing.
We could push harder on Haiku. We could batch more work into subagents. We could write more aggressive output-budgeting contracts into our skills. Each would shave maybe 10–20% more.
The cost of an additional engineer is many multiples of our entire Anthropic spend. Spending engineering hours optimizing AI costs past a certain point is engineering hours not spent shipping the actual product. We optimize when sessions feel cramped or bills feel surprising. Otherwise we ship.
Try Statnive
Privacy-first WordPress analytics built by a team that takes cost accounting as seriously as test coverage. Install Statnive free from WordPress.org — your data stays on your server, our spend stays on a single budget line, and the entire engineering practice is open source.