MCP Tool Search: How We Deferred 120K Tokens of Tool Schemas

The Hidden Cost We Didn’t Notice For Weeks

When you install an MCP (Model Context Protocol) server in Claude Code, you get tools. Lots of tools. The GitHub MCP server alone provides 35 — for issues, pull requests, branches, comments, releases, workflow runs, deployments, security alerts, and more.

What you also get, by default, is every tool’s full JSON schema injected into context at session start. Names, descriptions, parameter types, enum values, examples — all of it, before you’ve typed a single prompt.

We connected 24 MCP servers over the course of building Statnive. We didn’t think about the cost until our sessions started feeling cramped. Then we ran /context for the first time.

This post is the before/after, the one flag that did 85% of the work, and the three other patterns that finished the job.

What `/context` Showed Us

Here’s the relevant slice of our /context output before any optimization:

MCP server	Tools	Tokens consumed
GitHub	35	~26,000
Slack	11	~21,000
Jira	~20	~17,000
Playwright (browser automation)	21	~13,647
Context7 (library docs)	~15	~8,000
Other 19 connectors	~190	~50,000
Total MCP overhead	~290	~135,000

That’s roughly 67% of the entire 200K context window spent on tool definitions for tools we might not use that session. The pattern researchers documented elsewhere holds: average MCP tool overhead is ~500–710 tokens per tool, and a moderate connector load (24 servers) routinely consumes 48,000–120,000 tokens before any work begins. The most extreme documented case is Docker’s MCP server: 135 tools, ~126,000 tokens by itself.

We had ~65K tokens left for actual conversation, system prompt, built-in tools, our CLAUDE.md, skill metadata, and the auto-compact buffer. That’s why everything felt cramped.

Tool Search: The One Flag That Did 85% Of The Work

Claude Code v2.1.7 shipped MCP Tool Search. Instead of injecting every tool schema at startup, Tool Search builds a lightweight ~5K-token index of tool names and descriptions. The full schema for any individual tool loads only when Claude actually decides to call it. Once loaded, it stays cached for the session.

Anthropic’s internal testing showed reduction from 134K to ~5K tokens — an 85% cut. Counter-intuitively, accuracy on MCP evaluations went up, not down: Opus 4 jumped from 49% to 74% on the same benchmark, presumably because the model wasn’t drowning in tool schemas it didn’t need.

Tool Search activates automatically when tool descriptions exceed roughly 10% of the context window (~20K tokens). Below that threshold it stays off, on the assumption that you don’t need it. We’re well above the threshold, so it’s always active for us.

After enabling it, our /context looked dramatically different:

Source	Before	After
MCP tool schemas	~135,000	~3,000 (index only)
MCP tool schemas (during a session that used 4 tools)	~135,000	~6,500 (index + 4 loaded schemas)
Available context for work	~65,000	~190,000

Verification step we never skip: run /context at session start and confirm the line that says tool schemas are deferred or that Tool Search is active. If it’s not, you’re paying for nothing.

Consolidating CRUD Explosions

Tool Search was the biggest single lever, but it doesn’t help if individual tools have bloated descriptions or your servers expose dozens of near-identical tools.

We rebuilt one of our internal MCP servers using a pattern documented in research as action-parameter consolidation. The original ten-tool API for issue management:

create_issue, update_issue, delete_issue, list_issues, get_issue,
add_comment, update_comment, delete_comment, list_comments, get_comment

Became one tool:

manage_issues({ action: "create" | "update" | "delete" | "list" | "get",
                target: "issue" | "comment", ... })

Documented results from one developer who applied this pattern: 20 tools dropped from 14,214 tokens to 5,663 tokens — a 60% reduction. The model still routes correctly because the action parameter is enumerated and the tool description names every supported operation. We saw similar results on our own consolidation: roughly 9,800 tokens down to 4,100.

Even with Tool Search deferring schema loads, the on-demand load for one fat tool is much smaller than for ten lean ones, because schema overhead is dominated by repeated boilerplate (type definitions, error envelopes, pagination patterns).

Trim Descriptions Aggressively

The other side of the same coin. Marketing prose in tool descriptions costs real tokens.

A description like:

“Search the web using Tavily Search API. Best for factual queries requiring reliable sources and citations from authoritative web content. Handles complex topics with academic depth and provides comprehensive results with relevance scoring…”

Costs 87 tokens. The same routing signal in:

“Search using Tavily. Best for factual/academic topics with citations.”

Costs 12 tokens. Across 290 tools, average savings of 50 tokens per description is ~14,500 tokens.

The rule we use: a description should help Claude decide whether to call this tool, not market the tool. Cut anything that doesn’t change the routing decision.

When We Kept A Tool Eager

A small number of tools we want loaded eagerly because they fire on almost every session:

Read, Write, Edit, Grep, Glob, Bash — built-in, not MCP, but the principle is the same: high call frequency justifies always-loaded.
Two MCP servers we use multiple times per session — our internal release tooling and our docs-fetching server. Their schemas total ~3,500 tokens combined; on-demand loading 6+ times per session would cost more in re-fetch latency than the eager load saves.

Tool Search supports a per-server eager allowlist for exactly this reason. Use it surgically — every eager tool is permanent overhead.

Cap Output, Or It’ll Cap You

The MCP environment variable MAX_MCP_OUTPUT_TOKENS defaults to 25,000 tokens per tool response. That’s generous for a 200K window with one tool call per turn. With 24 connectors and tools that fan out across multiple calls per turn, it’s a guaranteed way to fill context with raw API responses.

We capped ours at 4,000 and required the most output-heavy servers to support server-side pagination and summary-first response shapes. A GitHub list-issues call now returns the first 20 with a has_more: true marker and a continuation token instead of dumping 200 issues into context. The model can ask for more if needed; usually it doesn’t.

For complex multi-tool workflows, Programmatic Tool Calling (Claude writes orchestration code that runs in a sandboxed environment, with only final results entering context) showed a ~37% token reduction in Anthropic’s internal testing on research-heavy tasks. We use it for one workflow — a Q&A agent that hits 6 documentation servers — and the savings hold up.

CLI > MCP For Occasional-Use Tools

Counter-intuitive but real: for tools you use rarely, a shell command is cheaper than an MCP server. gh, aws, gcloud, sentry-cli, wp — they execute via the Bash tool with zero persistent context overhead. The Bash tool description (already loaded) is all the context you pay for. The CLI binary’s help text loads only if Claude reads it.

We pulled three rarely-used MCP servers and replaced them with CLI usage. That alone saved ~12,000 tokens of baseline overhead.

The break-even point is roughly: does this tool fire more than once per typical session? If yes, MCP. If no, CLI.

What We Didn’t Optimize

MAX_MCP_OUTPUT_TOKENS is per-call, not per-session. A misbehaving server can still flood context across many turns. We don’t have per-session caps yet — that’s a Claude Code feature request, not something we can fix locally.
Tool Search is on or off. We can’t tier MCP servers the way we tier skills. All 24 of ours go through Tool Search uniformly. For a server we use almost every session, eager loading would actually be more efficient — but we can’t selectively eager-load just that one server’s schemas without disabling Tool Search globally.
We didn’t measure routing accuracy carefully on our own workloads. Anthropic’s 49% → 74% on Opus 4 is encouraging, and we haven’t seen routing failures in practice, but we don’t have a benchmark suite to prove deferred loading works as well for Statnive-specific tasks as eager loading would.

The Practical Steps If You’re Doing This Today

Run /context in a fresh session. See what’s actually loaded.
If MCP tool schemas exceed ~20K tokens (10% of the window), Tool Search should be active automatically. Verify it from the /context output.
Audit your three heaviest connectors. The Pareto curve is brutal — usually 3 servers consume 60%+ of MCP overhead.
Consolidate CRUD explosions in any MCP server you control.
Trim descriptions on any tool whose description reads like marketing copy.
Move occasional-use tools from MCP to CLI.
Cap MAX_MCP_OUTPUT_TOKENS to a reasonable per-call limit (we use 4,000).

If you want the bigger picture — how this fits with CLAUDE.md optimization and skill tiering to deliver multiplicative savings — start with the flagship post in this series.

Try Statnive

Privacy-first WordPress analytics, shipped by a team that pays attention to where every token goes. Install Statnive free from WordPress.org — your data stays on your server.