Case Studies · Parhum Khoshbakht

Skill Tiering: The Four-Bucket Model That Keeps Our 80+ Skills Out of Context

Progressive disclosure means 80 skills can cost 3,200 tokens of metadata, or zero. Here is how we classify every skill in the Statnive repo into always-on, auto-invocable, manual-only, or fork — and the one-question test we use to decide.

Why More Skills Don’t Have To Mean Less Context

Statnive’s Claude Code setup loads more than 80 skills covering product management, backend scaffolding, QA, security auditing, WordPress-specific patterns, release packaging, and more. The framework we build on (jaan.to) ships 141. A naive reading: more skills, more context overhead, less room for actual work.

The actual math is more interesting. 80 skills can cost 3,200 tokens of permanent metadata, or zero, depending on how each one is configured. The difference is the four-bucket tiering model defined by Claude Code’s skill system, and a single one-question test for picking the right bucket.

This post walks through all four buckets with the actual Statnive skill distribution, the test we use to decide, and the three things we got wrong before we got it right.

Progressive Disclosure: The Mechanism Underneath

Before the tiering model makes sense, the loading mechanism has to. Claude Code uses three layers of progressive disclosure for skills:

LayerWhat loadsWhenCost
1 — MetadataYAML frontmatter name + descriptionAlways at startup~30–50 tokens per skill
2 — BodyFull SKILL.md contentWhen the skill is invokedAnthropic recommends ≤ 500 lines / ~5K tokens
3 — Bundled resourcesReferenced scripts, templates, referencesOnly when accessedZero baseline cost

This is the lever the tiering model uses. The metadata layer is the only thing that’s always in context. Everything else is on-demand.

For a 141-skill framework, that’s 4,200–14,100 tokens of permanent overhead for metadata alone, scaling linearly with skill count. Bigger if descriptions are verbose. Smaller — or zero — if you tell certain skills to skip the metadata registry entirely.

A documented bug (Claude Code GitHub issue #14882) reports that some plugin skills may consume their full body at startup rather than just frontmatter. We watch for this in our own /context output. If your skill metadata line shows numbers far higher than ~50 tokens × your auto-invocable skill count, you’re hitting it.

The Four Buckets

Every skill in the Statnive repo fits into one of four buckets. The one defining frontmatter field for each:

Bucket 1 — Always-On (default frontmatter)

---
name: simplify
description: Review changed code for reuse, quality, efficiency. Auto-fixes issues.
---

These are the core workflow skills the model should route to automatically. Standard frontmatter — no special flags. Metadata loaded at startup; body loaded on invocation.

Statnive examples: simplify, statnive-release, statnive-release-zip. These fire on most release-related work.

Cost per skill: ~40 tokens of metadata permanently in context.

Bucket 2 — Auto-Invocable (default frontmatter, concise description)

Same configuration as Always-On from Claude Code’s perspective. The distinction is editorial: these are domain skills that fire only when their trigger keywords match. The discipline is in keeping the description trigger-oriented and short.

---
name: wp-rest-api
description: Use when building REST endpoints in WordPress plugins.
---

Bad description (still works, costs more):

description: A comprehensive skill for working with the WordPress REST API,
  including endpoint registration, controller patterns, schema validation,
  authentication, response shaping, and CPT/taxonomy exposure...

Good description (above): 13 tokens. Bad description: 38 tokens. Across 60+ auto-invocable skills, the difference is roughly 1,500 tokens of permanent metadata savings.

Statnive examples: wp-rest-api, wp-plugin-development, wp-performance, react-best-practices, wp-block-development, all the jaan-to:* planning skills.

Cost per skill: ~30–50 tokens of metadata permanently in context.

Bucket 3 — Manual-Only (disable-model-invocation: true)

---
name: statnive-emergency-rollback
description: Emergency-only rollback procedure for a botched deploy.
disable-model-invocation: true
---

The skill exists, the slash command works (/statnive-emergency-rollback), but the metadata never enters the <available_skills> registry. Claude doesn’t know it exists unless the user explicitly invokes it.

Cost per skill: 0 tokens. This is the magic bucket.

When to use it: rare workflows, destructive operations, anything you don’t want the model auto-routing to. If marking a skill manual-only would prevent a workflow from completing, it belongs in Buckets 1 or 2 instead.

Statnive examples: Emergency rollback, manual database surgery, one-off migration scripts, anything that exists “just in case” but shouldn’t fire opportunistically.

We mark roughly half our skills disable-model-invocation: true. Across 80+ skills, that’s ~1,800 tokens of baseline metadata reclaimed — and routing quality on the remaining auto-invocable skills actually improved, because Claude wasn’t choosing between near-duplicates.

Bucket 4 — Fork / Subagent (context: fork)

---
name: simplify
description: Review changed code for reuse, quality, efficiency. Auto-fixes issues.
context: fork
---

Fork mode runs the skill in an isolated subagent context with its own conversation history and its own 200K-token window. The work output stays out of the main context window. Only a summary returns.

For self-contained workflows like code reviews, security audits, and multi-step research, this is transformative. Anthropic documents subagents returning ~500–1,000 tokens from 10,000+ of internal work — roughly a 37% main-context reduction on complex tasks where the subagent did substantial reading and processing.

Statnive examples: simplify (three parallel review agents, returns a summary), jaan-to:backend-pr-review, jaan-to:sec-audit-remediate, jaan-to:detect-dev. Anything that reads many files and returns a verdict.

Cost per skill: ~40 tokens of metadata, but the work itself happens in isolation.

The One-Question Test

The four buckets sound like four decisions. They’re really one: does the main conversation need to see the skill’s intermediate work?

AnswerBucket
Yes — the skill writes code the main session will continue editingAlways-on or Auto-invocable
No — the skill returns a verdict, summary, or reportFork / subagent
Maybe — but it should never auto-fire (rare, destructive, weird)Manual-only

If “no,” set context: fork. Your main context stays clean and you get to use Haiku 4.5 ($1/$5 per MTok) for the subagent’s reading-heavy work while the main session uses Sonnet or Opus. That’s a 3× cost win on top of the context win.

If “yes,” it goes in Buckets 1 or 2. The choice between Always-On and Auto-Invocable is editorial: how confidently can Claude trigger this from natural-language cues? Strong, unambiguous triggers go in Auto-Invocable. Workflows the model should consider on most sessions go in Always-On.

If the skill exists but should never auto-fire, mark it Manual-Only and recover its metadata cost.

Statnive’s Actual Skill Distribution

Here’s our current breakdown across ~85 skills:

BucketCountTotal metadata costNotes
Always-On8~320 tokensRelease, simplify, sprint planning, PR review
Auto-invocable38~1,520 tokensDomain skills with strong trigger keywords
Manual-only320 tokensSlash-command-only
Fork / subagent7~280 tokensReviews, audits, detects
Total metadata cost85~2,120 tokensAbout 1% of context

Without tiering — if all 85 were defaults — we’d pay roughly 3,400 tokens of permanent metadata. The 32 manual-only skills alone save ~1,280 tokens. Looks small in isolation; matters when stacked with CLAUDE.md trims and MCP Tool Search.

Body Limits: Why 500 Lines Is The Right Number

The metadata side is permanent overhead. The body side is per-invocation overhead — and equally important to control.

Anthropic recommends keeping each SKILL.md under 500 lines (~5K tokens). Research on aggressive optimization pushes this to a hard cap of 600 lines per skill body, with anything over that requiring reference extraction: pull templates, long checklists, multi-stack comparisons, anti-pattern catalogs out of the SKILL.md and into separate files referenced via clear pointers.

The pattern looks like:

.claude/skills/wp-plugin-development/
├── SKILL.md                          # 380 lines — execution core only
└── references/
    ├── activation-deactivation-patterns.md   # Loaded only when needed
    ├── settings-api-patterns.md
    ├── nonce-and-capability-checklist.md
    └── release-packaging-checklist.md

The execution core stays compact and deterministic. References load on demand, costing zero tokens until accessed. We rebuilt three of our heaviest skills this way and cut ~12,000 tokens from typical invocation budgets.

The other body-control field worth using:

allowed-tools: ["Read", "Grep", "Glob"]

This restricts which tools the skill can access, reducing both token overhead and execution surface area. A skill that only reads code shouldn’t have Write, Bash, or MCP tool access — narrowing the toolset narrows the schema injected when the skill fires, and removes whole classes of accidental behavior.

What We Got Wrong The First Three Times

Honest caveats, same pattern as the rest of the series:

  1. We started with verbose descriptions. Our first wave of skills had 60+ token descriptions optimized for human readers. They worked, but they cost 2× what they needed to. Trim cycle one cut ~1,400 tokens of metadata across the auto-invocable bucket.
  2. We had three skills doing similar things. Auto-invocable bucket had pm-roadmap-add, pm-roadmap-update, and pm-sprint-plan with overlapping triggers. Routing got coin-flippy. We consolidated and clarified the triggers. Routing accuracy went up; metadata cost went down.
  3. We had heavy skills not using fork mode. simplify originally ran inline. It would read 30+ files, run three review passes, and return a 2,000-token report — all of that internal work polluted main context. Switching to context: fork cut typical main-context use by ~9,000 tokens per release session.

The Measurement Step

Same as the rest of the series: /context is the diagnostic. The line you watch for skill tiering is the one that shows skill metadata token count. Targets we use:

SourceTargetHard cap
Skill metadata≤ 2,500 tokens5,000
Number of auto-invocable skills≤ 60
Any single SKILL.md body≤ 500 lines / ~5K tokens600 lines / ~8K tokens

If your skill metadata line is way above 5K tokens and you have fewer than 100 skills, you probably have either verbose descriptions (Bucket 2 problem) or the loading bug we mentioned earlier (Bucket 1 problem).

How This Connects To The Rest Of Statnive

We run 248 PHP unit tests on every commit. Releases pass 22 release gates before any version ships. The skills that orchestrate all of that — statnive-release, simplify, wp-plugin-development, the QA generators — fit into roughly 2,100 tokens of permanent metadata. The work happens, the context stays clean, and the team stays small.

The four-bucket model isn’t an academic exercise. It’s the reason we can ship a WordPress analytics plugin under a 5KB tracker budget without a five-person team behind it.

Try Statnive

Privacy-first WordPress analytics, built by a team that runs /context more often than /help. Install Statnive free from WordPress.org — your data stays on your server, our skill library stays well under its token budget.

Get Statnive Free