One agent that knows everything is the most expensive way to know anything.
Hermes today loads one ~100k-token brain — system prompt, every skill, all memory — before you type a word. The fix isn't a smaller brain. It's an org chart: a sharp orchestrator that routes work to specialist agents, each carrying only its own context, most running on cheaper models. Same answers, a fraction of the burn, and it stops rotting as memory grows.
Thesis
A monolithic agent pays for its entire brain on every single turn. A 100k-token prefix re-enters context each call; even fully cached, a hit still bills ~10% of input — so 100k cached behaves like 10k of full-price input, every turn, on your most expensive model. And the context isn't just costly, it's lossy: attention is quadratic, and past a point quality drops as context grows.
Splitting that brain across specialist agents wins on three independent axes at once — smaller per-call context, cheaper models for routine work, and cleaner attention windows. The three numbers that frame the case:
The token count goes up; the bill and the error rate go down. The rest of this memo shows why, with the math live so you can move the assumptions yourself.
The Org Model
A company doesn't hire one omniscient employee who memorizes HR policy, the codebase, every client and the books — then consults all of it for every question. It hires specialists and a director who knows who to ask. Hermes should work the same way.
Today — the monolith
- ✕ Carries sys-prompt + all skills + all memory (~100k) on every call
- ✕ Re-reads the whole brain each turn — pays for context it never uses
- ✕ Quality degrades as memory grows (context rot)
- ✕ Everything runs on one expensive model, including drudge work
- ✕ One changing token early invalidates the cache downstream
Proposed — Hermessina as director
- ✓ Orchestrator holds a lean planning prompt + cached skill catalog
- ✓ Routes each task to the worker that owns that context
- ✓ Workers carry only their domain slice — small, stable, cache-friendly
- ✓ Workers run cheap models; flagship reserved for plan + synthesis
- ✓ Each worker reasons in a clean window — out of the rot zone
Why a Cached Brain Still Isn't Cheap
"It's all cached" is the comforting lie. Caching reuses the KV state of an exact prefix, so a stable head is cheap to re-read — but a cache hit still bills ~10% of input, paid every turn, and anything dynamic placed early shatters the cache for everything after it. Output tokens never cache at all.
The exact-prefix rule
Each token's cached state depends on every token before it. Change one early token — a timestamp, a freshly-retrieved doc, a reordered tool — and the cache for everything after is invalid. A monolith that injects dynamic content high in the prompt quietly pays full price on most of its "cached" context.
Output is the silent killer
Caching only touches input. Output bills at 5–6× input and never caches. The monolith generates all output on the flagship ($25–30/M). Tiering lets workers draft on a $5/M model and only synthesize on the flagship — often the largest single saving.
The Live Model — Anchored on Hermes
Defaults reflect Hermes as described: ~100k base context, a real daily run volume, multi-turn agentic tasks. Move the controls. Caching applies to both architectures, so you're seeing the honest delta from context-splitting and model-tiering — not a caching trick.
Orchestrator vs Monolith — cost model
prompt caching: ON (applied to both)Per-run breakdown
Model assumptions — read before quoting a number
A transparent first-order model, not a billing simulator. Per turn, the monolith re-sends
its base context plus a growing transcript (~1.5k tokens/turn of tool results
+ messages) and generates ~1.2k output tokens on the flagship. With caching on, the stable
head reads at 0.1× input after a first-turn write at 1.25×; the
growing tail bills at full input (the realistic monolith case, since appended tool output
keeps moving the cache edge). Orchestrated: each of N workers carries roughly
base/N + a 5k shared kernel, runs ~⌈turns/2⌉ turns on the worker
model, and returns a ~1.5k-token summary. The orchestrator runs ~3 turns (plan + synthesis)
on the flagship over a lean 8k catalog plus the N summaries. Caching applied
identically to both sides. Output: workers ~1k/turn on the cheap model, orchestrator
~1.5k/turn on the flagship. Pricing is the verified June-2026 table below. Real results
depend on routing quality and how cleanly your domains separate — treat the % as
directional, the mechanism as solid.
The Numbers Behind It
Verified against OpenAI and Anthropic official pricing, June 2026. USD per 1M tokens. "Cache read" is the price of a cache hit — note it's ~10% of input on every model, never zero.
| Model | Input | Cache read | Output | Write 5m / 1h |
|---|---|---|---|---|
| OpenAI · automatic caching · 90% off · no write premium | ||||
| GPT-5.5 flagship | $5.00 | $0.50 | $30.00 | — / — |
| GPT-5.4 mini | $0.75 | $0.075 | $4.50 | — / — |
| GPT-5.4 nano cheapest | $0.20 | $0.02 | $1.25 | — / — |
| Anthropic · explicit cache_control · read 0.1× · write 1.25×(5m)/2×(1h) | ||||
| Claude Opus 4.8 flagship | $5.00 | $0.50 | $25.00 | $6.25 / $10.00 |
| Claude Sonnet 4.6 | $3.00 | $0.30 | $15.00 | $3.75 / $6.00 |
| Claude Haiku 4.5 cheap | $1.00 | $0.10 | $5.00 | $1.25 / $2.00 |
The spread between cheapest and flagship is roughly 25–60×. That gap is the entire game: drudge work on a $0.20–$1 model, judgment on a $5 model. A monolith spends flagship rates on everything.
Where This Can Go Wrong
The honest version. The architecture wins under specific conditions — naming them up front is what makes the pitch credible.
If Hermessina only decides "which agent gets this," a frontier model is overkill — the literature is blunt: "using a frontier model to decide which model to use defeats the purpose." Frame it as the director who reads the brief, decomposes the work, and writes the final synthesis. That reasoning role justifies the flagship. Pure routing should be a cheap classifier.
Anthropic measured ~15× the token volume of a chat. Don't hide this. The dollar saving comes from those tokens running on cheap models over small windows; the quality saving comes from dodging context rot. If you tier badly (workers on the flagship), you get the 15× without the saving. Tiering is mandatory, not optional.
Anthropic's own boundary: tasks where "all agents share the same context or involve many dependencies between agents are not a good fit." Hermes' domains (Notion, GitHub, calendar, clients) separate cleanly — that's exactly the good-fit zone. But tightly-coupled tasks should stay in one agent.
The saving assumes each worker gets only its domain context. If routing leaks the full 100k into every worker, you've rebuilt the monolith N times — strictly worse. The retrieval/routing layer (hybrid semantic + keyword over chunked source, plus entity relationships) is the real engineering work — the thing you and I already flagged in the thread.
What I'd Actually Do for Hermessina
cached_tokens and per-worker token spend from day one. If cache hit rates are low, a dynamic token is sitting too early in a prefix — fixable, but only if you can see it.Built to settle one question: is splitting Hermes into an orchestrator + specialists worth it? Short version — yes, if the routing layer is real and the models are tiered. The cost model above is first-order and directional; the mechanism (smaller stable contexts + cheap workers + flagship-only synthesis) is well-supported.
SOURCESAnthropic — Building a multi-agent research system (15× tokens, 90.2%, 80% variance)
Anthropic — Effective context engineering (n² attention, context rot, sub-agent isolation)
Chroma — Context Rot (18 models, 20–50% degradation 10k→100k)
Anthropic — Pricing & Prompt caching
OpenAI — Pricing & Prompt caching (automatic, 90% off, 1024-token min)
Morph — LLM cost optimization & model routing (tiering, 40–70% savings)
Shaan · Casper · pricing verified June 2026 · figures directional, mechanism load-bearing