HERMES ARCHITECTURE / COST MEMO / 2026-06-04

One agent that knows everything is the most expensive way to know anything.

Hermes today loads one ~100k-token brain — system prompt, every skill, all memory — before you type a word. The fix isn't a smaller brain. It's an org chart: a sharp orchestrator that routes work to specialist agents, each carrying only its own context, most running on cheaper models. Same answers, a fraction of the burn, and it stops rotting as memory grows.

route dispatch retrieve reason summarize synthesize answer
00 /

Thesis

A monolithic agent pays for its entire brain on every single turn. A 100k-token prefix re-enters context each call; even fully cached, a hit still bills ~10% of input — so 100k cached behaves like 10k of full-price input, every turn, on your most expensive model. And the context isn't just costly, it's lossy: attention is quadratic, and past a point quality drops as context grows.

Splitting that brain across specialist agents wins on three independent axes at once — smaller per-call context, cheaper models for routine work, and cleaner attention windows. The three numbers that frame the case:

before first message
~100k
tokens of context — every run pays for all of it
multi-agent vs chat — Anthropic
15×
more tokens used — yet cheaper in dollars when tiered right
research eval — Anthropic
90.2%
Opus-lead + Sonnet-workers beat single-agent Opus

The token count goes up; the bill and the error rate go down. The rest of this memo shows why, with the math live so you can move the assumptions yourself.

01 /

The Org Model

A company doesn't hire one omniscient employee who memorizes HR policy, the codebase, every client and the books — then consults all of it for every question. It hires specialists and a director who knows who to ask. Hermes should work the same way.

Today — the monolith

  • Carries sys-prompt + all skills + all memory (~100k) on every call
  • Re-reads the whole brain each turn — pays for context it never uses
  • Quality degrades as memory grows (context rot)
  • Everything runs on one expensive model, including drudge work
  • One changing token early invalidates the cache downstream

Proposed — Hermessina as director

  • Orchestrator holds a lean planning prompt + cached skill catalog
  • Routes each task to the worker that owns that context
  • Workers carry only their domain slice — small, stable, cache-friendly
  • Workers run cheap models; flagship reserved for plan + synthesis
  • Each worker reasons in a clean window — out of the rot zone
Hermessina — orchestrator flagship · plans, routes, synthesizes Notion / KB cheap · own slice GitHub / code cheap · own slice Calendar / comms cheap · own slice Clients / CRM cheap · own slice each worker returns only a ~1–2k token summary → orchestrator never sees the raw noise
Figure 1 / Workers explore in isolated windows. The director only ever reads distilled output.
02 /

Why a Cached Brain Still Isn't Cheap

"It's all cached" is the comforting lie. Caching reuses the KV state of an exact prefix, so a stable head is cheap to re-read — but a cache hit still bills ~10% of input, paid every turn, and anything dynamic placed early shatters the cache for everything after it. Output tokens never cache at all.

MONOLITH
turn 1
100k context · CACHE WRITE @ 1.25× input
MONOLITH
turn 2…N
90k stable head · cache read @ 0.1×
+ growing transcript @ full price
SPECIALIST
worker turn
~20k slice · read @ 0.1×
+ tail

Same caching rules apply to both — the win isn't "we turned caching on." It's that the worker's multiplicand is ~5× smaller, its prefix is stable so it actually hits, and it runs on a cheaper model. The monolith pays 0.1× · 100k on the flagship, every turn, forever.

The exact-prefix rule

Each token's cached state depends on every token before it. Change one early token — a timestamp, a freshly-retrieved doc, a reordered tool — and the cache for everything after is invalid. A monolith that injects dynamic content high in the prompt quietly pays full price on most of its "cached" context.

Output is the silent killer

Caching only touches input. Output bills at 5–6× input and never caches. The monolith generates all output on the flagship ($25–30/M). Tiering lets workers draft on a $5/M model and only synthesize on the flagship — often the largest single saving.

03 /

The Live Model — Anchored on Hermes

Defaults reflect Hermes as described: ~100k base context, a real daily run volume, multi-turn agentic tasks. Move the controls. Caching applies to both architectures, so you're seeing the honest delta from context-splitting and model-tiering — not a caching trick.

Orchestrator vs Monolith — cost model

prompt caching: ON (applied to both)
PROMPT CACHING ENABLED
■ Monolith (one big agent)$—/mo
■ Orchestrator + specialists$—/mo
cheaper per month · $— saved

Per-run breakdown

Monolith — input (incl. cache)$—
Monolith — output (flagship)$—
Monolith per run$—
Orchestrator — plan + synthesis$—
Workers — explore + draft (×N)$—
Orchestrated per run$—
Total tokens / run — monolith
Total tokens / run — orchestrated
Model assumptions — read before quoting a number

A transparent first-order model, not a billing simulator. Per turn, the monolith re-sends its base context plus a growing transcript (~1.5k tokens/turn of tool results + messages) and generates ~1.2k output tokens on the flagship. With caching on, the stable head reads at 0.1× input after a first-turn write at 1.25×; the growing tail bills at full input (the realistic monolith case, since appended tool output keeps moving the cache edge). Orchestrated: each of N workers carries roughly base/N + a 5k shared kernel, runs ~⌈turns/2⌉ turns on the worker model, and returns a ~1.5k-token summary. The orchestrator runs ~3 turns (plan + synthesis) on the flagship over a lean 8k catalog plus the N summaries. Caching applied identically to both sides. Output: workers ~1k/turn on the cheap model, orchestrator ~1.5k/turn on the flagship. Pricing is the verified June-2026 table below. Real results depend on routing quality and how cleanly your domains separate — treat the % as directional, the mechanism as solid.

04 /

The Numbers Behind It

Verified against OpenAI and Anthropic official pricing, June 2026. USD per 1M tokens. "Cache read" is the price of a cache hit — note it's ~10% of input on every model, never zero.

ModelInputCache readOutputWrite 5m / 1h
OpenAI · automatic caching · 90% off · no write premium
GPT-5.5 flagship$5.00$0.50$30.00— / —
GPT-5.4 mini$0.75$0.075$4.50— / —
GPT-5.4 nano cheapest$0.20$0.02$1.25— / —
Anthropic · explicit cache_control · read 0.1× · write 1.25×(5m)/2×(1h)
Claude Opus 4.8 flagship$5.00$0.50$25.00$6.25 / $10.00
Claude Sonnet 4.6$3.00$0.30$15.00$3.75 / $6.00
Claude Haiku 4.5 cheap$1.00$0.10$5.00$1.25 / $2.00

The spread between cheapest and flagship is roughly 25–60×. That gap is the entire game: drudge work on a $0.20–$1 model, judgment on a $5 model. A monolith spends flagship rates on everything.

05 /

Where This Can Go Wrong

The honest version. The architecture wins under specific conditions — naming them up front is what makes the pitch credible.

01The orchestrator must be a planner, not a switchboard

If Hermessina only decides "which agent gets this," a frontier model is overkill — the literature is blunt: "using a frontier model to decide which model to use defeats the purpose." Frame it as the director who reads the brief, decomposes the work, and writes the final synthesis. That reasoning role justifies the flagship. Pure routing should be a cheap classifier.

02Multi-agent uses more tokens — the win is dollars and quality

Anthropic measured ~15× the token volume of a chat. Don't hide this. The dollar saving comes from those tokens running on cheap models over small windows; the quality saving comes from dodging context rot. If you tier badly (workers on the flagship), you get the 15× without the saving. Tiering is mandatory, not optional.

03It's a poor fit when everything needs shared context

Anthropic's own boundary: tasks where "all agents share the same context or involve many dependencies between agents are not a good fit." Hermes' domains (Notion, GitHub, calendar, clients) separate cleanly — that's exactly the good-fit zone. But tightly-coupled tasks should stay in one agent.

04Routing only the relevant slice is the whole ballgame

The saving assumes each worker gets only its domain context. If routing leaks the full 100k into every worker, you've rebuilt the monolith N times — strictly worse. The retrieval/routing layer (hybrid semantic + keyword over chunked source, plus entity relationships) is the real engineering work — the thing you and I already flagged in the thread.

06 /

What I'd Actually Do for Hermessina

01 /
Keep Notion as source of truth, add a routing layer — not a summarizer. Hybrid semantic + keyword retrieval over chunked source with entity/relationship metadata (people, projects, clients, decisions). This is what lets each worker pull only its slice with citations instead of inheriting the 100k brain.
02 /
Make Hermessina the director. Flagship model, lean planning prompt + a cached skill/agent catalog. It decomposes, dispatches, and synthesizes — it never holds raw domain context itself.
03 /
Stand up specialists per domain. One worker per clean context bucket (Notion/knowledge, GitHub, calendar/comms, clients). Each on the cheapest model that clears its bar — Haiku / GPT-5.4-mini for most, escalate only on hard reasoning.
04 /
Tier the output, not just the input. Workers draft and explore cheaply; only the final synthesis runs on the flagship. This is where the biggest dollar saving hides.
05 /
Instrument it. Log cached_tokens and per-worker token spend from day one. If cache hit rates are low, a dynamic token is sitting too early in a prefix — fixable, but only if you can see it.
06 /
On the GPT sub, watch the rate limit, not a bill. The deliverable is "Hermes runs 4–6× more before hitting the cap." That's the number a Casper-funded central instance turns on.

Built to settle one question: is splitting Hermes into an orchestrator + specialists worth it? Short version — yes, if the routing layer is real and the models are tiered. The cost model above is first-order and directional; the mechanism (smaller stable contexts + cheap workers + flagship-only synthesis) is well-supported.

SOURCES
Anthropic — Building a multi-agent research system (15× tokens, 90.2%, 80% variance)
Anthropic — Effective context engineering (n² attention, context rot, sub-agent isolation)
Chroma — Context Rot (18 models, 20–50% degradation 10k→100k)
Anthropic — Pricing & Prompt caching
OpenAI — Pricing & Prompt caching (automatic, 90% off, 1024-token min)
Morph — LLM cost optimization & model routing (tiering, 40–70% savings)

Shaan · Casper · pricing verified June 2026 · figures directional, mechanism load-bearing