Yesterday I published a deep dive into Claude Code's compaction engine. At the end, I made a promise: go deeper on the caching optimizations that happen outside of compaction.
But actually, the caching rabbit hole started before that post - because of a tweet from about ten days ago.
The Tweet That Confused Me
If you're a heavy Claude Code user, you felt the 5-hour usage cap snap shut after Anthropic's two-week promotional window closed. The complaints flooded in. Someone tagged Boris - the engineer behind Claude Code, the person who built it - asking what he planned to do about it.
His answer: improvements are coming to squeeze more out of the current quota.
My first reaction: what can he possibly do? The quota is server-side. It's rate limits and token budgets. There's no client trick that changes how many tokens you're allowed per hour.
That question sat with me. Then yesterday's compaction post led me to look harder at the source - and the answer became obvious.
Cache Hit Ratio Is the Quota
Every message you send to Claude Code costs tokens. But tokens aren't flat. Cache hits are discounted significantly. Cache misses cost 1.25x - you're not just paying full price, you're paying a penalty.
If your cache hit ratio is high, you stretch the same quota dramatically further than someone whose cache keeps busting. The quota doesn't change. What you extract from it does.
This is the reframe. When Boris says improvements are coming, he's not talking about changing server limits. He's talking about recovering cache hit ratio - which is the same thing as handing quota back to users.
What Claude Code Already Does About This
When I asked Claude to analyze its own source code, what came back wasn't a simple "we cache the system prompt." It was twelve distinct mechanisms working together, each one plugging a specific leak.
Two stood out.
The 2^N problem - and how a boundary string solves it
Every time you send a message, the API receives your entire conversation - not just your latest message. The server caches this accumulated history so it doesn't reprocess everything from scratch on the next turn.
But here's the problem: every conditional in your system prompt that depends on runtime state creates a unique cache key variant. Five booleans - which tools are enabled, whether you're in interactive mode, whether skills are loaded - means 32 possible cache entries. With millions of users, most of those entries will never get a second hit.
Claude Code's solution is a literal string in the source: __SYSTEM_PROMPT_DYNAMIC_BOUNDARY__. Everything before it is identical for every Claude Code user on Earth - the same coding instructions, the same safety rules, the same tone guidelines. That block gets scope: 'global', which hints at a single cache entry shared worldwide rather than per-organization. Everything runtime-dependent moves after the boundary and isn't cached at all.
The comment in the source puts it plainly: "Each conditional here is a runtime bit that would otherwise multiply the Blake2b prefix hash variants (2^N)."
cache_edits - surgery without invalidation
Over a long session, old tool results pile up inside the cached conversation history. They're taking up space you're paying for. The obvious fix is to remove them - edit the message, delete the content. But the moment any byte in any message changes, the server sees a different sequence. Cache miss. Everything from that edit point onward gets reprocessed from scratch at 1.25x cost.
cache_edits is a side-channel. Instead of editing the message content, Claude Code attaches a separate instruction to the API call: server, remove the cached KV entry for tool result with this ID. The actual message bytes - the ones that form the cache key - never change. The cached history stays valid. The server quietly drops that tool result from its cached representation.
Same conversation. Less bloat. No cache bust.
The Mindset Behind All of It
Reading through the code, one pattern keeps repeating: they don't treat cache misses as a tradeoff. They treat them as bugs.
There's a 728-line diagnostic system that monitors cache hits on every API call. When cache_read_input_tokens drops more than 5% and 2,000 tokens, it writes a .diff file to disk, attributes the root cause - which tool schema changed, which beta header flipped, whether the TTL expired - and logs the event to analytics.
There's a function literally named DANGEROUS_uncachedSystemPromptSection(). Any engineer who uses it has to pass a written reason string explaining why this section must be dynamic - a forcing function that makes the cache cost explicit and visible, like a code review justification for a known defect.
One section had ended up in this category without good reason: the token budget prompt. The old version toggled on and off depending on whether a budget was active - present when you had one, absent when you didn't. Every toggle busted ~20K tokens of cache. The code still has the comment: "Was DANGEROUS_uncached (toggled on getCurrentTurnTokenBudget()), busting ~20K tokens per budget flip."
The fix was a single sentence rewrite. The new version reads:
"When the user specifies a token target (e.g., '+500k', 'spend 2M tokens'), your output token count will be shown each turn. Keep working until you approach the target..."
It never mentions the actual current budget number. It teaches the model the behavior generically - when a target exists, here's how to act - so the text works whether a budget is active or not. The actual number gets injected elsewhere at runtime, not in the system prompt. Stable text, no toggle, no cache bust.
One sentence rewrite saved 20K tokens of cache creation on every budget flip.
The /clear Warning - What the Code Actually Says
I had a hypothesis about the one-liner suggestion Claude Code now shows - "new task? /clear to save 1.2M tokens" - that it appeared when Claude Code detected the cache TTL had expired. I had the source, so I checked.
The real mechanism is called "willow" internally. It fires on two conditions: 75 minutes idle and at least 100K tokens in the conversation. When both are true, you see either a blocking dialog or that one-liner hint, depending on which A/B variant you're in.
My hypothesis was close but not quite right. It's not reading cache state directly - it's a timer. But 75 minutes idle means the 1-hour cache TTL has definitely expired. The feature isn't inspecting cache state; it's inferring it from elapsed time and making the cost visible before you pay it. Continuing a cold, 100K+ token conversation means reprocessing everything from scratch at 1.25x. The hint is the tool telling you: this is about to be expensive - is it worth it?
What Boris Can Actually Do
He can tighten every one of those twelve mechanisms. Push more content across the dynamic boundary into the stable prefix. Find new DANGEROUS_uncached calls and rewrite the prompts until they're stable. Improve fork cache sharing. Plug the leaks the break detection system is already flagging.
Every improvement compounds. The 5-hour quota doesn't change. What you get out of it does.
Previously: Claude Code's Compaction Engine: What the Source Code Actually Reveals

