How does Claude Code token caching affect my quota usage?

Claude Code token caching is the primary lever that determines your effective quota utilization. Cache hits are billed at a discount while cache misses cost 1.25x the normal rate, making your cache hit ratio the single most important factor in how far your quota stretches. The system splits the prompt into a stable, globally-cached prefix and a dynamic section using a boundary marker, which prevents cache key explosion across different conversation states.

What is the cache hit ratio in Claude Code and why does it matter for quota?

The Claude Code cache hit ratio measures what percentage of tokens in each API call are served from Anthropic's prompt caching layer versus recomputed from scratch. A high ratio means most of your conversation context is discounted, while misses are charged at 1.25x, so even small drops in cache hits can significantly increase quota burn. Claude Code includes a 728-line diagnostic system that monitors cache hits per call and writes .diff files when drops exceed 5% and 2,000 tokens.

What is cache_edits in the Claude API and how does it preserve caching?

The cache_edits feature is a side-channel mechanism that removes old tool results from the cached key-value store without altering the actual message bytes sent in the conversation. This is critical because modifying message content would invalidate the cache and force an expensive recomputation at the 1.25x miss rate. By operating outside the message payload, cache_edits lets Claude Code prune stale context while keeping the prompt caching layer intact.

How does Anthropic prompt caching work with dynamic system prompts?

Anthropic prompt caching uses a boundary marker called __SYSTEM_PROMPT_DYNAMIC_BOUNDARY__ to split system prompts into two sections: a stable prefix that is globally cached across all users, and a dynamic section that changes per-context and is not cached. This design solves a combinatorial explosion where N independent toggleable prompt sections would create 2^N unique cache keys. Engineers must use a function called DANGEROUS_uncachedSystemPromptSection with a written justification for anything that breaks the cache.

Why does the Claude Code 5-hour usage quota feel so limiting?

The Claude Code 5-hour usage quota is heavily influenced by your cache hit ratio, not just raw usage time. Cache hits are discounted while cache misses cost 1.25x the normal rate, meaning poor caching can burn through your quota far faster than expected. After Anthropic's initial promotional window closed, users began feeling the cap because their effective quota depends on how efficiently the client reuses cached tokens.

What does the Claude Code /clear command actually do and when should I use it?

The Claude Code /clear command resets your conversation context, and it includes a built-in warning system internally called 'willow' that fires when you've been idle for 75 minutes with over 100K tokens in context. This warning exists because the cache TTL has likely expired, meaning continuing that stale conversation would reprocess all tokens at the expensive 1.25x miss rate. Using /clear strategically before resuming a cold session prevents you from paying the full uncached cost on a massive context window.

What are the best practices for Claude Code long session optimization?

Claude Code long session optimization centers on maximizing your cache hit ratio, since every improvement compounds into more usable output from the same quota. Key strategies include watching for the /clear warning after 75 minutes of inactivity with large contexts and avoiding the resumption of cold conversations with 100K+ tokens. Anthropic engineer Boris has confirmed that upcoming improvements will focus on client-side cache optimization rather than server-side quota changes.

What is DANGEROUS_uncachedSystemPromptSection and why does Claude Code use it?

DANGEROUS_uncachedSystemPromptSection is a function in Claude Code's source code that forces engineers to provide a written justification whenever they add uncached content to the system prompt. The naming convention serves as an architectural guardrail — the 'DANGEROUS' prefix signals that placing content outside the cached prefix directly increases token costs for every API call. One example: the token budget prompt was originally toggled on/off, busting roughly 20K tokens of cache per flip, until it was rewritten as a single generic instruction.

How does the Blake2b cache key hash work in Claude Code's token caching system?

Claude Code uses a Blake2b prefix hash to generate cache keys, meaning the key is derived from the exact sequence of tokens in the prompt prefix. Each runtime conditional that alters the system prompt — such as feature flags, user settings, or tool availability — creates a distinct hash variant, causing the number of cache key variants to grow exponentially at 2^N where N is the number of conditionals. This is why Claude Code uses the __SYSTEM_PROMPT_DYNAMIC_BOUNDARY__ to keep all runtime-dependent content out of the cached prefix.

The 5-Hour Quota, Boris's Tweet, and What the Source Code Actually Reveals

Yesterday I published a deep dive into Claude Code's compaction engine. At the end, I made a promise: go deeper on the caching optimizations that happen outside of compaction.

But actually, the caching rabbit hole started before that post - because of a tweet from about ten days ago.

The Tweet That Confused Me

If you're a heavy Claude Code user, you felt the 5-hour usage cap snap shut after Anthropic's two-week promotional window closed. The complaints flooded in. Someone tagged Boris - the engineer behind Claude Code, the person who built it - asking what he planned to do about it.

His answer: improvements are coming to squeeze more out of the current quota.

My first reaction: what can he possibly do? The quota is server-side. It's rate limits and token budgets. There's no client trick that changes how many tokens you're allowed per hour.

That question sat with me. Then yesterday's compaction post led me to look harder at the source - and the answer became obvious.

Cache Hit Ratio Is the Quota

Every message you send to Claude Code costs tokens. But tokens aren't flat. Cache hits are discounted significantly. Cache misses cost 1.25x - you're not just paying full price, you're paying a penalty.

If your cache hit ratio is high, you stretch the same quota dramatically further than someone whose cache keeps busting. The quota doesn't change. What you extract from it does.

This is the reframe. When Boris says improvements are coming, he's not talking about changing server limits. He's talking about recovering cache hit ratio - which is the same thing as handing quota back to users.

What Claude Code Already Does About This

When I asked Claude to analyze its own source code, what came back wasn't a simple "we cache the system prompt." It was twelve distinct mechanisms working together, each one plugging a specific leak.

Two stood out.

The 2^N problem - and how a boundary string solves it

Every time you send a message, the API receives your entire conversation - not just your latest message. The server caches this accumulated history so it doesn't reprocess everything from scratch on the next turn.

But here's the problem: every conditional in your system prompt that depends on runtime state creates a unique cache key variant. Five booleans - which tools are enabled, whether you're in interactive mode, whether skills are loaded - means 32 possible cache entries. With millions of users, most of those entries will never get a second hit.

Claude Code's solution is a literal string in the source: __SYSTEM_PROMPT_DYNAMIC_BOUNDARY__. Everything before it is identical for every Claude Code user on Earth - the same coding instructions, the same safety rules, the same tone guidelines. That block gets scope: 'global', which hints at a single cache entry shared worldwide rather than per-organization. Everything runtime-dependent moves after the boundary and isn't cached at all.

The comment in the source puts it plainly: "Each conditional here is a runtime bit that would otherwise multiply the Blake2b prefix hash variants (2^N)."

cache_edits - surgery without invalidation

Over a long session, old tool results pile up inside the cached conversation history. They're taking up space you're paying for. The obvious fix is to remove them - edit the message, delete the content. But the moment any byte in any message changes, the server sees a different sequence. Cache miss. Everything from that edit point onward gets reprocessed from scratch at 1.25x cost.

cache_edits is a side-channel. Instead of editing the message content, Claude Code attaches a separate instruction to the API call: server, remove the cached KV entry for tool result with this ID. The actual message bytes - the ones that form the cache key - never change. The cached history stays valid. The server quietly drops that tool result from its cached representation.

Same conversation. Less bloat. No cache bust.

The Mindset Behind All of It

Reading through the code, one pattern keeps repeating: they don't treat cache misses as a tradeoff. They treat them as bugs.

There's a 728-line diagnostic system that monitors cache hits on every API call. When cache_read_input_tokens drops more than 5% and 2,000 tokens, it writes a .diff file to disk, attributes the root cause - which tool schema changed, which beta header flipped, whether the TTL expired - and logs the event to analytics.

There's a function literally named DANGEROUS_uncachedSystemPromptSection(). Any engineer who uses it has to pass a written reason string explaining why this section must be dynamic - a forcing function that makes the cache cost explicit and visible, like a code review justification for a known defect.

One section had ended up in this category without good reason: the token budget prompt. The old version toggled on and off depending on whether a budget was active - present when you had one, absent when you didn't. Every toggle busted ~20K tokens of cache. The code still has the comment: "Was DANGEROUS_uncached (toggled on getCurrentTurnTokenBudget()), busting ~20K tokens per budget flip."

The fix was a single sentence rewrite. The new version reads:

"When the user specifies a token target (e.g., '+500k', 'spend 2M tokens'), your output token count will be shown each turn. Keep working until you approach the target..."

It never mentions the actual current budget number. It teaches the model the behavior generically - when a target exists, here's how to act - so the text works whether a budget is active or not. The actual number gets injected elsewhere at runtime, not in the system prompt. Stable text, no toggle, no cache bust.

One sentence rewrite saved 20K tokens of cache creation on every budget flip.

The `/clear` Warning - What the Code Actually Says

I had a hypothesis about the one-liner suggestion Claude Code now shows - "new task? /clear to save 1.2M tokens" - that it appeared when Claude Code detected the cache TTL had expired. I had the source, so I checked.

The real mechanism is called "willow" internally. It fires on two conditions: 75 minutes idle and at least 100K tokens in the conversation. When both are true, you see either a blocking dialog or that one-liner hint, depending on which A/B variant you're in.

My hypothesis was close but not quite right. It's not reading cache state directly - it's a timer. But 75 minutes idle means the 1-hour cache TTL has definitely expired. The feature isn't inspecting cache state; it's inferring it from elapsed time and making the cost visible before you pay it. Continuing a cold, 100K+ token conversation means reprocessing everything from scratch at 1.25x. The hint is the tool telling you: this is about to be expensive - is it worth it?

What Boris Can Actually Do

He can tighten every one of those twelve mechanisms. Push more content across the dynamic boundary into the stable prefix. Find new DANGEROUS_uncached calls and rewrite the prompts until they're stable. Improve fork cache sharing. Plug the leaks the break detection system is already flagging.

Every improvement compounds. The 5-hour quota doesn't change. What you get out of it does.

Previously: Claude Code's Compaction Engine: What the Source Code Actually Reveals

The 5-Hour Quota, Boris's Tweet, and What the Source Code Actually Reveals

The Tweet That Confused Me

Cache Hit Ratio Is the Quota

What Claude Code Already Does About This

The Mindset Behind All of It

The `/clear` Warning - What the Code Actually Says

What Boris Can Actually Do

Frequently Asked Questions

Found this helpful? Share it!

Quick Links

Connect

The 5-Hour Quota, Boris's Tweet, and What the Source Code Actually Reveals

The Tweet That Confused Me

Cache Hit Ratio Is the Quota

What Claude Code Already Does About This

The Mindset Behind All of It

The /clear Warning - What the Code Actually Says

What Boris Can Actually Do

Frequently Asked Questions

Found this helpful? Share it!

The `/clear` Warning - What the Code Actually Says