Claude Code's Compaction Engine: What the Source Code Actually Reveals
AI Tools & AgentsMarch 31, 20265 min read

Claude Code's Compaction Engine: What the Source Code Actually Reveals

A few months ago I wrote about context engineering - the invisible logic that keeps AI agents from losing their minds during long sessions. I described the patterns from the outside: keep the latest file versions, trim terminal output, summarize old tool results, guard the system prompt.

I also made a prediction: naive LLM summarization was a band-aid. The real work had to be deterministic curation. Summary should be the last resort.

Then Claude Code's repository surfaced publicly. I asked Claude to analyze its own compaction source code.

The prediction held. And the implementation is more thoughtful than I expected.

Three Tiers, Not One

Claude Code's compaction system isn't a single mechanism - it's three tiers applied in sequence, each heavier than the last.

Tier 1 runs before every API call. It does lightweight cleanup: clearing old tool results, keeping only the most recent five, replacing the rest with [Old tool result content cleared]. Fast, cheap, no model involved.

Tier 2 operates at the API level - server-side strategies that handle thinking blocks and tool result clearing based on token thresholds.

Tier 3 is the full LLM summarization. A structured 9-section summary: intent, technical concepts, files touched, errors and fixes, all user messages, pending tasks, current work. The model reasons through the conversation before committing to the summary - a chain-of-thought scratchpad that gets stripped afterward. It's sophisticated. It's also the last resort.

This architecture confirms exactly what the first article argued: summarization is expensive and lossy. You reach for it only when everything else has already run.

The Cache Insight That Changes Everything

Here's where it gets interesting.

When I first read about the tier 1 microcompact, my first instinct was: but if the conversation is cached, deleting old messages invalidates the cache. And cache invalidation is brutally expensive - instead of a 90% discount on tokens, you're paying 1.25x for cache writes. You've just made compaction cost more than the tokens you saved.

The solution is elegant. When the prompt cache is warm, Claude Code doesn't modify local messages at all. Instead, it queues cache_edits blocks that ride alongside the API request - telling the server to delete specific tool result blocks by their tool_use_id, surgically, without touching the cached prefix.

The cache stays intact. The old tool results disappear. No rewrite cost.

The Summarization Call Reuses Your Own Cache

The same logic applies to the full compaction path.

When tier 3 triggers and Claude Code needs to summarize the entire conversation, you'd expect it to spin up a fresh API call with a dedicated system prompt: "You are a summarization assistant...". Different system prompt, different cache key, entire conversation re-tokenized from scratch.

They don't do that.

Instead, the summarization call reuses the exact same system prompt, tools, model, and message prefix as the main conversation. The compaction instruction is appended as a new user message at the end. The server sees the same cache key - and hits it.

The alternative was tested. A 98% cache miss rate. Tens of billions of tokens re-processed daily, globally, for no reason other than a different system prompt on the compaction call.

What Happens After Compaction

The part I only described abstractly before: reconstruction.

After summarization, Claude Code doesn't just drop the summary into an empty context and hope for the best. It rebuilds methodically: boundary marker with pre-compaction metadata, the formatted summary, the 5 most recently read files (capped at 50K tokens total), re-injected skills sorted by recency, tool definitions re-announced, session hooks re-run, CLAUDE.md restored.

If the agent was running autonomously before compaction hit, the continuation message tells it: you were already working, don't acknowledge the summary, don't recap, just continue. The session experience is designed to be seamless - not just for the user watching, but for the agent executing.

This is what separates a compaction system that works from one that technically runs but quietly breaks the task halfway through.

Cache Economics Shaped All of It

Looking at the whole system, there's a single thread running through every architectural decision: cache hits are everything.

The surgical cache_edits instead of message modification - cache. The forked summarization call piggybacking on the main conversation's prefix - cache. The three tiers that delay summarization as long as possible - also cache, because summarization is the one path that can't reuse the existing key cleanly.

This isn't just context engineering. It's context engineering under a cost constraint. And solving both simultaneously is what gives you stable, long-running sessions without burning through your quota.

Which is a good segue: the 5-hour usage cap has been generating a lot of frustration lately. What I've described here is just one piece of what Claude Code does to stretch that window. The next article will go deeper - all the caching optimizations that happen outside of compaction, and what they actually buy you in practice.

FAQ

Frequently Asked Questions

Common questions about this article

Jonathan Barazany

Jonathan Barazany

Chief AI at Nayax. Previously 10 years at Microsoft building data systems and leading engineering teams. Writes about AI agents, data engineering, and technical leadership.

Found this helpful? Share it!