How does Claude Code's compaction system manage context window limits?

Claude Code uses a three-tier compaction system to manage context window limits. Tier 1 performs lightweight cleanup by clearing old tool results and keeping only the 5 most recent. Tier 2 applies API-level server-side strategies. Tier 3 triggers a full LLM summarization that produces a structured 9-section summary, after which the conversation is reconstructed with a boundary marker, the summary, recent file contents capped at 50K tokens, skills, tools, hooks, and CLAUDE.md instructions.

What are cache_edits and how do they preserve the prompt cache during compaction?

Cache_edits are specialized blocks that Claude Code uses when the prompt cache is warm to surgically delete old tool results by their tool_use_id without invalidating the existing cache. Instead of modifying local messages and triggering an expensive cache rewrite, Claude Code queues these edits alongside the API request. The alternative approach of rebuilding the prompt from scratch was found to have a 98% cache miss rate, making cache_edits a critical efficiency mechanism.

Why does Claude Code use tiered compaction instead of always summarizing?

Claude Code's tiered compaction approach is designed to preserve cached conversation prefixes for as long as possible. Lightweight Tier 1 cleanup and Tier 2 server-side strategies handle most cases without an expensive LLM call. Full Tier 3 summarization only fires when the context window is under serious pressure, and even then the summarization call reuses the same system prompt, model, and cache key as the main conversation to keep cache hit rates high.

What does Claude Code's source code reveal about its compaction engine?

The Claude Code source code reveals a sophisticated 3-tier compaction system where cache economics drive every architectural decision. Rather than naively truncating or summarizing conversation history, the engine uses surgical cache_edits instead of message modification, piggybacks summarization calls on the main conversation's cache prefix, and deliberately delays summarization through three tiers — all to maximize cache hit rates and reduce API costs.

How does context engineering for AI agents differ from prompt engineering?

While prompt engineering focuses on crafting individual instructions, context engineering for AI agents is about architecting the entire information pipeline — deciding what context to keep, compress, or discard as conversations grow. Claude Code's compaction engine exemplifies this: its post-compaction reconstruction carefully rebuilds a boundary marker, summary, recent files capped at 50K tokens, skills, tools, and CLAUDE.md instructions rather than simply writing a better prompt.

What happens during post-compaction reconstruction in AI coding tools like Claude Code?

Post-compaction reconstruction rebuilds the agent's working context after the conversation history has been compressed. In Claude Code's system, this assembles a specific sequence: a boundary marker, the compacted summary, the 5 most recently read files with a 50K token cap, loaded skills, tool definitions, and CLAUDE.md project instructions. If the agent was running autonomously, the continuation message instructs it to proceed without acknowledging the summary, maintaining seamless operation.

What does the Claude Code /compact command actually do under the hood?

The Claude Code /compact command triggers the full Tier 3 summarization of your conversation context. It runs the most aggressive compaction pass, distilling the entire conversation into a structured 9-section summary while re-injecting the 5 most recently read files capped at 50K tokens so you retain working context for your current task. This is the same process that runs automatically when the context window fills up.

How does Claude Code's compaction engine help with the 5-hour usage cap?

The compaction engine stretches the 5-hour usage cap by dramatically reducing the token count of each request through progressive summarization. By clearing old tool results in Tier 1, applying server-side strategies in Tier 2, and only falling back to full summarization in Tier 3, Claude Code keeps each API call efficient. Fewer tokens consumed per interaction means you can accomplish more within the same usage window.

How does Claude Code vs Cursor handle context management for long coding sessions?

Claude Code uses a cache_edits mechanism that avoids prompt cache invalidation entirely, preserving a 90% discount on cached tokens compared to the 1.25x cost of cache rewrites. Its 3-tier compaction system combined with caching economics gives it a distinct advantage for extended sessions. Cursor and Windsurf handle context limits differently, but Claude Code's architecture is specifically designed around the principle that cache hits are everything.

Why do prompt caching economics matter for AI coding tools like Claude Code?

Prompt caching economics directly determine how much productive work you get from each session. Claude Code achieves a 90% discount on previously cached tokens, whereas writing new cache entries costs 1.25x the normal rate. This economic structure drives the entire compaction architecture — surgical cache_edits preserve the cache, the summarization call reuses the existing cache key, and the three tiers delay expensive summarization as long as possible.

Claude Code's Compaction Engine: What the Source Code Actually Reveals

A few months ago I wrote about context engineering - the invisible logic that keeps AI agents from losing their minds during long sessions. I described the patterns from the outside: keep the latest file versions, trim terminal output, summarize old tool results, guard the system prompt.

I also made a prediction: naive LLM summarization was a band-aid. The real work had to be deterministic curation. Summary should be the last resort.

Then Claude Code's repository surfaced publicly. I asked Claude to analyze its own compaction source code.

The prediction held. And the implementation is more thoughtful than I expected.

Three Tiers, Not One

Claude Code's compaction system isn't a single mechanism - it's three tiers applied in sequence, each heavier than the last.

Tier 1 runs before every API call. It does lightweight cleanup: clearing old tool results, keeping only the most recent five, replacing the rest with [Old tool result content cleared]. Fast, cheap, no model involved.

Tier 2 operates at the API level - server-side strategies that handle thinking blocks and tool result clearing based on token thresholds.

Tier 3 is the full LLM summarization. A structured 9-section summary: intent, technical concepts, files touched, errors and fixes, all user messages, pending tasks, current work. The model reasons through the conversation before committing to the summary - a chain-of-thought scratchpad that gets stripped afterward. It's sophisticated. It's also the last resort.

This architecture confirms exactly what the first article argued: summarization is expensive and lossy. You reach for it only when everything else has already run.

The Cache Insight That Changes Everything

Here's where it gets interesting.

When I first read about the tier 1 microcompact, my first instinct was: but if the conversation is cached, deleting old messages invalidates the cache. And cache invalidation is brutally expensive - instead of a 90% discount on tokens, you're paying 1.25x for cache writes. You've just made compaction cost more than the tokens you saved.

The solution is elegant. When the prompt cache is warm, Claude Code doesn't modify local messages at all. Instead, it queues cache_edits blocks that ride alongside the API request - telling the server to delete specific tool result blocks by their tool_use_id, surgically, without touching the cached prefix.

The cache stays intact. The old tool results disappear. No rewrite cost.

The Summarization Call Reuses Your Own Cache

The same logic applies to the full compaction path.

When tier 3 triggers and Claude Code needs to summarize the entire conversation, you'd expect it to spin up a fresh API call with a dedicated system prompt: "You are a summarization assistant...". Different system prompt, different cache key, entire conversation re-tokenized from scratch.

They don't do that.

Instead, the summarization call reuses the exact same system prompt, tools, model, and message prefix as the main conversation. The compaction instruction is appended as a new user message at the end. The server sees the same cache key - and hits it.

The alternative was tested. A 98% cache miss rate. Tens of billions of tokens re-processed daily, globally, for no reason other than a different system prompt on the compaction call.

What Happens After Compaction

The part I only described abstractly before: reconstruction.

After summarization, Claude Code doesn't just drop the summary into an empty context and hope for the best. It rebuilds methodically: boundary marker with pre-compaction metadata, the formatted summary, the 5 most recently read files (capped at 50K tokens total), re-injected skills sorted by recency, tool definitions re-announced, session hooks re-run, CLAUDE.md restored.

If the agent was running autonomously before compaction hit, the continuation message tells it: you were already working, don't acknowledge the summary, don't recap, just continue. The session experience is designed to be seamless - not just for the user watching, but for the agent executing.

This is what separates a compaction system that works from one that technically runs but quietly breaks the task halfway through.

Cache Economics Shaped All of It

Looking at the whole system, there's a single thread running through every architectural decision: cache hits are everything.

The surgical cache_edits instead of message modification - cache. The forked summarization call piggybacking on the main conversation's prefix - cache. The three tiers that delay summarization as long as possible - also cache, because summarization is the one path that can't reuse the existing key cleanly.

This isn't just context engineering. It's context engineering under a cost constraint. And solving both simultaneously is what gives you stable, long-running sessions without burning through your quota.

Which is a good segue: the 5-hour usage cap has been generating a lot of frustration lately. What I've described here is just one piece of what Claude Code does to stretch that window. The next article will go deeper - all the caching optimizations that happen outside of compaction, and what they actually buy you in practice.

Claude Code's Compaction Engine: What the Source Code Actually Reveals

Three Tiers, Not One

The Cache Insight That Changes Everything

The Summarization Call Reuses Your Own Cache

What Happens After Compaction

Cache Economics Shaped All of It

Frequently Asked Questions

Found this helpful? Share it!

Quick Links

Connect