Your AI coding assistant just spent 20 minutes refactoring a complex React component. It understood the patterns, handled edge cases, even wrote the tests. Then you ask it to update one more file.
Suddenly, it hits a wall. Maybe it throws an error: "Bad Request: input tokens too high." Or worse — the model provider silently trims the conversation, and now it's calling functions that don't exist, forgetting the component structure it just built, writing Python instead of TypeScript.
What happened? Context collapse.
Every large language model has a hard limit — its context window. Cross that threshold, and what happens next depends on your provider. Some fail fast with clear errors. Others silently trim from the start of the conversation.
Here's the part most people don't realize: when that trimming happens, the first thing to go is often the system prompt — the part that tells the agent who it is, what tools it has, and how to behave. Without that anchor, even the smartest model becomes useless.
If you've watched an AI agent go from brilliant to broken mid-task, this is your explanation.
Compacting the conversation
To avoid this collapse, most agents began doing something called "compacting the conversation." The idea was simple: whenever the token count got too high, the agent would ask the model to summarize the chat so far, keep only the summary, and drop the rest.
This did save money — fewer tokens, fewer API calls — but at a cost. It can, and very often has, dropped important details. The results? Tasks that started fine but drifted off course. Agents that couldn't finish what they began.
Between March and July 2025, this difference — between agents that compacted aggressively for savings, and those that didn't — became one of the clearest quality markers across the market.
Compacting might sound clever, but it's a band-aid. Real stability came from something else entirely: context engineering.
The engineering behind context
Context engineering isn't about summarizing. It's about curation — deciding, with deterministic logic, what deserves to stay in the model's memory and what doesn't.
And unlike compacting, this happens automatically, right before the agent hits its context limit — not continuously, not on a timer, but triggered by that breach threshold.
Over time, the smarter agents started doing this with a mix of clever heuristics:
1. Keep only the latest version of each file
When working on code, the same file may appear multiple times — a refactor, a build fix, a test tweak. All these versions pile up in the context. The fix? Agents learned to keep only the most recent copy of each file. The older ones get dropped.
2. Shorten terminal outputs
Some commands, like npm run build
or dotnet build
, dump thousands of lines of logs and warnings. Unless the user specifically asked to fix warnings, most of that text is noise.
So agents began keeping only the start and end of long outputs — where the meaningful parts usually live — and trimming the middle entirely.
3. Cut old tool call results, keep the meaning
Agents often call dozens or even hundreds of tools over a long session. Instead of keeping every raw JSON output forever, some agents learned to retain only the model's interpretation of that tool result. For example: "Error rate spiked on August 25th." That single sentence replaces the hundreds of lines that produced it.
4. Guard the important parts
Some agents implemented integrity checks to make sure that the system prompt — the one that defines its tools, behavior, and constraints — never gets trimmed by accident. However, this isn't always possible. For example, in environments like VS Code's LM API, the system prompt is often appended as a regular message rather than a protected one — meaning the agent itself doesn't fully control it.
These are all invisible operations, but they're what separate a stable, long-lived session from one that spirals into confusion.
More than cleanup
But context engineering isn't just about shrinking the conversation. It's also about reminding the model what matters most.
Many agents now show a TODO list or "plan" in their interface. That's not just a fancy UI. It's a live piece of context that the model keeps seeing — a repeated reminder of what it's working on, what's next, and what's done.
Every few iterations, the agent re-inserts this list back into the context, reinforcing the current objective and keeping the model aligned. It's one of the simplest and most effective forms of context engineering — especially on long-lived sessions, where focus naturally starts to fade.
A bridge between the two worlds: Pinning
There's one idea that could dramatically improve even the naïve compact-the-conversation approach: pinning.
When agents compact a conversation, they usually keep a few fixed things — the initial user prompt that defines the problem, and sometimes the execution plan or TODO list. But that's it. Everything else is fair game for removal.
Now imagine if you, the user, could pin a message — marking it as something that must survive every compaction pass. Maybe it's a critical instruction, a specific rule, or a piece of data the agent can't operate without. Today, when compaction happens, there's no guarantee those will remain.
Adding this kind of selective persistence would make a huge difference. It wouldn't replace deterministic context engineering, but it would bridge the two worlds — letting users protect the essentials even when the conversation gets compacted.
A quiet evolution
A few days ago, Anthropic demonstrated some of these same ideas in their Kattan game demo, where Claude Sonnet 4.5 beat human opponents.
Some of that context-management logic has already made its way into their consumer app.
But truthfully, many agents in the ecosystem were already using similar techniques months earlier. The difference wasn't in the model itself, but in the engineering beneath it: deterministic, quiet, context-aware logic that knows what to remember, what to forget, and when to remind.
The bottom line
Context engineering isn't about bigger context windows or clever summaries. It's about keeping the model sane — trimming what doesn't matter, repeating what does, and protecting the parts that define who the agent is.
Do that well, and your agent stops losing itself halfway through a task. It starts behaving like what we've been chasing all along: autonomous.
(See also: AI Agents Debug Spark Faster — where this engineering proved itself in the wild.)