{"slug":"context-compaction-safety","title":"Does compacting an agent's context put its safety rules at risk?","question":"Does compacting an agent's context put its safety rules at risk?","summary":"Context compaction is not just a lossy cost optimization — a 1,323-episode benchmark shows it can silently erase the governance constraints a long-running agent was given, and only pinning those constraints outside the compactible window prevents it.","status":"active","cluster":"safety","cluster_label":"Safety and control","updated":"2026-07-01","audience":"strong-software-engineer","math_depth":"","sections":[{"heading":"Builder consequence","html":"<p>If your agent runs long enough to need context compaction — summarizing, evicting, or compressing older turns to stay under a token budget — the compactor is not a neutral cost optimization. It is a place where the rules you gave the agent up front (forbidden tools, approval gates, a user&#x27;s hard &quot;do not&quot;) can silently disappear. The agent will keep acting exactly as if nothing changed, because from its perspective nothing did: the constraint is simply no longer in what it can see.</p>"},{"heading":"Short answer","html":"<p>Context compaction can silently erase safety and governance constraints stated earlier in a long-running session, and this is not a rare edge case: across 1,323 episodes and seven model families, prohibited-action violation rises from 0% with the constraint in full context to 30% after ordinary compaction, and as high as 59% for some models. When the constraint text survives the summary, violations stay at 0%. The fix isn&#x27;t &quot;compact less&quot; — it&#x27;s &quot;never let the compaction step touch the parts of context that carry hard rules.&quot;</p>"},{"heading":"Builder model","html":"<p>Split what lives in an agent&#x27;s context into two classes: content that can be safely lost and re-derived (task history, intermediate reasoning, prior tool outputs) and content that is load-bearing and irreversible if lost (permissions, forbidden actions, hard constraints, approval gates). Ordinary summarization treats both classes the same way — it compresses for information density, not for which sentence is a safety rule. Once a governance constraint gets paraphrased away or dropped for space, the agent isn&#x27;t disobeying a rule it still holds; it genuinely no longer has the rule in front of it. The same threat model as prompt injection applies to your own compactor: an untrusted or adversarial step in the pipeline can remove instructions you rely on, whether by accident or on purpose.</p>"},{"heading":"Mechanism","html":"<p>A long-horizon agent keeps a token budget. To stay under it, agents typically evict old turns, replace them with a running summary, or roll both together, with a summarization model or heuristic deciding what to keep — usually optimizing for task continuity, not rule preservation. The Governance Decay study formalizes this with ConstraintRot, a benchmark of long-horizon agent scenarios with deterministic tool-call grading, and measures how often a stated policy constraint is honored after the surrounding context has been compacted. The result: violation is 0% when the constraint sits in full, uncompacted context; it climbs to 30% after ordinary compaction, up to 59% depending on the model; when the compacted summary happens to retain the constraint&#x27;s wording violation stays 0%, but when it&#x27;s dropped violation reaches 38%. The paper also demonstrates a Compaction-Eviction Attack: adversarial in-context content specifically crafted to bias the summarizer toward omitting a legitimate policy, and optimized versions of this attack defeat every model they evaluate — turning the compactor into an active adversarial target, not just a source of accidental loss. Their proposed fix, Constraint Pinning, is training-free: it quarantines governance constraints so they are excluded from whatever the compaction step is allowed to touch, and this alone restores violation to 0% in their benchmark.</p>\n<p>This mechanism generalizes beyond safety text specifically. The same compaction step is where ordinary memory quality degrades too — the tiered architecture practitioners converge on (short-term working context, episodic history, long-term semantic store) moves information through a lossy transform at every tier (summarize, embed-and-retrieve, or forget), and none of those transforms natively distinguish &quot;detail that doesn&#x27;t matter anymore&quot; from &quot;detail the system depends on.&quot;</p>"},{"heading":"Evidence","html":"<ul><li>Benchmark/result-backed: Governance Decay / ConstraintRot measures constraint-violation rate across 1,323 episodes and seven model families: 0% with the policy in full context, 30% after ordinary compaction (up to 59% for some models), 0% when the constraint survives the summary, 38% when it&#x27;s dropped; a Compaction-Eviction Attack defeats every evaluated model, and training-free Constraint Pinning restores violation to 0%.</li><li>Primary-doc-backed: LangChain&#x27;s practitioner guide frames agent memory as short-term (live context), episodic, and long-term/semantic tiers, and recommends a capture -&gt; analyze -&gt; update loop over trace data instead of dumping raw history into long-term memory.</li><li>Production field-report-backed: Elastic&#x27;s Atlas ships three memory categories on top of Elasticsearch, exposed to agents over MCP with per-user isolation, and reports a measured evaluation number (0.89 Recall@10) rather than shipping the architecture as an unverified diagram — the same discipline this concept asks builders to apply to compaction specifically.</li><li>Editorial inference: treat any lossy transform in the memory pipeline as a place a safety-relevant fact can silently vanish, and test for it adversarially, not just on the happy path.</li></ul>"},{"heading":"How to apply","html":"<p>Identify every hard constraint your agent depends on (forbidden tools, approval gates, hard user &quot;do nots&quot;, compliance rules) and store them outside the compactible window — in a pinned system block your summarization step is not allowed to rewrite or evict — then re-inject the verbatim text into every post-compaction prompt rather than trusting the running summary to carry it forward. Add a regression test that forces a compaction cycle mid-session and then attempts the prohibited action, asserting the agent still refuses; this check is cheap and training-free, but only catches the failure if you actually run it, since governance decay is invisible until you specifically probe for it. Treat your compaction/summarization component as untrusted input in the same sense as an injected tool result: if an attacker can influence what enters context (a tool response, a retrieved document), assume they can try to bias the summarizer into dropping a constraint, and make sure the pinned region cannot be edited by anything the compactor reads. When you evaluate any memory architecture — a tiered store, external retrieval, or a vendor-shipped memory service — require a measured evaluation number instead of accepting an unverified &quot;we added memory&quot; claim.</p>"},{"heading":"Failure modes","html":"<ul><li>Compaction as a black box: trusting a summarizer to preserve &quot;the important parts&quot; without testing whether governance-relevant text specifically survives.</li><li>Treating governance decay as rare: benchmark data says otherwise — violation rates hit double digits under ordinary compaction, not just adversarial conditions.</li><li>No adversarial test: never running a Compaction-Eviction-style attack against your own pipeline, so the first adversarial constraint drop happens in production.</li><li>Same-tier assumption: managing safety rules and disposable task history with the same lossy pipeline instead of splitting load-bearing content into a pinned, non-evictable region.</li><li>Shipping memory without an eval number: adding a memory layer (compaction, retrieval, or a vendor service) and calling it done without measuring whether it actually preserves what matters.</li></ul>"},{"heading":"Related","html":"<p>See <a href=\"/topic/context-compaction\">context compaction</a> for compaction techniques and their cost/latency trade-offs, <a href=\"/topic/agent-memory\">agent memory</a> for the broader tiered-memory architecture debate, and <a href=\"/topic/prompt-injection\">prompt injection</a> for the adjacent threat model where untrusted content hijacks what an agent trusts.</p>"}],"evidence":[{"id":"constraintrot-2026-governance-decay","kind":"benchmark-result","tier":"benchmark/result-backed","title":"Governance Decay: How Context Compaction Silently Erases Safety Constraints in Long-Horizon LLM Agents","note":"ConstraintRot benchmark, 1,323 episodes across seven model families: prohibited-action violation is 0% with the policy in full context, rises to 30% after ordinary compaction (up to 59% for some models), stays 0% when the constraint survives the summary, and reaches 38% when it is dropped. A Compaction-Eviction Attack (adversarial content that biases the summarizer to drop the policy) defeats every evaluated model; the paper's training-free Constraint Pinning mitigation restores violation to 0%.","url":"http://arxiv.org/abs/2606.22528v1"},{"id":"langchain-2026-agent-memory-guide","kind":"primary-doc","tier":"primary-doc-backed","title":"How to Build Memory into AI Agents","note":"Frames agent memory as short-term (live context), episodic, and long-term/semantic tiers, and recommends a capture-traces -> analyze -> selectively-update loop over long-term memory rather than dumping raw history into it.","url":"https://www.langchain.com/blog/how-to-give-your-agent-memory"},{"id":"elastic-atlas-2026-cognitive-memory","kind":"production-field-report","tier":"production field-report-backed","title":"Elastic Open-Sources Atlas Agent Memory Based on Cognitive Science","note":"Elastic's Atlas ships three memory categories on Elasticsearch, exposed to agents over MCP with per-user isolation, and reports 0.89 Recall@10 on a question-answering evaluation rather than shipping the architecture as an unverified diagram.","url":"https://www.infoq.com/news/2026/06/elastic-atlas-agent-memory/"},{"id":"context-compaction-safety-editorial-synthesis","kind":"editorial-inference","tier":"editorial inference","title":"LLM Digest synthesis","note":"For agent builders, any lossy transform in the memory pipeline (compaction, retrieval, forgetting) is a place a safety-relevant fact can silently vanish, and it needs the same adversarial testing discipline as prompt injection, not just a happy-path check."}],"related_topics":[{"slug":"context-compaction","title":"Context compaction: summarize, compress, and curate the working set"},{"slug":"agent-memory","title":"Agents forget across steps and sessions"},{"slug":"prompt-injection","title":"Untrusted input and tools can hijack an agent"}],"related_playbook_cards":["pb-pin-governance-constraints-past-compaction","pb-close-the-trace-to-memory-loop"],"related_storylines":[],"covers_evidence":["constraintrot-2026-governance-decay","langchain-2026-agent-memory-guide","elastic-atlas-2026-cognitive-memory","context-compaction-safety-editorial-synthesis"]}