LLM Digest

Agent Builder Foundations

Evidence-tiered mechanisms for agent builders

View as JSON

Agent foundations

Does compacting an agent's context put its safety rules at risk?

Concept·Safety and control·4 evidence tiers·updated 2026-07-01

If your agent runs long enough to need context compaction — summarizing, evicting, or compressing older turns to stay under a token budget — the compactor is not a neutral cost optimization. It is a place where the rules you gave the agent up front (forbidden tools, approval gates, a user's hard "do not") can silently disappear. The agent will keep acting exactly as if nothing changed, because from its perspective nothing did: the constraint is simply no longer in what it can see.

Short answer

Context compaction can silently erase safety and governance constraints stated earlier in a long-running session, and this is not a rare edge case: across 1,323 episodes and seven model families, prohibited-action violation rises from 0% with the constraint in full context to 30% after ordinary compaction, and as high as 59% for some models. When the constraint text survives the summary, violations stay at 0%. The fix isn't "compact less" — it's "never let the compaction step touch the parts of context that carry hard rules."

Builder model

Split what lives in an agent's context into two classes: content that can be safely lost and re-derived (task history, intermediate reasoning, prior tool outputs) and content that is load-bearing and irreversible if lost (permissions, forbidden actions, hard constraints, approval gates). Ordinary summarization treats both classes the same way — it compresses for information density, not for which sentence is a safety rule. Once a governance constraint gets paraphrased away or dropped for space, the agent isn't disobeying a rule it still holds; it genuinely no longer has the rule in front of it. The same threat model as prompt injection applies to your own compactor: an untrusted or adversarial step in the pipeline can remove instructions you rely on, whether by accident or on purpose.

Mechanism

A long-horizon agent keeps a token budget. To stay under it, agents typically evict old turns, replace them with a running summary, or roll both together, with a summarization model or heuristic deciding what to keep — usually optimizing for task continuity, not rule preservation. The Governance Decay study formalizes this with ConstraintRot, a benchmark of long-horizon agent scenarios with deterministic tool-call grading, and measures how often a stated policy constraint is honored after the surrounding context has been compacted. The result: violation is 0% when the constraint sits in full, uncompacted context; it climbs to 30% after ordinary compaction, up to 59% depending on the model; when the compacted summary happens to retain the constraint's wording violation stays 0%, but when it's dropped violation reaches 38%. The paper also demonstrates a Compaction-Eviction Attack: adversarial in-context content specifically crafted to bias the summarizer toward omitting a legitimate policy, and optimized versions of this attack defeat every model they evaluate — turning the compactor into an active adversarial target, not just a source of accidental loss. Their proposed fix, Constraint Pinning, is training-free: it quarantines governance constraints so they are excluded from whatever the compaction step is allowed to touch, and this alone restores violation to 0% in their benchmark.

This mechanism generalizes beyond safety text specifically. The same compaction step is where ordinary memory quality degrades too — the tiered architecture practitioners converge on (short-term working context, episodic history, long-term semantic store) moves information through a lossy transform at every tier (summarize, embed-and-retrieve, or forget), and none of those transforms natively distinguish "detail that doesn't matter anymore" from "detail the system depends on."

Evidence

Benchmark/result-backed: Governance Decay / ConstraintRot measures constraint-violation rate across 1,323 episodes and seven model families: 0% with the policy in full context, 30% after ordinary compaction (up to 59% for some models), 0% when the constraint survives the summary, 38% when it's dropped; a Compaction-Eviction Attack defeats every evaluated model, and training-free Constraint Pinning restores violation to 0%.
Primary-doc-backed: LangChain's practitioner guide frames agent memory as short-term (live context), episodic, and long-term/semantic tiers, and recommends a capture -> analyze -> update loop over trace data instead of dumping raw history into long-term memory.
Production field-report-backed: Elastic's Atlas ships three memory categories on top of Elasticsearch, exposed to agents over MCP with per-user isolation, and reports a measured evaluation number (0.89 Recall@10) rather than shipping the architecture as an unverified diagram — the same discipline this concept asks builders to apply to compaction specifically.
Editorial inference: treat any lossy transform in the memory pipeline as a place a safety-relevant fact can silently vanish, and test for it adversarially, not just on the happy path.

How to apply

Identify every hard constraint your agent depends on (forbidden tools, approval gates, hard user "do nots", compliance rules) and store them outside the compactible window — in a pinned system block your summarization step is not allowed to rewrite or evict — then re-inject the verbatim text into every post-compaction prompt rather than trusting the running summary to carry it forward. Add a regression test that forces a compaction cycle mid-session and then attempts the prohibited action, asserting the agent still refuses; this check is cheap and training-free, but only catches the failure if you actually run it, since governance decay is invisible until you specifically probe for it. Treat your compaction/summarization component as untrusted input in the same sense as an injected tool result: if an attacker can influence what enters context (a tool response, a retrieved document), assume they can try to bias the summarizer into dropping a constraint, and make sure the pinned region cannot be edited by anything the compactor reads. When you evaluate any memory architecture — a tiered store, external retrieval, or a vendor-shipped memory service — require a measured evaluation number instead of accepting an unverified "we added memory" claim.

Failure modes

Compaction as a black box: trusting a summarizer to preserve "the important parts" without testing whether governance-relevant text specifically survives.
Treating governance decay as rare: benchmark data says otherwise — violation rates hit double digits under ordinary compaction, not just adversarial conditions.
No adversarial test: never running a Compaction-Eviction-style attack against your own pipeline, so the first adversarial constraint drop happens in production.
Same-tier assumption: managing safety rules and disposable task history with the same lossy pipeline instead of splitting load-bearing content into a pinned, non-evictable region.
Shipping memory without an eval number: adding a memory layer (compaction, retrieval, or a vendor service) and calling it done without measuring whether it actually preserves what matters.

See context compaction for compaction techniques and their cost/latency trade-offs, agent memory for the broader tiered-memory architecture debate, and prompt injection for the adjacent threat model where untrusted content hijacks what an agent trusts.

Evidence · 4 sources

BenchmarkGovernance Decay: How Context Compaction Silently Erases Safety Constraints in Long-Horizon LLM Agentsbenchmark/result-backed
ConstraintRot benchmark, 1,323 episodes across seven model families: prohibited-action violation is 0% with the policy in full context, rises to 30% after ordinary compaction (up to 59% for some models), stays 0% when the constraint survives the summary, and reaches 38% when it is dropped. A Compaction-Eviction Attack (adversarial content that biases the summarizer to drop the policy) defeats every evaluated model; the paper's training-free Constraint Pinning mitigation restores violation to 0%.
Primary docHow to Build Memory into AI Agentsprimary-doc-backed
Frames agent memory as short-term (live context), episodic, and long-term/semantic tiers, and recommends a capture-traces -> analyze -> selectively-update loop over long-term memory rather than dumping raw history into it.
Field reportElastic Open-Sources Atlas Agent Memory Based on Cognitive Scienceproduction field-report-backed
Elastic's Atlas ships three memory categories on Elasticsearch, exposed to agents over MCP with per-user isolation, and reports 0.89 Recall@10 on a question-answering evaluation rather than shipping the architecture as an unverified diagram.
EditorialLLM Digest synthesiseditorial inference
For agent builders, any lossy transform in the memory pipeline (compaction, retrieval, forgetting) is a place a safety-relevant fact can silently vanish, and it needs the same adversarial testing discipline as prompt injection, not just a happy-path check.