๐ ๏ธ Solution ยท 3 sources
Context compaction: summarize, compress, and curate the working set
TL;DR
Keep memory *inside* the context window but small: summarize old turns, compress history, and deliberately curate what stays in-context each step ("context engineering"). The agent forgets less because the working set is chosen, not just truncated.
State of the art
"Context engineering and memory management" has emerged as a discipline of its own โ treating the prompt as a managed working set rather than an append-only log. Techniques range from rolling summarization to LLM-guided compression of long-term memory (MemRefine) and memory systems that explicitly model association, forgetting, and synthesis rather than storing everything. Compaction is increasingly paired with an external store: compress the working set, offload the rest to a vector/graph KB, and rehydrate on demand.
What's new
Compression is getting smarter than naive summarization โ LLM-guided methods (MemRefine) and forgetting/synthesis-aware memory systems aim to preserve signal density rather than just shrink token count.
Trade-offs
Cheap on infra (no external store) and keeps everything the model needs in one place, but summarization is lossy and irreversible โ a detail dropped early can't be recovered later, and aggressive compaction can quietly degrade task fidelity. Best for single-session, long-horizon tasks where recency dominates and the full history isn't needed verbatim.
Why it matters for platform engineers
Often the highest-leverage first move: it directly attacks token cost and latency (the bill scales with context size) without standing up new infrastructure. The risk is silent quality loss, so it needs evaluation โ which makes it a tuning knob, not a set-and-forget fix.