🧱 Obstacle · security · 3 sources
Untrusted input and tools can hijack an agent
TL;DR
An agent treats whatever it reads — a web page, a tool result, a file, another agent's message — as instructions it might follow. Prompt injection turns that into an attack: hidden text redirects the agent to exfiltrate data, misuse its tools, or escalate privileges. Because the agent has real credentials and can act, a successful injection is not a bad answer — it's an unauthorized action.
State of the art
There is no clean fix, only layered mitigation, and each layer has known holes. Guardrail models that screen inputs/outputs are the common defense, but recent work shows the very reasoning that makes them effective also makes them a target — "From Shield to Target" demonstrates denial-of-service attacks that weaponize a guardrail against the agent it protects. Sandboxing is necessary but not sufficient: a coding-agent sandbox contains code execution yet does nothing about credential authorization — the agent inside the sandbox still holds tokens that injected instructions can abuse. The threat compounds in multi-agent systems, where one compromised agent's output is another's trusted input; new benchmarks (Deep-XPIA) are emerging specifically to measure cross-agent (indirect) prompt-injection exposure. The durable lesson is least privilege: scope what the agent can touch so a hijack has a small blast radius.
What's new
Defenses are being shown to be brittle from two directions at once: guardrail models can be turned into a DoS vector, and sandboxing is reframed as not solving credential authorization at all — pushing the emphasis from "filter the prompt" toward scoping permissions and measuring multi-agent injection exposure directly.
Why it matters for platform engineers
This is the security boundary of the whole agent stack, and it maps to ordinary ops controls done right: scoped credentials, per-tool authorization, network egress limits, and human approval on high-impact actions. The mistake is treating a sandbox or a guardrail model as the answer; both are layers, and both have published bypasses. Every tool you connect (see tool use) widens the attack surface, so authorization and blast-radius limits — not prompt hygiene alone — are the real control.