Coordinating multiple agents adds more failure than capability — agent engineering

State of the art

The conversation is shifting from "more agents is better" to characterizing *when* multi-agent actually helps, and the recurring answer is that the communication structure dominates the agent count. DPBench studies the structural determinants of multi-agent LLM coordination directly — which topologies and role assignments make collaboration pay off versus add noise.

Cost is the second axis: Stanford's DeLM reports cutting multi-agent task cost by roughly half by *removing the central orchestrator*, evidence that a single coordinating agent is both a token bottleneck and a single point of failure.

Capacity allocation across roles is a third, less-asked variable: a study of hierarchical search agents factors the job into a delegation role (task decomposition), an execution role (retrieval and evidence extraction), and a fixed generation role, then varies model capacity per role to find where it actually matters. The result complicates "just add more agents" further — capacity isn't interchangeable between roles, so the same topology can win or lose depending on *which* role gets the bigger model, not just how many agents are in the mesh.

A fourth allocation lever targets the assignment mechanism itself, not just the topology or the per-role capacity: Agora replaces the coarse-grained matching a main agent typically uses to route sub-tasks to expert models and tools with an auction, where each candidate bids on a task based on its own confidence and cost and the highest bidder gets the work — reframing "which agent handles this" as a market-clearing problem rather than a fixed routing table.

Orchestration itself is becoming dynamic rather than hand-wired — Anthropic's writeup on Claude Code's Dynamic Workflows describes generating a custom execution harness per task to coordinate sub-agents instead of committing to one fixed shape. The sharper version of that move is orchestrating sub-agents with code rather than tool calls: LangChain's dynamic subagents in Deep Agents drive fan-out and coordination from a program, so coverage is *guaranteed* by control flow instead of hoped-for from the model emitting one tool call per worker — turning the coordination layer into ordinary (testable, deterministic) code around non-deterministic agents.

The flip side of caring about communication structure is that the structure is also an attack surface: the "Linguistic Firewall" work treats routing in a multi-agent system as a geometry problem and defends it, because a compromised or adversarial agent in the mesh can steer the others — so robust handoffs are a security property, not just a quality one.

Meanwhile practitioners are still hunting for frameworks where *heterogeneous* models genuinely collaborate (route refactors to one model, codegen to another), which is really a routing-and-handoff problem, not a model problem — and that hunt is now materializing as shipping tooling:

Coding agents with built-in multi-model orchestration (Kimchi routes a terminal coding agent across models)
Visual orchestration UIs that let you wire sub-agents by hand for Claude Code (rondoflow)
Transparency-first multi-agent tools (OpenOrb) that surface what each agent did

The common thread is that the hard, load-bearing work has moved out of the agents and into the *routing, wiring, and visibility* layer between them.

A sharper version of the "is it worth it" question is now visible at both ends: Sakana's Fugu *collapses* a multi-agent system into a single distilled model — trading the coordination layer away entirely once the division of labor is known — while practitioners building orchestration libraries report that the real engineering is mundane plumbing (workspaces, runtimes, directory layout for sub-agents) rather than clever agent roles.

The durable lesson: who talks to whom, in what format, and under whose control is the dominant variable — and sometimes the cheapest topology is no topology at all.

A newer thread ties coordination quality directly to uncertainty: UA-ChatDev has role-based software-development agents track and act on their own confidence, so a low-confidence step triggers deliberation or hand-off rather than confidently propagating a mistake to the next role — coordination reliability as a function of agents knowing what they don't know, not just of topology.

When multiple *coding* agents work the same repo concurrently, the coordination problem becomes concrete conflict avoidance rather than abstract topology: one practitioner pattern gives each agent (Claude, Codex) its own git branch and its own sandboxed worktree so "no two agents ever touch the same branch, and no agent can reach another's files," then runs work in frozen, read-only-reviewable rounds and replays each candidate in a clean box with a neutral verifier before merging — passing tests first, smallest diff second. It's a concrete instance of the durable lesson above: isolation plus a control-flow gate, not smarter agents, is what keeps parallel coding agents from clobbering each other's work.

Code-driven orchestration is also generalizing across providers: Omegacode composes agent()/parallel()/pipeline()/phase() calls in plain JavaScript, and each agent() call can spawn a Codex, Claude Code, OpenCode, or pi agent from the same workflow file — so patterns like adversarial code review or a bake-off between models are one script instead of one integration per provider. That widens the earlier code-driven-fan-out move (LangChain's dynamic subagents) from guaranteeing coverage inside a single framework to letting the same coordination script mix heterogeneous agents, which is the "route refactors to one model, codegen to another" capability practitioners were still hunting for above. A second cross-provider SDK makes the same move from the Python side: h5i-python defines and executes multi-agent coding workflows across Claude Code, Codex, and other runtimes as ordinary Python programs, the same "coordination is portable code, not a per-provider integration" thesis Omegacode ships in JavaScript.

The "conflict resolution between agents" problem is getting a named pattern: an arbiter role that settles disagreement between a planning agent and a coding agent by checking the code against the plan directly, rather than trusting either agent's self-report — which only works if the plan was specified in enough detail for the arbiter to actually verify against it. The same practitioner framing packages parallel testing, review, and context-retrieval agents plus that arbiter as a governance layer (distinct credentials per agent role, visible communication over human-readable channels like GitHub or chat rather than hidden logs) — the coordination-plus-oversight bundle that turns ad hoc multi-agent use into something a platform team can run safely.

At the tooling-consolidation end, low-code orchestration platforms are folding the agent loop *into* the workflow engine rather than treating agents and workflows as separate layers: one open-source platform embeds a full agent loop (model call, tool invocation, observation, next-step decision) as a drag-and-drop step that can itself trigger or be triggered by ordinary workflow steps, sharing one audit trail across agent decisions, tool calls, and human approvals — a concrete instance of the durable "put the coordination in ordinary code" lesson, expressed as a visual builder instead of a script.

A production case study puts hard numbers behind the standing "is it worth it" question: a multi-agent A2A+MCP architecture deployed in a live 5G-core security operations center cut mean time to detect and respond by 40% and compressed the human review work by 12x — concrete evidence the coordination overhead this page tracks can pay for itself at production scale, not just in a benchmark. A practitioner guide sharpens the "when does the topology matter" question from the framework side: a LangGraph field guide positions the framework by workflow-complexity fit rather than as a universal default, walking through three recipes (SQL analytics with repair loops, RAG with evidence gating, human-in-the-loop policy review with interrupt/checkpoint recovery) that make routing, pauses, and audit trails explicit product behavior — while naming plain ReAct-style loops, schema-first tools, and DSPy as better fits for simpler jobs.

Named enterprise deployments are now spanning industries beyond that one security-ops showcase: Jefferies, an investment bank, built a production trade-assistant for front-office trading on Strands Agents — an open agent-harness SDK for building agents that reason, plan, and act by orchestrating calls to foundation models and tools — paired with Amazon Bedrock, Amazon Bedrock Knowledge Bases, and MCP for connecting to trading data sources and tools through one interface. Apollo's GTM AI Assistant runs the same pattern in a different vertical — prospecting, enrichment, outreach, and analytics on "Deep Agents" plus LangSmith, with MCP integrations of its own. Two different company-specific multi-agent systems, in regulated finance and sales/GTM respectively, replacing a single-model assistant rather than one framework or one industry proving the case alone.

A practitioner-scale trial adds a concrete before-the-org-commits data point to the same "does it pay off" question: a CTO's own orchestration-first publishing project — 25 agents and tools, 30 agent skills, 12 MCP/A2A-native services, processing 26 billion tokens across 318 PRs and 423 commits — was run solo, deliberately, before asking the wider engineering organization to build this way. It's a smaller-scale, individual-scoping counterpart to the Jefferies/Apollo production deployments above: proving the pattern works for one builder first, rather than committing a team to it up front.

A fourth industry joins the named-deployment roster above: an AWS reference architecture for market surveillance pairs LangGraph for workflow orchestration with Strands for agent reasoning on Amazon Bedrock AgentCore, adding checkpoint-based recovery plus AgentCore's own memory and observability primitives to the state-driven side of the "does the coordination overhead pay for itself" evidence — capital-markets surveillance alongside the existing security-ops, trading, and sales/GTM deployments.

A controlled benchmark puts a number behind "sometimes the cheapest topology is no topology at all": on local, open-weight language models, a two-call self-refinement loop beats a five-agent structured pipeline (Parishad) on the same tasks — evidence the coordination tax this page tracks isn't limited to frontier-model economics hiding the overhead; it shows up just as sharply once you're not paying enterprise API rates for the extra calls.

What's new

A controlled benchmark on local, open-weight language models sharpens this page's standing "more agents isn't automatically better" finding into a specific comparison: a two-call self-refinement loop beats a five-agent structured pipeline (Parishad) on the same tasks, evidence the coordination tax tracked here doesn't require frontier-model pricing to show up.

Why it matters for platform engineers

Every extra agent is extra tokens, extra latency, and extra failure surface, so a multi-agent design has to clear a hard bar: beat a single well-prompted agent on cost *and* reliability — and it often doesn't. The engineering job is choosing a topology (orchestrator-worker vs. decentralized), writing strict handoff contracts so one agent's output is safely another's input, and budgeting the communication overhead up front. Crucially it needs an eval (see agent benchmarks) that proves the extra agents paid for themselves, because the default failure mode is paying N× the cost for a result a single agent could have produced.

Evidence · 26 sources