🧠 Agent Engineering Wiki

🛠️ Solution · 3 sources

← Knowledge map 📰 Live feed 📈 Storylines 🔔 RSS

LLM-as-judge: model-graded evaluation of traces and outputs

TL;DR

Use a model to grade a model: give an LLM the agent's output (or its full trace), plus a rubric, and have it return a structured verdict. It scales the judgment human raters can't keep up with — every production trace, every CI run — and is the practical backbone of agent evaluation when answers are open-ended.

State of the art

The pattern is maturing from "ask GPT to rate 1–5" toward structured, trajectory-level judging: detectors that read a trace and emit categorized failures with confidence scores and causal chains (AWS's Strands Evals) rather than a single scalar. The headline cost problem — running a frontier judge over every trace is expensive — is being attacked by fine-tuning small open judges on production traces: LangChain and Fireworks report matching frontier-judge quality at roughly 1/100th the cost by mining perceived-error signals from real traffic. A related frontier is judging *before* deployment — OpenAI's deployment simulation predicts model behavior on real conversation data pre-release, using model-graded simulation as a forecasting tool rather than a post-hoc check.

What's new

Cheap, specialized judges: instead of paying frontier-model rates per trace, teams fine-tune a small open model on their own production traces and recover near-frontier judging quality — turning LLM-as-judge from a sampling luxury into something you can run on the whole stream.

Trade-offs

The judge is itself a non-deterministic model: it has biases (verbosity, position, self-preference), can be gamed, and needs its *own* validation against human labels or it just launders noise. Cheap fine-tuned judges narrow the cost gap but can overfit to the trace distribution they were trained on and miss novel failure modes. Best when paired with a rubric and a held-out human-labeled set, and when you care about explanations (which step failed) rather than a single opaque score.

Why it matters for platform engineers

This is what makes continuous agent eval affordable: a judge you can run in CI and on live traffic to catch regressions a model upgrade or prompt change introduces. The cost knob (frontier vs. fine-tuned local judge) is a real budget decision, and the judge becomes a dependency you must monitor and re-validate like any other piece of infra. Pairs with agent benchmarks for the fixed-task side of evaluation.