Agent Builder Foundations
Evidence-tiered mechanisms for agent builders
Agent foundations
What makes a prompt reliable?
ConceptยทPrompting and instruction followingยทmath intuitionยท5 evidence tiersยทupdated 2026-06-25
Reliable prompts are interfaces, not prose. If an agent step matters, write the prompt like a contract: define the role of the step, the available inputs, the required output shape, the decision rules, and the failure behavior you will test.
A good prompt reduces ambiguity in the model's continuation while preserving the information needed to solve the task. Instructions set the goal, examples show the local pattern, schemas constrain the output channel, and evals tell you whether the contract survives model, context, and data changes.
Treat every important prompt as a small API. The input is the context you pass in, the implementation is the model's learned distribution, and the output contract is what downstream code or humans consume. Reliability improves when the prompt makes the desired continuation easier than nearby wrong continuations, then verifies that behavior with representative cases.
An autoregressive language model predicts the next token conditioned on the tokens before it. A prompt is therefore not just an instruction; it is the conditioning environment for all later tokens. Clear task framing, examples, delimiters, and output schemas change which continuations are likely.
Few-shot examples work because they place a pattern directly in context. Chain-of-thought examples can help on multi-step problems because they demonstrate an intermediate representation before the final answer. Self-consistency helps when the model can reach the same answer through multiple sampled paths. Long context is not automatically reliable context: relevant information can be harder to use when it is buried in the middle.
Think of the model as assigning probability mass across possible next-token paths. A vague prompt spreads mass across many plausible completions: explanation, refusal, partial answer, wrong format, hidden assumption. A reliable prompt concentrates mass around the acceptable region.
Examples act like local coordinates: they show the model what kind of mapping you want. A schema narrows the output subspace. Delimiters reduce accidental mixing between instructions, retrieved text, and user data. Evals estimate whether the probability mass stays in the right region across the cases you actually care about.
- Theory/paper-backed: "Language Models are Few-Shot Learners" shows that large language models can adapt to new tasks from instructions and examples placed directly in context.
- Benchmark/result-backed: chain-of-thought prompting and self-consistency report improvements on reasoning benchmarks when prompts demonstrate intermediate reasoning or sample multiple reasoning paths.
- Benchmark/result-backed: "Lost in the Middle" shows that adding more context can reduce reliability when the relevant evidence is positioned poorly.
- Editorial inference: for production agents, these findings imply that prompt quality is inseparable from interface design and evaluation.
Write the prompt contract before polishing wording. Specify the task, the input fields, what the model must ignore, the output schema, and what to do when evidence is missing. Put volatile or untrusted content behind clear delimiters. Keep decisive instructions close to the work they govern, especially when the context is long.
For agent systems, pair the prompt with a small eval set. Include clean successes, missing-information cases, adversarial or irrelevant context, and format-stability checks. Change one prompt dimension at a time, then compare outputs against the contract. If downstream code parses the answer, schema adherence is part of correctness, not formatting polish.
- Prompt-tip copying: borrowing a clever phrase without testing whether it changes behavior on your task.
- Context stuffing: adding more retrieved text until the decisive evidence is harder to find.
- Hidden contracts: expecting JSON, citations, or tool arguments without making those constraints explicit.
- No negative cases: testing only easy examples, so the prompt looks reliable until the first ambiguous production input.
- Security confusion: relying on prompt wording to control untrusted instructions instead of using sandboxing, permissions, and output validation.
Use agent evaluation to test whether prompt changes helped, context compaction to keep the working set usable, and prompt injection when the prompt includes untrusted text or tool output.
- TheoryLanguage Models are Few-Shot Learnerstheory/paper-backed
Shows that task behavior can be specified through instructions and examples in context, without gradient updates.
- BenchmarkChain-of-Thought Prompting Elicits Reasoning in Large Language Modelsbenchmark/result-backed
Reports that exemplars with intermediate reasoning improve performance on arithmetic, commonsense, and symbolic reasoning tasks.
- BenchmarkSelf-Consistency Improves Chain of Thought Reasoning in Language Modelsbenchmark/result-backed
Shows that sampling multiple reasoning paths and selecting a consistent answer can improve reasoning benchmark accuracy.
- BenchmarkLost in the Middle: How Language Models Use Long Contextsbenchmark/result-backed
Finds that long-context models can perform worse when relevant information appears in the middle of the context.
- EditorialLLM Digest synthesiseditorial inference
For agent builders, prompt reliability should be treated as an interface and evaluation problem, not a copywriting problem.