Building AI Agents That Work in Production
Key takeaways
- Production AI agents rest on four pillars: validated capability, end-to-end observability, enforced guardrails, and explicit human approval boundaries.
- Start with ReAct over a tight tool set. Move to plan-and-execute or multi-agent patterns only when the problem genuinely demands them.
- Five failure modes account for most agent incidents: hallucination, tool-use errors, prompt injection, infinite loops, and cost runaway. Each has a known mitigation.
- Evaluation in production means trajectory-level scoring on a frozen test set, replays on real traffic, and a weekly sample reviewed against a rubric — not just accuracy.
- Put humans in the loop where failure is irreversible or expensive. Automate the rest, then sample it.
What separates a demo agent from a production agent
A demo agent looks autonomous. A production agent is autonomous only inside a carefully bounded region, surrounded by monitoring, approval gates, and rollback mechanisms. The visible behaviour is often similar. The invisible scaffolding is what separates a weekend project from a system that handles thousands of real interactions per day without embarrassing its operator.
In the last eighteen months we have shipped agents into call centers, warehouse operations, internal help desks, and outbound sales workflows. The technology under each one is broadly the same: a reasoning model, a tool set, a memory store, an evaluator. What differs — and what decides whether the agent earns its keep — is the engineering around those components.
The mistake we see most often is treating an agent project as a prompt-engineering project. Prompts matter, but they are maybe twenty percent of the work. The other eighty is data pipelines, tool integration, evaluation harnesses, and the operational machinery that lets the team run the agent like any other production service. If that part is missing, the agent will be live for a quarter and then quietly switched off.
Treating an agent project as a prompt-engineering project is why so many of them die two months after launch. Prompts are maybe twenty percent of the work.
The good news: the discipline transfers. If you already run production services, you know how to do most of this. You just need to translate familiar patterns — observability, rollouts, incident response — into the specific shape that agents require.
The four pillars of production agents
Every production agent we ship rests on four pillars. Skipping any of them creates a predictable failure pattern three to six months later.
Capability
Capability is what the agent can actually do end to end, measured on real workloads rather than curated examples. At minimum, this requires a frozen evaluation set — typically 100 to 500 real tasks with known correct outcomes — run against every new prompt or model version before it reaches production. Capability without evaluation is a narrative, not a measurement.
Observability
Observability means that any single run can be inspected after the fact: the user input, the planner output, every tool call with its arguments and response, token counts, latency, and the final answer. Tools like LangSmith, Langfuse, Arize, and Helicone all do this well at different price points. The specific tool matters less than the commitment to store every trace and make it searchable.
Guardrails
Guardrails are the enforced boundaries the agent cannot cross, regardless of what the model generates. They live outside the model — in the tool layer, the infrastructure, and the approval flow. A prompt that says "do not delete customer records" is a suggestion. A tool interface that has no delete method is a guardrail.
Human approval
Human approval is a design decision, not a fallback. For actions with real-world consequences above a defined threshold, the agent proposes and a human disposes. For everything else, the agent acts and a human audits a sample weekly. The line is drawn explicitly, not left to emerge.
Miss the capability pillar and the agent ships wrong answers. Miss observability and you cannot debug. Miss guardrails and you expose the business. Miss human approval and the first preventable incident becomes a board-level conversation.
Architecture patterns that ship
There are four patterns worth knowing, and the decision between them is less about sophistication than about matching the pattern to the task.
ReAct
The agent alternates between reasoning and acting: think, call a tool, observe the result, think again. It is the simplest pattern and the right default for single-role agents with a limited tool set. Most real-world agents — support triage, data lookup, internal search — should start here. ReAct is easy to trace, easy to evaluate, and easy to debug when it goes wrong.
Plan-and-execute
The agent writes a multi-step plan first, then executes each step. This is appropriate when tasks are long horizon (more than five or six tool calls) or when intermediate checkpoints help reduce error compounding. The trade-off is cost and latency: the planning pass is an extra model call, and if the plan is wrong the agent may commit to a dead path before noticing.
Multi-agent swarm
A supervisor agent routes work to specialized sub-agents. This is the right pattern when the problem decomposes cleanly into distinct roles (classifier, writer, critic, researcher) and when the sub-agents benefit from different prompts, tools, or models. Be careful: swarms multiply cost, latency, and debugging surface area. Many swarm designs could be replaced by a single ReAct agent with a richer tool set and a lower bill.
Hybrid
In practice, production systems usually combine patterns. A front-door classifier routes to one of three specialized ReAct agents, which each may invoke a plan-and-execute subflow for complex cases. The hybrid is not a pattern in itself so much as a discipline: pick the simplest pattern that solves each part of the problem.
| Pattern | Best for | Typical cost | Debuggability |
|---|---|---|---|
| ReAct | Single-role tasks, limited tools | Low | High |
| Plan-and-execute | Long-horizon tasks, checkpointing | Medium | Medium |
| Multi-agent swarm | Problems with clean role decomposition | High | Low |
| Hybrid | Real systems with mixed task types | Variable | Depends on design |
Tools are the actual interface
Whatever pattern you pick, the quality of the tool definitions dominates outcomes. A tool is an API with a schema and a description. LLMs are extraordinarily sensitive to both. A minimal tool spec for a CRM lookup looks like this:
{
"name": "lookup_customer",
"description": "Find a customer by email or external ID. Returns customer profile, subscription status, and last 10 interactions. Use this before any action that modifies a customer account.",
"parameters": {
"type": "object",
"properties": {
"email": {"type": "string", "description": "Customer email. Optional if external_id is provided."},
"external_id": {"type": "string", "description": "External CRM ID. Optional if email is provided."}
},
"required": []
}
}
Notice what the description does: it states preconditions ("before any action that modifies"), clarifies mutual exclusivity, and names the return fields. Descriptions that read like onboarding notes for a junior teammate outperform terse technical descriptions by a wide margin. This is not a hack; it is how the model chooses tools.
The failure modes that ship agents
Five failure categories account for the vast majority of agent incidents we have investigated. Each has a predictable mitigation.
- Hallucination. The agent fabricates data that looks plausible. Mitigation: prefer retrieval and tool lookup over parametric recall; instrument a factuality check on any response grounded in documents; use a smaller, faster model for verification when the primary model is large.
- Tool-use errors. Wrong tool called, wrong arguments, wrong sequence. Mitigation: invest heavily in tool descriptions; add explicit precondition checks inside each tool; surface tool failures as first-class traces; add a retry-with-backoff policy for transient errors but not for semantic ones.
- Prompt injection. External content — emails, tickets, documents — contains instructions that hijack the agent. Mitigation: separate instructions from data using structural delimiters; strip or sanitize known injection patterns; enforce least-privilege on tool scopes so a hijacked prompt cannot trigger sensitive actions; require human approval for any action touching financial or regulated data.
- Infinite loops. The agent retries the same failing step until it exhausts the budget. Mitigation: bound the number of steps per run (typical limit: 15 to 25); detect repeat tool calls with the same arguments and force a re-plan or escalate to human; log every step so the pattern is visible.
- Cost runaway. A class of inputs causes the agent to consume 50× the expected tokens. Mitigation: per-run token and dollar budgets enforced as hard caps; per-user and per-day rate limits; alerts on cost anomalies; a fallback to a cheaper model or a deterministic path when the budget is hit.
None of these are exotic. All of them have shipped as production incidents in systems built without them. The cost of adding each mitigation upfront is hours. The cost of adding them after an incident is days plus reputation.
The observability stack
Observability is the nervous system of a production agent. Without it, you are flying a plane by listening to the engine. Four components are non-negotiable.
Tracing
Every run produces a trace with a unique ID, a tree of spans (planner calls, tool calls, sub-agent invocations), and all inputs and outputs. Store traces for at least 30 days, longer for regulated industries. Index them by user, tool, error class, and cost bucket. If the product team cannot search "last week's top five most expensive failures" in under a minute, you are not observable.
Evaluations
Evaluations run on every prompt and model change before deploy. The minimum bar is a frozen golden set: 100 to 500 tasks with known outcomes and rubrics. Below that, you are not evaluating; you are hoping. Above that, add trajectory evaluation that scores intermediate steps, not only final outputs — a correct answer reached through unsafe tool use is still a failure.
Replays
Replays are traces you can re-run with a different prompt, model, or tool set. When a user reports a bad outcome, you want to reproduce it in five minutes on the engineer's laptop, not reconstruct it from memory. Replays are how you know if a proposed fix actually fixes the problem.
Rollback
Every change — prompt, model, tool definition, guardrail rule — is versioned. Any version can be reverted in under a minute. Ideally, traffic is split so a new version runs alongside the old on 5% of traffic before graduating. This is standard software practice. A surprising number of agent teams skip it because "it's a prompt change".
Tool choice is secondary, but worth naming: LangSmith, Langfuse, Arize AI, Helicone, and Braintrust all cover the core. Pick one that integrates with your existing observability stack and move on. For more on connecting agent telemetry into broader data and MLOps workflows, see our related posts on MLOps from notebook to production and RAG vs fine-tuning in 2026.
Human-in-the-loop by design
Human-in-the-loop is not a patch applied when the agent is not good enough. It is a design decision about where the business draws the line between autonomy and accountability. Drawing that line well is the difference between an agent users trust and one they fight.
A useful rule of thumb: put a human in the loop whenever a failed action would cost more than the review time, or whenever the action is irreversible. In practice, this maps to a short list.
- Actions with monetary impact above a defined threshold (commonly 50 to 500 USD, depending on risk tolerance)
- External communications to customers, regulators, or partners on sensitive topics
- Account, permission, or access changes
- Any action involving legal, medical, or safety categories
- First-time classes of decisions, before operational track record exists
For everything else, the agent acts autonomously and a human reviews a sample. Weekly review of 30 to 50 random runs plus all flagged failures is enough to catch drift early. The review should be scored against an explicit rubric, not vibes.
The worst design is the one most teams ship by default: review every action. This exhausts the reviewer within a month, after which the review becomes rubber-stamping. At that point the agent is effectively autonomous, but nobody has acknowledged it. Explicit boundaries are always safer than implicit ones.
A real-world pattern
The Soflex emergency-response system is a concrete example of the four pillars operating together. The domain is a 911-style dispatch call center where every second matters and mistakes are expensive. The agent classifies incoming incidents in real time, prioritizes dispatch, and assists operators with suggested protocols.
Capability was validated on a frozen evaluation set of 400 historical calls with known correct outcomes before the system handled a single live call. Observability was built before the agent was: every classification, every protocol suggestion, every escalation was traced end to end with sub-second granularity. Guardrails were enforced in the tool layer: the agent could propose, never commit; dispatch required a dispatcher's explicit confirmation. Human-in-the-loop was the product — the agent made operators faster, not absent.
The measured impact was a 42% reduction in manual work and 60% faster triage, sustained across eight weeks of shadow deployment and six months of live operation. No reported incidents from the agent itself during that window. Those numbers are real, and they are the result of the operational discipline described in this post. You can see the broader case in our case studies.
The same playbook has worked for warehouse agents, internal help desks, and outbound sales flows. The domain varies; the shape of the work does not. If you are planning something similar, related reading on why most AI pilots fail covers the upstream decisions that determine whether any of this shipping discipline gets a chance to matter.
Frequently asked questions
What is a production AI agent?
A production AI agent is an LLM-driven system that reliably completes multi-step tasks with real users, real data, and real consequences. It is distinguished from a demo agent by four traits: tested capability on real workloads, observability into every step, enforced guardrails, and a defined human approval boundary. Without all four, the system is a prototype running in a production environment.
Which agent architecture should we start with?
Start with a ReAct loop over a constrained tool set for any single-role agent. It is the simplest pattern that supports observability and debugging. Move to plan-and-execute only when tasks are long horizon or require intermediate checkpoints. Multi-agent swarms are appropriate only when the problem decomposes cleanly into specialized roles; otherwise they add cost and failure modes without proportional benefit.
How do you prevent prompt injection in production?
Treat all data retrieved by the agent as untrusted input and separate it from instructions with clear structural boundaries. Use least-privilege tool scopes so a compromised prompt cannot trigger sensitive actions. Add a classifier or regex pass on retrieved content for known injection patterns, and require human approval for any action with write access to financial, customer, or legal data. Defense in depth, not a single filter.
When should a human be in the loop?
Put a human in the loop whenever a failed action would cost more than the review time, or whenever the action is irreversible. Typical thresholds: actions with monetary impact above a small amount, external communications to customers or regulators, account or access changes, and anything involving legal or medical categories. For high-frequency low-stakes actions, sample-based review is usually enough.
How do you measure agent quality beyond accuracy?
Use a mix of task success rate on a frozen evaluation set, trajectory evaluation that scores intermediate steps, cost per successful task, latency at the 95th percentile, and operational metrics like tool-call error rate and retry rate. For user-facing agents, add a weekly sample reviewed by humans against a rubric. Accuracy alone does not catch expensive or slow behaviour.
What does observability for AI agents look like?
Log every step of every agent run with a trace identifier: the input, the planner output, each tool call with arguments and response, token counts, latency, and the final answer. Store traces for replay so a failed run can be reproduced against a new prompt or model. Add dashboards for cost, failure rate, and tool usage. Without traces, agent debugging is guesswork.
Planning AI work this quarter?
Book a 30-minute strategy call and we'll stress-test your use case before you commit.