Building AI Agents That Work in Production

Key takeaways

Production AI agents rest on four pillars: validated capability, end-to-end observability, enforced guardrails, and explicit human approval boundaries.
Start with ReAct over a tight tool set. Move to plan-and-execute or multi-agent patterns only when the problem genuinely demands them.
Five failure modes account for most agent incidents: hallucination, tool-use errors, prompt injection, infinite loops, and cost runaway. Each has a known mitigation.
Evaluation in production means trajectory-level scoring on a frozen test set, replays on real traffic, and a weekly sample reviewed against a rubric. Accuracy alone is not enough.
Put humans in the loop where failure is irreversible or expensive. Automate the rest, then sample it.

What separates a demo agent from a production agent

A demo agent looks autonomous. A production agent is autonomous only inside a carefully bounded region, surrounded by monitoring, approval gates, and rollback mechanisms. The visible behaviour is often similar. The invisible scaffolding is what separates a weekend project from a system that handles thousands of real interactions per day without embarrassing its operator.

In the last eighteen months we have shipped agents into call centers, warehouse operations, internal help desks, and outbound sales workflows. The technology under each one is broadly the same: a reasoning model, a tool set, a memory store, an evaluator. What differs is the engineering around those components, and that is what decides whether the agent earns its keep.

The mistake we see most often is treating an agent project as a prompt-engineering project. Prompts matter, but they are maybe twenty percent of the work. The other eighty is data pipelines, tool integration, evaluation harnesses, and the operational machinery that lets the team run the agent like any other production service. If that part is missing, the agent will be live for a quarter and then quietly switched off.

Treating an agent project as a prompt-engineering project is why so many of them die two months after launch. Prompts are maybe twenty percent of the work.

The good news: the discipline transfers. If you already run production services, you know how to do most of this. You just need to translate familiar patterns like observability, rollouts, and incident response into the specific shape that agents require.

The four pillars of production agents

Every production agent we ship rests on four pillars. Skipping any of them creates a predictable failure pattern three to six months later.

Capability

Capability is what the agent can actually do end to end, measured on real workloads rather than curated examples. At minimum, this requires a frozen evaluation set, typically 100 to 500 real tasks with known correct outcomes, run against every new prompt or model version before it reaches production. Capability without evaluation is a narrative, not a measurement.

Observability

Observability means that any single run can be inspected after the fact: the user input, the planner output, every tool call with its arguments and response, token counts, latency, and the final answer. Tools like LangSmith, Langfuse, Arize, and Helicone all do this well at different price points. The specific tool matters less than the commitment to store every trace and make it searchable.

Guardrails

Guardrails are the enforced boundaries the agent cannot cross, regardless of what the model generates. They live outside the model, in the tool layer, the infrastructure, and the approval flow. A prompt that says "do not delete customer records" is a suggestion. A tool interface that has no delete method is a guardrail.

Human approval

Human approval is a design decision, not a fallback. For actions with real-world consequences above a defined threshold, the agent proposes and a human disposes. For everything else, the agent acts and a human audits a sample weekly. The line is drawn explicitly, not left to emerge.

Miss the capability pillar and the agent ships wrong answers. Miss observability and you cannot debug. Miss guardrails and you expose the business. Miss human approval and the first preventable incident becomes a board-level conversation.

Architecture patterns that ship

There are four patterns worth knowing, and the decision between them is less about sophistication than about matching the pattern to the task.

ReAct

The agent alternates between reasoning and acting: think, call a tool, observe the result, think again. It is the simplest pattern and the right default for single-role agents with a limited tool set. Most real-world agents, such as support triage, data lookup, and internal search, should start here. ReAct is easy to trace, easy to evaluate, and easy to debug when it goes wrong.

Plan-and-execute

The agent writes a multi-step plan first, then executes each step. This is appropriate when tasks are long horizon (more than five or six tool calls) or when intermediate checkpoints help reduce error compounding. The trade-off is cost and latency: the planning pass is an extra model call, and if the plan is wrong the agent may commit to a dead path before noticing.

Multi-agent swarm

A supervisor agent routes work to specialized sub-agents. This is the right pattern when the problem decomposes cleanly into distinct roles (classifier, writer, critic, researcher) and when the sub-agents benefit from different prompts, tools, or models. Be careful: swarms multiply cost, latency, and debugging surface area. Many swarm designs could be replaced by a single ReAct agent with a richer tool set and a lower bill.

Hybrid

In practice, production systems usually combine patterns. A front-door classifier routes to one of three specialized ReAct agents, which each may invoke a plan-and-execute subflow for complex cases. The hybrid is not a pattern in itself so much as a discipline: pick the simplest pattern that solves each part of the problem.

Pattern	Best for	Typical cost	Debuggability
ReAct	Single-role tasks, limited tools	Low	High
Plan-and-execute	Long-horizon tasks, checkpointing	Medium	Medium
Multi-agent swarm	Problems with clean role decomposition	High	Low
Hybrid	Real systems with mixed task types	Variable	Depends on design

Tools are the actual interface

Whatever pattern you pick, the quality of the tool definitions dominates outcomes. A tool is an API with a schema and a description. LLMs are extraordinarily sensitive to both. A minimal tool spec for a CRM lookup looks like this:

{
  "name": "lookup_customer",
  "description": "Find a customer by email or external ID. Returns customer profile, subscription status, and last 10 interactions. Use this before any action that modifies a customer account.",
  "parameters": {
    "type": "object",
    "properties": {
      "email": {"type": "string", "description": "Customer email. Optional if external_id is provided."},
      "external_id": {"type": "string", "description": "External CRM ID. Optional if email is provided."}
    },
    "required": []
  }
}

Notice what the description does: it states preconditions ("before any action that modifies"), clarifies mutual exclusivity, and names the return fields. Descriptions that read like onboarding notes for a junior teammate outperform terse technical descriptions by a wide margin. This is not a hack; it is how the model chooses tools.

The failure modes that ship agents

Five failure categories account for the vast majority of agent incidents we have investigated. Each has a predictable mitigation.

Hallucination. The agent fabricates data that looks plausible. Mitigation: prefer retrieval and tool lookup over parametric recall; instrument a factuality check on any response grounded in documents; use a smaller, faster model for verification when the primary model is large.
Tool-use errors. Wrong tool called, wrong arguments, wrong sequence. Mitigation: invest heavily in tool descriptions; add explicit precondition checks inside each tool; surface tool failures as first-class traces; add a retry-with-backoff policy for transient errors but not for semantic ones.
Prompt injection. External content such as emails, tickets, and documents contains instructions that hijack the agent. Mitigation: separate instructions from data using structural delimiters; strip or sanitize known injection patterns; enforce least-privilege on tool scopes so a hijacked prompt cannot trigger sensitive actions; require human approval for any action touching financial or regulated data.
Infinite loops. The agent retries the same failing step until it exhausts the budget. Mitigation: bound the number of steps per run (typical limit: 15 to 25); detect repeat tool calls with the same arguments and force a re-plan or escalate to human; log every step so the pattern is visible.
Cost runaway. A class of inputs causes the agent to consume 50× the expected tokens. Mitigation: per-run token and dollar budgets enforced as hard caps; per-user and per-day rate limits; alerts on cost anomalies; a fallback to a cheaper model or a deterministic path when the budget is hit.

None of these are exotic. All of them have shipped as production incidents in systems built without them. The cost of adding each mitigation upfront is hours. The cost of adding them after an incident is days plus reputation.

The observability stack

Observability is the nervous system of a production agent. Without it, you are flying a plane by listening to the engine. Four components are non-negotiable.

Tracing

Every run produces a trace with a unique ID, a tree of spans (planner calls, tool calls, sub-agent invocations), and all inputs and outputs. Store traces for at least 30 days, longer for regulated industries. Index them by user, tool, error class, and cost bucket. If the product team cannot search "last week's top five most expensive failures" in under a minute, you are not observable.

Evaluations

Evaluations run on every prompt and model change before deploy. The minimum bar is a frozen golden set: 100 to 500 tasks with known outcomes and rubrics. Below that, you are not evaluating; you are hoping. Above that, add trajectory evaluation that scores intermediate steps, not only final outputs. A correct answer reached through unsafe tool use is still a failure.

Replays

Replays are traces you can re-run with a different prompt, model, or tool set. When a user reports a bad outcome, you want to reproduce it in five minutes on the engineer's laptop, not reconstruct it from memory. Replays are how you know if a proposed fix actually fixes the problem.

Rollback

Every change is versioned: prompt, model, tool definition, guardrail rule. Any version can be reverted in under a minute. Ideally, traffic is split so a new version runs alongside the old on 5% of traffic before graduating. This is standard software practice. A surprising number of agent teams skip it because "it's a prompt change".

Tool choice is secondary, but worth naming: LangSmith, Langfuse, Arize AI, Helicone, and Braintrust all cover the core. Pick one that integrates with your existing observability stack and move on. For more on connecting agent telemetry into broader data and MLOps workflows, see our related posts on MLOps from notebook to production and RAG vs fine-tuning in 2026.

Human-in-the-loop by design

Human-in-the-loop is not a patch applied when the agent is not good enough. It is a design decision about where the business draws the line between autonomy and accountability. Drawing that line well is the difference between an agent users trust and one they fight.

A useful rule of thumb: put a human in the loop whenever a failed action would cost more than the review time, or whenever the action is irreversible. In practice, this maps to a short list.

Actions with monetary impact above a defined threshold (commonly 50 to 500 USD, depending on risk tolerance)
External communications to customers, regulators, or partners on sensitive topics
Account, permission, or access changes
Any action involving legal, medical, or safety categories
First-time classes of decisions, before operational track record exists

For everything else, the agent acts autonomously and a human reviews a sample. Weekly review of 30 to 50 random runs plus all flagged failures is enough to catch drift early. The review should be scored against an explicit rubric, not vibes.

The worst design is the one most teams ship by default: review every action. This exhausts the reviewer within a month, after which the review becomes rubber-stamping. At that point the agent is effectively autonomous, but nobody has acknowledged it. Explicit boundaries are always safer than implicit ones.

A real-world pattern

In a real deployment the four pillars stop being separate checklists and start reinforcing each other. Capability is settled first: the agent has to clear a frozen evaluation set of historical cases with known correct outcomes before it touches a single unit of live traffic, so you know its accuracy before anyone depends on it. Observability is already in place by then, because the traces were built before the agent was, which means every decision, every tool call, and every escalation is legible from the first live run rather than reconstructed after an incident. Guardrails live in the tool layer, not the prompt: the agent proposes and a human commits, so a bad suggestion is a discarded draft instead of an executed action. That last boundary is also the product. The point is not an operator-free system but an operator who is faster and better informed, with the agent doing the heavy lifting and a person owning the decision that carries consequences.

The same playbook has worked for warehouse agents, internal help desks, and outbound sales flows. The domain varies; the shape of the work does not. If you are planning something similar, related reading on why most AI pilots fail covers the upstream decisions that determine whether any of this shipping discipline gets a chance to matter.

Frequently asked questions

What is a production AI agent?

A production AI agent is an LLM-driven system that reliably completes multi-step tasks with real users, real data, and real consequences. It is distinguished from a demo agent by four traits: tested capability on real workloads, observability into every step, enforced guardrails, and a defined human approval boundary. Without all four, the system is a prototype running in a production environment.

Which agent architecture should we start with?

Start with a ReAct loop over a constrained tool set for any single-role agent. It is the simplest pattern that supports observability and debugging. Move to plan-and-execute only when tasks are long horizon or require intermediate checkpoints. Multi-agent swarms are appropriate only when the problem decomposes cleanly into specialized roles; otherwise they add cost and failure modes without proportional benefit.

How do you prevent prompt injection in production?

Treat all data retrieved by the agent as untrusted input and separate it from instructions with clear structural boundaries. Use least-privilege tool scopes so a compromised prompt cannot trigger sensitive actions. Add a classifier or regex pass on retrieved content for known injection patterns, and require human approval for any action with write access to financial, customer, or legal data. Defense in depth, not a single filter.

When should a human be in the loop?

Put a human in the loop whenever a failed action would cost more than the review time, or whenever the action is irreversible. Typical thresholds: actions with monetary impact above a small amount, external communications to customers or regulators, account or access changes, and anything involving legal or medical categories. For high-frequency low-stakes actions, sample-based review is usually enough.

How do you measure agent quality beyond accuracy?

Use a mix of task success rate on a frozen evaluation set, trajectory evaluation that scores intermediate steps, cost per successful task, latency at the 95th percentile, and operational metrics like tool-call error rate and retry rate. For user-facing agents, add a weekly sample reviewed by humans against a rubric. Accuracy alone does not catch expensive or slow behaviour.

What does observability for AI agents look like?

Log every step of every agent run with a trace identifier: the input, the planner output, each tool call with arguments and response, token counts, latency, and the final answer. Store traces for replay so a failed run can be reproduced against a new prompt or model. Add dashboards for cost, failure rate, and tool usage. Without traces, agent debugging is guesswork.