Why most AI agents fail in production, and what we do about it

The model isn't usually the problem. The agent system around it almost always is. Here's the failure taxonomy we keep meeting in the wild, with the engineering moves that prevent each one.

We've shipped enough AI agents now to see the same failure modes recur, regardless of vertical. Almost none of them are the model's fault. Almost all of them are the system around the model: the retrieval, the tools, the evals, the rollout. Here's the running taxonomy we use internally, and the moves that have actually prevented each one.

Failure 1: The agent confidently makes things up

The classic hallucination. Customer asks 'what's your refund policy on annual subscriptions?' and the agent invents a 30-day policy that doesn't exist. The model isn't broken. It's doing exactly what it was trained to do, which is generate plausible text. The breakage is that the system asked a generative model to act like a database lookup.

The fix isn't 'better prompting.' The fix is structural: every customer-facing answer must be grounded in retrieved content from a vetted source, with an abstention default when retrieval comes back empty or low-confidence.

Note:Concrete pattern: if retrieval similarity score is below 0.55, the agent says 'I don't have that information. Let me get a teammate.' This is non-negotiable on any production deployment.

Failure 2: The agent fires a destructive action it shouldn't have

An agent with a refund tool issues an unauthorized $5,000 refund. An agent with a calendar tool double-books a CEO's afternoon. The pattern: the model 'decided' to call a tool that has real-world consequences and there was nothing in between deciding and doing.

Two layers prevent this. First, tool-level allowlists with parameter constraints: refunds capped at $X, calendar writes only to specific calendars. Second, human-in-the-loop approval for any tool call above a threshold of impact. Both are cheap to add and absolutely critical.

tools/refund.ts·ts

1export const refundIssue = tool({
2  name: "refund.issue",
3  params: { amount: z.number().max(500) },
4  requiresApproval: (p) => p.amount > 100,
5  handler: async (p) => stripe.refunds.create(p),
6});

Failure 3: Quality degrades silently over weeks

Day 1, the agent ships and looks great. Day 30, it's quietly fielding 12% of conversations incorrectly, but no one notices because there's no measurement. Trust erodes, and by the time someone audits, the cost is real.

This is the most dangerous failure because it's invisible. The fix is an eval suite that runs continuously: a fixed set of 200-500 representative interactions scored on whether the agent's response was acceptable, run on every deploy and on a schedule against live traffic samples.

▸Define a graded rubric per intent type (correctness, tone, escalation correctness)
▸Pull 200-500 interactions from real production logs as your eval set
▸Run the suite on every PR; block merges that drop the score
▸Sample a small % of live conversations daily and grade them, to catch drift before customers do

Failure 4: 'It works, but no one can explain why'

An executive asks why the agent told a customer X. Engineering can't answer. The model output is non-deterministic, the prompt is in five places, the retrieval results aren't logged, and the tool calls aren't traced. The agent becomes a black box no one trusts.

Observability isn't optional for production AI. Every agent invocation should log the input, retrieval results, intermediate reasoning, tool calls (with parameters), and final output. Tools like Langfuse, LangSmith, or Arize make this trivial. Without it, you have no way to debug, audit, or improve.

Failure 5: Permission leaks

Internal knowledge agents will surface documents the asker shouldn't be able to see, because the retrieval layer doesn't enforce source-system permissions. We've seen finance docs leak to interns and unannounced acquisition decks surface in Slack queries.

The fix is permission-aware retrieval: every retrieval query filters against the asker's actual access rights in the source system. We model this at index time (per-doc ACLs) and enforce at query time (filtered search). It's not glamorous, but skipping it is a CISO-level mistake.

What it looks like when these are handled

An agent that's well-engineered against these failure modes is a quiet thing. It answers what it knows. It abstains where it doesn't. It calls tools within bounds. It logs everything. Its quality stays steady or improves over time because evals catch regressions.

It's also boring to demo, which is part of why this is hard to sell to people who haven't been burned yet. Boring is the goal. The exciting agents are the ones that fail in interesting ways three months in.

The hardest thing about building production AI isn't building the prototype. It's building the boring infrastructure that makes the prototype trustworthy at scale.

If you're evaluating a vendor or your own team's work, the questions to ask aren't about the model. They're about the system: How do you handle low-confidence retrieval? Where are the human approval gates? What's your eval suite? Where do traces live? How do you handle permissions? If those answers are vague, the agent will fail in production. If they're specific, you have a real shot.