5 Things That Break AI Agents in Production

Most AI agent projects don't fail because of the model. They fail because of the engineering around it.

I've built and shipped AI systems across multiple products — including an AI co-pilot for professional working dog teams used by law enforcement K9 units and search and rescue handlers. The gap between a demo that impresses and a system that runs reliably at 3am when nobody's watching is enormous.

These are the 5 failure points I see most often, and what to do about each one.

1. No error recovery when tool calls fail

An AI agent without error recovery isn't an agent — it's a time bomb. Tool calls fail. APIs return unexpected responses. Rate limits get hit. Network timeouts happen. In a demo environment you control, none of this surfaces. In production, all of it does.

The failure mode looks like this: your agent calls an external tool, the tool returns an error or malformed data, and the agent either silently stops, returns nonsense to the user, or — worst case — proceeds anyway with bad data downstream.

I've seen agents confidently complete a multi-step workflow using the results of a failed API call. The user had no idea. The downstream data was corrupted.

This happens because most agent frameworks treat tool calls as happy-path operations. The orchestration layer doesn't have explicit handling for failure states, retries, or fallbacks.

The fix: Define failure handling before you write the first tool call. Every tool should have a retry policy, a fallback response, and a clear signal to the agent that it should stop or escalate rather than continue with bad data. Build observability around tool call success rates from day one — not after something breaks.

2. Cost overruns from unthrottled LLM calls

LLM costs are invisible until they're not. During development you're making dozens of calls. In production you're making thousands. The architecture decisions that seemed fine at low volume become expensive fast.

The most common culprits:

No caching — identical or near-identical queries hit the LLM every single time
No request throttling — a single user action triggers a cascade of LLM calls
Using a large, expensive model for tasks that a smaller model handles fine
No cost monitoring until the invoice arrives

Cost per query is a product decision, not just an engineering one. If you're not tracking it from day one, you're flying blind.

The fix: Implement semantic caching for frequent queries. Use a smaller, cheaper model for classification, routing, and simple retrieval tasks, and reserve your most capable model for generation. Set up cost alerts before you go live, not after.

3. Context window bloat

More context isn't always better. Sending too much context to the LLM is one of the most common and expensive mistakes in production RAG systems.

The naive implementation: retrieve the top 10 chunks from your vector store and dump them all into the context window. It works fine in testing. In production, it means thousands of tokens of potentially irrelevant content on every query — driving up cost, slowing response time, and often degrading output quality because the model gets confused by noise.

There's a well-documented phenomenon called "lost in the middle" — models pay less attention to content in the middle of long contexts. Past a certain point, more context makes things worse.

The fix: Add a re-ranking step between retrieval and generation. Retrieved chunks get scored for relevance to the query, and only the top 2–3 get passed to the LLM. Tools like Cohere Rerank or a simple cross-encoder model handle this well. You'll cut context by 60–70% and often improve output quality at the same time.

4. Prompt brittleness — works in dev, breaks in prod

Your prompts work perfectly in development. You've tested them against 20 examples and they nail it every time. Then you go live and real users immediately find inputs your prompt doesn't handle.

Real user inputs are far more varied, ambiguous, and unexpected than your test cases:

Users phrase the same request 10 different ways
They include irrelevant context that throws off the agent's reasoning
They ask things adjacent to but outside your intended use case
They use industry-specific terminology you didn't anticipate

A prompt that's been tested against 20 examples is a hypothesis, not a product. Production is where you find out if the hypothesis holds.

The fix: Build a prompt testing suite before launch — not just happy path examples, but adversarial inputs and edge cases. Version your prompts so you can roll back when something breaks. Log inputs and outputs in production and review them regularly. Expect to iterate on your prompts for weeks after launch.

5. Hallucinations from poor retrieval and bad chunking

Hallucinations in RAG systems are almost never the model's fault. They're a retrieval problem.

When your agent confidently states something wrong, the most likely cause is one of these:

The right information isn't in your knowledge base
The right information is there, but the chunking strategy broke it apart so it's never retrieved together
The retrieved chunks are semantically similar to the query but don't actually answer it
The model is filling gaps in the retrieved context with plausible-sounding but invented information

The default approach — fixed-size chunks of 512 tokens — is a starting point, not a solution. It's fast to implement and often produces retrievals that are just slightly off, which is worse than being obviously wrong because it's harder to catch.

An agent that says "I don't have enough information to answer that" is more trustworthy than one that answers confidently with wrong information. Build explicit uncertainty handling into your system.

The fix: Revisit your chunking strategy for your specific data type. Add metadata to your chunks — source, date, section — and use it in retrieval. Implement a relevance threshold so the agent only proceeds if retrieved context meets a minimum confidence score. Give the agent an explicit "I don't know" path rather than forcing it to always generate an answer.

The common thread

Look at all five of these failure points and you'll notice something: none of them are about the AI model. They're about engineering discipline, system design, and treating your agent like production software — because that's what it is.

The teams that ship reliable AI agents are the ones who apply the same rigour they'd apply to any backend system: observability, error handling, cost monitoring, testing, and iteration. The teams that struggle are the ones who outsource all of that to the AI and hope for the best.

Working on an AI agent project?

I take on a small number of AI agent implementation projects each quarter — from scoped single-agent builds to full multi-agent systems. Fixed price after scoping. You own it fully after handover.

→ See my AI Agent Development & Implementation service

→ Schedule a free 30-minute discovery call

→ Or email me directly