What We Learned Deploying AI Agents in Production for 12 Months

The year agents got real

Twelve months ago, most "AI agent" demos followed the same pattern: a chatbot calls a function, something happens, the audience claps. The gap between that and a production system handling thousands of requests per day — with real money and real users on the line — turned out to be enormous.

We've spent the past year working with teams deploying agent-based systems across customer support, data analysis, and internal tooling. This post distills the patterns that reliably work and the traps that look harmless in staging but break spectacularly in production.

The reliability ceiling nobody talks about

Here's the uncomfortable truth: most agent architectures top out around 85–90% task completion rates on non-trivial workflows. That sounds decent until you realize it means 1 in 10 users hits a failure. For a traditional software product, a 10% error rate would be a severity-1 incident.

The failure modes are rarely dramatic. They're subtle:

The agent calls the right tool with slightly wrong parameters
It completes 6 of 7 steps correctly, then hallucinates the last one
It enters a retry loop that burns tokens without making progress
It succeeds on the task but produces output in an unexpected format

The most dangerous agent failures are the ones that look like successes. The output is coherent, well-formatted, and completely wrong.

Architecture patterns that survive production

Pattern 1 — Deterministic scaffolding with LLM decision points

The most reliable agent systems we've seen don't give the LLM full autonomy. Instead, they use a deterministic workflow engine (think: state machine or DAG) and delegate specific decision points to the model.

The model decides what to do next. The code decides how to do it — including input validation, error handling, and output parsing.

This hybrid approach sacrifices some flexibility for a massive gain in reliability. You can write tests for the deterministic parts and focus your evaluation budget on the LLM decision points.

Pattern 2 — Layered validation

Every tool call should pass through at least two validation layers:

Schema validation — Does the output match the expected structure?
Semantic validation — Does the output make sense given the context?

Schema validation is cheap and catches ~60% of malformed outputs. Semantic validation typically requires a second, smaller model call — but the cost is trivial compared to the cost of propagating a bad action through the rest of the workflow.

Pattern 3 — Graceful degradation over retry loops

Retry logic is the default instinct when an agent step fails. But retrying the same prompt with the same context rarely produces a different result. Instead, design for graceful degradation: if a step fails twice, fall back to a simpler strategy or escalate to a human.

The teams with the best production outcomes are the ones that invested in good fallback paths, not better retry logic.

Memory is harder than it looks

Agent memory — the ability to retain and reference information across turns or sessions — is one of the most requested features and one of the hardest to get right.

The naive approach is to dump everything into the context window. This works until it doesn't: you hit token limits, latency spikes, or the model starts confusing old context with current instructions.

The approaches we've seen work best:

Structured scratchpads — A JSON or key-value store that the agent reads and writes to explicitly, separate from the conversation history
Summarization checkpoints — At regular intervals, compress the conversation history into a summary and discard the raw turns
Scoped retrieval — Instead of carrying everything forward, retrieve only the memory entries relevant to the current step

None of these are perfect. Memory remains the single biggest open problem in agent architecture.

Cost control is a design problem

Agent loops are inherently expensive because each step involves at least one LLM call, and complex tasks can require 10–20 steps. Without guardrails, a single runaway agent session can consume hundreds of thousands of tokens.

Set hard limits Every production agent should have a maximum step count and a maximum token budget per session. Without these, a single edge case can blow through your monthly API spend in hours.

Practical cost controls that work:

Token budgets per session with hard cutoffs
Model routing — use a smaller, cheaper model for simple decisions and reserve the frontier model for complex reasoning steps
Caching — many agent steps produce deterministic results given the same input; cache aggressively

What changes in the next 12 months

Agent reliability is improving fast, driven by better tool-use training, structured output guarantees, and frameworks like MCP that standardize how models interact with external systems.

But the biggest unlock won't come from model improvements alone. It will come from better engineering practices: proper evaluation suites, staged rollouts, human-in-the-loop checkpoints, and the same rigor we apply to any production software system.

The teams that treat agents as software — not magic — will be the ones that ship products users actually trust.