Shipping AI Agents to Production: What Nobody Tells You

Building an LLM prototype takes an afternoon. Getting it to production takes months. After shipping 6 agent systems this year, here's the gap between the demo and reality.

Ronald Mugisha

Head of Data & AI

2 October 202412 min read

AILangChainLangGraphProductionArchitecture

Diagram showing a multi-agent LLM workflow with human escalation paths

Everyone can build an LLM prototype that impresses in a demo. You give it a prompt, it produces a plausible output, stakeholders are excited. Then you try to put it in production and everything falls apart. This post is about the gap between those two states — and how to close it.

The Demo-to-Production Gap

In a demo, you hand-pick the inputs. In production, users send inputs you never imagined. The LLM's job is to handle all of them gracefully — and when it can't, your system's job is to fail safely. Most prototype architectures have no concept of graceful failure.

Reliability Is Not Optional

The first thing we add to every production agent system is structured retry logic with exponential backoff. LLM APIs have rate limits and occasionally return malformed JSON. If you're not handling both, your agent will fail silently in ways that are very hard to debug.

Python

import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(class="text-svc-social">3),
    wait=wait_exponential(multiplier=class="text-svc-social">1, min=class="text-svc-social">2, max=class="text-svc-social">10),
    reraise=True,
)
async def call_llm_with_retry(messages: list[dict]) -> str:
    response = await openai_client.chat.completions.create(
        model=class="text-svc-data">"gpt-4o",
        messages=messages,
        response_format={class="text-svc-data">"type": class="text-svc-data">"json_object"},
        timeout=class="text-svc-social">30,
    )
    return response.choices[class="text-svc-social">0].message.content

Prompt Versioning and Testing

Prompts are code. They need to be version-controlled, reviewed, and tested before shipping. We store every prompt in a dedicated prompts/ directory with a version suffix, and we maintain a golden dataset of input/output pairs that we run on every prompt change.

Store prompts as typed constants in a dedicated module — never inline strings
Build a golden test dataset of 20–50 representative inputs per agent
Run evals against the golden set on every prompt PR — reject if pass rate drops
Use structured output (JSON mode or tool calling) rather than free-text parsing
Log every prompt + completion pair to a database for analysis and fine-tuning

Human-in-the-Loop Is an Architecture Decision

Every production agent system we've shipped has a human escalation path. Not because we don't trust the LLM — but because there will always be a class of inputs where the right answer is "a human needs to make this call". The architecture question is not whether to build this, but where in the flow to put it.

Python

from langgraph.graph import StateGraph, END
from typing import TypedDict, Literal

class AgentState(TypedDict):
    document:   str
    confidence: float
    result:     str | None
    escalated:  bool

def should_escalate(state: AgentState) -> Literal[class="text-svc-data">"escalate", class="text-svc-data">"complete"]:
    class="text-svc-data">""class="text-svc-data">"Route to human queue if confidence below threshold."class="text-svc-data">""
    return class="text-svc-data">"escalate" if state[class="text-svc-data">"confidence"] < class="text-svc-social">0.85 else class="text-svc-data">"complete"

graph = StateGraph(AgentState)
graph.add_node(class="text-svc-data">"process",  process_document)
graph.add_node(class="text-svc-data">"escalate", route_to_human_queue)
graph.add_node(class="text-svc-data">"complete", mark_complete)

graph.add_conditional_edges(class="text-svc-data">"process", should_escalate)

Observability for Agents

You cannot debug what you cannot observe. Every agent invocation should emit structured logs with: the input hash, model used, token counts, latency, confidence score, and whether it was escalated. We use a simple Postgres table for this — not a fancy LLMOps platform — because SQL gives us full flexibility to query and aggregate.

Managing Token Costs at Scale

Token costs are invisible until they're not. A document processing agent that handles 10 documents a day in testing might need to handle 10,000 a day in production. At that scale, the difference between a 2,000-token and 4,000-token prompt is significant. Instrument token usage from day one.

Cost pattern: Cache LLM responses for identical or near-identical inputs using a semantic similarity hash. For document classification tasks, we typically see 30–40% cache hit rates — effectively cutting costs by a third with no accuracy loss.

When Not to Use Agents

Not every automation problem needs an LLM. If the logic can be expressed as rules, use rules. If the data is structured, use a database query. LLMs add latency, cost, and non-determinism. They earn their place when the task genuinely requires language understanding or judgment — not just pattern matching.

Use an LLM when the input is unstructured and the output requires interpretation
Use rules when the logic is deterministic and the inputs are constrained
Use both when you need LLM understanding feeding into deterministic downstream processing

Engineering

Building Fast Web Apps for Low-Bandwidth Markets

Performance optimisation isn't a nice-to-have when your users are on 3G in Kampala or Nairobi. Here's the exact stack and techniques we use to build sub-2s load times across East Africa.

14 November 20249 min read

Read Article

Design

Design Systems That Don't Become a Burden

Most design systems start as a solution and end up as a problem. Here's how we build component libraries that teams actually want to use — and keep using.

19 August 20248 min read