Building Fast Web Apps for Low-Bandwidth Markets
Performance optimisation isn't a nice-to-have when your users are on 3G in Kampala or Nairobi. Here's the exact stack and techniques we use to build sub-2s load times across East Africa.
Building an LLM prototype takes an afternoon. Getting it to production takes months. After shipping 6 agent systems this year, here's the gap between the demo and reality.
Ronald Mugisha
Head of Data & AI

Everyone can build an LLM prototype that impresses in a demo. You give it a prompt, it produces a plausible output, stakeholders are excited. Then you try to put it in production and everything falls apart. This post is about the gap between those two states — and how to close it.
In a demo, you hand-pick the inputs. In production, users send inputs you never imagined. The LLM's job is to handle all of them gracefully — and when it can't, your system's job is to fail safely. Most prototype architectures have no concept of graceful failure.
The first thing we add to every production agent system is structured retry logic with exponential backoff. LLM APIs have rate limits and occasionally return malformed JSON. If you're not handling both, your agent will fail silently in ways that are very hard to debug.
import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(class="text-svc-social">3),
wait=wait_exponential(multiplier=class="text-svc-social">1, min=class="text-svc-social">2, max=class="text-svc-social">10),
reraise=True,
)
async def call_llm_with_retry(messages: list[dict]) -> str:
response = await openai_client.chat.completions.create(
model=class="text-svc-data">"gpt-4o",
messages=messages,
response_format={class="text-svc-data">"type": class="text-svc-data">"json_object"},
timeout=class="text-svc-social">30,
)
return response.choices[class="text-svc-social">0].message.contentPrompts are code. They need to be version-controlled, reviewed, and tested before shipping. We store every prompt in a dedicated prompts/ directory with a version suffix, and we maintain a golden dataset of input/output pairs that we run on every prompt change.
Every production agent system we've shipped has a human escalation path. Not because we don't trust the LLM — but because there will always be a class of inputs where the right answer is "a human needs to make this call". The architecture question is not whether to build this, but where in the flow to put it.
from langgraph.graph import StateGraph, END
from typing import TypedDict, Literal
class AgentState(TypedDict):
document: str
confidence: float
result: str | None
escalated: bool
def should_escalate(state: AgentState) -> Literal[class="text-svc-data">"escalate", class="text-svc-data">"complete"]:
class="text-svc-data">""class="text-svc-data">"Route to human queue if confidence below threshold."class="text-svc-data">""
return class="text-svc-data">"escalate" if state[class="text-svc-data">"confidence"] < class="text-svc-social">0.85 else class="text-svc-data">"complete"
graph = StateGraph(AgentState)
graph.add_node(class="text-svc-data">"process", process_document)
graph.add_node(class="text-svc-data">"escalate", route_to_human_queue)
graph.add_node(class="text-svc-data">"complete", mark_complete)
graph.add_conditional_edges(class="text-svc-data">"process", should_escalate)You cannot debug what you cannot observe. Every agent invocation should emit structured logs with: the input hash, model used, token counts, latency, confidence score, and whether it was escalated. We use a simple Postgres table for this — not a fancy LLMOps platform — because SQL gives us full flexibility to query and aggregate.
Token costs are invisible until they're not. A document processing agent that handles 10 documents a day in testing might need to handle 10,000 a day in production. At that scale, the difference between a 2,000-token and 4,000-token prompt is significant. Instrument token usage from day one.
Cost pattern: Cache LLM responses for identical or near-identical inputs using a semantic similarity hash. For document classification tasks, we typically see 30–40% cache hit rates — effectively cutting costs by a third with no accuracy loss.
Not every automation problem needs an LLM. If the logic can be expressed as rules, use rules. If the data is structured, use a database query. LLMs add latency, cost, and non-determinism. They earn their place when the task genuinely requires language understanding or judgment — not just pattern matching.
Engineering deep-dives, design thinking, and practical AI — written for builders who care about craft. No fluff. No spray.
No spam. Unsubscribe any time.
Continue Reading
Performance optimisation isn't a nice-to-have when your users are on 3G in Kampala or Nairobi. Here's the exact stack and techniques we use to build sub-2s load times across East Africa.
Most design systems start as a solution and end up as a problem. Here's how we build component libraries that teams actually want to use — and keep using.