Why retrieval accuracy drops at scale
A RAG demo can look perfect with ten questions. Production means thousands of queries a day across drifting docs, seasonal content, and ambiguous user intent. Retrieval precision degrades before accuracy does.
Architecture pattern we use
Hybrid retrieval with BM25 plus dense embeddings, then a reranker on top. Vector caching and prompt caching cut cost and latency without changing behavior. We also freeze embeddings per corpus version so ranking stays stable between releases.
Agent tool-calling patterns
Function calling works when the action surface is narrow. ReAct works better when the agent must plan across uncertain tools. We choose by action-graph complexity, not by trend.
Observability and guardrails
Every LLM call gets logged with prompt version, retrieved context ids, token usage, and latency. We run eval regressions in CI on prompt, model, or retrieval changes. One client kept 99.99% uptime after applying this workflow.
Cost controls
We enforce latency and token budgets per feature. Routing, semantic caching, and model cascades usually cut inference cost 35% to 40% versus a naive deployment.
Deployment patterns
Deploy to Modal or AWS with canary traffic, run evals before promotion, keep runbooks for retrieval index rebuilds, and treat the vector store like any other stateful dependency.