TL;DR a full working implementation in pure Python, along with benchmark results from a local setup. RAG systems do not fail only on quality. They can also become inefficient in terms of cost, often in ways that are not immediately visible. Every extra retrieved token has a cost. In my system, context over-fetching ranged from 3–8× beyond what queries actually required. In many baseline implementations, repeated queries are processed independently, with no reuse of previous results. In single-model setups, a large share of simple queries may be handled by high-cost models, even when lower-cost alternatives would be sufficient. With semantic caching (up to 98.5% hit rate in a pre-seeded, warmed cache benchmark), query routing (around 81% of requests shifted to a lower-cost model in the benchmark mix), and a token budget layer with a circuit breaker, the system achieved up to 85.8% cost reduction at 10,000 requests per day, while maintaining response quality under the evaluated setup.…