RAG Is Burning Money — I Built a Cost Control Layer to Fix It

1 / 9

RAG Is Burning Money — I Built a Cost Control Layer to Fix It | Towards Data Science

Towards Data Science·Emmimal P Alexander·3 days ago

#vnAIIx2D

#towardsdatascience #cost #query #self #cache #system

Reading 0:00

15s threshold

TL;DR a full working implementation in pure Python, along with benchmark results from a local setup. RAG systems do not fail only on quality. They can also become inefficient in terms of cost, often in ways that are not immediately visible. Every extra retrieved token has a cost. In my system, context over-fetching ranged from 3–8× beyond what queries actually required. In many baseline implementations, repeated queries are processed independently, with no reuse of previous results. In single-model setups, a large share of simple queries may be handled by high-cost models, even when lower-cost alternatives would be sufficient. With semantic caching (up to 98.5% hit rate in a pre-seeded, warmed cache benchmark), query routing (around 81% of requests shifted to a lower-cost model in the benchmark mix), and a token budget layer with a circuit breaker, the system achieved up to 85.8% cost reduction at 10,000 requests per day, while maintaining response quality under the evaluated setup.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

RAG Is Burning Money — I Built a Cost Control Layer to Fix It | Towards Data Science