I open-sourced llm0 recently — a Go binary that puts one OpenAI-compatible endpoint in front of OpenAI, Anthropic, Gemini, and local Ollama. MIT licensed. Single binary plus Postgres + Redis. The technically interesting bits are how it stays fast: 3 ms p50 cache-hit latency, ~1,672 req/s sustained throughput, 1–2 Redis round trips on the hot path on a DigitalOcean 4 vCPU / 8 GB shared Linux droplet. This post walks through the architecture decisions that got those numbers. Expect Lua scripts, a pgvector query, and an honest discussion of where I overstated things and got corrected by a Redis engineer. The naive approach (and why it's slow) A typical LLM gateway request needs to do six things: Authenticate the API key Check the per-API-key rate limit Check the per-project spend cap Look up exact-match cache (Maybe) check semantic cache (Maybe) forward to the upstream model The naive approach is six serial Redis GETs.…