From 3 tok/s frustration to 21 tok/s GPU-hybrid inference - a real engineer's guide to self-hosted AI that actually works. Why Bother Running Local LLMs? Before we get into the how, let's address the obvious question: why not just use Claude, GPT, or Gemini? The honest answer is - for many tasks, you should. But local LLMs make sense when: Privacy matters. Code, internal documents, proprietary configs - none of it leaves your machine. Cost at scale. API calls add up fast when you're running a coding agent all day. Latency control. No network round-trips, no rate limits, no API downtime. Offline capability. Works on a plane, in a data center, behind a firewall. Experimentation. Swap models freely, tune inference parameters, benchmark to your heart's content. This guide documents a real setup - not a toy demo - built specifically to run Claude Code and pi.dev against a local model, transparently, with no API key required.…