Remember, switching to your pistol is always faster than reloading. The same idea applies to LLM workflows. Most of the time, you don't need a flagship model to scaffold a project. Boilerplate, spec drafts, and initial plans are all tasks where a smaller model can do the heavy lifting. Then you pass the result to a larger model for review. Why this works Prefill is usually a single forward pass (not counting advanced stuff like chunking and sequence parallelism). Fundamentally, the next token is just model.forward() . How does this help? Say your initial prompt is 16k tokens (a rough ballpark for a Claude Code session) and you need to generate another 16k tokens of output (tool calls, reads, edits included). If your large model generates at 50t/s, a small model can easily hit 200t/s. That's 80 seconds versus 320 seconds for the same 16k tokens. The concept is the same as speculative decoding.…