How we almost wrote off 3 models as broken — the thinking-mode tax By Vilius Vystartas | May 2026 Three models scored under 15% in my first benchmark run. Kimi K2.5: 10%. MiniMax M2.5: 15%. Gemma 4: HTTP 400 on every call. I almost excluded them as broken. They weren't broken — I was calling them wrong. Here's what happened and how to avoid it when benchmarking your own models. The symptoms Kimi K2.5 (10%): Every response was empty. The model returned exactly 300 tokens of nothing. finish_reason: length — it ran out of budget before producing visible output. MiniMax M2.5 (15%): Same pattern. One task ran for 88 minutes and consumed 98,000 tokens before I killed it. Gemma 4: Every request returned HTTP 400. Wrong model ID, wrong parameter name — include_thinking doesn't exist for Gemma. Root cause: thinking mode is on by default These models enable internal chain-of-thought reasoning by default. Every request burns tokens thinking silently before producing output.…