I've spent weeks stress-testing Apple's on-device model — the ~3B parameter one that runs on the Neural Engine of any Apple Silicon Mac. To test it thoroughly, I built Think Local , a macOS app that exercises every capability of the model: chat, image generation, structured output, tool calling, and parameter comparison. My conclusion: As a chatbot, the model is terrible. As a structured output and tool calling engine, it's surprisingly good. This distinction matters because it completely changes what you should use this model for. Chat is disappointing — and that's fine Apple's model has a 4,096-token context window. To put this in perspective: Claude has 1M tokens and GPT-4o has 128K. With Apple, add a 200-token system prompt, a 150-token schema, and three conversation turns, and you're already at 70% capacity. Free-form text quality isn't impressive either.…