On-device LLM on iPhone: which runtime is fastest? MLX vs llama.cpp vs LiteRT-LM vs CoreML

1 / 3

On-device LLM on iPhone: which runtime is fastest? MLX vs llama.cpp vs LiteRT-LM vs CoreML

DEV Community: machinelearning·Daisuke Majima·about 14 hours ago

#sbJ7uuTv

#dev #litert #coreml #gemma #memory #qwen

Reading 0:00

15s threshold

I want to run an LLM on iPhone. But there are several runtimes and it's not obvious which to pick. And I couldn't find many head-to-head benchmarks. Runtime In a nutshell MLX Apple charging into the on-device-LLM scene and pushing hard. llama.cpp The mature, battle-tested community standard for local LLMs. LiteRT-LM Gemma-4 only, but Google's heavyweight, finally deployed. CoreML-LLM Lets you use the Apple Neural Engine, which the GPU/Metal-dominated LLM world tends to overlook. I built it — can it even compete...? Fine, let's just do it. On an iPhone 17 Pro (A19 Pro), I ran the same model on four on-device inference runtimes and measured decode speed and memory. The conclusion: "For local LLMs on iPhone, MLX by default." "For Gemma 4 specifically, LiteRT-LM is unbeatable." Conclusion first Decode speed : Qwen 3.5 2B is fastest on MLX (61 tok/s). Gemma 4 E2B is a decisive win for LiteRT-LM (55 tok/s). Memory : CoreML / ANE (Apple Neural Engine) wins by a landslide.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

On-device LLM on iPhone: which runtime is fastest? MLX vs llama.cpp vs LiteRT-LM vs CoreML