Menu

Post image 1
Post image 2
Post image 3
1 / 3
0

On-device LLM on iPhone: which runtime is fastest? MLX vs llama.cpp vs LiteRT-LM vs CoreML

DEV Community: machinelearning·Daisuke Majima·about 14 hours ago
#sbJ7uuTv
#dev#litert#coreml#gemma#memory#qwen
Reading 0:00
15s threshold

I want to run an LLM on iPhone. But there are several runtimes and it's not obvious which to pick. And I couldn't find many head-to-head benchmarks. Runtime In a nutshell MLX Apple charging into the on-device-LLM scene and pushing hard. llama.cpp The mature, battle-tested community standard for local LLMs. LiteRT-LM Gemma-4 only, but Google's heavyweight, finally deployed. CoreML-LLM Lets you use the Apple Neural Engine, which the GPU/Metal-dominated LLM world tends to overlook. I built it — can it even compete...? Fine, let's just do it. On an iPhone 17 Pro (A19 Pro), I ran the same model on four on-device inference runtimes and measured decode speed and memory. The conclusion: "For local LLMs on iPhone, MLX by default." "For Gemma 4 specifically, LiteRT-LM is unbeatable." Conclusion first Decode speed : Qwen 3.5 2B is fastest on MLX (61 tok/s). Gemma 4 E2B is a decisive win for LiteRT-LM (55 tok/s). Memory : CoreML / ANE (Apple Neural Engine) wins by a landslide.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More