How we catch silent NPU fallback on Snapdragon in CI (and why your eval set won't)

1 / 2

How we catch silent NPU fallback on Snapdragon in CI (and why your eval set won't)

DEV Community·ashish-frozo·18 days ago

#uFm7l1Ut

#part #edgeai #mlops #median #model #latency

Reading 0:00

15s threshold

TL;DR — ONNX Runtime's QNN execution provider will quietly route unsupported ops to the CPU instead of the Hexagon NPU. Your accuracy is fine. Your eval set is fine. Median latency on a clean device looks fine. Then production traffic hits a different input distribution, more ops fall back, and p95 latency triples. The fix isn't more eval data — it's three CI assertions: run on real hardware, gate on median and coefficient of variation, and parse the ORT profiling output to assert what fraction of FLOPs actually ran on the NPU. The pain A team we work with ships a person-detection model on a robotics platform with a Snapdragon 8 Gen 3 SoC. The model is YOLOv8n, quantized to INT8 with AIMET, compiled through Qualcomm AI Hub, exported as an ONNX graph that ORT loads with the QNN execution provider targeting the Hexagon NPU in HTP performance mode. Pre-merge they run a 5,000-image eval set on a development board. Latency: median 8.2 ms , p95 9.1 ms .…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

How we catch silent NPU fallback on Snapdragon in CI (and why your eval set won't)