TL;DR — ONNX Runtime's QNN execution provider will quietly route unsupported ops to the CPU instead of the Hexagon NPU. Your accuracy is fine. Your eval set is fine. Median latency on a clean device looks fine. Then production traffic hits a different input distribution, more ops fall back, and p95 latency triples. The fix isn't more eval data — it's three CI assertions: run on real hardware, gate on median and coefficient of variation, and parse the ORT profiling output to assert what fraction of FLOPs actually ran on the NPU. The pain A team we work with ships a person-detection model on a robotics platform with a Snapdragon 8 Gen 3 SoC. The model is YOLOv8n, quantized to INT8 with AIMET, compiled through Qualcomm AI Hub, exported as an ONNX graph that ORT loads with the QNN execution provider targeting the Hexagon NPU in HTP performance mode. Pre-merge they run a 5,000-image eval set on a development board. Latency: median 8.2 ms , p95 9.1 ms .…