TL;DR: We profiled a LiDAR object detector expecting the 3D backbone to dominate. It didn't. Voxelization plus the scatter-to-pillars step ate roughly 40% of per-frame latency on an A100, and pulling them out of the Python hot path took our p50 from 31ms down to 19ms. The assumption that cost us two weeks When I was at Valeo.ai we ran a PointPillars-style detector on nuScenes-scale point clouds, around 250k points per sweep. The mental model everyone carried was simple. The sparse conv backbone is heavy, so the backbone is where the milliseconds go. We spent a sprint trying to prune channels and fuse BatchNorm into the conv weights before anyone actually looked at a trace. When we finally ran torch.profiler with CUDA activities enabled, the picture was not what we expected. The 2D CNN head and the pillar feature net were fast. The expensive part lived upstream, in the code nobody thought of as "the model."## What the trace actually showed To be precise, two things dominated.…