The C++ template library CUB is a go-to for high-performance GPU primitive algorithms, but its traditional “two-phase” API, which separates memory estimation from allocation, can be cumbersome. While this programming model offers flexibility, it often results in repetitive boilerplate code. This post explains the shift from this API to the new CUB single-call API introduced in CUDA 13.1 , which simplifies development by managing memory under the hood without sacrificing performance. What is CUB? If you need to run a standard algorithm (such as scan, histogram, or sort) on a GPU, CUB is likely the fastest way to do it. As a principal component of the NVIDIA CUDA Core Compute Libraries (CCCL) , CUB is designed to abstract away the complexity of manual CUDA thread management without sacrificing performance. While libraries like Thrust provide a high-level, “host-side” interface similar to the C++ Standard Template Library (STL) for quick prototyping, CUB provides a set of “device-side” primitives.…