Streamlining CUB with a Single-Call API

1 / 2

Streamlining CUB with a Single-Call API

NVIDIA Technical Blog·Giannis Gonidelis·about 1 month ago

#xwe3gwiM

#x2d #developertoolstechniques #simulationmodelingdesign #general #cuda #memory

Reading 0:00

15s threshold

The C++ template library CUB is a go-to for high-performance GPU primitive algorithms, but its traditional “two-phase” API, which separates memory estimation from allocation, can be cumbersome. While this programming model offers flexibility, it often results in repetitive boilerplate code.  This post explains the shift from this API to the new CUB single-call API introduced in CUDA 13.1 , which simplifies development by managing memory under the hood without sacrificing performance. What is CUB? If you need to run a standard algorithm (such as scan, histogram, or sort) on a GPU, CUB is likely the fastest way to do it. As a principal component of the NVIDIA CUDA Core Compute Libraries (CCCL) , CUB is designed to abstract away the complexity of manual CUDA thread management without sacrificing performance.  While libraries like Thrust provide a high-level, “host-side” interface similar to the C++ Standard Template Library (STL) for quick prototyping, CUB provides a set of “device-side” primitives.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Streamlining CUB with a Single-Call API