Making Softmax More Efficient with NVIDIA Blackwell Ultra

1 / 5

Making Softmax More Efficient with NVIDIA Blackwell Ultra

NVIDIA Technical Blog·Jamie Li·about 1 month ago

#qZ0MlVb0

#x2d #agenticaigenerativeai #datacentercloud #cloudservices #blackwell #attention

Reading 0:00

15s threshold

LLM context lengths are exploding, and architectures are moving toward complex attention schemes like Multi-Head Latent Attention (MLA) and Grouped Query Attention (GQA). As a result, AI ”speed of thought” is increasingly governed not by the massive throughput of matrix multiplications, but by the transcendental math of the softmax function. Transcendentals refer to functions that cannot be expressed as the root of a polynomial equation with rational coefficients. Subsequently, they “transcend” basic algebraic operations like addition and multiplication—the exact operations Tensor Cores excel at. In the specific context of softmax, the most computationally expensive of these transcendentals is the natural exponential function that  is executed on Special Function Units (SFUs). In NVIDIA assembly instructions ( SASS ), this function is invoked via the MUFU.EX2 instruction.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Making Softmax More Efficient with NVIDIA Blackwell Ultra