Optimizing Communication for Mixture-of-Experts Training with Hybrid Expert Parallel

1 / 11

Optimizing Communication for Mixture-of-Experts Training with Hybrid Expert Parallel

NVIDIA Technical Blog·Fan Yu·about 1 month ago

#V2IjLVo3

#x2d #agenticaigenerativeai #datacentercloud #networkingcommunications #telecommunications #hybrid

Reading 0:00

15s threshold

In LLM training, Expert Parallel (EP) communication for hyperscale mixture-of-experts (MoE) models is challenging. EP communication is essentially all-to-all, but due to its dynamics and sparseness (only topk experts per AI token instead of all experts), it’s challenging to implement and optimize.  This post details an efficient MoE EP communication solution, Hybrid-EP, and its use in the NVIDIA Megatron family of frameworks, on NVIDIA Quantum InfiniBand and NVIDIA Spectrum-X Ethernet platforms. It also dives into the effectiveness of Hybrid-EP in real-world model training.  Efficiency challenges of hyperscale MoE model training DeepSeek-V3 is a representative model of the new generation of large-scale fine-grained MoE models.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Optimizing Communication for Mixture-of-Experts Training with Hybrid Expert Parallel