As the sizes of AI models and datasets continue to increase, relying only on higher-precision BF16 training is no longer sufficient. Key challenges such as training throughput expectations, memory limits, and rising costs are becoming the primary barriers to scaling transformer models. Using lower-precision training can address these challenges. By reducing the numeric precision used during computation, GPUs can process more operations per cycle, enhancing training efficiency and lowering costs. This post compares the following three low-precision training formats directly against established BF16 precision training across multi-hundred-billion token pretraining runs and downstream benchmarks: 8-bit floating point per-tensor current scaling (FP8-CS) Mixed precision training with FP8 (MXFP8) NVFP4 precision training using NVIDIA NeMo Megatron Bridge , an open source library that is part of NVIDIA NeMo framework We present practical, large-scale results showing how low-precision…