Your GPU's FP32 Output Is Lying: Truncation in Tensor Cores

Tensor Cores and GEMM: A Precision Trade-off

Modern GPUs leverage Tensor Cores to accelerate General Matrix Multiplication (GEMM) operations, delivering substantial performance gains for workloads like deep learning. You benefit from these teraflop ratings, but a critical detail often overlooked is the internal precision of these engines. Tensor Cores are optimized for speed, frequently employing accumulators with bit-depths lower than the final output format, introducing truncation errors that can impact numerical accuracy.

The Anatomy of Internal Truncation in FP8

When utilizing FP8 (e4m3) inputs, the matrix multiplication engine performs exponent addition (4-bit + 4-bit) and mantissa multiplication (4-bit x 4-bit). The architectural constraint arises during accumulation. NVIDIA implements an accumulator precision referred to as e8mNacc, where Nacc is less than the 23 bits in standard FP32 (e8m23).

Verifying Precision with Triton and NCU

You can independently verify this precision loss using the Triton language and the NVIDIA profiler, ncu(3). Triton’s tl.dot() function allows you to decompose a matmul(A,B) into specific block sizes, such as (BLOCK_SIZE_M, BLOCK_SIZE_K) for input A and (BLOCK_SIZE_K, BLOCK_SIZE_N) for input B.

Experiments with a block size of 64x64x32 demonstrate that the FP8xFP8 tl.dot() call directly translates into the QGMMA instruction on Ada Lovelace architectures. Increasing the BLOCK_SIZE_K to 128, where each WGMMA instruction handles K=32, consistently reveals only 13 effective mantissa bits.

Changing your kernel configuration to num_warps = 2 alters this behavior. This configuration packs FP8 data into FP16, triggering the HMMA (FP16-TensorCore-specific) instruction instead of QGMMA. The FP16 Tensor Core accumulator is one bit shorter than the standard FP32 accumulator, but still differs from the 13-bit limit of the dedicated FP8 path.

Furthermore, for INT8xINT8 engines, truncation occurs on the most significant bits (MSBs), adding another layer of complexity to the numerical stability of low-precision workloads.

Implications for Your Workloads

This precision gap has direct implications for model tuning. If you rely on cuBLAS, the CUBLASLT_MATMUL_DESC_FAST_ACCUM flag may activate these lower-precision paths for speed, at the cost of numerical accuracy. Autotuning processes often suggest large block sizes like 256 or 512, but the hardware-level truncation remains a fixed constraint of the H100 and RTX4000 architectures.

If your model architecture is sensitive to accumulation errors—particularly in deep networks with many successive matrix multiplications—you must account for the fact that your FP32 outputs are effectively FP22. Consider the following:

Numerical Stability: Monitor for divergence or unexpected behavior in long-running computations.
Loss Scaling: Experiment with loss scaling techniques to mitigate potential underflow issues.
Precision Awareness: Design your models with an understanding of the inherent precision limitations of the hardware.

The Bottom Line for Developers

You must acknowledge the internal precision limitations of Tensor Cores when deploying FP8 workloads. While these accelerators offer significant performance benefits, the reduced accumulator precision can introduce subtle errors. Thorough verification, careful tuning, and an awareness of the underlying hardware constraints are essential for achieving both speed and accuracy in your applications. Ignoring these details can lead to unexpected results and reduced model reliability.

Your GPU's FP32 Output Is Lying: Truncation in Tensor Cores

Editorial Note

In this article

Tensor Cores and GEMM: A Precision Trade-off

The Anatomy of Internal Truncation in FP8

Verifying Precision with Triton and NCU

Implications for Your Workloads

The Bottom Line for Developers

Share this article

What did you think?

Related Articles

Here's What Your iPhone Needs: The Top iOS Apps of 2026

Here's Why Your Next Phone Doesn't Need to Cost a Fortune

Your Android 17 Update: Why Your Pixel Might Be Fighting Back

Stay Updated

Latest News

Here's What Your iPhone Needs: The Top iOS Apps of 2026

Here's Why Your Next Phone Doesn't Need to Cost a Fortune

Your Android 17 Update: Why Your Pixel Might Be Fighting Back