Back to Blog

Your GPU's FP32 Output Is Lying: Truncation in Tensor Cores

NVIDIA H100 and RTX4000 Tensor Cores truncate FP32 outputs to 13-bit mantissas during FP8 matmuls. See how this affects your model accuracy.

Admin
Mar 01, 2026
4 min read
Your GPU's FP32 Output Is Lying: Truncation in Tensor Cores
Your GPU's FP32 Output Is Lying: Truncation in Tensor Cores

Editorial Note

Reviewed and analysis by ScoRpii Tech Editorial Team.

Tensor Cores and GEMM: A Precision Trade-off

Modern GPUs leverage Tensor Cores to accelerate General Matrix Multiplication (GEMM) operations, delivering substantial performance gains for workloads like deep learning. You benefit from these teraflop ratings, but a critical detail often overlooked is the internal precision of these engines. Tensor Cores are optimized for speed, frequently employing accumulators with bit-depths lower than the final output format, introducing truncation errors that can impact numerical accuracy.

The Anatomy of Internal Truncation in FP8

When utilizing FP8 (e4m3) inputs, the matrix multiplication engine performs exponent addition (4-bit + 4-bit) and mantissa multiplication (4-bit x 4-bit). The architectural constraint arises during accumulation. NVIDIA implements an accumulator precision referred to as e8mNacc, where Nacc is less than the 23 bits in standard FP32 (e8m23).

Specifically, on H100 and RTX4000-series hardware, the accumulator uses an FP22 format (e8m13). This means your results are truncated. The truncation threshold (Ntrun) is calculated as Ntrun ≤ 23 − Nacc. According to the PyTorch Blog, truncating up to 10 of the least significant bits (LSBs) of the mantissa in your software-side verification will yield results identical to the hardware output. This is because the hardware has already discarded that data to conserve power and area.

This behavior is particularly relevant in the FP8-to-FP32 path, where the QGMMA instruction—specific to FP8 Tensor Cores—is invoked. The choice of accumulator precision is a direct consequence of the physical limitations of transistor density and power consumption within the GPU.

Verifying Precision with Triton and NCU

You can independently verify this precision loss using the Triton language and the NVIDIA profiler, ncu(3). Triton’s tl.dot() function allows you to decompose a matmul(A,B) into specific block sizes, such as (BLOCK_SIZE_M, BLOCK_SIZE_K) for input A and (BLOCK_SIZE_K, BLOCK_SIZE_N) for input B.

Experiments with a block size of 64x64x32 demonstrate that the FP8xFP8 tl.dot() call directly translates into the QGMMA instruction on Ada Lovelace architectures. Increasing the BLOCK_SIZE_K to 128, where each WGMMA instruction handles K=32, consistently reveals only 13 effective mantissa bits.

Changing your kernel configuration to num_warps = 2 alters this behavior. This configuration packs FP8 data into FP16, triggering the HMMA (FP16-TensorCore-specific) instruction instead of QGMMA. The FP16 Tensor Core accumulator is one bit shorter than the standard FP32 accumulator, but still differs from the 13-bit limit of the dedicated FP8 path.

Furthermore, for INT8xINT8 engines, truncation occurs on the most significant bits (MSBs), adding another layer of complexity to the numerical stability of low-precision workloads.

Implications for Your Workloads

This precision gap has direct implications for model tuning. If you rely on cuBLAS, the CUBLASLT_MATMUL_DESC_FAST_ACCUM flag may activate these lower-precision paths for speed, at the cost of numerical accuracy. Autotuning processes often suggest large block sizes like 256 or 512, but the hardware-level truncation remains a fixed constraint of the H100 and RTX4000 architectures.

If your model architecture is sensitive to accumulation errors—particularly in deep networks with many successive matrix multiplications—you must account for the fact that your FP32 outputs are effectively FP22. Consider the following:

  • Numerical Stability: Monitor for divergence or unexpected behavior in long-running computations.
  • Loss Scaling: Experiment with loss scaling techniques to mitigate potential underflow issues.
  • Precision Awareness: Design your models with an understanding of the inherent precision limitations of the hardware.

The Bottom Line for Developers

You must acknowledge the internal precision limitations of Tensor Cores when deploying FP8 workloads. While these accelerators offer significant performance benefits, the reduced accumulator precision can introduce subtle errors. Thorough verification, careful tuning, and an awareness of the underlying hardware constraints are essential for achieving both speed and accuracy in your applications. Ignoring these details can lead to unexpected results and reduced model reliability.

Originally reported by

PyTorch Blog

Share this article