The performance of modern neural network training is increasingly limited by memory bandwidth rather than raw arithmetic. Lower-precision numeric formats can dramatically improve both — provided the loss in precision does not compromise model quality.

OpenNN supports BF16 (Brain Floating Point 16) mixed precision on Ampere-class NVIDIA GPUs and newer. With one configuration call, training kernels run several times faster and memory consumption is roughly halved, without any change to model code or hyperparameters.

 

Contents:

  1. Introduction.
  2. Mixed-Precision Strategy.
  3. Enabling BF16 in OpenNN.
  4. Hardware Requirements.
  5. Querying the Active Configuration.
  6. Conclusions.

 

1. Introduction

Floating-point numbers in neural networks serve two roles: storing the model’s parameters (the weights the optimizer updates), and computing the activations that flow forward and backward through each layer.

Single-precision floating point (FP32) uses 32 bits per number — 1 sign, 8 exponent, 23 mantissa — and has been the default precision for deep learning for over a decade. It is numerically robust, but it doubles the memory footprint of a 16-bit alternative and underutilizes the BF16-specific tensor cores on Ampere and newer GPUs.

BF16 uses 16 bits — 1 sign, 8 exponent, 7 mantissa — keeping FP32’s full exponent range so overflow and underflow patterns are identical. Only the mantissa is shorter, so absolute precision is reduced, but dynamic range is preserved. This makes BF16 a much safer drop-in replacement for FP32 than the older FP16 format, which has only 5 exponent bits and overflows on common training values.

 

2. Mixed-Precision Strategy

Pure BF16 training is unstable: the format’s 7-bit mantissa is not precise enough to safely accumulate optimizer updates or gradient sums over many steps. OpenNN solves this by keeping two copies of the model state in parallel:

  • A master copy in FP32 for parameters, gradients, and optimizer state (Adam moments, accumulations).
  • A working copy in BF16 on the GPU for forward and backward activations.

After each parameter update, the FP32 master is automatically cast back to BF16 to refresh the working copy. There is no manual loss-scaling factor to tune — unlike FP16, BF16’s wider exponent range eliminates the need for it.

 

3. Enabling BF16 in OpenNN

BF16 is controlled by the Configuration singleton. Set it once, before constructing any datasets, networks, or training strategies.

#include "../../opennn/configuration.h"

using namespace opennn;

int main()
{
    // Train and infer in BF16 on the GPU
    Configuration::instance().set(Device::CUDA, Type::BF16, Type::BF16);

    // ... dataset, network, training strategy ...
}

The three arguments are the execution Device, the training numeric Type, and the inference numeric Type. All default to Auto, which on an Ampere+ GPU resolves to BF16 automatically. To let OpenNN pick the fastest valid combination on the current machine, call set() with no arguments.

 

4. Hardware Requirements

BF16 requires NVIDIA CUDA compute capability of 8.0 or higher — that is, Ampere or newer:

  • RTX 30xx and 40xx consumer GPUs.
  • A100, A40, H100, H200 data-center GPUs.

On older GPUs (Volta V100, Turing T4) or on CPU, OpenNN throws a clear error rather than silently falling back. This prevents subtle correctness issues from accidentally enabling an unsupported configuration.

runtime_error: "Configuration: BF16 training requires CUDA
                compute capability >= 8.0 (Ampere+)."

 

5. Querying the Active Configuration

Code that conditionally specializes for BF16 can query the resolved state via free helper functions:

if (is_bf16_training())
{
    // BF16-specific path
}

if (is_gpu())
{
    // CUDA-only kernel
}

These functions read from the same cached Configuration singleton and are safe to call from any layer of the codebase.

 

6. Conclusions

Automated mixed precision in OpenNN is a one-line change that delivers substantial speedups on modern GPUs:

  • One call — Configuration::instance().set(Device::CUDA, Type::BF16, Type::BF16) — enables it.
  • The FP32 master copy keeps optimizer math numerically stable; the BF16 mirror cuts memory in half and unlocks Ampere tensor cores.
  • OpenNN validates the GPU at startup, so misconfigurations fail loudly instead of silently degrading accuracy.

For workloads where training speed or GPU memory are bottlenecks, BF16 is the right default on supported hardware. There is no accuracy trade-off compared to FP32 in practice, and no manual tuning is required.