GPU CNN training speed: OpenNN vs PyTorch vs TensorFlow (MNIST)
The other notes in this series measure footprint and CPU behavior. This one measures GPU training throughput on a convolutional network: the simplest CNN that exercises the convolution, pooling, and dense paths — one convolutional and one pooling layer — trained on the repository’s MNIST images with the same configuration in all three frameworks.
The result
Training throughput (samples/second, higher is better; epoch wall time in parentheses), 40 timed epochs after a 2-epoch warmup:
| Batch 128 | Batch 1,024 | |
|---|---|---|
| OpenNN (CUDA, fp32, GPU-resident data) | 195,000–212,000 (0.047 s) | 397,259 (0.025 s) |
| PyTorch (eager, fp32, GPU-resident data) | 111,241 (0.090 s) | 298,947 (0.034 s) |
| TensorFlow (Keras fit, fp32) | 55,339 (0.181 s) | 134,307 (0.075 s) |
OpenNN’s batch-128 figure is a range over repeated runs (laptop GPU thermal variance); the other rows showed < ±5% spread.
- OpenNN trains this CNN 1.9× faster than PyTorch at the everyday batch size (128) and 1.3× faster at batch 1,024 — and 2.9–3.8× faster than TensorFlow.
- The training is real, not a no-op loop: the runs end at cross-entropy ≈ 0.10 after 12 epochs (from 1.14 at initialization), identical between OpenNN’s fast path and its legacy path to four decimals.
A model this small (≈31k parameters, 28×28 inputs) measures the framework’s per-step pipeline cost — batch assembly, host↔device traffic, kernel launches, optimizer update — more than raw conv FLOPS. That is the regime real small-CNN workloads live in, and it is where a native C++ loop can beat the Python frameworks outright.
How OpenNN got here
The first run of this benchmark had OpenNN at 103k/132k samples/s — behind PyTorch at both sizes. Profiling (OPENNN_PROFILE=1) attributed the gap to three specific costs, each now fixed in the library:
- NHWC bias gradients.
cudnnConvolutionBackwardBiastook 2.2 ms/step on NHWC deltas — 69% of the whole backward pass, for a 16-float reduction. The bias gradient is now a cudnn-frontend reduction graph (~0.1 ms). - Legacy cuDNN v7 convolution API. The convolution forward/wgrad/dgrad now run through the cudnn-frontend graph API (the engine interface PyTorch uses), with conv + bias + ReLU fused into a single forward graph when the layer requests it. The v7 path is kept as an automatic fallback (
OPENNN_CONV_LEGACY=1forces it). - Host-side batch assembly. Each batch was assembled by re-reading the image cache (one
preadper image), casting, scaling, and copying to the GPU — 1.7 ms per 128-image batch, more than the GPU math it fed.ImageDatasetnow supports the library’s GPU-resident dataset mode (OPENNN_GPU_RESIDENT_DATA=1): the scaled pixels and one-hot targets are staged once and batches are gathered device-side, the same arrangement the PyTorch script uses. MNIST resident is 31 MB.
With those fixed, the per-step pipeline that remains (native loop, device gather, cuDNN engines, fused Adam) is simply leaner than eager PyTorch’s Python dispatch — which is the whole-step overhead this benchmark measures.
Setup
Identical in all three frameworks:
| Value | |
|---|---|
| Data | the repo’s MNIST subset: 10,000 BMPs, 28×28×1, 10 classes (examples/mnist/data) |
| Network | Conv 16@3×3 (stride 1, Same, ReLU) → MaxPool 2×2 (stride 2) → Flatten → Dense 10 (softmax) |
| Loss / optimizer | cross-entropy, Adam (lr 0.001), no regularization |
| Pixels | scaled to [0, 1] once before the timed loop |
| Data residency | GPU-resident for OpenNN and PyTorch; TensorFlow streams from host NumPy via fit |
| Protocol | shuffled epochs, 2 warmup epochs (cuDNN autotune, allocator), 40 timed epochs |
| Precision | fp32 end-to-end, each framework’s default TF32 policy |
| Metric | OpenNN: mean epoch time over the timed run; PyTorch/TensorFlow: median epoch time |
Hardware/software: NVIDIA GeForce RTX 3060 Laptop GPU (driver 555.85) under WSL2 Ubuntu 24.04 on Windows 11 (i7-12700H). OpenNN built with g++ 13.3 + CUDA 12.9.86 + cuDNN 9.23; PyTorch 2.6.0 (cu124 wheels), TensorFlow 2.21.0 on CPython 3.12. The Windows driver caps the stack at the CUDA 12.x line; newer 12.x runtimes work through CUDA minor-version compatibility.
Caveats
- Single consumer laptop GPU; ratios will differ on datacenter parts and at other model sizes.
- OpenNN’s GPU-resident mode requires the dataset to fit in VRAM (it stages scaled inputs + one-hot targets as one float matrix) and is off by default; without it OpenNN measures 114k/165k samples/s — still ahead of host-streamed PyTorch (92k at batch 128).
torch.compile, XLA, and OpenNN’s bf16 path would all raise their respective numbers; this note deliberately compares the plain fp32 training loops.- OpenNN’s 40-epoch mean includes its one-time residency staging (~0.15 s) amortized across the run; PyTorch’s tensors are staged outside its timed region.
Reproducing
The data prep, the three training programs, the probe used in the analysis, and the runner are in docs/benchmarks/cnn-training-speed/:
python prepare_mnist.py ../../../examples/mnist/data # BMP folders -> npy
./run_cnn_speed.sh 40 128 # all three engines + summary
./run_cnn_speed.sh 40 1024
# or individually:
OPENNN_GPU_RESIDENT_DATA=1 ./opennn_cnn_speed <data_path> [epochs] [batch] [fp32|bf16]
python pytorch_cnn_speed.py [epochs] [batch]
python tensorflow_cnn_speed.py [epochs] [batch]