‹ Back to Benchmarks

Training the HIGGS Benchmark with OpenNN and PyTorch

OpenNN trains the classic HIGGS deep-learning benchmark faster than an optimized PyTorch CUDA Graphs implementation, while reaching the same predictive quality.

This article compares the best measured GPU training run from OpenNN with the best measured PyTorch run for the same HIGGS benchmark. Both executions use the same dataset split, the same neural network topology and CUDA Graph-class execution.

Contents

Benchmark application

The HIGGS dataset is a binary classification benchmark from high-energy physics. Each sample represents a simulated particle-collision event, and the task is to distinguish a Higgs-boson signal process from background events.

Item Configuration
Dataset HIGGS, 11,000,000 simulated particle-collision events
Split 10,000,000 training / 500,000 validation / 500,000 testing
Inputs 28 real-valued features; first column is the binary target
Network 5 hidden dense layers, 300 tanh units per hidden layer, sigmoid output
Parameters 370,201 trainable parameters
Optimizer Stochastic gradient descent with momentum
Batch size 100
Learning rate 0.05 initial learning rate, decay 0.0202, momentum 0.9
Regularization L2 weight decay coefficient 1e-5
Metric Area under the ROC curve (AUC) on the testing split

Reference computer

Component Reference system
Operating system Ubuntu Linux 22.04, kernel 6.8
CPU 12th Gen Intel Core i9-12900K, 24 logical CPUs
RAM 62 GiB
GPU NVIDIA GeForce RTX 4080, 16 GB VRAM
NVIDIA driver 555.42.02
PyTorch 2.6.0+cu124
OpenNN C++20 development build with CUDA FP32 backend

Methodology

The benchmark reports one representative measured run for each framework. Training time excludes CSV preprocessing and measures the training loop only, including validation at the end of each epoch. Testing metrics are calculated after training and are not included in the training time.

Epoch numbering starts at 0, so the run labelled epoch 20 contains 21 complete passes over the training split.

Framework Execution path
OpenNN CUDA Graph training path with GPU-resident dataset and device-side batch gathering
PyTorch Optimized CUDA Graphs implementation with static tensors updated in-place
The comparison below keeps only the strongest measured run from each framework: OpenNN with CUDA Graph execution and a GPU-resident dataset, and PyTorch optimized with CUDA Graphs.

Results

Training throughput on HIGGS
Samples per second · RTX 4080 · batch size 100 · epochs 0-20 HIGHER IS BETTER ->
OpenNN · CUDA Graph + GPU dataset
1.01M

PyTorch · CUDA Graphs
563k

OpenNN reaches 1.79x the training throughput of the optimized PyTorch CUDA Graphs run in this benchmark.
Framework / run Training time Avg. epoch time Throughput Test AUC Test accuracy @ 0.5
OpenNN, CUDA Graph + GPU-resident dataset, final parameters at epoch 20 208.23 s 9.92 s 1.01M samples/s 0.859371 0.77382
PyTorch optimized with CUDA Graphs, final parameters at epoch 20 372.82 s 17.75 s 563k samples/s 0.857541 0.77301
Comparison Result
OpenNN speed vs optimized PyTorch CUDA Graphs 1.79x faster
Epoch-time reduction 44.1% lower epoch time
Throughput increase 79.1% more samples per second
Predictive quality Same class of testing AUC and accuracy

Discussion

The important result is that OpenNN is not only faster than a transparent Python training loop; it is faster than the optimized PyTorch CUDA Graphs version used for this workload. The OpenNN run completes epochs 0-20 in 208.23 seconds, compared with 372.82 seconds for PyTorch CUDA Graphs.

The reason is that the optimized OpenNN path keeps the dataset resident on the GPU and uses CUDA Graph execution for the training step. This removes the per-batch host orchestration cost and avoids repeatedly staging the same tabular data through CPU-side batch workers. For a small-batch workload such as HIGGS with batch size 100, that overhead matters.

Predictive quality remains aligned. OpenNN reaches a test AUC of 0.859371, while the optimized PyTorch run reaches 0.857541. The difference is small; the main conclusion is about training speed at equivalent model quality.

Conclusions

  • OpenNN trains the HIGGS 5×300 benchmark in 208.23 seconds for epochs 0-20.
  • The optimized PyTorch CUDA Graphs run takes 372.82 seconds for the same benchmark setup.
  • OpenNN is 1.79x faster, with 44.1% lower average epoch time.
  • Both frameworks reach essentially the same predictive quality on the testing split.

References