OpenNN trains the classic HIGGS deep-learning benchmark faster than an optimized PyTorch CUDA Graphs implementation, while reaching the same predictive quality.
This article compares the best measured GPU training run from OpenNN with the best measured PyTorch run for the same HIGGS benchmark. Both executions use the same dataset split, the same neural network topology and CUDA Graph-class execution.
Contents
Benchmark application
The HIGGS dataset is a binary classification benchmark from high-energy physics. Each sample represents a simulated particle-collision event, and the task is to distinguish a Higgs-boson signal process from background events.
| Item | Configuration |
|---|---|
| Dataset | HIGGS, 11,000,000 simulated particle-collision events |
| Split | 10,000,000 training / 500,000 validation / 500,000 testing |
| Inputs | 28 real-valued features; first column is the binary target |
| Network | 5 hidden dense layers, 300 tanh units per hidden layer, sigmoid output |
| Parameters | 370,201 trainable parameters |
| Optimizer | Stochastic gradient descent with momentum |
| Batch size | 100 |
| Learning rate | 0.05 initial learning rate, decay 0.0202, momentum 0.9 |
| Regularization | L2 weight decay coefficient 1e-5 |
| Metric | Area under the ROC curve (AUC) on the testing split |
Reference computer
| Component | Reference system |
|---|---|
| Operating system | Ubuntu Linux 22.04, kernel 6.8 |
| CPU | 12th Gen Intel Core i9-12900K, 24 logical CPUs |
| RAM | 62 GiB |
| GPU | NVIDIA GeForce RTX 4080, 16 GB VRAM |
| NVIDIA driver | 555.42.02 |
| PyTorch | 2.6.0+cu124 |
| OpenNN | C++20 development build with CUDA FP32 backend |
Methodology
The benchmark reports one representative measured run for each framework. Training time excludes CSV preprocessing and measures the training loop only, including validation at the end of each epoch. Testing metrics are calculated after training and are not included in the training time.
Epoch numbering starts at 0, so the run labelled epoch 20 contains 21 complete passes over the training split.
| Framework | Execution path |
|---|---|
| OpenNN | CUDA Graph training path with GPU-resident dataset and device-side batch gathering |
| PyTorch | Optimized CUDA Graphs implementation with static tensors updated in-place |
Results
| Framework / run | Training time | Avg. epoch time | Throughput | Test AUC | Test accuracy @ 0.5 |
|---|---|---|---|---|---|
| OpenNN, CUDA Graph + GPU-resident dataset, final parameters at epoch 20 | 208.23 s | 9.92 s | 1.01M samples/s | 0.859371 | 0.77382 |
| PyTorch optimized with CUDA Graphs, final parameters at epoch 20 | 372.82 s | 17.75 s | 563k samples/s | 0.857541 | 0.77301 |
| Comparison | Result |
|---|---|
| OpenNN speed vs optimized PyTorch CUDA Graphs | 1.79x faster |
| Epoch-time reduction | 44.1% lower epoch time |
| Throughput increase | 79.1% more samples per second |
| Predictive quality | Same class of testing AUC and accuracy |
Discussion
The important result is that OpenNN is not only faster than a transparent Python training loop; it is faster than the optimized PyTorch CUDA Graphs version used for this workload. The OpenNN run completes epochs 0-20 in 208.23 seconds, compared with 372.82 seconds for PyTorch CUDA Graphs.
The reason is that the optimized OpenNN path keeps the dataset resident on the GPU and uses CUDA Graph execution for the training step. This removes the per-batch host orchestration cost and avoids repeatedly staging the same tabular data through CPU-side batch workers. For a small-batch workload such as HIGGS with batch size 100, that overhead matters.
Predictive quality remains aligned. OpenNN reaches a test AUC of 0.859371, while the optimized PyTorch run reaches 0.857541. The difference is small; the main conclusion is about training speed at equivalent model quality.
Conclusions
- OpenNN trains the HIGGS 5×300 benchmark in 208.23 seconds for epochs 0-20.
- The optimized PyTorch CUDA Graphs run takes 372.82 seconds for the same benchmark setup.
- OpenNN is 1.79x faster, with 44.1% lower average epoch time.
- Both frameworks reach essentially the same predictive quality on the testing split.
References
- HIGGS dataset, UCI Machine Learning Repository.
- P. Baldi, P. Sadowski and D. Whiteson, Searching for exotic particles in high-energy physics with deep learning, Nature Communications, 2014.
- PyTorch.
- OpenNN.