‹ Back to Benchmarks

Peak memory: OpenNN vs PyTorch vs TensorFlow

Disk size and startup time are two costs of a heavy runtime; RAM is a third. On a constrained target — a small edge device, a memory-capped container, a function with a tight memory limit, or simply a machine running many model processes at once — what matters is how much resident memory a model process actually holds. A framework that loads a large runtime pays for it in RAM the moment it starts, before any data is touched.

Contents

The numbers

OpenNN PyTorch TensorFlow
Baseline RSS (model built, no training) 9 MB 221 MB 485 MB
Peak RSS (during training) 9 MB 295 MB 521 MB
Peak vs OpenNN ≈32× ≈56×

All three programs do the same thing on the same data: load sum.csv (1,000 rows × 100 numeric inputs + 1 target), build a 100 → 64 → 1 MLP, and train for 50 epochs (Adam, batch size 32, single-threaded). Each reports its own peak RSS via the OS (getrusage / resource).

What the numbers show

  • OpenNN holds ~9 MB and barely moves. Baseline and peak are the same to within measurement noise — the dataset and training buffers are tiny next to the already-small working set. The whole process, code and data, fits in single-digit megabytes.
  • PyTorch starts at ~221 MB before training — the Python interpreter plus the libtorch runtime (and NumPy) resident in memory — and rises to ~295 MB during training as autograd and optimizer buffers are allocated.
  • TensorFlow starts at ~485 MB and rises to ~521 MB — the Keras/TF runtime carries an even larger resident footprint than PyTorch before any training.

Why OpenNN uses so little

OpenNN is a native binary with the library linked in and Eigen (header-only) for math. The resident memory is essentially the model parameters, the data, and a small amount of code — there is no interpreter and no large general-purpose tensor runtime mapped into the process. PyTorch keeps the Python runtime and the full libtorch engine resident for the life of the process, which sets a high floor independent of model size.

Why it matters

  • Memory-capped containers / functions: a 256 MB function can host the OpenNN process many times over; PyTorch’s baseline alone nearly fills it before any work.
  • Many concurrent model processes: at ~9 MB each, you can run far more OpenNN workers per machine than ~250–300 MB PyTorch ones.
  • Small edge devices: RAM is often scarcer than disk; a single-digit-MB footprint leaves room for the rest of the application.

Caveats

  • This is a memory benchmark on a small model, chosen so the numbers reflect framework overhead — the structural difference — rather than a specific large workload. A bigger model or dataset adds parameter/activation memory to both sides on top of these baselines.
  • Measured on Linux x86_64, single-threaded (OMP_NUM_THREADS=1) for both, to avoid thread-pool arenas inflating RSS differently. OpenNN built with g++ 13.3 (CPU-only); PyTorch 2.12.0+cpu on CPython 3.12 with NumPy installed.
  • RSS is the OS’s peak resident size (ru_maxrss); absolute values vary with allocator, glibc, and thread settings, but the order-of-magnitude gap is structural.
  • CPU-only on both sides. A CUDA build adds GPU-side memory, which this note does not cover.

References