A typical neural network training step issues hundreds of small CUDA kernels — one matrix multiplication, one activation, one element-wise operation, and so on, in tight sequence. Each kernel costs the CPU roughly five to ten microseconds to launch, regardless of how little work the GPU then performs. On modern hardware, these launch overheads can dominate the actual computation for small layers and tight loops.

CUDA Graphs are NVIDIA’s solution to this problem: record the sequence of kernels once, then replay it many times with a single launch call. This article explains how CUDA Graphs work, when they help, and when they do not.

 

Contents:

  1. Introduction.
  2. The Three-Phase Model.
  3. When CUDA Graphs Help.
  4. When They Do Not.
  5. The Capture Cost.
  6. Graph-Friendly Kernel Design.
  7. Conclusions.

 

1. Introduction

Launching a CUDA kernel is not free. Each call crosses from user code into the CUDA runtime, then into the driver, before the GPU sees any work to do. Even with an asynchronous stream, the CPU spends real time per kernel constructing the launch parameters and submitting them.

For workloads with a small number of large kernels, this overhead is negligible. For workloads with many small kernels — common in neural network training, where each layer issues several short operations — the cumulative CPU time spent launching kernels can rival or exceed the GPU time spent executing them.

CUDA Graphs address this by collapsing many kernel launches into a single, pre-recorded structure that the driver submits in one operation.

 

2. The Three-Phase Model

A CUDA Graph is built and used in three phases:

  1. Capture. The CUDA runtime intercepts every kernel launched on a stream and records its arguments into a graph data structure. No work runs on the GPU during capture.
  2. Instantiate. The captured graph is compiled into an executable form. Dependencies are resolved, memory is reserved, and the driver decides on an efficient launch schedule.
  3. Replay. A single cudaGraphLaunch call submits the entire sequence. The per-kernel CPU overhead disappears; the GPU sees one job instead of hundreds.

For a training step that issues two hundred kernels, replay typically cuts CPU-side overhead from one to two milliseconds down to about ten microseconds — a meaningful margin when each step is already only a few milliseconds of actual compute.

 

3. When CUDA Graphs Help

CUDA Graphs deliver the largest speedups in three scenarios:

  • Tight training loops where the same forward and backward pass runs identically on every step.
  • Small kernels whose runtime is comparable to their launch cost — most element-wise operations and small layer activations fall into this category.
  • Inference servers serving fixed input shapes at high request rates, where amortizing launch overhead across millions of requests adds up.

In each case, the same graph is replayed many times, and the up-front capture cost is recovered quickly.

 

4. When They Do Not

CUDA Graphs are not a universal optimization. Several situations make them ineffective or impossible to use:

  • Dynamic shapes. A captured graph fixes tensor shapes; varying batch size, sequence length, or image resolution means re-capturing.
  • Control flow. Kernels chosen based on runtime data — for example, early-exit logic or conditional branches — escape the graph.
  • Already-large kernels. If each kernel runs for milliseconds, the launch overhead is already negligible, and graphs offer little.
  • Host-side work. Operations that allocate memory, synchronize back to the CPU, or call into host callbacks cannot be captured.

A workload that mixes captured and uncaptured operations will still pay launch overhead for the uncaptured parts.

 

5. The Capture Cost

Capturing and instantiating a graph is itself an expensive operation — comparable to running the loop once normally. The technique only pays off when the same graph is replayed many times.

A useful rule of thumb is to capture once and replay at least a hundred times. For training loops that run for thousands of steps, this is trivial. For one-off scripts or rarely-executed paths, the capture cost will not be recovered.

 

6. Graph-Friendly Kernel Design

Operations that cannot be captured force the runtime to fall back to per-kernel launches, undoing the benefit. Performance-oriented neural network libraries therefore design their kernels to be «graph-friendly»:

  • No memory allocations inside the kernel — all buffers are pre-allocated.
  • No branches that depend on host-side data.
  • No synchronization back to the CPU.
  • Fused operations rather than separate launches for sub-steps that always run together.

Even before adopting graphs, designing kernels in this style improves baseline performance — fewer launches, fewer allocations, and tighter overlap with the surrounding compute.

 

7. Conclusions

CUDA Graphs are a powerful tool for amortizing kernel launch overhead when the same sequence of GPU work runs many times:

  • They collapse hundreds of kernel launches into a single replay call.
  • They shine on workloads with many small kernels and fixed shapes; they hurt on dynamic-shape or rarely-replayed loops.
  • Designing kernels to be graph-friendly — branch-free, pre-allocated, fused — is a useful constraint even before graphs are adopted, because it produces faster kernels regardless.

For training and inference workloads that are launch-overhead-bound, CUDA Graphs can be the difference between a GPU running at 30% utilization and one running near peak.