GPU utilisation in nvidia-smi measures whether any kernel is running on the GPU — not whether it’s doing useful compute. A GPU showing 95% utilisation can be doing almost no actual tensor math, while you pay full price for it.
Key takeaways
The GPU utilisation metric is one of the most widely misread numbers in ML infrastructure. Teams see 95% and assume their GPU is working hard. Often it’s idle, waiting for data or network. Understanding the difference between GPU_UTIL and SM Activity is worth real money at scale.
When you run nvidia-smi, the GPU-Util column reports the percentage of time over the sampling window that at least one CUDA kernel was active on the GPU. The operative word is “at least one.”
A single tiny memory copy kernel executing for 1 millisecond per second registers as activity. A NCCL communication kernel shuffling data between GPUs counts as activity. Any kernel — regardless of how many of the GPU’s 132 Streaming Multiprocessors (H100) it actually uses — counts.
This means GPU_UTIL tells you one binary thing: was the GPU awake? It tells you nothing about whether it was doing useful tensor math.
⚠ Real example
An 8×H100 node running a NCCL-heavy distributed training workload can show ~100% GPU_UTIL across all cards while SM Activity and SM Occupancy sit at 10–20%. The GPUs are continuously executing small communication kernels — and doing almost no actual matrix multiplication.
An NVIDIA H100 has 132 Streaming Multiprocessors. Each SM contains CUDA cores and Tensor Cores — the hardware that runs matrix multiplications, attention kernels, and everything else that makes LLM training and inference fast.
SM Activity measures the percentage of those SMs that are actively doing compute work during the sampling window. This is the number that directly correlates with throughput.
SM Activity is available via nvidia-smi dmon. On Hopper and later GPUs (H100, H200, B200), the GPU Performance Metrics (GPM) expose it directly:
# Sample SM activity every second
nvidia-smi dmon -s u -d 1
# For Hopper+ with GPM metrics (SM utilisation per SM, not just presence)
nvidia-smi dmon --options gract,smutil -d 1
# gract = GPU activity (same as GPU_UTIL)
# smutil = percentage of SMs actively being used
For older architectures (A100, A10, etc.), SM Activity requires NVIDIA DCGM or Nsight Systems. The DCGM_FI_DEV_GPU_UTIL field is GPU_UTIL; for true SM Activity use DCGM_FI_PROF_SM_ACTIVE.
packet.ai dashboard
The packet.ai GPU dashboard samples nvidia-smi dmon continuously and surfaces both GPU_UTIL and SM Activity. When utilisation is ≥80% but SM Activity is <30%, you’ll see an alert: "Your GPU shows high utilisation but low compute activity. This often indicates a communication or memory bottleneck."
Increase batch size
The single biggest lever. Larger batches pack more work into each kernel dispatch, which means more SMs get assigned actual computation. For vLLM inference, increase --max-num-seqs. For training, increase per-GPU batch size before scaling to more GPUs.
Fix data loading bottlenecks
If the GPU is waiting for the CPU to prepare batches, SM Activity collapses while GPU_UTIL stays elevated from idle kernels. Set num_workers ≥ 4 and pin_memory=True on your DataLoader. Pre-load data to GPU memory where possible.
Enable Flash Attention
Flash Attention 2 and 3 restructure attention computation to maximise SM occupancy and reduce HBM reads. For transformer inference this can raise SM Activity 2–4×. In vLLM, Flash Attention is enabled by default on H100/H200/B200. Verify with --attention-backend FLASHATTN.
Right-size your GPU
If SM Activity is consistently below 20% on an H100, your workload may not need one. An H100 on packet.ai costs from $0.65/hr. If the same throughput is achievable on an RTX PRO 6000 at $0.66/hr — with higher single-GPU SM Activity on 30B models — that’s the better choice.
MFU — Model FLOP Utilisation — is the gold standard metric for training efficiency. It measures actual FLOPs executed versus theoretical peak FLOPs. Well-optimised LLM training on H100 SXM reaches 35–50% MFU; poorly optimised jobs often sit at 5–15%.
SM Activity is a fast, always-available proxy for MFU. You can’t always compute MFU in real time, but you can always sample SM Activity with nvidia-smi dmon. Low SM Activity almost always means low MFU. Fix SM Activity first; verify with MFU calculation once stable.
At packet.ai, the same 8×H100 cluster running at 85% SM Activity through the day costs the same per hour as one running at 15% SM Activity. The team with 85% SM Activity runs 5.7× more training at the same budget. That’s the real cost of misreading the utilisation metric.
Last reviewed: 10 June 2026. Browse H100, H200, and B200 clusters on packet.ai →
Same models. Same API. Fraction of the cost. Start free — no credit card required.
Start Building →