🚀 B200 bare metal now at $5.6/hr. The best price you'll find. DC in US West → (Access it from Bare metal button on top after login).

Get Your B200 →
Start Building
Cover image
Engineering

GPU Utilization: The Lie Your Dashboard Tells You

GPU_UTIL in nvidia-smi measures if any kernel is running — not whether your GPU is computing. Here’s why SM Activity is the metric you should actually watch.

Author photo
packet.ai Team
February 3, 2025

GPU utilisation in nvidia-smi measures whether any kernel is running on the GPU — not whether it’s doing useful compute. A GPU showing 95% utilisation can be doing almost no actual tensor math, while you pay full price for it.

Key takeaways

  • GPU_UTIL in nvidia-smi measures time any kernel was active — even a tiny memcpy kernel counts as 100% utilised for that moment
  • SM Activity measures what fraction of Streaming Multiprocessors are doing real compute work — this is the metric that correlates with actual throughput
  • An H100 has 132 SMs — if one SM runs at 100% while the rest idle, GPU_UTIL can still read near 100% while SM Activity is 0.76%
  • Memory-bound workloads commonly show 80–95% GPU_UTIL alongside 10–20% SM Activity — the GPU is waiting for data, not computing
  • Increasing batch size is the single biggest lever for raising SM Activity in LLM training and inference

The GPU utilisation metric is one of the most widely misread numbers in ML infrastructure. Teams see 95% and assume their GPU is working hard. Often it’s idle, waiting for data or network. Understanding the difference between GPU_UTIL and SM Activity is worth real money at scale.

What GPU_UTIL actually measures (and why it lies)

When you run nvidia-smi, the GPU-Util column reports the percentage of time over the sampling window that at least one CUDA kernel was active on the GPU. The operative word is “at least one.”

A single tiny memory copy kernel executing for 1 millisecond per second registers as activity. A NCCL communication kernel shuffling data between GPUs counts as activity. Any kernel — regardless of how many of the GPU’s 132 Streaming Multiprocessors (H100) it actually uses — counts.

This means GPU_UTIL tells you one binary thing: was the GPU awake? It tells you nothing about whether it was doing useful tensor math.

⚠ Real example

An 8×H100 node running a NCCL-heavy distributed training workload can show ~100% GPU_UTIL across all cards while SM Activity and SM Occupancy sit at 10–20%. The GPUs are continuously executing small communication kernels — and doing almost no actual matrix multiplication.

SM Activity: the metric that actually matters

An NVIDIA H100 has 132 Streaming Multiprocessors. Each SM contains CUDA cores and Tensor Cores — the hardware that runs matrix multiplications, attention kernels, and everything else that makes LLM training and inference fast.

SM Activity measures the percentage of those SMs that are actively doing compute work during the sampling window. This is the number that directly correlates with throughput.

Scenario GPU_UTIL SM Activity Diagnosis
GPU idle0%0%Nothing running
Memory-bound workload80%15%Waiting for data transfer
Communication-bound (NCCL)90%10%Waiting for inter-GPU network
Small batch inference70%30%Underutilised — increase batch size
Compute-bound (ideal)95%85%GPU actually working

How to measure SM Activity with nvidia-smi

SM Activity is available via nvidia-smi dmon. On Hopper and later GPUs (H100, H200, B200), the GPU Performance Metrics (GPM) expose it directly:

# Sample SM activity every second
nvidia-smi dmon -s u -d 1

# For Hopper+ with GPM metrics (SM utilisation per SM, not just presence)
nvidia-smi dmon --options gract,smutil -d 1

# gract = GPU activity (same as GPU_UTIL)
# smutil = percentage of SMs actively being used

For older architectures (A100, A10, etc.), SM Activity requires NVIDIA DCGM or Nsight Systems. The DCGM_FI_DEV_GPU_UTIL field is GPU_UTIL; for true SM Activity use DCGM_FI_PROF_SM_ACTIVE.

packet.ai dashboard

The packet.ai GPU dashboard samples nvidia-smi dmon continuously and surfaces both GPU_UTIL and SM Activity. When utilisation is ≥80% but SM Activity is <30%, you’ll see an alert: "Your GPU shows high utilisation but low compute activity. This often indicates a communication or memory bottleneck."

Four ways to raise SM Activity

1

Increase batch size

The single biggest lever. Larger batches pack more work into each kernel dispatch, which means more SMs get assigned actual computation. For vLLM inference, increase --max-num-seqs. For training, increase per-GPU batch size before scaling to more GPUs.

2

Fix data loading bottlenecks

If the GPU is waiting for the CPU to prepare batches, SM Activity collapses while GPU_UTIL stays elevated from idle kernels. Set num_workers ≥ 4 and pin_memory=True on your DataLoader. Pre-load data to GPU memory where possible.

3

Enable Flash Attention

Flash Attention 2 and 3 restructure attention computation to maximise SM occupancy and reduce HBM reads. For transformer inference this can raise SM Activity 2–4×. In vLLM, Flash Attention is enabled by default on H100/H200/B200. Verify with --attention-backend FLASHATTN.

4

Right-size your GPU

If SM Activity is consistently below 20% on an H100, your workload may not need one. An H100 on packet.ai costs from $0.65/hr. If the same throughput is achievable on an RTX PRO 6000 at $0.66/hr — with higher single-GPU SM Activity on 30B models — that’s the better choice.

The relationship to MFU (Model FLOP Utilisation)

MFU — Model FLOP Utilisation — is the gold standard metric for training efficiency. It measures actual FLOPs executed versus theoretical peak FLOPs. Well-optimised LLM training on H100 SXM reaches 35–50% MFU; poorly optimised jobs often sit at 5–15%.

SM Activity is a fast, always-available proxy for MFU. You can’t always compute MFU in real time, but you can always sample SM Activity with nvidia-smi dmon. Low SM Activity almost always means low MFU. Fix SM Activity first; verify with MFU calculation once stable.

At packet.ai, the same 8×H100 cluster running at 85% SM Activity through the day costs the same per hour as one running at 15% SM Activity. The team with 85% SM Activity runs 5.7× more training at the same budget. That’s the real cost of misreading the utilisation metric.

Frequently asked questions

GPU_UTIL in nvidia-smi measures the percentage of time over the sampling window that at least one CUDA kernel was executing on the GPU. It does not measure how many Streaming Multiprocessors were active, whether tensor math was happening, or how efficiently the GPU was being used. It is a binary activity indicator, not a compute efficiency metric.
GPU utilisation (GPU_UTIL) measures whether any kernel was running. SM Activity measures the fraction of Streaming Multiprocessors actively executing compute work. A GPU with 95% GPU_UTIL and 10% SM Activity is mostly idle from a compute standpoint — running small communication or memory kernels without doing useful tensor math.
Use nvidia-smi dmon -s u -d 1 to sample utilisation metrics each second. On Hopper and later GPUs (H100, H200, B200), add --options gract,smutil to expose GPU Performance Metrics including SM utilisation per SM. For A100 and older, SM Activity requires NVIDIA DCGM using the DCGM_FI_PROF_SM_ACTIVE field.
High GPU_UTIL with slow training almost always means a bottleneck outside the GPU: data loading (CPU can't feed batches fast enough), inter-GPU communication (NCCL overhead dominates), or small batch sizes leaving most SMs idle. Check SM Activity via nvidia-smi dmon — if it's below 30%, start with increasing batch size and enabling async data loading.
Well-optimised LLM training on H100 SXM typically reaches 70–85% SM Activity, corresponding to 35–50% MFU. Production inference at high concurrency sits at 60–80%. Development and fine-tuning jobs often run 20–40% — acceptable for exploratory work, but worth investigating if the job runs for more than a few hours at real cost.

Last reviewed: 10 June 2026. Browse H100, H200, and B200 clusters on packet.ai →

Waste less compute.

Same models. Same API. Fraction of the cost. Start free — no credit card required.

Start Building →

More from the blog