What does GPU utilisation in nvidia-smi actually measure?

GPU_UTIL measures the percentage of time at least one CUDA kernel was executing. It does not measure how many Streaming Multiprocessors were active or whether tensor math was happening. It is a binary activity indicator, not a compute efficiency metric.

What is the difference between GPU utilisation and SM Activity?

GPU_UTIL measures whether any kernel was running. SM Activity measures the fraction of SMs actively executing compute work. A GPU with 95% GPU_UTIL and 10% SM Activity is mostly idle from a compute standpoint.

How do I measure SM Activity with nvidia-smi?

Use nvidia-smi dmon -s u -d 1. On Hopper+ GPUs add --options gract,smutil for SM utilisation per SM. For A100 and older, SM Activity requires NVIDIA DCGM using DCGM_FI_PROF_SM_ACTIVE.

My GPU shows 95% utilisation but training is slow. What's wrong?

High GPU_UTIL with slow training usually means a bottleneck outside the GPU: data loading, NCCL communication overhead, or small batch sizes. Check SM Activity via nvidia-smi dmon and increase batch size if it's below 30%.

What is a good SM Activity level for LLM training and inference?

Well-optimised LLM training on H100 SXM typically reaches 70-85% SM Activity (35-50% MFU). Production inference at high concurrency sits at 60-80%. Development and fine-tuning often runs 20-40%.

Engineering

GPU Utilization: The Lie Your Dashboard Tells You

GPU_UTIL in nvidia-smi measures if any kernel is running — not whether your GPU is computing. Here’s why SM Activity is the metric you should actually watch.

packet.ai Team

February 3, 2025

GPU utilisation in nvidia-smi measures whether any kernel is running on the GPU — not whether it’s doing useful compute. A GPU showing 95% utilisation can be doing almost no actual tensor math, while you pay full price for it.

Key takeaways

GPU_UTIL in nvidia-smi measures time any kernel was active — even a tiny memcpy kernel counts as 100% utilised for that moment
SM Activity measures what fraction of Streaming Multiprocessors are doing real compute work — this is the metric that correlates with actual throughput
An H100 has 132 SMs — if one SM runs at 100% while the rest idle, GPU_UTIL can still read near 100% while SM Activity is 0.76%
Memory-bound workloads commonly show 80–95% GPU_UTIL alongside 10–20% SM Activity — the GPU is waiting for data, not computing
Increasing batch size is the single biggest lever for raising SM Activity in LLM training and inference

The GPU utilisation metric is one of the most widely misread numbers in ML infrastructure. Teams see 95% and assume their GPU is working hard. Often it’s idle, waiting for data or network. Understanding the difference between GPU_UTIL and SM Activity is worth real money at scale.

What GPU_UTIL actually measures (and why it lies)

When you run nvidia-smi, the GPU-Util column reports the percentage of time over the sampling window that at least one CUDA kernel was active on the GPU. The operative word is “at least one.”

A single tiny memory copy kernel executing for 1 millisecond per second registers as activity. A NCCL communication kernel shuffling data between GPUs counts as activity. Any kernel — regardless of how many of the GPU’s 132 Streaming Multiprocessors (H100) it actually uses — counts.

This means GPU_UTIL tells you one binary thing: was the GPU awake? It tells you nothing about whether it was doing useful tensor math.

⚠ Real example

An 8×H100 node running a NCCL-heavy distributed training workload can show ~100% GPU_UTIL across all cards while SM Activity and SM Occupancy sit at 10–20%. The GPUs are continuously executing small communication kernels — and doing almost no actual matrix multiplication.

SM Activity: the metric that actually matters

An NVIDIA H100 has 132 Streaming Multiprocessors. Each SM contains CUDA cores and Tensor Cores — the hardware that runs matrix multiplications, attention kernels, and everything else that makes LLM training and inference fast.

SM Activity measures the percentage of those SMs that are actively doing compute work during the sampling window. This is the number that directly correlates with throughput.

Scenario	GPU_UTIL	SM Activity	Diagnosis
GPU idle	0%	0%	Nothing running
Memory-bound workload	80%	15%	Waiting for data transfer
Communication-bound (NCCL)	90%	10%	Waiting for inter-GPU network
Small batch inference	70%	30%	Underutilised — increase batch size
Compute-bound (ideal)	95%	85%	GPU actually working

How to measure SM Activity with nvidia-smi

SM Activity is available via nvidia-smi dmon. On Hopper and later GPUs (H100, H200, B200), the GPU Performance Metrics (GPM) expose it directly:

# Sample SM activity every second
nvidia-smi dmon -s u -d 1

# For Hopper+ with GPM metrics (SM utilisation per SM, not just presence)
nvidia-smi dmon --options gract,smutil -d 1

# gract = GPU activity (same as GPU_UTIL)
# smutil = percentage of SMs actively being used

For older architectures (A100, A10, etc.), SM Activity requires NVIDIA DCGM or Nsight Systems. The DCGM_FI_DEV_GPU_UTIL field is GPU_UTIL; for true SM Activity use DCGM_FI_PROF_SM_ACTIVE.

packet.ai dashboard

The packet.ai GPU dashboard samples nvidia-smi dmon continuously and surfaces both GPU_UTIL and SM Activity. When utilisation is ≥80% but SM Activity is <30%, you’ll see an alert: "Your GPU shows high utilisation but low compute activity. This often indicates a communication or memory bottleneck."

Four ways to raise SM Activity

1

Increase batch size

The single biggest lever. Larger batches pack more work into each kernel dispatch, which means more SMs get assigned actual computation. For vLLM inference, increase --max-num-seqs. For training, increase per-GPU batch size before scaling to more GPUs.

2

Fix data loading bottlenecks

If the GPU is waiting for the CPU to prepare batches, SM Activity collapses while GPU_UTIL stays elevated from idle kernels. Set num_workers ≥ 4 and pin_memory=True on your DataLoader. Pre-load data to GPU memory where possible.

3

Enable Flash Attention

Flash Attention 2 and 3 restructure attention computation to maximise SM occupancy and reduce HBM reads. For transformer inference this can raise SM Activity 2–4×. In vLLM, Flash Attention is enabled by default on H100/H200/B200. Verify with --attention-backend FLASHATTN.

4

Right-size your GPU

If SM Activity is consistently below 20% on an H100, your workload may not need one. An H100 on packet.ai costs from $0.65/hr. If the same throughput is achievable on an RTX PRO 6000 at $0.66/hr — with higher single-GPU SM Activity on 30B models — that’s the better choice.

The relationship to MFU (Model FLOP Utilisation)

MFU — Model FLOP Utilisation — is the gold standard metric for training efficiency. It measures actual FLOPs executed versus theoretical peak FLOPs. Well-optimised LLM training on H100 SXM reaches 35–50% MFU; poorly optimised jobs often sit at 5–15%.

SM Activity is a fast, always-available proxy for MFU. You can’t always compute MFU in real time, but you can always sample SM Activity with nvidia-smi dmon. Low SM Activity almost always means low MFU. Fix SM Activity first; verify with MFU calculation once stable.

At packet.ai, the same 8×H100 cluster running at 85% SM Activity through the day costs the same per hour as one running at 15% SM Activity. The team with 85% SM Activity runs 5.7× more training at the same budget. That’s the real cost of misreading the utilisation metric.

Frequently asked questions

Use nvidia-smi dmon -s u -d 1 to sample utilisation metrics each second. On Hopper and later GPUs (H100, H200, B200), add --options gract,smutil to expose GPU Performance Metrics including SM utilisation per SM. For A100 and older, SM Activity requires NVIDIA DCGM using the DCGM_FI_PROF_SM_ACTIVE field.

High GPU_UTIL with slow training almost always means a bottleneck outside the GPU: data loading (CPU can't feed batches fast enough), inter-GPU communication (NCCL overhead dominates), or small batch sizes leaving most SMs idle. Check SM Activity via nvidia-smi dmon — if it's below 30%, start with increasing batch size and enabling async data loading.

Last reviewed: 10 June 2026. Browse H100, H200, and B200 clusters on packet.ai →

GPU Utilization: The Lie Your Dashboard Tells You

What GPU_UTIL actually measures (and why it lies)

SM Activity: the metric that actually matters

How to measure SM Activity with nvidia-smi

Four ways to raise SM Activity

The relationship to MFU (Model FLOP Utilisation)

Frequently asked questions

Waste less compute.

More from the blog

GPU Utilization: The Lie Your Dashboard Tells You

What GPU_UTIL actually measures (and why it lies)

SM Activity: the metric that actually matters

How to measure SM Activity with nvidia-smi

Four ways to raise SM Activity

The relationship to MFU (Model FLOP Utilisation)

Frequently asked questions

Waste less compute.

More from the blog

How Dynamic GPU Placement Enables Lower Prices

How Persistent Workspaces Actually Work (A Deep Dive)

Token Factory: How We Built a 98% Cheaper OpenAI Alternative