Properly measuring "GPU load" is something I've been wondering about, as an arch...

sailingparrot · 2025-10-09T16:54:15 1760028855

For kernel-level performance tuning you can use the occupancy calculator as pointed out by jplusqualt or you can profile your kernel with Nsight compute which will give you a ton of info.

But for model-wide performance, you basically have to come up with your own calculation to estimate the FLOPs required by your model and based on that figure out how well your model is maxing out the GPU capabilities (MFU/HFU).

Here is a more in-depth example on how you might do this: https://github.com/stas00/ml-engineering/tree/master/trainin...

hatthew · 2025-10-09T18:17:57 1760033877

It's harder than measuring CPU load, and depends a lot on context. For example, often 90% of a GPU's available flops are exclusively for low-precision matrix multiply-add operations. If you're doing full precision multiply-add operations at full speed, do you count that as 10% or 100% load? If you're doing lots of small operations and your warps are only 50% full, do you count that as 50% or 100% load? Unfortunately, there isn't really a shortcut to understanding how a GPU works and knowing how you're using it.

jplusequalt · 2025-10-09T16:44:28 1760028268

CUDA toolkit comes with an occupancy calculator that can help you determine based on your kernel launch parameters how busy your GPU will potentially be.

For more information: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#multi...

villgax · 2025-10-09T16:55:59 1760028959

you need to profile them, nsight is one even torch does flamegraphs