"Utilization" tells you the percentage of your GPU's SM that currently have at l...

aprdm · 2025-10-09T16:53:41 1760028821

What about energy consumption as a proxy for it ?

sailingparrot · 2025-10-09T16:59:25 1760029165

Definitely a better high level metric than nvidia-smi, and probably fine if you just want to get a very coarse idea of whether or not your are using the GPUs reasonably at all.

But when you get to the point where you care about a few percentage points of utilisation it's just not reliable enough as many things can impact energy consumption both ways. E.g. had a case were the GPU cluster we were using wasn't being cooled well enough, so you would gradually see power draw getting lower and lower as the GPUs were throttling themselves to not overheat.

You can also find cases were energy consumption is high but MFU/HFU isn't, like memory intensive workloads

villgax · 2025-10-09T16:57:22 1760029042

not a good estimator but still roughly good, ambient temps/neighboring cards alone might influence this more than workloads

JackYoustra · 2025-10-09T17:21:33 1760030493

iirc most of the energy comes from memory IO not arithmetic, so it's still not great. A better direction, though.

huevosabio · 2025-10-10T01:32:06 1760059926

This is a great explanation, thank you!