Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"Utilization" tells you the percentage of your GPU's SM that currently have at least one thread assigned to them.

It does not at all take into count how much that thread is actually using the core to it's capacity.

So if e.g. your thread is locked waiting on some data from another GPU (NCCL) and actually doing nothing, it will still show 100% utilisation. A good way to realize that is when a NCCL call timeout after 30 minutes for some reason, but you can see all your GPUs (except the one that cause the failure) were at 100% util, even though they clearly did nothing but wait.

Another example are operation with low compute intensity: Say you want to add 1 to every element in a very large tensor, you effectively have to transfer every element (let's say FP8, so 1 byte) from the HBM to the l2 memory, which is very slow operation, to then simply do an add, which is extremely fast. It takes about ~1000x more time to move that byte to L2 than it takes to actually do the add, so in effect your "true" utilization is ~0.2%, but nvidia-smi (and this tool) will show 100% for the entire duration of that add.

Sadly there isn't a great general way to monitor "true" utilization during training, generally you have to come up with an estimate of how many flops your model requires per pass, look at the time it takes to do said pass, and compare the flops/sec you get to Nvidia's spec sheet. If you get around 60% of theoretical flops for a typical transformer LLM training you are basically at max utilization.



What about energy consumption as a proxy for it ?


Definitely a better high level metric than nvidia-smi, and probably fine if you just want to get a very coarse idea of whether or not your are using the GPUs reasonably at all.

But when you get to the point where you care about a few percentage points of utilisation it's just not reliable enough as many things can impact energy consumption both ways. E.g. had a case were the GPU cluster we were using wasn't being cooled well enough, so you would gradually see power draw getting lower and lower as the GPUs were throttling themselves to not overheat.

You can also find cases were energy consumption is high but MFU/HFU isn't, like memory intensive workloads


not a good estimator but still roughly good, ambient temps/neighboring cards alone might influence this more than workloads


iirc most of the energy comes from memory IO not arithmetic, so it's still not great. A better direction, though.


This is a great explanation, thank you!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: