Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Properly measuring "GPU load" is something I've been wondering about, as an architect who's had to deploy ML/DL models but is still relatively new at it. With CPU workloads you can generally tell from %CPU, %Mem and IOs how much load your system is under. But with GPU I'm not sure how you can tell, other than by just measuring your model execution times. I find it makes it hard to get an idea whether upgrading to a stronger GPU would help and by how much. Are there established ways of doing this?


For kernel-level performance tuning you can use the occupancy calculator as pointed out by jplusqualt or you can profile your kernel with Nsight compute which will give you a ton of info.

But for model-wide performance, you basically have to come up with your own calculation to estimate the FLOPs required by your model and based on that figure out how well your model is maxing out the GPU capabilities (MFU/HFU).

Here is a more in-depth example on how you might do this: https://github.com/stas00/ml-engineering/tree/master/trainin...


It's harder than measuring CPU load, and depends a lot on context. For example, often 90% of a GPU's available flops are exclusively for low-precision matrix multiply-add operations. If you're doing full precision multiply-add operations at full speed, do you count that as 10% or 100% load? If you're doing lots of small operations and your warps are only 50% full, do you count that as 50% or 100% load? Unfortunately, there isn't really a shortcut to understanding how a GPU works and knowing how you're using it.


CUDA toolkit comes with an occupancy calculator that can help you determine based on your kernel launch parameters how busy your GPU will potentially be.

For more information: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#multi...


you need to profile them, nsight is one even torch does flamegraphs




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: