Pretty sure CUDA will limit your thread count to hardware constraints? You can’t...

bassp · 2025-01-14T20:44:20 1736887460

You can request up to 1024-2048 threads per block depending on the gpu; each SM can execute between 32 and 128 threads at a time! So you can have a lot more threads assigned to an SM than the SM can run at once

saagarjha · 2025-01-15T04:29:14 1736915354

Right, ok. So you mean a handful of warps and not like a plethora of them for no reason.

buildbot · 2025-01-14T20:47:24 1736887644

Thread counts per block are limited to 1024 (unless I’ve missed and change and wikipedia is wrong), but total threads per kernel is 1024(2^32-1)65535*65535 ~= 2^74 threads

https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming...

saagarjha · 2025-01-15T04:32:35 1736915555

Yeah I’m talking about the limit per-block.