yeah we just have the 100gig link, atm that's about all the gpu clusters can pull but we'll prob expand bandwidth and storage as we scale.
I guess worth noting that we do have a bunch of 4090s in the colo and it's been super helpful for e.g. calculating embeddings and such for data splits.
How did you arrive at the decision of not putting the GPU machines in the colo? Were the power costs going to be too high? Or do you just expect to need more physical access to the GPU machines vs the storage ones?
When I was working at sfcompute prior to this we saw multiple datacenters literally catch on fire bc the industry was not experienced with the power density of h100s. Our training chips just aren't a standard package in the way JBODs are.
My info may be dated, but power density has gone up a ton over time. I'd expect a lot of datacenters to have plenty of space, but not much power. You can only retrofit so much additional power distribution and cooling into a building designed for much less power density.
I guess worth noting that we do have a bunch of 4090s in the colo and it's been super helpful for e.g. calculating embeddings and such for data splits.