So, I'd love to work on optimizing pipelines like this. How does one "get into" it? It seems a ML scientist with some C/C++ and infra knowledge just dips down into the system when required? Or is it CUDA/SIMD experts who move "up" into ML?
I know someone who works on this in Meta. His resume is computer science heavy, with a masters in Machine Learning. On the previous experience side, before getting into Meta, he had about a decade working as a Software Engineer with Machine Learning system in multiple languages, such as Go, C++ and Python.
To get the job he applied for a spot I'm Software Engineer applied in Machine Learning, he went through the multiple step interview process, and then when he got the job he did a few weeks of training and interviewing teams. One of the teams in charge of optimizing ML code in Meta picked him up and now he works there.
Because of Meta's scale, optimizing code that saves a few ms or watts is a huge impact in the bottom line.
In sum:
- Get a formal education in the area
- Get work experience somewhere
- Apply for a big tech job in Software Engineer applied with ML
- Hope they hire you and have a spot in one of the teams in charge of optimizing stuff
This is helpful thank you. There's always some luck.
I have a PhD in CS, and lots of experience in optimization and some in throughput/speedups (in an amdahl sense) for planning problems. My biggest challenge is really getting something meaty with high constraints or large compute requirements. By the time I get a pipeline set up it's good enough and we move on. So it's tough to build up that skillset to get in the door where the big problems are.
A lot of the optimisation at this level is getting data into the right place at the right time, without killing the network.
Its also a group effort to provide simple to use primitives that "normal" ML people can use, even if they've never used hyper scale clusters before.
So you need a good scheduler, that understand dependencies (no, the k8s scheduler(s) are shit for this, plus it wont scale past 1k nodes without eating all of your network bandwidth), then you need a dataloader that can provide the dataset access, then you need the IPC that allows sharing/joining of GPUs together.
all of that needs to be wrapped up into a python interface that fairly simple to use.
Oh and it needs to be secure, pass an FCC audit (ie you need to prove that no user data is being used) have a high utilisation efficiency and uptime.
can you say more about the network issues with thousands of k8s nodes? I'm regularly running 2-3000 nodes in a GKE cluster, majority have GPUs, is this something I need to be worrying about?
Only if you are paying for the network bandwidth. for example if there are nodes spanning more than one zone, and you pay for that traffic, you might want to think about moving stuff to a single zone.
For other settings, moving to something like opencue might be better (caveats apply)
at the end of the day, you are still moving, storing and manipulating 1's and 0's, whether you are a front end engineer or a backend engineer or systems engieer or an ML engineer or an infra engineer
well at least I fit my resume to match the 'job description' because at the end of the day it's all hallucinations and 'real' software engineers that has core computer science skills can literally do anything
Our group works on some of this stuff at Meta, and we have a pretty good diversity of backgrounds - high performance computing (the bulk), computer systems, compilers, ML engineers, etc. We are hiring.