Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So, I'd love to work on optimizing pipelines like this. How does one "get into" it? It seems a ML scientist with some C/C++ and infra knowledge just dips down into the system when required? Or is it CUDA/SIMD experts who move "up" into ML?


I know someone who works on this in Meta. His resume is computer science heavy, with a masters in Machine Learning. On the previous experience side, before getting into Meta, he had about a decade working as a Software Engineer with Machine Learning system in multiple languages, such as Go, C++ and Python.

To get the job he applied for a spot I'm Software Engineer applied in Machine Learning, he went through the multiple step interview process, and then when he got the job he did a few weeks of training and interviewing teams. One of the teams in charge of optimizing ML code in Meta picked him up and now he works there.

Because of Meta's scale, optimizing code that saves a few ms or watts is a huge impact in the bottom line.

In sum:

- Get a formal education in the area - Get work experience somewhere - Apply for a big tech job in Software Engineer applied with ML - Hope they hire you and have a spot in one of the teams in charge of optimizing stuff


This is helpful thank you. There's always some luck.

I have a PhD in CS, and lots of experience in optimization and some in throughput/speedups (in an amdahl sense) for planning problems. My biggest challenge is really getting something meaty with high constraints or large compute requirements. By the time I get a pipeline set up it's good enough and we move on. So it's tough to build up that skillset to get in the door where the big problems are.


A lot of the optimisation at this level is getting data into the right place at the right time, without killing the network.

Its also a group effort to provide simple to use primitives that "normal" ML people can use, even if they've never used hyper scale clusters before.

So you need a good scheduler, that understand dependencies (no, the k8s scheduler(s) are shit for this, plus it wont scale past 1k nodes without eating all of your network bandwidth), then you need a dataloader that can provide the dataset access, then you need the IPC that allows sharing/joining of GPUs together.

all of that needs to be wrapped up into a python interface that fairly simple to use.

Oh and it needs to be secure, pass an FCC audit (ie you need to prove that no user data is being used) have a high utilisation efficiency and uptime.

the model stuff is the cherry on the top


can you say more about the network issues with thousands of k8s nodes? I'm regularly running 2-3000 nodes in a GKE cluster, majority have GPUs, is this something I need to be worrying about?


Only if you are paying for the network bandwidth. for example if there are nodes spanning more than one zone, and you pay for that traffic, you might want to think about moving stuff to a single zone.

For other settings, moving to something like opencue might be better (caveats apply)


Ok, but back to my main question, how do I get into this?


It looks more like an infra problem than ML. "Software architect"s mixed with devops/infra/sre people


Well since I'm not a ML engineer of any kind - that's good!


at the end of the day, you are still moving, storing and manipulating 1's and 0's, whether you are a front end engineer or a backend engineer or systems engieer or an ML engineer or an infra engineer


yeah, but how do you get the hiring managers to see things in the same way? :)


well at least I fit my resume to match the 'job description' because at the end of the day it's all hallucinations and 'real' software engineers that has core computer science skills can literally do anything


I work on PyTorch Compilers at Meta, and I think folks enter ML Systems from all directions :)

Some folks start with more familiarity in ML research and dip down as far as they need.

Other folks come from a traditional distributed systems/compilers/HPC background, and apply those skills to ML systems.


Our group works on some of this stuff at Meta, and we have a pretty good diversity of backgrounds - high performance computing (the bulk), computer systems, compilers, ML engineers, etc. We are hiring.

Feel free to DM me to learn more.


I will, thank you. Any info is very helpful.


start with something small - take some kernel function in C, and try to optimize it for your laptops assembly SIMD instruction set.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: