Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Can you share some of your wisdom on setting up a scalable inference infrastructure?



As someone who has run LLMs in production, using Ray is probably the worst idea. It's not optimized for language models, and is extremely slow. There's no KV-caching, model parallelism, and other basic table stakes features that are offered by Dynamo or other open source inference frameworks. Useful only if you have <1 QPS.

Use SGLang, vLLM, or text-generation-inference instead.


It really depends on the task. If you have 1 massive job, Ray sucks and doesn't provide table stakes. If you have 50M tiny jobs, Ray and kuberay is great and serves as the backbone of several billion dollar products.

Good for the goose, good for the gander...


> If you have 1 massive job, Ray sucks and doesn't provide table stakes.

Can you say more?


This is probably true, but unlike every Nvidia product we tried, it did, you know, reply to inference requests with actual output. That said, you can serve vLLM with Ray Serve. https://docs.ray.io/en/latest/serve/tutorials/vllm-example.h...


Ray doesn't offer anything if you use vLLM on top of Ray Serve though.


It does if you need pipeline parallelism across multiple nodes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: