Can you share some of your wisdom on setting up a scalable inference infrastruct...

Carrok · 2025-03-18T21:37:01 1742333821

Use Ray Serve. https://docs.ray.io/en/latest/serve/index.html

ipsum2 · 2025-03-18T21:41:32 1742334092

As someone who has run LLMs in production, using Ray is probably the worst idea. It's not optimized for language models, and is extremely slow. There's no KV-caching, model parallelism, and other basic table stakes features that are offered by Dynamo or other open source inference frameworks. Useful only if you have <1 QPS.

Use SGLang, vLLM, or text-generation-inference instead.

erulabs · 2025-03-18T22:28:22 1742336902

It really depends on the task. If you have 1 massive job, Ray sucks and doesn't provide table stakes. If you have 50M tiny jobs, Ray and kuberay is great and serves as the backbone of several billion dollar products.

Good for the goose, good for the gander...

richardliaw · 2025-03-20T16:08:14 1742486894

> If you have 1 massive job, Ray sucks and doesn't provide table stakes.

Can you say more?

Carrok · 2025-03-18T21:43:41 1742334221

This is probably true, but unlike every Nvidia product we tried, it did, you know, reply to inference requests with actual output. That said, you can serve vLLM with Ray Serve. https://docs.ray.io/en/latest/serve/tutorials/vllm-example.h...

ipsum2 · 2025-03-18T22:54:20 1742338460

Ray doesn't offer anything if you use vLLM on top of Ray Serve though.

dzr0001 · 2025-03-19T00:37:55 1742344675

It does if you need pipeline parallelism across multiple nodes.