Are we talking about big data ETL here? I did not know Ray was suited for it.

thedood · on July 31, 2024

This is a specialized ETL use-case - similar to taking a single SQL query and creating a dedicated distributed application tailored to run only that query. The lower-level primitives in Ray Core (tasks and actors) are general purpose enough to make building this type of application possible, but you'll be hard pressed to quickly (i.e., with less than 1 day of effort) make any arbitrary SQL query or dataframe operation run with better efficiency or scale on Ray than on dedicated data processing frameworks like Spark. IMO, the main value add of frameworks like Spark lies more in unlocking "good enough" efficiency and scale for almost any ETL job relatively quickly and easily, even if it may not run your ETL job optimally.

dekhn · on July 31, 2024

Speaking as a distributed computing nerd, Ray is definitely one of the more interesting and exciting frameworks I've seen in a while. It's one of those systems where reading the manual, I can see that I'm not going to have to learn anything new, because the mental model resembles so many distributed systems I've worked with before (I dunno about anybody else, but tensorflow is an example of a distributed system that forced me to forget basically everything I knew before I could be even remotely productive in it).

Unclear if it's in the best interests of anyscale to promote Ray as a general purpose cluster productivity tool, even if it's good at that more general use case.

robertnishihara · on July 31, 2024

I'm glad you find it exciting!

Our intention from the start was for Ray to be general purpose. And the core Ray APIs are quite general (basically just scheduling a Python function somewhere in a cluster or instantiating a Python class as a process somewhere in the cluster).

We had AI use cases in mind from the start, since we were grad students in AI. But the generality has really been important since AI workloads encompass a huge variety of computational patterns (allreduce style communication patterns on GPUs for training, embarrassingly parallel data processing workloads on spot instances, and so on).

dekhn · on July 31, 2024

Oh, I know all that, I used to work at Google and give lots of money to the various groups associated with Ion Stoica's groups at Berkeley to help stimulate more open source alternatives to Borg/MapReduce/Flume/TensorFlow. Keep up the good work.

refset · on Aug 1, 2024

Is there anybody trying to build a SQL database on Ray yet? Asking for a friend.