Luigi vs. Airflow vs. Pinball

vtuulos · on Feb 6, 2016

We have been using Luigi in production for a year now at AdRoll, to manage a graph of tens of data processing tasks. We have been really happy with it.

You can read more about our setup in these two blog posts:

http://tech.adroll.com/blog/data/2015/09/22/data-pipelines-d...

http://tech.adroll.com/blog/data/2015/10/15/luigi.html

suresk · on Feb 7, 2016

Just curious - have you guys used the hadoop contrib stuff with Luigi? We use it almost exclusively to kick off Hadoop jobs and when I went to refactor some of my predecessors stuff that just kicked off a raw process that called 'hadoop jar' to the hadoop contrib stuff, I ran into a lot of weird issues (largely, the way arguments get passed to the hadoop job).

Just was curious if many other people were using the hadoop contrib stuff successfully, or if I am trying to use something that isn't very well supported.

vtuulos · on Feb 7, 2016

No, we haven't used Luigi with Hadoop. For batch processing we use containerized jobs with a simple job queue, like described in the blog article.

vikiomega9 · on Feb 7, 2016

I find a lot of the use cases end up using hadoop anyway and I was wondering why tools like oozie are not used. It appears as if such projects are feats of engineering and nothing more. I might be gravely mistaken but that's how I see. Comments that suggest otherwise would be greatly appreciated. EDIT: I also find it odd that one might write a workflow manager because they can't find an equivalent one for python.

suresk · on Feb 7, 2016

Common complaints I've heard about Oozie is that it has a high learning curve, not a great UI, and people hate the fact that it is XML based. This is a pretty decent comparison of Oozie vs Luigi (and Azkaban):

http://www.slideshare.net/jcrobak/data-engineermeetup-201309

vikiomega9 · on Feb 7, 2016

That presentation was pretty good with the good and bad takes. Do you think frameworks like Casacading or spark make things a lot easier as a higher abstraction on hadoop / different compute model?

suresk · on Feb 7, 2016

I haven't tried Cascading, but I've started doing some stuff with Spark and really like it. I feel like it is usually an easier abstraction to work with and it is a lot easier to prototype and experiment with.

mtrn · on Feb 8, 2016

We have been using Luigi for a larger project and it works fine. Some people have a bit of a hard time understanding what it is about and why at least some software for scheduling is needed. I find the notion of "make for data" useful.

After a presentation on Luigi in a Python User Group, we had a lively discussion about certain features. One issue that came up was the fact, that downstream tasks are not necessarily recomputed, once you change something in the code. For that to happen, you would have to keep track of the source code as well. Similarities with Nix came up, where a change in code leads to a different ID, so all changes can be tracked.

Shameless plug: When I started using Luigi, I missed some auto-generated filename feature for task outputs, so I wrote a utility library for that (and a few other things): https://git.io/vg4D0

twunde · on Feb 6, 2016

The most interesting part of this are the links to the actual reviews

harlowja · on Feb 7, 2016

Don't forget another one (used by openstack projects):

http://docs.openstack.org/developer/taskflow/

Comments/questions welcome!

suresk · on Feb 7, 2016

Good write up. We've been pretty happy with Luigi, but built-in scheduling would be really, really nice, so going to have to take a look at Airflow.

thesorrow · on Feb 7, 2016

We are using airflow to schedule backups for our servers and it's been really stable so far !