We have been using Luigi in production for a year now at AdRoll, to manage a graph of tens of data processing tasks.
We have been really happy with it.
You can read more about our setup in these two blog posts:
Just curious - have you guys used the hadoop contrib stuff with Luigi? We use it almost exclusively to kick off Hadoop jobs and when I went to refactor some of my predecessors stuff that just kicked off a raw process that called 'hadoop jar' to the hadoop contrib stuff, I ran into a lot of weird issues (largely, the way arguments get passed to the hadoop job).
Just was curious if many other people were using the hadoop contrib stuff successfully, or if I am trying to use something that isn't very well supported.
I find a lot of the use cases end up using hadoop anyway and I was wondering why tools like oozie are not used. It appears as if such projects are feats of engineering and nothing more. I might be gravely mistaken but that's how I see. Comments that suggest otherwise would be greatly appreciated. EDIT: I also find it odd that one might write a workflow manager because they can't find an equivalent one for python.
Common complaints I've heard about Oozie is that it has a high learning curve, not a great UI, and people hate the fact that it is XML based. This is a pretty decent comparison of Oozie vs Luigi (and Azkaban):
That presentation was pretty good with the good and bad takes. Do you think frameworks like Casacading or spark make things a lot easier as a higher abstraction on hadoop / different compute model?
I haven't tried Cascading, but I've started doing some stuff with Spark and really like it. I feel like it is usually an easier abstraction to work with and it is a lot easier to prototype and experiment with.
We have been using Luigi for a larger project and it works fine. Some people have a bit of a hard time understanding what it is about and why at least some software for scheduling is needed. I find the notion of "make for data" useful.
After a presentation on Luigi in a Python User Group, we had a lively discussion about certain features. One issue that came up was the fact, that downstream tasks are not necessarily recomputed, once you change something in the code. For that to happen, you would have to keep track of the source code as well. Similarities with Nix came up, where a change in code leads to a different ID, so all changes can be tracked.
Shameless plug: When I started using Luigi, I missed some auto-generated filename feature for task outputs, so I wrote a utility library for that (and a few other things): https://git.io/vg4D0
You can read more about our setup in these two blog posts:
http://tech.adroll.com/blog/data/2015/09/22/data-pipelines-d...
http://tech.adroll.com/blog/data/2015/10/15/luigi.html