data engineer's lunch #18

Data Engineer’s Lunch #18: Luigi for Scheduling

In Data Engineer’s Lunch #18: Luigi for Scheduling, we discussed using Luigi as a workflow scheduler. We then compared its utility vs our previously discussed schedulers, Airflow and Jenkins. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at noon EST. Register here now!

Luigi is a Python (2.7, 3.6, 3.7 tested) package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more. The documentation proposes Luigi as a solution for handling the “plumbing” for long-running batch jobs. For this post, we will investigate whether Luigi can be used as a workflow scheduler similar to Airflow.

Luigi Tasks

Luigi tasks package work and defines the execution of code. They are made up of python classes and parameters and methods within those classes. The class definition takes one of a number of prototypes from Luigi, meant to define what type of task it is. The base task type is used in the majority of cases, but you can use special types to create a task that creates a Spark job or do other special things. 

The parameters are a space for different types of data to be provided at runtime. Things like dates usually go here. The input method defines where data is coming from for processing inside of the task. Luigi can take data from files but also from things like S3 buckets and Hadoop. Outputs are defined the same way. The run method contains all of the actual processing. It uses the definitions of inputs and outputs to load and return data and then does whatever processing is defined in between.

Luigi Central Scheduler

The Luigi Central Scheduler facilitates and manages the running of tasks. It provides for the visualization of tasks and their dependencies and tracking of the success and failure of tasks. The scheduler does not provide functionality for running tasks or parts of tasks from within the user interface. It also lacks the ability to distribute and parallelize workflows and is without the ability to schedule the running of tasks at a particular time. Tasks trigger once when run via command line and the Scheduler manages the order in which dependencies are run. 

We can use cron in order to time schedule and repeatedly run Luigi tasks. We could even use one of the other scheduling platforms we have discussed like Airflow or Jenkins in order to do this. Without extra help, however, the cron job running a Luigi task would still lack the monitoring and retry/timeout capabilities found in airflow and Jenkins. 

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!