In Data Engineer’s Lunch #13: Introduction to Airflow, we discussed the scheduling too, Airflow. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at noon EST. Register here now!
Airflow Overview
This week as part of the Data engineers lunch we discussed Airflow. We had previously discussed Airflow with a presentation by a guest speaker, Will Angel. That can be found here. Airflow is a tool for scheduling tasks and chains of tasks which it refers to as workflows or DAGs. This tool is especially useful for automating repeated processes like common ETL tasks and repeated machine learning training.
Workflows are written in Python, meaning that tools that we can interact with using python are fair game as well. They are made up of individual tasks. We can define the dependencies between tasks. This creates a workflow in the form of a DAG (directed acyclic graph) of tasks. The scheduler executes this DAG starting with tasks with no dependencies and moving to other tasks once the dependencies have been met.
Airflow offers the ability to schedule workflows so that they run with the same amount of time between them. It is also possible to trigger runs manually if you just want to take advantage of how DAGs work. The tool can also be used to keep track of metrics and logs about runs that it manages.
Airflow was already scalable, being able to send tasks out to workers to accomplish many of them at the same time, but with Airflow 2.0, the scheduler itself can also be distributed to handle more load as well. Since Airflow workflows are written in python, it is possible to write code that dynamically changes workflows based on different parameters. Airflow also has many Connectors and Operators used to extend functionality with other tools.
Cassandra.Link
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!