Airflow and Spark

Data Engineer’s Lunch #25: Airflow and Spark

In Data Engineer’s Lunch #25: Airflow and Spark, we discuss how we can use Airflow to manage Spark jobs. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at noon EST. Register here now!

In Data Engineer’s Lunch #25: Airflow and Spark, we discuss how we can connect Airflow and Spark and manage / schedule Spark jobs using Airflow. If you are not familiar with Airflow, check out some of our previous Data Engineer’s Lunches below that cover Airflow. Additionally, you can find the rest of the Data Engineer’s Lunch YouTube playlist here!

We have a walkthrough below, which you can use to learn how to quickly spin up Airflow and Spark, connect them, and then use Airflow to run the Spark jobs. We used Gitpod as our dev environment so that you can quickly learn and test without having to worry about OS inconsistencies. You can also do the walkthrough using this GitHub repo! As mentioned above, the live recording is embedded below if you want to watch the walkthrough live.

Walkthrough

If you have not already opened this in Gitpod, then hit this link to get started!

1. Set up Airflow

We will be using the quick start script that Airflow provides here.

bash setup.sh

2. Start Spark in standalone mode

2.1 – Start master

./spark-3.1.1-bin-hadoop2.7/sbin/start-master.sh

2.2 – Start worker

Open port 8081 in the browser, copy the master URL, and paste in the designated spot below

./spark-3.1.1-bin-hadoop2.7/sbin/start-worker.sh <master-URL>

3. Move spark_dag.py to ~/airflow/dags

3.1 – Create ~/airflow/dags

mkdir ~/airflow/dags

3.2 – Move spark_dag.py

mv spark_dag.py ~/airflow/dags

4, Open port 8080 to see Airflow UI and check if example_spark_operator exists.

If it does not exist yet, give it a few seconds to refresh.

5. Update Spark Connection, unpause the example_spark_operator, and drill down by clicking on example_spark_operator.

5.1 – Under the Admin section of the menu, select spark_default and update the host to the Spark master URL. Save once done

5.2 – Select the DAG menu item and return to the dashboard. Unpause the example_spark_operator, and then click on the example_spark_operatorlink.

6. Trigger the DAG from the tree view and click on the graph view afterwards

7. Once the jobs have run, you can click on each task in the graph view and see their logs.

In their logs, we should see value of Pi that each job calculated, and the two numbers differing between Python and Scala

8. Trigger DAG from command line

8.1 – Open a new terminal and run airflow dags

airflow dags trigger example_spark_operator

8.2 – If we want to trigger only one task

airflow tasks run example_spark_operator python_job now

And that wraps up our basic walkthrough on using Airflow to manage Spark jobs. Again, the live recording of Data Engineer’s Lunch #25: Airflow and Spark is embedded below, so if you want to watch the walkthrough live, be sure to check it out!

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!