Airflow and Cassandra

Apache Cassandra Lunch #48: Airflow and Cassandra

In Apache Cassandra Lunch #48: Airflow and Cassandra, we discussed using Airflow to manage interactions with Cassandra. Specifically this week we covered Airflow Operators, and how they could be used to interact with a Cassandra cluster on the level of the various language drivers. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!

Airflow Overview

Airflow is a workflow scheduling platform. It is a tool for creating and managing DAGs of tasks. Tasks are snippets of python code with defined inputs, outputs, and connections to other tasks. DAGs are directed acyclic graphs of these tasks that Airflow manages the dependencies of. Airflow as a tool is great for managing the running of repeated tasks and processes, things like ETL jobs and Machine Learning workflows. The DAGs are defined in python, meaning that any tools that can be interacted with via python can be used in your workflows.

Airflow also allows the manual running of DAGs via its UI or API. It also manages the scheduling of repeatedly run DAGs with CRON syntax, again definable either in the UI or via CLI. We can also monitor individual DAG runs using Airflow, and collects logs about them.

Airflow Features

Airflow provides a UI as well as several different command line API layers for interacting with different portions of the program. DAGs are defined in python files, and can be viewed from within the UI as code or as graphs of tasks. With Airflow 2.0 the Airflow scheduler itself has become a distributed system, able to run across multiple machines. Airflow DAGs can be scheduled in CRON syntax or with common tags like @daily and @weekly.

Airflow also contains a number of provider packages for connecting to external systems. Providers packages are maintained by the Airflow community for connection with a number of common tools. These tools included AWS systems, Docker, Apache Spark, and a number of others. Relevant to this presentation, Airflow has a provider package for Apache Cassandra. It is also possible to create your own provider packages if you need functionality not covered by existing providers.

Airflow can also be used to manage different types of connections to external systems. The provider packages manage the actual interactions between Airflow and external systems, but Airflow Connections hold the information about hosts and ports that facilitate those connections. Connection types beyond a few defaults are added by their associated provider packages. These connections may contain extra information used to facilitate the connection between Airflow and whatever system it is connected to. Underneath the Connections UI tab and API, connections are JSON strings containing the defined information that Airflow programs can load and extract relevant data from.

For defining tasks within DAGs, Airflow provides common Operator functions, and provider packages can define new ones. Users also have the option to define their own operators. Mainly, Airflow DAG definitions use PythonOperator, BashOperator, and BranchOperator to define their functionality. There are a number of other operators that perform more specialized tasks.

Airflow and Cassandra

Theoretically there are at least two ways to extract value from the combination of Airflow and Cassandra. One is to use Airflow to manage tasks that interact with the data in the Cassandra cluster. These could be part of etl pipelines, data transfer schemes, or other interactions that work with data. The other would be to use Airflow to interact with the functioning of a Cassandra cluster. This would include things like triggering compaction or flushing data to SSTables. This presentation focuses on the first, as that is what the existing Cassandra provider package is meant to work with.

Airflow Cassandra Operators

The Airflow Apache Cassandra provider package contains two operators. Those two are CassandraTableSensor and CassandraRecordSensor. The way that the Airflow Cassandra Connection works is that you can define a “Schema” or keyspace name. So, because of that, these sensors work within the given keyspace. Then they take a table name for the table sensor, and a table name and record definition. And they poke at a Cassandra table until the table/record within the table registers to the sensor. But, in my experience, these sensors don’t actually work.

Airflow Cassandra Hooks

What does work, in my experience, is to combine the underlying code underneath those operators. They work using Cassandra Hooks. By combining the Cassandra Hooks with the Python Operators we can gain access to all of the functionality contained within the cassandra hook object. Since the Cassandra hook uses the python Cassandra driver to accomplish things, and exposes the session object to the user, we can use this combination to manage with airflow any process that the python Cassandra driver is capable of. In the demo below, we replicate the CassandraTableSensor operator using this method and also read data from the cassandra cluster using the Cassandra Hook.

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!