Cover slide for the Airflow and Cassandra for Cluster Management webinar

Apache Cassandra Lunch #52: Airflow and Cassandra for Cluster Management

In Apache Cassandra Lunch #52: Airflow and Cassandra for Cluster Management, we discussed using Airflow to schedule tasks on a Cassandra cluster beyond what could be accomplished with the Cassandra provider package. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!

Airflow and Cassandra

Last time, at Cassandra Lunch number 48, we discussed using the Cassandra provider package in Airflow. That package consists of the Cassandra connection type, the Cassandra hook class, and the Cassandra Operators table sensor and record sensor. We used these components to interact with the Cassandra cluster. 

Since the Cassandra hook exposes a session object in the same way that the python Cassandra driver does, we would have the ability to create, drop, and alter tables, as well as load data into tables, and query from them. The Cassandra hook works in concert with one of the main Airflow Operators, PythonOperator to run python code as part of DAGs. By defining a python function or callable class that uses the Hook, we can add Cassandra interactions into DAGs. 

The Sensors that come as part of the Cassandra Operator can be used to wait until a particular table or record is created by an external process before triggering later functionality. 

Airflow connections just store the data necessary for connection with external services. Cassandra connections can hold IP addresses, port numbers, usernames, and passwords for connecting to a Cassandra cluster. We use this connection to create/configure a hook object that will properly connect to the Cassandra instance.

Cluster Management

There are a number of things that we may want to do in Cassandra that the python driver is not capable of. For example, there are a number of cqlsh specific queries. Only cqlsh can fulfill these, not other query sources. Statements like “describe keyspaces” only work in cqlsh and the Cassandra python driver cannot execute them. Beyond that, Cassandra installations come with a number of useful tools outside of the functionality of the Cassandra driver. Nodetool commands are important for checking the status of a cluster and triggering automatic processes manually. They can also change the configuration of the cluster. Cassandra clusters also contain a number of utilities for working with SSTables.

Nodetool flush pushes in-memory data (the commit log) to disk in the form of SSTables. Nodetool compact manually triggers compaction, which resolves copies and tombstones and consolidates data into fewer SSTable files. We can change cluster configurations with commands like nodetool disableautocompaction. Nodetool status gives us a general status for the cluster. We could use nodetool repair to manually trigger the repair process, resolving inconsistencies between replicas. We can trigger an SSTable dump on a schedule in order to facilitate the regular backup of our data.

Cluster Management with Airflow

In order to run these commands normally we would need to access the console on one of the Cassandra nodes. If Airflow is running on the same machine as Cassandra, we can use the BashOperator. We can combine it with any of the commands described above to run commands on the same machine as the Airflow install. We can use the SSHOperator in order to connect to an external machine and run bash commands.

The SSHOperator connects via an SSHHook. That hook is created with an SSH Connection type. The combination allows us to run commands on any machine we have SSH access to. If the cluster is running on docker containers we can combine the bash or ssh operators with docker exec. This allows us to run commands on those containers. Or we can use the DockerOperator to bring up a new node, and then run commands on it afterward.

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!