Visualize Data from Cassandra in Superset

In this post, I will walk through how to visualize data from Cassandra in Superset (by means of Presto), and along the way share some observations about my experience and potential use cases that might be interesting for you.

Apache Superset has been around for a while now and has gained quite a bit of traction. I personally heard about it through an interview on the Astronomer podcast with Maxime Beauchemin, who is the creator of not only Apache Airflow, but Superset as well. Having an interest in data visualization, I’ve been looking forward to exploring Superset and seeing what it contributes to the larger data visualization ecosystem as a whole.

Quick disclaimer: I approach this topic more as a data engineer than a data scientist and will be writing from that perspective, so I will be writing from that perspective. Hopefully, it will be helpful for other data engineers who, like me, have limited experience with data visualization but want to learn more.

Introduction to Superset

Superset bills itself as “a modern data exploration and visualization platform.” It will come as no surprise to learn that it is built to work well with distributed platforms, given that its creator also created Apache Airflow. In its docs, Superset claims that it is “cloud-native and designed to be highly available. It was designed to scale out to large, distributed environments and works very well inside containers. While you can easily test drive Superset on a modest setup or simply on your laptop, there‚Äôs virtually no limit around scaling out the platform.”

Part of how it does this is through its infrastructure, which requires a webserver, a metadata db engine, a message queue, a results backend, and a caching layer. Some of these components can overlap – for example, in the setup for this tutorial, we will cover these five requirements with just these three components: Gunicorn webserver, Redis (for the message queue, results backend, and caching), and Postgres (for the metadata db).

Ingesting your Cassandra data into Superset

In regard to how you ingest your data, Superset works with most interfaces that support SQL queries. This means that while Yugabyte for example, works out of the box due to its Postgres compatibility, Cassandra does not. For this reason, if you want to visualize data from Cassandra in Superset, a post by the Senior Developer Advocate for Superset recommends using an intermediate query engine like Trino or Presto. This setup provides the SQL compatibility to Cassandra that we need, as well as other benefits that come with such a tool in the process.

What we cover in this post

First, we will connect Cassandra with Presto using docker-compose. From there, we will follow this guide to setting up Superset in docker-compose to connect Superset to Presto. You can find the code behind this post on GitHub.

Step 1: Connecting Presto with Cassandra

For connecting Presto with Cassandra, there is nothing too novel with this process, as it has been well documented already. We will be using the docker image provided by Ahana; for more information on that container, see here.

In order to avoid putting more details in here than we need to, for this part of the tutorial you can follow the README in the GitHub repo for a step-by-step guide.

The gist of it is to start up the Docker containers using docker-compose, then create a catalog in Presto for Cassandra. We will name ours cassandra (go figure).

Ingest Sample Data into Cassandra

We will also ingest some sample data at this point as explained in the GitHub README. For this, you can ingest by doing writes through Presto (using SQL) or CQLSH (using CQL).

At this point, our data is now accessible in two ways: 1) From Cassandra directly:

CQLSH results for spacecraft data from Cassandra

Or 2) Using Presto, e.g., using the Presto CLI:

Presto spacecraft query results for data from Cassandra


Again, although our main point of this exercise is to connect Cassandra to Superset, we also get other benefits of having Presto in the process, e.g., if you run some queries here, you can see a record of what queries were run in the Presto GUI Dashboard:

Presto query metrics from Cassandra queries

Step 2: Connecting Superset to Cassandra with Presto

Now that we have Cassandra hooked up to Presto, we can connect Presto with Superset.

Setup Superset

First, we will install Superset using docker compose. The recommended way to do this is through cloning the Superset repo and using their Dockerfiles and docker-compose yaml files directly. (Again, more detailed instructions can be found in our repo’s README).

This will set us up with a Superset instance with the following components

  • web server: Gunicorn
  • metadata database engine: Postgres
  • message queue: Redis
  • results backend: Redis
  • caching layer: Redis

After running docker-compose up, it is now set up here: http://localhost:8088. After you log in, you are greeted with a screen that looks like this:

Superset default view

As nice as it is to start off with a “clean slate” as it were, it is even better to have some sample datasets without having to connect anything, and Superset provides just that. For example, there is a sample dashboard with World Bank Data:

Superset sample dashboard from world bank

That won’t help us however, we want to see our data from Cassandra!

Connect Superset to Cassandra through Presto

First, add a Data connection using “Data > Connect Database”:

Adding a db connection in superset

From there, select “Presto”:

Adding Presto database in Superset

One way to access Presto from Superset would be to put them in the same docker network, e.g., by putting in same docker-compose.yml file or chaining the two docker-compose.yml files using something like:

docker-compose -f ../superset/docker-compose-non-dev.yml -f docker-compose.yml up -d

However, instead, we will just take advantage of the exposed ports on the Presto container. Get our host IP using:

hostname -I | awk '{print $2}'

This will result in the connection address looking something like this (using cassandra as path, since that is what we named our catalog that connects to our Cassandra instance):

presto://<result-from-last-step>:8080/cassandra

Insert that into the box, and test the connection to make sure it looks good:

Superset connection to Presto test for Cassandra data catalog

Test Some Queries in Superset

You should now be ready to run some SQL queries through Presto to read from your C* db. Superset includes an SQL IDE called SQL Lab expressly for this purpose.

Navigate to SQL > SQL Lab to get started:

Opening SQL Lab in Superset

Make sure to select “Presto” as your Database, and then you can run arbitrary Presto SQL in the query command box. The result should look like this:

Sample query in SQL Lab in Superset with Presto connection for Cassandra data catalog

Step 3: Visualize Your Cassandra Data

Of course, we didn’t go through all this trouble just so we could query our data! We want to visualize it!

Finally, we are ready to visualize data from Cassandra in Superset. First, we need to create a “Dataset” in Superset. Navigate to “Datasets” and click the “+ Dataset” button:

Adding a Dataset in Superset

Let’s start with demo.spacecraft_journey_catalog:

Setting a Cassandra keyspace as schema when adding a dataset in Superset

Now let’s make some charts for demo.spacecraft_journey_catalog. Adding the table as a dataset will bring us to this view automatically, so we can just start creating some charts.

Visualize Spacecraft Data

Some examples include:

A pie chart showing “Journeys per spacecraft”

We can use a COUNT(*) query to see how many unique journey_id’s there are per spacecraft name.

Pie chart of Cassandra data in Superset
A time series chart for “Journeys per year”

By using the start column, we can do some time series line charts, such as finding out how many journeys started in a given year.

Time series visualization in Superset
Speed over time

We can also do a similar thing with demo.spacecraft_speed_over_time. Make sure to set up the Dataset for this as well, just as we did with the previous C* table.

Comparing Cassandra spacecraft speed data in time series chart in Superset

Visualize Cassandra Data using Joins

Part of the advantage of using Presto is that it allows joins between tables, which is not allowed in raw CQL. Of course, depending on your data model you might have to be careful running these operations so you don’t bring down your nodes (!).

As before, make sure to create all of the needed datasets first. Then run the join in SQL Lab:

select spacecraft_journey_catalog.spacecraft_name, spacecraft_journey_catalog.summary, spacecraft_speed_over_time.speed, spacecraft_speed_over_time.reading_time from spacecraft_journey_catalog 
inner join spacecraft_speed_over_time on spacecraft_journey_catalog.journey_id = spacecraft_speed_over_time.journey_id;
Running a select join query in SQL Lab in Superset


From there, click the “Create Chart” button to visualize the results.

Create chart button in Superset

Put the visualizations together, and you have the start of a working dashboard:

Cassandra spacecraft data dashboard in Superset

Geocoding Visualizations

Using Datastax Academy Sample sensor data we can also get some cool maps going in Superset. This will require a Mapbox API token (the default public token worked great for me).

With a basic setup and the sample scripts included here, you can quickly generate some maps from your data:

Geocoding chart of sensors by location in Superset

You can limit “Clustering Radius” to 0 (so that sensors are not grouped together at all) and filter by network name to see specific locations of sensors within a given network:

Geocoding chart of sensors by location  in region.

Making a chart representing temperatures for sensors is also pretty straightforward:

Geocoding chart of heat for sensors by location in Superset

Already we see the advantages of what Superset calls their “simple no-code viz builder”. Even with such fairly generic datasets as we used here, we are able to derive insights and create visualizations without very much hassle or configuration at all. It did not matter that I have never even tried Mapbox before – all it took was a couple of columns with “latitude” and “longitude” data and we were able to create several visualizations that could be plotted directly to several different kinds of maps.

And again, using the charts we put together, we can view them together in a single dashboard:

Sensor data dashboard in Superset

Other Potential Use Case: Visualize Cassandra System Metrics

Before closing, I thought I would share another potential use case for Mapbox with Cassandra in particular. Without much effort at all, Apache Superset allows you to create custom visualizations for Cassandra system data using the system keyspace.

Cassandra system tables charts data in Superset

Of course, there are other tools made especially for Cassandra cluster metrics such as Prometheus with Grafana, but it is a possibility that might be worth exploring.

Conclusion

It is easy to tell why Superset has become so popular: it is powerful, it creates beautiful visualizations, and it is simple. Frankly, I was surprised at how seamless the whole process was. Having dabbled with various Python and Javascript charting libraries before, I found Apache Superset was extremely easy to use and well documented. There are sane defaults for what kinds of visualizations you want to make, but with quite a lot of customization allowed as well.

The initial integration process was also pretty straightforward. It would be nice to have some docker-compose files available that didn’t require cloning the whole Superset repo, but that is not even worth complaining about, and will probably come with time.

At the same time, this much infrastructure might or might not be reasonable for your project. Just to run Superset on its own, we needed to add Redis and Postgres. To visualize Cassandra data specifically in Superset, there is no direct connector at this point, unfortunately, which means users need to add an intermediate layer such as Presto or Trino in addition. If your project has some of these tools already, this might not be a problem, but it is something to consider.

Another potential concern would be if your charts need to have a lot of interaction, especially custom interaction. I did not explore thoroughly enough to say anything for certain and there is some interactivity available (e.g., see this demo from ApacheCon Asia 2022). However, in general, “no-code” means limitations for customization, so I would be careful in this regard if you have specific needs.

Credits

This post on how to visualize Cassandra data in Superset is built off of the more general post about Open Source BI Tools written by my teammate Arpan Patel.

Our spaceship dataset is based on the SparkSQL notebook from Datastax Studio. For the basic schema on which this was based, see the examples provided here. Data entries and schema were modified slightly by Arpan Patel and made available by his demo for Presto, Airflow, and Cassandra here. The data was modified again for this current demo.

We also borrowed from Datastax Academy Sample sensor data. Slight modifications were made to make sure it runs in the correct keyspace.

Leave a Comment

Your email address will not be published. Required fields are marked *