Running a databricks notebook against datastax astra

Running A Databricks Notebook Against DataStax Astra

In this blog, we will cover running a Databricks notebook against DataStax Astra. We will use Databricks community edition and use DataStax Astra’s free tier to show you how you can run a Databricks notebook against DataStax Astra without a credit card. Additionally, a YouTube video is embedded below if you want to watch a live demo of this process, so be sure to check it out.

This blog is Part 2 of our Databricks and DataStax Astra series: Running A Databricks Notebook Against DataStax Astra. If you missed Part 1, you can check it out here: Connect Databricks and DataStax Astra. We will not cover how to connect Databricks and DataStax Astra in this blog, we will just focus on running a demo scenario. A YouTube video is also embedded below if you want to watch the demo live; as well as, if you want to check out other features Databricks provides in their community edition. Don’t forget to like and subscribe while you are there!

If you have been following us here at Anant, then you know that we have been working with DataStax Astra for some time. If you are not familiar with DataStax Astra, it is cloud-native Cassandra-as-a-Service built on Apache Cassandra. DataStax Astra eliminates the overhead to install, operate, and scale Cassandra and also offers a 5 gig free-tier with no credit card required, so it is a perfect way to get started and/or play with Cassandra in the cloud.

Check out our content on DataStax Astra below!

If you are not familiar with Databricks, it is a unified data analytics platform in the cloud for massive-scale data engineering and collaborative data science. Databricks does provide a free community edition, but there are some limits on features (more on that here). Databricks allows for large-scale data processing for batch and streaming workloads, enabling analytics on the most complete and recent data, simplifying and accelerating data science on large datasets, and standardizing ML lifecycle from experimentation to production.

Databricks adds enterprise-grade functionality to the innovations of the open-source community. As a fully managed cloud service, they handle your data security and software reliability; as well as, have unmatched scale and performance of the cloud. Databricks is rooted in open-source as their platform includes technologies such as Apache SparkDelta Lake, and MLflow, which were originally created by the founders of Databricks. The Databricks platform also includes TensorFlowRedash, and R.

Now we will discuss how to run a Databricks notebook against DataStax Astra. Again, you can check out this blog if you missed Part 1 and how to get started.

We will be using DataStax Astra’s Studio notebook, so you will need to download this Studio notebook and drag-and-drop it into the Studio interface. Again as mentioned above, a live demonstration is included in the embedded YouTube video below. Once you have drag-and-dropped it into DataStax Astra Studio, we will run cells 1-3. This will create our leaves table. If you are not familiar with the leaves table, you can check out our content list linked above, but essentially, if you visit Cassandra.Link, you will see hand-curated resources associated with Cassandra. If you look at the left-hand side of the page, we can see a Tags section, which allows filtering of resources based on tags. Additionally, each tag has a count associated with it, which signals how many resources contain the specific tag.

We will be creating 2 new tables from our leaves table and writing them to Astra using Databricks: leaves_by_tag and tags. The leaves table is the master list of resources and we will extract the data, transform it, and load it into the 2 new tables. The leaves_by_tag table will have the resources partitioned by their tags and clustered by their titles for additional uniqueness. An example of this would be by clicking the Spark tag on Cassandra.Link, and we can see resources that include the tag of Spark, but we can also see the other tags associated with the resources under each individual resource. The tags table will generate the count of each tag which is displayed next to the individual tags on the tag bar.

We can create a notebook in Databricks and connect it to the cluster we created. Then we can input the following code to confirm that we can see our leaves table. Additionally, input your database name and keyspace name in the designated spots. We could input these configs into the Spark config when we create the cluster and access them within the notebook; however, for uniformity with Part 1 of the series, we will just require you to manually input them yourselves.

import org.apache.spark.sql.functions._
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
import spark.implicits._

val dbName = ""
val keyspace = ""

spark.conf.set(s"spark.sql.catalog.$dbName", "com.datastax.spark.connector.datasource.CassandraCatalog")
spark.sql(s"use $dbName.$keyspace")
spark.sql("show tables").show()

Once we have confirmed we can see our leaves table in Databricks, we will move onto the next step. We will create a new cell and input the below code block. The code block below will read data from our leaves table, transform it for both the leaves_by_tag and tags dataframes, create tables for both dataframes, and then write the transformed data back into Astra. Again, don’t forget to input your database name and keyspace name into the designated spots.

import org.apache.spark.sql.functions._
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._

val dbName = ""
val keyspace = ""

 spark.conf.set(s"spark.sql.catalog.$dbName", "com.datastax.spark.connector.datasource.CassandraCatalog")

import spark.implicits._
        
spark.sql(s"use $dbName.$keyspace")

val leavesByTag = spark.sql("select tags as tag, title, url, tags from leaves").withColumn("tag", explode($"tag"))

val tagsDF = spark.sql("select tags as tag from leaves").withColumn("tag", explode($"tag")).groupBy("tag").count()

leavesByTag.createCassandraTable(keyspace, "leaves_by_tag", partitionKeyColumns = Some(Seq("tag")), clusteringKeyColumns = Some(Seq("title")))

leavesByTag.write.cassandraFormat("leaves_by_tag", keyspace).mode("append").save()

tagsDF.createCassandraTable(keyspace, "tags")

tagsDF.write.cassandraFormat("tags", keyspace).mode("append").save()

We can then re-run the first cell and confirm that we now have 3 tables in Astra. Additionally, you can go to Astra Studio and run cells 4-5 to visualize the results of the Databricks notebook.

Now we can also expand on this demo. We will run cell 6 in the Astra notebook to add a new record to the leaves table. You can run cell 3 again to confirm that there are now 3 records in the leaves table.

Next, you can copy and paste the below code block into a new cell in the Databricks notebook, or you can just comment out the lines that create the leaves_by_tag and tags tables in the 2nd cell: leavesByTag.createCassandraTable(...) and tagsDF.createCassandraTable(...).

If you do want to comment out the lines, use this code block and don’t forget to add your database name and keyspace name in the designated spots:

import org.apache.spark.sql.functions._
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._

val dbName = ""
val keyspace = ""

 spark.conf.set(s"spark.sql.catalog.$dbName", "com.datastax.spark.connector.datasource.CassandraCatalog")

import spark.implicits._
        
spark.sql(s"use $dbName.$keyspace")

val leavesByTag = spark.sql("select tags as tag, title, url, tags from leaves").withColumn("tag", explode($"tag"))

val tagsDF = spark.sql("select tags as tag from leaves").withColumn("tag", explode($"tag")).groupBy("tag").count()

leavesByTag.write.cassandraFormat("leaves_by_tag", keyspace).mode("append").save()

tagsDF.write.cassandraFormat("tags", keyspace).mode("append").save()

Once we re-run cell 2 or run cell 3 in the Databricks notebook, we should see new records with Astra Studio in the leaves_by_tag and tags table; as well as, updated counts for some tags in the tags table.

That wraps up Part 2 of our Databricks and DataStax Astra Series: Running A Databricks Notebook Against DataStax Astra. Again, if you want to watch this demo live, you can watch it in the YouTube video embedded below. Also, don’t forget to like and subscribe!

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!