In this blog, we will cover how to connect Databricks and DataStax Astra. We will use Databricks community edition and use DataStax Astra’s free tier to show you how you can connect Databricks to DataStax Astra without a credit card. Additionally, a YouTube video is embedded below if you want to watch a live demo of this process, so be sure to check it out.
In this blog, we will introduce Databricks and show you how you can connect Databricks and DataStax Astra all without spending a single cent. This will Part 1 of the series: Databricks and DataStax Astra, and in Part 2, we will expand upon Part 1 and take a deeper look at Databricks notebooks and features. In Part 2, we will create a Databricks notebook that will extract data from our Astra database, transform it, and write it back into Astra while also exploring features that Databricks provides in the Community Edition.
If you have been following us here at Anant, then you know that we have been working with DataStax Astra for some time. If you are not familiar with DataStax Astra, it is cloud-native Cassandra-as-a-Service built on Apache Cassandra. DataStax Astra eliminates the overhead to install, operate, and scale Cassandra and also offers a 5 gig free-tier with no credit card required, so it is a perfect way to get started and/or play with Cassandra in the cloud.
Check out our content on DataStax Astra below!
- Cassandra.API Documentation Walkthrough
- Cassandra.API Blog Post: Part 1
- Cassandra.API Blog Post: Part 2
- Building a REST API with DataStax Astra using Node & Python: Part 1
- Building a REST API with DataStax Astra using Node & Python: Part 2
- Cassandra.API Live Workshop w/DataStax
- Cassandra.API Video Demo: Part 1
- Cassandra.API Video demo: Part 2
- Cassandra as a Service in the Cloud
- Exploring DataStax Astra’s REST API
- Exploring DataStax Astra’s GraphQL API
- Connect Apache Spark and DataStax Astra
- Run an Apache Spark Job on DataStax Astra
If you are not familiar with Databricks, it is a unified data analytics platform in the cloud for massive-scale data engineering and collaborative data science. Databricks does provide a free community edition, but there are some limits on features (more on that here). Databricks allows for large-scale data processing for batch and streaming workloads, enabling analytics on the most complete and recent data, simplifying and accelerating data science on large datasets, and standardizing ML lifecycle from experimentation to production.
Databricks adds enterprise-grade functionality to the innovations of the open-source community. As a fully managed cloud service, they handle your data security and software reliability; as well as, have unmatched scale and performance of the cloud. Databricks is rooted in open-source as their platform includes technologies such as Apache Spark, Delta Lake, and MLflow, which were originally created by the founders of Databricks. The Databricks platform also includes TensorFlow, Redash, and R.
Now we will discuss how to connect Databricks and DataStax Astra. You can go to this link to create a free Databricks community edition account. Additionally, you can check out this repository to learn how to get started with DataStax Astra and download the secure connect bundle, which we will need to import into Databricks.
Once you have logged into Databricks, you will need to upload the secure connect bundle into the DBFS. To do so, you can go to the “Data” tab on the left and create a new table. Once the table is created, you can then drag-and-drop the secure connect bundle into the “Files” box. Once added, you can go ahead and create a cluster.
Depending on which Databricks runtime you select, you will need to check which version of DataStax’s spark-cassandra-connector is viable. We will use Spark 3.0.1 with Scala 2.12, and the spark-cassandra-connector version that is compatible is version 3.0.
Once we have selected our runtime, we will click on the “Spark” tab located at the bottom in order to update our Spark Config. With DataStax Astra, we need specific configs in order to connect to Astra versus normal Apache Cassandra. The configs with example items are included below.
spark.dse.continuousPagingEnabled false spark.cassandra.auth.password password spark.cassandra.auth.username username spark.cassandra.connection.config.cloud.path DBFS:/FileStore/tables/secure_connect_databasename.zip
We need spark.dse.continuousPagingEnabled false
to avoid potential errors when reading our data from Astra. If you do not include this config, you may get errors when running your notebook to do a simple read as I did. We also need spark.cassandra.connection.config.cloud.path as I mentioned above set to the secure connect bundle path, which you may not come across when working with open-source Apache Cassandra. Additionally, remember to keep 1 whitespace between the config and the value associated with it. Once you have inserted the above config options and updated them to your specific values, we can go ahead and create the cluster.
Once we have created the cluster, we will need to add the spark-cassandra-connector to our cluster. We can do this by going to “Libraries” and hitting “Install New”. Once the dialog opens, we can click on the “Maven” tab. We can either copy and paste the coordinates into the designated spot, or we can hit “Search Packages”. An important thing to note: we will need to use the assembly version of the spark-cassandra-connector due to dependency conflict between spark-cassandra-connector and Databricks. If we are using a runtime that uses Spark 3.0.1, we will need com.datastax.spark:spark-cassandra-connector-assembly_2.12:3.0.0
vs just com.datastax.spark:spark-cassandra-connector_2.12:3.0.0
.
Once the library is downloaded, we are all set. With that, we will wrap up Part 1 as that is all you need to get started to connect Databrick and DataStax Astra. In Part 2, we will create a Databricks notebook that will extract data from our Astra database, transform it, and write it back into Astra while also exploring features that Databricks provides in the Community Edition. As mentioned above, we have an embedded YouTube video that includes a video demonstration of how to do this process, so be sure to check it out! Don’t forget to like and subscribe while you are there!
Resources:
Cassandra.Link
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!