Apache Cassandra Lunch #67: Moving Data from Cassandra to DataStax Astra with DSBulk

In Apache Cassandra Lunch #67: Moving Data from Cassandra to DataStax Astra, we discussed how to move data from Open Source Cassandra to Datastax Astra using DSbulk migrator. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!

Apache Cassandra

  • Apache Cassandra is an open-source distributed No-SQL database designed to handle large volumes of data across multiple different servers
  • Cassandra clusters can be upgraded by either improving hardware on current nodes (vertical scalability) or adding more nodes (horizontal scalability)
    • Horizontal scalability is part of why Cassandra is so powerful – cheap machines can be added to a cluster to improve its performance in a significant manner
  • Note: Demo runs the Open Source version of Cassandra (not DSE)
    • Works nearly identically with DSE Cassandra

DataStax Astra

  • Astra website: https://www.datastax.com/products/datastax-astra
  • DataStax Astra is a fully managed, serverless database built on Apache Cassandra, and is provided by DataStax
  • Some additional features of Astra:
    • Stargate APIs: Makes it easy for developers to use a Cassandra-based database like Astra to work with data without deep knowledge of CQL
    • Zero Lock-In: Deploy on AWS, GCP and Azure and still maintain compatibility with open-source Cassandra
    • Global Scale: Data replication across multiple data centers, availability zones, and multiple regions.
      • Additionally, allows a user to scale an Astra database up to multiple petabytes of data without impacting speed or performance
    • 80 GB of storage and 20 million read/write operations for free every month

DSBulk

  • DSBulk: DataStax Bulk Loader for Apache Cassandra is an open source software used to load/unload CSV or JSON data in and out of supported databases
  • Supported databases:
    • DataStax Astra cloud database
    • DataStax Enterprise (DSE) 4.7 and later
    • Open source Apache Cassandra 2.1 and later
  • More information about DSBulk, along with an introduction to it and various documentation can be found linked here: https://docs.datastax.com/en/dsbulk/doc/dsbulk/dsbulkAbout.html
  • Github Repository for the DataStax DSBulk project: https://github.com/datastax/dsbulk
  • Commands that will be used in today’s presentation/demo:
    • dsbulk load
      • This command is used to load data into a cassandra/astra database without a configuration file. Note that necessary parameters will have to be passed in (listed below)
    • dsbulk unload
      • This command is used to unload data from a cassandra/astra database without a configuration file, into a CSV or JSON file. Note that necessary parameters will have to be passed in as well.
    • dsbulk count
      • This command is used to return information about loaded data in a cassandra/astra database.
  • Some necessary parameters/flags that must be used if using these commands without a configuration file:
    • -k: keyspace
    • -t: table
    • -b: path to secure connect bundle (only necessary if connecting to astra)
    • -u: username, -p: password (to the database)
      • Since recent Astra update earlier this year, need to use ClientID/ClientSecret instead of username/password.
      • Can be left empty if cassandra database user/password is left as default (cassandra/cassandra)
    • -url: url from where to pull .CSV or .JSON file from, or a local directory for where to unload data into

Demo Project

For the demo project, we will be running through some sample commands based on the following GitHub repository: https://github.com/DataStax-Examples/dsbulk-to-astra/. Some notes before getting started:

  • Make sure your local cassandra database is running. For a simple docker command, use the following to startup an open source cassandra database locally:
    • docker run -p 9042:9042 –rm –name my-cassandra -d cassandra
  • Create an Astra database on the Astra website after registering for an account on their website: https://astra.datastax.com/register
    • After creating a database, make sure to generate a Client token with some kind of higher permissions that allow you to write into the database and read from it. (For example, Administrator Privileges). Write down the ClientID and Client Secret keys.
    • Additionally, download the secure connect bundle for the database. This will be necessary to allow dsbulk to connect to the Astra database.

After making sure that your local Cassandra database is running, we need to set up both the keyspace and table schema for this demo. The following commands should be run on both Astra’s CQLSH console along with your local Cassandra’s CQLSH console, which defines the keyspace and tables we will use:

CREATE KEYSPACE testkeyspace WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'}  AND durable_writes = true;
CREATE TABLE IF NOT EXISTS testkeyspace.video_ratings_by_user (
    videoid uuid,
    userid uuid,
    rating int,
    PRIMARY KEY (videoid, userid)
);

Some example output from local Cassandra once these commands have been run and a select * command is run:

screenshot of Cassandra console
Commands used to create the necessary keyspace and table in local Cassandra

Once this is done for both Astra and your local Cassandra database, we can proceed with using DSBulk. Before using DSBulk, it must be downloaded from the following url (which also includes instructions on downloading DSBulk): https://docs.datastax.com/en/dsbulk/doc/dsbulk/install/dsbulkInstall.html. Now we can begin running DSBulk commands.

Loading from a file at a url into local Cassandra:

./dsbulk-1.8.0/bin/dsbulk load -url https://raw.githubusercontent.com/DataStax-Examples/dsbulk-to-astra/master/data.csv -h localhost -k testkeyspace -t video_ratings_by_user -u cassandra -p cassandra

Note that the first part of the command is the path to your local DSbulk installation’s DSbulk executable file. Some sample output from the above command:

screenshot of Cassandra console
Sample output from DSbulk load into local Cassandra

Loading from a file at a url into Astra:

./dsbulk-1.8.0/bin/dsbulk load -url https://raw.githubusercontent.com/DataStax-Examples/dsbulk-to-astra/master/data.csv -b ./secure-connect-testdb3.zip -k testkeyspace -t video_ratings_by_user -u IwxQhWdajNMpHisNlWeFlPYq -p AJ,pr7SG_H3P,,AZxWrYCqSkzUzjxXvbUrWH-c6GAII.h,YCK1S6ghAaItKCC-I0l27ybK6PuTusPbb_vJRz3igAdyvL1KepRF-tACkiMRSRx3jZW,xhBd3LgeIA,Dy2

Note that the parameters that come after -u and -p are not quite username and password, but rather the Client ID and Client Secret Key that are obtained by generating a token for your Astra database. Additionally, the path after the -b flag should point to the secure connect bundle for your Astra database.

In both of the above cases, we are loading from a .CSV file at a url into either local Cassandra or Astra. To move data from local Cassandra into Astra, we will also need to use the DSbulk unload command. We first run the following command in Astra’s cqlsh to make sure that the Astra table does not have any data in it:

TRUNCATE testkeyspace.video_ratings_by_user;

Some sample output from running that command:

screenshot of Astra console
Empty table in Astra after using TRUNCATE command in CQLSH

Now we do a two-step process to completely move data from local Cassandra into Astra:

Step 1: Unload data from local Cassandra into a .csv file:

./dsbulk-1.8.0/bin/dsbulk unload -h localhost -k testkeyspace -t video_ratings_by_user -url ./my_data

Note that the very last parameter is the path to a local folder and it must be empty. Finally, we run the following DSbulk load command to load that local .csv file into Astra:

./dsbulk-1.8.0/bin/dsbulk load -url ./my_data.csv/ -b ./secure-connect-testdb3.zip -k testkeyspace -t video_ratings_by_user -u IwxQhWdajNMpHisNlWeFlPYq -p AJ,pr7SG_H3P,,AZxWrYCqSkzUzjxXvbUrWH-c6GAII.h,YCK1S6ghAaItKCC-I0l27ybK6PuTusPbb_vJRz3igAdyvL1KepRF-tACkiMRSRx3jZW,xhBd3LgeIA,Dy2

And the data migration process of a table from local Cassandra into Astra is complete. For a complete run-through of the commands mentioned above, along with additional commentary, please see the recorded live session below on YouTube!

Recording of the live session is below:

References

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!