In Apache Cassandra Lunch #121: Migrating to Azure Managed Instance for Apache Cassandra, we discussed different methods for migrating data from existing Cassandra instances to Azure-hosted options. The live recording of this Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you want to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!
Azure Migration Overview
Migrations of data from existing on-premise or cloud Cassandra instances are usually meant to move data to Azure Managed Instance for Apache Cassandra or Cosmos DB Cassandra API. Azure Managed Instance is a Cassandra cluster with specific default configurations, automations for monitoring, backup, and repair. Cosmos DB is a cloud-native NoSQL database that uses its default data model to mimic the way that data is managed in other databases allowing it to integrate with tooling for different database types. Its Cassandra API can interact with Cassandra drivers and CQLSH.
Migrating data to these Azure systems from an existing Cassandra database can be done in a variety of ways. These methods all have their own advantages and disadvantages. They all interact with the specific systems of Azure Managed Instance and Cosmos DB differently and require different external systems to work. The two main methods we focus on today will be the Azure Managed Instance hybrid cluster and the Cassandra Migrator. We will also briefly touch on other methods for migrating data, but won’t get too deep into their advantages and disadvantages in this post.
Main Migration Methods
Hybrid Cluster Replication
The first method of migrating data to Azure works between open source Cassandra and Azure Managed Instance for Apache Cassandra. It works by creating a hybrid cluster containing the existing nodes in one data center and the new Azure Managed Instance nodes in a new data center. Then the Cassandra functionality that replicates data between data centers will move data onto the Azure nodes, at which point the old data center can be shut down. In order to get this to work, some security and encryption settings need to be set on the existing Cassandra nodes, otherwise, they will be incompatible with Azure Managed Instance. Specifically, node-to-node encryption must be enabled on the existing cluster in order to connect with Azure managed Instance. Client-to-node encryption is optional, but if it is enabled, that must be taken into account when creating the Azure Managed Instance nodes.
The basic process for setting this up involves creating a Virtual Network on Azure and configuring a subnet. All other components will need to be part of this subnet in order to communicate with Azure Managed Instance. Then add extra permissions needed by Azure Managed Instance for Apache Cassandra. A role within the resource group that everything runs in needs to be added via the Azure console before Azure Managed Instance nodes can be created. Then we can create and configure the resource for Azure Managed Instance. Then we need to get the gossip certs from the new Azure Managed Instance cluster and install them in the existing datacenter. Finally, we can create a new datacenter. We will go into more detail on this process when we talk specifically about this method in another post.
Azure Cassandra Migrator
The second method for migrating data involves running a spark job that will copy data from an existing Cassandra instance to an Azure Cassandra instance. This method requires a Spark cluster in addition to two Cassandra clusters, one pre-existing on-premise or cloud cluster and the other in Azure. Azure docs suggest using Azure Databricks for Spark and using a Scala notebook to manage the process. The Azure Cassandra Migrator is a JAR file and can be run with any Spark. It is possible to run it on a normal open-source Spark cluster via spark-submit, the configuration just needs to be passed in via config file in that case.
The basic process here starts with creating both Cassandra clusters, as well as a Spark cluster (either Databricks or open-source will work). Then set the configs, either in code if using a Scala notebook or in the config file if using some other Spark offering. Then start the process, either by running the notebook or using spark-submit.
In comparison to the hybrid cluster, this method has one obvious drawback, which is that this migration takes place once, and does not continue to mirror changes over time. Once a hybrid cluster is set up changes will be relayed between the data centers over time, but the Cassandra migrator method only functions in one direction, and only migrates data that exists at the time of the migration.
Other Methods
Kafka Connect can be used to migrate data by loading data from Cassandra into a Kafka topic. Then loading that data into Azure Managed Instance for Apache Cassandra using Kafka Connect and a Cassandra Sink.
Datastax’s dual write proxy is a live migration method that does not cover historical data. Application data coming in is written to both the old and the new cluster, which helps define a set time frame for a separate historical data migration.
CDC (Change Data Capture) can be used to retrieve a stream of deltas from a Cassandra database. This stream of changes can be applied to a separate database to keep it in sync with the cluster that the updates are coming from.
Cassandra.Link
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!