In this blog, we will introduce Scylla’s Spark Migrator and walk through how we can use the Scylla Migrator for Cassandra Data Operations.
The Scylla Migrator is a pre-compiled Apache Spark Job that was built to migrate data extract using Spark to Scylla, normally from Cassandra. However, we can still use the Scylla Migrator for Cassandra data operations and move data between Cassandra instances.
Also, check out this blog for a series that will outline the general approaches for data operations in business-critical environments that leverage Cassandra and must maintain high-availability in an agile development and continuous delivery environment.
On the Scylla Migrator GitHub repository, they outline instructions for how to build, configure, and run the migrator on either a live Spark cluster or on a local Dockerized environment. In our walkthrough below, we forked the repo and adjusted the Dockerized example they provide to fit our needs.
In simple terms, the migrator works by generating a dataframe from the source and then writing it to the destination. Obviously, its more complex than that, but you can read this Scylla blog for a deep dive into how it works. It is important to note that the destination table must have the same schema as the source table; however, if you want to rename columns, you would need to indicate that in the config.yaml
file. You can also set other configurations within the config.yaml
file and more on that can be found here.
The migrator also utilizes savepoints, which are configuration files that are saved by the migrator as it runs. Their purpose is to skip token ranges that have already been copied and this configuration only applies when copying from Cassandra/Scylla. You can set the path and interval range at which the savepoints will be created in the config.yaml
file. I personally ran into some savepoint issues and at the time of writing this, I have not been able to debug it. However, it does not cause any functional issues as the migration will work as intended in the walkthrough.
Now we can move on to the walkthrough to demonstrate how we can use the Scylla Migrator for Cassandra data operations. We will be using this GitHub repository, which is a fork off the original scylla-migrator and is for educational purposes. We have bypassed some steps like running build.sh
to make it as fast as possible. If you want to go through the process of running the local Docker example Scylla provides or running the migrator on a live Spark cluster, you can find the instructions on their original repository linked above. Additionally, if you want to further extend this demo, follow the instructions on the original repository for how to rebuild the JAR after making code changes.
Prerequisites
We will start by cloning the repo onto our local machine.
git clone https://github.com/adp8ke/scylla-migrator.git
We can then cd
into our repo and get the Docker containers started. If at any point you notice 137
errors and containers crash or refuse start, you will need to increase the memory resource allocation. You can do this by going to your Docker settings and increasing the memory from 2.00 GB to 4.00 GB (or higher if 4.00 doesn’t work). I ran into these 137
issues and increasing my memory resources from 2.00 to 4.00 GB allowed this demo to work as needed.
cd scylla-migrator
docker-compose up -d
We will run a few commands to add the cql
files that we will run onto the source and target Cassandra containers.
docker cp source.cql scylla-migrator_cass1_1:/
docker cp target.cql scylla-migrator_cass2_1:/
Next, we will need to set up our Source and Target Cassandra Containers. Open a new terminal / tab and run the following commands for the Source Cassandra container
docker-compose exec cass1 cqlsh
then
source '/source.cql'
then
select count(*) from demo.spacecraft_journey_catalog ;
This should return 1000 rows. These are the 1000 rows that we will transfer to the Target Cassandra Container.
Now to set up the Target Cassandra Container, we will need to open a new terminal / tab and run the following commands.
docker-compose exec cass2 cqlsh
then
source '/target.cql'
then
select count(*) from demo.spacecraft_journey_catalog ;
This should return 0 rows as we will populate this table using the migrator. Remember, the destination table must have the same schema as the source table. If you want to rename columns, you can do so in the config.yaml file; however, for the sake of our purposes, we are not doing that.
Now we can set up the config.yaml
file that we will need when we run the Spark job.
Go back to the first terminal / tab and create the config.yaml
file from config.yaml.example
mv config.yaml.example config.yaml
vim config.yaml
First, edit source host to cass1
and update keyspace and table values to demo
and spacecraft_journey_catalog
, respectively.
Second, edit target type to cassandra
, host to cass2
, and update keyspace and table values to demo
and spacecraft_journey_catalog
, respectively.
Lastly, escape and quit the editor.
We are now ready to run the Spark Migrator using all of the setup we have done now. Once you run the below command, you may notice that there are some savepoint issues as mentioned above. In order to reduce walkthrough time, we have already provided the assembled JAR using sbt assembly. The spark-master
container mounts the ./target/scala-2.11
dir on /jars
and the repository root on /app
, so you all you have to do is just copy and paste the below command to run the migrator for this example. As mentioned above, if you want to update the JAR with new code, you will need to re-run build.sh
and then run spark-submit
again.
NOTE: Make sure you are still in the first terminal
docker-compose exec spark-master /spark/bin/spark-submit --class com.scylladb.migrator.Migrator \ --master spark://spark-master:7077 \ --conf spark.driver.host=spark-master \ --conf spark.scylla.config=/app/config.yaml \ /jars/scylla-migrator-assembly-0.0.1.jar
To confirm the migration occured, run the following command in the Target
Cassandra Container terminal / tab. If you try it right away, you may see the numbers increase as Spark is working, but it should fully complete in less than ~10-15 seconds. But once completed, there should be 1000 rows in the Target Cassandra Container.
select count(*) from demo.spacecraft_journey_catalog ;
And that will wrap up the walkthrough on how we can use the Scylla Migrator for Cassandra data operations. We could use actual host addresses to move data between Cassandra clusters, but for the purposes of a quick learning exercise, we just used Docker to simulate it.
Cassandra.Link
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!