In this blog, we discuss CQL Copy and how we can use it for data operations. We will include a step-by-step demo showing you how you can do data operations such as exporting a Cassandra table as a CSV and then importing data from a CSV into a Cassandra table.
CQL copy is a group of shell commands that import and export CSV to and from Cassandra. There are 2 main methods: COPY TO and COPY FROM. COPY TO exports data from a table into a CSV file. COPY FROM imports data from a CSV file into an existing table. More information and additional copy options can be found here.
Below will be a quick demo to learn how to use CQL Copy. The instructions will also be available on this GitHub repository.
Prerequisites
- Docker
Demo
We will begin by downloading/cloning this GitHub repository. Once extracted/cloned, we will need to cd into it.
cd cql-copy
Now that we are in the cql-copy
directory, we can start our Apache Cassandra docker container. We will also be mounting our cql-copy
directory to the docker container.
docker run -d -it --name cassandra -v "$(pwd)":/cql-copy cassandra:latest
We can confirm that the directory was added by opening a new terminal tab/terminal and running the following:
docker exec -it cassandra bash
ls
Now we can open another terminal tab/terminal and run the following to start a CQLSH terminal in our docker container:
docker exec -it cassandra cqlsh
Once loaded, we can use the cql
file that is in our cql-copy
directory when we mounted it.
source '/cql-copy/spacecraft_journey_catalog.cql'
In this cql
file, we create 2 tables: spacecraft_journey_catalog
and duration_by_journey_summary
. We also insert 1000 records into spacecraft_journey_catalog
. As a sneak-peek of the scenario, we will then export this table as CSV, do some “transformations” on it, and load the “transformed” data CSV into the duration_by_journey_summary
table.
To confirm that our data is in our Cassandra instance, we can run the following code block. We should get back 1000 rows.
use demo ;
describe tables ;
select count(*) from spacecraft_journey_catalog ;
As mentioned above, we can now export this data as CSV using CQL Copy. We will be using COPY TO and only export 4 of the columns. We want to take the summary and journey_id of each journey and calculate the duration in days of each journey versus just having the end time and start time. We can then load that new data into duration_by_journey_summary
, which is partitioned by the summary with a clustering column of journey_id. The use case of this kind of table could be for BI to analyze discrepancies between duration lengths of journies that had the same purpose/summary. We will point the CSV to be created at the root level of the docker container.
copy spacecraft_journey_catalog (summary, journey_id, end, start) to'../spacecraft_journey_catalog.csv' with header = true ;
We can then confirm that the CSV was created by running the following in the terminal / terminal tab that is running Docker bash:
ls
Then, we will need to export the CSV to do some “transformations” on it. We will need to run the following in the first terminal or a new terminal tab to get our Docker container ID:
docker ps
Then we can run the following command with your ID substituted for the placeholder id:
docker cp container_ID:/spacecraft_journey_catalog.csv .
This will export the CSV to our local cql-copy
directory. You can open your file manager to visualize this.
That covers how to export your Cassandra data using CQL COPY TO. Now we will cover how to import data using CQL COPY FROM.
As mentioned before, we will “transform” the CSV that we exported. However, to speed up the demo, I have already updated and calculated the duration in days for you. The “transformed” data is in the CSV file called duration_by_journey_summary.csv
. You can see this in the cql-copy
directory. Technically, we have already imported the file when we mounted it, but we will show you how to import it from the local directory into the root level of the docker container.
Make sure you are in the cql-copy
directory on your local machine. Then run the following command using the Docker container ID that we got before:
docker cp duration_by_journey_summary.csv container_ID:/
To confirm that the file was added to the root level of the Docker container, run the following in the terminal / terminal tab running Docker bash:
ls
Once you have confirmed it, we can now move onto importing the data from the CSV into our duration_by_journey_summary
table. Use the following command to do so:
copy duration_by_journey_summary (summary, journey_id, duration_in_days) from '../duration_by_journey_summary.csv' WITH HEADER = TRUE ;
We can now move the terminal / terminal tab running CQLSH and confirm that the data was transferred. We should get back 1000 rows.
select count(*) from duration_by_journey_summary ;
And with that, we have done CQL Copy for data operations by exporting and importing CSV data from and to our Cassandra tables.
Cassandra.Link
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!