Apache Cassandra Lunch #38: Cassandra SSTables and Spark

In case you missed it, this blog post is a recap of Cassandra Lunch #38, covering Apache Spark projects that interact with Cassandra specifically through Cassandra’s SSTables. We discussed the nature of SSTables and their usage within Cassandra before moving on. Then we went over a few projects that make use of Cassandra SSTables for data management with Apache Spark. The live recording of Cassandra Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!

SSTable Overview

First, we need to define what SSTables are and what place they have in the normal functioning of Cassandra. They have been previously mentioned in Apache Cassandra Lunch # 20 on Cassandra Read and Write Paths. Essentially SSTables are Cassandra’s on-disk storage method for data. Unlike Cassandra’s other forms of internal data storage, SSTables are immutable, they don’t get changed after being written. SSTables only get read from and deleted after being created. SSTable means Sorted String Table. They contain hashed string representations of data rows, sorted by a token. Several other files exist that support the Cassandra read process in telling if particular data is in a particular SSTable and in getting that data quickly from the file.

SSTable Components

  • Data.db: The actual data, i.e. the contents of rows.
  • Index.db: An index from partition keys to positions in the Data.db file. For wide partitions, this may also include an index to rows within a partition.
  • Summary.db: A sampling of (by default) every 128th entry in the Index.db file.
  • Filter.db: A Bloom Filter of the partition keys in the SSTable.
  • CompressionInfo.db: Metadata about the offsets and lengths of compression chunks in the Data.db file.
  • Statistics.db: Stores metadata about the SSTable, including information about timestamps, tombstones, clustering keys, compaction, repair, compression, TTLs, and more.
  • Digest.crc32: A CRC-32 digest of the Data.db file.
  • TOC.txt: A plain text list of the component files for the SSTable.

Cassandra Reads and SSTables

  • Check the memtable
  • Check row cache, if enabled
  • Checks Bloom filter
  • Checks partition key cache, if enabled
  • Goes directly to the compression offset map if a partition key is found in the partition key cache, or checks the partition summary if not
  • If the partition summary is checked, then the partition index is accessed
  • Locates the data on disk using the compression offset map
  • Fetches the data from the SSTable on disk

In the Cassandra read sequence, SSTables are the location of any data that is not still in the memtable, easily accessible in memory. Since accessing data on disk is slower and more costly than accessing data in memory, some checks are made to see if the data exists within the SSTables at all to avoid searching for that that isn’t there. Then several methods are used to narrow down the location of data in the SSTable’s Data.db file before it fetches that data from the hard disk.

Cassandra Writes and SSTables

  • Logging data in the commit log
  • Writing data to the memtable
  • Flushing data from the memtable
  • Storing data on disk in SSTables

Cassandra processes data at several stages on the write path, starting with the immediate logging of a write and ending in with a write of data to disk. The commit log just logs any data changes in order. Memtables and SSTables store data on a per-table basis. Data that gets written to memtable eventually get flushed to disk and becomes SSTables.

Compaction and SSTables

Since SSTables are immutable they don’t grow with new data after being written. This means that a single table may have many SSTables, all storing time updated copies of the same rows. Things like updates, deletions, or expiration of data all end up having the same rows in a number of SSTables. The read process deals with this by comparing the timestamps associated with each update and returning the most up-to-date information. As this process continues reads of that data start to take longer, so as data accumulates it triggers a process called compaction. In compaction SSTables are combined, updates are made to rows that have changes and a single new SSTable is written to disk before its precursors are deleted.

SSTables and Spark

Spark in a distributed analytics engine that has a lot of infrastructure for interfacing with Cassandra. The Spark-Cassandra connect, SparkSQL, and DSE are just some of the ways that that connection is handled. These methods go through the normal Cassandra read and write processes to get data into and from Cassandra. The projects we are discussing today, however, are meant to bypass the normal Cassandra read and write paths and get/write data directly from/to SSTables. This is useful because it leaves the request processing capability of the Cassandra cluster unchanged.

Spark-Cassandra-Bulk-Reader – SSTable Reader

This project reads SSTables directly into SparkSQL. It can read from Cassandra clusters as well as snapshots. It makes use of Java Cassandra classes to do the combination work of compaction (or normal Cassandra reads). The tool also can make use of several Spark workers reading replicas of the same data to read from SSTables with a specific consistency level, like the one that is set for normal Cassandra reads and writes. It has been used to successfully export a 32TB Cassandra table (46bn CQL rows) to HDFS in Parquet format in around 70 minutes, which is apparently a 20x improvement on previous solutions. The project page can be found here. The Apache Cassandra issue tracking page can be found here.

Spark2Cassandra – SSTable Writer

This project writes data from Spark directly to SSTables on a Cassandra cluster. It uses internal Java Cassandra classes as well as functionality from the Spark Cassandra Connector. The Spark Cassandra Connector is token aware in order to send reads and writes to the correct nodes. Spark2Cassandra uses this functionality to write the correct portions of each data frame to the correct location on the Cassandra cluster. The project page can be found here. An article with code snippets for a similar project can be found here.

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity. We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!