Cover Slide for DSE Analytics and Parquet tables.

Apache Cassandra Lunch #56: Using Spark SQL Parquet Tables in DSEFS / DSE Analytics

In Apache Cassandra Lunch #56: Using Spark SQL Parquet Tables in DSEFS / DSE Analytics, we discuss using Spark Parquet tables in DSEFS and DSE Analytics. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!

Using Spark SQL Parquet Tables in DSEFS / DSE Analytics

In this blog post, we will be covering what Parquet Tables, what they are, how they differ from other file formats, and link to a demo of using parquet tables with Apache Spark in DSE.

What is a Parquet Table?

Parquet tables are columnar format storage files in a binary based format. Apache Parquet is an open-source project that can be used within any project in the Hadoop eco-system. They are self-describing, schema and structure are stored in metadata within each file, which allows for faster reads than other traditional file formats and the ability to change schema over time by adding or removing columns from a file. Columnar storage allows skipping of unwanted data quickly, as a result aggregration queries can be much faster than row-oriented databases. In addition to these features, parquet support multiple methods for data compression by column and encoding. Some of the encoding options available are dictionary, bit packing, and run length. Parquet is especially useful in read situations since parquet only needs to read specified columns in order return results.

Using Spark SQL Parquet Tables in DSEFS & DSE Analytics -
Comparison image of the schema of row storage vs. column storage.

Use cases

Parquet formats are especially useful when using services that charge by the amount of data stored, due to parquet’s compression. They are also useful in services in which costs increase with query run time or the amount of data scanned. This is also due to the compression and the fact that queries can target specific columns of data reducing the need for full table scans. According to databricks Apache Parquet works best with serverless technologies like AWS Athena, Amazon Redshift Sprectrum, Google BigQuery, and Dataproc.

Demo for Spark SQL Parquet Tables in DSE and DSE Analytics

This demo can be found at https://github.com/thompson42/pyspark-dse-cookbook.

Resources

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!