In Data Engineer’s Lunch #33: Spark Cassandra and Elasticsearch for Data Engineering, we will discuss how you can use Spark and Spark jobs to load data from a CSV file, and save + load the data into Cassandra and Elasticsearch. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at noon EST. Register here now!
Spark Cassandra and Elasticsearch for Data Engineering
For Data Engineer’s Lunch #33: Spark Cassandra and Elasticsearch for Data Engineering, we will go over a demo project that can be located here. In order to understand how the demo functions, we will introduce each of the software components along with some basic concepts related to how each software component deals with data:
Apache Cassandra
- Apache Cassandra is an open-source distributed No-SQL database designed to handle large volumes of data across multiple different servers
- Cassandra clusters can be upgraded by either improving hardware on current nodes (vertical scalability) or adding more nodes (horizontal scalability)
- Horizontal scalability is part of why Cassandra is so powerful – cheap machines can be added to a cluster to improve its performance in a significant manner
- Note: Demo runs the DSE (DataStax Enterprise) version of Cassandra (not open source)
- Spark is also run together with Cassandra in DSE
Elasticsearch
- Elasticsearch: https://elastic.co
- Distributed, open-source search and analytics engine built on Apache Lucene and developed in Java
- ES allows a user to store, search, and analyze large volumes of data quickly and in near real-time, returning results in milliseconds.
- Basic concepts of data in Elasticsearch:
- Documents are the base unit of information that can be indexed in Elasticsearch (expressed in JSON)
- Think of a document like a row in a relational database
- An index is a collection of documents which are similar in some way
- Documents in the same index are typically logically related
- Vaguely like a “table” in a relational database
- Inverted index: A mapping from content (i.e. certain key words or numbers) to places that content (or those keywords) are located (documents)
- Documents are the base unit of information that can be indexed in Elasticsearch (expressed in JSON)
Spark
- Apache Spark: https://spark.apache.org/
- Open-Source unified analytics engine for large-scale data processing
- Provides an interface for programming entire clusters with implicit data parallelism and fault tolerance
- In this demo, Spark is being run from DSE with analytics turned on
- Spark can also be run separately, would just have to setup a master node and at least one worker to accept and perform the jobs.
- Spark DataFrames and their history:
- RDD – Resilient Distributed Dataset (Since Spark 1.0)
- Primary user-facing API in Spark since the beginning.
- Immutable distributed collection of elements of your data, partitioned across nodes in your cluster
- DataFrame (Since Spark 1.3)
- Immutable distributed collection of data
- Unlike RDD, data is organized into columns with names (much like a table in a relational database)
- Allows for a developer to impose a structure onto a distributed collection of data
- Dataset (Since Spark 1.6)
- Unlike a DataFrame, which is a collection of generic untyped objects, datasets are strongly typed.
- Extension of DataFrames: Collection of strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java
- One benefit of this is that syntax and analysis errors both come out at compile time as opposed to runtime.
- RDD – Resilient Distributed Dataset (Since Spark 1.0)
SBT
- SBT is an interactive build tool for Scala and Java projects.
- A large majority of Scala projects are built using SBT, and the demo project is no exception
- This demo project uses a plugin called sbt-assembly, which creates a fat JAR – one JAR file that contains all of the class files from your code, along with all dependencies.
- Libraries, etc.
- More info on sbt-assembly can be found in the Github repo: https://github.com/sbt/sbt-assembly
- More info on SBT can be found at the official website: https://www.scala-sbt.org/
Demo Project for Spark Cassandra and Elasticsearch for Data Engineering
Link to Github Repo for the demo project: https://github.com/Anant/example-cassandra-spark-elasticsearch/
The demo project contains three Scala class files, each corresponding to their own Spark job. The demo is built with SBT and only one jar is created, and by using the -class flag when using spark-submit we can specify which of the three main classes to run. The three spark jobs are as follows:
- Reading a .CSV file into a SparkSQL Dataframe and saving it to Cassandra
- This first job will read data.csv (located in /test-data/) into a SparkSQL Dataframe and then save it to DSE Cassandra.
- Loading data from a Cassandra table into a SparkSQL Dataframe and saving that data into Elasticsearch
- This second job will read data from DSE Cassandra that was inserted in the first job into a SparkSQL Dataframe. Afterwards, it will save that data to Elasticsearch.
- Loading data from Elasticsearch into a SparkSQL Dataframe
- The third job reads from Elasticsearch’s index that was created in the second job (testuserindex) and puts this data into a SparkSQL Dataframe. Afterwards, it displays the data in the console.
The demo also contains a Docker compose file which is used to run two Docker images and link them to each other through a network. The following two docker images are run:
- DSE – DataStax Enterprise version of Cassandra with analytics enabled (Spark)
- Elasticsearch 7.13.0
For more detailed information on what commands to run for everything to work and in what order, along with additional commentary and details on each spark job, take a look at the readme file in the Github repo for the demo!
That will wrap up this blog on using Spark and Spark jobs to load data from a csv file, and save + load the data into Cassandra and Elasticsearch. As mentioned above, the live recording which includes a live walkthrough of the demo is embedded below, so be sure to check it out and subscribe. Also, if you missed last week’s Data Engineer’s Lunch #32: Converting JSON to CSV, be sure to check that out as well!
References
- https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
- https://www.elastic.co/
- https://spark.apache.org/
- https://www.knowi.com/blog/what-is-elastic-search/
- https://github.com/sbt/sbt-assembly
- https://github.com/Anant/example-cassandra-spark-elasticsearch/
- https://github.com/Anant/example-cassandra-spark-job-scala
Cassandra.Link
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!