In Data Engineer’s Lunch #45: Apache Livy, we discussed Apache Livy, a REST API for interacting with Spark Clusters. It also helps with submitting jobs and managing Spark Contexts and cached data. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at noon EST. Register here now!
Apache Livy Overview
Apache Livy is a tool that enables users to interact with a Spark cluster via a REST interface. It allows users to submit jobs as pre-compiled jars, or snippets of code via REST. It also provides users with a Java/Scala client API, for interacting with the Livy server from within code. Livy manages Spark Contexts, including allowing them to be used for multiple jobs or by multiple clients. It also allows the sharing of cached data across multiple jobs. This includes RDDs, Dataframes, and Datasets as long as they are connected by a Spark Context through Livy.
It allows users to submit jobs without a Spark client. They also don’t need to have the Spark submit binaries on their local machine. Your existing Spark code does not need any modification to run with Livy.
The Livy server needs to run on the same machine as a Spark node in your cluster. At the current moment, Livy only works for versions of Spark built with Scala 2.10 and 2.11. All spark 3.0+ releases use Scala 2.12 and are thus incompatible. For our demo, we set up a Spark cluster with Spark version 2.4.8. After downloading and unzipping the Spark and Livy archives, we only had to export an environment variable for SPARK_HOME. The variable points to the main Spark directory. Then we started the Livy server.
The Livy server sits between clients / client programs and the Spark cluster. Users pass in their code snippets or the details of their jobs via REST. Since they can also use the provided API to start and configure Spark Contexts, they can also pass in those details. Livy will then make sure that the jobs are started with the appropriate contexts. Since Livy allows contexts to be created separately from the job that they run with, the contexts can also outlive the job and go on to be used for future jobs. This is also how cached data can be passed between jobs.
Non-Livy Spark Job Submission
Normally, Spark interactions take place through the Spark shell (Python or Scala or R), or through submission of batch jobs with spark-submit. These aren’t actually very different, the spark-submit deploy mode option can allow submitted jobs to run in client mode, which can also give access to a REPL shell.
Either of the above types rely on access to the cluster from a limited number of sources. The spark binaries on one of the nodes that is permanently part of the cluster. We can install the spark binaries somewhere else and use them to communicate with the cluster, or we can set up a cluster gateway machine.
All three methods essentially require access to a machine that is part of the cluster, allowing users to access implementation and configuration details. It is difficult to deny or allow certain types of access to users in different security groups with a limited number of nodes / gateways to access. This makes it hard to integrate with existing security / access control solutions.
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity. We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!