In this blog post, we will discuss a number of ways of doing dependency management when running spark scripts. This particular post is not a part of any of our ongoing series. We often discuss using spark during our Data Engineer’s Lunch events every Monday. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at noon EST. Register here now! We last discussed Spark at a recent Cassandra Lunch. The topic was ETL in Cassandra with Airflow and Spark, Our most recent discussion of Spark can be found here.
Spark is a unified analytics engine that does distributed data processing. When we want to define work for Spark, we want to pass compiled jar files to Spark using spark-submit. As we have mentioned previously when discussing Spark, we can do this in one of two ways. If we compile only the source scripts and nothing else, we get what is known as a thin jar. This thin jar does not contain any of the dependencies for running the code. Because of this, it must do one of two things. Either it should use no dependencies that are not already included in Spark. Or it can be included alongside other jar files that contain the dependencies.
We can also choose to compile dependencies and our main code into a single “fat” jar. The fat jar contains all of the necessary dependencies and is usually compiled using maven or sbt. The focus of this post is how we can get our dependencies when we want to do more casual development with Spark. If we are prototyping a process or just messing around with spark data, how do we ensure that all of the packages we need can be accessed in the shell?
Scala Dependency Management
When scripting in Spark we usually use either the default Scala shell or the pySpark python shell. In the Scala shell, we can provide a jar file when opening the Spark shell, and all of the dependencies contained within will be available for use in the shell. This does not solve the problem of needing dependencies not currently contained in a jar file. This is instead useful in extending or messing around with the functionality of an already compiled project. It allows access to dependencies as well as custom classes in the Spark shell.
Another way to get standard dependencies into the Spark shell is to use the –packages alongside the spark-shell commands. This will search maven and download packages and add them to the shell. This will not give access to local objects unless they are part of a local maven repo.
Once the spark shell is up we can enter whatever commands we want and use the dependencies included, whether local or external. We could also use the -i option to provide a Scala script that will run in the Spark shell once it comes up.
Python Dependency Management
Alternatively, we could use the pySpark python shell. By default, we can get custom Python code into Spark in the form of .py files, zipped packages, and .egg files. We can set the configuration setting spark.submit.pyFiles, use the –py-files option when running pySpark or spark-submit, and inside of other code, we can call pyspark.SparkContext.addPyFile(). These only work for custom code since packages built as wheels cannot be added this way, and therefore this method doesn’t allow us access to dependencies that we may want.
The first way to get python dependencies into the pySpark shell is to build and export a Conda environment. Conda is one of the most used Python package managers. To get a packaged environment you would use conda create with all of the packages that we want, then activate that environment. Then use conda pack to pack it into an archive. We would then pass that archive file with –archives to either spark submit or the pySpark shell. Spark automatically unpacks anything passed with the –archive option on executors, giving us access to our packed conda environment, and thus, our dependencies.
Another way to pass a python environment to pySpark is using Virtualenv in a similar way to how we used Conda above. After starting a venv environment we use pip as normal to install our dependencies. We then use venv-pack to create an archive file which is then passed to spark submit or pySpark using the –archive tag.
The last method we will cover for passing Python dependencies to the pySpark shell is to use PEX. Pex creates a python environment as a .pex file. First, we use pip to install pex as well as the dependencies, and then call pex with all of the dependencies (including pySpark) which will save to a .pex file. Then we submit the pex file along with any .py files with the environment variable PYSPARK_PYTHON as the .pex file.
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!