Notebooks are a potentially useful tool for those working with Apache Spark. The Databricks platform, built on Spark, also includes notebooks as a standard feature. As part of an ongoing series of webinars on replicating functionality that is part of the Databricks platform, we will be attempting to connect notebooks to Apache Spark ourselves.
What are notebooks?
Notebooks are a web-based interface for a document editor. They generally work with documents of a specific type, split up by the programming language that they use. Jupyter notebooks (previously called Ipython notebooks) work with files with the .ipynb extension and use the programming language python. R Markdown notebooks use the programming language R and the filetype .rmd. From these specific filetypes, notebooks also usually have the ability to export files to a number of different filetypes, including pdf, HTML, and raw JSON. They also often have the ability to combine all of the code in pieces in the notebook and export a single source file, in whatever language the notebook uses.
Notebooks store data in cells. Cells can contain a number of different types of data. The main cell type is Code, which contains the code in whatever programming language the notebooks use. Markdown cells contain formatted text and potentially images. Code cells can generate visualizations, which can be displayed directly below the cell that computes them and can be exported with the rest of the notebooks. Cell results can even contain interactive elements like text inputs and sliders/dials.
Why use notebooks?
Notebooks are useful development tools. They do not really have a place in a production environment but can be invaluable before that for how they enable quick development and clear communication. Notebooks give easy access to visualizations. Those visualizations are also persistent and do not get pushed out by new information like they might be in an IDE or when being printed to a console. Having code divided into cells that are able to run individually makes it easy to compartmentalize the different parts of your code. It also makes it easy to make changes and tweaks to those pieces as development continues.
Having access to markdown cells adds the ability to include text that is outside of the capabilities of the programming language’s comment functionality. Having code split up into cells, and alongside explanatory formatted text and images makes a much nicer looking and easier to understand document to share with others compared to showing them raw code, possibly with comments. Notebooks like Jupyter are also standard tools in data science. People who are familiar with ML work will almost certainly have used notebooks in the past, and being able to work with Spark in a familiar environment makes their job easier. Notebooks also have integrations of their own that can be useful to those working with Spark.
Onto actually connecting a notebook to Spark. The first thing that we need to do is to set up a Spark instance. In this case, we will be using the docker images from sdesilva26’s tutorial on using Spark with docker and running in network host mode in order to avoid the need to configure everything. This means that all ports in the containers will be accessible via the host network on the machine we are using. If using Docker for Mac or Windows, some Docker networking would also be necessary. If running this on a separate machine like a DO droplet, SSH tunneling would be necessary to actually see the notebook on your own machine.
docker run -dit –name spark-master –network host –entrypoint /bin/bash sdesilva26/spark_master:0.0.2
docker run -dit –name spark-worker1 –network host –entrypoint /bin/bash sdesilva26/spark_worker:0.0.2
Once we have the containers set up we want to move into the Spark master container in order to set up the notebooks. The reason that we want our notebooks to be running on the same machine or container as the Spark master node is so that connecting the two is easier. Notebooks connected to Spark when both are on separate machines will never run code in distributed mode. Having them both on the same machine gives an easy workaround where notebooks get exported to the source and then run on the Spark cluster via Spark submit.
docker exec -it spark-master /bin/bash
Once inside the container we want to use pip to install both the notebooks themselves as well as pyspark. Technically we only really need to pip install the notebooks from the console as everything else could be run from inside the notebooks. This notebook install command would also be something to add to the Dockerfile if one wanted to create their own spark master docker image with notebooks already installed.
pip3 install jupyter
pip3 install pyspark
Then we run the notebooks. Since we did not create any particular users inside the spark master container, we will need to add the argument –allow-root in order for the notebooks to run. We also add –no-browser to avoid a warning that the spark master container does not contain a web browser. When this command is run it gives us a token for access to the notebooks page and a port. If there’s a spark image on a separate machine, you would use ssh tunneling to access the correct port. Then you would input the token. In this case, we can navigate to localhost:8888 ourselves and input the token for access.
jupyter notebook –allow-root –no-browser
In order to actually create a Spark context that is connected to the spark instance we are running, we will use findspark. Findspark checks common install locations for Spark and connects to it. We will be able to prove this by checking the Spark jobs UI at localhost:4040 when we run tasks.
Our notebook file contains two examples. The first is an approximator of pi. It calculates pi using spark functionality. It uses parallelize to convert a python list into a Spark RDD, filter to cut down on the data via a condition, and count, which is straightforward. The other example downloads a common machine learning dataset from sklearn, converts it into an RDD, and then into a Dataframe. This is important because data frames are the compatible type for Sparks MLLib, some of the usage of which has been covered in a previous series. Machine Learning with Spark and Cassandra can be found here. Here is the notebook containing the examples. In order to see the proof that this is running on the Spark cluster, we can look at the spark UI.
Databricks is connected to a number of internal and external technologies that provide extra functionality. Notebooks are one of these tools. They ease development and can be integral to data science tasks. This is functionality that can be replicated without Databricks. Come back next time to see the other features of Databricks, and whether we can replicate them
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!