Machine learning is increasingly becoming a part of people’s business platforms. In order to make full use of machine learning in our business platforms, we will need a tool with similar characteristics to our database tools. It needs to be distributed and scale-able, and integrate near seamlessly with our data store. Luckily Spark is a great tool for this purpose. In this post and future ones, we will learn about how to set up an environment for performing machine learning using Apache Spark and Cassandra, and also learning more about machine learning in general.
Machine Learning Overview
The general workflow for a machine learning task is split into four stages. The first stage is preparation. In this stage we must decide on the question we are trying to answer, gather the relevant data, and prepare it for later use. Though this can mean different things depending on the specific algorithm being used, prep work usually involves making sure that your data is complete and in a format that the algorithm can use.
The second stage is splitting. Once our data is prepared we need to decide what proportion of our data we want to use to train our algorithm and what proportion we want to set aside in order to test how well our algorithm might work. While it is still possible to learn a model that fails to provide accurate answers to your overarching question, testing your model on withheld data gives you some idea of how your model will perform when given actual data in deployment. Sometimes the data splits three or more ways into training, validation and testing sets so that different model parameters can be tuned and checked before final testing is performed. There are even special methods that allow you to train and test over your entire dateset without causing problems, like k-fold cross validation.
Once your data is properly split, we can move on to the training stage. The point of this stage is to train your model using your data and then use the resulting model’s performance on the validation set to determine its performance. Then you can change the parameters to obtain better performance and continue iterating in this fashion until you are satisfied with the results. Then, you should test on the testing data and determine if more iterations are necessary.
If you are happy with the results, you move on to deployment. In deployment the model will encounter data that you did not train, validate, or test on and therefore you can only estimate its performance based on the results from previous tests. Depending on the type of algorithm chosen the actual usage of the finished algorithm may look different, but this stage is the one that has real world stakes.
The rest of this post will deal with the setup of the environment we will use for the rest of this series, which is an instance of Datastax Enterprise with jupyter notebooks on top as a web interface. We will be using docker in order to set this up. This setup is based on the project CaSpark on GitHub by HadesArchitect. This project is also the base for DatastaxDevs developer training on the topic. First you will need to pull the project’s code from GitHub. Then you should navigate into the folder CaSpark containing the repository and bring up the project using docker-compose.
git clone https://github.com/HadesArchitect/CaSpark.git cd CaSpark docker-compose up -d
If you get an error ending in PermissionError: [Errno 13] Permission denied: ‘/home/jovyan/.local’, you will need to replace the word jovyan with the name of your local user account. Also make the same replacement in line 4 of pyspark-cassandra/Dockerfile. Then rerun docker-compose up again. This may cause your jupyter notebooks instance to come up with no files in it. In this case, you will need to upload the contents from the jupyter folder, replicating the folder structure within. You should then be able to open the first file we will be looking at, kmeans.
We will look at the specifics of the k-means algorithm later, but generally k-means is a clustering algorithm. It takes data and assigns it to clusters based on how similar the inhabitants are to one another.
Right now however we want to look at what modules we need to import to work with our data. We have Pyspark and the python Cassandra driver in order to interact with our underlying spark cluster and Cassandra instance. Matplotlib.pyplot allows us to use graphing functionality similar to what is available in Matlab. We will use this in order to graph our data and gain insights into its structure. Pandas is a data manipulation library. It is common to use pandas in local machine learning tasks for its ability to load, save and organize data. It can work with a variety of formats.
Next time, we will look at the specifics of the k-means algorithm. We will also learn how to prepare the dataset that we loaded this time to be used with that algorithm. It will include a discussion of the k-means algorithm’s data requirements in order to help you prepare your own data. We will also discuss the parameters that can be tuned to provide different outcomes from training.
We build and manage business platforms. Is your project going south? Did your vendor screw up again (that never happens)? Let’s talk for 15 minutes.