Spark and Cassandra For Machine Learning: Data Pre-processing

Machine learning is increasingly becoming a part of companies’ business platforms. In order to make full use of machine learning in our business platforms, we will need a tool with similar characteristics to our database tools. It needs to be distributed and scale-able, and integrate near seamlessly with our data store. In this post, we will learn about how to perform machine learning using Apache Spark and Cassandra, while also learning more about machine learning in general.

What is Data Pre-processing?

Data Pre-processing covers a set of transformations applied to data that prepares it for use with machine learning algorithms. When we gather data, it generally comes in completely raw. Fields will come in all sorts of data types, values may be missing, and we may have redundant fields or a huge number of fields that make processing more time consuming than it needs to be. During the process of data pre-processing, we not only prepare our data for use in learning, but we also gain some personal insight into our data while ensuring that all of these things line up with the algorithms that we want to use.

Why do Data Preprocessing?

Different kinds of algorithms require data to be in specific formats and numbers within specific ranges. Most of the time, the raw data that we gather will not meet the conditions for the algorithm that we want to use. Without pre-processing we would be unable to do any learning at all in the vast majority of cases. Features of our data that can be problematic for machine learning algorithms including missing values, data that is centered on a number other than zero, and other such problems can all get in the way of training a machine learning model.

Types of Preprocessing

Loading and Vectorization

Data can come in many different forms when we decide to gather it for machine learning projects. This presentation will cover how to load data from a file into a Cassandra table as well as how to get that data from the Cassandra table into a form that Spark can work with. 

Imputation

Imputation is a method for dealing with missing data in your dataset. The simplest method is, of course, dropping any rows with missing data. But if you want to keep data with missing fields imputation provides a few ways to fill them back in.

Mean Imputation

Mean imputation involves replacing the missing data with the mean from all of the existing examples of that value. Other functions can replace the mean here like median, lowest, highest, or random.

ML Imputation

We can use machine learning to train a predictive model on our data with no missing fields. The model would generate a value to fill in for each row with the selected field missing. In situations with data missing multiple columns, we would need to train a model for each column, and what data we use to train each model can become complicated. After a column is filled, we can technically use that column in future models. It may be better to stick to data that was gathered, rather than use data that we generated in order to predict other missing values.

Standardization / Rescale

Standardization is the process of centering your data around a specific value and scaling it to within a certain range. Some machine learning algorithms work best with data in certain ranges, and some need data to be centered around zero. If your algorithm needs all positive values, for example, standardization can allow you to fulfill that need while also maintaining the relative separation between data points.

Encoding

Binary/Categorical

Binary encoding turns values with only two possible values into integers with the value of either 0 or 1. Categorical encoding works for values with greater numbers of possible values, as long as the number is finite. Each possible value becomes a unique integer. A lot of our data comes as text strings. If those strings describe categorical values, binary or categorical encoding can put them into a form usable by machine learning algorithms.

One-Hot

One hot encoding is another method of categorical encoding where each possible value becomes a specific place in a string of binary digits. Some algorithms prefer one hot encoded values to categorized integer values. In one hot encoded values only one place can have a value of one at any given time. Which value is set to one determines which category the resulting string belongs to.

MultiColinearity / Correlation / PCA

Principal Component Analysis is a method of data processing meant to cut down on the number of fields being fed into a machine learning algorithm. It is a method of analysis as well, telling us which fields in our data are more or less correlated with our label. In the end, it transforms our data, rotating our axes until we have the number that we desire, in a combination that explains the most variance in our data.

Conclusion

Pre-processing is a necessary step in any machine learning work flow. We use various methods to shape our raw data into a form the is usable with our various algorithms. Depending on which machine learning algorithms are being used, different pre-processing methods will be necessary in order to prepare that data.

More Information

Want to learn more about these and similar topics? Read some of our other posts about search or about Cassandra.

We help our enterprise clients with DataStax and Sitecore projects. Our company also provides companies with project planning workshops and Virtual CIOs.

We build and manage business platforms. Is your project going south? Let’s talk for 15 minutes.

Photo by Pietro Jeng on Unsplash