Apache Spark Companion Technologies: Distributed Machine Learning Frameworks

One of Apache Spark’s main core features is Spark MLLib, a library for doing machine learning in Spark. Most data science education relies on specific machine learning libraries, like Sci-Kit Learn. Having data scientists retrain to use Spark MLLib can be an extra cost on top of the data engineering work that needs to be done in the first place, just to use Spark. Databricks offers distributed versions of some of these Machine Learning frameworks as part of the Databricks platform.


Machine learning algorithms are really just applied mathematics. In each implementation they are framed slightly differently, to the point that using familiar machine learning libraries is easier than learning to use new ones. In order to facilitate ease of use, the Databricks Machine Learning Runtime includes a number of existing machine learning libraries and tools, modified to get the most out of being in a Spark environment. The runtime includes TensorFlow, Keras, PyTorch, MLflow, Horovod, GraphFrames, scikit-learn, XGboost, numpy, MLeap, and Pandas. Spark already includes MLLib, but the familiarity issue described above, as well as access to the necessary algorithms and compatibility with existing machine learning pipelines may result in data scientists preferring some of these other machine learning frameworks to Spark MLLib.

Distributed Machine Learning

Training machine learning models takes a certain amount of computation. The process also has a number of steps that essentially repeat the same procedure. It then combines the results. ML processes like cross-validation, work this way. In cross-validation, a model is trained repeatedly on different portions of the dataset and the scores are combined. Results can be turned into a single value or analyzed as a set. Hyperparameter tuning is another process that works the same way. Steps within the greater machine learning process like this benefit from distribution since it allows them to run in parallel beyond what can be accomplished by multithreading or GPU processing on a single machine.

In addition, sufficiently large datasets basically need to exist in some distributed fashion, whether because they are too big for a single disk, or because other features of the storage solution are desirable. Therefore methods for doing machine learning work that are equally distributed are useful to have.

Distributed Solutions

Possible solutions to the desire to do distributed machine learning include: Learning to use Spark MLLib. The machine learning algorithms are built to run on distributed systems and come standard with every spark instance. The API also comes in a number of programming languages. Alternatively, one could pay for Databricks and use their Machine Learning Runtime. It’s a managed Spark cluster with these technologies installed as standard, using familiar libraries. Besides that, some of the distributed machine learning libraries have been published as standalone libraries. Spark-sklearn by Databricks or spark-tensorflow by Yahoo are two examples.


Databricks is connected to a number of internal and external technologies that provide extra functionality. Distributed machine learning libraries offer the ability to do machine learning on a Spark cluster without needing to learn to use Spark MLLib. Come back next time to see the other features of Databricks, and whether we can replicate them


Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!