Databricks is an analytics engine based on Apache Spark. Today we will discuss what features Databricks offers in comparison to the base version of Apache Spark, and whether these capabilities are something that we can do without going through Databricks.
Apache Spark is a cluster computing technology primarily designed for fast computation. It is descended from Hadoop which is a distributed computation framework built upon MapReduce. Spark gains a speed advantage over previous technologies of a similar type by storing results of intermediate operations in memory. It does this rather than immediately writing back to disk. Spark is also more versatile than these previous technologies, having the ability to execute not just map and reduce steps for data processing, but also SQL queries, streaming data processing, machine learning, and graph-based processing. These different types of processing are contained in modules that sit on top of Spark core. When combined spark core and all of these components combine into a powerful, fast, analytics engine. Since it is open-source, also can connect to several external tools.
Databricks is a managed data and analytics platform developed by the same people responsible for creating Spark. Its core is a modified spark instance called Databricks Runtime, which is highly optimized even beyond a normal Spark cluster. It also comes with several data management technologies and visualization tools, as well as some quality of life additions. Being a cloud platform, it can be deployed to the various cloud providers and it’s easy to create and scale clusters using Databricks’ interface.
Databricks Runtime vs Apache Spark
Databricks runtime is a modified version of Apache Spark that sits as the foundation for the larger Databricks system. It makes several changes to optimize performance as well as ease connection with tools both internal and external to Databricks. Clusters created through Databricks are on-demand, able to be brought up quickly on various cloud platforms. They are also elastic, able to scale to meet new demands.
Notebooks are web-based interfaces to work with document types that can contain code, comments, and visualizations, as well as images and markdown sections. These capabilities make notebooks versatile development tools, especially when combining snippets from different languages and code that changes repeatedly to reach a workable state. A use case that falls under this often is machine learning. Databricks provides notebooks usable with your cluster. It is possible to configure standalone notebook instances to run code via a standalone Spark instance but Databricks handles the necessary configuration, making the task much easier. We use Jupyter notebooks with Spark as part of our Machine Learning with Spark and Cassandra series, which can be found here.
Spark includes its own machine learning framework Spark MLLib, which is very useful in performing machine learning tasks with Spark. There is, however, a learning curve that must be traversed before MLLib can be used efficiently. Databricks also offers, out of the box, access to several other popular machine learning frameworks. This makes it easy to undertake machine learning tasks in whatever framework a developer is most familiar with. The Databricks ML Runtime comes with TensorFlow, SciKit-Learn, and PyTorch, among others.
MLflow is a machine learning management platform that comes standard with Databricks. It is a tool for managing the training and deployment of machine learning models. These things can be done by hand, but having MLflow or another ml management platform can allow data scientists to have reproducible workflows. The workflows improve for generating, deploying, and otherwise managing machine learning models.
Databricks offers integrations with various BI tools in order to make visualizing your data easy and quick. BI tools like Tableau offer powerful visualization tools and Databricks makes it easy to connect your data to them. Again this is a thing that is possible to do yourself. Either by managing your data such that it is compatible with these BI tools or configuring the connections yourself
Delta Lake is a storage layer that sits atop cloud-based data lakes like AWS S3 or Azure Data Lake Storage. Data lakes can store a large amount of structured or unstructured data. Meaning that it can store data with table and column separations like database tables. It can also work with data with no schema for things like images, and videos.
Databricks is connected to a number of internal and external technologies that provide extra functionality. Things like external ML frameworks and Data Lake connection management make Databricks a more powerful analytics engine than base Apache Spark. Some of these connections and tweaks are things that can be replicated without Databricks, which may be the basis for a series of other posts in the future.
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!