Introduction to Databricks

Data Engineer’s Lunch #42: Introduction to Databricks

In Data Engineer’s Lunch #42: Introduction to Databricks, we introduce Databricks and discuss how we can use it for data engineering. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at noon EST. Register here now!

In Data Engineer’s Lunch #42: Introduction to Databricks, we introduce Databricks and discuss how we can use it for data engineering.

If you are not familiar with Databricks, it is a unified data analytics platform in the cloud for massive-scale data engineering and collaborative data science. Databricks does provide a free community edition, but there are some limits on features (more on that here). Databricks allows for large-scale data processing for batch and streaming workloads, enabling analytics on the most complete and recent data, simplifying and accelerating data science on large datasets, and standardizing ML lifecycle from experimentation to production.

Databricks adds enterprise-grade functionality to the innovations of the open-source community. As a fully managed cloud service, they handle your data security and software reliability; as well as, have unmatched scale and performance of the cloud. Databricks is rooted in open-source as their platform includes technologies such as Apache SparkDelta Lake, and MLflow, which were originally created by the founders of Databricks. The Databricks platform also includes TensorFlowRedash, and R.

Another thing we discussed was the Databricks’ Delta Live Tables. In some of our previous discussions on Databricks, we had not covered it, so it was something that we wanted to touch on for this discussion. Delta Live Tables are a framework for building reliable, maintainable, and testable data processing pipelines. You can define the transformations to perform on your data, and Delta Live Tables manages task orchestration, cluster management, monitoring, data quality, and error handling. You can also manage how your data is transformed based on a target schema you define for each processing step. With Delta Live Tables, you can also enforce data quality with Delta Live Tables expectations. These expectations allow you to define expected data quality and specify how to handle records that fail those expectations. You can create and run a Delta Live Tables pipeline using a Databricks notebook.

Check out this link for a quickstart on how to use Delta Live Tables! In this quickstart, Databricks covers:

  • Delta Live Tables pipeline on a dataset containing Wikipedia clickstream data to:
    • Read the raw JSON clickstream data into a table.
    • Read the records from the raw data table and use Delta Live Tables expectations to create a new table that contains cleansed data.
    • Use the records from the cleansed data table to make Delta Live Tables queries that create derived datasets.

If you are interested in any of our other talks on Databricks, they can be found here:

If you missed Data Engineer’s Lunch #42: Introduction to Databricks live, it is embedded below! Additionally, all of our live events can be rewatched on our YouTube channel, so be sure to subscribe and turn on your notifications!

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!