Search
Close this search box.

Apache Spark Companion Technologies: Data Lakes

Data lakes are a tool for long term data storage. They can be implemented on-premises for use cases requiring high security or in the cloud for more accessible solutions. The Databricks runtime includes code specifically for easing the connection between spark and Data lake technologies as well as its own companion tech, Delta Lake. Delta Lake makes interacting with data in data lakes easier and more consistent but it is possible to work with data lakes without it, as we will see today.

Introduction

Data lakes are tools for long term storage. They are often conceptualized as blob storage, object storage, or file storage. Data lake storage is very durable due to the distributed nature of data lake architecture. Inside of a data lake, data is stored as objects or files. That data can be unstructured (raw text, pdf files, video files, etc), semi-structured (CSV, logs, anything with some kind of structured schema and also room for arbitrary inputs or non-strict schema), or structured data (full schema-compliant data like that taken directly from databases). They may include automated versioning. Cloud data lakes also have access solutions for querying the data lakes for data.

Use Cases

Archival Data Storage

Data lakes work very well for the storage of data that is accessed infrequently but still needs to be available for analysis. It is very good for the storage of metadata and logs generated by other data management solutions. This also makes it good for backups of data stored in other data management systems as well. Data lakes with inbuilt versioning can do a lot of the management of backup as well. Since data lakes can expand to exabytes in size, they can often act as backups of the entire storage for your data platform.

Staging for Unprocessed Data

Data lakes can act as places for the storage of data that has yet to be processed in any way. The vast amount of space available means that data in different stages of being processed can also be stored. This enables different branches of processing leading to different insights. That data can then be loaded into databases for easy access or feeding data to an API. It can also be used to train machine learning algorithms.

Conclusion

Databricks is connected to a number of internal and external technologies that provide extra functionality. Data lakes act as durable long term storage and can be integral to data science tasks. This is functionality that can be replicated without Databricks. Come back next time to see the other features of Databricks, and whether we can replicate them

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!