Data Engineers lunch #5

Data Engineer’s Lunch #5: What is a Data Lake?

In Data Engineer’s Lunch #5: What is a Data Lake?, we discuss what data lakes are, why we need them, how we get data in and out, and different implementations of data lakes. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend Data Engineer’s Lunch in person, it is hosted every Monday at 12 PM EST. Register here now!

In Data Engineer’s Lunch #5, we discuss what data lakes are, why we need them, how we get data in and out, and different implementations of data lakes. If you want a more in-depth discussion, be sure to watch the live recording of Data Engineer’s Lunch #5 embedded below! Don’t forget to like and subscribe while you watch it!

What are data lakes?

  • Data forever in one place
  • Raw data stored in objects or files.
    • Structured from relational databases
      • csv
      • tsv
    • Semi-Structured (csv, logs, xml, json)
    • Unstructured data (emails, documents, PDFs)
    • Binary data (images, video, audio)
  • On Premise or Cloud
    • HFDS (S3/HDFS/Min.io/DSEFS)
    • Min.io
    • CEPH

Why do we need a data lake?

  • Can finally do cool stuff with data science
    • Get data into a Data lake
    • Data engineering / wrangling to clean the data
    • Save it back to the data lake
  • From : Will Angel
    • Executive memory problem: Many people don’t understand that a data-lake can just be BigQuery these days. Data lake/ data warehouse triggers a lot of PTSD in executives who have lived through bad data lake/warehouse projects and don’t understand that the cost and complexity have come down a lot.
  • Question from Will Angel
    • Garbage in Garbage Out: How do we avoid our data lakes turning into data swamps? 
    • Answer from Nirmal
      • Stream data in via Kafka (requires some filtration)
      • Leverage a data catalog (metadata, schema, name)
    • Other ideas
      • Different data lakes for ingestion, cleaner data, not quite a warehouse
      • Dataset identification / governance
      • Use databricks bronze/silver/gold terminology

How do we get data into and out of a data lake?

  • Ingress
    • Extract Load Transform (ELT)
    • Extract Transform Load (ETL)
    • Stream into it (Kafka, Spark streaming, Flink, Alpakka)
    • Batch into it (*, Spark, MapReduce, etc.)
  • Egress
    • Integration to query engines out of the box
      • Cloud
        • Snowflake
          • Storage: S3/Azure Storage
          • Query: Snowflake Query Language
        • Google BigQuery
          • Storage: Google Storage
          • Query: BigQuery
        • Azure Data Analytics
          • Storage: Azure Storage
          • Query: Azure Data Analytics
        • Amazon Redshift Spectrum
          • Storage: S3
          • Query: SQL
        • Amazon Athena
          • Amazon Glue
      • Open Source
        • Presto
          • Hive
        • SparkSQL / Spark
    • Stream out of it (Spark streaming, Flink, Kafka, Alpakka)
    • Batch out of it (*, Spark, MapReduce, etc.)
    • Extract LoadTransform (ELT)
    • Extract Transform Load (ETL)

Implementations

  • Original (On-Premise)
    • HDFS
    • SAN/NAS
  • Open Source
    • Object Storage
    • Structured / Formatted Files
      • Parquet
      • JSON
      • CSV
      • XML
      • Delta Lake (Parquet)
    • Structured / Databases
      • BigTable
      • Cassandra
  • Cloud
    • S3 / Amazon Athena
    • Azure Data Lake
    • Google Storage / Big Query
    • Snowflake
    • Databricks

Resources

If you missed last week’s Data Engineer’s Lunch #4: Airflow for Data Engineering, be sure to check it out! As mentioned above, the live recording of Data Engineer’s Lunch #5 is embedded below. Also, check out our YouTube page for more videos and the Data Engineer’s Lunch playlist here! Don’t forget to subscribe while you are there!

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!