Data Engineer’s Lunch #1: Data Engineering Roadmap

In case you missed it, last week we started the first of what will be weekly Data Engineering lunch discussions on Monday, November 16th. The first lunch covered the Data Engineering Road-map. We discussed how to become a proficient data engineer and the various masteries that are required. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at 12 PM EST. Register here now!

In our first Data Engineer’s Lunch, we discussed the Data Engineering roadmap. We started with a general path that covers a number of fields related to data engineering. We then moved on to looking at some specific guides on becoming a data engineer.

Overview

Starting with our own exploration of data engineering skills, we went through a number of categories, naming skills and masteries that a good data engineer needs. For programming languages, we listed Python, Scala, and Java. For scripting and automation skills we have shell scripting, cron scheduling, and cl utils like Sed, Awk, and JQ. 

Next we have databases, split into SQL and NoSQL types and linked with other tech like data lakes and data warehouses. We also cover data modeling here for structured, unstructured, and semi-structured data. Data processing skills include more shell scripting with Sed/Awk, Python, and Perl. It also includes batch processing using Spring Data/Batch and Apache Spark as well as micro batch processing with those same technologies and also Flink. Lastly it includes stream processing with Spark Streaming, Flink, and Kafka. 

Then we talked about scheduling beyond cron, using tools like Airflow. We covered cloud platforms and technologies based on them. AWS includes S3, DMS, and ElasticMapReduce (EMR). Google Cloud Platform includes Dataproc, and Azure includes Azure Databricks and HDInsight. After that we moved on to infrastructure using Terraform, Ansible, Docker, and Kubernetes. Then we briefly covered emerging technologies like Serverless functions and Deltalake.

Guides

Data Engineering — Complete Reference Guide From A-Z [2019] | by Yan Parker | Towards Data Science

The Path to Becoming a Data Engineer – DataCamp

Become a Data Engineer with this Complete List of Resources

Data Engineering: A Guide to the Who, What, and How – Talend

A Beginner’s Guide to Data Engineering – Part II

How To Become A Data Engineer: A Guide

Beginner’s Guide to Data Engineering – A 3-Part Series · Andrew Goss · Senior Data Engineer

data.plumbers – Medium

Resources

Data Engineer Spark Guide – Databricks

Streaming with Apache Kafka – guide for data engineers | Bartosz Mikulski

Data Engineering Streaming Overview

2019_Matt_Turck_Big_Data_Landscape_Final_Fullsize.png (4485×2980)

gunnarmorling/awesome-opensource-data-engineering: An Awesome List of Open-Source Data Engineering Projects

igorbarinov/awesome-data-engineering: A curated list of data engineering tools for software developers

The Top 21 Data Wrangling Open Source Projects

peerside/awesome-data-wrangling: A curated list of data wrangling resources

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!