ETL stands for Extract, Transform, and Load. This describes a process through which data becomes more refined. ETL is one of the main skills that data engineers need to master in order to do their jobs well. It was also the topic of our second ever Data Engineer’s lunch discussion. If you missed it, or just want an overview of available ETL frameworks, keep reading. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you could not attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at 12 PM EST. Register here now!
In our second Data Engineer’s Lunch meeting, we discussed a number of different commonly used ETL frameworks. Mostly they were divided by programming language, but there are a few tools that fall across languages or do not fall under a specific language.
ETL Frameworks
ETL Frameworks by Language
We started by talking about frameworks that use a specific programming language. Python has Dask, Ray, Airflow (at least with some plugins), and psycopg as ETL frameworks within that language. Dask is a distributed analytics engine that integrates with commonly used data manipulation tools in python. Ray is an API for building distributed applications, and also provides tools for including machine learning in those applications. Airflow is mostly a scheduling tool but can include plugins that make it possible to set up ETL processes.
Java has MapReduce and Hive, as well as Apache Camel for general ETL pipelines. Apache Camel can also handle stream processing. Spring Batch and Spring data can also be useful ETL tools in Java. Scala has Scaldi as an ETL framework as well as Flink. Even .NET has Spring .NET and SSIS. Node.js can also be used for ETL.
ETL Frameworks across Languages
This section is for ETL frameworks that either run across multiple languages or for ones whose operation/infrastructure are more important than the specific language that code is written in. For the second category, we have Serverless functions, in which processing jobs are run by a cloud provider whenever necessary, without having to maintain a server for them full time. These services include AWS Lambda functions, Google Cloud Functions, and Azure Functions.
For ETL frameworks that extend across multiple languages, we have Spark. Spark in another distributed analytics engine. It is written in Scala and has a second main language in Python with PySpark. It also has Java and R APIs. Even .NET can be used with Spark. Spark is a successor to MapReduce which allows the Transform part of ETL to be distributed. Spark’s dataframe API allows easy loading, manipulation, and saving of data through Spark. It also includes libraries for machine learning with MLLib, graph processing with GraphX, Streaming with Spark Streaming, and SQL with Spark SQL. It can also be attached to Apache Livy in order to get a REST API, so that spark clusters can be used with a number of languages.
Cassandra.Link
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!