In Data Engineer’s Lunch #40: Streaming vs. Batch for ETL, we will be discussing use cases for using real-time stream processing or processing in batches. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at noon EST. Register here now!
Streaming vs. Batch for ETL
Streaming data in real-time seems to be the way managing large amounts of data is moving towards, but there are still cases in which performing ETL in batch processes is an acceptable use case. This article is meant to outline streaming vs. batch for ETL. We will discuss what each of the processes is, some of the technologies used, and highlight some of the strengths and weaknesses of each.
In batch processing, data is collected and stored in windows of time. The schedule of these windows varies, but processing usually occurs one to two times daily. Batch tasks can be executed in any order as designated by the workflow. Batch is best used in cases of very large amounts of data that need to have entire sets processed such as sorting or calculating totals and averages.
Streaming executes ETL tasks on data as it is flowing through the pipeline. Each piece of data is processed as soon as it is ingested. Streaming is ideal for data analysis in real-time. Sources that continuously produce data or require immediate detection of anomalies, such as monitoring for fraud, are prime use cases for streaming.
Streaming vs. Batch for ETL: Technologies
When considering streaming vs. batch for ETL the technologies used are going to be a huge consideration. Cassandra, Spark, and Kafka are some of the principal technologies we utilize here at Anant. In fact, we have multiple blog posts on getting started with and using each of them. One such blog post outlines using Spark, Cassandra, and Elasticsearch for Data Processing.
Pros & Cons of Batch Processing
- Data Migration
- Running complex algorithms that require access to the entire data set.
- Access to current data may require multiple systems, reducing efficiency.
- May result in bottlenecks when the volume of transactions is high.
- Strain on the processing system if the amount of data in the batch becomes too large.
- Delays in availability of data may negatively impact delivery of service.
Pros & Cons of Stream Processing
- Analytics on data in real time.
- Analysis and detecting of patterns over time.
- Saas, IoT, Machine Learning, and Web Analytics are all processes that rely on or benefit from data streams.
- May face challenges when having multiple data sources moving through a distributed system.
- Keeping the order of data consistent requires deep consideration of the CAP theorem. A decision will have to be made regarding having highly consistent data or highly available data that might not be the most up to date.
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!