Data Engineer’s Lunch #57: StreamSets for Data Engineering

In Data Engineer’s Lunch #57, we discuss StreamSets and how it can be used for data engineering! The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. Subscribe to our YouTube Channel to keep up to date and watch Data Engineer’s Lunches live at 12 PM EST on Mondays!

In Data Engineer’s Lunch #57: StreamSets for Data Engineering, we introduce StreamSets as well as show a demo on how it can be used for data engineering. Be sure to watch the demo in the live recording of Data Engineer’s lunch embedded below!

StreamSets is a data integration platform built for dataops. With StreamSets, you can build streaming, batch, CDC, ETL, and ML pipelines from a single UI and deploy data and workloads to any cloud. The DataOps platform has a free tier (no cc required) with Data Collector Engine, Transform Engine, Control Hub. As shown in the demo below, it allows for self-managed deployments via Docker. In the free tier, we get 2 active jobs, 2 active users, and 10 published pipelines, so depending on your workload, you might be able to get away with using the free tier itself on self-managed deployments.

streamsets ui
StreamSets DataOps Platform UI

As mentioned above, with the DataOps Platform, we get the Control Hub, Data Collecter Engine, Transform Engine, and pre-built connectors and native integrations. The Data Collector Engine is open source, which can be found here: https://github.com/streamsets/datacollector-oss. The transform engine can natively execute on Apache Spark, Snowflake, AWS EMR, Google Cloud Dataproc, and Databricks platforms. And the pre-built connectors and native integrations allow for connection to applications, big data, SQL/NoSQL DBs, storage/warehouses, and streaming tools. Check out all the connectors here: https://streamsets.com/support/connectors/.

Streamsets Basic Architecture
StreamSets Basic Architecture

As mentioned above, we have a demo of how we can use StreamSets for data engineering embedded below! Don’t forget to like and subscribe! In the demo, we go through the following items:

  • Spin up Data Collector Deployment from Control Hub + Docker
  • Create Sample Pipeline and Preview Data
  • Schedule / Run Pipeline Job  and View Metrics
  • Spin up Transformer Engine Deployment from Control Hub + Docker
  • Create Sample ETL Pipeline and Preview Data
  • Submit ETL Pipeline to Local Spark

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!