In Data Engineer’s Lunch #50: Airbyte for data engineering, we discussed Airbyte and how it can be used for data engineering. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at noon EST. Register here now!
In Data Engineer’s Lunch #50: Airbyte for data engineering, we discussed Airbyte and how it can be used for data engineering, including a live demo. Airbyte is an open-source data integration tool that focuses on EL(T). Some of the features that Airbyte includes are:
- 140+ out-of-the-box connectors
- Custom or new connectors, access to CDK
- Database replication with Change Data Capture
- Normalization and custom transformations via dbt
- Full-grade scheduler
- Real-time monitoring
- Incremental updates
- Manual full refresh
- Integration with Kubernetes and Airflow
- Cloud hosting & management
Airbyte supports all API streams and lets you select the ones that you want to replicate specifically. Furthermore, you can opt for normalized schemas or JSON format, and even explode nested API objects into separate tables or get a serialized JSON. As mentioned above, Airbtye focuses more on the extract and load aspects of ETL, but for transformation, they provide the ability to do data transformations using dbt. Additionally, they have an API and tons of recipes to help you get started.
In addition to running pipelines, Airbyte also provides pipeline visibility in the forms of real-time monitoring with error logging, notification for failed syncs, and debugging autonomy that allows you to modify and debug pipelines without waiting.
Airbyte provides many different open-source deployment options ranging from:
- Local -> Docker
- Some users using Macs with an M1 chip are facing some problems running Airbyte
- Airbyte Cloud
- AWS -> EC2
- GCP -> Compute Engine
- Azure -> VM
- Digital Ocean
- Oracle -> Cloud Infrastructure VM
As mentioned above, we have a demo included in the live recording of Data Engineer’s Lunch #50: Airbyte for data engineering. In this demo, we spin up Airbyte on Gitpod and do 2 simple E+L pipelines. The first step is to get a CSV file from GitHub and stores it to local as JSON. The second does E+L from one instance of PostgreSQL to another instance of PostgreSQL. Be sure to watch the video below!
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!