In Data Engineer’s Lunch #28: Petl for Data Engineering, we discussed Petl as part of our ongoing series on python ETL tools. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at noon EST. Register here now!
Petl is a python etl tool that is designed to be easy and convenient to use. It is meant for processing datasets that have maximum tens to hundreds for thousands of rows, though it can theoretically scale to millions. This tool is not particularly fast or efficient and therefore has trouble with very large datasets or high-performance applications. This lack of performance combined with the ease of use means that Petl works well for prototyping processes that will be run later on high-performance platforms.
Introduction to Petl
Petl makes use of lazy evaluators and iterators, so it doesn’t process rows until the results are called for. It can support both functional and object-oriented programming styles. Petl also includes an executable that allows simple data processing from the command line.
As this series has gone on, we have been categorizing tools into two categories. One category holds etl tools that have an internal data representation and help with the actual processing of data by providing functions for common transformations. These tools need to be able to translate data into and out of their native representation and be able to manipulate that data. The other type of etl tool helps manage processes that do the actual data processing. These tools focus on connecting the pieces of processing, managing schedules, and monitoring processes. Petl fits mostly into the first category, but due to its lazy evaluation, it can also be said to be managing pipelines of data processing steps.
Petl ETL Steps
Extract / Load Step
Petl includes general to and from functions for loading data from files and databases. For the general functions, leaving the parameters empty causes the function to try and load from/write to stdin/out. It can automatically compress or decompress data in .gz or .bgz or .bz2 formats. It can also load data from a URL. Petl also includes functions for loading data from or writing data to specific formats or databases. These formats include Python objects, Delimited files, Pickle files Text files, XML files. As well as HTML files, JSON files, Excel files, HDF5 files, Bcolz ctables, Text indexes (Whoosh), Avro files (fastavro). It can interact with SQL databases via SQLAlchemy. Petl can also import/export data from/to a number of other Python data processing tools like Numpy and Pandas.
It is possible to create custom data sources as well. They must conform to the conventions for table containers and table iterators. Those conventions can be found here. Because Petl uses lazy evaluation and iterators to pass data, any data source must also pass data in these ways.
Petl includes functions for common data transformations. It divides them into a number of categories. The basic transformations described include head, tail, and slice functions for getting parts of a table. They also include concatenate and stack functions that combine two tables. Also, the add column function that can add either statically or dynamically determined columns to a table. There is also a section for header manipulation, for changing the header names for columns.
Other sections include value conversion (either type or value), row selection (with conditions). Things like regular expression transformations, splitting up compound values, row transformations, sorting, joins, set operations, deduplication, and aggregation.
Petl includes utility functions to go alongside the more central transformations. The table object has a number of getter functions for accessing just headers or just data or other pieces of the object. It includes tools for visualizations, different parsers for processing strings into other types, tools for counting data, tools for timing your processing, getting the statistics of your data like standard deviation, and tools for randomly generating data. There is also a library called Petlx that contains extensions for Petl. There is a tool for working with data from dna analysis programs, and a tool for building branching data pipelines in Petl.
The GitHub repository for the live demo in the video below can be found here.
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!