In Data Engineer’s Lunch #24: Pandas for Data Engineering, we discussed using Pandas for performing Data Engineering tasks in Python. This topic is part of our ongoing series on Python ETL tools. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at noon EST. Register here now!
Pandas Overview
Pandas is an open-source data analysis and manipulation tool written in Python. Last time we talked about python ETL tools we noticed a sort of dichotomy. There was one type of tool that facilitated ETL by scheduling jobs and managing dependencies between tasks. These tools allowed us to build workflows or pipelines. The other type of ETL tool discussed was one that had an internal data representation. This type helps to transform data to and from this representation from other formats and from databases. It also tends to contain easily applicable common data transformations, with the option to manually create more complex operations. This tool is of this second type.
Features
Pandas is a python package built on two custom collection data types and functions in relation to those types. The representation of data in Pandas is done using two related types, the Series and the Dataframe. A Pandas Series acts as a 1D data array, capable of storing any type of Python data within it. In actuality, the Series also holds indices alongside the data and allows for custom indexing methods. A Dataframe is a 2D data structure that arranges data into typed columns. Data within a column should all have the same type. There are a number of constructors for making Dataframes from other Python collections. Indices now act as row labels, with each column also having its own column label.
Pandas for ETL
Pandas I/O
The package also includes a number of methods for making Dataframes from data that is not already contained in a python collection. It can help you read from and write to CSV files and JSON most commonly. It also has tools ready for more exotic data storage like SQL, python pickled data, parquet tables, and HDF5 for Hadoop. These functions help to accomplish the extract and load portions of ETL.
Pandas Dataframe Operations
Pandas contains a large number of functions for manipulating Dataframes. These operations are commonly sped up from base python speeds with the use of Cython (C integration for python) or the use of computational practices like parallelization, chunking, and caching. Pandas support scalar operations as well as binary operations on dataframes. It offers statistical methods like mean calculation and standard deviation. Pandas also offers functions for common data science operations like removing null rows and dropping duplicates. It can even perform some data processing like groupBy and joins for data analysis. Often these operations are in the name of preparing data sets for data science applications.
Cassandra.Link
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!