In Data Engineer’s Lunch #6: Common Data Formats Used in Data Engineering, we discuss common data storage formats used in data engineering. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend Data Engineer’s Lunch in person, it is hosted every Monday at 12 PM EST. Register here now!
In Data Engineer’s Lunch #6: Common Data Formats Used in Data Engineering, we discuss a variety of data formats used in data engineering ranging from text/file and binary. Additional resources are also available at the end of the blog. If you want a more in-depth discussion, be sure to watch the live recording of Data Engineer’s Lunch #6 embedded below! Don’t forget to like and subscribe while you watch it!
Text/File Data Formats
- CSV – Comma Separated Value – text file
- open it in vi / notepad / sublimetext
- fieldnames at the top <- the metadata
- every major record is one line
- CSV is a mediocre data format for software (due to issues with escaping quotations and commas), but is a great interchange format for interchanging data between people and organizations.
- open it in Excel, Google Spreadsheets, Open Office
- TSV – Tab Separated Value – text file
- open it in vi / notepad / sublimetext
- fieldnames at the top <- the metadata
- every major record is one line
- humans cans scan it / read it like a “report”
- TSV is also faster for some parsing operations.
- XML
- EBXML
- Web Services – SOAP – Simple Object Access Protocol
- XML-RPC
- SGML -> HTML -> XML
- Self Describing
- <book>
- <authors>
- <author name=”Rahul Singh”>
- <author name=”Rahul Singh”>
- </authors>
- <title name=”Enterprise Consciousness”>
- <isbn number=””>
- <authors>
- </book>
- <book>
- <author name=”William Angel”>
- <title name=”Virtual Power…”>
- <book>
- <book>
- XSD
- XSLT -> Reformat the data
- Hierarchy
- Heavy
- JSON
- Web Services – REST / JSON
- Hierarchy
- Readable
- Metadata in the JSON
- Not as heavy
- Self Describing
- Avro Spec
- GeoSON
- SQL
- Text Files (What They Are & How to Open One)
Binary Data Formats
- Serialization/Deserialization formats
- PKL
- A PKL file is a file created by pickle, a Python module that enables objects to be serialized to files on disk and deserialized back into the program at runtime. It contains a byte stream that represents the objects.
- Kryo
- PKL
- Compressed
- * Text Files
- tar
- gz
- zip
- 7z
- Parquet
- JSON
- BSON
- Avro Compressed
Resources
- Comparison of data-serialization formats – Wikipedia
- Data Serialization Comparison: JSON, YAML, BSON, MessagePack – SitePoint
- Apache Arrow and Distributed Compute with Kubernetes – XenonStack
If you missed last week’s Data Engineer’s Lunch #5: What is a Data Lake? be sure to check it out! As mentioned above, the live recording of Data Engineer’s Lunch #6 is embedded below. Also, check out our YouTube page for more videos and the Data Engineer’s Lunch playlist here! Don’t forget to subscribe while you are there!
Cassandra.Link
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!