In Data Engineer’s Lunch #9: Open Source & Cloud Data Catalogs, we discussed data catalogs, which help users keep track of data. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at noon EST. Register here now!
Data catalogs are a method of metadata management that helps to enable other data management tasks an organization may want to undertake. They help users find the data that they need, act as a centralized list of all available data, and provide information that can help analyze whether data is in a form conducive to further processing. Different data catalogs offer different features and operate on different data stores. Some only work with a specific datastore like Hadoop, while others can connect several different data storage technologies.
Open Data Catalogs
CKAN, the world’s leading Open Source data portal platform CKAN is a powerful data management system that makes data accessible – by providing tools to streamline publishing, sharing, finding, and using data.
Magda is designed with the flexibility to work with all of an organization’s data assets, big or small – it can be used as a catalog for big data in a data lake, an easily-searchable repository for an organization’s small data files, an aggregator for multiple external data sources, or all at once.
Amundsen is a data discovery and metadata engine for improving the productivity of data analysts, data scientists, and engineers when interacting with data. It does that today by indexing data resources (tables, dashboards, streams, etc.) and powering a page-rank style search based on usage patterns (e.g. highly queried tables show up earlier than less queried tables). Think of it as Google search for data.
Apache Atlas provides open metadata management and governance capabilities for organizations to build a catalog of their data assets, classify and govern these assets, and provide collaboration capabilities around these data assets for data scientists, analysts, and the data governance team.
Kylo is an open-source enterprise-ready data lake management software platform for self-service data ingest and data preparation with integrated metadata management, governance, security, and best practices inspired by Think Big’s 150+ big data implementation projects.
Metacat is a federated service providing a unified REST/Thrift interface to access metadata of various data stores. The respective metadata stores are still the source of truth for schema metadata, so Metacat does not materialize it in its storage. It only directly stores the business and user-defined metadata about the datasets. It also publishes all of the information about the datasets to Elasticsearch for full-text search and discovery.
DataHub is LinkedIn’s generalized metadata search & discovery tool. Read about the architectures of different metadata systems and why DataHub excels here. Also, read our LinkedIn Engineering blog post, check out our Strata presentation, and watch our Crunch Conference Talk.
Cloud Data Catalogs
AWS Glue Catalog
AWS Glue is a serverless data integration service. It makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. It provides all of the capabilities needed for data integration. So that you can start analyzing your data and putting it to use in minutes instead of months.
Google Cloud Data Catalog
A fully managed and highly scalable data discovery and metadata management service. Integrates with BigQuery, Pub/Sub, Cloud Storage, and many other connectors. They help provide a unified view and tagging mechanism for technical and business metadata. Empower any user on the team to find or tag data with a powerful UI. The UI has the same search technology as Gmail, or via API access. Data Catalog is fully managed, so you can start and scale effortlessly.
Azure Data Catalog
Azure Data Catalog is an enterprise-wide metadata catalog that makes data asset discovery straightforward. It’s a fully-managed service that lets you register, enrich, discover, understand, and consume data sources. Works for anyone, from analyst to data scientist to data developer.
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!