Data Catalog Overview: Amundson

Open Source Data Catalog Overview: Amundsen

In this blog post, the second in a series about Open Source Data Catalogs, we will be talking about the Open Source Data Discovery and Metadata Engine known as Amundsen. We will be going over what the main idea of Amundsen is, what kinds of technologies make up Amundsen, methods of installation and development, and then go through the installation process of Amundsen using the docker method along with a few obstacles we ran into while doing so. We will also discuss the main microservices that make up Amundsen, configuration options for them, and how to add authentication to Amundsen. Finally, we conclude with some ending thoughts and conclusions on Amundsen from the perspective of a short dive into it.

Overview of Amundsen

Amundsen is an Open Source Data Discovery and Metadata Engine started by Lyft and named after Norwegian explorer Roald Amundsen, the first person to discover the South Pole. (Website: https://www.amundsen.io/, Github Repo: https://github.com/amundsen-io/amundsen)

Screenshot of Amundsen's front page.

Amundsen is primarily written in Python, with Elasticsearch and/or Apache Atlas for search capabilities and either Neo4j or Apache Atlas for storing and managing data/metadata. Amundsen supports integrations with three types of entities: Tables (from Databases), People (from HR systems), and Dashboards. Databases supported by Amundsen include Apache Cassandra, PostgreSQL, and dbapi + sql_alchemy interfaces. Amundsen has Dashboard connectors for Apache Superset, Redash, Tableau, and Mode Analytics.

Amundsen can be installed and run using docker-compose, or installed and run from source code. For development purposes, it is possible to either rebuild the docker images locally each time changes are made and run all of Amundsen’s microservices together with docker-compose, or modify source code and run each microservice independently, refreshing each individual microservice as changes are made to it specifically.

In the following section, we will be going over the Quick-Start process of Amundsen and addressing some roadblocks that a potential new user of Amundsen may encounter when running the base project.

Process of Docker-Compose Installation / Look at Amundsen

In our personal setup, we will be installing and running the latest version of Amundsen (as of May 2021) on a Linux Mint Ulyssa (20.1) VM using the docker-compose method detailed in the Quick-Start guide on Amundsen’s website. The Quick-Start guide to be followed in this section of the blog can be found here. Additionally, the same instructions can be found in the Github Repository here. The default quick method of locally running Amundsen is using docker-compose and a docker-compose.yml file. The following command is used to clone the Amundsen repository:

git clone --recursive git@github.com:amundsen-io/amundsen.git

Amundsen repository is cloned with the –recursive option so that submodules get cloned as well. This is important because Amundsen’s single repository contains all of Amundsen within it (including all microservices) and they are submodules within the main repository. Next, we enter the Amundsen project directory and run docker-compose with the docker-amundsen.yml file, which uses Neo4j as part of the backend (not Apache Atlas):

docker-compose -f docker-amundsen.yml up

We note that when running this docker-compose command, five separate containers are ran:

Screenshot of terminal output when docker containers are running.

These correspond to the five primary components of Amundsen: Amundsen’s three microservices (search, metadata, and frontend application), Neo4j, and Elasticsearch. Initially, we obtained some errors related to “max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]” and the Amundsen Elasticsearch container would close (es_amundsen default name). To fix this issue, we followed the instructions found here and ran the docker-compose command again. To open the Amundsen frontend, we can visit localhost at port 5000 in the web browser and we are greeted with a relatively blank Amundsen page:

http://localhost:5000/
Screenshot of what an Amundsen front-end looks like without data.

Now we ingested some data into Amundsen using an example script inside Amundsen’s Databuilder data ingestion library found here.
We first go into the databuilder directory inside the root Amundsen directory and then follow the instructions in the quick-start guide to use python virtual environment and run the example script sample_data_loader.py located in the /example/scripts/ directory within databuilder. It is important to note that this data ingestion should be done in its own terminal window and that the python3 virtual env must be activated before running the script. Once this is finished, we can open the UI again and search “test” to confirm that some data is loaded in. Then we can click on the first option that comes up (test_schema.test_table1 in our case) and we can see some information about this table:

Screenshot of the Amundsen directory.

Additionally, Neo4j’s interface can be visited by going to port 7474 of localhost. At this point, Amundsen is locally running using the latest Amundsen docker images and docker-compose with the file docker-amundsen.yml. While Amundsen is being run with this docker-compose file, all data in Elasticsearch and Neo4j (I.E. the data imported with the example data importing script) is stored in docker named volumes (es_data and neo4j_data, respectively). These volumes need to be deleted or newly named volumes can be made by editing the docker-amundsen.yml file to reset the state of Amundsen. We note that at this point Amundsen looks very simple and there is no configuration that can be done. In order to build and run Amundsen images with custom changes to the code, it is necessary to look into the Amundsen Development Guide.

Amundsen Development Guide

In order to make changes to the code or configuration of Amundsen and see the results of those changes, it is necessary to build and run Amundsen in a way different from the method listed in the Quick-Start guide. We looked at Amundsen’s official Developer Guide page on their website: https://www.amundsen.io/amundsen/developer_guide/.

First, we make sure that we run the proper docker commands to stop the docker containers we ran from the quick-start guide and start up the proper containers for local development:

docker-compose -f docker-amundsen.yml down
docker-compose -f docker-amundsen-local.yml up -d

We note that we are now giving docker-compose a different file. This file uses local code to build the custom docker images and run docker containers using those built images. One issue we ran into here is that the Elasticsearch container (default name es_amundsen) would keep closing (closed and running containers can be shown with the command docker ps -a). This is because Elasticsearch is trying to access a local directory named /.local/ and it is not allowed access to it. This issue can be found by using the docker log command on the es_amundsen container, and will show up as something along the lines of this:

org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: Failed to create node environment

To fix this issue, we run the following line of code from within the Amundsen directory and then restart docker-compose:

sudo chown -R 1000:1000 .local/elasticsearch
docker-compose -f docker-amundsen-local.yml down -d
docker-compose -f docker-amundsen-local.yml up -d

If we open up localhost:5000 now, we will see a default instance of Amundsen running with no data in it. Even if the named docker volumes which contained our Elasticsearch and Neo4j data from the Quick-Start section were not removed/deleted, Elasticsearch and Neo4j are now pulling data from the directory /amundsen/.local/. To reset the databases while developing Amundsen in this way, we can use the following commands:

#  reset elasticsearch
rm -rf .local/elasticsearch

#  reset neo4j
rm -rf .local/neo4j

We decided to try and change some options within Amundsen’s frontend configuration file which can be located in the following directory:

/amundsen/frontend/amundsen_application/static/js/config/config-custom.ts

Within this directory, there are a few different configuration files. Primarily, we care about config-default.ts and config-custom.ts. Opening config-default.ts shows a large number of options and their default values for Amundsen’s frontend service. Config-default.ts is first loaded, and then elements from config-custom.ts replace currently existing ones in config-default.ts. So, to make changes to the config, we add whatever changes we want to config-custom.ts. For this brief exploration, we add the following two lines to config-custom.ts:

logoTitle: 'ANANT',
documentTitle: 'Anant - Data Portal for Cassandra',

These values should now override the default logoTitle and documentTitle listed in config-default.ts, and we should be able to see those changes once new docker images are built. We run the following command to rebuild the custom docker images and run containers using these images (which include our changes):

docker-compose -f docker-amundsen-local.yml build \
  && docker-compose -f docker-amundsen-local.yml up -d

This command may take a while to run. Additionally, to have the terminal window that this command is run in show logs and errors/warning messages as they pop up for any of the five containers being run using this docker-compose file, remove the -d flag from the end of the command.

Now if we open localhost:5000, the document title of the page (found on the tab at the top of your browser corresponding to the Amundsen tab) and the logo title (displayed in the top left corner) have been changed. This indicates that we have successfully made changes to a portion of Amundsen (Frontend Javascript configuration in this case), built our custom Docker images properly, and ran containers of those images successfully:

Screenshot of the Amundsen front-end

We also note that the developer guide talks about how each microservice that makes up Amundsen can be built straight from source and run that way (not built into a Docker image). Instructions and information about how to use that approach are on the Amundsen Developer Guide page (and pages linked therein).

Overview and Configuration of Amundsen Components

An Amundsen setup is composed of five microservices: Amundsen Frontend, Amundsen Search, Amundsen Metadata, Elasticsearch, and Neo4j. An overview of Amundsen’s Architecture (assuming Neo4j is used) can be found on their website here. Below, we briefly mention what each of the microservices Amundsen Frontend, Amundsen Search, and Amundsen Metadata provide and configuration options for each of them.

Amundsen Frontend

Amundsen’s frontend is a Python Flask application with a React frontend. Amundsen Frontend leverages Amundsen Search for allowing users to search for data resources and Amundsen Metadata for viewing and editing metadata for a given resource. Configuration options for the frontend application can be found in the config-custom.ts file, and the default config file can be found at config-default.ts (both of which are linked to here). As mentioned previously, first config-default.ts is loaded then config-custom.ts is loaded. So, any fields included in config-custom.ts override the respective fields in config-default.ts.

The page linked previously also contains information about different features that can be enabled in Amundsen’s frontend by modifying the config-custom.ts file appropriately. Some of these features include an Announcements feature, Analytics plugins, Mail client, and more. Some configuration features (i.e. where the metadata and search services are located (if not at localhost) ) can be configured through a configuration file for the Flask application named config.py, which can be found here. After any changes are made, the frontend service needs to be rebuilt and then changes should appear. Information about how to set up a standalone version from source of each of the three Amundsen microservices can be found on this page.

Amundsen Search

Amundsen Search is a microservice that serves a Restful API and is also responsible for searching metadata. It uses Elasticsearch for most of its searching capabilities. Information about the search service, along with how to start it from either source or from Docker can be found here. One important thing to note is that documentation for Amundsen Search’s API is done using Swagger with OpenApi. When Amundsen is running, going to localhost:5001/apidocs/ should show the generated API documentation.

Amundsen Metadata

Amundsen Metadata is a microservice that serves a Restful API and is responsible for providing/updating metadata (such as table and column descriptions, and more). The metadata service can use either Neo4j or Apache Atlas as a persistent layer. More information about Amundsen Metadata, along with how to start it from either Docker or source can be found here. Similar to Amundsen Search, the API that Amundsen Metadata provides has documentation automatically generated for it using the same technologies. The generated docs can be found at localhost:5002/apidocs/.

Amundsen Databuilder

Amundsen also has a data ingestion library called Databuilder (Github Repo here) which can be used to put metadata into Amundsen, and extract metadata from data sources into a format Amundsen can use. Databuilder has many prewritten extractors for extracting metadata from a lot of commonly used databases and HR systems (for users) located under the /databuilder/databuilder/extractor/ directory in the Databuilder repository. Additionally, the Databuilder repository also has many example scripts. The Databuilder library can be used with an ad-hoc python script, or with an Airflow DAG.

Authentication for Deployment

One important component of properly deploying Amundsen (for actual use, not just development or testing) is the usage of authentication to prevent anybody that can access the deployed version of Amundsen from using the API without authorization. However, Amundsen by default does not have any form of authentication software/configuration and it is not a one-step process to add a form of authentication to Amundsen. One way to add authentication to Amundsen is to follow the short guide on Amundsen’s website that can be found here.

In the guide mentioned above, end-to-end authentication using OIDC with a Flask (micro web framework) wrapper is discussed. Primarily, it is important to know that flaskoidc needs to be installed and configured for three separate Amundsen microservices individually: amundsenfrontend, amundsenmetadata, and amundsensearch. Instructions on installing and configuring flaskoidc can be found at flaskoidc’s Github Repo here. Additionally, environment variables need to be set for the three mentioned microservices, and a few other code changes need to be done in order to add OIDC authentication as detailed in Amundsen’s instructions.

Conclusions

Amundsen is a powerful Open Source Data Discovery and Metadata Engine written in Python and leveraging Elasticsearch for search capabilities. Setting up individual features appears to be a long process, but many options exist and documentation for them. Additionally, Amundsen’s Databuilder data ingestion library appears to be powerful and already able to work with many popular databases. Amundsen as an Open Source Data Discovery and Metadata Engine is therefore likely a good option for a team willing to put in the time to learn it.

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!