Aumundsen/DSE with Airflow cover slide

Data Engineer’s Lunch #36: Amundsen/DSE with Airflow

In Data Engineer’s Lunch #36 Amundsen/DSE with Airflow, we will discuss the integration of Amundsen with DSE and Airflow. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at noon EST. Register here now!

Amundsen/DSE with Airflow

In this blog post, we will be taking a closer look at Amundsen an Open Source Data Discovery and Metadata Engine, and how we can integrate it with Datastax Enterprise Cassandra while scheduling the automation to deploy on Airflow. If you are new to Amundsen and the concept of Data Catalogs I suggest you take a few minutes to read this blog article which will help you set up Amundsen with Docker. Also, here you can find the GitHub repository containing instructions to recreate this deployment.

Amundsen

Amundsen

Amundsen is a data discovery and metadata engine for improving the productivity of data analysts, data scientists, and engineers when interacting with data. It does that today by indexing data resources (tables, dashboards, streams, etc.) and powering a page-rank style search based on usage patterns (e.g. highly queried tables show up earlier than less queried tables). Think of it as Google search for data. The project is named after Norwegian explorer Roald Amundsen, the first person to discover the South Pole.

Apache Airflow - Wikipedia

Airflow

Airflow is a platform to programmatically author, schedule, and monitor workflows. Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies.

Getting started with Amundsen/DSE with Airflow

Amundsen

So first before you decide to run Amundsen you have to make a choice of either running it with Neo4j or Atlas for this demo I have decided to go with Neo4j. To start up the Docker Compose for Amundsen with Neo4j you can just run

docker-compose -f docker-amundsen.yml up

This will spin up a docker-compose with all the containers required.

screenshot of Docker UI

Airflow

Next, you need to download the files in the repository and run

cd Airflow
docker-compose up airflow-init
docker-compose up

This will spin up the Airflow docker-compose with the added dse1 database where we will be crawling our data from.

screenshot of Docker UI

Install requirements

Install all the required dependencies to run the DAG

  cd dags/req
  pip install -r requirements.txt
  pip install cassandra-driver

Configure the DAG

In the /dags/dag.py file you need to configure the connections for Cassandra/Neo4j and ES

  1. you should see the network
  docker network ls

2. With this command, you should be able to see all containers running on this network

  docker network inspect amundsen_amundsennet
  1. Get the IPv4Address for this 3 containers Example:
               "Name": "airfloworiginal_dse1_1",
               "EndpointID": "3e3e13d95457c500dcf10660f0e9796b08dff4190f5893b3d1443dbff771a3f8",
               "MacAddress": "02:42:ac:15:00:09",
               "IPv4Address": "172.21.0.9/16",
               "IPv6Address": "" 

              "Name": "es_amundsen",
               "EndpointID": "dfa0fc9580d97309516add337fc4b5aa1df8e8439b7e075c28c0d3d6a990a8c4",
               "MacAddress": "02:42:ac:15:00:02",
               "IPv4Address": "172.21.0.2/16",
               "IPv6Address": ""

              "Name": "neo4j_amundsen",
               "EndpointID": "c044909c033c8f82172be6c265a70e0e077825fb1b01c960a9fd5d0373f9508f",
               "MacAddress": "02:42:ac:15:00:03",
               "IPv4Address": "172.21.0.3/16",
               "IPv6Address": ""         

Edit the DAG file

Change the file on these 3 lines

  1. On line 95:
  'extractor.cassandra.{}'.format(CassandraExtractor.IPS_KEY): ['172.21.0.9'],
  1. On line 56:
  NEO4J_ENDPOINT = f'bolt://172.21.0.3:{neo_port}'
  1. On line 51:
  {'host': '172.21.0.2', 'port': es_port},

Now that you have everything set up you can go into the Airflow UI to run and monitor the DAG

screenshot of Airflow UI

And the Amundesn UI allows you now to browse all the Clusters, Keypsaces and Tables you have in DSE

screenshot of Amundsen UI

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!