In Data Engineer’s Lunch #36 Amundsen/DSE with Airflow, we will discuss the integration of Amundsen with DSE and Airflow. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at noon EST. Register here now!
Amundsen/DSE with Airflow
In this blog post, we will be taking a closer look at Amundsen an Open Source Data Discovery and Metadata Engine, and how we can integrate it with Datastax Enterprise Cassandra while scheduling the automation to deploy on Airflow. If you are new to Amundsen and the concept of Data Catalogs I suggest you take a few minutes to read this blog article which will help you set up Amundsen with Docker. Also, here you can find the GitHub repository containing instructions to recreate this deployment.
Amundsen
Amundsen is a data discovery and metadata engine for improving the productivity of data analysts, data scientists, and engineers when interacting with data. It does that today by indexing data resources (tables, dashboards, streams, etc.) and powering a page-rank style search based on usage patterns (e.g. highly queried tables show up earlier than less queried tables). Think of it as Google search for data. The project is named after Norwegian explorer Roald Amundsen, the first person to discover the South Pole.
Airflow
Airflow is a platform to programmatically author, schedule, and monitor workflows. Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies.
Getting started with Amundsen/DSE with Airflow
Amundsen
So first before you decide to run Amundsen you have to make a choice of either running it with Neo4j or Atlas for this demo I have decided to go with Neo4j. To start up the Docker Compose for Amundsen with Neo4j you can just run
docker-compose -f docker-amundsen.yml up
This will spin up a docker-compose with all the containers required.
Airflow
Next, you need to download the files in the repository and run
cd Airflow docker-compose up airflow-init docker-compose up
This will spin up the Airflow docker-compose with the added dse1 database where we will be crawling our data from.
Install requirements
Install all the required dependencies to run the DAG
cd dags/req pip install -r requirements.txt pip install cassandra-driver
Configure the DAG
In the /dags/dag.py file you need to configure the connections for Cassandra/Neo4j and ES
- you should see the network
docker network ls
2. With this command, you should be able to see all containers running on this network
docker network inspect amundsen_amundsennet
- Get the IPv4Address for this 3 containers Example:
"Name": "airfloworiginal_dse1_1", "EndpointID": "3e3e13d95457c500dcf10660f0e9796b08dff4190f5893b3d1443dbff771a3f8", "MacAddress": "02:42:ac:15:00:09", "IPv4Address": "172.21.0.9/16", "IPv6Address": "" "Name": "es_amundsen", "EndpointID": "dfa0fc9580d97309516add337fc4b5aa1df8e8439b7e075c28c0d3d6a990a8c4", "MacAddress": "02:42:ac:15:00:02", "IPv4Address": "172.21.0.2/16", "IPv6Address": "" "Name": "neo4j_amundsen", "EndpointID": "c044909c033c8f82172be6c265a70e0e077825fb1b01c960a9fd5d0373f9508f", "MacAddress": "02:42:ac:15:00:03", "IPv4Address": "172.21.0.3/16", "IPv6Address": ""
Edit the DAG file
Change the file on these 3 lines
- On line 95:
'extractor.cassandra.{}'.format(CassandraExtractor.IPS_KEY): ['172.21.0.9'],
- On line 56:
NEO4J_ENDPOINT = f'bolt://172.21.0.3:{neo_port}'
- On line 51:
{'host': '172.21.0.2', 'port': es_port},
Now that you have everything set up you can go into the Airflow UI to run and monitor the DAG
And the Amundesn UI allows you now to browse all the Clusters, Keypsaces and Tables you have in DSE
Cassandra.Link
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!