Data Engineer’s Lunch #65: JanusGraph on Jupyter: Using Notebooks with Graph

Graph data readily lends itself to visualization – and notebooks make it easy to do just that. In this post, we will test out the Graph Notebook tool put out by the AWS team on JanusGraph.

Introduction

One of the key selling points of graph technology is that the data model matches how we think. Many refer to this as being “whiteboard-friendly”, in that what you draw during your initial whiteboarding ends up looking like more or less what you put into your database.

That being the case, graph databases lend themselves to visualization, and indeed there are a number of visualization tools out there for graph databases. It is, therefore, no surprise that there are various visualization tools for JanusGraph and for Neo4j.

The advantages to visualize a graph becomes compounded when combined with all of the advantages of the notebook format. The notebook format facilitates data experimentation by making it easy to run commands, see (and save!) the output, and share results. It is like a REPL with a GUI, that you can put into source code or publish online. This is one of the reasons why DSE Graph works so well with DSE Studio since DSE Studio provides this notebook interface to its Tinkerpop enabled graph implementation. However, DSE Studio is specific for DSE Graph – this solution does not work with other Tinkerpop graphs, such as JanusGraph.

This is where the Graph Notebook, put out by the team at AWS comes in. Although we can understand how to a certain degree it was built with AWS Neptune in mind, it works with any graph that implements Apache TinkerPopopenCypher, or RDF SPARQL. Not only does this provide a notebook interface for your graph database, but comes prebuilt to visualize the connectivity of your graph as well – not just in a bar chart or histogram.

In this post, we will explore what this looks like specifically using JanusGraph, a popular open-source Tinkerpop graph database. We will start with some instructions for getting set up, then explore its visualization capabilities, and then conclude with some evaluation and first impressions.

Setup the Environment

First, we’ll need to set up our sandbox. For this, we set up a docker-compose to make it easy to get started. You can check it out here: https://github.com/Anant/janusgraph-notebook.

After you clone the repo, you just need to run this one command to get started:

docker-compose up -d
Jupyter password log-in; use password 'tensorflow'

Once everything builds and connects properly, you can start playing around in your Jupyter notebooks here: http://localhost:8888

For the authentication page, use password `tensorflow` and then you’re all set.

Setup the Graph and the Notebook

Out of the box, the Graph Notebook lib provides dozens of notebooks to try out.

Unfortunately, those with seed data for loading into your graph mostly don’t work since they use custom string ids, and JanusGraph doesn’t support that.

However, we can load in the air routes data manually using the commands below, which will allow us to use the Air-Routes-Gremlin notebook.

Disclaimer: This seed data and notebook are not my own. Some of the commands are modified slightly from what was originally in the notebook.

Load seed data

# download the seed data from Kelvin Lawrence's repository and copy into janusgraph docker container
./scripts/load-in-airroutes.sh
docker cp ./tmp jg-notebook-janusgraph:/opt/tmp/

# start console
docker exec -it jg-notebook-janusgraph ./bin/gremlin.sh

Then, in the console:

:remote connect tinkerpop.server conf/remote.yaml
g = traversal().withRemote('conf/remote-graph.properties')
path = "/opt/tmp/air-routes.xml";
g.io(path).read().iterate();

Quickly testing to get some counts of our new data confirms it worked:

screenshot of seed data test

Setup Air Routes Notebook

Now we’re ready to set up our notebook. This link will navigate right to it: http://localhost:8888/notebooks/02-Visualization/Air-Routes-Gremlin.ipynb

To set up a notebook, run the following configuration command:

screenshot of configuration command
%%graph_notebook_config
{
  "host": "janusgraph",
  "port": 8182,
  "ssl": false,
  "gremlin": {
    "traversal_source": "g"
  }
}

Visualize Our Graph

Now you should be set up to run through some commands. Go ahead and skip the other setup commands, like the seed data ones and other config commands, since we did that already.

There are some things worth noting here before you get too far, however. In order for Graph Notebook to visualize your Gremlin traversal, there are some requirements it needs to fulfill.

Requirement: Needs to Return a Path

It’s important to note that not all Gremlin traversals will return a nice visualization. Our current notebook gives us some guidelines here:

“The results of any Gremlin query that returns a path can be explored visually” (emphasis added)

What this means is that if I just grab five adjacent vertices, our notebook doesn’t automatically know what to do with them, so will just display a table of results.

screenshot of Five adjacent vertices - note the absence of the "Graph" tab
Five adjacent vertices – note the absence of the “Graph” tab.

No visualization is possible – even if they are connected to each other since it is not returning a path.

Requirement for a good visualization: Needs to give some hints

This one isn’t a requirement to return a graph visualization but is a requirement to return a good graph visualization. As it is currently implemented, the graph notebook does not know what is an edge or what is a vertex just by looking at the results.

For example, take the following traversal, which goes out from an airport, and comes back into another airport, but only returns the “code” for the airport edge and the distance (“dist”) property of the edge:

g.V().hasLabel('airport').outE().inV().path().by('code').by('dist').limit(5)

This will return results that look like this:

Screenshot of gremlin query without visualization hint

Now, we know that this is a list of paths, each with three items, which represent an airport, a route, and the destination airport. However, the notebook just sees a string, an integer, and another string – it doesn’t know if the strings or integers represent edges or vertices. Accordingly, when we look at the visualization, it shows the connections, but edges (the circles with integers) look the same as vertices (the circles with strings):

Screenshot of visualization without visualization hint
First three results, without hint

Now suppose we add this hint in the notebook:

--path-pattern v,oute,inv

This tells the notebook that the first item in the path is a vertex (v), followed by an out edge (oute), followed by an in vertex (inv). We send in these hints using the --path-pattern argument, or -p for short. Now we get a much better result:

Screenshot of gremlin query with visualization hint
First three results, with hint

There’s a ton of other customization that you can do on the graph as well – I won’t get into it here though, that’s enough instructions for now. Instead, let’s try to get some cool visualizations from our graph.

Click on a Node or Edge to View Details

Screenshot of details pane in visualization of gremlin query

First 50 Routes from Seattle to Phnom Penh, grouped by country

my_node_labels = '{"airport":"city"}'
%%gremlin -p v,oute,inv,oute,inv,oute,inv -g country -d $my_node_labels -l 20
g.V().has('airport','code','SEA').
  repeat(outE().inV().simplePath()).
  until(has('code','PNH')).
  limit(50).
  path().
    by(valueMap(true)).
    by('dist')
Screenshot of visualization of gremlin query that finds first 50 routes from Seattle to Phnom Penh
I had to drag the nodes around to get it close to the actual geographical location, but the grouping still makes for a helpful visual

Custom Icons

Screenshot of custom icons
Using custom icons, set by country

Evaluation

Overall, this is a great tool that adds a lot of value to the graph ecosystem. For one thing, a major challenge for Gremlin that many have identified is the learning curve, and having notebooks with sample commands, visualizations, and clear instructions helps to make it easier to get started. Notebooks are helpful for experienced users as well since they make it easy to experiment, create visualizations, and share with others.

There are some features that I would make this even better, however, such as it would be nice to be able to customize the length of the edges as well as the strength of the push/pull force between nodes. It is possible to set these but must be done ahead of time using %%graph_notebook_vis_options. This is great, but being able to set this live within the visualization would be even better.

An initial comparison from my limited experience with DSE Studio: DSE Studio has some additional nice features and looks better out of the box (it’s a personal preference, but the light blue nodes on the dark background just work better for Graph). On the other hand, the Graph Notebook project is open source, and the visualization seems to be more customizable. It also has clearer instructions on how to create the visual you want to create, in contrast to DSE Studio, where it takes some trial and error to figure out what traversals return a nice looking graph (at least for me, I had trouble finding instructions for DSE Studio). Also for Graph Notebook, you can change the settings programmatically using %%graph_notebook_vis_options which gives you a lot of control, whereas in DSE Studio, unless I’m mistaken you have some powerful configuration settings in the visualization itself, but aren’t able to configure programmatically. Both are powerful in their own way.

All that said, it’s a great tool and I’m looking forward to seeing how it continues to improve and develop. They are continually developing this tool and the Gremlin community overall is very active and welcoming. Even while preparing this post, I received great help and support from the community in getting my environment set up. And undoubtedly, as more and more great tools for graph-like Graph Notebook come out, the community is only going to continue to grow.

I’d love to hear if you had any thoughts as well – let me know!

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!