In Apache Cassandra Lunch #95: Spark Graph Operations with DSEGraphFrames Scala API, we discussed methods for processing Graph database data using Apache Spark. Specifically, we discussed the DSEGraphFrames library which allows Spark to perform operations on graph databases. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!
DSE Graph is a distributed graph database built on top of Cassandra that is part of Datastax Enterprise (DSE). It maintains many of the advantages of using Casandra/DSE, including potentially global distribution, zero downtime, and DSE security protection. It also gains many of the benefits of being a graph database, namely in storage and analysis of complex and inter-related data sets. Graph can combine with DSE’s included Search and Analytics capabilities. It also integrates with DSE support tools like OpsCenter and Datastax Studio.
DSE Graph Operations
Most graph traversals (operations done using the adjacency of nodes and edges within a graph) work in real-time without making use of DSE Analytics (aka Spark) resources. Deep queries are traversals on a graph with extremely high density or high branching factor (nodes on average connect to a large number of other nodes). Scan queries traverse whole graphs or large parts of graphs. Either of these can require memory or computational resources beyond what the normal processing of graph queries can provide. These queries perform better when run via DSE Analytics.
There are two methods for performing Analytical queries on DSE graph instances: OLAP queries use an alternate traversal source that uses the SparkGraphComputer to run queries on the DSE Analytics nodes. The DSEGraphFrames library, support a subset of the Gremlin graph traversal language for use in Java and Scala applications running on Spark.
Normal DSE Graph queries use Online Transactional Processing (OLTP). This type consists of a large number of short transactions for processing queries quickly. It is used primarily for data entry and retrieval. OLTP processing uses filters and subgraphs to speed up access to data in specific parts of the larger graph.
Online Analytical Processing (OLAP) is a Spark backed method for performing multidimensional data analysis. It generally takes longer than OLTP queries. OLTP processing works by interpreting the graph as a sequence of “star graphs” centered on a single vertex. It works best for queries that process over the entire graph or at least large portions of a graph.
Graphframes is a Spark API for analytics operations on DSE Graph. Inspired by Databricks’ GraphFrame library, for processing graph data in Spark. It supports a subset of Gremlin graph traversal language. The Graphframes library is generally faster than OLAP queries for doing filtering and counts. Graphs present as two virtual tables: V() method for the vertex dataframe, E() method for the edge dataframe. The library can be used to import/export graphs into/from any format the Spark Dataframes can save to. It also supports a subset of Apache Tinkerpop traversals alongside the Gremlin ones.
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity. We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!