In this blog, we discuss the various methods of counting rows offered in Datastax Enterprise. Traditional databases maintain counts as a matter of course, but Cassandra’s architecture makes that more difficult. This architecture is the same thing that provides Cassandra its advantages, being scale-able, distributed, and having fast reads and writes. Since we still need row counts, we can use DSE’s features to get them in various circumstances.
Datastax Enterprise (DSE) is a custom build of Apache Cassandra, built and maintained by Datastax. They built DSE on Cassandra and so it shares the same node and ring structures, but also interacts with other technologies. DSE Analytics integrates Apache Spark with our Cassandra cluster and allows us access to all of Spark’s main APIs for distributed processing. Graph gives us the ability to create and traverse graph databases. DSE Search integrates with SOLR and Lucene indices to provide search capability. All of these can be combined, bringing up a single node with all three capabilities. They can also be enabled separately. Our counting methods will come from these various pieces of functionality that DSE provides.
Counting in DSE
Counting with Cassandra
Base Cassandra, without any of the extra DSE-added features, can already get counts in a few ways. Using CQL, Cassandra’s query language, the syntax for a standard count is “SELECT COUNT(*) FROM keyspace.table;”. This will work well and quickly for small clusters and small tables. By default, the result limits itself to 10,000 max rows counted. We can increase this limit by adding “LIMIT X” to the end of the select statement. Where X is the max number of rows you expect to get out. This can cause problems in bigger clusters however as it essentially triggers a full table scan. Doing it frequently can tie up cluster resources.
Another way to get a count is to use Cassandra’s counter data type and maintain one yourself. This is a more versatile approach and reading a value from a Cassandra table is significantly easier and cheaper in terms of cluster resources than calling select count(*). The way that counters work in Cassandra can cause some confusion though. Counter types can only go into a table with other counter types and a primary key. We also need to treat them differently when doing batch processing. When working with the CQL batch functionality batches that update counters need to be separate from any inserts, updates, or deletes to normal Cassandra tables. By combining external processing, counters, and occasional use of select count(*) you can keep up a relatively accurate count that won’t hog system resources no matter how often you need to read the result.
Counting with DSE Analytics (Apache Spark)
DSE Analytics turns our Cassandra node into Spark nodes as well as still retaining their Cassandra capabilities. Spark is well known for processing distributed data quickly and efficiently, making it a good candidate for getting counts without tying up the resources for Cassandra data management. Using Spark SQL and the Cassandra connectors, we can load tables as RDDs or Dataframes and use them to count how many rows we have. We could also potentially enable DSE’s always on SQL feature for an SQL shell to use to get our counts.
Counting with SOLR
We can also use DSE search to get our counts. The syntax is much the same as the standard select count(*) which can filter down to specific subsections of a table by adding a valid where clause to the end of the statement. When we create and attach a SOLR index to a table ,we can add a SOLR search query to the end to get counts for our search results.
In our live webinar covering this topic we use Datastax Astra, a managed Cassandra instance by Datastax. Astra does not have full DSE capabilities so we do not demonstrate counts via SOLR or the Always on SQL shell. Rather than having a single cluster for Cassandra and Spark like DSE with DSE Analytics, we use Astra alongside a Spark instance hosted in a Gitpod docker container, the details of which can be found in one of our previous blogs here. The code is on Github here.
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!