TableAnalyzer is a tool for analyzing Cassandra (CFStats/TableStats) output that visualizes variance in metrics between nodes. We use TableAnalyzer to generate a conditionally-formatted spreadsheet that can be used to perform data model review.
TableAnalyzer Among Other Cassandra Tools
This is the first of many tools we’ve been working on that help understand what’s going on in a Cassandra cluster. It’s not supposed to replace OpsCenter, Prometheus+Grafana, or other tools out there which chart various Cassandra metrics over time.
Instead, TableAnalyzer takes the table stats from all of the nodes at a specific point in time in order to compare across nodes rather than comparing across time. This allows us to identify issues such as tombstones, wide partitions, and different kinds of skew. The goal is to eventually build intelligence in here to help people build better tables through analysis of the keys, etc.
Purpose
This tool was created because we had a hard time explaining to non-believers that data-model issues were the reason their “amazing app” crashed all the time, and that it wasn’t Cassandra’s fault. TableAnalyzer generates a conditionally formatted spreadsheet which makes it easy to not just locate potential data modeling issues, but also provide quantifiable statistics to back it up.
What TableAnalyzer Actually Does
TableAnalyzer can be broken down into two functions:
1) Collect cfstats/tablestats Output (“Receive”)
2) Visualize cfstats/tablestats in an Excel Spreadsheet and JSON file (“Transform”).
1. Collect cfstats/tablestats Output
TableAnalyzer can collect the cfstats/tablestats by:
- Using the extracted contents of a diagnostics tarball from DSE OpsCenter
- Running nodetool cfstats or nodetool tablestats from a local Cassandra/Cassandra compatible* instance
- Running nodetool cfstats or nodetool tablestats from a Cassandra/Cassandra compatible* instance through SSH
*By Cassandra compatible instance, we are referring to DSE, open source Cassandra, Elassandra, or Scylla. However, note that so far we have only tested with DSE and open source Cassandra.
2. Visualize cfstats/tablestats (Excel Spreadsheet and JSON)
Once it has the stats, TableAnalyzer parses and transforms the stats into a CSV file format, which then is transformed into a conditionally formatted Excel file (xlsx). It also creates the data set as JSON which can be then sent to ElasticSearch, New Relic, etc. for further visualization.
Collect Tablestats/CFStats Output
Table Analyzer can collect the cfstats/tablestats by:
- Using the extracted contents of a diagnostics tarball from DSE OpsCenter
- Running nodetool cfstats or nodetool tablestats from a local Cassandra/Cassandra compatible* instance
- Running nodetool cfstats or nodetool tablestats from a Cassandra/Cassandra compatible* instance through SSH
Option #1: Using the extracted contents of a diagnostics tarball from DSE OpsCenter
This option is only possible when using DSE and having access to OpsCenter.
The DSE OpsCenter tarball has the cfstats/tablestats output contained within it, so we will not need to run nodetool cfstats or nodetool tablestats, but will need to extract the cfstats/tablestats output from the tarball.
We have written a script for this as well:
# unzip tarball
tar xzvf <path-to-unarchived-diagnostics-tarball-dir>/<tarball-filename>.tar.gz -C .
# get tablestats into our data directory
# python3 cfstats.receive.diag.py <region> <cluster> <datacenter> <path-to-unarchived-diagnostics-tarball-dir> <get-all-data {True|False}> <cassandra-major-version> <debug {0|1}>
python3 cfstats.receive.diag.py uswest prod cassandra <path-to-unarchived-diagnostics-tarball-dir> True 3 1
Option #2: Using nodetool cfstats or nodetool tablestats from a local Cassandra instance
Another way to get the tablestats output is to manually run the tablestats command on a local node that is running Cassandra. This command will have to be run several times, with different a different host specified in the environments.yaml
file for each node in order to get tablestats output on a multi-node cluster.
# python3 cfstats.receive.py <region> <cluster> <datacenter> <debug {0|1}>
<p>python3 cfstats.receive.py uswest prod cassandra 1</p>
Option #3: Using nodetool cfstats or nodetool tablestats from a Cassandra instance through SSH
Using SSH can make this whole process much faster.
First, change settings.json so that our script knows to use SSH:
vim ./config/settings.json
# change useSSH to "True"
It should now look like this:
{
"connection":{
"useDiag" : "False",
"useSSH": "True",
"detectTopology":"False"
}
}
Now run the script (can use same arguments as when executing locally):
# python3 cfstats.receive.py <region> <cluster> <datacenter> <debug {0|1}>
python3 cfstats.receive.py uswest prod cassandra 1
Expected result
At this point, there should now be .txt files, one for each node, in the ./data/<region>/<cluster>/
directory (e.g., ./data/uswest/prod
).
Generate Spreadsheet
Now that we have our tablestats output written to .txt files, we want to convert these files into a csv, and then convert that csv into a .xlsx format. The advantage of using the Excel format is being able to use formatting, and in particular conditional formatting. This conditional formatting will make it easy to see differences between nodes and outliers.
Convert to CSV
# python3 cfstats.transform.py <region> <cluster> <datacenter> <cassandra-major-version> <debug {0|1}>
python3 cfstats.transform.py uswest prod cassandra 3 1
Expected result:
- There should now be .csv files, one for each node, in the
./data/uswest/prod
directory - There might also be a few other outputted csv files that aggregate for each node and json file. (e.g.,
uswest.prod.cfstats.csv
,uswest.prod.cfstats.issues.csv
,uswest.prod.cfstats.pivot.csv
for node) (e.g.,uswest.prod.cfstats.pivot.json
for json file)uswest.prod.cfstats.pivot.csv
is the most important one
Convert to spreadsheet
For this we use csv2formattedxls.py
:
# python3 ./csv2formattedxls.py ./data/<region>/<cluster>/<region>.<cluster>.cfstats.pivot.csv ./data/<region>/<cluster>/<region>.<cluster>.cfstats.pivot.xlsx
python3 ./csv2formattedxls.py data/uswest/prod/uswest.prod.cfstats.pivot.csv data/uswest/prod/uswest.prod.cfstats.xlsx
Cassandra.Link
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms.If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!