Monitoring Complex Systems

Resources for Monitoring Datastax, Cassandra, Spark, & Solr Performance

This resource for monitoring Datastax, Cassandra, Spark, & Solr performance is just the first iteration of a longer initiative to create the best knowledge base on these real-time data platform technologies such as DataStax Enterprise (Cassandra, Spark, and Solr) as well as for Kafka, Docker, and Kubernetes. Our firm, Anant, has been working with Solr/Lucene for several years, and then over the years picked up Spark and Cassandra, and then made the logical move to become experts at and partners with Datastax.

Datastax OpsCenter is good but we’re wise enough to say, however, that it is just the beginning of the toolset needed to really understand what is happening under the hood in the component technologies that comprise of the Datastax Enterprise Platform. When monitoring to scale complex systems such as business platforms you need to review all signals, not just those that come from the database.

Why should we monitor or measure these systems in the first place?

One of the most fundamental principles of life and work which I’ve adopted is making decisions based on factual data is the best way to move in the desired direction. If the goal is to lose weight, you have to first know how much you weigh as well as what percentage of your body weight is muscle, fat, or water. If you want to get faster at running, you have to know how long it takes for you to run 1 mile, and then 2, and so on and so forth.

In real-time business platforms, the technologies that make up the system are generally distributed. This means that they are running on different computers and in different parts of the world. They may also be running several thousand or million different processes at the same time. It’s important to know what to measure, how to measure these metrics, and which tools to measure them.

The best way to measure progress is to measure progress. – Rahul Singh, CEO @ Anant

Distributed systems can be “simple” in terms of architecture but because of the moving pieces seem “complex.” That’s complexity comes from needing to see and correlate the different types of events and metrics that are happening in different parts of the system.

Different types of monitoring and purpose.

Just as there are different parts of a scalable business platform, there are different types of monitoring and they all have a reason. One of the biggest reasons people measure and monitor their systems is to adhere to a Service Level Agreement (SLA) or an Operating Level Agreement.  “Make our Platform faster” maybe an objective in your team’s OKRs but be sure to have a key result that’s tied to establishing a baseline or to improve the baseline in relation to actuals business goals around SLAs and OLAs.

  • Endpoint Metrics (User Browser / API )
  • Logging ( Event, Error, Info, Warnings, …)
  • Tracing (Interface, Software, Database)
  • System Metrics (Disk, CPU, Memory, … )
  • Performance metrics (Throughput, Latency, …)
  • Application (Custom to Business use cases).

Over the years I’ve seen that most businesses that need to manage a business critical platform tend to end up centralizing their metrics and logging into one system. The most common are Splunk, ELK (Elasticsearch, Logstash/Beats, Kibana), ELG (same but with Grafana), Graylog, SumoLogic, Sematext, DataDog, New Relic, and more recently Prometheus or Graphite, with Grafana (read below for links). These resources and our descriptions can help you decide what is good for you

Datastax

  • Datastax: DataStax Enterprise OpsCenter – an Indispensable tool to monitor different aspects of the Datastax system components (Cassandra, Spark, Solr, Graph) and manage the lifecycle of clusters. Also manages backup, restoration, and repairs.

Spark Monitoring & Performance Metrics

Cassandra Monitoring & Performance Metrics

Cassandra on Grafana
Cassandra on Grafana. Courtesy Pythian Blog

Solr Monitoring & Performance Metrics

General Tools for Gathering and Visualizing Monitoring Information

  • Prometheus – A system and time series database.
  • Graphite – A time series database that graphs data.
  • Cyanite – A drop-in replacement to Graphite powered by Cassandra
  • Zipkin – A distributed tracing system that allows you to visualize what’s going on.
  • Zipkin Cassandra – Zipkin Tracing Plugin for Cassandra
  • Kairos DB – Fast Time Series Database on Cassandra – Collectors, storage, rest API, web API , aggregators, tools, and client libraries to get and retrieve time series information stored in Cassandra.
  • Grafana – The tool for beautiful monitoring and metric analytics & dashboards for Graphite, InfluxDB & Prometheus & More ( like Elasticsearch or Kairos DB)
  • Cassandra Exporter – Exports Cassandra metrics via JMX into Prometheus
  • Cassandra Exporter – Java agent for exporting Cassandra metrics to Prometheus
  • Trulia: Thoth – Thoth is a real-time solr monitor and search analysis engine. It’s a set of tools that can help you collect, visualize and leverage data coming from your solr search Infrastructure

These resources were placed here as a starting point for you to figure out what’s going on with your Cassandra, Spark, or Solr cluster. Whatever you end up using, remember the goal: Measuring & monitoring for performance and stability. If you can’t quantify what performance or stability means in something similar to a Service Level Agreement, then all this is just for show. Need help with scaling a business platform built on these components? I’d love to chat and offer some thoughts. Send me an email with your question.

Photo by Chris Liverani on Unsplash