Architecting & Managing a Global Data & Analytics Platform Part 5: Monitoring & Scaling a Distributed Business Data & Communications Platform

Once a new distributed platform is implemented, any growing organization needs to be able to properly scale the system. Scaling a system for the sake of scaling it doesn’t make sense because cloud resources cost money. The best way to see when and how to scale is to properly monitor the software.

In our case, we have decided to use commercial components that help us tremendously. Datastax OpsCenter is good but we’re wise enough to say, however, that it is just the beginning of the toolset needed to really understand what is happening under the hood in the component technologies that comprise of the Datastax Enterprise Platform. Confluent Control Center also helps tremendously with Kafka, Kafka Connect, etc. but it still won’t be the single pane of glass your team may want to see how everything is doing.

When monitoring to scale complex systems such as business platforms you need to review all signals, not just those that come from the database or the queue.

Why should we monitor or measure these systems in the first place?

In a distributed system, different components are working with different strengths and at different capacities. If we don’t measure everything granularly, we may end up adding resources to one aspect of the framework but not giving the proper amount of resources to another, proportionately. Before we make any changes to our platform in terms of scale, we should be looking deeply at how everything is done and how long it takes to do it.

In real-time business platforms, the technologies that make up the system are distributed and each are different. Cassandra, Spark, Kafka all take care of different needs. Each of them tries to break up the work between all of the nodes. In a global platform, this means that this work is being spread on nodes that are running on different computers and in different parts of the world. They may also be running several thousand or million different processes at the same time. It’s important to know what to measure, how to measure these metrics, and which tools to measure them.

“If you can’t measure it, you can’t improve it” – Peter Drucker

Distributed systems can be “simple” in terms of architecture but because of the moving pieces seem “complex.” That’s complexity comes from needing to see and correlate the different types of events and metrics that are happening in different parts of the system.

Different types of monitoring and purpose.

Just as there are different parts of a scalable business platform, there are different types of monitoring and they all have a reason. One of the biggest reasons people measure and monitor their systems is to adhere to a Service Level Agreement (SLA) or an Operating Level Agreement.  “Make our Platform faster” maybe an objective in your team’s OKRs but be sure to have a key result that’s tied to establishing a baseline or to improve the baseline in relation to actuals business goals around SLAs and OLAs.

  • Endpoint Metrics (User Browser / API )
  • Logging ( Event, Error, Info, Warnings, …)
  • Tracing (Interface, Software, Database)
  • System Metrics (Disk, CPU, Memory, … )
  • Performance metrics (Throughput, Latency, …)
  • Application (Custom to Business use cases).

Over the years I’ve seen that most businesses that need to manage a business critical platform tend to end up centralizing their metrics and logging into one system. The most common are Splunk, ELK (Elasticsearch, Logstash/Beats, Kibana), ELG (same but with Grafana), Graylog, SumoLogic, Sematext, DataDog, New Relic, and more recently Prometheus or Graphite, with Grafana (read below for links). These resources can help you decide what is good for you.

This concludes my series on Architecting & Managing a Global Data & Analytics platform. If you want me or our company, to come and talk to your company about your Global Data & Analytics Platform, feel free to email me or my team at Anant.

  1. Part 1/5: Foundation of a Business Data, Computing, and Communication Framework
  2. Part 2/5: Foundation for Properly Managing a Business Data & Communications Framework
  3. Part 3/5: Deploy Frameworks that Scale on any Cloud (Containers, Azure, AWS, VMs, Baremetal)
  4. Part 4/5: Building a Developer-Friendly Platform on top of a World-Class Framework
  5. Part 5/5: Monitoring & Scaling a Distributed Business Data & Communications Platform