Apache Cassandra Lunch #26: Cassandra Troubleshooting with Logs

In case you missed it, this post is a recap of Cassandra Lunch #26, discussing common Cassandra log warnings and errors. We discussed the various resources that a Cassandra cluster needs, and how we can find problems with those resources via the logs generated by the cluster. The live recording of Cassandra Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at noon EST. Register here now!

Cassandra System Resources

Disk

Since Cassandra is a database, it needs to have enough storage to function. Part of how Cassandra maintains speed and availability in a distributed system is to store multiple SSTables and taking up extra space with replicas and denormalized tables. One way to ensure continued operation as data is generated is to split the storage of particular data onto several drives. One disk can be made to hold the basic Cassandra data, like SSTables containing table data. The second disk could hold any data needed to reconstruct a node if it goes down. The commit logs, hints, and any saved caches go onto disk two. Disk three contains the log files so that even if other disks fill up we can continue to get log messages. 

Memory

A lot of Cassandra processing takes place in memory, like MemTables before that data gets written to disk, or various caches to increase read speed. Cassandra is a Java application so the amount of memory allocated is determined by configurable settings. The heap should be 20% to 50% of all memory and must be no more than 32GB. Some memory is allocated for off-heap cache and file system cache. The amount of memory used by the cluster can be manipulated via garbage collection configuration.

CPU

The Cassandra cluster operations that might take up enough CPU power to cause warnings or errors are a large number of writes, which then trigger compaction. A large number of reads can cause the same problem. The repair process can also take up a lot of computing power.

Network

Network issues can also cause timeouts which can cause issues when trying to read or write data to a node.

Logging

Log Levels

The log levels appear in Cassandra logs to denote certain messages more or less important than other messages. INFO is a low level of alert that denotes normal messages with no chance of affecting service for the cluster. WARN denotes messages that may lead to errors or service interruptions. They do not represent imminent service interruption but can generally be found in the lead up to such incidents. The last level is ERROR which denotes significant errors that may affect the operation of the Cassandra cluster.

Search Terms

When searching through messages using something like the ELK stack, discussed here, you can look at the frequency of WARN/ERROR messages or the total number of log messages to get a general idea of cluster health. 

Specific messages to look for include hinted handoff messages. Hinted handoff happens when a node is down or too busy to receive messages from other nodes. These can be triggered if there is too much traffic involving a particular node, or if the cluster is too small to handle the load we are receiving. Hints are dropped after three hours of being unable to be written to a node, so hints dropped messages show that a node has been down or otherwise unable to communicate for three hours. Dropped mutations or task messages happen when a node is unable to complete a task which can hint at bigger problems.

Repair errors can happen with particular threads. To find them, search AntiEntropyStage, ReadRepair, or RepairTask. By searching compacting large partition, we can find wide partitions that are causing problems. Tombstones can also cause problems. The message is triggered based on the tombstone_warn_threshold configuration settings. Messages mentioning java show problems with the Java application that is Cassandra. Things like out of memory errors will show up here. Disk space problems can be found by searching unable to write or corrupted. Timeouts can be found based on the connection or by failing reads and writes.

Log Aggregation

A discussion of log aggregation methods can be found here.

References

Changing heap size parameters | DataStax Distribution of Apache Cassandra

Documentation | Hardware

Cassandra AWS System Memory Guidelines

Cassandra Logs — Apache Cassandra Documentation v4.0

Documentation | Reading Logs

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!