Apache Cassandra Cluster Design and Architecture

Apache Cassandra Lunch #51: Cassandra Cluster Design & Architecture

In Apache Cassandra Lunch #51: Cassandra Cluster Design & Architecture, we will discuss an overview of Cassandra cluster architecture, not to be confused with the Cassandra database architecture. Specifically, using Cassandra datacenters to isolate workloads. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!

Apache Cassandra

Apache Cassandra is a free and open-source, distributed, wide-column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.

Use Cases

Apache Cassandra is best used in situations in which fast reads and writes of terabytes of data are required. Cassandra is also great in situations in which replication and availability of data are a global need. Additionally, if the data in question should never have downtime and be constantly available. Cassandra’s multi-node and datacenter distribution and replication allow for all of these scenarios. Cassandra is meant for BIG data.

Cassandra should not be used if the amount of data is stored in gigabytes, can be comfortably housed in one data center, or if the system can allow for downtime. Especially don’t use Cassandra if you are just trying to use the latest trend in database technology. In these use cases, another type of relational database is likely a better option than Cassandra.

Cassandra Data Model

The Cassandra data model of tables and column families may look similar to SQL Server, MySQL, PostgreSQL tables, and databases. They are not. The data model consists of keyspaces, similar to databases, column families, similar to tables in the relational model, keys and columns. The Cassandra Query Language (CQL) supports queries with primary and optional clustering keys. CQL does not support arbitrary queries of columns, table joins are not allowed. Also, Cassandra should not be managing more than 100 to 150 tables across any number of key spaces.

Cassandra Cluster Architecture

Physical vs. Logical Datacenters

Cassandra clusters provide flexibility when it comes to architecture of distributing workloads. This is due in part to the ability to have both physical and logical data centers. Meaning, clusters can be physically or virtually distributed.

Physical Datacenters

Physical datacenters can be physical locations or separate cloud-based datacenters. In a physical data center, racks are used to define availability zones. Racks will contain nodes, with the nodes containing the data. Physical data centers still allow for high availability and redundancy with replication factors set by the keyspace.

Logical

Similar to physical data centers, racks contain the nodes which also contain the data, and the replication factor is still defined by the keyspace. The difference between the two is that in a logical data center the machines are located in the same place.

Availability / Performance of Data

In a single data center cluster, data is replicated as defined by the keyspace. Data is managed by the replication factors, QUORUM, ONE, or ALL. Repair processes are synced across all nodes in a data center. In a multi-datacenter cluster, data replication is still defined by the keyspace, but there are additional options for setting the replication factors. In a multi-datacenter cluster, the options include those of the single datacenter with the addition of LOCAL_QUORUM, LOCAL_ONE, or LOCAL_ALL. The local option defines whether or not the data will be replicated to the other datacenters or will be limited to the datacenter which is handling the transaction. Repair processes in a multi-datacenter cluster will sync across all datacenters.

Availability / Redundancy of Data

Data in a single data center cluster, the full dataset for keyspaces and tables are distributed among all the nodes. Racks will help distribute data evenly across partitions. Additionally, racks can be put in availability zones in the cloud or physical racks. Data in a multi-datacenter cluster, the full dataset for keyspaces and tables are distributed among the different nodes in each datacenter. Racks will still help distribute data evenly throughout a data center. Racks can still be put in availability zones in the cloud or physical racks. Datacenters can be located in physically different locations and the data centers can be used to isolate workloads.

Distributing Workloads

Logical

Cassandra cluster design and architecture of a logical multi-datacenter workload distribution.

This image demonstrates a logical multi-datacenter workload distribution using a Cassandra cluster design and architecture with isolation between data transactions in one data center. That same data is replicated to a separate, virtual data center performing analysis of the data. While yet another virtual data center is reporting on the data that is being replicated. This means that the workload of one data center will not be impacting the performance of the other two data centers. This workload distribution could also be accomplished using separate physical datacenters.

Cassandra cluster design and architecture diagram of a logical multi-datacenter workload distribution in Kubernetes.

In this image of a logical multi-datacenter workload distribution in Kubernetes, the same workload distribution is happening as above, but Kubernetes containers are being used instead of virtual machines.

Physical

Diagram of a physical/hybrid multi-datacenter cloud distributed cluster.

This is an example of using a physical/hybrid multi-datacenter cloud distribution. Cloud-separated data centers are syncing with an on-premise data center. An example would be using lightweight workloads on the cloud data centers while using the on-premise data center to handle a heavier analytics workload.

Diagram of a physical multi-datacenter cluster distribution.

In this physical multi-datacenter distribution, theoretically, this distribution could be taken to an interplanetary scale with data centers on their respective physical locations.

Resources

https://cassandra.apache.org/

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!