This series covers different aspects of architecting and managing a global data & analytics platform. This is not as simple as choosing some technology and installing it. This work involves proper coordination of people, processes, information, and systems to ensure that the business needs are met at all times. We will cover the components of the “SMACK” stack although many people may not necessarily use Akka or Mesos, they will find much value in our coverage of Cassandra, Spark, and Kafka. We will also cover the Anant “STACK” set of procedures which we use at our company to manage data & analytics platforms for our clients.
If something is already a “platform” how can it also be a “STACK”? We’re referring to an acronym we use here at Anant to describe the components of the people and process side of a platform which is made up of the information and systems. It stands for Setup, Training, Administration, Customization / Configuration, and Knowledge Management.
As we see in our ongoing series for SMACK, the demand of the modern customer is forcing startups and large enterprises that want to be relevant to use globally scalable technologies. Lucky for them the software that powers the largest companies like Google, Facebook, and Amazon is readily available for experimentation, research, and development. Once an idea is bashed out and proven, it needs to be operationalized. Today that either means using platforms as a service that is available as subscriptions, building your own frameworks on infrastructure as a service, or a combination of commercial products, multiple cloud providers, and a managed service team. This guide is meant for data architects, engineers of large organizations or CTOs, CDOs, of growth startups that are about to grow beyond their current scale.
Categories
- Sources
- Servers / Devices – Data from mobile apps, or systems in a network could be sending information to your platform.
- Cloud Applications – Cloud applications may be sending data directly to your internal APIs or to the Stream.
- Incoming API – Other facade APIs may be using your internal APIs.
- Ingress – Raw Data
- Batch – Ultimately, there is still some need for batches every now and then to do massive reconciliation or for importing massive amounts of information.
- Stream – Today, streams are becoming common as a way to ingest information.
- API – For the microservices architects, everything needs to be exposed as an API.
- Data Platform Components
- Stream – A stream topology inside a data platform helps process data through complex pipelines.
- Stream Processing – A stream processor takes data off of the stream, processes it and can either send data back into another stream or put it in a database.
- Queues – Since there are so many things going on in a data platform related to data operations, a queue is useful to use as a task table for batches, etc.
- Schedulers – A scheduler is used to add things to a queue, and potentially take work off of the queue and schedule it to be executed.
- Data Warehouse – Whatever the term is today, whether its called a data lake, data mart, or a data warehouse. There is someplace where all the data is stored so that it can be further processed into usable data.
- Database – Some data after its been processed can reside in a database that’s used by other systems natively, for example, Business Intelligence systems.
- Analytics – Analytics is a loaded term, but it differs from basic stream processing because it generally provides a framework to do complex operations such as descriptive, predictive, or prescriptive analysis with machine learning, or statistics.
- Egress – Processed Data
- Data – As a by-product of data processing and analytics, some information comes back out as raw data for visualization, etc.
- Reports – Other information is processed as data that can be shown on reports in the form of report tables.
- Events – Some business platforms with real-time goals, send out events or messages to systems, devices, and people
- Utility
- Outgoing API – Pure data is served back up as an outgoing API.
- Cloud Apps – There may be cloud apps that the company itself is hosting based on its own data.
- Servers / Devices – Servers and devices can also use the processed information.
Even without any specific technologies being mentioned, we can see a data platform has many components and can be complex. To make a platform like this global, there is yet another layer of thought and action that goes to bring up and manage a scalable data platform. Generally speaking, it takes several people to properly deploy a scalable Global Data platform. Today it’s a little easier due to technologies and companies that have made DevOps much easier. The reason your company may be looking at this stack is that your data needs require it. In this guide, although we are only covering Apache Cassandra, we wanted to show how and why these are often used in combination.
Apache Cassandra
- Proven Open source technology
- Multi-Region Peerless Replication
- Option to use physical, virtual, containers on hybrid or multi-cloud.
- Massively distributable
- Combined Transport and Communications
Apache Spark
- Proven Open source technology
- Can use Python, Scala, Java, R
- Option to use physical, virtual, containers on hybrid or multi-cloud.
- Massively distributable on different schedulers
- Combined processing and streaming.
Apache Kafka
- Proven Open source technology
- Several APIs: Connect, Streams, KSQL, Producer / Consumer API
- Option to use physical, virtual, containers on hybrid or multi-cloud.
- Massively distributable on different schedulers.
Given that these are proven technologies if we decided to go forward and use these technologies. For each of these components, there are several things to consider. Does your team have the talent to set up & configure, train, administer, customize, and manage the knowledge for each? We need to decide potentially whether we will get managed services for these, open source versions, or commercial versions. Luckily there are commercial versions of these tools that make our life a little easier so that even if you decided to manage the platform, there will be someone to support you in your time of need.
Cassandra and Spark are supported commercially by Datastax as part of a suite called Datastax Enterprise which also includes Solr (for indexing) and an implementation of Apache Tinkerpop called DSE Graph. Datastax, as commonly known, comprises of the team of founders that took Apache Cassandra from the original source code released by Facebook and made it what it is today. Even though Datastax products are somewhat different (optimized for enterprise loads and faster by several times), they share an open core with the core open source components of Apache Cassandra, Apache Solr, and Apache Spark. The suite also has tools which are not available in the open source distributions of Apache Cassandra or Apache Spark.
Confluent publishes a commercial version of Kafka. Confluent comprises of founders who literally built Apache Kafka while at LinkedIn and now are a major part of the committers on the project. Confluent provides a very easy to use Kafka distribution which also includes enterprise security and a topic registry.
In the next article, we’ll dig deeper into the Anant STACK process to see how it could make managing your components a little easier. If you want me or our company, to come and talk to your company about Global Data & Analytics Platform, feel free to email me or my team at Anant.
- Part 1/5: Foundation of a Business Data, Computing, and Communication Framework
- Part 2/5: Foundation for Properly Managing a Business Data & Communications Framework
- Part 3/5: Deploy Frameworks that Scale on any Cloud (Containers, Azure, AWS, VMs, Baremetal)
- Part 4/5: Building a Developer-Friendly Platform on top of a world-class Framework
- Part 5/5: Monitoring and Scaling a Distributed Business Data & Communications Platform