Distributed Real-time Data Processing with Spark in Business Platforms – Part 2/6

Real-time data processing is the current state-of-the-art in business platform data engineering practices. Long gone are the days of batch processing and monolithic ETL engines that are turned on at midnight. Today’s demands come from the number of mobile users and things on the internet. Mobile phones and tablets have steadily increased in the realm of customer experience as companies create mobile only or mobile-first interfaces to interact with their commercial systems or business processes. Similarly, there are other “things” in the “Internet of Things” (IoT) such as rental bikes and scooters, key-less home locks, as smart home thermostats.

 

It used to be that users would go to a store physically and pay cash for something. Then they started using credit cards, which then became networked. Eventually, they started using websites to buy things, then their phones, and now back to using mobile pay at cash registers. These are just some examples of the numerous sources of “events” that can interact with businesses.

Today, every large-scale business platform or one that is aspiring to be must not only be an internet company, but also a data company as well as an IoT company. Imagine how the millions of McDonald’s mobile application users order online and pick up their food curbside from any of the 14,000 some restaurants around the country in the United States. Uber, Lyft, Via, are all IoT companies the way they track riders and drivers and make real-time decisions on pairing them for rides.

How do they do it? The secret is not really that much of a secret if you work in the industry. Unlike 20 years ago, the technologies that power these companies are a dime a dozen. The open-source movement and the proliferation of the Internet across the world has made the technology accessible to anyone who has the willingness to learn and try it out.

Apache Spark

In our last post in this series we talked at a high level about the reasons for scaling business platforms, how to find and measure areas for growth, and the technologies used to scale by most companies. This post will focus on real-time data processing specifically with a well-known technology called Apache Spark (not the same as Capital One Spark Card, or the Spark Email Application).

 

In the last 10 years or so, “Big Data” has been used and abused by companies and government institutions across the world. Sometimes the misuse doesn’t hurt anyone, just the people that invested millions of dollars on something they didn’t understand. It all started with Google’s MapReduce, BigTable, and DistibutedFileSystem ideas which were published as papers for the world to see. Some went on to make things like DynamoDB and Cassandra (which we’ll cover in another article in this series). Others went on to make Hadoop MapReduce, HBase, and HDFS which became the collectively known as the “Hadoop” distribution which was later commercialized by Cloudera, Hortonworks, and cloud providers such AWS, and Azure.

 

This first generation of big data dealt with all the “Five Vs of Big Data” as most people in the industry know them or should know them as:

 

  1. Variability – Big data allowed teams to deal with every type of data known to man. Structured, Unstructured, Semi-Structured, Raw, etc.
  2. Volume – Big data allowed teams to accumulate and process tremendous amounts of “Big Data.” We’re talking about several hundred terabytes or petabytes of data.
  3. Velocity – Big data allowed teams to have all that “Big data” processed relatively much faster than old methods because of parallel processing. The processing could have been anything that was needed. Data science processes. Search indexing processes. Analytics processing. It doesn’t matter. A lot of it done fast.
  4. Veracity – Big data allowed the process to produce verifiable and truthful facts from the source information. It wasn’t sampling the data, it was actually going through all of it with sheer brute computing power.
  5. Value – Ultimately the real goal of Big Data was to provide some sort of value to the business from the hordes of information stored in archives for decades.

 

So where does Spark come in? While Hadoop and other contenders like MapR were initially able to do a lot of great things as batch processes, they were hard to use for normal people in technology. (Not everyone is a data geek in the technology community.) Although different tools built on top of Hadoop such as HIVE and PIG were able to make it slightly easier, the execution of these scripts still took time.

Spark Streaming
Spark Streaming

Spark came along and introduced the idea of a resilient distributed dataset (RDD)  which was an in-memory representation of some other data-source that could come from HDFS, or HBase, or an outside data source. In Spark, the computations were easier to create using a domain-specific-language built on Scala for data-processing and engineering. It allowed folks to be able to write the Google PageRank (The reason why Google developed all that technology that started the Big Data movement) in mind-blowing scales.

  • Open-sourced Spark examples: PageRank in just 70 lines. Here’s another example using Spark GraphX in just 50 lines.
  • Open-sourced SoundCloud’s PageRank (complete implementation) – professes to do iterations on”700M vertices and 15B edges” in “3-5” minutes.

Spark itself is very powerful as you can see. Later versions of Spark have amassed more and more power through open-source acquisitions where other libraries have merged in.

Spark includes Spark Core, Spark Streaming, MLLib, and GraphX
  • Spark SQL – Introduced as an abstraction layer on top of Spark, DataFrames make it easy for computer languages to query datasets “fluently” or with SQL via ODBC/JDBC connectors.
  • Spark Streaming – The streaming ability of Spark is what allows Spark to go beyond being fast. It allows Spark to be reactive and consume data from real-time streams such as Kafka, Flume, or Twitter.
  • Spark MLLib Machine Learning Library – Spark MLLib takes the machine learning into another level from one machine into as many machines as you want. It includes most of the common ML libraries so you can create your own data pipelines from a data source through your ML process and out to another destination all in one framework.
  • Spark GraphX – GraphX brings a graph processing framework to Spark. It is not exactly a graph database, but it can be used to do massively large graph processes in memory.

So what?

Apache Spark is the most open, free, and powerful distributed computing, data analysis, data processing, machine learning, graph data processing, and data stream processing framework. Developers from more than 300 companies actively  develop Apache Spark including some of the heavy hitters such as Microsoft, Apple, Netflix, Uber, Facebook, Amazon, Intel, Alibaba, Ebay, and one of our favorites Datastax just to name a few. An order of magnitude more companies actively use Spark on a daily basis. It has widely becoming the defacto data framework for both big and fast data.

In our next article, we’ll cover how Akka, the Scala / Java / C# Actor model framework can be used to facilitate “fast” in the whole real-time world. If you want me or our company, to come and talk to your company about data modernization, real-time data platforms, or Apache Spark, feel free to email me or my team at Anant.

  1. Part 1/6: Scaling Business Platform Performance with Spark, MesosAkka, Cassandra, Kafka, Kubernetes
  2. Part 2/6: Distributed Real-time Data Processing with Spark in Business Platforms
  3. Part 3/6: Reactive Business Platform Applications & Services with Akka
  4. Part 4/6: Resilient & Scalable Business Platform Database with Cassandra
  5. Part 5/6: Real-time Data Pipeline & Streaming Platform with Kafka
  6. Part 6/6: Scalable Business Platform Development & Data Operations with Kubernetes & Mesos