Data Platforms: From Paper to Real-Time

Clay, Bones, and Paper

For more than 6,000 years, humans have used methods to create and record information. Whether people used clay tablets, animal bones, papyrus, etc. early methods of record-keeping were both fragile and cumbersome, making organization, storage, and retrieval of information difficult. Furthermore, at some point in the history of every civilization, the number of records that needed to be kept grew so large that organizing and finding information was extremely laborious. We see some of the first attempts at establishing organization around stored data with the creation of concordances (alphabetical lists of records) during the 5th century and with page numbers and indexes in books (1470) around the invention of the printing press.

Libraries and Classification Systems

As printed information became widely accessible, a larger population of people became hungry to extract information from books, documents, etc. Eventually, the size and amount of information that people wanted to access became so large that even an alphabetical list proved extremely labor-intensive to find anything in a timely manner. This problem was especially true in public libraries. Therefore, the next advent in data management we see is classification systems which, when combined with a physical storage method, allow data to be found more quickly by the user. Two such systems, the Dewey Decimal System in 1876 and the Library of Congress Classification (LOCC) system in 1897, gained wide adoption in public libraries, colleges, and universities in the United States, because, while somewhat different in design, both Dewey and LOCC categorize information by topic and require information of the same topic be physically stored together. All of a sudden, a library-goer could see, all on one page, the 10 major topics of information in the library. This same person could simply walk to physical shelves in the library where books on geography or painting were stored. People saved time and accessed information more quickly.

These classification systems, while an improvement, utilize a “call number system” assigning a number to every book in a library. Each book would have a corresponding card present in a card catalog which would allow a user to “browse” books without actually going to the physical shelves. It’s quite amazing to think about now: an entire library was documented on small individual cards which took up lots of space and could be easily removed from the card catalog by library-goers seeking a book. This user-friendly system was difficult to keep up-to-date and error-free. Therefore, it’s no surprise that in the decades following the advent of computers, libraries began transcribing their card catalogs into digital databases. Many of us who attended grade school in the 80s can remember learning how to use both the physical call number card catalog and learning how to use the digitized version later. 

Digital Revolution

Surprisingly, the earliest electronic databases, created in the 1960s, wouldn’t have worked well for the average library-goer. These early databases, dubbed hierarchical databases, were very basic and consisted of digital lists that were essentially copies of their physical counterparts. While hierarchical databases were an improvement over physical paper lists in both space, durability, and consistency, they were not searchable beyond basic categories. Hierarchical databases were difficult to extract information from since a user would be limited to certain “parent” categories for searches.  

I asked my colleague (and rockstar software engineer), Obioma Anomnachi, about hierarchical databases. Obioma said, “In the library analogy, [with hierarchical database structure] you could still get a list of all of the books by a single author or maybe book series that are then broken down by book – the hierarchical database has a tree-like structure. However, the format can’t deal with books with 2+ authors. It would either have to have an author row containing all the authors and have that data be separated from any of the authors’ individual lists or it would have to duplicate data, storing the book in each one of the contributing authors lists.”  Multiple lists take up a lot of physical (disk) space and would require any additions or changes to be made in all applicable lists.

Relational Databases and SQL

While innovative, early database models only provided a partial solution to the information management problem. In order to be truly useful, database design needed to improve to the point that retrieval and use of the data therein became easier. In the 1970s, IBM and other companies would usher in an innovation that would do just that. With the advent of the “relational” database, a single database could be used (which made storage much more efficient), and users were able to query data. Along with this new database design, a standardized language was developed called Structured Query Language (SQL, pronounced ‘sequel’), which would become, and still is, the standard language used in relational databases. 

For decades, different flavors of relational databases thrived (e.g., Oracle Database, MariaDB, Microsoft SQL Server, etc.). However, history would repeat itself as the amount of data produced quickly outpaced the storage method and design. In my next article, I will discuss the “big data” revolution, including how hardware innovations allowed for nearly limitless amounts of data storage which, when combined with a public accustomed to blazingly fast processing speeds, established the need for a new type of database design.

Almost Modern: Relational Databases

My previous article (link) examined the evolution of information management from manual methods such as clay tablets and handwritten manuscripts to digital methods such as hierarchical and relational databases. However, that exploration got us through the late 1990s, which is only somewhat “modern.” In order to be able to follow the latest news on modern database technology, there are some major improvements and changes which have happened over the previous two decades that need to be explored. To lay the groundwork for the discussion of modern databases, it is necessary to spend a little more time understanding why relational databases were such a huge innovation and continue to be the “go-to” information management tool for many businesses. This article will also help provide context for future exploration as to why modern enterprises are offering more than SQL databases and the relational database model.

Spreadsheets Are Hierarchical

As discussed previously, the first digital databases were hierarchical in nature. We can think of hierarchical databases like old spreadsheets (before innovations that allowed linking between multiple sheets). It’s easy to imagine the joy of accountants and operations managers everywhere when they were able to switch from paper books to spreadsheets. This innovation meant that math could be programmed and information copied with a few clicks. Error-making was greatly reduced and on-the-job efficiency was greatly increased. But, as anyone can attest who has tried to manage a business solely using spreadsheets, this method quickly gets out of control when changes to one data item must be replicated manually across multiple copies of data and/or files. 

Digital Storage Used to be Expensive

Take the above issues with spreadsheets and visualize them at an enterprise-level scale. Redundant data was not only difficult to update, but it also took up a great amount of storage space on disk. It’s easy to forget how expensive digital storage was twenty years ago. Recent internet searches prove that one can purchase 16 Terabytes of hard disk storage for around $70 USD. In mid-1999, the same amount of storage would have cost $176,000 USD ($0.011/MB) and if we go back another 10 years to 1989, it would have cost nearly $120 million USD ($7.48/MB) to purchase 15TB of hard disk storage (check out this site for disk prices back to the 1950’s).

Digital disk space for storage used to be very expensive because the hardware was fragile and required proper care and maintenance. Do you remember the floppy disk? Periodicals and texts from the 1990s describe a floppy disk as, “a delicate device that must faithfully and accurately record and play back the information stored on its recording media. Dust and scratches on the disk surface must be carefully avoided during the manufacturing process, as even the smallest imperfection can cause writing and reading errors” (source). The thin plastic sheets were coated with iron oxide (aka, rust) allowing machines to leverage the magnetic properties of iron to encode information that could be later read by other machines. 

Hard disks, designed to be more durable, were also vulnerable. Damage caused by impacts, strong magnetic fields, and “head crash” failure (where the device component that physically reads and writes to the disk by applying a small magnetic field to “flip” the charge at a physical location from 0 to 1 and back scratches the surface of a disk rather than hovering above it) could mean data loss or complete database failure. These factors and more meant that storing any quantity of data was expensive and risky.

The Join: a Major Innovation

Needless to say, as digital databases grew in popularity, an immediate demand for frugality with regard to disk storage space accompanied it. Relational databases helped solve not only the problem of redundant data but also the problem of database size. Relational databases circumvented the need for creating redundant data by prioritizing the following rule: No piece of information should ever be present in more than one location inside an information system (database). Rather, relational databases introduced a new construct called a ‘join’ which could be used as a connector linking relevant pieces of data. This allowed for the transition from the 1 to many model of hierarchical database to the many-to-many model. Updates and changes to data only needed to be made in one place and “joining” data eliminated the need to create multiple, redundant copies. These changes made databases both easier to manage and smaller to store.

Millions of Dollars Saved

In the late 1980s, a large insurance company was using a hierarchical database to manage policy and claims data for its millions of customers. However, as the company’s data management needs grew, it became increasingly difficult to maintain the hierarchical database, which was prone to data redundancy and inconsistency. To address these issues, the company decided to implement a relational database.

The transition to the relational database was not without its challenges, but ultimately it allowed the company to streamline its data management processes and save money in several ways. First, the relational database eliminated the need for data redundancy, reducing the amount of storage space required and lowering the company’s storage costs. Second, the relational database improved data accuracy, reducing the risk of errors and the need for manual data correction. Finally, the relational database made it easier to access and analyze data, enabling the company to make more informed business decisions and improve its operations. Overall, the insurance company estimated that it saved millions of dollars by switching to the relational database. Thousands of companies experienced this benefit during the transition from hierarchical to relational databases.

Relational Databases Maintain Popularity

For approximately forty years, relational databases have served as the go-to data storage method for most businesses, and relational databases are not retiring anytime soon. Even as of this month, December 2022, a list of databases published by DB-Engines, categorizes seven of the top 10 ranked databases as relational databases.

The development of relational databases, such as MySQL, which are based on SQL and widely used for storing customer lists, product inventories, and sales transactions. However, as more and more daily activities moved online, the traditional SQL-based relational database struggled to keep up with the large volumes of data generated by modern software applications, including email and social media.

When Facebook wanted to offer its users the ability to search their inboxes, it became clear that a new approach was needed to manage the massive amounts of unstructured data generated by such applications. This need, along with others, led to the development of NoSQL databases. Unlike relational databases, NoSQL databases are designed to handle large amounts of unstructured data and are well-suited to handle the demands of modern software applications.

In this article, I will discuss various types of data, the difference between SQL and NoSQL, and specific versions of NoSQL you may hear about in the market. Lastly, I’ve included a graphic that presents use cases for both SQL and NoSQL by industry.

Structured, Semi-Structured, and Unstructured Data Types

Before we dive into the differences between SQL and NoSQL databases, it’s important to understand the types of data that are typically stored in databases. There are three main types of data: structured, semi-structured, and unstructured.

This image is a shortened form of the definitions for the types of data given in the article.  There is no new information on the image, only a more concise wording.
  • Structured data is highly organized and can be easily processed by computers. Examples of structured data include customer information, transactional records, and inventory lists. This type of data is typically stored in a fixed format, such as tables or spreadsheets, and can be easily queried using tools like SQL.
  • Semi-structured data is information that doesn’t fit neatly into a structured format but still has some identifiable structure. It may contain tags or labels that provide some context, but the content may vary in its format and organization. Examples of semi-structured data include email messages, social media posts, and web pages.
  • Unstructured data is information that has no identifiable structure or organization. It may come in the form of text, images, audio, or video, and it is not easily machine-readable. Examples of unstructured data include emails, documents, images, and video files.

While structured data is easy to analyze and process, unstructured data presents a challenge to businesses and organizations. The volume of unstructured data is growing rapidly, and these massive unstructured data stores require advanced technologies such as artificial intelligence and machine learning to extract insights and value from them. However, the insights gained from analyzing unstructured data can be highly valuable, providing businesses with a deeper understanding of their customers, operations, and markets.

Differences between SQL and NoSQL

SQL (Structured Query Language) databases, also known as relational databases, have been the standard for data storage for decades. They store data in a structured way, with rows and columns that can be easily queried using SQL. These databases were designed to handle structured data and provide strong consistency guarantees. They are still widely used for transactional systems, business intelligence, and data warehousing. However, with the growth of big data and the Internet of Things, SQL databases are no longer able to handle the sheer volume and velocity of data being generated. This has led to the development of new types of databases, known as NoSQL databases, which are designed to handle large amounts of unstructured data.

NoSQL stands for “not only SQL,” and as the name implies, it’s a different way of organizing data that goes beyond the traditional tables and columns of SQL databases. NoSQL databases are designed to handle large amounts of unstructured or semi-structured data, such as social media posts, web pages, and sensor data. NoSQL databases are more flexible than SQL databases because they don’t have a fixed schema or structure. This means that they can handle data that doesn’t fit into neat rows and columns. NoSQL databases can store data in a variety of ways, including document-based, key-value, and graph databases.

Types of NoSQL Databases

One of the most popular NoSQL databases is Apache Cassandra. It is a distributed database designed to handle large amounts of data across many servers, providing high availability with no single point of failure. Think of it as a filing cabinet with many drawers and no locks, accessible to anyone who needs it. It is particularly well-suited for handling large-scale data that is spread across multiple data centers and cloud availability zones.

This illustration shows a file cabinet with several open drawers to illustrate how a Cassandra database works.

Cassandra utilizes a flexible data model that allows data to be stored in a denormalized way, which is particularly useful for handling wide and sparse data sets. This means that it can scale horizontally by adding more drawers to the filing cabinet as the volume of data grows. It is well suited for write-heavy use cases, where data is frequently added or updated.

Cassandra was initially developed at Facebook to handle the huge amount of data generated by the social network’s inbox search feature. The project was started in 2008 and released as an open-source project in 2009. Cassandra was inspired by Amazon’s Dynamo, which is a distributed key-value store, and Google’s Bigtable, a distributed structured data store.

Cassandra’s creators aimed to create a distributed database that could scale horizontally across many commodity servers and maintain high availability, even in the face of hardware failures. Cassandra’s features, such as its decentralized architecture, ability to handle massive amounts of data, and fault tolerance, have made it a popular choice for many large-scale data-intensive applications, including social media platforms, e-commerce sites, and financial services.

Is Cassandra the Only NoSQL Database?

While Apache Cassandra is one of the most popular NoSQL databases, there are other types of NoSQL databases as well. Some other examples of NoSQL databases include:

  • Document databases: these databases store data as documents, typically in formats such as JSON or XML. Examples include MongoDB and Couchbase and are often used for content management systems or e-commerce applications.
  • Key-value stores: these databases store data as a key-value pair, similar to a dictionary or hash table. Examples include Redis and Amazon DynamoDB.
  • Graph databases: these databases are designed to store and query data in a network of nodes and edges, making them well-suited for applications such as social networks or recommendation engines. Examples include Neo4j and Amazon Neptune.  Graph databases are well-suited for analyzing relationships between data points.

Use Cases for SQL and NoSQL

This image gives a list of Industries and how each industry might use SQL and NoSQL databases.

While SQL is still ideal for storing structured data, such as customer information or transactional records, the shift from SQL to NoSQL databases, like Cassandra, has allowed organizations to handle large amounts of data in ways that were not previously possible. NoSQL databases are now an essential part of modern data architectures, providing the scalability and performance needed to handle today’s data demands. NoSQL databases have allowed businesses to store and process massive amounts of data more efficiently and cost-effectively, which has helped drive innovation and growth across a wide range of industries.

Conclusion

In conclusion, we have taken a journey through the history of information management and the development of databases. From early methods like clay tablets to the invention of hierarchical and relational databases, we have examined the challenges faced in organizing and accessing data. We learned about the significance of relational databases in revolutionizing data storage and retrieval, offering improved efficiency and reduced redundancy. However, with the growth of unstructured data and the need for scalability, the emergence of NoSQL databases addressed these challenges, providing flexibility and handling vast amounts of information. While relational databases remain a crucial tool, the rise of NoSQL databases demonstrates the need for modern approaches to handle the complexities of today’s data-driven world. The continuous evolution of database technology ensures that businesses can effectively manage and harness the power of data to drive innovation and growth in various industries.

Infographics by Allison Nokes for Anant Corporation

Blog image by Navneet Shanu @ Pexels.