Large amounts of data is not a new thing, and extracting information out of it isn’t either. There were pretty nice ways to store data and analyzing tools. They had their problems, but nothing unsolvable. And everything was great right? So what has changed? And what is that “Big Data” everyone is talking about? Is it a monster? A superhero? What are these new types of databases big data brought with?
Well let’s see…
What has changed?
The internet got bigger, making the world smaller and apps global. Eric Schmidt, of Google said in 2010 that “There were 5 exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days”. So what we called “large amounts of data” is not what we deal with today. And solutions dealt with data until now could not handle it. New solutions had to be found. So companies like Google came up with BigTable then Facebook created Cassandra, Yahoo created Hadoop and so on…All these are ways to deal with that new thing – Big Data.
There are 3 V’s defining Big Data
Volume – The data growth today is exponential, and not only found in text. Data can be found in videos (today we can even index videos and query them. How cool is that?), it is in music and images and well everywhere (think of IoT – machines creating data much faster, Boeing airplane creates hundreds of Gigabytes per flight)…. As the database grows the applications and architecture built to support the data needs to be reevaluated quite often. Sometimes the same data is re-evaluated from multiple angles and even though the original data is the same the new found intelligence creates explosion of the data.
Velocity – The speed at which data is generated is also growing. There were times when what happened yesterday was still called recent and relevant. Somehow newspapers still work that way, but even in that field most of the people I know moved to reading news online. Nowadays we want to know what’s now , so the data movement is almost real time.
Variety – Data today can come in many different shapes, from many different sources and it can represent anything you want. It can be a text, a video or maybe SMS message. Therefore multiple formats are needed to store different kinds of data.
The motives behind NoSQL databases were big data, scalability, and data formats. But each NoSQL database compromises some aspect while giving a really good answer to another.
Why is that? As a CAP theorem states out “Of three properties of a shared data system: data consistency, system availability and tolerance to network partitions, only two can be achieved at any given moment.” (published by Prof. Eric Brewer, in 2000 at University of Berkeley, and proven by Nancy Lynch et al. MIT). By giving up ACID (Atomicity, Consistency, Isolation, Durability) properties, one can achieve higher performance and scalability.
NoSQL Data models
NoSQL is a term that addresses a variety of databases and database architectures that sometimes has very little in common (although all referred to as NoSQL). Let’s examine the different families of NoSQL databases. Don’t worry if you don’t fully understand the data models, we will dive deeper into each model in later posts.
Key/Value Pairs – Use map as their fundamental data model. In this model, data is represented as a collection of key-value pairs, such that each possible key appears at most once in the collection. Mostly used as a cache store. Examples: Redis, Riak.
Document Based- Are inherently a subclass of the key-value store. The difference lies in the way the data is processed. They are schema free, aggregations using map/reduce and indexed using b-trees. Examples: MongoDB, Couched
Column Based – Works on columns. All columns are treated individually. Values of a single column are stored contiguously. Examples: Cassandra
Graph Based – Based on graph theory. They employ nodes, properties, and edges. Nodes are things you want to keep track of. Properties are pertinent information that relate to nodes. And edges connecting nodes to nodes or nodes to properties and they represent the relationship between the two. Examples: AllegroGraph, ArangoDB.
There are more such as tuples , triplestore, objects and this field keep growing.
Exponential growth of data brought with it what we call Big Data and the need for data models capable of dealing with it. There are many different NoSQL data models are and one should consider what about the data is important and what can compromised. Only then you can choose the right model and then choose a database product from the appropriate kind. The world of data is not just Relational vs NoSQL. It’s many kind of parties, each with its strengths and weaknesses.
In the next posts we will explore each of the data models more deeply and compare them with familiar technologies. Stay tuned!