A Guide to NoSQL for Developers

If you've been following trends in backend development and Big Data, you've probably already noticed the buzz around NoSQL databases in recent years. Some people are inspired by this approach to the database, while others think that there is some kind of trick hidden in it: the data models in them are not the same as in the usual relational databases, the application programming interfaces are unusual, and the applications are often incomprehensible. NoSQL Developer's Guide - 1

In this article I will tell you why they were created in the first place, these NoSQL databases, what problems they solve and why so many different databases are suddenly needed. If you're new to NoSQL, you might be particularly interested in the last part of the article, which lists the NoSQL database types that I think are worth exploring first to gain a thorough understanding of the field.

Why do we suddenly need a new database?

You may be puzzled to ask: what's wrong with relational databases? The point is that they worked really well for many years, but now there is a problem that they can no longer handle. According to some predictions, in 2018 humanity will generate 50,000 gigabytes of data per second. This is a colossal amount of data! Its storage and handling poses a serious engineering challenge. What’s even worse is that this volume is constantly growing. As it turns out, relational databases are poorly suited to working with really large volumes of data. They are designed to run on a single machine, and if you would like to handle more requests, then the only option is to buy a computer with more RAM and a more powerful processor. Unfortunately, the number of queries that one machine can handle is limited, and for distributed work across multiple machines we need a different database technology. Of course, some readers will chuckle at this point and say that there are two widely used methods for using multiple machines in the case of a relational database: replication and sharding. That is true, but these methods are not enough to cope with our tasks. Read replication is a technique in which each database update is propagated to other machines that can only handle read requests. In this case, all changes are performed by one server, called the master node, while other servers, called read replicas, only maintain copies of the data. The user can read from any of the machines, but change the data only through the master node. This is a convenient and very popular method, but it only allows you to process more read requests and does not in any way solve the problem of processing the required volumes of data.

In the figure:
Leader (read and write): Leading node (reads and writes)
Read-replicas (read-only): Read replicas (read-only)

Sharding is another popular approach that uses multiple instances of a relational database. Each of them handles write and read operations for a portion of the data. If a database stores information about customers, for example, using sharding, one machine can handle all requests for customers whose names start with A, another machine can store all the data for customers whose names begin with B, and so on.

In the figure:
Multi-master (read and write part of data): Several master nodes (reading and writing parts of data)

Although sharding allows you to record more data, managing such a database is a real nightmare: you have to align the data across machines and scale the cluster in both directions as needed. Although it looks simple in theory, getting it right is quite challenging.

Can relational databases be improved?

I think you have already come to believe that relational databases are not the best suited for the volume of data generated in the modern world. Although, you may still be wondering why no one has yet created a "better" relational database that can run efficiently across multiple machines. It may seem that this technology simply has not yet been developed, and distributed relational databases will appear very soon. Alas, this will not happen. This is mathematically impossible, and nothing can be done about it. To understand why this is so, you need to look at the so-called CAP theorem (aka Brewer's theorem). It was proven in 1999, and it states that a distributed database running on multiple machines can have the following three properties: Consistency - any read operation returns the results of the last corresponding write operation. If the system is consistent, after writing new data, it is impossible to read the old, already overwritten data. Availability ( A vailability) - a distributed system can service an incoming request at any time and return an error-free response. Partition tolerance - the database continues to respond to read and write requests even when some of its servers are temporarily unable to communicate with each other. This temporary failure is called a network connectivity failure and can be caused by a variety of factors, ranging from physical network problems due to a slow server to physical damage to network equipment. All of these properties are certainly handy, and we'd really like a database to combine them all. No sane developer would want to give up, say, accessibility without getting anything in return. Unfortunately, the CAP theorem also states that it is impossible for all three properties to hold simultaneously. Realizing this may not be easy, but it is possible. First, if we need a distributed database, it must be “disconnection tolerant.” This is not even discussed. Disconnections happen all the time and our database must work despite this. Now let's understand why we can't achieve both consistency and availability. Imagine we have a simple database running on two machines: A and B. Any user can write to either machine, after which the data is copied to the other.

Now imagine that these machines are temporarily unable to communicate with each other, and machine B is unable to send data to or receive data from machine A. If during this period of time machine B receives a read request from a client, it has two options:

Get your local data back, even if it's not the latest. In this case, preference is given to availability (to return at least some data, even outdated ones).
Return error. In this case, consistency is preferred: the client will not receive outdated data, but it will not receive any data at all.

In the figure:
Network partition: Loss of network connectivity

Relational databases strive to embody the properties of "consistency" and "availability" simultaneously, and therefore cannot operate in a distributed environment. Trying to implement all the capabilities of a relational database in a distributed system will be either unrealistic or simply unfeasible . On the other hand, NoSQL databases place a premium on scalability and performance. They usually lack such “basic” capabilities as connections and transactions, and the data model turns out to be completely different, perhaps even limiting in some way. All this makes it possible to store larger volumes of data and process more queries than ever before possible.

How do NoSQL databases balance consistency and availability?

It may seem to you that if you choose a NoSQL database, you will always receive either some outdated data or an error in the event of any failure. In practice, availability and consistency are by no means the only options available. There is a wide range of options available for you to choose from. Relational databases don't have these options, but NoSQL allows you to control query execution in a similar way. One way or another, they allow you to set two parameters when performing write or read operations in a NoSQL database: W - how many machines in the cluster must confirm saving data when performing a write operation . The larger the number of machines where you write your data, the easier it will be to read the most recent data on the next read operation, but also the longer it will take. R – how many machines you would like to read data from . In a distributed system, distributing data to all machines in a cluster can take some time, so some servers will have the latest data while others will lag. The greater the number of machines from which data is read, the higher the chances of reading current data. Let's look at a practical example. If you have five computers in your cluster and you decide to write data to only one and then read data from one randomly selected computer, then there is an 80% chance that you will read stale data. On the other hand, this will use a minimum of resources. So if legacy data is fine with you, it's not such a bad option. In this case, the parameters W and R are equal to 1.

On the other hand, if you write data to all five machines in a NoSQL database, you can read data from any machine and be guaranteed to get up-to-date data every time. Performing the same operation on a larger number of machines will take longer, but if up-to-date data is important to you, then you can choose this option. In this case, W = R = 5. What is the minimum number of reads and writes required for database consistency? Here is a simple formula: R + W ≥ N + 1 , where N is the number of machines in the cluster. This means that with five servers, you can choose either R = 2 and W = 4, or R = 3 and W = 3, or R = 4 and W = 2. In this case, it does not matter to which machines the data is written, read will always be done from at least one machine with up-to-date data.

Other databases, such as DynamoDB, have different restrictions and only allow consistent writes. Each piece of data is stored on three servers, and when any data is written, it is written to two of the three machines. But when reading data, you can choose one of two options:

Strictly consistent read, in which data is read from two machines out of three and always returns the most recently written data.
An eventual consistent read, in which one machine is randomly selected from which to read the data. However, this may temporarily return outdated data.

Why are there so many NoSQL databases?

If you follow the latest news in the field of software development, you have probably heard about many different NoSQL databases, such as MongoDB, DynamoDB, Cassandra, Redis and many others. You might be wondering: why do we need so many different NoSQL databases? The reason is simple: that different NoSQL databases are designed to solve different problems. This is why the number of competing databases is so large. NoSQL databases fall into four main categories:

Document-oriented databases

These databases provide the ability to store complex nested documents, whereas most relational databases only support one-dimensional rows. This feature can be useful in many cases, for example, when it is necessary to store information about a user with several addresses in the system. When using a document-oriented database, in this case you can simply store a complex object that includes an array of addresses, whereas in a relational database you would have to create two tables: one for user information and one for addresses. Document-oriented databases bridge the gap between the object model and the data model. Some relational databases, such as PostgreSQL, now also support document-oriented storage, but most relational databases still lack this capability.

Key/Value Databases

Key/value databases typically implement the simplest NoSQL model. Essentially, they provide you with a distributed hash table , allowing you to write data to a given key and read it back using it. Key/value databases are highly scalable and have significantly lower latency than other databases.

Graph Databases

Many subject areas, for example, social networks or information about films and actors, can be represented as graphs. Although the graph can be represented using a relational database, it is difficult and inconvenient. If you need graph data, it is better to use a specialized graph database, which can store information about the graph in a distributed cluster and makes it possible to efficiently implement algorithms on graphs.

Columnar Databases

The main difference between columnar and other types of databases is the way the data is stored on disk. Relational databases create a file for each table and store the values for all rows sequentially. Columnar databases create a file for each column in your tables. This structure allows you to aggregate data and run certain queries more efficiently, but you must ensure that the data fits the limitations of such databases.

Which database should you choose?

Choosing a database is usually a frustrating problem, and with so many options available, it can seem like an overwhelming task. The good news is that there is no need to choose just one. Instead of creating a single monolithic application that implements all capabilities and has access to all system data, you can use another modern pattern called microservices : break the application into a set of independent services. Each service solves its own narrow problem, and uses only its own database, which is most suitable for solving this problem.

How are you supposed to learn all this?

With so many databases , learning them all can seem like an impossible task. Good news: you don't have to do this. There are only a few basic types of NoSQL databases, and if you understand how they work, the others will be much easier to understand. Also, some NoSQL databases are used much more often than others, so it's best to focus your efforts on the most popular solutions. Here is a list of the most commonly used NoSQL databases that I think you should take a look at:

MongoDB . Probably the most popular NoSQL database on the market. If a company doesn't use a relational database as its primary data store, it probably uses MongoDB. This is a flexible document storage with a good set of tools. Early in its career, MongoDB had a bad reputation for losing data in some cases , but since then its stability and reliability have improved greatly. Take a look at this MongoDB course if you want to learn more.

DynamoDB . If you use Amazon Web Services (AWS), you'd better learn more about DynamoDB. It is an extremely reliable, scalable, low latency database with rich feature set and integration with many other AWS services. The best part is that you don't have to deploy it yourself. Setting up a scalable DynamoDB cluster that can handle thousands of queries is just a few clicks away. If this interests you, you can take a look at this course .

Neo4j . The most common graph database. This is a scalable and stable solution suitable for those who want to use a graph data model. If you want to learn more, start with this course .

Redis . While the other databases described here are used to store core application data, Redis is used primarily to implement caches and store auxiliary data. In many cases, one of the above mentioned databases is used in tandem with Redis. To learn more, check out this course.

In 2018 with NoSQL

NoSQL databases are a vast and rapidly growing field. They allow you to store and process previously unimaginable amounts of data, but it comes at a cost. These databases don't have many of the features you're familiar with in relational databases, and it can be difficult to get yourself set up to use them. But once you get the hang of them, you can create scalable, distributed databases that can handle astonishing volumes of read and write requests, which can be extremely important as larger and larger volumes of data are generated. Original: https://simpleprogrammer.com/guide-nosql-software-developers/