Developer's Guide to NoSQL

If you follow the trends in backend development and Big Data, you have probably already noticed the hype around NoSQL databases in recent years. This approach to the database inspires someone, it seems to someone that some kind of trick is hidden in it: the data models in them are not the same as in the usual relational databases, the application programming interfaces are unusual, and applications are often incomprehensible. NoSQL Developer Guide - 1

In this article I will explain why they were created at all, these NoSQL databases, what tasks they solve and why so many different databases are suddenly needed. If you are new to NoSQL, you might be particularly interested in the last part of the article, which lists databases of this type that I think are worth exploring first to get a complete understanding of this area.

Why do we suddenly need a new database?

You may wonder in bewilderment: what is wrong with relational databases? The thing is, they have worked really great for years, but now there's a problem that they can't handle anymore. According to some predictions, in 2018 humanity will generate 50,000 gigabytes of data per second. This is a huge amount of data! Its storage and processing is a serious engineering challenge. What's even worse is that this volume is constantly growing. As it turns out, relational databases are ill-suited to dealing with really large amounts of data. They are designed to run on a single machine, and if you would like to handle more requests, your only option is to buy a computer with more RAM and a more powerful processor. Unfortunately, the number of requests that a single machine can handle is limited, and for distributed work across multiple machines, we need a different database technology. Of course, some of the readers will chuckle at this point and say that there are two widely used methods of using multiple machines in the case of a relational database: replication and sharding. It is true, but these methods are not enough to cope with our tasks. Read replication is a technique in which each database update is propagated to other machines that can only process read requests. In this case, all changes are performed by a single server, called the leader, while other servers, called read replicas, only maintain copies of the data. The user can read from any of the machines, but change the data only through the host. This is a convenient and very popular method, but it only allows you to process more read requests and does nothing to solve the problem of processing the required amounts of data.

In the figure:
Leader (read and write):
Read-replicas (read-only): Read-replicas (read-only)

Sharding is another popular approach that uses multiple instances of a relational database. Each of them handles write and read operations for a piece of data. If the database stores, for example, information about customers, using sharding, one machine can process all requests for customers whose names begin with A, another can store all data about customers whose names begin with B, and so on.

In the figure:
Multi-master (read and write part of data): Multiple master nodes (reading and writing parts of data)

Although sharding allows you to write more data, managing such a database is a real nightmare: you have to align the data across machines and scale the cluster back and forth as needed. While this sounds simple in theory, getting it right is quite a challenge.

Can relational databases be improved?

I think by now you have come to believe that relational databases are not well adapted to the amount of data generated in today's world. Though you may still be wondering why no one has yet created an "improved" relational database that can run efficiently across multiple machines. It may seem that this technology is simply not developed yet, and distributed relational databases will appear very soon. Alas, this will not happen. This is mathematically impossible and nothing can be done about it. To understand why this is so, it is necessary to turn to the so-called CAP theorem (aka Brewer's theorem). It was proven in 1999 and claims that a distributed database running on multiple machines can have the following three properties: Consistency -Any read operation returns the results of the last corresponding write operation . If the system is consistent, after writing new data, it is impossible to read the old, already overwritten. Availability - a distributed system can serve an incoming request at any time and return a non-error response. Resilience to breaking connectivity ( Partition tolerance) - the database continues to respond to read and write requests even if some of its servers are temporarily unable to communicate with each other. This temporary failure is called a network outage and can be caused by a variety of factors, ranging from physical network problems due to a slow server to physical damage to network equipment. All of these properties are certainly handy, and we would love to see the database combine them all. No sane developer would be willing to give up, say, accessibility without getting something in return. Unfortunately, the CAP theorem also states that it is impossible to have all three properties at the same time. It may not be easy to realize this, but it is possible. First, if we want a distributed database, it must be "tolerant of connectivity failure". It's not even discussed. Connectivity violations happen all the time and our database has to work despite it. Now let's understand why we can't achieve both consistency and availability. Imagine that we have a simple database running on two machines, A and B. Any user can write to either machine, after which the data is copied to the other.

Now imagine that these machines are temporarily unable to communicate with each other, and machine B cannot send data to or receive data from machine A. If machine B receives a read request from a client during this period of time, it has two options:

Get back your local data, even if it's not the most recent. In this case, availability is preferred (return at least some data, even stale data).
Return an error. In this case, consistency is preferred: the client will not receive stale data, but it will not receive any data at all.

In the figure:
Network partition: Network connectivity failure

Relational databases strive to embody the properties of "consistency" and "availability" at the same time, and therefore cannot operate in a distributed environment. Trying to implement all the features of a relational database in a distributed system will either be unrealistic or just not feasible . On the other hand, NoSQL databases put a lot of emphasis on scalability and performance. They usually lack such "basic" features as connections and transactions, and the data model turns out to be completely different, perhaps even limiting in some way. All this makes it possible to store large amounts of data and process more requests than was ever possible before.

How do NoSQL databases balance consistency and availability?

It may seem to you that by choosing a NoSQL database, you will always receive either some outdated data or an error in case of any failure. In practice, accessibility and consistency are by no means the only options. There is a wide range of options available for you to choose from. Relational databases don't have these options, but NoSQL allows you to control query execution in this way. Anyway, they allow you to set two parameters when performing write or read operations in a NoSQL database: W - how many machines in the cluster must confirm the data is saved when performing a write operation . The more machines you write your data to, the easier it will be to read the most recent data on the next read, but also the longer it will take. R - how many machines you would like to read data from . In a distributed system, it may take some time for the data to spread across all the machines in the cluster, so some servers will have the data up to date while others will lag behind. The greater the number of machines from which data is read, the higher the chances of reading up-to-date data. Let's consider a practical example. If there are five computers in your cluster, and you decide to write data to only one, and then read data from one randomly selected computer, then with a probability of 80% you will read stale data. On the other hand, this will use a minimum of resources. So, if stale data suits you, this is not such a bad option. In this case, the parameters W and R are equal to 1.

On the other hand, if you write data to all five machines in a NoSQL database, you can read data from any machine and you are guaranteed to get up-to-date data every time. Performing the same operation on more machines will take longer, but if up-to-date data is important to you, you can choose this option. In this case, W = R = 5. What is the minimum number of reads and writes required for database consistency? Here is a simple formula: R + W ≥ N + 1 where N is the number of machines in the cluster. This means that with five servers, you can choose either R = 2 and W = 4, or R = 3 and W = 3, or R = 4 and W = 2. In this case, it does not matter which machines the data is written to, reading will always be produced from at least one machine with up-to-date data.

Other databases, such as DynamoDB, have different restrictions and only allow consistent writes. Each data element is stored on three servers, and when any data is written, it is written to two of the three machines. But when reading data, you can choose one of two options:

Strictly consistent read, in which data is read from two machines out of three and always returns the last written data.
A read that is ultimately consistent, in which one machine is randomly selected from which to read data. However, stale data may be temporarily returned.

Why are there so many NoSQL databases?

If you follow the latest software development news, you have probably heard of many different NoSQL databases, such as MongoDB, DynamoDB, Cassandra, Redis, and many more. You might be wondering: why do we need so many different NoSQL databases? The reason is simple: that different NoSQL databases are designed to solve different problems. That is why the number of competing databases is so large. NoSQL databases fall into four main categories:

Document-oriented databases

These databases provide the ability to store complex nested documents, while most relational databases only support one-dimensional rows. This feature can be useful in many cases, for example, if you need to store information about a user with multiple addresses in the system. With a document-oriented database, in such a case, you can simply store a complex object that includes an array of addresses, while in a relational database you would have to create two tables: one for user information and one for addresses. Document-based databases close the gapbetween the object model and the data model. Some relational databases, such as PostgreSQL, now also support document storage, but most relational databases still lack this capability.

Key/Value Databases

Key/value databases usually implement the simplest NoSQL model. Essentially, they provide you with a distributed hash table that allows you to write data to a given key and read it back using it. Key/value databases are highly scalable and distinguish themselves from other databases by significantly lower latency.

Graph databases

Many subject areas, such as social networks or information about films and actors, can be represented as graphs. Although a graph can also be represented using a relational database, it is difficult and inconvenient. If you need graph data, it is better to use a specialized graph database that can store information about the graph in a distributed cluster and allows efficient implementation of algorithms on graphs.

Column databases

The main difference between columnar and other types of databases is the way data is stored on disk. Relational databases create a file for each table and store values for all rows sequentially. Columnar databases create a file for each column in your tables. This structure allows you to aggregate data and run certain queries more efficiently, but you must ensure that the data fits within the constraints of such databases.

Which database to choose?

Choosing a database is usually a torturous problem, and with so many options available, it can seem like an impossible task. The good news is that there is no need to choose just one. Instead of creating a single monolithic application that implements all the features and has access to all the data of the system, you can use another modern pattern called " microservices ": break the application into a set of independent services. Each service solves its own narrow task, and uses only its own database, which is the most suitable for solving this problem.

How are you going to learn all this?

With so many databases , learning them all can seem like an impossible task. The good news is that you don't have to. There are only a few basic types of NoSQL databases, and once you understand how they work, others will be much easier to deal with. Also, some NoSQL databases are used a lot more than others, so it's best to focus on the most popular solutions. Here is a list of the most commonly used NoSQL databases that I think you should take a look at:

MongoDB . Probably the most popular NoSQL database on the market. If a company isn't using a relational database as its primary data store, then it probably uses MongoDB. It is a flexible document repository with a good set of tools. At the beginning of its "career" MongoDB did not have the best reputation, because the data in it was lost in some cases , but since then its stability and reliability have improved a lot. Take a look at this MongoDB course if you want to learn more.

DynamoDB . If you're using Amazon Web Services (AWS), you'd better learn more about DynamoDB. It is an exceptionally reliable, low-latency, scalable database with a rich set of features and integration with many other AWS services. And the best part is that you don't have to deploy it yourself. You can set up a scalable DynamoDB cluster capable of handling thousands of queries with just a few clicks. If this interests you, you can take a look at this course .

Neo4j . The most common graph database. It is a scalable and stable solution suitable for those who want to use a graph data model. If you want to learn more, start with this course .

redis . While the rest of the databases described here are used to store the main data of the application, Redis is mainly used to implement the cache and store auxiliary data. In many cases, one of the above databases is used in tandem with Redis. To learn more, check out this course.

In 2018 with NoSQL

NoSQL databases are a vast and rapidly growing area. They allow you to store and process hitherto unheard of amounts of data, but you have to pay for it. These databases don't have many of the features you're used to with relational databases, and setting yourself up to use them can be tricky. But once you get the hang of it, you can build scalable distributed databases that can handle amazing volumes of reads and writes, which can make a huge difference as more and more data is generated. Original: https://simpleprogrammer.com/guide-nosql-software-developers/