• 沒有找到結果。

Introduction to NoSQL

Join the NoSQL movement

6.1 Introduction to NoSQL

As you’ve read, the goal of NoSQL databases isn’t only to offer a way to partition data-bases successfully over multiple nodes, but also to present fundamentally different ways to model the data at hand to fit its structure to its use case and not to how a rela-tional database requires it to be modeled.

To help you understand NoSQL, we’re going to start by looking at the core ACID principles of single-server relational databases and show how NoSQL databases rewrite them into BASE principles so they’ll work far better in a distributed fashion. We’ll also look at the CAP theorem, which describes the main problem with distributing data-bases across multiple nodes and how ACID and BASE databases approach it.

6.1.1 ACID: the core principle of relational databases

The main aspects of a traditional relational database can be summarized by the con-cept ACID:

Atomicity—The “all or nothing” principle. If a record is put into a database, it’s put in completely or not at all. If, for instance, a power failure occurs in the middle of a database write action, you wouldn’t end up with half a record; it wouldn’t be there at all.

Consistency—This important principle maintains the integrity of the data. No entry that makes it into the database will ever be in conflict with predefined rules, such as lacking a required field or a field being numeric instead of text.

Isolation—When something is changed in the database, nothing can happen on this exact same data at exactly the same moment. Instead, the actions happen in serial with other changes. Isolation is a scale going from low isolation to high isolation. On this scale, traditional databases are on the “high isolation” end.

An example of low isolation would be Google Docs: Multiple people can write to a document at the exact same time and see each other’s changes happening instantly. A traditional Word document, on the other end of the spectrum, has high isolation; it’s locked for editing by the first user to open it. The second per-son opening the document can view its last saved version but is unable to see unsaved changes or edit the document without first saving it as a copy. So once someone has it opened, the most up-to-date version is completely isolated from anyone but the editor who locked the document.

Durability—If data has entered the database, it should survive permanently.

Physical damage to the hard discs will destroy records, but power outages and software crashes should not.

ACID applies to all relational databases and certain NoSQL databases, such as the graph database Neo4j. We’ll further discuss graph databases later in this chapter and in chapter 7. For most other NoSQL databases another principle applies: BASE. To understand BASE and why it applies to most NoSQL databases, we need to look at the CAP Theorem.

6.1.2 CAP Theorem: the problem with DBs on many nodes

Once a database gets spread out over different servers, it’s difficult to follow the ACID principle because of the consistency ACID promises; the CAP Theorem points out why this becomes problematic. The CAP Theorem states that a database can be any two of the following things but never all three:

Partition tolerant—The database can handle a network partition or network failure.

Available—As long as the node you’re connecting to is up and running and you can connect to it, the node will respond, even if the connection between the different database nodes is lost.

Consistent—No matter which node you connect to, you’ll always see the exact same data.

For a single-node database it’s easy to see how it’s always available and consistent:

Available—As long as the node is up, it’s available. That’s all the CAP availabil-ity promises.

Consistent—There’s no second node, so nothing can be inconsistent.

Things get interesting once the database gets parti-tioned. Then you need to make a choice between availability and consistency, as shown in figure 6.2.

Let’s take the example of an online shop with a server in Europe and a server in the United States, with a single distribution center. A German named Fritz and an American named Freddy are shopping at the same time on that same online shop. They see an item and only one is still in stock: a bronze, octopus-shaped coffee table. Disaster strikes, and communication between the two local servers is temporarily down. If you were the owner of the shop, you’d have two options:

Availability—You allow the servers to keep on serving customers, and you sort out everything afterward.

Consistency—You put all sales on hold until communication is reestablished.

In the first case, Fritz and Freddy will both buy the octopus coffee table, because the last-known stock number for both nodes is “one” and both nodes are allowed to sell it, as shown in figure 6.3.

If the coffee table is hard to come by, you’ll have to inform either Fritz or Freddy that he won’t receive his table on the promised delivery date or, even worse, he will

Partitioned

Available Consistent

Figure 6.2 CAP Theorem: when partitioning your database, you need to choose between availability and consistency.

never receive it. As a good businessperson, you might compensate one of them with a discount coupon for a later purchase, and everything might be okay after all.

The second option (figure 6.4) involves putting the incoming requests on hold temporarily.

This might be fair to both Fritz and Freddy if after five minutes the web shop is open for business again, but then you might lose both sales and probably many more.

Web shops tend to choose availability over consistency, but it’s not the optimal choice

X

CAP available but not consistent:

both Fritz and Freddy order last available item

Fritz Local server

Local server Disconnection

Freddy

Figure 6.3 CAP Theorem: if nodes get disconnected, you can choose to remain available, but the data could become inconsistent.

X

CAP consistent but not available:

Orders on hold until local server connection restored

Fritz Local server

Local server Disconnection

Freddy

X X

Figure 6.4 CAP Theorem: if nodes get disconnected, you can choose to remain consistent by stopping access to the databases until connections are restored

in all cases. Take a popular festival such as Tomorrowland. Festivals tend to have a maximum allowed capacity for safety reasons. If you sell more tickets than you’re allowed because your servers kept on selling during a node communication failure, you could sell double the number allowed by the time communications are reestab-lished. In such a case it might be wiser to go for consistency and turn off the nodes temporarily. A festival such as Tomorrowland is sold out in the first couple of hours anyway, so a little downtime won’t hurt as much as having to withdraw thousands of entry tickets.

6.1.3 The BASE principles of NoSQL databases

RDBMS follows the ACID principles; NoSQL databases that don’t follow ACID, such as the document stores and key-value stores, follow BASE. BASE is a set of much softer database promises:

Basically available—Availability is guaranteed in the CAP sense. Taking the web shop example, if a node is up and running, you can keep on shopping. Depend-ing on how thDepend-ings are set up, nodes can take over from other nodes. Elastic-search, for example, is a NoSQL document–type search engine that divides and replicates its data in such a way that node failure doesn’t necessarily mean service failure, via the process of sharding. Each shard can be seen as an indi-vidual database server instance, but is also capable of communicating with the other shards to divide the workload as efficiently as possible (figure 6.5). Sev-eral shards can be present on a single node. If each shard has a replica on another node, node failure is easily remedied by re-dividing the work among the remaining nodes.

Soft state—The state of a system might change over time. This corresponds to the eventual consistency principle: the system might have to change to make the data

Shard 1

Figure 6.5 Sharding: each shard can function as a self-sufficient database, but they also work together as a whole. The example represents two nodes, each containing four shards: two main shards and two replicas. Failure of one node is backed up by the other.

consistent again. In one node the data might say “A” and in the other it might say “B” because it was adapted. Later, at conflict resolution when the network is back online, it’s possible the “A” in the first node is replaced by “B.” Even though no one did anything to explicitly change “A” into “B,” it will take on this value as it becomes consistent with the other node.

Eventual consistency—The database will become consistent over time. In the web shop example, the table is sold twice, which results in data inconsistency. Once the connection between the individual nodes is reestablished, they’ll communi-cate and decide how to resolve it. This conflict can be resolved, for example, on a first-come, first-served basis or by preferring the customer who would incur the lowest transport cost. Databases come with default behavior, but given that there’s an actual business decision to make here, this behavior can be overwrit-ten. Even if the connection is up and running, latencies might cause nodes to become inconsistent. Often, products are kept in an online shopping basket, but putting an item in a basket doesn’t lock it for other users. If Fritz beats Freddy to the checkout button, there’ll be a problem once Freddy goes to check out. This can easily be explained to the customer: he was too late. But what if both press the checkout button at the exact same millisecond and both sales happen?

ACID versus BASE

The BASE principles are somewhat contrived to fit acid and base from chemistry: an acid is a fluid with a low pH value. A base is the opposite and has a high pH value.

We won’t go into the chemistry details here, but figure 6.6 shows a mnemonic to those familiar with the chemistry equivalents of acid and base.

0 1 2 3 6 7 10

ACID

• Atomicity

• Consistency

• Isolation

• Durability

BASE

• Basically available

• Soft state

• Eventual consistency 8

4 5 9 11 12 13 14

Figure 6.6 ACID versus BASE: traditional relational databases versus most NoSQL databases. The names are derived from the chemistry concept of the pH scale. A pH value below 7 is acidic; higher than 7 is a base. On this scale, your average surface water fluctuates between 6.5 and 8.5.

6.1.4 NoSQL database types

As you saw earlier, there are four big NoSQL types: key-value store, document store, column-oriented database, and graph database. Each type solves a problem that can’t be solved with relational databases. Actual implementations are often combinations of these. OrientDB, for example, is a multi-model database, combining NoSQL types.

OrientDB is a graph database where each node is a document.

Before going into the different NoSQL databases, let’s look at relational databases so you have something to compare them to. In data modeling, many approaches are possible. Relational databases generally strive toward normalization: making sure every piece of data is stored only once. Normalization marks their structural setup.

If, for instance, you want to store data about a person and their hobbies, you can do so with two tables: one about the person and one about their hobbies. As you can see in figure 6.7, an additional table is necessary to link hobbies to persons because of their many-to-many relationship: a person can have multiple hobbies and a hobby can have many persons practicing it.

A full-scale relational database can be made up of many entities and linking tables.

Now that you have something to compare NoSQL to, let’s look at the different types.

Person info table: represents

Hobby ID Hobby Name Hobby Description Shooting arrows

Figure 6.7 Relational databases strive toward normalization (making sure every piece of data is stored only once). Each table has unique identifiers (primary keys) that are used to model the relationship between the entities (tables), hence the term relational.

COLUMN-ORIENTEDDATABASE

Traditional relational databases are row-oriented, with each row having a row id and each field within the row stored together in a table. Let’s say, for example’s sake, that no extra data about hobbies is stored and you have only a single table to describe peo-ple, as shown in figure 6.8. Notice how in this scenario you have slight denormaliza-tion because hobbies could be repeated. If the hobby informadenormaliza-tion is a nice extra but not essential to your use case, adding it as a list within the Hobbies column is an acceptable approach. But if the information isn’t important enough for a separate table, should it be stored at all?

Every time you look up something in a row-oriented database, every row is scanned, regardless of which columns you require. Let’s say you only want a list of birthdays in September. The database will scan the table from top to bottom and left to right, as shown in figure 6.9, eventually returning the list of birthdays.

Indexing the data on certain columns can significantly improve lookup speed, but indexing every column brings extra overhead and the database is still scanning all the columns.

Figure 6.8 Row-oriented database layout. Every entity (person) is represented by a single row, spread over multiple columns.

Name

Figure 6.9 Row-oriented lookup: from top to bottom and for every entry, all columns are taken into memory

Column databases store each column separately, allowing for quicker scans when only a small number of columns is involved; see figure 6.10.

This layout looks very similar to a row-oriented database with an index on every col-umn. A database index is a data structure that allows for quick lookups on data at the cost of storage space and additional writes (index update). An index maps the row number to the data, whereas a column database maps the data to the row numbers; in that way counting becomes quicker, so it’s easy to see how many people like archery, for instance. Storing the columns separately also allows for optimized compression because there’s only one data type per table.

When should you use a row-oriented database and when should you use a column-oriented database? In a column-column-oriented database it’s easy to add another column because none of the existing columns are affected by it. But adding an entire record requires adapting all tables. This makes the row-oriented database preferable over the column-oriented database for online transaction processing (OLTP), because this implies adding or changing records constantly. The column-oriented database shines when performing analytics and reporting: summing values and counting entries. A row-oriented database is often the operational database of choice for actual transactions (such as sales). Overnight batch jobs bring the column-oriented database up to date, supporting lightning-speed lookups and aggregations using MapReduce algorithms for reports. Examples of column-family stores are Apache HBase, Facebook’s Cassandra, Hypertable, and the grandfather of wide-column stores, Google BigTable.

KEY-VALUESTORES

Key-value stores are the least complex of the NoSQL databases. They are, as the name suggests, a collection of key-value pairs, as shown in figure 6.11, and this simplicity makes them the most scalable of the NoSQL database types, capable of storing huge amounts of data.

Figure 6.10 Column-oriented databases store each column separately with the related row numbers. Every entity (person) is divided over multiple tables.

The value in a key-value store can be anything: a string, a number, but also an entire new set of key-value pairs encapsulated in an object. Figure 6.12 shows a slightly more complex key-value structure. Examples of key-value stores are Redis, Voldemort, Riak, and Amazon’s Dynamo.

DOCUMENTSTORES

Document stores are one step up in complexity from key-value stores: a document store does assume a certain document structure that can be specified with a schema.

Document stores appear the most natural among the NoSQL database types because they’re designed to store everyday documents as is, and they allow for complex query-ing and calculations on this often already aggregated form of data. The way thquery-ings are stored in a relational database makes sense from a normalization point of view: every-thing should be stored only once and connected via foreign keys. Document stores care little about normalization as long as the data is in a structure that makes sense. A relational data model doesn’t always fit well with certain business cases. Newspapers or magazines, for example, contain articles. To store these in a relational database, you need to chop them up first: the article text goes in one table, the author and all the information about the author in another, and comments on the article when published on a website go in yet another. As shown in figure 6.13, a newspaper article

Name Key

Jos The Boss Value

Birthday 11-12-1985

Archery, conquering the world

Hobbies Figure 6.11 Key-value stores

store everything as a key and a value.

Figure 6.12 Key-value nested structure

{

"articles": [ {

"title": "title of the article",

"articleID": 1,

"body": "body of the article",

"author": "Isaac Asimov",

"comments": [ {

"username": "Fritz"

"join date": "1/4/2014"

"commentid": 1,

"body": "this is a great article",

"replies": [ {

"username": "Freddy",

"join date": "11/12/2013",

"commentid": 2,

"body": "seriously? it's rubbish"

} ] }, {

"username": "Stark",

"join date": "19/06/2011",

"commentid": 3,

"body": "I don’t agree with the conclusion"

} ] } ] }

Document store approach

Relational database approach

Author name

Comment table Reader table

Author table Article table

Figure 6.13 Document stores save documents as a whole, whereas an RDMS cuts up the article and saves it in several tables. The example was taken from the Guardian website.

can also be stored as a single entity; this lowers the cognitive burden of working with the data for those used to seeing articles all the time. Examples of document stores are MongoDB and CouchDB.

GRAPHDATABASES

The last big NoSQL database type is the most complex one, geared toward storing relations between entities in an efficient manner. When the data is highly intercon-nected, such as for social networks, scientific paper citations, or capital asset clusters, graph databases are the answer. Graph or network data has two main components:

Node —The entities themselves. In a social network this could be people.

Edge —The relationship between two entities. This relationship is represented by a line and has its own properties. An edge can have a direction, for example, if the arrow indicates who is whose boss.

Graphs can become incredibly complex given enough relation and entity types. Fig-ure 6.14 already shows that complexity with only a limited number of entities. Graph

Graphs can become incredibly complex given enough relation and entity types. Fig-ure 6.14 already shows that complexity with only a limited number of entities. Graph