Normalization versus Denormalization - MongoDB: The Definitive Guide

There are many ways of representing data and one of the most important issues is how much you should normalize your data. Normalization is dividing up data into multiple collections with references between collections. Each piece of data lives in one collection although multiple documents may reference it. Thus, to change the data, only one document must be updated. However, MongoDB has no joining facilities, so gathering documents from multiple collections will require multiple queries.

Denormalization is the opposite of normalization: embedding all of the data in a single document. Instead of documents containing references to one definitive copy of the data, many documents may have copies of the data. This means that multiple documents need to be updated if the information changes but that all related data can be fetched with a single query.

Deciding when to normalize and when to denormalize can be difficult: typically, nor‐

malizing makes writes faster and denormalizing makes reads faster. Thus, you need to find what trade-offs make sense for your application.

153

Examples of Data Representations

Suppose we are storing information about students and the classes that they are taking.

One way to represent this would be to have a students collection (each student is one document) and a classes collection (each class is one document). Then we could have a third collection (studentClasses) that contains references to the student and classes he is taking:

ObjectId("512512ced86041c7dca81916"), ObjectId("512512dcd86041c7dca81917"), ObjectId("512512e6d86041c7dca81918"), ObjectId("512512f0d86041c7dca81919") ]

}

If you are familiar with relational databases, you may have seen this type of join table before, although typically you’d have one student and one class per document (instead of a list of class "_id"s). It’s a bit more MongoDB-ish to put the classes in an array, but you usually wouldn’t want to store the data this way because it requires a lot of querying to get to the actual information.

Suppose we wanted to find the classes a student was taking. We’d query for the student in the students collection, query studentClasses for the course "_id"s, and then query the classes collection for the class information. Thus, finding this information would take three trips to the server. This is generally not the way you want to structure data in MongoDB, unless the classes and students are changing constantly and reading the data does not need to be done quickly.

We can remove one of the dereferencing queries by embedding class references in the student’s document:

{

"_id" : ObjectId("512512a5d86041c7dca81914"), "name" : "John Doe",

"classes" : [

ObjectId("512512ced86041c7dca81916"), ObjectId("512512dcd86041c7dca81917"), ObjectId("512512e6d86041c7dca81918"), ObjectId("512512f0d86041c7dca81919") ]

}

The "classes" field keeps an array of "_id"s of classes that John Doe is taking. When we want to find out information about those classes, we can query the classes collection 154 | Chapter 8: Application Design

Download from Wow! eBook <www.wowebook.com>

with those "_id"s. This only takes two queries. This is fairly popular way to structure data that does not need to be instantly accessible and changes, but not constantly.

If we need to optimize reads further, we can get all of the information in a single query by fully denormalizing the data and storing each class as an embedded document in the

"classes" field:

The upside of this is that it only takes one query to get the information. The downsides are that it takes up more space and is more difficult to keep in sync. For example, if it turns out that physics was supposed to be four credits (not three) every student in the physics class would need to have her document updated (instead of just updating a central “Physics” document).

Finally, you can use a hybrid of embedding and referencing: create an array of subdo‐

cuments with the frequently used information, but with a reference to the actual docu‐

ment for more information:

}, {

"_id" : ObjectId("512512dcd86041c7dca81917"), "class" : "Physics"

}, {

"_id" : ObjectId("512512e6d86041c7dca81918"), "class" : "Women in Literature"

}, {

"_id" : ObjectId("512512f0d86041c7dca81919"), "class" : "AP European History"

} ] }

This approach is also a nice option because the amount of information embedded can change over time as your requirements changes: if you want to include more or less information on a page, you could embed more or less of it in the document.

Another important consideration is how often this information will change versus how often it’s read. If it will be updated regularly, then normalizing it is a good idea. However, if it changes infrequently, then there is little benefit to optimize the update process at the expense of every read your application performs.

For example, a textbook normalization use case is to store a user and his address in separate collections. However, people almost never change their address, so you gen‐

erally shouldn’t penalize every read on the off chance that someone’s moved. Your ap‐

plication should embed the address in the user document.

If you decide to use embedded documents and you need to update them, you should set up a cron job to ensure that any updates you do are successfully propagated to every document. For example, you might attempt to do a multiupdate but the server crashes before all of the documents have been updated. You need a way to detect this and retry the update.

To some extent, the more information you are generating the less of it you should embed.

If the embedded fields or number of embedded fields is supposed to grow without bound then they should generally be referenced, not embedded. Things like comment trees or activity lists should be stored as their own documents, not embedded.

Finally, fields should be included that are integral to the data in the document. If a field is almost always excluded from your results when you query for this document, it’s a good sign that it may belong in another collection. These guidelines are summarized in Table 8-1.

156 | Chapter 8: Application Design

Table 8-1. Comparison of embedding versus references

Embedding is better for... References are better for...

Small subdocuments Large subdocuments

Data that does not change regularly Volatile data

When eventual consistency is acceptable When immediate consistency is necessary Documents that grow by a small amount Documents that grow a large amount Data that you’ll often need to perform a second query to fetch Data that you’ll often exclude from the results

Fast reads Fast writes

Suppose we had a users collection. Here are some example fields we might have and whether or not they should be embedded:

Account preferences

They are only relevant to this user document, and will probably be exposed with other user information in this document. Account preferences should generally be embedded.

Recent activity

This one depends on how much recent activity grows and changes. If it is a fixed-size field (last 10 things), it might be useful to embed.

Friends

Generally this should not be embedded, or at least not fully. See the section below on advice on social networking.

All of the content this user has produced No.

Cardinality

Cardinality is how many references a collection has to another collection. Common relationships are one-to-one, one-to-many, or many-to-many. For example, suppose we had a blog application. Each post has a title, so that’s a one-to-one relationship. Each author has many posts, so that’s a one-to-many relationship. And posts have many tags and tags refer to many posts, so that’s a many-to-many relationship.

When using MongoDB, it can be conceptually useful to split “many” into subcategories:

“many” and “few.” For example, you might have a one-to-few cardinality between authors and posts: each author only writes a few posts. You might have many-to-few relation between blog posts and tags: your probably have many more blog posts than you have tags. However, you’d have a one-to-many relationship between blog posts and comments: each post has many comments.

Normalization versus Denormalization | 157

When you’ve determined few versus many relations, it can help you decide what to embed versus what to reference. Generally, “few” relationships will work better with embedding, and “many” relationships will work better as references.

在文檔中 MongoDB: The Definitive Guide (頁 175-180)