CHAPTER 5
Indexing
This chapter introduces MongoDB’s indexing, which allows you to optimize your quer‐
ies and is even required for certain types of queries:
• What indexing is and why you’d want to use it
• How to choose which fields to index
• How to enforce and evaluate index usage
• Administrative details on creating and removing indexes
Choosing the right indexes for your collections is critical to performance.
Introduction to Indexing
A database index is similar to a book’s index. Instead of looking through the whole book, the database takes a shortcut and just looks at an ordered list that points to the content, which allows it to query orders of magnitude faster.
A query that does not use an index is called a table scan (a term inherited from relational databases), which means that the server has to “look through the whole book” to find a query’s results. This process is basically what you’d do if you were looking for infor‐
mation in a book without an index: you start at page 1 and read through the whole thing.
In general, you want to avoid making the server do table scans because it is very slow for large collections.
For example, let’s create a collection with 1 million documents in it (or 10 million or 100 million, if you have the patience):
> for (i=0; i<1000000; i++) { ... db.users.insert(
... {
... "i" : i,
81
... "username" : "user"+i,
If we do a query on this collection, we can use the explain() function to see what MongoDB is doing when it executes the query. Try querying on a random username to see an example.
“Using explain() and hint()” on page 98 will explain the output fields; for now you can ignore almost all of them. "nscanned" is the number of documents MongoDB looked at while trying to satisfy the query, which, as you can see, is every document in the col‐
lection. That is, MongoDB had to look through every field in every document. This took nearly a second to accomplish: the "millis" field shows the number of milliseconds it took to execute the query.
The "n" field shows the number of results returned: 1, which makes sense because there is only one user with the username "user101". Note that MongoDB had to look through every document in the collection for matches because it did not know that usernames are unique. To optimize this query, we could limit it to one result so that MongoDB would stop looking after it found user101:
"indexBounds" : { }
}
The number scanned has now been cut way down and the query is almost instantaneous.
However, this is an impractical solution in general: what if we were looking for user999999? Then we’d still have to traverse the entire collection and our service would just get slower and slower as we added users.
Indexes are a great way to fix queries like this because they organize data by a given field to let MongoDB find it quickly. Try creating an index on the username field:
> db.users.ensureIndex({"username" : 1})
Depending on your machine and how large you made the collection, creating an index may take a few minutes. If the ensureIndex call does not return after a few seconds, run db.currentOp() (in a different shell) or check your mongod’s log to see the index build’s progress.
Once the index build is complete, try repeating the original query:
> db.users.find({"username" : "user101"}).explain() {
"cursor" : "BtreeCursor username_1", "nscanned" : 1,
"nscannedObjects" : 1, "n" : 1,
"millis" : 3, "nYields" : 0, "nChunkSkips" : 0, "isMultiKey" : false, "indexOnly" : false, "indexBounds" : { "username" : [ [
"user101", "user101"
] ] } }
This explain() output is more complex, but continue to ignore all the fields other than
"n", "nscanned", and "millis" for now. As you can see, the query is now almost in‐
stantaneous and, even better, has a similar runtime when querying for any username:
> db.users.find({username: "user999999"}).explain().millis 1
As you can see, an index can make a dramatic difference in query times. However, indexes have their price: every write (insert, update, or delete) will take longer for every Introduction to Indexing | 83
index you add. This is because MongoDB has to update all your indexes whenever your data changes, as well as the document itself. Thus, MongoDB limits you to 64 indexes per collection. Generally you should not have more than a couple of indexes on any given collection. The tricky part becomes figuring out which fields to index.
MongoDB’s indexes work almost identically to typical relational data‐
base indexes, so if you are familiar with those, you can skim this section for syntax specifics. We’ll go over some indexing basics, but keep in mind that it’s an extensive topic and most of the material out there on indexing for MySQL/Oracle/SQLite will apply equally well to Mon‐
goDB [such as the “Use the Index, Luke!” tutorial].
To choose which fields to create indexes for, look through your common queries and queries that need to be fast and try to find a common set of keys from those. For instance, in the example above, we were querying on "username". If that was a particularly com‐
mon query or was becoming a bottleneck, indexing "username" would be a good choice.
However, if this was an unusual query or one that was only done by administrators who didn’t care how long it took, it would not be a good choice of index.
Introduction to Compound Indexes
An index keeps all of its values in a sorted order so it makes sorting documents by the indexed key much faster. However, an index can only help with sorting if it is a prefix of the sort. For example, the index on "username" wouldn’t help much for this sort:
> db.users.find().sort({"age" : 1, "username" : 1})
This sorts by "age" and then "username", so a strict sorting by "username" isn’t terribly helpful. To optimize this sort, you could make an index on "age" and "username":
> db.users.ensureIndex({"age" : 1, "username" : 1})
This is called a compound index and is useful if your query has multiple sort directions or multiple keys in the criteria. A compound index is an index on more than one field.
Suppose we have a users collection that looks something like this, if we run a query with no sorting (called natural order):
> db.users.find({}, {"_id" : 0, "i" : 0, "created" : 0}) { "username" : "user0", "age" : 69 }
{ "username" : "user1", "age" : 50 } { "username" : "user2", "age" : 88 } { "username" : "user3", "age" : 52 } { "username" : "user4", "age" : 74 } { "username" : "user5", "age" : 104 } { "username" : "user6", "age" : 59 } { "username" : "user7", "age" : 102 }
84 | Chapter 5: Indexing
Download from Wow! eBook <www.wowebook.com>
{ "username" : "user8", "age" : 94 } { "username" : "user9", "age" : 7 } { "username" : "user10", "age" : 80 } ...
If we index this collection by {"age" : 1, "username" : 1}, the index will look roughly like this:
[0, "user100309"] -> 0x0c965148 [0, "user100334"] -> 0xf51f818e [0, "user100479"] -> 0x00fd7934 ...
[0, "user99985" ] -> 0xd246648f [1, "user100156"] -> 0xf78d5bdd [1, "user100187"] -> 0x68ab28bd [1, "user100192"] -> 0x5c7fb621 ...
[1, "user999920"] -> 0x67ded4b7 [2, "user100141"] -> 0x3996dd46 [2, "user100149"] -> 0xfce68412 [2, "user100223"] -> 0x91106e23 ...
Each index entry contains an age and a username and points to the location of a docu‐
ment on disk (represented by the hexadecimal numbers, which can be ignored). Note that "age" fields are ordered to be strictly ascending and, within each age, "user name"s are also in ascending order. As each age has approximately 8,000 usernames associated with it, only those necessary to convey the general idea have been included.
The way MongoDB uses this index depends on the type of query you’re doing. These are the three most common ways:
db.users.find({"age" : 21}).sort({"username" : -1})
This is a point query, which searches for a single value (although there may be multiple documents with that value). Due to the second field in the index, the results are already in the correct order for the sort: MongoDB can start with the last match for {"age" : 21} and traverse the index in order:
[21, "user999977"] -> 0x9b3160cf [21, "user999954"] -> 0xfe039231 [21, "user999902"] -> 0x719996aa ...
This type of query is very efficient: MongoDB can jump directly to the correct age and doesn’t need to sort the results as traversing the index returns the data in the correct order.
Note that sort direction doesn’t matter: MongoDB is comfortable traversing the index in either direction.
Introduction to Indexing | 85
db.users.find({"age" : {"$gte" : 21, "$lte" : 30}})
This is a multi-value query, which looks for documents matching multiple values (in this case, all ages between 21 and 30). MongoDB will use the first key in the index, "age", to return the matching documents, like so:
[21, "user100000"] -> 0x37555a81
In general, if MongoDB uses an index for a query it will return the resulting docu‐
ments in index order.
db.users.find({"age" : {"$gte" : 21, "$lte" : 30}}).sort({"username" : 1})
This is a multi-value query, like the previous one, but this time it has a sort. As before, MongoDB will use the index to match the criteria:
[21, "user100000"] -> 0x37555a81
However, the index doesn’t return the usernames in sorted order and the query requested that the results be sorted by username, so MongoDB has to sort the results in memory before returning them. Thus, this query is usually less efficient than the queries above.
Of course, the speed depends on how many results match your criteria: if your result set is only a couple of documents, MongoDB won’t have much work to do to sort them. If there are more results, it will be slower or may not work at all: if you have more than 32 MB of results MongoDB will just error out, refusing to sort that much data:
Mon Oct 29 16:25:26 uncaught exception: error: {
"$err" : "too much data for sort() with no index. add an index or specify a smaller limit",
"code" : 10128 }
86 | Chapter 5: Indexing
One other index you can use in the last example is the same keys in reverse order:
{"username" : 1, "age" : 1}. MongoDB will then traverse all the index entries, but in the order you want them back in. It will pick out the matching documents using the
"age" part of the index:
This is good in that it does not require any giant in-memory sorts. However, it does have to scan the entire index to find all matches. Thus, putting the sort key first is generally a good strategy when you’re using a limit so MongoDB can stop scanning the index after a couple of matches.
You can diagnose how MongoDB defaults to processing db.users.find({"age" : {"$gte" : 21, "$lte" : 30}}).sort({"username" : 1}) by using explain():
> db.users.find({"age" : {"$gte" : 21, "$lte" : 30}}).
... sort({"username" : 1}).
... explain() {
"cursor" : "BtreeCursor age_1_username_1", "isMultiKey" : false,
{
You can ignore most of these fields; they will be covered later in this chapter. For now, note that the "cursor" field indicates that this query would use the {"age" : 1, "user name" : 1} index and looks at less than a tenth of the documents ("nscanned" is only 83484), but takes nearly three seconds to execute (the "millis" field). The "scanAnd Order" field is true: this indicates that MongoDB had to sort the data in memory, as mentioned above.
We can use a hint to force MongoDB to use a certain index, so try the same query again using the {"username" : 1, "age" : 1} index, instead. This query scans more docu‐
ments, but does not have to do an in-memory sort:
> db.users.find({"age" : {"$gte" : 21, "$lte" : 30}}).
... sort({"username" : 1}).
... hint({"username" : 1, "age" : 1}).
... explain() {
"cursor" : "BtreeCursor username_1_age_1", "isMultiKey" : false,
"age" : [
Note that this took nearly 15 seconds to run, making the first index the clear winner.
However, if we limit the number of results for each query, a new winner emerges:
> db.users.find({"age" : {"$gte" : 21, "$lte" : 30}}).
... sort({"username" : 1}).
... limit(1000).
... hint({"age" : 1, "username" : 1}).
... explain()['millis']
2031
> db.users.find({"age" : {"$gte" : 21, "$lte" : 30}}).
... sort({"username" : 1}).
... limit(1000).
... hint({"username" : 1, "age" : 1}).
... explain()['millis']
181
The first query is still hovering between two and three seconds, but the second query now takes less than a fifth of a second! Thus, you should always run explain()s on exactly the queries that your application is running. Excluding any options could make the explain() output misleading.
The index pattern of {"sortKey" : 1, "queryCriteria" : 1} often works well in applications, as most application do not want all possible results for a query but only the first few. It also scales well because of the way indexes are organized internally.
Indexes are basically trees, with the smallest value on the leftmost leaf and the greatest on the rightmost. If you have a "sortKey" that is a date (or any value that increases over time) then as you traverse the tree from left to right, you’re basically travelling forward in time. Thus, for applications that tend to use recent data more than older data, Mon‐
goDB only has to keep the rightmost (most recent) branches of the tree in memory, not the whole thing. An index like this is called right-balanced and, whenever possible, you should make your indexes balanced. The "_id" index is an example of a right-balanced index.