Aggregation Commands - MongoDB: The Definitive Guide

There are several commands that MongoDB provides for basic aggregation tasks over a collection. These commands were added before the aggregation framework and have been superceded by it, for the most part. However, complex groups may still require JavaScript and counts and distincts can be simpler to run as non-framework commands.

count

The simplest aggregation tool is count, which returns the number of documents in the collection:

> db.foo.count() 0

> db.foo.insert({"x" : 1})

> db.foo.count() 1

Counting the total number of documents in a collection is fast regardless of collection size.

You can also pass in a query and Mongo will count the number of results for that query:

> db.foo.insert({"x" : 2})

> db.foo.count() 2

> db.foo.count({"x" : 1}) 1

This can be useful for getting a total for pagination: “displaying results 0–10 of 439.”

Adding criteria does make the count slower. Counts can use indexes, but indexes do not contain enough metadata to make counting any more efficient than actually doing a query for the criteria.

146 | Chapter 7: Aggregation

distinct

The distinct command finds all of the distinct values for a given key. You must specify a collection and key:

> db.runCommand({"distinct" : "people", "key" : "age"})

For example, suppose we had the following documents in our collection:

{"name" : "Ada", "age" : 20}

{"name" : "Fred", "age" : 35}

{"name" : "Susan", "age" : 60}

{"name" : "Andy", "age" : 35}

If you call distinct on the "age" key, you will get back all of the distinct ages:

> db.runCommand({"distinct" : "people", "key" : "age"}) {"values" : [20, 35, 60], "ok" : 1}

A common question at this point is if there’s a way to get all of the distinct keys in a collection. There is no built-in way of doing this, although you can write something to do it yourself using MapReduce (described in “MapReduce” on page 140).

group

group allows you to perform more complex aggregation. You choose a key to group by, and MongoDB divides the collection into separate groups for each value of the chosen key. For each group, you can create a result document by aggregating the documents that are members of that group.

If you are familiar with SQL, ^group is similar to SQL’s ^{GROUP BY}.

Suppose we have a site that keeps track of stock prices. Every few minutes from 10 a.m.

to 4 p.m., it gets the latest price for a stock, which it stores in MongoDB. Now, as part of a reporting application, we want to find the closing price for the past 30 days. This can be easily accomplished using group.

The collection of stock prices contains thousands of documents with the following form:

{"day" : "2010/10/03", "time" : "10/3/2010 03:57:01 GMT-400", "price" : 4.23}

{"day" : "2010/10/04", "time" : "10/4/2010 11:28:39 GMT-400", "price" : 4.27}

{"day" : "2010/10/03", "time" : "10/3/2010 05:00:23 GMT-400", "price" : 4.10}

{"day" : "2010/10/06", "time" : "10/6/2010 05:27:58 GMT-400", "price" : 4.30}

{"day" : "2010/10/04", "time" : "10/4/2010 08:34:50 GMT-400", "price" : 4.01}

Aggregation Commands | 147

Download from Wow! eBook <www.wowebook.com>

You should never store money amounts as floating-point numbers be‐

cause of inexactness concerns, but for simplicity we’ll do it in this ex‐

ample.

We want our results to be a list of the latest time and price for each day, something like this:

[

{"time" : "10/3/2010 05:00:23 GMT-400", "price" : 4.10}, {"time" : "10/4/2010 11:28:39 GMT-400", "price" : 4.27}, {"time" : "10/6/2010 05:27:58 GMT-400", "price" : 4.30}

]

We can accomplish this by splitting the collection into sets of documents grouped by day then finding the document with the latest timestamp for each day and adding it to the result set. The whole function might look something like this:

> db.runCommand({"group" : { ... "ns" : "stocks",

... "key" : "day",

... "initial" : {"time" : 0}, ... "$reduce" : function(doc, prev) { ... if (doc.time > prev.time) { ... prev.price = doc.price;

... prev.time = doc.time;

... } ... }}})

Let’s break this command down into its component keys:

"ns" : "stocks"

This determines which collection we’ll be running the group on.

"key" : "day"

This specifies the key on which to group the documents in the collection. In this case, that would be the "day" key. All the documents with a "day" key of a given value will be grouped together.

"initial" : {"time" : 0}

The first time the reduce function is called for a given group, it will be passed the initialization document. This same accumulator will be used for each member of a given group, so any changes made to it can be persisted.

"$reduce" : function(doc, prev) { ... }

This will be called once for each document in the collection. It is passed the current document and an accumulator document: the result so far for that group. In this example, we want the reduce function to compare the current document’s time with the accumulator’s time. If the current document has a later time, we’ll set the ac‐

cumulator’s day and price to be the current document’s values. Remember that there 148 | Chapter 7: Aggregation

is a separate accumulator for each group, so there is no need to worry about different days using the same accumulator.

In the initial statement of the problem, we said that we wanted only the last 30 days worth of prices. Our current solution is iterating over the entire collection, however.

This is why you can include a "condition" that documents must satisfy in order to be processed by the group command at all:

> db.runCommand({"group" : { ... "ns" : "stocks",

... "key" : "day",

... "initial" : {"time" : 0}, ... "$reduce" : function(doc, prev) { ... if (doc.time > prev.time) { ... prev.price = doc.price;

... prev.time = doc.time;

... }},

... "condition" : {"day" : {"$gt" : "2010/09/30"}}

... }})

Some documentation refers to a "cond" or "q" key, both of which are identical to the "condition" key (just less descriptive).

Now the command will return an array of 30 documents, each of which is a group. Each group has the key on which the group was based (in this case, "day" : string) and the final value of prev for that group. If some of the documents do not contain the key, these will be grouped into a single group with a day : null element. You can eliminate this group by adding "day" : {"$exists" : true} to the "condition". The group com‐

mand also returns the total number of documents used and the number of distinct values for "key":

> db.runCommand({"group" : {...}}) {

"retval" : [ {

"day" : "2010/10/04",

"time" : "Mon Oct 04 2010 11:28:39 GMT-0400 (EST)"

"price" : 4.27 },

...

], "count" : 734, "keys" : 30, "ok" : 1 }

Aggregation Commands | 149

We explicitly set the "price" for each group, and the "time" was set by the initializer and then updated. The "day" is included because the key being grouped by is included by default in each "retval" embedded document. If you don’t want to return this key, you can use a finalizer to change the final accumulator document into anything, even a nondocument (e.g., a number or string).

Using a finalizer

Finalizers can be used to minimize the amount of data that needs to be transferred from the database to the user, which is important because the group command’s output needs to fit in a single database response. To demonstrate this, we’ll take the example of a blog where each post has tags. We want to find the most popular tag for each day. We can group by day (again) and keep a count for each tag. This might look something like this:

> db.posts.group({

{"day" : "2010/01/12", "tags" : {"nosql" : 4, "winter" : 10, "sledding" : 2}}, {"day" : "2010/01/13", "tags" : {"soda" : 5, "php" : 2}},

{"day" : "2010/01/14", "tags" : {"python" : 6, "winter" : 4, "nosql": 15}}

]

Then we could find the largest value in the "tags" document on the client side. However, sending the entire tags document for every day is a lot of extra overhead to send to the client: an entire set of key/value pairs for each day, when all we want is a single string.

This is why group takes an optional "finalize" key. "finalize" can contain a function that is run on each group once, right before the result is sent back to the client. We can use a "finalize" function to trim out all of the cruft from our results:

> db.runCommand({"group" : {

... } else {

... prev.tags[doc.tags[i]] = 1;

... } ... },

... "finalize" : function(prev) { ... var mostPopular = 0;

... for (i in prev.tags) {

... if (prev.tags[i] > mostPopular) { ... prev.tag = i;

... mostPopular = prev.tags[i];

... } ... }

... delete prev.tags ... }}})

Now, we’re only getting the information we want; the server will send back something like this:

[

{"day" : "2010/01/12", "tag" : "winter"}, {"day" : "2010/01/13", "tag" : "soda"}, {"day" : "2010/01/14", "tag" : "nosql"}

]

finalize can either modify the argument passed in or return a new value.

Using a function as a key

Sometimes you may have more complicated criteria that you want to group by, not just a single key. Suppose you are using group to count how many blog posts are in each category. (Each blog post is in a single category.) Post authors were inconsistent, though, and categorized posts with haphazard capitalization. So, if you group by category name, you’ll end up with separate groups for “MongoDB” and “mongodb.” To make sure any variation of capitalization is treated as the same key, you can define a function to de‐

termine documents’ grouping key.

To define a grouping function, you must use a $keyf key (instead of "key"). Using

"$keyf" makes the group command look something like this:

> db.posts.group({"ns" : "posts",

... "$keyf" : function(x) { return x.category.toLowerCase(); }, ... "initializer" : ... })

"$keyf" allows you can group by arbitrarily complex criteria.

Aggregation Commands | 151

CHAPTER 8

在文檔中 MongoDB: The Definitive Guide (頁 168-175)