• 沒有找到結果。

MongoDB and MapReduce

在文檔中 MongoDB: The Definitive Guide (頁 165-168)

};

Now we need to reduce all of the emitted values for a tag into a single score for that tag:

reduce = function(key, emits) { var total = {urls : [], score : 0}

for (var i in emits) {

emits[i].urls.forEach(function(url) { total.urls.push(url);

}

total.score += emits[i].score;

}

return total;

};

The final collection will end up with a full list of URLs for each tag and a score showing how popular that particular tag is.

MongoDB and MapReduce

Both of the previous examples used only the mapreduce, map, and reduce keys. These three keys are required, but there are many optional keys that can be passed to the MapReduce command:

"finalize" : function

A final step to send reduce’s output to.

"keeptemp" : boolean

If the temporary result collection should be saved when the connection is closed.

MapReduce | 143

"out" : string

Name for the output collection. Setting this option implies keeptemp : true.

"query" : document

Query to filter documents by before sending to the map function.

"sort" : document

Sort to use on documents before sending to the map (useful in conjunction with the limit option).

"limit" : integer

Maximum number of documents to send to the map function.

"scope" : document

Variables that can be used in any of the JavaScript code.

"verbose" : boolean

Whether or not to use more verbose output in the server logs.

The finalize function

As with the previous group command, MapReduce can be passed a finalize function that will be run on the last reduce’s output before it is saved to a temporary collection.

Returning large result sets is less critical with MapReduce than group because the whole result doesn’t have to fit in 4 MB. However, the information will be passed over the wire eventually, so finalize is a good chance to take averages, chomp arrays, and remove extra information in general.

Keeping output collections

By default, Mongo creates a temporary collection while it is processing the MapReduce with a name that you are unlikely to choose for a collection: a dot-separated string containing mr, the name of the collection you’re MapReducing, a timestamp, and the job’s ID with the database. It ends up looking something like mr.stuff.18234210220.2.

MongoDB will automatically destroy this collection when the connection that did the MapReduce is closed. (You can also drop it manually when you’re done with it.) If you want to persist this collection even after disconnecting, you can specify keeptemp : true as an option.

If you’ll be using the temporary collection regularly, you may want to give it a better name. You can specify a more human-readable name with the out option, which takes a string. If you specify out, you need not specify keeptemp : true, since it is implied.

Even if you specify a “pretty” name for the collection, MongoDB will use the autogen‐

erated collection name for intermediate steps of the MapReduce. When it has finished, it will automatically and atomically rename the collection from the autogenerated name

144 | Chapter 7: Aggregation

to your chosen name. This means that if you run MapReduce multiple times with the same target collection, you will never be using an incomplete collection for operations.

The output collection created by MapReduce is a normal collection, which means that there is no problem with doing a MapReduce on it or a MapReduce on the results from that MapReduce, ad infinitum!

MapReduce on a subset of documents

Sometimes you need to run MapReduce on only part of a collection. You can add a query to filter the documents before they are passed to the map function.

Every document passed to the map function needs to be deserialized from BSON into a JavaScript object, which is a fairly expensive operation. If you know that you will need to run MapReduce only on a subset of the documents in the collection, adding a filter can greatly speed up the command. The filter is specified by the "query", "limit", and

"sort" keys.

The "query" key takes a query document as a value. Any documents that would ordi‐

narily be returned by that query will be passed to the map function. For example, if we have an application tracking analytics and want a summary for the last week, we can use MapReduce on only the most recent week’s documents with the following command:

> db.runCommand({"mapreduce" : "analytics", "map" : map, "reduce" : reduce, "query" : {"date" : {"$gt" : week_ago}}})

The sort option is mostly useful in conjunction with limit. limit can be used on its own, as well, to simply provide a cutoff on the number of documents sent to the map function.

If, in the previous example, we wanted an analysis of the last 10,000 page views (instead of the last week), we could use limit and sort:

> db.runCommand({"mapreduce" : "analytics", "map" : map, "reduce" : reduce, "limit" : 10000, "sort" : {"date" : -1}})

query, limit, and sort can be used in any combination, but sort isn’t useful if limit isn’t present.

Using a scope

MapReduce can take a code type for the map, reduce, and finalize functions, and, in most languages, you can specify a scope to be passed with code. However, MapReduce ignores this scope. It has its own scope key, "scope", and you must use that if there are client-side values you want to use in your MapReduce. You can set them using a plain document of the form variable_name : value, and they will be available in your map, reduce, and finalize functions. The scope is immutable from within these functions.

MapReduce | 145

For instance, in the example in the previous section, we calculated the recency of a page using 1/(new Date() - this.date). We could, instead, pass in the current date as part of the scope with the following code:

> db.runCommand({"mapreduce" : "webpages", "map" : map, "reduce" : reduce, "scope" : {now : new Date()}})

Then, in the map function, we could say 1/(now - this.date).

Getting more output

There is also a verbose option for debugging. If you would like to see the progress of your MapReduce as it runs, you can specify "verbose" : true.

You can also use print to see what’s happening in the map, reduce, and finalize functions. print will print to the server log.

在文檔中 MongoDB: The Definitive Guide (頁 165-168)