Terms aggregations - Multi-bucket aggregations

Exploring your data with Aggregations

7.3 Multi-bucket aggregations

7.3.1 Terms aggregations

We first looked at terms aggregation in section 7.1 as an example of how all aggregations work. The typical use-case is to get the top frequent X, where X would be a field in your document, like the name of a user, a tag or a category. Because the terms aggregation counts every term and not every field value, you'll normally run this aggregation on a non-analyzed field, because you want big data to be counted once, and not once for big and once for data.

You could use the terms aggregation to extract the most frequent terms from an analyzed field, like the description of an event. You can use this information to generate a word cloud, like the one in figure 7.6. Just make sure you have enough memory for loading all the fields in memory, if you have many documents or the documents contain many terms.

Figure 7.6 A terms aggregation can be used to get term frequencies and generate a word cloud

By default, the order of terms is by their count, descending, which fits all the top frequent X use-cases. But you can order terms ascending, or by other criteria, like the term name itself. The following listing shows how to list the group tags ordered alphabetically by using the order property.

Listing 7.8 Ordering tag buckets by name

curl localhost:9200/get-together/group/_search?pretty -d '{

"aggregations": { "tags": { "terms": {

"field": "tags.verbatim", "order": {

"_term": "asc"

} } } }}'

### reply

"aggregations" : { "tags" : { "buckets" : [ {

"key" : "apache lucene", "doc_count" : 1

}, {

"key" : "big data", "doc_count" : 3 }, {

"key" : "clojure", "doc_count" : 1

If you're nesting a metric aggregation under your terms aggregation, you can order terms by the metric, too. For example, you could use the average metric aggregation under your tags aggregation from listing 7.7, to get the average number of group members per tag. And you can order tags by the number of members by referring your metric aggregation name, like avg_members: desc.

WHICH TERMS TO INCLUDE IN THE REPLY

By default, the terms aggregation will return only the top 10 terms by the order you selected.

You can, however, change that number though the size parameter. Setting size to 0 will get you all the terms, but it's dangerous to use with a high-cardinality field, because returning a very large result is CPU-intensive to sort and might saturate your network.

To get back the top 10 terms – or the number of terms you configure with size - Elasticsearch has to get the top 10 terms for each shard and aggregate the results. The process is shown in the figure 7.7, with size=2 for clarity.

Figure 7.7 Sometimes, the overall top X is inaccurate, because only top X terms are returned per shard

This mechanism implies that you might get inaccurate counters for some terms, if those terms don't make it into the top 10 for each individual shard. This can even result in missing terms, like in the next figure where lucene, with a total value of 7, isn't returned in the top 2 overall tags because it didn't make the top 2 for each shard.

Figure 7.8 Reducing inaccuracies by increasing shard_size

To solve this problem, you can get more than 10 results from each shard by configuring shard_size, while retaining the same value of size. You will trade some performance for this, because aggregating larger per-shard result sets is more expensive.

At the other end of the accuracy spectrum, you could consider terms with low frequency irrelevant and exclude them from the result set entirely. This is especially useful when you sort terms by something else than frequency – which makes it likely for low-frequency terms to appear – but don't want to “pollute” the results with irrelevant results like typos. To do that, you'll need to change the min_doc_count setting from the default value of 1.

Finally, you can include and exclude specific terms from the result. You'd do that by using the include and exclude options, and provide regular expressions as values. Using include alone will include only terms matching the pattern, using exclude alone will include terms that don't match. Using both will have exclude take precedence: included terms will match the include pattern but won't match the exclude pattern.

The following listing will show you how to return counters for only tags containing “search.”

Listing 7.9 Creating buckets only for terms containing “search”

curl localhost:9200/get-together/group/_search?pretty -d '{

The significant terms query is useful if you want to see which terms have higher frequencies than normal in your current search results. Lets take the example of get-together groups: in all the groups out there, the term clojure may not appear frequently enough to count. Let's assume that it appears 10 times out of 1,000,000 terms (0.0001%). If you restrict your search for Denver, let's say it appears 7 times out of 10,000 terms (0.007%). The percentage is significantly higher than before and indicates a strong Clojure community in Denver, compared to the rest. It doesn't matter that other terms, such as programming or devops have a much higher absolute frequency.

The significant terms query is much like the terms query in the sense that it's counting terms. But the resulting buckets are ordered by a score, which represents the difference in

percentage between the foreground documents (that 0.007% in the previous example) and the background documents (0.0001%). The foreground documents are those matching your query and the background documents are all the documents from the index.

In the following listing, we'll try to find out which users of the get-together site have a similar preference to Lee for events. To do that, we'll query for events where Lee attends, and use the significant terms aggregation to see which event attendees participate to those events more, compared to the overall set of events they attend to.

Listing 7.10 Finding attendees attending similar events to Lee curl localhost:9200/get-together/event/_search?pretty -d '{ "significant_terms": {

"field": "attendees", #B

"significant_attendees" : {

"doc_count" : 5, #E

#A Foreground documents are events Lee attends to

#B We need attendees that appear more in these events than overall

#C Take only attendees that participated to at least 2 events

#D Exclude Lee from the analyzed terms, he has the same taste as himself

#E Total number of events Lee attends to is 5

#F Greg has similar taste: attended 3 events in total, all of them with Lee

#G Mike is after him, with 2 events in total, all of them with Lee

#H Daniel is last. He went to 3 events, but only 2 of them with Lee

As you might have guessed from the listing, the significant terms aggregation has the same size, shard_size, min_doc_count, include and exclude options as the terms aggregation, that let you control the terms you get back. In addition to those, it allows you to change the background documents from all the documents in the index to only those matching a defined filter in the “background_filter” parameter. For example, you may know that Lee only participates in technology events, so you can filter those to make sure that events irrelevant to him aren't taken into account.

Both the terms and significant terms aggregations work well for string fields. For numeric fields, range and histogram aggregations are more relevant, and we'll look at them next.

在文檔中 MEAP Edition Manning Early Access Program Elasticsearch in Action Version 11 (頁 174-179)