Exploring your data with Aggregations
7.1 Anatomy of an aggregation
All aggregations, no matter their type, follow some rules:
• You define them in the same JSON request as your queries, and you mark them by the key aggregations or aggs. You need to give each one a name, specify the type and the options specific to that type.
• They run on the results of your query. Documents that don't match your query are not
accounted for. Unless you include them with the global aggregation, which is a bucket aggregation that will be covered later in this chapter.
• You can filter down results of your query more, without influencing aggregations. To do that, you'd use post filters. For example, when searching for a keyword in an online shop, you can build statistics on all items matching the keyword, but use post filters to only show results that are in stock
Let's take a look at the popular terms aggregation, which you've already seen in the intro to this chapter. The example use-case was getting the most popular subjects (tags) for existing groups of your get-together site. We'll use this same terms aggregation to explore the rules that all aggregations must follow.
7.1.1 Structure of an aggregation request
In listing 7.1, you'll run a terms aggregation that will give you the most frequent tags in the get-together groups. The structure of this terms aggregation will apply to every other aggregation.
Listing 7.1 Using the terms aggregation to get top tags
curl 'localhost:9200/get-together/group/_search?pretty' -d '{
#A Aggregations key indicates that this is the aggregations part of the request
#B Give the aggregation a name
#C Specify the aggregation type terms
#D Verbatim field is used to have “big data” as a single term, instead of “big” and “data” separately
#E The list of results is there anyway, as if you hit the _search endpoint with no query
#F Aggregation results begin here
#G Aggregation name, as specified
#H Each unique term is an item in the bucket
#I For each term, you see now many times it appeared
• At the top level, there's the aggregations key, which can be shortened to aggs.
• On the next level, you have to give the aggregation a name. You can see that name in the reply. This is useful when you use multiple aggregations in the same request, so you can easily see the meaning of each set of results.
• Finally, you have to specify the aggregation type terms, and the specific option. In this case, we'll have the field name.
The aggregation request from listing 7.1 hits the _search endpoint, just like the queries you've seen in previous chapters. In fact, you also get back 10 group results. This is all because no query was specified, which will effectively run the match_all query you've seen in chapter 4. So your aggregation will run on all the group documents. Running a different query will make the aggregation run through a different set of documents.
Field data cache for faster aggregations
When you run a regular search, it goes pretty fast because of the nature of the inverted index: you have a limited number of terms to look for; Elasticsearch will identify documents containing those terms, and return the results. Aggregations, on the other hand, require more work because it has to pull all those terms from fields you need to aggregate on, and then do the counting or other computation.
To speed up things, Elasticsearch loads those terms in memory, in the field data cache. The more terms it has to deal with, the more memory will be used by the field data cache. That's why you have to make sure you have enough memory, especially when you're doing aggregations on large numbers of documents, or if fields are analyzed and you have more than one term per document.
By default, the field data cache is unlimited, so running many expensive aggregations can trigger an out-of-memory error. You can change the configuration to make old items expire (indices.fielddata.cache.expire) and you can put a limit on the amount of memory that can be occupied by this cache (indices.fielddata.cache.size). Another helpful feature is the “field data circuit breaker,” which will raise an exception if an aggregation uses more field cache than a certain limit.
That limit can be adjusted via indices.fielddata.breaker.limit in the configuration or cluster settings.
7.1.2 Aggregations run on query results
Computing metrics over the whole data set is just one of the possible use-cases for aggregations. Often, you want to compute metrics in the context of a query. For example, if you're searching for groups in Denver, you probably want to see the most popular tags for those groups only. As you'll see in listing 7.2, this is the default behavior for aggregations.
Unlike in listing 7.1, where the implied query was match_all, here we query for Denver in the location field, and aggregations will only be about groups from Denver.
Listing 7.2 Getting top tags for groups in Denver
curl 'localhost:9200/get-together/group/_search?pretty' -d '{
#A In this query we only look for groups in Denver
#B Fewer results than in listing 7.1, because we only look for Denver groups
#C Tags are only counted for Denver groups, so they look different than in listing 7.1
FROM AND SIZE
Recall from chapter 4 that you can use the from and size parameters of your query control the pagination of results. These parameters have no influence on aggregations, because aggregations always run on all the documents matching a query.
If you want to restrict query results more, without restricting aggregations, too, you can use post filters. We'll discuss post filters and the relationship between filters and aggregations in general next.
7.1.3 Filters and aggregations
In chapter 4 you saw that for most query types there is a filter equivalent. Because filters don't calculate scores and are cacheable, they're faster than their query counterparts. You've also learned that you should wrap filters in a filtered query, like this:
% curl localhost:9200/get-together/group/_search?pretty -d '{
"query": { "filtered": { "filter": { "term": {
"location": "denver"
} } } }}'
Using the filter this way is good for the overall query performance, because the filter runs first. Then, the query – which is typically more performance-intensive – runs only on documents matching the filter. As far as aggregations are concerned, they only run on documents matching the overall filtered query, as shown in Figure 7.3.
Figure 7.3 A filter wrapped in a filtered query runs first, and restricts both results and aggregations
“Nothing new so far,” you might say, “the filtered query behaves like any other query when it comes to aggregations,” and you'd be right. But there is also another way of running filters: by using a post filter, which will run after the query, and independent of the aggregation. The following request will give the same results as the previous filtered query:
% curl localhost:9200/get-together/group/_search?pretty -d '{
"post_filter": { "term": {
"location": "denver"
} }}'
As illustrated in figure 7.4, the post filter differs from the filter in the filtered query in two ways:
• Performance: The post filter runs after the query, making sure the query will run on all documents, and the filter run only on those matching the query. The overall request is typically slower than the filtered query equivalent, except when you have “expensive”
filters, like the script filter.
• Document set processed by aggregations: If a document doesn't match the post filter, it will still be accounted for by aggregations.
Figure 7.4 Post filter runs after the query and doesn't affect aggregations
Now that you understand the relation between queries, filters and aggregations, as well as the overall structure of an aggregation request, we can dive deeper into Aggregations Land and explore different aggregation types. We'll start with metrics aggregations, then go to bucket aggregations, then we'll discuss how to combine them to get powerful insights from your data in real-time.