Analyzing your data
5.1 What is analysis?
Analyzing your data
This chapter covers
• Analyzing your document's text with Elasticsearch
• Using the analysis API
• Tokenization
• Character Filters
• Token Filters
• Stemming
• Analyzers included with Elasticsearch
So far we've covered indexing and searching your data, but what actually happens when data is sent to Elasticsearch? What happens to the text sent in a document to Elasticsearch? How can it find specific words among sentences, even when the case changes? For example, when a users searches for the word "Bear," generally you would like a document with the sentence
"I love bears and fish" to match, because the word "bear" is in there. While you could use the information you learned in the previous chapter to do a query_string search for "bear*" and find the document, this can much more easily be accomplished by using analysis. Once you finish this chapter you'll have a better idea of how Elasticsearch's analysis allows you to search your document set in a more flexible manner.
5.1 What is analysis?
Analysis is the process Elasticsearch performs on the body of a document before the document is sent off to be indexed. Elasticsearch goes through a number of steps for every analyzed field before the document is added to the index:
• Character filtering: Transform the characters using a character filter.
• Breaking text into tokens: Break apart the text into a set of one or more tokens.
• Token filtering: Transform each token using a token filter.
• Token indexing: Store those tokens into the index.
We'll talk about each step more in detail next, but first, let's see the entire process summed up in a diagram. In Figure 5.1, we'll use the text "I love Bears & Fish," which is eventually transformed into the analyzed tokens "I," "like," "bears," and "fish."
Figure 5.1 Overview of the analysis process
CHARACTER FILTERING
As you can see in the upper left of the figure, Elasticsearch first runs the character filters;
these filters are used to transform particular character sequences into other character sequences. This can be used for things like stripping HTML out of text, or converting an arbitrary number of characters into other characters (perhaps correcting the text-message
shortening of "I love u too" into "I love you too"). In figure 5.1 we use the character filter to replace "&" with the word "and,".
BREAKING INTO TOKENS
After the text has had the character filters applied, it needs to be split into pieces that can be operated on. Lucene itself doesn't act on large strings of data, instead, it acts on what are known as tokens. Tokens are generated out of a piece of text, which results in any number (even zero!) of tokens. In English, for example, a common tokenization that can be used is the whitespace analyzer, which splits text into tokens, based on whitespace like spaces and newlines. In figure 5.1 this is represented by breaking the string "I love Bears and Fish." into the tokens "I," "love," "Bears," "and," and "Fish."
TOKEN FILTERING
Once the block of text has been converted into tokens, Elasticsearch will then apply what are called token filters to each token. These token filters take a token as input and can modify, add, or remove more tokens as needed. One of the most useful and common examples of a token filter is the lowercase token filter, which takes in a token and simply lowercases it, to ensure that you will be able to find a get together about "The Doors" when searching for the term "doors." The tokens can go through more than one token filter, each doing different things to the tokens to mold the data into the best format for your index.
In our example in figure 5.1, there are three token filters: the first lowercasing the tokens;
the second removing the stopword "and" (we'll talk about stopwords later in this chapter); and finally substituting the word "like" for "love," using synonyms.
TOKEN INDEXING
After the tokens have gone through zero-or-more token filters, they are sent to Lucene to be indexed for the document. These tokens are what make up the inverted index we discussed back in chapter 1.
Together, these different parts make up an analyzer, which can also be defined as zero-or-more character filters, a tokenizer, and zero-zero-or-more token filters. There are some prebuilt analyzers we’ll talk about later on in this chapter, which you can use without having to construct your own, but first we'll talk about the different components of an analyzer individually.
Analysis during a query
Depending on what kind of query you use, this analysis can also be applied to the search text, before the search is performed against the index. In particular, queries such as the match and match_phrase queries perform analysis before searching, and queries like the term and terms query do not. It's important to keep this in mind when debugging why a particular search matches or doesn't match a document - it might be analyzed differently than what you expect!
Now that you have an understanding of what goes on during Elasticsearch's analysis phase, let's talk about how analyzers are specified for fields in your mapping, and how custom analyzers are specified.