Using analyzers for your documents

Analyzing your data

5.2 Using analyzers for your documents

Knowing about the different types of analyzers and token filters is all fine and well, but before they can actually be used, Elasticsearch needs to know about how you want to use them. For instance, you can specify in the mapping, which individual tokenizer and token filters to use for an analyzer, and which analyzer to use for which field.

There are two ways to specify analyzers that can be used by your fields:

• when the index is created, as settings for that particular index; or

• as global analyzers in the configuration file for Elasticsearch.

Generally, to be more flexible, it's easier to specify them at the index creation time, which is also when you will want to specify your mappings. This allows you to create new indices with updated or entirely different analyzers. On the other hand, if you find yourself using the same set of analyzers across your indices, without changing them very often at all, you can also save yourself some bandwidth by putting the analyzers into the configuration file.

Examine how you are using Elasticsearch and pick the option that works the best for you. You could even combine the two and put the analyzers that are used by all of your indices into the configuration file, and specify additional analyzers when you create indices, for added flexibility.

Regardless of the way you specify your custom analyzers, you will need to specify which field uses which analyzer in the mapping of your index, either by specifying the mapping when the index is created, or using the put mapping API to specify it at a later time.

5.2.1 Adding analyzers when an index is created

You've already seen some of the settings when an index is created, in chapter 3, notably setting the number of primary and replica shards for an index, which look something like this next listing.

Listing 5.1 Setting the number of primary and replica shards

% curl -XPOST 'localhost:9200/myindex' -d' {

"settings" : {

"number_of_shards": 2, #A "number_of_replicas": 1 #B },

"mappings" : {

... #C }

#A Specifying custom settings for the index, here specifying 2 primary shards

#B And specifying 1 replica here

#C Mappings for the index

Adding a custom analyzer is done by specifying another map in the settings config, under the

"index" heading. This header should specify the custom analyzer you want to use, and can also contain the custom tokenizer, token-filters and char-filters that can be used by the index.

Listing 5.2 shows a custom analyzer that specifies custom parts for all the analysis steps. This is a complex example, so we've added some headings to show the different parts. Don't worry about all the code details yet, as we'll go through it later on in this chapter.

Listing 5.2 Adding a custom analyzer during index creation

% curl -XPOST 'localhost:9200/myindex' -d ' {

"settings" : {

"number_of_shards": 2, #A "number_of_replicas": 1, #A "index": { #B "analysis": { #C Custom analyzer

"analyzer": { #D "myCustomAnalyzer": { #E "type": "custom", #F "tokenizer": "myCustomTokenizer", #G "filter": ["myCustomFilter1", "myCustomFilter2"], #H "char_filter": ["myCustomCharFilter"] #I }

}, Tokenizer

"tokenizer": {

"myCustomTokenizer": { #J "type": "letter" #J } #J

Custom filters },

"filter": {

"myCustomFilter1": { #K "type": "lowercase" #K }, #K "myCustomFilter2": { #K "type": "kstem" #K } #K },

Character filter

"char_filter": {

"myCustomCharFilter": { #L

"type": "mapping", #L

#A Other settings for the index that we've covered before

#B Other "index"-level settings

#C The analysis settings for this index

#D Specifying a custom analyzer in the "analyzer" object

#E The custom analyzer is named "myCustomAnalyzer"

#F It's of type "custom"

#G It uses the "myCustomTokenizer" to tokenize text

#H Specify two filters that text should be run through, myCustomFilter1 and myCustomFilter2

#I Specify a custom char filter called "myCustomCharFilter" that will run before other analysis

#J Specifying the custom tokenizer of type "letter"

#K Two custom token filters, one for lowercasing and another using kstem

#L A custom char filter that translates characters to other mappings

#M Mappings for creating the index

The mappings have been left out of the code listing here, as we'll cover how to specify the analyzer for a field in section 5.2.3. In this example, a custom analyzer is created called myCustomAnalyzer, which uses the custom tokenizer myCustomTokenizer, two custom filters named myCustomFilter1 and myCustomFilter2, and a custom character filter named myCustomCharFilter (notice a trend here?). Each of these separate analysis parts are given in their respective JSON sub-maps. Multiple analyzers can be specified with different names, and combined by custom analyzers to give you flexible analysis options when indexing and searching.

Now that you have a sense of what adding custom analyzers looks like when an index is created, let's look at the same analyzers added to the Elasticsearch configuration itself.

5.2.2 Adding analyzers to the Elasticsearch configuration

In addition to specifying analyzers with settings during index creation, adding analyzers into the Elasticsearch config is another supported way of specify custom analyzers. There are tradeoffs to this method however; if you specify the analyzers during index creation, you will always be able to make changes to the analyzers without restarting Elasticsearch. But, if you specify the analyzers in the Elasticsearch configuration, you will need to restart Elasticsearch to pick up any changes you make to the analyzers. On the flip side, you'll have less data to send when creating indices. While it’s generally easier to specify them at index creation for the

larger degree of flexibility, if you plan to never change your analyzers, you can go ahead and put them into the configuration file.

Specifying analyzers in the elasticsearch.yml configuration file is very similar to specifying them as JSON; here are the same custom analyzers from the previous section, but specified in the configuration YAML file:

index:

filter: [myCustomFilter1, myCustomFilter2]

char_filter: myCustomCharFilter

5.2.3 Specifying the analyzer for a field in the mapping

There's one piece of the puzzle left before you're off on your way, analyzing fields with custom analyzers: how to specify that a particular field in the mapping should be analyzed using one of your custom analyzers. It's quite simple to specify the analyzer for a field by setting the

"analyzer" field on a mapping. For instance, if we had the mapping for a field called

"description," specifying the analyzer would look like this:

{

#A Specifying the analyzer "myCustomAnalyzer" for the description field

If you want a particular field to not be analyzed at all, you need to specify the "index" field with the not_analyzed setting. This keeps the text as a single token, without any kind of modification (no lowercasing or anything). It looks something like this:

{

#A Specifying that the name field is not to be analyzed

There is a common pattern for fields where you may want to search on both the analyzed and verbatim text of a field, which is to stick them in multi-fields.

USING MULTI-FIELD TYPE TO STORE DIFFERENTLY ANALYZED TEXT

Often it's quite helpful to be able to search on both the analyzed version of a field, as well as the original, non-analyzed text. This is especially useful for things like facets and aggregations, or sorting on a string field. Elasticsearch makes this simple to do by using multi fields, which we first saw in chapter 3. Take the "name" field for example; you may want to be able to sort on the name field, but search through it using analysis. You can specify a field that does both like so:

#A The original analysis, using the standard analyzer

#B A raw version of the field, which is not analyzed

We’ve covered how to specify analyzers; now we're ready to cover a neat way to check how any arbitrary text can be analyzed: the analyze API.

在文檔中 MEAP Edition Manning Early Access Program Elasticsearch in Action Version 11 (頁 132-137)