Core types for defining your own fields in documents

Diving into the functionality

3.2 Core types for defining your own fields in documents

} } }'

# expected result

# {"error":"MergeMappingException[Merge failed with failures {[mapper [host] of different type, current_type [string], merged_type [long]]}]","status":400}

The only way around this error is to reindex all the data in new-events, which involves the following steps:

• Removing all data from the new-events type—you’ll learn later in this chapter how to delete data. Removing data also removes the current mapping.

• Put the new mapping.

• Index all the data again.

To understand why reindexing might be required, imagine you’ve already indexed an event with a string in the host field. If you want the host field to be long now, Elasticsearch would have to change the way host is indexed in the existing document. As you’ll explore later in this chapter, editing an existing document implies deleting and indexing again.

To define correct mappings, that hopefully won’t need changes, only additions, let’s look at the core types you can choose for your fields in Elasticsearch, and what you can do with them.

3.2 Core types for defining your own fields in documents

With Elasticsearch, a field can be one of the core types (see table 3.1), such as a string or a number, or it can be a more complex type derived from core types, such as an array.

There are some additional types, not covered in this chapter. For example, there’s the nested type, which allows you to have documents within documents. Or the geo_point type, which stores a location on Earth based on its longitude and latitude. We’ll discuss those additional types in chapter 7, where we cover relationships among documents, and in appendix A, where we discuss geospatial data.

NOTE In addition to the fields you define in your documents, such as name or date, Elasticsearch uses a set of predefined fields to enrich them. For example, there’s an _all field, where all the document’s fields are indexed together. This is useful when users search for something without specifying the field—you can search in all fields. These predefined fields have their own configuration options, and we’ll discuss them later in this chapter.

Table 3.1 Elasticsearch core field types

Core type Example values

String "Lee", "Elasticsearch Denver"

Numeric 17, 3.2

Date 2013-03-15T10:02:26.231+1:00

Boolean Value can be either true or false

Let’s look at each of these core types, so you can make good mapping choices when you index your own data.

3.2.1 String

Strings are the most straightforward: your field should be string if you’re indexing characters.

They’re also the most interesting because you have so many options in your mapping about how to analyze them.

Analysis is the process of parsing the text to transform it and break it down into elements to make searches relevant. If it sounds too abstract, don’t worry: chapter 5 explores the concept. But let’s look at the basics now starting with the document you indexed in listing 3.1:

% curl -XPUT 'localhost:9200/get-together/new-events/1' -d '{

"name": "Late Night with Elasticsearch", "date": "2013-10-25T19:00"

With this document indexed, let’s search for the word “late” in the name field, which is a string:

% curl 'localhost:9200/get-together/new-events/_search?pretty' -d '{

"query": {

"query_string": { "query": "late"

} } }'

And the search finds the “Late Night with Elasticsearch” document you indexed in Listing 3.1. Elasticsearch connects the strings "late" and "Late Night with Elasticsearch"

though analysis. As you can see in figure 3.2, when you index "Late Night with Elasticsearch", the default analyzer lowercases all letters, and then breaks the string into words.

Figure 3.2 After the default analyzer breaks strings into terms, subsequent searches match those terms.

The analyzer removes the word "with" because it’s so common it belongs to a list of stop words. By default, stop words are eliminated during analysis because they appear so frequently that they’re irrelevant in searches.

The analysis produces three terms: “late”, “night”, and “elasticsearch”. The same process is then applied to the query string, but this time, “late” produces the same string: “late.” The document (doc1) matches the search because the “late” term that resulted from the query matches the “late” term that resulted from the document.

DEFINITION A term is a word from the text and is the basic unit for searching. In different contexts, this “word” can mean different things: it could be a name, or it could be an IP address, for example. If you want only exact matches on a field, the entire field should be treated as a word.

On the other hand, if you index “latenight”, the default analyzer creates only one term:

“latenight”. Searching for “late” won’t hit doc1 because it doesn’t include the term “late”.

MAPPING AND ANALYSIS INTERPLAY

This analysis process is where the mapping comes into play. You can specify many options around analyzing in your mapping. For example, you can configure stemming to take place during analysis. Stemming produces terms that are synonyms of your original terms, so queries for synonyms match as well. We’ll dive into the details of analysis in chapter 5, as promised, but for now, let’s look at the index option, which can be set to analyzed (the default), not_analyzed or no. For example, to set the name field to not_analyzed, your mapping might look like this:

% curl -XPUT 'localhost:9200/get-together/new-events/_mapping' -d '{

"new-events" : { "properties" : { "name": {

"type" : "string", "index" : "not_analyzed"

} } } }'

Setting index to analyzed produces the behavior you saw previously: by default, the analyzer lowercases all letters, breaks your string into words, and eliminates stop words. Use this option when your strings are long enough and you expect a single matching word to produce a match. For example, if users search for “elasticsearch,” they expect to see “Late Night with Elasticsearch” in the list of results.

Setting index to not_analyzed does the opposite: the analysis process is skipped, and the entire string is indexed as one term. Use this option when you want exact matches, such as when you search for tags. You probably want only “big data” to show up as a result when you search for “big data,” not “data”.

If you set index to no, then indexing is skipped and no terms are produced, so you won’t be able to search on that particular field. When you don’t need to search on a field, this option saves space and decreases the time it takes to index and search. For example, you might store reviews for events. Although storing and showing those reviews is valuable, searching through them might not be. In this case, disable indexing for that field, making the indexing process faster and saving space.

Check if your query is analyzed when searching in fields that aren’t

For some queries, such as the query_string you used previously, the analysis process is applied to your search criteria. It’s important to be aware if this is happening, otherwise results might not be as expected.

For example, if you index “Elasticsearch,” and it’s not analyzed, it produces the term

“Elasticsearch”. When you query for “Elasticsearch” like this:

curl 'localhost:9200/get-together/new-events/_search?q=Elasticsearch' the URI request is analyzed, and the term “elasticsearch” (lowercased) is produced. But you don’t have the term “elasticsearch” in your index; you have only “Elasticsearch” (with a capital E), so you get no hits.

In chapter 4, which is all about searching, you’ll learn which query types analyze the input text and which don’t.

Next, let’s look at how you can index numbers. Elasticsearch provides many core types that can help you deal with numbers, so we’ll refer to them collectively as numeric.

3.2.2 Numeric

Numeric types can be with or without a floating point. If you don’t need decimals, you can choose between byte, short, integer and long; if you do need them, your choices are float

and double. These types correspond to Java’s primitive data types, and choosing between them influences the size of your index and the range of values you can index. For example, whereas a long takes up 64 bits, a short takes up only 16 bits, but a long can store ranges up to several trillion times larger than the -32,768 to 32,767 that a short can store.

If you don’t know the range you need for your integer values or the precision you need for your floating point values, it’s safe to do what Elasticsearch does when it detects your mapping automatically: use long for integer values, and double for floating-point values. Your index might become larger and slower because these two types take up the most space, but at least you’re unlikely to get an “out of range” error from Elasticsearch when indexing.

Now that we’ve covered strings and numbers, let’s look at a type that’s more purpose-built: date.

3.2.3 Date

The date type is used for storing dates and times. It works like this: you normally provide a string with a date, as in 2013-12-25T09:00:00. Then, Elasticsearch parses the string and stores it as a number of type long in the Lucene index. That long is the number of milliseconds that have elapsed since 00:00:00 UTC time on January 1, 1970 (UNIX epoch) and the time you provided.

When you search for documents, you still provide date strings, and Elasticsearch parses those strings and works with numbers in background. It does that because numbers are faster to store and work with than strings.

You, on the other hand, only have to consider whether Elasticsearch understands the date string you’re providing. The date format of your date string is defined by the format option, and Elasticsearch parses ISO 8601 timestamps by default.

ISO 8601

An international standard for exchanging date- and time-related data, ISO 8601 is widely used in timestamps due to RFC 3339 (https://www.ietf.org/rfc/rfc3339.txt). An ISO 8601 date looks like this:

2013-10-11T10:32:45.453-3:0

It has all the right ingredients of a good timestamp: information is read from left to right, from the most important to the least important; the year has four digits; and the time includes subseconds and time zone.

Much of the information in this timestamp is optional, for example, you don’t need to specify milliseconds, and you can skip the time altogether.

When you use the format option to specify a date format, you have two options:

• Use a predefined date format. For example, the “date” format parses dates as “2013-02-25.” Many predefined formats are available, and you can see them all here:

www.elasticsearch.org/guide/reference/mapping/date-format/

• Specify your own custom format. You can specify a pattern for timestamps to follow.

For example, specifying “MMM YYYY” parses dates as “Jul 2001.” For a full reference on building date patterns, visit: http://joda-time.sourceforge.net/api-release/org/joda/time/format/DateTimeFormat.html

To put all this date information to use, let’s add a new mapping type called weekly-events, as shown in listing 3.3. Then, as also shown in the listing, add a title and date of the first event, and specify an ISO 8601 timestamp for that date. Also add a field with the date of the next event, and specify a custom date format for that date.

Listing 3.3 Using default and custom time formats

% curl -XPUT 'localhost:9200/get-together/weekly-events/_mapping' -d ' {

% curl -XPUT 'localhost:9200/get-together/weekly-events/1' -d ' {

"name": "Elasticsearch News", "first_occurence": "2011-04-03",

#B "next_event": "Oct 25 2013"

#A Defines the custom date format. Other dates are automatically detected and don’t need to be explicitly defined.

#B Specifies a standard date/time format. Only the date is included; the time isn’t specified.

We’ve talked about strings, numbers, and dates; let’s move on to the last core type: boolean.

Like date, boolean is a type that’s more purpose-built.

3.2.4 Boolean

The boolean type is used for storing true/false values from your documents. For example, you might want a field that indicates whether the event’s video is available for download. A sample document could be indexed like this:

% curl -XPUT 'localhost:9200/get-together/new-events/downloadable' -d '{

"name": "Broadcasted Elasticsearch News", "downloadable": true

The downloadable field is automatically mapped as boolean and is stored in the Lucene index as T for true or F for false. As with date fields, it parses the value you supply in the source document and transforms true and false to T and F, respectively. If you supply a number value, it transforms 0 to F and any other number to T:

% curl -XPUT 'localhost:9200/get-together/new-events/downloadable2' -d '{

"name": "Broadcasted Big Data News", "downloadable": 0

We’ve looked at the core types: string, numeric, date, and boolean, which you can use in your own fields; let’s move on to arrays and multi fields, which enable you to use the same core type multiple times.

在文檔中 MEAP Edition Manning Early Access Program Elasticsearch in Action Version 11 (頁 67-73)