Diving into the functionality
3.4 Using predefined fields
3.4.1 Control how to store and search your documents
Let’s start by looking at _source, which lets you store the documents you index, and _all, which lets you index all their content in a single field.
_SOURCE FOR STORING THE ORIGINAL CONTENTS
The _source field is for storing the original document, in the original format. This lets you see the documents that matched a search, not only their IDs.
_source can have enabled set to true or false, to specify whether you want to store the original document or not. By default it’s true, and, in most cases, that’s good because the existence of _source allows you to use other important features of Elasticsearch. For example, as you’ll learn later in this chapter, updating document contents using the update API needs _source.
To see how this field works, let’s look at what Elasticsearch typically returns when you retrieve a previously indexed document:
% curl 'localhost:9200/get-together/new-events/downloadable?pretty' {
"_index" : "get-together", "_type" : "new-events", "_id" : "downloadable", "_version" : 1,
"exists" : true, "_source" : {
"name": "Broadcasted Elasticsearch News", "downloadable": true
} }
You also get the _source JSON back when you search, as it’s returned there by default as well. If you disable _source, you don’t get the original document in the reply. This is typically done when you have a separate data store for your original content. In such a situation, you may want to index every entry in Elasticsearch with the same ID as in the data store. When you search, get the list of IDs from the results, and then go back to the data store to get the content. Such a process is illustrated in figure 3.3.
Figure 3.3 Using Elasticsearch for indexing only and using a different data store for document content
As you saw in sections 3.2 and 3.3, the fields you defined for your documents go under the properties field under the JSON mapping. Predefined fields, including _source, go directly under the mapping name. This makes it clear that predefined fields have a special status, and they’re not another property of your document. In this case, the _source field is not content that you add to document but a way to control how Elasticsearch stores it. The document remains the same, it’s the functionality around it that changes. To disable _source, you can define a mapping as shown in the following listing:
Listing 3.5 Disabling _source
% curl -XPUT localhost:9200/get-together/events_unstored/_mapping -d '{
"events_unstored": {
"_source": { "enabled": false}, #A "properties": {
"name": { #B "type": "string" #B } #B }
} }'
#A Predefined field, defined at the root of the mapping type
#B Custom data field, defined under properties
RETURNING ONLY SOME FIELDS OF THE SOURCE DOCUMENT
When you retrieve or search for a document, you can ask Elasticsearch to return only specific fields, and not the entire source. One way to do this is to give a comma-separated list of fields in the fields parameter. For example:
% curl -XGET 'localhost:9200/get-together/group/1?pretty&fields=name'
When the source is stored, Elasticsearch automatically goes to the source, gets the required fields and returns them to you. When you have no source, there’s nothing to return.
For example, if you followed listing 3.5 and indexed a sample document, trying to retrieve the title won’t get you anything:
If _source is disabled, you can store individual fields by settings the store option to yes. For example, to store only the name field, your mapping might look like this:
% curl -XPUT localhost:9200/get-together/events_unstored/_mapping -d '{
"events_unstored": {
You can also choose to store both _source and individual fields. This might be useful when you often ask Elasticsearch for a particular field because retrieving a single stored field will be faster than retrieving the entire _source and extracting that field from it.
When you store _source and individual fields, you should take into account that the more you store, the bigger your index gets. And usually, bigger indices imply slower indexing and slower searching. The good news here is that since version 0.90, Elasticsearch automatically compresses both _source and any individual fields you might choose to store. This is useful
because it keeps your index size small, and lets the operating system keep more of your data in its caches. In most situations, the overhead of compressing and uncompressing fields is insignificant when you compare it to the benefit of having smaller indices.
_ALL FOR INDEXING EVERYTHING
Just like _source is storing everything, _all is indexing everything. When you search in _all, Elasticsearch will return a hit regardless of which field matches. This is useful when users are looking for something without knowing where to look for, like searching for “elasticsearch”
may match the group name “Elasticsearch Denver” as well as the tag “elasticsearch” on other groups.
Running a search from the URI without a field name will search on _all by default:
curl 'localhost:9200/get-together/group/_search?q=elasticsearch'
If you always search on specific fields, you can disable _all by setting enabled to false.
Like with _source, doing so will reduce the total size of your index and will make indexing operations faster.
By default, each field is included in _all, by having include_in_all implicitly set to true.
You can use this option to control what is and isn’t included in _all. In the next listing, you’ll create a mapping where you include in _all only two of the total of three data fields. Then you’ll search in specific fields and in _all and compare the results.
Listing 3.6 Using include_in_all to store only some fields in _all curl -XPUT 'localhost:9200/get-together/custom-all/_mapping' -d '{
#E curl "$CUSTOM_ALL/_search?q=organizer:lee&pretty" #F
#A These are included in _all by default
#B You explicitly say you don't need this in _all
#C Returns result because the name field is in _all
#D Returns result because you can search in the specific field
#E Doesn’t return result because the organizer field isn’t included in _all
#F Returns result when you search in the specific field
Using include_in_all gives you flexibility not only in terms of saving space but also regarding how your queries behave. As you saw in the previous example, if a user searches without specifying a field, you might want to give back matches from the name and tags fields and match organizer fields only when the user specifically asks for that. This prevents unexpected results from appearing, if most customers think of a name or a tag when they give a keyword.
The next set of predefined fields are those used to identify documents: _index, _type, _id and _uid.