Indexing new data - Diving into the functionality

Diving into the functionality

2.3 Indexing new data

Although chapter 3 gets into the details of indexing, here the goal is to give you a feel for what indexing is about. In this section, we’ll discuss the following processes:

• Indexing a document with cURL. To send your first document, you’ll use the HTTP API to send a JSON document to be indexed with Elasticsearch. Then, we’ll have a look at the JSON reply that comes back.

• Looking at how Elasticsearch automatically creates the index and type to which your document belongs if they don’t exist already.

• Running a script to index additional documents. You’ll run the code samples for this chapter to quickly index additional files. This way, you have a bunch of documents ready to search through.

You’ll index your first document by hand, so let’s start by looking at how to issue an HTTP PUT request to a URI. A sample URI shown in figure 2.10 with each part labeled.

Figure 2.10 URI of a document in Elasticsearch

Let’s walk through how you issue the request.

2.3.1 Indexing a document with cURL

For most snippets in this book we’ll use the cURL binary. cURL is a command-line tool for transferring data over HTTP. You’ll use the curl command to make HTTP requests, as it has become a convention to use cURL for Elasticsearch code snippets. That’s because it’s easy to translate a cURL example into any programming language. In fact, if you ask for help on the official mailing list for Elasticsearch, it’s recommended that you provide a curl re-creation of your problem. A curl re-creation is a command or a sequence of curl commands that

reproduces the problem you’re experiencing, and anyone who has Elasticsearch installed locally can run it.

Installing cURL

If you’re running on a UNIX-like operating system, like Linux or Mac OS X, then you’re likely to have the curl command available. If you don’t have it already, or if you’re on Windows, you can download it from http://curl.haxx.se/. You can also install Cygwin, and then select cURL as part of the Cygwin installation, which is the approach we recommend.

Using Cygwin to run curl commands on Windows is preferred because you can copy-paste the commands that work on UNIX-like systems. If you choose to stick with the Windows shell, take extra care because single quotes behave differently on Windows. In most situations, you must replace single quotes (') with double-quotes (") and escape double quotes with a backslash (\"). For example, a UNIX command like this

curl 'http://localhost' -d '{"field": "value"}' looks like this on Windows

curl "http://localhost" -d "{\"field\": \"value\"}"

Assuming you can use the curl command and you have Elasticsearch installed with the defaults settings on your local machine, you can index your first CD document with the following command:

% curl -XPUT 'localhost:9200/get-together/group/1?pretty' -d '{

"name": "Elasticsearch Denver", "organizer": "Lee"

You should get the following output:

{

"ok" : true,

"_index" : "get-together", "_type" : "group",

"_id" : "1", "_version" : 1, "created" : true }

The reply tells you whether the request succeeded or failed. If it worked, you should get back the index, type, and ID of the indexed document. In this case, you get the ones you specified, but it’s also possible to rely on Elasticsearch to generate IDs, as you’ll learn in chapter 3. You also get the version of the document, which begins at 1 and is incremented with each update. You’ll learn all about updates in chapter 3.

There are many ways to use curl to make HTTP requests; run man curl to see all of them.

Throughout this book, we use the following curl usage conventions:

• The method, which is typically GET, PUT or POST, is the argument of the -X parameter.

You can add a space between the parameter and its argument, but we don’t add one.

For example, we use -XPUT instead of -X PUT. The default method is GET, and when we use it, we skip the -X parameter altogether.

• In the URI, we skip specifying the protocol; it’s always http, and curl uses http by default when no protocol is specified.

• We put single quotes around the URI because it can contain multiple parameters, and you have to separate the parameters with an ampersand (&), which normally sends the process to the background.

• True values of Boolean parameters can be expressed as pretty=true or simply pretty.

We use the latter. The pretty parameter in particular makes the JSON reply look more readable than the default, which is to return the reply all in one line.

• The data that we send through HTTP is typically JSON, and we surround it with single quotes because the JSON itself contains double quotes.

If single quotes are needed in the JSON itself, we first close the single quotes, and then surround the needed single quote with double quotes as shown in this example:

'{"name": "Scarlet O'"'"'Hara"}'

Using Elasticsearch from your browser via Head, kopf or Marvel

If you prefer graphical interfaces to the command line, several tools are available.

Elasticsearch Head—You can install this tool as an Elasticsearch plugin, a standalone HTTP server, or a web page that you can open from your file system. You can send HTTP requests from there, but Head is most useful as a monitoring tool to show you how shards are distributed in your cluster. You can find Elasticsearch Head at https://github.com/mobz/elasticsearch-head.

Elasticsearch kopf—Similar to Head in that it’s good for both monitoring and sending requests, this tool runs as a web page from your file system or as an Elasticsearch plugin. Both Head and kopf evolve quickly, so any comparison might become obsolete quickly as well. You can find Elasticsearch kopf at https://github.com/lmenezes/elasticsearch-kopf.

Marvel—This tool is a monitoring solution for Elasticsearch. We discuss more about monitoring in chapter 11, which is all about administering your cluster. For now, the thing to remember is that Marvel also provides a graphical way to send requests to Elasticsearch, and it provides an autocomplete feature, which is a useful learning aid. You can download Marvel at http://www.elasticsearch.org/overview/marvel/download/.

2.3.2 Creating an index and mapping type

If you installed Elasticsearch and ran the curl command to index a document, you might be wondering why it worked given the following factors:

• The index wasn’t there before. You didn’t issue any command to create an index named get-together.

• The mapping wasn’t previously defined. You didn’t define any mapping type called group in which to define the fields from your document.

The curl command works because Elasticsearch automatically adds the get-together index for you and also creates a new mapping for the type group. That mapping contains a definition of your field as strings. Elasticsearch handles all this for you by default, which enables you to start indexing without any prior configuration. You can change this default behavior if you need to as you’ll explore in chapter 3.

CREATING AN INDEX MANUALLY

You can always create an index with a PUT request similar to the request used to index a document:

% curl -XPUT 'localhost:9200/get-together?pretty' {

"acknowledged" : true }

Creating the index itself takes more time than creating a document, so you might want to have the index ready beforehand. Another reason to create indices in advance is if you want to specify different settings than the ones Elasticsearch defaults to, for example, you may want a specific number of shards.

VIEWING THE MAPPING TYPE

As we mentioned, the mapping is automatically created with your new document, and it automatically detects your name and organizer fields as strings. If you add a new document with yet another new field, Elasticsearch guesses its type, too and appends the new field to the mapping.

To view the current mapping, issue an HTTP GET to the _mapping endpoint of the type’s URL:

The response contains the following relevant data:

• Type name—group

• Property list—name and organizer

• Property options—The type is string for both properties

We talk more about indices, mappings, and mapping types in chapter3. For now, let’s define a mapping, and then index some documents by running a script from the code samples that came with this book.

2.3.3 Indexing documents from the code samples

Before we look at searching through the indexed documents, let’s do some more indexing by running populate.sh from the code samples for chapter 2.

NOTE To download the source code, visit https://github.com/dakrone/elasticsearch-in-action, and then follow the instructions from there.

The script first deletes the get-together index you created. Then, it re-creates it and creates the mapping that’s defined in mapping.json. The mapping file specifies options other than those you’ve seen so far, and we explore them in the rest of the book, mostly in chapter 3. Finally, the script indexes documents in two types: group and event. There is a parent-child relationship between those types (events belonging to groups), which we explore in chapter 8.

For now, ignore this relationship.

Running the populate.sh script should look similar to the following listing.

Listing 2.1 Indexing data with the populate.sh script

% ./populate.sh

WARNING, this script will delete the 'get-together' index and re-index all data!

Press Control-C to cancel this operation.

Press [Enter] to continue.

Creating 'get-together' index...

{"acknowledged":true}

Done creating 'get-together' index. #A Indexing data...

Indexing groups...

{"_index":"get-together","_type":"group","_id":"1","_version":1}

#more replies like this, one for each document

Done indexing groups. #A Indexing events...

{"_index":"get-together","_type":"event","_id":"10","_version":1}

#more replies like this, one for each document

Done indexing events. #A {"_shards":{"total":4,"successful":2,"failed":0}}

Done indexing data. #A

#A JSON replies, which come from Elasticsearch to acknowledge indexing

After running the script, you’ll have a handful of groups that meet and the events planned for those groups. Let’s have a look at how you can search through those documents.

在文檔中 MEAP Edition Manning Early Access Program Elasticsearch in Action Version 11 (頁 40-45)