Diving into the functionality
2.6 Adding nodes to the cluster
2.6.2 Adding additional nodes
If you run bin/elasticsearch or elasticsearch.bat again, to add a third node, and then a fourth, you’ll see that they detect the master via multicast and join the cluster in the same way.
Additionally, as shown in figure 2.14, the four shards of the get-together index automatically get balanced across the cluster.
Figure 2.14 Elasticsearch automatically distributes shards across the growing cluster.
At this point you might wonder what happens if you add more nodes. By default, nothing happens because you have four total shards that can’t be distributed to more than four nodes.
That said, if you need to scale, you have a few options:
• Change the number of replicas. Replicas can be updated on the fly, but scaling this way increases only the number of concurrent searches your cluster can serve. The indexing throughput as well as the performance of isolated searches remains the same.
• Create an index with more shards. This implies reindexing your data because the number of shards can’t be changed on the fly.
• Add more indices. Some data can be easily designed to use more indices. For example, if you index logs, you can put each day’s logs in its own index.
We discuss these patterns for scaling out in chapter 9. Again, scaling out for more concurrent searches isn’t a problem because you can change the number of replicas. The real challenge is in making indexing and individual searches run fast; a topic we discuss in chapter 10.
2.7 Summary
• Elasticsearch is document-oriented, scalable, and schema-free by default.
• Although you can form a cluster with the default settings, you should adjust at least some of them before you go to production; for example, cluster name and heap size.
• Indexing requests are distributed among the primary shards and replicated to those primary shards’ replicas.
• Searches are done using a round-robin approach between complete sets of data, no matter if those are made up of shards or replicas. The node that received the search
request then aggregates partial results from individual shards and returns those results to the application.
• Client applications may be unaware of the sharded nature of each index or what the cluster looks like. They care only about indices, types, and document IDs. They use the HTTP REST API to index and search for documents.
• You can send new documents and search parameters as the JSON payload of a HTTP request, and you’ll get back a JSON reply with the results.
In the next chapter, you’ll get the foundation you need to organize your data effectively in Elasticsearch, you’ll learn what types of fields your documents can have, and you’ll become familiar with all the relevant options for indexing, updating, and deleting.
Figure 2.1 An Elasticsearch cluster from the application’s and administrator’s point of view ... 25 Figure 2.2 Logical layout of data in Elasticsearch: how an application sees data ... 26 Figure 2.3 A three-node cluster with an index divided into five shards, with one replica per shard ... 30 Figure 2.4 Documents gets indexed to random primary shards and their replicas. Searches run on complete sets of shards, regardless of their status as primaries or replicas. ... 31 Figure 2.5 Term dictionary and frequencies in a Lucene index... 32 Figure 2.6 Multiple primary and replica shards make up the "get-together" index ... 33 Figure 2.7 Obtaining more performance by scaling vertically compared to scaling horizontally ... 34 Figure 2.8 Indexing operation is forwarded to the responsible shard, then to its replicas ... 35 Figure 2.9 Search request is forwarded to shards/replicas containing a complete set of data. Then, results are aggregated and sent back to the client ... 35 Figure 2.10 URI of a document in Elasticsearch... 36 Figure 2.11 Partial results can be returned from shards that are still available ... 44 Figure 2.12 One node cluster shown in Elasticsearch Kopf ... 53 Figure 2.13 Replica shards are allocated to the second node ... 54 Figure 2.14 Elasticsearch automatically distributes shards across the growing cluster. ... 55
Listing 2.1 Indexing data with the populate.sh script ... 40 Listing 2.2 Search for “elasticsearch” in groups ... 41 Listing 2.3 Search reply returning two fields of a single resulting document ... 43
3
Indexing, updating, and deleting data
This chapter covers
• Using mapping types to define multiple types of documents in the same index
• Types of fields you can use in mappings
• Using predefined fields and their options
• Updating and deleting data
This chapter is all about getting data in and out of Elasticsearch: indexing, updating and deleting documents. In chapter 1, you learned that Elasticsearch is document-based and that documents are made up of fields and their values, which makes them self-contained, much like having the column names from a table contained in the rows. In chapter 2, you saw how you can index such a document via Elasticsearch’s REST API. Here, we’ll dive deeper into the indexing process, by looking at the fields in those documents and what they contain. For example, when you index a document that looks like this:
{"name": "Elasticsearch Denver"}
the name field is a string because its value, Elasticsearch Denver, is a string. Other fields could be numbers, booleans, and so on. In this chapter, we’ll look at three types of fields:
• Core—These fields include strings and numbers.
• Arrays and multi fields—These fields help you store multiple values of the same core type, in the same field. For example, you can have multiple tag strings in your tags field.
• Predefined—Examples of these fields include _ttl (which stands for “time to live”) and timestamp.
Think of these field types as metadata that can be automatically managed by Elasticsearch to give you additional functionality. For example, you can store some fields in a way that make your indices smaller, you can configure Elasticsearch to automatically add new data to documents, such as a timestamp, or you can use the _ttl field to get your documents automatically deleted after a specified amount of time.
Once you know the field types that can be in your documents and how to index them, we’ll look at how you can update documents that are already there. Because of the way it stores data, when Elasticsearch updates an existing document, it retrieves it and applies changes according to your specifications. It then indexes the resulting document again and deletes the old one. Such updates can raise concurrency issues, and you’ll see how they can be solved automatically with document versions.
You’ll also see various ways of deleting documents. Some ways are faster than others. This is again due to the particular way Apache Lucene, the main library used by Elasticsearch for indexing, stores data on disc.
We’ll start with indexing, by looking at how you can manage fields from your documents.
As you saw in chapter 2, fields are defined in mappings, so before we dive into how you can work with each type of field, we’ll look at how you can work with mappings in general.