Understanding the logical layout: documents, types, and indices

Diving into the functionality

2.1 Understanding the logical layout: documents, types, and indices

When you index a document in Elasticsearch, you put it in a type that belongs to a specific index. You can see this idea in figure 2.2, where the get-together index contains two types:

event and group. Those types contain documents, such as the one labeled “1.” The label “1” is that document’s ID. The index-type-ID combination uniquely identifies a document in your Elasticsearch setup. When you search, you can look for documents in that specific type, of that specific index, or you can search across multiple types or even multiple indices.

#A (Add arrow to the left of /get-together/event/1) Index name + type name + document id = uniquely identified document

Figure 2.2 Logical layout of data in Elasticsearch: how an application sees data

At this point you might be asking: what exactly is a document, type, and an index? That’s exactly what we’re going to discuss next.

2.1.1 Documents

We said in chapter 1 that Elasticsearch is document-oriented, where a document is that smallest unit of data you index or search for. There are a few important properties of a document in Elasticsearch:

• It’s self-contained. A document contains both the fields (name) and their values (Elasticsearch Denver).

• It can be hierarchical. Think of this as documents within documents. A value of a field can be simple, like the value of the location field can be a string. It can also contain other fields and values. For example, the location field might contain both a city and a street-address within it.

• It has a flexible structure. Your documents don’t depend on a predefined schema. For example, not all events need description values, so that field can be omitted altogether.

But it might bring new fields, like the latitude and longitude of the location.

A document is normally be a JSON representation of your data. As we discussed in chapter 1, JSON over HTTP is the most widely used way to communicate with Elasticsearch, and it’s the method we use throughout the book. For example, an event in your get-together site can be represented in the following document:

{

"name": "Elasticsearch Denver", "organizer": "Lee",

"location": "Denver, Colorado, USA"

}

NOTE Throughout the book, we’ll use different colors for the field names and values of the JSON documents, to make them easier to read. Field names are darker/blue, and values are in lighter/red.

You can also imagine a table with three columns: name, organizer, and location. The document would be a row containing the values. But there are some differences that make this comparison inexact.

The main difference between documents like this and rows in a table is that a single document contains the names of all the fields it has a value for. So, although it uses up more space in its raw form, you can easily understand which value belongs to which field by looking at one document.

Another difference is that, unlike rows, documents can be hierarchical. For example, the location can contain a name and a geo-location:

{

"name": "Elasticsearch Denver", "organizer": "Lee",

"location": {

"name": "Denver, Colorado, USA", "geolocation": "39.7392, -104.9847"

} }

A single document can also contain arrays of values. For example:

{

"name": "Elasticsearch Denver", "organizer": "Lee",

"members": ["Lee", "Mike"]

}

Finally, documents in Elasticsearch are said to be schema-free, in the sense that not all your documents need to have the same fields, so they’re not bound to the same schema. For example, you can omit the location altogether, in case the organizer needs to be called before every gathering:

{

"name": "Elasticsearch Denver", "organizer": "Lee",

"members": ["Lee", "Mike"]

}

Although you can add or omit fields at will, the type of each fields matters: some are strings, some are integers, and so on. Because of that, Elasticsearch keeps a mapping of all your fields and their types, and other settings. This mapping is specific to every type of every index. That’s why types are also called mapping types in Elasticsearch terminology.

2.1.2 Mapping types

Mapping types are logical containers for documents, similar to how tables are containers for rows. They’re often called simply types, because you’d put different types of documents in different mapping types.

For example, you can have a type that defines the get-together groups, and a type for the events when people gather. These could be different types of documents because they’d have different structures.

We call them mapping types because they’re typically used as containers for different types of documents—documents with different structures. The definition of fields in each type is called a mapping. For example, name would be mapped as a string, but the geolocation field under location would be mapped as a special geo_point type. (We explore working with geospatial data in appendix A). Each kind of fields is handled differently. For example, you search for a word in the name field, and you search for groups that are located near where you live.

TIP Whenever you’re searching in a field that isn’t at the root of your JSON document, you must specify its path. For example, the geolocation field under location is referred to as location.geolocation.

You may ask yourself: If Elasticsearch is schema-free, why does each document belong to a type, and each type contains a mapping, which is like a schema?

We say schema-free because documents are not bound to the schema. They aren't required to contain all the fields defined in your mapping and may come up with new fields.

How does it work? First, the mapping contains all the fields of all the documents indexed so far in that type. But not all documents have to have all fields. Also, if a new document gets indexed with a field that’s not already in the mapping, Elasticsearch automatically adds that new field to your mapping. To add that field, it has to decide what type it is, so it guesses it.

For example, if the value is 7, it assumes it’s a long type.

This autodetection of new fields has its downside because Elasticsearch might not guess right. For example, after indexing 7, you might want to index 7.5, which will fail because it’s a float and not a long. In production, the safe way to go is to define your mapping before indexing data. We talk more about defining mappings in chapter 3.

Mapping types only divide documents logically. Physically, documents from the same index are written to disk regardless of the mapping type they belong to.

2.1.3 Indices

Indices are containers for mapping types. An Elasticsearch index is an independent chunk of documents, much like a database is in the relational world: each index is stored on the disk in the same set of files; it stores all the fields from all the mapping types in there, and it has its own settings.

For example, each index has a setting called refresh_interval, which defines the interval at which newly indexed documents are made available for searches. This refresh operation is quite expensive in terms of performance, and this is why it’s done occasionally—by default, every second—instead of doing it after each indexed document. If you’ve read that Elasticsearch is near-real-time, this refresh process is what it refers to. Even though you can refresh after each new document, it’s not worth it for many use cases. We talk more about indexing performance in chapter 10.

A nice feature of Elasticsearch is that you can search across indices like you can search across mapping types. This gives you flexibility in terms of how you can organize your documents. For example, you can put your get-together events and the blog posts about them in different indices or in different types of the same index. Because Elasticsearch is schema-free, you can even put them in the same type. You can organize your documents in various ways, but some ways are more efficient than others, depending on your use case. We talk more about how to organize your data for efficient indexing in chapter 4.

Elasticsearch index vs. Lucene index

You’ll see the word “index” used frequently as we discuss Elasticsearch; here’s how the terminology works.

An Elasticsearch index is broken down into chunks: shards. A shard is a Lucene index. So an Elasticsearch index is made up of multiple Lucene indices. This makes sense because Elasticsearch uses Apache Lucene as its core library to index your data and search through it.

Throughout this book, whenever you see the word “index” by itself, it refers an Elasticsearch index. If we’re digging into the details of what’s in a shard, we’ll specifically use the term “Lucene index.”

Another index-specific setting is the number of shards. You saw in chapter 1 that an index can be made up of one or more chunks called shards. This is good for scalability: you can run Elasticsearch on multiple servers and have shards of the same index live on multiple servers.

From a search or an indexing application’s point of view, the way you shard your index doesn’t matter in terms of how data is organized logically. It’s about how your data is physically laid out, and we’ll look at that next.

在文檔中 MEAP Edition Manning Early Access Program Elasticsearch in Action Version 11 (頁 29-34)