Relations among documents
8.1 Options for defining relationships among documents
First, let’s quickly define each of these approaches:
• Objects type: This allows you to have a sub-document as the value of a field in your document. For example, your address field of an event could be an object with its own fields: city, postal code, street name, and so on. You could even have an array of addresses if the same event happens in multiple cities.
• Nested documents: The problem you may have with the object type is that all the data is stored in the same document, so matches for a search can go across sub-documents.
For example, city=Paris AND street_name=Broadway could return an event that's hosted in New York and Paris at the same time, even though there's no Broadway street in Paris. Nested documents allow you to index the same JSON document, but will keep your addresses in separate Lucene documents, making only searches like city=New York AND street_name=Broadway return the expected result.
• Parent-child relationships between documents: This method allows you to use completely separate Elasticsearch documents for different types of data, like events and groups, but still define a relationship between them. For example, you can have groups as “parents” of events, to indicate which event hosts which group. This will allow you to search for events hosted by groups in your area, or for groups that host events about Elasticsearch.
• Denormalizing: This is a general technique of duplicating data in order to represent relationships. In Elasticsearch, you're likely to employ it to represent many-to-many relationships, because other options only work on one-to-many. For example, if all groups have members, and members could belong to multiple groups. You can duplicate one side of the relationship, for example by including all the members of a group in that group's document.
Before we dive into all the details of working with each possibility, we’ll overview them and their typical use-cases.
8.1.1 Object type
The easiest way to represent a common interest group and the corresponding events is to use the object type. This allows you to put a JSON object, or an array of JSON objects, as the value of your field, like the example below:
{
"name": "Denver technology group", "events": [
{
"date": "2014-12-22",
"title": "Introduction to Elasticsearch"
},
{
"date": "2014-06-20",
"title": "Introduction to Hadoop"
} ] }
If you want to search for a group with events that are about Elasticsearch, you can simply search in the events.title field.
This works brilliantly for one-to-one relationships, but with one-to-many relationships you might get unexpected results. For example, let's say you want to filter groups hosting Hadoop meetings in December 2014. Your query can look like this:
"bool": { "must": [ { "term": {
"events.title": "hadoop"
} }, {
"range": {
"events.date": { "from": "2014-12-01", "to": "2014-12-31"
} } } ] }
This will match our sample document, because it has a title that matches hadoop, and a date that's in the specified range. But this is not what we want: it's the Elasticsearch event that's in December; the Hadoop one is in June. Sticking with the default object type is the fastest and easiest approach to relations, but Elasticsearch is unaware of the boundaries between documents, as illustrated in figure 8.1.
#A To the left: Elasticsearch event is in December, Hadoop event is in June
#B To the right: search for Hadoop events in December matches the document
Figure 8.1 Inner object boundaries are not accounted for when storing, leading to unexpected results
8.1.2 Nested type
If you need to make sure such cross-object matches don't happen, you can use the nested type, which will index your events in separate Lucene documents. In both cases, the group's JSON document will look exactly the same and applications will index them in the same way.
The difference is in the mapping, which triggers Elasticsearch to index nested inner objects in adjacent, but separate Lucene documents, as illustrated in figure 8.2. When searching, you'll need to use nested filters and queries, which will be explored in section 8.2; those will search in all those Lucene documents.
Figure 8.2 Nested type makes Elasticsearch index objects as separate Lucene documents
In some use-cases, it's not a good idea to mash all the data in the same document, like objects and nested types do. Take the case of groups and events: if a new event is organized by a group, and all that group's data is in the same document, you'll have to re-index the whole document just for that event. This can hurt performance and concurrency, depending on how big those documents get, and how often those operations are done.
8.1.3 Parent-child relationships
With parent-child relationships, you can use completely different Elasticsearch documents, by putting them in different types and defining their relationship in the mapping of each type. For example, you can have events in one mapping type and groups in another, and you can specify in the mapping that groups are “parents” of events. Also, when you index an event, you can point it to the group that it belongs to, like in figure 8.3. At search time, you can use has_parent or has_child queries and filters to take the other part of the relationship into account. We'll discuss them later in this chapter as well.
Figure 8.3 Different types of Elasticsearch documents can have parent-child relationships
8.1.4 Denormalizing
For any relational work, you have objects, nested documents and parent-child. These work for one-to-one and one-to-many relationships, the “one parent with one or more children” kind.
There's also a fourth way, which is not a specific Elasticsearch feature, but a method often employed by NoSQL data-stores to overcome the lack of joins: denormalizing, which means a document will include data that's related to it, even if the same data will have to be duplicated in another document.
For example, let's take groups and their members. A group can have more members, and a user can be a member of more groups. Both have their own set of properties. To represent this relationship, you can have groups as “parents” of the members. For users who are members of multiple groups, you'd have to multiply their data: once for each group they belong to, like in the figure below:
#A Next to one of the “Lee” labels: the document for Lee is stored twice: once for each group he is a member of
Figure 8.4 Denormalizing is the technique of multiplying data to avoid costly relations
In the rest of this chapter, we'll take a deeper look at each of these techniques: objects and arrays, nested, parent-child, and denormalizing. You'll learn how they work internally, how to define them in the mapping, how to index and how to search those documents.