Deleting data - Diving into the functionality

Diving into the functionality

3.6 Deleting data

USING VERSIONS WHEN YOU INDEX DOCUMENTS

Another way to update a document, without using the update API, is to index a new one to the same index, type, and ID. This overwrites the existing document, and you can still use the version field for concurrency control. To do that, set the version parameter in the HTTP request. The value should be the version you expect the document to have. For example, if you want to index a new T-shirt and also make sure you don’t override anything by accident, set version=0:

% curl -XPUT 'localhost:9200/online-shop/shirts/2?version=0' -d '{

"caption": "Learning Elasticsearch Versioning", "price": 2

Elasticsearch throws an error if the document exists because the document has a version other than 0. For other such operations, specify the version you expect the document to have:

% curl -XPUT 'localhost:9200/online-shop/shirts/2?version=1' -d '{

"caption": "I Know about Elasticsearch Versioning", "price": 5

Again, the operation fails if the current version is different than 1—whether the document doesn’t exist or it has a higher version number.

With versions, you can index or update your documents safely. Next, let’s look at how you can remove documents.

3.6 Deleting data

Now that you know how to send data to Elasticsearch, let’s look at what options you have for removing some of what was indexed. If you’ve worked through the listings throughout this chapter, you now have unnecessary data that’s waiting to be removed. We’ll look at a few ways to remove data—or at least get it out of the way of slowing down your searches or further indexing:

• Delete individual documents or groups of documents. When you do that, Elasticsearch only marks them to be deleted, so they don’t show up in searches, and gets them out of the index later, in an asynchronous manner.

• Delete complete indices. This is a particular case of deleting groups of documents. But it differs in the sense that it’s easy to do performance-wise. The main job is to remove

all the files associated with that index, which happens almost instantly.

• Close indices. Although this isn’t about removing, it’s worth mentioning here. A closed index doesn’t allow read or write operations, and its data isn’t loaded in memory. It’s similar to removing data from Elasticsearch, but it remains on disk, and it’s easy to restore ; you open the closed index.

3.6.1 Deleting documents

There are a few ways to remove individual documents, and we’ll discuss most of them here:

• Remove a single document by its ID. This is good if you have only one document to delete, provided that you know its ID.

• Remove multiple documents in a single request. If you have multiple individual documents that you want to delete, you can remove them all at once in a bulk request, which is faster than removing one document at a time. We’ll cover bulk deletes in chapter 12, along with bulk indexing and bulk updating.

• Remove a mapping type, with all the documents in it. This will effectively search and remove all the documents you’ve indexed in that type, plus the mapping itself.

• Remove all the documents matching a query. This is similar to removing a mapping type, in the sense that internally, a search is run to identify the documents that need to be deleted. Only here you can specify any query you want, and the matching documents will be deleted.

REMOVE A SINGLE DOCUMENT

To remove a single document, you need to send an HTTP DELETE request to its URL. For example:

% curl -XDELETE 'localhost:9200/online-shop/shirts/1'

TIP You’ve used versions for indexing and updating, and you can use it for deletes to manage concurrency. For example, let’s assume you sold all shirts of a certain type, and you want to remove that document so it doesn’t appear in searches at all. But you might not know at that time if a new transport arrived and the stock data has been updated. To accomplish this, add version=1 as a parameter to your DELETE request. The document will be deleted only if it’s at version 1.

REMOVE A MAPPING TYPE AND DOCUMENTS MATCHING A QUERY

You can also remove an entire mapping type, which removes the mapping itself, plus all the documents indexed in that type. To do that, you provide the type’s URL to the DELETE request:

% curl -XDELETE 'localhost:9200/online-shop/shirts

The tricky part about removing types is that the type name is treated like another field in the documents. All documents of an index end up in the same shards regardless of the

mapping type they belong to. When you issue the previous command, Elasticsearch has to query for documents of that type, and then remove them. This is an important detail when it comes to performance for removing types versus removing complete indices because removing types typically takes longer and uses more resources.

In the same way you can query for all documents within a type and delete them, Elasticsearch allows you to specify your own query for documents you want to delete through an API called delete by query. Using the API is similar to running a query, except that the HTTP request is DELETE, and the _search endpoint is now _query. For example, to remove all documents that match “Elasticsearch” from the index get-together, you can run this command:

% curl -XDELETE 'localhost:9200/get-together/_query?q=elasticsearch'

Similar to regular queries, which we cover in more detail in chapter 4, you can run a delete by query on a specific type, on multiple types, everywhere in an index, in multiple indices or in all indices. When you search in all indices, be careful when you run a delete by query.

TIP Besides being careful, you can use backups. We talk about backups in chapter 14, which is all about administration.

3.6.2 Deleting indices

As you might expect, to delete an index, issue a DELETE request to the URL of that index:

% curl -XDELETE 'localhost:9200/get-together/'

You can also delete multiple indices, by providing a comma-separated list, or even delete all indices by providing _all as the index name.

Deleting an index is fast because it’s mostly about removing the files associated to all shards of that index. And deleting files on the file system happens fast. This is opposed to when you delete individual documents. When you do that, they’re only marked as deleted.

They get removed when segments are merged. Merging is the process of combining multiple small Lucene segments into a bigger segment.

On segments and merging

A segment is a chunk of the Lucene index (or a shard, in Elasticsearch terminology), which is created when you’re indexing. Segments are never appended—only new ones are created as you index new documents. Data is never removed from them because deleting only marks documents as deleted. Finally, data never changes because updating documents implies reindexing.

When Elasticsearch is performing a query on a shard, Lucene has to query all its segments, merge the results, and send them back—much like the process of querying multiple shards within an index.

As with shards, the more segments you have to go though, the slower the search.

As you may imagine, normal indexing operations create many such small segments. To avoid having an extremely large number of segments in an index, Lucene merges them from time to time.

Merging some documents implies reading their contents, excluding the deleted documents, and creating new and bigger segments with their combined content. This process requires resources—

specifically, CPU and disk I/O. Fortunately, merges run asynchronously, and Elasticsearch lets you configure numerous options around them. We talk more about those options in chapter 12, where you learn how to improve the performance of index, update, and delete operations.

3.6.3 Closing indices

Instead of deleting indices, you also have the option of closing them. If you close an index, you won’t be able to read or write data from it with Elasticsearch until you open it again. This is useful when you have flowing data, such as application logs. You’ll learn in chapter 12 that it’s a good idea to store such flowing data in time-based indices, for example, creating one index per day.

In an ideal world, you’d hold application logs forever, in case you needed to look back a long time ago. On the other hand, having a large amount of data in Elasticsearch demands increased resources. For this use case, it makes sense to close “old” indices. You’re unlikely to need that data, but you don’t want to remove it, either.

To close the get-together index, send an HTTP POST request to its URL at the _close endpoint:

% curl -XPOST 'localhost:9200/get-together/_close'

To open it again, you run a similar command, only the endpoint becomes _open:

% curl -XPOST localhost:9200/get-together/_open

Once you close an index, the only trace of it in Elasticsearch’s memory is its metadata, such as name, and where shards are located. If you have enough disk space and you’re not sure whether you’ll need to search in that data again, closing indices is better than removing them. Closing them gives you the peace of mind that you can always reopen a closed index and search in it again.

3.6.4 Reindexing sample documents

In chapter 2, you used the book’s code samples (https://github.com/dakrone/elasticsearch-in-action) to index documents. Running populate.sh from these code samples removes the get-together index you created in this chapter, and reindexes the sample documents.

If you look at both the populate.sh script and the mapping definition from mapping.json, you’ll recognize various types of fields we discussed in this chapter.

Some of the mapping and indexing options, such as the analysis settings, are dealt with in upcoming chapters. For now, run populate.sh to prepare the get-together index for chapter 4, which is all about searches. The code samples provide you with sample data to search on.

3.7 Summary

• Mappings let you define fields in your documents and how those fields are indexed. We say Elasticsearch is schema-free because mappings are extended automatically, but in production you often need to take control over what is indexed, what is stored and how.

• Most fields in your documents are core types, such as strings and numbers. The way you index those fields has a big impact on how Elasticsearch performs and how relevant your search results are. For example, the analysis settings, which we cover in chapter 5.

• A single field can also be a container for multiple fields or values. We looked at arrays and multi fields, which let you have multiple occurrences of the same core type in the same field.

• Besides the fields that are specific to your documents, Elasticsearch provides predefined fields, such as _source and _timestamp. Configuring these fields changes some data that you don’t explicitly provide in your documents but has a big impact on both performance and functionality. For example, you can decide whether you want the original document to be stored or not.

• Because Elasticsearch stores data in Lucene segments that don’t change once they’re created, updating a document implies retrieving the existing one, putting the changes in a new document that gets indexed, and marking the old one as deleted.

• The removal of documents happens when the Lucene segments are asynchronously merged. This is also why deleting an entire index is faster than removing one or more individual documents from it— it only implies removing files on disk with no merging.

• Throughout indexing, updating, and deleting, you can use document versions to manage concurrency issues. With updates, you can tell Elasticsearch to retry automatically if an update fails because of a concurrency issue.

4

在文檔中 MEAP Edition Manning Early Access Program Elasticsearch in Action Version 11 (頁 91-96)