Amazon Comprehend

(1)

Amazon Comprehend

Developer Guide

(2)

Amazon Comprehend: Developer Guide

Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be aﬃliated with, connected to, or sponsored by Amazon.

(3)

What Is Amazon Comprehend?

Amazon Comprehend uses natural language processing (NLP) to extract insights about the content of documents. Amazon Comprehend processes any text file in UTF-8 format, image files (JPG, PNG, or TIFF), and semi-structured documents (PDF or Word files). It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document. Use Amazon Comprehend to create new products based on understanding the structure of documents. For example, using Amazon Comprehend you can search social networking feeds for mentions of products or scan an entire document repository for key phrases.

Topics

• Amazon Comprehend Insights (p. 1)

• Comprehend Custom (p. 1)

• Document Clustering (Topic Modeling) (p. 2)

• Examples (p. 2)

• Beneﬁts (p. 2)

• Are You a First-time User of Amazon Comprehend? (p. 3)

Amazon Comprehend Insights

You work with one or more documents at a time to evaluate their content and gain insights about them.

Some of the insights that Amazon Comprehend develops about a document include:

• Entities – Amazon Comprehend returns a list of entities, such as people, places, and locations, identiﬁed in a document. For more information, see Detect Entities (p. 155).

• Key phrases – Amazon Comprehend extracts key phrases that appear in a document. For example, a document about a basketball game might return the names of the teams, the name of the venue, and the ﬁnal score. For more information, see Detect Key Phrases (p. 162).

• PII – Amazon Comprehend analyzes documents to detect personal data that could be used to identify an individual, such as an address, bank account number, or phone number. For more information, see Detect Personally Identiﬁable Information (PII) (p. 167).

• Language – Amazon Comprehend identiﬁes the dominant language in a document. Amazon Comprehend can identify 100 languages. For more information, see Detect the Dominant Language (p. 163).

• Sentiment – Amazon Comprehend determines the emotional sentiment of a document. Sentiment can be positive, neutral, negative, or mixed. For more information, see Determine Sentiment (p. 172).

• Syntax – Amazon Comprehend parses each word in your document and determines the part of speech for the word. For example, in the sentence "It is raining today in Seattle," "it" is identiﬁed as a pronoun,

"raining" is identiﬁed as a verb, and "Seattle" is identiﬁed as a proper noun. For more information, see Analyze Syntax (p. 173).

Comprehend Custom

Customize Comprehend for your speciﬁc requirements without the skillset required to build machine learning-based NLP solutions. Using automatic machine learning, or AutoML, Comprehend Custom builds customized NLP models on your behalf, using data you already have.

(10)

Document Clustering (Topic Modeling)

Custom Classification – Create custom document classifiers to organize your documents into your own categories. For each classification label, provide a set of documents that best represent that label and train your classifier on it. Once trained, a classifier can be used on any number of unlabeled document sets. You can use the console for a code-free experience or install the latest AWS SDK. For more information, see Custom Classification (p. 90).

Custom Entities – Create custom entity types that analyze text for your speciﬁc terms and noun-based phrases. You can train custom entities to extract terms like policy numbers, or phrases that imply a customer escalation. To train the model, you provide a list of the entities and a set of documents that contain them. Once the model is trained, you can submit analysis jobs against it to extract their custom entities. For more information, see Custom Entity Recognition (p. 109).

Document Clustering (Topic Modeling)

You can also use Amazon Comprehend to examine a corpus of documents to organize them based on similar keywords within them. Document clustering (topic modeling) is useful to organize a large corpus of documents into topics or clusters that are similar based on the frequency of words within them.

Topic modeling is a asynchronous process, you submit a set of documents for processing and then later get the results when processing is complete. Amazon Comprehend does topic modeling on large document sets, for best results you should include at least 1,000 documents when you submit a topic modeling job. For more information, see Topic Modeling (p. 175).

Examples

The following examples show how you might use the Amazon Comprehend operations in your applications.

Example 1: Find documents about a subject

Find the documents about a particular subject using Amazon Comprehend topic modeling. Scan a set of documents to determine the topics discussed, and to ﬁnd the documents associated with each topic. You can specify the number of topics that Amazon Comprehend should return from the document set.

Example 2: Find out how customers feel about your products

If your company publishes a catalog, let Amazon Comprehend tell you what customers think of your products. Send each customer comment to the DetectSentiment operation and it will tell you whether customers feel positive, negative, neutral, or mixed about a product.

Example 3: Discover what matters to your customers

Use Amazon Comprehend topic modeling to discover the topics that your customers are talking about on your forums and message boards, then use entity detection to determine the people, places, and things that they associate with the topic. Finally, use sentiment analysis to determine how your customers feel about a topic.

Beneﬁts

Some of the beneﬁts of using Amazon Comprehend include:

• Integrate powerful natural language processing into your apps—Amazon Comprehend removes the complexity of building text analysis capabilities into your applications by making powerful and

(11)

Are You a First-time User of Amazon Comprehend?

accurate natural language processing available with a simple API. You don't need textual analysis expertise to take advantage of the insights that Amazon Comprehend produces.

• Deep learning based natural language processing—Amazon Comprehend uses deep learning technology to accurately analyze text. Our models are constantly trained with new data across multiple domains to improve accuracy.

• Scalable natural language processing—Amazon Comprehend enables you to analyze millions of documents so that you can discover the insights that they contain.

• Integrate with other AWS services—Amazon Comprehend is designed to work seamlessly with other AWS services like Amazon S3, AWS KMS, and AWS Lambda. Store your documents in Amazon S3, or analyze real-time data with Kinesis Data Firehose. Support for AWS Identity and Access Management (IAM) makes it easy to securely control access to Amazon Comprehend operations. Using IAM, you can create and manage AWS users and groups to grant the appropriate access to your developers and end users.

• Encryption of output results and volume data —Amazon S3 already enables you to encrypt your input documents, and Amazon Comprehend extends this even farther. By using your own KMS key, you can not only encrypt the output results of your job, but also the data on the storage volume attached to the compute instance that processes the analysis job. The result is signiﬁcantly enhanced security.

• Low cost—With Amazon Comprehend, you only pay for the documents that you analyze. There are no minimum fees or upfront commitments.

Are You a First-time User of Amazon Comprehend?

If you are a ﬁrst-time user of Amazon Comprehend, we recommend that you read the following sections in order:

1.How It Works (p. 4) – This section introduces Amazon Comprehend concepts.

2.Getting Started with Amazon Comprehend (p. 7) – In this section, you set up your account and test Amazon Comprehend.

3.API Reference (p. 232) – In this section you'll ﬁnd reference documentation for Amazon Comprehend operations.

(12)

How It Works

Amazon Comprehend uses a pre-trained model to examine and analyze a document or set of documents to gather insights about it. This model is continuously trained on a large body of text so that there is no need for you to provide training data.

Amazon Comprehend can examine and analyze documents in a variety of languages, depending on the speciﬁc feature. For more information, see Amazon Comprehend supported languages.

Additionally, Amazon Comprehend's Detect the Dominant Language (p. 163) operation can examine documents and determine the dominant language out of a far wider variety of diﬀerent languages. For more information, see Languages Supported in Amazon Comprehend (p. 5).

With Amazon Comprehend, you can perform the following on your documents:

• Detect the Dominant Language (p. 163) — Examine text to determine the dominant language.

• Detect Entities (p. 155) — Detect textual references to the names of people, places, and items as well as references to dates and quantities.

• Detect Key Phrases (p. 162) — Find key phrases such as "good morning" in a document or set of documents.

• Detect Personally Identiﬁable Information (PII) (p. 167) — Analyze documents to detect personal data that could be used to identify an individual, such as an address, bank account number, or phone number.

• Determine Sentiment (p. 172) — Analyze documents and determine the dominant sentiment of the text.

• Analyze Syntax (p. 173) — Parse the words in your text and show the speech syntax for each word and enable you to understand the content of the document.

• Topic Modeling (p. 175) — Search the content of documents to determine common themes and topics.

Each operation can be processed in several ways:

• Single-Document Processing (p. 179) — You call Amazon Comprehend with a single document and receive a synchronous response.

• Multiple Document Synchronous Processing (p. 184) — You call Amazon Comprehend with a collection of up to 25 documents and receive a synchronous response.

• Asynchronous Batch Processing (p. 179) — You put a collection of documents into an Amazon S3 bucket and start an asynchronous operation to analyze the documents. The results of the analysis are returned in an S3 bucket.

Each operation can be encrypted both during communication and processing.

By using the integrated AWS KMS encryption, you maintain control over who can access to your encrypted data.

You can optionally provide a custom KMS key when you create your analysis job and your data will be encrypted on the storage volume attached to the ML compute instance processing the job. You can also provide a key to encrypt your output results as it's sent to the S3 bucket. If you have set up encryption on the S3 bucket that holds your input documents, this can provide you with end-to-end security.

For more information, see KMS Encryption in Amazon Comprehend (p. 195).

(13)

Supported Languages

Languages Supported in Amazon Comprehend

Amazon Comprehend supports a wide variety of languages for its various features. The languages supported and the features that support them can be seen in the following tables.

Topics

• Supported Languages (p. 5)

• Languages Supported by Amazon Comprehend Features (p. 5)

Supported Languages

Amazon Comprehend (except the Detect Dominant Language feature) supports the following languages for one or more features.

Code Language

de German

en English

es Spanish

it Italian

pt Portuguese

fr French

ja Japanese

ko Korean

hi Hindi

ar Arabic

zh Chinese (simpliﬁed)

zh-TW Chinese (traditional)

NoteAmazon Comprehend identifies the language using identifiers from RFC 5646 — if there is a 2- letter ISO 639-1 identifier, with a regional subtag if necessary, it uses that. Otherwise, it uses the ISO 639-2 3-letter code. For more information about RFC 5646, see the IETF Tools web site.

Languages Supported by Amazon Comprehend Features

Feature Supported Languages

Detect the Dominant Language (p. 163) See Detect the Dominant Language (p. 163).

(14)

Languages Supported by Amazon Comprehend Features

Feature Supported Languages

Detect Entities (p. 155) All supported languages.

Detect Key Phrases (p. 162) All supported languages.

Detect Personally Identiﬁable Information

(PII) (p. 167) English

Label Documents with Personally Identiﬁable

Information (PII) (p. 171) English

Determine Sentiment (p. 172) All supported languages.

Analyze Syntax (p. 173) German (de), English (en), Spanish (es), French (fr), Italian (it), and Portuguese (pt).

Topic Modeling (p. 175) Not dependent on the language used. Does not support character-based languages such as Chinese, Japanese, and Korean.

Custom Classiﬁcation (p. 90) German (de), English (en), Spanish (es), French (fr), Italian (it), and Portuguese (pt).

Custom Entity Recognition (p. 109) German (de), English (en), Spanish (es), French (fr), Italian (it), and Portuguese (pt).

(15)

Step 1: Set Up an Account

Getting Started with Amazon Comprehend

To get started using Amazon Comprehend, set up an AWS account and create an AWS Identity and Access Management (IAM) user. To use the Amazon Comprehend (AWS CLI), download and conﬁgure it.

Topics

• Step 1: Set Up an AWS Account and Create an Administrator User (p. 7)

• Step 2: Set Up the AWS Command Line Interface (AWS CLI) (p. 8)

• Step 3: Getting Started Using the Amazon Comprehend Console (p. 9)

• Step 4: Getting Started Using the Amazon Comprehend API (p. 30)

• Solution: Analyzing Text with Amazon Comprehend and Amazon OpenSearch (p. 65)

Step 1: Set Up an AWS Account and Create an Administrator User

Before you use Amazon Comprehend for the ﬁrst time, complete the following tasks:

1.Sign Up for AWS (p. 7) 2.Create an IAM User (p. 7)

Sign Up for AWS

When you sign up for Amazon Web Services (AWS), your AWS account is automatically signed up for all AWS services, including Amazon Comprehend. You are charged only for the services that you use.

With Amazon Comprehend, you pay only for the resources that you use. If you are a new AWS customer, you can get started with Amazon Comprehend for free. For more information, see AWS Free Usage Tier.

If you already have an AWS account, skip to the next section.

To create an AWS account

1. Open https://portal.aws.amazon.com/billing/signup.

2. Follow the online instructions.

Part of the sign-up procedure involves receiving a phone call and entering a veriﬁcation code on the phone keypad.

Record your AWS account ID because you'll need it for the next task.

Create an IAM User

Services in AWS, such as Amazon Comprehend, require that you provide credentials when you access them. This allows the service to determine whether you have permissions to access the service's resources.

(16)

Next Step

We strongly recommend that you access AWS using AWS Identity and Access Management (IAM), not the credentials for your AWS account. To use IAM to access AWS, create an IAM user, add the user to an IAM group with administrative permissions, and then grant administrative permissions to the IAM user. You can then access AWS using a special URL and the IAM user's credentials.

The Getting Started exercises in this guide assume that you have a user with administrator privileges, adminuser.

To create an administrator user and sign in to the console

1. Create an administrator user called adminuser in your AWS account. For instructions, see Creating Your First IAM User and Administrators Group in the IAM User Guide.

2. Sign in to the AWS Management Console using a special URL. For more information, see How Users Sign In to Your Account in the IAM User Guide.

For more information about IAM, see the following:

• AWS Identity and Access Management (IAM)

• Getting started

• IAM User Guide

Next Step

Step 2: Set Up the AWS Command Line Interface (AWS CLI) (p. 8)

Step 2: Set Up the AWS Command Line Interface (AWS CLI)

You don't need the AWS CLI to perform the steps in the Getting Started exercises. However, some of the other exercises in this guide do require it. If you prefer, you can skip this step and go to Step 3: Getting Started Using the Amazon Comprehend Console (p. 9), and set up the AWS CLI later.

To set up the AWS CLI

1. Download and conﬁgure the AWS CLI. For instructions, see the following topics in the AWS Command Line Interface User Guide:

• Getting Set Up with the AWS Command Line Interface

• Conﬁguring the AWS Command Line Interface

2. In the AWS CLI config file, add a named profile for the administrator user:.

[profile adminuser]

aws_access_key_id = adminuser access key ID

aws_secret_access_key = adminuser secret access key region = aws-region

You use this profile when executing the AWS CLI commands. For more information about named profiles, see Named Profiles in the AWS Command Line Interface User Guide. For a list of AWS Regions, see Regions and Endpoints in the Amazon Web Services General Reference.

(17)

Next Step

3. Verify the setup by typing the following help command at the command prompt:

aws help

Next Step

Step 3: Getting Started Using the Amazon Comprehend Console (p. 9)

Step 3: Getting Started Using the Amazon Comprehend Console

The easiest way to get started using Amazon Comprehend is to use the console to analyze a short text ﬁle. If you haven't reviewed the concepts and terminology in How It Works (p. 4), we recommend that you do that before proceeding.

Topics

• Analyzing Documents Using the Console (p. 9)

• Creating and Using Custom Entity Recognizer (p. 16)

• Creating and Using Custom Classiﬁers (p. 20)

• Model Versioning with Amazon Comprehend (p. 25)

• Creating a Topic Modeling Job Using the Console (p. 27)

• Creating an Events Detection Job Using the Console (p. 29)

Analyzing Documents Using the Console

The Amazon Comprehend console enables you to analyze the contents of documents up to 5,000 characters long. The results are shown in the console so that you can review the analysis.

To start analyzing documents, sign in to the AWS Management Console and open the Amazon Comprehend console.

You can replace the sample text with your own text either in English or one of the other languages supported by Amazon Comprehend and then choose Analyze to get an analysis of your text. Below the text being analyzed, the Results pane shows more information about the text.

Entities

The Entities tab lists each entity, its category, and the level of conﬁdence that Amazon Comprehend has detected in the input text. The results are color-coded to indicate diﬀerent entity types such as organizations, locations, dates, and persons. For more information, see Detect Entities (p. 155).

(18)

Analyzing Documents Using the Console

Key phrases

The Key phrases tab lists key noun phrases that Amazon Comprehend detected in the input text and the associated conﬁdence level. For more information, see Detect Key Phrases (p. 162).

(19)

Language

The Language tab shows the dominant language of the text and Amazon Comprehend's level of conﬁdence that it has detected the dominant language correctly. Amazon Comprehend can recognize 100 languages. For more information, see Detect the Dominant Language (p. 163).

(20)

PII

The PII tab lists entities in your input text that contain personally identiﬁable information (PII). A PII entity is a textual reference to personal data that could be used to identify an individual, such as an address, bank account number, or phone number. For more information, see Detect Personally Identiﬁable Information (PII) (p. 167).

The PII tab provides two analysis modes:

• Oﬀsets

• Labels

Oﬀsets

The Oﬀsets analysis mode identiﬁes the location of PII in your text documents. For more information, see Locate PII Entities (p. 169).

(21)

Labels

The Labels analysis mode checks for the presence of PII in your text document and returns the labels of identiﬁed PII entity types. For more information, see Label Documents with PII Entity Types (p. 171).

(22)

Sentiment

The Sentiment tab shows the overall emotional sentiment of the text. Sentiment can be rated neutral, positive, negative, or mixed. In this case, each emotional sentiment has a conﬁdence rating, providing an estimate by Amazon Comprehend for that sentiment being dominant. For more information, see Determine Sentiment (p. 172).

(23)

Syntax

The Syntax tab shows a breakdown of each element in the text, along with its part of speech and the associated conﬁdence score. For more information, see Analyze Syntax (p. 173).

(24)

Creating and Using Custom Entity Recognizer

You can create custom entity recognizers using the Amazon Comprehend console. This section shows you how to create and train a custom entity recognizer and then how to create an entity recognizer job.

Creating a Custom Entity Recognizer Using the Console - CSV Format

To create the custom entity recognizer, ﬁrst provide a dataset to train your model. With this dataset, include one of the following: a set of annotated documents or a list of entities and their type label, along with a set of documents containing those entities. For more information, see Custom Entity Recognition (p. 109)

To train a custom entity recognizer with a CSV ﬁle

1. Sign in to the AWS Management Console and open the Amazon Comprehend console.

(25)

2. From the left menu, choose Customization and then choose Custom entity recognition.

3. Choose Create new model.

4. Give the recognizer a name. The name must be unique within the Region and account.

5. Select the language.

6. Under Custom entity type, enter a custom label that you want the recognizer to ﬁnd in the dataset.

The entity type must be uppercase, and if it consists of more than one word, separate the words with an underscore.

7. Choose Add type.

8. If you want to add an additional entity type, enter it, and then choose Add type. If you want to remove one of the entity types you've added, choose Remove type and then choose the entity type to remove from the list. A maximum of 25 entity types can be listed.

9. To encrypt your training job, choose Recognizer encryption and then choose whether to use a KMS key associated with the current account, or one from another account.

• If you are using a key associated with the current account, for KMS key ID choose the key ID.

• If you are using a key associated with a diﬀerent account, for KMS key ARN enter the ARN for the key ID.

NoteFor more information on creating and using KMS keys and the associated encryption, see Key Management Service (KMS).

10. Under Data speciﬁcations, choose the format of your training documents:

• CSV file — A CSV file that supplements your training documents. The CSV file contains

information about the custom entities that your trained model will detect. The required format of the ﬁle depends on whether you are providing annotations or an entity list.

• Augmented manifest — A labeled dataset that is produced by Amazon SageMaker Ground Truth.

This ﬁle is in JSON lines format. Each line is a complete JSON object that contains a training document and its labels. Each label annotates a named entity in the training document. You can provide up to 5 augmented manifest ﬁles.

For more information about available formats, and for examples, see Training custom entity recognizers (p. 109).

11. Under Training type, choose the training type to use:

• Using annotations and training docs

• Using entity list and training docs

If choosing annotations, enter the URL of the annotations ﬁle in Amazon S3. You can also navigate to the bucket or folder in Amazon S3 where the annotation ﬁles are located and choose Browse S3.

If choosing entity list, enter the URL of the entity list in Amazon S3. You can also navigate to the bucket or folder in Amazon S3 where the entity list is located and choose Browse S3.

12. Enter the URL of an input dataset containing the training documents in Amazon S3. You can also navigate to the bucket or folder in Amazon S3 where the training documents are located and choose Select folder.

13. Under Test dataset select how you want to evaluate the performance of your trained model - you can do this for both annotations and entity list training types.

• Autosplit: Autosplit automatically selects 10% of your provided training data to use as testing data

(26)

• (Optional) Customer provided: When you select customer provided, you can specify exactly what test data you want to use.

14. If you select Customer provided test dataset, enter the URL of the annotations ﬁle in Amazon S3.

You can also navigate to the bucket or folder in Amazon S3 where the annotation ﬁles are located and choose Select folder.

15. In the Choose an IAM role section, either select an existing IAM role or create a new one.

• Choose an existing IAM role – Select this option if you already have an IAM role with permissions to access the input and output Amazon S3 buckets.

• Create a new IAM role – Select this option when you want to create a new IAM role with the proper permissions for Amazon Comprehend to access the input and output buckets.

NoteIf the input documents are encrypted, the IAM role used must have kms:Decrypt permission. For more information, see Permissions Required to Use KMS

Encryption (p. 207).

16. (Optional) To launch your resources into Amazon Comprehend from a VPC, enter the VPC ID under VPC or choose the ID from the drop-down list.

1. Choose the subnet under Subnet(s). After you select the ﬁrst subnet, you can choose additional ones.

2. Under Security Group(s), choose the security group to use if you speciﬁed one. After you select the ﬁrst security group, you can choose additional ones.

NoteWhen you use a VPC with your custom entity recognition job, the DataAccessRole used for the Create and Start operations must have permissions to the VPC from which the input documents and the output bucket are accessed.

17. (Optional) To add a tag to the custom entity recognizer, enter a key-value pair under Tags. Choose Add tag. To remove this pair before creating the recognizer, choose Remove tag.

18. Choose Train.

The new recognizer will then appear in the list, showing its status. It will first show as Submitted. It will then show Training for a classifier that is processing training documents, Trained for a classifier that is ready to use, and In error for a classifier that has an error. You can click on a job to get more information about the recognizer, including any error messages.

Creating a Custom Entity Recognizer Using the Console - Augmented Manifest

To train a custom entity recognizer with a Plain text, PDF, or Word Document 1. Sign in to the AWS Management Console and open the Amazon Comprehend console.

3. Choose Train recognizer.

4. Give the recognizer a name. The name must be unique within the Region and account.

5. Select the language. Note: If you're training a PDF or Word document, English is the supported language.

6. Under Custom entity type, enter a custom label that you want the recognizer to ﬁnd in the dataset.

The entity type must be uppercase, and if it consists of more than one word, separate the words with an underscore.

7. Choose Add type.

(27)

8. If you want to add an additional entity type, enter it, and then choose Add type. If you want to remove one of the entity types you've added, choose Remove type and then choose the entity type to remove from the list. A maximum of 25 entity types can be listed.

9. To encrypt your training job, choose Recognizer encryption and then choose whether to use a KMS key associated with the current account, or one from another account.

• If you are using a key associated with the current account, for KMS key ID choose the key ID.

10. Under Training data, choose Augmented manifest as your data format:

• Augmented manifest — is a labeled dataset that is produced by Amazon SageMaker Ground Truth. This ﬁle is in JSON lines format. Each line in the ﬁle is a complete JSON object that contains a training document and its labels. Each label annotates a named entity in the training document.

You can provide up to 5 augmented manifest ﬁles. If you are using PDF documents for training data, you must select Augmented manifest. You can provide up to 5 augmented manifest ﬁles.

For each ﬁle, you can name up to 5 attributes to use as training data.

For more information about available formats, and for examples, see Training custom entity recognizers (p. 109).

11. Select the training model type.

If you selected Plain text documents, under Input location, enter the Amazon S3URL of the Amazon SageMakerGround Truth augmented manifest ﬁle. You can also navigate to the bucket or folder in Amazon S3 where the augmented manifest(s) is located and choose Select folder.

12. Under Attribute name, enter the name of the attribute that contains your annotations. If the ﬁle contains annotations from multiple chained labeling jobs, add an attribute for each job. In this case, each attribute contains the set of annotations from a labeling job. Note: You can provide up to 5 attribute names for each ﬁle.

13. Select Add.

14. If you selected PDF, Word documents under Input location, enter the Amazon S3URL of the Amazon SageMaker Ground Truth augmented manifest ﬁle. You can also navigate to the bucket or folder in Amazon S3 where the augmented manifest(s) is located and choose Select folder.

15. Enter the S3 preﬁx for your Annotation data ﬁles. These are the PDF documents that you labled.

16. Enter the S3 preﬁx for your Source documents. These are the original PDF documents (data objects) that you provided to Ground Truth for your labeling job.

17. Enter the attribute names that contain your annotations. Note: You can provide up to 5 attribute names for each ﬁle. Any attributes in your ﬁle that you don't specify are ignored.

18. In the IAM role section, either select an existing IAM role or create a new one.

• Choose an existing IAM role – Select this option if you already have an IAM role with permissions to access the input and output Amazon S3 buckets.

• Create a new IAM role – Select this option when you want to create a new IAM role with the proper permissions for Amazon Comprehend to access the input and output buckets.

NoteIf the input documents are encrypted, the IAM role used must have kms:Decrypt permission. For more information, see Permissions Required to Use KMS

(28)

Creating and Using Custom Classiﬁers

Note

When you use a VPC with your custom entity recognition job, the DataAccessRole used for the Create and Start operations must have permissions to the VPC from which the input documents and the output bucket are accessed.

20. (Optional) To add a tag to the custom entity recognizer, enter a key-value pair under Tags. Choose Add tag. To remove this pair before creating the recognizer, choose Remove tag.

21. Choose Train.

The new recognizer will then appear in the list, showing its status. It will first show as Submitted. It will then show Training for a classifier that is processing training documents, Trained for a classifier that is ready to use, and In error for a classifier that has an error. You can click on a job to get more information about the recognizer, including any error messages.

Creating and Using Custom Classiﬁers

You can create and train custom classifiers using the console, and then run asynchronous classification jobs to analyze your documents. You can also use the same custom model and add an endpoint to it to run custom classification requests to gain real-time (synchronous) insights about your text. This section shows you how to create a classifier using the console and then both how to use it to run an asynchronous classification job, or how to create an endpoint for it and run a real-time classification request.

Topics

• Creating a Custom Classiﬁer (Console) (p. 20)

• Running an Asynchronous Custom Classiﬁcation Job (p. 22)

• Creating a Real-time Custom Classiﬁcation Request (p. 24)

Creating a Custom Classiﬁer (Console)

Create a custom document classiﬁer to identify the categories of a set of documents.

To train the classifier, you need a set of training documents. You label these documents with the categories that you want the document classifier to recognize. For more information on these training documents, see Custom Classification (p. 90).

To train a document classiﬁer

2. From the left menu, choose Customization and then choose Custom Classiﬁcation.

3. Choose Create new model.

4. Give the classiﬁer a name. The name must be unique within your account and current Region.

5. Select the language of the training documents. You can train a document classiﬁer using any of the languages that work with Amazon Comprehend. However, you can only train the classiﬁer in one language. To learn more, see Languages Supported by Amazon Comprehend. (p. 5)

(29)

6. (Optional) If you want to encrypt the data in the storage volume while your training job is being processed, choose Classiﬁer encryption and then choose whether to use a KMS key associated with your current account, or one from another account.

• If you are using a key associated with the current account, choose the key ID for KMS key ID.

• If you are using a key associated with a diﬀerent account, enter the ARN for the key ID under KMS key ARN.

Note

For more information on creating and using KMS keys and the associated encryption, see Key Management Service (KMS).

7. Under Data speciﬁcations, choose which classiﬁer mode to use.

• Single-label mode: Choose this option if the categories you are assigning to documents are mutually exclusive and you are training your classiﬁer to assign one and only one label to each document.

• Multi-label mode: Choose this option if multiple categories can applied to a document at the same time and you are training your classiﬁer to assign one, many, all, or no label to each document.

8. If you chose Multi-label mode, choose the character delimiter you want to use to separate labels when there are more than one label per line from Delimiter for labels.

9. Under Data format, choose the format of your training documents:

• CSV file — A two-column CSV file, where labels are provided in the first column, and documents are provided in the second.

• Augmented manifest — A labeled dataset that is produced by Amazon SageMaker Ground Truth.

This ﬁle is in JSON lines format. Each line is a complete JSON object that contains a training document and its associated labels.

For more information about these formats, and for examples, see Training a Custom Classiﬁer (p. 91).

10. Under Training dataset, enter the location of the Amazon S3 bucket that contains your training documents or navigate to it by choosing Select folder. The IAM role you're using for access permissions for the training job must have reading permissions for the S3 bucket.

11. Under Test dataset select how you want to evaluate the performance of your trained model - you can do this for both annotations and entity list training types.

• Autosplit: Autosplit automatically selects 10% of your provided training data to use as testing data

• (Optional) Customer provided: When you select customer provided, you can specify exactly what test data you want to use. If you select Customer provided test dataset, enter the URL of the annotations ﬁle in Amazon S3. You can also navigate to the bucket or folder in Amazon S3 where the annotation ﬁles are located and choose Select folder.

12. (Optional) If you want Amazon Comprehend to create a confusion matrix that provides metrics on how well the classiﬁer performed during training, enter the location of an Amazon S3 bucket where it will be saved. For more information, see Confusion Matrix (p. 106).

(Optional) If you choose to encrypt the output result from your training job, choose Encryption and then choose whether to use a KMS key associated with the current account, or one from another account.

• If you are using a key associated with the current account, choose the key alias for KMS key ID.

• If you are using a key associated with a diﬀerent account, enter the ARN for the key alias or ID under KMS key ID.

(30)

13. Choose Choose an existing IAM role, and then choose an existing IAM role that has read permissions for the S3 bucket that contains your training documents. Only roles that have a trust policy that begins with comprehend.amazonaws.com are valid.

If you don't already have an IAM role with these permissions, choose Create an IAM role to make one. Choose the access permissions to grant this role, and then choose a name suﬃx to distinguish the role from IAM roles in your account.

NoteIf the input documents are encrypted, the IAM role used must also have kms:Decrypt permission. For more information, see Permissions Required to Use KMS

1. Choose the subnet under Subnets(s). After you select the ﬁrst subnet, you can choose additional ones.

NoteWhen you use a VPC with your classiﬁcation job, the DataAccessRole used for the Create and Start operations must have permissions to the VPC from which the input documents and the output bucket are accessed.

15. (Optional) To add a tag to the custom classiﬁer, enter a key-value pair under Tags. Choose Add tag.

To remove this pair before creating the classiﬁer, choose Remove tag. For more information, see Tagging your resources (p. 191).

16. Choose Create.

The new classifier will then appear in the list, showing its status. It will first show as Submitted. It will then show Training for a classifier that is processing training documents, Trained for a classifier that is ready to use, and In error for a classifier that has an error. You can click on a job to get more information about the classifier, including any error messages.

Running an Asynchronous Custom Classiﬁcation Job

Once you have created a custom document classiﬁer, you can use it to categorize a group of documents.

(31)

To create a custom asynchronous classiﬁcation job

2. From the left menu, choose Customization and then choose Custom classiﬁcation.

3. Choose Create job.

4. Give the classiﬁcation job a name. The name must be unique your account and current Region.

5. Under Analysis type, choose Custom classiﬁcation.

6. From Select classiﬁer, choose the custom classiﬁer to use.

7. (Optional) If you choose to encrypt the data in the storage volume while your classiﬁcation job is processed, choose Job encryption and then choose whether to use a KMS key associated with the current account, or one from another account.

• If you are using a key associated with the current account, choose the key ID for KMS key ID.

• If you are using a key associated with a diﬀerent account, enter the ARN for the key ID under KMS key ARN.

8. Under Input data, enter the location of the Amazon S3 bucket that contains your input documents or navigate to it by choosing Select folder. This bucket must be in the same region as the API that you are calling. The IAM role you're using for access permissions for the classiﬁcation job must have reading permissions for the S3 bucket.

9. (Optional) Choose the format of the documents to be classified under Input format. These can be one document per file, or one document per line in a single file.

10. Under Output data, enter the location of the Amazon S3 bucket where Amazon Comprehend should write the job's output data or navigate to it by choosing Select folder. This bucket must be in the same region as the API that you are calling. The IAM role you're using for access permissions for the classiﬁcation job must have write permissions for the S3 bucket.

11. (Optional) If you choose to encrypt the output result from your job, choose Encryption and then choose whether to use a KMS key associated with the current account, or one from another account.

• If you are using a key associated with the current account, choose the key alias or ID for KMS key ID.

• If you are using a key associated with a diﬀerent account, enter the ARN for the key alias or ID under KMS key ID.

NoteWhen you use a VPC with your classiﬁcation job, the DataAccessRole used for the Create and Start operations must have permissions to the VPC from which the output bucket are accessed.

13. Choose Create job to create the document classiﬁcation job.

(32)

Creating a Real-time Custom Classiﬁcation Request

In addition to using the custom document classifier to run asynchronous jobs, you can also use it to run synchronous custom classification requests to gain real-time insight into the categories in your document. This requires first that you create an endpoint and set the level of data throughput for it, and then to run the real-time analysis.

Note

Using real-time analysis will result in additional cost to your account. This cost is determined by how long the endpoint is operating and the level of throughput you determine.

The level of throughput assigned to an endpoint is measured in Inference units, each of which represents data throughput of 100 characters per second. You can provision the endpoint with up to 10 inference units. This level of throughput can be adjust to meet your needs by updating the endpoint.

Once you have completed your real-time analysis, you should delete the endpoint because the charge for it will continue as long as it's active. You can easily create another endpoint whenever you need it.

To create an endpoint

2. From the left menu, choose Customization and then choose Custom Classiﬁcation.

3. From the Classiﬁers list, choose the name of the custom model for which you want to create the endpoint and follow the link. The Endpoints list on the custom model details page is displayed.

NotePreviously created endpoints are shown on the models detail page, along with the model with which they're associated.

4. Under Endpoints, choose Create endpoint.

5. Give the endpoint a name. The name must be unique within the AWS Region and account.

6. Enter the number of inference units to assign to the endpoint. Each unit represents a throughput of 100 characters per second. You can assign up to a maximum of 10 inference units per endpoint.

7. (Optional) To add a tag to the endpoint, enter a key-value pair under Tags and choose Add tag. To remove this pair before creating the endpoint, choose Remove tag.

8. Choose Create endpoint. The Endpoints list is displayed, with the new endpoint showing Creating.

Once it shows Ready, the endpoint can be used for real-time analysis.

To run a real-time custom classiﬁcation request

2. From the left menu, choose Real-time analysis.

3. Under Input type, choose Custom for Analysis type.

4. For Select endpoint, choose the endpoint that you want to use. This endpoint is linked to a speciﬁc custom model.

5. Enter the text you want to analyze.

6. Choose Analyze. The text analysis based on your custom model is displayed, along with a conﬁdence assessment of the analysis.

To update your endpoint

(33)

Model Versioning with Amazon Comprehend

3. From the Classiﬁers list, choose the name of the custom model for which you want to update the endpoint and follow the link. The custom model details page is displayed.

4. Navigate to the Endpoints list, choose the name of the endpoint you want to update and follow the link.

5. Choose Edit.

6. Enter the updated number of inference units to assign to the endpoint. Each unit represents a throughput of 100 characters per second. You can assign up to a maximum of 10 inference units per endpoint.

NoteThe cost of using an endpoint is based on the amount of time operating and the

throughput (based on the number of inference units. Increasing the number of inference units will thus increase the cost of operation. For more information, see Amazon

Comprehend Pricing.

7. Choose Edit endpoint. The endpoint details page is displayed.

8. Conﬁrm that the endpoint is updating by choosing the model name from the breadcrumbs at the top of the page. On the custom model details page, navigate to the Endpoints list and verify that it shows Updating next to the endpoint. When the update is complete, it will show Ready.

To delete your endpoint

3. From the Classiﬁers list, choose the name of the custom model associated with the endpoint you want to delete and follow the link. The custom model details page is displayed.

4. Navigate to the Endpoints list, choose the name of the endpoint to delete and follow the link.

5. Choose Delete.

Note

All endpoints associated with a custom model must be deleted before that model itself can be removed.

6. Choose Delete again to conﬁrm the deletion. The custom model details page is displayed. Conﬁrm that the endpoint you deleted shows deleting next to it. When it's deleted, the endpoint is removed from the Endpoints list.

Model Versioning with Amazon Comprehend

Artifical intelligence and machine learning (AI/ML) is all about rapid experimentation. With Amazon Comprehend, you train and build out models which you use to gain insight on your data. With model versioning you can keep track of your modeling history and scores associated with running results of your models as you provide more or different sets of data. You can use versioning with your custom classification models or your custom entity recognition models. Taking a look at your different versions over time you can gain insight on how successful they've performed and gain insight on what parameters you used to get to your state of success.

When you train a new version of an existing custom classiﬁer model or entity recognition model, all you need to do is create a new version from the model details page and all the details populate for you. The new version will have the same name as your earlier model — what we call the versionID — although you will give it a unique version name during creation. As you add new versions to a model, you can see all the previous versions and their details in one view from the model details page. With versioning, you can see how model performance changes as you make changes to your training dataset.

(34)

Model Versioning with Amazon Comprehend

Create a new Custom classiﬁer version (console)

3. From the Classiﬁers list, choose the name of the custom model from which you want to create a new version. The custom model details page is displayed.

4. On the top right, select Create new model. A screen opens with prepopulated details from the parent custom classiﬁcation model.

5. Under Version name add a unique name to the new version.

6. Under version details, you can change the language and number of labels associated with your new model.

7. Under the Data specifications section configure how you want to provide the data to your new version— make sure to provide full data, which includes documents from your previous model and your new documents. You can change the Classifier mode (single-label, or multi-label), Data format (CSV file, Augmented manifest), your Training dataset, and your Test dataset (autosplit, or your custom test data configuration).

8. (Optional) update the S3 location for your output data 9. Under Access permissions, create or use an existing IAM role.

10. (Optional) Update your VPC settings

(35)

Creating a Topic Modeling Job Using the Console

11. (Optional) Add tags to your new version to help keep track of the details.

For more information about creating custom classifiers, see Custom Classification (p. 90) and Creating and Using Custom Classifiers (p. 20)

Create a new Custom entity recognizer version (console)

3. From the Recognizer model list, choose the name of the recognizer from which you want to create a new version. The details page is displayed.

4. On the top right, select Train new version. A screen opens with prepopulated details from the parent entity recognizer.

5. Under Version name add a unique name to the new version.

6. Under Custom entity type, add the custom labels or label you want the recognizer to identify in your dataset and select Add type. Choose a custom entity type from the annotations or entity list you've provided. The recognizer will then use all of the included entity types to identify entities in the data set when running your job. Each entity type must be upper-case and separated by and underscore if it uses multiple words. A maximum of 25 types are allowed.

7. (Optional) Select Recognizer encryption to encrypt the data in the storage volume while your job is being processed.

8. Under the Training data section, specify the Annotation and data format details (CSV ﬁle, Augmented manifest)single-label, or multi-label), Data format (CSV, Augmented manifest), your Training dataset, and your Test dataset (autosplit, or your custom test data conﬁguration).

9. (Optional) update the S3 location for your output data 10. Under Access permissions, create or use an existing IAM role.

11. (Optional) Update your VPC settings

12. (Optional) Add tags to your new version to help keep track of the details.

To learn more about custom entity recognizers, see Custom Entity Recognition (p. 109) and Creating a Custom Entity Recognizer Using the Console (p. 16).

Creating a Topic Modeling Job Using the Console

You can use the Amazon Comprehend console to create and manage asynchronous topic detection jobs.

To create a topic modeling job

2. From the left menu, choose Analysis Jobs and then choose Create.

3. Under Job settings, give the job a name. The name must be unique within the region and account.

4. For Analysis Type, choose Topic Modeling.

5. (Optional) If you choose to encrypt the data in the storage volume while your job is processed, choose Job encryption and then choose whether to use a KMS key associated with the current account, or one from another account.

• If you are using a key associated with the current account, for KMS key IDchoose the key ID.

(36)

Creating a Topic Modeling Job Using the Console

6. Choose the data source to use. You can use either sample data or you can analyze your own data stored in an Amazon S3 bucket.

If you choose to use your own data, provide the following information:

• S3 data location – An Amazon S3 data bucket that contains the documents to analyze. You can choose the folder icon to browse to the location of your data. The bucket must be in the same region as the API that you are calling.

• Input format – Optionally choose whether input data is contained in one document per ﬁle, or if there is one document per line in a ﬁle.

• Number of topics – The number of topics to return.

7. (Optional) If you choose to encrypt the output result from your job, choose Encryption and then choose whether to use a KMS key associated with the current account, or one from another account.

• If you are using a key associated with the current account, for KMS key ID choose the key alias or ID.

• If you are using a key associated with a diﬀerent account, for KMS key ID enter the ARN for the key alias or ID.

8. In the Choose an IAM role section, either select an existing IAM role or create a new one.

• Choose an existing IAM role – Choose this option if you already have an IAM role with permissions to access the input and output Amazon S3 buckets.

• Create a new IAM role – Choose this option when you want to create a new IAM role with the proper permissions for Amazon Comprehend to access the input and output buckets. For more information about the permissions given to the IAM role, see Role-Based Permissions Required for Asynchronous Operations (p. 210).

Note

If the input documents are encrypted, the IAM role used must have KMS:Decrypt permission. For more information, see Permissions Required to Use KMS

Note

When you use a VPC with your topic modeling job, the DataAccessRole that is used for the Create and Start operations must have permissions to the VPC from which the input documents and the output bucket are accessed.

10. When you have ﬁnished ﬁlling out the form, choose Create job to create and start the topic detection job.

The new job appears in the job list with the status field showing the status of the job. The field can be IN_PROGRESS for a job that is processing, COMPLETED for a job that has finished successfully,

Amazon Comprehend

Amazon Comprehend

Developer Guide

Amazon Comprehend: Developer Guide

Table of Contents

What Is Amazon Comprehend?

Amazon Comprehend Insights

Comprehend Custom

Document Clustering (Topic Modeling)

Examples

Beneﬁts

Are You a First-time User of Amazon Comprehend?

How It Works

Languages Supported in Amazon Comprehend

Supported Languages

Languages Supported by Amazon Comprehend Features

Getting Started with Amazon Comprehend

Step 1: Set Up an AWS Account and Create an Administrator User

Sign Up for AWS

Create an IAM User

Next Step

Step 2: Set Up the AWS Command Line Interface (AWS CLI)

Next Step

Step 3: Getting Started Using the Amazon Comprehend Console

Analyzing Documents Using the Console

Entities

Key phrases

Language

PII

Oﬀsets

Labels

Sentiment

Syntax

Creating and Using Custom Entity Recognizer

Creating a Custom Entity Recognizer Using the Console - CSV Format

Creating a Custom Entity Recognizer Using the Console - Augmented Manifest

Creating and Using Custom Classiﬁers

Creating a Custom Classiﬁer (Console)

Running an Asynchronous Custom Classiﬁcation Job

Creating a Real-time Custom Classiﬁcation Request

Model Versioning with Amazon Comprehend

Creating a Topic Modeling Job Using the Console