Amazon Comprehend
Developer Guide
Amazon Comprehend: Developer Guide
Copyright © Amazon Web Services, Inc. and/or its affiliates. All rights reserved.
Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be affiliated with, connected to, or sponsored by Amazon.
Table of Contents
What Is Amazon Comprehend? ... 1
Amazon Comprehend Insights ... 1
Comprehend Custom ... 1
Document Clustering (Topic Modeling) ... 2
Examples ... 2
Benefits ... 2
Are You a First-time User of Amazon Comprehend? ... 3
How It Works ... 4
Supported Languages ... 5
Supported Languages ... 5
Languages Supported by Amazon Comprehend Features ... 5
Getting Started ... 7
Step 1: Set Up an Account ... 7
Sign Up for AWS ... 7
Create an IAM User ... 7
Next Step ... 8
Step 2: Set Up the AWS CLI ... 8
Next Step ... 9
Step 3: Getting Started Using the Console ... 9
Analyzing Documents Using the Console ... 9
Creating and Using Custom Entity Recognizer ... 16
Creating and Using Custom Classifiers ... 20
Model Versioning with Amazon Comprehend ... 25
Creating a Topic Modeling Job Using the Console ... 27
Creating an Events Detection Job Using the Console ... 29
Step 4: Getting Started Using the API ... 30
Detecting the Dominant Language ... 30
Detecting Named Entities ... 33
Detecting Key Phrases ... 35
Detecting PII ... 37
Labeling Documents with PII ... 38
Detecting Sentiment ... 39
Detecting Syntax ... 41
Using Custom Classification ... 44
Detecting Custom Entities ... 48
Detecting Events ... 52
Topic Modeling ... 54
Using the Batch APIs ... 60
Solution: Analyzing Text with OpenSearch ... 65
Tutorial: Analyzing Insights from Reviews ... 66
Prerequisites ... 67
Step 1: Adding Documents to Amazon S3 ... 68
Prerequisites ... 69
Download Sample Data ... 69
Create an Amazon S3 Bucket ... 69
(Console Only) Create Folders ... 70
Upload the Input Data ... 70
Step 2: (CLI Only) Creating an IAM Role ... 71
Prerequisites ... 72
Create an IAM Role ... 72
Attach an IAM Policy to the IAM Role ... 73
Step 3: Running Analysis Jobs ... 74
Prerequisites ... 74
Analyze Sentiment and Entities ... 74
Step 4: Preparing the Output ... 77
Prerequisites ... 77
Download the Output ... 77
Extract the Output Files ... 78
Upload the Extracted Files ... 79
Load the Data into an AWS Glue Data Catalog ... 80
Prepare the Data for Analysis ... 83
Step 5: Visualizing the Output ... 85
Prerequisites ... 85
Give Amazon QuickSight Access ... 85
Import the Datasets ... 86
Create a Sentiment Visualization ... 86
Create an Entities Visualization ... 87
Publish a Dashboard ... 88
Clean Up ... 89
Custom Classification ... 90
Multi-Class and Multi-Label Modes ... 90
Asynchronous Classification ... 91
Training a Custom Classifier ... 91
Creating Training Data ... 92
Multi-Class Mode ... 92
Multi-Label Mode ... 94
Testing the Training Data ... 96
Running an Asynchronous Classification Job ... 96
The Classification Job ... 98
Real-time Analysis with Custom Classification ... 100
Creating an Endpoint for Custom Classification ... 100
Running Real-Time Custom Classification ... 101
Metrics ... 102
Metrics ... 103
Improving Your Custom Classifier's Performance ... 106
Confusion Matrix ... 106
Custom Entity Recognition ... 109
Training custom entity recognizers ... 109
Annotations ... 111
Entity lists ... 120
Detecting Custom Entities with a Batch Job ... 122
Detecting Custom Entities in Real Time ... 125
Creating an Endpoint ... 126
Running Entity Detection ... 127
Metrics ... 127
Improving Performance ... 129
Managing Endpoints ... 131
Endpoints Overview ... 131
Monitoring an Endpoint ... 132
Updating an Endpoint ... 133
Using Trusted Advisor ... 134
Amazon Comprehend Underutilized Endpoints ... 135
Amazon Comprehend Endpoint Access Risk ... 136
Deleting an Endpoint ... 137
Auto Scaling with Endpoints ... 137
Target Tracking ... 138
Scheduled Scaling ... 140
Copying Custom Models Between AWS Accounts ... 143
Sharing a Custom Model ... 143
Before You Begin ... 144
Resource-Based Policies for Custom Models ... 147
Step 1: Add a Resource-Based Policy to a Custom Model ... 147
Step 2: Provide the Details That Others Need to Import ... 149
Importing a Custom Model ... 150
Before You Begin ... 150
Importing a Custom Model ... 152
Text Analysis APIs ... 155
Detect Entities ... 155
Detect Events ... 157
... 157
Supported Types for Entities, Events, and Arguments ... 158
Detect Key Phrases ... 162
Detect the Dominant Language ... 163
Detect PII ... 167
PII Entity Types ... 167
Locate PII Entities ... 169
Redact PII Entities ... 170
Label Documents with PII ... 171
Label Documents with PII Entity Types ... 171
Determine Sentiment ... 172
Analyze Syntax ... 173
Topic Modeling ... 175
Document Processing Modes ... 179
Single-Document Processing ... 179
Asynchronous Batch Processing ... 179
Prerequisites ... 180
Starting an Analysis Job ... 180
Monitoring Analysis Jobs ... 181
Getting Analysis Results ... 181
Multiple Document Synchronous Processing ... 184
Using S3 Object Lambda Access Points for PII ... 187
Controlling Access to Documents with PII ... 187
Creating an Amazon S3 Object Lambda Access Point to Control Access to Documents ... 188
Invoking an Amazon S3 Object Lambda Access Point to Control Access to Documents ... 188
Redacting PII from Documents ... 189
Creating an Amazon S3 Object Lambda Access Point to Redact PII from Documents ... 188
Invoking an Amazon S3 Object Lambda Access Point to Redact PII from Documents ... 188
Tagging ... 191
Tagging a new resource ... 191
Viewing, editing, and deleting tags ... 192
Security ... 194
Data Protection ... 194
KMS Encryption in Amazon Comprehend ... 195
Cross-service Confused Deputy Prevention ... 197
Using a Virtual Private Cloud (VPC) ... 199
VPC endpoints (AWS PrivateLink) ... 202
Authentication and Access Control ... 203
Authentication ... 204
Access Control ... 204
Overview of Managing Access ... 205
Using Identity-Based Policies (IAM Policies) for Amazon Comprehend ... 207
Amazon Comprehend API Permissions Reference ... 212
AWS Managed Policies ... 212
Logging Amazon Comprehend API Calls with AWS CloudTrail ... 215
Amazon Comprehend Information in CloudTrail ... 215
Examples: Amazon Comprehend Log File Entries ... 217
Compliance Validation ... 223
Resilience ... 224
Infrastructure Security ... 224
Permissions Required for a Custom Asynchronous Analysis Job ... 225
Guidelines and Quotas ... 226
Supported Regions ... 226
Overall Quotas ... 226
Throttling When Using Single Transactions ... 226
Multiple Document Operations ... 226
Asynchronous Operations ... 227
Document Classification ... 227
Language Detection ... 228
Events ... 229
Topic Modeling ... 229
Entity Recognition ... 229
API Reference ... 232
Actions ... 232
BatchDetectDominantLanguage ... 234
BatchDetectEntities ... 237
BatchDetectKeyPhrases ... 240
BatchDetectSentiment ... 243
BatchDetectSyntax ... 246
ClassifyDocument ... 249
ContainsPiiEntities ... 252
CreateDocumentClassifier ... 254
CreateEndpoint ... 260
CreateEntityRecognizer ... 264
DeleteDocumentClassifier ... 270
DeleteEndpoint ... 272
DeleteEntityRecognizer ... 274
DeleteResourcePolicy ... 276
DescribeDocumentClassificationJob ... 278
DescribeDocumentClassifier ... 281
DescribeDominantLanguageDetectionJob ... 284
DescribeEndpoint ... 287
DescribeEntitiesDetectionJob ... 289
DescribeEntityRecognizer ... 292
DescribeEventsDetectionJob ... 295
DescribeKeyPhrasesDetectionJob ... 297
DescribePiiEntitiesDetectionJob ... 300
DescribeResourcePolicy ... 303
DescribeSentimentDetectionJob ... 306
DescribeTopicsDetectionJob ... 309
DetectDominantLanguage ... 312
DetectEntities ... 315
DetectKeyPhrases ... 319
DetectPiiEntities ... 322
DetectSentiment ... 324
DetectSyntax ... 327
ImportModel ... 330
ListDocumentClassificationJobs ... 334
ListDocumentClassifiers ... 337
ListDocumentClassifierSummaries ... 340
ListDominantLanguageDetectionJobs ... 342
ListEndpoints ... 345
ListEntitiesDetectionJobs ... 348
ListEntityRecognizers ... 351
ListEntityRecognizerSummaries ... 355
ListEventsDetectionJobs ... 357
ListKeyPhrasesDetectionJobs ... 360
ListPiiEntitiesDetectionJobs ... 363
ListSentimentDetectionJobs ... 366
ListTagsForResource ... 369
ListTopicsDetectionJobs ... 371
PutResourcePolicy ... 374
StartDocumentClassificationJob ... 377
StartDominantLanguageDetectionJob ... 382
StartEntitiesDetectionJob ... 387
StartEventsDetectionJob ... 392
StartKeyPhrasesDetectionJob ... 396
StartPiiEntitiesDetectionJob ... 401
StartSentimentDetectionJob ... 405
StartTopicsDetectionJob ... 410
StopDominantLanguageDetectionJob ... 415
StopEntitiesDetectionJob ... 417
StopEventsDetectionJob ... 419
StopKeyPhrasesDetectionJob ... 421
StopPiiEntitiesDetectionJob ... 423
StopSentimentDetectionJob ... 425
StopTrainingDocumentClassifier ... 427
StopTrainingEntityRecognizer ... 429
TagResource ... 431
UntagResource ... 433
UpdateEndpoint ... 435
Data Types ... 437
AugmentedManifestsListItem ... 439
BatchDetectDominantLanguageItemResult ... 441
BatchDetectEntitiesItemResult ... 442
BatchDetectKeyPhrasesItemResult ... 443
BatchDetectSentimentItemResult ... 444
BatchDetectSyntaxItemResult ... 445
BatchItemError ... 446
ClassifierEvaluationMetrics ... 447
ClassifierMetadata ... 449
DocumentClass ... 450
DocumentClassificationJobFilter ... 451
DocumentClassificationJobProperties ... 452
DocumentClassifierFilter ... 455
DocumentClassifierInputDataConfig ... 456
DocumentClassifierOutputDataConfig ... 458
DocumentClassifierProperties ... 459
DocumentClassifierSummary ... 463
DocumentLabel ... 465
DocumentReaderConfig ... 466
DominantLanguage ... 467
DominantLanguageDetectionJobFilter ... 468
DominantLanguageDetectionJobProperties ... 469
EndpointFilter ... 472
EndpointProperties ... 473
EntitiesDetectionJobFilter ... 476
EntitiesDetectionJobProperties ... 477
Entity ... 480
EntityLabel ... 482
EntityRecognizerAnnotations ... 483
EntityRecognizerDocuments ... 484
EntityRecognizerEntityList ... 485
EntityRecognizerEvaluationMetrics ... 486
EntityRecognizerFilter ... 487
EntityRecognizerInputDataConfig ... 488
EntityRecognizerMetadata ... 490
EntityRecognizerMetadataEntityTypesListItem ... 491
EntityRecognizerProperties ... 492
EntityRecognizerSummary ... 495
EntityTypesEvaluationMetrics ... 497
EntityTypesListItem ... 498
EventsDetectionJobFilter ... 499
EventsDetectionJobProperties ... 500
InputDataConfig ... 503
KeyPhrase ... 505
KeyPhrasesDetectionJobFilter ... 506
KeyPhrasesDetectionJobProperties ... 507
OutputDataConfig ... 510
PartOfSpeechTag ... 511
PiiEntitiesDetectionJobFilter ... 512
PiiEntitiesDetectionJobProperties ... 513
PiiEntity ... 516
PiiOutputDataConfig ... 518
RedactionConfig ... 519
SentimentDetectionJobFilter ... 520
SentimentDetectionJobProperties ... 521
SentimentScore ... 524
SyntaxToken ... 525
Tag ... 526
TopicsDetectionJobFilter ... 527
TopicsDetectionJobProperties ... 528
VpcConfig ... 531
Common Errors ... 531
Common Parameters ... 533
Document History ... 536
Amazon Comprehend Insights
What Is Amazon Comprehend?
Amazon Comprehend uses natural language processing (NLP) to extract insights about the content of documents. Amazon Comprehend processes any text file in UTF-8 format, image files (JPG, PNG, or TIFF), and semi-structured documents (PDF or Word files). It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document. Use Amazon Comprehend to create new products based on understanding the structure of documents. For example, using Amazon Comprehend you can search social networking feeds for mentions of products or scan an entire document repository for key phrases.
Topics
• Amazon Comprehend Insights (p. 1)
• Comprehend Custom (p. 1)
• Document Clustering (Topic Modeling) (p. 2)
• Examples (p. 2)
• Benefits (p. 2)
• Are You a First-time User of Amazon Comprehend? (p. 3)
Amazon Comprehend Insights
You work with one or more documents at a time to evaluate their content and gain insights about them.
Some of the insights that Amazon Comprehend develops about a document include:
• Entities – Amazon Comprehend returns a list of entities, such as people, places, and locations, identified in a document. For more information, see Detect Entities (p. 155).
• Key phrases – Amazon Comprehend extracts key phrases that appear in a document. For example, a document about a basketball game might return the names of the teams, the name of the venue, and the final score. For more information, see Detect Key Phrases (p. 162).
• PII – Amazon Comprehend analyzes documents to detect personal data that could be used to identify an individual, such as an address, bank account number, or phone number. For more information, see Detect Personally Identifiable Information (PII) (p. 167).
• Language – Amazon Comprehend identifies the dominant language in a document. Amazon Comprehend can identify 100 languages. For more information, see Detect the Dominant Language (p. 163).
• Sentiment – Amazon Comprehend determines the emotional sentiment of a document. Sentiment can be positive, neutral, negative, or mixed. For more information, see Determine Sentiment (p. 172).
• Syntax – Amazon Comprehend parses each word in your document and determines the part of speech for the word. For example, in the sentence "It is raining today in Seattle," "it" is identified as a pronoun,
"raining" is identified as a verb, and "Seattle" is identified as a proper noun. For more information, see Analyze Syntax (p. 173).
Comprehend Custom
Customize Comprehend for your specific requirements without the skillset required to build machine learning-based NLP solutions. Using automatic machine learning, or AutoML, Comprehend Custom builds customized NLP models on your behalf, using data you already have.
Document Clustering (Topic Modeling)
Custom Classification – Create custom document classifiers to organize your documents into your own categories. For each classification label, provide a set of documents that best represent that label and train your classifier on it. Once trained, a classifier can be used on any number of unlabeled document sets. You can use the console for a code-free experience or install the latest AWS SDK. For more information, see Custom Classification (p. 90).
Custom Entities – Create custom entity types that analyze text for your specific terms and noun-based phrases. You can train custom entities to extract terms like policy numbers, or phrases that imply a customer escalation. To train the model, you provide a list of the entities and a set of documents that contain them. Once the model is trained, you can submit analysis jobs against it to extract their custom entities. For more information, see Custom Entity Recognition (p. 109).
Document Clustering (Topic Modeling)
You can also use Amazon Comprehend to examine a corpus of documents to organize them based on similar keywords within them. Document clustering (topic modeling) is useful to organize a large corpus of documents into topics or clusters that are similar based on the frequency of words within them.
Topic modeling is a asynchronous process, you submit a set of documents for processing and then later get the results when processing is complete. Amazon Comprehend does topic modeling on large document sets, for best results you should include at least 1,000 documents when you submit a topic modeling job. For more information, see Topic Modeling (p. 175).
Examples
The following examples show how you might use the Amazon Comprehend operations in your applications.
Example 1: Find documents about a subject
Find the documents about a particular subject using Amazon Comprehend topic modeling. Scan a set of documents to determine the topics discussed, and to find the documents associated with each topic. You can specify the number of topics that Amazon Comprehend should return from the document set.
Example 2: Find out how customers feel about your products
If your company publishes a catalog, let Amazon Comprehend tell you what customers think of your products. Send each customer comment to the DetectSentiment operation and it will tell you whether customers feel positive, negative, neutral, or mixed about a product.
Example 3: Discover what matters to your customers
Use Amazon Comprehend topic modeling to discover the topics that your customers are talking about on your forums and message boards, then use entity detection to determine the people, places, and things that they associate with the topic. Finally, use sentiment analysis to determine how your customers feel about a topic.
Benefits
Some of the benefits of using Amazon Comprehend include:
• Integrate powerful natural language processing into your apps—Amazon Comprehend removes the complexity of building text analysis capabilities into your applications by making powerful and
Are You a First-time User of Amazon Comprehend?
accurate natural language processing available with a simple API. You don't need textual analysis expertise to take advantage of the insights that Amazon Comprehend produces.
• Deep learning based natural language processing—Amazon Comprehend uses deep learning technology to accurately analyze text. Our models are constantly trained with new data across multiple domains to improve accuracy.
• Scalable natural language processing—Amazon Comprehend enables you to analyze millions of documents so that you can discover the insights that they contain.
• Integrate with other AWS services—Amazon Comprehend is designed to work seamlessly with other AWS services like Amazon S3, AWS KMS, and AWS Lambda. Store your documents in Amazon S3, or analyze real-time data with Kinesis Data Firehose. Support for AWS Identity and Access Management (IAM) makes it easy to securely control access to Amazon Comprehend operations. Using IAM, you can create and manage AWS users and groups to grant the appropriate access to your developers and end users.
• Encryption of output results and volume data —Amazon S3 already enables you to encrypt your input documents, and Amazon Comprehend extends this even farther. By using your own KMS key, you can not only encrypt the output results of your job, but also the data on the storage volume attached to the compute instance that processes the analysis job. The result is significantly enhanced security.
• Low cost—With Amazon Comprehend, you only pay for the documents that you analyze. There are no minimum fees or upfront commitments.
Are You a First-time User of Amazon Comprehend?
If you are a first-time user of Amazon Comprehend, we recommend that you read the following sections in order:
1.How It Works (p. 4) – This section introduces Amazon Comprehend concepts.
2.Getting Started with Amazon Comprehend (p. 7) – In this section, you set up your account and test Amazon Comprehend.
3.API Reference (p. 232) – In this section you'll find reference documentation for Amazon Comprehend operations.
How It Works
Amazon Comprehend uses a pre-trained model to examine and analyze a document or set of documents to gather insights about it. This model is continuously trained on a large body of text so that there is no need for you to provide training data.
Amazon Comprehend can examine and analyze documents in a variety of languages, depending on the specific feature. For more information, see Amazon Comprehend supported languages.
Additionally, Amazon Comprehend's Detect the Dominant Language (p. 163) operation can examine documents and determine the dominant language out of a far wider variety of different languages. For more information, see Languages Supported in Amazon Comprehend (p. 5).
With Amazon Comprehend, you can perform the following on your documents:
• Detect the Dominant Language (p. 163) — Examine text to determine the dominant language.
• Detect Entities (p. 155) — Detect textual references to the names of people, places, and items as well as references to dates and quantities.
• Detect Key Phrases (p. 162) — Find key phrases such as "good morning" in a document or set of documents.
• Detect Personally Identifiable Information (PII) (p. 167) — Analyze documents to detect personal data that could be used to identify an individual, such as an address, bank account number, or phone number.
• Determine Sentiment (p. 172) — Analyze documents and determine the dominant sentiment of the text.
• Analyze Syntax (p. 173) — Parse the words in your text and show the speech syntax for each word and enable you to understand the content of the document.
• Topic Modeling (p. 175) — Search the content of documents to determine common themes and topics.
Each operation can be processed in several ways:
• Single-Document Processing (p. 179) — You call Amazon Comprehend with a single document and receive a synchronous response.
• Multiple Document Synchronous Processing (p. 184) — You call Amazon Comprehend with a collection of up to 25 documents and receive a synchronous response.
• Asynchronous Batch Processing (p. 179) — You put a collection of documents into an Amazon S3 bucket and start an asynchronous operation to analyze the documents. The results of the analysis are returned in an S3 bucket.
Each operation can be encrypted both during communication and processing.
By using the integrated AWS KMS encryption, you maintain control over who can access to your encrypted data.
You can optionally provide a custom KMS key when you create your analysis job and your data will be encrypted on the storage volume attached to the ML compute instance processing the job. You can also provide a key to encrypt your output results as it's sent to the S3 bucket. If you have set up encryption on the S3 bucket that holds your input documents, this can provide you with end-to-end security.
For more information, see KMS Encryption in Amazon Comprehend (p. 195).
Supported Languages
Languages Supported in Amazon Comprehend
Amazon Comprehend supports a wide variety of languages for its various features. The languages supported and the features that support them can be seen in the following tables.
Topics
• Supported Languages (p. 5)
• Languages Supported by Amazon Comprehend Features (p. 5)
Supported Languages
Amazon Comprehend (except the Detect Dominant Language feature) supports the following languages for one or more features.
Code Language
de German
en English
es Spanish
it Italian
pt Portuguese
fr French
ja Japanese
ko Korean
hi Hindi
ar Arabic
zh Chinese (simplified)
zh-TW Chinese (traditional)
NoteAmazon Comprehend identifies the language using identifiers from RFC 5646 — if there is a 2- letter ISO 639-1 identifier, with a regional subtag if necessary, it uses that. Otherwise, it uses the ISO 639-2 3-letter code. For more information about RFC 5646, see the IETF Tools web site.
Languages Supported by Amazon Comprehend Features
Feature Supported Languages
Detect the Dominant Language (p. 163) See Detect the Dominant Language (p. 163).
Languages Supported by Amazon Comprehend Features
Feature Supported Languages
Detect Entities (p. 155) All supported languages.
Detect Key Phrases (p. 162) All supported languages.
Detect Personally Identifiable Information
(PII) (p. 167) English
Label Documents with Personally Identifiable
Information (PII) (p. 171) English
Determine Sentiment (p. 172) All supported languages.
Analyze Syntax (p. 173) German (de), English (en), Spanish (es), French (fr), Italian (it), and Portuguese (pt).
Topic Modeling (p. 175) Not dependent on the language used. Does not support character-based languages such as Chinese, Japanese, and Korean.
Custom Classification (p. 90) German (de), English (en), Spanish (es), French (fr), Italian (it), and Portuguese (pt).
Custom Entity Recognition (p. 109) German (de), English (en), Spanish (es), French (fr), Italian (it), and Portuguese (pt).
Step 1: Set Up an Account
Getting Started with Amazon Comprehend
To get started using Amazon Comprehend, set up an AWS account and create an AWS Identity and Access Management (IAM) user. To use the Amazon Comprehend (AWS CLI), download and configure it.
Topics
• Step 1: Set Up an AWS Account and Create an Administrator User (p. 7)
• Step 2: Set Up the AWS Command Line Interface (AWS CLI) (p. 8)
• Step 3: Getting Started Using the Amazon Comprehend Console (p. 9)
• Step 4: Getting Started Using the Amazon Comprehend API (p. 30)
• Solution: Analyzing Text with Amazon Comprehend and Amazon OpenSearch (p. 65)
Step 1: Set Up an AWS Account and Create an Administrator User
Before you use Amazon Comprehend for the first time, complete the following tasks:
1.Sign Up for AWS (p. 7) 2.Create an IAM User (p. 7)
Sign Up for AWS
When you sign up for Amazon Web Services (AWS), your AWS account is automatically signed up for all AWS services, including Amazon Comprehend. You are charged only for the services that you use.
With Amazon Comprehend, you pay only for the resources that you use. If you are a new AWS customer, you can get started with Amazon Comprehend for free. For more information, see AWS Free Usage Tier.
If you already have an AWS account, skip to the next section.
To create an AWS account
1. Open https://portal.aws.amazon.com/billing/signup.
2. Follow the online instructions.
Part of the sign-up procedure involves receiving a phone call and entering a verification code on the phone keypad.
Record your AWS account ID because you'll need it for the next task.
Create an IAM User
Services in AWS, such as Amazon Comprehend, require that you provide credentials when you access them. This allows the service to determine whether you have permissions to access the service's resources.
Next Step
We strongly recommend that you access AWS using AWS Identity and Access Management (IAM), not the credentials for your AWS account. To use IAM to access AWS, create an IAM user, add the user to an IAM group with administrative permissions, and then grant administrative permissions to the IAM user. You can then access AWS using a special URL and the IAM user's credentials.
The Getting Started exercises in this guide assume that you have a user with administrator privileges, adminuser.
To create an administrator user and sign in to the console
1. Create an administrator user called adminuser in your AWS account. For instructions, see Creating Your First IAM User and Administrators Group in the IAM User Guide.
2. Sign in to the AWS Management Console using a special URL. For more information, see How Users Sign In to Your Account in the IAM User Guide.
For more information about IAM, see the following:
• AWS Identity and Access Management (IAM)
• Getting started
• IAM User Guide
Next Step
Step 2: Set Up the AWS Command Line Interface (AWS CLI) (p. 8)
Step 2: Set Up the AWS Command Line Interface (AWS CLI)
You don't need the AWS CLI to perform the steps in the Getting Started exercises. However, some of the other exercises in this guide do require it. If you prefer, you can skip this step and go to Step 3: Getting Started Using the Amazon Comprehend Console (p. 9), and set up the AWS CLI later.
To set up the AWS CLI
1. Download and configure the AWS CLI. For instructions, see the following topics in the AWS Command Line Interface User Guide:
• Getting Set Up with the AWS Command Line Interface
• Configuring the AWS Command Line Interface
2. In the AWS CLI config file, add a named profile for the administrator user:.
[profile adminuser]
aws_access_key_id = adminuser access key ID
aws_secret_access_key = adminuser secret access key region = aws-region
You use this profile when executing the AWS CLI commands. For more information about named profiles, see Named Profiles in the AWS Command Line Interface User Guide. For a list of AWS Regions, see Regions and Endpoints in the Amazon Web Services General Reference.
Next Step
3. Verify the setup by typing the following help command at the command prompt:
aws help
Next Step
Step 3: Getting Started Using the Amazon Comprehend Console (p. 9)
Step 3: Getting Started Using the Amazon Comprehend Console
The easiest way to get started using Amazon Comprehend is to use the console to analyze a short text file. If you haven't reviewed the concepts and terminology in How It Works (p. 4), we recommend that you do that before proceeding.
Topics
• Analyzing Documents Using the Console (p. 9)
• Creating and Using Custom Entity Recognizer (p. 16)
• Creating and Using Custom Classifiers (p. 20)
• Model Versioning with Amazon Comprehend (p. 25)
• Creating a Topic Modeling Job Using the Console (p. 27)
• Creating an Events Detection Job Using the Console (p. 29)
Analyzing Documents Using the Console
The Amazon Comprehend console enables you to analyze the contents of documents up to 5,000 characters long. The results are shown in the console so that you can review the analysis.
To start analyzing documents, sign in to the AWS Management Console and open the Amazon Comprehend console.
You can replace the sample text with your own text either in English or one of the other languages supported by Amazon Comprehend and then choose Analyze to get an analysis of your text. Below the text being analyzed, the Results pane shows more information about the text.
Entities
The Entities tab lists each entity, its category, and the level of confidence that Amazon Comprehend has detected in the input text. The results are color-coded to indicate different entity types such as organizations, locations, dates, and persons. For more information, see Detect Entities (p. 155).
Analyzing Documents Using the Console
Key phrases
The Key phrases tab lists key noun phrases that Amazon Comprehend detected in the input text and the associated confidence level. For more information, see Detect Key Phrases (p. 162).
Analyzing Documents Using the Console
Language
The Language tab shows the dominant language of the text and Amazon Comprehend's level of confidence that it has detected the dominant language correctly. Amazon Comprehend can recognize 100 languages. For more information, see Detect the Dominant Language (p. 163).
Analyzing Documents Using the Console
PII
The PII tab lists entities in your input text that contain personally identifiable information (PII). A PII entity is a textual reference to personal data that could be used to identify an individual, such as an address, bank account number, or phone number. For more information, see Detect Personally Identifiable Information (PII) (p. 167).
The PII tab provides two analysis modes:
• Offsets
• Labels
Offsets
The Offsets analysis mode identifies the location of PII in your text documents. For more information, see Locate PII Entities (p. 169).
Analyzing Documents Using the Console
Labels
The Labels analysis mode checks for the presence of PII in your text document and returns the labels of identified PII entity types. For more information, see Label Documents with PII Entity Types (p. 171).
Analyzing Documents Using the Console
Sentiment
The Sentiment tab shows the overall emotional sentiment of the text. Sentiment can be rated neutral, positive, negative, or mixed. In this case, each emotional sentiment has a confidence rating, providing an estimate by Amazon Comprehend for that sentiment being dominant. For more information, see Determine Sentiment (p. 172).
Analyzing Documents Using the Console
Syntax
The Syntax tab shows a breakdown of each element in the text, along with its part of speech and the associated confidence score. For more information, see Analyze Syntax (p. 173).
Creating and Using Custom Entity Recognizer
Creating and Using Custom Entity Recognizer
You can create custom entity recognizers using the Amazon Comprehend console. This section shows you how to create and train a custom entity recognizer and then how to create an entity recognizer job.
Creating a Custom Entity Recognizer Using the Console - CSV Format
To create the custom entity recognizer, first provide a dataset to train your model. With this dataset, include one of the following: a set of annotated documents or a list of entities and their type label, along with a set of documents containing those entities. For more information, see Custom Entity Recognition (p. 109)
To train a custom entity recognizer with a CSV file
1. Sign in to the AWS Management Console and open the Amazon Comprehend console.
Creating and Using Custom Entity Recognizer
2. From the left menu, choose Customization and then choose Custom entity recognition.
3. Choose Create new model.
4. Give the recognizer a name. The name must be unique within the Region and account.
5. Select the language.
6. Under Custom entity type, enter a custom label that you want the recognizer to find in the dataset.
The entity type must be uppercase, and if it consists of more than one word, separate the words with an underscore.
7. Choose Add type.
8. If you want to add an additional entity type, enter it, and then choose Add type. If you want to remove one of the entity types you've added, choose Remove type and then choose the entity type to remove from the list. A maximum of 25 entity types can be listed.
9. To encrypt your training job, choose Recognizer encryption and then choose whether to use a KMS key associated with the current account, or one from another account.
• If you are using a key associated with the current account, for KMS key ID choose the key ID.
• If you are using a key associated with a different account, for KMS key ARN enter the ARN for the key ID.
NoteFor more information on creating and using KMS keys and the associated encryption, see Key Management Service (KMS).
10. Under Data specifications, choose the format of your training documents:
• CSV file — A CSV file that supplements your training documents. The CSV file contains
information about the custom entities that your trained model will detect. The required format of the file depends on whether you are providing annotations or an entity list.
• Augmented manifest — A labeled dataset that is produced by Amazon SageMaker Ground Truth.
This file is in JSON lines format. Each line is a complete JSON object that contains a training document and its labels. Each label annotates a named entity in the training document. You can provide up to 5 augmented manifest files.
For more information about available formats, and for examples, see Training custom entity recognizers (p. 109).
11. Under Training type, choose the training type to use:
• Using annotations and training docs
• Using entity list and training docs
If choosing annotations, enter the URL of the annotations file in Amazon S3. You can also navigate to the bucket or folder in Amazon S3 where the annotation files are located and choose Browse S3.
If choosing entity list, enter the URL of the entity list in Amazon S3. You can also navigate to the bucket or folder in Amazon S3 where the entity list is located and choose Browse S3.
12. Enter the URL of an input dataset containing the training documents in Amazon S3. You can also navigate to the bucket or folder in Amazon S3 where the training documents are located and choose Select folder.
13. Under Test dataset select how you want to evaluate the performance of your trained model - you can do this for both annotations and entity list training types.
• Autosplit: Autosplit automatically selects 10% of your provided training data to use as testing data
Creating and Using Custom Entity Recognizer
• (Optional) Customer provided: When you select customer provided, you can specify exactly what test data you want to use.
14. If you select Customer provided test dataset, enter the URL of the annotations file in Amazon S3.
You can also navigate to the bucket or folder in Amazon S3 where the annotation files are located and choose Select folder.
15. In the Choose an IAM role section, either select an existing IAM role or create a new one.
• Choose an existing IAM role – Select this option if you already have an IAM role with permissions to access the input and output Amazon S3 buckets.
• Create a new IAM role – Select this option when you want to create a new IAM role with the proper permissions for Amazon Comprehend to access the input and output buckets.
NoteIf the input documents are encrypted, the IAM role used must have kms:Decrypt permission. For more information, see Permissions Required to Use KMS
Encryption (p. 207).
16. (Optional) To launch your resources into Amazon Comprehend from a VPC, enter the VPC ID under VPC or choose the ID from the drop-down list.
1. Choose the subnet under Subnet(s). After you select the first subnet, you can choose additional ones.
2. Under Security Group(s), choose the security group to use if you specified one. After you select the first security group, you can choose additional ones.
NoteWhen you use a VPC with your custom entity recognition job, the DataAccessRole used for the Create and Start operations must have permissions to the VPC from which the input documents and the output bucket are accessed.
17. (Optional) To add a tag to the custom entity recognizer, enter a key-value pair under Tags. Choose Add tag. To remove this pair before creating the recognizer, choose Remove tag.
18. Choose Train.
The new recognizer will then appear in the list, showing its status. It will first show as Submitted. It will then show Training for a classifier that is processing training documents, Trained for a classifier that is ready to use, and In error for a classifier that has an error. You can click on a job to get more information about the recognizer, including any error messages.
Creating a Custom Entity Recognizer Using the Console - Augmented Manifest
To train a custom entity recognizer with a Plain text, PDF, or Word Document 1. Sign in to the AWS Management Console and open the Amazon Comprehend console.
2. From the left menu, choose Customization and then choose Custom entity recognition.
3. Choose Train recognizer.
4. Give the recognizer a name. The name must be unique within the Region and account.
5. Select the language. Note: If you're training a PDF or Word document, English is the supported language.
6. Under Custom entity type, enter a custom label that you want the recognizer to find in the dataset.
The entity type must be uppercase, and if it consists of more than one word, separate the words with an underscore.
7. Choose Add type.
Creating and Using Custom Entity Recognizer
8. If you want to add an additional entity type, enter it, and then choose Add type. If you want to remove one of the entity types you've added, choose Remove type and then choose the entity type to remove from the list. A maximum of 25 entity types can be listed.
9. To encrypt your training job, choose Recognizer encryption and then choose whether to use a KMS key associated with the current account, or one from another account.
• If you are using a key associated with the current account, for KMS key ID choose the key ID.
• If you are using a key associated with a different account, for KMS key ARN enter the ARN for the key ID.
NoteFor more information on creating and using KMS keys and the associated encryption, see Key Management Service (KMS).
10. Under Training data, choose Augmented manifest as your data format:
• Augmented manifest — is a labeled dataset that is produced by Amazon SageMaker Ground Truth. This file is in JSON lines format. Each line in the file is a complete JSON object that contains a training document and its labels. Each label annotates a named entity in the training document.
You can provide up to 5 augmented manifest files. If you are using PDF documents for training data, you must select Augmented manifest. You can provide up to 5 augmented manifest files.
For each file, you can name up to 5 attributes to use as training data.
For more information about available formats, and for examples, see Training custom entity recognizers (p. 109).
11. Select the training model type.
If you selected Plain text documents, under Input location, enter the Amazon S3URL of the Amazon SageMakerGround Truth augmented manifest file. You can also navigate to the bucket or folder in Amazon S3 where the augmented manifest(s) is located and choose Select folder.
12. Under Attribute name, enter the name of the attribute that contains your annotations. If the file contains annotations from multiple chained labeling jobs, add an attribute for each job. In this case, each attribute contains the set of annotations from a labeling job. Note: You can provide up to 5 attribute names for each file.
13. Select Add.
14. If you selected PDF, Word documents under Input location, enter the Amazon S3URL of the Amazon SageMaker Ground Truth augmented manifest file. You can also navigate to the bucket or folder in Amazon S3 where the augmented manifest(s) is located and choose Select folder.
15. Enter the S3 prefix for your Annotation data files. These are the PDF documents that you labled.
16. Enter the S3 prefix for your Source documents. These are the original PDF documents (data objects) that you provided to Ground Truth for your labeling job.
17. Enter the attribute names that contain your annotations. Note: You can provide up to 5 attribute names for each file. Any attributes in your file that you don't specify are ignored.
18. In the IAM role section, either select an existing IAM role or create a new one.
• Choose an existing IAM role – Select this option if you already have an IAM role with permissions to access the input and output Amazon S3 buckets.
• Create a new IAM role – Select this option when you want to create a new IAM role with the proper permissions for Amazon Comprehend to access the input and output buckets.
NoteIf the input documents are encrypted, the IAM role used must have kms:Decrypt permission. For more information, see Permissions Required to Use KMS
Encryption (p. 207).
Creating and Using Custom Classifiers
19. (Optional) To launch your resources into Amazon Comprehend from a VPC, enter the VPC ID under VPC or choose the ID from the drop-down list.
1. Choose the subnet under Subnet(s). After you select the first subnet, you can choose additional ones.
2. Under Security Group(s), choose the security group to use if you specified one. After you select the first security group, you can choose additional ones.
Note
When you use a VPC with your custom entity recognition job, the DataAccessRole used for the Create and Start operations must have permissions to the VPC from which the input documents and the output bucket are accessed.
20. (Optional) To add a tag to the custom entity recognizer, enter a key-value pair under Tags. Choose Add tag. To remove this pair before creating the recognizer, choose Remove tag.
21. Choose Train.
The new recognizer will then appear in the list, showing its status. It will first show as Submitted. It will then show Training for a classifier that is processing training documents, Trained for a classifier that is ready to use, and In error for a classifier that has an error. You can click on a job to get more information about the recognizer, including any error messages.
Creating and Using Custom Classifiers
You can create and train custom classifiers using the console, and then run asynchronous classification jobs to analyze your documents. You can also use the same custom model and add an endpoint to it to run custom classification requests to gain real-time (synchronous) insights about your text. This section shows you how to create a classifier using the console and then both how to use it to run an asynchronous classification job, or how to create an endpoint for it and run a real-time classification request.
Topics
• Creating a Custom Classifier (Console) (p. 20)
• Running an Asynchronous Custom Classification Job (p. 22)
• Creating a Real-time Custom Classification Request (p. 24)
Creating a Custom Classifier (Console)
Create a custom document classifier to identify the categories of a set of documents.
To train the classifier, you need a set of training documents. You label these documents with the categories that you want the document classifier to recognize. For more information on these training documents, see Custom Classification (p. 90).
To train a document classifier
1. Sign in to the AWS Management Console and open the Amazon Comprehend console.
2. From the left menu, choose Customization and then choose Custom Classification.
3. Choose Create new model.
4. Give the classifier a name. The name must be unique within your account and current Region.
5. Select the language of the training documents. You can train a document classifier using any of the languages that work with Amazon Comprehend. However, you can only train the classifier in one language. To learn more, see Languages Supported by Amazon Comprehend. (p. 5)
Creating and Using Custom Classifiers
6. (Optional) If you want to encrypt the data in the storage volume while your training job is being processed, choose Classifier encryption and then choose whether to use a KMS key associated with your current account, or one from another account.
• If you are using a key associated with the current account, choose the key ID for KMS key ID.
• If you are using a key associated with a different account, enter the ARN for the key ID under KMS key ARN.
Note
For more information on creating and using KMS keys and the associated encryption, see Key Management Service (KMS).
7. Under Data specifications, choose which classifier mode to use.
• Single-label mode: Choose this option if the categories you are assigning to documents are mutually exclusive and you are training your classifier to assign one and only one label to each document.
• Multi-label mode: Choose this option if multiple categories can applied to a document at the same time and you are training your classifier to assign one, many, all, or no label to each document.
8. If you chose Multi-label mode, choose the character delimiter you want to use to separate labels when there are more than one label per line from Delimiter for labels.
9. Under Data format, choose the format of your training documents:
• CSV file — A two-column CSV file, where labels are provided in the first column, and documents are provided in the second.
• Augmented manifest — A labeled dataset that is produced by Amazon SageMaker Ground Truth.
This file is in JSON lines format. Each line is a complete JSON object that contains a training document and its associated labels.
For more information about these formats, and for examples, see Training a Custom Classifier (p. 91).
10. Under Training dataset, enter the location of the Amazon S3 bucket that contains your training documents or navigate to it by choosing Select folder. The IAM role you're using for access permissions for the training job must have reading permissions for the S3 bucket.
11. Under Test dataset select how you want to evaluate the performance of your trained model - you can do this for both annotations and entity list training types.
• Autosplit: Autosplit automatically selects 10% of your provided training data to use as testing data
• (Optional) Customer provided: When you select customer provided, you can specify exactly what test data you want to use. If you select Customer provided test dataset, enter the URL of the annotations file in Amazon S3. You can also navigate to the bucket or folder in Amazon S3 where the annotation files are located and choose Select folder.
12. (Optional) If you want Amazon Comprehend to create a confusion matrix that provides metrics on how well the classifier performed during training, enter the location of an Amazon S3 bucket where it will be saved. For more information, see Confusion Matrix (p. 106).
(Optional) If you choose to encrypt the output result from your training job, choose Encryption and then choose whether to use a KMS key associated with the current account, or one from another account.
• If you are using a key associated with the current account, choose the key alias for KMS key ID.
• If you are using a key associated with a different account, enter the ARN for the key alias or ID under KMS key ID.
Creating and Using Custom Classifiers
13. Choose Choose an existing IAM role, and then choose an existing IAM role that has read permissions for the S3 bucket that contains your training documents. Only roles that have a trust policy that begins with comprehend.amazonaws.com are valid.
If you don't already have an IAM role with these permissions, choose Create an IAM role to make one. Choose the access permissions to grant this role, and then choose a name suffix to distinguish the role from IAM roles in your account.
NoteIf the input documents are encrypted, the IAM role used must also have kms:Decrypt permission. For more information, see Permissions Required to Use KMS
Encryption (p. 207).
14. (Optional) To launch your resources into Amazon Comprehend from a VPC, enter the VPC ID under VPC or choose the ID from the drop-down list.
1. Choose the subnet under Subnets(s). After you select the first subnet, you can choose additional ones.
2. Under Security Group(s), choose the security group to use if you specified one. After you select the first security group, you can choose additional ones.
NoteWhen you use a VPC with your classification job, the DataAccessRole used for the Create and Start operations must have permissions to the VPC from which the input documents and the output bucket are accessed.
15. (Optional) To add a tag to the custom classifier, enter a key-value pair under Tags. Choose Add tag.
To remove this pair before creating the classifier, choose Remove tag. For more information, see Tagging your resources (p. 191).
16. Choose Create.
The new classifier will then appear in the list, showing its status. It will first show as Submitted. It will then show Training for a classifier that is processing training documents, Trained for a classifier that is ready to use, and In error for a classifier that has an error. You can click on a job to get more information about the classifier, including any error messages.
Running an Asynchronous Custom Classification Job
Once you have created a custom document classifier, you can use it to categorize a group of documents.
Creating and Using Custom Classifiers
To create a custom asynchronous classification job
1. Sign in to the AWS Management Console and open the Amazon Comprehend console.
2. From the left menu, choose Customization and then choose Custom classification.
3. Choose Create job.
4. Give the classification job a name. The name must be unique your account and current Region.
5. Under Analysis type, choose Custom classification.
6. From Select classifier, choose the custom classifier to use.
7. (Optional) If you choose to encrypt the data in the storage volume while your classification job is processed, choose Job encryption and then choose whether to use a KMS key associated with the current account, or one from another account.
• If you are using a key associated with the current account, choose the key ID for KMS key ID.
• If you are using a key associated with a different account, enter the ARN for the key ID under KMS key ARN.
NoteFor more information on creating and using KMS keys and the associated encryption, see Key Management Service (KMS).
8. Under Input data, enter the location of the Amazon S3 bucket that contains your input documents or navigate to it by choosing Select folder. This bucket must be in the same region as the API that you are calling. The IAM role you're using for access permissions for the classification job must have reading permissions for the S3 bucket.
9. (Optional) Choose the format of the documents to be classified under Input format. These can be one document per file, or one document per line in a single file.
10. Under Output data, enter the location of the Amazon S3 bucket where Amazon Comprehend should write the job's output data or navigate to it by choosing Select folder. This bucket must be in the same region as the API that you are calling. The IAM role you're using for access permissions for the classification job must have write permissions for the S3 bucket.
11. (Optional) If you choose to encrypt the output result from your job, choose Encryption and then choose whether to use a KMS key associated with the current account, or one from another account.
• If you are using a key associated with the current account, choose the key alias or ID for KMS key ID.
• If you are using a key associated with a different account, enter the ARN for the key alias or ID under KMS key ID.
12. (Optional) To launch your resources into Amazon Comprehend from a VPC, enter the VPC ID under VPC or choose the ID from the drop-down list.
1. Choose the subnet under Subnet(s). After you select the first subnet, you can choose additional ones.
2. Under Security Group(s), choose the security group to use if you specified one. After you select the first security group, you can choose additional ones.
NoteWhen you use a VPC with your classification job, the DataAccessRole used for the Create and Start operations must have permissions to the VPC from which the output bucket are accessed.
13. Choose Create job to create the document classification job.
Creating and Using Custom Classifiers
Creating a Real-time Custom Classification Request
In addition to using the custom document classifier to run asynchronous jobs, you can also use it to run synchronous custom classification requests to gain real-time insight into the categories in your document. This requires first that you create an endpoint and set the level of data throughput for it, and then to run the real-time analysis.
Note
Using real-time analysis will result in additional cost to your account. This cost is determined by how long the endpoint is operating and the level of throughput you determine.
The level of throughput assigned to an endpoint is measured in Inference units, each of which represents data throughput of 100 characters per second. You can provision the endpoint with up to 10 inference units. This level of throughput can be adjust to meet your needs by updating the endpoint.
Once you have completed your real-time analysis, you should delete the endpoint because the charge for it will continue as long as it's active. You can easily create another endpoint whenever you need it.
To create an endpoint
1. Sign in to the AWS Management Console and open the Amazon Comprehend console.
2. From the left menu, choose Customization and then choose Custom Classification.
3. From the Classifiers list, choose the name of the custom model for which you want to create the endpoint and follow the link. The Endpoints list on the custom model details page is displayed.
NotePreviously created endpoints are shown on the models detail page, along with the model with which they're associated.
4. Under Endpoints, choose Create endpoint.
5. Give the endpoint a name. The name must be unique within the AWS Region and account.
6. Enter the number of inference units to assign to the endpoint. Each unit represents a throughput of 100 characters per second. You can assign up to a maximum of 10 inference units per endpoint.
7. (Optional) To add a tag to the endpoint, enter a key-value pair under Tags and choose Add tag. To remove this pair before creating the endpoint, choose Remove tag.
8. Choose Create endpoint. The Endpoints list is displayed, with the new endpoint showing Creating.
Once it shows Ready, the endpoint can be used for real-time analysis.
To run a real-time custom classification request
1. Sign in to the AWS Management Console and open the Amazon Comprehend console.
2. From the left menu, choose Real-time analysis.
3. Under Input type, choose Custom for Analysis type.
4. For Select endpoint, choose the endpoint that you want to use. This endpoint is linked to a specific custom model.
5. Enter the text you want to analyze.
6. Choose Analyze. The text analysis based on your custom model is displayed, along with a confidence assessment of the analysis.
To update your endpoint
1. Sign in to the AWS Management Console and open the Amazon Comprehend console.
Model Versioning with Amazon Comprehend
2. From the left menu, choose Customization and then choose Custom classification.
3. From the Classifiers list, choose the name of the custom model for which you want to update the endpoint and follow the link. The custom model details page is displayed.
4. Navigate to the Endpoints list, choose the name of the endpoint you want to update and follow the link.
5. Choose Edit.
6. Enter the updated number of inference units to assign to the endpoint. Each unit represents a throughput of 100 characters per second. You can assign up to a maximum of 10 inference units per endpoint.
NoteThe cost of using an endpoint is based on the amount of time operating and the
throughput (based on the number of inference units. Increasing the number of inference units will thus increase the cost of operation. For more information, see Amazon
Comprehend Pricing.
7. Choose Edit endpoint. The endpoint details page is displayed.
8. Confirm that the endpoint is updating by choosing the model name from the breadcrumbs at the top of the page. On the custom model details page, navigate to the Endpoints list and verify that it shows Updating next to the endpoint. When the update is complete, it will show Ready.
To delete your endpoint
1. Sign in to the AWS Management Console and open the Amazon Comprehend console.
2. From the left menu, choose Customization and then choose Custom classification.
3. From the Classifiers list, choose the name of the custom model associated with the endpoint you want to delete and follow the link. The custom model details page is displayed.
4. Navigate to the Endpoints list, choose the name of the endpoint to delete and follow the link.
5. Choose Delete.
Note
All endpoints associated with a custom model must be deleted before that model itself can be removed.
6. Choose Delete again to confirm the deletion. The custom model details page is displayed. Confirm that the endpoint you deleted shows deleting next to it. When it's deleted, the endpoint is removed from the Endpoints list.
Model Versioning with Amazon Comprehend
Artifical intelligence and machine learning (AI/ML) is all about rapid experimentation. With Amazon Comprehend, you train and build out models which you use to gain insight on your data. With model versioning you can keep track of your modeling history and scores associated with running results of your models as you provide more or different sets of data. You can use versioning with your custom classification models or your custom entity recognition models. Taking a look at your different versions over time you can gain insight on how successful they've performed and gain insight on what parameters you used to get to your state of success.
When you train a new version of an existing custom classifier model or entity recognition model, all you need to do is create a new version from the model details page and all the details populate for you. The new version will have the same name as your earlier model — what we call the versionID — although you will give it a unique version name during creation. As you add new versions to a model, you can see all the previous versions and their details in one view from the model details page. With versioning, you can see how model performance changes as you make changes to your training dataset.
Model Versioning with Amazon Comprehend
Create a new Custom classifier version (console)
1. Sign in to the AWS Management Console and open the Amazon Comprehend console.
2. From the left menu, choose Customization and then choose Custom classification.
3. From the Classifiers list, choose the name of the custom model from which you want to create a new version. The custom model details page is displayed.
4. On the top right, select Create new model. A screen opens with prepopulated details from the parent custom classification model.
5. Under Version name add a unique name to the new version.
6. Under version details, you can change the language and number of labels associated with your new model.
7. Under the Data specifications section configure how you want to provide the data to your new version— make sure to provide full data, which includes documents from your previous model and your new documents. You can change the Classifier mode (single-label, or multi-label), Data format (CSV file, Augmented manifest), your Training dataset, and your Test dataset (autosplit, or your custom test data configuration).
8. (Optional) update the S3 location for your output data 9. Under Access permissions, create or use an existing IAM role.
10. (Optional) Update your VPC settings
Creating a Topic Modeling Job Using the Console
11. (Optional) Add tags to your new version to help keep track of the details.
For more information about creating custom classifiers, see Custom Classification (p. 90) and Creating and Using Custom Classifiers (p. 20)
Create a new Custom entity recognizer version (console)
1. Sign in to the AWS Management Console and open the Amazon Comprehend console.
2. From the left menu, choose Customization and then choose Custom entity recognition.
3. From the Recognizer model list, choose the name of the recognizer from which you want to create a new version. The details page is displayed.
4. On the top right, select Train new version. A screen opens with prepopulated details from the parent entity recognizer.
5. Under Version name add a unique name to the new version.
6. Under Custom entity type, add the custom labels or label you want the recognizer to identify in your dataset and select Add type. Choose a custom entity type from the annotations or entity list you've provided. The recognizer will then use all of the included entity types to identify entities in the data set when running your job. Each entity type must be upper-case and separated by and underscore if it uses multiple words. A maximum of 25 types are allowed.
7. (Optional) Select Recognizer encryption to encrypt the data in the storage volume while your job is being processed.
8. Under the Training data section, specify the Annotation and data format details (CSV file, Augmented manifest)single-label, or multi-label), Data format (CSV, Augmented manifest), your Training dataset, and your Test dataset (autosplit, or your custom test data configuration).
9. (Optional) update the S3 location for your output data 10. Under Access permissions, create or use an existing IAM role.
11. (Optional) Update your VPC settings
12. (Optional) Add tags to your new version to help keep track of the details.
To learn more about custom entity recognizers, see Custom Entity Recognition (p. 109) and Creating a Custom Entity Recognizer Using the Console (p. 16).
Creating a Topic Modeling Job Using the Console
You can use the Amazon Comprehend console to create and manage asynchronous topic detection jobs.
To create a topic modeling job
1. Sign in to the AWS Management Console and open the Amazon Comprehend console.
2. From the left menu, choose Analysis Jobs and then choose Create.
3. Under Job settings, give the job a name. The name must be unique within the region and account.
4. For Analysis Type, choose Topic Modeling.
5. (Optional) If you choose to encrypt the data in the storage volume while your job is processed, choose Job encryption and then choose whether to use a KMS key associated with the current account, or one from another account.
• If you are using a key associated with the current account, for KMS key IDchoose the key ID.
• If you are using a key associated with a different account, for KMS key ARN enter the ARN for the key ID.
Creating a Topic Modeling Job Using the Console
NoteFor more information on creating and using KMS keys and the associated encryption, see Key Management Service (KMS).
6. Choose the data source to use. You can use either sample data or you can analyze your own data stored in an Amazon S3 bucket.
If you choose to use your own data, provide the following information:
• S3 data location – An Amazon S3 data bucket that contains the documents to analyze. You can choose the folder icon to browse to the location of your data. The bucket must be in the same region as the API that you are calling.
• Input format – Optionally choose whether input data is contained in one document per file, or if there is one document per line in a file.
• Number of topics – The number of topics to return.
7. (Optional) If you choose to encrypt the output result from your job, choose Encryption and then choose whether to use a KMS key associated with the current account, or one from another account.
• If you are using a key associated with the current account, for KMS key ID choose the key alias or ID.
• If you are using a key associated with a different account, for KMS key ID enter the ARN for the key alias or ID.
8. In the Choose an IAM role section, either select an existing IAM role or create a new one.
• Choose an existing IAM role – Choose this option if you already have an IAM role with permissions to access the input and output Amazon S3 buckets.
• Create a new IAM role – Choose this option when you want to create a new IAM role with the proper permissions for Amazon Comprehend to access the input and output buckets. For more information about the permissions given to the IAM role, see Role-Based Permissions Required for Asynchronous Operations (p. 210).
Note
If the input documents are encrypted, the IAM role used must have KMS:Decrypt permission. For more information, see Permissions Required to Use KMS
Encryption (p. 207).
9. (Optional) To launch your resources into Amazon Comprehend from a VPC, enter the VPC ID under VPC or choose the ID from the drop-down list.
1. Choose the subnet under Subnet(s). After you select the first subnet, you can choose additional ones.
2. Under Security Group(s), choose the security group to use if you specified one. After you select the first security group, you can choose additional ones.
Note
When you use a VPC with your topic modeling job, the DataAccessRole that is used for the Create and Start operations must have permissions to the VPC from which the input documents and the output bucket are accessed.
10. When you have finished filling out the form, choose Create job to create and start the topic detection job.
The new job appears in the job list with the status field showing the status of the job. The field can be IN_PROGRESS for a job that is processing, COMPLETED for a job that has finished successfully,