Advanced data manipulation

You can manipulate your document metadata ﬁelds or attributes and content using Lambda functions.

This is useful if you want to go beyond basic logic and apply advanced data manipulations. For example, using Optical Character Recognition (OCR), which interprets text from images, and treats each image as a textual document. Or, retrieving the current time in a certain time zone and inserting the date-time where there's an empty value for a date ﬁeld. You can apply basic logic ﬁrst and then use a Lambda function to further manipulate your data.

Amazon Kendra can invoke a Lambda function to apply advanced data manipulations during the ingestion process as part of your CustomDocumentEnrichmentConﬁguration. You specify a role that includes permission to execute the Lambda function and access your Amazon S3 bucket to store the output of your data manipulations—see IAM access roles. Amazon Kendra can apply your advanced data manipulations on your original, raw documents or on the structured, parsed documents. You can conﬁgure a Lambda function that takes your original or raw data and applies your data manipulations using PreExtractionHookConﬁguration. You can also conﬁgure a Lambda function that takes your structured documents and applies your data manipulations using PostExtractionHookConﬁguration.

Amazon Kendra extracts the document metadata and text to structure your documents. Your Lambda functions must follow the mandatory request and response structures. For more information, see the section called “Data contracts for Lambda functions” (p. 147).

To conﬁgure a Lambda function in the console, select your index and then select Document enrichments in the navigation menu. Go to Conﬁgure Lambda functions to conﬁgure a Lambda function.

You can conﬁgure only one Lambda function for PreExtractionHookConfiguration and and only one Lambda function for PostExtractionHookConfiguration. However, your Lambda function can invoke other functions that it requires. You can conﬁgure both

PreExtractionHookConfiguration and PostExtractionHookConfiguration or either one. Your Lambda function for PreExtractionHookConfiguration must not exceed a run time of 5 minutes and your Lambda function for PostExtractionHookConfiguration must not exceed a run time of 1 minute. Conﬁguring Custom Document Enrichment naturally takes longer to ingest your documents into Amazon Kendra than if you were to not conﬁgure this.

You can conﬁgure Amazon Kendra to invoke a Lambda function only if a condition is met. For example, you can specify a condition that if there are empty date-time values, then Amazon Kendra should invoke a function that inserts the current date-time.

The following is an example of using a Lambda function to run OCR to interpret text from images and store this text in a ﬁeld called 'Document_Image_Text'.

Example 1: Extracting text from images to create textual documents Data before advanced manipulation applied.

Document_ID Document_Image

1 image_1.png

2 image_2.png

3 image_3.png

Data after advanced manipulation applied.

Document_ID Document_Image Document_Image_Text

1 image_1.png Mailed survey response

Advanced data manipulation

Document_ID Document_Image Document_Image_Text

2 image_2.png Mailed survey response

3 image_3.png Mailed survey response

The following is an example of using a Lambda function to insert the current date-time for empty date values. This uses the condition that if a date ﬁeld value is 'null', then replace this with the current date-time.

Example 2: Replacing empty values in the Last_Updated ﬁeld with the current date-time.

Data before advanced manipulation applied.

Document_ID Body_Text Last_Updated

1 Lorem Ipsum. January 1st 2020

2 Lorem Ipsum.

3 Lorem Ipsum. July 1st 2020

Data after advanced manipulation applied.

Document_ID Body_Text Last_Updated

1 Lorem Ipsum. January 1st 2020

2 Lorem Ipsum. December 1st 2021

3 Lorem Ipsum. July 1st 2020

The following code is an example of conﬁguring a Lambda function for advanced data manipulation on the raw, original data.

Console

To conﬁgure a Lambda function for advanced data manipulation on the raw, original data

1. In the left navigation pane, under Indexes, select Document enrichments and then select Add document enrichment.

2. On the Conﬁgure Lambda functions page, in the Lambda for pre-extraction section, select from the dropdowns your Lambda function ARN and your Amazon S3 bucket. Add your IAM access role by selecting your role from the dropdown to give the required permissions to create the document enrichment.

CLI

To conﬁgure a Lambda function for advanced data manipulation on the raw, original data

aws kendra create-data-source \ --name data-source-name \ --index-id index-id \

Advanced data manipulation

To conﬁgure a Lambda function for advanced data manipulation on the raw, original data

import boto3

from botocore.exceptions import ClientError import pprint

import time

kendra = boto3.client("kendra")

print("Create a data source with customizations") name = "data-source-name"

"LambdaArn":"arn:aws:iam::account-id:function/function-name", "S3Bucket":"S3-bucket-name"

}

"RoleArn":"arn:aws:iam::account-id:role/cde-role-name"

} try:

data_source_response = kendra.create_data_source(

Name = name, IndexId = index_id, RoleArn = role_arn, Type = data_source_type Configuration = configuration

CustomDocumentEnrichmentConfiguration = custom_document_enrichment_configuration )

pprint.pprint(data_source_response)

data_source_id = data_source_response["Id"]

print("Wait for Kendra to create the data source with your customizations.") while True:

# Get the data source description

data_source_description = kendra.describe_data_source(

Id = data_source_id, IndexId = index_id )

status = data_source_description["Status"]

print(" Creating data source. Status: "+status) time.sleep(60)

if status != "CREATING":

break

Advanced data manipulation

print("Synchronize the data source.")

sync_response = kendra.start_data_source_sync_job(

Id = data_source_id, IndexId = index_id )

pprint.pprint(sync_response)

print("Wait for the data source to sync with the index.") while True:

jobs = kendra.list_data_source_sync_jobs(

Id=data_source_id, IndexId=index_id )

# For this example, there should be one job status = jobs["History"][0]["Status"]

print(" Syncing data source. Status: "+status) time.sleep(60)

if status != "SYNCING":

break

except ClientError as e:

print("%s" % e) print("Program ends.") Java

To conﬁgure a Lambda function for advanced data manipulation on the raw, original data

package com.amazonaws.kendra;

import java.util.concurrent.TimeUnit;

import software.amazon.awssdk.services.kendra.KendraClient;

import software.amazon.awssdk.services.kendra.model.CreateDataSourceRequest;

import software.amazon.awssdk.services.kendra.model.CreateDataSourceResponse;

import software.amazon.awssdk.services.kendra.model.CreateIndexRequest;

import software.amazon.awssdk.services.kendra.model.CreateIndexResponse;

import software.amazon.awssdk.services.kendra.model.DataSourceConfiguration;

import software.amazon.awssdk.services.kendra.model.DataSourceStatus;

import software.amazon.awssdk.services.kendra.model.DataSourceSyncJob;

import software.amazon.awssdk.services.kendra.model.DataSourceSyncJobStatus;

import software.amazon.awssdk.services.kendra.model.DataSourceType;

import software.amazon.awssdk.services.kendra.model.DescribeDataSourceRequest;

import software.amazon.awssdk.services.kendra.model.DescribeDataSourceResponse;

import software.amazon.awssdk.services.kendra.model.DescribeIndexRequest;

import software.amazon.awssdk.services.kendra.model.DescribeIndexResponse;

import software.amazon.awssdk.services.kendra.model.IndexStatus;

import software.amazon.awssdk.services.kendra.model.ListDataSourceSyncJobsRequest;

import software.amazon.awssdk.services.kendra.model.ListDataSourceSyncJobsResponse;

import software.amazon.awssdk.services.kendra.model.S3DataSourceConfiguration;

import software.amazon.awssdk.services.kendra.model.StartDataSourceSyncJobRequest;

import software.amazon.awssdk.services.kendra.model.StartDataSourceSyncJobResponse;

public class CreateDataSourceWithCustomizationsExample {

public static void main(String[] args) throws InterruptedException { System.out.println("Create a data source with customizations");

Advanced data manipulation

String dataSourceName = "data-source-name";

String indexId = "index-id";

String dataSourceRoleArn = "arn:aws:iam::account-id:role/role-name";

String s3BucketName = "S3-bucket-name"

KendraClient kendra = KendraClient.builder().build();

CreateDataSourceRequest createDataSourceRequest = CreateDataSourceRequest .builder()

CreateDataSourceResponse createDataSourceResponse = kendra.createDataSource(createDataSourceRequest);

System.out.println(String.format("Response of creating data source: %s", createDataSourceResponse));

String dataSourceId = createDataSourceResponse.id();

System.out.println(String.format("Waiting for Kendra to create the data source %s", dataSourceId));

DescribeDataSourceRequest describeDataSourceRequest = DescribeDataSourceRequest .builder()

.indexId(indexId) .id(dataSourceId) .build();

while (true) {

DescribeDataSourceResponse describeDataSourceResponse = kendra.describeDataSource(describeDataSourceRequest);

DataSourceStatus status = describeDataSourceResponse.status();

System.out.println(String.format("Creating data source. Status: %s", status));

System.out.println(String.format("Synchronize the data source %s", dataSourceId));

在文檔中 Amazon Kendra (頁 154-159)