• 沒有找到結果。

Advanced data manipulation

在文檔中 Amazon Kendra (頁 154-159)

Advanced data manipulation

You can manipulate your document metadata fields or attributes and content using Lambda functions.

This is useful if you want to go beyond basic logic and apply advanced data manipulations. For example, using Optical Character Recognition (OCR), which interprets text from images, and treats each image as a textual document. Or, retrieving the current time in a certain time zone and inserting the date-time where there's an empty value for a date field. You can apply basic logic first and then use a Lambda function to further manipulate your data.

Amazon Kendra can invoke a Lambda function to apply advanced data manipulations during the ingestion process as part of your CustomDocumentEnrichmentConfiguration. You specify a role that includes permission to execute the Lambda function and access your Amazon S3 bucket to store the output of your data manipulations—see IAM access roles. Amazon Kendra can apply your advanced data manipulations on your original, raw documents or on the structured, parsed documents. You can configure a Lambda function that takes your original or raw data and applies your data manipulations using PreExtractionHookConfiguration. You can also configure a Lambda function that takes your structured documents and applies your data manipulations using PostExtractionHookConfiguration.

Amazon Kendra extracts the document metadata and text to structure your documents. Your Lambda functions must follow the mandatory request and response structures. For more information, see the section called “Data contracts for Lambda functions” (p. 147).

To configure a Lambda function in the console, select your index and then select Document enrichments in the navigation menu. Go to Configure Lambda functions to configure a Lambda function.

You can configure only one Lambda function for PreExtractionHookConfiguration and and only one Lambda function for PostExtractionHookConfiguration. However, your Lambda function can invoke other functions that it requires. You can configure both

PreExtractionHookConfiguration and PostExtractionHookConfiguration or either one. Your Lambda function for PreExtractionHookConfiguration must not exceed a run time of 5 minutes and your Lambda function for PostExtractionHookConfiguration must not exceed a run time of 1 minute. Configuring Custom Document Enrichment naturally takes longer to ingest your documents into Amazon Kendra than if you were to not configure this.

You can configure Amazon Kendra to invoke a Lambda function only if a condition is met. For example, you can specify a condition that if there are empty date-time values, then Amazon Kendra should invoke a function that inserts the current date-time.

The following is an example of using a Lambda function to run OCR to interpret text from images and store this text in a field called 'Document_Image_Text'.

Example 1: Extracting text from images to create textual documents Data before advanced manipulation applied.

Document_ID Document_Image

1 image_1.png

2 image_2.png

3 image_3.png

Data after advanced manipulation applied.

Document_ID Document_Image Document_Image_Text

1 image_1.png Mailed survey response

Advanced data manipulation

Document_ID Document_Image Document_Image_Text

2 image_2.png Mailed survey response

3 image_3.png Mailed survey response

The following is an example of using a Lambda function to insert the current date-time for empty date values. This uses the condition that if a date field value is 'null', then replace this with the current date-time.

Example 2: Replacing empty values in the Last_Updated field with the current date-time.

Data before advanced manipulation applied.

Document_ID Body_Text Last_Updated

1 Lorem Ipsum. January 1st 2020

2 Lorem Ipsum.

3 Lorem Ipsum. July 1st 2020

Data after advanced manipulation applied.

Document_ID Body_Text Last_Updated

1 Lorem Ipsum. January 1st 2020

2 Lorem Ipsum. December 1st 2021

3 Lorem Ipsum. July 1st 2020

The following code is an example of configuring a Lambda function for advanced data manipulation on the raw, original data.

Console

To configure a Lambda function for advanced data manipulation on the raw, original data

1. In the left navigation pane, under Indexes, select Document enrichments and then select Add document enrichment.

2. On the Configure Lambda functions page, in the Lambda for pre-extraction section, select from the dropdowns your Lambda function ARN and your Amazon S3 bucket. Add your IAM access role by selecting your role from the dropdown to give the required permissions to create the document enrichment.

CLI

To configure a Lambda function for advanced data manipulation on the raw, original data

aws kendra create-data-source \ --name data-source-name \ --index-id index-id \

Advanced data manipulation

To configure a Lambda function for advanced data manipulation on the raw, original data

import boto3

from botocore.exceptions import ClientError import pprint

import time

kendra = boto3.client("kendra")

print("Create a data source with customizations") name = "data-source-name"

"LambdaArn":"arn:aws:iam::account-id:function/function-name", "S3Bucket":"S3-bucket-name"

}

"RoleArn":"arn:aws:iam::account-id:role/cde-role-name"

} try:

data_source_response = kendra.create_data_source(

Name = name, IndexId = index_id, RoleArn = role_arn, Type = data_source_type Configuration = configuration

CustomDocumentEnrichmentConfiguration = custom_document_enrichment_configuration )

pprint.pprint(data_source_response)

data_source_id = data_source_response["Id"]

print("Wait for Kendra to create the data source with your customizations.") while True:

# Get the data source description

data_source_description = kendra.describe_data_source(

Id = data_source_id, IndexId = index_id )

status = data_source_description["Status"]

print(" Creating data source. Status: "+status) time.sleep(60)

if status != "CREATING":

break

Advanced data manipulation

print("Synchronize the data source.")

sync_response = kendra.start_data_source_sync_job(

Id = data_source_id, IndexId = index_id )

pprint.pprint(sync_response)

print("Wait for the data source to sync with the index.") while True:

jobs = kendra.list_data_source_sync_jobs(

Id=data_source_id, IndexId=index_id )

# For this example, there should be one job status = jobs["History"][0]["Status"]

print(" Syncing data source. Status: "+status) time.sleep(60)

if status != "SYNCING":

break

except ClientError as e:

print("%s" % e) print("Program ends.") Java

To configure a Lambda function for advanced data manipulation on the raw, original data

package com.amazonaws.kendra;

import java.util.concurrent.TimeUnit;

import software.amazon.awssdk.services.kendra.KendraClient;

import software.amazon.awssdk.services.kendra.model.CreateDataSourceRequest;

import software.amazon.awssdk.services.kendra.model.CreateDataSourceResponse;

import software.amazon.awssdk.services.kendra.model.CreateIndexRequest;

import software.amazon.awssdk.services.kendra.model.CreateIndexResponse;

import software.amazon.awssdk.services.kendra.model.DataSourceConfiguration;

import software.amazon.awssdk.services.kendra.model.DataSourceStatus;

import software.amazon.awssdk.services.kendra.model.DataSourceSyncJob;

import software.amazon.awssdk.services.kendra.model.DataSourceSyncJobStatus;

import software.amazon.awssdk.services.kendra.model.DataSourceType;

import software.amazon.awssdk.services.kendra.model.DescribeDataSourceRequest;

import software.amazon.awssdk.services.kendra.model.DescribeDataSourceResponse;

import software.amazon.awssdk.services.kendra.model.DescribeIndexRequest;

import software.amazon.awssdk.services.kendra.model.DescribeIndexResponse;

import software.amazon.awssdk.services.kendra.model.IndexStatus;

import software.amazon.awssdk.services.kendra.model.ListDataSourceSyncJobsRequest;

import software.amazon.awssdk.services.kendra.model.ListDataSourceSyncJobsResponse;

import software.amazon.awssdk.services.kendra.model.S3DataSourceConfiguration;

import software.amazon.awssdk.services.kendra.model.StartDataSourceSyncJobRequest;

import software.amazon.awssdk.services.kendra.model.StartDataSourceSyncJobResponse;

public class CreateDataSourceWithCustomizationsExample {

public static void main(String[] args) throws InterruptedException { System.out.println("Create a data source with customizations");

Advanced data manipulation

String dataSourceName = "data-source-name";

String indexId = "index-id";

String dataSourceRoleArn = "arn:aws:iam::account-id:role/role-name";

String s3BucketName = "S3-bucket-name"

KendraClient kendra = KendraClient.builder().build();

CreateDataSourceRequest createDataSourceRequest = CreateDataSourceRequest .builder()

CreateDataSourceResponse createDataSourceResponse = kendra.createDataSource(createDataSourceRequest);

System.out.println(String.format("Response of creating data source: %s", createDataSourceResponse));

String dataSourceId = createDataSourceResponse.id();

System.out.println(String.format("Waiting for Kendra to create the data source %s", dataSourceId));

DescribeDataSourceRequest describeDataSourceRequest = DescribeDataSourceRequest .builder()

.indexId(indexId) .id(dataSourceId) .build();

while (true) {

DescribeDataSourceResponse describeDataSourceResponse = kendra.describeDataSource(describeDataSourceRequest);

DataSourceStatus status = describeDataSourceResponse.status();

System.out.println(String.format("Creating data source. Status: %s", status));

System.out.println(String.format("Synchronize the data source %s", dataSourceId));

在文檔中 Amazon Kendra (頁 154-159)

相關文件