Part Two: Export Data from DynamoDB

To verify the DynamoDB table 1. Open the DynamoDB console.

2. On the Tables screen, click your DynamoDB table and click Explore Table.

3. On the Browse Items tab, columns that correspond to the data input ﬁle should display, such as Id, Price, ProductCategory, as shown in the following screen. This indicates that the import operation from the ﬁle to the DynamoDB table occurred successfully.

Step 6: Delete Your Pipeline (Optional)

To stop incurring charges, delete your pipeline. Deleting your pipeline deletes the pipeline deﬁnition and all associated objects.

To delete your pipeline

1. On the List Pipelines page, select your pipeline.

2. Click Actions, and then choose Delete.

3. When prompted for conﬁrmation, choose Delete.

Part Two: Export Data from DynamoDB

This is the second of a two-part tutorial that demonstrates how to bring together multiple AWS features to solve real-world problems in a scalable way through a common scenario: moving schema-less data in and out of DynamoDB using AWS Data Pipeline.

Tasks

• Before You Begin (p. 97)

• Step 1: Create the Pipeline (p. 98)

• Step 2: Save and Validate Your Pipeline (p. 99)

• Step 3: Activate Your Pipeline (p. 100)

• Step 4: Monitor the Pipeline Runs (p. 100)

• Step 5: Verify the Data Export File (p. 100)

• Step 6: Delete Your Pipeline (Optional) (p. 101)

Part Two: Export Data from DynamoDB

Before You Begin

You must complete part one of this tutorial to ensure that your DynamoDB table contains the necessary data to perform the steps in this section. For more information, see Part One: Import Data into

DynamoDB (p. 91).

Additionally, be sure you've completed the following steps:

• Complete the tasks in Setting up for AWS Data Pipeline (p. 14).

• Create a topic and subscribe to receive notiﬁcations from AWS Data Pipeline regarding the status of your pipeline components. For more information, see Create a Topic in the Amazon SNS Getting Started Guide.

• Ensure that you have the DynamoDB table that was created and populated with data in part one of this tutorial. This table will be your data source for part two of the tutorial. For more information, see Part One: Import Data into DynamoDB (p. 91).

Be aware of the following:

Underlying service costs

Be aware of the costs. AWS Data Pipeline manages the import/export process for you, but you still pay for the underlying AWS services used. The import and export pipelines will create Amazon EMR clusters to read and write data and there are per-instance charges for each node in the cluster. You can read more about the details of Amazon EMR Pricing. The default cluster conﬁguration is one m1.small instance master node and one m1.xlarge instance task node, though you can change this conﬁguration in the pipeline deﬁnition. There are also charges for AWS Data Pipeline. For more information, see AWS Data Pipeline Pricing and Amazon S3 Pricing.

Imports may overwrite data

When you import data from Amazon S3, the import may overwrite items in your DynamoDB table. Make sure that you are importing the right data and into the right table. Be careful not to accidentally set up a recurring import pipeline that will import the same data multiple times.

Exports may overwrite data

When you export data to Amazon S3, you may overwrite previous exports if you write to the same bucket path. The default behavior of the Export DynamoDB to S3 template will append the job's scheduled time to the Amazon S3 bucket path, which will help you avoid this problem.

Jobs consume throughput capacity

Import and Export jobs will consume some of your DynamoDB table's provisioned throughput capacity. This section explains how to schedule an import or export job using Amazon EMR. The Amazon EMR cluster will consume some read capacity during exports or write capacity during imports. You can control the percentage of the provisioned capacity that the import/export jobs consume by with the settings MyImportJob.myDynamoDBWriteThroughputRatio and MyExportJob.myDynamoDBReadThroughputRatio. Be aware that these settings determine how much capacity to consume at the beginning of the import/export process and will not adapt in real time if you change your table's provisioned capacity in the middle of the process.

On-Demand Capacity works only with EMR 5.24.0 or later

DynamoDB tables conﬁgured for On-Demand Capacity are supported only when using Amazon EMR release version 5.24.0 or later. When you use a template to create a pipeline for DynamoDB, choose Edit in Architect and then choose Resources to conﬁgure the Amazon EMR cluster that AWS Data Pipeline provisions. For Release label, choose emr-5.24.0 or later.

Part Two: Export Data from DynamoDB

Step 1: Create the Pipeline

First, create the pipeline.

To create the pipeline

1. Open the AWS Data Pipeline console at https://console.aws.amazon.com/datapipeline/.

2. The ﬁrst screen that you see depends on whether you've created a pipeline in the current region.

a. If you haven't created a pipeline in this region, the console displays an introductory screen.

Choose Get started now.

b. If you've already created a pipeline in this region, the console displays a page that lists your pipelines for the region. Choose Create new pipeline.

3. In Name, enter a name for your pipeline.

4. (Optional) In Description, enter a description for your pipeline.

5. For Source, select Build using a template, and then select the following template: Export DynamoDB table to S3.

6. Under Parameters, set DynamoDB table name to the name of your table. Click the folder icon next to Output S3 folder, select one of your Amazon S3 buckets, and then click Select.

7. Under Schedule, choose on pipeline activation.

8. Under Pipeline Conﬁguration, leave logging enabled. Choose the folder icon under S3 location for logs, select one of your buckets or folders, and then choose Select.

If you prefer, you can disable logging instead.

9. Under Security/Access, leave IAM roles set to Default.

10. Click Edit in Architect.

Next, conﬁgure the Amazon SNS notiﬁcation actions that AWS Data Pipeline performs depending on the outcome of the activity.

Part Two: Export Data from DynamoDB

To conﬁgure the success, failure, and late notiﬁcation actions 1. In the right pane, click Activities.

2. From Add an optional ﬁeld, select On Success.

3. From the newly added On Success, select Create new: Action.

4. From Add an optional ﬁeld, select On Fail.

5. From the newly added On Fail, select Create new: Action.

6. From Add an optional ﬁeld, select On Late Action.

7. From the newly added On Late Action, select Create new: Action.

8. In the right pane, click Others.

9. For DefaultAction1, do the following:

a. Change the name to SuccessSnsAlarm.

b. From Type, select SnsAlarm.

c. In Topic Arn, enter the ARN of the topic that you created. See ARN resource names for Amazon SNS.

d. Enter a subject and a message.

10. For DefaultAction2, do the following:

a. Change the name to FailureSnsAlarm.

b. From Type, select SnsAlarm.

c. In Topic Arn, enter the ARN of the topic that you created (see ARN resource names for Amazon SNS.

d. Enter a subject and a message.

11. For DefaultAction3, do the following:

a. Change the name to LateSnsAlarm.

b. From Type, select SnsAlarm.

c. In Topic Arn, enter the ARN of the topic that you created (see ARN resource names for Amazon SNS.

d. Enter a subject and a message.

Step 2: Save and Validate Your Pipeline

Important

If your pipeline uses an Amazon EMR release version in the 6.x series, you must add a bootstrap action to copy the following Jar ﬁle to the Hadoop classpath where MyRegion is the AWS Region where your pipeline runs: s3://dynamodb-dpl-MyRegion/emr-ddb-storage-handler/4.14.0/emr-dynamodb-tools-4.14.0-jar-with-dependencies.jar. For more information, see Amazon EMR 6.1.0 Release and Hadoop 3.x Jar Dependencies (p. 230).

In addition, you must change the ﬁrst argument in the step ﬁeld in EmrActivity with name TableBackupActivity from s3://dynamodb-dpl-MyRegion/emr-ddb- storage-handler/4.11.0/emr-dynamodb-tools-4.11.0-SNAPSHOT-jar-with-dependencies.jar to s3://dynamodb-dpl-MyRegion/emr-ddb-storage-handler/4.14.0/emr-dynamodb-tools-4.14.0-jar-with-dependencies.jar.

You can save your pipeline deﬁnition at any point during the creation process. As soon as you save your pipeline deﬁnition, AWS Data Pipeline looks for syntax errors and missing values in your pipeline deﬁnition. If your pipeline is incomplete or incorrect, AWS Data Pipeline generates validation errors and warnings. Warning messages are informational only, but you must ﬁx any error messages before you can

Part Two: Export Data from DynamoDB

To save and validate your pipeline 1. Choose Save pipeline.

2. AWS Data Pipeline validates your pipeline deﬁnition and returns either success or error or warning messages. If you get an error message, choose Close and then, in the right pane, choose Errors/

Warnings.

3. The Errors/Warnings pane lists the objects that failed validation. Choose the plus (+) sign next to the object names and look for an error message in red.

4. When you see an error message, go to the speciﬁc object pane where you see the error and ﬁx it. For example, if you see an error message in the DataNodes object, go to the DataNodes pane to ﬁx the error.

5. After you ﬁx the errors listed in the Errors/Warnings pane, choose Save Pipeline.

6. Repeat the process until your pipeline validates successfully.

Step 3: Activate Your Pipeline

Activate your pipeline to start creating and processing runs. The pipeline starts based on the schedule and period in your pipeline deﬁnition.

Important

If activation succeeds, your pipeline is running and might incur usage charges. For more information, see AWS Data Pipeline pricing. To stop incurring usage charges for AWS Data Pipeline, delete your pipeline.

To activate your pipeline 1. Choose Activate.

2. In the conﬁrmation dialog box, choose Close.

Step 4: Monitor the Pipeline Runs

After you activate your pipeline, you are taken to the Execution details page where you can monitor the progress of your pipeline.

To monitor the progress of your pipeline runs

1. Choose Update or press F5 to update the status displayed.

TipIf there are no runs listed, ensure that Start (in UTC) and End (in UTC) cover the scheduled start and end of your pipeline, and then choose Update.

2. When the status of every object in your pipeline is FINISHED, your pipeline has successfully completed the scheduled tasks. If you created an SNS notiﬁcation, you should receive email about the successful completion of this task.

3. If your pipeline doesn't complete successfully, check your pipeline settings for issues. For more information about troubleshooting failed or incomplete instance runs of your pipeline, see Resolving Common Problems (p. 306).

Step 5: Verify the Data Export File

Next, verify that the data export occurred successfully using viewing the output ﬁle contents.

Copy CSV Data from Amazon S3 to Amazon S3

To view the export ﬁle contents 1. Open the Amazon S3 console.

2. On the Buckets pane, click the Amazon S3 bucket that contains your ﬁle output (the example pipeline uses the output path s3://mybucket/output/MyTable) and open the output ﬁle with your preferred text editor. The output ﬁle name is an identiﬁer value with no extension, such as this example: ae10f955-fb2f-4790-9b11-fbfea01a871e_000000.

3. Using your preferred text editor, view the contents of the output ﬁle and ensure that there is a data ﬁle that corresponds to the DynamoDB source table. The presence of this text ﬁle indicates that the export operation from DynamoDB to the output ﬁle occurred successfully.

Step 6: Delete Your Pipeline (Optional)

To stop incurring charges, delete your pipeline. Deleting your pipeline deletes the pipeline deﬁnition and all associated objects.

To delete your pipeline

1. On the List Pipelines page, select your pipeline.

2. Click Actions, and then choose Delete.

3. When prompted for conﬁrmation, choose Delete.

Copy CSV Data Between Amazon S3 Buckets Using

在文檔中 AWS Data Pipeline (頁 102-107)