To verify the DynamoDB table 1. Open the DynamoDB console.
2. On the Tables screen, click your DynamoDB table and click Explore Table.
3. On the Browse Items tab, columns that correspond to the data input file should display, such as Id, Price, ProductCategory, as shown in the following screen. This indicates that the import operation from the file to the DynamoDB table occurred successfully.
Step 6: Delete Your Pipeline (Optional)
To stop incurring charges, delete your pipeline. Deleting your pipeline deletes the pipeline definition and all associated objects.
To delete your pipeline
1. On the List Pipelines page, select your pipeline.
2. Click Actions, and then choose Delete.
3. When prompted for confirmation, choose Delete.
Part Two: Export Data from DynamoDB
This is the second of a two-part tutorial that demonstrates how to bring together multiple AWS features to solve real-world problems in a scalable way through a common scenario: moving schema-less data in and out of DynamoDB using AWS Data Pipeline.
Tasks
• Before You Begin (p. 97)
• Step 1: Create the Pipeline (p. 98)
• Step 2: Save and Validate Your Pipeline (p. 99)
• Step 3: Activate Your Pipeline (p. 100)
• Step 4: Monitor the Pipeline Runs (p. 100)
• Step 5: Verify the Data Export File (p. 100)
• Step 6: Delete Your Pipeline (Optional) (p. 101)
Part Two: Export Data from DynamoDB
Before You Begin
You must complete part one of this tutorial to ensure that your DynamoDB table contains the necessary data to perform the steps in this section. For more information, see Part One: Import Data into
DynamoDB (p. 91).
Additionally, be sure you've completed the following steps:
• Complete the tasks in Setting up for AWS Data Pipeline (p. 14).
• Create a topic and subscribe to receive notifications from AWS Data Pipeline regarding the status of your pipeline components. For more information, see Create a Topic in the Amazon SNS Getting Started Guide.
• Ensure that you have the DynamoDB table that was created and populated with data in part one of this tutorial. This table will be your data source for part two of the tutorial. For more information, see Part One: Import Data into DynamoDB (p. 91).
Be aware of the following:
Underlying service costs
Be aware of the costs. AWS Data Pipeline manages the import/export process for you, but you still pay for the underlying AWS services used. The import and export pipelines will create Amazon EMR clusters to read and write data and there are per-instance charges for each node in the cluster. You can read more about the details of Amazon EMR Pricing. The default cluster configuration is one m1.small instance master node and one m1.xlarge instance task node, though you can change this configuration in the pipeline definition. There are also charges for AWS Data Pipeline. For more information, see AWS Data Pipeline Pricing and Amazon S3 Pricing.
Imports may overwrite data
When you import data from Amazon S3, the import may overwrite items in your DynamoDB table. Make sure that you are importing the right data and into the right table. Be careful not to accidentally set up a recurring import pipeline that will import the same data multiple times.
Exports may overwrite data
When you export data to Amazon S3, you may overwrite previous exports if you write to the same bucket path. The default behavior of the Export DynamoDB to S3 template will append the job's scheduled time to the Amazon S3 bucket path, which will help you avoid this problem.
Jobs consume throughput capacity
Import and Export jobs will consume some of your DynamoDB table's provisioned throughput capacity. This section explains how to schedule an import or export job using Amazon EMR. The Amazon EMR cluster will consume some read capacity during exports or write capacity during imports. You can control the percentage of the provisioned capacity that the import/export jobs consume by with the settings MyImportJob.myDynamoDBWriteThroughputRatio and MyExportJob.myDynamoDBReadThroughputRatio. Be aware that these settings determine how much capacity to consume at the beginning of the import/export process and will not adapt in real time if you change your table's provisioned capacity in the middle of the process.
On-Demand Capacity works only with EMR 5.24.0 or later
DynamoDB tables configured for On-Demand Capacity are supported only when using Amazon EMR release version 5.24.0 or later. When you use a template to create a pipeline for DynamoDB, choose Edit in Architect and then choose Resources to configure the Amazon EMR cluster that AWS Data Pipeline provisions. For Release label, choose emr-5.24.0 or later.
Part Two: Export Data from DynamoDB
Step 1: Create the Pipeline
First, create the pipeline.
To create the pipeline
1. Open the AWS Data Pipeline console at https://console.aws.amazon.com/datapipeline/.
2. The first screen that you see depends on whether you've created a pipeline in the current region.
a. If you haven't created a pipeline in this region, the console displays an introductory screen.
Choose Get started now.
b. If you've already created a pipeline in this region, the console displays a page that lists your pipelines for the region. Choose Create new pipeline.
3. In Name, enter a name for your pipeline.
4. (Optional) In Description, enter a description for your pipeline.
5. For Source, select Build using a template, and then select the following template: Export DynamoDB table to S3.
6. Under Parameters, set DynamoDB table name to the name of your table. Click the folder icon next to Output S3 folder, select one of your Amazon S3 buckets, and then click Select.
7. Under Schedule, choose on pipeline activation.
8. Under Pipeline Configuration, leave logging enabled. Choose the folder icon under S3 location for logs, select one of your buckets or folders, and then choose Select.
If you prefer, you can disable logging instead.
9. Under Security/Access, leave IAM roles set to Default.
10. Click Edit in Architect.
Next, configure the Amazon SNS notification actions that AWS Data Pipeline performs depending on the outcome of the activity.
Part Two: Export Data from DynamoDB
To configure the success, failure, and late notification actions 1. In the right pane, click Activities.
2. From Add an optional field, select On Success.
3. From the newly added On Success, select Create new: Action.
4. From Add an optional field, select On Fail.
5. From the newly added On Fail, select Create new: Action.
6. From Add an optional field, select On Late Action.
7. From the newly added On Late Action, select Create new: Action.
8. In the right pane, click Others.
9. For DefaultAction1, do the following:
a. Change the name to SuccessSnsAlarm.
b. From Type, select SnsAlarm.
c. In Topic Arn, enter the ARN of the topic that you created. See ARN resource names for Amazon SNS.
d. Enter a subject and a message.
10. For DefaultAction2, do the following:
a. Change the name to FailureSnsAlarm.
b. From Type, select SnsAlarm.
c. In Topic Arn, enter the ARN of the topic that you created (see ARN resource names for Amazon SNS.
d. Enter a subject and a message.
11. For DefaultAction3, do the following:
a. Change the name to LateSnsAlarm.
b. From Type, select SnsAlarm.
c. In Topic Arn, enter the ARN of the topic that you created (see ARN resource names for Amazon SNS.
d. Enter a subject and a message.
Step 2: Save and Validate Your Pipeline
Important
If your pipeline uses an Amazon EMR release version in the 6.x series, you must add a bootstrap action to copy the following Jar file to the Hadoop classpath where MyRegion is the AWS Region where your pipeline runs: s3://dynamodb-dpl-MyRegion/emr-ddb-storage-handler/4.14.0/emr-dynamodb-tools-4.14.0-jar-with-dependencies.jar. For more information, see Amazon EMR 6.1.0 Release and Hadoop 3.x Jar Dependencies (p. 230).
In addition, you must change the first argument in the step field in EmrActivity with name TableBackupActivity from s3://dynamodb-dpl-MyRegion/emr-ddb- storage-handler/4.11.0/emr-dynamodb-tools-4.11.0-SNAPSHOT-jar-with-dependencies.jar to s3://dynamodb-dpl-MyRegion/emr-ddb-storage-handler/4.14.0/emr-dynamodb-tools-4.14.0-jar-with-dependencies.jar.
You can save your pipeline definition at any point during the creation process. As soon as you save your pipeline definition, AWS Data Pipeline looks for syntax errors and missing values in your pipeline definition. If your pipeline is incomplete or incorrect, AWS Data Pipeline generates validation errors and warnings. Warning messages are informational only, but you must fix any error messages before you can
Part Two: Export Data from DynamoDB
To save and validate your pipeline 1. Choose Save pipeline.
2. AWS Data Pipeline validates your pipeline definition and returns either success or error or warning messages. If you get an error message, choose Close and then, in the right pane, choose Errors/
Warnings.
3. The Errors/Warnings pane lists the objects that failed validation. Choose the plus (+) sign next to the object names and look for an error message in red.
4. When you see an error message, go to the specific object pane where you see the error and fix it. For example, if you see an error message in the DataNodes object, go to the DataNodes pane to fix the error.
5. After you fix the errors listed in the Errors/Warnings pane, choose Save Pipeline.
6. Repeat the process until your pipeline validates successfully.
Step 3: Activate Your Pipeline
Activate your pipeline to start creating and processing runs. The pipeline starts based on the schedule and period in your pipeline definition.
Important
If activation succeeds, your pipeline is running and might incur usage charges. For more information, see AWS Data Pipeline pricing. To stop incurring usage charges for AWS Data Pipeline, delete your pipeline.
To activate your pipeline 1. Choose Activate.
2. In the confirmation dialog box, choose Close.
Step 4: Monitor the Pipeline Runs
After you activate your pipeline, you are taken to the Execution details page where you can monitor the progress of your pipeline.
To monitor the progress of your pipeline runs
1. Choose Update or press F5 to update the status displayed.
TipIf there are no runs listed, ensure that Start (in UTC) and End (in UTC) cover the scheduled start and end of your pipeline, and then choose Update.
2. When the status of every object in your pipeline is FINISHED, your pipeline has successfully completed the scheduled tasks. If you created an SNS notification, you should receive email about the successful completion of this task.
3. If your pipeline doesn't complete successfully, check your pipeline settings for issues. For more information about troubleshooting failed or incomplete instance runs of your pipeline, see Resolving Common Problems (p. 306).
Step 5: Verify the Data Export File
Next, verify that the data export occurred successfully using viewing the output file contents.
Copy CSV Data from Amazon S3 to Amazon S3
To view the export file contents 1. Open the Amazon S3 console.
2. On the Buckets pane, click the Amazon S3 bucket that contains your file output (the example pipeline uses the output path s3://mybucket/output/MyTable) and open the output file with your preferred text editor. The output file name is an identifier value with no extension, such as this example: ae10f955-fb2f-4790-9b11-fbfea01a871e_000000.
3. Using your preferred text editor, view the contents of the output file and ensure that there is a data file that corresponds to the DynamoDB source table. The presence of this text file indicates that the export operation from DynamoDB to the output file occurred successfully.
Step 6: Delete Your Pipeline (Optional)
To stop incurring charges, delete your pipeline. Deleting your pipeline deletes the pipeline definition and all associated objects.
To delete your pipeline
1. On the List Pipelines page, select your pipeline.
2. Click Actions, and then choose Delete.
3. When prompted for confirmation, choose Delete.