You can use the following pipeline objects and components in your pipeline definition.
Contents
• Data Nodes (p. 146)
• Activities (p. 167)
• Resources (p. 223)
• Preconditions (p. 251)
• Databases (p. 266)
• Data Formats (p. 271)
• Actions (p. 279)
• Schedule (p. 282)
• Utilities (p. 286)
NoteFor an example application that uses the AWS Data Pipeline Java SDK, see Data Pipeline DynamoDB Export Java Sample on GitHub.
The following is the object hierarchy for AWS Data Pipeline.
Data Nodes
Data Nodes
The following are the AWS Data Pipeline data node objects:
Objects
• DynamoDBDataNode (p. 146)
• MySqlDataNode (p. 150)
• RedshiftDataNode (p. 154)
• S3DataNode (p. 159)
• SqlDataNode (p. 163)
DynamoDBDataNode
Defines a data node using DynamoDB, which is specified as an input to a HiveActivity or EMRActivity object.
NoteThe DynamoDBDataNode object does not support the Exists precondition.
Example
The following is an example of this object type. This object references two other objects that you'd define in the same pipeline definition file. CopyPeriod is a Schedule object and Ready is a precondition object.
{ "id" : "MyDynamoDBTable", "type" : "DynamoDBDataNode",
"schedule" : { "ref" : "CopyPeriod" }, "tableName" : "adEvents",
"precondition" : { "ref" : "Ready" } }
Syntax
Required Fields Description Slot Type
tableName The DynamoDB table. String
Object Invocation
Fields Description Slot Type
schedule This object is invoked within the execution of a schedule interval. Users must specify a schedule reference to another object to set the dependency execution order for this object. Users can satisfy this requirement by explicitly setting a schedule on the object, for example, by specifying
"schedule": {"ref": "DefaultSchedule"}. In most cases, it is better to put the schedule reference
Reference Object, for example, "schedule":
{"ref":"myScheduleId"}
DynamoDBDataNode
Object Invocation
Fields Description Slot Type
on the default pipeline object so that all objects inherit that schedule. Or, if the pipeline has a tree of schedules (schedules within the master schedule), users can create a parent object that has a schedule reference. For more information about example optional schedule configurations, see Schedule.
Optional Fields Description Slot Type
attemptStatus Most recently reported status from the remote
activity. String
attemptTimeout Timeout for remote work completion. If this field is set, then a remote activity that does not complete within the set time of starting may be retried.
Period
dataFormat DataFormat for the data described by this data node. Currently supported for HiveActivity and HiveCopyActivity.
Reference Object,
"dataFormat":
{"ref":"myDynamoDBDataFormatId"}
dependsOn Specify dependency on another runnable object Reference Object, e.g. "dependsOn":
{"ref":"myActivityId"}
failureAndRerunMode Describes consumer node behavior when
dependencies fail or are rerun Enumeration
lateAfterTimeout The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand.
Period
maxActiveInstances The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances.
Integer
maximumRetries Maximum number attempt retries on failure Integer
onFail An action to run when current object fails. Reference Object, e.g. "onFail":
{"ref":"myActionId"}
onLateAction Actions that should be triggered if an object has
not yet been scheduled or still not completed. Reference Object, e.g. "onLateAction":
{"ref":"myActionId"}
onSuccess An action to run when current object succeeds. Reference Object, e.g. "onSuccess":
{"ref":"myActionId"}
parent Parent of the current object from which slots will
be inherited. Reference Object,
e.g. "parent":
{"ref":"myBaseObjectId"}
DynamoDBDataNode
Optional Fields Description Slot Type
pipelineLogUri The S3 URI (such as 's3://BucketName/Key/') for
uploading logs for the pipeline. String precondition Optionally define a precondition. A data node is
not marked "READY" until all preconditions have been met.
Reference Object, e.g. "precondition":
{"ref":"myPreconditionId"}
readThroughputPercent Sets the rate of read operations to keep your DynamoDB provisioned throughput rate in the allocated range for your table. The value is a double between 0.1 and 1.0, inclusively.
Double
region The code for the region where the DynamoDB table exists. For example, us-east-1. This is used by HiveActivity when it performs staging for DynamoDB tables in Hive.
Enumeration
reportProgressTimeout Timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried.
Period
retryDelay The timeout duration between two retry
attempts. Period
runsOn The computational resource to run the activity or command. For example, an Amazon EC2 instance or Amazon EMR cluster.
Reference Object, e.g. "runsOn":
{"ref":"myResourceId"}
scheduleType Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval. An on-demand schedule allows you to run a pipeline one time per activation. This means you do not have to clone or re-create the pipeline to run it again. If you use an on-demand schedule it must be specified in the default object and must be the only scheduleType specified for objects in the pipeline. To use on-demand pipelines, you simply call the ActivatePipeline operation for each subsequent run. Values are: cron, ondemand, and timeseries.
Enumeration
workerGroup The worker group. This is used for routing tasks.
If you provide a runsOn value and workerGroup exists, workerGroup is ignored.
String
writeThroughputPercent Sets the rate of write operations to keep your DynamoDB provisioned throughput rate in the allocated range for your table. The value is a double between .1 and 1.0, inclusively.
Double
DynamoDBDataNode
Runtime Fields Description Slot Type
@activeInstances List of the currently scheduled active instance
objects. Reference Object,
e.g. "activeInstances":
{"ref":"myRunnableObjectId"}
@actualEndTime Time when the execution of this object finished. DateTime
@actualStartTime Time when the execution of this object started. DateTime cancellationReason The cancellationReason if this object was
cancelled. String
@cascadeFailedOn Description of dependency chain the object failed
on. Reference Object, e.g.
"cascadeFailedOn":
{"ref":"myRunnableObjectId"}
emrStepLog EMR step logs available only on EMR activity
attempts String
errorId The errorId if this object failed. String
errorMessage The errorMessage if this object failed. String errorStackTrace The error stack trace if this object failed. String
@finishedTime The time at which this object finished its
execution. DateTime
hadoopJobLog Hadoop job logs available on attempts for
EMR-based activities. String
@healthStatus The health status of the object which reflects success or failure of the last object instance that reached a terminated state.
String
@healthStatusFromInstanceIdId of the last instance object that reached a
terminated state. String
@healthStatusUpdatedTimeTime at which the health status was updated last
time. DateTime
hostname The host name of client that picked up the task
attempt. String
@lastDeactivatedTime The time at which this object was last deactivated. DateTime
@latestCompletedRunTimeTime the latest run for which the execution
completed. DateTime
@latestRunTime Time the latest run for which the execution was
scheduled. DateTime
@nextRunTime Time of run to be scheduled next. DateTime
reportProgressTime Most recent time that remote activity reported
progress. DateTime
@scheduledEndTime Schedule end time for object DateTime
@scheduledStartTime Schedule start time for object DateTime
MySqlDataNode
Runtime Fields Description Slot Type
@status The status of this object. String
@version Pipeline version the object was created with. String
@waitingOn Description of list of dependencies this object is
waiting on. Reference Object,
e.g. "waitingOn":
{"ref":"myRunnableObjectId"}
System Fields Description Slot Type
@error Error describing the ill-formed object String
@pipelineId Id of the pipeline to which this object belongs to String
@sphere The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects
String
MySqlDataNode
Defines a data node using MySQL.
NoteThe MySqlDataNode type is deprecated. We recommend that you use SqlDataNode (p. 163) instead.
Example
The following is an example of this object type. This object references two other objects that you'd define in the same pipeline definition file. CopyPeriod is a Schedule object and Ready is a precondition object.
{ "id" : "Sql Table", "type" : "MySqlDataNode",
"schedule" : { "ref" : "CopyPeriod" }, "table" : "adEvents",
"username": "user_name", "*password": "my_password",
"connectionString": "jdbc:mysql://mysqlinstance-rds.example.us-east-1.rds.amazonaws.com:3306/database_name",
"selectQuery" : "select * from #{table} where eventTime >=
'#{@scheduledStartTime.format('YYYY-MM-dd HH:mm:ss')}' and eventTime <
'#{@scheduledEndTime.format('YYYY-MM-dd HH:mm:ss')}'", "precondition" : { "ref" : "Ready" }
}
Syntax
Required Fields Description Slot Type
table The name of the table in the MySQL database. String
MySqlDataNode
Object Invocation
Fields Description Slot Type
schedule This object is invoked within the execution of a schedule interval. Users must specify a schedule reference to another object to set the dependency execution order for this object. Users can satisfy this requirement by explicitly setting a schedule on the object, for example, by specifying
"schedule": {"ref": "DefaultSchedule"}. In most cases, it is better to put the schedule reference on the default pipeline object so that all objects inherit that schedule. Or, if the pipeline has a tree of schedules (schedules within the master schedule), users can create a parent object that has a schedule reference. For more information about example optional schedule configurations, see https://docs.aws.amazon.com/datapipeline/
latest/DeveloperGuide/dp-object-schedule.html
Reference Object, e.g. "schedule":
{"ref":"myScheduleId"}
Optional Fields Description Slot Type
attemptStatus Most recently reported status from the remote
activity. String
attemptTimeout Timeout for remote work completion. If set then a remote activity that does not complete within the set time of starting may be retried.
Period
createTableSql An SQL create table expression that creates the
table. String
database The name of the database. Reference Object,
e.g. "database":
{"ref":"myDatabaseId"}
dependsOn Specifies dependency on another runnable object. Reference Object, e.g. "dependsOn":
{"ref":"myActivityId"}
failureAndRerunMode Describes consumer node behavior when
dependencies fail or are rerun. Enumeration insertQuery An SQL statement to insert data into the table. String lateAfterTimeout The elapsed time after pipeline start within which
the object must complete. It is triggered only when the schedule type is not set to ondemand.
Period
maxActiveInstances The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances.
Integer
maximumRetries Maximum number attempt retries on failure Integer
MySqlDataNode
Optional Fields Description Slot Type
onFail An action to run when current object fails. Reference Object, e.g. "onFail":
{"ref":"myActionId"}
onLateAction Actions that should be triggered if an object has
not yet been scheduled or still not completed. Reference Object, e.g. "onLateAction":
{"ref":"myActionId"}
onSuccess An action to run when current object succeeds. Reference Object, e.g. "onSuccess":
{"ref":"myActionId"}
parent Parent of the current object from which slots will
be inherited. Reference Object,
e.g. "parent":
{"ref":"myBaseObjectId"}
pipelineLogUri The S3 URI (such as 's3://BucketName/Key/') for
uploading logs for the pipeline. String precondition Optionally define a precondition. A data node is
not marked "READY" until all preconditions have been met.
Reference Object, e.g. "precondition":
{"ref":"myPreconditionId"}
reportProgressTimeout Timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried.
Period
retryDelay The timeout duration between two retry
attempts. Period
runsOn The computational resource to run the activity or command. For example, an Amazon EC2 instance or Amazon EMR cluster.
Reference Object, e.g. "runsOn":
{"ref":"myResourceId"}
scheduleType Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval. An on-demand schedule allows you to run a pipeline one time per activation. This means you do not have to clone or re-create the pipeline to run it again. If you use an on-demand schedule it must be specified in the default object and must be the only scheduleType specified for objects in the pipeline. To use on-demand pipelines, you simply call the ActivatePipeline operation for each subsequent run. Values are: cron, ondemand, and timeseries.
Enumeration
schemaName The name of the schema holding the table String selectQuery A SQL statement to fetch data from the table. String
MySqlDataNode
Optional Fields Description Slot Type
workerGroup The worker group. This is used for routing tasks.
If you provide a runsOn value and workerGroup exists, workerGroup is ignored.
String
Runtime Fields Description Slot Type
@activeInstances List of the currently scheduled active instance
objects. Reference Object,
e.g. "activeInstances":
{"ref":"myRunnableObjectId"}
@actualEndTime Time when the execution of this object finished. DateTime
@actualStartTime Time when the execution of this object started. DateTime cancellationReason The cancellationReason if this object was
cancelled. String
@cascadeFailedOn Description of the dependency chain the object
failed on. Reference Object, e.g.
"cascadeFailedOn":
{"ref":"myRunnableObjectId"}
emrStepLog EMR step logs available only on EMR activity
attempts String
errorId The errorId if this object failed. String
errorMessage The errorMessage if this object failed. String errorStackTrace The error stack trace if this object failed. String
@finishedTime The time at which this object finished its
execution. DateTime
hadoopJobLog Hadoop job logs available on attempts for
EMR-based activities. String
@healthStatus The health status of the object which reflects success or failure of the last object instance that reached a terminated state.
String
@healthStatusFromInstanceIdId of the last instance object that reached a
terminated state. String
@healthStatusUpdatedTimeTime at which the health status was updated last
time. DateTime
hostname The host name of client that picked up the task
attempt. String
@lastDeactivatedTime The time at which this object was last deactivated. DateTime
@latestCompletedRunTimeTime the latest run for which the execution
completed. DateTime
@latestRunTime Time the latest run for which the execution was
scheduled. DateTime
RedshiftDataNode
Runtime Fields Description Slot Type
@nextRunTime Time of run to be scheduled next. DateTime
reportProgressTime Most recent time that remote activity reported
progress. DateTime
@scheduledEndTime Schedule end time for object. DateTime
@scheduledStartTime Schedule start time for object. DateTime
@status The status of this object. String
@version Pipeline version the object was created with. String
@waitingOn Description of list of dependencies this object is
waiting on. Reference Object,
e.g. "waitingOn":
{"ref":"myRunnableObjectId"}
System Fields Description Slot Type
@error Error describing the ill-formed object. String
@pipelineId Id of the pipeline to which this object belongs to. String
@sphere The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects.
String