AWS DataSync
User Guide
AWS DataSync: User Guide
Copyright © Amazon Web Services, Inc. and/or its affiliates. All rights reserved.
Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be affiliated with, connected to, or sponsored by Amazon.
Table of Contents
What is AWS DataSync? ... 1
Use cases ... 1
Benefits ... 1
Additional AWS DataSync resources ... 2
How AWS DataSync works ... 3
AWS DataSync architecture ... 3
Data transfer between self-managed storage and AWS ... 3
Data transfer between AWS storage services ... 4
Data transfer using a DataSync EC2 agent deployed in a Region ... 5
Components and terminology ... 5
Agent ... 6
Location ... 6
Task ... 6
Task execution ... 6
How DataSync transfers files ... 7
How AWS DataSync verifies data integrity ... 8
How DataSync handles open and locked files ... 8
Setting up ... 9
Sign up for AWS ... 9
AWS Regions and endpoints ... 9
How to access AWS DataSync ... 9
DataSync pricing ... 9
Requirements ... 10
Agent requirements ... 10
Supported hypervisors ... 10
Virtual machine requirements ... 11
Amazon EC2 instance requirements ... 11
Network requirements ... 11
Network requirements to connect to your self-managed storage ... 11
Network requirements when using VPC endpoints ... 12
Network requirements when using public service endpoints or FIPS endpoints ... 15
Required network interfaces for data transfers ... 19
Getting started ... 21
Create an agent ... 21
Deploy your agent ... 22
Choose a service endpoint ... 26
Activate your agent ... 28
Configure a source location ... 29
Configure a destination location ... 30
Configure task settings ... 31
Data verification options ... 31
Ownership and permissions-related options ... 32
File metadata options and file management ... 32
Bandwidth options ... 33
Filtering options ... 33
Scheduling and queueing options ... 33
Tags and logging options ... 34
Review and create your task ... 34
Start your task ... 34
Clean up resources ... 35
Using the AWS CLI ... 36
Step 1: Create an agent ... 36
Step 2: Create locations ... 39
Create an NFS location ... 39
Create an SMB location ... 40
Create an HDFS location ... 41
Create an object storage location ... 42
Create an Amazon EFS location ... 42
Create an FSx for Windows File Server location ... 44
Create an Amazon FSx for Lustre location ... 44
Create an Amazon S3 location ... 45
Step 3: Create a task ... 49
Step 4: Start a task execution ... 50
Step 5: Monitor your task execution ... 51
Monitor your task execution in real time ... 52
API filters ... 52
Parameters for API filtering ... 52
API filtering ListLocations ... 53
API filtering ListTasks ... 53
Working with agents ... 55
Creating and activating an agent ... 55
Using DataSync in a VPC ... 56
How DataSync works with VPC endpoints ... 56
Configuring DataSync to use private IP addresses for data transfer ... 56
Deploying your agent in AWS Regions ... 59
Data transfer from in-cloud file system to in-cloud file system or Amazon S3 ... 59
Data transfer from S3 to in-cloud file systems ... 60
Editing your agent's properties ... 61
Using multiple agents for a location ... 62
Agent statuses ... 62
Deleting an agent ... 62
Configuring your agent for multiple NICs ... 63
Working with your agent's local console ... 63
Logging in to the agent local console ... 64
Obtaining an activation key using the local console ... 64
Configuring your agent network settings ... 65
Testing your agent connectivity to the internet ... 66
Testing connectivity to self-managed storage ... 67
Viewing your agent system resource status ... 67
Synchronizing your VMware agent time ... 68
Running AWS DataSync commands on the local console ... 69
Enabling AWS Support to help troubleshoot DataSync ... 70
Working with locations ... 72
Creating a location for NFS ... 73
NFS location settings ... 74
NFS server on AWS Snowcone and AWS Snowball Edge ... 75
Creating a location for SMB ... 75
SMB location settings ... 76
Creating a location for HDFS ... 77
Unsupported HDFS features ... 78
Creating a location for object storage ... 78
Creating a location for Amazon EFS ... 79
Considerations when creating a location for Amazon EFS ... 80
Creating a location for FSx for Windows File Server ... 81
Creating a location for FSx for Lustre ... 83
Creating a location for Amazon S3 ... 83
Amazon S3 location settings ... 85
Considerations when working with Amazon S3 storage classes in DataSync ... 86
Manually configuring an IAM role to access your Amazon S3 bucket ... 88
How DataSync handles metadata and special files ... 90
Metadata copied by DataSync ... 90
Default POSIX metadata applied by DataSync ... 92
Links and directories copied by DataSync ... 93
Deleting a location ... 93
Working with tasks ... 94
Creating your task ... 94
Creating a task for DataSync ... 94
Creating a task to transfer data between self-managed storage and AWS ... 95
Creating a task to transfer between in-cloud locations ... 95
Configuring task settings ... 99
Filtering data ... 100
Filtering terms, definitions, and syntax ... 100
Excluding data from a transfer ... 101
Including data in a transfer ... 102
Sample filters for common uses ... 102
Scheduling your task ... 103
Configuring a task schedule ... 104
Editing a task schedule ... 104
Task creation statuses ... 105
Starting your task ... 105
Queueing task executions ... 106
Working with task executions ... 106
Adjust bandwidth throttling ... 106
Task execution statuses ... 107
Cancel a task execution ... 107
Deleting your task ... 108
Monitoring ... 109
Accessing CloudWatch metrics ... 109
DataSync CloudWatch metrics ... 109
CloudWatch events for DataSync ... 110
DataSync dimensions ... 111
Uploading logs to Amazon CloudWatch log groups ... 111
Security ... 113
Data protection ... 113
Data encryption ... 113
Identity and access management ... 114
Overview of managing access ... 114
Using identity-based policies (IAM policies) ... 119
Cross-service confused deputy prevention ... 122
DataSync API permissions reference ... 123
Logging ... 128
Working with AWS DataSync information in CloudTrail ... 128
Understanding AWS DataSync log file entries ... 129
Compliance validation ... 130
Resilience ... 131
Infrastructure security ... 131
Quotas and limits ... 132
Quotas for tasks ... 132
Quotas for task executions ... 134
Limits for DataSync file systems ... 134
Limits for DataSync filters ... 134
Troubleshooting ... 135
I need DataSync to use a specific NFS or SMB version to mount my share ... 135
What does the "Failed to retrieve agent activation key" error mean? ... 136
I can't activate an agent I created using a VPC endpoint ... 136
My task status is unavailable and indicates a mount error ... 136
My task failed with an input/output error message ... 137
My task is stuck in launching status ... 137
My task failed with a permissions denied error message ... 137
My task has had a preparing status for a long time ... 138
How long does it take to verify a task I've run? ... 138
My storage cost is higher than I expected ... 138
I don't know what's going on with my agent. Can someone help me? ... 139
How do I connect to an Amazon EC2 agent's local console? ... 139
Tutorial: Transferring across accounts to S3 ... 140
Overview ... 140
Prerequisites ... 140
Step 1: Create an IAM role for DataSync in Account A ... 141
Create the IAM role ... 141
Attach a custom policy to the IAM role ... 141
Step 2: Disable ACLs for your S3 bucket in Account B ... 142
Step 3: Update the S3 bucket policy in Account B ... 142
Step 4: Create a DataSync destination location for the S3 bucket ... 143
Step 5: Create and start your DataSync task ... 144
Related ... 144
Additional resources ... 146
Transferring data from a self-managed storage array ... 146
Other use cases ... 146
Transferring files in opposite directions ... 146
Using multiple tasks to write to the same Amazon S3 bucket ... 147
Allowing Amazon S3 access from a private VPC endpoint ... 147
API reference ... 149
Actions ... 149
CancelTaskExecution ... 151
CreateAgent ... 153
CreateLocationEfs ... 157
CreateLocationFsxLustre ... 161
CreateLocationFsxWindows ... 164
CreateLocationHdfs ... 167
CreateLocationNfs ... 172
CreateLocationObjectStorage ... 176
CreateLocationS3 ... 180
CreateLocationSmb ... 185
CreateTask ... 189
DeleteAgent ... 194
DeleteLocation ... 196
DeleteTask ... 198
DescribeAgent ... 200
DescribeLocationEfs ... 203
DescribeLocationFsxLustre ... 206
DescribeLocationFsxWindows ... 209
DescribeLocationHdfs ... 212
DescribeLocationNfs ... 216
DescribeLocationObjectStorage ... 219
DescribeLocationS3 ... 222
DescribeLocationSmb ... 225
DescribeTask ... 228
DescribeTaskExecution ... 234
ListAgents ... 239
ListLocations ... 241
ListTagsForResource ... 244
ListTaskExecutions ... 247
ListTasks ... 250
StartTaskExecution ... 253
TagResource ... 257
UntagResource ... 259
UpdateAgent ... 261
UpdateLocationHdfs ... 263
UpdateLocationNfs ... 267
UpdateLocationObjectStorage ... 270
UpdateLocationSmb ... 273
UpdateTask ... 276
UpdateTaskExecution ... 279
Data Types ... 280
AgentListEntry ... 282
Ec2Config ... 283
FilterRule ... 284
HdfsNameNode ... 285
LocationFilter ... 286
LocationListEntry ... 287
NfsMountOptions ... 289
OnPremConfig ... 290
Options ... 291
PrivateLinkConfig ... 296
QopConfiguration ... 298
S3Config ... 299
SmbMountOptions ... 300
TagListEntry ... 301
TaskExecutionListEntry ... 302
TaskExecutionResultDetail ... 303
TaskFilter ... 305
TaskListEntry ... 306
TaskSchedule ... 307
Common Errors ... 307
Common Parameters ... 309
Document history ... 311
AWS glossary ... 314
What is AWS DataSync?
AWS DataSync is an online data transfer service that simplifies, automates, and accelerates moving data between on-premises storage systems and AWS storage services, and also between AWS storage services. DataSync can copy data between:
• Network File System (NFS) file servers
• Server Message Block (SMB) file servers
• Hadoop Distributed File System (HDFS)
• On-premises (self-managed) object storage
• Snow Family devices
• Amazon Simple Storage Service (Amazon S3) buckets
• Amazon EFS file systems
• Amazon FSx for Windows File Server file systems
• Amazon FSx for Lustre file systems
In this guide, you can find a description of the components of DataSync, detailed instructions on how to get started, and the API Reference.
Topics
• Use cases (p. 1)
• Benefits (p. 1)
• Additional AWS DataSync resources (p. 2)
Use cases
These are some of the main use cases for AWS DataSync:
• Data migration – Move active datasets rapidly over the network into Amazon S3, Amazon EFS, FSx for Windows File Server, or FSx for Lustre. DataSync includes automatic encryption and data integrity validation to help make sure that your data arrives securely, intact, and ready to use.
• Archiving cold data – Move cold data stored in on-premises storage directly to durable and secure long-term storage classes such as S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive. Doing so can free up on-premises storage capacity and shut down legacy systems.
• Data protection – Move data into any Amazon S3 storage class, choosing the most cost-effective storage class for your needs. You can also send data to Amazon EFS, FSx for Windows File Server, or FSx for Lustre for a standby file system.
• Data movement for timely in-cloud processing – Move data in to or out of AWS for processing when working with systems that generate data on-premises. This approach can speed up critical hybrid cloud workflows across many industries. These include machine learning in the life-sciences industry, video production in media and entertainment, big-data analytics in financial services, and seismic research in the oil and gas industry.
Benefits
By using AWS DataSync, you can get the following benefits:
Additional AWS DataSync resources
• Simplify and automate data movement – AWS DataSync makes it easier to move data over the network between on-premises storage and AWS storage services, and also between AWS storage services. DataSync automates both the management of data-transfer processes and the infrastructure required for high performance and secure data transfer.
• Transfer data securely – DataSync provides end-to-end security, including encryption and integrity validation, to help ensure that your data arrives securely, intact, and ready to use. DataSync accesses your AWS storage through built-in AWS security mechanisms, such as AWS Identity and Access Management (IAM) roles. It also supports virtual private cloud (VPC) endpoints, giving you the option to transfer data without traversing the public internet, and further increasing the security of data copied online.
• Move data faster – With DataSync, you can transfer data rapidly over the network into AWS. It uses a purpose-built network protocol and a parallel, multi-threaded architecture to accelerate your transfers.
This approach speeds up migrations, recurring data-processing workflows for analytics and machine learning, and data-protection processes.
• Reduce operational costs – You can move data cost-effectively with the flat, per-gigabyte pricing of DataSync. You can save on script development, and deployment and maintenance costs, and avoid the need for costly commercial transfer tools.
Additional AWS DataSync resources
We recommend that you read the following:
• DataSync resources – The resources page includes blogs, videos, and other training materials.
• AWS DataSync developer forum – The AWS DataSync developer forum.
• AWS DataSync pricing – AWS DataSync pricing information.
AWS DataSync also supports Terraform. To learn more about DataSync deployment automation with Terraform, see the Terraform documentation.
How AWS DataSync works
In this section, you can find information about components, terms, and how DataSync works.
Topics
• AWS DataSync architecture (p. 3)
• Components and terminology (p. 5)
• How DataSync transfers files (p. 7)
AWS DataSync architecture
Topics
• Data transfer between self-managed storage and AWS (p. 3)
• Data transfer between AWS storage services (p. 4)
• Data transfer using a DataSync EC2 agent deployed in a Region (p. 5)
The architectural diagrams show how DataSync transfers data between on-premises (self-managed) storage systems and AWS storage services, and between in-cloud storage systems and AWS storage services.
For a list of all DataSync supported source and destination endpoints, see Working with locations (p. 72).
Data transfer between self-managed storage and AWS
The following diagram shows a high-level view of the DataSync architecture for transferring files between self-managed storage and AWS services.
Data transfer between AWS storage services
Data transfer between AWS storage services
The following diagram provides a high-level view of the DataSync architecture for transferring files between AWS services within the same AWS account. This architecture applies to both in-Region and cross-Region transfers.
Important
When you use DataSync to copy files or objects between AWS Regions, you pay for data transfer between Regions. This is billed as data transfer OUT from your source Region to your destination Region. For more information, see Data transfer pricing.
Data transfer using a DataSync EC2 agent deployed in a Region
You can use DataSync to transfer data between AWS services in different AWS accounts, or between self- managed file systems in AWS and Amazon S3, by deploying the DataSync Amazon EC2 agent in an AWS Region. For more information, see Deploying your DataSync agent in AWS Regions (p. 59).
Components and terminology
The components of DataSync include the following:
• Agent – A virtual machine (VM) that's used to read data from or write data to a self-managed location.
An agent isn't required when transferring between AWS storage services in the same AWS account.
• Location – Any source or destination location that's used in the data transfer, such as, Amazon S3, Amazon EFS, Amazon FSx for Windows File Server, Amazon FSx for Lustre, Network File System (NFS), Server Message Block (SMB), Hadoop Distributed File System (HDFS), or self-managed object storage.
• Task – A source location and a destination location, and a configuration that defines how data is transferred. A task always transfers data from the source to the destination. The configuration can include options such as task schedule, bandwidth limit, and so on. A task is the complete definition of a data transfer.
Agent
• Task execution – An individual run of a task, which includes information such as the start time, end time, bytes written, and status.
Agent
An agent is a VM that you own that's used to read or write data from self-managed storage systems. The agent can be deployed on VMware ESXi, KVM, Microsoft Hyper-V hypervisors, or it can be launched as an Amazon EC2 instance. You use the AWS DataSync console or the API to set up and activate your agent.
The activation process associates your agent VM with your AWS account. For information about agents, see Working with agents (p. 55).
An agent that's functioning properly has the status ONLINE. If an agent is unable to communicate with AWS, it transitions to OFFLINE status. This transition can result from issues with a network partition, firewall misconfiguration, and other events that make the agent VM unable to connect to AWS. The status of an agent that's powered off also shows as OFFLINE.
Location
A location is an endpoint of a task. Each task has two locations—a source location and a destination location. AWS DataSync supports the following location types:
• Network File System (NFS)
• Server Message Block (SMB)
• Hadoop Distributed File System (HDFS)
• On-premises (self-managed) object storage
• Amazon EFS
• Amazon FSx for Windows File Server
• Amazon FSx for Lustre
• Amazon S3
For more information, see Working with locations (p. 72).
Task
A task includes two locations (source and destination), and the configuration of how to transfer the data from one location to the other. The configuration settings can include options such as how to treat metadata, deleted files, and permissions. A task is the complete definition of a data transfer.
Task execution
A task execution is an individual run of a task, which shows information such as the start time, end time, number of transferred files, and status.
A task execution has five transition phases and two terminal statuses, as shown in the following diagram.
These phases and statuses are:
• QUEUEING – This phase consists of queuing the task executions that are running using the same agent.
• LAUNCHING – During this phase, the task execution is initialized.
• PREPARING – During this phase, DataSync computes which files need to be transferred.
• TRANSFERRING – During this phase, DataSync transfers data to AWS.
• VERIFYING – During this optional phase, DataSync performs a full data and metadata integrity verification. This phase occurs only if the VerifyMode option is enabled during configuration.
• SUCCESS or ERROR – When the task is finished, DataSync sets the task to one of these terminal statuses, depending on whether it was successful.
If the VerifyMode option isn't enabled in the task configuration, the terminal status is set after the TRANSFERRING phase. Otherwise, it is set after the VERIFYING phase. The two terminal statuses are these:
• SUCCESS
• ERROR
For more information, see Task execution statuses (p. 107).
How DataSync transfers files
Topics
• How AWS DataSync verifies data integrity (p. 8)
• How DataSync handles open and locked files (p. 8)
When a task starts, it goes through different phases: LAUNCHING, PREPARING, TRANSFERRING, and VERIFYING. In the LAUNCHING phase, DataSync initializes the task execution. In the PREPARING phase, DataSync examines the source and destination file systems to determine which files to sync. It does so by recursively scanning the contents and metadata of files on the source and destination file systems for differences.
The time that DataSync spends in the PREPARING phase depends on the number of files in both the source and destination file systems. It also depends on the performance of these file systems and usually takes between a few minutes to a few hours. For more information, see Starting your DataSync task (p. 105).
After the scanning is done and the differences are calculated, DataSync transitions to the
TRANSFERRING phase. At this point, DataSync starts transferring files and metadata from the source file system to the destination. DataSync copies changes to files with contents or metadata that are different between the source and the destination. You can narrow down the copied files by filtering the data or by configuring DataSync to not overwrite files that are already present in the destination.
Note
By default, any changes to metadata on the source storage result in this metadata being copied to the destination storage.
After the TRANSFERRING phase is done, DataSync verifies consistency between the source and destination file systems. This is the VERIFYING phase.
How AWS DataSync verifies data integrity
When DataSync transfers data, it always performs data integrity checks during the transfer. You can enable additional verification to compare the source and destination at the end of a transfer. This additional check can verify the entire dataset or only the files that were transferred as part of the task execution. For most use cases, we recommend verifying only the files transferred.
How AWS DataSync verifies data integrity
AWS DataSync locally calculates the checksum of every file in the source file system and the destination and compares them. Additionally, DataSync compares the metadata of every file in the source and destination and compares them. If there are differences in either one, verification fails with an error code that specifies precisely what failed. For examples, you see error codes such as Checksum failure, Metadata failure, Files were added, Files were removed, and so on.
For more information, see DataSync task creation statuses (p. 105) and Enable verification in the Configuring task settings (p. 99) section.
How DataSync handles open and locked files
In general, DataSync can transfer open files without any limitations.
If a file is open and it's being written to during the transfer, DataSync detects data inconsistency during the VERIFYING phase. This phase is when DataSync detects whether the file on the source is different from the file on the destination.
If a file is locked and the server prevents DataSync from opening it, DataSync skips transferring it.
DataSync logs an error during the TRANSFERRING phase and sends a verification error.
Setting up
To get started, you first sign up for AWS. If you are a first-time user, we recommend that you read the Regions and requirements section.
Topics
• Sign up for AWS (p. 9)
• AWS Regions and endpoints (p. 9)
• How to access AWS DataSync (p. 9)
• DataSync pricing (p. 9)
Sign up for AWS
To use AWS DataSync, you need an AWS account that gives you access to all AWS resources, forums, support, and usage reports. You aren't charged for any of the services unless you use them. If you already have an AWS account, you can skip this step.
To sign up for AWS account
1. Open https://portal.aws.amazon.com/billing/signup.
2. Follow the online instructions.
Part of the sign-up procedure involves receiving a phone call and entering a verification code on the phone keypad.
AWS Regions and endpoints
AWS DataSync is available in the following AWS Regions.
How to access AWS DataSync
You can use the DataSync management console to perform various sync configuration and management tasks.
Additionally, you can use the AWS DataSync API or the AWS CLI to programmatically configure and manage DataSync. For more information about the API, see API reference (p. 149).
You can also use the AWS SDKs to develop applications that interact with DataSync. The AWS SDKs for Java, .NET, and PHP wrap the underlying DataSync API to simplify your programming tasks. For information about downloading the SDK libraries, see Sample code libraries.
DataSync pricing
For information about AWS DataSync pricing, see AWS DataSync pricing on the DataSync pricing page.
Agent requirements
Requirements for AWS DataSync
AWS DataSync agent and network requirements vary based on where and how you plan to transfer data.
Topics
• Agent requirements (p. 10)
• Network requirements (p. 11)
Agent requirements
Your AWS DataSync agent must adhere to the requirements that apply to your scenario.
Topics
• Supported hypervisors (p. 10)
• Virtual machine requirements (p. 11)
• Amazon EC2 instance requirements (p. 11)
Supported hypervisors
DataSync supports the following hypervisor versions and hosts:
• VMware ESXi Hypervisor (version 6.5, 6.7, or 7.0): A free version of VMware is available on the VMware website. You also need a VMware vSphere client to connect to the host.
NoteWhen VMware ends general support for an ESXi hypervisor version, DataSync also ends support for that version. For information about VMware's supported hypervisor versions, see VMware lifecycle policy on the VMware website.
• Microsoft Hyper-V Hypervisor (version 2012 R2, 2016, or 2019): A free, standalone version of Hyper- V is available at the Microsoft Download Center. For this setup, you need a Microsoft Hyper-V Manager on a Microsoft Windows client computer to connect to the host.
NoteThe DataSync VM is a generation 1 virtual machine. For more information about the
differences between generation 1 and generation 2 VMs, see Should I create a generation 1 or 2 virtual machine in Hyper-V?
• Linux Kernel-based Virtual Machine (KVM): A free, open-source virtualization technology. KVM is included in Linux versions 2.6.20 and newer. AWS DataSync is tested and supported for the CentOS/
RHEL 7.8, Ubuntu 16.04 LTS, and Ubuntu 18.04 LTS distributions. Any other modern Linux distribution might work, but function or performance is not guaranteed. We recommend this option if you already have a KVM environment up and running and you're already familiar with how KVM works.
Note
Running KVM on Amazon EC2 isn't supported, and cannot be used for DataSync agents. To run the agent on Amazon EC2, deploy an agent Amazon Machine Image (AMI). For more information about deploying an agent AMI on Amazon EC2, see Deploy your agent as an Amazon EC2 instance (p. 24).
• Amazon EC2 instance: DataSync provides an Amazon Machine Image (AMI) that contains the DataSync VM image. For the recommended instance types, see Amazon EC2 instance requirements (p. 11).
Virtual machine requirements
When deploying AWS DataSync on-premises, make sure that the underlying hardware where you deploy the DataSync VM can dedicate the following minimum resources:
• Virtual processors: Four virtual processors assigned to the VM.
• Disk space: 80 GB of disk space for installation of VM image and system data.
• RAM: Depending on your configuration, one of the following:
• 32 GB of RAM assigned to the VM, for tasks that transfer up to 20 million files.
• 64 GB of RAM assigned to the VM, for tasks that transfer more than 20 million files.
Amazon EC2 instance requirements
When deploying a DataSync agent with Amazon EC2, the instance size must be at least 2xlarge.
We recommend using one of the following instance sizes:
• m5.2xlarge: For tasks to transfer up to 20 million files.
• m5.4xlarge: For tasks to transfer more than 20 million files.
Note
An exception to this is if you're running DataSync on a AWS Snowcone device. Use the default instance snc1.medium, which provides 2 CPU cores and 4 GiB of memory.
To connect to an Amazon EC2 agent using SSH, you must use the following cryptographic algorithms:
• SSH cipher: aes128-ctr
• Key exchange: diffie-hellman-group14-sha1
Network requirements
DataSync network requirements depend on how you plan to transfer data (for example, over the public internet or using a more private connection).
Use the following tables to help you configure network access for DataSync agents that transfer data from your self-managed storage system and through virtual private cloud (VPC), public service, Federal Information Processing Standard (FIPS) endpoints.
Topics
• Network requirements to connect to your self-managed storage (p. 11)
• Network requirements when using VPC endpoints (p. 12)
• Network requirements when using public service endpoints or FIPS endpoints (p. 15)
• Required network interfaces for data transfers (p. 19)
Network requirements to connect to your self- managed storage
To minimize network latency, deploy the DataSync agent close to your self-managed storage. Doing this ensures that files travel over the network between the DataSync agent and the DataSync service using our purpose-built, accelerated protocol, which significantly speeds up transfers.
Network requirements when using VPC endpoints
The following ports are required for communication between the DataSync agent and your Network File System (NFS) server, Hadoop Distributed File System (HDFS) cluster, Server Message Block (SMB) server, or Amazon S3 API compatible storage.
From To Protocol Port How used
Agent NFS server TCP/UDP 2049 (NFS) By the DataSync agent to
mount a source NFS file system.
Supports NFS v3.x, NFS v4.0, and NFS v4.1.
Agent SMB server TCP/UDP 139 (SMB)
or 445 (SMB)
By the DataSync agent to mount a source SMB file share.
Supports SMB 2.1 and SMB 3 versions.
Agent Self-managed object
storage TCP 443 (HTTPS)
or 80 (HTTP) By the DataSync agent to access your self-managed object storage.
Agent Hadoop cluster TCP NameNode
port (default is 8020)
By the DataSync agent to access the NameNodes in your Hadoop cluster.
Specify the port used when creating an HDFS location.
Agent Hadoop cluster TCP DataNode
port (default is 50010)
By the DataSync agent to access the DataNodes in your Hadoop cluster.
The DataSync agent automatically determines the port to use.
Agent Hadoop Key Management
Server (KMS) TCP KMS port
(default 9600)
By the DataSync agent to access the KMS for your Hadoop cluster.
Agent Kerberos Key Distribution
Center (KDC) server TCP KDC port
(default 88) By the DataSync agent when authenticating to the Kerberos realm. This port is used only with HDFS.
Network requirements when using VPC endpoints
If you use only private IP addresses, you can ensure that your VPC can't be reached over the internet, and you can prevent any packets from entering or exiting the network. By using private IP addresses, you can eliminate all internet access from your self-managed systems, and still use DataSync for data transfers to and from AWS.
DataSync requires the following ports for its operation when your agent is using private endpoints.
From To Protocol Port How used Your web
browser Your DataSync agent TCP 80 (HTTP) By your computer to obtain the agent activation key.
After successful activation, DataSync closes the agent's port 80.
The DataSync agent doesn't require port 80 to be publicly accessible. The required level of access to port 80 depends on your network configuration.
NoteAlternatively, you can obtain the activation key from the agent's local console.
This method does not require connectivity between the browser and your agent. For more information about using the local console to get the activation key, see Obtaining an activation key using the local console (p. 64).
Agent Your DataSync VPC endpoint
To find the correct IP address, open the Amazon VPC console, and choose Endpoints from the left navigation pane. Choose the DataSync endpoint, and check the Subnets list to find the private IP address that corresponds to the subnet that you chose for your VPC endpoint setup.
For more information, see step 5 in Configuring DataSync to use private IP addresses for data transfer (p. 56).
TCP 1024–1064 For control traffic between the DataSync agent and the AWS service.
Network requirements when using VPC endpoints
From To Protocol Port How used
Agent Your task's elastic network interfaces
To find the related IP addresses, open the Amazon EC2 console and choose Network Interfaces from the left navigation pane. To see the four network interfaces for the task, enter your task ID in the search filter.
For more information, see step 9 in Configuring DataSync to use private IP addresses for data transfer (p. 56).
TCP 443 (HTTPS) For data transfer from the DataSync VM to the AWS service.
Agent Your DataSync VPC
endpoint TCP 22 (Support
channel) To allow AWS Support to access your DataSync to help you with
troubleshooting DataSync issues.
You don't need this port open for normal operation, but it's required for troubleshooting.
Following is an illustration of the ports required by DataSync when using private endpoints.
Network requirements when using public service endpoints or FIPS endpoints
Your agent VM requires access to the following endpoints to communicate with AWS when using public service endpoints, or when using FIPS endpoints. Enabling this access is not necessary when using DataSync with VPC endpoints.
If you use a firewall or router to filter or limit network traffic, configure your firewall or router to allow these service endpoints. They're required to enable outbound communication between your network and AWS.
From To Protocol Port How used Endpoints accessed by the agent Your web
browser DataSync
agent TCP 80
(HTTP) Used by your computer to obtain the agent activation key.
After successful activation, DataSync closes the agent's port 80.
The DataSync agent doesn't require port 80 to be publicly accessible. The required level of access to port 80 depends on your network configuration.
NoteAlternatively, you can obtain the activation keyfrom the agent's local console.
This method does not require connectivity between thebrowser and your agent.
For more
N/A
Network requirements when using public service endpoints or FIPS endpoints
From To Protocol Port How used Endpoints accessed by the agent information
about using the local console to get the activation key, see Obtaining an activation keyusing the local console (p. 64).
Agent AWS TCP 443
(HTTPS) Used by the DataSync agent to activate with your AWS account. You can block the public endpoints after activation.
For public endpoint activation:
activation.datasync.us- east-2.amazonaws.com For FIPS endpoint activation:
activation.datasync-fips.us- east-2.amazonaws.com
Agent AWS TCP 443
(HTTPS) For
communication between the DataSync agent and the AWS service endpoint.
For information about Regions and service endpoints, see Choose a service endpoint (p. 26).
API endpoints:
datasync.us-
east-2.amazonaws.com Data transfer endpoints:
yourTaskId.datasync-dp.us- east-2.amazonaws.com cp.datasync.us- east-2.amazonaws.com Data transfer endpoints for FIPS:
cp.datasync-fips.us- east-2.amazonaws.com
From To Protocol Port How used Endpoints accessed by the agent
Agent AWS TCP 80
(HTTP) Allows the DataSync agent to get updates from AWS.
The activation_region variable is the AWS Region you used to activate your DataSync agent.
repo.default.amazonaws.com packages.us-
west-1.amazonaws.com packages.sa-
east-1.amazonaws.com repo.
$activation_region.amazonaws.com packages.
$activation_region.amazonaws.com
*.s3.
$activation_region.amazonaws.com
Agent AWS TCP 443
(HTTPS) Allows the DataSync agent to get updates from AWS.
The activation_region variable is the AWS Region you used to activate your DataSync agent.
amazonlinux.default.amazonaws.com cdn.amazonlinux.com
amazonlinux-2-repos- us-east-1.s3.dualstack.
$activation_region.amazonaws.com amazonlinux-2-
repos-us-east-1.s3.
$activation_region.amazonaws.com
Agent Domain
NameService (DNS) server
TCP/UDP 53 (DNS) For
communication between DataSync agent and the DNS server.
N/A
Agent AWS TCP 22
(Support channel)
Allows AWS Support to access your DataSync to help you with troubleshooting DataSync issues.
You don't need this port open for normal operation, but it's required for troubleshooting.
AWS support channel:
54.201.223.107
Network requirements when using public service endpoints or FIPS endpoints
From To Protocol Port How used Endpoints accessed by the agent Agent Network
Time Protocol (NTP) server
UDP 123
(NTP) Used by local systems to synchronize the VM time to the host time.
NTP:
0.amazon.pool.ntp.org 1.amazon.pool.ntp.org 2.amazon.pool.ntp.org 3.amazon.pool.ntp.org
NoteIf you want to change the default NTP configuration of your VM agent to use a different NTP server using the local console, see Configuring a Network Time Protocol (NTP) server for VMware agents (p. 68).
The following diagram shows the ports required by DataSync when using public service endpoints or FIPS endpoints.
Required network interfaces for data transfers
For every task you run, DataSync automatically creates elastic network interfaces (ENIs) to manage data transfer traffic. How many ENIs DataSync creates and where they’re created depends on the following details about your task:
• Whether your task requires a DataSync agent.
• Your source and destination locations (where you’re copying data from and to).
• The type of endpoint used to activate your agent.
Each ENI uses a single IP address in your subnet (the more ENIs there are, the more IP addresses you need). Use the following tables to make sure your subnet has enough IP addresses for your task.
Transfers with agents
You need a DataSync agent when copying data between a self-managed storage system and an AWS storage service.
Location ENIs created by default Where ENIs are created when using a public or FIPS endpoint
Where ENIs are created when using a private (VPC) endpoint
Amazon S3 4 N/A (ENIs aren’t
needed since DataSync communicates directly with the S3 bucket)
The subnet you specified when activating your DataSync agent.
Amazon EFS 4 The subnet you specify when
creating the Amazon EFS location.
Amazon FSx for
Windows File Server 4 The same subnet as the preferred
file server for the file system.
Amazon FSx for Lustre 4 The same subnet as the file system.
Transfers without agents
You don’t need a DataSync agent when copying data between AWS storage services.
NoteThe total number of ENIs depends on your DataSync task locations. For example, transferring from an Amazon EFS location to FSx for Lustre requires four ENIs. Meanwhile, transferring from an FSx for Windows File Server to an Amazon S3 bucket requires two ENIs.
Location ENIs created by default Where ENIs are created
Amazon S3 N/A (ENIs aren’t needed since DataSync
communicates directly with the S3 bucket)
Amazon EFS 2 The subnet you specify when
creating the Amazon EFS location.
Required network interfaces for data transfers
Location ENIs created by default Where ENIs are created
Amazon FSx for Windows File
Server 2 The same subnet as the
preferred file server for the file system.
Amazon FSx for Lustre 2 The same subnet as the file
system.
To see the ENIs allocated for your DataSync task, use the DescribeTask operation.
Getting started with AWS DataSync
In this topic, you can find step-by-step instructions on how to get started using AWS DataSync on the AWS Management Console.
Before you begin, we recommend reading How AWS DataSync works (p. 3) to understand the components and terms used in DataSync and how DataSync works. We also recommend reading the Using identity-based policies (IAM policies) for DataSync (p. 119) section to understand the AWS Identity and Access Management (IAM) permissions that DataSync requires.
To use AWS DataSync
1. Open the AWS DataSync console at https://console.aws.amazon.com/datasync/.
2. In the upper-right corner, choose the AWS Region where you want to run DataSync. We recommend choosing the AWS Region where you plan to locate your Amazon S3 bucket, Amazon EFS file system, Amazon FSx for Windows File Server file system, or Amazon FSx for Lustre file system.
If you haven't created DataSync resources in this AWS Region, the DataSync home page appears.
3. On the DataSync home page, select whether to create the data transfer task Between on-premises storage and AWS or Between AWS Storage services.
4. Choose Get started to begin using DataSync.
If this is your first time using DataSync in this AWS Region, the Create agent page appears. From this page, you can download your virtual machine (VM) or create an Amazon EC2 instance.
If you have used DataSync in this AWS Region, the Agents page appears and you can see your agents listed.
Next, take the following steps.
Topics
• Create an agent (p. 21)
• Configure a source location (p. 29)
• Configure a destination location (p. 30)
• Configure task settings (p. 31)
• Review your settings and create your task (p. 34)
• Start your task (p. 34)
• Clean up resources (p. 35)
Create an agent
For AWS DataSync to access your self-managed storage (whether on-premises or in the cloud), you need a DataSync agent associated with your AWS account.
Tip
An agent isn't required when transferring between AWS storage services in the same AWS account. To set up a data transfer between two AWS services, see Configure a source location (p. 29).
Topics
• Deploy your DataSync agent (p. 22)
Deploy your agent
• Choose a service endpoint (p. 26)
• Activate your agent (p. 28)
Deploy your DataSync agent
Where you deploy your AWS DataSync agent depends on where you're copying data to and from and whether you're working with on-premises or in-cloud storage systems.
Topics
• Deploy your agent on VMware (p. 22)
• Deploy your agent on KVM (p. 22)
• Deploy your agent on Hyper-V (p. 23)
• Deploy your agent as an Amazon EC2 instance (p. 24)
• Deploy your agent on Snow Family devices (p. 26)
• Deploy your agent on AWS Outposts (p. 26)
Deploy your agent on VMware
You can download and deploy an AWS DataSync agent in your VMware environment and then activate it. You can also use an existing agent instead of deploying a new one. You can use a previously created agent if it can access your self-managed storage and if it's activated in the same AWS Region.
To deploy an agent on VMware
1. Open the AWS DataSync console at https://console.aws.amazon.com/datasync/.
2. If you don't have an agent, on the Create agent page in the console, choose Download image in the Deploy agent section. Doing this downloads the agent and deploys it in your VMware ESXi hypervisor. The agent is available as a VM. If you want to deploy the agent as an Amazon EC2 instance, see Deploy your agent as an Amazon EC2 instance (p. 24).
AWS DataSync currently supports the VMware ESXi hypervisor. For information about hardware requirements for the VM, see Virtual machine requirements (p. 11). For information about how to deploy an .ova file in a VMware host, see the documentation for your hypervisor.
If you have previously activated an agent in this AWS Region and want to use that agent, choose that agent and choose Create agent. The Configure a source location (p. 29) page appears.
3. Power on your hypervisor, log in to your VM, and get the IP address of the agent. You need this IP address to activate the agent.
Note
The VM's default credentials are the login admin and the password password.
You can change the password on the local console. You don't need to log in to the VM for DataSync functionality. Login is mainly required for troubleshooting, such as running a connectivity test or opening a support channel with AWS. It's also required for network- specific settings, such as setting up a static IP address.
After you have deployed an agent, you choose a service endpoint (p. 26).
Deploy your agent on KVM
You can download and deploy an AWS DataSync agent in your KVM environment and then activate it.
You can also use an existing agent instead of deploying a new one. You can use a previously created agent if it can access your self-managed storage and if it's activated in the same AWS Region.
To deploy an agent on KVM
1. Open the AWS DataSync console at https://console.aws.amazon.com/datasync/.
2. If you don't have an agent, on the Create agent page in the console, choose Download image in the Deploy agent section. Doing this downloads the agent in a .zip file that contains a .qcow2 image file that can you can deploy in your KVM hypervisor.
The agent is available as a VM. If you want to deploy the agent as an Amazon EC2 instance, see Deploy your agent as an Amazon EC2 instance (p. 24).
AWS DataSync currently supports the KVM hypervisor. For information about hardware requirements for the VM, see Virtual machine requirements (p. 11).
To get started installing your .qcow2 image for use in KVM, use the following command.
virt-install \ --name "datasync" \
--description "AWS DataSync agent" \ --os-type=generic \
--ram=32768 \ --vcpus=4 \
--disk path=datasync-yyyymmdd-x86_64.qcow2,bus=virtio,size=80 \ --network default,model=virtio \
--graphics none \ --import
For information about how to manage this VM, and your KVM host, see the documentation for your hypervisor.
If you previously activated an agent in this AWS Region and want to use that agent, choose that agent, and then choose Create agent. The Configure a source location (p. 29) page appears.
3. Power on your hypervisor, log in to your VM, and get the IP address of the agent. You need this IP address to activate the agent.
NoteThe VM's default credentials are the login admin and the password password.
You can change the password on the local console. You don't need to log in to the VM for DataSync functionality. Login is mainly required for troubleshooting, such as running a connectivity test or opening a support channel with AWS. It's also required for network- specific settings, such as setting up a static IP address.
After you deploy an agent, you choose a service endpoint (p. 26).
Deploy your agent on Hyper-V
You can download and deploy an AWS DataSync agent in your Hyper-V environment and then activate it. You can also use an existing agent instead of deploying a new one. You can use a previously created agent if it can access your self-managed storage and if it's activated in the same AWS Region.
To deploy an agent on Hyper-V
1. Open the AWS DataSync console at https://console.aws.amazon.com/datasync/.
2. If you don't have an agent, on the Create agent page in the console, choose Download image in the Deploy agent section. Doing this downloads the agent in a .zip file that contains a .vhdx image file that can you can deploy in your Hyper-V hypervisor.
Deploy your agent
The agent is available as a VM. If you want to deploy the agent as an Amazon EC2 instance, see Deploy your agent as an Amazon EC2 instance (p. 24).
AWS DataSync currently supports the Hyper-V hypervisor. For information about hardware requirements for the VM, see Virtual machine requirements (p. 11). For information about how to deploy a .vhdx file in a Hyper-V host, see the documentation for your hypervisor.
If you previously activated an agent in this AWS Region and want to use that agent, choose that agent, and then choose Create agent. The Configure a source location (p. 29) page appears.
3. Power on your hypervisor, log in to your VM, and get the IP address of the agent. You need this IP address to activate the agent.
Note
The VM's default credentials are the login admin and the password password.
You can change the password on the local console. You don't need to log in to the VM for DataSync functionality. Login is mainly required for troubleshooting, such as running a connectivity test or opening a support channel with AWS. It's also required for network- specific settings, such as setting up a static IP address.
After you deploy an agent, you choose a service endpoint (p. 26).
Deploy your agent as an Amazon EC2 instance
You deploy a DataSync agent as an Amazon EC2 instance when copying data between:
• A self-managed, in-cloud storage system and an AWS storage service.
For more information about these use cases, including high-level architecture diagrams, see Deploying your DataSync agent in AWS Regions (p. 59).
• Amazon S3 on AWS Outposts (p. 26) and an AWS storage service.
Warning
We don't recommend using an Amazon EC2 agent to access your on-premises storage because of increased network latency. Instead, deploy the agent as a VMware, KVM, or Hyper-V virtual machine in your data center as close to your on-premises storage as possible.
To choose the agent AMI for your AWS Region
• Use the following CLI command to get the latest DataSync Amazon Machine Image (AMI) ID for the specified AWS Region.
aws ssm get-parameter --name /aws/service/datasync/ami --region $region
Example Example command and output
aws ssm get-parameter --name /aws/service/datasync/ami --region us-east-1
{ "Parameter": {
"Name": "/aws/service/datasync/ami", "Type": "String",
"Value": "ami-id", "Version": 6,
"LastModifiedDate": 1569946277.996,
"ARN": "arn:aws:ssm:us-east-1::parameter/aws/service/datasync/ami"
} }
For the recommended instance types, see Amazon EC2 instance requirements (p. 11).
If you activate an agent in the Region that has access to your file system using a mount target in the same Availability Zone and you want to use that agent, choose the agent and select choose Create agent. The Configure a source location (p. 29) page appears.
To deploy your DataSync agent as an Amazon EC2 instance Important
To avoid charges, deploy your agent in a way that it doesn't require network traffic between Availability Zones. For example, deploy your agent in the Availability Zone where your self- managed file system resides.
To learn more about data transfer prices for all AWS Regions, see Amazon EC2 On-Demand pricing.
1. From the AWS account where the source file system resides, launch the agent using your AMI from the Amazon EC2 launch wizard. Use the following URL to launch the AMI.
https://console.aws.amazon.com/ec2/v2/home?region=source-file-system- region#LaunchInstanceWizard:ami=ami-id
In the URL, replace the source-file-system-region and ami-id with your own source AWS Region and AMI ID. The Choose an Instance Type page appears on the Amazon EC2 console. To find the DataSync AMI ID for a specified AWS Region, use the .AMI-command CLI command described in the preceding section.
2. Choose one of the recommended instance types for your use case, and choose Next:
Configure Instance Details. For the recommended instance types, see Amazon EC2 instance requirements (p. 11).
3. On the Configure Instance Details page, do the following:
a. For Network, choose the virtual private cloud (VPC) where your source Amazon EFS or NFS file system is located.
b. For Auto-assign Public IP, choose a value. For your instance to be accessible from the public internet, set Auto-assign Public IP to Enable. Otherwise, set Auto-assign Public IP to Disable.
If a public IP address isn't assigned, activate the agent in your VPC using its private IP address.
When you transfer files from an in-cloud file system, to increase performance we recommend that you choose a Placement Group value where your NFS server resides.
4. Choose Next: Add Storage. The agent doesn't require additional storage, so you can skip this step and choose Next: Add tags.
5. (Optional) On the Add Tags page, you can add tags to your Amazon EC2 instance. When you're finished on the page, choose Next: Configure Security Group.
6. On the Configure Security Group page, do the following:
a. Make sure that the selected security group allows inbound access to HTTP port 80 from the web browser that you plan to use to activate the agent.
b. Make sure that the security group of the source file system allows inbound traffic from the agent. In addition, make sure that the agent allows outbound traffic to the source file system.
If you deploy your agent using a VPC endpoint, you need to allow additional ports. For more information, see How DataSync works with VPC endpoints (p. 56).
Choose a service endpoint
For the complete set of network requirements for DataSync, see Network requirements (p. 11).
7. Choose Review and Launch to review your configuration, then choose Launch to launch your instance. Remember to use a key pair that's accessible to you. A confirmation page appears and indicates that your instance is launching.
8. Choose View Instances to close the confirmation page and return to the Amazon EC2 instances screen. When you launch an instance, its initial state is pending. After the instance starts, its state changes to running. At this point, it's assigned a public Domain Name System (DNS) name and IP address, you can find these in the Descriptions tab.
9. If you set Auto-assign Public IP to Enable, choose your instance and note the public IP address in the Description tab. You use this IP address later to connect to your sync agent.
If you set Auto-assign Public IP to Disable, launch or use an existing instance in your VPC to
activate the agent. In this case, you use the private IP address of the sync agent to activate the agent from this instance in the VPC.
Deploy your agent on Snow Family devices
The DataSync agent AMI is pre-installed on your Snow Family Device. You can use AWS OpsHub for Snow Family or the AWS Snowball Edge CLI command line tool to launch the agent and attach a virtual interface to the agent. Then, use the virtual interface's IP address to activate the agent.
For instructions on launching the agent using AWS OpsHub, see Using DataSync to transfer files to AWS.
For instructions on launching the agent using the Snowball CLI, see Launching AWS DataSync AMI.
For information about using the AWS Snowcone client, see Using the Snowcone client.
Deploy your agent on AWS Outposts
You can launch a DataSync Amazon EC2 instance on your AWS Outpost. To learn more about launching an AMI on AWS Outposts, see Launch an instance on your Outpost in the AWS Outposts User Guide.
When using DataSync to access Amazon S3 on Outposts, you must launch the agent in a VPC that's allowed to access your Amazon S3 access point, and activate the agent in the Outpost's parent Region.
The agent must also be able to route to the Amazon S3 on Outposts endpoint for the bucket. To learn more about working with Amazon S3 on Outposts endpoints, see Working with Amazon S3 on Outposts in the Amazon S3 User Guide.
Choose a service endpoint
You must specify an endpoint that your AWS DataSync agent uses to communicate with AWS. The agent can connect to the following types of endpoints:
• Public endpoints: If you use public endpoints, all communication from your DataSync agent to AWS occurs over the public internet. For instructions, see Choose a public service endpoint (p. 27).
• Federal Information Processing Standard (FIPS) endpoints: If you need FIPS 140-2 validated cryptographic modules when accessing the AWS GovCloud (US-East) or AWS GovCloud (US-West) Region, use this endpoint to activate your agent. You use the AWS CLI or API to access this endpoint.
For more information, see Federal Information Processing Standard (FIPS) 140-2.
• Virtual private cloud (VPC) endpoints: If you use a VPC endpoint, all communication from DataSync to AWS occurs through the endpoint in your AWS VPC. This establishes a private connection between your self-managed storage system, your VPC, and AWS services, providing extra security as your data is copied over the network. For instructions, see Using AWS DataSync in a virtual private cloud (p. 56).
NoteAfter you choose a service endpoint type and activate your agent, you can't change it to use a different service endpoint type later. If you need to transfer data to multiple endpoint types, create a DataSync agent for each endpoint type that you use.
For more information about service endpoints, see AWS DataSync in the AWS General Reference.
Topics
• Choose a public service endpoint (p. 27)
• Choose a FIPS service endpoint (p. 27)
• Choose a VPC endpoint (p. 27)
Choose a public service endpoint
If you use a public endpoint, all communication from your DataSync agent to AWS occurs over the public internet.
To choose a public service endpoint
1. Open the AWS DataSync console at https://console.aws.amazon.com/datasync/.
2. Go to the Agents page and choose Create agent.
3. In the Service endpoint section, choose Public service endpoints in AWS Region name. For a list of supported AWS Regions, see AWS DataSync in the AWS General Reference.
Next Step: the section called “Activate your agent” (p. 28)
Choose a FIPS service endpoint
If you use a FIPS service endpoint, DataSync communicates with the AWS GovCloud (US) or Canada (Central) Region.
To choose a FIPS service endpoint
1. Open the AWS DataSync console at https://console.aws.amazon.com/datasync/.
2. Go to the Agents page and choose Create agent.
3. In the Service endpoint section, choose the FIPS endpoint that you want. For information about supported FIPS endpoint, see AWS DataSync in the AWS General Reference.
Next step: the section called “Activate your agent” (p. 28)
Choose a VPC endpoint
If you use a VPC endpoint, all communication from DataSync to AWS services occurs through the VPC endpoint in your VPC in AWS. This approach provides a private connection between your self-managed data center, your VPC, and AWS services.
You can also use a VPC endpoint outside your VPC to connect your data center directly to AWS resources.
In this case, you use a virtual private network (VPN) or AWS Direct Connect. You set up a VPC route table to use the endpoint to access the service. For detailed information, see Routing for gateway endpoints.
To choose a VPC endpoint
1. Create a VPC endpoint. For instructions, see Creating an interface endpoint. If you already have a VPC endpoint in the AWS Region, you can use it.
Activate your agent
Important
In step 4 of the instructions mentioned preceding, choose
com.amazonaws.region.datasync for Service Name in the table of endpoints.
For information about supported AWS Regions, see AWS DataSync in the AWS General Reference.
2. Open the AWS DataSync console at https://console.aws.amazon.com/datasync/.
3. Go to the Agents page and choose Create agent.
4. In the Service endpoint section, choose VPC endpoints using AWS PrivateLink. This is the VPC endpoint that the agent has access to.
5. For VPC Endpoint, choose the private VPC endpoint that you want your agent to connect to. You noted the endpoint ID when you created the VPC endpoint.
6. For Subnet, choose the subnet in which you want to run your task. This is the subnet where the elastic network interface is created.
7. For Security Group, choose a security group for your task. This is the security group that protects your network interface for tasks that run on your agent.
For additional information about using DataSync in a VPC, see Using AWS DataSync in a virtual private cloud (p. 56).
Next step: the section called “Activate your agent” (p. 28)
Activate your agent
After you deploy your AWS DataSync agent and specify a service endpoint, you must activate the agent to associate it with your AWS account.
NoteAn agent can be associated with only one AWS account at a time.
To activate your agent
1. Open the AWS DataSync console at https://console.aws.amazon.com/datasync/.
2. Go to the Agents page and choose Create agent.
3. In the Activation key section, select Automatically get the activation key from your agent.
This option requires that your browser access the agent using port 80. Once activated, the agent closes the port. For more information, see Network requirements (p. 11).
Alternatively, select Manually enter your agent's activation key if you don't want a connection between your browser and agent. For more information, see Obtaining an activation key using the local console (p. 64).
4. For the Agent address, enter the agent's IP address or domain name and select Get key. Your browser connects to the IP address and gets a unique activation key from your agent.
If activation succeeds, the activation key is displayed. If the activation fails, make sure that your security group is configured properly and verify that your firewall allows the required ports.
5. (Optional) For Agent name, enter a name for your agent.
6. (Optional) For Tags, enter a key and value to add a tag to your agent. A tag is a key-value pair that helps you manage, filter, and search for your agents.
7. Choose Create agent. Your agent is listed on the Agents page. In the Service endpoint column, verify that your service endpoint is correct.
8. In the Tasks section of the page, choose Create task. The Configure source location page appears.
Configure a source location
A task consists of a pair of locations that data will be transferred between. The source location defines the storage system or service that you want to read data from. The destination location defines the storage system or service that you want to write data to.
For a list of all DataSync supported source and destination endpoints, see Working with locations (p. 72).
In the following walkthrough, we give an example of configuring a Network File System (NFS) file system as the source location.
To configure a different location type as your source location, see the following topics:
• Creating a location for NFS (p. 73)
• Creating a location for SMB (p. 75)
• Creating a location for HDFS (p. 77)
• Creating a location for object storage (p. 78)
• Creating a location for Amazon EFS (p. 79)
• Creating a location for FSx for Windows File Server (p. 81)
• Creating a location for FSx for Lustre (p. 83)
• Creating a location for Amazon S3 (p. 83)
To create an NFS location
1. On the Configure source location page, choose Create a new location or Choose existing location.
Create a new location enables you to define a new location and Choose existing location enables you to choose from locations that you have previously created in this AWS Region.
Configure a destination location
2. For Location type in the Configuration section, choose your NFS server from the list.
3. For Agents, choose your agent from the list. You can add more than one agent. For this walkthrough, we add only one agent.
NoteIn many cases, you might be transferring from an in-cloud NFS file system or an Amazon EFS file system. In such cases, make sure that you choose an agent that you created in an Amazon EC2 instance that can access this file system.
You can't use agents that are created with different endpoint types for the same task.
4. For NFS server, enter the IP address or domain name of your NFS server. An agent that's installed on-premises uses this host name to mount the NFS server in a network. The NFS server should allow full access to all files.
5. For Mount path, enter a path that's exported by the NFS server, or a subdirectory that can be mounted by other NFS clients in your network. The path is used to read data from or write data to your NFS server.
6. Choose Next to open the Configure destination location page.
Configure a destination location
A task consists of a pair of locations that data will be transferred between. The source location defines the storage system or service that you want to read data from. The destination location defines the storage system or service that you want to write data to.
For a list of all DataSync supported source and destination endpoints, see Working with locations (p. 72).
To configure a different location type, see the following topics:
• Creating a location for NFS (p. 73)
• Creating a location for SMB (p. 75)
• Creating a location for HDFS (p. 77)
• Creating a location for object storage (p. 78)
• Creating a location for Amazon EFS (p. 79)
• Creating a location for FSx for Windows File Server (p. 81)
• Creating a location for FSx for Lustre (p. 83)
• Creating a location for Amazon S3 (p. 83)
Configure task settings
After you have created an AWS DataSync agent and configured the source and destination locations, you can configure the settings for a new task. A task is a set of two locations (source and destination) and a set of options that you use to control the behavior of the task.
You configure task settings when creating a new task in the AWS DataSync console. You can also edit task settings by opening the AWS DataSync console at https://console.aws.amazon.com/datasync/, selecting the task you want to edit, and choosing Edit.
On the Configure settings page, for Task name - optional, enter a name for your task. Task name is an optional setting.
The Options section contains configuration options for running your task. The following sections provide more details about these options.
Topics
• Data verification options (p. 31)
• Ownership and permissions-related options (p. 32)
• File metadata options and file management (p. 32)
• Bandwidth options (p. 33)
• Filtering options (p. 33)
• Scheduling and queueing options (p. 33)
• Tags and logging options (p. 34)
Data verification options
As DataSync transfers data, it always performs data integrity checks during the transfer. You can enable additional verification to compare source and destination at the end of a transfer. This additional check can verify the entire dataset or only the files that were transferred as part of the task execution. For most use cases, we recommend verifying only the files transferred.
Task data verification options specify how to verify data that's transferred by the task.
Data verification options are as follows:
• Verify only the data transferred (recommended) – This option calculates the checksum of transferred files and metadata on the source. It then compares this checksum to the checksum calculated on those files at the destination at the end of the transfer. We recommend this option when transferring to S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive storage classes. For more information, see Considerations when working with Amazon S3 storage classes in DataSync (p. 86).
• Verify all data in the destination – This option performs a scan at the end of the transfer of the entire source and entire destination to verify that source and destination are fully synchronized. You can't use this option when transferring to S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive storage