AWS DataSync

(1)

AWS DataSync

User Guide

(2)

AWS DataSync: User Guide

Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be aﬃliated with, connected to, or sponsored by Amazon.

(3)

What is AWS DataSync?

AWS DataSync is an online data transfer service that simpliﬁes, automates, and accelerates moving data between on-premises storage systems and AWS storage services, and also between AWS storage services. DataSync can copy data between:

• Network File System (NFS) ﬁle servers

• Server Message Block (SMB) ﬁle servers

• Hadoop Distributed File System (HDFS)

• On-premises (self-managed) object storage

• Snow Family devices

• Amazon Simple Storage Service (Amazon S3) buckets

• Amazon EFS ﬁle systems

• Amazon FSx for Windows File Server ﬁle systems

• Amazon FSx for Lustre ﬁle systems

In this guide, you can ﬁnd a description of the components of DataSync, detailed instructions on how to get started, and the API Reference.

Topics

• Use cases (p. 1)

• Beneﬁts (p. 1)

• Additional AWS DataSync resources (p. 2)

Use cases

These are some of the main use cases for AWS DataSync:

• Data migration – Move active datasets rapidly over the network into Amazon S3, Amazon EFS, FSx for Windows File Server, or FSx for Lustre. DataSync includes automatic encryption and data integrity validation to help make sure that your data arrives securely, intact, and ready to use.

• Archiving cold data – Move cold data stored in on-premises storage directly to durable and secure long-term storage classes such as S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive. Doing so can free up on-premises storage capacity and shut down legacy systems.

• Data protection – Move data into any Amazon S3 storage class, choosing the most cost-eﬀective storage class for your needs. You can also send data to Amazon EFS, FSx for Windows File Server, or FSx for Lustre for a standby ﬁle system.

• Data movement for timely in-cloud processing – Move data in to or out of AWS for processing when working with systems that generate data on-premises. This approach can speed up critical hybrid cloud workﬂows across many industries. These include machine learning in the life-sciences industry, video production in media and entertainment, big-data analytics in ﬁnancial services, and seismic research in the oil and gas industry.

Beneﬁts

By using AWS DataSync, you can get the following beneﬁts:

(9)

Additional AWS DataSync resources

• Simplify and automate data movement – AWS DataSync makes it easier to move data over the network between on-premises storage and AWS storage services, and also between AWS storage services. DataSync automates both the management of data-transfer processes and the infrastructure required for high performance and secure data transfer.

• Transfer data securely – DataSync provides end-to-end security, including encryption and integrity validation, to help ensure that your data arrives securely, intact, and ready to use. DataSync accesses your AWS storage through built-in AWS security mechanisms, such as AWS Identity and Access Management (IAM) roles. It also supports virtual private cloud (VPC) endpoints, giving you the option to transfer data without traversing the public internet, and further increasing the security of data copied online.

• Move data faster – With DataSync, you can transfer data rapidly over the network into AWS. It uses a purpose-built network protocol and a parallel, multi-threaded architecture to accelerate your transfers.

This approach speeds up migrations, recurring data-processing workﬂows for analytics and machine learning, and data-protection processes.

• Reduce operational costs – You can move data cost-eﬀectively with the ﬂat, per-gigabyte pricing of DataSync. You can save on script development, and deployment and maintenance costs, and avoid the need for costly commercial transfer tools.

Additional AWS DataSync resources

We recommend that you read the following:

• DataSync resources – The resources page includes blogs, videos, and other training materials.

• AWS DataSync developer forum – The AWS DataSync developer forum.

• AWS DataSync pricing – AWS DataSync pricing information.

AWS DataSync also supports Terraform. To learn more about DataSync deployment automation with Terraform, see the Terraform documentation.

(10)

How AWS DataSync works

In this section, you can ﬁnd information about components, terms, and how DataSync works.

Topics

• AWS DataSync architecture (p. 3)

• Components and terminology (p. 5)

• How DataSync transfers ﬁles (p. 7)

AWS DataSync architecture

Topics

• Data transfer between self-managed storage and AWS (p. 3)

• Data transfer between AWS storage services (p. 4)

• Data transfer using a DataSync EC2 agent deployed in a Region (p. 5)

The architectural diagrams show how DataSync transfers data between on-premises (self-managed) storage systems and AWS storage services, and between in-cloud storage systems and AWS storage services.

For a list of all DataSync supported source and destination endpoints, see Working with locations (p. 72).

Data transfer between self-managed storage and AWS

The following diagram shows a high-level view of the DataSync architecture for transferring ﬁles between self-managed storage and AWS services.

(11)

Data transfer between AWS storage services

The following diagram provides a high-level view of the DataSync architecture for transferring ﬁles between AWS services within the same AWS account. This architecture applies to both in-Region and cross-Region transfers.

(12)

Important

When you use DataSync to copy ﬁles or objects between AWS Regions, you pay for data transfer between Regions. This is billed as data transfer OUT from your source Region to your destination Region. For more information, see Data transfer pricing.

Data transfer using a DataSync EC2 agent deployed in a Region

You can use DataSync to transfer data between AWS services in diﬀerent AWS accounts, or between self- managed ﬁle systems in AWS and Amazon S3, by deploying the DataSync Amazon EC2 agent in an AWS Region. For more information, see Deploying your DataSync agent in AWS Regions (p. 59).

Components and terminology

The components of DataSync include the following:

• Agent – A virtual machine (VM) that's used to read data from or write data to a self-managed location.

An agent isn't required when transferring between AWS storage services in the same AWS account.

• Location – Any source or destination location that's used in the data transfer, such as, Amazon S3, Amazon EFS, Amazon FSx for Windows File Server, Amazon FSx for Lustre, Network File System (NFS), Server Message Block (SMB), Hadoop Distributed File System (HDFS), or self-managed object storage.

• Task – A source location and a destination location, and a configuration that defines how data is transferred. A task always transfers data from the source to the destination. The configuration can include options such as task schedule, bandwidth limit, and so on. A task is the complete definition of a data transfer.

(13)

Agent

• Task execution – An individual run of a task, which includes information such as the start time, end time, bytes written, and status.

Agent

An agent is a VM that you own that's used to read or write data from self-managed storage systems. The agent can be deployed on VMware ESXi, KVM, Microsoft Hyper-V hypervisors, or it can be launched as an Amazon EC2 instance. You use the AWS DataSync console or the API to set up and activate your agent.

The activation process associates your agent VM with your AWS account. For information about agents, see Working with agents (p. 55).

An agent that's functioning properly has the status ONLINE. If an agent is unable to communicate with AWS, it transitions to OFFLINE status. This transition can result from issues with a network partition, firewall misconfiguration, and other events that make the agent VM unable to connect to AWS. The status of an agent that's powered off also shows as OFFLINE.

Location

A location is an endpoint of a task. Each task has two locations—a source location and a destination location. AWS DataSync supports the following location types:

• Network File System (NFS)

• Server Message Block (SMB)

• Hadoop Distributed File System (HDFS)

• On-premises (self-managed) object storage

• Amazon EFS

• Amazon FSx for Windows File Server

• Amazon FSx for Lustre

• Amazon S3

For more information, see Working with locations (p. 72).

Task

A task includes two locations (source and destination), and the configuration of how to transfer the data from one location to the other. The configuration settings can include options such as how to treat metadata, deleted files, and permissions. A task is the complete definition of a data transfer.

Task execution

A task execution is an individual run of a task, which shows information such as the start time, end time, number of transferred ﬁles, and status.

A task execution has ﬁve transition phases and two terminal statuses, as shown in the following diagram.

These phases and statuses are:

• QUEUEING – This phase consists of queuing the task executions that are running using the same agent.

• LAUNCHING – During this phase, the task execution is initialized.

• PREPARING – During this phase, DataSync computes which ﬁles need to be transferred.

• TRANSFERRING – During this phase, DataSync transfers data to AWS.

(14)

• VERIFYING – During this optional phase, DataSync performs a full data and metadata integrity veriﬁcation. This phase occurs only if the VerifyMode option is enabled during conﬁguration.

• SUCCESS or ERROR – When the task is ﬁnished, DataSync sets the task to one of these terminal statuses, depending on whether it was successful.

If the VerifyMode option isn't enabled in the task conﬁguration, the terminal status is set after the TRANSFERRING phase. Otherwise, it is set after the VERIFYING phase. The two terminal statuses are these:

• SUCCESS

• ERROR

For more information, see Task execution statuses (p. 107).

How DataSync transfers ﬁles

Topics

• How AWS DataSync veriﬁes data integrity (p. 8)

• How DataSync handles open and locked ﬁles (p. 8)

When a task starts, it goes through different phases: LAUNCHING, PREPARING, TRANSFERRING, and VERIFYING. In the LAUNCHING phase, DataSync initializes the task execution. In the PREPARING phase, DataSync examines the source and destination file systems to determine which files to sync. It does so by recursively scanning the contents and metadata of files on the source and destination file systems for differences.

The time that DataSync spends in the PREPARING phase depends on the number of files in both the source and destination file systems. It also depends on the performance of these file systems and usually takes between a few minutes to a few hours. For more information, see Starting your DataSync task (p. 105).

After the scanning is done and the diﬀerences are calculated, DataSync transitions to the

TRANSFERRING phase. At this point, DataSync starts transferring files and metadata from the source file system to the destination. DataSync copies changes to files with contents or metadata that are different between the source and the destination. You can narrow down the copied files by filtering the data or by configuring DataSync to not overwrite files that are already present in the destination.

Note

By default, any changes to metadata on the source storage result in this metadata being copied to the destination storage.

After the TRANSFERRING phase is done, DataSync veriﬁes consistency between the source and destination ﬁle systems. This is the VERIFYING phase.

(15)

How AWS DataSync veriﬁes data integrity

When DataSync transfers data, it always performs data integrity checks during the transfer. You can enable additional verification to compare the source and destination at the end of a transfer. This additional check can verify the entire dataset or only the files that were transferred as part of the task execution. For most use cases, we recommend verifying only the files transferred.

How AWS DataSync veriﬁes data integrity

AWS DataSync locally calculates the checksum of every file in the source file system and the destination and compares them. Additionally, DataSync compares the metadata of every file in the source and destination and compares them. If there are differences in either one, verification fails with an error code that specifies precisely what failed. For examples, you see error codes such as Checksum failure, Metadata failure, Files were added, Files were removed, and so on.

For more information, see DataSync task creation statuses (p. 105) and Enable veriﬁcation in the Conﬁguring task settings (p. 99) section.

How DataSync handles open and locked ﬁles

In general, DataSync can transfer open ﬁles without any limitations.

If a file is open and it's being written to during the transfer, DataSync detects data inconsistency during the VERIFYING phase. This phase is when DataSync detects whether the file on the source is different from the file on the destination.

If a ﬁle is locked and the server prevents DataSync from opening it, DataSync skips transferring it.

DataSync logs an error during the TRANSFERRING phase and sends a veriﬁcation error.

(16)

Setting up

To get started, you ﬁrst sign up for AWS. If you are a ﬁrst-time user, we recommend that you read the Regions and requirements section.

Topics

• Sign up for AWS (p. 9)

• AWS Regions and endpoints (p. 9)

• How to access AWS DataSync (p. 9)

• DataSync pricing (p. 9)

Sign up for AWS

To use AWS DataSync, you need an AWS account that gives you access to all AWS resources, forums, support, and usage reports. You aren't charged for any of the services unless you use them. If you already have an AWS account, you can skip this step.

To sign up for AWS account

1. Open https://portal.aws.amazon.com/billing/signup.

2. Follow the online instructions.

Part of the sign-up procedure involves receiving a phone call and entering a veriﬁcation code on the phone keypad.

AWS Regions and endpoints

AWS DataSync is available in the following AWS Regions.

How to access AWS DataSync

You can use the DataSync management console to perform various sync conﬁguration and management tasks.

Additionally, you can use the AWS DataSync API or the AWS CLI to programmatically conﬁgure and manage DataSync. For more information about the API, see API reference (p. 149).

You can also use the AWS SDKs to develop applications that interact with DataSync. The AWS SDKs for Java, .NET, and PHP wrap the underlying DataSync API to simplify your programming tasks. For information about downloading the SDK libraries, see Sample code libraries.

DataSync pricing

For information about AWS DataSync pricing, see AWS DataSync pricing on the DataSync pricing page.

(17)

Agent requirements

Requirements for AWS DataSync

AWS DataSync agent and network requirements vary based on where and how you plan to transfer data.

Topics

• Agent requirements (p. 10)

• Network requirements (p. 11)

Agent requirements

Your AWS DataSync agent must adhere to the requirements that apply to your scenario.

Topics

• Supported hypervisors (p. 10)

• Virtual machine requirements (p. 11)

• Amazon EC2 instance requirements (p. 11)

Supported hypervisors

DataSync supports the following hypervisor versions and hosts:

• VMware ESXi Hypervisor (version 6.5, 6.7, or 7.0): A free version of VMware is available on the VMware website. You also need a VMware vSphere client to connect to the host.

NoteWhen VMware ends general support for an ESXi hypervisor version, DataSync also ends support for that version. For information about VMware's supported hypervisor versions, see VMware lifecycle policy on the VMware website.

• Microsoft Hyper-V Hypervisor (version 2012 R2, 2016, or 2019): A free, standalone version of Hyper- V is available at the Microsoft Download Center. For this setup, you need a Microsoft Hyper-V Manager on a Microsoft Windows client computer to connect to the host.

NoteThe DataSync VM is a generation 1 virtual machine. For more information about the

diﬀerences between generation 1 and generation 2 VMs, see Should I create a generation 1 or 2 virtual machine in Hyper-V?

• Linux Kernel-based Virtual Machine (KVM): A free, open-source virtualization technology. KVM is included in Linux versions 2.6.20 and newer. AWS DataSync is tested and supported for the CentOS/

RHEL 7.8, Ubuntu 16.04 LTS, and Ubuntu 18.04 LTS distributions. Any other modern Linux distribution might work, but function or performance is not guaranteed. We recommend this option if you already have a KVM environment up and running and you're already familiar with how KVM works.

Note

Running KVM on Amazon EC2 isn't supported, and cannot be used for DataSync agents. To run the agent on Amazon EC2, deploy an agent Amazon Machine Image (AMI). For more information about deploying an agent AMI on Amazon EC2, see Deploy your agent as an Amazon EC2 instance (p. 24).

• Amazon EC2 instance: DataSync provides an Amazon Machine Image (AMI) that contains the DataSync VM image. For the recommended instance types, see Amazon EC2 instance requirements (p. 11).

(18)

Virtual machine requirements

When deploying AWS DataSync on-premises, make sure that the underlying hardware where you deploy the DataSync VM can dedicate the following minimum resources:

• Virtual processors: Four virtual processors assigned to the VM.

• Disk space: 80 GB of disk space for installation of VM image and system data.

• RAM: Depending on your conﬁguration, one of the following:

• 32 GB of RAM assigned to the VM, for tasks that transfer up to 20 million ﬁles.

• 64 GB of RAM assigned to the VM, for tasks that transfer more than 20 million ﬁles.

Amazon EC2 instance requirements

When deploying a DataSync agent with Amazon EC2, the instance size must be at least 2xlarge.

We recommend using one of the following instance sizes:

• m5.2xlarge: For tasks to transfer up to 20 million ﬁles.

• m5.4xlarge: For tasks to transfer more than 20 million ﬁles.

Note

An exception to this is if you're running DataSync on a AWS Snowcone device. Use the default instance snc1.medium, which provides 2 CPU cores and 4 GiB of memory.

To connect to an Amazon EC2 agent using SSH, you must use the following cryptographic algorithms:

• SSH cipher: aes128-ctr

• Key exchange: diﬃe-hellman-group14-sha1

Network requirements

DataSync network requirements depend on how you plan to transfer data (for example, over the public internet or using a more private connection).

Use the following tables to help you conﬁgure network access for DataSync agents that transfer data from your self-managed storage system and through virtual private cloud (VPC), public service, Federal Information Processing Standard (FIPS) endpoints.

Topics

• Network requirements to connect to your self-managed storage (p. 11)

• Network requirements when using VPC endpoints (p. 12)

• Network requirements when using public service endpoints or FIPS endpoints (p. 15)

• Required network interfaces for data transfers (p. 19)

Network requirements to connect to your self- managed storage

To minimize network latency, deploy the DataSync agent close to your self-managed storage. Doing this ensures that ﬁles travel over the network between the DataSync agent and the DataSync service using our purpose-built, accelerated protocol, which signiﬁcantly speeds up transfers.

(19)

Network requirements when using VPC endpoints

The following ports are required for communication between the DataSync agent and your Network File System (NFS) server, Hadoop Distributed File System (HDFS) cluster, Server Message Block (SMB) server, or Amazon S3 API compatible storage.

From To Protocol Port How used

Agent NFS server TCP/UDP 2049 (NFS) By the DataSync agent to

mount a source NFS ﬁle system.

Supports NFS v3.x, NFS v4.0, and NFS v4.1.

Agent SMB server TCP/UDP 139 (SMB)

or 445 (SMB)

By the DataSync agent to mount a source SMB ﬁle share.

Supports SMB 2.1 and SMB 3 versions.

Agent Self-managed object

storage TCP 443 (HTTPS)

or 80 (HTTP) By the DataSync agent to access your self-managed object storage.

Agent Hadoop cluster TCP NameNode

port (default is 8020)

By the DataSync agent to access the NameNodes in your Hadoop cluster.

Specify the port used when creating an HDFS location.

Agent Hadoop cluster TCP DataNode

port (default is 50010)

By the DataSync agent to access the DataNodes in your Hadoop cluster.

The DataSync agent automatically determines the port to use.

Agent Hadoop Key Management

Server (KMS) TCP KMS port

(default 9600)

By the DataSync agent to access the KMS for your Hadoop cluster.

Agent Kerberos Key Distribution

Center (KDC) server TCP KDC port

(default 88) By the DataSync agent when authenticating to the Kerberos realm. This port is used only with HDFS.

Network requirements when using VPC endpoints

If you use only private IP addresses, you can ensure that your VPC can't be reached over the internet, and you can prevent any packets from entering or exiting the network. By using private IP addresses, you can eliminate all internet access from your self-managed systems, and still use DataSync for data transfers to and from AWS.

DataSync requires the following ports for its operation when your agent is using private endpoints.

(20)

From To Protocol Port How used Your web

browser Your DataSync agent TCP 80 (HTTP) By your computer to obtain the agent activation key.

After successful activation, DataSync closes the agent's port 80.

The DataSync agent doesn't require port 80 to be publicly accessible. The required level of access to port 80 depends on your network conﬁguration.

NoteAlternatively, you can obtain the activation key from the agent's local console.

This method does not require connectivity between the browser and your agent. For more information about using the local console to get the activation key, see Obtaining an activation key using the local console (p. 64).

Agent Your DataSync VPC endpoint

To ﬁnd the correct IP address, open the Amazon VPC console, and choose Endpoints from the left navigation pane. Choose the DataSync endpoint, and check the Subnets list to ﬁnd the private IP address that corresponds to the subnet that you chose for your VPC endpoint setup.

For more information, see step 5 in Conﬁguring DataSync to use private IP addresses for data transfer (p. 56).

TCP 1024–1064 For control traﬃc between the DataSync agent and the AWS service.

(21)

Network requirements when using VPC endpoints

From To Protocol Port How used

Agent Your task's elastic network interfaces

To ﬁnd the related IP addresses, open the Amazon EC2 console and choose Network Interfaces from the left navigation pane. To see the four network interfaces for the task, enter your task ID in the search ﬁlter.

For more information, see step 9 in Conﬁguring DataSync to use private IP addresses for data transfer (p. 56).

TCP 443 (HTTPS) For data transfer from the DataSync VM to the AWS service.

Agent Your DataSync VPC

endpoint TCP 22 (Support

channel) To allow AWS Support to access your DataSync to help you with

troubleshooting DataSync issues.

You don't need this port open for normal operation, but it's required for troubleshooting.

Following is an illustration of the ports required by DataSync when using private endpoints.

(22)

Network requirements when using public service endpoints or FIPS endpoints

Your agent VM requires access to the following endpoints to communicate with AWS when using public service endpoints, or when using FIPS endpoints. Enabling this access is not necessary when using DataSync with VPC endpoints.

If you use a firewall or router to filter or limit network traffic, configure your firewall or router to allow these service endpoints. They're required to enable outbound communication between your network and AWS.

From To Protocol Port How used Endpoints accessed by the agent Your web

browser DataSync

agent TCP 80

(HTTP) Used by your computer to obtain the agent activation key.

After successful activation, DataSync closes the agent's port 80.

The DataSync agent doesn't require port 80 to be publicly accessible. The required level of access to port 80 depends on your network conﬁguration.

NoteAlternatively, you can obtain the activation keyfrom the agent's local console.

This method does not require connectivity between thebrowser and your agent.

For more

N/A

(23)

Network requirements when using public service endpoints or FIPS endpoints

From To Protocol Port How used Endpoints accessed by the agent information

about using the local console to get the activation key, see Obtaining an activation keyusing the local console (p. 64).

Agent AWS TCP 443

(HTTPS) Used by the DataSync agent to activate with your AWS account. You can block the public endpoints after activation.

For public endpoint activation:

activation.datasync.us- east-2.amazonaws.com For FIPS endpoint activation:

activation.datasync-fips.us- east-2.amazonaws.com

Agent AWS TCP 443

(HTTPS) For

communication between the DataSync agent and the AWS service endpoint.

For information about Regions and service endpoints, see Choose a service endpoint (p. 26).

API endpoints:

datasync.us-

east-2.amazonaws.com Data transfer endpoints:

yourTaskId.datasync-dp.us- east-2.amazonaws.com cp.datasync.us- east-2.amazonaws.com Data transfer endpoints for FIPS:

cp.datasync-fips.us- east-2.amazonaws.com

(24)

From To Protocol Port How used Endpoints accessed by the agent

Agent AWS TCP 80

(HTTP) Allows the DataSync agent to get updates from AWS.

The activation_region variable is the AWS Region you used to activate your DataSync agent.

repo.default.amazonaws.com packages.us-

west-1.amazonaws.com packages.sa-

east-1.amazonaws.com repo.

$activation_region.amazonaws.com packages.

$activation_region.amazonaws.com

*.s3.

Agent AWS TCP 443

(HTTPS) Allows the DataSync agent to get updates from AWS.

The activation_region variable is the AWS Region you used to activate your DataSync agent.

amazonlinux.default.amazonaws.com cdn.amazonlinux.com

amazonlinux-2-repos- us-east-1.s3.dualstack.

$activation_region.amazonaws.com amazonlinux-2-

repos-us-east-1.s3.

Agent Domain

NameService (DNS) server

TCP/UDP 53 (DNS) For

communication between DataSync agent and the DNS server.

N/A

Agent AWS TCP 22

(Support channel)

Allows AWS Support to access your DataSync to help you with troubleshooting DataSync issues.

You don't need this port open for normal operation, but it's required for troubleshooting.

AWS support channel:

54.201.223.107

(25)

Network requirements when using public service endpoints or FIPS endpoints

From To Protocol Port How used Endpoints accessed by the agent Agent Network

Time Protocol (NTP) server

UDP 123

(NTP) Used by local systems to synchronize the VM time to the host time.

NTP:

0.amazon.pool.ntp.org 1.amazon.pool.ntp.org 2.amazon.pool.ntp.org 3.amazon.pool.ntp.org

NoteIf you want to change the default NTP configuration of your VM agent to use a different NTP server using the local console, see Configuring a Network Time Protocol (NTP) server for VMware agents (p. 68).

The following diagram shows the ports required by DataSync when using public service endpoints or FIPS endpoints.

(26)

Required network interfaces for data transfers

For every task you run, DataSync automatically creates elastic network interfaces (ENIs) to manage data transfer traﬃc. How many ENIs DataSync creates and where they’re created depends on the following details about your task:

• Whether your task requires a DataSync agent.

• Your source and destination locations (where you’re copying data from and to).

• The type of endpoint used to activate your agent.

Each ENI uses a single IP address in your subnet (the more ENIs there are, the more IP addresses you need). Use the following tables to make sure your subnet has enough IP addresses for your task.

Transfers with agents

You need a DataSync agent when copying data between a self-managed storage system and an AWS storage service.

Location ENIs created by default Where ENIs are created when using a public or FIPS endpoint

Where ENIs are created when using a private (VPC) endpoint

Amazon S3 4 N/A (ENIs aren’t

needed since DataSync communicates directly with the S3 bucket)

The subnet you speciﬁed when activating your DataSync agent.

Amazon EFS 4 The subnet you specify when

creating the Amazon EFS location.

Amazon FSx for

Windows File Server 4 The same subnet as the preferred

ﬁle server for the ﬁle system.

Amazon FSx for Lustre 4 The same subnet as the ﬁle system.

Transfers without agents

You don’t need a DataSync agent when copying data between AWS storage services.

NoteThe total number of ENIs depends on your DataSync task locations. For example, transferring from an Amazon EFS location to FSx for Lustre requires four ENIs. Meanwhile, transferring from an FSx for Windows File Server to an Amazon S3 bucket requires two ENIs.

Location ENIs created by default Where ENIs are created

Amazon S3 N/A (ENIs aren’t needed since DataSync

communicates directly with the S3 bucket)

Amazon EFS 2 The subnet you specify when

creating the Amazon EFS location.

(27)

Required network interfaces for data transfers

Location ENIs created by default Where ENIs are created

Amazon FSx for Windows File

Server 2 The same subnet as the

preferred ﬁle server for the ﬁle system.

Amazon FSx for Lustre 2 The same subnet as the ﬁle

system.

To see the ENIs allocated for your DataSync task, use the DescribeTask operation.

(28)

Getting started with AWS DataSync

In this topic, you can ﬁnd step-by-step instructions on how to get started using AWS DataSync on the AWS Management Console.

Before you begin, we recommend reading How AWS DataSync works (p. 3) to understand the components and terms used in DataSync and how DataSync works. We also recommend reading the Using identity-based policies (IAM policies) for DataSync (p. 119) section to understand the AWS Identity and Access Management (IAM) permissions that DataSync requires.

To use AWS DataSync

1. Open the AWS DataSync console at https://console.aws.amazon.com/datasync/.

2. In the upper-right corner, choose the AWS Region where you want to run DataSync. We recommend choosing the AWS Region where you plan to locate your Amazon S3 bucket, Amazon EFS file system, Amazon FSx for Windows File Server file system, or Amazon FSx for Lustre file system.

If you haven't created DataSync resources in this AWS Region, the DataSync home page appears.

3. On the DataSync home page, select whether to create the data transfer task Between on-premises storage and AWS or Between AWS Storage services.

4. Choose Get started to begin using DataSync.

If this is your ﬁrst time using DataSync in this AWS Region, the Create agent page appears. From this page, you can download your virtual machine (VM) or create an Amazon EC2 instance.

If you have used DataSync in this AWS Region, the Agents page appears and you can see your agents listed.

Next, take the following steps.

Topics

• Create an agent (p. 21)

• Conﬁgure a source location (p. 29)

• Conﬁgure a destination location (p. 30)

• Conﬁgure task settings (p. 31)

• Review your settings and create your task (p. 34)

• Start your task (p. 34)

• Clean up resources (p. 35)

Create an agent

For AWS DataSync to access your self-managed storage (whether on-premises or in the cloud), you need a DataSync agent associated with your AWS account.

Tip

An agent isn't required when transferring between AWS storage services in the same AWS account. To set up a data transfer between two AWS services, see Conﬁgure a source location (p. 29).

Topics

• Deploy your DataSync agent (p. 22)

(29)

Deploy your agent

• Choose a service endpoint (p. 26)

• Activate your agent (p. 28)

Deploy your DataSync agent

Where you deploy your AWS DataSync agent depends on where you're copying data to and from and whether you're working with on-premises or in-cloud storage systems.

Topics

• Deploy your agent on VMware (p. 22)

• Deploy your agent on KVM (p. 22)

• Deploy your agent on Hyper-V (p. 23)

• Deploy your agent as an Amazon EC2 instance (p. 24)

• Deploy your agent on Snow Family devices (p. 26)

• Deploy your agent on AWS Outposts (p. 26)

Deploy your agent on VMware

You can download and deploy an AWS DataSync agent in your VMware environment and then activate it. You can also use an existing agent instead of deploying a new one. You can use a previously created agent if it can access your self-managed storage and if it's activated in the same AWS Region.

To deploy an agent on VMware

2. If you don't have an agent, on the Create agent page in the console, choose Download image in the Deploy agent section. Doing this downloads the agent and deploys it in your VMware ESXi hypervisor. The agent is available as a VM. If you want to deploy the agent as an Amazon EC2 instance, see Deploy your agent as an Amazon EC2 instance (p. 24).

AWS DataSync currently supports the VMware ESXi hypervisor. For information about hardware requirements for the VM, see Virtual machine requirements (p. 11). For information about how to deploy an .ova ﬁle in a VMware host, see the documentation for your hypervisor.

If you have previously activated an agent in this AWS Region and want to use that agent, choose that agent and choose Create agent. The Conﬁgure a source location (p. 29) page appears.

3. Power on your hypervisor, log in to your VM, and get the IP address of the agent. You need this IP address to activate the agent.

Note

The VM's default credentials are the login admin and the password password.

You can change the password on the local console. You don't need to log in to the VM for DataSync functionality. Login is mainly required for troubleshooting, such as running a connectivity test or opening a support channel with AWS. It's also required for network- speciﬁc settings, such as setting up a static IP address.

After you have deployed an agent, you choose a service endpoint (p. 26).

Deploy your agent on KVM

You can download and deploy an AWS DataSync agent in your KVM environment and then activate it.

You can also use an existing agent instead of deploying a new one. You can use a previously created agent if it can access your self-managed storage and if it's activated in the same AWS Region.

(30)

To deploy an agent on KVM

2. If you don't have an agent, on the Create agent page in the console, choose Download image in the Deploy agent section. Doing this downloads the agent in a .zip ﬁle that contains a .qcow2 image ﬁle that can you can deploy in your KVM hypervisor.

The agent is available as a VM. If you want to deploy the agent as an Amazon EC2 instance, see Deploy your agent as an Amazon EC2 instance (p. 24).

AWS DataSync currently supports the KVM hypervisor. For information about hardware requirements for the VM, see Virtual machine requirements (p. 11).

To get started installing your .qcow2 image for use in KVM, use the following command.

virt-install \ --name "datasync" \

--description "AWS DataSync agent" \ --os-type=generic \

--ram=32768 \ --vcpus=4 \

--disk path=datasync-yyyymmdd-x86_64.qcow2,bus=virtio,size=80 \ --network default,model=virtio \

--graphics none \ --import

For information about how to manage this VM, and your KVM host, see the documentation for your hypervisor.

If you previously activated an agent in this AWS Region and want to use that agent, choose that agent, and then choose Create agent. The Conﬁgure a source location (p. 29) page appears.

NoteThe VM's default credentials are the login admin and the password password.

After you deploy an agent, you choose a service endpoint (p. 26).

Deploy your agent on Hyper-V

You can download and deploy an AWS DataSync agent in your Hyper-V environment and then activate it. You can also use an existing agent instead of deploying a new one. You can use a previously created agent if it can access your self-managed storage and if it's activated in the same AWS Region.

To deploy an agent on Hyper-V

2. If you don't have an agent, on the Create agent page in the console, choose Download image in the Deploy agent section. Doing this downloads the agent in a .zip ﬁle that contains a .vhdx image ﬁle that can you can deploy in your Hyper-V hypervisor.

(31)

Deploy your agent

The agent is available as a VM. If you want to deploy the agent as an Amazon EC2 instance, see Deploy your agent as an Amazon EC2 instance (p. 24).

AWS DataSync currently supports the Hyper-V hypervisor. For information about hardware requirements for the VM, see Virtual machine requirements (p. 11). For information about how to deploy a .vhdx ﬁle in a Hyper-V host, see the documentation for your hypervisor.

If you previously activated an agent in this AWS Region and want to use that agent, choose that agent, and then choose Create agent. The Conﬁgure a source location (p. 29) page appears.

Note

The VM's default credentials are the login admin and the password password.

After you deploy an agent, you choose a service endpoint (p. 26).

Deploy your agent as an Amazon EC2 instance

You deploy a DataSync agent as an Amazon EC2 instance when copying data between:

• A self-managed, in-cloud storage system and an AWS storage service.

For more information about these use cases, including high-level architecture diagrams, see Deploying your DataSync agent in AWS Regions (p. 59).

• Amazon S3 on AWS Outposts (p. 26) and an AWS storage service.

Warning

We don't recommend using an Amazon EC2 agent to access your on-premises storage because of increased network latency. Instead, deploy the agent as a VMware, KVM, or Hyper-V virtual machine in your data center as close to your on-premises storage as possible.

To choose the agent AMI for your AWS Region

• Use the following CLI command to get the latest DataSync Amazon Machine Image (AMI) ID for the speciﬁed AWS Region.

aws ssm get-parameter --name /aws/service/datasync/ami --region $region

Example Example command and output

aws ssm get-parameter --name /aws/service/datasync/ami --region us-east-1

{ "Parameter": {

"Name": "/aws/service/datasync/ami", "Type": "String",

"Value": "ami-id", "Version": 6,

"LastModifiedDate": 1569946277.996,

(32)

"ARN": "arn:aws:ssm:us-east-1::parameter/aws/service/datasync/ami"

} }

For the recommended instance types, see Amazon EC2 instance requirements (p. 11).

If you activate an agent in the Region that has access to your ﬁle system using a mount target in the same Availability Zone and you want to use that agent, choose the agent and select choose Create agent. The Conﬁgure a source location (p. 29) page appears.

To deploy your DataSync agent as an Amazon EC2 instance Important

To avoid charges, deploy your agent in a way that it doesn't require network traﬃc between Availability Zones. For example, deploy your agent in the Availability Zone where your self- managed ﬁle system resides.

To learn more about data transfer prices for all AWS Regions, see Amazon EC2 On-Demand pricing.

1. From the AWS account where the source ﬁle system resides, launch the agent using your AMI from the Amazon EC2 launch wizard. Use the following URL to launch the AMI.

https://console.aws.amazon.com/ec2/v2/home?region=source-file-system- region#LaunchInstanceWizard:ami=ami-id

In the URL, replace the source-file-system-region and ami-id with your own source AWS Region and AMI ID. The Choose an Instance Type page appears on the Amazon EC2 console. To ﬁnd the DataSync AMI ID for a speciﬁed AWS Region, use the .AMI-command CLI command described in the preceding section.

2. Choose one of the recommended instance types for your use case, and choose Next:

Conﬁgure Instance Details. For the recommended instance types, see Amazon EC2 instance requirements (p. 11).

3. On the Conﬁgure Instance Details page, do the following:

a. For Network, choose the virtual private cloud (VPC) where your source Amazon EFS or NFS ﬁle system is located.

b. For Auto-assign Public IP, choose a value. For your instance to be accessible from the public internet, set Auto-assign Public IP to Enable. Otherwise, set Auto-assign Public IP to Disable.

If a public IP address isn't assigned, activate the agent in your VPC using its private IP address.

When you transfer ﬁles from an in-cloud ﬁle system, to increase performance we recommend that you choose a Placement Group value where your NFS server resides.

4. Choose Next: Add Storage. The agent doesn't require additional storage, so you can skip this step and choose Next: Add tags.

5. (Optional) On the Add Tags page, you can add tags to your Amazon EC2 instance. When you're ﬁnished on the page, choose Next: Conﬁgure Security Group.

6. On the Conﬁgure Security Group page, do the following:

a. Make sure that the selected security group allows inbound access to HTTP port 80 from the web browser that you plan to use to activate the agent.

b. Make sure that the security group of the source file system allows inbound traffic from the agent. In addition, make sure that the agent allows outbound traffic to the source file system.

If you deploy your agent using a VPC endpoint, you need to allow additional ports. For more information, see How DataSync works with VPC endpoints (p. 56).

(33)

Choose a service endpoint

For the complete set of network requirements for DataSync, see Network requirements (p. 11).

7. Choose Review and Launch to review your conﬁguration, then choose Launch to launch your instance. Remember to use a key pair that's accessible to you. A conﬁrmation page appears and indicates that your instance is launching.

8. Choose View Instances to close the conﬁrmation page and return to the Amazon EC2 instances screen. When you launch an instance, its initial state is pending. After the instance starts, its state changes to running. At this point, it's assigned a public Domain Name System (DNS) name and IP address, you can ﬁnd these in the Descriptions tab.

9. If you set Auto-assign Public IP to Enable, choose your instance and note the public IP address in the Description tab. You use this IP address later to connect to your sync agent.

If you set Auto-assign Public IP to Disable, launch or use an existing instance in your VPC to

activate the agent. In this case, you use the private IP address of the sync agent to activate the agent from this instance in the VPC.

Deploy your agent on Snow Family devices

The DataSync agent AMI is pre-installed on your Snow Family Device. You can use AWS OpsHub for Snow Family or the AWS Snowball Edge CLI command line tool to launch the agent and attach a virtual interface to the agent. Then, use the virtual interface's IP address to activate the agent.

For instructions on launching the agent using AWS OpsHub, see Using DataSync to transfer ﬁles to AWS.

For instructions on launching the agent using the Snowball CLI, see Launching AWS DataSync AMI.

For information about using the AWS Snowcone client, see Using the Snowcone client.

Deploy your agent on AWS Outposts

You can launch a DataSync Amazon EC2 instance on your AWS Outpost. To learn more about launching an AMI on AWS Outposts, see Launch an instance on your Outpost in the AWS Outposts User Guide.

When using DataSync to access Amazon S3 on Outposts, you must launch the agent in a VPC that's allowed to access your Amazon S3 access point, and activate the agent in the Outpost's parent Region.

The agent must also be able to route to the Amazon S3 on Outposts endpoint for the bucket. To learn more about working with Amazon S3 on Outposts endpoints, see Working with Amazon S3 on Outposts in the Amazon S3 User Guide.

Choose a service endpoint

You must specify an endpoint that your AWS DataSync agent uses to communicate with AWS. The agent can connect to the following types of endpoints:

• Public endpoints: If you use public endpoints, all communication from your DataSync agent to AWS occurs over the public internet. For instructions, see Choose a public service endpoint (p. 27).

• Federal Information Processing Standard (FIPS) endpoints: If you need FIPS 140-2 validated cryptographic modules when accessing the AWS GovCloud (US-East) or AWS GovCloud (US-West) Region, use this endpoint to activate your agent. You use the AWS CLI or API to access this endpoint.

For more information, see Federal Information Processing Standard (FIPS) 140-2.

• Virtual private cloud (VPC) endpoints: If you use a VPC endpoint, all communication from DataSync to AWS occurs through the endpoint in your AWS VPC. This establishes a private connection between your self-managed storage system, your VPC, and AWS services, providing extra security as your data is copied over the network. For instructions, see Using AWS DataSync in a virtual private cloud (p. 56).

(34)

NoteAfter you choose a service endpoint type and activate your agent, you can't change it to use a diﬀerent service endpoint type later. If you need to transfer data to multiple endpoint types, create a DataSync agent for each endpoint type that you use.

For more information about service endpoints, see AWS DataSync in the AWS General Reference.

Topics

• Choose a public service endpoint (p. 27)

• Choose a FIPS service endpoint (p. 27)

• Choose a VPC endpoint (p. 27)

Choose a public service endpoint

If you use a public endpoint, all communication from your DataSync agent to AWS occurs over the public internet.

To choose a public service endpoint

2. Go to the Agents page and choose Create agent.

3. In the Service endpoint section, choose Public service endpoints in AWS Region name. For a list of supported AWS Regions, see AWS DataSync in the AWS General Reference.

Next Step: the section called “Activate your agent” (p. 28)

Choose a FIPS service endpoint

If you use a FIPS service endpoint, DataSync communicates with the AWS GovCloud (US) or Canada (Central) Region.

To choose a FIPS service endpoint

3. In the Service endpoint section, choose the FIPS endpoint that you want. For information about supported FIPS endpoint, see AWS DataSync in the AWS General Reference.

Next step: the section called “Activate your agent” (p. 28)

Choose a VPC endpoint

If you use a VPC endpoint, all communication from DataSync to AWS services occurs through the VPC endpoint in your VPC in AWS. This approach provides a private connection between your self-managed data center, your VPC, and AWS services.

You can also use a VPC endpoint outside your VPC to connect your data center directly to AWS resources.

In this case, you use a virtual private network (VPN) or AWS Direct Connect. You set up a VPC route table to use the endpoint to access the service. For detailed information, see Routing for gateway endpoints.

To choose a VPC endpoint

1. Create a VPC endpoint. For instructions, see Creating an interface endpoint. If you already have a VPC endpoint in the AWS Region, you can use it.

(35)

Activate your agent

Important

In step 4 of the instructions mentioned preceding, choose

com.amazonaws.region.datasync for Service Name in the table of endpoints.

For information about supported AWS Regions, see AWS DataSync in the AWS General Reference.

4. In the Service endpoint section, choose VPC endpoints using AWS PrivateLink. This is the VPC endpoint that the agent has access to.

5. For VPC Endpoint, choose the private VPC endpoint that you want your agent to connect to. You noted the endpoint ID when you created the VPC endpoint.

6. For Subnet, choose the subnet in which you want to run your task. This is the subnet where the elastic network interface is created.

7. For Security Group, choose a security group for your task. This is the security group that protects your network interface for tasks that run on your agent.

For additional information about using DataSync in a VPC, see Using AWS DataSync in a virtual private cloud (p. 56).

Next step: the section called “Activate your agent” (p. 28)

Activate your agent

After you deploy your AWS DataSync agent and specify a service endpoint, you must activate the agent to associate it with your AWS account.

NoteAn agent can be associated with only one AWS account at a time.

To activate your agent

3. In the Activation key section, select Automatically get the activation key from your agent.

This option requires that your browser access the agent using port 80. Once activated, the agent closes the port. For more information, see Network requirements (p. 11).

Alternatively, select Manually enter your agent's activation key if you don't want a connection between your browser and agent. For more information, see Obtaining an activation key using the local console (p. 64).

(36)

4. For the Agent address, enter the agent's IP address or domain name and select Get key. Your browser connects to the IP address and gets a unique activation key from your agent.

If activation succeeds, the activation key is displayed. If the activation fails, make sure that your security group is conﬁgured properly and verify that your ﬁrewall allows the required ports.

5. (Optional) For Agent name, enter a name for your agent.

6. (Optional) For Tags, enter a key and value to add a tag to your agent. A tag is a key-value pair that helps you manage, ﬁlter, and search for your agents.

7. Choose Create agent. Your agent is listed on the Agents page. In the Service endpoint column, verify that your service endpoint is correct.

8. In the Tasks section of the page, choose Create task. The Conﬁgure source location page appears.

Conﬁgure a source location

A task consists of a pair of locations that data will be transferred between. The source location deﬁnes the storage system or service that you want to read data from. The destination location deﬁnes the storage system or service that you want to write data to.

In the following walkthrough, we give an example of conﬁguring a Network File System (NFS) ﬁle system as the source location.

To conﬁgure a diﬀerent location type as your source location, see the following topics:

• Creating a location for NFS (p. 73)

• Creating a location for SMB (p. 75)

• Creating a location for HDFS (p. 77)

• Creating a location for object storage (p. 78)

• Creating a location for Amazon EFS (p. 79)

• Creating a location for FSx for Windows File Server (p. 81)

• Creating a location for FSx for Lustre (p. 83)

• Creating a location for Amazon S3 (p. 83)

To create an NFS location

1. On the Conﬁgure source location page, choose Create a new location or Choose existing location.

Create a new location enables you to deﬁne a new location and Choose existing location enables you to choose from locations that you have previously created in this AWS Region.

(37)

Conﬁgure a destination location

2. For Location type in the Conﬁguration section, choose your NFS server from the list.

3. For Agents, choose your agent from the list. You can add more than one agent. For this walkthrough, we add only one agent.

NoteIn many cases, you might be transferring from an in-cloud NFS file system or an Amazon EFS file system. In such cases, make sure that you choose an agent that you created in an Amazon EC2 instance that can access this file system.

You can't use agents that are created with diﬀerent endpoint types for the same task.

4. For NFS server, enter the IP address or domain name of your NFS server. An agent that's installed on-premises uses this host name to mount the NFS server in a network. The NFS server should allow full access to all ﬁles.

5. For Mount path, enter a path that's exported by the NFS server, or a subdirectory that can be mounted by other NFS clients in your network. The path is used to read data from or write data to your NFS server.

6. Choose Next to open the Conﬁgure destination location page.

Conﬁgure a destination location

A task consists of a pair of locations that data will be transferred between. The source location deﬁnes the storage system or service that you want to read data from. The destination location deﬁnes the storage system or service that you want to write data to.

To conﬁgure a diﬀerent location type, see the following topics:

• Creating a location for NFS (p. 73)

• Creating a location for SMB (p. 75)

(38)

• Creating a location for HDFS (p. 77)

• Creating a location for object storage (p. 78)

• Creating a location for Amazon EFS (p. 79)

• Creating a location for FSx for Windows File Server (p. 81)

• Creating a location for FSx for Lustre (p. 83)

• Creating a location for Amazon S3 (p. 83)

Conﬁgure task settings

After you have created an AWS DataSync agent and conﬁgured the source and destination locations, you can conﬁgure the settings for a new task. A task is a set of two locations (source and destination) and a set of options that you use to control the behavior of the task.

You conﬁgure task settings when creating a new task in the AWS DataSync console. You can also edit task settings by opening the AWS DataSync console at https://console.aws.amazon.com/datasync/, selecting the task you want to edit, and choosing Edit.

On the Conﬁgure settings page, for Task name - optional, enter a name for your task. Task name is an optional setting.

The Options section contains conﬁguration options for running your task. The following sections provide more details about these options.

Topics

• Data veriﬁcation options (p. 31)

• Ownership and permissions-related options (p. 32)

• File metadata options and ﬁle management (p. 32)

• Bandwidth options (p. 33)

• Filtering options (p. 33)

• Scheduling and queueing options (p. 33)

• Tags and logging options (p. 34)

Data veriﬁcation options

As DataSync transfers data, it always performs data integrity checks during the transfer. You can enable additional verification to compare source and destination at the end of a transfer. This additional check can verify the entire dataset or only the files that were transferred as part of the task execution. For most use cases, we recommend verifying only the files transferred.

Task data veriﬁcation options specify how to verify data that's transferred by the task.

Data veriﬁcation options are as follows:

• Verify only the data transferred (recommended) – This option calculates the checksum of transferred ﬁles and metadata on the source. It then compares this checksum to the checksum calculated on those ﬁles at the destination at the end of the transfer. We recommend this option when transferring to S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive storage classes. For more information, see Considerations when working with Amazon S3 storage classes in DataSync (p. 86).

• Verify all data in the destination – This option performs a scan at the end of the transfer of the entire source and entire destination to verify that source and destination are fully synchronized. You can't use this option when transferring to S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive storage

AWS DataSync

AWS DataSync

User Guide

AWS DataSync: User Guide

Table of Contents

What is AWS DataSync?

Use cases

Beneﬁts

Additional AWS DataSync resources

How AWS DataSync works

AWS DataSync architecture

Data transfer between self-managed storage and AWS

Data transfer between AWS storage services

Data transfer using a DataSync EC2 agent deployed in a Region

Components and terminology

Agent

Location

Task

Task execution

How DataSync transfers ﬁles

How AWS DataSync veriﬁes data integrity

How DataSync handles open and locked ﬁles

Setting up

Sign up for AWS

AWS Regions and endpoints

How to access AWS DataSync

DataSync pricing

Requirements for AWS DataSync

Agent requirements

Supported hypervisors

Virtual machine requirements

Amazon EC2 instance requirements

Network requirements

Network requirements to connect to your self- managed storage

Network requirements when using VPC endpoints

Network requirements when using public service endpoints or FIPS endpoints

Required network interfaces for data transfers

Transfers with agents

Transfers without agents

Getting started with AWS DataSync

Create an agent

Deploy your DataSync agent

Deploy your agent on VMware

Deploy your agent on KVM

Deploy your agent on Hyper-V

Deploy your agent as an Amazon EC2 instance

Deploy your agent on Snow Family devices

Deploy your agent on AWS Outposts

Choose a service endpoint

Choose a public service endpoint

Choose a FIPS service endpoint

Choose a VPC endpoint

Activate your agent

Conﬁgure a source location

Conﬁgure a destination location

Conﬁgure task settings

Data veriﬁcation options