The following diagram displays a high-level architecture of the solution, where we use AWS DMS to move data from Microsoft SQL Server databases hosted on Amazon Relational Database Service (Amazon RDS) to Amazon Simple Storage Service (Amazon S3).
The following diagram shows the structure of the Amazon S3 bucket from the preceding diagram.
To replicate data, you need to create and configure the following artifacts in AWS DMS:
• Replication Instance — An AWS managed instance that hosts the AWS DMS engine. You control the type or size of the instance based on the workload you plan to migrate.
• Source Endpoint — An endpoint that provides connection details, data store type, and credentials to connect to a source database. For this use case, we will configure the source endpoint to point to the Amazon RDS for SQL Server database.
• Target Endpoint — AWS DMS supports several target systems including Amazon RDS, Amazon Aurora, Amazon Redshift, Amazon Kinesis Data Streams, Amazon S3, and more. For the use case, we will configure Amazon S3 as the target endpoint.
• Replication Task — A task that runs on the replication instance and connects to endpoints to replicate data from the source database to the target database
For this walkthrough, we will use the AdventureWorks sample database on an Amazon RDS for SQL Server instance as the base data for the walkthrough. The AdventureWorks database holds sales, marketing, and order data. We will use AWS DMS to move sales data from the source database to Amazon S3 object store, which can be used as a data lake for downstream analytics needs.
NoteYou can refer to the section called “Migrating a SQL Server Always On Database to AWS” (p. 8) for details on migrating from a Microsoft SQL Server Always On database instance.
We will create an AWS DMS task, which will perform a one-time full load to migrate a point in time snapshot and will then stream incremental data to the target Amazon S3 bucket. This way, sales data in the S3 bucket will be kept in sync with the source database.
Prerequisites
The following prerequisites are required to complete this walkthrough:
• Understand Amazon Relational Database Service (Amazon RDS), the applicable database technologies, and SQL.
• Create an AWS account with AWS Identity and Access Management (IAM) credentials that allows you to launch Amazon RDS and AWS Database Migration Service (AWS DMS) instances in your AWS Region.
For information about IAM credentials, see Create an IAM user.
• Understand the Amazon Virtual Private Cloud (Amazon VPC) service and security groups. For
information about using Amazon VPC with Amazon RDS, see Amazon Virtual Private Cloud (VPCs) and Amazon RDS. For information about Amazon RDS security groups, see Controlling access with security groups.
• Understand the supported features and limitations of AWS DMS. For information about AWS DMS, see What is AWS Database Migration Service?.
• Understand how to work with Microsoft SQL Server as a source and Amazon S3 data lake as a target.
For information about working with SQL Server as a source, see Using a Microsoft SQL Server database as a source for AWS DMS. For information about working with Amazon S3 as a target, see Using Amazon S3 as a target for AWS Database Migration Service.
• Understand the supported data type conversion options for SQL Server and Amazon S3. For information about data types for SQL Server as a source, see Source data types for SQL Server. For information about data types for Amazon S3 as a target (Parquet only), see Target data types for S3 Parquet.
• Audit your source SQL Server database. For each schema and all the objects under each schema, determine whether any of the objects are no longer being used. Deprecate these objects on the source SQL Server database, because there’s no need to migrate them if they aren’t being used.
For more information about AWS DMS, see Getting started with AWS Database Migration Service.
Step-by-Step Migration
The following steps provide instructions for migrating an Amazon RDS for SQL Server database to an Amazon S3 data lake. These steps assume that you have already prepared your source database as described in the section called “Prerequisites” (p. 112).
Topics
• Step 1: Create an AWS DMS Replication Instance (p. 113)
• Step 2: Configure a Source Amazon RDS for SQL Server Database (p. 114)
• Step 3: Create an AWS DMS Source Endpoint (p. 116)
• Step 4: Configure a Target Amazon S3 Bucket (p. 117)
• Step 5: Configure an AWS DMS Target Endpoint (p. 118)
• Step 6: Create an AWS DMS Task (p. 123)
• Step 7: Run the AWS DMS Task (p. 127)
Step 1: Create an AWS DMS Replication Instance
To create an AWS Database Migration Service (AWS DMS) replication instance, see Creating a replication instance. Usually, the full load phase is multi-threaded (depending on task configurations) and has a greater resource footprint than ongoing replication. Consequently, it’s advisable to start with a larger instance class and then scale down once the task is in the ongoing replication phase. Moreover, if you intend to migrate your workload using multiple tasks, monitor your replication instance metrics and re-size your instance accordingly.
For this use case, we will migrate a subset (the Sales schema) of the AdventureWorks database, which is over 3 GB in size. Because we perform a heterogenous migration without many LOB columns, we can start with a compute optimized instance like c5.xlarge running the latest AWS DMS engine version. We can later scale up or down based on resource utilization during task execution.
NoteScaling replication instance during full load and ongoing replication phases is usually based on CloudWatch metrics such as CPU, memory, I/O, and so on. Choosing the appropriate replication instance class and size depends on several factors such as number of tasks, table size, DML activity, size of transactions, Large Objects (LOB), and so on. This is out of scope for this walkthrough. To learn more about these topics, see Choosing replication instance types and Sizing a replication instance.
To create an AWS DMS replication instance, do the following:
1. Sign in to the AWS Management Console, and open the AWS DMS console at https://
console.aws.amazon.com/dms/v2/.
2. If you are signed in as an AWS Identity and Access Management (IAM) user, you must have the appropriate permissions to access AWS DMS. For more information about the permissions required, see IAM permissions needed to use AWS DMS.
3. On the Welcome page, choose Create replication instance to start a database migration.
4. On the Create replication instance page, specify your replication instance information.
For This Parameter Do This
Name Enter datalake-migration-ri. If you are
using multiple replication servers or sharing an account, choose a name that helps you quickly differentiate between the different servers.
Description Enter Migrate SQL Server to Amazon S3
data lake.
Instance class Choose dms.c5.xlarge. Each size and type of
instance class has increasing CPU, memory, and I/O capacity.
Engine version Leave the default value, which is the latest stable version of the AWS DMS replication engine.
For This Parameter Do This
Allocated storage (GiB) Choose 50.
VPC Choose the virtual private cloud (VPC) in which
your replication instance will launch. If possible, select the same VPC in which either your source or target database resides (or both).
Multi AZ If you choose Yes, AWS DMS creates a second
replication server in a different Availability Zone for failover if there is a problem with the primary replication server.
Publicly accessible If either your source or target database resides outside of the VPC in which your replication server resides, you must make your replication server policy publicly accessible.
5. Choose Create.
Step 2: Configure a Source Amazon RDS for SQL Server Database
One of the primary considerations when setting up AWS DMS replication is the load that it induces on the source database. During full load, AWS DMS tasks initiate two or three connections for each table that is configured for parallel load. Because AWS DMS settings and data volumes vary across tasks, workloads, and even across different runs of the same task, providing an estimate of resource utilization that applies for all use cases is difficult.
Ongoing replication is single-threaded and it usually consumes less resources than full load. Providing estimates for change data capture (CDC) resource utilization has the same challenges described above.
That said, you can estimate the expected increase in load on your source Amazon RDS instance, by running test AWS DMS tasks on replicas of your source Amazon RDS for SQL Server instance and monitoring the CPU, memory, IO and throughput metrics.
For our source database, we use an m5.xlarge Amazon RDS instance running Microsoft SQL Server 2019. While the steps for Amazon RDS for SQL Server creation are out of scope for this walkthrough (for more information, see the section called “Prerequisites” (p. 112)), make sure that your Amazon RDS instance has Automatic Backups turned on so that the recovery model for the database is set to FULL.
This is a pre-requisite for ongoing replication with AWS DMS. You can turn on these settings when you create or modify an existing Amazon RDS instance.
The following image displays the database settings required for ongoing replication with AWS DMS.
To perform the full load phase, AWS DMS requires read privileges to the tables in scope for migration.
For more information on required permissions, see Permissions for full load only tasks.
Connect to the Amazon RDS for SQL Server instance and run the following queries. Use a login with master user privileges for both full load and CDC.
USE AdventureWorks;
CREATE LOGIN dms_user WITH PASSWORD = 'password' CREATE USER dms_user FOR LOGIN dms_user
ALTER ROLE [db_datareader] ADD MEMBER dms_user ALTER ROLE [db_owner] ADD MEMBER dms_user GRANT VIEW DATABASE STATE to dms_user USE master;
GRANT VIEW SERVER STATE TO dms_user
NoteHere, we create a new user to perform the migration. You can skip this step if you plan to use existing logins and users that have the required privileges.
Turn on MS-CDC for your Amazon RDS for SQL Server database instance at the database level.
exec msdb.dbo.rds_cdc_enable_db 'AdventureWorks'
Because we migrate all tables in the Sales schema of the AdventureWorks database, we need to identify the total number of tables.
SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE FROM information_schema.tables
WHERE TABLE_SCHEMA = 'Sales' ORDER BY TABLE_NAME
Then we need to divide tables in the following groups:
• Tables with a primary key.
• Tables with a unique index without primary key.
• Tables without a primary key and unique index.
We use the information_schema to identify tables that have a primary key or a unique index without a primary key.
SELECT a.TABLE_SCHEMA, a.TABLE_NAME, a.CONSTRAINT_TYPE, CONSTRAINT_NAME FROM information_schema.table_constraints a
JOIN information_schema.tables b ON aTABLE_SCHEMA = b.TABLE_SCHEMA AND a.TABLE_NAME = b.TABLE_NAME
WHERE b.TABLE_TYPE = 'BASE TABLE' AND a.TABLE_SCHEMA = 'Sales'
AND a.CONSTRAINT_TYPE in ('UNIQUE','PRIMARY KEY') ORDER BY a.TABLE_SCHEMA, a.TABLE_NAME
The query results show that the task has 19 tables and all of them have primary keys. For all these tables, run the following query to turn on MS-CDC at the table level.
exec sys.sp_cdc_enable_table
@source_schema = N'Sales',
@source_name = N'table_name',
@role_name = NULL,
@supports_net_changes = 1
Now, set the retention period for changes to be available on the source using the following commands.
Set the pollinginterval value to 86399 seconds to increase the retention of changes on the Amazon RDS for SQL Server instance.
EXEC sys.sp_cdc_change_job @job_type = 'capture' ,@pollinginterval = 86399 exec sp_cdc_stop_job 'capture'
exec sp_cdc_start_job 'capture' exec sys.sp_cdc_help_jobs
Set the polling interval on your secondary database to 86399 seconds too. For most use cases these settings should be enough. For databases that have a large number of transactions, you need to make additional configuration changes to make sure that the transaction log has optimal retention. For more information, see Optional settings when using Amazon RDS for SQL Server as a source for AWS DMS in the AWS DMS user guide.
For more information about ongoing replication, see Setting up ongoing replication on a Cloud SQL Server DB instance.
NoteAWS DMS does not support replicating ongoing changes from views. For more information, see Selection rules and actions.
In this walkthrough, we focus on migrating the tables and do not include views in the migration scope.
You should also look at estimating the number of records in the tables you are going to migrate as this is a useful consideration while configuring AWS DMS tasks.
Step 3: Create an AWS DMS Source Endpoint
After you configured the AWS Database Migration Service (AWS DMS) replication instance and the source Amazon RDS for SQL Server instance, ensure connectivity between these two instances. To ensure
that the replication instance can access the server and the port for the database, make changes to the relevant security groups and network access control lists. For more information about your network configuration, see Setting up a network for a replication instance.
After you completed the network configurations, you can create a source endpoint.
To create a source endpoint, do the following:
1. Open the AWS DMS console at https://console.aws.amazon.com/dms/v2/.
2. Choose Endpoints.
3. Choose Create endpoint.
4. On the Create endpoint page, enter the following information.
For This Parameter Do This
Endpoint type Choose Source endpoint, turn on Select RDS DB
instance, and choose datalake-source-db RDS instance.
Endpoint identifier Enter datalake-source-db.
Source engine Choose Microsoft SQL Server.
Access to endpoint database Choose Provide access information manually.
Server name Enter the database server name on Amazon RDS.
Port Enter 1433.
Secure Socket Layer (SSL) mode Choose none.
User name Enter dms_user.
Password Enter the password that you created for the
dms_user user.
Database name Enter AdventureWorks.
5. Choose Create endpoint.
NoteTo migrate a Microsoft SQL Server Always On database, you need to use different
configurations. For more information, see the section called “Migrating a SQL Server Always On Database to AWS” (p. 8).
Step 4: Configure a Target Amazon S3 Bucket
You can integrate Amazon S3 with other AWS and third-party services to take advantage of the following:
• Data analysis using Amazon Athena query engine. This service helps reduce cost as you do not pay for dedicated resources and instead pay based on the amount data being scanned.
• Perform extract, transform, and load (ETL) operations using distributed processing frameworks such as Spark with Amazon EMR or AWS Glue.
• Implement machine learning use cases, because Amazon S3 can store granular time series data spanning years in raw form, in conjunction with Amazon SageMaker.
Because in this use case we migrate the Sales schema to Amazon S3, we need to account for future use cases of the migrated data before we set up Amazon S3 bucket and AWS DMS endpoints.
To create the Amazon S3 bucket, do the following:
1. Open the Amazon S3 console at https://s3.console.aws.amazon.com/s3/home.
2. Choose Create bucket.
3. For Bucket name, enter adventure-works-datalake.
4. For AWS Region, choose the region that hosts your AWS DMS replication instance.
5. Leave the default values in the other fields and choose Create bucket.
Step 5: Configure an AWS DMS Target Endpoint
To use Amazon S3 as an AWS Database Migration Service (AWS DMS) target endpoint, create an IAM role with write and delete access to the S3 bucket. Then add DMS (dms.amazonaws.com) as trusted entity in this IAM role. For more information, see Prerequisites for using Amazon S3 as a target.
When using AWS DMS to migrate data to an Amazon Simple Storage Service (Amazon S3) data lake, you can change the default task behavior, such as file formats, partitioning, file sizing, and so on. This leads to minimizing post-migration processing and helps downstream applications consume data efficiently.
You can customize task behavior using endpoint settings and extra connection attributes (ECA). Most of the Amazon S3 endpoint settings and ECA settings overlap, except for a few parameters. In this walkthrough, we will configure Amazon S3 endpoint settings.