Prerequisites
You must complete the following tasks before you can start this tutorial.
1. If you have not already done so, sign up for an AWS account by following the steps at the section called “Signing up for AWS” (p. 14).
2. Create credentials by following the steps at the section called “Creating credentials” (p. 16).
NoteDSBulk currently only supports service-specific credentials. If you prefer to manage access to Amazon Keyspaces by using AWS Identity and Access Management users and roles, you can use the SigV4 authentication plugin and disable the service-specific credentials after you complete data upload with DSBulk.
3. Create a JKS trust store file.
a. Download the Starfield digital certificate using the following command and save sf-class2-root.crt locally or in your home directory.
curl https://certs.secureserver.net/repository/sf-class2-root.crt -O
NoteYou can also use the Amazon digital certificate to connect to Amazon Keyspaces and can continue to do so if your client is connecting to Amazon Keyspaces successfully. The Starfield certificate provides additional backwards compatibility for clients using older certificate authorities.
b. Convert the Starfield digital certificate into a trustStore file.
openssl x509 -outform der -in sf-class2-root.crt -out temp_file.der keytool -import -alias cassandra -keystore cassandra_truststore.jks -file temp_file.der
In this step, you need to create a password for the keystore and trust this certificate. The interactive command looks like this.
Enter keystore password:
Re-enter new password:
Owner: OU=Starfield Class 2 Certification Authority, O="Starfield Technologies, Inc.", C=US
Issuer: OU=Starfield Class 2 Certification Authority, O="Starfield Technologies, Inc.", C=US
Serial number: 0
Valid from: Tue Jun 29 17:39:16 UTC 2004 until: Thu Jun 29 17:39:16 UTC 2034 Certificate fingerprints:
Subject Public Key Algorithm: 2048-bit RSA key Version: 3
[OU=Starfield Class 2 Certification Authority, O="Starfield Technologies, Inc.", C=US]
]Trust this certificate? [no]: y
4. Set up the Cassandra Query Language shell (cqlsh) connection and confirm that you can connect to Amazon Keyspaces by following the steps at the section called “Using cqlsh” (p. 25).
5. Download and install DSBulk.
a. To download DSBulk, you can use the following code.
curl -OL https://downloads.datastax.com/dsbulk/dsbulk-1.8.0.tar.gz
b. Then unpack the tar file and add DSBulk to your PATH as shown in the following example.
tar -zxvf dsbulk-1.8.0.tar.gz
# add the DSBulk directory to the path export PATH=$PATH:./dsbulk-1.8.0/bin
c. Create an application.conf file to store settings to be used by DSBulk. You can save the following example as ./dsbulk_keyspaces.conf. Replace localhost with the contact point of your local Cassandra cluster if you are not on the local node, for example the DNS name or IP address. Replace username and password with your server credentials. Take note of the file name and path, as you're going to need to specify this later in the dsbulk load command.
datastax-java-driver {
basic.contact-points = [ "localhost"]
advanced.auth-provider {
class = PlainTextAuthProvider username = "username"
password = "password"
} }
Step 1: Create the source CSV file and target table
For this tutorial, we use a comma-separated values (CSV) file with the name
keyspaces_sample_table.csv as the source file for the data migration. The provided sample file contains a few rows of data for a table with the name book_awards.
1. Create the source file. You can choose one of the following options:
• Download the sample CSV file (keyspaces_sample_table.csv) contained in the following archive file samplemigration.zip. Unzip the archive and take note of the path to keyspaces_sample_table.csv.
• To populate a CSV file with your own data stored in an Apache Cassandra database, you can populate the source CSV file by using dsbulk unload as shown in the following example.
dsbulk unload -k mykeyspace -t mytable -f ./my_application.conf > keyspaces_sample_table.csv
Make sure the CSV file you create meets the following requirements:
• The first row contains the column names.
• The column names in the source CSV file match the column names in the target table.
• The data is delimited with a comma.
• All data values are valid Amazon Keyspaces data types. See the section called “Data types” (p. 224).
2. Create the target keyspace and table in Amazon Keyspaces.
a. Connect to Amazon Keyspaces using cqlsh, replacing the service endpoint, user name, and password in the following example with your own values.
cqlsh cassandra.us-east-2.amazonaws.com 9142 -u "111122223333" -p "wJalrXUtnFEMI/
K7MDENG/bPxRfiCYEXAMPLEKEY" --ssl
b. Create a new keyspace with the name catalog as shown in the following example.
CREATE KEYSPACE catalog WITH REPLICATION = {'class': 'SingleRegionStrategy'};
c. After the new keyspace has a status of available, use the following code to create the target table book_awards. To learn more about asynchronous resource creation and how to check if a resource is available, see the section called “Creating keyspaces” (p. 110).
CREATE TABLE catalog.book_awards ( year int,
award text, rank int, category text, book_title text, author text, publisher text,
PRIMARY KEY ((year, award), category, rank) );
If Apache Cassandra is your original data source, a simple way to create the Amazon Keyspaces target table with matching headers is to generate the CREATE TABLE statement from the source table as shown in the following statement.
cqlsh localhost 9042 -u "username" -p "password" --execute "DESCRIBE TABLE mykeyspace.mytable;"
Then create the target table in Amazon Keyspaces with the column names and data types matching the description from the Cassandra source table.
Step 2: Prepare the data
Preparing the source data for an efficient transfer is a two-step process. First, you randomize the data.
In the second step, you analyze the data to determine the appropriate dsbulk parameter values and required table settings.
Randomize the data
The dsbulk command reads and writes data in the same order that it appears in the CSV file. If you use the dsbulk command to create the source file, the data is written in key-sorted order in the CSV.
Internally, Amazon Keyspaces partitions data using partition keys. Although Amazon Keyspaces has built-in logic to help load balance requests for the same partition key, loading the data is faster and more efficient if you randomize the order. This is because you can take advantage of the built-in load balancing that occurs when Amazon Keyspaces is writing to different partitions.
To spread the writes across the partitions evenly, you must randomize the data in the source file. You can write an application to do this or use an open-source tool, such as Shuf. Shuf is freely available on Linux distributions, on macOS (by installing coreutils in homebrew), and on Windows (by using Windows Subsystem for Linux (WSL)). One extra step is required to prevent the header row with the column names to get shuffled in this step.
To randomize the source file while preserving the header, enter the following code.
tail +2 keyspaces_sample_table.csv | shuf -o keyspace.table.csv
(head -1 keyspaces_sample_table.csv && cat keyspace.table.csv ) > keyspace.table.csv1 &&
mv keyspace.table.csv1 keyspace.table.csv
Shuf rewrites the data to a new CSV file called keyspace.table.csv. You can now delete the keyspaces_sample_table.csv file—you no longer need it.
Analyze the data
Determine the average and maximum row size by analyzing the data.
You do this for the following reasons:
• The average row size helps to estimate the total amount of data to be transferred.
• You need the average row size to provision the write capacity needed for the data upload.
• You can make sure that each row is less than 1 MB in size, which is the maximum row size in Amazon Keyspaces.
NoteThis quota refers to row size, not partition size. Unlike Apache Cassandra partitions, Amazon Keyspaces partitions can be virtually unbound in size. Partition keys and clustering columns require additional storage for metadata, which you must add to the raw size of rows. For more information, see the section called “Calculating row size” (p. 114).
The following code uses AWK to analyze a CSV file and print the average and maximum row size.
awk -F, 'BEGIN {samp=10000;max=-1;}{if(NR>1){len=length($0);t+=len;avg=t/NR;max=(len>max ? len : max)}}NR==samp{exit}END{printf("{lines: %d, average: %d bytes, max: %d
bytes}\n",NR,avg,max);}' keyspace.table.csv
Running this code results in the following output.
using 10,000 samples:
{lines: 10000, avg: 123 bytes, max: 225 bytes}
Make sure that your maximum row size doesn't exceed 1 MB. If it does, you have to break up the row or compress the data to bring the row size below 1 MB. In the next step of this tutorial, you use the average row size to provision the write capacity for the table.
Step 3: Set throughput capacity for the table
This tutorial shows you how to tune DSBulk to load data within a set time range. Because you know how many reads and writes you perform in advance, use provisioned capacity mode. After you finish the data transfer, you should set the capacity mode of the table to match your application’s traffic patterns. To learn more about capacity management, see Serverless resource management (p. 95).
With provisioned capacity mode, you specify how much read and write capacity you want to provision to your table in advance. Write capacity is billed hourly and metered in write capacity units (WCUs). Each WCU is enough write capacity to support writing 1 KB of data per second. When you load the data, the write rate must be under the max WCUs (parameter: write_capacity_units) that are set on the target table.
By default, you can provision up to 40,000 WCUs to a table and 80,000 WCUs across all the tables in your account. If you need additional capacity, you can request a quota increase in the Service Quotas console.
For more information about quotas, see Quotas (p. 240).
Step 4: Configure DSBulk settings Calculate the average number of WCUs required for an insert
Inserting 1 KB of data per second requires 1 WCU. If your CSV file has 360,000 rows and you want to load all the data in 1 hour, you must write 100 rows per second (360,000 rows / 60 minutes / 60 seconds
= 100 rows per second). If each row has up to 1 KB of data, to insert 100 rows per second, you must provision 100 WCUs to your table. If each row has 1.5 KB of data, you need two WCUs to insert one row per second. Therefore, to insert 100 rows per second, you must provision 200 WCUs.
To determine how many WCUs you need to insert one row per second, divide the average row size in bytes by 1024 and round up to the nearest whole number.
For example, if the average row size is 3000 bytes, you need three WCUs to insert one row per second.
ROUNDUP(3000 / 1024) = ROUNDUP(2.93) = 3 WCUs
Calculate data load time and capacity
Now that you know the average size and number of rows in your CSV file, you can calculate how many WCUs you need to load the data in a given amount of time, and the approximate time it takes to load all the data in your CSV file using different WCU settings.
For example, if each row in your file is 1 KB and you have 1,000,000 rows in your CSV file, to load the data in 1 hour, you need to provision at least 278 WCUs to your table for that hour.
1,000,000 rows * 1 KBs = 1,000,000 KBs
1,000,000 KBs / 3600 seconds =277.8 KBs / second = 278 WCUs
Configure provisioned capacity settings
You can set a table’s write capacity settings when you create the table or by using the ALTER TABLE command. The following is the syntax for altering a table’s provisioned capacity settings with the ALTER TABLE command.
ALTER TABLE catalog.book_awards WITH custom_properties={'capacity_mode':{'throughput_mode':
'PROVISIONED', 'read_capacity_units': 100, 'write_capacity_units': 278}} ;
For the complete language reference, see the section called “CREATE TABLE” (p. 230) and the section called “ALTER TABLE” (p. 232).
Step 4: Configure DSBulk settings
This section outlines the steps required to configure DSBulk for data upload to Amazon Keyspaces.
You configure DSBulk by using a configuration file. You specify the configuration file directly from the command line.
1. Create a DSBulk configuration file for the migration to Amazon Keyspaces, in this example we use the file name dsbulk_keyspaces.conf. Specify the following settings in the DSBulk configuration file.
a. PlainTextAuthProvider – Create the authentication provider with the
PlainTextAuthProvider class. ServiceUserName and ServicePassword should match the user name and password you obtained when you generated the service-specific credentials by following the steps at the section called “Creating credentials” (p. 16).
b. local-datacenter – Set the value for local-datacenter to the AWS Region that you're connecting to. For example, if the application is connecting to
cassandra.us-Step 4: Configure DSBulk settings
east-2.amazonaws.com, then set the local data center to us-east-2. For all available AWS Regions, see the section called “Service endpoints” (p. 22).
c. SSLEngineFactory – To configure SSL/TLS, initialize the SSLEngineFactory by adding a section in the configuration file with a single line that specifies the class with class = DefaultSslEngineFactory. Provide the path to cassandra_truststore.jks and the password that you created previously.
d. consistency –Set the consistency level to LOCAL QUORUM and turn off the token_metadata setting. Other write consistency levels are not supported, for more information see the section called “Supported Cassandra consistency levels” (p. 12).
The following is the complete sample configuration file.
datastax-java-driver {
basic.contact-points = [ "cassandra.us-east-2.amazonaws.com:9142"]
advanced.auth-provider {
class = PlainTextAuthProvider username = "ServiceUserName"
password = "ServicePassword"
}basic.load-balancing-policy { local-datacenter = "us-east-2"
}
basic.request {
consistency = LOCAL_QUORUM default-idempotence = true }advanced.ssl-engine-factory { class = DefaultSslEngineFactory
truststore-path = "./cassandra_truststore.jks"
truststore-password = "my_password"
hostname-validation = false }
2. Review the parameters for the DSBulk load command.
a. executor.maxPerSecond – The maximum number of rows that the load command attempts to process concurrently per second. If unset, this setting is disabled with -1.
Set executor.maxPerSecond based on the number of WCUs that you provisioned to the target destination table. The executor.maxPerSecond of the load command isn’t a limit – it’s a target average. This means it can (and often does) burst above the number you set.
To allow for bursts and make sure that enough capacity is in place to handle the data load requests, set executor.maxPerSecond to 90% of the table’s write capacity.
executor.maxPerSecond = WCUs * .90
In this tutorial, we set executor.maxPerSecond to 5.
Note
If you are using DSBulk 1.6.0 or higher, you can use dsbulk.engine.maxConcurrentQueries instead.
b. Configure these additional parameters for the DSBulk load command.
Step 5: Run the DSBulk load command
• batch-mode – This parameter tells the system to group operations by partition key. Because this could interfere with other settings, we recommend to disable batch mode.
• driver.advanced.retry-policy-max-retries – This determines how many times to retry a failed query. If unset, the default is 10. You can adjust this value as needed.
• driver.basic.request.timeout – The time in minutes the system waits for a query to return. If unset, the default is "5 minutes". You can adjust this value as needed.
Step 5: Run the DSBulk load command
In the final step of this tutorial, you upload the data into Amazon Keyspaces.
To run the DSBulk load command, complete the following steps.
1. Run the following code to upload the data from your csv file to your Amazon Keyspaces table. Make sure to update the path to the application configuration file you created earlier.
dsbulk load -f ./dsbulk_keyspaces.conf --connector.csv.url keyspace.table.csv -header true --batch.mode DISABLED --executor.maxPerSecond 5 --driver.basic.request.timeout "5 minutes" --driver.advanced.retry-policy.max-retries 10 -k catalog -t book_awards 2. The output includes the location of a log file that details successful and unsuccessful operations. The
file is stored in the following directory.
Operation directory: /home/user_name/logs/UNLOAD_20210308-202317-801911
3. The log file entries will include metrics, as in the following example. Check to make sure that the number of rows is consistent with the number of rows in your csv file.
total | failed | rows/s | p50ms | p99ms | p999ms 200 | 0 | 200 | 21.63 | 21.89 | 21.89
Important
Now that you have transferred your data, adjust the capacity mode settings of your target table to match your application’s regular traffic patterns. You incur charges at the hourly rate for your provisioned capacity until you change it. For more information, see the section called “Read/
write capacity modes” (p. 95).
Troubleshooting Amazon Keyspaces (for Apache Cassandra)
The following sections provide information about how to troubleshoot common configuration issues that you might encounter when using Amazon Keyspaces (for Apache Cassandra).
For troubleshooting guidance specific to IAM access, see the section called “Troubleshooting” (p. 192).
For more information about security best practices, see the section called “Security best practices” (p. 220).
Topics
• Troubleshooting connections in Amazon Keyspaces (p. 79)
• Troubleshooting capacity management in Amazon Keyspaces (p. 84)
• Troubleshooting data definition language in Amazon Keyspaces (p. 88)
Troubleshooting connections in Amazon Keyspaces
Having trouble connecting? Here are some common issues and how to resolve them.
Errors connecting to an Amazon Keyspaces endpoint
Failed connections and connection errors can result in different error messages. The following section covers the most common scenarios.
Topics
• I can't connect to Amazon Keyspaces with cqlsh (p. 79)
• I can't connect to Amazon Keyspaces using a Cassandra client driver (p. 83)
• I can't connect using IAM identities (p. 84)
• I'm trying to import data with cqlsh and the connection to my Amazon Keyspaces table is lost (p. 84)
I can't connect to Amazon Keyspaces with cqlsh
You're trying to connect to an Amazon Keyspaces endpoint using cqlsh and the connection fails with a Connection error.
If you try to connect to an Amazon Keyspaces table and cqlsh hasn't been configured properly, the connection fails. The following section provides examples of the most common configuration issues that result in connection errors when you're trying to establish a connection using cqlsh.
You're trying to connect to Amazon Keyspaces using cqlsh, but you get a connection timed out error.
This might be the case if you didn't supply the correct port, which results in the following error.
# cqlsh cassandra.us-east-1.amazonaws.com 9140 -u "USERNAME" -p "PASSWORD" --ssl
Connection error: ('Unable to connect to any servers', {'3.234.248.199': error(None, "Tried connecting to [('3.234.248.199', 9140)]. Last error: timed out")})
To resolve this issue, verify that you're using port 9142 for the connection.
You're trying to connect to Amazon Keyspaces using cqlsh, but you get a Name or service not known error.
This might be the case if you used an endpoint that is misspelled or doesn't exist. In the following example, the name of the endpoint is misspelled.
# cqlsh cassandra.us-east-1.amazon.com 9142 -u "USERNAME" -p "PASSWORD" --ssl Traceback (most recent call last):
File "/usr/bin/cqlsh.py", line 2458, in >module>
main(*read_options(sys.argv[1:], os.environ)) File "/usr/bin/cqlsh.py", line 2436, in main encoding=options.encoding)
File "/usr/bin/cqlsh.py", line 484, in __init__
load_balancing_policy=WhiteListRoundRobinPolicy([self.hostname]),
File "/usr/share/cassandra/lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/
cassandra-driver-3.11.0-bb96859b/cassandra/policies.py", line 417, in __init__
socket.gaierror: [Errno -2] Name or service not known
To resolve this issue when you're using public endpoints to connect, select an available endpoint from the section called “Service endpoints” (p. 22), and verify that the name of the endpoint doesn't have any
To resolve this issue when you're using public endpoints to connect, select an available endpoint from the section called “Service endpoints” (p. 22), and verify that the name of the endpoint doesn't have any