Run the DSBulk load command - Amazon Keyspaces (for Apache Cassandra) Developer Guide

Prerequisites

You must complete the following tasks before you can start this tutorial.

1. If you have not already done so, sign up for an AWS account by following the steps at the section called “Signing up for AWS” (p. 14).

2. Create credentials by following the steps at the section called “Creating credentials” (p. 16).

NoteDSBulk currently only supports service-speciﬁc credentials. If you prefer to manage access to Amazon Keyspaces by using AWS Identity and Access Management users and roles, you can use the SigV4 authentication plugin and disable the service-speciﬁc credentials after you complete data upload with DSBulk.

3. Create a JKS trust store ﬁle.

a. Download the Starﬁeld digital certiﬁcate using the following command and save sf-class2-root.crt locally or in your home directory.

curl https://certs.secureserver.net/repository/sf-class2-root.crt -O

NoteYou can also use the Amazon digital certiﬁcate to connect to Amazon Keyspaces and can continue to do so if your client is connecting to Amazon Keyspaces successfully. The Starﬁeld certiﬁcate provides additional backwards compatibility for clients using older certiﬁcate authorities.

b. Convert the Starﬁeld digital certiﬁcate into a trustStore ﬁle.

openssl x509 -outform der -in sf-class2-root.crt -out temp_file.der keytool -import -alias cassandra -keystore cassandra_truststore.jks -file temp_file.der

In this step, you need to create a password for the keystore and trust this certiﬁcate. The interactive command looks like this.

Enter keystore password:

Re-enter new password:

Owner: OU=Starfield Class 2 Certification Authority, O="Starfield Technologies, Inc.", C=US

Issuer: OU=Starfield Class 2 Certification Authority, O="Starfield Technologies, Inc.", C=US

Serial number: 0

Valid from: Tue Jun 29 17:39:16 UTC 2004 until: Thu Jun 29 17:39:16 UTC 2034 Certificate fingerprints:

Subject Public Key Algorithm: 2048-bit RSA key Version: 3

[OU=Starfield Class 2 Certification Authority, O="Starfield Technologies, Inc.", C=US]

]Trust this certificate? [no]: y

4. Set up the Cassandra Query Language shell (cqlsh) connection and conﬁrm that you can connect to Amazon Keyspaces by following the steps at the section called “Using cqlsh” (p. 25).

5. Download and install DSBulk.

a. To download DSBulk, you can use the following code.

curl -OL https://downloads.datastax.com/dsbulk/dsbulk-1.8.0.tar.gz

b. Then unpack the tar ﬁle and add DSBulk to your PATH as shown in the following example.

tar -zxvf dsbulk-1.8.0.tar.gz

# add the DSBulk directory to the path export PATH=$PATH:./dsbulk-1.8.0/bin

c. Create an application.conf ﬁle to store settings to be used by DSBulk. You can save the following example as ./dsbulk_keyspaces.conf. Replace localhost with the contact point of your local Cassandra cluster if you are not on the local node, for example the DNS name or IP address. Replace username and password with your server credentials. Take note of the ﬁle name and path, as you're going to need to specify this later in the dsbulk load command.

datastax-java-driver {

basic.contact-points = [ "localhost"]

advanced.auth-provider {

class = PlainTextAuthProvider username = "username"

password = "password"

} }

Step 1: Create the source CSV ﬁle and target table

For this tutorial, we use a comma-separated values (CSV) ﬁle with the name

keyspaces_sample_table.csv as the source ﬁle for the data migration. The provided sample ﬁle contains a few rows of data for a table with the name book_awards.

1. Create the source ﬁle. You can choose one of the following options:

• Download the sample CSV ﬁle (keyspaces_sample_table.csv) contained in the following archive ﬁle samplemigration.zip. Unzip the archive and take note of the path to keyspaces_sample_table.csv.

• To populate a CSV ﬁle with your own data stored in an Apache Cassandra database, you can populate the source CSV ﬁle by using dsbulk unload as shown in the following example.

dsbulk unload -k mykeyspace -t mytable -f ./my_application.conf > keyspaces_sample_table.csv

Make sure the CSV ﬁle you create meets the following requirements:

• The ﬁrst row contains the column names.

• The column names in the source CSV ﬁle match the column names in the target table.

• The data is delimited with a comma.

• All data values are valid Amazon Keyspaces data types. See the section called “Data types” (p. 224).

2. Create the target keyspace and table in Amazon Keyspaces.

a. Connect to Amazon Keyspaces using cqlsh, replacing the service endpoint, user name, and password in the following example with your own values.

cqlsh cassandra.us-east-2.amazonaws.com 9142 -u "111122223333" -p "wJalrXUtnFEMI/

K7MDENG/bPxRfiCYEXAMPLEKEY" --ssl

b. Create a new keyspace with the name catalog as shown in the following example.

CREATE KEYSPACE catalog WITH REPLICATION = {'class': 'SingleRegionStrategy'};

c. After the new keyspace has a status of available, use the following code to create the target table book_awards. To learn more about asynchronous resource creation and how to check if a resource is available, see the section called “Creating keyspaces” (p. 110).

CREATE TABLE catalog.book_awards ( year int,

award text, rank int, category text, book_title text, author text, publisher text,

PRIMARY KEY ((year, award), category, rank) );

If Apache Cassandra is your original data source, a simple way to create the Amazon Keyspaces target table with matching headers is to generate the CREATE TABLE statement from the source table as shown in the following statement.

cqlsh localhost 9042 -u "username" -p "password" --execute "DESCRIBE TABLE mykeyspace.mytable;"

Then create the target table in Amazon Keyspaces with the column names and data types matching the description from the Cassandra source table.

Step 2: Prepare the data

Preparing the source data for an eﬃcient transfer is a two-step process. First, you randomize the data.

In the second step, you analyze the data to determine the appropriate dsbulk parameter values and required table settings.

Randomize the data

The dsbulk command reads and writes data in the same order that it appears in the CSV ﬁle. If you use the dsbulk command to create the source ﬁle, the data is written in key-sorted order in the CSV.

Internally, Amazon Keyspaces partitions data using partition keys. Although Amazon Keyspaces has built-in logic to help load balance requests for the same partition key, loading the data is faster and more eﬃcient if you randomize the order. This is because you can take advantage of the built-in load balancing that occurs when Amazon Keyspaces is writing to diﬀerent partitions.

To spread the writes across the partitions evenly, you must randomize the data in the source ﬁle. You can write an application to do this or use an open-source tool, such as Shuf. Shuf is freely available on Linux distributions, on macOS (by installing coreutils in homebrew), and on Windows (by using Windows Subsystem for Linux (WSL)). One extra step is required to prevent the header row with the column names to get shuﬄed in this step.

To randomize the source ﬁle while preserving the header, enter the following code.

tail +2 keyspaces_sample_table.csv | shuf -o keyspace.table.csv

(head -1 keyspaces_sample_table.csv && cat keyspace.table.csv ) > keyspace.table.csv1 &&

mv keyspace.table.csv1 keyspace.table.csv

Shuf rewrites the data to a new CSV ﬁle called keyspace.table.csv. You can now delete the keyspaces_sample_table.csv ﬁle—you no longer need it.

Analyze the data

Determine the average and maximum row size by analyzing the data.

You do this for the following reasons:

• The average row size helps to estimate the total amount of data to be transferred.

• You need the average row size to provision the write capacity needed for the data upload.

• You can make sure that each row is less than 1 MB in size, which is the maximum row size in Amazon Keyspaces.

NoteThis quota refers to row size, not partition size. Unlike Apache Cassandra partitions, Amazon Keyspaces partitions can be virtually unbound in size. Partition keys and clustering columns require additional storage for metadata, which you must add to the raw size of rows. For more information, see the section called “Calculating row size” (p. 114).

The following code uses AWK to analyze a CSV ﬁle and print the average and maximum row size.

awk -F, 'BEGIN {samp=10000;max=-1;}{if(NR>1){len=length($0);t+=len;avg=t/NR;max=(len>max ? len : max)}}NR==samp{exit}END{printf("{lines: %d, average: %d bytes, max: %d

bytes}\n",NR,avg,max);}' keyspace.table.csv

Running this code results in the following output.

using 10,000 samples:

{lines: 10000, avg: 123 bytes, max: 225 bytes}

Make sure that your maximum row size doesn't exceed 1 MB. If it does, you have to break up the row or compress the data to bring the row size below 1 MB. In the next step of this tutorial, you use the average row size to provision the write capacity for the table.

Step 3: Set throughput capacity for the table

This tutorial shows you how to tune DSBulk to load data within a set time range. Because you know how many reads and writes you perform in advance, use provisioned capacity mode. After you ﬁnish the data transfer, you should set the capacity mode of the table to match your application’s traﬃc patterns. To learn more about capacity management, see Serverless resource management (p. 95).

With provisioned capacity mode, you specify how much read and write capacity you want to provision to your table in advance. Write capacity is billed hourly and metered in write capacity units (WCUs). Each WCU is enough write capacity to support writing 1 KB of data per second. When you load the data, the write rate must be under the max WCUs (parameter: write_capacity_units) that are set on the target table.

By default, you can provision up to 40,000 WCUs to a table and 80,000 WCUs across all the tables in your account. If you need additional capacity, you can request a quota increase in the Service Quotas console.

For more information about quotas, see Quotas (p. 240).

Step 4: Conﬁgure DSBulk settings Calculate the average number of WCUs required for an insert

Inserting 1 KB of data per second requires 1 WCU. If your CSV ﬁle has 360,000 rows and you want to load all the data in 1 hour, you must write 100 rows per second (360,000 rows / 60 minutes / 60 seconds

= 100 rows per second). If each row has up to 1 KB of data, to insert 100 rows per second, you must provision 100 WCUs to your table. If each row has 1.5 KB of data, you need two WCUs to insert one row per second. Therefore, to insert 100 rows per second, you must provision 200 WCUs.

To determine how many WCUs you need to insert one row per second, divide the average row size in bytes by 1024 and round up to the nearest whole number.

For example, if the average row size is 3000 bytes, you need three WCUs to insert one row per second.

ROUNDUP(3000 / 1024) = ROUNDUP(2.93) = 3 WCUs

Calculate data load time and capacity

Now that you know the average size and number of rows in your CSV ﬁle, you can calculate how many WCUs you need to load the data in a given amount of time, and the approximate time it takes to load all the data in your CSV ﬁle using diﬀerent WCU settings.

For example, if each row in your ﬁle is 1 KB and you have 1,000,000 rows in your CSV ﬁle, to load the data in 1 hour, you need to provision at least 278 WCUs to your table for that hour.

1,000,000 rows * 1 KBs = 1,000,000 KBs

1,000,000 KBs / 3600 seconds =277.8 KBs / second = 278 WCUs

Conﬁgure provisioned capacity settings

You can set a table’s write capacity settings when you create the table or by using the ALTER TABLE command. The following is the syntax for altering a table’s provisioned capacity settings with the ALTER TABLE command.

ALTER TABLE catalog.book_awards WITH custom_properties={'capacity_mode':{'throughput_mode':

'PROVISIONED', 'read_capacity_units': 100, 'write_capacity_units': 278}} ;

For the complete language reference, see the section called “CREATE TABLE” (p. 230) and the section called “ALTER TABLE” (p. 232).

Step 4: Conﬁgure DSBulk settings

This section outlines the steps required to conﬁgure DSBulk for data upload to Amazon Keyspaces.

You conﬁgure DSBulk by using a conﬁguration ﬁle. You specify the conﬁguration ﬁle directly from the command line.

1. Create a DSBulk conﬁguration ﬁle for the migration to Amazon Keyspaces, in this example we use the ﬁle name dsbulk_keyspaces.conf. Specify the following settings in the DSBulk conﬁguration ﬁle.

a. PlainTextAuthProvider – Create the authentication provider with the

PlainTextAuthProvider class. ServiceUserName and ServicePassword should match the user name and password you obtained when you generated the service-speciﬁc credentials by following the steps at the section called “Creating credentials” (p. 16).

b. local-datacenter – Set the value for local-datacenter to the AWS Region that you're connecting to. For example, if the application is connecting to

cassandra.us-Step 4: Conﬁgure DSBulk settings

east-2.amazonaws.com, then set the local data center to us-east-2. For all available AWS Regions, see the section called “Service endpoints” (p. 22).

c. SSLEngineFactory – To conﬁgure SSL/TLS, initialize the SSLEngineFactory by adding a section in the conﬁguration ﬁle with a single line that speciﬁes the class with class = DefaultSslEngineFactory. Provide the path to cassandra_truststore.jks and the password that you created previously.

d. consistency –Set the consistency level to LOCAL QUORUM and turn oﬀ the token_metadata setting. Other write consistency levels are not supported, for more information see the section called “Supported Cassandra consistency levels” (p. 12).

The following is the complete sample conﬁguration ﬁle.

datastax-java-driver {

basic.contact-points = [ "cassandra.us-east-2.amazonaws.com:9142"]

advanced.auth-provider {

class = PlainTextAuthProvider username = "ServiceUserName"

password = "ServicePassword"

}basic.load-balancing-policy { local-datacenter = "us-east-2"

}

basic.request {

consistency = LOCAL_QUORUM default-idempotence = true }advanced.ssl-engine-factory { class = DefaultSslEngineFactory

truststore-path = "./cassandra_truststore.jks"

truststore-password = "my_password"

hostname-validation = false }

2. Review the parameters for the DSBulk load command.

a. executor.maxPerSecond – The maximum number of rows that the load command attempts to process concurrently per second. If unset, this setting is disabled with -1.

Set executor.maxPerSecond based on the number of WCUs that you provisioned to the target destination table. The executor.maxPerSecond of the load command isn’t a limit – it’s a target average. This means it can (and often does) burst above the number you set.

To allow for bursts and make sure that enough capacity is in place to handle the data load requests, set executor.maxPerSecond to 90% of the table’s write capacity.

executor.maxPerSecond = WCUs * .90

In this tutorial, we set executor.maxPerSecond to 5.

Note

If you are using DSBulk 1.6.0 or higher, you can use dsbulk.engine.maxConcurrentQueries instead.

b. Conﬁgure these additional parameters for the DSBulk load command.

Step 5: Run the DSBulk load command

• batch-mode – This parameter tells the system to group operations by partition key. Because this could interfere with other settings, we recommend to disable batch mode.

• driver.advanced.retry-policy-max-retries – This determines how many times to retry a failed query. If unset, the default is 10. You can adjust this value as needed.

• driver.basic.request.timeout – The time in minutes the system waits for a query to return. If unset, the default is "5 minutes". You can adjust this value as needed.

Step 5: Run the DSBulk load command

In the ﬁnal step of this tutorial, you upload the data into Amazon Keyspaces.

To run the DSBulk load command, complete the following steps.

1. Run the following code to upload the data from your csv ﬁle to your Amazon Keyspaces table. Make sure to update the path to the application conﬁguration ﬁle you created earlier.

dsbulk load -f ./dsbulk_keyspaces.conf --connector.csv.url keyspace.table.csv -header true --batch.mode DISABLED --executor.maxPerSecond 5 --driver.basic.request.timeout "5 minutes" --driver.advanced.retry-policy.max-retries 10 -k catalog -t book_awards 2. The output includes the location of a log ﬁle that details successful and unsuccessful operations. The

ﬁle is stored in the following directory.

Operation directory: /home/user_name/logs/UNLOAD_20210308-202317-801911

3. The log ﬁle entries will include metrics, as in the following example. Check to make sure that the number of rows is consistent with the number of rows in your csv ﬁle.

total | failed | rows/s | p50ms | p99ms | p999ms 200 | 0 | 200 | 21.63 | 21.89 | 21.89

Important

Now that you have transferred your data, adjust the capacity mode settings of your target table to match your application’s regular traﬃc patterns. You incur charges at the hourly rate for your provisioned capacity until you change it. For more information, see the section called “Read/

write capacity modes” (p. 95).

Troubleshooting Amazon Keyspaces (for Apache Cassandra)

The following sections provide information about how to troubleshoot common conﬁguration issues that you might encounter when using Amazon Keyspaces (for Apache Cassandra).

For troubleshooting guidance speciﬁc to IAM access, see the section called “Troubleshooting” (p. 192).

For more information about security best practices, see the section called “Security best practices” (p. 220).

Topics

• Troubleshooting connections in Amazon Keyspaces (p. 79)

• Troubleshooting capacity management in Amazon Keyspaces (p. 84)

• Troubleshooting data deﬁnition language in Amazon Keyspaces (p. 88)

Troubleshooting connections in Amazon Keyspaces

Having trouble connecting? Here are some common issues and how to resolve them.

Errors connecting to an Amazon Keyspaces endpoint

Failed connections and connection errors can result in diﬀerent error messages. The following section covers the most common scenarios.

Topics

• I can't connect to Amazon Keyspaces with cqlsh (p. 79)

• I can't connect to Amazon Keyspaces using a Cassandra client driver (p. 83)

• I can't connect using IAM identities (p. 84)

• I'm trying to import data with cqlsh and the connection to my Amazon Keyspaces table is lost (p. 84)

I can't connect to Amazon Keyspaces with cqlsh

You're trying to connect to an Amazon Keyspaces endpoint using cqlsh and the connection fails with a Connection error.

If you try to connect to an Amazon Keyspaces table and cqlsh hasn't been conﬁgured properly, the connection fails. The following section provides examples of the most common conﬁguration issues that result in connection errors when you're trying to establish a connection using cqlsh.

You're trying to connect to Amazon Keyspaces using cqlsh, but you get a connection timed out error.

This might be the case if you didn't supply the correct port, which results in the following error.

# cqlsh cassandra.us-east-1.amazonaws.com 9140 -u "USERNAME" -p "PASSWORD" --ssl

Connection error: ('Unable to connect to any servers', {'3.234.248.199': error(None, "Tried connecting to [('3.234.248.199', 9140)]. Last error: timed out")})

To resolve this issue, verify that you're using port 9142 for the connection.

You're trying to connect to Amazon Keyspaces using cqlsh, but you get a Name or service not known error.

This might be the case if you used an endpoint that is misspelled or doesn't exist. In the following example, the name of the endpoint is misspelled.

# cqlsh cassandra.us-east-1.amazon.com 9142 -u "USERNAME" -p "PASSWORD" --ssl Traceback (most recent call last):

File "/usr/bin/cqlsh.py", line 2458, in >module>

main(*read_options(sys.argv[1:], os.environ)) File "/usr/bin/cqlsh.py", line 2436, in main encoding=options.encoding)

File "/usr/bin/cqlsh.py", line 484, in __init__

load_balancing_policy=WhiteListRoundRobinPolicy([self.hostname]),

File "/usr/share/cassandra/lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/

cassandra-driver-3.11.0-bb96859b/cassandra/policies.py", line 417, in __init__

socket.gaierror: [Errno -2] Name or service not known

To resolve this issue when you're using public endpoints to connect, select an available endpoint from the section called “Service endpoints” (p. 22), and verify that the name of the endpoint doesn't have any

在文檔中 Amazon Keyspaces (for Apache Cassandra) Developer Guide (頁 78-200)