• 沒有找到結果。

Load the CUSTOMER table using MANIFEST

在文檔中 Amazon Redshift (頁 149-153)

In this step, you use the FIXEDWIDTH, MAXERROR, ACCEPTINVCHARS, and MANIFEST options to load the CUSTOMER table.

Step 5: Run the COPY commands

The sample data for this exercise contains characters that cause errors when COPY attempts to load them. You use the MAXERRORS option and the STL_LOAD_ERRORS system table to troubleshoot the load errors and then use the ACCEPTINVCHARS and MANIFEST options to eliminate the errors.

Fixed-Width Format

Fixed-width format defines each field as a fixed number of characters, rather than separating fields with a delimiter. The following excerpt from the data for the CUSTOMER table uses fixed-width format.

1 Customer#000000001 IVhzIApeRb MOROCCO 0MOROCCO AFRICA 25-705 2 Customer#000000002 XSTf4,NCwDVaWNe6tE JORDAN 6JORDAN MIDDLE EAST 23-453 3 Customer#000000003 MG9kdTD ARGENTINA5ARGENTINAAMERICA 11-783

The order of the label/width pairs must match the order of the table columns exactly. For more information, see FIXEDWIDTH (p. 593).

The fixed-width specification string for the CUSTOMER table data is as follows.

fixedwidth 'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15, c_region :12, c_phone:15,c_mktsegment:10'

To load the CUSTOMER table from fixed-width data, run the following command.

copy customer

from 's3://<your-bucket-name>/load/customer-fw.tbl'

credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>'

fixedwidth 'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15, c_region :12, c_phone:15,c_mktsegment:10';

You should get an error message, similar to the following.

An error occurred when executing the SQL command:

copy customer

from 's3://mybucket/load/customer-fw.tbl' credentials'...

ERROR: Load into table 'customer' failed. Check 'stl_load_errors' system table for details. [SQL State=XX000]

Execution time: 2.95s 1 statement(s) failed.

MAXERROR

By default, the first time COPY encounters an error, the command fails and returns an error message. To save time during testing, you can use the MAXERROR option to instruct COPY to skip a specified number of errors before it fails. Because we expect errors the first time we test loading the CUSTOMER table data, add maxerror 10 to the COPY command.

To test using the FIXEDWIDTH and MAXERROR options, run the following command.

copy customer

from 's3://<your-bucket-name>/load/customer-fw.tbl'

credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>'

fixedwidth 'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15, c_region :12, c_phone:15,c_mktsegment:10'

maxerror 10;

Step 5: Run the COPY commands

This time, instead of an error message, you get a warning message similar to the following.

Warnings:

Load into table 'customer' completed, 112497 record(s) loaded successfully.

Load into table 'customer' completed, 7 record(s) could not be loaded. Check 'stl_load_errors' system table for details.

The warning indicates that COPY encountered seven errors. To check the errors, query the STL_LOAD_ERRORS table, as shown in the following example.

select query, substring(filename,22,25) as filename,line_number as line,

substring(colname,0,12) as column, type, position as pos, substring(raw_line,0,30) as line_text,

substring(raw_field_value,0,15) as field_text, substring(err_reason,0,45) as error_reason from stl_load_errors

order by query desc, filename limit 7;

The results of the STL_LOAD_ERRORS query should look similar to the following.

query | filename | line | column | type | pos | line_text | field_text | error_reason

---+---+---+--- +---+---+---+---

334489 | customer-fw.tbl.log | 2 | c_custkey | int4 | -1 | customer-fw.tbl | customer-f | Invalid digit, Value 'c', Pos 0, Type: Integ

334489 | customer-fw.tbl.log | 6 | c_custkey | int4 | -1 | Complete | Complete | Invalid digit, Value 'C', Pos 0, Type: Integ

334489 | customer-fw.tbl.log | 3 | c_custkey | int4 | -1 | #Total rows | #Total row | Invalid digit, Value '#', Pos 0, Type: Integ

334489 | customer-fw.tbl.log | 5 | c_custkey | int4 | -1 | #Status | #Status | Invalid digit, Value '#', Pos 0, Type: Integ

334489 | customer-fw.tbl.log | 1 | c_custkey | int4 | -1 | #Load file | #Load file | Invalid digit, Value '#', Pos 0, Type: Integ

334489 | customer-fw.tbl000 | 1 | c_address | varchar | 34 | 1 Customer#000000001 | .Mayag.ezR | String contains invalid or unsupported UTF8 334489 | customer-fw.tbl000 | 1 | c_address | varchar | 34 | 1 Customer#000000001 | .Mayag.ezR | String contains invalid or unsupported UTF8 (7 rows)

By examining the results, you can see that there are two messages in the error_reasons column:

• Invalid digit, Value '#', Pos 0, Type: Integ

These errors are caused by the customer-fw.tbl.log file. The problem is that it is a log file, not a data file, and should not be loaded. You can use a manifest file to avoid loading the wrong file.

• String contains invalid or unsupported UTF8

The VARCHAR data type supports multibyte UTF-8 characters up to three bytes. If the load data contains unsupported or invalid characters, you can use the ACCEPTINVCHARS option to replace each invalid character with a specified alternative character.

Another problem with the load is more difficult to detect—the load produced unexpected results. To investigate this problem, run the following command to query the CUSTOMER table.

Step 5: Run the COPY commands

select c_custkey, c_name, c_address from customer

order by c_custkey limit 10;

c_custkey | c_name | c_address

2 | Customer#000000002 | XSTf4,NCwDVaWNe6tE

2 | Customer#000000002 | XSTf4,NCwDVaWNe6tE 3 | Customer#000000003 | MG9kdTD

3 | Customer#000000003 | MG9kdTD 4 | Customer#000000004 | XxVSJsL 4 | Customer#000000004 | XxVSJsL

5 | Customer#000000005 | KvpyuHCplrB84WgAi 5 | Customer#000000005 | KvpyuHCplrB84WgAi 6 | Customer#000000006 | sKZz0CsnMD7mp4Xd0YrBvx 6 | Customer#000000006 | sKZz0CsnMD7mp4Xd0YrBvx (10 rows)

The rows should be unique, but there are duplicates.

Another way to check for unexpected results is to verify the number of rows that were loaded. In our case, 100000 rows should have been loaded, but the load message reported loading 112497 records.

The extra rows were loaded because the COPY loaded an extraneous file, customer-fw.tbl0000.bak.

In this exercise, you use a manifest file to avoid loading the wrong files.

ACCEPTINVCHARS

By default, when COPY encounters a character that is not supported by the column's data type, it skips the row and returns an error. For information about invalid UTF-8 characters, see Multibyte character load errors (p. 94).

You could use the MAXERRORS option to ignore errors and continue loading, then query

STL_LOAD_ERRORS to locate the invalid characters, and then fix the data files. However, MAXERRORS is best used for troubleshooting load problems and should generally not be used in a production environment.

The ACCEPTINVCHARS option is usually a better choice for managing invalid characters.

ACCEPTINVCHARS instructs COPY to replace each invalid character with a specified valid character and continue with the load operation. You can specify any valid ASCII character, except NULL, as the replacement character. The default replacement character is a question mark ( ? ). COPY replaces multibyte characters with a replacement string of equal length. For example, a 4-byte character would be replaced with '????'.

COPY returns the number of rows that contained invalid UTF-8 characters. It also adds an entry to the STL_REPLACEMENTS system table for each affected row, up to a maximum of 100 rows per node slice.

Additional invalid UTF-8 characters are also replaced, but those replacement events are not recorded.

ACCEPTINVCHARS is valid only for VARCHAR columns.

For this step, you add the ACCEPTINVCHARS with the replacement character '^'.

MANIFEST

When you COPY from Amazon S3 using a key prefix, there is a risk that you might load unwanted tables.

For example, the 's3://mybucket/load/ folder contains eight data files that share the key prefix customer-fw.tbl: customer-fw.tbl0000, customer-fw.tbl0001, and so on. However, the same folder also contains the extraneous files customer-fw.tbl.log and customer-fw.tbl-0001.bak.

Step 5: Run the COPY commands

To ensure that you load all of the correct files, and only the correct files, use a manifest file. The manifest is a text file in JSON format that explicitly lists the unique object key for each source file to be loaded.

The file objects can be in different folders or different buckets, but they must be in the same region. For more information, see MANIFEST (p. 582).

The following shows the customer-fw-manifest text.

{ "entries": [

{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-000"}, {"url":"s3://<your-bucket-name>/load/customer-fw.tbl-001"}, {"url":"s3://<your-bucket-name>/load/customer-fw.tbl-002"}, {"url":"s3://<your-bucket-name>/load/customer-fw.tbl-003"}, {"url":"s3://<your-bucket-name>/load/customer-fw.tbl-004"}, {"url":"s3://<your-bucket-name>/load/customer-fw.tbl-005"}, {"url":"s3://<your-bucket-name>/load/customer-fw.tbl-006"}, {"url":"s3://<your-bucket-name>/load/customer-fw.tbl-007"}

] }

To load the data for the CUSTOMER table using the manifest file 1. Open the file customer-fw-manifest in a text editor.

2. Replace <your-bucket-name> with the name of your bucket.

3. Save the file.

4. Upload the file to the load folder on your bucket.

5. Run the following COPY command.

copy customer from 's3://<your-bucket-name>/load/customer-fw-manifest' credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>' fixedwidth 'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15, c_region :12, c_phone:15,c_mktsegment:10'

maxerror 10

acceptinvchars as '^' manifest;

在文檔中 Amazon Redshift (頁 149-153)