Load the LINEORDER table using multiple ﬁles

This step uses the GZIP and COMPUPDATE options to load the LINEORDER table.

In this exercise, you load the LINEORDER table from a single data ﬁle and then load it again from multiple ﬁles. Doing this enables you to compare the load times for the two methods.

NoteThe ﬁles for loading the LINEORDER table are provided in an AWS sample bucket. You don't need to upload ﬁles for this step.

GZIP, LZOP and BZIP2

You can compress your ﬁles using either gzip, lzop, or bzip2 compression formats. When loading from compressed ﬁles, COPY uncompresses the ﬁles during the load process. Compressing your ﬁles saves storage space and shortens upload times.

COMPUPDATE

When COPY loads an empty table with no compression encodings, it analyzes the load data to determine the optimal encodings. It then alters the table to use those encodings before beginning the load. This analysis process takes time, but it occurs, at most, once per table. To save time, you can skip this step by turning COMPUPDATE oﬀ. To enable an accurate evaluation of COPY times, you turn COMPUPDATE oﬀ for this step.

Multiple Files

The COPY command can load data very eﬃciently when it loads from multiple ﬁles in parallel instead of from a single ﬁle. You can split your data into ﬁles so that the number of ﬁles is a multiple of the number of slices in your cluster. If you do, Amazon Redshift divides the workload and distributes the data evenly among the slices. The number of slices per node depends on the node size of the cluster. For more information about the number of slices that each node size has, go to About clusters and nodes in the Amazon Redshift Cluster Management Guide.

For example, the dc2.large compute nodes used in this tutorial have two slices each, so the four-node cluster has eight slices. In previous steps, the load data was contained in eight ﬁles, even though the ﬁles are very small. In this step, you compare the time diﬀerence between loading from a single large ﬁle and loading from multiple ﬁles.

The ﬁles you use for this tutorial contain about 15 million records and occupy about 1.2 GB. These ﬁles are very small in Amazon Redshift scale, but suﬃcient to demonstrate the performance advantage of loading from multiple ﬁles. The ﬁles are large enough that the time required to download them and then upload them to Amazon S3 is excessive for this tutorial. Thus, you load the ﬁles directly from an AWS sample bucket.

The following screenshot shows the data ﬁles for LINEORDER.

Step 5: Run the COPY commands

To evaluate the performance of COPY with multiple ﬁles

1. Run the following command to COPY from a single ﬁle. Do not change the bucket name.

copy lineorder from 's3://awssampledb/load/lo/lineorder-single.tbl' credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>' gzipcompupdate off

region 'us-east-1';

2. Your results should be similar to the following. Note the execution time.

Warnings:

Load into table 'lineorder' completed, 14996734 record(s) loaded successfully.

0 row(s) affected.

copy executed successfully Execution time: 51.56s

3. Run the following command to COPY from multiple ﬁles. Do not change the bucket name.

copy lineorder from 's3://awssampledb/load/lo/lineorder-multi.tbl' credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>' gzipcompupdate off

region 'us-east-1';

4. Your results should be similar to the following. Note the execution time.

Warnings:

Load into table 'lineorder' completed, 14996734 record(s) loaded successfully.

0 row(s) affected.

copy executed successfully Execution time: 17.7s

Step 6: Vacuum and analyze the database

5. Compare execution times.

In our example, the time to load 15 million records decreased from 51.56 seconds to 17.7 seconds, a reduction of 65.7 percent.

These results are based on using a four-node cluster. If your cluster has more nodes, the time savings is multiplied. For typical Amazon Redshift clusters, with tens to hundreds of nodes, the diﬀerence is even more dramatic. If you have a single node cluster, there is little diﬀerence between the execution times.

Next step

Step 6: Vacuum and analyze the database (p. 141)

Step 6: Vacuum and analyze the database

Whenever you add, delete, or modify a signiﬁcant number of rows, you should run a VACUUM command and then an ANALYZE command. A vacuum recovers the space from deleted rows and restores the sort order. The ANALYZE command updates the statistics metadata, which enables the query optimizer to generate more accurate query plans. For more information, see Vacuuming tables (p. 108).

If you load the data in sort key order, a vacuum is fast. In this tutorial, you added a signiﬁcant number of rows, but you added them to empty tables. That being the case, there is no need to resort, and you didn't delete any rows. COPY automatically updates statistics after loading an empty table, so your statistics should be up-to-date. However, as a matter of good housekeeping, you complete this tutorial by vacuuming and analyzing your database.

To vacuum and analyze the database, run the following commands.

vacuum;

analyze;

Next step

Step 7: Clean up your resources (p. 141)

Step 7: Clean up your resources

Your cluster continues to accrue charges as long as it is running. When you have completed this tutorial, you should return your environment to the previous state by following the steps in Step 5: Revoke access and delete your sample cluster in the Amazon Redshift Getting Started Guide.

If you want to keep the cluster, but recover the storage used by the SSB tables, run the following commands.

drop table part;

drop table supplier;

drop table customer;

drop table dwdate;

drop table lineorder;

Summary (p. 142)

Summary

In this tutorial, you uploaded data ﬁles to Amazon S3 and then used COPY commands to load the data from the ﬁles into Amazon Redshift tables.

You loaded data using the following formats:

• Character-delimited

• CSV

• Fixed-width

You used the STL_LOAD_ERRORS system table to troubleshoot load errors, and then used the REGION, MANIFEST, MAXERROR, ACCEPTINVCHARS, DATEFORMAT, and NULL AS options to resolve the errors.

You applied the following best practices for loading data:

• Use a COPY command to load data (p. 19)

• Split your load data (p. 20)

• Use a single COPY command to load from multiple ﬁles (p. 19)

• Compress your data ﬁles (p. 20)

• Verify data ﬁles before and after a load (p. 20)

For more information about Amazon Redshift best practices, see the following links:

• Amazon Redshift best practices for loading data (p. 19)

• Amazon Redshift best practices for designing tables (p. 16)

• Amazon Redshift best practices for designing queries (p. 22)

在文檔中 Amazon Redshift (頁 154-158)