This step uses the GZIP and COMPUPDATE options to load the LINEORDER table.
In this exercise, you load the LINEORDER table from a single data file and then load it again from multiple files. Doing this enables you to compare the load times for the two methods.
NoteThe files for loading the LINEORDER table are provided in an AWS sample bucket. You don't need to upload files for this step.
GZIP, LZOP and BZIP2
You can compress your files using either gzip, lzop, or bzip2 compression formats. When loading from compressed files, COPY uncompresses the files during the load process. Compressing your files saves storage space and shortens upload times.
COMPUPDATE
When COPY loads an empty table with no compression encodings, it analyzes the load data to determine the optimal encodings. It then alters the table to use those encodings before beginning the load. This analysis process takes time, but it occurs, at most, once per table. To save time, you can skip this step by turning COMPUPDATE off. To enable an accurate evaluation of COPY times, you turn COMPUPDATE off for this step.
Multiple Files
The COPY command can load data very efficiently when it loads from multiple files in parallel instead of from a single file. You can split your data into files so that the number of files is a multiple of the number of slices in your cluster. If you do, Amazon Redshift divides the workload and distributes the data evenly among the slices. The number of slices per node depends on the node size of the cluster. For more information about the number of slices that each node size has, go to About clusters and nodes in the Amazon Redshift Cluster Management Guide.
For example, the dc2.large compute nodes used in this tutorial have two slices each, so the four-node cluster has eight slices. In previous steps, the load data was contained in eight files, even though the files are very small. In this step, you compare the time difference between loading from a single large file and loading from multiple files.
The files you use for this tutorial contain about 15 million records and occupy about 1.2 GB. These files are very small in Amazon Redshift scale, but sufficient to demonstrate the performance advantage of loading from multiple files. The files are large enough that the time required to download them and then upload them to Amazon S3 is excessive for this tutorial. Thus, you load the files directly from an AWS sample bucket.
The following screenshot shows the data files for LINEORDER.
Step 5: Run the COPY commands
To evaluate the performance of COPY with multiple files
1. Run the following command to COPY from a single file. Do not change the bucket name.
copy lineorder from 's3://awssampledb/load/lo/lineorder-single.tbl' credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>' gzipcompupdate off
region 'us-east-1';
2. Your results should be similar to the following. Note the execution time.
Warnings:
Load into table 'lineorder' completed, 14996734 record(s) loaded successfully.
0 row(s) affected.
copy executed successfully Execution time: 51.56s
3. Run the following command to COPY from multiple files. Do not change the bucket name.
copy lineorder from 's3://awssampledb/load/lo/lineorder-multi.tbl' credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>' gzipcompupdate off
region 'us-east-1';
4. Your results should be similar to the following. Note the execution time.
Warnings:
Load into table 'lineorder' completed, 14996734 record(s) loaded successfully.
0 row(s) affected.
copy executed successfully Execution time: 17.7s
Step 6: Vacuum and analyze the database
5. Compare execution times.
In our example, the time to load 15 million records decreased from 51.56 seconds to 17.7 seconds, a reduction of 65.7 percent.
These results are based on using a four-node cluster. If your cluster has more nodes, the time savings is multiplied. For typical Amazon Redshift clusters, with tens to hundreds of nodes, the difference is even more dramatic. If you have a single node cluster, there is little difference between the execution times.
Next step
Step 6: Vacuum and analyze the database (p. 141)
Step 6: Vacuum and analyze the database
Whenever you add, delete, or modify a significant number of rows, you should run a VACUUM command and then an ANALYZE command. A vacuum recovers the space from deleted rows and restores the sort order. The ANALYZE command updates the statistics metadata, which enables the query optimizer to generate more accurate query plans. For more information, see Vacuuming tables (p. 108).
If you load the data in sort key order, a vacuum is fast. In this tutorial, you added a significant number of rows, but you added them to empty tables. That being the case, there is no need to resort, and you didn't delete any rows. COPY automatically updates statistics after loading an empty table, so your statistics should be up-to-date. However, as a matter of good housekeeping, you complete this tutorial by vacuuming and analyzing your database.
To vacuum and analyze the database, run the following commands.
vacuum;
analyze;
Next step
Step 7: Clean up your resources (p. 141)
Step 7: Clean up your resources
Your cluster continues to accrue charges as long as it is running. When you have completed this tutorial, you should return your environment to the previous state by following the steps in Step 5: Revoke access and delete your sample cluster in the Amazon Redshift Getting Started Guide.
If you want to keep the cluster, but recover the storage used by the SSB tables, run the following commands.
drop table part;
drop table supplier;
drop table customer;
drop table dwdate;
drop table lineorder;
Next
Summary (p. 142)
Summary
Summary
In this tutorial, you uploaded data files to Amazon S3 and then used COPY commands to load the data from the files into Amazon Redshift tables.
You loaded data using the following formats:
• Character-delimited
• CSV
• Fixed-width
You used the STL_LOAD_ERRORS system table to troubleshoot load errors, and then used the REGION, MANIFEST, MAXERROR, ACCEPTINVCHARS, DATEFORMAT, and NULL AS options to resolve the errors.
You applied the following best practices for loading data:
• Use a COPY command to load data (p. 19)
• Split your load data (p. 20)
• Use a single COPY command to load from multiple files (p. 19)
• Compress your data files (p. 20)
• Verify data files before and after a load (p. 20)
For more information about Amazon Redshift best practices, see the following links:
• Amazon Redshift best practices for loading data (p. 19)
• Amazon Redshift best practices for designing tables (p. 16)
• Amazon Redshift best practices for designing queries (p. 22)