• 沒有找到結果。

Testing compression encodings

在文檔中 Amazon Redshift (頁 61-65)

Original data value Original size (bytes) Compressed value

(token) Compressed size

(bytes)

Yellow 6 7

Yellow 6 0

Yellow 6 0

Yellow 6

{4,Yellow}

0

Total 51 23

Text255 and Text32k encodings

Text255 and text32k encodings are useful for compressing VARCHAR columns in which the same words recur often. A separate dictionary of unique words is created for each block of column values on disk.

(An Amazon Redshift disk block occupies 1 MB.) The dictionary contains the first 245 unique words in the column. Those words are replaced on disk by a one-byte index value representing one of the 245 values, and any words that are not represented in the dictionary are stored uncompressed. The process repeats for each 1 MB disk block. If the indexed words occur frequently in the column, the column yields a high compression ratio.

For the text32k encoding, the principle is the same, but the dictionary for each block does not capture a specific number of words. Instead, the dictionary indexes each unique word it finds until the combined entries reach a length of 32K, minus some overhead. The index values are stored in two bytes.

For example, consider the VENUENAME column in the VENUE table. Words such as Arena, Center, and Theatre recur in this column and are likely to be among the first 245 words encountered in each block if text255 compression is applied. If so, this column benefits from compression. This is because every time those words appear, they occupy only 1 byte of storage (instead of 5, 6, or 7 bytes, respectively).

Zstandard encoding

Zstandard (ZSTD) encoding provides a high compression ratio with very good performance across diverse datasets. ZSTD works especially well with CHAR and VARCHAR columns that store a wide range of long and short strings, such as product descriptions, user comments, logs, and JSON strings. Where some algorithms, such as Delta (p. 43) encoding or Mostly (p. 44) encoding, can potentially use more storage space than no compression, ZSTD is very unlikely to increase disk usage.

ZSTD supports SMALLINT, INTEGER, BIGINT, DECIMAL, REAL, DOUBLE PRECISION, BOOLEAN, CHAR, VARCHAR, DATE, TIMESTAMP, and TIMESTAMPTZ data types.

Testing compression encodings

If you decide to manually specify column encodings, you might want to test different encodings with your data.

NoteWe recommend that you use the COPY command to load data whenever possible, and allow the COPY command to choose the optimal encodings based on your data. Or you can use the ANALYZE COMPRESSION (p. 565) command to view the suggested encodings for existing data. For details about applying automatic compression, see Loading tables with automatic compression (p. 89).

To perform a meaningful test of data compression, you need a large number of rows. For this example, we create a table and insert rows by using a statement that selects from two tables; VENUE and LISTING.

Testing compression encodings

We leave out the WHERE clause that would normally join the two tables. The result is that each row in the VENUE table is joined to all of the rows in the LISTING table, for a total of over 32 million rows.

This is known as a Cartesian join and normally is not recommended. However, for this purpose, it's a convenient method of creating a lot of rows. If you have an existing table with data that you want to test, you can skip this step.

After we have a table with sample data, we create a table with seven columns. Each has a different compression encoding: raw, bytedict, lzo, run length, text255, text32k, and zstd. We populate each column with exactly the same data by running an INSERT command that selects the data from the first table.

To test compression encodings, do the following:

1. (Optional) First, use a Cartesian join to create a table with a large number of rows. Skip this step if you want to test an existing table.

create table cartesian_venue(

venueid smallint not null distkey sortkey, venuename varchar(100),

venuecity varchar(30), venuestate char(2), venueseats integer);

insert into cartesian_venue

select venueid, venuename, venuecity, venuestate, venueseats from venue, listing;

2. Next, create a table with the encodings that you want to compare.

create table encodingvenue ( venueraw varchar(100) encode raw,

venuebytedict varchar(100) encode bytedict, venuelzo varchar(100) encode lzo,

venuerunlength varchar(100) encode runlength, venuetext255 varchar(100) encode text255, venuetext32k varchar(100) encode text32k, venuezstd varchar(100) encode zstd);

3. Insert the same data into all of the columns using an INSERT statement with a SELECT clause.

insert into encodingvenue

select venuename as venueraw, venuename as venuebytedict, venuename as venuelzo, venuename as venuerunlength, venuename as venuetext32k, venuename as venuetext255, venuename as venuezstd

from cartesian_venue;

4. Verify the number of rows in the new table.

select count(*) from encodingvenue count

38884394 (1 row)

5. Query the STV_BLOCKLIST (p. 1241) system table to compare the number of 1 MB disk blocks used by each column.

The MAX aggregate function returns the highest block number for each column. The STV_BLOCKLIST table includes details for three system-generated columns. This example uses col < 6 in the WHERE clause to exclude the system-generated columns.

Example: Choosing compression encodings for the CUSTOMER table

select col, max(blocknum)

from stv_blocklist b, stv_tbl_perm p

where (b.tbl=p.id) and name ='encodingvenue' and col < 7

group by name, col order by col;

The query returns the following results. The columns are numbered beginning with zero. Depending on how your cluster is configured, your result might have different numbers, but the relative sizes should be similar. You can see that BYTEDICT encoding on the second column produced the best results for this dataset. This approach has a compression ratio of better than 20:1. LZO and ZSTD encoding also produced excellent results. Different data sets produce different results, of course.

When a column contains longer text strings, LZO often produces the best compression results.

col | max 0 | 203 1 | 10 2 | 22 3 | 204 4 | 56 5 | 72 6 | 20 (7 rows)

If you have data in an existing table, you can use the ANALYZE COMPRESSION (p. 565) command to view the suggested encodings for the table. For example, the following example shows the recommended encoding for a copy of the VENUE table, CARTESIAN_VENUE, that contains 38 million rows. Notice that ANALYZE COMPRESSION recommends LZO encoding for the VENUENAME column.

ANALYZE COMPRESSION chooses optimal compression based on multiple factors, which include percent of reduction. In this specific case, BYTEDICT provides better compression, but LZO also produces greater than 90 percent compression.

analyze compression cartesian_venue;

Table | Column | Encoding | Est_reduction_pct ---+---+---+---reallybigvenue | venueid | lzo | 97.54 reallybigvenue | venuename | lzo | 91.71 reallybigvenue | venuecity | lzo | 96.01 reallybigvenue | venuestate | lzo | 97.68 reallybigvenue | venueseats | lzo | 98.21

Example: Choosing compression encodings for the CUSTOMER table

The following statement creates a CUSTOMER table that has columns with various data types. This CREATE TABLE statement shows one of many possible combinations of compression encodings for these columns.

create table customer(

custkey int encode delta, custname varchar(30) encode raw, gender varchar(7) encode text255, address varchar(200) encode text255,

Example: Choosing compression

The following table shows the column encodings that were chosen for the CUSTOMER table and gives an explanation for the choices:.

Column Data type Encoding Explanation

CUSTKEY int delta CUSTKEY consists of

unique, consecutive integer values. Because the differences are one byte, DELTA is a good choice.

CUSTNAME varchar(30) raw CUSTNAME has a

large domain with few repeated values. Any compression encoding would probably be ineffective.

GENDER varchar(7) text255 GENDER is very small

domain with many

ADDRESS varchar(200) text255 ADDRESS is a large

domain, but contains

CITY varchar(30) text255 CITY is a large domain,

with some repeated

STATE char(2) raw In the United States,

STATE is a precise

在文檔中 Amazon Redshift (頁 61-65)