Amazon Redshift
Database Developer Guide
Amazon Redshift: Database Developer Guide
Copyright © Amazon Web Services, Inc. and/or its affiliates. All rights reserved.
Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be affiliated with, connected to, or sponsored by Amazon.
Table of Contents
System overview ... 1
Are you a first-time Amazon Redshift user? ... 1
Are you a database developer? ... 2
Prerequisites ... 3
System and architecture overview ... 3
Data warehouse system architecture ... 3
Performance ... 5
Columnar storage ... 7
Workload management ... 9
Using Amazon Redshift with other services ... 9
Best practices ... 11
Conducting a proof of concept ... 11
Overview of the process ... 11
Identify the business goals and success criteria ... 12
Set up your proof of concept ... 12
Checklist for a complete evaluation ... 13
Develop a project plan for your evaluation ... 14
Additional resources to help your evaluation ... 15
Need help? ... 16
Best practices for designing tables ... 16
Choose the best sort key ... 16
Choose the best distribution style ... 17
Use automatic compression ... 18
Define constraints ... 18
Use the smallest possible column size ... 18
Use date/time data types for date columns ... 19
Best practices for loading data ... 19
Take the loading data tutorial ... 19
Use a COPY command to load data ... 19
Use a single COPY command ... 19
Split your load data ... 20
Compress your data files ... 20
Verify data files before and after a load ... 20
Use a multi-row insert ... 20
Use a bulk insert ... 21
Load data in sort key order ... 21
Load data in sequential blocks ... 21
Use time-series tables ... 21
Use a staging table to perform a merge ... 22
Schedule around maintenance windows ... 22
Best practices for designing queries ... 22
Working with Advisor ... 24
Amazon Redshift Regions ... 24
Access Advisor ... 25
Advisor recommendations ... 26
Tutorials ... 36
Working with automatic table optimization ... 37
Enabling automatic table optimization ... 37
Removing automatic table optimization ... 38
Monitoring actions of automatic table optimization ... 38
Working with column compression ... 39
Compression encodings ... 40
Testing compression encodings ... 46
Example: Choosing compression encodings for the CUSTOMER table ... 48
Working with data distribution styles ... 50
Data distribution concepts ... 51
Distribution styles ... 51
Viewing distribution styles ... 52
Evaluating query patterns ... 53
Designating distribution styles ... 54
Evaluating the query plan ... 55
Query plan example ... 56
Distribution examples ... 60
Working with sort keys ... 62
Compound sort key ... 63
Interleaved sort key ... 63
Defining table constraints ... 64
Loading data ... 65
Using COPY to load data ... 65
Credentials and access permissions ... 66
Preparing your input data ... 67
Loading data from Amazon S3 ... 68
Loading data from Amazon EMR ... 76
Loading data from remote hosts ... 80
Loading from Amazon DynamoDB ... 86
Verifying that the data loaded correctly ... 88
Validating input data ... 88
Automatic compression ... 89
Optimizing for narrow tables ... 91
Default values ... 91
Troubleshooting ... 91
Updating with DML ... 96
Updating and inserting ... 96
Merge method 1: Replacing existing rows ... 97
Merge method 2: Specifying a column list ... 97
Creating a temporary staging table ... 97
Performing a merge operation by replacing existing rows ... 98
Performing a merge operation by specifying a column list ... 98
Merge examples ... 100
Performing a deep copy ... 102
Analyzing tables ... 103
Automatic analyze ... 104
Analysis of new table data ... 104
ANALYZE command history ... 107
Vacuuming tables ... 108
Automatic table sort ... 108
Automatic vacuum delete ... 109
VACUUM frequency ... 109
Sort stage and merge stage ... 110
Vacuum threshold ... 110
Vacuum types ... 110
Managing vacuum times ... 110
Managing concurrent write operations ... 116
Serializable isolation ... 117
Write and read/write operations ... 120
Concurrent write examples ... 121
Tutorial: Loading data from Amazon S3 ... 122
Prerequisites ... 122
Overview ... 123
Steps ... 123
Step 1: Create a cluster ... 123
Step 2: Download the data files ... 125
Step 3: Upload the files to an Amazon S3 bucket ... 125
Step 4: Create the sample tables ... 127
Step 5: Run the COPY commands ... 129
Step 6: Vacuum and analyze the database ... 141
Step 7: Clean up your resources ... 141
Summary ... 142
Unloading data ... 143
Unloading data to Amazon S3 ... 143
Unloading encrypted data files ... 145
Unloading data in delimited or fixed-width format ... 147
Reloading unloaded data ... 148
Creating user-defined functions ... 149
UDF security and privileges ... 149
Creating a scalar SQL UDF ... 150
Scalar SQL function example ... 150
Creating a scalar Python UDF ... 151
Scalar Python UDF example ... 151
Python UDF data types ... 151
ANYELEMENT data type ... 152
Python language support ... 152
UDF constraints ... 155
Creating a scalar Lambda UDF ... 156
Registering a Lambda UDF ... 156
Managing Lambda UDF security and privileges ... 156
Configuring the authorization parameter for Lambda UDFs ... 157
Using the JSON interface between Amazon Redshift and Lambda ... 156
Naming UDFs ... 160
Logging errors and warnings ... 161
Example uses of UDFs ... 162
Creating stored procedures ... 163
Stored procedure overview ... 163
Naming stored procedures ... 165
Security and privileges ... 166
Returning a result set ... 167
Managing transactions ... 168
Trapping errors ... 173
Logging stored procedures ... 174
Limits and differences ... 175
PL/pgSQL language reference ... 176
PL/pgSQL reference conventions ... 176
Structure of PL/pgSQL ... 176
Supported PL/pgSQL statements ... 180
Creating materialized views ... 192
Querying a materialized view ... 194
Automatic query rewriting to use materialized views ... 195
Usage notes ... 195
Limitations ... 195
Refreshing a materialized view ... 196
Autorefreshing a materialized view ... 198
Automated materialized views (preview) ... 198
Streaming ingestion (preview) ... 199
Data flow ... 200
Streaming ingestion use cases ... 200
Getting started with streaming ingestion ... 200
Limitations ... 201
Electric vehicle station-data tutorial (preview) ... 202
Querying spatial data ... 205
Tutorial: Using spatial SQL functions ... 207
Prerequisites ... 207
Step 1: Create tables and load test data ... 208
Step 2: Query spatial data ... 210
Step 3: Clean up your resources ... 212
Loading a shapefile ... 213
Terminology ... 214
Bounding box ... 214
Geometric validity ... 214
Geometric simplicity ... 216
Limitations ... 217
Querying data with federated queries ... 218
Getting started with using federated queries to PostgreSQL ... 218
Getting started using federated queries to PostgreSQL with CloudFormation ... 219
Launching a CloudFormation stack for Redshift federated queries ... 220
Querying data from the external schema ... 221
Getting started with using federated queries to MySQL ... 221
Creating a secret and an IAM role ... 222
Prerequisites ... 222
Examples of using a federated query ... 224
Example of using a federated query with PostgreSQL ... 224
Example of using a mixed-case name ... 225
Example of using a federated query with MySQL ... 227
Data type differences ... 227
Limitations and considerations ... 230
Querying external data using Amazon Redshift Spectrum ... 232
Amazon Redshift Spectrum overview ... 232
Amazon Redshift Spectrum Regions ... 233
Amazon Redshift Spectrum considerations ... 233
Getting started with Amazon Redshift Spectrum ... 234
Prerequisites ... 234
CloudFormation ... 234
Getting started with Redshift Spectrum step by step ... 234
Step 1. Create an IAM role ... 234
Step 2: Associate the IAM role with your cluster ... 237
Step 3: Create an external schema and an external table ... 238
Step 4: Query your data in Amazon S3 ... 238
Launch your CloudFormation stack and then query your data ... 241
IAM policies for Amazon Redshift Spectrum ... 243
Amazon S3 permissions ... 244
Cross-account Amazon S3 permissions ... 244
Grant or restrict access using Redshift Spectrum ... 245
Minimum permissions ... 245
Chaining IAM roles ... 246
Accessing AWS Glue data ... 247
Using Redshift Spectrum with Lake Formation ... 253
Creating data files for queries in Amazon Redshift Spectrum ... 254
Data formats for Redshift Spectrum ... 254
Compression types for Redshift Spectrum ... 255
Encryption for Redshift Spectrum ... 255
Creating external schemas ... 256
Working with external catalogs ... 257
Creating external tables ... 261
Pseudocolumns ... 262
Partitioning Redshift Spectrum external tables ... 263
Mapping to ORC columns ... 267
Creating external tables for Hudi-managed data ... 269
Creating external tables for Delta Lake data ... 270
Improving Amazon Redshift Spectrum query performance ... 271
Setting data handling options ... 273
Monitoring metrics ... 274
Troubleshooting queries ... 274
Retries exceeded ... 275
Access throttled ... 275
Resource limit exceeded ... 276
No rows returned for a partitioned table ... 276
Not authorized error ... 276
Incompatible data formats ... 276
Syntax error when using Hive DDL in Amazon Redshift ... 277
Permission to create temporary tables ... 277
Invalid range ... 277
Invalid Parquet version number ... 278
Tutorial: Querying nested data with Amazon Redshift Spectrum ... 278
Overview ... 278
Step 1: Create an external table that contains nested data ... 279
Step 2: Query your nested data in Amazon S3 with SQL extensions ... 280
Nested data use cases ... 283
Nested data limitations ... 285
Serializing complex nested JSON ... 285
Using HyperLogLog sketches in Amazon Redshift ... 288
Considerations ... 288
Limitations ... 289
Examples ... 289
Example: Return cardinality in a subquery ... 289
Example: Return an HLLSKETCH type from combined sketches in a subquery ... 290
Example: Return a HyperLogLog sketch from combining multiple sketches ... 290
Example: Generate HyperLogLog sketches over S3 data using external tables ... 291
Querying data across databases ... 293
Considerations ... 294
Limitations ... 294
Examples of using a cross-database query ... 295
Using cross-database queries with the query editor ... 298
Sharing data across clusters ... 300
Data sharing overview ... 300
Data sharing use cases ... 300
Data sharing concepts ... 301
Sharing data at different levels ... 302
Managing data consistency ... 302
Accessing shared data ... 302
Data sharing considerations in Amazon Redshift ... 303
How data sharing works ... 303
Controlling access for cross-account datashares ... 304
Working with views in Amazon Redshift data sharing ... 307
Managing the data sharing lifecycle ... 308
Managing permissions for datashares ... 309
Tracking usage and auditing in data sharing ... 310
Cluster management and data sharing ... 310
Integrating Amazon Redshift data sharing with business intelligence tools ... 311
Accessing metadata for datashares ... 312
Working with AWS Data Exchange for Amazon Redshift ... 312
How AWS Data Exchange datashares work ... 313
Considerations when using AWS Data Exchange for Amazon Redshift ... 313
Getting started data sharing ... 314
Getting started data sharing using the SQL interface ... 314
Getting started data sharing using the console ... 334
Getting started data sharing with CloudFormation ... 343
Limitations ... 345
Ingesting and querying semistructured data in Amazon Redshift ... 346
Use cases for the SUPER data type ... 346
Concepts for SUPER data type use ... 347
Considerations for SUPER data ... 348
SUPER sample dataset ... 348
Loading semistructured data into Amazon Redshift ... 350
Parsing JSON documents to SUPER columns ... 350
Using COPY to load JSON data in Amazon Redshift ... 351
Unloading semistructured data ... 354
Unloading semistructured data in CSV or text formats ... 354
Unloading semistructured data in the Parquet format ... 355
Querying semistructured data ... 355
Navigation ... 355
Unnesting queries ... 356
Object unpivoting ... 357
Dynamic typing ... 358
Lax semantics ... 360
Types of introspection ... 360
Order by ... 361
Operators and functions ... 362
Arithmetic operators ... 362
Arithmetic functions ... 362
Array functions ... 363
SUPER configurations ... 364
Lax and strict modes for SUPER ... 364
Accessing JSON fields with upper or mixed case letters ... 364
Limitations ... 365
Using SUPER data type with materialized views ... 367
Accelerating PartiQL queries ... 367
Limitations for using the SUPER data type with materialized views ... 369
Using machine learning in Amazon Redshift ... 370
Machine learning overview ... 371
How machine learning can solve a problem ... 371
Terms and concepts for Amazon Redshift ML ... 372
Machine learning for novices and experts ... 373
Costs for using Amazon Redshift ML ... 374
Getting started with Amazon Redshift ML ... 375
Administrative setup ... 376
Using model explainability with Amazon Redshift ML ... 379
Tutorial: Creating a customer churn model ... 379
Tuning query performance ... 383
Query processing ... 383
Query planning and execution workflow ... 383
Query plan ... 385
Reviewing query plan steps ... 390
Factors affecting query performance ... 392
Analyzing and improving queries ... 393
Query analysis workflow ... 393
Reviewing query alerts ... 394
Analyzing the query plan ... 395
Analyzing the query summary ... 396
Improving query performance ... 401
Diagnostic queries for query tuning ... 403
Troubleshooting queries ... 406
Connection fails ... 407
Query hangs ... 407
Query takes too long ... 408
Load fails ... 409
Load takes too long ... 409
Load data is incorrect ... 409
Setting the JDBC fetch size parameter ... 409
Implementing workload management ... 411
Modifying the WLM configuration ... 412
Migrating from manual WLM to automatic WLM ... 412
Automatic WLM ... 413
Priority ... 414
Concurrency scaling mode ... 414
User groups ... 414
Query groups ... 414
Wildcards ... 415
Query monitoring rules ... 415
Checking for automatic WLM ... 415
Query priority ... 415
Manual WLM ... 419
Concurrency scaling mode ... 420
Concurrency level ... 420
User groups ... 421
Query groups ... 421
Wildcards ... 421
WLM memory percent to use ... 421
WLM timeout ... 422
Query monitoring rules ... 422
WLM query queue hopping ... 422
Tutorial: Configuring manual WLM queues ... 425
Concurrency scaling ... 438
Concurrency scaling capabilities ... 438
Limitations for concurrency scaling ... 439
Regions for concurrency scaling ... 439
Concurrency scaling candidates ... 440
Configuring concurrency scaling queues ... 416
Monitoring concurrency scaling ... 441
Concurrency scaling system views ... 441
Short query acceleration ... 442
Maximum SQA runtime ... 442
Monitoring SQA ... 443
WLM queue assignment rules ... 443
Queue assignments example ... 444
Assigning queries to queues ... 446
Assigning queries to queues based on user groups ... 446
Assigning a query to a query group ... 446
Assigning queries to the superuser queue ... 446
Dynamic and static properties ... 447
WLM dynamic memory allocation ... 448
Dynamic WLM example ... 448
Query monitoring rules ... 450
Defining a query monitor rule ... 450
Query monitoring metrics ... 451
Query monitoring rules templates ... 453
System tables and views for query monitoring rules ... 454
WLM system tables and views ... 455
WLM service class IDs ... 456
Managing database security ... 457
Amazon Redshift security overview ... 457
Default database user privileges ... 458
Superusers ... 458
Users ... 459
Creating, altering, and deleting users ... 459
Role-based access control ... 460
Role hierarchy ... 460
Role assignment ... 460
Amazon Redshift system-defined roles ... 461
System privileges ... 461
Database object permissions ... 464
ALTER DEFAULT PRIVILEGES for RBAC ... 465
Considerations for role usage ... 465
Managing roles ... 465
Groups ... 465
Creating, altering, and deleting groups ... 466
Schemas ... 466
Creating, altering, and deleting schemas ... 466
Search path ... 467
Schema-based privileges ... 467
Example for controlling user and group access ... 467
SQL reference ... 469
Amazon Redshift SQL ... 469
SQL functions supported on the leader node ... 469
Amazon Redshift and PostgreSQL ... 470
Using SQL ... 475
SQL reference conventions ... 476
Basic elements ... 476
Expressions ... 505
Conditions ... 509
SQL commands ... 527
ABORT ... 529
ALTER DATABASE ... 530
ALTER DATASHARE ... 532
ALTER DEFAULT PRIVILEGES ... 534
ALTER GROUP ... 536
ALTER MATERIALIZED VIEW ... 537
ALTER ROLE ... 538
ALTER PROCEDURE ... 539
ALTER SCHEMA ... 540
ALTER TABLE ... 541
ALTER TABLE APPEND ... 556
ALTER USER ... 560
ANALYZE ... 563
ANALYZE COMPRESSION ... 565
BEGIN ... 567
CALL ... 568
CANCEL ... 570
CLOSE ... 572
COMMENT ... 573
COMMIT ... 574
COPY ... 575
CREATE DATABASE ... 640
CREATE DATASHARE ... 646
CREATE EXTERNAL FUNCTION ... 647
CREATE EXTERNAL SCHEMA ... 654
CREATE EXTERNAL TABLE ... 660
CREATE FUNCTION ... 678
CREATE GROUP ... 682
CREATE LIBRARY ... 683
CREATE MATERIALIZED VIEW ... 685
CREATE MODEL ... 688
CREATE PROCEDURE ... 701
CREATE ROLE ... 704
CREATE SCHEMA ... 705
CREATE TABLE ... 707
CREATE TABLE AS ... 722
CREATE USER ... 729
CREATE VIEW ... 733
DEALLOCATE ... 736
DECLARE ... 737
DELETE ... 739
DESC DATASHARE ... 741
DROP DATABASE ... 742
DROP DATASHARE ... 743
DROP FUNCTION ... 744
DROP GROUP ... 745
DROP LIBRARY ... 746
DROP MODEL ... 747
DROP MATERIALIZED VIEW ... 747
DROP PROCEDURE ... 748
DROP ROLE ... 749
DROP SCHEMA ... 750
DROP TABLE ... 752
DROP USER ... 754
DROP VIEW ... 756
END ... 757
EXECUTE ... 758
EXPLAIN ... 759
FETCH ... 763
GRANT ... 764
INSERT ... 777
INSERT (external table) ... 781
LOCK ... 783
PREPARE ... 784
REFRESH MATERIALIZED VIEW ... 785
RESET ... 787
REVOKE ... 788
ROLLBACK ... 797
SELECT ... 798
SELECT INTO ... 833
SET ... 834
SET SESSION AUTHORIZATION ... 837
SET SESSION CHARACTERISTICS ... 838
SHOW ... 838
SHOW EXTERNAL TABLE ... 838
SHOW MODEL ... 840
SHOW DATASHARES ... 842
SHOW PROCEDURE ... 843
SHOW TABLE ... 844
SHOW VIEW ... 845
START TRANSACTION ... 846
TRUNCATE ... 846
UNLOAD ... 847
UPDATE ... 866
VACUUM ... 870
SQL functions reference ... 875
Leader node–only functions ... 876
Compute node–only functions ... 876
Aggregate functions ... 877
Array functions ... 895
VARBYTE functions ... 899
Bit-wise aggregate functions ... 905
Conditional expressions ... 910
Date and time functions ... 919
Spatial functions ... 959
Hash functions ... 1053
HyperLogLog functions ... 1058
JSON functions ... 1062
Math functions ... 1069
Machine learning functions ... 1093
String functions ... 1095
SUPER type information functions ... 1140
Window functions ... 1149
Data type formatting functions ... 1192
System administration functions ... 1209
System information functions ... 1216
Reserved words ... 1235
System tables reference ... 1238
System tables and views ... 1238
Types of system tables and views ... 1238
Visibility of data in system tables and views ... 1239
Filtering system-generated queries ... 1239
STV tables for snapshot data ... 1239
STV_ACTIVE_CURSORS ... 1240
STV_BLOCKLIST ... 1241
STV_CURSOR_CONFIGURATION ... 1244
STV_EXEC_STATE ... 1244
STV_INFLIGHT ... 1245
STV_LOAD_STATE ... 1247
STV_LOCKS ... 1248
STV_ML_MODEL_INFO ... 1249
STV_MV_DEPS ... 1249
STV_MV_INFO ... 1250
STV_NODE_STORAGE_CAPACITY ... 1252
STV_PARTITIONS ... 1253
STV_QUERY_METRICS ... 1254
STV_RECENTS ... 1258
STV_SESSIONS ... 1259
STV_SLICES ... 1260
STV_STARTUP_RECOVERY_STATE ... 1261
STV_TBL_PERM ... 1261
STV_TBL_TRANS ... 1263
STV_WLM_QMR_CONFIG ... 1265
STV_WLM_CLASSIFICATION_CONFIG ... 1265
STV_WLM_QUERY_QUEUE_STATE ... 1266
STV_WLM_QUERY_STATE ... 1267
STV_WLM_QUERY_TASK_STATE ... 1269
STV_WLM_SERVICE_CLASS_CONFIG ... 1270
STV_WLM_SERVICE_CLASS_STATE ... 1271
System views ... 1272
STL views for logging ... 1273
SVCS views ... 1354
SVL views ... 1372
SVV views ... 1412
Monitoring views (preview) ... 1454
System catalog tables ... 1463
PG_ATTRIBUTE_INFO ... 1464
PG_CLASS_INFO ... 1464
PG_DATABASE_INFO ... 1465
PG_DEFAULT_ACL ... 1466
PG_EXTERNAL_SCHEMA ... 1467
PG_LIBRARY ... 1468
PG_PROC_INFO ... 1469
PG_STATISTIC_INDICATOR ... 1469
PG_TABLE_DEF ... 1470
Querying the catalog tables ... 1472
Configuration reference ... 1477
Modifying the server configuration ... 1477
activate_aqua ... 1478
Values ... 1478
Description ... 1478
Example ... 1478
analyze_threshold_percent ... 1478
Values (default in bold) ... 1478
Description ... 1479
Examples ... 1479
cast_super_null_on_error ... 1479
Values (default in bold) ... 1479
Description ... 1478
datashare_break_glass_session_var ... 1479
Values (default in bold) ... 1479
Description ... 1479
Example ... 1480
datestyle ... 1480
Values (default in bold) ... 1480
Description ... 1478
Example ... 1480
default_geometry_encoding ... 1480
Values (default in bold) ... 1480
Description ... 1478
describe_field_name_in_uppercase ... 1481
Values (default in bold) ... 1481
Description ... 1478
Example ... 1480
downcase_delimited_identifier ... 1481
Values (default in bold) ... 1481
Description ... 1478
enable_case_sensitive_identifier ... 1481
Values (default in bold) ... 1481
enable_result_cache_for_session ... 1483
Values (default in bold) ... 1483
Description ... 1478
enable_vacuum_boost ... 1483
Values (default in bold) ... 1483
Description ... 1478
error_on_nondeterministic_update ... 1483
Values (default in bold) ... 1483
Description ... 1478
Example ... 1480
extra_float_digits ... 1484
Values (default in bold) ... 1484
Description ... 1484
json_serialization_enable ... 1484
Values (default in bold) ... 1484
Description ... 1478
json_serialization_parse_nested_strings ... 1484
Values (default in bold) ... 1484
Description ... 1478
max_concurrency_scaling_clusters ... 1485
Values (default in bold) ... 1485
Description ... 1485
max_cursor_result_set_size ... 1485
Values (default in bold) ... 1485
Description ... 1485
mv_enable_aqmv_for_session ... 1485
Values (default in bold) ... 1485
Description ... 1485
navigate_super_null_on_error ... 1485
Values (default in bold) ... 1485
Description ... 1478
parse_super_null_on_error ... 1486
Values (default in bold) ... 1486
Description ... 1478
query_group ... 1486
Values (default in bold) ... 1486
Description ... 1486
search_path ... 1487
Values (default in bold) ... 1487
Description ... 1487
Example ... 1487
spectrum_query_maxerror ... 1488
Values (default in bold) ... 1488
Description ... 1488
Example ... 1488
statement_timeout ... 1489
Values (default in bold) ... 1489
Description ... 1489
Example ... 1489
stored_proc_log_min_messages ... 1489
Values (default in bold) ... 1489
Description ... 1478
timezone ... 1490
Values (default in bold) ... 1490
Syntax ... 1490
Description ... 1490
Time zone formats ... 1490
Examples ... 1492
wlm_query_slot_count ... 1492
Values (default in bold) ... 1492
Description ... 1492
Examples ... 1493
Sample database ... 1494
CATEGORY table ... 1495
DATE table ... 1495
EVENT table ... 1496
VENUE table ... 1496
USERS table ... 1497
LISTING table ... 1497
SALES table ... 1498
Time zone names and abbreviations ... 1499
Time zone names ... 1499
Time zone abbreviations ... 1508
Document history ... 1512
Earlier updates ... 1518
Are you a first-time Amazon Redshift user?
Amazon Redshift system overview
Topics
• Are you a first-time Amazon Redshift user? (p. 1)
• Are you a database developer? (p. 2)
• Prerequisites (p. 3)
• System and architecture overview (p. 3)
This is the Amazon Redshift Database Developer Guide.
Amazon Redshift is an enterprise-level, petabyte scale, fully managed data warehousing service.
This guide focuses on using Amazon Redshift to create and manage a data warehouse. If you work with databases as a designer, software developer, or administrator, it gives you the information you need to design, build, query, and maintain your data warehouse.
Are you a first-time Amazon Redshift user?
If you are a first-time user of Amazon Redshift, we recommend that you begin by reading the following sections.
• Service Highlights and Pricing – The product detail page provides the Amazon Redshift value proposition, service highlights, and pricing.
• Getting Started – Amazon Redshift Getting Started Guide includes an example that walks you
through the process of creating an Amazon Redshift data warehouse cluster, creating database tables, uploading data, and testing queries.
After you complete the Getting Started guide, we recommend that you explore one of the following guides:
• Amazon Redshift Cluster Management Guide – The Cluster Management guide shows you how to create and manage Amazon Redshift clusters.
If you are an application developer, you can use the Amazon Redshift Query API to manage clusters programmatically. Additionally, the AWS SDK libraries that wrap the underlying Amazon Redshift API can help simplify your programming tasks. If you prefer a more interactive way of managing clusters, you can use the Amazon Redshift console and the AWS Command Line Interface (AWS CLI).
For information about the API and CLI, go to the following manuals:
• API reference
• CLI reference
• Amazon Redshift Database Developer Guide (this document) – If you are a database developer, the Database Developer Guide explains how to design, build, query, and maintain the databases that make up your data warehouse.
If you are transitioning to Amazon Redshift from another relational database system or data warehouse application, you should be aware of important differences in how Amazon Redshift is implemented. For
Are you a database developer?
a summary of the most important considerations for designing tables and loading data, see Amazon Redshift best practices for designing tables (p. 16) and Amazon Redshift best practices for loading data (p. 19). Amazon Redshift is based on PostgreSQL. For a detailed list of the differences between Amazon Redshift and PostgreSQL, see Amazon Redshift and PostgreSQL (p. 470).
Are you a database developer?
If you are a database user, database designer, database developer, or database administrator, the following table will help you find what you're looking for.
If you want to ... We recommend Quickly start using
Amazon Redshift Begin by following the steps in Amazon Redshift Getting Started Guide to quickly deploy a cluster, connect to a database, and try out some queries.
When you are ready to build your database, load data into tables, and write queries to manipulate data in the data warehouse, return here to the Database Developer Guide.
Learn about the internal architecture of the Amazon Redshift data warehouse.
The System and architecture overview (p. 3) gives a high-level overview of Amazon Redshift's internal architecture.
If you want a broader overview of the Amazon Redshift web service, go to the Amazon Redshift product detail page.
Create databases, tables, users, and other database objects.
Getting started using databases is a quick introduction to the basics of SQL development.
The Amazon Redshift SQL (p. 469) has the syntax and examples for Amazon Redshift SQL commands and functions and other SQL elements.
Amazon Redshift best practices for designing tables (p. 16) provides a summary of our recommendations for choosing sort keys, distribution keys, and compression encodings.
Learn how to design tables for optimum performance.
Working with automatic table optimization (p. 37) details considerations for applying compression to the data in table columns and choosing distribution and sort keys.
Load data. Loading data (p. 65) explains the procedures for loading large datasets from Amazon DynamoDB tables or from flat files stored in Amazon S3 buckets.
Amazon Redshift best practices for loading data (p. 19) provides for tips for loading your data quickly and effectively.
Manage users, groups,
and database security. Managing database security (p. 457) covers database security topics.
Monitor and optimize
system performance. The System tables reference (p. 1238) details system tables and views that you can query for the status of the database and monitor queries and processes.
You should also consult the Amazon Redshift Cluster Management Guide to learn how to use the AWS Management Console to check the system health, monitor metrics, and back up and restore clusters.
Prerequisites
If you want to ... We recommend Analyze and report
information from very large datasets.
Many popular software vendors are certifying Amazon Redshift with their offerings to enable you to continue to use the tools you use today. For more information, see the Amazon Redshift partner page.
The SQL reference (p. 469) has all the details for the SQL expressions, commands, and functions Amazon Redshift supports.
Prerequisites
Before you use this guide, you should complete these tasks.
• Install a SQL client.
• Launch an Amazon Redshift cluster.
• Connect your SQL client to a database in your cluster.
For step-by-step instructions, see Amazon Redshift Getting Started Guide.
You should also know how to use your SQL client and should have a fundamental understanding of the SQL language.
System and architecture overview
Topics
• Data warehouse system architecture (p. 3)
• Performance (p. 5)
• Columnar storage (p. 7)
• Workload management (p. 9)
• Using Amazon Redshift with other services (p. 9)
An Amazon Redshift data warehouse is an enterprise-class relational database query and management system.
Amazon Redshift supports client connections with many types of applications, including business intelligence (BI), reporting, data, and analytics tools.
When you run analytic queries, you are retrieving, comparing, and evaluating large amounts of data in multiple-stage operations to produce a final result.
Amazon Redshift achieves efficient storage and optimum query performance through a combination of massively parallel processing, columnar data storage, and very efficient, targeted data compression encoding schemes. This section presents an introduction to the Amazon Redshift system architecture.
Data warehouse system architecture
This section introduces the elements of the Amazon Redshift data warehouse architecture as shown in the following figure.
Data warehouse system architecture
Client applications
Amazon Redshift integrates with various data loading and ETL (extract, transform, and load) tools and business intelligence (BI) reporting, data mining, and analytics tools. Amazon Redshift is based on industry-standard PostgreSQL, so most existing SQL client applications will work with only minimal changes. For information about important differences between Amazon Redshift SQL and PostgreSQL, see Amazon Redshift and PostgreSQL (p. 470).
Clusters
The core infrastructure component of an Amazon Redshift data warehouse is a cluster.
A cluster is composed of one or more compute nodes. If a cluster is provisioned with two or more compute nodes, an additional leader node coordinates the compute nodes and handles external
communication. Your client application interacts directly only with the leader node. The compute nodes are transparent to external applications.
Leader node
The leader node manages communications with client programs and all communication with compute nodes. It parses and develops execution plans to carry out database operations, in particular, the series of steps necessary to obtain results for complex queries. Based on the execution plan, the leader node compiles code, distributes the compiled code to the compute nodes, and assigns a portion of the data to each compute node.
The leader node distributes SQL statements to the compute nodes only when a query references tables that are stored on the compute nodes. All other queries run exclusively on the leader node. Amazon Redshift is designed to implement certain SQL functions only on the leader node. A query that uses any of these functions will return an error if it references tables that reside on the compute nodes. For more information, see SQL functions supported on the leader node (p. 469).
Compute nodes
The leader node compiles code for individual elements of the execution plan and assigns the code to individual compute nodes. The compute nodes runs the compiled code and send intermediate results back to the leader node for final aggregation.
Performance
Each compute node has its own dedicated CPU, memory, and attached disk storage, which are determined by the node type. As your workload grows, you can increase the compute capacity and storage capacity of a cluster by increasing the number of nodes, upgrading the node type, or both.
Amazon Redshift provides several node types for your compute and storage needs. For details of each node type, see Amazon Redshift clusters in the Amazon Redshift Cluster Management Guide.
Node slices
A compute node is partitioned into slices. Each slice is allocated a portion of the node's memory and disk space, where it processes a portion of the workload assigned to the node. The leader node manages distributing data to the slices and apportions the workload for any queries or other database operations to the slices. The slices then work in parallel to complete the operation.
The number of slices per node is determined by the node size of the cluster. For more information about the number of slices for each node size, go to About clusters and nodes in the Amazon Redshift Cluster Management Guide.
When you create a table, you can optionally specify one column as the distribution key. When the table is loaded with data, the rows are distributed to the node slices according to the distribution key that is defined for a table. Choosing a good distribution key enables Amazon Redshift to use parallel processing to load data and run queries efficiently. For information about choosing a distribution key, see Choose the best distribution style (p. 17).
Internal network
Amazon Redshift takes advantage of high-bandwidth connections, close proximity, and custom communication protocols to provide private, very high-speed network communication between the leader node and compute nodes. The compute nodes run on a separate, isolated network that client applications never access directly.
Databases
A cluster contains one or more databases. User data is stored on the compute nodes. Your SQL client communicates with the leader node, which in turn coordinates query execution with the compute nodes.
Amazon Redshift is a relational database management system (RDBMS), so it is compatible with other RDBMS applications. Although it provides the same functionality as a typical RDBMS, including online transaction processing (OLTP) functions such as inserting and deleting data, Amazon Redshift is optimized for high-performance analysis and reporting of very large datasets.
Amazon Redshift is based on PostgreSQL. Amazon Redshift and PostgreSQL have a number of very important differences that you need to take into account as you design and develop your data warehouse applications. For information about how Amazon Redshift SQL differs from PostgreSQL, see Amazon Redshift and PostgreSQL (p. 470).
Performance
Amazon Redshift achieves extremely fast query execution by employing these performance features.
Topics
• Massively parallel processing (p. 6)
• Columnar data storage (p. 6)
• Data compression (p. 6)
• Query optimizer (p. 6)
• Result caching (p. 6)
• Compiled code (p. 7)
Performance
Massively parallel processing
Massively parallel processing (MPP) enables fast execution of the most complex queries operating on large amounts of data. Multiple compute nodes handle all query processing leading up to final result aggregation, with each core of each node executing the same compiled query segments on portions of the entire data.
Amazon Redshift distributes the rows of a table to the compute nodes so that the data can be processed in parallel. By selecting an appropriate distribution key for each table, you can optimize the distribution of data to balance the workload and minimize movement of data from node to node. For more
information, see Choose the best distribution style (p. 17).
Loading data from flat files takes advantage of parallel processing by spreading the workload across multiple nodes while simultaneously reading from multiple files. For more information about how to load data into tables, see Amazon Redshift best practices for loading data (p. 19).
Columnar data storage
Columnar storage for database tables drastically reduces the overall disk I/O requirements and is an important factor in optimizing analytic query performance. Storing database table information in a columnar fashion reduces the number of disk I/O requests and reduces the amount of data you need to load from disk. Loading less data into memory enables Amazon Redshift to perform more in-memory processing when executing queries. See Columnar storage (p. 7) for a more detailed explanation.
When columns are sorted appropriately, the query processor is able to rapidly filter out a large subset of data blocks. For more information, see Choose the best sort key (p. 16).
Data compression
Data compression reduces storage requirements, thereby reducing disk I/O, which improves query performance. When you run a query, the compressed data is read into memory, then uncompressed during query execution. Loading less data into memory enables Amazon Redshift to allocate more memory to analyzing the data. Because columnar storage stores similar data sequentially, Amazon Redshift is able to apply adaptive compression encodings specifically tied to columnar data types. The best way to enable data compression on table columns is by allowing Amazon Redshift to apply optimal compression encodings when you load the table with data. To learn more about using automatic data compression, see Loading tables with automatic compression (p. 89).
Query optimizer
The Amazon Redshift query execution engine incorporates a query optimizer that is MPP-aware and also takes advantage of the columnar-oriented data storage. The Amazon Redshift query optimizer implements significant enhancements and extensions for processing complex analytic queries that often include multi-table joins, subqueries, and aggregation. To learn more about optimizing queries, see Tuning query performance (p. 383).
Result caching
To reduce query execution time and improve system performance, Amazon Redshift caches the results of certain types of queries in memory on the leader node. When a user submits a query, Amazon Redshift checks the results cache for a valid, cached copy of the query results. If a match is found in the result cache, Amazon Redshift uses the cached results and doesn't run the query. Result caching is transparent to the user.
Result caching is turned on by default. To turn off result caching for the current session, set the enable_result_cache_for_session (p. 1483) parameter to off.
Columnar storage
Amazon Redshift uses cached results for a new query when all of the following are true:
• The user submitting the query has access privilege to the objects used in the query.
• The table or views in the query haven't been modified.
• The query doesn't use a function that must be evaluated each time it's run, such as GETDATE.
• The query doesn't reference Amazon Redshift Spectrum external tables.
• Configuration parameters that might affect query results are unchanged.
• The query syntactically matches the cached query.
To maximize cache effectiveness and efficient use of resources, Amazon Redshift doesn't cache some large query result sets. Amazon Redshift determines whether to cache query results based on a number of factors. These factors include the number of entries in the cache and the instance type of your Amazon Redshift cluster.
To determine whether a query used the result cache, query the SVL_QLOG (p. 1382) system view. If a query used the result cache, the source_query column returns the query ID of the source query. If result caching wasn't used, the source_query column value is NULL.
The following example shows that queries submitted by userid 104 and userid 102 use the result cache from queries run by userid 100.
select userid, query, elapsed, source_query from svl_qlog where userid > 1
order by query desc;
userid | query | elapsed | source_query ---+---+---+--- 104 | 629035 | 27 | 628919 104 | 629034 | 60 | 628900 104 | 629033 | 23 | 628891 102 | 629017 | 1229393 | 102 | 628942 | 28 | 628919 102 | 628941 | 57 | 628900 102 | 628940 | 26 | 628891 100 | 628919 | 84295686 | 100 | 628900 | 87015637 | 100 | 628891 | 58808694 |
Compiled code
The leader node distributes fully optimized compiled code across all of the nodes of a cluster. Compiling the query eliminates the overhead associated with an interpreter and therefore increases the execution speed, especially for complex queries. The compiled code is cached and shared across sessions on the same cluster, so subsequent executions of the same query will be faster, often even with different parameters.
The execution engine compiles different code for the JDBC connection protocol and for ODBC and psql (libq) connection protocols, so two clients using different protocols will each incur the first-time cost of compiling the code. Other clients that use the same protocol, however, will benefit from sharing the cached code.
Columnar storage
Columnar storage for database tables is an important factor in optimizing analytic query performance because it drastically reduces the overall disk I/O requirements and reduces the amount of data you need to load from disk.
Columnar storage
The following series of illustrations describe how columnar data storage implements efficiencies and how that translates into efficiencies when retrieving data into memory.
This first illustration shows how records from database tables are typically stored into disk blocks by row.
In a typical relational database table, each row contains field values for a single record. In row-wise database storage, data blocks store values sequentially for each consecutive column making up the entire row. If block size is smaller than the size of a record, storage for an entire record may take more than one block. If block size is larger than the size of a record, storage for an entire record may take less than one block, resulting in an inefficient use of disk space. In online transaction processing (OLTP) applications, most transactions involve frequently reading and writing all of the values for entire records, typically one record or a small number of records at a time. As a result, row-wise storage is optimal for OLTP databases.
The next illustration shows how with columnar storage, the values for each column are stored sequentially into disk blocks.
Using columnar storage, each data block stores values of a single column for multiple rows. As records enter the system, Amazon Redshift transparently converts the data to columnar storage for each of the columns.
In this simplified example, using columnar storage, each data block holds column field values for as many as three times as many records as row-based storage. This means that reading the same number of column field values for the same number of records requires a third of the I/O operations compared to row-wise storage. In practice, using tables with very large numbers of columns and very large row counts, storage efficiency is even greater.
An added advantage is that, since each block holds the same type of data, block data can use a compression scheme selected specifically for the column data type, further reducing disk space and I/O. For more information about compression encodings based on data types, see Compression encodings (p. 40).
The savings in space for storing data on disk also carries over to retrieving and then storing that data in memory. Since many database operations only need to access or operate on one or a small number of columns at a time, you can save memory space by only retrieving blocks for columns you actually need for a query. Where OLTP transactions typically involve most or all of the columns in a row for a small
Workload management
number of records, data warehouse queries commonly read only a few columns for a very large number of rows. This means that reading the same number of column field values for the same number of rows requires a fraction of the I/O operations and uses a fraction of the memory that would be required for processing row-wise blocks. In practice, using tables with very large numbers of columns and very large row counts, the efficiency gains are proportionally greater. For example, suppose a table contains 100 columns. A query that uses five columns will only need to read about five percent of the data contained in the table. This savings is repeated for possibly billions or even trillions of records for large databases.
In contrast, a row-wise database would read the blocks that contain the 95 unneeded columns as well.
Typical database block sizes range from 2 KB to 32 KB. Amazon Redshift uses a block size of 1 MB, which is more efficient and further reduces the number of I/O requests needed to perform any database loading or other operations that are part of query execution.
Workload management
Amazon Redshift workload management (WLM) enables users to flexibly manage priorities within workloads so that short, fast-running queries won't get stuck in queues behind long-running queries.
Amazon Redshift WLM creates query queues at runtime according to service classes, which define the configuration parameters for various types of queues, including internal system queues and user- accessible queues. From a user perspective, a user-accessible service class and a queue are functionally equivalent. For consistency, this documentation uses the term queue to mean a user-accessible service class as well as a runtime queue.
When you run a query, WLM assigns the query to a queue according to the user's user group or by matching a query group that is listed in the queue configuration with a query group label that the user sets at runtime.
Currently, the default for clusters using the default parameter group is to use automatic WLM.
Automatic WLM manages query concurrency and memory allocation. For more information, see Implementing automatic WLM (p. 413).
With manual WLM, Amazon Redshift configures one queue with a concurrency level of five, which enables up to five queries to run concurrently, plus one predefined Superuser queue, with a concurrency level of one. You can define up to eight queues. Each queue can be configured with a maximum concurrency level of 50. The maximum total concurrency level for all user-defined queues (not including the Superuser queue) is 50.
The easiest way to modify the WLM configuration is by using the Amazon Redshift Management Console.
You can also use the Amazon Redshift command line interface (CLI) or the Amazon Redshift API.
For more information about implementing and using workload management, see Implementing workload management (p. 411).
Using Amazon Redshift with other services
Amazon Redshift integrates with other AWS services to enable you to move, transform, and load your data quickly and reliably, using data security features.
Moving data between Amazon Redshift and Amazon S3
Amazon Simple Storage Service (Amazon S3) is a web service that stores data in the cloud. Amazon Redshift leverages parallel processing to read and load data from multiple data files stored in Amazon S3 buckets. For more information, see Loading data from Amazon S3 (p. 68).
Using Amazon Redshift with other services
You can also use parallel processing to export data from your Amazon Redshift data warehouse to multiple data files on Amazon S3. For more information, see Unloading data (p. 143).
Using Amazon Redshift with Amazon DynamoDB
Amazon DynamoDB is a fully managed NoSQL database service. You can use the COPY command to load an Amazon Redshift table with data from a single Amazon DynamoDB table. For more information, see Loading data from an Amazon DynamoDB table (p. 86).
Importing data from remote hosts over SSH
You can use the COPY command in Amazon Redshift to load data from one or more remote hosts, such as Amazon EMR clusters, Amazon EC2 instances, or other computers. COPY connects to the remote hosts using SSH and runs commands on the remote hosts to generate data. Amazon Redshift supports multiple simultaneous connections. The COPY command reads and loads the output from multiple host sources in parallel. For more information, see Loading data from remote hosts (p. 80).
Automating data loads using AWS Data Pipeline
You can use AWS Data Pipeline to automate data movement and transformation into and out of Amazon Redshift. By using the built-in scheduling capabilities of AWS Data Pipeline, you can schedule and run recurring jobs without having to write your own complex data transfer or transformation logic.
For example, you can set up a recurring job to automatically copy data from Amazon DynamoDB into Amazon Redshift. For a tutorial that walks you through the process of creating a pipeline that periodically moves data from Amazon S3 to Amazon Redshift, see Copy data to Amazon Redshift using AWS Data Pipeline in the AWS Data Pipeline Developer Guide.
Migrating data using AWS Database Migration Service (AWS DMS)
You can migrate data to Amazon Redshift using AWS Database Migration Service. AWS DMS can
migrate your data to and from most widely used commercial and open-source databases such as Oracle, PostgreSQL, Microsoft SQL Server, Amazon Redshift, Aurora, DynamoDB, Amazon S3, MariaDB, and MySQL. For more information, see Using an Amazon Redshift database as a target for AWS Database Migration Service.
Conducting a proof of concept
Amazon Redshift best practices
Following, you can find best practices for planning a proof of concept, designing tables, loading data into tables, and writing queries for Amazon Redshift, and also a discussion of working with Amazon Redshift Advisor.
Amazon Redshift is not the same as other SQL database systems. To fully realize the benefits of the Amazon Redshift architecture, you must specifically design, build, and load your tables to use massively parallel processing, columnar data storage, and columnar data compression. If your data loading and query execution times are longer than you expect, or longer than you want, you might be overlooking key information.
If you are an experienced SQL database developer, we strongly recommend that you review this topic before you begin developing your Amazon Redshift data warehouse.
If you are new to developing SQL databases, this topic is not the best place to start. We recommend that you begin by reading Getting started using databases and trying the examples yourself.
In this topic, you can find an overview of the most important development principles, along with specific tips, examples, and best practices for implementing those principles. No single practice can apply to every application. You should evaluate all of your options before finalizing a database design. For more information, see Working with automatic table optimization (p. 37), Loading data (p. 65), Tuning query performance (p. 383), and the reference chapters.
Topics
• Conducting a proof of concept for Amazon Redshift (p. 11)
• Amazon Redshift best practices for designing tables (p. 16)
• Amazon Redshift best practices for loading data (p. 19)
• Amazon Redshift best practices for designing queries (p. 22)
• Working with recommendations from Amazon Redshift Advisor (p. 24)
Conducting a proof of concept for Amazon Redshift
Amazon Redshift is a fast, scalable data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL with your existing business intelligence (BI) tools. Amazon Redshift offers fast performance in a low-cost cloud data warehouse. It uses sophisticated query optimization, accelerated cache, columnar storage on high-performance local disks, and massively parallel query execution.
In the following sections, you can find a framework for building a proof of concept with Amazon Redshift. The framework helps you to use architectural best practices for designing and operating a secure, high-performing, and cost-effective data warehouse. This guidance is based on reviewing designs of thousands of customer architectures across a wide variety of business types and use cases. We have compiled customer experiences to develop this set of best practices to help you develop criteria for evaluating your data warehouse workload.
Overview of the process
Conducting a proof of concept is a three-step process:
1. Identify the goals of the proof of concept – you can work backward from your business requirements and success criteria, and translate them into a technical proof of concept project plan.
Identify the business goals and success criteria
2. Set up the proof of concept environment – most of the setup process is a click of few buttons to create your resources. Within minutes, you can have a data warehouse environment ready with data loaded.
3. Complete the proof of concept project plan to ensure that the goals are met.
In the following sections, we go into the details of each step.
Identify the business goals and success criteria
Identifying the goals of the proof of concept plays a critical role in determining what you want to measure as part of the evaluation process. The evaluation criteria should include the current scaling challenges, enhancements to improve your customer's experience of the data warehouse, and methods of addressing your current operational pain points. You can use the following questions to identify the goals of the proof of concept:
• What are your goals for scaling your data warehouse?
• What are the specific service-level agreements whose terms you want to improve?
• What new datasets do you need to include in your data warehouse?
• What are the business-critical SQL queries that you need to test and measure? Make sure to include the full range of SQL complexities, such as the different types of queries (for example, select, insert, update, and delete).
• What are the general types of workloads you plan to test? Examples might include extract-transform- load (ETL) workloads, reporting queries, and batch extracts.
After you have answered these questions, you should be able to establish SMART goals and success criteria for building your proof of concept. For information about setting goals, see SMART criteria in Wikipedia.
Set up your proof of concept
Because we eliminated hardware provisioning, networking, and software installation from an on- premises data warehouse, trying Amazon Redshift with your own dataset has never been easier. Many of the sizing decisions and estimations that used to be required are now simply a click away. You can flexibly resize your cluster or adjust the ratio of storage versus compute.
Broadly, setting up the Amazon Redshift proof of concept environment is a two-step process. It involves the launching of a data warehouse and then the conversion of the schema and datasets for evaluation.
Choose a starting cluster size
You can choose the node type and number of nodes using the Amazon Redshift console. We recommend that you also test resizing the cluster as part of your proof of concept plan. To get the initial sizing for your cluster, take the following steps:
1. Sign in to the AWS Management Console and open the Amazon Redshift console at https://
console.aws.amazon.com/redshift/.
2. On the navigation pane, choose Create cluster to open the configuration page.
3. For Cluster identifier, enter a name for your cluster.
4. Choose one of the following methods to size your cluster:
NoteThe following step describes an Amazon Redshift console that is running in an AWS Region that supports RA3 node types. For a list of AWS Regions that support RA3 node types, see Overview of RA3 node types in the Amazon Redshift Cluster Management Guide.
Checklist for a complete evaluation
• If your AWS Region supports RA3 node types, choose either Production or Free trial to answer the question What are you planning to use this cluster for?
If your organization is eligible, you might be able to create a cluster under the Amazon Redshift free trial program. To do this, choose Free trial to create a configuration with the dc2.large node type.
For more information about choosing a free trial, see Amazon Redshift free trial.
• If you don't know how large to size your cluster, choose Help me choose. Doing this starts a sizing calculator that asks you questions about the size and query characteristics of the data that you plan to store in your data warehouse.
If you know the required size of your cluster (that is, the node type and number of nodes), choose I'll choose. Then choose the Node type and number of Nodes to size your cluster for the proof of concept.
5. After you enter all required cluster properties, choose Create cluster to launch your data warehouse.
For more details about creating clusters with the Amazon Redshift console, see Creating a cluster in the Amazon Redshift Cluster Management Guide.
Convert the schema and set up the datasets for the proof of concept
If you don't have an existing data warehouse, skip this section and see Amazon Redshift Getting Started Guide. Amazon Redshift Getting Started Guide provides a tutorial to create a cluster and examples of setting up data in Amazon Redshift.
When migrating from your existing data warehouse, you can convert schema, code, and data using the AWS Schema Conversion Tool and the AWS Database Migration Service. Your choice of tools depends on the source of your data and optional ongoing replications. For more information, see What Is the AWS Schema Conversion Tool? in the AWS Schema Conversion Tool User Guide and What Is AWS Database Migration Service? in the AWS Database Migration Service User Guide. The following can help you set up your data in Amazon Redshift:
• Migrate Your Data Warehouse to Amazon Redshift Using the AWS Schema Conversion Tool – this blog post provides an overview on how you can use the AWS SCT data extractors to migrate your existing data warehouse to Amazon Redshift. The AWS SCT tool can migrate your data from many legacy platforms (such as Oracle, Greenplum, Netezza, Teradata, Microsoft SQL Server, or Vertica).
• Optionally, you can also use the AWS Database Migration Service for ongoing replications of changed data from the source. For more information, see Using an Amazon Redshift Database as a Target for AWS Database Migration Service in the AWS Database Migration Service User Guide.
Amazon Redshift is a relational database management system (RDBMS). As such, it can run many types of data models including star schemas, snowflake schemas, data vault models, and simple, flat, or normalized tables. After setting up your schemas in Amazon Redshift, you can take advantage of massively parallel processing and columnar data storage for fast analytical queries out of the box. For information about types of schemas, see star schema, snowflake schema, and data vault modeling in Wikipedia.
Checklist for a complete evaluation
Make sure that a complete evaluation meets all your data warehouse needs. Consider including the following items in your success criteria:
• Data load time – using the COPY command is a common way to test how long it takes to load data. For more information, see Amazon Redshift best practices for loading data (p. 19).
Develop a project plan for your evaluation
• Throughput of the cluster – measuring queries per hour is a common way to determine throughput.
To do so, set up a test to run typical queries for your workload.
• Data security – you can easily encrypt data at rest and in transit with Amazon Redshift. You also have a number of options for managing keys. Amazon Redshift also supports single sign-on (SSO) integration. Amazon Redshift pricing includes built-in security, data compression, backup storage, and data transfer.
• Third-party tools integration – you can use either a JDBC or ODBC connection to integrate with business intelligence and other external tools.
• Interoperability with other AWS services – Amazon Redshift integrates with other AWS services, such as Amazon EMR, Amazon QuickSight, AWS Glue, Amazon S3, and Amazon Kinesis. You can use this integration when setting up and managing your data warehouse.
• Backups and snapshots – backups and snapshots are created automatically. You can also create a point-in-time snapshot at any time or on a schedule. Try using a snapshot and creating a second cluster as part of your evaluation. Evaluate if your development and testing organizations can use the cluster.
• Resizing – your evaluation should include increasing the number or types of Amazon Redshift nodes.
Evaluate that the workload throughput before and after a resize meets any variability of the volume of your workload. For more information, see Resizing clusters in Amazon Redshift in the Amazon Redshift Cluster Management Guide.
• Concurrency scaling – this feature helps you handle variability of traffic volume in your data warehouse. With concurrency scaling, you can support virtually unlimited concurrent users and concurrent queries, with consistently fast query performance. For more information, see Working with concurrency scaling (p. 438).
• Automatic workload management (WLM) – prioritize your business critical queries over other queries by using automatic WLM. Try setting up queues based on your workloads (for example, a queue for ETL and a queue for reporting). Then enable automatic WLM to allocate the concurrency and memory resources dynamically. For more information, see Implementing automatic WLM (p. 413).
• Amazon Redshift Advisor – the Advisor develops customized recommendations to increase
performance and optimize costs by analyzing your workload and usage metrics for your cluster. Sign in to the Amazon Redshift console to view Advisor recommendations. For more information, see Working with recommendations from Amazon Redshift Advisor (p. 24).
• Table design – Amazon Redshift provides great performance out of the box for most workloads.
When you create a table, the default sort key and distribution key is AUTO. For more information, see Working with automatic table optimization (p. 37).
• Support – we strongly recommend that you evaluate AWS Support as part of your evaluation. Also, make sure to talk to your account manager about your proof of concept. AWS can help with technical guidance and credits for the proof of concept if you qualify. If you don't find the help you're looking for, you can talk directly to the Amazon Redshift team. For help, submit the form at Request support for your Amazon Redshift proof-of-concept.
• Lake house integration – with built-in integration, try using the out-of-box Amazon Redshift
Spectrum feature. With Redshift Spectrum, you can extend the data warehouse into your data lake and run queries against petabytes of data in Amazon S3 using your existing cluster. For more information, see Querying external data using Amazon Redshift Spectrum (p. 232).
Develop a project plan for your evaluation
Some of the following techniques for creating query benchmarks might help support your Amazon Redshift evaluation:
• Assemble a list of queries for each runtime category. Having a sufficient number (for example, 30 per category) helps ensure that your evaluation reflects a real-world data warehouse implementation.
Add a unique identifier to associate each query that you include in your evaluation with one of the
Additional resources to help your evaluation
categories you establish for your evaluation. You can then use these unique identifiers to determine throughput from the system tables.
You can also create a query group to organize your evaluation queries. For example, if you have established a "Reporting" category for your evaluation, you might create a coding system to tag your evaluation queries with the word "Report." You can then identify individual queries within reporting as R1, R2, and so on. The following example demonstrates this approach.
SELECT 'Reporting' AS query_category, 'R1' as query_id, * FROM customers;
SELECT query, datediff(seconds, starttime, endtime) FROM stl_query
WHERE
querytxt LIKE '%Reporting%'
and starttime >= '2018-04-15 00:00' and endtime < '2018-04-15 23:59';
When you have associated a query with an evaluation category, you can use a unique identifier to determine throughput from the system tables for each category.
• Test throughput with historical user or ETL queries that have a variety of runtimes in your existing data warehouse. You might use a load testing utility, such as the open-source JMeter or a custom utility. If so, make sure that your utility does the following:
• It can take the network transmission time into account.
• It evaluates execution time based on throughput of the internal system tables. For information about how to do this, see Analyzing the query summary (p. 396).
• Identify all the various permutations that you plan to test during your evaluation. The following list provides some common variables:
• Cluster size
• Node type
• Load testing duration
• Concurrency settings
• Reduce the cost of your proof of concept by pausing your cluster during off-hours and weekends.
When a cluster is paused, on-demand compute billing is suspended. To run tests on the cluster, resume per-second billing. You can also create a schedule to pause and resume your cluster automatically.
For more information, see Pausing and resuming clusters in the Amazon Redshift Cluster Management Guide.
At this stage, you're ready to complete your project plan and evaluate results.
Additional resources to help your evaluation
To help your Amazon Redshift evaluation, see the following:
• Service highlights and pricing – this product detail page provides the Amazon Redshift value proposition, service highlights, and pricing.
• Amazon Redshift Getting Started Guide – this guide provides a tutorial of using Amazon Redshift to create a sample cluster and work with sample data.
• Getting started with Amazon Redshift Spectrum (p. 234) – in this tutorial, you learn how to use Redshift Spectrum to query data directly from files on Amazon S3.
• Amazon Redshift management overview – this topic in the Amazon Redshift Cluster Management Guide provides an overview of Amazon Redshift.