• 沒有找到結果。

Amazon Redshift

N/A
N/A
Protected

Academic year: 2022

Share "Amazon Redshift"

Copied!
1549
0
0

加載中.... (立即查看全文)

全文

(1)

Amazon Redshift

Database Developer Guide

(2)

Amazon Redshift: Database Developer Guide

Copyright © Amazon Web Services, Inc. and/or its affiliates. All rights reserved.

Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be affiliated with, connected to, or sponsored by Amazon.

(3)

Table of Contents

System overview ... 1

Are you a first-time Amazon Redshift user? ... 1

Are you a database developer? ... 2

Prerequisites ... 3

System and architecture overview ... 3

Data warehouse system architecture ... 3

Performance ... 5

Columnar storage ... 7

Workload management ... 9

Using Amazon Redshift with other services ... 9

Best practices ... 11

Conducting a proof of concept ... 11

Overview of the process ... 11

Identify the business goals and success criteria ... 12

Set up your proof of concept ... 12

Checklist for a complete evaluation ... 13

Develop a project plan for your evaluation ... 14

Additional resources to help your evaluation ... 15

Need help? ... 16

Best practices for designing tables ... 16

Choose the best sort key ... 16

Choose the best distribution style ... 17

Use automatic compression ... 18

Define constraints ... 18

Use the smallest possible column size ... 18

Use date/time data types for date columns ... 19

Best practices for loading data ... 19

Take the loading data tutorial ... 19

Use a COPY command to load data ... 19

Use a single COPY command ... 19

Split your load data ... 20

Compress your data files ... 20

Verify data files before and after a load ... 20

Use a multi-row insert ... 20

Use a bulk insert ... 21

Load data in sort key order ... 21

Load data in sequential blocks ... 21

Use time-series tables ... 21

Use a staging table to perform a merge ... 22

Schedule around maintenance windows ... 22

Best practices for designing queries ... 22

Working with Advisor ... 24

Amazon Redshift Regions ... 24

Access Advisor ... 25

Advisor recommendations ... 26

Tutorials ... 36

Working with automatic table optimization ... 37

Enabling automatic table optimization ... 37

Removing automatic table optimization ... 38

Monitoring actions of automatic table optimization ... 38

Working with column compression ... 39

Compression encodings ... 40

Testing compression encodings ... 46

Example: Choosing compression encodings for the CUSTOMER table ... 48

(4)

Working with data distribution styles ... 50

Data distribution concepts ... 51

Distribution styles ... 51

Viewing distribution styles ... 52

Evaluating query patterns ... 53

Designating distribution styles ... 54

Evaluating the query plan ... 55

Query plan example ... 56

Distribution examples ... 60

Working with sort keys ... 62

Compound sort key ... 63

Interleaved sort key ... 63

Defining table constraints ... 64

Loading data ... 65

Using COPY to load data ... 65

Credentials and access permissions ... 66

Preparing your input data ... 67

Loading data from Amazon S3 ... 68

Loading data from Amazon EMR ... 76

Loading data from remote hosts ... 80

Loading from Amazon DynamoDB ... 86

Verifying that the data loaded correctly ... 88

Validating input data ... 88

Automatic compression ... 89

Optimizing for narrow tables ... 91

Default values ... 91

Troubleshooting ... 91

Updating with DML ... 96

Updating and inserting ... 96

Merge method 1: Replacing existing rows ... 97

Merge method 2: Specifying a column list ... 97

Creating a temporary staging table ... 97

Performing a merge operation by replacing existing rows ... 98

Performing a merge operation by specifying a column list ... 98

Merge examples ... 100

Performing a deep copy ... 102

Analyzing tables ... 103

Automatic analyze ... 104

Analysis of new table data ... 104

ANALYZE command history ... 107

Vacuuming tables ... 108

Automatic table sort ... 108

Automatic vacuum delete ... 109

VACUUM frequency ... 109

Sort stage and merge stage ... 110

Vacuum threshold ... 110

Vacuum types ... 110

Managing vacuum times ... 110

Managing concurrent write operations ... 116

Serializable isolation ... 117

Write and read/write operations ... 120

Concurrent write examples ... 121

Tutorial: Loading data from Amazon S3 ... 122

Prerequisites ... 122

Overview ... 123

Steps ... 123

Step 1: Create a cluster ... 123

(5)

Step 2: Download the data files ... 125

Step 3: Upload the files to an Amazon S3 bucket ... 125

Step 4: Create the sample tables ... 127

Step 5: Run the COPY commands ... 129

Step 6: Vacuum and analyze the database ... 141

Step 7: Clean up your resources ... 141

Summary ... 142

Unloading data ... 143

Unloading data to Amazon S3 ... 143

Unloading encrypted data files ... 145

Unloading data in delimited or fixed-width format ... 147

Reloading unloaded data ... 148

Creating user-defined functions ... 149

UDF security and privileges ... 149

Creating a scalar SQL UDF ... 150

Scalar SQL function example ... 150

Creating a scalar Python UDF ... 151

Scalar Python UDF example ... 151

Python UDF data types ... 151

ANYELEMENT data type ... 152

Python language support ... 152

UDF constraints ... 155

Creating a scalar Lambda UDF ... 156

Registering a Lambda UDF ... 156

Managing Lambda UDF security and privileges ... 156

Configuring the authorization parameter for Lambda UDFs ... 157

Using the JSON interface between Amazon Redshift and Lambda ... 156

Naming UDFs ... 160

Logging errors and warnings ... 161

Example uses of UDFs ... 162

Creating stored procedures ... 163

Stored procedure overview ... 163

Naming stored procedures ... 165

Security and privileges ... 166

Returning a result set ... 167

Managing transactions ... 168

Trapping errors ... 173

Logging stored procedures ... 174

Limits and differences ... 175

PL/pgSQL language reference ... 176

PL/pgSQL reference conventions ... 176

Structure of PL/pgSQL ... 176

Supported PL/pgSQL statements ... 180

Creating materialized views ... 192

Querying a materialized view ... 194

Automatic query rewriting to use materialized views ... 195

Usage notes ... 195

Limitations ... 195

Refreshing a materialized view ... 196

Autorefreshing a materialized view ... 198

Automated materialized views (preview) ... 198

Streaming ingestion (preview) ... 199

Data flow ... 200

Streaming ingestion use cases ... 200

Getting started with streaming ingestion ... 200

Limitations ... 201

Electric vehicle station-data tutorial (preview) ... 202

(6)

Querying spatial data ... 205

Tutorial: Using spatial SQL functions ... 207

Prerequisites ... 207

Step 1: Create tables and load test data ... 208

Step 2: Query spatial data ... 210

Step 3: Clean up your resources ... 212

Loading a shapefile ... 213

Terminology ... 214

Bounding box ... 214

Geometric validity ... 214

Geometric simplicity ... 216

Limitations ... 217

Querying data with federated queries ... 218

Getting started with using federated queries to PostgreSQL ... 218

Getting started using federated queries to PostgreSQL with CloudFormation ... 219

Launching a CloudFormation stack for Redshift federated queries ... 220

Querying data from the external schema ... 221

Getting started with using federated queries to MySQL ... 221

Creating a secret and an IAM role ... 222

Prerequisites ... 222

Examples of using a federated query ... 224

Example of using a federated query with PostgreSQL ... 224

Example of using a mixed-case name ... 225

Example of using a federated query with MySQL ... 227

Data type differences ... 227

Limitations and considerations ... 230

Querying external data using Amazon Redshift Spectrum ... 232

Amazon Redshift Spectrum overview ... 232

Amazon Redshift Spectrum Regions ... 233

Amazon Redshift Spectrum considerations ... 233

Getting started with Amazon Redshift Spectrum ... 234

Prerequisites ... 234

CloudFormation ... 234

Getting started with Redshift Spectrum step by step ... 234

Step 1. Create an IAM role ... 234

Step 2: Associate the IAM role with your cluster ... 237

Step 3: Create an external schema and an external table ... 238

Step 4: Query your data in Amazon S3 ... 238

Launch your CloudFormation stack and then query your data ... 241

IAM policies for Amazon Redshift Spectrum ... 243

Amazon S3 permissions ... 244

Cross-account Amazon S3 permissions ... 244

Grant or restrict access using Redshift Spectrum ... 245

Minimum permissions ... 245

Chaining IAM roles ... 246

Accessing AWS Glue data ... 247

Using Redshift Spectrum with Lake Formation ... 253

Creating data files for queries in Amazon Redshift Spectrum ... 254

Data formats for Redshift Spectrum ... 254

Compression types for Redshift Spectrum ... 255

Encryption for Redshift Spectrum ... 255

Creating external schemas ... 256

Working with external catalogs ... 257

Creating external tables ... 261

Pseudocolumns ... 262

Partitioning Redshift Spectrum external tables ... 263

Mapping to ORC columns ... 267

(7)

Creating external tables for Hudi-managed data ... 269

Creating external tables for Delta Lake data ... 270

Improving Amazon Redshift Spectrum query performance ... 271

Setting data handling options ... 273

Monitoring metrics ... 274

Troubleshooting queries ... 274

Retries exceeded ... 275

Access throttled ... 275

Resource limit exceeded ... 276

No rows returned for a partitioned table ... 276

Not authorized error ... 276

Incompatible data formats ... 276

Syntax error when using Hive DDL in Amazon Redshift ... 277

Permission to create temporary tables ... 277

Invalid range ... 277

Invalid Parquet version number ... 278

Tutorial: Querying nested data with Amazon Redshift Spectrum ... 278

Overview ... 278

Step 1: Create an external table that contains nested data ... 279

Step 2: Query your nested data in Amazon S3 with SQL extensions ... 280

Nested data use cases ... 283

Nested data limitations ... 285

Serializing complex nested JSON ... 285

Using HyperLogLog sketches in Amazon Redshift ... 288

Considerations ... 288

Limitations ... 289

Examples ... 289

Example: Return cardinality in a subquery ... 289

Example: Return an HLLSKETCH type from combined sketches in a subquery ... 290

Example: Return a HyperLogLog sketch from combining multiple sketches ... 290

Example: Generate HyperLogLog sketches over S3 data using external tables ... 291

Querying data across databases ... 293

Considerations ... 294

Limitations ... 294

Examples of using a cross-database query ... 295

Using cross-database queries with the query editor ... 298

Sharing data across clusters ... 300

Data sharing overview ... 300

Data sharing use cases ... 300

Data sharing concepts ... 301

Sharing data at different levels ... 302

Managing data consistency ... 302

Accessing shared data ... 302

Data sharing considerations in Amazon Redshift ... 303

How data sharing works ... 303

Controlling access for cross-account datashares ... 304

Working with views in Amazon Redshift data sharing ... 307

Managing the data sharing lifecycle ... 308

Managing permissions for datashares ... 309

Tracking usage and auditing in data sharing ... 310

Cluster management and data sharing ... 310

Integrating Amazon Redshift data sharing with business intelligence tools ... 311

Accessing metadata for datashares ... 312

Working with AWS Data Exchange for Amazon Redshift ... 312

How AWS Data Exchange datashares work ... 313

Considerations when using AWS Data Exchange for Amazon Redshift ... 313

Getting started data sharing ... 314

(8)

Getting started data sharing using the SQL interface ... 314

Getting started data sharing using the console ... 334

Getting started data sharing with CloudFormation ... 343

Limitations ... 345

Ingesting and querying semistructured data in Amazon Redshift ... 346

Use cases for the SUPER data type ... 346

Concepts for SUPER data type use ... 347

Considerations for SUPER data ... 348

SUPER sample dataset ... 348

Loading semistructured data into Amazon Redshift ... 350

Parsing JSON documents to SUPER columns ... 350

Using COPY to load JSON data in Amazon Redshift ... 351

Unloading semistructured data ... 354

Unloading semistructured data in CSV or text formats ... 354

Unloading semistructured data in the Parquet format ... 355

Querying semistructured data ... 355

Navigation ... 355

Unnesting queries ... 356

Object unpivoting ... 357

Dynamic typing ... 358

Lax semantics ... 360

Types of introspection ... 360

Order by ... 361

Operators and functions ... 362

Arithmetic operators ... 362

Arithmetic functions ... 362

Array functions ... 363

SUPER configurations ... 364

Lax and strict modes for SUPER ... 364

Accessing JSON fields with upper or mixed case letters ... 364

Limitations ... 365

Using SUPER data type with materialized views ... 367

Accelerating PartiQL queries ... 367

Limitations for using the SUPER data type with materialized views ... 369

Using machine learning in Amazon Redshift ... 370

Machine learning overview ... 371

How machine learning can solve a problem ... 371

Terms and concepts for Amazon Redshift ML ... 372

Machine learning for novices and experts ... 373

Costs for using Amazon Redshift ML ... 374

Getting started with Amazon Redshift ML ... 375

Administrative setup ... 376

Using model explainability with Amazon Redshift ML ... 379

Tutorial: Creating a customer churn model ... 379

Tuning query performance ... 383

Query processing ... 383

Query planning and execution workflow ... 383

Query plan ... 385

Reviewing query plan steps ... 390

Factors affecting query performance ... 392

Analyzing and improving queries ... 393

Query analysis workflow ... 393

Reviewing query alerts ... 394

Analyzing the query plan ... 395

Analyzing the query summary ... 396

Improving query performance ... 401

Diagnostic queries for query tuning ... 403

(9)

Troubleshooting queries ... 406

Connection fails ... 407

Query hangs ... 407

Query takes too long ... 408

Load fails ... 409

Load takes too long ... 409

Load data is incorrect ... 409

Setting the JDBC fetch size parameter ... 409

Implementing workload management ... 411

Modifying the WLM configuration ... 412

Migrating from manual WLM to automatic WLM ... 412

Automatic WLM ... 413

Priority ... 414

Concurrency scaling mode ... 414

User groups ... 414

Query groups ... 414

Wildcards ... 415

Query monitoring rules ... 415

Checking for automatic WLM ... 415

Query priority ... 415

Manual WLM ... 419

Concurrency scaling mode ... 420

Concurrency level ... 420

User groups ... 421

Query groups ... 421

Wildcards ... 421

WLM memory percent to use ... 421

WLM timeout ... 422

Query monitoring rules ... 422

WLM query queue hopping ... 422

Tutorial: Configuring manual WLM queues ... 425

Concurrency scaling ... 438

Concurrency scaling capabilities ... 438

Limitations for concurrency scaling ... 439

Regions for concurrency scaling ... 439

Concurrency scaling candidates ... 440

Configuring concurrency scaling queues ... 416

Monitoring concurrency scaling ... 441

Concurrency scaling system views ... 441

Short query acceleration ... 442

Maximum SQA runtime ... 442

Monitoring SQA ... 443

WLM queue assignment rules ... 443

Queue assignments example ... 444

Assigning queries to queues ... 446

Assigning queries to queues based on user groups ... 446

Assigning a query to a query group ... 446

Assigning queries to the superuser queue ... 446

Dynamic and static properties ... 447

WLM dynamic memory allocation ... 448

Dynamic WLM example ... 448

Query monitoring rules ... 450

Defining a query monitor rule ... 450

Query monitoring metrics ... 451

Query monitoring rules templates ... 453

System tables and views for query monitoring rules ... 454

WLM system tables and views ... 455

(10)

WLM service class IDs ... 456

Managing database security ... 457

Amazon Redshift security overview ... 457

Default database user privileges ... 458

Superusers ... 458

Users ... 459

Creating, altering, and deleting users ... 459

Role-based access control ... 460

Role hierarchy ... 460

Role assignment ... 460

Amazon Redshift system-defined roles ... 461

System privileges ... 461

Database object permissions ... 464

ALTER DEFAULT PRIVILEGES for RBAC ... 465

Considerations for role usage ... 465

Managing roles ... 465

Groups ... 465

Creating, altering, and deleting groups ... 466

Schemas ... 466

Creating, altering, and deleting schemas ... 466

Search path ... 467

Schema-based privileges ... 467

Example for controlling user and group access ... 467

SQL reference ... 469

Amazon Redshift SQL ... 469

SQL functions supported on the leader node ... 469

Amazon Redshift and PostgreSQL ... 470

Using SQL ... 475

SQL reference conventions ... 476

Basic elements ... 476

Expressions ... 505

Conditions ... 509

SQL commands ... 527

ABORT ... 529

ALTER DATABASE ... 530

ALTER DATASHARE ... 532

ALTER DEFAULT PRIVILEGES ... 534

ALTER GROUP ... 536

ALTER MATERIALIZED VIEW ... 537

ALTER ROLE ... 538

ALTER PROCEDURE ... 539

ALTER SCHEMA ... 540

ALTER TABLE ... 541

ALTER TABLE APPEND ... 556

ALTER USER ... 560

ANALYZE ... 563

ANALYZE COMPRESSION ... 565

BEGIN ... 567

CALL ... 568

CANCEL ... 570

CLOSE ... 572

COMMENT ... 573

COMMIT ... 574

COPY ... 575

CREATE DATABASE ... 640

CREATE DATASHARE ... 646

CREATE EXTERNAL FUNCTION ... 647

(11)

CREATE EXTERNAL SCHEMA ... 654

CREATE EXTERNAL TABLE ... 660

CREATE FUNCTION ... 678

CREATE GROUP ... 682

CREATE LIBRARY ... 683

CREATE MATERIALIZED VIEW ... 685

CREATE MODEL ... 688

CREATE PROCEDURE ... 701

CREATE ROLE ... 704

CREATE SCHEMA ... 705

CREATE TABLE ... 707

CREATE TABLE AS ... 722

CREATE USER ... 729

CREATE VIEW ... 733

DEALLOCATE ... 736

DECLARE ... 737

DELETE ... 739

DESC DATASHARE ... 741

DROP DATABASE ... 742

DROP DATASHARE ... 743

DROP FUNCTION ... 744

DROP GROUP ... 745

DROP LIBRARY ... 746

DROP MODEL ... 747

DROP MATERIALIZED VIEW ... 747

DROP PROCEDURE ... 748

DROP ROLE ... 749

DROP SCHEMA ... 750

DROP TABLE ... 752

DROP USER ... 754

DROP VIEW ... 756

END ... 757

EXECUTE ... 758

EXPLAIN ... 759

FETCH ... 763

GRANT ... 764

INSERT ... 777

INSERT (external table) ... 781

LOCK ... 783

PREPARE ... 784

REFRESH MATERIALIZED VIEW ... 785

RESET ... 787

REVOKE ... 788

ROLLBACK ... 797

SELECT ... 798

SELECT INTO ... 833

SET ... 834

SET SESSION AUTHORIZATION ... 837

SET SESSION CHARACTERISTICS ... 838

SHOW ... 838

SHOW EXTERNAL TABLE ... 838

SHOW MODEL ... 840

SHOW DATASHARES ... 842

SHOW PROCEDURE ... 843

SHOW TABLE ... 844

SHOW VIEW ... 845

START TRANSACTION ... 846

(12)

TRUNCATE ... 846

UNLOAD ... 847

UPDATE ... 866

VACUUM ... 870

SQL functions reference ... 875

Leader node–only functions ... 876

Compute node–only functions ... 876

Aggregate functions ... 877

Array functions ... 895

VARBYTE functions ... 899

Bit-wise aggregate functions ... 905

Conditional expressions ... 910

Date and time functions ... 919

Spatial functions ... 959

Hash functions ... 1053

HyperLogLog functions ... 1058

JSON functions ... 1062

Math functions ... 1069

Machine learning functions ... 1093

String functions ... 1095

SUPER type information functions ... 1140

Window functions ... 1149

Data type formatting functions ... 1192

System administration functions ... 1209

System information functions ... 1216

Reserved words ... 1235

System tables reference ... 1238

System tables and views ... 1238

Types of system tables and views ... 1238

Visibility of data in system tables and views ... 1239

Filtering system-generated queries ... 1239

STV tables for snapshot data ... 1239

STV_ACTIVE_CURSORS ... 1240

STV_BLOCKLIST ... 1241

STV_CURSOR_CONFIGURATION ... 1244

STV_EXEC_STATE ... 1244

STV_INFLIGHT ... 1245

STV_LOAD_STATE ... 1247

STV_LOCKS ... 1248

STV_ML_MODEL_INFO ... 1249

STV_MV_DEPS ... 1249

STV_MV_INFO ... 1250

STV_NODE_STORAGE_CAPACITY ... 1252

STV_PARTITIONS ... 1253

STV_QUERY_METRICS ... 1254

STV_RECENTS ... 1258

STV_SESSIONS ... 1259

STV_SLICES ... 1260

STV_STARTUP_RECOVERY_STATE ... 1261

STV_TBL_PERM ... 1261

STV_TBL_TRANS ... 1263

STV_WLM_QMR_CONFIG ... 1265

STV_WLM_CLASSIFICATION_CONFIG ... 1265

STV_WLM_QUERY_QUEUE_STATE ... 1266

STV_WLM_QUERY_STATE ... 1267

STV_WLM_QUERY_TASK_STATE ... 1269

STV_WLM_SERVICE_CLASS_CONFIG ... 1270

(13)

STV_WLM_SERVICE_CLASS_STATE ... 1271

System views ... 1272

STL views for logging ... 1273

SVCS views ... 1354

SVL views ... 1372

SVV views ... 1412

Monitoring views (preview) ... 1454

System catalog tables ... 1463

PG_ATTRIBUTE_INFO ... 1464

PG_CLASS_INFO ... 1464

PG_DATABASE_INFO ... 1465

PG_DEFAULT_ACL ... 1466

PG_EXTERNAL_SCHEMA ... 1467

PG_LIBRARY ... 1468

PG_PROC_INFO ... 1469

PG_STATISTIC_INDICATOR ... 1469

PG_TABLE_DEF ... 1470

Querying the catalog tables ... 1472

Configuration reference ... 1477

Modifying the server configuration ... 1477

activate_aqua ... 1478

Values ... 1478

Description ... 1478

Example ... 1478

analyze_threshold_percent ... 1478

Values (default in bold) ... 1478

Description ... 1479

Examples ... 1479

cast_super_null_on_error ... 1479

Values (default in bold) ... 1479

Description ... 1478

datashare_break_glass_session_var ... 1479

Values (default in bold) ... 1479

Description ... 1479

Example ... 1480

datestyle ... 1480

Values (default in bold) ... 1480

Description ... 1478

Example ... 1480

default_geometry_encoding ... 1480

Values (default in bold) ... 1480

Description ... 1478

describe_field_name_in_uppercase ... 1481

Values (default in bold) ... 1481

Description ... 1478

Example ... 1480

downcase_delimited_identifier ... 1481

Values (default in bold) ... 1481

Description ... 1478

enable_case_sensitive_identifier ... 1481

Values (default in bold) ... 1481

enable_result_cache_for_session ... 1483

Values (default in bold) ... 1483

Description ... 1478

enable_vacuum_boost ... 1483

Values (default in bold) ... 1483

Description ... 1478

(14)

error_on_nondeterministic_update ... 1483

Values (default in bold) ... 1483

Description ... 1478

Example ... 1480

extra_float_digits ... 1484

Values (default in bold) ... 1484

Description ... 1484

json_serialization_enable ... 1484

Values (default in bold) ... 1484

Description ... 1478

json_serialization_parse_nested_strings ... 1484

Values (default in bold) ... 1484

Description ... 1478

max_concurrency_scaling_clusters ... 1485

Values (default in bold) ... 1485

Description ... 1485

max_cursor_result_set_size ... 1485

Values (default in bold) ... 1485

Description ... 1485

mv_enable_aqmv_for_session ... 1485

Values (default in bold) ... 1485

Description ... 1485

navigate_super_null_on_error ... 1485

Values (default in bold) ... 1485

Description ... 1478

parse_super_null_on_error ... 1486

Values (default in bold) ... 1486

Description ... 1478

query_group ... 1486

Values (default in bold) ... 1486

Description ... 1486

search_path ... 1487

Values (default in bold) ... 1487

Description ... 1487

Example ... 1487

spectrum_query_maxerror ... 1488

Values (default in bold) ... 1488

Description ... 1488

Example ... 1488

statement_timeout ... 1489

Values (default in bold) ... 1489

Description ... 1489

Example ... 1489

stored_proc_log_min_messages ... 1489

Values (default in bold) ... 1489

Description ... 1478

timezone ... 1490

Values (default in bold) ... 1490

Syntax ... 1490

Description ... 1490

Time zone formats ... 1490

Examples ... 1492

wlm_query_slot_count ... 1492

Values (default in bold) ... 1492

Description ... 1492

Examples ... 1493

Sample database ... 1494

(15)

CATEGORY table ... 1495

DATE table ... 1495

EVENT table ... 1496

VENUE table ... 1496

USERS table ... 1497

LISTING table ... 1497

SALES table ... 1498

Time zone names and abbreviations ... 1499

Time zone names ... 1499

Time zone abbreviations ... 1508

Document history ... 1512

Earlier updates ... 1518

(16)

Are you a first-time Amazon Redshift user?

Amazon Redshift system overview

Topics

• Are you a first-time Amazon Redshift user? (p. 1)

• Are you a database developer? (p. 2)

• Prerequisites (p. 3)

• System and architecture overview (p. 3)

This is the Amazon Redshift Database Developer Guide.

Amazon Redshift is an enterprise-level, petabyte scale, fully managed data warehousing service.

This guide focuses on using Amazon Redshift to create and manage a data warehouse. If you work with databases as a designer, software developer, or administrator, it gives you the information you need to design, build, query, and maintain your data warehouse.

Are you a first-time Amazon Redshift user?

If you are a first-time user of Amazon Redshift, we recommend that you begin by reading the following sections.

• Service Highlights and Pricing – The product detail page provides the Amazon Redshift value proposition, service highlights, and pricing.

• Getting Started – Amazon Redshift Getting Started Guide includes an example that walks you

through the process of creating an Amazon Redshift data warehouse cluster, creating database tables, uploading data, and testing queries.

After you complete the Getting Started guide, we recommend that you explore one of the following guides:

• Amazon Redshift Cluster Management Guide – The Cluster Management guide shows you how to create and manage Amazon Redshift clusters.

If you are an application developer, you can use the Amazon Redshift Query API to manage clusters programmatically. Additionally, the AWS SDK libraries that wrap the underlying Amazon Redshift API can help simplify your programming tasks. If you prefer a more interactive way of managing clusters, you can use the Amazon Redshift console and the AWS Command Line Interface (AWS CLI).

For information about the API and CLI, go to the following manuals:

• API reference

• CLI reference

• Amazon Redshift Database Developer Guide (this document) – If you are a database developer, the Database Developer Guide explains how to design, build, query, and maintain the databases that make up your data warehouse.

If you are transitioning to Amazon Redshift from another relational database system or data warehouse application, you should be aware of important differences in how Amazon Redshift is implemented. For

(17)

Are you a database developer?

a summary of the most important considerations for designing tables and loading data, see Amazon Redshift best practices for designing tables (p. 16) and Amazon Redshift best practices for loading data (p. 19). Amazon Redshift is based on PostgreSQL. For a detailed list of the differences between Amazon Redshift and PostgreSQL, see Amazon Redshift and PostgreSQL (p. 470).

Are you a database developer?

If you are a database user, database designer, database developer, or database administrator, the following table will help you find what you're looking for.

If you want to ... We recommend Quickly start using

Amazon Redshift Begin by following the steps in Amazon Redshift Getting Started Guide to quickly deploy a cluster, connect to a database, and try out some queries.

When you are ready to build your database, load data into tables, and write queries to manipulate data in the data warehouse, return here to the Database Developer Guide.

Learn about the internal architecture of the Amazon Redshift data warehouse.

The System and architecture overview (p. 3) gives a high-level overview of Amazon Redshift's internal architecture.

If you want a broader overview of the Amazon Redshift web service, go to the Amazon Redshift product detail page.

Create databases, tables, users, and other database objects.

Getting started using databases is a quick introduction to the basics of SQL development.

The Amazon Redshift SQL (p. 469) has the syntax and examples for Amazon Redshift SQL commands and functions and other SQL elements.

Amazon Redshift best practices for designing tables (p. 16) provides a summary of our recommendations for choosing sort keys, distribution keys, and compression encodings.

Learn how to design tables for optimum performance.

Working with automatic table optimization (p. 37) details considerations for applying compression to the data in table columns and choosing distribution and sort keys.

Load data. Loading data (p. 65) explains the procedures for loading large datasets from Amazon DynamoDB tables or from flat files stored in Amazon S3 buckets.

Amazon Redshift best practices for loading data (p. 19) provides for tips for loading your data quickly and effectively.

Manage users, groups,

and database security. Managing database security (p. 457) covers database security topics.

Monitor and optimize

system performance. The System tables reference (p. 1238) details system tables and views that you can query for the status of the database and monitor queries and processes.

You should also consult the Amazon Redshift Cluster Management Guide to learn how to use the AWS Management Console to check the system health, monitor metrics, and back up and restore clusters.

(18)

Prerequisites

If you want to ... We recommend Analyze and report

information from very large datasets.

Many popular software vendors are certifying Amazon Redshift with their offerings to enable you to continue to use the tools you use today. For more information, see the Amazon Redshift partner page.

The SQL reference (p. 469) has all the details for the SQL expressions, commands, and functions Amazon Redshift supports.

Prerequisites

Before you use this guide, you should complete these tasks.

• Install a SQL client.

• Launch an Amazon Redshift cluster.

• Connect your SQL client to a database in your cluster.

For step-by-step instructions, see Amazon Redshift Getting Started Guide.

You should also know how to use your SQL client and should have a fundamental understanding of the SQL language.

System and architecture overview

Topics

• Data warehouse system architecture (p. 3)

• Performance (p. 5)

• Columnar storage (p. 7)

• Workload management (p. 9)

• Using Amazon Redshift with other services (p. 9)

An Amazon Redshift data warehouse is an enterprise-class relational database query and management system.

Amazon Redshift supports client connections with many types of applications, including business intelligence (BI), reporting, data, and analytics tools.

When you run analytic queries, you are retrieving, comparing, and evaluating large amounts of data in multiple-stage operations to produce a final result.

Amazon Redshift achieves efficient storage and optimum query performance through a combination of massively parallel processing, columnar data storage, and very efficient, targeted data compression encoding schemes. This section presents an introduction to the Amazon Redshift system architecture.

Data warehouse system architecture

This section introduces the elements of the Amazon Redshift data warehouse architecture as shown in the following figure.

(19)

Data warehouse system architecture

Client applications

Amazon Redshift integrates with various data loading and ETL (extract, transform, and load) tools and business intelligence (BI) reporting, data mining, and analytics tools. Amazon Redshift is based on industry-standard PostgreSQL, so most existing SQL client applications will work with only minimal changes. For information about important differences between Amazon Redshift SQL and PostgreSQL, see Amazon Redshift and PostgreSQL (p. 470).

Clusters

The core infrastructure component of an Amazon Redshift data warehouse is a cluster.

A cluster is composed of one or more compute nodes. If a cluster is provisioned with two or more compute nodes, an additional leader node coordinates the compute nodes and handles external

communication. Your client application interacts directly only with the leader node. The compute nodes are transparent to external applications.

Leader node

The leader node manages communications with client programs and all communication with compute nodes. It parses and develops execution plans to carry out database operations, in particular, the series of steps necessary to obtain results for complex queries. Based on the execution plan, the leader node compiles code, distributes the compiled code to the compute nodes, and assigns a portion of the data to each compute node.

The leader node distributes SQL statements to the compute nodes only when a query references tables that are stored on the compute nodes. All other queries run exclusively on the leader node. Amazon Redshift is designed to implement certain SQL functions only on the leader node. A query that uses any of these functions will return an error if it references tables that reside on the compute nodes. For more information, see SQL functions supported on the leader node (p. 469).

Compute nodes

The leader node compiles code for individual elements of the execution plan and assigns the code to individual compute nodes. The compute nodes runs the compiled code and send intermediate results back to the leader node for final aggregation.

(20)

Performance

Each compute node has its own dedicated CPU, memory, and attached disk storage, which are determined by the node type. As your workload grows, you can increase the compute capacity and storage capacity of a cluster by increasing the number of nodes, upgrading the node type, or both.

Amazon Redshift provides several node types for your compute and storage needs. For details of each node type, see Amazon Redshift clusters in the Amazon Redshift Cluster Management Guide.

Node slices

A compute node is partitioned into slices. Each slice is allocated a portion of the node's memory and disk space, where it processes a portion of the workload assigned to the node. The leader node manages distributing data to the slices and apportions the workload for any queries or other database operations to the slices. The slices then work in parallel to complete the operation.

The number of slices per node is determined by the node size of the cluster. For more information about the number of slices for each node size, go to About clusters and nodes in the Amazon Redshift Cluster Management Guide.

When you create a table, you can optionally specify one column as the distribution key. When the table is loaded with data, the rows are distributed to the node slices according to the distribution key that is defined for a table. Choosing a good distribution key enables Amazon Redshift to use parallel processing to load data and run queries efficiently. For information about choosing a distribution key, see Choose the best distribution style (p. 17).

Internal network

Amazon Redshift takes advantage of high-bandwidth connections, close proximity, and custom communication protocols to provide private, very high-speed network communication between the leader node and compute nodes. The compute nodes run on a separate, isolated network that client applications never access directly.

Databases

A cluster contains one or more databases. User data is stored on the compute nodes. Your SQL client communicates with the leader node, which in turn coordinates query execution with the compute nodes.

Amazon Redshift is a relational database management system (RDBMS), so it is compatible with other RDBMS applications. Although it provides the same functionality as a typical RDBMS, including online transaction processing (OLTP) functions such as inserting and deleting data, Amazon Redshift is optimized for high-performance analysis and reporting of very large datasets.

Amazon Redshift is based on PostgreSQL. Amazon Redshift and PostgreSQL have a number of very important differences that you need to take into account as you design and develop your data warehouse applications. For information about how Amazon Redshift SQL differs from PostgreSQL, see Amazon Redshift and PostgreSQL (p. 470).

Performance

Amazon Redshift achieves extremely fast query execution by employing these performance features.

Topics

• Massively parallel processing (p. 6)

• Columnar data storage (p. 6)

• Data compression (p. 6)

• Query optimizer (p. 6)

• Result caching (p. 6)

• Compiled code (p. 7)

(21)

Performance

Massively parallel processing

Massively parallel processing (MPP) enables fast execution of the most complex queries operating on large amounts of data. Multiple compute nodes handle all query processing leading up to final result aggregation, with each core of each node executing the same compiled query segments on portions of the entire data.

Amazon Redshift distributes the rows of a table to the compute nodes so that the data can be processed in parallel. By selecting an appropriate distribution key for each table, you can optimize the distribution of data to balance the workload and minimize movement of data from node to node. For more

information, see Choose the best distribution style (p. 17).

Loading data from flat files takes advantage of parallel processing by spreading the workload across multiple nodes while simultaneously reading from multiple files. For more information about how to load data into tables, see Amazon Redshift best practices for loading data (p. 19).

Columnar data storage

Columnar storage for database tables drastically reduces the overall disk I/O requirements and is an important factor in optimizing analytic query performance. Storing database table information in a columnar fashion reduces the number of disk I/O requests and reduces the amount of data you need to load from disk. Loading less data into memory enables Amazon Redshift to perform more in-memory processing when executing queries. See Columnar storage (p. 7) for a more detailed explanation.

When columns are sorted appropriately, the query processor is able to rapidly filter out a large subset of data blocks. For more information, see Choose the best sort key (p. 16).

Data compression

Data compression reduces storage requirements, thereby reducing disk I/O, which improves query performance. When you run a query, the compressed data is read into memory, then uncompressed during query execution. Loading less data into memory enables Amazon Redshift to allocate more memory to analyzing the data. Because columnar storage stores similar data sequentially, Amazon Redshift is able to apply adaptive compression encodings specifically tied to columnar data types. The best way to enable data compression on table columns is by allowing Amazon Redshift to apply optimal compression encodings when you load the table with data. To learn more about using automatic data compression, see Loading tables with automatic compression (p. 89).

Query optimizer

The Amazon Redshift query execution engine incorporates a query optimizer that is MPP-aware and also takes advantage of the columnar-oriented data storage. The Amazon Redshift query optimizer implements significant enhancements and extensions for processing complex analytic queries that often include multi-table joins, subqueries, and aggregation. To learn more about optimizing queries, see Tuning query performance (p. 383).

Result caching

To reduce query execution time and improve system performance, Amazon Redshift caches the results of certain types of queries in memory on the leader node. When a user submits a query, Amazon Redshift checks the results cache for a valid, cached copy of the query results. If a match is found in the result cache, Amazon Redshift uses the cached results and doesn't run the query. Result caching is transparent to the user.

Result caching is turned on by default. To turn off result caching for the current session, set the enable_result_cache_for_session (p. 1483) parameter to off.

(22)

Columnar storage

Amazon Redshift uses cached results for a new query when all of the following are true:

• The user submitting the query has access privilege to the objects used in the query.

• The table or views in the query haven't been modified.

• The query doesn't use a function that must be evaluated each time it's run, such as GETDATE.

• The query doesn't reference Amazon Redshift Spectrum external tables.

• Configuration parameters that might affect query results are unchanged.

• The query syntactically matches the cached query.

To maximize cache effectiveness and efficient use of resources, Amazon Redshift doesn't cache some large query result sets. Amazon Redshift determines whether to cache query results based on a number of factors. These factors include the number of entries in the cache and the instance type of your Amazon Redshift cluster.

To determine whether a query used the result cache, query the SVL_QLOG (p. 1382) system view. If a query used the result cache, the source_query column returns the query ID of the source query. If result caching wasn't used, the source_query column value is NULL.

The following example shows that queries submitted by userid 104 and userid 102 use the result cache from queries run by userid 100.

select userid, query, elapsed, source_query from svl_qlog where userid > 1

order by query desc;

userid | query | elapsed | source_query ---+---+---+--- 104 | 629035 | 27 | 628919 104 | 629034 | 60 | 628900 104 | 629033 | 23 | 628891 102 | 629017 | 1229393 | 102 | 628942 | 28 | 628919 102 | 628941 | 57 | 628900 102 | 628940 | 26 | 628891 100 | 628919 | 84295686 | 100 | 628900 | 87015637 | 100 | 628891 | 58808694 |

Compiled code

The leader node distributes fully optimized compiled code across all of the nodes of a cluster. Compiling the query eliminates the overhead associated with an interpreter and therefore increases the execution speed, especially for complex queries. The compiled code is cached and shared across sessions on the same cluster, so subsequent executions of the same query will be faster, often even with different parameters.

The execution engine compiles different code for the JDBC connection protocol and for ODBC and psql (libq) connection protocols, so two clients using different protocols will each incur the first-time cost of compiling the code. Other clients that use the same protocol, however, will benefit from sharing the cached code.

Columnar storage

Columnar storage for database tables is an important factor in optimizing analytic query performance because it drastically reduces the overall disk I/O requirements and reduces the amount of data you need to load from disk.

(23)

Columnar storage

The following series of illustrations describe how columnar data storage implements efficiencies and how that translates into efficiencies when retrieving data into memory.

This first illustration shows how records from database tables are typically stored into disk blocks by row.

In a typical relational database table, each row contains field values for a single record. In row-wise database storage, data blocks store values sequentially for each consecutive column making up the entire row. If block size is smaller than the size of a record, storage for an entire record may take more than one block. If block size is larger than the size of a record, storage for an entire record may take less than one block, resulting in an inefficient use of disk space. In online transaction processing (OLTP) applications, most transactions involve frequently reading and writing all of the values for entire records, typically one record or a small number of records at a time. As a result, row-wise storage is optimal for OLTP databases.

The next illustration shows how with columnar storage, the values for each column are stored sequentially into disk blocks.

Using columnar storage, each data block stores values of a single column for multiple rows. As records enter the system, Amazon Redshift transparently converts the data to columnar storage for each of the columns.

In this simplified example, using columnar storage, each data block holds column field values for as many as three times as many records as row-based storage. This means that reading the same number of column field values for the same number of records requires a third of the I/O operations compared to row-wise storage. In practice, using tables with very large numbers of columns and very large row counts, storage efficiency is even greater.

An added advantage is that, since each block holds the same type of data, block data can use a compression scheme selected specifically for the column data type, further reducing disk space and I/O. For more information about compression encodings based on data types, see Compression encodings (p. 40).

The savings in space for storing data on disk also carries over to retrieving and then storing that data in memory. Since many database operations only need to access or operate on one or a small number of columns at a time, you can save memory space by only retrieving blocks for columns you actually need for a query. Where OLTP transactions typically involve most or all of the columns in a row for a small

(24)

Workload management

number of records, data warehouse queries commonly read only a few columns for a very large number of rows. This means that reading the same number of column field values for the same number of rows requires a fraction of the I/O operations and uses a fraction of the memory that would be required for processing row-wise blocks. In practice, using tables with very large numbers of columns and very large row counts, the efficiency gains are proportionally greater. For example, suppose a table contains 100 columns. A query that uses five columns will only need to read about five percent of the data contained in the table. This savings is repeated for possibly billions or even trillions of records for large databases.

In contrast, a row-wise database would read the blocks that contain the 95 unneeded columns as well.

Typical database block sizes range from 2 KB to 32 KB. Amazon Redshift uses a block size of 1 MB, which is more efficient and further reduces the number of I/O requests needed to perform any database loading or other operations that are part of query execution.

Workload management

Amazon Redshift workload management (WLM) enables users to flexibly manage priorities within workloads so that short, fast-running queries won't get stuck in queues behind long-running queries.

Amazon Redshift WLM creates query queues at runtime according to service classes, which define the configuration parameters for various types of queues, including internal system queues and user- accessible queues. From a user perspective, a user-accessible service class and a queue are functionally equivalent. For consistency, this documentation uses the term queue to mean a user-accessible service class as well as a runtime queue.

When you run a query, WLM assigns the query to a queue according to the user's user group or by matching a query group that is listed in the queue configuration with a query group label that the user sets at runtime.

Currently, the default for clusters using the default parameter group is to use automatic WLM.

Automatic WLM manages query concurrency and memory allocation. For more information, see Implementing automatic WLM (p. 413).

With manual WLM, Amazon Redshift configures one queue with a concurrency level of five, which enables up to five queries to run concurrently, plus one predefined Superuser queue, with a concurrency level of one. You can define up to eight queues. Each queue can be configured with a maximum concurrency level of 50. The maximum total concurrency level for all user-defined queues (not including the Superuser queue) is 50.

The easiest way to modify the WLM configuration is by using the Amazon Redshift Management Console.

You can also use the Amazon Redshift command line interface (CLI) or the Amazon Redshift API.

For more information about implementing and using workload management, see Implementing workload management (p. 411).

Using Amazon Redshift with other services

Amazon Redshift integrates with other AWS services to enable you to move, transform, and load your data quickly and reliably, using data security features.

Moving data between Amazon Redshift and Amazon S3

Amazon Simple Storage Service (Amazon S3) is a web service that stores data in the cloud. Amazon Redshift leverages parallel processing to read and load data from multiple data files stored in Amazon S3 buckets. For more information, see Loading data from Amazon S3 (p. 68).

(25)

Using Amazon Redshift with other services

You can also use parallel processing to export data from your Amazon Redshift data warehouse to multiple data files on Amazon S3. For more information, see Unloading data (p. 143).

Using Amazon Redshift with Amazon DynamoDB

Amazon DynamoDB is a fully managed NoSQL database service. You can use the COPY command to load an Amazon Redshift table with data from a single Amazon DynamoDB table. For more information, see Loading data from an Amazon DynamoDB table (p. 86).

Importing data from remote hosts over SSH

You can use the COPY command in Amazon Redshift to load data from one or more remote hosts, such as Amazon EMR clusters, Amazon EC2 instances, or other computers. COPY connects to the remote hosts using SSH and runs commands on the remote hosts to generate data. Amazon Redshift supports multiple simultaneous connections. The COPY command reads and loads the output from multiple host sources in parallel. For more information, see Loading data from remote hosts (p. 80).

Automating data loads using AWS Data Pipeline

You can use AWS Data Pipeline to automate data movement and transformation into and out of Amazon Redshift. By using the built-in scheduling capabilities of AWS Data Pipeline, you can schedule and run recurring jobs without having to write your own complex data transfer or transformation logic.

For example, you can set up a recurring job to automatically copy data from Amazon DynamoDB into Amazon Redshift. For a tutorial that walks you through the process of creating a pipeline that periodically moves data from Amazon S3 to Amazon Redshift, see Copy data to Amazon Redshift using AWS Data Pipeline in the AWS Data Pipeline Developer Guide.

Migrating data using AWS Database Migration Service (AWS DMS)

You can migrate data to Amazon Redshift using AWS Database Migration Service. AWS DMS can

migrate your data to and from most widely used commercial and open-source databases such as Oracle, PostgreSQL, Microsoft SQL Server, Amazon Redshift, Aurora, DynamoDB, Amazon S3, MariaDB, and MySQL. For more information, see Using an Amazon Redshift database as a target for AWS Database Migration Service.

(26)

Conducting a proof of concept

Amazon Redshift best practices

Following, you can find best practices for planning a proof of concept, designing tables, loading data into tables, and writing queries for Amazon Redshift, and also a discussion of working with Amazon Redshift Advisor.

Amazon Redshift is not the same as other SQL database systems. To fully realize the benefits of the Amazon Redshift architecture, you must specifically design, build, and load your tables to use massively parallel processing, columnar data storage, and columnar data compression. If your data loading and query execution times are longer than you expect, or longer than you want, you might be overlooking key information.

If you are an experienced SQL database developer, we strongly recommend that you review this topic before you begin developing your Amazon Redshift data warehouse.

If you are new to developing SQL databases, this topic is not the best place to start. We recommend that you begin by reading Getting started using databases and trying the examples yourself.

In this topic, you can find an overview of the most important development principles, along with specific tips, examples, and best practices for implementing those principles. No single practice can apply to every application. You should evaluate all of your options before finalizing a database design. For more information, see Working with automatic table optimization (p. 37), Loading data (p. 65), Tuning query performance (p. 383), and the reference chapters.

Topics

• Conducting a proof of concept for Amazon Redshift (p. 11)

• Amazon Redshift best practices for designing tables (p. 16)

• Amazon Redshift best practices for loading data (p. 19)

• Amazon Redshift best practices for designing queries (p. 22)

• Working with recommendations from Amazon Redshift Advisor (p. 24)

Conducting a proof of concept for Amazon Redshift

Amazon Redshift is a fast, scalable data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL with your existing business intelligence (BI) tools. Amazon Redshift offers fast performance in a low-cost cloud data warehouse. It uses sophisticated query optimization, accelerated cache, columnar storage on high-performance local disks, and massively parallel query execution.

In the following sections, you can find a framework for building a proof of concept with Amazon Redshift. The framework helps you to use architectural best practices for designing and operating a secure, high-performing, and cost-effective data warehouse. This guidance is based on reviewing designs of thousands of customer architectures across a wide variety of business types and use cases. We have compiled customer experiences to develop this set of best practices to help you develop criteria for evaluating your data warehouse workload.

Overview of the process

Conducting a proof of concept is a three-step process:

1. Identify the goals of the proof of concept – you can work backward from your business requirements and success criteria, and translate them into a technical proof of concept project plan.

(27)

Identify the business goals and success criteria

2. Set up the proof of concept environment – most of the setup process is a click of few buttons to create your resources. Within minutes, you can have a data warehouse environment ready with data loaded.

3. Complete the proof of concept project plan to ensure that the goals are met.

In the following sections, we go into the details of each step.

Identify the business goals and success criteria

Identifying the goals of the proof of concept plays a critical role in determining what you want to measure as part of the evaluation process. The evaluation criteria should include the current scaling challenges, enhancements to improve your customer's experience of the data warehouse, and methods of addressing your current operational pain points. You can use the following questions to identify the goals of the proof of concept:

• What are your goals for scaling your data warehouse?

• What are the specific service-level agreements whose terms you want to improve?

• What new datasets do you need to include in your data warehouse?

• What are the business-critical SQL queries that you need to test and measure? Make sure to include the full range of SQL complexities, such as the different types of queries (for example, select, insert, update, and delete).

• What are the general types of workloads you plan to test? Examples might include extract-transform- load (ETL) workloads, reporting queries, and batch extracts.

After you have answered these questions, you should be able to establish SMART goals and success criteria for building your proof of concept. For information about setting goals, see SMART criteria in Wikipedia.

Set up your proof of concept

Because we eliminated hardware provisioning, networking, and software installation from an on- premises data warehouse, trying Amazon Redshift with your own dataset has never been easier. Many of the sizing decisions and estimations that used to be required are now simply a click away. You can flexibly resize your cluster or adjust the ratio of storage versus compute.

Broadly, setting up the Amazon Redshift proof of concept environment is a two-step process. It involves the launching of a data warehouse and then the conversion of the schema and datasets for evaluation.

Choose a starting cluster size

You can choose the node type and number of nodes using the Amazon Redshift console. We recommend that you also test resizing the cluster as part of your proof of concept plan. To get the initial sizing for your cluster, take the following steps:

1. Sign in to the AWS Management Console and open the Amazon Redshift console at https://

console.aws.amazon.com/redshift/.

2. On the navigation pane, choose Create cluster to open the configuration page.

3. For Cluster identifier, enter a name for your cluster.

4. Choose one of the following methods to size your cluster:

NoteThe following step describes an Amazon Redshift console that is running in an AWS Region that supports RA3 node types. For a list of AWS Regions that support RA3 node types, see Overview of RA3 node types in the Amazon Redshift Cluster Management Guide.

(28)

Checklist for a complete evaluation

• If your AWS Region supports RA3 node types, choose either Production or Free trial to answer the question What are you planning to use this cluster for?

If your organization is eligible, you might be able to create a cluster under the Amazon Redshift free trial program. To do this, choose Free trial to create a configuration with the dc2.large node type.

For more information about choosing a free trial, see Amazon Redshift free trial.

• If you don't know how large to size your cluster, choose Help me choose. Doing this starts a sizing calculator that asks you questions about the size and query characteristics of the data that you plan to store in your data warehouse.

If you know the required size of your cluster (that is, the node type and number of nodes), choose I'll choose. Then choose the Node type and number of Nodes to size your cluster for the proof of concept.

5. After you enter all required cluster properties, choose Create cluster to launch your data warehouse.

For more details about creating clusters with the Amazon Redshift console, see Creating a cluster in the Amazon Redshift Cluster Management Guide.

Convert the schema and set up the datasets for the proof of concept

If you don't have an existing data warehouse, skip this section and see Amazon Redshift Getting Started Guide. Amazon Redshift Getting Started Guide provides a tutorial to create a cluster and examples of setting up data in Amazon Redshift.

When migrating from your existing data warehouse, you can convert schema, code, and data using the AWS Schema Conversion Tool and the AWS Database Migration Service. Your choice of tools depends on the source of your data and optional ongoing replications. For more information, see What Is the AWS Schema Conversion Tool? in the AWS Schema Conversion Tool User Guide and What Is AWS Database Migration Service? in the AWS Database Migration Service User Guide. The following can help you set up your data in Amazon Redshift:

• Migrate Your Data Warehouse to Amazon Redshift Using the AWS Schema Conversion Tool – this blog post provides an overview on how you can use the AWS SCT data extractors to migrate your existing data warehouse to Amazon Redshift. The AWS SCT tool can migrate your data from many legacy platforms (such as Oracle, Greenplum, Netezza, Teradata, Microsoft SQL Server, or Vertica).

• Optionally, you can also use the AWS Database Migration Service for ongoing replications of changed data from the source. For more information, see Using an Amazon Redshift Database as a Target for AWS Database Migration Service in the AWS Database Migration Service User Guide.

Amazon Redshift is a relational database management system (RDBMS). As such, it can run many types of data models including star schemas, snowflake schemas, data vault models, and simple, flat, or normalized tables. After setting up your schemas in Amazon Redshift, you can take advantage of massively parallel processing and columnar data storage for fast analytical queries out of the box. For information about types of schemas, see star schema, snowflake schema, and data vault modeling in Wikipedia.

Checklist for a complete evaluation

Make sure that a complete evaluation meets all your data warehouse needs. Consider including the following items in your success criteria:

Data load time – using the COPY command is a common way to test how long it takes to load data. For more information, see Amazon Redshift best practices for loading data (p. 19).

(29)

Develop a project plan for your evaluation

Throughput of the cluster – measuring queries per hour is a common way to determine throughput.

To do so, set up a test to run typical queries for your workload.

Data security – you can easily encrypt data at rest and in transit with Amazon Redshift. You also have a number of options for managing keys. Amazon Redshift also supports single sign-on (SSO) integration. Amazon Redshift pricing includes built-in security, data compression, backup storage, and data transfer.

Third-party tools integration – you can use either a JDBC or ODBC connection to integrate with business intelligence and other external tools.

Interoperability with other AWS services – Amazon Redshift integrates with other AWS services, such as Amazon EMR, Amazon QuickSight, AWS Glue, Amazon S3, and Amazon Kinesis. You can use this integration when setting up and managing your data warehouse.

Backups and snapshots – backups and snapshots are created automatically. You can also create a point-in-time snapshot at any time or on a schedule. Try using a snapshot and creating a second cluster as part of your evaluation. Evaluate if your development and testing organizations can use the cluster.

Resizing – your evaluation should include increasing the number or types of Amazon Redshift nodes.

Evaluate that the workload throughput before and after a resize meets any variability of the volume of your workload. For more information, see Resizing clusters in Amazon Redshift in the Amazon Redshift Cluster Management Guide.

Concurrency scaling – this feature helps you handle variability of traffic volume in your data warehouse. With concurrency scaling, you can support virtually unlimited concurrent users and concurrent queries, with consistently fast query performance. For more information, see Working with concurrency scaling (p. 438).

Automatic workload management (WLM) – prioritize your business critical queries over other queries by using automatic WLM. Try setting up queues based on your workloads (for example, a queue for ETL and a queue for reporting). Then enable automatic WLM to allocate the concurrency and memory resources dynamically. For more information, see Implementing automatic WLM (p. 413).

Amazon Redshift Advisor – the Advisor develops customized recommendations to increase

performance and optimize costs by analyzing your workload and usage metrics for your cluster. Sign in to the Amazon Redshift console to view Advisor recommendations. For more information, see Working with recommendations from Amazon Redshift Advisor (p. 24).

Table design – Amazon Redshift provides great performance out of the box for most workloads.

When you create a table, the default sort key and distribution key is AUTO. For more information, see Working with automatic table optimization (p. 37).

Support – we strongly recommend that you evaluate AWS Support as part of your evaluation. Also, make sure to talk to your account manager about your proof of concept. AWS can help with technical guidance and credits for the proof of concept if you qualify. If you don't find the help you're looking for, you can talk directly to the Amazon Redshift team. For help, submit the form at Request support for your Amazon Redshift proof-of-concept.

Lake house integration – with built-in integration, try using the out-of-box Amazon Redshift

Spectrum feature. With Redshift Spectrum, you can extend the data warehouse into your data lake and run queries against petabytes of data in Amazon S3 using your existing cluster. For more information, see Querying external data using Amazon Redshift Spectrum (p. 232).

Develop a project plan for your evaluation

Some of the following techniques for creating query benchmarks might help support your Amazon Redshift evaluation:

• Assemble a list of queries for each runtime category. Having a sufficient number (for example, 30 per category) helps ensure that your evaluation reflects a real-world data warehouse implementation.

Add a unique identifier to associate each query that you include in your evaluation with one of the

(30)

Additional resources to help your evaluation

categories you establish for your evaluation. You can then use these unique identifiers to determine throughput from the system tables.

You can also create a query group to organize your evaluation queries. For example, if you have established a "Reporting" category for your evaluation, you might create a coding system to tag your evaluation queries with the word "Report." You can then identify individual queries within reporting as R1, R2, and so on. The following example demonstrates this approach.

SELECT 'Reporting' AS query_category, 'R1' as query_id, * FROM customers;

SELECT query, datediff(seconds, starttime, endtime) FROM stl_query

WHERE

querytxt LIKE '%Reporting%'

and starttime >= '2018-04-15 00:00' and endtime < '2018-04-15 23:59';

When you have associated a query with an evaluation category, you can use a unique identifier to determine throughput from the system tables for each category.

• Test throughput with historical user or ETL queries that have a variety of runtimes in your existing data warehouse. You might use a load testing utility, such as the open-source JMeter or a custom utility. If so, make sure that your utility does the following:

• It can take the network transmission time into account.

• It evaluates execution time based on throughput of the internal system tables. For information about how to do this, see Analyzing the query summary (p. 396).

• Identify all the various permutations that you plan to test during your evaluation. The following list provides some common variables:

• Cluster size

• Node type

• Load testing duration

• Concurrency settings

• Reduce the cost of your proof of concept by pausing your cluster during off-hours and weekends.

When a cluster is paused, on-demand compute billing is suspended. To run tests on the cluster, resume per-second billing. You can also create a schedule to pause and resume your cluster automatically.

For more information, see Pausing and resuming clusters in the Amazon Redshift Cluster Management Guide.

At this stage, you're ready to complete your project plan and evaluate results.

Additional resources to help your evaluation

To help your Amazon Redshift evaluation, see the following:

• Service highlights and pricing – this product detail page provides the Amazon Redshift value proposition, service highlights, and pricing.

• Amazon Redshift Getting Started Guide – this guide provides a tutorial of using Amazon Redshift to create a sample cluster and work with sample data.

• Getting started with Amazon Redshift Spectrum (p. 234) – in this tutorial, you learn how to use Redshift Spectrum to query data directly from files on Amazon S3.

• Amazon Redshift management overview – this topic in the Amazon Redshift Cluster Management Guide provides an overview of Amazon Redshift.

參考文獻

相關文件

 Evaluated deadline and cost perfor mance of various scheduling polici es under a large range of SLA cost function and

Since the FP-tree reduces the number of database scans and uses less memory to represent the necessary information, many frequent pattern mining algorithms are based on its

The Performance Evaluation for Horizontal, Vertical and Hybrid Schema in Database Systems.. -A Case Study of Wireless Broadband

For example, both Illumination Cone and Quotient Image require several face images of different lighting directions in order to train their database; all of

The aims of this study are: (1) to provide a repository for collecting ECG files, (2) to decode SCP-ECG files and store the results in a database for data management and further

This research is focused on the integration of test theory, item response theory (IRT), network technology, and database management into an online adaptive test system developed

In response to the twenty-first century’s global economy, “broadband network construction” is an important basis for the government in developing the national knowledge and

The purpose of this research is to explore the important and satisfaction analysis of experiential marketing in traditional bakery industry by using Importance-Performance and