Amazon Redshift

(1)

Amazon Redshift

Database Developer Guide

(2)

Amazon Redshift: Database Developer Guide

Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be aﬃliated with, connected to, or sponsored by Amazon.

(3)

Amazon Redshift system overview

Topics

• Are you a ﬁrst-time Amazon Redshift user? (p. 1)

• Are you a database developer? (p. 2)

• Prerequisites (p. 3)

• System and architecture overview (p. 3)

This is the Amazon Redshift Database Developer Guide.

Amazon Redshift is an enterprise-level, petabyte scale, fully managed data warehousing service.

This guide focuses on using Amazon Redshift to create and manage a data warehouse. If you work with databases as a designer, software developer, or administrator, it gives you the information you need to design, build, query, and maintain your data warehouse.

Are you a ﬁrst-time Amazon Redshift user?

If you are a ﬁrst-time user of Amazon Redshift, we recommend that you begin by reading the following sections.

• Service Highlights and Pricing – The product detail page provides the Amazon Redshift value proposition, service highlights, and pricing.

• Getting Started – Amazon Redshift Getting Started Guide includes an example that walks you

through the process of creating an Amazon Redshift data warehouse cluster, creating database tables, uploading data, and testing queries.

After you complete the Getting Started guide, we recommend that you explore one of the following guides:

• Amazon Redshift Cluster Management Guide – The Cluster Management guide shows you how to create and manage Amazon Redshift clusters.

If you are an application developer, you can use the Amazon Redshift Query API to manage clusters programmatically. Additionally, the AWS SDK libraries that wrap the underlying Amazon Redshift API can help simplify your programming tasks. If you prefer a more interactive way of managing clusters, you can use the Amazon Redshift console and the AWS Command Line Interface (AWS CLI).

For information about the API and CLI, go to the following manuals:

• API reference

• CLI reference

• Amazon Redshift Database Developer Guide (this document) – If you are a database developer, the Database Developer Guide explains how to design, build, query, and maintain the databases that make up your data warehouse.

If you are transitioning to Amazon Redshift from another relational database system or data warehouse application, you should be aware of important diﬀerences in how Amazon Redshift is implemented. For

(17)

Are you a database developer?

a summary of the most important considerations for designing tables and loading data, see Amazon Redshift best practices for designing tables (p. 16) and Amazon Redshift best practices for loading data (p. 19). Amazon Redshift is based on PostgreSQL. For a detailed list of the diﬀerences between Amazon Redshift and PostgreSQL, see Amazon Redshift and PostgreSQL (p. 470).

Are you a database developer?

If you are a database user, database designer, database developer, or database administrator, the following table will help you ﬁnd what you're looking for.

If you want to ... We recommend Quickly start using

Amazon Redshift Begin by following the steps in Amazon Redshift Getting Started Guide to quickly deploy a cluster, connect to a database, and try out some queries.

When you are ready to build your database, load data into tables, and write queries to manipulate data in the data warehouse, return here to the Database Developer Guide.

Learn about the internal architecture of the Amazon Redshift data warehouse.

The System and architecture overview (p. 3) gives a high-level overview of Amazon Redshift's internal architecture.

If you want a broader overview of the Amazon Redshift web service, go to the Amazon Redshift product detail page.

Create databases, tables, users, and other database objects.

Getting started using databases is a quick introduction to the basics of SQL development.

The Amazon Redshift SQL (p. 469) has the syntax and examples for Amazon Redshift SQL commands and functions and other SQL elements.

Amazon Redshift best practices for designing tables (p. 16) provides a summary of our recommendations for choosing sort keys, distribution keys, and compression encodings.

Learn how to design tables for optimum performance.

Working with automatic table optimization (p. 37) details considerations for applying compression to the data in table columns and choosing distribution and sort keys.

Load data. Loading data (p. 65) explains the procedures for loading large datasets from Amazon DynamoDB tables or from ﬂat ﬁles stored in Amazon S3 buckets.

Amazon Redshift best practices for loading data (p. 19) provides for tips for loading your data quickly and eﬀectively.

Manage users, groups,

and database security. Managing database security (p. 457) covers database security topics.

Monitor and optimize

system performance. The System tables reference (p. 1238) details system tables and views that you can query for the status of the database and monitor queries and processes.

You should also consult the Amazon Redshift Cluster Management Guide to learn how to use the AWS Management Console to check the system health, monitor metrics, and back up and restore clusters.

(18)

Prerequisites

If you want to ... We recommend Analyze and report

information from very large datasets.

Many popular software vendors are certifying Amazon Redshift with their oﬀerings to enable you to continue to use the tools you use today. For more information, see the Amazon Redshift partner page.

The SQL reference (p. 469) has all the details for the SQL expressions, commands, and functions Amazon Redshift supports.

Prerequisites

Before you use this guide, you should complete these tasks.

• Install a SQL client.

• Launch an Amazon Redshift cluster.

• Connect your SQL client to a database in your cluster.

For step-by-step instructions, see Amazon Redshift Getting Started Guide.

You should also know how to use your SQL client and should have a fundamental understanding of the SQL language.

System and architecture overview

Topics

• Data warehouse system architecture (p. 3)

• Performance (p. 5)

• Columnar storage (p. 7)

• Workload management (p. 9)

• Using Amazon Redshift with other services (p. 9)

An Amazon Redshift data warehouse is an enterprise-class relational database query and management system.

Amazon Redshift supports client connections with many types of applications, including business intelligence (BI), reporting, data, and analytics tools.

When you run analytic queries, you are retrieving, comparing, and evaluating large amounts of data in multiple-stage operations to produce a ﬁnal result.

Amazon Redshift achieves eﬃcient storage and optimum query performance through a combination of massively parallel processing, columnar data storage, and very eﬃcient, targeted data compression encoding schemes. This section presents an introduction to the Amazon Redshift system architecture.

Data warehouse system architecture

This section introduces the elements of the Amazon Redshift data warehouse architecture as shown in the following ﬁgure.

(19)

Data warehouse system architecture

Client applications

Amazon Redshift integrates with various data loading and ETL (extract, transform, and load) tools and business intelligence (BI) reporting, data mining, and analytics tools. Amazon Redshift is based on industry-standard PostgreSQL, so most existing SQL client applications will work with only minimal changes. For information about important diﬀerences between Amazon Redshift SQL and PostgreSQL, see Amazon Redshift and PostgreSQL (p. 470).

Clusters

The core infrastructure component of an Amazon Redshift data warehouse is a cluster.

A cluster is composed of one or more compute nodes. If a cluster is provisioned with two or more compute nodes, an additional leader node coordinates the compute nodes and handles external

communication. Your client application interacts directly only with the leader node. The compute nodes are transparent to external applications.

Leader node

The leader node manages communications with client programs and all communication with compute nodes. It parses and develops execution plans to carry out database operations, in particular, the series of steps necessary to obtain results for complex queries. Based on the execution plan, the leader node compiles code, distributes the compiled code to the compute nodes, and assigns a portion of the data to each compute node.

The leader node distributes SQL statements to the compute nodes only when a query references tables that are stored on the compute nodes. All other queries run exclusively on the leader node. Amazon Redshift is designed to implement certain SQL functions only on the leader node. A query that uses any of these functions will return an error if it references tables that reside on the compute nodes. For more information, see SQL functions supported on the leader node (p. 469).

Compute nodes

The leader node compiles code for individual elements of the execution plan and assigns the code to individual compute nodes. The compute nodes runs the compiled code and send intermediate results back to the leader node for ﬁnal aggregation.

(20)

Performance

Each compute node has its own dedicated CPU, memory, and attached disk storage, which are determined by the node type. As your workload grows, you can increase the compute capacity and storage capacity of a cluster by increasing the number of nodes, upgrading the node type, or both.

Amazon Redshift provides several node types for your compute and storage needs. For details of each node type, see Amazon Redshift clusters in the Amazon Redshift Cluster Management Guide.

Node slices

A compute node is partitioned into slices. Each slice is allocated a portion of the node's memory and disk space, where it processes a portion of the workload assigned to the node. The leader node manages distributing data to the slices and apportions the workload for any queries or other database operations to the slices. The slices then work in parallel to complete the operation.

The number of slices per node is determined by the node size of the cluster. For more information about the number of slices for each node size, go to About clusters and nodes in the Amazon Redshift Cluster Management Guide.

When you create a table, you can optionally specify one column as the distribution key. When the table is loaded with data, the rows are distributed to the node slices according to the distribution key that is deﬁned for a table. Choosing a good distribution key enables Amazon Redshift to use parallel processing to load data and run queries eﬃciently. For information about choosing a distribution key, see Choose the best distribution style (p. 17).

Internal network

Amazon Redshift takes advantage of high-bandwidth connections, close proximity, and custom communication protocols to provide private, very high-speed network communication between the leader node and compute nodes. The compute nodes run on a separate, isolated network that client applications never access directly.

Databases

A cluster contains one or more databases. User data is stored on the compute nodes. Your SQL client communicates with the leader node, which in turn coordinates query execution with the compute nodes.

Amazon Redshift is a relational database management system (RDBMS), so it is compatible with other RDBMS applications. Although it provides the same functionality as a typical RDBMS, including online transaction processing (OLTP) functions such as inserting and deleting data, Amazon Redshift is optimized for high-performance analysis and reporting of very large datasets.

Amazon Redshift is based on PostgreSQL. Amazon Redshift and PostgreSQL have a number of very important diﬀerences that you need to take into account as you design and develop your data warehouse applications. For information about how Amazon Redshift SQL diﬀers from PostgreSQL, see Amazon Redshift and PostgreSQL (p. 470).

Performance

Amazon Redshift achieves extremely fast query execution by employing these performance features.

Topics

• Massively parallel processing (p. 6)

• Columnar data storage (p. 6)

• Data compression (p. 6)

• Query optimizer (p. 6)

• Result caching (p. 6)

• Compiled code (p. 7)

(21)

Performance

Massively parallel processing

Massively parallel processing (MPP) enables fast execution of the most complex queries operating on large amounts of data. Multiple compute nodes handle all query processing leading up to ﬁnal result aggregation, with each core of each node executing the same compiled query segments on portions of the entire data.

Amazon Redshift distributes the rows of a table to the compute nodes so that the data can be processed in parallel. By selecting an appropriate distribution key for each table, you can optimize the distribution of data to balance the workload and minimize movement of data from node to node. For more

information, see Choose the best distribution style (p. 17).

Loading data from flat files takes advantage of parallel processing by spreading the workload across multiple nodes while simultaneously reading from multiple files. For more information about how to load data into tables, see Amazon Redshift best practices for loading data (p. 19).

Columnar data storage

Columnar storage for database tables drastically reduces the overall disk I/O requirements and is an important factor in optimizing analytic query performance. Storing database table information in a columnar fashion reduces the number of disk I/O requests and reduces the amount of data you need to load from disk. Loading less data into memory enables Amazon Redshift to perform more in-memory processing when executing queries. See Columnar storage (p. 7) for a more detailed explanation.

When columns are sorted appropriately, the query processor is able to rapidly ﬁlter out a large subset of data blocks. For more information, see Choose the best sort key (p. 16).

Data compression

Data compression reduces storage requirements, thereby reducing disk I/O, which improves query performance. When you run a query, the compressed data is read into memory, then uncompressed during query execution. Loading less data into memory enables Amazon Redshift to allocate more memory to analyzing the data. Because columnar storage stores similar data sequentially, Amazon Redshift is able to apply adaptive compression encodings speciﬁcally tied to columnar data types. The best way to enable data compression on table columns is by allowing Amazon Redshift to apply optimal compression encodings when you load the table with data. To learn more about using automatic data compression, see Loading tables with automatic compression (p. 89).

Query optimizer

The Amazon Redshift query execution engine incorporates a query optimizer that is MPP-aware and also takes advantage of the columnar-oriented data storage. The Amazon Redshift query optimizer implements signiﬁcant enhancements and extensions for processing complex analytic queries that often include multi-table joins, subqueries, and aggregation. To learn more about optimizing queries, see Tuning query performance (p. 383).

Result caching

To reduce query execution time and improve system performance, Amazon Redshift caches the results of certain types of queries in memory on the leader node. When a user submits a query, Amazon Redshift checks the results cache for a valid, cached copy of the query results. If a match is found in the result cache, Amazon Redshift uses the cached results and doesn't run the query. Result caching is transparent to the user.

Result caching is turned on by default. To turn oﬀ result caching for the current session, set the enable_result_cache_for_session (p. 1483) parameter to off.

(22)

Columnar storage

Amazon Redshift uses cached results for a new query when all of the following are true:

• The user submitting the query has access privilege to the objects used in the query.

• The table or views in the query haven't been modiﬁed.

• The query doesn't use a function that must be evaluated each time it's run, such as GETDATE.

• The query doesn't reference Amazon Redshift Spectrum external tables.

• Conﬁguration parameters that might aﬀect query results are unchanged.

• The query syntactically matches the cached query.

To maximize cache eﬀectiveness and eﬃcient use of resources, Amazon Redshift doesn't cache some large query result sets. Amazon Redshift determines whether to cache query results based on a number of factors. These factors include the number of entries in the cache and the instance type of your Amazon Redshift cluster.

To determine whether a query used the result cache, query the SVL_QLOG (p. 1382) system view. If a query used the result cache, the source_query column returns the query ID of the source query. If result caching wasn't used, the source_query column value is NULL.

The following example shows that queries submitted by userid 104 and userid 102 use the result cache from queries run by userid 100.

select userid, query, elapsed, source_query from svl_qlog where userid > 1

order by query desc;

userid | query | elapsed | source_query ---+---+---+--- 104 | 629035 | 27 | 628919 104 | 629034 | 60 | 628900 104 | 629033 | 23 | 628891 102 | 629017 | 1229393 | 102 | 628942 | 28 | 628919 102 | 628941 | 57 | 628900 102 | 628940 | 26 | 628891 100 | 628919 | 84295686 | 100 | 628900 | 87015637 | 100 | 628891 | 58808694 |

Compiled code

The leader node distributes fully optimized compiled code across all of the nodes of a cluster. Compiling the query eliminates the overhead associated with an interpreter and therefore increases the execution speed, especially for complex queries. The compiled code is cached and shared across sessions on the same cluster, so subsequent executions of the same query will be faster, often even with diﬀerent parameters.

The execution engine compiles different code for the JDBC connection protocol and for ODBC and psql (libq) connection protocols, so two clients using different protocols will each incur the first-time cost of compiling the code. Other clients that use the same protocol, however, will benefit from sharing the cached code.

Columnar storage

Columnar storage for database tables is an important factor in optimizing analytic query performance because it drastically reduces the overall disk I/O requirements and reduces the amount of data you need to load from disk.

(23)

Columnar storage

The following series of illustrations describe how columnar data storage implements eﬃciencies and how that translates into eﬃciencies when retrieving data into memory.

This ﬁrst illustration shows how records from database tables are typically stored into disk blocks by row.

In a typical relational database table, each row contains ﬁeld values for a single record. In row-wise database storage, data blocks store values sequentially for each consecutive column making up the entire row. If block size is smaller than the size of a record, storage for an entire record may take more than one block. If block size is larger than the size of a record, storage for an entire record may take less than one block, resulting in an ineﬃcient use of disk space. In online transaction processing (OLTP) applications, most transactions involve frequently reading and writing all of the values for entire records, typically one record or a small number of records at a time. As a result, row-wise storage is optimal for OLTP databases.

The next illustration shows how with columnar storage, the values for each column are stored sequentially into disk blocks.

Using columnar storage, each data block stores values of a single column for multiple rows. As records enter the system, Amazon Redshift transparently converts the data to columnar storage for each of the columns.

In this simplified example, using columnar storage, each data block holds column field values for as many as three times as many records as row-based storage. This means that reading the same number of column field values for the same number of records requires a third of the I/O operations compared to row-wise storage. In practice, using tables with very large numbers of columns and very large row counts, storage efficiency is even greater.

An added advantage is that, since each block holds the same type of data, block data can use a compression scheme selected speciﬁcally for the column data type, further reducing disk space and I/O. For more information about compression encodings based on data types, see Compression encodings (p. 40).

The savings in space for storing data on disk also carries over to retrieving and then storing that data in memory. Since many database operations only need to access or operate on one or a small number of columns at a time, you can save memory space by only retrieving blocks for columns you actually need for a query. Where OLTP transactions typically involve most or all of the columns in a row for a small

(24)

Workload management

number of records, data warehouse queries commonly read only a few columns for a very large number of rows. This means that reading the same number of column field values for the same number of rows requires a fraction of the I/O operations and uses a fraction of the memory that would be required for processing row-wise blocks. In practice, using tables with very large numbers of columns and very large row counts, the efficiency gains are proportionally greater. For example, suppose a table contains 100 columns. A query that uses five columns will only need to read about five percent of the data contained in the table. This savings is repeated for possibly billions or even trillions of records for large databases.

In contrast, a row-wise database would read the blocks that contain the 95 unneeded columns as well.

Typical database block sizes range from 2 KB to 32 KB. Amazon Redshift uses a block size of 1 MB, which is more eﬃcient and further reduces the number of I/O requests needed to perform any database loading or other operations that are part of query execution.

Workload management

Amazon Redshift workload management (WLM) enables users to ﬂexibly manage priorities within workloads so that short, fast-running queries won't get stuck in queues behind long-running queries.

Amazon Redshift WLM creates query queues at runtime according to service classes, which deﬁne the conﬁguration parameters for various types of queues, including internal system queues and user- accessible queues. From a user perspective, a user-accessible service class and a queue are functionally equivalent. For consistency, this documentation uses the term queue to mean a user-accessible service class as well as a runtime queue.

When you run a query, WLM assigns the query to a queue according to the user's user group or by matching a query group that is listed in the queue conﬁguration with a query group label that the user sets at runtime.

Currently, the default for clusters using the default parameter group is to use automatic WLM.

Automatic WLM manages query concurrency and memory allocation. For more information, see Implementing automatic WLM (p. 413).

With manual WLM, Amazon Redshift configures one queue with a concurrency level of five, which enables up to five queries to run concurrently, plus one predefined Superuser queue, with a concurrency level of one. You can define up to eight queues. Each queue can be configured with a maximum concurrency level of 50. The maximum total concurrency level for all user-defined queues (not including the Superuser queue) is 50.

The easiest way to modify the WLM conﬁguration is by using the Amazon Redshift Management Console.

You can also use the Amazon Redshift command line interface (CLI) or the Amazon Redshift API.

For more information about implementing and using workload management, see Implementing workload management (p. 411).

Using Amazon Redshift with other services

Amazon Redshift integrates with other AWS services to enable you to move, transform, and load your data quickly and reliably, using data security features.

Moving data between Amazon Redshift and Amazon S3

Amazon Simple Storage Service (Amazon S3) is a web service that stores data in the cloud. Amazon Redshift leverages parallel processing to read and load data from multiple data ﬁles stored in Amazon S3 buckets. For more information, see Loading data from Amazon S3 (p. 68).

(25)

Using Amazon Redshift with other services

You can also use parallel processing to export data from your Amazon Redshift data warehouse to multiple data ﬁles on Amazon S3. For more information, see Unloading data (p. 143).

Using Amazon Redshift with Amazon DynamoDB

Amazon DynamoDB is a fully managed NoSQL database service. You can use the COPY command to load an Amazon Redshift table with data from a single Amazon DynamoDB table. For more information, see Loading data from an Amazon DynamoDB table (p. 86).

Importing data from remote hosts over SSH

You can use the COPY command in Amazon Redshift to load data from one or more remote hosts, such as Amazon EMR clusters, Amazon EC2 instances, or other computers. COPY connects to the remote hosts using SSH and runs commands on the remote hosts to generate data. Amazon Redshift supports multiple simultaneous connections. The COPY command reads and loads the output from multiple host sources in parallel. For more information, see Loading data from remote hosts (p. 80).

Automating data loads using AWS Data Pipeline

You can use AWS Data Pipeline to automate data movement and transformation into and out of Amazon Redshift. By using the built-in scheduling capabilities of AWS Data Pipeline, you can schedule and run recurring jobs without having to write your own complex data transfer or transformation logic.

For example, you can set up a recurring job to automatically copy data from Amazon DynamoDB into Amazon Redshift. For a tutorial that walks you through the process of creating a pipeline that periodically moves data from Amazon S3 to Amazon Redshift, see Copy data to Amazon Redshift using AWS Data Pipeline in the AWS Data Pipeline Developer Guide.

Migrating data using AWS Database Migration Service (AWS DMS)

You can migrate data to Amazon Redshift using AWS Database Migration Service. AWS DMS can

migrate your data to and from most widely used commercial and open-source databases such as Oracle, PostgreSQL, Microsoft SQL Server, Amazon Redshift, Aurora, DynamoDB, Amazon S3, MariaDB, and MySQL. For more information, see Using an Amazon Redshift database as a target for AWS Database Migration Service.

(26)

Conducting a proof of concept

Amazon Redshift best practices

Following, you can ﬁnd best practices for planning a proof of concept, designing tables, loading data into tables, and writing queries for Amazon Redshift, and also a discussion of working with Amazon Redshift Advisor.

Amazon Redshift is not the same as other SQL database systems. To fully realize the beneﬁts of the Amazon Redshift architecture, you must speciﬁcally design, build, and load your tables to use massively parallel processing, columnar data storage, and columnar data compression. If your data loading and query execution times are longer than you expect, or longer than you want, you might be overlooking key information.

If you are an experienced SQL database developer, we strongly recommend that you review this topic before you begin developing your Amazon Redshift data warehouse.

If you are new to developing SQL databases, this topic is not the best place to start. We recommend that you begin by reading Getting started using databases and trying the examples yourself.

In this topic, you can find an overview of the most important development principles, along with specific tips, examples, and best practices for implementing those principles. No single practice can apply to every application. You should evaluate all of your options before finalizing a database design. For more information, see Working with automatic table optimization (p. 37), Loading data (p. 65), Tuning query performance (p. 383), and the reference chapters.

Topics

• Conducting a proof of concept for Amazon Redshift (p. 11)

• Amazon Redshift best practices for designing tables (p. 16)

• Amazon Redshift best practices for loading data (p. 19)

• Amazon Redshift best practices for designing queries (p. 22)

• Working with recommendations from Amazon Redshift Advisor (p. 24)

Conducting a proof of concept for Amazon Redshift

Amazon Redshift is a fast, scalable data warehouse that makes it simple and cost-eﬀective to analyze all your data using standard SQL with your existing business intelligence (BI) tools. Amazon Redshift oﬀers fast performance in a low-cost cloud data warehouse. It uses sophisticated query optimization, accelerated cache, columnar storage on high-performance local disks, and massively parallel query execution.

In the following sections, you can ﬁnd a framework for building a proof of concept with Amazon Redshift. The framework helps you to use architectural best practices for designing and operating a secure, high-performing, and cost-eﬀective data warehouse. This guidance is based on reviewing designs of thousands of customer architectures across a wide variety of business types and use cases. We have compiled customer experiences to develop this set of best practices to help you develop criteria for evaluating your data warehouse workload.

Overview of the process

Conducting a proof of concept is a three-step process:

1. Identify the goals of the proof of concept – you can work backward from your business requirements and success criteria, and translate them into a technical proof of concept project plan.

(27)

Identify the business goals and success criteria

2. Set up the proof of concept environment – most of the setup process is a click of few buttons to create your resources. Within minutes, you can have a data warehouse environment ready with data loaded.

3. Complete the proof of concept project plan to ensure that the goals are met.

In the following sections, we go into the details of each step.

Identify the business goals and success criteria

Identifying the goals of the proof of concept plays a critical role in determining what you want to measure as part of the evaluation process. The evaluation criteria should include the current scaling challenges, enhancements to improve your customer's experience of the data warehouse, and methods of addressing your current operational pain points. You can use the following questions to identify the goals of the proof of concept:

• What are your goals for scaling your data warehouse?

• What are the speciﬁc service-level agreements whose terms you want to improve?

• What new datasets do you need to include in your data warehouse?

• What are the business-critical SQL queries that you need to test and measure? Make sure to include the full range of SQL complexities, such as the diﬀerent types of queries (for example, select, insert, update, and delete).

• What are the general types of workloads you plan to test? Examples might include extract-transform- load (ETL) workloads, reporting queries, and batch extracts.

After you have answered these questions, you should be able to establish SMART goals and success criteria for building your proof of concept. For information about setting goals, see SMART criteria in Wikipedia.

Set up your proof of concept

Because we eliminated hardware provisioning, networking, and software installation from an on- premises data warehouse, trying Amazon Redshift with your own dataset has never been easier. Many of the sizing decisions and estimations that used to be required are now simply a click away. You can ﬂexibly resize your cluster or adjust the ratio of storage versus compute.

Broadly, setting up the Amazon Redshift proof of concept environment is a two-step process. It involves the launching of a data warehouse and then the conversion of the schema and datasets for evaluation.

Choose a starting cluster size

You can choose the node type and number of nodes using the Amazon Redshift console. We recommend that you also test resizing the cluster as part of your proof of concept plan. To get the initial sizing for your cluster, take the following steps:

1. Sign in to the AWS Management Console and open the Amazon Redshift console at https://

console.aws.amazon.com/redshift/.

2. On the navigation pane, choose Create cluster to open the conﬁguration page.

3. For Cluster identiﬁer, enter a name for your cluster.

4. Choose one of the following methods to size your cluster:

NoteThe following step describes an Amazon Redshift console that is running in an AWS Region that supports RA3 node types. For a list of AWS Regions that support RA3 node types, see Overview of RA3 node types in the Amazon Redshift Cluster Management Guide.

(28)

Checklist for a complete evaluation

• If your AWS Region supports RA3 node types, choose either Production or Free trial to answer the question What are you planning to use this cluster for?

If your organization is eligible, you might be able to create a cluster under the Amazon Redshift free trial program. To do this, choose Free trial to create a conﬁguration with the dc2.large node type.

For more information about choosing a free trial, see Amazon Redshift free trial.

• If you don't know how large to size your cluster, choose Help me choose. Doing this starts a sizing calculator that asks you questions about the size and query characteristics of the data that you plan to store in your data warehouse.

If you know the required size of your cluster (that is, the node type and number of nodes), choose I'll choose. Then choose the Node type and number of Nodes to size your cluster for the proof of concept.

5. After you enter all required cluster properties, choose Create cluster to launch your data warehouse.

For more details about creating clusters with the Amazon Redshift console, see Creating a cluster in the Amazon Redshift Cluster Management Guide.

Convert the schema and set up the datasets for the proof of concept

If you don't have an existing data warehouse, skip this section and see Amazon Redshift Getting Started Guide. Amazon Redshift Getting Started Guide provides a tutorial to create a cluster and examples of setting up data in Amazon Redshift.

When migrating from your existing data warehouse, you can convert schema, code, and data using the AWS Schema Conversion Tool and the AWS Database Migration Service. Your choice of tools depends on the source of your data and optional ongoing replications. For more information, see What Is the AWS Schema Conversion Tool? in the AWS Schema Conversion Tool User Guide and What Is AWS Database Migration Service? in the AWS Database Migration Service User Guide. The following can help you set up your data in Amazon Redshift:

• Migrate Your Data Warehouse to Amazon Redshift Using the AWS Schema Conversion Tool – this blog post provides an overview on how you can use the AWS SCT data extractors to migrate your existing data warehouse to Amazon Redshift. The AWS SCT tool can migrate your data from many legacy platforms (such as Oracle, Greenplum, Netezza, Teradata, Microsoft SQL Server, or Vertica).

• Optionally, you can also use the AWS Database Migration Service for ongoing replications of changed data from the source. For more information, see Using an Amazon Redshift Database as a Target for AWS Database Migration Service in the AWS Database Migration Service User Guide.

Amazon Redshift is a relational database management system (RDBMS). As such, it can run many types of data models including star schemas, snowflake schemas, data vault models, and simple, flat, or normalized tables. After setting up your schemas in Amazon Redshift, you can take advantage of massively parallel processing and columnar data storage for fast analytical queries out of the box. For information about types of schemas, see star schema, snowflake schema, and data vault modeling in Wikipedia.

Checklist for a complete evaluation

Make sure that a complete evaluation meets all your data warehouse needs. Consider including the following items in your success criteria:

• Data load time – using the COPY command is a common way to test how long it takes to load data. For more information, see Amazon Redshift best practices for loading data (p. 19).

(29)

Develop a project plan for your evaluation

• Throughput of the cluster – measuring queries per hour is a common way to determine throughput.

To do so, set up a test to run typical queries for your workload.

• Data security – you can easily encrypt data at rest and in transit with Amazon Redshift. You also have a number of options for managing keys. Amazon Redshift also supports single sign-on (SSO) integration. Amazon Redshift pricing includes built-in security, data compression, backup storage, and data transfer.

• Third-party tools integration – you can use either a JDBC or ODBC connection to integrate with business intelligence and other external tools.

• Interoperability with other AWS services – Amazon Redshift integrates with other AWS services, such as Amazon EMR, Amazon QuickSight, AWS Glue, Amazon S3, and Amazon Kinesis. You can use this integration when setting up and managing your data warehouse.

• Backups and snapshots – backups and snapshots are created automatically. You can also create a point-in-time snapshot at any time or on a schedule. Try using a snapshot and creating a second cluster as part of your evaluation. Evaluate if your development and testing organizations can use the cluster.

• Resizing – your evaluation should include increasing the number or types of Amazon Redshift nodes.

Evaluate that the workload throughput before and after a resize meets any variability of the volume of your workload. For more information, see Resizing clusters in Amazon Redshift in the Amazon Redshift Cluster Management Guide.

• Concurrency scaling – this feature helps you handle variability of traﬃc volume in your data warehouse. With concurrency scaling, you can support virtually unlimited concurrent users and concurrent queries, with consistently fast query performance. For more information, see Working with concurrency scaling (p. 438).

• Automatic workload management (WLM) – prioritize your business critical queries over other queries by using automatic WLM. Try setting up queues based on your workloads (for example, a queue for ETL and a queue for reporting). Then enable automatic WLM to allocate the concurrency and memory resources dynamically. For more information, see Implementing automatic WLM (p. 413).

• Amazon Redshift Advisor – the Advisor develops customized recommendations to increase

performance and optimize costs by analyzing your workload and usage metrics for your cluster. Sign in to the Amazon Redshift console to view Advisor recommendations. For more information, see Working with recommendations from Amazon Redshift Advisor (p. 24).

• Table design – Amazon Redshift provides great performance out of the box for most workloads.

When you create a table, the default sort key and distribution key is AUTO. For more information, see Working with automatic table optimization (p. 37).

• Support – we strongly recommend that you evaluate AWS Support as part of your evaluation. Also, make sure to talk to your account manager about your proof of concept. AWS can help with technical guidance and credits for the proof of concept if you qualify. If you don't ﬁnd the help you're looking for, you can talk directly to the Amazon Redshift team. For help, submit the form at Request support for your Amazon Redshift proof-of-concept.

• Lake house integration – with built-in integration, try using the out-of-box Amazon Redshift

Spectrum feature. With Redshift Spectrum, you can extend the data warehouse into your data lake and run queries against petabytes of data in Amazon S3 using your existing cluster. For more information, see Querying external data using Amazon Redshift Spectrum (p. 232).

Develop a project plan for your evaluation

Some of the following techniques for creating query benchmarks might help support your Amazon Redshift evaluation:

• Assemble a list of queries for each runtime category. Having a suﬃcient number (for example, 30 per category) helps ensure that your evaluation reﬂects a real-world data warehouse implementation.

Add a unique identiﬁer to associate each query that you include in your evaluation with one of the

(30)

Additional resources to help your evaluation

categories you establish for your evaluation. You can then use these unique identiﬁers to determine throughput from the system tables.

You can also create a query group to organize your evaluation queries. For example, if you have established a "Reporting" category for your evaluation, you might create a coding system to tag your evaluation queries with the word "Report." You can then identify individual queries within reporting as R1, R2, and so on. The following example demonstrates this approach.

SELECT 'Reporting' AS query_category, 'R1' as query_id, * FROM customers;

SELECT query, datediff(seconds, starttime, endtime) FROM stl_query

WHERE

querytxt LIKE '%Reporting%'

and starttime >= '2018-04-15 00:00' and endtime < '2018-04-15 23:59';

When you have associated a query with an evaluation category, you can use a unique identiﬁer to determine throughput from the system tables for each category.

• Test throughput with historical user or ETL queries that have a variety of runtimes in your existing data warehouse. You might use a load testing utility, such as the open-source JMeter or a custom utility. If so, make sure that your utility does the following:

• It can take the network transmission time into account.

• It evaluates execution time based on throughput of the internal system tables. For information about how to do this, see Analyzing the query summary (p. 396).

• Identify all the various permutations that you plan to test during your evaluation. The following list provides some common variables:

• Cluster size

• Node type

• Load testing duration

• Concurrency settings

• Reduce the cost of your proof of concept by pausing your cluster during oﬀ-hours and weekends.

When a cluster is paused, on-demand compute billing is suspended. To run tests on the cluster, resume per-second billing. You can also create a schedule to pause and resume your cluster automatically.

For more information, see Pausing and resuming clusters in the Amazon Redshift Cluster Management Guide.

At this stage, you're ready to complete your project plan and evaluate results.

Additional resources to help your evaluation

To help your Amazon Redshift evaluation, see the following:

• Service highlights and pricing – this product detail page provides the Amazon Redshift value proposition, service highlights, and pricing.

• Amazon Redshift Getting Started Guide – this guide provides a tutorial of using Amazon Redshift to create a sample cluster and work with sample data.

• Getting started with Amazon Redshift Spectrum (p. 234) – in this tutorial, you learn how to use Redshift Spectrum to query data directly from ﬁles on Amazon S3.

• Amazon Redshift management overview – this topic in the Amazon Redshift Cluster Management Guide provides an overview of Amazon Redshift.

Amazon Redshift

Amazon Redshift

Database Developer Guide

Amazon Redshift: Database Developer Guide

Table of Contents

Amazon Redshift system overview

Are you a ﬁrst-time Amazon Redshift user?

Are you a database developer?

Prerequisites

System and architecture overview

Data warehouse system architecture

Performance

Massively parallel processing

Columnar data storage

Data compression

Query optimizer

Result caching

Compiled code

Columnar storage

Workload management

Using Amazon Redshift with other services

Moving data between Amazon Redshift and Amazon S3

Using Amazon Redshift with Amazon DynamoDB

Importing data from remote hosts over SSH

Automating data loads using AWS Data Pipeline

Migrating data using AWS Database Migration Service (AWS DMS)

Amazon Redshift best practices

Conducting a proof of concept for Amazon Redshift

Overview of the process

Identify the business goals and success criteria

Set up your proof of concept

Choose a starting cluster size

Convert the schema and set up the datasets for the proof of concept

Checklist for a complete evaluation

Develop a project plan for your evaluation

Additional resources to help your evaluation