• 沒有找到結果。

Amazon Athena

N/A
N/A
Protected

Academic year: 2022

Share "Amazon Athena"

Copied!
659
0
0

加載中.... (立即查看全文)

全文

(1)

Amazon Athena

User Guide

Amazon Athena: User Guide

Copyright © Amazon Web Services, Inc. and/or its affiliates. All rights reserved.

(2)

Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be affiliated with, connected to, or sponsored by Amazon.

(3)

Table of Contents

What is Amazon Athena? ... 1

When should I use Athena? ... 1

Amazon Athena ... 1

Amazon EMR ... 2

Amazon Redshift ... 2

Accessing Athena ... 2

Understanding Tables, Databases, and the Data Catalog ... 3

AWS service integrations with Athena ... 4

Setting Up ... 7

1. Sign up for an AWS account ... 7

2. Create an IAM administrator user and group ... 7

3. Attach managed policies for Athena ... 7

4. Sign in as an IAM user ... 8

Getting Started ... 9

Prerequisites ... 9

Step 1: Create a Database ... 9

Step 2: Create a Table ... 12

Step 3: Query Data ... 16

Saving Your Queries ... 18

Connecting to Other Data Sources ... 18

Accessing Amazon Athena ... 19

Using the Console ... 19

Using the API ... 19

Using the CLI ... 19

Connecting to Data Sources ... 20

Integration with AWS Glue ... 20

Using AWS Glue to Connect to Data Sources in Amazon S3 ... 21

Registering an AWS Glue Data Catalog from Another Account ... 22

Best Practices When Using Athena with AWS Glue ... 23

Upgrading to the AWS Glue Data Catalog Step-by-Step ... 31

FAQ: Upgrading to the AWS Glue Data Catalog ... 32

Using a Hive Metastore ... 34

Overview of Features ... 35

Workflow ... 36

Considerations and Limitations ... 36

Connecting Athena to an Apache Hive Metastore ... 38

Using the AWS Serverless Application Repository ... 40

Connecting Athena to Hive Using an Existing Role ... 42

Configure Athena to Use a Deployed Hive Metastore Connector ... 45

Using a Default Data Source ... 46

Using the AWS CLI with Hive Metastores ... 49

Reference Implementation ... 55

Using Amazon Athena Federated Query ... 56

Considerations and Limitations ... 56

Videos ... 57

Deploying a Connector and Connecting to a Data Source ... 57

Using the AWS Serverless Application Repository ... 59

Athena Data Source Connectors ... 60

Writing Federated Queries ... 64

Enabling Cross-Account Federated Queries ... 67

Writing a Data Source Connector ... 73

IAM Policies for Accessing Data Catalogs ... 74

Data Catalog Example Policies ... 75

Managing Data Sources ... 78

(4)

Connecting to Amazon Athena with ODBC and JDBC Drivers ... 79

Using Athena with the JDBC Driver ... 79

Connecting to Amazon Athena with ODBC ... 81

Configuring SSO Using ODBC, SAML 2.0, and Okta ... 84

Power BI Connector ... 94

Creating Databases and Tables ... 100

Creating Databases ... 100

Creating Tables ... 101

Considerations and Limitations ... 101

Creating Tables Using AWS Glue or the Athena Console ... 102

Showing Table Information ... 103

Names for Tables, Databases, and Columns ... 104

Use lower case for table names and table column names in Athena ... 104

Special characters ... 104

Reserved Keywords ... 105

List of Reserved Keywords in DDL Statements ... 105

List of Reserved Keywords in SQL SELECT Statements ... 105

Examples of Queries with Reserved Words ... 106

Table Location in Amazon S3 ... 106

Table Location and Partitions ... 108

Columnar Storage Formats ... 108

Converting to Columnar Formats ... 109

Partitioning Data ... 109

Considerations and Limitations ... 109

Creating and Loading a Table with Partitioned Data ... 110

Preparing Hive Style and Non-Hive Style Data for Querying ... 110

Partition Projection ... 114

Additional Resources ... 114

Partition Projection ... 114

Pruning and Projection for Heavily Partitioned Tables ... 114

Using Partition Projection ... 114

Use Cases ... 115

Considerations and Limitations ... 115

Video ... 116

Setting up Partition Projection ... 116

Supported Types for Partition Projection ... 120

Dynamic ID Partitioning ... 124

Amazon Kinesis Data Firehose Example ... 126

Creating a Table from Query Results (CTAS) ... 130

Considerations and Limitations for CTAS Queries ... 130

Running CTAS Queries in the Console ... 132

Bucketing vs Partitioning ... 133

Data Types Supported for Filtering on Bucketed Columns ... 134

Examples of CTAS Queries ... 134

Using CTAS and INSERT INTO for ETL ... 138

Overview ... 138

Step 1: Create a Table Based on the Original Dataset ... 138

Step 2: Use CTAS to Partition, Convert, and Compress the Data ... 139

Step 3: Use INSERT INTO to Add Data ... 140

Step 4: Measure Performance and Cost Differences ... 142

Summary ... 143

Creating a Table with More Than 100 Partitions ... 143

Compression Support ... 147

Compression Support in Athena by File Format ... 148

Specifying Compression Formats ... 148

Notes and Resources ... 148

SerDe Reference ... 150

(5)

Using a SerDe ... 150

To Use a SerDe in Queries ... 150

Supported SerDes and Data Formats ... 151

Avro SerDe ... 153

Regex SerDe ... 155

CloudTrail SerDe ... 155

OpenCSVSerDe for Processing CSV ... 157

Grok SerDe ... 160

JSON SerDe Libraries ... 163

LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files ... 167

ORC SerDe ... 172

Parquet SerDe ... 175

Running Queries ... 178

Query Results and Recent Queries ... 178

Specifying a Query Result Location ... 179

Downloading Query Results Files Using the Athena Console ... 182

Viewing Recent Queries ... 184

Finding Query Output Files in Amazon S3 ... 187

Working with Views ... 189

When to Use Views? ... 189

Supported Actions for Views in Athena ... 190

Considerations for Views ... 190

Limitations for Views ... 191

Working with Views in the Console ... 191

Creating Views ... 192

Examples of Views ... 194

Using saved queries ... 195

Considerations and limitations ... 195

Working with saved queries in the Athena console ... 196

Using the Athena API to update saved queries ... 197

Querying with Prepared Statements ... 197

Considerations and Limitations ... 197

SQL Statements ... 197

Handling Schema Updates ... 199

Summary: Updates and Data Formats in Athena ... 200

Index Access in ORC and Parquet ... 201

Types of Updates ... 202

Updates in Tables with Partitions ... 207

Querying Arrays ... 208

Creating Arrays ... 208

Concatenating Arrays ... 210

Converting Array Data Types ... 211

Finding Lengths ... 211

Accessing Array Elements ... 212

Flattening Nested Arrays ... 213

Creating Arrays from Subqueries ... 215

Filtering Arrays ... 216

Sorting Arrays ... 217

Using Aggregation Functions with Arrays ... 217

Converting Arrays to Strings ... 218

Using Arrays to Create Maps ... 219

Querying Arrays with Complex Types and Nested Structures ... 219

Querying Geospatial Data ... 225

What is a Geospatial Query? ... 225

Input Data Formats and Geometry Data Types ... 225

Supported Geospatial Functions ... 226

Examples: Geospatial Queries ... 248

(6)

Querying Hudi Datasets ... 250

Hudi Dataset Table Types ... 250

Considerations and Limitations ... 252

Video ... 252

Creating Hudi Tables ... 252

See Also ... 255

Querying JSON ... 255

Best Practices for Reading JSON Data ... 256

Extracting Data from JSON ... 257

Searching for Values in JSON Arrays ... 259

Obtaining Length and Size of JSON Arrays ... 261

Troubleshooting JSON Queries ... 262

Using ML with Athena ... 262

Considerations and Limitations ... 262

ML with Athena Syntax ... 263

Customer Use Examples ... 264

Querying with UDFs ... 264

Considerations and Limitations ... 264

Videos ... 265

UDF Query Syntax ... 265

Examples ... 267

Creating and Deploying a UDF Using Lambda ... 267

Querying AWS service logs ... 272

Application Load Balancer ... 273

Elastic Load Balancing ... 275

CloudFront ... 276

CloudTrail ... 278

Amazon EMR ... 284

Global Accelerator ... 287

GuardDuty ... 289

Network Firewall ... 291

Network Load Balancer ... 293

Route 53 ... 294

Amazon VPC ... 296

AWS WAF ... 303

Querying AWS Glue Data Catalog ... 310

Listing Databases and Searching a Specified Database ... 311

Listing Tables in a Specified Database and Searching for a Table by Name ... 311

Listing Partitions for a Specific Table ... 312

Listing or Searching Columns for a Specified Table or View ... 313

Querying Web Server Logs ... 314

Querying Apache Logs ... 315

Querying IIS Logs ... 317

Using ACID Transactions ... 323

Using Governed Tables ... 323

Considerations and Limitations ... 324

Getting Started ... 324

Creating Governed Tables ... 325

Querying Governed Tables ... 325

Supported Data Types ... 327

Using Iceberg Tables ... 328

Considerations and Limitations ... 328

Getting Started ... 329

Creating Iceberg Tables ... 329

DDL Operations ... 332

Schema Evolution ... 332

Querying Iceberg Tables ... 333

(7)

Updating Iceberg Tables ... 335

Optimizing Iceberg Tables ... 336

Supported Data Types ... 337

Security ... 339

Data Protection ... 339

Protecting Multiple Types of Data ... 340

Encryption at Rest ... 341

Encryption in Transit ... 349

Key Management ... 349

Internetwork Traffic Privacy ... 349

Identity and Access Management ... 350

AWS managed policies ... 350

Access through JDBC and ODBC Connections ... 355

Access to Amazon S3 ... 356

Cross-account Access to S3 Buckets ... 357

Fine-Grained Access to Databases and Tables in AWS Glue ... 360

Cross-Account Access to AWS Glue Data Catalogs ... 368

Access to Encrypted Metadata in the Data Catalog ... 372

Access to Workgroups and Tags ... 372

Allow Access to Prepared Statements ... 373

Using CalledVia Context Keys ... 374

Allow Access to an Athena Data Connector for External Hive Metastore ... 376

Allow Lambda Function Access to External Hive Metastores ... 378

Allow Access to Athena Federated Query ... 381

Allow Access to Athena UDF ... 385

Allowing Access for ML with Athena ... 389

Enabling Federated Access to the Athena API ... 390

Logging and Monitoring ... 393

Logging Amazon Athena API Calls with AWS CloudTrail ... 394

Compliance Validation ... 396

Resilience ... 397

Infrastructure Security ... 397

Connect to Amazon Athena Using an Interface VPC Endpoint ... 398

Configuration and Vulnerability Analysis ... 399

Using Athena with Lake Formation ... 399

How Lake Formation Data Access Works ... 400

Considerations and Limitations ... 402

Managing User Permissions ... 404

Applying Lake Formation Permissions to Existing Databases and Tables ... 406

Using Lake Formation and JDBC or ODBC for Federated Access ... 407

Using Workgroups to Control Query Access and Costs ... 447

Using Workgroups for Running Queries ... 447

Benefits of Using Workgroups ... 448

How Workgroups Work ... 448

Setting up Workgroups ... 449

IAM Policies for Accessing Workgroups ... 450

Workgroup Settings ... 456

Managing Workgroups ... 457

Athena Workgroup APIs ... 463

Troubleshooting Workgroups ... 463

Controlling Costs and Monitoring Queries with CloudWatch Metrics and Events ... 465

Video ... 465

Enabling CloudWatch Query Metrics ... 465

Monitoring Athena Queries with CloudWatch Metrics ... 466

Monitoring Athena Queries with CloudWatch Events ... 469

Monitoring Usage Metrics ... 471

Setting Data Usage Control Limits ... 473

(8)

Tagging Resources ... 475

Tag Basics ... 475

Tag Restrictions ... 475

Working with Tags on Workgroups in the Console ... 476

Displaying Tags for Individual Workgroups ... 476

Adding and Deleting Tags on an Individual Workgroup ... 477

Using Tag Operations ... 478

Managing Tags Using API Operations ... 478

Managing Tags Using the AWS CLI ... 479

Tag-Based IAM Access Control Policies ... 481

Tag Policy Examples for Workgroups ... 481

Tag Policy Examples for Data Catalogs ... 483

Athena Engine Versioning ... 486

Changing Athena Engine Versions ... 486

Finding the Query Engine Version for a Workgroup ... 486

Changing the Engine Version ... 487

Specifying the Engine Version When You Create a Workgroup ... 488

Testing Queries in Advance of an Engine Version Upgrade ... 488

Troubleshooting Queries That Fail ... 489

Athena Engine Version Reference ... 489

Athena engine version 2 ... 489

SQL Reference ... 498

Data Types in Athena ... 498

DML Queries, Functions, and Operators ... 500

SELECT ... 500

INSERT INTO ... 505

UNLOAD ... 508

EXPLAIN and EXPLAIN ANALYZE ... 510

Functions ... 522

Supported Time Zones ... 524

DDL Statements ... 532

Unsupported DDL ... 533

ALTER DATABASE SET DBPROPERTIES ... 534

ALTER TABLE ADD COLUMNS ... 535

ALTER TABLE ADD PARTITION ... 535

ALTER TABLE DROP PARTITION ... 537

ALTER TABLE RENAME PARTITION ... 537

ALTER TABLE REPLACE COLUMNS ... 538

ALTER TABLE SET LOCATION ... 539

ALTER TABLE SET TBLPROPERTIES ... 540

CREATE DATABASE ... 541

CREATE TABLE ... 542

CREATE TABLE AS ... 546

CREATE VIEW ... 548

DESCRIBE TABLE ... 549

DESCRIBE VIEW ... 550

DROP DATABASE ... 550

DROP TABLE ... 551

DROP VIEW ... 551

MSCK REPAIR TABLE ... 552

SHOW COLUMNS ... 554

SHOW CREATE TABLE ... 555

SHOW CREATE VIEW ... 555

SHOW DATABASES ... 556

SHOW PARTITIONS ... 556

SHOW TABLES ... 558

SHOW TBLPROPERTIES ... 559

(9)

SHOW VIEWS ... 559

Considerations and Limitations ... 560

Cross-Regional Queries ... 561

Troubleshooting ... 562

CREATE TABLE AS SELECT (CTAS) ... 562

Duplicated data occurs with concurrent CTAS statements ... 562

Data File Issues ... 562

Athena cannot read hidden files ... 562

Athena cannot read files stored in Amazon S3 Glacier ... 562

Athena reads files that I excluded from the AWS Glue crawler ... 563

HIVE_BAD_DATA: Error parsing field value ... 563

HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://bucket-name... 563

HIVE_CURSOR_ERROR: com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not exist ... 563

HIVE_CURSOR_ERROR: Unexpected end of input stream ... 563

HIVE_FILESYSTEM_ERROR: Incorrect fileSize 1234567 for file ... 564

HIVE_UNKNOWN_ERROR: Unable to create input format ... 564

org.apache.parquet.io.GroupColumnIO cannot be cast to org.apache.parquet.io.PrimitiveColumnIO ... 564

The S3 location provided to save your query results is invalid. ... 564

Federated Queries ... 564

JSON Related Errors ... 565

NULL or incorrect data errors when trying to read JSON data ... 565

HIVE_BAD_DATA: Error parsing field value for field 0: java.lang.String cannot be cast to org.openx.data.jsonserde.json.JSONObject ... 565

HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Duplicate key ... 565

HIVE_CURSOR_ERROR messages with pretty-printed JSON ... 565

Multiple JSON records return a SELECT COUNT of 1 ... 565

Cannot query a table created by a AWS Glue crawler that uses a custom JSON classifier ... 566

MSCK REPAIR TABLE ... 566

Output Issues ... 566

Unable to verify/create output bucket ... 566

TIMESTAMP result is empty ... 566

Store Athena query output in a format other than CSV ... 566

The S3 location provided to save your query results is invalid ... 566

Partitioning Issues ... 567

MSCK REPAIR TABLE does not remove stale partitions ... 567

MSCK REPAIR TABLE failure ... 567

MSCK REPAIR TABLE detects partitions but doesn't add them to AWS Glue ... 567

Partition projection ranges with the date format of dd-MM-yyyy-HH-mm-ss or yyyy-MM-dd do not work ... 567

PARTITION BY doesn't support the BIGINT type ... 567

No meaningful partitions available ... 567

Partition projection does not work in conjunction with range partitions ... 567

HIVE_UNKNOWN_ERROR: Unable to create input format ... 568

HIVE_PARTITION_SCHEMA_MISMATCH ... 568

SemanticException table is not partitioned but partition spec exists ... 568

Zero records returned from partitioned data ... 568

Permissions ... 568

Access Denied Error when querying Amazon S3 ... 568

Access Denied with Status Code: 403 error when running DDL queries on encrypted data in Amazon S3 ... 568

Access Denied with Status Code: 403 when querying an Amazon S3 bucket in another account .. 569

Use IAM role credentials to connect to the Athena JDBC driver ... 569

Query Syntax Issues ... 569

FAILED: NullPointerException Name is null ... 569

Function not registered ... 570

(10)

GENERIC_INTERNAL_ERROR Exceptions ... 570

Number of matching groups doesn't match the number of columns ... 570

queryString failed to satisfy constraint: Member must have length less than or equal to 262144 571 SYNTAX_ERROR: Column cannot be resolved ... 571

Throttling Issues ... 571

Views ... 571

Views created in Apache Hive shell do not work in Athena ... 571

View is stale; it must be re-created ... 571

Workgroups ... 572

Additional Resources ... 572

Athena Error Catalog ... 572

Error Category ... 572

Error Type Reference ... 573

Performance Tuning ... 577

Physical Limits ... 577

Query Optimization Techniques ... 577

Data Size ... 577

File Formats ... 578

Joins, Grouping, and Unions ... 578

Partitioning ... 579

Window Functions ... 579

Use More Efficient Functions ... 579

Additional Resources ... 580

Code Samples, Service Quotas, and Previous JDBC Driver ... 581

Code Samples ... 581

Constants ... 581

Create a Client to Access Athena ... 582

Start Query Execution ... 582

Stop Query Execution ... 584

List Query Executions ... 586

Create a Named Query ... 587

Delete a Named Query ... 588

List Named Queries ... 589

Earlier Version JDBC Drivers ... 590

Instructions for JDBC Driver version 1.1.0 ... 591

Service Quotas ... 595

Queries ... 595

Workgroups ... 596

Databases, Tables, and Partitions ... 596

Amazon S3 Buckets ... 596

Per Account API Call Quotas ... 597

Release Notes ... 598

March 2, 2022 ... 600

February 23, 2022 ... 600

February 15, 2022 ... 600

February 14, 2022 ... 601

February 9, 2022 ... 601

February 8, 2022 ... 601

January 28, 2022 ... 601

January 13, 2022 ... 602

November 30, 2021 ... 602

November 26, 2021 ... 602

November 24, 2021 ... 603

November 22, 2021 ... 603

November 18, 2021 ... 603

November 17, 2021 ... 604

November 16, 2021 ... 604

(11)

November 12, 2021 ... 605

November 2, 2021 ... 605

October 29, 2021 ... 605

JDBC and ODBC Drivers ... 605

Features and Improvements ... 606

October 4, 2021 ... 606

September 16, 2021 ... 606

Features ... 606

Improvements ... 606

September 15, 2021 ... 607

Athena Console Preview ... 607

Athena JDBC Driver 2.0.24 ... 607

August 31, 2021 ... 607

August 12, 2021 ... 608

August 6, 2021 ... 608

August 5, 2021 ... 608

July 30, 2021 ... 608

July 21, 2021 ... 609

July 16, 2021 ... 609

July 8, 2021 ... 609

July 1, 2021 ... 610

June 23, 2021 ... 610

May 12, 2021 ... 610

May 10, 2021 ... 610

May 5, 2021 ... 611

April 30, 2021 ... 611

April 29, 2021 ... 611

April 26, 2021 ... 611

April 21, 2021 ... 612

April 5, 2021 ... 612

EXPLAIN Statement ... 612

SageMaker Machine Learning Models in SQL Queries ... 612

User Defined Functions (UDF) ... 612

March 30, 2021 ... 612

March 25, 2021 ... 612

March 5, 2021 ... 613

February 25, 2021 ... 613

December 16, 2020 ... 613

Athena engine version 2 and Athena Federated Query ... 613

AWS PrivateLink ... 613

November 24, 2020 ... 613

November 11, 2020 ... 614

Athena engine version 2 ... 614

Federated SQL Queries ... 614

October 22, 2020 ... 615

July 29, 2020 ... 615

July 9, 2020 ... 615

Querying Apache Hudi Datasets ... 616

AWS CloudFormation Data Catalog Resource ... 616

June 1, 2020 ... 616

Using Apache Hive Metastore as a Metacatalog with Amazon Athena ... 616

May 21, 2020 ... 616

April 1, 2020 ... 616

March 11, 2020 ... 617

March 6, 2020 ... 617

November 26, 2019 ... 617

Federated SQL Queries ... 617

(12)

Invoking Machine Learning Models in SQL Queries ... 618

User Defined Functions (UDFs) (Preview) ... 619

Using Apache Hive Metastore as a Metacatalog with Amazon Athena (Preview) ... 619

New Query-Related Metrics ... 619

November 12, 2019 ... 620

November 8, 2019 ... 620

October 8, 2019 ... 620

September 19, 2019 ... 620

September 12, 2019 ... 621

August 16, 2019 ... 621

August 9, 2019 ... 621

June 26, 2019 ... 621

May 24, 2019 ... 622

March 05, 2019 ... 622

February 22, 2019 ... 622

February 18, 2019 ... 623

November 20, 2018 ... 624

October 15, 2018 ... 624

October 10, 2018 ... 625

September 6, 2018 ... 625

August 23, 2018 ... 626

August 16, 2018 ... 626

August 7, 2018 ... 627

June 5, 2018 ... 627

Support for Views ... 627

Improvements and Updates to Error Messages ... 627

Bug Fixes ... 628

May 17, 2018 ... 628

April 19, 2018 ... 628

April 6, 2018 ... 628

March 15, 2018 ... 629

February 2, 2018 ... 629

January 19, 2018 ... 629

November 13, 2017 ... 630

November 1, 2017 ... 630

October 19, 2017 ... 630

October 3, 2017 ... 630

September 25, 2017 ... 630

August 14, 2017 ... 630

August 4, 2017 ... 631

June 22, 2017 ... 631

June 8, 2017 ... 631

May 19, 2017 ... 631

Improvements ... 632

Bug Fixes ... 632

April 4, 2017 ... 632

Features ... 632

Improvements ... 633

Bug Fixes ... 633

March 24, 2017 ... 633

Features ... 633

Improvements ... 633

Bug Fixes ... 633

February 20, 2017 ... 634

Features ... 634

Improvements ... 635

Document History ... 636

(13)

AWS glossary ... 646

(14)

When should I use Athena?

What is Amazon Athena?

Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL (p. 498). With a few actions in the AWS

Management Console, you can point Athena at your data stored in Amazon S3 and begin using standard SQL to run ad-hoc queries and get results in seconds.

Athena is serverless, so there is no infrastructure to set up or manage, and you pay only for the queries you run. Athena scales automatically—running queries in parallel—so results are fast, even with large datasets and complex queries.

Topics

• When should I use Athena? (p. 1)

• Accessing Athena (p. 2)

• Understanding Tables, Databases, and the Data Catalog (p. 3)

• AWS service integrations with Athena (p. 4)

When should I use Athena?

Query services like Amazon Athena, data warehouses like Amazon Redshift, and sophisticated data processing frameworks like Amazon EMR all address different needs and use cases. The following guidance can help you choose one or more services based on your requirements.

Amazon Athena

Athena helps you analyze unstructured, semi-structured, and structured data stored in Amazon S3.

Examples include CSV, JSON, or columnar data formats such as Apache Parquet and Apache ORC. You can use Athena to run ad-hoc queries using ANSI SQL, without the need to aggregate or load the data into Athena.

Athena integrates with Amazon QuickSight for easy data visualization. You can use Athena to generate reports or to explore data with business intelligence tools or SQL clients connected with a JDBC or an ODBC driver. For more information, see What is Amazon QuickSight in the Amazon QuickSight User Guide and Connecting to Amazon Athena with ODBC and JDBC Drivers (p. 79).

Athena integrates with the AWS Glue Data Catalog, which offers a persistent metadata store for your data in Amazon S3. This allows you to create tables and query data in Athena based on a central metadata store available throughout your Amazon Web Services account and integrated with the ETL and data discovery features of AWS Glue. For more information, see Integration with AWS Glue (p. 20) and What is AWS Glue in the AWS Glue Developer Guide.

Amazon Athena makes it easy to run interactive queries against data directly in Amazon S3 without having to format data or manage infrastructure. For example, Athena is useful if you want to run a quick query on web logs to troubleshoot a performance issue on your site. With Athena, you can get started fast: you just define a table for your data and start querying using standard SQL.

You should use Amazon Athena if you want to run interactive ad hoc SQL queries against data on Amazon S3, without having to manage any infrastructure or clusters. Amazon Athena provides the easiest way to run ad hoc queries for data in Amazon S3 without the need to setup or manage any servers.

(15)

Amazon EMR

For a list of AWS services that Athena leverages or integrates with, see the section called “AWS service integrations with Athena” (p. 4).

Amazon EMR

Amazon EMR makes it simple and cost effective to run highly distributed processing frameworks such as Hadoop, Spark, and Presto when compared to on-premises deployments. Amazon EMR is flexible – you can run custom applications and code, and define specific compute, memory, storage, and application parameters to optimize your analytic requirements.

In addition to running SQL queries, Amazon EMR can run a wide variety of scale-out data processing tasks for applications such as machine learning, graph analytics, data transformation, streaming data, and virtually anything you can code. You should use Amazon EMR if you use custom code to process and analyze extremely large datasets with the latest big data processing frameworks such as Spark, Hadoop, Presto, or Hbase. Amazon EMR gives you full control over the configuration of your clusters and the software installed on them.

You can use Amazon Athena to query data that you process using Amazon EMR. Amazon Athena supports many of the same data formats as Amazon EMR. Athena's data catalog is Hive metastore compatible. If you use EMR and already have a Hive metastore, you can run your DDL statements on Amazon Athena and query your data immediately without affecting your Amazon EMR jobs.

Amazon Redshift

A data warehouse like Amazon Redshift is your best choice when you need to pull together data from many different sources – like inventory systems, financial systems, and retail sales systems – into a common format, and store it for long periods of time. If you want to build sophisticated business reports from historical data, then a data warehouse like Amazon Redshift is the best choice. The query engine in Amazon Redshift has been optimized to perform especially well on running complex queries that join large numbers of very large database tables. When you need to run queries against highly structured data with lots of joins across lots of very large tables, choose Amazon Redshift.

For more information about when to use Athena, consult the following resources:

• When to use Athena vs other big data services

• Amazon Athena Overview

• Amazon Athena Features

• Amazon Athena FAQs

• Amazon Athena Blog posts

Accessing Athena

You can access Athena using the AWS Management Console, a JDBC or ODBC connection, the Athena API, the Athena CLI, the AWS SDK, or AWS Tools for Windows PowerShell.

• To get started with the console, see Getting Started (p. 9).

• To learn how to use JDBC or ODBC drivers, see Connecting to Amazon Athena with JDBC (p. 79) and Connecting to Amazon Athena with ODBC (p. 81).

• To use the Athena API, see the Amazon Athena API Reference.

• To use the CLI, install the AWS CLI and then type aws athena help from the command line to see available commands. For information about available commands, see the Amazon Athena command line reference.

(16)

Understanding Tables, Databases, and the Data Catalog

• To use the AWS SDK for Java 2.x, see the Athena section of the AWS SDK for Java 2.x API Reference, the Athena Java V2 Examples on GitHub.com, and the AWS SDK for Java 2.x Developer Guide.

• To use the AWS SDK for .NET, see the Amazon.Athena namespace in the AWS SDK for .NET API Reference, the .NET Athena examples on GitHub.com, and the AWS SDK for .NET Developer Guide.

• To use AWS Tools for Windows PowerShell, see the AWS Tools for PowerShell - Amazon Athena cmdlet reference, the AWS Tools for PowerShell portal page, and the AWS Tools for Windows PowerShell User Guide.

• For information about Athena service endpoints that you can connect to programmatically, see Amazon Athena endpoints and quotas in the Amazon Web Services General Reference.

Understanding Tables, Databases, and the Data Catalog

In Athena, tables and databases are containers for the metadata definitions that define a schema for underlying source data. For each dataset, a table needs to exist in Athena. The metadata in the table tells Athena where the data is located in Amazon S3, and specifies the structure of the data, for example, column names, data types, and the name of the table. Databases are a logical grouping of tables, and also hold only metadata and schema information for a dataset.

For each dataset that you'd like to query, Athena must have an underlying table it will use for obtaining and returning query results. Therefore, before querying data, a table must be registered in Athena. The registration occurs when you either create tables automatically or manually.

Regardless of how the tables are created, the tables creation process registers the dataset with Athena.

This registration occurs in the AWS Glue Data Catalog and enables Athena to run queries on the data.

• To create a table automatically, use an AWS Glue crawler from within Athena. For more information about AWS Glue and crawlers, see Integration with AWS Glue (p. 20). When AWS Glue creates a table, it registers it in its own AWS Glue Data Catalog. Athena uses the AWS Glue Data Catalog to store and retrieve this metadata, using it when you run queries to analyze the underlying dataset.

After you create a table, you can use SQL SELECT (p. 500) statements to query it, including getting specific file locations for your source data (p. 504). Your query results are stored in Amazon S3 in the query result location that you specify (p. 179).

The AWS Glue Data Catalog is accessible throughout your Amazon Web Services account. Other AWS services can share the AWS Glue Data Catalog, so you can see databases and tables created throughout your organization using Athena and vice versa. In addition, AWS Glue lets you automatically discover data schema and extract, transform, and load (ETL) data.

• To create a table manually:

• Use the Athena console to run the Create Table Wizard.

• Use the Athena console to write Hive DDL statements in the Query Editor.

• Use the Athena API or CLI to run a SQL query string with DDL statements.

• Use the Athena JDBC or ODBC driver.

When you create tables and databases manually, Athena uses HiveQL data definition language (DDL) statements such as CREATE TABLE, CREATE DATABASE, and DROP TABLE under the hood to create tables and databases in the AWS Glue Data Catalog.

NoteIf you have tables in Athena created before August 14, 2017, they were created in an Athena- managed internal data catalog that exists side-by-side with the AWS Glue Data Catalog until

(17)

AWS service integrations with Athena

you choose to update. For more information, see Upgrading to the AWS Glue Data Catalog Step- by-Step (p. 31).

When you query an existing table, under the hood, Amazon Athena uses Presto, a distributed SQL engine. We have examples with sample data within Athena to show you how to create a table and then issue a query against it using Athena. Athena also has a tutorial in the console that helps you get started creating a table based on data that is stored in Amazon S3.

• For a step-by-step tutorial on creating a table and writing queries in the Athena Query Editor, see Getting Started (p. 9).

• Run the Athena tutorial in the console. This launches automatically if you log in to https://

console.aws.amazon.com/athena/ for the first time. You can also choose Tutorial in the console to launch it.

AWS service integrations with Athena

You can use Athena to query data from the AWS services listed in this section. To see the Regions that each service supports, see Regions and Endpoints in the Amazon Web Services General Reference.

AWS services integrated with Athena

• AWS CloudFormation

• Amazon CloudFront

• AWS CloudTrail

• Elastic Load Balancing

• AWS Glue Data Catalog

• AWS Identity and Access Management (IAM)

• Amazon QuickSight

• Amazon S3 Inventory

• AWS Step Functions

• AWS Systems Manager Inventory

• Amazon Virtual Private Cloud

For information about each integration, see the following sections.

AWS CloudFormation Data Catalog

Reference topic: AWS::Athena::DataCatalog in the AWS CloudFormation User Guide

Specify an Athena data catalog, including a name, description, type, parameters, and tags. For more information, see DataCatalog in the Amazon Athena API Reference.

Named Query

Reference topic: AWS::Athena::NamedQuery in the AWS CloudFormation User Guide

Specify named queries with AWS CloudFormation and run them in Athena. Named queries allow you to map a query name to a query and then run it as a saved query from the Athena console.

For information, see CreateNamedQuery in the Amazon Athena API Reference.

Workgroup

Reference topic: AWS::Athena::WorkGroup in the AWS CloudFormation User Guide

(18)

AWS service integrations with Athena

Specify Athena workgroups using AWS CloudFormation. Use Athena workgroups to isolate queries for you or your group from other queries in the same account. For more information, see Using Workgroups to Control Query Access and Costs (p. 447) in the Amazon Athena User Guide and CreateWorkGroup in the Amazon Athena API Reference.

Amazon CloudFront

Reference topic: Querying Amazon CloudFront Logs (p. 276)

Use Athena to query Amazon CloudFront logs. For more information about using CloudFront, see the Amazon CloudFront Developer Guide.

AWS CloudTrail

Reference topic: Querying AWS CloudTrail Logs (p. 278)

Using Athena with CloudTrail logs is a powerful way to enhance your analysis of AWS service activity.

For example, you can use queries to identify trends and further isolate activity by attribute, such as source IP address or user. You can create tables for querying logs directly from the CloudTrail console, and use those tables to run queries in Athena. For more information, see Creating a Table for CloudTrail Logs in the CloudTrail Console (p. 279).

Elastic Load Balancing

Reference topic: Querying Application Load Balancer Logs (p. 273)

Querying Application Load Balancer logs allows you to see the source of traffic, latency, and bytes transferred to and from Elastic Load Balancing instances and backend applications. For more information, see Creating the Table for ALB Logs (p. 273).

Reference topic: Querying Classic Load Balancer Logs (p. 275)

Query Classic Load Balancer logs to analyze and understand traffic patterns to and from Elastic Load Balancing instances and backend applications. You can see the source of traffic, latency, and bytes transferred. For more information, see Creating the Table for ELB Logs (p. 275).

AWS Glue Data Catalog

Reference topic: Integration with AWS Glue (p. 20)

Athena integrates with the AWS Glue Data Catalog, which offers a persistent metadata store for your data in Amazon S3. This allows you to create tables and query data in Athena based on a central metadata store available throughout your Amazon Web Services account and integrated with the ETL and data discovery features of AWS Glue. For more information, see Integration with AWS Glue (p. 20) and What is AWS Glue in the AWS Glue Developer Guide.

AWS Identity and Access Management (IAM) Reference topic: Actions for Amazon Athena

You can use Athena API actions in IAM permission policies. For more information, see Actions for Amazon Athena and Identity and Access Management in Athena (p. 350).

Amazon QuickSight

Reference topic: Connecting to Amazon Athena with ODBC and JDBC Drivers (p. 79)

Athena integrates with Amazon QuickSight for easy data visualization. You can use Athena to generate reports or to explore data with business intelligence tools or SQL clients connected with a JDBC or an ODBC driver. For more information about Amazon QuickSight, see What is Amazon QuickSight in the Amazon QuickSight User Guide. For information about using JDBC and ODBC drivers with Athena, see Connecting to Amazon Athena with ODBC and JDBC Drivers (p. 79).

Amazon S3 Inventory

Reference topic: Querying inventory with Athena in the Amazon Simple Storage Service User Guide

(19)

AWS service integrations with Athena

You can use Amazon Athena to query Amazon S3 inventory using standard SQL. You can use

Amazon S3 inventory to audit and report on the replication and encryption status of your objects for business, compliance, and regulatory needs. For more information, see Amazon S3 inventory in the Amazon Simple Storage Service User Guide.

AWS Step Functions

Reference topic: Call Athena with Step Functions in the AWS Step Functions Developer Guide Call Athena with AWS Step Functions. AWS Step Functions can control select AWS services directly using the Amazon States Language. You can use Step Functions with Athena to start and stop query execution, get query results, run ad-hoc or scheduled data queries, and retrieve results from data lakes in Amazon S3. For more information, see the AWS Step Functions Developer Guide.

Video: Orchestrate Amazon Athena Queries using AWS Step Functions

The following video demonstrates how to use Amazon Athena and AWS Step Functions to run a regularly scheduled Athena query and generate a corresponding report.

Orchestrate Amazon Athena Queries using AWS Step Functions

For an example that uses Step Functions and Amazon EventBridge to orchestrate AWS Glue DataBrew, Athena, and Amazon QuickSight, see Orchestrating an AWS Glue DataBrew job and Amazon Athena query with AWS Step Functions in the AWS Big Data Blog.

AWS Systems Manager Inventory

Reference topic: Querying inventory data from multiple Regions and accounts in the AWS Systems Manager User Guide

AWS Systems Manager Inventory integrates with Amazon Athena to help you query inventory data from multiple AWS Regions and accounts. For more information, see the AWS Systems Manager User Guide.

Amazon Virtual Private Cloud

Reference topic: Querying Amazon VPC Flow Logs (p. 296)

Amazon Virtual Private Cloud flow logs capture information about the IP traffic going to and from network interfaces in a VPC. Query the logs in Athena to investigate network traffic patterns and identify threats and risks across your Amazon VPC network. For more information about Amazon VPC, see the Amazon VPC User Guide.

(20)

1. Sign up for an AWS account

Setting Up

If you've already signed up for Amazon Web Services, you can start using Amazon Athena immediately.

If you haven't signed up for AWS or need assistance getting started, be sure to complete the following tasks:

1. Sign up for an AWS account (p. 7)

2. Create an IAM administrator user and group (p. 7) 3. Attach managed policies for Athena (p. 7) 4. Sign in as an IAM user (p. 8)

1. Sign up for an AWS account

When you sign up for AWS, your account is automatically signed up for all services in AWS, including Athena. You are charged only for the services that you use. For pricing information, see Amazon Athena pricing.

If you have an AWS account already, skip to the next task. If you don't have an AWS account, use the following procedure to create one.

To create an AWS account

1. Open http://aws.amazon.com/, and then choose Create an AWS account.

2. Follow the online instructions. Part of the sign-up procedure involves receiving a phone call and entering a PIN using the phone keypad.

3. Note your AWS account number, because you need it for the next task.

2. Create an IAM administrator user and group

An AWS Identity and Access Management (IAM) user is an account that you create to access services. It is a different user than your main AWS account. As a security best practice, we recommend that you use the IAM user's credentials to access AWS services. You use the IAM console to create an administrator IAM user and an Administrators group for the user. You can then access the console for Athena and other AWS services by accessing a special link and providing the credentials for the IAM user that you created.

For steps, see Creating an administrator IAM user and user group (console) in the IAM User Guide.

3. Attach managed policies for Athena

After you have created an IAM user, you must attach some Athena managed policies to the user so that the user can access Athena. There are two managed policies for Athena: AmazonAthenaFullAccess and AWSQuicksightAthenaAccess. These policies grant permissions to Athena to query Amazon S3 and to write the results of your queries to a separate bucket on your behalf. To see the contents of these policies for Athena, see AWS managed policies for Amazon Athena (p. 350).

(21)

4. Sign in as an IAM user

For steps to attach the Athena managed policies, follow Adding IAM Identity Permissions (Console) in the IAM User Guide and add the AmazonAthenaFullAccess and AWSQuicksightAthenaAccess managed policies to the IAM administrator user that you created.

NoteYou may need additional permissions to access the underlying dataset in Amazon S3. If you are not the account owner or otherwise have restricted access to a bucket, contact the bucket owner to grant access using a resource-based bucket policy, or contact your account administrator to grant access using an identity-based policy. For more information, see Amazon S3 Permissions (p. 356). If the dataset or Athena query results are encrypted, you may need additional permissions. For more information, see Configuring Encryption Options (p. 341).

4. Sign in as an IAM user

To sign in as the new IAM user that you created, you can use the custom sign-in URL for the IAM users of your account. To see the sign-in URL for the IAM users for your account, open the IAM console and choose Users, user_name, Security credentials, Console sign-in link. As a convenience, you can use the clipboard icon to copy the sign-in URL to the clipboard.

For more information about signing in as an IAM user, see How IAM users sign in to your AWS account in the IAM User Guide.

(22)

Prerequisites

Getting Started

This tutorial walks you through using Amazon Athena to query data. You'll create a table based on sample data stored in Amazon Simple Storage Service, query the table, and check the results of the query.

The tutorial uses live resources, so you are charged for the queries that you run. You aren't charged for the sample data in the location that this tutorial uses, but if you upload your own data files to Amazon S3, charges do apply.

Prerequisites

• If you have not already done so, sign up for an account in Setting Up (p. 7).

• Using the same AWS Region (for example, US West (Oregon)) and account that you are using for Athena, Create a bucket in Amazon S3 to hold your Athena query results.

Step 1: Create a Database

You first need to create a database in Athena.

To create an Athena database

1. Open the Athena console at https://console.aws.amazon.com/athena/.

2. If this is your first time to visit the Athena console in your current AWS Region, choose Explore the query editor to open the query editor. Otherwise, Athena opens in the query editor.

3. Choose View Settings to set up a query result location in Amazon S3.

4. On the Settings tab, choose Manage.

(23)

Step 1: Create a Database

5. For Manage settings, do one of the following:

• In the Location of query result box, enter the path to the bucket that you created in Amazon S3 for your query results. Prefix the path with s3://.

• Choose Browse S3, choose the Amazon S3 bucket that you created for your current Region, and then choose Choose.

(24)

Step 1: Create a Database

6. Choose Save.

7. Choose Editor to switch to the query editor.

8. On the right of the navigation pane, you can use the Athena query editor to enter and run queries and statements.

(25)

Step 2: Create a Table

9. To create a database named mydatabase, enter the following CREATE DATABASE statement.

CREATE DATABASE mydatabase 10. Choose Run or press Ctrl+ENTER.

11. From the Database list on the left, choose mydatabase to make it your current database.

Step 2: Create a Table

Now that you have a database, you can create an Athena table for it. The table that you create will be based on sample Amazon CloudFront log data in the location s3://athena-examples-myregion/

cloudfront/plaintext/, where myregion is your current AWS Region.

The sample log data is in tab-separated values (TSV) format, which means that a tab character is used as a delimiter to separate the fields. The data looks like the following example. For readability, the tabs in the excerpt have been converted to spaces and the final field shortened.

2014-07-05 20:00:09 DFW3 4260 10.0.0.15 GET eabcd12345678.cloudfront.net /test-image-1.jpeg 200 - Mozilla/5.0[...]

2014-07-05 20:00:09 DFW3 4252 10.0.0.15 GET eabcd12345678.cloudfront.net /test-image-2.jpeg 200 - Mozilla/5.0[...]

2014-07-05 20:00:10 AMS1 4261 10.0.0.15 GET eabcd12345678.cloudfront.net /test-image-3.jpeg 200 - Mozilla/5.0[...]

To enable Athena to read this data, you could run a CREATE EXTERNAL TABLE statement like the following. The statement that creates the table defines columns that map to the data, specifies how the data is delimited, and specifies the Amazon S3 location that contains the sample data.

NoteFor the LOCATION clause, specify an Amazon S3 folder location, not a specific file. Athena scans all of the files in the folder that you specify.

CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (

(26)

Step 2: Create a Table

`Date` DATE, Time STRING, Location STRING, Bytes INT,

RequestIP STRING, Method STRING, Host STRING, Uri STRING, Status INT, Referrer STRING, ClientInfo STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'

LOCATION 's3://athena-examples-my-region/cloudfront/plaintext/';

The example creates a table called cloudfront_logs and specifies a name and data type for each field.

These fields become the columns in the table. Because date is a reserved word (p. 105), it is escaped with backtick (`) characters. ROW FORMAT DELIMITED means that Athena will use a default library called LazySimpleSerDe (p. 167) to do the actual work of parsing the data. The example also specifies that the fields are tab separated (FIELDS TERMINATED BY '\t') and that each record in the file ends in a newline character (LINES TERMINATED BY '\n). Finally, the LOCATION clause specifies the path in Amazon S3 where the actual data to be read is located. If you have your own tab or comma-separated data, you can use a CREATE TABLE statement like this.

Returning to the sample data, here is a full example of the final field ClientInfo:

Mozilla/5.0%20(Android;%20U;%20Windows%20NT%205.1;%20en-US;

%20rv:1.9.0.9)%20Gecko/2009040821%20IE/3.0.9

As you can see, this field is multivalued. If the CREATE TABLE statement specifies tabs as field delimiters, the separate components inside this particular field can't be broken out into separate

columns. To create columns from the values inside the field, you can use a regular expression (regex) that contains regex groups. The regex groups that you specify become separate table columns. To use a regex in your CREATE TABLE statement, use syntax like the following. This syntax instructs Athena to use the Regex SerDe (p. 155) library and the regular expression that you specify.

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ("input.regex" = "regular_expression")

Regular expressions can be useful for creating tables from complex CSV or TSV data but can be difficult to write and maintain. Fortunately, there are other libraries that you can use for formats like JSON, Parquet, and ORC. For more information, see Supported SerDes and Data Formats (p. 151).

Now you are ready to create the table in the Athena query editor. The CREATE TABLE statement and regex are provided for you.

To create a table in Athena

1. In the navigation pane, for Database, make sure that mydatabase is selected.

2. To give yourself more room in the query editor, you can choose the arrow icon to collapse the navigation pane.

(27)

Step 2: Create a Table

3. To create a tab for a new query, choose the plus (+) sign in the query editor. You can have up to ten query tabs open at once.

4. To close one or more query tabs, choose the arrow next to the plus sign. To close all tabs at once, choose the arrow, and then choose Close all tabs.

(28)

Step 2: Create a Table

5. In the query pane, enter the following CREATE EXTERNAL TABLE statement. The regex breaks out the operating system, browser, and browser version information from the ClientInfo field in the log data.

CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs ( `Date` DATE,

Time STRING, Location STRING, Bytes INT,

RequestIP STRING, Method STRING, Host STRING, Uri STRING, Status INT, Referrer STRING, os STRING, Browser STRING, BrowserVersion STRING

) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES (

"input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s +([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^

\/]+)[\/](.*)$"

) LOCATION 's3://athena-examples-myregion/cloudfront/plaintext/';

6. In the LOCATION statement, replace myregion with the AWS Region that you are currently using (for example, us-west-1).

7. Choose Run.

(29)

Step 3: Query Data

The table cloudfront_logs is created and appears under the list of Tables for the mydatabase database.

Step 3: Query Data

Now that you have the cloudfront_logs table created in Athena based on the data in Amazon S3, you can run SQL queries on the table and see the results in Athena. For more information about using SQL in Athena, see SQL Reference for Amazon Athena (p. 498).

To run a query

1. Choose the plus (+) sign to open a new query tab and enter the following SQL statement in the query pane.

SELECT os, COUNT(*) count FROM cloudfront_logs

WHERE date BETWEEN date '2014-07-05' AND date '2014-08-05' GROUP BY os

2. Choose Run.

The results look like the following:

3. To save the results of the query to a .csv file, choose Download results.

(30)

Step 3: Query Data

4. To view or run previous queries, choose the Recent queries tab.

5. To download the results of a previous query from the Recent queries tab, select the query, and then choose Download results. Queries are retained for 45 days.

(31)

Saving Your Queries

For more information, see Working with Query Results, Recent Queries, and Output Files (p. 178).

Saving Your Queries

You can save the queries that you create or edit in the query editor with a name. Athena stores these queries on the Saved queries tab. You can use the Saved queries tab to recall, run, rename, or delete your saved queries. For more information, see Using saved queries (p. 195).

Connecting to Other Data Sources

This tutorial used a data source in Amazon S3 in CSV format. For information about using Athena with AWS Glue, see Using AWS Glue to Connect to Data Sources in Amazon S3 (p. 21). You can also connect Athena to a variety of data sources by using ODBC and JDBC drivers, external Hive metastores, and Athena data source connectors. For more information, see Connecting to Data Sources (p. 20).

(32)

Using the Console

Accessing Amazon Athena

You can access Amazon Athena using the AWS Management Console, the Amazon Athena API, or the AWS CLI.

Using the Console

You can use the AWS Management Console for Amazon Athena to do the following:

• Create or select a database.

• Create, view, and delete tables.

• Filter tables by starting to type their names.

• Preview tables and generate CREATE TABLE DDL for them.

• Show table properties.

• Run queries on tables, save and format queries, and view query history.

• Create up to ten queries using different query tabs in the query editor. To open a new tab, click the plus sign.

• Display query results, save, and export them.

• Access the AWS Glue Data Catalog.

• View and change settings, such as view the query result location, configure auto-complete, and encrypt query results.

In the right pane, the Query Editor displays an introductory screen that prompts you to create your first table. You can view your tables under Tables in the left pane.

Here's a high-level overview of the actions available for each table:

Preview tables – View the query syntax in the Query Editor on the right.

Show properties – Show a table's name, its location in Amazon S3, input and output formats, the serialization (SerDe) library used, and whether the table has encrypted data.

Delete table – Delete a table.

Generate CREATE TABLE DDL – Generate the query behind a table and view it in the query editor.

Using the API

Amazon Athena enables application programming for Athena. For more information, see Amazon Athena API Reference. The latest AWS SDKs include support for the Athena API.

For examples of using the AWS SDK for Java with Athena, see Code Samples (p. 581).

For more information about AWS SDK for Java documentation and downloads, see the SDKs section in Tools for Amazon Web Services.

Using the CLI

You can access Amazon Athena using the AWS CLI. For more information, see the AWS CLI Reference for Athena.

(33)

Integration with AWS Glue

Connecting to Data Sources

You can use Amazon Athena to query data stored in different locations and formats in a dataset. This dataset might be in CSV, JSON, Avro, Parquet, or some other format.

The tables and databases that you work with in Athena to run queries are based on metadata. Metadata is data about the underlying data in your dataset. How that metadata describes your dataset is called the schema. For example, a table name, the column names in the table, and the data type of each column are schema, saved as metadata, that describe an underlying dataset. In Athena, we call a system for organizing metadata a data catalog or a metastore. The combination of a dataset and the data catalog that describes it is called a data source.

The relationship of metadata to an underlying dataset depends on the type of data source that you work with. Relational data sources like MySQL, PostgreSQL, and SQL Server tightly integrate the metadata with the dataset. In these systems, the metadata is most often written when the data is written. Other data sources, like those built using Hive, allow you to define metadata on-the-fly when you read the dataset. The dataset can be in a variety of formats—for example, CSV, JSON, Parquet, or Avro.

Athena natively supports the AWS Glue Data Catalog. The AWS Glue Data Catalog is a data catalog built on top of other datasets and data sources such as Amazon S3, Amazon Redshift, and Amazon DynamoDB. You can also connect Athena to other data sources by using a variety of connectors.

Topics

• Integration with AWS Glue (p. 20)

• Using Athena Data Connector for External Hive Metastore (p. 34)

• Using Amazon Athena Federated Query (p. 56)

• IAM Policies for Accessing Data Catalogs (p. 74)

• Managing Data Sources (p. 78)

• Connecting to Amazon Athena with ODBC and JDBC Drivers (p. 79)

Integration with AWS Glue

AWS Glue is a fully managed ETL (extract, transform, and load) AWS service. One of its key abilities is to analyze and categorize data. You can use AWS Glue crawlers to automatically infer database and table schema from your data in Amazon S3 and store the associated metadata in the AWS Glue Data Catalog.

Athena uses the AWS Glue Data Catalog to store and retrieve table metadata for the Amazon S3 data in your Amazon Web Services account. The table metadata lets the Athena query engine know how to find, read, and process the data that you want to query.

To create database and table schema in the AWS Glue Data Catalog, you can run an AWS Glue crawler from within Athena on a data source, or you can run Data Definition Language (DDL) queries directly in the Athena Query Editor. Then, using the database and table schema that you created, you can use Data Manipulation (DML) queries in Athena to query the data.

You can register an AWS Glue Data Catalog from an account other than your own. After you configure the required IAM permissions for AWS Glue, you can use Athena to run cross-account queries. For more information, see Cross-Account Access to AWS Glue Data Catalogs (p. 368).

For more information about the AWS Glue Data Catalog, see Populating the AWS Glue Data Catalog in the AWS Glue Developer Guide.

(34)

Using AWS Glue to Connect to Data Sources in Amazon S3

Separate charges apply to AWS Glue. For more information, see AWS Glue Pricing and Are there separate charges for AWS Glue? (p. 33) For more information about the benefits of using AWS Glue with

Athena, see Why should I upgrade to the AWS Glue Data Catalog? (p. 33) Topics

• Using AWS Glue to Connect to Data Sources in Amazon S3 (p. 21)

• Registering an AWS Glue Data Catalog from Another Account (p. 22)

• Best Practices When Using Athena with AWS Glue (p. 23)

• Upgrading to the AWS Glue Data Catalog Step-by-Step (p. 31)

• FAQ: Upgrading to the AWS Glue Data Catalog (p. 32)

Using AWS Glue to Connect to Data Sources in Amazon S3

Athena can connect to your data stored in Amazon S3 using the AWS Glue Data Catalog to store metadata such as table and column names. After the connection is made, your databases, tables, and views appear in Athena's query editor.

To define schema information for AWS Glue to use, you can create an AWS Glue crawler to retrieve the information automatically, or you can manually add a table and enter the schema information.

Creating an AWS Glue Crawler

You can create a crawler by starting in the Athena console and then using the AWS Glue console in an integrated way. When you create the crawler, you specify a data location in Amazon S3 to crawl.

To create a crawler in AWS Glue starting from the Athena console 1. Open the Athena console at https://console.aws.amazon.com/athena/.

2. In the query editor, next to Tables and views, choose Create, and then choose AWS Glue crawler.

3. On the AWS Glue console Add crawler page, follow the steps to create a crawler. For more information, see Using AWS Glue Crawlers (p. 24) in this guide and Populating the AWS Glue Data Catalog in the AWS Glue Developer Guide.

Note

Athena does not recognize exclude patterns that you specify for an AWS Glue crawler. For example, if you have an Amazon S3 bucket that contains both .csv and .json files and you exclude the .json files from the crawler, Athena queries both groups of files. To avoid this, place the files that you want to exclude in a different location.

Adding a Table Using a Form

The following procedure shows you how to use the Athena console to add a table using the Create Table From S3 bucket data form.

To add a table and enter schema information using a form

1. Open the Athena console at https://console.aws.amazon.com/athena/.

2. In the query editor, next to Tables and views, choose Create, and then choose S3 bucket data.

3. On the Create Table From S3 bucket data form, for Table name, enter a name for the table.

4. For Database configuration, choose an existing database, or create a new one.

(35)

Registering an AWS Glue Data Catalog from Another Account

5. For Location of Input Data Set, specify the path in Amazon S3 to the folder that contains the dataset that you want to process.

6. For Data Format, choose a data format (Apache Web Logs, CSV, TSV, Text File with Custom Delimiters, JSON, Parquet, or ORC).

• For the Apache Web Logs option, you must also enter a regex expression in the Regex box.

• For the Text File with Custom Delimiters option, specify a Field terminator (that is, a column delimiter). Optionally, you can specify a Collection terminator for array types or a Map key terminator.

7. For Column details, specify a column name and the column data type.

• To add more columns one at a time, choose Add a column.

• To quickly add more columns, choose Bulk add columns. In the text box, enter a comma separated list of columns in the format column_namedata_type, column_namedata_type[,

…], and then choose Add.

8. (Optional) For Partition details, add one or more column names and data types.

9. The Preview table query box shows the CREATE TABLE statement generated by the information that you entered into the form. The preview statement cannot be edited directly. To change the statement, modify the fields in the form, or create the statement directly (p. 103) in the query editor instead of using the form.

10. Choose Create table to run the generated statement in the query editor and create the table.

Registering an AWS Glue Data Catalog from Another Account

You can use Athena's cross-account AWS Glue catalog feature to register an AWS Glue catalog from an account other than your own. After you configure the required IAM permissions for AWS Glue and register the catalog as an Athena DataCatalog resource, you can use Athena to run cross-account queries. For information about configuring the required permissions, see Cross-Account Access to AWS Glue Data Catalogs (p. 368).

The following procedure shows you how to use the Athena console to configure an AWS Glue Data Catalog in an Amazon Web Services account other than your own as a data source.

To register an AWS Glue Data Catalog from another account

1. Follow the steps in Cross-Account Access to AWS Glue Data Catalogs (p. 368) to ensure that you have permissions to query the data catalog in the other account.

2. Open the Athena console at https://console.aws.amazon.com/athena/.

3. If the console navigation pane is not visible, choose the expansion menu on the left.

4. Choose Data sources.

(36)

Best Practices When Using Athena with AWS Glue

5. On the upper right, choose Connect data source.

6. In the AWS Glue Data Catalog section, for Choose an AWS Glue Data Catalog, choose AWS Glue Data Catalog in another account.

7. For Data source details, enter the following information:

Data source name – Enter the name that you want to use in your SQL queries to refer to the data catalog in the other account.

Description – (Optional) Enter a description of the data catalog in the other account.

Catalog ID – Enter the 12-digit Amazon Web Services account ID of the account to which the data catalog belongs. The Amazon Web Services account ID is the catalog ID.

8. (Optional) For Tags, enter key-value pairs that you want to associate with the data source. For more information about tags, see Tagging Athena Resources (p. 475).

9. Choose Register. On the Data sources page, the data catalog that you entered is listed in the Catalog name column.

10. To view or edit information about the new data catalog, choose the catalog, and then choose Edit.

11. To delete the new data catalog, choose the catalog, and then choose Delete.

Best Practices When Using Athena with AWS Glue

When using Athena with the AWS Glue Data Catalog, you can use AWS Glue to create databases and tables (schema) to be queried in Athena, or you can use Athena to create schema and then use them in AWS Glue and related services. This topic provides considerations and best practices when using either method.

Under the hood, Athena uses Presto to process DML statements and Hive to process the DDL statements that create and modify schema. With these technologies, there are a couple of conventions to follow so that Athena and AWS Glue work well together.

In this topic

Database, Table, and Column Names (p. 23)

Using AWS Glue Crawlers (p. 24)

• Scheduling a Crawler to Keep the AWS Glue Data Catalog and Amazon S3 in Sync (p. 24)

• Using Multiple Data Sources with Crawlers (p. 24)

• Syncing Partition Schema to Avoid "HIVE_PARTITION_SCHEMA_MISMATCH" (p. 25)

• Updating Table Metadata (p. 26)

Working with CSV Files (p. 26)

• CSV Data Enclosed in Quotes (p. 26)

• CSV Files with Headers (p. 28)

AWS Glue Partition Indexing and Filtering (p. 28)

Working with Geospatial Data (p. 29)

Using AWS Glue Jobs for ETL with Athena (p. 29)

• Creating Tables Using Athena for AWS Glue ETL Jobs (p. 29)

• Using ETL Jobs to Optimize Query Performance (p. 30)

• Converting SMALLINT and TINYINT Datatypes to INT When Converting to ORC (p. 31)

• Automating AWS Glue Jobs for ETL (p. 31)

Database, Table, and Column Names

When you create schema in AWS Glue to query in Athena, consider the following:

(37)

Best Practices When Using Athena with AWS Glue

• A database name cannot be longer than 255 characters.

• A table name cannot be longer than 255 characters.

• A column name cannot be longer than 255 characters.

• The only acceptable characters for database names, table names, and column names are lowercase letters, numbers, and the underscore character.

For more information, see Databases and Tables in the AWS Glue Developer Guide.

NoteIf you use an AWS::Glue::Database AWS CloudFormation template to create an AWS Glue database and do not specify a database name, AWS Glue automatically generates a database name in the format resource_name–random_string that is not compatible with Athena.

You can use the AWS Glue Catalog Manager to rename columns, but not table names or database names.

To change a database name, you must create a new database and copy tables from the old database to it (in other words, copy the metadata to a new entity). You can follow a similar process for tables. You can use the AWS Glue SDK or AWS CLI to do this.

Using AWS Glue Crawlers

AWS Glue crawlers help discover the schema for datasets and register them as tables in the AWS Glue Data Catalog. The crawlers go through your data and determine the schema. In addition, the crawler can detect and register partitions. For more information, see Defining Crawlers in the AWS Glue Developer Guide. Tables from data that were successfully crawled can be queried from Athena.

NoteAthena does not recognize exclude patterns that you specify for an AWS Glue crawler. For example, if you have an Amazon S3 bucket that contains both .csv and .json files and you exclude the .json files from the crawler, Athena queries both groups of files. To avoid this, place the files that you want to exclude in a different location.

Scheduling a Crawler to Keep the AWS Glue Data Catalog and Amazon S3 in Sync

AWS Glue crawlers can be set up to run on a schedule or on demand. For more information, see Time- Based Schedules for Jobs and Crawlers in the AWS Glue Developer Guide.

If you have data that arrives for a partitioned table at a fixed time, you can set up an AWS Glue crawler to run on schedule to detect and update table partitions. This can eliminate the need to run a potentially long and expensive MSCK REPAIR command or manually run an ALTER TABLE ADD PARTITION command. For more information, see Table Partitions in the AWS Glue Developer Guide.

Using Multiple Data Sources with Crawlers

When an AWS Glue crawler scans Amazon S3 and detects multiple directories, it uses a heuristic to determine where the root for a table is in the directory structure, and which directories are partitions for the table. In some cases, where the schema detected in two or more directories is similar, the crawler may treat them as partitions instead of separate tables. One way to help the crawler discover individual tables is to add each table's root directory as a data store for the crawler.

The following partitions in Amazon S3 are an example:

s3://bucket01/folder1/table1/partition1/file.txt s3://bucket01/folder1/table1/partition2/file.txt s3://bucket01/folder1/table1/partition3/file.txt s3://bucket01/folder1/table2/partition4/file.txt

數據

Table Runtime Data Scanned

參考文獻

相關文件

In this section we define a general model that will encompass both register and variable automata and study its query evaluation problem over graphs. The model is essentially a

• The memory storage unit holds instructions and data for a running program.. • A bus is a group of wires that transfer data from one part to another (data,

Since we use the Fourier transform in time to reduce our inverse source problem to identification of the initial data in the time-dependent Maxwell equations by data on the

(a) In your group, discuss what impact the social issues in Learning Activity 1 (and any other socials issues you can think of) have on the world, Hong Kong and you.. Choose the

Official Statistics --- Reproduction of these data is allowed provided the source is quoted.. Further information can be obtained from the Documentation and Information Centre

可以設定遊戲音 效以及是否離開

Programming languages can be used to create programs that control the behavior of a. machine and/or to express algorithms precisely.” -

To complete the “plumbing” of associating our vertex data with variables in our shader programs, you need to tell WebGL where in our buffer object to find the vertex data, and