Amazon Athena
User Guide
Amazon Athena: User Guide
Copyright © Amazon Web Services, Inc. and/or its affiliates. All rights reserved.
Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be affiliated with, connected to, or sponsored by Amazon.
Table of Contents
What is Amazon Athena? ... 1
When should I use Athena? ... 1
Amazon Athena ... 1
Amazon EMR ... 2
Amazon Redshift ... 2
Accessing Athena ... 2
Understanding Tables, Databases, and the Data Catalog ... 3
AWS service integrations with Athena ... 4
Setting Up ... 7
1. Sign up for an AWS account ... 7
2. Create an IAM administrator user and group ... 7
3. Attach managed policies for Athena ... 7
4. Sign in as an IAM user ... 8
Getting Started ... 9
Prerequisites ... 9
Step 1: Create a Database ... 9
Step 2: Create a Table ... 12
Step 3: Query Data ... 16
Saving Your Queries ... 18
Connecting to Other Data Sources ... 18
Accessing Amazon Athena ... 19
Using the Console ... 19
Using the API ... 19
Using the CLI ... 19
Connecting to Data Sources ... 20
Integration with AWS Glue ... 20
Using AWS Glue to Connect to Data Sources in Amazon S3 ... 21
Registering an AWS Glue Data Catalog from Another Account ... 22
Best Practices When Using Athena with AWS Glue ... 23
Upgrading to the AWS Glue Data Catalog Step-by-Step ... 31
FAQ: Upgrading to the AWS Glue Data Catalog ... 32
Using a Hive Metastore ... 34
Overview of Features ... 35
Workflow ... 36
Considerations and Limitations ... 36
Connecting Athena to an Apache Hive Metastore ... 38
Using the AWS Serverless Application Repository ... 40
Connecting Athena to Hive Using an Existing Role ... 42
Configure Athena to Use a Deployed Hive Metastore Connector ... 45
Using a Default Data Source ... 46
Using the AWS CLI with Hive Metastores ... 49
Reference Implementation ... 55
Using Amazon Athena Federated Query ... 56
Considerations and Limitations ... 56
Videos ... 57
Deploying a Connector and Connecting to a Data Source ... 57
Using the AWS Serverless Application Repository ... 59
Athena Data Source Connectors ... 60
Writing Federated Queries ... 64
Enabling Cross-Account Federated Queries ... 67
Writing a Data Source Connector ... 73
IAM Policies for Accessing Data Catalogs ... 74
Data Catalog Example Policies ... 75
Managing Data Sources ... 78
Connecting to Amazon Athena with ODBC and JDBC Drivers ... 79
Using Athena with the JDBC Driver ... 79
Connecting to Amazon Athena with ODBC ... 81
Configuring SSO Using ODBC, SAML 2.0, and Okta ... 84
Power BI Connector ... 94
Creating Databases and Tables ... 100
Creating Databases ... 100
Creating Tables ... 101
Considerations and Limitations ... 101
Creating Tables Using AWS Glue or the Athena Console ... 102
Showing Table Information ... 103
Names for Tables, Databases, and Columns ... 104
Use lower case for table names and table column names in Athena ... 104
Special characters ... 104
Reserved Keywords ... 105
List of Reserved Keywords in DDL Statements ... 105
List of Reserved Keywords in SQL SELECT Statements ... 105
Examples of Queries with Reserved Words ... 106
Table Location in Amazon S3 ... 106
Table Location and Partitions ... 108
Columnar Storage Formats ... 108
Converting to Columnar Formats ... 109
Partitioning Data ... 109
Considerations and Limitations ... 109
Creating and Loading a Table with Partitioned Data ... 110
Preparing Hive Style and Non-Hive Style Data for Querying ... 110
Partition Projection ... 114
Additional Resources ... 114
Partition Projection ... 114
Pruning and Projection for Heavily Partitioned Tables ... 114
Using Partition Projection ... 114
Use Cases ... 115
Considerations and Limitations ... 115
Video ... 116
Setting up Partition Projection ... 116
Supported Types for Partition Projection ... 120
Dynamic ID Partitioning ... 124
Amazon Kinesis Data Firehose Example ... 126
Creating a Table from Query Results (CTAS) ... 130
Considerations and Limitations for CTAS Queries ... 130
Running CTAS Queries in the Console ... 132
Bucketing vs Partitioning ... 133
Data Types Supported for Filtering on Bucketed Columns ... 134
Examples of CTAS Queries ... 134
Using CTAS and INSERT INTO for ETL ... 138
Overview ... 138
Step 1: Create a Table Based on the Original Dataset ... 138
Step 2: Use CTAS to Partition, Convert, and Compress the Data ... 139
Step 3: Use INSERT INTO to Add Data ... 140
Step 4: Measure Performance and Cost Differences ... 142
Summary ... 143
Creating a Table with More Than 100 Partitions ... 143
Compression Support ... 147
Compression Support in Athena by File Format ... 148
Specifying Compression Formats ... 148
Notes and Resources ... 148
SerDe Reference ... 150
Using a SerDe ... 150
To Use a SerDe in Queries ... 150
Supported SerDes and Data Formats ... 151
Avro SerDe ... 153
Regex SerDe ... 155
CloudTrail SerDe ... 155
OpenCSVSerDe for Processing CSV ... 157
Grok SerDe ... 160
JSON SerDe Libraries ... 163
LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files ... 167
ORC SerDe ... 172
Parquet SerDe ... 175
Running Queries ... 178
Query Results and Recent Queries ... 178
Specifying a Query Result Location ... 179
Downloading Query Results Files Using the Athena Console ... 182
Viewing Recent Queries ... 184
Finding Query Output Files in Amazon S3 ... 187
Working with Views ... 189
When to Use Views? ... 189
Supported Actions for Views in Athena ... 190
Considerations for Views ... 190
Limitations for Views ... 191
Working with Views in the Console ... 191
Creating Views ... 192
Examples of Views ... 194
Using saved queries ... 195
Considerations and limitations ... 195
Working with saved queries in the Athena console ... 196
Using the Athena API to update saved queries ... 197
Querying with Prepared Statements ... 197
Considerations and Limitations ... 197
SQL Statements ... 197
Handling Schema Updates ... 199
Summary: Updates and Data Formats in Athena ... 200
Index Access in ORC and Parquet ... 201
Types of Updates ... 202
Updates in Tables with Partitions ... 207
Querying Arrays ... 208
Creating Arrays ... 208
Concatenating Arrays ... 210
Converting Array Data Types ... 211
Finding Lengths ... 211
Accessing Array Elements ... 212
Flattening Nested Arrays ... 213
Creating Arrays from Subqueries ... 215
Filtering Arrays ... 216
Sorting Arrays ... 217
Using Aggregation Functions with Arrays ... 217
Converting Arrays to Strings ... 218
Using Arrays to Create Maps ... 219
Querying Arrays with Complex Types and Nested Structures ... 219
Querying Geospatial Data ... 225
What is a Geospatial Query? ... 225
Input Data Formats and Geometry Data Types ... 225
Supported Geospatial Functions ... 226
Examples: Geospatial Queries ... 248
Querying Hudi Datasets ... 250
Hudi Dataset Table Types ... 250
Considerations and Limitations ... 252
Video ... 252
Creating Hudi Tables ... 252
See Also ... 255
Querying JSON ... 255
Best Practices for Reading JSON Data ... 256
Extracting Data from JSON ... 257
Searching for Values in JSON Arrays ... 259
Obtaining Length and Size of JSON Arrays ... 261
Troubleshooting JSON Queries ... 262
Using ML with Athena ... 262
Considerations and Limitations ... 262
ML with Athena Syntax ... 263
Customer Use Examples ... 264
Querying with UDFs ... 264
Considerations and Limitations ... 264
Videos ... 265
UDF Query Syntax ... 265
Examples ... 267
Creating and Deploying a UDF Using Lambda ... 267
Querying AWS service logs ... 272
Application Load Balancer ... 273
Elastic Load Balancing ... 275
CloudFront ... 276
CloudTrail ... 278
Amazon EMR ... 284
Global Accelerator ... 287
GuardDuty ... 289
Network Firewall ... 291
Network Load Balancer ... 293
Route 53 ... 294
Amazon VPC ... 296
AWS WAF ... 303
Querying AWS Glue Data Catalog ... 310
Listing Databases and Searching a Specified Database ... 311
Listing Tables in a Specified Database and Searching for a Table by Name ... 311
Listing Partitions for a Specific Table ... 312
Listing or Searching Columns for a Specified Table or View ... 313
Querying Web Server Logs ... 314
Querying Apache Logs ... 315
Querying IIS Logs ... 317
Using ACID Transactions ... 323
Using Governed Tables ... 323
Considerations and Limitations ... 324
Getting Started ... 324
Creating Governed Tables ... 325
Querying Governed Tables ... 325
Supported Data Types ... 327
Using Iceberg Tables ... 328
Considerations and Limitations ... 328
Getting Started ... 329
Creating Iceberg Tables ... 329
DDL Operations ... 332
Schema Evolution ... 332
Querying Iceberg Tables ... 333
Updating Iceberg Tables ... 335
Optimizing Iceberg Tables ... 336
Supported Data Types ... 337
Security ... 339
Data Protection ... 339
Protecting Multiple Types of Data ... 340
Encryption at Rest ... 341
Encryption in Transit ... 349
Key Management ... 349
Internetwork Traffic Privacy ... 349
Identity and Access Management ... 350
AWS managed policies ... 350
Access through JDBC and ODBC Connections ... 355
Access to Amazon S3 ... 356
Cross-account Access to S3 Buckets ... 357
Fine-Grained Access to Databases and Tables in AWS Glue ... 360
Cross-Account Access to AWS Glue Data Catalogs ... 368
Access to Encrypted Metadata in the Data Catalog ... 372
Access to Workgroups and Tags ... 372
Allow Access to Prepared Statements ... 373
Using CalledVia Context Keys ... 374
Allow Access to an Athena Data Connector for External Hive Metastore ... 376
Allow Lambda Function Access to External Hive Metastores ... 378
Allow Access to Athena Federated Query ... 381
Allow Access to Athena UDF ... 385
Allowing Access for ML with Athena ... 389
Enabling Federated Access to the Athena API ... 390
Logging and Monitoring ... 393
Logging Amazon Athena API Calls with AWS CloudTrail ... 394
Compliance Validation ... 396
Resilience ... 397
Infrastructure Security ... 397
Connect to Amazon Athena Using an Interface VPC Endpoint ... 398
Configuration and Vulnerability Analysis ... 399
Using Athena with Lake Formation ... 399
How Lake Formation Data Access Works ... 400
Considerations and Limitations ... 402
Managing User Permissions ... 404
Applying Lake Formation Permissions to Existing Databases and Tables ... 406
Using Lake Formation and JDBC or ODBC for Federated Access ... 407
Using Workgroups to Control Query Access and Costs ... 447
Using Workgroups for Running Queries ... 447
Benefits of Using Workgroups ... 448
How Workgroups Work ... 448
Setting up Workgroups ... 449
IAM Policies for Accessing Workgroups ... 450
Workgroup Settings ... 456
Managing Workgroups ... 457
Athena Workgroup APIs ... 463
Troubleshooting Workgroups ... 463
Controlling Costs and Monitoring Queries with CloudWatch Metrics and Events ... 465
Video ... 465
Enabling CloudWatch Query Metrics ... 465
Monitoring Athena Queries with CloudWatch Metrics ... 466
Monitoring Athena Queries with CloudWatch Events ... 469
Monitoring Usage Metrics ... 471
Setting Data Usage Control Limits ... 473
Tagging Resources ... 475
Tag Basics ... 475
Tag Restrictions ... 475
Working with Tags on Workgroups in the Console ... 476
Displaying Tags for Individual Workgroups ... 476
Adding and Deleting Tags on an Individual Workgroup ... 477
Using Tag Operations ... 478
Managing Tags Using API Operations ... 478
Managing Tags Using the AWS CLI ... 479
Tag-Based IAM Access Control Policies ... 481
Tag Policy Examples for Workgroups ... 481
Tag Policy Examples for Data Catalogs ... 483
Athena Engine Versioning ... 486
Changing Athena Engine Versions ... 486
Finding the Query Engine Version for a Workgroup ... 486
Changing the Engine Version ... 487
Specifying the Engine Version When You Create a Workgroup ... 488
Testing Queries in Advance of an Engine Version Upgrade ... 488
Troubleshooting Queries That Fail ... 489
Athena Engine Version Reference ... 489
Athena engine version 2 ... 489
SQL Reference ... 498
Data Types in Athena ... 498
DML Queries, Functions, and Operators ... 500
SELECT ... 500
INSERT INTO ... 505
UNLOAD ... 508
EXPLAIN and EXPLAIN ANALYZE ... 510
Functions ... 522
Supported Time Zones ... 524
DDL Statements ... 532
Unsupported DDL ... 533
ALTER DATABASE SET DBPROPERTIES ... 534
ALTER TABLE ADD COLUMNS ... 535
ALTER TABLE ADD PARTITION ... 535
ALTER TABLE DROP PARTITION ... 537
ALTER TABLE RENAME PARTITION ... 537
ALTER TABLE REPLACE COLUMNS ... 538
ALTER TABLE SET LOCATION ... 539
ALTER TABLE SET TBLPROPERTIES ... 540
CREATE DATABASE ... 541
CREATE TABLE ... 542
CREATE TABLE AS ... 546
CREATE VIEW ... 548
DESCRIBE TABLE ... 549
DESCRIBE VIEW ... 550
DROP DATABASE ... 550
DROP TABLE ... 551
DROP VIEW ... 551
MSCK REPAIR TABLE ... 552
SHOW COLUMNS ... 554
SHOW CREATE TABLE ... 555
SHOW CREATE VIEW ... 555
SHOW DATABASES ... 556
SHOW PARTITIONS ... 556
SHOW TABLES ... 558
SHOW TBLPROPERTIES ... 559
SHOW VIEWS ... 559
Considerations and Limitations ... 560
Cross-Regional Queries ... 561
Troubleshooting ... 562
CREATE TABLE AS SELECT (CTAS) ... 562
Duplicated data occurs with concurrent CTAS statements ... 562
Data File Issues ... 562
Athena cannot read hidden files ... 562
Athena cannot read files stored in Amazon S3 Glacier ... 562
Athena reads files that I excluded from the AWS Glue crawler ... 563
HIVE_BAD_DATA: Error parsing field value ... 563
HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://bucket-name... 563
HIVE_CURSOR_ERROR: com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not exist ... 563
HIVE_CURSOR_ERROR: Unexpected end of input stream ... 563
HIVE_FILESYSTEM_ERROR: Incorrect fileSize 1234567 for file ... 564
HIVE_UNKNOWN_ERROR: Unable to create input format ... 564
org.apache.parquet.io.GroupColumnIO cannot be cast to org.apache.parquet.io.PrimitiveColumnIO ... 564
The S3 location provided to save your query results is invalid. ... 564
Federated Queries ... 564
JSON Related Errors ... 565
NULL or incorrect data errors when trying to read JSON data ... 565
HIVE_BAD_DATA: Error parsing field value for field 0: java.lang.String cannot be cast to org.openx.data.jsonserde.json.JSONObject ... 565
HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Duplicate key ... 565
HIVE_CURSOR_ERROR messages with pretty-printed JSON ... 565
Multiple JSON records return a SELECT COUNT of 1 ... 565
Cannot query a table created by a AWS Glue crawler that uses a custom JSON classifier ... 566
MSCK REPAIR TABLE ... 566
Output Issues ... 566
Unable to verify/create output bucket ... 566
TIMESTAMP result is empty ... 566
Store Athena query output in a format other than CSV ... 566
The S3 location provided to save your query results is invalid ... 566
Partitioning Issues ... 567
MSCK REPAIR TABLE does not remove stale partitions ... 567
MSCK REPAIR TABLE failure ... 567
MSCK REPAIR TABLE detects partitions but doesn't add them to AWS Glue ... 567
Partition projection ranges with the date format of dd-MM-yyyy-HH-mm-ss or yyyy-MM-dd do not work ... 567
PARTITION BY doesn't support the BIGINT type ... 567
No meaningful partitions available ... 567
Partition projection does not work in conjunction with range partitions ... 567
HIVE_UNKNOWN_ERROR: Unable to create input format ... 568
HIVE_PARTITION_SCHEMA_MISMATCH ... 568
SemanticException table is not partitioned but partition spec exists ... 568
Zero records returned from partitioned data ... 568
Permissions ... 568
Access Denied Error when querying Amazon S3 ... 568
Access Denied with Status Code: 403 error when running DDL queries on encrypted data in Amazon S3 ... 568
Access Denied with Status Code: 403 when querying an Amazon S3 bucket in another account .. 569
Use IAM role credentials to connect to the Athena JDBC driver ... 569
Query Syntax Issues ... 569
FAILED: NullPointerException Name is null ... 569
Function not registered ... 570
GENERIC_INTERNAL_ERROR Exceptions ... 570
Number of matching groups doesn't match the number of columns ... 570
queryString failed to satisfy constraint: Member must have length less than or equal to 262144 571 SYNTAX_ERROR: Column cannot be resolved ... 571
Throttling Issues ... 571
Views ... 571
Views created in Apache Hive shell do not work in Athena ... 571
View is stale; it must be re-created ... 571
Workgroups ... 572
Additional Resources ... 572
Athena Error Catalog ... 572
Error Category ... 572
Error Type Reference ... 573
Performance Tuning ... 577
Physical Limits ... 577
Query Optimization Techniques ... 577
Data Size ... 577
File Formats ... 578
Joins, Grouping, and Unions ... 578
Partitioning ... 579
Window Functions ... 579
Use More Efficient Functions ... 579
Additional Resources ... 580
Code Samples, Service Quotas, and Previous JDBC Driver ... 581
Code Samples ... 581
Constants ... 581
Create a Client to Access Athena ... 582
Start Query Execution ... 582
Stop Query Execution ... 584
List Query Executions ... 586
Create a Named Query ... 587
Delete a Named Query ... 588
List Named Queries ... 589
Earlier Version JDBC Drivers ... 590
Instructions for JDBC Driver version 1.1.0 ... 591
Service Quotas ... 595
Queries ... 595
Workgroups ... 596
Databases, Tables, and Partitions ... 596
Amazon S3 Buckets ... 596
Per Account API Call Quotas ... 597
Release Notes ... 598
March 2, 2022 ... 600
February 23, 2022 ... 600
February 15, 2022 ... 600
February 14, 2022 ... 601
February 9, 2022 ... 601
February 8, 2022 ... 601
January 28, 2022 ... 601
January 13, 2022 ... 602
November 30, 2021 ... 602
November 26, 2021 ... 602
November 24, 2021 ... 603
November 22, 2021 ... 603
November 18, 2021 ... 603
November 17, 2021 ... 604
November 16, 2021 ... 604
November 12, 2021 ... 605
November 2, 2021 ... 605
October 29, 2021 ... 605
JDBC and ODBC Drivers ... 605
Features and Improvements ... 606
October 4, 2021 ... 606
September 16, 2021 ... 606
Features ... 606
Improvements ... 606
September 15, 2021 ... 607
Athena Console Preview ... 607
Athena JDBC Driver 2.0.24 ... 607
August 31, 2021 ... 607
August 12, 2021 ... 608
August 6, 2021 ... 608
August 5, 2021 ... 608
July 30, 2021 ... 608
July 21, 2021 ... 609
July 16, 2021 ... 609
July 8, 2021 ... 609
July 1, 2021 ... 610
June 23, 2021 ... 610
May 12, 2021 ... 610
May 10, 2021 ... 610
May 5, 2021 ... 611
April 30, 2021 ... 611
April 29, 2021 ... 611
April 26, 2021 ... 611
April 21, 2021 ... 612
April 5, 2021 ... 612
EXPLAIN Statement ... 612
SageMaker Machine Learning Models in SQL Queries ... 612
User Defined Functions (UDF) ... 612
March 30, 2021 ... 612
March 25, 2021 ... 612
March 5, 2021 ... 613
February 25, 2021 ... 613
December 16, 2020 ... 613
Athena engine version 2 and Athena Federated Query ... 613
AWS PrivateLink ... 613
November 24, 2020 ... 613
November 11, 2020 ... 614
Athena engine version 2 ... 614
Federated SQL Queries ... 614
October 22, 2020 ... 615
July 29, 2020 ... 615
July 9, 2020 ... 615
Querying Apache Hudi Datasets ... 616
AWS CloudFormation Data Catalog Resource ... 616
June 1, 2020 ... 616
Using Apache Hive Metastore as a Metacatalog with Amazon Athena ... 616
May 21, 2020 ... 616
April 1, 2020 ... 616
March 11, 2020 ... 617
March 6, 2020 ... 617
November 26, 2019 ... 617
Federated SQL Queries ... 617
Invoking Machine Learning Models in SQL Queries ... 618
User Defined Functions (UDFs) (Preview) ... 619
Using Apache Hive Metastore as a Metacatalog with Amazon Athena (Preview) ... 619
New Query-Related Metrics ... 619
November 12, 2019 ... 620
November 8, 2019 ... 620
October 8, 2019 ... 620
September 19, 2019 ... 620
September 12, 2019 ... 621
August 16, 2019 ... 621
August 9, 2019 ... 621
June 26, 2019 ... 621
May 24, 2019 ... 622
March 05, 2019 ... 622
February 22, 2019 ... 622
February 18, 2019 ... 623
November 20, 2018 ... 624
October 15, 2018 ... 624
October 10, 2018 ... 625
September 6, 2018 ... 625
August 23, 2018 ... 626
August 16, 2018 ... 626
August 7, 2018 ... 627
June 5, 2018 ... 627
Support for Views ... 627
Improvements and Updates to Error Messages ... 627
Bug Fixes ... 628
May 17, 2018 ... 628
April 19, 2018 ... 628
April 6, 2018 ... 628
March 15, 2018 ... 629
February 2, 2018 ... 629
January 19, 2018 ... 629
November 13, 2017 ... 630
November 1, 2017 ... 630
October 19, 2017 ... 630
October 3, 2017 ... 630
September 25, 2017 ... 630
August 14, 2017 ... 630
August 4, 2017 ... 631
June 22, 2017 ... 631
June 8, 2017 ... 631
May 19, 2017 ... 631
Improvements ... 632
Bug Fixes ... 632
April 4, 2017 ... 632
Features ... 632
Improvements ... 633
Bug Fixes ... 633
March 24, 2017 ... 633
Features ... 633
Improvements ... 633
Bug Fixes ... 633
February 20, 2017 ... 634
Features ... 634
Improvements ... 635
Document History ... 636
AWS glossary ... 646
When should I use Athena?
What is Amazon Athena?
Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL (p. 498). With a few actions in the AWS
Management Console, you can point Athena at your data stored in Amazon S3 and begin using standard SQL to run ad-hoc queries and get results in seconds.
Athena is serverless, so there is no infrastructure to set up or manage, and you pay only for the queries you run. Athena scales automatically—running queries in parallel—so results are fast, even with large datasets and complex queries.
Topics
• When should I use Athena? (p. 1)
• Accessing Athena (p. 2)
• Understanding Tables, Databases, and the Data Catalog (p. 3)
• AWS service integrations with Athena (p. 4)
When should I use Athena?
Query services like Amazon Athena, data warehouses like Amazon Redshift, and sophisticated data processing frameworks like Amazon EMR all address different needs and use cases. The following guidance can help you choose one or more services based on your requirements.
Amazon Athena
Athena helps you analyze unstructured, semi-structured, and structured data stored in Amazon S3.
Examples include CSV, JSON, or columnar data formats such as Apache Parquet and Apache ORC. You can use Athena to run ad-hoc queries using ANSI SQL, without the need to aggregate or load the data into Athena.
Athena integrates with Amazon QuickSight for easy data visualization. You can use Athena to generate reports or to explore data with business intelligence tools or SQL clients connected with a JDBC or an ODBC driver. For more information, see What is Amazon QuickSight in the Amazon QuickSight User Guide and Connecting to Amazon Athena with ODBC and JDBC Drivers (p. 79).
Athena integrates with the AWS Glue Data Catalog, which offers a persistent metadata store for your data in Amazon S3. This allows you to create tables and query data in Athena based on a central metadata store available throughout your Amazon Web Services account and integrated with the ETL and data discovery features of AWS Glue. For more information, see Integration with AWS Glue (p. 20) and What is AWS Glue in the AWS Glue Developer Guide.
Amazon Athena makes it easy to run interactive queries against data directly in Amazon S3 without having to format data or manage infrastructure. For example, Athena is useful if you want to run a quick query on web logs to troubleshoot a performance issue on your site. With Athena, you can get started fast: you just define a table for your data and start querying using standard SQL.
You should use Amazon Athena if you want to run interactive ad hoc SQL queries against data on Amazon S3, without having to manage any infrastructure or clusters. Amazon Athena provides the easiest way to run ad hoc queries for data in Amazon S3 without the need to setup or manage any servers.
Amazon EMR
For a list of AWS services that Athena leverages or integrates with, see the section called “AWS service integrations with Athena” (p. 4).
Amazon EMR
Amazon EMR makes it simple and cost effective to run highly distributed processing frameworks such as Hadoop, Spark, and Presto when compared to on-premises deployments. Amazon EMR is flexible – you can run custom applications and code, and define specific compute, memory, storage, and application parameters to optimize your analytic requirements.
In addition to running SQL queries, Amazon EMR can run a wide variety of scale-out data processing tasks for applications such as machine learning, graph analytics, data transformation, streaming data, and virtually anything you can code. You should use Amazon EMR if you use custom code to process and analyze extremely large datasets with the latest big data processing frameworks such as Spark, Hadoop, Presto, or Hbase. Amazon EMR gives you full control over the configuration of your clusters and the software installed on them.
You can use Amazon Athena to query data that you process using Amazon EMR. Amazon Athena supports many of the same data formats as Amazon EMR. Athena's data catalog is Hive metastore compatible. If you use EMR and already have a Hive metastore, you can run your DDL statements on Amazon Athena and query your data immediately without affecting your Amazon EMR jobs.
Amazon Redshift
A data warehouse like Amazon Redshift is your best choice when you need to pull together data from many different sources – like inventory systems, financial systems, and retail sales systems – into a common format, and store it for long periods of time. If you want to build sophisticated business reports from historical data, then a data warehouse like Amazon Redshift is the best choice. The query engine in Amazon Redshift has been optimized to perform especially well on running complex queries that join large numbers of very large database tables. When you need to run queries against highly structured data with lots of joins across lots of very large tables, choose Amazon Redshift.
For more information about when to use Athena, consult the following resources:
• When to use Athena vs other big data services
• Amazon Athena Overview
• Amazon Athena Features
• Amazon Athena FAQs
• Amazon Athena Blog posts
Accessing Athena
You can access Athena using the AWS Management Console, a JDBC or ODBC connection, the Athena API, the Athena CLI, the AWS SDK, or AWS Tools for Windows PowerShell.
• To get started with the console, see Getting Started (p. 9).
• To learn how to use JDBC or ODBC drivers, see Connecting to Amazon Athena with JDBC (p. 79) and Connecting to Amazon Athena with ODBC (p. 81).
• To use the Athena API, see the Amazon Athena API Reference.
• To use the CLI, install the AWS CLI and then type aws athena help from the command line to see available commands. For information about available commands, see the Amazon Athena command line reference.
Understanding Tables, Databases, and the Data Catalog
• To use the AWS SDK for Java 2.x, see the Athena section of the AWS SDK for Java 2.x API Reference, the Athena Java V2 Examples on GitHub.com, and the AWS SDK for Java 2.x Developer Guide.
• To use the AWS SDK for .NET, see the Amazon.Athena namespace in the AWS SDK for .NET API Reference, the .NET Athena examples on GitHub.com, and the AWS SDK for .NET Developer Guide.
• To use AWS Tools for Windows PowerShell, see the AWS Tools for PowerShell - Amazon Athena cmdlet reference, the AWS Tools for PowerShell portal page, and the AWS Tools for Windows PowerShell User Guide.
• For information about Athena service endpoints that you can connect to programmatically, see Amazon Athena endpoints and quotas in the Amazon Web Services General Reference.
Understanding Tables, Databases, and the Data Catalog
In Athena, tables and databases are containers for the metadata definitions that define a schema for underlying source data. For each dataset, a table needs to exist in Athena. The metadata in the table tells Athena where the data is located in Amazon S3, and specifies the structure of the data, for example, column names, data types, and the name of the table. Databases are a logical grouping of tables, and also hold only metadata and schema information for a dataset.
For each dataset that you'd like to query, Athena must have an underlying table it will use for obtaining and returning query results. Therefore, before querying data, a table must be registered in Athena. The registration occurs when you either create tables automatically or manually.
Regardless of how the tables are created, the tables creation process registers the dataset with Athena.
This registration occurs in the AWS Glue Data Catalog and enables Athena to run queries on the data.
• To create a table automatically, use an AWS Glue crawler from within Athena. For more information about AWS Glue and crawlers, see Integration with AWS Glue (p. 20). When AWS Glue creates a table, it registers it in its own AWS Glue Data Catalog. Athena uses the AWS Glue Data Catalog to store and retrieve this metadata, using it when you run queries to analyze the underlying dataset.
After you create a table, you can use SQL SELECT (p. 500) statements to query it, including getting specific file locations for your source data (p. 504). Your query results are stored in Amazon S3 in the query result location that you specify (p. 179).
The AWS Glue Data Catalog is accessible throughout your Amazon Web Services account. Other AWS services can share the AWS Glue Data Catalog, so you can see databases and tables created throughout your organization using Athena and vice versa. In addition, AWS Glue lets you automatically discover data schema and extract, transform, and load (ETL) data.
• To create a table manually:
• Use the Athena console to run the Create Table Wizard.
• Use the Athena console to write Hive DDL statements in the Query Editor.
• Use the Athena API or CLI to run a SQL query string with DDL statements.
• Use the Athena JDBC or ODBC driver.
When you create tables and databases manually, Athena uses HiveQL data definition language (DDL) statements such as CREATE TABLE, CREATE DATABASE, and DROP TABLE under the hood to create tables and databases in the AWS Glue Data Catalog.
NoteIf you have tables in Athena created before August 14, 2017, they were created in an Athena- managed internal data catalog that exists side-by-side with the AWS Glue Data Catalog until
AWS service integrations with Athena
you choose to update. For more information, see Upgrading to the AWS Glue Data Catalog Step- by-Step (p. 31).
When you query an existing table, under the hood, Amazon Athena uses Presto, a distributed SQL engine. We have examples with sample data within Athena to show you how to create a table and then issue a query against it using Athena. Athena also has a tutorial in the console that helps you get started creating a table based on data that is stored in Amazon S3.
• For a step-by-step tutorial on creating a table and writing queries in the Athena Query Editor, see Getting Started (p. 9).
• Run the Athena tutorial in the console. This launches automatically if you log in to https://
console.aws.amazon.com/athena/ for the first time. You can also choose Tutorial in the console to launch it.
AWS service integrations with Athena
You can use Athena to query data from the AWS services listed in this section. To see the Regions that each service supports, see Regions and Endpoints in the Amazon Web Services General Reference.
AWS services integrated with Athena
• AWS CloudFormation
• Amazon CloudFront
• AWS CloudTrail
• Elastic Load Balancing
• AWS Glue Data Catalog
• AWS Identity and Access Management (IAM)
• Amazon QuickSight
• Amazon S3 Inventory
• AWS Step Functions
• AWS Systems Manager Inventory
• Amazon Virtual Private Cloud
For information about each integration, see the following sections.
AWS CloudFormation Data Catalog
Reference topic: AWS::Athena::DataCatalog in the AWS CloudFormation User Guide
Specify an Athena data catalog, including a name, description, type, parameters, and tags. For more information, see DataCatalog in the Amazon Athena API Reference.
Named Query
Reference topic: AWS::Athena::NamedQuery in the AWS CloudFormation User Guide
Specify named queries with AWS CloudFormation and run them in Athena. Named queries allow you to map a query name to a query and then run it as a saved query from the Athena console.
For information, see CreateNamedQuery in the Amazon Athena API Reference.
Workgroup
Reference topic: AWS::Athena::WorkGroup in the AWS CloudFormation User Guide
AWS service integrations with Athena
Specify Athena workgroups using AWS CloudFormation. Use Athena workgroups to isolate queries for you or your group from other queries in the same account. For more information, see Using Workgroups to Control Query Access and Costs (p. 447) in the Amazon Athena User Guide and CreateWorkGroup in the Amazon Athena API Reference.
Amazon CloudFront
Reference topic: Querying Amazon CloudFront Logs (p. 276)
Use Athena to query Amazon CloudFront logs. For more information about using CloudFront, see the Amazon CloudFront Developer Guide.
AWS CloudTrail
Reference topic: Querying AWS CloudTrail Logs (p. 278)
Using Athena with CloudTrail logs is a powerful way to enhance your analysis of AWS service activity.
For example, you can use queries to identify trends and further isolate activity by attribute, such as source IP address or user. You can create tables for querying logs directly from the CloudTrail console, and use those tables to run queries in Athena. For more information, see Creating a Table for CloudTrail Logs in the CloudTrail Console (p. 279).
Elastic Load Balancing
Reference topic: Querying Application Load Balancer Logs (p. 273)
Querying Application Load Balancer logs allows you to see the source of traffic, latency, and bytes transferred to and from Elastic Load Balancing instances and backend applications. For more information, see Creating the Table for ALB Logs (p. 273).
Reference topic: Querying Classic Load Balancer Logs (p. 275)
Query Classic Load Balancer logs to analyze and understand traffic patterns to and from Elastic Load Balancing instances and backend applications. You can see the source of traffic, latency, and bytes transferred. For more information, see Creating the Table for ELB Logs (p. 275).
AWS Glue Data Catalog
Reference topic: Integration with AWS Glue (p. 20)
Athena integrates with the AWS Glue Data Catalog, which offers a persistent metadata store for your data in Amazon S3. This allows you to create tables and query data in Athena based on a central metadata store available throughout your Amazon Web Services account and integrated with the ETL and data discovery features of AWS Glue. For more information, see Integration with AWS Glue (p. 20) and What is AWS Glue in the AWS Glue Developer Guide.
AWS Identity and Access Management (IAM) Reference topic: Actions for Amazon Athena
You can use Athena API actions in IAM permission policies. For more information, see Actions for Amazon Athena and Identity and Access Management in Athena (p. 350).
Amazon QuickSight
Reference topic: Connecting to Amazon Athena with ODBC and JDBC Drivers (p. 79)
Athena integrates with Amazon QuickSight for easy data visualization. You can use Athena to generate reports or to explore data with business intelligence tools or SQL clients connected with a JDBC or an ODBC driver. For more information about Amazon QuickSight, see What is Amazon QuickSight in the Amazon QuickSight User Guide. For information about using JDBC and ODBC drivers with Athena, see Connecting to Amazon Athena with ODBC and JDBC Drivers (p. 79).
Amazon S3 Inventory
Reference topic: Querying inventory with Athena in the Amazon Simple Storage Service User Guide
AWS service integrations with Athena
You can use Amazon Athena to query Amazon S3 inventory using standard SQL. You can use
Amazon S3 inventory to audit and report on the replication and encryption status of your objects for business, compliance, and regulatory needs. For more information, see Amazon S3 inventory in the Amazon Simple Storage Service User Guide.
AWS Step Functions
Reference topic: Call Athena with Step Functions in the AWS Step Functions Developer Guide Call Athena with AWS Step Functions. AWS Step Functions can control select AWS services directly using the Amazon States Language. You can use Step Functions with Athena to start and stop query execution, get query results, run ad-hoc or scheduled data queries, and retrieve results from data lakes in Amazon S3. For more information, see the AWS Step Functions Developer Guide.
Video: Orchestrate Amazon Athena Queries using AWS Step Functions
The following video demonstrates how to use Amazon Athena and AWS Step Functions to run a regularly scheduled Athena query and generate a corresponding report.
Orchestrate Amazon Athena Queries using AWS Step Functions
For an example that uses Step Functions and Amazon EventBridge to orchestrate AWS Glue DataBrew, Athena, and Amazon QuickSight, see Orchestrating an AWS Glue DataBrew job and Amazon Athena query with AWS Step Functions in the AWS Big Data Blog.
AWS Systems Manager Inventory
Reference topic: Querying inventory data from multiple Regions and accounts in the AWS Systems Manager User Guide
AWS Systems Manager Inventory integrates with Amazon Athena to help you query inventory data from multiple AWS Regions and accounts. For more information, see the AWS Systems Manager User Guide.
Amazon Virtual Private Cloud
Reference topic: Querying Amazon VPC Flow Logs (p. 296)
Amazon Virtual Private Cloud flow logs capture information about the IP traffic going to and from network interfaces in a VPC. Query the logs in Athena to investigate network traffic patterns and identify threats and risks across your Amazon VPC network. For more information about Amazon VPC, see the Amazon VPC User Guide.
1. Sign up for an AWS account
Setting Up
If you've already signed up for Amazon Web Services, you can start using Amazon Athena immediately.
If you haven't signed up for AWS or need assistance getting started, be sure to complete the following tasks:
1. Sign up for an AWS account (p. 7)
2. Create an IAM administrator user and group (p. 7) 3. Attach managed policies for Athena (p. 7) 4. Sign in as an IAM user (p. 8)
1. Sign up for an AWS account
When you sign up for AWS, your account is automatically signed up for all services in AWS, including Athena. You are charged only for the services that you use. For pricing information, see Amazon Athena pricing.
If you have an AWS account already, skip to the next task. If you don't have an AWS account, use the following procedure to create one.
To create an AWS account
1. Open http://aws.amazon.com/, and then choose Create an AWS account.
2. Follow the online instructions. Part of the sign-up procedure involves receiving a phone call and entering a PIN using the phone keypad.
3. Note your AWS account number, because you need it for the next task.
2. Create an IAM administrator user and group
An AWS Identity and Access Management (IAM) user is an account that you create to access services. It is a different user than your main AWS account. As a security best practice, we recommend that you use the IAM user's credentials to access AWS services. You use the IAM console to create an administrator IAM user and an Administrators group for the user. You can then access the console for Athena and other AWS services by accessing a special link and providing the credentials for the IAM user that you created.
For steps, see Creating an administrator IAM user and user group (console) in the IAM User Guide.
3. Attach managed policies for Athena
After you have created an IAM user, you must attach some Athena managed policies to the user so that the user can access Athena. There are two managed policies for Athena: AmazonAthenaFullAccess and AWSQuicksightAthenaAccess. These policies grant permissions to Athena to query Amazon S3 and to write the results of your queries to a separate bucket on your behalf. To see the contents of these policies for Athena, see AWS managed policies for Amazon Athena (p. 350).
4. Sign in as an IAM user
For steps to attach the Athena managed policies, follow Adding IAM Identity Permissions (Console) in the IAM User Guide and add the AmazonAthenaFullAccess and AWSQuicksightAthenaAccess managed policies to the IAM administrator user that you created.
NoteYou may need additional permissions to access the underlying dataset in Amazon S3. If you are not the account owner or otherwise have restricted access to a bucket, contact the bucket owner to grant access using a resource-based bucket policy, or contact your account administrator to grant access using an identity-based policy. For more information, see Amazon S3 Permissions (p. 356). If the dataset or Athena query results are encrypted, you may need additional permissions. For more information, see Configuring Encryption Options (p. 341).
4. Sign in as an IAM user
To sign in as the new IAM user that you created, you can use the custom sign-in URL for the IAM users of your account. To see the sign-in URL for the IAM users for your account, open the IAM console and choose Users, user_name, Security credentials, Console sign-in link. As a convenience, you can use the clipboard icon to copy the sign-in URL to the clipboard.
For more information about signing in as an IAM user, see How IAM users sign in to your AWS account in the IAM User Guide.
Prerequisites
Getting Started
This tutorial walks you through using Amazon Athena to query data. You'll create a table based on sample data stored in Amazon Simple Storage Service, query the table, and check the results of the query.
The tutorial uses live resources, so you are charged for the queries that you run. You aren't charged for the sample data in the location that this tutorial uses, but if you upload your own data files to Amazon S3, charges do apply.
Prerequisites
• If you have not already done so, sign up for an account in Setting Up (p. 7).
• Using the same AWS Region (for example, US West (Oregon)) and account that you are using for Athena, Create a bucket in Amazon S3 to hold your Athena query results.
Step 1: Create a Database
You first need to create a database in Athena.
To create an Athena database
1. Open the Athena console at https://console.aws.amazon.com/athena/.
2. If this is your first time to visit the Athena console in your current AWS Region, choose Explore the query editor to open the query editor. Otherwise, Athena opens in the query editor.
3. Choose View Settings to set up a query result location in Amazon S3.
4. On the Settings tab, choose Manage.
Step 1: Create a Database
5. For Manage settings, do one of the following:
• In the Location of query result box, enter the path to the bucket that you created in Amazon S3 for your query results. Prefix the path with s3://.
• Choose Browse S3, choose the Amazon S3 bucket that you created for your current Region, and then choose Choose.
Step 1: Create a Database
6. Choose Save.
7. Choose Editor to switch to the query editor.
8. On the right of the navigation pane, you can use the Athena query editor to enter and run queries and statements.
Step 2: Create a Table
9. To create a database named mydatabase, enter the following CREATE DATABASE statement.
CREATE DATABASE mydatabase 10. Choose Run or press Ctrl+ENTER.
11. From the Database list on the left, choose mydatabase to make it your current database.
Step 2: Create a Table
Now that you have a database, you can create an Athena table for it. The table that you create will be based on sample Amazon CloudFront log data in the location s3://athena-examples-myregion/
cloudfront/plaintext/, where myregion is your current AWS Region.
The sample log data is in tab-separated values (TSV) format, which means that a tab character is used as a delimiter to separate the fields. The data looks like the following example. For readability, the tabs in the excerpt have been converted to spaces and the final field shortened.
2014-07-05 20:00:09 DFW3 4260 10.0.0.15 GET eabcd12345678.cloudfront.net /test-image-1.jpeg 200 - Mozilla/5.0[...]
2014-07-05 20:00:09 DFW3 4252 10.0.0.15 GET eabcd12345678.cloudfront.net /test-image-2.jpeg 200 - Mozilla/5.0[...]
2014-07-05 20:00:10 AMS1 4261 10.0.0.15 GET eabcd12345678.cloudfront.net /test-image-3.jpeg 200 - Mozilla/5.0[...]
To enable Athena to read this data, you could run a CREATE EXTERNAL TABLE statement like the following. The statement that creates the table defines columns that map to the data, specifies how the data is delimited, and specifies the Amazon S3 location that contains the sample data.
NoteFor the LOCATION clause, specify an Amazon S3 folder location, not a specific file. Athena scans all of the files in the folder that you specify.
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
Step 2: Create a Table
`Date` DATE, Time STRING, Location STRING, Bytes INT,
RequestIP STRING, Method STRING, Host STRING, Uri STRING, Status INT, Referrer STRING, ClientInfo STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
LOCATION 's3://athena-examples-my-region/cloudfront/plaintext/';
The example creates a table called cloudfront_logs and specifies a name and data type for each field.
These fields become the columns in the table. Because date is a reserved word (p. 105), it is escaped with backtick (`) characters. ROW FORMAT DELIMITED means that Athena will use a default library called LazySimpleSerDe (p. 167) to do the actual work of parsing the data. The example also specifies that the fields are tab separated (FIELDS TERMINATED BY '\t') and that each record in the file ends in a newline character (LINES TERMINATED BY '\n). Finally, the LOCATION clause specifies the path in Amazon S3 where the actual data to be read is located. If you have your own tab or comma-separated data, you can use a CREATE TABLE statement like this.
Returning to the sample data, here is a full example of the final field ClientInfo:
Mozilla/5.0%20(Android;%20U;%20Windows%20NT%205.1;%20en-US;
%20rv:1.9.0.9)%20Gecko/2009040821%20IE/3.0.9
As you can see, this field is multivalued. If the CREATE TABLE statement specifies tabs as field delimiters, the separate components inside this particular field can't be broken out into separate
columns. To create columns from the values inside the field, you can use a regular expression (regex) that contains regex groups. The regex groups that you specify become separate table columns. To use a regex in your CREATE TABLE statement, use syntax like the following. This syntax instructs Athena to use the Regex SerDe (p. 155) library and the regular expression that you specify.
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ("input.regex" = "regular_expression")
Regular expressions can be useful for creating tables from complex CSV or TSV data but can be difficult to write and maintain. Fortunately, there are other libraries that you can use for formats like JSON, Parquet, and ORC. For more information, see Supported SerDes and Data Formats (p. 151).
Now you are ready to create the table in the Athena query editor. The CREATE TABLE statement and regex are provided for you.
To create a table in Athena
1. In the navigation pane, for Database, make sure that mydatabase is selected.
2. To give yourself more room in the query editor, you can choose the arrow icon to collapse the navigation pane.
Step 2: Create a Table
3. To create a tab for a new query, choose the plus (+) sign in the query editor. You can have up to ten query tabs open at once.
4. To close one or more query tabs, choose the arrow next to the plus sign. To close all tabs at once, choose the arrow, and then choose Close all tabs.
Step 2: Create a Table
5. In the query pane, enter the following CREATE EXTERNAL TABLE statement. The regex breaks out the operating system, browser, and browser version information from the ClientInfo field in the log data.
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs ( `Date` DATE,
Time STRING, Location STRING, Bytes INT,
RequestIP STRING, Method STRING, Host STRING, Uri STRING, Status INT, Referrer STRING, os STRING, Browser STRING, BrowserVersion STRING
) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES (
"input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s +([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^
\/]+)[\/](.*)$"
) LOCATION 's3://athena-examples-myregion/cloudfront/plaintext/';
6. In the LOCATION statement, replace myregion with the AWS Region that you are currently using (for example, us-west-1).
7. Choose Run.
Step 3: Query Data
The table cloudfront_logs is created and appears under the list of Tables for the mydatabase database.
Step 3: Query Data
Now that you have the cloudfront_logs table created in Athena based on the data in Amazon S3, you can run SQL queries on the table and see the results in Athena. For more information about using SQL in Athena, see SQL Reference for Amazon Athena (p. 498).
To run a query
1. Choose the plus (+) sign to open a new query tab and enter the following SQL statement in the query pane.
SELECT os, COUNT(*) count FROM cloudfront_logs
WHERE date BETWEEN date '2014-07-05' AND date '2014-08-05' GROUP BY os
2. Choose Run.
The results look like the following:
3. To save the results of the query to a .csv file, choose Download results.
Step 3: Query Data
4. To view or run previous queries, choose the Recent queries tab.
5. To download the results of a previous query from the Recent queries tab, select the query, and then choose Download results. Queries are retained for 45 days.
Saving Your Queries
For more information, see Working with Query Results, Recent Queries, and Output Files (p. 178).
Saving Your Queries
You can save the queries that you create or edit in the query editor with a name. Athena stores these queries on the Saved queries tab. You can use the Saved queries tab to recall, run, rename, or delete your saved queries. For more information, see Using saved queries (p. 195).
Connecting to Other Data Sources
This tutorial used a data source in Amazon S3 in CSV format. For information about using Athena with AWS Glue, see Using AWS Glue to Connect to Data Sources in Amazon S3 (p. 21). You can also connect Athena to a variety of data sources by using ODBC and JDBC drivers, external Hive metastores, and Athena data source connectors. For more information, see Connecting to Data Sources (p. 20).
Using the Console
Accessing Amazon Athena
You can access Amazon Athena using the AWS Management Console, the Amazon Athena API, or the AWS CLI.
Using the Console
You can use the AWS Management Console for Amazon Athena to do the following:
• Create or select a database.
• Create, view, and delete tables.
• Filter tables by starting to type their names.
• Preview tables and generate CREATE TABLE DDL for them.
• Show table properties.
• Run queries on tables, save and format queries, and view query history.
• Create up to ten queries using different query tabs in the query editor. To open a new tab, click the plus sign.
• Display query results, save, and export them.
• Access the AWS Glue Data Catalog.
• View and change settings, such as view the query result location, configure auto-complete, and encrypt query results.
In the right pane, the Query Editor displays an introductory screen that prompts you to create your first table. You can view your tables under Tables in the left pane.
Here's a high-level overview of the actions available for each table:
• Preview tables – View the query syntax in the Query Editor on the right.
• Show properties – Show a table's name, its location in Amazon S3, input and output formats, the serialization (SerDe) library used, and whether the table has encrypted data.
• Delete table – Delete a table.
• Generate CREATE TABLE DDL – Generate the query behind a table and view it in the query editor.
Using the API
Amazon Athena enables application programming for Athena. For more information, see Amazon Athena API Reference. The latest AWS SDKs include support for the Athena API.
For examples of using the AWS SDK for Java with Athena, see Code Samples (p. 581).
For more information about AWS SDK for Java documentation and downloads, see the SDKs section in Tools for Amazon Web Services.
Using the CLI
You can access Amazon Athena using the AWS CLI. For more information, see the AWS CLI Reference for Athena.
Integration with AWS Glue
Connecting to Data Sources
You can use Amazon Athena to query data stored in different locations and formats in a dataset. This dataset might be in CSV, JSON, Avro, Parquet, or some other format.
The tables and databases that you work with in Athena to run queries are based on metadata. Metadata is data about the underlying data in your dataset. How that metadata describes your dataset is called the schema. For example, a table name, the column names in the table, and the data type of each column are schema, saved as metadata, that describe an underlying dataset. In Athena, we call a system for organizing metadata a data catalog or a metastore. The combination of a dataset and the data catalog that describes it is called a data source.
The relationship of metadata to an underlying dataset depends on the type of data source that you work with. Relational data sources like MySQL, PostgreSQL, and SQL Server tightly integrate the metadata with the dataset. In these systems, the metadata is most often written when the data is written. Other data sources, like those built using Hive, allow you to define metadata on-the-fly when you read the dataset. The dataset can be in a variety of formats—for example, CSV, JSON, Parquet, or Avro.
Athena natively supports the AWS Glue Data Catalog. The AWS Glue Data Catalog is a data catalog built on top of other datasets and data sources such as Amazon S3, Amazon Redshift, and Amazon DynamoDB. You can also connect Athena to other data sources by using a variety of connectors.
Topics
• Integration with AWS Glue (p. 20)
• Using Athena Data Connector for External Hive Metastore (p. 34)
• Using Amazon Athena Federated Query (p. 56)
• IAM Policies for Accessing Data Catalogs (p. 74)
• Managing Data Sources (p. 78)
• Connecting to Amazon Athena with ODBC and JDBC Drivers (p. 79)
Integration with AWS Glue
AWS Glue is a fully managed ETL (extract, transform, and load) AWS service. One of its key abilities is to analyze and categorize data. You can use AWS Glue crawlers to automatically infer database and table schema from your data in Amazon S3 and store the associated metadata in the AWS Glue Data Catalog.
Athena uses the AWS Glue Data Catalog to store and retrieve table metadata for the Amazon S3 data in your Amazon Web Services account. The table metadata lets the Athena query engine know how to find, read, and process the data that you want to query.
To create database and table schema in the AWS Glue Data Catalog, you can run an AWS Glue crawler from within Athena on a data source, or you can run Data Definition Language (DDL) queries directly in the Athena Query Editor. Then, using the database and table schema that you created, you can use Data Manipulation (DML) queries in Athena to query the data.
You can register an AWS Glue Data Catalog from an account other than your own. After you configure the required IAM permissions for AWS Glue, you can use Athena to run cross-account queries. For more information, see Cross-Account Access to AWS Glue Data Catalogs (p. 368).
For more information about the AWS Glue Data Catalog, see Populating the AWS Glue Data Catalog in the AWS Glue Developer Guide.
Using AWS Glue to Connect to Data Sources in Amazon S3
Separate charges apply to AWS Glue. For more information, see AWS Glue Pricing and Are there separate charges for AWS Glue? (p. 33) For more information about the benefits of using AWS Glue with
Athena, see Why should I upgrade to the AWS Glue Data Catalog? (p. 33) Topics
• Using AWS Glue to Connect to Data Sources in Amazon S3 (p. 21)
• Registering an AWS Glue Data Catalog from Another Account (p. 22)
• Best Practices When Using Athena with AWS Glue (p. 23)
• Upgrading to the AWS Glue Data Catalog Step-by-Step (p. 31)
• FAQ: Upgrading to the AWS Glue Data Catalog (p. 32)
Using AWS Glue to Connect to Data Sources in Amazon S3
Athena can connect to your data stored in Amazon S3 using the AWS Glue Data Catalog to store metadata such as table and column names. After the connection is made, your databases, tables, and views appear in Athena's query editor.
To define schema information for AWS Glue to use, you can create an AWS Glue crawler to retrieve the information automatically, or you can manually add a table and enter the schema information.
Creating an AWS Glue Crawler
You can create a crawler by starting in the Athena console and then using the AWS Glue console in an integrated way. When you create the crawler, you specify a data location in Amazon S3 to crawl.
To create a crawler in AWS Glue starting from the Athena console 1. Open the Athena console at https://console.aws.amazon.com/athena/.
2. In the query editor, next to Tables and views, choose Create, and then choose AWS Glue crawler.
3. On the AWS Glue console Add crawler page, follow the steps to create a crawler. For more information, see Using AWS Glue Crawlers (p. 24) in this guide and Populating the AWS Glue Data Catalog in the AWS Glue Developer Guide.
Note
Athena does not recognize exclude patterns that you specify for an AWS Glue crawler. For example, if you have an Amazon S3 bucket that contains both .csv and .json files and you exclude the .json files from the crawler, Athena queries both groups of files. To avoid this, place the files that you want to exclude in a different location.
Adding a Table Using a Form
The following procedure shows you how to use the Athena console to add a table using the Create Table From S3 bucket data form.
To add a table and enter schema information using a form
1. Open the Athena console at https://console.aws.amazon.com/athena/.
2. In the query editor, next to Tables and views, choose Create, and then choose S3 bucket data.
3. On the Create Table From S3 bucket data form, for Table name, enter a name for the table.
4. For Database configuration, choose an existing database, or create a new one.
Registering an AWS Glue Data Catalog from Another Account
5. For Location of Input Data Set, specify the path in Amazon S3 to the folder that contains the dataset that you want to process.
6. For Data Format, choose a data format (Apache Web Logs, CSV, TSV, Text File with Custom Delimiters, JSON, Parquet, or ORC).
• For the Apache Web Logs option, you must also enter a regex expression in the Regex box.
• For the Text File with Custom Delimiters option, specify a Field terminator (that is, a column delimiter). Optionally, you can specify a Collection terminator for array types or a Map key terminator.
7. For Column details, specify a column name and the column data type.
• To add more columns one at a time, choose Add a column.
• To quickly add more columns, choose Bulk add columns. In the text box, enter a comma separated list of columns in the format column_namedata_type, column_namedata_type[,
…], and then choose Add.
8. (Optional) For Partition details, add one or more column names and data types.
9. The Preview table query box shows the CREATE TABLE statement generated by the information that you entered into the form. The preview statement cannot be edited directly. To change the statement, modify the fields in the form, or create the statement directly (p. 103) in the query editor instead of using the form.
10. Choose Create table to run the generated statement in the query editor and create the table.
Registering an AWS Glue Data Catalog from Another Account
You can use Athena's cross-account AWS Glue catalog feature to register an AWS Glue catalog from an account other than your own. After you configure the required IAM permissions for AWS Glue and register the catalog as an Athena DataCatalog resource, you can use Athena to run cross-account queries. For information about configuring the required permissions, see Cross-Account Access to AWS Glue Data Catalogs (p. 368).
The following procedure shows you how to use the Athena console to configure an AWS Glue Data Catalog in an Amazon Web Services account other than your own as a data source.
To register an AWS Glue Data Catalog from another account
1. Follow the steps in Cross-Account Access to AWS Glue Data Catalogs (p. 368) to ensure that you have permissions to query the data catalog in the other account.
2. Open the Athena console at https://console.aws.amazon.com/athena/.
3. If the console navigation pane is not visible, choose the expansion menu on the left.
4. Choose Data sources.
Best Practices When Using Athena with AWS Glue
5. On the upper right, choose Connect data source.
6. In the AWS Glue Data Catalog section, for Choose an AWS Glue Data Catalog, choose AWS Glue Data Catalog in another account.
7. For Data source details, enter the following information:
• Data source name – Enter the name that you want to use in your SQL queries to refer to the data catalog in the other account.
• Description – (Optional) Enter a description of the data catalog in the other account.
• Catalog ID – Enter the 12-digit Amazon Web Services account ID of the account to which the data catalog belongs. The Amazon Web Services account ID is the catalog ID.
8. (Optional) For Tags, enter key-value pairs that you want to associate with the data source. For more information about tags, see Tagging Athena Resources (p. 475).
9. Choose Register. On the Data sources page, the data catalog that you entered is listed in the Catalog name column.
10. To view or edit information about the new data catalog, choose the catalog, and then choose Edit.
11. To delete the new data catalog, choose the catalog, and then choose Delete.
Best Practices When Using Athena with AWS Glue
When using Athena with the AWS Glue Data Catalog, you can use AWS Glue to create databases and tables (schema) to be queried in Athena, or you can use Athena to create schema and then use them in AWS Glue and related services. This topic provides considerations and best practices when using either method.
Under the hood, Athena uses Presto to process DML statements and Hive to process the DDL statements that create and modify schema. With these technologies, there are a couple of conventions to follow so that Athena and AWS Glue work well together.
In this topic
• Database, Table, and Column Names (p. 23)
• Using AWS Glue Crawlers (p. 24)
• Scheduling a Crawler to Keep the AWS Glue Data Catalog and Amazon S3 in Sync (p. 24)
• Using Multiple Data Sources with Crawlers (p. 24)
• Syncing Partition Schema to Avoid "HIVE_PARTITION_SCHEMA_MISMATCH" (p. 25)
• Updating Table Metadata (p. 26)
• Working with CSV Files (p. 26)
• CSV Data Enclosed in Quotes (p. 26)
• CSV Files with Headers (p. 28)
• AWS Glue Partition Indexing and Filtering (p. 28)
• Working with Geospatial Data (p. 29)
• Using AWS Glue Jobs for ETL with Athena (p. 29)
• Creating Tables Using Athena for AWS Glue ETL Jobs (p. 29)
• Using ETL Jobs to Optimize Query Performance (p. 30)
• Converting SMALLINT and TINYINT Datatypes to INT When Converting to ORC (p. 31)
• Automating AWS Glue Jobs for ETL (p. 31)
Database, Table, and Column Names
When you create schema in AWS Glue to query in Athena, consider the following:
Best Practices When Using Athena with AWS Glue
• A database name cannot be longer than 255 characters.
• A table name cannot be longer than 255 characters.
• A column name cannot be longer than 255 characters.
• The only acceptable characters for database names, table names, and column names are lowercase letters, numbers, and the underscore character.
For more information, see Databases and Tables in the AWS Glue Developer Guide.
NoteIf you use an AWS::Glue::Database AWS CloudFormation template to create an AWS Glue database and do not specify a database name, AWS Glue automatically generates a database name in the format resource_name–random_string that is not compatible with Athena.
You can use the AWS Glue Catalog Manager to rename columns, but not table names or database names.
To change a database name, you must create a new database and copy tables from the old database to it (in other words, copy the metadata to a new entity). You can follow a similar process for tables. You can use the AWS Glue SDK or AWS CLI to do this.
Using AWS Glue Crawlers
AWS Glue crawlers help discover the schema for datasets and register them as tables in the AWS Glue Data Catalog. The crawlers go through your data and determine the schema. In addition, the crawler can detect and register partitions. For more information, see Defining Crawlers in the AWS Glue Developer Guide. Tables from data that were successfully crawled can be queried from Athena.
NoteAthena does not recognize exclude patterns that you specify for an AWS Glue crawler. For example, if you have an Amazon S3 bucket that contains both .csv and .json files and you exclude the .json files from the crawler, Athena queries both groups of files. To avoid this, place the files that you want to exclude in a different location.
Scheduling a Crawler to Keep the AWS Glue Data Catalog and Amazon S3 in Sync
AWS Glue crawlers can be set up to run on a schedule or on demand. For more information, see Time- Based Schedules for Jobs and Crawlers in the AWS Glue Developer Guide.
If you have data that arrives for a partitioned table at a fixed time, you can set up an AWS Glue crawler to run on schedule to detect and update table partitions. This can eliminate the need to run a potentially long and expensive MSCK REPAIR command or manually run an ALTER TABLE ADD PARTITION command. For more information, see Table Partitions in the AWS Glue Developer Guide.
Using Multiple Data Sources with Crawlers
When an AWS Glue crawler scans Amazon S3 and detects multiple directories, it uses a heuristic to determine where the root for a table is in the directory structure, and which directories are partitions for the table. In some cases, where the schema detected in two or more directories is similar, the crawler may treat them as partitions instead of separate tables. One way to help the crawler discover individual tables is to add each table's root directory as a data store for the crawler.
The following partitions in Amazon S3 are an example:
s3://bucket01/folder1/table1/partition1/file.txt s3://bucket01/folder1/table1/partition2/file.txt s3://bucket01/folder1/table1/partition3/file.txt s3://bucket01/folder1/table2/partition4/file.txt