Amazon Athena

(1)

Amazon Athena

User Guide

Amazon Athena: User Guide

(2)

Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be aﬃliated with, connected to, or sponsored by Amazon.

(3)

What is Amazon Athena?

Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL (p. 498). With a few actions in the AWS

Management Console, you can point Athena at your data stored in Amazon S3 and begin using standard SQL to run ad-hoc queries and get results in seconds.

Athena is serverless, so there is no infrastructure to set up or manage, and you pay only for the queries you run. Athena scales automatically—running queries in parallel—so results are fast, even with large datasets and complex queries.

Topics

• When should I use Athena? (p. 1)

• Accessing Athena (p. 2)

• Understanding Tables, Databases, and the Data Catalog (p. 3)

• AWS service integrations with Athena (p. 4)

When should I use Athena?

Query services like Amazon Athena, data warehouses like Amazon Redshift, and sophisticated data processing frameworks like Amazon EMR all address diﬀerent needs and use cases. The following guidance can help you choose one or more services based on your requirements.

Amazon Athena

Athena helps you analyze unstructured, semi-structured, and structured data stored in Amazon S3.

Examples include CSV, JSON, or columnar data formats such as Apache Parquet and Apache ORC. You can use Athena to run ad-hoc queries using ANSI SQL, without the need to aggregate or load the data into Athena.

Athena integrates with Amazon QuickSight for easy data visualization. You can use Athena to generate reports or to explore data with business intelligence tools or SQL clients connected with a JDBC or an ODBC driver. For more information, see What is Amazon QuickSight in the Amazon QuickSight User Guide and Connecting to Amazon Athena with ODBC and JDBC Drivers (p. 79).

Athena integrates with the AWS Glue Data Catalog, which oﬀers a persistent metadata store for your data in Amazon S3. This allows you to create tables and query data in Athena based on a central metadata store available throughout your Amazon Web Services account and integrated with the ETL and data discovery features of AWS Glue. For more information, see Integration with AWS Glue (p. 20) and What is AWS Glue in the AWS Glue Developer Guide.

Amazon Athena makes it easy to run interactive queries against data directly in Amazon S3 without having to format data or manage infrastructure. For example, Athena is useful if you want to run a quick query on web logs to troubleshoot a performance issue on your site. With Athena, you can get started fast: you just deﬁne a table for your data and start querying using standard SQL.

You should use Amazon Athena if you want to run interactive ad hoc SQL queries against data on Amazon S3, without having to manage any infrastructure or clusters. Amazon Athena provides the easiest way to run ad hoc queries for data in Amazon S3 without the need to setup or manage any servers.

(15)

Amazon EMR

For a list of AWS services that Athena leverages or integrates with, see the section called “AWS service integrations with Athena” (p. 4).

Amazon EMR

Amazon EMR makes it simple and cost effective to run highly distributed processing frameworks such as Hadoop, Spark, and Presto when compared to on-premises deployments. Amazon EMR is flexible – you can run custom applications and code, and define specific compute, memory, storage, and application parameters to optimize your analytic requirements.

In addition to running SQL queries, Amazon EMR can run a wide variety of scale-out data processing tasks for applications such as machine learning, graph analytics, data transformation, streaming data, and virtually anything you can code. You should use Amazon EMR if you use custom code to process and analyze extremely large datasets with the latest big data processing frameworks such as Spark, Hadoop, Presto, or Hbase. Amazon EMR gives you full control over the conﬁguration of your clusters and the software installed on them.

You can use Amazon Athena to query data that you process using Amazon EMR. Amazon Athena supports many of the same data formats as Amazon EMR. Athena's data catalog is Hive metastore compatible. If you use EMR and already have a Hive metastore, you can run your DDL statements on Amazon Athena and query your data immediately without aﬀecting your Amazon EMR jobs.

Amazon Redshift

A data warehouse like Amazon Redshift is your best choice when you need to pull together data from many diﬀerent sources – like inventory systems, ﬁnancial systems, and retail sales systems – into a common format, and store it for long periods of time. If you want to build sophisticated business reports from historical data, then a data warehouse like Amazon Redshift is the best choice. The query engine in Amazon Redshift has been optimized to perform especially well on running complex queries that join large numbers of very large database tables. When you need to run queries against highly structured data with lots of joins across lots of very large tables, choose Amazon Redshift.

For more information about when to use Athena, consult the following resources:

• When to use Athena vs other big data services

• Amazon Athena Overview

• Amazon Athena Features

• Amazon Athena FAQs

• Amazon Athena Blog posts

Accessing Athena

You can access Athena using the AWS Management Console, a JDBC or ODBC connection, the Athena API, the Athena CLI, the AWS SDK, or AWS Tools for Windows PowerShell.

• To get started with the console, see Getting Started (p. 9).

• To learn how to use JDBC or ODBC drivers, see Connecting to Amazon Athena with JDBC (p. 79) and Connecting to Amazon Athena with ODBC (p. 81).

• To use the Athena API, see the Amazon Athena API Reference.

• To use the CLI, install the AWS CLI and then type aws athena help from the command line to see available commands. For information about available commands, see the Amazon Athena command line reference.

(16)

Understanding Tables, Databases, and the Data Catalog

• To use the AWS SDK for Java 2.x, see the Athena section of the AWS SDK for Java 2.x API Reference, the Athena Java V2 Examples on GitHub.com, and the AWS SDK for Java 2.x Developer Guide.

• To use the AWS SDK for .NET, see the Amazon.Athena namespace in the AWS SDK for .NET API Reference, the .NET Athena examples on GitHub.com, and the AWS SDK for .NET Developer Guide.

• To use AWS Tools for Windows PowerShell, see the AWS Tools for PowerShell - Amazon Athena cmdlet reference, the AWS Tools for PowerShell portal page, and the AWS Tools for Windows PowerShell User Guide.

• For information about Athena service endpoints that you can connect to programmatically, see Amazon Athena endpoints and quotas in the Amazon Web Services General Reference.

Understanding Tables, Databases, and the Data Catalog

In Athena, tables and databases are containers for the metadata definitions that define a schema for underlying source data. For each dataset, a table needs to exist in Athena. The metadata in the table tells Athena where the data is located in Amazon S3, and specifies the structure of the data, for example, column names, data types, and the name of the table. Databases are a logical grouping of tables, and also hold only metadata and schema information for a dataset.

For each dataset that you'd like to query, Athena must have an underlying table it will use for obtaining and returning query results. Therefore, before querying data, a table must be registered in Athena. The registration occurs when you either create tables automatically or manually.

Regardless of how the tables are created, the tables creation process registers the dataset with Athena.

This registration occurs in the AWS Glue Data Catalog and enables Athena to run queries on the data.

• To create a table automatically, use an AWS Glue crawler from within Athena. For more information about AWS Glue and crawlers, see Integration with AWS Glue (p. 20). When AWS Glue creates a table, it registers it in its own AWS Glue Data Catalog. Athena uses the AWS Glue Data Catalog to store and retrieve this metadata, using it when you run queries to analyze the underlying dataset.

After you create a table, you can use SQL SELECT (p. 500) statements to query it, including getting speciﬁc ﬁle locations for your source data (p. 504). Your query results are stored in Amazon S3 in the query result location that you specify (p. 179).

The AWS Glue Data Catalog is accessible throughout your Amazon Web Services account. Other AWS services can share the AWS Glue Data Catalog, so you can see databases and tables created throughout your organization using Athena and vice versa. In addition, AWS Glue lets you automatically discover data schema and extract, transform, and load (ETL) data.

• To create a table manually:

• Use the Athena console to run the Create Table Wizard.

• Use the Athena console to write Hive DDL statements in the Query Editor.

• Use the Athena API or CLI to run a SQL query string with DDL statements.

• Use the Athena JDBC or ODBC driver.

When you create tables and databases manually, Athena uses HiveQL data deﬁnition language (DDL) statements such as CREATE TABLE, CREATE DATABASE, and DROP TABLE under the hood to create tables and databases in the AWS Glue Data Catalog.

NoteIf you have tables in Athena created before August 14, 2017, they were created in an Athena- managed internal data catalog that exists side-by-side with the AWS Glue Data Catalog until

(17)

AWS service integrations with Athena

you choose to update. For more information, see Upgrading to the AWS Glue Data Catalog Step- by-Step (p. 31).

When you query an existing table, under the hood, Amazon Athena uses Presto, a distributed SQL engine. We have examples with sample data within Athena to show you how to create a table and then issue a query against it using Athena. Athena also has a tutorial in the console that helps you get started creating a table based on data that is stored in Amazon S3.

• For a step-by-step tutorial on creating a table and writing queries in the Athena Query Editor, see Getting Started (p. 9).

• Run the Athena tutorial in the console. This launches automatically if you log in to https://

console.aws.amazon.com/athena/ for the ﬁrst time. You can also choose Tutorial in the console to launch it.

AWS service integrations with Athena

You can use Athena to query data from the AWS services listed in this section. To see the Regions that each service supports, see Regions and Endpoints in the Amazon Web Services General Reference.

AWS services integrated with Athena

• AWS CloudFormation

• Amazon CloudFront

• AWS CloudTrail

• Elastic Load Balancing

• AWS Glue Data Catalog

• AWS Identity and Access Management (IAM)

• Amazon QuickSight

• Amazon S3 Inventory

• AWS Step Functions

• AWS Systems Manager Inventory

• Amazon Virtual Private Cloud

For information about each integration, see the following sections.

AWS CloudFormation Data Catalog

Reference topic: AWS::Athena::DataCatalog in the AWS CloudFormation User Guide

Specify an Athena data catalog, including a name, description, type, parameters, and tags. For more information, see DataCatalog in the Amazon Athena API Reference.

Named Query

Reference topic: AWS::Athena::NamedQuery in the AWS CloudFormation User Guide

Specify named queries with AWS CloudFormation and run them in Athena. Named queries allow you to map a query name to a query and then run it as a saved query from the Athena console.

For information, see CreateNamedQuery in the Amazon Athena API Reference.

Workgroup

Reference topic: AWS::Athena::WorkGroup in the AWS CloudFormation User Guide

(18)

Specify Athena workgroups using AWS CloudFormation. Use Athena workgroups to isolate queries for you or your group from other queries in the same account. For more information, see Using Workgroups to Control Query Access and Costs (p. 447) in the Amazon Athena User Guide and CreateWorkGroup in the Amazon Athena API Reference.

Amazon CloudFront

Reference topic: Querying Amazon CloudFront Logs (p. 276)

Use Athena to query Amazon CloudFront logs. For more information about using CloudFront, see the Amazon CloudFront Developer Guide.

AWS CloudTrail

Reference topic: Querying AWS CloudTrail Logs (p. 278)

Using Athena with CloudTrail logs is a powerful way to enhance your analysis of AWS service activity.

For example, you can use queries to identify trends and further isolate activity by attribute, such as source IP address or user. You can create tables for querying logs directly from the CloudTrail console, and use those tables to run queries in Athena. For more information, see Creating a Table for CloudTrail Logs in the CloudTrail Console (p. 279).

Elastic Load Balancing

Reference topic: Querying Application Load Balancer Logs (p. 273)

Querying Application Load Balancer logs allows you to see the source of traﬃc, latency, and bytes transferred to and from Elastic Load Balancing instances and backend applications. For more information, see Creating the Table for ALB Logs (p. 273).

Reference topic: Querying Classic Load Balancer Logs (p. 275)

Query Classic Load Balancer logs to analyze and understand traﬃc patterns to and from Elastic Load Balancing instances and backend applications. You can see the source of traﬃc, latency, and bytes transferred. For more information, see Creating the Table for ELB Logs (p. 275).

AWS Glue Data Catalog

Reference topic: Integration with AWS Glue (p. 20)

Athena integrates with the AWS Glue Data Catalog, which oﬀers a persistent metadata store for your data in Amazon S3. This allows you to create tables and query data in Athena based on a central metadata store available throughout your Amazon Web Services account and integrated with the ETL and data discovery features of AWS Glue. For more information, see Integration with AWS Glue (p. 20) and What is AWS Glue in the AWS Glue Developer Guide.

AWS Identity and Access Management (IAM) Reference topic: Actions for Amazon Athena

You can use Athena API actions in IAM permission policies. For more information, see Actions for Amazon Athena and Identity and Access Management in Athena (p. 350).

Amazon QuickSight

Reference topic: Connecting to Amazon Athena with ODBC and JDBC Drivers (p. 79)

Athena integrates with Amazon QuickSight for easy data visualization. You can use Athena to generate reports or to explore data with business intelligence tools or SQL clients connected with a JDBC or an ODBC driver. For more information about Amazon QuickSight, see What is Amazon QuickSight in the Amazon QuickSight User Guide. For information about using JDBC and ODBC drivers with Athena, see Connecting to Amazon Athena with ODBC and JDBC Drivers (p. 79).

Amazon S3 Inventory

Reference topic: Querying inventory with Athena in the Amazon Simple Storage Service User Guide

(19)

You can use Amazon Athena to query Amazon S3 inventory using standard SQL. You can use

Amazon S3 inventory to audit and report on the replication and encryption status of your objects for business, compliance, and regulatory needs. For more information, see Amazon S3 inventory in the Amazon Simple Storage Service User Guide.

AWS Step Functions

Reference topic: Call Athena with Step Functions in the AWS Step Functions Developer Guide Call Athena with AWS Step Functions. AWS Step Functions can control select AWS services directly using the Amazon States Language. You can use Step Functions with Athena to start and stop query execution, get query results, run ad-hoc or scheduled data queries, and retrieve results from data lakes in Amazon S3. For more information, see the AWS Step Functions Developer Guide.

Video: Orchestrate Amazon Athena Queries using AWS Step Functions

The following video demonstrates how to use Amazon Athena and AWS Step Functions to run a regularly scheduled Athena query and generate a corresponding report.

Orchestrate Amazon Athena Queries using AWS Step Functions

For an example that uses Step Functions and Amazon EventBridge to orchestrate AWS Glue DataBrew, Athena, and Amazon QuickSight, see Orchestrating an AWS Glue DataBrew job and Amazon Athena query with AWS Step Functions in the AWS Big Data Blog.

AWS Systems Manager Inventory

Reference topic: Querying inventory data from multiple Regions and accounts in the AWS Systems Manager User Guide

AWS Systems Manager Inventory integrates with Amazon Athena to help you query inventory data from multiple AWS Regions and accounts. For more information, see the AWS Systems Manager User Guide.

Amazon Virtual Private Cloud

Reference topic: Querying Amazon VPC Flow Logs (p. 296)

Amazon Virtual Private Cloud flow logs capture information about the IP traffic going to and from network interfaces in a VPC. Query the logs in Athena to investigate network traffic patterns and identify threats and risks across your Amazon VPC network. For more information about Amazon VPC, see the Amazon VPC User Guide.

(20)

1. Sign up for an AWS account

Setting Up

If you've already signed up for Amazon Web Services, you can start using Amazon Athena immediately.

If you haven't signed up for AWS or need assistance getting started, be sure to complete the following tasks:

1. Sign up for an AWS account (p. 7)

2. Create an IAM administrator user and group (p. 7) 3. Attach managed policies for Athena (p. 7) 4. Sign in as an IAM user (p. 8)

1. Sign up for an AWS account

When you sign up for AWS, your account is automatically signed up for all services in AWS, including Athena. You are charged only for the services that you use. For pricing information, see Amazon Athena pricing.

If you have an AWS account already, skip to the next task. If you don't have an AWS account, use the following procedure to create one.

To create an AWS account

1. Open http://aws.amazon.com/, and then choose Create an AWS account.

2. Follow the online instructions. Part of the sign-up procedure involves receiving a phone call and entering a PIN using the phone keypad.

3. Note your AWS account number, because you need it for the next task.

2. Create an IAM administrator user and group

An AWS Identity and Access Management (IAM) user is an account that you create to access services. It is a diﬀerent user than your main AWS account. As a security best practice, we recommend that you use the IAM user's credentials to access AWS services. You use the IAM console to create an administrator IAM user and an Administrators group for the user. You can then access the console for Athena and other AWS services by accessing a special link and providing the credentials for the IAM user that you created.

For steps, see Creating an administrator IAM user and user group (console) in the IAM User Guide.

3. Attach managed policies for Athena

After you have created an IAM user, you must attach some Athena managed policies to the user so that the user can access Athena. There are two managed policies for Athena: AmazonAthenaFullAccess and AWSQuicksightAthenaAccess. These policies grant permissions to Athena to query Amazon S3 and to write the results of your queries to a separate bucket on your behalf. To see the contents of these policies for Athena, see AWS managed policies for Amazon Athena (p. 350).

(21)

4. Sign in as an IAM user

For steps to attach the Athena managed policies, follow Adding IAM Identity Permissions (Console) in the IAM User Guide and add the AmazonAthenaFullAccess and AWSQuicksightAthenaAccess managed policies to the IAM administrator user that you created.

NoteYou may need additional permissions to access the underlying dataset in Amazon S3. If you are not the account owner or otherwise have restricted access to a bucket, contact the bucket owner to grant access using a resource-based bucket policy, or contact your account administrator to grant access using an identity-based policy. For more information, see Amazon S3 Permissions (p. 356). If the dataset or Athena query results are encrypted, you may need additional permissions. For more information, see Conﬁguring Encryption Options (p. 341).

4. Sign in as an IAM user

To sign in as the new IAM user that you created, you can use the custom sign-in URL for the IAM users of your account. To see the sign-in URL for the IAM users for your account, open the IAM console and choose Users, user_name, Security credentials, Console sign-in link. As a convenience, you can use the clipboard icon to copy the sign-in URL to the clipboard.

For more information about signing in as an IAM user, see How IAM users sign in to your AWS account in the IAM User Guide.

(22)

Prerequisites

Getting Started

This tutorial walks you through using Amazon Athena to query data. You'll create a table based on sample data stored in Amazon Simple Storage Service, query the table, and check the results of the query.

The tutorial uses live resources, so you are charged for the queries that you run. You aren't charged for the sample data in the location that this tutorial uses, but if you upload your own data ﬁles to Amazon S3, charges do apply.

Prerequisites

• If you have not already done so, sign up for an account in Setting Up (p. 7).

• Using the same AWS Region (for example, US West (Oregon)) and account that you are using for Athena, Create a bucket in Amazon S3 to hold your Athena query results.

Step 1: Create a Database

You ﬁrst need to create a database in Athena.

To create an Athena database

1. Open the Athena console at https://console.aws.amazon.com/athena/.

2. If this is your ﬁrst time to visit the Athena console in your current AWS Region, choose Explore the query editor to open the query editor. Otherwise, Athena opens in the query editor.

3. Choose View Settings to set up a query result location in Amazon S3.

4. On the Settings tab, choose Manage.

(23)

Step 1: Create a Database

5. For Manage settings, do one of the following:

• In the Location of query result box, enter the path to the bucket that you created in Amazon S3 for your query results. Preﬁx the path with s3://.

• Choose Browse S3, choose the Amazon S3 bucket that you created for your current Region, and then choose Choose.

(24)

Step 1: Create a Database

6. Choose Save.

7. Choose Editor to switch to the query editor.

8. On the right of the navigation pane, you can use the Athena query editor to enter and run queries and statements.

(25)

Step 2: Create a Table

9. To create a database named mydatabase, enter the following CREATE DATABASE statement.

CREATE DATABASE mydatabase 10. Choose Run or press Ctrl+ENTER.

11. From the Database list on the left, choose mydatabase to make it your current database.

Step 2: Create a Table

Now that you have a database, you can create an Athena table for it. The table that you create will be based on sample Amazon CloudFront log data in the location s3://athena-examples-myregion/

cloudfront/plaintext/, where myregion is your current AWS Region.

The sample log data is in tab-separated values (TSV) format, which means that a tab character is used as a delimiter to separate the fields. The data looks like the following example. For readability, the tabs in the excerpt have been converted to spaces and the final field shortened.

2014-07-05 20:00:09 DFW3 4260 10.0.0.15 GET eabcd12345678.cloudfront.net /test-image-1.jpeg 200 - Mozilla/5.0[...]

2014-07-05 20:00:09 DFW3 4252 10.0.0.15 GET eabcd12345678.cloudfront.net /test-image-2.jpeg 200 - Mozilla/5.0[...]

2014-07-05 20:00:10 AMS1 4261 10.0.0.15 GET eabcd12345678.cloudfront.net /test-image-3.jpeg 200 - Mozilla/5.0[...]

To enable Athena to read this data, you could run a CREATE EXTERNAL TABLE statement like the following. The statement that creates the table defines columns that map to the data, specifies how the data is delimited, and specifies the Amazon S3 location that contains the sample data.

NoteFor the LOCATION clause, specify an Amazon S3 folder location, not a specific file. Athena scans all of the files in the folder that you specify.

CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (

(26)

`Date` DATE, Time STRING, Location STRING, Bytes INT,

RequestIP STRING, Method STRING, Host STRING, Uri STRING, Status INT, Referrer STRING, ClientInfo STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'

LOCATION 's3://athena-examples-my-region/cloudfront/plaintext/';

The example creates a table called cloudfront_logs and speciﬁes a name and data type for each ﬁeld.

These fields become the columns in the table. Because date is a reserved word (p. 105), it is escaped with backtick (`) characters. ROW FORMAT DELIMITED means that Athena will use a default library called LazySimpleSerDe (p. 167) to do the actual work of parsing the data. The example also specifies that the fields are tab separated (FIELDS TERMINATED BY '\t') and that each record in the file ends in a newline character (LINES TERMINATED BY '\n). Finally, the LOCATION clause specifies the path in Amazon S3 where the actual data to be read is located. If you have your own tab or comma-separated data, you can use a CREATE TABLE statement like this.

Returning to the sample data, here is a full example of the ﬁnal ﬁeld ClientInfo:

Mozilla/5.0%20(Android;%20U;%20Windows%20NT%205.1;%20en-US;

%20rv:1.9.0.9)%20Gecko/2009040821%20IE/3.0.9

As you can see, this field is multivalued. If the CREATE TABLE statement specifies tabs as field delimiters, the separate components inside this particular field can't be broken out into separate

columns. To create columns from the values inside the ﬁeld, you can use a regular expression (regex) that contains regex groups. The regex groups that you specify become separate table columns. To use a regex in your CREATE TABLE statement, use syntax like the following. This syntax instructs Athena to use the Regex SerDe (p. 155) library and the regular expression that you specify.

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ("input.regex" = "regular_expression")

Regular expressions can be useful for creating tables from complex CSV or TSV data but can be diﬃcult to write and maintain. Fortunately, there are other libraries that you can use for formats like JSON, Parquet, and ORC. For more information, see Supported SerDes and Data Formats (p. 151).

Now you are ready to create the table in the Athena query editor. The CREATE TABLE statement and regex are provided for you.

To create a table in Athena

1. In the navigation pane, for Database, make sure that mydatabase is selected.

2. To give yourself more room in the query editor, you can choose the arrow icon to collapse the navigation pane.

(27)

3. To create a tab for a new query, choose the plus (+) sign in the query editor. You can have up to ten query tabs open at once.

4. To close one or more query tabs, choose the arrow next to the plus sign. To close all tabs at once, choose the arrow, and then choose Close all tabs.

(28)

5. In the query pane, enter the following CREATE EXTERNAL TABLE statement. The regex breaks out the operating system, browser, and browser version information from the ClientInfo ﬁeld in the log data.

CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs ( `Date` DATE,

Time STRING, Location STRING, Bytes INT,

RequestIP STRING, Method STRING, Host STRING, Uri STRING, Status INT, Referrer STRING, os STRING, Browser STRING, BrowserVersion STRING

) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES (

"input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s +([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^

\/]+)[\/](.*)$"

) LOCATION 's3://athena-examples-myregion/cloudfront/plaintext/';

6. In the LOCATION statement, replace myregion with the AWS Region that you are currently using (for example, us-west-1).

7. Choose Run.

(29)

Step 3: Query Data

The table cloudfront_logs is created and appears under the list of Tables for the mydatabase database.

Step 3: Query Data

Now that you have the cloudfront_logs table created in Athena based on the data in Amazon S3, you can run SQL queries on the table and see the results in Athena. For more information about using SQL in Athena, see SQL Reference for Amazon Athena (p. 498).

To run a query

1. Choose the plus (+) sign to open a new query tab and enter the following SQL statement in the query pane.

SELECT os, COUNT(*) count FROM cloudfront_logs

WHERE date BETWEEN date '2014-07-05' AND date '2014-08-05' GROUP BY os

2. Choose Run.

The results look like the following:

3. To save the results of the query to a .csv ﬁle, choose Download results.

(30)

Step 3: Query Data

4. To view or run previous queries, choose the Recent queries tab.

5. To download the results of a previous query from the Recent queries tab, select the query, and then choose Download results. Queries are retained for 45 days.

(31)

Saving Your Queries

For more information, see Working with Query Results, Recent Queries, and Output Files (p. 178).

Saving Your Queries

You can save the queries that you create or edit in the query editor with a name. Athena stores these queries on the Saved queries tab. You can use the Saved queries tab to recall, run, rename, or delete your saved queries. For more information, see Using saved queries (p. 195).

Connecting to Other Data Sources

This tutorial used a data source in Amazon S3 in CSV format. For information about using Athena with AWS Glue, see Using AWS Glue to Connect to Data Sources in Amazon S3 (p. 21). You can also connect Athena to a variety of data sources by using ODBC and JDBC drivers, external Hive metastores, and Athena data source connectors. For more information, see Connecting to Data Sources (p. 20).

(32)

Using the Console

Accessing Amazon Athena

You can access Amazon Athena using the AWS Management Console, the Amazon Athena API, or the AWS CLI.

Using the Console

You can use the AWS Management Console for Amazon Athena to do the following:

• Create or select a database.

• Create, view, and delete tables.

• Filter tables by starting to type their names.

• Preview tables and generate CREATE TABLE DDL for them.

• Show table properties.

• Run queries on tables, save and format queries, and view query history.

• Create up to ten queries using diﬀerent query tabs in the query editor. To open a new tab, click the plus sign.

• Display query results, save, and export them.

• Access the AWS Glue Data Catalog.

• View and change settings, such as view the query result location, conﬁgure auto-complete, and encrypt query results.

In the right pane, the Query Editor displays an introductory screen that prompts you to create your ﬁrst table. You can view your tables under Tables in the left pane.

Here's a high-level overview of the actions available for each table:

• Preview tables – View the query syntax in the Query Editor on the right.

• Show properties – Show a table's name, its location in Amazon S3, input and output formats, the serialization (SerDe) library used, and whether the table has encrypted data.

• Delete table – Delete a table.

• Generate CREATE TABLE DDL – Generate the query behind a table and view it in the query editor.

Using the API

Amazon Athena enables application programming for Athena. For more information, see Amazon Athena API Reference. The latest AWS SDKs include support for the Athena API.

For examples of using the AWS SDK for Java with Athena, see Code Samples (p. 581).

For more information about AWS SDK for Java documentation and downloads, see the SDKs section in Tools for Amazon Web Services.

Using the CLI

You can access Amazon Athena using the AWS CLI. For more information, see the AWS CLI Reference for Athena.

(33)

Integration with AWS Glue

Connecting to Data Sources

You can use Amazon Athena to query data stored in diﬀerent locations and formats in a dataset. This dataset might be in CSV, JSON, Avro, Parquet, or some other format.

The tables and databases that you work with in Athena to run queries are based on metadata. Metadata is data about the underlying data in your dataset. How that metadata describes your dataset is called the schema. For example, a table name, the column names in the table, and the data type of each column are schema, saved as metadata, that describe an underlying dataset. In Athena, we call a system for organizing metadata a data catalog or a metastore. The combination of a dataset and the data catalog that describes it is called a data source.

The relationship of metadata to an underlying dataset depends on the type of data source that you work with. Relational data sources like MySQL, PostgreSQL, and SQL Server tightly integrate the metadata with the dataset. In these systems, the metadata is most often written when the data is written. Other data sources, like those built using Hive, allow you to deﬁne metadata on-the-ﬂy when you read the dataset. The dataset can be in a variety of formats—for example, CSV, JSON, Parquet, or Avro.

Athena natively supports the AWS Glue Data Catalog. The AWS Glue Data Catalog is a data catalog built on top of other datasets and data sources such as Amazon S3, Amazon Redshift, and Amazon DynamoDB. You can also connect Athena to other data sources by using a variety of connectors.

Topics

• Integration with AWS Glue (p. 20)

• Using Athena Data Connector for External Hive Metastore (p. 34)

• Using Amazon Athena Federated Query (p. 56)

• IAM Policies for Accessing Data Catalogs (p. 74)

• Managing Data Sources (p. 78)

• Connecting to Amazon Athena with ODBC and JDBC Drivers (p. 79)

Integration with AWS Glue

AWS Glue is a fully managed ETL (extract, transform, and load) AWS service. One of its key abilities is to analyze and categorize data. You can use AWS Glue crawlers to automatically infer database and table schema from your data in Amazon S3 and store the associated metadata in the AWS Glue Data Catalog.

Athena uses the AWS Glue Data Catalog to store and retrieve table metadata for the Amazon S3 data in your Amazon Web Services account. The table metadata lets the Athena query engine know how to ﬁnd, read, and process the data that you want to query.

To create database and table schema in the AWS Glue Data Catalog, you can run an AWS Glue crawler from within Athena on a data source, or you can run Data Deﬁnition Language (DDL) queries directly in the Athena Query Editor. Then, using the database and table schema that you created, you can use Data Manipulation (DML) queries in Athena to query the data.

You can register an AWS Glue Data Catalog from an account other than your own. After you conﬁgure the required IAM permissions for AWS Glue, you can use Athena to run cross-account queries. For more information, see Cross-Account Access to AWS Glue Data Catalogs (p. 368).

For more information about the AWS Glue Data Catalog, see Populating the AWS Glue Data Catalog in the AWS Glue Developer Guide.

(34)

Using AWS Glue to Connect to Data Sources in Amazon S3

Separate charges apply to AWS Glue. For more information, see AWS Glue Pricing and Are there separate charges for AWS Glue? (p. 33) For more information about the beneﬁts of using AWS Glue with

Athena, see Why should I upgrade to the AWS Glue Data Catalog? (p. 33) Topics

• Using AWS Glue to Connect to Data Sources in Amazon S3 (p. 21)

• Registering an AWS Glue Data Catalog from Another Account (p. 22)

• Best Practices When Using Athena with AWS Glue (p. 23)

• Upgrading to the AWS Glue Data Catalog Step-by-Step (p. 31)

• FAQ: Upgrading to the AWS Glue Data Catalog (p. 32)

Using AWS Glue to Connect to Data Sources in Amazon S3

Athena can connect to your data stored in Amazon S3 using the AWS Glue Data Catalog to store metadata such as table and column names. After the connection is made, your databases, tables, and views appear in Athena's query editor.

To deﬁne schema information for AWS Glue to use, you can create an AWS Glue crawler to retrieve the information automatically, or you can manually add a table and enter the schema information.

Creating an AWS Glue Crawler

You can create a crawler by starting in the Athena console and then using the AWS Glue console in an integrated way. When you create the crawler, you specify a data location in Amazon S3 to crawl.

To create a crawler in AWS Glue starting from the Athena console 1. Open the Athena console at https://console.aws.amazon.com/athena/.

2. In the query editor, next to Tables and views, choose Create, and then choose AWS Glue crawler.

3. On the AWS Glue console Add crawler page, follow the steps to create a crawler. For more information, see Using AWS Glue Crawlers (p. 24) in this guide and Populating the AWS Glue Data Catalog in the AWS Glue Developer Guide.

Note

Athena does not recognize exclude patterns that you specify for an AWS Glue crawler. For example, if you have an Amazon S3 bucket that contains both .csv and .json files and you exclude the .json files from the crawler, Athena queries both groups of files. To avoid this, place the files that you want to exclude in a different location.

Adding a Table Using a Form

The following procedure shows you how to use the Athena console to add a table using the Create Table From S3 bucket data form.

To add a table and enter schema information using a form

2. In the query editor, next to Tables and views, choose Create, and then choose S3 bucket data.

3. On the Create Table From S3 bucket data form, for Table name, enter a name for the table.

4. For Database conﬁguration, choose an existing database, or create a new one.

(35)

Registering an AWS Glue Data Catalog from Another Account

5. For Location of Input Data Set, specify the path in Amazon S3 to the folder that contains the dataset that you want to process.

6. For Data Format, choose a data format (Apache Web Logs, CSV, TSV, Text File with Custom Delimiters, JSON, Parquet, or ORC).

• For the Apache Web Logs option, you must also enter a regex expression in the Regex box.

• For the Text File with Custom Delimiters option, specify a Field terminator (that is, a column delimiter). Optionally, you can specify a Collection terminator for array types or a Map key terminator.

7. For Column details, specify a column name and the column data type.

• To add more columns one at a time, choose Add a column.

• To quickly add more columns, choose Bulk add columns. In the text box, enter a comma separated list of columns in the format column_namedata_type, column_namedata_type[,

…], and then choose Add.

8. (Optional) For Partition details, add one or more column names and data types.

9. The Preview table query box shows the CREATE TABLE statement generated by the information that you entered into the form. The preview statement cannot be edited directly. To change the statement, modify the ﬁelds in the form, or create the statement directly (p. 103) in the query editor instead of using the form.

10. Choose Create table to run the generated statement in the query editor and create the table.

Registering an AWS Glue Data Catalog from Another Account

You can use Athena's cross-account AWS Glue catalog feature to register an AWS Glue catalog from an account other than your own. After you conﬁgure the required IAM permissions for AWS Glue and register the catalog as an Athena DataCatalog resource, you can use Athena to run cross-account queries. For information about conﬁguring the required permissions, see Cross-Account Access to AWS Glue Data Catalogs (p. 368).

The following procedure shows you how to use the Athena console to conﬁgure an AWS Glue Data Catalog in an Amazon Web Services account other than your own as a data source.

To register an AWS Glue Data Catalog from another account

1. Follow the steps in Cross-Account Access to AWS Glue Data Catalogs (p. 368) to ensure that you have permissions to query the data catalog in the other account.

3. If the console navigation pane is not visible, choose the expansion menu on the left.

4. Choose Data sources.

(36)

Best Practices When Using Athena with AWS Glue

5. On the upper right, choose Connect data source.

6. In the AWS Glue Data Catalog section, for Choose an AWS Glue Data Catalog, choose AWS Glue Data Catalog in another account.

7. For Data source details, enter the following information:

• Data source name – Enter the name that you want to use in your SQL queries to refer to the data catalog in the other account.

• Description – (Optional) Enter a description of the data catalog in the other account.

• Catalog ID – Enter the 12-digit Amazon Web Services account ID of the account to which the data catalog belongs. The Amazon Web Services account ID is the catalog ID.

8. (Optional) For Tags, enter key-value pairs that you want to associate with the data source. For more information about tags, see Tagging Athena Resources (p. 475).

9. Choose Register. On the Data sources page, the data catalog that you entered is listed in the Catalog name column.

10. To view or edit information about the new data catalog, choose the catalog, and then choose Edit.

11. To delete the new data catalog, choose the catalog, and then choose Delete.

Best Practices When Using Athena with AWS Glue

When using Athena with the AWS Glue Data Catalog, you can use AWS Glue to create databases and tables (schema) to be queried in Athena, or you can use Athena to create schema and then use them in AWS Glue and related services. This topic provides considerations and best practices when using either method.

Under the hood, Athena uses Presto to process DML statements and Hive to process the DDL statements that create and modify schema. With these technologies, there are a couple of conventions to follow so that Athena and AWS Glue work well together.

In this topic

• Database, Table, and Column Names (p. 23)

• Using AWS Glue Crawlers (p. 24)

• Scheduling a Crawler to Keep the AWS Glue Data Catalog and Amazon S3 in Sync (p. 24)

• Using Multiple Data Sources with Crawlers (p. 24)

• Syncing Partition Schema to Avoid "HIVE_PARTITION_SCHEMA_MISMATCH" (p. 25)

• Updating Table Metadata (p. 26)

• Working with CSV Files (p. 26)

• CSV Data Enclosed in Quotes (p. 26)

• CSV Files with Headers (p. 28)

• AWS Glue Partition Indexing and Filtering (p. 28)

• Working with Geospatial Data (p. 29)

• Using AWS Glue Jobs for ETL with Athena (p. 29)

• Creating Tables Using Athena for AWS Glue ETL Jobs (p. 29)

• Using ETL Jobs to Optimize Query Performance (p. 30)

• Converting SMALLINT and TINYINT Datatypes to INT When Converting to ORC (p. 31)

• Automating AWS Glue Jobs for ETL (p. 31)

Database, Table, and Column Names

When you create schema in AWS Glue to query in Athena, consider the following:

(37)

Best Practices When Using Athena with AWS Glue

• A database name cannot be longer than 255 characters.

• A table name cannot be longer than 255 characters.

• A column name cannot be longer than 255 characters.

• The only acceptable characters for database names, table names, and column names are lowercase letters, numbers, and the underscore character.

For more information, see Databases and Tables in the AWS Glue Developer Guide.

NoteIf you use an AWS::Glue::Database AWS CloudFormation template to create an AWS Glue database and do not specify a database name, AWS Glue automatically generates a database name in the format resource_name–random_string that is not compatible with Athena.

You can use the AWS Glue Catalog Manager to rename columns, but not table names or database names.

To change a database name, you must create a new database and copy tables from the old database to it (in other words, copy the metadata to a new entity). You can follow a similar process for tables. You can use the AWS Glue SDK or AWS CLI to do this.

Using AWS Glue Crawlers

AWS Glue crawlers help discover the schema for datasets and register them as tables in the AWS Glue Data Catalog. The crawlers go through your data and determine the schema. In addition, the crawler can detect and register partitions. For more information, see Deﬁning Crawlers in the AWS Glue Developer Guide. Tables from data that were successfully crawled can be queried from Athena.

NoteAthena does not recognize exclude patterns that you specify for an AWS Glue crawler. For example, if you have an Amazon S3 bucket that contains both .csv and .json files and you exclude the .json files from the crawler, Athena queries both groups of files. To avoid this, place the files that you want to exclude in a different location.

Scheduling a Crawler to Keep the AWS Glue Data Catalog and Amazon S3 in Sync

AWS Glue crawlers can be set up to run on a schedule or on demand. For more information, see Time- Based Schedules for Jobs and Crawlers in the AWS Glue Developer Guide.

If you have data that arrives for a partitioned table at a ﬁxed time, you can set up an AWS Glue crawler to run on schedule to detect and update table partitions. This can eliminate the need to run a potentially long and expensive MSCK REPAIR command or manually run an ALTER TABLE ADD PARTITION command. For more information, see Table Partitions in the AWS Glue Developer Guide.

Using Multiple Data Sources with Crawlers

When an AWS Glue crawler scans Amazon S3 and detects multiple directories, it uses a heuristic to determine where the root for a table is in the directory structure, and which directories are partitions for the table. In some cases, where the schema detected in two or more directories is similar, the crawler may treat them as partitions instead of separate tables. One way to help the crawler discover individual tables is to add each table's root directory as a data store for the crawler.

The following partitions in Amazon S3 are an example:

s3://bucket01/folder1/table1/partition1/file.txt s3://bucket01/folder1/table1/partition2/file.txt s3://bucket01/folder1/table1/partition3/file.txt s3://bucket01/folder1/table2/partition4/file.txt

Amazon Athena

Amazon Athena

User Guide

Amazon Athena: User Guide

Table of Contents

What is Amazon Athena?

When should I use Athena?

Amazon Athena

Amazon EMR

Amazon Redshift

Accessing Athena

Understanding Tables, Databases, and the Data Catalog

AWS service integrations with Athena

Setting Up

1. Sign up for an AWS account

2. Create an IAM administrator user and group

3. Attach managed policies for Athena

4. Sign in as an IAM user

Getting Started

Prerequisites

Step 1: Create a Database

Step 2: Create a Table

Step 3: Query Data

Saving Your Queries

Connecting to Other Data Sources

Accessing Amazon Athena

Using the Console

Using the API

Using the CLI

Connecting to Data Sources

Integration with AWS Glue

Using AWS Glue to Connect to Data Sources in Amazon S3

Creating an AWS Glue Crawler

Adding a Table Using a Form

Registering an AWS Glue Data Catalog from Another Account

Best Practices When Using Athena with AWS Glue

Database, Table, and Column Names

Using AWS Glue Crawlers

Scheduling a Crawler to Keep the AWS Glue Data Catalog and Amazon S3 in Sync

Using Multiple Data Sources with Crawlers