View AWS Glue Data Catalog objects

Prerequisites

This tutorial assumes that you have an AWS account and access to AWS Glue.

Step 1: Add a crawler

Use these steps to conﬁgure and run a crawler that extracts the metadata from a CSV ﬁle stored in Amazon S3.

To create a crawler that reads ﬁles stored on Amazon S3

1. On the AWS Glue service console, on the left-side menu, choose Crawlers.

2. On the Crawlers page, choose Add crawler. This starts a series of pages that prompt you for the crawler details.

3. In the Crawler name ﬁeld, enter Flights Data Crawler, and choose Next.

Crawlers invoke classiﬁers to infer the schema of your data. This tutorial uses the built-in classiﬁer for CSV by default.

4. For the crawler source type, choose Data stores and choose Next.

5. Now let's point the crawler to your data. On the Add a data store page, choose the Amazon S3 data store. This tutorial doesn't use a connection, so leave the Connection ﬁeld blank if it's visible.

For the option Crawl data in, choose Speciﬁed path in another account. Then, for the Include path, enter the path where the crawler can ﬁnd the ﬂights data, which is s3://crawler-public-us-east-1/flight/2016/csv. After you enter the path, the title of this ﬁeld changes to Include path. Choose Next.

6. You can crawl multiple data stores with a single crawler. However, in this tutorial, we're using only a single data store, so choose No, and then choose Next.

Step 2: Run the crawler

7. The crawler needs permissions to access the data store and create objects in the AWS Glue Data Catalog. To conﬁgure these permissions, choose Create an IAM role. The IAM role name starts with AWSGlueServiceRole-, and in the ﬁeld, you enter the last part of the role name. Enter CrawlerTutorial, and then choose Next.

Note

To create an IAM role, your AWS user must have CreateRole, CreatePolicy, and AttachRolePolicy permissions.

The wizard creates an IAM role named AWSGlueServiceRole-CrawlerTutorial, attaches the AWS managed policy AWSGlueServiceRole to this role, and adds an inline policy that allows read access to the Amazon S3 location s3://crawler-public-us-east-1/flight/2016/csv.

8. Create a schedule for the crawler. For Frequency, choose Run on demand, and then choose Next.

9. Crawlers create tables in your Data Catalog. Tables are contained in a database in the Data Catalog.

First, choose Add database to create a database. In the pop-up window, enter test-flights-db for the database name, and then choose Create.

Next, enter flights for Preﬁx added to tables. Use the default values for the rest of the options, and choose Next.

10. Verify the choices you made in the Add crawler wizard. If you see any mistakes, you can choose Back to return to previous pages and make changes.

After you have reviewed the information, choose Finish to create the crawler.

Step 2: Run the crawler

After creating a crawler, the wizard sends you to the Crawlers view page. Because you create the crawler with an on-demand schedule, you're given the option to run the crawler.

To run the crawler

1. The banner near the top of this page lets you know that the crawler was created, and asks if you want to run it now. Choose Run it now? to run the crawler.

The banner changes to show "Attempting to run" and Running" messages for your crawler. After the crawler starts running, the banner disappears, and the crawler display is updated to show a status of Starting for your crawler. After a minute, you can click the Refresh icon to update the status of the crawler that is displayed in the table.

2. When the crawler completes, a new banner appears that describes the changes made by the crawler.

You can choose the test-ﬂights-db link to view the Data Catalog objects.

Step 3: View AWS Glue Data Catalog objects

The crawler reads data at the source location and creates tables in the Data Catalog. A table is the metadata deﬁnition that represents your data, including its schema. The tables in the Data Catalog do not contain data. Instead, you use these tables as a source or target in a job deﬁnition.

To view the Data Catalog objects created by the crawler

1. In the left-side navigation, under Data catalog, choose Databases. Here you can view the flights-db database that is created by the crawler.

2. In the left-side navigation, under Data catalog and below Databases, choose Tables. Here you can view the flightscsv table created by the crawler. If you choose the table name, then you can view

Step 3: View AWS Glue Data Catalog objects

the table settings, parameters, and properties. Scrolling down in this view, you can view the schema, which is information about the columns and data types of the table.

3. If you choose View partitions on the table view page, you can see the partitions created for the data. The ﬁrst column is the partition key.

Document history for AWS Glue Studio User Guide

Latest documentation update: October 11, 2021

The following table describes the important changes in each revision of the AWS Glue Studio User Guide.

For notiﬁcation about updates to this documentation, you can subscribe to an RSS feed.

update-history-change update-history-description update-history-date Glue Studio is now available in

China (p. 117) AWS Glue Studio is now

available in the China Beijing and Ningxia regions.

October 11, 2021

Direct access to streaming

sources now available (p. 117) When adding data sources to your ETL job in the visual editor, you can supply information to access the data stream instead of having to use a Data Catalog database and table.

Glue version 3.0 (p. 117) When creating jobs in AWS Glue Studio, you can choose Glue 3.0 as the version for your job in the Job details tab. If you do not choose a version for your ETL job, Glue 2.0 is used by default.

August 18, 2021

AWS GovCloud (US)

Region (p. 117) AWS Glue Studio is now available in the AWS GovCloud (US) Region

August 18, 2021

Python shell authoring available

in AWS Glue Studio (p. 117) When creating a new job, you can now choose to create a

Studio (p. 117) In conjunction with the script editor feature, you can upload

View your job's dataset while creating and editing jobs (p. 117)

You can use the new Data preview tab for a node in your job diagram to see a sample of the data processed by that node.

added (p. 117) If you want to access a data source located in your VPC,

Edit job scripts (p. 117) You can now edit scripts in the job editor. For more information, see Editing a job script.

May 24, 2021

Delete jobs using the AWS Glue

Studio console (p. 117) You can now delete jobs in AWS Glue Studio. To learn how, see Delete jobs.

May 24, 2021

Read data from ﬁles in child

folders in Amazon S3 (p. 117) You can specify a single folder in Amazon S3 as your data source and use the Recursive option to include all the child folders as part of the data source. For more information, see Using ﬁles in

added (p. 117) You can use the FillMissingValues transform in AWS Glue Studio to

SQL transform

available (p. 117) You can use a SQL transform node to write your own transform in the form of a SQL query. For more information, see Using a SQL query to transform data.

March 23, 2021

JDBC source nodes now support

job bookmark keys (p. 117) Job bookmarks help AWS Glue maintain state information and

targets (p. 117) Using a custom or AWS

Marketplace connector for your

A new toolbar is available for the

visual job editor (p. 117) A more streamlined and functional toolbar is available for the visual job editor of AWS Glue Studio. This feature makes it easier to add nodes to your graph. a table in the AWS Glue Data Catalog. For more information,

AWS Glue Studio (p. 117) You can deﬁne a time-based schedule for your job runs in

AWS Glue Custom Connectors

released (p. 117) AWS Glue Custom Connectors allow you to discover and subscribe to connectors in AWS Marketplace. We also released AWS Glue Spark runtime interfaces to plug in connectors built for Apache Spark Datasource, Athena federated query, and JDBC APIs.

For more information, see Using Connectors and connections with AWS Glue Studio.

December 21, 2020

Support for running streaming ETL jobs in AWS Glue version 2.0 (p. 117)

AWS Glue Studio now supports running streaming ETL jobs using AWS Glue version 2.0. For more information, see Adding Streaming ETL Jobs in AWS Glue in the AWS Glue Developer Guide.

November 11, 2020

Availability of AWS Glue Studio

announced (p. 117) AWS Glue Studio provides a visual interface that simpliﬁes the creation of jobs that prepare the data for analysis. The initial version of this guide was published on the same day AWS Glue Studio launched.

September 23, 2020

AWS glossary

For the latest AWS terminology, see the AWS glossary in the AWS General Reference.

在文檔中 AWS Glue Studio (頁 119-126)