Prerequisites
This tutorial assumes that you have an AWS account and access to AWS Glue.
Step 1: Add a crawler
Use these steps to configure and run a crawler that extracts the metadata from a CSV file stored in Amazon S3.
To create a crawler that reads files stored on Amazon S3
1. On the AWS Glue service console, on the left-side menu, choose Crawlers.
2. On the Crawlers page, choose Add crawler. This starts a series of pages that prompt you for the crawler details.
3. In the Crawler name field, enter Flights Data Crawler, and choose Next.
Crawlers invoke classifiers to infer the schema of your data. This tutorial uses the built-in classifier for CSV by default.
4. For the crawler source type, choose Data stores and choose Next.
5. Now let's point the crawler to your data. On the Add a data store page, choose the Amazon S3 data store. This tutorial doesn't use a connection, so leave the Connection field blank if it's visible.
For the option Crawl data in, choose Specified path in another account. Then, for the Include path, enter the path where the crawler can find the flights data, which is s3://crawler-public-us-east-1/flight/2016/csv. After you enter the path, the title of this field changes to Include path. Choose Next.
6. You can crawl multiple data stores with a single crawler. However, in this tutorial, we're using only a single data store, so choose No, and then choose Next.
Step 2: Run the crawler
7. The crawler needs permissions to access the data store and create objects in the AWS Glue Data Catalog. To configure these permissions, choose Create an IAM role. The IAM role name starts with AWSGlueServiceRole-, and in the field, you enter the last part of the role name. Enter CrawlerTutorial, and then choose Next.
Note
To create an IAM role, your AWS user must have CreateRole, CreatePolicy, and AttachRolePolicy permissions.The wizard creates an IAM role named AWSGlueServiceRole-CrawlerTutorial, attaches the AWS managed policy AWSGlueServiceRole to this role, and adds an inline policy that allows read access to the Amazon S3 location s3://crawler-public-us-east-1/flight/2016/csv.
8. Create a schedule for the crawler. For Frequency, choose Run on demand, and then choose Next.
9. Crawlers create tables in your Data Catalog. Tables are contained in a database in the Data Catalog.
First, choose Add database to create a database. In the pop-up window, enter test-flights-db for the database name, and then choose Create.
Next, enter flights for Prefix added to tables. Use the default values for the rest of the options, and choose Next.
10. Verify the choices you made in the Add crawler wizard. If you see any mistakes, you can choose Back to return to previous pages and make changes.
After you have reviewed the information, choose Finish to create the crawler.
Step 2: Run the crawler
After creating a crawler, the wizard sends you to the Crawlers view page. Because you create the crawler with an on-demand schedule, you're given the option to run the crawler.
To run the crawler
1. The banner near the top of this page lets you know that the crawler was created, and asks if you want to run it now. Choose Run it now? to run the crawler.
The banner changes to show "Attempting to run" and Running" messages for your crawler. After the crawler starts running, the banner disappears, and the crawler display is updated to show a status of Starting for your crawler. After a minute, you can click the Refresh icon to update the status of the crawler that is displayed in the table.
2. When the crawler completes, a new banner appears that describes the changes made by the crawler.
You can choose the test-flights-db link to view the Data Catalog objects.
Step 3: View AWS Glue Data Catalog objects
The crawler reads data at the source location and creates tables in the Data Catalog. A table is the metadata definition that represents your data, including its schema. The tables in the Data Catalog do not contain data. Instead, you use these tables as a source or target in a job definition.
To view the Data Catalog objects created by the crawler
1. In the left-side navigation, under Data catalog, choose Databases. Here you can view the flights-db database that is created by the crawler.
2. In the left-side navigation, under Data catalog and below Databases, choose Tables. Here you can view the flightscsv table created by the crawler. If you choose the table name, then you can view
Step 3: View AWS Glue Data Catalog objects
the table settings, parameters, and properties. Scrolling down in this view, you can view the schema, which is information about the columns and data types of the table.
3. If you choose View partitions on the table view page, you can see the partitions created for the data. The first column is the partition key.
Document history for AWS Glue Studio User Guide
Latest documentation update: October 11, 2021
The following table describes the important changes in each revision of the AWS Glue Studio User Guide.
For notification about updates to this documentation, you can subscribe to an RSS feed.
update-history-change update-history-description update-history-date Glue Studio is now available in
China (p. 117) AWS Glue Studio is now
available in the China Beijing and Ningxia regions.
October 11, 2021
Direct access to streaming
sources now available (p. 117) When adding data sources to your ETL job in the visual editor, you can supply information to access the data stream instead of having to use a Data Catalog database and table.
Glue version 3.0 (p. 117) When creating jobs in AWS Glue Studio, you can choose Glue 3.0 as the version for your job in the Job details tab. If you do not choose a version for your ETL job, Glue 2.0 is used by default.
August 18, 2021
AWS GovCloud (US)
Region (p. 117) AWS Glue Studio is now available in the AWS GovCloud (US) Region
August 18, 2021
Python shell authoring available
in AWS Glue Studio (p. 117) When creating a new job, you can now choose to create a
Studio (p. 117) In conjunction with the script editor feature, you can upload
View your job's dataset while creating and editing jobs (p. 117)
You can use the new Data preview tab for a node in your job diagram to see a sample of the data processed by that node.
added (p. 117) If you want to access a data source located in your VPC,
Edit job scripts (p. 117) You can now edit scripts in the job editor. For more information, see Editing a job script.
May 24, 2021
Delete jobs using the AWS Glue
Studio console (p. 117) You can now delete jobs in AWS Glue Studio. To learn how, see Delete jobs.
May 24, 2021
Read data from files in child
folders in Amazon S3 (p. 117) You can specify a single folder in Amazon S3 as your data source and use the Recursive option to include all the child folders as part of the data source. For more information, see Using files in
added (p. 117) You can use the FillMissingValues transform in AWS Glue Studio to
SQL transform
available (p. 117) You can use a SQL transform node to write your own transform in the form of a SQL query. For more information, see Using a SQL query to transform data.
March 23, 2021
JDBC source nodes now support
job bookmark keys (p. 117) Job bookmarks help AWS Glue maintain state information and
targets (p. 117) Using a custom or AWS
Marketplace connector for your
A new toolbar is available for the
visual job editor (p. 117) A more streamlined and functional toolbar is available for the visual job editor of AWS Glue Studio. This feature makes it easier to add nodes to your graph. a table in the AWS Glue Data Catalog. For more information,
AWS Glue Studio (p. 117) You can define a time-based schedule for your job runs in
AWS Glue Custom Connectors
released (p. 117) AWS Glue Custom Connectors allow you to discover and subscribe to connectors in AWS Marketplace. We also released AWS Glue Spark runtime interfaces to plug in connectors built for Apache Spark Datasource, Athena federated query, and JDBC APIs.
For more information, see Using Connectors and connections with AWS Glue Studio.
December 21, 2020
Support for running streaming ETL jobs in AWS Glue version 2.0 (p. 117)
AWS Glue Studio now supports running streaming ETL jobs using AWS Glue version 2.0. For more information, see Adding Streaming ETL Jobs in AWS Glue in the AWS Glue Developer Guide.
November 11, 2020
Availability of AWS Glue Studio
announced (p. 117) AWS Glue Studio provides a visual interface that simplifies the creation of jobs that prepare the data for analysis. The initial version of this guide was published on the same day AWS Glue Studio launched.
September 23, 2020
AWS glossary
For the latest AWS terminology, see the AWS glossary in the AWS General Reference.