AWS Glue Studio
User Guide
AWS Glue Studio: User Guide
Copyright © Amazon Web Services, Inc. and/or its affiliates. All rights reserved.
Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be affiliated with, connected to, or sponsored by Amazon.
Table of Contents
Using Notebooks (Preview) ... 1
Overview of using notebooks ... 1
Getting started with notebooks in AWS Glue Studio ... 2
Creating an ETL job using notebooks in AWS Glue Studio ... 2
Notebook editor components ... 3
Saving your notebook and job script ... 3
Managing notebook sessions ... 4
AWS Glue Visual Job API (Preview) ... 5
API design and CRUD APIs ... 5
Getting started ... 5
API design and CRUD APIs ... 9
SDK Onboarding ... 9
Appendix: Visual Job Examples and Model Definitions ... 9
Examples ... 9
Model Definitions ... 11
Detect PII (Preview) ... 22
Choosing how you want the data to be scanned ... 22
Choosing the PII entities to take action on ... 23
Choosing what to do with identified PII data ... 24
What is AWS Glue Studio? ... 25
Features of AWS Glue Studio ... 26
Visual job editor ... 26
Notebook interface for interactively developing and debugging job scripts ... 26
Job script code editor ... 27
Job performance dashboard ... 27
Support for dataset partitioning ... 27
When should I use AWS Glue Studio? ... 27
Accessing AWS Glue Studio ... 28
Pricing for AWS Glue Studio ... 28
Setting up ... 29
Complete initial AWS configuration tasks ... 29
Sign up for AWS ... 29
Create an IAM administrator user ... 29
Sign in as an IAM user ... 30
Review IAM permissions needed for the AWS Glue Studio user ... 31
AWS Glue service permissions ... 31
Creating Custom IAM Policies for AWS Glue Studio ... 31
Notebook and data preview permissions ... 33
Amazon CloudWatch permissions ... 33
Review IAM permissions needed for ETL jobs ... 34
Data source and data target permissions ... 34
Permissions required for deleting jobs ... 34
AWS Key Management Service permissions ... 34
Permissions required for using connectors ... 35
Set up IAM permissions for AWS Glue Studio ... 35
Create an IAM Role ... 35
Attach policies to the AWS Glue Studio user ... 36
Create a trust policy for roles not named "AWSGlueServiceRole*" ... 36
Configure a VPC for your ETL job ... 37
Populate the AWS Glue Data Catalog ... 38
Tutorial: Getting started ... 39
Prerequisites ... 39
Step 1: Start the job creation process ... 39
Step 2: Edit the data source node in the job diagram ... 40
Step 3: Edit the transform node of the job ... 41
Step 4: Edit the data target node of the job ... 41
Step 5: View the job script ... 42
Step 6: Specify the job details and save the job ... 42
Step 7: Run the job ... 43
Next steps ... 43
Creating jobs ... 44
Start the job creation process ... 44
Create jobs that use a connector ... 45
Next steps for creating a job in AWS Glue Studio ... 45
Editing jobs ... 47
Accessing the job diagram editor ... 47
Job editor features ... 47
Using schema previews in the visual job editor ... 48
Using data previews in the visual job editor ... 48
Restrictions when using data previews ... 49
Script code generation ... 49
Editing the data source node ... 50
Using Data Catalog tables for the data source ... 51
Using a connector for the data source ... 51
Using files in Amazon S3 for the data source ... 51
Using a streaming data source ... 53
Editing the data transform node ... 55
Using ApplyMapping to remap data property keys ... 55
Using SelectFields to remove most data property keys ... 56
Using DropFields to keep most data property keys ... 57
Renaming a field in the dataset ... 57
Using Spigot to sample your dataset ... 58
Joining datasets ... 58
Using SplitFields to split a dataset into two ... 60
Overview of SelectFromCollection transform ... 60
Using SelectFromCollection to choose which dataset to keep ... 61
Find and fill missing values in a dataset ... 62
Filtering keys within a dataset ... 62
Using DropNullFields to remove fields with null values ... 63
Using a SQL query to transform data ... 64
Using Aggregate to perform summary calculcations on selected fields ... 66
Creating a custom transformation ... 68
Using Aggregate to perform summary calculcations on selected fields ... 71
Configuring data target nodes ... 73
Overview of data target options ... 73
Editing the data target node ... 74
Editing or uploading a job script ... 76
Creating and editing Scala scripts in AWS Glue Studio ... 77
Creating and editing Python shell jobs in AWS Glue Studio ... 78
Adding nodes to the job diagram ... 79
Changing the parent nodes for a node in the job diagram ... 79
Deleting nodes from the job diagram ... 80
Using connectors and connections ... 81
Overview of using connectors and connections ... 81
Adding connectors to AWS Glue Studio ... 82
Subscribing to AWS Marketplace connectors ... 82
Creating custom connectors ... 83
Creating connections for connectors ... 85
Authoring jobs with custom connectors ... 85
Create jobs that use a connector for the data source ... 85
Configure source properties for nodes that use connectors ... 86
Configure target properties for nodes that use connectors ... 89
Managing connectors and connections ... 90
Viewing connector and connection details ... 90
Editing connectors and connections ... 91
Deleting connectors and connections ... 91
Cancel a subscription for a connector ... 92
Developing custom connectors ... 92
Developing Spark connectors ... 92
Developing Athena connectors ... 93
Developing JDBC connectors ... 93
Examples of using custom connectors with AWS Glue Studio ... 93
Developing AWS Glue connectors for AWS Marketplace ... 94
Restrictions for using connectors and connections in AWS Glue Studio ... 94
Tutorial: Using the open-source Elasticsearch Spark Connector ... 95
Prerequisites ... 95
Step 1: (Optional) Create an AWS secret for your OpenSearch cluster information ... 95
Next step ... 96
Step 2: Subscribe to the connector ... 96
Next step ... 97
Step 3: Activate the connector in AWS Glue Studio and create a connection ... 97
Next step ... 97
Step 4: Configure an IAM role for your ETL job ... 97
Next step ... 97
Step 5: Create a job that uses the OpenSearch connection ... 98
Next step ... 100
Step 6: Run the job ... 100
Monitoring jobs ... 101
Accessing the job monitoring dashboard ... 101
Overview of the job monitoring dashboard ... 101
Job runs view ... 101
Viewing the job run logs ... 103
Viewing the details of a job run ... 103
Viewing Amazon CloudWatch metrics for a job run ... 104
Managing jobs ... 106
Start a job run ... 106
Schedule job runs ... 106
Manage job schedules ... 107
Stop job runs ... 108
View your jobs ... 108
Customize the job display ... 108
View information for recent job runs ... 108
View the job script ... 109
Modify the job properties ... 109
Store Spark shuffle files on Amazon S3 ... 110
Save the job ... 111
Troubleshooting errors when saving a job ... 111
Clone a job ... 113
Delete jobs ... 113
Tutorial: Adding an AWS Glue crawler ... 114
Prerequisites ... 114
Step 1: Add a crawler ... 114
Step 2: Run the crawler ... 115
Step 3: View AWS Glue Data Catalog objects ... 115
Document history ... 117
AWS glossary ... 121
Overview of using notebooks
Using Notebooks with AWS Glue Studio and AWS Glue
Notebooks is in preview release for AWS Glue Studio and is subject to change.
Data engineers can author AWS Glue jobs faster and more easily than before using the new interactive notebook interface in AWS Glue Studio or interactive sessions in AWS Glue.
Topics
• Overview of using notebooks (p. 1)
• Getting started with notebooks in AWS Glue Studio (p. 2)
Overview of using notebooks
AWS Glue Studio allows you to interactively author jobs in a notebook interface based on Jupyter Notebooks. Through notebooks in AWS Glue Studio, you can edit job scripts and view the output without having to run a full job, and you can edit data integration code and view the output without having to run a full job, and you can add markdown and save notebooks as .ipynb files and job scripts. You can start a notebook without installing software locally or managing servers. When you are satisified with your code, AWS Glue Studio can convert your notebook to a Glue job with the click of a button.
Some benefits of using notebooks include:
• No cluster to provision or manage
• No idle clusters to pay for
• No up-front configuration required
• No installation of Jupyter notebooks required
• The same runtime/platform as AWS Glue ETL
When you start a notebook through AWS Glue Studio, all the configuration steps are done for you so that you can explore your data and start developing your job script after only a few seconds. AWS Glue Studio configures a Jupyter notebook with the AWS Glue Jupyter kernel. You don’t have to configure VPCs, network connections, or development endpoints to use this notebook.
To create jobs using the notebook interface:
• configure the necessary IAM permissions.
• start a notebook session to create a job
• write code in the cells in the notebook
• run and test the code to view the output
• save the job
Getting started with notebooks in AWS Glue Studio
After your notebook is saved, your notebook is a full AWS Glue job. You can manage all aspects of the job, such as scheduling jobs runs, setting job parameters, and viewing the job run history right along side your notebook.
Getting started with notebooks in AWS Glue Studio
When you start a notebook through AWS Glue Studio, all the configuration steps are done for you so that you can explore your data and start developing your job script after only a few seconds.
The following sections describe how to use the AWS Glue Studio to create notebooks for ETL jobs.
Topics
• Creating an ETL job using notebooks in AWS Glue Studio (p. 2)
• Notebook editor components (p. 3)
• Saving your notebook and job script (p. 3)
• Managing notebook sessions (p. 4)
Creating an ETL job using notebooks in AWS Glue Studio
To start using notebooks in the AWS Glue Studio console
1. Attach AWS Identity and Access Management policies to the AWS Glue Studio user and create an IAM role for your ETL job and notebook, as instructed in Set up IAM permissions for AWS Glue Studio (p. 35).
2. Configure additional IAM security for notebooks, as described in
3. Open the AWS Glue Studio console at https://console.aws.amazon.com/gluestudio/.
4. Choose the Jobs link in the left-side navigation menu.
5. Choose Jupyter notebook and then choose Create to start a new notebook session.
6. On the Create job in Jupyter notebook page, provide the job name, the IAM role to use, and choose which programming language you want to use within the notebook. Choose Create job.
After a short time period, the notebook editor appears.
7. The notebook doesn't automatically run any code. To configure your session, you use the %
%configure magic to specify the notebook session parameters. For more information about magics, see the AWS Glue Developer Guide.
When the notebook first opens, it contains a single cell with an example %%configure command based on the information you provided on the Create job in Jupyter notebook page. You can modify this cell to customize the notebook session.
Run the cell to start a new notebook session and generate a session id.
8. Add cells, and enter code or markdown text.
For information about writing code using a Jupyter notebook interface, see The Jupyter Notebook User Documentation .
9. To test your script, run the entire script, or individual cells. Any command output will be displayed in the area beneath the cell.
Notebook editor components
10. After you have finished developing your script, you can save the job and then run it. For more information about running jobs, see Start a job run (p. 106).
Notebook editor components
The notebook editor interface has the following main sections.
• Notebook interface (main panel) and toolbar
• Job editing tabs
The notebook editor
The AWS Glue Studio notebook editor is based on the Jupyter Notebook Application. The AWS Glue Studio notebook interface is similar to that provided by Juypter Notebooks, which is described in the section Notebook user interface . The notebook used by interactive sessions is a Jupyter Notebook.
Although the AWS Glue Studio notebook is similar to Juptyer Notebooks, it differs in a few key ways:
• currently, the AWS Glue Studio notebook cannot install extensions
• you cannot use multiple tabs; there is a 1:1 relationship between a job and a notebook
• the AWS Glue Studio notebook does not have the same top file menu that exists in Jupyter Notebooks
• currently, the AWS Glue Studio notebook only runs with the AWS Glue kernel
AWS Glue Studio job editing tabs
The tabs that you use to interact with the ETL job are at the top of the notebook page. They are similar to tabs that appear in the visual job editor of AWS Glue Studio, and they perform the same actions.
• Notebook – Use this tab to view the job script using the notebook interface.
• Job details – Configure the environment and properties for the job runs.
• Runs – View information about previous runs of this job.
• Schedules – Configure a schedule for running your job at specific times.
Saving your notebook and job script
You can save your notebook and the job script you are creating at any time. Simply choose the Save button in the upper right corner, the same as if you were using the visual or script editor.
When you choose Save, the job script and notebook file are saved in the locations you specified.
• The job script is saved to the Amazon S3 location indicated by the job property Script path, in the Scripts folder.
• The notebook file (.ipynb) is saved to the Amazon S3 location indicated by the job property Script path, in the Notebooks folder.
When you save the job, the job script contains only the code cells from the notebook. The Markdown cells aren't included.
After you save the job, you can then run the job using the script that you created in the notebook.
Managing notebook sessions
Managing notebook sessions
Notebooks in AWS Glue Studio are based on the interactive sessions feature of AWS Glue. There is a cost for using interactive sessions. To help manage your costs, you can monitor the sessions created for your account, and configure the default settings for all sessions.
Change the default timeout for all notebook sessions
By default, the notebook (interactive) session in Glue Studio times out after 1 hour.
To modify the default session timeout for notebooks in AWS Glue Studio
1. In the notebook, enter the %idle_timeout magic in a cell and specify the timeout value in minutes.
2. For example: %idle_timeout 15 will change the default timeout from 60 to 15 minutes. If the session is not used in 15 minutes, the session is automatically stopped.
Installing additional Python modules
If you would like to install additional modules to your session using pip you can do so by using
%additional_python_modules to add them to your session:
%additional_python_modules awswrangler, s3://mybucket/mymodule.whl
All arguments to additional_python_modules are passed to pip3 install -m <>
To view a list of available Python modules see Using Python Libraries with AWS Glue
Changing AWS Glue Configuration
AWS Glue supports various worker types. You can set the worker type with %worker_type. For example:
%worker_type G.2X . The default is G.1X.
You can also specify the Number of workers with %number_of_workers. For example, to specify 40 workers: %number_of_workers 40.
For more information see Defining Job Properties
Stop a notebook session
To stop a notebook session, use the magic %stop_session.
If you navigate away from the notebook in the AWS console, you will receive a warning message where you can choose to stop the session.
API design and CRUD APIs
AWS Glue Visual Job API (Preview)
The Visual Job API is in preview release for AWS Glue and is subject to change.
Topics
• API design and CRUD APIs (p. 5)
• Getting started (p. 5)
• API design and CRUD APIs (p. 9)
• SDK Onboarding (p. 9)
• Appendix: Visual Job Examples and Model Definitions (p. 9)
AWS Glue provides an API that allows customers to create data integration jobs using the AWS Glue API from a JSON object that represents a DAG. Customers can then use the visual editor in AWS Glue Studio to work with these jobs.
API design and CRUD APIs
The create and update Job APIs now support an additional optional parameter,
codeGenConfigurationNodes. Providing a non-empty json structure for this field will result in the DAG being registered in AWS Glue Studio for the created job and the associated code being generated. A null value or empty string for this field on job create will be ignored.
Updates to the codeGenConfigurationNodes field will be done through the update-job AWS Glue API in a similar way as create-job. The entire field should be specified in update-job where DAG has been changed as desired. A null value provided will be ignored and no update to the DAG would be performed.
An empty structure or string will cause the codeGenConfigurationNodes to be set as empty and any previous DAG removed. The get-job API will return a DAG if one exists. The delete-job API will need to also delete any associated DAG.
Getting started
Follow the SDK Onboarding (p. 9).
To create a job, use the createJob function. The CreateJobRequest input will have an additional field
‘codeGenConfigurationNodes’ where you can get specify the DAG object in JSON.
Things to keep in mind:
• The ‘codeGenConfigurationNodes’ field is a map of nodeId to node.
• Each node begins with a key identifying what kind of node it is.
• There can only be one key specified since a node can only be of one type.
• The input field contains the parent nodes of the current node.
The following is a JSON representation of a createJob input.
Getting started
{
"Name":"myjob1",
"Role":"arn:aws:iam::253723508848:role/myrole", "Description":"",
"GlueVersion":"2.0", "Command":{
"Name":"glueetl",
"ScriptLocation":"s3://myscripts/myjob1.py", "PythonVersion":"3"
},
"MaxRetries":3, "Timeout":2880, "ExecutionProperty":{
"MaxConcurrentRuns":1 },
"NotificationProperty":{}, "DefaultArguments":{
"--class":"GlueApp", "--job-language":"python",
"--job-bookmark-option":"job-bookmark-enable", "--TempDir":"s3://assets/temporary/",
"--enable-metrics":"true",
"--enable-continuous-cloudwatch-log":"true", "--enable-spark-ui":"true",
"--spark-event-logs-path":"s3://assets/sparkHistoryLogs/", "--encryption-type":"sse-s3",
"--enable-glue-datacatalog":"true"
},
"Tags":{},
"DeveloperMode":false, "WorkerType":"G.1X", "NumberOfWorkers":10,
"CodeGenConfigurationNodes":{
"node-1":{
"S3CatalogSource":{
"Database":"database", "Name":"S3 bucket", "Table":"table1"
} },
"node-2":{
"ApplyMapping":{
"Mapping":[
{
"FromPath":[
"col0"
],
"ToKey":"col0", "ToType":"string", "FromType":"string", "Dropped":false },
{
"FromPath":[
"col1"
],
"ToKey":"col1", "ToType":"string", "FromType":"string", "Dropped":false },
{
"FromPath":[
"col2"
],
"ToKey":"col2",
Getting started
"ToType":"string", "FromType":"string", "Dropped":false },
{
"FromPath":[
"col3"
],
"ToKey":"col3", "ToType":"string", "FromType":"string", "Dropped":false }
],
"Inputs":[
"node-1"
],
"Name":"ApplyMapping"
} },
"node-3":{
"S3CatalogTarget":{
"Path":"s3://mypath/",
"UpdateCatalogOptions":"none", "Inputs":[
"node-1", "node-2"
],
"SchemaChangePolicy":{
"enableUpdateCatalog":false },
"Name":"S3 bucket", "Format":"json", "PartitionKeys":[], "Compression":"none"
} } } }
The following is a more complex example:
{
"Name": "myjob2",
"Role": "arn:aws:iam::253723508848:role/myrole", "Description": "",
"GlueVersion": "2.0", "Command": {
"Name": "glueetl",
"ScriptLocation": "s3://myscripts/myjob1.py", "PythonVersion": "3"
},
"MaxRetries": 3, "Timeout": 2880, "ExecutionProperty": { "MaxConcurrentRuns": 1 },
"node-3": {
"S3DirectTarget": { "Path": "s3://mypath/",
"UpdateCatalogOptions": "none", "Inputs": [
Getting started
"node-1624994219677"
],
"SchemaChangePolicy": { "EnableUpdateCatalog": false },
"Name": "S3 bucket", "Format": "json", "PartitionKeys": [], "Compression": "none"
} },
"node-1624994205115": { "
CatalogSource": {
"Name": "AWS Glue Data Catalog", "Database": "database2",
"Table": "table2"
} },
"node-1624994219677": { "Join": {
"Name": "Join", "Inputs": [
"node-1624994205115", "node-2"
],
"JoinType": "equijoin", "Columns": [
{
"From": "node-1624994205115", "Keys": [
"firstname"
] }, {
"From": "node-2", "Keys": [
"col0"
] } ],
"ColumnConditions": [
"="
] } },
"node-2": {
"S3CatalogSource": { "Database": "database",
“Input”: [“node-1624994219677”]
"Name": "S3 bucket", "Format": "json", "PartitionKeys": [], "Compression": "none"
} } }
Updating jobs Since updateJob will also have a ‘codeGenConfigurationNodes’ field, the input format will be the same. The get-job command will return a ‘codeGenConfigurationNodes’ field in the same format as well.
API design and CRUD APIs
API design and CRUD APIs
Since the ‘codeGenConfigurationNodes’ parameter has been added to existing APIs, any limitations in those APIs will be inherited. In addition, the codeGenConfigurationNodes and some nodes will be limited in size. See the Appendix for full set of limitations. The limitations are applied on a per field basis.
The general shape is:
{
"Nodeid-1": {...}, "Nodeid-2": {...}
}
Things to note:
• keys uniquely identify a node
• the body contains the node specification
• every node will have a strongly typed model
• an exhaustive list of node models is present in the Appendix section
SDK Onboarding
To access the required files, go to the GitHub repository as described below.
CLI
Go to the GitHub repository to access the service-2.json file and download the file. If you're using Mac or Linux, place this file in the folder ~/.aws/models/glue/2017- 03-31. If .aws does not exist, that mean you have to configure the AWS CLI. AWS CLI installation instructions can be found here . If you do not have the other folders, you can create them manually. The CLI with this custom model can be used in the same way that CLI is normally used.
Java SDK
For older java clients, a JAR called AwsGlueJavaClient-1.12.x.jar is available on the GitHub repository .
To use the newer AWS SDK for Java2.x, a JAR called AwsJavaSdk-Glue-2.0.jar is available on the GitHub repository .
Add the JAR to your class path in your preferred way. After the JAR is added to your class path, it can be used in the same way as you are using the existing AWS Glue SDK.
Appendix: Visual Job Examples and Model Definitions
This appendix provides examples and model definitions of sources, data targets, and transforms.
Examples
Sources
Examples
S3CSVSource from a glue catalog table:
{
"Database": "database", "Table": "table1", "Name": "S3 bucket", "IsCatalog": true }
CatalogSource for RDS:
{
"Database": "database", "Table": "rdsSource", "Name": "MyRdsSource", "IsCatalog": true }
Data Targets S3CatalogTarget
{
"Inputs": [
"node-1625147321253"
],
"Database": "dbl", "Table": "s3Table", "Name": "s3 bucket", "Format": "json", "PartitionKeys": [ "col1"
],
"UpdateCatalogOptions": "schemaAndPartitions", "SchemaChangePolicy": {
"EnableUpdateCatalog": true,
"UpdateBehavior": "UPDATE_IN_DATABASE"
}
}
S3DirectTarget
{
"Path": "s3://mypath/",
"UpdateCatalogOptions": "none", "Inputs": [
"node-2"
],
"SchemaChangePolicy": {
"EnableUpdateCatalog": false },
"Name": "S3 bucket", "Format": "json", "PartitionKeys": [],
Model Definitions
"Classification": "DataSink", "Compression": "none"
}
Transforms Rename Field
{
"Inputs": [ "node-1"
],
"Name": "MyRenameField", "SourcePath": "col3"
"TargetPath": "name"
}
Filter
{
"Name": "Filter", "Inputs": [ "node-2"
],
"LogicalOperator": "AND", "Filters": [
{
"Operation": "ISNULL", "Negated": false, "Values": [ {
"Type": "COLUMNEXTRACTED", "Value": "col1"
} ] }, {
"Operation": "REGEX", "Negated": false, "Values": [ {
"Type": "CONSTANT", "Value": ".*"
}, {
"Type": "COLUMNEXTRACTED", "Value": "col2"
} ] } ] }
Model Definitions
Sources
Model Definitions
AthenaConnector
{
"Name": 100 character String Required,
"ConnectionName": 256 character String Required, "ConnectorName": 256 character String,
"ConnectionType": 256 character String Required, "ConnectionTable": 256 character String Required, "SchemaName": 256 character String Required }
JDBCConnector
JDBCDataType is an enu. To see a full list of possible values, see JDBCDataType.
{
"Name": 100 character String Required,
"ConnectionName": 256 character String Required, "ConnectorName": 256 character String,
"ConnectionType": 256 character String Required, "AdditionalOptions": JDBCConnectorOptions, "ConnectionTable": 256 character String, "Query": 256 character String
}
JDBCConnectorOptions:
{
"FilterPredicate": 256 character String, "PartitionColumn": 256 character String, "LowerBound": Non-Negative Long,
"UpperBound": Non-Negative Long, "NumPartitions": Non-Negative Long,
"JobBookmarkKeys": List of Strings up to 100, "JobBookmarkKeysSortOrder": ASC or DESC,
"DataTypeMapping": Map<DBCDataType, JDBCDataType>
SparkConnectorSource
{
"Name": 100 character String Required,
"ConnectionName": 256 character String Required, "ConnectorName": 256 character String,
"ConnectionType": 256 character String Required, "AdditionalOptions": Map<256 character String, Object>
}
CatalogSource:
{
"Name": 100 character String Required, "Database": 256 character String Required, "Table": 256 character String Required }
Model Definitions
CatalogKinesisSource – is a kind of CatalogSource
{
"Name": 100 character String Required, "Database": 256 character String Required, "WindowSize": Positive Integer,
"DetectSchema": Boolean,
"StreamingOptions": KinesisStreamingSourceOptions, "Table": 256 character String Required
}
KinesisStreamingSourceOptions
{
"EndpointUrl":256 character String,
"StreamName":256 character String, "Classification":256 character String,
"Delimiter":256 character String,
"StartingPosition": LATEST or TRIM_HORIZON or EARLIEST, "MaxFetchTimeInMs": Non-negative Long,
"MaxFetchRecordsPerShard": non-negative Long, "MaxRecordsPerRead": Non-negative Long, "AddIdleTimeBetweenReads": Boolean,
"IdleTimeBetweenReadsInMs": Non-negative Long, "DescribeShardInterval": Non-negative Long, "NumRetries": Positive Integer,
"RetryIntervalInMs": Non-negative Long, "MaxRetryIntervalMs": Non-negative Long, "AvoidEmptyBatches": Boolean,
"StreamARN": 256 character String "AwsSTSRoleARN": 256 character String, "AwsSTSSessionName": 256 character String }
DirectKinesisSource
{
"Name":100 character String Required, "WindowSize": Positive Integer, "DetectSchema": Boolean,
"StreamingOptions": KinesisStreamingSourceOptions }
CatalogKafkaSource
{
"Name":100 character String Required, "Database": 256 character String Required, "WindowSize": Positive Integer,
"DetectSchema": Boolean,
"StreamingOptions": KafkaStreamingSourceOptions, "Table": 256 character String Required,
}
KafkaStreamingSourceOptions
Model Definitions
{
"BootstrapServers":256 character String, "SecurityProtocol":256 character String, "ConnectionName":256 character String, "TopicName":256 character String, "Assign":256 character String,
"SubscribePattern":256 character String, "Classification":256 character String, "Delimiter":256 character String, "StartingOffsets":256 character String, "EndingOffsets":256 character String, "PollTimeoutInMs": Non-negative long, "NumRetries": Positive integer, "RetryIntervalMs": Non-negative long, "MaxOffsetsPerTrigger": Non-negative long, "MinPartitions": Non-negative integer }
DirectKafkaSource
{
"Name":100 character String Required, "WindowSize": Positive Integer, "DetectSchema": Boolean,
"StreamingOptions": KafkaStreamingSourceOptions }
RedshiftSource - is a kind of CatalogSource:
{
"Name":100 character String Required, "Database": 256 character String Required, "Table": 256 character String Required, "RedshiftTmpDir":256 character String, "TmpDirIAMRole":256 character String }
S3CatalogSource
{
"Name":100 character String Required, "Database": 256 character String Required, "Table": 256 character String Required, "S3SourceAdditionalOptions": {
//Only one can be specified, or neither "BoundedSize":Nullable Long,
"BoundedFiles":Nullable Long }
}
S3CSVSource
{
Model Definitions
"Name":100 character String Required,
"Paths": List of Strings. Up to 100 256 character Strings Required, "CompressionType":gzip or bzip2,
"Exclusions": List of Strings. Up to 100 256 character Strings, "GroupFiles":256 character String,
"GroupSize":256 character String, "Recurse":Boolean,
"MaxBand":Integer,
"MaxFilesInBand": Non negative Integer, "S3SourceAdditionalOptions": {
//Only one can be specified, or neither "boundedSize":Nullable Long,
"boundedFiles":Nullable Long },
"Separator":256 character String, "Escaper":256 character String, "QuoteChar":256 character String, "Multiline":Boolean,
"WithHeader":Boolean, "WriteHeader":Boolean, "SkipFirst":Boolean,
"UseArrowColumnVectors":Boolean, "UseSimdCsvParser":Boolean }
S3JSONSource
{
"Name":100 character String Required,
"Paths": List of Strings. Up to 100 256 character Strings Required, "CompressionType":gzip or bzip2,
"Exclusions": List of Strings. Up to 100 256 character Strings, "GroupFiles":256 character String,
"GroupSize":256 character String, "Recurse":Boolean,
"MaxBand": Non negative Integer, "MaxFilesInBand": Non negative Integer, "S3SourceAdditionalOptions": {
//Only one can be specified, or neither "BoundedSize":Nullable Long,
"BoundedFiles":Nullable Long },
"JsonPath":256 character String, "Multiline":Boolean
}
S3ParquetSource
{
"Name":100 character String Required,
"Paths": List of Strings. Up to 100 256 character Strings Required, "CompressionType":gzip or bzip2,
"Exclusions": List of Strings. Up to 100 256 character Strings, "GroupFiles":256 character String,
"GroupSize":256 character String, "Recurse":Boolean,
"MaxBand": Non negative Integer,
"MaxFilesInBand": Non negative Integer, "S3SourceAdditionalOptions": {
Model Definitions
//Only one can be specified, or neither "BoundedSize":Nullable Long,
"BoundedFiles":Nullable Long },
}
Targets
JDBCConnectorTarget
{
"Name":100 character String Required,
"Inputs": List of Strings. One 256 character String Required, "ConnectionName":256 character String Required,
"ConnectionTable":256 character String, "ConnectorName":256 character String,
"ConnectionType":256 character String Required, "ConnectionTypeSuffix":256 character String,
"AdditionalOptions":Map<256 character String,Object>
}
SparkConnectorTarget
{
"Name":100 character String Required,
"Inputs": List of Strings. One 256 character String Required, "ConnectionName":256 character String Required,
"ConnectionTable":256 character String, "ConnectorName":256 character String,
"ConnectionType":256 character String Required, "ConnectionTypeSuffix": 256 character String, AdditionalOptions":Map<256 character String,Object>
}
CatalogTarget
{
"Name":100 character String Required,
"Inputs": List of Strings. One 256 character String Required, "Database":256 character String Required,
"Table":256 character String Required }
RedshiftTarget
{ "Name":100 character String Required, "Inputs": List of Strings. One 256 character String Required,
"Database":256 character String Required, "Table":256 character String Required, "RedshiftTmpDir":256 character String, "TmpDirIAMRole":256 character String }
S3CatalogTarget
{
"Name":100 character String Required,
Model Definitions
"Inputs": List of Strings. One 256 character String Required, "SchemaChangePolicy": SchemaChangePolicy,
"PartitionKeys": List of Strings. Up to 100 256 character Strings, "Database":256 character String Required,
"Table":256 character String Required }
SchemaChangePolicy:
{
"EnableUpdateCatalog": Boolean,
"UpdateBehavior": "LOG" | "UPDATE_IN_DATABASE"
}
S3DirectTarget
{
"Name":100 character String Required,
"Inputs": List of Strings. One 256 character String Required, "PartitionKeys": List of Strings. Up to 100 256 character Strings, "Path":256 character String Required,
"Compression": gzip or bzip2,
"Format":json, csv, avro, orc, or parquet Required, "SchemaChangePolicy": DirectSchemaChangePolicy }
DirectSchemaChangePolicy:
{
"EnableUpdateCatalog": Boolean,
"UpdateBehavior": "LOG" | "UPDATE_IN_DATABASE", "Database":256 character String,
"Table":256 character String }
Transforms ApplyMapping
See the end of the document for the possible values of ApplyMappingType
{
"Name":100 character String Required,
"Inputs": List of Strings. One 256 character String Required, "Mapping":List of up to 250 Mapping Required
} Mapping:
{
"ToKey":256 character String Required,
"FromPath": List of Strings. One 256 character String Required, "FromType":ApplyMappingType Required,
"ToType": ApplyMappingType Required, "Dropped":Boolean,
"Children": List of up to 250 Mapping }
SelectFields
{
"Name":100 character String Required,
Model Definitions
"Inputs": List of Strings. One 256 character String Required, "Paths": List of Strings. Up to 100 256 character Strings Required }
DropFields
{
"Name":100 character String Required,
"Inputs: List of Strings. One 256 character String Required, "Paths": List of Strings. Up to 100 256 character Strings Required }
RenameField
{
"Name":100 character String Required,
"Inputs": List of Strings. One 256 character String Required,
"SourcePath":List of Strings. Up to 100 256 character Strings Required "TargetPath":256 character String Required
}
Spigot
{
"name":100 character String Required,
"inputs": List of Strings. One 256 character String Required, "path":256 character String Required,
"topk":Integer from 0 to 100, "prob":Double from 0 to 1.0 }
Join
{
"Name": 100 character String Required
"Inputs": List of Strings. Two 256 character String Required "JoinTYpe": equijoin, left, right, outer, leftsemi, or leftanti Required
"Columns": List[Column] Required }
Column:
{
"From": 256 character String Required "Keys": List[String] Required
}
SplitFields
{
"Name":100 character String Required,
Model Definitions
"Inputs": List of Strings. One 256 character String Required, "Paths":List of Strings. Up to 100 256 character Strings Required }
SelectFromCollection
{
"Name":100 character String Required,
"Inputs": List of Strings. One 256 character String Required, "Index": Non Negative Integer Required
}
FillMissingValues
{
"Name":100 character String Required,
"Inputs": List of Strings. One 256 character String RRequired, "ImputedPath":256 character String Required
"FilledPath":256 character String }
Filter
{
"Name":100 character String Required,
"Inputs": List of Strings. One 256 character String Required, "LogicalOperator":String Required,
"Filters":List[FilterInstance] Required }
FilterInstance:
{
"Operation": "EQ" | "LT" | "GT" | "LTE" | "GTE" | "REGEX" | "ISNULL" Required,
"Negated":Boolean,
"Values":List[FilterValue] Required }
FilterValue:
{
"Type": "COLUMNEXTRACTED" | "CONSTANT" Required, "Value": Object Required,
CustomCode
{
"Name":100 character String Required,
"Inputs": List of Strings. One to fifty 256 character String Required, "Code":Up to 51,200 character string or 50 KB Required,
"ClassName":256 character String Required }
SparkSQL
Model Definitions
{
"Name":100 character String Required,
"Inputs": List of Strings. One to fifty 256 character String Required, "SqlQuery": Up to 51,200 character string or 50 KB Required,
"SqlAliases":List of Alias. Up to 256 Aliases Required }
Alias:
{
"From":256 character String Required, "Alias":256 character String Required }
DropNullFields
{
"Name":100 character String Required,
"Inputs": List of Strings. One to fifty 256 character String Required, "Paths":List of Strings. Up to 100 256 character Strings Required "NullCheckBoxList": NullCheckBoxList,
"NullTextList": List of NullValueField. Up to 50 NullValueField.
}
NullCheckboxList {
"IsEmpty": Boolean, "IsNullString": Boolean, "IsNegOne": Boolean }
NullvalueFields {
"Value": 256 character String, "DataType": DataType
} DataType {
"Id": 256 character String, "Label": 256 character String }
Union
{ "Name":100 character String Required,
"Inputs":List of Strings. Two 256 character String Required, "Sources":List of Strings. Two 256 character String,
"UnionType": ALL or DISTINCT Required }
Enums JDBCDataType
ARRAY,BIGINT,BINARY,BIT,BLOB,BOOLEAN,CHAR,CLOB,DATALINK,DATE,DECIMAL,DISTINCT,DOUBL
E,FLOAT,INTEGER,JAVA_OBJECT,LONGNVARCHAR,LONGVARBINARY,LONGVARCHAR,NCHAR,NCLOB,NULL ,NUMERIC,NVARCHAR,OTHER,REAL,REF,REF_CURSOR,ROWID,SMALLINT,SQLXML,STRUCT,TIME,TIME_
WITH_TIMEZONE,TIMESTAMP,TIMESTAMP_WITH_TIMEZONE,TINYINT,VARBINARY,VARCHAR
Model Definitions
ApplyMappingType
bigint,binary,boolean,char,date,decimal,double,float,int,interval,long,smallint,str ing,timestamp,tinyint,varchar
Choosing how you want the data to be scanned
Detect PII using AWS Glue Studio
The Detect PII transform is in preview release for AWS Glue Studio and is subject to change.
Note
Using the Detect PII transform in AWS Glue Studio jobs requires AWS Glue 2.0.The Detect PII transform identifies Personal Identifiable Information (PII) in your data source. You choose the PII entity to identify, how you want the data to be scanned, and what to do with the PII entity that have been identified by the Detect PII transform.
Topics
• Choosing how you want the data to be scanned (p. 22)
• Choosing the PII entities to take action on (p. 23)
• Choosing what to do with identified PII data (p. 24)
Choosing how you want the data to be scanned
You can choose to detect PII in the entire data source, or detect the fields columns that contain PII.
When you choose Detect PII in each cell, you’re choosing to scan all rows in the data source. This is a comprehensive scan to ensure that PII entities are identified.
When you choose Detect fields containing PII, you’re choosing to scan a sample of rows for PII entities.
This is a way to keep costs and resources low while also identifying the fields where PII entities are found.
When you choose to detect fields that contain PII, you can reduce costs and improve performance by sampling a portion of rows. Choosing this option will allow you to specify additional options:
• Sample portion: This allows you to specify the percentage of rows to sample. For example, if you enter ‘50’, you’re specifying that you want 50 percent of scanned rows for the PII entity.
• Detection threshold: This allows you to specify the percentage of rows that contain the PII entity in order for the entire column to be identified as having the PII entity. For example, if you enter ‘10’, you’re specifying that the number of the PII entity, US Phone, in the rows that are scanned must be 10 percent or greater in order for the field to be identified as having the PII entity, US Phone. If the percentage of rows that contain the PII entity is less than 10 percent, that field will not be labeled as having the PII entity, US Phone, in it.
Choosing the PII entities to take action on
Choosing the PII entities to take action on
You can specify one or more PII entities that you want to detect and take action on.
• ITIN (US)
• Passport Number (US)
• US Phone
• Credit Card
• Bank Account (US, Canada)
• US Driving License
• IP Address
• MAC Address
• DEA Number (US)
• HCPCS Code (US)
• National Provider Identifier (US)
• National Drug Code (US)
• Health Insurance Claim Number (US)
• Medicare Beneficiary Identifier (US)
• CPT Code (US)
Choosing what to do with identified PII data
Choosing what to do with identified PII data
If you chose to detect PII in the entire datasource, you can choose to:
• Enrich data with detection results: If you chose Detect PII in each cell, you can store the detected entities into a new column.
• Redact detected text: You can replace the detected PII value with a string that you specify in the optional Replacing text input field. If no string is specified, the detected PII entity is replaced with '*******'.
If you chose to detect fields containing PII, you chan choose to take the following actions:
• Output Detection Results: This creates a new dataframe with the detected PII information for each column.
• Redact detected text: You can replace the detected PII value with a string that you specify. If no string is specified, the detected PII entity is replaced with '*******'.
What is AWS Glue Studio?
AWS Glue Studio is a new graphical interface that makes it easy to create, run, and monitor extract, transform, and load (ETL) jobs in AWS Glue. You can visually compose data transformation workflows and seamlessly run them on AWS Glue’s Apache Spark-based serverless ETL engine. You can inspect the schema and data results in each step of the job.
AWS Glue Studio is designed not only for tabular data, but also for semi-structured data, which is difficult to render in spreadsheet-like data preparation interfaces. Examples of semi-structured data include application logs, mobile events, Internet of Things (IoT) event streams, and social feeds.
When creating a job in AWS Glue Studio, you can choose from a variety of data sources that are stored in AWS services. You can quickly prepare that data for analysis in data warehouses and data lakes. AWS Glue Studio also offers tools to monitor ETL workflows and validate that they are operating as intended.
You can preview the dataset for each node. This helps you to debug your ETL jobs by displaying a sample of the data at each step of the job.
AWS Glue Studio provides a visual interface that makes it easy to:
• Pull data from an Amazon S3, Amazon Kinesis, or JDBC source.
• Configure a transformation that joins, samples, or transforms the data.
• Specify a target location for the transformed data.
• View the schema or a sample of the dataset at each point in the job.
Features of AWS Glue Studio
• Run, monitor, and manage the jobs created in AWS Glue Studio.
Features of AWS Glue Studio
AWS Glue Studio helps you to create and manage jobs that gather, transform, and clean data. Advanced users can use AWS Glue Studio to troubleshoot and edit job scripts.
Visual job editor
You can perform the following actions when creating and editing jobs in AWS Glue Studio:
• Add additional nodes to the job to implement:
• Multiple data sources.
• Multiple data targets.
• Data sources and targets that use connectors for external data stores that were not previously supported
• View a sample of the data at each node in the job diagram.
• Change the parent nodes for an existing node.
• Add transforms that:
• Join data sources.
• Select specific fields from the data.
• Drop fields.
• Rename fields.
• Change the data type of fields.
• Write select fields from the data into a JSON file in an Amazon S3 bucket (spigot).
• Filter out data from a dataset.
• Split a dataset into two datasets.
• Find missing values in a dataset and provide the missing value in a separate column.
• Use SQL to query and transform the data.
• Use custom code.
Notebook interface for interactively developing and debugging job scripts
AWS Glue Studio provides an enhanced notebook experience with one-click setup for easy job authoring and data exploration. The notebook and connections are configured automatically for you. You can use the notebook interface based on Juypter Notebook to interactively develop, debug, and deploy scripts and workflows using AWS Glue serverless Apache Spark ETL infrastructure. You can perform ad-hoc queries, data analysis, and visualization (for example, tables and graphs) in the notebook environment.
The notebook editor interface in AWS Glue Studio offers the following features:
• No cluster to provision or manage
• No cost for idle clusters waiting to run notebooks
• No up-front configuration required
• No resource contention for the same development environment
• Easy installation and usage
Job script code editor
• Test in the exact same run environment that your AWS Glue ETL jobs run in
Job script code editor
AWS Glue Studio also has a script editor for writing or customizing the extract-transform-and-load (ETL) code for your jobs. You can use the visual editor in AWS Glue Studio to quickly design your ETL job and then edit the generated script to write code for the unique components of your job.
When creating a new job, you can choose to write scripts for Spark jobs or Python shell jobs. You can code the job ETL script for Spark jobs using either Python or Scala. If you create a Python shell job, the job ETL script uses Python 3.6.
The script editor interface in AWS Glue Studio offers the following features:
• Insert, modify, and delete sources, targets, and transforms in your script.
• Add or modify arguments for data sources, targets, and transforms.
• Syntax and keyword highlighting
• Auto-completion suggestions for local words, Python keywords, and code snippets.
Job performance dashboard
AWS Glue Studio provides a comprehensive run dashboard for your ETL jobs. The dashboard displays information about job runs from a specific time frame. The information displayed on the dashboard includes:
• Jobs overview summary – A high-level overview showing total jobs, current runs, completed runs, and failed jobs.
• Status summaries – Provides high level job metrics based on job properties, such as worker type and job type.
• Job runs time line – A bar graph summary of successful, failed, and total runs for the currently selected time frame.
• Job run breakdown – A detailed list of job runs from the selected time frame.
Support for dataset partitioning
You can use AWS Glue Studio to efficiently process partitioned datasets. You can load, filter, transform, and save your partitioned data by using SQL expressions or user-defined functions–to avoid listing and reading unnecessary data from Amazon S3.
When should I use AWS Glue Studio?
Use AWS Glue Studio for a simple visual interface to create ETL workflows for data cleaning and transformation, and run them on AWS Glue.
AWS Glue Studio makes it easy for ETL developers to create repeatable processes to move and transform large-scale, semi-structured datasets, and load them into data lakes and data warehouses. It provides a boxes-and-arrows style visual interface for developing and managing AWS Glue ETL workflows that you can optionally customize with code. AWS Glue Studio combines the ease of use of traditional ETL tools, and the power and flexibility of AWS Glue’s big data processing engine.
Accessing AWS Glue Studio
AWS Glue Studio provides multiple ways to customize your ETL scripts,including adding nodes that represent code snippets in the visual editor.
Use AWS Glue Studio for easier job management. AWS Glue Studio provides you with job and job run management interfaces that make it clear how jobs relate to each other, and give an overall picture of your job runs. The job management page makes it easy to do bulk operations on jobs (previously difficult to do in the AWS Glue console). All job runs are available in a single interface where you can search and filter. This gives you a constantly updated view of your ETL operations and the resources you use. You can use the real-time dashboard in AWS Glue Studio to monitor your job runs and validate that they are operating as intended.
Accessing AWS Glue Studio
To access AWS Glue Studio, sign in to AWS as a user that has the required permissions, as described in Set up IAM permissions for AWS Glue Studio (p. 35). Then you can sign in to the AWS Management Console and open the AWS Glue console at https://console.aws.amazon.com/glue/. Click the AWS Glue Studio link in the navigation pane.
Pricing for AWS Glue Studio
When using AWS Glue Studio, you are charged for data previews. After you specify an IAM role for the job, the visual editor starts an Apache Spark session for sampling your source data and executing transformations. This session runs for 30 minutes and then turns off automatically. AWS charges you for 2 DPUs at the development endpoint rate (DEVED-DPU-Hour), typically resulting in a charge of $0.44 for each 30 minute session. The rate might vary for each region. At the end of the 30 minute session, you can choose Retry on the Data preview tab for any node or reload the visual editor page to start a new 30 minute session at the same rates.
You also pay for the underlying AWS services that your jobs use or interact with–for example, AWS Glue, your data sources, and your data targets. For pricing information, see AWS Glue Pricing.
Complete initial AWS configuration tasks
Setting up for AWS Glue Studio
Complete the tasks in this section when you're using AWS Glue Studio for the first time:
Topics
• Complete initial AWS configuration tasks (p. 29)
• Review IAM permissions needed for the AWS Glue Studio user (p. 31)
• Review IAM permissions needed for ETL jobs (p. 34)
• Set up IAM permissions for AWS Glue Studio (p. 35)
• Configure a VPC for your ETL job (p. 37)
• Populate the AWS Glue Data Catalog (p. 38)
Complete initial AWS configuration tasks
To use AWS Glue Studio you must first complete the following tasks:
• Sign up for AWS (p. 29)
• (Recommended) Create an IAM administrator user (p. 29)
• (Recommended) Create an AWS user for AWS Glue Studio.
You can either use the administrator user for creating and managing your ETL jobs, or you can create a separate user for accessing AWS Glue Studio.
To create additional users for AWS Glue or AWS Glue Studio, follow the steps in Creating Your First IAM Delegated User and Group in the IAM User Guide.
• Sign in as an IAM user (p. 30)
Sign up for AWS
If you do not have an AWS account, complete the following steps to create one.
To sign up for an AWS account
1. Open https://portal.aws.amazon.com/billing/signup.
2. Follow the online instructions.
Part of the sign-up procedure involves receiving a phone call and entering a verification code on the phone keypad.
Create an IAM administrator user
If your account already includes an IAM user with full AWS administrative permissions, you can skip this section.
Sign in as an IAM user
To create an administrator user for yourself and add the user to an administrators group (console)
1. Sign in to the IAM console as the account owner by choosing Root user and entering your AWS account email address. On the next page, enter your password.
Note
We strongly recommend that you adhere to the best practice of using the Administrator IAM user that follows and securely lock away the root user credentials. Sign in as the root user only to perform a few account and service management tasks.2. In the navigation pane, choose Users and then choose Add user.
3. For User name, enter Administrator.
4. Select the check box next to AWS Management Console access. Then select Custom password, and then enter your new password in the text box.
5. (Optional) By default, AWS requires the new user to create a new password when first signing in. You can clear the check box next to User must create a new password at next sign-in to allow the new user to reset their password after they sign in.
6. Choose Next: Permissions.
7. Under Set permissions, choose Add user to group.
8. Choose Create group.
9. In the Create group dialog box, for Group name enter Administrators.
10. Choose Filter policies, and then select AWS managed - job function to filter the table contents.
11. In the policy list, select the check box for AdministratorAccess. Then choose Create group.
Note
You must activate IAM user and role access to Billing before you can use theAdministratorAccess permissions to access the AWS Billing and Cost Management console. To do this, follow the instructions in step 1 of the tutorial about delegating access to the billing console.
12. Back in the list of groups, select the check box for your new group. Choose Refresh if necessary to see the group in the list.
13. Choose Next: Tags.
14. (Optional) Add metadata to the user by attaching tags as key-value pairs. For more information about using tags in IAM, see Tagging IAM entities in the IAM User Guide.
15. Choose Next: Review to see the list of group memberships to be added to the new user. When you are ready to proceed, choose Create user.
You can use this same process to create more groups and users and to give your users access to your AWS account resources. To learn about using policies that restrict user permissions to specific AWS resources, see Access management and Example policies.
Sign in as an IAM user
Sign in to the IAM console by choosing IAM user and entering your AWS account ID or account alias. On the next page, enter your IAM user name and your password.
Note
For your convenience, the AWS sign-in page uses a browser cookie to remember your IAM user name and account information. If you previously signed in as a different user, choose the sign-in link beneath the button to return to the main sign-in page. From there, you can enter your AWS account ID or account alias to be redirected to the IAM user sign-in page for your account.
Review IAM permissions needed for the AWS Glue Studio user
Review IAM permissions needed for the AWS Glue Studio user
To use AWS Glue Studio, the user must have access to various AWS resources. The user must be able to view and select Amazon S3 buckets, IAM policies and roles, and AWS Glue Data Catalog objects.
AWS Glue service permissions
AWS Glue Studio uses the actions and resources of the AWS Glue service. Your user needs permissions on these actions and resources to effectively use AWS Glue Studio. You can grant the AWS Glue Studio user the AWSGlueConsoleFullAccess managed policy, or create a custom policy with a smaller set of permissions.
Important
Per security best practices, it is recommended to restrict access by tightening policies to further restrict access to Amazon S3 bucket and Amazon CloudWatch log groups. For an example Amazon S3 policy, see Writing IAM Policies: How to Grant Access to an Amazon S3 Bucket.
Creating Custom IAM Policies for AWS Glue Studio
You can create a custom policy with a smaller set of permissions for AWS Glue Studio. The policy can grant permissions for a subset of objects or actions. Use the following information when creating a custom policy. Jobs marked with an asterisk (*) indicate an API that is only available with AWS Glue Studio and do not exist for AWS Glue.
Job Actions
• GetJob
• CreateJob
• DeleteJob
• GetJobs
• UpdateJob
• *QueryJobs
• *SaveJob
Directed acyclic graph (DAG) Actions
• *CreateDag
• *UpdateDag
• *GetDag
• *DeleteDag
Database Actions
• GetDatabases
Plan Actions
• GetPlan
Job run Actions
Creating Custom IAM Policies for AWS Glue Studio
• StartJobRun
• GetJobRuns
• BatchStopJobRun
• GetJobRun
• *QueryJobRuns
• *QueryJobRunsAggregated
Schema Actions
• *GetSchema
• *GetInferredSchema
Table Actions
• SearchTables
• GetTables
• GetTable
Connection Actions
• COMMENT OUT -CreateConnections
• DeleteConnection
• UpdateConnection
• GetConnections
• GetConnection
File Actions
• GetFile
Mapping Actions
• GetMapping
COMMENT OUTJob Schedule Actions
• *GetNextScheduledJobs
Repository Actions
• *ListRepositories
Branch Actions
• *ListBranches
• *GetBranches
Commit Actions