AWS Glue Studio

(1)

AWS Glue Studio

User Guide

(2)

AWS Glue Studio: User Guide

Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be aﬃliated with, connected to, or sponsored by Amazon.

(3)

Using Notebooks with AWS Glue Studio and AWS Glue

Notebooks is in preview release for AWS Glue Studio and is subject to change.

Data engineers can author AWS Glue jobs faster and more easily than before using the new interactive notebook interface in AWS Glue Studio or interactive sessions in AWS Glue.

Topics

• Overview of using notebooks (p. 1)

• Getting started with notebooks in AWS Glue Studio (p. 2)

Overview of using notebooks

AWS Glue Studio allows you to interactively author jobs in a notebook interface based on Jupyter Notebooks. Through notebooks in AWS Glue Studio, you can edit job scripts and view the output without having to run a full job, and you can edit data integration code and view the output without having to run a full job, and you can add markdown and save notebooks as .ipynb ﬁles and job scripts. You can start a notebook without installing software locally or managing servers. When you are satisiﬁed with your code, AWS Glue Studio can convert your notebook to a Glue job with the click of a button.

Some beneﬁts of using notebooks include:

• No cluster to provision or manage

• No idle clusters to pay for

• No up-front conﬁguration required

• No installation of Jupyter notebooks required

• The same runtime/platform as AWS Glue ETL

When you start a notebook through AWS Glue Studio, all the configuration steps are done for you so that you can explore your data and start developing your job script after only a few seconds. AWS Glue Studio configures a Jupyter notebook with the AWS Glue Jupyter kernel. You don’t have to configure VPCs, network connections, or development endpoints to use this notebook.

To create jobs using the notebook interface:

• conﬁgure the necessary IAM permissions.

• start a notebook session to create a job

• write code in the cells in the notebook

• run and test the code to view the output

• save the job

(7)

Getting started with notebooks in AWS Glue Studio

After your notebook is saved, your notebook is a full AWS Glue job. You can manage all aspects of the job, such as scheduling jobs runs, setting job parameters, and viewing the job run history right along side your notebook.

Getting started with notebooks in AWS Glue Studio

When you start a notebook through AWS Glue Studio, all the conﬁguration steps are done for you so that you can explore your data and start developing your job script after only a few seconds.

The following sections describe how to use the AWS Glue Studio to create notebooks for ETL jobs.

Topics

• Creating an ETL job using notebooks in AWS Glue Studio (p. 2)

• Notebook editor components (p. 3)

• Saving your notebook and job script (p. 3)

• Managing notebook sessions (p. 4)

Creating an ETL job using notebooks in AWS Glue Studio

To start using notebooks in the AWS Glue Studio console

1. Attach AWS Identity and Access Management policies to the AWS Glue Studio user and create an IAM role for your ETL job and notebook, as instructed in Set up IAM permissions for AWS Glue Studio (p. 35).

2. Conﬁgure additional IAM security for notebooks, as described in

3. Open the AWS Glue Studio console at https://console.aws.amazon.com/gluestudio/.

4. Choose the Jobs link in the left-side navigation menu.

5. Choose Jupyter notebook and then choose Create to start a new notebook session.

6. On the Create job in Jupyter notebook page, provide the job name, the IAM role to use, and choose which programming language you want to use within the notebook. Choose Create job.

After a short time period, the notebook editor appears.

7. The notebook doesn't automatically run any code. To conﬁgure your session, you use the %

%configure magic to specify the notebook session parameters. For more information about magics, see the AWS Glue Developer Guide.

When the notebook ﬁrst opens, it contains a single cell with an example %%configure command based on the information you provided on the Create job in Jupyter notebook page. You can modify this cell to customize the notebook session.

Run the cell to start a new notebook session and generate a session id.

8. Add cells, and enter code or markdown text.

For information about writing code using a Jupyter notebook interface, see The Jupyter Notebook User Documentation .

9. To test your script, run the entire script, or individual cells. Any command output will be displayed in the area beneath the cell.

(8)

Notebook editor components

10. After you have ﬁnished developing your script, you can save the job and then run it. For more information about running jobs, see Start a job run (p. 106).

Notebook editor components

The notebook editor interface has the following main sections.

• Notebook interface (main panel) and toolbar

• Job editing tabs

The notebook editor

The AWS Glue Studio notebook editor is based on the Jupyter Notebook Application. The AWS Glue Studio notebook interface is similar to that provided by Juypter Notebooks, which is described in the section Notebook user interface . The notebook used by interactive sessions is a Jupyter Notebook.

Although the AWS Glue Studio notebook is similar to Juptyer Notebooks, it diﬀers in a few key ways:

• currently, the AWS Glue Studio notebook cannot install extensions

• you cannot use multiple tabs; there is a 1:1 relationship between a job and a notebook

• the AWS Glue Studio notebook does not have the same top ﬁle menu that exists in Jupyter Notebooks

• currently, the AWS Glue Studio notebook only runs with the AWS Glue kernel

AWS Glue Studio job editing tabs

The tabs that you use to interact with the ETL job are at the top of the notebook page. They are similar to tabs that appear in the visual job editor of AWS Glue Studio, and they perform the same actions.

• Notebook – Use this tab to view the job script using the notebook interface.

• Job details – Conﬁgure the environment and properties for the job runs.

• Runs – View information about previous runs of this job.

• Schedules – Conﬁgure a schedule for running your job at speciﬁc times.

Saving your notebook and job script

You can save your notebook and the job script you are creating at any time. Simply choose the Save button in the upper right corner, the same as if you were using the visual or script editor.

When you choose Save, the job script and notebook ﬁle are saved in the locations you speciﬁed.

• The job script is saved to the Amazon S3 location indicated by the job property Script path, in the Scripts folder.

• The notebook ﬁle (.ipynb) is saved to the Amazon S3 location indicated by the job property Script path, in the Notebooks folder.

When you save the job, the job script contains only the code cells from the notebook. The Markdown cells aren't included.

After you save the job, you can then run the job using the script that you created in the notebook.

(9)

Managing notebook sessions

Notebooks in AWS Glue Studio are based on the interactive sessions feature of AWS Glue. There is a cost for using interactive sessions. To help manage your costs, you can monitor the sessions created for your account, and conﬁgure the default settings for all sessions.

Change the default timeout for all notebook sessions

By default, the notebook (interactive) session in Glue Studio times out after 1 hour.

To modify the default session timeout for notebooks in AWS Glue Studio

1. In the notebook, enter the %idle_timeout magic in a cell and specify the timeout value in minutes.

2. For example: %idle_timeout 15 will change the default timeout from 60 to 15 minutes. If the session is not used in 15 minutes, the session is automatically stopped.

Installing additional Python modules

If you would like to install additional modules to your session using pip you can do so by using

%additional_python_modules to add them to your session:

%additional_python_modules awswrangler, s3://mybucket/mymodule.whl

All arguments to additional_python_modules are passed to pip3 install -m <>

To view a list of available Python modules see Using Python Libraries with AWS Glue

Changing AWS Glue Conﬁguration

AWS Glue supports various worker types. You can set the worker type with %worker_type. For example:

%worker_type G.2X . The default is G.1X.

You can also specify the Number of workers with %number_of_workers. For example, to specify 40 workers: %number_of_workers 40.

For more information see Deﬁning Job Properties

Stop a notebook session

To stop a notebook session, use the magic %stop_session.

If you navigate away from the notebook in the AWS console, you will receive a warning message where you can choose to stop the session.

(10)

API design and CRUD APIs

AWS Glue Visual Job API (Preview)

The Visual Job API is in preview release for AWS Glue and is subject to change.

Topics

• API design and CRUD APIs (p. 5)

• Getting started (p. 5)

• API design and CRUD APIs (p. 9)

• SDK Onboarding (p. 9)

• Appendix: Visual Job Examples and Model Deﬁnitions (p. 9)

AWS Glue provides an API that allows customers to create data integration jobs using the AWS Glue API from a JSON object that represents a DAG. Customers can then use the visual editor in AWS Glue Studio to work with these jobs.

API design and CRUD APIs

The create and update Job APIs now support an additional optional parameter,

codeGenConfigurationNodes. Providing a non-empty json structure for this field will result in the DAG being registered in AWS Glue Studio for the created job and the associated code being generated. A null value or empty string for this field on job create will be ignored.

Updates to the codeGenConfigurationNodes field will be done through the update-job AWS Glue API in a similar way as create-job. The entire field should be specified in update-job where DAG has been changed as desired. A null value provided will be ignored and no update to the DAG would be performed.

An empty structure or string will cause the codeGenConﬁgurationNodes to be set as empty and any previous DAG removed. The get-job API will return a DAG if one exists. The delete-job API will need to also delete any associated DAG.

Getting started

Follow the SDK Onboarding (p. 9).

To create a job, use the createJob function. The CreateJobRequest input will have an additional ﬁeld

‘codeGenConﬁgurationNodes’ where you can get specify the DAG object in JSON.

Things to keep in mind:

• The ‘codeGenConﬁgurationNodes’ ﬁeld is a map of nodeId to node.

• Each node begins with a key identifying what kind of node it is.

• There can only be one key speciﬁed since a node can only be of one type.

• The input ﬁeld contains the parent nodes of the current node.

The following is a JSON representation of a createJob input.

(11)

Getting started

{

"Name":"myjob1",

"Role":"arn:aws:iam::253723508848:role/myrole", "Description":"",

"GlueVersion":"2.0", "Command":{

"Name":"glueetl",

"ScriptLocation":"s3://myscripts/myjob1.py", "PythonVersion":"3"

},

"MaxRetries":3, "Timeout":2880, "ExecutionProperty":{

"MaxConcurrentRuns":1 },

"NotificationProperty":{}, "DefaultArguments":{

"--class":"GlueApp", "--job-language":"python",

"--job-bookmark-option":"job-bookmark-enable", "--TempDir":"s3://assets/temporary/",

"--enable-metrics":"true",

"--enable-continuous-cloudwatch-log":"true", "--enable-spark-ui":"true",

"--spark-event-logs-path":"s3://assets/sparkHistoryLogs/", "--encryption-type":"sse-s3",

"--enable-glue-datacatalog":"true"

},

"Tags":{},

"DeveloperMode":false, "WorkerType":"G.1X", "NumberOfWorkers":10,

"CodeGenConfigurationNodes":{

"node-1":{

"S3CatalogSource":{

"Database":"database", "Name":"S3 bucket", "Table":"table1"

} },

"node-2":{

"ApplyMapping":{

"Mapping":[

{

"FromPath":[

"col0"

],

"ToKey":"col0", "ToType":"string", "FromType":"string", "Dropped":false },

{

"FromPath":[

"col1"

],

"ToKey":"col1", "ToType":"string", "FromType":"string", "Dropped":false },

{

"FromPath":[

"col2"

],

"ToKey":"col2",

(12)

Getting started

"ToType":"string", "FromType":"string", "Dropped":false },

{

"FromPath":[

"col3"

],

"ToKey":"col3", "ToType":"string", "FromType":"string", "Dropped":false }

],

"Inputs":[

"node-1"

],

"Name":"ApplyMapping"

} },

"node-3":{

"S3CatalogTarget":{

"Path":"s3://mypath/",

"UpdateCatalogOptions":"none", "Inputs":[

"node-1", "node-2"

],

"SchemaChangePolicy":{

"enableUpdateCatalog":false },

"Name":"S3 bucket", "Format":"json", "PartitionKeys":[], "Compression":"none"

} } } }

The following is a more complex example:

{

"Name": "myjob2",

"Role": "arn:aws:iam::253723508848:role/myrole", "Description": "",

"GlueVersion": "2.0", "Command": {

"Name": "glueetl",

"ScriptLocation": "s3://myscripts/myjob1.py", "PythonVersion": "3"

},

"MaxRetries": 3, "Timeout": 2880, "ExecutionProperty": { "MaxConcurrentRuns": 1 },

"node-3": {

"S3DirectTarget": { "Path": "s3://mypath/",

"UpdateCatalogOptions": "none", "Inputs": [

(13)

Getting started

"node-1624994219677"

],

"SchemaChangePolicy": { "EnableUpdateCatalog": false },

"Name": "S3 bucket", "Format": "json", "PartitionKeys": [], "Compression": "none"

} },

"node-1624994205115": { "

CatalogSource": {

"Name": "AWS Glue Data Catalog", "Database": "database2",

"Table": "table2"

} },

"node-1624994219677": { "Join": {

"Name": "Join", "Inputs": [

"node-1624994205115", "node-2"

],

"JoinType": "equijoin", "Columns": [

{

"From": "node-1624994205115", "Keys": [

"firstname"

] }, {

"From": "node-2", "Keys": [

"col0"

] } ],

"ColumnConditions": [

"="

] } },

"node-2": {

"S3CatalogSource": { "Database": "database",

“Input”: [“node-1624994219677”]

"Name": "S3 bucket", "Format": "json", "PartitionKeys": [], "Compression": "none"

} } }

Updating jobs Since updateJob will also have a ‘codeGenConfigurationNodes’ field, the input format will be the same. The get-job command will return a ‘codeGenConfigurationNodes’ field in the same format as well.

(14)

API design and CRUD APIs

Since the ‘codeGenConfigurationNodes’ parameter has been added to existing APIs, any limitations in those APIs will be inherited. In addition, the codeGenConfigurationNodes and some nodes will be limited in size. See the Appendix for full set of limitations. The limitations are applied on a per field basis.

The general shape is:

{

"Nodeid-1": {...}, "Nodeid-2": {...}

}

Things to note:

• keys uniquely identify a node

• the body contains the node speciﬁcation

• every node will have a strongly typed model

• an exhaustive list of node models is present in the Appendix section

SDK Onboarding

To access the required ﬁles, go to the GitHub repository as described below.

CLI

Go to the GitHub repository to access the service-2.json file and download the file. If you're using Mac or Linux, place this file in the folder ~/.aws/models/glue/2017- 03-31. If .aws does not exist, that mean you have to configure the AWS CLI. AWS CLI installation instructions can be found here . If you do not have the other folders, you can create them manually. The CLI with this custom model can be used in the same way that CLI is normally used.

Java SDK

For older java clients, a JAR called AwsGlueJavaClient-1.12.x.jar is available on the GitHub repository .

To use the newer AWS SDK for Java2.x, a JAR called AwsJavaSdk-Glue-2.0.jar is available on the GitHub repository .

Add the JAR to your class path in your preferred way. After the JAR is added to your class path, it can be used in the same way as you are using the existing AWS Glue SDK.

Appendix: Visual Job Examples and Model Deﬁnitions

This appendix provides examples and model deﬁnitions of sources, data targets, and transforms.

Examples

Sources

(15)

Examples

S3CSVSource from a glue catalog table:

{

"Database": "database", "Table": "table1", "Name": "S3 bucket", "IsCatalog": true }

CatalogSource for RDS:

{

"Database": "database", "Table": "rdsSource", "Name": "MyRdsSource", "IsCatalog": true }

Data Targets S3CatalogTarget

{

"Inputs": [

"node-1625147321253"

],

"Database": "dbl", "Table": "s3Table", "Name": "s3 bucket", "Format": "json", "PartitionKeys": [ "col1"

],

"UpdateCatalogOptions": "schemaAndPartitions", "SchemaChangePolicy": {

"EnableUpdateCatalog": true,

"UpdateBehavior": "UPDATE_IN_DATABASE"

}

S3DirectTarget

{

"Path": "s3://mypath/",

"UpdateCatalogOptions": "none", "Inputs": [

"node-2"

],

"SchemaChangePolicy": {

"EnableUpdateCatalog": false },

"Name": "S3 bucket", "Format": "json", "PartitionKeys": [],

(16)

Model Deﬁnitions

"Classification": "DataSink", "Compression": "none"

}

Transforms Rename Field

{

"Inputs": [ "node-1"

],

"Name": "MyRenameField", "SourcePath": "col3"

"TargetPath": "name"

}

Filter

{

"Name": "Filter", "Inputs": [ "node-2"

],

"LogicalOperator": "AND", "Filters": [

{

"Operation": "ISNULL", "Negated": false, "Values": [ {

"Type": "COLUMNEXTRACTED", "Value": "col1"

} ] }, {

"Operation": "REGEX", "Negated": false, "Values": [ {

"Type": "CONSTANT", "Value": ".*"

}, {

"Type": "COLUMNEXTRACTED", "Value": "col2"

} ] } ] }

Model Deﬁnitions

Sources

(17)

Model Deﬁnitions

AthenaConnector

{

"Name": 100 character String Required,

"ConnectionName": 256 character String Required, "ConnectorName": 256 character String,

"ConnectionType": 256 character String Required, "ConnectionTable": 256 character String Required, "SchemaName": 256 character String Required }

JDBCConnector

JDBCDataType is an enu. To see a full list of possible values, see JDBCDataType.

{

"ConnectionType": 256 character String Required, "AdditionalOptions": JDBCConnectorOptions, "ConnectionTable": 256 character String, "Query": 256 character String

}

JDBCConnectorOptions:

{

"FilterPredicate": 256 character String, "PartitionColumn": 256 character String, "LowerBound": Non-Negative Long,

"UpperBound": Non-Negative Long, "NumPartitions": Non-Negative Long,

"JobBookmarkKeys": List of Strings up to 100, "JobBookmarkKeysSortOrder": ASC or DESC,

"DataTypeMapping": Map<DBCDataType, JDBCDataType>

SparkConnectorSource

{

"ConnectionType": 256 character String Required, "AdditionalOptions": Map<256 character String, Object>

}

CatalogSource:

{

"Name": 100 character String Required, "Database": 256 character String Required, "Table": 256 character String Required }

(18)

Model Deﬁnitions

CatalogKinesisSource – is a kind of CatalogSource

{

"Name": 100 character String Required, "Database": 256 character String Required, "WindowSize": Positive Integer,

"DetectSchema": Boolean,

"StreamingOptions": KinesisStreamingSourceOptions, "Table": 256 character String Required

}

KinesisStreamingSourceOptions

{

"EndpointUrl":256 character String,

"StreamName":256 character String, "Classification":256 character String,

"Delimiter":256 character String,

"StartingPosition": LATEST or TRIM_HORIZON or EARLIEST, "MaxFetchTimeInMs": Non-negative Long,

"MaxFetchRecordsPerShard": non-negative Long, "MaxRecordsPerRead": Non-negative Long, "AddIdleTimeBetweenReads": Boolean,

"IdleTimeBetweenReadsInMs": Non-negative Long, "DescribeShardInterval": Non-negative Long, "NumRetries": Positive Integer,

"RetryIntervalInMs": Non-negative Long, "MaxRetryIntervalMs": Non-negative Long, "AvoidEmptyBatches": Boolean,

"StreamARN": 256 character String "AwsSTSRoleARN": 256 character String, "AwsSTSSessionName": 256 character String }

DirectKinesisSource

{

"Name":100 character String Required, "WindowSize": Positive Integer, "DetectSchema": Boolean,

"StreamingOptions": KinesisStreamingSourceOptions }

CatalogKafkaSource

{

"Name":100 character String Required, "Database": 256 character String Required, "WindowSize": Positive Integer,

"DetectSchema": Boolean,

"StreamingOptions": KafkaStreamingSourceOptions, "Table": 256 character String Required,

}

KafkaStreamingSourceOptions

(19)

Model Deﬁnitions

{

"BootstrapServers":256 character String, "SecurityProtocol":256 character String, "ConnectionName":256 character String, "TopicName":256 character String, "Assign":256 character String,

"SubscribePattern":256 character String, "Classification":256 character String, "Delimiter":256 character String, "StartingOffsets":256 character String, "EndingOffsets":256 character String, "PollTimeoutInMs": Non-negative long, "NumRetries": Positive integer, "RetryIntervalMs": Non-negative long, "MaxOffsetsPerTrigger": Non-negative long, "MinPartitions": Non-negative integer }

DirectKafkaSource

{

"Name":100 character String Required, "WindowSize": Positive Integer, "DetectSchema": Boolean,

"StreamingOptions": KafkaStreamingSourceOptions }

RedshiftSource - is a kind of CatalogSource:

{

"Name":100 character String Required, "Database": 256 character String Required, "Table": 256 character String Required, "RedshiftTmpDir":256 character String, "TmpDirIAMRole":256 character String }

S3CatalogSource

{

"Name":100 character String Required, "Database": 256 character String Required, "Table": 256 character String Required, "S3SourceAdditionalOptions": {

//Only one can be specified, or neither "BoundedSize":Nullable Long,

"BoundedFiles":Nullable Long }

}

S3CSVSource

{

(20)

Model Deﬁnitions

"Name":100 character String Required,

"Paths": List of Strings. Up to 100 256 character Strings Required, "CompressionType":gzip or bzip2,

"Exclusions": List of Strings. Up to 100 256 character Strings, "GroupFiles":256 character String,

"GroupSize":256 character String, "Recurse":Boolean,

"MaxBand":Integer,

"MaxFilesInBand": Non negative Integer, "S3SourceAdditionalOptions": {

//Only one can be specified, or neither "boundedSize":Nullable Long,

"boundedFiles":Nullable Long },

"Separator":256 character String, "Escaper":256 character String, "QuoteChar":256 character String, "Multiline":Boolean,

"WithHeader":Boolean, "WriteHeader":Boolean, "SkipFirst":Boolean,

"UseArrowColumnVectors":Boolean, "UseSimdCsvParser":Boolean }

S3JSONSource

{

"MaxBand": Non negative Integer, "MaxFilesInBand": Non negative Integer, "S3SourceAdditionalOptions": {

"BoundedFiles":Nullable Long },

"JsonPath":256 character String, "Multiline":Boolean

}

S3ParquetSource

{

"MaxBand": Non negative Integer,

"MaxFilesInBand": Non negative Integer, "S3SourceAdditionalOptions": {

(21)

Model Deﬁnitions

"BoundedFiles":Nullable Long },

}

Targets

JDBCConnectorTarget

{

"Inputs": List of Strings. One 256 character String Required, "ConnectionName":256 character String Required,

"ConnectionTable":256 character String, "ConnectorName":256 character String,

"ConnectionType":256 character String Required, "ConnectionTypeSuffix":256 character String,

"AdditionalOptions":Map<256 character String,Object>

}

SparkConnectorTarget

{

"Inputs": List of Strings. One 256 character String Required, "ConnectionName":256 character String Required,

"ConnectionTable":256 character String, "ConnectorName":256 character String,

"ConnectionType":256 character String Required, "ConnectionTypeSuffix": 256 character String, AdditionalOptions":Map<256 character String,Object>

}

CatalogTarget

{

"Inputs": List of Strings. One 256 character String Required, "Database":256 character String Required,

"Table":256 character String Required }

RedshiftTarget

{ "Name":100 character String Required, "Inputs": List of Strings. One 256 character String Required,

"Database":256 character String Required, "Table":256 character String Required, "RedshiftTmpDir":256 character String, "TmpDirIAMRole":256 character String }

S3CatalogTarget

{

(22)

Model Deﬁnitions

"Inputs": List of Strings. One 256 character String Required, "SchemaChangePolicy": SchemaChangePolicy,

"PartitionKeys": List of Strings. Up to 100 256 character Strings, "Database":256 character String Required,

"Table":256 character String Required }

SchemaChangePolicy:

{

"EnableUpdateCatalog": Boolean,

"UpdateBehavior": "LOG" | "UPDATE_IN_DATABASE"

}

S3DirectTarget

{

"Inputs": List of Strings. One 256 character String Required, "PartitionKeys": List of Strings. Up to 100 256 character Strings, "Path":256 character String Required,

"Compression": gzip or bzip2,

"Format":json, csv, avro, orc, or parquet Required, "SchemaChangePolicy": DirectSchemaChangePolicy }

DirectSchemaChangePolicy:

{

"EnableUpdateCatalog": Boolean,

"UpdateBehavior": "LOG" | "UPDATE_IN_DATABASE", "Database":256 character String,

"Table":256 character String }

Transforms ApplyMapping

See the end of the document for the possible values of ApplyMappingType

{

"Inputs": List of Strings. One 256 character String Required, "Mapping":List of up to 250 Mapping Required

} Mapping:

{

"ToKey":256 character String Required,

"FromPath": List of Strings. One 256 character String Required, "FromType":ApplyMappingType Required,

"ToType": ApplyMappingType Required, "Dropped":Boolean,

"Children": List of up to 250 Mapping }

SelectFields

{

(23)

Model Deﬁnitions

"Inputs": List of Strings. One 256 character String Required, "Paths": List of Strings. Up to 100 256 character Strings Required }

DropFields

{

"Inputs: List of Strings. One 256 character String Required, "Paths": List of Strings. Up to 100 256 character Strings Required }

RenameField

{

"Inputs": List of Strings. One 256 character String Required,

"SourcePath":List of Strings. Up to 100 256 character Strings Required "TargetPath":256 character String Required

}

Spigot

{

"name":100 character String Required,

"inputs": List of Strings. One 256 character String Required, "path":256 character String Required,

"topk":Integer from 0 to 100, "prob":Double from 0 to 1.0 }

Join

{

"Name": 100 character String Required

"Inputs": List of Strings. Two 256 character String Required "JoinTYpe": equijoin, left, right, outer, leftsemi, or leftanti Required

"Columns": List[Column] Required }

Column:

{

"From": 256 character String Required "Keys": List[String] Required

}

SplitFields

{

(24)

Model Deﬁnitions

"Inputs": List of Strings. One 256 character String Required, "Paths":List of Strings. Up to 100 256 character Strings Required }

SelectFromCollection

{

"Inputs": List of Strings. One 256 character String Required, "Index": Non Negative Integer Required

}

FillMissingValues

{

"Inputs": List of Strings. One 256 character String RRequired, "ImputedPath":256 character String Required

"FilledPath":256 character String }

Filter

{

"Inputs": List of Strings. One 256 character String Required, "LogicalOperator":String Required,

"Filters":List[FilterInstance] Required }

FilterInstance:

{

"Operation": "EQ" | "LT" | "GT" | "LTE" | "GTE" | "REGEX" | "ISNULL" Required,

"Negated":Boolean,

"Values":List[FilterValue] Required }

FilterValue:

{

"Type": "COLUMNEXTRACTED" | "CONSTANT" Required, "Value": Object Required,

CustomCode

{

"Inputs": List of Strings. One to fifty 256 character String Required, "Code":Up to 51,200 character string or 50 KB Required,

"ClassName":256 character String Required }

SparkSQL

(25)

Model Deﬁnitions

{

"Inputs": List of Strings. One to fifty 256 character String Required, "SqlQuery": Up to 51,200 character string or 50 KB Required,

"SqlAliases":List of Alias. Up to 256 Aliases Required }

Alias:

{

"From":256 character String Required, "Alias":256 character String Required }

DropNullFields

{

"Inputs": List of Strings. One to fifty 256 character String Required, "Paths":List of Strings. Up to 100 256 character Strings Required "NullCheckBoxList": NullCheckBoxList,

"NullTextList": List of NullValueField. Up to 50 NullValueField.

}

NullCheckboxList {

"IsEmpty": Boolean, "IsNullString": Boolean, "IsNegOne": Boolean }

NullvalueFields {

"Value": 256 character String, "DataType": DataType

} DataType {

"Id": 256 character String, "Label": 256 character String }

Union

{ "Name":100 character String Required,

"Inputs":List of Strings. Two 256 character String Required, "Sources":List of Strings. Two 256 character String,

"UnionType": ALL or DISTINCT Required }

Enums JDBCDataType

ARRAY,BIGINT,BINARY,BIT,BLOB,BOOLEAN,CHAR,CLOB,DATALINK,DATE,DECIMAL,DISTINCT,DOUBL

E,FLOAT,INTEGER,JAVA_OBJECT,LONGNVARCHAR,LONGVARBINARY,LONGVARCHAR,NCHAR,NCLOB,NULL ,NUMERIC,NVARCHAR,OTHER,REAL,REF,REF_CURSOR,ROWID,SMALLINT,SQLXML,STRUCT,TIME,TIME_

WITH_TIMEZONE,TIMESTAMP,TIMESTAMP_WITH_TIMEZONE,TINYINT,VARBINARY,VARCHAR

(26)

Model Deﬁnitions

ApplyMappingType

bigint,binary,boolean,char,date,decimal,double,float,int,interval,long,smallint,str ing,timestamp,tinyint,varchar

(27)

Choosing how you want the data to be scanned

Detect PII using AWS Glue Studio

The Detect PII transform is in preview release for AWS Glue Studio and is subject to change.

Note

Using the Detect PII transform in AWS Glue Studio jobs requires AWS Glue 2.0.

The Detect PII transform identifies Personal Identifiable Information (PII) in your data source. You choose the PII entity to identify, how you want the data to be scanned, and what to do with the PII entity that have been identified by the Detect PII transform.

Topics

• Choosing how you want the data to be scanned (p. 22)

• Choosing the PII entities to take action on (p. 23)

• Choosing what to do with identiﬁed PII data (p. 24)

Choosing how you want the data to be scanned

You can choose to detect PII in the entire data source, or detect the ﬁelds columns that contain PII.

When you choose Detect PII in each cell, you’re choosing to scan all rows in the data source. This is a comprehensive scan to ensure that PII entities are identiﬁed.

When you choose Detect ﬁelds containing PII, you’re choosing to scan a sample of rows for PII entities.

This is a way to keep costs and resources low while also identifying the ﬁelds where PII entities are found.

When you choose to detect ﬁelds that contain PII, you can reduce costs and improve performance by sampling a portion of rows. Choosing this option will allow you to specify additional options:

• Sample portion: This allows you to specify the percentage of rows to sample. For example, if you enter ‘50’, you’re specifying that you want 50 percent of scanned rows for the PII entity.

• Detection threshold: This allows you to specify the percentage of rows that contain the PII entity in order for the entire column to be identified as having the PII entity. For example, if you enter ‘10’, you’re specifying that the number of the PII entity, US Phone, in the rows that are scanned must be 10 percent or greater in order for the field to be identified as having the PII entity, US Phone. If the percentage of rows that contain the PII entity is less than 10 percent, that field will not be labeled as having the PII entity, US Phone, in it.

(28)

Choosing the PII entities to take action on

You can specify one or more PII entities that you want to detect and take action on.

• ITIN (US)

• Email

• Passport Number (US)

• US Phone

• Credit Card

• Bank Account (US, Canada)

• US Driving License

• IP Address

• MAC Address

• DEA Number (US)

• HCPCS Code (US)

• National Provider Identiﬁer (US)

• National Drug Code (US)

• Health Insurance Claim Number (US)

• Medicare Beneﬁciary Identiﬁer (US)

• CPT Code (US)

(29)

Choosing what to do with identiﬁed PII data

If you chose to detect PII in the entire datasource, you can choose to:

• Enrich data with detection results: If you chose Detect PII in each cell, you can store the detected entities into a new column.

• Redact detected text: You can replace the detected PII value with a string that you specify in the optional Replacing text input ﬁeld. If no string is speciﬁed, the detected PII entity is replaced with '*******'.

If you chose to detect ﬁelds containing PII, you chan choose to take the following actions:

• Output Detection Results: This creates a new dataframe with the detected PII information for each column.

• Redact detected text: You can replace the detected PII value with a string that you specify. If no string is speciﬁed, the detected PII entity is replaced with '*******'.

(30)

What is AWS Glue Studio?

AWS Glue Studio is a new graphical interface that makes it easy to create, run, and monitor extract, transform, and load (ETL) jobs in AWS Glue. You can visually compose data transformation workﬂows and seamlessly run them on AWS Glue’s Apache Spark-based serverless ETL engine. You can inspect the schema and data results in each step of the job.

AWS Glue Studio is designed not only for tabular data, but also for semi-structured data, which is diﬃcult to render in spreadsheet-like data preparation interfaces. Examples of semi-structured data include application logs, mobile events, Internet of Things (IoT) event streams, and social feeds.

When creating a job in AWS Glue Studio, you can choose from a variety of data sources that are stored in AWS services. You can quickly prepare that data for analysis in data warehouses and data lakes. AWS Glue Studio also oﬀers tools to monitor ETL workﬂows and validate that they are operating as intended.

You can preview the dataset for each node. This helps you to debug your ETL jobs by displaying a sample of the data at each step of the job.

AWS Glue Studio provides a visual interface that makes it easy to:

• Pull data from an Amazon S3, Amazon Kinesis, or JDBC source.

• Conﬁgure a transformation that joins, samples, or transforms the data.

• Specify a target location for the transformed data.

• View the schema or a sample of the dataset at each point in the job.

(31)

Features of AWS Glue Studio

• Run, monitor, and manage the jobs created in AWS Glue Studio.

Features of AWS Glue Studio

AWS Glue Studio helps you to create and manage jobs that gather, transform, and clean data. Advanced users can use AWS Glue Studio to troubleshoot and edit job scripts.

Visual job editor

You can perform the following actions when creating and editing jobs in AWS Glue Studio:

• Add additional nodes to the job to implement:

• Multiple data sources.

• Multiple data targets.

• Data sources and targets that use connectors for external data stores that were not previously supported

• View a sample of the data at each node in the job diagram.

• Change the parent nodes for an existing node.

• Add transforms that:

• Join data sources.

• Select speciﬁc ﬁelds from the data.

• Drop ﬁelds.

• Rename ﬁelds.

• Change the data type of ﬁelds.

• Write select ﬁelds from the data into a JSON ﬁle in an Amazon S3 bucket (spigot).

• Filter out data from a dataset.

• Split a dataset into two datasets.

• Find missing values in a dataset and provide the missing value in a separate column.

• Use SQL to query and transform the data.

• Use custom code.

Notebook interface for interactively developing and debugging job scripts

AWS Glue Studio provides an enhanced notebook experience with one-click setup for easy job authoring and data exploration. The notebook and connections are conﬁgured automatically for you. You can use the notebook interface based on Juypter Notebook to interactively develop, debug, and deploy scripts and workﬂows using AWS Glue serverless Apache Spark ETL infrastructure. You can perform ad-hoc queries, data analysis, and visualization (for example, tables and graphs) in the notebook environment.

The notebook editor interface in AWS Glue Studio oﬀers the following features:

• No cluster to provision or manage

• No cost for idle clusters waiting to run notebooks

• No up-front conﬁguration required

• No resource contention for the same development environment

• Easy installation and usage

(32)

Job script code editor

• Test in the exact same run environment that your AWS Glue ETL jobs run in

Job script code editor

AWS Glue Studio also has a script editor for writing or customizing the extract-transform-and-load (ETL) code for your jobs. You can use the visual editor in AWS Glue Studio to quickly design your ETL job and then edit the generated script to write code for the unique components of your job.

When creating a new job, you can choose to write scripts for Spark jobs or Python shell jobs. You can code the job ETL script for Spark jobs using either Python or Scala. If you create a Python shell job, the job ETL script uses Python 3.6.

The script editor interface in AWS Glue Studio oﬀers the following features:

• Insert, modify, and delete sources, targets, and transforms in your script.

• Add or modify arguments for data sources, targets, and transforms.

• Syntax and keyword highlighting

• Auto-completion suggestions for local words, Python keywords, and code snippets.

Job performance dashboard

AWS Glue Studio provides a comprehensive run dashboard for your ETL jobs. The dashboard displays information about job runs from a speciﬁc time frame. The information displayed on the dashboard includes:

• Jobs overview summary – A high-level overview showing total jobs, current runs, completed runs, and failed jobs.

• Status summaries – Provides high level job metrics based on job properties, such as worker type and job type.

• Job runs time line – A bar graph summary of successful, failed, and total runs for the currently selected time frame.

• Job run breakdown – A detailed list of job runs from the selected time frame.

Support for dataset partitioning

You can use AWS Glue Studio to efficiently process partitioned datasets. You can load, filter, transform, and save your partitioned data by using SQL expressions or user-defined functions–to avoid listing and reading unnecessary data from Amazon S3.

When should I use AWS Glue Studio?

Use AWS Glue Studio for a simple visual interface to create ETL workﬂows for data cleaning and transformation, and run them on AWS Glue.

AWS Glue Studio makes it easy for ETL developers to create repeatable processes to move and transform large-scale, semi-structured datasets, and load them into data lakes and data warehouses. It provides a boxes-and-arrows style visual interface for developing and managing AWS Glue ETL workﬂows that you can optionally customize with code. AWS Glue Studio combines the ease of use of traditional ETL tools, and the power and ﬂexibility of AWS Glue’s big data processing engine.

(33)

Accessing AWS Glue Studio

AWS Glue Studio provides multiple ways to customize your ETL scripts,including adding nodes that represent code snippets in the visual editor.

Use AWS Glue Studio for easier job management. AWS Glue Studio provides you with job and job run management interfaces that make it clear how jobs relate to each other, and give an overall picture of your job runs. The job management page makes it easy to do bulk operations on jobs (previously diﬃcult to do in the AWS Glue console). All job runs are available in a single interface where you can search and ﬁlter. This gives you a constantly updated view of your ETL operations and the resources you use. You can use the real-time dashboard in AWS Glue Studio to monitor your job runs and validate that they are operating as intended.

Accessing AWS Glue Studio

To access AWS Glue Studio, sign in to AWS as a user that has the required permissions, as described in Set up IAM permissions for AWS Glue Studio (p. 35). Then you can sign in to the AWS Management Console and open the AWS Glue console at https://console.aws.amazon.com/glue/. Click the AWS Glue Studio link in the navigation pane.

Pricing for AWS Glue Studio

When using AWS Glue Studio, you are charged for data previews. After you specify an IAM role for the job, the visual editor starts an Apache Spark session for sampling your source data and executing transformations. This session runs for 30 minutes and then turns oﬀ automatically. AWS charges you for 2 DPUs at the development endpoint rate (DEVED-DPU-Hour), typically resulting in a charge of $0.44 for each 30 minute session. The rate might vary for each region. At the end of the 30 minute session, you can choose Retry on the Data preview tab for any node or reload the visual editor page to start a new 30 minute session at the same rates.

You also pay for the underlying AWS services that your jobs use or interact with–for example, AWS Glue, your data sources, and your data targets. For pricing information, see AWS Glue Pricing.

(34)

Complete initial AWS conﬁguration tasks

Setting up for AWS Glue Studio

Complete the tasks in this section when you're using AWS Glue Studio for the ﬁrst time:

Topics

• Complete initial AWS conﬁguration tasks (p. 29)

• Review IAM permissions needed for the AWS Glue Studio user (p. 31)

• Review IAM permissions needed for ETL jobs (p. 34)

• Set up IAM permissions for AWS Glue Studio (p. 35)

• Conﬁgure a VPC for your ETL job (p. 37)

• Populate the AWS Glue Data Catalog (p. 38)

Complete initial AWS conﬁguration tasks

To use AWS Glue Studio you must ﬁrst complete the following tasks:

• Sign up for AWS (p. 29)

• (Recommended) Create an IAM administrator user (p. 29)

• (Recommended) Create an AWS user for AWS Glue Studio.

You can either use the administrator user for creating and managing your ETL jobs, or you can create a separate user for accessing AWS Glue Studio.

To create additional users for AWS Glue or AWS Glue Studio, follow the steps in Creating Your First IAM Delegated User and Group in the IAM User Guide.

• Sign in as an IAM user (p. 30)

Sign up for AWS

If you do not have an AWS account, complete the following steps to create one.

To sign up for an AWS account

1. Open https://portal.aws.amazon.com/billing/signup.

2. Follow the online instructions.

Part of the sign-up procedure involves receiving a phone call and entering a veriﬁcation code on the phone keypad.

Create an IAM administrator user

If your account already includes an IAM user with full AWS administrative permissions, you can skip this section.

(35)

Sign in as an IAM user

To create an administrator user for yourself and add the user to an administrators group (console)

1. Sign in to the IAM console as the account owner by choosing Root user and entering your AWS account email address. On the next page, enter your password.

Note

We strongly recommend that you adhere to the best practice of using the Administrator IAM user that follows and securely lock away the root user credentials. Sign in as the root user only to perform a few account and service management tasks.

2. In the navigation pane, choose Users and then choose Add user.

3. For User name, enter Administrator.

4. Select the check box next to AWS Management Console access. Then select Custom password, and then enter your new password in the text box.

5. (Optional) By default, AWS requires the new user to create a new password when ﬁrst signing in. You can clear the check box next to User must create a new password at next sign-in to allow the new user to reset their password after they sign in.

6. Choose Next: Permissions.

7. Under Set permissions, choose Add user to group.

8. Choose Create group.

9. In the Create group dialog box, for Group name enter Administrators.

10. Choose Filter policies, and then select AWS managed - job function to ﬁlter the table contents.

11. In the policy list, select the check box for AdministratorAccess. Then choose Create group.

Note

You must activate IAM user and role access to Billing before you can use the

AdministratorAccess permissions to access the AWS Billing and Cost Management console. To do this, follow the instructions in step 1 of the tutorial about delegating access to the billing console.

12. Back in the list of groups, select the check box for your new group. Choose Refresh if necessary to see the group in the list.

13. Choose Next: Tags.

14. (Optional) Add metadata to the user by attaching tags as key-value pairs. For more information about using tags in IAM, see Tagging IAM entities in the IAM User Guide.

15. Choose Next: Review to see the list of group memberships to be added to the new user. When you are ready to proceed, choose Create user.

You can use this same process to create more groups and users and to give your users access to your AWS account resources. To learn about using policies that restrict user permissions to speciﬁc AWS resources, see Access management and Example policies.

Sign in as an IAM user

Sign in to the IAM console by choosing IAM user and entering your AWS account ID or account alias. On the next page, enter your IAM user name and your password.

Note

For your convenience, the AWS sign-in page uses a browser cookie to remember your IAM user name and account information. If you previously signed in as a diﬀerent user, choose the sign-in link beneath the button to return to the main sign-in page. From there, you can enter your AWS account ID or account alias to be redirected to the IAM user sign-in page for your account.

(36)

Review IAM permissions needed for the AWS Glue Studio user

To use AWS Glue Studio, the user must have access to various AWS resources. The user must be able to view and select Amazon S3 buckets, IAM policies and roles, and AWS Glue Data Catalog objects.

AWS Glue service permissions

AWS Glue Studio uses the actions and resources of the AWS Glue service. Your user needs permissions on these actions and resources to eﬀectively use AWS Glue Studio. You can grant the AWS Glue Studio user the AWSGlueConsoleFullAccess managed policy, or create a custom policy with a smaller set of permissions.

Important

Per security best practices, it is recommended to restrict access by tightening policies to further restrict access to Amazon S3 bucket and Amazon CloudWatch log groups. For an example Amazon S3 policy, see Writing IAM Policies: How to Grant Access to an Amazon S3 Bucket.

Creating Custom IAM Policies for AWS Glue Studio

You can create a custom policy with a smaller set of permissions for AWS Glue Studio. The policy can grant permissions for a subset of objects or actions. Use the following information when creating a custom policy. Jobs marked with an asterisk (*) indicate an API that is only available with AWS Glue Studio and do not exist for AWS Glue.

Job Actions

• GetJob

• CreateJob

• DeleteJob

• GetJobs

• UpdateJob

• *QueryJobs

• *SaveJob

Directed acyclic graph (DAG) Actions

• *CreateDag

• *UpdateDag

• *GetDag

• *DeleteDag

Database Actions

• GetDatabases

Plan Actions

• GetPlan

Job run Actions

(37)

Creating Custom IAM Policies for AWS Glue Studio

• StartJobRun

• GetJobRuns

• BatchStopJobRun

• GetJobRun

• *QueryJobRuns

• *QueryJobRunsAggregated

Schema Actions

• *GetSchema

• *GetInferredSchema

Table Actions

• SearchTables

• GetTables

• GetTable

Connection Actions

• COMMENT OUT -CreateConnections

• DeleteConnection

• UpdateConnection

• GetConnections

• GetConnection

File Actions

• GetFile

Mapping Actions

• GetMapping

COMMENT OUTJob Schedule Actions

• *GetNextScheduledJobs

Repository Actions

• *ListRepositories

Branch Actions

• *ListBranches

• *GetBranches

Commit Actions

AWS Glue Studio

AWS Glue Studio

User Guide

AWS Glue Studio: User Guide

Table of Contents

Using Notebooks with AWS Glue Studio and AWS Glue

Overview of using notebooks

Getting started with notebooks in AWS Glue Studio

Creating an ETL job using notebooks in AWS Glue Studio

To start using notebooks in the AWS Glue Studio console

Notebook editor components

The notebook editor

AWS Glue Studio job editing tabs

Saving your notebook and job script

Managing notebook sessions

Change the default timeout for all notebook sessions

To modify the default session timeout for notebooks in AWS Glue Studio

Installing additional Python modules

Changing AWS Glue Conﬁguration

Stop a notebook session

AWS Glue Visual Job API (Preview)

API design and CRUD APIs

Getting started

API design and CRUD APIs

SDK Onboarding

Appendix: Visual Job Examples and Model Deﬁnitions

Examples

Model Deﬁnitions

Detect PII using AWS Glue Studio

Note

Choosing how you want the data to be scanned

Choosing the PII entities to take action on

Choosing what to do with identiﬁed PII data

What is AWS Glue Studio?

Features of AWS Glue Studio

Visual job editor

Notebook interface for interactively developing and debugging job scripts

Job script code editor

Job performance dashboard

Support for dataset partitioning

When should I use AWS Glue Studio?

Accessing AWS Glue Studio

Pricing for AWS Glue Studio

Setting up for AWS Glue Studio

Complete initial AWS conﬁguration tasks

Sign up for AWS

To sign up for an AWS account

Create an IAM administrator user

To create an administrator user for yourself and add the user to an administrators group (console)

Note

Note

Sign in as an IAM user

Note

Review IAM permissions needed for the AWS Glue Studio user

AWS Glue service permissions

Important

Creating Custom IAM Policies for AWS Glue Studio