Amazon EMR

(1)

Amazon EMR

Amazon EMR Serverless User Guide

(2)

Amazon EMR: Amazon EMR Serverless User Guide

Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be aﬃliated with, connected to, or sponsored by Amazon.

(3)

What is Amazon EMR Serverless?

Amazon EMR Serverless is a new deployment option for Amazon EMR. EMR Serverless provides a serverless runtime environment that simpliﬁes running analytics applications using the latest open source frameworks such as Apache Spark and Apache Hive. With EMR Serverless, you don’t have to conﬁgure, optimize, secure, or operate clusters to run applications with these frameworks.

EMR Serverless helps you avoid over- or under-provisioning resources for your data processing jobs. EMR Serverless automatically determines the resources required by the applications, acquires these resources to process your jobs, and relinquishes them when the jobs ﬁnish. For use cases where applications require a response within seconds, such as interactive data analysis, you can pre-initialize required resources during application creation.

With EMR Serverless, you'll continue to get the beneﬁts of Amazon EMR such as open source compatibility, concurrency, and performance optimized runtime for popular frameworks.

EMR Serverless is suitable for customers who want ease in operating applications using open source frameworks. It oﬀers easy provisioning, quick job startup, automatic capacity management, and simple cost controls.

Concepts

EMR Serverless terms and concepts.

Release version

An Amazon EMR release is a set of open-source applications from the big data ecosystem. Each release comprises diﬀerent big data applications, components, and features that you select to have EMR

Serverless deploy and conﬁgure to run your applications. When creating an application, you must specify its release version. You'll choose the Amazon EMR release version along with the open source framework version you want to use in your application.

Application

With EMR Serverless, you can create one or more EMR Serverless applications that use open source analytics frameworks. To create an application, you must specify the following attributes:

• The Amazon EMR release version for the open source framework version you want to use. To determine your release version, see Release versions (p. 66).

• The speciﬁc runtime that you want your application to use, such as Apache Spark or Apache Hive.

After you create an application, you can schedule data processing jobs or interactive requests to your application.

Each EMR Serverless application is strictly isolated from other applications and runs on a secure Amazon Virtual Private Cloud (VPC). Additionally, you can use IAM policies to deﬁne which IAM users and roles can access the application. You can also specify limits to control and track usage costs incurred by the application.

Consider creating multiple applications for the following scenarios:

(6)

Job run

• Using diﬀerent open source frameworks

• Using diﬀerent versions of open source frameworks for diﬀerent use cases

• Performing A/B testing when upgrading from one version to another

• Maintaining separate logical environments for test and production scenarios

• Providing separate logical environments for diﬀerent teams with independent cost controls and usage tracking

• Separating diﬀerent line-of-business applications

EMR Serverless is a Regional service that simpliﬁes running workloads across multiple Availability Zones within a Region. To learn more about using applications with EMR Serverless, see Interacting with your application (p. 10).

Job run

A job run is a request submitted to an EMR Serverless application that is asynchronously executed and tracked through completion. Examples of jobs include a HiveQL query submitted to an Apache Hive application or a PySpark data processing script submitted to an Apache Spark application. When submitting a job, you must specify an execution role, authored in IAM, that will be used by the job to access AWS resources, such as Amazon S3 objects. Multiple job run requests can be submitted to an application, and each job run can use a diﬀerent execution role to access AWS resources. EMR Serverless starts executing jobs as soon as they are received and runs multiple job requests concurrently. To learn more about running jobs, see Running jobs (p. 14).

Workers

An EMR Serverless application internally uses workers to execute your workloads. The default size of these workers are based on your application type and Amazon EMR release version. You can override these sizes when scheduling a job run.

When a job is submitted, EMR Serverless computes the resources needed for the job and schedules workers. EMR Serverless breaks down your workloads into tasks, downloads images, provisions and sets up workers, and decommissions them when the job ﬁnishes. EMR Serverless automatically scales workers up or down depending on the workload and parallelism required at every stage of the job, removing the need for you to estimate the number of workers required to run your workloads.

Pre-initialized capacity

EMR Serverless provides a feature that keeps workers initialized and ready to respond in seconds, eﬀectively creating a warm pool of workers for an application. This feature is called pre-initialized capacity and can be conﬁgured for each application by setting the initial-capacity parameter of an application. Pre-initialized capacity allows jobs to start immediately, making it ideal for implementing iterative applications and time-sensitive jobs. To learn more about pre-initialized workers, see

Conﬁguring and managing pre-initialized capacity (p. 12).

(7)

Step 1: Sign up

Setting up

Step 1: Sign up for AWS

When you sign up for AWS, your AWS account is automatically signed up for all services, including the generally available Amazon EMR deployment options. You are charged only for the services that you use. If you have an AWS account already, skip to the next step. If you don't have an AWS account, use the following procedure to create one.

To create an AWS account

1. Open https://portal.aws.amazon.com/billing/signup.

2. Follow the online instructions. Part of the sign-up procedure involves receiving a phone call and entering a veriﬁcation code on the phone keypad.

Step 2: Create an IAM user

As a best practice, create an AWS Identity and Access Management (IAM) user with administrator permissions, and then use that IAM user for all work that does not require root credentials. Create a password for console access, and create access keys to use command line tools. For instructions, see Creating your ﬁrst IAM admin user and group in the IAM User Guide.

You can use this same process to create more groups and users and to give your users access to your AWS account resources. To learn about using policies that restrict user permissions to speciﬁc AWS resources, see Access management. If you choose to create a separate user to work with EMR Serverless, ensure the user has suﬃcient permissions to invoke EMR Serverless actions by attaching an IAM policy to the IAM user. For more information, see Access policies (p. 45). However, if you choose to continue with an Admin user, no further action will be required.

Step 3: Install and conﬁgure the AWS CLI

To set up EMR Serverless, you must have the latest version of AWS CLI installed. To install the latest version of the AWS CLI for macOS, Linux, or Windows, see Installing or updating the latest version of the AWS CLI.

To conﬁgure the AWS CLI and set up of secure access to AWS services, including EMR Serverless, see Quick conﬁguration with aws configure.

(8)

Step 1: Plan an application

Getting started

This tutorial helps you get started using EMR Serverless by deploying a sample Spark workload. You'll create your application with default pre-initialized capacity, run the sample application with logs stored in your S3 bucket and view event logs in the Spark History Server. Note that, for simplicity, we have chosen default options in most parts of this tutorial. For examples of running Hive applications, see Running Hive jobs (p. 21).

Prerequisites

Before you launch an EMR Serverless application, make sure you complete the following tasks:

• EMR Serverless is currently in preview release. To access the preview of EMR Serverless, follow the sign-up steps at https://pages.awscloud.com/EMR-Serverless-Preview.html.

• You must update the AWS CLI with the latest service model for EMR Serverless. Once you've received conﬁrmation of access, use the following command to download the latest API model ﬁle and update the AWS CLI.

aws s3 cp s3://elasticmapreduce/emr-serverless-preview/artifacts/latest/dev/cli/

service.json ./service.json

aws configure add-model --service-model file://service.json

• To use EMR Serverless, you must choose the AWS Region where preview is available. This applies to any AWS services and resources that EMR Serverless will need to access as part of running your workloads. Preview is currently available in US East (N. Virginia) us-east-1, and you may want to conﬁgure the AWS CLI to send all your AWS requests to this speciﬁc region by default. You can do so with the following command.

aws configure set region us-east-1

• Validate that the AWS CLI conﬁguration and permissions to interact with EMR Serverless are correctly set up. You can do so by running the following command to see a list of your EMR Serverless

applications in your current Region.

aws emr-serverless list-applications

If the command returns with an error, see Troubleshooting EMR Serverless identity and access (p. 47).

Step 1: Plan an EMR Serverless application

Prepare output log storage for EMR Serverless

In this tutorial, you'll use an S3 bucket to store output ﬁles and logs from the sample Spark workload you'll run using an EMR Serverless application. To create a bucket, follow the instructions in Creating a bucket in the Amazon Simple Storage Service Console User Guide.

As noted in the prerequisites, the S3 bucket must be created in the same Region where EMR Serverless is available (us-east-1). Replace any further reference to DOC-EXAMPLE-BUCKET with the name of the newly created bucket.

(9)

Set up a job execution role

Job runs in EMR Serverless use an execution role that provide granular permissions to speciﬁc AWS services and resources at runtime. In this tutorial, the data and scripts are hosted in a public S3 bucket, however, the output including logs will be stored in DOC-EXAMPLE-BUCKET.

To setup a job execution role, you will ﬁrst create an execution role with a trust policy to allow EMR Serverless to use the new role. Next, you'll attach the required S3 access policy to that role. The following steps walk you through the process.

1. Create a ﬁle named emr-serverless-trust-policy.json that contains the trust policy to use for the IAM role. The ﬁle should contain the following policy.

{ "Version": "2012-10-17", "Statement": [{

"Sid": "EMRServerlessTrustPolicy", "Action": "sts:AssumeRole", "Effect": "Allow",

"Principal": {

"Service": "emr-serverless.amazonaws.com"

} }]

}

2. Create an IAM role named sampleJobExecutionRole using the trust policy created in the previous step.

aws iam create-role \

--role-name sampleJobExecutionRole \

--assume-role-policy-document file://emr-serverless-trust-policy.json

Take note of the ARN in the output, as you will use the ARN of the new role during job submission, henceforth referred to as the <execution_role_arn>.

3. Create a ﬁle named emr-sample-access-policy.json that deﬁnes the IAM policy for your workload to get read access the script and data stored in public S3 buckets and read-write access to DOC-EXAMPLE-BUCKET. You must replace DOC-EXAMPLE-BUCKET in the policy below with the actual bucket name created in Step 1).

{ "Version": "2012-10-17", "Statement": [

{

"Sid": "ReadAccessForEMRSamples", "Effect": "Allow",

"Action": [

"s3:GetObject", "s3:ListBucket"

],

"Resource": [

"arn:aws:s3:::*.elasticmapreduce", "arn:aws:s3:::*.elasticmapreduce/*"

] }, {

"Sid": "FullAccessToOutputBucket", "Effect": "Allow",

"Action": [

"s3:PutObject", "s3:GetObject",

(10)

Step 2: Create an application

"s3:ListBucket", "s3:DeleteObject"

],

"Resource": [

"arn:aws:s3:::DOC-EXAMPLE-BUCKET", "arn:aws:s3:::DOC-EXAMPLE-BUCKET/*"

] } ] }

4. Create an IAM policy named sampleS3AccessPolicy using the policy ﬁle created in the previous step. Take note of the ARN in the output, as you will use the ARN of the new policy in the next step.

aws iam create-policy \

--policy-name sampleS3AccessPolicy \

--policy-document file://emr-sample-access-policy.json

Take note of the new policy's ARN in the output, as you will substitute it for <policy_arn> in the next step.

5. Attach the IAM policy sampleS3AccessPolicy to the job execution role sampleJobExecutionRole.

aws iam attach-role-policy \

--role-name sampleJobExecutionRole \ --policy-arn <policy_arn>

Step 2: Create an EMR Serverless application

Now you're ready create a new Spark application using EMR Serverless. To create an application, run the following command.

aws emr-serverless create-application \ --release-label emr-6.5.0-preview \ --type "SPARK" \

--name my-application

Take note of the application ID returned in the output, as you will use the ID to start the application and during job submission, henceforth referred to as the <application_id>.

Although EMR Serverless automatically pre-initializes a set of workers for you (with additional workers created on demand), you may choose a diﬀerent set of pre-initialized workers by setting the initialCapacity parameter while creating the application. You may also choose to set a limit for the total maximum capacity that an application can use by setting the maximumCapacity parameter. To learn more about these options, see Conﬁguring and managing pre-initialized capacity (p. 12).

Step 3: Start your application

Before you can schedule a job using your application, you must start the application. This action will pre-initialize a set of workers. You must ensure the application has reached the CREATED state before starting it. To check the state of your application, run the following command, substituting

<application_id> with the ID of your new application.

aws emr-serverless get-application \

(11)

Step 4: Schedule a job run

--application-id <application_id>

When application has reached the CREATED state, start your application using the following command.

aws emr-serverless start-application \ --application-id <application_id>

Before moving to the next step, ensure your application has reached the STARTED state using the get- application API.

Step 4: Schedule a job run to your EMR Serverless application

Now your Spark application is ready to run jobs. In this tutorial, we use a PySpark script to compute the number of occurrences of unique words across multiple text ﬁles. Both the script and the dataset are stored in a public, read-only S3 bucket. The output ﬁle and the log data from the Spark runtime will be pushed to /output and /logs directory in the S3 bucket you created in Step 1 (DOC-EXAMPLE- BUCKET).

In the command below, substitute <application_id> with your application ID. Substitute

<execution_role_arn> with the execution role ARN you created in Step 1. Replace all DOC- EXAMPLE-BUCKET strings with the Amazon S3 bucket you created, adding /output and /logs to the path. This creates new folders in your bucket, where EMR Serverless can copy the output and log ﬁles of your application.

aws emr-serverless start-job-run \ --application-id <application_id> \

--execution-role-arn <execution_role_arn> \ --job-driver '{

"sparkSubmit": {

"entryPoint": "s3://us-east-1.elasticmapreduce/emr-containers/samples/

wordcount/scripts/wordcount.py",

"entryPointArguments": ["s3://DOC-EXAMPLE-BUCKET/output"], "sparkSubmitParameters": "--conf spark.executor.cores=1 --conf

spark.executor.memory=4g --conf spark.driver.cores=1 --conf spark.driver.memory=4g --conf spark.executor.instances=1"

} }' \

--configuration-overrides '{

"monitoringConfiguration": { "s3MonitoringConfiguration": {

"logUri": "s3://DOC-EXAMPLE-BUCKET/logs"

} } }'

Note the job run ID returned in the output, as you will replace <job_run_id> with this ID in the following steps.

Step 5: Review your job run's output

The job run should typically take 3-5 minutes to complete. You can check for the state of the job using the following command.

(12)

Step 6: Clean up

aws emr-serverless get-job-run \ --application-id <application_id> \ --job-run-id <job_run_id>

With your log destination set to s3://DOC-EXAMPLE-BUCKET/logs, the logs for this speciﬁc job run can be found under s3://DOC-EXAMPLE-BUCKET/logs/applications/<application_id>/

jobs/<job_run_id>.

Depending upon the type of application and log file, EMR Serverless will upload logs to your bucket at different cadences. For Spark applications, EMR Serverless will push event logs every 30 seconds to the sparklogs folder in the S3 log destination. The Spark runtime logs for the driver and executors (i.e., stderr and stdout log files) will upload upon completion of the job to folders named appropriately by the worker type, such as driver or executor.

The output of the PySpark job will be uploaded upon sucessful execution of the job to s3://DOC- EXAMPLE-BUCKET/output/.

Step 6: Clean up

When you’re done working with this tutorial, consider deleting the resources you created. This will help you avoid any unnecessary expenses. Note that in preview, there is no additional cost to using EMR Serverless. However, we still recommend following best practives by releasing resources that you don't intend to use again.

Delete your application

To delete an application, it must be in the STOPPED state. Use the following command to stop the application.

aws emr-serverless stop-application \ --application-id <application_id>

Once the application is in the STOPPED state, use the following command to delete the application.

aws emr-serverless delete-application \ --application-id <application_id>

Delete your S3 log bucket

To delete your S3 logging and output bucket, use the following command. Replace DOC-EXAMPLE- BUCKET with the actual name of the S3 bucket created in Step 1.

aws s3 rm s3://DOC-EXAMPLE-BUCKET --recursive aws s3api delete-bucket --bucket DOC-EXAMPLE-BUCKET

Delete your job execution role

To delete the execution role, detach the policy from the role. You can then delete both the role and the policy.

aws iam detach-role-policy \

(13)

Delete your job execution role

--role-name sampleJobExecutionRole \ --policy-arn <policy_arn>

To delete the role, use the following command.

aws iam delete-role \

--role-name sampleJobExecutionRole

To delete the policy that was attached to the role, use the following command.

aws iam delete-policy \ --policy-arn <policy_arn>

This concludes the tutorial. For examples of running Hive applications, see Running Hive jobs (p. 21).

(14)

Application states

Interacting with your application

In this section, we'll cover how you can interact with your Amazon EMR Serverless application using the AWS CLI and the defaults for Spark and Hive engines.

Topics

• Application states (p. 10)

• Working with your application on the AWS CLI (p. 11)

• Pre-initialized capacity defaults (p. 11)

• Conﬁguring and managing pre-initialized capacity (p. 12)

Application states

When you create an application with Amazon EMR Serverless, the application run enters the CREATING state. It then passes through the following states until it succeeds (exits with code 0) or fails (exits with a non-zero code).

Applications can have the following states:

State Description

Creating The application is being prepared and is not yet

ready to use.

Created The application has been created but has not yet

provisioned capacity. It can be modiﬁed to make changes to the initial capacity conﬁguration.

Starting The application has started and is provisioning

capacity.

Started The application is ready to accept new jobs. Jobs

will only be accepted in this state.

Stopping All jobs have completed and the application is

releasing its capacity. EMR Serverless may move an application to this state when there are failures in provisioning capacity.

Stopped The application is stopped and no resources are

running on the application. It can be modiﬁed to make changes to the initial capacity conﬁguration.

Terminated The application has been terminated and will not

appear on your list. EMR Serverless may move an application to this state when there are failures in creation.

The following diagram illustrates the trajectory of EMR Serverless application states.

(15)

Working with your application on the AWS CLI

This section covers how to create, describe, and delete individual applications on the command line, as well as how to list all of your applications to view them at a glance. For more application operations, such starting, stopping, and updating your application, see the EMR Serverless API Reference.

To create an application, use create-application. You must specify SPARK or HIVE as the application type. This command returns the application’s ARN, name, and ID.

aws emr-serverless create-application \ --name <my_application_name> \

--type <application_type> \ --release-label <release_version>

To describe an application, use get-application and provide its application-id. This command returns the state and capacity-related conﬁgurations for your application.

aws emr-serverless get-application \ --application-id <application_id>

To list all of your applications, call list-applications. This command returns the same properties as get-application but includes all of your applications.

aws emr-serverless list-applications

To delete your application, call delete-application and supply your application-id.

aws emr-serverless delete-application \ --application-id <application_id>

Pre-initialized capacity defaults

EMR Serverless conﬁgures all applications with the default to run one job at low latency. The defaults are engine-speciﬁc and are set as follows:

• Spark - 3 drivers (4 vCPU, 16 GB memory, 21 GB disk), 9 executors (4 vCPU, 16 GB memory, 21 GB disk)

• Hive - 3 drivers (2 vCPU, 6 GB memory, and 21 GB disk), 30 Tez tasks (1 vCPU, 6 GB memory, and 21 GB disk)

(16)

Conﬁguring and managing pre-initialized capacity

EMR Serverless provides an optional feature that keeps driver and workers pre-initialized and ready to respond in seconds, eﬀectively creating a warm pool of workers for an application. This feature is called pre-initialized capacity and can be conﬁgured for each application by setting the initialCapacity parameter of an application to the number of workers you want to pre-initialize. Pre-initialized worker capacity allows jobs to start immediately, making it ideal for implementing iterative applications and time-sensitive jobs.

When a job is run, if any workers from initialCapacity are available (not already in use from jobs previously submitted), those resources are used to start running the job. If those resources are not available because they are in use by other jobs, or if resources are insufficient to execute the job because the job requires more than what is available from intialCapacity, then additional workers are requested and acquired, up to the maximum limits on resources set for the application. When jobs finish running, the workers used by the job are released, and the number of resources available for the application returns to initialCapacity. An application maintains the initialCapacity of resources even after jobs finish running. Excess resources beyond initialCapacity are released immediately when they're no longer required to run jobs.

NoteFor this preview release, you must manually stop an application to decommission the pre- initialized workers.

Pre-initialized capacity is available and ready to use when the application has started. It is

decommissioned when the application is stopped. An application moves to the STARTED state only if the requested pre-initialized capacity has been created and is ready to use. For the entire duration that the application is in the STARTED state, EMR Serverless ensures that the pre-initialized capacity is available for use or is in use by jobs or interactive workloads. Capacity is replenished for released or failed containers to maintain the number of workers speciﬁed in the InitialCapacity parameter. For an application with no pre-initialized capacity, the state can immediately transition from CREATED to STARTED.

You can modify the InitialCapacity counts, and specify compute conﬁgurations such as vCPU, memory, and disk, for each worker. Modiﬁcations are only allowed when the application is in the CREATED or STOPPED state.

Customizing pre-initialized capacity for speciﬁc big data frameworks

You can further customize pre-initialized capacity to suit workloads running on speciﬁc big data frameworks. For example, when running Apache Spark, you can specify how many workers start as drivers and how many start as executors. Similarly, when you use Apache Hive, you can specify how many workers start as Hive drivers, and how many are used to run Tez tasks.

Conﬁguring an application running Apache Hive with pre-initialized capacity

The following API request creates an application running Apache Hive based on Amazon EMR release emr-5.34.0-preview. The application starts with 5 pre-initialized Hive drivers, each with 2 vCPU and 6 GB of memory, and 50 pre-initialized Tez task workers, each with 1 vCPU and 6 GB of memory. When Hive queries are run on this application, they ﬁrst use the pre-initialized workers and start executing immediately. If all of the pre-initialized workers are busy and more Hive jobs are submitted, the application can scale to a total of 400 vCPU and 1024 GB of memory.

aws emr-serverless create-application \ --type HIVE \

--name <my_application_name> \

(17)

Customizing pre-initialized capacity for speciﬁc big data frameworks --release-label emr-5.34.0-preview \

--initial-capacity '{

"DRIVER": {

"workerCount": 5,

"resourceConfiguration": { "cpu": "2vCPU", "memory": "4GB"

} },

"TEZ_TASK": {

"workerCount": 50,

} } }' \

--maximum-capacity '{

"cpu": "400vCPU", "memory": "1024GB"

}'

Conﬁguring an application running Apache Spark with pre-initialized capacity

The following API request creates an application running Apache Spark 3.1 based on Amazon EMR release 6.5. The application starts with 5 pre-initialized Spark drivers, each with 2 vCPU and 4 GB of memory, and 50 pre-initialized executors, each with 4 vCPU and 8 GB of memory. When Spark jobs are run on this application, they ﬁrst use the pre-initialized workers and start executing immediately. If all of the pre-initialized workers are busy and more Spark jobs are submitted, the application can scale to a total of 400 vCPU and 1024 GB of memory.

NoteSpark adds 10% overhead to the memory requested for driver and executors. In order for jobs to use pre-initialized workers, the initial capacity memory conﬁguration should be at least 10%

more than the memory requested by the job.

aws emr-serverless create-application \ --type "SPARK" \

--name <"my_application_name"> \ --release-label "emr-6.5.0-preview" \ --initial-capacity '{

"DRIVER": {

"workerCount": 5,

} },

"EXECUTOR": {

"workerCount": 50,

} } }' \

--maximum-capacity '{

"cpu": "400vCPU", "memory": "1024GB"

}'

(18)

Job run states

Running jobs

After you've provisioned your application, you can submit jobs to it. This section covers how to use the AWS CLI to run jobs on your application and what the defaults are for each type of application available on EMR Serverless.

Topics

• Job run states (p. 14)

• Submitting jobs on the AWS CLI (p. 15)

• Running Spark jobs (p. 15)

• Running Hive jobs (p. 21)

Job run states

When you submit a job run to an Amazon EMR Serverless job queue, the job run enters the SUBMITTED state. It then passes through the following states until it succeeds (exits with code 0) or fails (exits with a non-zero code).

Job runs can have the following states:

Submitted The initial job state when a job run is submitted to

EMR Serverless. The job is waiting to be scheduled for the application, and EMR Serverless is working on prioritizing and scheduling this job run.

Pending The job run has been scheduled for the

application, and EMR Serverless is allocating resources to execute it.

Scheduled The job run is being evaluated by the scheduler

to prioritize and schedule this job run for the application.

Running The job run has necessary initial resources and is

running in the application. In Spark applications, this means that the Spark driver process is in the running state.

Failed The job run failed to be submitted to the

application or it completed unsuccessfully. See StateDetails for additional information about this job failure.

Completed The job run completed successfully.

Cancelling The job run has been requested for cancellation,

either through the CancelJobRun API or due to

(19)

Submitting jobs on the AWS CLI

timeout. EMR Serverless is trying to cancel the job in the application and release the resources.

Cancelled The job run was cancelled successfully, and the

resources it used have been released.

Submitting jobs on the AWS CLI

You can create, describe, and delete individual jobs on the command line. You can also list all of your jobs to view them at a glance.

To submit a new job, use start-job-run and supply the ID of the application you want to run, along with job-speciﬁc properties. For Spark examples, see Running Spark jobs (p. 15). For Hive examples, see Running Hive jobs (p. 21). This command returns your application-id, ARN, and new job-id.

To describe a job, use get-job-run. This command returns job-speciﬁc conﬁgurations and the set capacity for your new job.

aws emr-serverless get-job-run \ --job-run-id <job_id> \

To list your jobs, call list-job-runs. This command returns an abbreviated set of properties including job type, state, and other high-level attributes. If you don't want to see all of your jobs, you can specify the maximum number of jobs you'd like to see up to 50. The the following command demonstrates how to specify that you'd like to see your two last job runs.

aws emr-serverless list-job-runs \ --max-results 2 \

To cancel a job, call cancel-job-run, supplying both the application-id and the job-id of the job you want to cancel.

aws emr-serverless cancel-job-run \ --job-run-id <job_id> \

For more information on running jobs using the AWS CLI, see the EMR Serverless API Reference.

Running Spark jobs

You can run Spark jobs on an application with the type parameter set to 'SPARK'. Jobs must be compatible with the Spark version referenced in the Amazon EMR release version. For example, when running jobs on an application with Amazon EMR release 6.5, your job must be compatible with Apache Spark 3.1.2.

To run a Spark job, you must specify the following parameters when using the start-job-run API.

Execution role (executionRoleArn)

(20)

Running Spark jobs

This is an IAM role ARN that your application uses to execute Spark jobs. This role must contain the following permissions:

• Read from S3 buckets or other data sources where your data resides

• Read from S3 buckets or preﬁxes where your PySpark script or JAR ﬁle is located

• Write to S3 buckets where you intend to write your ﬁnal output

• Write logs to an S3 bucket or preﬁx speciﬁed by S3MonitoringConfiguration

• Access to KMS keys if you use KMS keys to encrypt data in your S3 bucket

• Access to AWS Glue Catalog if you use SparkSQL

Failure to provide these permissions to the IAM role can lead to job failures. If your Spark job is reading or writing data to or from other data sources, make sure that the appropriate permissions are speciﬁed in this IAM role. For more information, see Using job execution roles with EMR Serverless (p. 40) and Logging (p. 55).

Job driver (jobDriver)

A job's driver is used to provide input to the job. This parameter accepts only one value for the job type that you want to run. For a Spark job, that value is sparkSubmit. You can use this job type to run Scala, Java, PySpark, SparkR, and any other supported jobs through Spark submit. This job type has the following parameters:

• entryPoint ‐ This is the reference in Amazon S3 to the main JAR or Python ﬁle that you want to run. If you are running a Scala or Java JAR, the main entry class should be speciﬁed in the SparkSubmitParameters using the --class argument.

• entryPointArguments ‐ This is an array of arguments that you want to pass to your main JAR or Python ﬁle. You should handle reading these parameters using your entrypoint code. Each argument in the array should be separated by a comma.

• sparkSubmitParameters ‐ These are the additional Spark parameters that you want to send to the job. Use this parameter to override default Spark properties such as driver memory or number of executors like the --conf or --class parameters.

For additional information, see Launching Applications with spark-submit.

Conﬁguration overrides (configurationOverrides)

This parameter is used for overriding application and monitoring level conﬁguration properties. This parameter accepts a JSON object having the following two ﬁelds.

• applicationConfiguration ‐ This field allows you to override the default configurations for applications by supplying a configuration object. You can use a shorthand syntax to provide the configuration, or you can reference the configuration object in a JSON file. Configuration objects consist of a classification, properties, and optional nested configurations. Properties consist of the settings you want to override in that file. You can specify multiple classifications for multiple applications in a single JSON object. The configuration classifications that are available vary by specific release version for Amazon EMR. For a list of configuration classifications that are available for each release version of Amazon EMR, see Release versions (p. 66).

If you pass the same conﬁguration in an application override and in Spark submit parameters, the Spark submit parameters take precedence. The complete conﬁguration priority list follows, in order of highest priority to lowest priority:

• Conﬁguration supplied when creating SparkSession.

• Conﬁguration supplied as part of sparkSubmitParameters using —conf.

• Conﬁguration provided as part of application overrides.

• Optimized conﬁgurations chosen by Amazon EMR for the release.

(21)

Spark defaults

• Default open source conﬁgurations for the application.

• monitoringConfiguration ‐ This ﬁeld allows you to specify the Amazon S3 URL

(s3MonitoringConfiguration) where you want the EMR Serverless job to store logs of your Spark job. Make sure you've created this bucket with the same AWS account hosting your application, and in the same Region where your job is running.

Spark job defaults

The following table lists optional Spark properties and their default values that you can override when submitting a Spark job.

Key Explanation Default value

spark.emr-

serverless.allocation.batch.size

The number of containers to request in each cycle of executor allocation. There is a 1-second gap between each allocation cycle.

20

spark.emr-

serverless.allocation.executor.timeout

The time to wait for a newly- created executor container to reach the running state before the request is cancelled.

300s

spark.emr-

serverless.memoryOverheadFactor

Sets the Memory Overhead Factor to be added to the driver and executor container memory.

0.1

spark.executor.memory The amount of memory used by

each executor. 14G

spark.executor.cores The number of cores used by

each executor. 4

spark.driver.memory The amount of memory used by

the driver. 14G

spark.driver.cores The number of cores used by the

driver. 4

spark.emr-

serverless.driver.disk The Spark driver disk. 21G spark.emr-

serverless.executor.disk

The Spark executor disk. 21G

spark.executor.instances The number of Spark executor containers to allocate. 3 spark.executor.extraJavaOptionsExtra Java options for the Spark

executor. NULL

spark.driver.extraJavaOptionsExtra Java options for the Spark

driver. NULL

spark.driver.extraJavaOptionsExtra Java options for the Spark

driver. NULL

(22)

Spark defaults

spark.dynamicAllocation.enabledEnables Spark Dynamic Resource

Allocation. TRUE

spark.emr-

serverless.driverEnv.[KEY]

Adds additional environment

variables to the Spark driver. NULL spark.executorEnv.[KEY] Adds additional environment

variables to the Spark executors. NULL

spark.files A comma-separated list

of files to be placed in the working directory of each executor. The file paths of these files in executors can be accessed by running SparkFiles.get(fileName).

NULL

spark.jars Additional jars to add to the

runtime classpath of the driver and executors.

NULL

spark.archives A comma-separated list of archives to be extracted into the working directory of each executor. Supported file types include .jar, .tar.gz, .tgz and .zip. You can specify the directory name to unpack by adding # after the file name to unpack. For example, file.zip#directory. This configuration is experimental.

NULL

spark.submit.pyFiles A comma-separated list of .zip, .egg, or .py ﬁles to place in the PYTHONPATH for Python apps.

NULL

spark.sql.warehouse.dir The default location for

managed databases and tables. The value of $PWD/spark- warehouse

spark.hadoop.hive.metastore.client.factory.classThe Hive metastore

implementation class. NULL spark.authenticate Enables authentication of

Spark's internal connections. TRUE spark.network.crypto.enabledEnables AES-based RPC

encryption, including the new authentication protocol added in 2.2.0.

FALSE

spark.dynamicAllocation.enabledEnables dynamic resource allocation, which scales the number of executors registered with the application up or down, based on the workload.

TRUE

(23)

Spark defaults

spark.dynamicAllocation.executorIdleTimeoutThe duration of idle time an executor can have before it will be removed. This only applies if dynamic allocation is enabled.

60s

spark.dynamicAllocation.initialExecutorsThe initial number of executors to run if dynamic allocation is enabled.

3

spark.dynamicAllocation.maxExecutorsThe upper bound for the number of executors if dynamic allocation is enabled.

1000

spark.dynamicAllocation.minExecutorsThe lower bound for the number of executors if dynamic allocation is enabled.

0

The following table lists the default Spark submit parameters.

executor-memory The amount of memory used by

executor. 14G

executor-cores The number of cores used by

each executor. 4

driver-memory The amount of memory used by

the driver. 14G

driver-cores The number of cores to used by

the driver. 4

num-executors The number of executors to

launch. 3

files A comma-separated list of ﬁles

to be placed in the working directory of each executor.

File paths of these ﬁles in executors can be accessed via SparkFiles.get(fileName).

NULL

py-files A comma-separated list of .zip,

.egg, or .py ﬁles to place on the PYTHONPATH for Python apps.

NULL

archives A comma-separated list of

archives to be extracted into the working directory of each executor.

NULL

jars A comma-separated list of jars

to include on the driver and executor classpaths.

NULL

(24)

Spark examples

verbose Enables printing additional

debug output. NULL

class The application's main class (for

Java and Scala apps). NULL

conf An arbitrary Spark conﬁguration

property. NULL

Spark examples

The following is an example of running a Python script using the StartJobRun API. For an end-to-end tutorial using this example, see Getting started (p. 4).

aws emr-serverless start-job-run \ --application-id <application_id> \ --execution-role-arn <iam_role_arn> \ --job-driver '{

"sparkSubmit": {

"entryPoint": "s3://us-east-1.elasticmapreduce/emr-containers/samples/

wordcount/scripts/wordcount.py",

"entryPointArguments": ["s3://DOC-EXAMPLE-BUCKET-OUTPUT/wordcount_output"], "sparkSubmitParameters": "--conf spark.executor.cores=1 --conf

spark.executor.memory=4g --conf spark.driver.cores=1 --conf spark.driver.memory=4g --conf spark.executor.instances=1"

} }' \

"logUri": "s3://DOC-EXAMPLE-BUCKET-LOGGING/logs/"

} } }'

The following is an example of running a Spark JAR using the StartJobRun API.

"sparkSubmit": {

"entryPoint": "/usr/lib/spark/examples/jars/spark-examples.jar", "entryPointArguments": ["1"],

"sparkSubmitParameters": "--class org.apache.spark.examples.SparkPi --conf spark.executor.cores=4 --conf spark.executor.memory=20g --conf spark.driver.cores=4 --conf spark.driver.memory=8g --conf spark.executor.instances=1"

} }' \

"logUri": "s3://DOC-EXAMPLE-BUCKET-LOGGING/logs/"

} } }'

(25)

Running Hive jobs

You can run Hive jobs on an application with the type parameter set to 'HIVE'. Jobs must be compatible with the Hive version referenced in the Amazon EMR release version. For example, when running jobs on an application with Amazon EMR release 5.34.0-preview, your job must be compatible with Apache Hive 2.3.8.

To run a Hive job, you must specify the following parameters when using the start-job-run API.

Execution role (executionRoleArn)

This is an IAM role ARN that your application uses to execute Hive jobs. This role must contain the following permissions:

• Read from S3 buckets or other data sources where your data resides

• Read from S3 buckets or prefixes where your Hive query file and init query file are located

• Read and write to S3 buckets where your Hive Scratch directory and Hive Metastore warehouse directory are located

• Write to S3 buckets where you intend to write your ﬁnal output

• Write logs to an S3 bucket or preﬁx speciﬁed by S3MonitoringConfiguration

• Access to KMS keys if you use KMS keys to encrypt data in your S3 bucket

• Access to AWS Glue Catalog

Failure to provide these permissions to the IAM role can lead to job failures. If your Hive job is reading or writing data to or from other data sources, make sure that the appropriate permissions are speciﬁed in this IAM role. For more information, see Using job execution roles with EMR Serverless (p. 40).

Job driver (jobDriver)

A job's driver is used to provide input to the job. This parameter accepts only one value for the job type that you want to run. A Hive query is passed to the job-driver parameter by specifying Hive as the job type. This job type has the following parameters:

• query ‐ This is the reference in Amazon S3 to the Hive query ﬁle that you want to run.

• parameters ‐ These are the additional Hive conﬁguration properties you want to override. You can override properties by passing them to this parameter as --hiveconf <property=value> and variables passing by them as --hivevar <key=value>.

• initQueryFile ‐ This is the init Hive query ﬁle. It will be executed prior to your query and can be used to initialize tables.

Conﬁguration overrides (configurationOverrides)

This parameter is used for overriding application and monitoring level conﬁguration properties. This parameters accepts a JSON objects having the following two ﬁelds:

• applicationConfiguration ‐ This field allows you to override the default configurations for applications by supplying a configuration object. You can use a shorthand syntax to provide the configuration, or you can reference the configuration object in a JSON file. Configuration objects consist of a classification, properties, and optional nested configurations. Properties consist of the settings you want to override in that file. You can specify multiple classifications for multiple applications in a single JSON object. The configuration classifications that are available vary by specific release version for Amazon EMR. For a list of configuration classifications that are available for each release version of Amazon EMR, see Release versions (p. 66).

(26)

Hive properties

If you pass the same conﬁguration in an application override and in Hive parameters, the Hive parameters take precedence. The complete conﬁguration priority list follows, in order of highest priority to lowest priority.

• Conﬁguration supplied as part of Hive parameters using --hiveconf <property=value>.

• Conﬁguration provided as part of application overrides.

• Optimized conﬁgurations chosen by Amazon EMR for the release.

• Default open source conﬁgurations for the application.

• monitoringConfiguration ‐ This ﬁeld allows you to specify the Amazon S3 URL

(s3MonitoringConfiguration) where you want the EMR Serverless job to store logs of your Hive job. Make sure you've created this bucket with the same AWS account hosting your application, and in the same Region where your job is running.

Hive job properties

The following table lists the mandatory properties that you must conﬁgure when submitting a Hive job.

Setting Description

hive.exec.scratchdir The Amazon S3 location where temporary ﬁles are created during the Hive job execution.

hive.metastore.warehouse.dir The Amazon S3 location of databases for managed tables in Hive.

The following table lists the optional Hive properties and their default values that you can override when submitting a Hive job.

Setting Description Value

hive.driver.memory The amount of memory to use per Hive driver process.

This memory is shared equally between HiveCLI and Tez Application Master with 20% of headroom.

6G

hive.driver.cores The number of cores to use for the Hive driver process. 2 hive.driver.disk The disk size for the Hive driver. 21G hive.tez.disk.size The disk size for each task

container. 21G

hive.prewarm.enabled Enables container prewarm for

Tez. FALSE

hive.prewarm.numcontainers The number of containers to

prewarm for Tez. 10

hive.tez.container.size The amount of memory to use

per Tez task process. 6144

(27)

Hive properties

hive.tez.cpu.vcores The number of cores to use for

each Tez task. 1

hive.max-task-containers The maximum number of concurrent containers. The conﬁgured mapper memory is also multuplied by this value to determines available memory that is used by split computation and task preemption.

100

hive.exec.reducers.max The maximum number of

reducers. 256

hive.auto.convert.join.noconditionaltask.sizeThe size below which a join is

directly converted to a mapjoin. Optimal value is calculated based on Tez task memory tez.runtime.io.sort.mb The size of the soft buﬀer when

output is sorted. Optimal value is calculated based on Tez task memory tez.runtime.unordered.output.buffer.size-

mb

The size of the buﬀer to use if

not writing directly to disk. Optimal value is calculated based on Tez task memory tez.am.task.max.failed.attemptsThe maximum number of

attempts that can fail for a particular task before the task is failed. This does not count manually terminated attempts.

3

hive.exec.stagingdir The name of the directory for storing temporary ﬁles that will be created inside table locations and in the scratch directory location speciﬁed using the hive.exec.scratchdir property.

.hive-staging

hive.compute.query.using.statsEnables Hive to answer a few queries using statistics stored in the metastore.

For basic statistics, set

hive.stats.autogather to true. For a more advanced collection, run analyze table queries.

TRUE

hive.vectorized.execution.enabledEnables vectorized mode of

query execution. FALSE for EMR versions 5.34.0 and onwards.

TRUE for EMR versions 6.5.0 and onwards. See HIVE-19269 for more information.

hive.cbo.enable Enables cost-based

optimizations using the Calcite framework.

TRUE

(28)

Hive properties

hive.tez.auto.reducer.parallelismEnables Tez's auto-reducer parallelism feature. Hive will still estimate data sizes and set parallelism estimates. Tez will sample source vertices' output sizes and adjust the estimates at runtime as necessary.

FALSE

hive.stats.fetch.column.statsDisables fetching of column statistics from the metastore.

Fetching column statistics can be expensive when the number of columns is high.

FALSE

hive.vectorized.execution.reduce.enabledEnables vectorized mode of a

query execution's reduce-side. TRUE hive.exec.max.dynamic.partitions.pernodeMaximum number of dynamic

partitions allowed to be created in each mapper and reducer node.

100

hive.stats.fetch.partition.statsEnables fetching of partition statistics from the metastore.

When this flag is disabled, Hive will make calls to the file system to get file sizes and estimate the number of rows from the row schema. Fetching partition statistics can be expensive when the number of partitions is high.

TRUE for EMR versions 5.34.0 and onwards.

Property if removed for EMR versions 6.5.0 and onwards.

See HIVE-17932 for more information. Partition stats are collected by default if a table is partitioned.

hive.exec.max.dynamic.partitionsThe maximum number of dynamic partitions allowed to be created in total.

1000

hive.auto.convert.join.noconditionaltaskEnables optimization in

converting a common join into a mapjoin based on the input ﬁle size.

TRUE

hive.exec.dynamic.partition.modeIn strict mode, you must specify at least one static partition in case you accidentally overwrite all partitions. In non-strict mode, all partitions are allowed to be dynamic.

strict

hive.merge.tezfiles Enables the merging of small

ﬁles at the end of a Tez DAG. FALSE hive.strict.checks.cartesian.productEnables strict Cartesian join

checks, which disallows a Cartesian product (a cross join).

TRUE for EMR versions 5.34.0 and onwards.

FALSE for EMR versions 6.5.0 and onwards. See HIVE-18251 for more information.

(29)

Hive properties

hive.stats.autogather Enables basic statistics to be gathered automatically during the INSERT OVERWRITE command.

TRUE

hive.exec.orc.split.strategyExpects one of [BI, ETL, HYBRID]. This is not a user level config. BI strategy is used when you want to spend less time in split generation as opposed to query execution (split generation does not read or cache file footers). ETL strategy is used when you want to spend more time in split generation (split generation reads and caches file footers).

HYBRID chooses between the above strategies based on heuristics.

HYBRID

hive.auto.convert.join Enables auto-conversion of common joins into mapjoins, based on the input ﬁle size.

TRUE

hive.default.fileformat The default ﬁle format for CREATE TABLE statements.

You can explicitly override this by specifying STORED AS [FORMAT] in your CREATE TABLE command.

TEXTFILE

hive.exec.reducers.bytes.per.reducerThe size per reducer. The default is 256 MB. If the input size is 1G, the job will use 4 reducers.

256000000

hive.exec.dynamic.partitionEnables dynamic partitions in

DML/DDL. TRUE

hive.merge.size.per.task The size of merged ﬁles at the

end of the job. 256000000

hive.merge.mapfiles Enables small ﬁles to be merged at the end of a map-only job. TRUE hive.fetch.task.conversion Expects one of [NONE, MINIMAL,

MORE]. Some select queries can be converted to single FETCH task, minimizing latency.

MORE

(30)

Hive properties

hive.stats.gather.num.threadsThe number of threads used by the partialscan and noscan analyze command for partitioned tables. This is applicable only for ﬁle formats that implement

StatsProvidingRecordReader (like ORC).

10

hive.optimize.ppd Enables predicate pushdown. TRUE hive.input.format The default input format.

Set to HiveInputFormat if you encounter problems with CombineHiveInputFormat.

org.apache.hadoop.hive.ql.io.CombineHiveInputFormat

hive.optimize.ppd.storage Enables predicate pushdown to

storage handlers. TRUE

hive.groupby.position.aliasEnables using a column position

alias in GROUP BY statements. FALSE hive.orderby.position.aliasEnables using a column position

alias in ORDER BY statements. TRUE hive.mapred.reduce.tasks.speculative.executionEnables speculative execution

for reducers. TRUE

hive.support.quoted.identifiersExpects one of [NONE, COLUMN].

NONE implies only alphanumeric and underscore characters are valid in identiﬁers. COLUMN implies column names can contain any character.

COLUMN

hive.tez.min.partition.factorPuts a lower limit to the number of reducers that Tez speciﬁes when auto-reducer parallelism is enabled.

0.25

hive.strict.checks.type.safetyEnables strict type safety checks and disables comparing bigint with both string and double.

TRUE

hive.root.logger The Hive log4j logging level. INFO, DRFA tez.am.log.level The root logging level passed to

the Tez app master. INFO

tez.task.log.level The root logging level passed to

the Tez tasks. INFO

tez.grouping.max-size The upper bound on the size (in bytes) of a grouped split, to avoid generating excessively large splits.

1073741824

(31)

Hive examples

tez.grouping.min-size The lower bound on the size (in bytes) of a grouped split, to avoid generating too many small splits.

52428800

tez.shuffle-vertex- manager.min-src-fraction

The fraction of source tasks which must complete before tasks for the current vertex are scheduled (in case of a ScatterGather connection).

0.25

tez.shuffle-vertex- manager.max-src-fraction

The fraction of source tasks that must have completed before all tasks on the current vertex can be scheduled (in case of a ScatterGather connection).

The number of tasks ready for scheduling on the current vertex scales linearly between min- fraction and max-fraction.

This defaults the default value or tez.shuffle-vertex- manager.min-src-fraction, whichever is greater.

0.75

tez.am.speculation.enabled Enables speculative execution of slower tasks. This can help reduce job latency when some tasks are running slower due bad or slow machines.

FALSE

tez.am.dag.cleanup.on.completionEnables the cleanup of shuﬄe

data upon DAG completion. TRUE tez.client.asynchronous-

stop

Enables pushing of ATS events before terminating the Hive driver.

FALSE

tez.am.sleep.time.before.exit.millisThe amount of time after which ATS events should be pushed upon AM shutdown request.

0

tez.yarn.ats.event.flush.timeout.millisThe maximum amount of time for which AM should wait for events to be ﬂushed before shutting down.

300000

Hive job examples

The following is an example of running a Hive query using the StartJobRun API.

(32)

Hive examples

"hive": {

"query": "s3://DOC-EXAMPLE-BUCKET/emr-serverless-hive/query/hive-query.ql", "parameters": "--hiveconf hive.root.logger=DEBUG,DRFA"

} }' \

"applicationConfiguration": [{

"classification": "hive-site", "properties": {

"hive.exec.scratchdir": "s3://DOC-EXAMPLE-BUCKET/emr-serverless-hive/hive/

scratch",

"hive.metastore.warehouse.dir": "s3://DOC-EXAMPLE-BUCKET/emr-serverless- hive/hive/warehouse",

"hive.driver.cores": "2", "hive.driver.memory": "4g", "hive.tez.container.size": "4096", "hive.tez.cpu.vcores": "1"

} }],

"logUri": "s3://DOC-EXAMPLE-BUCKET/emr-serverless-hive/logs/"

} } }'

While the job is running, the logs for the Hive driver and Tez tasks continuously upload to the Amazon S3 log location you conﬁgured in monitoringConfiguration.

Once the job run has a state of SUCCEEDED, the output of your Hive query will be available in the Amazon S3 location you speciﬁed in the monitoringConfiguration ﬁeld of

configurationOverrides. For example, if your log location is s3://DOC-EXAMPLE-BUCKET/emr- serverless-hive/hive/logs, your Hive query's outpit will be available in s3://DOC-EXAMPLE-BUCKET/ emr-serverless-hive/hive/logs/applications/<application-id>/jobs/<job-run-id>/HIVE_DRIVER/

stdout.gz.

The hive-query.ql ﬁle contains the query that Hive will run. The following is an example of a sample query.

create database if not exists emrserverless;

use emrserverless;

create table if not exists test_table(id int);

drop table if exists Values__Tmp__Table__1;

insert into test_table values (1),(2),(2),(3),(3),(3);

select id, count(id) from test_table group by id order by id desc;

(33)

Data protection

Security

Cloud security at AWS is the highest priority. As an AWS customer, you beneﬁt from a data center and network architecture that is built to meet the requirements of the most security-sensitive organizations.

Security is a shared responsibility between AWS and you. The shared responsibility model describes this as security of the cloud and security in the cloud:

• Security of the cloud – AWS is responsible for protecting the infrastructure that runs AWS services in the AWS Cloud. AWS also provides you with services that you can use securely. Third-party auditors regularly test and verify the eﬀectiveness of our security as part of the AWS compliance programs. To learn about the compliance programs that apply to Amazon EMR Serverless, see AWS services in scope by compliance program.

• Security in the cloud – Your responsibility is determined by the AWS service that you use. You are also responsible for other factors including the sensitivity of your data, your company's requirements, and applicable laws and regulations.

This documentation helps you understand how to apply the shared responsibility model when using Amazon EMR Serverless. The topics in this chapter show you how to conﬁgure Amazon EMR Serverless and use other AWS services to meet your security and compliance objectives.

Topics

• Data protection (p. 29)

• Identity and Access Management (IAM) in Amazon EMR Serverless (p. 31)

• Security best practices for Amazon EMR Serverless (p. 49)

• Logging and monitoring (p. 50)

• Compliance validation for Amazon EMR Serverless (p. 50)

• Resilience in Amazon EMR Serverless (p. 51)

• Infrastructure security in Amazon EMR Serverless (p. 51)

• Conﬁguration and vulnerability analysis in Amazon EMR Serverless (p. 51)

Data protection

The AWS shared responsibility model applies to data protection in Amazon EMR Serverless. As described in this model, AWS is responsible for protecting the global infrastructure that runs all of the AWS Cloud.

You are responsible for maintaining control over your content that is hosted on this infrastructure. This content includes the security conﬁguration and management tasks for the AWS services that you use. For more information about data privacy, see the Data Privacy FAQ . For information about data protection in Europe, see the AWS Shared Responsibility Model and GDPR blog post on the AWS Security Blog.

For data protection purposes, we recommend that you protect AWS account credentials and set up individual user accounts with AWS Identity and Access Management (IAM). That way each user is given only the permissions necessary to fulﬁll their job duties. We also recommend that you secure your data in the following ways:

• Use multi-factor authentication (MFA) with each account.

• Use SSL/TLS to communicate with AWS resources. We recommend TLS 1.2 or later.

• Set up API and user activity logging with AWS CloudTrail.

• Use AWS encryption solutions, along with all default security controls within AWS services.

Amazon EMR

Amazon EMR

Amazon EMR Serverless User Guide

Amazon EMR: Amazon EMR Serverless User Guide

Table of Contents

What is Amazon EMR Serverless?

Concepts

Release version

Application

Job run

Workers

Pre-initialized capacity

Setting up

Step 1: Sign up for AWS

Step 2: Create an IAM user

Step 3: Install and conﬁgure the AWS CLI

Getting started

Step 1: Plan an EMR Serverless application

Prepare output log storage for EMR Serverless

Set up a job execution role

Step 2: Create an EMR Serverless application

Step 3: Start your application

Step 4: Schedule a job run to your EMR Serverless application

Step 5: Review your job run's output

Step 6: Clean up

Delete your application

Delete your S3 log bucket

Delete your job execution role

Interacting with your application

Application states

Working with your application on the AWS CLI

Pre-initialized capacity defaults

Conﬁguring and managing pre-initialized capacity

Customizing pre-initialized capacity for speciﬁc big data frameworks

Running jobs

Job run states

Submitting jobs on the AWS CLI

Running Spark jobs

Spark job defaults

Spark examples

Running Hive jobs

Hive job properties

Hive job examples

Security

Data protection