Amazon EMR
Amazon EMR Serverless User Guide
Amazon EMR: Amazon EMR Serverless User Guide
Copyright © Amazon Web Services, Inc. and/or its affiliates. All rights reserved.
Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be affiliated with, connected to, or sponsored by Amazon.
Table of Contents
What is Amazon EMR Serverless? ... 1
Concepts ... 1
Release version ... 1
Application ... 1
Job run ... 2
Workers ... 2
Pre-initialized capacity ... 2
Setting up ... 3
Step 1: Sign up ... 3
Step 2: Create an IAM user ... 3
Step 3: Install the AWS CLI ... 3
Getting started ... 4
Step 1: Plan an application ... 4
Prepare output log storage for EMR Serverless ... 4
Set up a job execution role ... 5
Step 2: Create an application ... 6
Step 3: Start your application ... 6
Step 4: Schedule a job run ... 7
Step 5: Review output ... 7
Step 6: Clean up ... 8
Delete your application ... 8
Delete your S3 log bucket ... 8
Delete your job execution role ... 8
Interacting with your application ... 10
Application states ... 10
Working with your application on the AWS CLI ... 11
Pre-initialized capacity defaults ... 11
Configuring and managing pre-initialized capacity ... 12
Customizing pre-initialized capacity for specific big data frameworks ... 12
Running jobs ... 14
Job run states ... 14
Submitting jobs on the AWS CLI ... 15
Running Spark jobs ... 15
Spark defaults ... 17
Spark examples ... 20
Running Hive jobs ... 21
Hive properties ... 22
Hive examples ... 27
Security ... 29
Data protection ... 29
Encryption at rest ... 30
Encryption in transit ... 31
Identity and Access Management (IAM) ... 31
Audience ... 32
Authenticating with identities ... 32
Managing access using policies ... 34
How EMR Serverless works with IAM ... 35
Using job execution roles with EMR Serverless ... 40
Identity-based policy examples ... 43
Access policies ... 45
Policies for tag-based access control ... 46
Troubleshooting ... 47
Security best practices ... 49
Apply principle of least privilege ... 49
Isolate untrusted application code ... 49
Role-based access control (RBAC) permissions ... 50
Logging and monitoring ... 50
Compliance validation ... 50
Resilience ... 51
Infrastructure security ... 51
Configuration and vulnerability analysis ... 51
Tagging resources ... 52
What is a tag? ... 52
Tagging resources ... 52
Tagging limitations ... 53
Working with tags ... 53
Logging ... 55
Using Amazon S3 logs ... 55
Using the Spark UI ... 56
Using the Tez UI ... 58
Limitations ... 65
Release versions ... 66
Apache Hive ... 66
emr-5.34.0-preview (Hive 2.3.8) ... 66
Apache Spark ... 67
emr-6.5.0-preview (Spark 3.1.2) ... 67
Concepts
What is Amazon EMR Serverless?
Amazon EMR Serverless is a new deployment option for Amazon EMR. EMR Serverless provides a serverless runtime environment that simplifies running analytics applications using the latest open source frameworks such as Apache Spark and Apache Hive. With EMR Serverless, you don’t have to configure, optimize, secure, or operate clusters to run applications with these frameworks.
EMR Serverless helps you avoid over- or under-provisioning resources for your data processing jobs. EMR Serverless automatically determines the resources required by the applications, acquires these resources to process your jobs, and relinquishes them when the jobs finish. For use cases where applications require a response within seconds, such as interactive data analysis, you can pre-initialize required resources during application creation.
With EMR Serverless, you'll continue to get the benefits of Amazon EMR such as open source compatibility, concurrency, and performance optimized runtime for popular frameworks.
EMR Serverless is suitable for customers who want ease in operating applications using open source frameworks. It offers easy provisioning, quick job startup, automatic capacity management, and simple cost controls.
Concepts
EMR Serverless terms and concepts.
Release version
An Amazon EMR release is a set of open-source applications from the big data ecosystem. Each release comprises different big data applications, components, and features that you select to have EMR
Serverless deploy and configure to run your applications. When creating an application, you must specify its release version. You'll choose the Amazon EMR release version along with the open source framework version you want to use in your application.
Application
With EMR Serverless, you can create one or more EMR Serverless applications that use open source analytics frameworks. To create an application, you must specify the following attributes:
• The Amazon EMR release version for the open source framework version you want to use. To determine your release version, see Release versions (p. 66).
• The specific runtime that you want your application to use, such as Apache Spark or Apache Hive.
After you create an application, you can schedule data processing jobs or interactive requests to your application.
Each EMR Serverless application is strictly isolated from other applications and runs on a secure Amazon Virtual Private Cloud (VPC). Additionally, you can use IAM policies to define which IAM users and roles can access the application. You can also specify limits to control and track usage costs incurred by the application.
Consider creating multiple applications for the following scenarios:
Job run
• Using different open source frameworks
• Using different versions of open source frameworks for different use cases
• Performing A/B testing when upgrading from one version to another
• Maintaining separate logical environments for test and production scenarios
• Providing separate logical environments for different teams with independent cost controls and usage tracking
• Separating different line-of-business applications
EMR Serverless is a Regional service that simplifies running workloads across multiple Availability Zones within a Region. To learn more about using applications with EMR Serverless, see Interacting with your application (p. 10).
Job run
A job run is a request submitted to an EMR Serverless application that is asynchronously executed and tracked through completion. Examples of jobs include a HiveQL query submitted to an Apache Hive application or a PySpark data processing script submitted to an Apache Spark application. When submitting a job, you must specify an execution role, authored in IAM, that will be used by the job to access AWS resources, such as Amazon S3 objects. Multiple job run requests can be submitted to an application, and each job run can use a different execution role to access AWS resources. EMR Serverless starts executing jobs as soon as they are received and runs multiple job requests concurrently. To learn more about running jobs, see Running jobs (p. 14).
Workers
An EMR Serverless application internally uses workers to execute your workloads. The default size of these workers are based on your application type and Amazon EMR release version. You can override these sizes when scheduling a job run.
When a job is submitted, EMR Serverless computes the resources needed for the job and schedules workers. EMR Serverless breaks down your workloads into tasks, downloads images, provisions and sets up workers, and decommissions them when the job finishes. EMR Serverless automatically scales workers up or down depending on the workload and parallelism required at every stage of the job, removing the need for you to estimate the number of workers required to run your workloads.
Pre-initialized capacity
EMR Serverless provides a feature that keeps workers initialized and ready to respond in seconds, effectively creating a warm pool of workers for an application. This feature is called pre-initialized capacity and can be configured for each application by setting the initial-capacity parameter of an application. Pre-initialized capacity allows jobs to start immediately, making it ideal for implementing iterative applications and time-sensitive jobs. To learn more about pre-initialized workers, see
Configuring and managing pre-initialized capacity (p. 12).
Step 1: Sign up
Setting up
Step 1: Sign up for AWS
When you sign up for AWS, your AWS account is automatically signed up for all services, including the generally available Amazon EMR deployment options. You are charged only for the services that you use. If you have an AWS account already, skip to the next step. If you don't have an AWS account, use the following procedure to create one.
To create an AWS account
1. Open https://portal.aws.amazon.com/billing/signup.
2. Follow the online instructions. Part of the sign-up procedure involves receiving a phone call and entering a verification code on the phone keypad.
Step 2: Create an IAM user
As a best practice, create an AWS Identity and Access Management (IAM) user with administrator permissions, and then use that IAM user for all work that does not require root credentials. Create a password for console access, and create access keys to use command line tools. For instructions, see Creating your first IAM admin user and group in the IAM User Guide.
You can use this same process to create more groups and users and to give your users access to your AWS account resources. To learn about using policies that restrict user permissions to specific AWS resources, see Access management. If you choose to create a separate user to work with EMR Serverless, ensure the user has sufficient permissions to invoke EMR Serverless actions by attaching an IAM policy to the IAM user. For more information, see Access policies (p. 45). However, if you choose to continue with an Admin user, no further action will be required.
Step 3: Install and configure the AWS CLI
To set up EMR Serverless, you must have the latest version of AWS CLI installed. To install the latest version of the AWS CLI for macOS, Linux, or Windows, see Installing or updating the latest version of the AWS CLI.
To configure the AWS CLI and set up of secure access to AWS services, including EMR Serverless, see Quick configuration with aws configure.
Step 1: Plan an application
Getting started
This tutorial helps you get started using EMR Serverless by deploying a sample Spark workload. You'll create your application with default pre-initialized capacity, run the sample application with logs stored in your S3 bucket and view event logs in the Spark History Server. Note that, for simplicity, we have chosen default options in most parts of this tutorial. For examples of running Hive applications, see Running Hive jobs (p. 21).
Prerequisites
Before you launch an EMR Serverless application, make sure you complete the following tasks:
• EMR Serverless is currently in preview release. To access the preview of EMR Serverless, follow the sign-up steps at https://pages.awscloud.com/EMR-Serverless-Preview.html.
• You must update the AWS CLI with the latest service model for EMR Serverless. Once you've received confirmation of access, use the following command to download the latest API model file and update the AWS CLI.
aws s3 cp s3://elasticmapreduce/emr-serverless-preview/artifacts/latest/dev/cli/
service.json ./service.json
aws configure add-model --service-model file://service.json
• To use EMR Serverless, you must choose the AWS Region where preview is available. This applies to any AWS services and resources that EMR Serverless will need to access as part of running your workloads. Preview is currently available in US East (N. Virginia) us-east-1, and you may want to configure the AWS CLI to send all your AWS requests to this specific region by default. You can do so with the following command.
aws configure set region us-east-1
• Validate that the AWS CLI configuration and permissions to interact with EMR Serverless are correctly set up. You can do so by running the following command to see a list of your EMR Serverless
applications in your current Region.
aws emr-serverless list-applications
If the command returns with an error, see Troubleshooting EMR Serverless identity and access (p. 47).
Step 1: Plan an EMR Serverless application
Prepare output log storage for EMR Serverless
In this tutorial, you'll use an S3 bucket to store output files and logs from the sample Spark workload you'll run using an EMR Serverless application. To create a bucket, follow the instructions in Creating a bucket in the Amazon Simple Storage Service Console User Guide.
As noted in the prerequisites, the S3 bucket must be created in the same Region where EMR Serverless is available (us-east-1). Replace any further reference to DOC-EXAMPLE-BUCKET with the name of the newly created bucket.
Set up a job execution role
Set up a job execution role
Job runs in EMR Serverless use an execution role that provide granular permissions to specific AWS services and resources at runtime. In this tutorial, the data and scripts are hosted in a public S3 bucket, however, the output including logs will be stored in DOC-EXAMPLE-BUCKET.
To setup a job execution role, you will first create an execution role with a trust policy to allow EMR Serverless to use the new role. Next, you'll attach the required S3 access policy to that role. The following steps walk you through the process.
1. Create a file named emr-serverless-trust-policy.json that contains the trust policy to use for the IAM role. The file should contain the following policy.
{ "Version": "2012-10-17", "Statement": [{
"Sid": "EMRServerlessTrustPolicy", "Action": "sts:AssumeRole", "Effect": "Allow",
"Principal": {
"Service": "emr-serverless.amazonaws.com"
} }]
}
2. Create an IAM role named sampleJobExecutionRole using the trust policy created in the previous step.
aws iam create-role \
--role-name sampleJobExecutionRole \
--assume-role-policy-document file://emr-serverless-trust-policy.json
Take note of the ARN in the output, as you will use the ARN of the new role during job submission, henceforth referred to as the <execution_role_arn>.
3. Create a file named emr-sample-access-policy.json that defines the IAM policy for your workload to get read access the script and data stored in public S3 buckets and read-write access to DOC-EXAMPLE-BUCKET. You must replace DOC-EXAMPLE-BUCKET in the policy below with the actual bucket name created in Step 1).
{ "Version": "2012-10-17", "Statement": [
{
"Sid": "ReadAccessForEMRSamples", "Effect": "Allow",
"Action": [
"s3:GetObject", "s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::*.elasticmapreduce", "arn:aws:s3:::*.elasticmapreduce/*"
] }, {
"Sid": "FullAccessToOutputBucket", "Effect": "Allow",
"Action": [
"s3:PutObject", "s3:GetObject",
Step 2: Create an application
"s3:ListBucket", "s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::DOC-EXAMPLE-BUCKET", "arn:aws:s3:::DOC-EXAMPLE-BUCKET/*"
] } ] }
4. Create an IAM policy named sampleS3AccessPolicy using the policy file created in the previous step. Take note of the ARN in the output, as you will use the ARN of the new policy in the next step.
aws iam create-policy \
--policy-name sampleS3AccessPolicy \
--policy-document file://emr-sample-access-policy.json
Take note of the new policy's ARN in the output, as you will substitute it for <policy_arn> in the next step.
5. Attach the IAM policy sampleS3AccessPolicy to the job execution role sampleJobExecutionRole.
aws iam attach-role-policy \
--role-name sampleJobExecutionRole \ --policy-arn <policy_arn>
Step 2: Create an EMR Serverless application
Now you're ready create a new Spark application using EMR Serverless. To create an application, run the following command.
aws emr-serverless create-application \ --release-label emr-6.5.0-preview \ --type "SPARK" \
--name my-application
Take note of the application ID returned in the output, as you will use the ID to start the application and during job submission, henceforth referred to as the <application_id>.
Although EMR Serverless automatically pre-initializes a set of workers for you (with additional workers created on demand), you may choose a different set of pre-initialized workers by setting the initialCapacity parameter while creating the application. You may also choose to set a limit for the total maximum capacity that an application can use by setting the maximumCapacity parameter. To learn more about these options, see Configuring and managing pre-initialized capacity (p. 12).
Step 3: Start your application
Before you can schedule a job using your application, you must start the application. This action will pre-initialize a set of workers. You must ensure the application has reached the CREATED state before starting it. To check the state of your application, run the following command, substituting
<application_id> with the ID of your new application.
aws emr-serverless get-application \
Step 4: Schedule a job run
--application-id <application_id>
When application has reached the CREATED state, start your application using the following command.
aws emr-serverless start-application \ --application-id <application_id>
Before moving to the next step, ensure your application has reached the STARTED state using the get- application API.
Step 4: Schedule a job run to your EMR Serverless application
Now your Spark application is ready to run jobs. In this tutorial, we use a PySpark script to compute the number of occurrences of unique words across multiple text files. Both the script and the dataset are stored in a public, read-only S3 bucket. The output file and the log data from the Spark runtime will be pushed to /output and /logs directory in the S3 bucket you created in Step 1 (DOC-EXAMPLE- BUCKET).
In the command below, substitute <application_id> with your application ID. Substitute
<execution_role_arn> with the execution role ARN you created in Step 1. Replace all DOC- EXAMPLE-BUCKET strings with the Amazon S3 bucket you created, adding /output and /logs to the path. This creates new folders in your bucket, where EMR Serverless can copy the output and log files of your application.
aws emr-serverless start-job-run \ --application-id <application_id> \
--execution-role-arn <execution_role_arn> \ --job-driver '{
"sparkSubmit": {
"entryPoint": "s3://us-east-1.elasticmapreduce/emr-containers/samples/
wordcount/scripts/wordcount.py",
"entryPointArguments": ["s3://DOC-EXAMPLE-BUCKET/output"], "sparkSubmitParameters": "--conf spark.executor.cores=1 --conf
spark.executor.memory=4g --conf spark.driver.cores=1 --conf spark.driver.memory=4g --conf spark.executor.instances=1"
} }' \
--configuration-overrides '{
"monitoringConfiguration": { "s3MonitoringConfiguration": {
"logUri": "s3://DOC-EXAMPLE-BUCKET/logs"
} } }'
Note the job run ID returned in the output, as you will replace <job_run_id> with this ID in the following steps.
Step 5: Review your job run's output
The job run should typically take 3-5 minutes to complete. You can check for the state of the job using the following command.
Step 6: Clean up
aws emr-serverless get-job-run \ --application-id <application_id> \ --job-run-id <job_run_id>
With your log destination set to s3://DOC-EXAMPLE-BUCKET/logs, the logs for this specific job run can be found under s3://DOC-EXAMPLE-BUCKET/logs/applications/<application_id>/
jobs/<job_run_id>.
Depending upon the type of application and log file, EMR Serverless will upload logs to your bucket at different cadences. For Spark applications, EMR Serverless will push event logs every 30 seconds to the sparklogs folder in the S3 log destination. The Spark runtime logs for the driver and executors (i.e., stderr and stdout log files) will upload upon completion of the job to folders named appropriately by the worker type, such as driver or executor.
The output of the PySpark job will be uploaded upon sucessful execution of the job to s3://DOC- EXAMPLE-BUCKET/output/.
Step 6: Clean up
When you’re done working with this tutorial, consider deleting the resources you created. This will help you avoid any unnecessary expenses. Note that in preview, there is no additional cost to using EMR Serverless. However, we still recommend following best practives by releasing resources that you don't intend to use again.
Delete your application
To delete an application, it must be in the STOPPED state. Use the following command to stop the application.
aws emr-serverless stop-application \ --application-id <application_id>
Once the application is in the STOPPED state, use the following command to delete the application.
aws emr-serverless delete-application \ --application-id <application_id>
Delete your S3 log bucket
To delete your S3 logging and output bucket, use the following command. Replace DOC-EXAMPLE- BUCKET with the actual name of the S3 bucket created in Step 1.
aws s3 rm s3://DOC-EXAMPLE-BUCKET --recursive aws s3api delete-bucket --bucket DOC-EXAMPLE-BUCKET
Delete your job execution role
To delete the execution role, detach the policy from the role. You can then delete both the role and the policy.
aws iam detach-role-policy \
Delete your job execution role
--role-name sampleJobExecutionRole \ --policy-arn <policy_arn>
To delete the role, use the following command.
aws iam delete-role \
--role-name sampleJobExecutionRole
To delete the policy that was attached to the role, use the following command.
aws iam delete-policy \ --policy-arn <policy_arn>
This concludes the tutorial. For examples of running Hive applications, see Running Hive jobs (p. 21).
Application states
Interacting with your application
In this section, we'll cover how you can interact with your Amazon EMR Serverless application using the AWS CLI and the defaults for Spark and Hive engines.
Topics
• Application states (p. 10)
• Working with your application on the AWS CLI (p. 11)
• Pre-initialized capacity defaults (p. 11)
• Configuring and managing pre-initialized capacity (p. 12)
Application states
When you create an application with Amazon EMR Serverless, the application run enters the CREATING state. It then passes through the following states until it succeeds (exits with code 0) or fails (exits with a non-zero code).
Applications can have the following states:
State Description
Creating The application is being prepared and is not yet
ready to use.
Created The application has been created but has not yet
provisioned capacity. It can be modified to make changes to the initial capacity configuration.
Starting The application has started and is provisioning
capacity.
Started The application is ready to accept new jobs. Jobs
will only be accepted in this state.
Stopping All jobs have completed and the application is
releasing its capacity. EMR Serverless may move an application to this state when there are failures in provisioning capacity.
Stopped The application is stopped and no resources are
running on the application. It can be modified to make changes to the initial capacity configuration.
Terminated The application has been terminated and will not
appear on your list. EMR Serverless may move an application to this state when there are failures in creation.
The following diagram illustrates the trajectory of EMR Serverless application states.
Working with your application on the AWS CLI
Working with your application on the AWS CLI
This section covers how to create, describe, and delete individual applications on the command line, as well as how to list all of your applications to view them at a glance. For more application operations, such starting, stopping, and updating your application, see the EMR Serverless API Reference.
To create an application, use create-application. You must specify SPARK or HIVE as the application type. This command returns the application’s ARN, name, and ID.
aws emr-serverless create-application \ --name <my_application_name> \
--type <application_type> \ --release-label <release_version>
To describe an application, use get-application and provide its application-id. This command returns the state and capacity-related configurations for your application.
aws emr-serverless get-application \ --application-id <application_id>
To list all of your applications, call list-applications. This command returns the same properties as get-application but includes all of your applications.
aws emr-serverless list-applications
To delete your application, call delete-application and supply your application-id.
aws emr-serverless delete-application \ --application-id <application_id>
Pre-initialized capacity defaults
EMR Serverless configures all applications with the default to run one job at low latency. The defaults are engine-specific and are set as follows:
• Spark - 3 drivers (4 vCPU, 16 GB memory, 21 GB disk), 9 executors (4 vCPU, 16 GB memory, 21 GB disk)
• Hive - 3 drivers (2 vCPU, 6 GB memory, and 21 GB disk), 30 Tez tasks (1 vCPU, 6 GB memory, and 21 GB disk)
Configuring and managing pre-initialized capacity
Configuring and managing pre-initialized capacity
EMR Serverless provides an optional feature that keeps driver and workers pre-initialized and ready to respond in seconds, effectively creating a warm pool of workers for an application. This feature is called pre-initialized capacity and can be configured for each application by setting the initialCapacity parameter of an application to the number of workers you want to pre-initialize. Pre-initialized worker capacity allows jobs to start immediately, making it ideal for implementing iterative applications and time-sensitive jobs.
When a job is run, if any workers from initialCapacity are available (not already in use from jobs previously submitted), those resources are used to start running the job. If those resources are not available because they are in use by other jobs, or if resources are insufficient to execute the job because the job requires more than what is available from intialCapacity, then additional workers are requested and acquired, up to the maximum limits on resources set for the application. When jobs finish running, the workers used by the job are released, and the number of resources available for the application returns to initialCapacity. An application maintains the initialCapacity of resources even after jobs finish running. Excess resources beyond initialCapacity are released immediately when they're no longer required to run jobs.
NoteFor this preview release, you must manually stop an application to decommission the pre- initialized workers.
Pre-initialized capacity is available and ready to use when the application has started. It is
decommissioned when the application is stopped. An application moves to the STARTED state only if the requested pre-initialized capacity has been created and is ready to use. For the entire duration that the application is in the STARTED state, EMR Serverless ensures that the pre-initialized capacity is available for use or is in use by jobs or interactive workloads. Capacity is replenished for released or failed containers to maintain the number of workers specified in the InitialCapacity parameter. For an application with no pre-initialized capacity, the state can immediately transition from CREATED to STARTED.
You can modify the InitialCapacity counts, and specify compute configurations such as vCPU, memory, and disk, for each worker. Modifications are only allowed when the application is in the CREATED or STOPPED state.
Customizing pre-initialized capacity for specific big data frameworks
You can further customize pre-initialized capacity to suit workloads running on specific big data frameworks. For example, when running Apache Spark, you can specify how many workers start as drivers and how many start as executors. Similarly, when you use Apache Hive, you can specify how many workers start as Hive drivers, and how many are used to run Tez tasks.
Configuring an application running Apache Hive with pre-initialized capacity
The following API request creates an application running Apache Hive based on Amazon EMR release emr-5.34.0-preview. The application starts with 5 pre-initialized Hive drivers, each with 2 vCPU and 6 GB of memory, and 50 pre-initialized Tez task workers, each with 1 vCPU and 6 GB of memory. When Hive queries are run on this application, they first use the pre-initialized workers and start executing immediately. If all of the pre-initialized workers are busy and more Hive jobs are submitted, the application can scale to a total of 400 vCPU and 1024 GB of memory.
aws emr-serverless create-application \ --type HIVE \
--name <my_application_name> \
Customizing pre-initialized capacity for specific big data frameworks --release-label emr-5.34.0-preview \
--initial-capacity '{
"DRIVER": {
"workerCount": 5,
"resourceConfiguration": { "cpu": "2vCPU", "memory": "4GB"
} },
"TEZ_TASK": {
"workerCount": 50,
"resourceConfiguration": { "cpu": "4vCPU", "memory": "8GB"
} } }' \
--maximum-capacity '{
"cpu": "400vCPU", "memory": "1024GB"
}'
Configuring an application running Apache Spark with pre-initialized capacity
The following API request creates an application running Apache Spark 3.1 based on Amazon EMR release 6.5. The application starts with 5 pre-initialized Spark drivers, each with 2 vCPU and 4 GB of memory, and 50 pre-initialized executors, each with 4 vCPU and 8 GB of memory. When Spark jobs are run on this application, they first use the pre-initialized workers and start executing immediately. If all of the pre-initialized workers are busy and more Spark jobs are submitted, the application can scale to a total of 400 vCPU and 1024 GB of memory.
NoteSpark adds 10% overhead to the memory requested for driver and executors. In order for jobs to use pre-initialized workers, the initial capacity memory configuration should be at least 10%
more than the memory requested by the job.
aws emr-serverless create-application \ --type "SPARK" \
--name <"my_application_name"> \ --release-label "emr-6.5.0-preview" \ --initial-capacity '{
"DRIVER": {
"workerCount": 5,
"resourceConfiguration": { "cpu": "2vCPU", "memory": "4GB"
} },
"EXECUTOR": {
"workerCount": 50,
"resourceConfiguration": { "cpu": "4vCPU", "memory": "8GB"
} } }' \
--maximum-capacity '{
"cpu": "400vCPU", "memory": "1024GB"
}'
Job run states
Running jobs
After you've provisioned your application, you can submit jobs to it. This section covers how to use the AWS CLI to run jobs on your application and what the defaults are for each type of application available on EMR Serverless.
Topics
• Job run states (p. 14)
• Submitting jobs on the AWS CLI (p. 15)
• Running Spark jobs (p. 15)
• Running Hive jobs (p. 21)
Job run states
When you submit a job run to an Amazon EMR Serverless job queue, the job run enters the SUBMITTED state. It then passes through the following states until it succeeds (exits with code 0) or fails (exits with a non-zero code).
Job runs can have the following states:
State Description
Submitted The initial job state when a job run is submitted to
EMR Serverless. The job is waiting to be scheduled for the application, and EMR Serverless is working on prioritizing and scheduling this job run.
Pending The job run has been scheduled for the
application, and EMR Serverless is allocating resources to execute it.
Scheduled The job run is being evaluated by the scheduler
to prioritize and schedule this job run for the application.
Running The job run has necessary initial resources and is
running in the application. In Spark applications, this means that the Spark driver process is in the running state.
Failed The job run failed to be submitted to the
application or it completed unsuccessfully. See StateDetails for additional information about this job failure.
Completed The job run completed successfully.
Cancelling The job run has been requested for cancellation,
either through the CancelJobRun API or due to
Submitting jobs on the AWS CLI
State Description
timeout. EMR Serverless is trying to cancel the job in the application and release the resources.
Cancelled The job run was cancelled successfully, and the
resources it used have been released.
Submitting jobs on the AWS CLI
You can create, describe, and delete individual jobs on the command line. You can also list all of your jobs to view them at a glance.
To submit a new job, use start-job-run and supply the ID of the application you want to run, along with job-specific properties. For Spark examples, see Running Spark jobs (p. 15). For Hive examples, see Running Hive jobs (p. 21). This command returns your application-id, ARN, and new job-id.
To describe a job, use get-job-run. This command returns job-specific configurations and the set capacity for your new job.
aws emr-serverless get-job-run \ --job-run-id <job_id> \
--application-id <application_id>
To list your jobs, call list-job-runs. This command returns an abbreviated set of properties including job type, state, and other high-level attributes. If you don't want to see all of your jobs, you can specify the maximum number of jobs you'd like to see up to 50. The the following command demonstrates how to specify that you'd like to see your two last job runs.
aws emr-serverless list-job-runs \ --max-results 2 \
--application-id <application_id>
To cancel a job, call cancel-job-run, supplying both the application-id and the job-id of the job you want to cancel.
aws emr-serverless cancel-job-run \ --job-run-id <job_id> \
--application-id <application_id>
For more information on running jobs using the AWS CLI, see the EMR Serverless API Reference.
Running Spark jobs
You can run Spark jobs on an application with the type parameter set to 'SPARK'. Jobs must be compatible with the Spark version referenced in the Amazon EMR release version. For example, when running jobs on an application with Amazon EMR release 6.5, your job must be compatible with Apache Spark 3.1.2.
To run a Spark job, you must specify the following parameters when using the start-job-run API.
Execution role (executionRoleArn)
Running Spark jobs
This is an IAM role ARN that your application uses to execute Spark jobs. This role must contain the following permissions:
• Read from S3 buckets or other data sources where your data resides
• Read from S3 buckets or prefixes where your PySpark script or JAR file is located
• Write to S3 buckets where you intend to write your final output
• Write logs to an S3 bucket or prefix specified by S3MonitoringConfiguration
• Access to KMS keys if you use KMS keys to encrypt data in your S3 bucket
• Access to AWS Glue Catalog if you use SparkSQL
Failure to provide these permissions to the IAM role can lead to job failures. If your Spark job is reading or writing data to or from other data sources, make sure that the appropriate permissions are specified in this IAM role. For more information, see Using job execution roles with EMR Serverless (p. 40) and Logging (p. 55).
Job driver (jobDriver)
A job's driver is used to provide input to the job. This parameter accepts only one value for the job type that you want to run. For a Spark job, that value is sparkSubmit. You can use this job type to run Scala, Java, PySpark, SparkR, and any other supported jobs through Spark submit. This job type has the following parameters:
• entryPoint ‐ This is the reference in Amazon S3 to the main JAR or Python file that you want to run. If you are running a Scala or Java JAR, the main entry class should be specified in the SparkSubmitParameters using the --class argument.
• entryPointArguments ‐ This is an array of arguments that you want to pass to your main JAR or Python file. You should handle reading these parameters using your entrypoint code. Each argument in the array should be separated by a comma.
• sparkSubmitParameters ‐ These are the additional Spark parameters that you want to send to the job. Use this parameter to override default Spark properties such as driver memory or number of executors like the --conf or --class parameters.
For additional information, see Launching Applications with spark-submit.
Configuration overrides (configurationOverrides)
This parameter is used for overriding application and monitoring level configuration properties. This parameter accepts a JSON object having the following two fields.
• applicationConfiguration ‐ This field allows you to override the default configurations for applications by supplying a configuration object. You can use a shorthand syntax to provide the configuration, or you can reference the configuration object in a JSON file. Configuration objects consist of a classification, properties, and optional nested configurations. Properties consist of the settings you want to override in that file. You can specify multiple classifications for multiple applications in a single JSON object. The configuration classifications that are available vary by specific release version for Amazon EMR. For a list of configuration classifications that are available for each release version of Amazon EMR, see Release versions (p. 66).
If you pass the same configuration in an application override and in Spark submit parameters, the Spark submit parameters take precedence. The complete configuration priority list follows, in order of highest priority to lowest priority:
• Configuration supplied when creating SparkSession.
• Configuration supplied as part of sparkSubmitParameters using —conf.
• Configuration provided as part of application overrides.
• Optimized configurations chosen by Amazon EMR for the release.
Spark defaults
• Default open source configurations for the application.
• monitoringConfiguration ‐ This field allows you to specify the Amazon S3 URL
(s3MonitoringConfiguration) where you want the EMR Serverless job to store logs of your Spark job. Make sure you've created this bucket with the same AWS account hosting your application, and in the same Region where your job is running.
Spark job defaults
The following table lists optional Spark properties and their default values that you can override when submitting a Spark job.
Key Explanation Default value
spark.emr-
serverless.allocation.batch.size
The number of containers to request in each cycle of executor allocation. There is a 1-second gap between each allocation cycle.
20
spark.emr-
serverless.allocation.executor.timeout
The time to wait for a newly- created executor container to reach the running state before the request is cancelled.
300s
spark.emr-
serverless.memoryOverheadFactor
Sets the Memory Overhead Factor to be added to the driver and executor container memory.
0.1
spark.executor.memory The amount of memory used by
each executor. 14G
spark.executor.cores The number of cores used by
each executor. 4
spark.driver.memory The amount of memory used by
the driver. 14G
spark.driver.cores The number of cores used by the
driver. 4
spark.emr-
serverless.driver.disk The Spark driver disk. 21G spark.emr-
serverless.executor.disk
The Spark executor disk. 21G
spark.executor.instances The number of Spark executor containers to allocate. 3 spark.executor.extraJavaOptionsExtra Java options for the Spark
executor. NULL
spark.driver.extraJavaOptionsExtra Java options for the Spark
driver. NULL
spark.driver.extraJavaOptionsExtra Java options for the Spark
driver. NULL
Spark defaults
Key Explanation Default value
spark.dynamicAllocation.enabledEnables Spark Dynamic Resource
Allocation. TRUE
spark.emr-
serverless.driverEnv.[KEY]
Adds additional environment
variables to the Spark driver. NULL spark.executorEnv.[KEY] Adds additional environment
variables to the Spark executors. NULL
spark.files A comma-separated list
of files to be placed in the working directory of each executor. The file paths of these files in executors can be accessed by running SparkFiles.get(fileName).
NULL
spark.jars Additional jars to add to the
runtime classpath of the driver and executors.
NULL
spark.archives A comma-separated list of archives to be extracted into the working directory of each executor. Supported file types include .jar, .tar.gz, .tgz and .zip. You can specify the directory name to unpack by adding # after the file name to unpack. For example, file.zip#directory. This configuration is experimental.
NULL
spark.submit.pyFiles A comma-separated list of .zip, .egg, or .py files to place in the PYTHONPATH for Python apps.
NULL
spark.sql.warehouse.dir The default location for
managed databases and tables. The value of $PWD/spark- warehouse
spark.hadoop.hive.metastore.client.factory.classThe Hive metastore
implementation class. NULL spark.authenticate Enables authentication of
Spark's internal connections. TRUE spark.network.crypto.enabledEnables AES-based RPC
encryption, including the new authentication protocol added in 2.2.0.
FALSE
spark.dynamicAllocation.enabledEnables dynamic resource allocation, which scales the number of executors registered with the application up or down, based on the workload.
TRUE
Spark defaults
Key Explanation Default value
spark.dynamicAllocation.executorIdleTimeoutThe duration of idle time an executor can have before it will be removed. This only applies if dynamic allocation is enabled.
60s
spark.dynamicAllocation.initialExecutorsThe initial number of executors to run if dynamic allocation is enabled.
3
spark.dynamicAllocation.maxExecutorsThe upper bound for the number of executors if dynamic allocation is enabled.
1000
spark.dynamicAllocation.minExecutorsThe lower bound for the number of executors if dynamic allocation is enabled.
0
The following table lists the default Spark submit parameters.
Key Explanation Default value
executor-memory The amount of memory used by
executor. 14G
executor-cores The number of cores used by
each executor. 4
driver-memory The amount of memory used by
the driver. 14G
driver-cores The number of cores to used by
the driver. 4
num-executors The number of executors to
launch. 3
files A comma-separated list of files
to be placed in the working directory of each executor.
File paths of these files in executors can be accessed via SparkFiles.get(fileName).
NULL
py-files A comma-separated list of .zip,
.egg, or .py files to place on the PYTHONPATH for Python apps.
NULL
archives A comma-separated list of
archives to be extracted into the working directory of each executor.
NULL
jars A comma-separated list of jars
to include on the driver and executor classpaths.
NULL
Spark examples
Key Explanation Default value
verbose Enables printing additional
debug output. NULL
class The application's main class (for
Java and Scala apps). NULL
conf An arbitrary Spark configuration
property. NULL
Spark examples
The following is an example of running a Python script using the StartJobRun API. For an end-to-end tutorial using this example, see Getting started (p. 4).
aws emr-serverless start-job-run \ --application-id <application_id> \ --execution-role-arn <iam_role_arn> \ --job-driver '{
"sparkSubmit": {
"entryPoint": "s3://us-east-1.elasticmapreduce/emr-containers/samples/
wordcount/scripts/wordcount.py",
"entryPointArguments": ["s3://DOC-EXAMPLE-BUCKET-OUTPUT/wordcount_output"], "sparkSubmitParameters": "--conf spark.executor.cores=1 --conf
spark.executor.memory=4g --conf spark.driver.cores=1 --conf spark.driver.memory=4g --conf spark.executor.instances=1"
} }' \
--configuration-overrides '{
"monitoringConfiguration": { "s3MonitoringConfiguration": {
"logUri": "s3://DOC-EXAMPLE-BUCKET-LOGGING/logs/"
} } }'
The following is an example of running a Spark JAR using the StartJobRun API.
aws emr-serverless start-job-run \ --application-id <application_id> \ --execution-role-arn <iam_role_arn> \ --job-driver '{
"sparkSubmit": {
"entryPoint": "/usr/lib/spark/examples/jars/spark-examples.jar", "entryPointArguments": ["1"],
"sparkSubmitParameters": "--class org.apache.spark.examples.SparkPi --conf spark.executor.cores=4 --conf spark.executor.memory=20g --conf spark.driver.cores=4 --conf spark.driver.memory=8g --conf spark.executor.instances=1"
} }' \
--configuration-overrides '{
"monitoringConfiguration": { "s3MonitoringConfiguration": {
"logUri": "s3://DOC-EXAMPLE-BUCKET-LOGGING/logs/"
} } }'
Running Hive jobs
Running Hive jobs
You can run Hive jobs on an application with the type parameter set to 'HIVE'. Jobs must be compatible with the Hive version referenced in the Amazon EMR release version. For example, when running jobs on an application with Amazon EMR release 5.34.0-preview, your job must be compatible with Apache Hive 2.3.8.
To run a Hive job, you must specify the following parameters when using the start-job-run API.
Execution role (executionRoleArn)
This is an IAM role ARN that your application uses to execute Hive jobs. This role must contain the following permissions:
• Read from S3 buckets or other data sources where your data resides
• Read from S3 buckets or prefixes where your Hive query file and init query file are located
• Read and write to S3 buckets where your Hive Scratch directory and Hive Metastore warehouse directory are located
• Write to S3 buckets where you intend to write your final output
• Write logs to an S3 bucket or prefix specified by S3MonitoringConfiguration
• Access to KMS keys if you use KMS keys to encrypt data in your S3 bucket
• Access to AWS Glue Catalog
Failure to provide these permissions to the IAM role can lead to job failures. If your Hive job is reading or writing data to or from other data sources, make sure that the appropriate permissions are specified in this IAM role. For more information, see Using job execution roles with EMR Serverless (p. 40).
Job driver (jobDriver)
A job's driver is used to provide input to the job. This parameter accepts only one value for the job type that you want to run. A Hive query is passed to the job-driver parameter by specifying Hive as the job type. This job type has the following parameters:
• query ‐ This is the reference in Amazon S3 to the Hive query file that you want to run.
• parameters ‐ These are the additional Hive configuration properties you want to override. You can override properties by passing them to this parameter as --hiveconf <property=value> and variables passing by them as --hivevar <key=value>.
• initQueryFile ‐ This is the init Hive query file. It will be executed prior to your query and can be used to initialize tables.
Configuration overrides (configurationOverrides)
This parameter is used for overriding application and monitoring level configuration properties. This parameters accepts a JSON objects having the following two fields:
• applicationConfiguration ‐ This field allows you to override the default configurations for applications by supplying a configuration object. You can use a shorthand syntax to provide the configuration, or you can reference the configuration object in a JSON file. Configuration objects consist of a classification, properties, and optional nested configurations. Properties consist of the settings you want to override in that file. You can specify multiple classifications for multiple applications in a single JSON object. The configuration classifications that are available vary by specific release version for Amazon EMR. For a list of configuration classifications that are available for each release version of Amazon EMR, see Release versions (p. 66).
Hive properties
If you pass the same configuration in an application override and in Hive parameters, the Hive parameters take precedence. The complete configuration priority list follows, in order of highest priority to lowest priority.
• Configuration supplied as part of Hive parameters using --hiveconf <property=value>.
• Configuration provided as part of application overrides.
• Optimized configurations chosen by Amazon EMR for the release.
• Default open source configurations for the application.
• monitoringConfiguration ‐ This field allows you to specify the Amazon S3 URL
(s3MonitoringConfiguration) where you want the EMR Serverless job to store logs of your Hive job. Make sure you've created this bucket with the same AWS account hosting your application, and in the same Region where your job is running.
Hive job properties
The following table lists the mandatory properties that you must configure when submitting a Hive job.
Setting Description
hive.exec.scratchdir The Amazon S3 location where temporary files are created during the Hive job execution.
hive.metastore.warehouse.dir The Amazon S3 location of databases for managed tables in Hive.
The following table lists the optional Hive properties and their default values that you can override when submitting a Hive job.
Setting Description Value
hive.driver.memory The amount of memory to use per Hive driver process.
This memory is shared equally between HiveCLI and Tez Application Master with 20% of headroom.
6G
hive.driver.cores The number of cores to use for the Hive driver process. 2 hive.driver.disk The disk size for the Hive driver. 21G hive.tez.disk.size The disk size for each task
container. 21G
hive.prewarm.enabled Enables container prewarm for
Tez. FALSE
hive.prewarm.numcontainers The number of containers to
prewarm for Tez. 10
hive.tez.container.size The amount of memory to use
per Tez task process. 6144
Hive properties
Setting Description Value
hive.tez.cpu.vcores The number of cores to use for
each Tez task. 1
hive.max-task-containers The maximum number of concurrent containers. The configured mapper memory is also multuplied by this value to determines available memory that is used by split computation and task preemption.
100
hive.exec.reducers.max The maximum number of
reducers. 256
hive.auto.convert.join.noconditionaltask.sizeThe size below which a join is
directly converted to a mapjoin. Optimal value is calculated based on Tez task memory tez.runtime.io.sort.mb The size of the soft buffer when
output is sorted. Optimal value is calculated based on Tez task memory tez.runtime.unordered.output.buffer.size-
mb
The size of the buffer to use if
not writing directly to disk. Optimal value is calculated based on Tez task memory tez.am.task.max.failed.attemptsThe maximum number of
attempts that can fail for a particular task before the task is failed. This does not count manually terminated attempts.
3
hive.exec.stagingdir The name of the directory for storing temporary files that will be created inside table locations and in the scratch directory location specified using the hive.exec.scratchdir property.
.hive-staging
hive.compute.query.using.statsEnables Hive to answer a few queries using statistics stored in the metastore.
For basic statistics, set
hive.stats.autogather to true. For a more advanced collection, run analyze table queries.
TRUE
hive.vectorized.execution.enabledEnables vectorized mode of
query execution. FALSE for EMR versions 5.34.0 and onwards.
TRUE for EMR versions 6.5.0 and onwards. See HIVE-19269 for more information.
hive.cbo.enable Enables cost-based
optimizations using the Calcite framework.
TRUE
Hive properties
Setting Description Value
hive.tez.auto.reducer.parallelismEnables Tez's auto-reducer parallelism feature. Hive will still estimate data sizes and set parallelism estimates. Tez will sample source vertices' output sizes and adjust the estimates at runtime as necessary.
FALSE
hive.stats.fetch.column.statsDisables fetching of column statistics from the metastore.
Fetching column statistics can be expensive when the number of columns is high.
FALSE
hive.vectorized.execution.reduce.enabledEnables vectorized mode of a
query execution's reduce-side. TRUE hive.exec.max.dynamic.partitions.pernodeMaximum number of dynamic
partitions allowed to be created in each mapper and reducer node.
100
hive.stats.fetch.partition.statsEnables fetching of partition statistics from the metastore.
When this flag is disabled, Hive will make calls to the file system to get file sizes and estimate the number of rows from the row schema. Fetching partition statistics can be expensive when the number of partitions is high.
TRUE for EMR versions 5.34.0 and onwards.
Property if removed for EMR versions 6.5.0 and onwards.
See HIVE-17932 for more information. Partition stats are collected by default if a table is partitioned.
hive.exec.max.dynamic.partitionsThe maximum number of dynamic partitions allowed to be created in total.
1000
hive.auto.convert.join.noconditionaltaskEnables optimization in
converting a common join into a mapjoin based on the input file size.
TRUE
hive.exec.dynamic.partition.modeIn strict mode, you must specify at least one static partition in case you accidentally overwrite all partitions. In non-strict mode, all partitions are allowed to be dynamic.
strict
hive.merge.tezfiles Enables the merging of small
files at the end of a Tez DAG. FALSE hive.strict.checks.cartesian.productEnables strict Cartesian join
checks, which disallows a Cartesian product (a cross join).
TRUE for EMR versions 5.34.0 and onwards.
FALSE for EMR versions 6.5.0 and onwards. See HIVE-18251 for more information.
Hive properties
Setting Description Value
hive.stats.autogather Enables basic statistics to be gathered automatically during the INSERT OVERWRITE command.
TRUE
hive.exec.orc.split.strategyExpects one of [BI, ETL, HYBRID]. This is not a user level config. BI strategy is used when you want to spend less time in split generation as opposed to query execution (split generation does not read or cache file footers). ETL strategy is used when you want to spend more time in split generation (split generation reads and caches file footers).
HYBRID chooses between the above strategies based on heuristics.
HYBRID
hive.auto.convert.join Enables auto-conversion of common joins into mapjoins, based on the input file size.
TRUE
hive.default.fileformat The default file format for CREATE TABLE statements.
You can explicitly override this by specifying STORED AS [FORMAT] in your CREATE TABLE command.
TEXTFILE
hive.exec.reducers.bytes.per.reducerThe size per reducer. The default is 256 MB. If the input size is 1G, the job will use 4 reducers.
256000000
hive.exec.dynamic.partitionEnables dynamic partitions in
DML/DDL. TRUE
hive.merge.size.per.task The size of merged files at the
end of the job. 256000000
hive.merge.mapfiles Enables small files to be merged at the end of a map-only job. TRUE hive.fetch.task.conversion Expects one of [NONE, MINIMAL,
MORE]. Some select queries can be converted to single FETCH task, minimizing latency.
MORE
Hive properties
Setting Description Value
hive.stats.gather.num.threadsThe number of threads used by the partialscan and noscan analyze command for partitioned tables. This is applicable only for file formats that implement
StatsProvidingRecordReader (like ORC).
10
hive.optimize.ppd Enables predicate pushdown. TRUE hive.input.format The default input format.
Set to HiveInputFormat if you encounter problems with CombineHiveInputFormat.
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
hive.optimize.ppd.storage Enables predicate pushdown to
storage handlers. TRUE
hive.groupby.position.aliasEnables using a column position
alias in GROUP BY statements. FALSE hive.orderby.position.aliasEnables using a column position
alias in ORDER BY statements. TRUE hive.mapred.reduce.tasks.speculative.executionEnables speculative execution
for reducers. TRUE
hive.support.quoted.identifiersExpects one of [NONE, COLUMN].
NONE implies only alphanumeric and underscore characters are valid in identifiers. COLUMN implies column names can contain any character.
COLUMN
hive.tez.min.partition.factorPuts a lower limit to the number of reducers that Tez specifies when auto-reducer parallelism is enabled.
0.25
hive.strict.checks.type.safetyEnables strict type safety checks and disables comparing bigint with both string and double.
TRUE
hive.root.logger The Hive log4j logging level. INFO, DRFA tez.am.log.level The root logging level passed to
the Tez app master. INFO
tez.task.log.level The root logging level passed to
the Tez tasks. INFO
tez.grouping.max-size The upper bound on the size (in bytes) of a grouped split, to avoid generating excessively large splits.
1073741824
Hive examples
Setting Description Value
tez.grouping.min-size The lower bound on the size (in bytes) of a grouped split, to avoid generating too many small splits.
52428800
tez.shuffle-vertex- manager.min-src-fraction
The fraction of source tasks which must complete before tasks for the current vertex are scheduled (in case of a ScatterGather connection).
0.25
tez.shuffle-vertex- manager.max-src-fraction
The fraction of source tasks that must have completed before all tasks on the current vertex can be scheduled (in case of a ScatterGather connection).
The number of tasks ready for scheduling on the current vertex scales linearly between min- fraction and max-fraction.
This defaults the default value or tez.shuffle-vertex- manager.min-src-fraction, whichever is greater.
0.75
tez.am.speculation.enabled Enables speculative execution of slower tasks. This can help reduce job latency when some tasks are running slower due bad or slow machines.
FALSE
tez.am.dag.cleanup.on.completionEnables the cleanup of shuffle
data upon DAG completion. TRUE tez.client.asynchronous-
stop
Enables pushing of ATS events before terminating the Hive driver.
FALSE
tez.am.sleep.time.before.exit.millisThe amount of time after which ATS events should be pushed upon AM shutdown request.
0
tez.yarn.ats.event.flush.timeout.millisThe maximum amount of time for which AM should wait for events to be flushed before shutting down.
300000
Hive job examples
The following is an example of running a Hive query using the StartJobRun API.
aws emr-serverless start-job-run \ --application-id <application_id> \ --execution-role-arn <iam_role_arn> \ --job-driver '{
Hive examples
"hive": {
"query": "s3://DOC-EXAMPLE-BUCKET/emr-serverless-hive/query/hive-query.ql", "parameters": "--hiveconf hive.root.logger=DEBUG,DRFA"
} }' \
--configuration-overrides '{
"applicationConfiguration": [{
"classification": "hive-site", "properties": {
"hive.exec.scratchdir": "s3://DOC-EXAMPLE-BUCKET/emr-serverless-hive/hive/
scratch",
"hive.metastore.warehouse.dir": "s3://DOC-EXAMPLE-BUCKET/emr-serverless- hive/hive/warehouse",
"hive.driver.cores": "2", "hive.driver.memory": "4g", "hive.tez.container.size": "4096", "hive.tez.cpu.vcores": "1"
} }],
"monitoringConfiguration": { "s3MonitoringConfiguration": {
"logUri": "s3://DOC-EXAMPLE-BUCKET/emr-serverless-hive/logs/"
} } }'
While the job is running, the logs for the Hive driver and Tez tasks continuously upload to the Amazon S3 log location you configured in monitoringConfiguration.
Once the job run has a state of SUCCEEDED, the output of your Hive query will be available in the Amazon S3 location you specified in the monitoringConfiguration field of
configurationOverrides. For example, if your log location is s3://DOC-EXAMPLE-BUCKET/emr- serverless-hive/hive/logs, your Hive query's outpit will be available in s3://DOC-EXAMPLE-BUCKET/ emr-serverless-hive/hive/logs/applications/<application-id>/jobs/<job-run-id>/HIVE_DRIVER/
stdout.gz.
The hive-query.ql file contains the query that Hive will run. The following is an example of a sample query.
create database if not exists emrserverless;
use emrserverless;
create table if not exists test_table(id int);
drop table if exists Values__Tmp__Table__1;
insert into test_table values (1),(2),(2),(3),(3),(3);
select id, count(id) from test_table group by id order by id desc;
Data protection
Security
Cloud security at AWS is the highest priority. As an AWS customer, you benefit from a data center and network architecture that is built to meet the requirements of the most security-sensitive organizations.
Security is a shared responsibility between AWS and you. The shared responsibility model describes this as security of the cloud and security in the cloud:
• Security of the cloud – AWS is responsible for protecting the infrastructure that runs AWS services in the AWS Cloud. AWS also provides you with services that you can use securely. Third-party auditors regularly test and verify the effectiveness of our security as part of the AWS compliance programs. To learn about the compliance programs that apply to Amazon EMR Serverless, see AWS services in scope by compliance program.
• Security in the cloud – Your responsibility is determined by the AWS service that you use. You are also responsible for other factors including the sensitivity of your data, your company's requirements, and applicable laws and regulations.
This documentation helps you understand how to apply the shared responsibility model when using Amazon EMR Serverless. The topics in this chapter show you how to configure Amazon EMR Serverless and use other AWS services to meet your security and compliance objectives.
Topics
• Data protection (p. 29)
• Identity and Access Management (IAM) in Amazon EMR Serverless (p. 31)
• Security best practices for Amazon EMR Serverless (p. 49)
• Logging and monitoring (p. 50)
• Compliance validation for Amazon EMR Serverless (p. 50)
• Resilience in Amazon EMR Serverless (p. 51)
• Infrastructure security in Amazon EMR Serverless (p. 51)
• Configuration and vulnerability analysis in Amazon EMR Serverless (p. 51)
Data protection
The AWS shared responsibility model applies to data protection in Amazon EMR Serverless. As described in this model, AWS is responsible for protecting the global infrastructure that runs all of the AWS Cloud.
You are responsible for maintaining control over your content that is hosted on this infrastructure. This content includes the security configuration and management tasks for the AWS services that you use. For more information about data privacy, see the Data Privacy FAQ . For information about data protection in Europe, see the AWS Shared Responsibility Model and GDPR blog post on the AWS Security Blog.
For data protection purposes, we recommend that you protect AWS account credentials and set up individual user accounts with AWS Identity and Access Management (IAM). That way each user is given only the permissions necessary to fulfill their job duties. We also recommend that you secure your data in the following ways:
• Use multi-factor authentication (MFA) with each account.
• Use SSL/TLS to communicate with AWS resources. We recommend TLS 1.2 or later.
• Set up API and user activity logging with AWS CloudTrail.
• Use AWS encryption solutions, along with all default security controls within AWS services.