As a best practice, create an AWS Identity and Access Management (IAM) user with administrator permissions, and then use that IAM user for all work that does not require root credentials. Create a password for console access, and create access keys to use command line tools. For instructions, see Creating your first IAM admin user and group in the IAM User Guide.
You can use this same process to create more groups and users and to give your users access to your AWS account resources. To learn about using policies that restrict user permissions to specific AWS resources, see Access management. If you choose to create a separate user to work with EMR Serverless, ensure the user has sufficient permissions to invoke EMR Serverless actions by attaching an IAM policy to the IAM user. For more information, see Access policies (p. 45). However, if you choose to continue with an Admin user, no further action will be required.
Step 3: Install and configure the AWS CLI
To set up EMR Serverless, you must have the latest version of AWS CLI installed. To install the latest version of the AWS CLI for macOS, Linux, or Windows, see Installing or updating the latest version of the AWS CLI.
To configure the AWS CLI and set up of secure access to AWS services, including EMR Serverless, see Quick configuration with aws configure.
Step 1: Plan an application
Getting started
This tutorial helps you get started using EMR Serverless by deploying a sample Spark workload. You'll create your application with default pre-initialized capacity, run the sample application with logs stored in your S3 bucket and view event logs in the Spark History Server. Note that, for simplicity, we have chosen default options in most parts of this tutorial. For examples of running Hive applications, see Running Hive jobs (p. 21).
Prerequisites
Before you launch an EMR Serverless application, make sure you complete the following tasks:
• EMR Serverless is currently in preview release. To access the preview of EMR Serverless, follow the sign-up steps at https://pages.awscloud.com/EMR-Serverless-Preview.html.
• You must update the AWS CLI with the latest service model for EMR Serverless. Once you've received confirmation of access, use the following command to download the latest API model file and update the AWS CLI.
aws s3 cp s3://elasticmapreduce/emr-serverless-preview/artifacts/latest/dev/cli/
service.json ./service.json
aws configure add-model --service-model file://service.json
• To use EMR Serverless, you must choose the AWS Region where preview is available. This applies to any AWS services and resources that EMR Serverless will need to access as part of running your workloads. Preview is currently available in US East (N. Virginia) us-east-1, and you may want to configure the AWS CLI to send all your AWS requests to this specific region by default. You can do so with the following command.
aws configure set region us-east-1
• Validate that the AWS CLI configuration and permissions to interact with EMR Serverless are correctly set up. You can do so by running the following command to see a list of your EMR Serverless
applications in your current Region.
aws emr-serverless list-applications
If the command returns with an error, see Troubleshooting EMR Serverless identity and access (p. 47).
Step 1: Plan an EMR Serverless application
Prepare output log storage for EMR Serverless
In this tutorial, you'll use an S3 bucket to store output files and logs from the sample Spark workload you'll run using an EMR Serverless application. To create a bucket, follow the instructions in Creating a bucket in the Amazon Simple Storage Service Console User Guide.
As noted in the prerequisites, the S3 bucket must be created in the same Region where EMR Serverless is available (us-east-1). Replace any further reference to DOC-EXAMPLE-BUCKET with the name of the newly created bucket.
Set up a job execution role
Set up a job execution role
Job runs in EMR Serverless use an execution role that provide granular permissions to specific AWS services and resources at runtime. In this tutorial, the data and scripts are hosted in a public S3 bucket, however, the output including logs will be stored in DOC-EXAMPLE-BUCKET.
To setup a job execution role, you will first create an execution role with a trust policy to allow EMR Serverless to use the new role. Next, you'll attach the required S3 access policy to that role. The following steps walk you through the process.
1. Create a file named emr-serverless-trust-policy.json that contains the trust policy to use for the IAM role. The file should contain the following policy.
{ "Version": "2012-10-17", "Statement": [{
"Sid": "EMRServerlessTrustPolicy", "Action": "sts:AssumeRole",
2. Create an IAM role named sampleJobExecutionRole using the trust policy created in the previous step.
aws iam create-role \
--role-name sampleJobExecutionRole \
--assume-role-policy-document file://emr-serverless-trust-policy.json
Take note of the ARN in the output, as you will use the ARN of the new role during job submission, henceforth referred to as the <execution_role_arn>.
3. Create a file named emr-sample-access-policy.json that defines the IAM policy for your workload to get read access the script and data stored in public S3 buckets and read-write access to DOC-EXAMPLE-BUCKET. You must replace DOC-EXAMPLE-BUCKET in the policy below with the actual bucket name created in Step 1).
{ "Version": "2012-10-17", "Statement": [
Step 2: Create an application
"s3:ListBucket", "s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::DOC-EXAMPLE-BUCKET", "arn:aws:s3:::DOC-EXAMPLE-BUCKET/*"
] } ] }
4. Create an IAM policy named sampleS3AccessPolicy using the policy file created in the previous step. Take note of the ARN in the output, as you will use the ARN of the new policy in the next step.
aws iam create-policy \
--policy-name sampleS3AccessPolicy \
--policy-document file://emr-sample-access-policy.json
Take note of the new policy's ARN in the output, as you will substitute it for <policy_arn> in the next step.
5. Attach the IAM policy sampleS3AccessPolicy to the job execution role sampleJobExecutionRole.
aws iam attach-role-policy \
--role-name sampleJobExecutionRole \ --policy-arn <policy_arn>
Step 2: Create an EMR Serverless application
Now you're ready create a new Spark application using EMR Serverless. To create an application, run the following command.
aws emr-serverless create-application \ --release-label emr-6.5.0-preview \ --type "SPARK" \
--name my-application
Take note of the application ID returned in the output, as you will use the ID to start the application and during job submission, henceforth referred to as the <application_id>.
Although EMR Serverless automatically pre-initializes a set of workers for you (with additional workers created on demand), you may choose a different set of pre-initialized workers by setting the initialCapacity parameter while creating the application. You may also choose to set a limit for the total maximum capacity that an application can use by setting the maximumCapacity parameter. To learn more about these options, see Configuring and managing pre-initialized capacity (p. 12).
Step 3: Start your application
Before you can schedule a job using your application, you must start the application. This action will pre-initialize a set of workers. You must ensure the application has reached the CREATED state before starting it. To check the state of your application, run the following command, substituting
<application_id> with the ID of your new application.
aws emr-serverless get-application \
Step 4: Schedule a job run
--application-id <application_id>
When application has reached the CREATED state, start your application using the following command.
aws emr-serverless start-application \ --application-id <application_id>
Before moving to the next step, ensure your application has reached the STARTED state using the get-application API.
Step 4: Schedule a job run to your EMR Serverless application
Now your Spark application is ready to run jobs. In this tutorial, we use a PySpark script to compute the number of occurrences of unique words across multiple text files. Both the script and the dataset are stored in a public, read-only S3 bucket. The output file and the log data from the Spark runtime will be pushed to /output and /logs directory in the S3 bucket you created in Step 1 (DOC-EXAMPLE-BUCKET).
In the command below, substitute <application_id> with your application ID. Substitute
<execution_role_arn> with the execution role ARN you created in Step 1. Replace all DOC-EXAMPLE-BUCKET strings with the Amazon S3 bucket you created, adding /output and /logs to the path. This creates new folders in your bucket, where EMR Serverless can copy the output and log files of your application.
aws emr-serverless start-job-run \ --application-id <application_id> \
--execution-role-arn <execution_role_arn> \ --job-driver '{
"sparkSubmit": {
"entryPoint": "s3://us-east-1.elasticmapreduce/emr-containers/samples/
wordcount/scripts/wordcount.py",
"entryPointArguments": ["s3://DOC-EXAMPLE-BUCKET/output"], "sparkSubmitParameters": "--conf spark.executor.cores=1 --conf
spark.executor.memory=4g --conf spark.driver.cores=1 --conf spark.driver.memory=4g --conf spark.executor.instances=1"
} }' \
--configuration-overrides '{
"monitoringConfiguration": { "s3MonitoringConfiguration": {
"logUri": "s3://DOC-EXAMPLE-BUCKET/logs"
} } }'
Note the job run ID returned in the output, as you will replace <job_run_id> with this ID in the following steps.
Step 5: Review your job run's output
The job run should typically take 3-5 minutes to complete. You can check for the state of the job using the following command.
Step 6: Clean up
aws emr-serverless get-job-run \ --application-id <application_id> \ --job-run-id <job_run_id>
With your log destination set to s3://DOC-EXAMPLE-BUCKET/logs, the logs for this specific job run can be found under s3://DOC-EXAMPLE-BUCKET/logs/applications/<application_id>/
jobs/<job_run_id>.
Depending upon the type of application and log file, EMR Serverless will upload logs to your bucket at different cadences. For Spark applications, EMR Serverless will push event logs every 30 seconds to the sparklogs folder in the S3 log destination. The Spark runtime logs for the driver and executors (i.e., stderr and stdout log files) will upload upon completion of the job to folders named appropriately by the worker type, such as driver or executor.
The output of the PySpark job will be uploaded upon sucessful execution of the job to s3://DOC-EXAMPLE-BUCKET/output/.
Step 6: Clean up
When you’re done working with this tutorial, consider deleting the resources you created. This will help you avoid any unnecessary expenses. Note that in preview, there is no additional cost to using EMR Serverless. However, we still recommend following best practives by releasing resources that you don't intend to use again.
Delete your application
To delete an application, it must be in the STOPPED state. Use the following command to stop the application.
aws emr-serverless stop-application \ --application-id <application_id>
Once the application is in the STOPPED state, use the following command to delete the application.
aws emr-serverless delete-application \ --application-id <application_id>
Delete your S3 log bucket
To delete your S3 logging and output bucket, use the following command. Replace DOC-EXAMPLE-BUCKET with the actual name of the S3 bucket created in Step 1.
aws s3 rm s3://DOC-EXAMPLE-BUCKET --recursive aws s3api delete-bucket --bucket DOC-EXAMPLE-BUCKET
Delete your job execution role
To delete the execution role, detach the policy from the role. You can then delete both the role and the policy.
aws iam detach-role-policy \
Delete your job execution role
--role-name sampleJobExecutionRole \ --policy-arn <policy_arn>
To delete the role, use the following command.
aws iam delete-role \
--role-name sampleJobExecutionRole
To delete the policy that was attached to the role, use the following command.
aws iam delete-policy \ --policy-arn <policy_arn>
This concludes the tutorial. For examples of running Hive applications, see Running Hive jobs (p. 21).
Application states
Interacting with your application
In this section, we'll cover how you can interact with your Amazon EMR Serverless application using the AWS CLI and the defaults for Spark and Hive engines.
Topics
• Application states (p. 10)
• Working with your application on the AWS CLI (p. 11)
• Pre-initialized capacity defaults (p. 11)
• Configuring and managing pre-initialized capacity (p. 12)
Application states
When you create an application with Amazon EMR Serverless, the application run enters the CREATING state. It then passes through the following states until it succeeds (exits with code 0) or fails (exits with a non-zero code).
Applications can have the following states:
State Description
Creating The application is being prepared and is not yet
ready to use.
Created The application has been created but has not yet
provisioned capacity. It can be modified to make changes to the initial capacity configuration.
Starting The application has started and is provisioning
capacity.
Started The application is ready to accept new jobs. Jobs
will only be accepted in this state.
Stopping All jobs have completed and the application is
releasing its capacity. EMR Serverless may move an application to this state when there are failures in provisioning capacity.
Stopped The application is stopped and no resources are
running on the application. It can be modified to make changes to the initial capacity configuration.
Terminated The application has been terminated and will not
appear on your list. EMR Serverless may move an application to this state when there are failures in creation.
The following diagram illustrates the trajectory of EMR Serverless application states.
Working with your application on the AWS CLI
Working with your application on the AWS CLI
This section covers how to create, describe, and delete individual applications on the command line, as well as how to list all of your applications to view them at a glance. For more application operations, such starting, stopping, and updating your application, see the EMR Serverless API Reference.
To create an application, use create-application. You must specify SPARK or HIVE as the application type. This command returns the application’s ARN, name, and ID.
aws emr-serverless create-application \ --name <my_application_name> \
--type <application_type> \ --release-label <release_version>
To describe an application, use get-application and provide its application-id. This command returns the state and capacity-related configurations for your application.
aws emr-serverless get-application \ --application-id <application_id>
To list all of your applications, call list-applications. This command returns the same properties as get-application but includes all of your applications.
aws emr-serverless list-applications
To delete your application, call delete-application and supply your application-id.
aws emr-serverless delete-application \ --application-id <application_id>
Pre-initialized capacity defaults
EMR Serverless configures all applications with the default to run one job at low latency. The defaults are engine-specific and are set as follows:
• Spark - 3 drivers (4 vCPU, 16 GB memory, 21 GB disk), 9 executors (4 vCPU, 16 GB memory, 21 GB disk)
• Hive - 3 drivers (2 vCPU, 6 GB memory, and 21 GB disk), 30 Tez tasks (1 vCPU, 6 GB memory, and 21 GB disk)
Configuring and managing pre-initialized capacity
Configuring and managing pre-initialized capacity
EMR Serverless provides an optional feature that keeps driver and workers pre-initialized and ready to respond in seconds, effectively creating a warm pool of workers for an application. This feature is called pre-initialized capacity and can be configured for each application by setting the initialCapacity parameter of an application to the number of workers you want to pre-initialize. Pre-initialized worker capacity allows jobs to start immediately, making it ideal for implementing iterative applications and time-sensitive jobs.
When a job is run, if any workers from initialCapacity are available (not already in use from jobs previously submitted), those resources are used to start running the job. If those resources are not available because they are in use by other jobs, or if resources are insufficient to execute the job because the job requires more than what is available from intialCapacity, then additional workers are requested and acquired, up to the maximum limits on resources set for the application. When jobs finish running, the workers used by the job are released, and the number of resources available for the application returns to initialCapacity. An application maintains the initialCapacity of resources even after jobs finish running. Excess resources beyond initialCapacity are released immediately when they're no longer required to run jobs.
NoteFor this preview release, you must manually stop an application to decommission the pre-initialized workers.
Pre-initialized capacity is available and ready to use when the application has started. It is
decommissioned when the application is stopped. An application moves to the STARTED state only if the requested pre-initialized capacity has been created and is ready to use. For the entire duration that the application is in the STARTED state, EMR Serverless ensures that the pre-initialized capacity is available for use or is in use by jobs or interactive workloads. Capacity is replenished for released or failed containers to maintain the number of workers specified in the InitialCapacity parameter. For an application with no pre-initialized capacity, the state can immediately transition from CREATED to STARTED.
You can modify the InitialCapacity counts, and specify compute configurations such as vCPU, memory, and disk, for each worker. Modifications are only allowed when the application is in the CREATED or STOPPED state.
Customizing pre-initialized capacity for specific big data frameworks
You can further customize pre-initialized capacity to suit workloads running on specific big data frameworks. For example, when running Apache Spark, you can specify how many workers start as drivers and how many start as executors. Similarly, when you use Apache Hive, you can specify how many workers start as Hive drivers, and how many are used to run Tez tasks.
Configuring an application running Apache Hive with pre-initialized capacity
The following API request creates an application running Apache Hive based on Amazon EMR release emr-5.34.0-preview. The application starts with 5 pre-initialized Hive drivers, each with 2 vCPU and 6 GB of memory, and 50 pre-initialized Tez task workers, each with 1 vCPU and 6 GB of memory. When Hive queries are run on this application, they first use the pre-initialized workers and start executing immediately. If all of the pre-initialized workers are busy and more Hive jobs are submitted, the application can scale to a total of 400 vCPU and 1024 GB of memory.
aws emr-serverless create-application \ --type HIVE \
--name <my_application_name> \
Customizing pre-initialized capacity for specific big data frameworks --release-label emr-5.34.0-preview \
--initial-capacity '{
"cpu": "400vCPU", "memory": "1024GB"
}'
Configuring an application running Apache Spark with pre-initialized capacity
The following API request creates an application running Apache Spark 3.1 based on Amazon EMR release 6.5. The application starts with 5 pre-initialized Spark drivers, each with 2 vCPU and 4 GB of memory, and 50 pre-initialized executors, each with 4 vCPU and 8 GB of memory. When Spark jobs are run on this application, they first use the pre-initialized workers and start executing immediately. If all
The following API request creates an application running Apache Spark 3.1 based on Amazon EMR release 6.5. The application starts with 5 pre-initialized Spark drivers, each with 2 vCPU and 4 GB of memory, and 50 pre-initialized executors, each with 4 vCPU and 8 GB of memory. When Spark jobs are run on this application, they first use the pre-initialized workers and start executing immediately. If all