Create an IAM user - Amazon EMR

As a best practice, create an AWS Identity and Access Management (IAM) user with administrator permissions, and then use that IAM user for all work that does not require root credentials. Create a password for console access, and create access keys to use command line tools. For instructions, see Creating your ﬁrst IAM admin user and group in the IAM User Guide.

You can use this same process to create more groups and users and to give your users access to your AWS account resources. To learn about using policies that restrict user permissions to speciﬁc AWS resources, see Access management. If you choose to create a separate user to work with EMR Serverless, ensure the user has suﬃcient permissions to invoke EMR Serverless actions by attaching an IAM policy to the IAM user. For more information, see Access policies (p. 45). However, if you choose to continue with an Admin user, no further action will be required.

Step 3: Install and conﬁgure the AWS CLI

To set up EMR Serverless, you must have the latest version of AWS CLI installed. To install the latest version of the AWS CLI for macOS, Linux, or Windows, see Installing or updating the latest version of the AWS CLI.

To conﬁgure the AWS CLI and set up of secure access to AWS services, including EMR Serverless, see Quick conﬁguration with aws configure.

Step 1: Plan an application

Getting started

This tutorial helps you get started using EMR Serverless by deploying a sample Spark workload. You'll create your application with default pre-initialized capacity, run the sample application with logs stored in your S3 bucket and view event logs in the Spark History Server. Note that, for simplicity, we have chosen default options in most parts of this tutorial. For examples of running Hive applications, see Running Hive jobs (p. 21).

Prerequisites

Before you launch an EMR Serverless application, make sure you complete the following tasks:

• EMR Serverless is currently in preview release. To access the preview of EMR Serverless, follow the sign-up steps at https://pages.awscloud.com/EMR-Serverless-Preview.html.

• You must update the AWS CLI with the latest service model for EMR Serverless. Once you've received conﬁrmation of access, use the following command to download the latest API model ﬁle and update the AWS CLI.

aws s3 cp s3://elasticmapreduce/emr-serverless-preview/artifacts/latest/dev/cli/

service.json ./service.json

aws configure add-model --service-model file://service.json

• To use EMR Serverless, you must choose the AWS Region where preview is available. This applies to any AWS services and resources that EMR Serverless will need to access as part of running your workloads. Preview is currently available in US East (N. Virginia) us-east-1, and you may want to conﬁgure the AWS CLI to send all your AWS requests to this speciﬁc region by default. You can do so with the following command.

aws configure set region us-east-1

• Validate that the AWS CLI conﬁguration and permissions to interact with EMR Serverless are correctly set up. You can do so by running the following command to see a list of your EMR Serverless

applications in your current Region.

aws emr-serverless list-applications

If the command returns with an error, see Troubleshooting EMR Serverless identity and access (p. 47).

Step 1: Plan an EMR Serverless application

Prepare output log storage for EMR Serverless

In this tutorial, you'll use an S3 bucket to store output ﬁles and logs from the sample Spark workload you'll run using an EMR Serverless application. To create a bucket, follow the instructions in Creating a bucket in the Amazon Simple Storage Service Console User Guide.

As noted in the prerequisites, the S3 bucket must be created in the same Region where EMR Serverless is available (us-east-1). Replace any further reference to DOC-EXAMPLE-BUCKET with the name of the newly created bucket.

Set up a job execution role

Job runs in EMR Serverless use an execution role that provide granular permissions to speciﬁc AWS services and resources at runtime. In this tutorial, the data and scripts are hosted in a public S3 bucket, however, the output including logs will be stored in DOC-EXAMPLE-BUCKET.

To setup a job execution role, you will ﬁrst create an execution role with a trust policy to allow EMR Serverless to use the new role. Next, you'll attach the required S3 access policy to that role. The following steps walk you through the process.

1. Create a ﬁle named emr-serverless-trust-policy.json that contains the trust policy to use for the IAM role. The ﬁle should contain the following policy.

{ "Version": "2012-10-17", "Statement": [{

"Sid": "EMRServerlessTrustPolicy", "Action": "sts:AssumeRole",

2. Create an IAM role named sampleJobExecutionRole using the trust policy created in the previous step.

aws iam create-role \

--role-name sampleJobExecutionRole \

--assume-role-policy-document file://emr-serverless-trust-policy.json

Take note of the ARN in the output, as you will use the ARN of the new role during job submission, henceforth referred to as the <execution_role_arn>.

3. Create a ﬁle named emr-sample-access-policy.json that deﬁnes the IAM policy for your workload to get read access the script and data stored in public S3 buckets and read-write access to DOC-EXAMPLE-BUCKET. You must replace DOC-EXAMPLE-BUCKET in the policy below with the actual bucket name created in Step 1).

{ "Version": "2012-10-17", "Statement": [

Step 2: Create an application

"s3:ListBucket", "s3:DeleteObject"

"Resource": [

"arn:aws:s3:::DOC-EXAMPLE-BUCKET", "arn:aws:s3:::DOC-EXAMPLE-BUCKET/*"

] } ] }

4. Create an IAM policy named sampleS3AccessPolicy using the policy ﬁle created in the previous step. Take note of the ARN in the output, as you will use the ARN of the new policy in the next step.

aws iam create-policy \

--policy-name sampleS3AccessPolicy \

--policy-document file://emr-sample-access-policy.json

Take note of the new policy's ARN in the output, as you will substitute it for <policy_arn> in the next step.

5. Attach the IAM policy sampleS3AccessPolicy to the job execution role sampleJobExecutionRole.

aws iam attach-role-policy \

--role-name sampleJobExecutionRole \ --policy-arn <policy_arn>

Step 2: Create an EMR Serverless application

Now you're ready create a new Spark application using EMR Serverless. To create an application, run the following command.

aws emr-serverless create-application \ --release-label emr-6.5.0-preview \ --type "SPARK" \

--name my-application

Take note of the application ID returned in the output, as you will use the ID to start the application and during job submission, henceforth referred to as the <application_id>.

Although EMR Serverless automatically pre-initializes a set of workers for you (with additional workers created on demand), you may choose a diﬀerent set of pre-initialized workers by setting the initialCapacity parameter while creating the application. You may also choose to set a limit for the total maximum capacity that an application can use by setting the maximumCapacity parameter. To learn more about these options, see Conﬁguring and managing pre-initialized capacity (p. 12).

Step 3: Start your application

Before you can schedule a job using your application, you must start the application. This action will pre-initialize a set of workers. You must ensure the application has reached the CREATED state before starting it. To check the state of your application, run the following command, substituting

<application_id> with the ID of your new application.

aws emr-serverless get-application \

Step 4: Schedule a job run

--application-id <application_id>

When application has reached the CREATED state, start your application using the following command.

aws emr-serverless start-application \ --application-id <application_id>

Before moving to the next step, ensure your application has reached the STARTED state using the get-application API.

Step 4: Schedule a job run to your EMR Serverless application

Now your Spark application is ready to run jobs. In this tutorial, we use a PySpark script to compute the number of occurrences of unique words across multiple text ﬁles. Both the script and the dataset are stored in a public, read-only S3 bucket. The output ﬁle and the log data from the Spark runtime will be pushed to /output and /logs directory in the S3 bucket you created in Step 1 (DOC-EXAMPLE-BUCKET).

In the command below, substitute <application_id> with your application ID. Substitute

<execution_role_arn> with the execution role ARN you created in Step 1. Replace all DOC-EXAMPLE-BUCKET strings with the Amazon S3 bucket you created, adding /output and /logs to the path. This creates new folders in your bucket, where EMR Serverless can copy the output and log ﬁles of your application.

aws emr-serverless start-job-run \ --application-id <application_id> \

--execution-role-arn <execution_role_arn> \ --job-driver '{

"sparkSubmit": {

"entryPoint": "s3://us-east-1.elasticmapreduce/emr-containers/samples/

wordcount/scripts/wordcount.py",

"entryPointArguments": ["s3://DOC-EXAMPLE-BUCKET/output"], "sparkSubmitParameters": "--conf spark.executor.cores=1 --conf

spark.executor.memory=4g --conf spark.driver.cores=1 --conf spark.driver.memory=4g --conf spark.executor.instances=1"

} }' \

--configuration-overrides '{

"monitoringConfiguration": { "s3MonitoringConfiguration": {

"logUri": "s3://DOC-EXAMPLE-BUCKET/logs"

} } }'

Note the job run ID returned in the output, as you will replace <job_run_id> with this ID in the following steps.

Step 5: Review your job run's output

The job run should typically take 3-5 minutes to complete. You can check for the state of the job using the following command.

Step 6: Clean up

aws emr-serverless get-job-run \ --application-id <application_id> \ --job-run-id <job_run_id>

With your log destination set to s3://DOC-EXAMPLE-BUCKET/logs, the logs for this speciﬁc job run can be found under s3://DOC-EXAMPLE-BUCKET/logs/applications/<application_id>/

jobs/<job_run_id>.

Depending upon the type of application and log ﬁle, EMR Serverless will upload logs to your bucket at diﬀerent cadences. For Spark applications, EMR Serverless will push event logs every 30 seconds to the sparklogs folder in the S3 log destination. The Spark runtime logs for the driver and executors (i.e., stderr and stdout log ﬁles) will upload upon completion of the job to folders named appropriately by the worker type, such as driver or executor.

The output of the PySpark job will be uploaded upon sucessful execution of the job to s3://DOC-EXAMPLE-BUCKET/output/.

Step 6: Clean up

When you’re done working with this tutorial, consider deleting the resources you created. This will help you avoid any unnecessary expenses. Note that in preview, there is no additional cost to using EMR Serverless. However, we still recommend following best practives by releasing resources that you don't intend to use again.

Delete your application

To delete an application, it must be in the STOPPED state. Use the following command to stop the application.

aws emr-serverless stop-application \ --application-id <application_id>

Once the application is in the STOPPED state, use the following command to delete the application.

aws emr-serverless delete-application \ --application-id <application_id>

Delete your S3 log bucket

To delete your S3 logging and output bucket, use the following command. Replace DOC-EXAMPLE-BUCKET with the actual name of the S3 bucket created in Step 1.

aws s3 rm s3://DOC-EXAMPLE-BUCKET --recursive aws s3api delete-bucket --bucket DOC-EXAMPLE-BUCKET

Delete your job execution role

To delete the execution role, detach the policy from the role. You can then delete both the role and the policy.

aws iam detach-role-policy \

Delete your job execution role

--role-name sampleJobExecutionRole \ --policy-arn <policy_arn>

To delete the role, use the following command.

aws iam delete-role \

--role-name sampleJobExecutionRole

To delete the policy that was attached to the role, use the following command.

aws iam delete-policy \ --policy-arn <policy_arn>

This concludes the tutorial. For examples of running Hive applications, see Running Hive jobs (p. 21).

Application states

Interacting with your application

In this section, we'll cover how you can interact with your Amazon EMR Serverless application using the AWS CLI and the defaults for Spark and Hive engines.

Topics

• Application states (p. 10)

• Working with your application on the AWS CLI (p. 11)

• Pre-initialized capacity defaults (p. 11)

• Conﬁguring and managing pre-initialized capacity (p. 12)

Application states

When you create an application with Amazon EMR Serverless, the application run enters the CREATING state. It then passes through the following states until it succeeds (exits with code 0) or fails (exits with a non-zero code).

Applications can have the following states:

State Description

Creating The application is being prepared and is not yet

ready to use.

Created The application has been created but has not yet

provisioned capacity. It can be modiﬁed to make changes to the initial capacity conﬁguration.

Starting The application has started and is provisioning

capacity.

Started The application is ready to accept new jobs. Jobs

will only be accepted in this state.

Stopping All jobs have completed and the application is

releasing its capacity. EMR Serverless may move an application to this state when there are failures in provisioning capacity.

Stopped The application is stopped and no resources are

running on the application. It can be modiﬁed to make changes to the initial capacity conﬁguration.

Terminated The application has been terminated and will not

appear on your list. EMR Serverless may move an application to this state when there are failures in creation.

The following diagram illustrates the trajectory of EMR Serverless application states.

Working with your application on the AWS CLI

This section covers how to create, describe, and delete individual applications on the command line, as well as how to list all of your applications to view them at a glance. For more application operations, such starting, stopping, and updating your application, see the EMR Serverless API Reference.

To create an application, use create-application. You must specify SPARK or HIVE as the application type. This command returns the application’s ARN, name, and ID.

aws emr-serverless create-application \ --name <my_application_name> \

--type <application_type> \ --release-label <release_version>

To describe an application, use get-application and provide its application-id. This command returns the state and capacity-related conﬁgurations for your application.

aws emr-serverless get-application \ --application-id <application_id>

To list all of your applications, call list-applications. This command returns the same properties as get-application but includes all of your applications.

aws emr-serverless list-applications

To delete your application, call delete-application and supply your application-id.

aws emr-serverless delete-application \ --application-id <application_id>

Pre-initialized capacity defaults

EMR Serverless conﬁgures all applications with the default to run one job at low latency. The defaults are engine-speciﬁc and are set as follows:

• Spark - 3 drivers (4 vCPU, 16 GB memory, 21 GB disk), 9 executors (4 vCPU, 16 GB memory, 21 GB disk)

• Hive - 3 drivers (2 vCPU, 6 GB memory, and 21 GB disk), 30 Tez tasks (1 vCPU, 6 GB memory, and 21 GB disk)

Conﬁguring and managing pre-initialized capacity

EMR Serverless provides an optional feature that keeps driver and workers pre-initialized and ready to respond in seconds, eﬀectively creating a warm pool of workers for an application. This feature is called pre-initialized capacity and can be conﬁgured for each application by setting the initialCapacity parameter of an application to the number of workers you want to pre-initialize. Pre-initialized worker capacity allows jobs to start immediately, making it ideal for implementing iterative applications and time-sensitive jobs.

When a job is run, if any workers from initialCapacity are available (not already in use from jobs previously submitted), those resources are used to start running the job. If those resources are not available because they are in use by other jobs, or if resources are insuﬃcient to execute the job because the job requires more than what is available from intialCapacity, then additional workers are requested and acquired, up to the maximum limits on resources set for the application. When jobs ﬁnish running, the workers used by the job are released, and the number of resources available for the application returns to initialCapacity. An application maintains the initialCapacity of resources even after jobs ﬁnish running. Excess resources beyond initialCapacity are released immediately when they're no longer required to run jobs.

NoteFor this preview release, you must manually stop an application to decommission the pre-initialized workers.

Pre-initialized capacity is available and ready to use when the application has started. It is

decommissioned when the application is stopped. An application moves to the STARTED state only if the requested pre-initialized capacity has been created and is ready to use. For the entire duration that the application is in the STARTED state, EMR Serverless ensures that the pre-initialized capacity is available for use or is in use by jobs or interactive workloads. Capacity is replenished for released or failed containers to maintain the number of workers speciﬁed in the InitialCapacity parameter. For an application with no pre-initialized capacity, the state can immediately transition from CREATED to STARTED.

You can modify the InitialCapacity counts, and specify compute conﬁgurations such as vCPU, memory, and disk, for each worker. Modiﬁcations are only allowed when the application is in the CREATED or STOPPED state.

Customizing pre-initialized capacity for speciﬁc big data frameworks

You can further customize pre-initialized capacity to suit workloads running on speciﬁc big data frameworks. For example, when running Apache Spark, you can specify how many workers start as drivers and how many start as executors. Similarly, when you use Apache Hive, you can specify how many workers start as Hive drivers, and how many are used to run Tez tasks.

Conﬁguring an application running Apache Hive with pre-initialized capacity

The following API request creates an application running Apache Hive based on Amazon EMR release emr-5.34.0-preview. The application starts with 5 pre-initialized Hive drivers, each with 2 vCPU and 6 GB of memory, and 50 pre-initialized Tez task workers, each with 1 vCPU and 6 GB of memory. When Hive queries are run on this application, they ﬁrst use the pre-initialized workers and start executing immediately. If all of the pre-initialized workers are busy and more Hive jobs are submitted, the application can scale to a total of 400 vCPU and 1024 GB of memory.

aws emr-serverless create-application \ --type HIVE \

--name <my_application_name> \

Customizing pre-initialized capacity for speciﬁc big data frameworks --release-label emr-5.34.0-preview \

--initial-capacity '{

"cpu": "400vCPU", "memory": "1024GB"

Conﬁguring an application running Apache Spark with pre-initialized capacity

The following API request creates an application running Apache Spark 3.1 based on Amazon EMR release 6.5. The application starts with 5 pre-initialized Spark drivers, each with 2 vCPU and 4 GB of memory, and 50 pre-initialized executors, each with 4 vCPU and 8 GB of memory. When Spark jobs are run on this application, they ﬁrst use the pre-initialized workers and start executing immediately. If all

在文檔中 Amazon EMR (頁 7-0)