Amazon EMR

(1)

Amazon EMR

Amazon EMR on EKS Development Guide

(2)

Amazon EMR: Amazon EMR on EKS Development Guide

Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be aﬃliated with, connected to, or sponsored by Amazon.

(3)

What is Amazon EMR on EKS

Amazon EMR on EKS provides a deployment option for Amazon EMR that allows you to run open- source big data frameworks on Amazon Elastic Kubernetes Service (Amazon EKS). With this deployment option, you can focus on running analytics workloads while Amazon EMR on EKS builds, conﬁgures, and manages containers for open-source applications.

If you already use Amazon EMR, you can now run Amazon EMR based applications with other types of applications on the same Amazon EKS cluster. This deployment option also improves resource utilization and simpliﬁes infrastructure management across multiple Availability Zones. If you already run big data frameworks on Amazon EKS, you can now use Amazon EMR to automate provisioning and management, and run Apache Spark more quickly.

Amazon EMR on EKS enables your team to collaborate more eﬃciently and process vast amounts of data more easily and cost-eﬀectively:

• You can run applications on a common pool of resources without having to provision infrastructure.

You can use Amazon EMR Studio and the AWS SDK or AWS CLI to develop, submit, and diagnose analytics applications running on EKS clusters. You can run scheduled jobs on Amazon EMR on EKS using self-managed Apache Airflow or Amazon Managed Workflows for Apache Airflow (MWAA).

• Infrastructure teams can centrally manage a common computing platform to consolidate Amazon EMR workloads with other container-based applications. You can simplify infrastructure management with common Amazon EKS tools and take advantage of a shared cluster for workloads that need diﬀerent versions of open-source frameworks. You can also reduce operational overhead with automated Kubernetes cluster management and OS patching. With Amazon EC2 and AWS Fargate, you can enable multiple compute resources to meet performance, operational, or ﬁnancial requirements.

The following diagram shows the two diﬀerent deployment models for Amazon EMR.

Topics

• Architecture (p. 2)

(7)

Architecture

• Concepts (p. 2)

• How the components work together (p. 3)

Architecture

Amazon EMR on EKS loosely couples applications to the infrastructure that they run on. Each

infrastructure layer provides orchestration for the subsequent layer. When you submit a job to Amazon EMR, your job deﬁnition contains all of its application-speciﬁc parameters. Amazon EMR uses these parameters to instruct Amazon EKS about which pods and containers to deploy. Amazon EKS then brings online the computing resources from Amazon EC2 and AWS Fargate required to run the job.

With this loose coupling of services, you can run multiple, securely isolated jobs simultaneously. You can also benchmark the same job with diﬀerent compute backends or spread your job across multiple Availability Zones to improve availability.

The following diagram illustrates how Amazon EMR on EKS works with other AWS services.

Concepts

Kubernetes namespace

Amazon EKS uses Kubernetes namespaces to divide cluster resources between multiple users and applications. These namespaces are the foundation for multi-tenant environments. A Kubernetes

namespace can have either Amazon EC2 or AWS Fargate as the compute provider. This ﬂexibility provides you with diﬀerent performance and cost options for your jobs to run on.

Virtual cluster

A virtual cluster is a Kubernetes namespace that Amazon EMR is registered with. Amazon EMR uses virtual clusters to run jobs and host endpoints. Multiple virtual clusters can be backed by the same

(8)

Job run

physical cluster. However, each virtual cluster maps to one namespace on an EKS cluster. Virtual clusters do not create any active resources that contribute to your bill or that require lifecycle management outside the service.

Job run

A job run is a unit of work, such as a Spark jar, PySpark script, or SparkSQL query, that you submit to Amazon EMR on EKS. One job can have multiple job runs. When you submit a job run, you include the following information:

• A virtual cluster where the job should run.

• A job name to identify the job.

• The execution role — a scoped IAM role that runs the job and allows you to specify which resources can be accessed by the job.

• The Amazon EMR release label that speciﬁes the version of open-source applications to use.

• The artifacts to use when submitting your job, such as spark-submit parameters.

By default, logs are uploaded to the Spark History server and are accessible from the AWS Management Console. You can also push event logs, execution logs, and metrics to Amazon S3 and Amazon

CloudWatch.

Amazon EMR containers

Amazon EMR containers is the API name for Amazon EMR on EKS. The emr-containers preﬁx is used in the following scenarios:

• It is the preﬁx in the CLI commands for Amazon EMR on EKS. For example, aws emr-containers start-job-run.

• It is the preﬁx before IAM policy actions for Amazon EMR on EKS. For example, "Action": [ "emr- containers:StartJobRun"]. For more information, see Policy actions for Amazon EMR on EKS.

• It is the preﬁx used in Amazon EMR on EKS service endpoints. For example, emr-containers.us- east-1.amazonaws.com. For more information, see Amazon EMR on EKS Service Endpoints.

How the components work together

The following steps and diagram illustrate the Amazon EMR on EKS workﬂow:

• Use an existing Amazon EKS cluster or create one by using the eksctl command line utility or Amazon EKS console.

• Create a virtual cluster by registering Amazon EMR with a namespace on an EKS cluster.

• Submit your job to the virtual cluster using the AWS CLI or SDK.

(9)

How the components work together

Registering Amazon EMR with a Kubernetes namespace on Amazon EKS creates a virtual cluster. Amazon EMR can then run analytics workloads on that namespace. When you use Amazon EMR on EKS to submit Spark jobs to the virtual cluster, Amazon EMR on EKS requests the Kubernetes scheduler on Amazon EKS to schedule pods.

For each job that you run, Amazon EMR on EKS creates a container with an Amazon Linux 2 base image, Apache Spark, and associated dependencies. Each job runs in a pod that downloads the container and starts to run it. The pod terminates after the job terminates. If the container’s image has been previously deployed to the node, then a cached image is used and the download is bypassed. Sidecar containers, such as log or metric forwarders, can be deployed to the pod. After the job terminates, you can still debug it using Spark application UI in the Amazon EMR console.

(10)

Install the AWS CLI

Setting up

Complete the following tasks to get set up for Amazon EMR on EKS. If you've already signed up for Amazon Web Services (AWS) and have been using Amazon EKS, you are almost ready to use Amazon EMR on EKS. If you have already completed any of these steps, you may skip them and move on to the next step.

NoteYou can also follow the Amazon EMR on EKS Workshop to set up all the necessary resources to run Spark jobs on Amazon EMR on EKS. The workshop also provides automation by using CloudFormation templates to create the resources necessary for you to get started.

1.Install the AWS CLI (p. 5) 2.Install eksctl (p. 7)

3.Set up an Amazon EKS cluster (p. 8)

4.Enable cluster access for Amazon EMR on EKS (p. 11)

5.Enable IAM Roles for Service Accounts (IRSA) on the EKS cluster (p. 14) 6.Create a job execution role (p. 15)

7.Update the trust policy of the job execution role (p. 15) 8.Grant users access to Amazon EMR on EKS (p. 16)

9.Register the Amazon EKS cluster with Amazon EMR (p. 18)

Install the AWS CLI

You can install the latest version of the AWS CLI for macOS, Linux, or Windows.

Important

To set up Amazon EMR on EKS, you must have the latest version of AWS CLI installed.

To install or update the AWS CLI for macOS

1. If you currently have the AWS CLI installed, determine which version that you have installed.

aws --version

2. If you have an earlier version of AWS CLI, then use the following command to install the latest AWS CLI version 2. For other installation options, or to upgrade your currently installed version 2, see Upgrading the AWS CLI version 2 on macOS.

curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"

sudo installer -pkg AWSCLIV2.pkg -target /

If you're unable to use the AWS CLI version 2, then ensure that you have the latest version of the AWS CLI version 1 installed using the following command.

pip3 install awscli --upgrade --user

To install or update the AWS CLI for Linux

(11)

To install or update the AWS CLI for Windows

aws --version

2. If you have an earlier version of AWS CLI, then use the following command to install the latest AWS CLI version 2. For other installation options, or to upgrade your currently installed version 2, see Upgrading the AWS CLI version 2 on Linux.

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"

unzip awscliv2.zip sudo ./aws/install

pip3 install --upgrade --user awscli

To install or update the AWS CLI for Windows

aws --version

2. If you have an earlier version of AWS CLI, then use the following command to install the latest AWS CLI version 2. For other installation options, or to upgrade your currently installed version 2, see Upgrading the AWS CLI version 2 on Windows.

1. Download the AWS CLI MSI installer for Windows (64-bit) at https://awscli.amazonaws.com/

AWSCLIV2.msi

2. Run the downloaded MSI installer and follow the onscreen instructions. By default, the AWS CLI installs to C:\Program Files\Amazon\AWSCLIV2.

pip3 install --user --upgrade awscli

Conﬁgure your AWS CLI credentials

Both eksctl and the AWS CLI require that you have AWS credentials conﬁgured in your environment. The aws configure command is the fastest way to set up your AWS CLI installation for general use.

$ aws configure

AWS Access Key ID [None]: <AKIAIOSFODNN7EXAMPLE>

AWS Secret Access Key [None]: <wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY>

Default region name [None]: <region-code>

Default output format [None]: <json>

When you type this command, the AWS CLI prompts you for four pieces of information: Access key, secret access key, AWS Region, and output format. This information is stored in a proﬁle (a collection of settings) named default. This proﬁle is used when you run commands unless you specify another one.

For more information, see Conﬁguring the AWS CLI in the AWS Command Line Interface User Guide.

(12)

Install eksctl

You need to install the latest version of eksctl command line utility on macOS, Linux, or Windows. For more information, see https://eksctl.io/.

Important

To set up Amazon EMR on EKS, you must have eksctl 0.34.0 version or later.

To install or upgrade eksctl on macOS using Homebrew

The easiest way to get started with Amazon EKS and macOS is by installing eksctl with Homebrew. The eksctl Homebrew recipe installs eksctl and any other dependencies that are required for Amazon EKS, such as kubectl. The recipe also installs the aws-iam-authenticator, which is required if you don't have the AWS CLI version 1.16.156 or later installed.

1. If you do not already have Homebrew installed on macOS, install it with the following command.

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/

install.sh)"

2. Install the Weaveworks Homebrew tap.

brew tap weaveworks/tap 3. 1. Install or upgrade eksctl.

• Install eksctl with the following command.

brew install weaveworks/tap/eksctl

• If eksctl is already installed, run the following command to upgrade.

brew upgrade eksctl & brew link --overwrite eksctl

2. Test that your installation was successful with the following command. You must have eksctl 0.34.0 version or later.

eksctl version

To install or upgrade eksctl on Linux using curl

1. Download and extract the latest release of eksctl with the following command.

curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/

download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp 2. Move the extracted binary to /usr/local/bin.

sudo mv /tmp/eksctl /usr/local/bin

(13)

To install or upgrade eksctl on Windows using Chocolatey

eksctl version

To install or upgrade eksctl on Windows using Chocolatey

1. If you do not already have Chocolatey installed on your Windows system, see Installing Chocolatey.

2. Install or upgrade eksctl.

• Install the binaries with the following command.

chocolatey install -y eksctl

• If they are already installed, run the following command to upgrade:

chocolatey upgrade -y eksctl

eksctl version

Set up an Amazon EKS cluster

Amazon EKS is a managed service that makes it easy for you to run Kubernetes on AWS without needing to install, operate, and maintain your own Kubernetes control plane or nodes. Follow the steps outlined below to create a new Kubernetes cluster with nodes in Amazon EKS.

Prerequisites

Before creating an Amazon EKS cluster, you must install and conﬁgure the following tools and resources that you need to create and manage an Amazon EKS cluster,

• The latest version of AWS CLI.

• kubectl version 1.20 or later.

• The latest version of eksctl .

For more information, see Install the AWS CLI (p. 5), Installing kubectl, Install eksctl (p. 7).

Create an Amazon EKS cluster using eksctl

Take the following steps to create an Amazon EKS cluster using eksctl.

Important

To get started quickly, you can create an EKS cluster and the nodes with default settings. But for production use, we recommend that you customize the settings for the cluster and nodes to meet your speciﬁc requirements. For a list of all settings and options, run the command eksctl create cluster -h. For more information, see Creating and Managing Clusters in the eksctl documentation.

(14)

Create an Amazon EKS cluster using eksctl

1. Create an Amazon EC2 key pair.

If you don't have an existing key pair, you can run the following command to create a new key pair.

Replace us-west-2 with the Region where you want to create your cluster.

aws ec2 create-key-pair --region us-west-2 --key-name myKeyPair

Save the returned output in a ﬁle on your local computer. For more information, see Creating or importing a key pair in the Amazon EC2 User Guide for Linux Instances.

NoteA key pair is not required for creating an EKS cluster. But specifying the key pair allows you to SSH to nodes once they're created. You can specify a key pair only when you create the node group.

2. Create an EKS cluster.

Run the following command to create an EKS cluster and nodes. Replace my-cluster and myKeyPair with your own cluster name and key pair name. Replace us-west-2 with the Region where you want to create your cluster. For more information about Amazon EKS supported Regions, see Amazon Elastic Kubernetes Service endpoints and quotas.

eksctl create cluster \ --name my-cluster \ --region us-west-2 \ --with-oidc \ --ssh-access \

--ssh-public-key myKeyPair \ --instance-types=m5.xlarge \ --managed

Important

When creating an EKS cluster, use m5.xlarge as the instance type, or any other instance type with a higher CPU and memory. Using an instance type with lower CPU or memory compared to m5.xlarge may lead to job failure due to insuﬃcient resources available in the cluster. To see all resources created, view the stack named eksctl-my-cluster-cluster in the AWS Cloud Formation console.

The cluster and node creation process takes several minutes. You'll see several lines of output when the cluster and nodes are created. The following example demonstrates the last line of output.

...

[✓] EKS cluster "my-cluster" in "us-west-2" region is ready

eksctl created a kubectl config file in ~/.kube or added the new cluster's configuration within an existing config file in ~/.kube.

3. View and validate resources

Run the following command to view your cluster nodes.

kubectl get nodes -o wide

The following shows an example output.

Amazon EC2 node output

(15)

Create an EKS cluster using AWS Management Console and AWS CLI

NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME

ip-192-168-12-49.us-west-2.compute.internal Ready none 6m7s v1.18.9-eks-d1db3c 192.168.12.49 52.35.116.65 Amazon Linux 2 4.14.209-160.335.amzn2.x86_64 docker://19.3.6

ip-192-168-72-129.us-west-2.compute.internal Ready none 6m4s v1.18.9-eks-d1db3c 192.168.72.129 44.242.140.21 Amazon Linux 2 4.14.209-160.335.amzn2.x86_64 docker://19.3.6

For more information, see View nodes.

Use the following command to view the workloads running on your cluster.

kubectl get pods --all-namespaces -o wide

The following shows an example output.

Amazon EC2 output

NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES kube-system aws-node-6ctpm 1/1 Running 0 7m43s

192.168.72.129 ip-192-168-72-129.us-west-2.compute.internal none none kube-system aws-node-cbntg 1/1 Running 0 7m46s

192.168.12.49 ip-192-168-12-49.us-west-2.compute.internal none none kube-system coredns-559b5db75d-26t47 1/1 Running 0 14m

192.168.78.81 ip-192-168-72-129.us-west-2.compute.internal none none kube-system coredns-559b5db75d-9rvnk 1/1 Running 0 14m

192.168.29.248 ip-192-168-12-49.us-west-2.compute.internal none none kube-system kube-proxy-l8pbd 1/1 Running 0 7m46s

192.168.12.49 ip-192-168-12-49.us-west-2.compute.internal none none kube-system kube-proxy-zh85h 1/1 Running 0 7m43s

192.168.72.129 ip-192-168-72-129.us-west-2.compute.internal none none

For more information about what you see here, see View workloads.

Create an EKS cluster using AWS Management Console and AWS CLI

You can also use AWS Management Console and AWS CLI to create an EKS cluster. Follow the steps at Getting started with Amazon EKS – AWS Management Console and AWS CLI. This way gives you visibility into how each resource is created for the EKS cluster and how the resources interact with each other.

Important

When creating nodes for an EKS cluster, use m5.xlarge as the instance type, or any other instance type with a higher CPU and memory.

Create an EKS cluster with AWS Fargate

You can also create an EKS cluster with pods running on AWS Fargate.

1. To create an EKS cluster with pods running on Fargate, follow the steps outlined at Getting Started with AWS Fargate using Amazon EKS.

(16)

Enable cluster access for Amazon EMR on EKS

NoteAmazon EMR on EKS needs CoreDNS for running jobs on EKS cluster. If you want to run your pods only on Fargate, you must follow the steps at Updating CoreDNS.

2. Run the following command to view your cluster nodes.

kubectl get nodes -o wide

The following shows an example Fargate output.

Fargate node output

NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME

fargate-ip-192-168-141-147.us-west-2.compute.internal Ready none 8m3s v1.18.8-eks-7c9bda 192.168.141.147 none Amazon Linux 2 4.14.209-160.335.amzn2.x86_64 containerd://1.3.2

fargate-ip-192-168-164-53.us-west-2.compute.internal Ready none 7m30s v1.18.8-eks-7c9bda 192.168.164.53 none Amazon Linux 2 4.14.209-160.335.amzn2.x86_64 containerd://1.3.2

For more information, see View nodes.

3. Run the following command to view the workloads running on your cluster.

kubectl get pods --all-namespaces -o wide

The following shows an example Fargate output.

Fargate output

NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES

kube-system coredns-69dfb8f894-9z95l 1/1 Running 0 18m

192.168.164.53 fargate-ip-192-168-164-53.us-west-2.compute.internal none none

kube-system coredns-69dfb8f894-c8v66 1/1 Running 0 18m

192.168.141.147 fargate-ip-192-168-141-147.us-west-2.compute.internal none none

For more information, see View workloads.

Enable cluster access for Amazon EMR on EKS

You must allow Amazon EMR on EKS access to a speciﬁc namespace in your cluster by taking the following actions: creating a Kubernetes role, binding the role to a Kubernetes user, and mapping the Kubernetes user with the service linked role AWSServiceRoleForAmazonEMRContainers. These actions are automated in eksctl when the IAM identity mapping command is used with emr- containers as the service name. You can perform these operations easily by using the following command.

eksctl create iamidentitymapping \ --cluster my_eks_cluster \

--namespace kubernetes_namespace \

(17)

Manual steps to enable cluster access for Amazon EMR on EKS --service-name "emr-containers"

Replace my_eks_cluster with the name of your Amazon EKS cluster and replace

kubernetes_namespace with the Kubernetes namespace created to run Amazon EMR workloads.

Important

You must download the latest eksctl using the previous step Install eksctl (p. 7) to use this functionality.

Manual steps to enable cluster access for Amazon EMR on EKS

You can also use the following manual steps to enable cluster access for Amazon EMR on EKS.

1. Create a Kubernetes role in a speciﬁc namespace

Run the following command to create a Kubernetes role in a speciﬁc namespace. This role grants the necessary RBAC permissions to Amazon EMR on EKS.

namespace=my-namespace

cat - <<EOF | kubectl apply -f - --namespace "${namespace}"

apiVersion: rbac.authorization.k8s.io/v1 kind: Role

metadata:

rules:

- apiGroups: [""]

resources: ["namespaces"]

verbs: ["get"]

- apiGroups: [""]

resources: ["serviceaccounts", "services", "configmaps", "events", "pods", "pods/

log"]

verbs: ["get", "list", "watch", "describe", "create", "edit", "delete", "deletecollection", "annotate", "patch", "label"]

- apiGroups: [""]

resources: ["secrets"]

verbs: ["create", "patch", "delete", "watch"]

- apiGroups: ["apps"]

resources: ["statefulsets", "deployments"]

verbs: ["get", "list", "watch", "describe", "create", "edit", "delete", "annotate", "patch", "label"]

- apiGroups: ["batch"]

resources: ["jobs"]

- apiGroups: ["extensions"]

resources: ["ingresses"]

- apiGroups: ["rbac.authorization.k8s.io"]

resources: ["roles", "rolebindings"]

verbs: ["get", "list", "watch", "describe", "create", "edit", "delete", "deletecollection", "annotate", "patch", "label"]

EOF

2. Create a Kubernetes role binding scoped to the namespace

Run the following command to create a Kubernetes role binding in the given namespace. This role binding grants the permissions deﬁned in the role created in the previous step to a user named

(18)

Manual steps to enable cluster access for Amazon EMR on EKS

emr-containers. This user identiﬁes service-linked roles for Amazon EMR on EKS and thus allows Amazon EMR on EKS to perform actions as deﬁned by the role you created.

namespace=my-namespace

cat - <<EOF | kubectl apply -f - --namespace "${namespace}"

apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding

metadata:

subjects:

- kind: User

apiGroup: rbac.authorization.k8s.io roleRef:

kind: Role

apiGroup: rbac.authorization.k8s.io EOF

3. Update Kubernetes aws-auth conﬁguration map

You can use one of the following options to map the Amazon EMR on EKS service-linked role with the emr-containers user that was bound with the Kubernetes role in the previous step.

Option 1: Using eksctl

Run the following eksctl command to map the Amazon EMR on EKS service-linked role with the emr-containers user.

eksctl create iamidentitymapping \ --cluster my-cluster-name \

--arn "arn:aws:iam::my-account-id:role/AWSServiceRoleForAmazonEMRContainers" \ --username emr-containers

Option 2: Without using eksctl

1. Run the following command to open the aws-auth conﬁguration map in text editor.

kubectl edit -n kube-system configmap/aws-auth

Note

If you receive an error stating Error from server (NotFound): configmaps

"aws-auth" not found, see the steps in Add user roles in the Amazon EKS User Guide to apply the stock ConﬁgMap.

2. Add Amazon EMR on EKS service-linked role details to the mapRoles section of the ConfigMap, under data. Add this section if it does not already exist in the ﬁle. The updated mapRoles section under data looks like the following example.

apiVersion: v1 data:

mapRoles: |

- rolearn: arn:aws:iam::<your-account-id>:role/

AWSServiceRoleForAmazonEMRContainers username: emr-containers

- ... <other previously existing role entries, if there's any>.

3. Save the ﬁle and exit your text editor.

(19)

Enable IAM Roles for Service Accounts (IRSA) on the EKS cluster

The IAM roles for service accounts feature is available on Amazon EKS versions 1.14 and later and for EKS clusters that are updated to versions 1.13 or later on or after September 3rd, 2019. To use this feature, you can update existing EKS clusters to version 1.14 or later. For more information, see Updating an Amazon EKS cluster Kubernetes version.

If your cluster supports IAM roles for service accounts, it has an OpenID Connect issuer URL associated with it. You can view this URL in the Amazon EKS console, or you can use the following AWS CLI command to retrieve it.

Important

You must use the latest version of the AWS CLI to receive the proper output from this command.

aws eks describe-cluster --name cluster_name --query "cluster.identity.oidc.issuer" -- output text

The expected output is as follows.

https://oidc.eks.<region-code>.amazonaws.com/id/EXAMPLED539D4633E53DE1B716D3041E

To use IAM roles for service accounts in your cluster, you must create an OIDC identity provider using either eksctl or the AWS Management Console.

To create an IAM OIDC identity provider for your cluster with eksctl

Check your eksctl version with the following command. This procedure assumes that you have installed eksctl and that your eksctl version is 0.32.0 or later.

eksctl version

For more information about installing or upgrading eksctl, see Installing or upgrading eksctl.

Create your OIDC identity provider for your cluster with the following command. Replace cluster_name with your own value.

eksctl utils associate-iam-oidc-provider --cluster cluster_name --approve

To create an IAM OIDC identity provider for your cluster with the AWS Management Console

Retrieve the OIDC issuer URL from the Amazon EKS console description of your cluster, or use the following AWS CLI command.

Use the following command to retrieve the OIDC issuer URL from the AWS CLI.

aws eks describe-cluster --name <cluster_name> --query "cluster.identity.oidc.issuer" -- output text

Use the following steps to retrieve the OIDC issuer URL from the Amazon EKS console.

(20)

Create a job execution role

1. Open the IAM console at https://console.aws.amazon.com/iam/.

2. In the navigation panel, choose Identity Providers, and then choose Create Provider.

1. For Provider Type, choose Choose a provider type, and then choose OpenID Connect.

2. For Provider URL, paste the OIDC issuer URL for your cluster.

3. For Audience, type sts.amazonaws.com and choose Next Step.

3. Verify that the provider information is correct, and then choose Create to create your identity provider.

Create a job execution role

You must create an IAM role to run workloads on Amazon EMR on EKS. We refer to this role as the job execution role in this documentation. For more information about how to create IAM roles, see Creating IAM roles in the IAM User Guide.

You must create an IAM policy that speciﬁes the permissions for the job execution role and then attach the IAM policy to the job execution role. For more information, see Creating IAM Policies in the IAM User Guide.

The following policy for the job execution role allows access to resource targets, Amazon S3, and CloudWatch. These permissions are necessary to monitor jobs and access logs.

{

"Version": "2012-10-17", "Statement": [

{

"Effect": "Allow", "Action": [

"s3:PutObject", "s3:GetObject", "s3:ListBucket"

],

"Resource": "*"

}, {

"logs:PutLogEvents", "logs:CreateLogStream", "logs:DescribeLogGroups", "logs:DescribeLogStreams"

],

"Resource": [

"arn:aws:logs:*:*:*"

] } ] }

For more information, see Using job execution roles, Conﬁgure a job run to use S3 logs, and Conﬁgure a job run to use CloudWatch Logs.

Update the trust policy of the job execution role

When you use IAM Roles for Service Accounts (IRSA) to run jobs on a Kubernetes namespace, an administrator must create a trust relationship between the job execution role and the identity of the

(21)

Grant users access to Amazon EMR on EKS

EMR managed service account. The trust relationship can be created by updating the trust policy of the job execution role. Note that the EMR managed service account is automatically created at job submission, scoped to the namespace where the job is submitted.

Run the following command to update the trust policy.

aws emr-containers update-role-trust-policy \ --cluster-name cluster \

--namespace namespace \

--role-name iam_role_name_for_job_execution

For more information, see Using job execution roles with Amazon EMR on EKS (p. 69).

Important

The operator running the above command must have these permissions:

eks:DescribeCluster, iam:GetRole, iam:UpdateAssumeRolePolicy.

Grant users access to Amazon EMR on EKS

For any actions that you perform on Amazon EMR on EKS, you need a corresponding IAM permission for that action. You must create an IAM policy that allows you to perform the Amazon EMR on EKS actions and attach the policy to the IAM user or role that you use.

This topic provides steps for creating a new policy and attaching it to an IAM user. It also covers the basic permissions that you need to set up your Amazon EMR on EKS environment. We recommend that you reﬁne the permissions to speciﬁc resources whenever possible based on your business needs.

Creating a new IAM policy and attaching it to an IAM user in the IAM console

Create a new IAM policy

1. Sign in to the AWS Management Console and open the IAM console at https://

console.aws.amazon.com/iam/.

2. In the left navigation pane of the IAM console, choose Policies.

3. On the Policies page, choose Create Policy.

4. In the Create Policy window, navigate to the Edit JSON tab. Create a policy document with one or more JSON statements as shown in the examples following this procedure. Next, choose Review policy.

5. On the Review Policy screen, enter your Policy Name, for example AmazonEMROnEKSPolicy. Enter an optional description, and then choose Create policy.

Attach the policy to an IAM user or role

1. Sign in to the AWS Management Console and open the IAM console at https://

console.aws.amazon.com/iam/

2. In the navigation pane, choose Policies.

3. In the list of policies, select the check box next to the policy created in the previous section. You can use the Filter menu and the search box to ﬁlter the list of policies.

4. Choose Policy actions, and then choose Attach.

(22)

Permissions for managing virtual clusters

5. Choose the user or role to attach the policy to. You can use the Filter menu and the search box to ﬁlter the list of principal entities. After choosing the user or role to attach the policy to, choose Attach policy.

Permissions for managing virtual clusters

To manage virtual clusters in your AWS account, create an IAM policy with the following permissions.

These permissions allow you to create, list, describe, and delete virtual clusters in your AWS account.

{ "Version": "2012-10-17", "Statement": [

{

"iam:CreateServiceLinkedRole"

],

"Resource": "*", "Condition": { "StringLike": {

"iam:AWSServiceName": "emr-containers.amazonaws.com"

} } }, {

"emr-containers:CreateVirtualCluster", "emr-containers:ListVirtualClusters", "emr-containers:DescribeVirtualCluster", "emr-containers:DeleteVirtualCluster"

],

"Resource": "*"

} ] }

When the CreateVirtualCluster operation is invoked for the ﬁrst time from an AWS account, you also need the CreateServiceLinkedRole permissions to create the service-linked role for Amazon EMR on EKS. For more information, see Using service-linked roles for Amazon EMR on EKS (p. 66).

Permissions for submitting jobs

To submit jobs on the virtual clusters in your AWS account, create an IAM policy with the following permissions. These permissions allow you to start, list, describe, and cancel job runs for the all virtual clusters in your account. You should consider adding permissions to list or describe virtual clusters, which allow you to check the state of the virtual cluster before submitting jobs.

{

"emr-containers:StartJobRun", "emr-containers:ListJobRuns", "emr-containers:DescribeJobRun", "emr-containers:CancelJobRun"

],

"Resource": "*"

(23)

Permissions for debugging and monitoring

} ] }

Permissions for debugging and monitoring

To get access to logs pushed to Amazon S3 and CloudWatch, or to view application event logs in the Amazon EMR console, create an IAM policy with the following permissions. We recommend that you reﬁne the permissions to speciﬁc resources whenever possible based on your business needs.

Important

If you haven't created an Amazon S3 bucket, you need to add s3:CreateBucket permission to the policy statement. If you haven't created a log group, you need to add logs:CreateLogGroup to the policy statement.

{

"emr-containers:DescribeJobRun",

"elasticmapreduce:CreatePersistentAppUI", "elasticmapreduce:DescribePersistentAppUI", "elasticmapreduce:GetPersistentAppUIPresignedURL"

],

"Resource": "*"

}, {

"s3:GetObject", "s3:ListBucket"

],

"Resource": "*"

}, {

"logs:Get*",

"logs:DescribeLogGroups", "logs:DescribeLogStreams"

],

"Resource": "*"

} ] }

For more information about how to configure a job run to push logs to Amazon S3 and CloudWatch, see Configure a job run to use S3 logs and Configure a job run to use CloudWatch Logs.

Register the Amazon EKS cluster with Amazon EMR

Registering your cluster is the ﬁnal required step to set up Amazon EMR on EKS to run workloads.

Use the following command to create a virtual cluster with a name of your choice for the Amazon EKS cluster and namespace that you set up in previous steps.

(24)

Register the Amazon EKS cluster with Amazon EMR

NoteEach virtual cluster must have a unique name across all the EKS clusters. If two virtual clusters have the same name, the deployment process will fail even if the two virtual clusters belong to diﬀerent EKS clusters.

aws emr-containers create-virtual-cluster \ --name virtual_cluster_name \

--container-provider '{

"id": "cluster_name", "type": "EKS", "info": {

"eksInfo": {

"namespace": "namespace_name"

} } }'

Alternatively, you can create a JSON ﬁle that includes the required parameters for the virtual cluster and then run the create-virtual-cluster command with the path to the JSON ﬁle. For more information, see Managing virtual clusters (p. 52).

NoteTo validate the successful creation of a virtual cluster, view the status of virtual clusters using the list-virtual-clusters operation or by going to the Virtual Clusters page in the Amazon EMR console.

(25)

Run a Spark Python application

Getting started

This topic helps you get started using Amazon EMR on EKS by deploying a Spark Python application on a virtual cluster.

Before you begin, make sure that you have completed the steps in Setting up (p. 5).

You will need the following information from the setup steps:

• Virtual cluster ID for the Amazon EKS cluster and Kubernetes namespace registered with Amazon EMR Important

When creating an EKS cluster, make sure to use m5.xlarge as the instance type, or any other instance type with a higher CPU and memory. Using an instance type with lower CPU or memory than m5.xlarge may lead to job failure due to insuﬃcient resources available in the cluster.

• Name of the IAM role used for job execution

• Release label for the Amazon EMR release (for example, emr-6.4.0-latest)

• Destination targets for logging and monitoring:

• Amazon CloudWatch log group name and log stream preﬁx

• Amazon S3 location to store event and container logs

Important

Amazon EMR on EKS jobs use Amazon CloudWatch and Amazon S3 as destination targets for monitoring and logging. You can monitor job progress and troubleshoot failures by viewing the job logs sent to these destinations. To enable logging, the IAM policy associated with the IAM role for job execution must have the required permissions to access the target resources. If the IAM policy doesn't have the required permissions, you must follow the steps outlined in Update the trust policy of the job execution role (p. 15), Conﬁgure a job run to use Amazon S3 logs, and Conﬁgure a job run to use CloudWatch Logs before running this sample job.

Run a Spark Python application

Take the following steps to run a simple wordcount.py Spark Python application on Amazon EMR on EKS. The application entryPoint ﬁle is located at s3://REGION.elasticmapreduce/emr- containers/samples/wordcount/scripts/wordcount.py. The REGION is the Region in which your Amazon EMR on EKS virtual cluster resides, such as us-east-1.

1. Update the IAM policy for the job execution role with the required permissions, as the following policy statements demonstrate.

{

"Sid": "ReadFromLoggingAndInputScriptBuckets", "Effect": "Allow",

"Action": [

"s3:GetObject", "s3:ListBucket"

],

"Resource": [

(26)

"arn:aws:s3:::*.elasticmapreduce", "arn:aws:s3:::*.elasticmapreduce/*", "arn:aws:s3:::DOC-EXAMPLE-BUCKET-OUTPUT", "arn:aws:s3:::DOC-EXAMPLE-BUCKET-OUTPUT/*", "arn:aws:s3:::DOC-EXAMPLE-BUCKET-LOGGING", "arn:aws:s3:::DOC-EXAMPLE-BUCKET-LOGGING/*"

] }, {

"Sid": "WriteToLoggingAndOutputDataBuckets", "Effect": "Allow",

"Action": [

"s3:PutObject", "s3:DeleteObject"

],

"Resource": [

"arn:aws:s3:::DOC-EXAMPLE-BUCKET-OUTPUT/*", "arn:aws:s3:::DOC-EXAMPLE-BUCKET-LOGGING/*"

] }, {

"Sid": "DescribeAndCreateCloudwatchLogStream", "Effect": "Allow",

"Action": [

"logs:CreateLogStream", "logs:DescribeLogGroups", "logs:DescribeLogStreams"

],

"Resource": [

"arn:aws:logs:*:*:*"

] }, {

"Sid": "WriteToCloudwatchLogs", "Effect": "Allow",

"Action": [

"logs:PutLogEvents"

],

"Resource": [

"arn:aws:logs:*:*:log-group:my_log_group_name:log- stream:my_log_stream_prefix/*"

] } ] }

• The ﬁrst statement ReadFromLoggingAndInputScriptBuckets in this policy grants ListBucket and GetObjects access to the following Amazon S3 buckets:

• REGION.elasticmapreduce ‐ the bucket where the application entryPoint ﬁle is located.

• DOC-EXAMPLE-BUCKET-OUTPUT ‐ a bucket that you deﬁne for your output data.

• DOC-EXAMPLE-BUCKET-LOGGING ‐ a bucket that you deﬁne for your logging data.

• The second statement WriteToLoggingAndOutputDataBuckets in this policy grants the job with permissions to write data to your output and logging buckets respectively.

• The third statement DescribeAndCreateCloudwatchLogStream grants the job with permissions to describe and create Amazon CloudWatch Logs.

• The fourth statement WriteToCloudwatchLogs grants permissions to write logs to an Amazon CloudWatch log group named my_log_group_name under a log stream named my_log_stream_prefix.

2. Initiate the sample application using the following command. Replace all the replaceable red italicized values with appropriate values. The REGION is the Region in which your Amazon EMR on EKS virtual cluster resides, such as us-east-1.

(27)

aws emr-containers start-job-run \ --virtual-cluster-id cluster_id \ --name sample-job-name \

--execution-role-arn execution-role-arn \ --release-label emr-6.4.0-latest \ --job-driver '{

"sparkSubmitJobDriver": {

"entryPoint": "s3://REGION.elasticmapreduce/emr-containers/samples/wordcount/

scripts/wordcount.py",

"entryPointArguments": ["s3://DOC-EXAMPLE-BUCKET-OUTPUT/wordcount_output"], "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf

spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"

}}' \

--configuration-overrides '{

"monitoringConfiguration": {

"cloudWatchMonitoringConfiguration": { "logGroupName": "my_log_group_name",

"logStreamNamePrefix": "my_log_stream_prefix"

},

"s3MonitoringConfiguration": {

"logUri": "s3://DOC-EXAMPLE-BUCKET-LOGGING"

} } }'

The output data from this job will be available at s3://DOC-EXAMPLE-BUCKET-OUTPUT/

wordcount_output.

You can also create a JSON file with specified parameters for your job run. Then run the start- job-run command with a path to the JSON file. For more information, see Submit a job run (p. 33). For more details about configuring job run parameters, see Options for configuring a job run (p. 34).

3. To monitor the progress of the job or to debug failures, you can inspect logs uploaded to Amazon S3, CloudWatch Logs, or both. Refer to log path in Amazon S3 at Conﬁgure a job run to use S3 logs and for Cloudwatch logs at Conﬁgure a job run to use CloudWatch Logs. To see logs in CloudWatch Logs, follow the instructions below.

• Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.

• In the Navigation pane, choose Logs. Then choose Log groups.

• Choose the log group for Amazon EMR on EKS and then view the uploaded log events.

(28)

How to customize Docker images

Customizing Docker images for Amazon EMR on EKS

You can use customized Docker images with Amazon EMR on EKS. Customizing the Amazon EMR on EKS runtime image provides the following beneﬁts:

• Package application dependencies and runtime environment into a single immutable container that promotes portability and simpliﬁes dependency management for each workload.

• Install and conﬁgure packages that are optimized to your workloads. These packages may not be widely available in the public distribution of Amazon EMR runtimes.

• Integrate Amazon EMR on EKS with current established build, test, and deployment processes within your organization, including local development and testing.

• Apply established security processes, such as image scanning, that meet compliance and governance requirements within your organization.

Topics

• How to customize Docker images (p. 23)

• How to select a base image URI (p. 30)

• Considerations (p. 31)

How to customize Docker images

Take the following steps to customize Docker images for Amazon EMR on EKS.

• Prerequisites (p. 23)

• Step 1: Retrieve a base image from Amazon Elastic Container Registry (Amazon ECR) (p. 24)

• Step 2: Customize a base image (p. 24)

• Step 3: (Optional but recommended) Validate a custom image (p. 25)

• Step 4: Publish a custom image (p. 26)

• Step 5: Submit a Spark workload in Amazon EMR using a custom image (p. 26)

Here are other options you may want to consider when customizing Docker images:

• Customize Docker images for managed endpoints (p. 28)

• Work with multi-architecture images (p. 29)

Prerequisites

• Complete the Setting up (p. 5) steps for Amazon EMR on EKS.

• Install Docker in your environment. For more information, see Get Docker.

(29)

Step 1: Retrieve a base image from Amazon Elastic Container Registry (Amazon ECR)

Take the following steps to retrieve an Amazon EMR on EKS base image from Amazon ECR. The base image contains the Amazon EMR runtime and connectors used to access other AWS services.

1. Choose a base image URI. The image URI follows this format, ECR-registry-

account.dkr.ecr.Region.amazonaws.com/spark/container-image-tag, as the following example demonstrates.

895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-5.32.0-20210129

To choose a base image in your Region, see How to select a base image URI (p. 30).

2. Log in to the Amazon ECR repository where the base image is stored. Replace 895885662937 and us-west-2 with the Amazon ECR registry account and the AWS Region you have selected.

aws ecr get-login-password --region us-west-2 | docker login --username AWS --password- stdin 895885662937.dkr.ecr.us-west-2.amazonaws.com

3. Pull the base image into your local Workspace. Replace emr-6.2.0-20210129 with the container image tag you have selected.

docker pull 895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-6.2.0-20210129

Step 2: Customize a base image

Take the following steps to customize the base image you have pulled from Amazon ECR.

1. Create a new Dockerfile on your local Workspace.

2. Edit the Dockerfile you just created and add the following content. This Dockerfile uses the container image you have pulled from 895885662937.dkr.ecr.us-west-2.amazonaws.com/

spark/emr-6.2.0-20210129.

FROM 895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-6.2.0-20210129 USER root

✓✓✓ Add customization commands here ✓✓✓✓

USER hadoop:hadoop

3. Add commands in the Dockerfile to customize the base image. For example, add a command to install Python libraries, as the following Dockerfile demonstrates.

FROM 895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-6.2.0-20210129 USER root

RUN pip3 install --upgrade boto3 pandas numpy // For python 3 USER hadoop:hadoop

4. From the same directory where the Dockerfile is created, run the following command to build the Docker image. Provide a name for the Docker image, for example, emr6.2_custom.

docker build -t emr6.2_custom .

(30)

Step 3: (Optional but recommended) Validate a custom image

We recommend that you test the compatibility of your custom image before publishing it. You can use the Amazon EMR on EKS custom image CLI to check if your image has the required ﬁle structures and correct conﬁgurations for running on Amazon EMR on EKS.

NoteThe Amazon EMR on EKS custom image CLI cannot conﬁrm that your image is free of error. Use caution when removing dependencies from the base images.

Take the following steps to validate your custom image.

1. Download and install Amazon EMR on EKS custom image CLI. For more information, see Amazon EMR on EKS custom image CLI Installation Guide.

2. Run the following command to test the installation.

emr-on-eks-custom-image --version

The following shows an example of the output.

Amazon EMR on EKS Custom Image CLI Version: x.xx

3. Run the following command to validate your custom image.

emr-on-eks-custom-image validate-image -i image_name -r release_version [-t image_type]

• -i speciﬁes the local image URI that needs to be validated. This can be the image URI, any name or tag that you deﬁned for your image.

• -r speciﬁes the exact release version for the base image, for example, emr-5.32.0.

• -t speciﬁes the image type. If this is a Spark image, input spark. The default value is spark. The current Amazon EMR on EKS custom image CLI version only supports Spark runtime images.

If you run the command successfully and the custom image meets all the required conﬁgurations and ﬁle structures, the returned output displays the results of all of the tests, as the following example demonstrates.

Amazon EMR on EKS Custom Image Test Version: x.xx

... Checking if docker cli is installed ... Checking Image Manifest

[INFO] Image ID: xxx

[INFO] Created On: 2021-05-17T20:50:07.986662904Z [INFO] Default User Set to hadoop:hadoop : PASS [INFO] Working Directory Set to /home/hadoop : PASS [INFO] Entrypoint Set to /usr/bin/entrypoint.sh : PASS [INFO] SPARK_HOME is set with value: /usr/lib/spark : PASS [INFO] JAVA_HOME is set with value: /etc/alternatives/jre : PASS [INFO] File Structure Test for spark-jars in /usr/lib/spark/jars: PASS [INFO] File Structure Test for hadoop-files in /usr/lib/hadoop: PASS [INFO] File Structure Test for hadoop-jars in /usr/lib/hadoop/lib: PASS [INFO] File Structure Test for bin-files in /usr/bin: PASS

... Start Running Sample Spark Job

[INFO] Sample Spark Job Test with local:///usr/lib/spark/examples/jars/spark- examples.jar : PASS

(31)

Step 4: Publish a custom image

--- Overall Custom Image Validation Succeeded.

---

If the custom image doesn't meet the required configurations or file structures, error messages occur. The returned output provides information about the incorrect configurations or file structures.

Step 4: Publish a custom image

Publish the new Docker image to your Amazon ECR registry.

1. Run the following command to create an Amazon ECR repository for storing your Docker image.

Provide a name for your repository, for example, emr6.2_custom_repo. Replace us-west-2 with your Region.

aws ecr create-repository \

--repository-name emr6.2_custom_repo \

--image-scanning-configuration scanOnPush=true \ --region us-west-2

For more information, see Create a repository in the Amazon ECR User Guide.

2. Run the following command to authenticate to your default registry.

aws ecr get-login-password --region us-west-2 | docker login --username AWS --password- stdin aws_account_id.dkr.ecr.us-west-2.amazonaws.com

For more information, see Authenticate to your default registry in the Amazon ECR User Guide.

3. Tag and publish an image to the Amazon ECR repository you created.

Tag the image.

docker tag emr6.2_custom aws_account_id.dkr.ecr.us- west-2.amazonaws.com/emr6.2_custom_repo

Push the image.

docker push aws_account_id.dkr.ecr.us-west-2.amazonaws.com/emr6.2_custom_repo

For more information, see Push an image to Amazon ECR in the Amazon ECR User Guide.

Step 5: Submit a Spark workload in Amazon EMR using a custom image

After a custom image is built and published, you can submit an Amazon EMR on EKS job using a custom image.

First, create a start-job-run-request.json ﬁle and specify spark.kubernetes.container.image parameter to reference the custom image, as the following example JSON ﬁle demonstrates.

NoteYou can use local:// scheme to refer to ﬁles available in the custom image as shown

with entryPoint argument in the JSON snippet below. You can also use the local:// scheme

(32)

Step 5: Submit a Spark workload in Amazon EMR using a custom image

to refer to application dependencies. All ﬁles and dependencies that are referred

using local:// scheme must already be present at the speciﬁed path in the custom image.

{ "name": "spark-custom-image",

"virtualClusterId": "virtual-cluster-id", "executionRoleArn": "execution-role-arn", "releaseLabel": "emr-6.2.0-latest", "jobDriver": {

"entryPoint": "local:///usr/lib/spark/examples/jars/spark-examples.jar", "entryPointArguments": [

"10"

],

"sparkSubmitParameters": "--class org.apache.spark.examples.SparkPi --conf spark.kubernetes.container.image=123456789012.dkr.ecr.us-west-2.amazonaws.com/

emr6.2_custom_repo"

} } }

You can also reference the custom image by using applicationConfiguration properties as the following example demonstrates.

{ "name": "spark-custom-image",

"virtualClusterId": "virtual-cluster-id", "executionRoleArn": "execution-role-arn", "releaseLabel": "emr-6.2.0-latest", "jobDriver": {

"entryPoint": "local:///usr/lib/spark/examples/jars/spark-examples.jar", "entryPointArguments": [

"10"

],

"sparkSubmitParameters": "--class org.apache.spark.examples.SparkPi"

} },

"configurationOverrides": { "applicationConfiguration": [ {

"classification": "spark-defaults", "properties": {

"spark.kubernetes.container.image": "123456789012.dkr.ecr.us- west-2.amazonaws.com/emr6.2_custom_repo"

} } ] } }

Then run the start-job-run command to submit the job.

aws emr-containers start-job-run --cli-input-json file://./start-job-run-request.json

In the JSON examples above, replace emr-6.2.0-latest with your Amazon EMR release version. In the Step 1: Retrieve a base image from Amazon Elastic Container Registry (Amazon ECR) (p. 24), the

(33)

Customize Docker images for managed endpoints

selected base image tag is emr-6.2.0-20210129, so the corresponding Amazon EMR release version can be either emr-6.2.0-latest or emr-6.2.0-20210129. It's strongly recommended you use the - latest release version to ensure that the selected version contains the latest security updates. For more information about Amazon EMR release versions and corresponding image tags, see How to select a base image URI (p. 30).

Note

You can use spark.kubernetes.driver.container.image and

spark.kubernetes.executor.container.image to specify a diﬀerent image for driver and executor pods.

Customize Docker images for managed endpoints

You can also customize Docker images for managed endpoints, so you can run custom kernels and ensure you have the dependencies you need when running interactive workloads from EMR Studio.

1. Follow the Steps 1-4 as outlined above to customize a Docker image. The only diﬀerence is the base image URI in your Dockerﬁle. The base image URI follows this format: ECR-registry- account.dkr.ecr.Region.amazonaws.com/notebook-spark/container-image-tag. You need to use notebook-spark in the base image URI, instead of spark. The base image contains the Spark runtime and the notebook kernels that run with it. For more information about selecting Regions and container image tags, see How to select a base image URI (p. 30).

2. Create a managed endpoint that can be used with the custom image.

First, create a JSON ﬁle custom-image-managed-endpoint.json with the following contents.

{ "name": "endpoint-name",

"virtualClusterId": "virtual-cluster-id", "type": "JUPYTER_ENTERPRISE_GATEWAY", "releaseLabel": "emr-6.2.0-latest", "executionRoleArn": "execution-role-arn", "certificateArn": "certificate-arn", "configurationOverrides": {

"applicationConfiguration": [ {

"classification": "jupyter-kernel-overrides", "configurations": [

{

"classification": "python-kubernetes", "properties": {

"container-image": "123456789012.dkr.ecr.us- west-2.amazonaws.com/custom-notebook-python:latest"

} }, {

"classification": "spark-python-kubernetes", "properties": {

"container-image": "123456789012.dkr.ecr.us- west-2.amazonaws.com/custom-notebook-spark:latest"

} } ] } ] } }

Then create a managed endpoint using the configurations specified in the JSON file, as the following example demonstrates.

(34)

Work with multi-architecture images

aws emr-containers create-managed-endpoint --cli-input-json custom-image-managed- endpoint.json

For more information, see Create a managed endpoint for your virtual cluster.

3. Then connect to the managed endpoint via EMR Studio. For more information, see Connecting from Studio.

Work with multi-architecture images

Amazon EMR on EKS supports multi-architecture container images for Amazon Elastic Container Registry (Amazon ECR). For more information, see Introducing multi-architecture container images for Amazon ECR.

Amazon EMR on EKS custom images support both Graviton-based EC2 instances and non-Graviton- based EC2 instances. The Graviton-based images are stored in the same image repositories in Amazon ECR as non-Graviton-based images.

For example, to inspect the Docker manifest list for 6.3.0 images, run the following command.

docker manifest inspect 895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/

emr-6.3.0:latest

Here is the output. The arm64 architecture is for Graviton instance. The amd64 is for non-Graviton instance.

{ "schemaVersion": 2,

"mediaType": "application/vnd.docker.distribution.manifest.list.v2+json", "manifests": [

{

"mediaType": "application/vnd.docker.distribution.manifest.v2+json", "size": 1805,

"digest":

"xxx123:6b971cb47d11011ab3d45fff925e9442914b4977ae0f9fbcdcf5cfa99a7593f0", "platform": {

"architecture": "arm64", "os": "linux"

} }, {

"mediaType": "application/vnd.docker.distribution.manifest.v2+json", "size": 1805,

"digest":

"xxx123:6f2375582c9c57fa9838c1d3a626f1b4fc281e287d2963a72dfe0bd81117e52f", "platform": {

"architecture": "amd64", "os": "linux"

} } ] }

Take the following steps to create multi-architecture images:

1. Create a Dockerfile with the following contents so you can pull the arm64 image.

Amazon EMR

Amazon EMR

Amazon EMR on EKS Development Guide

Amazon EMR: Amazon EMR on EKS Development Guide

Table of Contents

What is Amazon EMR on EKS

Architecture

Concepts

Kubernetes namespace

Virtual cluster

Job run

Amazon EMR containers

How the components work together

Setting up

Install the AWS CLI

To install or update the AWS CLI for macOS

To install or update the AWS CLI for Linux

To install or update the AWS CLI for Windows

Conﬁgure your AWS CLI credentials

Install eksctl

To install or upgrade eksctl on macOS using Homebrew

To install or upgrade eksctl on Linux using curl

To install or upgrade eksctl on Windows using Chocolatey

Set up an Amazon EKS cluster

Prerequisites

Create an Amazon EKS cluster using eksctl

Create an EKS cluster using AWS Management Console and AWS CLI

Create an EKS cluster with AWS Fargate

Enable cluster access for Amazon EMR on EKS

Manual steps to enable cluster access for Amazon EMR on EKS

Enable IAM Roles for Service Accounts (IRSA) on the EKS cluster

To create an IAM OIDC identity provider for your cluster with eksctl

To create an IAM OIDC identity provider for your cluster with the AWS Management Console

Create a job execution role

Update the trust policy of the job execution role

Grant users access to Amazon EMR on EKS

Creating a new IAM policy and attaching it to an IAM user in the IAM console

Permissions for managing virtual clusters

Permissions for submitting jobs

Permissions for debugging and monitoring

Register the Amazon EKS cluster with Amazon EMR

Getting started

Run a Spark Python application

Customizing Docker images for Amazon EMR on EKS

How to customize Docker images

Prerequisites

Step 1: Retrieve a base image from Amazon Elastic Container Registry (Amazon ECR)

Step 2: Customize a base image

Step 3: (Optional but recommended) Validate a custom image

Step 4: Publish a custom image

Step 5: Submit a Spark workload in Amazon EMR using a custom image

Customize Docker images for managed endpoints

Work with multi-architecture images