Amazon EMR

(1)

Amazon EMR

Management Guide

(2)

Amazon EMR: Management Guide

Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be aﬃliated with, connected to, or sponsored by Amazon.

(3)

What is Amazon EMR?

Amazon EMR (previously called Amazon Elastic MapReduce) is a managed cluster platform that simpliﬁes running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. Using these frameworks and related open-source projects, you can process data for analytics purposes and business intelligence workloads. Amazon EMR also lets you transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.

If you are a ﬁrst-time user of Amazon EMR, we recommend that you begin by reading the following, in addition to this section:

• Amazon EMR – This service page provides Amazon EMR highlights, product details, and pricing information.

• Tutorial: Getting started with Amazon EMR (p. 13) – This tutorial gets you started using Amazon EMR quickly.

In This Section

• Overview of Amazon EMR (p. 1)

• Beneﬁts of using Amazon EMR (p. 5)

• Overview of Amazon EMR architecture (p. 9)

Overview of Amazon EMR

This topic provides an overview of Amazon EMR clusters, including how to submit work to a cluster, how that data is processed, and the various states that the cluster goes through during processing.

In This Topic

• Understanding clusters and nodes (p. 1)

• Submitting work to a cluster (p. 2)

• Processing data (p. 2)

• Understanding the cluster lifecycle (p. 3)

Understanding clusters and nodes

The central component of Amazon EMR is the cluster. A cluster is a collection of Amazon Elastic Compute Cloud (Amazon EC2) instances. Each instance in the cluster is called a node. Each node has a role within the cluster, referred to as the node type. Amazon EMR also installs diﬀerent software components on each node type, giving each node a role in a distributed application like Apache Hadoop.

The node types in Amazon EMR are as follows:

• Master node: A node that manages the cluster by running software components to coordinate the distribution of data and tasks among other nodes for processing. The master node tracks the status of tasks and monitors the health of the cluster. Every cluster has a master node, and it's possible to create a single-node cluster with only the master node.

• Core node: A node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on your cluster. Multi-node clusters have at least one core node.

(10)

• Task node: A node with software components that only runs tasks and does not store data in HDFS.

Task nodes are optional.

The following diagram represents a cluster with one master node and four core nodes.

Submitting work to a cluster

When you run a cluster on Amazon EMR, you have several options as to how you specify the work that needs to be done.

• Provide the entire deﬁnition of the work to be done in functions that you specify as steps when you create a cluster. This is typically done for clusters that process a set amount of data and then terminate when processing is complete.

• Create a long-running cluster and use the Amazon EMR console, the Amazon EMR API, or the AWS CLI to submit steps, which may contain one or more jobs. For more information, see Submit work to a cluster (p. 526).

• Create a cluster, connect to the master node and other nodes as required using SSH, and use the interfaces that the installed applications provide to perform tasks and submit queries, either scripted or interactively. For more information, see the Amazon EMR Release Guide.

Processing data

When you launch your cluster, you choose the frameworks and applications to install for your data processing needs. To process data in your Amazon EMR cluster, you can submit jobs or queries directly to installed applications, or you can run steps in the cluster.

Submitting jobs directly to applications

You can submit jobs and interact directly with the software that is installed in your Amazon EMR cluster.

To do this, you typically connect to the master node over a secure connection and access the interfaces and tools that are available for the software that runs directly on your cluster. For more information, see Connect to the cluster (p. 476).

(11)

Running steps to process data

You can submit one or more ordered steps to an Amazon EMR cluster. Each step is a unit of work that contains instructions to manipulate data for processing by software installed on the cluster.

The following is an example process using four steps:

1. Submit an input dataset for processing.

2. Process the output of the ﬁrst step by using a Pig program.

3. Process a second input dataset by using a Hive program.

4. Write an output dataset.

Generally, when you process data in Amazon EMR, the input is data stored as files in your chosen underlying file system, such as Amazon S3 or HDFS. This data passes from one step to the next in the processing sequence. The final step writes the output data to a specified location, such as an Amazon S3 bucket.

Steps are run in the following sequence:

1. A request is submitted to begin processing steps.

2. The state of all steps is set to PENDING.

3. When the ﬁrst step in the sequence starts, its state changes to RUNNING. The other steps remain in the PENDING state.

4. After the ﬁrst step completes, its state changes to COMPLETED.

5. The next step in the sequence starts, and its state changes to RUNNING. When it completes, its state changes to COMPLETED.

6. This pattern repeats for each step until they all complete and processing ends.

The following diagram represents the step sequence and change of state for the steps as they are processed.

If a step fails during processing, its state changes to FAILED. You can determine what happens next for each step. By default, any remaining steps in the sequence are set to CANCELLED and do not run if a preceeding step fails. You can also choose to ignore the failure and allow remaining steps to proceed, or to terminate the cluster immediately.

The following diagram represents the step sequence and default change of state when a step fails during processing.

Understanding the cluster lifecycle

A successful Amazon EMR cluster follows this process:

(12)

1. Amazon EMR ﬁrst provisions EC2 instances in the cluster for each instance according to your

speciﬁcations. For more information, see Conﬁgure cluster hardware and networking (p. 197). For all instances, Amazon EMR uses the default AMI for Amazon EMR or a custom Amazon Linux AMI that you specify. For more information, see Using a custom AMI (p. 183). During this phase, the cluster state is STARTING.

2. Amazon EMR runs bootstrap actions that you specify on each instance. You can use bootstrap actions to install custom applications and perform customizations that you require. For more information, see Create bootstrap actions to install additional software (p. 193). During this phase, the cluster state is BOOTSTRAPPING.

3. Amazon EMR installs the native applications that you specify when you create the cluster, such as Hive, Hadoop, Spark, and so on.

4. After bootstrap actions are successfully completed and native applications are installed, the cluster state is RUNNING. At this point, you can connect to cluster instances, and the cluster sequentially runs any steps that you speciﬁed when you created the cluster. You can submit additional steps, which run after any previous steps complete. For more information, see Work with steps using the AWS CLI and console (p. 526).

5. After steps run successfully, the cluster goes into a WAITING state. If a cluster is conﬁgured to auto-terminate after the last step is complete, it goes into a TERMINATING state and then into the TERMINATED state. If the cluster is conﬁgured to wait, you must manually shut it down when you no longer need it. After you manually shut down the cluster, it goes into the TERMINATING state and then into the TERMINATED state.

A failure during the cluster lifecycle causes Amazon EMR to terminate the cluster and all of its instances unless you enable termination protection. If a cluster terminates because of a failure, any data stored on the cluster is deleted, and the cluster state is set to TERMINATED_WITH_ERRORS. If you enabled termination protection, you can retrieve data from your cluster, and then remove termination protection and terminate the cluster. For more information, see Using termination protection (p. 177).

The following diagram represents the lifecycle of a cluster, and how each stage of the lifecycle maps to a particular cluster state.

(13)

Beneﬁts of using Amazon EMR

There are many beneﬁts to using Amazon EMR. This section provides an overview of these beneﬁts and links to additional information to help you explore further.

Topics

• Cost savings (p. 5)

• AWS integration (p. 6)

• Deployment (p. 6)

• Scalability and ﬂexibility (p. 6)

• Reliability (p. 7)

• Security (p. 7)

• Monitoring (p. 8)

• Management interfaces (p. 8)

Cost savings

Amazon EMR pricing depends on the instance type and number of Amazon EC2 instances that you deploy and the Region in which you launch your cluster. On-demand pricing offers low rates, but you can reduce the cost even further by purchasing Reserved Instances or Spot Instances. Spot Instances can offer significant savings—as low as a tenth of on-demand pricing in some cases.

(14)

Note

If you use Amazon S3, Amazon Kinesis, or DynamoDB with your EMR cluster, there are additional charges for those services that are billed separately from your Amazon EMR usage.

Note

When you set up an Amazon EMR cluster in a private subnet, we recommend that you also set up VPC endpoints for Amazon S3. If your EMR cluster is in a private subnet without VPC endpoints for Amazon S3, you will incur additional NAT gateway charges that are associated with S3 traﬃc because the traﬃc between your EMR cluster and S3 will not stay within your VPC.

For more information about pricing options and details, see Amazon EMR pricing.

AWS integration

Amazon EMR integrates with other AWS services to provide capabilities and functionality related to networking, storage, security, and so on, for your cluster. The following list provides several examples of this integration:

• Amazon EC2 for the instances that comprise the nodes in the cluster

• Amazon Virtual Private Cloud (Amazon VPC) to conﬁgure the virtual network in which you launch your instances

• Amazon S3 to store input and output data

• Amazon CloudWatch to monitor cluster performance and conﬁgure alarms

• AWS Identity and Access Management (IAM) to conﬁgure permissions

• AWS CloudTrail to audit requests made to the service

• AWS Data Pipeline to schedule and start your clusters

• AWS Lake Formation to discover, catalog, and secure data in an Amazon S3 data lake

Deployment

Your EMR cluster consists of EC2 instances, which perform the work that you submit to your cluster.

When you launch your cluster, Amazon EMR conﬁgures the instances with the applications that you choose, such as Apache Hadoop or Spark. Choose the instance size and type that best suits the processing needs for your cluster: batch processing, low-latency queries, streaming data, or large data storage. For more information about the instance types available for Amazon EMR, see Conﬁgure cluster hardware and networking (p. 197).

Amazon EMR offers a variety of ways to configure software on your cluster. For example, you can install an Amazon EMR release with a chosen set of applications that can include versatile frameworks, such as Hadoop, and applications, such as Hive, Pig, or Spark. You can also install one of several MapR distributions. Amazon EMR uses Amazon Linux, so you can also install software on your cluster manually using the yum package manager or from the source. For more information, see Configure cluster software (p. 192).

Scalability and ﬂexibility

Amazon EMR provides ﬂexibility to scale your cluster up or down as your computing needs change. You can resize your cluster to add instances for peak workloads and remove instances to control costs when peak workloads subside. For more information, see Manually resizing a running cluster (p. 518).

Amazon EMR also provides the option to run multiple instance groups so that you can use On-Demand Instances in one group for guaranteed processing power together with Spot Instances in another group to have your jobs completed faster and at lower costs. You can also mix diﬀerent instance types to take

(15)

advantage of better pricing for one Spot Instance type over another. For more information, see When should you use Spot Instances? (p. 237).

Additionally, Amazon EMR provides the flexibility to use several file systems for your input, output, and intermediate data. For example, you might choose the Hadoop Distributed File System (HDFS) which runs on the master and core nodes of your cluster for processing data that you do not need to store beyond your cluster's lifecycle. You might choose the EMR File System (EMRFS) to use Amazon S3 as a data layer for applications running on your cluster so that you can separate your compute and storage, and persist data outside of the lifecycle of your cluster. EMRFS provides the added benefit of allowing you to scale up or down for your compute and storage needs independently. You can scale your compute needs by resizing your cluster and you can scale your storage needs by using Amazon S3. For more information, see Work with storage and file systems (p. 138).

Reliability

Amazon EMR monitors nodes in your cluster and automatically terminates and replaces an instance in case of failure.

Amazon EMR provides configuration options that control if your cluster is terminated automatically or manually. If you configure your cluster to be automatically terminated, it is terminated after all the steps complete. This is referred to as a transient cluster. However, you can configure the cluster to continue running after processing completes so that you can choose to terminate it manually when you no longer need it. Or, you can create a cluster, interact with the installed applications directly, and then manually terminate the cluster when you no longer need it. The clusters in these examples are referred to as long- running clusters.

Additionally, you can conﬁgure termination protection to prevent instances in your cluster from being terminated due to errors or issues during processing. When termination protection is enabled, you can recover data from instances before termination. The default settings for these options diﬀer depending on whether you launch your cluster by using the console, CLI, or API. For more information, see Using termination protection (p. 177).

Security

Amazon EMR leverages other AWS services, such as IAM and Amazon VPC, and features such as Amazon EC2 key pairs, to help you secure your clusters and data.

IAM

Amazon EMR integrates with IAM to manage permissions. You deﬁne permissions using IAM policies, which you attach to IAM users or IAM groups. The permissions that you deﬁne in the policy determine the actions that those users or members of the group can perform and the resources that they can access. For more information, see How Amazon EMR works with IAM (p. 284).

Additionally, Amazon EMR uses IAM roles for the Amazon EMR service itself and the EC2 instance profile for the instances. These roles grant permissions for the service and instances to access other AWS services on your behalf. There is a default role for the Amazon EMR service and a default role for the EC2 instance profile. The default roles use AWS managed policies, which are created for you automatically the first time you launch an EMR cluster from the console and choose default permissions. You can also create the default IAM roles from the AWS CLI. If you want to manage the permissions instead of AWS, you can choose custom roles for the service and instance profile. For more information, see Configure IAM service roles for Amazon EMR permissions to AWS services and resources (p. 286).

Security groups

Amazon EMR uses security groups to control inbound and outbound traﬃc to your EC2 instances. When you launch your cluster, Amazon EMR uses a security group for your master instance and a security group

(16)

to be shared by your core/task instances. Amazon EMR configures the security group rules to ensure communication among the instances in the cluster. Optionally, you can configure additional security groups and assign them to your master and core/task instances for more advanced rules. For more information, see Control network traffic with security groups (p. 419).

Encryption

Amazon EMR supports optional Amazon S3 server-side and client-side encryption with EMRFS to help protect the data that you store in Amazon S3. With server-side encryption, Amazon S3 encrypts your data after you upload it.

With client-side encryption, the encryption and decryption process occurs in the EMRFS client on your EMR cluster. You manage the root key for client-side encryption using either the AWS Key Management Service (AWS KMS) or your own key management system.

For more information, see Encryption for Amazon S3 data with EMRFS in the Amazon EMR Release Guide.

Amazon VPC

Amazon EMR supports launching clusters in a virtual private cloud (VPC) in Amazon VPC. A VPC is an isolated, virtual network in AWS that provides the ability to control advanced aspects of network conﬁguration and access. For more information, see Conﬁgure networking (p. 205).

AWS CloudTrail

Amazon EMR integrates with CloudTrail to log information about requests made by or on behalf of your AWS account. With this information, you can track who is accessing your cluster when, and the IP address from which they made the request. For more information, see Logging Amazon EMR API calls in AWS CloudTrail (p. 474).

Amazon EC2 key pairs

You can monitor and interact with your cluster by forming a secure connection between your remote computer and the master node. You use the Secure Shell (SSH) network protocol for this connection or use Kerberos for authentication. If you use SSH, an Amazon EC2 key pair is required. For more information, see Use an Amazon EC2 key pair for SSH credentials (p. 338).

Monitoring

You can use the Amazon EMR management interfaces and log files to troubleshoot cluster issues, such as failures or errors. Amazon EMR provides the ability to archive log files in Amazon S3 so you can store logs and troubleshoot issues even after your cluster terminates. Amazon EMR also provides an optional debugging tool in the Amazon EMR console to browse the log files based on steps, jobs, and tasks. For more information, see Configure cluster logging and debugging (p. 241).

Amazon EMR integrates with CloudWatch to track performance metrics for the cluster and jobs within the cluster. You can conﬁgure alarms based on a variety of metrics such as whether the cluster is idle or the percentage of storage used. For more information, see Monitor metrics with CloudWatch (p. 462).

Management interfaces

There are several ways you can interact with Amazon EMR:

• Console — A graphical user interface that you can use to launch and manage clusters. With it, you ﬁll out web forms to specify the details of clusters to launch, view the details of existing

(17)

clusters, debug, and terminate clusters. Using the console is the easiest way to get started with Amazon EMR; no programming knowledge is required. The console is available online at https://

console.aws.amazon.com/elasticmapreduce/home.

• AWS Command Line Interface (AWS CLI) — A client application you run on your local machine to connect to Amazon EMR and create and manage clusters. The AWS CLI contains a feature-rich set of commands speciﬁc to Amazon EMR. With it, you can write scripts that automate the process of launching and managing clusters. If you prefer working from a command line, using the AWS CLI is the best option. For more information, see Amazon EMR in the AWS CLI Command Reference.

• Software Development Kit (SDK) — SDKs provide functions that call Amazon EMR to create and manage clusters. With them, you can write applications that automate the process of creating and managing clusters. Using the SDK is the best option to extend or customize the functionality of Amazon EMR. Amazon EMR is currently available in the following SDKs: Go, Java, .NET (C# and VB.NET), Node.js, PHP, Python, and Ruby. For more information about these SDKs, see Tools for AWS and Amazon EMR sample code & libraries.

• Web Service API — A low-level interface that you can use to call the web service directly, using JSON.

Using the API is the best option to create a custom SDK that calls Amazon EMR. For more information, see the Amazon EMR API Reference.

Overview of Amazon EMR architecture

Amazon EMR service architecture consists of several layers, each of which provides certain capabilities and functionality to the cluster. This section provides an overview of the layers and the components of each.

In This Topic

• Storage (p. 9)

• Cluster resource management (p. 10)

• Data processing frameworks (p. 10)

• Applications and programs (p. 11)

Storage

The storage layer includes the different file systems that are used with your cluster. There are several different types of storage options as follows.

Hadoop Distributed File System (HDFS)

Hadoop Distributed File System (HDFS) is a distributed, scalable file system for Hadoop. HDFS distributes the data it stores across instances in the cluster, storing multiple copies of data on different instances to ensure that no data is lost if an individual instance fails. HDFS is ephemeral storage that is reclaimed when you terminate a cluster. HDFS is useful for caching intermediate results during MapReduce processing or for workloads that have significant random I/O.

For more information, go to HDFS User Guide on the Apache Hadoop website.

EMR File System (EMRFS)

Using the EMR File System (EMRFS), Amazon EMR extends Hadoop to add the ability to directly access data stored in Amazon S3 as if it were a ﬁle system like HDFS. You can use either HDFS or Amazon S3 as the ﬁle system in your cluster. Most often, Amazon S3 is used to store input and output data and intermediate results are stored in HDFS.

(18)

Local ﬁle system

The local ﬁle system refers to a locally connected disk. When you create a Hadoop cluster, each node is created from an Amazon EC2 instance that comes with a preconﬁgured block of pre-attached disk storage called an instance store. Data on instance store volumes persists only during the lifecycle of its Amazon EC2 instance.

Cluster resource management

The resource management layer is responsible for managing cluster resources and scheduling the jobs for processing data.

By default, Amazon EMR uses YARN (Yet Another Resource Negotiator), which is a component introduced in Apache Hadoop 2.0 to centrally manage cluster resources for multiple data-processing frameworks.

However, there are other frameworks and applications that are oﬀered in Amazon EMR that do not use YARN as a resource manager. Amazon EMR also has an agent on each node that administers YARN components, keeps the cluster healthy, and communicates with Amazon EMR.

Because Spot Instances are often used to run task nodes, Amazon EMR has default functionality for scheduling YARN jobs so that running jobs do not fail when task nodes running on Spot Instances are terminated. Amazon EMR does this by allowing application master processes to run only on core nodes.

The application master process controls running jobs and needs to stay alive for the life of the job.

Amazon EMR release version 5.19.0 and later uses the built-in YARN node labels feature to achieve this. (Earlier versions used a code patch). Properties in the yarn-site and capacity-scheduler configuration classifications are configured by default so that the YARN capacity-scheduler and fair- scheduler take advantage of node labels. Amazon EMR automatically labels core nodes with the CORE label, and sets properties so that application masters are scheduled only on nodes with the CORE label. Manually modifying related properties in the yarn-site and capacity-scheduler configuration classifications, or directly in associated XML files, could break this feature or modify this functionality.

Data processing frameworks

The data processing framework layer is the engine used to process and analyze data. There are many frameworks available that run on YARN or have their own resource management. Diﬀerent frameworks are available for diﬀerent kinds of processing needs, such as batch, interactive, in-memory, streaming, and so on. The framework that you choose depends on your use case. This impacts the languages and interfaces available from the application layer, which is the layer used to interact with the data you want to process. The main processing frameworks available for Amazon EMR are Hadoop MapReduce and Spark.

Hadoop MapReduce

Hadoop MapReduce is an open-source programming model for distributed computing. It simpliﬁes the process of writing parallel distributed applications by handling all of the logic, while you provide the Map and Reduce functions. The Map function maps data to sets of key-value pairs called intermediate results.

The Reduce function combines the intermediate results, applies additional algorithms, and produces the ﬁnal output. There are multiple frameworks available for MapReduce, such as Hive, which automatically generates Map and Reduce programs.

For more information, go to How map and reduce operations are actually carried out on the Apache Hadoop Wiki website.

Apache Spark

Spark is a cluster framework and programming model for processing big data workloads. Like Hadoop MapReduce, Spark is an open-source, distributed processing system but uses directed acyclic graphs for

(19)

execution plans and in-memory caching for datasets. When you run Spark on Amazon EMR, you can use EMRFS to directly access your data in Amazon S3. Spark supports multiple interactive query modules such as SparkSQL.

For more information, see Apache Spark on Amazon EMR clusters in the Amazon EMR Release Guide.

Applications and programs

Amazon EMR supports many applications, such as Hive, Pig, and the Spark Streaming library to provide capabilities such as using higher-level languages to create processing workloads, leveraging machine learning algorithms, making stream processing applications, and building data warehouses. In addition, Amazon EMR also supports open-source projects that have their own cluster management functionality instead of using YARN.

You use various libraries and languages to interact with the applications that you run in Amazon EMR.

For example, you can use Java, Hive, or Pig with MapReduce or Spark Streaming, Spark SQL, MLlib, and GraphX with Spark.

For more information, see the Amazon EMR Release Guide.

(20)

Setting up Amazon EMR

Complete the tasks in this section before you launch an Amazon EMR cluster for the ﬁrst time:

1.Sign up for AWS (p. 12)

2.Create an Amazon EC2 key pair for SSH (p. 12)

Sign up for AWS

If you do not have an AWS account, complete the following steps to create one.

To sign up for an AWS account

1. Open https://portal.aws.amazon.com/billing/signup.

2. Follow the online instructions.

Part of the sign-up procedure involves receiving a phone call and entering a veriﬁcation code on the phone keypad.

Create an Amazon EC2 key pair for SSH

Note

With Amazon EMR release versions 5.10.0 or later, you can conﬁgure Kerberos to

authenticate users and SSH connections to a cluster. For more information, see Use Kerberos authentication (p. 338).

To authenticate and connect to the nodes in a cluster over a secure channel using the Secure Shell (SSH) protocol, create an Amazon Elastic Compute Cloud (Amazon EC2) key pair before you launch the cluster.

You can also create a cluster without a key pair. This is usually done with transient clusters that start, run steps, and then terminate automatically.

If... Then...

You already have an Amazon EC2 key pair that you want to use, or you don't need to authenticate to your cluster.

Skip this step.

You need to create a key pair. See Creating your key pair using Amazon EC2.

Next steps

• For guidance on creating a sample cluster, see Tutorial: Getting started with Amazon EMR (p. 13).

• For more information on how to conﬁgure a custom cluster and control access to it, see Plan and conﬁgure clusters (p. 131) and Security in Amazon EMR (p. 251).

(21)

Tutorial: Getting started with Amazon EMR

Overview

With Amazon EMR you can set up a cluster to process and analyze data with big data frameworks in just a few minutes. This tutorial shows you how to launch a sample cluster using Spark, and how to run a simple PySpark script stored in an Amazon S3 bucket. It covers essential Amazon EMR tasks in three main workﬂow categories: Plan and Conﬁgure, Manage, and Clean Up.

You'll ﬁnd links to more detailed topics as you work through the tutorial, and ideas for additional steps in the Next steps (p. 24) section. If you have questions or get stuck, contact the Amazon EMR team on our Discussion forum.

Prerequisites

• Before you launch an Amazon EMR cluster, make sure you complete the tasks in Setting up Amazon EMR (p. 12).

Cost

• The sample cluster that you create runs in a live environment. The cluster accrues minimal charges.

To avoid additional charges, make sure you complete the cleanup tasks in the last step of this tutorial.

Charges accrue at the per-second rate according to Amazon EMR pricing. Charges also vary by Region.

For more information, see Amazon EMR pricing.

(22)

• Minimal charges might accrue for small ﬁles that you store in Amazon S3. Some or all of the charges for Amazon S3 might be waived if you are within the usage limits of the AWS Free Tier. For more information, see Amazon S3 pricing and AWS Free Tier.

Step 1: Plan and conﬁgure an Amazon EMR cluster

Prepare storage for Amazon EMR

When you use Amazon EMR, you can choose from a variety of file systems to store input data, output data, and log files. In this tutorial, you use EMRFS to store data in an S3 bucket. EMRFS is an implementation of the Hadoop file system that lets you read and write regular files to Amazon S3. For more information, see Work with storage and file systems (p. 138).

To create a bucket for this tutorial, follow the instructions in How do I create an S3 bucket? in the Amazon Simple Storage Service Console User Guide. Create the bucket in the same AWS Region where you plan to launch your Amazon EMR cluster. For example, US West (Oregon) us-west-2.

Buckets and folders that you use with Amazon EMR have the following limitations:

• Names can consist of lowercase letters, numbers, periods (.), and hyphens (-).

• Names cannot end in numbers.

• A bucket name must be unique across all AWS accounts.

• An output folder must be empty.

Prepare an application with input data for Amazon EMR

The most common way to prepare an application for Amazon EMR is to upload the application and its input data to Amazon S3. Then, when you submit work to your cluster you specify the Amazon S3 locations for your script and data.

In this step, you upload a sample PySpark script to your Amazon S3 bucket. We've provided a PySpark script for you to use. The script processes food establishment inspection data and returns a results ﬁle in your S3 bucket. The results ﬁle lists the top ten establishments with the most "Red" type violations.

You also upload sample input data to Amazon S3 for the PySpark script to process. The input data is a modiﬁed version of Health Department inspection results in King County, Washington, from 2006 to 2020. For more information, see King County Open Data: Food Establishment Inspection Data. Following are sample rows from the dataset.

name, inspection_result, inspection_closed_business, violation_type, violation_points 100 LB CLAM, Unsatisfactory, FALSE, BLUE, 5

100 PERCENT NUTRICION, Unsatisfactory, FALSE, BLUE, 5 7-ELEVEN #2361-39423A, Complete, FALSE, , 0

To prepare the example PySpark script for EMR

1. Copy the example code below into a new ﬁle in your editor of choice.

import argparse

(23)

from pyspark.sql import SparkSession

def calculate_red_violations(data_source, output_uri):

"""

Processes sample food establishment inspection data and queries the data to find the top 10 establishments

with the most Red violations from 2006 to 2020.

:param data_source: The URI of your food establishment data CSV, such as 's3://DOC- EXAMPLE-BUCKET/food-establishment-data.csv'.

:param output_uri: The URI where output is written, such as 's3://DOC-EXAMPLE- BUCKET/restaurant_violation_results'.

"""

with SparkSession.builder.appName("Calculate Red Health Violations").getOrCreate() as spark:

# Load the restaurant violation CSV data if data_source is not None:

restaurants_df = spark.read.option("header", "true").csv(data_source) # Create an in-memory DataFrame to query

restaurants_df.createOrReplaceTempView("restaurant_violations")

# Create a DataFrame of the top 10 restaurants with the most Red violations top_red_violation_restaurants = spark.sql("""SELECT name, count(*) AS total_red_violations

FROM restaurant_violations WHERE violation_type = 'RED' GROUP BY name

ORDER BY total_red_violations DESC LIMIT 10""") # Write the results to the specified output URI top_red_violation_restaurants.write.option("header", "true").mode("overwrite").csv(output_uri)

if __name__ == "__main__":

parser = argparse.ArgumentParser() parser.add_argument(

'--data_source', help="The URI for you CSV restaurant data, like an S3 bucket location.")

parser.add_argument(

'--output_uri', help="The URI where output is saved, like an S3 bucket location.")

args = parser.parse_args()

calculate_red_violations(args.data_source, args.output_uri)

2. Save the ﬁle as health_violations.py.

3. Upload health_violations.py to Amazon S3 into the bucket you created for this tutorial. For instructions, see Uploading an object to a bucket in the Amazon Simple Storage Service Getting Started Guide.

To prepare the sample input data for EMR

1. Download the zip ﬁle, food_establishment_data.zip.

2. Unzip and save food_establishment_data.zip as food_establishment_data.csv on your machine.

3. Upload the CSV ﬁle to the S3 bucket that you created for this tutorial. For instructions, see Uploading an object to a bucket in the Amazon Simple Storage Service Getting Started Guide.

For more information about setting up data for EMR, see Prepare input data (p. 140).

(24)

Launch an Amazon EMR cluster

After you prepare a storage location and your application, you can launch a sample Amazon EMR cluster.

In this step, you launch an Apache Spark cluster using the latest Amazon EMR release version.

Console

To launch a cluster with Spark installed using Quick Options

1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.

2. Choose Create cluster to open the Quick Options wizard.

3. Note the default values for Release, Instance type, Number of instances, and Permissions on the Create Cluster - Quick Options page. These fields autofill with values that work for general- purpose clusters. For more information about the Quick Options configuration settings, see Summary of Quick Options (p. 132).

4. Enter a Cluster name to help you identify the cluster. For example, My First EMR Cluster. 5. Leave Logging enabled, but replace the S3 folder value with the Amazon S3 bucket you

created, followed by /logs. For example, s3://DOC-EXAMPLE-BUCKET/logs. Adding / logs creates a new folder called 'logs' in your bucket, where EMR can copy the log ﬁles of your cluster.

6. Choose the Spark option under Applications to install Spark on your cluster.

Note

Choose the applications you want on your Amazon EMR cluster before you launch the cluster. You can't add or remove applications from a cluster after launch.

7. Choose your EC2 key pair under Security and access.

8. Choose Create cluster to launch the cluster and open the cluster status page.

9. Find the cluster Status next to the cluster name. The status changes from Starting to Running to Waiting as Amazon EMR provisions the cluster. You may need to choose the refresh icon on the right or refresh your browser to see status updates.

Your cluster status changes to Waiting when the cluster is up, running, and ready to accept work. For more information about reading the cluster summary, see View cluster status and details (p. 436).

For information about cluster status, see Understanding the cluster lifecycle (p. 3).

CLI

To launch a cluster with Spark installed using the AWS CLI

1. Create a Spark cluster with the following command. Enter a name for your cluster with the -- name option, and specify the name of your EC2 key pair with the --ec2-attributes option.

aws emr create-cluster \

--name "<My First EMR Cluster>" \ --release-label <emr-5.34.0> \ --applications Name=Spark \

--ec2-attributes KeyName=<myEMRKeyPairName> \ --instance-type m5.xlarge \

--instance-count 3 \ --use-default-roles

Note the other required values for --instance-type, --instance-count, and --use- default-roles. These values have been chosen for general-purpose clusters. For more information about create-cluster, see the AWS CLI reference.

(25)

Note

Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

You should see output like the following. The output shows the ClusterId and ClusterArn of your new cluster. Note your ClusterId. You use the ClusterId to check on the cluster status and to submit work.

{ "ClusterId": "myClusterId", "ClusterArn": "myClusterArn"

}

2. Check your cluster status with the following command.

aws emr describe-cluster --cluster-id <myClusterId>

You should see output like the following with the Status object for your new cluster.

{ "Cluster": {

"Id": "myClusterId",

"Name": "My First EMR Cluster", "Status": {

"State": "STARTING", "StateChangeReason": {

"Message": "Configuring cluster software"

} } } {

The State value changes from STARTING to RUNNING to WAITING as Amazon EMR provisions the cluster.

Cluster status changes to WAITING when a cluster is up, running, and ready to accept work. For information about cluster status, see Understanding the cluster lifecycle (p. 3).

Step 2: Manage your Amazon EMR cluster

Submit work to Amazon EMR

After you launch a cluster, you can submit work to the running cluster to process and analyze data.

You submit work to an Amazon EMR cluster as a step. A step is a unit of work made up of one or more actions. For example, you might submit a step to compute values, or to transfer and process data. You can submit steps when you create a cluster, or to a running cluster. In this part of the tutorial, you submit health_violations.py as a step to your running cluster. To learn more about steps, see Submit work to a cluster (p. 526).

(26)

Console

To submit a Spark application as a step using the console

2. Select the name of your cluster from the Cluster List. The cluster state must be Waiting.

3. Choose Steps, and then choose Add step.

4. Conﬁgure the step according to the following guidelines:

• For Step type, choose Spark application. You should see additional ﬁelds for Deploy Mode, Spark-submit options, and Application location appear.

• For Name, leave the default value or type a new name. If you have many steps in a cluster, naming each step helps you keep track of them.

• For Deploy mode, leave the default value Cluster. For more information about Spark deployment modes, see Cluster mode overview in the Apache Spark documentation.

• Leave the Spark-submit options ﬁeld blank. For more information about spark-submit options, see Launching applications with spark-submit.

• For Application location, enter the location of your health_violations.py script in Amazon S3. For example, s3://DOC-EXAMPLE-BUCKET/health_violations.py.

• In the Arguments ﬁeld, enter the following arguments and values:

--data_source s3://DOC-EXAMPLE-BUCKET/food_establishment_data.csv --output_uri s3://DOC-EXAMPLE-BUCKET/myOutputFolder

Replace s3://DOC-EXAMPLE-BUCKET/food_establishment_data.csv with the S3 URI of the input data you prepared in Prepare an application with input data for Amazon EMR (p. 14).

Replace DOC-EXAMPLE-BUCKET with the name of the bucket you created for this tutorial, and myOutputFolder with a name for your cluster output folder.

• For Action on failure, accept the default option Continue so that if the step fails, the cluster continues to run.

5. Choose Add to submit the step. The step should appear in the console with a status of Pending.

6. Check for the step status to change from Pending to Running to Completed. To refresh the status in the console, choose the refresh icon to the right of the Filter. The script takes about one minute to run.

You will know that the step ﬁnished successfully when the status changes to Completed.

CLI

To submit a Spark application as a step using the AWS CLI

1. Make sure you have the ClusterId of the cluster you launched in Launch an Amazon EMR cluster (p. 16). You can also retrieve your cluster ID with the following command.

aws emr list-clusters --cluster-states WAITING

2. Submit health_violations.py as a step with the add-steps command and your ClusterId.

(27)

• You can specify a name for your step by replacing "My Spark Application". In the Args array, replace s3://DOC-EXAMPLE-BUCKET/health_violations.py with the location of your health_violations.py application.

• Replace s3://DOC-EXAMPLE-BUCKET/food_establishment_data.csv with the S3 location of your food_establishment_data.csv dataset.

• Replace s3://DOC-EXAMPLE-BUCKET/MyOutputFolder with the S3 path of your designated bucket and a name for your cluster output folder.

• ActionOnFailure=CONTINUE means the cluster continues to run if the step fails.

aws emr add-steps \

--cluster-id <myClusterId> \ --steps Type=Spark,Name="<My Spark

Application>",ActionOnFailure=CONTINUE,Args=[<s3://DOC-EXAMPLE- BUCKET/health_violations.py>,--data_source,<s3://DOC-EXAMPLE-BUCKET/

food_establishment_data.csv>,--output_uri,<s3://DOC-EXAMPLE-BUCKET/MyOutputFolder>]

For more information about submitting steps using the CLI, see the AWS CLI Command Reference.

After you submit the step, you should see output like the following with a list of StepIds. Since you submitted one step, you will see just one ID in the list. Copy your step ID. You use your step ID to check the status of the step.

{

"StepIds": [

"s-1XXXXXXXXXXA"

] }

3. Query the status of your step with the describe-step command.

aws emr describe-step --cluster-id <myClusterId> --step-id <s-1XXXXXXXXXXA>

You should see output like the following with information about your step.

{ "Step": {

"Id": "s-1XXXXXXXXXXA",

"Name": "My Spark Application", "Config": {

"Jar": "command-runner.jar", "Properties": {},

"Args": [

"spark-submit",

"s3://DOC-EXAMPLE-BUCKET/health_violations.py", "--data_source",

"s3://DOC-EXAMPLE-BUCKET/food_establishment_data.csv", "--output_uri",

"s3://DOC-EXAMPLE-BUCKET/myOutputFolder"

]

(28)

},

"ActionOnFailure": "CONTINUE", "Status": {

"State": "COMPLETED"

} } }

The State of the step changes from PENDING to RUNNING to COMPLETED as the step runs. The step takes about one minute to run, so you might need to check the status a few times.

You will know that the step was successful when the State changes to COMPLETED.

For more information about the step lifecycle, see Running steps to process data (p. 3).

View results

After a step runs successfully, you can view its output results in your Amazon S3 output folder.

To view the results of health_violations.py

1. Open the Amazon S3 console at https://console.aws.amazon.com/s3/.

2. Choose the Bucket name and then the output folder that you speciﬁed when you submitted the step. For example, DOC-EXAMPLE-BUCKET and then myOutputFolder.

3. Verify that the following items appear in your output folder:

• A small-sized object called _SUCCESS.

• A CSV ﬁle starting with the preﬁx part- that contains your results.

4. Choose the object with your results, then choose Download to save the results to your local ﬁle system.

5. Open the results in your editor of choice. The output ﬁle lists the top ten food establishments with the most red violations. The output ﬁle also shows the total number of red violations for each establishment.

The following is an example of health_violations.py results.

name, total_red_violations SUBWAY, 322

T-MOBILE PARK, 315 WHOLE FOODS MARKET, 299 PCC COMMUNITY MARKETS, 251 TACO TIME, 240

MCDONALD'S, 177 THAI GINGER, 153 SAFEWAY INC #1508, 143 TAQUERIA EL RINCONSITO, 134 HIMITSU TERIYAKI, 128

For more information about Amazon EMR cluster output, see Conﬁgure an output location (p. 149).

(Optional) Connect to your running Amazon EMR cluster

When you use Amazon EMR, you may want to connect to a running cluster to read log ﬁles, debug the cluster, or use CLI tools like the Spark shell. Amazon EMR lets you connect to a cluster using the Secure

(29)

Shell (SSH) protocol. This section covers how to conﬁgure SSH, connect to your cluster, and view log ﬁles for Spark. For more information about connecting to a cluster, see Authenticate to Amazon EMR cluster nodes (p. 337).

Authorize SSH connections to your cluster

Before you connect to your cluster, you need to modify your cluster security groups to authorize inbound SSH connections. Amazon EC2 security groups act as virtual ﬁrewalls to control inbound and outbound traﬃc to your cluster. When you created your cluster for this tutorial, Amazon EMR created the following security groups on your behalf:

ElasticMapReduce-master

The default Amazon EMR managed security group associated with the master node. In an Amazon EMR cluster, the master node is an Amazon EC2 instance that manages the cluster.

ElasticMapReduce-slave

The default security group associated with core and task nodes.

To allow SSH access for trusted sources for the ElasticMapReduce-master security group

To edit your security groups, you must have permission to manage security groups for the VPC that the cluster is in. For more information, see Changing Permissions for an IAM User and the Example Policy that allows managing EC2 security groups in the IAM User Guide.

2. Choose Clusters.

3. Choose the Name of the cluster you want to modify.

4. Choose the Security groups for Master link under Security and access.

5. Choose ElasticMapReduce-master from the list.

6. Choose the Inbound rules tab and then Edit inbound rules.

7. Check for an inbound rule that allows public access with the following settings. If it exists, choose Delete to remove it.

• Type SSH

• Port 22

• Source

Custom 0.0.0.0/0

Warning

Before December 2020, the ElasticMapReduce-master security group had a pre-configured rule to allow inbound traffic on Port 22 from all sources. This rule was created to simplify initial SSH connections to the master node. We strongly recommend that you remove this inbound rule and restrict traffic to trusted sources.

8. Scroll to the bottom of the list of rules and choose Add Rule.

9. For Type, select SSH.

Selecting SSH automatically enters TCP for Protocol and 22 for Port Range.

(30)

10. For source, select My IP to automatically add your IP address as the source address. You can also add a range of Custom trusted client IP addresses, or create additional rules for other clients. Many network environments dynamically allocate IP addresses, so you might need to update your IP addresses for trusted clients in the future.

11. Choose Save.

12. Optionally, choose ElasticMapReduce-slave from the list and repeat the steps above to allow SSH client access to core and task nodes.

Connect to your cluster using the AWS CLI

Regardless of your operating system, you can create an SSH connection to your cluster using the AWS CLI.

To connect to your cluster an view log ﬁles using the AWS CLI

1. Use the following command to open an SSH connection to your cluster. Replace <mykeypair.key>

with the full path and ﬁle name of your key pair ﬁle. For example, C:\Users\<username>\.ssh

\mykeypair.pem.

aws emr ssh --cluster-id <j-2AL4XXXXXX5T9> --key-pair-file <~/mykeypair.key>

2. Navigate to /mnt/var/log/spark to access the Spark logs on your cluster's master node. Then view the files in that location. For a list of additional log files on the master node, see View log files on the master node (p. 449).

cd /mnt/var/log/spark ls

Step 3: Clean up your Amazon EMR resources

Terminate your cluster

Now that you've submitted work to your cluster and viewed the results of your PySpark application, you can terminate the cluster. Terminating a cluster stops all of the cluster's associated Amazon EMR charges and Amazon EC2 instances.

When you terminate a cluster, Amazon EMR retains metadata about the cluster for two months at no charge. Archived metadata helps you clone the cluster (p. 525) for a new job or revisit the cluster conﬁguration for reference purposes. Metadata does not include data that the cluster writes to S3, or data stored in HDFS on the cluster.

Note

The Amazon EMR console does not let you delete a cluster from the list view after you terminate the cluster. A terminated cluster disappears from the console when Amazon EMR clears its metadata.

Console

To terminate the cluster using the console

2. Choose Clusters, then choose the cluster you want to terminate. For example, My First EMR Cluster.

(31)

3. Choose Terminate to open the Terminate cluster prompt.

4. Choose Terminate in the open prompt. Depending on the cluster conﬁguration, termination may take 5 to 10 minutes. For more information about terminating Amazon EMR clusters, see Terminate a cluster (p. 490).

Note

If you followed the tutorial closely, termination protection should be oﬀ. Cluster termination protection prevents accidental termination. If termination protection is on, you will see a prompt to change the setting before terminating the cluster. Choose Change, then Oﬀ.

CLI

To terminate the cluster using the AWS CLI

1. Initiate the cluster termination process with the following command. Replace <myClusterId>

with the ID of your sample cluster. The command does not return output.

aws emr terminate-clusters --cluster-ids <myClusterId>

2. To check that the cluster termination process is in progress, check the cluster status with the following command.

aws emr describe-cluster --cluster-id <myClusterId>

Following is example output in JSON format. The cluster Status should change from TERMINATING to TERMINATED. Termination may take 5 to 10 minutes depending on your cluster conﬁguration. For more information about terminating an Amazon EMR cluster, see Terminate a cluster (p. 490).

{ "Cluster": {

"Id": "j-xxxxxxxxxxxxx", "Name": "My Cluster Name", "Status": {

"State": "TERMINATED", "StateChangeReason": { "Code": "USER_REQUEST",

"Message": "Terminated by user request"

} } } }

Delete S3 resources

To avoid additional charges, you should delete your Amazon S3 bucket. Deleting the bucket removes all of the Amazon S3 resources for this tutorial. Your bucket should contain:

• The PySpark script

• The input dataset

• Your output results folder

• Your log ﬁles folder

(32)

You might need to take extra steps to delete stored ﬁles if you saved your PySpark script or output in a diﬀerent location.

Note

Your cluster must be terminated before you delete your bucket. Otherwise, you may not be allowed to empty the bucket.

To delete your bucket, follow the instructions in How do I delete an S3 bucket? in the Amazon Simple Storage Service User Guide.

Next steps

You have now launched your ﬁrst Amazon EMR cluster from start to ﬁnish. You have also completed essential EMR tasks like preparing and submitting big data applications, viewing results, and terminating a cluster.

Use the following topics to learn more about how you can customize your Amazon EMR workﬂow.

Explore big data applications for Amazon EMR

Discover and compare the big data applications you can install on a cluster in the Amazon EMR Release Guide. The Release Guide details each EMR release version and includes tips for using frameworks such as Spark and Hadoop on Amazon EMR.

Plan cluster hardware, networking, and security

In this tutorial, you created a simple EMR cluster without conﬁguring advanced options. Advanced options let you specify Amazon EC2 instance types, cluster networking, and cluster security. For more information about planning and launching a cluster that meets your requirements, see Plan and conﬁgure clusters (p. 131) and Security in Amazon EMR (p. 251).

Manage clusters

Dive deeper into working with running clusters in Manage clusters (p. 436). To manage a cluster, you can connect to the cluster, debug steps, and track cluster activities and health. You can also adjust cluster resources in response to workload demands with EMR managed scaling (p. 493).

Use a diﬀerent interface

In addition to the Amazon EMR console, you can manage Amazon EMR using the AWS Command Line Interface, the web service API, or one of the many supported AWS SDKs. For more information, see Management interfaces (p. 8).

You can also interact with applications installed on Amazon EMR clusters in many ways. Some applications like Apache Hadoop publish web interfaces that you can view. For more information, see View web interfaces hosted on Amazon EMR clusters (p. 482).

Browse the EMR technical blog

For sample walkthroughs and in-depth technical discussion of new Amazon EMR features, see the AWS big data blog.

Amazon EMR

Amazon EMR

Management Guide

Amazon EMR: Management Guide

Table of Contents

What is Amazon EMR?

Overview of Amazon EMR

Understanding clusters and nodes

Submitting work to a cluster

Processing data

Submitting jobs directly to applications

Running steps to process data

Understanding the cluster lifecycle

Beneﬁts of using Amazon EMR

Cost savings

Note

Note

AWS integration

Deployment

Scalability and ﬂexibility

Reliability

Security

IAM

Security groups

Encryption

Amazon VPC

AWS CloudTrail

Amazon EC2 key pairs

Monitoring

Management interfaces

Overview of Amazon EMR architecture

Storage

Hadoop Distributed File System (HDFS)

EMR File System (EMRFS)

Local ﬁle system

Cluster resource management

Data processing frameworks

Hadoop MapReduce

Apache Spark

Applications and programs

Setting up Amazon EMR

Sign up for AWS

To sign up for an AWS account

Create an Amazon EC2 key pair for SSH

Note

Next steps

Tutorial: Getting started with Amazon EMR

Overview

Prerequisites

Cost

Step 1: Plan and conﬁgure an Amazon EMR cluster

Prepare storage for Amazon EMR

Prepare an application with input data for Amazon EMR

To prepare the example PySpark script for EMR

To prepare the sample input data for EMR

Launch an Amazon EMR cluster

To launch a cluster with Spark installed using Quick Options

Note

To launch a cluster with Spark installed using the AWS CLI

Note

Step 2: Manage your Amazon EMR cluster

Submit work to Amazon EMR

To submit a Spark application as a step using the console

To submit a Spark application as a step using the AWS CLI

View results

To view the results of health_violations.py

(Optional) Connect to your running Amazon EMR cluster

Authorize SSH connections to your cluster

To allow SSH access for trusted sources for the ElasticMapReduce-master security group

Warning

Connect to your cluster using the AWS CLI

To connect to your cluster an view log ﬁles using the AWS CLI

Step 3: Clean up your Amazon EMR resources

Terminate your cluster

Note

To terminate the cluster using the console

Note

To terminate the cluster using the AWS CLI

Delete S3 resources

Note