• 沒有找到結果。

Amazon EMR

N/A
N/A
Protected

Academic year: 2022

Share "Amazon EMR"

Copied!
584
0
0

加載中.... (立即查看全文)

全文

(1)

Amazon EMR

Management Guide

(2)

Amazon EMR: Management Guide

Copyright © Amazon Web Services, Inc. and/or its affiliates. All rights reserved.

Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be affiliated with, connected to, or sponsored by Amazon.

(3)

Table of Contents

What is Amazon EMR? ... 1

Overview ... 1

Understanding clusters and nodes ... 1

Submitting work to a cluster ... 2

Processing data ... 2

Understanding the cluster lifecycle ... 3

Benefits ... 5

Cost savings ... 5

AWS integration ... 6

Deployment ... 6

Scalability and flexibility ... 6

Reliability ... 7

Security ... 7

Monitoring ... 8

Management interfaces ... 8

Architecture ... 9

Storage ... 9

Cluster resource management ... 10

Data processing frameworks ... 10

Applications and programs ... 11

Setting up Amazon EMR ... 12

Sign up for AWS ... 12

Create an Amazon EC2 key pair for SSH ... 12

Next steps ... 12

Getting started tutorial ... 13

Overview ... 13

Step 1: Plan and configure ... 14

Prepare storage for Amazon EMR ... 14

Prepare an application with input data for Amazon EMR ... 14

Launch an Amazon EMR cluster ... 16

Step 2: Manage ... 17

Submit work to Amazon EMR ... 17

View results ... 20

Step 3: Clean up ... 22

Terminate your cluster ... 22

Delete S3 resources ... 23

Next steps ... 24

Explore big data applications for Amazon EMR ... 24

Plan cluster hardware, networking, and security ... 24

Manage clusters ... 24

Use a different interface ... 24

Browse the EMR technical blog ... 24

Amazon EMR Studio ... 25

Key features of EMR Studio ... 25

How Amazon EMR Studio works ... 25

Authentication and user login ... 26

Access control ... 28

Workspaces ... 29

Notebook storage ... 29

Considerations and limitations ... 29

Known issues ... 30

Feature limitations ... 31

Service limits ... 31

VPC and subnet best practices ... 32

(4)

Cluster requirements ... 32

Configure EMR Studio ... 33

Administrator permissions to create an EMR Studio ... 34

Set up an EMR Studio ... 37

Manage a Studio ... 66

Control EMR Studio network traffic ... 70

Set up Amazon EMR on EKS for EMR Studio ... 72

Create cluster templates ... 76

Access and permissions for Git-based repositories ... 81

Optimize Spark jobs ... 82

Use an Amazon EMR Studio ... 83

Workspace basics ... 84

Configure Workspace collaboration ... 88

Browse data with SQL Explorer ... 90

Attach a cluster to a Workspace ... 90

Link Git repositories ... 93

Debug applications and jobs ... 95

Install kernels and libraries ... 98

Enhance kernels with magic commands ... 99

Use multi-language notebooks with Spark kernels ... 103

EMR Notebooks ... 105

Considerations ... 105

Cluster requirements ... 105

Differences in capabilities by cluster release version ... 106

Limits for concurrently attached EMR Notebooks ... 107

Jupyter Notebook and Python versions ... 107

... 107

Creating a Notebook ... 107

Working with EMR Notebooks ... 109

Understanding Notebook status ... 109

Working with the Notebook editor ... 110

Changing clusters ... 111

Deleting Notebooks and Notebook files ... 111

Sharing Notebook files ... 112

Executing EMR Notebooks programmatically ... 112

CLI command samples ... 113

Boto3 SDK sample script ... 116

Ruby sample script ... 118

User impersonation for Spark ... 119

Setting up Spark user impersonation ... 120

Using the Spark job monitoring widget ... 120

Security ... 121

Installing and using kernels and libraries ... 122

Installing kernels and Python libraries on a cluster master node ... 122

Using Notebook-scoped libraries ... 123

Working with Notebook-scoped libraries ... 123

Associating Git-based repositories with EMR Notebooks ... 124

Prerequisites and considerations ... 125

Add a Git-based repository to Amazon EMR ... 126

Update or delete a Git-based repository ... 128

Link or unlink a Git-based repository ... 128

Create a new Notebook with an associated Git repository ... 129

Use Git repositories in a Notebook ... 130

Plan and configure clusters ... 131

Launch a cluster with Quick Options ... 131

Summary of Quick Options ... 132

Configure cluster location and data storage ... 136

(5)

Choose an AWS Region ... 137

Work with storage and file systems ... 138

Prepare input data ... 140

Configure an output location ... 149

Plan and configure master nodes ... 153

Supported applications and features ... 153

Launch an Amazon EMR Cluster with multiple master nodes ... 158

Amazon EMR integration with EC2 placement groups ... 160

Considerations and best practices ... 164

EMR clusters on AWS Outposts ... 165

Prerequisites ... 165

Limitations ... 165

Network connectivity considerations ... 166

Creating an Amazon EMR cluster on AWS Outposts ... 166

EMR clusters on AWS Local Zones ... 167

Supported instance types ... 167

Creating an Amazon EMR cluster on Local Zones ... 168

Configure Docker ... 168

Docker registries ... 169

Configuring Docker registries ... 169

Configuring YARN to access Amazon ECR on EMR 6.0.0 and earlier ... 170

Control cluster termination ... 172

Configuring a cluster to continue or terminate after step execution ... 172

Using an auto-termination policy ... 174

Using termination protection ... 177

Working with AMIs ... 181

Using the default AMI ... 182

Using a custom AMI ... 183

Specifying the Amazon EBS root device volume size ... 191

Configure cluster software ... 192

Create bootstrap actions to install additional software ... 193

Configure cluster hardware and networking ... 197

Understand node types ... 197

Configure Amazon EC2 instances ... 199

Configure networking ... 205

Configure instance fleets or instance groups ... 214

Configure cluster logging and debugging ... 241

Default log files ... 241

Archive log files to Amazon S3 ... 242

Enable the debugging tool ... 244

Debugging option information ... 245

Tag clusters ... 245

Tag restrictions ... 246

Tag resources for billing ... 247

Add tags to a new cluster ... 247

Adding tags to an existing cluster ... 248

View tags on a cluster ... 248

Remove tags from a cluster ... 249

Drivers and third-party application integration ... 249

Use business intelligence tools with Amazon EMR ... 250

Security ... 251

Security configurations ... 251

Data protection ... 251

AWS Identity and Access Management with Amazon EMR ... 251

Kerberos ... 252

Lake Formation ... 252

Secure Socket Shell (SSH) ... 252

(6)

Amazon EC2 security groups ... 252

Default Amazon Linux AMI updates ... 252

Use security configurations to set up cluster security ... 253

Create a security configuration ... 253

Specify a security configuration for a cluster ... 271

Data protection ... 271

Encrypt data at rest and in transit ... 272

IAM with Amazon EMR ... 280

Audience ... 281

Authenticating with identities ... 281

Managing access using policies ... 283

How Amazon EMR works with IAM ... 284

Configure service roles for Amazon EMR ... 286

Identity-based policy examples ... 316

Authenticate to cluster nodes ... 337

Use an Amazon EC2 key pair for SSH credentials ... 338

Use Kerberos authentication ... 338

Integrate Amazon EMR with AWS Lake Formation ... 364

Overview ... 364

Applications, features, and limitations ... 370

Before you begin ... 371

Launch an Amazon EMR cluster with Lake Formation ... 380

Integrate Amazon EMR with Apache Ranger ... 386

Ranger overview ... 386

Application support and limitations ... 389

Set up Amazon EMR for Apache Ranger ... 390

Apache Ranger plugins ... 402

Apache Ranger troubleshooting ... 417

Control network traffic with security groups ... 419

Working with Amazon EMR-managed security groups ... 420

Working with additional security groups ... 426

Specifying security groups ... 426

Security groups for EMR Notebooks ... 428

Using block public access ... 429

Compliance validation ... 432

Resilience ... 432

Infrastructure security ... 432

Connect to Amazon EMR using an interface VPC endpoint ... 433

Manage clusters ... 436

View and monitor a cluster ... 436

View cluster status and details ... 436

Enhanced step debugging ... 442

View application history ... 443

View log files ... 449

View cluster instances in Amazon EC2 ... 453

CloudWatch events and metrics ... 453

View cluster application metrics with Ganglia ... 474

Logging Amazon EMR API calls in AWS CloudTrail ... 474

Connect to the cluster ... 476

Before you connect ... 476

Connect to the master node using SSH ... 477

View web interfaces hosted on Amazon EMR clusters ... 482

Terminate a cluster ... 490

Terminate a cluster using the console ... 491

Terminate a cluster using the AWS CLI ... 491

Terminate a cluster using the API ... 492

Scaling cluster resources ... 492

(7)

Using EMR managed scaling in Amazon EMR ... 493

Using automatic scaling with a custom policy for instance groups ... 509

Manually resizing a running cluster ... 518

Cluster scale-down ... 524

Cloning a cluster using the console ... 525

Submit work to a cluster ... 526

Work with steps using the AWS CLI and console ... 526

Submit Hadoop jobs interactively ... 531

Add more than 256 steps to a cluster ... 532

Automate recurring clusters with AWS Data Pipeline ... 533

Troubleshoot a cluster ... 534

What tools are available for troubleshooting? ... 534

Tools to display cluster details ... 534

Tools to run scripts and configure processes ... 535

Tools to view log files ... 535

Tools to monitor cluster performance ... 535

Viewing and restarting Amazon EMR and application processes (daemons) ... 536

Viewing running processes ... 536

Stopping and restarting processes ... 537

Troubleshoot a failed cluster ... 539

Step 1: Gather data about the issue ... 540

Step 2: Check the environment ... 540

Step 3: Look at the last state change ... 541

Step 4: Examine the log files ... 541

Step 5: Test the cluster step by step ... 542

Troubleshoot a slow cluster ... 543

Step 1: Gather data about the issue ... 543

Step 2: Check the environment ... 544

Step 3: Examine the log files ... 545

Step 4: Check cluster and instance health ... 546

Step 5: Check for suspended groups ... 547

Step 6: Review configuration settings ... 547

Step 7: Examine input data ... 549

Common errors in Amazon EMR ... 549

Input and output errors ... 549

Permissions errors ... 551

Resource errors ... 552

Streaming cluster errors ... 559

Custom JAR cluster errors ... 560

Hive cluster errors ... 560

VPC errors ... 561

AWS GovCloud (US-West) errors ... 564

Other issues ... 564

Troubleshoot a Lake Formation cluster ... 564

Data lake access not allowed ... 564

Session expiration ... 565

No permissions for user on requested table ... 565

Inserting into, creating and altering tables: Unsupported in beta ... 565

Write applications that launch and manage clusters ... 566

End-to-end Amazon EMR Java source code sample ... 566

Common concepts for API calls ... 568

Endpoints for Amazon EMR ... 569

Specifying cluster parameters in Amazon EMR ... 569

Availability Zones in Amazon EMR ... 569

How to use additional files and libraries in Amazon EMR clusters ... 570

Use SDKs to call Amazon EMR APIs ... 570

Using the AWS SDK for Java to create an Amazon EMR cluster ... 570

(8)

Manage Amazon EMR Service Quotas ... 572

What are Amazon EMR Service Quotas ... 572

How to manage Amazon EMR Service Quotas ... 573

When to set up EMR events in CloudWatch ... 573

AWS glossary ... 576

(9)

What is Amazon EMR?

Amazon EMR (previously called Amazon Elastic MapReduce) is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. Using these frameworks and related open-source projects, you can process data for analytics purposes and business intelligence workloads. Amazon EMR also lets you transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.

If you are a first-time user of Amazon EMR, we recommend that you begin by reading the following, in addition to this section:

• Amazon EMR – This service page provides Amazon EMR highlights, product details, and pricing information.

• Tutorial: Getting started with Amazon EMR (p. 13) – This tutorial gets you started using Amazon EMR quickly.

In This Section

• Overview of Amazon EMR (p. 1)

• Benefits of using Amazon EMR (p. 5)

• Overview of Amazon EMR architecture (p. 9)

Overview of Amazon EMR

This topic provides an overview of Amazon EMR clusters, including how to submit work to a cluster, how that data is processed, and the various states that the cluster goes through during processing.

In This Topic

• Understanding clusters and nodes (p. 1)

• Submitting work to a cluster (p. 2)

• Processing data (p. 2)

• Understanding the cluster lifecycle (p. 3)

Understanding clusters and nodes

The central component of Amazon EMR is the cluster. A cluster is a collection of Amazon Elastic Compute Cloud (Amazon EC2) instances. Each instance in the cluster is called a node. Each node has a role within the cluster, referred to as the node type. Amazon EMR also installs different software components on each node type, giving each node a role in a distributed application like Apache Hadoop.

The node types in Amazon EMR are as follows:

Master node: A node that manages the cluster by running software components to coordinate the distribution of data and tasks among other nodes for processing. The master node tracks the status of tasks and monitors the health of the cluster. Every cluster has a master node, and it's possible to create a single-node cluster with only the master node.

Core node: A node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on your cluster. Multi-node clusters have at least one core node.

(10)

Task node: A node with software components that only runs tasks and does not store data in HDFS.

Task nodes are optional.

The following diagram represents a cluster with one master node and four core nodes.

Submitting work to a cluster

When you run a cluster on Amazon EMR, you have several options as to how you specify the work that needs to be done.

• Provide the entire definition of the work to be done in functions that you specify as steps when you create a cluster. This is typically done for clusters that process a set amount of data and then terminate when processing is complete.

• Create a long-running cluster and use the Amazon EMR console, the Amazon EMR API, or the AWS CLI to submit steps, which may contain one or more jobs. For more information, see Submit work to a cluster (p. 526).

• Create a cluster, connect to the master node and other nodes as required using SSH, and use the interfaces that the installed applications provide to perform tasks and submit queries, either scripted or interactively. For more information, see the Amazon EMR Release Guide.

Processing data

When you launch your cluster, you choose the frameworks and applications to install for your data processing needs. To process data in your Amazon EMR cluster, you can submit jobs or queries directly to installed applications, or you can run steps in the cluster.

Submitting jobs directly to applications

You can submit jobs and interact directly with the software that is installed in your Amazon EMR cluster.

To do this, you typically connect to the master node over a secure connection and access the interfaces and tools that are available for the software that runs directly on your cluster. For more information, see Connect to the cluster (p. 476).

(11)

Running steps to process data

You can submit one or more ordered steps to an Amazon EMR cluster. Each step is a unit of work that contains instructions to manipulate data for processing by software installed on the cluster.

The following is an example process using four steps:

1. Submit an input dataset for processing.

2. Process the output of the first step by using a Pig program.

3. Process a second input dataset by using a Hive program.

4. Write an output dataset.

Generally, when you process data in Amazon EMR, the input is data stored as files in your chosen underlying file system, such as Amazon S3 or HDFS. This data passes from one step to the next in the processing sequence. The final step writes the output data to a specified location, such as an Amazon S3 bucket.

Steps are run in the following sequence:

1. A request is submitted to begin processing steps.

2. The state of all steps is set to PENDING.

3. When the first step in the sequence starts, its state changes to RUNNING. The other steps remain in the PENDING state.

4. After the first step completes, its state changes to COMPLETED.

5. The next step in the sequence starts, and its state changes to RUNNING. When it completes, its state changes to COMPLETED.

6. This pattern repeats for each step until they all complete and processing ends.

The following diagram represents the step sequence and change of state for the steps as they are processed.

If a step fails during processing, its state changes to FAILED. You can determine what happens next for each step. By default, any remaining steps in the sequence are set to CANCELLED and do not run if a preceeding step fails. You can also choose to ignore the failure and allow remaining steps to proceed, or to terminate the cluster immediately.

The following diagram represents the step sequence and default change of state when a step fails during processing.

Understanding the cluster lifecycle

A successful Amazon EMR cluster follows this process:

(12)

1. Amazon EMR first provisions EC2 instances in the cluster for each instance according to your

specifications. For more information, see Configure cluster hardware and networking (p. 197). For all instances, Amazon EMR uses the default AMI for Amazon EMR or a custom Amazon Linux AMI that you specify. For more information, see Using a custom AMI (p. 183). During this phase, the cluster state is STARTING.

2. Amazon EMR runs bootstrap actions that you specify on each instance. You can use bootstrap actions to install custom applications and perform customizations that you require. For more information, see Create bootstrap actions to install additional software (p. 193). During this phase, the cluster state is BOOTSTRAPPING.

3. Amazon EMR installs the native applications that you specify when you create the cluster, such as Hive, Hadoop, Spark, and so on.

4. After bootstrap actions are successfully completed and native applications are installed, the cluster state is RUNNING. At this point, you can connect to cluster instances, and the cluster sequentially runs any steps that you specified when you created the cluster. You can submit additional steps, which run after any previous steps complete. For more information, see Work with steps using the AWS CLI and console (p. 526).

5. After steps run successfully, the cluster goes into a WAITING state. If a cluster is configured to auto-terminate after the last step is complete, it goes into a TERMINATING state and then into the TERMINATED state. If the cluster is configured to wait, you must manually shut it down when you no longer need it. After you manually shut down the cluster, it goes into the TERMINATING state and then into the TERMINATED state.

A failure during the cluster lifecycle causes Amazon EMR to terminate the cluster and all of its instances unless you enable termination protection. If a cluster terminates because of a failure, any data stored on the cluster is deleted, and the cluster state is set to TERMINATED_WITH_ERRORS. If you enabled termination protection, you can retrieve data from your cluster, and then remove termination protection and terminate the cluster. For more information, see Using termination protection (p. 177).

The following diagram represents the lifecycle of a cluster, and how each stage of the lifecycle maps to a particular cluster state.

(13)

Benefits of using Amazon EMR

There are many benefits to using Amazon EMR. This section provides an overview of these benefits and links to additional information to help you explore further.

Topics

• Cost savings (p. 5)

• AWS integration (p. 6)

• Deployment (p. 6)

• Scalability and flexibility (p. 6)

• Reliability (p. 7)

• Security (p. 7)

• Monitoring (p. 8)

• Management interfaces (p. 8)

Cost savings

Amazon EMR pricing depends on the instance type and number of Amazon EC2 instances that you deploy and the Region in which you launch your cluster. On-demand pricing offers low rates, but you can reduce the cost even further by purchasing Reserved Instances or Spot Instances. Spot Instances can offer significant savings—as low as a tenth of on-demand pricing in some cases.

(14)

Note

If you use Amazon S3, Amazon Kinesis, or DynamoDB with your EMR cluster, there are additional charges for those services that are billed separately from your Amazon EMR usage.

Note

When you set up an Amazon EMR cluster in a private subnet, we recommend that you also set up VPC endpoints for Amazon S3. If your EMR cluster is in a private subnet without VPC endpoints for Amazon S3, you will incur additional NAT gateway charges that are associated with S3 traffic because the traffic between your EMR cluster and S3 will not stay within your VPC.

For more information about pricing options and details, see Amazon EMR pricing.

AWS integration

Amazon EMR integrates with other AWS services to provide capabilities and functionality related to networking, storage, security, and so on, for your cluster. The following list provides several examples of this integration:

• Amazon EC2 for the instances that comprise the nodes in the cluster

• Amazon Virtual Private Cloud (Amazon VPC) to configure the virtual network in which you launch your instances

• Amazon S3 to store input and output data

• Amazon CloudWatch to monitor cluster performance and configure alarms

• AWS Identity and Access Management (IAM) to configure permissions

• AWS CloudTrail to audit requests made to the service

• AWS Data Pipeline to schedule and start your clusters

• AWS Lake Formation to discover, catalog, and secure data in an Amazon S3 data lake

Deployment

Your EMR cluster consists of EC2 instances, which perform the work that you submit to your cluster.

When you launch your cluster, Amazon EMR configures the instances with the applications that you choose, such as Apache Hadoop or Spark. Choose the instance size and type that best suits the processing needs for your cluster: batch processing, low-latency queries, streaming data, or large data storage. For more information about the instance types available for Amazon EMR, see Configure cluster hardware and networking (p. 197).

Amazon EMR offers a variety of ways to configure software on your cluster. For example, you can install an Amazon EMR release with a chosen set of applications that can include versatile frameworks, such as Hadoop, and applications, such as Hive, Pig, or Spark. You can also install one of several MapR distributions. Amazon EMR uses Amazon Linux, so you can also install software on your cluster manually using the yum package manager or from the source. For more information, see Configure cluster software (p. 192).

Scalability and flexibility

Amazon EMR provides flexibility to scale your cluster up or down as your computing needs change. You can resize your cluster to add instances for peak workloads and remove instances to control costs when peak workloads subside. For more information, see Manually resizing a running cluster (p. 518).

Amazon EMR also provides the option to run multiple instance groups so that you can use On-Demand Instances in one group for guaranteed processing power together with Spot Instances in another group to have your jobs completed faster and at lower costs. You can also mix different instance types to take

(15)

advantage of better pricing for one Spot Instance type over another. For more information, see When should you use Spot Instances? (p. 237).

Additionally, Amazon EMR provides the flexibility to use several file systems for your input, output, and intermediate data. For example, you might choose the Hadoop Distributed File System (HDFS) which runs on the master and core nodes of your cluster for processing data that you do not need to store beyond your cluster's lifecycle. You might choose the EMR File System (EMRFS) to use Amazon S3 as a data layer for applications running on your cluster so that you can separate your compute and storage, and persist data outside of the lifecycle of your cluster. EMRFS provides the added benefit of allowing you to scale up or down for your compute and storage needs independently. You can scale your compute needs by resizing your cluster and you can scale your storage needs by using Amazon S3. For more information, see Work with storage and file systems (p. 138).

Reliability

Amazon EMR monitors nodes in your cluster and automatically terminates and replaces an instance in case of failure.

Amazon EMR provides configuration options that control if your cluster is terminated automatically or manually. If you configure your cluster to be automatically terminated, it is terminated after all the steps complete. This is referred to as a transient cluster. However, you can configure the cluster to continue running after processing completes so that you can choose to terminate it manually when you no longer need it. Or, you can create a cluster, interact with the installed applications directly, and then manually terminate the cluster when you no longer need it. The clusters in these examples are referred to as long- running clusters.

Additionally, you can configure termination protection to prevent instances in your cluster from being terminated due to errors or issues during processing. When termination protection is enabled, you can recover data from instances before termination. The default settings for these options differ depending on whether you launch your cluster by using the console, CLI, or API. For more information, see Using termination protection (p. 177).

Security

Amazon EMR leverages other AWS services, such as IAM and Amazon VPC, and features such as Amazon EC2 key pairs, to help you secure your clusters and data.

IAM

Amazon EMR integrates with IAM to manage permissions. You define permissions using IAM policies, which you attach to IAM users or IAM groups. The permissions that you define in the policy determine the actions that those users or members of the group can perform and the resources that they can access. For more information, see How Amazon EMR works with IAM (p. 284).

Additionally, Amazon EMR uses IAM roles for the Amazon EMR service itself and the EC2 instance profile for the instances. These roles grant permissions for the service and instances to access other AWS services on your behalf. There is a default role for the Amazon EMR service and a default role for the EC2 instance profile. The default roles use AWS managed policies, which are created for you automatically the first time you launch an EMR cluster from the console and choose default permissions. You can also create the default IAM roles from the AWS CLI. If you want to manage the permissions instead of AWS, you can choose custom roles for the service and instance profile. For more information, see Configure IAM service roles for Amazon EMR permissions to AWS services and resources (p. 286).

Security groups

Amazon EMR uses security groups to control inbound and outbound traffic to your EC2 instances. When you launch your cluster, Amazon EMR uses a security group for your master instance and a security group

(16)

to be shared by your core/task instances. Amazon EMR configures the security group rules to ensure communication among the instances in the cluster. Optionally, you can configure additional security groups and assign them to your master and core/task instances for more advanced rules. For more information, see Control network traffic with security groups (p. 419).

Encryption

Amazon EMR supports optional Amazon S3 server-side and client-side encryption with EMRFS to help protect the data that you store in Amazon S3. With server-side encryption, Amazon S3 encrypts your data after you upload it.

With client-side encryption, the encryption and decryption process occurs in the EMRFS client on your EMR cluster. You manage the root key for client-side encryption using either the AWS Key Management Service (AWS KMS) or your own key management system.

For more information, see Encryption for Amazon S3 data with EMRFS in the Amazon EMR Release Guide.

Amazon VPC

Amazon EMR supports launching clusters in a virtual private cloud (VPC) in Amazon VPC. A VPC is an isolated, virtual network in AWS that provides the ability to control advanced aspects of network configuration and access. For more information, see Configure networking (p. 205).

AWS CloudTrail

Amazon EMR integrates with CloudTrail to log information about requests made by or on behalf of your AWS account. With this information, you can track who is accessing your cluster when, and the IP address from which they made the request. For more information, see Logging Amazon EMR API calls in AWS CloudTrail (p. 474).

Amazon EC2 key pairs

You can monitor and interact with your cluster by forming a secure connection between your remote computer and the master node. You use the Secure Shell (SSH) network protocol for this connection or use Kerberos for authentication. If you use SSH, an Amazon EC2 key pair is required. For more information, see Use an Amazon EC2 key pair for SSH credentials (p. 338).

Monitoring

You can use the Amazon EMR management interfaces and log files to troubleshoot cluster issues, such as failures or errors. Amazon EMR provides the ability to archive log files in Amazon S3 so you can store logs and troubleshoot issues even after your cluster terminates. Amazon EMR also provides an optional debugging tool in the Amazon EMR console to browse the log files based on steps, jobs, and tasks. For more information, see Configure cluster logging and debugging (p. 241).

Amazon EMR integrates with CloudWatch to track performance metrics for the cluster and jobs within the cluster. You can configure alarms based on a variety of metrics such as whether the cluster is idle or the percentage of storage used. For more information, see Monitor metrics with CloudWatch (p. 462).

Management interfaces

There are several ways you can interact with Amazon EMR:

Console — A graphical user interface that you can use to launch and manage clusters. With it, you fill out web forms to specify the details of clusters to launch, view the details of existing

(17)

clusters, debug, and terminate clusters. Using the console is the easiest way to get started with Amazon EMR; no programming knowledge is required. The console is available online at https://

console.aws.amazon.com/elasticmapreduce/home.

AWS Command Line Interface (AWS CLI) — A client application you run on your local machine to connect to Amazon EMR and create and manage clusters. The AWS CLI contains a feature-rich set of commands specific to Amazon EMR. With it, you can write scripts that automate the process of launching and managing clusters. If you prefer working from a command line, using the AWS CLI is the best option. For more information, see Amazon EMR in the AWS CLI Command Reference.

Software Development Kit (SDK) — SDKs provide functions that call Amazon EMR to create and manage clusters. With them, you can write applications that automate the process of creating and managing clusters. Using the SDK is the best option to extend or customize the functionality of Amazon EMR. Amazon EMR is currently available in the following SDKs: Go, Java, .NET (C# and VB.NET), Node.js, PHP, Python, and Ruby. For more information about these SDKs, see Tools for AWS and Amazon EMR sample code & libraries.

Web Service API — A low-level interface that you can use to call the web service directly, using JSON.

Using the API is the best option to create a custom SDK that calls Amazon EMR. For more information, see the Amazon EMR API Reference.

Overview of Amazon EMR architecture

Amazon EMR service architecture consists of several layers, each of which provides certain capabilities and functionality to the cluster. This section provides an overview of the layers and the components of each.

In This Topic

• Storage (p. 9)

• Cluster resource management (p. 10)

• Data processing frameworks (p. 10)

• Applications and programs (p. 11)

Storage

The storage layer includes the different file systems that are used with your cluster. There are several different types of storage options as follows.

Hadoop Distributed File System (HDFS)

Hadoop Distributed File System (HDFS) is a distributed, scalable file system for Hadoop. HDFS distributes the data it stores across instances in the cluster, storing multiple copies of data on different instances to ensure that no data is lost if an individual instance fails. HDFS is ephemeral storage that is reclaimed when you terminate a cluster. HDFS is useful for caching intermediate results during MapReduce processing or for workloads that have significant random I/O.

For more information, go to HDFS User Guide on the Apache Hadoop website.

EMR File System (EMRFS)

Using the EMR File System (EMRFS), Amazon EMR extends Hadoop to add the ability to directly access data stored in Amazon S3 as if it were a file system like HDFS. You can use either HDFS or Amazon S3 as the file system in your cluster. Most often, Amazon S3 is used to store input and output data and intermediate results are stored in HDFS.

(18)

Local file system

The local file system refers to a locally connected disk. When you create a Hadoop cluster, each node is created from an Amazon EC2 instance that comes with a preconfigured block of pre-attached disk storage called an instance store. Data on instance store volumes persists only during the lifecycle of its Amazon EC2 instance.

Cluster resource management

The resource management layer is responsible for managing cluster resources and scheduling the jobs for processing data.

By default, Amazon EMR uses YARN (Yet Another Resource Negotiator), which is a component introduced in Apache Hadoop 2.0 to centrally manage cluster resources for multiple data-processing frameworks.

However, there are other frameworks and applications that are offered in Amazon EMR that do not use YARN as a resource manager. Amazon EMR also has an agent on each node that administers YARN components, keeps the cluster healthy, and communicates with Amazon EMR.

Because Spot Instances are often used to run task nodes, Amazon EMR has default functionality for scheduling YARN jobs so that running jobs do not fail when task nodes running on Spot Instances are terminated. Amazon EMR does this by allowing application master processes to run only on core nodes.

The application master process controls running jobs and needs to stay alive for the life of the job.

Amazon EMR release version 5.19.0 and later uses the built-in YARN node labels feature to achieve this. (Earlier versions used a code patch). Properties in the yarn-site and capacity-scheduler configuration classifications are configured by default so that the YARN capacity-scheduler and fair- scheduler take advantage of node labels. Amazon EMR automatically labels core nodes with the CORE label, and sets properties so that application masters are scheduled only on nodes with the CORE label. Manually modifying related properties in the yarn-site and capacity-scheduler configuration classifications, or directly in associated XML files, could break this feature or modify this functionality.

Data processing frameworks

The data processing framework layer is the engine used to process and analyze data. There are many frameworks available that run on YARN or have their own resource management. Different frameworks are available for different kinds of processing needs, such as batch, interactive, in-memory, streaming, and so on. The framework that you choose depends on your use case. This impacts the languages and interfaces available from the application layer, which is the layer used to interact with the data you want to process. The main processing frameworks available for Amazon EMR are Hadoop MapReduce and Spark.

Hadoop MapReduce

Hadoop MapReduce is an open-source programming model for distributed computing. It simplifies the process of writing parallel distributed applications by handling all of the logic, while you provide the Map and Reduce functions. The Map function maps data to sets of key-value pairs called intermediate results.

The Reduce function combines the intermediate results, applies additional algorithms, and produces the final output. There are multiple frameworks available for MapReduce, such as Hive, which automatically generates Map and Reduce programs.

For more information, go to How map and reduce operations are actually carried out on the Apache Hadoop Wiki website.

Apache Spark

Spark is a cluster framework and programming model for processing big data workloads. Like Hadoop MapReduce, Spark is an open-source, distributed processing system but uses directed acyclic graphs for

(19)

execution plans and in-memory caching for datasets. When you run Spark on Amazon EMR, you can use EMRFS to directly access your data in Amazon S3. Spark supports multiple interactive query modules such as SparkSQL.

For more information, see Apache Spark on Amazon EMR clusters in the Amazon EMR Release Guide.

Applications and programs

Amazon EMR supports many applications, such as Hive, Pig, and the Spark Streaming library to provide capabilities such as using higher-level languages to create processing workloads, leveraging machine learning algorithms, making stream processing applications, and building data warehouses. In addition, Amazon EMR also supports open-source projects that have their own cluster management functionality instead of using YARN.

You use various libraries and languages to interact with the applications that you run in Amazon EMR.

For example, you can use Java, Hive, or Pig with MapReduce or Spark Streaming, Spark SQL, MLlib, and GraphX with Spark.

For more information, see the Amazon EMR Release Guide.

(20)

Setting up Amazon EMR

Complete the tasks in this section before you launch an Amazon EMR cluster for the first time:

1.Sign up for AWS (p. 12)

2.Create an Amazon EC2 key pair for SSH (p. 12)

Sign up for AWS

If you do not have an AWS account, complete the following steps to create one.

To sign up for an AWS account

1. Open https://portal.aws.amazon.com/billing/signup.

2. Follow the online instructions.

Part of the sign-up procedure involves receiving a phone call and entering a verification code on the phone keypad.

Create an Amazon EC2 key pair for SSH

Note

With Amazon EMR release versions 5.10.0 or later, you can configure Kerberos to

authenticate users and SSH connections to a cluster. For more information, see Use Kerberos authentication (p. 338).

To authenticate and connect to the nodes in a cluster over a secure channel using the Secure Shell (SSH) protocol, create an Amazon Elastic Compute Cloud (Amazon EC2) key pair before you launch the cluster.

You can also create a cluster without a key pair. This is usually done with transient clusters that start, run steps, and then terminate automatically.

If... Then...

You already have an Amazon EC2 key pair that you want to use, or you don't need to authenticate to your cluster.

Skip this step.

You need to create a key pair. See Creating your key pair using Amazon EC2.

Next steps

• For guidance on creating a sample cluster, see Tutorial: Getting started with Amazon EMR (p. 13).

• For more information on how to configure a custom cluster and control access to it, see Plan and configure clusters (p. 131) and Security in Amazon EMR (p. 251).

(21)

Tutorial: Getting started with Amazon EMR

Overview

With Amazon EMR you can set up a cluster to process and analyze data with big data frameworks in just a few minutes. This tutorial shows you how to launch a sample cluster using Spark, and how to run a simple PySpark script stored in an Amazon S3 bucket. It covers essential Amazon EMR tasks in three main workflow categories: Plan and Configure, Manage, and Clean Up.

You'll find links to more detailed topics as you work through the tutorial, and ideas for additional steps in the Next steps (p. 24) section. If you have questions or get stuck, contact the Amazon EMR team on our Discussion forum.

Prerequisites

• Before you launch an Amazon EMR cluster, make sure you complete the tasks in Setting up Amazon EMR (p. 12).

Cost

• The sample cluster that you create runs in a live environment. The cluster accrues minimal charges.

To avoid additional charges, make sure you complete the cleanup tasks in the last step of this tutorial.

Charges accrue at the per-second rate according to Amazon EMR pricing. Charges also vary by Region.

For more information, see Amazon EMR pricing.

(22)

• Minimal charges might accrue for small files that you store in Amazon S3. Some or all of the charges for Amazon S3 might be waived if you are within the usage limits of the AWS Free Tier. For more information, see Amazon S3 pricing and AWS Free Tier.

Step 1: Plan and configure an Amazon EMR cluster

Prepare storage for Amazon EMR

When you use Amazon EMR, you can choose from a variety of file systems to store input data, output data, and log files. In this tutorial, you use EMRFS to store data in an S3 bucket. EMRFS is an implementation of the Hadoop file system that lets you read and write regular files to Amazon S3. For more information, see Work with storage and file systems (p. 138).

To create a bucket for this tutorial, follow the instructions in How do I create an S3 bucket? in the Amazon Simple Storage Service Console User Guide. Create the bucket in the same AWS Region where you plan to launch your Amazon EMR cluster. For example, US West (Oregon) us-west-2.

Buckets and folders that you use with Amazon EMR have the following limitations:

• Names can consist of lowercase letters, numbers, periods (.), and hyphens (-).

• Names cannot end in numbers.

• A bucket name must be unique across all AWS accounts.

• An output folder must be empty.

Prepare an application with input data for Amazon EMR

The most common way to prepare an application for Amazon EMR is to upload the application and its input data to Amazon S3. Then, when you submit work to your cluster you specify the Amazon S3 locations for your script and data.

In this step, you upload a sample PySpark script to your Amazon S3 bucket. We've provided a PySpark script for you to use. The script processes food establishment inspection data and returns a results file in your S3 bucket. The results file lists the top ten establishments with the most "Red" type violations.

You also upload sample input data to Amazon S3 for the PySpark script to process. The input data is a modified version of Health Department inspection results in King County, Washington, from 2006 to 2020. For more information, see King County Open Data: Food Establishment Inspection Data. Following are sample rows from the dataset.

name, inspection_result, inspection_closed_business, violation_type, violation_points 100 LB CLAM, Unsatisfactory, FALSE, BLUE, 5

100 PERCENT NUTRICION, Unsatisfactory, FALSE, BLUE, 5 7-ELEVEN #2361-39423A, Complete, FALSE, , 0

To prepare the example PySpark script for EMR

1. Copy the example code below into a new file in your editor of choice.

import argparse

(23)

from pyspark.sql import SparkSession

def calculate_red_violations(data_source, output_uri):

"""

Processes sample food establishment inspection data and queries the data to find the top 10 establishments

with the most Red violations from 2006 to 2020.

:param data_source: The URI of your food establishment data CSV, such as 's3://DOC- EXAMPLE-BUCKET/food-establishment-data.csv'.

:param output_uri: The URI where output is written, such as 's3://DOC-EXAMPLE- BUCKET/restaurant_violation_results'.

"""

with SparkSession.builder.appName("Calculate Red Health Violations").getOrCreate() as spark:

# Load the restaurant violation CSV data if data_source is not None:

restaurants_df = spark.read.option("header", "true").csv(data_source) # Create an in-memory DataFrame to query

restaurants_df.createOrReplaceTempView("restaurant_violations")

# Create a DataFrame of the top 10 restaurants with the most Red violations top_red_violation_restaurants = spark.sql("""SELECT name, count(*) AS total_red_violations

FROM restaurant_violations WHERE violation_type = 'RED' GROUP BY name

ORDER BY total_red_violations DESC LIMIT 10""") # Write the results to the specified output URI top_red_violation_restaurants.write.option("header", "true").mode("overwrite").csv(output_uri)

if __name__ == "__main__":

parser = argparse.ArgumentParser() parser.add_argument(

'--data_source', help="The URI for you CSV restaurant data, like an S3 bucket location.")

parser.add_argument(

'--output_uri', help="The URI where output is saved, like an S3 bucket location.")

args = parser.parse_args()

calculate_red_violations(args.data_source, args.output_uri)

2. Save the file as health_violations.py.

3. Upload health_violations.py to Amazon S3 into the bucket you created for this tutorial. For instructions, see Uploading an object to a bucket in the Amazon Simple Storage Service Getting Started Guide.

To prepare the sample input data for EMR

1. Download the zip file, food_establishment_data.zip.

2. Unzip and save food_establishment_data.zip as food_establishment_data.csv on your machine.

3. Upload the CSV file to the S3 bucket that you created for this tutorial. For instructions, see Uploading an object to a bucket in the Amazon Simple Storage Service Getting Started Guide.

For more information about setting up data for EMR, see Prepare input data (p. 140).

(24)

Launch an Amazon EMR cluster

After you prepare a storage location and your application, you can launch a sample Amazon EMR cluster.

In this step, you launch an Apache Spark cluster using the latest Amazon EMR release version.

Console

To launch a cluster with Spark installed using Quick Options

1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.

2. Choose Create cluster to open the Quick Options wizard.

3. Note the default values for Release, Instance type, Number of instances, and Permissions on the Create Cluster - Quick Options page. These fields autofill with values that work for general- purpose clusters. For more information about the Quick Options configuration settings, see Summary of Quick Options (p. 132).

4. Enter a Cluster name to help you identify the cluster. For example, My First EMR Cluster. 5. Leave Logging enabled, but replace the S3 folder value with the Amazon S3 bucket you

created, followed by /logs. For example, s3://DOC-EXAMPLE-BUCKET/logs. Adding / logs creates a new folder called 'logs' in your bucket, where EMR can copy the log files of your cluster.

6. Choose the Spark option under Applications to install Spark on your cluster.

Note

Choose the applications you want on your Amazon EMR cluster before you launch the cluster. You can't add or remove applications from a cluster after launch.

7. Choose your EC2 key pair under Security and access.

8. Choose Create cluster to launch the cluster and open the cluster status page.

9. Find the cluster Status next to the cluster name. The status changes from Starting to Running to Waiting as Amazon EMR provisions the cluster. You may need to choose the refresh icon on the right or refresh your browser to see status updates.

Your cluster status changes to Waiting when the cluster is up, running, and ready to accept work. For more information about reading the cluster summary, see View cluster status and details (p. 436).

For information about cluster status, see Understanding the cluster lifecycle (p. 3).

CLI

To launch a cluster with Spark installed using the AWS CLI

1. Create a Spark cluster with the following command. Enter a name for your cluster with the -- name option, and specify the name of your EC2 key pair with the --ec2-attributes option.

aws emr create-cluster \

--name "<My First EMR Cluster>" \ --release-label <emr-5.34.0> \ --applications Name=Spark \

--ec2-attributes KeyName=<myEMRKeyPairName> \ --instance-type m5.xlarge \

--instance-count 3 \ --use-default-roles

Note the other required values for --instance-type, --instance-count, and --use- default-roles. These values have been chosen for general-purpose clusters. For more information about create-cluster, see the AWS CLI reference.

(25)

Note

Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

You should see output like the following. The output shows the ClusterId and ClusterArn of your new cluster. Note your ClusterId. You use the ClusterId to check on the cluster status and to submit work.

{ "ClusterId": "myClusterId", "ClusterArn": "myClusterArn"

}

2. Check your cluster status with the following command.

aws emr describe-cluster --cluster-id <myClusterId>

You should see output like the following with the Status object for your new cluster.

{ "Cluster": {

"Id": "myClusterId",

"Name": "My First EMR Cluster", "Status": {

"State": "STARTING", "StateChangeReason": {

"Message": "Configuring cluster software"

} } } {

The State value changes from STARTING to RUNNING to WAITING as Amazon EMR provisions the cluster.

Cluster status changes to WAITING when a cluster is up, running, and ready to accept work. For information about cluster status, see Understanding the cluster lifecycle (p. 3).

Step 2: Manage your Amazon EMR cluster

Submit work to Amazon EMR

After you launch a cluster, you can submit work to the running cluster to process and analyze data.

You submit work to an Amazon EMR cluster as a step. A step is a unit of work made up of one or more actions. For example, you might submit a step to compute values, or to transfer and process data. You can submit steps when you create a cluster, or to a running cluster. In this part of the tutorial, you submit health_violations.py as a step to your running cluster. To learn more about steps, see Submit work to a cluster (p. 526).

(26)

Console

To submit a Spark application as a step using the console

1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.

2. Select the name of your cluster from the Cluster List. The cluster state must be Waiting.

3. Choose Steps, and then choose Add step.

4. Configure the step according to the following guidelines:

• For Step type, choose Spark application. You should see additional fields for Deploy Mode, Spark-submit options, and Application location appear.

• For Name, leave the default value or type a new name. If you have many steps in a cluster, naming each step helps you keep track of them.

• For Deploy mode, leave the default value Cluster. For more information about Spark deployment modes, see Cluster mode overview in the Apache Spark documentation.

• Leave the Spark-submit options field blank. For more information about spark-submit options, see Launching applications with spark-submit.

• For Application location, enter the location of your health_violations.py script in Amazon S3. For example, s3://DOC-EXAMPLE-BUCKET/health_violations.py.

• In the Arguments field, enter the following arguments and values:

--data_source s3://DOC-EXAMPLE-BUCKET/food_establishment_data.csv --output_uri s3://DOC-EXAMPLE-BUCKET/myOutputFolder

Replace s3://DOC-EXAMPLE-BUCKET/food_establishment_data.csv with the S3 URI of the input data you prepared in Prepare an application with input data for Amazon EMR (p. 14).

Replace DOC-EXAMPLE-BUCKET with the name of the bucket you created for this tutorial, and myOutputFolder with a name for your cluster output folder.

• For Action on failure, accept the default option Continue so that if the step fails, the cluster continues to run.

5. Choose Add to submit the step. The step should appear in the console with a status of Pending.

6. Check for the step status to change from Pending to Running to Completed. To refresh the status in the console, choose the refresh icon to the right of the Filter. The script takes about one minute to run.

You will know that the step finished successfully when the status changes to Completed.

CLI

To submit a Spark application as a step using the AWS CLI

1. Make sure you have the ClusterId of the cluster you launched in Launch an Amazon EMR cluster (p. 16). You can also retrieve your cluster ID with the following command.

aws emr list-clusters --cluster-states WAITING

2. Submit health_violations.py as a step with the add-steps command and your ClusterId.

(27)

• You can specify a name for your step by replacing "My Spark Application". In the Args array, replace s3://DOC-EXAMPLE-BUCKET/health_violations.py with the location of your health_violations.py application.

• Replace s3://DOC-EXAMPLE-BUCKET/food_establishment_data.csv with the S3 location of your food_establishment_data.csv dataset.

• Replace s3://DOC-EXAMPLE-BUCKET/MyOutputFolder with the S3 path of your designated bucket and a name for your cluster output folder.

• ActionOnFailure=CONTINUE means the cluster continues to run if the step fails.

aws emr add-steps \

--cluster-id <myClusterId> \ --steps Type=Spark,Name="<My Spark

Application>",ActionOnFailure=CONTINUE,Args=[<s3://DOC-EXAMPLE- BUCKET/health_violations.py>,--data_source,<s3://DOC-EXAMPLE-BUCKET/

food_establishment_data.csv>,--output_uri,<s3://DOC-EXAMPLE-BUCKET/MyOutputFolder>]

For more information about submitting steps using the CLI, see the AWS CLI Command Reference.

After you submit the step, you should see output like the following with a list of StepIds. Since you submitted one step, you will see just one ID in the list. Copy your step ID. You use your step ID to check the status of the step.

{

"StepIds": [

"s-1XXXXXXXXXXA"

] }

3. Query the status of your step with the describe-step command.

aws emr describe-step --cluster-id <myClusterId> --step-id <s-1XXXXXXXXXXA>

You should see output like the following with information about your step.

{ "Step": {

"Id": "s-1XXXXXXXXXXA",

"Name": "My Spark Application", "Config": {

"Jar": "command-runner.jar", "Properties": {},

"Args": [

"spark-submit",

"s3://DOC-EXAMPLE-BUCKET/health_violations.py", "--data_source",

"s3://DOC-EXAMPLE-BUCKET/food_establishment_data.csv", "--output_uri",

"s3://DOC-EXAMPLE-BUCKET/myOutputFolder"

]

(28)

},

"ActionOnFailure": "CONTINUE", "Status": {

"State": "COMPLETED"

} } }

The State of the step changes from PENDING to RUNNING to COMPLETED as the step runs. The step takes about one minute to run, so you might need to check the status a few times.

You will know that the step was successful when the State changes to COMPLETED.

For more information about the step lifecycle, see Running steps to process data (p. 3).

View results

After a step runs successfully, you can view its output results in your Amazon S3 output folder.

To view the results of health_violations.py

1. Open the Amazon S3 console at https://console.aws.amazon.com/s3/.

2. Choose the Bucket name and then the output folder that you specified when you submitted the step. For example, DOC-EXAMPLE-BUCKET and then myOutputFolder.

3. Verify that the following items appear in your output folder:

• A small-sized object called _SUCCESS.

• A CSV file starting with the prefix part- that contains your results.

4. Choose the object with your results, then choose Download to save the results to your local file system.

5. Open the results in your editor of choice. The output file lists the top ten food establishments with the most red violations. The output file also shows the total number of red violations for each establishment.

The following is an example of health_violations.py results.

name, total_red_violations SUBWAY, 322

T-MOBILE PARK, 315 WHOLE FOODS MARKET, 299 PCC COMMUNITY MARKETS, 251 TACO TIME, 240

MCDONALD'S, 177 THAI GINGER, 153 SAFEWAY INC #1508, 143 TAQUERIA EL RINCONSITO, 134 HIMITSU TERIYAKI, 128

For more information about Amazon EMR cluster output, see Configure an output location (p. 149).

(Optional) Connect to your running Amazon EMR cluster

When you use Amazon EMR, you may want to connect to a running cluster to read log files, debug the cluster, or use CLI tools like the Spark shell. Amazon EMR lets you connect to a cluster using the Secure

(29)

Shell (SSH) protocol. This section covers how to configure SSH, connect to your cluster, and view log files for Spark. For more information about connecting to a cluster, see Authenticate to Amazon EMR cluster nodes (p. 337).

Authorize SSH connections to your cluster

Before you connect to your cluster, you need to modify your cluster security groups to authorize inbound SSH connections. Amazon EC2 security groups act as virtual firewalls to control inbound and outbound traffic to your cluster. When you created your cluster for this tutorial, Amazon EMR created the following security groups on your behalf:

ElasticMapReduce-master

The default Amazon EMR managed security group associated with the master node. In an Amazon EMR cluster, the master node is an Amazon EC2 instance that manages the cluster.

ElasticMapReduce-slave

The default security group associated with core and task nodes.

To allow SSH access for trusted sources for the ElasticMapReduce-master security group

To edit your security groups, you must have permission to manage security groups for the VPC that the cluster is in. For more information, see Changing Permissions for an IAM User and the Example Policy that allows managing EC2 security groups in the IAM User Guide.

1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.

2. Choose Clusters.

3. Choose the Name of the cluster you want to modify.

4. Choose the Security groups for Master link under Security and access.

5. Choose ElasticMapReduce-master from the list.

6. Choose the Inbound rules tab and then Edit inbound rules.

7. Check for an inbound rule that allows public access with the following settings. If it exists, choose Delete to remove it.

Type SSH

Port 22

Source

Custom 0.0.0.0/0

Warning

Before December 2020, the ElasticMapReduce-master security group had a pre-configured rule to allow inbound traffic on Port 22 from all sources. This rule was created to simplify initial SSH connections to the master node. We strongly recommend that you remove this inbound rule and restrict traffic to trusted sources.

8. Scroll to the bottom of the list of rules and choose Add Rule.

9. For Type, select SSH.

Selecting SSH automatically enters TCP for Protocol and 22 for Port Range.

(30)

10. For source, select My IP to automatically add your IP address as the source address. You can also add a range of Custom trusted client IP addresses, or create additional rules for other clients. Many network environments dynamically allocate IP addresses, so you might need to update your IP addresses for trusted clients in the future.

11. Choose Save.

12. Optionally, choose ElasticMapReduce-slave from the list and repeat the steps above to allow SSH client access to core and task nodes.

Connect to your cluster using the AWS CLI

Regardless of your operating system, you can create an SSH connection to your cluster using the AWS CLI.

To connect to your cluster an view log files using the AWS CLI

1. Use the following command to open an SSH connection to your cluster. Replace <mykeypair.key>

with the full path and file name of your key pair file. For example, C:\Users\<username>\.ssh

\mykeypair.pem.

aws emr ssh --cluster-id <j-2AL4XXXXXX5T9> --key-pair-file <~/mykeypair.key>

2. Navigate to /mnt/var/log/spark to access the Spark logs on your cluster's master node. Then view the files in that location. For a list of additional log files on the master node, see View log files on the master node (p. 449).

cd /mnt/var/log/spark ls

Step 3: Clean up your Amazon EMR resources

Terminate your cluster

Now that you've submitted work to your cluster and viewed the results of your PySpark application, you can terminate the cluster. Terminating a cluster stops all of the cluster's associated Amazon EMR charges and Amazon EC2 instances.

When you terminate a cluster, Amazon EMR retains metadata about the cluster for two months at no charge. Archived metadata helps you clone the cluster (p. 525) for a new job or revisit the cluster configuration for reference purposes. Metadata does not include data that the cluster writes to S3, or data stored in HDFS on the cluster.

Note

The Amazon EMR console does not let you delete a cluster from the list view after you terminate the cluster. A terminated cluster disappears from the console when Amazon EMR clears its metadata.

Console

To terminate the cluster using the console

1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.

2. Choose Clusters, then choose the cluster you want to terminate. For example, My First EMR Cluster.

(31)

3. Choose Terminate to open the Terminate cluster prompt.

4. Choose Terminate in the open prompt. Depending on the cluster configuration, termination may take 5 to 10 minutes. For more information about terminating Amazon EMR clusters, see Terminate a cluster (p. 490).

Note

If you followed the tutorial closely, termination protection should be off. Cluster termination protection prevents accidental termination. If termination protection is on, you will see a prompt to change the setting before terminating the cluster. Choose Change, then Off.

CLI

To terminate the cluster using the AWS CLI

1. Initiate the cluster termination process with the following command. Replace <myClusterId>

with the ID of your sample cluster. The command does not return output.

aws emr terminate-clusters --cluster-ids <myClusterId>

2. To check that the cluster termination process is in progress, check the cluster status with the following command.

aws emr describe-cluster --cluster-id <myClusterId>

Following is example output in JSON format. The cluster Status should change from TERMINATING to TERMINATED. Termination may take 5 to 10 minutes depending on your cluster configuration. For more information about terminating an Amazon EMR cluster, see Terminate a cluster (p. 490).

{ "Cluster": {

"Id": "j-xxxxxxxxxxxxx", "Name": "My Cluster Name", "Status": {

"State": "TERMINATED", "StateChangeReason": { "Code": "USER_REQUEST",

"Message": "Terminated by user request"

} } } }

Delete S3 resources

To avoid additional charges, you should delete your Amazon S3 bucket. Deleting the bucket removes all of the Amazon S3 resources for this tutorial. Your bucket should contain:

• The PySpark script

• The input dataset

• Your output results folder

• Your log files folder

(32)

You might need to take extra steps to delete stored files if you saved your PySpark script or output in a different location.

Note

Your cluster must be terminated before you delete your bucket. Otherwise, you may not be allowed to empty the bucket.

To delete your bucket, follow the instructions in How do I delete an S3 bucket? in the Amazon Simple Storage Service User Guide.

Next steps

You have now launched your first Amazon EMR cluster from start to finish. You have also completed essential EMR tasks like preparing and submitting big data applications, viewing results, and terminating a cluster.

Use the following topics to learn more about how you can customize your Amazon EMR workflow.

Explore big data applications for Amazon EMR

Discover and compare the big data applications you can install on a cluster in the Amazon EMR Release Guide. The Release Guide details each EMR release version and includes tips for using frameworks such as Spark and Hadoop on Amazon EMR.

Plan cluster hardware, networking, and security

In this tutorial, you created a simple EMR cluster without configuring advanced options. Advanced options let you specify Amazon EC2 instance types, cluster networking, and cluster security. For more information about planning and launching a cluster that meets your requirements, see Plan and configure clusters (p. 131) and Security in Amazon EMR (p. 251).

Manage clusters

Dive deeper into working with running clusters in Manage clusters (p. 436). To manage a cluster, you can connect to the cluster, debug steps, and track cluster activities and health. You can also adjust cluster resources in response to workload demands with EMR managed scaling (p. 493).

Use a different interface

In addition to the Amazon EMR console, you can manage Amazon EMR using the AWS Command Line Interface, the web service API, or one of the many supported AWS SDKs. For more information, see Management interfaces (p. 8).

You can also interact with applications installed on Amazon EMR clusters in many ways. Some applications like Apache Hadoop publish web interfaces that you can view. For more information, see View web interfaces hosted on Amazon EMR clusters (p. 482).

Browse the EMR technical blog

For sample walkthroughs and in-depth technical discussion of new Amazon EMR features, see the AWS big data blog.

數據

table metadata Disabled When enabled, specifies the

參考文獻

相關文件

Dialysis can help you feel better and live longer, but it is not a cure for kidney failure. After you start dialysis, you will need to keep doing it to stay as healthy

(https://www.mercadolibre.cl/): 提供平台由當地供 應商自營,不像 amazon、pchome 等有 B2B2C 廠商

If you see difficult sentences/ a difficult sentence or have (any) questions / a question, going over/through (=browsing) the article(s) again.. can/may help you

Once students are supported to grasp this concept, they become more willing to use English for self-expression and that in turn, is the finest form of empowerment!... What makes

 If I buy a call option from you, I am paying you a certain amount of money in return for the right to force you to sell me a share of the stock, if I want it, at the strike price,

and the value they assign to these texts and creative work more generally practical tasks such as colour-coding, chunking and segmenting the text, using audio recordings of

Text A.. The activities that follow on p. 14-18 are designed to demonstrate how teachers can use “scaffolding strategies” to support student learning when using print media

(a) In your group, discuss what impact the social issues in Learning Activity 1 (and any other socials issues you can think of) have on the world, Hong Kong and you.. Choose the