• 沒有找到結果。

When developing Hadoop applications, it is common to switch between running the application locally and running it on a cluster. In fact, you may have several clusters you work with, or you may have a local “pseudodistributed” cluster that you like to test on (a pseudodistributed cluster is one whose daemons all run on the local machine; setting up this mode is covered in Appendix A).

One way to accommodate these variations is to have Hadoop configuration files

containing the connection settings for each cluster you run against and specify which one you are using when you run Hadoop applications or tools. As a matter of best practice, it’s recommended to keep these files outside Hadoop’s installation directory tree, as this

makes it easy to switch between Hadoop versions without duplicating or losing settings.

For the purposes of this book, we assume the existence of a directory called conf that contains three configuration files: hadoop-local.xml, hadoop-localhost.xml, and hadoop-cluster.xml (these are available in the example code for this book). Note that there is

nothing special about the names of these files; they are just convenient ways to package up some configuration settings. (Compare this to Table A-1 in Appendix A, which sets out the equivalent server-side configurations.)

The hadoop-local.xml file contains the default Hadoop configuration for the default filesystem and the local (in-JVM) framework for running MapReduce jobs:

<?xml version="1.0"?>

<configuration>

<property>

<name>fs.defaultFS</name>

<value>file:///</value>

</property>

<property>

<name>mapreduce.framework.name</name>

<value>local</value>

</property>

</configuration>

The settings in hadoop-localhost.xml point to a namenode and a YARN resource manager both running on localhost:

<?xml version="1.0"?>

<configuration>

<property>

<name>fs.defaultFS</name>

<value>hdfs://localhost/</value>

</property>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

<property>

<name>yarn.resourcemanager.address</name>

<value>localhost:8032</value>

</property>

</configuration>

Finally, hadoop-cluster.xml contains details of the cluster’s namenode and YARN resource manager addresses (in practice, you would name the file after the name of the cluster, rather than “cluster” as we have here):

<?xml version="1.0"?>

<configuration>

<property>

<name>fs.defaultFS</name>

<value>hdfs://namenode/</value>

</property>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

<property>

<name>yarn.resourcemanager.address</name>

<value>resourcemanager:8032</value>

</property>

</configuration>

You can add other configuration properties to these files as needed.

SETTING USER IDENTITY

The user identity that Hadoop uses for permissions in HDFS is determined by running the whoami command on the client system. Similarly, the group names are derived from the output of running groups.

If, however, your Hadoop user identity is different from the name of your user account on your client machine, you can explicitly set your Hadoop username by setting the HADOOP_USER_NAME environment variable. You can also override user group mappings by means of the hadoop.user.group.static.mapping.overrides configuration property. For example, dr.who=;preston=directors,inventors means that the dr.who user is in no groups, but

preston is in the directors and inventors groups.

You can set the user identity that the Hadoop web interfaces run as by setting the hadoop.http.staticuser.user

property. By default, it is dr.who, which is not a superuser, so system files are not accessible through the web interface.

Notice that, by default, there is no authentication with this system. See Security for how to use Kerberos authentication with Hadoop.

With this setup, it is easy to use any configuration with the -conf command-line switch.

For example, the following command shows a directory listing on the HDFS server running in pseudodistributed mode on localhost:

% hadoop fs -conf conf/hadoop-localhost.xml -ls . Found 2 items

drwxr-xr-x - tom supergroup 0 2014-09-08 10:19 input drwxr-xr-x - tom supergroup 0 2014-09-08 10:19 output

If you omit the -conf option, you pick up the Hadoop configuration in the etc/hadoop subdirectory under $HADOOP_HOME. Or, if HADOOP_CONF_DIR is set, Hadoop configuration files will be read from that location.

NOTE

Here’s an alternative way of managing configuration settings. Copy the etc/hadoop directory from your Hadoop installation to another location, place the *-site.xml configuration files there (with appropriate settings), and set the

HADOOP_CONF_DIR environment variable to the alternative location. The main advantage of this approach is that you don’t need to specify -conf for every command. It also allows you to isolate changes to files other than the Hadoop XML configuration files (e.g., log4j.properties) since the HADOOP_CONF_DIR directory has a copy of all the

configuration files (see Hadoop Configuration).

Tools that come with Hadoop support the -conf option, but it’s straightforward to make your programs (such as programs that run MapReduce jobs) support it, too, using the Tool

interface.

GenericOptionsParser, Tool, and ToolRunner

Hadoop comes with a few helper classes for making it easier to run jobs from the command line. GenericOptionsParser is a class that interprets common Hadoop

command-line options and sets them on a Configuration object for your application to use as desired. You don’t usually use GenericOptionsParser directly, as it’s more

convenient to implement the Tool interface and run your application with the ToolRunner, which uses GenericOptionsParser internally:

public interface Tool extends Configurable { int run(String [] args) throws Exception;

}

Example 6-4 shows a very simple implementation of Tool that prints the keys and values of all the properties in the Tool’s Configuration object.

Example 6-4. An example Tool implementation for printing the properties in a Configuration

public class ConfigurationPrinter extends Configured implements Tool {

static {

Configuration.addDefaultResource("hdfs-default.xml");

Configuration.addDefaultResource("hdfs-site.xml");

Configuration.addDefaultResource("yarn-default.xml");

Configuration.addDefaultResource("yarn-site.xml");

Configuration.addDefaultResource("mapred-default.xml");

Configuration.addDefaultResource("mapred-site.xml");

}

@Override

public int run(String[] args) throws Exception { Configuration conf = getConf();

for (Entry<String, String> entry: conf) {

System.out.printf("%s=%s\n", entry.getKey(), entry.getValue());

}

return 0;

}

public static void main(String[] args) throws Exception {

int exitCode = ToolRunner.run(new ConfigurationPrinter(), args);

System.exit(exitCode);

} }

We make ConfigurationPrinter a subclass of Configured, which is an implementation

Configurable (since Tool extends it), and subclassing Configured is often the easiest way to achieve this. The run() method obtains the Configuration using Configurable’s

getConf() method and then iterates over it, printing each property to standard output.

The static block makes sure that the HDFS, YARN, and MapReduce configurations are picked up, in addition to the core ones (which Configuration knows about already).

ConfigurationPrinter’s main() method does not invoke its own run() method directly.

Instead, we call ToolRunner’s static run() method, which takes care of creating a

Configuration object for the Tool before calling its run() method. ToolRunner also uses a GenericOptionsParser to pick up any standard options specified on the command line and to set them on the Configuration instance. We can see the effect of picking up the properties specified in conf/hadoop-localhost.xml by running the following commands:

% mvn compile

% export HADOOP_CLASSPATH=target/classes/

% hadoop ConfigurationPrinter -conf conf/hadoop-localhost.xml \ | grep yarn.resourcemanager.address=

yarn.resourcemanager.address=localhost:8032

WHICH PROPERTIES CAN I SET?

ConfigurationPrinter is a useful tool for discovering what a property is set to in your environment. For a running daemon, like the namenode, you can see its configuration by viewing the /conf page on its web server. (See Table 10-6 to find port numbers.)

You can also see the default settings for all the public properties in Hadoop by looking in the share/doc directory of your Hadoop installation for files called core-default.xml, hdfs-default.xml, yarn-default.xml, and mapred-default.xml.

Each property has a description that explains what it is for and what values it can be set to.

The default settings files’ documentation can be found online at pages linked from

http://hadoop.apache.org/docs/current/ (look for the “Configuration” heading in the navigation). You can find the defaults for a particular Hadoop release by replacing current in the preceding URL with r<version> — for example, http://hadoop.apache.org/docs/r2.5.0/.

Be aware that some properties have no effect when set in the client configuration. For example, if you set

yarn.nodemanager.resource.memory-mb in your job submission with the expectation that it would change the amount of memory available to the node managers running your job, you would be disappointed, because this property is honored only if set in the node manager’s yarn-site.xml file. In general, you can tell the component where a property should be set by its name, so the fact that yarn.nodemanager.resource.memory-mb starts with

yarn.nodemanager gives you a clue that it can be set only for the node manager daemon. This is not a hard and fast rule, however, so in some cases you may need to resort to trial and error, or even to reading the source.

Configuration property names have changed in Hadoop 2 onward, in order to give them a more regular naming structure. For example, the HDFS properties pertaining to the namenode have been changed to have a dfs.namenode

prefix, so dfs.name.dir is now dfs.namenode.name.dir. Similarly, MapReduce properties have the mapreduce

prefix rather than the older mapred prefix, so mapred.job.name is now mapreduce.job.name.

This book uses the new property names to avoid deprecation warnings. The old property names still work, however, and they are often referred to in older documentation. You can find a table listing the deprecated property names and their replacements on the Hadoop website.

We discuss many of Hadoop’s most important configuration properties throughout this book.

GenericOptionsParser also allows you to set individual properties. For example:

% hadoop ConfigurationPrinter -D color=yellow | grep color color=yellow

Here, the -D option is used to set the configuration property with key color to the value

yellow. Options specified with -D take priority over properties from the configuration files. This is very useful because you can put defaults into configuration files and then override them with the -D option as needed. A common example of this is setting the

number of reducers for a MapReduce job via -D mapreduce.job.reduces=n. This will override the number of reducers set on the cluster or set in any client-side configuration files.

The other options that GenericOptionsParser and ToolRunner support are listed in Table 6-1. You can find more on Hadoop’s configuration API in The Configuration API.

WARNING

Do not confuse setting Hadoop properties using the -D property=value option to GenericOptionsParser (and

ToolRunner) with setting JVM system properties using the -Dproperty=value option to the java command. The syntax for JVM system properties does not allow any whitespace between the D and the property name, whereas

GenericOptionsParser does allow whitespace.

JVM system properties are retrieved from the java.lang.System class, but Hadoop properties are accessible only from a Configuration object. So, the following command will print nothing, even though the color system property has been set (via HADOOP_OPTS), because the System class is not used by ConfigurationPrinter:

% HADOOP_OPTS='-Dcolor=yellow' \

hadoop ConfigurationPrinter | grep color

If you want to be able to set configuration through system properties, you need to mirror the system properties of interest in the configuration file. See Variable Expansion for further discussion.

Table 6-1. GenericOptionsParser and ToolRunner options

Option Description

-D property=value Sets the given Hadoop configuration property to the given value. Overrides any default or site properties in the configuration and any properties set via the -conf option.

-conf filename ... Adds the given files to the list of resources in the configuration. This is a convenient way to set site properties or to set a number of properties at once.

-fs uri Sets the default filesystem to the given URI. Shortcut for -D fs.defaultFS=uri.

-jt host:port Sets the YARN resource manager to the given host and port. (In Hadoop 1, it sets the jobtracker address, hence the option name.) Shortcut for -D yarn.resourcemanager.address=host:port.

-files

file1,file2,...

Copies the specified files from the local filesystem (or any filesystem if a scheme is specified) to the shared filesystem used by MapReduce (usually HDFS) and makes them available to

MapReduce programs in the task’s working directory. (See Distributed Cache for more on the distributed cache mechanism for copying files to machines in the cluster.)

-archives

archive1,archive2,...

Copies the specified archives from the local filesystem (or any filesystem if a scheme is specified) to the shared filesystem used by MapReduce (usually HDFS), unarchives them, and makes them available to MapReduce programs in the task’s working directory.

-libjars

jar1,jar2,...

Copies the specified JAR files from the local filesystem (or any filesystem if a scheme is specified) to the shared filesystem used by MapReduce (usually HDFS) and adds them to the MapReduce task’s classpath. This option is a useful way of shipping JAR files that a job is dependent on.