Hadoop 出

5.2　件式

5.2.6 Hadoop 出

了Spark 的式，任 Hadoop 的式交互。Spark 新

Hadoop 件API，了大的。⁴ 1. Hadoop

要用新版的Hadoop API 读件，需要 Spark 。newAPIHadoopFile

收路及。第式，代式。的数

hadoopFile() 用用的API 的Hadoop 式。第的，

的。需要定的Hadoop ， conf 。

KeyValueTextInputFormat 简的Hadoop 式，用本件中读

数据 5-24 所。行会处理，和用制开。

式存 Hadoop 中，所需工中加的能用它。

5-24： Scala 中用式 API 读 KeyValueTextInputFormat()

val input = sc.hadoopFile[Text, Text, KeyValueTextInputFormat](inputFile).map{

case (x, y) => (x.toString, y.toString) }

学了读本件加解析读 JSON 数据的方法。，

用自定义Hadoop 式读 JSON 数据。需要的，

的。Twitter 的 Elephant Bird https://github.com/

twitter/elephant-bird 数据式， JSON、Lucene、Protocol Buffer 的式等。适用新 Hadoop 件API。了 Spark 中用新式 Hadoop API，用Lzo JsonInputFormat读 LZO 算法的JSON 数据的

子。

5-25： Scala 中用 Elephant Bird 读 LZO 算法的JSON 件 val input = sc.newAPIHadoopFile(inputFile, classOf[LzoJsonInputFormat], classOf[LongWritable], classOf[MapWritable], conf)

// " "中的 MapWritable代 JSON

LZO 的要 hadoop-lzo ， Spark 的本地中。

用Debian ，用spark-submit时加 --driver-library-path /

usr/lib/hadoop/lib/native/ --driver-class-path /usr/lib/hadoop/lib/

了。

用的Hadoop API 读件用法，了需要式InputFormat 。 Spark 自带的的数 sequenceFile() 用式Hadoop API 的。

2. Hadoop

SequenceFile 有了定的了解， Java API 中有用的保存 pair RDD 的

4：Hadoop 进中增加了新的MapReduce API，有用的。

数。作用式Hadoop 式的API 的子 5-26

新 saveAsNewAPIHadoopFile 的用方法的。

5-26： Java 保存 SequenceFile

public static class ConvertToWritableTypes implements PairFunction<Tuple2<String, Integer>, Text, IntWritable> {

public Tuple2<Text, IntWritable> call(Tuple2<String, Integer> record) { return new Tuple2(new Text(record._1), new IntWritable(record._2));

} }

JavaPairRDD<String, Integer> rdd = sc.parallelizePairs(input);

JavaPairRDD<Text, IntWritable> result = rdd.mapToPair(new ConvertToWritableTypes());

result.saveAsHadoopFile(fileName, Text.class, IntWritable.class, SequenceFileOutputFormat.class);

3. 数据

了hadoopFile()和saveAsHadoopFile() 大数，用hadoopDataset/ saveAsHadoopDataSet和newAPIHadoopDataset/saveAsNewAPIHadoopDataset 问Hadoop 所

的件的存式。， HBase 和 MongoDB 的存

了用读 Hadoop 式的。 Spark 中方地用式。

hadoopDataset() 数收 Configuration ，用问数据

所需的Hadoop 。要用 Hadoop MapReduce 作同的方式

。所应 MapReduce 中问数据的用，

Spark。，5.5.3 了用newAPIHadoopDataset HBase 中读数据。

4. ：protocol buffer

Protocol buffer 简 PB，https://github.com/google/protobuf ⁵ 由Google 开发，用

的远用 RPC ，开。PB 化数据，它要字和要定

，有能需要读的数据。

PB 字定义，者 PB 。 string、int32、enum

等。 PB 的，的，问Protocol Buffer 的网 https://developers.google.com/protocol-buffers 了解。

5-27 的简的PB 式中读 VenueResponse 。VenueResponse

字的简式，字带有需字、字及

字的PB 。 5-27：PB 定义

message Venue {

required int32 id = 1;

message VenueResponse { repeated Venue results = 1;

}

中用 Twitter 的 Elephant Bird 读 JSON 数据，它 PB 中读和保

存数据。出Venues的， 5-28 所。

5-28： Scala 中用 Elephant Bird 出protocol buffer val job = new Job()

val conf = job.getConfiguration

LzoProtobufBlockOutputFormat.setClassConf(classOf[Places.Venue], conf);

val dnaLounge = Places.Venue.newBuilder() dnaLounge.setId(1);

dnaLounge.setName("DNA Lounge")

dnaLounge.setType(Places.Venue.VenueType.CLUB) val data = sc.parallelize(List(dnaLounge.build())) val outputData = data.map{ pb =>

val protoWritable = ProtobufWritable.newInstance(classOf[Places.Venue]);

protoWritable.set(pb) (null, protoWritable) }

outputData.saveAsNewAPIHadoopFile(outputFile, classOf[Text], classOf[ProtobufWritable[Places.Venue]],

classOf[LzoProtobufBlockOutputFormat[ProtobufWritable[Places.Venue]]], conf)

的版本本书的代中。

gzip 快高 org.apache.hadoop.io.

compress.GzipCodec

Spark 的textFile()方法处理的，数据

分读的方式，Spark 会打开splittable。，

要读的，要用Spark 的，而用

newAPIHadoopFile 者hadoopFile，定的编解。

有式 SequenceFile 数据中的，时有

用。式有自的制：，Twitter 的 Elephant Bird 中的式用LZO 算法的数据。

5.3

Spark 读件，用任要的件式。

5.3.1 /“ ”

Spark 本地件中读件，它要求文件在集群中所有节点的相同路径下都可以找到。

NFS、AFS 及MapR 的 NFS layer 的网络件会件件

的式用。的数据中，需要定 ﬁle://

路要件的同路，Spark 会自处理

5-29 所。

5-29： Scala 中本地件读的本件 val rdd = sc.textFile("file:///home/holden/happypandas.gz")

件有集中的所有，序中本地读件而

需用集，用parallelize 分发工作。方式能会

，所的方法件 HDFS、NFS、S3 等共件。

5.3.2 Amazon S3

用Amazon S3 存大量数据行。计算 Amazon EC2 的时，

用S3 作存快，需要网问数据时能会。

要 Spark 中问 S3 数据，应的S3 问据 AWS_ACCESS_KEY_ID和

AWS_SECRET_ACCESS_KEY 量。 Amazon Web Service 制台据。

， s3n://开的路 s3n://bucket/path-within-bucket的式

Spark 的方法。和所有件，Spark 能 S3 路中字， s3n://bucket/my-Files/*.txt。

Amazon S3 问权，保定了问的号数据有 read 读和 list 列的权。Spark 需要列出的，要读的数据。

5.3.3 HDFS

Hadoop 分式件 HDFS 用的件，Spark 能地用它。

HDFS 计价的件工作，有地应，同时高量。

Spark 和 HDFS 同， Spark 用数据分量

网络开。

Spark 中用 HDFS 需要出路定 hdfs://master:port/path 了。

HDFS Hadoop 版本而化，用的Spark

版本的Hadoop 编译的，读会。，Spark

Hadoop 1.0.4 编译⁷。代编译，量中定SPARK_

HADOOP_VERSION= 版本的Hadoop 进行编译

编译的Spark 版本。据运行hadoop version的

量要的。

5.4 Spark SQL中数据

Spark SQL Spark 1.0 中新加 Spark 的件，快速成了 Spark 中的作

化和化数据的方式。化数据的有结构信息的数据—— 所有的数

据记具有字的集合。Spark SQL 化数据作，而由

Spark SQL 道数据的，它数据中读出所需字。第9

地讲解Spark SQL，用它数据中读数据。

， SQL Spark SQL，它数据执行出

字者字用数，由Row 成的RDD， Row

记。 Java 和 Scala 中，Row 的问的。 Row 有

get()方法，会进行。有本

的用get()方法 getFloat()、getInt()、getLong()、getString()、getShort()、

getBoolean()等。 Python 中，用row[column_number] 及row.column_name

问元。

7：自 Spark 1.4.0 ，Spark 的Hadoop 版本 2.2.0。——译者

5.4.1 Apache Hive

5-30：用 Python HiveContext 数据

from pyspark.sql import HiveContext hiveCtx = HiveContext(sc)

rows = hiveCtx.sql("SELECT name, age FROM users") firstRow = rows.first()

print firstRow.name

5-31：用 Scala HiveContext 数据 import org.apache.spark.sql.hive.HiveContext

val hiveCtx = new org.apache.spark.sql.hive.HiveContext(sc) val rows = hiveCtx.sql("SELECT name, age FROM users") val firstRow = rows.first()

println(firstRow.getString(0)) // 字 0 name字 5-32：用 Java HiveContext 数据

import org.apache.spark.sql.hive.HiveContext;

import org.apache.spark.sql.Row;

import org.apache.spark.sql.SchemaRDD;

HiveContext hiveCtx = new HiveContext(sc);

SchemaRDD rows = hiveCtx.sql("SELECT name, age FROM users");

Row firstRow = rows.first();

System.out.println(firstRow.getString(0)); // 字 0 name字会 9.3.1 地 Hive 中读数据。

5.4.2 JSON

有记的JSON 数据，Spark SQL 自出它的，

数据读记，字的作简。要读 JSON 数

据，需要和用Hive HiveContext。需要

Hive，需要hive-site.xml 件。用HiveContext.jsonFile方法

件中由Row 成的RDD。了用 Row ， RDD

册张，中出特定的字。，有的JSON 件，

式 5-33 所，行记。 5-33：JSON 中的

{"user": {"name": "Holden", "location": "San Francisco"}, "text": "Nice day out today"}

{"user": {"name": "Matei", "location": "Berkeley"}, "text": "Even nicer here :)"}

读数据，中 username 用和text 本字， 5-34

5-36 所。

5-34： Python 中用 Spark SQL 读 JSON 数据 tweets = hiveCtx.jsonFile("tweets.json") tweets.registerTempTable("tweets")

results = hiveCtx.sql("SELECT user.name, text FROM tweets") 5-35： Scala 中用 Spark SQL 读 JSON 数据

val tweets = hiveCtx.jsonFile("tweets.json") tweets.registerTempTable("tweets")

val results = hiveCtx.sql("SELECT user.name, text FROM tweets") 5-36： Java 中用 Spark SQL 读 JSON 数据

SchemaRDD tweets = hiveCtx.jsonFile(jsonFile);

tweets.registerTempTable("tweets");

SchemaRDD results = hiveCtx.sql("SELECT user.name, text FROM tweets");

会 9.3.3 用Spark SQL 读 JSON 数据问进行。

，Spark SQL 的远读数据，数据、 RDD 所的方式的方式合数据、数据运行自定义数，第9 中讲。

5.5 数据

数据的Hadoop 者自定义的Spark ，Spark 问用的

数据。本的。

5.5.1 Java数据

Spark 任 Java 数据 JDBC 的数据中读数据，

MySQL、Postgre 等。要问数据，需要 org.apache.spark.rdd.

JdbcRDD， SparkContext 和数它。 5-37 了用JdbcRDD

MySQL 数据。

5-37：Scala 中的 JdbcRDD def createConnection() = {

Class.forName("com.mysql.jdbc.Driver").newInstance();

DriverManager.getConnection("jdbc:mysql://localhost/test?user=holden");

}

def extractValues(r: ResultSet) = { (r.getInt(1), r.getString(2)) }

val data = new JdbcRDD(sc,

createConnection, "SELECT * FROM panda WHERE ? <= id AND id <= ?",

lowerBound = 1, upperBound = 3, numPartitions = 2, mapRow = extractValues) println(data.collect().toList)

• 数的数出 java.sql.ResultSet http://docs.

oracle.com/javase/7/docs/api/java/sql/ResultSet.html 作数据有用的式的数。

5-37 中，会 (Int, String) 。数，Spark 会自行

DataStax 开用 Spark 的 Cassandra https://github.com/datastax/spark-cassandra-connector ，Spark Cassandra 的大大。 Spark 的分，

需要加的的件中能用它。Cassandra 有用

Spark SQL，它会由CassandraRow 成的RDD，有分方法 Spark SQL 的^Row 的方法同， 5-38 和 5-39 所。Spark 的 Cassandra

能 Java 和 Scala 中用。

5-38：Cassandra 的sbt

"com.datastax.spark" %% "spark-cassandra-connector" % "1.0.0-rc5",

"com.datastax.spark" %% "spark-cassandra-connector-java" % "1.0.0-rc5"

8：道有记，执行计数，据决定upperBound和

lowerBound的。

5-39：Cassandra 的Maven

<groupId>com.datastax.spark</groupId>

<artifactId>spark-cassandra-connector</artifactId>

</dependency>

<groupId>com.datastax.spark</groupId>

<artifactId>spark-cassandra-connector-java</artifactId>

</dependency>

Elasticsearch ，Cassandra 要读作决定集。

spark.cassandra.connection.host Cassandra 集。有用和的

，需要分 spark.cassandra.auth.username和spark.cassandra.auth.password。

定有 Cassandra 集要， SparkContext 时， 5-40 和 5-41 所。

5-40： Scala 中 Cassandra val conf = new SparkConf(true)

.set("spark.cassandra.connection.host", "hostname") val sc = new SparkContext(conf)

5-41： Java 中 Cassandra SparkConf conf = new SparkConf(true)

.set("spark.cassandra.connection.host", cassandraHost);

JavaSparkContext sc = new JavaSparkContext(

sparkMaster, "basicquerycassandra", conf);

Datastax 的 Cassandra 用Scala 中的式 SparkContext 和 RDD

加数。式，读数据 5-42 所。

5-42： Scala 中张读 RDD

// SparkContext和RDD 加数的式

import com.datastax.spark.connector._

// 张读 RDD。的 test的

// CREATE TABLE test.kv(key text PRIMARY KEY, value int);

val data = sc.cassandraTable("test" , "kv")

// 打印出value字的本计。

import static com.datastax.spark.connector.CassandraJavaUtil.javaFunctions;

// 张读 RDD。的 test的

// CREATE TABLE test.kv(key text PRIMARY KEY, value int);

JavaRDD<CassandraRow> data = javaFunctions(sc).cassandraTable("test" , "kv");

// 打印本计。

System.out.println(data.mapToDouble(new DoubleFunction<CassandraRow>() { public double call(CassandraRow row) { return row.getInt("value"); } }).stats());

了读张，数据集的子集。 cassandraTable()的用中加 where 子，制的数据， sc.cassandraTable(...).where("key=?", "panda")。 Cassandra 的RDD 保存 Cassandra 中。保存由

CassandraRow 成的RDD，制数据有用。定列的

，存行的式而元和列的式的RDD， 5-44 所。 5-44： Scala 中保存数据 Cassandra

val rdd = sc.parallelize(List(Seq("moremagic", 1)))

rdd.saveToCassandra("test" , "kv", SomeColumns("key", "value"))

本简地了Cassandra 。要了解，阅的GitHub https://github.com/datastax/spark-cassandra-connector 。

5.5.3 HBase

由 org.apache.hadoop.hbase.mapreduce.TableInputFormat 的，Spark

Hadoop 式问HBase。式会数据，中的 org.

apache.hadoop.hbase.io.ImmutableBytesWritable，而的 org.apache.hadoop.hbase.

client.Result。Result 据列的方法， API https://hbase.

apache.org/apidocs/org/apache/hadoop/hbase/client/Result.html 中有所。

要 Spark 用 HBase，需要用的式用SparkContext.newAPIHadoopRDD。 Scala 中的 5-45 所。 val conf = HBaseConfiguration.create()

conf.set(TableInputFormat.INPUT_TABLE, "tablename") // 张 val rdd = sc.newAPIHadoopRDD(

conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable],classOf[Result])

TableInputFormat 用化 HBase 的读的，制

分列中，及制的时。 TableInputFormat的API http://

hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html 中

， HBaseConfiguration中它，它 Spark。

5.5.4 Elasticsearch

Spark 用Elasticsearch-Hadoop https://github.com/elastic/elasticsearch-hadoop Elasticsearch 中读数据。Elasticsearch 开的、 Lucene 的。

Elasticsearch 和的大，它会的路，

而 SparkContext 中的。Elasticsearch 的OutputFormat 有用 Spark 所的，所用saveAsHadoopDataSet 代，需要

5-46： Scala 中用 Elasticsearch 出

val jobConf = new JobConf(sc.hadoopConfiguration)

jobConf.set("mapred.output.format.class", "org.elasticsearch.hadoop.

mr.EsOutputFormat")

jobConf.setOutputCommitter(classOf[FileOutputCommitter])

jobConf.set(ConfigurationOptions.ES_RESOURCE_WRITE, "twitter/tweets") jobConf.set(ConfigurationOptions.ES_NODES, "localhost")

FileOutputFormat.setOutputPath(jobConf, new Path("-")) output.saveAsHadoopDataset(jobConf)

5-47： Scala 中用 Elasticsearch

def mapWritableToInput(in: MapWritable): Map[String, String] = { in.map{case (k, v) => (k.toString, v.toString)}.toMap

}

val jobConf = new JobConf(sc.hadoopConfiguration)

jobConf.set(ConfigurationOptions.ES_RESOURCE_READ, args(1)) jobConf.set(ConfigurationOptions.ES_NODES, args(2)) val currentTweets = sc.hadoopRDD(jobConf,

classOf[EsInputFormat[Object, MapWritable]], classOf[Object], classOf[MapWritable])

// map

// MapWritable[Text, Text] Map[String, String]

val tweets = currentTweets.map{ case (key, value) => mapWritableToInput(value) }

和，Elasticsearch 有，作的有效。

出而，Elasticsearch 进行，尔会出的

数据，要存字的数据，定

https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-mapping.html 。

5.6

本，应能数据读 Spark 中，计算所的方

式存。了数据用的同式，及它应的数

据处理的方式。掌握了读和保存大数据集的方法，会

用编高效大的Spark 序的方法。

第 6 章

Spark编程进阶

6.1

本有及的Spark 编的进特，会的共量：

累加器 accumulator 广播变量 broadcast variable 。加用进行合，而量用高效分发大的。有的RDD 化作的，

数据需要大代价的任了作。了用的工具，本会

Spark 序交互的方式，用R 编的本进行交互。

本会用电作者的作，出的应用。

的的号。号由分的，有自的号号，

所据号应的。有作者的地理，用

定。 6-1 了。本书的代中需要

中进行处理的号列。

6-1： JSON 式的，中字

{"address":"address here", "band":"40m","callsign":"KK6JLK","city":"SUNNYVALE",

"contactlat":"37.384733","contactlong":"-122.032164",

"county":"Santa Clara","dxcc":"291","fullname":"MATTHEW McPherrin",

"id":57779,"mode":"FM","mylat":"37.751952821","mylong":"-122.4208688735",...}

要用的第 Spark 特共量。共量 Spark 任中用的特

的量。中，用Spark 共量的进行计数，

及分发张大的。

任需要时进行，需要数据者数生成时，数 file = sc.textFile(inputFile)

# Accumulator[Int] 化 0 blankLines = sc.accumulator(0) def extractCallSigns(line):

global blankLines # 问量 if (line == ""):

blankLines += 1 return line.split(" ")

callSigns = file.flatMap(extractCallSigns)

callSigns.saveAsTextFile(outputDir + "/callsigns") print "Blank lines: %d" % blankLines.value

6-3： Scala 中加行 val sc = new SparkContext(...) val file = sc.textFile("file.txt")

val blankLines = sc.accumulator(0) // Accumulator[Int] 化 0 val callSigns = file.flatMap(line => {

if (line == "") { println("Blank lines: " + blankLines.value) 6-4： Java 中加行

JavaRDD<String> rdd = sc.textFile(args[1]);

final Accumulator<Integer> blankLines = sc.accumulator(0);

JavaRDD<String> callSigns = rdd.flatMap(

new FlatMapFunction<String, String>() { public Iterable<String> call(String line) { if (line.equals("")) {

blankLines.add(1);

}

return Arrays.asList(line.split(" "));

}});

callSigns.saveAsTextFile("output.txt")

System.out.println("Blank lines: "+ blankLines.value());

中，了作blankLines的Accumulator[Int] ，

中行时加1。执行化作，打印出加中的。，

• 中用SparkContext.accumulator(initialValue)方法，出存有

的加。 org.apache.spark.Accumulator[T] ，中T

initialValue的。

的计数时方，有需要时，者需要

validSignCount = sc.accumulator(0) invalidSignCount = sc.accumulator(0) def validateSign(sign):

global validSignCount, invalidSignCount

if re.match(r"\A\d?[a-zA-Z]{1,2}\d{1,4}[a-zA-Z]{1,3}\Z", sign):

validSignCount += 1

validSigns = callSigns.filter(validateSign)

contactCount = validSigns.map(lambda sign: (sign, 1)).reduceByKey(lambda (x, y): x + y)

# 制计算计数

contactCount.count()

if invalidSignCount.value < 0.1 * validSignCount.value:

contactCount.saveAsTextFile(outputDir + "/contactCount") else:

print "Too many errors: %d in %d" % (invalidSignCount.value, validSignCount.

value)

加要处理，对于要在行动操作中使用的累加器，Spark

只会把每个任务对各累加器的修改应用一次。，要计

算时的加，它 foreach() 的行作中。

对于在RDD 转化操作中使用的累加器，就不能保证有这种情况了。化作中加

能会发生次新。子，存有用的RDD 第

次 LRU 存中新用时，的次新会发生。会制

RDD 据进行算，而作用会中的化作的加进行

新，次发中。化作中，加用的。

版本的Spark 能会行成新次加的，版本

1.2.0 会进行次新，化作中的加时用。

6.2.2

，学了用加法作Spark 的加 Accumulator[Int] 。 Spark Double、Long和Float 的加。，Spark 了自定义

加和合作的API 要加的中的大，而加。自

定义加需要 AccumulatorParam， Spark API http://spark.apache.org/docs/

latest/api/scala/index.html#package 中有所。要作同时交和合，

用任作代数的加法。了和，数据的大。

任的a 和 b，有 a op b = b op a， 作op 交。

在文檔中快速大数据分析 (頁 95-0)

5.2 件 式