Spark SQL 能 - 快速大数据分析

本开所的，Spark SQL 的高及加的 Spark

SQL 数据加高效。

Spark SQL SQL 的用用的。Spark SQL 有件的合作

，列进行 9-40 所。用Spark SQL 需要第6 中

的特的进行作。

9-40：Spark SQL 列和

SELECT SUM(user.favouritesCount), SUM(retweetCount), user.id FROM tweets GROUP BY user.id

Spark SQL 用的了解高效地数据。存数据时，Spark SQL 用

存式的列式存。了存的，而能地了中

字时的数据读。

Spark SQL 中的分工作。

需 Spark 中读特定的记，的方法读数据集，执行

件。而， Spark SQL 中，的数据存读的记

，制件，Spark SQL 中的制件数据存

，而大大需要读的数据。

Spark SQL 的能有， 9-2 所列。

表9-2：Spark SQL中的性能选项

spark.sql.codegen false true时，Spark SQL 会

运行时编译 Java 进制代。

高大的能，进行

时会

spark.sql.inMemoryColumnarStorage.compressed false 自存中的列式存进行 spark.sql.inMemoryColumnarStorage.batchSize 1000 列式存时的处理的大。

大能会存的

spark.sql.parquet.compression.codec snappy 用编。的

uncompressed/^snappy/^gzip/^lzo

用JDBC 和Beeline shell 时， ^set 能的

， 9-41 所。

9-41：打开codegen 的Beeline beeline> set spark.sql.codegen=true;

SET spark.sql.codegen=true spark.sql.codegen=true Time taken: 1.196 seconds

的Spark SQL 应用中， Spark 中 Spark ， 9-42 所。 9-42： Scala 中打开codegen 的代

conf.set("spark.sql.codegen", "true")

的需要特的量。第 spark.sql.codegen，

Spark SQL 运行编译 Java 进制代。由生成了运行定

的代，codegen 大者的快。而，运行特快

1 2 的时时，codegen 有能会增加开， codegen 需要

编译的。⁵codegen 的能，所有大的

者运行的中用codegen。

时能需要的第 spark.sql.inMemoryColumnarStorage.batchSize。

存SchemaRDD 时，Spark SQL 会制定的大 1000 记分

，分。的处理大会，而处理大大的，

次处理的数据存所能的大时，有能会发问题。中的

记大数字者网大的字字，能需要

处理大存 OOM 的。的，的处

理大合适的， 1000 记时本法高的了。

9.7

，学了Spark 用Spark SQL 进行化和化数据处理的方式。了本的，第3 第6 中讲的作RDD 的方法同适用 Spark SQL

中的SchemaRDD。时，会 SQL 的编合用，分

用SQL 的简和编辑的。而用Spark SQL 时，Spark 执

行能据数据的进行化，中。

5：，codegen 打开时开的会， Spark SQL 需要化它的编译。所

codegen 的开应运行4 5 。

第 1 0 章

Spark Streaming

应用需要时处理收的数据，用时问计的应用、学

的应用，有自的应用。Spark Streaming Spark 应用而计的

。它用用和处理的API 编式计算应用，大

量用处理应用的技术代。

和Spark RDD 的，Spark Streaming 用离散化流 discretized stream 作

，作DStream。DStream 时而收的数据的序列。，时区收的数据作 RDD 存，而 DStream 由 RDD 所成的序列

化。DStream ， Flume、Kafka 者HDFS。

出的DStream 作，转化操作 transformation ，会生成新的 DStream，输出操作 output operation ，数据中。DStream

了 RDD 所的作的作，增加了时的新作，

。

和处理序同，Spark Streaming 应用需要进行保 24/7 工作。本

会检查点 checkpointing 制，数据存件 HDFS

的制， Spark Streaming 用工作的要方式。，会讲

时应用，及应用自式。

， Spark 1.1 ，Spark Streaming Java 和 Scala 中用。的Python Spark 1.2 中，本数据。本用Java 和 Scala 所有的 API，的 Python 适用的。

10.1

10-1：Spark Streaming 的 Maven groupId = org.apache.spark artifactId = spark-streaming_2.10 version = 1.2.0

StreamingContext 开，它计算能的要。StreamingContext 会出SparkContext，用处理数据。造数收用定时处理次新数据的批次间隔 batch interval 作，它 1 。，

用socketTextStream() 出本地7777 收的本数据的DStream。

DStream filter()进行化， error 的行。，用出作

print() 出的行打印出。 10-4 和 10-5 所。

10-4：用 Scala 进行式，打印出 error 的行 // SparkConf StreamingContext 定1 的处理大 val ssc = new StreamingContext(conf, Seconds(1))

// 本地 7777 ，用收的数据 DStream

val lines = ssc.socketTextStream("localhost", 7777) // DStream中出字 "error"的行

val errorLines = lines.filter(_.contains("error")) // 打印出有"error"的行

errorLines.print()

10-5：用 Java 进行式，打印出 error 的行 // SparkConf StreamingContext 定1 的处理大

JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));

// 7777作 DStream

JavaDStream<String> lines = jssc.socketTextStream("localhost", 7777);

// DStream中出字 "error"的行

JavaDStream<String> errorLines = lines.filter(new Function<String, Boolean>() { public Boolean call(String line) {

return line.contains("error");

}});

// 打印出有"error"的行 errorLines.print();

定了要进行的计算，收数据时计算会开。要开收数据，

式用StreamingContext 的start()方法。，Spark Streaming 会开 Spark 作交的SparkContext 执行。执行会中进行，所需要用

$ spark-submit --class com.oreilly.learningsparkexamples.scala.StreamingLogInput \

$ASSEMBLY_JAR local[4]

$ nc localhost 7777 # 的行发

< 处的 >

Windows 用用ncat http://nmap.org/ncat/ 代的nc 。ncat nmap http://nmap.org/ 工具的分。

会子加处理Apache 件。需要生成的

，运行本书Git 中的本./bin/fakelogs.sh 者./bin/fakelogs.cmd

发 7777 。

10.2

图10-1：Spark Streaming 的高层次架构

讲，Spark Streaming 的编化， DStream 图10-2 所

DStream， DStream 应用进行转化操作新的 DStream。DStream 第3 中所讲的RDD 的化作。，DStream 有 Time: 1413833674000 ms

---

71.19.157.174 - - [24/Sep/2014:22:26:12 +0000] "GET /error78978 HTTP/1.1" 404 505 ...

--- Time: 1413833675000 ms

---

71.19.164.174 - - [24/Sep/2014:22:27:10 +0000] "GET /error78978 HTTP/1.1" 404 505 ...

了化作，DStream 输出操作，中用的print()。出作

图10-5：Spark Streaming 在 Spark 各组件中的执行过程

Spark Streaming DStream 的 Spark RDD 所的：要

10.3

StreamingContext._ 能 Scala 中用。和 RDD ， Java 中需要 mapToPair()

出 JavaPairDStream 能用。

表10-1：DStream无状态转化操作的例子（不完整列表）

ds.flatMap(x => x.split(" ")) f: T -> Iterable[U]

filter() 由定DStream 中

的元成的DStream。

ds.filter(x => x != 1) f: T -> Boolean

repartition() DStream 的分区数。 ds.repartition(10) N/A

reduceByKey() 次中同的记。ds.reduceByKey(

(x, y) => x + y)

f: T, T -> T

groupByKey() 次中的记据分。ds.groupByKey() N/A

需要记的，数作用， DStream

10-10： Scala 中 DStream 用map()和reduceByKey()

// ApacheAccessingLog 用 Apache 中解析的工具

val accessLogDStream = logData.map(line => ApacheAccessLog.parseFromLogLine(line)) val ipDStream = accessLogsDStream.map(entry => (entry.getIpAddress(), 1))

val ipCountsDStream = ipDStream.reduceByKey((x, y) => x + y) 10-11： Java 中 DStream 用map()和reduceByKey()

// ApacheAccessingLog 用 Apache 中解析的工具

static final class IpTuple implements PairFunction<ApacheAccessLog, String, Long> { public Tuple2<String, Long> call(ApacheAccessLog log) {

return new Tuple2<>(log.getIpAddress(), 1L);

} }

JavaDStream<ApacheAccessLog> accessLogsDStream = logData.map(new ParseFromLogLine());

JavaPairDStream<String, Long> ipDStream = accessLogsDStream.mapToPair(new IpTuple());

JavaPairDStream<String, Long> ipCountsDStream = ipDStream.reduceByKey(new LongSumReducer());

化作能 DStream 合数据，时区。，

DStream 有和RDD 的的化作， cogroup()、join()、

leftOuterJoin()等 4.3.3 。 DStream 用作，

次分执行了应的RDD 作。

DStream 用的具子。 10-12 和 10-13 中，

IP 地址，计数的数据和数据量的数据。

10-12： Scala 中 DStream val ipBytesDStream =

accessLogsDStream.map(entry => (entry.getIpAddress(), entry.getContentSize())) val ipBytesSumDStream =

ipBytesDStream.reduceByKey((x, y) => x + y) val ipBytesRequestCountDStream =

ipCountsDStream.join(ipBytesSumDStream) 10-13： Java 中 DStream

JavaPairDStream<String, Long> ipBytesDStream = accessLogsDStream.mapToPair(new IpContentTuple());

JavaPairDStream<String, Long> ipBytesSumDStream = ipBytesDStream.reduceByKey(new LongSumReducer());

JavaPairDStream<String, Tuple2<Long, Long>> ipBytesRequestCountDStream = ipCountsDStream.join(ipBytesSumDStream);

的Spark 中用DStream 的union() 作它和 DStream

的合，用StreamingContext.union() 合。

，化作用，DStream 了作transform()的高

作，作的RDD。 transform() 作 DStream

任 RDD RDD 的数。数会数据中的次中用，生成

新的。transform()的应用用 RDD 的处理代。，

有作extractOutliers()的数，用记的RDD 中出的

RDD 能进行计， transform()中用它， 10-14 和

10-15 所。

10-14： Scala 中 DStream 用transform()

val outlierDStream = accessLogsDStream.transform { rdd =>

extractOutliers(rdd) }

10-15： Java 中 DStream 用transform()

JavaPairDStream<String, Long> ipRawDStream = accessLogsDStream.transform(

new Function<JavaRDD<ApacheAccessLog>, JavaRDD<ApacheAccessLog>>() { public JavaPairRDD<ApacheAccessLog> call(JavaRDD<ApacheAccessLog> rdd) { return extractOutliers(rdd);

} });

StreamingContext.transform DStream.transformWith(otherStream, func) 合化 DStream。

10.3.2

DStream 的有化作时区数据的作，次的数

据用新的次中计算。要的和updateStateByKey()，

者时进行作，者用的化

代用会的。

有化作需要的StreamingContext 中打开制保。会

10.6 中地制，需要道作数

ssc.checkpoint() 打开它， 10-16 所。

10-16：

ssc.checkpoint("hdfs://...")

进行本地开发时，用本地路 /tmp 代HDFS。

的作会 StreamingContext 的次的时，合

次的，计算出的。本会用化作网络

问中的，的应代、大，及。

所有的作需要数，分时及，者 StreamContext 的次的数。时制次计算的次的数据，

的windowDuration/batchInterval 次。有 10 次的

DStream，要 30 的时 3 次，应 windowDuration

30 。而的次等，用制新的DStream 进行计算的

。 DStream 次 10 ，次计算次，

应 20 。图10-6 了子。

DStream 用的简作 window()，它新的DStream 所

的作的数据。，window()生成的DStream 中的 RDD 会

次中的数据，数据进行count()、transform()等作 10-17 和 10-18 。

网络数据有的数据

大：3

：2

图10-6：一个基于窗口的流数据，窗口时长为 3 个批次，滑动步长为 2 个批次；每隔 2 个批次就对前3 个批次的数据进行一次计算

10-17： Scala 中用window() 进行计数

val accessLogsWindow = accessLogsDStream.window(Seconds(30), Seconds(10)) val windowCounts = accessLogsWindow.count()

10-18： Java 中用window() 进行计数

JavaDStream<ApacheAccessLog> accessLogsWindow = accessLogsDStream.window(

Durations.seconds(30), Durations.seconds(10));

JavaDStream<Integer> windowCounts = accessLogsWindow.count();

用window() 出所有的作，Spark Streaming 了的

作，用高效而方地用。，reduceByWindow()和reduceByKeyAndWindow()

高效地进行作。它收数，执

行， +。，它有特式，新进的数据和开

的数据， Spark 增量计算。特式需要数的数，

+ 应的数 -。大的，数大大高执行效图10-7 。

网络数据进行网络数据进行有

作的

图10-7：普通的reduceByWindow()与使用逆函数的增量式reduceByWindow()的区别

处理的子中，用数高效地 IP 地址问量进行计

数， 10-19 和 10-20 所。

10-19：Scala 版本的 IP 地址的问量计数

val ipDStream = accessLogsDStream.map(logEntry => (logEntry.getIpAddress(), 1)) val ipCountDStream = ipDStream.reduceByKeyAndWindow(

{(x, y) => x + y}, // 加新进的次中的元 {(x, y) => x - y}, // 开的次中的元 Seconds(30), // 时

Seconds(10)) //

10-20：Java 版本的 IP 地址的问量计数

class ExtractIp extends PairFunction<ApacheAccessLog, String, Long> { public Tuple2<String, Long> call(ApacheAccessLog entry) {

return new Tuple2(entry.getIpAddress(), 1L);

} }

class AddLongs extends Function2<Long, Long, Long>() { public Long call(Long v1, Long v2) { return v1 + v2; } }

class SubtractLongs extends Function2<Long, Long, Long>() { public Long call(Long v1, Long v2) { return v1 - v2; } }

JavaPairDStream<String, Long> ipAddressPairDStream = accessLogsDStream.mapToPair(

new ExtractIp());

JavaPairDStream<String, Long> ipCountDStream = ipAddressPairDStream.

reduceByKeyAndWindow(

new AddLongs(), // 加新进的次中的元 new SubtractLongs()

// 开的次中的元

Durations.seconds(30), // 时 Durations.seconds(10)); //

，DStream 了countByWindow()和countByValueAndWindow()作数据进行

计数作的简。countByWindow() 中元数的DStream，而

countByValueAndWindow() 的DStream 中的数， 10-21 和

10-22 所。

10-21：Scala 中的计数作

val ipDStream = accessLogsDStream.map{entry => entry.getIpAddress()}

val ipAddressRequestCount = ipDStream.countByValueAndWindow(Seconds(30), Seconds(10)) val requestCount = accessLogsDStream.countByWindow(Seconds(30), Seconds(10)) 10-22：Java 中的计数作

JavaDStream<String> ip = accessLogsDStream.map(

new Function<ApacheAccessLog, String>() { public String call(ApacheAccessLog entry) { return entry.getIpAddress();

}});

JavaDStream<Long> requestCount = accessLogsDStream.countByWindow(

Dirations.seconds(30), Durations.seconds(10));

JavaPairDStream<String, Long> ipAddressRequestCount = ip.countByValueAndWindow(

Dirations.seconds(30), Durations.seconds(10));

UpdateStateByKey

updateStateByKey() 用问的10 。列

，会件时新。

要用updateStateByKey()，了 update(events, oldState) 数，收

的件及应的，应的新。数的所。

• events：次中收的件的列能。

• ^oldState：的，存 Option 有的，

。

• ^newState：由数， Option 式存的Option

要。

10-23： Scala 中用updateStateByKey()运行应代的计数 def updateRunningSum(values: Seq[Long], state: Option[Long]) = { Some(state.getOrElse(0L) + values.size)

}

val responseCodeDStream = accessLogsDStream.map(log => (log.getResponseCode(), 1L)) val responseCodeCountDStream = responseCodeDStream.updateStateByKey(updateRunningSum _) 10-24： Java 中用updateStateByKey()运行应代的计数

class UpdateRunningSum implements Function2<List<Long>, Optional<Long>, Optional<Long>> {

public Optional<Long> call(List<Long> nums, Optional<Long> current) { long sum = current.or(0L);

return Optional.of(sum + nums.size());

} };

JavaPairDStream<Integer, Long> responseCodeCountDStream = accessLogsDStream.mapToPair(

new PairFunction<ApacheAccessLog, Integer, Long>() { public Tuple2<Integer, Long> call(ApacheAccessLog log) { return new Tuple2(log.getResponseCode(), 1L);

}})

.updateStateByKey(new UpdateRunningSum());

10.4

了序，用出作保存了。Spark Streaming DStream 有

，Spark Streaming 有的saveAsSequenceFile() 数，用 10-26 和 10-27 中的方法保存 SequenceFile 件。

10-26： Scala 中 DStream 保存 SequenceFile

val writableIpAddressRequestCount = ipAddressRequestCount.map { (ip, count) => (new Text(ip), new LongWritable(count)) } writableIpAddressRequestCount.saveAsHadoopFiles[

SequenceFileOutputFormat[Text, LongWritable]]("outputDir", "txt") 10-27： Java 中 DStream 保存 SequenceFile

JavaPairDStream<Text, LongWritable> writableDStream = ipDStream.mapToPair(

new PairFunction<Tuple2<String, Long>, Text, LongWritable>() { public Tuple2<Text, LongWritable> call(Tuple2<String, Long> e) { return new Tuple2(new Text(e._1()), new LongWritable(e._2()));

}});

class OutFormat extends SequenceFileOutputFormat<Text, LongWritable> {};

writableDStream.saveAsHadoopFiles(

"outputDir", "txt", Text.class, LongWritable.class, OutFormat.class);

，有用的出作foreachRDD()，它用 DStream 中的 RDD 运行任计 ipAddressRequestCount.foreachRDD { rdd =>

rdd.foreachPartition { partition =>

// 打开存的数据的

10.5

Spark Streaming 生同的数据。核心数据打 Spark Streaming 的 Maven 工件中，而的 spark-streaming-kafka等加工件

。

本会分数据进行。定了数据，会

中 Spark 的件。计新的应用，用HDFS

Kafka 简的开。

10.5.1 心数据

所有用核心数据 DStream 的方法 StreamingContext 中。

中用中：字。：件和Akka actor。

Spark 任 Hadoop 的件中读数据，所 Spark Streaming

任 Hadoop 的件中的件数据。由，方

式用，要制 HDFS 的数据。要 Spark Streaming

处理数据，需要字的式，件原子化

件 Spark 的。² 10-4 和 10-5 处理新出的

件， 10-29 和 10-30 所。 10-29：用 Scala 读中的本件

val logData = ssc.textFileStream(logDirectory) 10-30：用 Java 读中的本件

JavaDStream<String> logData = jssc.textFileStream(logsDirectory);

用所的./bin/fakelogs_directory.sh 本造出。有数据

的，用mv 件所的中。

了本数据，读任 Hadoop 式。 5.2.6 所讲的，需要

Key、Value 及InputFormat Spark Streaming 。，有了

处理作处理，的时区的数据分存成了

SequenceFile， 10-31 中所的读数据。

2：子化作次成。 Spark Streaming 要处理件时，的数据出了，Spark

Streaming 会法新加的数据，子化要。件中，件

作子化的。

10-31：用 Scala 读中的SequenceFile ssc.fileStream[LongWritable, IntWritable,

SequenceFileInputFormat[LongWritable, IntWritable]](inputDirectory).map { case (x, y) => (x.get(), y.get())

} 2. Akka actor

核心数据收 actorStream，它 Akka actor http://akka.io/ 作数据

的。要出 actor ，需要 Akka actor， org.apache.spark.

streaming.receiver.ActorHelper 。要数据 actor 制 Spark Streaming 中，

需要收新数据时用actor 的store() 数。Akka actor ，所会进行。阅读计算的 http://spark.apache.org/docs/latest/streaming-custom-receivers.html 及Spark 中的 ActorWordCount https://github.com/apache/spark/

blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/ActorWordCount. Kafka、Amazon Kinesis、Apache Flume，及 ZeroMQ。加 Spark 版本的Maven 工件spark-streaming-[projectname]_2.10 加收。 1. Apache Kafka

Apache Kafka http://kafka.apache.org/ 速成了行的。用 Kafka 生的，处理题的。工中需要 Maven 工件

spark-streaming-kafka_2.10 用它。的KafkaUtils StreamingContext 和

JavaStreamingContext 中的Kafka 出DStream。由 KafkaUtils 阅题，它出的DStream 由成的题和成。要出数据，需要用StreamingContext 、由号开的ZooKeeper 列字、者

的字字，及题题的收数的用

createStream()方法 10-32 和 10-33 所。

10-32： Scala 中用 Apache Kafka 阅Panda 题 import org.apache.spark.streaming.kafka._

...

// 题收数的

val topics = List(("pandas", 1), ("logs", 1)).toMap

val topicLines = KafkaUtils.createStream(ssc, zkQuorum, group, topics) StreamingLogInput.processLines(topicLines.map(_._2))

10-33： Java 中用 Apache Kafka 阅Panda 题 import org.apache.spark.streaming.kafka.*;

在文檔中快速大数据分析 (頁 178-0)