Beeline

9.4　 JDBC/ODBC

9.4.1 Beeline

Beeline 中，用的HiveQL 、列及数据。

Hive 册 https://cwiki.apache.org/conﬂuence/display/Hive/LanguageManual 中

HiveQL 的所有法，的作。

，要本地数据张数据，用CREATE TABLE 。用LOAD DATA

进行数据读。Hive 读带有定分的本件， CSV 等式的件，

9-33 所。 9-33：读数据

> CREATE TABLE IF NOT EXISTS mytable (key INT, value STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY , ;

> LOAD DATA LOCAL INPATH learning-spark-examples/files/int_string.csv INTO TABLE mytable;

要列数据，用SHOW TABLES 9-34 所。 DESCRIBE

tableName 张的。

9-34：列数据

> SHOW TABLES;

mytable

Time taken: 0.052 seconds

要存数据，用CACHE TABLE tableName 。存用UNCACHE

TABLE tableName 的存。需要的，，存的会

JDBC 的所有共。

， Beeline 中计简，运行EXPLAIN ， 9-35 所。

9-35：Spark SQL shell 执行EXPLAIN

spark-sql> EXPLAIN SELECT * FROM mytable where key = 1;

== Physical Plan ==

Filter (key#16 = 1)

HiveTableScan [key#16,value#17], (MetastoreRelation default, mytable, None), None Time taken: 0.551 seconds

计，Spark SQL HiveTableScan 用了作。

， SQL 数据进行。Beeline shell 用共的

存数据进行快速的数据有用的。

9.4.2

用Spark SQL 的 JDBC 的同序共存

的数据。JDBC Thrift 序，共成了能。

中所，需要册数据运行CACHE ，用存了。

Spark SQL shell

了JDBC ，Spark SQL 作的进用的简

shell， ./bin/spark-sql 。 shell 会

conf/hive-site.xml 中的 Hive 的元数据。存的元数据，Spark SQL

会本地新。本要本地开发有用。共的集

，应用JDBC ，用 beeline进行。

9.5 数

用自定义数， UDF，用Python/Java/Scala 册自定义数， SQL

中用。方法用，用的SQL 用高能，

用用册的数而需自编了。 Spark SQL 中，编

UDF 简。Spark SQL 有自的UDF ，有的Apache Hive UDF。

9.5.1 Spark SQL UDF

用Spark 的编编数， Spark SQL 的方法进

，捷地册自的UDF。 Scala 和 Python 中，用生的数和 lambda 法的，而 Java 中，需要应的UDF 。UDF 能数据

，用时的数。

Python 和 Java 中，需要用 9-1 中列出的 SchemaRDD 应的定

。Java 中的应 org.apache.spark.sql.api.java.DataType中，而 Python 中需要 DataType 。

9-36 和 9-37 中，用计算字的简的UDF，

用它计算的。

9-36：Python 版本字 UDF

# 字的UDF

hiveCtx.registerFunction("strLenPython", lambda x: len(x), IntegerType()) lengthSchemaRDD = hiveCtx.sql("SELECT strLenPython('text') FROM tweets LIMIT 10") 9-37：Scala 版本的字 UDF

registerFunction("strLenScala", (_: String).length)

val tweetLength = hiveCtx.sql("SELECT strLenScala('tweet') FROM tweets LIMIT 10") Java 中定义 UDF 需要的import 。和定义RDD 数时，据

hiveCtx.udf().register("stringLengthJava", new UDF1<String, Integer>() { @Override

public Integer call(String str) throws Exception { return str.length();

}

}, DataTypes.IntegerType);

SchemaRDD tweetLength = hiveCtx.sql(

"SELECT stringLengthJava('text') FROM tweets LIMIT 10");

List<Row> lengths = tweetLength.collect();

for (Row row : result) {

System.out.println(row.get(0));

} Hive UDF，需用hiveCtx.sql("CREATE TEMPORARY FUNCTION name AS class.function")。

9.6 Spark SQL

本开所的，Spark SQL 的高及加的 Spark

SQL 数据加高效。

Spark SQL SQL 的用用的。Spark SQL 有件的合作

，列进行 9-40 所。用Spark SQL 需要第6 中

的特的进行作。

9-40：Spark SQL 列和

SELECT SUM(user.favouritesCount), SUM(retweetCount), user.id FROM tweets GROUP BY user.id

Spark SQL 用的了解高效地数据。存数据时，Spark SQL 用

存式的列式存。了存的，而能地了中

字时的数据读。

Spark SQL 中的分工作。

需 Spark 中读特定的记，的方法读数据集，执行

件。而， Spark SQL 中，的数据存读的记

，制件，Spark SQL 中的制件数据存

，而大大需要读的数据。

Spark SQL 的能有， 9-2 所列。

表9-2：Spark SQL中的性能选项

spark.sql.codegen false true时，Spark SQL 会

运行时编译 Java 进制代。

高大的能，进行

时会

spark.sql.inMemoryColumnarStorage.compressed false 自存中的列式存进行 spark.sql.inMemoryColumnarStorage.batchSize 1000 列式存时的处理的大。

大能会存的

spark.sql.parquet.compression.codec snappy 用编。的

uncompressed/^snappy/^gzip/^lzo

用JDBC 和Beeline shell 时， ^set 能的

， 9-41 所。

9-41：打开codegen 的Beeline beeline> set spark.sql.codegen=true;

SET spark.sql.codegen=true spark.sql.codegen=true Time taken: 1.196 seconds

的Spark SQL 应用中， Spark 中 Spark ， 9-42 所。 9-42： Scala 中打开codegen 的代

conf.set("spark.sql.codegen", "true")

的需要特的量。第 spark.sql.codegen，

Spark SQL 运行编译 Java 进制代。由生成了运行定

的代，codegen 大者的快。而，运行特快

1 2 的时时，codegen 有能会增加开， codegen 需要

编译的。⁵codegen 的能，所有大的

者运行的中用codegen。

时能需要的第 spark.sql.inMemoryColumnarStorage.batchSize。

存SchemaRDD 时，Spark SQL 会制定的大 1000 记分

，分。的处理大会，而处理大大的，

次处理的数据存所能的大时，有能会发问题。中的

记大数字者网大的字字，能需要

处理大存 OOM 的。的，的处

理大合适的， 1000 记时本法高的了。

9.7

，学了Spark 用Spark SQL 进行化和化数据处理的方式。了本的，第3 第6 中讲的作RDD 的方法同适用 Spark SQL

中的SchemaRDD。时，会 SQL 的编合用，分

用SQL 的简和编辑的。而用Spark SQL 时，Spark 执

行能据数据的进行化，中。

5：，codegen 打开时开的会， Spark SQL 需要化它的编译。所

codegen 的开应运行4 5 。

第 1 0 章

Spark Streaming

应用需要时处理收的数据，用时问计的应用、学

的应用，有自的应用。Spark Streaming Spark 应用而计的

。它用用和处理的API 编式计算应用，大

量用处理应用的技术代。

和Spark RDD 的，Spark Streaming 用离散化流 discretized stream 作

，作DStream。DStream 时而收的数据的序列。，时区收的数据作 RDD 存，而 DStream 由 RDD 所成的序列

化。DStream ， Flume、Kafka 者HDFS。

出的DStream 作，转化操作 transformation ，会生成新的 DStream，输出操作 output operation ，数据中。DStream

了 RDD 所的作的作，增加了时的新作，

。

和处理序同，Spark Streaming 应用需要进行保 24/7 工作。本

会检查点 checkpointing 制，数据存件 HDFS

的制， Spark Streaming 用工作的要方式。，会讲

时应用，及应用自式。

， Spark 1.1 ，Spark Streaming Java 和 Scala 中用。的Python Spark 1.2 中，本数据。本用Java 和 Scala 所有的 API，的 Python 适用的。

10.1

10-1：Spark Streaming 的 Maven groupId = org.apache.spark artifactId = spark-streaming_2.10 version = 1.2.0

StreamingContext 开，它计算能的要。StreamingContext 会出SparkContext，用处理数据。造数收用定时处理次新数据的批次间隔 batch interval 作，它 1 。，

用socketTextStream() 出本地7777 收的本数据的DStream。

DStream filter()进行化， error 的行。，用出作

print() 出的行打印出。 10-4 和 10-5 所。

10-4：用 Scala 进行式，打印出 error 的行 // SparkConf StreamingContext 定1 的处理大 val ssc = new StreamingContext(conf, Seconds(1))

// 本地 7777 ，用收的数据 DStream

val lines = ssc.socketTextStream("localhost", 7777) // DStream中出字 "error"的行

val errorLines = lines.filter(_.contains("error")) // 打印出有"error"的行

errorLines.print()

10-5：用 Java 进行式，打印出 error 的行 // SparkConf StreamingContext 定1 的处理大

JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));

// 7777作 DStream

JavaDStream<String> lines = jssc.socketTextStream("localhost", 7777);

// DStream中出字 "error"的行

JavaDStream<String> errorLines = lines.filter(new Function<String, Boolean>() { public Boolean call(String line) {

return line.contains("error");

}});

// 打印出有"error"的行 errorLines.print();

定了要进行的计算，收数据时计算会开。要开收数据，

式用StreamingContext 的start()方法。，Spark Streaming 会开 Spark 作交的SparkContext 执行。执行会中进行，所需要用

$ spark-submit --class com.oreilly.learningsparkexamples.scala.StreamingLogInput \

$ASSEMBLY_JAR local[4]

$ nc localhost 7777 # 的行发

< 处的 >

Windows 用用ncat http://nmap.org/ncat/ 代的nc 。ncat nmap http://nmap.org/ 工具的分。

会子加处理Apache 件。需要生成的

，运行本书Git 中的本./bin/fakelogs.sh 者./bin/fakelogs.cmd

发 7777 。

10.2

图10-1：Spark Streaming 的高层次架构

讲，Spark Streaming 的编化， DStream 图10-2 所

DStream， DStream 应用进行转化操作新的 DStream。DStream 第3 中所讲的RDD 的化作。，DStream 有 Time: 1413833674000 ms

---

71.19.157.174 - - [24/Sep/2014:22:26:12 +0000] "GET /error78978 HTTP/1.1" 404 505 ...

--- Time: 1413833675000 ms

---

71.19.164.174 - - [24/Sep/2014:22:27:10 +0000] "GET /error78978 HTTP/1.1" 404 505 ...

了化作，DStream 输出操作，中用的print()。出作

图10-5：Spark Streaming 在 Spark 各组件中的执行过程

Spark Streaming DStream 的 Spark RDD 所的：要

10.3

StreamingContext._ 能 Scala 中用。和 RDD ， Java 中需要 mapToPair()

出 JavaPairDStream 能用。

表10-1：DStream无状态转化操作的例子（不完整列表）

ds.flatMap(x => x.split(" ")) f: T -> Iterable[U]

filter() 由定DStream 中

的元成的DStream。

ds.filter(x => x != 1) f: T -> Boolean

repartition() DStream 的分区数。 ds.repartition(10) N/A

reduceByKey() 次中同的记。ds.reduceByKey(

(x, y) => x + y)

f: T, T -> T

groupByKey() 次中的记据分。ds.groupByKey() N/A

需要记的，数作用， DStream

10-10： Scala 中 DStream 用map()和reduceByKey()

// ApacheAccessingLog 用 Apache 中解析的工具

val accessLogDStream = logData.map(line => ApacheAccessLog.parseFromLogLine(line)) val ipDStream = accessLogsDStream.map(entry => (entry.getIpAddress(), 1))

val ipCountsDStream = ipDStream.reduceByKey((x, y) => x + y) 10-11： Java 中 DStream 用map()和reduceByKey()

// ApacheAccessingLog 用 Apache 中解析的工具

static final class IpTuple implements PairFunction<ApacheAccessLog, String, Long> { public Tuple2<String, Long> call(ApacheAccessLog log) {

return new Tuple2<>(log.getIpAddress(), 1L);

} }

JavaDStream<ApacheAccessLog> accessLogsDStream = logData.map(new ParseFromLogLine());

JavaPairDStream<String, Long> ipDStream = accessLogsDStream.mapToPair(new IpTuple());

JavaPairDStream<String, Long> ipCountsDStream = ipDStream.reduceByKey(new LongSumReducer());

化作能 DStream 合数据，时区。，

DStream 有和RDD 的的化作， cogroup()、join()、

leftOuterJoin()等 4.3.3 。 DStream 用作，

次分执行了应的RDD 作。

DStream 用的具子。 10-12 和 10-13 中，

IP 地址，计数的数据和数据量的数据。

10-12： Scala 中 DStream val ipBytesDStream =

accessLogsDStream.map(entry => (entry.getIpAddress(), entry.getContentSize())) val ipBytesSumDStream =

ipBytesDStream.reduceByKey((x, y) => x + y) val ipBytesRequestCountDStream =

ipCountsDStream.join(ipBytesSumDStream) 10-13： Java 中 DStream

JavaPairDStream<String, Long> ipBytesDStream = accessLogsDStream.mapToPair(new IpContentTuple());

JavaPairDStream<String, Long> ipBytesSumDStream = ipBytesDStream.reduceByKey(new LongSumReducer());

JavaPairDStream<String, Tuple2<Long, Long>> ipBytesRequestCountDStream = ipCountsDStream.join(ipBytesSumDStream);

的Spark 中用DStream 的union() 作它和 DStream

的合，用StreamingContext.union() 合。

，化作用，DStream 了作transform()的高

作，作的RDD。 transform() 作 DStream

任 RDD RDD 的数。数会数据中的次中用，生成

新的。transform()的应用用 RDD 的处理代。，

有作extractOutliers()的数，用记的RDD 中出的

RDD 能进行计， transform()中用它， 10-14 和

10-15 所。

10-14： Scala 中 DStream 用transform()

val outlierDStream = accessLogsDStream.transform { rdd =>

extractOutliers(rdd) }

10-15： Java 中 DStream 用transform()

JavaPairDStream<String, Long> ipRawDStream = accessLogsDStream.transform(

new Function<JavaRDD<ApacheAccessLog>, JavaRDD<ApacheAccessLog>>() { public JavaPairRDD<ApacheAccessLog> call(JavaRDD<ApacheAccessLog> rdd) { return extractOutliers(rdd);

} });

StreamingContext.transform DStream.transformWith(otherStream, func) 合化 DStream。

10.3.2

DStream 的有化作时区数据的作，次的数

据用新的次中计算。要的和updateStateByKey()，

者时进行作，者用的化

代用会的。

有化作需要的StreamingContext 中打开制保。会

10.6 中地制，需要道作数

ssc.checkpoint() 打开它， 10-16 所。

10-16：

ssc.checkpoint("hdfs://...")

进行本地开发时，用本地路 /tmp 代HDFS。

的作会 StreamingContext 的次的时，合

次的，计算出的。本会用化作网络

问中的，的应代、大，及。

所有的作需要数，分时及，者 StreamContext 的次的数。时制次计算的次的数据，

的windowDuration/batchInterval 次。有 10 次的

DStream，要 30 的时 3 次，应 windowDuration

30 。而的次等，用制新的DStream 进行计算的

。 DStream 次 10 ，次计算次，

应 20 。图10-6 了子。

DStream 用的简作 window()，它新的DStream 所

的作的数据。，window()生成的DStream 中的 RDD 会

次中的数据，数据进行count()、transform()等作 10-17 和 10-18 。

网络数据有的数据

大：3

：2

图10-6：一个基于窗口的流数据，窗口时长为 3 个批次，滑动步长为 2 个批次；每隔 2 个批次就对前3 个批次的数据进行一次计算

10-17： Scala 中用window() 进行计数

val accessLogsWindow = accessLogsDStream.window(Seconds(30), Seconds(10)) val windowCounts = accessLogsWindow.count()

10-18： Java 中用window() 进行计数

JavaDStream<ApacheAccessLog> accessLogsWindow = accessLogsDStream.window(

Durations.seconds(30), Durations.seconds(10));

JavaDStream<Integer> windowCounts = accessLogsWindow.count();

用window() 出所有的作，Spark Streaming 了的

作，用高效而方地用。，reduceByWindow()和reduceByKeyAndWindow()

高效地进行作。它收数，执

行， +。，它有特式，新进的数据和开

的数据， Spark 增量计算。特式需要数的数，

+ 应的数 -。大的，数大大高执行效图10-7 。

网络数据进行网络数据进行有

作的

图10-7：普通的reduceByWindow()与使用逆函数的增量式reduceByWindow()的区别

处理的子中，用数高效地 IP 地址问量进行计

数， 10-19 和 10-20 所。

10-19：Scala 版本的 IP 地址的问量计数

val ipDStream = accessLogsDStream.map(logEntry => (logEntry.getIpAddress(), 1)) val ipCountDStream = ipDStream.reduceByKeyAndWindow(

{(x, y) => x + y}, // 加新进的次中的元 {(x, y) => x - y}, // 开的次中的元 Seconds(30), // 时

Seconds(10)) //

10-20：Java 版本的 IP 地址的问量计数

class ExtractIp extends PairFunction<ApacheAccessLog, String, Long> { public Tuple2<String, Long> call(ApacheAccessLog entry) {

return new Tuple2(entry.getIpAddress(), 1L);

} }

class AddLongs extends Function2<Long, Long, Long>() { public Long call(Long v1, Long v2) { return v1 + v2; } }

class SubtractLongs extends Function2<Long, Long, Long>() { public Long call(Long v1, Long v2) { return v1 - v2; } }

JavaPairDStream<String, Long> ipAddressPairDStream = accessLogsDStream.mapToPair(

new ExtractIp());

JavaPairDStream<String, Long> ipCountDStream = ipAddressPairDStream.

reduceByKeyAndWindow(

new AddLongs(), // 加新进的次中的元 new SubtractLongs()

// 开的次中的元

Durations.seconds(30), // 时 Durations.seconds(10)); //

，DStream 了countByWindow()和countByValueAndWindow()作数据进行

计数作的简。countByWindow() 中元数的DStream，而

countByValueAndWindow() 的DStream 中的数， 10-21 和

10-22 所。

10-21：Scala 中的计数作

val ipDStream = accessLogsDStream.map{entry => entry.getIpAddress()}

val ipAddressRequestCount = ipDStream.countByValueAndWindow(Seconds(30), Seconds(10)) val requestCount = accessLogsDStream.countByWindow(Seconds(30), Seconds(10)) 10-22：Java 中的计数作

JavaDStream<String> ip = accessLogsDStream.map(

new Function<ApacheAccessLog, String>() { public String call(ApacheAccessLog entry) { return entry.getIpAddress();

}});

JavaDStream<Long> requestCount = accessLogsDStream.countByWindow(

Dirations.seconds(30), Durations.seconds(10));

JavaPairDStream<String, Long> ipAddressRequestCount = ip.countByValueAndWindow(

Dirations.seconds(30), Durations.seconds(10));

UpdateStateByKey

updateStateByKey() 用问的10 。列

，会件时新。

要用updateStateByKey()，了 update(events, oldState) 数，收

的件及应的，应的新。数的所。

• events：次中收的件的列能。

• ^oldState：的，存 Option 有的，

。

• ^newState：由数， Option 式存的Option

要。

10-23： Scala 中用updateStateByKey()运行应代的计数 def updateRunningSum(values: Seq[Long], state: Option[Long]) = { Some(state.getOrElse(0L) + values.size)

}

val responseCodeDStream = accessLogsDStream.map(log => (log.getResponseCode(), 1L)) val responseCodeCountDStream = responseCodeDStream.updateStateByKey(updateRunningSum _) 10-24： Java 中用updateStateByKey()运行应代的计数

class UpdateRunningSum implements Function2<List<Long>, Optional<Long>, Optional<Long>> {

public Optional<Long> call(List<Long> nums, Optional<Long> current) { long sum = current.or(0L);

return Optional.of(sum + nums.size());

} };

JavaPairDStream<Integer, Long> responseCodeCountDStream = accessLogsDStream.mapToPair(

new PairFunction<ApacheAccessLog, Integer, Long>() { public Tuple2<Integer, Long> call(ApacheAccessLog log) { return new Tuple2(log.getResponseCode(), 1L);

}})

.updateStateByKey(new UpdateRunningSum());

10.4

了序，用出作保存了。Spark Streaming DStream 有

，Spark Streaming 有的saveAsSequenceFile() 数，用 10-26 和 10-27 中的方法保存 SequenceFile 件。

10-26： Scala 中 DStream 保存 SequenceFile

val writableIpAddressRequestCount = ipAddressRequestCount.map { (ip, count) => (new Text(ip), new LongWritable(count)) } writableIpAddressRequestCount.saveAsHadoopFiles[

SequenceFileOutputFormat[Text, LongWritable]]("outputDir", "txt") 10-27： Java 中 DStream 保存 SequenceFile

JavaPairDStream<Text, LongWritable> writableDStream = ipDStream.mapToPair(

new PairFunction<Tuple2<String, Long>, Text, LongWritable>() { public Tuple2<Text, LongWritable> call(Tuple2<String, Long> e) { return new Tuple2(new Text(e._1()), new LongWritable(e._2()));

}});

class OutFormat extends SequenceFileOutputFormat<Text, LongWritable> {};

writableDStream.saveAsHadoopFiles(

"outputDir", "txt", Text.class, LongWritable.class, OutFormat.class);

，有用的出作foreachRDD()，它用 DStream 中的 RDD 运行任计 ipAddressRequestCount.foreachRDD { rdd =>

rdd.foreachPartition { partition =>

在文檔中快速大数据分析 (頁 175-0)

9.4 JDBC/ODBC

9.4.1 Beeline

9.4.2

9.5 数

9.5.1 Spark SQL UDF

9.6 Spark SQL

9.7

第 1 0 章

Spark Streaming

10.1

10.2

10.3

10.3.2

10.4

9.4　 JDBC/ODBC