效能分析比較

第五章系統實作

5.3 FOAF 資料分析

5.3.4 效能分析比較

SELECT subject,object FROM btc2012

WHERE `predicate` = '<http://xmlns.com/foaf/0.1/knows>' AND subject LIKE '%edu%'

最後的執行結果如表 7。在 MySQL 部分，我們發現資料量與執行時間成正比。Hive 分成single node 與 cluster 兩部分，single node 的執行時間比 MySQL 多，資料量與執行時間成正比，這是因為Hadoop/MapReduce 單機的效率不佳，無法執行平行處理；

反觀cluster 的執行過程充分發揮 Hadoop/MapReduce 平行處理的優勢，處理資料量越

大，所得效益更加明顯，資料效率約為MySQL 的一倍。

另外需特別注意的是，本實驗使用的 namanode 與 datanode1 的硬體規格不同(請參

考表 5)，後者的處理效能較佳，若與使用兩部相同的硬體規格實驗比較，實驗數據

可能有些差異。

表 7 : MySQL 與 Hive 效能比較表

資料量(GB) MySQL R+MySQL Hive R+Hive Hive R+Hive

7.8 86 93 128 142 91 96

13.3 156 157 219 240 112 143

18.1 231 233 264 313 119 169

30 345 349 424 510 187 272

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

圖 23：MySQL 與 Hive 效能圖 0

50 100 150 200 250 300 350 400 450

7.8 13.3 18.1 30

處理時間

( 秒)

資料量(GB)

MySQL Hive

Hive(Cluster)

‧

本研究首先針對社群網路區分為集中式(Centralized Social Network)與分散式 (Decentralized Social Network)，說明選擇分析分散式社群網路的原因。分散式線上社群網路採用RDF(S)為基礎的 FOAF 格式於信任的第三方 Hadoop cluster 來儲存個人資料與其社群網絡。面臨大量的社群網路資料，傳統的分析方式將會遇到許多處理與儲存的問題，本研究透過結合R 與 Hadoop MapReduce 技術，提出三種分析方式：R + Hadoop Streaming (RHS)，R + MySQL (RMS)，R + Hive (RH)來解決分析大量 FOAF 資料運算與儲存的瓶頸。

R+Hadoop Streaming 分析(RHS Analytics)是使用 Hadoop 的 Hadoop Streaming 架構，

用R 語言執行 Hadoop/MapReduce 分散式處理，從大量的 RDF 資料統計 FOAF 字彙使用頻率，輸出結果為文字檔並放在HDFS，可作為後續 R + Hive (RH)進階處理，例如使用R 繪製 FOAF 字彙使用頻率圖；R+MySQL 分析(RMS Analytics)是使用 MySQL 資料庫與R 語言，本研究使用的中小型資料(30GB 之內)，適合以 MySQL 作為儲存與分析的架構，並作為R+Hive 分析(RH Analytics)的對照組；R+Hive 分析(RH Analytics)是 R 透過rhive 結合 Hadoop/MapReduce 的優點，藉以分析大量的 FOAF 資料，最後結合 R 的社會網路分析功能，發現重要分析量測指標。

我們分別就Storage(儲存空間)、Performance(執行效能)與 FOAF SNA(Social Network Analysis)，來討論本研究使用的三種分析方式：

1.R+Hadoop Streaming(RHS) Analytics

(1)Storage：使用 Hadoop HDFS 儲存 FOAF 資料，儲存空間可達 PB 以上。

(2)Performance：以 Hadoop Cluster 作分散式處理，處理速度隨 Hadoop Cluster 節點(node) 增加而提升。

(3)FOAF SNA：只作簡單的 FOAF 使用頻率統計，資料放在 Hadoop HDFS，可提供 R+Hive(RH) Analytics 作進階資料處理。

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

2.R+MySQL(RMS) Analytics

(1)Storage：MySQL 儲存容量因作業系統的檔案限制，而非 MySQL 本身的關係[37]。以本實驗使用的 Linux 系統使用 ext3 檔案系統，容量上限為 4TB；Solaris 9/10 為 16TB。

(2)Performance：本實驗測試結果，處理速度隨資料增加而變慢(僅使用 namenode 單機)。

(3)FOAF SNA：使用 R igraph 計算 Social Network Centrality 指標，以 8 萬筆的圖形(Graph)

資料為例，計算時間在十分鐘以內；繪圖部分，若一個圖形包括一萬六千個點(Vertex)

與兩萬五千個邊(Edge)，使用 namenode 計算耗時約 45 分鐘。

3.R+Hive(RH) Analytics (1)Storage：與 RHS 相同。

(2)Performance：HQL 透過 Hive 轉換為 Hadoop/MapReduce 工作，藉由 Match(符合 FOAF 字串)、Filter(過濾出屬於學術界資料)、Extract(萃取 Match 與 Filter 之後資料)資料後，

再輸出查詢結果，處理速度比RHS 更快。

(3)FOAF SNA：與 RMS 相同。

本研究使用的測試資料-BTC2012 資料集，有些研究使用 SPARQL 作為查詢與分析工具，SQL-Like 適合作結構性資料的分析，因此除了 Hadoop 的 Hive 之外，另一項 HBase(Column DB)也適合作類似的研究。本研究結合 R 與 Hadoop/MapReduce 分析大量的社群網路資料，無論是使用MySQL，或是 Hive 針對 FOAF 資料進行第一階段的分散

式平行處理，之後再將結果傳遞給 R 作第二階段的社群網路分析。回到原本的 R 環境

開始計算社會網絡分析指標，即使有多核或多CPU，最後只有單核或是單 CPU 能進行

運算。面對每日激增的社群網路資料，如何更進一步的結合R 與 Hadoop/MapReduce，

並使用HBase 或是與既有 R 的平行化軟體作結合，也是日後可以努力研究的方向。

‧

[1]. Apache Hadoop Project, http://hadoop.apache.org

[2]. Billion Triples Challenge 2012 Dataset, http://km.aifb.kit.edu/projects/btc-2012/

[3]. Bizer, C., Heath, T., & Berners-Lee, T. (2009). Linked data-the story so far.International journal on semantic web and information systems, 5(3), 1-22.

[4]. Bonacich, P. (1987). Power and centrality: A family of measures. American journal of sociology, 1170-1182.

[5]. Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., ... &

Gruber, R. E. (2008). Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2), 4.

[6]. Daniel J. Weitzner . http://www.w3.org/People/Weitzner.html

[7]. Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.

[8]. Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.

[9]. Department Of Statistics, Purdue University (2012). Divide and Recombine (D&R) with RHIPE. Retrieved from http://www.datadr.org/.

[10]. Ding, L., Zhou, L., Finin, T., & Joshi, A. (2005, January). How the semantic web is being used: An analysis of foaf documents. In System Sciences, 2005. HICSS'05.

Proceedings of the 38th Annual Hawaii International Conference on(pp. 113c-113c).

IEEE.

[11]. Ding, L., Zhou, L., Finin, T., & Joshi, A. (2005, January). How the semantic web is being used: An analysis of foaf documents. In System Sciences, 2005. HICSS'05.

Proceedings of the 38th Annual Hawaii International Conference on(pp. 113c-113c).

IEEE.

[12]. Dirk Eddelbuettel(2014, July 7) . CRAN Task View: High-Performance and Parallel Computing with R , Retrieved July 7, 2014, from

http://cran.r-project.org/web/views/HighPerformanceComputing.html

[13]. Erétéo, G., Gandon, F., Corby, O., & Buffa, M. (2009). Semantic social network analysis.

arXiv preprint arXiv:0904.3701.

[14]. FOAF Vocabulary Specification 0.99/Namespace Document 14 January 2014 -

‧

Paddington Edition. http://xmlns.com/foaf/spec/

[15]. Freeman, L. C. (1979). Centrality in social networks conceptual clarification. Social networks, 1(3), 215-239.

[16]. G. K. Zipf, Selected Studies of the Principle of Relative Frequency in Language. Harvard University Press, 1932

[17]. Ghemawat, S., Gobioff, H., & Leung, S. T. (2003, October). The Google file system.

In ACM SIGOPS Operating Systems Review (Vol. 37, No. 5, pp. 29-43). ACM.

[18]. Ghemawat, S., Gobioff, H., & Leung, S. T. (2003, October). The Google file system. In ACM SIGOPS Operating Systems Review (Vol. 37, No. 5, pp. 29-43). ACM.

[19]. Golbeck, J., & Rothstein, M. (2008, July). Linking Social Networks on the Web with FOAF: A Semantic Web Case Study. In AAAI (Vol. 8, pp. 1138-1143).

[20]. http://en.wikipedia.org/wiki/Information_Sciences_Institute [21]. http://www.ldodds.com/foaf/foaf-a-matic.html

[22]. Jonathan Seidman .,& Ramesh Venkataramaiah (2011). Distributed Data Analysis with Hadoop and R.

[23]. Mori, J., Matsuo, Y., Ishizuka, M., & Faltings, B. (2004, September). Keyword

extraction from the web for foaf metadata. In Proceedings of the 1st Workshop on Friend of a Friend, Social Networking and the (Semantic) Web.

[24]. MySQL database, http://www.mysql.com/

[25]. MySQL Limits on Table Size,

http://dev.mysql.com/doc/refman/5.1/en/table-size-limit.html

[26]. Paolillo, J. C., & Wright, E. (2004). The challenges of FOAF characterization.

InProceedings of the 1st Workshop on Friend of a Friend, Social Networking and the (Semantic) Web.

[27]. Paolillo, J. C., & Wright, E. (2006). Social network analysis on the semantic web:

Techniques and challenges for visualizing FOAF. In Visualizing the semantic web(pp.

229-241). Springer London.

[28]. Piccolboni, A. (2014,May 25) RevolutionAnalytics/RHadoop. Retrieved from https://github.com/RevolutionAnalytics/RHadoop/wiki.

[29]. Resource Description Framework (RDF), http://www.w3.org/RDF/

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

[30]. Rickert, J. B. (2010). Big Data Analysis with Revolution R Enterprise.

[31]. Ryan R. Rosario(2010). Taking R to the Limit. Los Angeles R Users' Group [32]. The Apache HBase, http://hbase.apache.org/

[33]. The Apache Hive, https://hive.apache.org/

[34]. The Apache ZooKeeper, http://zookeeper.apache.org/

[35]. The Friend of a Friend (FOAF) project, http://www.foaf-project.org/

[36]. The R Project for Statistical Computing, http://www.r-project.org/

[37]. Yeung, C. M. A., Liccardi, I., Lu, K., Seneviratne, O., & Berners-Lee, T. (2009, January).

Decentralization: The future of online social networking. In W3C Workshop on the Future of Social Networking Position Papers (Vol. 2, pp. 2-7).

在文檔中整合R與Hadoop/MapReduce來分析FOAF社群網路 - 政大學術集成 (頁 53-0)

第五章 系統實作

5.3 FOAF 資料分析

5.3.4 效能分析比較

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧

‧

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

第五章系統實作

立政治大學

立政治大學

立政治大學