結論和未來展望 - 中華大學

根據分散式檔案系統的概念，將匯入的資料切分後隨機儲存在資料節點上，

當然這些資料隨機散在不同的節點上是有助於平行分散式計算，但所有的資料的使用程度並不是相同的，如此隨機的資料放置方法用在資料庫的匯入中並不是明智之舉，這便是我們這篇論文的主要發想。

Join是資料庫裡時常發生的操作，這個操作非常消耗運算資源，以分散式計算的角度來說，將Join切分成幾個小工作用以平行分散式運算，但若是這些資料分散在許多不同的節點上的時候，資料必須透過網路傳輸這勢必影響整個Join完成的時間。我們的方法第一件事就是想到資料庫裡會有log檔用以紀錄資料庫裡的各種操作，透過分析log檔便可以知道那些Table被Join在一起，並同時考慮到一個Join所用到的資料量所以我們考慮到Table size的問題，我們便透過我們的方法CA_Sqoop，在匯入資料庫的同時將有關連性的Table盡量的放置在同一個節點中。

在實驗模擬中可以清楚得看到我們的方法 CA_Sqoop 在 Data Locality 的改善，

如此大量的資料不需要透過網路傳輸，只需要在本地端的硬碟上讀取，想必能夠減少非常多的運算時間。

在本篇論文中我們並沒有考慮到 HDFS 中資料副本的問題，日後若能同時考慮副本放置的問題，想必能夠對資料庫匯入到 Hadoop 後的性能有更大的幫助。

此外節點的容量上限定義(Capacity)，簡化了節點儲存的問題，在實際的系統上應該定義為每個節點擁有的容量來定義，而此種作法雖然較為實際但可能在匯入的時候會花大量的時間來決定到底該放在那些節點才能有更大的效能改善，但如此的作法才切乎實際這將是未來我們研究的主要方向。

文獻參考

1. Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt and Andrew Warﬁeld, “Xen and the Art of Virtualization,” SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles, vol. 37, Issue 5, pp. 164-177, 2003.

2. Sanjay Ghemawat, Howard Gobioff, and ShunTak Leung, “The Google file system,” In Proceedings of 19th Symposium on Operating Systems

Principles, pp. 29-43, 2003.

3. Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data

Processing on Large Clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.

4. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C., Hsieh Deborah A., Wallach Mike Burrws, Tushar Chandra, Andrew Fikes, and Robert E.Gruber,

“Bigtable: A Distributed Storage System for Structured Data,” 7th UENIX Symposium on Operating Systems Design and Implementation, pp. 205-218, 2006.

5. Kasim Selcuk Candan, Jong Wook Kim, Parth Nagarkar, Mithila Nagendra and Ren-wei Yu, “Scalable Multimedia Data Processing in Server Clusters,”

IEEE MultiMedia, pp. 3-5, 2010.

6. J. Tan, X. Pan, S. Kavulya, R. Gandhi, and P. Narasimhan, “Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop,” USENIX Workshop on Hot Topics in Cloud Computing (HotCloud), pp. 1-5, 2009.

7. S. Ibrahim, H. Jin, L. Lu, L. Qi, S. Wu and X. Shi, “Evaluating MapReduce on Virtual Machines: The Hadoop Case,” Proceedings Conference Cloud Computing (CloudCom 2009), Springer LNCS, pp. 519-528, 2009. Dec.

8. C. Jin and R. Buyya, “Mapreduce programming model for net-based cloud computing,” in Proceedings of the 15th International Euro-Par Conference on Parallel Processing, Euro-Par (Berlin, Heidelberg), pp. 417–428, 2009.

9. R. Nanduri, N. Maheshwari, A. Reddyraja, and V. Varma, “Job aware scheduling algorithm for Mapreduce framework,” in 3rd International Conference on Cloud Computing Technology and Science,

CLOUDCOM ’11, (Washington, DC, USA), pp. 724–729, 2011.

10. Apache Sqoop. Available from: http://sqoop.apache.org/

11. Jenq-Shiou Leu, Yun-Sun Yee, Wa-Lin Chen, ”Comparison of Map-Reduce and SQL on Large-scale Data Processing,” International Symposium on Parallel and Distributed Processing with Applications, pp. 244-248, 2010.

12. Masato Asahara, Shinji Nakadai and Takuya Araki, “LoadAtomizer: A

Locality and I/O Load aware Task Scheduler for MapReduce,” in 4th IEEE International Conference on Cloud Computing Technology and Science (CloudCom), pp. 317-324, 2012.

13. Sanjay Ghemawat, Howard Gobioff, and ShunTak Leung, “The Google file system,” In Proceedings of 19th Symposium on Operating Systems

Principles, pp. 29-43, 2003.

14. Sven Groot, “Jumbo: Beyond MapReduce for Workload Balancing,” Fuzzy Systems and Knowledge Discovery (FSKD), 2011 Eighth International Conference on Cloud Computing Technology and Science, vol. 4, pp.

2675-2678, 2011.

15. Steven Lynden, Yusuke Tanimura, Isao Kojima and Akiyoshi Matono,”

Dynamic Data Redistribution for MapReduce Joins,” IEEE International Conference on Coud Computing Technology and Science, pp. 717-723, 2011.

16. Dawei Jiang, Anthony K. H. Tung, and Gang Chen,” MAP-JOIN-REDUCE:

Toward Scalable and Efficient Data Analysis on Large Clusters,” IEEE Transactions on knowledge and Data Engineering, vol. 23, no. 9, pp.

1299-1311, 2011.

17. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P.

Wyckoff, and R. Murthy, “Hive - a warehousing solution over a Map-Reduce framework,” PVLDB, vol. 2, no. 2, pp. 1626–1629, 2009.

18. Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and

Xiaodong Zhang, ” YSmart: Yet Another SQL-to-MapReduce Translator,”

International Conference on Distributed Computing Systems, pp. 25-36, 2011.

19. Hung-Ping Lin, “Structured Data Processing on MapReduce in NoSQL Database,” Master Thesis in National Chiao Tung University, 2010.

20. Meng-Ju Hsieh, Chao-Rui Chang, Jan-Jan Wu, Pangfeng Liu and Li-Yung Ho, “SQLMR : A Scalable Database Management System for Cloud Computing,” International Conference on Parallel Processing (ICPP), pp.

315-324, 2011.

21. Andrey Balmin, Tim Kaldewey, Sandeep Tata, “Clydesdale: Structured Data Processing on Hadoop,” ACM SIGMOD International Conference on

Management of Data, pp. 705-708, 2012.

22. Andrey Balmin, Tim Kaldewey, Sandeep Tata, “Clydesdale: Structured Data Processing on MapReduce,” International Conference on Extending

Database Technology, pp. 15-25, 2012.

在文檔中中華大學 (頁 32-35)