行政院國家科學委員會專題研究計畫成果報告

(1)

行政院國家科學委員會專題研究計畫成果報告

利用 P2P 技術建構以服務導向架構(SOA)為基礎之輕量化格網系統--子計畫二:應用 P2P 與 Web 技術發展以 SOA 為基礎

的格網中介軟體與經濟模型(第 3 年) 研究成果報告(完整版)

計畫類別：整合型

計畫編號： NSC 97-2628-E-216-006-MY3

執行期間： 99 年 08 月 01 日至 100 年 07 月 31 日執行單位：中華大學資訊工程學系

計畫主持人：許慶賢

計畫參與人員：碩士班研究生-兼任助理人員：李志純碩士班研究生-兼任助理人員：黃安婷碩士班研究生-兼任助理人員：游景涵

報告附件：出席國際會議研究心得報告及發表論文

處理方式：本計畫涉及專利或其他智慧財產權，2 年後可公開查詢

中華民國 100 年 10 月 29 日

(2)

行政院國家科學委員會補助專題研究計畫 █ 成果報告

□期中進度報告

※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※

※ 利用 P 2 P 技術建構以服務導向架構 ( S O A ) 為基礎之 ※

※ 輕量化格網系統－子計畫二 : ※

※ 應用 P 2 P 與 W e b 技術發展以 S O A 為基礎的 ※

※ 格網中介軟體與經濟模型 (第三年 ) ※

※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※

計畫類別：□個別型計畫  整合型計畫計畫編號： NSC 97-2628-E-216-006-MY3

執行期間： 99 年 08 月 01 日至 100 年 07 月 31 日執行單位：中華大學資訊工程學系

計畫主持人：許慶賢中華大學資訊工程學系教授共同主持人：

計畫參與人員：陳世璋、陳泰龍 (中華大學工程科學研究所博士生) 許志貴、陳柏宇、李志純、游景涵、黃安婷

(中華大學資訊工程學系研究生)

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計畫、列管計畫及下列情形者外，得立即公開查詢

□涉及專利或其他智慧財產權，□一年  二年後可公開查詢

中華民國 100 年 10 月 28 日

(3)

行政院國家科學委員會補助專題研究計畫 █ 成果報告

□期中進度報告

應用 P2P 與 Web 技術發展以 SOA 為基礎的格網中介軟體與經濟模型 (第三年)

計畫類別：□個別型計畫  整合型計畫計畫編號： NSC 97-2628-E-216-006-MY3

執行期間： 99 年 08 月 01 日至 100 年 07 月 31 日計畫主持人：許慶賢中華大學資訊工程學系教授共同主持人：

計畫參與人員：陳世璋、陳泰龍 (中華大學工程科學研究所博士生) 許志貴、陳柏宇、李志純、游景涵、黃安婷

(中華大學資訊工程學系研究生)

成果報告類型(依經費核定清單規定繳交)：□精簡報告  完整報告

本成果報告包括以下應繳交之附件：

□赴國外出差或研習心得報告一份

□赴大陸地區出差或研習心得報告一份

 出席國際學術會議心得報告及發表之論文各一份

□國際合作研究計畫國外研究報告書一份

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計畫、列管計畫及下列情形者外，得立即公開查詢

□涉及專利或其他智慧財產權，□一年  二年後可公開查詢

執行單位：中華大學資訊工程學系

中華民國 100 年 10 月 28 日

(4)

1

目錄

中文摘要... 2

一、緣由與目的...3

二、研究方法與成果...4

三、結果與討論...16

五、計畫成果自評...18

六、參考文獻...19

出席國際學術會議心得報告 (MTPP 2010)...21

出席國際學術會議心得報告 (CSE 2009)...32

出席國際學術會議心得報告 (InforScale 2009)...47

出席國際學術會議心得報告 (ChinaGrid 2008)...77

(5)

2

行政院國家科學委員會專題研究計畫成果報告

應用 P2P 與 Web 技術發展以 SOA 為基礎的格網中介軟體與經濟模型

計畫編號：NSC 97-2628-E-216-006-MY3 執行期限：99 年 8 月 1 日至 100 年 7 月 31 日主持人：許慶賢中華大學資訊工程學系教授

計畫參與人員：陳世璋、陳泰龍 (中華大學工程科學研究所博士生) 許志貴、陳柏宇、李志純、游景涵、黃安婷

(中華大學資訊工程學系研究生)

一、中文摘要

本計畫整合 P2P 技術於格網系統，目標是提供大量分散式計算與資料之傳輸。由於 P2P 具有自我調適、擴充性與容錯等特性，而格網著重於異質環境的整合與資源的管理，格網技術與 P2P 技術的整合已經成為格網系統開發之趨勢。這樣的結合可形成具擴充性、容錯、自我調適、高效能的格網平台。此外藉由服務導向架構(SOA, Service-oriented Architecture)，將格網中成員所提供的功能、參與的資源與交付的工作服務化，使得所有包裝的格網服務彼此間可以透過標準的協定進行溝通與整合。因此本計畫以 P2P 與服務導向架構為基礎，

建置輕量化的格網系統。

本計畫執行三年，第一年，我們發展完整的中介資料處理、儲存、與快取機制、資料儲存機制、P2P 索引機制、安全性機制，並且建置虛擬儲存空間、

作業系統的虛擬檔案系統連結、針對 SRB 進行效能與系統功能測試。第二年，

我們改良 MapReduce 這個資料處理模型，並且配合原本的資料格網系統應用觀念，發展能夠大量處理資料的應用程式介面，以及能夠針對特定領域來進行資料處理的領域特定語言與各種平台函式庫開發。第三年，我們利用開發三套資料格網的管理與分析工具，可以監控資料的分佈、網路狀態、運算資源的狀態，

並整理出資料的分佈模式。讓資料能夠運用其存放節點的運算資源，讓每筆資料不需要透過集中式的運算，達成輕鬆地移動或複寫自己而提升整體資料的容錯率。整體而言，本計畫預計完成的研究項目，皆已經實作出來，並在相關期刊與研討會發表。

關鍵詞：P2P、格網計算、Web 技術、中介軟體、經濟模型、服務導向架構、資源管理、

資料格網、網際服務。

(6)

3

二、緣由與目的

由於網路、電腦與數位產品的普及，全世界的資料每天幾乎是以 Peta Bytes 的速度成長。這樣龐大的資訊量，一定也需要有軟硬體搭配儲存。因此除了典型硬式磁碟以外，也有許多不同的整合式單一媒體系統，如階層式(Hierarchical)儲存，叢集式(Cluster) 儲存，網路式儲存(Network Attached Storage, NAS)；非傳統的儲存方式如資料庫系統，

或加密式的檔案系統也受到重視。相對於單一儲存系統，在異質性的儲存環境之下，

面對大量的儲存需求，資料格網的另一個挑戰，便是能夠整合這些儲存媒體，建構虛擬儲存空間，有效率地使用其容量並提高傳輸速度。另外，由於資訊的複雜化，資料往往必頇具備索引，並能夠被快速搜尋及取得。比起即時地分析資料並且進行搜尋，

過去就有針對資料的關鍵欄位建立索引的作法，這些索引資料同樣也必備讓使用者能夠快速閱讀，或是分類的功能。我們稱作這些為中介資料，目的是用來描述資料的資料。舉例來說，在典型的檔案系統中有一個稱作 Superblock 的地方，就是儲存所有檔案的中介資料。當系統進行檔案搜尋的時候，實際上是搜尋中介資料，並且回報給系統連結到實體檔案的位置。另一個例子是在典型索引及搜尋的資料網路上，例如一個 P2P 網路，我們通常都是透過檔案名稱來搜尋檔案。但實際上真正的應用方式，應該是針對檔案的內容來進行搜尋。困難的地方是在於，這樣的網路通常沒有任何能力分析檔案的內容或者是其中介資料，並且讓使用者找到內容也符合搜尋條件的檔案，因此也經常會有許多假檔出現。而針對檔案中介資料進行分析與索引，能夠做得相當好的，

又是必頇重新設計其中介資料讀取及索引的介面，一旦設計完成就難以修改並套用在其他類型的檔案上。造成這個問題的主要原因，是不同領域對於相同物件或檔案會有不同的描述方法。因此，要創造一個共通的資料協定，進行資料轉換甚至擷取，分析也是難上加難。而如果要建立索引，也必頇耗費成本設計專用的索引系統。因此，以往也有些人反過來志在開發一些能夠通用的協定，但也多半是不符合完整的領域需求，因為不見得所有的組織都會參與。當然這其中也不乏商業競爭的問題。

在這個研究計畫中，我們將討論如何採用 W3C 的 Web 技術，及 P2P 技術來改善這些問題，並且將其成果套用在一個以資料為核心的容錯度，效能為主來考量的複寫演算法上。以及最後要來探討資料格網上的經濟模型。簡言之，本研究目的是發展一套可以運行在現行學術及大眾網路環境中的資料格網中介軟體系統，並且發展資料自我感測的複寫技術及發展可以商業化的資料格網經濟模型。本計畫有下列幾個主要

(7)

4

的研究課題：

 發展以 P2P 及 W3C 各種 Web 技術為核心的格網中介軟體，具有輕量化，高速傳輸，容錯及大量處理中介資料的能力，並擁有完整的開發環境 API。並且實際建置於學術網路環境之上，使其成為接下來研究所依賴的開發環境。

 利用 MapReduce 這個資料處理模型，發展大量資料操作，處理的應用程式介面 (Application Programming Interface)，以及能夠處理資料的領域特定語言(Domain Specific Language)，使得在此資料格網上開發應用軟體成為一件簡單的事情。

 利用前項成果，發展資料自我感測式的複寫管理技術，並針對運行中的平台進行容錯與效能測試。

 發展大量部署及自我管理的智慧型資料格網中介軟體平台，使得未來管理的人力及時間成本大量降低，並且實際地在既有的學術網路環境中建置該平台。

 發展以服務為導向的格網經濟模型，應用在資料保全、工作排程、與各種 Web 服務，進而滿足不同使用者與系統管理的需求。

三、研究方法與成果

第一年，我們以 W3C 的 Web 相關技術，及現有的 P2P 演算法為基礎，研究資料格網的中介資料，虛擬儲存，大量部署管理等核心技術的開發。而與現有的資料格網中介軟體-Storage Resource Broker(SRB)，進行效能的比較。此外在校園學術網路上進行系統平台的大量部署，利用校園內的分散硬碟空間，成功建置大型的資料格網系統。

第二年，我們進行資料領域特定語言的設計與各種平台函式庫開發，並且在學術網路上進行第一年成果的建置與部署並且測試其效能。在這一個階段，我們也與其他學校進行緊密的整合，並進行許多細節的修正。

第三年，我們利用前項所發展的資料格網中介軟體 API，以及 P2P 分散式排程的技術，進行資料自我感測式複寫技術的研究，並且開發完整的使用者介面，以及發展成該平台複寫技術的核心。另外，建構以服務為導向的格網經濟模型，應用於各種 Web 服務，進而滿足不同使用者的需求。格網經濟模型研究的重點在於同時考量供給者的維運成本與滿足消費者的不同需求(QoS)，發展一套未來可以導入企業網路的格網經濟架構。

(8)

5

我們使用許多 Web 核心技術，而這些技術多半是由 W3C 所制訂的。此外，我們也在各種方面加入 P2P 的觀念及考量，使得擴充性及容錯度更高。透過 P2P 的觀念，我們將整個 Data Grid 的成員設計成 Storage Peer(SP), Metadata Peer(MP), 及 Cluster Peer(CP)。這些成員如圖一的 P2P 資料格網架構圖所示。以下的敘述我們將採用 SP，MP 及 CP 來簡述。

Network Overlay

Domain Overlay

Domain MP

Storage Overlay

SP Network MP

圖一 P2P 資料格網架構圖

1.

中介資料容錯、儲存、索引、與處理

我們將 MP 利用 Chord 網路連接在一起，最主要的目的是利用 Distributed Hash Table 的特性，一旦加入網路後，能夠很快速地找到特定節點，並且進行後續動作。

所有的 MP 都有一個基本的能力，就是可以快速聚合(aggregate)自己下一層的所有中介資料，儲存在自己的中介資料儲存裡，並且能與自己的備援進行完全的同步。

然而整個 Data Grid 環境，並不是一個 Overlay 能夠解決，我們預期根據 Data Grid 的特性建立不同層次的 Overlay 來對應。

在圖二的架構中，第 1 層為 Network Overlay，Network 也就是 SRB 裡所定義的 Zone。這一層的主要目的並不是實際地建立一層資料服務的 Overlay，然後彼此建立副本來容錯(儘管根據設計是可以的，但卻不實用)。而是每一個 Peer 都必頇作為下一層 Overlay，也就是 Domain Overlay 的 Chord 網路初始進入點。一旦 Domain Overlay 節點增加，Network Overlay 的功能就會放在另一個層面。

我們另外一個設計的主要概念是，不管是服務程式還是使用者，都會希望有個單一進入點。一個 Network MP 就是用來代表一整個 Data Grid，對於程式所需要的

(9)

6

資料，他可以定期地統計所有 Domain MP 的狀態，並且透過 Grid Information System 的規格對外服務其資料；而對於使用者的部分，則是被用來當作 Web Portal 的負載平衡進入點。為了避免發生問題，Network MP 的目的就不是用來服務，而是連線重導，實際上真正進行服務的還是下層的 Domain Overlay。也就是說 Network MP 的功用有點像是 Web 負載平衡主機及 Proxy，但從實做的角度，就是簡單地開啟了 Portal 功能的 MP。

Network Overlay

Domain Overlay

Domain MP

Storage Overlay

SP Network MP

圖二 P2P 資料格網架構圖

如果系統中有單一進入點，通常也會造成單點錯誤(Single Point Failure)的來源。我們將 Network MP 設計成很簡單的錯誤備援機制，一個 Network MP 只能夠有一個備援，而透過其他網路上的通訊機制如 Mail 或簡訊來通知管理者已經發生錯誤備援的警告，如圖三所示。

(10)

7

Network Overlay

Master Network MP Backup Network MP

IP ,Portal Service, Entry point

圖三 Network overlay 備援機制示意圖

在圖二的架構中，第 2 層才是中介資料實際上運作的 Overlay，稱做 Domain Overlay。Domain MP 的建立，一方面是根據使用者的要求而建立，這個狀況是當使用者建立新的 Domain，對應的 Domain MP 也會被要求啟動及設定；另一個就是透過大量部署機制，後面會提到。任何一個 SP 都一定會被加入到一個特定的 Domain，就另一個角度，也就是與該 MP 連接上。這邊的設計最主要是讓使用者能夠自然地在建立整個資料格網管理階層時，就同時設定好了對應的 Peer 連線，使得中介資料階層能夠與管理階層相對應。根據 SP 的數量，有些 SP 會自動地被提升為 Backup MP(也就是開啟中介資料儲存的功能)，來作為其他 MP 的備援，但 MP 之間的運作並非構成 Master / Slave 的架構，而是 Peer-to-Peer。彼此互為備援的 MP，才會互相地完全同步其 Domain 的中介資料。各 MP 之間能夠轉送其他 Domain 的 SP 所要求的中介資料查詢，所有機制讓他們有點像是 DNS 一樣地在運作。所有的 SP 也都會有他們所負責 MP 的清單，

所以當任何清單中的 MP 離線時，只要還有一個以上的 MP，就能夠查詢，此外也會通知使用者發生了 MP 連線的警告。而當所有的 MP 離線，我們的作法是等 Network MP 偵測到，然後透過其他網路通訊機制通知使用者發生了所有 MP 斷線的錯誤。

第 3 層是 Storage Overlay，當然所有的 SP 也都只負責儲存，而根據後面提到的中介資料快取機制，也會儲存跟自己有關檔案的中介資料。除了對自己所屬的 MP 才能夠傳送其中介資料，對任何其他的 SP 或 MP 是無法進行中介資料查詢及傳送的。SP 之間有一個重要的工作，就是進行平行傳輸。根據 Replica 策略，資料能夠輕易地在 SP 之間移動，或是進行複寫。任何移動及複寫的過程，都會被其負責的 Domain MP 記錄，以維持一致性。此外我們還針對了格網管理或是開發人員，設計了獨特的大量部署，管理及開發功能，由 Cluster Overlay 及 CP 來運作，將在後面介紹。

在中介資料儲存的部份，這一個機制的目的，主要是為了在設計架構上實現中介

(11)

8

資料中立的概念。這個概念描述說，任何中介資料索引或是儲存的機制，不能夠限制在同一種儲存媒體上，如資料庫系統。DBMS 的優點是搜尋關連式資料或是聚合同性質資料有一定的效率，可是對於要求高輸出的中介資料，不見得是好的實做。因此在這裡我們選擇透過系統函式庫提供 DBMS Abstraction Layer，來支援不同的 DBMS 連線，藉此達成中立概念。另一方面我們只利用 DBMS 來做中介資料索引及統計的動作。

除非是這兩個動作，否則都是得讓任何要讀取中介資料的節點將整個中介資料的 XML 讀取回去。然而利用 HTTP 的特性，節點在讀取的時候可以檢查其 ETag Header，發現檔案沒有變動，即可直接利用本地快取。

2.

通訊、Cluster Peer、與資料儲存

HTTP 是一個非常完善設計的檔案通訊協定，而透過一些擴充如 WebDAV，使得 HTTP 比 FTP 還要能夠有檔案的中介資料及實體資料傳輸的能力。儘管 HTTP 並不是 Stateful，但我們依然能夠加上 Cookie 及 Session 來完成一些需要狀態的動作，

例如我們自行定義的通訊流程。此外 HTTP 很多核心的觀念都建立在其 Server 裡，

最有名的例子為 Apache HTTP Server，使得檔案的部分傳輸相當地簡單，因此平行傳輸的問題只剩一半。另一個好處是，如果需要加密的傳輸，只需要把 URL 從 http 改為 https 即可。

在資料格式定址(Data Format Addressing)的方法上，近來也有一個針對 HTTP 通訊及 Web 應用相當熱門的名詞稱做 REST。REST 原意為 Representational State Transfer，表示同樣的 URL(Resource)，利用 HTTP Accept Header，能夠做到讓 Server 傳回不同型態的資料。例如 http://www.someone.com/login，對於 HTML 要求，便傳回人看得懂的網頁；而對於程式的 XML 要求，便傳回 XML 結果。我們將利用這種特性，將一系列的 URL 設計為簡易的 Web API，使得整個程式框架更容易實做。所以，我們就可以將 URL 視作整個網路內資料的定址及選擇通訊資料格式的方式。

對於 P2P 連線及平行傳輸，在前述的中介資料索引小節中，我們提到了整個 P2P 機制是透過 Chord 這個分散式雜湊表來完成。透過這機制，取得了任何資料的來源清單後，便可以透過 HTTP 的資料分斷機制，進行平行化的傳輸。任何資料進行平行傳輸後，會再次計算 checksum，確保資料傳輸沒有問題。

針對大量部署及管理的問題，管理者不僅要從虛擬檔案系統看整個資料格網內的資料，也必頇隨時掌握實體機器的狀態。但以往管理者遇到的問題是，很難快速

(12)

9

地在資料格網中建立或刪除節點，加入新的硬碟，或是在任何機器有狀況的時候進行處理，此外像是 CPU，記憶體，硬碟空間的監控都很重要。這個困難點是因為，

很多實體機器不見得都是該管理者有全部權限，或是可以立即進行操作，此外如果一次要對 20 台進行硬碟的安裝可能就很困難。

為了解決這個問題，管理者必頇安裝 Cluster Peer(CP)元件。CP 實際上是一種為了管理而存在的 Peer。而構成的 Cluster Overlay，除了擁有 MP 的所有功能以外，預設會認定實體機器是在同一個或是鄰近網段下運作，藉此調整其快取機制，並更提高輸出率，而不是容錯率。所有屬於 CP 的 SP，會被視作一個 Storage Pool，而不是單獨的儲存節點。為了達到這個需求，我們也會建議管理者提升網路的等級，來加快整體的運作時間。

安裝 CP，使用者可以透過 SSH(必頇提供機器 root 密碼)，或是事先安裝好 SP 軟體來達成半自動或全自動的部署。部署完後的 SP，將會自動地加入 CP。CP 除了能夠當作這一群 SP 的 MP 以外，也能夠透過叢集指令對每個 SP 的硬碟進行管理。

圖四描繪上述 CP 提供自動部署機制。舉例來說，如果需要新增儲存節點，如同安裝的時候一樣，提供 SSH 密碼，或是在新主機上安裝 SP，都能夠讓 CP 將其自動地納入管理。

另外，叢集指令及叢集指令介面(Clustered Shell)是 CP 提供的特殊功能，允許管理者執行一道指令便能夠在指定或是所有的節點上執行。在未安裝 SP 軟體的情況下，預設會需要提供主機的 root 密碼，以透過 SSH 連線。叢集指令使得大量管理成為可能，例如一次格式化所有主機的新硬碟，並將其掛載上指定目錄。這個需要管理者事先規劃好使用同樣類型的 OS，來防止任何問題。這些節點也可以分做群組，

使得不同類型 OS 或主機都可以應用在同一個 CP 上。

(13)

10

Storage Pool 2 Storage Pool 1

Cluster Peer

Clustered Command Clustered Command

圖四 Cluster Peer 提供自動部署機制

將大量的 SP 視作為磁碟快取的作法，對提高資料格網效率與擴充性有很大幫助。這正符合一些用到大量硬碟空間的計畫需求。

在資料儲存的策略方面，我們選擇的主流的 OS，其 Native IO API 作為預設儲存實體檔案的驅動方式。也因此目前只支援 Linux 與 Windows。我們保留了可以方便撰寫 Extension 的框架，期待更多使用者能夠參與開發。

透過前述定址的方式，任何在資料格網上的檔案將會照著邏輯位置(Logical Location)來服務，也就是在虛擬儲存空間上的位置。而其虛擬的儲存位置，僅會以簡單的雜湊來區隔檔案，而儲存在使用者或是系統指定的目錄裡。雜湊的目的，只需要確認一個目錄的檔案個數上限不會超過檔案系統所能承受的即可。

我們針對現行的檔案系統如 NTFS 及 EXT3，透過 Windows 或是 Unix 都會有的原生 IO 指令來使用。並且設計儲存驅動的程式介面，讓任何人都可以透過撰寫外掛的方式，來驅動自己所需要的儲存設備，如特殊的階層式檔案系統。此外，針對常見的備份式儲存，如磁帶，以及可移除式儲存如 USB 碟與光碟，也都有支援。這些非傳統的儲存介面帶來的挑戰是，利用檔案快取來讓使用者還是能夠存取到。也因此後面會有提到利用 Cluster Peer(CP)及 SP 在近端網路建立大量磁碟快取(Disk Pool) 的作法，圖五所示。而另外一點令人關切的問題是，不同 Domain 的資料，或是不同使用者擁有的資料，是不會能夠被其他人看見。

針對很多學術計畫所需要的大量硬碟空間，透過上述部署機制，減少很多不必要的管理時間。然而，同性質的機器，除了能夠當作儲存，使用其 CPU 的資源也是相當重要。舉例如果有個計畫的成果會產生大量的感測器資料，不僅容量大，而且

(14)

11

數量多，例如 30 萬個檔案。這 30 萬個檔案，如果要能夠擷取其中介資料，再利用單台電腦匯入到資料庫裡建立索引，進行查詢，可以說是相當緩慢。儘管資料格網附有中介資料匯入的方法，但只也只侷限在從一個用戶端進行匯入。

IOManager Native IO

Module

Disks

Tapes, CDROMs

RemovableIO Module

Disk Pools

Non-Conventional Storage Module

Network Filesystems, Databases

圖五資料格網之異質性儲存整合介面

3.

管理及使用

我們針對此建立了大量中介資料處理的機制，主要是透過 Google 提出的 MapReduce 資料平行處理框架。我們修改 MapReduce 架構，來套用在我們的資料格網系統上。根據 MapReduce 架構，任何資料首先必頇平均地散佈在各個 SP 上，這個我們就透過原本資料格網既有的傳輸機制。接著使用者必頇根據程式介面撰寫 Mapping，以及 Reducing 的程序，而與用戶端直接匯入的不同是，必頇自行撰寫 Reducing 的程式碼，而非由系統負責處理。而實際的 Mapping 及 Reducing 程序進行完後，資料就會直接合併為 CP 所管理的中介資料，這樣的程序會持續進行直到中介資料都合併至所有的 MP 為止。圖六顯示上述 MapReduce 資料平行處理框架。

在與 Globus GSI 整合的部份，使用 CP 的管理者，絕大多數都是因為計畫上需要與 Globus 整合，或是透過 Globus 來整合儲存與運算。我們的資料格網系統，也必頇相容於這個要求。但其實 Globus 的通訊，是透過 GSI (Globus Security Infrastructure)，

這個元件使得所有通訊都必頇透過 GSI SSL 加密。也因此任何的節點都要有主機憑證，而要進行連接的使用者也要有使用者憑證。由於這種使用情境並不適合一般單純想要使用儲存的使用者，所以我們將這個功能分離至 CP 中。

(15)

12

與 Globus 整合後，所有節點間的通訊都必頇經過 GSI SSL，免不了加重了 CPU 負擔，也因此上述 CP 與 SP 的硬體要求就相當重要。然而好處是，除了確保計畫資料絕對可以儲存在透過憑證授權的機器，資料格網的資料也都可以透過原先 Globus 支援的方式來傳送。例如在沒有 Globus 的情況下，我們是使用未加密的 HTTP 傳送檔案，

而使用了 Globus 之後，任何檔案傳送就變成使用 GridFTP。而叢集指令也會使用 globus-job-run 的方式處理。對於原本在 Globus 上的應用方式，例如執行平行程式之前，原本都是使用 GridFTP 每個節點來散佈，在我們的設計架構下，也可以呼叫資料格網 API，或是叢集指令來進行快速的散佈。

M ap p in g R ed u ci n g

Grouped SP Metadata

CP Metadata 圖六 MapReduce 資料平行處理框架

對於資料格網管理者而言，資料格網往往都具備一個重要的元件稱為虛擬檔案系統，有了這個元件，任何使用者可以以管理單台電腦同樣的觀念來管理資料格網理的檔案。以往針對虛擬檔案系統的管理方式，一直都是相當繁重而浪費時間的，

絕大部分都是因為浪費在使用者介面與系統通訊來來回回地。SRB 在 3 版後，不管是 Web 介面，還是視窗介面，都實做出來了，但卻還是很難讓使用者感受到資料格網是可以管理大量檔案的。問題也是在於操作順暢度，以及系統針對中介資料如何處理。透過前述的中介資料快取，解決了很多順暢度的問題，因此要達成如檔案總管般拖拉檔案相當地簡單。基本的檔案管理動作，如複製，更名，移動，會直接對應到資料格網的 API。而比較複雜的動作，像是讓檔案在節點間移動，我們也會採取較視覺化的作法。而另一些基本元素的管理，如使用者，也是將其模擬成如同檔案般的觀念，使得管理者可以以同樣的觀念處理所有的工作。

(16)

13

對於資料格網開發者而言，對於資料格網上的資料，開發者必頇透過 API，對檔案進行很有效率的 CRUD(新增，取回，修改，刪除)。也因此在後面提到的開發框架中，我們會說明使用 ActiveRecord 這個資料存取設計樣式，來設計我們的資料格網 API。ActiveRecord 這個設計樣式，主要是將任何的永久式資料儲存(如資料庫，

或是 Directory Service)，裡面的單一資料，利用物件導向的觀念，將其欄位對應成程式裡物件的屬性。套用在資料格網上，例如查詢的時候，就可以取回其中介資料，

以及檔案存取點的 URL。同樣地，當任何屬性修改的時候，透過呼叫儲存成員函式，

就可以直接存回 MP。

對於一般使用者而言，由於 P2P 資料格網的特性，使得透過資料格網來共享檔案成為可能。我們也設計了使用者專用的 P2P 共享介面，並讓使用者在操作資料格網的時候，也可以輕鬆地決定是否要公開分享檔案。在 P2P 社群中，共享的檔案可以輕易地利用中介資料加上標籤，並且透過前述中介資料的索引功能，加強搜尋的準確性。此外，這個虛擬社群也間接地解決了假檔的問題，在 P2P 領域的許多論文都有不同的作法。而我們是利用資料格網的中介資料，讓檔案的使用度，使用者對於共享檔案的熱門度，直接反映在中介資料上。並利用社交網路(Social Network)的觀念，讓使用者在系統提供的虛擬社群上共享檔案，使得假檔率降到最低。

從應用程式的觀點，以往在資料格網上，任何應用程式如果要存取資料，都必頇重新利用用戶端函式庫來撰寫，而通常存取資料這工作，在應用程式上都是相當底層，重新撰寫通常會有一定難度。為了使得任何可能的應用程式可以透過既有的單一介面存取到虛擬檔案系統裡的資料，我們也提供了虛擬檔案系統對作業系統的存取介面。這個作法目前儘管只能夠運行在 Linux 及 Windows，卻可以讓多數舊有的應用程式也可以享受資料格網的好處。這樣的應用程式如 FTP Server，或 HTTP Server 都可以在資料格網上使用。

4.

策略與安全性

對資料格網的使用者管理來說，所需要的是不同階層的權限來進行管理。以往的資料格網遇到的問題是，針對不同情境所需要的 ACL 條件，例如群組能夠做什麼，或是特定的管理階層能夠做什麼，無法提供預設的策略，也無法讓管理員能夠輕易地修改為自己管理上的需要。

在權限控管清單這樣的機制中(ACL Policy)，如果要做到越精確，系統負載就會

(17)

14

越大。但如果想要系統進行權限檢查時更快速，就必頇捨棄很多驗證的項目。而我們的格網系統，主要是根據 Unix 的檔案權限系統，只有提供「擁有者」，「群組」，

「其他」等的權限等級，而驗證功能為「讀取 Metadata」，「完全讀取」，「寫入 Metadata」，「完全寫入」等，由於我們的群組功能支援子群組，也因此不需要考慮一個檔案是不是要屬於很多群組這樣的功能。在預設的情況下，檔案是被擁有人完全控制，而群組是完全讀取，而非擁有者或群組的帳號是無法讀取或寫入。

在安全性方面，目前的資料格網系統，如 SRB，使用的是自行撰寫的認證系統。

而任何單一簽入的動作都是傳遞給中央控管的 MCAT 伺服器，這可以想像，每個檔案都需要進行如此繁雜的動作，需要多少的負載。另一方面，SRB 也支援與 Globus GSI 整合，GSI 的方式幾乎是透過憑證，加上與本機的使用者進行對應。但 GSI 的缺點，便是安裝，申請憑證的程序需要透過人工。如果要很快速地讓使用的儲存節點增加，是相當地困難。

而我們的格網系統採取上述優點的部分，並利用了 HTTP 協定，分做兩種情況。

一般的情況下，任何連線要進行認證前，會採取 SSL 進行通訊，使用的是該節點安裝的時候便會產生的主機憑證。而實際進行 SSL 連線的時候，考量到該 SP 可能會換 IP，便不會進行憑證驗證。SSL 連線中途，便向該 SP 傳送帳號密碼，而此 SP 會再透過 MP 進行認證。認證完畢，會傳回由 MP 產生的 Session Key。如此只要連接在同一個 MP 下的 SP，便不需要重新認證。當 Client 確定離線後，Session Key 便會被清除。圖七展示了上述的認證過程。

這個機制的優點是：第一，認證的單位，也就是想要單一簽入的範圍可以擴充。

要進行認證的帳號，可以在 MP 上設定將使用者中介資料移往更上層的 MP，這樣因為發 session key 的單位為更上層，因此單一簽入的範圍就可以跨許多 MP。另一點是，採取加密的連線，只會發生在進行認證連線的時候，因此對大多數的使用者，

儘管沒那樣安全，卻降低不必要的 CPU 負載。

(18)

15

MP Authentication ( No Certificate Validation)

Session Key

Request data with session key

SSL

NO SSL

Other SP/Client

SP 2

Authentication ( No Certificate Validation)

Session Key SP 1

Data Disconnect Reset Key Status

圖七 HTTP 格網認證機制

5.

格網經濟模型

我們針對資料格網的應用，提出可以套用在市場行為的格網經濟模型。在我們的研究方法中，當有一個用戶端節點或新的 Replica 節點(以下簡稱 Client)要從資料格網中從己存在的 Replica 節點同時下載檔案時，Client 會送出所要求的條件給 Broker，條件可包含要最快完成下載的組合、成本最低的組合、多少時間內要完成下載或只限制最多要花多少成本來完成下載，Broker 再去向 RLS(Replica Location Server)去查詢相關 Replica 節點的資訊，包括 Replica 的位置、每下載 1MB 的單位成本和 Replica 與 client 之間的頻寛，Broker 取得相關資訊後，會依 client 所要求的條件來計算出要由那些節點，需要在各節點下載多少 MB 的檔案，再由 client 去向各 Replica 下載資料。圖八描繪出我們在這個問題上所要採用的經濟模型架構。這邊值得一提的是，在平行下載檔案的實例中，下載完檔案需要合併檔案，而合併檔案需要時間去處理，但是合併資料所需的時間和下載大量資料的時間相比，可能是可以忽略，所以在我們提出的方法中並沒有將合併時間考慮進去。

(19)

16

圖八資料格網經濟模型架構圖

對於最快完成與最小成本的方法，演算法的設計其實並不難。當使用者限制時間內要完成檔案下載，或限制成本完成檔案下載，我們暫時考慮採用以下的方法。

限制時間內要完成檔案下載 - 先計算出每個 Replica 在限制時間內最多可下載 的檔案大小 Capabilityi，再依照成本來排序，由最小成本開始選擇 Replica，直到整個檔案大小可在所選的 Replica 下載完成，則所選的 Replica 即是可完成限制時間內最小成本可完成下載檔案的 Replica 組合。

限制成本完成檔案下載 -我們所提出的演算法是將限制的成本完全使用以求最快的時間能夠完成下載檔案，演算法的概念是假設 Client 從每個 Replica 下載不同大小的部份檔案，可以使下載的成本剛好等於限制的成本，因每個 Replica 下載的部份檔案大小都不一樣，每個 replica 都會有一個變數。為了演算法複雜度，我們將每個 Replica 產生一個分數，而成本愈低分數會愈高，所以演算法會先排序，讓成本愈低的 Replica 得到愈高的分數，而這些分數會都會乘上一個變數，我們只要調整此乘冪變數即可讓成本等於限制成本，再進行最佳化的部份。

上述的劇情，基本上可以導入不同的應用領域。我們將提出若干個最佳化的演算法，作為服務式計算的核心。並且，實作此一經濟模型與使用者介面，提供適應於各種使用者需求的 Web 服務。

四、結論與討論

我們以 W3C 的 Web 相關技術，及現有的 P2P 演算法為基礎，研究資料格網的中介資料，虛擬儲存，大量部署管理等核心技術的開發。而與現有的資料格網中介

(20)

17

軟體-Storage Resource Broker(SRB)，進行效能的比較。此外在校園學術網路上進行系統平台的大量部署，利用校園內的分散硬碟空間，成功建置大型的資料格網系統。另外，我們進行資料領域特定語言的設計與各種平台函式庫開發，並且在學術網路上進行建置與部署並且測試其效能。我們也與其他學校進行緊密的整合，並進行許多細節的修正。利用前項所發展的資料格網中介軟體 API，以及 P2P 分散式排程的技術，我們進行資料自我感測式複寫技術的研究，並且開發完整的使用者介面，以及發展成該平台複寫技術的核心。另外，建構以服務為導向的格網經濟模型，

應用於各種 Web 服務，進而滿足不同使用者的需求。格網經濟模型研究的重點在於同時考量供給者的維運成本與滿足消費者的不同需求(QoS)，發展一套未來可以導入企業網路的格網經濟架構。

我們改良 MapReduce 這個資料處理模型，並且配合原本的資料格網系統應用觀念，發展能夠大量處理資料的應用程式介面，以及能夠針對特定領域來進行資料處理的領域特定語言與各種平台函式庫開發。利用資料格網的管理分析工具，瞭解資料的分佈及網路，運算資源的狀態，整理出資料的分佈模式。讓資料能夠運用其存放節點的運算資源，讓每筆資料不需要透過集中式的運算，達成輕鬆地移動或複寫自己而提升整體資料的容錯率。此外，我們也實做出以 P2P 及 Web 技術為基礎的格網中介軟體，以及分散式中介資料服務，高輸出虛擬檔案系統。我們的中介軟體顯示了在節點數量，檔案數量成長時，也有容錯的特性。iRODS 的問題是，當檔案數量成長時，讀取效能也會跟著下降。使用網路型資料庫作為後端，當資料量一大，整體的效能就會降低。我們的內部 JSON 資料格式相對地就比較快速。整體而言，本計畫預計完成的研究項目，皆已經實作出來，並在相關期刊與研討會發表。

下面我們歸納本計畫主要的成果:

 完成資料格網軟體的開發

 完成開發中介資料處理、儲存、與快取機制

 完成通訊機制與安全性機制

 完成 P2P 索引機制

 完成虛擬儲存空間與兩種作業系統的虛擬檔案系統連結

 針對 SRB 進行效能的比較與系統功能測試

 完成 CP 的大量部署機制

 在學術網路上進行資料格網的部署

 完成開發使用者介面包含 WEB 介面與應用程式介面(APIs)

 完成分析資料格網運作紀錄，建立資料運作模型

(21)

18

 完成資料自我感測複寫機制技術與實作

 完成格網經濟模型，並開發使用者介面，提供 Web 服務

 發表 4 篇 SCI 國際期刊



Ching-Hsien Hsu, Hai Jin and Franck Cappello, “Peer-to-Peer Grid Technologies”,

Future Generation Computer Systems

(FGCS), Vol. 26, No. 5, pp. 701-703, 2010.



Ching-Hsien Hsu, Yun-Chiu Ching, Laurence T. Yang and Frode Eika Sandnes, “An Efficient Peer Collaboration Strategy for Optimizing P2P Services in BitTorrent-Like File Sharing Networks”,

Journal of Internet Technology

(JIT), Vol. 11, Issue 1, January 2010, pp. 79-88. (SCIE, EI)



Ching-Hsien Hsu and Shih Chang Chen, “A Two-Level Scheduling Strategy for Optimizing Communications of Data Parallel Programs in Clusters”, Accepted,

International Journal of Ad-Hoc and Ubiquitous Computing

(IJAHUC), 2010. (SCIE, EI, IF=0.66)



Ching-Hsien Hsu and Bing-Ru Tsai, “Scheduling for Atomic Broadcast Operation in Heterogeneous Networks with One Port Model,” The

Journal of Supercomputing

(TJS), Springer, Vol. 50, Issue 3, pp. 269-288, December 2009. (SCI, EI, IF=0.615)

 發表 6 篇國際研討會論文



Ching-Hsien Hsu, Alfredo Cuzzocrea and Shih-Chang Chen, "CAD: Efficient Transmission Schemes across Clouds for Data-Intensive Scientific Applications", Proceedings of the 4th International Conference on Data Management in Grid and P2P Systems, LNCS, Toulouse, France, August 29-September 2, 2011.



Tai-Lung Chen, Ching-Hsien Hsu and Shih-Chang Chen, “Scheduling of Job Combination and Dispatching Strategy for Grid and Cloud System,” Proceedings of the 5th International Grid and Pervasive Computing (GPC 2010), LNCS 6104, pp. 612-621, 2010.



Shih-Chang Chen, Tai-Lung Chen and Ching-Hsien Hsu, “Message Clustering Techniques towards Efficient Communication Scheduling in Clusters and Grids,” Proceedings of the 10th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2010), LNCS 6081, pp. 283-291, 2010.



Shih-Chang Chen, Ching-Hsien Hsu, Tai-Lung Chen, Kun-Ming Yu, Hsi-Ya Chang and Chih-Hsun Chou, “A Compound Scheduling Strategy for Irregular Array Redistribution in Cluster Based Parallel System,” Proceedings of the 2nd Russia-Taiwan Symposium on Methods and Tools for Parallel Programming (MTPP 2010), LNCS 6083, 2010.



Ching-Hsien Hsuand Tai-Lung Chen, “Adaptive Scheduling based on Quality of Services in Heterogeneous Environments”, IEEE Proceedings of the 4

^th

International Conference on Multimedia and Ubiquitous Engineering (MUE), Cebu, Philippines, Aug. 2010.



Ching-Hsien Hsu, Yen-Jun Chen, Kuan-Ching Li, Hsi-Ya Chang and Shuen-Tai Wang, "Power Consumption Optimization of MPI Programs on Multi-Core Clusters" Proceedings of the 4th ICST International Conference on Scalable Information Systems (InfoScale 2009), Hong Kong, June, 2009, Lecture Notes of the Institute for Computer Science, Social Informatics and Telecommunications Engineering, (ISBN: 978-3-642-10484-8) Vol. 18, pp. 108-120, (DOI:

10.1007/978-3-642-10485-5_8) (EI)

五、計畫成果自評

本計畫完成了大量中介資料處理機制，以及 CP 的儲存緩衝區與大量部署機制。此外，我們也在多所大學進行部署、分析資料格網運作紀錄，建立資料運作模型。本計畫相關論文產出共計發表四篇期刊論文與六篇研討會論文。期刊論文部

(22)

19

分，論文[21]提出了 P2P 網路中跨 ISP 通訊最佳化的技術。論文[22]發表應用於異質性網格系統中通訊排程的技術。論文[23]提出了異質性網路訊息廣播的技術。其他研討會論文則是發表與本計畫相關之實作成果。整體來說，研究成果完全符合預期之目標。

本計畫有目前研究成果，感謝國科會給予機會、也感謝許多合作的學校、教授、

同學協助軟硬體的架設、測試、與協助機器的管理。另外，對於參與研究計畫執行同學的認真，本人亦表達肯定與感謝。

六、參考文獻

[1]

W3C Standards

http://www.w3.org/

[2]

[2] Ann Chervenak, Ian Foster, Carl Kesselman, Charles Salisbury, Steven Tuecke “The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientic Datasets,” Journal of Network and Computer Applications, 2000

[3]

[3] Arcot Rajasekar, Michael Wan, Reagan Moore, George Kremenek, Tom Guptil “Data Grids, Collections, and Grid Bricks,” Proceedings. 20th IEEE/11th NASA Goddard Conference

on Mass Storage Systems and Technologies, 2003. (MSST 2003).

[4]

[4] Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, Christos Kozyrakis

“Evaluating MapReduce for Multi-core and Multiprocessor Systems,” Proceedings of the 2007

IEEE 13th International Symposium on High Performance Computer Architecture

[5]

[5] Ching-Hsien Hsu and Chih-Chun Chang, “QoS and Economic Adaptation Scheduling for Bag-of-Task Applications in Service Oriented Grids”, Accepted, Journal of Internet

Technology (SCI), 2009

[6]

[6] Ching-Hsien Hsu, Chi-Guey Hsu and Shih-Chang Chen,“Efficient Message Traversal Techniques towards Low Traffic P2P Services”, Accepted,International Journal of

Communication Systems (SCI), 2009

[7]

[7] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber “Bigtable: A Distributed Storage System for Structured Data,” 7th USENIX Symposium on Operating Systems Design and

Implementation (OSDI), 2006, pp. 205-218

[8]

[8] Gurmeet Singh, Shishir Bharathi, Ann Chervenak, Ewa Deelman, Carl Kesselman,Mary Manohar, Sonal Patil, Laura Pearlman “A Metadata Catalog Service for Data Intensive Applications,” Proceedings of the 2003 ACM/IEEE conference on Supercomputing.

[9]

[9] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishna “Chord: A Scalable Peertopeer Lookup Service for Internet Applications,” Proceedings of the 2001

conference on Applications, technologies, architectures, and protocols for computer communications.

[10]

[10] Jeffrey Dean and Sanjay Ghemawat “MapReduce: Simplied Data Processing on Large Clusters,” OSDI'04: Sixth Symposium on Operating System Design and Implementation, 2004, pp. 137-150.

[11]

[11] Jeffrey Dean “Experiences with MapReduce, an abstraction for large-scale computation,”

Proc. 15th International Conference on Parallel Architectures and Compilation Techniques,

2006, pp. 1.

[12]

[12] Jiannong Cao and Fred B. Liu “P2PGrid: Integrating P2P Networks into the Grid Environment,” Concurrency and Computation: Practice and Experience, 2007

[13]

[13] Karl Aberer, Philippe Cudré-Mauroux, Anwitaman Datta, Zoran Despotovic, Manfred Hauswirth, Magdalena Punceva, Roman Schmidt “P-Grid: A Self-organizing Structured P2P System,” SIGMOD Record, 32(2), September 2003.

[14]

[14] Karl Aberer, Anwitaman Datta, Manfred Hauswirth “P-Grid: Dynamics of self-organization processes in structured P2P systems,” Peer-to-Peer Systems and Applications,

Lecture Notes in Computer Science, LNCS 3845, Springer Verlag, 2005.

[15]

[15] Michael Wan, Arcot Rajasekar, Reagan Moore, Phil Andrews “A Simple Mass Storage

(23)

20

System for the SRB Data Grid,” Proceedings. 20th IEEE/11th NASA Goddard Conference on

Mass Storage Systems and Technologies, (MSST 2003), 2003.

[16]

[16] Mike Burrows “The Chubby lock service for loosely-coupled distributed systems,” 7th

USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2006.

[17]

[17] N. Santos and B. Koblitz “Distributed Metadata with the AMGA Metadata Catalog”

Workshop on Next-Generation Distributed Data Management

[18]

[18] N. Santos and B. Koblitz “Metadata Services on the Grid,” Proceedings of the X

International Workshop on Advanced Computing and Analysis Techniques in Physics Research.

[19]

[19] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung “The Google File System,”

Proceedings of the 19th ACM Symposium on Operating Systems Principles, 2003, pp. 20-43.

[20]

[20] Tim Oreilly “What is Web 2.0: Design Patterns and Business Models for the Next Generation of Software,” Communications & Strategies, No. 1, p. 17, First Quarter 2007

[21]

Ching-Hsien Hsu, Yun-Chiu Ching, Laurence T. Yang and Frode Eika Sandnes, “An Efficient Peer Collaboration Strategy for Optimizing P2P Services in BitTorrent-Like File Sharing Networks”, Journal of Internet Technology (JIT), Vol. 11, Issue 1, January 2010, pp. 79-88.

[22]

Ching-Hsien Hsu and Shih Chang Chen, “A Two-Level Scheduling Strategy for Optimizing Communications of Data Parallel Programs in Clusters”, Accepted, International Journal of Ad-Hoc and Ubiquitous Computing (IJAHUC), 2010.

[23]

Ching-Hsien Hsu and Bing-Ru Tsai, “Scheduling for Atomic Broadcast Operation in Heterogeneous Networks with One Port Model,” The Journal of Supercomputing (TJS), Springer, Vol. 50, Issue 3, pp.

269-288, December 2009.

[24]

Tai-Lung Chen, Ching-Hsien Hsu and Shih-Chang Chen, “Scheduling of Job Combination and Dispatching Strategy for Grid and Cloud System,” Proceedings of the 5th International Grid and Pervasive Computing (GPC 2010), LNCS 6104, pp. 612-621, 2010.

[25]

Shih-Chang Chen, Tai-Lung Chen and Ching-Hsien Hsu, “Message Clustering Techniques towards Efficient Communication Scheduling in Clusters and Grids,” Proceedings of the 10th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2010), LNCS 6081, pp.

283-291, 2010.

[26]

Shih-Chang Chen, Ching-Hsien Hsu, Tai-Lung Chen, Kun-Ming Yu, Hsi-Ya Chang and Chih-Hsun Chou,

“A Compound Scheduling Strategy for Irregular Array Redistribution in Cluster Based Parallel System,”

Proceedings of the 2nd Russia-Taiwan Symposium on Methods and Tools for Parallel Programming (MTPP 2010), LNCS 6083, 2010.

(24)

出席國際學術會議心得報告

計畫名稱應用 P2P 與 Web 技術發展以 SOA 為基礎的格網中介軟體與經濟模型

計畫編號 NSC 97-2628-E-216-006-MY3

報告人姓名許慶賢

服務機構

及職稱

中華大學資訊工程學系教授

會議名稱 The 2nd Russia-Taiwan Symposium on Methods and Tools of Parallel Programming Multicomputers (MTPP 2010)

會議 / 訪問時間地點海參威, 俄羅斯 / 2010.05.16-19

發表論文題目 A Compound Scheduling Strategy for Irregular Array Redistribution in Cluster Based Parallel System

參加會議經過

會議時間行程敘述

2010/05/16 (上午)

10:00 會場報到

11:00 參訪研究中心 (半日) (下午)

6:00 committee meeting (晚上)

7:00 參加歡迎茶會 2010/05/17 (上午)

9:00 開幕致詞

9:10 聽取 Parallel Algorithm 相關論文發表 11:00 聽取 Models and Tools 相關論文發表

(下午)

2:00 聽取 Parallel Programming 相關論文發表

(25)

2010/05/18 (上午)

9:00 發表論文

11:00 聽取 System Algorithm 相關論文發表

(下午)

2:00 聽取 Numerical simulation 相關論文發表 4:00 參訪 Far East National University (晚上)

7:00 參加晚宴 2010/05/19 (上午)

9:00 聽取 Simulation 相關論文發表

MTPP-10 是台俄雙邊在平行計算研究領域主要的研討會。這一次參與 MTPP-10 ，本人擔任會議議程主席，除了發表相關研究成果以外，也在會場與多位國外教授交換研究心得，

並且討論未來可能的合作。

這一次參與 MTPP-10 除了發表我們最新的研究成果以外，也在會場中，向多位國內外學者解釋我們的研究內容，彼此交換研究心得。除了讓別的團隊知道我們的研究方向與成果，藉此，我們也學習他人的研究經驗。經過兩次的雙邊研討會交流，雙方已經找到共同研究的題目，兩邊的團隊也將於今年(2010 年)8 月開始撰寫研究計畫書，進行更密切的合作。

這一次在 Vladivostok, Russia 所舉行的國際學術研討會議議程共計四天。開幕當天由俄羅斯方面的 General Co-Chair，RSA 的 Victor E. Malyshkin 教授，與敝人分別致詞歡迎大家參加這次的第二屆 MTPP 2010 國際研討會。接著全程參與整個會議的流程，也聽取不同論文發表，休息時間與俄羅斯的學者教授交換意見和資訊。本人發表的論文在會議第三天的議程九點三十分發表（A Compound Scheduling Strategy for Irregular Array Redistribution in Cluster Based Parallel System ）。本人主要聽取 Parallel and Distributed 、Grid、Cloud 與 Multicore 相關研究，同時獲悉許多新興起的研究主題，

並了解目前國外學者主要的研究方向。最後一天，我們把握機會與國外的教授認識，希望能夠讓他們加深對台灣研究的印象。這是一次非常成功的學術研討會。

主辦第一、二屆台俄雙邊學術研討會，感受良多。論文篇數從第一屆的 30 篇到第二屆的 50 篇，也讓本人感受到這個研討會的進步成長。台方參與的教授學生超過 15 個學研單位，

包括台大、清華、交大、中研院、成大、中山、等等。俄方也有超過 10 個學研單位的參與。

值得一提的是，這一次的論文集我們爭取到 Springer LNCS 的出版，並且在 EI 索引。這一個研討會與發表的論文，其影響力已達到國際的水準。

(26)

(27)

A Compound Scheduling Strategy for Irregular Array Redistribution in Cluster Based Parallel System

Shih-Chang Chen¹, Ching-Hsien Hsu², Tai-Lung Chen¹, Kun-Ming Yu², Hsi-Ya Chang³ and Chih-Hsun Chou^2*

1 College of Engineering

2 Department of Computer Science and Information Engineering Chung Hua University, Hsinchu, Taiwan 300, R.O.C.

3 National Center for High-Performance Computing, Hsinchu 30076, Taiwan

{scc, robert, tai}@grid.chu.edu.tw, [email protected], [email protected], [email protected]

Abstract. With the advancement of network and techniques of clusters, joining clusters to

construct a wide parallel system becomes a trend. Irregular array redistribution employs generalized blocks to help utilize the resource while executing scientific application on such platforms. Research for irregular array redistribution is focused on scheduling heuristics because communication cost could be saved if this operation follows an efficient schedule. In this paper, a two-step communication cost modification (T2CM) and a synchronization delay-aware scheduling heuristic (SDSH) are proposed to normalize the communication cost and reduce transmission delay in algorithm level. The performance evaluations show the contributions of proposed method for irregular array redistribution.

1 Introduction

Scientific application executing on parallel systems with multiple phases requires appropriate data distribution schemes. Each scheme describes the data quantity for every node in each phase. Therefore, performing data redistribution operations among nodes help enhance the data locality.

Generally, data redistribution is classified into regular and irregular redistributions. BLOCK, CYCLIC and BLOCK-CYCLIC(c) are used to specify array decomposition for the former while user-defined function, such as GEN_BLOCK, is used to specify array decomposition for the latter. High Performance Fortran version 2 provides GEN_BLOCK directive to facilitate the data redistribution for user-defined function. To perform array redistribution efficiently, it is important to follow a schedule with low communication cost.

With the advancement of network and the popularizing of cluster computing research in campus, it is a trend to join clusters in different regions to construct a complex parallel system. To performing array redistribution on this platform, new techniques are required instead of existing methods.

Schedules illustrate time steps for data segments (messages) to be transmitted in appropriate time. The cost of schedules given by scheduling heuristics is the summation of cost of every time steps while cost of each time step is dominated by the message with largest cost. A phenomenon is observed that most local transmissions, which are happened in a node, do not dominate the cost of each step although they are in algorithm level for existing methods. In other words, they are overestimated. Since a node can send and receive only one message in the same time step [5], the arranged position of each message becomes important. Therefore, a two-step communication cost modification (T2CM) and a synchronization

delay-aware scheduling heuristic (SDSH) are proposed to deal with the overestimate problems, reduce

overall communication cost and avoid synchronization of schedules in algorithm level.

(28)

The rest of this paper is organized as follows: Section 2 gives a survey of existing works related to array redistribution. Section 3 gives notations, terminology and examples to explain each parts of scheduling heuristics. The proposed techniques are described in section 4. Section 5 presents the results of the comparative evaluation, while section 6 concludes the paper.

2 Related Work

Array redistribution techniques have been developed for regular array redistribution and GEN_BLOCK redistribution in many papers. Both kinds of redistribution issues require at least two sorts of techniques.

One is communication sets identification which decomposes array for nodes; the other one is communication scheduling method which derives schedules to shorten the overall transmission cost for redistributions.

ScaLAPACK [9] was proposed to identify communication sets for regular array redistribution. Guo et al.

[2] proposed a symbolic analysis method to help generate messages for GEN_BLOCK redistribution. Hsu et

al. [3] proposed the Generalized Basic-Cycle Calculation method to shorten the communication for

generalized cases. The research on prototype framework for distributed memory platforms is proposed by Sundarsan et al. [11] who developed a method to distribute multidimensional block-cyclic arrays on processor grids. Karwande et al. [8] presented CC-MPI with the compiled communication technique to optimize collective communication routines. Huang et al. [6] proposed a flexible processor mapping technique to reduce the number of data element exchanging among processors and enhance the data locality.

To reduce indexing cost, a processor replacement scheme was proposed [4]. With local matrix and compressed CRS vectors transposition schemes the communication cost can be reduced significantly.

Combining the advantages of relocation scheduling algorithm and divide-and-conquer scheduling algorithm, Wang et al. [12] proposed a method with two phases for GEN_BLOCK redistribution. The first phase acts like relocation algorithm, but the contentions avoidance mechanism of second phase will not be proceeded immediately while contentions happened. To minimize the total communication time, Cohen et al. [1]

supposed that at most k communication can be performed at the same time and proposed two algorithms with low complexity and fast heuristics. A study [7] focusing on the cases of local redistributions and inter-cluster redistribution was given by Jeannot and Wagner. It compared existing scheduling methods and described the difference among them. Rauber and Runger [10] presented a data-re-distribution library to deal with composed data structures which are distributed to one or more processor groups for executing multiprocessor task on distributed memory machines or cluster platforms. Hsu et al. [5] proposed a two-phase degree-reduction scheduling heuristic to minimize the overall communication cost. The proposed method derives each time step of a complete schedule by performing degree reduction technique while the number of messages of each node representing the degree of each vertex in algorithm level.

3 Preliminary

Following are notations, terminology and examples to explain each parts of scheduling heuristics for GEN_BLOCK redistribution. To improve data locality, multi-phase scientific problems require appropriate data distribution schemes for specific phases. For example, to distribute array for two different phases on six nodes, which are indexed from 0 to 5, two strings, {13, 20, 17, 17, 12, 21} and {16, 18, 13, 16, 29, 8},

(29)

are given, where the array size is 100 units. These two strings provide necessary information for nodes to generate messages to be transmitted among them. Fig. 1 shows these messages marked from m₁ to m₁₁ and are with information such as data size, source node and destination node in the relative rows.

Scheduling heuristics are developed for providing solutions of time steps to reduce total communication cost for a GEN_BLOCK redistribution operation. In each step, there are several messages which are suggested to be transmitted in the same time step. To help perform an efficient redistribution, scheduling methods should avoid node contention, synchronization delay and redundant transmission cost. It is also important to follow policies of messages arrangement, i.e. with the same source nodes, messages should not be in the same step; with the same destination nodes, messages should be in different step; a node can only deal with one message while playing whether source node or destination node. These messages that cannot be scheduled together called conflict tuples, for example, a conflict tuple is formed with messages m1 and m2. Note that if a node can only deal with a message while it is a source/destination node, the number of steps for a schedule must be the equal to or more than the number of messages from/to these nodes. In other words, the minimal number of time steps is equal to the maximal number of messages in a conflict tuple, CTmax.

Information of messages No. of

message

Data size

Source node

Destination node

m₁ 13 0 0

m2 3 1 0

m3 17 1 1

m₄ 1 2 1

m5 13 2 2

m6 3 2 3

m₇ 13 3 3

m₈ 4 3 4

m9 12 4 4

m10 13 5 4

m₁₁ 8 5 5

Fig. 1. Information of messages generated from given schemes to be transmitted on six nodes which are indexed from 0 to 5

Fig. 2 gives a schedule with low communication cost and arranges messages in the number of minimal steps.

In this result, there are three time steps with messages sent/received to/from different nodes. The values beside m_1~11 are data size, the cost of each step is dominated by the largest one. Thus, m₃, m₁ and m₈ dominate step 1, 2 and 3, and the estimated cost are 17, 13 and 4, respectively. To avoid node contentions, messages m₁ and m₂ are in separate steps due to destination nodes of both messages are the same. Based on same argument, m₂ and m₃ are in separate steps due to both messages are members of a conflict tuple. The total cost which represents the performance of a schedule is the summation of all cost of steps. In other words, a schedule with lower cost is better than another one with higher cost in terms of performance.

(30)

A result of scheduling heuristics

No. of step No. of message Cost of step Step 1 m3(17), m5(13), m7(13) , m10(13) 17 Step 2 m₁(13), m₆(3), m₉(12), m₁₁(8) 13 Step 3 m₂(3), m₄(1), m₈(4) 4

Total cost 34

Fig. 2. A result of scheduling messages with low communication cost and minimal steps

The result in Fig. 2 schedules messages in three steps, which is the number of minimal steps or CTmax. The total cost is small which representing low communication cost due to messages with larger cost and messages with smaller cost are in separate steps. However, the schedule can still be better by providing a cost normalization method and a new scheduling technique to avoid synchronization delay among nodes during message transmissions in next section.

4 The Proposed Method

In this paper, a two-step communication cost modification (T2CM) and a synchronization delay-aware

scheduling heuristic (SDSH) are proposed to normalize the communication cost of messages and reduce

transmission delay in algorithm level. The first step of T2CM is a local reduction operation, which deal with the message happened in local memory. In other words, candidates are transmissions whose source node and destination node are the same node. For example, m1, m3, m5, m7, m9 and m11 are such kind of transmissions which happened inside nodes. The second step is a inter amplification method, which is responsible for transmissions happened across clusters. Assumed there are two clusters, and node 0~2 are in cluster 1, other nodes are in cluster 2. Then m6 is such message which is transmitted from cluster 1 to cluster 2. Both operations are responsible for different kind of transmissions due to the heterogeneity of network bandwidth. The local reduction operation reduces simulated cost of messages to 1/8 which is evaluated from PC clusters that connected with 100Mbps layer-2 switch. On same argument, inter

amplification operation increases cost of messages five times. The cost then becomes more practical for

real machines when scheduling heuristics try to give a perfect schedule with low communication cost. For previous research, the difference does not exist in algorithm level of scheduling heuristics in and could result in erroneous judgments and high communication cost.

Fig. 3 gives the results of local reduction and inter amplification operations modifying data size for messages m1~11. The given schedule in Fig. 2 becomes the results in Fig. 4. Difference of Fig. 2 and Fig.

4 shows the schedule could be improved and explains the explain the erroneous judgments. First, the dominators in step 1 and 2 are changed to others whose estimated cost is larger in Fig. 4. For example, the

m

3 and m1 are replaced by m10 and m6 for both steps, respectively. Second, the cost of step 1 and step 2 are changed due to new dominators are chosen in both steps. Furthermore, the synchronization delay is small in algorithm level but results in more node idle time in practical. For instance, the cost of m3, m5, m7 and

m

10 are 17, 13, 13 and 13 are close to each other in step 1 in Fig. 2. But it is quite different in practical in

行政院國家科學委員會專題研究計畫 成果報告

行政院國家科學委員會專題研究計畫 成果報告

利用 P2P 技術建構以服務導向架構(SOA)為基礎之輕量化格 網系統--子計畫二:應用 P2P 與 Web 技術發展以 SOA 為基礎

的格網中介軟體與經濟模型(第 3 年) 研究成果報告(完整版)

行政院國家科學委員會補助專題研究計畫 █ 成 果 報 告

□期中進度報告

※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※

※ 利 用 P 2 P 技 術 建 構 以 服 務 導 向 架 構 ( S O A ) 為 基 礎 之 ※

※ 輕 量 化 格 網 系 統 － 子 計 畫 二 : ※

※ 應 用 P 2 P 與 W e b 技 術 發 展 以 S O A 為 基 礎 的 ※

※ 格 網 中 介 軟 體 與 經 濟 模 型 (第 三 年 ) ※

※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※ ※

計畫類別：□個別型計畫  整合型計畫 計畫編號： NSC 97-2628-E-216-006-MY3

執行期間： 99 年 08 月 01 日至 100 年 07 月 31 日 執行單位： 中華大學資訊工程學系

計畫主持人： 許慶賢 中華大學資訊工程學系教授 共同主持人：

計畫參與人員： 陳世璋、陳泰龍 (中華大學工程科學研究所博士生) 許志貴、陳柏宇、李志純、游景涵、黃安婷

(中華大學資訊工程學系研究生)

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計畫、列 管計畫及下列情形者外，得立即公開查詢

□涉及專利或其他智慧財產權，□一年  二年後可公開查詢

中 華 民 國 100 年 10 月 28 日

行政院國家科學委員會補助專題研究計畫 █ 成 果 報 告

□期中進度報告

應用 P2P 與 Web 技術發展以 SOA 為基礎的格網中介軟體與經濟模型 (第三年)

計畫類別：□個別型計畫  整合型計畫 計畫編號： NSC 97-2628-E-216-006-MY3

執行期間： 99 年 08 月 01 日至 100 年 07 月 31 日 計畫主持人： 許慶賢 中華大學資訊工程學系教授 共同主持人：

計畫參與人員： 陳世璋、陳泰龍 (中華大學工程科學研究所博士生) 許志貴、陳柏宇、李志純、游景涵、黃安婷

(中華大學資訊工程學系研究生)

成果報告類型(依經費核定清單規定繳交)：□精簡報告  完整報告

本成果報告包括以下應繳交之附件：

□赴國外出差或研習心得報告一份

□赴大陸地區出差或研習心得報告一份

 出席國際學術會議心得報告及發表之論文各一份

□國際合作研究計畫國外研究報告書一份

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計畫、列 管計畫及下列情形者外，得立即公開查詢

□涉及專利或其他智慧財產權，□一年  二年後可公開查詢

執行單位： 中華大學資訊工程學系

中 華 民 國 100 年 10 月 28 日

目錄

行政院國家科學委員會專題研究計畫成果報告

應用 P2P 與 Web 技術發展以 SOA 為基礎的格網中介 軟體與經濟模型

計畫編號：NSC 97-2628-E-216-006-MY3 執行期限：99 年 8 月 1 日至 100 年 7 月 31 日 主持人：許慶賢 中華大學資訊工程學系教授

計畫參與人員： 陳世璋、陳泰龍 (中華大學工程科學研究所博士生) 許志貴、陳柏宇、李志純、游景涵、黃安婷

(中華大學資訊工程學系研究生)

1.

Network Overlay

2.

IOManager Native IO

Module

RemovableIO Module

Disk Pools

Non-Conventional Storage Module

3.

M ap p in g R ed u ci n g

4.

5.

Ching-Hsien Hsu, Hai Jin and Franck Cappello, “Peer-to-Peer Grid Technologies”,

(FGCS), Vol. 26, No. 5, pp. 701-703, 2010.

Ching-Hsien Hsu, Yun-Chiu Ching, Laurence T. Yang and Frode Eika Sandnes, “An Efficient Peer Collaboration Strategy for Optimizing P2P Services in BitTorrent-Like File Sharing Networks”,

(JIT), Vol. 11, Issue 1, January 2010, pp. 79-88. (SCIE, EI)

Ching-Hsien Hsu and Shih Chang Chen, “A Two-Level Scheduling Strategy for Optimizing Communications of Data Parallel Programs in Clusters”, Accepted,

(IJAHUC), 2010. (SCIE, EI, IF=0.66)

Ching-Hsien Hsu and Bing-Ru Tsai, “Scheduling for Atomic Broadcast Operation in Heterogeneous Networks with One Port Model,” The

(TJS), Springer, Vol. 50, Issue 3, pp. 269-288, December 2009. (SCI, EI, IF=0.615)

Ching-Hsien Hsu, Alfredo Cuzzocrea and Shih-Chang Chen, "CAD: Efficient Transmission Schemes across Clouds for Data-Intensive Scientific Applications", Proceedings of the 4th International Conference on Data Management in Grid and P2P Systems, LNCS, Toulouse, France, August 29-September 2, 2011.

Tai-Lung Chen, Ching-Hsien Hsu and Shih-Chang Chen, “Scheduling of Job Combination and Dispatching Strategy for Grid and Cloud System,” Proceedings of the 5th International Grid and Pervasive Computing (GPC 2010), LNCS 6104, pp. 612-621, 2010.

Ching-Hsien Hsuand Tai-Lung Chen, “Adaptive Scheduling based on Quality of Services in Heterogeneous Environments”, IEEE Proceedings of the 4

International Conference on Multimedia and Ubiquitous Engineering (MUE), Cebu, Philippines, Aug. 2010.

10.1007/978-3-642-10485-5_8) (EI)

W3C Standards

[2] Ann Chervenak, Ian Foster, Carl Kesselman, Charles Salisbury, Steven Tuecke “The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientic Datasets,” Journal of Network and Computer Applications, 2000

[3] Arcot Rajasekar, Michael Wan, Reagan Moore, George Kremenek, Tom Guptil “Data Grids, Collections, and Grid Bricks,” Proceedings. 20th IEEE/11th NASA Goddard Conference

[4] Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, Christos Kozyrakis

“Evaluating MapReduce for Multi-core and Multiprocessor Systems,” Proceedings of the 2007

[5] Ching-Hsien Hsu and Chih-Chun Chang, “QoS and Economic Adaptation Scheduling for Bag-of-Task Applications in Service Oriented Grids”, Accepted, Journal of Internet

[6] Ching-Hsien Hsu, Chi-Guey Hsu and Shih-Chang Chen,“Efficient Message Traversal Techniques towards Low Traffic P2P Services”, Accepted,International Journal of

[7] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber “Bigtable: A Distributed Storage System for Structured Data,” 7th USENIX Symposium on Operating Systems Design and

[8] Gurmeet Singh, Shishir Bharathi, Ann Chervenak, Ewa Deelman, Carl Kesselman,Mary Manohar, Sonal Patil, Laura Pearlman “A Metadata Catalog Service for Data Intensive Applications,” Proceedings of the 2003 ACM/IEEE conference on Supercomputing.

[9] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishna “Chord: A Scalable Peertopeer Lookup Service for Internet Applications,” Proceedings of the 2001

[10] Jeffrey Dean and Sanjay Ghemawat “MapReduce: Simplied Data Processing on Large Clusters,” OSDI'04: Sixth Symposium on Operating System Design and Implementation, 2004, pp. 137-150.

[11] Jeffrey Dean “Experiences with MapReduce, an abstraction for large-scale computation,”

行政院國家科學委員會專題研究計畫成果報告

行政院國家科學委員會專題研究計畫成果報告

利用 P2P 技術建構以服務導向架構(SOA)為基礎之輕量化格網系統--子計畫二:應用 P2P 與 Web 技術發展以 SOA 為基礎

行政院國家科學委員會補助專題研究計畫 █ 成果報告

※ 利用 P 2 P 技術建構以服務導向架構 ( S O A ) 為基礎之 ※

※ 輕量化格網系統－子計畫二 : ※

※ 應用 P 2 P 與 W e b 技術發展以 S O A 為基礎的 ※

※ 格網中介軟體與經濟模型 (第三年 ) ※

計畫類別：□個別型計畫  整合型計畫計畫編號： NSC 97-2628-E-216-006-MY3

執行期間： 99 年 08 月 01 日至 100 年 07 月 31 日執行單位：中華大學資訊工程學系

計畫主持人：許慶賢中華大學資訊工程學系教授共同主持人：

計畫參與人員：陳世璋、陳泰龍 (中華大學工程科學研究所博士生) 許志貴、陳柏宇、李志純、游景涵、黃安婷

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計畫、列管計畫及下列情形者外，得立即公開查詢

中華民國 100 年 10 月 28 日

行政院國家科學委員會補助專題研究計畫 █ 成果報告

計畫類別：□個別型計畫  整合型計畫計畫編號： NSC 97-2628-E-216-006-MY3

執行期間： 99 年 08 月 01 日至 100 年 07 月 31 日計畫主持人：許慶賢中華大學資訊工程學系教授共同主持人：

計畫參與人員：陳世璋、陳泰龍 (中華大學工程科學研究所博士生) 許志貴、陳柏宇、李志純、游景涵、黃安婷

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計畫、列管計畫及下列情形者外，得立即公開查詢

執行單位：中華大學資訊工程學系

中華民國 100 年 10 月 28 日

應用 P2P 與 Web 技術發展以 SOA 為基礎的格網中介軟體與經濟模型

計畫編號：NSC 97-2628-E-216-006-MY3 執行期限：99 年 8 月 1 日至 100 年 7 月 31 日主持人：許慶賢中華大學資訊工程學系教授

計畫參與人員：陳世璋、陳泰龍 (中華大學工程科學研究所博士生) 許志貴、陳柏宇、李志純、游景涵、黃安婷