針對雲端計算環境建置以分散式雜湊表為基礎之同儕式檔案系統

全文

(1)國立高雄大學資訊工程學系碩士論文. 針對雲端計算環境建置以分散式雜湊表為基礎之同儕式檔案系統 A Peer-to-Peer File System on DHT for Cloud Computing. 研究生：鄭坤益撰指導教授：吳俊興博士. 中華民國九十八年八月.

(2) 針對雲端計算環境建置以分散式雜湊表為基礎之同儕式檔案系統指導教授：吳俊興博士（教授）國立高雄大學資訊工程學系. 學生：鄭坤益國立高雄大學資訊工程學系. 摘要. 本論文針對由不同成員提供伺服器所合作形成的雲端計算環境，設計一個以分散式雜湊表為基礎的同儕式檔案系統—Peeraid。系統具有以下幾項特性：每個系統中的節點都是對等的，同時扮演檔案系統使用者和儲存空間資源提供者兩種角色；使用分散式雜湊表作為節點查詢和訊息傳遞的機制來提高檔案系統的延展性，同時達成節點上線和離線時的自動管理；採用容錯編碼作為使用者檔案的備份策略，比起傳統複製檔案的方式，容錯編碼可以花費較少的儲存空間成本來提昇檔案的可獲得性；利用額外的系統資訊和分散式雜湊表的特性來達成對於分散式群組的支援，並且將檔案的存取和群組作結合，利用群組的區域性來提昇存取的效能；在每個節點間傳遞的訊息中加入傳輸時間的量測資訊並記錄下來，可以在存取檔案時選擇延遲時間較小的節點進行檔案傳輸，藉此提昇系統的效能。. 關鍵字：雲端計算、同儕式檔案系統、分散式雜湊表、容錯編碼、分散式群組、群組區域性、網路區域性. i.

(3) A Peer-to-Peer File System on DHT for Cloud Computing Advisor(s): Dr.(Professor) Chun-Hsin Wu Department of Computer Science and Information Engineering National University of Kaohsiung. Student: Kun-Yi Cheng Department of Computer Science and Information Engineering National University of Kaohsiung. ABSTRACT. Data management is one of the key issues in Cloud Computing. This dissertation proposes a peer-to-peer file system, called Peeraid, for the clouds structured as a collection of peer nodes which are supplied by different participants around the world. Peeraid is high scalable because data object and node lookups are accomplished by distributed hash table (DHT). User files are stored as erasure-coded shares to achieve high data availability with less storage cost. Peeraid also supports distributed group management in a fully decentralized manner, and utilizes group locality and network locality to improve access performance. Besides, write operations are efficiently logged to support fast update.. Keywords: cloud computing, peer-to-peer file system, distributed hash table, erasure coding, distributed group, group locality, network locality. ii.

(4) Acknowledgements 終於順利地完成了兩年的碩士學業，此時的我心中充滿了激動與感謝。這兩年可以說是我人生中非常大的一個轉捩點，現在回想起來，好像是一眨眼的時間，卻發生了好多的事情。首先要感謝的是我的父母，因為有你們無私的奉獻才成就了今天的我；不管自己再怎麼辛苦甚至不顧身體上的病痛都還是毫無保留地支持我，讓我得以專注在碩士論文的完成上面。還要謝謝我的妹妹和女友在這段時間裡陪伴著我，當我沮喪時給予我鼓勵和安慰，並且忍受我的壞脾氣。另外要感謝和我特別要好的兩位碩士班同學—嘉鈞和英琪。很懷念兩年來，我們三個人在學業上、生活上為彼此打氣和一起努力的時光。要特別感謝我的指導教授吳俊興老師。我非常慶幸自己能遇到這樣一位好老師，從一開始，願意接受像我這樣一點資訊工程背景都沒有的學生；並且在這兩年之中，總是盡心盡力地給予我指導和督促，凡事都站在我的立場為我著想。除了讓我在學習的路上走得更順利，也給了我在人生方向上許多寶貴的建議。要感謝的真的太多了，我是一個非常幸運的人，在我的身旁圍繞著這麼棒的家人、老師和朋友。因為有你們才讓這篇論文得以順利完成，謝謝！. iii.

(5) Contents Acknowledgements................................................................................... iii Contents.....................................................................................................iv List of Figures ...........................................................................................vi 1. 2. 3. Introduction ......................................................................................... 1 1.1. Cloud computing.......................................................................... 2. 1.2. Challenges ................................................................................... 4. 1.3. Contributions ............................................................................... 5. 1.4. Rest of thesis................................................................................ 6. Background and related work ............................................................ 7 2.1. Client-server distributed file systems ........................................... 7. 2.2. Peer-to-peer architecture .............................................................. 8. 2.3. Distributed hash table .................................................................. 9. 2.4. Peer-to-peer storage systems ...................................................... 10. System design..................................................................................... 12 3.1. Design overview ........................................................................ 12. 3.2. DHT module .............................................................................. 14 3.2.1 3.2.2 3.2.3 3.2.4. 3.3. Storage module .......................................................................... 20 3.3.1 3.3.2 3.3.3 3.3.4 3.3.5 3.3.6. 3.4. DHT-based vs. directory-based................................................................16 Kademlia ..................................................................................................17 Kademlia protocol....................................................................................19 Revising Kademlia...................................................................................20. File fragmentation....................................................................................21 Erasure coding .........................................................................................22 Hybrid backup..........................................................................................23 Enhancing write performance with log....................................................24 Cooperative service support.....................................................................24 Group locality and file distribution..........................................................26. File system module .................................................................... 27 3.4.1. File system model ....................................................................................27 iv.

(6) 3.4.2 3.4.3. 4. 5. Data object management..........................................................................30 Security ....................................................................................................31. Evaluation .......................................................................................... 32 4.1. Erasure coding vs. replication .................................................... 32. 4.2. Group locality and network locality ........................................... 33. Conclusion and future work.............................................................. 48. Bibliography........................................................................................ - 49 -. v.

(7) List of Figures Figure 1-1 An open cloud model. ............................................................................................2 Figure 1-2 A typical architecture of cloud computing.............................................................3 Figure 3-1 System architecture of Peeraid.............................................................................13 Figure 3-2 Units in the file system module. ..........................................................................14 Figure 3-3 Publishing a File-Object. .....................................................................................15 Figure 3-4 Retrieving a File-Object.......................................................................................15 Figure 3-5 DHT-based vs. directory-based............................................................................17 Figure 3-6 Every group on Peeraid has their own routing table for key lookups..................19 Figure 3-7 Erasure coding. ....................................................................................................23 Figure 3-8 The contents of a write log...................................................................................25 Figure 3-9 Directory structure of Peeraid..............................................................................30 Figure 4-1 Availability analysis when average node availability is 0.8. ...............................33 Figure 4-2 Availability comparison when storage cost is four times. ...................................34 Figure 4-3 The internal nodes of group. ................................................................................35 Figure 4-4 The external nodes of group. ...............................................................................36 Figure 4-5 The file source node.............................................................................................37 Figure 4-6 Performance comparison when the number of group nodes is 32. ......................39 Figure 4-7 Performance comparison when the number of group nodes is 48. ......................40 Figure 4-8 Performance comparison when the number of group nodes is 320. ....................41. vi.

(8) 1 Introduction Cloud computing is a new style of computing in which dynamically scalable and reconfigurable resources are provided as a service over the Internet [17]. Developers no longer require large capital outlays in hardware to deploy their service or human expense to operate it. They can just rent the hardware and systems software in the datacenters from cloud providers in a pay-as-you-go manner [18]. Although cloud computing simplifies developer’s efforts to provide new services, how to manage huge volume of data efficiently in the datacenters becomes one of the most fundamental issues for cloud providers. To build a cloud datacenter, service availability, data confidentiality and storage scalability have been identified as three major concerns [18]. Besides, most cloud providers have defined their own computation model, storage model, networking model, and even proprietary APIs. This proprietary cloud computing model is not always suitable for developers to develop portable services or to experiment new services with specific resource requirements. To motivate more developers and small service providers to develop innovative applications and services, we believe an open cloud computing model is required to build a collaborative platform for experiment and deployment. This open computing model can be structured as a collection of peer nodes which are supplied by different organizations or participants around the world. Under these kinds of open clouds, each participant contributes several nodes and takes advantage of the others’ nodes to develop and deploy their services in a share-as-you-go manner.. 1.

(9) Figure 1-1. An open cloud model.. In this dissertation, we propose a peer-to-peer file system, Peeraid, to tackle storage and data management problems of open cloud systems. We make the following assumptions on the file system: . The system is aimed at storing large files (from megabytes to gigabytes), such as long documents or media files (photos, images, videos etc.).. . The contents of files will not frequently be modified.. . Users will not alter file paths arbitrarily.. 1.1. Cloud computing. Cloud computing is one of the most popular research topics in computer science recently. It combines several essential concepts such as software as a service (SaaS), utility computing, pay-as-you-go etc., and provides “anytime, anywhere” services for end users.. 2.

(10) SaaS User Web applications SaaS Provider Utility computing Cloud Provider. Figure 1-2. A typical architecture of cloud computing.. Comparing with deployment of previous Internet services, cloud computing separates service applications from hardware and systems software. Consequently, service providers do not have to worry about building their private datacenters that might be over-provisioned or under-provisioned for new services. According to [19], the server capacity of a datacenter is usually set to be twice as high as the predicted peak load of a new service. However, machines in these datacenters run under 30% of their capacities most of the time, thus wasting costly resources. Figure 1-2 shows a typical architecture of cloud computing. Hardware and systems software are provided by cloud providers. Service providers (SaaS providers) only need to develop their service applications, and rent resources from cloud providers in accordance with their needs. A service provider may be also a SaaS user of other services. At present, cloud providers are usually those companies who have their own large datacenters like Google, IBM, Microsoft, and Amazon etc. Most of them define their own computation model, storage model, networking model, and even proprietary APIs. This proprietary cloud computing model suffers from lack of flexibility for developing 3.

(11) new services with specific resource requirement. Therefore, we propose an open cloud computing model structured as a collection of peer nodes which are supplied by different organizations or participants around the world, and design a peer-to-peer file system to tackle storage and data management problems of this model.. 1.2. Challenges. Although the open cloud computing model seems promising, it faces more new challenges than the conventional proprietary cloud computing model. First, all of theses peer nodes spread over the world and belong to different Internet service providers. Each of them might fail to be accessed temporarily for many reasons such as failure of hardware or software, network disconnection, server attacks, natural disasters, etc. The file system must be robust and fault-tolerant, and guarantee data availability even when many nodes fail to be accessed, so that services can be provided for users persistently. Peeraid adopts peer-to-peer architecture based on distributed hash table to address this challenge. The system is fully decentralized to avoid suffering the single point of failure. In addition, Peeraid applies erasure coding for file backup to achieve higher data availability with less storage cost compared with conventional replication. Distributed hash table also makes the system highly scalable and self-sustained. Peeraid copes with the scale of thousands of participating nodes. However, accompanying these benefits of peer-to-peer architecture is another challenge to security. Files are stored at the nodes which are supplied by different participants and accessed across WAN. The file system must keep files safe from improper access. Peeraid provides three basic mechanisms to achieve this goal. Users 4.

(12) must be authenticated by other peers when they try to retrieve the location information of files. The location information of files is encrypted. Consequently, only users who possess correct decryption keys have sufficient knowledge to access files. The contents of files are also encrypted. Besides, each service has its own feature. For example, some services are only supplied for the residents in a certain region, and files required by these services should be stored at the nodes near the region to obtain better access performance. The file system has to provide the flexibility of file distribution for users. Peeraid supports distributed group management, and combines the concept of group with file distribution. Users can decide whether to store a file at the group peer nodes when creating a new file and utilize group locality to improve the access performance.. 1.3. Contributions. In this dissertation, we propose a peer-to-peer file system, Peeraid, to tackle storage and data management problems of open cloud systems. Based on distributed hash table, Peeraid is fully decentralized to guarantee scalability and avoid single point of failure. In addition, Peeraid applies erasure coding for file backup to achieve higher data availability with less storage cost. Furthermore, security and protection issues are also well concerned in Peeraid. Peeraid implements three basic mechanisms to keep files safe from improper access. Considering data locality and network proximity, Peeraid designs intelligent group management and peer selection to enhance access performance. When crating a new file, users can decide whether to store a file at group peer nodes to utilize group locality for improving access performance.. 5.

(13) 1.4. Rest of thesis. The remainder of this dissertation is organized as follows. Chapter 2 discusses background knowledge of designing Peeraid. Chapter 3 first gives an overview of the system design, including the types of data objects on Peeraid and the system architecture, and then describes three major software modules on Peeraid in details. Chapter 4 presents analysis and experiment results. Finally, chapter 5 concludes the dissertation and discusses the future work.. 6.

(14) 2 Background and Related Works File system is an essential topic in system research. How to design a stable, reliable, and efficient storage system is one of the basic and crucial requirements in many applications. For large-scale applications, many past experiences have shown that distributed approach is more promising to achieve this goal. With the evolution of network technology, distributed file systems has gradually changed from the client-server architecture to the peer-to-peer architecture in recent years. Peeraid is also a peer-to-peer file system. In this chapter, we briefly describe a few existing distributed file systems, including both the client-server and peer-to-peer architecture. Furthermore, we will explain why Peeraid adopts peer-to-peer architecture and the principle of distributed hash table.. 2.1. Client-server distributed file systems. Most of early distributed file systems are organized along the lines of client-server architectures. Files are merely stored at remote file servers. Clients are unaware of the actual locations of their files, and send access requests to these file servers with the interface provided by file systems. Several famous systems such as SUN NFS [26], CODA [21], Plan 9 [24], and SFS [9] are still popular today. The advantages of client-server architectures are simple, secure, and reliable, since all the user files are managed by file servers. In order to improve access performance and file availability, file servers are often structured as server clusters which are linked by LAN and a single file is distributed across multiple servers. Whenever a client issues an access request, file servers will authenticate his identifier and accomplish those operations. Alternatively, a client can access a file locally after having 7.

(15) downloaded it from file servers. Nevertheless, the client-server architecture has its limitation in nature. Storing all of user files at centralized servers also means that the bandwidth, storage capacity, and processing power of servers would become the bottleneck of the system. Besides, a central server represents a single point of failure, requiring server replication for high availability which increases the cost and complexity. Therefore, researchers try to find out a new architecture which is more scalable, robust, and fault-tolerant to address these limitations.. 2.2. Peer-to-peer architecture. Peer-to-peer architecture is a new type of computing technology which allows participants to share their resources directly without centralized servers. These resources include processing power, disk space, and network bandwidth etc. In a peer-to-peer system, all participating nodes are symmetric: both suppliers and consumers of resources. Current successful services based on peer-to-peer architecture are almost file sharing systems like Napster [22], Gnutella [7], Kazaa [16], and Morpheus [20]. Systems of peer-to-peer architectures have following characteristics. . Decentralized: In client-server architecture, the bottleneck of a system is determined by a few centralized servers. However, peer-to-peer systems distribute tasks to every peer node to achieve load balance.. . Scalable: When nodes arrive, the total capacity of the system increases as well as resource demand increases. Ideally, the system capacity of a peer-to-peer system is infinite. 8.

(16) . Robust and Fault-tolerant: System resources are distributed to multiple peer nodes, thus mitigating single point of failure and DoS attack. Even some nodes are failed to be accessed, participants can still retrieve resources they need.. . Self-sustained: As nodes join or leave, peer-to-peer systems will reallocate resources and mange nodes automatically.. It can be seen that peer-to-peer architecture is very suitable for developing distributed storage systems.. 2.3. Distributed hash table. One of the most important peer-to-peer issues is the method used for resource location. As resources are distributed in diverse peers, an efficient mechanism for object location becomes the deciding factor in the performance of such systems. A distributed hash table (DHT), such as Chord [14], Pastry [2], Tapestry [6], and CAN [25], has addressed the problem of locating data and provided an efficient solution. Most DHTs mentioned above have similar approaches. Each node in the system is assigned a node ID by using a base hash function such as SHA-1 [12], which is used to indicate the node’s position in a circular identifier space. Each object, usually called a key, is also assigned a key ID by using the same hash function. Then key k is assigned to the node whose identifier is numerically closest to k, and replicated to its neighbor nodes. Let m be the number of bits in the key/node identifiers. Each node maintains a routing table with m entries. The ith entry in the table contains the information about the node of distance between 2i and 2i+1 from itself. To locate the node closest to key k, the originator first searches its routing table for the node i whose identifier is closest to k, and asks i for the node j it knows whose identifier is closest to k, and again asks j. 9.

(17) By repeating this process, the originator learns about nodes with identifiers closer and closer to k, and finally finds the closest node. It has been proven that, in an N-node system, each DHT node maintains information only about O(log N) other nodes, and resolves all resource lookups via O(log N) messages to other nodes.. 2.4. Peer-to-peer storage systems. Peeraid was motivated by previous work on peer-to-peer storage system or file system such as CFS [10], Ivy [1], PAST [3], OceanStore [15], and BitTorrent [4]. All of these systems use DHTs as their location schemes. CFS and Ivy are both block-based file systems and built on top of Chord. Ivy stores its files as a set of logs, one log per participant. Each participant reads data by consulting all logs, but performs modifications by appending only to its own logs. PAST is built on top of the Pastry lookup algorithm and provides persistent storage management. It stores a single large file without breaking it into smaller chunks. OceanStore locates data objects by either a non-deterministic but fast algorithm (attenuated Bloom filters) or a slower deterministic algorithm (Tapestry). It also provides self-monitoring introspection mechanisms for data migration based on access patterns. Peeraid adopts many advanced techniques of BitTorrent. BitTorrent stores file metadata as a “.torrent” file, which contains checksums for each file piece and information about the tracker. Users that want to download a file must first obtain a torrent file for it, and connect to the specified tracker, which tells them from which other peers to download the pieces of the file. Alternatively, some BitTorrent clients are 10.

(18) implemented for a tracker-less system by using Kademlia. With this approach, each peer acts as a tracker.. 11.

(19) 3 System Design Peeraid is an open read/write peer-to-peer file system. It is structured as a collection of peer nodes which are supplied by different participants around the world. The system is adjusted for storing large files from hundreds of kilobytes to gigabytes and provides users with block-based access. Peeraid also supports the concept of group. Users are allowed to create new groups or join existing ones. The owner of a file can decide whether to store the file on its group nodes or not.. 3.1. Design overview. There are three primary types of data objects on Peeraid: SysInfo-objects, File-objects, and Metadata-objects. These data objects may be distributed to any node in different ways. A SysInfo-object represents a system file. Each one of them contains the information about a user or a group, especially their authentication keys. The main purpose of SysInfo-objects is to provide a public mechanism for user authentication. A File-object represents an actual user file, which could be either a directory or a general file. Each File-object has a corresponding Metadata-object like an inode in UNIX virtual file system. It includes the whole information on the File-object such as file attributes, locations and checksums of file shares, permission mode, etc. By default, each File-object is fragmented into 8 KB blocks and each block is stored as 32 1024-byte erasure-coded shares, where any eight of which are sufficient to reconstruct the block. Users can assign the policy to distribute these shares. However, SysInfo-objects and Metadata-objects are not fragmented. They are directly stored at the nodes mapped from their identifiers by DHT and replicated at the neighbor nodes. 12.

(20) Node. Node. Node. Peeraid. Peeraid. Peeraid. FS. FS. FS. Storage. Storage. Storage. DHT. DHT. DHT. Network Figure 3-1. System architecture of Peeraid.. Figure 3-1 illustrates the architecture of the Peeraid system. Each Peeraid entity consists of three software modules: file system module, storage module, and a DHT module. The file system module is responsible for management of accessing data objects, offering the file system interface to service applications, and maintaining file security. Figure 3-2 shows the units in the file system module. Data objects are accessed by individual units because they have different policies on distribution and backup. The storage module provides file fragmentation, reassembling, distribution, backup, and cache functions for the file system module. Data object and node lookups are accomplished by the DHT module. Figure 3-3 and Figure 3-4 give the overview of publishing and retrieving a File-Object. The publisher encrypts the user file based on its permission mode and fragments it into fixed-size blocks. Then, the publisher applies erasure coding to the blocks and expands each of them to multiple shares. All shares belong to the same block are distributed to different nodes which are chosen according to the policy that the publisher assigns. A user who wants to retrieve the File-Object uses a hash of the file pathname to 13.

(21) File system module FS Interface SysInfo Unit. Metadata Unit. File Unit. Security Unit. Figure 3-2. Units in the file system module.. obtain the File ID. Then, the user sends a request to metadata unit of the node which is mapped from the File ID by DHT. The metadata unit examines permission mode of the file and retrieves the required SysInfo-Object for user authentication. If the user has permission to access the file, the metadata unit will return the Metadata-Object of the file. The user decrypts the Metadata-Object to obtain the locations of shares and starts downloading them.. 3.2. DHT module. In Peeraid, all nodes in the system are organized into more than one overlay network by the DHT module, which also provides the mechanisms for message routing and key lookup. Comparing with most peer-to-peer file systems based on DHT, Peeraid separates file access from the DHT protocol. User files are not directly stored at the nodes mapped from their keys. They are distributed to nodes chosen according to the policy that users assign to provide more flexibility for service applications.. 14.

(22) File-object encryption. file distribution block 0. block 1. share 1 re su g a r e din co. share 2. ‧‧‧. ‧‧‧. block n. share (k+m) DHT. Figure 3-4. Publishing a File-Object.. Metadata Unit. /user/foo.txt. Local node. File. Sh are. Check permission (SysInfo ID) response. ID. Peer node. Metadata IDs. File Unit. ．．． shares Peer nodes. Figure 3-3. SysInfo Unit. Retrieving a File-Object.. 15. Peer node.

(23) 3.2.1. DHT-based vs. directory-based. In the early design, storage systems built on top of DHT usually use content hashes of data objects as their object keys, and store them at the nodes directly mapped from their keys by DHT, which is called DHT-based storage systems. The advantages of this design are simple and easy for backup management, because every node is required to maintain a list of its neighbor nodes in these systems. However, a serious problem of these systems is lack of flexibility. In these systems, users cannot do anything to control locations of their files, so files are often stored at the nodes far way from those users who need to access them frequently. This will significantly reduce system performance. Previous research such as [11] had focused on server selection and file caching on the routing path. Nevertheless, server selection cannot take effect when only a few nodes are available. File caching on the routing path will greatly increase the storage cost. Specifically, for an open network environment such as Cloud Computing, file confidentiality is one of the most important concerns. Users might not be willing to store their files at too many nodes that they cannot trust. B. G. Chun and F. Dabek et al. had proposed another architecture of DHT [5], which is called directory-based storage systems. In directory-based storage systems, a data object can be stored and backed up at arbitrary nodes, and the location pointers to this object are stored at the nodes mapped from its key. Figure 3-5 illustrates the difference between DHT-based and directory-based systems. This kind of design separates the policy of file distribution from the DHT protocol, and provides users more flexibility. However, accompanying with the benefit of flexibility is how to choose the appropriate nodes for storing files and how to manage user files so that data 16.

(24) successor list. successor list. data object. location pointer. data object. data object ID space. ID space. data object DHT Figure 3-5. Directory DHT-based vs. directory-based.. availability can be ensured. Peeraid combines the concept of group and file distribution to address these two problems, and we will discuss it in the later section. Peeraid integrates both architectures to locate data objects according to their types. SysInfo-Objects and Metadata-Objects are stored with DHT-based in order to locate them efficiently, whereas File-Objects are stored with directory-based so that users are allowed to control the location of their files. Users who need to access a File-Object should retrieve its corresponding Metadata-Object by the DHT module first to obtain location information of file shares recorded in the Metadata-Object.. 3.2.2. Kademlia. Kademlia [23] is a distributed hash table that has the similar basic idea like those introduced previously. However, there are three main differences between Kademlia and other DHTs. First, Kademlia uses XOR metric for calculating the distance between nodes instead of numerical difference. XOR is a symmetric operation, allowing 17.

(25) Kademlia participants to receive lookup queries from precisely the same distribution of nodes contained in their routing tables. Second, for the ith entry in the routing table, each Kademlia node keeps a list of〈IP address, UDP port, Node ID〉triples for up to k nodes of distance between 2i and 2i+1 instead of only one node like other DHTs. Therefore, the routing table of Kademlia is called k-buckets, and k is chosen such that any given k nodes are very unlikely to fail within an hour. Maintaining information about multiple nodes within an interval of identifier space makes Kademlia more robust and tolerant towards network churn and DoS attacks. Third, each message sent by a node must include its node ID, so that the recipient can learn about the sender and update its k-bucket for the sender’s node ID. This minimizes the number of messages required by a node to learn about each other, which spreads automatically as a side-effect of key lookups. As a result, the cost of building and maintaining routing tables in Kademlia is lower than other DHTs. Besides, additional useful information about senders can be appended to the message sent (e.g. latency between two nodes). Peeraid uses Kademlia protocol as the DHT module due to its low cost of initializing and maintaining routing tables. Kademlia minimizes the number of messages a node must send to learn about each other. This property can be deployed to support group management. Every user on Peeraid has a global routing table for key lookups among all nodes in the system. However, every group has their own routing table for key lookups only among group members. This will ensure that all lookups are carried out within the group and enhance the security of file access. The efficiency of lookups among group members is also improved because the number of group members is usually far less than the number of all nodes in the system. Figure 3-6 18.

(26) Group 1 routing table. Global ID space. Group 2 routing table. Global routing table. Figure 3-6 lookups.. Every group on Peeraid has their own routing table for key. illustrates this feature.. 3.2.3. Kademlia protocol. The original Kademlia protocol consists of four RPCs: . PING: probes a node to see if it is online.. . FIND_NODE: takes a 160-bit key ID as the argument. The recipient of the RPC must return up to k〈IP address, UDP port, Node ID〉triples that it knows to be closest the key.. . STORE: instructs a node to store a〈key, value〉pair for later retrieval.. . FIND_VALUE: behaves like FIND_NODE with one exception. If a corresponding value is present on the recipient, the associate data is returned.. For participant peers, the most important function is to locate the k closest nodes to some given target ID, which is called a node lookup. Kademlia accomplishes node 19.

(27) lookups with a recursive algorithm. First, the initiator picks α nodes closest to the target ID from its local k-bucket, and then sends parallel, asynchronous FIND_NODE RPCs to these α nodes. After receiving any response from peer nodes, the initiator again picks α nodes closest to the target ID it has learned about and resends FIND_NODE RPCs to these nodes. By repeating this process recursively, the initiator learns about nodes with node ID closer and closer to target ID. The lookup terminates when the initiator has queried and obtained responses from the k closest nodes it has seen. The DHT module on Peeraid includes PING and FIND_NODE RPCs, and the node lookup function, but without implementing STORE and FIND_VALUE. As mentioned before, Peeraid uses both DHT-based and directory-based architectures to locate data objects according to their types, and users must be authenticated by other peers when they try to access files. Therefore, we separate file access functions from the DHT protocol, and implement them on the top layers.. 3.2.4. Revising Kademlia. Since every message a node sends is requested to include its node ID in Kademlia, additional useful information about the sender can be appended. In Peeraid, timestamps are piggy-backed on every message for measuring round-trip time between nodes. With this information, users can route queries through a low-latency path and select low-latency nodes from multiple sources to download files.. 3.3. Storage module. The storage module provides essential functions for accessing data objects, including file fragmentation, reassembling, backup, distribution, and caching. It uses 20.

(28) the DHT module to accomplish data object and node lookups. The file system module calls these functions in accordance with the types of data objects.. 3.3.1. File fragmentation. Peeraid is a block-based file system. File-Objects should be fragmented into fixed-size blocks before publishing them. This is because the system is aimed at storing large files. Users would not like to wait until completely downloading the whole file every time they issue a file operation. However, SysInfo-Objects and Metadata-Objects are not fragmented since they have small file size and their full content needs to be retrieved at a time. Several factors affect block size. First, Peeraid applies erasure coding to each block and expands them to multiple shares. Therefore, shares are the actual units transferred on network. The size of a share should fit in a single 1500-byte packet to reach better transport performance. An overlarge block will increase the number of shares and need to distribute them to more nodes. It reduces the write performance and makes file management more difficult. Contrarily, a too-small block will make read operation ineffective. Besides, Metadata-Object records locations and checksums of its corresponding File-Object. Increasing the number of file shares will also increase the size of Metadata-Object. In addition to block size, some services have the property of space locality. If a user accesses one file block, he might need to access the successive blocks with high opportunity. Peeraid allows users to specify how many continuous blocks are returned whenever they issue an operation for each file.. 21.

(29) 3.3.2. Erasure coding. Erasure coding [13] is commonly used in error-resilient communication systems but growing in importance for storage applications as storage systems grow in size and complexity. Figure 3-7 depicts the basic concept of erasure coding. An original data is divided into k units, and then they are encoded into total (k+m) shares. The key property of erasure coding is that the original data can be reconstructed from any k shares. In other words, there can be at most m erasures. Systems employing erasure codes have mean-time-to-failures many orders of magnitude higher than replicated systems with similar storage and bandwidth costs. Peeraid uses erasure coding with k = 8 and m = 24. Any eight of 32 shares are sufficient to reconstruct the original block, and the storage cost is four times. Users can choose low-latency nodes for downloading shares according to the latency information recorded by the DHT module. The ratio of k to m determines data availability, and the sum of k and m determines how many nodes a File-Object is distributed to. When the average node availability is 0.8, data availability of replication with the same storage cost. is. 0.9984.. Data. availability. of erasure. 0.99999999999746.. 22. coding. described. above. is.

(30) ．．．. ．．． m coding units. k data units. (a) encoding. ．．．．．．．．． (k+m) units with up to m erasure. The contents of the original k units are recalculated. (b) decoding Figure 3-7. 3.3.3. Erasure coding.. Hybrid backup. Peeraid uses both erasure coding and replication for file backup. Compared with replication, erasure coding can provide higher data availability with less storage cost at the expense of overhead for encoding and decoding. For large files, storage cost is the main concern whereas efficiency is more important for small files. Consequently, Peeraid applies erasure coding to File-Objects for high data availability, and replication to SysInfo-Objects and Metadata-Objects for efficiency. Except. for. efficiency,. applying. replication. to. SysInfo-Objects. and. Metadata-Objects also makes security management easier. If these two kinds of data 23.

(31) objects are stored as erasure-coded shares, users must contact more nodes to accomplish user authentication.. 3.3.4. Enhancing write performance with log. Although erasure coding provides higher data availability, it also degrades performance. Users need to make contact with more nodes to access a file than replication. Read performance can be improved by selecting low-latency nodes from multiple sources and using parallel downloads. However, the system still suffers from the small write problem like RAID [8]. It must simultaneously overwrite all 32 shares of a block every time a write operation is issued. Peeraid solves this problem by using write logs. Users who want to write a file do not directly write their contents into the File-Object. Instead, they generate write logs and send them to the metadata unit mapped from its File ID. The metadata unit stores these logs in a list with the Metadata-Object of this file. Whenever a user accesses the file, the metadata unit returns both the Metadata-Object and all logs of this file to him. Figure 3-8 shows the contents of a write log, which represents a write operation, including user ID of the requester, write types (modify or append), version number which represents the write order, block index, byte range in the write block, and the write content. The file owner or other users who have write permission will integrate these write logs into the File-Object periodically.. 3.3.5. Cooperative service support. 24.

(32) User ID Version No.. Type. Block Index Byte Range Content. Figure 3-8. The contents of a write log.. In cloud computing, a file is often shared by multiple users. These users possibly provide related services or cooperate with each other to support one service. For example, many current Internet services access Google Maps by means of Google APIs. File system should have mechanisms to manage access permission for shared files. On UNIX, group management is achieved by recoding information about all groups in /etc/group file. Peeraid supports the concept of group with two mechanisms. First, the information about a group is recorded in a SysInfo-Object. It includes group ID, authentication public key for the group, and IDs of the group manager and all members. User who creates a new group is the group manager. Only the group manager has write permission for its SysInfo-Object. Second, every group has their own routing table for key lookups as mentioned before. Each file belongs to one group and the file owner can decide if only group members have permission to access the file. There are two main differences between UNIX system files and SysInfo-Objects. UNIX system files are stored under the same directory hierarchy with user files, but SysInfo-Objects are stored under the independent name space. The reason is that SysInfo-Objects are distributed and 25.

(33) accessed in different ways from user files. Besides, only root user has permission to modify system files on UNIX. However, Peeraid is fully decentralized and there is no administrator in the system. Consequently, every group should have their own SysInfo-Objects, which are managed by their group manager.. 3.3.6. Group locality and file distribution. The concept of group in peer-to-peer systems is quite different from that in UNIX. In UNIX, all group members are on the local machine. In peer-to-peer systems, however, group members may be located on different nodes that spread over the network. It means that the concept of group not only relates to access permission for shared files, but also locations of members. Some groups possibly consist of peer nodes around the world whereas some groups consist of peer nodes only in a certain geographical region or community. Peeraid combines file distribution with group locality to improve access performance. Users can decide whether to store files on the nodes of their group members. Group locality is due to two aspects. First, group members of a file have more opportunities to access it than other users. Second, many groups have the regional property and consist of adjacent nodes. As a result the mean latency between group members is lower than the whole system. Storing files on their group members also enhances the file security because users belong to the same group are more willing to trust each other. Peeraid provides three kinds of distribution policies for users: RANDOM, GROUP, and MIX. By default, a File-Object is distributed to 32 nodes because a block is stored as 32 erasure-coded shares. RANDOM means all 32 nodes are selected randomly from 26.

(34) all nodes in the system. This policy is suitable for public files, especially when group members spread over the world. It can achieve higher data availability because the probability that nodes storing file shares fail simultaneously is lower, but access performance is not deterministic. GROUP means all 32 nodes are selected randomly from only group members. The policy is suitable for group files, that is, only group members have permission to access files. It can ensure file security and employ group locality to enhance access performance. However, if group members highly concentrate in a certain region, they might fail simultaneously due to the same single problem. MIX means half of nodes are selected from group members and the other half are selected from external nodes of the group. This policy is suitable for those files with less security requirement. It can both employ the group locality and achieve average data availability. Although all members’ IDs of a group are recorded in the corresponding SysInfo-Object, Peeraid supports these policies of file distribution by using individual routing table of each group. Therefore, users do not have to access SysInfo-Objects of groups whenever they distribute a File-Object.. 3.4. File system module. The file system module offers the file system interface to service applications. It also manages access to data objects and maintains the file security.. 3.4.1. File system model. File types and attributes 27.

(35) In Peeraid, user files are represented by File-Objects, which could be either directories or general files. The File-Object of a directory records information about each entry in that directory, including its path name, type (a general file or a sub-directory), permission mode, and file owner and group. Comparing with the contents of directories on UNIX, Peeraid include additional file information and has no index to their Metadata-Objects. This is because files belong to the same directories might be distributed to different nodes. In order to improve the performance of reading directories, Peeraid duplicates some file information that is also stored in Metadata-Objects and puts them into the File-Objects of directories. In addition, Metadata-Objects are stored at the nodes that mapped from File IDs directly so that users can locate them by using the DHT module without any index. Integrate attributes of a user file are recoded in the corresponding Metadata-Object, which includes the file path name, file ID, file type, permission mode, file size, file owner and group, and time information.. File system interface Peeraid provides the file system interface listed below to service applications. For directories: . mkdir: Create a subdirectory in a given directory.. . opendir: Open a directory – that is, retrieve the Metadata-Object of the directory.. . readdir: Read entries under a directory – that is, retrieve the File-Object of the directory.. . closedir: Close a directory. 28.

(36) . rmdir: Remove an empty subdirectory from a directory.. For general files: . create: Create a general file.. . open: Open a file – that is, retrieve the Metadata-Object of the file.. . read: Read the data contained in a file – that is, retrieve the File-Object of the file.. . write: Write data to a file.. . close: Close a file.. . unlink: Remove a file from a file system.. . getattr: Get the attribute values for a file.. . setattr: Set one or more attribute values for a file.. Directory structure Figure 3-9 shows the directory structure of Peeraid. The top level is root directory, and each user and group has their own home directories on the second level. Users and group members can arbitrarily create new files or subdirectories in their home directories.. File system mounting Peeraid allows users to mount the file system under any local directory. At least one bootstrap node that has already existed in the system must be specified when mounting file system. It assists the new-coming user in building his own routing table, and notifies other users of the arrival of the entrant.. 29.

(37) /. User or Group directories｛. User 1. User or Group. User m. …. files and ｛ subdirectories Figure 3-9. 3.4.2. Group 1 ……. Group n. …. Directory structure of Peeraid.. Data object management. There are individual units for managing access to data objects in the file system module: SysInfo unit, metadata unit, and file unit. Users only communicate with their local file units by the file system interface. Whenever they issue a file system operation, the local file unit will accomplish it by cooperating with other units on different nodes and call the functions that the storage module provides if necessary. A SysInfo-Object is named with the user name or group name it represents, stored at the nodes that mapped from the SysInfo ID, hash of its name, and managed by their SysInfo units. Similarly, a Metadata-Object is named with the file path name it represents, stored at the nodes that mapped from the File ID, hash of its file path name, and managed by their metadata units. Users can locate them efficiently by the DHT module. File-Objects are stored as erasure-coded shares. Because block is the basic unit when accessing files, Peeraid authenticates each share by naming it with the file pathname and the block number. File-Objects are distributed to nodes according to the 30.

(38) policies that users assign and managed by their file units. Every node in the system might be unable to access temporarily for some reasons; even damage to hardware or software could happen. The file system should maintain the number of active backups of any data object to guarantee data availability. Peeraid requests every node to republish SysInfo-Objects and Metadata-Objects stored at them periodically. File-Objects are maintained by file owners. When someone fails to access a file share, he sends a notification to the file owner. The file owner accumulates these notifications and republishes corresponding blocks periodically.. 3.4.3. Security. In Peeraid, file security are protected by three mechanisms. First, users must be authenticated when they send requests for retrieving Metadata-Objects. The metadata unit which receives a request will examine the permission mode of the file and retrieves the required information for user authentication from SysInfo-Objects. This protects Metadata-Objects from being accessed by users without appropriate permission. Second, information about locations of file shares in Metadata-Objects is encrypted so that only users who have correct decryption keys can obtain this knowledge. Finally, the contents of File-Objects are also encrypted. Every user and group has a public/private key pair for authentication and another one for encryption. Authentication public keys are recorded in their SysInfo-Objects and can be accessed by any user. File permission mode, like UNIX, is divided into three classifications: owner, group, and others. Permission mode determines how to encrypt the Metadata-Object and File-Object, and which key is required for user authentication when accessing the Metadata-Object. 31.

(39) 4 Evaluation This chapter evaluates some primary characteristics of Peeraid, including the difference between erasure coding and replication, and the effects of group locality and network locality on performance.. 4.1. Erasure coding vs. replication. We assume the average node availability in the system is p. With replication, a certain block is unavailable only when all n replicas are unavailable, so the availability is 1 - (1 - p)n. With erasure coding, each block is split into k units, and erasure code is applied on these units generating (k+m) total shares. Since any k units of shares are sufficient to reconstruct the original block, the availability is. ⎛ k + m⎞ i ⎟⎟ p (1 − p) k +m−i ⎜⎜ ∑ i ⎠ i =k ⎝. k +m. (1). Figure 4-1 shows data availability analysis for k=8, k=16, and full replication, when the average node availability p=0.8. According to the figure, erasure coding achieves higher availability than replication when storage cost is more than double. Peeraid uses erasure coding with k=8 and m=24; the storage cost is four times. Figure 4-2 shows data availability for k=8, k=16, and full replication with different node availability, when the storage cost is four times. According to the figure, erasure coding achieves higher availability than replication when the average node availability is more than 0.3.. 32.

(40) Data availability. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0. k = 8, erasure k = 16, erasure full replication. 1 2 3 Storage cost (times). 4. Storage cost (times) 1. 2. 3. 4. n = 8,. erasure. 0.1677721600. 0.9985240617. 0.9999998948. 0.9999999999. n = 16,. erasure. 0.0281474977. 0.9999672849. 0.9999999999. 0.9999999999. replication. 0.8000000000. 0.9600000000. 0.9920000000. 0.9984000000. Figure 4-1. Availability analysis when average node availability is 0.8.. full. It can be seen that k=16 can achieve higher availability than k=8 with the same storage cost. However, the value of k also determines the minimum number of requests must be sent when accessing a block. There is a trade-off between data availability and access performance. Therefore, Peeraid uses erasure coding with k=8 instead of k=16.. 4.2. Group locality and network locality. Next, we design an experiment on PlanetLab to evaluate the effects of group locality and network locality on performance. PlanetLab is a global research test bed comprising approximately 1025 machines distributed at 488 sites around the world in July 2009. When the experiment was running, only about 640 nodes were active. 33.

(41) Data availability. 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0. k = 8, erasure k = 16, erasure full replication. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Average node availability. Average node availability 0.1. 0.2. 0.3. 0.4. 0.5. k = 8,. erasure. 0.0116854546. 0.3017631322. 0.7882322328. 0.9751780577. 0.9989487992. k = 16,. erasure. 0.0004467193. 0.1967625597. 0.8437627667. 0.9959893851. 0.9999878177. replication. 0.3439000000. 0.5904000000. 0.7599000000. 0.8704000000. 0.9375000000. full. Average node availability 0.6. 0.7. 0.8. 0.9. 1. k = 8,. erasure. 0.9999871684. 0.9999999735. 0.9999999999. 0.9999999999. 1.0000000000. k = 16,. erasure. 0.9999999970. 0.9999999999. 0.9999999999. 0.9999999999. 1.0000000000. replication. 0.9744000000. 0.9919000000. 0.9984000000. 0.9999000000. 1.0000000000. full. Figure 4-2. Availability comparison when storage cost is four times.. The experiment setup is described below. First, a node was randomly selected as the file source. Latency between the file source node and all other nodes was measured, and group members consisted of the nodes with lower latency. Since Peeraid distributes a file to 32 nodes, we chose 32 file nodes according to three policies of file distribution, RANDOM, GROUP, and MIX, respectively. Nodes which received a request would download 1 KB shares from arbitrary eight of 32 file nodes of each 34.

(42) RANDOM. MIX. GROUP. RANDOM+. MIX+. GROUP+. Download time (s). 2.5 2.0 1.5 1.0 0.5 0.0 32 48. Figure 4-3. 128. 320 Number of group nodes. 480. The internal nodes of group.. policy sequentially. Then, the experiment was repeated with considering network locality. The only difference was that nodes which received a request would download 1 KB shares from eight nodes with the lowest latency of 32 file nodes of each policy. Peeraid can achieve this because latency information is measured by the DHT module. Figure 4-3 shows how performance changes as the number of group nodes increases for internal nodes of the group. The symbol “+” means network locality was employed when downloading shares. It can be seen that, for internal nodes of the group, distributing file shares with GROUP policy can achieve the best performance; the next is MIX, and RANDOM is the worst. When the number of group nodes is 32, which is exactly sufficient to store 32 file shares, performance of GROUP is about 2.5 times better than MIX, and almost 5 times better than RANDOM without network locality. 35.

(43) RANDOM. MIX. GROUP. RANDOM+. MIX+. GROUP+. Download time (s). 2.5 2.0 1.5 1.0 0.5 0.0 32 48. Figure 4-4. 128. 320 Number of group nodes. 480. The external nodes of group.. As the number of group nodes increases, performance of GROUP and MIX descends because the divergence of latency between group members also increases. With network locality, performance of GROUP and RANDOM increases more than 2 times, and MIX increases almost 4 times when the number of group nodes is 32. Because in MIX half of file nodes are selected from group members, downloading file shares from eight nodes with the lowest latency has the same effect as GROUP. Although performance of GROUP and MIX descends as the number of group nodes increases, network locality still raises 2-3 times. Figure 4-4 shows download results of external nodes of the group. For external nodes, GROUP and MIX are worse than RANDOM. When the number of group nodes is 32, performance of RANDOM is 1.6 times better than GROUP, and 1.2 times better 36.

(44) RANDOM. MIX. GROUP. RANDOM+. MIX+. GROUP+. Download time (s). 2.5 2.0 1.5 1.0 0.5 0.0 32 48. 128. Figure 4-5. 320 Number of group nodes. 480. The file source node.. than MIX. Because external nodes of the group spread around the world, high degree of concentration of file shares are unfavorable for them. Figure 4-5 shows download results of the file source node. It is similar to the internal nodes of group, but performance of GROUP is better. This property is good for the file system because the owner of a certain file is usually the user who accesses that file most often. Network locality also raises performance 2-3 times for external nodes and the file source node. Next, we again repeat the experiment but increase the number of shares downloaded from one file node. Figure 4-6 and Figure 4-7 show the performance status with different download size from one file node when the number of group nodes is 32 and 48. It can be seen that the rank of three policies does not change when download size increases. For internal nodes of group, GROUP is still the best; the next 37.

(45) is MIX, and RANDOM is the worst. For external nodes of group, RANDOM is better than GROUP and MIX. However, the difference between three policies becomes greater when download size increases. Figure 4-8 shows download results when the number of group nodes is 320. The benefit from group nodes disappears and the curves of three policies are almost overlapped whether internal nodes or external nodes of group. With network locality, performance increases about 1.5-4 times no matter what kind of conditions. These experiment results can make useful suggestions for file distribution. For those files which only group members have permission to access, users should distribute them with GROUP policy to achieve the best performance. However, for those files which are public to everyone, users should distribute them with MIX policy, because it has an average performance for internal nodes and external nodes of the group.. 38.

(46) RANDOM. MIX. GROUP. RANDOM+. MIX+. GROUP+. Download time (s). 25 20 15 10 5 0 0. 1K. 10K Share size (Byte). 100K. 1M. 1K. 10K. 100K. 1M. RANDOM. 2.180. 2.368. 5.518. 22.854. MIX. 1.069. 1.579. 3.443. 12.435. GROUP. 0.401. 0.699. 1.878. 5.490. RANDOM+. 0.954. 0.992. 2.560. 16.156. MIX+. 0.276. 0.753. 1.722. 8.894. GROUP+. 0.144. 0.372. 0.575. 2.408. (a) Internal nodes of group.. 39.

(47) RANDOM. MIX. GROUP. RANDOM+. MIX+. GROUP+. 40 Download time (s). 35 30 25 20 15 10 5 0 0. 1K. 10K 100K Share size (Byte). 1M. 1K. 10K. 100K. 1M. RANDOM. 1.003. 1.382. 2.558. 14.201. MIX. 1.292. 1.942. 3.038. 21.903. GROUP. 1.624. 2.105. 3.204. 34.762. RANDOM+. 0.411. 1.051. 1.642. 8.580. MIX+. 0.330. 1.651. 2.620. 13.256. GROUP+. 1.117. 1.865. 2.691. 21.333. (b) External nodes of group.. 40.

(48) RANDOM. MIX. GROUP. RANDOM+. MIX+. GROUP+. Download time (s). 25 20 15 10 5 0 0. 1K. 10K Share size (Byte). 100K. 1M. 1K. 10K. 100K. 1M. RANDOM. 1.899. 2.104. 5.305. 23.593. MIX. 0.916. 1.237. 2.476. 10.626. GROUP. 0.172. 0.287. 1.108. 3.182. RANDOM+. 0.584. 0.830. 2.505. 15.855. MIX+. 0.233. 0.409. 1.262. 6.351. GROUP+. 0.166. 0.197. 0.410. 2.141. (c) The file source node.. Figure 4-6. Performance comparison when the number of group nodes is 32.. 41.

(49) RANDOM. MIX. GROUP. RANDOM+. MIX+. GROUP+. Download time (s). 25 20 15 10 5 0 0. 1K. 10K 100K Share size (Byte). 1M. 1K. 10K. 100K. 1M. RANDOM. 1.810. 2.315. 5.806. 20.803. MIX. 1.164. 2.197. 3.798. 15.026. GROUP. 0.464. 1.020. 2.765. 6.937. RANDOM+. 0.944. 1.236. 2.728. 14.604. MIX+. 0.362. 0.803. 1.948. 8.051. GROUP+. 0.302. 0.491. 0.635. 2.507. (a) Internal nodes of group.. 42.

(50) RANDOM. MIX. GROUP. RANDOM+. MIX+. GROUP+. 40 Download time (s). 35 30 25 20 15 10 5 0 0. 1K. 10K 100K Share size (Byte). 1M. 1K. 10K. 100K. 1M. RANDOM. 1.019. 1.241. 2.945. 16.661. MIX. 1.211. 1.828. 3.155. 20.328. GROUP. 1.359. 1.918. 3.182. 33.386. RANDOM+. 0.418. 1.110. 2.227. 9.081. MIX+. 0.378. 1.359. 2.770. 13.216. GROUP+. 0.874. 1.461. 2.677. 24.343. (b) External nodes of group.. 43.

(51) RANDOM. MIX. GROUP. RANDOM+. MIX+. GROUP+. Download time (s). 25 20 15 10 5 0 0. 1K. 10K 100K Share size (Byte). 1M. 1K. 10K. 100K. 1M. RANDOM. 1.714. 2.369. 5.800. 21.432. MIX. 1.204. 1.531. 2.683. 13.712. GROUP. 0.200. 0.842. 2.263. 4.831. RANDOM+. 0.509. 0.914. 2.683. 12.584. MIX+. 0.219. 0.467. 1.569. 7.110. GROUP+. 0.158. 0.208. 0.641. 2.301. (c) The file source node.. Figure 4-7. Performance comparison when the number of group nodes is 48.. 44.

(52) RANDOM. MIX. GROUP. RANDOM+. MIX+. GROUP+. 30 Download time (s). 25 20 15 10 5 0 0. 1K. 10K 100K Share size (Byte). 1M. 1K. 10K. 100K. 1M. RANDOM. 2.193. 2.338. 4.936. 23.687. MIX. 1.539. 2.301. 5.337. 25.903. GROUP. 1.372. 2.241. 5.423. 22.574. RANDOM+. 0.782. 1.557. 2.430. 15.390. MIX+. 0.705. 1.337. 2.297. 14.959. GROUP+. 0.752. 1.369. 2.451. 16.014. (a) Internal nodes of group.. 45.

(53) RANDOM. MIX. GROUP. RANDOM+. MIX+. GROUP+. Download time (s). 25 20 15 10 5 0 0. 1K. 10K 100K Share size (Byte). 1M. 1K. 10K. 100K. 1M. RANDOM. 0.950. 1.564. 3.015. 15.270. MIX. 1.160. 1.641. 2.876. 16.117. GROUP. 0.939. 1.428. 2.814. 14.780. RANDOM+. 0.441. 0.939. 2.708. 10.675. MIX+. 0.421. 1.116. 2.442. 11.258. GROUP+. 0.411. 1.098. 2.593. 13.572. (b) External nodes of group.. 46.

(54) RANDOM. MIX. GROUP. RANDOM+. MIX+. GROUP+. 30. Download time (s). 25 20 15 10 5 0 0. 1K. 10K 100K Share size (Byte). 1M. 1K. 10K. 100K. 1M. RANDOM. 1.714. 2.369. 5.800. 21.432. MIX. 1.204. 1.531. 2.683. 13.712. GROUP. 0.200. 0.842. 2.263. 4.831. RANDOM+. 0.509. 0.914. 2.683. 12.584. MIX+. 0.219. 0.467. 1.569. 7.110. GROUP+. 0.158. 0.208. 0.641. 2.301. (c) The file source node.. Figure 4-8. Performance comparison when the number of group nodes is 320.. 47.

(55) 5 Conclusion and Future Works Peeraid is a high scalable, robust, and fault-tolerant peer-to-peer file system based on DHT for open clouds structured as a collection of peer nodes which are supplied by different participants around the world. It provides three basic mechanisms to ensure file security. Erasure coding is applied for file backup to achieve higher data availability with less storage cost. Assuming the average node availability is 0.8, erasure coding achieves higher availability than replication when storage cost is more than double. Peeraid supports the concept of group with two mechanisms. First, the information about a group is recorded in a SysInfo-object. Only the group manager has write permission for its SysInfo-object. Second, every group has their own routing table for key lookup. Peeraid combines file distribution with group locality to improve access performance. From our experiment results, distributing file shares with GROUP policy can achieve better access performance than MIX and RANDOM for the internal nodes of the group. At present, Peeraid does not guarantee file consistence when multiple users write a sharing file. If there are conflicts between write logs, users have to select the ones they trust by themselves. A mechanism for synchronization should be provided in the future. Besides, although erasure coding achieves higher data availability, we still need an efficient way to maintain the number of file shares in the system for durability. Eventually, accounting is another important issue in our future research.. 48.

(56) Bibliography [1]. A. Muthitacharoen, R. Morris, T. M. Gil, and B. Chen, “Ivy: a read/write peer-to-peer file system,” in OSDI, 2002.. [2]. A. Rowstron and P. Druschel, “Pastry: scalable, distributed object location and routing for large-scale peer-to-peer systems,” in IFIP/ACM Middleware, 2001.. [3]. A. Rowstron and P. Druschel, “Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility,” in ACM SOSP, 2001.. [4]. B. Cohen, “Incentives build robustness in BitTorrent,” May, 2003. [5]. B. G. Chun, F. Dabek, A. Haeberlen, E. Sit, H. Weatherspoon, M. F. Kaashoek, J. Kubiatowicz, and R. Morris, “Efficient replica maintenance for distributed storage systems,” in Proceeding of the 3rd Conferenc on 3rd Symposium on Networked Systems Design & Implementation, vol. 3, 2006.. [6]. B. Y. Zhao, J. D. Kubiatowicz, and A. D. Joseph, “Tapestry: an infrastructure for fault-resilient wide-area location and routing,“ U.C. Berkeley, Tech. Rep. No. USB//CSD-01-1141, April 2001.. [7]. Clip2, The Gnutella Protocol Specification v0.4, 2000.. [8]. D. A. Patterson, G. Gibson, and R. H. Katz, “A case for redundant arrays of inexpensive disks,” ACM SIGMOD Record, vol. 17, pp. 109-116, June 1988.. [9]. D.. Mazi è res,. M.. Kaminsky,. M.. F.. Kaashoek,. and. Emmett. Witchel, ”Separating key management from file system security,” ACM SIGOPS Operating Systems Review, vol. 33, pp. 124-139, December 1999. [10]. F. Dabek et al, “Wide-area cooperative storage with CFS,” in Usenix SOSP, 2001.. [11]. F. Dabek, J. Li, E. Sit, J. Robertson, M. F. Kaashoek, and R. Morris, “Design a DHT for low latency and high throughput,” in USENIX-NSDI, 2004.. [12]. FIPS 180-1, Secure hash standard, U.S. Department of Commerce/NIST, National Technical Information Service, Springfield, VA, April 1995.. [13]. H. Weatherspoon and J. D. Kubiatowicz, “Erasure coding vs. replication: a quantitative comparison,” in Electronic Proceedings for the 1st International Workshop on Peer-to-Peer Systems, March 2002.. [14]. I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan, “Chord:.

(57) a scalable peer-to-peer lookup service for Internet applications,” in ACM SIGCOMM, 2001. [15]. J. Kubiatowicz et al, “OceanStore: an architecture for global-scale persistent storage,” in ACM ASPLOS, 2000.. [16]. Kazaa, http://www.kazaa.com.. [17]. L. Vaquero, L. Rodero-Merino, J. Caceres, and M. Lindner, “A Break in the Clouds: Towards a Cloud Definition,” ACM SIGCOMM Computer Communication Review, vol. 39, pp. 50-55, January 2009.. [18]. M. A. Armbrust et al, ”Above the clouds: a Berkeley view of cloud computing,” UC Berkeley Reliable Adaptive Distributed Systems Laboratory, Tech. Rep. No. UCB/EECS-2009-28, 2009.. [19]. M. D. Stefano, Distributed data management for grid computing, Hoboken, N.J., Wiley-Interscience, 2005.. [20]. Morpheus, http://www.morpheus.com. [21]. M. Satyanarayanan et al, ”Coda: a highly available file system for a distributed workstation environment,” IEEE Transactions on Computers, vol. 39, pp. 447-459, April 1990.. [22]. Napster Inc., http://www.napster.com.. [23]. P. Maymounkov and D. Mazières, “Kademlia: a peer-to-peer information system based on the XOR metric,” in Electronic Proceedings for the 1st International Workshop on Peer-to-Peer Systems, March 2002.. [24]. R. Pike, D. Presotto, S. Dorward, B. Flandrena, K. Thompson, H. Trickey, and P. Winterbottom, “Plan 9 from Bell Labs,” Computing Systems, 1995.. [25]. S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker, “A scalable content-addressable network,” in ACM SIGCOMM, August 2001.. [26]. S. Shepler et al, Network File System (NFS) Version 4 Protocol, RFC 3530, April 2003.. - 50 -.

(58)