The storage module provides essential functions for accessing data objects,
the DHT module to accomplish data object and node lookups. The file system module calls these functions in accordance with the types of data objects.
3.3.1 File fragmentation
Peeraid is a block-based file system. File-Objects should be fragmented into fixed-size blocks before publishing them. This is because the system is aimed at storing large files. Users would not like to wait until completely downloading the whole file every time they issue a file operation. However, SysInfo-Objects and Metadata-Objects are not fragmented since they have small file size and their full content needs to be retrieved at a time.
Several factors affect block size. First, Peeraid applies erasure coding to each block and expands them to multiple shares. Therefore, shares are the actual units transferred on network. The size of a share should fit in a single 1500-byte packet to reach better transport performance. An overlarge block will increase the number of shares and need to distribute them to more nodes. It reduces the write performance and makes file management more difficult. Contrarily, a too-small block will make read operation ineffective.
Besides, Metadata-Object records locations and checksums of its corresponding File-Object. Increasing the number of file shares will also increase the size of Metadata-Object.
In addition to block size, some services have the property of space locality. If a user accesses one file block, he might need to access the successive blocks with high opportunity. Peeraid allows users to specify how many continuous blocks are returned whenever they issue an operation for each file.
3.3.2 Erasure coding
Erasure coding [13] is commonly used in error-resilient communication systems but growing in importance for storage applications as storage systems grow in size and complexity. Figure 3-7 depicts the basic concept of erasure coding. An original data is divided into k units, and then they are encoded into total (k+m) shares. The key property of erasure coding is that the original data can be reconstructed from any k shares. In other words, there can be at most m erasures. Systems employing erasure codes have mean-time-to-failures many orders of magnitude higher than replicated systems with similar storage and bandwidth costs.
Peeraid uses erasure coding with k = 8 and m = 24. Any eight of 32 shares are sufficient to reconstruct the original block, and the storage cost is four times. Users can choose low-latency nodes for downloading shares according to the latency information recorded by the DHT module. The ratio of k to m determines data availability, and the sum of k and m determines how many nodes a File-Object is distributed to. When the average node availability is 0.8, data availability of replication with the same storage cost is 0.9984. Data availability of erasure coding described above is 0.99999999999746.
k data units m coding units
... ...
k data units m coding units
...
The contents of the original k units are recalculated
The contents of the original k units are recalculated
(b) decoding
Figure 3-7 Erasure coding.
3.3.3 Hybrid backup
Peeraid uses both erasure coding and replication for file backup. Compared with replication, erasure coding can provide higher data availability with less storage cost at the expense of overhead for encoding and decoding. For large files, storage cost is the main concern whereas efficiency is more important for small files. Consequently, Peeraid applies erasure coding to File-Objects for high data availability, and replication to SysInfo-Objects and Metadata-Objects for efficiency.
Except for efficiency, applying replication to SysInfo-Objects and Metadata-Objects also makes security management easier. If these two kinds of data
objects are stored as erasure-coded shares, users must contact more nodes to accomplish user authentication.
3.3.4 Enhancing write performance with log
Although erasure coding provides higher data availability, it also degrades performance. Users need to make contact with more nodes to access a file than replication. Read performance can be improved by selecting low-latency nodes from multiple sources and using parallel downloads. However, the system still suffers from the small write problem like RAID [8]. It must simultaneously overwrite all 32 shares of a block every time a write operation is issued.
Peeraid solves this problem by using write logs. Users who want to write a file do not directly write their contents into the File-Object. Instead, they generate write logs and send them to the metadata unit mapped from its File ID. The metadata unit stores these logs in a list with the Metadata-Object of this file. Whenever a user accesses the file, the metadata unit returns both the Metadata-Object and all logs of this file to him.
Figure 3-8 shows the contents of a write log, which represents a write operation, including user ID of the requester, write types (modify or append), version number which represents the write order, block index, byte range in the write block, and the write content. The file owner or other users who have write permission will integrate these write logs into the File-Object periodically.
3.3.5 Cooperative service support
Type
Figure 3-8 The contents of a write log.
In cloud computing, a file is often shared by multiple users. These users possibly provide related services or cooperate with each other to support one service. For example, many current Internet services access Google Maps by means of Google APIs. File system should have mechanisms to manage access permission for shared files.
On UNIX, group management is achieved by recoding information about all groups in /etc/group file. Peeraid supports the concept of group with two mechanisms.
First, the information about a group is recorded in a SysInfo-Object. It includes group ID, authentication public key for the group, and IDs of the group manager and all members. User who creates a new group is the group manager. Only the group manager has write permission for its SysInfo-Object. Second, every group has their own routing table for key lookups as mentioned before.
Each file belongs to one group and the file owner can decide if only group members have permission to access the file. There are two main differences between UNIX system files and SysInfo-Objects. UNIX system files are stored under the same directory hierarchy with user files, but SysInfo-Objects are stored under the independent name space. The reason is that SysInfo-Objects are distributed and
accessed in different ways from user files.
Besides, only root user has permission to modify system files on UNIX. However, Peeraid is fully decentralized and there is no administrator in the system. Consequently, every group should have their own SysInfo-Objects, which are managed by their group manager.
3.3.6 Group locality and file distribution
The concept of group in peer-to-peer systems is quite different from that in UNIX.
In UNIX, all group members are on the local machine. In peer-to-peer systems, however, group members may be located on different nodes that spread over the network. It means that the concept of group not only relates to access permission for shared files, but also locations of members. Some groups possibly consist of peer nodes around the world whereas some groups consist of peer nodes only in a certain geographical region or community.
Peeraid combines file distribution with group locality to improve access performance. Users can decide whether to store files on the nodes of their group members. Group locality is due to two aspects. First, group members of a file have more opportunities to access it than other users. Second, many groups have the regional property and consist of adjacent nodes. As a result the mean latency between group members is lower than the whole system. Storing files on their group members also enhances the file security because users belong to the same group are more willing to trust each other.
Peeraid provides three kinds of distribution policies for users: RANDOM, GROUP, and MIX. By default, a File-Object is distributed to 32 nodes because a block is stored
all nodes in the system. This policy is suitable for public files, especially when group members spread over the world. It can achieve higher data availability because the probability that nodes storing file shares fail simultaneously is lower, but access performance is not deterministic.
GROUP means all 32 nodes are selected randomly from only group members. The policy is suitable for group files, that is, only group members have permission to access files. It can ensure file security and employ group locality to enhance access performance. However, if group members highly concentrate in a certain region, they might fail simultaneously due to the same single problem.
MIX means half of nodes are selected from group members and the other half are selected from external nodes of the group. This policy is suitable for those files with less security requirement. It can both employ the group locality and achieve average data availability.
Although all members’ IDs of a group are recorded in the corresponding SysInfo-Object, Peeraid supports these policies of file distribution by using individual routing table of each group. Therefore, users do not have to access SysInfo-Objects of groups whenever they distribute a File-Object.