Other load balancing methods - Preliminaries and Related Work

Chapter 2 Preliminaries and Related Work

2.4 Other load balancing methods

In RDS [18], the load information of each peer is periodically published to a few fixed rendezvous directories, which are responsible for scheduling the load reassignment to achieve load balance. Rendezvous directories are a fixed group of peers, which are known publicly to all peers. If a rendezvous directory is occupied by malicious peers or overwhelmed by DoS traffic, the service of load balancing failed.

In ISS [19], a peer doesn’t publish its load information anywhere else. To achieve load balancing, peers should perform searching independently to find other peers with inverse load characteristics, and move load from the heavy nodes to the light nodes. ISS is very inefficient because searching peers with inverse load characteristics may incur huge traffic.

In [17], they proposed a GDS for load balancing in DHT based P2P networks, which combines RDS and ISS. The whole network is formed into groups, and the gossip protocol is used for load information dissemination. Each group member has the load information of its own group and the utilization information of the network after load information dissemination.

They use these load information to achieve load balancing. GDS can only be built on top of ring-like DHT based P2P networks, such as Pastry, Chord, etc.

In [27], they presented a dynamic feedback adaptive scheduling algorithm to adjust the ratio peers distribution constantly according to the super peers’ cyclical feedback information, which fully taking account the load capacity of each super peers, and solving the problem of super peers load balance in hybrid P2P networks effectively [27]. The super peers in a system are divided into several cluster networks according to their different locations in the network, and each cluster network communicates with load balancing control peer through a cluster agent peer. All the requests from peers should be first sent to the load balancing control peer, which designates logging super peers for peers according to the immediate updated super peer scheduling sequence. There may be a number of this kind of load balancing control peers in the system [27].

Chapter 3 Design Approach

In general, each keyword of an object will be hashed one time to get a key to publish. To enhance the search hit rate, we propose a KAD-N method to publish keywords by multiple hash in the KAD P2P network, where N is the maximum hash times. By hashing the key of a keyword, we can get the second key that can represent the same keyword. We can still hash the second key to get the third key. By hashing it random times, we get a final key to publish. Peers may produce different keys to represent the same keyword. And different keys are published to different peers. This multiple hash method will publish a keyword from peers to different peers to balance their loads.

3.1 Concept of KAD-N

Figure 7 shows the procedure of how to generate a key to publish. If we want to publish object 1, we will use terms 1, 2, and 3 as keywords. We use term 1 as an example to describe our multiple hash operation. First, we generate a random number r and 1≦r≦N. Next, we hash term 1 to get key 1 and hash key 1 to get key 2, etc. After r times of hashing, we get key r, which is the final key to publish. Then we publish key r to the network.

Figure 7. The procedure of how to generate a key to publish.

However, the multiple hash will increase the number of search messages. If we publish a keyword by hashing the key twice then we must query two keys in average to get the whole indexes of a keyword. Figure 8 shows the concept of multiple hash for publishing and search. In this figure, keyword A may be published to peers 1 to N. If peer M want to search keyword A, it must send N search messages (queries) to these peers who may store indexes of keyword A.

Then peer M will receive N search responses. It can then build the indexes of keyword A from these responses. Therefore, the proposed KAD-N method will result in more traffic in searching keywords. However, the number of search messages is usually ten times less than publishing messages [16]. Hence, KAD-N requires just a small overhead to balance the publishing load.

Figure 8. The concept of multiple hash for publishing and search.

Peer 1 published to N peers by

different peers

Peer M

Peer M must send N search messages to these peers to

search keyword A

P : Publishing message S : Search message

Figure 9. The concept of multiple hash for balancing the index distribution.

Figure 9 is an example of multiple hash for balancing the index distribution. Figure 9(a) is the index distribution of the original KAD. All indexes of an individual keyword will be handled by a keyspace, e.g. keyspace 1 handleing all indexes of keyword Q. After applying the proposed KAD-N method to the network, the index distribution will spread more even. In this example, we apply KAD-3 to the network. KAD-3 will distribute indexes of keyword Q into 3 keyspaces as shown in Figure 9(b). One is handled by the original keyspace 1 and the other two will be spread to other two keyspaces.

3.2 Publishing procedure

The detailed publishing procedure of KAD-N is shown in Figure 11. Remind that N is the maximum hash times. A KAD-N network will work as the original KAD P2P network if N is

keyspace 1 (a) The original KAD index distribution

: Indexes of keyword Q : Publishing target (b) The KAD-3 index distribution

Apply KAD-3 method

set to 1. In Chapter 4, the optimal hash times will be derived. Note that, we will discuss block A1 of Figure 11 in Chapter 5.

To publish a file, first, we get publishing keywords from the object name. Then a random number r will be generated for each keyword (for example, keyword A) and 1 ≤ r ≤ N. After that, we hash keyword A r times to produce the final target key: key a. Figure 10 shows the proposed multiple hash algorithm. Key a will be the target while running the lookup procedure. The details of the lookup procedure were mentioned in Chapter 2. After starting the lookup procedure, we will receive several responses which contain some possible peers who are closer to the target. We use these peers to update the candidate list. If the candidate list becomes stable, then advance to the next step. The candidate list will be sorted by the distance to the target in the ascending order. The closest node is the first one in the list. In Chapter 2, we mentioned that KAD will publish a key to 11 different peers in the target keyspace. That is, top 11 nodes will be selected from the list for publishing. After sending publishing messages to these nodes, we successfully publish a keyword to the KAD P2P network.

Multiple Hash Algorithm

Figure 10. Multiple hash algorithm.

Figure 11. The publishing procedure of KAD-N.

End

Block A1

N: Maximum hash times

Start

Obtain keyword A from an object name

Generate a random number: r, 1≦r≦N i = 1

input = keyword A

Hash input to get key tmp i = i+1

Use key tmp as a target and then run the lookup procedure

Use the responses of lookup messages to update the candidate list

Select 11 closer nodes from the candidate list:

nodes 1~11

Send publishing messages to nodes 1~11 i ≦ r

3.3 Search procedure

Figure 12. The search procedure of KAD-N.

Yes

N: Maximum hash times

TOTAL: Maximum number of answers

Start

Obtain keyword A from a query

Hash input to get key tmp

Use key tmp as a target and then send a search message to the network

i = i+1

Save answers from search responses

End i = 1 input = keyword A

i ≦ N input = key tmp

Number of answers > TOTAL or timeout

No Yes

Block A2

Figure 12 describes the search procedure of KAD-N. At the beginning, we obtain a keyword from a query (for example, keyword A). After that, this keyword will be hashed to produce a temp key. We use this temp key as a target to generate a search message and send it to the network. The above procedure will be repeated N times. That is, it will send N different search messages to the network. Then we will receive several search responses which may contain search answers. The search will stop when we receive enough answers or timeout is triggered.

The default TOTAL value is 300 and the default timeout is set to 20 seconds. In other words, we will stop the search process after 20 seconds or if we receive more than 300 answers [18]. We will discuss block A2 of Figure 12 in Chapter 5.

3.4 Qualitative comparison of KAD-N and KAD

Table II shows the comparison between the original KAD and our KAD-N. If we hash a keyword at most N times, the publishing load will be more balanced and the search hit rate will also increase. Indexes of a keyword are published to at most N targets. Furthermore, KAD-N does not increase the total number of indexes. It just distributes indexes more even, as shown in Figure 9. In other words, in KAD-N the total publishing load of the network is same as that of KAD but the number of search messages will increase N times compared to KAD. Because KAD-N will spread the indexes, there will be more peers who have the same indexes. KAD-N will improve the search hit rate in case that some peers failed. However, the network traffic will increase slightly because of the increased number of search messages. The keyword of an object may be hashed at most N times and the computation overhead is thus O(N). In the original KAD, since each keyword is hashed only once, the computation overhead is O(1).

KAD-N would cause extra computation overhead. We will discuss an optimal value of N in the following chapter.

TABLE II. Qualitative comparison of KAD and KAD-N

Approach KAD KAD-N (proposed)

Publishing load Imbalance Balance

Search hit rate Normal Better

Computation overhead O(1) O(N)

Query messages per search 1 N

Network traffic Normal More

Chapter 4 Simulation Results

4.1 Simulation setup

First, we analyze the overhead of publishing messages and search messages in KAD. In [21], they spied on 20 different keyspaces of the KAD network for 24 hours. During this time, on average, 4.3 million publishing messages and 350,000 search messages were recorded. Based on the measurements of [21], it showed that there are ten times more publishing messages than search messages. Moreover, a publishing message is ten times bigger than a search message since it contains not only a keyword but also metadata describing a published object. In [23], they also spied on a keyspace in the KAD P2P network for 12 hours. They got 561,542 search messages and 5,549,183 publishing messages. Search messages produced 10.8 MB traffic and publishing messages produced 966 MB traffic. Based on these data, traffic produced by a search message is 0.019 KB and 0.18 KB for a publishing message on average. We used these data to calculate total network traffic in our simulation environment. Total network traffic contains search traffic and publishing traffic.

We rank keywords according to their appearance times. Rank 1 is the most popular keyword.

The publishing messages, which were collected by [21], contain 26,500 different keywords per keyspace and 315,000 distinct files. The appearances of each keyword were also counted.

Based on these data, [21] used Matlab to estimates the number of indexes for the ’th popular keyword which is proportional to 1/^. and the number of indexes for the most popular keyword is about 10. For example, the number of indexes of the most popular keyword is ten times more than the tenth popular keyword in the KAD P2P network. That is to say, the peer

who handles indexes of the most popular keyword will get ten times network load than the peer who handles indexes of the tenth popular keyword. Based on the above analysis, we evaluate the performance of our approach.

We used JAVA to construct our simulation environment. Based on [20] and the above analysis, we simulated the behaviors of how KAD P2P networks publish objects and distribute indexes. The indexes handled by each peer were also recorded. Then we applied our method to this simulation environment. We gathered the indexes handled by each peer and used them to show the effectiveness of the proposed KAD-N method.

4.2 Simulation results

Figure 13 shows the index distributions of each keyspace under different hash times. We rank keyspaces according to the number of indexes handled, i.e. keyspace popularity. Rank 1 keyspace handles the most indexes. We found that the index distribution of the original KAD (KAD-1) is very uneven. A large number of indexes were handled by a few keyspaces. If we hash more times, some indexes would be moved from front rank keyspaces to others. From this figure, the index distribution curve will be smoother if we hash the key more times. In other words, if we hash the key more times, publishing load will be more balanced.

Figure 13. The index distribution of each keyspace under different hash times.

However, the number of search messages will increase after applying the proposed KAD-N method. We found that the total network messages will increase linearly with more hash times in Figure 14. Total network messages were calculated based on [23], which was introduced in the simulation setup. They include search messages and publishing messages.

Figure 14. The total number of network messages under different hash times.

1E+2

Figure 15 plots the percentage of extra traffic under different hash times. The growth of the curve is linear just like Figure 14. As we mentioned in the simulation setup, number of search messages multiplied 0.019 KB is search traffic and number of publishing messages multiplied 0.18 KB is publishing traffic. We add search traffic to publishing traffic to get total network traffic. We calculate extra traffic percentage p according to the following equation:

The extra traffic is very small because the number of search messages is much fewer than the number of publishing messages, and traffic produced by a search message is much smaller than a publishing message.

Figure 15. The percentage of extra traffic under different hash times.

We use a standard deviation # to show the divergence under different hash times. A standard deviation is a measure of the dispersion of a data set. A low standard deviation indicates that the

data points tend to be very close to the mean, while a high standard deviation indicates that the data are “spread out” over a large range of values. We calculate standard deviation using the number of indexes handled by each keyspace. In other words, the higher the value, the more unbalanced publishing load of each keyspace. # is computed as follows:

$ %∑ '(^._)/0 ₎* +, -.

where is the number of keyspaces ( = 256 in KAD); 1₂ is the number of indexes handled in the th keyspace and 3 is the average number of indexes handled in each keyspace.

From Figure 16, we observed that when hash times ≥ 7, # will not decrease too much. That is, if we hash more than 7 times, the standard deviations are almost the same. In other words, when hash times ≥ 7, it doesn’t help much on load balancing.

Figure 16. The standard deviation of each keyspace under different hash times.

400000

We also simulated the hit rate

failed. We cannot retrieval indexes from failed peers.

Objects referenced by missing calculated by the number of hashing more times it increase

of peers vary according to a diurnal. And the minimal number of peers is about 78%

maximum. So the percentage of failed peers in a day is about 27%.

Figure 17. The hit rate

Figure 17 shows that the hit rate peers failed. The proposed KAD number of peers failed. Because a cost-effectiveness factor k

the hit rate variation under different hash times in case that cannot retrieval indexes from failed peers. We call these indexes

missing indexes would be unsearchable. Note that the h number of missing indexes dividing number of total indexes.

increases the hit rates while peers failed. From [22]

vary according to a diurnal. And the minimal number of peers is about 78%

So the percentage of failed peers in a day is about 27%.

. The hit rate with respect to failed peers under different

the hit rate is close to 100% if we hash more than 5 tim

The proposed KAD-N will increase the search resilience in the situation of Because more hash times do not always bring more efficiency

k to determine the maximum hash times.

10 20 30

Failed peers (%) KAD

in case that some peers indexes as missing indexes.

Note that the hit rate is total indexes. In Figure 17, by [22], we know the number vary according to a diurnal. And the minimal number of peers is about 78% of the

der different hash times.

hash more than 5 times with 27% of in the situation of a large bring more efficiency, we used

4 #

To have a larger k, one has to increase the hit rate, and reduce the total network traffic and the standard deviation. From Figure 18, k of 6, 7 and 8 hash times are very close, and k of 7 hash times is the highest. That is, hashing 7 times is the optimal choice for the trace we simulated.

Figure 18. Cost-effectiveness factor under different hash times.

5.663 5.705

5.696

3.000 3.500 4.000 4.500 5.000 5.500 6.000

1 2 3 4 5 6 7 8 9 10 11 12

Cost-effectiveness factor (k)

Hash times KAD

Chapter 5 Implementation Issues

5.1 Applying KAD-N to existing KAD P2P networks

In this chapter, we describe how to implement the proposed KAD-N method. Our method is an improvement of the original KAD. We can implement it based on an existing KAD P2P network, such as eMule [8] or aMule [24]. They are both open source projects so we can get their source codes easily. By modifying their source codes, the proposed KAD-N method can be implemented. In the following, we describe how to adapt a KAD method to the proposed KAD-N method.

Figure 19 shows the publishing procedure of the original KAD P2P network and Figure 20 shows its search procedure. We can implement the proposed KAD-N method based on a KAD P2P network by modifying the hash operation for publishing and searching objects. To apply our method to the KAD P2P network, we replace block B1 in Figure 19 with block A1 in Figure 11 and also replace block B2 in Figure 20 with block A2 in Figure 12.

Figure 19. The publishing procedure of KAD.

Start

Obtain keyword A from object name

Use key tmp as a target and run the lookup procedure

Use the responses of lookup messages to update the candidate list

Select 11 closer nodes from the candidate list: nodes 1~11

Send publishing messages to nodes 1~11

End

Is the candidate list stable?

Yes

Hash keyword A to get key tmp

Block B1

Figure 20. The search procedure of KAD.

5.2 Combining with streaming

The KAD P2P network can be extended to support streaming applications. Most of existing P2P networks based on KAD are only capable of file sharing. We can enhance the ability of KAD P2P networks by adding some functions, such as P2P streaming. For example, if a video file has been published by several peers in the KAD network, we can use a P2P streaming tool to view this file while downloading. In the original KAD P2P network, we must wait until the whole file is downloaded. It is not efficient.

Yes

TOTAL: Maximum number of answers

Start

Obtain keyword A from a query

Hash keyword A to get key tmp

Use key tmp as a target and then send a search message to the network

Save answers from search responses

End

Number of answers > TOTAL or timeout

Block B2

To combine a KAD P2P networks with streaming, we can modify its download function. A peer can search a video file to get a list of peers who have this file in the KAD P2P network. We can use this peer list to form a P2P streaming network. Peers who have the requested video file

在文檔中基於強韌搜尋之KAD同儕網路負載平衡方法 (頁 21-0)