• 沒有找到結果。

Chapter 1 Introduction

1.4 Proposed KAD-N

In this thesis, to resolve the unbalanced load problem in structured P2P networks, we proposed a KAD-N scheme that hashes the keyword of an object random times to produce a key for publishing the object. Other peers who want to publish the object with the same keyword will do the same. Different peers may hash different times to produce different keys, which all represent the same keyword. That is, our method will produce several different keys to represent a same keyword. These keys will be published to different peers, not only to one peer.

And peers who received these publishing messages will use these keys to build indexes. Our method can spread indexes more even in the network. Simulation results show that we can balance the loads of peers and also increase the search hit rate in case that some peers failed or left. The overhead of our method is increasing the network traffic slightly, but not more than 10 percent.

The thesis is organized in the following manner. Chapter 2 is the preliminary knowledge of the KAD P2P network and related work. In Chapter 3, we describe our multiple hash method.

Simulation setup and results are shown in Chapter 4. Chapter 5 discusses the implementation issues of the proposed KAD-N method. Chapter 6 gives some concluding remarks and future work.

5

Chapter 2

Preliminaries and Related Work

Since the KAD P2P network is our main target to enhance, in this chapter, we review this network and some existing load balancing methods. First, the distributed hash table (DHT), the main component of the KAD P2P network is overviewed. Then we describe the principles of how to lookup, publish, and search objects in the KAD P2P network. After that, we discuss the original KAD load balancing mechanisms. Next, other load balancing methods, such as Gossip Dissemination Strategy (GDS) [17], Rendezvous Directory Strategy (RDS) [18], and Independent Searching Strategy (ISS) [19], etc., are overviewed.

2.1 Distributed hash table

A main characteristic of DHT-based networks is that search is deterministic. It costs a bounded effort to route and retrieve. These DHT-based P2P networks basically support only the exact name match as each object is given a unique identifier obtained by hashing its name to determine its location in the network. Keyword search must be built on top of the overlay to enhance search functionality. Several mechanisms have been proposed for keyword search in DHT, and all of them use the inverted index as the primary data structure. An inverted index is a set of pairs (keyword, objects set). After an inverted index is built, we can use a keyword to find all objects that contain this keyword.

To implement keyword search in a structured P2P network, a distributed inverted index can be built. By using DHT-based networks, one can use a given keyword as a key to find out the

6

peers who have objects that contain this keyword. Peers can retrieve objects with a given search query (keyword set) to perform a keyword search operation [2].

Figure 1. An inverted index example of four objects [2].

In Figure 1(a), we show an inverted index example of four objects: objects 1, 2, 3, and 4. Each object contains three keywords. For example, object 1 has keywords, terms1, 2 and 3, and object 2 has term1, 3 and 5, etc. Figure 1(b) is the inverted index which is built by these keywords and objects sets. By this way, we can link keywords to different objects. For instance, if we use “term 1” as a keyword to search the network, objects 1, 2 and 3 will be found.

(a) Each object has three keywords.

7

2.2 Background of KAD

KAD specifies the structure of a network and the exchange of information through peer lookups. Peers communicate among themselves in KAD using UDP [3]. A virtual overlay network is formed by the participating peers. Each peer is identified by a KAD ID. The KAD ID serves not only as an identification, but peers use the KAD ID to locate objects. In Figure 2, the KAD ID of the peer is “11111.” An object can produce two different keys, source keys and keyword keys. Source keys identify the content and location of an object. Keyword keys are computed by hashing a keyword from the name of the object [18]. In Figure 2, keywords of this object are “project” and “KAD.” Thus, the source key is “00110” and the keyword keys are

“00010” and “11000.”

Figure 2. An example of a publishing peer.

Figure 3 andFigure 4 show an example of publishing an object. A peer wants to publish an object named “project KAD.” This object name will result in two keywords, “project” and

“KAD.” All relevant references to the original object are generated, such as the source key and the keyword key. Next, keyword keys “project 00010” and “KAD 11000” are published to

8

corresponding peers “00001” and “11001” to build indexes, which are all pointed to peer

“00111.” Finally, the source key is published, with an index pointing to the publishing peer.

Figure 3. The index building procedure.

In KAD, each key is not published just on a single peer that is numerically closest to that key, but on 11 different peers whose KAD ID matches at least the first 8-bits of the key. This zone around a key is called the tolerance zone or the keyspace [16]. There are 2  256 keyspaces in a KAD P2P network. In Figure 5, we can see 11 peers to receive publishing messages in the target keyspace.

9

Figure 4. An example of publishing an object.

Figure 5. The diagram of where to publish [20].

When searching for some objects, the peer needs to know the target location and explores the network in several steps. Each step will find peers that are closer to the target. Figure 6 shows the steps of the lookup procedure. First, a searching peer sends messages to two closest possible

Target keyspace

Publishing peer

11 peers in the target keyspace

Keyword KAD

10

peers. When the searching peer received responses, it obtains three more closer possible peers.

Then these new possible peers that are in the target keyspace will be stored to a list called the candidate list. In this example, two peers are in the target keyspace. These two peers will be saved to the candidate list. In the last step, the searching peer sends a request for more closer peers to the three closest peers again, but only two peers are available and reply with closer peers. The lookup procedure terminates when the lookup responses contain only peers that are either already present in the candidate list or farther away from the target than the other top 3 candidate peers [16]. At this point, the candidate list is called stable. Like other DHT networks, KAD travels only O(logN) peers during the execution of the lookup procedure when there are N peers in the network. Therefore, the lookup procedure is very efficient.

Figure 6. Steps of the lookup procedure [20].

Target

keyspace Searching peer

Possible peers Requests

Responses

New possible peers

Requests

Responses

11

2.3 Load balancing in KAD

KAD P2P networks do little to balance the load of each peer. They just limit the number of indexes handled in each peer to prevent them from overloading. A peer can only be responsible for maximum 60,000 indexes and can hold a maximum of 50,000 indexes of an individual keyword. Once reaching the limit, a peer would send an overload response to the publishing peer. After receiving an overload response, the publishing peer will publish this object to another peer.

2.4 Other load balancing methods

In RDS [18], the load information of each peer is periodically published to a few fixed rendezvous directories, which are responsible for scheduling the load reassignment to achieve load balance. Rendezvous directories are a fixed group of peers, which are known publicly to all peers. If a rendezvous directory is occupied by malicious peers or overwhelmed by DoS traffic, the service of load balancing failed.

In ISS [19], a peer doesn’t publish its load information anywhere else. To achieve load balancing, peers should perform searching independently to find other peers with inverse load characteristics, and move load from the heavy nodes to the light nodes. ISS is very inefficient because searching peers with inverse load characteristics may incur huge traffic.

In [17], they proposed a GDS for load balancing in DHT based P2P networks, which combines RDS and ISS. The whole network is formed into groups, and the gossip protocol is used for load information dissemination. Each group member has the load information of its own group and the utilization information of the network after load information dissemination.

They use these load information to achieve load balancing. GDS can only be built on top of ring-like DHT based P2P networks, such as Pastry, Chord, etc.

12

In [27], they presented a dynamic feedback adaptive scheduling algorithm to adjust the ratio peers distribution constantly according to the super peers’ cyclical feedback information, which fully taking account the load capacity of each super peers, and solving the problem of super peers load balance in hybrid P2P networks effectively [27]. The super peers in a system are divided into several cluster networks according to their different locations in the network, and each cluster network communicates with load balancing control peer through a cluster agent peer. All the requests from peers should be first sent to the load balancing control peer, which designates logging super peers for peers according to the immediate updated super peer scheduling sequence. There may be a number of this kind of load balancing control peers in the system [27].

13

Chapter 3

Design Approach

In general, each keyword of an object will be hashed one time to get a key to publish. To enhance the search hit rate, we propose a KAD-N method to publish keywords by multiple hash in the KAD P2P network, where N is the maximum hash times. By hashing the key of a keyword, we can get the second key that can represent the same keyword. We can still hash the second key to get the third key. By hashing it random times, we get a final key to publish. Peers may produce different keys to represent the same keyword. And different keys are published to different peers. This multiple hash method will publish a keyword from peers to different peers to balance their loads.

3.1 Concept of KAD-N

Figure 7 shows the procedure of how to generate a key to publish. If we want to publish object 1, we will use terms 1, 2, and 3 as keywords. We use term 1 as an example to describe our multiple hash operation. First, we generate a random number r and 1≦r≦N. Next, we hash term 1 to get key 1 and hash key 1 to get key 2, etc. After r times of hashing, we get key r, which is the final key to publish. Then we publish key r to the network.

14

Figure 7. The procedure of how to generate a key to publish.

However, the multiple hash will increase the number of search messages. If we publish a keyword by hashing the key twice then we must query two keys in average to get the whole indexes of a keyword. Figure 8 shows the concept of multiple hash for publishing and search. In this figure, keyword A may be published to peers 1 to N. If peer M want to search keyword A, it must send N search messages (queries) to these peers who may store indexes of keyword A.

Then peer M will receive N search responses. It can then build the indexes of keyword A from these responses. Therefore, the proposed KAD-N method will result in more traffic in searching keywords. However, the number of search messages is usually ten times less than publishing messages [16]. Hence, KAD-N requires just a small overhead to balance the publishing load.

Figure 8. The concept of multiple hash for publishing and search.

Peer 1 published to N peers by

different peers

Peer M

Peer M must send N search messages to these peers to

search keyword A

P : Publishing message S : Search message

15

Figure 9. The concept of multiple hash for balancing the index distribution.

Figure 9 is an example of multiple hash for balancing the index distribution. Figure 9(a) is the index distribution of the original KAD. All indexes of an individual keyword will be handled by a keyspace, e.g. keyspace 1 handleing all indexes of keyword Q. After applying the proposed KAD-N method to the network, the index distribution will spread more even. In this example, we apply KAD-3 to the network. KAD-3 will distribute indexes of keyword Q into 3 keyspaces as shown in Figure 9(b). One is handled by the original keyspace 1 and the other two will be spread to other two keyspaces.

3.2 Publishing procedure

The detailed publishing procedure of KAD-N is shown in Figure 11. Remind that N is the maximum hash times. A KAD-N network will work as the original KAD P2P network if N is

keyspace 1 (a) The original KAD index distribution

: Indexes of keyword Q : Publishing target (b) The KAD-3 index distribution

Apply KAD-3 method

16

set to 1. In Chapter 4, the optimal hash times will be derived. Note that, we will discuss block A1 of Figure 11 in Chapter 5.

To publish a file, first, we get publishing keywords from the object name. Then a random number r will be generated for each keyword (for example, keyword A) and 1 ≤ r ≤ N. After that, we hash keyword A r times to produce the final target key: key a. Figure 10 shows the proposed multiple hash algorithm. Key a will be the target while running the lookup procedure. The details of the lookup procedure were mentioned in Chapter 2. After starting the lookup procedure, we will receive several responses which contain some possible peers who are closer to the target. We use these peers to update the candidate list. If the candidate list becomes stable, then advance to the next step. The candidate list will be sorted by the distance to the target in the ascending order. The closest node is the first one in the list. In Chapter 2, we mentioned that KAD will publish a key to 11 different peers in the target keyspace. That is, top 11 nodes will be selected from the list for publishing. After sending publishing messages to these nodes, we successfully publish a keyword to the KAD P2P network.

Multiple Hash Algorithm

Figure 10. Multiple hash algorithm.

17

Figure 11. The publishing procedure of KAD-N.

End

Block A1

N: Maximum hash times

Start

Obtain keyword A from an object name

Generate a random number: r, 1≦r≦N i = 1

input = keyword A

Hash input to get key tmp i = i+1

Use key tmp as a target and then run the lookup procedure

Use the responses of lookup messages to update the candidate list

Select 11 closer nodes from the candidate list:

nodes 1~11

Send publishing messages to nodes 1~11 i ≦ r

18

3.3 Search procedure

Figure 12. The search procedure of KAD-N.

Yes

N: Maximum hash times

TOTAL: Maximum number of answers

Start

Obtain keyword A from a query

Hash input to get key tmp

Use key tmp as a target and then send a search message to the network

i = i+1

Save answers from search responses

End i = 1 input = keyword A

i ≦ N input = key tmp

Number of answers > TOTAL or timeout

No Yes

No

Block A2

19

Figure 12 describes the search procedure of KAD-N. At the beginning, we obtain a keyword from a query (for example, keyword A). After that, this keyword will be hashed to produce a temp key. We use this temp key as a target to generate a search message and send it to the network. The above procedure will be repeated N times. That is, it will send N different search messages to the network. Then we will receive several search responses which may contain search answers. The search will stop when we receive enough answers or timeout is triggered.

The default TOTAL value is 300 and the default timeout is set to 20 seconds. In other words, we will stop the search process after 20 seconds or if we receive more than 300 answers [18]. We will discuss block A2 of Figure 12 in Chapter 5.

3.4 Qualitative comparison of KAD-N and KAD

Table II shows the comparison between the original KAD and our KAD-N. If we hash a keyword at most N times, the publishing load will be more balanced and the search hit rate will also increase. Indexes of a keyword are published to at most N targets. Furthermore, KAD-N does not increase the total number of indexes. It just distributes indexes more even, as shown in Figure 9. In other words, in KAD-N the total publishing load of the network is same as that of KAD but the number of search messages will increase N times compared to KAD. Because KAD-N will spread the indexes, there will be more peers who have the same indexes. KAD-N will improve the search hit rate in case that some peers failed. However, the network traffic will increase slightly because of the increased number of search messages. The keyword of an object may be hashed at most N times and the computation overhead is thus O(N). In the original KAD, since each keyword is hashed only once, the computation overhead is O(1).

KAD-N would cause extra computation overhead. We will discuss an optimal value of N in the following chapter.

20

TABLE II. Qualitative comparison of KAD and KAD-N

Approach KAD KAD-N (proposed)

Publishing load Imbalance Balance

Search hit rate Normal Better

Computation overhead O(1) O(N)

Query messages per search 1 N

Network traffic Normal More

21

Chapter 4

Simulation Results

4.1 Simulation setup

First, we analyze the overhead of publishing messages and search messages in KAD. In [21], they spied on 20 different keyspaces of the KAD network for 24 hours. During this time, on average, 4.3 million publishing messages and 350,000 search messages were recorded. Based on the measurements of [21], it showed that there are ten times more publishing messages than search messages. Moreover, a publishing message is ten times bigger than a search message since it contains not only a keyword but also metadata describing a published object. In [23], they also spied on a keyspace in the KAD P2P network for 12 hours. They got 561,542 search messages and 5,549,183 publishing messages. Search messages produced 10.8 MB traffic and publishing messages produced 966 MB traffic. Based on these data, traffic produced by a search message is 0.019 KB and 0.18 KB for a publishing message on average. We used these data to calculate total network traffic in our simulation environment. Total network traffic contains search traffic and publishing traffic.

We rank keywords according to their appearance times. Rank 1 is the most popular keyword.

The publishing messages, which were collected by [21], contain 26,500 different keywords per keyspace and 315,000 distinct files. The appearances of each keyword were also counted.

Based on these data, [21] used Matlab to estimates the number of indexes for the ’th popular keyword which is proportional to 1/ . and the number of indexes for the most popular keyword is about 10. For example, the number of indexes of the most popular keyword is ten times more than the tenth popular keyword in the KAD P2P network. That is to say, the peer

22

who handles indexes of the most popular keyword will get ten times network load than the peer who handles indexes of the tenth popular keyword. Based on the above analysis, we evaluate the performance of our approach.

We used JAVA to construct our simulation environment. Based on [20] and the above analysis, we simulated the behaviors of how KAD P2P networks publish objects and distribute indexes. The indexes handled by each peer were also recorded. Then we applied our method to this simulation environment. We gathered the indexes handled by each peer and used them to show the effectiveness of the proposed KAD-N method.

4.2 Simulation results

Figure 13 shows the index distributions of each keyspace under different hash times. We rank

Figure 13 shows the index distributions of each keyspace under different hash times. We rank

相關文件