• 沒有找到結果。

Chapter 1 Introduction

1.5 Thesis organization

The rest of this thesis is organized as follows.Chapter 2 reviews the background of KAD P2P networks and existing load balancing schemes in structured P2P networks. In Chapter 3, we present our design approach in detail. Simulation setup and simulation results are presented in Chapter 4 and Chapter 5 concludes the thesis and outline future work.

Chapter 2

Preliminaries and Related Work

Since the KAD P2P network is our main target to enhance, in this chapter, we review the KAD P2P network and some existing load balancing methods. KAD P2P networks are based on distributed hash table (DHT). First we review the DHT and then describe the mechanism of how to lookup, publish, and search objects in the KAD P2P network. Finally, we review existing load balancing mechanisms, KAD [5], KAD-7 [23], and MHF [24].

2.1 Distributed hash table

DHT is based on a consistent hashing function. In DHT-based P2P networks, each object is assigned a unique ID using a consistent hashing function. A DHT-based P2P network basically supports only the exact name match as each object is given a unique identifier obtained by hashing its name to determine its location in the network. Keyword search must be built on top of the network to enhance search functionality. The most common way to implement keyword search in information systems is by inverted index [15]. An inverted index is a set of pairs (keyword, objects set). After an inverted index is built, we can use a keyword to find all objects that contain this keyword.

A distributed inverted index is built to implement keyword search in a structured P2P network. By using DHT-based P2P networks, one can use a given keyword as a key to find out the peers who have objects that contain this keyword. Peers can retrieve objects with a given search query (keyword set) to perform a keyword search operation [15]. In Figure 1(a), we show an inverted index example of four objects: Objects 1, 2, 3, and 4. Object 1 contains Keywords 1, 2, and 3, and Object 2 has Keywords 1, 3, and 5, etc. Figure 1(b) shows the

inverted index which is built by these keywords and objects. Following this rule, we can link keywords to different objects. For example, if we use “Keyword 2” as a keyword to search the network, we can find Objects 1, 3, and 4.

2.2 Background of KAD

Each KAD node has a global identifier, referred as KAD ID, which is 128-bit long and randomly generated by a cryptographic hash function. The designers of KAD decided to consider a contact sufficiently close to the target if it shares with it at least the first 8 bits. The space of KAD IDs that satisfy this constraint is called tolerance zone [17]. There are

256

28= zones in a KAD P2P network. We will briefly explain the lookup, publishing, and (a) Each object has some keywords

Keyword Object set

Keyword 1 {Object 1, Object 2, Object 3 }

Keyword 2 { Object 1, Object 3, Object 4}

Keyword 3 { Object 1, Object 2}

Keyword 4 { Object 3, Object 4}

Keyword 5 { Object 2}

(b) An inverted index

Figure 1. An inverted index example of four objects [15].

2.2.1 Lookup procedure

When searching for some objects, a peer needs to know the target location and explores the network in several steps. Each step will find peers that are closer to the target. Routing in KAD is based on prefix matching. In KAD networks, the distance between two nodes is calculated by XOR-distance. The XOR-distance is defined as d(a, b) = a b. It calculated bitwise on the KAD IDs of two nodes, e.g., the distance between a = 10011 and b = 01111 is d(a, b) = 10011 01111 = 10100. Routing to a KAD ID is done in an iterative way. Figure 2 is an example lookup procedure. In the first step, the searching peer has three closest possible contacts from the routing table. They have different XOR-distances and are still not close enough to the target peer. The second step in Figure 2 shows that the searching peer received three responses. The searching peer obtains three more closer possible contacts by the responses. If a new possible peer in the tolerance zone, it will be stored to a list called the candidate list. In the third step, two of these possible peers are in the tolerance zone. These two peers will be saved to the candidate list. In the fourth step, the searching peer sends a request for more closer peers to the three closest peers again. The lookup procedure terminates when the lookup responses contain only peers that are either already present in the candidate list or farther away from the target than the other top three candidate peers [17]. At this point, no new request is sent and the candidate list becomes stable. KAD travels only O(logN) peers during the execution of the lookup procedure when there are N peers in the network.

2.2.2 Publish procedure

Publish is an essential action when peers want to share objects. Peers will publish keyword keys and a source key to foreign peers. In Figure 3, the KAD ID of the peer is

“10111.” An object can produce two different keys, a source key and keyword keys. A source key is computed by hashing the name of the object. Keyword keys are computed by hashing keywords from the name of the object [16]. The keywords of this object are “Modular” and

“KAD.” In Figure 3, the source key is “01011” and the keyword keys of “Modular” and

“KAD” are “00001” and “00100,” respectively

Figure 2. An example iterative lookup procedure [16].

Figure 4 shows an example of publishing steps for an index. Before publishing an index, a sending peer must use KAD_REQ to find a receiving peer. At first the sending peer sends a KAD_REQ to the receiving peer. KAD_REQ is used to find the receiving peer and check whether the peer is alive. When the receiving peer receive KAD_REQ, it will send a KAD_RES back. After establishing a connection between the sending peer and the receiving peer, the sending peer starts to publish keys to the receiving peer.

Figure 4. The KAD publish steps for an index [16].

Object Source Keyword

Keyword

01011

00001

00100 Peer 10111

Figure 3.An example of an object to be published.

When a peer starts to publish keys, the peer will publish a source key and keyword keys by 2-level publishing scheme. Figure 5 shows an example 2-level publish. A peer “10111”

wants to publish an object named “Modular KAD.” This object name will result in two keywords, “Modular” and “KAD.” All relevant references to the original object are generated, such as the source key and the keyword keys. Next, keyword keys “Modular 00001” and

“KAD 00100” are published to corresponding peers “00001” and “00100” to build indexes, which are all pointed to peer “01011.” Finally, the source key is published, with an index pointing to the publishing peer.

In KAD, each key is not published just on a single peer that is numerically closest to that key, but on 11 different peers whose KAD ID matches at least the first 8-bits of the key. This zone around a key is called the tolerance zone or the keyspace [17].

2.2.3 Search procedure

Like publishing, searching files is also a 2-level search: keyword search and source search. For a keyword search, the hash value of the first word of the user input is computed.

The rest of words are packed in a form of a search tree. A query consists of a hash value of the first keyword and a search tree [16]. The query is routed to the peers that have a KAD ID close to the hash value. The matching results are responded from that peers and carry the information of source keys. For a source search, a user chooses a desired object from returned results. Then the source key of the object is used for searching the peers who have the object.

The returned results would be added to the download queue of the object.

2.3 Original load balancing scheme in KAD

KAD limits the number of indexes in each peer to avoid overloading. A peer can handle a maximum of 60,000 indexes and can hold a maximum of 50,000 indexes of an individual keyword. Therefore, when a peer reaching the limit of maximum indexes number receives a publishing request, it will reply a successful message, even if the publishing request is rejected.

2.4 Other existing load balancing schemes

KAD-7 [23] hashes the keyword of an object r times to produce a key for publishing objects, where r is a random number and 1 ≤ r ≤ 7. Different peers may hash different times to produce different indexes, which all represent the same keyword. These keys will be published to different peers, not just to one peer. Because indexes are spread, KAD-7 will increase the number of search messages.

In [18], the authors found that the peaks of load are due to very popular keywords that are most often meaningless stopwords. They proposed to add a stopword exclusion step into all KAD based P2P systems. They use stopword exclusion to reduce the number of indexes,

so the total indexes in the KAD will decrease.

In [24], the authors describe a novel approach with multiple hash functions (MHF) to replicate the hotspots in a series of different nodes to distribute the high load evenly, and it can increase or decrease the replicas dynamically. MHF provides a load balancing scheme for a high request rate. If the request rate is not over a threshold, the zone with popular keywords will still have a large number of requests. MHF will result in a lot of additional network traffic because a large number of indexes are replicated.

2.5 Qualitative comparison of representative load balancing schemes

Table II shows the comparison of four representative load balancing schemes. In the proposed KAD-mod, the publish load and request load are more balanced and thus the search hit rate will increase. If a popular keyword references 10 indexes, KAD-mod can 6 distributes indexes more even to 160 zones. In contrast, KAD-7 can only spread indexes to seven zones and KAD only to just one zone. Since KAD-mod and KAD-7 will spread the indexes, there will be more peers that have the same indexes. As a result, the search hit rates of both schemes will increase in case that some peers failed. However, their network traffic will increase slightly because of increased search messages. MHF uses a scheme that replicates the hotspot load to other peers, so it will generate a lot of indexes which increase the network traffic. Peers to know where the key is located must ask the original responsible peer. Once the original responsible peer becomes offline, the search hit rate will decrease.

Table II. Qualitative comparison of four load balance schemes.

Chapter 3

Proposed KAD-mod Load Balancing Scheme

KAD has a 128-bit ID space and there are 256 zones in a KAD P2P network. Divide 2128 by 256 to get a quotient of 2120 so that each zone has at most

2

120 peers. We will

define a new ID type: mod ID. A mod ID is computed as KAD ID mod

2

120. By deriving a new mod ID (as follows), for each peer, the peers with the same mod ID will be located in different zones. In Figure 6, assume that there are 15 peers and 5 zones so that each zone has 3 peers. Each peer’s KAD ID mod 3 will generate its mod ID. For example, 4

1 mod 3

and “1” is the mod ID. For example, a peer’s KAD ID is N, then peers with KAD IDN+2120,N+2×2120, …and N+255×2120 will all have the same mod ID.

Deriving a new mod ID:

Let n be a nonzero integer. We say that two integers a and b are congruent modulo n if there is an integer k such that a – b = kn. In the KAD case, we have abmodn, where a = KAD ID, n=2120, b = mod ID.

Figure 6. A mapping between KAD IDs and mod IDs in the KAD P2P network.

3.1 Concept of KAD-mod

We propose a modulo based load balancing method to let peers with the same mod ID share loads. Using mod ID, we can easily find where loads are reassigned and where to find the reassigned loads. We use a request forwarding threshold (RFT) to help decide if an index should be redirected to the same mod ID of a peer at another zone. We limit the number of indexes stored in each peer to avoid overloading. A peer can handle at most RFT indexes of an individual keyword. For example, when a peer reaches the limit of RFT indexes, it will redirect the remaining requests to the same mod ID of peers at another zones. Figure 7 shows the concept of KAD-mod for publishing. In this example, there are 180 keys K published from many peers to peer N and RFT = 60. Peer N will handle the first 60 keys. For the remaining keys, peer N redirects 61th to 120th keys to peer N + 2120 and redirects 121th to 180th keys to peer N +2 * 2120.

3.2 Publish procedure

Figure 7. The concept of KAD-mod for publishing.

receiving peer will become a redirection peer when the number of KAD_REQ’s received for storing the same key exceeds RFT. Figure 8(a) shows a scenario that the number of KAD_REQ’s peer N received for storing the same key does not exceed RFT. Figure 8(b) shows an alternative publishing procedure in five steps when the number of KAD_REQ’s peer N received for storing the same key K is over RFT. The five steps are:

Step 1: When a sending peer wants to publish key K to receiving peer N, it sends a KAD_REQ to receiving peer N.

Step 2: Divide REQ_counter by RFT to get a quotient i. In this case, i ≥ 1. Receving peer N will become a redirection peer and redirect KAD_REQ to peer N+i×2120.

Step 3: When receiving peer N+i×2120 receives KAD_REQ, it will sends a KAD_RES to the sending peer.

Step 4: Then, the sending peer starts to send KAD_PUBLISH_REQ to new receiving peer 2120

× +i

N .

Step 5: When receiving peer N+i×2120 receives KAD_PUBLISH_REQ, it sends KAD_

PUBLISH _RES to sending peer N . Then, keywork K is published successfully.

Figure 8. The KAD-mod publish procedure for a key.

The detail of the publishing procedure in KAD-mod is shown in Figure 9. Figure 9(a) shows the procedure that a peer publishes a key. We hash keyword A to generate a key K.

Peer K will be the target peer and then the peer uses a lookup procedure to send KAD_REQ to the target peer. The details of the lookup procedure has been presented in Chapter 2. Then we will receive several responses which contain some possible peers who are closer to the target peer. We use these peers to update the candidate list. If the candidate list becomes stable, then go to the the next step. Top 11 peers will be selected from the candidate list for sending KAD_REQ to ask for storing the key. After receiving KAD_RES, the peer sends publishing messages to the 11 peers to complete the procedure of publishing a keyword to the KAD P2P network.

Figure 9 (b) shows the condition that a peer receives KAD_REQ for storing a key K.

First, the peer will check if it has ever received the same KAD_REQ. If the peer has not received the request before, it initializes a new counter, REQ_counter, to 1. Otherwise, it adds one to REQ_counter and the peer checks whether REQ_counter > RFT. If yes, the peer will calculate a peer number NEXT and redirect KAD_REQ to peer NEXT with the same mod ID at other zones. If no, the peer will send KAD_RES to the sending peer. In order to avoid an infinite loop, the peer will redirect to at most 255 different zones except its own zone.

(a) The procedure of a peer publishing a key.

2 2

120)mod 128

_ *

( 



+ RFT

Counter N REQ



 RFT

Counter REQ_

3.3 Search procedure

Figure 10 describes the search procedure of KAD-mod. The searching peer obtains a keyword from a keyword query (for example, keyword A). Then, this keyword will be hashed to produce a key K. The searching peer uses key K as target peer ID to send search messages.

The searching peer can know REQ_counter by the received response and the searching peer can know where the indexes corresponding to key K are stored by REQ_counter. After that, we send search messages to these peers from the nearest peer to the farthest peer in our routing table. Then, we will receive several search responses which may contain search answers. The search will stop when a maximum numbers answers, TOTAL, has been received or a timeout is triggered. The default TOTAL value is 300 and the default timeout is set to 20 seconds. In other words, we will stop the search process after 20 seconds or if the searching peer receives more than 300 answers [20].





RFT Counter REQ _

Chapter 4

Simulation Results

4.1 Simulation setup

First, we analyze the KAD P2P network environment. In [21], the authors crawled a representative subnet of KAD every five minutes for six months. They found that in average, there are 8000 peers in a zone. In [19], the authors spied on one zone in the KAD P2P network for 12 hours. They observed that the number of search messages is 561,542 and the size of search messages is 10.8 MB, while the number of publishing messages is 5,549,183 and the size of publishing messages is 996MB. According to the observation in [19], the average size is 0.019 KB for a search message and 0.18 KB for a publishing message. We classify keywords into ranks according to the number of times a keyword appeared. The nth popular keyword is classified as rank n. The number of indexes for the nth popular keyword is proportional to k×1 n1.63where k is the number of indexes for the most popular keyword [18]. According to [18], k is about 10 . The parameter settings of our simulation environment 7 are shown in Table III. We used JAVA to construct our simulation environment. In the simulation, the number of indexes handled by each zone and the number of times each zone being requested were collected for comparison and evaluation.

4.2 Simulation results

We used Gini coefficient (G) as a load balancing index for evaluation of load balancing regarding the number of indexes handled by each zone. The range of G is between 0 and 1.

The closer the G approach to 0, the more load balancing it is. G is computed as follows [22]:

∑∑

For calculating G regarding the number of published indexes in each zone. N is the number of zones (N = 256), li and lj are the numbers of indexes handled by the ith and jth zones, respectively, and µ is the average number of indexes handled by each zone. For calculating G

Table III. Simulation parameter settings.

Number of KAD peers 256 × 8,000

Number of KAD zones 256

Peers per zone 8,000

Number of different keywords 1,000,000

Keywords popularity distribution Zipf’s law [18]

Search distribution Zipf’s law [18]

Raito of publish messages to search messages 10 : 1

Because RFT would affect the performance of KAD, KAD-7, and the proposed KAD-mod, we conducted experiments to decide the best RFT. Figure 11 shows the G regarding the number of indexes published in each zone under a different RFT. We found that the lowest value of G occurs when the values of RFT are between 5000 and 6000.

There are two issues in the proposed KAD-mod. First, the average hop count of finding a target to publish an index will increase after applying the KAD-mod method. We used the results of [17] to evaluate the average hop count of finding a target to publish an index. Figure 12 shows the average hop count of finding a target to publish an index under a different RFT.

In our method, for some popular keywords receiving peers may need to redirect KAD_REQs to other peers because the total number of indexes of a popular keyword in the receiving peers exceeds RFT. The redirection of KAD_REQs needs an additional hop to find the next target.

Figure 11. The Gini coefficient regarding the number of indexes published in each zone under a different RFT.

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000 15000 16000 17000 18000 19000 20000

Gini coefficient (G)

RFT

Second, the number of search messages will also increase after applying the proposed

Second, the number of search messages will also increase after applying the proposed

相關文件