分散式建構具有繞路能力的小世界同儕網路

(1)

行政院國家科學委員會補助專題研究計畫

□期中進度報告分散式建構具有繞路能力的小世界同儕網路

計畫類別：■ 個別型計畫 □ 整合型計畫計畫編號：NSC 96－2221－E－006－062－

執行期間： 2007 年 8 月 1 日至 2008 年 7 月 31 日

計畫主持人：蕭宏章共同主持人：

計畫參與人員：廖豪（博士生）

成果報告類型(依經費核定清單規定繳交)：□精簡報告 ■完整報告

本成果報告包括以下應繳交之附件：

□赴國外出差或研習心得報告一份

□赴大陸地區出差或研習心得報告一份

□出席國際學術會議心得報告及發表之論文各一份

□國際合作研究計畫國外研究報告書一份

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計畫、

列管計畫及下列情形者外，得立即公開查詢

□涉及專利或其他智慧財產權，□一年■二年後可公開查詢

執行單位：成功大學資訊工程學系

中華民國 2008 年 8 月 1 日

(2)

Building Small-World Peer-to-Peer Networks Based on Hierarchical Structures

Hung-Chang Hsiao, Yung-Chih Lin

Abstract— Small-world (SW) networks possess two properties, namely low diameter and high clustering coefficient, that are often desired by large-scale peer-to-peer networks. Prior studies have shown that the construction of an SW network can be based on ad-regular graph, and each node in the graph maintains d local neighbors and a small constant number of long-distance contacts. However, it is commonly understood that it is difficult to construct a short route in an SW network, given source (s) and target (t) nodes, though an SW network guarantees that a short route from s to t exists. Prior work in [1] proposed a

“navigable” SW network for a d-dimensional lattice such that a simple localized routing algorithm can be devised to route a message froms to t using O(log²X ) hops, where X is the number of nodes in the network.

In this paper, we present a novel navigable SW network based on a hierarchical model. Compared to previous efforts, the novelty of our study presents (i) that our network construction based on a hierarchical model is decentralized, (ii) that routing a message between any two nodes in our SW network takes logarithmic hopcount in expectation, (iii) that our SW network has high cluster coefficient, and (iv) that the performance of our proposal is mathematically provable. We support the performance of our proposal in this study through rigorous, thorough performance analysis and extensive simulations.

Index Terms— Peer-to-peer systems, small world, overlay net- works, tree hierarchy, performance analysis

I. INTRODUCTION

Peer-to-peer (P2P) networks (or overlays) have recently become an active area of research. Applications over P2P networks include information retrieval, content distribution, processor cycle sharing, etc. These applications often demand that their underlying P2P network infrastructures be scalable, robust, and have low diameter. For example, an Internet-scale file sharing system, namely Oceanstore [2], is designed and de- ployed on top of the P2P network infrastructure, Tapestry [ 3], which guarantees that each node participates in the network using O(logX ) connections, and routing a message between any two nodes takes O(logX ), where X is the total number of nodes in the system.

One elegant construction for P2P network infrastructures is the implementation of a distributed hash table (DHT). Exam- ples of DHT networks include CAN [4], Chord [5], Pastry [6], Tapestry, etc. These networks are well-structured such that each node picks its neighboring nodes deterministically. In a dynamic environment where nodes come and go frequently, a DHT network “may”, however, have some difficulty in maintaining its network topological structure, compared with

Corresponding author. Department of Computer Science and Information Engineering, National Cheng-Kung University, Tainan 701, Taiwan, E-mail:

[email protected].

unstructured proposals. In unstructured P2P networks (e.g., Gnutella [7]), nodes are allowed to interconnect with each other, randomly.

Another thread for constructing a P2P network is to follow the small world (SW) principle [8]. For building an SW network, prior studies by Bollob´as et al. [9] and Watts et al. [10] suggested the introduction of only a few “random”

edges into a primitive, regular ring network. One of the major conclusions drawn by the previous study [ 10] is that such a randomized SW network structure exhibits (1) low diameter, and (2) high clustering coefficient. By diameter, we mean the maximum path length of the shortest routes between any two nodes in the network. Given a small k, the cluster coefficient for a node v is ^|B^k^v^|

(^|Bk₂^{v |}), where B^k_v and B^k_v represent the set of v’s neighbors within v’s k-hop scope and the set of edges appearing inB^k_v, respectively. If the cluster coefficient is high, then neighboring nodes of v are likely to interconnect with one another. Since an SW network has a low diameter, the path length of routing a message between any two nodes in an SW network is small. Additionally, nodes with high clustering coefficient form strongly connected clusters such that nodes in a cluster cannot be simply disconnected. Particularly, since nodes in a P2P network are heterogeneous [ 11], the strongly connected clusters formed by capable and durable peers may further enhance the system performance and reliability. More- over, in contrast to a deterministic DHT network, a randomized SW network enumerates a family of network topologies. This results in better flexibility for connecting nodes in an SW network, as compared to a deterministic DHT structure.

SW overlay networks with low diameter and high cluster coefficient offer several applications. For example, Zhang et al. [12] improved the search performance for an overlay net- work, namely Freenet [13], by relying on clustering locations of data objects having close key values. Hui et al. suggested replicating popular objects in a cluster of nodes to handle flash crowds [14]. Li et al. implemented semantic content space over an SW overlay for providing attribute-based search [ 15].

Iamnitchi et al. proposed taking advantage of high clustering coefficient exhibited by SW networks for rapid information dissemination [16].

Given an SW network, few works discussed the develop- ment of a localized routing algorithm ¹ for discovering a short route between any two nodes in the network [ 1], [17], [18]. This is because an SW network is a randomized network where no deterministic structure to organize the participating

1In a localized routing algorithm, nodes only depend on their local knowledge to perform efficient message routing.

(3)

peers exists ². Perhaps, Kleinberg et al. were the first to present a “navigable” SW network based on a d-dimensional lattice substrate [1]. Consider a 2-dimensional lattice-based SW network (i.e., d = 2) presented by Kleinberg et al. A node v connects to its four neighboring nodes in the lattice. v also connects to a long-distance contact node u with the probability of ^l(v,u)⁻²

∀ul(v,u)⁻², where l(v, u) =|v.x− u.x| + |v.y − u.y| (any node v in the 2-dimmeionsal lattice has a lattice coordinate of v.x and v.y). In the SW network proposed by Kleinberg et al., routing a message is simply done by having a node x that receives the message greedily forward such a message to a node y that is picked among x’s neighbors and x’s long- distance contacts such that y is closest to a target node. Such a routing algorithm takes O(log²X ) hops in sending a message from a source to a target.

Studies in [1], [19], [20] presented navigable SW networks, where either participating nodes maintained global knowledge regarding the entire network topology for constructing their connections to the network, or the nodes cannot join and leave the network freely. In our work, we are, however, concerned with the construction of a navigable SW P2P network that can operate in a large-scale, dynamic environment. We aim at providing a decentralized protocol for constructing an SW P2P overlay such that nodes only depend on their local knowledge in joining and maintaining the network. A node also only depends on its local knowledge to help relay a message towards a destination. Particularly, we intend to present an SW network based on a tree structure. This allows our SW network to include previously proposed optimization techniques for performance enhancement, and thus leverages prior research efforts. For example, depending on the tree hierarchy embedded in our SW network, it is possible to incorporate the algorithm [21] presented by Xu et al. into our SW network in order to exploit physical network locality. It is also possible to take advantage of our tree substrate, and to implement the proposals by Zhu et al. [ 22] and Shen et al. [23] for balancing loads among participating peers.

A. Our Contributions

We present a novel navigable SW network embedded with a tree substrate. Since an SW network is a randomized network, in our proposal we first present a decentralized, randomized tree formation protocol for constructing our base substrate T. Basically, our tree protocol constructs T in a distributed, bottom-up fashion such thatT consists of d T ^k−1trees. Each T^k−1 also comprises dT^k−2 trees and so on. Long-distance contacts are then created and maintained by each node v inT for supporting navigation.

Our contributions are threefold.

• We present a navigable SW network where participating nodes operate in a localized, decentralized manner in a dynamic environment. In contrast to the works in [ 1], [19], [20] ³, our proposed SW network can operate in

2Unlike SW networks, most DHT networks maintain deterministic structures.

3Note that although the SW network presented by [19] is conceptually based on a tree, the resultant SW network does not incorporate with a physical tree substrate.

a dynamic environment, where participating entities may come and go, freely. In addition, unlike the studies in [ 1], [19], we assume no global knowledge available to any participating nodes. Moreover, in contrast with [ 20], we do not impose any distribution of communication delays among nodes.

• Even in a dynamic environment where participating nodes only have local knowledge, our SW network performs well, and has the statistical performance guarantees and provides the performance qualities approximating to those concluded by Kleinberg et al. in [ 19]. In [19], the authors presented a static SW network in which each node contributes O(log²X ) connections to the network, and routing a message in the network takes O(logX ) hops. By contrast, in our design the expected number of connections that any node v in the system maintains is no more than d + ln²X , and routing a message takes O(log_dX ) hops in expectation, where d is a constant.

• We evaluate our design through rigorous, thorough performance analysis. The performance of our proposal is mathematically provable. We also validate our analytical results in extensive simulations. Compared with prior proposals [5], [24], the simulation results show that our SW network is a tree-based network having low diameter and high cluster coefficient.

B. Roadmap

The remainder of the paper is organized as follows. Sec- tion II discusses related work. Section III gives an overview of our SW network. The design of our tree formation protocol is detailed in Section IV. We describe the augmentation process and the navigation protocol in Section V. Section VI presents the theoretical analysis for our SW overlay protocol. We also perform the simulation study, and the simulation results are given in Section VII. We conclude our study in Section VIII with possible future research directions.

II. RELATEDWORK

Earlier DHT networks (e.g., Tapestry [3], CAN [4], Chord [5], and Pastry [6]) and recent sophisticated constant- degree DHTs (e.g., Viceroy [25], Koorde [26] and Cy- cloid [27]) are not designed based on the SW concept. In contrast, Symphony [24], which is a randomized DHT network ⁴, realizes the SW concept. Symphony can operate in a dynamic environment. The structured substrate in Symphony is a ring network. Similar to most DHTs, each node in Symphony has a unique ID. Any node v over the Symphony ring links to a long-distance contact u with the probability of l(u,v) log X¹ , where l(u, v) is the numerical ID difference in the clockwise direction of the ring. In Symphony, each node maintains a constant number of links, k, in the system, and a message takes O(_k¹log²₂X ) hops in expectation to reach its destination.

Unlike Symphony, (1) our SW network is based on a randomized tree substrate, (2) routing a message in our network

4In a randomized network (e.g., [28], [29]), some (or all) of connections among nodes are created randomly. Unlike SW networks, randomized networks may not necessarily have high clustering coefficient [30].

(4)

takes O(log_dX ) hops in expectation, and (3) nodes in our network form strongly connected components, and this allows including capable peers into the clusters for further optimizing the system performance and reliability. The detailed discussion for the performance comparison between Symphony and our proposal are given in Section VII-D and VII-E.

Duchon et al. also offered the design of an SW P2P network that can operate in a distributed manner [ 20]. Duchon et al.

assumed that the node distribution in the network follows the α-power-law latency expansion [31], one of the realistic network models. That is, for each node v in the network, the number of nodes having the latency x to v is no more than βx^α, where α and β are two given positive constants. Duchon et al. targeted at a static environment where nodes do not come and go. In contrast to [20], we impose no assumption on the distribution of nodes in the network in this paper. In addition, our study allows nodes to join and leave the system anytime.

Wang et al. [32] analyzed the resilience of structured P2P networks, including Chord [5], CAN [4], and Pastry [6], in terms of the percentage of messages that can be routed to their destinations (namely, the average hit ratio). They concluded that in a dynamic environment, CAN—a d-dimensional lattice substrate—is not robust in terms of the average hit ratio, compared to Chord and Pastry. This may be due to the deterministic structure of a CAN network. Wang et al. sug- gested introducing random long-distance contacts to each peer participating in a CAN network. With the help of randomly picked long-distance contacts, not only the number of hops for routing a message in CAN is reduced (i.e., the average routing length is O(log²X )), but the resilience of CAN is improved.

By contrast, our proposal in this paper builds an SW network based on the hierarchical model. As we will discuss later in this section, our SW network can leverage prior efforts (e.g., those presented in [21], [22], [23]) by taking advantage of the embedded tree structure for further optimization.

Merugu et al. [33] provided an extensive simulation study for P2P networks constructed based on the SW principle. The study by Merugu et al. validated that message routing in an SW network is efficient. However, the study by Merugu et al.

provided no rigorous analytical results.

Zhu et al. [22] proposed a proximity-aware, load-balancing mechanism to scale the loads of peers participating in a DHT network. In the proposal by Zhu et al., a tree structure is constructed and maintained over a DHT network, which col- lects load information from participating nodes and performs matching for migrating loads from heavy nodes to light ones.

Shen et al. [23] also provided a design aiming to balance loads of peers and to minimize the communication cost incurred by movements of loads. In this study, we construct an SW network based on a tree substrate. Since studies like [22]

and [23] rely on a tree structure to collect load information and to perform load balancing, our SW overlay can be naturally added to the load balancing feature by further including the mechanisms presented by Zhu et al. and Shen et al. such that peers in the network have balanced loads.

Node clustering algorithms (e.g., the proposals in [ 34], [35], [36]) have been extensively studied. Given a network G= (V, E), a node cluster algorithm partitions the node set

V into clusters C1, C2, C3,· · · , Cl such that C_i ∩ Cj = ∅ for any i = j, and _l

i=1C_i = V . In addition, nodes assigned to the same cluster are “similar” in accordance with a predefined similarity function, and nodes in different clusters are dissimilar [35]. If a similarity function (e.g., that in [35]) is defined appropriately, node clustering algorithms may generate a set of disjoint clusters such that nodes in the same cluster are highly connected (i.e., intermediate nodes on a routing path in the cluster C_i are also the members of C_i). However, to our best knowledge, few studies present the construction of SW networks based on node clustering algorithms.

Another important group of studies (e.g., [ 37], [38], [39], [40]) related to our work finds a dominating set from a given network G = (V, E). A dominating set is a subset of nodes S⊂ V , where any node v ∈ V − S connects to a node in S.

In general, an extended dominating set is a subset, S, of nodes such that any node v∈ V − S is k-hop researchable from any node u∈ S. A connected dominating set, S, is that all nodes in S are connected. Clearly, a connected, extended dominating set is a candidate substrate for building an SW overlay network.

This is because a node u ∈ S may serve as a head of a

“cluster” in which non-head members are k-hop researchable from the head node u. Possibly, nodes interconnect to form short-distance links in a cluster, while long-distance links are constructed among heads of clusters. However, it is unclear to us how an SW network is built with connected dominating sets.

Techniques (e.g., the proposals in [21], [41], [42]) that are orthogonal to the SW principle offer efficient message routing in P2P networks. For example, for exploiting the physical network locality, Xu et al. [21] proposed to overlay an extra hierarchical storage structure over CAN [4] such that nodes participating in CAN not only register their network locations with the distributed storage, but also forward their routing messages to geographically nearby nodes discovered from storage. Studies such as [42] by Ratnasamy et al. rely on landmark nodes to exploit the physical network locality. Since overlay links among nodes in an SW network are constructed in probability, the SW network may exploit physical network locality by incorporating with the techniques presented in [ 21], [41], [42].

III. PROPOSALOVERVIEW

Basically, an SW network is a randomized network where overlay links among nodes are determined in probability. Since the substrate of our SW network is a tree network, the tree network shall also be a randomized network. Figure 1 shows our SW network based on a tree substrate T. In our design, with regard to the tree substrate, a node participates in a “sub- tree” having the lowest level. Such a sub-tree is denoted byT¹. DisjointT¹trees construct aT²tree by forming representative nodes (i.e., the root nodes) of theseT¹’s. For example, a and g are the representative nodes of the two distinctT¹’s shown in Figure 1. a and g form aT²tree, and the tree root, g, ofT²is the representative node forming another treeT³. We note that such aT²tree is only with the node set{a, g}, i.e., the set of root nodes of respectiveT¹’s that assembleT². Similarly, the

(5)

1¹ 1²

1²

1¹ 1³

Fig. 1

ASMALL-WORLD NETWORK BASED ON A TREE SUBSTRATET (^{THE SHAPE} WITH THICK LINES),WHERE THE LONG-DISTANCE CONTACTS(THE DASH

POINTERS)OF THE NODEaARE SHOWN

node set ofT³ in Figure 1 is{i, g}, the set of respective root nodes of assemblingT² trees. This process proceeds until a T^k is constructed, where k = log_dT−1X , d^T is the maximal number of nodes participating in aT ⁱ (1 ≤ i ≤ k) tree, and X is the maximally total number of nodes in the system. We denote the resultant treeT^k as T.

Conceptually, having aT network, each node a in T creates its long-distance contacts. a picks its long-distance contacts in probability. In general, the node a prefers to select its long- distance contacts from the nodes inT^j, compared with those in T^l, where j < l. For example, in Figure 1, instead of picking i as a long-distance contact, the node a picks b, d, and j as its long-distance contacts. Each node participating in T performs similar operations for selecting its long-distance contacts, resulting in an SW network.

Routing in our SW network is that each node a greedily forwards a message towards its destination. More precisely, in our SW network, each node a has a unique hierarchical label representing its location in the tree. The naming of a label is similar to an IP address used in the Internet. a picks a long-distance contact, say x, whose label is closest to the label of the message destination. a then relays the message to x. If a cannot find any long-distance contact to help relay the message, a forwards the message to one of its neighboring nodes in the tree such that the neighboring node is closest to the destination.

In the following sections, we first present the construction and maintenance of our tree substrate in Section IV. Our SW overlay based on the tree substrate and its navigation are then given in Section V.

IV. TREECONSTRUCTION ANDMAINTENANCE

Our SW overlay network relies on a base substrate—the tree hierarchy. Basically, our tree is recursively formed in a hierarchical fashion. The basic element of our tree is aT ⁱtree

(a single peer is represented as aT⁰ tree). ATⁱ tree is built by at most d^T Tⁱ⁻¹ trees, where 1 ≤ i ≤ k. The resultant tree that our tree protocol constructs is T = T^k. We note that X = (d^T)^k is the maximum number of nodes in T.

Specially, when forming a Tⁱ tree, nodes self-organize. In each Tⁱ tree, the root node, denoted by r, only maintains an only child node, r.chd. This allows to minimize the degree of the root since the root node will then take some connections to participate aTⁱ⁺¹tree. In contrast, non-root nodes can use up- to the degree of d^T−1. Once a Tⁱtree is constructed, its root node proceeds to join a Tⁱ⁺¹ tree. Possibly, the root remains a root node of aTⁱ⁺¹ tree. Otherwise, it can connect no more than d^T nodes inTⁱ⁺¹. It will be clear in Section VI-B that d^T is a soft constraint.

In our tree construction algorithm, eachT ⁱ, where 1≤ i ≤ k, depends on a unified network construction protocol. We call such a tree construction protocol as the T protocol in the following discussion.

A. T Protocol

We consider to format and maintain a tree network Tⁱ, where 1≤ i ≤ k. Notably, each Tⁱ is created and maintained by a constant number (i.e., d^T) of peers. Section IV-A.1 and IV-A.2 provide the details for the T protocol and its maintenance algorithm, respectively.

1) Tree Construction: We first define the following nota- tion.

Definition 1. The numerical difference, dif f (v), of a node v with respect to the only child node r.chd of the root r is defined as

dif f(v)^def=

F(r.chd) ≤ F(v)

R + otherwise , (1)

where F can be an arbitrary uniform randomness function (e.g., SHA-1 [43] and MD5 [44]) that can provide an unique ID (≥ 1) to a node, R is the maximum value that F can return, and = F(v) − F(r.chd).

Consider that a node A newly joinsTⁱ. A first connects to a bootstrap node ⁵ that provides an entry point for the joining of A. In our design the entry point is the root node, r, ofT ⁱ. The root node r then helps A join by uniformly picking a node in the tree at random. When a random node, say B, is determined, the process as follows is immediately performed.

1) If dif f (A) < dif f (B), B reports its parent node, B.prt, to A. Upon receiving the network address of B.prt, A then iteratively performs the joining by send- ing the joining request to B.prt. The joining process proceeds until the joining request is forwarded to an ancestor node, Q, of B.prt and dif f (Q) < dif f (A).

A then connects to Q.

2) Otherwise, if dif f (A)≥ diff(B), A simply connects B as B’s child node.

If B is unavailable (i.e.,Tⁱ contains an only root node r), then A becomes the only child node of r of theT ⁱ tree.

5We adopt the mechanism similar to Gnutella [7] that provides a bootstrap node for a node joining. Possibly, there are several bootstrap nodes to help nodes join the overlay.

(6)

Fig. 2

(A) AN EXAMPLE OF ATⁱTREE CONSTRUCTED USING THE PROPOSED ALGORITHM, (B)THE NODEp₁₅JOINS THE OVERLAY,AND(C)THE

OVERLAY THEN INCLUDES ONE EXTRA NODEp15

Figure 2 illustrates an example for our tree-shaped overlay construction. In Figure 2(a), the overlay first consists of six peers, i.e.,{r, p10, p30, p50, p70, p100}, where the non-root nodes are p10, p30, p50, p70and p100, and p10is the only child node, r.chd, of r. In this example,F(pi) = i. When p15joins the overlay (Figure 2(b)), it first sends its joining request to r via the help of a bootstrap node (not shown in Figure 2).

r randomly picks a node (i.e., p50) to help p₁₅ join. Since p₁₅ has the numerical difference dif f (p₁₅) less than that of p₅₀ (i.e., dif f (p₁₅) = F(p15) − F(p10) = 5 < diff(p₅₀) = F(p50) − F(p10) = 40), p15 then sends its joining request to the parent node of p50, which is p10. However, because p10 has dif f (p10) less than diff(p15), p15 then links to p10

(Figure 2(c)).

We note the following in our tree formation protocol. First, in our design the root node r of a T ⁱ tree needs to pick a node in Tⁱ uniformly at random. It is possible to rely on a Markov Chain Monte Carlo method (see Chapter 10 in [45]) to randomly sample a node in Tⁱ. However, since aTⁱ tree consists of a constant number, d^T, of nodes, in our implementation we instead maintain the node set ofTⁱ in the root node r such that r can simply pick a random node from the node set. To maintain the node set of Tⁱ, each node v in Tⁱ requires to send a live message to r. This is because nodes may dynamically leave the system without informing any node in theTⁱ. Consequently, r can then have the set of nodes participating inTⁱ.

Second, the root node r of a Tⁱ tree registers with the bootstrap node if Tⁱ contains less than d^T participants.

However, ifTⁱ contains exactly d^T nodes, then Tⁱ will not include any newly coming node and r will then deregister from the bootstrap node. r may re-register with the bootstrap if r maintains less than d^T nodes due to nodes leaving.

Finally, the root node r ofTⁱ will join a Tⁱ⁺¹ tree using theT protocol as we discussed in this section. r has a unique ID to joinTⁱ⁺¹, if available. That is, r joinsTⁱ⁺¹ via the root node ˆr ofTⁱ⁺¹ by querying the bootstrap node. If there does not exist any root node ofTⁱ⁺¹registering with the bootstrap node, r becomes the root of a newTⁱ⁺¹ tree.

2) Network Maintenance: A Tⁱ tree may be fragmented due to node failure or departure. To handle the dynamics of the tree overlay, each node v inTⁱperiodically pings its parent node v.prt. If v.prt fails to respond to v, v assumes the failure of v.prt, and then rejoins the tree via the help of the root node r of Tⁱ. v rejoinsTⁱ using theT protocol described in Section IV-A.1.

Notably, it is possible that the root node r of aTⁱtree fails such that non-root participants inTⁱcannot rejoin via the help of r. If so, these non-root participants rejoin the system via the bootstrap node. They rejoin the system as newly coming peers.

B. Recursive Construction forT

We have presented in Section IV-A the basic algorithm for forming aTⁱ tree that can consist of up-to d^T nodes.

Assume that we have constructed a number of T¹ trees.

For constructing a level-2 tree T², root nodes r in distinct T¹ trees query the bootstrap node for their entry points. This process is identical to that of the joining of a node into a T¹ tree except that the candidate entry points that can help these root nodes form their T²tree are the root nodes ofT¹ trees. Therefore, in our tree formation protocol we require the bootstrap node to additionally label each registry node with its level-ID. The bootstrap node depends on the level-ID to identify the “root level” of a registry node. That is, the root node of a Tⁱ tree will be labeled with the level-ID i in the bootstrap. For example, if a node is a root node of a level-3 tree T³, then it will have the level-ID 3 in the bootstrap.

If the entire tree network, T, is a level-k tree T^k, then the above-mentioned process proceeds until multiple T ^k−1 trees self organize into a T^k tree. Similarly, the root nodes of these T^k−1 trees form T^k by consulting the bootstrap for the locations of roots with level-ID k− 1 as the entry points.

V. AUGMENTATION ANDNAVIGATION

Fig. 3

LABELING NODES INT,WHERE CIRCLE SHAPES REPRESENTS NODES,THE NUMBER INSIDE A NODE REPRESENTS THE COORDINATE-LIKE LABEL, AND THE NUMBER ASIDE A NODE INDICATES THE HIERARCHICAL LABEL

In our design, each node v in T has two labels, namely the coordinate-like label and the hierarchical label. While

(7)

coordinate-like labels are used for creating long-distance contacts, routing a message depends on hierarchical labels. In our implementation, the parent node, v.prt, of a node v periodically updates the coordinate-like label for each of its children nodes. If v.prt has the coordinate label (x, y) and v is the i-th child node of v.prt, then v has the coordinate-like label (x + 1, i). v.prt also assigns .i as v’s hierarchical label if the hierarchical label v.prt is . Figure 3 illustrates the idea.

A. Augmentation

We augment our tree networkT with long-distance contacts.

In addition to tree links, each node v in T especially creates and maintains an extra set, Tv, that contains at most ln²X long-distance contacts.Tv is precisely defined, as follows:

Tv=

u: P (u_x= x) = b^−(k−x)

ln X , P(u_y= y) = 1 2b

, (2) where (u_x, u_y) represents the coordinate-like label of a node u and b = d^T − 1. We note that the root node of T = T^k has the coordinate-like label of (0, 0) in our implementation.

The idea is that v is more likely to connect a node u having a larger value of u_x. That is, u in the lower level ofT is likely to be picked.

Notably, the probability distribution for picking long- distance contacts as shown in Eq. (2) allows the expected path length of O(log_dT−1X ) for routing a message. Theorem 6 in Section VI-C presents the details.

Algorithm 1: AUGMENTATION: A node v creates its ln²X long-distance contacts periodically

for c← 1 to ln²X do

1

u_x← i with the probability P (X = i) = ^b^−(k−i)_ln_X ;

2

u_y ← j picked uniformly at random in [1, 2b];

3

w← R^OUTE(u);

4

v links to w;

5

Algorithm 1 presents the details for constructing and main- taining long-distance contacts for any node v. Note that in Al- gorithm 1, ROUTE(u) performs as follows: given a coordinate- like label (u_x, u_y), upon receiving a message, a node a in our augmented tree-shaped network greedily forwards the message to f , among nodes inTa, towards the label (u_x, u_y) such that

|fx− ux| is minimal. Otherwise, a relays the message to f having f_ythat minimizes|fy−uy|. A route will stop at a node w if w cannot route the message further, that is, w cannot find any node inTw that can further minimize the numerical distance to (u_x, u_y). Possibly, a may have multiple choices, say f and f in forwarding the message. If so, a randomly picks one of them to relay the message.

B. Navigation

Given a destination node t with a hierarchical label l1.l2.· · · .li, upon receiving a message, a node v in our

augmented tree-shaped network greedily forwards the message to u, among nodes in Tv, towards t such that

u= arg max

w

w∈ Tv:pref ix(t, w) ≥

, (3)

where pref ix(t, w) returns a maximal common prefix string of t and w, and is initialized to zero in the beginning and varies every time by letting =|u|. The forwarding process stops if the hierarchical label of a forwarding node matches the target’s. Notably, if v cannot find a qualified node u, v relays the message along the links of the tree network towards the destination. The details are given in Algorithm 2.

Algorithm 2: NAVIGATION(t): A node v relays a route message towards a node t

if u = arg max_wprefix(t, w) ≥: w ∈ Tv

1 then

v relays the message to u;

2

= |u|;

3

else

4

v forwards the message along the tree network

5

towards the message’s destination;

VI. THEORETICALANALYSIS

We provide the theoretical performance analysis for our T protocol in Section VI-A. Section VI-B discusses the performance forT. We present the analysis for routing delay in Section VI-C. Our protocol overheads are given in Sec- tion VI-D. We also perform the simulation study, and the simulation results are discussed in Section VII.

We briefly summarize the major results, provided by this section, as follows. The height of our Tⁱ tree is O(lnN ) w.h.p. ⁶ (see Theorem 2 in Section VI-A), where N + 1 is the number of nodes participating in aT ⁱtree. SinceT = T^k is recursively constructed with T¹,T²,· · · , T^k−1, we show that the diameter of the resultant tree network T is 2 ln X in expectation (Corollary 3 in Section VI-B), where X is the maximally number of peers in T. In addition, the maximum degree of a peer joining T is d^T in expectation (Theorem 4 in Section VI-B). Corollary 3 and Theorem 4 together enable the expected routing length of O(log_dT−1X ) for sending a message (Theorem 6 in Section VI-C).

We also investigate whether the bootstrap node will become the performance bottleneck of our system. Theorem 5 in Section VI-B concludes that the number of root nodes registering with the bootstrap peer is no more than O(log_dT X ) in expectation.

A. Performance ofT

It is sufficient to consider the sub-tree Γ = (V,E) rooted at r.chd inTⁱ. Recent measurement studies [46], [47] of real P2P systems (i.e., Gnutella [7] and Napster [48]) provided ev- idence that peers have lifetimes approximating the exponential

6w.h.p stands for with high probability in this paper, which denotes the probability no less than1 − N^−Ω(1), whereN is the problem size.

(8)

distribution reasonably well [49]. In the following analysis, we assume that the system follows the M/M/∞ queuing model in which the arrival rate of peers is according to a Poisson distribution with the parameter λ, and the lifetime for peers are independent and exponentially distributed with the parameter μ. The number of peers in the system is at time t denoted by M(t). We also assess the load of the bootstrap node in this section.

Theorem 1. The number of peers in the system at time t is O(E[M(t)]) w.h.p..

Proof. Since P

M(t) = j

= e^−λtp(λtp)^j j! , where p =_t

0 e^−u(t−x)

t dx due to the uniformity of the arrival time in [0, t], P

M(t) = j

is thus a Poisson distribution with the parameter λtp (see Page 218 in [45]). That is,E[M(t)] = λtp. We letE[M(t)] = N . By Chernoff tail bound (Page 97 in [45]), we have

P

M(t) ≥ 3 2 N

≤e^−N(eN )³²^N (³₂N )³²^N =

8e 27

_N ,

and P

M(t) ≤ 1 2N

≤ e^−N(eN )¹²^N (¹₂N )¹²^N =

e 2

−N

.

When N > 3, P

|M(t) − N | ≥ ¹₂N

< ₁

2

_N

. The proof thus follows.

Corollary 1. Let ^λ_μ = N . If t ≥ ^N_μ, then O(E[M(t)]) = N . Proof. Since

E[M(t)] = λtp = λt

_t

0

e^−μ(t−x)

t dx= λt1

μt(1 − e^−μt), if t ≥ ^N_μ then ^λ_μ(1 − e^−N) ≤ E[M(t)] < ^λ_μ. The proof follows.

Theorem 1 states that the number of nodes in the system at any time t is O(M (t)) w.h.p.. Corollary 1 presents that if the system time t≥ O(N ), then the number of nodes in the system is O(E[M(t)]) = N . Therefore, in the following we will discuss Γ operating at t > cN for some c, and denote the number of peers in Γ at t byN .

Lemma 1. If an overlay Γ is constructed using theT protocol, then Γ will be cycle-free.

Proof. Consider a cycle, denoted by p = a0a1a2· · · an−1a0in Γ, where a0= r.chd and {a1, a2,· · · , an−1} ⊆ V − {r.chd}.

We consider the following two cases. (1) p is a cycle because two paths p1and p2share the same endpoint a0joint at a node, say a_i (1≤ i ≤ n − 1). However, this is impossible since by definition each non-root node in Γ can only have a parent node.

(2) p can be a circular path even without two paths with r.chd as their endpoint cross. If so, it can be easily shown

that dif f (a0) < diff(a1) < diff(a2) < · · · < diff(a0).

This is a contradiction, and the proof follows.

Remark 1. If an overlay Γ is constructed using the T protocol, then a node joining Γ will visit nodes on no more than one path with the root node as the endpoint.

Theorem 2. If Γ = (V,E) implements the T protocol and

|V | = N , then the height of Γ is ln N + O(1) in expectation.

Since the proof of Theorem 2 is technically challenging and lengthy, we refer the readers to Appendix I for the details.

Theorem 3. Assume Γ with N nodes. Denote the height of Γ by the random variable S_N. Then S_N = O(ln N ) with the probability no less than 1− O(N⁻¹).

The details of the proof for Theorem 3 are given in Appendix II.

Lemma 1 and Remark 1 state that any node a takes a finite number of hops to join the overlay and the nodes helping a join appear on only one path with r.chd as the endpoint.

Theorem 2 and 3 show that any path with r.chd as the endpoint has O(lnN ) hops w.h.p.. We thus conclude as follows.

Corollary 2. If Γ rooted at r.chd withN nodes is constructed with theT protocol, then a newly joining node takes O(ln N ) hops w.h.p. to join Γ. Clearly, Tⁱ associated with Γ has the height of O(lnN ) + 1 = O(ln N ).

B. Performance ofT

As we discussed earlier,T = T^k. We will, in this section, report the performance analysis for T regarding the degree, d^T, for any v inT, and the height of T.

Theorem 4. Assume that each node v in T initially has the degree ˆd to formTⁱ, where 1≤ i ≤ k. Then, E[d^T] = ˆd+O(1) and d^T≤ 2 ˆd with the probability no less than 1− ˆd⁻³.

We note that ˆd = d^T for simplifying the notation in the following discussion. As we mentioned in Section IV, d^T is a soft constraint, and Theorem 4 concludes the degree that a node needs to contribute to the network.

Theorem 5. Assume that constructing a ˆd-nodeTⁱ tree takes t( ˆd) time units, where 1 ≤ i ≤ k. If E[t( ˆd)] ≤ _2(λ+μ)^d^ˆ (E[t( ˆd)] ≈ _2λ^d^ˆ when λ  μ), then the number of registry nodes in the bootstrap node is less than k in expectation and no more than k²+ O(k) with the probability 1 − k⁻⁴+ o(1).

The proofs for Theorem 4 and 5 are lengthy, and the details of the proofs are given respectively in Appendix III and IV.

Corollary 3. LetX = ˆd^k be the total number of nodes inT.

Then, the diameter of T is DT = 2 ln X in expectation, and with the probability no less than 1− O( ˆd⁻¹) D_T is no more than 12 lnX .

Proof. The diameter D_TofT is the length of the path crossing through the root node r in T ⁱ (i = 1, 2,· · · , k) from a leaf node a in T¹ to a leaf b in another T¹. Therefore, D_T = 2k ln ˆd= 2 logdˆX ln ˆd= 2 ln X . By the proof in Theorem 3,