Chapter 3. Self-Organizing P2P Network Architecture
3.3. Semantic Search
When performing a semantic search, the user needs to provide both a range (a sphere with a center and a radius) in the semantic space and the number of documents to be returned.
In addition, the maximum number of hops (i.e. TTL value) should also be given. The search procedure will continue collecting documents that fall into the given range until either the required number of documents has been collected, or the TTL is reached. Clearly, the quality of the query depends on the document distribution and the parameters provided by the user.
However, in reality, it is possible for the user to provide only the center, and the system can fill in the other parameters probably, possibly after conducting several “sampling” queries and gaining sufficient knowledge about the distribution of documents around the center.
Below we describe the search procedure in more details.
document
peer median
Fig. 3-2. Peer positioning using mediums
Fig. 3-3. The flow chart of the request procedure
As shown in Fig. 3-3, the procedure mainly consists of six steps:
Put document’s information into result set. When a query is initiated, it contains an empty result set used to hold documents collected so far, as well as the peers that contain the documents, respectively. The result set has an upper bound limiting the number of documents it can contain. We can conveniently set the limit to the required number of documents given in the query. When a peer receives a query, it first looks up its own document link table and whether to put some related documents it finds into the result set. If the result set is full, the peer can replace documents in the result set with better ones in its document link table. If the result set is not full, there are two choices. First, a document is put in the result set only if it falls in the query region. In the second choice, the result set is simply filled as many documents as possible, but following the same replacement strategy mentioned above. The latter approach is what we focus on because, doing so propagate more document information as well as peer information along the query path, allowing more peers to learn the network topology and their inter-relations in the network more quickly.
Replace Document Procedure (RDP). This procedure mainly replaces a document link in the document table whenever a peer gains knowledge about a new document in the network.
Often the knowledge comes from the query (actually, its result set) passing by. RDP is performed to make the documents of a peer more focused. However, RDP does not drop a document link when its document is cached in the peer, to prevent the document from being
Put document's information into result
set
Issue RDP Issue RRP Update own
information
dropped from the network all together. In RDP, if the document table is not full, the new document is simply added to the document table. If the document table is full, RDP then calculate the distance (similarity) between the peer and the document, where the distance between two semantic vectors is computed using Euclidean distance:
(3)) ,
(
P Q p
1q
1 2p
2q
2 2p
kq
k 2d
RDP then compares the distance with dmax, i.e. the document that is most distant from the peer.
If the new document is closer than dmax, then replace dmax with the document. Algorithm 3-1 shows the pseudo codes of RDP:
Algorithm 3-1: Algorithm for RDP RRP(Peer p, Document d ) :
1: if document table isn’t full then 2: add d to document table
Replace Routing Procedure (RRP). This procedure mainly updates and replaces routing table in a way similar to RDP. In this case, a peer gains new knowledge about a new peer also from the query passing by. This also implies that whenever a peer receives and/or redirects a query, it attempts to “inject” information about itself, including its current position. However, if we only keep near-by peers in the routing table, the network may gradually evolve into disjoint clusters. To avoid this, we divide the routing table into short-range link set and long-range link set. The replacement of short-range links is similar to RDP. In contrast, when
the new peer under consideration falls outside the short-range link set, RRP checks to see if adding the new peer can help the peer “expand” its horizon. This is done by computing, for each peer in the long-range link set as well as the new peer, its total distance between it and the rest, then keep the combination which has the maximum total distance. Of course, if original combination of long-range link set has the maximum total distance, no replacement is done. Algorithm 3-2 shows the RPP pseudo code:
Algorithm 3-2: Algorithm for RRP
RPP(Peer p, ResultSet rs ) : for each peer np in rs 1: if np exists in routing table then
2: update np’s information in the routing table replacing peer l with peer np then
9: replace peer l with peer np 10: end if
11: end if 12: end if
Update own information. This procedure mainly re-calculates the new semantic vector, because after RDP and RRP are performed, the peer’s position may change.
Put neighbor information into candidate queue. In request procedure, the search path is along the greedy path. The request message is always forwarded to the peer that it has not
been visited and its position is closest to query. So the request message uses a candidate queue to store top k peers’ information closest to query.
Forward the message. The search procedure stops if the result set has accumulated required number of documents, or the query has traversed TTL hops given in the query. In either case the query will be sent back along the request path. Otherwise, the peer will forward the query to the peer in its routing table that is closest to query. The algorithm 3-3 shows the complete pseudo code for request procedure as follows:
Algorithm 3-3: Algorithm for search procedure search(Peer p, Query Q ):
Q.RS: result set, Q.CQ: candidate queue, Q.TTL: TTL Q.S: similarity threshold, Q.TS: travel stack
1: for all d in the document table do
In order to reduce the maintenance overhead, we make use of the query as much as possible by packing document and peer information into the query. This is also the case when the query is returned along the search path towards the source, where all the peers along the path can still learn knowledge about new documents and other peers and update themselves accordingly. Fig. 3-4 shows the steps done when a query is returned:
Fig. 3-4. The flow chart of the response procedure
When peer p receives the returning query, it first executes RDP and RRP again and re-calculates its own position. To send response message back along the search path, the peer still update its own entry in the query (mostly in the travel stack carried by the query).
Algorithm 3-4 shows the pseudo code for the response procedure as follows:
Algorithm 3-4: Algorithm for response procedure response(Peer p, Query Q):