De-centralized Environment - ㄧ個針對連續頻繁之K取N配對搜尋的快速演算法

In centralized environment, the centralized server needs to handle all queries from the clients.

When there are large volumes queries and points from the clients and data sources respectively, server spends much computation cost on those. If there are other servers that have the same computation power with centralized server, we can use these servers to compute partial answers and send these partial answer to centralized server to compute final answers. By this way, we can decrease the number of points that every server has to handle and balance

workload of every server. Consider de-centralized environment setting with m + 1 servers:

N1, N2, ..., Nm servers with a centralized server N0. Every server Ni(n < i ≤ m) has data points {P_1,i, P_2,i, ..., P_l_i_,i} where l_iis the number of points in N_i. In de-centralized environment, when the client registers a frequent k-n-match query < Q, [n0, n1], k > to the N0, N0 gives every query a unique number qid and broadcasts (qid, < Q, [n₀, n₁], k >) to every server Ni(n < i ≤ m). Unlike centralized environment, Ni(n < i ≤ m) does not send information of the points to the centralized server N₀ at first time. According to different query points, Ni(0 < i ≤ m) performs CFKNMatchAD to find the points that have chance to become final answers. In traditional similarity search algorithm, a point is determined whether if it is an answer that user wants according a score.

It can be guaranteed that the global top k points are also in the set of the local top k points. However, in frequent k-n match search, an answer point is determined by the number of appearance in S_n₀, ..., S_n₁. It is possible that a point that is not in the local top k points is in global Sn0, ..., Sn1. If Ni(0 < i ≤ m) performs CFKNMatchAD to find local top k points and send them to N₀, N₀ may lose some points that should be in global S_n₀, ..., S_n₁. This problem causes incorrect number of appearance of points in global Sn0, ..., Sn1. To avoid this problem, we let N_i perform CFKNMatchAD and then send every point {P_1,i⁰ , P_2,i⁰ , ..., P_a⁰_i_,i} that appear in local Sn0, ..., Sn1 and their attributes with corresponding query number qid to N₀ where a_i is the number of points in S_n₀, ..., S_n₁. We will prove that a point that do not appear in the top k positions of local Sn0, ..., Sn1 will not appear in the top k positions of global S_n₀, ..., S_n₁ in the next paragraph. Therefore, we just send the points that appear in local Sn0, ..., Sn1 and can sure that final answer is correct. Then N0 receives the points from N_i(0 < i ≤ m) with the same qid and perform FKNMatchAD to find globe top k points.

In Ni(1 < i ≤ m), the attributes of {P1,i, P2,i, ..., Pli,i} have their safe regions. When the attribute of {P_1,i, P_2,i, ..., P_l_i_,i} fluctuates, N_i(0 < i ≤ m) performs CFKNMatchAD and then sends the points that appear in local Sn0, ..., Sn1 to N0 if if the points in local Sn0, ..., Sn1

change. N₀ receives the update from N_i(0 < i ≤ m) and performs FKNMatchAD to find new globe top k points. The architecture of de-centralized environment is shown in Figure 4.2.

Proof. In frequent k-n match search, a point that do not appear in the top k position of local Sn0, ..., Sn1 will not appear in the top k position of global Sn0, ..., Sn1.

Let P be a point that do not appear in local S_n₀, ..., S_n₁. Hence, there are at least k points that have n-match differences smaller than P where (n0 ≤ n ≤ n1). If we send all points in local S_n₀, ..., S_n₁ and P to the centralized server. After performing FKNMatchAD, assume P is in global Sn0, ..., Sn1. This means that, there are less than k points that have n-match

Centralized Server N₀

{P˅1,1,P˅2,1,Ξ,P˅a¹,1} {P˅1,2,P˅2,2,Ξ,P˅a2,2}

{P˅1,m,P˅2,m,Ξ,P˅a^m,m} Monitoring

globe top k points

FKNMatchAD

Perform CFKNMatchAD

{P1,1,P2,1,Ξ}

Perform CFKNMatchAD

{P1,2,P2,2,Ξ}

Perform CFKNMatchAD

{P1,m,P2,m,Ξ}

data source

Server N1 Server N2 Server Nm

data source data source

Send the local answers

local top k point sets

Figure 4.2: Architecture of distributed system

differences smaller than P . But we do send all points in local Sn0, ..., Sn1 to the centralized server. So there are at least k points in global S_n₀, ..., S_n₁. It contradicts the assumption and the point P do not exist. Thus, we prove that a point that do not appear in the top k positions of local S_n₀, ..., S_n₁ will not appear in the top k positions of global S_n₀, ..., S_n₁. 2 In de-centralized environment, data servers only perform CFKNMatchAD on the points it has and the centralized server only FKNMatchAD on the points received from data servers.

Moreover, data servers only send data of the points that appear in the top k positions of local S_n₀, ..., S_n₁ to the centralized server. Data servers can eliminate the points that are not the answers and computation on query process can be balanced by all data servers and response time of each query and network traffic can be reduced. The system we mentioned above are 2-level de-centralized architecture. The data servers are in level 1 and the centralized server is in level 2. To reduce the response time of the queries, we can further deploy more servers to deepen the architecture level. The servers of level 1 are all data servers and the server of top level is the centralized server that is responsible to handle the queries from the clients.

The servers between level 1 and top level perform FKNMatchAD to find the temporary top k points and send to the server of upper levels.

4.4 Implementation

In continuous queries, the server has to report the valid answer periodically. Therefore, when an attribute fluctuates, we have to check whether the answer is changed and do the reevalua-tion. In every reevalution, FKNMatchAD and CFKNMatchAD have to sort all attributes of the objects in every dimension. This operation cost a lot of computation. Actually, we do not have to sort all attributes and can get the valid answer. After we get the first-time answer, we know how many attributes we retrieve to obtain the valid answer and the differences of these attributes are smaller then threshold. In the next reevaluation, we can only sort the attributes that have differences smaller than threshold plus a value to obtain the valid answer.

This will reduce the sorting time significantly and improve the performance of FKNMatchAD and CFKNMatchAD.

Chapter 5

在文檔中ㄧ個針對連續頻繁之K取N配對搜尋的快速演算法 (頁 26-30)