In centralized environment, the centralized server needs to handle all queries from the clients.
When there are large volumes queries and points from the clients and data sources respectively, server spends much computation cost on those. If there are other servers that have the same computation power with centralized server, we can use these servers to compute partial answers and send these partial answer to centralized server to compute final answers. By this way, we can decrease the number of points that every server has to handle and balance
workload of every server. Consider de-centralized environment setting with m + 1 servers:
N1, N2, ..., Nm servers with a centralized server N0. Every server Ni(n < i ≤ m) has data points {P1,i, P2,i, ..., Pli,i} where liis the number of points in Ni. In de-centralized environment, when the client registers a frequent k-n-match query < Q, [n0, n1], k > to the N0, N0 gives every query a unique number qid and broadcasts (qid, < Q, [n0, n1], k >) to every server Ni(n < i ≤ m). Unlike centralized environment, Ni(n < i ≤ m) does not send information of the points to the centralized server N0 at first time. According to different query points, Ni(0 < i ≤ m) performs CFKNMatchAD to find the points that have chance to become final answers. In traditional similarity search algorithm, a point is determined whether if it is an answer that user wants according a score.
It can be guaranteed that the global top k points are also in the set of the local top k points. However, in frequent k-n match search, an answer point is determined by the number of appearance in Sn0, ..., Sn1. It is possible that a point that is not in the local top k points is in global Sn0, ..., Sn1. If Ni(0 < i ≤ m) performs CFKNMatchAD to find local top k points and send them to N0, N0 may lose some points that should be in global Sn0, ..., Sn1. This problem causes incorrect number of appearance of points in global Sn0, ..., Sn1. To avoid this problem, we let Ni perform CFKNMatchAD and then send every point {P1,i0 , P2,i0 , ..., Pa0i,i} that appear in local Sn0, ..., Sn1 and their attributes with corresponding query number qid to N0 where ai is the number of points in Sn0, ..., Sn1. We will prove that a point that do not appear in the top k positions of local Sn0, ..., Sn1 will not appear in the top k positions of global Sn0, ..., Sn1 in the next paragraph. Therefore, we just send the points that appear in local Sn0, ..., Sn1 and can sure that final answer is correct. Then N0 receives the points from Ni(0 < i ≤ m) with the same qid and perform FKNMatchAD to find globe top k points.
In Ni(1 < i ≤ m), the attributes of {P1,i, P2,i, ..., Pli,i} have their safe regions. When the attribute of {P1,i, P2,i, ..., Pli,i} fluctuates, Ni(0 < i ≤ m) performs CFKNMatchAD and then sends the points that appear in local Sn0, ..., Sn1 to N0 if if the points in local Sn0, ..., Sn1
change. N0 receives the update from Ni(0 < i ≤ m) and performs FKNMatchAD to find new globe top k points. The architecture of de-centralized environment is shown in Figure 4.2.
Proof. In frequent k-n match search, a point that do not appear in the top k position of local Sn0, ..., Sn1 will not appear in the top k position of global Sn0, ..., Sn1.
Let P be a point that do not appear in local Sn0, ..., Sn1. Hence, there are at least k points that have n-match differences smaller than P where (n0 ≤ n ≤ n1). If we send all points in local Sn0, ..., Sn1 and P to the centralized server. After performing FKNMatchAD, assume P is in global Sn0, ..., Sn1. This means that, there are less than k points that have n-match
Centralized Server N0
{P˅1,1,P˅2,1,Ξ,P˅a1,1} {P˅1,2,P˅2,2,Ξ,P˅a2,2}
Ξ
{P˅1,m,P˅2,m,Ξ,P˅am,m} Monitoring
globe top k points
FKNMatchAD
Perform CFKNMatchAD
{P1,1,P2,1,Ξ}
Perform CFKNMatchAD
{P1,2,P2,2,Ξ}
Perform CFKNMatchAD
{P1,m,P2,m,Ξ}
data source
Server N1 Server N2 Server Nm
data source data source
Send the local answers
local top k point sets
Figure 4.2: Architecture of distributed system
differences smaller than P . But we do send all points in local Sn0, ..., Sn1 to the centralized server. So there are at least k points in global Sn0, ..., Sn1. It contradicts the assumption and the point P do not exist. Thus, we prove that a point that do not appear in the top k positions of local Sn0, ..., Sn1 will not appear in the top k positions of global Sn0, ..., Sn1. 2 In de-centralized environment, data servers only perform CFKNMatchAD on the points it has and the centralized server only FKNMatchAD on the points received from data servers.
Moreover, data servers only send data of the points that appear in the top k positions of local Sn0, ..., Sn1 to the centralized server. Data servers can eliminate the points that are not the answers and computation on query process can be balanced by all data servers and response time of each query and network traffic can be reduced. The system we mentioned above are 2-level de-centralized architecture. The data servers are in level 1 and the centralized server is in level 2. To reduce the response time of the queries, we can further deploy more servers to deepen the architecture level. The servers of level 1 are all data servers and the server of top level is the centralized server that is responsible to handle the queries from the clients.
The servers between level 1 and top level perform FKNMatchAD to find the temporary top k points and send to the server of upper levels.
4.4 Implementation
In continuous queries, the server has to report the valid answer periodically. Therefore, when an attribute fluctuates, we have to check whether the answer is changed and do the reevalua-tion. In every reevalution, FKNMatchAD and CFKNMatchAD have to sort all attributes of the objects in every dimension. This operation cost a lot of computation. Actually, we do not have to sort all attributes and can get the valid answer. After we get the first-time answer, we know how many attributes we retrieve to obtain the valid answer and the differences of these attributes are smaller then threshold. In the next reevaluation, we can only sort the attributes that have differences smaller than threshold plus a value to obtain the valid answer.
This will reduce the sorting time significantly and improve the performance of FKNMatchAD and CFKNMatchAD.