Solving DIA problem via the well-known Greedy-NN algorithm

Chapter 4 Inverted File Optimization

4.2 Document Identifier Assignment Problem and Its Algorithm

4.2.2 Solving DIA problem via the well-known Greedy-NN algorithm

The research works of Shieh et al. (2003) and Gelbukh et al. (2003) indicated that finding the near-optimal solution for the SDIA problem can be recast as the traveling salesman problem (TSP), and also showed that heuristic algorithms for the TSP can be applied to the SDIA problem to find a near-optimal DIA. Compared with those well-known TSP heuristic algorithms, such as insertion heuristic algorithm and spanning tree based algorithm, Shieh et al. (2003) showed that the Greedy-NN algorithm performs better for the SDIA problem on average. In this section, we first show how to solve the SDIA problem using the Greedy-NN algorithm. Then, we show how to transform the DIA problem into the SDIA problem, and explain why the Greedy-NN algorithm can provide better performance than the other TSP heuristic algorithms for the DIA problem.

Solving SDIA problem via Greedy-NN algorithm

Shieh et al. (2003) showed that the SDIA problem can be solved by using TSP heuristic algorithms. Given a collection of N documents, a document similarity graph (DSG) can be constructed. In a DSG, each vertex represents a document, and the weight on an edge between two vertices represents the similarity of these two corresponding documents. The similarity Sim(di, dj) between two documents di and dj is defined as: intersection operator. Hence, the similarity between two documents is the number of common terms appearing in both documents. The DSG for the example documents in Figure 4.2(a) is shown in Figure 4.3. A TSP heuristic algorithm can then be used to find a path of the DSG visiting each vertex exactly once with maximal sum of similarities. If we follow the visiting order of vertices on the path to assign document identifiers, the sum of d-gap values for an inverted file can be

decreased, and the size of inverted file compressed via the d-gap compression approach can be reduced. Shieh et al. (2003) showed that the Greedy-NN algorithm (Figure 4.4) can provide excellent performance for the SDIA problem.

We now show how to obtain a DIA for the example documents in Figure 4.2. In Step 1, we construct the DSG (Figure 4.3) for the given documents, where V={d₁, d₂, d₃, d₄, d₅, d₆}. In Step 2, we pick d₄ as v₁ since the sum of similarity values associated with its adjacent edges is maximal (=10). In Step 3, we have V'={d1, d2, d3, d5, d6}. In Step 4, we pick d6 as v2 since d6 is the vertex v in V' such that the edge (v,v1) has the maximal similarity value. In Step 5, we have V'={d1, d2, d3, d5}.

Repeat Steps 4 and 5 as needed, we can then sequentially pick d₁ as v₃, d₃ as v₄, d₂ as v₅, and d₅ as v₆. Hence, we have a TSP path: {d₄, d₆, d₁, d₃, d₂, d₅}, and a DIA π = {d₁Æ3, d2Æ5, d3Æ4, d4Æ1, d5Æ6, d6Æ2}.

Figure 4.3 The DSG for the example documents in Figure 4.2(a).

1 d1 d2

d3 d4

1 1

3 2 2 2 1

2 0

Algorithm Greedy_nearest_neighbor

1. Construct the DSG(V, E), where V is a set of vertices (in which each vertex represents a document) and E is a set of edges (in which each edge has a similarity value associated with it);

2. Pick a vertex v∈V as v1 such that the sum of similarity values associated with the adjacent edges of v is maximal;

3. ;V′:=V −{v₁}; i:=1

4. Find v in V ′ such that the similarity value of the edge (v,vi) is maximal: if more than one such vertex exist, select one randomly;

5. i:=i+1; v_i :=v; V′:=V′−{v_i};

6. If i<N then goto 3;

7. Output a TSP path with its visiting order of vertices being

{

v₁,v₂,...,vN

}

Figure 4.4 The Greedy-NN algorithm for the SDIA problem.

Transforming DIA problem into SDIA problem

We use a matrix A to represent the input document collection, in which a row corresponds to a term and a column corresponds to a document. The entry Ai,j is a 1 if term i appears in document dj, and 0 otherwise. The SDIA problem is to determine whether there exists a permutation of the columns of A that results in a matrix B such that

(

z i j z i j

)

(

z i

)

where C is a coding method which requires C(x) bits to encode a d-gap x, n is the number of terms, fi is the total number of documents in which term i appears, z(i,j) is a function that returns the column index of the j^th nonzero entry at row i, and k is a given integer used to determine whether there exists a permutation of columns of A such that the total encoded size of an inverted file is less than k. The DIA problem is to determine whether there exists a permutation of the columns of A that results in a matrix B such that

(

z i j z i j

)

(

z i

)

k determine whether there exists a permutation of columns of A such that the mean encoded size needed to read and decompress a posting list during query processing is less than k'.

To show how to transform the DIA problem into the SDIA problem, we use the document collection in Figure 4.2(a) as an example instance of the DIA problem, and assume that the probabilities of terms being queried are p1=0.2, p2=0.3, p3=0.1, and p4=0.4. Figure 4.5(a) shows the matrix A of Figure 4.2(a). Then we construct a new matrix A′ for the SDIA problem by duplicating each row of matrix A in a certain number of times based on the given probabilities of terms appearing in a query, as shown in Figure 4.5(b). In matrix A′, the row of matrix A corresponding to term i is duplicated mi times, where mi=rows(A′)×pi and rows(A′) denotes the number of rows of matrix A′. The rows(A′) can be any positive integer such that mi=rows(A′) ×pi is an integer for every i. In this example, we let rows(A′) be 10. One can easily show that the optimal solution of matrix A′

for the SDIA problem is also the optimal solution of matrix A for the DIA problem when the probabilities p₁=0.2, p₂=0.3, p₃=0.1, and p₄=0.4 are given.

Using the same approach, it is obvious that one can transform any instance A of the DIA problem into an instance A′ of the SDIA problem such that the optimal solution of matrix A′ for the SDIA problem is also the optimal solution of matrix A for the DIA problem when the probabilities pi for 1 ≤ i ≤ n are given, where n denotes the number of distinct terms. Since the research work of Shieh et al. (2003) showed that the Greedy-NN algorithm performs the best for the SDIA problem on average, one can show that the Greedy-NN algorithm can provide better performance than the other TSP heuristic algorithms for the DIA problem. Therefore, the DIA problem can be solved

using the Greedy-NN algorithm described in Figure 4.4, if the similarity Sim(di, dj) between two

where the probability of a term t appearing in a query is known to be pt.

(a) An example instance for the DIA problem: Matrix A corresponds to the document collection in Figure 4.2(a), and the probabilities of terms appearing in a query are p1=0.2, p2=0.3, p3=0.1, and p4=0.4. (b) Matrix A′ is the corresponding instance of Figure 4.5(a) for the SDIA problem. In

matrix A′, Rowtermi of matrix A is duplicated mi times, where mi=rows(A′) ×pi and rows(A′) denotes the number of rows of matrix A′.

Figure 4.5 An example to illustrate how to transform an instance of the DIA problem into an instance of the SDIA problem

Although the Greedy-NN algorithm is very simple to implement, it is not very applicable to large-scale IRSs due to its high complexity. Given a collection of N documents and n distinct terms, the number of comparisons for calculating Sim(di,dj) given fixed i and j is O(n), hence the total

Row_term1 of matrix A is duplicated m1=rows(A′) ×p1=2 times

Row_term2 of matrix A is duplicated m2=rows(A′) ×p2=3 times

Rowterm3 the matrix A is duplicated m3=rows(A′) ×p3=1 time

Row_term4 of matrix A is duplicated m4=rows(A′) ×p4=4 times

Matrix A′:

Matrix A:

number of comparisons to construct a DSG for the Greedy-NN algorithm is O(N²×n). An algorithm with lower complexity yet still generates satisfactory results should be developed.

在文檔中大型資訊檢索系統之轉置檔案設計 (頁 89-94)