中華大學

(1)

中華大學

碩士論文

題目：利用增強型虛擬相關性回饋改善資訊檢索效能

系所別：資訊工程學系碩士班學號姓名：M09502050 吳智瑋指導教授：曾秋蓉教授

中華民國九十七年七月

(2)

中文摘要

由於網際網路以及資訊技術的發達，資訊檢索(Information retrieval, IR)系統被廣泛應用於日常生活中。資訊檢索系統有助於找出與使用者需求相關之資訊，

為現今資訊系統使用者不可或缺之應用，然而其價值往往取決於檢索結果的品質。使用者不僅要求資訊檢索系統要能夠找出其所需要之相關資訊，更希望找出之資訊能夠依照其與使用者需求之相關程度加以排序，以節省其過濾資訊所需耗費之時間。過去許多學者發現相關性回饋(Relevance feedback, RF)資訊相當有助於改善資訊檢索系統的品質；其中，最著名的便是標準的 Rocchio 相關性回饋演算法(standard Rocchio’s relevance feedback algorithm)。而自動相關性回饋（pseudo relevance feedback）可依查詢結果自動選取相關文件及非相關文件，避免人工提供相關性回饋資訊之負擔與不便，更讓相關性回饋的應用得以在資訊檢索系統上實現。雖然過去所提出的相關性回饋演算法已被證實可以改善資訊檢索品質，然而在相關性回饋資訊的運用上，所有相關文件或非相關文件對於關鍵詞權重所產生的影響均一視同仁，並未考慮到各個文件以及各個關鍵詞彙對查詢之重要程度不一，應給予不同之回饋，方能將相關性資訊做最妥善之應用。有鑑於此，本論文提出一套增強型的自動相關性回饋演算法(Enhanced Pseudo relevance feedback algorithms)，考慮文件及詞彙對查詢的個別重要性，以更進一步改善資訊檢索的品質。從實驗結果可證實，本論文提出之增強型自動相關性回饋演算法優於標準 Rocchio 相關性回饋演算法。

關鍵詞：資訊檢索、相關性回饋、自動相關性回饋

(3)

Abstract

Owing to the rapid growth and popularization of Internet and information technology, information retrieval systems has become a necessary part of our modern life. Users find valuable information from either digital libraries or the Internet by a few keywords or a nature language sentence. However, the quality of an information retrieval system relies heavily on the accuracy of the information retrieved. The retrieved information should be not only matched the user’s query, but also ranked well according to its relevance to the user’s query. In the literatures, researchers found that Relevance Feedback (RF) information is quite useful for an information retrieval system to improve its accuracy. Among the proposed relevance feedback algorithms, the standard Rocchio’s relevance feedback algorithm is the most well-known and widely employed in information retrieval systems. Furthermore, the idea of pseudo relevance feedback was proposed for the relevance feedback algorithms. It reduces user’s burden by deciding automatically relevant and irrelevant documents according to the ranks of the retrieval results. Although relevance feedback algorithms can be used to improve retrieval performance, they do not discriminate well the degree of importance on either documents or terms. To cope with this problem, an enhanced pseudo relevance feedback algorithm is proposed in this thesis. Experimental results showed that the performance of the proposed algorithm outperforms the standard Rocchio’s relevance feedback algorithm.

Keywords: information retrieval, relevance feedback, pseudo relevance feedback

(4)

誌謝

論文寫至此，表示研究所生涯即將結束，辛苦了兩年，終於輪到我寫誌謝。

首先，由衷感謝亦師亦友的曾秋蓉教授悉心教導，讓學生的研究所生涯得以順利完成。老師總是非常有耐心，適時的給予指導。除了課業上的協助，更關心學生在生活上遇到的困難。老師對於研究的熱忱和認真教學的精神，更激勵了學生奮發向上。

此外，感謝兩位非常優秀的博士班學長，創楷及志祥學長的照顧，無論是學術面亦或是實務面，兩位學長總是熱心的協助我解決遭遇到的困難。當然，還要謝謝智強學長，從大學時專題到研究所的協助，讓我能在短時間內進入狀況。感謝好友們和研究室的伙伴，因為有你們的陪伴，讓研究的道路上增添了溫馨和樂趣。感謝我的家人，他們的支持與鼓勵，使我能夠專心致力於研究中。感謝所有幫助過我的人，這份恩情我將銘記在心！最後，僅以此論文獻給我敬愛的師長、

伙伴以及我最愛的家人。

(5)

List of Figures

Figure 2.1 Main idea of relevance feedback ... 9

Figure 2.2 Process of the relevance feedback algorithm ... 10

Figure 4.1 Precision comparisons of four parameter settings ... 31

Figure 4.2 MAP comparisons of four parameter settings ... 32

Figure 4.3 Precision comparisons with Medlars ... 33

Figure 4.4 MAP comparisons with Medlars ... 33

Figure 4.5 Precision comparisons with OHSUMED ... 35

Figure 4.6 MAP comparisons with OHSUMED ... 35

Figure 4.7 IR vs. SB ... 36

Figure 4.8 (IR, SB) vs. (IRF, SBF) ... 37

(7)

List of Tables

Table 3.1 An example of query reformulation with IR ... 18

Table 3.2 The query-document similarity of the first 3 ranked documents ... 21

Table 3.3 An example of query reformulation with SB ... 21

Table 3.4 The weight of the j-th term to the user’s query in either R or S... 24

Table 3.5 An example of query reformulation with IRF ... 24

Table 3.6 The query-document similarity of the first 3 documents ... 26

Table 3.7 The weight of the j-th term to the user’s query ... 27

Table 3.8 An example of query reformulation with SBF ... 27

Table 4.1 Four parameter settings for SR ... 30

Table 4.2 IR vs. SB ... 37

Table 4.3 (IR, SB) vs. (IRF, SBF) ... 37

(8)

Chapter 1 Introduction

Owing to the rapid growth and popularization of Internet and information technology, vast amounts of information are widely accessible on the web.

Information Retrieval (IR) is one of the core technologies to exploit the information that users need. Nowadays, information retrieval systems have become a necessary part of our modern life. Users find valuable information from either digital libraries or the Internet by a few keywords or a nature language sentence. However, the quality of an information retrieval system relies heavily on the accuracy of the information retrieved. The retrieved information should be not only matched the user’s query, but also ranked well according to its relevance to the user’s query.

In the literatures, researchers found that Relevance Feedback (RF) information is quite useful for an information retrieval system to improve its accuracy [Rocchio 1966, Ide 1971, Rocchio 1971, Ruthven 2003, Jordan 2004, Pan 2007, Yang 2006].

Relevance feedback algorithms use query reformation strategies to reformulate a user’s query based on the relevance judgments which are either from user designations or from system designations. Rocchio’s relevance feedback algorithm [Rocchio 1966] was the first relevance feedback algorithm proposed for IR systems based on vector space model [Salton 1968, Salton 1971]. Later, two relevance feedback strategies, called Ide regular and Ide dec-hi, were developed by Ide [Ide 1971]. Both strategies are modified Rocchio’s relevance feedback algorithms. And then, the standard Rocchio’s relevance feedback algorithm (SR) was proposed [Rocchio 1971]. It employs three constants α, β and γ to adjust the effect of each component in the query reformation formula.

In these algorithms, the relevance feedback information is provided by the users.

That is, users are involved in judging whether the retrieved documents are relevant or

(9)

not. If the users are not willing to provide relevance information, relevance feedback algorithms will not be applicable. In order to reduce the users’ burden in providing relevance feedback information, an alternative approach, known as Pseudo Relevance Feedback (PRF) [Croft 1979] was proposed. The information retrieval systems will judge the relevance/irrelevance of retrieved documents automatically according to their ranks in the retrieved results. That makes relevance feedback information always available and relevance feedback algorithms become more applicable to information retrieval systems. Hereafter, many IR systems employ relevance feedback algorithms to improve their retrieval performance. [Jordan 2004, Pan 2007, Yang 2006].

Although it has been proved that relevance feedback algorithms will improve the performance of information retrieval systems, there are some problems exist:

(1) For the documents retrieved, only bi-states, either relevant or irrelevant, are considered in the query reformation formula. The degrees of relevance/irrelevances are neglected.

(2) For the terms contained in a retrieved document, only their weights to the document are considered in the reformation formula. The weights of terms to the user’s query are neglected.

To cope with these problems, an Enhanced Pseudo relevance feedback algorithm (EP) is proposed. The degrees of relevance/irrelevance of documents, as well as the weights of terms to the user’s query, are considered in the proposed algorithm. Four variants based on EP are then derived: the Inverse-Ranked algorithm (IR), the Similarity-Based algorithm (SB), the Inverse-Ranked with Frequency ratio algorithm (IRF) and the Similarity-Based with Frequency ratio algorithm (SBF). The Inverse-Ranked algorithm (IR) and the Similarity-Based algorithm (SB) both take only the degrees of relevance/irrelevance of documents into consideration, while the Inverse-Ranked with Frequency ratio algorithm (IRF) and the Similarity-Based with

(10)

Frequency ratio algorithm (SBF) not only consider the degrees of relevance/irrelevance of documents, but also consider the weights of terms to the user’s query.

Performances of the four variations: IR, SB, IRF and SBF are evaluated under two standard test collections: Medlars [Salton 1975] and OHSUMED [Hersh 1994].

30 queries and 1033 documents are contained in Medlars, while 106 queries and 14430 documents are contained in OHSUMED. Experimental results showed that all the four variations of EP outperform SR.

The reminder of this thesis is organized as follows. In Chapter 2, we will review the related works of information retrieval techniques. Then, the Enhanced Pseudo relevance feedback algorithm (EP) and its four variations will be described in Chapter 3. In Chapter 4, we conduct some experiments to compare the performance of the proposed algorithms with SR. Finally, conclusions and the future works are presented in Chapter 5.

(11)

Chapter 2 Related Works

Information retrieval (IR) is an algorithm that searches for information in either structural or unstructured documents. IR algorithms predict which documents are relevant to the user’s need and return feasible documents to the user. In this chapter, previous works of information retrieval techniques will be reviewed. In Section 2.1, we will survey the models of information retrieval. In Section 2.2, the model of information retrieval used in this thesis, that is, the vector space model is described in detail. Then, query reformulation methods, which have been proposed for improving the performance of information retrieval systems, will be introduced in Section 2.3.

The relevance feedback algorithms will be described in Section 2.4.

2.1 Models of Information Retrieval

Boolean model [Slaton 1989] is the first operational IR retrieval model based on Boolean logic. A user’s query is presented as a Boolean expression that consists of terms and operands. Boolean model is an exact match model that IR systems based on it will only retrieve those documents that match exactly the user’s query expression.

Although Boolean model has been used in IR systems, it has been shown to demonstrate a number of difficulties. [Ruthven 2003] The first problem is that term weights are not used in Boolean modeled IR systems. That is, the documents which match the query are returned as an unordered set. Also, different systems may retrieve different documents for the same query. It is because the order that operators applied may not be consistent across systems [Ruthven 2003]. To cope with these problems, some statistical models were proposed, such as the probabilistic model [Robertson 1976] and the Vector Space Model (VSM) [Salton 1968, Salton 1971].

The probabilistic model is proposed by Roberson and Spark Jones [Robertson

(12)

1976]. It has been known as the Binary Independence Retrieval (BIR) model. The feasible documents are captured by estimating the probability that a document will be relevant to the user’s need. The higher estimated probability, the more likely the document is to be relevant to the user’s need. Although it can cope with the problem that the documents which match the query are returned as an unordered set, all the weights are binary. Factors that would represent the importance of a term are not taken into account, such as term frequency and inverse document frequency.

The Vector Space Model (VSM) is the most popular model of information retrieval systems. In vector space model, documents, as well as the user’s query, are represented as a Characteristic Vector (CV) which consists of n weights. Each weight represents the importance of a unique term in the document. A common approach used to determine the weight is the so-called TF×IDF (Term Frequency × Inverse Document Frequency) [Salton 1975, Salton 1983] method. When documents and the user’s query are represented as vectors, some operators can be applied on a document vector and the query vector to determine the similarity, called query-document similarity, between them. And then documents with higher similarity are extracted to the user as feasible solutions. As its popularity, VSM is used in this thesis. The process of information retrieval with the vector space model will be described in detail in next section.

2.2 Information Retrieval with Vector Space Model

In vector space model, a document is represented as a Characteristic Vector (CV) which consists of n weights. Each weight represents the importance of the corresponding term in the document. A query is also represented as a CV. The CV of the query is then compared with the CV of each document by some predefined operator to find out the similarity between the query and each document. Finally, the

(13)

documents with higher similarity are returned to the user as feasible solutions. In general, the process of information retrieval based on the vector space model can be divided into four stages: (1) term extraction, (2) term weight definition, (3) similarity computation and (4) document extraction.

Term extraction consists of three steps [Ruthven 2003]:

(1) Tokenization: convert the document into a stream of terms. Typically, all the terms will be converted into lower cases and punctuation characters are removed.

(2) Stop word removal: remove all the terms which appear commonly in the document collection, and which are not supposed to aid retrieval of relevant material.

(3) Stemming: reduce terms to their root variant to avoid having to instantiate every possible variation of each query term [Lovins 1968, Porter 1980].

“Term weight” represents the importance of a term in a document. To define the term weight, one of the best known schemes is the TF×IDF (Term Frequency × Inverse Document Frequency) [Salton 1975, Salton 1983]. It is a statistical measure for evaluating the importance of a term. The importance increases proportionally to the frequency of the term in the document, but is offset by the frequency of the term in the whole documents. While TF×IDF is well-known in IR, the length of a term, which implies the importance of a term, is not considered. The IMportance Factor (IMF) [Wu 2007] was then proposed by Wu et al. to cope with this problem.

Experimental results show that the effectiveness of IMF outperforms TF×IDF on short corpuses.

After terms are extracted and term weights are defined and computed, the CV of document Di can be represented as Eq. (2.1).

)}

, ( ),..., ,

( ),..., ,

( ), , {(

)

(D_i k₁ w_i₁ k₂ w_i₂ k_j w_ij k_n w_in

CV = (2.1)

(14)

where kj represents the j-th term and wij is the weight of the j-th term in document Di. The CV of user’s query Q can be represented as Eq. (2.2).

)}

, ( ),..., , ( ),..., , ( ), , {(

)

(Q k₁ w₁ k₂ w₂ k_j w_j k_n w_n

CV = (2.2)

where kj represents the j-th term and wi is the weight of the j-th term in the user’s query Q. For similarity comparison, cosine operation and inner product (also called dot product) are usually used to measure the similarity between two vectors. Cosine operation is applied on two vectors to compute their included angle. If the angle is small, that means the two vectors are more similar. For example, given two characteristic vectors, CV(Q) for the user’s query and CV(Di) for document Di, the cosine similarity measurement is represented as Eq. (2.3):

) CV(

) ) CV(

cos(

) , (

i i

i Q D

D D Q

Q

Sim ⋅

=

= θ (2.3)

where

( )

∑

=

×

=

⋅ ⁿ

j

jk j

i w w

D

1

) CV(

CV(Q) , (2.4)

∑

=

= ⁿ

j

wj

Q

1

) 2

CV( , and

∑

=

= ⁿ

j ij

i w

D

1

) 2

CV( . (2.5)

(15)

If inner product is used for similarity measurement, then

) CV(

) ,

(Q D_i Q D_i

Sim = ⋅

After similarity measurement, we have to decide which documents are replied to the user. In general, it is decided by either threshold setting or document ranking [Cordon 2002]. In threshold setting method, documents will be replied to the user when their similarity measure is higher than a predefined threshold. In document ranking method, documents are ranked by their similarity measurement. The document with top rank will be judged as the most feasible solution to the user. In general, top n documents would be replied to the user.

2.3 Query Reformulation

In IR systems, the quality relies heavily on the accuracy of the information retrieved. One of the reasons why a retrieval is not feasible is term mismatch [Croft 1995]. That is, the terms specified in a user’s query do not match the terms contained in the relevant documents of that query. That makes the performance of the IR systems not being acceptable. To cope with this problem, several techniques, such as query expansion and relevance feedback, have been proposed to reformulate a user’s query.

Query expansion reformulates a user’s query to obtain more relevant results. It can be done by two basic steps:

(1) New meaningful terms which are related to the queried terms are added to expand the original query. There are some query expansion methods have been proposed, such as co-occurrence terms addition [He 2005], query substitution [Jones 2006] and ontology-based query expansion [Bhogal 2007] etc. (2) Terms contained in

(16)

an expanded query are re-weighted. The effectiveness of the retrieved results depends on how to calculate the term weight. There are some re-weighting methods proposed, such as using TFxIDF, fuzzy inference [Lin 2006] and simulated annealing [Cordon, 2002].

Relevance feedback is another method that can be used to reformulate a user’s query. The relevant and irrelevant documents are used to refine the CV of the user’s original query. The relevance judgments come from two folds: one from user designation; another from system designation. The relevance feedback algorithms have some advantages as following:

(1) The details of the query reformulation are shielded.

(2) A controlled process is designed to emphasize some relevant terms and deemphasize some irrelevant terms.

The idea of relevance feedback will be described in the following section.

2.4 Relevance Feedback

The main idea of Relevance Feedback (RF) [Harman 1992, Rocchio 1971, Slaton 1990, Orengo 2006] is reformulating a users’ query to facilitate retrieval results relevant to the user’s need as shown in Figure 2.1.

Figure 2.1 Main idea of relevance feedback

(17)

The process of information retrieval with relevance feedback is shown in Figure 2.2. First, each of the documents in the document set is represented as a Characteristic Vector (CV). Each document is compared with the query and then the feasible documents are extracted to the user. In an IR system without relevance feedback, the document with the largest similarity measure is judged as the most feasible document to the user’s query. Nevertheless, relevance feedback information can be used to reformulate the original query to obtain more precise information.

Figure 2.2 Process of the relevance feedback algorithm

2.4.1 Rocchio’s Relevance Feedback Algorithm

The Rocchio’s approach, which is based on relevance feedback, is developed originally to improve the quality of the retrieval results. The Rocchio’s query reformulation formula [Rocchio 1966] is as follow:

∑ ∑

= =

×

−

× +

= ¹ ²

1 2 1

1

1 ' 1

n i

i

i S

R n Q n

Q (2.6)

where Q’ represents the new query vector which is the modified vector of the original query vector plus the vectors of the relevant and subtracts the irrelevant documents, Q represents the original query vector, R represents the set of returned documents rated as relevant, S represents the set of returned documents rated as irrelevant, n1 is the

(18)

cardinality of R, n2 is the cardinality of S, Ri represents the vector of the i-th relevant document, Si represents the vector of the i-th irrelevant document.

In Eq. (2.6), the Rocchio’s query reformulation formula is represented as vector expression. For more details, it can be represented as term expression:

∑

⁻ ^×

∑

× +

=

R in

D DinS

ij ij

j j

i i

S w R w

w

w 1 1

' (2.7)

where wj’

is the weight of the j-th term in the new query vector, wj is the weight of the j-th term in the original query, wij is the weight of the j-th term in the document Di, R represents the set of returned documents rated as relevant, S represents the set of returned documents rated as irrelevant.

An advantage of the Rocchio’s relevance feedback algorithm is that the quality of the retrieval results is improved. However, in its query reformulation formula, the degrees of effect of each component: the original query, the relevant documents, and the irrelevant documents, are not specified.

2.4.2 Ide’s Relevance Feedback Algorithm

Ide regular and Ide dec-hi [Ide 1971] were developed by Ide. Both strategies are modified Rocchio’s relevance feedback algorithms. Ide regular is based on the Rocchio’s algorithm. Its query reformulation formula is represented as:

∑ ∑

= =

− +

= ¹ ²

1 1

'

n i

i

i S

R Q

Q (2.8)

or

(19)

∑

⁻

∑

+

=

R in

D DinS

ij ij

j j

i i

w w

w

w ' (2.9)

Ide dec-hi is another strategy based on Rocchio’s algorithm. Its query reformulation formula is represented as follow:

∑

=

− +

= ¹

1

' 1

n i

i S

R Q

Q (2.10)

or

∑

⁻

∑

+

=

R in

D DinS

ij ij

j j

i i

w w

w

w ' , where |S| = 1 (2.11)

The quality of the quality of the retrieval results is improved by Idea’s algorithms, though the improvement is only slight.

2.4.3 Standard Rocchio’s Relevance Feedback Algorithm

An intuitive modification of the Rocchio’s formula, called Standard Rocchio’s relevance feedback algorithm (SR), is weighting the relative contribution of the original query, relevant and irrelevant documents [Ruthven 2003]. SR is represented as follow:

∑ ∑

= =

×

−

× +

×

= ¹ ²

1 2 1

1

'

n i

i

i S

R n Q n

Q α β γ (2.12)

∑

⁻ ^×

∑

× +

×

=

R in

D DinS

ij ij

j j

i i

S w R w

w

w ' α β γ (2.13)

(20)

where α, β and γ are constants that regulate the degree of effect of each component in Eq. (2.6). α is the degree of effect of user’s query, β is the degree of effect of relevant documents and γ is the degree of effect of irrelevant documents. Typically, α = 1 and β + γ = 1 [Eichmann 2002, Jordan 2005, Salton 1990].

SR has an advantage that the degree of effect of each component is specified.

However, the relevance feedback information is provided by the users. If the users are not willing to provide relevance information, relevance feedback algorithms will not be applicable.

2.4.4 Pseudo Relevance Feedback

In order to reduce the users’ burden in providing relevance feedback information, an alternative approach, known as Pseudo Relevance Feedback (PRF) [Croft 1979]

was proposed. PRF is one of the well known techniques which widely used in IR systems. The idea of PRF was proposed for the relevance feedback algorithms. It reduces user’s burden by deciding automatically relevant and irrelevant documents according to the retrieval results. PRF makes relevance feedback algorithms applicable to information retrieval systems. However, it may cause query drift when few or no relevant documents are used as the feedback information. There are two possible ways to cope with this problem: (1) improving initial retrieval results or (2) developing better RF techniques.

Hereafter, many IR systems employ the relevance feedback algorithms to improve their retrieval performance [Jordan 2004, Pan 2007, Yang 2006]. In this thesis, PRF is used as the basic idea for our relevance feedback algorithm, called Enhanced Pseudo relevance feedback algorithm (EP). We will describe EP in Character 3.

(21)

Chapter 3 The Enhanced Pseudo Relevance Feedback Algorithm

Although the famous Standard Rocchio’s relevance feedback algorithm (SR) is widely used in IR systems, there are some improvements can be made on the query reformation formula incorporated. In this chapter, an Enhanced Pseudo relevance feedback algorithm (EP) is proposed. EP incorporates more relevance feedback information in its query reformation formula to make it more accurate. The basic ideas of EP will be described in Section 3.1. Four variants based on EP are then derived and described in Section 3.2 through Section 3.5 respectively.

3.1 Basic Ideas

Examining the query reformation formula of SR, there are some problems exist:

(1) For the documents retrieved, only bi-states, either relevant or irrelevant, are considered in the query reformation formula. The degrees of relevance/irrelevances are neglected.

(2) For the terms contained in a retrieved document, only their weights to the document are considered in the reformation formula. The weights of terms to the user’s query are neglected.

To cope with these problems, an Enhanced Pseudo relevance feedback algorithm (EP) is proposed. EP enhances SR by incorporating more relevance feedback information in its query reformation formula. The degrees of relevance/irrelevance of documents, as well as the weights of terms to the user’s query, are considered in the proposed algorithm.

EP is a relevance feedback algorithm based on the vector space model. As well as SR, EP modifies the original query vector

(22)

)}

, ( ),..., , ( ),..., , ( ), , {(

)

(Q k₁ w₁ k₂ w₂ k_j w_j k_n w_n CV =

to a new query vector

)}

' , ( ),..., ' , ( ),..., ' , ( ), ' , {(

) (

' Q k₁ w₁ k₂ w₂ k_j w_j k_n w_n

CV =

according to the relevance/irrelevance of each document Di returned by the original query, where a returned document Di is also represented as a vector

)}

, ( ),..., , ( ),..., ,

( ), , {(

)

(D_i k₁ w_i₁ k₂ w_i₂ k_j w_ij k_n w_in

CV = .

The query reformulation formula of EP is then presented as follow:

∑

^× ^× ⁻ ^× ^× ^×

× +

×

=

S in D

j i ij R

in D

j i ij j

j

i

i S

t d w R

t d w w

w ' α β γ (3.1)

where α, β and γ are constants analogous to SR, wj’

is the weight of the j-th term in the new query vector, wj is the weight of the j-th term in the original query vector, wij is the weight of the j-th term in document Di, R represents the set of returned documents rated as relevant, S represents the set of returned documents rated as irrelevant, di is the degree of relevance/irrelevance of Di, tj is the weight of the j-th term to the user’s query.

In Eq. (3.1), di, the degree of relevance/irrelevance of Di, and tj, the weight of the j-th term to the user’s query, are two unique factors which are only incorporated in EP.

We can designate different settings of di and tj to result in different relevance feedback algorithms. When we use the following setting:

⎩⎨

⎧

=

j each for t

i each for d

j i

, 1

(23)

That means both the degree of relevance/irrelevance of Di and the weight of the j-th term to the user’s query are neglected. Eq. (3.1) can be reduced to

∑

⁻ ^×

∑

× +

×

=

R in

D DinS

ij ij

j j

i i

S w R w

w

w ' α β γ ,

which is just the query reformulation formula of SR. In this case, EP is reduced to SR.

In the following sections, four settings of di and tj are used to result in four relevance feedback algorithms.

3.2 The Inverse-Ranked Algorithm

SR employs three constants α, β and γ to adjust the effect of each component in the query reformulation formula. However, for the documents retrieved, only bi-states, either relevant or irrelevant, are considered in the query reformation formula. The degrees of relevance/irrelevance are neglected. That makes SR will not be unavailable to discriminate well the degree of importance of documents.

To cope with this problem, a variant of EP, called Inverse-Ranked algorithm (IR), is proposed. IR takes the degrees of relevance/irrelevance of documents into consideration. In IR, the degrees of relevance/irrelevance of documents are defined by their inverse rank in the retrieved results.

Assume that EP takes the degrees of relevance/irrelevance of documents into consideration, we use the following setting:

⎩⎨

⎧

=

+

−

=

j each for t

i each for i

size set d

j i

, 1

, 1 )

( (3.2)

(24)

That means the degree of relevance/irrelevance of Di is set to inverse-ranked value, and the weight of the j-th term to the user’s query is neglected. Eq. (3.1) can be reduced to

∑

^× ⁻ ^× ^×

× +

×

=

S in D

i ij R

in D

i ij j

j

i

i S

d w R

d w w

w ' α β γ (3.3)

∑

^× ⁻ ⁺ ⁻ ^× ^× ⁻ ⁺

× +

×

=

S in D

ij R

in D

ij j

j

i

i S

i R w R

i R w w

w ( 1) ( 1)

' α β γ

In this case, EP is reduced to IR.

In Eq. (3.3), di, the degree of relevance/irrelevance of Di, is defined as Eq. (3.2);

inverse-ranked value is given as the degree of relevance/irrelevance of Di

corresponding to the rank of Di in relevant/irrelevant document set. For example, assume that the first 3 ranked documents are considered to be relevant, and the last 3 ranked documents are considered to be irrelevant. The degree of relevance/irrelevance of d1 is (3-1+1) = 3, the degree of relevance/irrelevance of d2 is (3-2+1) =2, and so on.

Assume that four key terms, k1 as “precision”, k2 as “accuracy”, k3 as “measure”

and k4 as “define”, are extracted from the IR system. The Characteristic Vector (CV) of the user’s original query Q is represented as follow:

User’s original query: CV(Q)={(k₁,0.5),(k₂,0),(k₃,0.5),(k₄,0)}

Moreover, the relevant document, D2, and the irrelevant documents, D1, D3, are represented as follows:

(25)

D2 as 1st relevant document: CV(D₂)={(k₁,0.5),(k₂,0.5),(k₃,0.5),(k₄,0)}

D1 as 1st irrelevant document: CV(D₁)={(k₁,0.5),(k₂,0),(k₃,0),(k₄,0.5)}

D3 as 2nd irrelevant document: CV(D₃)={(k₁,0),(k₂,0),(k₃,0.5),(k₄,0)}

Suppose that the constants, α = 1, β = 0.5, γ = 0.5, are employed to adjust the effect of each component in IR. And, |R|, the number of relevant documents is 1; |S|, the number of irrelevant documents is 2. An example of query reformulation with IR is represented as Table 3.1.

Table 3.1 An example of query reformulation with IR

In one of our researches, IR was applied to improve the precision of a virtual 1 × CV(Q)={(k₁,0.5),(k₂,0),(k₃,0.5),(k₄,0)}

+

∑

inR Di

1 5 .

0 CV(D₂)={(k₁,0.5×1),(k₂,0.5×1),(k₃,0.5×1),(k₄,0×1)}

-

∑

inS Di

2 5 .

0 CV(D₁)={(k₁,0.5×2),(k₂,0×2),(k₃,0×2),(k₄,0.5×2)}

CV(D₃)={(k₁,0×1),(k₂,0×1),(k₃,0.5×1),(k₄,0×1)}

=

CV(Q')={(k₁,0.5),(k₂,0.25),(k₃,0.625),(k₄,−0.25)}

Relevant document set

Irrelevant document set

(26)

tutoring assistant system. The results had published in WSEAS-ACACOS [Wu 2008]

and WSEAS-TISA [Wu to appear] respectively.

3.3 Similarity-Based algorithm

The Similarity-Based algorithm (SB) will be described in this section. As well as IR, SB takes the degrees of relevance/irrelevance of documents into consideration. In SB, the degrees of relevance/irrelevance of documents are defined by their similarity to the query.

Assume that EP takes the degrees of relevance/irrelevance of documents into consideration, we use the following setting:

⎩⎨

⎧

=

j each for t

i each for R in D D Q Sim d

j

i i i

, 1

, ),

,

( (3.4)

That means the degree of relevance/irrelevance of Di is set to the query-document similarity, and the weight of the j-th term to the user’s query is neglected.

∑

^× ⁻ ^× ^×

× +

×

=

S in D

i ij R

in D

i ij j

j

i

i S

d w R

d w w

w ' α β γ (3.5)

∑

^× ⁻ ^× ^×

× +

×

=

S in D

i ij

R in D

i ij

j j

i

i S

D Q Sim w

R D Q Sim w w

w ( , ) ( , )

' α β γ

In this case, EP is reduced to SB.

In Eq. (3.5), di, the degree of relevance/irrelevance of Di, is defined as Eq. (3.4);

(27)

the query-document similarity value is given as the degree of relevance/irrelevance of Di according to the rank of Di in the relevant document set. For example, assume that the first 3 ranked documents are considered to be relevant, and the last 3 ranked documents are considered to be irrelevant. Assume that the first 3 ranked documents are ranked as D2, D4 and D5. Then, the degree of relevance/irrelevance of d1 is Sim(Q, D2), the degree of relevance/irrelevance of d2 is Sim(Q, D4), and so on.

and k4 as “define”, are extracted from the IR system. The CV of the user’s original query Q is represented as follow:

Suppose that the first 3 ranked documents are D2, D4 and D5, and their query-document similarities are shown in Table3.2.

(28)

Table 3.2 The query-document similarity of the first 3 ranked documents Di query-document similarity

D2 0.8

D4 0.6

D5 0.5

Suppose that the three constants, α = 1, β = 0.5, γ = 0.5, are employed to adjust the effect of each component in SB. And, |R|, the number of relevant documents is 1;

|S|, the number of irrelevant documents is 2. An example of query reformulation with SB is represented as Table 3.3.

Table 3.3 An example of query reformulation with SB

1 × CV(Q)={(k₁,0.5),(k₂,0),(k₃,0.5),(k₄,0)}

+

∑

inR Di

1 5 .

0 CV(D₂)={(k₁,0.5×0.8),(k₂,0.5×0.8),(k₃,0.5×0.8),(k₄,0×0.8)}

-

∑

inS Di

2 5 .

0 CV(D₁)={(k₁,0.5×0.8),(k₂,0×0.8),(k₃,0×0.8),(k₄,0.5×0.8)}

CV(D₃)={(k₁,0×0.6),(k₂,0×0.6),(k₃,0.5×0.6),(k₄,0×0.6)}

=

CV(Q')={(k₁,0.6),(k₂,0.2),(k₃,0.625),(k₄,−0.1)}

(29)

3.4 Inverse-Ranked with Frequency ratio algorithm

Inverse-Ranked with Frequency ratio algorithm (IRF) will be described in this section. IRF is enhanced IR. IRF not only considers the degrees of relevance/irrelevance of documents, but also considers the weights of terms to the user’s query.

Assume that EP not only considers the degrees of relevance/irrelevance of documents, but also considers the weights of terms to the user’s query, we use the following setting:

⎪⎪

⎪

⎩

⎪⎪

⎪

⎨

⎧

⎪⎪

⎩

⎪⎪

⎨

⎧

=

+

−

=

∑

∪

j each for S in d tf when tf

j each for R in d tf when tf

t

i each for i

size set d

i S

R in D

ij S in D

ij

i S

R in D

ij R in D

ij

i i

i i i

i

, ,

, 1 ) (

(3.6)

In this case, EP is reduced to IRF.

∑

^× ^× ⁻ ^× ^× ^×

× +

×

=

S in D

j i ij R

in D

j i ij j

j

i

i S

t d w R

t d w w

w ' α β γ (3.7)

∑ ∑

∑

∑ ∑

∑

∪

× +

−

×

−

× +

−

×

× +

×

=

S in D

S R in

D ij

S in D

ij ij

R in D

S R in

D ij

R in D

ij ij

j j

i

i i

i

i i

S

tf tf i

S w

R

tf tf i

R w w

w

) 1 (

' α β γ

In Eq. (3.7), di, the degree of relevance/irrelevance of Di and tj, the weight of the

(30)

j-th term to the user’s query, are defined as Eq. (3.6). For example, assume that the occurrence number of k1 is 2 in R and k1 is 3 in S, the ti in R will be defined as 0.4 (2/5) and the ti in S will be defined as 0.6 (3/5).

and k4 as “define”, are extracted from the IR system. The CV of the user’s original query Q can be represented as follow:

Suppose that the occurrence number of each term and the weight of the j-th term to the user’s query (tj) are shown in Table 3.4.

(31)

Table 3.4 The weight of the j-th term to the user’s query in either R or S kj

Occurrence number in R

Occurrence number in S

tj in R tj in S

k1 3 2 0.6 0.4

k2 3 0 1 0

k3 3 2 0.6 0.4

k4 0 2 0 1

Suppose that the three constants, α = 1, β = 0.5, γ = 0.5, are employed to adjust the effect of each component in IRF. And, |R|, the number of relevant documents is 1;

|S|, the number of irrelevant documents is 2. An example of query reformulation with IRF is represented as Table 3.5.

Table 3.5 An example of query reformulation with IRF

1 × CV(Q)={(k₁,0.5),(k₂,0),(k₃,0.5),(k₄,0)}

+

∑

inR Di

1 5 .

0 CV(D₂)={(k₁,0.5×1×0.6),(k₂,0.5×1×1),(k₃,0.5×1×0.6),(k₄,0×1×0)}

-

∑

inS Di

2 5 .

0 CV(D₁)={(k₁,0.5×2×0.4),(k₂,0×2×0),(k₃,0×2×0.4),(k₄,0.5×2×1)}

CV(D₃)={(k₁,0×1×0.4),(k₂,0×1×0),(k₃,0.5×1×0.4),(k₄,0×1×1)}

=

CV(Q')={(k₁,0.55),(k₂,0.25),(k₃,0.6),(k₄,−0.25)}

(32)

3.5 Similarity-Based with Frequency ratio algorithm

Similarity-Based with Frequency ratio algorithm (SBF) will be described in this section. SBF is enhanced SB. SBF not only considers the degrees of relevance/irrelevance of documents, but also considers the weights of terms to the user’s query.

Assume that EP not only considers the degrees of relevance/irrelevance of documents, but also considers the weights of terms to the user’s query, we use the following setting:

⎪⎪

⎪

⎩

⎪⎪

⎪

⎨

⎧

⎪⎪

⎩

⎪⎪

⎨

⎧

=

∑

∪

j each for S in d tf when tf

j each for R in d tf when tf

t

i each for R in D D Q Sim d

i S

R in D

ij S in D

ij

i S

R in D

ij R in D

ij

i

i i i

i

, ,

, ),

, (

(3.8)

∑

^× ^× ⁻ ^× ^× ^×

× +

×

=

S in D

j i ij R

in D

j i ij j

j

i

i S

t d w R

t d w w

w ' α β γ (3.9)

∑ ∑

∑

∑ ∑

∑

∪

×

−

×

× +

×

=

S in D

S R in

D ij

S in D

ij i

ij

R in D

S R in

D ij

R in D

ij i

ij

j j

i

i i

i

i i

S

tf tf D

Q Sim w

R

tf tf D

Q Sim w

w w

) , ( )

, (

' α β γ

In Eq. (3.9), di, the degree of relevance/irrelevance of Di and tj, the weight of the j-th term to the user’s query, are defined as Eq. (3.9).

(33)

and k4 as “define”, are extracted from the IR system. The CV of the user’s original query Q can be represented as the following:

Suppose that the first 3 ranked documents are D2, D4 and D5, and their query-document similarities are shown in Table3.6. The occurrence number of each term and the weight of the j-th term to the user’s query (tj) are shown in Table 3.7.

Table 3.6 The query-document similarity of the first 3 documents Di query-document similarity

D2 0.8

D4 0.6

D5 0.5

(34)

Table 3.7 The weight of the j-th term to the user’s query kj

Occurrence number in R

Occurrence number in S

tj in R tj in S

k1 3 2 0.6 0.4

k2 3 0 1 0

k3 3 2 0.6 0.4

k4 0 2 0 1

Suppose that the three constants, α = 1, β = 0.5, γ = 0.5, are employed to adjust the effect of each component in IR. And, |R|, the number of relevant documents is 1;

|S|, the number of irrelevant documents is 2. An example of query reformulation with SBF is represented as Table 3.8.

Table 3.8 An example of query reformulation with SBF

Irrelevant document set 1 × CV(Q)={(k₁,0.5),(k₂,0),(k₃,0.5),(k₄,0)}

+

∑

inR Di

1 5 .

0 CV(D₂)={(k₁,0.5×0.8×0.6),(k₂,0.5×0.8×1),(k₃,0.5×0.8×0.6),(k₄,0×0.8×0)}

-

∑

inS Di

2 5 .

0 CV(D₁)={(k₁,0.5×0.8×0.4),(k₂,0×0.8×0),(k₃,0×0.8×0.4),(k₄,0.5×0.8×1)}

CV(D₃)={(k₁,0×0.6×0.4),(k₂,0×0.6×0),(k₃,0.5×0.6×0.4),(k₄,0×0.6×1)}

=

CV(Q')={(k₁,0.58),(k₂,0.1),(k₃,0.59),(k₄,−0.1)}

(35)

Chapter 4 Experiment and Evaluation

In this section, we conduct some experiments to compare the performance of EP and SR. Experimental environments are introduced in Section 4.1. Experimental design is introduced in Section 4.2. For following experiments successfully, parameter setting is described in Section 4.3. The performance of the proposed algorithms and SR were compared by small test collection, called Medlars. Moreover, the performance of the proposed algorithms and SR were compared by large test collection, called OHSUMED. Finally, summary of experimental results and analysis are discussed in Section 4.6.

4.1 Experimental Environments

4.1.1 Test Collections

To evaluate retrieval performance of the algorithms, we need test collections that contain: (1) a set of documents (2) a set of queries (3) a list of judged relevant documents for each query. Medlars, a small test collection, and OHSUMED, a large test collection, are used to compare the performance of the proposed algorithms with SR.

Medlars [Salton 1975] is based on MEDLINE reference collections from 1964 to 1966. It is publicly available at ftp://ftp.cs.cornell.edu/pub/smart/med/. There are 1033 documents (medical abstracts) and 30 queries in the collection. For each query, there is a list of documents associated with it. The relevant judgments are provided by human experts.

OHSUMED [Hersh 1994] is also based on MEDLINE reference collections form 1987 to 1991. It is publicly available at ftp://medir.ohsu.edu/pub/ohsumed/. There are

中 華 大 學

中 華 大 學

碩 士 論 文

題目：利用增強型虛擬相關性回饋改善資訊檢索 效能

系 所 別：資訊工程學系碩士班 學號姓名：M09502050 吳智瑋 指導教授：曾秋蓉 教授

中華民國 九十七 年 七 月

中文摘要

Abstract

誌謝

Table of Contents

List of Figures

List of Tables

Chapter 1 Introduction

Chapter 2 Related Works

2.1 Models of Information Retrieval

2.2 Information Retrieval with Vector Space Model

( )

∑

∑

∑

2.3 Query Reformulation

2.4 Relevance Feedback

2.4.1 Rocchio’s Relevance Feedback Algorithm

∑ ∑

∑

∑

2.4.2 Ide’s Relevance Feedback Algorithm

∑ ∑

∑

∑

∑

∑

∑

2.4.3 Standard Rocchio’s Relevance Feedback Algorithm

∑ ∑

∑

∑

2.4.4 Pseudo Relevance Feedback

Chapter 3 The Enhanced Pseudo Relevance Feedback Algorithm

3.1 Basic Ideas

∑

∑

∑

∑

3.2 The Inverse-Ranked Algorithm

∑

∑

∑

∑

∑

∑

3.3 Similarity-Based algorithm

∑

∑

∑

∑

∑

∑

3.4 Inverse-Ranked with Frequency ratio algorithm

∑

∑

∑

∑

∑

∑

∑ ∑

∑

∑ ∑

∑

∑

∑

3.5 Similarity-Based with Frequency ratio algorithm

∑

∑

∑

∑

∑

∑

∑ ∑

∑

中華大學

中華大學

碩士論文

題目：利用增強型虛擬相關性回饋改善資訊檢索效能

系所別：資訊工程學系碩士班學號姓名：M09502050 吳智瑋指導教授：曾秋蓉教授

中華民國九十七年七月