Pseudo-Relevance Feedback - A Unified Framework for Pseudo-Relevance Feedback

Chapter 4 A Unified Framework for Pseudo-Relevance Feedback

4.1 Pseudo-Relevance Feedback

In reality, since a query often consists of only a few words, the query model that is meant to represent the user’s information need might not be appropriately estimated by the ML estimator. Furthermore, merely matching words between a query and documents

might not be an effective approach, as the word overlaps alone could not capture the semantic intent of the query. To cater for this, an LM-based SDR system with the KL-divergence measure can adopt the idea of pseudo-relevance feedback and perform two rounds of retrieval to search for more relevant documents. In the first round of retrieval, an initial query is input into the SDR system to retrieve a number of top-ranked feedback documents. Subsequently, on top of these top-ranked feedback documents, a refined query model is constructed and a second round of retrieval is conducted with this new query model and the KL-divergence measure depicted in Figure 4.2. It is usually anticipated that the SDR system can thus retrieve more documents relevant to the query.

However, an LM-based SDR system with the pseudo-relevance feedback process may confront two intrinsic challenges. One is how to purify the top-ranked feedback documents obtained from the first round of retrieval so as to remove redundant and non-relevant information. The other is how to effectively utilize the selected set of representative feedback documents for estimating a more accurate query model. For the latter, there are a number of studies proposing various query modeling techniques directly exploiting the top-ranked feedback text (or spoken) documents, such as the simple mixture model (SMM) [203], the relevance model (RM) [102] and their

Figure 4.1 A toy example of a user goes to a search engine.

Query: Mac price

Apple new Macbook spec.

price

Information needs Relevant

documents

extensions [43][190], among others. However, for the former, there is relatively little work done on selecting useful and representative feedback documents from the top-ranked ones for SDR, as far as we are aware. Recently, the so-called “Gapped Top K”

and “Cluster Centroid” selection methods [182] have been proposed for text information retrieval. “Gapped Top K” selects top K documents with a ranking gap J in between any two top-ranked documents, while “Cluster Centroid” groups the top-ranked documents into K clusters and selects one representative document from each cluster to obtain diversified feedback documents. Another more attractive and sophisticated method proposed for text IR is “Active-RDD” [39][199], which takes into account the relevance, diversity and density cues of the top-ranked documents for feedback document selection.

Figure 4.2 A schematic illustration of the SDR process with pseudo-relevance feedback.

Initial Query

Model Document

Models Initial Round of Retrieval

Top-Ranked Documents Representative

Documents

Various Query

Modeling Second Round of Retrieval

Document Collection

Relevant Documents Query

4.1.1 Relevance Modeling (RM)

Under the notion of relevance modeling (RM, especially referred to as RM-1), each query Q is assumed to be associated with an unknown relevance class RQ, and documents that are relevant to the semantic content expressed in query are samples drawn from the relevance class RQ. However, in reality, since there is no prior knowledge about RQ, we may use the top-ranked documents DTop to approximate the relevance class RQ. The corresponding relevance model, on the grounds of a multinomial view of RQ, can be estimated using the following equation [102][103]:

) ,

where the prior probability P(Dr) of each document can be simply kept uniform, while the document models (such as P(w|Dr)) are estimated with the ML estimator on the basis of the occurrence counts of w in each document, respectively.

4.1.2 Simple Mixture Model (SMM)

Another perspective of estimating an accurate query model with the top-ranked documents is the simple mixture model (SMM), which assumes that words in DTop are drawn from a two-component mixture model: 1) One component is the query-specific topic model PSMM(w|Q), and 2) the other is a generic background model P(w|BG). By doing so, the SMM model PSMM(w|Q) can be estimated by maximizing the likelihood over all the top-ranked documents [43][190][203]:

(

_SMM( | ) (1 ) ( | )

)

⁽ ^, ⁾,

where 𝛼𝛼 is a pre-defined weighting parameter used to control the degree of reliance between PSMM(w|Q) and P(w|BG). This estimation will enable more specific words (i.e.,

words in DTop that are not well-explained by the background model) to receive more probability mass, thereby leading to a more discriminative query model PSMM(w|Q).

Simply put, the SMM model is anticipated to extract useful word usage cues from DTop, which are not only probably relevant to the query Q, but also external to those already captured by the generic background model.

4.1.3 Regularized Simple Mixture Model (RSMM)

Although the SMM modeling aims to extract extra word usage cues for enhanced query modeling, it may confront two intrinsic problems. One is the extraction of word usage cues from DTop is not guided by the original query. This would lead to a concern for SMM to be distracted from being able to appropriately model the query of interest, which is probably caused by some dominant distracting (or irrelevant) documents. The other is that the mixing coefficient 𝛼𝛼 is fixed across all top-ranked documents albeit that different (either relevant or irrelevant) documents would potentially contribute different amounts of word usage cues to the enhanced query model. To mitigate these two problems, the original query model P(w|Q) can be used to define a conjugate Dirichlet prior for the enhanced query model to be estimated; meanwhile, a trainable document-specific weighting coefficient 𝛼𝛼_𝐷𝐷_𝑟𝑟 is introduced for each pseudo-relevant document Dr. The resulting model is referred to hereafter as the regularized simple mixture model (RSMM) and its corresponding objective likelihood function is expressed as [54][190]:

where 𝜇𝜇 is a weighting factor indicating the confidence on the prior information (viz.

the original query model).

在文檔中統計式語言模型 – 語音文件標記、檢索以及摘要 (頁 53-58)