Query Modeling - Language Modeling for Spoken Document Retrieval

2.1 Language Modeling for Spoken Document Retrieval

2.1.3 Query Modeling

In order to investigate various feedback document selection methods as well as proximity information cues studied in this study, this section introduce query models, such as relevance model (RM), topic-based relevance model (TRM) and simple mixture model (SMM) for query reformulation.

(I). Relevance Model

In the KL-divergence measure, one of straightforward but effective techniques to enhance the query formulation is to leverage extra relevance cues related to the query by the so-called relevance model (RM) [19,36,37]. To realize this idea, each query Q is assumed to be associated with an unknown relevance class RQ . Under the assumption of relevance class RQ, relevant documents (which satisfy the stated information need of a query) of cause are also assumed to be samples drawn from RQ.

To strengthen the original query formulation with relevance information extracted from RQ is anticipated to exhibits potential for better discriminating the content of

relevant document from the content of non-relevant documents. Hence, retrieval issue

can be solved by discovering a strategy to formulate the relevance model (RM), that is, the probability model ^PRM

 

^w . The relevance model PRM

 

w , which is the probability

of observing a word w of a document related to the stated information need, as a multinomial observation of RQ, can be interpreted as randomly selecting a document

from the relevance class and observing a word from it.

Estimating the enhanced query model depends largely on the ideal relevance class;

however, there is no prior knowledge about how to find the ideal relevance class in reality. A common strategy is to employ pseudo-relevance feedback process. As introduced above, the process typically performs two rounds of retrieval, first round of retrieval conducted by user given query, second round of retrieval rely on the newly formulated query based on the small amount top-ranked documents of initial retrieval.

Hence, the relevance model can also adopt pseudo-relevance feedback and leverage top-ranked documents from initial retrieval as pseudo-relevant documents to approximate the ideal relevance class and further estimate the enhanced query model on top of these documents. It is worth mentioning that the initial retrieval in this study if not otherwise note is implemented with the KL-divergence measure, where the query model ^P

 

^w^Q is estimated with the ML estimator ^P

 

^w^Q (cf. (5)) to obtain a top-ranked list of M pseudo-relevant documents DTopD1,D2,,DM from the spoken document collection. An enhanced query model ^PRM

 

^w^Q is then constructed

by these top-ranked documents. In addition to query modeling, ^PRM

 

^w^Q can further combine with or replace the original query ^P

 

^w^Q to form a better estimated (or enhanced) query model so as to identify more relevant documents. After query modeling, the final stage of pseudo-relevance feedback is to retrieval the final ranked list with new constructed relevance model.

To be more specific, based on the top-ranked list of M pseudo-relevant documents, the joint probability ^PRM



^Q^,^w



of Q and w being observed together in

the relevance class RQ of Q is formulated as follows:





₁

  

1^, 2^, ^, ^|



RM Q w 



^M_m P Dm Pq q qL w Dm

P  (9)

where P

 

Dm is the probability of the document Dm to be randomly selected and



q q qL w Dm



P 1, 2, , | is the joint probability of co-occurrence of Q and w in D , where m

essentially assumes higher probability to the words co-occurred with query terms in D . To m

further assume that words are conditionally independent given D and word order is not m

importance (i.e., the “bag-of-words” assumption), the joint probability can then be

decomposed as below:





₁

  



₁





RM Qw 



^M_m PDm P w Dm



_l^LPql Dm

P (10)

The probability of a pseudo-relevant documentP

 

Dm can be simply set uniform or decided by the relevance degree of Dm to Q . P |



w Dm



and P



ql|Dm



are determined by ML estimation, which is based on the word occurrence counts in Dm.

Even though there has been explored different ways to derive relevance model ^PRM

 

^w^Q , the equation shown above (11) is validated to be more effectively and robustly than the other

variants across different collections[37] .

(II). Topic-based Relevance Model

Apart from RM, in this study we also consider the performance evaluation of topic-based relevance model, which leverages latent topic information for the modeling of RM. To this end, a set of pre-defined latent topic variables



T1,T2,,TK



is assumed to describe the “word-document” co-occurrence characteristics among the pseudo-relevant documents obtained by the initial round of retrieval. Consequently, the word probability observed from a pseudo-relevant document Dm is no longer estimated directly by the frequency of the word occurring in a document, but instead based on likelihood of the document generates the topic and the probability of the

word observed in the respective latent topics as well:



 







₁ ₁

  

 



₁





TRM Q w 

 

^M_m ^K_k P Dm PTk Dm Pw Tk



_l^L P ql Tk

P (13)

This is topic-based relevance model (TRM), which employ a set of latent variable to reinterpret the probability a word is observed in a pseudo-relevant document. In contrast to RM, TRM assumes that the word distribution across a set of latent topics obtained from all spoken document in the collection may carry useful global topic information for relevance modeling.

In order to obtain the probabilities P |



w Tk



and P



Tk |Dm



, we can employ PLSA

or LDA so that the topical probability can be estimated by maximizing the total log-likelihood logLD of the spoken document collection D, which can be further

derived leveraging inference algorithms like the expectation-maximization (EM) algorithm [38] with uniform priors, or the variational approximation algorithm [39]

with Dirichlet priors. To be more specific, we take EM algorithm for example. The

A school of thought to derive a feedback query model is to assumes words in the set of

feedback documents DP are generated from two models: 1) the feedback model



w FB



P | and 2) the background model P |



w BG



, namely simple mixture model (SMM) [26]. For feedback model P



w|FB



, it is estimated by the log-likelihood of a set of feedback documents DP expressed as follows, which can be maximized via the EM algorithm: documents. In order to obtain feedback model, the objective function (18) can also be

maximized by the following EM algorithm via iterative maximization steps:

 

^{ }

 

model. A schematic illustration of the SDR process is shown in Figure 1.

在文檔中探索虛擬關聯回饋技術和鄰近資訊於語音文件檢索與辨識之改進 (頁 38-44)