2.1 Language Modeling for Spoken Document Retrieval
2.1.3 Query Modeling
In order to investigate various feedback document selection methods as well as proximity information cues studied in this study, this section introduce query models, such as relevance model (RM), topic-based relevance model (TRM) and simple mixture model (SMM) for query reformulation.
(I). Relevance Model
In the KL-divergence measure, one of straightforward but effective techniques to enhance the query formulation is to leverage extra relevance cues related to the query by the so-called relevance model (RM) [19,36,37]. To realize this idea, each query Q is assumed to be associated with an unknown relevance class RQ . Under the assumption of relevance class RQ, relevant documents (which satisfy the stated information need of a query) of cause are also assumed to be samples drawn from RQ.
To strengthen the original query formulation with relevance information extracted from RQ is anticipated to exhibits potential for better discriminating the content of
relevant document from the content of non-relevant documents. Hence, retrieval issue
26
can be solved by discovering a strategy to formulate the relevance model (RM), that is, the probability model PRM
w . The relevance model PRM
w , which is the probabilityof observing a word w of a document related to the stated information need, as a multinomial observation of RQ, can be interpreted as randomly selecting a document
from the relevance class and observing a word from it.
Estimating the enhanced query model depends largely on the ideal relevance class;
however, there is no prior knowledge about how to find the ideal relevance class in reality. A common strategy is to employ pseudo-relevance feedback process. As introduced above, the process typically performs two rounds of retrieval, first round of retrieval conducted by user given query, second round of retrieval rely on the newly formulated query based on the small amount top-ranked documents of initial retrieval.
Hence, the relevance model can also adopt pseudo-relevance feedback and leverage top-ranked documents from initial retrieval as pseudo-relevant documents to approximate the ideal relevance class and further estimate the enhanced query model on top of these documents. It is worth mentioning that the initial retrieval in this study if not otherwise note is implemented with the KL-divergence measure, where the query model P
wQ is estimated with the ML estimator P
wQ (cf. (5)) to obtain a top-ranked list of M pseudo-relevant documents DTopD1,D2,,DM from the spoken document collection. An enhanced query model PRM
wQ is then constructed27
by these top-ranked documents. In addition to query modeling, PRM
wQ can further combine with or replace the original query P
wQ to form a better estimated (or enhanced) query model so as to identify more relevant documents. After query modeling, the final stage of pseudo-relevance feedback is to retrieval the final ranked list with new constructed relevance model.To be more specific, based on the top-ranked list of M pseudo-relevant documents, the joint probability PRM
Q,w
of Q and w being observed together inthe relevance class RQ of Q is formulated as follows:
,
1
1, 2, , |
,RM Q w
Mm P Dm Pq q qL w DmP (9)
where P
Dm is the probability of the document Dm to be randomly selected and
q q qL w Dm
P 1, 2, , | is the joint probability of co-occurrence of Q and w in D , where m
essentially assumes higher probability to the words co-occurred with query terms in D . To m
further assume that words are conditionally independent given D and word order is not m
importance (i.e., the “bag-of-words” assumption), the joint probability can then be
decomposed as below:
,
1
|
1
|
.RM Qw
Mm PDm P w Dm
lLPql DmP (10)
The probability of a pseudo-relevant documentP
Dm can be simply set uniform or decided by the relevance degree of Dm to Q . P |
w Dm
and P
ql|Dm
are determined by ML estimation, which is based on the word occurrence counts in Dm.28
Even though there has been explored different ways to derive relevance model PRM
wQ , the equation shown above (11) is validated to be more effectively and robustly than the othervariants across different collections[37] .
(II). Topic-based Relevance Model
Apart from RM, in this study we also consider the performance evaluation of topic-based relevance model, which leverages latent topic information for the modeling of RM. To this end, a set of pre-defined latent topic variables
T1,T2,,TK
is assumed to describe the “word-document” co-occurrence characteristics among the pseudo-relevant documents obtained by the initial round of retrieval. Consequently, the word probability observed from a pseudo-relevant document Dm is no longer estimated directly by the frequency of the word occurring in a document, but instead based on likelihood of the document generates the topic and the probability of the
word observed in the respective latent topics as well:
|
|
|
.29
,
1 1
|
|
1
|
.TRM Q w
Mm Kk P Dm PTk Dm Pw Tk
lL P ql TkP (13)
This is topic-based relevance model (TRM), which employ a set of latent variable to reinterpret the probability a word is observed in a pseudo-relevant document. In contrast to RM, TRM assumes that the word distribution across a set of latent topics obtained from all spoken document in the collection may carry useful global topic information for relevance modeling.
In order to obtain the probabilities P |
w Tk
and P
Tk |Dm
, we can employ PLSAor LDA so that the topical probability can be estimated by maximizing the total log-likelihood logLD of the spoken document collection D, which can be further
derived leveraging inference algorithms like the expectation-maximization (EM) algorithm [38] with uniform priors, or the variational approximation algorithm [39]
with Dirichlet priors. To be more specific, we take EM algorithm for example. The
30
A school of thought to derive a feedback query model is to assumes words in the set of
feedback documents DP are generated from two models: 1) the feedback model
w FB
P | and 2) the background model P |
w BG
, namely simple mixture model (SMM) [26]. For feedback model P
w|FB
, it is estimated by the log-likelihood of a set of feedback documents DP expressed as follows, which can be maximized via the EM algorithm: documents. In order to obtain feedback model, the objective function (18) can also be31
maximized by the following EM algorithm via iterative maximization steps:
model. A schematic illustration of the SDR process is shown in Figure 1.