1.1 Motivation
1.1.1 Spoken Document Retrieval
1
Introduction
1.1 Motivation
With ongoing multimedia technology evolution, ever-increasing amounts of multimedia whether represented as static texts or audio-visual multimedia has given us tremendous amounts of information. Accompanying exponential proliferation of multimedia related to spoken documents, research on spoken document retrieval (SDR) has received growing amount of interest from researchers and practitioners over the past two decades. The advances of automatic speech recognition (ASR) and the unprecedented volumes of multimedia associated with spoken documents made available to the public, such as broadcast news stories, lecture or meeting recordings, telephone conversations and many others, are the two main reasons [1-3].
1.1.1 Spoken Document Retrieval
Unlike spoken term detection (STD), research on STD usually targets at the probable extraction of spoken terms or phrases inherent in a spoken document that could match
2
the query words or phrases literally. [2] However, research on SDR pays more attention to the notion of relevance of a spoken document related to a given query [4].
Typically, a document is deemed to be relevant if it could address the stated information need of the query, not just all the query terms overlap alone [5].
Even though merely using imperfect recognition transcripts produced from one-best recognition results, most retrieval systems participated in the TREC-SDR evaluations had claimed that speech recognition errors do not seem to cause very significant deterioration in terms of the retrieval quality [6]. This might partly due to the fact that the queries of TREC tend to be rather long and different in word usage which often describe a similar concept and hence further assist these queries in matching their relevant spoken documents. In addition, the same word in the corpus is not always misrecognized as well as a query word (or phrase) may repeat more than once within a truly relevant spoken document. Accordingly, though SDR apparently looks like a solved problem, there are three fundamental problems we believe it would still require facing in practice:
(I). A query is often a short and vague expression of an underlying information need (II). Word usage mismatch between a query and a spoken document would probably
happens even if these terms are topically related to each other
(III). The imperfect speech recognition transcript carries wrong information which
3
would drift away somewhat from representing the true theme of a spoken document
Language modeling (LM) approach is by far one of the most popular paradigms in building SDR systems [7-10]. This is attributable to the fact that the neat formulation of LM approach not only embraces impressive retrieval performance but clear probability meaning[11]. In terms of the general measurement of LM approach, the relevance (or similarity) measure between a query and a spoken document is typically computed by two different matching strategies, namely, literal term matching and concept matching. For literal term matching, the most popular instantiation is the unigram language model (ULM) [7-10]. In this class of methods, each document is regarded as a generative model composed of a mixture of unigram (multinomial) distributions for computing the likelihood of generating a query, which is usually expressed as a sequence of words (or index terms) as the document observation.
Accordingly, ranking can be done by scoring documents’ likelihood of observing the query respectively, that is, the so-called query-likelihood measure. More, the position or the order of term occurred in the document is assumed as unimportance, namely,
‘bag-of-words” assumption. Still, in order to improve ULM, there is a considerable work striving to further glean contextual information with n-grams of various orders, or some grammar structures; however, most result in mild gains or mixed results [11].
4
Since the aforementioned class of methods follow the thought of literal term matching, these methods inevitably would confront the problems of word usage diversity, which might lead to retrieval performance degradation for the differential in word usage between a given query and its corresponding relevant documents. In consequence of that, a family of topic modeling methods has been proposed. Topic models attempt at depicting the latent topic cues hidden in the query and documents.
For instance, latent Dirichlet allocation (LDA) [12] and its precursor, probabilistic latent semantic analysis (PLSA) [13], are often treated as two typical examples of concept matching. Both of them employ a set of latent topic variables to portray the co-occurrence relationship between a word and a document. Thus, the relevance measure between a query and a document is instead estimated the frequency of query words occurred in the possible latent topics and the probability that the document observes the respective topics as well, which demonstrates some sort of concept matching. Although there are many follow-up researches devoting to extend LDA and PLSA, empirical results imply that more sophisticated (or complicated) topic models, such as Pachinko allocation model (PAM), can not provide further benefits for retrieval [14,15].
Although most of the aforementioned retrieval methods can be applied to not only text but also spoken documents without adaptation, the latter ones still suffer from
5
unique difficulties, such as speech recognition errors, or redundant information. Apart from many conventional researches that focus on boosting recognition accuracy, an intuitive idea is to directly develop more robust representations for spoken documents.
For instance, aside from the top scoring ones, multiple recognition hypotheses can be constructed to derive alternative representations for the unclear part of the spoken documents [1,2,10]. Another line of research leverages subword-level index features or the combination of word- and subword-level index features to stand for the spoken documents, which also has been demonstrated beneficial to SDR. This might attribute to the fact that the incorrectly recognized spoken words often comprise several correctly recognized subword-level units. As a result, the retrieval process based on subword-level indexing of spoken documents may gain from partial matching [9,16,17].
In order to better represent spoken documents, a large body of SDR research has been devoted to exploring more robust indexing or modeling techniques [4,9,10,16,17], however, very limited work has been placed on the other side of the coin, that is, the possible improvement of query modeling for better reflecting the underlying information need of a user [18]. As for the latter problem, we had recently given a new picture of query modeling [18], which can be worked with pseudo-relevance feedback [5] to leverage the notion of relevance [19] and exhibits preliminary promise for query
6
reformulation. It is worth mentioning that the relevant notion is built on the assumption that the small amount of top-ranked feedback documents obtained from the initial round of retrieval are relevant which almost dominate the success of such query modeling and can be used to estimate a more precise query model for further retrieve more relevant documents, namely, the so-called pseudo-relevance feedback.
Nevertheless, simply exploiting all of the top-ranked documents for query modeling (or reformulation), does not necessarily promise for a good performance, especially when the top-ranked documents contain much redundant or n on-relevant information.