探索虛擬關聯回饋技術和鄰近資訊於語音文件檢索與辨識之改進

全文

(1)國立臺灣師範大學資訊工程研究所碩士論文. 指導教授：陳柏琳博士. 探索虛擬關聯回饋技術和鄰近資訊於語音文件檢索與辨識之改進 Exploring Effective Pseudo-Relevance Feedback and Proximity Information for Speech Retrieval and Transcription. 研究生：陳憶文. 撰. 中華民國一百零二年七月.

(2)

(3) Exploring Effective Pseudo-Relevance Feedback and Proximity Information for Speech Retrieval and Transcription. Master Thesis. by. Yi-Wen Chen [email protected] Department of Computer Science and Information Engineering National Taiwan Normal University July, 2013.

(4)

(5) 摘要虛擬文件檢索 (Pseudo-Relevance Feedback) 為目前最常見的查詢重建 (Query Reformulation)典範。它假設預檢索(Initial-round of Retrieval)排名前端的文件都是相關的，所以可全用於查詢擴展(Query Expansion)。然而，預檢索所獲得的文件中，極可能同時包含重複性資訊(Redundant)和非關聯(Non-relevant)資訊，使得重新建立的查詢不能有良好檢索效能。有鑑於此，本論文探討運用不同資訊以在預檢索獲得的語音文件中挑選適當的關聯文件來建立查詢表示，讓語音文件檢索結果可以更準確。另一方面，關聯模型(Relevance Model )雖然可藉由詞袋(Bag-of-words)假設來簡化模型推導和估測，卻可能因此過度簡化問題，特別是用於語音辨識的語言模型。為了調適關聯模型，本論文有兩個貢獻。其一，本論文提出詞鄰近資訊使用於關聯模型以改善詞袋(Bag-of-words)假設於語音辨識的不適。其二，本論文也進一步探討主題鄰近資訊以強化鄰近關聯模型的架構。實驗結果證明本論文所提出之方法，不論在語音文件檢索還是語音辨識方面皆可有效改善現有方法的效能。關鍵詞：語音文件檢索、語音辨識、語言模型、虛擬關聯回饋、鄰近資訊.

(6) Abstract Pseudo-relevance feedback is by far the most commonly-used paradigm for query reformulation in spoken document retrieval, which assumes that a small amount of top-ranked feedback documents obtained from the initial retrieval are relevant and can be utilized for query expansion. Nevertheless, simply taking all of the top-ranked feedback documents acquired from the initial retrieval for query modeling does not necessary work well, especially when the top-ranked documents contain much redundant or non-relevant cues. In view of this, we explore different kinds of information cues for selecting helpful feedback documents to further improve information retrieval. On the other hand, relevance model (RM) based on “bag-of-words” assumption, which can facilitate the derivation and estimation, may be oversimplified for the task of language modeling in speech recognition. Hence, we also enhance RM in two significant aspects. First, “bag-of-words” assumption of RM is relaxed by incorporating word proximity information into RM formulation. Second, topic-based proximity information is additionally explored to further enhance the proximity-based RM framework. Experiments conducted on not only a spoken document retrieval task but also a speech recognition task indicates that our approaches can bring competitive utilities to existing ones. Keywords: Spoken Document Retrieval, Speech Recognition, Language Modeling, Pseudo-Relevance Feedback, Proximity.

(7)

(8) &. Contents. 1 Introduction ............................................................................................................... 1 1.1 Motivation ........................................................................................................ 1 1.1.1 Spoken Document Retrieval ................................................................. 1 1.1.2 Speech Recognition .............................................................................. 6 1.2 Contribution ..................................................................................................... 7 1.3 Outline of the Thesis ........................................................................................ 8 2 Related Work ........................................................................................................... 10 2.1 Language Modeling for Spoken Document Retrieval ................................... 10 2.1.1 Retrieval Modeling Approaches ......................................................... 11 2.1.2 Pseudo-Relevance Feedback ............................................................... 15 2.1.3 Query Modeling .................................................................................. 25 2.2 Language Modeling for Speech Recognition ................................................ 31 2.2.1 N-gram Language Model .................................................................... 31 2.2.2 Topic-based Language Models ........................................................... 32 2.2.3 Trigger-based Language Model .......................................................... 33 2.2.4 Recurrent Neural Network Language Model vs. Discriminative Language Model .......................................................................................... 34 2.2.5 Relevance Modeling ........................................................................... 34 3 Effective Pseudo-Relevance Feedback &. Proximity Information ................... 38. 3.1 Diversity Measure .......................................................................................... 39 3.2 Density Measure ............................................................................................ 40 3.3 Non-Relevance Measure ................................................................................ 42 3.4 Proximity Information for RM....................................................................... 44 3.5 Topic-based Proximity Information for RM .................................................. 46 4 Experiments on Spoken Document Retrieval....................................................... 47 i.

(9) 4.1 Spoken Document Collections & Evaluation Metrics ................................... 47 4.2 Subword-level Index Features ....................................................................... 49 4.3 Baseline Experiments..................................................................................... 50 4.4 Using Effective Pseudo-Relevance Feedback ............................................... 52 4.5 IDF-Based Term Weighting ........................................................................... 54 4.6 Fusion of Different Levels of Indexing Features ........................................... 54 5 Experiments on Speech Recognition ..................................................................... 56 5.1 Speech Recognition Corpus & Evaluation Metrics ....................................... 56 5.2 Baseline Experiments..................................................................................... 58 5.3 Using Proximity Information ......................................................................... 59 5.4. Using Latent Topic Proximity Information ................................................... 60 6 Conclusion and Future Work ................................................................................. 62 Bibliography ................................................................................................................ 64. ii.

(10) &. List of Figures. 1.. A schematic illustration of the Pseudo-Relevance Feedback Process. 16. 2.. A schematic illustration of the Flexible Pseudo-Relevance Feedback process. 19. 3.. A schematic illustration of the Active Pseudo-Relevance Feedback process. 20. 4.. The Gapped Top K algorithm. 21. 5.. A Cluster Centroid example. 23. 6. 7.. A diagram of density measure. 41. The speech recognition results (in CER (%)) of PRM. iii. 60.

(11) iv.

(12) &. List of Tables. Table 1 : Retrieval results (in mAP) achieved by ULM and PLSA.. 13. Table 2 : Statistics for the TDT-2 Collections. 48. Table 3 : Retrieval results (in mAP) achieved by various retrieval models.. 51. Table 4 : Retrieval results (in mAP) achieved by various combinations of retrieval models and feedback document selection methods.. 53. Table 5 : Retrieval results (in MAP) achieved when simply using the top 5, 10, 15, 25 or 30 documents obtained from the initial round of retrieval for constructing various query models.. 53. Table 6 : Statistics for the Speech Corpus. 56. Table 7 : The speech recognition results (in CER (%)) of various language models compared in this study.. 59. Table 8 : The speech recognition results (in CER (%)) of PRM.. 60. Table 9 : The speech recognition results (in CER (%)) of TRM and PLSA, and their combination with PRM respectively.. 61. Table 10 : The p-value obtained from the pair t-test on CER(%)of PRM with respect to that of RM and CER(%) of PRM + TRM with respect to that of RM respectively.. 61. v.

(13) vi.

(14) 1 Introduction. 1.1 Motivation With ongoing multimedia technology evolution, ever-increasing amounts of multimedia whether represented as static texts or audio-visual multimedia has given us tremendous amounts of information. Accompanying exponential proliferation of multimedia related to spoken documents, research on spoken document retrieval (SDR) has received growing amount of interest from researchers and practitioners over the past two decades. The advances of automatic speech recognition (ASR) and the unprecedented volumes of multimedia associated with spoken documents made available to the public, such as broadcast news stories, lecture or meeting recordings, telephone conversations and many others, are the two main reasons [1-3].. 1.1.1 Spoken Document Retrieval Unlike spoken term detection (STD), research on STD usually targets at the probable extraction of spoken terms or phrases inherent in a spoken document that could match 1.

(15) the query words or phrases literally. [2] However, research on SDR pays more attention to the notion of relevance of a spoken document related to a given query [4]. Typically, a document is deemed to be relevant if it could address the stated information need of the query, not just all the query terms overlap alone [5]. Even though merely using imperfect recognition transcripts produced from one-best recognition results, most retrieval systems participated in the TREC-SDR evaluations had claimed that speech recognition errors do not seem to cause very significant deterioration in terms of the retrieval quality [6]. This might partly due to the fact that the queries of TREC tend to be rather long and different in word usage which often describe a similar concept and hence further assist these queries in matching their relevant spoken documents. In addition, the same word in the corpus is not always misrecognized as well as a query word (or phrase) may repeat more than once within a truly relevant spoken document. Accordingly, though SDR apparently looks like a solved problem, there are three fundamental problems we believe it would still require facing in practice: (I). A query is often a short and vague expression of an underlying information need (II). Word usage mismatch between a query and a spoken document would probably happens even if these terms are topically related to each other (III). The imperfect speech recognition transcript carries wrong information which 2.

(16) would drift away somewhat from representing the true theme of a spoken document Language modeling (LM) approach is by far one of the most popular paradigms in building SDR systems [7-10]. This is attributable to the fact that the neat formulation of LM approach not only embraces impressive retrieval performance but clear probability meaning[11]. In terms of the general measurement of LM approach, the relevance (or similarity) measure between a query and a spoken document is typically computed by two different matching strategies, namely, literal term matching and concept matching. For literal term matching, the most popular instantiation is the unigram language model (ULM) [7-10]. In this class of methods, each document is regarded as a generative model composed of a mixture of unigram (multinomial) distributions for computing the likelihood of generating a query, which is usually expressed as a sequence of words (or index terms) as the document observation. Accordingly, ranking can be done by scoring documents’ likelihood of observing the query respectively, that is, the so-called query-likelihood measure. More, the position or the order of term occurred in the document is assumed as unimportance, namely, ‘bag-of-words” assumption. Still, in order to improve ULM, there is a considerable work striving to further glean contextual information with n-grams of various orders, or some grammar structures; however, most result in mild gains or mixed results [11]. 3.

(17) Since the aforementioned class of methods follow the thought of literal term matching, these methods inevitably would confront the problems of word usage diversity, which might lead to retrieval performance degradation for the differential in word usage between a given query and its corresponding relevant documents. In consequence of that, a family of topic modeling methods has been proposed. Topic models attempt at depicting the latent topic cues hidden in the query and documents. For instance, latent Dirichlet allocation (LDA) [12] and its precursor, probabilistic latent semantic analysis (PLSA) [13], are often treated as two typical examples of concept matching. Both of them employ a set of latent topic variables to portray the co-occurrence relationship between a word and a document. Thus, the relevance measure between a query and a document is instead estimated the frequency of query words occurred in the possible latent topics and the probability that the document observes the respective topics as well, which demonstrates some sort of concept matching. Although there are many follow-up researches devoting to extend LDA and PLSA, empirical results imply that more sophisticated (or complicated) topic models, such as Pachinko allocation model (PAM), can not provide further benefits for retrieval [14,15]. Although most of the aforementioned retrieval methods can be applied to not only text but also spoken documents without adaptation, the latter ones still suffer from 4.

(18) unique difficulties, such as speech recognition errors, or redundant information. Apart from many conventional researches that focus on boosting recognition accuracy, an intuitive idea is to directly develop more robust representations for spoken documents. For instance, aside from the top scoring ones, multiple recognition hypotheses can be constructed to derive alternative representations for the unclear part of the spoken documents [1,2,10]. Another line of research leverages subword-level index features or the combination of word- and subword-level index features to stand for the spoken documents, which also has been demonstrated beneficial to SDR. This might attribute to the fact that the incorrectly recognized spoken words often comprise several correctly recognized subword-level units. As a result, the retrieval process based on subword-level indexing of spoken documents may gain from partial matching [9,16,17]. In order to better represent spoken documents, a large body of SDR research has been devoted to exploring more robust indexing or modeling techniques [4,9,10,16,17], however, very limited work has been placed on the other side of the coin, that is, the possible improvement of query modeling for better reflecting the underlying information need of a user [18]. As for the latter problem, we had recently given a new picture of query modeling [18], which can be worked with pseudo-relevance feedback [5] to leverage the notion of relevance [19] and exhibits preliminary promise for query 5.

(19) reformulation. It is worth mentioning that the relevant notion is built on the assumption that the small amount of top-ranked feedback documents obtained from the initial round of retrieval are relevant which almost dominate the success of such query modeling and can be used to estimate a more precise query model for further retrieve more relevant documents, namely, the so-called pseudo-relevance feedback. Nevertheless, simply exploiting all of the top-ranked documents for query modeling (or reformulation), does not necessarily promise for a good performance, especially when the top-ranked documents contain much redundant or n on-relevant information.. 1.1.2 Speech Recognition In automatic speech recognition (ASR) system, the language modeling (LM) also plays a crucial role, which assists in constraining acoustic analysis, guide the search through multiple candidate word strings, and quantify the acceptability of the final output. Due to its simplicity and predictive power, the n-gram model remains the predominant language model. A growing number of novel and ingenious LM techniques have been developed to complement or to replace the n-gram model. A more recent school of thought is to build a language model by leverage information cues extracted from pseudo-relevance feedback (PRF) to complement the n-gram model. For example, relevance modeling (RM) formulates the language model based on the notion of relevance, which can be approximated by PRF. RM that explores the relevance 6.

(20) information inherent between the search history and an upcoming word has exhibited preliminary promise for dynamic language model adaptation. Consequently, how to further explore useful cues from PRF for better estimating relevance modeling is an interesting research issue.. 1.2 Contribution In view of above mentioned problems, our research develops into two parts. First of all, with the above background, in this study we turn our attention to a more challenging problem of how to additionally glean useful cues from the top-ranked feedback documents to achieve more accurate query modeling. Towards this end, several kinds of information cues are considered and integrated to select representative and useful feedback documents for better query reformulation which leads to better retrieval performance. Furthermore, we also investigate representing the query and documents with different granularities of index features to work in conjunction with the various information cues selection criteria for pseudo-relevance feedback. Finally, the utility of the methods deduced from our framework is verified by extensive comparisons with several existing active feedback methods for pseudo-relevance feedback. On the other hand, in terms of speech recognition, this thesis follows this general line of research to build language models on top of the notion of relevance modeling 7.

(21) (RM) and has two significant contributions. First, the so-called “bag-of-words” assumption of RM is relaxed by further incorporating loosely word order information and word proximity evidence into the RM formulation. Second, topic-based proximity information is additionally explored to further enhance the proximity-based RM language model.. 1.3 Outline of the Thesis The remainder of this paper is structured as follows. Chapter 2 briefly reviews the theoretical underpinnings of the LM approach not only for SDR but also for speech recognition to give readers a picture about how language modeling figures prominently in these two different fields of research. Plus, a concise introduction is given to existing related variations of pseudo-relevance feedback and proximity information modeling techniques, following by shedding light on the basic foundation of the RM modeling framework that can leverage lexical co-occurrence in a systematic way for language modeling in speech recognition. In Chapter 3, we describe and explain several cues we explore to select representative feedback documents during pseudo-relevance feedback. Subsequently, an elucidation of integrating proximity information cues into the formulation of RM for speech recognition. After that, the experimental settings and a series of retrieval experiments are presented in Chapter 4. Finally, Chapter 5 draws a conclusion from our study and suggests avenues for future 8.

(22) work.. 9.

(23) 2 Related Work. In this chapter, we provide a survey of the literature on the language modeling for spoken document retrieval and for speech recognition, respectively. We first present an overview of the retrieval modeling approaches for spoken document retrieval, and then we review major work up to date for pseudo-relevance feedback. Then, a brief introduction to language modeling for speech recognition is provided, following by the basic foundation of the relevance modeling for speech recognition.. 2.1 Language Modeling for Spoken Document Retrieval Language modeling (LM), providing proper quantitative scores to sequences of words or tokens by employing a statistical mechanism, has been an interesting yet challenging problem in speech and language processing community for a long time [20,21]. For instance, it can be applied to facilitate the acoustic analysis in speech recognition, lead the search through the vast space of candidate word strings, and quantify the acceptability of the final output from the speech recognizer. This statistical 10.

(24) paradigm was first proposed for solving IR problems by [7,8], demonstrating very good potential, and was latter also introduced to the field of SDR [4,9,10]. Language modeling (LM) approach is a recent trend in building SDR systems [4,10,18]. It can be attributed to both of the sound theoretical underpinnings and impressive empirical performance exhibited by the LM approach.. 2.1.1 Retrieval Modeling Approaches (I).. Unigram Language Model. The basic formulation of the LM approach to SDR, is to compute the conditional probability PQ D  , i.e., the likelihood of a query Q generated by each spoken document D (the so-called query-likelihood measure)[11]. A spoken document is deemed to be relevant to a query if the corresponding document model is more likely to generate the query. The query Q is treated as a sequence of words (or terms), Q  q1 , q2 ,, qL , where the query words are under the assumption that given the. document D these query words are conditionally independent and no concern of word order (i.e., the so-called “bag-of-words” assumption). Thus, the similarity measure between a query and a document PQ D  can be further decomposed as a product of the probabilities of the query words generated by the document: SIM 1 Q, D  PQ D  l 1 Pql D , L. (1). 11.

(25) where. Pql D . represents the probability of D generating ql (a.k.a. the document. model). The document model is constructed by two variants for each document D . in this study. One is to use the unigram language model (ULM). Toward this end, each document can, respectively, offer a unigram distribution for observing a query word, which is based on the empirical counts of words occurring in the document with the maximum likelihood (ML) estimator [11,20]. The document model is further smoothed by a background unigram language model estimated from a large general collection to model the general properties of the language as well as to avoid the problem of zero probability. However, how to strike the balance between these two probability distributions is actually a matter of judgment, or trial and error. The other is to employ a topic model, such as probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA), which calculates the query-likelihood based on the frequency of ql occurring in a given latent topic as well as the likelihood that a document D generates the corresponding topic. Nevertheless, both of PLSA and LDA merely provide coarse-grained latent semantic representations for the user information need, which is essentially unable to distinguish the fine-grained difference between the semantically-related words. In a given implementation, combining them with ULM to obtain better retrieval quality is always good [4,22]. For instance, as the equation (2) shown below, ULM can be linear combined with PLSA (referred to equation (3) ), 12.

(26) where their retrieval results with optimal parameters tuning conducted on TDT-2 are demonstrated respectively on Table 1. The experiment results show the beneficial to further combine with PLSA. PQ D  l 1[  Pql D  (1   )  PPLSA (ql | D)], L. (2). K. PPLSA (ql | D)   P(ql | Tk ) P(Tk | D),. (3). k 1. Table 1 : Retrieval results (in mAP) achieved by ULM and PLSA. Dev.. ULM. PLSA. TD. 0.371. 0.418. SD. 0.323. 0.345. (II). Kull-Leibler Divergence Measure For SDR, another fundamental LM is the Kullback-Leibler (KL)-divergence measure [11,23,24]: SIM 2 Q, D    KLQ D . Pw | Q  Pw | D   wV Pw | Q log Pw | D   wV Pw | Q log Pw | Q   wV Pw | Q log. rank. . where the part. . wV. wV. (4). Pw | Q log Pw | D ,. Pw | Q log Pw | Q  in the equation can be directly ignored since for. a given query, the query entropy is identical for all the documents thus has no effect on ranking documents;. rank  means. equivalent ranking results. Compared with (1) where. a query Q is regarded as a sample drawn from the language model of a possibly relevant document D , (4) entails designing the models of the query Q and the 13.

(27) document D , which are conventionally formed as a (unigram) language model (denoted by Pw | Q and Pw | D  ), respectively, for observing any word. w. in the. vocabulary V . In practice, the degree of relevance of the document D is measured by the value of KLQ D  (or probability distance), that is, the smaller the value of KL-divergence the more relevant this document is. The retrieval effectiveness of the KL-divergence measure depends largely on the accurate estimation of the query model Pw | Q  and the document model Pw | D  . Moreover, it is easy to prove that the KL-. divergence measure will generate the same ranking as the query-likelihood measure when the query model Pw | Q is simply estimated with the ML estimator [25]: rank. SIM 2 Q, D   wV Pw | Q log Pw | D  cw, Q   log Pw | D  Q rank.  cw, Q log Pw | D . (5).  log PQ | D  rank.  PQ | D .. As (5), Pw | Q is simply derived as cw, Q  , where cw, Q  is the frequency of Q. w. occurring in Q and Q is the total number of words a query Q has. Accordingly, the KL-divergence measure can be viewed as a generalization of the query-likelihood measure, which embraces additional merit of being able to accommodate extra information cues to better estimate its component models (especially, the query model) for obtaining document ranking in a systematic way. Due to the fact that a query usually consists of only a few words, the query model 14.

(28) Pw | Q  that is deem to reflect the user’s information need might not be well-estimated. merely by the ML estimator. In order to alleviate this problem, a conventional approach is to explore extra cues to strengthen the query model in the KL-divergence measure, that is, the so-called query expansion.. 2.1.2 Pseudo-Relevance Feedback As the problem mentioned above, a query often consists of only a few words, which is usually short for represent the user’s information need. Therefore, estimating a query model by the ML estimator might not be appropriate. Furthermore, merely matching words between a query and documents might not be an effective approach, as the word overlaps alone could not show the semantic intent of the query. To cater for this, a conventional strategy is to adopt the idea of pseudo-relevance feedback which performs two rounds of retrieval so as to retrieve more relevant documents as Figure 1. In the first round of retrieval, a user given query is exploited for a SDR system to retrieve a small number of top-ranked feedback documents as pseudo-relevant documents. Subsequently, a refined query model is reformulated by leveraging these top-ranked feedback documents to add possible query terms and to reweight the query terms for improving query representation. After that, a second round of retrieval is ready with this new and better estimated query model to conduct with the KL-divergence measure depicted in (4) again. It is generally anticipated that the SDR 15.

(29) system can thus retrieve more relevant documents.. Figure 1: A schematic illustration of the Pseudo-Relevance Feedback process Nevertheless, there are two basic challenges may necessarily encounter for LM-based SDR system implementing the pseudo-relevance feedback process. First, after initial round of retrieval, how to purify the obtained top-ranked feedback documents so as to preserve pure relevant information but remove redundant and non-relevant information, that is, to extract useful information from top-ranked documents for query expansion. This problem is quite important since the general state-of-the-art query modeling techniques exploit the whole feedback documents as a unit without thoroughly checking the information within. Second, if the top-ranked documents can guarantee some degree of relevance and usefulness for query modeling, how to effectively leverage these selected feedback documents or information for. 16.

(30) estimating a more precise query model. As the latter, it might attribute to its effective in retrieval performance, researches focus on query modeling work with pseudo-relevance feedback are many, such as the simple mixture model (SMM) [26], the relevance model (RM) [19] and their extensions [18] , among others. Nevertheless, as far as we know, little work has been placed on the other side of coin, namely filtering helpful and representative feedback documents from the pseudo-relevance feedback for SDR query expansion. Researches in text information retrieval (IR) recently introduced several interesting algorithm to avoid redundant information and select more diversified feedback documents for query representation, such as “Gapped Top K” and “Cluster Centroid” selection methods [27] and many others. Based on different points of view, these related approaches are briefly reviewed as follows.. (I). Local Context Analysis Local Context Analysis, a variation of pseudo relevance feedback, is proposed to avoid expansion term selection from non-relevant passages of the assumed-relevant documents. Therefore, the essence of local context analysis is to exploit top-ranked passages instead of top-ranked documents of the initial round of retrieval. It is addressed that local context analysis is at least as effective as pseudo-relevance feedback [28,29].. 17.

(31) (II). Flexible Pseudo-Relevance Feedback Pseudo Relevance Feedback (PRF) or Local Feedback (LF) is a well-known technique for improving average retrieval performance without human-judged relevant documents, however, a closer look often reveals that around one-third of search requests is actually ruined, which often results in worse retrieval performance than that of the initial retrieval by the original query [30,31]. Even though, its potential to improve average retrieval results completely automatically is known to be effective not only for monolingual retrieval [28] but also for cross-language retrieval [29], the user probably would not be glad if the automatic query expansion after a long time waiting finally spoils the user given query and of course retrieval performance. In order to improve pseudo-relevance feedback, some researches make an attempt to vary the number of expansion terms [32]. Some researchers try to make it more reliable by proposing flexible local feedback (FLF) or flexible pseudo-relevance feedback (FPRF), which estimates not only the best number of expansion terms for each query but also the optimal number of top assumed-relevant documents [30] illustrated as Figure 2. In terms of FPRF, years before 2005, existing FPRF approaches determine optimal number of top assumed-relevant documents and query expansion terms under the assumption of the small amount of top-ranked documents are relevant. In 2005, approaches like Selective Sampling and Selective Sampling with Memory 18.

(32) Resetting [33] explore query word set within top-ranked documents to skip some top-ranked document for avoiding some redundant documents. However, since this study has not attempted to study optimal number of top assumed-relevant documents as well as optimal number of query expansion terms for each single query, our approach does not pertain to this category, however, this school of thought might be a good avenue for future work.. Figure 2 : A schematic illustration of the Flexible Pseudo-Relevance Feedback process. (III).Active Pseudo-Relevance Feedback Shen and Zhai (2005) proposed the notion of active feedback, which tackles this problem from the view point of maximum learning benefit, that is, how to extract useful subset of documents from feedback documents so that the retrieval system can obtain maximum learning benefits. For active feedback, they develop two practical 19.

(33) algorithms to avoid redundant information among the top-ranked documents and to search for more diversity documents instead, including the Gapped Top K , and K Cluster Centroid algorithms [27]. The whole procedure of this category is illustrated in Figure 3, where the blue documents represent the selected documents as well as the primary goal of active pseudo-relevance feedback.. Figure 3 : A schematic illustration of the Active Pseudo-Relevance Feedback process. - Gapped Top K. Supposed there are K documents need to be selected from top N documents for relevance feedback and the mathematical equation can be shown as following: N  (G  1) K. (6). where G  1 is a small positive integer. To consider both relevance and diversity of a candidate document, one possible way is to cluster top N documents into K 20.

(34) groups based on its corresponding relevance scores and choose the document with the highest relevance score from each group. By this way, the first group will be the top. G  1 documents on ranking and the second group will be next G  1 documents, and so on. If we visualize this method like below, one can quickly figure out why authors named this method “Gapped Top K”. Also, this method can perform the same as traditionally pseudo-relevance feedback, which simply using Top K documents of initial rank of first retrieval, when G is equal to 0 and K is set to N .. rank 1. rank 2. -------------------------------------------------------------------------------------------------. rank 3. ----------------------------------------------. rank 4. -------------------------------------------. rank 5. --------------------------------------------------. rank 6. ---------------------------------------------------. Figure 4: The Gapped Top K algorithm. For example, if selecting 2 documents from the top 6 documents of initial round of retrieval, that is, N  6 , K  2 , and G  2. 21.

(35) Figure 4 demonstrates an example for the Gapped Top K algorithm, where the top-ranked 6 documents are ranked based on their relevance score and then clustered into two groups by the relevance score of each document. Then the document with a red check on its left hand side is the chosen documents which embrace the highest relevance score for each group. In other words, the algorithm picks one document and skips two documents (which are the “gap” between chosen ones )regularly until the pre-defined number of selected documents is met.. - Maximal Marginal Relevance. For Maximal Marginal Relevance (MMR), which is a greedy algorithm, documents are ranked based on relevance and non-redundancy cues [34,35]. Specifically, the documents are selected iteratively one by one which further excludes some already selected documents as well as optimize the MMR function below. s(d D)    r (d )  (1   )  max sim(d , d ' ) ' d D. (7). where r (d ) is a relevance scoring function, sim(d , d ' ) stands for a similarity function and  represents a weighting parameter for balancing relevance and non-redundancy. It is worth noting that when   1 MMR can also reduce to conventional Top K method as a special case.. 22.

(36) - Cluster Centroid. Apart from the aforementioned methods, a more straightforward method to model diversity is to directly divide the top N documents ranked by relevance score into. K clusters and gather a representative document from each cluster to construct a better feedback document subset. The diversity of the chosen documents rests on clustering and selecting only one document of each cluster for representation. One can choose a representative document for each cluster by different measures. One of the intuitive approaches is to choose the document with the highest relevance score. Alternatively, choosing the centroid document maximizes the similarity between the documents in the same cluster for better representation.. Figure 5 : A Cluster Centroid example. As an instance, each blue dot on the left side in Figure 5 indicates a top-ranked document. The dots in a circle on the right hand side of Figure 5 form a cluster, where 23.

(37) the red dot stands for the selected documents. Then selecting the feedback documents by Cluster Centroid is to extract a representative document for each cluster, which is the ultimate goal of Cluster Centroid. In this case, Figure 5 illustrates the selection process of Cluster Centroid when the number of the top-ranked documents N is 6 and the predefined number of selected documents K is 2.. - Active-RDD. Recently, another more attractive and interesting approach to select helpful documents is Active-RDD (indicating Active Learning to achieve Relevance, Diversity and Density), which is originally proposed for text IR and here introduced to SDR. This algorithm as its name means investigates relevance, diversity and density measure to select a better set of feedback documents for query expansion. Its objective function is expressed as below: D*  arg max 1       M Rel Q, D     M Diversity D     M Density D ,. (8). DDTop DP. where D * stand for a better set of feedback documents; D Top indicates the top-ranked documents from initial retrial; D P represents the already selected feedback documents;. M Rel Q, D  , M Diversity D  and M Density D  denote the measure of relevance, diversity and density respectively; the weighting parameters are  and  . The detail implementation of these three measures can refer to chapter 3. Active-RDD the most 24.

(38) resembling method to ours in this study selects a document from the top-ranked document at a time, which simultaneously considers cues of relevance, diversity, and density of the document with respect to the already selected documents set.. 2.1.3 Query Modeling In order to investigate various feedback document selection methods as well as proximity information cues studied in this study, this section introduce query models, such as relevance model (RM), topic-based relevance model (TRM) and simple mixture model (SMM) for query reformulation.. (I). Relevance Model In the KL-divergence measure, one of straightforward but effective techniques to enhance the query formulation is to leverage extra relevance cues related to the query by the so-called relevance model (RM) [19,36,37]. To realize this idea, each query Q is assumed to be associated with an unknown relevance class RQ . Under the assumption of relevance class RQ , relevant documents (which satisfy the stated information need of a query) of cause are also assumed to be samples drawn from RQ . To strengthen the original query formulation with relevance information extracted from RQ is anticipated to exhibits potential for better discriminating the content of relevant document from the content of non-relevant documents. Hence, retrieval issue. 25.

(39) can be solved by discovering a strategy to formulate the relevance model (RM), that is, the probability model PRM w . The relevance model PRM w , which is the probability of observing a word. w. of a document related to the stated information need, as a. multinomial observation of RQ , can be interpreted as randomly selecting a document from the relevance class and observing a word from it. Estimating the enhanced query model depends largely on the ideal relevance class; however, there is no prior knowledge about how to find the ideal relevance class in reality. A common strategy is to employ pseudo-relevance feedback process. As introduced above, the process typically performs two rounds of retrieval, first round of retrieval conducted by user given query, second round of retrieval rely on the newly formulated query based on the small amount top-ranked documents of initial retrieval. Hence, the relevance model can also adopt pseudo-relevance feedback and leverage top-ranked documents from initial retrieval as pseudo-relevant documents to approximate the ideal relevance class and further estimate the enhanced query model on top of these documents. It is worth mentioning that the initial retrieval in this study if not otherwise note is implemented with the KL-divergence measure, where the query model Pw Q  is estimated with the ML estimator Pw Q  (cf. (5)) to obtain a top-ranked list of M pseudo-relevant documents DTop  D1 , D2 , , DM  from the spoken document collection. An enhanced query model PRM w Q  is then constructed 26.

(40) by these top-ranked documents. In addition to query modeling, PRM w Q  can further combine with or replace the original query Pw Q  to form a better estimated (or enhanced) query model so as to identify more relevant documents. After query modeling, the final stage of pseudo-relevance feedback is to retrieval the final ranked list with new constructed relevance model. To be more specific, based on the top-ranked list of M pseudo-relevant documents, the joint probability PRM Q, w of Q and. w. being observed together in. the relevance class RQ of Q is formulated as follows: PRM Q, w  m1 PDm Pq1 , q2 ,  qL , w | Dm , M. (9). where PDm  is the probability of the document Dm to be randomly selected and Pq1 , q2 ,qL , w | Dm  is the joint probability of co-occurrence of Q and w in Dm , where. essentially assumes higher probability to the words co-occurred with query terms in Dm . To further assume that words are conditionally independent given Dm and word order is not importance (i.e., the “bag-of-words” assumption), the joint probability can then be decomposed as below:. PRM Q, w  m1 PDm Pw | Dm l 1 Pql | Dm . M. L. (10). The probability of a pseudo-relevant document PDm  can be simply set uniform or decided by the relevance degree of Dm to Q . Pw | Dm  and Pql | Dm  are determined by ML estimation, which is based on the word occurrence counts in Dm . 27.

(41) As the result, the enhanced query model PRM w Q  can be expressed as P Q, w m1 PDm Pw | Dm l 1 Pql | Dm  PRM w Q   RM  . L M PRM Q   PDm  Pql | Dm  L. M. m 1. (11). l 1. Even though there has been explored different ways to derive relevance model PRM w Q  , the equation shown above (11) is validated to be more effectively and robustly than the other variants across different collections[37] .. (II). Topic-based Relevance Model Apart from RM, in this study we also consider the performance evaluation of topic-based relevance model, which leverages latent topic information for the modeling of RM. To this end, a set of pre-defined latent topic variables T1 , T2 , , TK  is assumed to describe the “word-document” co-occurrence characteristics among the pseudo-relevant documents obtained by the initial round of retrieval. Consequently, the word probability observed from a pseudo-relevant document Dm is no longer estimated directly by the frequency of the word occurring in a document, but instead based on likelihood of the document generates the topic and the probability of the word observed in the respective latent topics as well: K ~ P w | Dm   k 1 Pw | Tk PTk | Dm .. The joint probability of Q and. w. (12). being simultaneously occurred in the relevance. class RQ of Q , as shown earlier in (10), is thus decomposed as. 28.

(42) PTRM Q, w  m1 k 1 PDm PTk | Dm Pw | Tk l 1 Pql | Tk . M. L. K. (13). This is topic-based relevance model (TRM), which employ a set of latent variable to reinterpret the probability a word is observed in a pseudo-relevant document. In contrast to RM, TRM assumes that the word distribution across a set of latent topics obtained from all spoken document in the collection may carry useful global topic information for relevance modeling. In order to obtain the probabilities Pw | Tk  and PTk | Dm  , we can employ PLSA or LDA so that the topical probability can be estimated by maximizing the total log-likelihood log LD of the spoken document collection D , which can be further derived leveraging inference algorithms like the expectation-maximization (EM) algorithm [38] with uniform priors, or the variational approximation algorithm [39] with Dirichlet priors. To be more specific, we take EM algorithm for example. The objective function for driving Pw | Tk  and PTk | Dm  can be defined as below:: ~ log LD  DD wiD cwi , Dlog P wi | D,. (14). where cwi , D  stands for the frequency count of wi occurring in D . Then, the objective function (14) can be maximized by three iteratively updating equations as following: Pwi | Tk  .  cw , D PT | w , D  ,   cw , D PT | w , D  DD. DD. w j D. i. k. j. i. k. j. 29. (15).

(43) PTk | D  . . PTk | wi , D  . wi D. cwi , D PTk | wi , D . w jD cw j , D . Pwi | Tk PTk | D . l 1 Pwi | Tl PTl | D K. (16). ,. (17). ,. where PTk | wi , D represents the probability of observing the latent topic Tk when a word wi and a document D are given. To get a closer look, the probability PTk | wi , D is estimated by Pwi | Tk  and PTk | D  obtained from the previous. training iteration.. (III).Simple Mixture Model A school of thought to derive a feedback query model is to assumes words in the set of feedback documents D P are generated from two models: 1) the feedback model Pw | FB  and 2) the background model Pw | BG  , namely simple mixture model (SMM). [26]. For feedback model Pw | FB  , it is estimated by the log-likelihood of a set of feedback documents D P expressed as follows, which can be maximized via the EM algorithm: LLDP .   cw, D  log  Pw FB   1     Pw | BG , j. (18). D j DP wV. where cw, D j  is the number of times. w. occurring in a feedback document D j and.  is the waiting parameter which can be used to estimate possible amount of background information ( modeled as background model Pw | BG  ) in feedback documents. In order to obtain feedback model, the objective function (18) can also be 30.

(44) maximized by the following EM algorithm via iterative maximization steps: P m  FB w .   P m  w FB    P m  w FB   1     Pw BG . (19). and. P. m1. w FB    . D j DP. w. where. m. indicates the. frequency count of. w. cw, D j   P m  FB w. cw, D j   P m  FB w D j DP m. ,. (20). -th iteration of the EM algorithm and cw, D j  is the. occurring in the feedback document D j . After EM training,. the feedback model Pw | FB  can be used to support or replace the original query model. A schematic illustration of the SDR process is shown in Figure 1.. 2.2 Language Modeling for Speech Recognition In any large vocabulary continuous speech recognition (LVCSR) system, language modeling (LM) plays a critical and indispensable role [20]. It might attributable to the fact that LM has ability to assist the acoustic analysis, guide the search through a vast space filled with possible candidate word strings, and judge the quality or acceptability of the best output transcript for the speech recognizer.. 2.2.1 N-gram Language Model Due to inherent simplicity and predictive power, the n-gram language model [40,41] based on a statistical modeling paradigm is still the most commonly-used LM in speech recognition. The n-gram language models the regularity between the 31.

(45) immediately preceding n-1 words and a newly decoded word wi . m. (21). P(W )  P( w1 ) P( wi | wi  N 1 ,..., wi 1 ) i 2. where the P(W ) stand for the probability to generate a sequence of word W . However, the n-gram language model, good at modeling the local contextual cues or lexical regularity of a language, has inevitably faced the problems on two fronts. First, it is brittle across domains. The performance of n-gram language model is sensitive to topics of test data different from its training corpus. That is, its training corpus will affect or limit its performance directly. Second, it misses the information (either semantic or syntactic information) carried in the history beyond the immediately preceding n-1 words of a newly decoded word.. 2.2.2 Topic-based Language Models Consequent to the fact of that, a number of latent topic modeling approaches, which were originally formulated for information retrieval (IR) [5], have been introduced to dynamic language model adaptation and investigated to complement the n-gram models with varying degrees of success [11,42,43], such as latent Dirichlet allocation (LDA) [13]and its precursor, probabilistic latent semantic analysis (PLSA) [12]. Both of LDA and PLSA exploits a set of latent topic variables to portray the “word-document” co-occurrence relationship. Similar to topic model in information retrieval, the relationship between an upcoming word and its preceding search history 32.

(46) (regarded as a document in SDR) is reinterpreted by a set of predefined latent topics. That is, the search history predicts the subsequent decoded word is based on the likelihood that the search history generates the topics as well as the probability of the word observed in the respective latent topic. The main difference between LDA and PLSA is the inference of model parameters: The model parameters in PLSA are assumed to be fixed and unknown, whereas the model parameters in LDA are assumed to follow Dirichlet distributions.. 2.2.3 Trigger-based Language Model Apart from the topic models mentioned above, there are some other researchers have developed a number of complement approaches for the n-gram models, such as the trigger-based language model (TBLM) [44]. As the named for the language model, the concept of word trigger pairs are considered and formulated for language modeling. To shed light on TBLM, word trigger pairs can be automatically generated to describe the co-occurrence relationship between preceding history sequence w1 ,..., wi 1 and the upcoming word wi as following: . P( wi | w1 ,..., wi 1 ) . 1 i 1  P(wi | w j ) i  1 j 1. (22). The word trigger pairs are estimated by the prepared adaptation corpus, where the function to decide whether having a close or trigger-pair relationship between two words can be designed by mutual information inverse document frequency. As a 33.

(47) language model for speech recognition, the triggers often exist in the preceding history words. Thus, TBLM can capture the associations between the words in the search history and an upcoming word.. 2.2.4 Recurrent Neural Network Language Model vs. Discriminative Language Model In recent year, the recurrent neural network language model (RNNLM) [45] and the discriminative language model (DLM) [46] have received considerable interests from not only researchers but practitioners. The former attempts to map both of the preceding history and a upcoming decoded word into a continuous space and leverage a recursive fashion to derive the probability of a decoded word observed after the history sequence. In contrast to RNNLM, DLM tries to effectively discriminate correct decoded word from incorrect recognition hypotheses via a rich set of lexical and/or syntactic features as well as a wide variety of training algorithms for achieving better recognition results instead of solely relying on the distribution of training data.. 2.2.5 Relevance Modeling In addition to the above LM, a more recent school of thought is to leverage the notion of relevance to construct language models for speech recognition, namely relevance. 34.

(48) modeling (RM) [47]. The notion of relevance modeling, which is originally developed in information retrieval, has recently attracted much attention and been successfully applied to many IR tasks. Nevertheless, as far as we’re concerned, the investigation on exploring the effectiveness of relevance modeling for language modeling in speech recognition is still little [47]. In speech recognition, the role of language modeling can be simply interpreted as estimating the conditional probability P(w | H ) , in which H is a search history, usually expressed as a sequence of words H  h1 , h2 , , hL , and. w. indicates a possible. decoded words (i.e., an upcoming word) [20,40,41]. In contrast to RM in SDR, each search history H (which can be interpreted as a query in SDR )has further assumed to be associated to a relevance class RH , which can assist in predicting its immediately subsequent words. w.. relevant to H if. w. As same as RM in SDR, the decoded word. w. is deem to be. is drawn from the same relevance class RH of H and has. higher probability to co-occurred with H . To this end, the joint probability of H and. w. being observed from RH , i.e., PRM ( H , w) , can thus be used to derive the. conditional probability P(w | H ) for speech recognition [47]. Still, as RM in SDR, since there is no prior knowledge about the ideal relevance class RH for each search history H , one possible strategy is to leverage a local feedback-like procedure, namely pseudo-relevance feedback, which takes H as a 35.

(49) query and can make an initial round of retrieval to obtain a top-ranked list of M pseudo-relevant documents from a contemporaneous (or in-domain) corpus to approximate RH , denoted as DH  {D1 , D2 , , DM } . Accordingly, the joint probability of simultaneously observing H and. w. can be defined as. PRM ( H , w)  m1 P( Dm ) P( H , w | Dm ), M. (23). where P( Dm ) is the probability of Dm is randomly selected Dm from RH and P( H , w | Dm ) (or P(h1 , h2 , hL , w | Dm ) ) is the joint probability of observing H together. with. w. in Dm . If the joint probability is further assumed that words are conditionally. independent given Dm and word order is of no importance (i.e., the so-called “bag-of-words” assumption), equation (23) can then be decomposed as a product of unigram probabilities of words observed from D m : PRM ( H , w). (24).  m1 P( Dm ) P( w | Dm )l 1 P(hl | Dm ). L. M. The probability P( Dm ) can be simply set uniform or weighted referred to the relevance of Dm to H . Both of P(w | Dm ) and P(hl | Dm ) are calculated based on the word occurrence frequencies in a pseudo-relevant document and integrated with the Bayesian or Jelinek-Mercer smoothing method [11]. As a result, the conditional probability P(w | H ) can be expressed as PRM ( w | H ) . PRM ( H , w) PRM ( H ).  . P( Dm ) P( w | Dm )l 1 P(hl | Dm ) m1. (25). L. M. . P( Dm )l 1 P(hl | Dm ) m1 M. L. 36. ..

(50) If the probability of language model can be realized in the logarithmic domain, implementation of (25) can be quite efficient, [47]. Besides, RM can combine with the baseline n-gram language model to obtain a better recognition result, since the baseline n-gram language model trained on a large general corpus can offer the generic constraint cue of lexical regularities; ~ P (w | H )    PRM (w | H )  (1   )  Pn-gram (w | H ),. (26). where  is the interpolation parameter, which balances the degree of reliance between RM model and n-gram language model.. 37.

(51) 3 Effective Pseudo-Relevance Feedback & Proximity Information. In order to effectively extract a smaller set of helpfully representative feedback documents from a small amount of top-ranked documents obtained from initial retrieval, ULM retrieval model was employed with initial query to acquire a number of top-ranked documents DTop  D1 , D2 ,, DM . To make a good selection on feedback documents, the document D in the top-ranked list DTop is graded based on four different point of view of the document D , namely relevance, non-relevance, diversity and density cues, and then selected one document at a time. Specifically, in the selection process, each candidate feedback document D is scored according to a linear combination of measures of these cues as following: D *  arg max 1         M Rel Q, D     M NR Q, D  DDTop  D P.    M Diversity D     M Density D ,. (27). where D * denotes the final feedback document set; DTop is the top-ranked document set obtained from initial retrieval; D P indicates the set of already selected feedback 38.

(52) documents; M Rel Q, D , M NR Q, D , M Diversity D and M Density D stand for the measures of relevance, non-relevance, diversity and density to each candidate document D in DTop ;  ,  and  are the weighting parameters to balance the degree of importance. or reliance among these four cues. The final used set of feedback document is then iteratively selected the highest score (26) document one by one from a small amount of the top-ranked documents until D P achieves the pre-defined number of feedback documents. It is worth mentioning that to some extent the iterative selection of the algorithm shown in (26) resembles maximal marginal relevance (MMR) ranking algorithm [34,48] which was originally developed for extractive document summarization. To get a more clear view on the implementation of these four cues, the detail measure techniques will be introduced below.. First of all, M Rel Q, D denotes. relevance measure between query and document, which can be realized by ULM retrieval model depicted in (1), the initial retrieval we used here.. 3.1 Diversity Measure In recent years, diversification no matter in retrieval results or pseudo relevance feedback documents has attracted much attention in the text IR community, since the conventional document ranking criteria often considers merely relevance information in a document to a given query and will inevitably suffer from too many redundant documents shown in the top-ranked list, which may be quite annoying and even not 39.

(53) effective for query modeling. In terms of pseudo-relevance feedback, if top-ranked documents contain too much redundant information and are used to build the query model, the second round of retrieval, which is anticipated to return more relevant documents, may result in returning too many “redundant” documents to the user. Hence, it is good to consider diversity cues within feedback documents, especially those used to construct a query model. For better query reformulation, the diversity measure of a candidate feedback document with respect to the already selected feedback documents D P is defined as following: M Diversity D   min. D j DP. 1  KLD j || D   KLD || D j . 2. (28). As shown above, each candidate document attempts to compute the model distance between the candidate and the already selected one by one and records the smallest grades as its diversity score, which can gain maximum diversity effect when the objective function (26) prefer a candidate document with higher score.. 3.2 Density Measure Another interest but effective approach is to take into account the structural information or distribution information among the top-ranked documents [49]. To realize this idea, the average symmetric probability distance between a candidate document D to all the other documents DM in DTop is computed as following:. 40.

(54) M Density D  . where D Top. 1 D Top  1. .  KLD. h. || D   KLD || Dh  ,. (29). Dh DTop Dh  D. indicates the total number of documents in DTop . Based on this. similarity (or distance) measure, a document D is deemed to be more similar (or closer) to all the other documents, which can be more preventative in this group DTop . Thus, density measure M Density D can be used to measure the representative of a document. In this case, a higher value of M Density D means more representative the document D is in DTop . That is, this information can measure the representative of a candidate document and generality of a candidate document among the others as well. More specifically, this measure can be visualized as following:. outliers. Figure 6 : A diagram of density measure In Figure 6, each blue point stands for a candidate document of the top-ranked documents; and the similarity distance or relationship between arbitrary two blue points (namely, two assumed-relevant documents) is the density measure equation 41.

(55) attempts to capture. That is, if the average distance of an assumed-relevant document to all the others is small, that means this assumed-relevant document is close to all other top-ranked documents, which essentially illustrates this document is located in a larger similar group like the blue points circled by pink circle in Figure 7, and form a dominant group in these pseudo-relevance feedback documents. A candidate document like this may be most representative document in the candidate set. Conversely, if the density measure score of a top-ranked document is large, that means the document is far away from most of the top-ranked documents, which is probably as an outlier in the green circle of Figure 6. Therefore, if these assumed-relevant documents are actually relevant to some degree, this measure can be utilized to keep away from the information or documents not in popular demand, which might have higher probability to satisfy the user information need. The above three measures are the primary features Active-RDD is concerned.. 3.3 Non-Relevance Measure To measure non-relevance, there are some investigations have been placed on modeling non-relevance information and exploiting to support information retrieval [18,50]. Ideally, non-relevance modeling can be estimated according to each given query Q respectively and uniquely. To this end, a straightforward approach is to exploit low-ranked documents acquired from the initial round of retrieval to construct 42.

(56) the non-relevance model Pw | NRQ  specially for each single query. Then, the non-relevance measure of a candidate feedback document D can be expressed as M NR D  KLNRQ D .. (30). where NRQ denotes non-relevance model with respect to a query Q . It is worth noting that to further incorporate M NR D  for feedback document selection will constrain the selection process to keep selected feedback documents not only have a probability distance close to the original query model but with a probability distance far away from the non-relevance model, which is quite useful when an uncertain document is consider with only knowledge of its relevance information measure but have no idea to its potential non-relevance information within. However, non-relevance information is not easy to correctly estimate even it has strong relationship with the query. It might imply that when a query in this moment can not be correctly estimated so as the non-relevance model. Therefore, a more easy to obtain and estimate model, which is available to approximate non-relevance information, is needed. Thus, owing to the fact that in reality the number of relevant documents in response to a given query is usually very small when compared to that of non-relevant ones from the view of the whole retrieval corpus, non-relevance model may be assumed and approximated by the entire spoken document collection. More 43.

(57) specifically, the background language model Pw | BG  can be an alternative estimate for the non-relevance model. Another advantage to use background language model as an alternative is that the clarity measure of background model is shown to have ability to tell vague information of a query in text IR[51], which in turn can assist feedback document selection in identifying vague expression of a candidate document. As a result, in addition to three measures introduced above, namely measures of relevance, diversity and density, background language model Pw | BG  is further explored to support feedback document selection.. 3.4 Proximity Information for RM On the other hand, model-based pseudo relevance feedback can also be an effective method for speech recognition. Relevance model (RM) based on “bag-of-words” assumption, which can facilitate the derivation and estimation, may be oversimplified for the task of language modeling in speech recognition, which pays more attention on the regularity of a language, such as the success of n-gram language model. Therefore, in order to better adapt RM (cf. Section 2.2.5) for speech recognition, one possible approach is to take advantage of word order and adjacency relationships among history words and the upcoming word from pseudo relevance feedback documents. To this end, the joint probability of observing H  h1 , h2 ,, hL together with. 44. w. in a pseudo-relevant.

(58) document Dm can be alternatively decomposed as follows, which simultaneously models the pairwise word order and (immediate or intermediate) adjacency relationships as well: ~ PRM ( H , w | Dm ). . .  P(h1 | Dm ) l 2 P(hl | hl 1 , Dm ) P( w | hL , Dm ), L. (31). where P(hl | hl 1 , Dm ) and P(w | hL , Dm ) resemble traditional bigram language model in capturing the pairwise proximity (namely, word order and adjacency) relationships between history words and the decoded word in a pseudo-relevant document. To be more specifically, the conditional probability P(w | hL , Dm ) (which is the building block of RM) is further realized by the following formulation: P( w | hL , Dm ) . C (hL, , w, Dm ) . w C (hL, , w, Dm ). (32). where C (hL , w, Dm ) stand for the frequency co-occurrence count of hL and. w. observed within a fixed-length sliding window in a pseudo-relevant document Dm , where the sliding window places immediately after each position of hL and has a window size of  words. It is worth noting that when the window size  is set to one, then the conditional probability P(w | hL , Dm ) in (31) is actually equivalent to a conventional bigram language model estimated by a document Dm . Therefore, it is quite interesting to modulate the window size and explore the impact of word proximity with different degree of closeness on relevance modeling. The resulting language model is hereafter named as the proximity-based RM model (PRM). 45.

(59) 3.5 Topic-based Proximity Information for RM As an inspiration similar to PLSA and LDA, latent topic information is explored to work with RM modeling [47]. To this end, a set of latent topic variables T1 , T2 ,, TK  is employed to portray the “word-document” co-occurrence relationship from the pseudo-relevant documents of a search history H  h1 , h2 ,, hL . Accordingly, the conditional probability that the search history H together with a decoded word. w. are observed from a pseudo-relevant document Dm is not computed directly based on the number of times of H and. w. co- occurring in the document Dm , but instead. based on the frequency count of H and. w. co-occurring in the latent topics as well. as the likelihood that Dm generates the respective topics:  PRM ( H , w | Dm ). . . (33).  k 1 l 1 P(hl | Tk ) P( w | Tk ) P(Tk | Dm ). K. where. the. L. component. probabilities. can. be. estimated. exploiting. the. expectation-maximization (EM) inference algorithm [38]. Substituting (33) into (25), to some extent, offers a mechanism to depict the proximity information between the search history H and the upcoming word. w. in the latent topic space related to a. pseudo-relevant document. The relevance model (33) is referred to as the topic-based relevance model (TRM).. 46.

(60) 4 Experiments on Spoken Document Retrieval. 4.1 Spoken Document Collections & Evaluation Metrics For this study, we used the Topic Detection and Tracking collections (TDT-2) [52]. Spoken documents are gathered from the Mandarin news stories by Voice of America news broadcasts. The test queries were collected via compiling the title fields of the Chinese text news stories from Xinhua News Agency. Therefore, the task of news monitoring and tracking is especially suitable on this corpus. Performance evaluation of all news stories were judged based on the exhaustively labeling with event-based topic. Table 1 demonstrates some basic statistics about the TDT-2 collections used in this study. The TDT-2 collection is leveraged to tune the optimal parameters and to see the best performance for various retrieval models. In addition, the number of latent topics used to build TRM, PLSA and LDA is set to 32. The number of pseudo-relevant documents acquired from the initial round of retrieval for the various query models is set to 25. It is worth mentioning that all the parameters used in this study can be further fine-tuned to achieve better performance for different spoken document collections via 47.

(61) appropriate experimentation. Table 2 : Statistics for the TDT-2 Collections TDT-2 1998, 02~06 # Spoken. 2,265 stories,. documents. 46.03 hours of audio. # Distinct test. 16 Xinhua text stories. queries. (Topics 20001~20096). Doc. Length (in characters). Min.. Max.. Med.. Mean. 23. 4,841. 153. 287.1. 8. 27. 13. 14. 2. 95. 13. 29.3. Length of test query (in characters) # Relevant doc. per test query. The Chinese word transcripts of the Mandarin audio collections (TDT-2) were recognized by Dragon large-vocabulary continuous speech recognizer. To evaluate the performance of Dragon’s recognizer, a fraction of the TDT-2 (approximately 39.90 hours) is spot-checked. The error rates of word, character and syllable are 35.38%, 17.69% and 13.00%, respectively. Due to the fact that it is not available to obtain Dragon’s lexicon, the LDC Mandarin Chinese Lexicon is augmented with 24k words extracted from Dragon’s word recognition output, and for computing error rates, the manual transcripts are tokenized by the augmented LDC lexicon (about 51,000 words). The query sets are also tokenized by this augmented LDC lexicon in the retrieval experiments. 48.

(62) In terms of non-interpolated mean average precision (mAP), the retrieval results are defined following the TREC evaluation [5,6]: mAP . 1 E 1  E i 1 N i. Ni. j. j 1. i, j. r. (34). where E indicates the number of test queries, N i means the total number of relevant documents pertaining to query Qi , and ri , j denotes the position (rank) of the j –th relevant document pertaining to query Qi , counting down from the top of the ranked list.. 4.2 Subword-level Index Features In Mandarin Chinese, although only some (e.g., 80 thousands, depending on the domain) are commonly used, there is an unknown number of words. Each word includes one or more characters, each of which is pronounced as a monosyllable and is a morpheme with its own meaning.. Moreover, full textual coverage of written. Chinese is almost covered by an inventory of about 6,000 characters. There is a many-to-many mapping between characters and syllables. The characteristics of the Chinese language result in some special considerations when performing Mandarin Chinese speech recognition. Mandarin Chinese speech recognition evaluation is usually based on syllable and character accuracy, rather than word accuracy. Due to the exclusive characteristics of the Chinese language, SDR has some special considerations. Fist, word-level index features embrace more semantic 49.

(63) information than subword-level index features; consequently to the fact of that, word-based retrieval enhances precision. Second, subword-level index features are more robust in contrast to the Chinese word tokenization ambiguity, homophone ambiguity, open vocabulary problem, and speech recognition errors; as a result, subword-based retrieval enhances recall. Accordingly, it is good to combine the information acquired from indexing the features of different levels [9]. In this study, different levels of index features are utilized for construct the query and document models involved in the KL-divergence measure, including words, syllable-level units, and their combination. To this end, in addition to words, syllable pairs are taken as the basic units for indexing. Both the recognition transcript and the manual transcript of each spoken document, which were originally tokenized as words, were automatically transferred to overlapping syllable pairs. Then, all the distinct syllable pairs occurring in the spoken document collection were identified and collected to construct a vocabulary or lexicon of syllable pairs for indexing. Syllable pairs can be used to replace words, to represent the query and spoken documents, and thereby to construct the associated query and document models.. 4.3 Baseline Experiments In the first set of experiments, we compare the performance of RM, TRM and SMM when the top-ranked (i.e., top 25) documents obtained from the initial round of 50.

(64) retrieval is leveraged for constructing the refined query models. The corresponding results are shown in Table 3, where the results of ULM and LDA (latent Dirichlet allocation) [12] are also listed for reference. LDA is a state-of-the-art (more sophisticated) LM-based retrieval model that employs a set of latent topics for representing (spoken) documents. It is worth mentioning that both ULM and LDA perform retrieval only with the initial query. Take a look on Table 3 reveals two noteworthy points. First, in terms of mAP, the performance gap between the retrieval using manual transcripts (denoted by TD) and the recognition transcripts (denoted by SD) is about 0.05, such degradation is apparently less significant as compared to the WER of spoken documents [18]. Second, RM and SMM tend to perform on a par with each other, and they gain substantial improvements over ULM (and are comparable to LDA). TRM demonstrates superior performance over RM and SMM, confirming the merits of leveraging topical information for query modeling. Table 3 : Retrieval results (in mAP) achieved by various retrieval models. ULM. LDA. RM. TRM. SMM. TD. 0.371. 0.401. 0.421. 0.456. 0.415. SD. 0.323. 0.341. 0.369. 0.397. 0.361. 51.