統計式語言模型 – 語音文件標記、檢索以及摘要

(1)

國立臺灣大學電機資訊學院資訊工程研究所博士論文

Department of Computer Science and Information Engineering College of Electrical Engineering and Computer Science

National Taiwan University Doctoral Dissertation

統計式語言模型 – 語音文件標記、檢索以及摘要 Statistical Language Modeling –

Spoken Document Indexing, Retrieval and Summarization

陳冠宇 Kuan-Yu Chen

指導教授：陳信希博士、王新民博士

Advisor: Hsin-Hsi Chen, Ph.D. and Hsin-Min Wang, Ph.D.

中華民國 104 年 4 月

April, 2015

(2)

(3)

DEDICATION

This thesis is dedicated to my family and girlfriend!

(4)

(5)

中文摘要

由於越來越多唾手可得的多媒體文件，促成了語音文件理解(Understanding)與組織

(Organization)在過去二十幾年來成為重要的研究議題。在各式各樣的相關研究中，

語音文件標記(Indexing)、檢索(Retrieval)以及語音摘要(Summarization)被視為是這

個領域中重要且基礎的研究題目。統計式語言模型(Statistical Language Modeling)

一直是一個有趣且極富挑戰的研究領域，其主要被用於量化一段文字在自然語言中存在的可能性。過去許多研究致力於將語言模型運用於語音文件處理的任務之中，多數的研究呈現了豐富且卓越的實驗成果。有鑑於語言模型對於語音文件處理的重要性，本論文將以語言模型為主軸，繼續深究語音文件標記、檢索與摘要等問題。

由於使用者所給定的查詢通常非常簡短，這是資訊檢索系統面臨的一項重要考驗，本論文從此問題出發，除了廣泛地研究前人所提出的方法外，並針對傳統的方法提出了一套統一化的見解，更將這項技術應用於語音文件摘要的問題之中；

接著，受到I-vector 技術的啟發，本論文提出一個新穎的語言模型方法，並進一步

的與虛擬關聯回饋技術相結合，提升語音文件檢索的效能；我們也觀察到，雖然語言模型已被使用於語音文件摘要任務之中，但過去所用的技術皆是以單連語模型為主，無法考慮長距離的語意資訊，有鑑於此，本論文提出以遞迴式神經網路語言模型搭配課程學習法的訓練方式，成功地提升了語音文件摘要的成效；最後，

語言模型的發展漸漸地由模型化轉變到向量化，本論文提出新穎的相似度評估方式，成功地與近年來陸續提出的各式詞向量表示法相匹配，運用於語音文件摘要的問題上，除此之外，本論文亦提出了機率式詞向量表示法，不僅繼承了傳統表示法的優點，更可以有效地彌補現今詞向量表示法詮釋性的不足。

(6)

(7)

ABSTRACT

The inestimable volumes of multimedia associated with spoken documents that been made available to the public in the past two decades have brought spoken document understanding and organization to the forefront as subjects of research. Among all the related subtasks, spoken document indexing, retrieval and summarization can be thought of as the cornerstones of this research area. Statistical language modeling (LM), which purports to quantify the acceptability of a given piece of text, has long been an interesting yet challenging research area. Much research shows that language modeling for spoken document processing has enjoyed remarkable empirical success. Motivated by the great importance of and interest in language modeling for various spoken document processing tasks (i.e., indexing, retrieval and summarization), language modeling is the backbone of this thesis.

In real-world applications, a serious challenge faced by the search engine is that queries usually consist of only a few words to address users’ information needs. This thesis starts with a general survey of the practical challenge, and then not only proposes a principled framework which can unify the relationships among several widely-used approaches but also extends this school of techniques to spoken document summarization tasks.

Next, inspired by the concept of the i-vector technique, an i-vector based language modeling framework is proposed for spoken document retrieval and reformulated to accurately represent users’ information needs.

Following, we are aware that language models have shown preliminary success in extractive speech summarization, but a central challenge facing the LM approach is how

(8)

to formulate sentence models and accurately estimate their parameters for each sentence in the spoken document to be summarized. Thus, in this thesis we propose a framework which builds on the notion of recurrent neural network language models and a curriculum learning strategy, which shows promise in capturing not only word usage cues but also long-span structural information about word co-occurrence relationships within spoken documents, thus eliminating the need for the strict bag-of-words assumption made by most existing LM-based methods.

Lastly, word embedding has been a recent popular research area due to its excellent performance in many natural language processing (NLP)-related tasks. However, as far as we are aware, there are relatively few studies that investigate its use in extractive text or speech summarization. First of all, this thesis focuses on building novel and efficient ranking models based on general word embedding methods for extractive speech summarization. Next, the thesis proposes a novel probabilistic modeling framework for learning word and sentence representations, which not only inherits the advantages of the original word embedding methods but also boasts a clear and rigorous probabilistic foundation.

(9)

(10)

LIST OF FIGURES

Figure 2.1 Several state-of-the-art language models. ... 8 Figure 4.1 A toy example of a user goes to a search engine. ... 35 Figure 4.2 A schematic illustration of the SDR process with pseudo-relevance feedback.36 Figure 5.1 Retrieval results (in MAP) of i-vector based query representation techniques,

relevance model (RM), and simple mixture model (SMM) with word- and subword-level index features. ... 56 Figure 6.1 A schematic depiction of the fundamental network of RNNLM. ... 60 Figure 6.2 A sketch of the proposed RNNLM summarization framework. ... 62 Figure 6.3 Summarization results (in ROUGE-2) for each individual document

(represented with either manual or speech transcript) in the test set, respectively, achieved by ULM and RNNLM+ULM. ... 68 Figure 7.1 A running toy example for learning disparate distributional representations of a specific word w5, where the training corpus contains three documents, the vocabulary size is 9 (i.e., having words w1,…,w9) and the context window size is 1 (i.e., c=1). ... 90 Figure 8.1 The important language models and the proposed frameworks are

summarized year by year. ... 97

(15)

(16)

LIST OF TABLES

Table 3.1 Statistics for TDT-2 collection used for spoken document retrieval. ... 25 Table 3.2 Retrieval results (in MAP) of different retrieval models with word- and subword-level index features. ... 27 Table 3.3 The statistical information of the broadcast news documents used for the

summarization... 29 Table 3.4 The agreement among the subjects for important sentence ranking for the

evaluation set. ... 30 Table 3.5 Summarization results achieved by a few well-studied or/and state-of-the-art unsupervised methods. ... 31 Table 4.1 The summarization results (in F-scores(%)) achieved by various language

models along with text and spoken documents. ... 43 Table 5.1 Retrieval results (in MAP) of IVLM with word- and subword-level index

features for short and long queries using inductive and transductive learning strategies. ... 50 Table 5.2 Retrieval results (in MAP) of different approaches with word- and subword-level index features for short and long queries. ... 51 Table 5.3 Retrieval results (in MAP) of different pooling methods with word- and subword-level index features with respect to the number of references (|R|).55 Table 6.1 Training of RNNLM-based sentence models and the application of them for

important sentence ranking. ... 63 Table 6.2 Summarization results achieved by various LM-based methods, including

ULM, BLM, PLSA, PLSA+ULM, RNNLM and RNNLM+ULM. ... 66

(17)

Table 6.3 Summarization results respectively achieved by ULM and RNNLM+ULM with respect to different summarization ratios. ... 67 Table 6.4 Summarization results achieved by the proposed framework and a few

well-studied or/and state-of-the-art unsupervised methods, which were measured by using the abstractive summaries written by the human subjects as the ground truth. ... 71 Table 6.5 Summarization results achieved by RNNLM+ULM with respect to different numbers of hidden-layer neurons being used. ... 72 Table 6.6 Summarization results achieved by RNNLM+ULM, MMR, ILP and their

combinations. ... 74 Table 6.7 Summarization results achieved by ULM, RNNLM and RNNLM+ULM in

conjunction with syllable-level index features. ... 74 Table 6.8 Four types of acoustic features used to represent each spoken sentence. ... 75 Table 6.9 Summarization results achieved by using acoustic features in isolation and its combination with ULM, RNNLM and ULM+RNNLM based sentence ranking scores, respectively. ... 75 Table 7.1 Summarization results achieved by various word-embedding methods in

conjunction with the cosine similarity measure... 87 Table 7.2 Summarization results achieved by various word-embedding methods in

conjunction with the triplet learning model. ... 87 Table 7.3 Summarization results achieved by various word-embedding methods in

conjunction with the document likelihood measure. ... 87 Table 7.4 Summarization results achieved by various word-embedding methods in

conjunction with the cosine similarity measure... 93 Table 7.5 Summarization results achieved by various word-embedding methods in

(18)

conjunction with the document likelihood measure. ... 93

(19)

(20)

Chapter 1 Introduction

Before 2000, speech and text processing were two representative research areas, individually. Speech recognition [79][82], speaker identification and verification [168], voice synthesis [3] and so forth were important subtopics in the speech processing community. At the same time, information retrieval [177][179], language modeling [82][165], and summarization [130][131] were popular directions for text processing research. Since then, the rapid development of technology (especially computing hardware), the popularity of the Internet, and the rise of handheld devices have led to a considerable amount of research in spoken document processing [22][100][106].

1.1 Spoken Document Processing

Along with the growing popularity of Internet applications, ever-increasing volumes of multimedia, such as broadcast radio and television programs, lecture recordings, and digital archives, are being made available in our daily life. Clearly, speech itself is one of the most important sources of information within multimedia. Users can efficiently listen to and digest multimedia associated with spoken documents by virtue of spoken content processing, which includes spoken document indexing, retrieval, and summarization [60][109][158].

A significant amount of effort has been put towards researching robust indexing (or representation) techniques [51][77][150] so as to extract probable spoken terms or phrases inherent in a spoken document that can match query words or phrases literally.

On the other hand, spoken document retrieval (SDR), which revolves more around the notion of the relevance of a spoken document in response to a query, has also been a

(21)

prominent subject of much recent research. It is generally agreed that a document is relevant to a query if it addresses the stated information need of the query, but not merely because it happens to contain all the words in the query [132]. We have also witnessed a flurry of research activity aimed at the development of novel and ingenious methods for speech summarization, the aim of which is to generate a concise summary to help users efficiently review or quickly assimilate the important information conveyed by either a single spoken document or multiple spoken documents [59][125][136][151][163][206]. The dramatically growth of these studies is due in large part to advances in automatic speech recognition [60][157] and the ever-increasing volumes of multimedia associated with spoken documents made available to the public [109][158].

Beginning in the late 20th century, statistical language modeling has been successfully applied to various NLP-related applications, such as speech recognition [37][79][83], information retrieval [165][186][202], document summarization [25][29][115], and spelling error detection and correction [33][40][118][197]. Language modeling (LM) provides a statistical mechanism to associate quantitative scores to sequences of words or tokens. By far, the most widely-used and well-practiced language model is the n-gram language model [37][82], because of its simplicity and moderately good predictive power. For instance, in speech recognition, it can be used to constrain the acoustic analysis, guide the search through the vast space of candidate word strings, and quantify the acceptability of the final output from the speech recognizer [156][200].

This statistical paradigm was first introduced for information retrieval (IR) problems by Ponte and Croft (1998) [165], Song and Croft (1999) [186] and Miller, Leek, and Schwartz (1999) [143], demonstrating good success, and was then extended in a number of publications [31][50][99][203].

(22)

The n-gram language model, which determines the probability of an upcoming word given the previous n-1 word history, is the most commonly used language model because of its neat formulation and good predictive power. Nevertheless, the n-gram language model, as it only captures local contextual information and the lexical regularity of a language, is inevitably faced with two fundamental problems. On one hand, it is brittle across domains, since its performance is sensitive to changes in the genre or topic of the text on which it is trained. On the other hand, due to its limitation in scope, it fails to capture information (either semantic or syntactic) conveyed in the contextual history beyond its order (e.g., a trigram language model is limited to two words of context).

Motivated by the great importance of and interest in language modeling for various spoken document processing tasks, language modeling is the backbone of this thesis.

Three subtasks (spoken document indexing, retrieval and summarization) are considered, and several insights are shared and methods proposed to unify conventional approaches or make further progress in complementing spoken document processing.

1.2 Organization of the Thesis

The remainder of this thesis is organized as follows:

Chapter 2 is a brief introduction to statistical language modeling, including word-regularity models, topic models, continuous language models and neural network-based language models. Also, spoken document indexing, retrieval and summarization are discussed.

Chapter 3 presents the experimental data sets, settings, and evaluation metrics for spoken document retrieval and summarization, as well as the baseline results.

(23)

Chapter 4 focuses on analyzing pseudo-relevance feedback for query reformulation approaches, and then presents a continuation of this general line of research. The main contribution here is two-fold. First, the thesis proposes a principled framework which unifies the relationships among several widely-used query modeling formulations.

Second, on top of this successfully developed framework, an extended query modeling formulation is introduced by incorporating critical query-specific information cues to guide model estimation.

In Chapter 5 an i-vector based language modeling framework, stemming from the state-of-the-art i-vector framework for language identification and speaker recognition, is proposed and formulated to represent documents for spoken document retrieval. Also described in detail in this chapter are three novel methods to be applied in concert with i-vector based language modeling to more accurately represent user information needs.

Chapter 6 proposes a novel and effective recurrent neural network language modeling framework for speech summarization, on top of which the deduced sentence models are able to render not only word usage cues but also long-span structural information about word co-occurrence relationships within spoken documents, thus eliminating the need for the strict bag-of-words assumption. Second, the utility of the method originated from the proposed framework and that of several widely-used unsupervised methods are analyzed and compared extensively.

Beyond the effort made to improve word representations, Chapter 7 focuses on building novel and efficient ranking models based on general word embedding methods for extractive speech summarization. After that, the chapter also introduces a novel probabilistic modeling framework for learning word and sentence representations, which not only inherits the advantages from the original word embedding methods but also boasts a clear and rigorous probabilistic foundation.

(24)

Finally, Chapter 8 summarizes the contribution of this thesis and concludes the work.

(25)

(26)

Chapter 2 Overview of Related Literature

2.1 Statistical Language Modeling

Language modeling is an important component in most natural language processing (NLP)-related tasks today. The wide array of language modeling methods that have been developed so far fall roughly into four main categories: 1) word-regularity language models, 2) topic language models, 3) continuous language models, and 4) neural network language models. In this chapter, we briefly review several well-known or state-of-the-art language models. Figure 2.1 highlights some but not all of the state-of-the-art and widely-used language models year by year.

2.1.1 Word-Regularity Language Modeling

Beginning in the late 20th century, statistical language modeling has been successfully applied to various NLP applications, such as speech recognition [37][82], information retrieval [102][103][165], document summarization [25][115], and spelling correction [33][118][197]. The most widely-used and mature language model, by far, is the n-gram language model [37][82], because of its simplicity and fair predictive power.

Quantifying the quality of a word string in a natural language is the most common task.

Take the trigram model for example: when given a word string 𝑊𝑊₁^𝐿𝐿 = 𝑤𝑤₁, 𝑤𝑤₂, ⋯ , 𝑤𝑤_𝐿𝐿, the probability of the word string is approximated by the following product of a series of conditional probabilities [82]:

.) ,

| ( )

| ( ) (

)

| ( ) ( ) (

3 2 1

1 2 1

2

1 1 1

1

∏= − −

∏=

−

≈

=

L

l l l l

L l

l l L

w w w P w w P w P

W w P w P W

P (2.1)

(27)

In the trigram model, we make the approximation (or assumption) that the probability of a word depends only on the two immediately preceding words.

The easiest way to estimate the conditional probability in Eq. (2.1) is to use the maximum likelihood (ML) estimation

) , , (

) , , ) (

,

|

( 2 1

1 1 2

2 − −

−

− −

− =

l l

l l l l

l

l c w w

w w w w c

w w

P (2.2)

where c(wl-2,wl-1,wl) and c(wl-2,wl-1) denote the occurrences of the word strings

“wl-2,wl-1,wl” and “wl-2,wl-1” in a given training corpus, respectively. Without loss of generality, the trigram model can be extended to higher order models, such as the four-gram model and the five-gram model, but the high-order n-gram models usually suffer from data sparseness, which leads to zero conditional probabilities. To eliminate

Figure 2.1 Several state-of-the-art language models.

•Probability Latent Semantic Analysis(1999)

•Latent Semantic Analysis(1997)

•Latent Dirichlet Allocation(2003)

~~

•Cache-based Model(1988)

•Mixed-Order Markov Model(1997)

•Maximum Entropy Model(1994)

•Class-based Model(1992)

•Aggregate Language Model(1997)

•Skipping Model(1993)

•Trigger-based Model(1993)

•Structured Model(1997)

•N-gram Model

•Mixture-Based Language Model(1997)

•Latent Maximum Entropy Model(2001)

•Neural Probabilistic Language Model(2000)

•Gaussian Mixture Language Model(2007)•Continuous Topic Language Model(2008)•Tied-Mixture Language Model(2009)

•Discriminative Training Language Model(2000)

•Pseudo-conventional N-gram Model(2008)

•Minimum Word Error Training Language Model(2005)

•Global Conditional Log-linear Model(2007)

2008 2006

2004 2002

2000 2010 2012 2014

•Recurrent Neural Network Language Model(2010)

•Relevance-based Language Model(2001)

•Simple Mixture Model(2001)

•Regularized Mixture Model(2006)

Word-Regularity Models Topic Models

Continuous Language Models Neural Network- based Language

Models

•C&W Neural Network Language Model(2008)

•Log-bilinear Language Model(2007)

•Continuous Bag-of-words Representation(2013)

•Skip-gram Representation(2013)•Global Vector(2014)

•Round-robin Discriminative Language Model(2011)

•Three Mixture Model(2002)

•Word Topic Model(2006)•Word Vicinity Model(2006)

(28)

zero probabilities, various smoothing techniques have been proposed, e.g., Good-Turing [66][88], Kneser-Ney [37][92], and Pitman-Yor [78]. The general formulation of these approaches is [37]





=

= ≠

+

− +

−

− +

−

+

− +

−

− +

−

0 ) , , (

)), , , (

( ) , , (

0 ) , , (

, )) , , (

(

) , ,

| (

1 1

l n

l l

n l l

n l

l n

l l

n l

l n

l l

w w

c if w w

c f w w

w w

c if w

w c f

w w

w P



β

(2.3)

where 𝑓𝑓(∙) denotes a discounting probability function and 𝛽𝛽(∙) denotes a back-off weighting factor that makes the distribution sum to one.

Clearly, n-gram language modeling focuses on modeling the local contextual information or the lexical regularity of a language, and it is recognized as the earliest language model. Continuing this school of research, many successive language models have been proposed, such as the cache language model [95][96], the trigger model [101], the class-based language models [17][196] and the maximum entropy language model [174][175]. Interested readers are referred [176] to for thorough and entertaining discussions on the major methods.

The year 1997 can be thought as a watershed in language modeling research. On one hand, discriminative language modeling [41][208] is representative of the following research. Although this sort of research still is aimed at building n-gram models, the major difference between discriminative language modeling and conventional n-gram models is the training objective. Conventional n-gram-based language models seek a set of parameters by maximizing the corpus likelihood with a ML criterion, while discriminative language models seek parameters that reduce the speech recognition error rate [41], enhance the F-score for information retrieval [23], or optimize the rouge score for summarization [120]. The minimum word error training (MERT) [153][155], the global conditional log-linear model (GCLM) [170][171], and the round-robin

(29)

discriminative model (R2D2) [154] are representative methods. On the other hand, researchers are aware that many polysemic words have different meanings in different contexts, and that word-regularity language modeling takes into account only local context information, and thus cannot capture long-span semantic information embedded in a sentence or document. To mitigate this flaw, topic language modeling has been proposed [14][26][75]. We give a brief introduction to this school of research in the next subsection.

2.1.2 Topic Language Modeling

The n-gram language model, as it is aimed at capturing only local contextual information, or a language’s lexical regularities, is unable to capture information (either semantic or syntactic) conveyed by words before the n-1 immediately preceding words.

To mitigate this weakness of the n-gram model, various topic language models have been proposed and widely used in many NLP tasks. We can roughly organize these topic models into two categories [30][31]: document topic models (DTMs) and word topic models (WTMs).

DTMs introduce a set of latent topic variables to describe the “word-document”

co-occurrence characteristics. The dependence between a word and its preceding words (regarded as a document) is not computed directly based on frequency counts as in the conventional n-gram model. The probability now is instead based on the frequency of the word in the latent topics as well as the likelihood that the preceding words together generate the respective topics. Probabilistic latent semantic analysis (PLSA) [75][76]

and latent Dirichlet allocation (LDA) [12][67] are two representatives of this category.

LDA, having a formula analogous to PLSA, can be regarded as an extension to PLSA

(30)

and has enjoyed much success for various NLP tasks. The major difference between PLSA and LDA is the inference of model parameters [12][14]. PLSA assumes that the model parameters are fixed and unknown while LDA places additional a priori constraints on model parameters by treating them as random variables that follow Dirichlet distributions. Since LDA has a more complex form for model optimization, which is not easily solved by exact inference, several approximate inference algorithms, including variational approximation [11][12][14], expectation propagation [145], and Gibbs sampling [67], have been proposed to estimate LDA parameters.

Instead of treating the preceding word string as a document topic model, we can further regard each word wl of the language as a word topic model (WTM) [44][45].

Each WTM model M𝑤𝑤_𝑙𝑙 can be trained in a data-driven manner by concatenating those words occurring within the vicinity of each occurrence of wl in a training corpus, which are postulated to be relevant to wl. To this end, a sliding window with a size of S words is placed on each occurrence of wl, allowing for the consequent aggregation of a pseudo-document associated with such vicinity information of wl. The WTM model of each word can be estimated using the expectation-maximization (EM) algorithm [52] by maximizing the total log-likelihood of words occurring in their associated “vicinity documents”. The word vicinity model (WVM) [30] bears a certain similarity to WTM in its motivation of modeling word-word co-occurrences, but has a more concise parameterization. WVM explores word vicinity information by directly modeling the joint probability of any word pair in the language. Along a similar vein, WVM is trained using the EM algorithm by maximizing the probabilities of all word pairs, respectively, that co-occur within a sliding window of S words in the training corpus.

It is worth noting that several variations of topic language models have been

(31)

proposed for use with NLP-related tasks, including the supervised topic model [13], the labeled LDA [167], and the latent association analysis (LAA) [137].

2.1.3 Continuous Language Modeling

The fundamental theorem of the Gaussian mixture language model (GMLM) was proposed in 2007 [1]. The GMLM model claims that although the n-gram has been the dominant and successful technology for language modeling, its two greatest weaknesses are clearly: generalizability and adaptability. To leverage the lessons learned in acoustic modeling for speech recognition, GMLM is an attempt to use Gaussian mixture models to model language instead of the usual multinomial distributions. Formally, GMLM employs singular value decomposition (SVD) [51] to project each word in the vocabulary to a continuous space, thus assigning to each word its own distributed representation. Since each history consists of a set of words of size n-1 for an n-gram sample, the history can be represented by concatenating the word representations corresponding to the words in the history. GMLM then models contextual information by using a Gaussian mixture model (GMM) [11] for each word respectively.

Specifically, word wi has its own density function with which to calculate the probability densities for an observed history 𝑊𝑊_{𝑖𝑖−𝑛𝑛+1}^𝑖𝑖−1 = 𝑤𝑤𝑖𝑖−𝑛𝑛+1, … , 𝑤𝑤𝑖𝑖−1:

∑

=

− +

−

− +

− = ^M ∑

m i w m w m

n i m w i i

n

i w c _i N W _i _i

W p

1 1 , ,

1 1 ,

1| ) ( | , )

( m (2.4)

where M is the number of mixtures in the GMM model of word wi, and 𝑐𝑐_𝑤𝑤_𝑖𝑖_,𝑚𝑚, 𝜇𝜇_𝑤𝑤_𝑖𝑖_,𝑚𝑚, and Σ𝑤𝑤_𝑖𝑖,𝑚𝑚 are the component weight, mean vector, and covariance matrix for the m-th mixture in the GMM model. However, when we observe a history, what we need is the prediction probability. GMLM suggests using Bayes’ rule to calculate the conditional probability as

(32)

∑

∈

− +

−

− +

−

− +

−

− +

− − +

− = =

V

w i j

n i j

i i n i i i

n i

i i n i i i

n i i

j

w W p w P

w W p w P W

p

w W p w W P

w

P ( ) ( | )

)

| ( ) ( )

(

)

| ( ) ) (

|

( ₁

1 11 11

11

11 (2.5)

where P(wi) is the conventional unigram language model. The advantage of using the Gaussian mixture model as the cornerstone of the language model is that it can be easily adapted from a relatively small size of text corpus by utilizing well-studied techniques such as maximum likelihood linear regression (MLLR) [110]. Moreover, it is also well-suited for combination with the concept of clusters to enhance generalization capability. Continuing this school of research, many related language models have been proposed, such as the tied-mixture language modeling (TMLM) [180] and the continuous topic language modeling (CTLM) [46].

2.1.4 Neural Network-based Language Modeling

The artificial neural network (ANN) can be dated back to the threshold logic which is a computational model for neural networks based on mathematics and algorithms [134].

Although several neural network-based language models have been proposed year by year, this school of research gained attention only after the year 2000. Feedforward neural network [68] and recurrent neural network [56] are two important representatives.

The feedforward neural network language model (NNLM) is n-gram-based language modeling [7][8]. The original motivation for NNLM was to mitigate the data scantiness faced by conventional n-gram models. A famous example is the sentence “The cat is walking in the bedroom”: seeing this sentence in the training corpus, we should generalize such that the sentence “A dog was running in a room” is almost as likely, simply because dog and cat (or the and a, room and bedroom, and so on) have similar

(33)

semantic and grammatical roles. To achieve this goal, the model learns (1) a distributed representation for each word along with (2) the probability function for word sequences, expressed in terms of these representations, simultaneously.

The recurrent neural network language model (RNNLM) tries to project the history, 𝑊𝑊₁^𝐿𝐿−1, onto a continuous space and estimate the conditional probability in a recursive way by using the full information about 𝑊𝑊₁^𝐿𝐿−1 [36][138]. It has recently emerged as a promising language modeling framework that can effectively and efficiently capture the long-span context relationships among words (or more precisely, the dependence between an upcoming word and its whole history) for use in speech recognition and spoken document summarization. The fundamental network of RNNLM consists of three main ingredients: the input layer, the hidden layer and the output layer. The most attractive aspect of RNNLM is that the statistical cues of previously encountered words retained in the hidden layer are fed back to the input layer and work in combination with the currently encountered word 𝑤𝑤𝐿𝐿−1 as an “augmented” input vector for predicting an arbitrary succeeding word 𝑤𝑤𝐿𝐿. Intuitively, the information stored in the hidden layer can be viewed as topic-like information similar in spirit to PLSA or LDA; the major difference is that RNNLM leverages a set of non-linear active functions to calculate the values of the latent variables while PLSA or LDA estimates the corresponding model parameters by using the (variational) EM algorithm. Thus doing, RNNLM naturally takes into account not only word usage cues but also long-span structural information about word co-occurrence relationships for language modeling.

Recently, neural networks have emerged as a popular subject of research because of their excellent performance in many fundamental areas, including multimedia processing [87][94], speech processing [1][148], and natural language processing

(34)

[48][146][183]. In language model research, the recurrent neural network represents a breakthrough in building language models; recently, research trends have moved from modeling to vectorization. Several representation learning approaches have been proposed and applied to various NLP-related tasks [10][124][159][189].

2.2 Spoken Document Indexing and Retrieval

Over the last two decades, spoken document retrieval (SDR) has become an active area of research and experimentation in the speech processing community. Although most retrieval systems participating in the TREC-SDR evaluations had claimed that speech recognition errors do not seem to cause much adverse effect on SDR performance when merely using imperfect recognition transcripts derived from one-best recognition results from a speech recognizer, this is probably due to the fact that the TREC-style test queries tend to be quite long and contain different words describing similar concepts that could help the queries match their relevant spoken documents. Furthermore, a query word (or phrase) might occur repeatedly (more than once) within a relevant spoken document, and it is not always the case that all of the occurrences of the word would be misrecognized totally as other words. Nevertheless, we believe that there are still at least two fundamental challenges facing SDR. On one hand, the imperfect speech recognition transcript carries wrong information and thus would deviate somewhat from representing the true theme of a spoken document. On the other hand, a query is often only a vague expression of an underlying information need, and there probably would be word usage mismatch between a query and a spoken document even if they are topically related to each other.

A significant body of spoken content retrieval work has been placed on the

(35)

exploration of robust indexing or modeling techniques to represent spoken documents in order to work around (or mitigate) the problems caused by ASR [22][26][42][161][191].

On the contrary, very limited research has been conducted to look at the other side of the coin, namely, the improvement of query formulation for better reflecting the underlying information need of a user [27]. As for the latter problem, pseudo-relevance feedback [1][173] is by far the most commonly-used paradigm, which assumes that a small amount of top-ranked spoken documents obtained from the initial round of retrieval are relevant and can be utilized for query reformulation. Subsequently, the retrieval system can perform a second round of retrieval with the enhanced query representation to search for more relevant documents.

2.2.1 Language Modeling for Spoken Document Retrieval 2.2.1.1 Query-Likelihood Measure

Recently, language modeling (LM) has emerged as a promising approach to building SDR systems [26][27][42]. This is due to the fact that the LM approach has inherent clear probabilistic foundation and excellent retrieval performance [204]. The fundamental formulation of the LM approach to SDR is to compute the conditional probability P(Q|D), i.e., the likelihood of a query Q generated by each spoken document D (the so-called query-likelihood measure). A spoken document D is deemed to be relevant with respect to the query Q if the corresponding document model is more likely to generate the query. If the query Q is treated as a sequence of words, Q=w1,w2,…,wL, where the query words are assumed to be conditionally independent given the document D and their order is also assumed to be of no importance (i.e., the so-called

“bag-of-words” assumption), the similarity measure P(Q|D) can be further decomposed

(36)

as a product of the probabilities of the query words generated by the document [204]:

( )

QD =_{∏ =}_l^L₁P

(

w_l D

)

,

P (2.6)

where P(wl|D) is the likelihood of generating wl by document D (a.k.a. the document model). The simplest way to construct P(wl|D) is based on literal term matching [109], or using the unigram language model (ULM). To this end, each document D can, respectively, offer a unigram distribution for observing any given word w, which is parameterized on the basis of the empirical counts of words occurring in the document with the maximum likelihood (ML) estimator [82][204]:

( ) ⁽

,

⁾

, D

D w D c w

P = (2.7)

where c(w,D) is the number of times that word w occurs in the document D and |D| is the number of words in the document. The document model is further smoothed by a background unigram language model estimated from a large general collection to model the general properties of the language as well as to avoid the problem of zero probability [204]. However, how to strike the balance between these two probability distributions is actually a matter of judgment, or trial and error.

2.2.1.2 Kullback-Leibler (KL)-Divergence Measure

Another basic formulation of LM for SDR is the Kullback-Leibler (KL)-divergence measure [97][204]:

( ) ( ) ( )

( )

(

|

)

log

(

|

)

,

| log |

|

rank ∑

∈

∑∈

=

−

=

−

V w

D w P Q w P

D w P

Q w Q P

w P D

Q KL

(2.8)

where the query and the document are, respectively, framed as a (unigram) language model (i.e., P(w|Q) and P(w|D)), ^rank⁼ means equivalent in terms of being used for the

(37)

purpose of ranking documents, and V denotes the vocabulary. A document D has a smaller value (or probability distance) in terms of KL(Q||D) is deemed to be more relevant with respect to Q. The retrieval effectiveness of the KL-divergence measure depends primarily on the accurate estimation of the query modeling P(w|Q) and the document modeling P(w|D). In addition, it is easy to show that the KL-divergence measure will give the same ranking as the ULM model (cf. Eq. (2.6) and Eq. (2.7)) when the query language model is simply derived with the ML estimator [27]:

( ) ( ) ( )

( ) ( )

( )

(

|

)

.

| log

| log ,

| , log

| log

|

rank rank rank

D Q P

D w P Q w c

D w Q P

Q w c

D w P Q w P D

Q KL

V w

=

∑ ∈

=

∑ ∈

=

−

(2.9)

In Eq. (2.9), P(w|Q) is simply estimated as c(w,Q)/|Q|, where c(w,Q) is the number of times w occurring in Q and |Q| is the total count of words in Q. Accordingly, the KL-divergence measure not only can be thought as a generalization of the query-likelihood measure, but also has the additional merit of being able to accommodate extra information cues to improve the estimation of its component models (especially, the query model) for better document ranking in a systematic manner [27][204].

2.3 Speech Summarization

By virtue of extractive speech summarization, one can listen to and digest multimedia associated with spoken documents efficiently. Extractive speech summarization manages to select a set of indicative sentences from an original spoken document

(38)

according to a target summarization ratio and concatenates them together to form a summary accordingly [125][130][151][163]. The wide spectrum of extractive speech summarization methods developed so far may be split into three main categories [125][130]: 1) methods simply based on the sentence position or structure information, 2) methods based on unsupervised sentence ranking, and 3) methods based on supervised sentence classification.

For the first category, the important sentences can be selected from some salient parts of a spoken document [5]. For instance, sentences can be selected from the introductory and/or concluding parts of a spoken document. However, such methods can be only applied to some specific domains with limited document structures. On the other hand, unsupervised sentence ranking methods attempt to select important sentences based on statistical features of spoken sentences or of the words in the sentences without human labor involved. Statistical features, for example, can be the term (word) frequency, linguistic score and recognition confidence measure, as well as the prosodic information.

The associated unsupervised methods based on these features have gained much attention of research. Among them, the vector space model (VSM) [65], the latent semantic analysis (LSA) method [65], the Markov random walk (MRW) method [192], the maximum marginal relevance (MMR) method [19], the sentence significant score method [59], the LexRank method [58], the submodularity-based method [114], and the integer linear programming (ILP) method [135] are the most popular approaches for spoken document summarization. Apart from that, a number of classification-based methods using various kinds of representative features also have been investigated, such as the Gaussian mixture models (GMM) [65], the Bayesian classifier (BC) [98], the support vector machine (SVM) [205] and the conditional random fields (CRFs) [61], to name just a few. In these methods, important sentence selection is usually formulated as

(39)

a binary classification problem. A sentence can either be included in a summary or not.

These classification-based methods need a set of training documents along with their corresponding handcrafted summaries (or labeled data) for training the classifiers (or summarizers). However, manual annotation is expensive in terms of time and personnel.

Even if the performance of unsupervised summarizers is not always comparable to that of supervised summarizers, their easy-to-implement and flexible property (i.e., they can be readily adapted and carried over to summarization tasks pertaining to different languages, genres or domains) still makes them attractive. Interested readers may also refer to [125][130][151][163] for comprehensive reviews and new insights into the major methods that have been developed and applied with good success to a wide range of text and speech summarization tasks.

2.3.1 Language Modeling for Speech Summarization

Among the aforementioned methods, one of the emerging lines of research is to employ the language modeling (LM) approach for important sentence selection, which has shown preliminary success for performing extractive speech summarization in an unsupervised fashion [38][116]. However, a central challenge facing the LM approach is how to formulate the sentence models and accurately estimate their parameters for each sentence in the spoken document to be summarized.

Intuitively, extractive speech summarization could be cast as an ad-hoc information retrieval (IR) problem, where a spoken document to be summarized is taken as an information need and each sentence of the document is regarded as a candidate information unit to be retrieved according to its relevance (or importance) to the information need. As such, the ultimate goal of extractive speech summarization could be stated as the selection of the most representative sentences that can succinctly

(40)

describe the main topics of the spoken document.

When applying the LM-based approach to extractive speech summarization, a principal realization is to use a probabilistic generative paradigm for ranking each sentence S of a spoken document D to be summarized, which can be expressed by P(S|D). Instead of calculating this probability directly, we can apply the Bayes’ rule and rewrite it as follows [82]:

) , (

) ( )

| ) (

|

( P D

S P S D D P

S

P = (2.10)

where P(D|S) is the sentence generative probability, i.e., the likelihood of D being generated by S, P(S) is the prior probability of the sentence S being relevant, and P(D) is the prior probability of the document D. P(D) in Eq. (2.10) can be eliminated because it is identical for all sentences and will not affect the ranking of the sentences.

Furthermore, because the way to estimate the probability P(S) is still under active study [38], we may simply assume that P(S) is uniformly distributed, or identical for all sentences. In this way, the sentences of a spoken document to be summarized can be ranked by means of the probability P(D|S) instead of using the probability P(S|D): the higher the probability P(D|S), the more representative S is likely to be for D. If the document D is expressed as a sequence of words, D=w1,w2,…,wL, where words are further assumed to be conditionally independent given the sentence and their order is assumed to be of no importance (i.e., the so-called “bag-of-words” assumption), then P(D|S) can be approximated by

,)

| ( )

|

(D S ≈_{∏ =}_i^L₁P w_i S

P (2.11)

where L denotes the length of the document D. The sentence ranking problem has now been reduced to the problem of how to accurately infer the probability distribution P(wi|S), i.e., the corresponding sentence model for each sentence of the document.

(41)

Again, the simplest way is to estimate a unigram language model (ULM) on the basis of the frequency of each distinct word w occurring in the sentence, with the maximum likelihood (ML) criterion [82][204]:

| ,

| ) , ) (

|

( S

S w S c

w

P = (2.12)

where c(w,S) is the number of times that word w occurs in S and |S| is the length of S.

The ULM model can be further smoothed by a background unigram language model estimated from a large general collection to model the general properties of the language as well as to avoid the problem of zero probability. It turns out that a sentence S with more document words w occurring frequently in it would tend to have a higher probability of generating the document.

(42)

(43)

Chapter 3 Speech and Language Corpora &

Evaluation Metrics

3.1 Data Sets for Spoken Document Indexing and Retrieval

The thesis uses the Mandarin Chinese collection of the TDT corpora for the retrospective retrieval task [23][104], such that the statistics for the entire document collection is obtainable. The Chinese news stories (text) from Xinhua News Agency are used as our test queries and training corpus for all models (excluding test query set).

More specifically, in the following experiments, we will merely extract the tittle field from a news story as a test query. The Mandarin news stories (audio) from Voice of America news broadcasts are used as the spoken documents. All news stories are exhaustively tagged with event-based topic labels, which serve as the relevance judgments for performance evaluation. Table 3.1 describes some basic statistics about the corpora used in this thesis. The Dragon large-vocabulary continuous speech recognizer provided Chinese word transcripts for our Mandarin audio collections (TDT-2). To assess the performance level of the recognizer, we spot-checked a fraction of the TDT-2 development set (about 39.90 hours) by comparing the Dragon recognition hypotheses with manual transcripts, and obtained a word error rate (WER) of 35.38%.

Since Dragon’s lexicon is not available, we augmented the LDC Mandarin Chinese Lexicon with 24k words extracted from Dragon’s word recognition output, and for computing error rates used the augmented LDC lexicon (about 51,000 words) to tokenize the manual transcripts. We also used this augmented LDC lexicon to tokenize the query sets and training corpus in the retrieval experiments.

(44)

3.1.1 Subword-level Index Units

In Mandarin Chinese, there is an unknown number of words, although only some (e.g., 80 thousands, depending on the domain) are commonly used. Each word encompasses one or more characters, each of which is pronounced as a monosyllable and is a morpheme with its own meaning. Consequently, new words are easily generated every day by combining a few characters. Furthermore, Mandarin Chinese is phonologically compact; an inventory of about 400 base syllables provides full phonological coverage of Mandarin audio, if the differences in tones are disregarded. Additionally, an inventory of about 6,000 characters almost provides full textual coverage of written Chinese. There is a many-to-many mapping between characters and syllables. As such,

TDT-2 (Development Set) 1998, 02~06

# Spoken documents 2,265 stories, 46.03 hours of audio

# Distinct test queries 16 Xinhua text stories (Topics 20001~20096)

# Distinct training queries 819 Xinhua text stories (Topics 20001~20096)

Min. Max. Med. Mean

Doc. length

(in characters) 23 4,841 153 287.1 Short query length

(in characters) 8 27 13 14

Long query length

(in characters) 183 2,623 329 532.9

# Relevant documents

per test query 2 95 13 29.3

# Relevant documents

per training query 2 95 87 74.4

Table 3.1 Statistics for TDT-2 collection used for spoken document retrieval.

(45)

a foreign word can be translated into different Chinese words based on its pronunciation, where different translations usually have some syllables in common, or may have exactly the same syllables.

The characteristics of the Chinese language lead to some special considerations when performing Mandarin Chinese speech recognition; for example, syllable recognition is believed to be a key problem. Mandarin Chinese speech recognition evaluation is usually based on syllable and character accuracy, rather than word accuracy. The characteristics of the Chinese language also lead to some special considerations for SDR. Word-level indexing features possess more semantic information than subword-level features; hence, word-based retrieval enhances precision. On the other hand, subword-level indexing features behave more robustly against the Chinese word tokenization ambiguity, homophone ambiguity, open vocabulary problem, and speech recognition errors; hence, subword-based retrieval enhances recall. Accordingly, there is good reason to fuse the information obtained from indexing the features of different levels [23]. To do this, syllable pairs are taken as the basic units for indexing besides words. Both the manual transcript and the recognition transcript of each spoken document, in form of a word stream, were automatically converted into a stream of overlapping syllable pairs. Then, all the distinct syllable pairs occurring in the spoken document collection were identified to form a vocabulary of syllable pairs for indexing.

We can simply use syllable pairs, in replace of words, to represent the spoken documents, and thereby construct the associated retrieval models.

3.1.2 Evaluation Metrics

The retrieval results are expressed in terms of non-interpolated mean average precision (MAP) following the TREC evaluation [63], which is computed by the following

(46)

equation:

1 , MAP 1

1 1 ,

∑= ∑

= ^L = i

N j i j i

i

r j N

L

(3.1) where L is the number of test queries, Ni is the total number of documents that are relevant to query Qi, and ri,j is the position (rank) of the j–th document that is relevant to query Qi, counting down from the top of the ranked list.

3.1.3 Baseline Experiments

In the first set of experiments, we compare several retrieval models, including the vector space model (VSM) [132][178], the latent semantic analysis (LSA) [51], the semantic context inference (SCI) [77], and the basic LM-based method (i.e., ULM) [202]. The results when using word- and subword-level index features are shown in Table 3.2. At first glance, ULM in general outperforms the other three methods in most cases, validating the applicability of the LM framework for SDR. Next, we compare two extensions of ULM, namely the probabilistic latent semantic analysis (PLSA) [31] and the latent Dirichlet allocation (LDA) [195], with ULM. The experimental results are also shown in Table 3.2. As expected, both PLSA and LDA outperform ULM, and they are almost on par with each other. The results also reveal that PLSA and LDA can give more accurate estimates of the document language models than the empirical ML

VSM LSA SCI ULM LDA

Word 0.273 0.296 0.270 0.321 0.328 Subword 0.257 0.384 0.270 0.329 0.377

Table 3.2 Retrieval results (in MAP) of different retrieval models with word- and subword-level index features.

統計式語言模型 – 語音文件標記、檢索以及摘要

國立臺灣大學電機資訊學院資訊工程研究所 博士論文

Department of Computer Science and Information Engineering College of Electrical Engineering and Computer Science

National Taiwan University Doctoral Dissertation

統計式語言模型 – 語音文件標記、檢索以及摘要 Statistical Language Modeling –

Spoken Document Indexing, Retrieval and Summarization

陳冠宇 Kuan-Yu Chen

指導教授：陳信希 博士、王新民 博士

Advisor: Hsin-Hsi Chen, Ph.D. and Hsin-Min Wang, Ph.D.

中華民國 104 年 4 月

April, 2015

DEDICATION

This thesis is dedicated to my family and girlfriend!

中文摘要

ABSTRACT

CONTENTS

LIST OF FIGURES

LIST OF TABLES

Chapter 1 Introduction

1.1 Spoken Document Processing

1.2 Organization of the Thesis

Chapter 2 Overview of Related Literature

2.1 Statistical Language Modeling

2.1.1 Word-Regularity Language Modeling

2.1.2 Topic Language Modeling

2.1.3 Continuous Language Modeling

∑

∑

2.1.4 Neural Network-based Language Modeling

2.2 Spoken Document Indexing and Retrieval

2.2.1 Language Modeling for Spoken Document Retrieval 2.2.1.1 Query-Likelihood Measure

( )

(

)

( ) (

)

2.2.1.2 Kullback-Leibler (KL)-Divergence Measure

( ) ( ) ( )

( )

(

)

(

)

( ) ( ) ( )

( ) ( )

( ) ( )

( )

(

)

2.3 Speech Summarization

2.3.1 Language Modeling for Speech Summarization

Chapter 3 Speech and Language Corpora &

Evaluation Metrics

3.1 Data Sets for Spoken Document Indexing and Retrieval

3.1.1 Subword-level Index Units

3.1.2 Evaluation Metrics

3.1.3 Baseline Experiments

國立臺灣大學電機資訊學院資訊工程研究所博士論文

指導教授：陳信希博士、王新民博士

( ) ⁽

⁾