Data Sets for Spoken Document Indexing and Retrieval

Chapter 3 Speech and Language Corpora & Evaluation Metrics

3.1 Data Sets for Spoken Document Indexing and Retrieval

The thesis uses the Mandarin Chinese collection of the TDT corpora for the retrospective retrieval task [23][104], such that the statistics for the entire document collection is obtainable. The Chinese news stories (text) from Xinhua News Agency are used as our test queries and training corpus for all models (excluding test query set).

More specifically, in the following experiments, we will merely extract the tittle field from a news story as a test query. The Mandarin news stories (audio) from Voice of America news broadcasts are used as the spoken documents. All news stories are exhaustively tagged with event-based topic labels, which serve as the relevance judgments for performance evaluation. Table 3.1 describes some basic statistics about the corpora used in this thesis. The Dragon large-vocabulary continuous speech recognizer provided Chinese word transcripts for our Mandarin audio collections (TDT-2). To assess the performance level of the recognizer, we spot-checked a fraction of the TDT-2 development set (about 39.90 hours) by comparing the Dragon recognition hypotheses with manual transcripts, and obtained a word error rate (WER) of 35.38%.

Since Dragon’s lexicon is not available, we augmented the LDC Mandarin Chinese Lexicon with 24k words extracted from Dragon’s word recognition output, and for computing error rates used the augmented LDC lexicon (about 51,000 words) to tokenize the manual transcripts. We also used this augmented LDC lexicon to tokenize the query sets and training corpus in the retrieval experiments.

3.1.1 Subword-level Index Units

In Mandarin Chinese, there is an unknown number of words, although only some (e.g., 80 thousands, depending on the domain) are commonly used. Each word encompasses one or more characters, each of which is pronounced as a monosyllable and is a morpheme with its own meaning. Consequently, new words are easily generated every day by combining a few characters. Furthermore, Mandarin Chinese is phonologically compact; an inventory of about 400 base syllables provides full phonological coverage of Mandarin audio, if the differences in tones are disregarded. Additionally, an inventory of about 6,000 characters almost provides full textual coverage of written Chinese. There is a many-to-many mapping between characters and syllables. As such,

TDT-2 (Development Set) 1998, 02~06

# Spoken documents 2,265 stories, 46.03 hours of audio

# Distinct test queries 16 Xinhua text stories (Topics 20001~20096)

# Distinct training queries 819 Xinhua text stories (Topics 20001~20096)

Min. Max. Med. Mean

Doc. length

(in characters) 23 4,841 153 287.1 Short query length

(in characters) 8 27 13 14

Long query length

(in characters) 183 2,623 329 532.9

# Relevant documents

per test query 2 95 13 29.3

# Relevant documents

per training query 2 95 87 74.4

Table 3.1 Statistics for TDT-2 collection used for spoken document retrieval.

a foreign word can be translated into different Chinese words based on its pronunciation, where different translations usually have some syllables in common, or may have exactly the same syllables.

The characteristics of the Chinese language lead to some special considerations when performing Mandarin Chinese speech recognition; for example, syllable recognition is believed to be a key problem. Mandarin Chinese speech recognition evaluation is usually based on syllable and character accuracy, rather than word accuracy. The characteristics of the Chinese language also lead to some special considerations for SDR. Word-level indexing features possess more semantic information than subword-level features; hence, word-based retrieval enhances precision. On the other hand, subword-level indexing features behave more robustly against the Chinese word tokenization ambiguity, homophone ambiguity, open vocabulary problem, and speech recognition errors; hence, subword-based retrieval enhances recall. Accordingly, there is good reason to fuse the information obtained from indexing the features of different levels [23]. To do this, syllable pairs are taken as the basic units for indexing besides words. Both the manual transcript and the recognition transcript of each spoken document, in form of a word stream, were automatically converted into a stream of overlapping syllable pairs. Then, all the distinct syllable pairs occurring in the spoken document collection were identified to form a vocabulary of syllable pairs for indexing.

We can simply use syllable pairs, in replace of words, to represent the spoken documents, and thereby construct the associated retrieval models.

3.1.2 Evaluation Metrics

The retrieval results are expressed in terms of non-interpolated mean average precision (MAP) following the TREC evaluation [63], which is computed by the following

equation:

1 , MAP 1

1 1 ,

∑= ∑

= ^L = i

N j i j i

r j N

(3.1) where L is the number of test queries, Ni is the total number of documents that are relevant to query Qi, and ri,j is the position (rank) of the j–th document that is relevant to query Qi, counting down from the top of the ranked list.

3.1.3 Baseline Experiments

In the first set of experiments, we compare several retrieval models, including the vector space model (VSM) [132][178], the latent semantic analysis (LSA) [51], the semantic context inference (SCI) [77], and the basic LM-based method (i.e., ULM) [202]. The results when using word- and subword-level index features are shown in Table 3.2. At first glance, ULM in general outperforms the other three methods in most cases, validating the applicability of the LM framework for SDR. Next, we compare two extensions of ULM, namely the probabilistic latent semantic analysis (PLSA) [31] and the latent Dirichlet allocation (LDA) [195], with ULM. The experimental results are also shown in Table 3.2. As expected, both PLSA and LDA outperform ULM, and they are almost on par with each other. The results also reveal that PLSA and LDA can give more accurate estimates of the document language models than the empirical ML

VSM LSA SCI ULM LDA

Word 0.273 0.296 0.270 0.321 0.328 Subword 0.257 0.384 0.270 0.329 0.377

Table 3.2 Retrieval results (in MAP) of different retrieval models with word- and subword-level index features.

estimator used in ULM, and thus improve the retrieval effectiveness. On the other hand, if we have a close look at these results, we notice that although the word error rate (WER) for the spoken document collection is higher than 35%, it does not lead to catastrophic failures probably due to the reason that recognition errors are overshadowed by a large number of spoken words correctly recognized in the documents.

在文檔中統計式語言模型 – 語音文件標記、檢索以及摘要 (頁 43-47)