Chapter 1 Introduction
1.1 Background
The use of prosodic information in automatic speech recognition (ASR) is an attractive research topic in recent years. Prosody refers to the suprasegmental features of continuous speech, such as accentuation, prominence, tone, pause, intonation, and rhythm. Prosody is physically encoded in the variations of pitch contour, energy level, duration, and silence of spoken utterances. Prosody is known to closely correlate with the linguistic features of various levels, say from phone, syllable, word, phrase, to sentence or above. Owing to those correlations, prosody is potentially useful for ASR.
Generally, the task of prosody-assisted ASR is to firstly exploit prosodic cues correlated to linguistic features, and to then model their relationships with linguistic features and prosodic-acoustic features, and to lastly incorporate these models into the ASR framework.
In the past, many studies on using prosodic information to assist in ASR have been reported [1]-[7] for American English [1]-[4],[6],[7] and Spanish [5].
Ananthakrishnan et al. [1]-[3] proposed to incorporate a prosodic language model and a prosodic acoustic model into the conventional Hidden Markov model (HMM)-based ASR recognizer by rescoring the N-best word sequences or the word lattice. The prosodic acoustic model used Gaussian mixture model (GMM) or multilayer perceptrons (MLP) to model the relation of binary pitch accent label of word and the prosodic-acoustic features extracted from the F0 track, energy, and duration cues of context. The prosodic language model was a trigram language model (LM) with compound tokens of words and their binary pitch accent labels. Besides, an unsupervised adaptation approach to jointly refining the two categorical prosody models and bootstrapping prosodic labels was also proposed to assist in solving the problem of lacking large corpora annotated with relevant prosodic symbols [1].
Relative improvements of 1.2-3.1% in word error rate (WER) were obtained on the Boston University Radio News Corpus (BU-RNC). Chen et al. [4] used two prosodic events, intonational phrase boundary and pitch accent, in ASR to construct
WER was achieved on BU-RNC. Milone et al. [5] proposed a method to use the accentual information in ASR. The method first estimated a sequence of accentual structure of words from speech signal using F0 and energy by an HMM-based classifier or a neural tree networks classifier, and then incorporated it into the recognition process. An LM built to take into account the accentual structure of words in phrase was used. A relative improvement of 28.91% in WER was achieved on a medium-vocabulary Spanish continuous-speech recognition task. Vergyri et al. [6]
proposed to integrate models of different prosodic knowledge sources into ASR. They included word duration model, pause language model, and prosodic model of hidden events (e.g. sentence boundaries and speech disfluencies). Relative improvements of 2.6-3.1% in WER were achieved on the Switchboard database. Ostendorf et al. [7]
presented a statistical modeling framework for incorporating prosody in the speech recognition process. Several issues were discussed, including prosodic feature extraction in different time scales and normalization, prosody modeling using an intermediate symbol representation in contrast to directly conditioning on acoustic correlates, the use of questions about prosodic structure in acoustic model clustering, dynamic pronunciation modeling conditioned on acoustic-prosodic features.
Besides, some other studies on using prosodic information to assist in Mandarin ASR can also be found [8]-[13]. In [8], a recurrent neural network (RNN) was used to detect word-boundary information from the input prosodic features with base-syllable boundary being pre-determined by an HMM-based acoustic decoder. The word boundary information was then used to assist the linguistic decoder in solving word-boundary ambiguity as well as pruning unlikely paths. An absolute improvement of 1.1% in character error rate (CER) was achieved on a large-vocabulary speaker-dependent (SD) Mandarin continuous ASR task. Huang et al. [9],[10] utilized decision tree-based or GMM-based prosodic models of syllable- and word-level to generate the prosodic likelihood score for rescoring in a two-pass recognition process.
Absolute CER improvements of 1.06% [9] and 1.45% [10] were reported on a large-vocabulary multi-speaker continuous ASR task. In [11], word-dependent tone modeling using prosodic features of syllable duration and three F0 values with two back-off schemes was proposed for Mandarin ASR. A minor improvement on CER was achieved on a Mandarin broadcast news corpus. Ni et al. [12] proposed an implicit tone model using F0 contour features and an explicit tone model using both
prosodic and lexical features for assisting in Mandarin ASR. An improvement of 3.65% in CER was achieved on the Project-863 database. In [13], Ni et al incorporated a GMM-based prosody-dependent tonal syllable duration model and a maximum entropy (ME)-based syntactical prosody model into a prosody-dependent acoustic model recognizer by rescoring the syllable lattice. Only tonal syllable recognition rate was reported on the Project-863 database.
Prosody modeling was also used in some other speech recognition tasks. Liu et al.
[14] conducted enriching speech recognition to automatic detection of sentence boundaries and disfluencies on both conversational telephone speech and broadcast news tasks of NIST RT-04F evaluation using both prosodic and lexical features.
Shriberg et al. [15] employed the decision tree method to model rhythmic and melodic features of speech for several applications including sentence segmentation and disfluency detection, topic segmentation in broadcast news, dialog act labeling and word recognition in conversational speech. Although prosody modeling was useful in those applications, only minor improvements on word recognition were achieved.
It can be found from above discussions that prosody modeling is the main concern in all those previous studies. The methods of prosody modeling in those studies can be classified into two classes: 1) direct modeling of target classes [8],[10]-[12], and 2) prosody modeling via intermediate abstract phonological categories [1]-[6],[9],[13], such as TOBI [16] and INTSINT [17]. In direct modeling of target classes, the relationship between prosodic acoustic features and target classes (usually, linguistic feature, e.g., lexical tone, lexical word, etc.) is directly modeled by a pattern classifier, such as GMM, decision tree, RNN, ME, etc. This approach is advantageous on bypassing manual labeling of prosodic tags and hence can avoid the inter-annotator inconsistency. Nevertheless, the variability or space of both prosodic-acoustic and linguistic features (target) may be too large when considering more features of various level or wider time window. Therefore, only limited linguistic and prosodic-acoustic features are incorporated in this direct modeling approach [8],[10]-[12]. On the other hand, prosody modeling via intermediate abstract phonological categories [1]-[6],[9],[13] first explores important prosodic cues or events potentially useful for ASR and then builds prosodic models to describe the
prosodic-acoustic features using a prosody-annotated speech database. Figure 1.1 shows a conceptual block diagram of the prosody modeling using intermediate abstract phonological categories. Usually, prosody annotation is based on the ToBI labeling system [16] and is performed manually. The variability of prosodic-acoustic features can be reduced by introducing a finite discrete set of prosody tags so as to make the construction of prosody-syntax relationship easier. The main drawback of this approach lies in the need of a large well-annotated database with full prosodic cues being properly labeled. In the past, prosody labeling is usually done by human became of the lack of a good automatic labeling algorithm. But, preparing such a database by human is still difficult because the labeling work is highly time-consuming and it is not easy to maintain the consistency of fully labeling of all prosodic cues for the same annotators or between different annotators. So, most previous works of this class used databases annotated with only few obvious prosodic cues, such as pitch accent and intonational phrase boundary. This will highly limit the effectiveness of using prosodic information on improving the ASR performance.
Although some studies [13],[18],[19] conducted automatic prosody labeling to enlarge the size of prosody-annotated corpus, the prosodic cues they used were still very limited. Besides, their prosodic models were still trained with manually annotated speech corpora so that their performances were subject to the quality of human prosody labeling. Table 1.1 summarizes the primary features of prosody modeling and experiment setting for those previous studies on prosody-assisted ASR for comparison.
Figure 1.1: A conceptual block diagram of the prosody modeling class using intermediate abstract phonological categories. PD-AM and PD-LM denote prosody-dependent acoustic model and prosody-dependent language model.
Table 1.1: Comparison Between Prosody-Assisted ASR Studies
Prosody modeling Experiment setting
Literature PE PH PL PAF LF LNG STL VSZ SPK IMP (%)
Ni [13] 2B+2S 1-L SS F0/d t M R TSR SI 9.82/24.4(tonal
syllable) Huang [9] 2B 2-L R F0*/d*/e*/p t/WB M B 100K SD 1.06/5.5(character)
Ana [1] 2A - UA F0*/d*/e* W E R - - 1/3.1
Chen [4] 2B+2A 1-L* S F0/d ph/W/POS E R - SI 1.73/6.9
Vergyri [6] 3P+5HE - - F0*/d*/p ph/W E C 8K SI 1.1, 0.7, 0.9/3.9, 2.6, 3.1
Milone [5] AS - - F0/e/d W S R <500 SI 2.18/28.91
Huang [10] Dir - - F0*/d*/e*/p t/WB M B 100K SD 1.45/7.5(character)
Ni [12] Dir - - F0/d/e/p t/WB M R 4818
8
SI 3.65/21.5(character)
Lei [11] Dir - - F0/d t/ts/W M B 49k SI 0.7, 1/6,
5.2(character) Wang [8] Dir - - F0*/d/e*/p SJ M R 110K SD 1.1/4.2(character) proposed 7B+PS 4-L U F0*/d*/e/p/ed t/s/f/WL/WB/POS/PM M R 60K SI 9.82/24.4(tonal
syllable) PE: prosodic event = {B: break type | PS: prosodic state | S: phrase stress | A: binary pitch accent | HE:
hidden events | AS: accentual structure of words | Dir: direct prosody modeling}; PH: prosody hierarchy = {L: layer}; PL: prosody labeling = {U: unsupervised | SS: semi-supervised | S:
supervised | BS: bootstrapping | R: taking lexical word as potential PW}; PAF: prosodic-acoustic feature = {F0: fundamental frequency | d: duration | e: energy | pd: pause duration | *: with differential};
LF: linguistic feature = {t: tone | ph: phone | s: base-syllable type | W: word | POS: part of speech | PM:
punctuation mark}; LNG: language = {M: Mandarin | E: English | S: Spanish}; STL: style = {R: read | B: broadcasting | C: conversational}; VSZ: vocabulary size in word, TSR: tonal syllable recognition;
SPK: speaker = {SI: speaker independent | SD: speaker dependent | MS: multi speaker}; IMP:
improvement in absolute/relative accuracy.