Training of the Proposed Prosodic Models - The Proposed Prosody-Assisted ASR

Chapter 2 The Proposed Prosody-Assisted ASR

2.2 Training of the Proposed Prosodic Models

The joint prosody labeling and modeling (PLM) algorithm proposed previously [20] is adopted to train all these 12 models from an unlabeled speech database. The PLM algorithm is a sequential optimization procedure based on the ML criterion to jointly label the prosodic tags for all utterances of the training corpus and estimate the parameters of all 12 prosodic models. It is composed of two parts: initialization and iteration. The initialization part first determines initial prosodic tags of all utterances, and then estimates initial parameters of the prosodic models by a specially designed procedure. The iteration part first defines an objective likelihood function for each utterance by It then performs a multi-step iterative procedure to re-label the prosodic tags of each utterance with the goal of maximizing Q and update the parameters of all prosodic models sequentially and iteratively. In the following, we describe the sequential optimization procedure in more detail.

2.3.1 Initialization

(a) Initial labeling of break indices

The initial break index of each syllable juncture is determined by a decision tree

shown in Figure 2.3. The decision tree is designed based on the general knowledge of the break types obtained in our previous prosody labeling and modeling study on a single-speaker database [20]. First, a juncture is labeled as B4 if its pause duration is longer than a large threshold Th1. Then, it is assigned as B3 if its pause duration is longer than Th2. Then, all intrawords are labeled as B0/B1. We then mark interwords with medium pause duration (

Th3) as B2-2, with medium pitch jump (



Th4) as B2-1, and with medium pre- or post-syllable lengthening (



Th5 and



Th6) as B2-3.

All remaining interwords are labeled as B0/B1. Lastly, B0/B1 are refined as B0 if the syllable juncture has continuous F0 trajectory, otherwise it is labeled as B1. All these six thresholds are determined in a systematic way by an algorithm to avoid determining them by trial-and-error. The algorithm is discussed in detail as follows.

Interword?

Figure 2.3: The decision tree for initial break type labeling.

The algorithm is designed using both linguistic and acoustic cues to determine these six thresholds. First, we consider that PMs are usually associated with long breaks and assigned to B3 or B4. We hence collect the pause durations of all word junctures with PM and use scalar quantization to divide them into two clusters. Two gamma distributions are accordingly constructed to stand for pause duration distributions of B4 and B3, i.e. f_B₃(pd) and f_B₄(pd), respectively. The threshold

Th1 is then set to be the equal probability intersection between the two distributions.

Then, we construct a Gamma distribution f_B_0/1(pd) for B0/B1 by using the pause

constructed by using the pause durations of all non-PM interword junctures with of interwords with PM and of intrawords, respectively. Then, a Gaussian distribution of pj for B2-1, i.e., f_B_2-1(pj), is constructed using non-PM interwords with apparent pitch jump defined based on the criterion of f_PM(pj) f_intra(pj). Similarly, two Gaussian distributions of dl and df for B2-3, i.e., f_B_2-3( )dl and f_B_2-3(df), are constructed using non-PM interwords with apparent duration lengthening defined based on the criteria of f_PM( )dl  f_intra( )dl and f_PM(df) f_intra(df). Lastly, Th4, Th5 and Th6 are set to be the equal probability intersections of f_intra(pj)/ f_B_2-1(pj),

intra( )

f dl / f_B_2-3( )dl and f_intra(df)/f_B_2-3(df). (b) Initialization of 12 prosodic models

The initializations of the break-syntax model and the syllable-juncture prosodic-acoustic model can be done independently with initial break indices of all syllable junctures being given. We realize them by the CART algorithm [22]. Then, the initializations of the three syllable prosodic-acoustic models are considered. Since they are multi-parametric representation models to superimpose several APs of major affecting factors to form the observed syllable prosodic-acoustic features, the estimation of an AP may be interfered by the existence of the APs of other types. It is therefore improper to estimate all initial parameters independently. We hence adopt a progressive estimation strategy to first determine the initial APs which can be estimated most reliably and then eliminate their effects from the surface prosodic-acoustic features for the estimations of the remaining APs. Based on this idea, we determine the order of initial AP estimation according to the availability of affecting factor and the size of AP. The resulting ordering is listed as follows: global

means _sp/_sd/



_se, tone _t/_t/_t, coarticulation



_{B t}^f_, /



_{B t}^b_, , base-syllable/final type

s/_f, and prosodic states _p/_q/_r. It is noted that an improper ordering of initial AP estimation may result in poor AP estimates. For example, if we reverse the order of initial estimation of tone and base-syllable APs (i.e., _t and _s) of syllable duration, then the value of _s for base-syllable “de” will decrease significantly while the value of _t for Tone 5 will increase accordingly. This is due to the high-frequency character “的” which dominates both distributions of Tone 5 and base-syllable “de”. We also note that the initial pitch, duration and energy prosodic-state indices are assigned by applying vector quantization (VQ) to the residues of syllable F0 level, duration and energy level, respectively; and their APs are set to be the corresponding codewords. Lastly, the initializations of the three prosodic state transition models are done using the labeled prosodic-state indices and break indices.

2.3.2 Iteration

The iteration is a multi-step procedure listed below:

Step 1: Update the APs of tones, _t/_t/_t, with all other APs being fixed.

Step 2: Update the APs of coarticulation,



_{B t}^f_, /



_{B t}^b_,, with all other APs being fixed.

Step 3: Update the APs of base-syllable/final type, _s/_f, with all other APs being fixed.

Step 4: Re-label the prosodic state sequence of each utterance by the Viterbi algorithm so as to maximize Q defined in Equation (2.14).

Step 5: Update the APs of prosodic state, _p/_q/_r, variances,

R /

_sp R_sd/

R , and

_se the prosodic state transition model.

Step 6: Re-label the break type sequence of each utterance by the Viterbi algorithm so as to maximize Q defined in Equation (2.14).

Step 7: Update the decision trees of the break-syntax model and of the syllable-juncture prosodic-acoustic model.

Step 8: Repeat Steps 1 to 7 until a convergence is reached.

在文檔中一種韻律輔助中文語音辨認系統及其應用 (頁 28-31)