Prosodic Feature Prediction - An Application to Prosody Generation for

Chapter 4 An Application to Prosody Generation for

4.3 Prosodic Feature Prediction

The prosodic features to be predicted include syllable prosodic features (sp, sd, se) and inter-syllable pause duration (pd). Among them, the inter-syllable pause duration of each syllable juncture can be simply predicted by the break-acoustic model trained in Chapter 3, i.e.

*= arg max ( | *, )

n pdn n n n

pd p pd B l (4.2) where B_n^* represents the optimal break type of syllable n predicted by the break-syntax model discussed in Section 4.2. The syllable prosodic features, including syllable pitch contour sp, syllable duration sd, and syllable energy level se, are predicted by the models formulated basing on the minimum mean squared error (MMSE) criterion. Given with the predicted break sequence B^* and linguistic features l , the MMSE predictors for sp, sd, and se are sp^*_n =E[sp B l_n| , ]^* ,

* [ | , ]*

n n

sd =E sd B l , and se^*_n =E se[ _n|B l^*, ], respectively. Since sp, sd, and se are predicted in the same way, we only present the prediction model of sp here for simplicity. The MMSE predictor for sp can be elaborated by

* *

It can be seen from Eq. (4.3) that the predicted syllable pitch contour is a weighted sum of the reconstructed patterns formed by superimposing various APs with weights being the a posterior probabilities of prosodic state p_n. The a posterior probability

( | , )_n *

forward and backward algorithm with the probability P p p( |_n _n₋₁,B_nⁿ₋₁, )l_n which is similar to the prosodic state model P p p( |_n _n₋₁,B_n₋₁). The probability P p p( |_n _n₋₁,B_nⁿ₋₁, )l_n strengthens the influences of linguistic features and break B_n on the current prosodic state p_n. In practical realization, since the space of the histories {p_n₋₁,B_nⁿ₋₁} and linguistic features {l_n} is too large, we partition the space into several classes

1 1

( _n , _nⁿ , )_n

C p ₋ B₋ l to calculate the conditional probabilities P p C p( _n| ( _n₋₁,B_nⁿ₋₁, ))l_n by the decision tree method. The detail of the question set for constructing the decision tree is listed bellow:

(1) Current word length in syllable: {1, 2, 3, 4, >4}.

(2) Current syllable position in word: {1st, intermediate, last, mono-syllable word}.

(3) Sentence length in syllable: {1, [2,5], [6,10], [11,15], [16,20], >20}.

(4) Current syllable position in sentence: {1st, 2nd, 3rd, [4th, 5th], [6th, 7th], [8th, 11th], last, 2nd last, 3rd last, [5th last, 4th last], [7th last, 6th last], [11th last, 8th last], others}; Smaller count number from the beginning or end wins.

(5) PM after the current syllable (five types).

(6) POS3: 47-types POS.

(7) Break type of juncture n, n-1, n-2 (8) Prosodic state of (n-1)-th syllable.

The proposed prediction method is conducted with the break sequence given by the two-stage method. We choose the two-stage method because it performs better.

Table 4.7 displays the TREs of the prosody prediction results for syllable pitch contour, duration and energy level. Since the performance of the proposed method should not consider the influence of utterance, the TREs of syllable duration and energy level are respectively the ratios of the sum-squared prediction errors of syllable duration and energy level over the sum-squared normalized ones with the influences from utterance being removed. The performances were acceptable. To separate the effect of break prediction on the prosodic feature prediction, we do the same experiment using the correct break labels. Table 4.8 displays the experimental results. By comparing the results shown in the two tables, we find that the latter performed better. This shows that the break prediction plays an important role in the prediction of prosodic features. Erroneous breaks predicted will make gross shifts of

PW patterns and may result in large prediction errors of prosodic features. Hence, to improve the prosodic feature generation, the break prediction task is essential and worthwhile further investigating in the future.

Table 4.7: TREs of the prosodic feature prediction results.

sp sd se pd

Inside 42.39% 45.60% 37.61% 18.50%

Outside 42.73% 46.24% 35.98% 18.92%

Table 4.8: TREs of the prosodic feature prediction using correct break labels.

sp sd se pd

Inside 32.72% 34.80% 32.13% 8.64%

Outside 39.10% 41.74% 33.33% 7.00%

Figure 4.4 displays an example of the predicted prosodic features by the A-UJPLM-based approach. It illustrates the prosodic feature variations of two sentential utterances extracted from a long utterance. It can be found from the figure that the predicted prosodic features matched well with their original counterparts for most syllables. Some large errors can be found to occur on the syllable durations and inter-syllable pause durations of the first sentence. They were mainly resulted from a series of break prediction errors. For example, the two contiguous break prediction errors (predict (B2-2,B1) as (B1,B3)) in the first sentence caused a gross shift of the first PW showing a move of the phrase-ending lengthening from the 4^th syllable to the 6^th syllable. The after-phrase long pause also shifted two syllables to the right synchronously. By using correct break labels, these large prosodic feature prediction errors disappeared. This confirmed that break prediction errors are responsible for prosodic feature prediction errors. So break prediction plays an important role on prosodic feature prediction.

40 50 60 70

0 5 10 15 20 25

0 0.2 0.4

Sec.

0.1 0.2 0.3

Sec.

4.5 5 5.5

LogF0

Figure 4.4: An example of the prosodic feature prediction by the A-UJPLM-based approach. The panels from up to bottom represent, respectively, syllable log-F0 means, syllable duration, syllable energy level and inter-syllable pause duration. Solid lines, open circles, and closed circles denote, correspondingly, the original features, the predicted features using predicted breaks, and the predicted features using correct break labels. Vertical dash lines represent erroneous major/minor break prediction boundaries while vertical solid lines represent correct ones. Notice that break labels in () represent erroneous breaks predicted.

4.4 Conclusions

In this chapter, a model-based prosody generation method for TTS is discussed.

The method contains two steps: break prediction and prosodic feature prediction. In the break prediction, three methods are investigated. They include the baseline all-in-one CART-based method, the two-stage method and the Markov model-based method. Among them, the Markov model-based method achieves the highest accuracy in predicting three-class break types of non-break, minor break and major break, while the baseline all-in-one CART-based method has the worst performance.

However, compared with the two-stage method, the Markov model-based method can only bring negligible improvement on the outside test. We therefore conclude that the linguistic features rather than the contextual break information are primary features in break prediction.

Based on the break prediction result by the two-stage method, four prosodic features, including syllable pitch contour, syllable duration, syllable energy level, and inter-syllable pause duration, are predicted by the proposed A-UJPLM-based prosody generator. Experimental results showed that the performance of the proposed method is acceptable. An upper bound of performance obtained in the oracle experiment using correct break labels confirms that the break prediction task is essential in prosodic feature generation. Further elaboration of the break prediction model is worthwhile studying in the future.

在文檔中非監督式中文語音韻律標記及韻律模式 (頁 106-111)