Chapter 3 Advanced Unsupervised Joint Prosody
3.3 Model Training by the A-UJPLM Method
3.3.1 The Syllable Prosodic Model
We first examined the parameters of the syllable prosodic model ( | ,p X B PS L . , ) The covariance matrices/variances of the original and residual syllable log-F0 contour, duration and energy level are shown below:
4
Obviously, all components of covariance and variances of the three residuals were much smaller than their counterparts of the original features. This showed that the influences of the affecting factors considered were indeed essential to the variation of sp, sd and se.
Table 3.2 displays the APs of five tones. These results generally agreed with those of previous studies [58,59,67,68].
Table 3.2: APs of five tones
Tone 1 2 3 4 5
Pitch mean 0.153 -0.080 -0.175 0.088 -0.145 Syllable duration 0.012 0.015 -0.008 -0.001 -0.075 Energy level 0.367 -1.015 -1.272 1.500 -1.940
Figure 3.3 displays the decision tree analysis of the duration APs of base-syllable type. It can be found from the figure that the syllables with initial in {b, d, g} are much shorter in average than other combinations of initial-final. Generally, syllables with initial in {q, ch, c, f, h, x, sh, s, p, t, k} are longer while syllables with final of single vowel are shorter. The results generally confirmed to those of previous studies [59]. The decision tree analysis of energy-level APs of final type is shown in Figure 3.4. It can be seen from the figure that the average energy level, from large to small, are those of open, mid and close vowels. Besides, the energy level of final with medial is generally smaller than others.
Figure 3.3: Decision tree analysis of duration APs of base-syllable type. Number in () represents the average length (ms) of the APs in the leaf node. Solid line indicates positive answer to the question and dashed line indicates negative answer.
Figure 3.4: Decision tree analysis of energy-level APs of final. Number in () represents the average energy level (dB) of the APs in the leaf node. Solid line indicates positive answer to the question and dashed line indicates negative answer.
Table 3.3 displays the APs of the pitch, duration and energy prosodic states. It can be seen from Figure 3.5 that, for each of the three prosodic features, the APs of 16 prosodic states spanned widely to cover the whole dynamic range.
Table 3.3: APs of prosodic states
Figure 3.5: Distributions of normalized prosodic features and the APs of prosodic states (vertical lines).
Table 3.4 displays the total residual errors (TREs) of the prosodic modelings for syllable pitch contour, duration, and energy level with respect to the use of different combinations of APs. It can be seen from the table that the TREs reduced as more APs were considered and the most significant one is prosodic state. This result suggested that higher-level prosodic constituents (i.e., PW, PPh and PG/BG) may account for great amount of prosodic variations. More detail analysis of prosodic state will be given in Subsection 3.3.3.
Table 3.4: TREs of the prosodic modelings for syllable pitch contour, duration and energy level w.r.t. the use of different combinations of affecting factors.
Pitch Duration Energy level
APs TRE APs TRE APs TRE + Utterance 98.8% + Utterance 77.8%
+ Tone 71.6% + Tone 88.1% + Tone 74.5%
+ Coarticulation 60.3% + Base-syllable 62.9% + Final 48.0%
+ Prosodic state 1.1% + Prosodic state 1.1% + Prosodic state 1.0%
3.3.2 The Break-Acoustics Model
Figure 3.6 displays the distributions of pause duration, energy-dip level, normalized pitch jump, and normalized duration lengthening factors for the root nodes of these seven break types. As can be seen from the figure, the break types of higher level were generally associated with longer pause duration, lower energy-dip level, greater normalized pitch jump, and larger duration lengthening factors. The distributions of pause duration and energy-dip level were similar to those obtained in the previous study shown in Fig. 2.5. Notice that B2-3 was similar to B1 and B2-1 in the distributions of pause duration, and energy-dip level. B2-1, B2-2, B3, and B4 had positive normalized pitch jumps in average while B0, B1, and B2-3 had negative ones.
This result illustrated the declination and reset effects of log-F0 at intra-PW and inter-PW syllable boundaries, respectively. Normalized duration lengthening factors of B2-2, B2-3, B3, and B4 were relatively larger than those of B0, B1, and B2-1.
These distributions showed the lengthening effect for the last syllable of PW, PPh, and PG/BG.
0 200 400 600 800 1000
Figure 3.6: The pdfs of (a) pause duration, (b) energy-dip level for the root nodes, (c) normalized pitch jump, (d) normalized duration lengthening factor 1 and (e) normalized duration lengthening factor 2 of these seven break types. Numbers in () denote the mean values.
3.3.3 The Prosodic State Model
Figure 3.7 displays some most significant transitions of pitch prosodic state
1 1
( n | n , n )
P p p − B− for seven break types. It can be found that the prosodic state transitions of B0, B1, B2-1, B2-2, B3 and B4 generally agree with the results illustrated in Subsection 2.4.3. The transition of B2-3 is similar to those of B0 and B1.
This implies no apparent pitch reset exists at the duration-lengthening juncture of B2-3.
n-1 n each break types. Notice that the darker lines represent the more primary prosodic state transitions.
Figure 3.8 illustrates the transitions of duration prosodic state P q q( |n n−1,Bn−1). Generally, larger break types made more significant high-to-low transitions. It can be observed from the transitions of B3 and B4 that PPhs and PG/BGs usually begun with lower states and ended with higher states to manifest the significant duration lengthening effect before major break junctures. Compared with the transitions of B3 and B4, those of B2-2 and B2-3 had less high-to-low dynamics implying less syllable duration lengthening before minor break junctures. As for B0, B1 and B2-1, they had small nearby-state transitions without preferred direction.
n-1 n for each break types. Notice that the darker lines represent the more primary prosodic state transitions.
The energy prosodic state transitions are shown in Figure 3.9. Apparently, low-to-high transitions were primarily found in major breaks (i.e., B3 and B4), while high-to-low and level transitions were mainly observed in non-break and minor break.
These results demonstrated the declination of energy level within a PPh or PG/BG, and the reset when restarted a PPh or PG/BG.
n-1 n for each break types. Notice that the darker lines represent the more primary prosodic state transitions.
3.3.4 The Break-Syntax Model
Figure 3.10 displays the decision tree of the break-syntax model. It can be seen from the figure that the root nodes of the two sub-trees T3 and T4, which corresponded to syllable juncture with PM, were mainly composed of major break types of B3 and B4. T4 contained more B4 because it corresponded to major PMs. T6 which corresponded to intra-word was mainly composed of non-break. T5 had much more complex tree structure than other sub-trees. By further analyzing the entropies of the leaf nodes in sub-trees T3-6, we find that T6 had the largest entropy. This implies that it is more difficult to correctly predict the break types of non-PM inter-word junctures.
51813
PM? Y
N
47133 4680
22187
24946
1106 3574
Comma?
Interword?T2 T1
T3 T5 T4
T6
Figure 3.10: The decision tree of the break-syntax model. The bar plot associated with a node denotes the distributions of these six break types (B0, B1, B2-1, B2-2, B3, B4, from left to right) and the number is the total sample count of the node.
More detailed structures of these four sub-trees up to the fourth layer are shown in Figure 3.11. It is found from Figures 3.11(a) and (b) that nodes in T3 and T4 were mainly split by questions related to sentence-level linguistic features such as LFS>=7 (Is the length of the following sentence equal to or greater than 7?). Generally, the juncture of PM was more likely to be B4 when the previous/following sentence was long. It is also found from Figure 3.11(b) that the minor PM “dun hao” (or “、”) was likely to be labeled as B3 and B2-2 other than B4. We find from Figure 3.11(c) that in T6 Type-2 intra-word junctures, which are anticipated as potential break positions, were more likely to be minor breaks than Type-1 intra-word junctures which were simply labeled as B0 and B1. For the most complex sub-tree T5 (see Figure 3.11(d)), the labeling of non-PM inter-word juncture could be firstly discriminated as tending to B0 or B1 by the initial type of the following syllable {null initial, m, n, l ,r}. The junctures with non-sonorant initial could be further discriminated as non-break by the following word with POS “DE”. This result matched with the previous finding presented in Subsection 2.5.1. It was also found that the distance to previous PM
discriminate other sub-trees of T5. Generally speaking, a non-PM inter-word juncture had higher potential to be labeled as minor breaks as its distance to the nearby PM was longer.
Figure 3.11: The more detailed structures of sub-trees of (a) T3, (b) T4, (c) T6 and (d) T5. Solid line indicates positive answer to the question and dashed line indicates negative answer.
(a)
(b) (c)
(d)