• 沒有找到結果。

Prosodic transformation

Hearing-Impaired Mandarin Speech

4.4 Prosodic transformation

The new model ¯λ then becomes λ for the next iteration and the reestimation process is repeated until the likelihood reaches a fixed value.

4.4 Prosodic transformation

Most current approaches to voice conversion make little or no use of pitch measures, despite evidence showing that intonational information is highly correlated to speech individuality. The main reason for this is the difficulty in finding an appropriate fea-ture set that capfea-tures linguistically relevant intonational information. This problem is alleviated in Mandarin speech conversion task as its tonal system allows relatively non-overlapping characterizations of the corresponding F0 contour dynamics. Speech enhancement can therefore be realized by a proper analysis and control of the F0 con-tour dynamics. Since pitch is defined only for voiced speech, the pertinent tone-related portions of syllables are the vowel or diphthong nuclei from which distinctive pitch changes are perceived. Recognizing this, we need only to concatenate F0 values of the final subsyllable into a vector and represent it by a small linguistically motivated parameter set. Unlike the conventional frame-based VQ approaches [10], this segment-based approach makes it possible to convert not only the static characteristics but also the dynamic characteristics of F0 contours.

Choosing an appropriate representation of F0 contour is the first step in applying pitch modification to the voice conversion. By taking advantage of the simple tone structure of F0 contours in mandarin speech, the polynomial curve fitting technique is used to decompose the F0 contour into mutually orthogonal components in transform domain [44]. The F0 contour can therefore be represented by a smooth curve formed

by orthogonal expansion using some low order transform coefficients. In describing the source speaker’s F0 contour, F0 are measured only for the final subsyllable and are in the form of {w0(mx), Bx≤ mx ≤ Tx}. For notational convenience, the F0 contour of a segment with Ix+ 1 frames is rewritten as {w0(ix), 0 ≤ ix ≤ Ix}, where ix = mx− Bx

and Ix = Tx− Bx. Parameters for pitch modification are then extracted from the F0 contour segment by the orthogonal polynomial transform:

b(x)j = 1

Due to the smoothness of an F0 contour segment [44], the first four discrete Legendre polynomials are chosen as the basis functions Ψj(·) to represent it. Based on this orthogonal polynomial representation, the source F0 contour is characterized by a 4-dimensional feature vector, b(x) = (b(x)0 , b(x)1 , b(x)2 , b(x)3 )T, which will be quantized using vector quantization (VQ) technique. Similarly, b(y) = (b(y)0 , b(y)1 , b(y)2 , b(y)3 )T is a feature vector representing the F0 contour of the target speaker.

Our conversion technique is based on the codebook mapping and consists of two steps: a learning step and a conversion-synthesis step. In the learning step, the source and target F0 codebooks were separately generated using an orthogonal polynomial representation of F0 contours in training utterances. Each of the two codebooks in-cludes 16 codevectors and is designed using the well-known LBG algorithm [45]. Next, a histogram of correspondence between codebook elements of the two speakers is calcu-lated. Using this histogram as a weighting function, the mapping codebook is defined as a linear combination of target F0 codevectors. In the conversion-synthesis step, the F0 contour of input speech was orthogonally transformed and vector-quantized using the source F0 codebook. Then, the pitch modification was carried out by decoding them using the mapping codebook. If the decoded codevector is ˆb = (ˆb0, ˆb1, ˆb2, ˆb3)T,

Hearing-impaired speech is generally characterized by a much lower speaking rate and by excessive shortening of consonants. For the converted speech to carry the naturalness of human speech, the duration of individual phonemes needs to match those found in the natural speech. This can be done by modifying the interval of each synthesis frame by a time-varying factor ρ(m) in a way of Q(m) = ρ(m)Q.

The case ρ(m) > 1 corresponds to a time-scale expansion, while the case ρ(m) < 1 corresponds to a scale compression. The next step is to determine the time-scaling factor ρ(m) based on spectral representations of the same syllable uttered by the source and target speakers. In describing the source speaker’s spectral envelope, cepstral coefficients are measured frame by frame and are of the following form: X = {x(mx), mx = 1, 2, . . . , Tx}, where Tx is the syllable duration in frames. Similarly, Y = {y(my), my = 1, 2, . . . , Ty} is the sequence of Ty cepstral vectors representing the target speaker’s spectral envelope. Acoustic analysis of Mandarin hearing-impaired speech has indicated that unvoiced sound such as consonants may not be subjected to the same scaling as the vowels. Thus for time-scaling of speech, different approaches should be applied in the time-intervals where the frames corresponding to both speakers were marked as Mandarin initials or finals. The boundary between the initial and final parts of an isolated syllable is relatively easy to detect by a voiced/unvoiced decision based on the voicing probability Pv. Let Bx and By represent the starting frame for the final subsyllables in the source and target utterances, respectively. For constituent frames of the initial consonant, a linear time normalization was applied with a fixed factor ρ = (By − 1)/(Bx − 1). With regards to the final subsyllables, two sets of paired cepstral vectors, {x(mx), Bx ≤ mx ≤ Tx} and {y(my), By ≤ my ≤ Ty}, were time aligned using the procedure of dynamic time warping (DTW) [46]. Usually the problem of DTW is formulated as a path finding problem over a finite range of grid points (mx, my). The basic strategy applied here is to interpret the slope of the DTW path as a time-scaling function, which indicates on a frame-by-frame basis how much to shorten or lengthen each frame of the source utterance in order to reproduce the same duration as in target utterance.

The DTW aims to align two utterances with a path through a matrix of similarity distances that minimizes the sum of the distances. We begin by defining a partial ac-cumulated distance DA(mx, my), representing the accumulated distance along the best path from the point (Bx, By) to the point (mx, my). For an efficient implementation, a dynamic programming recursion is applied to compute DA(mx, my) for all local paths that reach (mx, my) in exactly one step from an intermediate point (mx, my) using a set of local path constraints. Table 4.1 summarizes the local constraints and slope weights for three local paths, ℘1, ℘2, and ℘3, chosen for the implementation. The local distance d(mx, my) between the time-aligned pairs of cepstral vectors is defined by a squared Euclidean distance. We summarize the dynamic programming implementation for finding the time-scaling factor at every frame of a final subsyllable as follows.

1) Initialization: Set DA(Bx, By) = d(Bx, By)

3) Path backtracking: According to the optimal DTW path, we define the time-scaling factor ρ(m)=0.5, 1, or 2, for the case where the move from the point (mx, my) to the point (mx, my) is via the local path ℘1, ℘2, or ℘3, respectively.

4.5 Experimental Results

Experiments were carried out to investigate the potential advantages of using the pro-posed conversion algorithms to enhance the hearing-impaired Mandarin speech. Our efforts began with the collection of a speech corpus that contained two sets of mono-syllabic utterances, one for system learning and one for testing in our voice conversion

Table 4.1: Incremental distortions and slope weights for local paths.

experiment. The text material consisted of 76 isolated tonal CV syllables (19 base syllables × 4 tones), formed by pairing the three prominent vowels /a,i,u/ with 11 con-sonants, the five fricatives and the six affricates of Mandarin Chinese, but excluding combinations that were phonologically unacceptable. The choice of these two classes was based on the research findings showing these consonants appeared as the most fre-quently misarticulated sounds made by the hearing-impaired Mandarin speakers [47].

Speech samples were produced by two male adult speakers, one with normal hear-ing sensitivity and the other with congenital severe-to-profound (> 70 dB) hearhear-ing loss. The speech of the impaired speaker was largely intelligible in sentences but often caused misunderstanding if produced in syllable forms due to prosodic deviations and misarticulated initial consonants.

Figure 4.3 presents the results of our pitch modification method for transforming F0 contours. Panels 4.3(a) and 4.3(c) are the F0 contours for the source and the target syllable /ti/ spoken with four different tones, and panel 4.3(b) is the converted F0 contour using VQ and orthogonal polynomial representation. Comparison of F0 variations as a function of time found in panel 4.3(b) with 4.3(a) clearly shows the improvements on tones 2 and 3. Our next examination focused on how the converted F0 contours were perceived in relation to those of the source. For easy judgments of the tonal categories, only syllables with one consonant class (affricate) were used, with a total of 40 tonal syllables (10 for each tone). Four male and one female adult native

Figure 4.3: F0 contours for syllable /ti/ spoken with four different tones: (a) source speech, (b) converted speech, and (c) target speech.

Table 4.2: Confusion matrix showing tone recognition results for source syllables.

Table 4.3: Confusion matrix showing tone recognition results for converted syllables.

speakers of Mandarin Chinese, all with normal hearing status, served as the listeners.

Tables 4.2 and 4.3 present the confusion matrices showing the tone recognition results for the source and the converted set, respectively. The results in each table were based on the listeners’ judgments of 400 responses (40 tonal syllables× 5 listeners× 2 sessions). It is clear that the proposed system resulted in more intelligible stimuli with an average tone recognition score of 86.25%, compared with 69% for the source stimuli.

The results further showed an improvement of 38% and 28% for syllables with tone 2 and tone 3, respectively.

To establish the statistical significance of these results, we calculated the P -value using a Z-test [48]. If we let p1 and p2 denote the recognition rates for the source and

Table 4.4: Raw data and tone recognition rates derived from Tables 2 and 3.

converted set, respectively, our objective was to test the null hypothesis H0 : p1 ≥ p2. Based on the statistics in Table 4.4, the Z-test yielded a small P -value (P < 0.0002);

therefore, the null hypothesis was strongly rejected. Further evidence of improvement is seen on Figure 4.4, which shows our prosodic modification applied to continuous speech. A four-syllable utterance, containing tones 4-4-3-3, was used. According to the tone-sandhi rule, the first tone 3 should be produced with a tone 2 F0 pattern.

The audio presentation, however, showed that the first tone 3 was produced more like tone 1 than the targeted tone 2. A comparison of the F0 contours for the source and the target utterances showed that the former exhibited fewer fine fluctuation details, even though the variation ranges were both within 100 Hz. Further, the first tone 4 was essentially carrying a tone 1 F0 pattern and the last tone 3 was produced with the rising part truncated. The improvement due to prosodic modification can be seen in the following areas. First, the missing falling part in the fist tone 4 and the dipping of the last tone 3 were fully restored. Second, the rising part of the first tone 3 segment was steeper in slope, making it more appropriate for the targeted tone 2.

To hear audio examples of the voice conversion system, please visit the web site at http://a61.cm.nctu.edu.tw/demo.

Results of the spectral conversion were analyzed acoustically with software spectro-graph to assess how closely the converted speech resembled the target speech in render-ing acoustic cues for phoneme perception. The improvement for the fricatives is shown

Figure 4.4: F0 contours for a four-syllable utterance /ying-4 yong-4 ruan-3 ti-3/(Application software): (a) source speech, (b) converted speech, and (c) target speech.

in three aspects: (1) lengthening of the consonant duration, (2) a less abrupt transi-tion, or a gradual blending of the acoustic energy, near the consonant-vowel boundary, and (3) a redistribution of acoustic energy around appropriate frequency regions, such as an elevation to 3 kHz for the syllable /shu/ or to 4 kHz for the syllable /shii/. An example of such spectral differences for the syllable /shu/ is shown in Figure 4.5. Even closer spectrographic matches were obtained for the affricates, as shown in Figure 4.6 using /chii/ as an example. In normal production, affricates are stops followed by fricatives, which are individually represented on the spectrograph as a burst with its energy concentrated at higher frequencies to be blended immediately with those of the following fricative. The distorted affricate, however, was translated spectrographically into a stop that included a full voicing gap but not much of frication. Our analysis revealed that the conversion filled the gap, softened the burst, removed the low fre-quency energy and elevated the fricative portion to normal frefre-quency ranges. When examined along with audio presentations, this modification also resulted in a change of the vowel percept from the erroneous, high front but lip-rounding, vowel /yu/ to the correct /i/, even though formant modification for the vowel was less apparent.

Two listening tests, preference and intelligibility, were conducted to determine whether the above spectrographic enhancement could also be realized perceptually.

The five listeners for the previous tone recognition test were used. In the preference test, the listeners were asked to give their preference judgments over pairs of source vs.

converted syllables. A two-alternative-forced-choice (2AFC) test paradigm was used, in which the presentation order of the two stimuli was randomized. For converted stimuli, two sets of converted syllables were used: (1) those with spectral conversion only and (2) those with combined spectral and time-scaled conversions. The results showed 62%

of the 380 responses (2 stimulus sets × 19 base syllables × 5 listeners × 2 sessions) preferred spectrally modified syllables to source syllables, while 84% preferred those with combined modifications. To further validate the effect of the proposed approach, intelligibility measures were obtained for 19 base syllables before and after spectral

conversion. The listeners were instructed to write down their responses using Man-darin phonetic symbols. Figure 4.7 shows comparison of the percent correct phoneme recognition scores for the source and the converted stimuli. Individual phonemes were arranged from left to right into three groups, fricative, affricate, and vowel. Recognition of vowels /a,u/ was near perfect even without the modification. In contrast, recogni-tion for the affricates and the fricatives (with the exceprecogni-tion of /h/) was either near or at 0%, a finding consistent with our earlier observation that these two consonant classes are frequently substituted with stops by the hearing-impaired speakers. The relatively good recognition for /h/, even for the source, could be explained by the fact that little oral modification of the glottal air source was required during articulation.

With the converted stimuli, an improvement was seen in all three groups. An average increase of 47.25% was obtained for the fricatives, with /h/ counted out. The amount was further increased by 20% (=67.17%) for the affricates, with /ji, chi/ showing a total correction, making this group the phoneme class that benefited the most from our application. The vowels, despite their small improvement, were the only group showing a total correction for all its members.

Again, we considered a recognition experiment only based on the spectral con-version to perform supervised speaker adaptation for hearing-impaired speaker. The recognition system of Mandarin digit strings as the task without restricting the string length was described in Section 4.2. The source speaker and the target speaker are male. The recognition accuracy of the target speaker is 96.04% on the training tokens.

After conversion, the recognition accuracy of the source speaker can be improved from 19.51% of the original speech to 36.02% of the converted speech.

4.6 Summary

This chapter presents a novel means of exploiting spectral and prosodic transformations in enhancing disordered speech. In spectral conversion, subsyllable-based GMMs were

applied within the sinusoidal framework to modify the articulation-related parameters of speech. In prosodic conversion, we found the tone structure of F0 contour in Man-darin speech could be used to advantage in orthogonal polynomial representation of pitch contours. The results also suggest a new approach to time-scaling modification in which the initial part of a syllable is linearly normalized with a fixed factor, and then a DTW algorithm is used to control the time-varying scaling factor for the final part. Evaluations by objective tests and listening tests show that the proposed tech-niques can improve the intelligibility and naturalness of the hearing-impaired Mandarin speech.

Figure 4.5: Spectrograms for syllable /shu/: (a) source speech, (b) converted speech, and (c) target speech.

Figure 4.6: Spectrograms for syllable /chii/: (a) source speech, (b) converted speech, and (c) target speech.

Figure 4.7: Percent correct phoneme recognition scores for source and converted speech.

Chapter 5

相關文件