半自動人聲配音系統之研究

全文

(1).

(2)

(3) A Study on Semi-automatic Voice Dubbing System NSC 88-2213-E-011-035

(4) .

(5) . . !"#$%. E-mail: root@guhy.ee.ntust.edu.tw. . tests, it can indeed translate a single voice timbre into many distinct timbres. Keywords: voice dubbing, timbre translation, vocal track, speech synthesis.

(6) !"#$ %&'()

(7) *&+, -./012!34 567&89 :)

(8) ;*<=>?@ =&ABCD&EFG%H7I -.JKLMN5OPQR&+O STUVW9XY

(9) 2Z[\]^4 _`-.a &b?cPQR de&Afg_`hijklmjno [jpkqrst5du&49vw xy&z{kqrst5_`+| }b ?4,~&-. ,deEG 5LM&2E& v

(10) .

(11) .

(12) !"#$%&' ()*+',-!./012 34 56789(:;< =>?@:AB 1CDE<FG$*H I2JKLM NOP?QRS6T*H UV=CDWX- YZ[\]2^_`9((voice conversion) ab<[c0FGd efghi`j`klmn 9(:ofghijmn 7 pq`r[sVtu9(:pv` w?xyz{^_ab2|} ,~- /0^_t; !; @ Z6/ 07A9(-/0g d< `T P 6 ¡`(textto-speech)|¢£?F¤¥¦6 k§d ¨©ª«- . Abstract Voice dubbing or timbre translation is meant a processing that can translate a single voice timbre into many distinct timbres. In drama works, different actors are usually dubbed with different timbres and many dubbing persons are therefore required. To reduce the cost spent to dub the actors, it will be useful if the computer can help to convert a single voice timbre into many distinct timbres. Therefore, we intend to develop a semiautomatic voice-dubbing system. It is called semiautomatic because the emotions (like, anger, sad, happy) perceived from the translated voice need still be controlled by the person who provides the original voice. In this project, a method for timbre translation is proposed. The goal is accomplished by providing independent control of fundamental frequency, vocal-track length, voice source, and internal ratio of vocal track. Among the four factors, the internal ratio of vocal track is newly studied here. In addition, an on-line operable system is built with this method. It can be used for real-time voice dubbing. Also, according to our perception. yz{/0^_¬E8 . .

(13) . VkÎ15-Ï¼¤¥ÐÑ6¸ºÒ§ ÓÒaÔÕÖ×ØY[1] ~ Y[45]&ÀÁ ÙÂ_ÚÛ¤¥ÜÝÐ'Õª"»Clips º-6VkÎ1-¤¥Þ¼ Ð max1, max2, max3Dß1- màbl¬wOá Clip=(max1 + max2 + max3) * 0.5 È à b ° á Clip=min(max1, max2, max3)*0.6 & À Á âÂ_ ãªY[1]ĒY[45]'m@ä1-åZ_ _.

(14) (vocal track) . .

(15) . . . . . .

(16)

(17)

(18) !"#. !"#$%& '()*+,-./01-2345 67/#89:.;<=>? @A2B CDEF&G H9:IJKLMNOPQRST QUVWXY+Z [V\U]^&_

(19) _ _ ____àbcdeKLfghi JjklmnJjklop qmnJ56KLars9:tuJ jjklopjkvAwxy z6{|}~56KLnj gu6 4=Q@ GHl=MNQ&_ ____ (buffer) 500 * sampling_rate / 11,025-k'(10-6KL6 1(50% [1-6 &Y¡¢£@1-6 es¤¥¦§¨D1-6 Y¡©ª 256,000 6O+«6 l¬w%®¯°R±®² ³´@µe¶·6¸¹º»D16®²³´¼ª60O½«6 U¾bl¬wr+«6l¬w %®¯°R&_ ____lop¿kKLÀÁ@Â_ ÀÁRÂ_ÃÄÅÆl¬w%@¯Ç ®ÈÇ®5Él¬weÊ Ë®-l¬&ÀÁÂ_ÌÍ-6U. $%&' ( %&'. / )*+,-.. 0 )*+,-. /0123 45. 0. / 6789 ! #:;<=>?. @A<=>?;BC. . _. _. _ æÐopÑÒkÎY[1]ĒY[45]U ¤¥¼ª Clip (çÐè+éHÕÖ× Ø X[1]ĒX[K]_&ÀÁêÂ_Èß1- àbl¬w°á[l¬ÜÝ ß1-l¬¦ëÐìíîïê4l ¬ÜÝl¬¦ëÐìRîïê±¯°á [ l ¬ Ü Ý 35*sampling_rate /11,025 4 l ¬ Ü Ý 200* sampling_rate / 11,025&ÀÁðÂ ñòó+X[1]ĒX[K]UôõZ4 [l¬ÜÝx¸X[i]öÎ÷Uk Î1-àbÏ¼¤¥D/@l¬ø ù±ú øùòóûßöÎ X[1] ĒX[K]U[1üýª4[l¬Ü Ýx¸ X[i] ÷UþÎ¤¥Ï¼xX[i]D /@[1-l¬øùPö[ &_ GHul¿k !. .

(20) . g67l O2 B *· \Ì0l¬ u&_ _. O@ C -l¬m

(21) b !"ú8Ìß1©\S ¦ël¬C¤¥!"¡ SkBlop&§Ölop xKLÔ.>@2 +lop¿kA´95% 4&_ _. ____@µ PQÙST+1-l ¬ã>_`Ï¼6aû b -cd"eRfeKL]gh.

(22) _ ____@µ -&Ï¼ @ !¤ ´ (formant frequency) U!¤´ "b#$%&'D"" 6!¤´

(23) (x°

(24) ) 5 *+(&'GH2 *ã! "!¤´ 0

(25) "! "$&_ sÚ,hgGHÄßÎTIPW \eÌ0TIPW2

(26) (duration) !¤´ r -

(27) J j& [O TIPW KLÀÁ/1. /01234 5òó6789:ï;&_ <L=£@µ>¤? ´KLÀÁ@@(Step1)Ñ lÁÖehl jk$ßBÌ0xl&(Step2)Ñ Ì0l¬6¸opöC -f ZAÖeh67laÔÑ3w6 ¸ ´ Ú Û 6 7 l §D& (Step3) 67l VE4FG§ D& (Step4) Ñ Ì0l¬67l ¬ ! z H ¢ I J K LMNOP_ QOPRMQaÔ+67l E 4 -øIJKSTZÌ0l UV øù' §&(Step5)+ -KL ·67l C§&_ ____@W·!¤ ´x0r!¤´@ *ãúk(resampling)0PJX !¤ÝZ6Rîê[6O CDªZ67e 4äRîê-k ¢1-\k]k´m "&B \resamplingÔl¬ x Ôú+ k^(Step3)hS. i6Äl¬& ß#jk :l;,!"2mdn" oDeRp§62 màbqOrr srDe p§6°àbNtOs &_. ____óu./42Yv wxGHy *ãz{|}~xIjB r@g&ã I_.

(28) _ GHDI¤¥ Ï¼6¶ O@QÙho prúI67x¸ f&'°2+D67de l(pitch peak)6I¤¥q

(29) Ï¼GHD67d eoªl6¶ O@ ¤¥Ï¼Ðop&_ ____QÙUe ¶ mªe!]@u hi6l¬Cze e! @b>&'GHÂ È T < T f °á T ' = T × R f_ T ' = T − T ' f 1. 2. 1. 1. 2. 1. È T ≥ T f ° á T ' = T × ( 2 − R ) f_ 1 2 2 2 fUee __e r@ T ' = T −T ' ! 1. 2. ´»Ð@í____bu\ e ’f_e!’ xÔ# resamplin/ !"4¸[)¸s »&_. ". .

(30) . ºíîð6-½å1¾¿"BÀ Á urD¢ºRîâ 6-½qå1¾¿"BÂ. ) u&_ _. ____Z<4*·d eDÐRîí¼6PRîê <ñ@D 6SÎrDRîí©6 Píîê<ñdbm &_ _. _ ____sÚ,jkufgÃw Î'Ñ ./ Î1-56F?#89:&Zj k·NUGHóuu±Ä»Å ÆÇÃwP>¤?´ !" ,cÃw³È ¯"oÉÊ B o&ãª,hF? #89:}/56l=K L(lop=vAwª[Ë KLdÌÍb ¼ S GHÎumÏÐÑª56l= wxjk4$ßl=A´ 2lêÒ&y SÓ/9:Ì0Î d/<=ÔÕTGH A2 Îo&_ _. _ ____GHóu5mCz #¡H<ñga@b¢- £0,+¤6GH½ 5¥¦§¦@U1 -àb¼&_ ____Ñ ß # jk_GH*ã LPC~SB (¨'»_©2©M^ ª=£@VC ¸«¬z rã(¨'»S ®¯(lattice)° ±²2D/@1³´& sÚ,hOÄÑ (¨'»jÎ V¸C«¬z¼©ú µ V¸«¬z¼© aÔú+!"Ô«¬z(0\1¶ (¨'»g Ô \ ³ ´ 2 3 KLÀÁ@ )Tufq2*_V-del¬{|} ~rjÎ(¨'»©2 ·©MI{. _ [1] H. Kuwabara and Y. Sagisaka, “Acoustic Characteristics of Speaker Individuality: Control and Conversion”, Speech Communication, Vol. 16, pp. 165-174, 1995. [2] H. Mizuno and M. Abe, “Voice Conversion Algorithm Based on Piecewise Linear Conversion Rules of Formant Frequency and Spectrum Tilt”, Speech Communication, Vol. 16, pp. 153-164, 1995. [3] Y. Stylianou and O. Cappe, “A System for Voice Conversion Based on Probabilistic Classification and a Harmonic plus Noise Model”, IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, Vol. 1, pp. 281-284, 1998. [4] M. Narendranath, H. A. Murthy, S. Rajendran, and B. Yegnanarayana, “Transformation of Formants for Voice Conversion Using Artificial Neural Networks”, Speech Communication, Vol. 16, pp. 207-216, 1995. [5] N. Iwahashi and Y. Sagisaka, “Speech Spectrum Conversion Based on Speaker Interpolation and Mulit-functional Representation with Weighting by Radial Basis Function Networks”, Speech Communication, Vol. 16, pp. 139-152, 1995. [6] G. Baudoin and Y. Stylianou, “On the Transformation of the Speech Spectrum for Voice Conversion”, Int. Conf. on Spoken Language. {|}~x¸»±)Tufq3* ¹R_ _ Area =  1 − Ki  × Area f__ (1)_ i+1 i 1+ K   i  j Î « ¬ z *IB>2 · *IB>Mf_ á. *IB>1! _Ríí±)Tufq4*ÑSnJ º+*IB>2·*IB>M03µ 0 *IB>’2 · *IB>’Mv03 + *IB>M03,2 · *IB>M µ 0 *IB>’Mv03,2 · *IB>’M ± )Tufq5*+*IB>’2·*IB>’M>Ö¹__ Area′i − Areai′+ 1 _ _ (2)_ K ′i = Areai′ + Area′i +1 j k Ô ( ¨ ' » ©’ · ©’ ±. )Tufq6* +(¨'»© ©’ »^ú+ )Tufq2*SjBI^Ö®¯° ±²2B !"xÔd e&_ ____4GH ¼¼¼O¼ ¼º¼ - ½ .>D¢. #. .

(31) . Processing, Vol. 3, pp. 1405-1408, 1996. [7] H. Y. Gu and W. L. Shiu, "A Mandarin-syllable Signal Synthesis Method with Increased Flexibility in Duration, Tone and Timbre Control", Proceedings of the National Science Council, Republic of China, Part A: Physical Science and Engineering, Vol. 22, No. 3, pp. 385-395, 1998. [8] D. G. Childers, “Glottal Source Modeling for Voice Conversion”, Speech Communication, Vol. 16, pp. 127-138, 1995. [9] P. H. Milenkovic, “Voice Source Model for Continuous Control of Pitch Period”, J. Acost. Soc. Am., Vol. 93, No. 2, pp. 1087-1096, 1993. [10] L. R. Rabiner, et al., “A Comparative Performance Study of Several Pitch Detection Algorithms”, IEEE trans. Acoust., Speech, and Signal Processing, pp. 399-418, Oct. 1976. [11] J. D. Markel and A. H. Gray, Jr., Linear Prediction of Speech, New York: Springer-Verlag, 1976. [12] Y. Medan, E. Yair, and D. Chazan, “Super Resolution Pitch Determination of Speech Signals”, IEEE trans. Signal Processing, pp. 40-48, Jan. 1991. [13] ”

(32) ” !("#) $ 228-234 %, 1995& [14] J. F. Wang, et al., “A Hierarchical Neural Network Model Based on a C/V Segmentation Algorithm for Isolated Mandarin Speech Recognition”, IEEE trans. Signal Processing, pp. 2141-2146, Sep. 1991. [15] L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition, Prentice Hall,1993.. $. .

(33)