用於歌聲合成之歌唱共鳴產生之研究

(1)

行政院國家科學委員會專題研究計畫成果報告

用於歌聲合成之歌唱共鳴產生之研究研究成果報告(精簡版)

計畫類別：個別型

計畫編號： NSC 100-2221-E-011-157-

執行期間： 100 年 08 月 01 日至 101 年 07 月 31 日執行單位：國立臺灣科技大學資訊工程系

計畫主持人：古鴻炎

計畫參與人員：碩士班研究生-兼任助理人員：張家維碩士班研究生-兼任助理人員：簡延庭碩士班研究生-兼任助理人員：張世穎碩士班研究生-兼任助理人員：林祐靖碩士班研究生-兼任助理人員：陳彥華

報告附件：出席國際會議研究心得報告及發表論文

公開資訊：本計畫可公開查詢

中華民國 101 年 10 月 19 日

(2)

中文摘要：我們藉由此計畫重新製做了一個基於 HMM 頻譜模型的華語歌聲合成系統，而 HMM 模型之建造必需考慮不同文脈與音域組合，如此才能合成出一致性的歌聲音色。此外，我們由實驗發現，以歌唱方式發音的語料來訓練 HMM，即可讓合成出的歌聲具有歌唱共鳴之特性。關於歌唱共鳴的產生模型，我們基於高斯混合模型(GMM)之頻譜對映機制，研究了兩種轉換的程序，而經由聽測實驗發現，第一種轉換程序(稱為基本轉換法)的效能稍好一些，不僅可把 A 的音色轉換成近似 B 的，且可以轉換出共鳴的特性，但是音質則仍需改進。

中文關鍵詞：歌聲合成, 歌唱共鳴, 頻譜包絡, 高斯混合模型

英文摘要： In this project, an HMM (hidden Markov model) based Mandarin singing-voice synthesis system is

implemented. Notice that the context of a lyric vowel and the pitch height of the note sung must be taken into account when training the HMM models in order to have consistent timbre among the synthesized lyric syllables. To have the characteristic of singing resonance kept in the synthesized singing voice, it is enough that the corpus for training the HMM is collected by recording a real singer’s singing voice signals. As to the model for singing resonance

generation, the GMM based conversion mechanism for spectrum mapping is adopted here. We have developed two conversion procedures. The first conversion procedure (called the basic conversion method) is found to be slightly better in performance. It can not only convert the timbre of speaker A into a similar timbre of speaker B, but also keep the characteristic of singing resonance in the

synthesized singing voice. Nevertheless, the voice quality still needs to be improved.

英文關鍵詞： singing voice synthesis, singing resonance, spectral envelope, GMM

(3)

1

行政院國家科學委員會專題研究計畫成果報告

用於歌聲合成之歌唱共鳴產生之研究

Singing Resonance Generation for Singing Voice Synthesis 計畫編號：NSC 100-2221-E-011-157

執行期限：100 年 8 月 1 日至 101 年 7 月 31 日主持人：古鴻炎國立台灣科技大學資訊工程系

計畫參與人員：張家維、簡延庭、張世穎、林祐靖、陳彥華 e-mail: [email protected]

一、中文摘要

我們藉由此計畫重新製做了一個基於 HMM 頻譜模型的華語歌聲合成系統，而 HMM 模型之建造必需考慮不同文脈與音域組合，如此才能合成出一致性的歌聲音色。此外，我們由實驗發現，以歌唱方式發音的語料來訓練 HMM，即可讓合成出的歌聲具有歌唱共鳴之特性。關於歌唱共鳴的產生模型，我們基於高斯混合模型(GMM)之頻譜對映機制，研究了兩種轉換的程序，而經由聽測實驗發現，第一種轉換程序(稱為基本轉換法)的效能稍好一些，

不僅可把 A 的音色轉換成近似 B 的，且可以轉換出共鳴的特性，但是音質則仍需改進。

關鍵詞：歌聲合成、歌唱共鳴、頻譜包絡、高斯混合模型

ABSTRACT

In this project, an HMM (hidden Markov model) based Mandarin singing-voice synthesis system is implemented. Notice that the context of a lyric vowel and the pitch height of the note sung must be taken into account when training the HMM models in order to have consistent timbre among the synthesized lyric syllables. To have the characteristic of singing resonance kept in the synthesized singing voice, it is enough that the corpus for training the HMM is collected by recording a real singer’s singing voice signals.

As to the model for singing resonance generation, the GMM based conversion mechanism for spectrum mapping is adopted here. We have developed two conversion procedures. The first conversion procedure (called the basic conversion method) is found to be slightly better in performance. It can not only convert the

timbre of speaker A into a similar timbre of speaker B, but also keep the characteristic of singing resonance in the synthesized singing voice. Nevertheless, the voice quality still needs to be improved.

Keywords: singing voice synthesis, singing resonance, spectral envelope, GMM

二、緣由與目的

電腦歌聲合成技術可用於製作歌唱教導之軟體，以教導許多喜愛唱歌但是看不懂樂譜的人士。作詞、作曲的專業人士，也可藉由歌聲合成軟體快速地聆聽、評估自已的作品，以作必要的修改。更進一步當歌聲合成技術日漸成熟時，電腦虛擬歌手的想法，將變成實際可行。近年來日本Yamaha 公司製作的歌聲合成軟體Vocaloid [1]，就塑造了一個虛擬歌手，稱為”初音”。

過去，我們在歌聲合成領域研究了好幾年，目前也已經發展出一個即時的華語歌聲合成系統，可合成出自然度不錯的歌聲。不過，

當我們仔細聽前述系統合成出的歌聲時，會查覺到歌聲的音質仍然存在一個不能忽視的缺點，那就是歌聲信號太過於清晰，而缺少專業歌手的歌聲中所呈現出的共鳴(resonance)特性，一個重要原因是，過去我們都是使用以說話方式發音的華語音節，去分析、求得信號合成模型的參數。

因此，在本次計畫裡，我們把焦點放在歌唱共鳴上，去分析歌唱共鳴在聲學上所呈現出的特性，然後研究建立歌唱共鳴的產生模型，以便讓我們的系統能夠合成出自然、且具有共鳴特性的歌聲。與歌唱共鳴相關的幾個問題是：(a)使用以歌唱方式發音的華語音節來分析出信號模型的參數，是否就可以讓合成出的

(4)

2 歌聲具有歌唱共鳴? (b)當對一個錄音者的發音建立歌唱共鳴的產生模型之後，這個模型是否可被用來對另一個錄音者的說話方式發音所分析出的信號模型參數作處理，以便合成出具有共鳴特性的歌聲信號? 這些問題就是本計畫要加以探討的。

三、文獻回顧

關於歌聲合成的研究，過去已有一些文獻探討歌聲合成相關的議題[2, 3, 4]，例如歌聲信號合成的方法，咬字規則(articulatory rules)與演奏規則(performance rules)，抖音(vibrato)與轉音(portamento) 等歌唱技巧的基週軌跡訂定，歌者共振峰(singer’s formant)的模擬，以及歌聲表情的表達。其中歌者共振峰是探討不同的發音方式(如歌唱、說話)，對於同一個音素的共振峰頻率與頻寬會造成什麼樣的影響。

關於歌聲信號的合成，過去已有一些方法被提出，大致上可將它們畫分成時域上、及頻域上的合成方法。一般來說時域上的合成方法，事先需作基週頂點之標記，計算量較少，

例如PSOLA 合成法[5]、TIPW 合成法[6]，其實這兩種合成法都是起源於語音合成的研究。不過，適用於語音合成的方法，不見得就適合用於作歌聲合成，因為歌聲需求的音高變化(音域)及音長變化(tempo)，是要比語音大很多的。

屬於頻域上的合成方法包括了 phase vocoder 法[7]，共振峰合成(formant synthesis) 法[7]，基於線性預測之 source-filter 合成法[7, 8]，弦波模型合成法[9]，EpR(excitation plus Resonances)模型合成法[10]，及最近我們加以改進之HNM (harmonic plus noise model)合成法[11, 12]。一般來說頻域上的合成方法較為複雜且計算量較多，不過所合成出的歌聲信號的音質較為清晰，並且當音高及音長改變時，音質退化得較慢。因此在本計畫裡，我們決定使用HNM 合成法。

關於歌唱共鳴的研究，所搜尋到的文獻並不多，並且大多是在分析共鳴歌聲的一些頻譜特性，例如受過訓練的歌手所唱出的歌聲信號，在頻譜上3KHz 附近會出現不同於語音信號的頻譜包絡突起，稱為歌者之共振峰，一個例子如圖 1[13]所顯示的。不過，歌唱共鳴的特性並不是對所有的歌者都是共通的，實際上會因為音域的不同(如女高音或女低音)、男女性別的不同、及族群的不同，而有不同的歌唱

方式及共鳴方式[14, 15, 16]。

雖然過去有少數歌聲合成的文獻[13]，研究以修改共振峰頻率、頻寬的方式，來產生出具有歌者共振峰的頻譜包絡形狀，但是這種研究方向仍然存在一些問題需要解決，例如共振峰的頻率是否可以被所使用的程式很準確地估計出來? 含多個母音之音節(如/iau/)，其母音之間的過渡區的頻譜要如何調整? 所以我們覺得，一種基於自動學習的共鳴模型的塑模方法，會比規則式的頻譜調整方法來得理想。

圖1 歌者共振峰之例圖[13]

四、研究方法

首先我們邀請一位男性歌手(A 語者)到隔音錄音室錄音，分別以說話方式(A 語音)及歌唱方式(A 歌聲)錄製相同內容的歌詞發音，以建立 40 首歌的平行語料庫；此外，邀請另一位男性(B 語者)以說話方式(B 語音)，錄製相同內容的歌詞發音。接著，我們製作基頻與頻譜包絡參數的分析程式，以求取語音及歌聲信號各音框的基頻值、DCC (discrete cepstrum coefficient)係數值[17]。

關於歌唱共鳴的呈現，由於前人的研究多認為與頻譜包絡的形狀有關，因此我們首先嘗試拿歌聲錄音所分析出的 DCC 係數，帶入所發展的HMM 訓練程式，去訓練出華語聲、韻母的HMM (hidden Markov model)頻譜模型，

然後依照歌詞來選擇適合的HMM 頻譜模型，

去產生歌詞的一序列音框的 DCC 係數，接著帶入HNM 信號模型去作信號合成。

此外，我們欲建立一個歌唱共鳴的產生模型，以便讓其他以說話方式錄音的人，能夠使用這一個模型來合成出具有共鳴的歌聲。因此我們嘗試應用語音變換(voice conversion)領域裡所發展出的技術，來研究建立歌唱共鳴的產生模型，而這也是為什麼我們要錄製說話語音和歌唱聲之平行語料的原因。

(5)

3

4.1 歌聲合成系統主流程—訓練階段在本計畫中我們改採 HMM 頻譜模型來重新製作一個華語歌聲合成系統，此系統在訓練階段的主要處理流程如圖2 所示。我們先將歌聲資料庫的歌曲一句一句切成樂句作標音、切音，之後使用STRAIGHT[18]程式對每一個切割出的音節作分析以求得頻譜包絡資料及音高資料；然後使用先前發展的離散倒頻譜係數之估計程式，去求取歌聲音節各音框的 DCC 頻譜係數，由於使用 STRAIGHT 來分析出頻譜包絡曲線，所以在 DCC 係數的計算過程，我們調整了頻譜峰值之挑選方式；接著，

我們依據所標記的音標資訊，將 DCC 係數分類為聲母、韻母兩部分，並且進一步依據文脈及音高資訊將聲、韻母作細分類；之後對各個聲、韻母細類作HMM 訓練。

Start

標音、切音 STRAIGHT 分析

歌聲資料庫

頻譜包絡資料音高資料

DCC 係數估計 聲、韻母分類

聲、韻母細類 HMM 訓練 聲、韻母細分類

(依文脈、音高)

End

圖 2 訓練階段之主流程

4.2 歌聲合成系統主流程—合成階段合成階段的主要處理流程如圖 3 所示。首先對一個讀入的歌譜作文字剖析，以取得各個歌詞的音高資料及音長資料。接著，將歌譜的各個歌詞轉成聲、韻母之單位，再依據其文脈及音高資料從HMM 模型集合中選取出最搭配的HMM 模型。然後，根據音長資料及狀態駐留參數，來決定一個聲、韻母HMM 模型內各狀態應該指派的音框數量，並且使用HMM 各狀態上的平均 DCC 向量去產生出各音框的

DCC 係數。求得 DCC 係數之後，再將各音框的 DCC 係數帶入 GMM 音色轉換模組，以轉換出具有另一人音色的 DCC 係數。接著，依據音高資料去訂定各音框的音高數值；最後將各音框的 DCC 係數、音高軌跡等資訊，帶入 HNM 模組去合成出歌聲信號。

選取聲、韻母 HMM 歌譜分析

音框DCC 係數產生

訂定音高軌跡 GMM 音色轉換

HNM 信號合成

音高資料

音長資料 HMM

模型

歌聲信號 GMM 參數歌譜

音框 DCC 係數

Start

End

圖 3 合成階段之主流程

4.3 DCC 係數估計

由於 STRAIGHT 分析出的頻譜包絡再據以作合成而得到的語音信號，具有相當高的信號自然度，這表示 STRAIGHT 分析出的頻譜包絡非常準確。

因此，在本計畫裡，我們將切割好的歌聲音節先拿去作 STRAIGHT 分析，以求得該音節各音框的頻譜包絡曲線；然後對於一個有聲音框的諧波部分，先將 STRAIGHT 分析所得到的基頻值 F0 除於一個倍數 m，使得 x0 = F0/m 落在 40Hz 至 80Hz 之間，接著在頻率範圍0 至 5500Hz 內，記錄 x0 的各個倍頻上的頻率值及振幅值；對於雜音部分(即 5500Hz 之後)，則直接去偵頻各頻譜峰點的頻率值及其對應的振幅值。依據前述記錄下來的頻率值及振幅值的一些組合，接著就可執行離散倒頻譜之計算模組，而求得一個音框的DCC 係數。

關於 DCC 係數的階數，我們量測了頻譜包絡的逼近誤差，發現一直到階數高於80 時，

逼近誤差值的下降幅度才趨於緩和，因此我們決定把階數值設為80。

(6)

4

4.4 基於 GMM 之音色轉換

音色轉換的程序是，首先依據目前歌詞音

節的韻母來選取該韻母所屬的 GMM 對映

(mapping)函數，然後讀入各音框的 DCC 係數，分別將各音框的DCC 係數作 GMM 對映，

以獲得音色轉換後的音框 DCC 係數。至於 GMM 音色轉換的細節，我們研究了如下二種的轉換方法：

(a) 基本轉換法:

首先使用A 語者的歌聲 HMM 模型來產生出音框DCC 係數，接著，採用由 A 語者(來源語者)和 B 語者(目標語者)的平行說話語料去作GMM 訓練而得到的 GMM 參數，把各音框的 DCC 係數拿去作 GMM 頻譜對映，而求得轉換後的音框 DCC 係數。這個轉換法的處理流程如圖4 所示。

韻母GMM 選取

HMM 產生之 DCC 係數

GMM 頻譜 對映

轉換後的 DCC 係數

Start

End

圖4 基本音色轉換法之流程

(b) 相對振幅轉換法

在圖5 裡輸入的音框 DCC 係數，是使用 A 語者的歌聲HMM 模型來產生出的，處裡的步驟首先是作 GMM 對映，以把 A 語者的歌聲 DCC 係數轉換成 A 語者的語音 DCC 係數，在此使用的GMM 模型參數是 A 語者所唱的歌聲 (來源語者)和 A 語者所唸的歌詞(目標語者)去作GMM 訓練而得到。第二個步驟是，把前一步驟轉換出的音框 DCC 係數，再作一次的 GMM 對映，以轉換出 B 語者的語音 DCC 係數，在此使用的GMM 模型參數是 A 語者(來源語者)和 B 語者(目標語者)的平行說話語料去作GMM 訓練而得到。接著，把 DCC 係數還原成頻譜包絡曲線Y(f)。

下一步，把第一次 GMM 對映後輸出的音框 DCC 係數，還原成頻譜包絡曲線 X(f)。接著，計算兩頻譜包絡曲線的振幅差值∆dB(f) = Y(f) - X(f)。然後，把輸入的音框 DCC 係數(未作GMM 對映)，轉成頻譜包絡曲線 Z(f)，再將 Z(f)加上前二步驟算出的振幅差值∆dB(f)，如

此就可求得音色轉換後的頻譜包絡。

音框DCC 係

數(A 歌聲) GMM 頻譜對映 (A 歌聲=>A 語音)

GMM 頻譜對映 (A 語音=>B 語音)

DCC 係數轉成頻譜包絡Y(f)

轉換後之頻譜包絡V(f) Start

DCC 係數 (A 語音)

DCC 係數轉成頻譜包絡X(f)

計算頻譜振幅差值

∆dB(f)=Y(f) - X(f)

DCC 係數轉成頻譜包絡Z(f)

頻譜包絡修正 V(f)=Z(f)+∆dB(f)

End

圖5 相對振幅轉換法之流程

在圖 4 和圖 5 裡的 GMM 頻譜對映方塊，

其執行的頻譜對映功能，若以數學公式來說明，則如下式所示[19]:

( ; , ) y F x= μ Ψ =

( ) ( ) ¹

1 1

( ; , )

( )

( ; , )

x xx

M m m m y yx xx x

m m m m

M x xx

m m m m

m

w N x

x w N x

μ μ μ

μ

−

=

 

 

⋅ Ψ  

  + Ψ ⋅ Ψ ⋅ − 

 ⋅ Ψ  

 

 

 

(1)

其中x 表示來源語者的頻譜特微向量，y 表示 變換後得到的頻譜特微向量，M 是高斯混合 N(•, •, •)的總數，而 μ 及 Ψ 分別表示平均向量 與共變異矩陣的集合。

五、成果與討論

在本次計畫裡，我們重新製做了一個基於 HMM 頻譜模型的華語歌聲合成系統，而 HMM 頻譜模型的訓練則是使用新錄製的語料及自行發展的程式(非現成軟體，如 HTK)。經由聽覺聆聽所合成的歌聲，我們發現，如果只建造 21 聲母與 36 韻母的 HMM 模型，則合成的歌

(7)

5 聲在一些相鄰的音節之間會顯得不夠流暢，因此我們把聲、韻母依照其文脈(前後文)作更細的分類，以建造出較多的文脈相關之HMM 模型，如此合成出的歌聲就能夠變得較流暢。除了文脈因素之外，我們也發現歌唱者唱同一個韻母時，如果所唱的音高(pitch)高低差許多，

則頻譜也會有明顯的差異(即造成音色的差別)，因此我們再加入了音高因素(分成高、中、

低音域)，而分別建造出不同文脈與音域組合的HMM 模型，如此合成的歌聲其音色才會讓人覺得是同一個人所唱的。

另外，我們也比較了兩種語料所訓練出 HMM 模型的差別，第一種語料是 A 歌聲(A 語者以歌唱方式發音)，而另一種語料是 A 語音 (A 以說話方式發音)，把兩種 HMM 模型分別拿去合成出A 語者的歌聲信號，我們聆聽後發現，A 歌聲語料所訓練出的 HMM 模型，的確可讓合成的歌聲具有歌唱共鳴的特性，然而A 語音語料所訓練出的HMM 模型，卻不能讓合成的歌聲顯現歌唱共鳴特性，所以要讓合成歌聲具有歌唱共鳴特性，只要使用歌唱方式發音的語料來訓練HMM 模型就可以達成。

關於歌唱共鳴的產生模型，目的是讓以說話方式錄音的人，能夠使用這一個模型來合成出具有共鳴特性的歌聲。我們基於GMM 頻譜對映去研究了兩種轉換方法，分別如圖4 和圖 5 所顯示的處理流程。在圖 4 的基本轉換法，

雖然只能夠使用平行說話語料去訓練GMM 頻譜對映模型，但是使用該轉換法去合成出歌聲後，我們聆聽合成的歌聲，發現基本轉換法確實能夠讓合成出的歌聲具有共鳴的特性，並且音色已經被轉換成很近似B 語者的音色。

在圖5 的相對振幅轉換法，當使用此轉換法去合成出歌聲後，我們聆聽歌聲的感覺是，

圖5 的歌聲比圖 4 的清晰許多(即比較不悶)，

但是音色方面就差了一些，音色比較不像B 語者的音色。所以，圖4 和圖 5 的轉換法各有其優缺點。

六、參考文獻

[1] Yamaha, VOCALOID: New Singing Synthesis Technology,

http://www.vocaloid.com/en/index.html .

[2] X. Rodet, “Synthesis and processing of the singing voice,” Proc. 1^st IEEE Benelux Workshop on Model based Processing and Coding of Audio (MPCA-2002), Leuven, Belgium, pp. 99-108, 2002.

[3] P. R. Cook, “Singing Voice Synthesis: History, Current Work, and Future Directions”, Computer Music Journal, Vol. 20(3), pp. 38-46, 1996.

[4] J. Sundberg, The Science of the Singing Voice, North Illinois University Press, DeKalb, Illinois, 1987.

[5] N. Schnell, G. Peeters, S. Lemouton, P.

Manoury, and X. Rodet, “Synthesizing a Choir in Real-time Using Pitch Synchronous Overlap Add”, Int. Computer Music Conference, Berlin, Germany, pp. 102-108, 2000.

[6] H. Y. Gu and W. L. Shiu, “A Mandarin-syllable Signal Synthesis Method with Increased Flexibility in Duration, Tone and Timbre Control,” Proc. Natl. Sci. Counc. ROC(A), Vol.

22, pp. 385-395, 1998.

[7] C. Dodge and T. A. Jerse, Computer Music:

Synthesis, Composition, and Performance, Schirmer Books, New York, NY, 1997.

[8] Y. E. Kim, Singing Voice Analysis/Synthesis, Ph.D. thesis, Massachusetts Institute of Technology, 2003.

[9] M. W. Macon, L. Jensen-Link, J. Oliverio, M. A.

Clements, and E. B. George, “A Singing Voice Synthesis System Based on Sinusoidal Modeling”, Int. Conf. Acoustics, Speech, and Signal Processing, Munich, Germany, pp.

435-438, 1997.

[10] J. Bonada and X. Serra, “Synthesis of the Singing Voice by Performance Sampling and Spectral Models”, IEEE Signal Processing Magazine, Vol. 24, pp. 67-79, 2007.

[11] Y. Stylianou, Harmonic plus Noise Models for Speech, Combined with Statistical Methods, for Speech and Speaker Modification, Ph.D. thesis, Ecole Nationale Supèrieure des Télécom- munications, Paris, France, 1996.

[12] H. Y. Gu and H. L. Liao, “Mandarin Singing-voice Synthesis Using an HNM Based Scheme”, Journal of Information Science and Engineering, Vol. 27(1), pp. 303-317, 2011.

[13] M. E. Lee and M. J. T. Smith, “Spectral Modification for Digital Singing Voice Synthesis Using Asymmetric Generalized Gaussians”, Int. Conf. on Acoustics, Speech, and Signal Processing, pp. I260-263, 2003.

[14] M. E. Bestebreurtje and H. K. Schutte,

“Resonance Strategies for the Belting Style:

Results of a Single Female Subject Study”, Journal of Voice, Vol. 14(2), pp. 194-204, 2000.

[15] E. Joliveau, J. Smith, and J. Wolfe , "Vocal Tract Resonances in Singing: The Soprano

(8)

6 Voice", Journal of Acoustical Society of America, pp. 2434-2439, 2004.

[16] N. Henrich, M. Kiek , J. Smith , and J.

Wolfe, “Resonance Strategies used in Bulgarian Women’s Singing Style: a Pilot Study”, Logopedics Phoniatrics Vocology, 2007.

[17] O. Cappé and E. Moulines, “Regularization Techniques for Discrete Cepstrum Estimation”, IEEE Signal Processing Letters, Vol. 3(4), pp.

100-102, 1996.

[18] H. Kawahara, I. Masuda-katsuse and A. De Cheveign, “Restructuring Speech Represen- tations Using a Pitch-adaptive Time-frequency Smoothing and an Instantaneous-frequency- based F0 Extraction: Possible Role of a Repetitive Structure in Sounds”, Speech Communication, Vol. 27, pp. 187-207, 1999.

[19] Y. Stylianou, O. Capp´e, and E.

Moulines, ”Continuous Probabilistic Transform for Voice Conversion,” IEEE trans. Speech and Audio Processing, Vol. 6, No. 2, pp.131-142, 1998.

(9)

1

出席國際學術會議心得報告

計畫編號 NSC 100-2221-E-011-157

計畫名稱用於歌聲合成之歌唱共鳴產生之研究

出國人員姓名服務機關及職稱

古鴻炎

國立台灣科技大學資訊工程系副教授會議時間地點 2011/10/15 ~ 2011/10/17, 中國上海

會議名稱 International Congress on Image and Signal Processing (CISP 2011)

發表論文題目 An Improved Voice Conversion Method Using Segmental GMMs and Automatic GMM Selection

一、參加會議經過

CISP 2011 國際研討會，由上海的東華大學所主辦，而由 IEEE 生醫工程學會 (Engineering in Medicine and Biology Society)協辦，接受的論文將收錄於 IEEE Xplore 資料庫。由研討會的名稱可知，接受投稿的領域包含了影像處理、視訊處理、信號處理相關之子領域及語音信號的處理。個人投稿的論文，屬於語音處理，研究的成果是，提出以分段式 (segmental)高斯混合模型(GMM)的觀念，來改進語音轉換的效能，並且發展了一個基於動態規劃之自動 GMM 挑選的演算法，以實際應用該觀念於線上(on-line)進行的語音轉換處理。

CISP 2011 接受的論文，分成 28 個 sessions 分別進行口頭發表和壁報發表。由於事先向主辦單位回應，希望以壁報方式來發表，所以我的論文排於 10 月 16 日 13:30 ~ 15:30 的時段，以壁報方式進行發表。右邊上圖就是在發表會場所拍攝的照片，而右邊下圖，則是在晚宴會場所拍攝的照片。

(10)

2

在行程方面，於 10 月 14 日搭乘 14:15 由松山機場直飛上海虹橋機場的班機，然後從機場搭乘地鐵，經換線後，到達會場的所在地點，即漕寶路 66 號光大會展中心，到達時已是 17:50。參加研討會後，則於 17 日晚間乘坐地鐵前往虹橋機場第一航站，然後搭乘 19:50 直航台北松山機場的班機。

二、與會心得

CISP 2011 研討會接受投稿的領域包含了影像、視訊、與信號處裡之相關領域，所邀請的四位 keynote speaker 都是 IEEE Fellow 級的專家。不過，演講的題目並沒有聲訊或語音處理方面的，畢竟聲訊處理並不是此次研討會的焦點。

跟語音處理直接相關的論文，在此次研討會中共有 15 篇被接受。除了我的論文之外，其它論文中有四篇是作語音辨識(speech recognition)的，有兩篇是作語者辨識(speaker recognition)的，

此外分別有一篇作語音合成(speech synthesis)和語音強化(speech enhancement)的，及一些作語音分析的論文。雖然語音處理方面的論文篇數不是很多，但是也含蓋了幾個語音研究的子領域，

因此仍可相互了解不同子領域裡的研究情況。

(11)

3

An Improved Voice Conversion Method Using Segmental GMMs and Automatic GMM Selection

Hung-Yan Gu and Sung-Fung Tsai

Department of Computer Science and Information Engineering National Taiwan University of Science and Technology

Taipei, Taiwan

Abstract—In this paper, the idea of segmental GMMs is proposed for voice conversion. Also, to apply this idea to on-line voice conversion, we have developed an automatic GMM selection algorithm based on dynamic programming. In addition, to map a vector of DCC (discrete cepstrum coefficients) with only one Gaussian mixture, we have designed a mixture selection algorithm. For evaluating the performance of the idea, segmental GMMs, three voice conversion system are constructed and used to conduct listening tests. The results of the listening tests show that segmental GMMs proposed here can indeed help to improve the performances in both timbre similarity and voice quality.

Keywords-voice conversion; discrete cepstrum; Gaussian mixture model; timbre similarity; harmonic plus noise model

Introduction

The GMM based voice conversion method was introduced by Stylianou [1]. Afterward, many researches had tried to improve this method by considering one or several related issues [2-5]. Nevertheless, some serious problems still exist when applying the GMM based voice conversion method. The most noticeable one is that the converted spectrums are often over smoothed [2-4]. As a result, the converted voice is perceived with apparent distortion, i.e. the voice quality is significantly decreased. In addition, another noticeable problem is that two adjacent frames’ converted spectrums may become discontinuous when the over smoothing problem is tried to solve by using just the most probable Gaussian mixture to map the source spectral coefficients [4, 6].

In this paper, we study to solve the over smoothing problem with a different approach. Note that the cause results to over smoothing is the summation across many Gaussian mixtures (usually 128 mixtures) in the GMM based mapping function,

( ; , ) y F x= μ Ψ =

( ) ( ) ¹

1 1

( ; , )

( )

( ; , )

x xx

M m m m y yx xx x

m m m m

M x xx

m m m m

m

w N x

x w N x

μ μ μ

μ

−

=

 

 

⋅ Ψ  

  + Ψ ⋅ Ψ ⋅ − 

  

⋅ Ψ

 

 

 

(1)

where x denotes a feature vector of the source speaker, y denotes the converted feature vector for the target speaker, M is the number of Gaussian mixtures, and μ and Ψ represent the sets of mean vectors and covariance matrices,

respectively. To solve the problem of over smoothing, we think reducing the number of Gaussian mixtures, M, in the mapping function is necessary. Nevertheless, the probability density function (PDF) of the trained GMM would become coarse when the number of mixtures is directly decreased.

Therefore, we consider to segment each of the training sentences into a sequence of speech segments, and to group these speech segments into several classes. For example, a speech segment may be a phoneme or a syllable. After segmentation, the signal frames grouped to a class are taken to train a corresponding GMM with fewer mixtures (e.g. 16 mixtures). Then, this GMM is dedicated to convert the source frames recognized to belong to the corresponding class. In this way, the GMM based mapping function, i.e. (1), can be applied with fewer mixtures. That is, a complicated GMM is now replaced with multiple simpler GMMs, and each GMM is dedicated for converting the signal frames recognized to belong to its corresponding class.

In this paper, we study voice conversion for Mandarin, and Mandarin is a syllable prominent language. Therefore, we take each syllable of a labeled training sentence as a speech segment. Next, each segment is grouped to one of the 37 classes according to its syllable final. For each of the 37 syllable-final classes, a corresponding GMM is then trained.

After training, the 37 GMMs are used for on-line voice conversion. Nevertheless, there is a problem that must be solved beforehand. That is, how can the right class that an input frame belongs to be picked out? For this problem, we have developed an automatic selection algorithm based on dynamic programming. This algorithm will be described in Subsection III.A.

Besides using multiple segmental GMMs to reduce the number of mixtures, we advanced furthermore to use only one Gaussian mixture for mapping a source spectrum into its converted spectrum in order to solve the problem of over- smoothed converted spectrum. Nevertheless, two adjacent source frames’ converted spectrums may become discontinuous and result in artifact sounds. Therefore, we studied to design a dynamic programming based algorithm to consider both the likelihood (when using a particular Gaussian mixture) and spectral continuity simultaneously for a sequence of signal frames. This algorithm will be described in Subsection III.B. In addition, we have integrated the two solution methods mentioned to build an on-line voice

(12)

4 conversion system. Then, this system is used to conduct the

listening tests.

Training Procedure

As an overview, the processing flow for the training stage of our voice conversion system is as that drawn in Fig. 1.

Three persons are invited to record 375 parallel sentences in a soundproof room. The sampling rate is 22,050Hz. Among the three persons, two are males, denoted as M1 and M2, and the other one is a female, denoted as F1. In this study, M1 is treated as the source speaker whereas M2 and F1 are treated as the target speakers, respectively. Therefore, the two voice conversion tasks here are converting the voice of M1 into the voice of M2 or F1.

Training Segmental

GMMs Labeling and

segmenting Training sentences of

source speaker Training sentences of target speaker

Labeling and segmenting Grouping into

37 classes Grouping into

37 classes

DTW alignment Estimating

DCC Estimating

DCC

Param. of 37 GMMs

Estimating pitch param.

Pitch param.

of target Pitch param.

of source

Framing Framing

Figure 1. Processing flow for the training stage.

Labeling and Grouping

First, the software package, HTK, was used to do forced alignment, i.e. automatic labeling. Here, the speech unit is syllable. Since many errors are found in the labeled syllable boundaries, manual checking and correcting of the syllable boundaries are thus required. Here, we used the software, WaveSurfer, to edit the labels and boundaries. Then, according to the information of syllable boundaries and phonetic symbol, each syllable’s signal was extracted and saved into a separate file which is named with sentence number, syllable number, and phonetic symbol. As a total, 2,926 syllables were extracted from the 375 recorded sentences of a speaker. Next, the syllables from the first 350 sentences are grouped into 37 classes according to the syllable-final symbol parsed from the filename of each saved signal file.

DCC Estimation

There are several methods proposed for estimating a signal frame’s magnitude-spectrum envelope (spectral envelope).

The method, STRAIGHT, is very accurate in its estimated spectral envelope but it requires a large amount of computations and cannot be used to implement a real-time system currently.

Therefore, in this study, we adopt the spectral envelope estimation method, discrete cepstrum [7, 8], and use the estimated discrete cepstrum coefficients (DCC) as the spectral parameters. For each signal frame, the DCC estimation scheme developed previously [8] is executed to obtain 40 DCC. Here, a frame’s width is 512 sample points, and adjacent frames are placed 110 points (about 5ms) apart.

Training of Segmental GMM

After the block, “grouping into 37 classes”, in Fig. 1 is executed, there would be 37 classes of syllable segments. For each class, a GMM of 16 mixtures is trained from those syllable signals grouped to the class. The GMM obtained is hence termed a segmental GMM.

Here, a parallel corpus is used. Each source syllable and its corresponding target syllable are time aligned first with DTW as indicated in the block, “DTW alignment”. Then, the DCC computed from a source frame is jointed with the DCC computed from the aligned target frame. With the jointed vectors of DCC, the training method based on maximum likelihood estimate is used to train a GMM for each class [9].

Pitch Parameters

A pitch detection method based on both autocorrelation and AMDF is used to detect the pitch frequency of a signal frame [10]. Then, the pitch frequencies detected from a speaker’s utterances are collected to compute their average and standard deviation, which are the pitch parameters used in this study.

Conversion Procedure

The procedure proposed here for converting voice is as the processing flow drawn in Fig. 2. When a spoken sentence with unknown content is inputted, it will be sliced into a sequence of frames first with the frame width and shift as given in Subsection II.B. Then, the pitch frequency of each frame is detected in the left flow of Fig. 2 with the method mentioned in Subsection II.D. When a frame is detected to be unvoiced, the three gray colored blocks in Fig. 2 are bypassed directly.

That is, pitch adjusting is not needed and the spectral parameters, DCC, are not converted. On the other hand, when a frame is detected to be voiced, its pitch is simply converted as

( )

y y x

t x t

q μ σ p μ

= +σ − (2)

(13)

5 where pt is the detected pitch frequency, μ^x and σ^x are the

average and standard deviation of the source speaker’s pitch frequencies.

As for the right flow of Fig. 2, the input frames are processed one after another basically. Nevertheless, in the block, “Selecting a GMM”, we propose a selection algorithm that processes every 20 voiced frames in a batch. With this algorithm, the correct GMM (or its nearby GMM sometimes) can be picked out from the 37 GMMs for each frame. Then, in the block, “Mapping with single mixture”, only one mixture of the selected GMM is used to map the DCC in order to avoid spectral over smoothing. Nevertheless, the mixture selected for mapping is not always the most probable one. This is because spectral continuity between adjacent converted frames must also be considered to prevent artifact sounds from being generated. For the problem of mixture selection, we have developed a dynamic programming based algorithm that is different from the one studied by previous researchers [4].

Hence, in this block, a sequence of voiced frames bounded with left and right unvoiced frames are processed in a batch.

Finally, in the jointed block, “HNM based speech synthesis”, speech signals are re-synthesized using an HNM (harmonic plus noise model) based method [8, 11].

HNM based speech synthesis Pitch

adjusting

Selecting a GMM Mapping with single mixture Estimating

DCC

Converted voice Detect pitch

freq.

Unknown spoken sentence

Framing

Figure 2. Processing flow for the conversion stage.

GMM Selection

Since the content of the input speech is unknown, which one of the 37 GMMs should be selected for mapping each frame’s DCC becomes a problem that must be solved. In general, this is a problem of speech recognition. Nevertheless, it is not so serious because some frames are assigned with incorrect but similar GMMs are tolerable.

Here, we intend to use the 37 GMMs trained to take the role of HMM usually used for speech recognition. In addition, we observe that it is impossible for a person to utter more than 2 segments (i.e. syllables here) within a very short time interval, e.g. 100ms. Therefore, we decide to select GMMs for every 20 successive voiced frames (spanning 100ms of time) in a batch.

Then, only one or two of the 37 GMMs should be picked out.

Here, we have developed a dynamic programming based algorithm that selects one or two GMMs according to the criterion of maximum likelihood.

Let the probability that the t-th input frame’s DCC are generated by the s-th GMM be Gt(s). That is,

( )

1

( ) = ( ) ; ( ), ( ) .

M x xx

t m t m m

m

G s w s N x μ s s

=

⋅ Ψ

 ⁽³⁾

where wm(s) is the weight of the m-th mixture, and xt is the vector of DCC for the t-th frame. In addition, let R(t, s) be the logarithmic likelihood that the frames from time 1 to time t are all generated by the s-th GMM. In contrast, let D(t, s) be the logarithmic likelihood that the frames from time 1 to t are generated by two GMMs and the t-th frame is generated by the s-th GMM. In terms of these definitions, we can derive the two recursive formula,

( )

( , ) log _t( ) ( 1, ) ,

R t s = G s +R t− s (4)

( ) ₀ _37, [ ]

( , ) log _t( ) max max ( 1, ) , ( 1, ) ,

v v s

D t s G s R t v D t s

≤ < ≠

 

= +  − − 

 

(5) where the boundary values are D(1, s) = 0 and R(1, s) = log(G1(s)). Then, the maximum likelihood can be calculated as

[ ] [ ]

{

⁰ ³⁷ ⁰ ³⁷

}

( ) max max ( , ) , max ( , ) .

v v

A T R T v D T v

≤ < ≤ <

= (6)

where the final time T is set to 20 in this study. In terms of (4), (5), and (6), we can calculate the maximum likelihood, A(20), and then back track to find the sequence of GMM indices that are best for assigning to the batch of 20 voiced frames.

Mapping with Single Mixture

Mapping an input frame’s DCC with a single Gaussian mixture is meant that the summation and the weighting term of (1) are removed. That is, the converted DCC vector, y, is calculated as,

( ) ( ) ¹

( ) ^y ^yx ( ) ,

k xx x

k k

y F x= =μ + Ψ ⋅ Ψ ⁻ ⋅ −x μ (7) where x is the input frame’s DCC and F^k(x) denotes the mapping function using the k-th mixture.

The developed dynamic programming based algorithm for mixture selection is as the following. Let the index of the GMM selected by Subsection III.A for the t-th frame be I(t).

Denote the mapping function using the k-th mixture as

( )( )

I tk t

F x . In addition, let C(t, k) represent the cumulated distance from time 1 to time t and the index of the mixture used at time t be k. Then, we design the recursive formula,

( ^{( )} ^{( 1)} ¹ )

0 ,

( ( 1))

( , ) min ( ), ( ) ( 1, ) ,

m

k m

I t t I t t w I tm M H

C t k dist F x F ₋ x₋ C t m

≤ <

− >

 

=  + − 

(8)

(14)

6 to realize dynamic programming, where dist(•,•) is a

geometric distance measure for DCC, H is a threshold set to 0.3 empirically, and wm(s) is the weight of the m-th mixture.

At time 0, the values of C(0, k) are directly set as C(0, k) = 0, 0≤ k < M. Finally, at time T, the minimum cumulated distance B(T) is computed as

[ ]

0 , ( ( ))

( ) min ( , ) .

k M w I Tk H

B T C T k

≤ < >

= (9)

In terms of (8) and (9), the minimum cumulated distance can be obtained. Also, the sequence of mixture indices for the frames from time 1 to T can be obtained through backtracking.

HNM Based Speech Synthesis

In HNM, the spectrum of a voiced frame is divided into the lower-frequency harmonic part and the higher-frequency noise part. The frequency that the two parts are divided according to is termed the maximum voiced frequency (MVF). In the original work [11], a method is provided to dynamically detect each frame’s MVF. Here, to simplify the synthesis processing, we just use the static MVF value, 6,000Hz, across all voiced frames.

Suppose the i-th and (i+1)-th frames are both voiced and have L and ⁱ Lⁱ⁺¹ harmonic partials, respectively. To synthesize a signal sample for the t-th sampling point between the i-th and (i+1)-th frames, we first derive the frequencies,

k( )t

f , and amplitudes, a_k( )t , of the harmonic partials for this sampling point with linear interpolation. That is,

1

( ) , 1, 2,..., ,

( ) , 1, 2,...,

i i

i k k

k k

i i

i k k

k k

f f

f t f t k L

N

a a

a t a t k L

N

+ +

= + − =

(10)

where N is the number of sampling points between two adjacent frames, L is the larger one of Lⁱ and Lⁱ⁺¹, and f_kⁱ and _aⁱ_kare the frequency and amplitude for the k-th harmonic partial of the i-th frame. The value of f_kⁱ is simply computed as k × qi where qi is the converted pitch frequency for the i-th frame. As to _a_kⁱ, its value is derived from the converted vector of DCC. The detail of the derivation is referred to our previous work [8]. Here, we directly set ⁱ

ak= 0, k=L +1, …, ⁱ Lⁱ⁺¹, if L is less than i Lⁱ⁺¹. Then, the harmonic signal, h(t), for the t- th sampling point is computed as

1

( ) ( ) cos( ( )), 0 , ( ) ( 1) 2 ( ) / 22, 050

L

k k

k

k k k

h t a t t t N

t t f t

φ

φ ⁼ φ π

= ⋅ ≤ <

= − + ⋅

 (11)

where φ_k( )t denotes the cumulated phase on time t for the k- th harmonic partial and 22,050 is the sampling frequency.

k( 1)

φ − is defined to be φ_k( -1)N of the last frame to keep

continuity of phase. If i = 0, i.e. there is no last frame, the value of φ_k( 1)− is then set randomly.

Experimental Evaluations

For evaluating the conversion method proposed here, we have constructed three kinds of voice conversion systems, named SOG, SSG, and SLG, respectively. In the system SOG (system using original GMM for mapping), a single GMM of 256 mixtures are trained with the 350 training sentences, and then the mapping function, (1), is used to convert the DCC of each input frame. In the system SSG (system using single Gaussian mixture for mapping), we still trained a single GMM of 256 mixtures. Nevertheless, in the conversion stage, the mixture selection method as described in Subsection III.B is applied, and then the DCC of a frame is converted with the single Gaussian mixture selected. As to the system SLG (system using selected GMM for mapping), we trained 37 segmental GMMs instead of a single GMM and the number of mixtures for each GMM is 16. Then, in the conversion stage, the GMM selection method as described in Subsection III.A is applied. Next, the mixture selection method as described in Subsection III.B is applied, too.

Using the three systems, we can obtain three different converted voice files for a source voice file. In terms of the converted voice files, we have conducted two types of listening tests. The first type is for timbre similarity whereas the second type is for voice quality. For each type of listening tests, 25 persons are invited to listen to the voice files and give relative scores. Among the 25 persons, 20 of them are not familiar with the research field of voice conversion.

Timbre Similarity Tests

In the tests of timbre similarity, 5 voice files are prepared first, which are named VS (uttered by the source speaker), VT (uttered by target speaker), VX1 (converted by SOG), VX2 (converted by SSG), and VX3 (converted by SLG). Among the 5 files, VS and VT are of same content whereas VX1, VX2, VX3 are of same content but different from VS and VT. These 5 files can be downloaded by accessing the web page:

http://guhy.csie.ntust.edu.tw/VoiceConv/. During listening tests, these files are played in the order ABX where A is fixed to VS, B is fixed to VT, and X is randomly selected from VX1, VX2, and VX3. Each time that three files, ABX, are played, the participant is requested to give a score. Here, the score range is from 1 to 9. The score 9 (1) means the timbre of X is sure to be that of B (A), the score 7 (3) means the timbre of X is more like that of B (A), and the score 5 means the timbre of X cannot be judged.

After listening tests, the scores given by the 25 persons are collected to compute average scores (AVG) and standard deviations (STD) for the three systems respectively. The results are those values listed in Table I. From this table, it can be seen that the average scores for voice conversion between different genders (i.e. from M1 to F1) are much higher than those for voice conversion between same genders (i.e. from