行政院國家科學委員會專題研究計畫成果報告

(1)

行政院國家科學委員會專題研究計畫成果報告

應用生物聲紋特徵於自動化物種辨識之研究

計畫類別：個別型計畫

計畫編號： NSC94-2213-E-216-022-

執行期間： 94 年 08 月 01 日至 95 年 07 月 31 日執行單位：中華大學資訊工程學系

計畫主持人：李建興共同主持人：李遠坤

計畫參與人員：蘇忠茂、林炳佑

報告類型：精簡報告

處理方式：本計畫可公開查詢

中華民國 95 年 10 月 27 日

(2)

行政院國家科學委員會補助專題研究計畫 □ 成果報告

□期中進度報告應用生物聲紋特徵於自動化物種辨識之研究

計畫類別： ; 個別型計畫整合型計畫計畫編號：NSC 94-2213-E-216-022-

執行期間： 2005 年 08 月 01 日至 2006 年 07 月 31 日

計畫主持人：李建興共同主持人：李遠坤

計畫參與人員：蘇忠茂、林炳佑

成果報告類型(依經費核定清單規定繳交)： ; 精簡報告完整報告

本成果報告包括以下應繳交之附件：

赴國外出差或研習心得報告一份赴大陸地區出差或研習心得報告一份

出席國際學術會議心得報告及發表之論文各一份國際合作研究計畫國外研究報告書一份

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計畫、

列管計畫及下列情形者外，得立即公開查詢

涉及專利或其他智慧財產權，一年二年後可公開查詢

執行單位：中華大學資訊工程學系

中華民國 95 年 10 月 31 日

(3)

摘要

許多生物發出聲音之目的，主要是為了彼此之溝通或是某些特定行為所發出的聲音，

像進食、移動、飛行、求偶和警戒等等，而生物聲紋自動辨識之研究對於生物學、生態學和環境監控應用上是相當重要的，特別是應用在偵測生物之種類和生物之分佈定位上。本計畫提出一套自動辨識生物聲紋特徵之系統，此系統可用以辨識蛙聲和蟋蟀聲。對於輸入一生物聲音，首先我們將此聲音之每一音節切取出來，然後對每一個音節分析其音色特徵，

我們是以整個音節之平均線性倒頻譜係數(ALPCC)及梅爾倒頻譜係數(AMFCC)當做此音節之音色特徵向量。然後以線性區別分析演算法來提升辨識之正確率，此一演算法可以縮小同類特徵向量之距離而且加大不同種類特徵向量之距離，因此可以減少特徵向量維度之情況下提高辨識率。

一. 報告內容 1. 前言

在日常生活中，我們能夠聽到許多生物的聲音，像是人類講話的聲音、狗吠聲、鳥鳴聲、蛙叫聲、蟬聲和蟋蟀聲等等。許多生物發出聲音之目的，主要是為了彼此之溝通或是某些特定行為所發出的聲音，像進食、移動、飛行、求偶和警戒等等，而生物聲紋自動辨識之研究對於生物學、生態學和環境監控應用上是相當重要的，特別是應用在偵測生物之種類和生物之分佈定位上。一般而言，人們只會聽到生物發出的聲音，較少看到生物本身，

除此之外，生物的發聲(animal vocalizations) 早已進化成與特定之物種相關 (species-specific)，也就是不同之物種之聲音會有所不同，因此利用生物的聲音來辨識生物種類是非常自然不過的，另外還能夠應用在生物種類統計(ecological censusing)、環境監控 (environment monitoring)和生物多樣性評估(biodiversity assessment)等方面。

通常在做生態調查時都是靠人力在白天用拍攝方式去做生態記錄，雖然現今科技進步，即使在晚上也可以利用紅外線拍攝，但是拍攝者必須花費很大的耐心和時間，且必須小心翼翼在不驚動到動物的情況下去做拍攝，但如果是記錄動物的聲音則不同，通常我們在野外容易聽得到動物的聲音，若想見其行蹤，則是難上加難，如果我們可以利用動物聲音之記錄來做生態評估，應該能省下許多的時間精力，且可以更有效的從事生態物種記錄，

因此利用聲音去辨識物種做生態評估是近來常用的方法。由於生物種類繁多，不同物種間的棲息環境及生活方式也都有所差異，因此眾多研究人員投入研究生物叫聲的差異性，希望依此發現新的物種，然而目前所使用生物聲音的辨識方法，多採用人工至野外錄音，再

(4)

Identification Result

Testing Signal Training Signal

Feature Database Syllable

Segmentation

Classification Feature Extraction

LDA Transformation

Syllable Segmentation

Feature Extraction

LDA Transformation 動辨識生物聲紋特徵之系統，用以辨識青蛙及蟋蟀之叫聲。

2. 研究目的與研究方法

本計劃之生物聲紋自動辨識系統包含兩個階段，分別為訓練階段(training phase)和辨識階段(recognition phase) ，訓練階段是由三個主要模組所組成 : 音節切割 (syllable segmentation)、特徵擷取(feature extraction)和線性區別分析(linear discriminant analysis, LDA)。辨識階段是由四個主要模組所組成: 音節分割、特徵擷取、線性區別分析轉換(LDA transformation)和分類(classification)。圖一為本計劃之系統架構圖。

圖一生物聲紋自動辨織系統的架構圖

2.1 音節切割

當輸入一段生物聲音訊號時，首先將一個個音節切割出來，每一個音節視為辨識系統之基本的辨識單元。而選擇以音節當成辨識單元，是因為當所輸入聲音訊號同時有許多不同種類的動物聲音時，要切出一個個的音節是比較簡單的，除此之外，利用音節所擷取出來的特徵值較為穩定。在這裡我們是依據Harma 之方法以頻率上的資訊來完成音節切割，

其詳細步驟如下:

Step 1. 利用 short-time Fourier transform(STFT)建立輸入生物訊號的頻譜，我們用一個矩 陣來表示此頻譜

M

(

f

,

t

)，f 和 t 分別表示頻率值和時間的索引值。

Step 2. 設 n =0。

(5)

Step 3. 針對所有的(

f

,

t

)，找出振幅強度之最大值，即

M ( f

_n

, t

_n

) ≥ M ( f , t )

，且將第

n

個音節的位置記錄為(

f

_n,

t

0

)−

β

dB，停止切割動作，這代

_n+

t

_e)。

_e。

(a)線性倒頻譜係數(Linear Predictive Cepstral Coefficients, LPCCs)

線性預測編碼 (linear predictive coding, LPC) 已經被廣泛的應用在語音辨識的技術上，可以說是語音分析技術上最重要的一項表示法，線性預測編碼的基本觀念，在於一個語音的訊號可以由前面

p 個語音訊號的線性組合來預測，而其做法是將預測的訊號值與實

際的訊號值的誤差值減至最小，如此就可以得到最佳的預測器，而預測器內的係數即為線性組合所需的係數稱為線性預測編碼係數。

線性倒頻譜係數(Linear prediction cepstral coefficients, LPCCs)是一種較可靠之特徵而且已被證明比起線性預測編碼係數更能應用在語音辨識上，並且更能表現出語音訊號中頻譜的波峰與細部的變化。計算線性倒頻譜係數之詳細步驟如下:

Step 1. Preemphasis

使用一階的FIR 濾波器:

) 1

~ ( ) ( )

~ ( n = s n − a s n −

s

(6)

Step 2. Framing

將聲音訊號切割為一個個音框，其區段大小可為256、512 或是 1024 個 sample，

每次重疊一半的sample 數

)

~ ( )

( n s M n

x

_l

= l +

^,

ⁿ

⁼⁰^,¹^,^...,

^N

⁻¹^{, 1}l⁼⁰^,¹^,^...,

^L

⁻ 其中

_l

表示第

_l

個區段。

Step 3. Windowing

使用漢明視窗(Hamming window)消除每個區段中起始與終點訊號的不連續性

) ( ) ( )

~ x

_l

( n = x

_l

n w n

_,

₀ _≤ _n _≤ _N ₋ ₁

漢明視窗公式如下：

⎟ ⎠

⎜ ⎞

⎝

⎛

− −

= 1

cos 2 46 . 0 54 . 0 )

( N

n n

w π

,

0 ≤ n ≤ N − 1

Step 4. Autocorrelation analysis

計算每個區段的自相關函數 (autocorrelation)，以便求取線性預測編碼的係數，求取自相關函數的公式如下:

∑

⁻⁻

=

+

=

^N ^m

n

m n x n x m

r

1

0

)

~ ( )

(

_l _l

l ,

m

=0, 1, ...,

p

a

Step 5. LPC analysis

利用 Durbin Recursive Method 求得線性預測編碼係數，演算法如下:

⎩ ⎨

⎧ − −

= ∑

ⁱ ⁱ

j i j

i

r i r i j E

k

α , 0≤

i

≤

p

i i i⁽⁾ =

k α

) 1 ( ) 1 ( )

( −

−

=

_jⁱ _i _iⁱ_j

i

j α

k

α

) 1 ( 2 )

(ⁱ =(1−

k

_i )

E

ⁱ⁻

E

m =

a

LPC coefficient=

α

_m^{( p}⁾, 1≤

m

≤

p

其中

p 為 LPC 係數的階數。

Step 6. LPC coefficients conversion to cepstral coefficients (LPCCs) 將線性預測編碼係數轉換為倒頻譜係數

2 0 =ln

σ c

∑

=

−

=

^p

k

r k r

G

1 2

2

( 0 ) α ( )

σ

(7)

∑

⁻

=

⎟

−

⎠

⎜ ⎞

⎝ + ⎛

=

¹

1 m k

k m k m

m

c a

m a k

c

_, ₁_≤

_m

_≤

_p

其中

σ

²為與語音發聲模型之增益值(gain term):

∑

=

−

= ^p

k

r k a r

1

2 [0] [ ]

σ

(b)梅爾倒頻譜係數（Mel-scale Frequency Cepstral Coefficients, MFCCs）

梅爾倒頻譜係數已被廣泛的利用在語音辨識上，事實上，梅爾倒頻譜係數在語音辨識上是非常有用的而且可以用一組頻帶來描述一段聲音訊號。mel 來是用以表示聽覺上對一個 tone 感覺上的的音高(pitch)或者是頻率的計算單位，在人類聽覺系統中，人類對於一個 tone 的實際頻率(physical frequency)之反應並不是呈現線性變化，而實際頻率和梅爾頻率之間的對應關係在頻率低於1 KHz 時呈現線性變化，但在高頻的部份是呈現對數變化的。而實際頻率和梅爾頻率之間的關係圖顯示於圖二，其數學式如下:

700) 1 ( log

2595 ₁₀

f

mel

= + ,

) 1 10

(

700 ²⁵⁹⁵ −

=

mel

f

,

f 代表實際的頻率值。人類之聽覺系統可將聲音之頻率分為一個個臨界頻帶(critical band)，

位於同一臨界頻帶內之頻率聲音對人耳聽起來是相似的，因此我們可以用一組濾波器來過濾每一臨界頻帶之聲音訊號，另外每一個臨界頻帶的頻寬會隨著頻率值而改變，圖三顯示一組三角臨界頻帶濾波器之形狀及頻寬，表一是每個臨界頻帶濾波器的頻寬範圍。

圖二實際頻率和梅爾頻率之間的關係圖

(8)

圖三一組三角臨界頻帶濾波器

表一每個臨界頻帶濾波器的頻寬範圍

Index Low Freg.(Hz) Center Freg. (Hz) High Freg. (Hz)

Filter 1 0 100 200

Filter 2 100 200 300

Filter 3 200 300 400

Filter 4 300 400 500

Filter 5 400 500 600

Filter 6 500 600 700

Filter 7 600 700 800

Filter 8 700 800 900

Filter 9 800 900 1000

Filter 10 900 1000 1149 Filter 11 1000 1149 1320 Filter 12 1149 1320 1516

Filter 13 1320 1516 1741

Filter 14 1516 1741 2000

Filter 15 1741 2000 2297

Filter 16 2000 2297 2639

Filter 17 2297 2639 3031

Filter 18 2639 3031 3482

Filter 19 3031 3482 4000

Filter 20 3482 4000 4595

Filter 21 4000 4595 5278

Filter 22 4595 5278 6063

Filter 23 5278 6063 6964

Filter 24 6063 6964 8000

Filter 25 6964 8000 9190

(9)

求取MFCC 之步驟如下：

Step 1: 預強調 (Pre-emphasis)

] ˆ [

] [ ]

ˆ [ n s n a s n 1

s = − −

] [n

s

為我們輸入訊號，

aˆ

的預設值為0.95。

Step 2: 取音框 (Framing)

將每一個音節切割成一個一個的音框，大小為512，而且為了讓每個音框的差異性不大，我們又讓每個音框重疊一半。

Step 3: 乘上漢明視窗(Hamming Windowing)

為了來消除每個音框與開始與結束的不連續性，每個音框都乘上一個漢明視窗，

漢明視窗式子如下。

1 0

1), cos( 2 46 . 0 54 . 0 ]

[ ≤ ≤ −

− −

=

n N

N m n

w π

Step 4: 快速傅立葉轉換(FFT)

將音訊訊號從時域轉換成頻率域

0

, ]

~[ ] [ ¹

0

2

k N

e n s k

X

^N

n

Nn j k

<

≤

=

∑

⁻

=

− π

N：音框大小, ]

~ n

s

[ ：離散訊號

Step 5: 三角帶通濾波器(Triangular band-pass filter)

由於人耳對聲音的頻率的解析度不是呈線性關係，而是呈現對數(logarithm)變化，利用三角帶通濾波器將聲音訊號分成一個個頻帶，並算出每個頻帶的能量:

0

, ) (

1 0

J j X

k

E

^K

k j k

j =

∑

⁻ ≤ <

=

φ

,

J 為三角帶通濾波器之個數, A 為

_k

X

[k]的振幅:

0 /2 ,

| ] [

|

X k

²

k N A

_k = ≤ <

而

φ

_j為第

j 個濾波器:

⎪ ⎩

⎪ ⎨

⎧

≤

−

≤

−

≥

≤

=

j h j

c j c j h j

h

j c j

l j l j c j l

j h j

l

j

I k I I I k I

I k I I I I k

I k I k k

) /(

) (

) /(

) (

or if

0 ] φ [

在這裡，

I ,

_l^j

I 和

_c^j

I 分別代表第 j 個濾波器之低頻索引值，中間頻率索引值，和

_h^j 高頻索引值:

(10)

) (

f N I f

s j l j

l = ,

) (

f N I f

s j j c

c = ,

) (

f N I f

s j j h

h = ,

f 為取樣頻率，

s

f ,

_l^j

f ,

_c^j

f 為第 j 個濾波器的低頻、中頻和高頻值，而每個濾波

_h^j 器的低頻、中頻和高頻值可參考表二。

Step 6: 離散餘弦轉換(Discrete Cosine Transform)

最後將這些不同頻帶的能量乘上不同的cosine 值，求出梅爾倒頻譜係數:

, ) ( log )) 5 . 0 (

1

cos(

0

∑

⁻ 10

=

+

=

^J

j

j i

m

j E

m J

C

π

1 0 ≤ m ≤ L − L 代表的是梅爾倒頻譜係數的個數。

我們共用了25 個三角濾波器，所以

J=25，而梅爾倒頻譜係數的長度為 15(L=15)。

(c)Averaged LPCCs/MFCCs

如前所述，我們先把每一個音節切成一個個的音框，接著計算每個音框的 LPCCs/MFCCs 為其特徵值。可想而知，針對不同長度之音節所切割出來的音框個數會不盡相同，而為了解決這個問題，我們取這些音框的 LPCCs/MFCCs 的平均值來當做此一音節之特徵值，因此對所有的音節所取出的特徵值之長度是固定的，不會隨著音節的長度而變化。計算平均LPCCs/MFCCs 的計算公式如下:

∑

=

= ^K

i i m

m

C

f K

1

1 ,

0 ≤ m ≤ L − 1

f 為第 m 個特徵值，K 是一個音節中的音框數量，

m

C 為第 i 個音框的第 m 個特徵值，L

_mⁱ 為特徵向量的長度。

在系統之訓練過程中，我們把屬於同一種動物聲音的所有訓練音節的特徵平均值來當做此一物種之特徵，計算公式如下:

1 L m f

E

F

_m = ( _m), 0≤ ≤ − (⋅)

E

指的是期望值。因為不同的特徵值，其範圍大小也不一樣，所以我們利用正規化來求出

F

^∧_m ，計算公式如下:

(11)

min max

min m m

m m m

f f

f F F

−

= −

∧ ,

其中

f

_m^max和

f

_m^min為第

m 個特徵的最大值和最小值。

2.3 線性區別分析演算法(Linear Discriminant Analysis, LDA)

線性區別分析演算法之目的是將一個高維度的特徵向量轉換成一個低維度的向量，並且增加辦識的準確率，線性區別分析演算法主要處理不同類別間的區別程度而不是用於不同類別之表示方式。線性區別分析演算法的主要精神是要把同類之間的距離最小化，並且把不同類別之間的距離給最大化，所以，必需決定一個轉換矩陣(transformation matrix)來將

維度

n 的特徵向量轉換成維度 d 的向量，在這裡 d ≤ n

，透過這樣的轉換我們能夠增強不同

類別之間的差異性。最常使用的轉換矩陣主要依據 Fisher criterion

J

_F來求得:

)) (

) ((

)

(

A tr A S A A S A J

_F = ^T _W ⁻¹ ^T _B ,

其中，

S 和

_W

S

_B分別代表的是同類別之散佈矩陣(within-class scatter matrix)和不同類別之散佈矩陣(between-class scatter matrix)，而同類別之散佈矩陣的公式如下:

∑∑

= =

−

= ^C

j N i

T j j i j j i W

S

i

1 1

) )(

(x μ x μ ,

而x 代表在類別 j 中的第 i 個特徵向量，_i^j

μ

_j為第

j 類的平均向量(mean vector)，C 為類別的

數目，

N 為類別 j 裡的特徵向量個數。而不同類別之散佈矩陣公式如下:

_j

∑

=

−

=

^C

j

T j j

S

B 1

) )(

( μ μ μ μ

,

μ

為所有類別的平均向量。線性區別分析演算法的目的是要去求出能夠使不同類別之散佈矩陣和同類別之散佈矩陣的比值為最大值轉換矩陣(transformation matrix)

A ，而其維度大

_opt 小為

n × d

:

). (

) max (

arg

tr A S A A S A A tr

W T

B T

opt A

=

此一轉換矩陣，可經由求出

S

_W⁻¹

S

_B的eigenvectors 來得到，而

A 之 d 個行向量為前 d 個最

_opt 大 eigenvalue 值所對應之 eigenvector。假設 eigenvalues 的順序為非遞增的，則所保留之 eigenvector 個數可由以下公式求得:

∑

= =

≥ ⁿ _i

d

i 0.95

λ

,

(12)

在我們決定出最佳的轉換矩陣

A 後，我們以

_opt

A 將每一正規化(normalized)後之 n 維

_opt 的特徵向量轉換為

d 維之向量。令

f 為類別 j 裡維度為 n 的特徵向量，轉換成維度為 d 的_j 向量之公式如下:

j T opt

j

A f

x =

.

2.4 辨識階段

在辨識的部份中，在輸入每個聲音檔後，首先將每一個音節切割出來，並求出每個音節的LPCCs/MFCCs 平均值。我們同時利用轉換矩陣

A 來將經過正規化的 LPCCs/MFCCs

_opt 轉換成較低維度的特徵向量，接著，計算此一特徵向量和代表每一個生物種類之特徵向量之間的距離，在這裡的距離公式是歐基里德距離(Euclidean distance)，令

r 代表所辨識出來

之生物種類:

∑

=

≤

≤ −

= ^d

m

k m C m

k

x x

r

1min 1

arg ,

C 為類別的個數，d 為特徵向量之維度， x 為輸入音節的第 m 個特徵值，

_m

x 為種類 k 的第

_m^k

m 個特徵值。

3. 實驗結果與討論

在實驗所用動物聲音資料中，包含了兩個資料庫，分別有30 種青蛙聲紋與 19 種蟋蟀聲紋，取樣頻率為44100 Hz，音訊範圍大小為 16 bits。在對資料庫中所有動物聲音擷取特徵進行辨識前，我們以頻譜能量之資訊對每個動物聲音檔案切出音節，一半之聲音音節做為訓練音節，而另一半之聲音音節為測試音節，表二及表三分別列出各種青蛙與蟋蟀聲音檔案之訓練音節及測試音節數目。

表四及表五分別比較HMM, ALPCC,及AMFCC辨識30種青蛙聲音及19種蟋蟀聲音之正確率，由這兩個表中，我們可看出AMFCC之正確率比HMM及ALPCC高，而其對青蛙聲音及蟋蟀聲音之辨識正確率分別為96.2%及98.9%。

(13)

表二青蛙聲音資料庫 (SC 為代碼)

SC Scientific name (Popular name) Ns 1 Bufo bankorensis (Central Formosan Toad) 38 2 Bufo melanosticus (Spectacled Toad) 48 3 Hyla chinensis (Chinese Tree Frog) 46 4 Microhyla butleri (Butler’s Narrow-Mouthed Toad) 15 5 Microhyla heymonsi (Heymonsi’s Narrow-Mouthed Toad) 235 6 Microhyla ornata (Ornate Narrow-Mouthed Toad) 193 7 Rana adenopleura (Olive Frog) 36 8 Rana catesbiana (American Bull Frog) 13 9 Rana guentheri (Guenther’s Amoy Frog) 15 10 Rana kuhlii (Kuhli’s Wart Frog) 52 11 Rana latouchii (Brown Wood Frog) 95 12 Rana limmocharis (Indian Rice Frog) 38 13 Rana rugulosa (Chinese Bull Frog) 98 14 Rana swinhoana (Swinhoe's Frog) 13 15 Rana sauteri (Sauter’s Frog) 132 16 Rana taipehensis (Taipei Grass Frog) 27 17 Buergeria japonica (Japanese Tree Frog) 60 18 Buergeri robusta (Brown Tree frog) 67 19 Chirixalus eiffingeri (Eiffinger’s Tree Frog) 10 20 Chirixalus idiootocus (Meintein Tree Frog) 112 21 Polypedates megacephalus (White lipped Tree Frog) 23 22 Rhacophorus arvalis (Farmland Tree Frog) 46 23 Rhacophorus Aurantiventris (Orange-Belly Tree Frog) 38 24 Rhacophorus moltrechti (Moltrecht’s Tree Frog) 191 25 Rhacophorus prasinatus (Emerald Tree Frog) 52 26 Rhacophorus taipeianus (Taipei Green Tree Frog) 49 27 Microhyla steinegeri (Steinger's Narrow-Mouthed Toad) 3 28 Kaloula pulchra (Malaysian Narrow-Mouthed Toad) 6 29 Rana longicrus (Long-Legged Frog) 9 30 Rana psaltes (Harpist Frog) 65

(14)

表三蟋蟀聲音資料庫 (SC 為代碼)

SC Scientific name (Popular name) Ns 1 Gryllotalpa fossor (Mole Cricket) 80 2 Teleogryllus occipitalis (Oil guord) 24 3 Teleogryllus mitratus (Oil guord) 3 4 Teleogryllus emma (Oil guord) 10 5 Gryllus bimaculatus (Painted Mirror) 6 6 Brachytrupes portentosus (Formosan Giant Crickets) 67 7 Loxoblemmus equestris (Coffin-headed cricket) 9 8 Dianemobius flavoantennalis (Flowered Bell) 22 9 Homoeogryllus japonicus (Horse bell) 3 10 Scleropterus punctatus (Rocky bell) 7 11 Oecanthus longicaudus (Bamboo bell) 7 12 Xenogryllus marmoratus (Pagoda Bell) 6 13 Anaxipha pallidula (Yellow Bell) 58 14 Svistella bifasciatata (Golden Bell) 15 15 Homoeoxipha lycoides (Inky bell) 4 16 Mecopoda elongta L. (Weaving Lady) 74 17 Gryllotalpa fossor (Mole Cricket) 135 18 Teleogryllus occipitalis (Oil guord) 2 19 Teleogryllus mitratus (Oil guord) 5

(15)

表四青蛙聲音之辨識正確率

SC HMM ALPCC AMFCC

1 66% 89% 97%

2 98% 91% 100%

3 89% 97% 100%

4 40% 100% 100%

5 76% 71% 95%

6 82% 80% 97%

7 53% 58% 100%

8 54% 100% 100%

9 93% 100% 100%

10 50% 76% 100%

11 80% 66% 100%

12 68% 100% 100%

13 81% 97% 100%

14 62% 46% 38%

15 77% 87% 88%

16 63% 85% 92%

17 75% 100% 100%

18 96% 100% 97%

19 100% 100% 100%

20 99% 100% 100%

21 43% 60% 56%

22 15% 30% 73%

23 79% 68% 100%

24 94% 96% 99%

25 67% 96% 96%

26 94% 85% 100%

27 0% 100% 100%

28 0% 66% 100%

29 89% 88% 100%

30 98% 69% 100%

Average 78.9% 83.9% 96.2%

(16)

表五青蛙聲音及蟋蟀聲音之辨識正確率

SC HMM ALPCC AMFCC

1 90% 92% 98%

2 83% 100% 100%

3 100% 100% 100%

4 50% 20% 100%

5 50% 100% 100%

6 99% 97% 100%

7 89% 100% 88%

8 68% 86% 100%

9 33% 100% 100%

10 86% 85% 85%

11 71% 100% 100%

12 100% 100% 100%

13 93% 94% 100%

14 87% 93% 93%

15 100% 100% 100%

16 99% 98% 98%

17 100% 97% 100%

18 0% 100% 100%

19 20% 80% 80%

Average 91.2% 94.6% 98.9%

二. 參考文獻

[1] E. D. Chesmore, “Application of time domain signal coding and artificial neural networks to passive acoustical identification of animals,” Applied Acoustics, vol: 62, Issue 12, December, 2001.

[2] A. Harma,.” Automatic identification of bird species based on sinusoidal modeling of syllables,”

Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 5, pp.v_545-v_548, 2003.

[3] J. A. Kogan and D. Margoliash, “Automated recognition of bird song elements from continuous recordings using dynamic time warping and hidden Markov models:A Comparative study,” J. Acoust. Soc. Am., vol. 103, No. 4, April 1998.

[4] A. L. McIlraith and H. C. Card, “A comparison of backpropagation and statistical classifiers for bird identification,” Proceedings of the International Conference on Neural Networks, vol.1, pp. 100-104, vol. 1, June 1997.

(17)

[5] A. L. McIlraith and H. C. Card, “Birdsong recognition with DSP and neural networks,”

Proceedings of the IEEE International Conference on Communications, Power, and Computing, vol. 2, pp. 409-414, May 1995.

[6] A. L. McIlraith and H. C. Card, “Birdsong recognition using backpropagation and multivariate statistics,” IEEE Transactions on Signal Processing, vol. 45, Issue 11, pp.

2740-2748, Nov. 1997.

[7] A. L. McIlraith and H. C. Card, “Bird song identification using artificial neural networks and statistical analysis,” Proceedings of the IEEE 1997 Canadian Conference on Electrical

and Computer Engineering, vol. 1, pp. 63-66, May 1997.

[8] K. Sasaki and M. Yamazaki, “Vector compression of bird songs spectra in water sites by using the linear prediction method and its application to an automated Bayesian species classification,” Proceedings of the 38th Annual Conference on SICE Annual, pp. 1083-1088, July 1999.

[9] C. Rogers, “High resolution analysis of bird sounds,” Proceedings of the 1995 International

Conference on Acoustics, Speech, and Signal Processing, vol. 5, pp. 3011-3014, May 1995.

[10] A. Harma and M. Juntunen, “A method for parametrization of time-varying sounds,” IEEE

Signal Processing Letters, vol. 9, Issue 5, pp. 151-153, May 2002.

[11] R. Vergin, D. O'Shaughnessy, and A. Farhat, “Generalized Mel Frequency Cepstral Coefficients for Large-Vocabulary Speaker-Independent Continuous-Speech Recognition,”

IEEE Trans. on Speech and Audio Processing, Vol. 7, No. 5, 1999, pp. 525 –532.

[12] L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Englewood Cliffs, NJ:

Prentice-Hall, 1993.

[13] J. W. Picone, “Signal modeling techniques in speech recognition,” Proceedings of the IEEE, Vol. 81, pp. 1215–1247, 1993.

[14] T. Tolonen and M. Karjalainen, “A computationally efficient multipitch analysis model,”

IEEE Trans. Speech Audio Processing, Vol. 8, pp. 708–716, Nov. 2000.

[15] Cristianini, An Introduction to Support Vector Machine, Cambridge, 2000.

[16] L. Lu, S.Z. Li and H.J. Zhang, “Content-based audio segmentation using support vector machines,” Proceedings of the IEEE International Conference on Multimedia and Expo, pp.

749-752, Aug. 2001.

[17] Hung Wei Ng, Y. Sawahata and K. Aizawa, “Summarization of wearable videos using support vector machine,” Proceedings of the IEEE International Conference on Multimedia

and Expo, pp. 325-328, Aug. 2002.

[18] S. I. Hill, P. J. Wolfe and P. J. W. Rayner, “Nonlinear perceptual audio filtering using support

(18)

Signal Processing, pp. 488-491, Aug. 2001.

[19] M. Davy and S. Godsill, “Detection of abrupt spectral changes using support vector machines: an application to audio signal segmentation,” Proceedings of the IEEE

International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 1313-1316,

2002.

[20] G. D. Guo and S. Z. Li, “Content-based audio classification and retrieval by support vector machines,” IEEE Transactions on Neural Networks, vol. 14, Issue 1, pp. 209-215, Jan. 2003.

[21] Y. Liu, P. Ding and B. Xu, “Using nonstandard SVM for combination of speaker verification and verbal information verification in speaker authentication system,” IEEE International

Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. I-673-I-676, May 2002.

三. 計畫成果自評

建立生物多樣性資料庫是推動生物保育、教育及研究的重要基礎工作。於「生物多樣性公約」之第十七條即要求各國需成立生物多樣性資訊之交換中心，積極蒐集整理本土生物多樣性之資料，並與其他國家分享，以促進生物多樣性之保育、利用、管理、研究及教育，同時也可提振各國分類學的能力建設。

台灣的土地面積雖不大，卻擁有異常豐富的生物多樣性資源，特有生物種類繁多，臺灣己列入正式紀錄的鳥類約有450 種，青蛙種類約有 31 種，蟬類有 59 種(4 種新種)，台灣蟋蟀種類約有八十幾種，以聲音自動辨識系統來記錄生物的棲息環境，不僅有助於了解這些生物的生態變化，並能減少對生態的影響。生物聲音辨識的研究，尤其是國內，進行的仍舊很少，希望透過此自動辨識系統配合適當的硬體設備，能發現更多未曾記載的生物物種，

建立更完善的台灣生物聲音資料庫。本計畫已完成可自動辨識青蛙及蟋蟀聲紋之辨識系統，未來希望能進一步辨識較複雜多變化之鳥類鳴聲。

目前我們已發表三篇相關論文，包括一篇期刊論文及兩篇研討會論文：

期刊論文 (Journal Papers) ：

[1] C. H. Lee, C. H. Chou, C. H. Han, and R. Z. Huang, “Automatic Recognition of Animal Vocalizations Using Averaged MFCC and Linear Discriminant Analysis”, Pattern Recognition

Letters, Vol. 27, Issue 2, Jan. 2006, pp. 93-101. (SCI, EI)

研討會論文 (Conference Papers) ：

(19)

[1] C. H. Lee, C. H. Chou, and R. Z. Huang, “Automatic Recognition of Bioacoustic Sounds: an Experiment on the Frog Vocalizations”, in Proceedings of the 17th IPPR Conference on

Computer Vision, Graphics, and Image Processing, Hualien, Aug. 15-17, 2004.

[2] C. H. Lee, C. H. Chou, C. C. Han, and R. Z. Huang, “Automatic Recognition of Frog Calls Using Averaged MFCC and Linear Discriminant Analysis”, in Proceedings of the 9th

Conference on Artificial Intelligence and Applications, Taipei, Nov. 5-6, 2004.

以下為發表之期刊論文的全文

(20)

Automatic recognition of animal vocalizations using averaged MFCC and linear discriminant analysis

Chang-Hsing Lee

^a,*

, Chih-Hsun Chou

^a

, Chin-Chuan Han

^b

, Ren-Zhuang Huang

^a

aDepartment of Computer Science and Information Engineering, Chung Hua University, Hsinchu 300, Taiwan, ROC

bDepartment of Computer Science and Information Engineering, National United University, Miao-Li 360, Taiwan, ROC Received 13 August 2004; received in revised form 30 June 2005

Available online 31August 2005

Communicated by O. Siohan

Abstract

In this paper we propose a method that uses the averaged Mel-frequency cepstral coefficients (MFCCs) and linear discriminant analysis (LDA) to automatically identify animals from their sounds. First, each syllable corresponding to a piece of vocalization is segmented. The averaged MFCCs over all frames in a syllable are calculated as the vocalization features. Linear discriminant analysis (LDA), which finds out a transformation matrix that minimizes the within-class distance and maximizes the between-class distance, is utilized to increase the classification accuracy while to reduce the dimensionality of the feature vectors. In our experiment, the average classification accuracy is 96.8% and 98.1% for 30 kinds of frog calls and 19 kinds of cricket calls, respectively.

MSC: 140.000

Keywords: Linear discriminant analysis; Mel-frequency cepstral coeﬃcients

1. Introduction

Many animals generate sounds either for communica- tion or as a by-product of their living activities such as eating, moving, or ﬂying. Automatic recognition of bioacoustic sounds is valuable for applications such as bio- logical research and environmental monitoring; this is particularly true for detecting and locating animals. In our daily life, we often hear the animal vocalizations rather than see the animals. In general, the animals generate sounds to communicate with members of the same species and thus the animal vocalizations have evolved to be

species-speciﬁc. Therefore, identifying animal species from their vocalizations is valuable to ecological censusing.

In general, the acoustic signal representing animal vocalizations can be regarded as a sequence of syllables. Thus, a better way to identify animals from their vocalizations is to use a syllable as the acoustic component. It is necessary to segment the syllables of animal vocalizations before the recognition process. Segmentation of speech or audio signals is often based on energy (Lamel et al., 1981; Li et al., 2001; Lu, 2001; Wold et al., 1996; Zhang and Kuo, 2001) and/or zero-crossing rate (Li et al., 2001; Lu, 2001;

Tian et al., 2002; Wold et al., 1996; Zhang and Kuo, 2001). A disadvantage of using these segmentation meth- ods to extract syllables from animal vocalizations is that the full syllable cannot be extracted exactly. To overcome this problem, we exploit the frequency information to segment the syllables of animal vocalizations (Harma, 2003).

* Corresponding author. Tel.: +886 3 5186406; fax: +886 3 5186416.

E-mail addresses: [email protected] (C.-H. Lee), [email protected] (C.-H. Chou),[email protected](C.-C. Han).

www.elsevier.com/locate/patrec Pattern Recognition Letters 27 (2006) 93–101

(21)

Once the syllables have been properly segmented, a set of features will be calculated to represent each syllable.

The most well-known features for speech/speaker recognition are linear predictive coeﬃcients (LPCs) (Rabiner and Juang, 1993) or Mel-frequency cepstral coeﬃcients (MFCCs) (Picone, 1993; Rabiner and Juang, 1993; Vergin et al., 1999). In this paper, we use the averaged MFCCs in a syllable to identify animals from their sounds due to the fact that MFCCs can represent the spectrum of animal sounds in a compact form. In the next section, we will describe the proposed recognition method for animal vocalizations.

2. The proposed recognition method for animal vocalizations

The recognition system consists of two parts: the training part and the recognition part. The training part is com- posed of three main modules: syllable segmentation, averaged MFCCs extraction, and linear discriminant analysis (LDA). The recognition part consists of four modules:

syllable segmentation, averaged MFCCs extraction, LDA transformation, and classiﬁcation. A detailed description of each module will be described below.

2.1. Syllable segmentation

The input acoustic signal is ﬁrst segmented into a set of syllables (Harma, 2003). Each syllable is regarded as the basic acoustic unit for recognition. The syllable segmentation method based on the frequency information is described as follows:

Step 1. Compute the spectrogram of the input bioacoustic signal using short-time Fourier transform (STFT).

We denote the spectrogram a matrix S(f, t), where f represents frequency index and t is the frame index.

Step 2. Set n = 0.

Step 3. Find fnand tn, such that jS(fn, tn) j P jS(f, t)j, for every pair of (f, t). Set the position of the nth syllable to be (fn, tn).

Step 4. Compute the amplitude An(0) = 20 log10jS(fn, tn)j dB and set the frequency parameter Wn(0) = fn. If An(0) < A0(0) 20 dB, stop the segmentation process. This means that the amplitude of the nth syllable is too small and hence no more syllables need to be extracted.

Step 5. Starting from (fn, tn), trace the maximal peak of jS(f, t)j for t < tnuntil An(t) < An(0) ßdB, where ß is the stopping criteria and its default value is 20. Next, trace the maximal peak of jS(f, t)j for t > tn until An(t) < An(0) ßdB. The step is to determine the starting time (tn ts) and the ending time (tn+ te) of the nth syllable around tn. Step 6. Store the amplitude trajectories corresponding to

the nth syllable in function An(s), where s= tn ts, . . . , tn+ te.

Step 7. Set S(f, [tn ts, . . . , tn+ te]) = 0 to delete the area of nth syllable. Set n = n + 1and goto Step 3 to ﬁnd the next syllable.

Fig. 1 shows the waveform of Olive Frog (Rana adeno- pleur) as well as the segmentation results by using the energy information and the spectrogram frequency information. It is evident that a better result can be obtained.

After segmenting each syllable, the averaged MFCCs are extracted to represent the syllable.

2.2. Averaged MFCCs extraction

MFCCs have been the most widely used features for speech recognition (Picone, 1993; Rabiner and Juang, 1993; Vergin et al., 1999), bird song recognition (Kogan and Margoliash, 1998), and audio retrieval (Slaney, 2002) due to their ability to represent the signal spectrum in a com-

94 C.-H. Lee et al. / Pattern Recognition Letters 27 (2006) 93–101

(22)

pact form. In fact, the MFCCs have been proven to be very effective in automatic speech recognition or in modeling the subjective frequency content of audio signals. In general, an input signal is first divided into a set of frames. The MFCCs for each frame are then computed and are regarded as the features of this frame. However, the number of frames var- ies for different syllables. To deal with this problem, the averaged MFCCs of all the frames in a syllable are computed and used as features to represent the syllable. There- fore, the number of features is fixed regardless of the length of the acoustic syllable. A detailed description for deriving the MFCCs of an acoustic signal is given as follows:

Step 1. Pre-emphasis.

^s½n ¼ s½n ^as½n 1; ð1Þ

where s[n] is the signal denoting the input syllable, a typical value for ^ais 0.95.

Step 2. Framing.Each syllable is divided into a set of over- lapped frames with frame size of N samples, and the overlapping size is M samples for each pair of successive frames. Therefore, consecutive frames will never change too much. In our experiments, N is 512 and M is 256.

Step 3. Windowing.To reduce the discontinuity on both ends of a frame, each frame is multiplied by a Hamming window

~s½n ¼ ^s½nw½n; 0 6 n 6 N 1; ð2Þ where w[n] is the Hamming window function w½n ¼ 0:54 0:46 cos 2pn

N 1

; 0 6 n 6 N 1.

ð3Þ Step 4. Spectral analysis.Take the discrete Fourier trans-

form of each frame using FFT

X½k ¼X^N¹

n¼0

~s½ne^j2p^N^kⁿ; 0 6 k 6 N 1. ð4Þ Step 5. Band-pass filtering.The amplitude spectrum is then filtered using a set of triangular band-pass filters

E_j¼^{N =21}X

k¼0

/_jðkÞAk; 0 6 j 6 J 1; ð5Þ where J is the number of ﬁlters, /jis the jth ﬁlter, and Akis the amplitude of X[k]

A_k¼ jX ½kj²; 0 6 k < N =2. ð6Þ Step 6. DCT.The MFCCs for the ith frame are computed

by performing DCT on the logarithm of Ej

Cⁱ_m¼X^J1

j¼0

cos mp

Jðj þ 0:5Þ

log₁₀ðEjÞ;

0 6 m 6 L 1; ð7Þ

where L is the number of MFCCs.

In the proposed method, the ﬁlter bank consists of 25 triangular ﬁlters, that is, J = 25. The length of MFCCs feature vector for each frame is 15 (L = 15). After deriving the MFCCs for each frame, we compute the averaged MFCCs of all frames within the syllable

f_m¼ PK

i¼1Cⁱ_m

K ; 0 6 m 6 L 1; ð8Þ

where fm is the mth MFCC, K is the number of frames within the syllable, and Cⁱ_m denotes the mth MFCC of the ith frame. In the training phase, the averaging of f_m over all training syllables for the acoustic vocalization of the same species is regarded as the mth feature value, Fm. Since the dynamic ranges of fmÕs may be diﬀerent, we per- form a linear normalization process to get the ﬁnal feature vector F⁰_m

F⁰_m¼ F_m f_m^min

f_m^max f_m^min; ð9Þ

where f_m^max and f_m^min denote the maximum and minimum values of the mth MFCC of all f_m⁰s for the training syllables, respectively.

From Fig. 1, it seems that each syllable has diﬀerent sound structure. Fig. 2 shows that the spectrograms of these syllables look similar except that of the last syllable.

Table 1 shows the feature vectors extracted from these

C.-H. Lee et al. / Pattern Recognition Letters 27 (2006) 93–101 95

(23)

syllables. From this table, we can see that these feature vectors are very close. Table 2 shows the distance between each feature vector extracted from these syllables and the

2.3. Linear discriminant analysis (LDA)

LDA (Baker, 2004; Duda et al., 2000; Slaney, 2002) aims

Table 1

Feature vectors for the extracted syllables (SC denotes the subject code)

SC S1 S2 S3 S4 S5 S6 S7 S8 S9

f1 0.627 0.667 0.606 0.713 0.727 0.791 0.843 0.829 0.680

f2 0.680 0.684 0.784 0.768 0.784 0.793 0.783 0.8010.782

f3 0.290 0.242 0.342 0.325 0.361 0.310 0.283 0.296 0.511

f4 0.404 0.456 0.2410.309 0.305 0.256 0.264 0.270 0.284

f5 0.368 0.324 0.504 0.463 0.416 0.467 0.417 0.383 0.455

f6 0.156 0.098 0.418 0.406 0.346 0.462 0.291 0.303 0.439

f7 0.273 0.250 0.266 0.251 0.162 0.260 0.183 0.154 0.258

f8 0.457 0.459 0.487 0.536 0.467 0.511 0.510 0.493 0.550

f9 0.592 0.617 0.351 0.512 0.439 0.562 0.760 0.734 0.478

f10 0.809 0.915 0.641 0.662 0.699 0.698 0.873 0.908 0.676

f11 0.6710.658 0.420 0.490 0.577 0.456 0.466 0.485 0.475

f12 0.506 0.479 0.653 0.6610.686 0.609 0.426 0.463 0.727

f13 0.523 0.357 0.559 0.5410.5010.5110.437 0.409 0.525

f14 0.4510.477 0.390 0.453 0.337 0.453 0.450 0.462 0.396

f15 0.593 0.614 0.410 0.418 0.491 0.425 0.467 0.483 0.482

Table 2

Distance between each feature vector extracted from the example syllables and the 30 representative feature vectors with the minimum distance highlighted

SC S1 S2 S3 S4 S5 S6 S7 S8 S9

1 0.7415 0.8188 0.6039 0.5936 0.6587 0.5493 0.7214 0.7292 0.6717

2 0.8217 0.9170 0.6944 0.6371 0.7520 0.6797 0.8420 0.8740 0.7215

3 0.7595 0.8941 0.8354 0.8194 0.8182 0.8604 0.9300 0.9503 0.8349

4 0.6782 0.7808 0.7048 0.6665 0.7097 0.6718 0.7571 0.7804 0.7471

5 0.6993 0.79810.6162 0.56610.6487 0.5693 0.70710.7289 0.5668

6 0.8232 0.8872 0.6899 0.6597 0.7372 0.6684 0.8217 0.8341 0.6709

7 0.3099 0.3917 0.4410 0.2813 0.2748 0.2685 0.2748 0.2622 0.3777

8 0.9130 1.0213 0.6975 0.6980 0.7500 0.6806 0.8556 0.8842 0.6811

9 0.8558 0.9303 0.6713 0.6936 0.70610.6909 0.82210.8495 0.7285

10 0.9542 1.0621 0.7159 0.7853 0.7150 0.7850 0.9780 0.9661 0.7123

11 0.7517 0.8106 0.5296 0.5299 0.6268 0.5263 0.7261 0.7387 0.6011

12 1.0211 1.1240 0.8332 0.8651 0.9174 0.8173 0.9882 1.0073 0.9110

13 0.6558 0.7210 0.6242 0.5835 0.6120 0.5959 0.6438 0.6716 0.6317

14 0.6947 0.8062 0.6483 0.6258 0.6534 0.6830 0.8248 0.8373 0.6077

15 0.5267 0.5939 0.6179 0.5592 0.5456 0.6100 0.7183 0.7158 0.5603

16 0.7915 0.8734 0.7215 0.6223 0.7144 0.6161 0.7501 0.7792 0.6388

17 0.6448 0.7294 0.6343 0.5876 0.6302 0.6157 0.7493 0.7423 0.5883

18 0.6568 0.7204 0.5167 0.5368 0.5833 0.5677 0.7192 0.7357 0.6165

19 1.2702 1.3241 1.3281 1.3180 1.3344 1.3571 1.4431 1.4364 1.2818

20 1.1940 1.2704 1.2935 1.3012 1.3162 1.3569 1.4272 1.4301 1.2355

21 0.6753 0.7434 0.4479 0.4495 0.5182 0.4171 0.5986 0.6129 0.5046

22 0.8649 0.9368 0.7933 0.7902 0.8599 0.8153 0.9374 0.9430 0.7289

23 0.7865 0.8738 0.70210.6580 0.6753 0.6612 0.7866 0.7947 0.5952

24 1.0395 1.1656 0.7796 0.8132 0.9295 0.7812 0.9507 0.9850 0.8778

25 0.7724 0.8922 0.7335 0.65410.7303 0.6930 0.8256 0.8609 0.6843

26 0.8508 0.95010.6928 0.6290 0.72610.6534 0.8123 0.8423 0.6436

27 1.0673 1.0742 0.9840 0.8746 1.0122 0.8696 1.0030 0.9894 0.9164

28 0.9436 1.0374 0.7056 0.7272 0.7848 0.7024 0.8491 0.8737 0.7178

29 0.7372 0.8596 0.5467 0.5054 0.6029 0.5086 0.7217 0.7424 0.4600

30 0.9759 1.0893 0.8906 0.8582 0.94310.85310.99911.0190 0.7676

96 C.-H. Lee et al. / Pattern Recognition Letters 27 (2006) 93–101

行政院國家科學委員會專題研究計畫 成果報告