## 國

## 立

## 交

## 通

## 大

## 學

### 資訊科學與工程研究所

### 博

### 士

### 論

### 文

### 機率式模型分群法之研究與其應用

### Probabilistic Model-based Clustering and Its Applications

### 研 究 生：鄭士賢

### 指導教授：傅心家 教授

### 王新民 博士

### 機率式模型分群法之研究與其應用

### Probabilistic Model-based Clustering and Its Applications

### 研 究 生：鄭士賢 Student：Shih-Sian Cheng

### 指導教授：傅心家 教授 Advisor：Prof. Hsin-Chia Fu

### 王新民 博士

### Dr. Hsin-Min Wang

### 國 立 交 通 大 學

### 資 訊 科 學 與 工 程 研 究 所

### 博 士 論 文

A DissertationSubmitted to Department of Computer Science College of Computer Science

National Chiao Tung University in partial Fulfillment of the Requirements

for the Degree of Doctor of Philosophy

in

Computer Science

May 2009

Hsinchu, Taiwan, Republic of China

### 機率式模型分群法之研究與其應用

### 學生：鄭士賢

### 指導教授

：### 傅心家 教授

### 王新民 博士

### 國立交通大學資訊科學與工程研究所

### 摘

### 要

機率式模型分群法藉由學習一個有限混合模型(finite mixture model)而達成資料分群的 目的，其已經有很多成功的應用; 而最常用的混合模型乃是高斯混合模型(Gaussian mixture model, i.e., GMM)。當其目的函數的最佳化無法用分析方法來達成時，我們經 常用迭代式、局部最佳(local-optimal)的演算法來學習混合模型。此外，因為來自於混 合模型的資料樣本有非完全(incompleteness)的特性，因此我們通常採用 EM 形式 (EM-type)的演算法，如 EM 演算法與分類 EM 演算法(classification EM, i.e., CEM)。然 而，用傳統 EM 形式演算法來做模型分群有幾個缺點: 一、它的效能與模型初始值有 高度相關性; 二、混合元件(mixture component)的個數需要事先給定; 三、在完成分群 之後，資料樣本與群聚之間的拓樸關係(topological relationships)無法被保留。在本論 文中，針對上述前兩樣缺點，我們提出一個自我分裂之高斯混合模型學習演算法 (self-splitting Gaussian mixture learning, i.e., SGML)。此法起始於單一元件，然後迭代 式地，用貝氏資訊法則(Bayesian information criterion, i.e., BIC)來確認分裂的有效性， 分裂某一個元件而成為兩個新元件直到最佳的元件個數被找到為止，最佳的元件個數 亦是用 BIC 來決定。基於此分裂程序，SGML 可為學習不同模型複雜度的 GMM 提供 不錯的模型初始值。關於拓樸保留(topology-preserving)的議題，我們將一個機率式自 我組織圖(probabilistic self-organizing map, i.e., PbSOM)的學習視為一種可保留資料樣 本與群聚之間拓樸關係於一個神經網路(neural network)的模型分群法。基於此概念， 我們為 PbSOM 發展了一個耦合概似混合模型(coupling-likelihood mixture model)，其延 伸 Kohonen SOM 的參考向量(reference vectors)至多變量高斯模型。基於最大概似法則

(maximum likelihood criterion)，我們亦發展了三個學習 PbSOM 的 EM 形式演算法，亦 即 SOCEM、 SOEM、以及 SODAEM 演算法。SODAEM 是 SOCEM 與 SOEM 的決定 性 退 火 (deterministic annealing, i.e., DA) 變 形 ; 此 外 ， 藉 由 逐 漸 縮 小 鄰 域 大 小 (neighborhood size)，SOCEM 與 SOEM 可分別被解釋成: 高斯模型分群的 CEM 與 EM 演算法的 DA 變形。實驗結果顯示，我們所提出的 PbSOM 演算法與決定性 EM(DAEM) 演算法有相近的分群效能，並能維持拓樸保留之特性。關於應用方面，我們用 SGML 來訓練用於語者識別(speaker identification)的語者 GMMs;實驗結果顯示，SGML 可以 自動地為個別語者 GMM 決定適當的模型複雜度，而此在文獻裡通常是用經驗法則來 決定的。我們將所提出的 PbSOM 學習演算法應用於資料視覺化與分析;實驗結果顯示 它們可用於在一個二維的網路上探索資料樣本與群聚之間的拓樸關係。此外，我們提 出幾種分割且克服(divide-and-conquer)的方法來做音訊分段(audio segmentation)，其用 BIC 來評定兩個音段之間的不相似度(dissimilarity);在廣播新聞的實驗結果顯示，相較 於公認最佳的視窗成長分段法(window-growing-based segmentation)，我們的方法不僅 有比較低的計算量，亦有較高的分段正確性。

### Probabilistic

### Model-based Clustering and Its Applications

### Student：Shih-Sian Cheng

### Advisors：Prof. Hsin-Chia Fu

### Dr. Hsin-Min Wang

### Department of Computer Science

### National Chiao Tung University

### ABSTRACT

Probabilistic model-based clustering is achieved by learning a finite mixture model (usually a Gaussian mixture model, i.e., GMM) and has been successfully applied to many tasks. When the objective likelihood function cannot be optimized analytically, iterative, locally-optimal learning algorithms are usually applied to learn the model. Moreover, because the data samples drawn from a mixture model have the property of incompleteness, EM-type algorithms, such as EM and classification EM (CEM), are usually employed. However, model-based clustering based on conventional EM-type algorithms suffer from the following shortcomings: 1) the performance is sensitive to the model initialization; 2) the number of mixture components (data clusters) needs to be pre-defined; and 3) the topological relationships between data samples and clusters are not preserved after the learning process. In this thesis, we propose a self-splitting Gaussian mixture learning algorithm (SGML) to address the first two issues. The algorithm starts with one single component and iteratively splits a component into two new components using Bayesian information criterion (BIC) as the validity measure for the splitting until the most appropriate component number is found, which is also achieved by BIC. Based on the splitting process, SGML provides a decent model initialization for learning Gaussian mixture models with different model complexities. For the topology-preserving issue, we consider the learning process of a probabilistic self-organizing map (PbSOM) as a model-based clustering procedure that preserves the topological relationships between data samples and clusters in a neural network. Based on this concept, we develop a

coupling-likelihood mixture model for the PbSOM that extends the reference vectors in Kohonen's SOM to multivariate Gaussians distributions. We also derive three EM-type algorithms, called the SOCEM, SOEM, and SODAEM algorithms, for learning the model (PbSOM) based on the maximum likelihood criterion. SODAEM is a deterministic annealing (DA) variant of SOCEM and SOEM; moreover, by shrinking the neighborhood size, SOCEM and SOEM can be interpreted, respectively, as DA variants of the CEM and EM algorithms for Gaussian model-based clustering. The experiment results show that the proposed PbSOM learning algorithms achieve comparable data clustering performance to that of the deterministic annealing EM (DAEM) approach, while maintaining the topology-preserving property. For applications, we apply SGML to the training of speaker GMMs for the speaker identification task; the experiment results show that SGML can automatically determine an appropriate model complexity for a speaker GMM, which is usually determined empirically in the literature. We apply the proposed PbSOM algorithms to data visualization and analysis; the experiment results show that they can be used to explore the topological relationships between data samples and clusters on a two-dimensional network. In addition, we propose divide-and-conquer approaches for the audio segmentation task using BIC to evaluate the dissimilarity between two audio segments; the experiment results on the broad-cast news data demonstrate that our approaches not only have a lower computation cost but also achieve higher segmentation accuracy than the leading window-growing-based segmentation approach.

### ACKNOWLEDGEMENTS

I am grateful to my advisor, Prof. Hsin-Chia Fu and Dr. Hsin-Min Wang, for their intensive suggestions, patient guidance, and enthusiasm of research. Moreover, I would like to thank all the members of the Neural Network Multimedia Laboratory at Department of Computer Science, National Chiao Tung University, and the Spoken Language Group, Chinese Information Processing Laboratory at Institute of Information Science, Academia Sinica, for their valuable discussions. Finally, I would like to express my appreciation to my family for their supporting and encouragement. I dedicate this dissertation to my parents.

### Contents

1 Introduction 1

1.1 Background . . . 1

1.2 Contributions of this dissertation . . . 5

1.2.1 On model initialization and complexity . . . 5

1.2.2 On topology-preserving ability . . . 5

1.2.3 On applications . . . 6

1.3 Organization of this dissertation . . . 6

2 Preliminaries and related works 7 2.1 K-means clustering . . . 7

2.2 Hierarchical agglomerative clustering . . . 8

2.3 Self-organizing map . . . 9

2.4 Probabilistic model-based clustering . . . 10

2.4.1 The mixture likelihood approach . . . 10

2.4.1.1 The EM algorithm for mixture models . . . 11

2.4.1.2 The EM algorithm for Gaussian mixture models . . . 11

2.4.1.3 The Bayesian approach for model selection and BIC . . . 12

2.4.2 The classification likelihood approach . . . 13

2.4.2.1 The CEM algorithm for mixture models . . . 13

2.4.2.2 The CEM algorithm for Gaussian mixture models . . . 14

2.4.2.3 AWE and ICL for model selection . . . 14

2.4.3 The DAEM algorithm . . . 15

2.4.3.1 The DAEM algorithm for Gaussian mixture models . . . . 16

3 BIC-based self-splitting Gaussian mixture learning 17 3.1 The SGML algorithm . . . 17

3.1.1 Computational complexity of SGML . . . 18

3.2 The fastSGML algorithm . . . 19

3.3 Experiments on Gaussian mixture learning . . . 19

3.3.1 Results on the synthetic data . . . 20

3.4 Application of SGML to GMM-based speaker identification . . . 24

3.4.1 The GMM-based speaker identification . . . 24

3.4.2 Experiments . . . 26

3.4.2.1 Database description and feature extraction . . . 26

3.4.2.2 Configuration of model training and test utterance . . . . 27

3.4.2.3 Results . . . 27

4 Model-based clustering by probabilistic self-organizing maps 34 4.1 Formulation of the coupling-likelihood mixture model for PbSOMs . . . 34

4.2 The SOCEM algorithm for learning PbSOMs . . . 36

4.2.1 SOCEM - a DA variant of CEM for GMM . . . 37

4.2.2 Relation to Kohonen’s batch algorithm . . . 39

4.2.3 Computational cost . . . 40

4.3 The SOEM algorithm for learning PbSOMs . . . 41

4.3.1 SOEM - a DA variant of EM for GMM . . . 42

4.3.2 Computational cost . . . 42

4.4 The SODAEM algorithm for learning PbSOMs . . . 43

4.4.1 Computational cost . . . 44

4.5 Relation to other algorithms . . . 44

4.5.1 For SOCEM . . . 45

4.5.2 For SOEM and SODAEM . . . 46

4.6 Experiments on organizing property and data clustering . . . 46

4.6.1 Experiments on organizing property . . . 46

4.6.1.1 Results on the synthetic data . . . 47

4.6.1.2 Results on PenRecDigits C0 . . . 49

4.6.2 Experiments to evaluate the performance of data clustering . . . 50

4.6.2.1 Results on ImgSeg by using SOCEM and SODAEM C . . 51

4.6.2.2 Results on ImgSeg by using SOEM and SODAEM M . . . 51

4.6.2.3 Results on Ecoli . . . 52

4.7 Application of SOCEM, SOEM, and SODAEM to data visualization and analysis . . . 52

4.7.1 Experiment results on ImgSeg by using SOCEM and SODAEM C . 53 4.7.2 Experiment results on ImgSeg by using SOEM and SODAEM M . . 54

4.7.3 Experiment results on Ecoli . . . 54

5 BIC-based audio segmentation using divide-and-conquer 69 5.1 Window-growing-based segmentation . . . 71

5.1.1 *∆BIC as the distance measure of two audio segments . . . 71*

5.1.2 One-change-point detection . . . 72

5.2 Divide-and-Conquer-based segmentation . . . 73

5.2.1 The DACDec1 approach . . . 73

5.2.2 The DACDec2 approach . . . 75

5.2.3 Sequential segmentation by DACDec1 and DACDec2 . . . 77

5.3 Computational cost analysis . . . 78

5.3.1 For DACDec1 . . . 79 5.3.2 For DACDec2 . . . 80 5.3.3 For FixSlid . . . 81 5.3.4 For WinGrow . . . 81 5.3.5 Discussion . . . 82 5.4 Experiments . . . 83

5.4.1 Experiments on the synthetic data . . . 83

5.4.2 Experiments on broadcast news data . . . 84

6 Conclusion and future work 89 6.1 Conclusion . . . 89

6.2 Future work . . . 90

A 92 A.1 The SOCEM, SOEM, and SODAEM algorithms where mixture weights are learned . . . 92

B 95
*B.1 Compute T*1*(k) . . . 95*

*B.2 Compute T0 _{(m) . . . 96}*

### List of Figures

*2.1 (a) {A, B, · · ·, G} are data samples in <*2_{. (b) An illustrative dendrogram}
of the data samples in (a) obtained by hierarchical agglomerative clustering. 8
3.1 The learning process of SGML on the synthetic data, where full covariance

GMMs are used. . . 22 3.2 The learning curves of SGML and the baseline approaches on the synthetic

data, where full covariance GMMs are used. . . 23 3.3 The learning curve of SGML on the synthetic data, where diagonal

covari-ance GMMs are used. . . 23 3.4 The learning curves of applying the various GMM learning methods to the

15-second speech data of #5007 in NIST01SpkEval. (a) and (b) depicts the full covariance case, while (c) and (d) depicts the diagonal covariance case. . . 25 3.5 The learning curves of applying fastSGML, K-means-BinSplitting and SGML

to learn diagonal covariance GMMs with 60-second speech data of #5007
*in NIST01SpkEval. The splitting confidence of the fastSGML was set at*
150, 100, and 50, respectively, and K-means-BinSplitting was forced to stop
*at GMM*43. . . 26
4.1 (a) The network structure of a Gaussian mixture model, and (b) the

*pro-posed coupling-likelihood mixture model. Here, rl*(x*i; θl*) denotes the

mul-tivariate Gaussian distribution described in Eq. (2.12). . . 55 4.2 SOCEM’s objective function becomes more complex with the reduction of

*neighborhood size (σ in hkl*). . . 56

4.3 For each data sample x*i*, the adaptation of the reference models in SOCEM

is restricted to the winning reference model and its neighborhood. However,
in SOEM, the winner is relaxed to the weighted winners by the posterior
*probabilities γ _{k|i}(t), for k = 1, 2, · · · , G. Each data sample xi* contributes

pro-portionally to the adaptation of each reference model and its neighborhood according to the posterior probabilities. . . 57

4.4 The family of Gaussian model-based clustering algorithms derived from the

*SODAEM, SOEM and SOCEM algorithms. δkl* *= 1 if k = l; otherwise,*

*δkl* = 0. . . 57

4.5 The map-learning process obtained by running the SOCEM algorithm on
the synthetic data. Simulation 1 ((a)-(b)): When SOCEM is run with the
*random initialization in (a) and σ = 0.15, it converges to the unordered*
*map in (b). Simulation 2 ((a) and (c)-(f)): SOCEM starts with σ = 0.6*
*and the random initialization in (a). Then, the value of σ is reduced to*
0.15 in 0.15 decrements. . . 58
4.6 The map-learning process obtained by running the SOEM algorithm on

the synthetic data. Simulation 1 ((a)-(b)): When SOEM is run with the
*random initialization in (a) and σ = 0.15, it converges to the unordered*
*map in (b). Simulation 2 ((a) and (c)-(f)): SOEM starts with σ = 0.6 and*
*the random initialization in (a). Then, the value of σ is reduced to 0.15 in*
0.15 decrements. . . 59
4.7 The map-learning process obtained by running the SODAEM algorithm

*on the synthetic data. The value of σ is fixed at 0.15, while value of β is*
initialized at 0.16 and increased in multiples of 1.6 up to 17.592. . . 60
4.8 The map-learning process obtained by running the SOCEM algorithm on

PenRecDigits C0. Simulation 1 ((a)-(b)): When SOCEM is run with the
*random initialization in (a) and σ = 0.15, it converges to the unordered*
*map in (b). Simulation 2 ((a) and (c)-(f)): SOCEM starts with σ = 0.6*
*and the random initialization in (a). Then, the value of σ is reduced to*
0.15 in 0.15 decrements. . . 61
4.9 The map-learning process obtained by running the SOEM algorithm on

PenRecDigits C0. Simulation 1 ((a)-(b)): When SOEM is run with the
*random initialization in (a) and σ = 0.15, it converges to the unordered*
*map in (b). Simulation 2 ((a) and (c)-(f)): SOEM starts with σ = 0.6 and*
*the random initialization in (a). Then, the value of σ is reduced to 0.15 in*
0.15 decrements. . . 62
4.10 The map-learning process obtained by running the SODAEM algorithm

*on PenRecDigits C0. The value of σ is fixed at 0.15, while value of β is*
initialized at 0.16 and increased in multiples of 1.6 up to 17.592. . . 63
4.11 The data clustering performance of CEM, DAEM C, SOCEM, SODAEM C,

and KohonenGaussian on ImgSeg in terms of the classification log-likelihood. 64 4.12 Learning a Gaussian mixture model by applying EM, DAEM M, SOEM,

and SODAEM M to ImgSeg. . . 64 4.13 The data clustering performance on Ecoli in terms of (a) the classification

4.14 Data visualization for ImgSeg by running KohonenGaussian ((b)), SOCEM
((c), (d)), and SODAEM C ((e), (f)) with the random initialization in (a).
*The network structure is a 7 × 7 equally spaced square lattice in a unit*
square. . . 66
4.15 Data visualization for ImgSeg by running SOEM ((a), (b)) and SODAEM M

((c), (d)) with the random initialization in Figure 4.14 (a). The network

*structure is a 7 × 7 equally spaced square lattice in a unit square.* . . . . 67

4.16 Data visualization for Ecoli by running (b) KohonenGaussian, (c) SOCEM,
(d) SOEM, (e) SODAEM C, and (f) SODAEM M with the random
*initial-ization in (a). The network structure is a 7 × 7 equally spaced square*
lattice in a unit square. . . 68
5.1 The fixed-size sliding window detection approach. . . 71
5.2 Diagram of the multiple-change-point detection in window-growing-based

segmentation (WinGrow). The audio stream contains three segments,
*namely Seg1, Seg2, and Seg3; P and Q denote the change points. . . 73*
5.3 (a) An audio stream comprised of three speech segments, each derived from

*a distinct speaker. C*1 *and C*2 *are the change points. (b) The ∆BIC curve*

obtained by applying OCD-Chen to the audio stream in (a). . . 74 5.4 An illustration that data samples distribute as three Gaussian clusters.

*For this case, generally, two Gaussians (H*1) fit the distribution of the

*data better than one Gaussian (H*0) if the samples belonging to the same

Gaussian cluster are used together to estimate the parameters. . . 76 5.5 (a) An audio stream comprised of three speech segments; the first and third

segments are derived from the same speaker (Speaker1), while the second is
*derived from another speaker (Speaker2). (b) The ∆BIC curve obtained*
by applying OCD-Chen to the audio stream in (a). (c) The diagram of

*the hypothesis test at the change point C*2 in (b). (d) The diagram of the

*hypothesis test at the non-change point R in (b). . . 77*
5.6 A recursive tree that simulates the recursive process of DACDec2 on the

audio stream in Figure 5.5 (a). . . 79 5.7 Diagram of the detection process of SeqDACDec1 and SeqDACDec2. If a

change point is detected in the fixed-size analysis window by DACDec1 or
DACDec2, the window is moved to the change point with the largest time
*index. Otherwise, it is moved forward by ηL samples, where L denotes the*
*window size, and η > 0. . . 80*
*5.8 An audio stream comprised of k+1 homogeneous segments, each containing*

5.9 ROC curves obtained by running SeqDACDec1 and SeqDACDec2 on the synthetic data using 10-second, 20-second, and 30-second analysis windows.

*L denotes the size of the analysis window. . . 84*

5.10 The empirical cumulative distributions of the size of homogeneous segments in MATBN3hr and HUB4-98. . . 85 5.11 A significant local maximum on the distance curve. . . 86

*5.12 The ROC curves for MATBN3hr obtained by (a) SeqDACDec1 with Nmin* =

2 seconds and analysis windows of different size; (b) SeqDACDec2 with

*Nmin* = 2 seconds and analysis windows of different size; and (c)

*Seq-DACDec1 with Nmin* *= 2 seconds and L = 20 seconds, SeqDACDec2 with*

*Nmin* *= 2 seconds and L = 20 seconds, WinGrow with Nmin* = 3 seconds

*and Nmax* = 20 seconds, and FixSlid with a 2-second window. . . 87

5.13 The ROC curves for HUB4-98. . . 88 A.1 The map-learning process obtained by running the SOCEM algorithm on

the synthetic data with an ordered initialization in (a). Simulation 1 ((a)-(e)): The mixture weights are initialized at 1

16, and updated in the learning
process; the algorithm starts with the initialization in (a) and converges to
the unordered map in (e). Simulation 2 ((a) and (f)): SOCEM is performed
with equal mixture weights throughout the learning process; the algorithm
starts with the initialization in (a) and converges to the map in (f). The
*network structure is a 4 × 4 square lattice; the value of σ is set at 0.4. . . 94*

### List of Tables

2.1 Common inter-cluster distance measures for hierarchical agglomerative
*clus-tering. d(xi, xj*) denotes the distance measure for x*i* and x*j*. . . 9

3.1 The mean and standard deviation of the component number of the diago-nal covariance male speaker GMMs obtained by SGML and fastSGML on different amounts of training data. The first number in parentheses is the mean value, while the second number after ’/’ is the standard deviation. . 29 3.2 The mean and standard deviation of the component number of the diagonal

covariance female speaker GMMs obtained by SGML and fastSGML on different amounts of training data. The first number in parentheses is the mean value, while the second number after ’/’ is the standard deviation. . 29 3.3 The average CPU time (in second) of the diagonal covariance male speaker

GMMs obtained by SGML and fastSGML on different amounts of training data. . . 30 3.4 The average CPU time (in second) of the diagonal covariance female speaker

GMMs obtained by SGML and fastSGML on different amounts of training data. . . 30 3.5 Speaker identification accuracy (in %) for the male speakers. . . 30 3.6 Speaker identification accuracy (in %) for the female speakers. . . 31 4.1 The DAEM algorithm for learning GMMs with equal mixture weights and

the SOCEM algorithm. . . 40 4.2 Results of simulations using KohonenGaussian, SOCEM, SOEM, and

SO-DAEM in 20 independent random initialization trials on the synthetic data.

*The algorithms were run with two setups for σ in hkl. When σ = 0.15, *

Ko-honenGaussian succeeded in converging to an ordered map in one random initialization case (S:1), but failed in the remaining cases (F:19). . . 49 4.3 Results of simulations using KohonenGaussian, SOCEM, SOEM, and

SO-DAEM in 20 independent random initialization trials on PenRecDigits C0.

*The algorithms were run with two setups for σ in hkl. When σ = 0.15, *

Ko-honenGaussian succeeded in converging to an ordered map in one random initialization case (S:1), but failed in the remaining cases (F:19). . . 50

5.1 The CPU time of different audio segmentation approaches evaluated on MATBN3hr in the EER case and the associated EERs, where M and F denote the miss detection rate and the false alarm rate, respectively. . . 86 5.2 The CPU time of different audio segmentation approaches evaluated on

HUB4-98 in the EER case and the associated EERs. . . 87 A.1 The mixture weights learned by SOCEM with the initialization in Figure

A.1 (a). The mixture weights are initialized at 1

### Chapter 1

### Introduction

### 1.1

### Background

The goal of data clustering is to group data samples into clusters such that samples from
the same cluster are more similar to each other than samples from different clusters. It is
also known as unsupervised learning, numerical taxonomy, typological analysis and vector
quantization [1, 2]. Among the various clustering approaches, probabilistic model-based
clustering is the one derived from statistical learning; and it has been successfully applied
to many tasks, for example, speaker recognition [3, 4], speech recognition [5], handwritten
recognition [6], image segmentation [7], and clustering of microarray expression data [8, 9].
In model-based clustering, data samples are grouped by learning a finite mixture
model (usually a Gaussian mixture model, i.e., GMM), in which each mixture component
represents a cluster. There are two major learning methods for model-based clustering:
*the mixture likelihood approach, where the likelihood of each data sample is a mixture of all*
*the component likelihoods of the data sample; and the classification likelihood approach,*
where the likelihood of each data sample is generated by its winning component only [10,
11, 12, 13, 14, 15, 16, 17, 18]. In both approaches, when the globally optimal estimation of
the model parameters cannot be obtained analytically, iterative learning algorithms that
only guarantee obtaining locally optimal solutions are usually employed. The
expectation-maximization (EM) algorithm for mixture likelihood learning [19, 20, 21, 22] and the
classification EM (CEM) algorithm for classification likelihood learning [17] are two such
algorithms and have been the dominant approaches in this task. The conventional EM
or CEM-based model-based clustering has three critical aspects, namely the initialization

of model parameters, the model complexity1_{, and the topology-preserving ability. These}

aspects are discussed as follows.

*• For model initialization: The learning performance of EM and CEM are very *

sen-sitive to the initial conditions of the model’s parameters. To address this issue, the

authors in [17] proposed a simulated annealing implementation for CEM, which
re-duces the initial-position dependence based on random perturbations. Rather than
applying the randomization power of simulated annealing, the authors in [23]
*pro-posed a deterministic annealing EM (DAEM) algorithm that tackles the *
initializa-tion issue via a deterministic annealing process. DAEM was originally derived as a
DA variant of EM; however, as shown in Section 2.4.3, it is also a DA variant of CEM.
Although DAEM has been reported achieving decent performance, there is still no
guarantee that it finds the globally optimal model parameters because, like EM, it
is a iterative, single-token search scheme. In order to search the parameter space in
multiple pathes, the authors in [24, 25] and [26, 27] proposed multi-thread search
strategies when employing EM and DAEM, respectively, where the search pathes are
adapted in the search process according to the eigen-decomposition of the Hessian
matrix of the target log-likelihood function. As another kind of multiple-path search,
a GA-EM algorithm was proposed in [28], where EM learning is integrated into a
Genetic search procedure. It is less sensitive to the initialization because of the
sto-chastic search nature of the Genetic Algorithms (GA). Some heuristic-like learning
algorithms have also been proposed. For example, an initialization approach for EM
which is based on subsampling was presented in [29]. The authors in [30] proposed
an SMEM algorithm that finds the appropriate initial conditions for EM learning
by using split and merge operations. Similarly, Young and Woodland [31] proposed
a component-splitting approach to learn a GMM, which iteratively splits the mean
vector of the Gaussian component with the largest weight into two new ones, and
then performs EM to update all the Gaussian components. Another method based
on component splitting is presented in [32]. In [33], the authors suggested a simple
way that one can perform several short runs of EM (by early stopping) first, then
select the best model from the results and use them as the initial model for the long
run (standard) EM learning. As a common and simple way, one can apply K-means
clustering or hierarchical agglomerative clustering (HAC) to locate the initial mean
vectors of Gaussian components for the EM learning [1, 34, 35].

Note that all the approaches mentioned above are performed with a given target number of mixture components.

*• For model complexity: Assessment of the number of mixture components (data*

clusters) is an important issue in model-based clustering. The mixture model would over-fit the data if it contains too many mixture components; in contrast, it would not be flexible enough to describe the structure of the data if the number of com-ponents is too small. Various approaches have been proposed to address this issue. In [36], Furman and Lindsay developed two hypothesis test procedures based on the moment estimators to assess the number of components for a GMM. As

an-other hypothesis test-based approach, McLachlan and Khan estimated the number of components by likelihood ratio test, where the re-sampling process is applied to assess its null hypothesis [37]. In [38], the authors estimated the mixture complex-ity by comparing an information theoretic-derived nonparametric estimator with the best parametric fit of a given complexity. As another information theoretic-based approach, a maximum entropy-based approach with a modified EM algorithm was proposed to assess the model complexity of GMM in [39]. Moreover, model se-lection criteria, also known as penalized-likelihood criteria, have been proposed to assess the model complexity; for example, Akaike’s Information Criterion (AIC) [40], Bayesian Information Criterion (BIC) [41, 10, 42], Integrated Completed Like-lihood (ICL) [43], Approximate Weight Evidence (AWE) [18], Minimum Description Length (MDL) [44] (which is formally identical to BIC), and Minimum Message Length (MML) [45]. BIC is derived on the basis of mixture likelihood, ICL and AWE are derived on the basis of classification likelihood, and AIC, MDL and MML are information-theoretic-derived criteria. A common way to applying these criteria

*to assess model complexity is that defining the upper bound, Gmax*, of the

compo-nent number of candidate models first, and then choosing the one with the best score calculated using the employed criterion as the best model. For example, when using BIC, the best component number is

ˆ

*G = arg max*

*G* *{2 log p(X ; ˆ*Θ*G) − P enalty(G)|G = 1, 2, . . . , Gmax},* (1.1)

where ˆΘ*G* is the maximum likelihood estimate of parameters of the mixture model

*with G components, P enalty(G) is a monotonically increasing function of G that*
penalizes more for a more complex model [10]. However, there are two potential
drawbacks with this model-selection-based approach. First, it needs to define the

*upper bound of the component number, Gmax*, beforehand. On the one hand, if

*Gmax* is too large (much larger than the best component number determined by

the model selection criterion), the learning process will waste a lot of computation
*time. On the other hand, if Gmax* is too small, the selected model may be not flexible

enough to describe the structure of the data. Second, ˆΘ*G* is usually obtained by

EM, whose performance is highly dependent on the model initialization.

Rather than assessing the model complexity by incrementally adding mixture com-ponents during the learning process, as discussed above, the variational Bayesian framework in [46, 47] automatically determines the number of components by setting a larger component number initially and then suppressing unwanted components.

*• For topology-preserving ability: Conventional model-based clustering cannot*

clustering procedure. To overcome this shortcoming, the clustering task can be per-formed by using Kohonen’s self-organizing map (SOM) [48, 49]. The SOM, rather than being a supervised neural network model for pattern recognition, is an unsuper-vised model for data clustering and visualization. After SOM’s clustering procedure, the topological relationships among data samples and clusters can be preserved (or visualized) on the network, which is usually a two dimensional lattice. Kohonen’s sequential and batch SOM learning algorithms have proved successful in many prac-tical applications [48, 49]. However, they also suffer from some shortcomings, such as the lack of an objective (cost) function, a general proof of convergence, and a prob-abilistic framework [50]. Some related works that have addressed these issues are as follows. In [51, 52], the behavior of Kohonen’s sequential learning algorithm was studied in terms of energy functions, based on which, Cheng [53] proposed an energy function for SOM whose parameters can be learned by a K-means type algorithm. Luttrell [54, 55] proposed a noisy vector quantization model called the topographic vector quantizer (TVQ), whose training process coincides with the learning process of SOM. The cost function of TVQ represents the topographic distortion between the input data and the output code vectors in terms of Euclidean distance. Graepel

*et al. [56, 57] derived a soft topographic vector quantization (STVQ) algorithm*

by applying a deterministic annealing process to the optimization of TVQ’s cost
function. Based on the topographic distortion concept, Heskes [58] applied a
dif-ferent DA implementation from that of STVQ, and obtained an algorithm identical
to STVQ when the quantization error is expressed in terms of Euclidean distance.
In [59], Chow and Wu proposed an on-line algorithm for STVQ; later, motivated
by STVQ, they proposed a data visualization method that integrates SOM and
multi-dimensional scaling [60]. Based on the Bayesian analysis of SOMs in [61],
*Anouar et al. [62] proposed a probabilistic formalism for SOM, where the *
para-meters are learned by a K-means type algorithm. To help users select the correct
model complexity for SOM by probabilistic assessment, Lampinen and Kostiainen
[63] developed a generative model in which the SOM is trained by Kohonen’s
algo-rithm. Meanwhile, Van Hulle [64] developed a kernel-based topographic formation
in which the parameters are adjusted to maximize the joint entropy of the kernel
outputs. He subsequently developed a new algorithm with heteroscedastic Gaussian
mixtures that allows for a unified account of vector quantization, log-likelihood, and
Kullback-Leibler divergence [65]. Another probabilistic formulation is proposed in
[66], whereby a normalized neighborhood function of SOM is used as the posterior
*distribution in the E-step of the EM algorithm for a mixture model to enforce the*
*self-organizing of the mixture components. Sum et al. [67] interpreted Kohonen’s*
sequential learning algorithm in terms of maximizing the local correlations
(cou-pling energies) between neurons and their neighborhoods for the given input data.

They then proposed an energy function for SOM that reveals the correlations, and a gradient ascent learning algorithm for the energy function.

### 1.2

### Contributions of this dissertation

### 1.2.1

### On model initialization and complexity

This thesis proposes a BIC-based self-splitting Gaussian mixture learning (SGML) algo-rithm that starts with a single component and iteratively splits a component into two new components until the most appropriate component number is found. The proposed algorithm has several advantages as follows. 1) SGML provides a better initialization for EM. For the Gaussian mixture learning process whose initialization is based on a data clustering process like K-mens or HAC, the clustering phase may results in an poor initial mixture model with too many components in one part of the space and too few in another widely-separated part of the space; in this case, the following EM phase may fail to escape from this configuration and fall into an poor local maximum. In the splitting process of SGML, however, BIC is used to determine which part of the space should be divided into two parts. In this way, the ill initialization situation can be avoided to some extent and, thus, a better estimation for model parameters can be obtained. 2) SGML automati-cally determines the appropriate component number without the need to define the upper bound of the component number beforehand. It stops when a “significant maximum” in the learning curve (i.e., the BIC plot) is found, and then outputs the model yielding the maximum BIC valve. 3) SGML is deterministic; its output is always the same in different runs on the same data set since it does not contain any randomization procedure in its learning rules.

### 1.2.2

### On topology-preserving ability

In Kohonen’s SOM architecture, neurons in the network associate with reference vectors
in the data space. This contrasts with a SOM whose neurons associate with reference
models that present probability distributions, such as the isotropic Gaussians used in [66]
and the heteroscedastic Gaussians used in [62, 65]. In this thesis, the latter is called a
*probabilistic SOM (PbSOM). Motivated by the coupling energy concept in Sum et al.’s*
work [67], a coupling-likelihood mixture model for the PbSOM that uses multivariate
Gaussian distributions as the reference models is developed. In the proposed model,
local coupling energies between neurons and their neighborhoods are expressed in terms
of probabilistic likelihoods; and each mixture component expresses the local
coupling-likelihood between one neuron and its neighborhood. Based on this model, we develop
CEM, EM, and DAEM algorithms for learning PbSOMs, namely the SOCEM, SOEM, and
SODAEM algorithms, respectively. SODAEM is a DA variant of SOCEM and SOEM.

Moreover, we show that SOCEM and SOEM can be interpreted, respectively, as DA variants of the CEM and EM algorithms for Gaussian model-based clustering, where the neighborhood shrinking is interpreted as an annealing process. The experiment results show that the proposed PbSOM learning algorithms achieve comparable data clustering performance to the DAEM algorithm, while maintaining the topology-preserving property.

### 1.2.3

### On applications

In this thesis, we apply SGML to the training of speaker GMMs for the speaker identi-fication task. The experiment results on NIST 2001 speaker recognition evaluation [68] show that SGML can automatically determine the appropriate model complexity for a speaker GMM, which is usually empirically determined in the literature [4, 69]. The proposed PbSOM algorithms are applied to data visualization and analysis. The exper-iment results on UCI data sets show that we can explore the topological relationships between data samples and clusters on a two-dimensional network. Moreover, we propose divide-and-conquer approaches for the audio segmentation task using BIC to evaluate the dissimilarity between two audio segments. The experiment results on the broadcast news data demonstrate that our approaches not only have a lower computation cost but also achieve a higher segmentation accuracy than the leading window-growing-based segmen-tation approach [70, 71].

### 1.3

### Organization of this dissertation

To help readers understand the content of this thesis, we briefly review K-means clustering, hierarchical agglomerative clustering, Kohonen’s SOM, and model-based clustering in Chapter 2. In Chapter 3, we describe the SGML algorithm for learning a Gaussian mixture model and its application to speaker identification (based on [72]). In Chapter 4, we describe the SOCEM, SOEM, and SODAEM algorithms for model-based clustering where the topological relationships between data samples and clusters can be preserved on a network and their application to data visualization and analysis (based on [73]). We present the application of BIC to audio segmentation (based on [74]) in Chapter 5. Finally, we give the conclusion and discuss our future works in Chapter 6.

### Chapter 2

### Preliminaries and related works

### 2.1

### K-means clustering

*Given a data set X = {x*1*, x*2*, · · · , xN} ⊂ <d*, data clustering can be achieved based on

*learning the vector prototypes {m*1*, m*2*, · · · , mG} ⊂ <d; after the learning process, X is*

*partitioned into clusters P = {P*1*, P*2*, · · · , PG} by classifying each data sample to the*

cluster associated with its nearest prototype. For the learning, the goal is to find the partition and the prototypes that minimize the distance function

*D(P, {m*1*, m*2*, · · · , mG}; X ) =*
*G*
X
*k=1*
X
x*i∈Pk*
*kxi− mkk*2*,* (2.1)

which can not be minimized analytically.

*K-means clustering is a popular approach to learning the prototypes in order to *

min-imize Eq. (2.1). Given the initial prototypes, it iteratively and alternatively applies a classification step and a distance minimization step on the data samples and prototypes as follows.

*• Nearest-neighbor classification: Given the current prototypes, {m(t)*_{1} *, m(t)*_{2} *, · · · , m(t) _{G}},*

assign each data sample to the cluster associated with its nearest prototype, i.e.,
x*i* *∈ ˆPj(t)* *if j = arg minkkxi− m(t)k* *k*2.

*• Distance minimization: Suppose we have the clusters { ˆP*_{1}*(t), ˆP*_{2}*(t), · · · , ˆP _{G}(t)} after the*

classification step; then, by minimizing P_{x}

*i∈ ˆP _{k}(t)kxi*

*− mkk*

2 _{with respect to m}

*k*,

we obtain the update rules for the prototypes as: m*(t+1) _{k}* = 1

*| ˆP _{k}(t)|*

P

x*i∈ ˆP _{k}(t)*x

*i*, for

*k = 1, 2, · · · , G.*

Although the K-means clustering algorithm is simple and efficient, it only guaran-tees converging to a local minimum of the distance function in Eq. (2.1). Besides, it has two more shortcomings that the performance highly depends on the initialization of

1

*x*

2
*x*

(a)
A B C D E F G
{B ,C }
{A ,B ,C }
{D ,E }
{F,G }
{D ,E ,F,G }
{A ,B ,C ,D ,E ,F,G }
P
ro
xi
m
ity
(b)
*Figure 2.1: (a) {A, B, · · ·, G} are data samples in <*2_{. (b) An illustrative dendrogram of}
the data samples in (a) obtained by hierarchical agglomerative clustering.

prototypes and the number of prototypes need to be defined beforehand. To overcome the initialization issue, the authors in [75] proposed to iteratively split each mean vec-tor into two new ones until the desired number of clusters is reached. Similarly, in [76], the authors proposed iteratively splitting the mean vector of the cluster with the largest accumulated distance between data samples and its prototype until the desired number of clusters is reached. As a simple way, one may conduct the K-means algorithm with random initialization for many runs, say 20, and then select the best one.

### 2.2

### Hierarchical agglomerative clustering

Different from K-means clustering that performs data clustering based on prototype
learn-ing, hierarchical agglomerative clustering (HAC) performs the clustering according to the
proximity between clusters. When performing HAC, each data sample is considered as a
cluster initially; then, the algorithm iteratively merges the two clusters with the smallest
*distance into a new cluster. For example, if HAC is applied to clustering {A, B,· · ·,G}*
in Figure 2.1 (a) with Euclidean distance, we may obtain a dendrogram (tree) that
rep-resents the hierarchy of the clusters in Figure 2.1 (b). As shown in the figure, B and C
are merged into a cluster first, then D and E are merged, and so on. And we can obtain
different configurations of clusters by cutting the dendrogram according to the proximity.
As for the inter-cluster distance measure, the most commonly used ones are the so-called
single-linkage, average-linkage, and complete-linkage approaches defined in Table 2.1. For
more discussions of these distance measure, reader can refer to [77].

Table 2.1: Common inter-cluster distance measures for hierarchical agglomerative
*clus-tering. d(xi, xj*) denotes the distance measure for x*i* and x*j*.

Approach *distance between clusters Pm* *and Pn*

Single-linkage minx*i∈Pm,xj∈Pnd(xi, xj*)
Average-linkage 1
*|Pm||Pn|*
P
x*i∈Pm*
P
x*j∈Pnd(xi, xj*)
Complete-linkage maxx*i∈Pm,xj∈Pnd(xi, xj*)

### 2.3

### Self-organizing map

Self-organizing map (SOM) is a neural network model for data clustering that preserves the topological relationships between data samples and clusters in a network [48, 49]. Like K-means clustering, the training of SOM is a prototype-learning process. How-ever, in addition to the interconnections between the data samples and the prototypes as in K-means clustering, SOM involves lateral interactions between prototypes via a neighborhood definition on the network to learn the topological relationships between the prototypes. Kohonen’s sequential (incremental) and batch learning algorithms have been well recognized and successfully applied to many tasks, such as data visualization [48], document processing [78, 79], image processing [80], and speech precessing [81, 82, 83].

*Kohonen’s SOM model consists of G neurons R={r*1*, r*2*, · · · , rG} on a network with*

*a neighborhood function hkl* that defines the strength of lateral interaction between two

*neurons, rk* *and rl, for k, l ∈ {1, 2, · · · , G}; and each neuron rk* associates with a vector

prototype m*k*(a reference vector) in the input data space [48, 49]. In Kohonen’s sequential

learning algorithm, the training data samples x1*, x*2*, · · · , xN* are applied sequentially. For

each data sample, say x*i*, its winning prototype m_{win}(t)

*i* such that

*win(t) _{i}* = arg min

*j* *kxi− m*

*(t)*

*j* *k,* (2.2)

is retrieved first, then the prototypes are adapted by the following rules:
m*(t+1) _{k}* = m

*(t)*

_{k}*+ α(t)h*

_{win}(t)*i* *k*(x*i− m*
*(t)*

*k* *),* (2.3)

*for k = 1, 2, · · · , G. α(t) is the learning-rate factor such that 0 < α(t) < 1; it decreases*
monotonically with the increasing of the learning iterations. The neighborhood function

*hkl* *is usually a rectangular function, i.e., hkl* *= 1 for rl* *that is in the neighborhood of rk*

*and hkl* *= 0 for rl* *that is not in the neighborhood of rk*; or a monotonically decreasing

*function of the Euclidean distance krk− rlk between rk* *and rl* on the SOM network, for

example, the widely applied unnormalized Gaussian neighborhood function

*hkl* *= exp(−*

*krk− rlk*2

Kohonen’s batch learning algorithm applies all the data samples as a whole to up-date the prototypes, rather than one by one as in the sequential learning algorithm. In each learning iteration, the winning prototype for each sample is found first; then, the prototypes are updated by

m*(t+1) _{k}* =
P

_{N}*i=1hwin(t)*

_{i}*k*x

*i*P

_{N}*i=1hwin(t)*

_{i}*k*

*,*(2.5)

*for k = 1, 2, · · · , G.*

Kohonen’s learning algorithms need to be applied with two stages. First, a large
neighborhood is applied to the algorithm to learn an ordered map near the center of
the data samples. Then, the prototypes are adapted to fit the distribution of the data
samples by gradually shrinking the neighborhood. When the neighborhood is reduce to
*zero-neighborhood, i.e., hkl* *= δkl* *where δkl* *= 1 if k = l and otherwise δkl* = 0, Kohonen’s

batch learning algorithm becomes the K-means clustering algorithm.

### 2.4

### Probabilistic model-based clustering

### 2.4.1

### The mixture likelihood approach

In the mixture likelihood approach for model-based clustering, it is assumed that the
*given data set X = {x*1*, x*2*, · · · , xN} ⊂ <d* is generated by a set of independently and

identically distributed (i.i.d.) random vectors from a mixture model:

*p(xi*; Θ) =
*G*

X

*k=1*

*w(k)p(xi; θk),* (2.6)

*where w(k) is the mixing weight of the mixture component p(xi; θk), subject to 0 ≤*

*w(k) ≤ 1 for k = 1, 2, · · · , G;* P*G*

*k=1w(k) = 1; and θk* denotes the parameter set of

*p(xi; θk*).

The maximum likelihood estimate of the parameter set of the mixture model ˆΘ =

*{ ˆw(1), ˆw(2) , · · · , ˆw(G), ˆθ*1*, ˆθ*2*, · · · , ˆθG} can be obtained by maximizing the following *

log-likelihood function:
*L(Θ; X ) = log*
*N*
Y
*i=1*
*p(xi*; Θ)
=
*N*
X
*i=1*
log(
*G*
X
*k=1*
*w(k)p(xi; θk)).* (2.7)

This is usually achieved by using the expectation-maximization (EM) algorithm [20, 22].
*After learning the mixture model, we derive a partition of X , ˆP = { ˆP*1*, ˆP*2*, · · · , ˆPG}, by*

for x*i*, i.e., x*i* *∈ ˆPj* *if j = arg maxkp(k | xi*; ˆΘ).

2.4.1.1 The EM algorithm for mixture models

If the maximum likelihood estimation of the parameters cannot be accomplished analyti-cally, the EM algorithm is normally used as an alternative approach when the given data is incomplete or contains hidden information.

In the case of the mixture model, suppose that Θ*(t)* _{denotes the current estimate of}

*the parameter set, and k is the hidden variable that indicates the mixture component*
*from which the data sample is generated. The E-step of EM algorithm then computes*
*the following so-called auxiliary function:*

*Q(Θ; Θ(t)*_{) =} X*N*
*i=1*
*G*
X
*k=1*
*p(k | xi*; Θ*(t)) log p(xi, k; Θ),* (2.8)
where
*p(xi, k; Θ) = w(k)p(xi; θk),* (2.9)
and
*p(k | xi*; Θ*(t)*) =
*w(k)(t) _{p(x}*

*i; θ(t)k*) P

_{G}*j=1w(j)(t)p(xi; θ(t)j*) (2.10)

*denotes the posterior probability of the kth mixture component for xi*with the given Θ

*(t)*.

*Then, in the following M-step, the Θ(t+1)* _{that satisfies}

*Q(Θ(t+1)*; Θ*(t)*) = max

Θ *Q(Θ; Θ*

*(t)*_{)} _{(2.11)}

is chosen as the new estimate of the parameter set. By iteratively creating the auxiliary function in Eq. (2.8) and performing the subsequent maximization step, the EM algorithm guarantees to converge to a local maximum of the log-likelihood function in Eq. (2.7).

*When Q(Θ; Θ(t) _{) can not be maximized analytically, the M-step is modified to find}*

some Θ*(t+1)* _{such that Q(Θ}(t+1)_{; Θ}*(t) _{) > Q(Θ}(t)*

_{; Θ}

*(t)*

_{). This type of algorithm, called}Generalized EM (GEM), is also guaranteed to converge to a local maximum [20, 22].

2.4.1.2 The EM algorithm for Gaussian mixture models

Suppose Eq. (2.6) is a Gaussian mixture model where

*p(xi; θk*) =
1
*(2π)d/2 _{|Σ}*

*k|1/2*

*exp(−*1 2(x

*i− µk*)

*T*

_{Σ}

*−1*

*k*(x

*i− µk)),*(2.12)

*is the multivariate Gaussian distribution and θk* *= {µk*,Σ*k} are its mean vector and*

covariance matrix. Then, each iteration of the EM algorithm for learning GMMs is summarized as follows.

*• E-step: Given the current model, Θ(t)*_{, compute the posterior probability of each}
*mixture component of p(xi*; Θ*(t)*) for each x*i* using Eq. (2.10).

*• M-step: Substituting Eq. (2.12) into Eq. (2.8) and then setting its derivative to*

zero, we obtain the re-estimation formulae of the parameters as follows.

*w(k)(t+1)*_{=} 1
*N*
*N*
X
*i=1*
*p(k | xi*; Θ*(t)),* (2.13)
*µ(t+1) _{k}* =
P

_{N}*i=1p(k | xi*; Θ

*(t)*)x

*i*P

_{N}*i=1p(k | xi*; Θ

*(t)*)

*,*(2.14) Σ

*(t+1)*= P

_{k}

_{N}*i=1p(k | xi*; Θ

*(t)*)(x

*i− µ(t+1)k*)(x

*i− µ(t+1)k*)

*T*P

_{N}*i=1p(k | xi*; Θ

*(t)*)

*.*(2.15)

2.4.1.3 The Bayesian approach for model selection and BIC

*Given the data set X and a set of candidate probability models M = {MG* *| G =*

*1, . . . , Gmax} where model MG* associates with a parameter set Θ*G*, the posterior

*proba-bility p(MG* *| X ) can be used to select the appropriate model from M to represent the*

*distribution of X . According to Bayes’ theorem, p(MG* *| X ) can be expressed as*

*p(MG* *| X ) =*

*p(MG)p(X ; MG*)

*p(X )* *,* (2.16)

*where p(MG) is the prior probability of model MG. p(X ) can be ignored because it is*

identical for all models and does not affect the selection. Moreover, it is assumed that each
*model is equally likely (i.e., p(MG) = 1/Gmax); then, p(MG* *| X ) is proportional to the*

*probability that the data conforms to the model MG, p(X ; MG*), which can be computed

by
*p(X ; MG*) =
Z
*p(X ; ΘG, MG)p(ΘG; MG)dΘG,*
=
Z
*p(X ; ΘG)p(ΘG)dΘG.* (2.17)

*The calculation of log p(X ; MG*) can be achieved by the Laplace approximation [42, 47, 84],

which gives

*log p(X ; MG) ≈ log p(X ; ˆ*Θ*G) −*

1

2*d(ΘG) log N,* (2.18)

where ˆΘ*G* is the maximum likelihood estimate of Θ*G, d(ΘG*) is the number of free

*model MG* *over the data set X , BIC(MG, X ), is defined as*1

*BIC(MG, X ) ≡ 2 log p(X ; ˆ*Θ*G) − d(ΘG) log N.* (2.19)

Accordingly, the larger the BIC value, the stronger the evidence for the model is. In other words, the model with the maximum BIC value will be selected. The BIC can be used to compare models with different parameterizations, different numbers of components, or both.

### 2.4.2

### The classification likelihood approach

In the classification likelihood approach for model-based clustering [15, 16, 17], instead of
maximizing the log-likelihood function of the mixture model in Eq. (2.7), the objective is
to find the partition ˆ*P = { ˆP*1*, ˆP*2*, · · · , ˆPG} of X and the model parameters that maximize*

*C*1*(P, {θ*1*, θ*2*, · · · , θG}; X ) =*
*G*
X
*k=1*
X
x*i∈Pk*
*log p(xi; θk),* (2.20)
or
*C*2*(P, Θ; X ) =*
*G*
X
*k=1*
X
x*i∈Pk*
*log(w(k)p(xi; θk)).* (2.21)

*The relation between C*1 *and C*2 is

*C*2*(P, Θ; X ) = C*1*(P, {θ*1*, θ*2*, · · · , θG}; X ) +*
*G*

X

*k=1*

*|Pk| log w(k),* (2.22)

If all the mixture components are equally weighted,P*G*

*k=1|Pk| log w(k) becomes a constant,*

*such that C*1 *and C*2 are equivalent.

2.4.2.1 The CEM algorithm for mixture models

Celeux and Govaert [17] proposed the Classification EM (CEM) algorithm for estimating

the partition ˆ*P and model parameters. Like the EM algorithm, CEM is also an iterative*

*learning approach. In each iteration, it inserts a classification step (C-step) between the*

*E-step and M-step of the EM algorithm. In the E-step, the posterior probability of each*

*mixture component is calculated for each data sample. In the C-step, to obtain the *
parti-tion ˆ*P of the data samples, each sample is assigned to the mixture component that yields*

*the largest posterior probability for that sample. In the M-step, the maximization process*
is applied to ˆ*Pk* *individually for k = 1, 2, · · · , G. From a practical point of view, CEM*

1_{Kass and Raftery [85] defined BIC as minus the value given in Eq. (2.19); Fraley and Raftery [10]}

used the BIC defined in Eq. (2.19) to make it easier to interpret the plot of BIC values. Here, we follow Fraley and Raftery’s BIC definition.

is a K-means type algorithm where a cluster is associated with a probability distribution rather than a vector prototype [17].

2.4.2.2 The CEM algorithm for Gaussian mixture models

Suppose the multivariate Gaussian is used as the mixture component, each iteration of the CEM algorithm for learning GMMs is summarized as follows.

*• E-step: Given the current model, Θ(t)*_{, compute the posterior probability of each}
*mixture component of p(xi*; Θ*(t)*) for each x*i* using Eq. (2.10).

*• C-step: Assign each xi* to the cluster whose corresponding mixture component has

the largest posterior probability for x*i*, i.e., x*i* *∈ ˆPj(t)* *if j = arg maxkp(k|xi*; Θ*(t)*).

*• M-step: After the C-step, the partition of X (i.e., ˆP(t)*_{) is formed, and the objective}

*function C*2 defined in Eq. (2.21) becomes

*C*2(Θ; ˆ*P(t), X ) =*
*G*
X
*k=1*
X
x*i∈ ˆP _{k}(t)*

*log(w(k)p(xi; θk)).*(2.23)

*Substituting Eq. (2.12) into Eq. (2.23) and then setting the derivative of C*2 with

respect to individual parameters to zero, we obtain the re-estimation formulae of the parameters as follows.

*w(k)(t+1)* _{=} *| ˆP*
*(t)*
*k* *|*
*N* *,* (2.24)
*µ(t+1) _{k}* =
P
x

*i∈ ˆP*x

_{k}(t)*i*

*| ˆP*

_{k}(t)|*,*(2.25) Σ

*(t+1)*= P x

_{k}*i∈ ˆPk(t)*(x

*i− µ*

*(t+1)*

*k*)(x

*i− µ(t+1)k*)

*T*

*| ˆP*

_{k}(t)|*.*(2.26)

2.4.2.3 AWE and ICL for model selection

Based on classification likelihood, Banfield and Raftery proposed the AWE criterion [18, 86] to assess the number of Gaussian components (clusters) for a given data set. The AWE criterion is

*AW E(G clusters; X ) = C*1( ˆ*P, { ˆθ*1*, ˆθ*2*, · · · , ˆθG}; X ) − d({θ*1*, θ*2*, · · · , θG})(*

3

2*+ log N),*

(2.27)
*In addition, Biernacki et al. proposed the ICL criterion [43],*

*ICL(G clusters; X ) = C*2( ˆ*P, ˆ*Θ*G; X ) −*

1

### 2.4.3

### The DAEM algorithm

In the DAEM algorithm for learning a mixture model [23], the objective is to minimize the following system energy function during the annealing process:

*Fβ(Θ; X ) = −*
1
*β*
*N*
X
*i=1*
log(
*G*
X
*k=1*
*(w(k)p(xi; θk*))*β),* (2.29)

*where 1/β corresponds to the temperature that controls the annealing process. The *
aux-iliary function in this case is

*Uβ*(Θ; Θ*(t)) = −*
*N*
X
*i=1*
*G*
X
*k=1*
*f (k | xi*; Θ*(t)) log p(xi, k; Θ),* (2.30)
where
*f (k | xi*; Θ*(t)*) =
*(w(k)(t) _{p(x}*

*i; θ(t)k*))

*β*P

_{G}*j=1(w(j)(t)p(xi; θ(t)j*))

*β*(2.31) is the posterior probability derived by using the maximum entropy principle.

*Ueda and Nakano [23] showed that Fβ(Θ; X ) can be iteratively minimized by *

*it-eratively minimizing Uβ*(Θ; Θ*(t)). When using DAEM to learn a mixture model, β is*

initialized with a small value (less then 1) such that the energy function itself is simple
*enough to be optimized. Then, the value of β is gradually increased to 1. During the*
learning process, the parameters learned in the current learning phase are used as the
initial parameters of the next phase2_{. In the case of β = 1, F}

*β(Θ; X ) and Uβ*(Θ; Θ*(t)*) are

*the negatives of the log-likelihood function in Eq. (2.7) and the Q-function in Eq. (2.8),*

*respectively; thus, minimizing Fβ(Θ; X ) is equivalent to maximizing the log-likelihood*

function.

According to [23], Eq. (2.29) can be rewritten as

*Fβ(Θ; X ) = Uβ(Θ) −*
1
*βSβ(Θ),* (2.32)
where
*Uβ(Θ) = −*
*N*
X
*i=1*
*G*
X
*k=1*
*f (k | xi; Θ) log p(xi, k; Θ),* (2.33)
and
*Sβ(Θ) = −*
*N*
X
*i=1*
*G*
X
*k=1*
*f (k | xi; Θ) log f (k | xi*; Θ) (2.34)

*is the entropy of the posterior distribution. When β→∞, the rational function f (k | xi*; Θ)

*approximates to a zero-one function; thus, the entropy term Sβ(Θ)→0. In this case,*

*Fβ(Θ; X ) is equivalent to the negative of the objective function for CEM in Eq. (2.21).*

2_{Each β value corresponds to a learning phase. The algorithm proceeds to the next phase after it}

Therefore, DAEM can be viewed as a DA variant of CEM.

2.4.3.1 The DAEM algorithm for Gaussian mixture models

*For the given β value, each iteration of the DAEM algorithm for learning GMMs is*
summarized as follows.

*• E-step: Given the current model, Θ(t)*_{, compute the posterior probability of each}
*mixture component of p(xi*; Θ*(t)*) for each x*i* using Eq. (2.31).

*• M-step: Substituting Eq. (2.12) into Eq. (2.30) and then setting its derivative to*

zero, we obtain the re-estimation formulae of the parameters as follows.

*w(k)(t+1)*_{=} 1
*N*
*N*
X
*i=1*
*f (k | xi*; Θ*(t)),* (2.35)
*µ(t+1) _{k}* =
P

_{N}*i=1f (k | xi*; Θ

*(t)*)x

*i*P

_{N}*i=1f (k | xi*; Θ

*(t)*)

*,*(2.36) Σ

*(t+1)*= P

_{k}

_{N}*i=1f (k | xi*; Θ

*(t)*)(x

*i− µ(t+1)k*)(x

*i− µ(t+1)k*)

*T*P

_{N}*i=1f (k | xi*; Θ

*(t)*)

*.*(2.37)

### Chapter 3

### BIC-based self-splitting Gaussian

### mixture learning

### 3.1

### The SGML algorithm

*Given the data set X = {x*1*, x*2*, · · · , xN}, we can partition it into G clusters after applying*

*the EM algorithm to learn a G-component Gaussian mixture model (GMMG*). Here, to

help explain the scenario of the proposed approach, we denote a cluster obtained by EM
*learning a EM cluster. The self-splitting Gaussian mixture learning algorithm (SGML)*
*starts with a single component (i.e., a single EM cluster) in the data space (i.e., X )*
and splits the selected component adaptively during the learning process until the most
*appropriate number of components (EM clusters) is found. We describe the details of*
SGML in Algorithm 1. There are three main aspects with respect to the self-splitting
learning rules:

*I1: How to group the data samples into clusters?*

*I2: Which component should be split into two new components?*

*I3: How many components are enough for fitting the distribution of the data samples?*

*On issue I1: After EM, each xi* *∈ X is assigned to the EM cluster whose Gaussian*

component has the largest posterior probability for x*i*, i.e., assign x*i* *to EM clusterj* for

*j = arg maxkp(k | xi*; ˆΘ).

*On issue I2: For each EM cluster, we calculate the value of ∆BIC*21*(EM cluster):*

*∆BIC*21*(EM cluster) = BIC(GMM*2*, EM cluster) − BIC(GMM*1*, EM cluster),*

(3.1)

*where GMM*1 *fits the EM cluster with a single Gaussian component while GMM*2 fits

*the EM cluster with a mixture of two Gaussian components. The component *

*because, according to the BIC theory, the larger the ∆BIC*21 is, the higher the confidence

*that GMM*2 *fits the corresponding EM cluster better than GMM*1 does. The mean

*vec-tor and covariance matrix of GMM*1can be optimally estimated in an analytic way (which

*are respectively the sample mean vector and sample covariance matrix of EM cluster),*

*whereas those of GMM*2 need to be estimated by the EM algorithm. For the learning

*of GMM*2, we use K-means clustering (here, K=2) to find two initial mean vectors of

*GMM*2*, and then apply EM to fine-tune the parameters. Suppose µ is the mean vector of*

*GMM*1*, we use µ − ε and µ + ε as the initial centroids of the K-means clustering, where*

*ε is a constant vector.*

*On issue I3: After the jth split operation, the number of components grows to j + 1*

*and these Gaussian components are used as the initialization of EM to learn the GMMj+1*

*for X ; and the value of BIC(GMMj+1, X ) is computed after the EM learning. During the*

learning process the BIC values of the learned GMMs with different numbers of mixture components are collected to form a learning curve (i.e., the BIC plot); and the learning process stops when a “significant maximum” in the learning curve is found. Given that

*SGML has generated {GMM*1*,GMM*2*,· · ·,GMMcN um}, if BIC(GMMcN um−SRange, X ) is*

*the maximum among {BIC(GMM*1*, X ),BIC(GMM*2*, X ),· · ·,BIC(GMMcN um, X )}, it is*

called the “significant maximum” of the learning curve and SGML stops and returns

*the model GMMcN um−SRange; otherwise, SGML continues to learn GMMcN um+1*. Here,

*SRange is a positive integer; it is appropriate to set SRange around 5 according to our*

experience.

### 3.1.1

### Computational complexity of SGML

*Without loss of generality, we assume the splitting process of SGML stops at GMMGmax*.

*With the data set X , we use TKM(k, X ) to denote the computational cost of applying *

*K-means clustering to initialize GMMkfor EM and use TEM(k, X ) to denote the cost of using*

*EM for learning GMMk. For SGML, the computational cost of the Splitting step that*

*splits GMMM* *to GMMM +1* consists of the cost of computing the sample mean vectors

and sample covariance matrices for all clusters, i.e.,P*M*

*j=1TEM(1, EM clusterj*), plus the

*cost of applying K-means (K=2 here) followed by EM to learn GMM*2 for all clusters, i.e.,

P_{M}

*j=1[TKM(2, EM clusterj) +TEM(2, EM clusterj*)].

P_{M}

*j=1TEM(1, EM clusterj*) is

*al-most equal to TEM(1, X ). Moreover, from Eqs. (2.13)-(2.15), it is obvious that TEM(k, X )*

*is proportional to the component number k and the size of data set X . In fact, TKM(k, X )*

also has this property. Therefore, P*M*

*j=1TEM(2, EM clusterj*) can be approximated by

*TEM(2, X ), while*

P_{M}

*j=1TKM(2, EM clusterj) can be approximated by TKM(2, X ). In *

of SGML is about

*G*_{X}*max*

*k=1*

*[TEM(1, X ) + TKM(2, X ) + TEM(2, X ) + TEM(k, X )].* (3.2)

For the common approach that uses K-means (with random initialization) followed by EM to learn the candidate models for the model selection criterion, the computational cost is

*G*_{X}*max*

*k=1*

*[TKM(k, X ) + TEM(k, X )].* (3.3)

*Comparing Eq. (3.2) to Eq. (3.3), it is clear that SGML is more efficient when k À 2*
*because for which case, TKM(k, X ) is larger than TEM(1, X )+ TKM(2, X )+TEM(2, X ).*

Comparing to the method proposed by Young and Woodland [31], SGML has an
*over-head of TEM(1, X ) + TKM(2, X ) + TEM(2, X ) in each splitting step. However, TEM(1, X )*

is much small than 1

2*TEM(2, X ) because it does not involve likelihood calculations when*

*estimating GMM*1*. Moreover, TKM(2, X ) is much smaller than TEM(2, X ). Thus, the*

*overhead is just slightly higher than TEM(2, X ). When M À 2, the overhead for splitting*

*GMMM* *to GMMM +1is negligible since TEM(2, X ) is much smaller than TEM(M + 1, X ).*

### 3.2

### The fastSGML algorithm

The computation cost becomes an important issue when the data set is very large; a fast

*learning procedure is therefore desirable. Since we can validate if GMM*2 fits a cluster

*better than GMM*1 using BIC, a straightforward idea to speedup SGML is that split all

*the clusters whose ∆BIC*21values are larger than a threshold in each splitting step. Here,
*we call the threshold the splitting confidence. Based on the concept of multiple splitting,*
we proposed a fastSGML approach in Algorithm 2. To avoid the over-fitting condition
caused when using a too small splitting confidence, the fastSGML stops not only when
*the ∆BIC*21 value for each cluster is smaller than the splitting confidence (in Step 3) but
also when the learning curve starts to go down (in Step 4). Clearly, fastSGML needs a
much lower computation requirement than SGML because it allows multiple splits in the
splitting step. fasSGML resembles the LBG algorithm [75], but the difference is that the
later iteratively splits each of the clusters into two clusters until a given cluster number
is reached and no validation measure is applied to the splitting.

### 3.3

### Experiments on Gaussian mixture learning

We evaluated the proposed SGML and fastSGML algorithms on one synthetic data set
*which consists of six Gaussian clusters in <*2 _{and one real-world data set which consists}
of speech feature vectors from speaker #5007 of the NIST 2001 cellular speaker

recog-nition evaluation data (NIST01SpkEval) [68]. We evaluated the algorithms using the performance of speaker GMM training because we shall apply SGML to the GMM-based speaker identification task, as shown in Section 3.4. In each task, the performance of the proposed algorithms were compared to that of other EM-based learning approaches whose initial mean vectors of GMMs are located by hierarchical agglomerative clustering (HAC) or K-means clustering. The baseline approaches are as follows.

1. Hier-ComLink method: The initial Gaussian mean vectors in EM were obtained by the complete-linkage HAC [35].

2. Hier-SLink method: The initial Gaussian mean vectors in EM were obtained by the single-linkage HAC [35].

3. Hier-CenLink method: The initial Gaussian mean vectors in EM were obtained by the centroid-linkage HAC [35].

4. K-means-random method: The initial Gaussian mean vectors in EM were obtained by K-means clustering, in which the initial centroids are randomly selected from the data set.

5. K-means-BinSplitting method [75]: The initial Gaussian mean vectors in EM were determined by the LBG algorithm, in which each mean vector was split into two new ones in each splitting step until the desired number of clusters was reached. 6. K-means-IncSplitting method [76]: The initial Gaussian mean vectors in EM were

obtained by the incremental splitting K-means algorithm, in which only the mean vector of the cluster with the largest accumulated distance was split into two new ones in each splitting step until the desired number of clusters was reached.

7. EMSplitByMaxWeight method [31]: This method splits the mean vector of the Gaussian component with the largest mixture weight into two new ones in each splitting step, and then performs EM to update all Gaussian components. The number of mixture component is incremented from one to the pre-defined number.

*For each baseline approach, the initial covariance matrix of GMMk* *was set as ρk*I,

*where ρk*=min*j6=k{kµ*(0)*j* *− µ*(0)*k* *k} and µ*(0)*j* *denotes the initial mean vector of the jth*

Gaussian component.

### 3.3.1

### Results on the synthetic data

Figure 3.1 (a) shows the synthetic data that contains six Gaussian clusters, each with 100 samples. First, we conducted experiments using full covariance Gaussian components. Figure 3.1 schematically depicts the learning process of the SGML algorithm. For this