• 沒有找到結果。

# 機率式模型分群法之研究與其應用

N/A
N/A
Protected

Share "機率式模型分群法之研究與其應用"

Copied!
122
1
0

(1)

## 學

(2)

### 博 士 論 文

A Dissertation

Submitted to Department of Computer Science College of Computer Science

National Chiao Tung University in partial Fulfillment of the Requirements

for the Degree of Doctor of Philosophy

in

Computer Science

May 2009

Hsinchu, Taiwan, Republic of China

(3)

### 要

(4)

(maximum likelihood criterion)，我們亦發展了三個學習 PbSOM 的 EM 形式演算法，亦 即 SOCEM、 SOEM、以及 SODAEM 演算法。SODAEM 是 SOCEM 與 SOEM 的決定 性 退 火 (deterministic annealing, i.e., DA) 變 形 ; 此 外 ， 藉 由 逐 漸 縮 小 鄰 域 大 小 (neighborhood size)，SOCEM 與 SOEM 可分別被解釋成: 高斯模型分群的 CEM 與 EM 演算法的 DA 變形。實驗結果顯示，我們所提出的 PbSOM 演算法與決定性 EM(DAEM) 演算法有相近的分群效能，並能維持拓樸保留之特性。關於應用方面，我們用 SGML 來訓練用於語者識別(speaker identification)的語者 GMMs;實驗結果顯示，SGML 可以 自動地為個別語者 GMM 決定適當的模型複雜度，而此在文獻裡通常是用經驗法則來 決定的。我們將所提出的 PbSOM 學習演算法應用於資料視覺化與分析;實驗結果顯示 它們可用於在一個二維的網路上探索資料樣本與群聚之間的拓樸關係。此外，我們提 出幾種分割且克服(divide-and-conquer)的方法來做音訊分段(audio segmentation)，其用 BIC 來評定兩個音段之間的不相似度(dissimilarity);在廣播新聞的實驗結果顯示，相較 於公認最佳的視窗成長分段法(window-growing-based segmentation)，我們的方法不僅 有比較低的計算量，亦有較高的分段正確性。

(5)

### ABSTRACT

Probabilistic model-based clustering is achieved by learning a finite mixture model (usually a Gaussian mixture model, i.e., GMM) and has been successfully applied to many tasks. When the objective likelihood function cannot be optimized analytically, iterative, locally-optimal learning algorithms are usually applied to learn the model. Moreover, because the data samples drawn from a mixture model have the property of incompleteness, EM-type algorithms, such as EM and classification EM (CEM), are usually employed. However, model-based clustering based on conventional EM-type algorithms suffer from the following shortcomings: 1) the performance is sensitive to the model initialization; 2) the number of mixture components (data clusters) needs to be pre-defined; and 3) the topological relationships between data samples and clusters are not preserved after the learning process. In this thesis, we propose a self-splitting Gaussian mixture learning algorithm (SGML) to address the first two issues. The algorithm starts with one single component and iteratively splits a component into two new components using Bayesian information criterion (BIC) as the validity measure for the splitting until the most appropriate component number is found, which is also achieved by BIC. Based on the splitting process, SGML provides a decent model initialization for learning Gaussian mixture models with different model complexities. For the topology-preserving issue, we consider the learning process of a probabilistic self-organizing map (PbSOM) as a model-based clustering procedure that preserves the topological relationships between data samples and clusters in a neural network. Based on this concept, we develop a

(6)

coupling-likelihood mixture model for the PbSOM that extends the reference vectors in Kohonen's SOM to multivariate Gaussians distributions. We also derive three EM-type algorithms, called the SOCEM, SOEM, and SODAEM algorithms, for learning the model (PbSOM) based on the maximum likelihood criterion. SODAEM is a deterministic annealing (DA) variant of SOCEM and SOEM; moreover, by shrinking the neighborhood size, SOCEM and SOEM can be interpreted, respectively, as DA variants of the CEM and EM algorithms for Gaussian model-based clustering. The experiment results show that the proposed PbSOM learning algorithms achieve comparable data clustering performance to that of the deterministic annealing EM (DAEM) approach, while maintaining the topology-preserving property. For applications, we apply SGML to the training of speaker GMMs for the speaker identification task; the experiment results show that SGML can automatically determine an appropriate model complexity for a speaker GMM, which is usually determined empirically in the literature. We apply the proposed PbSOM algorithms to data visualization and analysis; the experiment results show that they can be used to explore the topological relationships between data samples and clusters on a two-dimensional network. In addition, we propose divide-and-conquer approaches for the audio segmentation task using BIC to evaluate the dissimilarity between two audio segments; the experiment results on the broad-cast news data demonstrate that our approaches not only have a lower computation cost but also achieve higher segmentation accuracy than the leading window-growing-based segmentation approach.

(7)

### ACKNOWLEDGEMENTS

I am grateful to my advisor, Prof. Hsin-Chia Fu and Dr. Hsin-Min Wang, for their intensive suggestions, patient guidance, and enthusiasm of research. Moreover, I would like to thank all the members of the Neural Network Multimedia Laboratory at Department of Computer Science, National Chiao Tung University, and the Spoken Language Group, Chinese Information Processing Laboratory at Institute of Information Science, Academia Sinica, for their valuable discussions. Finally, I would like to express my appreciation to my family for their supporting and encouragement. I dedicate this dissertation to my parents.

(8)

### Contents

1 Introduction 1

1.1 Background . . . 1

1.2 Contributions of this dissertation . . . 5

1.2.1 On model initialization and complexity . . . 5

1.2.2 On topology-preserving ability . . . 5

1.2.3 On applications . . . 6

1.3 Organization of this dissertation . . . 6

2 Preliminaries and related works 7 2.1 K-means clustering . . . 7

2.2 Hierarchical agglomerative clustering . . . 8

2.3 Self-organizing map . . . 9

2.4 Probabilistic model-based clustering . . . 10

2.4.1 The mixture likelihood approach . . . 10

2.4.1.1 The EM algorithm for mixture models . . . 11

2.4.1.2 The EM algorithm for Gaussian mixture models . . . 11

2.4.1.3 The Bayesian approach for model selection and BIC . . . 12

2.4.2 The classification likelihood approach . . . 13

2.4.2.1 The CEM algorithm for mixture models . . . 13

2.4.2.2 The CEM algorithm for Gaussian mixture models . . . 14

2.4.2.3 AWE and ICL for model selection . . . 14

2.4.3 The DAEM algorithm . . . 15

2.4.3.1 The DAEM algorithm for Gaussian mixture models . . . . 16

3 BIC-based self-splitting Gaussian mixture learning 17 3.1 The SGML algorithm . . . 17

3.1.1 Computational complexity of SGML . . . 18

3.2 The fastSGML algorithm . . . 19

3.3 Experiments on Gaussian mixture learning . . . 19

3.3.1 Results on the synthetic data . . . 20

(9)

3.4 Application of SGML to GMM-based speaker identification . . . 24

3.4.1 The GMM-based speaker identification . . . 24

3.4.2 Experiments . . . 26

3.4.2.1 Database description and feature extraction . . . 26

3.4.2.2 Configuration of model training and test utterance . . . . 27

3.4.2.3 Results . . . 27

4 Model-based clustering by probabilistic self-organizing maps 34 4.1 Formulation of the coupling-likelihood mixture model for PbSOMs . . . 34

4.2 The SOCEM algorithm for learning PbSOMs . . . 36

4.2.1 SOCEM - a DA variant of CEM for GMM . . . 37

4.2.2 Relation to Kohonen’s batch algorithm . . . 39

4.2.3 Computational cost . . . 40

4.3 The SOEM algorithm for learning PbSOMs . . . 41

4.3.1 SOEM - a DA variant of EM for GMM . . . 42

4.3.2 Computational cost . . . 42

4.4 The SODAEM algorithm for learning PbSOMs . . . 43

4.4.1 Computational cost . . . 44

4.5 Relation to other algorithms . . . 44

4.5.1 For SOCEM . . . 45

4.5.2 For SOEM and SODAEM . . . 46

4.6 Experiments on organizing property and data clustering . . . 46

4.6.1 Experiments on organizing property . . . 46

4.6.1.1 Results on the synthetic data . . . 47

4.6.1.2 Results on PenRecDigits C0 . . . 49

4.6.2 Experiments to evaluate the performance of data clustering . . . 50

4.6.2.1 Results on ImgSeg by using SOCEM and SODAEM C . . 51

4.6.2.2 Results on ImgSeg by using SOEM and SODAEM M . . . 51

4.6.2.3 Results on Ecoli . . . 52

4.7 Application of SOCEM, SOEM, and SODAEM to data visualization and analysis . . . 52

4.7.1 Experiment results on ImgSeg by using SOCEM and SODAEM C . 53 4.7.2 Experiment results on ImgSeg by using SOEM and SODAEM M . . 54

4.7.3 Experiment results on Ecoli . . . 54

5 BIC-based audio segmentation using divide-and-conquer 69 5.1 Window-growing-based segmentation . . . 71

5.1.1 ∆BIC as the distance measure of two audio segments . . . 71

5.1.2 One-change-point detection . . . 72

(10)

5.2 Divide-and-Conquer-based segmentation . . . 73

5.2.1 The DACDec1 approach . . . 73

5.2.2 The DACDec2 approach . . . 75

5.2.3 Sequential segmentation by DACDec1 and DACDec2 . . . 77

5.3 Computational cost analysis . . . 78

5.3.1 For DACDec1 . . . 79 5.3.2 For DACDec2 . . . 80 5.3.3 For FixSlid . . . 81 5.3.4 For WinGrow . . . 81 5.3.5 Discussion . . . 82 5.4 Experiments . . . 83

5.4.1 Experiments on the synthetic data . . . 83

5.4.2 Experiments on broadcast news data . . . 84

6 Conclusion and future work 89 6.1 Conclusion . . . 89

6.2 Future work . . . 90

A 92 A.1 The SOCEM, SOEM, and SODAEM algorithms where mixture weights are learned . . . 92

B 95 B.1 Compute T1(k) . . . 95

B.2 Compute T0(m) . . . 96

(11)

### List of Figures

2.1 (a) {A, B, · · ·, G} are data samples in <2. (b) An illustrative dendrogram of the data samples in (a) obtained by hierarchical agglomerative clustering. 8 3.1 The learning process of SGML on the synthetic data, where full covariance

GMMs are used. . . 22 3.2 The learning curves of SGML and the baseline approaches on the synthetic

data, where full covariance GMMs are used. . . 23 3.3 The learning curve of SGML on the synthetic data, where diagonal

covari-ance GMMs are used. . . 23 3.4 The learning curves of applying the various GMM learning methods to the

15-second speech data of #5007 in NIST01SpkEval. (a) and (b) depicts the full covariance case, while (c) and (d) depicts the diagonal covariance case. . . 25 3.5 The learning curves of applying fastSGML, K-means-BinSplitting and SGML

to learn diagonal covariance GMMs with 60-second speech data of #5007 in NIST01SpkEval. The splitting confidence of the fastSGML was set at 150, 100, and 50, respectively, and K-means-BinSplitting was forced to stop at GMM43. . . 26 4.1 (a) The network structure of a Gaussian mixture model, and (b) the

pro-posed coupling-likelihood mixture model. Here, rl(xi; θl) denotes the

mul-tivariate Gaussian distribution described in Eq. (2.12). . . 55 4.2 SOCEM’s objective function becomes more complex with the reduction of

neighborhood size (σ in hkl). . . 56

4.3 For each data sample xi, the adaptation of the reference models in SOCEM

is restricted to the winning reference model and its neighborhood. However, in SOEM, the winner is relaxed to the weighted winners by the posterior probabilities γk|i(t), for k = 1, 2, · · · , G. Each data sample xi contributes

pro-portionally to the adaptation of each reference model and its neighborhood according to the posterior probabilities. . . 57

(12)

4.4 The family of Gaussian model-based clustering algorithms derived from the

SODAEM, SOEM and SOCEM algorithms. δkl = 1 if k = l; otherwise,

δkl = 0. . . 57

4.5 The map-learning process obtained by running the SOCEM algorithm on the synthetic data. Simulation 1 ((a)-(b)): When SOCEM is run with the random initialization in (a) and σ = 0.15, it converges to the unordered map in (b). Simulation 2 ((a) and (c)-(f)): SOCEM starts with σ = 0.6 and the random initialization in (a). Then, the value of σ is reduced to 0.15 in 0.15 decrements. . . 58 4.6 The map-learning process obtained by running the SOEM algorithm on

the synthetic data. Simulation 1 ((a)-(b)): When SOEM is run with the random initialization in (a) and σ = 0.15, it converges to the unordered map in (b). Simulation 2 ((a) and (c)-(f)): SOEM starts with σ = 0.6 and the random initialization in (a). Then, the value of σ is reduced to 0.15 in 0.15 decrements. . . 59 4.7 The map-learning process obtained by running the SODAEM algorithm

on the synthetic data. The value of σ is fixed at 0.15, while value of β is initialized at 0.16 and increased in multiples of 1.6 up to 17.592. . . 60 4.8 The map-learning process obtained by running the SOCEM algorithm on

PenRecDigits C0. Simulation 1 ((a)-(b)): When SOCEM is run with the random initialization in (a) and σ = 0.15, it converges to the unordered map in (b). Simulation 2 ((a) and (c)-(f)): SOCEM starts with σ = 0.6 and the random initialization in (a). Then, the value of σ is reduced to 0.15 in 0.15 decrements. . . 61 4.9 The map-learning process obtained by running the SOEM algorithm on

PenRecDigits C0. Simulation 1 ((a)-(b)): When SOEM is run with the random initialization in (a) and σ = 0.15, it converges to the unordered map in (b). Simulation 2 ((a) and (c)-(f)): SOEM starts with σ = 0.6 and the random initialization in (a). Then, the value of σ is reduced to 0.15 in 0.15 decrements. . . 62 4.10 The map-learning process obtained by running the SODAEM algorithm

on PenRecDigits C0. The value of σ is fixed at 0.15, while value of β is initialized at 0.16 and increased in multiples of 1.6 up to 17.592. . . 63 4.11 The data clustering performance of CEM, DAEM C, SOCEM, SODAEM C,

and KohonenGaussian on ImgSeg in terms of the classification log-likelihood. 64 4.12 Learning a Gaussian mixture model by applying EM, DAEM M, SOEM,

and SODAEM M to ImgSeg. . . 64 4.13 The data clustering performance on Ecoli in terms of (a) the classification

(13)

4.14 Data visualization for ImgSeg by running KohonenGaussian ((b)), SOCEM ((c), (d)), and SODAEM C ((e), (f)) with the random initialization in (a). The network structure is a 7 × 7 equally spaced square lattice in a unit square. . . 66 4.15 Data visualization for ImgSeg by running SOEM ((a), (b)) and SODAEM M

((c), (d)) with the random initialization in Figure 4.14 (a). The network

structure is a 7 × 7 equally spaced square lattice in a unit square. . . . . 67

4.16 Data visualization for Ecoli by running (b) KohonenGaussian, (c) SOCEM, (d) SOEM, (e) SODAEM C, and (f) SODAEM M with the random initial-ization in (a). The network structure is a 7 × 7 equally spaced square lattice in a unit square. . . 68 5.1 The fixed-size sliding window detection approach. . . 71 5.2 Diagram of the multiple-change-point detection in window-growing-based

segmentation (WinGrow). The audio stream contains three segments, namely Seg1, Seg2, and Seg3; P and Q denote the change points. . . 73 5.3 (a) An audio stream comprised of three speech segments, each derived from

a distinct speaker. C1 and C2 are the change points. (b) The ∆BIC curve

obtained by applying OCD-Chen to the audio stream in (a). . . 74 5.4 An illustration that data samples distribute as three Gaussian clusters.

For this case, generally, two Gaussians (H1) fit the distribution of the

data better than one Gaussian (H0) if the samples belonging to the same

Gaussian cluster are used together to estimate the parameters. . . 76 5.5 (a) An audio stream comprised of three speech segments; the first and third

segments are derived from the same speaker (Speaker1), while the second is derived from another speaker (Speaker2). (b) The ∆BIC curve obtained by applying OCD-Chen to the audio stream in (a). (c) The diagram of

the hypothesis test at the change point C2 in (b). (d) The diagram of the

hypothesis test at the non-change point R in (b). . . 77 5.6 A recursive tree that simulates the recursive process of DACDec2 on the

audio stream in Figure 5.5 (a). . . 79 5.7 Diagram of the detection process of SeqDACDec1 and SeqDACDec2. If a

change point is detected in the fixed-size analysis window by DACDec1 or DACDec2, the window is moved to the change point with the largest time index. Otherwise, it is moved forward by ηL samples, where L denotes the window size, and η > 0. . . 80 5.8 An audio stream comprised of k+1 homogeneous segments, each containing

(14)

5.9 ROC curves obtained by running SeqDACDec1 and SeqDACDec2 on the synthetic data using 10-second, 20-second, and 30-second analysis windows.

L denotes the size of the analysis window. . . 84

5.10 The empirical cumulative distributions of the size of homogeneous segments in MATBN3hr and HUB4-98. . . 85 5.11 A significant local maximum on the distance curve. . . 86

5.12 The ROC curves for MATBN3hr obtained by (a) SeqDACDec1 with Nmin =

2 seconds and analysis windows of different size; (b) SeqDACDec2 with

Nmin = 2 seconds and analysis windows of different size; and (c)

Seq-DACDec1 with Nmin = 2 seconds and L = 20 seconds, SeqDACDec2 with

Nmin = 2 seconds and L = 20 seconds, WinGrow with Nmin = 3 seconds

and Nmax = 20 seconds, and FixSlid with a 2-second window. . . 87

5.13 The ROC curves for HUB4-98. . . 88 A.1 The map-learning process obtained by running the SOCEM algorithm on

the synthetic data with an ordered initialization in (a). Simulation 1 ((a)-(e)): The mixture weights are initialized at 1

16, and updated in the learning process; the algorithm starts with the initialization in (a) and converges to the unordered map in (e). Simulation 2 ((a) and (f)): SOCEM is performed with equal mixture weights throughout the learning process; the algorithm starts with the initialization in (a) and converges to the map in (f). The network structure is a 4 × 4 square lattice; the value of σ is set at 0.4. . . 94

(15)

### List of Tables

2.1 Common inter-cluster distance measures for hierarchical agglomerative clus-tering. d(xi, xj) denotes the distance measure for xi and xj. . . 9

3.1 The mean and standard deviation of the component number of the diago-nal covariance male speaker GMMs obtained by SGML and fastSGML on different amounts of training data. The first number in parentheses is the mean value, while the second number after ’/’ is the standard deviation. . 29 3.2 The mean and standard deviation of the component number of the diagonal

covariance female speaker GMMs obtained by SGML and fastSGML on different amounts of training data. The first number in parentheses is the mean value, while the second number after ’/’ is the standard deviation. . 29 3.3 The average CPU time (in second) of the diagonal covariance male speaker

GMMs obtained by SGML and fastSGML on different amounts of training data. . . 30 3.4 The average CPU time (in second) of the diagonal covariance female speaker

GMMs obtained by SGML and fastSGML on different amounts of training data. . . 30 3.5 Speaker identification accuracy (in %) for the male speakers. . . 30 3.6 Speaker identification accuracy (in %) for the female speakers. . . 31 4.1 The DAEM algorithm for learning GMMs with equal mixture weights and

the SOCEM algorithm. . . 40 4.2 Results of simulations using KohonenGaussian, SOCEM, SOEM, and

SO-DAEM in 20 independent random initialization trials on the synthetic data.

The algorithms were run with two setups for σ in hkl. When σ = 0.15,

Ko-honenGaussian succeeded in converging to an ordered map in one random initialization case (S:1), but failed in the remaining cases (F:19). . . 49 4.3 Results of simulations using KohonenGaussian, SOCEM, SOEM, and

SO-DAEM in 20 independent random initialization trials on PenRecDigits C0.

The algorithms were run with two setups for σ in hkl. When σ = 0.15,

Ko-honenGaussian succeeded in converging to an ordered map in one random initialization case (S:1), but failed in the remaining cases (F:19). . . 50

(16)

5.1 The CPU time of different audio segmentation approaches evaluated on MATBN3hr in the EER case and the associated EERs, where M and F denote the miss detection rate and the false alarm rate, respectively. . . 86 5.2 The CPU time of different audio segmentation approaches evaluated on

HUB4-98 in the EER case and the associated EERs. . . 87 A.1 The mixture weights learned by SOCEM with the initialization in Figure

A.1 (a). The mixture weights are initialized at 1

(17)

### Background

The goal of data clustering is to group data samples into clusters such that samples from the same cluster are more similar to each other than samples from different clusters. It is also known as unsupervised learning, numerical taxonomy, typological analysis and vector quantization [1, 2]. Among the various clustering approaches, probabilistic model-based clustering is the one derived from statistical learning; and it has been successfully applied to many tasks, for example, speaker recognition [3, 4], speech recognition [5], handwritten recognition [6], image segmentation [7], and clustering of microarray expression data [8, 9]. In model-based clustering, data samples are grouped by learning a finite mixture model (usually a Gaussian mixture model, i.e., GMM), in which each mixture component represents a cluster. There are two major learning methods for model-based clustering: the mixture likelihood approach, where the likelihood of each data sample is a mixture of all the component likelihoods of the data sample; and the classification likelihood approach, where the likelihood of each data sample is generated by its winning component only [10, 11, 12, 13, 14, 15, 16, 17, 18]. In both approaches, when the globally optimal estimation of the model parameters cannot be obtained analytically, iterative learning algorithms that only guarantee obtaining locally optimal solutions are usually employed. The expectation-maximization (EM) algorithm for mixture likelihood learning [19, 20, 21, 22] and the classification EM (CEM) algorithm for classification likelihood learning [17] are two such algorithms and have been the dominant approaches in this task. The conventional EM or CEM-based model-based clustering has three critical aspects, namely the initialization

of model parameters, the model complexity1, and the topology-preserving ability. These

aspects are discussed as follows.

• For model initialization: The learning performance of EM and CEM are very

sen-sitive to the initial conditions of the model’s parameters. To address this issue, the

(18)

authors in [17] proposed a simulated annealing implementation for CEM, which re-duces the initial-position dependence based on random perturbations. Rather than applying the randomization power of simulated annealing, the authors in [23] pro-posed a deterministic annealing EM (DAEM) algorithm that tackles the initializa-tion issue via a deterministic annealing process. DAEM was originally derived as a DA variant of EM; however, as shown in Section 2.4.3, it is also a DA variant of CEM. Although DAEM has been reported achieving decent performance, there is still no guarantee that it finds the globally optimal model parameters because, like EM, it is a iterative, single-token search scheme. In order to search the parameter space in multiple pathes, the authors in [24, 25] and [26, 27] proposed multi-thread search strategies when employing EM and DAEM, respectively, where the search pathes are adapted in the search process according to the eigen-decomposition of the Hessian matrix of the target log-likelihood function. As another kind of multiple-path search, a GA-EM algorithm was proposed in [28], where EM learning is integrated into a Genetic search procedure. It is less sensitive to the initialization because of the sto-chastic search nature of the Genetic Algorithms (GA). Some heuristic-like learning algorithms have also been proposed. For example, an initialization approach for EM which is based on subsampling was presented in [29]. The authors in [30] proposed an SMEM algorithm that finds the appropriate initial conditions for EM learning by using split and merge operations. Similarly, Young and Woodland [31] proposed a component-splitting approach to learn a GMM, which iteratively splits the mean vector of the Gaussian component with the largest weight into two new ones, and then performs EM to update all the Gaussian components. Another method based on component splitting is presented in [32]. In [33], the authors suggested a simple way that one can perform several short runs of EM (by early stopping) first, then select the best model from the results and use them as the initial model for the long run (standard) EM learning. As a common and simple way, one can apply K-means clustering or hierarchical agglomerative clustering (HAC) to locate the initial mean vectors of Gaussian components for the EM learning [1, 34, 35].

Note that all the approaches mentioned above are performed with a given target number of mixture components.

• For model complexity: Assessment of the number of mixture components (data

clusters) is an important issue in model-based clustering. The mixture model would over-fit the data if it contains too many mixture components; in contrast, it would not be flexible enough to describe the structure of the data if the number of com-ponents is too small. Various approaches have been proposed to address this issue. In [36], Furman and Lindsay developed two hypothesis test procedures based on the moment estimators to assess the number of components for a GMM. As

(19)

an-other hypothesis test-based approach, McLachlan and Khan estimated the number of components by likelihood ratio test, where the re-sampling process is applied to assess its null hypothesis [37]. In [38], the authors estimated the mixture complex-ity by comparing an information theoretic-derived nonparametric estimator with the best parametric fit of a given complexity. As another information theoretic-based approach, a maximum entropy-based approach with a modified EM algorithm was proposed to assess the model complexity of GMM in [39]. Moreover, model se-lection criteria, also known as penalized-likelihood criteria, have been proposed to assess the model complexity; for example, Akaike’s Information Criterion (AIC) [40], Bayesian Information Criterion (BIC) [41, 10, 42], Integrated Completed Like-lihood (ICL) [43], Approximate Weight Evidence (AWE) [18], Minimum Description Length (MDL) [44] (which is formally identical to BIC), and Minimum Message Length (MML) [45]. BIC is derived on the basis of mixture likelihood, ICL and AWE are derived on the basis of classification likelihood, and AIC, MDL and MML are information-theoretic-derived criteria. A common way to applying these criteria

to assess model complexity is that defining the upper bound, Gmax, of the

compo-nent number of candidate models first, and then choosing the one with the best score calculated using the employed criterion as the best model. For example, when using BIC, the best component number is

ˆ

G = arg max

G {2 log p(X ; ˆΘG) − P enalty(G)|G = 1, 2, . . . , Gmax}, (1.1)

where ˆΘG is the maximum likelihood estimate of parameters of the mixture model

with G components, P enalty(G) is a monotonically increasing function of G that penalizes more for a more complex model [10]. However, there are two potential drawbacks with this model-selection-based approach. First, it needs to define the

upper bound of the component number, Gmax, beforehand. On the one hand, if

Gmax is too large (much larger than the best component number determined by

the model selection criterion), the learning process will waste a lot of computation time. On the other hand, if Gmax is too small, the selected model may be not flexible

enough to describe the structure of the data. Second, ˆΘG is usually obtained by

EM, whose performance is highly dependent on the model initialization.

Rather than assessing the model complexity by incrementally adding mixture com-ponents during the learning process, as discussed above, the variational Bayesian framework in [46, 47] automatically determines the number of components by setting a larger component number initially and then suppressing unwanted components.

• For topology-preserving ability: Conventional model-based clustering cannot

(20)

clustering procedure. To overcome this shortcoming, the clustering task can be per-formed by using Kohonen’s self-organizing map (SOM) [48, 49]. The SOM, rather than being a supervised neural network model for pattern recognition, is an unsuper-vised model for data clustering and visualization. After SOM’s clustering procedure, the topological relationships among data samples and clusters can be preserved (or visualized) on the network, which is usually a two dimensional lattice. Kohonen’s sequential and batch SOM learning algorithms have proved successful in many prac-tical applications [48, 49]. However, they also suffer from some shortcomings, such as the lack of an objective (cost) function, a general proof of convergence, and a prob-abilistic framework [50]. Some related works that have addressed these issues are as follows. In [51, 52], the behavior of Kohonen’s sequential learning algorithm was studied in terms of energy functions, based on which, Cheng [53] proposed an energy function for SOM whose parameters can be learned by a K-means type algorithm. Luttrell [54, 55] proposed a noisy vector quantization model called the topographic vector quantizer (TVQ), whose training process coincides with the learning process of SOM. The cost function of TVQ represents the topographic distortion between the input data and the output code vectors in terms of Euclidean distance. Graepel

et al. [56, 57] derived a soft topographic vector quantization (STVQ) algorithm

by applying a deterministic annealing process to the optimization of TVQ’s cost function. Based on the topographic distortion concept, Heskes [58] applied a dif-ferent DA implementation from that of STVQ, and obtained an algorithm identical to STVQ when the quantization error is expressed in terms of Euclidean distance. In [59], Chow and Wu proposed an on-line algorithm for STVQ; later, motivated by STVQ, they proposed a data visualization method that integrates SOM and multi-dimensional scaling [60]. Based on the Bayesian analysis of SOMs in [61], Anouar et al. [62] proposed a probabilistic formalism for SOM, where the para-meters are learned by a K-means type algorithm. To help users select the correct model complexity for SOM by probabilistic assessment, Lampinen and Kostiainen [63] developed a generative model in which the SOM is trained by Kohonen’s algo-rithm. Meanwhile, Van Hulle [64] developed a kernel-based topographic formation in which the parameters are adjusted to maximize the joint entropy of the kernel outputs. He subsequently developed a new algorithm with heteroscedastic Gaussian mixtures that allows for a unified account of vector quantization, log-likelihood, and Kullback-Leibler divergence [65]. Another probabilistic formulation is proposed in [66], whereby a normalized neighborhood function of SOM is used as the posterior distribution in the E-step of the EM algorithm for a mixture model to enforce the self-organizing of the mixture components. Sum et al. [67] interpreted Kohonen’s sequential learning algorithm in terms of maximizing the local correlations (cou-pling energies) between neurons and their neighborhoods for the given input data.

(21)

They then proposed an energy function for SOM that reveals the correlations, and a gradient ascent learning algorithm for the energy function.

### On model initialization and complexity

This thesis proposes a BIC-based self-splitting Gaussian mixture learning (SGML) algo-rithm that starts with a single component and iteratively splits a component into two new components until the most appropriate component number is found. The proposed algorithm has several advantages as follows. 1) SGML provides a better initialization for EM. For the Gaussian mixture learning process whose initialization is based on a data clustering process like K-mens or HAC, the clustering phase may results in an poor initial mixture model with too many components in one part of the space and too few in another widely-separated part of the space; in this case, the following EM phase may fail to escape from this configuration and fall into an poor local maximum. In the splitting process of SGML, however, BIC is used to determine which part of the space should be divided into two parts. In this way, the ill initialization situation can be avoided to some extent and, thus, a better estimation for model parameters can be obtained. 2) SGML automati-cally determines the appropriate component number without the need to define the upper bound of the component number beforehand. It stops when a “significant maximum” in the learning curve (i.e., the BIC plot) is found, and then outputs the model yielding the maximum BIC valve. 3) SGML is deterministic; its output is always the same in different runs on the same data set since it does not contain any randomization procedure in its learning rules.

### On topology-preserving ability

In Kohonen’s SOM architecture, neurons in the network associate with reference vectors in the data space. This contrasts with a SOM whose neurons associate with reference models that present probability distributions, such as the isotropic Gaussians used in [66] and the heteroscedastic Gaussians used in [62, 65]. In this thesis, the latter is called a probabilistic SOM (PbSOM). Motivated by the coupling energy concept in Sum et al.’s work [67], a coupling-likelihood mixture model for the PbSOM that uses multivariate Gaussian distributions as the reference models is developed. In the proposed model, local coupling energies between neurons and their neighborhoods are expressed in terms of probabilistic likelihoods; and each mixture component expresses the local coupling-likelihood between one neuron and its neighborhood. Based on this model, we develop CEM, EM, and DAEM algorithms for learning PbSOMs, namely the SOCEM, SOEM, and SODAEM algorithms, respectively. SODAEM is a DA variant of SOCEM and SOEM.

(22)

Moreover, we show that SOCEM and SOEM can be interpreted, respectively, as DA variants of the CEM and EM algorithms for Gaussian model-based clustering, where the neighborhood shrinking is interpreted as an annealing process. The experiment results show that the proposed PbSOM learning algorithms achieve comparable data clustering performance to the DAEM algorithm, while maintaining the topology-preserving property.

### On applications

In this thesis, we apply SGML to the training of speaker GMMs for the speaker identi-fication task. The experiment results on NIST 2001 speaker recognition evaluation [68] show that SGML can automatically determine the appropriate model complexity for a speaker GMM, which is usually empirically determined in the literature [4, 69]. The proposed PbSOM algorithms are applied to data visualization and analysis. The exper-iment results on UCI data sets show that we can explore the topological relationships between data samples and clusters on a two-dimensional network. Moreover, we propose divide-and-conquer approaches for the audio segmentation task using BIC to evaluate the dissimilarity between two audio segments. The experiment results on the broadcast news data demonstrate that our approaches not only have a lower computation cost but also achieve a higher segmentation accuracy than the leading window-growing-based segmen-tation approach [70, 71].

### Organization of this dissertation

To help readers understand the content of this thesis, we briefly review K-means clustering, hierarchical agglomerative clustering, Kohonen’s SOM, and model-based clustering in Chapter 2. In Chapter 3, we describe the SGML algorithm for learning a Gaussian mixture model and its application to speaker identification (based on [72]). In Chapter 4, we describe the SOCEM, SOEM, and SODAEM algorithms for model-based clustering where the topological relationships between data samples and clusters can be preserved on a network and their application to data visualization and analysis (based on [73]). We present the application of BIC to audio segmentation (based on [74]) in Chapter 5. Finally, we give the conclusion and discuss our future works in Chapter 6.

(23)

### K-means clustering

Given a data set X = {x1, x2, · · · , xN} ⊂ <d, data clustering can be achieved based on

learning the vector prototypes {m1, m2, · · · , mG} ⊂ <d; after the learning process, X is

partitioned into clusters P = {P1, P2, · · · , PG} by classifying each data sample to the

cluster associated with its nearest prototype. For the learning, the goal is to find the partition and the prototypes that minimize the distance function

D(P, {m1, m2, · · · , mG}; X ) = G X k=1 X xi∈Pk kxi− mkk2, (2.1)

which can not be minimized analytically.

K-means clustering is a popular approach to learning the prototypes in order to

min-imize Eq. (2.1). Given the initial prototypes, it iteratively and alternatively applies a classification step and a distance minimization step on the data samples and prototypes as follows.

• Nearest-neighbor classification: Given the current prototypes, {m(t)1 , m(t)2 , · · · , m(t)G},

assign each data sample to the cluster associated with its nearest prototype, i.e., xi ∈ ˆPj(t) if j = arg minkkxi− m(t)k k2.

• Distance minimization: Suppose we have the clusters { ˆP1(t), ˆP2(t), · · · , ˆPG(t)} after the

classification step; then, by minimizing Px

i∈ ˆPk(t)kxi − mkk

2 with respect to m

k,

we obtain the update rules for the prototypes as: m(t+1)k = 1

| ˆPk(t)|

P

xi∈ ˆPk(t)xi, for

k = 1, 2, · · · , G.

Although the K-means clustering algorithm is simple and efficient, it only guaran-tees converging to a local minimum of the distance function in Eq. (2.1). Besides, it has two more shortcomings that the performance highly depends on the initialization of

(24)

1

2

### x

(a) A B C D E F G {B ,C } {A ,B ,C } {D ,E } {F,G } {D ,E ,F,G } {A ,B ,C ,D ,E ,F,G } P ro xi m ity (b)

Figure 2.1: (a) {A, B, · · ·, G} are data samples in <2. (b) An illustrative dendrogram of the data samples in (a) obtained by hierarchical agglomerative clustering.

prototypes and the number of prototypes need to be defined beforehand. To overcome the initialization issue, the authors in [75] proposed to iteratively split each mean vec-tor into two new ones until the desired number of clusters is reached. Similarly, in [76], the authors proposed iteratively splitting the mean vector of the cluster with the largest accumulated distance between data samples and its prototype until the desired number of clusters is reached. As a simple way, one may conduct the K-means algorithm with random initialization for many runs, say 20, and then select the best one.

### Hierarchical agglomerative clustering

Different from K-means clustering that performs data clustering based on prototype learn-ing, hierarchical agglomerative clustering (HAC) performs the clustering according to the proximity between clusters. When performing HAC, each data sample is considered as a cluster initially; then, the algorithm iteratively merges the two clusters with the smallest distance into a new cluster. For example, if HAC is applied to clustering {A, B,· · ·,G} in Figure 2.1 (a) with Euclidean distance, we may obtain a dendrogram (tree) that rep-resents the hierarchy of the clusters in Figure 2.1 (b). As shown in the figure, B and C are merged into a cluster first, then D and E are merged, and so on. And we can obtain different configurations of clusters by cutting the dendrogram according to the proximity. As for the inter-cluster distance measure, the most commonly used ones are the so-called single-linkage, average-linkage, and complete-linkage approaches defined in Table 2.1. For more discussions of these distance measure, reader can refer to [77].

(25)

Table 2.1: Common inter-cluster distance measures for hierarchical agglomerative clus-tering. d(xi, xj) denotes the distance measure for xi and xj.

Approach distance between clusters Pm and Pn

### Self-organizing map

Self-organizing map (SOM) is a neural network model for data clustering that preserves the topological relationships between data samples and clusters in a network [48, 49]. Like K-means clustering, the training of SOM is a prototype-learning process. How-ever, in addition to the interconnections between the data samples and the prototypes as in K-means clustering, SOM involves lateral interactions between prototypes via a neighborhood definition on the network to learn the topological relationships between the prototypes. Kohonen’s sequential (incremental) and batch learning algorithms have been well recognized and successfully applied to many tasks, such as data visualization [48], document processing [78, 79], image processing [80], and speech precessing [81, 82, 83].

Kohonen’s SOM model consists of G neurons R={r1, r2, · · · , rG} on a network with

a neighborhood function hkl that defines the strength of lateral interaction between two

neurons, rk and rl, for k, l ∈ {1, 2, · · · , G}; and each neuron rk associates with a vector

prototype mk(a reference vector) in the input data space [48, 49]. In Kohonen’s sequential

learning algorithm, the training data samples x1, x2, · · · , xN are applied sequentially. For

each data sample, say xi, its winning prototype mwin(t)

i such that

win(t)i = arg min

j kxi− m

(t)

j k, (2.2)

is retrieved first, then the prototypes are adapted by the following rules: m(t+1)k = m(t)k + α(t)hwin(t)

i k(xi− m (t)

k ), (2.3)

for k = 1, 2, · · · , G. α(t) is the learning-rate factor such that 0 < α(t) < 1; it decreases monotonically with the increasing of the learning iterations. The neighborhood function

hkl is usually a rectangular function, i.e., hkl = 1 for rl that is in the neighborhood of rk

and hkl = 0 for rl that is not in the neighborhood of rk; or a monotonically decreasing

function of the Euclidean distance krk− rlk between rk and rl on the SOM network, for

example, the widely applied unnormalized Gaussian neighborhood function

hkl = exp(−

krk− rlk2

(26)

Kohonen’s batch learning algorithm applies all the data samples as a whole to up-date the prototypes, rather than one by one as in the sequential learning algorithm. In each learning iteration, the winning prototype for each sample is found first; then, the prototypes are updated by

m(t+1)k = PN i=1hwin(t)i kxi PN i=1hwin(t)i k , (2.5) for k = 1, 2, · · · , G.

Kohonen’s learning algorithms need to be applied with two stages. First, a large neighborhood is applied to the algorithm to learn an ordered map near the center of the data samples. Then, the prototypes are adapted to fit the distribution of the data samples by gradually shrinking the neighborhood. When the neighborhood is reduce to zero-neighborhood, i.e., hkl = δkl where δkl = 1 if k = l and otherwise δkl = 0, Kohonen’s

batch learning algorithm becomes the K-means clustering algorithm.

### The mixture likelihood approach

In the mixture likelihood approach for model-based clustering, it is assumed that the given data set X = {x1, x2, · · · , xN} ⊂ <d is generated by a set of independently and

identically distributed (i.i.d.) random vectors from a mixture model:

p(xi; Θ) = G

X

k=1

w(k)p(xi; θk), (2.6)

where w(k) is the mixing weight of the mixture component p(xi; θk), subject to 0 ≤

w(k) ≤ 1 for k = 1, 2, · · · , G; PG

k=1w(k) = 1; and θk denotes the parameter set of

p(xi; θk).

The maximum likelihood estimate of the parameter set of the mixture model ˆΘ =

{ ˆw(1), ˆw(2) , · · · , ˆw(G), ˆθ1, ˆθ2, · · · , ˆθG} can be obtained by maximizing the following

log-likelihood function: L(Θ; X ) = log N Y i=1 p(xi; Θ) = N X i=1 log( G X k=1 w(k)p(xi; θk)). (2.7)

This is usually achieved by using the expectation-maximization (EM) algorithm [20, 22]. After learning the mixture model, we derive a partition of X , ˆP = { ˆP1, ˆP2, · · · , ˆPG}, by

(27)

for xi, i.e., xi ∈ ˆPj if j = arg maxkp(k | xi; ˆΘ).

2.4.1.1 The EM algorithm for mixture models

If the maximum likelihood estimation of the parameters cannot be accomplished analyti-cally, the EM algorithm is normally used as an alternative approach when the given data is incomplete or contains hidden information.

In the case of the mixture model, suppose that Θ(t) denotes the current estimate of

the parameter set, and k is the hidden variable that indicates the mixture component from which the data sample is generated. The E-step of EM algorithm then computes the following so-called auxiliary function:

Q(Θ; Θ(t)) = XN i=1 G X k=1 p(k | xi; Θ(t)) log p(xi, k; Θ), (2.8) where p(xi, k; Θ) = w(k)p(xi; θk), (2.9) and p(k | xi; Θ(t)) = w(k)(t)p(x i; θ(t)k ) PG j=1w(j)(t)p(xi; θ(t)j ) (2.10) denotes the posterior probability of the kth mixture component for xi with the given Θ(t).

Then, in the following M-step, the Θ(t+1) that satisfies

Q(Θ(t+1); Θ(t)) = max

Θ Q(Θ; Θ

(t)) (2.11)

is chosen as the new estimate of the parameter set. By iteratively creating the auxiliary function in Eq. (2.8) and performing the subsequent maximization step, the EM algorithm guarantees to converge to a local maximum of the log-likelihood function in Eq. (2.7).

When Q(Θ; Θ(t)) can not be maximized analytically, the M-step is modified to find

some Θ(t+1) such that Q(Θ(t+1); Θ(t)) > Q(Θ(t); Θ(t)). This type of algorithm, called Generalized EM (GEM), is also guaranteed to converge to a local maximum [20, 22].

2.4.1.2 The EM algorithm for Gaussian mixture models

Suppose Eq. (2.6) is a Gaussian mixture model where

p(xi; θk) = 1 (2π)d/2 k|1/2 exp(−1 2(xi− µk) TΣ−1 k (xi− µk)), (2.12)

is the multivariate Gaussian distribution and θk = {µkk} are its mean vector and

covariance matrix. Then, each iteration of the EM algorithm for learning GMMs is summarized as follows.

(28)

• E-step: Given the current model, Θ(t), compute the posterior probability of each mixture component of p(xi; Θ(t)) for each xi using Eq. (2.10).

• M-step: Substituting Eq. (2.12) into Eq. (2.8) and then setting its derivative to

zero, we obtain the re-estimation formulae of the parameters as follows.

w(k)(t+1)= 1 N N X i=1 p(k | xi; Θ(t)), (2.13) µ(t+1)k = PN i=1p(k | xi; Θ(t))xi PN i=1p(k | xi; Θ(t)) , (2.14) Σ(t+1)k = PN i=1p(k | xi; Θ(t))(xi− µ(t+1)k )(xi− µ(t+1)k )T PN i=1p(k | xi; Θ(t)) . (2.15)

2.4.1.3 The Bayesian approach for model selection and BIC

Given the data set X and a set of candidate probability models M = {MG | G =

1, . . . , Gmax} where model MG associates with a parameter set ΘG, the posterior

proba-bility p(MG | X ) can be used to select the appropriate model from M to represent the

distribution of X . According to Bayes’ theorem, p(MG | X ) can be expressed as

p(MG | X ) =

p(MG)p(X ; MG)

p(X ) , (2.16)

where p(MG) is the prior probability of model MG. p(X ) can be ignored because it is

identical for all models and does not affect the selection. Moreover, it is assumed that each model is equally likely (i.e., p(MG) = 1/Gmax); then, p(MG | X ) is proportional to the

probability that the data conforms to the model MG, p(X ; MG), which can be computed

by p(X ; MG) = Z p(X ; ΘG, MG)p(ΘG; MG)dΘG, = Z p(X ; ΘG)p(ΘG)dΘG. (2.17)

The calculation of log p(X ; MG) can be achieved by the Laplace approximation [42, 47, 84],

which gives

log p(X ; MG) ≈ log p(X ; ˆΘG) −

1

2d(ΘG) log N, (2.18)

where ˆΘG is the maximum likelihood estimate of ΘG, d(ΘG) is the number of free

(29)

model MG over the data set X , BIC(MG, X ), is defined as1

BIC(MG, X ) ≡ 2 log p(X ; ˆΘG) − d(ΘG) log N. (2.19)

Accordingly, the larger the BIC value, the stronger the evidence for the model is. In other words, the model with the maximum BIC value will be selected. The BIC can be used to compare models with different parameterizations, different numbers of components, or both.

### The classification likelihood approach

In the classification likelihood approach for model-based clustering [15, 16, 17], instead of maximizing the log-likelihood function of the mixture model in Eq. (2.7), the objective is to find the partition ˆP = { ˆP1, ˆP2, · · · , ˆPG} of X and the model parameters that maximize

C1(P, {θ1, θ2, · · · , θG}; X ) = G X k=1 X xi∈Pk log p(xi; θk), (2.20) or C2(P, Θ; X ) = G X k=1 X xi∈Pk log(w(k)p(xi; θk)). (2.21)

The relation between C1 and C2 is

C2(P, Θ; X ) = C1(P, {θ1, θ2, · · · , θG}; X ) + G

X

k=1

|Pk| log w(k), (2.22)

If all the mixture components are equally weighted,PG

k=1|Pk| log w(k) becomes a constant,

such that C1 and C2 are equivalent.

2.4.2.1 The CEM algorithm for mixture models

Celeux and Govaert [17] proposed the Classification EM (CEM) algorithm for estimating

the partition ˆP and model parameters. Like the EM algorithm, CEM is also an iterative

learning approach. In each iteration, it inserts a classification step (C-step) between the

E-step and M-step of the EM algorithm. In the E-step, the posterior probability of each

mixture component is calculated for each data sample. In the C-step, to obtain the parti-tion ˆP of the data samples, each sample is assigned to the mixture component that yields

the largest posterior probability for that sample. In the M-step, the maximization process is applied to ˆPk individually for k = 1, 2, · · · , G. From a practical point of view, CEM

1Kass and Raftery [85] defined BIC as minus the value given in Eq. (2.19); Fraley and Raftery [10]

used the BIC defined in Eq. (2.19) to make it easier to interpret the plot of BIC values. Here, we follow Fraley and Raftery’s BIC definition.

(30)

is a K-means type algorithm where a cluster is associated with a probability distribution rather than a vector prototype [17].

2.4.2.2 The CEM algorithm for Gaussian mixture models

Suppose the multivariate Gaussian is used as the mixture component, each iteration of the CEM algorithm for learning GMMs is summarized as follows.

• E-step: Given the current model, Θ(t), compute the posterior probability of each mixture component of p(xi; Θ(t)) for each xi using Eq. (2.10).

• C-step: Assign each xi to the cluster whose corresponding mixture component has

the largest posterior probability for xi, i.e., xi ∈ ˆPj(t) if j = arg maxkp(k|xi; Θ(t)).

• M-step: After the C-step, the partition of X (i.e., ˆP(t)) is formed, and the objective

function C2 defined in Eq. (2.21) becomes

C2(Θ; ˆP(t), X ) = G X k=1 X xi∈ ˆPk(t) log(w(k)p(xi; θk)). (2.23)

Substituting Eq. (2.12) into Eq. (2.23) and then setting the derivative of C2 with

respect to individual parameters to zero, we obtain the re-estimation formulae of the parameters as follows.

w(k)(t+1) = | ˆP (t) k | N , (2.24) µ(t+1)k = P xi∈ ˆPk(t)xi | ˆPk(t)| , (2.25) Σ(t+1)k = P xi∈ ˆPk(t)(xi− µ (t+1) k )(xi− µ(t+1)k )T | ˆPk(t)| . (2.26)

2.4.2.3 AWE and ICL for model selection

Based on classification likelihood, Banfield and Raftery proposed the AWE criterion [18, 86] to assess the number of Gaussian components (clusters) for a given data set. The AWE criterion is

AW E(G clusters; X ) = C1( ˆP, { ˆθ1, ˆθ2, · · · , ˆθG}; X ) − d({θ1, θ2, · · · , θG})(

3

2+ log N),

(2.27) In addition, Biernacki et al. proposed the ICL criterion [43],

ICL(G clusters; X ) = C2( ˆP, ˆΘG; X ) −

1

(31)

### The DAEM algorithm

In the DAEM algorithm for learning a mixture model [23], the objective is to minimize the following system energy function during the annealing process:

Fβ(Θ; X ) = − 1 β N X i=1 log( G X k=1 (w(k)p(xi; θk))β), (2.29)

where 1/β corresponds to the temperature that controls the annealing process. The aux-iliary function in this case is

(Θ; Θ(t)) = − N X i=1 G X k=1 f (k | xi; Θ(t)) log p(xi, k; Θ), (2.30) where f (k | xi; Θ(t)) = (w(k)(t)p(x i; θ(t)k ))β PG j=1(w(j)(t)p(xi; θ(t)j ))β (2.31) is the posterior probability derived by using the maximum entropy principle.

Ueda and Nakano [23] showed that Fβ(Θ; X ) can be iteratively minimized by

it-eratively minimizing Uβ(Θ; Θ(t)). When using DAEM to learn a mixture model, β is

initialized with a small value (less then 1) such that the energy function itself is simple enough to be optimized. Then, the value of β is gradually increased to 1. During the learning process, the parameters learned in the current learning phase are used as the initial parameters of the next phase2. In the case of β = 1, F

β(Θ; X ) and Uβ(Θ; Θ(t)) are

the negatives of the log-likelihood function in Eq. (2.7) and the Q-function in Eq. (2.8),

respectively; thus, minimizing Fβ(Θ; X ) is equivalent to maximizing the log-likelihood

function.

According to [23], Eq. (2.29) can be rewritten as

Fβ(Θ; X ) = Uβ(Θ) − 1 βSβ(Θ), (2.32) where Uβ(Θ) = − N X i=1 G X k=1 f (k | xi; Θ) log p(xi, k; Θ), (2.33) and Sβ(Θ) = − N X i=1 G X k=1 f (k | xi; Θ) log f (k | xi; Θ) (2.34)

is the entropy of the posterior distribution. When β→∞, the rational function f (k | xi; Θ)

approximates to a zero-one function; thus, the entropy term Sβ(Θ)→0. In this case,

Fβ(Θ; X ) is equivalent to the negative of the objective function for CEM in Eq. (2.21).

2Each β value corresponds to a learning phase. The algorithm proceeds to the next phase after it

(32)

Therefore, DAEM can be viewed as a DA variant of CEM.

2.4.3.1 The DAEM algorithm for Gaussian mixture models

For the given β value, each iteration of the DAEM algorithm for learning GMMs is summarized as follows.

• E-step: Given the current model, Θ(t), compute the posterior probability of each mixture component of p(xi; Θ(t)) for each xi using Eq. (2.31).

• M-step: Substituting Eq. (2.12) into Eq. (2.30) and then setting its derivative to

zero, we obtain the re-estimation formulae of the parameters as follows.

w(k)(t+1)= 1 N N X i=1 f (k | xi; Θ(t)), (2.35) µ(t+1)k = PN i=1f (k | xi; Θ(t))xi PN i=1f (k | xi; Θ(t)) , (2.36) Σ(t+1)k = PN i=1f (k | xi; Θ(t))(xi− µ(t+1)k )(xi− µ(t+1)k )T PN i=1f (k | xi; Θ(t)) . (2.37)

(33)

### The SGML algorithm

Given the data set X = {x1, x2, · · · , xN}, we can partition it into G clusters after applying

the EM algorithm to learn a G-component Gaussian mixture model (GMMG). Here, to

help explain the scenario of the proposed approach, we denote a cluster obtained by EM learning a EM cluster. The self-splitting Gaussian mixture learning algorithm (SGML) starts with a single component (i.e., a single EM cluster) in the data space (i.e., X ) and splits the selected component adaptively during the learning process until the most appropriate number of components (EM clusters) is found. We describe the details of SGML in Algorithm 1. There are three main aspects with respect to the self-splitting learning rules:

I1: How to group the data samples into clusters?

I2: Which component should be split into two new components?

I3: How many components are enough for fitting the distribution of the data samples?

On issue I1: After EM, each xi ∈ X is assigned to the EM cluster whose Gaussian

component has the largest posterior probability for xi, i.e., assign xi to EM clusterj for

j = arg maxkp(k | xi; ˆΘ).

On issue I2: For each EM cluster, we calculate the value of ∆BIC21(EM cluster):

∆BIC21(EM cluster) = BIC(GMM2, EM cluster) − BIC(GMM1, EM cluster),

(3.1)

where GMM1 fits the EM cluster with a single Gaussian component while GMM2 fits

the EM cluster with a mixture of two Gaussian components. The component

(34)

because, according to the BIC theory, the larger the ∆BIC21 is, the higher the confidence

that GMM2 fits the corresponding EM cluster better than GMM1 does. The mean

vec-tor and covariance matrix of GMM1can be optimally estimated in an analytic way (which

are respectively the sample mean vector and sample covariance matrix of EM cluster),

whereas those of GMM2 need to be estimated by the EM algorithm. For the learning

of GMM2, we use K-means clustering (here, K=2) to find two initial mean vectors of

GMM2, and then apply EM to fine-tune the parameters. Suppose µ is the mean vector of

GMM1, we use µ − ε and µ + ε as the initial centroids of the K-means clustering, where

ε is a constant vector.

On issue I3: After the jth split operation, the number of components grows to j + 1

and these Gaussian components are used as the initialization of EM to learn the GMMj+1

for X ; and the value of BIC(GMMj+1, X ) is computed after the EM learning. During the

learning process the BIC values of the learned GMMs with different numbers of mixture components are collected to form a learning curve (i.e., the BIC plot); and the learning process stops when a “significant maximum” in the learning curve is found. Given that

SGML has generated {GMM1,GMM2,· · ·,GMMcN um}, if BIC(GMMcN um−SRange, X ) is

the maximum among {BIC(GMM1, X ),BIC(GMM2, X ),· · ·,BIC(GMMcN um, X )}, it is

called the “significant maximum” of the learning curve and SGML stops and returns

the model GMMcN um−SRange; otherwise, SGML continues to learn GMMcN um+1. Here,

SRange is a positive integer; it is appropriate to set SRange around 5 according to our

experience.

### Computational complexity of SGML

Without loss of generality, we assume the splitting process of SGML stops at GMMGmax.

With the data set X , we use TKM(k, X ) to denote the computational cost of applying

K-means clustering to initialize GMMkfor EM and use TEM(k, X ) to denote the cost of using

EM for learning GMMk. For SGML, the computational cost of the Splitting step that

splits GMMM to GMMM +1 consists of the cost of computing the sample mean vectors

and sample covariance matrices for all clusters, i.e.,PM

j=1TEM(1, EM clusterj), plus the

cost of applying K-means (K=2 here) followed by EM to learn GMM2 for all clusters, i.e.,

PM

j=1[TKM(2, EM clusterj) +TEM(2, EM clusterj)].

PM

j=1TEM(1, EM clusterj) is

al-most equal to TEM(1, X ). Moreover, from Eqs. (2.13)-(2.15), it is obvious that TEM(k, X )

is proportional to the component number k and the size of data set X . In fact, TKM(k, X )

also has this property. Therefore, PM

j=1TEM(2, EM clusterj) can be approximated by

TEM(2, X ), while

PM

j=1TKM(2, EM clusterj) can be approximated by TKM(2, X ). In

(35)

GXmax

k=1

[TEM(1, X ) + TKM(2, X ) + TEM(2, X ) + TEM(k, X )]. (3.2)

For the common approach that uses K-means (with random initialization) followed by EM to learn the candidate models for the model selection criterion, the computational cost is

GXmax

k=1

[TKM(k, X ) + TEM(k, X )]. (3.3)

Comparing Eq. (3.2) to Eq. (3.3), it is clear that SGML is more efficient when k À 2 because for which case, TKM(k, X ) is larger than TEM(1, X )+ TKM(2, X )+TEM(2, X ).

Comparing to the method proposed by Young and Woodland [31], SGML has an over-head of TEM(1, X ) + TKM(2, X ) + TEM(2, X ) in each splitting step. However, TEM(1, X )

is much small than 1

2TEM(2, X ) because it does not involve likelihood calculations when

estimating GMM1. Moreover, TKM(2, X ) is much smaller than TEM(2, X ). Thus, the

overhead is just slightly higher than TEM(2, X ). When M À 2, the overhead for splitting

GMMM to GMMM +1is negligible since TEM(2, X ) is much smaller than TEM(M + 1, X ).

### The fastSGML algorithm

The computation cost becomes an important issue when the data set is very large; a fast

learning procedure is therefore desirable. Since we can validate if GMM2 fits a cluster

better than GMM1 using BIC, a straightforward idea to speedup SGML is that split all

the clusters whose ∆BIC21values are larger than a threshold in each splitting step. Here, we call the threshold the splitting confidence. Based on the concept of multiple splitting, we proposed a fastSGML approach in Algorithm 2. To avoid the over-fitting condition caused when using a too small splitting confidence, the fastSGML stops not only when the ∆BIC21 value for each cluster is smaller than the splitting confidence (in Step 3) but also when the learning curve starts to go down (in Step 4). Clearly, fastSGML needs a much lower computation requirement than SGML because it allows multiple splits in the splitting step. fasSGML resembles the LBG algorithm [75], but the difference is that the later iteratively splits each of the clusters into two clusters until a given cluster number is reached and no validation measure is applied to the splitting.

### Experiments on Gaussian mixture learning

We evaluated the proposed SGML and fastSGML algorithms on one synthetic data set which consists of six Gaussian clusters in <2 and one real-world data set which consists of speech feature vectors from speaker #5007 of the NIST 2001 cellular speaker

(36)

recog-nition evaluation data (NIST01SpkEval) [68]. We evaluated the algorithms using the performance of speaker GMM training because we shall apply SGML to the GMM-based speaker identification task, as shown in Section 3.4. In each task, the performance of the proposed algorithms were compared to that of other EM-based learning approaches whose initial mean vectors of GMMs are located by hierarchical agglomerative clustering (HAC) or K-means clustering. The baseline approaches are as follows.

1. Hier-ComLink method: The initial Gaussian mean vectors in EM were obtained by the complete-linkage HAC [35].

2. Hier-SLink method: The initial Gaussian mean vectors in EM were obtained by the single-linkage HAC [35].

3. Hier-CenLink method: The initial Gaussian mean vectors in EM were obtained by the centroid-linkage HAC [35].

4. K-means-random method: The initial Gaussian mean vectors in EM were obtained by K-means clustering, in which the initial centroids are randomly selected from the data set.

5. K-means-BinSplitting method [75]: The initial Gaussian mean vectors in EM were determined by the LBG algorithm, in which each mean vector was split into two new ones in each splitting step until the desired number of clusters was reached. 6. K-means-IncSplitting method [76]: The initial Gaussian mean vectors in EM were

obtained by the incremental splitting K-means algorithm, in which only the mean vector of the cluster with the largest accumulated distance was split into two new ones in each splitting step until the desired number of clusters was reached.

7. EMSplitByMaxWeight method [31]: This method splits the mean vector of the Gaussian component with the largest mixture weight into two new ones in each splitting step, and then performs EM to update all Gaussian components. The number of mixture component is incremented from one to the pre-defined number.

For each baseline approach, the initial covariance matrix of GMMk was set as ρkI,

where ρk=minj6=k{kµ(0)j − µ(0)k k} and µ(0)j denotes the initial mean vector of the jth

Gaussian component.

### Results on the synthetic data

Figure 3.1 (a) shows the synthetic data that contains six Gaussian clusters, each with 100 samples. First, we conducted experiments using full covariance Gaussian components. Figure 3.1 schematically depicts the learning process of the SGML algorithm. For this

+7

bone defect with a small daughter radiolucency, b papillary gingival tumor showing a discrete nodule (H&amp;E, 96), c submucosal tumor nests fusing with the surface epithelium

Another study observed a strong genetic component behind caries experience in twins (average age = 24.6 years) differing between males (49%) and females (68%), and a weaker

Listen to the sentence and circle the

This column splitting results in a new sparse matrix and hence a new LDPC code of longer length. If column splitting is done properly, the extended code performs amazingly well

After six months, we found that building learning organization could efficiently improve the learning culture in an organization and the organization’s performance, including

A spoken language understanding (SLU) component requires the domain ontology to decode utterances into semantic forms, which contain core content (a set of slots and slot-fillers)

Mathematically, 5-equation model approaches to same relaxation limits as HRM, but is difficult to solve numerically to ensure solution to be feasible.. 5 -equation

A mixture-energy-consistent 6-equation two-phase numerical model for fluids with interfaces, cavitation and evaporation waves. Modelling phase transition in metastable

Keywords: Adaptive Lasso; Cross-validation; Curse of dimensionality; Multi-stage adaptive Lasso; Naive bootstrap; Oracle properties; Single-index; Pseudo least integrated

• to develop a culture of learning to learn through self-evaluation and self-improvement, and to develop a research culture for improving the quality of learning and teaching

Then they work in groups of four to design a questionnaire on diets and eating habits based on the information they have collected from the internet and in Part A, and with

In this paper, we extended the entropy-like proximal algo- rithm proposed by Eggermont [12] for convex programming subject to nonnegative constraints and proposed a class of

//if it does not connect it starts an access point with the specified name //here &#34;AutoConnectAP&#34;. //and goes into a blocking loop awaiting

To enhance availability of composite services, we propose a discovery-based service com- position framework to better integrate component services in both static and dynamic

The presentation or rebranding by a company of an established product in a new form, a new package or under a new label into a market not previously explored by that company..

In this thesis, we have proposed a new and simple feedforward sampling time offset (STO) estimation scheme for an OFDM-based IEEE 802.11a WLAN that uses an interpolator to recover

Along with this process, a critical component that must be realized in order to assist management in determining knowledge objective and strategies is the assessment of

In order to improve the aforementioned problems, this research proposes a conceptual cost estimation method that integrates a neuro-fuzzy system with the Principal Items

This research proposes a Model Used for the Generation of Innovative Construction Alternatives (MUGICA) for innovation of construction technologies, which contains two models:

In this thesis, a concentric binary phase zone plate was used to re-shape the sharp Gaussian filed profile into an 1-D sinc(x) (or 2-D Bessinc(r)) complex field profile

The impact of an applied component of character education integrated into a general education curriculum on the moral, pro-social, and cognitive development of adolescents attending

This thesis aims to apply the Problem-Based Learning (PBL) method to design a course plan of traffic safety education which is suitable to the elementary school students