We evaluated the proposed SGML and fastSGML algorithms on one synthetic data set which consists of six Gaussian clusters in <2 and one real-world data set which consists of speech feature vectors from speaker #5007 of the NIST 2001 cellular speaker
recog-nition evaluation data (NIST01SpkEval) [68]. We evaluated the algorithms using the performance of speaker GMM training because we shall apply SGML to the GMM-based speaker identification task, as shown in Section 3.4. In each task, the performance of the proposed algorithms were compared to that of other EM-based learning approaches whose initial mean vectors of GMMs are located by hierarchical agglomerative clustering (HAC) or K-means clustering. The baseline approaches are as follows.
1. Hier-ComLink method: The initial Gaussian mean vectors in EM were obtained by the complete-linkage HAC [35].
2. Hier-SLink method: The initial Gaussian mean vectors in EM were obtained by the single-linkage HAC [35].
3. Hier-CenLink method: The initial Gaussian mean vectors in EM were obtained by the centroid-linkage HAC [35].
4. K-means-random method: The initial Gaussian mean vectors in EM were obtained by K-means clustering, in which the initial centroids are randomly selected from the data set.
5. K-means-BinSplitting method [75]: The initial Gaussian mean vectors in EM were determined by the LBG algorithm, in which each mean vector was split into two new ones in each splitting step until the desired number of clusters was reached.
6. K-means-IncSplitting method [76]: The initial Gaussian mean vectors in EM were obtained by the incremental splitting K-means algorithm, in which only the mean vector of the cluster with the largest accumulated distance was split into two new ones in each splitting step until the desired number of clusters was reached.
7. EMSplitByMaxWeight method [31]: This method splits the mean vector of the Gaussian component with the largest mixture weight into two new ones in each splitting step, and then performs EM to update all Gaussian components. The number of mixture component is incremented from one to the pre-defined number.
For each baseline approach, the initial covariance matrix of GMMk was set as ρkI, where ρk=minj6=k{kµ(0)j − µ(0)k k} and µ(0)j denotes the initial mean vector of the jth Gaussian component.
3.3.1 Results on the synthetic data
Figure 3.1 (a) shows the synthetic data that contains six Gaussian clusters, each with 100 samples. First, we conducted experiments using full covariance Gaussian components.
Figure 3.1 schematically depicts the learning process of the SGML algorithm. For this
example, SGML stops at GMM11 and obtains a “significant maximum” at GMM6, as shown in Figure 3.2. The results show that SGML performs well on automatic clustering the synthetic data. When evaluating the baseline methods, the maximal number of mix-ture components was limited to 13. From Figure 3.2, we see that all the baseline methods except K-means-random estimate the parameters of GMMk(k = 1, 2, · · · , 13) as well as SGML.
Then, we evaluated SGML with diagonal covariance Gaussians. In this case, as shown in Figure 3.3, SGML groups the synthetic data into ten clusters. From the perspective of “Gaussian mixture modeling” rather than “data clustering”, we divide the learning process into three phases:
1. The cluster-capturing phase (GMM1 − GMM6): In this phase, SGML roughly captures the locations of all the clusters of the data by the self-splitting rules.
2. The shape-smoothing phase (GMM7 − GMM10): A diagonal covariance Gaussian component is unable to model a cluster which has a complex shape, and this kind of cluster needs to be presented by a mixture of Gaussians. For example, as shown in Figure 3.1 (a), clusters 3, 4, 5, and 6 are all oblique ellipses and each of them needs to be presented by a mixture of two Gaussians. As a result, the learning curve in Figure 3.3 has the largest BIC value at GMM10. It is obvious that the increase of BIC value in the shape-smoothing phase is much smaller than that in the cluster-capturing phase.
3. The over-fitting phase (GMM11−): After reaching the component number with the largest BIC value, SGML starts to over-fit the data in each splitting step and thus, the learning curve goes down progressively.
3.3.2 Results on the real-world data
In this section, we compared the performance of SGML and baseline methods by evaluat-ing their performance on speaker GMM trainevaluat-ing with speech feature vectors from speaker
#5007 in NIST01SpkEval. For feature extraction, 24 mel-frequency cepstral coefficients (MFCCs) were extracted using a 32-ms analysis window with a 10-ms shifting [87].
Figure 3.4 shows the learning curves obtained by the various methods on 15-second speech data (corresponding to 1500 24-dimensional MFCCs). Figures 3.4 (a)-(b) show the learning curves of the full covariance case, while (c)-(d) show the results of the diagonal covariance case. For these two cases, the upper bounds of component number were limited to 15 and 40, respectively (in fact, SGML stopped at GMM9 and GMM21, respectively).
The corresponding “significant maximum” of the learning curves of SGML are at GMM4
and GMM16, respectively, which are also the global maximum in their corresponding
0 20 40 60 80 100 0
10 20 30 40 50 60 70 80 90 100 110
2
3
6 4
1
5
(a) GM M1
0 20 40 60 80 100
0 10 20 30 40 50 60 70 80 90 100 110
After splitting
(b) Initial of GM M2
0 20 40 60 80 100
0 10 20 30 40 50 60 70 80 90 100 110
After EM and clustering
(c) GM M2
0 20 40 60 80 100
0 10 20 30 40 50 60 70 80 90 100 110
After splitting
(d) Initial of GM M3
0 20 40 60 80 100
0 10 20 30 40 50 60 70 80 90 100 110
After EM and clustering
(e) GM M3
0 20 40 60 80 100
0 10 20 30 40 50 60 70 80 90 100 110
After splitting
(f) Initial of GM M4
0 20 40 60 80 100
0 10 20 30 40 50 60 70 80 90 100 110
After EM and clustering
(g) GM M4
0 20 40 60 80 100
0 10 20 30 40 50 60 70 80 90 100 110
After splitting
(h) Initial of GM M5
0 20 40 60 80 100
0 10 20 30 40 50 60 70 80 90 100 110
After EM and clustering
(i) GM M5
0 20 40 60 80 100
0 10 20 30 40 50 60 70 80 90 100 110
After splitting
(j) Initial of GM M6
0 20 40 60 80 100
0 10 20 30 40 50 60 70 80 90 100 110
After EM and clustering
(k) GM M6
0 20
40 60
80 100
0 20 40 60 80 100 0 0.005 0.01 0.015 0.02
Probability Model
pdf value
(l) 3D plot of GM M6
Figure 3.1: The learning process of SGML on the synthetic data, where full covariance GMMs are used.
0 2 4 6 8 10 12 14
−1.15
−1.1
−1.05
−1
−0.95
−0.9
−0.85x 104
number of components
BIC value Hier−ComLink
Hier−SLink Hier−CenLink SGML
(a)
0 2 4 6 8 10 12 14
−1.15
−1.1
−1.05
−1
−0.95
−0.9
−0.85x 104
number of components
BIC value
K−means−random K−means−BinSplitting K−means−IncSplitting EMSplitByMaxWeight SGML
(b)
Figure 3.2: The learning curves of SGML and the baseline approaches on the synthetic data, where full covariance GMMs are used.
0 2 4 6 8 10 12 14
−1.15
−1.1
−1.05
−1
−0.95
−0.9
−0.85x 104
number of components
BIC value SGML
Figure 3.3: The learning curve of SGML on the synthetic data, where diagonal covariance GMMs are used.
learning curves. From the figures, we see that the learning curves of SGML are smoother than those of the other methods. The self-splitting learning process splits the cluster with the largest ∆BIC21 value in each splitting step, and makes the learning curve go up steadily before reaching the most appropriate component number. After reaching the best component number, the learning process tends to split a well-modeled cluster in each splitting step, and makes the learning curve go down progressively. We also see that the BIC values of the GMMs trained by SGML are almost always higher than those of the GMMs trained by the other methods at any component number. As discussed in Section 2.4.1.3, with the same model complexity, the higher BIC value indicates the higher log-likelihood value. Therefore, SGML also outperforms the other methods in learning GMM with a given component number.
Next, we investigated the learning performance of fastSGML using diagonal covari-ance GMMs and 60-second speech data; and compared its performcovari-ance to that of SGML and BinSplitting. Figure 3.5 shows the learning curves of fastSGML, K-means-BinSplitting and SGML. The splitting confidence of fastSGML are set at 150 (fastSGML-150), 100 (fastSGML-100), and 50 (fastSGML-50), respectively. The best component numbers obtained by fastSGML-150, fastSGML-100, fastSGML-50 and SGML were 27, 37, 43 and 40, respectively; and, hence, K-means-BinSplitting was forced to stop at GMM43. From the figure, it seems that fastSGML-150 and fastSGML-100 under-fit the training data while fastSGML-50 over-fit it. From the learning curve of SGML, we see that the GMM whose component number is within the range of 30 to 50 has a BIC value similar to that of GMM40 (the best model selected by the SGML). fastSGML obtains a GMM with a very close BIC value and a very close component number to that of the best GMM obtained by SGML when splitting confidence is defined appropriately. Moreover, for a given component number, K-means-BinSplitting always obtains a smaller BIC value than that of fastSGML. We evaluated the speed of SGML and fastSGML with a Intel 3.2 GHz CPU. The run time of SGML, fastSGML-50, fastSGML-100, and fastSGML-150 were 268.96 sec., 48.01 sec., 97.18 sec., and 43.59 sec., respectively, from which we see the efficiency gain of fastSGML over SGML.