3.4 Application of SGML to GMM-based speaker identification
3.4.2 Experiments
3.4.2.3 Results
Tables 3.1 and 3.2 summarize the mean and standard deviation of the component number of the male speaker GMMs and female speaker GMMs, respectively, obtained by SGML and fastSGML on different amounts of training data. From the tables, we see that the splitting confidence within the range of 100 to 150 is appropriate for these two speaker identification tasks since in this case fastSGML yields a average model complexity similar to that of SGML. Furthermore, we see that, on average, a female speaker GMM needs more Gaussian components than a male speaker GMM for the same amount of training data. This reveals that, on average, the distribution of MFCC feature vectors of a female speaker is more diverse than that of a male speaker. We evaluated the CPU time cost of SGML and fastSGML with a Intel 3.2 GHz CPU. The results for male and female speakers
are shown in Tables 3.3 and 3.4, respectively. These two tables show the efficiency gains of fastSGML over SGML. Moreover, the average CPU time of SGML for female speakers is significantly larger than that of the male speakers; this is due to the larger average component number in the former case. For fastSGML, however, the average CPU time for the female case is close to that of the male case; this shows the CPU time cost of fastSGML is much less sensitive to the model complexity than that of SGML does.
For the identification experiments, we used the K-means-BinSplitting algorithm de-scribed in Section 3.3 as the baseline approach since, like SGML and fastSGML, it learns the model based on component splitting. Tables 3.5 and 3.6 show, respectively, the identification accuracy of the male and female speakers, which are obtained by K-means-BinSplitting, SGML, and fastSGML on different amounts of training speech. From the male speaker case with 30-second training speech in Table 3.5, we see that the identifi-cation accuracy of K-means-BinSplitting is first improved by increasing the component number from 8 to 16, and then degraded by further increasing the component number to 32 and 64 which have over-fitted the training data. In this case, SGML yields 23.92 Gaussian components on average for each male speaker, as shown in Table 3.1.
For the female speaker case with 30-second training speech, a similar trend is observed, and the baseline system achieves the best identification accuracy with 32 mixture com-ponents. This conforms to the observation from Tables 3.1 and 3.2 that a female speaker GMM generally needs more Gaussian components than a male speaker GMM. In this case, SGML yielded 29.44 components on average for each female speaker, as shown in Table 3.2. In the cases of 60-second and 90-second training speech in Tables 3.5 and 3.6, we also observe that SGML can automatically determine the adequate model complexity for speaker GMMs according to the amount and characteristics of training data, though no significant difference is found between the identification accuracies of SGML and the best accuracies of K-means-BinSplitting under the different training and test conditions.
We also observe that there is a huge identification performance gap between the female case and the male case. This gap is obviously due to the diversity of feature vectors of a female speaker. For the female case, more training data are needed to cover the diverse feature space.
From Tables 3.5 and 3.6, we see that fastSGML generally yields as good identification performance as SGML. Though the fastSGML with different splitting confidences might result in GMMs with different numbers of components, as shown in Tables 3.1 and 3.2, there is no significant difference in identification accuracy between these two approaches.
The splitting confidence within the range of 100 to 150 seems appropriate for these two speaker identification tasks since for this case fastSGML yields similar model complexity for speaker GMMs and similar identification accuracy to those of SGML.
In summary, the speaker identification experimental results show that the proposed SGML and fastSGML algorithms can automatically find the appropriate component
num-Table 3.1: The mean and standard deviation of the component number of the diagonal covariance male speaker GMMs obtained by SGML and fastSGML on different amounts of training data. The first number in parentheses is the mean value, while the second number after ’/’ is the standard deviation.
Amount of SGML splitting confidence (fastSGML)
training speech 150 100 50
30 sec (23.92/5.07) (20.58/5.31) (25.32/6.63) (27.16/5.99) 60 sec (35.96/6.90) (31.72/7.68) (38.22/9.26) (41.50/9.42) 90 sec (46.70/9.23) (41.30/9.48) (48.58/11.13) (53.00/10.65)
Table 3.2: The mean and standard deviation of the component number of the diagonal covariance female speaker GMMs obtained by SGML and fastSGML on different amounts of training data. The first number in parentheses is the mean value, while the second number after ’/’ is the standard deviation.
Amount of SGML splitting confidence (fastSGML)
training speech 150 100 50
30 sec (29.44/4.59) (26.32/5.25) (31.70/5.97) (33.34/5.97) 60 sec (45.06/7.18) (42.22/6.93) (50.26/7.85) (52.60/10.40) 90 sec (58.26/7.63) (55.82/9.15) (65.16/10.16) (70.18/11.56)
ber for the speaker GMMs, though the identification accuracy is not significantly im-proved, compared to the best accuracies of the baseline system. More over, fastSGML is almost as effective as the SGML in the training of GMMs, but at a much lower compu-tation cost.
Table 3.3: The average CPU time (in second) of the diagonal covariance male speaker GMMs obtained by SGML and fastSGML on different amounts of training data.
Amount of SGML splitting confidence (fastSGML)
training speech 150 100 50
30 sec 90.41 14.23 19.09 21.83 60 sec 345.12 49.14 63.61 60.20 90 sec 752.54 97.03 110.60 109.27
Table 3.4: The average CPU time (in second) of the diagonal covariance female speaker GMMs obtained by SGML and fastSGML on different amounts of training data.
Amount of SGML splitting confidence (fastSGML)
training speech 150 100 50
30 sec 116.81 14.42 20.21 19.62 60 sec 493.45 51.55 62.04 54.36 90 sec 1113.15 101.52 127.64 108.01
Table 3.5: Speaker identification accuracy (in %) for the male speakers.
Amount of Length of number of components SGML splitting confidence training speech test utterance for K-means-BinSplitting for fastSGML
8 16 32 64 150 100 50
30 sec variable-length 61.46 62.28 61.46 59.35 61.95 61.46 62.76 61.79 3 sec 51.68 54.41 54.41 52.38 54.09 53.67 53.86 54.07 5 sec 56.91 58.39 57.02 55.99 57.98 57.58 57.80 57.91 8 sec 59.80 60.68 60.24 58.47 61.88 60.11 61.19 60.49 60 sec variable-length 63.25 66.02 66.34 65.85 66.67 66.67 67.15 66.83 3 sec 54.29 58.38 58.91 59.29 59.10 59.65 59.80 59.16 5 sec 58.94 61.90 62.82 62.12 63.45 62.75 63.19 62.75 8 sec 61.95 65.25 66.14 65.88 66.39 66.65 67.22 66.77 90 sec variable-length 65.53 68.78 69.11 68.62 70.08 70.57 71.06 70.24 3 sec 55.28 59.10 61.03 61.99 61.69 61.96 61.79 61.75 5 sec 60.01 63.34 65.59 65.67 65.78 64.56 65.37 65.78 8 sec 63.86 66.01 68.55 68.36 69.93 69.75 69.37 68.29
Table 3.6: Speaker identification accuracy (in %) for the female speakers.
Amount of Length of number of components SGML splitting confidence training speech test utterance for K-means-BinSplitting for fastSGML
8 16 32 64 150 100 50
30 sec variable-length 29.87 32.21 35.07 30.37 30.54 33.05 29.87 30.70 3 sec 27.48 28.46 30.41 29.50 29.09 29.30 27.66 29.23 5 sec 29.87 30.55 32.25 30.49 30.67 31.29 28.76 31.08 8 sec 31.11 31.90 33.95 32.00 33.00 33.00 30.74 32.79 60 sec variable-length 40.77 42.45 45.30 45.13 44.97 45.13 45.47 45.47 3 sec 32.08 34.03 37.20 37.95 37.79 38.06 37.33 37.78 5 sec 35.65 37.97 40.66 40.63 40.11 40.91 40.60 41.00 8 sec 38.83 40.20 42.62 43.72 43.14 43.77 42.35 43.08 90 sec variable-length 42.95 43.62 48.83 48.49 47.32 47.48 47.65 48.49 3 sec 32.12 35.41 39.03 40.66 41.78 40.78 40.64 41.00 5 sec 35.62 38.74 42.92 44.47 44.34 44.53 44.59 45.24 8 sec 38.52 42.35 46.19 47.08 47.50 47.08 46.82 47.24
Algorithm 1 Self-splitting Gaussian Mixture Learning Algorithm (SGML) Require: The input data set X = {x1, x2, · · · , xN}
Ensure: The estimated parameter set ˆΘ = { ˆw(1), ˆw(2) , · · · , ˆw(bestNum), ˆθ1, ˆθ2,
· · · , ˆθbestN um}, where ˆw(k) and ˆθk = {ˆµk, ˆΣk} are the maximum likelihood estimate of mixture weight, mean vector, and covariance matrix of the kth Gaussian component Begin
1. Initialization:
SRange ← 5;
cNum ← 1;
Θ ← { ˆˆ w(1), ˆθ1}, where ˆθ1 = { ˆµ1, ˆΣ1} are the sample mean vector and sample covariance matrix of X ;
GMM set(1) ← ˆΘ;
//GMM set(cNum) is the parameter set of GMMcN um over X BIC set(1) ← BIC(GMM1, X );
2. Data clustering:
EM cluterk ← φ, f or k ← 1, 2, · · · , cNum;
for each sample xi:
j ← arg maxkp(k | xi; ˆΘ);
EM clusterj ← EM clusterj
Sxi;//add xi to EM clusterj
3. Split (split one component into two new components):
whichSplit←arg maxk{∆BIC21(EM clusterk)};
Suppose the parameters of GMM2 corresponding to EM clusterwhichSplit are λ¯1 ← { ¯w(1), ¯θ1}, ¯λ2 ← { ¯w(2), ¯θ2}, where ¯θk= {¯µk, ¯Σk}, f or k ← 1, 2;
Let
¯
w(1) ← 12w(whichSplit);ˆ
¯
w(2) ← 12w(whichSplit);ˆ
Θ ← ˆˆ Θ \ { ˆw(whichSplit), ˆθwhichSplit};//remove { ˆw(whichSplit), ˆθwhichSplit} Θ ← ˆˆ ΘS {¯λ1, ¯λ2};//add {¯λ1, ¯λ2}
cNum ← cNum + 1;
4. Global EM learning:
Perform EM learning on all the clusters with ˆΘ as the model initialization, GMM set(cNum) ← ˆΘ;
BIC set(cNum) ← BIC(GMMcN um, X );
if (cNum > SRange and BIC set(cNum − SRange) is the maximum in BIC set)
bestNum ← cNum − SRange;
Θ ← GMM set(bestNum);ˆ goto End;
else goto 2;
End
Algorithm 2 Fast Self-splitting Gaussian Mixture Learning Algorithm (fastSGML) Require: The input data set X = {x1, x2, · · · , xN}
splitting confidence: the threshold for component splitting
Ensure: The estimated parameter set ˆΘ = { ˆw(1), ˆw(2) , · · · , ˆw(bestNum), ˆθ1, ˆθ2,
· · · , ˆθbestN um}, where ˆw(k) and ˆθk = {ˆµk, ˆΣk} are the maximum likelihood estimate of mixture weight, mean vector, and covariance matrix of the kth Gaussian component Begin
1. Initialization: the same as in SGML.
2. Data clustering: the same as in SGML.
3. Split:
(a) temp← cNum;
for k ← 1, 2, · · · , temp
if (∆BIC21(EM clusterk) > splitting confidence)
split the component corresponding to EM clusterk into two new components;
cNum ← cNum + 1;
(b) if (no component is split in (a)) Θ ← GMM set(cNum);ˆ goto End;
4. Global EM learning:
Perform EM learning on the Gaussian components obtained in Step 3 (a);
if (the learning curve starts to go down)
let bestNum be the component number that has the maximum value in the learning curve;
Θ ← GMM set(bestNum);ˆ goto End;
else goto 2.;
End
Chapter 4
Model-based clustering by
probabilistic self-organizing maps
4.1 Formulation of the coupling-likelihood mixture model for PbSOMs
In this thesis, we define a PbSOM as a SOM that consists of G neurons R={r1, r2, · · · , rG} in a network with a neighborhood function hkl that defines the strength of lateral interaction between two neurons, rk and rl, for k, l ∈ {1, 2, · · · , G}; and each neuron rk
associates with a reference model θk that represents some probability distribution in the data space.
Sum et al. [67] interpreted Kohonen’s sequential SOM learning algorithm in terms of maximizing the local correlations (coupling energies) between the neurons and their neighborhoods with the given input data. Given a data sample xi∈ X = {x1, x2, · · · , xN}, the local coupling energy between rk and its neighborhood is defined as
Exi|k =
XG
l=1
hklrk(xi; θk)rl(xi; θl)
= rk(xi; θk)
XG l=1
hklrl(xi; θl), (4.1)
where rk(xi; θk) denotes the response of neuron rk to xi, which is modeled by an isotropic Gaussian density. Then, the coupling energy over the network for xi is defined as
Exi =
XG
k=1
Exi|k, (4.2)
and the energy function to be maximized is
C =
XN
i=1
log Exi. (4.3)
In Eq. (4.1), the termPGl=1hklrl(xi; θl) can be considered as the neighborhood response of rk, where the conjunction between the neuron responses is implemented using the summing operation.
Here, we express the neuron response rl(xi; θl) as a multivariate Gaussian distribution as in Eq. (2.12) and formulate the neighborhood response of rk as
Y
l6=k
rl(xi; θl)hkl, (4.4)
where the conjunction between the neuron responses in the neighborhood of rk is im-plemented using the multiplicative operation. Then, for a given xi, we define the local coupling energy between rk and its neighborhood as the following coupling-likelihood:
ps(xi|k; Θ, h) = rk(xi; θk)hkk Y
l6=k
rl(xi; θl)hkl
=
YG
l=1
rl(xi; θl)hkl
= exp(
XG l=1
hkllog rl(xi; θl)), (4.5)
where Θ is the set of reference models, and h denotes the given neighborhood function1. Then, we define the coupling-likelihood of xi over the network as the following (unnor-malized) mixture likelihood:
ps(xi; Θ, h) =
XG
k=1
ws(k)ps(xi|k; Θ, h), (4.6)
where ws(k) for k = 1, 2, · · · , G is fixed at 1/G. Note that, theoretically, the mixture weights can be learned automatically. When maximizing the local coupling-likelihood ps(xi|k; Θ, h) for each neuron rk, k = 1, 2, · · · , G, the topological order between neuron rk and its neighborhood for the given data sample xi is learned in the learning process;
therefore, we use equal mixture weights in the mixture model to take account of the topological order learning induced by the neurons faithfully (with equal prior importance).
1From Eq. (4.5), it is obvious that, in our formulation, the coupling between rk and its neighboring neurons is considered jointly, whereas Sum et al.’s formulation considers it in a pairwise manner, as shown in Eq. (4.1). Note that we use the term “coupling-likelihood” instead of “coupling energy” for two reasons: 1) Eq. (4.5) is a coupling of Gaussian likelihoods; and 2) using “coupling-likelihood” can help describe the link between our proposed approaches and model-based clustering.
In fact, this is important for learning an ordered map. From our experimental analysis, if the mixture weights are updated in the learning process, the learning of topological order is frequently dominated by some particular mixture components, which makes it difficult to obtain an ordered map. For details, one can refer to Appendix A.1 after reading this chapter.
Comparing the network structure of the proposed coupling-likelihood mixture model in Eq. (4.6) with that of the Gaussian mixture model (GMM), as shown in Figure 4.1, the proposed model inserts a coupling-likelihood layer between the Gaussian likelihood layer and the mixture likelihood layer to take account of the coupling between the neurons and their neighborhoods. When the neighborhood size is reduced to zero (i.e., hkl=δkl), the coupling-likelihood mixture model becomes a GMM with equal mixture weights.
Note that other probability distributions are possible for rl(xi; θl) in the formulation of the coupling-likelihood mixture model, although we use the multivariate Gaussian distribution here.