Results - Application of SGML to GMM-based speaker identification

3.4 Application of SGML to GMM-based speaker identification

3.4.2 Experiments

3.4.2.3 Results

Tables 3.1 and 3.2 summarize the mean and standard deviation of the component number of the male speaker GMMs and female speaker GMMs, respectively, obtained by SGML and fastSGML on different amounts of training data. From the tables, we see that the splitting confidence within the range of 100 to 150 is appropriate for these two speaker identification tasks since in this case fastSGML yields a average model complexity similar to that of SGML. Furthermore, we see that, on average, a female speaker GMM needs more Gaussian components than a male speaker GMM for the same amount of training data. This reveals that, on average, the distribution of MFCC feature vectors of a female speaker is more diverse than that of a male speaker. We evaluated the CPU time cost of SGML and fastSGML with a Intel 3.2 GHz CPU. The results for male and female speakers

are shown in Tables 3.3 and 3.4, respectively. These two tables show the efficiency gains of fastSGML over SGML. Moreover, the average CPU time of SGML for female speakers is significantly larger than that of the male speakers; this is due to the larger average component number in the former case. For fastSGML, however, the average CPU time for the female case is close to that of the male case; this shows the CPU time cost of fastSGML is much less sensitive to the model complexity than that of SGML does.

For the identification experiments, we used the K-means-BinSplitting algorithm de-scribed in Section 3.3 as the baseline approach since, like SGML and fastSGML, it learns the model based on component splitting. Tables 3.5 and 3.6 show, respectively, the identification accuracy of the male and female speakers, which are obtained by K-means-BinSplitting, SGML, and fastSGML on different amounts of training speech. From the male speaker case with 30-second training speech in Table 3.5, we see that the identifi-cation accuracy of K-means-BinSplitting is first improved by increasing the component number from 8 to 16, and then degraded by further increasing the component number to 32 and 64 which have over-fitted the training data. In this case, SGML yields 23.92 Gaussian components on average for each male speaker, as shown in Table 3.1.

For the female speaker case with 30-second training speech, a similar trend is observed, and the baseline system achieves the best identification accuracy with 32 mixture com-ponents. This conforms to the observation from Tables 3.1 and 3.2 that a female speaker GMM generally needs more Gaussian components than a male speaker GMM. In this case, SGML yielded 29.44 components on average for each female speaker, as shown in Table 3.2. In the cases of 60-second and 90-second training speech in Tables 3.5 and 3.6, we also observe that SGML can automatically determine the adequate model complexity for speaker GMMs according to the amount and characteristics of training data, though no significant difference is found between the identification accuracies of SGML and the best accuracies of K-means-BinSplitting under the different training and test conditions.

We also observe that there is a huge identification performance gap between the female case and the male case. This gap is obviously due to the diversity of feature vectors of a female speaker. For the female case, more training data are needed to cover the diverse feature space.

From Tables 3.5 and 3.6, we see that fastSGML generally yields as good identification performance as SGML. Though the fastSGML with different splitting confidences might result in GMMs with different numbers of components, as shown in Tables 3.1 and 3.2, there is no significant difference in identification accuracy between these two approaches.

The splitting confidence within the range of 100 to 150 seems appropriate for these two speaker identification tasks since for this case fastSGML yields similar model complexity for speaker GMMs and similar identification accuracy to those of SGML.

In summary, the speaker identification experimental results show that the proposed SGML and fastSGML algorithms can automatically find the appropriate component

num-Table 3.1: The mean and standard deviation of the component number of the diagonal covariance male speaker GMMs obtained by SGML and fastSGML on different amounts of training data. The first number in parentheses is the mean value, while the second number after ’/’ is the standard deviation.

Amount of SGML splitting confidence (fastSGML)

training speech 150 100 50

30 sec (23.92/5.07) (20.58/5.31) (25.32/6.63) (27.16/5.99) 60 sec (35.96/6.90) (31.72/7.68) (38.22/9.26) (41.50/9.42) 90 sec (46.70/9.23) (41.30/9.48) (48.58/11.13) (53.00/10.65)

Table 3.2: The mean and standard deviation of the component number of the diagonal covariance female speaker GMMs obtained by SGML and fastSGML on different amounts of training data. The first number in parentheses is the mean value, while the second number after ’/’ is the standard deviation.

Amount of SGML splitting confidence (fastSGML)

training speech 150 100 50

30 sec (29.44/4.59) (26.32/5.25) (31.70/5.97) (33.34/5.97) 60 sec (45.06/7.18) (42.22/6.93) (50.26/7.85) (52.60/10.40) 90 sec (58.26/7.63) (55.82/9.15) (65.16/10.16) (70.18/11.56)

ber for the speaker GMMs, though the identification accuracy is not significantly im-proved, compared to the best accuracies of the baseline system. More over, fastSGML is almost as effective as the SGML in the training of GMMs, but at a much lower compu-tation cost.

Table 3.3: The average CPU time (in second) of the diagonal covariance male speaker GMMs obtained by SGML and fastSGML on different amounts of training data.

Amount of SGML splitting confidence (fastSGML)

training speech 150 100 50

30 sec 90.41 14.23 19.09 21.83 60 sec 345.12 49.14 63.61 60.20 90 sec 752.54 97.03 110.60 109.27

Table 3.4: The average CPU time (in second) of the diagonal covariance female speaker GMMs obtained by SGML and fastSGML on different amounts of training data.

Amount of SGML splitting confidence (fastSGML)

training speech 150 100 50

30 sec 116.81 14.42 20.21 19.62 60 sec 493.45 51.55 62.04 54.36 90 sec 1113.15 101.52 127.64 108.01

Table 3.5: Speaker identification accuracy (in %) for the male speakers.

Amount of Length of number of components SGML splitting confidence training speech test utterance for K-means-BinSplitting for fastSGML

8 16 32 64 150 100 50

30 sec variable-length 61.46 62.28 61.46 59.35 61.95 61.46 62.76 61.79 3 sec 51.68 54.41 54.41 52.38 54.09 53.67 53.86 54.07 5 sec 56.91 58.39 57.02 55.99 57.98 57.58 57.80 57.91 8 sec 59.80 60.68 60.24 58.47 61.88 60.11 61.19 60.49 60 sec variable-length 63.25 66.02 66.34 65.85 66.67 66.67 67.15 66.83 3 sec 54.29 58.38 58.91 59.29 59.10 59.65 59.80 59.16 5 sec 58.94 61.90 62.82 62.12 63.45 62.75 63.19 62.75 8 sec 61.95 65.25 66.14 65.88 66.39 66.65 67.22 66.77 90 sec variable-length 65.53 68.78 69.11 68.62 70.08 70.57 71.06 70.24 3 sec 55.28 59.10 61.03 61.99 61.69 61.96 61.79 61.75 5 sec 60.01 63.34 65.59 65.67 65.78 64.56 65.37 65.78 8 sec 63.86 66.01 68.55 68.36 69.93 69.75 69.37 68.29

Table 3.6: Speaker identification accuracy (in %) for the female speakers.

Amount of Length of number of components SGML splitting confidence training speech test utterance for K-means-BinSplitting for fastSGML

8 16 32 64 150 100 50

30 sec variable-length 29.87 32.21 35.07 30.37 30.54 33.05 29.87 30.70 3 sec 27.48 28.46 30.41 29.50 29.09 29.30 27.66 29.23 5 sec 29.87 30.55 32.25 30.49 30.67 31.29 28.76 31.08 8 sec 31.11 31.90 33.95 32.00 33.00 33.00 30.74 32.79 60 sec variable-length 40.77 42.45 45.30 45.13 44.97 45.13 45.47 45.47 3 sec 32.08 34.03 37.20 37.95 37.79 38.06 37.33 37.78 5 sec 35.65 37.97 40.66 40.63 40.11 40.91 40.60 41.00 8 sec 38.83 40.20 42.62 43.72 43.14 43.77 42.35 43.08 90 sec variable-length 42.95 43.62 48.83 48.49 47.32 47.48 47.65 48.49 3 sec 32.12 35.41 39.03 40.66 41.78 40.78 40.64 41.00 5 sec 35.62 38.74 42.92 44.47 44.34 44.53 44.59 45.24 8 sec 38.52 42.35 46.19 47.08 47.50 47.08 46.82 47.24

Algorithm 1 Self-splitting Gaussian Mixture Learning Algorithm (SGML) Require: The input data set X = {x1, x2, · · · , xN}

Ensure: The estimated parameter set ˆΘ = { ˆw(1), ˆw(2) , · · · , ˆw(bestNum), ˆθ₁, ˆθ₂,

· · · , ˆθ_{bestN um}}, where ˆw(k) and ˆθ_k = {ˆµ_k, ˆΣ_k} are the maximum likelihood estimate of mixture weight, mean vector, and covariance matrix of the kth Gaussian component Begin

1. Initialization:

SRange ← 5;

cNum ← 1;

Θ ← { ˆˆ w(1), ˆθ1}, where ˆθ1 = { ˆµ₁, ˆΣ1} are the sample mean vector and sample covariance matrix of X ;

GMM set(1) ← ˆΘ;

//GMM set(cNum) is the parameter set of GMMcN um over X BIC set(1) ← BIC(GMM₁, X );

2. Data clustering:

EM cluterk ← φ, f or k ← 1, 2, · · · , cNum;

for each sample x_i:

j ← arg max_kp(k | x_i; ˆΘ);

EM clusterj ← EM clusterj

Sxi;//add xi to EM clusterj

3. Split (split one component into two new components):

whichSplit←arg max_k{∆BIC₂₁(EM cluster_k)};

Suppose the parameters of GMM2 corresponding to EM clusterwhichSplit are λ¯₁ ← { ¯w(1), ¯θ₁}, ¯λ₂ ← { ¯w(2), ¯θ₂}, where ¯θ_k= {¯µ_k, ¯Σ_k}, f or k ← 1, 2;

Let

w(1) ← ¹₂w(whichSplit);ˆ

w(2) ← ¹₂w(whichSplit);ˆ

Θ ← ˆˆ Θ \ { ˆw(whichSplit), ˆθ_whichSplit};//remove { ˆw(whichSplit), ˆθ_whichSplit} Θ ← ˆˆ Θ^S {¯λ₁, ¯λ₂};//add {¯λ₁, ¯λ₂}

cNum ← cNum + 1;

4. Global EM learning:

Perform EM learning on all the clusters with ˆΘ as the model initialization, GMM set(cNum) ← ˆΘ;

BIC set(cNum) ← BIC(GMM_{cN um}, X );

if (cNum > SRange and BIC set(cNum − SRange) is the maximum in BIC set)

bestNum ← cNum − SRange;

Θ ← GMM set(bestNum);ˆ goto End;

else goto 2;

End

Algorithm 2 Fast Self-splitting Gaussian Mixture Learning Algorithm (fastSGML) Require: The input data set X = {x₁, x₂, · · · , x_N}

splitting confidence: the threshold for component splitting

Ensure: The estimated parameter set ˆΘ = { ˆw(1), ˆw(2) , · · · , ˆw(bestNum), ˆθ₁, ˆθ₂,

1. Initialization: the same as in SGML.

2. Data clustering: the same as in SGML.

3. Split:

(a) temp← cNum;

for k ← 1, 2, · · · , temp

if (∆BIC₂₁(EM cluster_k) > splitting confidence)

split the component corresponding to EM clusterk into two new components;

cNum ← cNum + 1;

(b) if (no component is split in (a)) Θ ← GMM set(cNum);ˆ goto End;

4. Global EM learning:

Perform EM learning on the Gaussian components obtained in Step 3 (a);

if (the learning curve starts to go down)

let bestNum be the component number that has the maximum value in the learning curve;

Θ ← GMM set(bestNum);ˆ goto End;

else goto 2.;

End

Chapter 4 Model-based clustering by

probabilistic self-organizing maps

4.1 Formulation of the coupling-likelihood mixture model for PbSOMs

In this thesis, we define a PbSOM as a SOM that consists of G neurons R={r₁, r₂, · · · , rG} in a network with a neighborhood function hkl that defines the strength of lateral interaction between two neurons, rk and rl, for k, l ∈ {1, 2, · · · , G}; and each neuron rk

associates with a reference model θ_k that represents some probability distribution in the data space.

Sum et al. [67] interpreted Kohonen’s sequential SOM learning algorithm in terms of maximizing the local correlations (coupling energies) between the neurons and their neighborhoods with the given input data. Given a data sample x_i∈ X = {x₁, x₂, · · · , x_N}, the local coupling energy between r_k and its neighborhood is defined as

E_x_i_|k =

l=1

h_klr_k(x_i; θ_k)r_l(x_i; θ_l)

= r_k(x_i; θ_k)

XG l=1

h_klr_l(x_i; θ_l), (4.1)

where r_k(x_i; θ_k) denotes the response of neuron r_k to x_i, which is modeled by an isotropic Gaussian density. Then, the coupling energy over the network for x_i is defined as

E_x_i =

k=1

E_x_i_|k, (4.2)

and the energy function to be maximized is

C =

i=1

log Exi. (4.3)

In Eq. (4.1), the term^P^G_l=1h_klr_l(x_i; θ_l) can be considered as the neighborhood response of r_k, where the conjunction between the neuron responses is implemented using the summing operation.

Here, we express the neuron response r_l(x_i; θ_l) as a multivariate Gaussian distribution as in Eq. (2.12) and formulate the neighborhood response of rk as

l6=k

r_l(x_i; θ_l)^h^kl, (4.4)

where the conjunction between the neuron responses in the neighborhood of r_k is im-plemented using the multiplicative operation. Then, for a given x_i, we define the local coupling energy between r_k and its neighborhood as the following coupling-likelihood:

p_s(x_i|k; Θ, h) = r_k(x_i; θ_k)^h^kk ^Y

l6=k

r_l(x_i; θ_l)^h^kl

l=1

r_l(x_i; θ_l)^h^kl

= exp(

XG l=1

h_kllog r_l(x_i; θ_l)), (4.5)

where Θ is the set of reference models, and h denotes the given neighborhood function¹. Then, we define the coupling-likelihood of xi over the network as the following (unnor-malized) mixture likelihood:

p_s(x_i; Θ, h) =

k=1

w_s(k)p_s(x_i|k; Θ, h), (4.6)

where w_s(k) for k = 1, 2, · · · , G is fixed at 1/G. Note that, theoretically, the mixture weights can be learned automatically. When maximizing the local coupling-likelihood ps(xi|k; Θ, h) for each neuron rk, k = 1, 2, · · · , G, the topological order between neuron r_k and its neighborhood for the given data sample x_i is learned in the learning process;

therefore, we use equal mixture weights in the mixture model to take account of the topological order learning induced by the neurons faithfully (with equal prior importance).

1From Eq. (4.5), it is obvious that, in our formulation, the coupling between rk and its neighboring neurons is considered jointly, whereas Sum et al.’s formulation considers it in a pairwise manner, as shown in Eq. (4.1). Note that we use the term “coupling-likelihood” instead of “coupling energy” for two reasons: 1) Eq. (4.5) is a coupling of Gaussian likelihoods; and 2) using “coupling-likelihood” can help describe the link between our proposed approaches and model-based clustering.

In fact, this is important for learning an ordered map. From our experimental analysis, if the mixture weights are updated in the learning process, the learning of topological order is frequently dominated by some particular mixture components, which makes it difficult to obtain an ordered map. For details, one can refer to Appendix A.1 after reading this chapter.

Comparing the network structure of the proposed coupling-likelihood mixture model in Eq. (4.6) with that of the Gaussian mixture model (GMM), as shown in Figure 4.1, the proposed model inserts a coupling-likelihood layer between the Gaussian likelihood layer and the mixture likelihood layer to take account of the coupling between the neurons and their neighborhoods. When the neighborhood size is reduced to zero (i.e., h_kl=δ_kl), the coupling-likelihood mixture model becomes a GMM with equal mixture weights.

Note that other probability distributions are possible for r_l(x_i; θ_l) in the formulation of the coupling-likelihood mixture model, although we use the multivariate Gaussian distribution here.

在文檔中機率式模型分群法之研究與其應用 (頁 43-52)