• 沒有找到結果。

Experiment results on Ecoli

4.7 Application of SOCEM, SOEM, and SODAEM to data visualization and

4.7.3 Experiment results on Ecoli

We conducted experiments on Ecoli using the algorithms applied to ImgSeg in Section 4.7.1 and Section 4.7.2. In Figure 4.16, for each algorithm we show the result at the σ value that the class separability can be best visualized on the network. The Ecoli data set is comprised of eight classes, namely cp: C, im: I, pp: P, imU: U, om: O, omL:

M, imL: L, and imS: S. The numbers of data samples are 143, 77, 52, 35, 20, 5, 2, and 2, respectively. From the figure, we can see that topological relationships among data samples and clusters are preserved well and data classes can be roughly separated on the network.

(a) Gaussian mixture model

(b) The proposed coupling-likelihood mixture model

Figure 4.1: (a) The network structure of a Gaussian mixture model, and (b) the proposed coupling-likelihood mixture model. Here, rl(xi; θl) denotes the multivariate Gaussian distribution described in Eq. (2.12).

−15 −10

−5 0

5 10

15 20

−20

−10 0 10 20

−6

−5

−4

−3

−2

−1 0

x 104

µ1 µ2

classification log−likelihood

(a) σ = 0.6

−15 −10

−5 0

5 10

15 20

−20

−10 0 10 20

−5

−4

−3

−2

−1 0

x 104

µ1 µ2

classification log−likelihood

(b) σ = 0.4

−15

−10 −5

0 5

10 15

20

−20

−10 0 10 20

−4.5

−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5 0

x 104

µ1 µ2

classification log−likelihood

(c) σ = 0.3

−15

−10 −5

0 5

10 15

20

−20

−15

−10

−5 0 5 10 15 20

−3.5

−3

−2.5

−2

−1.5

−1

−0.5 0

x 104

µ1 µ2

classification log−likelihood

(d) σ = 0 (i.e., hkl= δkl)

Figure 4.2: SOCEM’s objective function becomes more complex with the reduction of neighborhood size (σ in hkl).

r1 r2 r3 r4 …….. rG1 rG

Winner selection

xi

(a) SOCEM

r1 r2 r3 r4 …….. rG1 rG

Weighted winner

xi )

(

| 1

t

γ i () γ(Gt|)i

| 4

t

γ i

(b) SOEM

Figure 4.3: For each data sample xi, the adaptation of the reference models in SOCEM is restricted to the winning reference model and its neighborhood. However, in SOEM, the winner is relaxed to the weighted winners by the posterior probabilities γk|i(t), for k = 1, 2, · · · , G. Each data sample xi contributes proportionally to the adaptation of each reference model and its neighborhood according to the posterior probabilities.

SODAEM DAEM for GMM

SOCEM CEM for GMM

kl

hkl →δ

kl

hkl →δ

0 / 1 β→

topology-constrained annealing

0 / 1 β→

SOEM

1 / 1 β→

EM for GMM

1 / 1 β→

topology-constrained annealing

kl

hkl δ

Figure 4.4: The family of Gaussian model-based clustering algorithms derived from the SODAEM, SOEM and SOCEM algorithms. δkl = 1 if k = l; otherwise, δkl = 0.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(a) random ini.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(b) σ = 0.15 with rand. ini.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(c) σ = 0.6 with rand. ini.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(d) σ = 0.45

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(e) σ = 0.3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(f) σ = 0.15

Figure 4.5: The map-learning process obtained by running the SOCEM algorithm on the synthetic data. Simulation 1 ((a)-(b)): When SOCEM is run with the random initializa-tion in (a) and σ = 0.15, it converges to the unordered map in (b). Simulainitializa-tion 2 ((a) and (c)-(f)): SOCEM starts with σ = 0.6 and the random initialization in (a). Then, the value of σ is reduced to 0.15 in 0.15 decrements.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(a) random ini.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(b) σ = 0.15 with rand. ini.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(c) σ = 0.6 with rand. ini.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(d) σ = 0.45

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(e) σ = 0.3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(f) σ = 0.15

Figure 4.6: The map-learning process obtained by running the SOEM algorithm on the synthetic data. Simulation 1 ((a)-(b)): When SOEM is run with the random initialization in (a) and σ = 0.15, it converges to the unordered map in (b). Simulation 2 ((a) and (c)-(f)): SOEM starts with σ = 0.6 and the random initialization in (a). Then, the value of σ is reduced to 0.15 in 0.15 decrements.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(a) random ini.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(b) σ = 0.15, β = 0.16

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(c) σ = 0.15, β = 0.256

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(d) σ = 0.15, β = 0.409

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(e) σ = 0.15, β = 0.655

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(f) σ = 0.15, β = 1.04

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(g) σ = 0.15, β = 2.68

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(h) σ = 0.15, β = 6.871

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(i) σ = 0.15, β = 17.592

Figure 4.7: The map-learning process obtained by running the SODAEM algorithm on the synthetic data. The value of σ is fixed at 0.15, while value of β is initialized at 0.16 and increased in multiples of 1.6 up to 17.592.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(a) random ini.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(b) σ = 0.15 with rand. ini.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(c) σ = 0.6 with rand. ini.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(d) σ = 0.45

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(e) σ = 0.3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(f) σ = 0.15

Figure 4.8: The map-learning process obtained by running the SOCEM algorithm on PenRecDigits C0. Simulation 1 ((a)-(b)): When SOCEM is run with the random initial-ization in (a) and σ = 0.15, it converges to the unordered map in (b). Simulation 2 ((a) and (c)-(f)): SOCEM starts with σ = 0.6 and the random initialization in (a). Then, the value of σ is reduced to 0.15 in 0.15 decrements.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(a) random ini.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(b) σ = 0.15 with rand. ini.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(c) σ = 0.6 with rand. ini.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(d) σ = 0.45

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(e) σ = 0.3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(f) σ = 0.15

Figure 4.9: The map-learning process obtained by running the SOEM algorithm on Pen-RecDigits C0. Simulation 1 ((a)-(b)): When SOEM is run with the random initialization in (a) and σ = 0.15, it converges to the unordered map in (b). Simulation 2 ((a) and (c)-(f)): SOEM starts with σ = 0.6 and the random initialization in (a). Then, the value of σ is reduced to 0.15 in 0.15 decrements.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(a) random ini.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(b) σ = 0.15, β = 0.16

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(c) σ = 0.15, β = 0.256

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(d) σ = 0.15, β = 0.409

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(e) σ = 0.15, β = 0.655

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(f) σ = 0.15, β = 1.04

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(g) σ = 0.15, β = 2.68

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(h) σ = 0.15, β = 6.871

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(i) σ = 0.15, β = 17.592

Figure 4.10: The map-learning process obtained by running the SODAEM algorithm on PenRecDigits C0. The value of σ is fixed at 0.15, while value of β is initialized at 0.16 and increased in multiples of 1.6 up to 17.592.

Figure 4.11: The data clustering performance of CEM, DAEM C, SOCEM, SODAEM C, and KohonenGaussian on ImgSeg in terms of the classification log-likelihood.

Figure 4.12: Learning a Gaussian mixture model by applying EM, DAEM M, SOEM, and SODAEM M to ImgSeg.

(a) (b)

Figure 4.13: The data clustering performance on Ecoli in terms of (a) the classification log-likelihood and (b) the log mixture-likelihood.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(b) KohonenGaussian (σ=0)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(f) SODAEM_C (β=10, σ=0)

Figure 4.14: Data visualization for ImgSeg by running KohonenGaussian ((b)), SOCEM ((c), (d)), and SODAEM C ((e), (f)) with the random initialization in (a). The network structure is a 7 × 7 equally spaced square lattice in a unit square.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 4.15: Data visualization for ImgSeg by running SOEM ((a), (b)) and SODAEM M ((c), (d)) with the random initialization in Figure 4.14 (a). The network structure is a 7

× 7 equally spaced square lattice in a unit square.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(a) random ini.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(b) KohonenGaussian (σ=0.06)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 4.16: Data visualization for Ecoli by running (b) KohonenGaussian, (c) SOCEM, (d) SOEM, (e) SODAEM C, and (f) SODAEM M with the random initialization in (a).

The network structure is a 7 × 7 equally spaced square lattice in a unit square.

Chapter 5

BIC-based audio segmentation using divide-and-conquer

The goal of audio segmentation is to detect acoustic changes in an audio stream, e.g., boundaries between two speakers or two environmental conditions. In the last decade, researchers in the speech processing community have put much effort on this problem for its potential applications to many speech and audio processing tasks, such as audio indexing [97], automatic transcription of audio recordings [98], speaker tracking [99], and speaker diarization [100]. Existing audio segmentation approaches generally fall into two categories, namely, distance-based segmentation [101, 70, 102, 71, 103, 104, 105] and model-decoding-based segmentation [70, 106].

In distance-based segmentation, a distance measure of two audio segments is first defined, and then an acoustic change detection strategy is designed based on the distance measure. Compared to model-decoding-based segmentation, these methods have a great advantage that they do not need a priori knowledge about the content of the input audio stream. It is assumed that the acoustic feature vectors in each of the two audio segments are drawn from a probability distribution (e.g., multivariate Gaussian). Then, the distance between the two segments is represented as the dissimilarity between the two distributions. Many distance measures have been investigated, e.g., Kullback-Leibler distance (KL or KL2) [101], Generalized Likelihood Ratio (GLR) [104], ∆BIC [70, 71], Mahalanobis distance, and Bhattacharyya distance [105].

Fixed-size sliding window detection [101, 104, 105] and BIC-based growing-size sliding window detection [70, 102, 71, 103, 107] are two leading approaches in distance-based segmentation. In the fixed-size sliding window detection approach, as shown in Figure 5.1, a certain distance measure is used to evaluate the dissimilarity between two adjacent windows that slide along the audio stream to produce a distance curve. This distance curve is often low-pass filtered. Then, the locations of peaks are judged if they are acoustic changes by some heuristic thresholds. This method has the advantage of low computation cost. However, in order to detect the change boundary associated with a

short homogeneous segment, the size of the analysis window is usually set at a small value (e.g., two seconds). This is a dilemma because a small analysis window does not contain sufficient feature vectors to obtain a reliable distance statistic.

BIC-based growing-size sliding window detection was first proposed by Chen and Gopalakrishnan [70]. For the distance measure of two audio segments, they used Bayesian Information Criterion (BIC) [41] to evaluate the following two hypotheses: 1) The union of the feature vectors of the two segments forms a Gaussian cluster in the feature space.

2) The feature vectors of each segment form a distinct Gaussian cluster. Then, the dif-ference of the two evaluation scores, ∆BIC, was used as the distance measure. In their acoustic change detection procedure, a small analysis window is put at the beginning of the audio stream, initially. If there is no change point detected in the analysis window, it is enlarged to have a larger search range. However, with the window size growing, this approach suffers from a heavy computation cost due to numerous ∆BIC calculations, in particular when the audio stream contains many long homogenous segments. To reduce the computation cost, Tritschler and Gopinath [102] proposed some heuristics to ignore the distance computations at the locations where the acoustic changes unlikely happen.

Zhou and Hansen [107] used the low computation cost Hotelling’s T2-Statistic as the dis-tance measure in the detection process, while ∆BIC was used only to verify the acoustic change candidates. In [71] and [103], the authors proposed more efficient implementations for the ∆BIC calculation without affecting the detection accuracy. Since the growing-size sliding window detection approach detects acoustic changes using a size-growing analysis window, we denote it as window-growing-based segmentation (WinGrow).

In this thesis, we propose two divide-and-conquer approaches that detect acoustic changes by recursively partitioning a large analysis window into two sub-windows using

∆BIC, rather than detecting acoustic changes with a size-growing analysis window. For the efficiency comparison, we analyzed their computational costs and reported their re-spective run times in the experiments. The experiment results on the broadcast news data show that the proposed recursive (top-down) multiple-change-point detection strategies are more effective and efficient than WinGrow’s bottom-up multiple-change-point detec-tion strategy.

To help explain our proposed approaches, we review the ∆BIC distance measure and the WinGrow approach in Section 5.1. We then present the proposed divide-and-conquer approaches for audio segmentation in Section 5.2. In Section 5.3, we analyze the computational costs of the baseline approaches and the proposed approaches. Section 5.4 details the experiments on audio segmentation.

Audio stream Sliding

windows

...

Distance curve Distance measure

Figure 5.1: The fixed-size sliding window detection approach.

5.1 Window-growing-based segmentation

5.1.1 ∆BIC as the distance measure of two audio segments

Given two audio segments represented by feature vectors, X = {x1, x2, · · · , xnx} ⊂ <d and Y = {y1, y2, · · · , yny} ⊂ <d, we evaluate the following two hypotheses [70]:

H0 : x1, x2, · · · , xnx, y1, y2, · · · , yny ∼ N (µ, Σ),

H1 : x1, x2, · · · , xnx ∼ N (µX, ΣX); y1, y2, · · · , yny ∼ N (µY, ΣY). (5.1) H0 posits that X and Y are derived from the same multivariate Gaussian, while H1 posits that they are derived from two distinct multivariate Gaussians.

Let Z = X SY and n = nx + ny. Then, the ∆BIC value can be computed as the difference between the BIC values of H1 and H0 as follows:

∆BIC{X ,Y} = BIC(H1, Z) − BIC(H0, Z)

= log p(X ; ˆµX, ˆΣX) + log p(Y; ˆµY, ˆΣY)

− log p(Z; ˆµ, ˆΣ) − 1

2λ(d + 1

2d(d + 1)) log n

= n

2log | ˆΣ| − nx

2 log | ˆΣX| − ny

2 log | ˆΣY| − 1

2λ(d + 1

2d(d + 1)) log n, (5.2) where ˆµ, ˆµX, and ˆµY are, respectively, the sample mean vectors of Z, X , and Y; ˆΣ, ΣˆX, and ˆΣY are, respectively, the sample covariance matrices of Z, X , and Y; and d is

the dimension of the feature vector [71]1. The larger the value of ∆BIC, the less similar the two segments will be; thus, the larger the distance between the two segments will be. When λ = 0, the ∆BIC distance between two segments is equivalent to the GLR distance [70, 108].