• 沒有找到結果。

Experiments on broadcast news data

5.3 Computational cost analysis

5.4.2 Experiments on broadcast news data

Data set description: We evaluated FixSlid, WinGrow, and the proposed methods on two broadcast news data sets. Three one-hour broadcast news programs (PTSND-20011203, PTSND-20011204, and PTSND-20011205) selected from the MATBN corpus [109] were used as the development set (denoted as MATBN3hr). We used the 1998 DARPA/NIST HUB-4 broadcast news evaluation test data set [110], which is comprised of two 1.5-hour audio streams, as the evaluation set (denoted as HUB4-98). There were 1386 and 1184 change points in MATBN3hr and HUB4-98, respectively. Figure 5.10 shows the empirical cumulative distributions of the size of homogeneous segments in the two data sets. As shown in the figure, the average length of the segments in HUB4-98 is longer than that in MATBN3hr.

Parameter setting: For FixSlid, we used the GLR distance as the distance measure

0 20 40 60 80 100 120 140 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x=size of segment (second)

F(x): the empirical CDF value

MATBN3hr HUB4−98

Figure 5.10: The empirical cumulative distributions of the size of homogeneous segments in MATBN3hr and HUB4-98.

of two adjacent windows. In the experiments, the window size was fixed at two seconds;

and the value of α used to evaluate the “significant” local maximum, as shown in Figure 5.11, was set at 0.4 initially, and increased to 2 in 0.05 increments to obtain the ROC curve. For WinGrow, the values of Ng and Nswere set at one second and Nmax/4 seconds, respectively; and the values of Nini and Nmax were tuned with the development set. For SeqDACDec1 and SeqDACDec2, η was fixed at 0.25; and L and Nmin in DACDec1 and DACDec2 were tuned with the development set. In all the approaches except FixSlid, the penalty factor λ in the ∆BIC computation was set at 0.7 initially, and increased to 1.7 in 0.05 increments to obtain the ROC curves. The GLR or ∆BIC distance was evaluated every 0.1 seconds in all the approaches; that is, the resolution for change point detection was 0.1 seconds. However, the tolerance ξ for counting the number of miss detection or false alarm was set at one second rather than 0.5 seconds. Basically, we made this change because of the limited precision of human reference annotation.

Results: We first evaluated all the approaches on MATBN3hr. When conducting experiments, we found that it was appropriate to set Nini at three seconds and Nmax at 20 seconds for WinGrow. For both SeqDACDec1 and SeqDACDec2, it was appropriate to set Nmin at two seconds and L at 20 seconds. Figure 5.12 (a) shows the ROC curves obtained by SeqDACDec1 with analysis windows of different size. Unlike the results for the synthetic data in Figure 5.9, the results with 10-second and 20-second analysis windows are similar. This is because, in the broadcast news data, homogeneous segments within a 10-second or 20-10-second analysis window are usually derived from different acoustic sources or speakers. For SeqDACDec2, the results for 10-second, 20-second, and 30-second analysis windows are similar, as shown in Figure 5.12 (b). The ROC curves obtained by all the approaches are shown in Figure 5.12 (c). We observe that the proposed approaches,

Figure 5.11: A significant local maximum on the distance curve.

Table 5.1: The CPU time of different audio segmentation approaches evaluated on MATBN3hr in the EER case and the associated EERs, where M and F denote the miss detection rate and the false alarm rate, respectively.

Approach WinGrow SeqDACDec1 SeqDACDec2 FixSlid CPU time 5162.08 sec 1911.17 sec 3386.84 sec 221.28 sec

Speedup over WinGrow 1 2.70 1.52 23.33

EER (in %) M:18.69 M:17.03 M:17.39 M:27.13

F:16.46 F: 17.94 F:15.23 F:25.76

namely SeqDACDec1 and SeqDACDec2, outperform the other approaches. Table 5.1 shows the CPU times of all the approaches in the EER case. The programs were run on a PC with a 3.2GHz Intel Pentium IV CPU. From the table, we observe that SeqDACDec1 and SeqDACDec2 are more efficient than WinGrow.

Next, we conducted experiments on HUB4-98 with the parameters tuned with the MATBN3hr data set. Figure 5.13 shows the ROC curves for all approaches; we see that SeqDACDec1 and SeqDACDec2 achieve the best performance. Table 5.2 summarizes the CPU time required by different approaches in the EER case. Comparing Table 5.2 to Table 5.1, it is clear that every approach achieves a higher speedup over WinGrow on 98 than on MATBN3hr. This is because the homogeneous segments in HUB4-98 are longer than those in MATBN3hr on average, as shown in Figure 5.10, and these approaches achieve higher speedup over WinGrow for an audio stream comprised of longer homogeneous segments, as mentioned in Section 5.3 (cf. Eqs. (5.6), (5.10), (5.11), and (5.14)).

Table 5.2: The CPU time of different audio segmentation approaches evaluated on HUB4-98 in the EER case and the associated EERs.

Approach WinGrow SeqDACDec1 SeqDACDec2 FixSlid CPU time 8418.23 sec 2003.62 sec 3853.48 sec 201.57 sec

Speedup over WinGrow 1 4.20 2.18 41.76

EER (in %) M:28.8 M:27.96 M:27.19 M:35.56

F:31.08 F:25.67 F:26.37 F:38.04

5 10 15 20 25 30 35 40

5 10 15 20 25 30 35 40

Miss Detection Rate (%)

False Alarm Rate (%)

SeqDACDec1, L=10 sec SeqDACDec1, L=20 sec SeqDACDec1, L=30 sec

(a)

5 10 15 20 25 30 35 40

5 10 15 20 25 30 35 40

Miss Detection Rate (%)

False Alarm Rate (%)

SeqDACDec2, L=10 sec SeqDACDec2, L=20 sec SeqDACDec2, L=30 sec

(b)

5 10 15 20 25 30 35 40

5 10 15 20 25 30 35 40

Miss Detection Rate (%)

False Alarm Rate (%)

SeqDACDec1, L=20 sec SeqDACDec2, L=20 sec WinGrow, N

max=20 sec FixSlid, α=0.4, 0.45,...,2

(c)

Figure 5.12: The ROC curves for MATBN3hr obtained by (a) SeqDACDec1 with Nmin = 2 seconds and analysis windows of different size; (b) SeqDACDec2 with Nmin = 2 seconds and analysis windows of different size; and (c) SeqDACDec1 with Nmin = 2 seconds and L = 20 seconds, SeqDACDec2 with Nmin = 2 seconds and L = 20 seconds, WinGrow with Nmin = 3 seconds and Nmax = 20 seconds, and FixSlid with a 2-second window.

0 10 20 30 40 50 0

5 10 15 20 25 30 35 40 45 50

Miss Detection Rate (%)

False Alarm Rate (%) SeqDACDec1, L=20 sec

SeqDACDec2, L=20 sec WinGrow, Nmax=20 sec FixSlid, α=0.4, 0.45,...,2

Figure 5.13: The ROC curves for HUB4-98.

Chapter 6

Conclusion and future work

6.1 Conclusion

In this thesis, we propose new learning algorithms for probabilistic model-based clustering.

The proposed SGML algorithm tries to tackle two long standing critical problems in the EM-based Gaussian mixture modeling; namely, 1) the difficulty in determining the number of Gaussian components and 2) the sensitivity to model initialization. A fast version of the SGML, called fastSGML, is also presented. It splits multiple components in each splitting step and, thus, needs a much lower computation cost than SGML. We conducted experiments on clustering of a synthetic data set and the speaker identification task.

Experiment results on the speaker identification task show that the proposed algorithms can automatically determine an appropriate model complexity for speaker GMMs though no significant improvements in identification accuracy are obtained compared to the best performance of the baseline systems.

Considering the learning of a probabilistic self-organizing map (PbSOM) as a model-based clustering process, we develop a coupling-likelihood mixture model for PbSOM, and derive three EM-type learning algorithms, namely the SOCEM, SOEM, and SODAEM al-gorithms, for learning the model (PbSOM). The proposed algorithms improve Kohonen’s learning algorithms by including a cost function, an EM-based convergence property, and a probabilistic framework. In addition, the proposed algorithms provide some insights into the choice of neighborhood size that would ensure topographic ordering. From the experiment results, we observe that the learning performance of SOCEM is very sensitive to the initial setting of the reference models when the neighborhood is small. Conversely, it is not sensitive to the initial condition when the neighborhood is sufficiently large. To deal with the initialization problem, we first run SOCEM with a large neighborhood, and then gradually reduce the neighborhood size until the learning converges to the desired map. When using a small neighborhood, SOEM is less sensitive to the initialization than SOCEM. However, to learn an ordered map, SOEM still needs to start with a large

neigh-borhood. In both SOCEM and SOEM, the neighborhood shrinking can be interpreted as an annealing process that overcomes the initialization issue. Alternatively, we can apply SODAEM, which is a deterministic annealing variant of SOCEM and SOEM, to learn a map. In our experiments, SODAEM overcomes the initialization issue of SOCEM and SOEM via the annealing process controlled by the temperature parameter. Moreover, through the comparison of SOCEM and Kohonen’s batch algorithm, we can also apply the DA interpretation of neighborhood shrinking to Kohonen’s algorithms to explain why they need to start with a large neighborhood size. We have also shown that the SOCEM and SOEM algorithms can be interpreted, respectively, as topology-constrained deterministic annealing variants of the CEM and EM algorithms for Gaussian model-based clustering.

The experiment results show that our proposed PbSOM learning algorithms achieve an effective data clustering performance, while maintaining the topology-preserving property.

Moreover, we propose two BIC-based audio segmentation approaches that employ divide-and-conquer strategies for acoustic change detection. In contrast to the leading and highly accurate window-growing-based approach, which searches for acoustic changes in a bottom-up manner by using a sequentially size-growing analysis window, the pro-posed DACDec1 and DACDec2 approaches search for acoustic changes in a top-down manner. We compared our approaches to leading approaches analytically by performing computational cost analysis. The results of experiments conducted on broadcast news data demonstrate that the proposed approaches are more efficient and achieve higher segmentation accuracy than the existing approaches discussed in this thesis.