Audio event detection (AED) shares a similar problem as music event detection, that is, the lack of detailed annotations for training models. Therefore, we also extend the proposed methodology to weakly-supervised audio event detection [75], where I am the second author. In that paper, the model is basically the same one used in music event detection proposed by me and the experiments are done by Ting-Wei Su. In the remaining of this section, the main results are presented. For details of this work, please refer to the original paper [75].
The goal of audio event detection (AED), or sound event detection, is to detect sound events in daily lives, such as screaming, shouting, and gun-shots for security [88–90] as well as breath and snore for medical purpose [91].
We build and evaluate such models with UrbanSound and UrbanSound8K datasets [92].
The model is different from the one used for music event detection in two aspects.
First, max pooling is used for global pooling because audio events are often pretty short compared with music events. Using average pooling could include too many false pos-itives and degrades the performance. Second, data augmentation is used because audio events in the real world have varied intensities.
Similar to music events, audio events may have different temporal properties. Some events happen in merely a few frames, like dog bark, while others can last for several minutes, such as the sound generated by an air conditioner in operation. To deal with it, a Gaussian filter layer is inserted between the final later convolutional layer and the final max-pooling, as what has been done for music audios.
This Gaussian filter layer is implemented with convolution, but the filter weights are set to fit a centered Gaussian distribution:
g[t] = 1
√2σ2πe−2σ2t2 , (3.1)
where σ is the standard deviation in the Gaussian distribution. g is applied to the signal s and scan through the signal:
where M is a pre-determined filter size.
3.5.1 Datasets
Two datasets are employed in the experiments: UrbanSound and UrbanSound8K. Ur-banSound is used for training and frame-level evaluation, and UrUr-banSound8K is used for clip-level evaluation. Both datasets are composed of ten different classes of sounds came from Freesound.org. The ten classes and the number of data can be seen in Table 3.7.
Every clip in these datasets contains only one label.
UrbanSound dataset comprises 1302 recordings with their durations varying from 1 second to over 30 seconds. All audio clips are annotated with the onset and offset of the sound events appearing in them. Because of these annotations, we are able to evaluate frame-level predictions. However, notice that these temporal annotations are used only for evaluating testing result. We do not use them in the training phase. Although the size of UrbanSound is not large enough, it is still the most suitable public dataset for our work.
UrbanSound8K is composed of 8732 short clips segmented from files in UrbanSound.
Files in UrbanSound8K are less than or equal to 4 seconds and is labelled with one class.
It is used for clip-level evaluation.
Table 3.7: This table provides the number of data in each dataset, the learned standard devi-ations σ of Gaussian filter layer after training, and the performance of our model. The US dataset evaluated with AUC score represents the performance of frame-level predictions, and the accuracy tested on US8K is the clip-level evaluation result.
# of data AUC Acc.
Class US US8k σ US US8k
Air conditioner 64 1000 2.93±0.30 0.612 76.7%
Car horn 125 429 1.32±0.05 0.807 69.0%
Children playing 158 1000 1.08±0.07 0.594 41.8%
Dog bark 337 1000 0.96±0.06 0.790 79.5%
Drilling 119 1000 2.05±0.13 0.764 52.3%
Engine idling 97 1000 2.97±0.19 0.688 51.8%
Gun shot 117 347 1.13±0.16 0.921 94.4%
Jackhammer 45 1000 1.93±0.10 0.704 39.8%
Siren 74 929 2.09±0.17 0.763 59.0%
Street music 166 1000 1.51±0.10 0.737 61.3%
Total 1302 8732 0.738 59.4%
3.5.2 Experiments
There are two major parts in the experiment. The first part shows the clip-level pre-diction of the proposed model in different structures and compares the best result with a fully-supervised work which was also tested on UrbanSound8K. In the second part, we will evaluate the result of frame-level prediction.
The sampling rate of audio files is set to 44100 Hz. The input feature is composed of an 128-dimension mel-spectrogram and its first derivative, which is also 128 dimensions.
Our model contains 2 early convolutional layers. Each layer comprises 60 filters with filter size 5 in time domain. We follow [13] and do convolution only on time domain.
Therefore, only the first convolutional layer has its filter size 128 in frequency domain while the others have a filter size of 1. Each early convolutional layer is followed by a max-pooling layer with both pooling size and stride size being 4 in temporal axis (Pooling is not done on the frequency axis). The late layers include 3 consecutive convolutional layers with their filter size being 1. We set the filter number to 128 for first two layers and 10 for the last one, which is the total number of the labels in UrbanSound. As for the Gaussian filter layer, the filter size is set to 32, and the initial standard deviation σ of every class to 2. All dropout rates in this model are set to 0.5, and the learning rate is initialized
Table 3.8: The comparison between the clip-level accuracy of basic setting and of adding one of multiple scales, data augmentation, and Gaussian filter to the model.
w/ Multiple w/ Data w/ Gaussian Basic scales augmentation filter
25.93% 37.51% 39.78% 51.73%
to 0.006. Adaptive Gradient Algorithm (AdaGrad) is used as our update method [93], so the learning rate will be changed every time we update the parameters. As the training data consist of clips of varied duration, the batch size is set to 1 to simplify the training process. 300 epochs are run in every training set, and the model belonging to the epoch with highest validation accuracy will be selected as the final model for an experimental setting.
Clip-level Evaluation
We begin with evaluating the following three modifications of the CNN model. First, in the basic structure, the window size of STFT is set to 1024. In the multi-scale setting, we instead use a structure of 3-scale input feature with window sizes being 1024, 4096, and 16384. As we see in the Table 3.8, multi-scale model outperforms the basic one with a large margin. Second, owing to the fact that weakly-supervised data may vary quite a lot in the volume of audio events, and that the training data are scarce from UrbanSound, we augment the training data by adding and reducing 5db to every clip. Thus, the training data are tripled and should make the model less sensitive to the effect of diverse volumes.
The result shows that it does improve the performance after data augmentation on volume.
Third, the Gaussian filter described is added.
We refer to the model with only single-scale input, no augmentation and no Gaussian filter as the ‘basic’ model. We than add one of these three modification to the model and investigate which one can more effectively improve the basic model. The result is shown in Table 3.8. We can see that the multi-scale feature is indeed beneficial, improving the basic model by a margin. In addition, the use of data augmentation is also quite effective.
More importantly, we found that the Gaussian filter largely improves the performance.
Furthermore, the final standard deviations of the Gaussian filters provide insights
regard-ing the classes. From Table 3.7, we can see that sound classes with longer durations, such as “air conditioner” and “engine idling,” obtain higher values, while classes with shorter durations, like “gun shot” and “dog bark,” get lower values. Therefore, if “gun shot” and
“engine idling” are both detected in a very short duration, gun shot is more likely to be highlighted by the Gaussian filter layer. On the contrary, the final prediction of engine idling will be reduced since this kind of sound is supposed to occur in a longer duration.
Finally, we turn on all three functions. As shown in Table 3.7, our model attains 59.4% accuracy, which is better than the result of using Guassian filter alone. In a fully-supervised setting, Piczak achieved 73.1% accuracy by training on subsets of the Urban-Sound8K itself. Although there is still a performance gap, this result is promising for it only uses weakly-supervised data from UrbanSound.
Frame-level Evaluation
The frame-level result is evaluated with average area under ROC curve (AUC) [94], and Table 3.7 shows the overall result and the result of each class. In addition, some visualized frame-level results are shown in Fig. 3.8. The ability of localizing events can be well seen. In general, when an event is detected, their temporal locations are usually correct.