This dissertation consists five chapters. In this chapter, Chapter 1, the task of event detection in audios and videos is introduced, including some general information and re-lated work. Chapter 2 introduces the background knowledge used throughout this work, including the difference between supervised learning and weakly-supervised learning, the definition of event detection in the context of this work, and the basics of neural networks.
Then, the proposed method of conducting weakly-supervised music event detection is de-tailed in Chapter 3. In Chapter 4, the weakly-supervised music event detection is used to help weakly-supervised instrument-playing action detection in videos. Finally, I conclude this work in Chapter 5.
Chapter 2 Background
In this chapter, we will first survey studies related to this work. Then, concepts used throughout this work will be introduced. The concepts include the definition of event de-tection, weakly-supervised learning, fully-convolutional networks, and the features used.
2.1 Literature survey
In this section, we will review and discuss the studies related to this work. They are divided into three categories: detection and classification in audios, detection and classi-fication in videos and images, and weakly-supervised learning.
2.1.1 Detection and classification in audios
A considerable amount of work has been made for music auto-tagging, mostly fo-cusing on only clip-level prediction (i.e., whether a tag can be applied to a music piece) [12–15, 41–44]. Although audio features are usually extracted in the frame level, the ob-jective of learning is to make clip-level prediction. In recent years, deep neural network architectures have been found superior to competing machine learning models for music auto-tagging [13, 14, 45]. For example, Dieleman et al. have shown the effectiveness of CNN in learning features for music auto-tagging [13]. However, this CNN model can neither deal with music of arbitrary length nor perform frame-level prediction.
Some studies investigate audio auto-tagging with finer granularity. Essid et al. apply hierarchical clustering for frame-level instrument recognition with temporal annotations of instrument occurrences [46]. Mandel et al. study the tag relationships inside a track and between tags with 10-second clips [47–49]. Frame-level prediction on music auto-tagging was discussed by Wang et al. [50]. Parascandolo et al. conducts polyphonic sound event detection with recurrent neural networks [51]. The main difference between the proposed method in this work and those in these studies is that the proposed method can predict at a temporal granularity finer than the granularity of the training data.
In recent years, the multimedia and MIR community starts to address the difficulty in collecting training data for frame-level predictions. Kumar et al. investigated the prob-lem of audio event detection with SVM and neural networks with weak-labeled data [52].
Schlüter utilized saliency maps to iteratively train a model that can recognize singing voices in the frame level [53]. In this work, we also utilize an FCN model to derive the frame-level instrument sound predictions. However, we will further use the frame-level instrument sound predictions as the training target for the visual action model, not only as the end product itself.
Instrument recognition has been an active research topic in MIR. Essid et al. extracted various audio features and applied hierarchical clustering and SVM for instrument recog-nition [31]. Han et al. proposed a CNN structure to recognize the predominant instrument in music [34]. Slizovskaia et al. used both audio and visual features as the input and applied CNNs for the task of instrument recognition [38]. The goal of these works is to recognize the instrument sounds with audio or audio-visual information as the input. In contrast, one of the goals in this work is to detect the instrument-playing actions at frame level by the visual cues in a video.
2.1.2 Detection and classification in videos and images
Zhou et al. identified the difficulty in acquiring action annotations for action detection and proposed a way to estimate the temporal and spatial extents of the actions [19]. They proposed a trajectory split-and-merge algorithm to first segment the background and the
foreground moving objects by using dense optical flows, and then they used the segmen-tation information to derive the temporal and spatial extents of the actions. Then, they used a latent SVM to classify these segmented patches and locate the actions. We share a similar goal to derive the temporal and spatial extents in our proposed framework, but we investigate utilizing two other modalities to estimate the extents, instead of using the dense optical flows.
Oquab et al. proposed to use fully-convolutional neural networks (FCNs) to realize weakly-supervised learning for images [10]. By replacing the fully-connected layers in conventional convolutional neural networks (CNNs) [54,55] with fully-convolutional lay-ers, the model produces an output map that indicates the activation values at different locations. We use this method to do spatial weakly-supervised learning for both action detection and object detection in this work.
There have been several studies on weakly-supervised object detection or segmenta-tion. Hartmann et al. used support vector machine (SVM) [56] for weakly-supervised object segmentation in videos [57]. Liu et al. used a nearest neighbor-based method to perform weakly-supervised object segmentation in videos [58]. Prest et al. used motion cues to produce candidates of temporal tubes that locate a moving object and trained the object detector with a subset of the tubes [59].
Bojanowski [21] and Huang et al. [26] tackled the problem of weakly-supervised ac-tion detecac-tion. In their study, they only knew the sequence of acac-tions and they had to align the actions with the frames in a video clip. They proposed different ways to align the action sequence. Our work is different from theirs in two ways. First, we use auxiliary sound and object models to learn to assign labels to video frames, instead of based on the sequence of labels assigned by human. Second, they only attempt to predict the labels temporally but not spatially.
Simonyan et al. proposed a two-stream framework for action detection by using an object stream and an action stream [22]. They experimented with fusing the two streams either by averaging the output scores of the two models or by using SVM to do the final classification. Feichtenhofer et al. extended Simonyan’s work by using different ways of
model fusion [25], and Ng et al. extended Simonyan’s work by incorporating information across longer period of time through temporal pooling and LSTM [23]. Our method also contains multiple streams. However, we use FCNs for all the three streams instead of the conventional CNNs because we want not only to classify the videos but also to locate the instruments, the actions, and the sounds. Furthermore, the models are fused only after they are separately trained.
The proposed method for action detection in this work is also related to supervision transfer introduced by Gupta et al. [60]. Given two learning tasks where task 1 has large annotated data while task 2 does not, Gupta et al. proposed to use the output of a middle layer in the well-trained network in task 1 to provide supervision to a middle layer of the network in task 2. In this way, the supervision is transferred. In this work, we also want to seek for more supervisions to the instrument-playing actions from two other modalities, but we provide the supervisions directly in the output layers by the physical relationships of the three modalities that are indicated by the two observations stated in Section 1.3.
The temporal and spatial supervisions are also exploited in addition to the instance-level label supervision in this work.
2.1.3 Weakly-supervised learning
In recent years, unsupervised learning, weakly-supervised, and semi-supervised learn-ing have received lots of attentions in video processlearn-ing [61–64]. This trend is partly due to the lack of supervisory signals in videos, but it is also because the multi-modal nature of videos and the temporal continuity of videos provide a good environment for learn-ing feature representations by the dependencies between modalities or between frames without external supervisions. For example, Aytar et al. [61] and Arandjelović et al. [63]
proposed to match the audio and visual information to unsupervisedly learn features from a large amount of videos and use only a few labeled data for training a classifier based on the learned features. Aytar et al. [64] further included text in addition to the audio and visual information for feature learning. In contrast to the aforementioned multi-modal approaches, Canziani et al. [62] proposed a CortexNet framework to learn features by
matching neighboring frames in videos. Similar to these works, our proposed framework also represents an attempt to increase supervisory signals by utilizing multiple modalities of videos for a challenging action detection task.
Labeling the bounding boxes of objects in an image requires more labors than annotat-ing the presence of objects does. Therefore, weakly-supervised approach for visual object localization with only image-level annotations has attracted increasing attentions in re-cent years [8–11]. Among the prior arts, our approach is closest to the model proposed by Oquab et al. [10], a multi-instance learning variant and also based on CNN. A key idea proposed in this work is to use the so-called full convolutions, so that the model can pro-cess images of arbitrary size. In this way, they can resize an input image arbitrarily in a multi-scale manner to locate a visual object. Being inspired by this approach, our model has two distinct features. First, we adopt a different way to achieve multi-scale learning for audios, as music cannot be easily “resized” as images. Second, we use a dedicated layer to deal with the temporal dimension in music, which is absent in images.
Visual event or action detection in videos, which also have a temporal dimension, has also been studied [22, 65, 66]. Similar to music event detection, this problem requires weakly-supervised learning because the annotation is at the video clip level. However, little work, if any, has been proposed to address localization problem for visual events in videos.