Instrument-playing action detection in music videos

With the popularity of social media and online sharing, people are sharing a large amount of videos online every day. These videos often contain human activities, so

hu-1https://github.com/ciaua/clip2frame

Figure 1.1: An example of the music event predictions.

man actions or movements are informative components in these videos. Therefore, auto-matically recognizing the types of the actions and locating the actions in videos can help understand and retrieve videos [19]. This task is often called “action detection” [19–26].

For fully-supervised learning approaches, detailed temporal and spatial annotations of ac-tions are usually needed for action detection. However, these annotaac-tions are difficult to acquire because the labeling is labor-intensive and time-consuming [19]. In recent years, researchers have proposed various strategies to alleviate this issue [19, 21, 26].

An observation can be made that the objects and sounds in several types of videos might also be used to alleviate this issue. In a large amount of videos, both the objects and sounds signify the key points of the actions. Examples include videos with instrument playing [27], videos with violent content [28], and sport videos [29]. For example, when we hear a guitar solo and see a musician holding a guitar in a video, it is pretty likely that the guitar solo comes from the musician’s playing actions. The hitting actions in ball games often contain the acting objects, such as feet, hands, rackets, or bats, as well as the accompanied sounds of hitting. This relationship between actions, objects, and sounds provides an opportunity to infer the appearance of actions from objects and sounds.

Specifically, from the actions in the videos where sounds and objects signify the key points of actions, we can observe the following two common properties:

Action-in-object The spatial location of an action is close to (e.g. at the border or within) the spatial location of objects (e.g., instruments, bats, balls, or weapons).

Action-making-sound A specific type of actions is associated with a specific type of

sounds that the actions make.

Action-in-object together with the region of the objects in the scene may give us clues regarding where the actions occur spatially, while action-making-sound together with the temporal activation of the sounds in the video frames may help us temporally locate the actions. In contrast to annotated action data, annotated data of objects and annotated data of sounds are easier to acquire. Therefore, in this work, a method is proposed to train a sound model specifying when the actions occur and train an object model telling us where the objects are. These two auxiliary models act as teachers to inform the action model when and where to pay attention to. We feed only the motion information (dense optical flows in this work) to the action model, so it is forced to learn when and where the actions occur by only motions with the help from the two auxiliary models. An interesting feature of this proposed framework is that it does not need annotated data of actions at all in the training process. This proposed framework is considered as a weakly-supervised learning one, because the model is trained to predict when and where the playing actions are in videos by using only information regarding whether an instrument appears in a video clip in the training phase. The proposed framework is depicted in Fig. 4.2b (Figs. 4.1a, 4.1b, and 4.2a are variants of the proposed framework that will be discussed in Section 4.1.2).

We will focus on the instrument-playing actions in music-related videos in this work.

Music is one of the most popular types among online videos (ranked number two accord-ing to the study of Cheng et al. [30]), and instrument playaccord-ing is among the most common scenes in these videos. For the audio aspect of instrument playing, automatic detection of instrument sounds has been widely studied in music information retrieval (MIR) [31–35].

It helps people understand the content of the music. However, the visual aspect of ment playing remains largely unaddressed in literature. In addition to the sounds of

instru-Figure 1.2: An example of the action predictions. They are five consecutive frames with one-second interval.³

ments, the visual appearances of instruments and instrument-playing actions also provide us important information about the related videos. In order to understand music-related videos, we need to know which instruments are played, when the instruments are played, and where the playing actions occur in the scene.² For example, in a video of a piano concert, the pianist may first walk into the scene, sit down, and then start to play the piano. In this case, the piano is not played until the pianist sits down and is ready.

We may want to know when the playing begins, the relative position of the piano to the scene, the relative positions of the hands to the piano, etc. There are also attempts to model the audio and visual information jointly for music information retrieval tasks [37,38]. For example, Schindler et al. investigated music genre classification by aggregating audio fea-tures and visual feafea-tures together as the input feafea-tures to a classifier [37]. This approach could improve the input feature of the model, but cannot circumvent the lack of annotated data.

In light of these observations, the goal of this work for music videos is to train a model to automatically pinpoint the instrument-playing actions temporally and spatially in videos with instrument-playing scenes without detailed annotations. In contrast to the abundance of annotated data available for either object recognition (including instruments), such as

2And even how the instruments are played—the gesture, the playing technique, the expression etc [36].

This is left as a topic of future research.

3The RGB snapshots are cropped from an YouTube video (ID: 3hjHJo452dY, uploaded by Zara and Nicola) with Creative Commons license.

ImageNet⁴ [39], or sound recognition, such as AudioSet⁵, we have no available dataset specifying the location of the playing actions in the scenes. Therefore, we can train the action model by utilizing the two properties mentioned above together with a trained sound model and a trained object model. We use the spatial locations of instrument objects and the temporal locations of the instrument sounds to help the detection of playing actions, but do not join the input features. In this way, we have a more flexible model that can work even if the audio is degraded due to factors such as environmental noises, audio track loss, or audio compression artifacts [40]. We human beings can guess if an instrument is played simply by the action, gesture, and the relative positions of hands or bows to instruments.

An example of action detection result is shown in Fig. 1.2. The violinist is not playing initially, and then she gradually raises the bow and starts playing in the final two frames. It shows that the the instruments and the playing actions do not always temporally coincide, and the action model should be able to handle this situation.

In order to investigate weakly-supervised instrument-playing action detection, three as-pects are contributed in this dissertation. First, a training framework is proposed to learn the temporal and spatial locations of the actions without detailed annotations by utilizing the object and the sound information. Furthermore, we can utilize the object and sound information to further improve the result after the action model is trained by a simple yet effective method of model fusion. Second, although the proposed method does not require detailed location information in training, for the purpose of evaluation, I manually anno-tated totally 5,400 frames from 135 videos with detailed locations of instrument-playing actions. Third, comprehensive experiments are conducted to investigate the effects of dif-ferent components in the framework (Section 4.3). The action patterns the neural network learns for each instrument are analyzed.

4http://www.image-net.org/

5https://research.google.com/audioset/

在文檔中應用全捲積網路所達成之弱監督音樂音訊及視訊事件偵測 (頁 21-26)