• 沒有找到結果。

4.2 Experimental setup

4.2.2 Features

A sampling rate of 16,000 and hop size of 512 are used to extract log mel-spectrograms from audios, and 3 scales of log mel-spectrograms with window size 512, 2048, and 8192 are used, similar to what is done in Chapter 3. Therefore, the input temporal resolution is 16000/512 = 31.25 frame-per-second (FPS) in the input feature maps. After the pro-cessing of the sound model with 16 total strides, the output has a temporal resolution of 31.25/16 = 1.95 FPS. We also use this resolution as the temporal resolution for the action and object models. The log mel-spectrograms are extracted with Librosa, an open-source

2http://host.robots.ox.ac.uk/pascal/VOC/index.html

Python library for audio analysis [101]. All the images and videos are resized so that the longer side has 256 pixels, maintaining the aspect ratio.

The RGB images are used in the object model. They are sampled from a video clip with 1.95 FPS, which is the same as the temporal resolution of the sound model. Because the FCNs can handle input of arbitrary sizes, we do not have to pad the images.

The dense optical flows are used in the action model. We extract the dense optical flows also with 1.95 FPS. For each frame, we use a stack of five optical flows as the representation, including the dense optical flow of the frame itself and the dense optical flows of its four neighboring frames (two after and two before). Each dense optical flow is decomposed into an x-direction flow and an y-direction flow, so there are totally 5×2 = 10 channels for the input. We will test five temporal resolutions for the extraction of dense optical flows in Section 4.3.3. To extract the dense optical flows, we convert RGB images to the gray scale and then employ OpenCV3.

4.2.3 Datasets

We use five datasets in this paper. For training the action and object models, we use a subset of YouTube-8M4[102]. For training the sound model, we use the AudioSet5[103].

For evaluating the action models, I manually annotate action key points in video clips from 135 videos of YouTube-8M. For evaluating the object model, we collected a set of instrument images from ImageNet. For evaluating the sound model, we use MedleyDB [18]. A list of used datasets can be found in Table 4.3. In this chapter, we focus on the detection of nine instruments. The properties of the instruments and the number of data in the datasets we use are presented in Table 4.5.

YouTube-8M

We use a subset of YouTube-8M dataset [102]. We collected videos for nine instru-ments according to the tag information provided by YouTube-8M. The nine instruinstru-ments

3http://opencv.org

4https://research.google.com/YouTube-8M/

5https://research.google.com/audioset/

6http://www.image-net.org/

Table4.3:Datasetsusedinthispaperandtheirdatatypes,annotationtypes,andusages.YT8M-IPArepresentstheproposedYouTube-8M- Instrument-Playing-Actiondataset. DatasetDatatypeAnnotationtypeUsageinthispaper YouTube-8M[102]YouTubevideoVideo-levellabelTrainingofthesound,object,andactionmodels YT8M-IPAYouTubevideoPlaying-actionkeypointIannotateEvaluationoftheactionmodel AudioSet[103]YouTubevideoVideo-levelsoundlabelTrainingofthesoundmodel MedleyDB[18]MusicaudioFrame-levelsoundlabelEvaluationofthesoundmodel ImageNet-Instrument6 ImageInstrumentboundingboxEvaluationoftheobjectmodel MagnaTagATune[17]MusicaudioTrack-levelmusic-relatedlabelTrainingofthesoundmodel Table4.4:Modelsusedinthispaperandtheirinputfeatures,predictiontypes,andtrainingtargets. ModelnameInputfeaturePredictiontypeTrainingtarget SoundmodelLogmel-spectrogramSoundactivationVideo-levelinstrumentlabel ObjectmodelRGBimageObjectactivationVideo-levelinstrumentlabel Videotagastarget(VT)DenseopticalflowActionactivationVideo-levelinstrumentlabel Soundastarget(ST)DenseopticalflowActionactivationSoundactivation Objectastarget(OT)DenseopticalflowActionactivationObjectactivation Sound×Objectastarget(SOT)DenseopticalflowActionactivationSoundactivation×Objectactivation

are ‘Accordion’, ‘Cello’, ‘Drummer’, ‘Flute’, ‘Guitar’, ‘Piano’, ‘Saxophone’, ‘Trumpet’, and ‘Violin’.7

Note that ‘Drummer’ is chosen instead of ‘Drum’ because the ‘Drummer’ tag seems to contain more instrument-playing videos. We will refer to ‘Drummer’ tag as ‘Drum’

in the rest of this work. In addition, I observe that the videos labeled with ‘Trumpet’ in YouTube-8M contain not only videos of trumpets, but also videos of other instruments in the brass family, such as cornet, French horn, and trombone Therefore, we will treat the

‘Trumpet’ as a more general trumpet-like tag.

YouTube-8M has divided the data into ‘train,’ ‘validate,’ and ‘test’ sets. 16,804 videos are collected as the training set from YouTube-8M ‘train’ set, and 2,100 videos as the validation set from YouTube-8M ‘validate’ set. The first minute in each video clip is used for training. Each instrument has at least 2,000 videos for training. Note that we only need clip-level labels for the training and validation sets.

YouTube-8M-Instrument-Playing-Action

There are no action annotations in YouTube-8M, so we manually annotate a set of video clips from YouTube-8M. The metadata and video IDs of the YouTube-8M ‘test’ set are not available, so we choose the testing data from YouTube-8M ‘validate’ set, not over-lapping with our validation set. 15 videos are chosen for each instrument. We manually annotate frames in the 0 to 10 seconds and 30 to 40 seconds so that we can evaluate the performance of our model for action detection. With the temporal resolution 1.95 FPS, this comprises 5,400 snapshots. This set of annotations is used only for evaluating action models, not for training. We will refer to this subset with manual annotations as YouTube-8M-Instrument-Playing-Action, or YT8M-IPA for short.

The locations of instrument-playing actions are represented as key points, instead of regions that are commonly used in the literature of action detection [19, 74]. I choose to

7While there are certainly other instruments, we choose these nine instruments mainly for they cover instrument types that are commonly seen. On one hand, we have sufficient number of training data for each of them in the datasets we use. On the other hand, we still need to manually annotate the action locations of the chosen instruments for evaluation (because such labels are not available elsewhere) so we have to limit the number of instruments. As the proposed methodology is quite generic, we believe our model can be easily extended to deal with other instruments in the future work.

Table4.5:Propertiesofthenineinstruments.Thelowerpartofthetablecontainsthenumberofdatainthedatasets.InYT8M-IPA,theplaying actionsareannotatedattheintersectionsoftheactionregionsandtheplayingtoolsasdescribedinSection4.2.3.‘frs’representsframes.‘imgs’ representsimages. AccordionCelloDrumFluteGuitarPianoSaxophoneTrumpetViolin ActionregionKeys/bodyStringsDrumskinsHoles/mouthpieceStringsKeysKeys/mouthpieceValves/mouthpieceStrings PlayingtoolHandsHand/bowSticksHands/mouthHandsHandsHands/mouthHands/mouthHand/bow Portable?XX×XX×XXX YouTube-8M(clips)227922602495226432043678236722403521 YT8M-IPA(clips/frs)15/60015/60015/60015/60015/60015/60015/60015/60015/600 AudioSet(clips)265846644866428156245233296636546553 MedleyDB(songs)511651064435714 ImageNet-Inst.(imgs)412323252359135315343294365

Figure 4.4: Examples of the manually annotated key points (red dots) of instrument-playing actions. I annotate the locations that are most directly responsible to making the instrument sounds as described in Section 4.2.3. Best seen in color.

do so because I want to predict the actions that are most directly responsible for making instrument sounds. The sounds are usually made by the contacts between sound making tools, such as hands and sticks, and an instrument, and the contacts are usually more like points than regions. Some examples of the annotations are shown in Figure 4.4.

The author annotates the locations of the instrument playing according to the following principles. For wind instruments like flute, saxophone, and trumpet, the locations where the hands are pressing and the location of the mouth are labeled. For string instruments like cello, violin, and guitar, the location of the pressing hand and the intersection of the stroking hand (or the bow) and the strings are labeled. For accordion, the two hands are labeled, and the center of accordion is also labeled because the deformation of the accordion is also an indicator of playing. For drum and piano, the locations of the hands/

sticks hitting the instruments are labeled. Note that these locations are labeled only if the instruments in sight are responsible for making the sounds at a given frame.

Table 4.6: Evaluation of the sound models for instrument sound detection. The sound model trained with AudioSet outperforms the one trained with YouTube-8M and the one (the model in our previous work [1]) trained with the music dataset MagnaTagATune. We use ‘Acc.’ as shorthand for Accordion.

AVG w/o Acc. 0.765 0.796 0.812 0.821

ImageNet-Instrument

In order to evaluate the object localization ability of the object model, images of the nine instruments were collected from the ImageNet website. They also provide the bound-ing boxes for the locations of the instruments. Totally 2,798 images and the correspondbound-ing bounding boxes are collected. An instrument has on average 311 images ranging from 135 to 412.

MedleyDB

MedleyDB [18] is a multi-track instrument dataset. It also contains the timestamps of the occurrences of the instrument sounds. There are totally 111 songs with the nine instruments used in this work.8 We use it to evaluate the sound model.

8We aggregate the ‘acoustic guitar,’ ‘clean electric guitar,’ and ‘distorted electric guitar’ in MedleyDB into the ‘Guitar’ tag, and aggregate ‘baritone saxophone,’ ‘soprano saxophone,’ and ‘tenor saxophone’ in MedleyDB into the ‘Saxophone’ tag.

AudioSet

AudioSet [103] is a video dataset released by Google, containing audio annotations on a 10-second clip in each of the videos. A subset of the nine instruments is collected from AudioSet, consisting of 35,512 video clips for training and 902 video clips for validation.

Each instrument has 3,945 training clips on average.