• 沒有找到結果。

It should be noted that it is not intrinsically wrong for action models to have a high object reliance level. If the captured object bias is universal over different datasets, we should consider the object as an essential part of that action. For instance, it is true that action

”playing piano” does associate with the object ”piano”.

However, object biases in the training set Dtrainand testing set Dtestare not guaranteed to be the same. In this situation, action models with heavy object reliance will learn the wrong object-action association and make inaccurate predictions in the testing set (see Fig. 3.2).

Therefore, we propose a new method to inspect the discrepancy of object represen-tation bias between Dtrain and Dtest. The idea is to calculate the performance drop of Mobj−→actwhen trained on Dtrainand test on Dtest:

In our experiments, we choose Mobj−→actas a simple linear regression classifier and Pk as the per-class f1 scores, hence equation 3.5 can be interpreted as the object distribution

(a) (b)

Figure 3.2: Examples of Fallacious Object Reliance with the action class ”throw”. (a) in the EPIC-KITCHENS dataset, ”throw” is often tied to ”trash can”, so the model re-lies on trash cans to predict the throwing action. However, in the testing set, ”throw”

may be associated with other things like the highlighted pot. (b) in HMDB51, ”throw”

is related to totally different objects from EPIC-KITCHENS. Action recognition models should capture the motion part of actions and avoid these FOR problems.

divergence of Dtrain and Dtext given an action k.

If an action model performs worse on groups whose object bias discrepancy is large, it indicates the model is using the wrong object-action association to predict actions. Using this idea, we further propose our Fallacious Object Reliance (FOR) measurement, which is used to evaluate the alignment between the action model performance on Dtestand the negative of object bias discrepancy.

F OR(Mvid−→act) = A(P (Dtest, Mvid−→act),−Bdiff(Dtrain, Dtest, k)) (3.6)

Fig. 3.3 shows the FOR scores of I3D on three different datasets. Note that because Mobj−→acthas almost 100% accuracy on the Dtrainof HMDB split 1, Bdiff(Dtrain, Dtest, k) is close to (1− the performance on Dtest), so Fig. 3.3 (b) is almost the same to Fig. 3.1 (b).

(a) EPIC-KITCHENS (b) HMDB51 split 1 (c) Moments in Time

Figure 3.3: The Fallacious Object Reliance (FOR) scores of I3D on three different datasets. As we can see, although I3D has strong ORL on all the three datasets, it remains high FOR scores only on EPIC-KITCHENS and HMDB51. This indicates that either Mo-ments in Time dataset is better calibrated so that it contains enough object diversity or it has similar object-action associations between its training and test sets.

Chapter 4

Object Image Iobj&

Object Mask Mobj

Pasted Mask Mpaste Optical Flow of Vorig

Lflow_overlap

LG LD Lclf Ladv

Figure 4.1: The overall architecture of AdvOST. AdvOST is composed of three different sub-networks: (a) a synthesizer S that affinely transforms the given object image and pastes it onto the original video to form an augmented video, (b) the classifier C to predict the action class given the augmented video at the training stage, and (c) the discriminator in charge of judging whether the input video is original or augmented and providing training signals for the synthesizer to produce natural synthesis. Additional regularization term called flow overlap loss is added to prevent the synthesizer paste on where motion occurs.

4.1 AdvOST

We show our overall training architecture called AdvOST in Fig. 4.1. This architecture consists of a synthesizer S, a classifier C, and a discriminator D.

For the synthesizer S, given an original video vorig ∈ Vorigand an object image i∈ Iobj,

S will infer an affinement matrix to apply on the object image. The affined object image is pasted onto vorigand produce a augmented video vaug. Its network structure is illustrated and described in Fig. 4.2.

The classifier’s target is to predict the action given vaug in the training stage and vorig in the testing stage. Its architecture can be any action recognition models, therefore the AdvOST is model-agnostic.

The final sub-network D acts just like the discriminator in the traditional structure of the generative adversarial network [12]. In each batch, vorigand vaugare fed to D, and it has to judge if the input video is authentic (vorig) or synthesized (vaug). Its goal is to prevent the synthesizer S from pasting objects in unnatural positions or at weird angles. Otherwise, it will be easy for the classifier to ignore the unnatural parts and make our purpose less effective.

Figure 4.2: The network architecture of our synthesizer. (a) The video convs block and the object convs block extract the features of given videos and objects. The extracted features are concatenated in the channel dimension and processed by (b) the feature mixing convs block, where the output is then pooled by a global average pooling layer. The pooled feature is then fed into (c) the affinement parameters predictor to predict 5 parameters of the affinement matrix. The 5 degrees-of-freedom includes 2 translation, 2 scalings, and 1 rotation. We then apply the (d) affinement transformation on the object image Iobj and its mask Mobjand use them to (e) paste the affined object onto the original video Vorigand produce the augmented video Vaugand an affined object mask Mpaste.

相關文件