Fallacious Object Reliance - 以對抗式物件合成訓練輔助動作辨識模型以減輕其偏差

It should be noted that it is not intrinsically wrong for action models to have a high object reliance level. If the captured object bias is universal over different datasets, we should consider the object as an essential part of that action. For instance, it is true that action

”playing piano” does associate with the object ”piano”.

However, object biases in the training set D_trainand testing set D_testare not guaranteed to be the same. In this situation, action models with heavy object reliance will learn the wrong object-action association and make inaccurate predictions in the testing set (see Fig. 3.2).

Therefore, we propose a new method to inspect the discrepancy of object represen-tation bias between Dtrain and Dtest. The idea is to calculate the performance drop of M_obj−→^actwhen trained on D_trainand test on D_test:

In our experiments, we choose M_obj−→^actas a simple linear regression classifier and P_k as the per-class f1 scores, hence equation 3.5 can be interpreted as the object distribution

(a) ^(b)

Figure 3.2: Examples of Fallacious Object Reliance with the action class ”throw”. (a) in the EPIC-KITCHENS dataset, ”throw” is often tied to ”trash can”, so the model re-lies on trash cans to predict the throwing action. However, in the testing set, ”throw”

may be associated with other things like the highlighted pot. (b) in HMDB51, ”throw”

is related to totally different objects from EPIC-KITCHENS. Action recognition models should capture the motion part of actions and avoid these FOR problems.

divergence of D_train and D_text given an action k.

If an action model performs worse on groups whose object bias discrepancy is large, it indicates the model is using the wrong object-action association to predict actions. Using this idea, we further propose our Fallacious Object Reliance (FOR) measurement, which is used to evaluate the alignment between the action model performance on D_testand the negative of object bias discrepancy.

F OR(M_vid−→^act) = A(P (D_test, M_vid−→^act),−Bdiff(D_train, D_test, k)) (3.6)

Fig. 3.3 shows the FOR scores of I3D on three different datasets. Note that because M_obj−→^acthas almost 100% accuracy on the D_trainof HMDB split 1, B_diff(D_train, D_test, k) is close to (1− the performance on Dtest), so Fig. 3.3 (b) is almost the same to Fig. 3.1 (b).

(a) EPIC-KITCHENS (b) HMDB51 split 1 (c) Moments in Time

Figure 3.3: The Fallacious Object Reliance (FOR) scores of I3D on three different datasets. As we can see, although I3D has strong ORL on all the three datasets, it remains high FOR scores only on EPIC-KITCHENS and HMDB51. This indicates that either Mo-ments in Time dataset is better calibrated so that it contains enough object diversity or it has similar object-action associations between its training and test sets.

Chapter 4

Object Image Iobj&

Object Mask M_obj

Pasted Mask M_paste Optical Flow of V_orig

Lflow_overlap

L_G L_D L_clf L_adv

Figure 4.1: The overall architecture of AdvOST. AdvOST is composed of three different sub-networks: (a) a synthesizer S that affinely transforms the given object image and pastes it onto the original video to form an augmented video, (b) the classifier C to predict the action class given the augmented video at the training stage, and (c) the discriminator in charge of judging whether the input video is original or augmented and providing training signals for the synthesizer to produce natural synthesis. Additional regularization term called flow overlap loss is added to prevent the synthesizer paste on where motion occurs.

4.1 AdvOST

We show our overall training architecture called AdvOST in Fig. 4.1. This architecture consists of a synthesizer S, a classifier C, and a discriminator D.

For the synthesizer S, given an original video v_orig ∈ Vorigand an object image i∈ Iobj,

S will infer an affinement matrix to apply on the object image. The affined object image is pasted onto v_origand produce a augmented video v_aug. Its network structure is illustrated and described in Fig. 4.2.

The classifier’s target is to predict the action given v_aug in the training stage and v_orig in the testing stage. Its architecture can be any action recognition models, therefore the AdvOST is model-agnostic.

The final sub-network D acts just like the discriminator in the traditional structure of the generative adversarial network [12]. In each batch, v_origand v_augare fed to D, and it has to judge if the input video is authentic (v_orig) or synthesized (v_aug). Its goal is to prevent the synthesizer S from pasting objects in unnatural positions or at weird angles. Otherwise, it will be easy for the classifier to ignore the unnatural parts and make our purpose less effective.

Figure 4.2: The network architecture of our synthesizer. (a) The video convs block and the object convs block extract the features of given videos and objects. The extracted features are concatenated in the channel dimension and processed by (b) the feature mixing convs block, where the output is then pooled by a global average pooling layer. The pooled feature is then fed into (c) the affinement parameters predictor to predict 5 parameters of the affinement matrix. The 5 degrees-of-freedom includes 2 translation, 2 scalings, and 1 rotation. We then apply the (d) affinement transformation on the object image Iobj and its mask M_objand use them to (e) paste the affined object onto the original video V_origand produce the augmented video V_augand an affined object mask M_paste.

在文檔中以對抗式物件合成訓練輔助動作辨識模型以減輕其偏差 (頁 16-21)