Implementation Details - Incremental learning instance

Pseudo Rehearsal with Imaging Recollection and Pseudo Neurons

5.1 Incremental learning instance

5.1.2 Implementation Details

We implement several approaches that also deal with catastrophic forgetting as com-parison. For all approaches, we adopt network architecture proposed by Zeiler and Fer-gus [48] and pre-train models on ImageNet [6]. Specifically, the model we denote as ZF net has 8 layers (see Figure 5.2). When an input image is fed into ZF net, the image is

1We merge desk_[1-3] to desk and table_small_[1-2] to table_small.

2b, c, cb, cm, f and sc are the abbreviations of bowl, cap, ceareal_ box, coffee_mug, flashlightand soda_can respectively

convolved with 96 different 1st layer filters, each of size 7 by 7 and with stride size of 2 in both x and y direction. The resulting feature maps are then: (i) passed through a rectified linear function (ReLU), (ii) max pooling operation within 3x3 regions, using stride size of 2 and (iii) contrast normalized across feature maps to give 96 different 55 by 55 element feature maps. Similar operations are repeated in layers 2,3,4,5. See Figure 5.2) for de-tailed difference. After convolutions are two dense layers, which first flatten features from the top convolutional layer to form a 9216-dimension (6· 6 · 256) vector and then output 4096-dimension feature vector. The final layer is a classification layer which acts on the feature vector and output a predefined C-dimension vector, where C is a prdefined number of classes. To enable incremental learning, we discard the final layer and add extensible classification layer to accommodate new classes. Additionally, we add ROIPooling layer [49] right before fc6 layer so that networks are able to accommodate different size images and classify local patches of an image.

Figure 5.2: ZF model.

During training, we let all parameters after conv5 be finetuned³for all approaches ex-cept for fixed representation. During training, we set fixed learning rate of 0.0001, weight decay of 0.0005 and mini-batch of 5 images (around 25 samples) per iteration. Each in-cremental training stage consists of 80 epochs or stops early if average classification loss over one epoch is below 0.005.

3Finetuning to early layers reduce the performance for all approaches. This is because finetuning to early layers harms the ability of rich feature representations obtained by pre-training on ImageNet

Baseline

the baseline solution is fixed representation, which only learns the weights of new classifiers or existing classifiers depending on whether current training dataset contains old classes or only new classes. This approach avoids forgetting old knowledge by fixing exiting classifiers’ weights. Figure 5.3 draws learnable part for the condition where current dataset contains both new class and old class.

Figure 5.3: Demonstration of trainable weights for method ”fix representation” in the condition where current dataset contains both new class and old class.

Less-forgetting Learning in Deep Neural Networks

Jung et al.’s work [39] mainly deals with domain adaptation. As the task doesn’t require adding new class neuron in final layer, we manually modify their approach to be class-extensible and leave other elements unchanged. Their approach tries to maintain final classifier’s decision boundary by fixing weights and keep feature representations from last layer (before classification layer) by using Euclidean loss. In addition, they allow intermediate layers trainable. We show modified version in Figure 5.4 and denote the version as LF*.

Figure 5.4: Modified version of [39] to allow class-extensible.

Compete to Compute

We implement Local-Winner-Takes-All (LWTA) [31] in fc6 and fc7 with blocks of size equal to 16 which yields the best performance among 4, 8 and 16. Training with LWTA is exactly the same as finetuning the current model. The difference is signal tran-sition is now constrained by only permitting max value to be passed between each layer.

Learning without Forgetting

Li and Derek’s work [37] can be directly fit into our scenario by simply taking multi-task learning as multi-class learning. We denote the method as LwF. The training rule is very similar to pseudorehearsal that also distil old model’s knowledge. The distinction is that instead of using additional data (e.g. pseudo data), LwF leverages current available data as source to preserve knowledge. Besides, LwF uses the distillation loss proposed by Hinton et al. [35] rather than L2 distance loss we adopt in our approach. The loss used in LwF [37] is defined as

Ldis,LwF = ^∑

(x,y)∈D Co

∑

i=1

g_i(x; θ^k⁻¹) log g_i(x; θ^k) (5.1)

where g_i(x; θ^k⁻¹) is the i^th element of output probabilities from old model M_k₋₁ with fixed weights θ^k⁻¹. θ^k is thr current model’s weights, which are trainable. The overall loss is thus summation of common classification loss in (4.3) and LwF distillation loss in (5.1).

Pseudo Rehearsal

For original pseudorehearsal [40], we set ˜n (the number of pseudo sample per class) equal to 80. Before one training stage, randomized images as pseudo data are generated for later usage. During training, loss weights λcls and λdisare set to 0.1 and 0.1 respectively.

Batch size for pseudo data is set to 16 samples per iteration.

Pseudo Rehearsal with Imaging Recollection and Pseudo Neurons

The training settings of our approach is the same as pseudorehearsal. Additionally, regarding the proposed pseudo neurons, the number of pseudo neurons is set to 10.

5.1.3 Results

Figure 5.5 shows the curve indicating the degree of how much the networks preserve the old knowledge by retrospecting testing data of previous scenes. Among all works that mitigate catastrophic forgetting, ours outperforms all of them in intermediate stages and final stage. The most closest one to us is LwF, which is behind ours by around 5%.

Nonetheless, as our goal is to train a robust recognition system in incremental manner, the results by testing on a testset2 which contains various imaging conditions is more repre-sentative. Table 5.1 shows the overall performance on testset1 and testset2 after training on the 8 datasets. We can see that after incrementally learning 28 instances from different scenes, our approach learns better invariance compared to others, reaching 74.48%

accu-Figure 5.5: Curve demonstrating how much knowledge is preserved by testing model on accumulated testing data.

racy on testset2 and outperforming the second best LwF with a margin of 13.12%. We further report the accuracies (see Table 5.2) over additional 4 trials with random training order of the 8 datasets to avoid any special case. All the results are consistent and show that our approach outperforms the others.

Table 5.1: % Accuracy on testset1 and testset2 after incremental training Approach testset1 testset2

Fix rep. 35.30± 1.98 22.53 ± 1.54 LF* [39] 50.38± 2.22 39.54 ± 2.50 PR [40] 59.55± 3.07 47.93 ± 3.11 LWTA [31] 65.27± 3.51 48.79 ± 2.78 LwF [37] 82.54± 5.12 61.32 ± 4.72 Ours 87.23± 3.02 74.48 ± 2.21

Table 5.2: % (testset1/testset2) Accuracy over 4 trials

Approach trial 1 trial 2 trial 3 trial 4 average

Fix rep. 35.3 / 22.5 50.1 / 36.2 62.7 / 44.9 28.1 / 16.0 46.2 / 33.2

We have already shown the overall performance of our approach on RGB-D Scenes Dataset that contains data of new classes or different imaging conditions in each incre-mental training stage. In this part, we separately investigate whether our approach can learn invariance by seeing same objects but with different imaging conditions.

5.2.1 GMU Kitchen Dataset

In this experiment, we rely on GMU Kitchen Dataset [50], which is similar to RGB-D Scene RGB-Dataset but with more challenging imaging conditions. The dataset contains 9 cluttered environments each with 9 to 11 objects⁴ We select 7 scene datasets that include the 11 objects. Then among the 7 datasets, we randomly chose 3 datasets for incremental training and the remaining 4 for testing. Similar to RGB-D Scene Dataset, training data are collected by subsampling every 5 frames for each training dataset. Implementation set-tings of approaches are also the same. We conduct 3 trials with different training datasets and testing data. Every trial is run for 5 times to obtain average performance. We report fixed representation as baseline, LwF, and our approach.

4The dataset contains main objects and extra objects. We use main objects here.

在文檔中減緩卷積類神經網路之災難性失憶問題以有效達成物體辨識 (頁 44-50)