Pseudo Rehearsal with Imaging Recollection and Pseudo Neurons
4.4 Pseudo Neurons
When training on new dataset, we rely on distillation loss to preserve previous be-haviour of model and acquire novel knowledge through classification loss. However, such a training process blinds new added neurons from taking account old information in-troduced by pseudo data, resulting in asymmetrical impacts from objective functions. We here propose pseudo neurons to let both information flows across neurons representing old or new classes. We embed the model in the last logits layer with several additional pseudo neurons that initially represent no class. These neurons are only trained to be less activated for the classes initially learned. When new training stage proceeds, where new classes are introduced, these pseudo neurons will be converted to capacities for new classes.
Because of the setting of pseudo neurons, f is now including both new class neurons
and pseudo neurons. Equation (4.2) and (4.3) are now modified to
Ldis= ∑
(˜x, ˜f )∈De
Co+C∑n+Cpo
i=1
1
2∥fi(˜x; θ)− ˜fi∥22 (4.10)
Lcls =− ∑
(x,y)∈D
Co+Cn+C∑po+Cpn
i=1
yilog(gi(x; θ)) (4.11)
where Co, Cn, Cpo, and Cpnare the numbers of old classes, new classes, old pseudo neu-rons remained, and newly added pseudo neuneu-rons respectively. Note that neuneu-rons in Cn are obtained by converting some pseudo neurons.
Figure 4.5: The arrangement of output neurons without (a) and with (b) pseudo neurons.
To see how pseudo neurons being arranged compared to the original one without pseudo neurons see Figure 4.5. In Figure 4.5 (a), new neurons of new classes are added directly. Penalizing dissimilarity between previous model and currently trained model by pseudo data, distillation loss can be only applied on already existing neurons. On the other hand, the setting of Figure 4.5 (b) enables distillation loss applied on converted neurons.
To ablate the reason why such a simple arrangement boosts performance, we consider following two cases with or without the addition of pseudo neurons.
We first see original version where no pseudo neuron is added. Consider an extreme case where only one new class is introduced from new data (Cn = 1). We form one minibatch consisting only one pseudo sample (˜xold, ˜fold) representing old knowledge and one true sample (xnew, ynew) of new class. The target index t of new data sample thus lies on the position of Co+ 1. Then by (4.2) and (4.3), the summation of two losses given one minibatch is
The optimized state whereL ≈ 0 can be achieved when
fi(˜xold; θ)≈ ˜fi, f or i̸= Co+ 1 (4.13)
and
eft(xnew;θ)≫∑Co
i=1
efi(xnew;θ) (4.14)
To see how gradient descent results in this condition, we compute gradients from total loss:
and rewrite the equation with respect to derivative ∂f∂L
To reach optimized state, weights would update toward the direction where all derivative terms ∂f∂L
i approximate to 01. As ft(x; θ) is only subject to classification loss, it’s free to grow to a large enough value to satisfy (4.14), which yields gt(xnew, θ) very close to 1 and gi(xnew, θ) very close to 0 for i̸= t, hence the gradients introduced by classification loss term approximate to 0. Given this condition, fi(x; θ) is now only subject to distillation loss as gi(xnew, θ) ≈ 0 for i ̸= t. The optimization of distillation loss then results in the condition of (4.13).
We can now reasonably refer that f (xnew; θ) would be a logits vector in which the t’th element surpasses other elements, which is desired because the resulting probability of the target class gt(xnew; θ) can be very close to 1. However, such an optimization will cause some confusing situations for images of old class. As encoded representations of images from different class are not possible to be complete orthogonal, they must share some similar information. So normally the resulting representational distributions of xold will also cause a high ft(xold; θ), which will be a competitive or even exceeding value than the logits fto(xold; θ), where to is the target index of xold. Consequently, the network is confused due to two high logits fto(xold; θ) and ft(xold; θ)2.
We turn to see the proposed alternative which uses pseudo neurons. For simplicity,
1One possible condition to reach stable state is ∂f∂θi ≈ 0 instead of∂f∂Li ≈ 0, which indicates training is stuck at a saddle point. However, such a condition may not exist from the observation by Goodfellow et al. [44].
2So in this way, neural network does not really forget old knowledge as it still retain a high score for correct class.
we set the number of pseudo neurons to 1, which is converted to the capacity of new class during training, and we don’t add new pseudo neuron back to make the equation clear.
Thus Cpo = Cpn = 0 and Cn= 1. Then the gradients of (4.10) and (4.11) become
The logits ft(x; θ) is now also subject to distillation loss. Consequently, ft(x; θ) will be high when input image is of new class and, when input is of old class, ft(x; θ) will be suppressed because of the penalization from distillation loss during training time. The confusion happens in the original version without pseudo neurons can thus be addressed.
Figure 4.6: Distributions of output logits (a) without pseudo neurons and (b) with pseudo neurons after learning one new class. Each column computes means and standard devi-ations of one batch of data of same class. Value of correct class index are coloured blue while value of new class index are coloured yellow for clear comparison.
To further demonstrate the phenomenon, we show experimental observation by ini-tially training a CNN model using data of 9 randomly selected instances from RGB-D Object Dataset [45] and then incrementally train the model to learn one new instance class.
Figure 4.6 shows the resulting distributions of f (x; θ). Each distribution is obtained by averaging the output logits f (x; θ) among batch of testing images. Images in the same batch belong to the same class,e.g. the 3’rd column is obtained from a batch of data that all belongs to the 3’rd instance class. We can see with pseudo neurons (Figure 4.6 bottom row), ft(xold; θ) is more suppressed than the one without pseudo neurons (Figure 4.6 top row), meaning that new class neuron is less likely to effect the final decision from CNNs to predict the right answer when seeing images of old class. Also see Table 5.6 in Chap-ter 5.3 for resulting accuracies. Note that data in RGB-D Object Dataset is recorded on turntable and only has variance of rotation. When using more challenging data or when CNN model has initially learned more classes, the phenomenon will enlarge and lead to a significant gap of performance between the one with pseudo neurons and the one without pseudo neurons (see Chapter 5.1).
Chapter 5
Experiment
We implement all approaches using Caffe [46], an open source library that is specif-ically built for designing CNN models and support GPU computation. We run our pro-gramme on NVIDIA GEFORCE GTX 960M.