3.3 Proposed Method: CRCNN Framework
3.3.3 CRCNN Formulations
國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
3.3.3 CRCNN Formulations
ConsideringI as a universal set of facial images and L be the corresponding label set of possible ages of a human being, we are given a training set of N facial images X ∈I and its label Y ∈L . Let F denotes the deep architecture function. Instead of computing Y with F, as usual in the conventional paradigm:
F:I → L
X 7→ Y = F(X).
(3.1)
The idea is to introduce a baseline B = {B1, . . . , BM} fromI with a composition function Ψ and Φ to decompose the task into two main parts. Note that X and B are usually disjoint.
First, in the comparative stage, the comparison of X and the baseline B with Ψ provides the set of hintsH . Second, in the estimation stage, the vote of hints from the set of hints H is to obtain the final labelL with Φ. Therefore, the proposed CRCNN approach is formulated as:
(I × I ) Ψ−→ H −→Φ L
(X, B) 7→ Z = Ψ(X, B) 7→ Y = Φ(Z).
Comparative stage
The set of hints Z ∈H is computed from X ∈ I and B ∈ I with the function Ψ, which is decomposed into
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Ψ = ΨR◦ ΨC◦ ΨL◦ ΨF◦ ΨA.
The first operator ΨR detects all regions where the facial information is selected by R-CNN to be the most relevant. The second operator ΨCis the convolutional step (including sub-sampling layers) that extracts a fixed-length feature vector from each region. The third and fourth operators (ΨL and ΨF) are locally and fully connected steps [111]. Finally, the features of both the input image and baseline samples are aggregated into the last operator ΨA, where an energy function approximates the age comparison with a distance metric.
Region-detection layer: Consider Xi ∈I , an input image, a set of candidate regions {Xi, j}j=1...J is detected from Xiin order to extract more efficient facial information features.
Each region Xi, j is detected by the algorithm in [122]. The same region-detection operator ΨRis applied to each baseline sample Bm, providing a set of candidate regions {Bm, j′}j′=1...J′. Therefore, we denote the first hidden layer of our deep architecture by H1, which is formed with the region-detection layer. If no region detection is used (ΨRis equivalent to an identical function), then we set the output as the input image itself ({Xi} = {Xi}).
Convolutional layers: The convolutional operator ΨC extracts features from the first hidden layer H1. Specifically, features are computed by forward propagation through a convolutional structure of |C| layers with
ΨC= ΨC1◦ ΨC2◦ · · · ◦ ΨC|C|.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
These steps expand the input into a set of simple local features. We denote Hk = ΨCk(Hk−1) as the output of a convolutional layer for k = 2, 3, . . . , |C| + 1. Further details of the convolutional layer are provided in [111]. We interpret these convolutional steps as an adaptive pre-processing step. The purpose of these convolutional steps is to extract low-level features such as simple edges and textures. Sub-sampling layers make the output of convolution networks more robust to local translations and small registrational errors, which is important in facial recognition problems.
Locally connected layers: After extracting features with ΨC, applied independently to Xi and Bm, we first combine locally extracted features through |L| locally connected layers with
ΨL= ΨL1◦ ΨL2◦ · · · ◦ ΨL|L|,
resulting in Hk= ΨLk(Hk−1) for k = |C| + 2, |C| + 3, . . . , |C| + |L| + 1. Similar to convo-lutional deep learning, locally connected layers apply a filter bank, but every location in the feature map learns a different set of filters. For example, information from an area between the eyes and eyebrows is combined with the one between the nose and the mouth; however, the two pieces of information are processed differently in the convolutional operation.
Fully connected layers: The fully connected operation ΨF computes all weights together with
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
ΨF = ΨF1◦ ΨF2 ◦ · · · ◦ ΨF|F|
and Hk= ΨFk(Hk−1) for k = |C| + |L| + 2, |C| + |L| + 3, . . . , |C| + |L| + |F| + 1. Unlike in the locally connected operation where the inputs are locally combined, each output unit in the fully connected layers is connected to all inputs. Further, these layers can capture correlations between features captured in distant parts of the facial images, for example, the position and shape of the eyes and the position and shape of the mouth.
Aggregation: An EBM energy function [126] is exploited to aggregate both information regarding Xiand Bmfrom the fully connected operation to estimate if Xiis younger or older than Bm. The advantage of the adopted energy function is that there is no need to estimate the normalized probability distributions over the input space. The scalar energy function E measures the compatibility between Xiand Bm, and it leads to a set of hints associated with the in-between comparative relationship, as shown in Figure 3.2. This real-valued energy function is thus defined as E(Xi, Bm) = ||GW(Xi) − GW(Bm)||, where GW denotes a mapping (subject to learning) to produce output vectors that are nearby for images from the same person and far away for images from different persons [126].
Learning is then performed by finding the deep architecture parameters that minimize a suitably designed loss function evaluated over a training set. Let L− (or L+) be the partial loss function if Xiis younger (or older) than Bm; then, our loss function is
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
L= (1 − ¯Zl)L−(E(Xi, Bm)) + ( ¯Zl)L+(E(Xi, Bm)),
where ¯Zldenotes the ground truth of the hint Zl. The partial loss function L− (or L+) is designed such that the minimization of L will decrease (or increase) the energy when X is younger (or older) than Bi. A simple approach to achieve this is to make L− monotonically decreasing, and L+ monotonically increasing.
Estimation stage
Once the set of hints has been generated, the estimation stage is applied to vote based on the output information of the previous comparative stage to determine the age of the person. The representation of the set of hints in Figure 3.2 includes the number of hints for each label.
This result is computed by applying a summation at each label. Therefore, the age of the input person can be estimated by considering the label with the most votes in a naive manner.
In practice, to avoid the case where most votes appear in more than one label, we choose to use the real value outputted from the energy function E instead of the number of hints Zi because the confidence of a vote is also embedded. That is, a larger value indicates higher confidence of a vote and vice versa.