CRCNN Formulations - Proposed Method: CRCNN Framework

3.3 Proposed Method: CRCNN Framework

3.3.3 CRCNN Formulations

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

3.3.3 CRCNN Formulations

ConsideringI as a universal set of facial images and L be the corresponding label set of possible ages of a human being, we are given a training set of N facial images X ∈I and its label Y ∈L . Let F denotes the deep architecture function. Instead of computing Y with F, as usual in the conventional paradigm:

F:I → L

X 7→ Y = F(X).

(3.1)

The idea is to introduce a baseline B = {B₁, . . . , B_M} fromI with a composition function Ψ and Φ to decompose the task into two main parts. Note that X and B are usually disjoint.

First, in the comparative stage, the comparison of X and the baseline B with Ψ provides the set of hintsH . Second, in the estimation stage, the vote of hints from the set of hints H is to obtain the final labelL with Φ. Therefore, the proposed CRCNN approach is formulated as:

(I × I ) Ψ−→ H −→Φ L

(X, B) 7→ Z = Ψ(X, B) 7→ Y = Φ(Z).

Comparative stage

The set of hints Z ∈H is computed from X ∈ I and B ∈ I with the function Ψ, which is decomposed into

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Ψ = Ψ^R◦ Ψ^C◦ Ψ^L◦ Ψ^F◦ Ψ^A.

The first operator Ψ^R detects all regions where the facial information is selected by R-CNN to be the most relevant. The second operator Ψ^Cis the convolutional step (including sub-sampling layers) that extracts a fixed-length feature vector from each region. The third and fourth operators (Ψ^L and Ψ^F) are locally and fully connected steps [111]. Finally, the features of both the input image and baseline samples are aggregated into the last operator Ψ^A, where an energy function approximates the age comparison with a distance metric.

Region-detection layer: Consider X_i ∈I , an input image, a set of candidate regions {Xi, j}_j=1...J is detected from X_iin order to extract more efficient facial information features.

Each region X_{i, j} is detected by the algorithm in [122]. The same region-detection operator Ψ^Ris applied to each baseline sample B_m, providing a set of candidate regions {B_{m, j}^′}_j′=1...J^′. Therefore, we denote the first hidden layer of our deep architecture by H₁, which is formed with the region-detection layer. If no region detection is used (Ψ^Ris equivalent to an identical function), then we set the output as the input image itself ({X_i} = {X_i}).

Convolutional layers: The convolutional operator Ψ^C extracts features from the first hidden layer H1. Specifically, features are computed by forward propagation through a convolutional structure of |C| layers with

Ψ^C= Ψ^C₁◦ Ψ^C₂◦ · · · ◦ Ψ^C_|C|.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

These steps expand the input into a set of simple local features. We denote H_k = Ψ^C_k(H_k−1) as the output of a convolutional layer for k = 2, 3, . . . , |C| + 1. Further details of the convolutional layer are provided in [111]. We interpret these convolutional steps as an adaptive pre-processing step. The purpose of these convolutional steps is to extract low-level features such as simple edges and textures. Sub-sampling layers make the output of convolution networks more robust to local translations and small registrational errors, which is important in facial recognition problems.

Locally connected layers: After extracting features with Ψ^C, applied independently to X_i and B_m, we first combine locally extracted features through |L| locally connected layers with

Ψ^L= Ψ^L₁◦ Ψ^L₂◦ · · · ◦ Ψ^L_|L|,

resulting in H_k= Ψ^L_k(H_k−1) for k = |C| + 2, |C| + 3, . . . , |C| + |L| + 1. Similar to convo-lutional deep learning, locally connected layers apply a filter bank, but every location in the feature map learns a different set of filters. For example, information from an area between the eyes and eyebrows is combined with the one between the nose and the mouth; however, the two pieces of information are processed differently in the convolutional operation.

Fully connected layers: The fully connected operation Ψ^F computes all weights together with

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Ψ^F = Ψ^F₁◦ Ψ^F₂ ◦ · · · ◦ Ψ^F_|F|

and H_k= Ψ^F_k(H_k−1) for k = |C| + |L| + 2, |C| + |L| + 3, . . . , |C| + |L| + |F| + 1. Unlike in the locally connected operation where the inputs are locally combined, each output unit in the fully connected layers is connected to all inputs. Further, these layers can capture correlations between features captured in distant parts of the facial images, for example, the position and shape of the eyes and the position and shape of the mouth.

Aggregation: An EBM energy function [126] is exploited to aggregate both information regarding X_iand B_mfrom the fully connected operation to estimate if X_iis younger or older than B_m. The advantage of the adopted energy function is that there is no need to estimate the normalized probability distributions over the input space. The scalar energy function E measures the compatibility between X_iand B_m, and it leads to a set of hints associated with the in-between comparative relationship, as shown in Figure 3.2. This real-valued energy function is thus defined as E(X_i, B_m) = ||G_W(X_i) − G_W(B_m)||, where G_W denotes a mapping (subject to learning) to produce output vectors that are nearby for images from the same person and far away for images from different persons [126].

Learning is then performed by finding the deep architecture parameters that minimize a suitably designed loss function evaluated over a training set. Let L⁻ (or L⁺) be the partial loss function if X_iis younger (or older) than B_m; then, our loss function is

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

L= (1 − ¯Z_l)L⁻(E(X_i, B_m)) + ( ¯Z_l)L⁺(E(X_i, B_m)),

where ¯Z_ldenotes the ground truth of the hint Z_l. The partial loss function L⁻ (or L⁺) is designed such that the minimization of L will decrease (or increase) the energy when X is younger (or older) than B_i. A simple approach to achieve this is to make L⁻ monotonically decreasing, and L⁺ monotonically increasing.

Estimation stage

Once the set of hints has been generated, the estimation stage is applied to vote based on the output information of the previous comparative stage to determine the age of the person. The representation of the set of hints in Figure 3.2 includes the number of hints for each label.

This result is computed by applying a summation at each label. Therefore, the age of the input person can be estimated by considering the label with the most votes in a naive manner.

In practice, to avoid the case where most votes appear in more than one label, we choose to use the real value outputted from the energy function E instead of the number of hints Z_i because the confidence of a vote is also embedded. That is, a larger value indicates higher confidence of a vote and vice versa.

在文檔中使用圖像和深度學習了解社交互動 - 政大學術集成 (頁 61-65)