• 沒有找到結果。

Discussions and Comparisons with State-of-the-art Methods

3.4 Experimental Results and Discussions

3.4.3 Discussions and Comparisons with State-of-the-art Methods

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

each age label is represented by one baseline sample, and we have nine labels, which makes k

= 9. In other words, we only need to compute the comparative relationship of the input image k times, where k can be a small number and much less than N. Therefore, the computational cost of our approach is reasonable.

3.4.3 Discussions and Comparisons with State-of-the-art Methods

We compared our approach with other recent facial age estimation techniques, such as rKCCA [68], IIS-LLD [90], CPNN [90], OHRank [129], AGES [70], and two aging function regression-basedmethods WAS [29] and AAS [45]. Further, several conventional general-purpose classification methods, such as k-nearest neighbors (kNN) [134], back propagation neural network (BP) [135], C4.5 decision tree [136], SVM [137], adaptive network-based fuzzy inference system (ANFIS) [138], and ranking-based approaches, such as ranking SVM [96], RankBoost [97], and RankNet [98] were also used for the comparison. We trained our model using the popular leave-one-person-out (LOPO) test strategy [70], as suggested in the related benchmarks [90, 68, 129, 70]. In particular, we split the used datasets (FG-NET and MORPH) by adopting the same training/testing protocol for all comparison methods.

For example, LOPO is used on the FG-NET dataset as follows: in each fold, images of one person are used as the testing set and those of the others are used as the training set. After 82 folds (the FG-NET dataset has a total of 82 subjects), each subject was used as the testing set, and in turn, the average results were computed from all estimates. However, because there are more than 13,000 subjects in the MORPH dataset, the LOPO test becomes too

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

time-consuming. Therefore, we adopted 10-fold cross-validation instead for the MORPH dataset.

Our CRCNN method is configured with the deep learning parameters optimized in Section 3.4.2, and the results are detailed in Table 3.1. Here, the human tests are included for reference, which were performed on 5% samples from the FG-NET database and 60 samples from the MORPH database [90]. The performance of the age estimation was evaluated by the mean absolute error (MAE) metric. In statistics, the MAE is used to measure how close a prediction is to the ground truth. In our case, the MAE is the mean of absolute errors between estimates and true ages; i.e., MAE = ∑Nk=1|abk− ak|/N, wherebakand ak denote the estimate and true ages of the sample image k, and N denotes the total number of samples. The standard deviations of the MORPH dataset are also listed in Table 3.2. For example, a number in the format a ± b means that the MAE a has a standard deviation of b. Some comparison methods (e.g., rKCCA and rKCCA+SVM) do not show standard deviations because they do not report the standard deviation values in the experiments of their works. For the results of the FG-NET dataset, we follow the common practice of the previous work (e.g., [90]) and do not indicate standard deviations. For example, as reported in [90], “the number of images for each person in the FG-NET database varies dramatically. Consequently, the standard deviation of the LOPO test on the FG-NET database becomes unstable”. In other words, for the FG-NET database, the values of standard deviation are not statistically meaningful, and thus, these values are not shown. The statistics are listed in Table 3.2. As listed in the table, the best results (boldfaced) are obtained from our CRCNN approach (with the early fusion scheme). The second-best results are also from our CRCNN approach (with the late

fusion scheme). Thus, the overall performance of the CRCNN is very encouraging; Our results are significantly better than those of all state-of-the-art methods. In comparison to the deep learning-based method, that is CPNN [90], we also achieved better performance with a relative improvement of 13.24% (from 4.76 to 4.13 on FG-NET) and 23.20% (from 4.87 to 3.74 on MORPH). These facts validate the robustness of the newly proposed comparative approach.

Table 3.2 Comparison with state-of-the-art methods on FG-NET and MORPH databases.

Method Database (FG-NET ) Database (MORPH )

CRCNN (Early Fusion) (RCNN) 4.13 3.74 ± 0.29

CRCNN (Early Fusion) (CNN) 4.72 4.33 ± 0.27

CRCNN (Late Fusion) (RCNN) 4.20 3.81 ± 0.32

CRCNN (Late Fusion) (CNN) 4.81 4.52 ± 0.23

Ranking SVM [96] 5.24 6.49 ± 0.17

RankBoost [97] 5.67 6.83 ± 0.25

RankNet [98] 5.46 6.71 ± 0.24

rKCCA [68] - 3.98

rKCCA + SVM [68] - 3.92

IIS-LLD [90] (Gaussian) 5.77 5.67 ± 0.15

IIS-LLD [90](Triangle) 5.90 6.09 ± 0.14

IIS-LLD [90] (Single) 6.27 6.35 ± 0.17

CPNN [90](Gaussian) 4.76 4.87 ± 0.31

CPNN [90](Triangle) 5.07 4.91 ± 0.29

CPNN [90](Single) 5.31 6.59 ± 0.31

OHRank [129] 6.27 6.28 ± 0.18

AGES [70] 6.77 6.61 ± 0.11

Human Tests (HumanA) 8.13 8.24

Human Tests (HumanB) 6.23 7.23

We further performed an evaluation on the IoG database. This database consists of 28,231 facial images collected from Flicker. Each face is labeled under one of the seven age groups:

0–2, 3–7, 8–12, 13–19, 20–36, 37–65, and 66+. In our evaluation, we considered only faces with an interocular distance of more than 40 pixels, which resulted in a subset of 1,495 face images. Further, we reorganized the age labels into the child, teen, and adult classes with ages 0–12, 13–19, and 20+, respectively. The setting yielded the following number of samples per age group: 546, 250, and 699. Finally, we performed the same normalizations as in the previous experiments on all IoG faces. We compare our results with the ranking based methods, including [96–98], and the local binary pattern kernel density estimation (LBP-KDE) [128]. The age group classification performance is summarized in Table 3.3. We can observe the better performance of our approach over the state-of-the-art methods with a relative improvement from 4.74% (in LBP-KDE) to 13.74% (in RankBoost). We believe that our method outperforms other approaches, thereby demonstrating its effectiveness for practical applications.

Table 3.3 Comparison with state-of-the-art methods on IoG database.

Method Database (IoG)

CRCNN (Early Fusion) (RCNN) 66.41%

CRCNN (Early Fusion) (CNN) 63.16%

CRCNN (Late Fusion) (RCNN) 65.48%

CRCNN (Late Fusion) (CNN) 62.19 %

LBP-KDE [128] 61.67%

Ranking SVM [96] 56.17%

RankBoost [97] 52.67%

RankNet [98] 55.08%

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y