Introduction - 使用圖像和深度學習了解社交互動

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

estimate the age of a person. Our approach has several advantages: (i) the age estimation task is split into several comparative stages, which is simpler compared to directly computing the person’s age; (ii) in addition to the input face, side information (comparative relations) can be explicitly employed to benefit the estimation task; and (iii) a few incorrect comparisons do not considerably influence the accuracy of result, which makes this approach more robust than the conventional one. To the best of our knowledge, the proposed approach is the first comparative deep learning framework for facial age estimation. In addition, we proposed incorporating the method of auxiliary coordinates (MAC) for training, which reduces the ill-conditioning problem of the deep network and affords an efficient and distributed opti-mization. The experimental results on the FG-NET, MORPH, and IoG databases demonstrate that the proposed CRCNN model achieves a significant outperformance compared to the state-of-the-art methods, with a relative improvement of 13.24% (on FG-NET), 23.20%

(on MORPH) in term of mean absolute error, and 4.74% (on IoG) in term of age group classification accuracy. The content of this chapter have been published in [110].

3.2 Introduction

The appearance of a human face changes with age. Therefore, facial appearance is a very important trait when estimating the age of a person, and facial age estimation is an essential component in several mobile and social media applications [115–120]. However, the age estimation by humans is not as easy a task as determining other facial information such as identity, expression, and gender. Hence, developing automatic facial age estimation

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

methods that are comparable or superior to the human ability of age estimation has become an attractive yet challenging topic in recent years [46, 66, 64, 90, 121].

In the literature, the conventional way for facial age estimation is a direct method to estimate the age of a person by analysing his/her facial information (e.g. eyes, nose and so forth) directly from the facial image of the person, cf. Figs 3.1(a) and 3.1(c). In particular, only the input image is used to estimate the age of a person; however, estimating someone’s precise age at a glance without any reference information is difficult, even for humans [90].

In response to the above challenges, our idea is to develop a facial age estimation algorithm inspired by the human cognitive processes [37]. In practice, humans commonly use several judgments to estimate a person’s age, cf. Fig 3.1(b). First, they learn to establish connections between a known age and the corresponding facial cues of a person (direct method), and second, they employ the learned knowledge as a reference to evaluate whether an unseen face is younger or older than the reference (comparative method). The larger the number of available references, the more precise is the estimation of the age of an unseen face.

Therefore, a general mathematical framework, namely comparative region-convolutional neural network (CRCNN), is proposed for facial age estimation, cf. Fig 3.1(d). Conceptually, we compare an unseen face with a set of selected references (labelled baseline samples) to determine if the person of the unseen face is younger or older than each of the baseline persons. We couple this comparative scheme with a specific deep learning architecture called region-convolutional neural network (R-CNN) [122]. The R-CNN is exploited to extract the most “iconic" local region from each facial image, where the spatial context (geometrical interrelation) of the extracted local regions can be also accounted for robust classification. In

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Fig. 3.1 Schematic diagram of (a,c) the conventional paradigm for facial age estimation by learning the age information from a facial image directly, and (b,d) the proposed paradigm by aggregating the comparisons of a facial image with baseline samples to determine the age in a comparative manner.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

the proposed CRCNN framework, not only the input image is used, but also several other reference images are used as baseline samples to be compared with the input image. The comparison is equivalent to estimating whether the input person is younger or older than the others. Compared to the conventional paradigm, the proposed approach allows reformulating the estimation task into sequentially independent sub-problems, wherein each sub-problem represents a comparison (younger/older decision) between two images, which is considerably simpler than the initial task, i.e., estimating the exact age of an observed face. Further, by simply increasing the number of baseline samples, more side information (comparisons) can be exploited to benefit the estimation task, which can help achieve a more robust estimation.

Finally, another advantage is that few incorrect comparisons do not significantly influence the accuracy of the age estimation because of leveraging many baseline samples.

Further, the traditional way to learn the parameters of a deep architecture is to min-imize an objective function by computing the gradient over all the parameters using the backpropagation algorithm [123] with a nonlinear optimizer. However, the deep learning method is very difficult to train, especially because of the ill-conditioning problem and local minima issue [36]. These difficulties also complicate the manual tuning of deep learning parameters and convergence. In this study, we propose incorporating the recent method of auxiliary coordinates (MAC) [124] into our framework for training, which is an interesting direction toward the more efficient training of deep architecture. The method introduces a set of variables that can break the objective function dependency, thereby making the problem considerably better conditioned without nesting, and thus affording an efficient and distributed optimization. The contributions of this chapter can be summarized as follows:

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

• To the best of our knowledge, our CRCNN framework is the first comparative deep learning approach for facial age estimation and has demonstrated that it can outperform state-of-the-art methods by experimenting with well-known face datasets.

• Instead of using the classical deep learning techniques such as a convolutional neural network (CNN) [111], we proposed the use of R-CNN to account for the spatial context of facial regions. Further, we improved the training efficiency of the deep architecture by incorporating the MAC technique; thus, the notorious ill-conditioning problem of deep learning can be alleviated.

• We implemented our mathematical framework with CAFFE [125], which is a popular deep learning platform that exploits the parallelization over multiple GPUs. The com-patibility with CAFFE makes all the components of our mathematical implementation readily available to other researchers.

• The sensitivity of deep learning parameters makes it a non-trivial task to obtain an ap-propriate setting, and therefore, the systematic investigation on parametric optimization provides guidance to users who plan to extend our approach for future research.

在文檔中使用圖像和深度學習了解社交互動 - 政大學術集成 (頁 53-57)