• 沒有找到結果。

2.3 Image Popularity Prediction on Social Media

2.3.2 Prediction Models

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

image popularity based on popularity matrix factorization. They explored the mechanism of dynamic popularity by factoring popularity into two contextual associations, i.e., user-item context and time-sensitive context. Furthermore, Almgren et al. [104] employed social context, image semantics, and early popularity features to predict the future popularity of an image. Specifically, they considered the popularity changes over time by collecting information regarding the image within an hour of uploading and keeping track of its popularity for a month. Totti et al. [42] analyzed the effect of visual content on image popularity and its propagation on online social networks. Along with social features, they proposed using aesthetic properties and semantic content to predict the popularity of images on Pinterest.

We observe that most of the aforementioned studies rely only on a part of the useful features for image popularity prediction, and do not consider the interactions between other pertinent types of features.

2.3.2 Prediction Models

Regarding the models used for image popularity prediction, previous studies have introduced several types of machine learning schemes. Both [38] and [39] considered image popularity prediction as a regression problem in which support vector regression (SVR) [105] was used to predict the number of views that an image received on Flickr. Totti et al. [42] reduced the problem to a binary classification task and utilized a random forest classifier [106] to predict whether an image would be extremely popular or unpopular based on the number of reshares

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

on Pinterest. Moreover, the authors in [43] predicted the number of views for an image on Flickr using a gradient boosting regression tree [107]. Although most of these prediction models perform satisfactorily, they tend to generate smoothed results, making the popularity of images with overly high or low scores difficult to predict accurately. In addition, it may be time-consuming to fine-tune the hyperparameters that significantly influence the performance of these models.

Recently, deep learning techniques have gained widespread attention and achieved out-standing performances in various tasks [108–111], owing to the capability of deep neural networks to learn complex representations from data at each layer, where they imitate learning in the human brain by abstraction. Nevertheless, insignificant effort has been expended for predicting image popularity using these techniques. In this regard, Wu et al. [112] proposed a new deep learning framework to investigate the sequential prediction of image popularity by integrating temporal context and attention at different time scales. Moreover, Meghawat et al. [113] developed an approach that integrates multiple multimodal information into a CNN model for predicting the popularity of images on Flickr. Although these studies have achieved satisfactory performances, they are not sufficiently powerful to capture and model the characteristics of image popularity. For instance, the authors in [113] investigated the effect of the visual content of an image on its popularity by utilizing only one feature obtained by the pre-trained InceptionResNetV2 model, whereas they ignored other important visual cues, such as low-level computer vision, aesthetics, and semantic features. Moreover, although it has been demonstrated that time features have a crucial effect on image popularity [103, 112], they were not considered in the proposed model. They also adopted an early

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

fusion scheme for processing the proposed multimodal features, despite several studies hav-ing demonstrated that this scheme is outperformed by the late fusion scheme in processhav-ing heterogeneous information [38, 114].

To address the above issues, we analyze a real-world dataset collected from Flickr to identify and extract different kinds of features that are correlated with image popularity, including multi-level visual, deep learning, social context information, and time features.

Based on the extracted features and motivated by the recent success of convolutional neural networks (CNNs) in processing data from different modalities, we propose a multimodal deep learning model for image popularity prediction. The proposed model exploits two dedicated CNNs to separately learn discriminative representations from the adopted input features and then efficiently merged them into a united network to predict the popularity.

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Chapter 3

Comparative Deep Learning Framework for Facial Age Estimation

3.1 Overview

In recent years, the development of automatic facial age estimation algorithms that demon-strate comparable or superior performances than that of the human ability of age estimation has become an attractive yet challenging topic. Conventional methods estimate the age of a person directly from the given facial image. In contrast, motivated by human cognitive processes, we propose a comparative deep learning framework, called comparative region convolutional neural network (CRCNN); this framework first compares the input face with reference faces of known age to generate a set of hints (comparative relations, i.e., whether the input face is younger or older than each reference), and then, all hints are aggregated to

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

estimate the age of a person. Our approach has several advantages: (i) the age estimation task is split into several comparative stages, which is simpler compared to directly computing the person’s age; (ii) in addition to the input face, side information (comparative relations) can be explicitly employed to benefit the estimation task; and (iii) a few incorrect comparisons do not considerably influence the accuracy of result, which makes this approach more robust than the conventional one. To the best of our knowledge, the proposed approach is the first comparative deep learning framework for facial age estimation. In addition, we proposed incorporating the method of auxiliary coordinates (MAC) for training, which reduces the ill-conditioning problem of the deep network and affords an efficient and distributed opti-mization. The experimental results on the FG-NET, MORPH, and IoG databases demonstrate that the proposed CRCNN model achieves a significant outperformance compared to the state-of-the-art methods, with a relative improvement of 13.24% (on FG-NET), 23.20%

(on MORPH) in term of mean absolute error, and 4.74% (on IoG) in term of age group classification accuracy. The content of this chapter have been published in [110].