• 沒有找到結果。

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

3.5 Summary

In this chapter, we proposed a novel comparative deep learning framework for facial age estimation called comparative region convolutional neural network (CRCNN). Motivated by human cognitive processes, we used a comparative approach to determine the age of an unseen person. To the best of our knowledge, this is the first comparative approach in deep learning for facial age estimation. The experimental results validate the superior performance of our CRCNN approach over state-of-the-art methods.

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Chapter 4

Multimodal Deep Learning Framework for Image Popularity Prediction on Social Media

4.1 Overview

Billions of photos are uploaded to the web daily through various types of social networks.

Some of these images receive millions of views and become popular, whereas others remain completely unnoticed. This raises the problem of predicting image popularity on social media. The popularity of an image can be affected by several factors, such as visual content, aesthetic quality, user, post metadata, and time. Thus, considering all these factors is essential for accurately predicting image popularity. In addition, the efficiency of the predictive

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

model also plays a crucial role. In this chapter, motivated by multimodal learning, which uses information from various modalities, and the current success of convolutional neural networks (CNNs) in various fields, we propose a deep learning model, called visual-social convolutional neural network (VSCNN), which predicts the popularity of a posted image by incorporating various types of visual and social features into a unified network model.

VSCNN first learns to extract high-level representations from the input visual and social features by utilizing two individual CNNs. The outputs of these two networks are then fused into a joint network to estimate the popularity score in the output layer. We assess the performance of the proposed method by conducting extensive experiments on a dataset of approximately 432K images posted on Flickr. The simulation results demonstrate that the proposed VSCNN model significantly outperforms state-of-the-art models, with a relative improvement of greater than 2.33%, 7.59%, and 14.16% in terms of Spearman’s Rho, mean absolute error, and mean squared error, respectively. The content of this chapter have been published in [139].

4.2 Introduction

Social media websites (e.g., Flickr, Twitter, and Facebook) allow users to create and share content (e.g., by liking, commenting, or viewing). Consequently, social media platforms have become an inseparable part of our daily lives, with significant social content gener-ated on these platforms. The explosive growth of social media content (i.e., texts, images, audios, and videos) and the interactive behavior between web users result in that only a

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

small portion of online social content attracts significant attention and becomes popular, whereas its vast majority either receives little attention or is entirely overlooked. Therefore, extensive efforts have been expended in the past few years to predict social media content popularity, understand its variation, and evaluate its growth [140–145, 38]. This popularity reflects user interests and provides opportunities to understand user interaction with online content, as well as information diffusion through social media websites. Hence, an accurate popularity prediction of online content may improve user experience and service effective-ness. Moreover, it can significantly influence several important applications, such as online advertising [58, 59], information retrieval [60], online product marketing [146], and content recommendation [147].

Popularity prediction on social media is usually defined as the problem of estimating the rating scores, view counts, or click-through of a post [103]. In this study, image popularity prediction on social media websites is analyzed to better understand the popularity factors for a particular image. Although this problem has recently received significant attention [40, 39, 41, 99], it remains a challenging task. For example, image popularity prediction can be significantly influenced by various factors (and features), such as visual content, aesthetic quality, user, post metadata, and time; therefore, considering all this multimodal information is crucial for an efficient prediction. Moreover, it is nontrivial to select an appropriate model that can make better use of the various features contributing to image popularity and accurately predict it. For example, simple machine learning schemes (e.g., support vector and decision tree regression) learn to predict by being fed with highly structured data, thus requiring time and skill to fine-tune the hyperparameters. However, to obtain accurate

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

prediction results, it is critical to construct a prediction model capable of learning through a more abstractive data representation and optimizing the extracted features.

Accordingly, we address the image popularity prediction problem by analyzing a large-scale dataset collected from Flickr to investigate two essential components that may contribute to the popularity of an image; namely, visual content and social context. In particular, we examine the effect of the visual content of an image on its popularity by adopting different types of features that describe various visual aspects of the image, including high-level, low-level, and deep learning features. These are extracted by applying several techniques from machine learning and computer vision. Additionally, we explore the significant role of social context information associated with images and their owners by analyzing the following three types of social features: user, post metadata, and time. To demonstrate the efficacy of the proposed features, we propose a computational deep learning model, called visual-social convolutional neural network (VSCNN), which uses two individual CNNs to learn high-level representations of the visual and social features independently. The outputs of the two networks are then merged into a shared network to learn joint multimodal features and compute the popularity score in the output layer. End-to-end learning is employed to train the entire model, and the weights of its parameters are learned through back-propagation. In a nutshell, the contribution of this chapter can be summarized as follows:

• We demonstrate a comprehensive exploration of the independent benefits and predictive power of various types of visual and social context features towards the popularity of

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

an image. We further demonstrate that these multimodal features can be combined effectively to enhance prediction performance.

• We propose a deep learning VSCNN model for predicting image popularity on social media. VSCNN uses dedicated CNNs to learn structural and discriminative representa-tions from the input visual and social features, achieving considerable performance in predicting image popularity compared with several traditional machine learning schemes.

• We effectively set the architectures and parameters of the adopted CNNs to fit the multimodal information, i.e., social information and visual content of the image.

• We demonstrate that processing visual and social features using the late fusion scheme is significantly better than using the early fusion scheme.

• We use a large-scale dataset of approximately 432K images posted on Flickr to evaluate the performance of the proposed VSCNN model. The simulation results demonstrate that VSCNN achieves competitive performance and outperforms six baseline models and other state-of-the-art methods.