Experiments and Results - 使用圖像和深度學習了解社交互動

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

sequential training of a series of simple decision tree estimators [164], where each successive tree attempts to minimize a certain loss function formed by the preceding trees. That is, in each stage, a new regression tree is sequentially added and trained based on the residual error of the previous ensemble model. The GBDT algorithm then updates all the predicted values by adding the predicted values of the new tree. This process is recursively continued until a maximum number of trees have been generated. Thus, the final prediction value of a single instance is the sum of the predictions of all the regression trees.

4.5 Experiments and Results

In this section, we present the experimental setting and discuss the results.

4.5.1 Experimental Setup

Popularity Measurement

Social media websites allow users to interact with posted content in various ways, which results in different social signals that can be utilized to measure the popularity of social content (e.g., images, texts, and videos) on these websites. For instance, on Twitter, popularity can be gauged by the number of re-tweets, whereas the number of likes or comments can be used to measure popularity on Facebook. In this study, we use Flickr as the major image-sharing platform to predict the popularity of social media images. Previous studies have used various metrics to measure the image popularity on Flickr. For example, Khosla et al. [38]

‧

determined the popularity of an image based on the number of views it received. McParlane et al. [41] adopted both view and comment count as the principal metrics.

The dataset used in our experiments complies with Khosla et al. [38], and the number of views was adopted as a popularity metric. The log function is applied to manage the large variation in the number of views for various photos from the dataset. Moreover, the images receive views during the time they are online. Thus, a log-normalization approach was used to normalize the effect of the time factor. The score proposed in [38] can be defined as follows:

Score_i= log₂p_i d_i

+ 1, (4.17)

where p_i is the popularity metric (the original number of views) of image i, and d_i is the number of days since the image first appeared on Flickr.

Parameter Setting of Baseline Models

All the baseline models were implemented using the scikit-learn machine learning library [167, 168]. In the experiments, the performance of the baseline models was observed to be significantly influenced by several hyper-parameters. Therefore, we identified the values of a few important SVR parameters as follows: C = 3, epsilon = 0.1, gamma = auto, and kernel = RBF. Regarding the DTR model, the best performance was achieved when the max_depth parameter was set to 10. Moreover, we identified several parameters of GBDT: n_estimators

= 2000, max_depth = 10, and learning_rate = 0.01. Finally, the remaining parameters are set to their default values in all the models.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Dataset

The Social Media Prediction (SMP-T1) dataset presented by ACM Multimedia Grand Challenge in 2017 was used as a real-world dataset to evaluate the performance of the proposed approach [169, 112]. The dataset consists of approximately 432K posts collected from the personal albums of 135 different users on Flickr. Every post in the dataset has a unique picture id along with the associated user id that signifies the user who posted the picture. Additionally, the following image metadata were provided: post date (postdate), number of comments (commentcount), number of tags in the post, whether the photo is tagged by some users or not (haspeople), and character length of the title and image caption (titlelen or deslen). Furthermore, user-centric information, namely the average view count, group count, and average member count, was also provided in the dataset. Each image has a label representing its popularity score (log-normalized views of the image). A few images selected from the dataset are shown in Fig. 4.5. In our experiments, 60% of the images were used for training, 20% for validation, and 20% for testing.

Fig. 4.5. Sample images from the dataset. The popularity of the images is sorted from more popular (left) to less popular (right).

‧

In this study, we used the same following metrics as those in the ACM Multimedia Grand Challenge [169, 112] to assess the prediction accuracy: Spearman’s Rho [159], MSE, and mean absolute error (MAE).

• Spearman’s Rho: Used to calculate the correlation between the predicted popularity scores and the actual scores for the set of tested images. It is defined by the following equation:

where n is the size of the test sample and d_jis the difference between the two ranks of the actual and predicted popularity values of each image j.

• MSE: Usually used to measure the average of the sum of squared prediction errors.

Each prediction error represents the difference between the actual value of the data point and the predicted value obtained by the regression model. MSE consists of simple mathematical properties, making it easier to calculate its gradient. In addition, it is often presented as a default metric for most predictive models because it is smoothly differentiable, computationally simple, and hence can be better optimized. A significant limitation of MSE is the fact that it heavily penalizes large prediction errors by squaring them. Because each error in MSE grows quadratically, the outliers in the data significantly contribute to the total error. This indicates that MSE is sensitive to

‧

outliers and applies excessive weight on their effects, which leads to an underestimation of the model performance. The drawback of MSE only becomes evident when there are outliers in the data, in which case using MAE is a sufficient alternative. Formally, the MSE is mathematically defined as follows:

MSE= 1 actual popularity scores of j-th image, respectively.

• MAE: A simple measure usually used to evaluate the accuracy of a regression model.

It measures the average of the absolute values of the individual prediction errors of the model over all samples in the test set. In the MAE metric, each prediction error contributes proportionally to the total amount of errors, indicating that larger errors contribute linearly to the overall error. Because we use the absolute value of the prediction error, the MAE does not indicate underperformance or overperformance of the model, that is, whether the regression model overpredicts or underpredicts the input samples. Thus, it offers a relatively impartial comprehension of how the model performs. By taking the absolute value of the prediction error and not squaring it, the MAE becomes more robust than MSE in managing outliers because it does not heavily penalize the large errors, as done by using MSE. Hence, MAE has its advantages and disadvantages. On one hand, it assists in handling outliers; however, on the other hand, it fails to penalize the large prediction errors. If ˆY_j is the predicted popularity value

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

of the j-th sample, and Y_jis the corresponding actual popularity value, then the MAE estimated over a test sample of size n is defined as follows:

MAE= 1 n

∑

j=1

Yˆ_j−Y_j (4.20)

4.5.2 Results

Using the features extracted for model learning, we trained the proposed VSCNN model to predict the popularity score. In the training stage, we used Adam [170] and the stochastic gradient descent as the learning optimizer to obtain the initialized parameters for VSCNN.

The initial learning rate was set to 0.001. In the experiments, the model was run for 50 training epochs over the entire training set. In each epoch, the model was iterated over batches of the training set, where each batch consisted of 20 samples. Furthermore, the following features were added to the training process: 1) The learning rate was reduced by 0.1 every 10 epochs using the learning rate scheduler function, which facilitates learning. 2) The best validation accuracy was saved using the model checkpoint function, which assists in saving the best learning model. The cost function generally converges during the training phase. In the testing stage, the trained VSCNN model was applied to the test samples for evaluation. The evaluation results demonstrated that VSCNN can achieve a Spearman’s Rho of 0.9014, an MAE of 0.73, and an MSE of 0.97, which are listed in Tables 4.4 and 4.5, and will be used for comparison with the baseline models.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

(a) (b)

Fig. 4.6. Quality evaluation of the VSCNN model. (a) Error distribution histogram of the model, and (b) scatterplot of true values (x-axis) versus predicted values (y-axis).

Essential visual analytics were added for the model quality evaluation by computing the error distribution histogram, which presents the distribution of the errors made by the model when predicting the popularity score for each test sample, as shown in Fig. 4.6 (a). A larger number of errors close to zero in the histogram indicates a higher prediction accuracy.

Moreover, Fig. 4.6 (b) presents a scatterplot of the actual values on the x-axis versus the predicted values obtained by the model on the y-axis. This scatterplot presents the correlation between the actual and predicted values. If the data appear to be near a straight diagonal line, it indicates a strong correlation. Thus, a perfect regression model would yield a straight diagonal line from the data. From the results shown in Fig. 4.6, there are certain outliers that are not correctly predicted by the VSCNN model. Hence, we analyze these outliers below and explain in detail why our model fails to predict them.

In certain regression problems, the distribution of the target variable may have outliers (e.g., large or small values far from the mean value), which can affect the performance of the

‧

predictive model. As shown in Fig. 4.7, the distribution of the target variable (view counts) of the training samples is highly non-uniform in our dataset; therefore, the proposed model attempts to minimize the prediction errors of the largest cluster of view counts of training samples. However, as the number of training samples with extremely high view counts is relatively low, it is more likely that the proposed model cannot correctly predict the high view counts, which will be observed as outliers in the predictive results.

0 5 10 15 20

Fig. 4.7. A distribution of the view counts of the training samples.

As shown in Fig. 4.8, a few good and bad predictions were made on images from the test set using our proposed model. The correctly predicted examples are shown in Fig. 4.8 (a); note, our model achieves superior performance with only 0.001 - 0.009 errors relative to the actual scores. For example, the popularity score of the first four images in Fig. 4.8 (a) is correctly predicted with errors of 0.001, 0.009, 0.002, and 0.001, respectively. In addition, the popularity score of the last two images in Fig. 4.8 (a) is perfectly predicted with zero prediction error. On the other hand, a few wrongly predicted examples are shown in Fig. 4.8 (b). For example, the actual popularity score of the first image in this figure is 3, while the

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

(a) Correct examples of popularity prediction

(b) Wrong examples of popularity prediction

Fig. 4.8. Examples of correct and wrong predictions of some images from our dataset using the VSCNN model. The actual popularity score and its corresponding predicted score are displayed below each image.

score obtained by our model is 7.472, resulting in a substantial error of 4.472 in prediction.

This disparity is due to the strong indications of some user features for this image, such as average views and member count, which have values of 993.42 and 10,672, respectively, and significantly contribute to the model prediction when integrating all features. Likewise, the last two images in Fig. 4.8 (b) are other badly predicted examples of our proposed model.

The actual popularity scores of these two images are observed to be too high. Therefore, it is suggested that our model cannot correctly predict the popularity of these images because the number of training samples with high popularity scores is extremely limited in our dataset, as shown in Fig. 4.7.

Comparison with Baseline Models

First, we train the SCNN model using three different types of social features to explore the influence of each type on predicting popularity. Subsequently, the SCNN model is

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

trained using all the social features as inputs. The prediction results of the SCNN model with different types of input features are summarized in Table 4.4, which presents that the user features perform exceptionally well in predicting the popularity of an image relative to the other two types of social features (i.e., post metadata and time), with a Spearman’s Rho of 0.7537, MAE of 1.13, and MSE of 2.17. This indicates that the popularity of an image is closely related to the popularity of the user uploading it; images shared on social media by popular users have a higher chance of obtaining more views. However, not all images posted by popular users are popular. To justify this, we use the popularity score and average view count as popularity metrics for images and users, respectively. As indicated in previous studies [41, 171], the Pareto Principle (or 80-20 rule) was used to select a threshold to differentiate images with high (20%) and low (80%) popularity scores. Likewise, we set a threshold to differentiate users with a high (20%) and low (80%) average view count.

Based on these differentiations, the top 20% of the images and users are considered as highly popular (or popular), while the remaining 80% are considered less popular (or common).

Accordingly, on average, 69.19% and 16.17% of the images posted by popular and common users, respectively, were determined to be popular. Thus, we conclude that not all images posted by popular users are always popular.

The post metadata are also noteworthy features. The SCNN model using these features achieved values of 0.6590, 1.35, and 2.98 for the Spearman’s Rho, MAE, and MSE, re-spectively. This indicates that image-specific social features, such as tag count, title length, description length, and comment count, also play an important role in predicting popularity, which is expected; an image with significant tags or a longer description/title tends to be

‧

Table 4.4 Performance comparison of SCNN, VCNN, and VSCNN models.

Models Features Spearman’s Rho MAE MSE

SCNN User 0.7537 1.13 2.17

Post_Metadata 0.6590 1.35 2.98

Time 0.5317 1.42 3.43

All_Social 0.8809 0.79 1.13

VCNN Color 0.3278 1.66 4.46

Gist 0.2612 1.72 4.67

LBP 0.3287 1.66 4.45

Aesthetic 0.2000 1.77 4.91

Deep 0.4101 1.61 4.13

All_Visual 0.4168 1.58 4.08

VSCNN Visual+Social 0.9014 0.73 0.97

more popular because it has a greater chance of showing up in the search results when people use keywords to search for images. Similarly, having more comments on the image suggests that more users interact with the image, which may lead to a greater number of views and thus, increased popularity. Considering the results, time features were also determined to make a significant contribution to popularity prediction, which indicates that the time when an image is posted may influence its popularity. For example, users tend to browse social networking sites at a particular time of the day, such as weekend leisure time, which indicates that images posted during that time are more likely to receive a large number of views and become popular.

Furthermore, while each type of social feature performs sufficiently, the SCNN model achieves the best predictive performance when all the social features are combined, as shown in the fourth row of Table 4.4. This suggests that all the social features proposed are strongly correlated and provide complementary information to each other. Fig. 4.9 (a) presents a diagram of the predicted values obtained by SCNN and the corresponding actual values.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Similarly, the VCNN model was trained using each of the individual visual features to analyze their effect on predicting image popularity. We also integrated all the visual features and used them as the input to the model. The evaluation results are listed in Table 4.4. Deep learning features were observed to outperform other visual features. However, it is important to note that the VCNN model achieves the best performance in terms of all the evaluation metrics when all the visual features are combined. In addition, as indicated by the results of the VCNN model, visual features are less effective than social features in terms of image popularity prediction. This finding is consistent with previous studies [38, 172, 43, 173].

Nevertheless, the visual features are useful when there are no post metadata existing, or to address scenarios where no social interactions were recorded prior to publishing the image (e.g., user newly joined social network). This indicates that image content also plays a critical role in popularity prediction, and may complement the social features.

A diagram of the predicted values obtained using VCNN and the corresponding actual values is shown in Fig. 4.9 (b). Finally, the performance of the proposed model is compared with the best performance of both VCNN and SCNN in terms of all the evaluation metrics;

the results are listed in Table 4.4. Apparently, VSCNN outperforms VCNN and SCNN, with a relative improvement of 2.33% (SCNN) and 116.27% (VCNN) in terms of Spearman’s Rho, and a decrease of 7.59% (SCNN) and 53.80% (VCNN), as well as 14.16% (SCNN) and 76.23% (VCNN) in terms of MAE and MSE, respectively.

Subsequently, the other four baseline models (i.e., LR, SVR, DTR, and GBDT) were trained using each single feature and various combinations thereof to demonstrate the ef-fectiveness of the proposed features in predicting image popularity. The predictions are

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

(a) (b)

Fig. 4.9. Diagrams of the predicted values obtained using the CNN-based baseline models and their corresponding ground truth values. (a) SCNN, and (b) VCNN.

Table 4.5 Performance comparison of LR, SVRR, DTR, GBDT, and VSCNN models.

Features LR SVR DTR GBDT VSCNN

Spearman’sRho MAE MSE Spearman’sRho MAE MSE Spearman’sRho MAE MSE Spearman’sRho MAE MSE Spearman’sRho MAE MSE Color 0.0856 1.81 5.10 0.2569 1.72 4.74 0.1915 1.76 4.93 0.3381 1.66 4.41

Gist 0.1337 1.79 5.04 0.3209 1.67 4.55 0.1436 1.80 5.12 0.3176 1.69 4.51 LBP 0.1546 1.79 5.04 0.3028 1.69 4.63 0.1640 1.78 5.01 0.3126 1.68 4.52 Aesthetic 0.1221 1.8 5.06 0.1866 1.77 4.97 0.1661 1.79 5.00 0.2040 1.77 4.88 Deep 0.3701 1.66 4.43 0.4754 1.53 3.88 0.2330 1.76 4.96 0.4403 1.59 4.05 All_Visual 0.3837 1.65 4.35 0.5018 1.50 3.73 0.2384 1.76 4.95 0.4890 1.53 3.82 user 0.6449 1.41 3.17 0.7548 1.12 2.18 0.7579 1.12 2.15 0.7580 1.12 2.15 Post_Metadata 0.5266 1.68 4.41 0.6126 1.42 3.28 0.6682 1.32 2.91 0.6962 1.27 2.72 Time 0.1337 1.80 5.00 0.2681 1.70 4.65 0.3485 1.61 4.19 0.6285 1.29 2.84 All_Social 0.7114 1.28 2.72 0.8292 0.94 1.58 0.8049 1.01 1.74 0.8611 0.86 1.30

Visual+Social 0.7341 1.22 2.44 0.8347 0.93 1.54 0.8200 0.96 1.65 0.8778 0.80 1.15 0.9014 0.73 0.97

shown in Table 4.5, presenting that the user feature yields the best results. This indicates that the characteristics of the person who posts a photo determine its popularity to a significant extent. Furthermore, post metadata and time features were also determined to be sufficient predictors. Additionally, when all social context features are combined and used as inputs,

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

(a) (b)

Fig. 4.10. Diagrams of the predicted values obtained using the four machine learning baseline models and their corresponding ground truth values. (a) LR, (b) SVR, (c) DTR, and (d) GBDT.

the performance improves significantly for all the models, and GBDT achieves the best performance in terms of all the evaluation metrics.

The deep learning feature is significant and outperforms other visual features; namely, color, gist, LBP, and aesthetics, although these features perform sufficiently in all mod-els. Nevertheless, the performance of all models is improved when all visual features are

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

combined. Moreover, note that combining visual and social features leads to a significant improvement in the performance of all the models compared to that exhibited using either set of these features independently.

Fig. 4.10 presents diagrams of the predicted values obtained using the four machine learning baseline models and their corresponding ground truth values. As presented in Table 4.5 and shown in Fig. 4.10, GBDT outperformed all other machine learning models, with a relative improvement from 5.16% (SVR) to 19.57% (LR) in terms of Spearman’s Rho, and with decreases from 13.98% to 34.43% and from 25.32% to 52.87% in terms of MAE and MSE, respectively. Finally, the performance of the proposed VSCNN model was compared with the best performance obtained by each of the four baseline models (LR, SVR, DTR, and GBDT); the results are shown in Table 4.5. Compared with GBDT, VSCNN improves the prediction performance by approximately 2.69%, 8.75%, and 15.65% in terms of Spearman’s Rho, MAE, and MSE, respectively.

Fig. 4.11 presents the best prediction performance for all the models in terms of the three evaluation metrics; the VSCNN outperforms all six baseline models in predicting the

在文檔中使用圖像和深度學習了解社交互動 - 政大學術集成 (頁 105-148)