Predicting Pairwise Relationships - 基於使用者生成多媒體內容之巨量資料分析

Through the studies, users are reluctant to annotate photos and even the faces in pho-tos. The phenomenon makes automatically predicting pairwise relationship (e.g., mother-child, father-child) by image content more important. Besides annotation by face recogni-tion, which is still very challenging for (wild) consumer photos, once the pairwise relation-ships are identified, the unknown identities are potential to be automatically inferred by partial name labels and their existing social relationships. Traditionally, predicting pair-wise relationships relied on the social contexts between the two people, such as relative distance, face size, gender and age attributes [67, 75]. As mentioned in Fig. 4.1 (a)(b),

the social contexts between two people are really limited, and thus lead to poor perfor-mance in recognition. However, more contextual cues can be inferred when all the faces are considered in a holistic way as shown in (c). Therefore, we hypothesize that inferring the pairwise relationships by the proposed face graph is promising.

The face graph of a group photo may contain many faces which might inevitably con-fuse co-occurrence measurement. On the other hand, informative subgraphs are poten-tial to filter out unintended information, and also preserve the co-occurring relationships.

Therefore, we exploit the subgraphs co-occurring with the designated pairwise relation-ship as the features. In the training phase, we manually label pairwise relationrelation-ships on a face graph according to their social relationships in the photo. By subgraph mining (as the process in Sec. 4.3.2) from the labeled face graphs, we discover the informative sub-graphs containing the edges labeled with the designated relationship. As shown in Fig.

4.6 (a), the mined informative subgraphs are different for different designated pairwise relationships (denoted by gray triangles and their connected lines). Taking “sibling” as an example, the informative subgraphs often contain a woman (circle) or a man (rectangle), which are possibly their mother or father.

When predicting a pair q (as shown in 4.6 (b)), we first construct the face graph G_q as the process in Sec. 4.3.1. In G_q, we use graph matching to check the presence of informative subgraph s_i, mined from the training images. Finally, the pairwise relationship r^∗is predicted by Naive Bayesian classifier by taking the image frequency P (s_i|rl) of the informative subgraph s_i in the image collections containing r_lpairwise relationship:

r^∗ = argmax_r_l∏

P (s_i|rl), (4.4)

Because the subgraphs in Gq is relatively few, appropriately smoothing P (si|rl) is re-quired. In the experiments, we will demonstrate its superiority against prior work in pre-dicting four typical pairwise relationships.

4.6 Experiments

In this section, we will (1) evaluate the effectiveness of BoFG for classifying family-type photos and then (2) evaluate the capability of informative subgraphs for predicting pairwise relationships (in Sec. 4.6.6). The techniques of face detection and facial attribute detection have been developed for years either in academic studies or commercial prod-ucts. The previous work [42] has shown that the classification accuracy of facial attributes can achieve more than 80% on average. However, to prevent the evaluation from the error caused by face attributes, we experiment on the public data set [32], which provides group photos and the associated attributes of the faces. The data set is collected from social me-dia (Flickr) with specific keywords, and categorized to family images, group images and wedding images. We leverage the keywords as the soft ground truth to obtain family-type images. Totally, 1,167 family images and 1,263 non-family images are retained for ex-periments which are conducted with 5-fold cross-validation. Note that, we evaluate the proposed approach by the photos containing at least three faces because those groups are more complex and very challenging for analysis and prediction. For groups containing less than three people, the prediction can be intuitively conducted by their attributes and distance directly [75]. Moreover, the proposed approach involves facial attributes rather than face identities; therefore, the discovered informative subgraphs are general and cross-family. In other words, our method operates on a per photo basis rather than a per family basis. We further investigate vital factors such as (1) different learning approaches, (2) the mined informative subgraphs , (3) sensitivity to normalization and (4) subgraph selection to evaluate classifying family photo by BoFG.

4.6.1 Classification

The analysis from text categorization [39] has concluded that Support Vector Machines (SVMs) is excellent in classification for BoW-like representations. The proposed bag-of-facial-subgraphs is in the similar paradigm, therefore we adopt SVMs as the learning method for family photo classification. To maximize the performance, we evaluate three

common SVM kernels for group classification.

Linear : K(x, y) = x^Ty, RBF : K(x, y) = e^{−γ∥x−y∥}², RBF − χ² : K(x, y) = e⁻

∑γ^(xk−yk)₁ ²

2(xk+yk),

where x, y are BoFG feature vectors and γ > 0. RBF kernel can map the training data to high dimensional space non-linearly, therefore can handle the case when the mapping between class label and feature vector is nonlinear. RBF-χ² kernel is another type of non-linear kernel, which are commonly used in image classification.

Although SVMs is a very powerful algorithm for learning high-dimensional features, it is deficient in feature selection and can only work on fixed (provided) features (subgraphs).

Due to the high computation cost from subgraph enumeration, Kudo et al. [40] proposed a boosting-based algorithm to couple the subgraph mining and classification, which avoids wasting time to enumerate non-discriminative subgraphs. In the experiments, the afore-mentioned kernel-based and boosting-based approaches are both applied to compare the effects from different learning methods on the proposed feature representation.

4.6.2 Effects from Learning Approaches

As shown in Fig. 4.7, linear kernel results in the worse accuracy by BoFG features, partially due to the number of training data is relatively few comparing with the adopted high-dimensional features. On the other hand, RBF kernel can non-linearly map train-ing data to the high-dimensional space, therefore leads to better classification results.

In our experiments, Chi-square kernel shows its superiority to both linear and RBF ker-nels, because the proposed features are basically organized by histograms of informative subgraphs. Actually there is no big difference in accuracy generated by linear and non-linear kernels, because the proposed feature representations are sparse and discriminative.

Therefore, similar to the cases in document vector or visual word vector, they are more linearly separable [86]. The classification accuracy of the boosting-based approach also

0.887 0.903

0.898 0.885

0.6 0.7 0.8 0.9

1 Classification Accuracy

0.679

Figure 4.7: Performance comparisons for social group type classification (family vs. non-family) by different features. Chi-square kernel shows its superiority over both linear and RBF kernels as it has been found excellent in histogram representations (e.g., BoW [85], BoFG). Note that, the accuracy for using low-level feature PHoG is only 67.94 %.

achieve 88.67%, which is on par with SVMs with linear kernel. We also train a family photo classifier by SVMs using low-level (and competitive) PHoG feature. The classifi-cation accuracy only achieved 67.94%, mainly due to the lack of (semantic) social cues addressed by BoFG.

4.6.3 Mined Informative Subgraphs for Family

In Fig. 4.8, we display the mined informative subgraphs for the two different classes organized by the number of vertices (|V^′|) in them. Block (a) is the most informative sub-graphs in family photos and block (b) holds the counterparts. Obviously, the informative subgraphs in family photos contain faces with larger age gaps (e.g., Fig. 4.8 2, 3, a-4). Besides, the order distance between two faces are much smaller (most are equal to 0).

That is, the families tend to stand closer to each other. Also, the couple-like subgroups frequently co-occur with kids in family photos (e.g., a-2). The seniors tend to stand in the center of a family group (e.g., a-4) such that have smaller order distance and usually link to the others. On the other hand, the informative subgraphs in non-family groups are mostly comprised of young people with smaller age gaps (due to the collected dataset photos).

People of the same gender stand together (e.g., b-4) more frequently than that in family

(a) Family

Figure 4.8: Block (a) is the most informative subgraphs (G^′) in family photos and block (b) holds the counterparts. Both of them are grouped by the number of vertices (|V^′|).

Obviously, the informative subgraphs in family photos contain faces with larger age gaps (e.g., a-2, a-3, a-4). Besides, the order distance between two faces are much smaller; that is, the families tend to stand closer to each other. Also, the couple-like subgroups fre-quently co-occur with kids in family photos (e.g., a-2). On the other hand, the informative subgraphs in non-family groups are mostly comprised of young people with smaller age gaps. People of the same gender stand together more frequently than that in family pho-tos. They might like to arrange themselves in a row (e.g., b-3, b-4); therefore, the order distance is relatively larger. (Best seen in color.)

photos. They might like to arrange themselves in a row; therefore, the order distance is relatively larger (e.g., b-3, b-4).

4.6.4 Sensitivity in Pixel vs. Order Distance

BoFG adopts order distance as the edge labels and are free of different photo varia-tions (e.g., size, face number, etc.). As for pixel distance, the sensitivity to normalization scale is relatively high. In the experiments, we reveal that pixel distance normalized by different scales results in unstable classification performance. We quantized the pixel dis-tance into different scale ranged from 5 to 15 degrees. The normalized disdis-tance degrees are then used as the edge labels. Fig. 4.9 shows the classification accuracy using BoFG constructed by pixel distance and constructed by order distance. All of them are learned by

0.4 0.5 0.6 0.7 0.8 0.9 1

5 6 7 8 9 10 11 12 13 14 15

Classification Accuracy

order pixel

Number of Pixel-Distance (quantization scales)

Figure 4.9: The pixel distance adopted in prior work suffers from the high variations in photo sizes, face scales, number of people, etc. The proposed order distance is more robust to the variances.

the boosting-based approach. As it shows, the results of pixel distance fluctuate by vary-ing normalization scales and somehow are affected by the test photos. The proposed order distance can escape from the instability and perform robustly across consumer photos.

4.6.5 Effects of Subgraph Selection

The large number of features (subgraphs) would inevitably incur heavy computation cost in learning models and on-line classification. This problem is especially critical for social media, where the data are growing exponentially. To reduce the size of subgraph vocabulary, we further select the informative subgraphs by document frequency and se-quential covering (Sec. 4.4.1). As Fig. 4.10 shows, both subgraph selection methods can effectively retain only 10% subgraphs but still ensure the same classification accuracy (89.75% with 4,315 subgraphs), therefore make the proposed framework more scalable.

The performance of sequential covering (Fig. 4.10, DF+SC) is slightly better than docu-ment frequency (Fig. 4.10, DF). The difference may come from the utilities of the given class labels, which are provided in sequential covering only. Interestingly, increasing the number of subgraphs is not always a gain for learning. As the experiment shows, the clas-sification accuracy notably degrades while the number of features is larger than 30,000.

The drops should be attributed to the overfitting problem in learning from high

dimen-Number of Features 0.78

0.80 0.82 0.84 0.86 0.88 0.90 0.92

500 600 700 800 900 4315 7273 33308 Classification Accuracy

SC+DF DF

Figure 4.10: Both the subgraph selection methods, document frequency (DF) and sequen-tial covering (SC), can effectively retain only 10% subgraphs but still ensure the classi-fication accuracy and therefore make the proposed framework more scalable. Notably, besides efficiency, subgraph selection is vital since avoiding the overfitting problem com-monly observed in learning from high-dimensional features.

sional features.

4.6.6 Performance of Predicting Pairwise Relationships

We use the family photos in [32] for experiments and predict the four pairwise re-lationships, including couple, mother-child, father-child, sibling. Totally 1,332 pairwise relationships are labeled in 772 photos (at least 250 labels for each pairwise relationship).

We use one half of the labeled data for training and one half for testing. To verify the sup-ports from the informative subgraphs, we remove the attributes of the two people involved in a pairwise relationship. That is, the social contexts between the two people are blind both in the training and testing phases. The confusion matrix in Fig. 4.11 shows that solely relying on the information from the subgroups on the face graph can successfully infer the pairwise social relationships and achieve very impressive accuracy. The results also sup-port that the additional information augmented by face graph can compensate errors in estimating social contexts between the pair of faces. We also derive superior performance (36% relative improvement on the average) as comparing with the confusion matrix of classification in [75] which are experimented on the same database [32]. For example,

.66 .08 .10 .16

.07 .75 .01 .16 .70 .19 .08 .03

.07 .02 .01 .91

Figure 4.11: The confusion matrix for predicting pairwise relationships. The results out-perform those reported in [24] since the informative subgroups provide supplemental sup-ports for determining the pairwise relationship. For example, the most gain is in “sibling”

since the co-occurring parent-like subgroups bring more supports.

the recognition of “sibling” relationship in [75] is less accurate and is probably due to the social contexts (relative distance, gender, etc.) between sibling is very ambiguous; as for our work, the co-occurred subgraphs, which frequently have the links to their parents, can provide further supports in recognizing pairwise relationships.

4.7 Remarks

We saw the sheer amount of consumer photos, which mostly contain groups of people.

In this paper, we propose a novel graph feature, bag-of-face-subgraphs for describing the social subgroups in a group photo. The informative subgraphs are automatically discov-ered from community-contributed photos, which reflect the social subgroups commonly appearing in the communities. BoFG preserves the occurrence pattern of social subgroups that are effective for analyzing human-related activities and group types. We demonstrate the capability to classify family-type photos and achieved great improvement (30.5% rel-atively) against prior works using state-of-the-art low-level visual features. The proposed framework considers subgraph selection for ensuring the scalability as well. Furthermore, the co-occurrence cues in the informative subgraphs can also help predicting pairwise re-lationships, which benefit inferring unknown identities in group photos and show salient

improvement over the prior work (36% relatively). In the near future, we will investi-gate more social contexts (e.g., face angles) and people attributes (e.g., race) to enrich the potential social interactions in the emerging group photos. Moreover, we will extend the social groups discovered from the user-contributed photos to inferring implicit interactions in social networks.

4.8 Extensive Applications: Personalized and Group

在文檔中基於使用者生成多媒體內容之巨量資料分析 (頁 66-75)