• 沒有找到結果。

使用圖像和深度學習了解社交互動 - 政大學術集成

N/A
N/A
Protected

Academic year: 2021

Share "使用圖像和深度學習了解社交互動 - 政大學術集成"

Copied!
148
0
0

加載中.... (立即查看全文)

全文

(1)國立政治大學理學院資訊科學系 社群網路與人智計算 國際研究生博士學位學程 College of Science Department of Computer Science. National Chengchi University Taiwan International Graduate Program in Social Networks and Human-Centered Computing 博士學位論文. 治 Doctoral 政 Dissertation. 立. 大. ‧ 國. 學 ‧. 使用圖像和深度學習了解社交互動 Understanding Social Interaction Using Images and Deep Learning. er. io. sit. y. Nat. al. n. iv n C h e n g艾費瑪 博士班學生: chi U 撰. Student: Fatma Said Abousaleh Abdeo 指導教授: 曹昱 博士 余能豪 博士 Advisors: Yu Tsao, Ph.D. Neng-Hao Yu, Ph.D. 中華民國 110 年 01 月 January, 2021. DOI:10.6814/NCCU202100261.

(2) This thesis is dedicated to the sake of Allah (SWT), who has been my constant source of. 政 治 大. inspiration, wisdom, health, knowledge and skills during the challenges of my whole life.. 立. He also gave me the strength to complete my research work successfully when I thought of. ‧ 國. 學. giving up. I also would like to dedicate this work to my beloved parents, who continually. ‧. provide me with their moral, spiritual, emotional, and financial support.. n. er. io. sit. y. Nat. al. Ch. engchi. i Un. v. i. DOI:10.6814/NCCU202100261.

(3) Declaration I hereby declare that except where specific reference is made to the work of. 政 治 大. others, the contents of this dissertation are original and have not been submitted. 立. in whole or in part for consideration for any other degree or qualification in. ‧ 國. 學. this, or any other university. This dissertation is my own work and contains. ‧. nothing which is the outcome of work done in collaboration with others, except. sit. y. Nat. as specified in the text and Acknowledgements.. n. al. er. io. Furthermore, I am aware of and understand the University’s policy on pla-. i Un. v. giarism and I certify that this dissertation is my own work, and, to the best of. Ch. engchi. my knowledge and belief, does not breach copyright law, and has not been taken from other sources except where such work has been cited and acknowledged within the text. Fatma Said Abousaleh January 2021. ii. DOI:10.6814/NCCU202100261.

(4) Acknowledgments The completion of this doctoral dissertation would not have been possible. 政 治 大. without the support of several people. I would like to express my heartfelt. 立. gratitude to all who in one way or another contributed to the completion of this. ‧ 國. 學. thesis. First and foremost, I would like to thank Allah (SWT) for granting me. ‧. the blessing, patience, health and strength to undertake this research task and. sit. y. Nat. enabling me to its completion. Thank you so much, Allah, I will keep on trusting. n. al. er. io. you for my future life.. i Un. v. I would like to express my deepest appreciation and thanks to my advisor. Ch. engchi. Dr. Yu Tsao for offering me such a great opportunity to join his research lab and for taking the daunting responsibility to conduct this PhD research under his supervision. He provided me with extensive personal and professional guidance and taught me a great deal about both scientific research and life in general. On the academic level, He provided all the necessary research facilities and the enlightening work environment that ultimately helped me to carry out and complete this study successfully. He also taught me how to research a problem and achieve goals. On a personal level, he inspired me by his hardworking and iii. DOI:10.6814/NCCU202100261.

(5) passionate attitude that made me belong and fit into the amazing and diverse culture of Taiwan. I am also extremely grateful for his generous financial support during this research work. I would also like to express my warmest thanks to my co-advisor Dr. NengHao Yu for encouraging me in all stages of this work and for his vital support in crucial times. His encouragement and belief in me have greatly assisted me to carry on my research work, develop my academic identity and formulate the. 政 治 大. thesis of this research successfully. He was very generous in sharing his wealth. 立. of knowledge and experiences in research, academic life and beyond. These. ‧ 國. 學. experiences will be my primary guide during my life and academic career. I am extremely delighted to express my gratitude to Taiwan International. ‧. Graduate Program (TIGP), Social Network and Human-Centered Computing. y. Nat. er. io. sit. (SNHCC) program, and Institute of Information Science (IIS) of Academia Sinica for the PhD fellowship that provided the financial support and the facilities. al. n. iv n C which were needed to executehmy research work. e n g c h i U I am also grateful to the research fellows of the Institute of Information Science (IIS) and the Research Center for Information Technology Innovation (CITI) in Academia Sinica for their valuable guidance, consistent assistance, and useful critiques throughout this study. I am also deeply thankful to the National Chengchi University (NCCU) and all the faculty members of the Department of Computer Science for their support throughout my study period there.. iv. DOI:10.6814/NCCU202100261.

(6) I am profusely thankful to Dr. Mark Liao for his constant guidance, empathy and motivation during this study. He has always made himself available to listen and clarify my doubts despite his busy schedules and I consider it as a great opportunity to learn from his research expertise. He gave useful advice that helped in addressing some of the shortcomings in this thesis. Your advice on both research as well as on my career has been priceless. Thank you, sir, for all your help and support. I also wish to thank Dr. Wen-Huang Cheng for the. 政 治 大. valuable and helpful suggestions and comments that he has provided me during. 立. contributed to the quality of this research.. 學. ‧ 國. the period of his supervision on this PhD research. His vital advice significantly. Special thanks are due to my labmates in Biomedical Acoustic Signal Pro-. ‧. cessing Laboratory (Bio-ASP), Multimedia Computing Laboratory (MCLab),. y. Nat. er. io. sit. and to my TIGP-SNHCC colleagues for their prompt help, constructive suggestions and productive discussions during this PhD journey and for the pleasurable. al. n. iv n C moments that we spent together Similarly, I am thankful to the h einnTaiwan. gchi U. assistant of TIGP-SNHCC, Ms. Chia-Chien, for her wonderful services and unconditional help all the time during my stay in Academia Sinica and Taipei. Nobody has been more important to me in the pursuit of seeing this achievement come true than the members of my family. I owe a lot to my parents for their love, care, sacrifices, confidence, encouragement, and prayers at every stage of my personal and academic life. They set a great example for me about how to live, study, work, and handle mental stress in the dynamic environment. Thank. v. DOI:10.6814/NCCU202100261.

(7) you both for giving me the strength to reach for the stars and chase my dreams. I am also grateful to my brother and sisters, who have provided me with endless support and incented me to strive towards my goal. I would also like to thank all my other family members and all my friends for their outstanding emotional support and well wishes during this whole journey.. 立. 政 治 大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i Un. v. vi. DOI:10.6814/NCCU202100261.

(8) 中文摘要 人們通常能自然無礙地和他人互動,而社群訊號(social signal)是有 效溝通的自然產物。然而如何讓電腦能分析、了解社交互動,並正確展. 政 治 大. 現人類社群訊號的過程,仍舊是社群訊號處理(social signal processing,. 立. SSP)領域最大的挑戰之一。社交互動可以透過面對面或網路兩種不同. ‧ 國. 學. 的渠道進行。在面對面的互動中,人們常透過可觀察的非語言行為線索. ‧. (例如:手勢、臉部表情、聲音表達、肢體動作和人際距離等)來了解. sit. y. Nat. 社群訊號和行為並與他人互動。基於臉部圖像辨識的社交互動研究近來. io. al. er. 受到學術界極大重視,這是因為臉部圖像蘊含多樣化的臉部特徵,可以. v. n. 用來傳達關於年齡、性別、情緒和健康狀況的資訊。這些訊息在描述個. Ch. engchi. i Un. 人特質和社交溝通中扮演了重要的角色,其中,年齡尤其是影響我們日 常社交互動最基本的因素之一。因此,根據臉部影像自動估計年齡的研 究成為人工智慧領域的一項重要目標。雖然近幾年有巨大進展,但由於 臉部樣貌的多變性取決於基因特徵、生活型態、臉部表情以及年齡等因 素,這個研究課題仍屬於未解的難題。另一方面,網路互動包含了用 戶如何透過社交平台如Facebook、Twitter、Instagram或Flickr等與他人互 動。大部分的社交網路允許用戶創造並分享內容,也可以藉由不同的形 式(例如:觀看、按讚或留言)與其他用戶創造的內容互動,從而產生 vii. DOI:10.6814/NCCU202100261.

(9) 大量含有用戶興趣、觀點、日常生活和互動資訊的社交內容。爆炸性成 長的社群媒體內容和線上互動的行為,造成少數社交內容得到大量關 注、受歡迎,但絕大多數則受到忽視。在社群媒體上不同種類的內容 中,圖像已經成為用戶溝通的重要媒介,也導致用戶獲得的觀看次數或 社交知名度產生變動。上述現象吸引了電腦視覺和多媒體領域的研究人 員的興趣,並探究特定圖像受歡迎的原因,以及如何自動預測其受歡迎 程度。然而,因為用戶獨特的偏好及其在社群媒體上互動歷程等其他因. 政 治 大. 素,社群媒體上圖像受歡迎的程度仍然難以衡量、預測和定義。為此,. 立. 本論文提出了一個架構,用以理解現實和線上世界的社交互動,來解決. ‧ 國. 學. 這些挑戰。. 首先,本論文探討根據臉部圖像自動估計年齡的問題。傳統估計. ‧. 臉部年齡的方法,透過直接分析臉部資訊(例如:鼻子、嘴巴、眼睛. y. Nat. io. sit. 等)來從一個人的照片決定其年紀。然而即使對人類來說,一眼看出. er. 某人的年紀本質上仍是一項艱鉅的任務。為了處理這個問題,本論. al. n. iv n C deep 文由人類認知過程發想,提出了一個比較深度學習(comparative hengchi U learning)的架構。藉由比較輸入圖像與選定的參考圖像(基準組), 決定那組比較年輕或年長,從而以臉部圖像估算年齡。我們用區域 卷積神經網路(region-convolutional neural network, R-CNN)從輸入圖 像與參考樣本中擷取臉部特徵。然後,為了估計年齡差距,我們用 能量函數(energy function)從全連接層(fully connected layer)獲取資 訊,產生了一組代表比較關係(年輕或年長)的建議。最後,在模 型的預測階段收集所有建議並依多數決來判斷人的年紀。我們在FG-. viii. DOI:10.6814/NCCU202100261.

(10) NET、MORPH和IoG資料集上的實驗結果顯示,我們提出的架構超越目 前最頂尖的方法,且進步的幅度分別是在FG-NET的13.24%(平均絕對 誤差)、MORPH的23.20%(平均絕對誤差)以及IoG的4.74%(年齡分 組分類精準度)。 其次,本論文研究社群媒體上圖片受歡迎度預測的問題。隨著社群網 路如Flickr、Facebook的興起,用戶常藉由分享他們的生活照片來互動。 雖然每分鐘上傳了數十億張圖像到網路,但只有少部分能有超過百萬次. 政 治 大. 的觀看量,其他則完全被忽略。即使是相同用戶上傳的不同照片也不會. 立. 有相同的觀看數。所以如何預測圖像受歡迎度是一個值得研究的主題,. ‧ 國. 學. 同時也是社群媒體分析的關鍵挑戰。因為這可提供一個瞭解個人喜好以 及公眾目光的管道。然而,圖像受歡迎度的關鍵因素,和建立一個能預. ‧. 測社群媒體上圖像歡迎度的模型,依然是未解的難題。為此,本論文提. y. Nat. er. io. sit. 出了一個多模式深度學習模型(multimodal deep learning),該模型藉由 與圖像受歡迎度有關的多種視覺和社會特徵,來預測社群媒體上圖像的. al. n. iv n C 受歡迎度。本模型使用了兩種CNN,分別學習輸入圖像的高階特徵,並 hengchi U 將他們融入一個統一的網路來預測受歡迎度。我們透過一系列對Flickr真 實資料集的實驗來評估本模型的效能。實驗結果顯示,本預測模型勝過 四個傳統的機器學習演算法、兩個CNN模型和其他最新的方法,效能至 少提昇了2.33%(斯皮爾曼等級相關係數)、7.59%(平均絕對誤差)以 及14.16%(均方誤差)以上。. ix. DOI:10.6814/NCCU202100261.

(11) Abstract Human beings generally have the capability to interact easily with each. 政 治 大. other without any obvious effort, and social signals are the natural result of. 立. this effective communication. The process of providing computers with an. ‧ 國. 學. equivalent capability that enables them to analyze and understand social in-. ‧. teractions, and then properly represent human social signals, remains one of. sit. y. Nat. the greatest scientific challenges in the field of social signal processing (SSP).. n. al. er. io. Social interactions can take place in two different ways: face-to-face or cyber. In. i Un. v. face-to-face interactions, people commonly use observable nonverbal behavioral. Ch. engchi. cues (e.g., gestures, facial expressions, vocalizations, postures, interpersonal distance, etc.) to understand and interact with the social signals and behavior of others. The problem of recognizing social interactions from face images has recently received significant attention from the research community. This is because facial images have a variety of facial traits that can convey information about an individual’s age, gender, emotions, and physical health. These types of information are known to play a key role both in the description of individuals and social communication. In particular, age is one of the most fundamental x. DOI:10.6814/NCCU202100261.

(12) attributes that affect our daily social interactions. Automatic age estimation from face images has therefore become a significant task in numerous applications of artificial intelligence. Despite the huge advances in the automatic age estimation from face images in recent years, it remains a challenging problem. This is because of the large variations in facial appearance that result from a number of different factors, including genetic traits, lifestyle, facial expressions, and aging. On the other hand, cyber interactions are related to how users interact with each. 政 治 大. other through social media websites, such as Facebook, Twitter, Instagram, and. 立. Flickr. Most social networks allow users to create and share content and interact. ‧ 國. 學. with other user-generated content in different forms (e.g., by viewing, liking, or commenting). This results in massive amounts of social content that provide. ‧. information about users’ interests, opinions, daily activities, and interactions.. y. Nat. er. io. sit. The explosive growth of social media content and the interactive online behaviors between users make only a limited number of social media content attracts. al. n. iv n C a great deal of user attention and popular, while the vast majority of h ebecome ngchi U. content is completely ignored. Among the different types of content generated by users on social media, images have become important media for communication between users, resulting in variations in the number of views they receive or their social popularity. This phenomenon has attracted researchers from computer vision and multimedia domains to explore the reasons why certain photos are considered popular and how to predict their popularity automatically. However, it is still difficult to measure, predict, or even define image popularity on social. xi. DOI:10.6814/NCCU202100261.

(13) media because it is based on a user’s preferences and many other factors that could affect user’s social interactions on social media websites and lead to the popularity of content. To this end, this dissertation proposes a framework for understanding social interaction in the real and online world to address these challenges. First, this dissertation addresses the problem of automatic age estimation from facial images. The conventional methods for facial age estimation normally. 政 治 大. determine the age of a person directly from his/her facial image by analyzing. 立. some facial information (e.g., nose, mouth, eyes, etc.). This means only the input. ‧ 國. 學. image is utilized to estimate the person’s age. However, telling someone’s precise age at a glance without any reference information is essentially a challenging. ‧. task even for humans. To address this problem and inspired by human cognitive. y. Nat. er. io. sit. processes, this dissertation proposes a comparative deep learning framework that estimates the age from the facial image by comparing the input image with a set. al. n. iv n C of selected reference images (labeled samples) to determine whether the h e nbaseline gchi U. input face is younger or older than each of the baseline samples. A specific deep learning architecture, namely a region-convolutional neural network (R-CNN), is used to extract facial information from both the input image and the baseline samples. Then, an energy function is exploited to aggregate the extracted information from the fully connected layer in order to estimate age comparisons. This results in a set of hints where each hint represents a comparative relationship (younger or older). Finally, the estimation stage aggregates all the set of hints. xii. DOI:10.6814/NCCU202100261.

(14) and then votes on the number of hints for each label in order to estimate the person’s age. Therefore, the age of the input person could be estimated by taking the label that received the most votes. The experimental results on the FG-NET, MORPH, and IoG databases demonstrate that the proposed model outperforms compared to the state-of-the-art methods, with a relative improvement of 13.24% (on FG-NET), 23.20% (on MORPH) in terms of mean absolute error, and 4.74% (on IoG) in terms of age group classification accuracy.. 政 治 大. Second, this dissertation addresses the problem of image popularity predic-. 立. tion on social media websites. With an increasing number of social networks. ‧ 國. 學. such as Flickr and Facebook, users often interact with each other by sharing photos of their daily lives. Although billions of images are uploaded to the internet. ‧. every minute, only a few of these images receive millions of views and become. y. Nat. er. io. sit. popular, while others are completely ignored. Even the different images posted by the same user receive a different number of views. This raises the problem. al. n. iv n C of image popularity prediction, become a key challenge in social h which e n ghas chi U media analytics, as it offers opportunities to reveal individual preferences and. public attention. However, the challenge remains to investigate crucial factors that influence image popularity, as well as modeling and predicting the evolution of image popularity on social media. To this end, this dissertation proposes a multimodal deep learning model that predicts the popularity of images on social media by using various types of visual and social features that are associated with image popularity. The proposed model uses two dedicated CNNs to learn high-. xiii. DOI:10.6814/NCCU202100261.

(15) level representations separately from the input features and then merges them into a unified network for popularity prediction. The performance of the model was evaluated by performing a series of experiments on a real-world dataset from Flickr. The evaluation results reveal that the proposed prediction model outperforms four traditional machine learning schemes, two CNN-based models, and other state-of-the-art methods, with a relative performance improvement of more than 2.33%, 7.59%, and 14.16% in terms of the Spearman rank correlation. 政 治 大. coefficient, mean absolute error, and mean squared error, respectively.. 立. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i Un. v. xiv. DOI:10.6814/NCCU202100261.

(16) Table of contents. Declaration. 立. 政 治 大. sit. x. n. al. er. io List of figures. vii. y. ‧ 國 Nat. Abstract. ‧. 中文摘要. iii. 學. Acknowledgments. ii. Ch. engchi. i Un. v. xvi. List of tables. 1. xvii. Introduction. 1. 1.1. Background of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1. 1.1.1. Social Interaction and Facial Age Estimation . . . . . . . . . . . .. 5. 1.1.2. Social Interaction across Social Media and Popularity Prediction . .. 9. DOI:10.6814/NCCU202100261.

(17) Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 11. 1.3. Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 15. 1.4. Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . .. 18. 19. 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 19. 2.2. Human Age Estimation from Face Images . . . . . . . . . . . . . . . . . . 2.2.1. 政 治 大 Aging-Related Facial Feature Extraction . . . . . . . . . . . . . . . 立. 20 20. Age Estimation Techniques . . . . . . . . . . . . . . . . . . . . .. 22. Image Popularity Prediction on Social Media . . . . . . . . . . . . . . . .. 25. 2.3.1. Features Influencing the Image Popularity . . . . . . . . . . . . . .. 26. 2.3.2. Prediction Models . . . . . . . . . . . . . . . . . . . . . . . . . .. ‧. y. Nat. io. sit. 2.3. al. iv n C Comparative Deep Learning Framework h e n g cforhFacial i U Age Estimation n. 3. 27. er. 2.2.2. 學. Literature Review. ‧ 國. 2. 1.2. 30. 3.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 30. 3.2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 31. 3.3. Proposed Method: CRCNN Framework . . . . . . . . . . . . . . . . . . .. 35. 3.3.1. Preliminary Definitions . . . . . . . . . . . . . . . . . . . . . . . .. 36. 3.3.2. Overview of Our CRCNN Framework . . . . . . . . . . . . . . . .. 38. 3.3.3. CRCNN Formulations . . . . . . . . . . . . . . . . . . . . . . . .. 39. xvi. DOI:10.6814/NCCU202100261.

(18) 3.3.4 3.4. 3.5. 43. Experimental Results and Discussions . . . . . . . . . . . . . . . . . . . .. 45. 3.4.1. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . .. 46. 3.4.2. Optimization of Our CRCNN Framework . . . . . . . . . . . . . .. 47. 3.4.3. Discussions and Comparisons with State-of-the-art Methods . . . .. 55. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 59. 政 治 大 Multimodal Deep Learning Framework for Image Popularity Prediction on 立 60. 學. Social Media. ‧ 國. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4.2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4.3. Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ‧. 4.1. sit. io. al. n. 4.5. 61. y. Nat. 4.4. 60. 64. er. 4. Learning Method for the Comparative Stage . . . . . . . . . . . . .. i Un. v. 4.3.1. Visual Content Features . . . . . . . . . . . . . . . . . . . . . . .. 4.3.2. Social Context Features . . . . . . . . . . . . . . . . . . . . . . .. 72. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 76. 4.4.1. Overview of Proposed Framework . . . . . . . . . . . . . . . . . .. 76. 4.4.2. Training the VSCNN Model . . . . . . . . . . . . . . . . . . . . .. 78. 4.4.3. Baseline Models . . . . . . . . . . . . . . . . . . . . . . . . . . .. 80. Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . .. 83. Ch. engchi. 65. xvii. DOI:10.6814/NCCU202100261.

(19) 4.6. 5. 4.5.1. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . .. 83. 4.5.2. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 88. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101. Conclusions and Future Work. 103. 5.1. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103. 5.2. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106. 立. References. 政 治 大. 108. ‧ 國. 學. Appendix A Publications. 126. ‧. n. er. io. sit. y. Nat. al. Ch. engchi. i Un. v. xviii. DOI:10.6814/NCCU202100261.

(20) List of figures. 3.1. 政 治 大 tion by learning the立 age information from a facial image directly, and (b,d). Schematic diagram of (a,c) the conventional paradigm for facial age estima-. ‧ 國. 學. the proposed paradigm by aggregating the comparisons of a facial image with baseline samples to determine the age in a comparative manner. . . . .. ‧. Generation of a set of the hints (for simplicity, five labels are employed). . .. 3.3. Optimization of our CRCNN approach: Performance for the different settings. sit. er. io. al. v ni. n. of the deep architecture’s parameters. . . . . . . . . . . . . . . . . . . . . 3.4. Ch. engchi U. 4.3. 50. The plots of deep learning feature vector values of different images from the dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4.2. 48. Optimization of our CRCNN approach: Sensitivity of the deep architecture’s parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4.1. 37. y. Nat. 3.2. 33. 71. Diagram of the proposed framework for image popularity prediction. (a) Feature extraction, and (b) Proposed VSCNN regression model. . . . . . .. 77. Structure of the VCNN model. . . . . . . . . . . . . . . . . . . . . . . . .. 79. xix. DOI:10.6814/NCCU202100261.

(21) 4.4. Structure of the SCNN model. . . . . . . . . . . . . . . . . . . . . . . . .. 4.5. Sample images from the dataset. The popularity of the images is sorted from more popular (left) to less popular (right). . . . . . . . . . . . . . . . . . .. 4.6. 80. 85. Quality evaluation of the VSCNN model. (a) Error distribution histogram of the model, and (b) scatterplot of true values (x-axis) versus predicted values. 4.7 4.8. (y-axis). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 89. 政 治 大 Examples of correct立 and wrong predictions of some images from our dataset. 90. A distribution of the view counts of the training samples. . . . . . . . . . .. ‧ 國. 學. using the VSCNN model. The actual popularity score and its corresponding predicted score are displayed below each image. . . . . . . . . . . . . . . .. ‧. 4.9. 91. Diagrams of the predicted values obtained using the CNN-based baseline. y. Nat. er. io. sit. models and their corresponding ground truth values. (a) SCNN, and (b) VCNN. 95. al. iv n C h e n g c hground baseline models and their corresponding i U truth values. (a) LR, (b) n. 4.10 Diagrams of the predicted values obtained using the four machine learning. SVR, (c) DTR, and (d) GBDT. . . . . . . . . . . . . . . . . . . . . . . . .. 96. 4.11 Best prediction performances for all the models in terms of Spearman’s Rho, MAE, and MSE metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 98. xx. DOI:10.6814/NCCU202100261.

(22) List of tables. 政 治 大. 3.1. Optimized setting of our CRCNN method. . . . . . . . . . . . . . . . . . .. 3.2. Comparison with state-of-the-art methods on FG-NET and MORPH databases. 57. 3.3. Comparison with state-of-the-art methods on IoG database. . . . . . . . . .. 4.1. Spearman’s Rho values for the correlation of user features with popularity. 立. 58. ‧. ‧ 國. 學. y. Nat. 73. er. io. sit. score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Spearman’s Rho values for the correlation of post metadata features with. al. n. 4.2. 49. popularity score. . .. iv n C . . .h. . . . . . . . . .U. . . . engchi. . . . . . . . . . . . .. 74. 4.3. Configuration of the VSCNN Model. . . . . . . . . . . . . . . . . . . . . .. 79. 4.4. Performance comparison of SCNN, VCNN, and VSCNN models. . . . . .. 93. 4.5. Performance comparison of LR, SVRR, DTR, GBDT, and VSCNN models.. 95. 4.6. Comparison with the state-of-the-art methods on SMP-T1 dataset. . . . . .. 99. 4.7. Performance comparison of VSCNN and VSCNN-EF models. . . . . . . . 101. xxi. DOI:10.6814/NCCU202100261.

(23) Chapter 1. Introduction 立. ‧ 國. 學. Background of the Study. ‧. 1.1. 政 治 大. y. Nat. sit. Humans have the capability to express and understand social signals (SSs) that are created. n. al. er. io. during social interactions, such as agreement, disagreement, conflict, empathy, politeness,. Ch. i Un. v. hostility, and any other way of behaving towards others that cannot be expressed using just. engchi. by words but by nonverbal behaviors. They also have the ability to manage them in order to get along well with others. This range of abilities is called social intelligence, which is an aspect of human intelligence and the most essential indicator of success in life. Therefore, understanding how people can easily interact with the world and with each other considers one of the greatest scientific problems in the field of social signal processing (SSP). SSP is a new research field that aims at providing computers with social intelligence that enables them to adapt and work properly in social settings. In particular, this field focuses on. 1. DOI:10.6814/NCCU202100261.

(24) how machines can participate in social interactions by automatically modeling, analyzing, and synthesizing many of the nonverbal behavior cues that people utilize to express socially relevant information or social signals [1]. Nonverbal communication plays an important role in our daily life, where humans utilize nonverbal behavioral cues (e.g., posture, interpersonal distance, facial expressions, gestures, etc.) that they can easily sense with their eyes and ears for recognizing human social signals, and understanding social behaviors of others and then interact with them accordingly. Thus, the fundamental idea of SSP is that these. 政 治 大. kinds of cues can be detected with microphones, cameras, and any other suitable sensor, and. 立. they can be used as machine detectable evidence for automatic analysis and understanding. ‧ 國. 學. of social behavior. This implies that SSP will bring computing closer to human-centered approaches that effectively deal with psychological and behavioral responses natural to. ‧. humans. This will have a major impact on various domains of computing technology, such. y. Nat. er. io. sit. as human-computer interaction technologies because, interfaces will become more adept to social interactions with users [2], multimedia content analysis techniques because, the content. al. n. iv n C will be analyzed on the basis of human of reality around them [3], computerh eperception ngchi U mediated communication because, the transmission will include the social cues necessary for establishing a natural contact with others [4], and any other domain where computers must seamlessly integrate the lives of people. All potential nonverbal behavioral cues occurring in social interactions have been grouped into five main categories by psychologists, and have been referred to as codes [5]. These five codes are gestures and postures, vocal behavior, space and environment, physical appearance, and face and eye behavior. Of these, the behavior face and eye is a critical code, as the face is 2. DOI:10.6814/NCCU202100261.

(25) our straight and naturally preeminent way of communicating and comprehending someone’s affective state and intents based on the facial expression [6]. In addition, faces convey information about age, gender, health qualities, personality, attractiveness, and emotions of an individual, that are useful sources in social signal processing [7]. This implies that facial behavior plays a key role in shaping perceptions during social interactions [7–9]. For example, age is one of the most essential signals that can be derived from the human face and is considered an important factor in interpersonal communication and interaction in our. 政 治 大. social life, as the perception of the age of our interlocutors can help us to determine and. 立. respond to the way in which we interact with them. In addition to daily life, the capability to. ‧ 國. 學. estimate age is helpful in more particular contexts such as police testament or the selling of products authorized only from a specific age. However, the capacity of the human for age. ‧. estimation is usually not as strong as for estimating other facial traits. Therefore, developing. y. Nat. er. io. sit. automatic facial age estimation systems that are comparable or even superior to the human ability in age estimation has become an attractive and challenging subject of research in. al. n. iv n C recent years. Consequently, the first h problem that we address e n g c h i U in this dissertation is how to. develop an automatic facial age estimation system based on basic concepts of human use to estimate the age so that it can accurately predict a person’s age from his/her facial image. On the other hand, the impact of technology, particularly social media, on human life has caused enormous changes in human behavior. These changes include numerous areas of human interaction, influencing the way people communicate, interact, work, think, do business, act, and react. We can simply say that social media has extensively influenced every facet of human life. The most important change in persons’ behavior after the emergence of 3. DOI:10.6814/NCCU202100261.

(26) online social networks is the way they interact, and its range. Thousands of millions of users daily create and share vast amounts of content on online social networks, and interact with each other irrespective of time and location. The content created by users on social media provides information about users as well as their living environments, allowing us to access a user’s preferences, opinions, and interactions. Thus, analyzing this content provides an opportunity to comprehend human behavior and can also be employed to improve the user services provided by these networks.. 政 治 大. Simultaneously, the existence of more connections on online social networks brings more. 立. attention and visibility to people, which is called popularity on social media. Popularity is. ‧ 國. 學. measured by the number of fans, followers, friends, retweets, likes, or any other metrics used. ‧. to calculate engagement, and this depends on the type of social network. The interactions. sit. y. Nat. and reactions of users to the posted content play a fundamental role in information diffusion. io. er. and the popularity of content on social media [10, 11]. Once the content is posted on a social. al. iv n C U 13]. Meanwhile, some contents matter, publisher’s credibility, time ofhpublishing, e n g c hetc.i [12, n. network, it attracts a different amount of user interactions based on its importance, subject. succeed in attracting more users’ interactions and becoming popular [14]. The popularity of content is generally measured by different metrics, such as the number of views, shares, likes, comments, etc. Predicting the popularity of content on social media (which can be text, audio, video, or image) has become a significant research topic, as it provides an opportunity to understand how users interact with online content and how information propagates across social networks. Therefore, the second problem that we address in this dissertation is the analysis of the popularity of content on social media and, more specifically, visual content 4. DOI:10.6814/NCCU202100261.

(27) such as images, to first explore the factors that can affect the social interactions of users on social media websites and lead to content popularity. Second, we design an efficient prediction model that can accurately predict this popularity.. 1.1.1. Social Interaction and Facial Age Estimation. Age is one of the most significant elements of face-to-face interaction because the age of our interlocutor strongly identifies the way we interact with him/her [15]. In almost all. 治 政 cultures, people interact differently with younger and 大 older people. For example, some 立 studies revealed that youths tend to talk more slowly and loudly to older people [16]. It is. ‧ 國. 學. therefore normal in daily life to estimate the age of people in order to interact with them in. ‧. a proper manner. Humans can observe aging-related traits on faces, which helps them to. Nat. sit. y. predict the age of other individuals only by looking towards their faces. However, researchers. n. al. er. io. who have worked on the process of age estimation by humans conclude that humans are. i Un. v. not so precise in age estimation [17]. The main explanation for this is that different people. Ch. engchi. of the same age can have different facial appearances due to varying rates of facial aging [18, 19]. Therefore, several automatic facial age estimation methods have been developed to compensate for humans weakness in age estimation. Although the automatic estimation of human age from face images has recently received significant attention, it remains a challenging problem. In particular, there are many reasons that make automatic age estimation a non-trivia task. First, the aging process is uncontrollable and is influenced not only by the genetic traits of a person but also by several external. 5. DOI:10.6814/NCCU202100261.

(28) factors, such as dietary habits, environmental conditions, lifestyle, living location, and health status [20]. Second, the gender of a person can have a significant effect on age estimation. Recent studies on facial aging have shown that the aging process differs in some respects between males and females [21, 22], including the appearance of facial hair like beards, increased thickness, facial vascularity, hormonal effects, and possible variations in fat and bone absorption rates throughout the life cycle. For example, the development of deeper wrinkles around the perioral area is higher in women than in men, as their skin has fewer. 政 治 大. appendages compared to men [23]. Third, from a technical point of view, males and females. 立. may have different discriminatory facial aging features shown in images because of the. ‧ 國. 學. different extents of using makeups, cosmetic surgeries, and accessories [19, 24]. For example, many photos of the female face can likely show younger appearances than they actually. ‧. [25]. Therefore, the extraction of general discriminative features for age estimation while. y. Nat. er. io. sit. decreasing the negative impact of individual differences remains an open problem. Fourth, the load of obtaining large-scale databases that cover a sufficient age range with chronological. al. n. iv n C face aging images makes it harder to the estimation tasks [26]. h eperform ngchi U. Although web. image mining can assist in the data collection process [27], it is usually difficult or even impractical to compile a large database for a large number of subjects who can supply a sequence of personal photos at different ages. Finally, the age estimation process is often affected by certain imaging conditions of face images, such as the variation of head pose, blur, illumination, expression, and occlusion. It is also influenced by camouflage caused by beards, moustaches, glasses, and makeup in the face images.. 6. DOI:10.6814/NCCU202100261.

(29) Automatic age estimation involves the automatic labeling a face image with the exact age (year) or the age group (year range) of the human face. Traditional computer vision methods for age estimation from face images rely on the extraction of certain handcrafted features that are carefully designed to represent the aging information and subsequently use these features to train a classification/regression machine learning model to predict the age of the face [28, 29, 18, 19, 24, 30]. These methods achieve relatively good results if they can effectively extract the most relevant features of aging. This means that the performances. 政 治 大. of these methods rely heavily on feature engineering, which is time-consuming, costs a. 立. lot of human effort, and requires expert knowledge. In addition, the model can be brittle. ‧ 國. 學. in the case that the selected features are not appropriate for the age estimation task. On the other hand, a different approach is taken by deep learning-based methods compared. ‧. with handcrafted methods. While using deep learning, the model is left to automatically. y. Nat. er. io. sit. extract and learn appropriate features related to age by feeding it with several samples of facial images during the training process. The features of the model are then iteratively. al. n. iv n C (and automatically) tuned using the h error between the U e n g c h i initial output of the model and the desired (real) output. CNN is one of the most important deep learning models that has been successfully used in face analysis tasks, especially age estimation. This is because CNN can efficiently capture high-level complex age-related visual features from raw input face. images without any handcrafting and has strong robust adaptability to the noise in the image, which indicates that the final estimation of the age will be more accurate. Deep learning schemes and CNNs have therefore been used in recent studies on age estimation and have demonstrated superior performance compared to other traditional methods [31–35]. However,. 7. DOI:10.6814/NCCU202100261.

(30) their performances depend on the training efficiency of deep architecture and the appropriate choice of deep learning parameters, which are difficult to achieve, especially because of the ill-conditioning problem of the deep neural network [36]. Most of the existing methods for facial age estimation so far rely on the biometric features extracted from the input facial image to estimate the person’s age. This indicates that only the input image is used to estimate the age of a person. However, telling someone’s precise age at a glance without any reference information is difficult even for humans. In practice, humans. 政 治 大. commonly infer the age of a person by learning to create links between a known age and the. 立. corresponding facial cues of a person. They then take the learnt information as a reference to. ‧ 國. 學. judge if an unseen face is younger or older than the reference. The accuracy of age estimation. ‧. of an unseen face increases whenever the number of available references increases. Thus,. sit. y. Nat. according to human cognitive processes [37], a more robust way of estimating facial age. io. er. is likely to be in a comparative manner, that is, learning from a number of comparative. al. iv n C h development there is an increasing demand for the e n g c h iof Uan automatic facial age estimation n. relationships (a given face is younger or older than another face of known age). As a result,. algorithm that simulates the process of age perception by humans and outperforms the human ability in age estimation. Therefore, motivated by the human perception of age estimation and the strength of CNNs in extracting effective and discriminative aging features from face images, this dissertation proposes an automatic facial age estimation system known as comparative region convolutional neural network (CRCNN) that can efficiently estimate the age of a person by using the input face image information in addition to reference information. 8. DOI:10.6814/NCCU202100261.

(31) obtained from some reference face images with known ages as well as considering the limitations of the abovementioned methods in age estimation.. 1.1.2. Social Interaction across Social Media and Popularity Prediction. Social interaction is steeply increasing through online social network websites. The rapid development of social networks has offered a variety of features to facilitate socialization on the internet. Users on these platforms can interact with each other by creating and sharing. 治 政 various forms of content, such as texts, photos, audios,大 and videos. This has contributed 立. to explosive growth in social media content and has intensified the online competition for. ‧ 國. 學. users’ attention because only a small amount of social content receives the most attention. ‧. and becomes popular, while others are completely unnoticed. The popularity of social. Nat. sit. y. media content reflects people’s interests and provides opportunities to comprehend how. n. al. er. io. users interact with online content and how information is disseminated through social media. i Un. v. websites, which have a profound impact on social economic and governmental activities.. Ch. engchi. Thus, modeling and predicting the popularity of social media content has become a significant research subject in social media analytics, and an essential task for supporting the design and assessment of a wide range of systems, from targeted advertising to effective search and recommendation services. Images are one of the main visual content posted by users on social networks and have become important media for communication among them. The explosive growth in the number of images posted on social media and interactive behaviors between web users results. 9. DOI:10.6814/NCCU202100261.

(32) in a variation in the number of views that these images receive or their popularity on social networks. Thus, this interesting phenomenon has attracted the research community to explore the factors that make certain images more popular than others, and how to automatically predict their popularity on social media. However, predicting the popularity of images on social media is a nontrivial and challenging task. The main difficulty lies in the fact that image popularity can be affected by different factors and features, such as visual content, text content, aesthetic quality, user, and time, which are intertwined during the cascade process.. 政 治 大. In addition, it is nontrivial to build a regression model that can integrate and process the. 立. various features contributing to image popularity and accurately predict it.. ‧ 國. 學. Recent studies have designed different types of features that evidently influence the. ‧. volume of image popularity [38–41]. However, most of these studies rely only on some. sit. y. Nat. useful features (e.g., visual content, social context, and post metadata) for image popularity. io. er. prediction, and ignore interactions between other valuable features (e.g., time and aesthetic. al. iv n C U posted at that time are more h e nmeans day, such as weekend leisure time, which g c hthati images n. quality). For example, users prefer to browse social media sites during a specific time of the. likely to receive a large number of views and become popular. Similarly, an image with a high aesthetic quality usually attracts the user’s attention and obtains a higher number of views. Thus, in addition to visual content, social context, and post metadata features, time and aesthetic features are also essential for accurately predicting image popularity. With regard to predictive models, existing works mostly use simple machine learning models to predict image popularity [38, 39, 42, 43]. Although these models have achieved satisfactory prediction accuracy, they are not sufficiently powerful to capture and extract high-level representations 10. DOI:10.6814/NCCU202100261.

(33) from the various types of raw features associated with image popularity, which consequently affects the predictive accuracy of these models. In addition, these models require both time and skill to fine-tune their hyperparameters. This implies that developing a predictive model that can effectively handle multimodal information contributing to image popularity and accurately predict it is highly desired. Therefore, this dissertation proposes a multimodal deep learning system, called visual-social convolutional neural network (VSCNN), to address image popularity prediction on social media in an efficient way, that is, the proposed VSCNN. 政 治 大. system learn effective and high-level representations from various visual and social features. 立. that significantly influence image popularity and precisely predict it.. ‧ 國. 學. Motivation. ‧. 1.2. y. Nat. er. io. sit. In the real world, age estimation is a skill that we use in everyday life, and it also has an important influence on our daily social interactions. Several automatic age estimation systems. al. n. iv n C are designed to estimate a person’s age his/her face image, as the estimation of age by h efrom ngchi U humans is not as easy as for determining other facial information (e.g., gender, identity, or expression). Although these systems have achieved promising results, the problem of age estimation is far from being solved. The major difficulty lies in how to design aging features that remain discriminative despite the significant variations in facial image appearance. This implies that addressing the automatic estimation of human age from face images is a wellestablished and challenging problem. In addition, the automatic human age estimation using. 11. DOI:10.6814/NCCU202100261.

(34) facial image analysis has numerous potential real-world applications. These applications include: (i) Security system access control: With an increasing number of crimes and terrorist threats, security control systems have become increasingly important in our daily lives. With the help of a monitoring camera, an automatic age estimation system can be used in the surveillance of bars as well as alcohol and cigarette vending machines to stop under-aged people from entering bars or wine stores and to prevent them from purchasing alcoholic. 政 治 大. drinks or cigarettes [44]. Age estimation can also be used to deny children access websites. 立. with unsuitable materials or restricted movies [45, 18]. In addition, age estimation can also. ‧ 國. 學. play an important role in controlling money transfer fraud from ATMs by monitoring a. ‧. specific age group that the police have found to be more prone to fraud [46].. sit. y. Nat. (ii) Age-specific human-computer interaction: Individuals belonging to different age. n. al. er. io. categories have various criteria and demands related to the way they interact with computers.. i Un. v. If an automated age estimator is used to determine the age of a computer user, both the. Ch. engchi. computing environment and the user interface could be adjusted automatically to meet the needs of his/her age group [47, 48]. For example, interfaces based on colorful icons with appropriate illustrations can be activated when dealing with young kids, while interfaces based on icons with titles written in large fonts can be activated for older users. (iii) Development of automatic age progression systems: Automatic age progression systems have the ability to simulate aging effects on new face images to predict how the person might look like in the future, or how he/she looked like in the past. Because automatic. 12. DOI:10.6814/NCCU202100261.

(35) facial age estimation systems depend on their ability to comprehend and categorize changes in facial appearance because of aging, the methodology needed for this task could form the basis for designing automatic age progression systems [29]. In addition, age progression algorithms often require information related to the current age of an individual, and this emphasizes the essential role of facial age estimation systems in the development of automatic age progression systems. (iv) Electronic customer relationship management (ECRM): ECRM uses modern internet-. 政 治 大. based technologies, such as chat rooms, blogs, emails, forums, and web sites, to efficiently. 立. manage the distinguished relationships with customers and communicate with them indi-. ‧ 國. 學. vidually [46, 49]. As customers come from different age groups, they may have varied. ‧. consumption patterns, preferences, and expectations for the products. Accordingly, automatic. sit. y. Nat. age estimation can be used by companies for monitoring market tendencies and customizing. io. er. their products and services to satisfy the needs and desires of clients in various age groups.. al. iv n C U However, a camera capturing h e their obtained and analyzed without infringing n gprivacy c h i rights. n. The issue here is how substantive personal information from all customers’ age groups can be. pictures of the clients can collect demographic data by snapping the face images of clients and automatically estimating their age groups using an automated age estimation system. All of these can be done without violating the privacy of anyone. (v) Biometrics: Age estimation is a kind of soft biometrics that provides additional information about users’ identity [50, 51]. It can be utilized to supplement the main biometric features, such as the face, fingerprints, iris, voice, and hand geometry, to enhance the performance and effectiveness of a hard (primary) biometrics system. For example, the 13. DOI:10.6814/NCCU202100261.

(36) system in real face recognition or identification applications often needs to recognize or identify faces after a gap that has lasted for many years (e.g., passport renewal or border security), that highlights the importance of age synthesis [52–54]. With the help of a dynamic aging model, the facial recognition system can dynamically fine-tune its parameters by taking into consideration the differences in face structure or skin texture during the aging process. As a result, the efficiency of the system in the time gap could be substantially improved [55]. All the aforementioned recent application areas of automated age estimation imply the. 政 治 大. need for developing more precise age estimation systems.. 立. In the online world, social media websites have been designed with the aim of facilitating. ‧ 國. 學. and increasing social interactions among people on the internet. We can simply say that social. ‧. media websites have altered the way we live and interact with. In addition, social media. sit. y. Nat. platforms have democratized the process of creating web content, allowing mere users to. n. al. er. io. become creators and distributors of content. However, this has also led to massive growth in. i Un. v. social content and has intensified the online competition for users’ attention. This is because. Ch. engchi. the interactive behavior of Web users often makes some of the content published on social media more popular than others. Therefore, there is a growing research interest in modeling and predicting the popularity of social media content [56, 57]. Predicting the popularity of social media content, especially visual content such as images, can help us understand public interest and attention behind user interactions. It can also facilitate several practical applications, such as online advertising [58, 59], online marketing, network dimensioning (e.g., caching and replication) [56], content retrieval [60], and politics. For example, in the case of online advertising, advertisers would like to be able to predict the number of 14. DOI:10.6814/NCCU202100261.

(37) views that a specific advertisement might produce on a particular website. Thus, if the popularity count is directly related to advertisement profits (such as with advertisements shown with YouTube videos), profits may be fairly precisely estimated ahead of time if all parties know how many views the video is expected to receive. In addition, in online marketing, the popularity prediction of a given product on a marketing company website provides a great opportunity for the company to make more strategic decisions, such as better managing their resources and more effectively targeting their ads. In general, both. 政 治 大. the customer benefits from a more pleasurable experience and the company benefits from. 立. a monetary saving or gain. However, popularity prediction is not an easy task because of. ‧ 國. 學. the difficulty in modeling and exploring the various factors that contribute to the popularity of social media content and, more specifically, image popularity. In addition, the popularity. ‧. of different social media content co-evolves over time, and this evolution may be described. y. Nat. er. io. sit. by complex online interactions and information cascades that are difficult to predict at the microscopic level [61–63]. This implies that developing a popularity prediction system that. al. n. iv n C can accurately predict the popularityhof social media content e n g c h i U is challenging and an active area of research.. 1.3. Contribution. In this dissertation, we explore and automatically estimate one of the factors that influence social interactions in the real world, which is the face age, as it considers one of the most important factors in interpersonal communication and interaction in our social life. Further-. 15. DOI:10.6814/NCCU202100261.

(38) more, it is known that human behavior, preferences, and interactions are different at different ages, which indicates vast potential applications of automatic age estimation. Simultaneously, we study the social interactions of users to online content by exploring the factors that can affect the social interactions of users on social media websites and, more specifically, on Flickr site and lead to content popularity. Then, we design an efficient prediction model that can accurately predict this popularity. Such predictions can help improve user experience and service effectiveness. The main contributions of this dissertation are as follows:. 政 治 大. • Motivated by the human cognitive process and the strength of CNNs in extracting. 立. effective and discriminative aging features from face images, we propose a novel. ‧ 國. 學. comparative deep learning framework for facial age estimation, called comparative. ‧. region convolutional neural network (CRCNN). In the proposed CRCNN framework,. sit. y. Nat. not only the input face image is used, but also several other reference face images of. n. al. er. io. known age are taken as baseline samples to compare with the input face image. The. i Un. v. advantage of this comparative approach is that, in addition to the input face image. Ch. engchi. information, some other side information obtained from the baseline samples can be exploited to boost the estimation task, leading to a more accurate estimation. In addition, instead of using classical deep learning models, the region-convolutional neural network (R-CNN) is exploited to account for the spatial context of facial regions. Moreover, the method of auxiliary coordinates (MAC) is incorporated in the training process of our framework to reduce the ill-conditioning problem of the deep network and provide efficient optimization. The experimental results on the FG-NET, MORPH, and IoG databases demonstrate that the proposed model achieves a significant 16. DOI:10.6814/NCCU202100261.

(39) outperformance compared to the state-of-the-art methods, with a relative improvement of 13.24% (on FG-NET), 23.20% (on MORPH) in terms of mean absolute error, and 4.74% (on IoG) in terms of age group classification accuracy. • Motivated by multimodal learning approaches, that uses information from various modalities, and the current success of convolutional neural networks (CNNs) in processing data from different modalities, we propose a multimodal deep learning framework for image popularity prediction on social media, called visual-social convolutional. 政 治 大. neural network (VSCNN). The proposed VSCNN framework uses dedicated CNNs for. 立. separately learning high-level representations from different types of features associ-. ‧ 國. 學. ated with image popularity, including multi-level visual, deep learning, social context,. ‧. and time features, and then fused them using a merged layer into a unified network. sit. y. Nat. for further processing and obtaining the final prediction. The fusion process in our. io. n. al. er. model becomes easy to execute and does not suffer from the data representation prob-. i Un. v. lem because the semantic vectors resulting from the dedicated CNN models usually. Ch. engchi. have the same form of data. Furthermore, the robust interpretation of incomplete and inconsistent multimodal input becomes more reliable at later stages because more semantic knowledge becomes available from various sources, boosting the prediction process. The simulation results demonstrate that the proposed VSCNN model significantly outperforms state-of-the-art models, with a relative improvement of more than 2.33%, 7.59%, and 14.16% in terms of the Spearman rank correlation coefficient, mean absolute error, and mean squared error, respectively.. 17. DOI:10.6814/NCCU202100261.

(40) 1.4. Dissertation Organization. Chapter 1 provides background information and a general introduction to social interaction in the real world and on social media websites. It also defines the research problem and sub-problems addressed in this dissertation. Chapter 2 provides an extensive literature review regarding facial age estimation systems. It also offers a comprehensive literature review related to image popularity prediction techniques. In Chapter 3, we propose and. 政 治 大 on human cognitive processes. In Chapter 4, we propose and design a multimodal deep 立. develop a comparative deep learning framework that accurately estimates the facial age based. ‧ 國. 學. learning framework that predicts the image popularity on social media in an efficient way by combining several multimodal features that significantly influence image popularity. Chapter. ‧. 5 summarizes this dissertation, highlights its research contribution, and provides an insight. n. al. er. io. sit. y. Nat. for future work.. Ch. engchi. i Un. v. 18. DOI:10.6814/NCCU202100261.

(41) Chapter 2 Literature Review 政 治 大 立 ‧ 國. 學. Introduction. ‧. 2.1. y. Nat. sit. First, in this chapter, we review related work on automatic facial age estimation. Specifically,. n. al. er. io. this chapter reviews the related literature on several descriptors used to extract and represent. Ch. i Un. v. aging-related features from facial images. It also reviews the state-of-the-art techniques. engchi. developed for age estimation from face images. Second, in this chapter, we review related work on image popularity prediction on social media. Specifically, this chapter reviews the related literature on many types of features that significantly affect image popularity. In addition, this chapter reviews the state-of-the-art prediction models designed to predict image popularity.. 19. DOI:10.6814/NCCU202100261.

(42) 2.2. Human Age Estimation from Face Images. The current age estimation systems utilizing face images usually comprise of two concatenated stages: aging-related facial feature extraction and age estimation techniques. Thus, we review the literature on age estimation according to these two stages.. 2.2.1. Aging-Related Facial Feature Extraction. 政 治 大. Most previous studies for facial age estimation focused on the extraction and fusion of. 立. different types of facial features. For example, Choi et al. [64] compared the performances. ‧ 國. 學. of various methods (e.g., sobel filter, difference image between original and smoothed image,. ‧. ideal high pass filter (IHPF), Gaussian high pass filter (GHPF), Haar and Daubechies discrete. sit. y. Nat. wavelet transform (DWT)) for extracting local features that can be used for detailed age. io. er. estimation. Both [65] and [66] combined global and local features (e.g., active appearance. al. models (AAM), Gabor filters, local binary patterns (LBP), Gabor wavelets (GW), and local. n. iv n C h e nfeatures phase quantization (LPQ)) to form hybrid g c h iin Uorder to have a better facial aging representation. Furthermore, Huerta et al. [67] used a fusion of textural and local appearancebased descriptors to achieve faster and more accurate results. Guo et al. [68] proposed the use of canonical correlation analysis (CCA) for jointly estimating the age with other facial information such as gender. Meanwhile, other studies concentrate on extracting new features specially designed to estimate age [69, 70]. For example, Guo et al. [69] proposed using the biologically inspired features (BIF) for estimating human age from faces, that are generated based on a pyramid 20. DOI:10.6814/NCCU202100261.

(43) of Gabor filters. Geng et al. [70] introduced an approach named as AGing pattErn Subspace (AGES) for age estimation. In this AGES approach, they model the aging process using an aging pattern which is defined as a sequence of face images of the same person at different ages and sorted in the time order. For encoding the face images, the AAM [71, 72] is used to extract the feature vectors represent the face images in an aging pattern, indicating that the extracted feature combines both the shape and the intensity of the face images. In [73], aging face was represented by integrating AAM, LBP, and Gabor features, that are extracted. 政 治 大. from the face image. Furthermore, Suo et al. [74] proposed to design four graphical facial. 立. features, that is, topology, geometry, photometry, and configuration, based on their recent. ‧ 國. 學. developed multi-resolution hierarchical face model [75]. Guided by this hierarchical model, instead of densely pursuing all filters over the image lattice, they applied particular filters to. ‧. various parameters at different levels to extract these four types of features for age estimation.. y. Nat. er. io. sit. The authors in [76] incorporated the features of LBP histogram with main components of BIF, shape and texture features of AAM, and the projection of the original image pixels. al. n. iv n C to principal component analysis (PCA) for representing the aging of the face h esubspace, ngchi U image. Both [77] and [67] used the histogram of oriented gradients (HOG) [78] to represent facial features. Recent studies also proposed to use high-level complex age-related visual features extracted using deep learning techniques such as CNN for automatic age estimation [32, 79, 80]. These studies demonstrated that high-level semantic features designed based on deep neural networks architectures usually perform better than hand-crafted features.. 21. DOI:10.6814/NCCU202100261.

(44) 2.2.2. Age Estimation Techniques. After extracting and representing aging features, the subsequent step is to estimate the age. Age estimation can be considered as a particular task of pattern recognition. It can be approached as a multi-class classification problem when each age label is viewed as a class [45, 81, 82]. On the other hand, age estimation can be considered as a regression problem when age labels are viewed as sequential chronological series [24, 83, 84]. Thus, age estimation techniques have been divided into the following two categories: a) classification-. 政 治 大. based methods; and b) regression-based methods.. 立. ‧ 國. 學. Regarding the classification-based methods, the existing studies have introduced several types of classifications models. For instance, Gao and Ai [48] used a fuzzy version of a. ‧. classifier known as linear discriminant analysis (LDA) [85] to classify the face image as. y. Nat. sit. one of the following four coarse categories: baby, child, adult, and old. Lanitis et al. [45]. n. al. er. io. presented a quantitative evaluation of the performance of different classifiers for automatic. Ch. i Un. v. age estimation, including the nearest neighbor classifier, the artificial neural networks (ANN),. engchi. and a quadratic function classifier. Ueki et al. [81] formulated the age estimation as an 11-class classification problem for the Waseda human-computer Interaction TechnologyDataBase (WIT-DB) which has 11-class age-groups registered. They first built eleven Gaussian models from each 11 age-group in a low-dimensional 2DLDA+LDA feature space using the expectation-maximization (EM) algorithm. The age-group classification is then determined by fitting the test image to each Gaussian model and comparing the likelihoods.. 22. DOI:10.6814/NCCU202100261.

(45) Han and Jain [86] used three different support vector machines (SVMs) to predict the age group (or exact age), gender, and race of a subject. By considering age estimation as a regression problem, Lanitis et al. [29] examined the following three formulations for the aging function: linear, quadratic, and cubic, using 50 raw model parameters. A genetic algorithm is employed to learn the optimal model parameters from training face images of various ages. Guo et al. [19, 18] applied support vector regression (SVR) technique on age manifold learned with the orthogonal locality. 政 治 大. preserving projections (OLPP) method for age estimation. To fit aging manifold learned. 立. with the conformal embedding analysis (CEA) method, Fu et al. [83, 24] used a multiple. ‧ 國. 學. linear regression function [87], which attains considerable improvements over some existing. ‧. methods. A semidefinite programming (SDP) formulation is used by Yan et al. [88] to. sit. y. Nat. solve the regression problem for age estimation, in which the regressor is learned from. io. er. uncertain nonnegative labels. The authors demonstrated that using SDP formulation for. al. iv n C multilayer perceptrons. However, thehSDP e nisgcomputationally c h i U very expensive particularly n. age regression provides much better results than the quadratic regression function and the. when the training set is large. Recently, several deep learning-based techniques have been used for facial age estimation. For example, Takimoto et al. [89] integrated a multilayered neural network with the adapted retinal sampling mechanism in order to estimate facial age. Geng et al. [90] proposed a constructive probabilistic neural network for facial age estimation based on learning from label distributions. The CNNs have been used in different recent studies on age estimation as well [91–94]. Niu et al. [33] used ordinal regression and multiple-output CNN 23. DOI:10.6814/NCCU202100261.

(46) for age estimation. A series of sub-problems were transformed into binary classification from the ordinal regression then solved by CNNs as each output layer matching one subproblem. Chen et al. [95] proposed a cascaded classification- regression framework for estimating the apparent age from unconstrained face images using deep convolutional neural networks (DCNNs). An error-correcting mechanism is also used to correct any erroneous age prediction. In summary, all the previous works followed the conventional paradigm for facial age. 政 治 大. estimation, i.e., learning direct mappings between the extracted facial features and the. 立. associated age labels. These observations motivated us to develop our comparative approach. ‧ 國. 學. using the deep learning method for estimating facial age.. ‧. Motivated by the human cognitive processes [37], it is arguable that a more robust. sit. y. Nat. approach to estimate a facial age is to be in a comparative manner, that is, learning from. n. al. er. io. a number of comparative relations (a given face is younger or older than another face of. i Un. v. known age). The development of our approach was also inspired by other ranking-based. Ch. engchi. methods, such as Ranking SVM [96], RankBoost [97], and RankNet [98]. Ranking SVM [96] formulates learning to rank as the problem of classifying instance pairs into two categories: correctly ranked and incorrectly ranked. Experimental results of this method demonstrated that the algorithm performs well in practice, successfully adapting the retrieval function of a meta-search engine to the preferences of a group of users. Nevertheless, the losses (penalties) of incorrect ranking between higher ranks and lower ranks, and incorrect ranking among lower ranks are specified the same. This remark will cause problems for facial age estimation, as the youngest and oldest persons have entirely different facial information. 24. DOI:10.6814/NCCU202100261.

(47) RankBoost [97] is another ranking algorithm that is trained on pairs; it is similar to our work because it attempts to directly solve the preference learning problem rather than solve an ordinal regression problem. The results are provided using decision stumps as weak learners. RankNet algorithm [98] is easy to train and performs well on a real-world ranking problem with large amounts of data. In addition, RankNet explores the use of a neural network formulation. A probabilistic cost for training systems is also proposed to learn ranking functions using pairs of training examples. In this study, a novel ranking approach is. 政 治 大. presented through the proposed comparative framework for facial age estimation. First, a. 立. set of selected references, i.e., baseline samples, is introduced into the framework to make. ‧ 國. 學. each rank more robust. Second, the proposed age estimation model is generated using the deep learning technique, providing effective features to rank each age based on the facial. ‧. information. Finally, the younger/older comparison will help provide robust ranking by. y. Nat. er. al. n. 2.3. io. structured.. sit. leaning similar facial information to estimate similar ranks; thus, the ranking will be better. Ch. engchi. i Un. v. Image Popularity Prediction on Social Media. In recent years, predicting the popularity of social media content has received substantial attention [99–102]. Regarding image popularity prediction, the related studies differ in terms of the definition of the popularity metric (e.g., view, reshare, and comment counts); however, they all share the same basic pipeline consisting of extracting and testing several types of features that influence popularity, followed by applying a classification or regression model. 25. DOI:10.6814/NCCU202100261.

(48) for prediction. Therefore, we review these studies by categorizing them according to the features and prediction models used.. 2.3.1. Features Influencing the Image Popularity. The existing studies have primarily focused on investigating the relative effectiveness of various feature types for predicting image popularity, including social context, visual content, aesthetic, and time. For instance, Khosla et al. [38] demonstrated that image content (e.g.,. 治 政 gist, color histogram, texture, color patches, gradient, and大 deep learning features) and social 立. cues (e.g., number of followers or number of posted images) have a significant effect on image. ‧ 國. 學. popularity. Gelli et al. [39] employed visual sentiment features along with context and user. ‧. features to predict a succinct popularity score of social media images. They demonstrated. Nat. sit. y. that sentiment features are correlated with popularity and have considerable predictive power. n. al. er. io. if they are used together with context features. Cappallo et al. [40] demonstrated that latent. i Un. v. image features can be used to predict image popularity. They explored the visual cues. Ch. engchi. that determine popularity by identifying themes from both popular and unpopular images. McParlane et al. [41] performed image classification using a combination of four broad feature types, that is, image content, image context, user context, and tags, to predict whether an image will obtain a high or low number of views and comments in the future. Compared to the aforementioned approaches, relatively few studies have been conducted to demonstrate the effect of time and aesthetic features on image popularity. For instance, Wu et al. [103] developed a new framework called multi-scale temporal decomposition to predict. 26. DOI:10.6814/NCCU202100261.

參考文獻

相關文件

The contents of this essay are to demonstrate that one can get the ultimate achievements by Separate-teaching also, to clarify the value of Separate-teaching and

Teachers may consider the school’s aims and conditions or even the language environment to select the most appropriate approach according to students’ need and ability; or develop

Text A.. The activities that follow on p. 14-18 are designed to demonstrate how teachers can use “scaffolding strategies” to support student learning when using print media

In view of the large quantity of information that can be obtained on the Internet and from the social media, while teachers need to develop skills in selecting suitable

• We have found a plausible model in AdS/CFT that captures some essential features of the quantum hall eect as well as a possible experimental prediction. • Even if this is the

In addition to examining the influence that the teachings of Zen had on Shi Tao’s art and theoretical system, this paper proposes further studies on Shi Tao’s interpretation on

• How social media shape our relationship to and understanding of breaking news events. – How do we know if information shared on social media

 Retrieval performance of different texture features according to the number of relevant images retrieved at various scopes using Corel Photo galleries. # of top