以分解機器為基礎之社群領袖偵測方法研究 - 政大學術集成

全文

(1)國立政治大學資訊科學系 Department of Computer Science National Chengchi University 碩士論文 Master’s Thesis. 立. 政治大. ‧ 國. 學 ‧. 以分解機器為基礎之社群領袖偵測方法研究 sit. y. Nat. Discovering Community Leaders from n. al. er. io. Coauthor Network via Factorization Machines Ch. engchi. i Un. v. 研究生：林哲立指導教授：蔡銘峰. 中華民國一百零五年七月 July 2016.

(2) 105. 碩士論文. 立. 政治大. ‧. ‧ 國. 學. 以分解機器為基礎之社群領袖偵測方法研究. n. er. io. sit. y. Nat. al. 政治大學資訊科學系. 林哲立. Ch. engchi. i Un. v.

(3) 以分解機器為基礎之社群領袖偵測方法研究 Discovering Community Leaders from Coauthor Network via Factorization Machines 研究生：林哲立指導教授：蔡銘峰. 立. Student：Zhe-Li Lin Advisor：Ming-Feng Tsai. 國立政治大學資訊科學系治政碩士論文. 大. ‧ 國. 學 ‧. A Thesis submitted to Department of Computer Science. Nat. n. er. io. sit. y. National Chengchi University in partial fulfillment of the Requirements a l for the degree of i v. n C h Master U engchi. in Computer Science. 中華民國一百零五年七月 July 2016.

(4) 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 2. i Un. v.

(5) 致謝. 我要感謝實驗室的所有成員以及幫助過我的所有人，在這幾年的研究所生涯裡和我一起學習，這是一段不可或缺的回憶，讓我在課業以及生活上都過得非常充實，謝謝。. 學. ‧ 國. 立. 政治大. 國立政治大學資訊科學系. ‧. io. sit. y. Nat. n. al. er. 2016. Ch. engchi. 3. i Un. v. 林哲立.

(6) 以分解機器為基礎之社群領袖偵測方法研究. 中文摘要本文提出了一種分析社群網路影響力於社群領袖偵測之方法。主要目的在於透過機器學習中的分解機器方法了解社群網路的結構，此方法進一步地了解社群網路之影響力分布，然後藉由此影響力的分析找尋社群中的影響力領袖。在過去的工作中，此類的社群網路分析研究的問題通常使用機率模型來處理。除此之外，某些相關的工作會使用基礎的圖論特徵像是圖中的節點或邊緣來幫助解決此類的問題。雖然過去的研究中已存在幾種方法來處理這類問題，但由於社群網路龐大. 政治大. 而且複雜，目前沒有精確且有效的機器學習方法能夠找出社群領袖。在此工作中我們採用過去研究中從未嘗試過的分解機器學習技術來分. 立. ‧ 國. 學. 析此類圖論問題，透過此機器學習技術來找出社群領袖。在提出的這套方法中，除了基本的網路結構外，社群網路中的人和其他物件的資訊也都能透過分解機器學習技術中特徵的方式加入至影響力分析模型中。此外，我們也提出了幾種不同的矩陣分解之隨機抽樣演算法來提. ‧. 升效能以及精確度。最後，我們透過由 DBLP 蒐集而來的資料來進行多項實驗，實驗結果顯示我們提出的方法即使在一個龐大且稀疏的社. Nat. n. al. er. io. sit. y. 群網路中仍還是可以有效地找出社群影響力領袖。. Ch. engchi. 4. i Un. v.

(7) Discovering Community Leaders from Coauthor Network via Factorization Machines. Abstract This work proposes a framework based on Factorization Machines to discover community leaders in a social network. The purpose of the proposed approach is to use Factorization Machines for analyzing the structure of a social network, and then we further use the structure to detect influential leaders in the social network. In the literature, there have been several studies using probabilistic models to deal with the leader discovery problem. In the meantime, there have also been some studies applying simple graphical models to tackle the problem. However, there are still no effective and efficient methods for the problem because of the enormous and complex social network. In this thesis, the traditional graph model is also applied to represent a social network, but we apply the techniques of Factorization Machines to analyze the network for discovering influential leaders. With the techniques of Factorization Machines, we can incorporate extra information about people and their related items into the analyzing process in a straightformward manner. Finally, we conduct the experiments on a dataset collected from DBLP, which is a computer science bibliography database. Our experimental results suggest that the proposed approach is effective and efficient at finding social leaders from a sparse social network.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 5. i Un. v.


(9) Contents 致謝. 3. 中文摘要. 4. Abstract. 5. 政治大. 1. Introduction. 2. Related Work 2.1 Social Network Analysis . . . . 2.2 Recommender Algorithm . . . . 2.2.1 Collabrative Filtering . . 2.2.2 Content-based Filtering . 2.2.3 Hybrid Algorithm . . .. 立. 1. 3. Methodology 3.1 Collaborative Latent Social Influence . . . . . . . . . . . . . . . . . . . . 3.2 Modeling Social Influence with FM . . . . . . . . . . . . . . . . . . . .. 7 7 8. 4. Experimental Results 4.1 Experiments . . . . . . . 4.1.1 Dataset . . . . . 4.1.2 Experiment Setup 4.1.3 Evaluation . . . 4.2 Experimental Results . . 4.3 Discussion . . . . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. n. al. er. io. sit. y. ‧. . . . . .. Nat. 5. . . . . .. 學. ‧ 國. . . . . .. 3 3 4 4 4 5. iv .U . n . . . .. . . . . . .. 11 11 11 12 13 13 14. Conclusions 5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 17 17. C. h. . . . . . . . . . .e . .n. g . .c. h. i. . . . .. . . . .. . . . .. . . . .. . . . .. Bibliography. . . . .. . . . .. . . . .. . . . .. . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. 19. 7.


(11) List of Figures 3.1. The Proposed Framework for Modeling Latent Social Influence. . . . . .. 8. 3.2. An Example Input for FM. . . . . . . . . . . . . . . . . . . . . . . . . .. 8. 4.1. The Word Cloud of the Top 20 Words from Dataset I. . . . . . . . . . . .. 16. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 9. i Un. v.


(13) List of Tables 4.1. Top 10 Authors in the Two Gold Standards Ranking Lists and the List Obtained from Our Best Model. . . . . . . . . . . . . . . . . . . . . . .. 4.2. 12. The Experimental Results. The notations ∗, †, ‡, and § denote the result is. 政治大. significant better than the four corresponding baselines #coauthor, #paper, #citation, and PageRank with p < 0.05. MAS uses the ranking list pro-. 立. vided by Microsoft Academic Search as the gold standard; for h-index,. ‧ 國. 學. we treat the list ranked by the authors’ h-indices as the gold standard. The “−” symbol denotes the experiments still in progress, which will be provided in the final version. . . . . . . . . . . . . . . . . . . . . . . . . . .. ‧. Top 20 Words Learned from the FM Model with Textual Information from. io. sit. y. Nat. Dataset I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. n. al. er. 4.3. 13. Ch. engchi. 11. i Un. v. 14.


(15) Chapter 1 Introduction 政治大. In a social network, such as Facebook and Twitter, people always get connected with others, including friends, colleagues and family members. Due to the flourishing of social. 立. websites, the data of user interactions from social network becomes large and complex.. ‧ 國. 學. In a social network, users always interact with others by posting pieces of information like articles and photos. The influence of postings will propagate through the connections between users; then, there is usually a leader or a group of leaders with the most influence. ‧. in the social network, which are called as community leaders. The community leaders. y. Nat. have powerful influences, and they usually affect other people validly and rapidly.. sit. In the literature, there have been some methods proposed to solve the leader discovery. al. er. io. problem. Some methods use the idea of influence chain to identify the top influence. v. n. leaders. The idea of influence chain is based on the idea that an influencer usually may. Ch. i Un. have more influential postings than other do. So, it regards an individal as an influncer. engchi. in terms of the number of his/her chains. In [9], the 5-chain influence path was used to identify influntial leaders in a small-scale network and reached a promising performance. However, the influence-chain approach is difficult to be applied to a large-scale social network and is limited to incorporate other information into the model. In order to overcome the limitations in previous work, we propose a framework based on Factorization Machines (FM) to automatically discover community leaders in a social network. To detect the leaders, we adopt the concept of collabrative filtering (CF), which is commonly used in the field of recommendation. The main concept of CF is to predict the interests of an user by collecting his/her preferences or taste information from many users. Based on the CF technique, it associates users by the same or similar interests or features and then computes the value representing the level of each person’s interests. On the basis of the CF idea, we attempt to apply the idea to identify influential leaders from a social network. In addition, via taking the advantages of FM, we can easily add the meta informaion of both users and items into the FM to improve the performance. In specific, 1.

(16) we first obtain the latent social influence of a user on an item by FM. Then, we sum up these latent social influences of an user on all items and consider the resulting value as the latent influence of this user. At last, we rank these values for all users and thus have the ability to discover the community leader in the social network. In the experiments, we use the data collected from DBLP1 to construct a social network. The collected data includes the coauthor network between authors, author profiles, and some paper meta-information. We collect two datasets, one consists 3662 authors and 5122 papers and the other consists 152010 papers and the number of the authors remains the same. On the basis of the coauthor relationship within the data, we first build an author-paper matrix. Moreover, we additionally extract the meta information of papers and authors, including author profiles and textual information of papers. Then, we use FM to model the problem by considering the author network, the author profiles, and other. 政治大 This paper proposes a FM framework to model the social influence among individuals 立 based on their patterns of collaborations; meanwhile, due to the essence of FM, the auxadditional textual information.. ‧ 國. 學. iliary information (such as time and textual information) can be intuitively integrated into the modeling process. Specifically, we first present an influence transformation function. ‧. to build up the influence matrix of individuals based on their patterns of collaborations. This influence matrix seperates the relation between user and item to three types. One is. Nat. sit. y. related relation, another is unrelated relation and the other is predicting relation. Then, the influence matrix is fed into FM to obtain the social influence of each individual; fur-. io. 1. al. n. influence.. er. thermore, in the process, some additional information can be exploited to model the social. Ch. engchi. http://dblp.uni-trier.de/. 2. i Un. v.

(17) Chapter 2 Related Work 2.1. Social Network Analysis 政. 治. 大立 With the rapid growth of social applications, such as social networks (e.g., Twitter, Face-. ‧ 國. 學. book) and collaboration networks (e.g., DBLP and GitHub), how to analyze and quantify social influences becomes a crucial task for studying social networks and draws much attention due to many important applications. Therefore, much research has been conducted. ‧. in this field, such as topic influence [2], external influence [4], and indirect influence [7].. y. Nat. These studies analyze how people influence each other and how the influence spreads a. sit. social network. With the powerful influence, a company can market a new product by first. al. er. io. convincing a small number of users to adopt the product and then extend the influence. In. v. n. other words, research collaborators can regared as they affect each others in an academic. Ch. i Un. networks. For social websites, e.g., Facebook and Twitter, users are very likely to follow. engchi. influential friends in their social circle or to ’like’ a social object like an article or a picture. We regard this as a topic-level influence which means he/she will like the topic or not. Lu Liu et al. [2] propsed a probalilistic generative model to mining topic-level influence. The study is aimed at predicting whether a user will like a topic or not. It determines the strength of the influence of a topic. Also, the study [4] discusses how the external information spreads in a social network. This work develops a information diffusion model to observe the influence of internal information and external information. It found that only about 71% of the information can be attributed to network diffusion and other 29% is from the external events or factors outside the netwrok. Most of the studies discuss the direct influence which means two people are connected in a network. The research [7] used a probability model to explain the indirect influence that is the influence of two people are not connected directly. Furthermore, there’s a similar work for analysis social influence through Factorization Machines (FM) [10] based on their collaborations. This work proposed a influence matrix to represent the influence between users and items. The 3.

(18) influence spreads through texts or media and the hetergeneous network can be constructed based on the relation. Based on the hetergeneous network, an influence matrix is proposed to predict the indirect relation. However, there is still lack of a unified framework for incorporating supplementary information, such as time and textual information, into the social influence analysis in a straightforward way.. 2.2. Recommender Algorithm. The main purpose of our task is to model the social influence and utilize it to achieve the social leaders prediction. We consider that the core of recommendation system is quite similar to our prediction task. Specifically, a recommendation algorithm is an attempt to recommend new items to an user based on the items which this user liked in before. In. 政治大 Therefore, the recommendation-based 立 algorithms are also feasible to finding the social other words, the items are treated as the important clues for realizing an user’s behavior. leaders as long as we make use of the coauthor relations.. ‧ 國. 學. Recommender systems have become immensely prevalent in recent years, and are applied in a variety of social applications. Recommender system produces a list of rec-. ‧. ommendation in one of several ways, through collabrative or content-based filtering. We consider that there are several similar quality between recommender systems and influ-. y. Nat. sit. encer detecting. For example, we recommend items through the user contents and item. er. io. contents also inflencer detecting does.. a. n. 2.2.1. l C Collabrative Filtering. hengchi. i Un. v. CF methods are based on collecting and analyzing a large amount of information on user’s behavior, preference and predicting what the user will like based on the similar users. This method is based on the assumption that people’s behavior or preference in the past is the same as the behavior or preference in the future. Many algorithms are used in measuring the similarity between users or items in recommender system. For instance, k-neareset neighor. The advantage of CF methods is we do not need to know the content of users or items. For example, if we recommend a music to an user, even though we do not understand the metedata of the music we can give a recommended list to the user. Although, this method can not work correctly if there comes a new user.. 2.2.2. Content-based Filtering. Content-based filtering is based on the description of items and the profiles of users. The main idea of content-based filtering is try to recommend items which are similar to items 4.

(19) the user liked in the past. In this method, we first fetch features to represent an item such as vector space. Then, we create a user profile from the rating items. Through these information, system can recommend the most similar items. Because this method is based on the profiles, it lacks the consideration of other people’s experience, the system can not make any decision of a quality, style or viewpoint for the item. Also, it is missing any personality assessment. Because of these disadvantages, there comes the hybrid algorithm to combine the problems of CF and content-based filtering algorithm.. 2.2.3. Hybrid Algorithm. Most of recent studies are hybrid approach which is combining CF and content-based. 政治大. filtering. The purpose of CF is to filter information or patterns involving collaboration among multiple people and data sources [8]. Factorization Machines (FM) is one of the. 立. state-of-the-art recommendation algorithms [5]. The algorithm has emerged as a popular. ‧ 國. 學. technique in recommender systems because of its ability of not only simulating CF but simultaneously incorporating with auxiliary information into the models. In FM, there’s an user-item matrix to record the ratings between user to item. Because. ‧. of the sparsity of data, there are most losing ratings. To overcome this problem, FM. y. Nat. generates an interaction matrix to represent the unknown ratings. For example, the user. sit. A listened a music X but didn’t listen music Y. The interaction matrix can simulate the. al. er. io. ratings between all users and all items. Then, FM can remember user A is interested in. v. n. music Y or not. Depend on this framework, we can use an user-item matrix to represent. Ch. i Un. a social network. We regared the ratings as the relations between users and items. For. engchi. example, if an user A likes or shares a post from user B, we record this relation in the useritem matrix. In other words, it means user B affects user A through the post. We consider the values of interaction matrix simulates is the latent influence. We will describe the details in Chapter 3.. 5.


(21) Chapter 3 Methodology Section 3.1 describes how to utilize the technique of CF to calculate latent social influ-. 政治大. ence. In Section 3.2, we show the formulation of the social influence calculation with. 立. FM.. ‧ 國. 學. 3.1. Collaborative Latent Social Influence. ‧. CF is a common technique adopted by recommendation systems. In this work, we attempt to model the latent social influence of people in a certain research community with this. y. Nat. sit. technique, which filters information or patterns involving collaboration among people.. al. er. io. Figure 3.1 gives an illustrative example to introduce the core idea of the proposed. n. framework for modeling the latent social influence. Figure 3.1(a) depicts the relation-. Ch. i Un. v. ships between the authors and their papers. These relationships can be transformed to. engchi. the coauthor matrix in Figure 3.1(b), in which each element xai ,pj equals to 1 if ai is the author of paper pj , and otherwise the element is 0. We then define an influence transformation function F (·) to build up the influence matrix, as shown in Figure 3.1(c); this is the key step to transform the relationships in Figure 3.1(a) to the input of a standard CF algorithm. The transformation function F (·) can be designed variously; in this paper, F (·) is defined as. F (xai ,pj ) =.     1, if ai is the author of pj ,. ? if ∃ ak ∈ Cai and ak is the author of pj ,    0, otherwise,. (3.1). where Cai is the set of the authors who have coauthored with author ai . After the transformation, we can obtain the resulting latent influence matrix in Figure 3.1(d) via any CF algorithms.1 1. Here the standard matrix factorization is adopted to calculate the resulting matrix.. 7.

(22) paper #1 1. Influence Transformation Function F( · ). paper #2. 1 ? 0 0. Author 1 Author 2 Author 3 Author 4. Paper 5. 0 ? 1 1. Paper 4. 1 ? 1 1 1 ? 0 ? 0 ? 1 0. Paper 3. Author 1 Author 2 Author 3 Author 4. Paper 2. 1 0 0 0. Paper 1. 0 0 1 1. Paper 5 Paper 4. 1 0 0 0. Paper 3. 0 1 0 1. Paper 2. 1 1 0 0. Paper 1. paper #4. Paper 5 Paper 4. 3. Paper 3. Author 1 Author 2 Author 3 Author 4. Paper 2. paper #3. Paper 1. 2. Matrix Factorization for Collaborative Filtering. Sum. 1.0 1.0 0.0 0.2. 0.4 1.0 0.8 1.0. 1.0 0.7 0.0 0.0. 0.0 0.6 1.0 1.0. 1.0 0.7 0.0 0.0. 3.4 4.0 1.8 2.2. 4 paper #5. Coauthor Matrix. Influence Matrix. Latent Influence Matrix. (b). (c). (d). (a). Figure 3.1: The Proposed Framework for Modeling Latent Social Influence. Target score. Author. Text information associated with author. Paper. 政治大. Text information associated with paper. x1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0.3 0.6 0.2 0.1 0.3 0.8 0.2 0.3 0.4 0.5. y2. 1. x2. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0.3 0.6 0.2 0.1 0.3 0.3 0.1 0.3 0.8 0.1. y3. 0. x3. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0.3 0.6 0.2 0.1 0.3 0.1 0.1 0.2 0.6 0.3. y4. 1. x4. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0.3 0.6 0.2 0.1 0.3 0.2 0.3 0.4 0.4 0.3. y5. 1. x5. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0.2 0.1 0.7 0.3 0.5 0.8 0.2 0.3 0.4 0.5. y6. 1. x6. 0. 1. 0. 0. 0.2 0.1 0.7 0.3 0.5 0.5 0.5 0.4 0.2 0.1. a3 a4. 0. 1. 0. 0. 0. p1. p2. p3 p4. p5. tai. Nat. al. er. io. sit. Figure 3.2: An Example Input for FM.. tpj. y. a1 a2. 立. ‧. ‧ 國. 1. 學. y1. n. In Figure 3.1(d), each number in blue color can be explained as the estimated latent. Ch. i Un. v. social influence; the numbers in the green box are the sum of the influence scores of each. engchi. author on all papers. As shown in the figure, we can observe that although author 2 has only written 2 papers, his/her social influence score (i.e., 4) is larger than that of author 1 (i.e., 3.4), who has written the most papers among the 4 authors. Even though author 2 is not the author of papers 3, 4, and 5, we consider that author 2 should still have (latent) influence on these three papers and the influence can be modeled with the patterns of collaborations among the authors.. 3.2. Modeling Social Influence with FM. FM provides an advantage over other existing CF approaches, which makes it possible to incorporate with any auxiliary information that can be encoded as a real-valued feature vector. Thus, via using FM, this paper integrates with other supplementary information to model latent social influence, and we use textual information as the supplementary information in our experiments. 8.

(23) Given a publication dataset, we first describe how to transform its information to the feature vectors of FM. The four notations are defined as follows: • A: the set of authors, ai ∈ A, where i = 1, 2, 3, · · · , `. • P : the set of papers, pj ∈ P , where j = 1, 2, 3, · · · , m. • tai : the text information associated with author ai . • tpj : the text information associated with paper pj . An observed instance can be defined as (ai , pj , tai , tpj , y), where y denotes the observed social influence of author ai on paper pj , which is corresponding to Equation (3.1). The feature vectors can be created from all of the observed instances as follows. First, there are |A| = ` binary indicator variables that represent the active author of an. 政治大. instance — there is always exactly one active author for each observed instance. The next |P | = m binary indicator variables hold the active paper and again there is always. 立. exactly one active paper. The rest features for tai and tpj represent the text information. ‧ 國. 學. associated with author ai and paper pj , respectively. In general, they can be described by the bag-of-words model and will be a vector with size |V |, where V is the set of all unique vocabularies. We then define a transformation function f for the above feature creation. ‧. x(ai ,pj ) = f (ai , pj , tai , tpj ) ∈ Ru .. y. Nat. Once we obtain the feature vectors created from the observed instances, we can model. sit. the latent social influence with FM. Assume that the data of a prediction problem is de-. al. er. io. scribed by a matrix X ∈ Rn×u , where the ith row xi ∈ Ru of X describes one case with. n. u real-valued variables and where yi is the prediction target of the ith case (see Figure 3.2. Ch. i Un. v. for an example). The model equation for a factorization machine is then defined as [6]: yˆ(x) := w0 +. u X. u h ui e n gX cX. w i xi +. i=1. hvi , vj i · xi xj ,. (3.2). i=1 j=i+1. where the model parameters that have to be estimated are w0 ∈ R, w ∈ Ru , and V ∈ Ru×k . Above, h· , ·i is the dot product of two vectors with size k: hvi , vj i :=. k X. vi,f · vj,f ,. (3.3). f =1. where k ∈. N+ 0. is a hyperparameter that defines the dimensionality of the factorization.. With the estimation function yˆ(x) in Equation (3.2), we can define the social influence score of author ai as follows: s(ai ) =. m X. yˆ x(ai ,pj ) .. (3.4). j=1. From Equation (3.4), the social influence score of each author ai is assumed to be the sum of the estimated latent social influence of author ai on each paper pj . 9.


(25) Chapter 4 Experimental Results 4.1. Experiments. 立. ‧. ‧ 國. Dataset. 學. 4.1.1. 政治大. The dataset we used is collected from DBLP and the there are two gold standards assembled from Microsoft Academic Search and Google Scholar.The Microsoft Academic. y. Nat. sit. Search is an academic search engine for scientific papers and literatures. Since the ser-. er. io. vice of Microsoft Academic Search stopped in 2013, we also use that of Google Scholar as our another gold standard for evaluation. This service provides the ranking list in dif-. n. al. Ch. i Un. v. ferent fields such as, machine learning, network. Google Scholar contains the newest. engchi. paper and literatures also it provides the profiles of each author. The profiles of author contains the numbers of papers that are published by the author and the h-index factor that represented the importance of this author. The experimental dataset is collected from DBLP,1 which contains the information of papers and the coauthors of each paper. We first collect the top 20 authors in the fields of data mining from Microsoft Academic Search. The purpose of our task is to predict the top 10 authors; for the data collection, we use the top 11 to 20 authors as the seeds to build the dataset. Then, these 10 authors, their coauthors, and all of the papers of the 10 authors, are used to construct as our first experimental dataset (denoted as Dataset I), which consists of 3,662 authors and 5,122 papers. In addition to the papers in Dataset I, the papers of all of the coauthors are also included to build the second dataset (denoted as Dataset II); there are 152,010 papers in total and the number of authors remains the same. In the experiments, two gold standards (see the top 10 authors in Table 4.1) are adopted 1. http://dblp.uni-trier.de/xml/. 11.

(26) MAS. h-index. Ours. Jiawei Han. Jiawei Han. Jiawei Han. Philip S. Yu. Hector Garcia-Molina Philip S. Yu. Rakesh Agrawal. Philip S. Yu. Jian Pei. Christos Faloutsos Christos Faloutsos. Christos Faloutsos. Hans-Peter Kriegel Rakesh Agrawal. Eamonn J. Keogh. Eamonn J. Keogh Andrew McCallum. Heikki Mannila. George Karypis. Hans-Peter Kriegel. Padhraic Smyth. Heikki Mannila. George Karypis. Rakesh Agrawal. Andrew McCallum Charu C. Aggarwal. Hans-Peter Kriegel. Jian Pei. Andrew McCallum. Jian Pei. 政治大. Table 4.1: Top 10 Authors in the Two Gold Standards Ranking Lists and the List Obtained. 立. from Our Best Model.. ‧ 國. 學. to evaluate the performance:. 1. The ranking list provided by Microsoft Academic Search (denoted as MAS);2. ‧. 2. The ranking list ranked by the 20 authors’ h-indices from Google Scholar (denoted. y. sit. n. al. er. Experiment Setup. io. 4.1.2. Nat. as h-index).3. i Un. v. We compare the results of four baselines and those of the proposed FM framework. The. Ch. engchi. first three baselines are the ranking via the numbers of coauthors, papers, and citations per author from Microsoft Academic Search. The fourth baseline are the ranking list based on the scores generated by PageRank. The textual information in the FM models is described by a bag-of-words model with term frequency. Note that we use title words of the papers only as the textual information; the resulting vocabulary sizes of Dataset I and Dataset II are 4,057 and 50,059, respectively. The result values of the FM models are the average scores over 20 experiments for Dataset I; for Data II, only 5 experiments are conducted because of the computational cost. For the text information, we utilize the title of each paper as the auxiliary information and the bag-of-word model with term frequency as the feature; note that the words are in their original forms and not stemmed. For author’s information, we sum up the sentence vectors of the papers which are written by an author. 2. http://academic.research.microsoft.com/?SearchDomain=2&SubDomain=7&. entitytype=2 3 https://scholar.google.com.tw/citations?view_op=search_authors& mauthors=label:data_mining. 12.

(27) Dataset I Gold. Evaluation. Standards. Metrics. MAS. h-index. (∗). (†). (‡). (§). FM. Dataset II FM. (§). FM. FM. #Coauthors #Papers #Citations PageRank [w/o texts] [w/ texts] PageRank [w/o texts] [w/ texts]. ρ. 0.233. 0.388. 0.460. 0.364. 0.478∗†‡§ 0.556∗†‡§ 0.364. 0.417∗†§ 0.601∗†‡§. τ. 0.179. 0.284. 0.347. 0.274. 0.349∗†§. 0.409∗†‡§ 0.274. 0.309∗†§ 0.463∗†‡§. ρ. 0.222. 0.510. 0.457. 0.558. 0.325∗. 0.625∗†‡§ 0.550. 0.540∗‡. 0.623∗†‡. 0.368. 0.221∗. 0.447∗†‡§. 0.385∗‡. 0.455∗†‡. τ. 0.179. 0.368. 0.326. 0.418. Table 4.2: The Experimental Results. The notations ∗, †, ‡, and § denote the result is significant better than the four corresponding baselines #coauthor, #paper, #citation, and PageRank with p < 0.05. MAS uses the ranking list provided by Microsoft Academic Search as the gold standard; for h-index, we treat the list ranked by the authors’ hindices as the gold standard. The “−” symbol denotes the experiments still in progress, which will be provided in the final version.. 4.1.3. 政治大. 立. Evaluation. ‧ 國. 學. To evaluate the performance in our experiments, two rank correlation metrics are used: Spearman’s Rho (ρ) [3] and Kendall’s Tau (τ ) [1, 6]. Given two ranked lists X =. ‧. {x1 , x2 , . . . , xn } and Y = {y1 , y2 , . . . , yn },. (xi − yi )2 , n(n2 − 1) #concordant pairs − #discordant pairs τ = . 0.5 · n · (n − 1) P. n. al. er. io. sit. y. Nat. ρ = 1−. 6. Ch. i Un. v. For the measure of Kendall’s Tau, any pair of observations (xi , yi ) and (xj , yj ) is concor-. engchi. dant if the ranks for both elements agree; that is, if both xi xj and yi yj or if both xj xi and yj yi . In contrast, it is discordant if xi xj and yj yi or if xj xi and yi yj . If xi = xj or yi = yj , the pair is neither concordant nor discordant. The FM library, libFM [6], is adopted to conduct the experiments. For the smaller Dataset I, the number of iterations is set to 500, and for the Dataset II, that is set to 200; all the other parameters are set as the default values of libFM.. 4.2. Experimental Results. Table 4.2 tabulates the experimental results, in which we compare the results of four baselines and those of the proposed FM framework. The first three baselines are the ranking via the numbers of coauthors, papers, and citations per author from Microsoft Academic Search. The fourth baseline are the ranking list based on the scores generated by PageRank. The textual information in the FM models is described by a bag-of-words model with 13.

(28) 1. availability. 2. www. 5. re-examination 6. lack. 3. character. 4. technique. 7. amethyst. 8. priority. 9. book. 10. function 11. permutation-based 12. shadow. 13. wikipedia. 14. topicnets 15. simplex. 16. model. 17. gaussian. 18. system. 20. system-throttling. 19. k-way. Table 4.3: Top 20 Words Learned from the FM Model with Textual Information from Dataset I. term frequency. Note that we use title words of the papers only as the textual information; the resulting vocabulary sizes of Dataset I and Dataset II are 4,057 and 50,059, respectively. Due to the randomization of the algorithms implemented in libFM, the values of. 政治大. the FM models are the average scores over 20 experiments for Dataset I; for Data II, only 5 experiments are conducted because of the computational cost.. 立. As shown in Table 4.2, among the four baseline methods, the baseline of #Citations. ‧ 國. 學. realizes the highest values of the two evaluation metrics in terms of the MAS gold standard, whereas PageRank has the best performance regarding to h-index. Observe that the performance of both FMs without and with texts in terms of MAS reach better results. ‧. than all of the baselines; in addition, almost all results significantly outperform the four. y. Nat. baselines with a p-value of 0.05. With the inclusion of the coauthor papers (Dataset II),. sit. the FM model with textual information achieves the best performance, i.e., ρ = 0.601. al. er. io. and τ = 0.463. On the other hand, regarding to h-index, the performance gains of the. v. n. FM models with textual information are statistically significant compared against all of. Ch. i Un. the four baselines. From the experimental results, we can observe that incorporating the. engchi. supplementary textual information did greatly improve the performance, which confirms that the textual information is beneficial to model the social influence. The top 10 authors ranked by our best model (FM with textual information of Dataset II) is listed in Table 4.1; the top 20 words is listed in Table 4.3.. 4.3. Discussion. Our results show that the FM-based framework with supplementary information takes an effecient way to predict a leader in a social network. First, we propose an influence matrix to represent the latent influence. To reduce the complexity of our framework, we only consider the relation of an author and the author’s coauthor. We did not discuss the influence of an author of the estrangement one. In fact, a remote author still wield some influence. We ignore these influences in our hypothesis. According to the dataset is easily to represnt as an user-item matrix also there are 14.

(29) several similar quality between recommender system and finding a leader. The purpose of recommender system is to recommend the most appropriate item to an user. We compute the probability of an item which an user will like it or not during the items of the user ever liked or used. In social network, different users may active in the same item. For example, there are two users share the same article or they post a post together. We also can compute the probabilty of an post which an user will like or not. We assume that if an user likes a post which means the post affects the user. So, we compute the probability between each items and users and regarded this value as the influence. In summary, FM is the most appropriate algorithm to solve the problem. Our results indicate a little different appearance in the two gold standards. First, the performance of page rank under h-index is better than the performance under Microsoft. Also, there are the same problem in our framework. As we known, the dataset is from dblp. 政治大 from Microsoft is outdated. We consider this situation made the results under the standard 立 of Microsoft worst which means the dataset is more fit the standard. which is updated also the standard of h-index from google is updated, but the standard. ‧ 國. 學. In Dataset I, we observe that FM without texts still has poor performance even less than the paper counts and citation counts under the standard of h-index. But the perfor-. ‧. mance is much better in Dataset II. The explain of this result is the FM is not powerful to analysis a network if the dataset is not enough. In other words, the Dataset I does not. Nat. sit. y. contain enough information. After we used the extensive information, the performance. io. dataset.. er. grows up dramaticly. We can say the additional information can effective remedy a poor. al. n. iv n C dant on the simple features like paperhcounts, i U The results of our framework e n gcited c hcounts. In other words, the ranking list of Microsoft academic and Google research are depen-. capture the latent infuence in the network, we consider the result can show the real situation of the network. We also list the top 20 words learned from the FM model with textual information from DatasetI as shown in Table 4.3 and the word cloud in Figure 4.1. The word in cloud is shown as the stemming type. The size of word means the weight of the word. We can observe the top words are ’availabilty’, ’re-examination’, ’book’, ’wikipedia’ and. ’guassian’. In this word list, we consider it exists some noises but there are still fetch some useful texts such as, ’gaussian’, ’function’, ’k-way’ and ’permutation-based’. These words appear in the articles of data mining.. 15.

(30) et icn top. 政治大. n. k. thr ott l. y. sys tem. sit. lack. Ch. nc sia tio n n. xam x in. io. wik. br o mu o. er act. Nat. al. pe. tat ion ba v s i. tem sys. ipe ‧ ree simdpilea kwai. model. er. 國. ys h t e. prioriti. ar ‧ ch. am. www t. 學. sha do w. iqu hn c te. ava ga il fu u s. 立. n U engchi. Figure 4.1: The Word Cloud of the Top 20 Words from Dataset I.. 16.

(31) Chapter 5 Conclusions 政治大 In this paper, we have represented 立 a FM framework to model the social influence among 5.1. Conclusions. individuals based on their patterns of collaborations in a social network and supplemen-. ‧ 國. 學. tary textual information. The proposed influence matrix captures the influence between authors during papers. Through the computation of FM, we get the latent inluence matrix. ‧. furthermore. The FM is applicable to many kinds of features, which means we can obtain the higher level information from multiple asepects. In our task, we enforce the perfor-. y. Nat. sit. mance by obtaining the textual information. Our experimental results on the two datasets. al. er. io. for the data mining community show that the proposed approach provides a better predic-. n. tive model than the four baselines. Otherwise, obtaining the textual information indeed. Ch. i Un. v. benefits the predictive performance. Then, the predictive model shows how does the tex-. engchi. tual information works that is we can remember which texts with powerful influence and which texts with lower influence. The influence spreading is an important issuse in social network. Although our proposed latent influence matrix can capture the latent influence of an author of the cognate one, we still cannot capture the latent influece from an estrangement authors. Further, we seek to improve the latent influence matrix. That is, we expect to develop a model to capture the deeper latent influence. On the other hand, we will conduct experiments on larger data sets with various fields of research communities or reality social network. In addition, other auxiliary information, such as the temporal information of the publications [11], will be included and analyzed in our further experiments. For text information, we will try to use word-emedding techniques such as word2vector to represent the auxiliary information and thus improve the performance. Finally, we attempt to use this framework in different social networks such as facebook, twitter and github.. 17.


(33) Bibliography [1] M. G. Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938. [2] L. Liu, J. Tang, J. Han, and S. Yang. Learning influence from heterogeneous social. 政治大 [3] J. L. Myers, A. Well, and R. F. Lorch. Research design and statistical analysis. 立 Routledge, 2010. networks. Data Mining and Knowledge Discovery, 25(3):511–544, 2012.. ‧ 國. 學. [4] S. A. Myers, C. Zhu, and J. Leskovec. Information diffusion and external influence in networks. In Proceedings of the 18th ACM SIGKDD International Conference. ‧. on Knowledge Discovery and Data Mining, KDD ’12, pages 33–41, New York, NY,. sit. y. Nat. USA, 2012. ACM.. [5] S. Rendle. Factorization machines. In Proceedings of the 2010 IEEE International. io. n. al. er. Conference on Data Mining, ICDM ’10, pages 995–1000, Washington, DC, USA, 2010. IEEE Computer Society.. Ch. engchi. i Un. v. [6] S. Rendle. Factorization machines with libfm. ACM Trans. Intell. Syst. Technol., 3(3):57:1–57:22, May 2012. [7] X. Shuai, Y. Ding, J. Busemeyer, S. Chen, Y. Sun, and J. Tang. Modeling indirect influence on twitter. Int. J. Semant. Web Inf. Syst., 8(4):20–36, Oct. 2012. [8] L. Terveen and W. Hill. Beyond recommender systems: Helping people help each other. 2001. [9] M.-F. Tsai, C.-W. Tzeng, and A. L. P. Chen. Discovering leaders from social network by action cascade. In Proceedings of the Fifth Workshop on Social Network Systems, SNS ’12, pages 12:1–12:2, New York, NY, USA, 2012. ACM. [10] M.-F. Tsai, C.-J. Wang, and Z.-L. Lin. Social influencer analysis with factorization machines. In Proceedings of the ACM Web Science Conference, WebSci ’15, pages 50:1–50:2, New York, NY, USA, 2015. ACM. 19.

(34) [11] K. Zhou, H. Zha, and L. Song. Learning social infectivity in sparse low-rank networks using multi-dimensional hawkes processes. In Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, pages 641–649, 2013.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 20. i Un. v.

(35)