未標記資料之連結發現

(1)

國立臺灣大學電機資訊學院資訊網路與多媒體研究所博士論文

Graduate Institute of Networking and Multimedia College of Electrical Engineering and Computer Science

National Taiwan University Doctoral Dissertation

未標記資料之連結發現

Link Discovery with Unlabeled Data

郭宗廷 Tsung-Ting Kuo

指導教授：林守德博士

Advisor: Shou-De Lin, Ph.D.

(2)

(3)

誌謝

「雖然所做的是一樣的行為，但是有遠大的目標，每件事都是偉大的腳印。就像一個王子的童年，和市井小民的童年，絕對不一樣。在日日中數數發心，將心力貫注於每個看似平凡的日子。最偉大的爆發力，就是恆久持續地，對目標的忠誠性！」

五年半的博士生涯，要感謝的人實在太多了！首先當然是指導教授林守德老師，無論在研究教學，論文寫作，競賽活動，做事態度，生活家庭平衡，團隊默契建立，實驗室經營等各方面，都給予最多的支援與指導。感謝博士學位考試予以諸多指導與建議的曾新穆老師，許聞廉老師，許永真老師，洪宗貝老師，張智星老師，張嘉惠老師，

李素瑛老師，以及彭文志老師。亦感謝在論文計畫審查階段提供指導的陳信希老師，

陳銘憲老師，張智星老師，林軒田老師，以及鄭卜壬老師。此外，也感謝曾指導修習課程的林守德老師，林智仁老師，林軒田老師，李明穗老師，歐陽明老師，及陳彥仰老師。也特別感謝碩士班指導教授曾憲雄老師，以及大學專題指導教授曾新穆老師，

在攻讀博士期間，依然給予許多支持和關懷。

感謝駱宏毅，李政德，解巽評，嚴睿，葉蓉蓉，黃安達，沈砡君，林婉真，洪三權，

林瑋詩，彭楠贇，林嘉貞，黃宇陽，龔鵬驊，以及曹餘雯等實驗室好伙伴的支援和鼓勵。也感謝陳雅琳小姐，以及網媒所和資工系等，所有職員多年來的幫忙。

十分感謝季松平，孫基康，林俊賢，張烈諄，徐佩玲，王馨佩，黃虹瑜，及宗臣科技和美商國家儀器等，許多事業前輩及伙伴的支援和幫忙。也感謝交大資科所，成大資工系，以及成大慈幼社等，許多好朋友的關懷。

亦非常感謝盧克宙，賴鍚源，郭基瑞，林常如，陳靜香，林文清，李妙玲，黃俊昌，

林金枝，以及福智文教基金會和里仁公司等，所有長輩及朋友的關懷及支持。也特別感謝臺大福智教職員聯誼會，臺大福智青年校友會，臺大福智青年社，以及福智青年大專班台大校群等，所有教職員生的支持和鼓勵。

當然，家人的支持是很重要的，感謝奶奶，外婆，爸爸，媽媽，姑姑，叔叔，嬸嬸，

舅舅，阿姨，妹妹，表弟，表妹，女兒，以及親愛的老婆。

(4)

摘要

許多社群，學術，生物，地理及資訊系統可以用網路來做描述。連結發現是一種在社群網路中確認隱藏連結的研究。然而，某些情況下，針對我們想發現的連結，並無法取得已標記的資料。在此論文中，我們研究一個關於連結發現問題的新面向：發現未標記之連結。我們進一步研究兩個子題，來預測兩種未標記之連結：在異質性網路中未標記之關係連結，以及在同質性網路中未標記之傳播連結。此問題之主要挑戰為缺少標記資料，所以無法直接使用傳統的自動分類方法。為解決此問題，我們設計了以機器學習為基礎的架構，來整合各種不同的資訊，並發現未標記資料的連結。我們也在許多真實世界的資料集上進行實驗，以驗證我們所提出的方法。實驗結果除了顯示我們所提出的方法可以解決此問題，也指出未標記資料之連結發現可以應用在許多不同的實務情境之中。

關鍵字:

連結發現；連結預測；資料探勘；機器學習；社群網路；機率圖形學習模型；自然語言處理

(5)

Abstract

Many social, academic, biological, geographical, and information systems can be described by networks. Link discovery is a kind of task aiming at identifying hidden links in a social network. However, in some cases, the labels of the links to be discovered is not available. In this dissertation, we investigate such a novel aspect of the link discovery task: the problem of discovering unlabeled links. Specifically, we conduct two studies to predict two kinds of unlabeled links respectively: links that represents unlabeled relationship in heterogeneous networks, and links that represents unlabeled diffusion in homogeneous networks. The main challenge of these tasks are the lack of labeled data, thus prevents the direct exploiting of traditional classification approaches. To address this challenge, we design learning-based frameworks to integrate diverse information and solve the corresponding link discovery problems in the two studies. Also, we conduct experiments on various real-world datasets to evaluate our proposed frameworks. The promising experiment results not only demonstrates the usefulness of the proposed models, but also indicates that discovering links without labeled data is feasible in many practical scenarios.

Keywords:

Link discovery; Link prediction; Data mining; Machine learning; Social network;

Probabilistic graphical model; Natural language processing

(6)

List of Algorithms

Algorithm 2-1. Ranked-margin learning algorithm... 30

Algorithm 2-2. Two-stage inference algorithm. ... 31

(10)

List of Figures

Figure 2-1. The unseen-type link prediction with aggregative statistics problem in a

heterogeneous social network. ... 13

Figure 2-2. Relational schema of the unseen-type link prediction with aggregative statistics problem shown in Figure 2-1. ... 18

Figure 2-3. Factor graph model with aggregative statistics (FGM-AS). ... 19

Figure 2-4. An example of FGM-AS based on Figure 2-1's network. ... 23

Figure 3-1. The novel-topic diffusion model. ... 50

(11)

List of Tables

Table 1-1. Summary of two studies of link discovery with unlabeled data. ... 1

Table 1-2. Summary of literatures and our proposed solutions. ... 7

Table 2-1. Statistics of the datasets. ... 32

Table 2-2. Mapping of the random variables for the datasets. ... 34

Table 2-3. Experiment results of our framework (FGM-AS) and all comparing methods (in percentage). ... 39

Table 2-4. Verification results of candidate-to-candidate functions (in percentage), Pre. = precision, Rec. = recall. ... 41

Table 3-1. Single-feature results. ... 57

Table 3-2. Feature combination results. ... 58

(12)

Chapter 1 Introduction

1.1 Problem and Motivation

Many social, academic, biological, geographical, and information systems can be described by networks (tree-structured, homogeneous, heterogeneous, etc.), where nodes represent individuals, and links denote the relations or interactions between nodes [31]

[32] [39]. Given such networks, Link discovery tries to estimate the likelihood of the existence of a link between two nodes, based on observed links and the attributes of nodes [16].

Table 1-1. Summary of two studies of link discovery with unlabeled data.

Study Unlabeled Network Publication

Link prediction using aggregative

statistics Relationship Heterogeneous [30]

Diffusion prediction of novel topics Diffusion Homogeneous [28]

However, in some cases, the links to be discovered is not labeled in training data.

Link discovery becomes much more challenging given such scenario. In this dissertation, we investigate the problem of discovering unlabeled links (links of specific attributes which are never observed in training data). Specifically, we conduct two studies to predict two kinds of unlabeled links respectively (Table 1-1): links that represent unlabeled relationship in heterogeneous networks, and links that represent unlabeled diffusion in

(13)

homogeneous networks. An example of unlabeled relationship prediction is to predict the

“like” relationship in Foursquare; due to the privacy policy, labels of the relationship (i.e., whether a user like a post or not) is not revealed. On the other hand, an example of unlabeled diffusion prediction is to predict whether a user will response a post about

“iPhone 6”, before any post about “iPhone 6” actually exists (that is, we only have labels for posts such as “HTC” and “iPad”, but not for “iPhone 6”). The two unlabeled link prediction studies are described in detail as follows:

(1) Link prediction using aggregative statistics (discovering links of unlabeled relationship in heterogeneous networks) [30]. Most of the social network services allow users to express their opinions (such as “like” or “+1”) to messages posted by other people, and such individual opinions are valuable for many reasons. However, due to privacy concern, opinion holders are sometimes hard to be determined.

Fortunately, the aggregative statistics of articles (i.e., how many people like this article) is usually available in such websites. In this study, we target to predict the links of unlabeled relationship. We try to answer a question: can we predict links representing a specific relationship in a heterogeneous network without any labeled data, but using the aggregative statistics as well as some attributes provided by the heterogeneous social networks only?

(14)

homogeneous network. We argue that such assumption does not always hold in the real-world scenario, and being able to forecast the propagation of novel or unlabeled topics is more valuable in practice. In this study, we try to forecast the link of unlabeled diffusion. We try to answer a question: can we predict the future diffusion of without labeled training data about how this kind of diffusion has propagated in a homogeneous network?

(15)

1.2 Challenge

Although discovering links using unlabeled data is valuable, solving the problems of the two proposed studies it is not trivial because of the following challenges:

(1) Link prediction using aggregative statistics. There are three challenges to solve the problem in this study. First, the absence of labeled training data prevents us from performing parameter learning in a straightforward way. Next, in a heterogeneous network, the information of different types of vertices and links are diverse but correlated with each other. A suitable model has to carefully model such correlation together with the aggregative statistics. Finally, since the type is unlabeled, presumably the possible candidate-link count approaches O(n²) where n is the total number of nodes. When n is large, this can cause serious sparsity problem, while finding the links in such a large space can be very challenging.

(2) Diffusion prediction of novel topics. In the problem of this study, the past diffusion behaviors of novel topics are missing, which makes this problem difficult to be solved.

That is, without historical training data of the novel topics, it is not easy to maintain reasonable prediction performance.

(16)

1.3 Methodology, Dataset and Experiment

To address the challenges for predicting unlabeled links, we design learning-based frameworks to integrate diverse information and solve the corresponding link discovery problems in the two studies. Also, we conduct experiments on various real-world datasets to evaluate our proposed frameworks and get promising results. The two proposed solutions, datasets, and experiment results, are introduced briefly below.

(1) Link prediction using aggregative statistics. In this study, we cannot apply supervised learning methods directly, because we do not have any labeled relationships in the training stage. Instead, we devise a novel unsupervised framework to integrate three kinds of information: candidate, attribute, and count. The proposed framework includes three main components: a three-layer factor graph model and three types of potential functions; a ranked-margin learning algorithm for parameter tuning; and a two-stage inference algorithm for link prediction. Also, we evaluate our method on four diverse scenarios using four datasets: preference prediction (Foursquare), repost prediction (Twitter), response prediction (Plurk), and citation prediction (DBLP). We further exploit nine unsupervised models to solve this problem as baseline, and our approach wins out in all scenarios significantly.

(2) Diffusion prediction of novel topics. In this study, we devise a supervised learning framework to solve the problem, because we do have labels for other kinds of diffusions. We exploit the latent semantic information among users, topics, and social connections as features for prediction. Specifically, we integrate four kinds of

(17)

based framework is evaluated on real data collected from public domain. The experiments show promising AUC improvement over baseline methods.

(18)

1.4 Literature

As an important task in recent data mining field, link prediction mainly solves the following problems [39]: (1) reconstruction of networks [44] [50] [58], which considers the reconstruction of networks from the observed networks with missing and spurious links; (2) evaluation of network evolving mechanisms [35] [66], which studies the evolving models of networks; and (3) classiﬁcation of partially labeled networks [14] [65], which is given a network with partial nodes being labeled, predicting the labels of these unlabeled nodes based on the known labels and the network structure.

In terms of methodology, the link prediction approaches can further be divided into two categories: supervised learning [4] [11] [18] [37] [40] [56], and unsupervised learning [1] [3] [6] [17] [22] [24] [43]. However, most of the proposed approaches aim at seen links (links of seen node, topic, and type), thus cannot be applied directly to solve the problem of discovering unlabeled links. The literatures and our proposed solutions are summarized in Table 1-2.

Table 1-2. Summary of literatures and our proposed solutions.

Labeled Data Unlabeled Data

Unsupervised Learning

[1] [3] [6] [17]

[22] [24] [43]

Link prediction using aggregative statistics (Chapter 2)

Supervised Learning

[4] [11] [18]

[37] [40] [56]

Diffusion prediction of novel topics (Chapter 3)

(19)

1.5 Contributions

The contributions in this dissertation are three-fold:

 Problem. We propose a novel problem of discovering unlabeled links, and conduct two related studies to predict links of unlabeled relationship in heterogeneous networks (link prediction using aggregative statistics), and links of unlabeled diffusion in homogeneous networks (diffusion prediction of novel topics).

 Solution. We devise two diverse learning-based frameworks, to integrate the diverse information and solve the unlabeled link discovery problems. For the link prediction using aggregative statistics task, we integrate candidate, attribute and count information in an unsupervised learning framework. For the diffusion prediction of novel topics task, we integrate the topic, user, user-topic, and global information in a supervised learning framework.

 Experiment. We conduct experiments on real-world datasets (Foursquare, Twitter, Plurk, and DBLP). The results show that our proposed frameworks provide reasonably high performance and can solve the unlabeled link prediction problems.

(20)

1.6 Dissertation Organization

The remainder of this dissertation is organized as follows. In the next chapter, we present the link prediction using aggregative statistics problem and explain how we tackle this problem. In Chapter 3, we introduce and solve the diffusion prediction of novel topics problem. Then, in Chapter 4, we provide concluding remarks of this dissertation.

(21)

Chapter 2 Link Prediction Using Aggregative Statistics

The concern of privacy has become an important issue for online social networks. In services such as Foursquare.com, whether a person likes an article is considered private and therefore not disclosed; only the aggregative statistics of articles (i.e., how many people like this article) is revealed. This study tries to answer a question: can we predict the opinion holder in a heterogeneous social network without any labeled data? This question can be generalized to a link prediction with aggregative statistics problem. This study devises a novel unsupervised framework to solve this problem, including two main components: (1) a three-layer factor graph model and three types of potential functions;

(2) a ranked-margin learning and inference algorithm. Finally, we evaluate our method on four diverse prediction scenarios using four datasets: preference (Foursquare), repost (Twitter), response (Plurk), and citation (DBLP). We further exploit nine unsupervised models to solve this problem as baselines. Our approach not only wins out in all scenarios, but on the average achieves 9.79% AUC and 12.81% NDCG improvement over the best competitors.

(22)

2.1 Overview

Most of the social network services allow users to express their opinions (e.g., “like” or

“+1”) to messages posted by other people. Such individual opinions are usually

valuable: companies can identify a specific customer’s preference, and government can recognize the will or desire of target influential person.

However, due to privacy concern, opinion holders are sometimes concealed. An example is Foursquare.com, a popular location-based social network websites. In Foursquare, users can post tips to certain venues of their interest, and other people may

“like” the tips. Nevertheless, the information about which user likes which tip is generally not available to public due to the privacy concern.

Another example is Pinterest.com, which is a pin-board-style photo sharing website.

In Pinterest, users can “like” or “repin” others’ images, but only a little portion of such information is available due to internal limitation of Pinterest (only first 24 “like” and first 8 “repin” are shown on the webpage). Thus, it is difficult to gather a full spectrum of information about each individual’s opinion under such circumstances.

Fortunately, aggregative statistics of opinions are usually available. For example, the total count of “like” of each tip in Foursquare is accessible, and the total count of “like”

and “repin” of an image in Pinterest is also obtainable. Such aggregative statistics are important because it is usually the only available clue to understand the quality of certain item without violating the policy rule. Hence, this study tries to address a problem: can we predict a link between a user and an item (e.g., whether a user likes a tip) using the

(23)

aggregative statistics together with other information in a heterogeneous social network?

We generalize the question to an unseen-type link prediction with aggregative statistics problem. The term unseen is used because we assume it is not possible to obtain which person likes which tip from data (therefore, such “like” link can be regarded as a kind of relationship that is previously unseen). From link prediction point of view, one can assume there is no labeled training data available of such type of links.

An example we use through this study is a network gathered from Foursquare (Figure 2-1). There are 7 nodes and 7 links with 3 node types (users, items, and categories) and 3 link types (be-friend-of, own, and belong-to). We want to predict the existence of

“like” links (e.g., whether user u2 likes item r2 or not) using the aggregative statistics (e.g., total like count of the item r2 is t(r2) = 1). Note that the links of “like” type is unseen, which means we do not see such link at all in the data.

(24)

Figure 2-1. The unseen-type link prediction with aggregative statistics problem in a heterogeneous social network.

Most of the link prediction literatures aim at predicting links of seen types (i.e., some labeled historical links are available as the training data) [35] [39] [62], thus cannot be applied to our problem. Some researchers predict links of unseen types using external node group information [33], but those information are not always available. As in the Foursquare example, the only available information in our problem is the aggregative statistics. Nevertheless, our problem is non-trivial due to the following three challenges:

 Lack of labeled data. The absence of labeled training data prevents us from performing parameter learning in a straightforward way.

 Diverse information. In a heterogeneous social network, the information of different types of nodes and links are diverse but correlated with each other. A suitable model is needed to represent such correlation with aggregative statistics.

c1

belong-to

r₁

r₂

r₃

Item

own

c₂

Category

like ?

t(r₁) = 2 t(r₃) = 1

t(r₂) = 1 u₁

u₂

User

be-friend-of

(25)

 Sparsity of links. Since the type is unseen, presumably the possible candidate-link count approaches O(n²) where n is the total number of nodes. When n is large, this can cause serious sparsity problem, while finding the links in such a large space can be very challenging.

In this study, we try to address these challenges by proposing a novel unsupervised probabilistic graphical model. First, we devise a factor graph model with three layers of random variables (candidate, attribute, and count) to infer the existence of unseen-type links. Second, we define three types of potential functions (attribute-to-candidate, candidate-to-candidate, and candidate-to-count) to integrate diverse information into the factor graph model. Third, we design a ranked-margin learning algorithm to automatically tune the parameters using aggregative statistics. Finally, we design a two-stage inference algorithm to update the candidate-to-count potential functions, and optimize the outputs.

The main contributions of this study are as below:

 We propose and formulate a novel yet practical problem to predict the links of unseen- type using aggregative statistics in heterogeneous social networks.

 We devise an unsupervised learning framework to solve the above-mentioned problem.

(26)

 We evaluate our method on four diverse scenarios using different heterogeneous social network datasets: preference prediction (Foursquare), repost prediction (Twitter), response prediction (Plurk), and citation prediction (DBLP). We also apply nine unsupervised models for this problem as baseline. Our model not only wins in all scenarios, but also achieves on the average 9.79% AUC and 12.81% NDCG improvement over the best comparing methods.

(27)

2.2 Problem Formulation

We start by formulating the problem.

Definition 1. Heterogeneous social network N = ( V, E, ΩV, ΩE ) is a directed graph, where V is a set of nodes, ΩV is a set of node labels, ΩE is a set of link labels, and E ⊆ V×

ΩE × V is a set of links.

The function type(v) → lV maps node v onto its node label lV ∈ ΩV. Similarly, given a triplet < source, link-label, target > as a link, the function type(e) → lE maps link e onto its link label lE ∈ ΩE.

For the example shown in Figure 2-1, there are 7 nodes and 7 links, with ΩV = { “user”, “item”, “category” } and ΩE = { “be-friend-of”, “own”, “belong-to” }. For brevity, we denote U ⊆ V as the set of node for type = “user”, R ⊆ V for type = “item”, and C ⊆ V for type = “category”.

The relationship between node labels and link labels can be enumerated. For instance, a user u may “be-friend-of” another user v (i.e., < u, “be-friend-of”, v >); a user u may

“own” an item r (i.e., < u, “own”, r >), and an item r may “belong-to” a category c (i.e.,

(28)

Definition 2. Unseen-type links is a set of links with a special type “?”; links of such type do not appear in a given heterogeneous social network. That is, unseen-type links Φ = { φ | φ = < source, “?”, target >, type(source) ∈ ΩV, type(target) ∈ ΩV, “?” ∉ ΩE }.

For the example in Figure 2-1, the unseen-type links denote the “like” behavior. That is, Φ = { < u, “like”, r > } denotes the set of links that user u likes item r. We use < u, r >

to denote the candidate pairs of unseen-type links, and there are |U| ∙ |R| = 6 plausible candidate pairs in Figure 2-1.

Definition 3. Aggregative statistic is the total unseen-type link count of a target node. In other words, the aggregative statistic of a node v ∈ V is σ(v, Φ) = | { φ | φ = < source,

“?”, target > ∈ Φ, target = v } |, which is a non-negative integer.

In our example, the aggregative statistic of an item r2 ∈ R is σ(r2, Φ) = | { φ | φ = <

u, “like”, r > ∈ Φ, r = r2 } | = 1.

Definition 4. Aggregative statistics of a heterogeneous social network T(N, Φ) = { < v, σ(v, Φ) > | v ∈ V } is the set of aggregative statistics of the unseen links for a heterogeneous social network N.

In Figure 2-1, the aggregative statistics of heterogeneous social network N is T(N, Φ) = { < r1, 2 >, < r2, 1 >, < r3, 1 > }.

(29)

Based on above definitions, we formulate the unseen-type link prediction with aggregative statistics problem as follows: given a heterogeneous social network N and corresponding aggregative statistics T(N, Φ), predict the existence of unseen-type links Φ.

The relational schema for our example is shown in Figure 2-2: given the heterogeneous social network (3 types of nodes and 3 types of edges) and aggregative statistics of “like”, predict whether each < u, “like”, r > exists or not, where u ∈ U and r

∈ R.

Figure 2-2. Relational schema of the unseen-type link prediction with aggregative statistics problem shown in Figure 2-1.

belong-to own be-friend-of

like ?

User

Item Category

Aggregative statistics for “like”

(30)

2.3 Methodology

We first propose to solve this problem using a probabilistic model. Then, we use an illustrative example to demonstrate our model. Finally, we describe a novel learning algorithm utilizing the aggregative statistics to learn the model parameters, as well as a two-stage inference algorithm to predict unseen-type links.

2.3.1 Factor Graph Model with Aggregative Statistics (FGM-AS)

To handle this problem, we propose a novel probabilistic graphical model: factor graph model with aggregative statistics (FGM-AS), as shown in Figure 2-3. There are three layers of variables in FGM-AS:

Figure 2-3. Factor graph model with aggregative statistics (FGM-AS).

 Candidate: the binary random variables Y in the candidate layer represent all unseen- type links to be predicted. They either exist (positive) or not exist (negative). Each

f(A, y_i)

attribute A ^a¹ a₂ ^a3

h(T, y_i)

y₁ y₂

g(Y, y_i) t₁

count T

candidate Y _y₃

t₂

(31)

candidate yi can be regarded as a pair of user and item, < u, r >. Also note that some y’s might point to the same users while some might share the same item.

 Attribute: the random variables A in the attribute layer carry attribute information (e.g., a1 represents the degree of the source node and a2 represents the degree of the target node) of the candidate links.

 Count: the random variables T in the count layer encode the aggregative statistics of the items. Note that t is a one-to-one mapping of an item r, but a one-to-many mapping of y because there are some y’s sharing the same item (e.g., candidate y1 and y2 point to the same t1 as they have the same item r).

Together with the random variables, we also propose three types of potential functions:

 Attribute-to-candidate functions: we define this type of potential function as a linear exponential function

( , _i) 1 exp{ '( , _i)}

f A y f A y

Z_ 

  (1)

(32)

 Candidate-to-candidate functions: this type of potential function is defined as ( , _i) 1 exp{ '( , _i)}

g Y y g Y y

Z_ 

  (2)

where g’(Y, yi) is a vector of functions representing the relationships between candidate random variables (see Section 2.3.4 for a detailed example), β is a vector of weights, and Zβ is a normalization factor.

 Candidate-to-count functions: this type of potential function is defined as

( , _i) 1 exp{ '( , _i)}

h T y h T y

Z_ 

  (3)

where h’(T, yi) is a vector of functions representing the constraints of aggregative statistics (see Section 2.3.5 for a detailed example), γ is a vector of weights, and Zγ is a normalization factor. More precisely, this type of potential functions adhere to the condition: the sum of predicted marginal probability of the candidate random variables of each item should be as close to the total count of that item as possible.

According to the FGM-AS model, when the candidates, attributes and counts are known, we can define the joint distribution as

( , , ) ( , _i) ( , _i) ( , _i)

i

P A T Y 



f A y g Y y h T y ⁽⁴⁾ Therefore, the marginal probability of candidate random variable yi being positive (e.g., like) is

(33)

( , , , )_i ( , , , _j), _j / { }_i

j

P A T Y y 



P A T Y y y Y y ⁽⁵⁾

The marginal probability P(A, T, Y, yi = 1) is the desired output in our problem, as it tells us for yi = < u, r >, how likely u likes r.

2.3.2 An Illustrative Example of FGM-AS

We believe that FGM-AS is a general graphical model for solving the unseen-type links prediction problem. The three layers of random variables and the three types of potential functions can be flexibly defined for different application context. Here we use FGM-AS to predict whether a user likes an item or not. Figure 2-4 illustrates an example of FGM- AS, which is built from the heterogeneous social network shown in Figure 2-1. The three layers of random variables are defined as:

(34)

Figure 2-4. An example of FGM-AS based on Figure 2-1's network.

 Candidate: candidate random variables Y = { yi | i = 1, 2, …, |U| ∙ |R| } represent the set of plausible links < u, r > to be predicted. In other words, each pair yi = < u, r >

indicates whether the user u likes the item r. For example, y1 = < u1, r1 > represents whether user u1 likes item r1. Note that u1 is not necessarily the owner of r1.

 Attribute: attribute random variables A = U ∪ R ∪ C contain three groups of information: users U = { u1, u2, …, u|U| }, items R = { r1, r2, …, r|R| }, and categories C = { c1, c2, …, c|C| }. We use u(yi) to denote the corresponding user, r(yi) to denote the corresponding item, and c(yi) to denote the corresponding category of yi.

f(.)

attribute

r₁

u₂

u₁ r₃ c₂ r₂

c₁

h(.)

y₁

y₂

y₃

g(.)

<u₁, r₁>

<u₂, r₁>

<u₁, r₃>

y₆

y₄

y₅

<u₂, r₃>

<u₁, r₂>

<u₂, r₂>

t₁

t₃

t₂

count

candidate

(35)

 Count: count random variables T = {t1, t2, …, t|R| } represent the aggregative statistics (total like count) of each item. Note that |T| = |R| because t is a one-to-one mapping of r. We use t(yi) to denote the corresponding count of yi.

The design of the three potential functions is described in the following three subsections.

2.3.3 Attribute-to-Candidate Function

According to Equation (1), we define f ’(A, yi) = < fUF(u(yi)), fIO(u(yi), r(yi)), fCP(c(yi)) >.

The functions fUF, fIO and fCP are based on user friendship, item ownership, and category popularity, which are defined below:

 User friendship (UF) function: fUF(u(yi)) = the number of friends of u(yi). The intuition behind UF is that we believe the number of friends of a user can influence his / her tendency to like an item. In Figure 2-1, fUF(u(y1)) = fUF(u1) = 1, because user u1

has only one friend (which is u2).

 Item ownership (IO) function: f (u(y), r(y)) = 1 if r(y) is owned by u(y), otherwise

(36)

 Category popularity (CP) function: fCP(c(yi)) = the number of items in the whole dataset that belongs to the same category as c(yi). The intuition behind CP is that users tend to like items belonging to a hot category (i.e., category which contains many items). In Figure 2-1, fCP(c(y1)) = fCP(c1) = 2, because there are two items belonging to c1.

2.3.4 Candidate-to-Candidate Function

According to Equation (2), we define g’(Y, yi) = < Σ j gOI(yi, yj), Σ j gFI(yi, yj), Σ j gOF(yi, yj), Σ j gCC(yi, yj), Σ j gCI(yi, yj) >, yj ∈ Y / {yi}. The functions gOI, gFI, gOF, gCC and gCI are based on owner, friend, owner-friend, co-category, and common-interest relationships, which are defined as follows:

 Owner-identification (OI) function: gOI(yi, yj) = 1 if < u(yi), “own”, r(yi) > ∈ E, <

u(yj), “own”, r(yj) > ∈ E, and u(yi) = u(yj); otherwise 0. The intuition is that an owner tends to like all his / her items. For example in Figure 2-1, u1 likes both r1 and r2, because u1 owns both items. Therefore, there will be a relation between y1 and y4 in Figure 2-4.

 Friend-identification (FI) function: gFI(yi, yj) = 1 if < v, “own”, r(yi) > ∈ E, < v,

“own”, r(yj) > ∈ E, u(yi) = u(yj), and v ∈ friend(u(yi)); otherwise 0. The intuition is that a person may like friend’s items. For example, u2 likes both r1 and r2, because u2’s friend u1 owns both items. Therefore, there will be a relation between y2 and y5.

(37)

 Owner-friend (OF) function: gOF(yi, yj) = 1 if < u(yi), “own”, r(yi) > ∈ E, r(yi) = r(yj), and u(yi) ∈ friend(u(yj)); otherwise 0. The intuition is that if an owner likes his / her own item, his / her friends tend to like the item too. For example, if u1 likes his / her item r1, then his / her friend u2 tends to like r1 as well. In other words, there will be a relation between y1 and y2.

 Co-category (CC) function: gCC(yi, yj) = 1 if < u(yi), “own”, r(yi) > ∈ E, u(yi) = u(yj), and c(yi) = c(yj); otherwise 0. The intuition is: the extent an owner likes the item will be similar to the extent of the owner likes other items in the same category. For example, if u1 tends to like item r1, then u1 may also like r3, because r1 and r3 are in the same category c1. Thus, there is a relation between y1 and y3.

 Common-Interest (CI) function: gCI(yi, yj) = 1 if < u(yi), “be-friend-of”, u(yj) > ∈ E, and r(yi) = r(yj); otherwise 0. The intuition is that if a user likes an item, his / her friends tend to like the item too. For example, if u1 likes an item r2, then his / her friend u2

tends to like r2 as well. In other words, there will be a relation between y4 and y5.

2.3.5 Candidate-to-Count Function

According to Equation (3), we define h’(T, yi) = < hCT(yi, t(yi)) >. The function hCT is

(38)

The summation term in Equation (6) sums up all the probabilities of a certain item r(yi) being liked by each user, which we hope to be as close to the observed “like” count of this item as possible. Thus, the difference of this term and t(yi) represents how close the prediction to the known aggregative statistics is. We divide this difference by |U| for normalization purpose. Ideally, the difference is 0, and thus hCT(yi, t(yi)) = 1. Also, 0 

hCT(yi, t(yi))  1.

It should be noted that P(A, T, Y, yj = 1) are not random variables anymore but the posterior probability of them. Therefore, the conventional exact or approximated inference methods cannot be applied directly. To update accordingly, we design a two- stage inference algorithm, which is described at the end of Section 2.3.6.

2.3.6 Ranked-Margin Learning for FGM-AS

The key factor that contributes to the success of FGM-AS lies in the algorithm’s capability of learning the parameters without labeled data. Here we discuss the main idea. Given a parameter configuration θ = (α, β, γ) and based on Equation (1) – (4), the joint probability P(A, T, Y) can be written as

 

  ^{ }

( , , ) 1 exp ( '( , ), '( , ), '( , ))

1 1

exp ( ) exp

i i i

i

i i

P A T Y f A y g Y y h T y

Z

s y S

Z Z



 

 

   





₍₇₎

where all potential functions for a yi is written as s(yi) = < f’(A, yi), g’(Y, yi), h’(T, yi) >, Z

= Zα Zβ Zγ, and S = Σ i s(yi).

(39)

Now, we will discuss how to learn the parameters of the model. Traditionally the idea of maximum-likelihood estimation (MLE) can be exploited and algorithms such as EM can be applied to achieve this goal. Alternatively for a factor graph, algorithms such as gradient decent can be exploited to greedily search in the parameter space. However, in our scenario, the absence of labels eliminates the possibility of exploiting MLE strategy for learning. Moreover, even if one can somehow come up with certain approximated objective to be maximized in the M-step of EM, the total number of hidden variables in this graph grows to |U| ∙ |R|, which can lead to very high computational cost for parameter learning.

To effectively and efficiently perform the learning task, we propose a novel idea to maximize the ranked-margin of the instances, incorporating the aggregative statistics into the objective function. The intuition is to assume the count for an item r(yi) is t(yi), which means that among all candidate users, only t(yi) of them like this object.

Therefore, during learning we want to adjust the parameter so that the top t(yi) users have very high probabilities of liking this item while the rest have very low probabilities of liking it. To realize this idea, we propose to do the following. For each item r, first rank each user ui based on the marginal probability of y = < ui, r >. Then, let P(Yrupper) be the

(40)

An extreme example is that the marginal probability of the top t(yi) candidate pairs are all 1, while the rest are all 0. In this case Diff(Yrmargin) = 1 – 0 = 1. Another extreme example is that the marginal probability of all candidate pairs are equal, which results in Diff(Yrmargin) = 0. Thus, 0  Diff(Yrmargin)  1.

Based on the above idea and Equation (8), we define the log-likelihood objective function to be maximized as

 

   

( , ) log ( ) log 1exp

log exp log exp

margin r

upper lower

r r

margin r

Y

Y Y

O r P Y S

Z

S S

 

  

   



 

(9)

Besides the intuitiveness of Equation (8) with respect to the count as mentioned, there are two other advantages of using Equation (9) as our objective function. First, it should be noted that computing the normalization factor Z in Equation (7) is very time-consuming.

However, for Equation (9), we can essentially eliminate Z to avoid the high computational cost during learning. Second, the gradient of Equation (9) can be obtained through sampling using any inference algorithm (as shown below).

To maximize the objective function, we exploit an idea similar to the Stochastic Gradient Descent (SGD) method, as shown in Algorithm 1. We calculate the gradient and update the parameters for each item iteratively until convergence, then move on to the next item (η is the learning rate of our algorithm). The gradient for each parameter θ and item r is

(41)

( ) ( )

log exp{ } log exp{ }

( , )

exp{ } exp{ }

upper lower

r r

upper lower

r r

upper lower

r r

upper lower

r r

Y Y

P Y P Y

S S

r

S S S S

S S

 



 



    

   

 

   

 

 

 

 

E E

₍₁₀₎

where ₍ ^upper₎

P_Yr S

E and ₍ ^lower₎

P_Yr S

E are two expected values of S. The expected values can be obtained naturally using approximated inference algorithms, such as Gibbs Sampling or Contrastive Divergence. It should be noted that the proposed ranked-margin algorithm can be exploited not just for graphical model, but also for other learning models as long as the gradient of the expected difference can be calculated.

Input: FGM-AS, learning rate 𝜂 Output: P(A, T, Y, yi = 1) for all yi ∈ Y

Initialize all elements in parameter configuration θ = 1 repeat

Run inference method using current θ to obtain P(A, T, Y, yi = 1) Compute potential function values S according to Eq. (1) – (7) foreach r ∈ R do

Compute gradient ^^O_^{( , )}^_ ^r using S according to Eq. (10) θ = θ + 𝜂 ^^^O_^{( , )}^_ ^r

(42)

In Algorithm 2-1, we need to perform an inference algorithm on the factor graph, to obtain the marginal probability of each candidate pair y. Also, after the parameters are learned, we need to apply the inference algorithm again to compute the marginal probability, representing how likely the person likes the item. Unfortunately, such inference cannot directly be done as P(A, T, Y, yi = 1) in Equation (6) requires the posterior probabilities of y.

Thus, we design a two-stage inference algorithm (Algorithm 2-2). In the first stage, we perform general inference method usingf(A, yi) and g(Y, yi) only (by assigning all h(T, yi) = 1) to initialize P(A, T, Y, yi = 1). In the second stage, we compute h(T, yi) using P(A, T, Y, yi = 1), and then perform inference one more time. This way, we integrate the posterior information into the inference process.

Algorithm 2-2. Two-stage inference algorithm.

Input: FGM-AS, parameter configuration θ Output: P(A, T, Y, yi = 1) for all yi ∈ Y Initialize all yi = 0, all h(T, yi) = 1 stage 1

Calculate f(A, yi) and g(Y, yi) according to Eq. (1), (2) Run an inference method using θ to obtain P(A, T, Y, yi = 1) stage 2

Calculate h(T, yi) using P(A, T, Y, yi = 1) according to Eq. (3), (6) Run an inference method using θ to obtain final P(A, T, Y, yi = 1)

(43)

2.4 Experiments

Here we want to verify the generalization of our model by testing whether it can be applied to datasets in four different scenarios. We also want to verify the usefulness of the potential functions.

2.4.1 Scenarios and Datasets

We study the following four types of scenarios of the unseen-type link prediction problem, each with a real-world dataset. The statistics of the datasets are shown in Table 2-1.

Table 2-1. Statistics of the datasets.

Property Foursquare Twitter Plurk DBLP

Node

User 71,634 69,026 190,853 102,304

Item 180,684 55,375 352,376 221,935

Category 16,961 100 100 100

Total 269,279 124,501 543,329 324,339

Link

Be-friend-of 724,378 21,979,021 2,151,351 245,391

Own 180,684 55,375 352,376 221,935

Belong-to 180,684 55,375 352,376 221,935

Unseen 15,758 79,918 804,404 123,479

Total 1,101,504 22,169,689 3,660,507 812,740

(44)

all tips for these venues, and identify users who posted the tips. We regard venues as categories, and tips as items. Note that due to the privacy policy in Foursquare, only the total like count of each tip is revealed. There is very limited number (i.e., 15,758) of unseen-type links revealed, which become ground truth for evaluation (not seen in training).

 Repost prediction. In social network websites, we are interested in predicting whether users will re-blog or retweet a post. Therefore, we use Twitter as the dataset, which is collected from [15]. Twitter is one of the most famous micro-blog website, and has been used to verify several models with different purposes [15] [20] [47]. In this study, we consider retweet as the unseen-type link. We keep users who have two or more friends, and have tweeted or retweeted more than once. Then, we perform stemming to identify 100 most popular terms in tweets as categories while each tweet is regarded as an item. For example, if a user v posts a tweet r, and later another user u retweets this tweet (with the “RT@” keyword), we consider an unseen-type link exists from u to r.

 Response prediction. In micro-blog services, we are interested in predicting whether users will respond to a post. We use Plurk dataset in this scenario. Plurk is a popular micro-blog service in Asia with more than 5 million users, and has been used in studies of diffusion prediction [28], diffusion model evaluation [27], and mood classification [7]. This dataset is collected from 01/2011 to 05/2011. In this study, we consider response-to-message as the unseen-type link. We manually identify the 100 most popular topics as categories, and regard messages as items. For example, if a person v

(45)

posts a message r, and later another person u responds to this message, we consider an unseen-type link exists from u to r.

 Citation prediction. In academic indexing and searching services, we are interested in predicting whether researchers will cite a paper. Therefore, we use DBLP [34]

dataset collected from ArnetMiner [52], version 5. In this study, we consider citation- to-paper as the unseen-type link. We first perform stemming, and then identify the 100 most popular terms-in-titles as categories, and regard papers as items. For example, if a researcher v published a paper r, and later another researcher u cites r, we consider an unseen-type link exists from u to r. Also, we consider two researchers as friend if they have been co-authors of at least one paper in the past.

The mapping of the information in the four abovementioned datasets to the random variables in FGM-AS is shown in Table 2-2. Note that in the above four datasets (Foursquare, Twitter, Plurk, and DBLP), we hide all unseen-link information as ground truth to evaluate our proposed framework. Also note that we obfuscate personal information in all of the datasets.

Table 2-2. Mapping of the random variables for the datasets.

Random Variable Foursquare Twitter Plurk DBLP

(46)

It should be noted that the unseen-type links used as ground truth are actually sparse comparing to all nodes and relations. For example, in Twitter dataset, the unseen-to- candidate ratio, |Unseen| / ( |User| ∙ |Item| ), is merely 0.00002. Thus, predicting unseen- type links for these datasets is a very challenging task.

2.4.2 Comparing Methods

We use nine unsupervised model for comparison. The first three methods are single attribute-to-candidate functions: UF, IO, and CP. Another six methods are as follows (note that all methods are executed on the whole heterogeneous social network):

 Betweenness Centrality (BC). This method is used to measure an edge's importance in a network. The BC value of an edge equals to the number of shortest paths from all nodes to all others that pass through that edge. For each candidate pair, we add a pseudo unseen-type link in network. Then, we generate BC values of pseudo links as their prediction scores.

 Jaccard Coefficient (JC). This method is used to directly compute the relatedness of a user u to an item r, which is defined as | neighbor(u) ∩ neighbor(r) | / | neighbor(u)

∪ neighbor(r) |. This score is used to predict whether u likes r.

(47)

 Preferential Attachment (PA). This method bases on an assumption that popular users tends to like popular items. Therefore, it is defined as | neighbor(u) | ∙ | neighbor(r)

|, which is used as the prediction scores.

 Attractiveness (AT). This method is designed to compute user-to-user attractiveness using aggregated count [61]. We transform it to predict unseen-type links. It first computes owner-item attractiveness Pvr from owner v to item r as

( ') ( )

( , ) ( ', )

vr

c r c r

P r

r





 



 ⁽¹¹⁾

where Φ is the set of “like” links, and σ(r, Φ) is the aggregative statistic of item r, as defined in Section 2.2. Then, it compute the user-owner attractiveness Puv from user u to v as

1 ( (1 ))

uv uv vr

r

P  



g  P ⁽¹²⁾

where guv = 1 if u and v are friends, otherwise 0. To perform link prediction, we further compute user-item attractiveness Pur (the probability of user u likes item r) as

ur uv vr

P P P (13)

 PageRank with Priors (PRP). This method executes PageRank algorithm [59] for |R|

times, once for each item. For specific item r, we set the prior of the item node to 1,

(48)

 AT-PRP. We combine the Attractiveness and PageRank with Priors methods by using the weight of the links. That is, in the heterogeneous social network, we add a link for each < u, r > pair, with weight equals to Pur. We then normalize all weights of outgoing links to sum up to 1, and run PageRank with Priors as mentioned above.

2.4.3 Settings

Because of the sparsity of unseen-links in ground-truth, we use Area Under ROC Curve (AUC) [9] [36] and Normalized Discounted Cumulative Gain (NDCG) [23] to evaluate our proposed method. For each item, we rank all the candidate pairs based on their predicted positive marginal probabilities, and then compare the rankings with the ground- truths to obtain AUC and NDCG scores. Finally, we average the scores over all items.

We select Loopy Belief Propagation (LBP) as our base inference method [46], utilize MALLET [42] for LBP inference, and apply LingPipe [2] for stemming. We use JUNG [45] to compute betweenness centrality and PageRank with Priors algorithms.

In FGM-AS, we set all zero potential function values to a small constant (0.000001), and use learning rate η = 0.0001. We run all experiments on a Linux server with AMD Opteron 2350 2.0GHz Quad-core CPU and 32GB memory.

2.4.4 Results

The results of different methods using AUC and NDCG are shown in Table 2-3. The

未標記資料之連結發現

國立臺灣大學電機資訊學院資訊網路與多媒體研究所 博士論文

Graduate Institute of Networking and Multimedia College of Electrical Engineering and Computer Science

National Taiwan University Doctoral Dissertation

未標記資料之連結發現

Link Discovery with Unlabeled Data

郭宗廷 Tsung-Ting Kuo

指導教授：林守德 博士

Advisor: Shou-De Lin, Ph.D.

誌謝

摘要

Abstract

Contents

List of Algorithms

List of Figures

List of Tables

Chapter 1 Introduction

1.1 Problem and Motivation

1.2 Challenge

1.3 Methodology, Dataset and Experiment

1.4 Literature

1.5 Contributions

1.6 Dissertation Organization

Chapter 2 Link Prediction Using Aggregative Statistics

2.1 Overview

2.2 Problem Formulation

2.3 Methodology

2.3.1 Factor Graph Model with Aggregative Statistics (FGM-AS)





2.3.2 An Illustrative Example of FGM-AS

2.3.3 Attribute-to-Candidate Function

2.3.4 Candidate-to-Candidate Function

2.3.5 Candidate-to-Count Function

2.3.6 Ranked-Margin Learning for FGM-AS

 

   





 

   



 

 

 

 

E E

2.4 Experiments

2.4.1 Scenarios and Datasets

2.4.2 Comparing Methods





2.4.3 Settings

2.4.4 Results

國立臺灣大學電機資訊學院資訊網路與多媒體研究所博士論文

指導教授：林守德博士

  ^{ }