PLPF: 探勘網路行為履歷於連線行為預測

(1)

國

立

交

通

大

學

資訊科學與工程研究所

碩

士

論

文

PLPF – 探勘網路行為履歷於連線行為預測

PLPF - A Profile-based Link Prediction Framework in a

Time-evolving Graph

研究生：李政輝

指導教授：彭文志教授

李素瑛教授

(2)

PLPF: 探勘網路行為履歷於連線行為預測

PLPF: A Profile-based Link Prediction Framework in a Time-evolving Graph

研究生：李政輝 Student：Zheng-Hui Lee

指導教授：彭文志 Advisor：Wen-Chih Peng

李素瑛 Suh-Yin Lee

國立交通大學

資訊科學與工程研究所

碩士論文

A Thesis

Submitted to Institute of Computer Science and Engineering College of Computer Science

National Chiao Tung University in partial Fulfillment of the Requirements

for the Degree of Master

in

Computer Science

June 2011

(3)

PLPF

¢Ø²ïLºew¼#ÚLº, v N? ÆY m×Zë N [Zë Ë¤'xÇÑxåv ©ëí X º;-UUEÿ #PLº#² Ù+J ÄáfË /SûqfËºI#PLº iN2Àß ( #Pþa G - (#PLºåGv #P aúþB ïýøv}G £(_ß° aúË#P (áú ,°#ÚLº¶ËN °( #PLºewïåã(}£&å dºú ,°# Ú Lº-óÕ/(Öº } £ GÖ ' ïýø aúË#Pv6_ì°úþ# P a (á b (8K#PLº¼vewB_ ( Ñ M å#PLº ( þæ;-#PÇWIóÕ &î M ° } ¹Õ2LÔæWPo:¹Õ ÅÔòå ° } ¹Õ } HýÁ B ¨WwòÇw ddK ã0,^°#PLºBÀß( Ñ 9 £} Ô Àß(wEå£} ( Ü Ü ÜuuuWWW###ÚÚLÚLLººº,,,###ÚÚÚLLLºººeeewww222

(4)

PLPF: A Profile-based Link Prediction Framework in a

Time-evolving Graph

Student: Zheng-Hui Lee

Advisor: Dr. Wen-Chih Peng

Dr. Suh-Yin Lee

Institute of Computer Science and Engineering

National Chiao Tung University

ABSTRACT

Connection logs are widely used to record connection behavior between sources and des-tinations, such as users and sites in the internet network, and are best represented as an evolving graph to illustrate the evolving connection behavior of users. Given a continuously recording connection log, when a new destination appears (i.e., connected by a few sources initially), one may want to know who else will then likely to connect it.

In this work, we propose a framework using profiles of sources to predict the top-k possible links to the specific destination from k different sources which have never connected to it before, where a profile records the connection behavior of a source including the static and surprising features to represent his/her long-term and short-term interests. The concept behind our framework is that sources have similar connection behavior before are likely to have similar connection behavior after, hence sources having similar profiles as the first source who connects to the new destination are best candidates to connect the same destination later. In the experiment, we use a real dataset of the internet network to evaluate the perfor-mance of our method, and compare it to the state-of-the-art method driven by information flow [20]. It shows that our method has improvement in the effectiveness while comparing to the previous method, and our method has a consistent computation time cost rather than the increasing computation time cost using previous method while the whole connection log increases. Furthermore, we show that surprising features of connection behavior are much more useful when predicting new links in the internet network.

(5)

Ç©ëÖý )HÆ+m×YN [Y + Å(xSYÆä×ào ¼BxN-ggå ÊZºUÞäõ(1w0+B_4§Y ÊYÔûãfÔá&(ãfBfÞåÊ Ö î úpý,ÇÖý t

Advanced Database System æW¤xw 0Çå xÔûïâ ê/vÆÖ/;W«ýf'k ©` û-xwxåÊx¹ `j 4\ v ; Eæ ÉxûËB8Ýoa ¤A (v ; -_×0`;` dy%(¤'¤4 wS}d #°G £;Õ(v;KEÿ;a_8 « ©× ý/Í è(dy%ÚðÓQ'²j ¶º6Í=ùåÊ(BxN-fúp/ ååÜ`//)xmN- ï: è (v¯-þk©Nº9ådÖ {fo ¶º å Ê Ü+wËy%/W`

(6)

List of Tables

3.1 5 connections in the internet network, sorted by time. . . 7 3.2 5 connections (email transmissions) in an email system, sorted by date. 7

4.1 Edge attributes of a graph snapshot G(d) from the connection log in Figure 3.1a. . . 16 4.2 The four features used in a profile. . . 19 4.3 A profile consists of three different attributes. . . 20 4.4 An example of calculating the advanced similarity based on common

ra-tio and connecra-tion orders with an originator whose connected sequence: < a, b, c, d, e, f > and β = 0.1 . . . 27 5.1 MAP scores for different propagation models used in EABIF Network . 32 5.2 MAP scores for each method . . . 32 5.3 4 different combination of parameters β, γ, ss, sl for getting best

similar-ities in static features and surprising features . . . 39 5.4 3 different parameter sets for PLPF to use static features only, surprising

(9)

List of Figures

1.1 An example of the evolving graph. . . 2

1.2 A rough example of the connection profile derived from Figure 1.1. . . 3

1.3 System framework. . . 3

3.1 Viewing a connection log as a connection network. . . 9

(a) Connection Log . . . 9

(b) Connection Network . . . 9

3.2 An evolving graph consists of several graph snapshots G(1), G(2), . . . 10

4.1 The whole framework of PLPF. . . 13

5.1 Precision & Recall to rank k for each method. . . 33

(a) Precision at rank k . . . 33

(b) Recall at rank k . . . 33

5.2 The history data used in LastDay profiles and Instant profile. . . 34

5.3 MAP scores when using LastDay profiles and Instant profiles. . . 34

5.4 Computation time cost for a single query. . . 35

5.5 Tuning γ and sl for similarity using static features only. . . 37

5.6 Tuning γ and sl for similarity using static features only. . . 37

5.7 Tuning β and ss for similarity using surprising features only. . . 38

5.8 Tuning β and ss for similarity using surprising features only. . . 38

5.9 Tuning α to combine both similarities using static features and surprising features. . . 40

(10)

Chapter 1 Introduction

Link prediction problem in social networks has been considered as a useful and inter-esting research issue to capture the social activities in advance, and is widely used in other domains such as bioinformatics, computer networks and recommendation systems [8, 10, 16]. A general link prediction problem aims to predict future links [12, 14, 18, 22] or unobserved links in current graph [2, 5, 9, 22], but in this work, we focus on the links from sources to a specific destination they have never connected before as [20]. For example, when a user (the source) connect to Facebook (the destination), which is a newly appeared website, we may predict that who else will also establish connections to Facebook. Empirically, predicting such nonexistent connections in internet helps us to recommend interesting site to users [11], identify anomalous connection links [13] and prevent network attacks such as phishing [4]. To resolve this problem, users of similar connection behavior should be identified since they were candidates who will have similar behavior and are likely to connect to the same site in the future.

Previous studies model the similarity between users by calculating their common features [2, 14, 19], reachable distances [14], consistent structures [19, 24] and informa-tion transiinforma-tion probability [20]. However, the above works only considered the historical behavior as a static graph in a single time slot, instead of the evolution of users’ be-havior. The evolution of user’s behavior could help us to find users’ stable interests such as regular connection behavior and new interests such as newly connected sites which are only seen recently in the temporal activities.

(11)

Figure 1.1: An example of the evolving graph.

In this study, we construct a connection profile for each user to record their evolv-ing behavior. Figure 1.1 depicts the difference between previous works and ours in the scope of capturing users’ historical behavior. A connection profile possesses informa-tion of connected nodes, connecinforma-tion order, connecinforma-tion count, and connecinforma-tion frequency as connection features. Figure 1.2 shows an rough example of the connection pro-file. The information attached in connection profile reflects the interests of users. The similarity of users’ behavior thus is revealed by comparing users’ connection profiles. However, a connection profile is a kind of semi-structured data consisting of categorical attributes (connected nodes), sequential data (connection order), and continuous values (connection count and connection frequency). Evaluating the difference between two connection profiles directly is not intuitive. In this study we use a hierarchical combi-nation to the combine the similarities of different connection features in two connection profiles. In addition, we proposed an efficient algorithm to construct connection profile through a sliding windows manner.

Our proposed profile-based link prediction system possesses both off-line profile construction and on-line link prediction which are shown in Figure 1.3. In the off-line part, a general connection log is translated into an evolving graph of a series of consecutive graph snapshots. Then, connection profiles are constructed and updated incrementally. In the on-line part, when a query (e.g. user X connects to a new site Y) entered, we dynamically calculating the similarities between user X and other users. Finally, the top-k users most similar to user X are predicted as the successors who will

(12)

Figure 1.2: A rough example of the connection profile derived from Figure 1.1.

Figure 1.3: System framework.

establish connections to site Y later.

The rest of this paper is organized as follows. Chapter 2 presents the related work of link prediction. Chapter 3 gives the preliminary. In Chapter 4, the details of our method are provided. In Chapter 5, we compare our method with the most similar work driven by information flow scheme, and finally this study concludes in Chapter 6.

(13)

Chapter 2 Related Works

Researchers devoted in the problem of link prediction on various domains, such as social networks, bioinformatics, computer networks and recommendation systems [10]. The problem of link prediction first introduced and defined in [14]. Consequently, authors in [3, 1, 6, 7] proposed various algorithms to resolve the problem from a connection graph. In [19], the authors investigated the methods of link prediction into two main approaches, the generative approaches and the similarity-based approaches.

The generative approaches build a probability model to describe the historical con-nection behavior. Then the link could be predicted by calculating the probability from the built model. For example, the authors in [22] and [17] proposed a relational Markov Model and local conditional probability respectively to model the historical behavior. However, there are three drawbacks related to the generative approaches. First, con-structing such model is costly due to the computation of each pair and combination of nodes. Second, the model can not be built online so that it will lack of freshness. Third, the connection graph is sparse and the predicted probability would be too small to have representative.

By contrast, the similarity-based approaches only need to identify users of similar behavior as the candidates who will generate similar links in the future. Thus, we calculate only the links related to the target instead of the whole connection graph. Previous studies modelled the similarity between users by calculating their common features [2, 14, 19], reachable distances [14], and consistent structures [19]. In [19], the

(14)

attributes of users are utilized to assess their similarity. The authors in [14] modelled the similarity of users as their structural proximity, such as graph distance and common neighbors. However, the above works only considered the historical behavior in a single time slot, instead of the evolution of users’ behavior. Obviously, without the evolution of user’s behavior, we could not describe the user’s behavior of different sequential order.

To the best of our knowledge, only [20], which is a generative approach, discusses the situation of predicting links from an evolving graph. Tseng, et. al. claimed that behavior would be propagated among users according to a proposed information flow. For example, if user X and X’ frequently connect to the same websites and with a specific order that X was followed by X’. They believed that there must exists an explicit information flow which cause the behavior propagating from user X to user X’. Thus, when user X connect to website Y, they would predict that user X’ will also connect to website Y.

Although [20] is currently the state-of-the-art on resolving the link prediction prob-lem, the nature of generative approach drawbacks leads to a poor efficiency and effec-tiveness. Therefore, in this study, we proposed a novel approach combining the evolving graph and similarity-based approach. A profile is utilized to record each user’s previ-ous behavior and calculate the similarity. When the behavior changed, only the user’s profile need to be updated instead of the probability model constructed in generative approach. Thus, we can obtain a better performance and accuracy when predicting potential links.

(15)

Chapter 3 Preliminary

In this section we will first illustrate the format of general connection logs with two examples in the real world, and show how to view a connection log as a connection network directly. Then we will introduce the evolving graph model we used while accessing the connection log/network in this work. Finally, we will formally define our problem and introduce the symbols and notations used in the following sections.

3.1 Connection Log & Connection Network

Here we will first illustrate the format of general connection logs with two real world instances: the world wide web (WWW) in the internet and the email systems. Then we show that any connection log and be viewed as a connection network, which is like a bipartite network with repeating links in the same node pairs.

3.1.1 Connection Log

Connection logs record connection behavior between sources and destinations. For example, in a web connection log, we can find out the connection behavior, such as a user may connect to Google, Yahoo!, or Facebook. Here the user is a source and each website is a destination. Table 3.1 is a short web connection log which contains five connections between two users and three websites. Similarly, sending emails from a sender to a recipient is also one type of connection behavior, where the sender is a

(16)

source and the recipient is a destination. A partial log in an email systems with five email transmissions between two senders and two recipients is revealed in Table 3.2.

Table 3.1: 5 connections in the internet network, sorted by time.

User Website Time

Alice Facebook 2011-01-29 19:20:36 Bob Yahoo! 2011-01-29 19:20:41 Bob Google 2011-01-29 19:23:05 Bob Facebook 2011-01-29 19:23:52 Alice Yahoo! 2011-01-29 19:24:17

Table 3.2: 5 connections (email transmissions) in an email system, sorted by date.

Sender Recipient Date zhlee@cs.nctu.edu.tw zxliao@cs.nctu.edu.tw 2011-01-29 hjchang@cs.nctu.edu.tw zxliao@cs.nctu.edu.tw 2011-01-30 zhlee@cs.nctu.edu.tw wcpeng@cs.nctu.edu.tw 2011-01-30 hjchang@cs.nctu.edu.tw wcpeng@cs.nctu.edu.tw 2011-01-31 zhlee@cs.nctu.edu.tw wcpeng@cs.nctu.edu.tw 2011-01-31

As showed in the two examples, a connection consists of three attributes: source, destinationand time to record a connection behavior representing a relationship with one-way direction from a source to a destination. Time attribute is used to identify the sequential order of records. According to the time information, we can transform a connection log into an evolving graph of several consecutive graph snapshots which can help us discover the profiles of users.

3.1.2 Connection Network

As mentioned in previous subsection, a connection consists of three attributes: source, destination and time. Therefore, an entry in a connection log can be denoted as (u, v, t) which represents a connection from the source u to the destination v at time t. By

(17)

viewing all the sources as a set U and all the destinations as another set V , then each entry (u, v, t) in a connection log can be viewed as a directed link (edge) from the source u to the destination v in a bipartite network with two different nodes sets U, V . Hence a connection log can be simply viewed as a bipartite network G = (U, V, E), where U is the node set of sources, V is the node set of destinations and E is the set of all links (connections). The transfered network view of a connection log is called a connection network.

Definition 1. (Connection Network): A connection log can be viewed as a connection network G with three components (U, V, E), where U, V are two different node sets consist of sources and destinations respectively, and E = U × V × T is the link set. Each link (u, v, t) ∈ E represents a connection from a source u to a destination v at time t in the connection log.

The connection network G of a connection log is formally defined in Definition 1. It’s notable that multiple links between the same source-destination pair can coexist in a connection network G, while it’s not possible in a general bipartite network. This is achieved by embedding the time dimension T into the link set E, hence E = U × V × T and links (u, v, t), (u, v, t′_{) can coexist in the link set if t 6= t}′_{. So given a general}

connection log as in Figure 3.1a, we can viewing it as the connection network in Figure 3.1b.

One should notice that both U, V in a connection network just stand for two sets of different node types generally, and the links (edges) in E just represents an relationship (connection behavior) between the two node sets U and V . The nodes types and relationship will be changed for different domains and applications, but for convenience we will say the first type of nodes in U as users, the second type of nodes in V as sites and the links in E as the connections that users connect to sites in the internet network in following sections.

(18)

(a) Connection Log (b) Connection Network

Figure 3.1: Viewing a connection log as a connection network.

3.2 Evolving Graph Model

An evolving graph is a graph consisting of several consecutive graph snapshots. Here we first define what a graph snapshot (of a connection network) is, then illustrate the evolving graph with an example.

Definition 2. (G[t1,t2]= (U[t1,t2], V[t1,t2], E[t1,t2])): The snapshot of a connection network Gwhose links (connections) are only appeared in the time interval [t1, t2] is denoted as

G[t1, t2].

A graph snapshot of the connection network G is just the connection network G in a fixed time interval, say [t1, t2]. We use the notation G[t1,t2] = (U[t1,t2], V[t1,t2], E[t1,t2]) as the graph snapshot of a connection network G whose links (connections) are only appeared in the time interval [t1, t2]. That is,

E[t1,t2]= {(u, v, t) ∈ E|t ∈ [t1, t2]} , (3.1) and similarly,

(19)

V[t1,t2]=v ∈ V |∃(u, v, t) ∈ E[t1,t2] . (3.3) Given the connection log in the time interval [1, t], one can view it as a single static graph G[1,t] as the graph snapshot in time interval [1, t]. However, doing such thing

may loss the information that how connections are evolving in the time interval [1, t]. A better way is viewing it as several consecutive graph snapshots in a series of consecutive time intervals with smaller lengths, say [1, 2], [2, 3], . . . , [t − 1, t]. Hence one can easily gain the evolving connection behavior in different time interval.

Figure 3.2: An evolving graph consists of several graph snapshots G(1), G(2), . . . .

An example of the evolving graph in time interval [1, 103] is illustrated in Figure 3.2. Each graph snapshot in Figure 3.2 denoted by G(t) is the graph snapshot in the time interval [t, t + 1]. One can know that all three users (X,Y,Z) connect to the 5 sites (A,B,C,D,E) in the time interval [1, 103] by viewing it as a static graph snapshot G[1,103]. But by viewing it as an evolving graph as in Figure 3.2, we can easily know

the evolving of connection behaviors such as: 1) both sites D,E are only connected in the time interval [101, 103], 2) the two users X,Y tends to connect sites A,B in each smaller time interval while only the user Z tends to connect A,C which is different, and 3) the user X has connected to the site C in the time interval [1, 2] once but never connected it again. These things are only presented in the evolving graph model.

(20)

3.3 Problem Formulation

Definition 3. (Link Prediction to a Specific Site v): Given a connection network G and the link (o, v, t1) be the first link to the site v, predict whether a link (u, v, t) to the

same site v will exist in G with u 6= o, t > t1.

Assuming that all users are known initially (i.e., U = U[t,t′_] for any time interval [t, t′_{]) and given the whole connection log in an internet network, then the question}

asks: “if a user o connects to the site v, then which else user in U will likely connect to the same site v later?” The question can be formulated as a link prediction problem to a specific node v defined in Definition 3: given the connection network G and the first link (o, v, t1) to the site v from a user o, predict whether a link to the same site v

by another user u will exist later.

To best formulate this problem with its correct answers in a prediction time inter-val, we define the problem K-Predict using the connection network in the past time interval [t0, t1] to predict k possible links in the future time interval [t1, t2]: given the

snapshot of a connection network G[t0,t1], and the link (o, v, t1) to be the first link to the site v, then predict k possible links (u1, v, t1), (u2, v, t2), . . . , (uk, v, tk) in the later

snapshot G[t1,t2]from k different users. The formal definition is illustrated in Definition 4.

Definition 4. (K-Predict): Given times t0, t1, t2 along with the earlier snapshot of

the network G[t0,t1] and the link (o, v, t1) to be the first link to the site v, then predict k possible links (u1, v, t1), (u2, v, t2), . . . , (uk, v, tk) in the later snapshot G[t1,t2] (i.e., (ui, v, ti) ∈ E[t1,t2] for 1 ≤ i ≤ k), where ui ∈ U \ {o}, ti ∈ [t1, t2] for all 1 ≤ i ≤ k.

(21)

Chapter 4 PLPF: Profile-based Link

Prediction Framework

In this section, we first take an overview of the Profile-based Link Prediction Framework (PLPF). Then we explain the graph generation process which transfers the connection log into a series of consecutive graph snapshots of the corresponding connection network in every time period. These consecutive graph snapshots are used to construct profiles and update them periodically in the off-line component of PLPF to capture connection behavior of users. Finally, we discuss the similarity measurements used to do link predictions in the on-line component of PLPF.

4.1 Framework Overview

Our major concept is that: “users with similar connection behavior before are likely to have similar connection behavior after,” which is the same as the concept of collabora-tive filtering (CF): “users with similar preferences are likely to access the same items later.” When a new site v appears in the connection log, there must be one user o who just connects to v and we note him/her as the originator to v since he/she is the first user who connects to v. Other users who connects to the same site v later are called successors respectively.

(22)

u1, u2, . . . , uk which have connected to the site v before and the information that ui is

connected to v earlier then uj for all i < j. Then the user u1 is called the originator

to v since he is the first who connect to v, other users u2, u3, . . . , uk who connect to v

later are called successors.

So in PLPF, given a newly appeared site v with its originator o, we aim to find possible successors through the remaining users (i.e., U \ {o}) which have similar con-nection behavior as the originator o in the earlier snapshot of the concon-nection network G[t0,t1]. The connection behavior is abstracted into a profile for each user through the connection network G[t0,t1] (i.e., the partial connection log in [t0, t1]), and similarities between users are estimated based on the similarities of profiles. However, accessing the huge connection log takes a lot of time, so we separate the framework of PLPF to two components: an off-line component which constructs/updates profiles for each user periodically, and an on-line component which accepts queries and extracts the top-k possible successors based on the similarities of pre-computed profiles in the off-line component. The whole framework of PLPF is illustrated in Figure 4.1.

Figure 4.1: The whole framework of PLPF.

(23)

Algorithm 1: Link Prediction in the On-line component of PLPF Input: the query (o, v, t), k

Output: the top-k possible links

1 fetch profile p_u for all users in U 2 foreach User u ∈ U \ {o} do

3 calculate the similarity s_o(u) according to p_o, p_u 4 end

5 rank users u ∈ U \ {o} according to s_o(u)

6 return the links from the first k ranked users to v

consecutive graph snapshots of the connection network in every time period. Then, profiles containing the features for describing users connection behavior are constructed. To keep the freshness of profiles, the off-line component updates profiles incrementally with a fixed time interval. The time interval should be set according to the domain knowledge for applications. For convenience, we will use one day as the unit time interval since we set it to one day in the experiment for capturing the daily changes of users’ connection behavior.

In the on-line component, when a query (i.e., a new connection to v, say (o, v, t1))

entered, we dynamically calculate the similarities between the originator o and other users and then report the top-k users most similar to the originator o as the successors who will establish links to v laster. The flow to make link predictions in the on-line component is illustrated in Algorithm 1. When a query comes, the on-line component first fetches the most recent profiles of users. Then for each user u ∈ U \ {o}, the simi-larity so(u) is computed according the fetched profiles po, pu. Finally, the top-k possible

successors are returned as the predicted successors who will establish connections to v later.

In the following sections, we will first describe graph generation process which trans-fer the connection log into a series of daily graph snapshots of the connection network, and then explain what features are selected into profiles and how to construct/update profiles from the graph snapshots. Finally, the similarity measurement used to do link predictions is illustrated.

(24)

4.2 Graph Generation

As described in Section 3.1.2, a connection log can be simply viewed as a connection network by accepting the connections as links in a bipartite network. We then further transfer the connection network into a series of consecutive graph snapshot as an evolv-ing graph. However, such links are still too many especially for internet network, where the number of connections between a user-site pair could be reached to one thousand even in a single graph snapshot. To reduce the computational cost of capturing the connection behavior for a user, we only focus on the sites connected by the user and record the attributes describing these connected sites. Therefore, two attributes are selected in each graph snapshot of the connection network:

1. Connection count c(u, v): the number of connections made by the user u to the site v in this graph snapshot.

2. Adoption time a(u, v): the time when the first connection to the site v is established by the user u in this graph snapshot.

Algorithm 2: Generate graph snapshot G(d) from G[d,d+1]

Input: G[d,d+1]

Output: G(d)

1 initialize a bipartite graph G with user set ant site set as G_[d,d+1] 2 foreach link (u, v) ∈ G do

3 c(u, v) = 0 4 a(u, v) = NULL 5 end

6

7 foreach link (u, v, t) ∈ G_[d,d+1] do 8 if c(u, v) = 0 then

9 update c(u, v) = 1, a(u, v) = t 10 else

11 update c(u, v) = c(u, v) + 1 12 13 if a(u, v) > t then 14 update a(u, v, ) = t 15 end 16 end 17 end 18 return graph G

The whole flow to derive a graph snapshot G(d) at the d-th snapshot of the con-nection network G[d,d+1] is illustrated in Algorithm 2. The process is efficient in the

(25)

aspect of computation time since it only needs to read the connection log once. For example, given the connection log in Figure 3.1a, we can transfer it to a single graph snapshot G(d) with edge attributes listed in Table 4.1.

Table 4.1: Edge attributes of a graph snapshot G(d) from the connection log in Figure 3.1a.

Edge (u, v) Adoption Time a(u, v) Connection Count c(u, v) (X, B) T1 3 (X, D) T₄ 2 (Y, A) T₃ 2 (Y, C) T₂ 1 (Y, E) T6 1 (Z, A) T5 1 (Z, C) T10 1 (Z, E) T7 1

For convenience, we further define the notation Vu(d) as the set of all sites connected

by u in the graph snapshot G(d).

Vu(d) = {v|c(u, v) > 0 in G(d)} (4.1)

In the above example, VX(d) = {B, D} and VY(d) = VZ(d) = {A, C, E}. This notation

is used for constructing/updating the profiles later. Given the consecutive graph snap-shot G(0), G(1), . . . , G(d), we can build the profiles to capture the evolving connection behavior of users from day 0 to day d.

In the next section, we will first explain what features are selected, then explain the profile format, and finally illustrate the methods to construct/update profiles from the daily graph snapshots.

4.3 Profiles

A profile represents various connection features of a user to capture his/her interest from the evolving connection behavior. We will first discuss what features are selected

(26)

into the profile to effectively reflect the similar interests between users. Then the profile construction process using the consecutive graph snapshots is illustrated.

4.3.1 Connection Features

Connection features are used to capture the users’ interests from the evolving connec-tion network. Tradiconnec-tional link predicconnec-tion methods often extract features by viewing the evolving connection network as a static network in a fixed time interval, i.e., the long-term behavior (interest). However, as mentioned in [23]: “overall behavior of a user may be determined by his/her long-term interest, but at any given time, a user is also affected by his/her short-term interest due to transient events, such as new prod-uct releases and special personal occasions such as birthdays.” So here we don’t only focus the static features that is extracted in a static network as the long-term interest, but also the surprising features which only appears recently and must be extracted by viewing the connection network as an evolving graph as the short-term interest. Definition 6. (Static features & surprising features): Static features are used to represent the long-term interests which can be extracted from the static network view of the connection network in a fixed time interval. Besides, surprising features are used to represent the short-term interests which only appears recently and should be extracted from the evolving graph view of the connection network.

Usually the connected sites by a user are often used to represent the interest of that user. However, to separate the long-term interests and short-term interests, we use two sliding window to capture the connected sites in two time intervals with different lengths. The larger sliding window with size sl is used to derive the connected sites

in the last sl days as the static features of long-term interests, and the smaller sliding

window with size ss is used to gain the newly connected sites only appeared in the

last ss days as the surprising features of short-term interests. Furthermore, since the

number of sites connected by a user is very large, we record the connection count and connection frequency to help us identify the sites which are more interesting to the user. And for the newly connected site, we also record the order that which newly connected is connected earlier than the other by a user.

(27)

In summary, 4 different types of connection features are used to describe the long-term and/or short-long-term interest of users from his/her connection behavior:

1. Connected sites or newly connected sites: the sites connected by a user in a fixed time interval or “only” in a fixed time interval.

2. Connection count for each connected site: the number of connections made to a connected site in a fixed time interval.

3. Connection frequency for each connected site: how often a site is connected in a fixed time interval.

4. Connection order for newly connected sites: the order that which newly con-nected site is concon-nected earlier than the other.

We will illustrate why the last three features are selected below.

The features of connection count is inspired from the recommendation systems. Since our concept is the same as collaborative filtering (CF), we first observe what features are used in CF for recommendation systems. A typical recommendation system consists of users and items along with user feedbacks on items. These user feedbacks are often described in real numbers, for example, using 1 or 0 to represent that a user likes or dislikes an item, or using a score from 0 to 10 to measures how this user is favorite on that item. And CF estimates that two users have similar preferences if they have similar feedbacks on the common accessed items. In connection logs, we don’t have explicit user feedbacks on sites (even the connected sites), but we can estimated them through the connection counts implicitly. Considering that a site is connected by a user means that this user is interested in it, then more connections to the same site directly means that the user is more interested in it.

Connection count for each site can represent the preference of a user in a fixed time interval, but a site with many connections initially and fewer (or no) connections recently is not still interested for a user now. So we seek another feature to help use find the more stable interest for a user in a fixed time interval. Then the regular connected sites attract us since if a user is interested on something for a long time, he may

(28)

continuously accesses it, i.e., connects to the site continuously. A regular connected sites is gained by measuring that how often the user connects to it. Hence, we record the connection frequency for each connected site in a fixed time interval to know which site is regularly connected by a user.

Besides, to capture the short-term interest for a user with surprising features, we do not only consider the newly connected sites, but also the connection order of newly connected sites. The authors in [20] has said: “information will flow from earlier adopters to the late adopters, but may not flow back from late adopters to the earlier adopters.” This viewpoint inspire us that two users having the same new interest recently may have the same access sequence on new items, for example, users who bought new digital camera will then buy a tripod and many lens later if they all are interested in photography recently. The order of how user connects to the new sites can reflect his/her new interest directly. So we also use the connection order of newly connected sites as another surprising feature to describe the short-term interest of a user.

Table 4.2: The four features used in a profile.

L/S Features Descriptions

L Connection count the number of connections made to a connected site L Connection frequency how often a site is connected in a fixed time period S Newly connected sites the sites only connected in a recent time period S Connection order the order that each newly connected site is connected

To give a conclusion, we list all the four features we selected in Table 4.2. A feature starts with an L is a static feature for long-term interest derived in a relatively larger sliding window, and a feature starts with an S is a surprising feature for the short-term interest which is derived in a relatively smaller sliding window, respectively. The larger sliding window with size sl is used to capture the static interest for users in the last

sl days and the smaller sliding window with size ss is used to capture the surprising

interest for users in the last ss days. Both sl and ss are fixed parameters in PLPF.

(29)

PLPF along with its construction/updating algorithm.

4.3.2 Profile Construction

We use 3 attributes to represent the 4 connection features – either static or surprising features – in a profile; the profile format is illustrated in Table 4.3 with an example on each attribute. The connection count distribution records the mapping of each connected site to the number of all connections made to the site in the last sl

days. Similarly, the connection frequency distribution records the mapping of each connected site to the frequency – how often a site is connected in the last sl days.

Both the connection count distribution and connection frequency distribution represent the static features for long-term interest – connection count, connection frequency – directly. Besides, the connection sequence records the order of the first connection to each newly connected site in the last ss days as a sequence, and is the only attribute

to represent the surprising features including the newly connected site (simply the sites in the connection sequence) and the connection order (the order of each site in the connection sequence) for short-term interest.

Table 4.3: A profile consists of three different attributes.

Notation Attributes Example

CCDu(d) Connection Count Distribution {a : 10, b : 105, c : 99, . . . , v : c(v)}

CF Du(d) Connection Frequency Distribution {a : 0.8, b : 0.6, c : 1.0, . . . , v : f (v)}

CSu(d) Connection Sequence < c′, d′, b′, a′,· · · >

Static features for the long-term interest are derived in the last sl days. The

con-nection count c(v) for a connected site v is calculated as the number of all concon-nections to v in the last sl days.

c(v) = number of connections to v in the last sl days (4.2)

And the connection frequency f (v) is calculated in a similar way, where the frequency means that how many days v is connected relatively to the last sl days.

f(v) = number of days where v is connected to in the last sl days

(30)

The connection sequence is simply all the newly connected sites in the last ss days,

which are ordered by their first connection time. A connection sequence can be ex-pressed as a vector below.

< v1, v2, . . . , vn > (4.4)

It’s notable that vi is connected earlier than vj for all i < j in the last ss days.

Connection counts and connection frequencies for different sites in the last sl days

can be easily derived from the evolving graph with the last sl consecutive daily graph

snapshots. However, to derive the newly connected sites which are connected “only” in last ss days, we cannot only look the last ss consecutive daily graph snapshots

but also other graph snapshots in earlier days is needed to know whether a site has been connected before or not. So we build an adoption history Hu(d) for each user

u to record the first connection time and the last connection time for each site that are connected in the last ss days before day d. The adoption history Hu(d) can be

formated as a vector below:

Hu(d) =< (v1, f1, l1), (v2, f2, l2), . . . , (vm, fm, lm) >, (4.5)

where m is the number of sites connected by user u in the last ss days before day d,

and fi (li) is the first (last) connection time to site vi in the last ss days before day

d. At any day d, given the adoption history Hu(d), we can derive the newly connected

sites by the user u in the last ss days before day d by simply filtering out the sites in

Hu(d) whose first connection time is earlier than day d with ss days.

The adoption history Hu(d) can be easily updated from the adoption history in the

previous day Hu(d − 1) and the daily snapshot of graph G(d) in the current day. The

incremental updating process of an adoption history Hu(d) is illustrated in Algorithm

3. It’s notable that the newly connected sites may be connected by the user far before day d with ss days. But if the user has not connected to it in the last ss days and

occasionally connects to it again now, then the site is still counted as a newly connected site since it’s new in the last ss days.

To conclude this section of the profile construction, we list the 3 construction pro-cesses for the attributes of connection count distribution, connection frequency distri-bution, connection sequence in Algorithm 4, 5 and 6, respectively.

(31)

Algorithm 3: Update Adoption History Hu(d) from Hu(d − 1), G(d)

Input: Hu(d − 1), G(d)

Output: Hu(d) 1 foreach v_i ∈ V_u(d) do

2 fetch the adoption time a_i = a(u, v_i) in G(d) 3 4 if v_i ∈ H_u(d − 1) then 5 update l_i = a_i in H_u(d − 1) 6 else 7 insert (v_i, a_i, a_i) into H_u(d − 1) 8 end 9 end 10 foreach (v_i, f_i, l_i) ∈ H_u(d − 1) do 11 if (d − l_i) > s_s then 12 remove (v_i, f_i, l_i) from H_u(d − 1) 13 end 14 end

15 return the modified H_u(d − 1) as H_u(d)

Algorithm 4:Construct the attribute of connection count distribution CCDu(d)

Input: G(d − ss+ 1), . . . , G(d)

Output: CCDu(d)

1 initialize an empty hash map M

2 foreach g in G(d − s_s+ 1), . . . , G(d) do 3 foreach v_i ∈ V_u(g) do

4 fetch the connection count c_i = c(u, v_i) in g 5 6 if v_i ∈ M then 7 M[v_i] = M[v_i] + c_i 8 else 9 M[v_i] = c_i 10 end 11 end 12 end 13 return M

4.4 Similarity Calculation

The similarity between users is calculated through the similarities of their static and/or surprising features for the long-term and short-term interest separately. In this section, we first explain that how we combine two similarity values to obtain a new one. Then we describe how to calculate the similarity values of static features and surprising features separately. Finally, we give a summary of all the 3 similarities used in PLPF.

(32)

Algorithm 5: Construct the attribute of connection frequency distribution CF Du(d)

Input: G(d − ss+ 1), . . . , G(d)

Output: CF Du(d)

1 initialize an empty hash map M

2 foreach g in G(d − s_s+ 1), . . . , G(t) do 3 foreach v_i ∈ V_u(g) do 4 if v_i ∈ M then 5 M[v_i] = M[v_i] + 1 6 else 7 M[v_i] = 1 8 end 9 end 10 end 11 foreach M[v_i] ∈ M do 12 M[v_i] = M[vi] sl 13 end 14 return M

Algorithm 6: Construct the attribute of connection sequence CSu(d)

Input: Hu(d)

Output: CSu(d)

1 initialize an empty vector V 2 foreach (v_i, f_i, l_i) ∈ H_u(d) do 3 if f_i >(d − s_s) then 4 add v_i into V 5 end 6 end 7 sort V according f_i 8 return V

4.4.1 Similarity Combinations

Before we explain the method to combine two similarity values, we first explain the notation of a general similarity (function) s(.) used in PLPF. Since the similarity in PLPF is used to inform that how a user is similar to an originator o, the similarity is often mentioned as so(u) instead of s(o, u). The notation so(u) also tells that a

similarity function in PLPF can be asymmetric, that is, so(u) may not be equal to

su(o). Moreover, since the originator o is fixed in every single query, the similarity

so(u) can be further simplified as s(u) only; the originator o can be omitted.

(33)

with a weight w to combine a new one s(u):

s(u) = w ∗ s1(u) + (1 − w) ∗ s2(u), (4.6)

where the weight w should be in [0, 1]. w = 1 gives all weight on s1(u) and w = 0

gives all weight on s2(u). However, the similarity values calculated by s1(.) may be

relatively larger or smaller than the ones calculated by s2(.). So we first normalize

the similarity values s1(u1), s1(u2), . . . , s1(un) by dividing their maximum value s1(u′)

before combine them with other similarity values. The normalized similarity ¯s1(u) is

calculated as following: ¯ s1(u) = s1(u) s1(u′) , (4.7) where u′ = argmax u s1(u). (4.8)

The algorithm to calculate the combined similarities is illustrated in Algorithm 7. Following we will discuss th similarities of static interest and surprising interest separately.

Algorithm 7: Calculate the combined similarity s(u) from s1(u), s2(u) with a

weight w

Input: s1(u), s2(u), w

Output: s(u)

1 calculate the normalized similarity ¯s₁(u) according to equation 4.7 2 calculate the normalized similarity ¯s₂(u) according to equation 4.7 3 s(u) = w ∗ ¯s₁(u) + (1 − w) ∗ ¯s₂(u)

4 return s(u)

4.4.2 Similarity of Static Features

Similarity of static features is the combined similarity through the similarities of the static features – connection count and connection frequency – separately, with a weight γ for combination.

Connection count can be derived directly from the connection count distribution, and is often represented as a mapping or two-dimension array as below:

(34)

Considering the count of non-connected sites as 0, then each user can have their con-nection count distribution as a one-dimension array with the unified indexes to the same sites.

< c1, c2, . . . , c|V | >

This form of data are usually applied with naive vector similarities such as the cosine similarity. Besides, one can convert the array as a discrete probability distribution as the possibility to connected to each site, and then apply statistical distance functions like probability distance. In this work, we tried both and decided to use the probability distance since it performs better.

Connection frequency can be similarly derived from the connection frequency dis-tribution, too. Also, considering the frequency of non-connected sites as 0, then each user can have their connection frequency distribution as a one-dimension array with the unified indexes as the connection count distribution.

< f1, f2, . . . , f|V | >

So the similarities applied to connection count can be applied to the connection fre-quency, too. And we also choose the probability distance for this feature since it performs better.

4.4.3 Similarity of Surprising Features

Similarity of surprising features is derived in the similar way through the similarities of surprising features – newly connected sites and connection order – separately, with a weight β for combination. However, there is a difference that the connection order is used to enhance the similarity values of newly connected sites, so we don’t use the convex combination here.

The newly connected sites can be gained through the connection sequence directly by viewing the sites in the vector as a set of sites.

V(u) = {v1, v2, . . . , vn}

Then several set-based similarity measurements can be applied to the this feature, such as the size of set intersection (number of common items), the Jaccard coefficient [21]

(35)

and so on. These similarity measurements are derived from the concept of common items and then weighted by other features. In other words, they count the number of common connected sites by two users recently, and gives higher similarity score if the number is larger. In here we define a new asymmetric similarity, the common ratio co(u), which is modified from the Jaccard coefficient to measures how many sites

connected by o is also connected by u, since we only focus on how similar a user is related to the originator.

co(u) =

|V (o) ∩ V (u)| |V (o)| =

|V (o, u)|

|V (o)| (4.9) And we used the common ratio to measure the similarity of newly connected sites.

Connection order here is used to enhance the similarity on newly connected sites. Users with common connected sites recently are said to have similar interests recently, but users with common connection order on sites are much more similar because they behave more consistently. To capture this consistency, we calculate the number of ordered site pairs which are connected by both two users with the same order:

pair(ui, uj) = |{(vk, vl)|vk, vl ∈ V (ui, uj), vk is connected earlier than vl for both ui, uj}|

(4.10) where V (ui, uj) = V (ui) ∩ V (uj) is the set of common connected sites by both ui, uj

recently. And then we define the enhanced similarity as:

so(u) = co(u) · (1 + β)pair(o,u) (4.11)

where β > 0 is a tunable parameter to weight the importance of common connection orders. Table 4.4 shows an example of calculating this similarity to an originator with β = 0.1.

4.4.4 Similarities in PLPF

Given the calculated similarities of static features and surprising features, we can com-bine them with a weight α to derive an overall similarity using all features. Assum-ing the similarity of static features sstatic(u) and the similarity of surprising features

(36)

Table 4.4: An example of calculating the advanced similarity based on common ratio and connection orders with an originator whose connected sequence: < a, b, c, d, e, f > and β = 0.1

User Connection sequence Common Ratio Common Pairs Similarity u1 < f, e, d, c, b, a > 1 0 1.0 u₂ < f, x, b, c, d, y > 0.66 3 0.87846 u₃ < x, a, b, c, y, z > 0.5 3 0.6655 u₄ < b, a, c > 0.5 2 0.605 u5 < o, p, q, r, s, t > 0 0 0 u6 < f > 0.16 0 0.16

ssurprising(u) are both normalized, then the overall similarity s(u) is derived through

the equation 4.6:

s(u) = α ∗ sstatic(u) + (1 − α) ∗ ssurprising(u) (4.12)

Totally we have 3 similarities can be used in PLPF: 1. similarity of static features only: set α = 1. 2. similarity of surprising features only: set α = 0.

3. the overall similarity using both static features and surprising features: set α ∈ (0, 1).

In Chapter 5, we will reveal the result using each similarity in PLPF.

4.5 Additional Domain Knowledge

As we mentioned above, the similarity is calculated through the connection features, which are all (evolving) topology features since we only have information about the connection log/network. But if additional domain knowledge is given, such as the node attributes, then we can use them to improve our similarity measurement and better predictions can be done.

(37)

One of such information is especially important but can not be derived from the connection log alone: the information of the newly appeared site v. As the site v just appears in the connection log, we only gain information about who connects to it and when the connection is established. But if more domain knowledge is given, for example, the newly appeared site v is very similar to another site v′ _{which is connected}

by the originator o before, then we can use this information to add weight on the static (surprising) features if v′ _{belongs to the long-term (short-term) interest of o.}

Generally speaking, there are 4 different cases:

1. v is similar to the connected sites in the long-term interest of o, 2. v is similar to the connected sites in the short-term interest of o,

3. v is similar to the connected sites in both long-term and short-term interest of o, and

4. v is not similar to any connected sites in the long-term or short-term interest of o.

This information is not useful in the last case, but for the first three cases, we can derive a relative adjusting weight wv by how similarly v belongs to the long-term

interest rather than the short-term interest, where 0 < wv < 1. Then we can adjust

the weight of long-term and short-term interests by wv:

so(u) = wv∗ slo(u) + (1 − wv) ∗ sso(u), (4.13)

where sl

o(u) is the similarity of u related to o on static features only, and sso(u) is the

(38)

Chapter 5 Experiments

In this section, we first describe the experimental environments including 1) the real datasets, 2) the query formats, and 3) the evaluating measurements. Then we show the prediction results derived by our algorithm and compare both the effectiveness and efficiency with information flow based approach proposed in [20]. The parameters used in PLPF to constructing profiles and calculating similarities are discuss later to inform how they effect the result of predictions. And finally, we analyze the properties of PLPF with the synthetic dataset.

5.1 Experiment Settings

For this work, we collect the TCP connection logs between the dormnet to the internet in our university from 2010-09-14 to 2011-01-31 as a connection log in one semester. There are more than 1,431,000,000 connections established by total 723 students and 15436399 outside sites. A set of queries is created for experiments with following properties:

1. Each query contains the information of the first connection to the newly appeared site. The information contains the originator (represented by an IP address), the newly appeared site (represented by an IP address) and the time to make this connection.

(39)

really connects to the same site in the following week after the new site appears, and these links established by these successors are used as ground truth.

Total 125 queries are selected with above properties, and have averagely 59.4 true successors in the following week after the new site appears.

Input for each method is the query (i.e., the originator, the newly appeared sever and the time of the first connection) along with the connections logs in the last day, and the output is the top-k possible links to the same site. To evaluate the effectiveness for each method, we consider the precision/recall at rank k and the mean average precision (MAP) since MAP provides a single-figure measure of quality across recall levels, which has been shown to have especially good discrimination and stability [15].

Precision at rank k for a single query q is simply the value that how many users in the top-k possible successors are true successors (which establish links to the specific site in the following week), and recall at rank k for a single query q is simply the value that how many true successors are in the top-k possible successors. Precision and recall at rank k for a query q are formalized as below:

precisionq(k) =

|number of true successors in the top-k possible successors|

k (5.1)

recallq(k) =

|number of true successors in the top-k possible successors| mq

(5.2) The notation mq denotes the number of all true successors for a query q. For a query set

Q, the overall precision/recall at rank k is computed by the average of precision/recall at rank k for each query q in Q.

precisionQ(k) = 1 |Q| |Q| X q=1 precisionq(k) (5.3) recallQ(k) = 1 |Q| |Q| X q=1 recallq(k) (5.4)

To compute MAP for a query set Q, we should compute the average precision for each query q ∈ Q first. Average precision of a query q is the average of precision at rank k for each k from 1 to mq, and MAP is the average of average precisions among

(40)

all queries in a query set Q. Average precision for a query q and MAP for a query set Qcan be formalize as following:

AverageP recision(q) = 1 mq mq X k=1 {precisionq(k)} (5.5) M AP(Q) = 1 |Q| |Q| X q=1 AverageP recision(q) (5.6) Besides, to evaluate the efficiency for each method, we consider the computation time cost for each method despite the disk I/O for reading logs and reading/writing intermediate data structures. For each method, the computation time cost is cumulated from accessing the connection logs to deriving the final result of top-k possible links. This comparison is based on the computation time cost to derive the prediction result for one single query.

5.2 Comparisons with EABIF Network

In this section, we compare our method – PLPF – with the most similar work driven by information flow scheme – EABIF Network – in both effectiveness and efficiency. The comparison of precisions is performed first as operating a link prediction system using PLPF, which uses LastDay Profiles built in last day to make predictions. Then we include the most refresh data which is just before the query into profiles as Instant Proflies, and illustrate the difference between using LastDay Profiles and Instant Pro-files. In the end, we compare the computation time cost for a single query despite the huge I/O cost for each method.

5.2.1 Precision Comparison

In the most similar work [20], their method models an information flow network called EABIF Network, and uses many different propagation models to compute the propa-gation probabilities for predicting the late adopters (possible successors). Here we list the MAP scores of the EABIF Network using the most two useful propagation models with 5 different parameters in Table 5.1. It is obvious that the propagation model

(41)

Table 5.1: MAP scores for different propagation models used in EABIF Network

Propagation Models MAP Score Exponential Weighted Summation with β = 1.0 0.208167 Exponential Weighted Summation with β = 2.0 0.199635 Exponential Weighted Summation with β = 3.0 0.197339 Exponential Weighted Summation with β = 4.0 0.196231 Exponential Weighted Summation with β = 5.0 0.195724 Summation to M Step with M = 1 0.229598 Summation to M Step with M = 2 0.228144 Summation to M Step with M = 3 0.225049 Summation to M Step with M = 4 0.221385 Summation to M Step with M = 5 0.218087

Summation to M Steps (StS(M)) performs better than Exponential Weighted Summation with weight β (EW S(β)) in all parameters. Besides, the best param-eters in both propagation models show that short propagation paths are much more important than long propagation paths to predict the new links. Later we will use the propagation model StS(1) in EABIF Network to compare the precisions with our methods.

Table 5.2: MAP scores for each method

Method MAP Score Improvement EABIF Network (StS(1)) 0.229598 1

PLPF (Surprising-only) 0.279514 1.217406 PLPF (Static-only) 0.242124 1.054056 PLPF (Both) 0.248188 1.080968

We compare our method with EABIF Network in Table 5.2. Since we have two different types of features (static and surprising features for static and surprising inter-ests), we compare the results using each type of features only and using both types of features. It shows that PLPF performs better than EABIF Network in all three

(42)

combi-nation of two different types of features, and PLPF with surprising features only gives the best MAP score 0.279514 which has the maximum improvement (21.7%) compared to EABIF Network.

(a) Precision at rank k (b) Recall at rank k

Figure 5.1: Precision & Recall to rank k for each method.

Besides the MAP scores, the results of overall precision/recall at rank k are illus-trated in figure 5.1 with k = 10, 20, 30, 40, 50. As the result of MAP scores, PLPF with surprising feature only has the best precision and recall values rather than other methods in each rank. All methods have a better precision value when k is smaller expect for the method of EABIF Network using propagation model EW S(1.0). In addition to this, PLPF with surprising feature only give a better precision value even when k is larger (k = 50). This shows that surprising features (interests) are much useful to predict new links in internet.

5.2.2 Instant Profiles

All above comparisons use the history data which includes at most to the last day before the new site appears, and here we do a comparison on precisions with the most recent history data which includes the connections in the connection log just before the

(43)

Figure 5.2: The history data used in LastDay profiles and Instant profile.

new site appears. The two types of history data are illustrated in Figure 5.2. We call the above profile a LastDay Profile since it uses the history data which includes at most to the last day before the new site appears, and the other profile which uses the history data including to the most recent connections just before the new site appears is called an Instant Profile. However, one should notice that building Instant Profiles needs to access the newest raw connection log directly and will cost lots of time. Hence a query using Instant Profiles does not have an instant result, and one should wait a long time to complete the query.

Figure 5.3: MAP scores when using LastDay profiles and Instant profiles.

The MAP scores using different profiles for each method are illustrated in Figure 5.3. It can be obviously seen that MAP score gets higher when more recent (fresh)

(44)

data are included (using Instant Profiles) expect the EABIF Network with EW S(1.0). This again supports that surprising feature for the surprising (short-term) interest is much more important for predicting links to a newly appeared site.

5.2.3 Computation Time Comparison

Figure 5.4: Computation time cost for a single query.

To evaluation the efficiency, we compare the computation time cost against to different length of history data we used as the input. Since we collect connection logs from 2010-09-14, we increase length of history data in half month until 2011-01-15. In figure 5.4 we compare the total computation time cost for a query from reading the connection logs to deriving the prediction result in each method. It can be seen that the computation time cost using PLPF is slightly less than using EABIF Network, and using PLPF with surprising features only (with the smaller sliding window size ss = 7)

gives the smallest computation time cost which is less than one second. PLPF with static features (with a larger sliding window size sl = 35) has a relatively consistent

time cost about 10 to 15 seconds after the history data includes more than 35 days (after 2010-10-31). The method using EABIF Network has an increasing computation time cost because it records the adoption time of each user-site pair, and the number

(45)

of user-site pairs it records increases as the length of history data increases. Besides, all methods have a slightly higher computation cost using the history data from 2010-09-14 to 2010-10-31. The reason is that the number of connections made in these days (2010-09-22 ˜ 2010-10-30) is much larger than other days, hence it prolong the computation cost for every method.

5.3 Parameters in PLPF

There are total 5 parameters in PLPF: 1) sl and ss are two sliding window sizes used

to drive static features and surprising features separately, 2) γ is used to derive the similarity for static features by combine the similarities of connection count and con-nection frequency. 3) β is used in the similarity of surprising features by adding weight on the common ratio, and finally 4) α is used to combine the similarities calculated with static features only and surprising features only.

In following paragraphs, we will first choose the best values of (γ, sl) for calculating

similarity using static features only. Then we choose the best values of (β, ss) for

calculating similarity using surprising features only. After all, we chose α to derive the best similarity using both static and surprising features.

5.3.1 Parameters for Static Features

First we set the value of sl = 7, 14, . . . , 70 and γ = 0, 0.05, . . . , 1 to observe how the

MAP score changes with different γ values. The result is illustrated in Figure 5.5. It then shows that using similarity of connection frequency only (γ = 0) gives much better MAP scores rather than using similarity of connection count only (γ = 1). (γ, sl) = (0, 35) gives a very high MAP score using connection frequency only. Besides,

the MAP scores get higher only when γ is near to but not equal to 1.

We then zoom in for γ = 0.9, 0.91, . . . , 1, and the result is illustrated in Figure 5.6. It shows that γ value changes in 0.9 to 0.99 has little influence to MAP scores, but the larger sliding window sl effects the MAP scores directly. sl = 35 gives the best MAP

(46)

Figure 5.5: Tuning γ and sl for similarity using static features only.

Figure 5.6: Tuning γ and sl for similarity using static features only.

pairs using to calculate the similarity of static features using both connection count distributions and connection frequency distributions.

5.3.2 Parameters for Surprising Features

Two tunable parameters are used for surprising features: the smaller sliding window size ss and the weight β for calculating the similarity of surprising features only. We

(47)

Figure 5.7: Tuning β and ss for similarity using surprising features only.

Figure 5.8: Tuning β and ss for similarity using surprising features only.

β will effect on MAP scores as in figure 5.7. Then we found that β > 2−4_{(= 0.0625)}

makes MAP score worse, and β values in range (0.002, 0.03) give almost the same effects. Besides, MAP scores are good only for the sliding window size 1, 5-9 (if ss>9

then MAP score decreases; the only exception is ss = 13), and have the very worst

value for ss = 2.

Then we focus the smaller sliding window size ss to 1, 5-9 and let β be values of

(48)

gives the best MAP score with any β values. However, since ss = 1 is a special case

to view every new site connected in previous day as a newly connected site, we also select another sliding window size ss = 7 which gives the second good MAP scores. In

the end, two different pairs of parameters (β, ss) are chosen: (0.035, 1) and (0.005, 7).

Both of them give the best MAP score when fixed the smaller sliding window size ss.

5.3.3 Parameters for Both Features

Table 5.3: 4 different combination of parameters β, γ, ss, sl for getting best similarities

in static features and surprising features

Static Features (γ, sl) Surprising Features (β, ss)

A (0.95, 35) (0.005, 7) B (0.95, 35) (0.035, 1) C (0, 35) (0.005, 7) D (0, 35) (0.035, 1)

Since we have derived 2 parameter pairs for calculating the similarities using static features (γ, sl) and surprising features (β, ss) separately, there are total 4 different

combinations of parameters β, γ, ss and sl. Those 4 different combinations are listed

in Table 5.3.

Set α = 0, 0.05, . . . , 1 and the result of MAP scores for the 4 different parameter combinations is illustrated in Figure 5.9. Even it is obviously seen that using surprising features only (α = 1) brings the best MAP scores, but we still need a parameter set using all features in this experiment to compare the influences. So we choose the α = 0.85 with parameter combination B, which has the highest MAP score while using all static features and surprising features. The final parameter sets for PLPF is illustrated in Table 5.4.

(49)

Figure 5.9: Tuning α to combine both similarities using static features and surprising features.

Table 5.4: 3 different parameter sets for PLPF to use static features only, surprising features only and both static and surprising features

Parameter PLPF PLPF (Static-only) PLPF (Surprising-only) α 0.85 0 1

β 0.035 X 0.035 γ 0.95 0.95 X ss 1 X 1

(50)

Chapter 6 Conclusion and Future Work

In this paper, we propose a framework – PLPF – to do link predictions based on profiles. PLPF consists of two components: 1) an off-line component which converts the connection log into an evolving graph of several consecutive graph snapshots and then constructs/updates user profiles with features extracted from the evolving graph periodically, and 2) an on-line component which uses the profiles to do link prediction for the top-k possible links from k different users who never connect to the specific site before. Four different type of connection features are used in the profiles to capture the connection behavior of users. In addition to the connection count which is widely used in any traditional method, we also bring up other features such as the connection frequency, newly connected sites and common connection order on the newly connected sites, which can only be derived by the evolving graph view of connection network.

In the experiment, we compare our method to the state-of-the-art method – EABIF Network – proposed by Tseng et. al. [20] in a real dataset of internet connections. In effectiveness, PLPF performs better than EABIF Network when using either static fea-tures or surprising feafea-tures, and PLPF with surprising feafea-tures only gives the maximum improvement of 21.7% while comparing to EABIF Network with its best propagation model. In efficiency, PLPF shows a consistent computation time cost rather than the increasing computation time cost of EABIF Network, which is caused by recording the old information which should be faded away as time evolving. Comparing to PLPF with different types of features, PLPF with surprising features only always performs

PLPF: 探勘網路行為履歷於連線行為預測

國

立

交

通

大

學

資訊科學與工程研究所

碩

士

論

文

PLPF – 探勘網路行為履歷於連線行為預測

PLPF - A Profile-based Link Prediction Framework in a

Time-evolving Graph

研 究 生：李政輝

指導教授：彭文志 教授

李素瑛 教授

PLPF: 探 勘 網 路 行 為 履 歷 於 連 線 行 為 預 測

PLPF: A Profile-based Link Prediction Framework in a Time-evolving Graph

研 究 生：李政輝 Student：Zheng-Hui Lee

指導教授：彭文志 Advisor：Wen-Chih Peng

李素瑛 Suh-Yin Lee

國 立 交 通 大 學

資 訊 科 學 與 工 程 研 究 所

碩 士 論 文

PLPF

PLPF: A Profile-based Link Prediction Framework in a

Time-evolving Graph

Student: Zheng-Hui Lee

Advisor: Dr. Wen-Chih Peng

Dr. Suh-Yin Lee

Institute of Computer Science and Engineering

National Chiao Tung University

ABSTRACT

Contents

List of Tables

List of Figures

Chapter 1

Introduction

Chapter 2

Related Works

Chapter 3

Preliminary

3.1

Connection Log & Connection Network

3.1.1

Connection Log

3.1.2

Connection Network

3.2

Evolving Graph Model

3.3

Problem Formulation

Chapter 4

PLPF: Profile-based Link

Prediction Framework

4.1

Framework Overview

4.2

Graph Generation

4.3

Profiles

4.3.1

Connection Features

4.3.2

Profile Construction

4.4

Similarity Calculation

4.4.1

Similarity Combinations

4.4.2

Similarity of Static Features

4.4.3

Similarity of Surprising Features

4.4.4

Similarities in PLPF

4.5

Additional Domain Knowledge

Chapter 5

研究生：李政輝

指導教授：彭文志教授

李素瑛教授

PLPF: 探勘網路行為履歷於連線行為預測

研究生：李政輝 Student：Zheng-Hui Lee

國立交通大學

資訊科學與工程研究所

碩士論文