Scalable Proximity Estimation and Link Prediction in Online Social Networks

(1)

Scalable Proximity Estimation and Link Prediction in Online Social Networks

Han Hee Song Tae Won Cho Vacha Dave Yin Zhang Lili Qiu

The University of Texas at Austin

{hhsong, khatz, vacha, yzhang, lili}@cs.utexas.edu

ABSTRACT

Proximity measures quantify the closeness or similarity between nodes in a social network and form the basis of a range of applications in social sciences, business, information technology, computer networks, and cyber security. It is challenging to estimate proximity measures in online social networks due to their massive scale (with millions of users) and dynamic nature (with hundreds of thousands of new nodes and millions of edges added daily). To address this challenge, we develop two novel methods to efficiently and accurately approximate a large family of proximity measures.

We also propose a novel incremental update algorithm to enable near real-time proximity estimation in highly dynamic social networks. Evaluation based on a large amount of real data collected in five popular online social networks shows that our methods are accurate and can easily scale to networks with millions of nodes.

To demonstrate the practical values of our techniques, we consider a significant application of proximity estimation: link prediction, i.e., predicting which new edges will be added in the near future based on past snapshots of a social network. Our results re- veal that (i) the effectiveness of different proximity measures for link prediction varies significantly across different online social networks and depends heavily on the fraction of edges contributed by the highest degree nodes, and (ii) combining multiple proximity measures consistently yields the best link prediction accuracy.

Categories and Subject Descriptors

H.3.5 [Information Storage and Retrieval]: Online Information Services—Web-based services; J.4 [Computer Applications]: So- cial and Behavioral Sciences—Sociology

General Terms

Algorithms, Human Factors, Measurement

Keywords

Social Network, Proximity Measure, Link Prediction, Embedding, Matrix Factorization, Sketch

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

IMC’09,November 4–6, 2009, Chicago, Illinois, USA.

1. INTRODUCTION

A social network [53] is a social structure modeled as a graph, where nodes represent people or other entities embedded in a social context, and edges represent specific types of interdependency among entities, e.g., values, visions, ideas, financial exchange, friend- ship, kinship, dislike, conflict or trade. Understanding the nature and evolution of social networks has important applications in a number of fields such as sociology, anthropology, biology, eco- nomics, information science, and computer science.

Traditionally, studies on social networks often focus on rela- tively small social networks (e.g., [30, 31] examine co-authorship networks with about5000 nodes). Recently, however, social networks have gained tremendous popularity in the cyber space. On- line social networks such as MySpace [40], Facebook [18] and YouTube [55] have each attracted tens of millions of visitors ev- ery month [44] and are now among the most popular sites on the Web [4]. The wide variety of online social networks and the vast amount of rich information available in these networks represent an unprecedented research opportunity for understanding the nature and evolution of social networks at massive scale.

A central concept in the computational analysis of social net- works is proximity measure, which quantifies the closeness or sim- ilarity between nodes in a social network. Proximity measures form the basis for a wide range of important applications in social and natural sciences (e.g., modeling complex networks [6, 13, 25, 42]), business (e.g., viral marketing [23], fraud detection [11]), informa- tion technology (e.g., improving Internet search [35], collaborative filtering [7]), computer networks (e.g., constructing overlay net- works [45]), and cyber security (e.g., mitigating email spams [22], defending against Sybil attacks [56]).

Unfortunately, the explosive growth of online social networks imposes significant challenges on proximity estimation. First, on- line social networks are typically massive in scale. For example, MySpace has over400 million user accounts [41], and Facebook has reportedly over 120 million active users world wide [19]. As a result, many proximity measures that are highly effective in rel- atively small social networks (e.g., the classic Katz measure [26]) become computationally prohibitive in large online social networks with millions of nodes [48]. Second, online social networks are of- ten highly dynamic, with hundreds of thousands of new nodes and millions of edges added daily. In such fast-evolving social networks, it is challenging to compute up-to-date proximity measures in a timely fashion.

Approach and contributions. To address the above challenges, we develop two novel techniques, proximity sketch and proximity embedding, for efficient and accurate proximity estimation in large social networks with millions of nodes. We then augment these techniques with a novel incremental proximity update algorithm to enable near real-time proximity estimation in highly dynamic

(2)

social networks. Our techniques are applicable to a large family of commonly used proximity measures, which includes the afore- mentioned Katz measure [26], as well as rooted PageRank [30, 31]

and escape probability [50]. These proximity measures are known to be highly effective for many applications [30, 31, 50], but were previously considered computationally prohibitive for large social networks [48, 50].

To demonstrate the practical value of our techniques, we con- sider a significant application of proximity estimation: link pre- diction, which refers to the task of predicting the edges that will be added to a social network in the future based on past snapshots of the network. As shown in [30, 31], proximity measures lie right at the heart of link prediction. Understanding which proximity measures lead to the most accurate link predictions provides valuable insights into the nature of social networks and can serve as the basis for comparing various network evolution models (e.g., [6, 13, 25, 42]). Accurate link prediction also allows online social networks to automatically make high-quality recommendations on potential new friends, making it much easier for individual users to expand their social neighborhood.

We evaluate the effectiveness of our proximity estimation methods using a large amount of real data collected in five popular online social networks: Digg [14], Flickr [20], LiveJournal [33], MyS- pace [40], and YouTube [55]. Our results show that our methods are accurate and can easily scale to handle large social networks with millions of nodes and hundreds of millions of edges. We also conduct extensive experiments to compare the effectiveness of a variety of proximity measures for link prediction in these online social networks. Our results uncover two interesting new findings:

(i) the effectiveness of different proximity measures varies significantly across different networks and depends heavily on the fraction of edges contributed by the highest degree nodes, and (ii) combining multiple proximity measures using an off-the-shelf machine learning software package consistently yields the best link prediction accuracy.

Paper organization. The rest of the paper is organized as follows.

In Section 2, we develop techniques to efficiently and accurately approximate a large family of proximity measures in massive, dynamic online social networks. In Section 3, we describe link prediction techniques. In Section 4, we evaluate both proximity estimation and link prediction in five popular online social networks.

In Section 5, we review related work. We conclude in Section 6.

2. SCALABLE PROXIMITY ESTIMATION

Proximity measures are the basis for many applications of social networks. As a result, a variety of proximity measures have been proposed. The simplest proximity measures are based on either the shortest graph distance or the maximum information flow between two nodes. One can also define proximity measures based on node neighborhoods (e.g., the number of common neighbors).

Finally, several more sophisticated proximity measures involve in- finite sums over the ensemble of all paths between two nodes (e.g., Katz measure [26], rooted PageRank [30, 31], and escape probability [50]). Compared with more direct proximity measures such as shortest graph distances and numbers of shared neighbors, path- ensemble based proximity measures capture more information about the underlying social structure and have been shown to be more effective in social networks with thousands of nodes [30, 31, 50].

Despite the effectiveness of path-ensemble based proximity measures, it is computationally expensive to summarize the ensemble of all paths between two nodes. The state of the art in estimat- ing path-ensemble based proximity measures (e.g., [50]) typically can only handle social networks with tens of thousands of nodes.

As a result, recent works on proximity estimation in large social

networks (e.g., [48]) either dismiss path-ensemble based proximity measures due to their prohibitive computational cost or leave it as future work to compare with these proximity measures.

In this section, we address the above challenge by developing efficient and accurate techniques to approximate a large family of path-ensemble based proximity measures. Our techniques can handle social networks with millions of nodes, which are several orders of magnitude larger than what the state of the art can support. In addition, our techniques can support near real-time proximity estimation in highly dynamic social networks.

2.1 Problem Formulation

Below we first formally define three commonly used path-ensemble based proximity measures: (i) Katz measure, (ii) rooted PageRank, and (iii) escape probability. We then show that all three proxim- ity measures can be efficiently estimated by solving a common subproblem, which we term the proximity inversion problem. In all our discussions below, we model a social network as a graph G = (V, E), where V is the set of nodes, and E is the set of edges.

G can be either undirected or directed, depending on whether the social relationship is symmetric.

Katz measure. The Katz measure [26] is a classic path-ensemble based proximity measure. It is designed to capture the following simple intuition: the more paths there are between two nodes and the shorter these paths are the stronger the relationship is (because there are more opportunities for the two nodes to discover and in- teract with each other in the social network). Given two nodes x, y ∈ V , the Katz measure Katz[x, y] is a weighted sum of the number of paths fromx to y, exponentially damped by length to count short paths more heavily. Formally, we have

Katz[x, y] =

∞

X

ℓ=1

β_Katz^ℓ · |paths^hℓix,y| (1)

where paths^hℓi_x,yis the set of length-ℓ paths from x to y, and βKatz

is a damping factor.

LetA be the adjacency matrix of graph G, where A[x, y] =

 1, ifhx, yi ∈ E,

0, otherwise. (2)

As shown in [31], the Katz measures between all pairs of nodes (represented as a matrix Katz) can be derived as a function of the adjacency matrixA and the damping factor βKatzas follows:

Katz=

∞

X

ℓ=1

β^ℓKatzA^ℓ= (I − βKatzA)⁻¹− I (3)

whereI is the identity matrix. Thus, in order to compute Katz, we just need to compute the matrix inverse(I − βKatzA)⁻¹.

Rooted PageRank. The rooted PageRank [30, 31] is a special instance of personalized PageRank [8,12]. It defines a random walk on the underlying graphG = (V, E) to capture the probability for two nodes to run into each other and uses this probability as an indicator of the node-to-node proximity. Specifically, given two nodesx, y ∈ V , the rooted PageRank RPR[x, y] is defined as the stationary probability ofy under the following random walk: (i) with probability1−βRPR, jump to nodex, and (ii) with probability βRPR, move to a random neighbor of current node.

The rooted PageRank between all node pairs (represented as a matrix RPR) can be derived as follows. LetD be a diagonal matrix withD[i, i] =P

jA[i, j]. Let T = D⁻¹A be the adjacency matrix with row sums normalized to1. We then have:

RPR= (1 − βRPR)(I − βRPRT )⁻¹ (4)

(3)

Therefore, to compute RPR, we just need to compute the matrix inverse(I − βRPRT )⁻¹. Also note that the standard PageRank can be computed simply as the average of all the columns of RPR.

Escape probability. The escape probability [50] is another path- ensemble based proximity measure. Given two nodesx, y ∈ V , the escape probability EP[x, y] from x to y is defined as the probability that a random walk which starts from nodex will visit node y before it returns to node x [16]. The escape probability EP[x, y]

can be directly derived from the rooted PageRank as follows.

EP[x, y] = Q[x, y]

Q[x, x]Q[y, y] − Q[x, y]Q[y, x] (5) where matrixQ = RPR/(1 − βRPR) = (I − βRPRT )⁻¹.

As shown in [16], when the underlying graphG = (V, E) is undirected, the escape probability EP is also closely related to several other random walk induced proximity or distance measures:

effective conductance EC, effective resistance ER, and commute time CT. Specifically, we have:

EC[x, y] = |N(x)| · EP[x, y] (6)

ER[x, y] = 1/EC[x, y] (7)

CT[x, y] = 2 · |E| · ER[x, y] (8) The common subproblem: proximity inversion. From the above discussions, it is evident that the key to estimating all three path- ensemble based proximity measures is to efficiently compute elements of the following matrix inverse:

P= (I − βM )^△ ⁻¹=

∞

X

ℓ=0

β^ℓM^ℓ (9)

whereM is a sparse nonnegative matrix with millions of rows and columns,I is an identity matrix of the same size, and β ≥ 0 is a damping factor. We term this common subproblem the proximity inversion problem.

2.2 Scalable Proximity Inversion

The key challenge in solving the proximity inversion problem (i.e., computing elements of matrixP = (I − βM )⁻¹) is that whileM is a sparse matrix, P is a dense matrix with millions of rows and columns. It is thus computationally prohibitive to compute and/or store the entireP matrix. To address the challenge, we first develop two novel dimensionality reduction techniques to approximate elements ofP = (I − βM )⁻¹based on a static snapshot ofM : proximity sketch and proximity embedding. We then develop an incremental proximity update algorithm to approximate elements ofP in an online setting when M continuously evolves.

2.2.1 Preparation

We first present an algorithm to approximate the sum of a subset of rows or columns ofP = (I − βM )⁻¹efficiently and accurately.

We use this algorithm as a basic building block in both proximity sketch and proximity embedding.

Algorithm. Suppose we want to compute the sum of a subset of columns: P

i∈SP [∗, i], where S is a set of column indices. We first construct an indicator column vectorv such that v[i] = 1 for∀i ∈ S and v[j] = 0 for ∀j 6∈ S. The sum of columns P

i∈SP [∗, i] is simply P v and can be approximated as:

P v = (I − βM )⁻¹v =

∞

X

ℓ=0

β^ℓM^ℓv ≈

ℓmax

X

ℓ=0

β^ℓM^ℓv (10)

whereℓmaxbounds the maximum length of the paths over which the summation is performed.

Estimate P[x,y] by taking the min upper bound in all H hash tables k

S [x,g (y)]_k +=P[x,y]

So S [x,g (y)] gives an upper bound on P[x,y]

P[x,y]

P[x,y] is hashed into entry S [x,g (y)] in each hash table S (k=1, ..., H)

S_k

k

m x m

P

k

k k

k

Figure 1: Proximity sketch Similarly, to compute the sum of a subset of rowsP

i∈SP [i, ∗], we first construct an indicator row vectoru such that u[i] = 1 for

∀i ∈ S and u[j] = 0 for ∀j 6∈ S. We then approximate the sum of rowsP

i∈SP [i, ∗] = u P as:

u P = u (I − βM )⁻¹=

∞

X

ℓ=0

β^ℓu M^ℓ≈

ℓmax

X

ℓ=0

β^ℓu M^ℓ (11)

In one extreme whereS contains all the column indices, we can compute the sum of all columns inP . This is useful for computing the PageRank (which is the average of all columns in the RPR matrix). In the other extreme whereS contains only one element, we can efficiently approximate a single row or column ofP .

Complexity. SupposeM is an m-by-m matrix with n non-zeros.

Computing the product of sparse matrixM and a dense vector v takes O(n) time by exploiting the sparseness of M . So it takes O(n · ℓmax) time to compute {M^ℓv | ℓ = 1, . . . , ℓmax} and ap- proximateP v. Note that the time complexity is independent of the size of the subsetS. The complexity for computing uP is identical.

Note however that the above approximation algorithm is not effi- cient for estimating individual elements ofP . In particular, even if we only want a single elementP [x, y], we have to compute either a complete rowP [x, ∗] or a complete column P [∗, y] in order to obtain an estimate ofP [x, y]. So we only apply the above technique for preprocessing. We will develop several techniques in the rest of this section to estimate individual elements ofP efficiently.

Benefits of truncation. We achieve two key benefits by trun- cating the infinite expansionP∞

ℓ=0β^ℓM^ℓto form a finite expan- sionPℓmax

ℓ=0 β^ℓM^ℓ. First, we completely eliminate the influence of paths with length aboveℓmaxon the resulting sums. This is desirable because as pointed out in [30, 31], proximity measures that are unable to limit the influence of overly lengthy paths tend to perform poorly for link prediction. Second, we ensure thatPℓmax

ℓ=0 β^ℓM^ℓis always finite, whereas elements ofP∞

ℓ=0β^ℓM^ℓmay reach infinity when the damping factorβ is not small enough.

2.2.2 Proximity Sketch

Our first dimensionality reduction technique, proximity sketch, exploits the mice-elephant phenomenon that frequently arises in matrixP in practice, i.e., most elements in P are tiny (i.e., mice) but few elements are huge (i.e., elephants).

Algorithm. Figure 1 shows the data structure for our proximity sketch, which consists ofH hash tables: S1, · · · , SH. EachSk

is a2-dimensional array with m rows and c ≪ m columns. A column hash functiongk : {1, · · · , m} → {1, · · · , c} is used to hash each element inPm×m (P [x, y]) into an element in Skm×c

(Sk[x, gk(y)]). We ensure that different hash functions gk(·) (k = 1, · · · , H) are two-wise independent. In each Sk, each element P [x, y] is added to entry Sk[x, gk(y)]. Thus,

Sk[a, b] = X

y: gk(y)=b

P [a, y] (12)

(4)

Note that each column ofSk:Sk[∗, b] =P

y:gk(y)=bP [∗, y] can be computed efficiently as described in Section 2.2.1.

SinceP is a nonnegative matrix, for any x, y ∈ V and any k ∈ [1, H], Sk[x, gk(y)] is an upper bound for P [x, y] according to Eq. 12. We can therefore estimateP [x, y] by taking the minimum upper bound in allH hash tables in O(H) time. That is:

P [x, y] = minˆ

k Sk[x, gk(y)] (13) Probabilistic accuracy guarantee. Our proximity sketch effec- tively summarizes each row ofP : P [x, ∗] using a count-min sketch [10]:{Sk[x, ∗] | k = 1, · · · , H}. As a result, we provide the same probabilistic accuracy guarantee as the count-min sketch, which is summarized in the following theorem (see [10] for detailed proof).

THEOREM 1. WithH = ⌈ln¹_δ⌉ hash tables, each with c = ⌈^e_ǫ⌉ columns, the estimate ˆP [x, y] guarantees: (i) P [x, y] ≤ ˆP [x, y];

and (ii) with probability at least1 − δ, ˆP [x, y] ≤ P [x, y] + ǫ · P

zP [x, z].

Therefore, as long asP [x, y] is much larger than ǫ ·P

zP [x, z], the relative error of ˆP [x, y] is small with high probability.

Extension. If desired, we can further reduce the space requirement of proximity sketch by aggregating the rows ofSk(at the cost of lower accuracy). Specifically, we associate eachSkwith a row hash functionfk(·). We then compute

Rk[a, b] = X

x: fk(x)=a

Sk[x, b] (14)

and store{Rk} (instead of {Sk}) as the final proximity sketch.

Clearly, we haveRk[a, b] =P

x: fk(x)=a

P

y: gk(y)=bP [x, y]. For anyx, y ∈ V , we can then estimate P [x, y] as

P [x, y] = minˆ

k Rk[fk(x), gk(y)] (15)

2.2.3 Proximity Embedding

Our second dimensionality reduction technique, proximity em- bedding, applies matrix factorization to approximateP as the product of two rank-r factor matrices U and V :

Pm×m≈ Um×r· Vr×m (16)

In this way, withO(2 m r) total state for factor matrices U and V , we can approximate anyP [x, y] in O(r) time as:

P [x, y] =ˆ

r

X

k=1

U [x, k] · V [k, y] (17) Our technique is motivated by recent research on embedding net- work distance (e.g., end-to-end round-trip time) into low-dimensional space (e.g., [32, 34, 43, 49]). Note however that proximity is the opposite of distance — the lower the distance the higher the proximity. As a result, techniques effective for distance embedding do not necessarily work well for proximity embedding.

Algorithm. As shown in Figure 2(a), our goal is to derive the two rank-r factor matrices U and V based on only a subset of rows P [L, ∗] and columns P [∗, L], where L is a set of indices (which we term the landmark set). We achieve this goal by taking the following five steps:

1. Randomly select a subset ofℓ nodes as the landmark set L. The probability for a nodei to be included in L is proportional to the PageRank of nodei in the underlying graph¹. Note that

1We also consider uniform landmark selection, but it yields worse accuracy than PageRank based landmark selection (see Section 4).

~

~ ~ *

P[L,L]

U[L,*] V[*,L] P[*,L] V[*,L]

P U

V

U

P[L,*]

~ ~

_U[L,*]

*

^V

(b) factorize P[L,L] to get U[L,*], V[*,L]

(c) obtain U from P[*,L] and V[*,L]

(d) obtain V from P[L,*] and U[L,*]

by only computing a subset of rows P[L,*] and columns P[*,L]

(a) goal: approximate P as the product of two rank−r matrices U, V

U[L,*]

*

V[*,L]

P[L,L] P[L,*]

~

P[*,L]

Figure 2: Proximity embedding

the PageRank for all the nodes can be precomputed efficiently using the finite expansion method in Section 2.2.1.

2. Compute sub-matricesP [L, ∗] and P [∗, L] efficiently by computing each rowP [i, ∗] and each column P [∗, i] (i ∈ L) sepa- rately as described in Section 2.2.1.

3. As shown in Figure 2(b), apply singular value decomposition (SVD) to obtain the best rank-r approximation of P [L, L]:

P [L, L] ≈ U [L, ∗] · V [∗, L] (18) 4. Our goal is to findU and V such that U · V is a good approximation ofP . As a result, U · V [∗, L] should be a good approximation ofP [∗, L]. We can therefore find U such that U · V [∗, L] best approximates sub-matrix P [∗, L] in least-squares sense (shown in Figure 2(c)). Given the use of SVD in step 3, the bestU is simply

U = P [∗, L] · V [∗, L]^T (19) 5. Similarly, findV such that U [L, ∗] · V best approximates sub- matrixP [L, ∗] in least-squares sense (shown in Figure 2(d)), which is simply

V = U [L, ∗]^T· P [L, ∗] (20) Accuracy. Unlike proximity sketch, proximity embedding does not provide any provable data-independent accuracy guarantee. How- ever, as a data-adaptive dimensionality reduction technique, when matrixP is in fact low-rank (i.e., having good low-rank approxima- tions), proximity embedding has the potential to achieve even better accuracy than proximity sketch. Our empirical results in Section 4 suggest that this is indeed the case for the Katz measure.

2.2.4 Incremental Proximity Update

To enable online proximity estimation, we periodically check- pointM and use the above dimensionality reduction techniques to approximateP for the last checkpoint of M . Between two check- points, we apply an incremental update algorithm to approximate P^′= (I − β · M^′)⁻¹, whereM^′= M + ∆ is the current matrix.

Our algorithm is based on the second-order approximation ofP^′: P^′ = [I − β(M + ∆)]⁻¹

≈ (I − βM )⁻¹+ β ∆ + β²(∆M + M ∆ + ∆²) (21)

(5)

The second-order approximation works well as long as∆ has only few non-zero elements andβ is small, making higher order terms negligible.

To estimate an individual elementP^′[x, y], we simply use:

P^′[x, y] ≈ P [x, y] + β∆[x, y] + X

k: ∆[x,k]6=0

β²∆[x, k]M [k, y]+

X

k: ∆[k,y]6=0

β²M [x, k]∆[k, y] + X

k: ∆[x,k]6=0

β²∆[x, k]∆[k, y] (22)

If we checkpointM frequently enough, the difference between the last checkpoint M and the current matrix M^′ will be quite small. In other words, the difference matrix∆ is likely to be sparse.

As a result, we expect row∆[x, ∗] and column ∆[∗, y] to have few non-zero elements. By leveraging such sparseness, we can efficiently compute Eq. 22 in an online fashion. We demonstrate the efficiency and accuracy of our incremental proximity update algorithm in Section 4.2.3.

3. LINK PREDICTION TECHNIQUES

We use link prediction as a significant application of our proximity estimation methods. Our goal is to understand (i) the effectiveness of various proximity measures in the context of link prediction, and (ii) the benefit of combining multiple proximity measures. In this section, we summarize the link predictors and the proximity measures we use.

3.1 Link Predictors

We consider two types of link predictors: (i) basic link predic- torthat uses a single proximity measure, and (ii) composite link predictorthat uses multiple proximity measures.

Basic link predictors. A basic link predictor consists of a proximity measure prox[∗, ∗], and a threshold T . Given an input graph G = (V, E) (which models a past snapshot of a given social network), a node pairhx, yi 6∈ E is predicted to form an edge in the future if and only if the proximity betweenx and y is sufficiently large, i.e., prox[x, y] ≥ T .

Composite link predictors. A composite link predictor uses machine learning techniques to make link predictions based on multiple proximity measures. We use the WEKA machine learning package [21] to automatically generate composite link predictors using a number of machine learning algorithms, including the REPtree decision tree learner, J48 decision tree learner, JRip rule learner, support vector machine (SVM) learner, and Adaboost learner. The results are consistent across different learners we use. So we only report the results of the REPtree decision tree learner. REPtree is a variant of the commonly used C4.5 decision tree learning algorithm [46]. It builds a decision tree using information gain and prunes it using reduced error pruning. It allows direct control on the depth of the learned decision tree, making it easy to visualize and interpret the resulting composite link predictor.

3.2 Proximity Measures

We consider three classes of proximity measures summarized in Table 1, which are based on (i) graph distance, (ii) node neighborhood, and (iii) ensemble of paths, respectively.

Notations. We model a social network as a graphG = (V, E), whereV is the set of nodes, and E is the set of edges. G can be either directed or undirected. For a nodex, let N (x) = {y|hx, yi ∈ E} be the set of neighbors x has in G. Similarly, let N⁻¹(x) = {y|hy, xi ∈ E} be the set of inverse neighbors x has in G (i.e., nodes that havex as their neighbors). Let A be the adjacency matrix forG (defined in Eq. 2). Let T = D⁻¹A be the adjacency

graph distance GD[x, y] = negated distance of the shortest path from x to y

common neighbors CN[x, y] = |N (x) ∩ N (y)|

Adamic/Adar AA[x, y] =P

z∈N(x)∩N(y) 1 log |N(z)|

preferential attachment PA[x, y] = |N (x)| · |N (y)|

PageRank product PRP[x, y] = P R(x) · P R(y), where P R(x) = ^1−d_{|V |} + dP

z∈N⁻¹(x) P R(z)

|N(x)|

Katz Katz[x, y] =P∞

ℓ=1β^ℓ· |paths^hℓix,y| we have: Katz= (I − βA)⁻¹− I Table 1: Summary of proximity measures

matrix with row sums normalized to1, where D is a diagonal matrix withD[i, i] =P

jA[i, j].

Graph distance based proximity measure. Perhaps the most di- rect metric for quantifying how close two nodes are is the graph distance. We thus define a proximity measure GD[x, y] as the neg- ative of the shortest-path distance fromx to y. Note that the use of negated(instead of original) shortest-path distance ensures that the proximity measure GD[x, y] increases as x and y get closer.

Note that it is inefficient to apply Dijkstra’s algorithm to compute shortest path distance fromx to y when G has millions of nodes. Instead, we exploit the small-world property [27] of the so- cial network and apply expanded ring search to compute the short- est path distance fromx to y. Specifically, we initialize S = {x}

andD = {y}. In each step we either expand set S to include its members’ neighbors (i.e.,S = S ∪ {v|hu, vi ∈ E ∧ u ∈ S}) or expand setD to include its members’ inverse neighbors (i.e., D = D ∪ {u|hu, vi ∈ E ∧ v ∈ D}). We stop whenever S ∩ D 6= ∅

— the number of steps taken so far gives the shortest path distance.

For efficiency, we always expand the smaller set betweenS and D in each step. We also stop when a maximum number of steps is reached (set to6 in our evaluation).

Node neighborhood based proximity measures. We define four proximity measures based on node neighborhood.

• Common neighbors. For two nodes x and y, they are more likely to become friends when the overlap of their neighborhoods is large. The simplest form of this approach is to count the size of the intersection: CN[x, y] = |N (x) ∩ N (y)|.

• Adamic/Adar. Like common neighbors, Adamic/Adar [1] also tries to measure the size of the intersection of two neighborhoods. However, Adamic/Adar also takes ”rareness” into ac- count, giving more weights to the common node with smaller number of friends: AA[x, y] =P

z∈N(x)∩N(y) 1 log |N(z)|.

• Preferential attachment. The preferential attachment is based on the idea that having a new neighbor is proportional to the size of the current neighborhood. Moreover, the probability of two users becoming friends is proportional to the product of the number of the current friends. We therefore define a proximity measure: PA[x, y] = |N (x)| · |N (y)|.

• PageRank product. PageRank is developed to analyze the hy- perlink structure of Web pages by treating a hyperlink as a vote.

The PageRank of a node depends on the count of inbound links and the PageRank of outbound neighbors. Formally, the PageR- ank of a nodex, denoted as P R(x), is defined recursively on G = (V, E) as

P R(x) = 1 − d

|V | + d X

z∈N⁻¹(x)

P R(z)

|N(x)| (23)

whered is a damping factor. We define the PageRank product of two nodesx and y as the product of two PageRank values:

PRP[x, y] = P R(x) · P R(y).

(6)

Path-ensemble based proximity measures. We use the Katz measure (Katz[x, y]) as a path-ensemble based proximity measure (described in Section 2.1). We use the Katz measure as the repre- sentative of path-ensemble based proximity measures for two main reasons. First, as shown in [30, 31], the Katz measure is the more effective than other path-ensemble based proximity measures such as the rooted PageRank. Second, our results in Section 4 show that the accuracy of our proximity estimation methods is the highest for the Katz measure.

4. EVALUATION 4.1 Dataset Description

Snapshot # of Conn- # of # of Added Asymmetric Network Date ected Nodes Links Links Link Fraction

Digg 9/15/2008 535,071 4,432,726 –

10/25/2008 567,771 4,813,668 656,478 58.3%

11/10/2008 567,771 4,941,401 175,958 Flickr 3/01/2007 1,932,735 26,702,209 –

4/15/2007 2,172,692 30,393,940 3,691,731 37.8%

5/18/2007 2,172,692 32,399,243 2,005,303 Live- 11/13/2008 1,769,493 61,488,262 –

Journal 12/05/2008 1,769,543 61,921,736 1,566,059 28.3%

1/30/2009 1,769,543 62,843,995 3,093,064 MySpace 12/11/2008 2,128,945 89,138,628 –

1/11/2009 2,137,773 90,629,452 1,845,898 0%

2/14/2009 2,137,773 89,341,780 696,016 YouTube 4/30/2007 2,012,280 9,762,825 –

6/15/2007 2,532,050 13,017,064 3,254,239 0%

7/23/2007 2,532,050 15,337,226 2,320,162 Wikipedia 9/30/2006 1,636,961 28,950,137 –

12/31/2006 1,758,323 33,974,708 5,024,571 83.1%

4/06/2007 1,758,323 38,349,329 4,374,621

Table 2: Dataset summary

We carry out our evaluation on five popular online social networks: Digg [14], Flickr [20], LiveJournal [33], MySpace [40], and YouTube [55]. For comparison, we also examine the hyperlink structure of Wikipedia [54]. For each network, we conduct three crawls and make three snapshots of the network. Table 2 summarizes the characteristics of the three snapshots for each of the networks. Note that, for the purpose of link prediction, we only use connected nodes (i.e., nodes with at least one incoming or outgoing friendship link), rather than considering all the crawled nodes. An- other point to note is that since link prediction implies that based on one snapshot of the network, we predict the new links that are formed in the next snapshot, the same set of users should appear in two consecutive snapshots. Hence, for a growing network, the number of users appearing in the last snapshot that we create may be less than the total number of users (to match the previous snapshot). Lastly, although there can be both link additions and dele- tions between two snapshots, since the goal of link prediction is to predict those that get added, we explicitly show the number of added links between two consecutive snapshots in Table 2.

Digg [14] is a website for users to share interesting Web content by posting a link to it. The posted link can be voted as either positive (“digg”) or negative (“bury”) by other users. Digg allows a user to become a “fan” of other users, which we consider as a friendship relation. All the friendship links together form a directed social graph. Overall, 58.3% directly connected user pairs in Digg have asymmetric friendship (i.e., friendship link only exists in one direc- tion between two users). We obtained the entire list of 1.9 million users in September 2008. We crawled friendship links among these users using the Digg API [15] in September 2008, October 2008, and November 2008. The resulting snapshots contain more than

500,000 connected users (i.e., users with at least one incoming or outgoing friendship link) out of 1.9 million crawled users.

Flickr [20] is a popular photo-sharing website. Flickr allows users to add other people as “contacts” to form a directed social link. We use the Flickr dataset collected by [36], which represents a breadth first search on the graph from a set of seed users. The dataset gives the growth of Flickr for 104 days and contains 33 million links among 2.3 million users. We treat the first 25 days as the boot- strap period to ensure that the crawl has sufficiently stabilized. We then partition the remaining dataset into three snapshots separated approximately by 40 days each. Note that, the third snapshot of Flickr contains links for the same 2.17 million users that appear in the second snapshot (and not the entire 2.3 million users).

LiveJournal [33] is a Web community that allows its users to post entries to personal journals. LiveJournal also acts as a social networking site, where a user can become a “fan” of another LiveJournal user. We consider this “fan” relationship as a directed friendship link in the social graph. Since LiveJournal does not provide a complete list of users, we obtained a list of active users who have published posts by analyzing periodic RSS announcements of recently updated journals starting from July 2008. We then used the LiveJournal API to gather friendship information of 2.2 million active users in November 2008, December 2008, and January 2009. The resulting snapshots have about 1.8 million connected users who have non-zero friendship links.

MySpace [40] is a social networking site where users can inter- act with each other by personalizing pages, commenting on others’

photos and videos, and making friends. For two MySpace users to become friends, both parties have to agree. Therefore, the social links in MySpace are undirected and thus symmetric. We crawled 10 million MySpace users out of over 400 million users by taking the first 10 million user IDs in December 2008, January 2009, and February 2009. After discarding all the inactive, deleted, private, and solitary MySpace IDs, we get information for approximately 2.1 million users in each resulting snapshot.

YouTube [55] is a popular video-sharing website. Registered users can connect with others by creating friendship links. We use the undirected version of the social graph collected by [36], which cov- ers the growth of YouTube for 165 days with 18 million added links among 3.2 million users. We divide the dataset into three snapshots separated by 45 days each. Note that the third snapshot of YouTube in Table 2 contains links for 2.5 million users that also appear in the second snapshot (and not the entire 3.2 million users).

Wikipedia [54] is an online encyclopedia which takes users’ col- laboration to build content. Different wiki pages are connected through hyperlinks. We compare Wikipedia’s hyperlink structure against social graphs of users from the previous five online social networks. Similar to general Web pages, most links in Wikipedia are asymmetric. We use the data collected by [36] over a six-year period from 2001 to 2007, which contains 38 million links con- necting 1.8 million pages. We extract three snapshots separated approximately by 90 days each.

4.2 Proximity Estimation Algorithms

In this section, we evaluate the accuracy and scalability of our proximity estimation methods using the above six datasets. We present results for Katz and RPR (defined in Section 2). The accuracy for escape probability (EP) is similar to RPR (due to their close relationship in Eq. 5) and is omitted in the interest of brevity.

Accuracy metrics. We quantify the estimation error using three different metrics: (i) Normalized Absolute Error (NAE) (defined as _mean^|estⁱ^−actualⁱ^|

i(actuali)), (ii) Normalized Mean Absolute Error (NMAE) (defined as

P

i|esti−actuali| P

iactuali ), and (iii) Relative Error (defined as

(7)

Network PageRank based selection Uniform selection

Digg 0.00015 0.00023

Flickr 0.00010 0.00238

LiveJournal 0.01222 0.07322

MySpace 0.00016 0.00032

YouTube 0.02115 0.05410

Wikipedia 0.00266 0.00328

Table 3: NMAE of different landmark selection schemes.

|esti−actuali|

actuali ), whereestiandactualidenote the estimated and actual values of the proximity measure for node pairi, respectively.

Since it is expensive to compute the actual proximity measures over all the data points, we randomly sample 100,000 data points by first randomly selecting 200 rows from the proximity matrix and then selecting 500 elements from each of these rows. We then compute errors for these 100,000 data points.

4.2.1 Proximity Embedding

We first evaluate the accuracy of proximity embedding. We aim to answer the following questions: (i) How accurate is proximity embedding in estimating Katz and RPR? (ii) How many dimensions and landmarks are required to achieve high accuracy? (iii) How does the landmark selection algorithm affect accuracy?

Parameter settings. Throughout our evaluation, we use a damping factor ofβ = 0.05, ℓmax = 6, and 1600 landmarks unless otherwise specified. We also vary these parameters to understand their impact. By default, we select landmarks based on the PageR- ank of each node. Specifically, we first compute PageRank for each node and normalize the sum of PageRank of all nodes to1. We then use the normalized PageRank as the probability of assigning a node as a landmark. In this way, nodes with high PageRank values are more likely to become landmarks. For comparison, we also examine the performance of uniform landmark selection, which selects landmarks uniformly at random.

Varying the number of dimensions. Figure 3 plots the CDF of normalized absolute errors in approximating the Katz measure as we vary the number of dimensions from 5 to 60. We make the following two key observations. First, for all six datasets the normalized absolute error is small: in more than 95% cases the normalized absolute error is within 0.05 and NMAE is within 0.05 except YouTube. The error in YouTube is higher because its “intrinsic” dimensionality is higher as analyzed in Figure 6 (see below). Second, as we would expect, the error decreases with the number of dimensions. The reduction is more significant in the YouTube dataset, because the other datasets have very low “intrinsic” dimensionality and using only 5 dimensions already gives low approximation error, whereas YouTube has higher “intrinsic” dimensionality and increasing the number of dimensions is more helpful.

Relative errors. Figure 4 further plots the CDF of relative errors using 60 dimensions. We take top 1%, 5%, and 10% of the randomly selected data points and generate the CDF for each of the selections. In all datasets, we observe that the relative errors are smaller for elements with larger values. This is desirable because larger elements play a more important role in many applications and are thus more important to estimate accurately.

Uniform landmark selection. Table 3 compares the NMAE of PageRank based landmark selection and uniform selection. PageR- ank based selection yields higher accuracy than uniform selection.

It reduces NMAE by 35% for Digg, 95.8% for Flickr, 83.3% for LiveJournal, 50% for MySpace, 61% for YouTube, and 19% for Wikipedia. The reason is that high-PageRank nodes are well connected, and it is less likely for nodes to be far away from all such landmarks, thereby improving the estimation accuracy.

Threshold for Large Katz Values Network 1% row sum 0.1% row sum 0% row sum

Digg 0.0562 0.0650 0.0002

Flickr 0.2177 0.2505 0.0001

LiveJournal 0.9872 0.2516 0.0122

MySpace 0.0532 0.0650 0.0002

YouTube 0.0074 0.0054 0.0212

Wikipedia 0.0039 0.0001 0.0027

(a) NMAE of proximity embedding Threshold for Large Katz Values Network 1% row sum 0.1% row sum 0% row sum

Digg 0.0001 0.3209 211.5

Flickr 0.0048 0.0293 1116.3

LiveJournal 0.0012 0.0113 1383.2

MySpace 0.0041 0.0360 1451.1

YouTube 0.0495 0.3769 1141.3

Wikipedia 0.0114 0.2645 647.3

(b) NMAE of proximity sketch

Table 4: Comparing proximity embedding and proximity sketch in estimating large Katz values.

Threshold for Large RPR Values Network 1% row sum 0.1% row sum 0% row sum

Digg 0.6662 2.0008 0.7933

Flickr 0.7285 1.6385 1.0000

LiveJournal 1.4491 7.2752 0.9980

MySpace 1.0916 6.6324 0.9984

YouTube 0.7068 1.1952 1.0635

Wikipedia 1.4429 5.7987 0.7208

(a) NMAE of proximity embedding Threshold for Large RPR Values Network 1% row sum 0.1% row sum 0% row sum

Digg 0.0031 0.0247 131.8

Flickr 0.0006 0.0019 717.6

LiveJournal 0.0042 0.0296 500.2

MySpace 0.0038 0.0269 853.0

YouTube 0.0019 0.0110 486.3

Wikipedia 0.0046 0.0265 619.7

(b) NMAE of proximity sketch

Table 5: Comparing proximity embedding and proximity sketch in estimating large RPR values.

Varying the number of landmarks. Figure 5 shows the NMAE as we vary the number of landmarks. As before, we use PageRank based landmark selection. For all the datasets that we use, NMAE values decrease with the number of landmarks. The decrease is sharp when the number of landmarks is small, and then tapers off as the number of landmarks reaches 100-200. In all cases, 1600 landmarks are large enough and further increasing the value yields only marginal benefit if any.

Estimating large Katz values. Table 4(a) shows the accuracy of proximity embedding in estimating Katz values larger than 1%, 0.1% and 0% of their corresponding row sums in the Katz matrix.

As we can see, for elements larger than 0, the NMAE is low (the largest one is 0.0212 for YouTube). In comparison, for elements larger than 1% and 0.1% of the row sums, the NMAE is often larger (e.g., the corresponding values for LiveJournal are 0.98 and 0.25).

Manual inspection suggests that many Katz values larger than 1%

of row sum involve a direct link between two nodes in an isolated island of the network that cannot reach any landmarks. For such node pairs, the proximity embedding yields an estimate of0, thus seriously inflating the NMAE. Fortunately, large Katz values are quite rare in each row. As a result, the NMAE is low when we consider all elements in the Katz matrix.

Estimating large Rooted PageRank values. Table 5(a) shows the accuracy of proximity embedding in estimating Rooted PageRank

(8)

0.95 0.96 0.97 0.98 0.99 1

0 0.1 0.2 0.3 0.4 0.5

CDF

Normalized Absolute Errors 5 Dims (NMAE=0.00) 10 Dims (NMAE=0.00) 20 Dims (NMAE=0.00) 30 Dims (NMAE=0.00) 60 Dims (NMAE=0.00) 0.95

0.96 0.97 0.98 0.99 1

0 0.1 0.2 0.3 0.4 0.5

(a) Digg

0.95 0.96 0.97 0.98 0.99 1

0 0.1 0.2 0.3 0.4 0.5

CDF

0.96 0.97 0.98 0.99 1

0 0.1 0.2 0.3 0.4 0.5

(b) Flickr

0.95 0.96 0.97 0.98 0.99 1

0 0.1 0.2 0.3 0.4 0.5

CDF

0.96 0.97 0.98 0.99 1

0 0.1 0.2 0.3 0.4 0.5

(c) LiveJournal

0.95 0.96 0.97 0.98 0.99 1

0 0.1 0.2 0.3 0.4 0.5

CDF

0.96 0.97 0.98 0.99 1

0 0.1 0.2 0.3 0.4 0.5

(d) MySpace

0.95 0.96 0.97 0.98 0.99 1

0 0.1 0.2 0.3 0.4 0.5

CDF

0.96 0.97 0.98 0.99 1

0 0.1 0.2 0.3 0.4 0.5

(e) YouTube

0.95 0.96 0.97 0.98 0.99 1

0 0.1 0.2 0.3 0.4 0.5

CDF

0.96 0.97 0.98 0.99 1

0 0.1 0.2 0.3 0.4 0.5

(f) Wikipedia

Figure 3: Normalized absolute errors with a varying number of dimensions (Katz measure,β = 0.05, and 1600 landmarks).

0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

0 0.02 0.04 0.06 0.08 0.1

CDF

Relative Errors Top 1%

Top 5%

Top 10%

0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

0 0.02 0.04 0.06 0.08 0.1

Top 1%

Top 5%

Top 10%

(a) Digg

0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

0 0.02 0.04 0.06 0.08 0.1

CDF

Top 5%

Top 10%

0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

0 0.02 0.04 0.06 0.08 0.1

Top 1%

Top 5%

Top 10%

(b) Flickr

0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

CDF

Normalized Absolute Errors Top 1%

Top 5%

Top 10%

0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Top 1%

Top 5%

Top 10%

(c) LiveJournal

0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

0 0.02 0.04 0.06 0.08 0.1

CDF

Top 5%

Top 10%

0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

0 0.02 0.04 0.06 0.08 0.1

Top 1%

Top 5%

Top 10%

(d) MySpace

0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

0 0.02 0.04 0.06 0.08 0.1

CDF

Top 5%

Top 10%

0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

0 0.02 0.04 0.06 0.08 0.1

Top 1%

Top 5%

Top 10%

(e) YouTube

0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

0 0.02 0.04 0.06 0.08 0.1

CDF

Top 5%

Top 10%

0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

0 0.02 0.04 0.06 0.08 0.1

Top 1%

Top 5%

Top 10%

(f) Wikipedia

Figure 4: Relative errors for top 1%, 5%, and 10% largest values (Katz measure,β = 0.05, 1600 landmarks, and 60 dimensions).

values larger than 1%, 0.1%, and 0% of their corresponding row sums in the RPR matrix. We observe that the NMAE for RPR is much larger than the NMAE for Katz.

To understand why proximity embedding performs well on Katz but not on RPR, we plot the fraction of total variance captured by the best rank-k approximation to the inter-landmark proximity matrices Katz[L, L] and RPR[L, L] in Figure 6, where L is the set of landmarks. Note that the best rank-k approximation to a matrix can be easily computed through the use of singular value decomposi- tion (SVD). The smaller the number of dimensions (i.e.,k) it takes to capture most variance of the matrix, the lower the “intrinsic” dimensionality the matrix has. As we can see, for LiveJournal, even

3 dimensions can capture over 99% variance for Katz, whereas it takes 1590 dimensions to achieve similar approximation accuracy for rooted PageRank. This indicates that the RPR matrix is not low- rank, whereas the Katz matrix exhibits low-rank property, which makes proximity embedding work well.

4.2.2 Proximity Sketch

Now we evaluate the accuracy of proximity sketch. We useH = 3 hash tables and c = 1600 columns in each table.

Estimating large Katz values. Table 4(b) shows the NMAE of proximity sketch in estimating Katz values larger than 1%, 0.1%, and 0% of row sum. Comparing Table 4(a) and (b), we observe