Sampling and Summarization for Social Networks

(1)

Sampling and Summarization for Social Networks

SDM 2013 Tutorial

Shou‐De Lin

^*

, Mi‐Yen Yeh

^#

, and Cheng‐Te Li

^*

* Computer Science and Information Engineering, National Taiwan University

# Institute of Information Science, Academic Sinica

[email protected], [email protected], [email protected] Tutorial slides: http://mslab.csie.ntu.edu.tw/tut‐pakdd13/samsum‐pakdd13.pdf (note that this version is very different from the slides in the SDM conference CD)

(2)

About This Tutorial

• It is a two‐hour tutorial for SDM2013 on social network sampling and summarization

– >50 papers are surveyed and organized in this talk, but they are by no means complete.

– We will highlight the trend, categorize different types of strategies, and describe some ongoing works of us

• Agenda

– Introduction + Sampling (50 min +10 min Q/A)

– Summarization + conclusion (50 min+10 min Q/A)

13/05/02 Lin et al., SDM2013 Tutorial 2

(3)

by Paul Butler

What can be mined from this picture? 3

Big Social Network Billions of different types of nodes and links

13/05/02 Lin et al., SDM2013 Tutorial

(4)

Challenges Facing to Mine Big‐

Network Data

>1 billion

>500 million

>200 million

Even it is feasible, performing conventional operations (e.g. average path length) can take a long time, not to mention more complicated ones

Even they are, loading everything into memory for further analysis might not be feasible

Sometimes the full networks are not completely observed in advance

Lin et al., SDM2013 Tutorial

13/05/02 4

(5)

•1+Billion users

•Avg: 130 friends each node

It costs >1TB memory to simply save the raw graph data (without attributes, labels nor content)

This can cause problems for information extraction, processing, and analysis

An Example on Facebook

Lin et al., SDM2013 Tutorial 5

Two possible solutions: Sampling and Summarization

(6)

Sampling Versus Summarization

• Sampling for Social Network

– Assume the information of a node becomes known only after it is sampled

– Goal: gradually identify a small set of representative nodes and links of a social network, usually given little prior information about this network

• Summarization Social Network

– The entire social network is known in prior

– Goal: condense the social network as much as possible without losing too much information

13/05/02 6

(7)

Homogeneous & Heterogeneous Social Networks

• Homogeneous  Single Relational Network

– Single object type & Link type

• Heterogeneous  Multi‐Relational Network

– Multiple object type & Link type

• Example

– Homogeneous

– Heterogeneous

Link Types Friend Family Love

Link Types Friend

(8)

Sampling for Social Networks

(9)

Sampling Social Networks

• Assume that the detailed information of a node can only be seen after it is sampled

– Entire social network is not known in advance

• Goal

– Sample (i.e. gradually observe nodes and links) a sub‐

network that represents the whole network

• To preserve certain properties of the original network

13/05/02 9

(10)

Properties Preserved

^(1/3)

• Homogeneous Static Social Network

– In/Out Degree Distribution – Path Length Distribution

– Clustering Coefficient Distribution – Eigenvalues

– Weakly/Strongly Connected Component Size Distribution

– Community Structure – Etc..

13/05/02 10

(11)

Properties Preserved

^(2/3)

• Homogeneous Dynamic Social Networks (

Graphs are time‐evolving)

– Densification Power Law

• Number of edges vs. number of nodes over time

– Shrinking diameter

• Observed that shrinks and stabilizes over time

– Average clustering coefficient over time

– Largest singular value of graph adjacency matrix over time

– Etc…

13/05/02 11

(12)

Properties Preserved

^(3/3)

• Heterogeneous Social Network

– Note type Distribution

– Intra‐link and Inter‐link type Distribution

– Distribution on Higher‐order types connection

13/05/02 12

(13)

Evaluating the Sampling Quality

• How to measure the quality of the sampling algorithm?

• A sampling algorithm is effective if

– The sampled social network can preserve certain network properties

– Using the sampled network to perform an ultimate task (e.g. centrality analysis, link prediction, etc), one can produce similar results as if this task were

performed on the fully observed network

– It can produce a small sampled sub‐network to achieve the above two goals

13/05/02 13

(14)

Sampling for Homogeneous

Social Networks

(15)

Three Main Strategies

• Node Selection

• Edge Selection

• Sampling by Exploration

– Random Walk – Graph Search

– Chain‐Referral Sampling

Seeds (i.e., ego)

13/05/02 15

(16)

Node Selection

• Random Node Sampling

– Uniformly select a set of nodes

• Degree‐based Sampling

[Adamic’01]

– the probability of a node being selected is proportional to its degree (assuming known)

• PageRank‐based Sampling

[Leskovec’06]

– the probability of a node being selected is proportional to its PageRank value (assuming known)

13/05/02 16

(17)

Edge Selection

• Random Edge (RE) Sampling

– Uniformly select edges at random, and then include the associated nodes

• Random Node‐Edge (RNE) Sampling

– Uniformly select a node, then uniformly select an edge incident to it

• Hybrid Sampling

[Leskovec’06]

– With probability p perform RE sampling, with probability 1‐p perform RNE sampling

13/05/02 17

(18)

Edge Selection

^(cont.)

• Induced Edge Sampling ^[Ahmed’12]

– Step 1: Uniformly select edges (and consequently nodes) for several rounds

– Step 2: Add edges that exist between sampled nodes

• Frontier Sampling [Ribeiro’10]

– Step 0: Randomly select a set of nodes L as seeds

– Step 1: Select a seed u from L using degree‐based sampling – Step 2: Select an edge of u, (u, v), uniformly

– Step 3: Replace u by v in L and add (u, v) to the sequence of sampled edges

– * Repeat Step 1 to 3

13/05/02 18

A C

B seed seed

Degree(B)>Degree(A) Seed={A,B}

Randomly pick (B,C) into the sampled sequence

Replace B by C as a new seed seed

(19)

Sampling by Exploration

• Random Walk

^[Gjoka’10]

– The next‐hop node is chosen uniformly among the neighbors of the current node

• Random Walk with Restart

[Leskovec’06]

– Uniformly select a random node and perform a random walk with restarts

• Random Jump

[Ribeiro’10]

– Same as random walk but with a probability p we jump to any node in the network

• Forest Fire

[Leskovec’06]

– Choose a node u uniformly

– Generate a random number z and select z out links of u that are not yet visited

– Apply this step recursively for all newly added nodes

13/05/02 19

(20)

Sampling by Exploration (cont.)

13/05/02 20

• Ego‐Centric Exploration (ECE) Sampling

– Similar to random walk, but each neighbor has p probability to be selected

– Multiple ECE (starting with multiple seeds)

• Depth‐First / Breadth‐First Search

[Krishnamurthy’05]

– Keep visiting neighbors of earliest / most recently visited nodes

• Sample Edge Count

^[Maiya’11]

– Move to neighbor with the highest degree, and keep going

• Expansion Sampling

^[Maiya’11]

– Construct a sample with the maximal expansion. Select the neighbor v based on

S: the set of sampled nodes, N(S): the 1^st neighbor set of S

∈ ∪

(21)

Example: Expansion Sampling

G E H

F

A

B C

D

|N({A})|=4

|N({E}) – N({A}) ∪{A}|=|{F,G,H}|=3

|N({D}) – N({A}) ∪{A}|=|{F}|=1

(22)

q_k ‐ ^sampled

node degree distribution

p_k ‐ real node degree distribution

Drawback of Random Walk: Degree Bias!

• Real average node degree ~ 94, Sampled average node degree ~ 338

• Solution: modify the transition probability :

13/05/02 22

,

1 ∗ min 1,

1 _,

0

If w is a neighbor of v If w = v

otherwise

(23)

Metropolis Graph Sampling

• Step 1: Initially pick one subgraph sample S with n’

nodes randomly

• Step 2: Iterate the following steps until convergence 2.1: Remove one node from S

2.2: Randomly add a new node to S  S’

2.3: Compute the likelihood ratio

– *(S) measures the similarity of a certain property between the sample S and the original network G

• Be derived approximately using Simulated Annealing

[Hubler’08]

13/05/02 23

∗ ′

1: : ≔ ∗

1: : ≔ with probability

: ≔ with probability 1

(24)

Sampling for Heterogeneous

Social Networks

(25)

Sampling on Heterogeneous Social Networks

• Heterogeneous Social Networks (HSN)

– A graph G=<V, E> has n nodes (v₁,v₂, …, v_n), m directed edges (e₁, …, e_m) and k different types – Each node/edge belongs to a type

• Given a finite set L = {L₁, ..., L_k} denoting k types

• Sampling methods for HSN

– Multi‐graph sampling

– Type‐distribution preserving sampling – Relational‐profile preserving sampling

13/05/02 25

[Gjoka’10]

(Li’ 11) (Yang’13)

(26)

Multigraph Sampling

• Random walk sampling on the union multiple graph to avoid stopping on the disconnected graph.

(27)

Sampling Heterogeneous Social Networks

• Sampling methods for HSN

13/05/02 27

[Gjoka’10]

(Li’ 11) (Yang’13)

(28)

Node Type Distribution Preserving Sampling

• Given a graph G and a sampled subgraph G

_S

• The node type distribution of G

_S

is expected to be the same as G, i.e., d(Dist(Gs),Dist(G)) = 0

– d() denotes the difference between two distributions

(9:6) = (3:2)

Sampled Network Original Network

13/05/02 28

(29)

Connection‐type Preserving Sampling

• Heterogeneous Connection

– For an edge E[v_i,v_j]

– Intra‐connection edge: Type(v_i) = Type(v_j) – Inter‐connection edge: Type(v_i) != Type(v_j)

• Intra‐Relationship preserving

– The ratio of the intra‐connection should be preserved, that is:

d(IR(G_S),IR(G)) = 0

– If the intra‐relationship is preserved, the inter‐relationship is also preserved

13/05/02 29

(30)

Respondent‐driven Sampling

• First proposed in social science[Heck’99] to solve the hidden population in surveying.

• Two Main Phases:

Snowball sampling  Finding steady‐state in the transition matrix

30

G

respondents

limited coupon c

S¹¹ S¹² S¹³ S²¹ S²² S²³ S³¹ S³² S³³

N‐step

transition P¹ P² P³

Transition Matrix

steady‐state vector

13/05/02 Lin et al., SDM2013 Tutorial

(31)

• Respondent‐driven Sampling does a good job with small node size, but saturate to mediocre afterwards

• Random node sampling performs poorly in the

beginning, but reaches the best results after sufficient amount of nodes are sampled.

Comparing Different Sampling algorithms

Similarity of node type‐distribution Similarity of Intra‐link distribution

(32)

Heterogeneous Social Networks

• Sampling methods for HSN

13/05/02 32

[Gjoka’10]

(Li’ 11) (Yang’13)

(33)

Relational Profile Preserving Sampling

• Node‐type/intra‐type preservation considers the semantics of nodes, but not the structure of the networks

• Homogeneous network sampling considers the structure but not the semantics of the networks

• Propose the Relational Profile to consider semantics and structure all together

– Capture the dependency between each Node Type(NT) and Edge Type(ET) of a directed Heterogeneous Network

– Consists of 4 Relational Matrices

• Conditional probabilities P(Tj|Ti) (e.g. P(LT=cites|NT=paper) )

• Node to node, node to edge, edge to node, edge to edge

NT ET

NT Transition Matrix

Transition Matrix

ET Transition Matrix

Transition Matrix

paper cites

cites

journal_of

authored

author

13/05/02 33

(34)

Example of Relational Profile (RP)

P A C J c p a

P 0.44 0.22 0.22 0.11 0.44 0.33 0.22

A 1 1

C 1 1

J 1 1

c 1 0.22 0.44 0.33

p 0.5 0.33 0.17 0.66 0.33

a 0.5 0.5 0.6 0.4

P A C J c p a

P 0.182 0.364 0.091 0.273 0.182 0.364 0.364

A 1 1

C 1 1

J 1 1

c 1 0.5 0.5

p 0.5 0.125 0.375 0.17 0.5 0.33

a 0.5 0.5 0.22 0.33 0.44

13/05/02 34

(35)

Challenge: How to approximate RP when the true RP is unknown

• We propose Exploration by Expectation Sampling

• Aim to preserve the unknown relational profile while adding new sample nodes

1. Randomly choose a starting node and the corresponding edges 2. Based on current RP, select a next node from all 1 degree neighbor 3. Add the new node and all its edges

4. Update RP of the sub‐sampled graph

5. Repeat step 3, 4 & 5 until the converge of RP

• Which node should be selected?

– Select the node whose inclusion can potentially lead to the largest change to the existing RP

• Use the partially observed RP to generate the ‘expected amount of change’ of each candidate node as its score

• Weighted sampling based on the score

13/05/02 35

(36)

Relational Profile Sampling (RPS)

D(v, G_s) = estimated change of RP given sampling v on the current graph G_s

=E[Δ_P(G_s, G_s+v)|G_s] , where Δ_P = RMSE_RP

Goal: maximize expected property (Relational Profile distribution) change

Exploiting the existing RP, P(type(v)=t|G_s) can be obtained using the observed types of v’s neighbors

v which can be calculated as

v RP(type |type )

RP(type |type )

RP(type |type ) RP(type |type )

P(type|type) can be obtained from the existing RP

Idea: Sample to increase the diversity

G_s

(37)

Evaluation

• Evaluation I (Property Preservation): see how well the sampled network approximates two properties of the full network

• Evaluation II (Prediction): training a prediction model using the sampled network to infer out‐of‐sampled network status:

– Node Type Prediction: Predict the type of unseen nodes in the network using a sub‐sampled network

– Missing Relations Prediction: Recover/predict the missing links – Features:

• f_deg = (in/out deg; avg in/out deg of neighbors)

• f_topo = (Common Neighbors; Jaccard’s Coefficient; etc)

• f_nt = P(type(v)|G_s)=

• f_RPnode=

• f_RPpath=

• Datasets: 3 real‐life large scale HSN

• Baselines:

– Random Walk Sampling (RW) – Degree‐based sampling (HDS)

Area for Testing

Area for sampling

(38)

Experiments (Property Preservation)

• RMSE (for RP)

• Weighted PageRank

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

Kendall‐Tau

# Nodes Sampled (in 10s)

RW HDS RPS

0 0.1 0.2 0.3 0.4 0.5 0.6

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

1 4 7 1013 16 1922 25 28 3134 37 4043 46 49

Hep Aca Movie

Type dependency preservation

Preserving relative node weights

propagated throughout entire network

(39)

Experiments (Prediction)

• We show Academic Network for brevity.

Node Type Prediction Missing Relation Prediction

0.6 0.65 0.7 0.75 0.8 0.85 0.9

10 60 110 160 210 260 310 360 410 460

A c c u r a c y

number of sampled nodes

All neighbors visible

highDeg RandWalk RPS

Remarks: These evaluations provide a general idea how one can evaluate a SN sampling algorithm^13/05/02 Lin et al., SDM2013 Tutorial 39

(40)

Task‐driven Network Sampling

• Sampling Community Structure

[Maiya’10][Satuluri’11]

• Sampling Network Backbone for Influence Maximization

[Mathioudakis’11]

• Sampling High Centrality Individuals

^[Maiya’10]

• Sampling Personalized PageRank Values

[Vattani’11]

• Sampling Network for Link/Label Prediction

[Ahmed’12]

13/05/02 40

(41)

Take Home Points

13/05/02 41

Homogeneous SN Heterogeneous SN Node and Edge

Selection

[Leskovec’06] [Adamic’01]

[Ahmed’12][Ribeiro’10] [Kurant’12]

Sampling by Exploration

[Krishnamurthy’05]

[Leskovec’06][Hubler’08]

[Gjoka’10][Ribeiro’10]

[Maiya’11][Kurant’11]

[Gjoka’11][Li’11][Kurant’12]

[Yang’13]

Task‐driven Sampling

[Maiya’10][Satuluri’11][Mathioudakis’11]

[Vattani’11][Ahmed’12]

• Why sampling a social network?

 the full network (e.g. Facebook) cannot be fully observed

 crawling can be costly in terms of resource and time consumption (therefore a smart sampling strategy is needed)

(42)

Social Network Summarization

The 2^nd part of this tutorial:

(43)

Goals of Social Network Summarization

• Find a condensed representation of a given social network to

– produce a succinct overview of the social network, – save the storage,

– enable efficient mining / query processing

43

(44)

Beyond Graph Summarization

• To summarize not only the structure or topology information such as:

– Neighbor set / adjacency – Reachability

– Connectivity

• but also the semantic information such as:

• Attributes of an entity and a relationship.

• Relationships of entity‐entity, entity‐community, community‐community.

44

(45)

Issues for Summarization

• Purpose

– Are there certain properties to preserve? What types of queries / mining tasks are the summaries for?

• Precision of the summary

– Lossless: can reconstruct the exact original social networks

– Lossy: cannot fully recover, usually for a better compression ratio

• Evaluation

– Space saving: Reduction of # node/edge, total data size in bytes, bit per edges, etc.

– Quality: reconstruction errors, interestingness, query errors (degree, centrality, connectivity), etc.

– Efficiency: time for summarization and time savings of querying or mining on the summaries.

45

(46)

Main Approaches for Summarization

• Aggregation based

– Creating a summary graph with supernodes and superedges.

– For efficient storage, analysis, and visualization.

• Abstraction based

– Extracting a subgraph given certain criteria for abstraction and various visualizations.

• Compression based

– Encoding the network in a space‐efficient way based on the structure information.

• Application‐oriented

– Designing specifically for different kinds of applications.

46

(47)

Web/Graph ‐> Social Network

‐ from the homogeneous to the heterogeneous structure

47

(48)

Main Approaches for Summarization

• Aggregation based

• Abstraction based

• Compression based

• Application‐oriented

48

(49)

Aggregation based on Node/Link Structures

• The basic idea:

– Merge nodes with similar neighbors into a supernode.

– Add a superedge between two supernodes conditionally.

– E.g., complete bipartite graph

• What if a subgraph is not complete?

– E.g.,

Supernode graph + edge corrections!

?

49

(50)

Aggregation based on Node/Link Structures (cont’d)

• S‐node representation for Web graphs [Raghavan’03]:

– Partition web pages

• URL split (domain) + Cluster split (adjacency list of out‐links)

– Supernode graph: a node represents a partition and a link between two partitions if there exists any link

between two pages, one from each partition.

– Positive/Negative superedge graphs: used to annotate the actual linkage between web pages.

– Lossless representation.

50

(51)

Aggregation based on Node/Link Structures (cont’d)

• A

two‐part representation R(S,C) is proposed [Navlakha’08]:

– Graph summary S: an aggregated graph.

• Merge nodes with more common neighbors.

• A link is added between two supernodes if the nodes in one supernode are densely connected to those in the other.

• Allow supernodes to have a self‐edge.

– Edge corrections C: to be used while recovering the original graph.

– Both lossless/lossy methods are proposed.

51

(52)

Aggregations of Links by Frequent Patterns

• Leverage pattern mining to compress the Web graph, which supports community discovery and random

access [Buehrer’08].

• Two phases:

– Clustering phase: nodes with similar out‐links are grouped together.

– Mining phase (for each

cluster): mine virtual nodes to aggregate edges

• A lossless and more compact structure

52

(53)

Aggregation by Node Attributes and Relations

Each student in group G₁ has at least a friend and a classmate in group G₂.

• SNAP [Tian’08]:

Summarizing by Grouping Nodes on Attributes and Pairwise Relationships.

• User can specify a node‐

attribute set and a relation‐

type set, the system returns an (A,R)‐compatible

grouping.

– E.g., A={gender, department}, R={friends, classmates}

• Lossless.

53

(54)

Aggregation by Node Attributes and Relations (cont’d)

• k‐SNAP [Tian’08]: relaxes the homogeneity requirement for the relationships and allows users to control (drill‐down, roll‐up) the sizes of the summaries.

– k is the user‐specified number of grouping nodes.

– Not requiring that every node participates in a group relationship.

– Lossy

54

(55)

Aggregation by Node Attributes and Relations (cont’d)

• [Zhang’10] Improves two limitations of k‐

SNAP in practice

– Limitation 1: Only handles categorical node attribute

• Sol: Provide cutoffs to categorize numerical attributes

– Limitation 2: The search space is too large for manually identifying interesting summaries

• Sol: An interestingness measure (diversity, coverage, conciseness) is introduced to evaluate the

interestingness of a summary

55

(56)

Main Approaches for Summarization

• Aggregation based

• Abstraction based

• Compression based

• Application‐oriented

56

(57)

Visual Analysis of Large Heterogeneous Social Networks

• OntoVis [Shen’06] is a visual analysis tool for

heterogeneous social network based on the given ontology graph

– Semantic abstraction: generate an induced graph of node types selected by users

– Structural abstraction: remove one‐degree nodes and duplicate paths for reducing visual complexity

– Importance filtering: using statistics such as node degree, dispersion and disparity per type to determine the

important node types for emphasizing.

57

(58)

Visual Analysis of Large Heterogeneous Social Networks (cont’d)

G G g G G g GENRE

DISTRIBUTER STUDIO

ROLE

AWARD MOVIE COUNTRY

PERSON

Ontology graph of the movie Dataset from the UCI KDD Archive¹

[Shen’06]

Semantic abstraction on “role‐actor” relationships Red nodes: role

Blue nodes: actor

1. http://kdd.ics.uci.edu/databases/movies/movies.data.html ⁵⁸

(59)

Visual Analysis of Large Heterogeneous Social Networks (cont’d)

G G g G G g GENRE

DISTRIBUTER STUDIO

ROLE

AWARD MOVIE COUNTRY

PERSON

Ontology graph of the movie Dataset from the UCI KDD Archive¹

Importance filtering on “node type disparity”

Node size: disparity of connected types

# on edge: frequencies of links between two types [Shen’06]

1. http://kdd.ics.uci.edu/databases/movies/movies.data.html ⁵⁹

(60)

Egocentric Abstraction on

Heterogeneous Social Networks

• Construct the abstracted graph of an ego node for a heterogeneous social network

– Identify each unique k‐step linear combination of relations as a feature – Counting the frequency of each unique feature

– Several criteria are introduced to decide which features are important for the ego

• E.g., abstraction by showing only rare/frequent features.

[Li’09]

60

(61)

Main Approaches for Summarization

• Aggregation based

• Abstraction based

• Compression based

• Application‐oriented

61

(62)

Ordering‐based Compression

• URLs of web pages have two features:

– Similarity: Usually the source and the target of a link are close to each other (in lexicographical order).

– locality: pages close to each other (in lexicographical order) tend to have many common successors, i.e., many navigational links are the same for pages in the same local cluster and with the same host.

• By leveraging the lexicographical order, the BV scheme [Boldi’04] needs only 3bits per edge to encode the Web graphs.

• However, nodes in social networks have no natural orders.

62

(63)

Ordering for Nodes in Social Networks

• The shingle ordering based on Jaccard coefficient to find locality in social networks [Chierichetti’09].

• If two nodes share a lot of common out‐neighbors, with high probability they will be close to each other in a shingle‐based ordering.

1 5

2 0

2 4

2 9

3 3

1 5

2 2

2 9

3 5

6 6

7 0

9 9

A: outlinks of v_A

B: outlinks of v_B Shingle of B

Some permutation function 

Shingle of A =

63

(64)

Neighbor Query Friendly Compression of Social Networks

• A novel Eulerian data structure using multi‐position linearizations (MP) of directed graphs is proposed to compress social networks while both out/in neighbor queries can be answered in sublinear time [Maserrat’10].

Original graph G

Find S, a cover that contains all nodes in G with S‐distance = 1 by duplicating necessary nodes.

MP₁‐Linearization of G

‐ Local information: 2 bits to encode if (v(i‐1), v(i)) and (v(i),v(i+1)) exists in E(G).

‐ Pointers: next appearance of v(i).

v(1) = v₈, v(2) = v₇, and so on.

(Min. distance among all pairs of (u,v) is 1)

64

(65)

Other Graph/Network Compressions

• Community‐based (hubs and spokes) Compression [Kang’10]

• MP‐Linearization for for lossy compression to preserve communities in social networks

[Maserrat’12]

• Mix clusterings and orders for Compressing Social Networks [Boldi’11]

• Encoding based on the newly defined structural entropy for Erdӧs‐Rényi graphs [Choi’12].

65

(66)

Main Directions

• Aggregation based

• Abstraction

• Compression based

• Application‐oriented

66

(67)

Application‐Oriented Summarization

• Summarization for query‐answering and pattern mining

– Adjacency, degree, centrality [LeFevre’10]

– Connectivity [Zhou’10][Toivonen’11]

– Graph pattern mining/search [Chen’09][Kang’10][Fan’12]

• Graph management system [Kang’11]

and so on.

67

(68)

What We Have not Covered in this Tutorial Yet

• Comparisons of the performance among all these works in terms of

– Efficiency

– Space saving – Quality

68

(69)

Social Network Summarization Overview

Precision

Strategy

Network

Aggregation‐based Abstraction

Application‐oriented Compression

69

(70)

Summarization Categories

Homogeneous Heterogeneous

Aggregation‐based [Raghavan’03][Navlakha’08]

[Buehrer’08]

[Tian’08] [Zhang’10][Liu’11]

Abstraction [Shen’06][Li’09]

Compression

[Chierichetti’09][Maserrat’1 0] [Maserrat’12]

[Kang’10][Choi’12]

Application‐oriented

[Zhou’10][LeFevre’10]

[Toivonen’11][Kang’11]

[Chen’09][Fan’12]

Summarization Strategies: Lossless / Lossy

70

(71)

Opportunities for Future Research

• Advanced techniques to sample/summarize more complex graph structures

– E.g. location‐based social networks, diffusion networks, dynamic social networks, social network with activity information, etc.

• Should we focus on task‐driven sampling and

summarization or do we need a general framework across tasks?

• Sampling/Summarization on noisy data

• Standard evaluation metrics and benchmark data are in high demand.

• And many others…

71

(72)

Final Remarks

• Sampling and summarization have immediate practical values in the big data era

– Allow data miners to perform advanced mining tasks in large graphs

– Achieve scalable storage and querying

– Facilitate the development of real‐world applications

• Existing works are rich, but by no means

complete to handle every aspect of the problem.

72

(73)

Acknowledgements

• This tutorial is partially sponsored by National Science Council, National Taiwan University and Intel Corporation under Grants NSC101‐

2911‐I‐002‐001, NSC101‐2628‐E‐002‐028‐MY2 and NTU102R7501

• Special thanks to Shu‐Ming Hsu @ Academia Sinica for his inputs

73

(74)

Reference – Homogeneous Sampling

• J. Leskovec and C. Faloutsos. Sampling from large graphs. In KDD 2006.

• A. S. Maiya and T. Y. Berger‐Wolf. Benefits of bias: towards better characterization of network sampling. In KDD 2011.

• B. Ribeiro and D. Towsley. Estimating and sampling graphs with multidimensional random walks. In ACM SIGCOMM IMC 2010.

• M. Gjoka, M. Kurant, C. T. Butts, and A. Markopoulou. Walking in facebook: a case study of unbiased sampling of OSNs. In IEEE INFOCOM 2010.

• V. Krishnamurthy, M. Faloutsos, M. Chrobak, L. Lao, J.‐H. Cui, and A. G.

Percus. Reducing large internet topologies for faster simulations. In Networking, 2005.

• N. K. Ahmed, J. Neville, and R. Kompella. Network Sampling: From Static to Streaming Graphs. arXiv:1211.3412, 2012.

• C. Hubler, H.‐P. Kriegel, K. M. Borgwardt, and Z. Ghahramani. Metropolis Algorithms for Representative Subgraph Sampling. In IEEE ICDM 2008.

• M. Kurant, M. Gjoka, C. T. Butts, and A. Markopoulou. Walking on a graph with a magnifying glass: stratified sampling via weighted random walks.

SIGMETRICS Perform. Eval. Rev. 2011.

74

(75)

Reference – Heterogeneous Sampling

• M. Gjoka, C. T. Butts, M. Kurant, and A. Markopoulou.

Multigraph Sampling of Online Social Networks. IEEE Journal on Selected Areas in Communications, 2011.

• M. Kurant, M. Gjoka, Y. Wang, Z. W. Almquist, C. T. Butts, and A. Markopoulou. Coarse‐grained topology estimation via

graph sampling. ACM WOSN 2012.

• J.‐Y. Li and M.‐Y. Yeh. On Sampling Type Distribution from Heterogeneous Social Networks. In PAKDD 2011.

• Cheng‐Lun Yang, Perng‐Hwa Kung, Chun‐An Chen, Shou‐De Lin. Semantically Sampling in Heterogeneous Social Networks in WWW 2013

• D. Heckathorn. Respondent‐driven sampling: a new approach to the study of hidden populations. Social problems, 1997.

75

(76)

Reference – Task‐driven Sampling

• A. S. Maiya and T. Y. Berger‐Wolf. Sampling community structure. In WWW 2010.

• M. Mathioudakis, F. Bonchi, C. Castillo, A. Gionis, and A. Ukkonen. Sparsification of Influence Networks. In KDD 2011.

• A.S. Maiya and T.Y. Berger‐Wolf. Online Sampling of High Centrality Individuals in Social Networks. In PAKDD 2010.

• V. Satuluri, S. Parthasarathy, and Y. Ruan. Local Graph Sparsification for Scalable Clustering. In SIGMOD 2011.

• A. Vattani, D. Chakrabarti, and M. Gurevich. Preserving Personalized Pagerank in Subgraphs. In ICML 2011.

• N. K. Ahmed, J. Neville, and R. Kompella. Network

Sampling Designs for Relational Classification. In AAAI ICWSM 2012.

76

(77)

References: Aggregation‐based Summarization

• S. Navlakha, R. Rastogi, N. Shrivastava. Graph Summarization with Bounded Error. In Proc. of ACM SIGMOD International Conference on Management of Data (SIGMOD’08), 2008.

• Y. Tian, R. A. Hankins and J. M. Patel. Efficient Aggregation for Graph Summarization. In Proc. of ACM SIGMOD International Conference on Management of Data (SIGMOD’08), 2008.

• G. Buehrer and K. Chellapilla. A Scalable Pattern Mining Approach to Web Graph Compression with Communities. In Proceedings of the 2008 International Conference on Web Search and Data Mining (WSDM’08), pages 95–106, 2008.

• N. Zhang, Y. Tian, and J. M. Patel. Discovery‐driven Graph

Summarization. In Proc. of IEEE International Conference on Data Engineering (ICDE’10), 2010.

77

(78)

References: Abstraction‐based Summarization

• Z. Shen, K. L. Ma and T. Eliassi‐Rad. Visual Analysis of Large Heterogeneous Social Networks by

Semantic and Structural Abstraction. IEEE Transactions on Visualization and Computer Graphics, 12(6), 1427–1439, 2006.

• C.‐T. Li and S.‐D. Lin. Egocentric Information

Abstraction for Heterogeneous Social Networks, In Proc. of International Conference on Advances in Social Network Analysis and Mining

(ASONAM’09), 2009.

78

(79)

References: Compression‐based Summarization

• P. Boldi and S. Vigna. The Webgraph Framework I: Compression Techniques. In the 13th international conference on World Wide Web (WWW'04), pages 595–602, 2004.

• F. Chierichetti, R. Kumar, S. Lattanzi, M. Mitzenmacher, A. Panconesi, and P.

Raghavan. On Compressing Social Networks, In Proc. of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’09), 2009.

• H. Maserrat and J. Pei. Neighbor Query Friendly Compression of Social Networks, In Proc. of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’10), 2010.

• H. Maserrat and J. Pei. Community Preserving Lossy Compression of Social Networks, In Proc. ICDM, 2012.

• P. Boldi, Marco Rosa, Massimo Santini, and Sebastiano Vigna. Layered label propagation: a multiresolution coordinate‐free ordering for compressing social networks. In WWW'11.

• Y. Choi and W. Szpankowski. Compression of Graphical Structures: Fundamental Limits, Algorithms, and Experiments. Information Theory, IEEE Transactions on, 58(2):620–638, February 2012

79

(80)

References: Application‐oriented Summarization

• F. Zhou, S. Malher, and H. Toivonen. Network Simplification with Minimal Loss of Connectivity. In Proc. of IEEE International Conference on Data Mining (ICDM’10), 2010.

• H. Toivonen, F. Zhou, A. Hartikainen, and A. Hinkka. Compression of Weighted Graphs, In Proc. of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11), 2011.

• U. Kang, H. Tong, J. Sun, C. Y. Lin, and C. Faloutsos. GBASE: A Scalable and General Graph

Management System, In Proc. of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11), 2011.

• K. LeFevre and E. Terzi. GraSS: Graph Structure Summarization. In Proc. of SIAM International Conference on Data Mining (SDM’10), 2010.

• U. Kang and C. Faloutsos. Beyond 'Caveman Communities': Hubs and Spokes for Graph Compression and Mining. In Proc. of IEEE International Conference on Data Mining (ICDM’10), 2010.

• C. Chen, C. X. Lin, M. Fredrikson, M. Christodorescu, X. Yan, and J. Han. Mining Graph Patterns Efficiently via Randomized Summaries. Proc. VLDB Endow., 2(1):742–753, August 2009.

• W. Fan, J. Li, X. Wang, and Y. Wu. Query Preserving Graph Compression, In Proc. of ACM SIGMOD International Conference on Management of Data (SIGMOD’12), 2012.

80

(81)

Q&A

Thank you!

81

(82)

Neighbor Query Friendly Compression of Social Networks (cont’d)

Neighbor query of v in

The upper bound of bits used for encoding a graph is asymptotically about ½ log(|V(G)|), which is the number of bit used for encoding an edge by baseline.

* Similar ideas are also used for lossy compression to preserve communities in social networks [Maserrat’12].

[Maserrat’10]

82