What Distinguish One from Its Peers in Social Networks?

(1)

(will be inserted by the editor)

What Distinguish One from Its Peers in Social Networks?

Yi-Chen Lo · Jhao-Yin Li ·

Mi-Yen Yeh · Shou-De Lin · Jian Pei

Received: date / Accepted: date

Abstract Being able to discover the uniqueness of an individual is a mean- ingful task in social network analysis. This paper proposes two novel problems in social network analysis: how to identify the uniqueness of a given query vertex, and how to identify a group of vertices that can mutually identify each other. We further propose intuitive yet effective methods to identify the uniqueness identification sets and the mutual identification groups of different properties. We further conduct an extensive experiment on both real and synthetic datasets to demonstrate the effectiveness of our model.

Keywords social query· node identiﬁcation · social networks

Yi-Chen Lo

Dept. of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan

E-mail: d00922006@csie.ntu.edu.tw Jhao-Yin Li

Institute of Information Science, Academia Sinica, Taipei, Taiwan E-mail: louisjyli@iis.sinica.edu.tw

Mi-Yen Yeh

Institute of Information Science, Academia Sinica, Taipei, Taiwan E-mail: miyen@iis.sinica.edu.tw

Shou-De Lin

Dept. of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan

E-mail: miyen@iis.sinica.edu.tw Jian Pei

School of Computing Science, Simon Fraser University, Burnaby, BC, Canada E-mail: jpei@cs.sfu.ca

(2)

Expert Expertise

Alice Bob John

C C++

Prolog Java Tim

Mary

Fig. 1 Example of a bipartite expertise network.

1 Introduction

In a heterogeneous social network, each entity is assigned a type (or label) to describe its category. A node type can be a place, a person, an organization, etc. For example, consider an expertise bipartite network, where each node is an entity of either the type expert or expertise, as illustrated in Figure 1.

“Ego” is a node of focus while the ego-centric view of a node generally contains the neighborhood information of this node centralized as an ego. In this work, ego is considered as a query node of interests. It is often interesting and useful to distinguish an ego of some type from the rest of the entities of the same type. How can one expert be distinguished from another? In Figure 1, Alice can be uniquely identiﬁed from all the other experts because she possesses the skill Prolog that nobody else does. Mary can be uniquely identiﬁed from the other experts using the set of expertise C, C++, and Java because she is the only person that owns all three skills. In other words, we call{C, C++, Java}

the uniqueness identification set of Mary.

Distinguishing a node from its peers in a social network is highly useful in many applications. In the above example, using the knowledge that Prolog uniquely identifies Alice from all the other experts, any consulting projects requiring the expertise on Prolog needs to involve Alice. To take the full ad- vantage of Mary’s expertise, Mary should be assigned first to those projects requiring expertise on C, C++, and Java together. Similarly, we may also distinguish different expertise by finding a unique set of experts that have the interested expertise.

Distinguishing a node from its peers in a social network is far from trivial.

Not every node can be uniquely identiﬁed only by its one-degree neighbors.

For example, in Figure 1, Tim and John cannot be distinguished from one another using only their expertise, since their expertise sets,{C++, Java}, are identical. To identify John, we have to use John’s 2-hop neighbors including C++, Java, and Tim. The identiﬁcation set for John essentially indicates,

“Besides Tim, John is the only person who knows C++ and Java”. Such

(3)

information about John is useful – it provides a unique expertise set of John as well as his alternatives in the network.

Additionally, we may use the identiﬁcation information to ﬁnd interesting

“communities” in a social network. For example, consider the sub-graph that contains the 5 nodes John, Tim, Mary, C++, and Java. Each of these nodes can be identiﬁed by the rest of the nodes (or a subset of them). For instance, the node C++ can be identiﬁed by the set Tim, Java, which can be read as “C++

is the only language other than Java that Tim knows”. Similarly, the node Java can be identified by the sets{Mary, C++} and {John, Mary, C++}. In other words, the induced graph on{John, Tim, Mary, C++, Java} is a closure in identification, which we call a mutual identification group. Intuitively, in a mutual identification group S, each member can be identified by some other members of S. Such a closure group discloses some interesting information – John, Mary and Tim may replace each other as they are the only experts who master both C++ and Java.

Essentially, by designating any node as an ego, we can use its identiﬁ- cation sets and mutual identiﬁcation groups to conduct ego-centric analysis.

Such analysis enables valuable applications including social entity search en- gines and substitution recommendation systems. For a social entity search engine, existing work such as [9, 13] did not emphasize on choosing neighbor- ing nodes to display, instead they opted to show all the neighbors of the ego up to certain degree. In contrast, our goal is to leverage the social network itself and report the uniqueness identification set for a query entity/ego to highlight its unique information, which helps users quickly catch the main dif- ferences among this entity and others of the same type from a large amount of redundant information. Such a mechanism facilitates better visualization and summarization for the task of social relationship search of an entity. Moreover, with the mutual identification group of each query entity/ego, we can build a substitution recommendation system that finds an alternative for a query entity that is out of stock or unavailable. That is, items that can be mutually identified by the same set of objects can be regarded as serving similar roles in a network, and therefore can be regarded as a surrogate for each other.

Consider a movie network as an example, which contains relationship between movies, actors, directors, etc. For a movie, an entity search engine can report its unique identification set containing the minimal set of information of its actors, director, or places that distinguish this movie from others, rather than simply shows its neighborhood graph. Moreover, a mutual identification group may capture a set of movies, actors and places that are closely related to each other. If some movie is sold out, the substitution system may recommend other movies in the same uniqueness identification group. We will show several cases in Section 6.3.

As astute readers might point out, we may find more than one identification set for each ego we are interested in a network. For example, to identify Mary we may choose its neighbor Java. We then find that both Tim and John have the same expertise and thus are structure equivalent to Mary given their mutual neighbor Java. The final identification set of Mary includes Tim, John,

(4)

and Java. On the other hand, we may also choose both C and C++ to distinguish Mary from other experts. This time, only Alice has both expertise and the resulting uniqueness identiﬁcation set of Mary is {Alice, C, C++}.

Compared to the former uniqueness identiﬁcation set of Mary, the latter one is preferred since fewer experts are involved and the uniqueness of Mary possess- ing the combination of expertise C and C++ is highlighted. Similar concept of using as few entities as possible when choosing the mutual identiﬁcation group should be followed.

Motivated by the above identiﬁcation needs, in this paper, we propose a novel ego-centric data mining task as follows. In a network where each node is associated with a unique type, we want to achieve the following two goals:

1) Given an ego, find a uniqueness identification set (UID) that distinguishes the ego from its peers, which are nodes of the same type. As there can be more than one uniqueness identification set of an ego, we aim to find one containing minimum number of structurally equivalent nodes to distinguish the ego from others of the same type. When two UIDs have the same number of structurally equivalent nodes, we choose the one that can be identified with a smaller number of neighbors; 2) Given an ego, find a mutual identification group (MUID) where the uniqueness of each node in the group can be identified by the rest of the nodes, or a subset of them. Similarly, when more than one mutual identification group is available, we aim to find the one with fewer structurally equivalent neighbor nodes to distinguish the nodes. We explore different evaluation criteria for the two tasks.

There are four main contributions in this paper. First, we define two novel data mining problems for social network analysis, namely mining uniqueness identification sets and mutual identification groups. Second, we introduce a series of simple yet effective methods, 1-Hop+, One-Neighbor, and Multiple- Neighbor, to tackle the problems from different angles. We prove that our methods are guaranteed to identify a node set that can identify the uniqueness of the query node. Third, we evaluate the three methods systematically on 3 real data sets and 3 synthetic data sets. The results and follow up analysis show the advantages of these methods. Finally, we analyze some of the interesting mutual identification groups from a real-world movie dataset, and provide explanation about them. The analysis justifies the usefulness of our approach.

The rest of the paper is organized as follows. We review the related work briefly in Section 2. We define the problems of uniqueness identification set of a vertex in a network and the mutual identification group in Section 3. We present three effective methods to solve the corresponding problems in Section 4. We report the experiment results in Section 5 and conclude the paper in Section 6.

(5)

2 Related Works

Our work is related to the existing studies on social networks in three as- pects: social network search/extraction, community queries, and social network anonymization.

The work on social entity search/extraction focuses on extracting social relationship among a specific set of people from available open resources such as the Web. For example, Zhu et al. [13] developed an entity relationship extraction framework for relation extraction from the Web data, and a search engine, Renlifang, was built to report the entity relationship graph of some query per- sons, locations, or organizations. Tang et al. [9] developed Arnetminer, an academic search system that aims to automatically extract the researcher pro- file including the co-authorship relation graph from the Web. Those studies focus on extracting social networks. In contrast to the above works, we do not rely on parsing and extracting the information from external sources, but focus on extracting the mutual identification group from a given social network.

The works on group or community queries, [8, 4, 6, 11], focuses on selecting a set of nodes or searching a specific community according to the given query nodes or some constraints. Given a set of query nodes in a graph, Sozio and Gionis [8] searched a community containing the nodes by finding a densely connected sub-graph. They proposed a greedy algorithm with heuristics to find the optimum solution under the monotone constraints on the density measure. Lappas et al. [4] tackled a team formation problem, which seeks a group of suitable people (each as a node in a graph) with different skills (as attributes of each node) to complete a task with certain skill requirement. The output results are determined by minimizing two types of communication cost between the selected people. Li and Shan [6] further generalized the problem in [4] by associating each required skill with a specific number of experts. They proposed a density-based measure for selecting the seed node and a grouping- based approach to find the team for generalized tasks. Similar to the previous two works, given a query initiator, Yang et al. [11] found a group of people that are available to attend activities while satisfying the acquaintanceship and social-temporal constraints. The problem was formulated as integer linear programming problem. With two efficient algorithms proposed to find the optimal solution. Based on a specified ego node, Li and Lin [5] reported an egocentric abstraction graph to summarize the features of the given ego node using an unsupervised learning mechanism. Although all of the above works extract a query-based social graph, their goals are very different from ours, which is finding a group of nodes that can uniquely identify the query node.

Our work is also related to the problem of graph anonymization. Speciﬁ- cally, the k-anonymization of social net-works [12] alters a given social network such that the 1-neighborhood of each node is isomorphic to those of at least k−1 other nodes. To achieve the goal, the critical step is to determine whether the 1-neighborhood of a node can identify the node with probability higher than 1/k. However, in k-anonymization, the search is constrained to only 1-

(6)

Table 1 Symbols and terminology.

Symbol Definition

G = (V, E) A graph consists of vertices V and edges E.

N (v) The ﬁrst order neighbor set of vertex v.

N2(v) The second order neighbor set of vertex v.

type(v) The vertex type of v.

U I Uniqueness identiﬁcation. See Deﬁnition 2.

UID Uniqueness identiﬁcation set. See Deﬁnition 1.

MUID Mutual identiﬁcation group. See Deﬁnition 3.

SE(v, M ) The structure equivalent set of node v given a set M . See Deﬁnition 1.

neighborhood, while our work does not have such constraint. Moreover, the goals in social network anonymization and ours are fundamentally diﬀerent.

To our knowledge, we are the first to identify and tackle the problem of finding identification sets and mutual identification groups. Our work is clearly different from community detection [3, 7] in a social network, which aims to divide an entire social network into a set of disjoint or overlapping partitions.

3 Preliminaries

In Section 3.1, we provide the formal deﬁnitions of the problems. We then prove that the problem of ﬁnding the optimal UID is NP-hard.

3.1 Problem Deﬁnition

We are given an undirected, simple, and labeled graph G(V, E) and a query node v ∈ V whose 1^st and 2^nd order neighbor set is denoted as N (v) and N2(v), respectively. Each vertex in G is labeled with a type denoted by type(v).

Table 1 summarizes the symbols and terminology used in this paper.

Definition 1 In a graph G(V, E), given a vertex v ∈ V and a set of nodes M ⊆ N(v), a node u ∈ V is said to be structure equivalent(abbreviated as SE) to v given M , denoted by u∈ SE(v, M), if type(u) = type(v) and M ⊆ N(u).

SE(v, M ) is the set of nodes structure equivalent to v given M . Please note that SE(v, M ) =∅ if there does not exist any node structure equivalent to v given M .

An example is shown in Figure 2. In Figure 2(a), suppose M ={m1}, then SE(v, M ) ={u1, u₂}. Note that u4 is not included in SE(v, M ) because it is not the same type as v. Figure 2(b), assuming an additional vertex m2 is added into M (note that the nodes in M do not need to be of the same type), we can ﬁnd that the set SE(v, M ) becomes smaller since u2 ∈ N(m/ 2) and has to be removed from SE(v, M ). This example shows that adding nodes into M may decrease the number of nodes in SE(v, M ). In Figure 2(c), when M ={m1, m2, m3}, no vertex is connected to every element in M. Therefore, SE(v, M ) = ϕ.

(7)

m₂ v

SE(v, M)={u₁, u₂} M={m1}

m1

u₁

u2

u₄ u₃ m₃

u₅ m₄

(a)

SE(v, M)={u₁} M={m₁ ,m₂}

m₂ v

m₁ u₁ u₂

u₄ u₃ m₃

u₅ m₄

(b)

SE(v, M)= ∅ M={m₁ ,m₂,m₃}

m₂ v

m₁ u₁ u₂

u₄ u₃ m₃

u₅ m₄

(c)

Fig. 2 Three examples to show SE(v, M ). (a) u1 and u2 are structure equivalent(SE) to v given M ={m¹}. (b) By adding m² into M , u2 becomes not SE to v since u2 has no link with m2. (c) Similarly by adding m3into M, SE(v, M) becomes empty set. Note that here we assume two types of nodes: ones with ﬁlled color, one without.

Definition 2 Given a vertex v and a non-empty set M ⊆ N(v), we deﬁne that v’s uniqueness can be identified by the 2-tuples set [M, SE(v, M )]. This set is called a UID of v.

Note that SE(v, M ) can be empty as it is possible to interpret such sit- uation as “v is unique because there is no other vertex that connects to M as v does”. When SE(v, M ) is non-empty, then the uniqueness of v can be interpreted as “v is unique because, besides the vertices in SE(v, M ), the only vertex that connects to M is v”. Next we introduce an interesting and useful property of UIDs.

Property 1 Given a query vertex v, its UIDs can always be found within two- hops of v.

Proof Since M⊆ N(v), according to Deﬁnition 1, ∀u ∈ SE(v, M), M ⊆ N(u).

Therefore, SE(v, M ) ⊆ N2(v). That is, any UID, [M, SE(v, M )], is within

two-hops of v. ⊓⊔

By Deﬁnition 2, there are many possible UIDs given a query vertex v.

Therefore, we deﬁne a function to compare the quality of UIDs.

Definition 3 Given two UIDs, D1= [M1, SE(v, M1)] and D2= [M2, SE(v, M2)], we deﬁne a comparison function Q(v, M1, M2) as follows.

(a) Q(v, D1, D2) = 1, i.e., the quality of D1 is better than that of D2, if (i)

|SE(v, M1)| < |SE(v, M2)| or (ii) |SE(v, M1)| = |SE(v, M2)| and |M1| <

|M2|.

(b) Q(v, D1, D2) = 0, i.e., the quality of D1equals to that of D2, if|SE(v, M1)| =

|SE(v, M2)| and |M1| = |M2|.

(c) Otherwise, Q(v, D1, D2) =−1, i.e., the quality of D1is worse than that of D2

Note that Q(v, D1, D2) =−Q(v, D2, D1). Furthermore, Q is transitive as when Q(v, D1, D2) = 1 and Q(v, D2, D3) = 1, Q(v, D1, D3) = 1.

(8)

In Deﬁnition 3, the UIDs with a smaller|SE(v, M)| is preferred. This re- ﬂects the intuition that having less structurally equivalent nodes in the UID indicates a more unique vertex v. Therefore our primary goal is to minimize

|SE(v, M)| of the identiﬁed UIDs. We treat |M| as a secondary criterion be- cause using smaller M to obtain the same SE(v, M ) implies that the query vertex can be identiﬁed with fewer critical neighbors.

In the three UIDs shown in Figure 2, the quality of the UID in Figure 2(c) is the best since there is no SE node for the query vertex. The UID of the query vertex in Figure 2(a) is the least unique of the three.

We then further introduce a problem of finding mutual identification group (MUID) given a query vertex. An MUID is a set of vertices where each vertex can be uniquely identified by the subset of the remaining vertices in the set.

Definition 4 In a graph G(V, E), given a vertex v ∈ V , a set of vertices X ⊆ V is a mutual identification group (MUID) of v if the following two conditions hold.

(a) v∈ X.

(b) ∀u ∈ X, ∃D ⊆ X, D is a UID of v

The quality measure of MUIDs is conceptually similar to UIDs. For each MUID, the primary goal is to minimize|SE(v, M)| and secondary goal is to minimize|M| of the UIDs.

For the reason that each vertex does have a UID, it is less straightforward to define the quality of MUID. Again, we still prefer a smaller SE(v, M ) size than the M size. To compare the |SE(v, M)| of MUIDs generated from different models, we propose two different metrics. The first metric uses the union of the SE set of each node in the MUID, while the second metric evaluates the summation of the size of the SE set of each node in the MUID. The same rule applies to M as the secondary criteria. The union measure evaluates how the vertices in the MUID exploit other members to identify themselves. If the nodes tend to include the same set of vertices to identify themselves, the union size would be smaller, creating better MUID. The sum of size of SE tells us whether we only need a small set of vertices to uniquely identify each individual member, regardless whether those vertices are overlapped or not.

Then, we can formally deﬁne the metric for MUID: Given an MUID X, for all v ∈ X, the UID Dv = [Mv, SE(v, Mv)]. For the metric of union size, we deﬁne size of union of SE set (USE) and size union of M set (UM):

U SE =| ∪

v∈X

SE(v, M_v)|, and

U M =| ∪

v∈X

M_v|.

We are now able to compare the quality of two MUIDs by replacing|SE(v, M)|

and|M| in Deﬁnition 3 with USE and UM. Similarly, for the metric of sum

(9)

of size, we deﬁne total size of SE set (TSE) and total size of M set (TM):

T SE =∑

v∈X

|SE(v, Mv)|, and

T M = ∑

v∈X

|Mv|.

We have considered to use other quality measure of UID such as com- paring|M| + |SE(v, M)|, i.e. do not diﬀerentiate the importance of |M| and

|SE(v, M)| but just compare the UID size. Under this measure, the three ex- amples in Figure 2 has the same quality which is against intuition because case (c) clearly identify v from others. Minimizing the above criteria while failed to minimize |SE(v, M)| (i.e. due to a minimum |M|), and we will leave the ego with a set of structure equivalent nodes, which does not make it “unique”.

For the same reason, another alternative measure to put more emphasis on minimizing size of |M| than |SE(v, M)| cannot highlight the uniqueness of vertices. The concept of priority of SE set and M set can be also applied to MUID. We have also considered other measures such as measuring the number of type included in UID or MUID found. For example, when forming a team, including more types means including more experts that know diﬀerent skills.

One of our future work is to modify the current proposed models to optimize such objective function.

3.2 Complexity Analysis

In this section we prove that finding the optimal UID with a minimal M under the condition of a minimal SE(v, M ) is NP-hard. We first prove that increasing M can only decrease the size of SE(v, M ) in Theorem 1. Based on this theorem, we can reduce the Set Covering Optimization Problem, which is known as an NP-hard problem, to our problem of finding the optimal UID.

We ﬁrst observe an interesting property that adding more nodes into an existing M set will cause some nodes in SE(v, M ) being removed from the set. Therefore,|SE(v, M)| can only remain the same or become smaller, it can never become larger when more nodes are added into M .

Theorem 1 Given a query vertex v and M^′⊇ M, then SE(v, M^′)⊆ SE(v, M).

Proof By Definition 1, since for all u^′∈ SE(v, M^′), we have:

u^′ ∈ ∩

m∈M^′

N (m)

= [( ∩

m∈M

N (m))∩ ( ∩

m∈M^′−M

N (m))]

⊆ ( ∩

m∈M

N (m))

= SE(v, M ).

(10)

Therefore, for all u^′ ∈ SE(v, M^′), u^′ ∈ SE(v, M) is proved, i.e., SE(v, M^′)⊆

SE(v, M ). ⊓⊔

This theorem implies that if M = N (v), SE(v, M ) would be the minimal SE set since any other |M^′| < |M| can only produce SE(v, M^′)⊇ SE(v, M).

Therefore, we can conclude that SE(v, M_max = N (v)) is the minimal SE set because N (v) is the largest possible set for M . Next, we want to show that the Set Covering Optimization Problem, which is known as an NP-hard problem, can be reduced to the problem of ﬁnding optimal UID with the minimal size of M to obtain the optimal SE(v, M ).

Theorem 2 Given a query vertex v, with the neighbor set N (v) =m1, m2, ..., my. The optimization of UID first tries to find a minimum SE set, SE_min, then given the condition tries to find a minimum M set such that M ⊆ N(v) and SE(v, M ) = SE_min. This problem is NP-hard.

Proof The Set Covering Optimization Problem is given a universe set U = u1, u2, ..., uN and a set of subsets of U , S = S1, S2, ..., Sk where si ⊆ U, i ∈ [1, k], to find a subset Sopt⊆ S that ∪

s∈Sopts = U .

For the problem of finding optimal UID, Theorem 1 tells us that each time when we add a node into M , some nodes in T = N2(v)−SEmincan be removed from the eventual SE set where SEmin= SE(v, N (v)) and the elements in it are not removable. Therefore, for each m = mi, we can identify a subset Ti= T− SE(v, mi). We can say that for each mi, there is a corresponding set Ti representing nodes that can be excluded from the potential SE set, T , when mi is added into M . Note that in our problem the goal is to find a minimum set M = M_min such that SE(v, M_min) = SE_min. In other words, we want to find a minimum M set which can remove all nodes in T to find the minimal SE set.

To reduce the Set Covering Optimization to finding optimal UID, we first create a ego vertex v, creating a neighbor m_i to v for each S_i ⊆ S, and then create a vertex tj for each uj ∈ U. Finally for each mi, we link mi with tj if uj∈ U − Si. That is, the set U to be covered in the Set Covering Optimization Problem can be mapped to the removable SE set, T , in the finding optimal UID problem and the covering subsets, Si, can be mapped to the identifying set of neighbors, Ti. Since by generating graph and finding the optimal UID in this way, the Set Covering Optimization Problem, which known as an NP-hard problem, can be also solved, the problem of finding optimal UID is proven to

be NP-hard. ⊓⊔

Here we suspect the MUID problem is also NP-hard because in MUID all vertices need to be uniquely identiﬁed by the rest. However, we will leave the detailed theoretical analysis of it to the future work.

4 Finding Optimal UIDs

In this section, we propose an exact and several greedy methods to find the uniquely identify set that satisfies a desired criterion. Since finding an optimal

(11)

Algorithm 1: Exhaustive Search

Input: A Graph G = (V, E), a query vertex v Output: UID set of v

1 Get N (v) ={m1, m2, ..., m_d};

2 d←− |N(v)|

3 M←− 2^dcombination set of N (v) //M (i) is the i^thelement M ; 4 optM ←− M(0)

5 for i = 1 to 2^d− 1 do

6 if Q(v, M (i), optM ) = 1 then

7 optM ←− M(i)

8 end

9 end

10 optU ID←− [optM, SE(optM, v)]

11 Output optU ID;

N₂(v) =

{u₁, u₂, u₃, u₄, u₅, u₆} N(v)={m₁, m₂, m₃, m₄}

v

m₁ m₂ m₃ m₄

u2

u₄ u₃

u₅ u₁

u₆ Fig. 3 The example graph to ﬁnd UID given v.

UID is an NP-hard problem, here we propose an exhaustive search method to find the exact outcome and some heuristic-based methods to identify the sub-optimal result for more efficiency. In Algorithm 1, an exhaustive search algorithm that requires O(2^d) time complexity is provided which identifies the optimal UID, where d is the degree of the query node v. It basically tries all the combination of M . Note that this method is not very efficient for nodes with high degree. For example, in our real world datasets there are a few nodes with hundreds or even higher degree, which makes running the exhaustive algorithm impractical.

To improve the eﬃciency, we propose the following three greedy methods, 1-Hop+, One-Neighbor, and Multiple-Neighbor, to ﬁnd sub-optimal UID sets faster. 1-Hop+ and One-Neighbor are na¨ıve baselines to be compared with Multiple-Neighbor. We will use the example graph of query v in Figure 3 for demonstration.

In the 1-Hop+ method, the algorithm ﬁrst includes all neighbors of v as M , and then adds SE(v, M ) into the UID set. This method extracts all the one-degree neighbors of v with their SE nodes. Figure 4 shows the UID m1, m2, m3, m4, u1 with u1being the only node that connects to all the nodes in M . Note that this method guarantees to ﬁnd the minimal SE set; however, it may produce the largest M set.

(12)

SE(v, M)={u₁}

v

m₁

m2

m3

m4

u₂

u4

u₃

u5

u₁

u₆ M=N(v)

Fig. 4 The UID of v found by the 1-Hop+ method.

SE(v, m₃) = {u₁, u₂, u₃, u₄}

SE(v, m₁)

= {u₁, u₂, u₃}

D = [M, SE(v, M)]

= [{m₁}, {u₁, u₂, u₃}]

SE(v, m₂) = {u₁, u₄, u_5,u₆}

SE(v, m₄) = {u₁, u₂, u_4,u₆}

SE(v, m₁) = {u₁, u₂, u₃}

v

m₁

m₂

m₃ m₄

u₂

u₄ u₃

u₅ u₁

u₆ M={m₁}

Fig. 5 The UID of v found by the One-Neighbor method.

In the One-Neighbor method, given a vertex v, we choose a neighbor m∈ N (v) and then obtain SE(v, M = {m}) to be included in the UID. Since the goal is to minimize the SE size, we observe the size of SE set produced by diﬀerent m and choose the minimal one. In Figure 5, m1 is chosen and it produces SE(v, m1) ={u1, u2, u3}

The two algorithms have low time complexity of O(d) where d is the degree of query node v. The 1-Hop+ method produces the minimal SE set but the maximal M . The One-Neighbor method keeps the minimal size of M , but in general does not minimize the size of the SE set. To address the above concerns, Algorithm 2 introduces the Multiple-Neighbor method whose goal is to identify a minimal M to minimize SE set. In this method, we greedily add neighbor nodes of v into M , guaranteeing the optimal SE set and hoping to obtain a minimal M .

With the greedy heuristic to pick neighbors of v into M , SEgain is deﬁned to represent “the vertices to be removed from SE(v, M ) after adding a node into M ”.

Definition 5 Given UID of a query vertex v as D = [M, SE(v, M )], and for M^′ = M ∪ {n}, where n ∈ N(v) − M, then SEgain(v, n, M) = SE(v, M) − SE(v, M^′).

In Theorem 3 we prove that the set size of SEgain(v, n, M ) is mono- tonic with the size of M . In this property, ∀u ∈ SE(v, N(v)), u cannot be

(13)

in SEgain(v, n, M ) for any v, n, M since there is no larger subset M ⊆ N(v) than N (v) itself.

Theorem 3 Given SE(v, M ), SE(v, M^′), n ∈ N(v), n /∈ M and n /∈ M^′, if M^′⊇ M then SEgain(v, n, M^′)⊆ SEgain(v, n, M).

Proof By Theorem 1, we have the following derivation.

(a) ∵ M^′⊇ M, ∴ SE(v, M^′)⊆ SE(v, M), and

(b) ∵ M^′∪ {n} ⊇ M ∪ {n}, ∴ SE(v, M^′∪ {n}) ⊆ SE(v, M ∪ {n}).

From (a) and (b), we have

SEgain(v, n, M^′) = SE(v, M^′)− SE(v, M^′∪ {n})

⊆ SE(v, M) − SE(v, M^′∪ {n})

⊆ SE(v, M) − SE(v, M ∪ {n})

= SEgain(v, n, M ).

⊓

⊔

The Multiple-Neighbor method starts from the state of M = ϕ and SE(v, M ) = N₂(v) given the query vertex v. It iteratively calculates the SEgain(v, n, M ) of each neighbor n that is not included in M , adding the neighbor with the largest

|SEgain(v, n, M)| into M and removing the vertices in SEgain(v, n, M) from

the current SE(v, M ). It ends when there is no neighbor vertex with|SEgain(v, n, M)| >

0. This method guarantees that the result UID has a minimal SE set.

Theorem 4 The UID of query vertex v found by the Multiple-Neighbor method is guaranteed with minimal SE set, i.e., if there is no neighbor n∈ N(v) − M satisfying|SEgain(v, n, M)| > 0, then SE(v, M) is minimal.

Proof Let SEmin = SE(v, N (v)) be the minimal SE set and let the current state of the Multiple-Neighbor method be SE(v, Mv). Suppose that SE(v, Mv) is not minimal, then the following two conditions must both hold.

(a) ∃u such that u /∈ SEmin and u∈ SE(v, Mv), and (b) ¬∃n ∈ N(v) − Mv such that SEgain(v, n, M_v) > 0.

Since u /∈ SEmin, there exits n ∈ N(v) − M such that n /∈ N(u). If we add n into Mv, then u /∈ SE(v, Mv∪ n). Because u ∈ SE(v, Mv) and u /∈ SE(v, Mv∪ n), u ∈ SE(v, Mv)v− SE(v, Mv∪ n) = SEgain(v, n, Mv). ⊓⊔ Figure 6 shows how the Multiple-Neighbor method is applied to the same example graph. In Figure 6(a), SEgain(v, mx, M ) is generated for m1, m2, m3and m4, while m1is selected to be added into M because it has the largest SEgain. SEgain(v, m1, M ) ={u4, u5, u6} is then removed from SE(v, M). In Figure 6(b), SEgain of the rest of the neighbors are updated and m2 is then added. After m2is added, it leaves only u1in SE(v, M ) and the algorithm halts because the condition|SEgain(v, x, M)| > 0 cannot be satisﬁed by adding any neighbor.

(14)

SE(v, M)={u_1,u_2, u_3,u_4,u_5,u₆}

→ SE(v, M)=

{u_1,u_2,u₃} M=∅

→M={m₁} (a)

SEgain(v, m₃, M) = {u_5,u₆} SEgain(v, m₁, M) = {u₄, u_5,u₆}

SEgain(v, m₂, M) = {u₂, u₃}

SEgain(v, m₄, M) = {u_3,u₅}

v

m₁

m₂

m₃ m₄

u₂

u₄ u₃

u₅ u₁

u₆

SE(v, M)=

{u_1,u_2,u₃}

→SE(v, M)={u₁} M={m₁}

→M={m_1,m₂} (b)

SEgain (v, m₃, M) = ∅

SEgain(v, m₂, M) = {u₂, u₃}

SEgain(v, m₄, M)

= {u₃} D = [M, SE(v, M)]

= [{m_1,m₂}, {u₁}]

v

m₁

m₂

m₃ m₄

u₂

u₄ u₃

u₅ u₁

u₆

Fig. 6 The UID of v found by the Multiple-Neighbor method (a) m1is chosen as the ﬁrst vertex to be added into M (b) m2is chosen as the second vertex to be added into M .

Algorithm 2: THE MULTIPLE-NEIGHBOR METHOD

Input: A Graph G = (V, E), a query vertex v Output: UID D of v

1 SE←− N²(v) 2 M←− ϕ

3 while|SE| > 0 do

4 mmax←− argmaxm|SEgain(v, m, M)|, m ∈ N(v) 5 if mmax≤ 0 and |M| ≥ 1 then

6 break;

7 end

8 M←− M ∪ mmax 9 SE←− SE − N(mmax) 10 end

11 D←− [M, SE]

12 output D;

5 Finding MUIDs

In Section 4, we introduced four methods to ﬁnd the UID given a vertex v.

We then extend the three algorithms proposed in Section 4, except exhaustive search, to identify the MUID given a vertex v.

Note that in MUID, not only the uniqueness of the query nodes but also that of the rest of the nodes in the set has to be uniquely identiﬁed. That is to say, after adding nodes into the set to uniquely identify the query vertex, one would then need to make sure the introduced nodes can be uniquely identiﬁed

(15)

v m₂

m₄ m₃ m₁

Dv = [Mv , SE(v, Mv)] = [Mv , {u₁, u₂}]

MUID of v: [{v} Ж M_v, {v} ЖM_vЖSE(v, M_v)]

Dm1 = [{v}, SE(m1, v)] = [{v} , {m3}]

D_u1 = [M_v, SE(u₁, M_v)]= [ M_v, {v, u₂} ]

Dm2 = [{v} , {m4}] Dm3 = [{v} , {m1}] Dm4 = [{v} , {m2}]

UIDs of members:

Mv = { m1, m2, m3, m4}, SE(v, Mv) = { u1, u2}

u₁

u₂

Du2 = [ Mv, {v, u1} ]

Fig. 7 The MUID of v found by the 1-Hop+ by redeﬁning M and SE set

by adding more nodes. In the worst case, one would have to include every node in the graph into the MUID. The quality of MUID is measured by the average quality of UID of each vertex in the MUID. Note that the quality of UID is the same as the one deﬁned in Deﬁnition 3.

For the 1-Hop+ method, we ﬁrst prove that the UID found given a vertex v can be an MUID for v by redeﬁning M and SE set.

Theorem 5 Given a vertex v, and the UID identified by the 1-Hop+ method D = [Mv, SE(v, Mv)],∀u ∈ D we can find a UID Du= [Mu, SE(u, Mu)] such that Mu⊆ {v} ∪ Mv and SE(u, Mu)⊆ {v} ∪ Mv∪ SE(v, Mv).

Proof We have to show that for each m ∈ Mv we can find UID Dm = [Mm, SE(m, Mm)] such that Mm ⊆ {v} ∪ Mv and SE(m, Mm)⊆ {v} ∪ Mv, and the same for u ∈ SE(v, Mv). This can be proved by the following two statements.

(a) For m∈ Mv, the UID of m can be found as M_m={v} and SE(m, Mm) = {x|x ∈ Mv− {m}, type(x) = type(m)}.

(b) For u∈ SE(v, Mv), the UID of u can be found as M_u= M_vand SE(u, M_u) = SE(v, M_v)∪ {v} − {u}.

From (a) and (b), we have shown that∀u ∈ D, it is possible to identify u with UID, Du = [Mu, SE(u, Mu)], such that Mu ⊆ {v} ∪ Mv and SE(u, Mu) ⊆ {v}∪Mv∪SE(v, Mv). Therefore the UID of v identified by the 1-Hop+ method has been proven to be an MUID of v by redefining M and SE set. ⊓⊔

In Figure 7 an example of redeﬁning UID by 1-Hop+ method is shown.

The problem for the MUID obtained this way is that for each query node, all the neighbor nodes are included in the MUID. For nodes with large degrees, the MUID generated will be large in size, which is not preferable. Also for the neighbor nodes of v, they are SE to other neighbors with the same type, leaving large SE set in their own UIDs.

In the One-Neighbor method, given a vertex v, we obtain a UID, Dv = [{m}, SE(v, m)]. To extend the One-Neighbor method for MUID, we pro- pose One-Neighbor-MUID method to further uniquely identify the newly in- troduced nodes. The proposal is to obtain the UID for m, Dm, in a simi- lar manner. With Dm = [{v}, SE(m, v)] as shown in Figure 8, Dv ∪ Dm =

(16)

v m

UID of v D_v = [m, SE(v, m)]

UID of m D_m = [v, SE(m, v)]

MUID of v: D_vЖD_m= [ {m, v}, SE(v, m)ЖSE(m, v) ]

Fig. 8 The MUID of v found by the One-Neighbor-MUID, obtained by combining UIDs of v and m

[{m} ∪ {v}, SE(v, m) ∪ SE(m, v)] becomes the MUID of v. The problem for the One-Neighbor-MUID is that for each node, there could be many SE nodes, which is not desirable in general.

Theorem 6 Given a vertex v and one of its neighbors m and the UID, D_v= [{m}, SE(v, m)], Dm= [{v}, SE(m, v)], can be identified by the One-Neighbor method. The union of UIDs, D_v∪ Dm, is an MUID of v.

Proof Given v and m, which are uniquely identified by UID, D_v and D_m, respectively, we have to show that for each element u ∈ SE(v, m) and u ∈ SE(m, v) we can find UID, D_u= [M_u, SE(u, M_u)].

(a) For each u∈ SE(v, m), its UID can be found as Mu={m} and SE(u, Mu) = SE(v, m)∪ {v} − {u}.

(b) For each u∈ SE(m, v), its UID can be found as Mu={v} and SE(u, Mu) = SE(m, v)∪ {m} − {u}.

From (a) and (b), we can conclude that every vertex u ∈ Dv∪ Dm can be identified by a UID Du, where Du ⊆ Dv∪ Dm. We prove that the result of One-Neighbor-MUID, Dv∪ Dm, satisfies the requirement as an MUID of v.

⊓

⊔

We have showed that both 1-Hop+ and One-Neighbor-MUID can be ex- tended to obtain MUID given a query vertex v. However, the quality of MUID produced by these methods is usually not optimal. To overcome such deﬁ- ciency, we try to extend the Multiple-Neighbor method for the MUID prob- lem. As proposed previously, the spirit of the Multiple-Neighbor method is to choose a small neighborhood subset M of a given query v, leading to a smallest SE(v, M ) to uniquely identify v. To extend the method for ﬁnd- ing MUID as Multiple-Neighbor-MUID, given the query vertex v and its UID D = Mv∩ SE(v, Mv), we have to make sure that all newly included m∈ Mv

and u∈ SE(v, Mv) need to be uniquely identified, the process continues. It is not hard to show that each u belonging to SE(v, Mv) obtained from our Multiple-Neighbor method has been uniquely identified. Unfortunately, their corresponding UIDs might not be optimal because there could exist another set M identifies better SE sets. Thus, we need to include more vertices into the MUID to identify the best result set.

(17)

We have proposed to use a heuristic SEgain(v, n, M ) = SE(v, M )− N(n) as a criterion to choose the next neighbor to be added in the Multiple-Neighbor method. This criterion calculates the number of nodes that can be removed from the UID if n ∈ N(v) is added. However, when it comes to Multiple- Neighbor-MUID, all the newly added nodes in M and SE(v, M ) also have to be uniquely identiﬁed. As shown in Figure 9, to estimate how many additional nodes would be introduced after adding m₁, we need to estimate how many vertices are required to identify M_m1, and to identify the further introduced vertices.

Similar to the Multiple-Neighbor method for the UID problem, here we deﬁne a heuristic function to estimate the quality of neighbors as candi- dates to be added into M . The idea is to execute the UID Multiple-Neighbor method for one pass on all nodes and record the M and SE(v, M ) sets of each node v as U IDSE(v) and U IDM(v). When choosing the neighbors of v into M in Multiple-Neighbor-MUID, we give higher priority for neighbor n with smaller |SE(v, M)| in UID, or |UIDSE(n)|. Given equal-sized |UIDSE(n)|, we then deﬁne the secondary criteria as minimizing |UIDM(n)− M|. This is because that exploiting neighbors already in M could potentially introduce fewer new vertices. Given equal-sized|UIDSE(n)| and |UIDM(n)−M| the ter- tiary criteria is larger SEgain(v, m, M ). Note that for n to be chosen into M ,

|SEgain(v, n, M)| must be larger than 0. Algorithm 3 describes the algorithm in details.

v

SE

m

₁

m

₂

m

₃

M

… M

_m1

M

_m2

M

_m3

SE(m₁, M_m1) SE(m₂, M_m2)

SE(m₃, M_m3)

Fig. 9 Example of Multiple-Neighbor-MUID.

6 Experimental Results

We introduce 3 real and 3 synthetic datasets for evaluation in Section 6.1.

Then, we display the experimental results of ﬁnding UID and MUID, respectively. Finally, we report some case studies on MUID examples.

(18)

Algorithm 3: Multiple-Neighbor-MUID Method

Input: A Graph G = (V, E), a query vertex v, precomputed SE(v, M ) and M in UID

Output: MUID set X of v 1 Get N (v) ={m¹, m2, ..., md};

2 Mx←− ∅ 3 SEx←− ∅

4 U IDSE(v)←− Read SE(v, M) of v’s UID 5 U IDM(v)←− Read M of v’s UID 6 W is a stack and initially empty.

7 //W stores not yet uniquely identiﬁed vertices 8 push v into W

9 while|W | > 0 do

10 w←− pop ﬁrst element from W 11 SEw←− SE(w, Mx) Mw←− ∅ 12 Nx←− N(v) − X

13 while|SE^w| > 0 do 14 for n∈ Nxdo

15 if |SEgain(w, n, Mw)| = 0 then

16 Nx←− Nx− {n}

17 end

18 end

19 if Nx=∅ then

20 break;

21 end

22 minU IDSE ←− max integer 23 minU IDM ←− max integer 24 maxSEG←− min integer 25 nopt←− null

26 for n∈ Nxdo

27 if (U IDSE(n) < minU IDSE) or

28 (U IDSE(n) = minU IDSE and U IDM(n) < minU IDM) or 29 (U IDSE(n) = minU IDSE and U IDM(n) = minU IDM and

SEgain(w, n, Mw) > maxSEG) then 30 minU IDSE ←− UIDSE(n)

31 minU IDM ←− UIDM(n)

32 maxSEG←− SEgain(w, n, Mw)

33 nopt←− n

34 end

35 end

36 Mw←− Mw∪ {nopt} 37 SEw←− SEw− N(nopt) 38 push noptinto W 39 end

40 for u∈ SEwdo

41 if N (u)⊃ N(w) then

42 p

43 end

44 ush u into W //∃Muthat w /∈ SE(u, Mu) 45 end

46 Mx←− Mx∪ Mw 47 SEx←− SEx∪ SEw 48 end

49 Output X = [Mx, SEx];

(19)

Table 2 Data sets statement.

Name —V— —E— Number of types

KDD Movie Dataset 35311 168868 20

TW Academic Network 63122 770155 6

HepTh Citation Network 41840 933149 4 Erd˝os-R´enyi Model 50000 249128 3 Barab´asi-Albert Model 50000 249179 3

Watts-Strogatz Model 50000 300000 3

6.1 Experiment Settings

We exploit three real datasets and three synthetic datasets to evaluate our model. The three real datasets include KDD movie data set, HepTh citation network, and academic co-author network. The KDD movie data set is a movie- centric heterogeneous information network. There are 35,311 nodes belonging to 20 different types. The node types include movie, actor, director, place, pro- ducer, . . . etc. There are a total of 168,868 edges in this network connecting different entities. The second real data set encodes paper-author relationship among numerous scholars and published research papers. This network consists of 63,122 nodes and 770,155 edges. There are six types of nodes - student, pro- fessor, department, college, and Chinese and English keywords. For the third dataset, we choose a commonly used citation graph, Arxiv HEPTH (High En- ergy Physics - Theory), provided in KDD Cup 2003. There are 93,319 edges connecting 41,840 nodes with four types, author, paper, journal, and email do- main. For each dataset, a giant connected component (GCC) is obtained for our experiments. We also create three synthetic datasets of 50,000 nodes each, based on three different social network generation models. We apply Erd˝os- Rényi model [2] to produce a random graph; the Barabási-Albert model [1]

to produce scale-free networks; and the Watts-Strogatz model [10] to produce graphs that can mimic the small world phenomena. For the Erd˝os-R´enyi Ran- dom Graph (ER) model , we set the connection probability to be 0.0002. For the Barab´asi-Albert (BA) model, we set the number of initial nodes to 5. For the Watts-Strogatz (WS) model , we set the mean degree to 6, and the edge rewiring probability to 0.18. We assume there are three types of vertices in the network and randomly assign a type to each node.

In Table 3 we compare different models by averaging their rankings (1 to 3). Note that the ranking is generated using all vertices in the graph based on Definition 3 (i.e., we compare|SE(v, M)| first, and the second criteria is

|M| given identical |SE(v, M)|). Finally, the methods with lower average rank are considered having better performance, namely obtaining better quality of UIDs.

For MUID, we compare the performance of the three greedy methods only because the exhaustive search method becomes intractable. Similar to our experiment on UID, Table 4 displays the comparison between the 3 greedy- based algorithms on MUID using the metrics described in Section 3.1. We return the average rank (calculated over all vertices) of each model.