PROXIMITY DISCOUNT HEURISTIC - 標籤社群網絡之影響力最佳化

In real-life cases, the social network can be derived before marketers employ it to make advertising strategies. Hence, for the original influence maximization problem, we can perform some preprocessing to find the top-k influencers in advance so that marketers can immediately determine the advertising plans without waiting the execution. However, such preprocessing is infeasible for the labeled influence maximization problem. The reason is three-fold. First, the combinations of the labels as targeted ones are numerous. Second, the profit values on targeted labels are undefined. Such first two parts will be unknown until the marketers attempt to make advertising strategies with best profits by testing all kinds of combinations of labels and profit weights for some products. Third, it is impossible to compute all kinds of targeted influencers for every possible combination of labels and their profit values. Due to such facts, we propose the proximity discount method, which allows offline computing proximity and provides

online querying targeted labels and finding the top-k seed nodes, to efficiently solve the labeled influence

maximization problem.

The basic idea of our proximity discount method is to consider the topological proximities between nodes as the potentials of successful influence from one to the other. If node u has higher proximity towards node v, we think u will have higher probability to activate v such higher proximity indicates (a) u is separated from v by fewer hops, (b) u and v are connected by multiple disjoint paths, and (c) walking from u to v tends to avoid going through nodes with higher degrees. Such three facts implies respectively that (a) u can easily affect v with fewer propagation steps, (b) u has more times to affects v through multiple disjoint paths, and (c) the propagations from u to v can avoid passing through edges with lower influence probabilities. In other words, the proximity considers the pairwise reachable probability from the global structural view as the influence potential from one to the other, instead of existing degree-related heuristics that use only egocentric local information. Therefore, we believe the proximity should be a better measure to capture the propagations of influence in a network. The computation of proximity is done in the offline stage.

On the other hand, we consider the discount idea, which is inspired from the degree discount heuristic, into the design of our proximity discount method. When each time a new seed node has been determined, the proximity scores (i.e., the expectation of influence) of some unselected nodes will get discounts because they have been affected by existing seeds. We will online perform such proximity discount in the online stage.

In the following, we will first describe the relationship between the proximity measure and the independent cascade influence model. And then we elaborate how to compute the proximity offline in the network. Finally, we present the online proximity discount algorithm with a data structure to efficiently perform the discounting.

4.1 Proximity for Independent Cascade Model

For the computation of proximity scores between nodes in a network, there are many proposals [8][9][14][23] for different scenarios. To integrate with the influence propagation in the independent cascade model, in this work, we modify the Cycle Free Effective Conductance (CFEC) [14] method to compute the pairwise proximity scores. The CFEC method applies the random walk mechanism to compute the probability that one node surfs in the graph and ever arrives the other. The major difference between CFEC and other proximity computation methods [8][9][23] is that when the random surfers walks in the graph, it never passes through the same node twice. All walking routes are simple paths, and no cycles will appear. Such random walk strategy complies with the influence propagation in the independent cascade model, in which each node has only one chance to be activated.

The propagation of influence in the independent cascade model can be regarded as a kind of random walk based on the influence probabilities on edges. Assuming there is a path

between and , the probability starting from and arriving is

Combing with the abovementioned cycle-free idea, therefore, the proximity from to equals to the probability that a random surfer starts from to without staying at the same node twice in the walking path. Let represent the probability the random surfer walks from to and does not stop by the same node. is the sum of probabilities of all simple paths from to .

where is the set of simple paths from to .

4.2 Offline Computing Proximity

Before computing as the proximity score, we need to find the set of simple paths between and . Finding simple paths is computationally infeasible in large-scaled networks.

Nevertheless, the provides us a clue to have an approximated computation. The derived probability is usually very small. By sorting the probabilities ( ) in the ascend order, it is apparent to find that the probability of the 100^th simple path could be one millionth of the 1^th simple path . It is nearly impossible to influence a certain node with such low probability. As a result, we consider only the top- simple paths to approximately compute the since top ones have higher potentials of effective influences. Here we use an example to illustrate the difference between top- simple paths and top- paths, as shown in Figure 2. If , the lengths of top- simple paths are 6, 20, and 21, and the lengths of top- paths are 6, 8, and 10 respectively.

Figure 2. An example graph to elaborate the difference between the top-k simple paths and the top-k paths.

The propagation of influence is based on the probabilities on edges. Each movement from one node u to its neighbor v in the network corresponds to the one more multiplication by the edge probability . The path length is related to the multiplication of edge probabilities. To find the lengths of simple paths, therefore, we transform the edge probabilities of influence into the length . As a result, the simple path with the low length will have the high influence probability. Finding the top- paths with highest probabilities equals to find the top- lowest simple paths. In this report, we do not give the specific . We continue to find the top lowest simple paths until the of the simple path is one twentieth (0.05) lower than the one . In details, we refer to Martins and Pascoal’s method to find the top- simple paths. The Hadoop/MapReduce framework is exploited for the implementation.

4.3 Online Proximity Discount Algorithm

Based on the offline derived pairwise proximity scores, in this subsection, we propose the proximity

discount method for finding the seed nodes according to the online querying targeted labels. The idea of

the proposed proximity discount is similar to the previous labeled degree discount algorithm. But the

proximity discount method has two further benefits. First, the proximity discount method considers the high-order neighborhood of each node, instead of using only the degree in the labeled degree discount.

Second, when computing the expectation of influence of each node v, the proximity not only discounts the profits of v’s neighboring seed nodes, but also considers that v’s neighboring seeds could affect v’s other neighbors as well.

Each time a node w is selected as the seed one, the proximity discount method will update and

discount the expectation of influence for each node . The

computation of proximity discount for node ’s expectation of influence consists of three parts. First, we compute the probability that is not successfully activated by the seed set . Such probability equals to . The second step is to compute the expected profit of . If the proximity from to

is not zero, has a certain chance to successfully activate . Hence, if the value is higher, will tend to be higher and will be more influential. Third, we also need to consider the probability that seed nodes could affect . Since could have been activated by the seed set , the expectation of influence of should get discount. The discounted value is . We summarize the update of ’s expectation of influence in the following formula.

To efficiently compute the proximity discount, we also develop a special data structure to store the graph with the information about the targeted labels. For each node , we create cells to connect nodes , . For each cell of , we create a linked list to store those nodes with the target label , subject to . As a result, if we want to know which nodes with that

could attempt to activate, we can simply search the elements in the linked list .

Here we use an example to illustrate the proposed data structure, as shown in Figure 3. Assuming there are nodes , we create an array to store them. Also assuming there are two labels and . Let be the probability that node does not be influenced by seed nodes. For each node , we create an element to store and two pointers to two linked lists and to store nodes with targeted label and respectively.

Figure 3. The proposed data structure to store the neighbors with certain targeted labels.

The overall method of the proposed proximity discount is described in Algorithm 5.

Algorithm 5. Proximity Discount (G, k)

Based on the offline derived scores of pairwise proximity in the previous section, we propose another greedy method to solve the labeled influence maximization problem. The central idea aims to consider the original problem as the Maximum Coverage Problem. The desired set of seeds, consequently, will be the set of nodes that can cover nodes with maximum total profit.

Given a graph , for each , it has a set of nodes , where the

proximity score . Let be the set of for each , i.e.,

and . The original maximum coverage problem is to find a set and such that the coverage (i.e., number) of , which equals to , is maximized. By applying such concept to the labeled influence maximization problem, the goal aims to find a seed set and such that the labeled influence spread (i.e., total profit) of , which equals to

, is maximized.

Algorithm 5. Maximum Coverage (G, k)

1: Initialize .

Since finding nodes with maximum coverage is a NP-hard problem, which implies that using such concept to find the top- targeted influencers in a labeled social network is a NP-hard problem as well, we develop a greedy algorithm to solve it, as described in Algorithm 5. The basic idea is to find the node that can increase the total profit as more as possible in each round of seed selection. In details of

在文檔中標籤社群網絡之影響力最佳化 (頁 12-16)