Additional Domain Knowledge - PLPF: 探勘網路行為履歷於連線行為預測

As we mentioned above, the similarity is calculated through the connection features, which are all (evolving) topology features since we only have information about the connection log/network. But if additional domain knowledge is given, such as the node attributes, then we can use them to improve our similarity measurement and better predictions can be done.

One of such information is especially important but can not be derived from the connection log alone: the information of the newly appeared site v. As the site v just appears in the connection log, we only gain information about who connects to it and when the connection is established. But if more domain knowledge is given, for example, the newly appeared site v is very similar to another site v^′ which is connected by the originator o before, then we can use this information to add weight on the static (surprising) features if v^′ belongs to the long-term (short-term) interest of o.

Generally speaking, there are 4 different cases:

1. v is similar to the connected sites in the long-term interest of o, 2. v is similar to the connected sites in the short-term interest of o,

3. v is similar to the connected sites in both long-term and short-term interest of o, and

4. v is not similar to any connected sites in the long-term or short-term interest of o.

This information is not useful in the last case, but for the first three cases, we can derive a relative adjusting weight wv by how similarly v belongs to the long-term interest rather than the short-term interest, where 0 < wv < 1. Then we can adjust the weight of long-term and short-term interests by wv:

so(u) = wv∗ s^l_o(u) + (1 − wv) ∗ s^s_o(u), (4.13) where s^l_o(u) is the similarity of u related to o on static features only, and s^s_o(u) is the similarity of u related to o on surprising features only.

Chapter 5 Experiments

In this section, we first describe the experimental environments including 1) the real datasets, 2) the query formats, and 3) the evaluating measurements. Then we show the prediction results derived by our algorithm and compare both the effectiveness and efficiency with information flow based approach proposed in [20]. The parameters used in PLPF to constructing profiles and calculating similarities are discuss later to inform how they effect the result of predictions. And finally, we analyze the properties of PLPF with the synthetic dataset.

5.1 Experiment Settings

For this work, we collect the TCP connection logs between the dormnet to the internet in our university from 2010-09-14 to 2011-01-31 as a connection log in one semester.

There are more than 1,431,000,000 connections established by total 723 students and 15436399 outside sites. A set of queries is created for experiments with following properties:

1. Each query contains the information of the first connection to the newly appeared site. The information contains the originator (represented by an IP address), the newly appeared site (represented by an IP address) and the time to make this connection.

2. For each query, it must have 50 to 70 true successors (about 7 ∼ 10% of users) who

really connects to the same site in the following week after the new site appears, and these links established by these successors are used as ground truth.

Total 125 queries are selected with above properties, and have averagely 59.4 true successors in the following week after the new site appears.

Input for each method is the query (i.e., the originator, the newly appeared sever and the time of the first connection) along with the connections logs in the last day, and the output is the top-k possible links to the same site. To evaluate the effectiveness for each method, we consider the precision/recall at rank k and the mean average precision (MAP) since MAP provides a single-figure measure of quality across recall levels, which has been shown to have especially good discrimination and stability [15].

Precision at rank k for a single query q is simply the value that how many users in the top-k possible successors are true successors (which establish links to the specific site in the following week), and recall at rank k for a single query q is simply the value that how many true successors are in the top-k possible successors. Precision and recall at rank k for a query q are formalized as below:

precisionq(k) = |number of true successors in the top-k possible successors|

k (5.1)

recallq(k) = |number of true successors in the top-k possible successors|

(5.2) The notation mq denotes the number of all true successors for a query q. For a query set Q, the overall precision/recall at rank k is computed by the average of precision/recall at rank k for each query q in Q.

To compute MAP for a query set Q, we should compute the average precision for each query q ∈ Q first. Average precision of a query q is the average of precision at rank k for each k from 1 to mq, and MAP is the average of average precisions among

all queries in a query set Q. Average precision for a query q and MAP for a query set

Besides, to evaluate the efficiency for each method, we consider the computation time cost for each method despite the disk I/O for reading logs and reading/writing intermediate data structures. For each method, the computation time cost is cumulated from accessing the connection logs to deriving the final result of top-k possible links.

This comparison is based on the computation time cost to derive the prediction result for one single query.

在文檔中 PLPF: 探勘網路行為履歷於連線行為預測 (頁 36-40)