• 沒有找到結果。

Chapter 6 Knowledge Discovery

6.1 Preprocessing Phase

Before presenting our method, the notations used in this paper will be defined in this section. For transforming original activities logs into a feature vector, which contains more useful information, the RENUMBER SORT ALGORITHM and the PREPROCESS ALGORITHM are also discussed in this section.

6.1.1 Definitions of Original Activity Log Database

Assume there are n users u1, u2, …, un. Each uq can be represented by a unique ID, e.g., the user id of a web system or the customer ID of a shopping center, and let U = {u1, u2, …, un}.

T = [t0, t0+wc] is the time interval concerned to collect activities logs where c is a

constant, t0 = 0, ti = ti-1 + c, and Ti = (ti-1, ti], 1 ≤ i ≤ w.

Ei = < i i i e i

e

e1, 2,..., α > is a sorted sequence of activities logs in time order during Ti and we assume |Ei| = αi ≤ α, for each i.

‘•’ is a concatenation operator, i.e., E1•E2 = < 11 12 1 ,..., 1

,e eα

e , 12 22 2 ,..., 2

,e eα

e >.

E = E1•E2•…•Ew is the whole activities logs we are concerned in T.

e-id is the event identifier which is defined by the triple fields <unique ID, action

target, action>, where unique ID ∈ U, and action target is the target of the user activities, e.g., the item to sale, and the action is the action taken by the user, e.g., POST, GET.

ID(eij) is an extracting function to extract the e-id of eij.

6.1.2 Renumber Sort Algorithm

Since the information of single activities is not sufficient enough to represent the user behavior, several activities with same e-id selected from Ei are first aggregated during Ti and then transformed into a feature vector. However, when we do aggregation in this phase, if a User A do GET action to web page X, and the other action user A

Notations:

i

fj = ReNumSort(Ei) is the jth distinct e-id during Ti. Si < i i i

f i

f

f1, 2,..., β > is a sequence of feature vectors during Ti, where βiαi.

Fi = {fji| for 1 ≤ j ≤ βi}, and F =

U

iw=1Fi.

i

Sqis a subsequence of Si for q ∈ U.

vq = <Sq1,Sq2,...,Sqw> is a behavior vector of uq. V = {vq | for q ∈ U}.

Table 4.1 presents the format of general activities log. The Time field indicates the occurred time of log. The UID field and TARGET field indicate unique ID for each user and the target item performing actions, respectively. The ACTION field indicates the action taken in the activity, for example, in network traffic, the ACTION may be the destination port, which implies the service has been requested by the user, e.g., FTP port is 21, Telnet port is 23, and HTTP port is 80. The information may contains in the activities log is different from applications to applications, for example, for consuming activities, maybe the sale amount, quantity will also be included, and the information can be also used in our algorithm.

Table 6.1: The Format of Standard Log Information.

Time UID TARGET ACTION … … … … …

In aggregating the activities into feature vector, we first sort the original activities database by RENUMBER SORT ALGORITHM to get the distinct e-id during Ti,

saying fji. For each activity during Ti, if there exists a previously defined feature vector entry is equal to the e-id of the activity then replace it by aggregating the information for the same e-id. Otherwise, create and define a new feature vector entry.

The RENUMBER SORT ALGORITHM is shown as follows.

Algorithm 6.1: ReNumberSort algorithm, ReNumSort(Ei) Input: Ei

Output: Fi, Si

Step1. Fi = φ, Si = < >, DistinctFlag = True, βi = 0.

Step2. For j = 1 to αi,

Step2.1. If DistinctFlag = True, βi ++,

i

fβi= ID(eij), Set DistinctFlag = False.

Step2.2. For k = 1 to βi,

If ID(eij) ≠ f , Set DistinctFlag = True. ki Else

Replace f by mergingki eij and f , Set DistinctFlag = ki False,

EXIT.

Step3. For j = 1 to βi,

Put fji into Fi, Si = Sifji. Step4. Return Fi, Si

In Step2.2, the aggregation process to construct feature vector is specified by domain expert, which is designed based on the application and information we have in the activities log. For example, if the activities log we are mining is the consuming log of customers of a shop, we may aggregate the price user spent on the target item, the quantity of the items, and also the other information can be aggregated. The way to aggregate the information can be decided according to the knowledge of the domain expert who design the mining process.

6.1.3 Preprocessing Phase

As defined above, the feature vector is aggregated from the selected activities with same e-id during Ti, so the feature vector is also identified by the e-id. The feature vector fji can be treated as a user behavior event, which represents the user’s behavior during Ti. Therefore, the behavior of the user uq during T can be represented by a sequence of feature vectors with time order.

In Table 6.1, the Time field indicates the starting time of the aggregated feature vector

i

fj. The Duration field indicates the interval between first and last activities with fji during Ti. The UID, TARGET, ACTION fields are with the same definition in activity log information, and all the other fields are aggregated from ReNumberSort algorithm, which are the important information to represent the behavior user taken, for example, the count of the activities, the cost user spent, the quantity user taken, etc, any useful information which can also be calculated by aggregation algorithm are included.

As shown in Figure 6.2, the preprocessing phase has two major stages: the first stage is to select the packets from activities log database during time window Ti and second stage is to calculate the feature vectors Fi during Ti by aggregating the activities with

i

fj, for 1 ≤ j ≤ βi. Thus, we can have the sequence of feature vectors Si and each user’s behavior during Ti. Therefore, each user’s behaviors during T can be represented as vq = <Sq1,Sq2,...,Sqw>, for each q ∈ U.

Select Activities in Ti

Establish Feature Vector Packet

Activity Log Database

Stage 1 Stage 2

Feature Vectors

Figure 6.2: Data Flow of Preprocessing Phase.

Algorithm 6.2: Preprocessing Algorithm, Preprocess(E) Input: E

Output: F, V

Step1. F = φ, V = φ, vq = < >.

Step2. For i = 1 to w, Select Ei from E,

(Fi, Si) = ReNumSort (Ei), F = F ∪ Fi, vq = vqSqi. Step3. For q = 1 to n,

V = V ∪ {vq}.