Thesis Organization - 漸進式準可篩除項目集探勘之研究

Chapter 1 Introduction

1.2 Thesis Organization

The rest of this thesis is organized as follows. We will review the background and motivation including erasable itemsets mining [7], pre-large concept [4][13], and Fastupdate (FUP) [5] in Chapter 2. The proposed incremental ε-quasi-erasable mining approach with theorem proof and examples is introduced in Chapter 3. Some experiments are shown to validate the performance of

the proposed algorithm in Chapter 4. Finally, the conclusion and future work are given and discussed in Chapter 5.

Chapter 2 Review of Related Work

2.1 Erasable Itemsets Mining

The conception of erasable-itemset mining comes from the apriori algorithm [9] but at an opposite threshold and meaning. Erasable-itemset mining was proposed by Deng et al. in 2009 [7] and will be briefly reviewed in this section.

Table 2.1 shows an example of some products manufactured in a factory. There are seven items including A, B, C, D, E, F, and G, that we use to generate products. For example, P₁ is made from A and C, and the factory gains 100 dollars by producing it.

Table 2.1: A simple example of a product dataset

PID Items Profit value kinds of products in a factory. For a product subset X, X ⊆ I, the gain of X is defined as:

G(X) = ∑_(𝑃_𝑘_|𝑋∩𝑃_𝑘_{.𝐼𝑡𝑒𝑚𝑠≠∅)}𝑃_𝐾. 𝑉𝑎𝑙. (1)

The gain of the itemset X is the sum of the profits of the products which include at least one item in X. For example, assume X = {A, B, C}. From Table 2.1, the four products P1, P2,

P₄ and P₅include item A or B or C. Therefore, g(X) = P₁.val

+

P₂.val

+

P₄.val

+

P₅.val, which is 850 (dollars). The definition of the total gain T is defined as:

T = ∑_𝑃_{𝑘 ∈𝐷𝐵}𝑃_𝑘. 𝑣𝑎𝑙. (2)

An itemset X is called an erasable itemsets if G(X) ≤ T × r, where r is a given threshold.

For example, in Table 2.1, with r = 50%, we want to know whether the item B is erasable or not. Since B appears in P2 and P4, we can obtain its gain as 500 + 150, which is 650. T is 1400 in Table 2.1. The threshold gain is 1400 × 0.5 (50%), which is 700. In this process, we can know 700 > 650, so B is an erasable itemset. Above is a 1-itemset. For the 3-itemset {A, B, C}, since A or B or C appears in P₁, P₂, P₄and P₅, we can calculate its gain as 850 because 850 > 700 (threshold). So, {A, B, C} is a non-erasable itemset. The goal of the erasable-itemset mining is to find all the erasable itemsets from a product database with an erasable ratio parameter r [2].

2.2 Pre-Large Itemset Mining

In 2001, Hong et al. proposed the pre-large itemset mining [12][15] for maintaining frequent itemsets in dynamic environments. The pre-large mining algorithm [12][15] was used to solve the condition that wastes time to rescan original products with the FUP [5]

algorithm. There are two parameters S_d and S_u. S_d is the threshold of the lower bound and S_u is the threshold of the upper bound, which can divide all the itemsets into three parts. The itemsets lying between S_d and S_u are called pre-large itemsets. Both the original transactions and the new transactions thus have three kinds of itemsets: large itemsets, pre-large itemsets, and small itemset. One itemset is thus in one of the nine combinations. Each combination (case) is then processed in its own way to save the maintenance time.

2.3 Survey of Erasable Itemset Mining Algorithms

There were some proposed approaches related to erasable-itemset mining. Agrawal et al.

proposed the apriori algorithm in 1994. Cheung et al. proposed the fastupdate algorithm in 1996. Deng et al. proposed the mining erasable (Meta) itemset algorithm in 2009. Deng and Xu proposed the vertical-format based (VME) algorithm for erasable-itemset mining in 2010.

Deng and Xu proposed the MERIT algorithm, which was based on the concept of a tree data structure for fast mining erasable itemsets. Le et al. introduced a revised algorithm called dMERIT+, which came from MERIT to mine erasable itemsets with some pruning techniques. Le and Vo introduced an effective algorithm called mining erasable itemsets (MEI), which used a divide-and-conquer strategy and the concept of different PID_sets and the dPidset strategy of MEI to improve mining efficiency. Hong et al. proposed the fastupdate-erasable mining in 2016. Table 2.2 displays a summary of some existing algorithms for mining erasable itemsets.

Table 2.2: The history of existing algorithms for mining erasable itemsets

Algorithm Year Approach

Apriori algorithm [2] 1994 Association rule mining algorithm Fastupdate algorithm [5] 1996 Incremental updating approach

Pre-large mining [12] 2001 Incremental frequent-itemset mining

erasable mining [7] 2009 Opposite of Apriori

VME [9] 2010 PID_List-Structure-based

MERIT [8] 2012 NC_Set-Structure-based

dMERIT+ [6] 2013 dNC_Set-Structure-based

MEI [18] 2014 dPidset-Structure-based

Fastupdate-erasable [16] 2016 Incremental erasable mining

Chapter 3 The Incremental Algorithm for Mining ε-quasi-Erasable Itemsets

In this chapter, the Fastupdate-erasable approach[17] will be improved by the proposed algorithm. The algorithm’s main idea is described in Section 3.1; The notation used is described in Section 3.2; The new algorithm is proposed in Section 3.3. Finally, an example is given to explain the new algorithm in Section 3.4.

3.1 Main idea

If an itemset is non-erasable in an old database, but is erasable in the new products, then the old database needs to be re-scanned to determine whether it is erasable or not. However, if the gain of an itemset in an old database is much larger than the maximum erasable gain threshold, then it is hard for the itemset to become erasable even when some new products come. Figure 3.1 shows the concept behind the solution.

Figure 3.1: The concept of ε-quasi-erasable-itemsets

Table 3.1 Nine cases arising from adding new products to existing products Original products

New products Erasable ε-quasi-erasable Non-erasable

Erasable Case 1 Case 4 Case 7

ε-quasi-erasable Case 2 Case 5 Case 8

Non-erasable Case 3 Case 6 Case 9

Table 3.1 shows the nine cases arising from adding new products to existing products.

When the itemset is erasable in the original products and the new products, which is case 1.

When the itemset is erasable in the original products and the itemset is ε-quasi-erasable in the new products, which is case 2. When the itemset is erasable in the original products and the itemset is non-erasable in the new products, which is case 3. When the itemset is ε-quasi-erasable in the original products and the itemset is erasable in the new products, where is case 4. When the itemset is ε-quasi-erasable in the original products and the itemset is ε-quasi-erasable in the new products, which is case 5. When the itemset is ε-quasi-erasable in the original products and the itemset is non-erasable in the new products, which is case 6.

When the itemset is non-erasable in the original products and the itemset is erasable in the new products, which is case 7. When the itemset is non-erasable in the original products and the itemset is ε-quasi-erasable in the new products, which is case 8. When the itemset is non-erasable in the original products and the itemset is non-erasable in the new products, which is case 9.

In the proposed algorithm, T^Pand T^Nrepresent the number of original products and number of new products. We give two parameters, r and ε, r be the maximum erasable ratio, and ε be the quasi-erasable parameter, if T^N ≤ T^P × (ε / r). The non-erasable-itemsets in original products cannot be erasable after inserting new products.

As mentioned above, if the total gain values of new products are small when compared to the total gain values of products in the original product database, then an itemset that is

neither erasable nor ε-quasi-erasable in the original database but is erasable in the newly inserted products cannot possibly be erasable for the entire updated product database. This property is proven in the following theorem.

Theorem: Let r be the maximum erasable ratio threshold and ε be the parameter value for ε-quasi-erasable itemsets. Also, let 𝑇^𝑃 and 𝑇^𝑁 be the total gain values of the items in the original products and in the new products, respectively. If 𝑇^𝑁 ≤ 𝑇^𝑃×^𝜀_𝑟 , then an itemset that is neither erasable nor ε-quasi-erasable in the original database but is erasable in the newly inserted products is certainly not erasable for the entire updated product database.

Proof: If 𝑇^𝑁 ≤ 𝑇^𝑃 ×^𝜀_𝑟 is true, then the following derivation can be obtained:

𝑇^𝑁 ≤ 𝑇^𝑃×^𝜀_𝑟

 r × 𝑇^𝑁≤ 𝑇^𝑃 × 𝜀

If an itemset I is neither erasable nor ε-quasi-erasable in the original product database, its gain value gain^P(I) in the original product database will satisfy the following inequality:

T^P × (r + ε) < gain^P(I)

If I is erasable in the newly inserted products, its gain value gain^N(I) in the new products will satisfy the following inequality:

0 ≤ gain^N(I) ≤ T^N× r

If the above three conditions are valid as stated in the theorem, then the entire gain value ratio of I in the updated product database U is ^{𝑔𝑎𝑖𝑛}

𝑈(𝐼)

𝑇^𝑃+ 𝑇^𝑁

,

which can be further expanded as follows:

𝑔𝑎𝑖𝑛^𝑈(𝐼)

So, I is not erasable. This completes the proof.

Therefore, when an itemset is neither erasable nor ε-quasi-erasable in the original products but is erasable in the newly inserted products, it is certainly not erasable for the entire updated product database when 𝑇^𝑁 ≤ 𝑇^𝑃×^𝜀_𝑟 .

For example, assume 𝑇^𝑃 = 1950, r = 35% and ε = 5%. The allowed total gain of new products within which the original product database doesn’t need to be scanned is 1950 × (0.05 / 0.35), which is 278.57. It means if the gain of newly inserted products is equal to or smaller than 278.57, then an originally non-erasable itemset will certainly be still non-erasable without the need to rescan the old product database.

3.2 The Notations

We define the notations to generate the following algorithm, the notations that will be

used in this thesis are shown as follows:

P: the original products,

N: the set of newly inserted products, I: the set of all items,

U: the updated products, s: an itemset formed from I, gain: the gain value of an itemset,

gain^P(s): the gain value of an itemset s in the original products P, gain^N(s): the gain value of an itemset s in the new products N, gain^U(s): the gain value of an itemset s in the updated products U, T^P: the total gain value of all products in the original products P,

T^N: the total gain value of all products in the set of newly inserted products N, r: the maximum ratio to be an erasable itemset,

ε: the quasi-erasable parameter,

E^P: the set of erasable itemsets in the original products P, E^Pk: the set of erasable k-itemsets in the original products P,

E^N_k: the set of erasable k-itemsets in the set of newly inserted products N, E^Uk: the set of erasable k-itemsets in the updated products U,

Q^P: the set of ε-quasi-erasable itemsets in the original products P, Q^Pk: the set of ε-quasi-erasable k-itemsets in the original products P,

Q^N_k: the set of ε-quasi-erasable k-itemsets in the set of newly inserted products N, Q^Uk: the set of ε-quasi-erasable k-itemsets in the updated products U,

C^P_k: the set of candidate itemsets from the original products P,

C^Nk: the set of candidate itemsets in the set of newly inserted products N, C^Uk: the set of candidate erasable from the set of updated products N.

3.3 The Proposed Incremental ε-quasi-erasable Itemset Mining Algorithm

Based on the above idea, a ε-erasable-itemset mining algorithm can be designed. When a set of new products comes, the proposed algorithm will re-calculate the gain of an erasable itemset or a non-erasable itemset according to different cases, and decide whether the itemset is erasable. There are nine cases in total, by combining old product database and new products (each has three parts). A variable, t, is used to record the amount of gain from new products since the last re-scan of the original product database. The proposed algorithm is stated as follows:

The proposed algorithm:

Input:

An original product P with its total profit value T^P, the set of erasable itemsets E^P and ε-quasi-erasable itemsets Q^P with their gain values from P plus previous new products since last re-scan, the amount of gain t from previous new products since the last re-scan, an erasable ratio threshold r, a quasi-erasable parameter ε, a set of all items I, and a set of newly coming products N.

Output:

The set of erasable itemsets for the updated products U.

Step 1:

Calculate the safety gain f of new products as follows:

f = T^P

×

^𝜀_𝑟

⁽³⁾

Step 2:

Calculate the total profit value T^N of the new products N as follows:

∑_𝑃_𝑖∈𝑁𝑃_𝑖. 𝑣𝑎𝑙𝑢𝑒

⁽⁴⁾

where P_i is a product in N. Then list the 1-itemsets appearing in the new products N

Step 5:

Put a candidate erasable k-itemsets in into the set (C^N_k) of erasable k-itemsets for the new products N if its gain value in the new products is smaller than or equal to the maximum gain threshold T^N

×

r. Otherwise, put s into the set (Q^Nk) of ε-quasi-erasable k-itemsets for the new products N if its gain value in the new products is larger than the maximum gain threshold T^N

×

r but smaller than or equal to T^N

×

(r

+

ε).

Step 6:

For each k-itemset s in the set of erasable k-itemsets (E^P_k) from the original products P plus previous new products since last re-scan, if it is also in E^N_k, do the following substeps (case 1):

Substep 6-1:

Set the updated gain of s as:

gain^U(s) = gain^P(s)

+

gain^N(s),

where gain^U(s), gain^P(s), gain^N(s) are the gains of s in the updated product dataset, the original product dataset plus previous new products since last re-scan, and the set of current new products, respectively.

Substep 6-2:

Directly put s into updated erasable k-itemsets, E^U_k.

Step 7:

For each k-itemset s in the set of erasable k-itemsets (E^Pk) from the original products P plus previous new products since last re-scan, if it is also in the candidate erasable k-itemsets C^N but not in E^N (i.e. C^N − E^N ), do the following substeps

(cases 2 and 3):

Substep 7-1:

Set the updated gain of s as:

gain^U(s) = gain^P(s)

+

gain^N(s).

Substep 7-2:

Check whether the updated gain of s is smaller than or equal to the maximum erasable gain threshold (T^P+ T^N)

×

r. If s satisfies the condition, put it in the set of updated erasable k-itemsets, E^U_k; Otherwise, Check whether the updated gain of s is smaller than or equal to the maximumε-quasi-erasable gain threshold (T^P

+

T^N)

×

+

ε). If s satisfies the condition, put it into updated ε-quasi-erasable k-itemsets, Q^Uk .

Step 8:

For the other k-itemsets in the set of erasable k-itemsets (E^Pk) from the original products P plus previous new products since last re-scan, directly keep them in the set of updated erasable k-itemsets, E^U_k, with their gain values unchanged.

Step 9:

For each k-itemset s in the set of ε-quasi-erasable k-itemsets (Q^P_k) from the original product database P plus previous new products since last re-scan, if it is also in E^N_k, do the following substeps (case 4):

Substep 9-1:

Set the updated gain of s as:

gain^U(s) = gain^P(s)

+

gain^N(s),

where gain^U(s), gain^P(s), gain^N(s) are the gains of s in the updated products dataset, the original product dataset plus previous new products since last re-scan, and the set of current new products, respectively.

Substep 9-2:

Check whether the updated gain of s is smaller than or equal to the maximum erasable gain threshold (T^P

+

T^N)

×

r. If s satisfies the condition, put it in the set of updated erasable k-itemsets, E^Uk; Otherwise, directly put it into updated ε-quasi-erasable k-itemsets,

Q^U_k; product dataset, the original product dataset plus previous new products since last re-scan, and the set of current new products, respectively.

Substep10-2:

Directly put s into updated ε-quasi-erasable k-itemsets, Q^Uk.

Step 11:

For each k-itemset s in the set of ε-quasi-erasable k-itemsets (Q^Pk) from the

Substep 11-2:

Check whether the updated gain of s is smaller than or equal to the maximum ε-quasi-erasable gain threshold (T^P

+

T^N)

×

+

ε). If s satisfies the condition, put it into ε-quasi-erasable k-itemsets, Q^Nk.

Step 12:

For each s of the other k-itemsets in the set of ε-quasi-erasable k-itemsets (Q^Nk) from the original product database P plus previous new products since last re-scan, check whether the original gain, gain^P(s), of s is smaller than or equal to the maximum erasable gain threshold (T^P

+

T^N)

×

r. If s satisfies the condition, put it in the set of updated erasable k-itemsets, E^U, with their gain values

unchanged; Otherwise, directly put it in the set of updated ε-quasi-erasable k-itemsets, Q^Uk, with their gain values unchanged.

Step 13:

For each k-itemset s which exists in (E^N_k

+

Q^N_k − E^P_k − Q^P_k) (erasable or ε-quasi- erasable for new products, but not erasable nor ε-quasi-erasable for the original product database plus previous new products since last re-scan) or in C^P_k

− E^Pk − Q^Pk, put s into R (cases 7 and 8).

Step 14:

If T^N

+

t ≤ f, then do nothing; Otherwise, do the following substeps for each k-itemset s in the rescan set R (cases 7 and 8):

Substep 14-1:

Rescan the original product database to determine gain^P(s).

Substep 14-2:

If s is in (E^Nk

+

Q^Nk − E^Pk − Q^Pk), set the updated gain of s as:

gain^U(s) = gain^P(s)

+

gain^N(s);

Otherwise, directly set the updated gain of s as:

gain^U(s) = gain^P(s)

Substep 14-3:

If the updated gain, gain^U(s), of s is smaller than or equal to the maximum gain threshold (T^P

+

T^N)

×

r, put s into updated erasable k-itemsets, E^Uk; Else if gain^U(s) is smaller than or equal to the maximum ε-quasi-erasable gain threshold (T^P

+

T^N)

×

+

ε), put s into updated ε-quasi-erasable k-itemsets, Q^Uk; Otherwise, remove s from erasable k-itemsets (E^N_k) and from the set of ε-quasi-erasable k-itemsets (Q^Nk) from the new products.

Step 15:

Form the candidate erasable (k

+

1)-itemsets C^Uk+1 from the k-itemsets in E^Uk and in Q^Uk in a way similar to that in the batch algorithm. If a candidate erasable (k

+

1)-itemsets generated includes at least one item in the new products, put s in C^Nk+1 and calculate its gain value from the new products N. else put s in C^Pk+1.

Step 16:

Set k = k

+

Step 17:

Repeat Steps 4-16 until no new candidate erasable itemsets are generated.

Step 18:

If T^N

+

t > f, set T^P = T^P

+

T^N and set t = 0; Otherwise, set t = t

+

T^N.

Step 19:

Output all the updated erasable itemsets.

3.4 An Example

In this part, we will give an example to illustrate the process of algorithm, as we know, there are two products that will be inserted in original products, the threshold r is 0.45, looking at the details below:

Table 3.2: The erasable itemsets found from Table 2.1

Erasable Item Gain value

Table.3.3: Two newly inserted products in the example

PID Items Profit Value

P9 BE 100

P₁₀ AC 150

Assume the ε-quasi-erasable parameter ε is set at 0.05. The maximum gain threshold value for ε-erasable-itemsets is then calculated as the total gain multiplied by (r

+

ε), which is 1400

×

0.5 (= 700). The set of 0.05-quasi-erasable itemsets derived from the original products given in Table 3.4.

Table.3.4: The 0.05-quasi-erasable itemsets found from Table 2.1

0.05-quasi-erasable itemsets

1 item Gain 2 items Gain

B 650 AC 700

C 700

Initially, the amount of gain t from the previous new products since the last re-scan of the original product database is zero. The ε-quasi-erasable mining algorithm then proceeds as follows.

Step 1:

The gain f of new products for not re-scanning the original product database is calculated, which is 1400

×

(0.05 / 0.45) (= 155.56).

Step 2:

The total profit value (T^N) of the two new products given in Table 3.3 is first calculated as 100

+

150, which is 250. The 1-itemsets in the two new products are then listed as the candidate erasable 1-itemsets. They are A, B ,C and E, which appear in P9 and P10. Their gain values in the new products are then calculated as (A: 150, B: 100, C: 150, E: 100).

The results are shown in Table 3.5.

Table 3.5: The gain values of the four 1-itemsets in new products

1-itemset Gain value

Step 5:

The candidate erasable 1-itemsets in are checked for whether their gain values in the new products are smaller than or equal to the maximum gain threshold T^N

×

r, which is 250 × 0.45 (= 112.5), or larger than the maximum gain threshold T^N

×

r but smaller than or equal to T^N

×

+

ε), which is 250

×

(0.45

+

0.05) (=125). In this example, no candidate erasable 1-itemsets satisfy the second condition, and thus the erasable 1-itemsets in is {B} and {E}, which satisfy the first condition.

Step 6:

For each 1-itemset in the set of erasable itemsets E^P1 from the original product database plus previous new products since last re-scan, if it is also in E^N1, it is processed by the substeps in this step. In this example, since is null from Step 5, there is no erasable 1-itemset in E^P1 satisfies the condition.

Step 7:

Each candidate 1-itemset simultaneous existing in the set of erasable itemsets from the original product database plus previous new products since last re-scan and in C^Nk – E^N_k is processed. In this example, only the 1-itemset {A} satisfies the condition. The gain

Step 7:

在文檔中漸進式準可篩除項目集探勘之研究 (頁 14-0)