SDP M INING STEP - WISDOM: WISELY IMAGINABLE SIGNIFICANT DIFFERENCE

CHAPTER 4. WISDOM: WISELY IMAGINABLE SIGNIFICANT DIFFERENCE

4.2. SDP M INING STEP

Data Reduction step filters the sensitive-less records and categorizes the continuous measure Mi to a new discrete measure Mi’. For the original continuous measure Mi, researches find significant differences by using statistical testing;

however, how can we find the SDPs from the new discrete measure Mi’? Therefore, a definition, Score and Range, and Significant Difference Determination heuristic are proposed as follows. The Score and Range definition is used to calculate the difference among different attribute values of an attribute, and the Significant Difference Determination heuristic is used to determine whether the difference is significant or not.

DEFINITION 4: Score and Range

Given an attribute Aij = {V_ijk| k = 1…g(i, j)} and a discrete measure M_i’.

Score(A_ij = V_ijk) can be used to represent the relation between the total mean

( )

Score ^ij ^ijk ^ij ^ijk

ijk

The value of Score(Aij = Vijk) is between 1, representing all the values of measure M_i’ are good, and -1, representing all the values of measure M_i’ are bad.

Range(A_ij) is the maximum difference of {Score(A_ij = V_ijk)| k = 1,2,…,g(i, j)}, and it can be used to represent the difference in the attribute Aij. Range(Aij) is defined as:

( )

{

^Score

(

^A ^V

)

^k ^g

^{( )}

ⁱ ^j

} {

^Score

(

^A ^V

)

^k ^g

^{( )}

ⁱ ^j

}

good means this record’s measure M_i is greater than

i M

M S

X +β , and the value bad means this record’s measure Mi is less than

( ) ( )

Given the attribute region and measure math_grade’ shown in Table 4.4, the Scores and Range are calculated as:

Score(region = north) = 0

Score(region = central) = 0

Score(region = south) = 1

Score(region = east) = 1 0

1 0

− = -1

Range(region) = Score(region = north) - Score(region = south) = 1 – (-1) = 2

The Score and Range can be represented as Figure 4.3.

Figure 4.3: The Score and Range of attribute region

■

Range(Aij) can be used to represent the difference of means of different attribute values in an attribute Aij. Obviously, the bigger Range(Aij) represents there is more significant difference in different attribute values in attribute Aij. Hence, the Significant Difference Determination heuristic is proposed to determine whether there exists the SDP in an attribute Aij or not.

HEURISTIC 3: Significant Difference Determination

Given an attribute Aij = {V_ijk| k = 1,2,…g(i, j)}, and a measure M_k’, if Range(A_ij) is greater than or equal to γ , there exist a SDP, (A_ij) : M_k, where γ is significance determination threshold.

■

EXAMPLE 6:

Given the attribute region and measure math_grade’ shown in Table 4.4, the Range(region) = 2 has been calculated in EXAMPLE 5. Given γ = 0.4,

Range(region) ≥ γ = 0.4

Hence, there exists the SDP

(region) : math_grade (4.3)

■

In general, researchers are interested in investigating the more general human behavior and social phenomenon. If there is a significant difference on the more general phenomenon, they won’t usually be interested in the more specific one. Hence, The Most General SDP First heuristic is proposed.

HEURISTIC 4: The Most General SDP First

The Most General SDP First heuristic is that the general SDP is more interesting than the specific SDP. The “general” means the higher level SDP and fewer dimensions SDP is better. Hence,

z Higher-level SDP is more interesting than lower-level SDP.

z Fewer-dimension SDP is more interesting than more-dimension SDP.

■

The following two examples explain The Most General SDP First heuristic more clearly.

EXAMPLE 7:

Given two SDPs:

(region) : math_grade (4.4)

(city) : math_grade (4.5)

If there is a significant difference between different resident regions on measure math_grade, researchers won’t usually be interested in whether there is a significant difference between different resident cities on measure math_grade or not.

■

EXAMPLE 8:

Given two SDPs:

(gender) : math_grade (4.6)

(gender | city = Taipei) : math_grade (4.7)

If there is a significant difference between different gender on measure math_grade, researchers won’t usually be interested in whether there is a significant difference between different gender in measure math_grade for the records only living in Taipei or not.

■

The Most General SDP First heuristic is proposed based on our experiments and discussing with senior researchers. It’s just a general phenomenon when researchers find the significant difference. In other words, it will not always be correct at different situations. For example, researchers might also be interested in whether there is a significant difference between different gender on measure math_grade for the records only living in Taipei in EXAMPLE 8. However, the complexity of the SDPD problem can be decreased effectively by using The Most General SDP First heuristic.

Based on the Significant Difference Determination and The Most General SDP First heuristics, SDPMining algorithm is a greedy algorithm, and it searches the Data Warehouse to find the SDPs like a BFS search tree. The pseudo code of SDPMining algorithm is listed in Table 4.6.

Table 4.6: The SDPMining algorithm

SDPMining(DW’, Mi, PD, Pi, α, γ, Current-Depth, Depth, SDPs’) Input:

DW’: A data warehouse with measure Mi’;

Mi: A measure;

PD: The potential dimensions that may cause to significant difference;

Pi: The parents of PD;

α: A confident level;

γ: A significant determinate threshold;

Current-Depth: The current complexity of the output pattern;

Depth: A search depth threshold;

SDPs’: The found SDPs;

Begin

If (Current-Depth > Depth) Return;

Set PD’ ← PD;

For each dimension PDi of PD, Do

Current Level of PDi = Highest–Level;

While ( RANGE(Current Level of PDi) <γ || Current Level = Lowest Level) Do SDS ,α,γ, Current-Depth+1, Depth );

Return;

End

At the beginning, it computes the Range(Ai1) of for the first attribute Ai1 of each dimension Di. If Range(Ai1) is greater than threshold γ , which means A_i1 is

significant, the rest attributes of dimension Di will not be searched due to the heuristic.

All the non-significant dimensions will be expanded to the next level and search go on.

The following two examples explain SDPMining algorithm more clearly.

EXAMPLE 9:

At the beginning, SDPMining algorithm computes the Range(Ai1) of for the first attribute, gender, region, and father_education, of each dimension. Due to the attribute region is significant, the attribute city will not be searched. The attribute region is significant and the attribute gender and father_education are not significant.

The result is shown in Figure 4.4.

Figure 4.4: The result after searching the first level in

After expanding the non-significant attribute, the result is shown in Figure 4.5.

Figure 4.5: The result after searching the second level in

The following SDPs can be found by Figure 4.5:

(region) : math_grade

(father_education| gender = male) : math_grade (father_education| gender = female) : math_grade (gender| father_education = senior_high) : math_grade (gender| father_education = university) : math_grade (gender| father_education = graduate) : math_grade

■

EXAMPLE 10:

At the beginning, SDPMining algorithm computes the Range(Ai1) of for the first

attribute, gender, region, and father_education, of each dimension. Due to the attribute region is not significant, the attribute city is also processed. The attribute region is significant and the attribute region, gender and father_education are not significant. The result is shown in Figure 4.6.

Figure 4.6: The result after searching the first level in

After expanding the non-significant attribute region, the result is shown in Figure 4.7.

Figure 4.7: The result after searching the second level in

The following SDPs can be found in Figure 4.7:

(city) : math_grade

(gender| region = north) : math_grade

(father_education| region = north) : math_grade (father_education | region = central) : math_grade (gender | region = east) : math_grade

■

在文檔中關於統計上顯著性差異模式探索之研究 (頁 34-46)