Input: Record vector RVjk. Output: Class vector CVjk. Step 1: Set CVjk to ZEROσm.
Step 2: For each i, 1 ≤ i ≤ σm, set the i-th bit of CVjk to 1 if RVjk ∩ RVmi ≠ ZEROn; otherwise, set it to 0.
Step 3: Return CVjk.
DEFINITION 7.3 - Feature-value vector :
A feature-value vector Fjk is concatenated of RVjk and CVjk.
For example, the feature-value vector F11 in Table 7.2 is 1100100000110, which is RV11 concatenated with CV11. All the feature-value vectors for a feature are then collected together as a feature matrix. This is defined below.
DEFINITION 7.4 - A feature matrix for a feature :
A feature matrix Mj for the feature Cj is denoted
For example, the feature matrix M1 in Table 7.2 is show as follows:
⎥⎥
matrix, it is easily derived that applying the bit-wise operator "OR" on all the record vectors in a feature matrix will get the ONEn vector, and applying the bit-wise operator
"AND" on any two record vectors in a feature matrix will get the ZEROn vector. Note that, the “OR” and “AND” operators are defined to result for executing “OR” and
“AND” operation on all respective bits for the given two bit vectors. Thus, if we apply the bit-wise operator "XOR" on all the record vectors in a feature matrix, we will also get the ZEROn vector. Take M1 as an example. The result for 1100100000 OR 0011010000 OR 0000001111 is 1111111111. The result for 1100100000 AND 0011010000 is 0000000000. The result for 1100100000 XOR 0011010000 XOR 0000001111 is 0000000000.
DEFINITION 7.5 - A feature matrix for a table T :
A feature matrix M for a table T is denoted
⎥⎥
⎥⎥
⎦
⎤
⎢⎢
⎢⎢
⎣
⎡
Mm
M M
M
2 1
, where m is the number of
features in T.
For example, the matrix composed of the bit strings from columns 3 and 4 of Table 7.2 is the feature matrix for the data given in Table 7.1. The feature matrix for a table is then input to the feature selection phase to find relevant and enough features.
7.2.3 Feature Selection Phase
In this phase, we want to find a set of relevant and enough features to represent the given dataset. It is further divided into several stages. First, a feature-based spanning tree is built for cleansing the bitmap indexing matrix. The dataset with noisy information is thus judged and filtered out according to the spanning tree. The cleansed, noisy-free bitmap indexing matrix is then used to determine the optimal feature set for some classification and clustering problems.
Before the feature selection phase is executed, the correctness of the target table needs to be verified. If there are some records in the target table with the same values of all condition features, but with different ones of the decision feature, they are treated as noise records and are filtered out from the target table. Intuitively, every two records can be compared to find out the inconsistent records in the target table. Its time complexity is O(n2m), where n is the number of records and m is the number of features. Below, we propose the concept of a cleansing tree to decrease the time complexity to O(nmj), where j is the maximum number of possible feature values of a feature and n is usually much larger than j in the general classification and clustering problems. The formation of a cleaning tree depends on the given feature order. We thus
DEFINITION 7.6 - spanned feature order :
A spanned feature order O is a permutation consisting of all the condition features in a target table T.
For example in Table 7.1, <C1, C2, C3, C4> can be a spanned feature order. When a spanned feature order is given, a cleansing tree can then be built according to it. The definition of a cleansing tree is first given below.
DEFINITION 5-7 - cleansing tree :
A cleansing tree Ctree is a tree with a root denoted root[Ctree]. Every node x in the tree corresponds to a feature value. A node y is the parent of a node x if the feature of y precedes the feature of x in the given spanned feature order. A node z is the sibling of a node x if they have the same feature, but different values.
A structure of a cleansing tree is shown in Figure 7.2. Its maximum height is m-1, where m is the number of features in a decision table T. Each node x has three pointers, which are p[x], left-child[x] and right-sibling[x], respectively pointing to its parent node, its leftmost child node and its first right sibling node. It also contains two
additional information, record[x] and class[x], which indicate the associated record and class vectors of x. If node x has no child, then left-child[x] = NIL; if node x is the rightmost child of its parent, then right-sibling[x] = NIL.
Figure 7.2: The structure of a cleansing tree
As mentioned above, records may have the same values of all condition features, but different value of the decision feature. These records are called inconsistent.
Inconsistent records can also be found out when the cleansing tree is built. The building algorithm uses the valid mask vector to find the consistent records. The valid mask vector is defined as follows.
DEFINITION 7.8 - valid mask vector :
A valid mask vector ValidMask for a target table T a bit string b1b2…bn, with bi set to 1 if the i-th record Ri is not inconsistent with other records, and set to 0 otherwise.
The cleansing tree for a given spanned feature order can be built by the following Create cleansing tree algorithm. The ValidMask is initially set to ONEn., and will be modified along with the execution of the Create cleansing tree algorithm.