CF Tree - Hierarchical Clustering - Hierarchical Clustering and Support Vector Machines

CHAPTER II. Hierarchical Clustering and Support Vector Machines

II.1 Hierarchical Clustering

II.1.2 CF Tree

A CF tree is a height-balanced tree with two parameters: branching factor B and threshold T. An example of a CF tree with height h = 3 is shown in Figure 2.1. In CF tree there are two types of node: non-leaf node and leaf node. Each of non-leaf nodes contains at most B entries of the form (CF_i, child_i), where:

• i = 1, 2, …, B

• childi is a pointer to its i^th child node

• CFi is the CF of a cluster represented by child_i

CF

---- CF

CF

¹¹

---- CF

- - - - CF

^B1

---- CF

CF₁₁₁

----

^CF11B

- - - -

CF_BB1

----

^CFBBB

Figure 2.1

CF Tree with height h = 3

A non-leaf node represents a cluster made up of all subclusters represented by its entries. Different from a non-leaf node, a leaf node contains only at most L entries, each of the form [CFi] where i = 1, 2, ..., L. A leaf node does not have pointer to other node. The tree is hierarchical in the following sense: the CF at any particular node contains information for all data points in that node’s sub-tree. A leaf node also represents a cluster made up of all sub-clusters represented by its entries, but all entries in a leaf node must satisfy a threshold requirement, with respect to a threshold T: the radius has to be less than T.

The size of the tree is depends on parameter T. The larger T is, the smaller the tree is. Another threshold is the branching factor B which determine the memory page size such that a leaf or a non-leaf node fits in a page. Based on the explanation above now we know that CF tree is a compact representation of the data set because each entry in a leaf node is not a single data point but it is a cluster which absorbs many data points within radius of T or less.

A CF tree is built up dynamically as new data objects are inserted. The insertion process is similar to that of a B+-tree use to guide a new insertion into the correct position for the sorting purposes. The insertion process is as follow.

1. Identifying the appropriate leaf

To identify the appropriate leaf to put the new entry, starting from the root it traverses the CF tree down (recursively) to the leaf level by choosing the child node whose centroid is closest to the new data object at each level.

We will see the description of this process on Figure 2.2. Given CF tree

structure in figure 2.2, when a new entry CFX want to be inserted into the tree, we have to identify the appropriate leaf first. Starting from the root we compute the distance between new entry CFX with each entry in the root (in this case we compute D1, D2, and D3). We choose the entry with the smallest distance, that is D2, and then traverse down to the child of CF2. We repeat the process of computing the distance between CFX with each entry in the node, D4 and D5. Since the current node is a leaf node and D5 is smaller than D4, so the appropriate place to put the new entry is on the entry CF₂₂.

CF₁ CF₂ CF₃

CF11 CF12 CF21 CF22 CF31 D1 = 0.52 D2 = 0.15 D3 = 0.23

D4 = 0.35 D5 = 0.25

CFx

Root

Leaf

Figure 2.2

Identifying Appropriate Leaf

2. Modifying the leaf

When it reaches a leaf node, it finds the closest leaf entry. In this step there are two cases that should be considered: First, if the leaf entry can absorb the new data object without violating the T threshold condition, then we only need to update the CF vector of the entry and then terminate (figure 2.3 (a)). Second, if the leaf entry cannot absorb the new entry without violating T threshold, it will add a new leaf entry and put the new entry (figure 2.3 (b)). In this case, if adding a new entry violates the B threshold (i.e., too many children or entry), it will split the node by choosing the farthest pair of entries as seeds and redistribute the remaining entries based on the closeness.

To explain how the leaf node does the split we can see from figure 2.3 (c).

In this example we have a leaf node with three entries (figure 2.3 (c) upper part) and we want to add new entry CF4 there but we have to add new entry because the entry (lets say entry CF3) cannot absorb CF4 without violating T threshold. First we compute the distance between each entry in the leaf node (CF1 - CF2, CF1 – CF3, and CF2 – CF3) and choose the farthest pair. For example if the farthest pair is CF1 and CF3 then we split these two entries into two different node and redistribute the remaining entries (CF₂ and CF₄) based on their closeness to the current centroid of the entries. After finish with redistribution we compute a parent node with two entries CF_A and CF_B.

CF₁ CF₂

CF₁ CF₂ CF₃

CFA CFB

CF₁ CF₂ CF₃ CF₄

(a) (b) (c)

Figure 2.3 Modifying the Leaf

3. Modifying the path to the leaf

After inserting new entry into a leaf, we must update the CF information for each non-leaf entry along the path to the root. In the absence of a node split, this process simply involves adding the CF vector to reflect the addition of new entry. If the previous step of modifying a leaf entry caused a node split, check the parent node for satisfaction of the branching factor constraint. If the parent node violates the B threshold as well, split it and recursively traverse back up to the root while performing the same checks.

There are several methods to compute the distance between two points including Euclidean distance, Manhattan distance, etc. The simplest one and the one which we use in our implementation, Euclidean distance, is described as follow:

( )

(

x1 x2 ²

)

¹^/²

D= − (4)

where X₁ and X₂ are the points.

The BIRCH algorithm we described above is not a complete BIRCH algorithm because it does not include the refinement process. A highly skewed input could cause a condition where two sub-clusters that should have been in the same cluster were split across different node and vice versa. This condition can be solved using refinement process, but since it is infrequent and does not critical to the final clustering result, so we do not use the refinement process.

The important threshold in building the CF tree is T threshold because it determines the size of the tree so that it fits in the memory. If T is too small, the number of node will be too large so that we will run out of memory before we finish scanning all the data. The original BIRCH algorithm initially sets a very low T and iteratively increases T until it fits into memory. But the problem is when we want to change the value of T, we have to rebuild the tree.

Rebuild the tree is an expensive process because it requires a re-scan of the inserted data so far and at most h extra pages of memory where h is the height of the tree [3]. In our implementation we use the value T intuitively based on the number of data points, dimensionality, value range of each dimensionality and adjustment to get best result for training the SVM.

在文檔中即時入侵偵測聯防系統(III) (頁 18-22)