# UNSUPERVISED LEARNING

In document Hierarchical Text Categorization Using One-Class SVM (Page 38-44)

## CHAPTER 3 TEXT CATEGORIZATION USING ONE-CLASS SVM

### 3.3 UNSUPERVISED LEARNING

We now go to the procedures of performing SV clustering. As we me ntioned in Section 2.1.2, in prforming SV clustering we have to choose proper value of

and

### ν . The choice of q

decides the compactness of the enclosing sphere and also the number of clusters. The choice of

### ν helps us to solve the problem of overlapping

clusters.

The SV clustering processes are as follows:

Fig 3.5 The SV clustering processes [Ben-Hur et. al., 2000]

Unlabeled Data Set

d

n

1

2

### ,..., } ∈

Choose kernel function Increase

from 0,

Given

### q

Using adjacent-matrix and DFS to find out all

Yes

Yes, Stop No

Fixed

change

increasingly Yes

No

No

If

### q

exhausted and all NO

Cluster Validity Clusters exist (

### ≥

2)

We explain the above procedures as follows:

### 3.3.2 The Choice of Kernel Function

In 1992 [Boser et. al., 1992] Vapnik shows that the order of operations for constructing a decision function can be interchanged. So instead of making a non- linear transformation of the input vectors followed by dot-products with SVs in feature space, one can first compare two vectors in input space and then makes a non-linear transformation of the value of the result.

Commonly used kernel functions are as follows:

a) Gaussian RBF kernel :

2

### )

(3.2) b) Polynomial kernel :

(

,

)

(

### x ⋅ y +

1)d (3.3) c) Sigmoid kernel :

(

,

)

tanh(

### x ⋅ y − θ

) (3.4)

we use only Gaussian kernel since other kernel function like polynomial kernel function does not yield tight contour representations of a cluster [Tax et. al., 1999]

and we will show that Gaussian kernel is indeed the best choice for SV clustering in Section 4.3.

### 3.3.3 Cluster-Finding with Depth First Searching Algorithm

We use graph theory to explain the clustering result. Every enclosing sphere is a connected component, and data points in the same connected component are adjacent.

What we do now is to find out all the connected components.

### A

ijbetween pairs of points

i

j,

### =

0 otherwise.

R R(y) , x and x connecting segment

line on the y all for if

1 i j

ij

### (3.5)

up to now we can know that whether two data points are adjacent, we need to find all adjacent data points in the same connected component. The algorithm we adopt is the Depth First Searching (DFS) algorithm. As we know every training data point even BSV will belong to one connected component. We can find out which connected component that the data point belongs to by DFS algorithm.

The connected component and DFS algorithm are as follows [黃曲江 1989 ; Ellis 1995]:

var

mark: array[VertexType] of integer;

{ Each vertex will be marked with the number of the component it is in.}

v: VertexType;

componentNumber: integer;

### procedure DFS(v:VertexType);

{Does a depth-first search beginning at the vertex v}

var

w: VertexType;

ptr: NodePointer;

### begin

mark[v] := componentNumber;

while ptr

nil do

w := ptr

### ↑ .

vertex;

output(v,w);

if mark[w]=0 then DFS(w) end ptr := ptr

end {while}

end {DFS}

### begin {ConnectedComponents}

{Initialize mark array.}

for v:=1 to n do mark[v] :=0 end;

{Find and number the connected components.}

componentNumber := 0;

for v := 1 to n do if mark[v]=0 then

componentNumber := componentNumber +1;

output heading for a new component;

DFS(v)

end { if v was unmarked}

end {for}

### 3.3.4 Cluster Validation

But when to stop the clustering procedure ? It is natural to use the number of SVs as an indication of a meaningful solution [Ben-Hur et. al., 2000 ; Ben-Hur et. al., 2001].

### v and some value of q

, increase slowly the value of

### q

. We can find that the cluster boundaries are more and more tightly. More clusters are formed and the percentage of SVs increases. If the value of percentage is too high, it

is time to stop the clustering process. In general, the percentage of SVs in the training data set is about 10% .

### q

, we should increase the value of

### v in order to break the overlapped boundaries. In doing so, many data

points that are in the overlapped boundaries will then be forced to become so called Bounded SVs. They are not included into the connected components.

Up to now, through all the processes, we can construct the complete Reuters hierarchical categories. We will show the advantage of this hierarchical categories comparing to basic Reuters flat classification in Section 4.4.

Below we solve two problems, the first problem is that if we always have only one connected component for our data set. The second problem is that finding connected components is very time-consuming such that we cannot afford it.

### 3.3.5 One-Cluster And Time-Consuming Problem

We face two problems in our proposed model, they are

(1) In case that the clustering result always tells us there is only one connected component for our training data set.

(2) The clustering process is time-consuming.

The strategy we use for the first problem is that we can perform dimension reduction in order to see the influence of the dimension to the clustering result. We use PCA for dimension reduction.

The second problem can be solved by using sampling. Suppose the training data set is

,

d

n

1

2

### ,..., } ∈

(3.6) it takes time complexity of

(

### mn

2) to build the adjacency matrix, where

### n is the

number of training data and

### m is the partition number in every loop.

At first we find cluster centers for each category by FCM, and use all the cluster centers to be our new training data. We also use SMO to solve our QP problem in order to reduce training time.

In document Hierarchical Text Categorization Using One-Class SVM (Page 38-44)