**CHAPTER 3 TEXT CATEGORIZATION USING ONE-CLASS SVM**

**3.3 UNSUPERVISED LEARNING**

We now go to the procedures of performing SV clustering. As we me ntioned in Section 2.1.2, in prforming SV clustering we have to choose proper value of

*q*

and
*ν . The choice of * ^{q}

decides the compactness of the enclosing sphere and also the
number of clusters. The choice of ^{q}

*ν helps us to solve the problem of overlapping *

clusters.
The SV clustering processes are as follows:

Fig 3.5 The SV clustering processes [Ben-Hur et. al., 2000]

Unlabeled Data Set

*d*

*n*

*R*

*x* *x* *x*

*X* = {

_{1}

### ,

_{2}

### ,..., } ∈

Choose kernel function Increase

*q*

from 0,
*ν fixed *

Given

*q*

,ν
Using adjacent-matrix and DFS to find out all

Yes

Yes, Stop No

Fixed

*q*

change *ν *

increasingly
Yes
No

No

If

*q*

exhausted
and all NO
Cluster Validity Clusters exist (

### ≥

^{2})

We explain the above procedures as follows:

**3.3.2 The Choice of Kernel Function **

In 1992 [Boser et. al., 1992] Vapnik shows that the order of operations for constructing a decision function can be interchanged. So instead of making a non- linear transformation of the input vectors followed by dot-products with SVs in feature space, one can first compare two vectors in input space and then makes a non-linear transformation of the value of the result.

Commonly used kernel functions are as follows:

a) Gaussian RBF kernel :

*K* ( *x* , *y* ) = exp( − *q* *x* − *y*

^{2}

### )

(3.2) b) Polynomial kernel :*K*

(*x*

,*y*

)### =

(*x* ⋅ *y* +

1)*(3.3) c) Sigmoid kernel :*

^{d}*K*

(*x*

,*y*

)### =

tanh(*x* ⋅ *y* − *θ*

) (3.4)
we use only Gaussian kernel since other kernel function like polynomial kernel function does not yield tight contour representations of a cluster [Tax et. al., 1999]

and we will show that Gaussian kernel is indeed the best choice for SV clustering in Section 4.3.

**3.3.3 Cluster-Finding with Depth First Searching Algorithm **

We use graph theory to explain the clustering result. Every enclosing sphere is a connected component, and data points in the same connected component are adjacent.

What we do now is to find out all the connected components.

Define an adjacent matrix

*A*

*between pairs of points*

_{ij}*x and *

_{i}*x*

*,*

_{j}###

### ≤

### =

0 otherwise.R R(y) , x and x connecting segment

line on the y all for if

1 _{i} _{j}

*A*

*ij*

** (3.5) **

up to now we can know that whether two data points are adjacent, we need to find all adjacent data points in the same connected component. The algorithm we adopt is the Depth First Searching (DFS) algorithm. As we know every training data point even BSV will belong to one connected component. We can find out which connected component that the data point belongs to by DFS algorithm.

The connected component and DFS algorithm are as follows [黃曲江 1989 ; Ellis 1995]:

**procedure ConnectedComponents (adjacencyList: HeaderList; n: integer); **

var

mark: array[VertexType] of integer;

{ Each vertex will be marked with the number of the component it is in.}

v: VertexType;

componentNumber: integer;

**procedure DFS(v:VertexType); **

{Does a depth-first search beginning at the vertex v}

var

w: VertexType;

ptr: NodePointer;

**begin **

mark[v] := componentNumber;

ptr := adjacencyList[v];

while ptr

### ≠

nil dow := ptr

### ↑ .

vertex;output(v,w);

if mark[w]=0 then DFS(w) end ptr := ptr

### ↑ .

linkend {while}

** end {DFS} **

**begin {ConnectedComponents} **

{Initialize mark array.}

for v:=1 to n do mark[v] :=0 end;

{Find and number the connected components.}

componentNumber := 0;

for v := 1 to n do if mark[v]=0 then

componentNumber := componentNumber +1;

output heading for a new component;

DFS(v)

end { if v was unmarked}

end {for}

**end {ConnectedComponents} **

**3.3.4 Cluster Validation **

But when to stop the clustering procedure ? It is natural to use the number of SVs as an indication of a meaningful solution [Ben-Hur et. al., 2000 ; Ben-Hur et. al., 2001].

At first we start with fixed

*v and some value of * *q*

, increase slowly the value of *q*

.
We can find that the cluster boundaries are more and more tightly. More clusters are
formed and the percentage of SVs increases. If the value of percentage is too high, it
is time to stop the clustering process. In general, the percentage of SVs in the training data set is about 10% .

If the connected components are not found in many

*q*

, we should increase the
value of *v in order to break the overlapped boundaries. In doing so, many data *

points that are in the overlapped boundaries will then be forced to become so called
Bounded SVs. They are not included into the connected components.
Up to now, through all the processes, we can construct the complete Reuters hierarchical categories. We will show the advantage of this hierarchical categories comparing to basic Reuters flat classification in Section 4.4.

Below we solve two problems, the first problem is that if we always have only one connected component for our data set. The second problem is that finding connected components is very time-consuming such that we cannot afford it.

**3.3.5 One-Cluster And Time-Consuming Problem **

We face two problems in our proposed model, they are
(1) In case that the clustering result always tells us there is only one connected component for our training data set.

(2) The clustering process is time-consuming.

The strategy we use for the first problem is that we can perform dimension reduction in order to see the influence of the dimension to the clustering result. We use PCA for dimension reduction.

The second problem can be solved by using sampling. Suppose the training data set is

*X*

,
*d*

*n*

*R*

*x* *x* *x*

*X* = {

_{1}

### ,

_{2}

### ,..., } ∈

(3.6) it takes time complexity of*O*

(*mn*

^{2}) to build the adjacency matrix, where

*n is the *

number of training data and

*m is the partition number in every loop. *

At first we find cluster centers for each category by FCM, and use all the cluster centers to be our new training data. We also use SMO to solve our QP problem in order to reduce training time.