CHAPTER 3 TEXT CATEGORIZATION USING ONE-CLASS SVM
3.3 UNSUPERVISED LEARNING
We now go to the procedures of performing SV clustering. As we me ntioned in Section 2.1.2, in prforming SV clustering we have to choose proper value of
q
andν . The choice of q
decides the compactness of the enclosing sphere and also the number of clusters. The choice ofν helps us to solve the problem of overlapping
clusters.The SV clustering processes are as follows:
Fig 3.5 The SV clustering processes [Ben-Hur et. al., 2000]
Unlabeled Data Set
d
n
R
x x x
X = {
1,
2,..., } ∈
Choose kernel function Increase
q
from 0,ν fixed
Given
q
,νUsing adjacent-matrix and DFS to find out all
Yes
Yes, Stop No
Fixed
q
changeν
increasingly YesNo
No
If
q
exhausted and all NOCluster Validity Clusters exist (
≥
2)We explain the above procedures as follows:
3.3.2 The Choice of Kernel Function
In 1992 [Boser et. al., 1992] Vapnik shows that the order of operations for constructing a decision function can be interchanged. So instead of making a non- linear transformation of the input vectors followed by dot-products with SVs in feature space, one can first compare two vectors in input space and then makes a non-linear transformation of the value of the result.
Commonly used kernel functions are as follows:
a) Gaussian RBF kernel :
K ( x , y ) = exp( − q x − y
2)
(3.2) b) Polynomial kernel :K
(x
,y
)=
(x ⋅ y +
1)d (3.3) c) Sigmoid kernel :K
(x
,y
)=
tanh(x ⋅ y − θ
) (3.4)we use only Gaussian kernel since other kernel function like polynomial kernel function does not yield tight contour representations of a cluster [Tax et. al., 1999]
and we will show that Gaussian kernel is indeed the best choice for SV clustering in Section 4.3.
3.3.3 Cluster-Finding with Depth First Searching Algorithm
We use graph theory to explain the clustering result. Every enclosing sphere is a connected component, and data points in the same connected component are adjacent.
What we do now is to find out all the connected components.
Define an adjacent matrix
A
ijbetween pairs of pointsx and
ix
j,
≤
=
0 otherwise.R R(y) , x and x connecting segment
line on the y all for if
1 i j
A
ij(3.5)
up to now we can know that whether two data points are adjacent, we need to find all adjacent data points in the same connected component. The algorithm we adopt is the Depth First Searching (DFS) algorithm. As we know every training data point even BSV will belong to one connected component. We can find out which connected component that the data point belongs to by DFS algorithm.
The connected component and DFS algorithm are as follows [黃曲江 1989 ; Ellis 1995]:
procedure ConnectedComponents (adjacencyList: HeaderList; n: integer);
var
mark: array[VertexType] of integer;
{ Each vertex will be marked with the number of the component it is in.}
v: VertexType;
componentNumber: integer;
procedure DFS(v:VertexType);
{Does a depth-first search beginning at the vertex v}
var
w: VertexType;
ptr: NodePointer;
begin
mark[v] := componentNumber;
ptr := adjacencyList[v];
while ptr
≠
nil dow := ptr
↑ .
vertex;output(v,w);
if mark[w]=0 then DFS(w) end ptr := ptr
↑ .
linkend {while}
end {DFS}
begin {ConnectedComponents}
{Initialize mark array.}
for v:=1 to n do mark[v] :=0 end;
{Find and number the connected components.}
componentNumber := 0;
for v := 1 to n do if mark[v]=0 then
componentNumber := componentNumber +1;
output heading for a new component;
DFS(v)
end { if v was unmarked}
end {for}
end {ConnectedComponents}
3.3.4 Cluster Validation
But when to stop the clustering procedure ? It is natural to use the number of SVs as an indication of a meaningful solution [Ben-Hur et. al., 2000 ; Ben-Hur et. al., 2001].
At first we start with fixed
v and some value of q
, increase slowly the value ofq
. We can find that the cluster boundaries are more and more tightly. More clusters are formed and the percentage of SVs increases. If the value of percentage is too high, itis time to stop the clustering process. In general, the percentage of SVs in the training data set is about 10% .
If the connected components are not found in many
q
, we should increase the value ofv in order to break the overlapped boundaries. In doing so, many data
points that are in the overlapped boundaries will then be forced to become so called Bounded SVs. They are not included into the connected components.Up to now, through all the processes, we can construct the complete Reuters hierarchical categories. We will show the advantage of this hierarchical categories comparing to basic Reuters flat classification in Section 4.4.
Below we solve two problems, the first problem is that if we always have only one connected component for our data set. The second problem is that finding connected components is very time-consuming such that we cannot afford it.
3.3.5 One-Cluster And Time-Consuming Problem
We face two problems in our proposed model, they are(1) In case that the clustering result always tells us there is only one connected component for our training data set.
(2) The clustering process is time-consuming.
The strategy we use for the first problem is that we can perform dimension reduction in order to see the influence of the dimension to the clustering result. We use PCA for dimension reduction.
The second problem can be solved by using sampling. Suppose the training data set is
X
,d
n
R
x x x
X = {
1,
2,..., } ∈
(3.6) it takes time complexity ofO
(mn
2) to build the adjacency matrix, wheren is the
number of training data and
m is the partition number in every loop.
At first we find cluster centers for each category by FCM, and use all the cluster centers to be our new training data. We also use SMO to solve our QP problem in order to reduce training time.