國 立 成 功 大 學 資 訊 工 程 學 系
碩 士 論 文
以單類支持向量機為基礎之階層式 文件分類
Hierarchical Text Categorization Using One-Class SVM
研 究 生 : 塗 宜 昆
指導教授 : 蔣 榮 先 博士
中 華 民 國 九 十 二 年 七 月
以單類支持向量機為基礎之階層式文件分類
塗宜昆 * 蔣榮先**
國立成功大學資訊工程研究所
中文摘要
由於資訊量的快速成長,自動文件分類對於處理及組織資料成為一種 重要的資訊分析技術。而由經驗得知當我們處理的資料類別數增加時,用 來衡量效能好壞的工具如精確率(precision) 、召回率 (recall)都會相對的下 降,採用階層式的類別架構可以解決及處理具有大量資料的問題。
在這個研究中,我們採用單類支持向量機來達到文件聚類之目的,並 使用聚類的結果來建立一個階層式的架構,這個架構描述了類別間的關 係。我們採用兩類及多類支持向量機來作監督式的分類訓練。
由所設計的三個實驗,我們探討以單類支持向量機為基礎所建立的系 統的特性,並與其他研究方法作比較,實驗結果證明所提出的系統具有較 佳的效能。
*作者 **指導教授
Hierarchical Text Categorization Using One-Class SVM
Yi-Kun Tu* Jung-Hsien Chiang**
Department of Computer Science & Information Engineering, National Cheng Kung University
Abstract
With the rapid growth of online information, text categorization has become one of the key techniques for handling and organizing text data. Experience to date has demonstrated that both precision and recall decrease as the number of categories increase. Hierarchical categorization affords the ability to deal with very large problems.
We utilize one-class SVM to perform support vector clustering, and then use the clustering results to construct a hierarchical categories. Two-class and multi-class SVMs are used to perform the supervised classification.
We explore one-class SVM model through three experiments. Performance analysis is performed by comparing with other approaches, the experimental results show that the proposed hierarchical categories works well.
*Author **Advisor
誌 謝
本論文的順利完成,首先要感謝指導老師 蔣榮先教授,在這兩年的時間裡悉心 教導與鼓勵,並提供良好的學習環境。感謝上帝將女友 淑華擺在我身旁成為我 最大的支援及鼓勵, 沛毅是我最大的資料庫,有任何問題他總不吝嗇的給予指 導, 宗賢學長是最佳支援手,總是在最需要時給我最好的建議。
當然,還有很多要感謝的人,我謹記在心,謝謝你們的支持與鼓勵。
2003 年仲夏 塗宜昆 謹誌於 成大資工所 ISMP IIR LAB
Contents
中文摘要 … … … ..… … … .. i
ABSTRACT… … … .… … … .ii
FIGURE LISTING… … … . vi
TABLE LISTING… … … .. vii
CHAPTER 1 INTRODUCTION… … … . 1
1.1 RESEARCH MOTVATION… … .… … … .… … … .. 2
1.2 THE APPROACH… … … .… 3
1.3 THESIS ORGANIZATION… … … . 4
CHAPTER 2 LITERATURE REVIEW AND RELATED WORKS..… … … . 5
2.1 SUPPORT VECTOR CLUSTERING… … ..… .… … … .. .5
2.1.1 One-Class SVM … … … .. 5
2.1.2 The Formulation of Support Vector Clustering… … … ... 9
2.2 SEQUENTIAL MINIMAL OPTIMIZATION… … … ... 11
2.2.1 Optimize Two Lagrange Multipliers … … … . 11
2.2.2 Updating After A Successful Optimization Step … … … .. 13
2.3 DIMENSION REDUCTION… … … ..… … … .. 13
2.4 FUZZY C-MEAN… … … .. 14
2.5 FEATURE SELECTION METHODS … … … 16
2.5.1 Document Frequency Thresholding… … … . 16
2.5.2 Information Gain … … … .. 17
2.5.3 Mutual Information… … … .. 17
2.5.4
χ
2Statistic… … … . 182.5.5 Term Strength… … … ... 18
2.6 MULTI-CLASS SVMs… … … .. 19
CHAPTER 3 TEXT CATEGORIZATION USING ONE-CLASS SVM… … … . 21
3.1 PROPOSED MODEL… … … 21
3.2 DATA PREPROCESSING… … … 22
3.2.1 Part-Of-Speech Tagger… … … 23
3.2.2 Stemming… … … .… … … 24
3.2.3 Stop-Word Filter… … … ... 25
3.2.4 Feature Selection… … … . 26
3.3 UNSUPERVISED LEARNING… … … 28
3.3.1 Support Vector Clustering… … … 28
3.3.2 The Choice Of Kernel Function… … … .. 30
3.3.3 Cluster-Finding With Depth First Searching Algorithm… … … .. 30
3.3.4 Cluster Validation… … … . 32
3.3.5 One-Cluster And Time-Consuming Problem… … … ... 33
3.4 SUPERVISED LEARNING… … … .. 34
3.4.1 Reuters Category Construction… … … 34
3.4.2 The Mapping Strategy… … … .… … … 36
3.4.3 Gateway Node Classifier… … … .… … … 37
3.4.4 Expert Node Classifier… … … 38
CHAPTER 4 EXPERIMENT DESIGN AND ANALYSIS … … … 39
4.1 THE CORPUS… … … .. 39
4.1.1 Documents… … … ... 39
4.1.2 File Format… … … .. 40
4.1.3 Document Internal Tags… … … .. 41
4.1.4 Categories… … … . 43
4.2 DATA REPRESENTATION ANALYSIS… … … . 44
4.3 THE CHOICE OF KERNEL FUNCTION… … … ...… … … 48
4.4 HIERARCHICAL TEXT CATEGORIZATION PERFORMANCE ANALYSIS… … … . 50
CHPATER 5 CONCLUSIONS AND FUTURE WORKS … … … . 56
5.1 CONCLUSIONS… … … 56
5.2 FUTURE WORKS… … … .… 57
Reference… … … .58
APPENDIX A Stop-word List… … … .. 63
APPENDIX B Part-Of-Speech Tags… … … .66
Figure Listing
Figure 1.1 Our Learning Processes … … … 3
Figure 2.1 Different value of
q
has different clustering results (a)q
=1 (b)q
=20(c)q
=24 (d)q
=48… … … 9Figure 2.2
q
fixed and different value ofν
results in different shape of cluster(a)ν l =
1 ,l
is the number of data (b)ν
=0.4… … … 10Figure 3.1 Our proposed text categorization model… … … 22
Figure 3.2 Data Preprocessing Processes … … … . 23
Figure 3.3 Words with tagging… … … ..24
Figure 3.4 Representing text as a feature vector… … … .27
Figure 3.5 The SV clustering processes[Ben-Hur et. al., 2000]… … … ..29
Figure 3.6 Reuters basic hierarchy[D’Alessio et. al., 2000]… … … 34
Figure 3.7 Our Proposed Hierarchy… … … .35
Figure 4.1 Sample news stories in the Reuters-21578 corpus … … … 41
Figure 4.2 One-Class SVM with polynomial kernel where d=2… … … 48
Figure 4.2 One-Class SVM with Gaussian kernel where g=100,
ν
=0.1… … … ..48Figure 4.4 Our Proposed Hierarchical Construction… … … . 52
Table Listing
Table 4.1 The list of categories, sorted in decreasing order of frequency… … … .42
Table 4.2 Number of Training/Testing Items … … … .… … … ..43
Table 4.3
F
1measure in One-Class SVM with vector dimension 10… … … 45Table 4.4
F
1measure in One-Class SVM with vector dimension 20… … … 45Table 4.5 Precision/Recall-breakeven point on the ten most frequent Reuters categories on Basic SVMs … .… … … .51
Table 4.6 Our proposed classifier comparison of k-NN and decision tree… … … 53
CHAPTER 1 INTRODUCTION
With the rapid growth of online information, text categorization has become one of the most popular key techniques for handling and organizing text data. It is used to classify news stories [Hayes et. al., 1990], to find interesting information on the WWW [Lang 1995], and to guide a users search through hypertext. Since building text classifiers by hand is difficult and very time consuming, it is desirable to learn classifiers from examples.
The most successful paradigm to organize a large mass of information is by categorizing the different documents according to their topics. Recently, various machine learning techniques have been attempted to automatic text categorization.
These approaches are usually based on the vector space model, in which documents are represented by sparse vectors, with one component for each unique word extracted from the document. Typically, the document vector is very high-dimensional, at least in the thousands for large collections. This is a major stumbling block in applying many machine learning methods, existing techniques rely heavily on dimension reduction as a preprocessing step. But it is computational expensive. Until recently, a new approach is found by the introduction of Support Vector Machines (SVMs), This new algorithm outperforms other classifiers in text categorization, and is also used as a clustering method.
Most of computational experience discussed in the literature deal with hierarchies that are trees. Indeed, until recently, most problems discussed dealt with categorization within a simple (non-hierarchical) set of categories [Frakes et. al., 1992]. But also a few hierarchical classification methods have been proposed recently
[D’Alessio et. al., 1998 ; Wang et. al., 2001 ; Weigend et. al., 1999].
In this research, we try to utilize one-cla ss SVM to perform text clustering. And use the clustering results to build a hierarchical construction. This hierarchical construction illustrates the relationship between Reuters categories.9
The Reuters-21578 corpus has been studied extensively. Yang [Ya ng 1997a]
compares 14 categorization algorithms applied this Reuters corpus as a flat categorization problem on 135 categories. This same corpus has been more recently studied by others treating the categories as a hierarchy [Koller et. al., 1997 ; Yang 1997a ; Ng et. al., 1997]. We construect our own hierarchy through Support Vector (SV) clustering and compare the text categorization result with state-of-the-art literatures.
1.1 RESEARCH MOTIVATION
In the field of document categorization, if we use only one- layer of classifiers, we usually need to use many training samples. This kind of model is usually too complicated and the training is not accurate enough. So we adopt the way ”divide and conquer”, to partition a problem into many small and easy-solved sub-problems. With this procedure, we can simplify our training processes and gain accurate training model.
In the heuristic, we know that each document can be in multiple, exactly one, or no category at all. By building the hierarchical construction between categories, whenever a new document comes in, we can easily assign it into the category (or categories) it belongs.
Support Vector Machine outperforms other classifiers in text classification in recent years, it is also used to perform text clustering. We want to perform support vector clustering on text data set and construct a hierarchical construction between the
data set.
1.2 THE APPROACH
There are three stages in our proposed approach. It consists of data preprocessing, unsupervised learning, and supervised learning.
We present documents as “bags of words”, only the presence or absence of the words in a document is indicated and not their order of any higher-level linguistic structure. We can thus think of documents as high-dimensional vectors, with one slot for each word in the vocabulary and a value in each slot related to the number of times that word appeared in the document. Note that a big fraction of the total number of slots will have zero value for any given document, so the vectors are quite sparse.
The goal of the first stage is to build the training data for the second stage, a detail description of the methods we use will be discussed in Section 3.2.
The second stage is unsupervised learning, we use one-class SVM to perform support vector clustering. We will explain how we achieve text clustering by this new learning algorithm in Section 3.3.
In the last stage, we use two-class SVMs and multi-class SVMs to train the tree nodes and finally get much more accurate tree node classifiers. The following Figure shows stage two and stage three:
Fig 1.1 Our Learning Process
1.3 THESIS ORGANIZATION
The content of this thesis is partitioned into five chapters and is organized as follows.
Chapter 1 introduces the motivation of our research.
Chapter 2 references works of related techniques such as support vector clustering, one-class SVM, sequential minimal optimization, text categorization and finally feature selection methods.
Chapter 3 describes our methodologies and the system.
Chapter 4 describes the corpus we use as the test bed for experiments, and our three experiments. Finally we report our experimental results of SV clustering and text categorization. System performance analysis is also discussed by comparing to other classifiers.
Chapter 5 presents our conclusions based on the results of our experiments, and suggests for future research.
CHAPTER 2
LITERATURE REVIEW AND RELATED WORKS
Section 2.1 and 2.2 illustrate unsupervised leaning methods we use. They are SV clustering, and one-class SVM, also Sequential minimal optimization. Section 2.3 delivers methods of dimension reduction, and section 2.4 is about Fuzzy C- mean.
Section 2.5 reviews the methods of feature selection. Section 2.6 is about Multi-class SVMs. These topics are major techniques in our proposed model.
2.1 SUPPORT VECTOR CLUSTERING
SV clustering was derived from the study of one-class support vector machine (SVM) [Scholkopf et. al., 2000 ; Tax et. al., 1999]. The work of one-class SVM was to find the density of a distribution probability. In solving the problem, one needs to solve a quadratic optimization problem. The optimal solution tells us the position of support vectors.
Asa Ben-Hur et. al. generalize support vectors as the boundary of the clusters.
[Ben-Hur et. al., 2000 ; Miguel 1997 ; Duda et. al., 2000]. They proposed an algorithm for representing the support of a probability distribution by a finite data set using the support vectors. We first review the concept of one-class SVM.
2.1.1 One-Class SVM
The term one-class classification originates from Moya [Moya et. al., 1993], but also outlier detection [Ritter et. al., 1997], novelty detection [Bishop 1995] for concept learning [Japkowicz et. al., 1995] are used. There are different names since
different applications can be applied for one-class classification. One-class classification problem can be described as that one wants to develop an algorithm which returns a function
f
that takes the value+ 1
in a small region capturing most of the data points, and− 1
elsewhere.Scholkopf et. al. [Scholkopf et. al., 2000] proposed two methods for solving one-class classification problem:
a) To map the data into the feature space corresponding to the kernel function, and to separate them from the origin with maximum margin. This is a two-class separation problem. The only element of the negative examples is the origin, and all the training data are the positive examples.
b) One searches for the sphere that can include all the training data such that the sphere is with minimum volume.
We review only the solution to the second problem. Suppose the data points are mapped from input space to the high-dimensional feature space, this is through a non- linear transformation function
Φ
. One looks for the smallest sphere that encloses the image of the data.Scholkopf solved the problem by the following way: Consider the smallest enclosing sphere with radius
R , the optimization problem is as follows:
min R 2
such that
Φ
(x
j)− a
2≤ R
2∀ j
(2.1)where
a is the center of the sphere and ⋅
is the Euclidean norm,R is the radius
of the sphere. Soft margin constraints are incorporated by adding slack variablesξ
j, so the original optimization problem is:∑
+
i
C
iR
2ξ
min
such that
Φ
(x
j)− a
2≤ R
2+ ξ
j (2.2) withξ
j≥ 0
. By the introduction of Lagrange Multiplier, the constrained problemwas :
∑
∑
∑ + − Φ − − +
−
=
j j jj
j j
j
x a u C
R R
L
2 ( 2ξ
( ) 2)β ξ ξ
(2.3)where
β
j≥ 0 , u
j≥ 0
are Lagrange Multipliers,C is a constant. One calculates the
partial derivative of
L
with respect toR , a
, andξ
j. Set the partial derivatives to zero, then one can get∑ =
j
j
1
β
(2.4)∑ Φ
=
j
j
j
x
a β ( )
(2.5)j
j
= C − u
β
(2.6)the Karush-Kuhn-Tucker (KKT) conditions should be hold, so it results in
0 ) ) (
(
R
2+ ξ
j− Φ x
j− a
2β
j=
(2.7)= 0
j j
u
ξ
(2.8)Again, one can eliminate the variables and
R, a
andu
j, and turns it into its Wolfe dual form. Then the new problem is a function of the variableβ
j only) ( ) ( )
(
, 2
j i
j i
j i j
j
j
x x
x
W = ∑ Φ β − ∑ β β Φ ⋅ Φ
(2.9) with the constraintsj
≤ C
≤ β
0
(2.10)∑ =
j
j
1
β
(2.11)the non- linear transformation function
Φ
is a feature map, i.e. a map into an inner product space F such that the inner product in the image ofΦ
can be computed by evaluating some simple kernel [Boser et. al., 1992 ; Cortes et. al., 1995 ; Scholkopf et.al., 1999].
Define
) ( ) ( ) ,
( x y x y
K ≡ Φ ⋅ Φ
(2.12) and the Gaussian kernel function,) exp(
) ,
( x y q x y
2K = − −
(2.13)the Lagrangian
W in (2.9) can be written as
) , ( )
, (
,
j i j
i j i j
j j
i
x k x x
x K
W = ∑ β − ∑ β β
(2.14)if one solves the above formula and get the optimal solution of
β
j, then one can now calculate the distance of each pointx to the center of the sphere :
2 2
) ( )
( x x a
R = Φ − (2.15)
By (2.5), one can rewrite it as
) , ( )
, ( 2 ) , (
, 2
j i j
i j i j
j j
i
x k x x
x K x
x k
R = − ∑ β + ∑ β β
(2.16)the radius of the sphere is:
} ctor support ve a
is
| ) ( {
R x
ix
iR = (2.17)
So if one solves the dual problem
W , one can find all the SVs. Then by calculating
the distance between the SVs and sphere center, the radius of the sphere is found.Consider equation (2.7) and (2.8),
0 ) ) (
(
R
2+ ξ
j− Φ x
j− a
2β
j=
andξ
ju
j= 0
there are three types of the data , they are as follows:
a) If
0 < β
j< C
, then the correspondingξ
j= 0
, from equation (2.7), these data lie on the sphere surface. They are called Support Vectors (SVs).b) If
ξ
j> 0
, the correspondingβ
j= C
, these data points lie outside the sphere.They are called Bounded Support Vectors (BSVs).
c) All other data with
β
j= 0
lie inside the sphere.So, from the discussion above. One can tell whether a data point is inside or outside or on the sphere surface by the corresponding value of
β
j.2.1.2 The Formulation of Support Vector Clustering
In the training of one-class SVM, considering the use of Gaussian kernel. We map the data into high-dimensional feature space and find a sphere with minimum volume to enclose the image of the data. When the sphere be mapped back to the input space, they form different shape of contours. These contours enclose the data points, they can be seen as cluster boundary, below we explain how to perform SV clustering.
By the use of Gaussian kernel function, the parameters that one needs to control are the kernel width
q
and penaltyv . One can use Fig 2.1 and Fig 2.2 [ Ben-Hur et.
al., 2000 ; Hao P. Y. et. al., 2002 ] to explain what they work,
Fig 2.1 Different value of
q
has different clustering results (a)q
=1 (b)q
=20 (c)q
=24 (d)q
=48In Fig 2.1, one sees that as the value of
q
increases, the contour boundaries are more and more tightly, and there are more and more contours. This explains that the parameterq
controls the compactness of the enclosing sphere, it also controls the number of the enclosing sphere. So one can tune the value ofq
from small to big or from big to small in order to gain proper results.The influence of the parameter
v is as follows:
Fig 2.2 q
fixed and different value ofv results in different shape
of cluster (a)vl = 1 l , is the number of data
(b)v =
0.4In Fig 2.2 one can see that as
v increases, there are more BSVs appear. It is because v controls the percentage of outliers. So up to now one knows how to control the
tightness of the contour boundaries and one can also let some data points to become outliers in order to form more clear contour boundaries.The problem now is how one finds out these contours just as one sees in Fig 2.1 and Fig 2.2. [Ben-Hur et. al., 2000 ; Ben-Hur et. al., 2001] proposed a method that one can use the schema of connected components by defining an adjacent matrix.
Contours are now defined as the connected components. One can use any graph searching algorithm to find out all the connected components.
2.2 SEQUENTIAL MINIMAL OPTIMIZATION
The new SVM learning algorithm is called Sequential Minimal Optimization (SMO).
This new learning algorithm was proposed by John C. Platt [Platt 1998]. The algorithm solves the quadratic programming problem by breaking it into a series of smallest possible sub-QP problems. These small sub-QP problems are solved analytically, so one does not need to perform matrix computation. Thus the training time the algorithm took is almost between linear and quadratic in the training set size for various test problems.
Traditionally the SVM learning algorithm uses numeric quadratic programming as an inner loop, so it takes much time scaling between linear and cubic in the data set size, SMO goes in different way that it uses analytic QP step, one can partition it into two steps just as what the following two sections describe.
2.2.1 Optimize Two Lagrange Multipliers
In solving the QP problem, every training data corresponds to a single Lagrange
Multiplier, one can judge whether a training data is a support vector by its corresponding value of Lagrange Multiplier. The work now is to optimize two Lagrange Multipliers at a time.
Platt proposed the following way [Platt 1998]: Consider optimizing over
α
1 andα , with all other variables fixed. Then the original quadratic problem
2∑
∑ ≤ ≤ =
i i i
ij
j i j
i
k x x subject to 1 l , 1 0
) 2 (
min 1 α
α ν α
α
α
(2.18)can be reduced to
∑
∑
=+
=2+
1 2
1 , 2 ,
min 1
2
1 i
i i j
i
ij j
i
α k α C C
α
α
α (2.19)
with
= ∑
== ∑
li j= i j ijl
j j ij
i
K and C K
C
3α
, 3α α
subject to
∑
=
∆
=
≤
≤
21 2
1
1 ,
, 0
i
l α
iα ν
α
(2.20)where
∆ =
1− ∑
li=3α
isince
C does nothing with α and
1α , one can eliminate it, so one obtains the new
2 form as:2 2 1 2 22
2 2 12 2 2 11
2
2
( )
2 ) 1
( )
2 (
min 1 ∆ − α K + ∆ − α α K + α K + ∆ − α C + α C
(2.21)with the derivative
2 1 22 2 12 2 11
2) ( 2 )
(
∆ − K + ∆ − K + K − C + C
− α α α
(2.22)let the derivative be zero, then
12 22
11
2 1 12 11
2
2
) (
K K
K
C C K K
− +
− +
−
= ∆
α
(2.23)since
α is found, we can calculate
2α from (2.20).
1If the new point (
α ,
1α ) is outside of [0,
2 1/ν l
], the constrained optimum is found by projectingα from (2.23) into the region allowed by the constraints, and
2 then re-computingα .
1The offset
ρ is recomputed at every such step.
2.2.2 Updating After A Successful Optimization Step
Let
α
1*,α
2* be the values of the Lagrange parameter after the step in 2.2.1, then the corresponding output is [Platt 1998]i i
i
i
K K C
O = +
2 2*+
* 1
1
α α
(2.24)combine with (2.23), one then has the update equation for
α such that
2α is
1disappeared,
12 22 11
2
* 1 2
2
K K 2K
O O
− + + −
= α
α
(2.25)2.3 DIMENSION REDUCTION
The problem of dimension reduction is introduced as a way to overcome the curse of the dimensionality when dealing with vector data in high-dimensional spaces and as a modeling tool for such data [Miguel 1997].
In most applications, dimension reduction is carried out as a preprocessing step, the selection of the dimensions using principal component analysis (PCA) [Duda et.
al., 2000 ; Jolliffe 1986] through singular value decomposition (SVD) [Golub et. al., 1996] is a popular approach for numerical attributes.
PCA is possibly the most widely used technique to perform dimension reduction, consider a sample
n i
x
i}
1{
= inR (2.26)
Dwith mean
∑
==
ni
x
ix n
1
1
(2.27)and covariance matrix
} ) )(
{( x x x x
T∑ = E − −
, (2.28)with spectral decomposition
∑ = U Λ U
T , (2.29)the principal component transformation
) (
x x U
y =
T−
(2.30)yields a reference system in which the sample has mean 0 and diagonal covariance matrix
Λ
containing the eigenvalues ofΣ , the variables are now uncorrelated. One
can discard the variables with small variance, i.e. project on the subspace spanned by the first L principal components, and obtain a good approximation to the original sample.2.4 FUZZY C-MEAN
Clustering is one of the most fundamental issues in pattern recognition. It plays a key
role in searching for structures in data. Given a finite set of data, the problem is to find several cluster centers that can properly characterize relevant classes of the set.
Fuzzy C-mean is based on fuzzy c-partitions, the algorithm is as follows [Georgej et.
al., 1995] :
Step1. Let t=0.Select an initial fuzzy pseudopartition
P
(0).
Step2. Calculate the c cluster centers v
1(t),... v
c(t)forP and the chosen value of
( t))
, 1 ( , m ∈ ∞
m
.Step3. Update
P
(t+1)by the following procedure: For eachx
k∈ x , if
2
0
>
−
itk
v
x
(2.31)for all
i ∈ N
c, then define1
1
1 1
) 2 (
) 2 ( )
1
( ( ) [ ( ) ]−
=
+
= ∑
c− −
−j
m t j k
t i k k
t i
v x
v x x
A
, (2.32)if
x
k− v
it 2= 0
for somei ∈ I ⊆ N
c,then defineA
i(t+1)( x
k)
fori ∈ I
by any nonnegative real numbers satisfying∑
∈+
=
I i
k t
i
x
A
( 1)( ) 1 (2.33)and define (+1)
(
k) = 0
t
i
x
A
fori ∈ N
c− I
Step4. Compare
P and
( t)P
(t+1). If P
(t)− P
(t+1)≤ ε
, then stop, otherwise, increase t by one and return to Step2The most obvious disadvantage of FCM algorithm is that we need to guess the
number of cluster centers. In our implementation, we do know how many clusters we need, so it won’t be big problem for us.
2.5 FEATURE SELECTION METHODS
In text categorization one is usually confronted with feature spaces containing 10000 dimensions and more, often exceeding the number of available training examples.
Many have noted the need for feature selection to make the use of conventional learning methods possible, to improve generalization accuracy, and to avoid
“overfitting” [Joachims 1998].
The most popular approach to feature selection is to select a subset of the available features using methods like Document Frequency Thresholding [Yang et. al., 1997b], Information Gain,
x
2 statistic [Schutze et. al., 1995], Mutual Information, and Term Strength. The most commonly used and often most effective [Yang et. al., 1997b] method for selecting features is the information gain criterion. Below a short description of these methods is given.2.5.1 Document Frequency Thresholding
Document frequency is the number of document in which a word occurs. In
Document Frequency Thresholding one computes the document frequency for each word in the training corpus and removes those words whose document frequency is less than some predetermined threshold. The basic assumption is that rare words are either non-informative for category prediction, or not influential in global performance. In either case removal of rare words reduces the dimensionality of the feature space. Improvement in categorization accuracy is also possible if rare words happen to be noise words.2.5.2 Information Gain
Information Gain measures the number of bits of information obtained for
category prediction by knowing the presence or absence of a word in a document.Let
c
1 ,c
2,...,c
M denote the set of categories in the target space. The information gain of word w is defined to be:
)
| ( log )
| ( ) ( )
| ( log )
| ( ) ( ) ( log ) ( )
(
1 1
1
∑ ∑
∑
= = =+ +
−
=
Mk
k k
M
k
k k
M
k
k
k
P c P w P c w P c w P w P c w P c w
c P w
IG
(2.34)
WhereP
(c
k): The fraction of documents in the total collection that belong to classc .
kP ( w )
: The fraction of documents in which the word w occurs.P
(c
k |w
): The fraction of documents from classc that have at least one
k occurrence of word w.P
(c
k |w
): The fraction of documents from classc that does not contain word w.
k The information gain is computed for each word of the training set, and the words whose information gain is less than some predetermined threshold are removed.2.5.3 Mutual Information
Mutual Information considers the two-way contingency table of a word w and a category c. Then the mutual information between w and c is defined to be:
) ( ) (
) log (
) ,
(
p w p c
c w c P
w
MI ×
= ∧
(2.35)and is estimated using
) ( ) log(
) ,
(
A C A B
N c A
w
MI + × +
≈ ×
(2.36)where A is the number of times w and c co-occur,
B
is the number of times the woccurs without c,
C is the number of times c occurs without w, and N is the total
number of documents.2.5.4 x
2Statistic
The
x
2 statistic measures the lack of independence between w and class c. it is defined to be:) ( ) ( ) ( ) (
) ) (
, (
2 2
D C B A D B C A
CB AD c N
w
x + × + × + × +
−
= ×
(2.37)Where
A
is the number of times w and c co-occur,B
is the number of times the w occurs without c,C is the number of times c occurs without w, D
is the number of times neither w nor c occurs. N is still the total number of documents.Two different measures can be computed based on the
x
2 statistic:∑
==
Mk
k k
avg
w P c x w c
x
1
2
2
( ) ( ) ( , )
(2.38)or
) , ( max )
( 2
1 2
max k
M
k
x w c
w
x =
=2.5.5 Term Strength
Term Strength estimates word importance based on how commonly a word is
likely to appear in “closely-related” documents. It uses a training set of documents to derive document pairs whose similarity is above a threshold. Term Strength is computed based on the estimated conditional probability that a word occurs in the second half of a pair of related documents given that it occurs in the first half.Let x and y be an arbitrary pair of distinct but related documents, and w be a word, then the strength of the word is defined to be:
)
| ( )
( w P w y w x
TS = ∈ ∈ (2.39)
2.6 MULTI-CLASS SVMS
There are many methods for SVMs to solve the multi-class classification problems.
One approach is to consider the problem as a two class classification problem. There are two ways to solve multi-class SVMs in this approach [Chih-Wei, et. al., 2002 ; Weston et. al., 1998], they are:
a) one-against-one classifiers.
b) one-against-the-rest classifiers.
In a), suppose there are
k classes to be classified, this method constructs )
1 2 (
1 k k −
SVM models. Each classifier must train on two classes, for training on thei
th class and the j th class, one needs to solve the following two-class classification problem:∑
+
t
T ij ij t ij
T ij b
w
w w C w
ij ij
ij ( ) ( )
2 min 1
,
,
ξ
ξ (2.40)
if b
x
w
ij)
T(
t)
ij1
tij,
( Φ + ≥ − ξ y
t= i
(2.41)if
b x
w
ij)
T(
t)
ij1
tij,
( Φ + ≤ − + ξ y
t= j
(2.42)≥ 0
ij
ξ
t (2.43)and the testing can be implemented in many ways, one of them is what so called
“Max Wins”.
This voting strategy says that if the result (
w
ij)TΦ
(x
t)+ b
ij says that the test data is in thei
th class, then the vote for classi
is added by 1, otherwise the class j is added 1. Then one can predict the test data with the largest vote. In general, one-against-one method will take us much time to accomp lish a training work, especially when we have many classes to train. But in real implementation, if one wants to gain better performance, one has no choice to use it.In b), one-against-the-rest method. If one has
k classes, then one needs to train
onlyk SVM models. This method spends much little time than one-against-one
method. Thei
th class is trained with positive labels while all the other classes are trained with negative labels. So if one hasl training data
(x
1,y
1),...,(x
l,y
l), such thatx
i∈ R
N,i =
1,...,l
andy
i∈
{1,...,k
} is in classi
, the n thei
th classifier solves the following problem:∑
+
t
T i i t i
T i b
w
w w C w
j i
i ( ) ( )
2 min 1
,
,
ξ
ξ (2.44)
if b
x
w
i)T ( j) i 1 ij,(
Φ + ≥ − ξ y
j= i
(2.45)if b
x
w
i)T ( j) i 1 ij,(
Φ + ≤ − + ξ y
j≠ i
(2.46)l
i
j
j
≥
0,=
1,...,ξ
(2.47)The testing is the same with two class SVMs. In general, the performance is usually not good when comparing with one-against-one method.
CHAPTER 3
TEXT CATEGORIZATION USING ONE-CLASS SVM
3.1 PROPOSED MODEL
There are mainly three stages in our proposed model, they are:
a) Data preprocessing stage.
b) Unsupervised learning stage.
c) Supervised learning stage.
The first stage includes Part-Of-Speech (POS) tagging, word stemming, stop-word deleting and feature selection. Through these processes, we can transform the original raw data into the normalized data that can be used in the second stage.
The second stage provides the processes that about performing SV clustering.
The subjects include the choice of kernel function, and how we find clusters and the strategy to judge the cluster validation. Also we discuss the problems we face and the solution we use to solve them.
The last stage is about the training of internal and expert node classifiers, these node classifiers come from the clustering result of the second stage. Firstly we construct a mapping strategy between raw data and cluster centers, and then train all the classifiers by one-against-one or one-against-the-rest training methods.
All the main components and procedures are illustrated in the Figure 3.1 as follows:
Figure 3.1 Our proposed text categorization model
A detailed description of the behavior of each component is described as follows:
3.2 DATA PREPROCESSING
There are four main procedures in the data preprocessing stage :
Fig 3.2 Data Preprocessing Processes
3.2.1 Part-Of-Speech Tagger
In this procedure, a POS tagger [Brill 1994] is introduced to provide POS information.
News article will be previously tagged resulting in each word with its appropriate part-of-speech (POS) tag. In general, the news articles are mostly composed by natural language text to express human’s thought. In this thesis, we consider that
News Documents
Part-Of-Speech Tagger
Stemmer
Stop-Word Filter
Feature Selection
Training Data
concepts, which express human’s thought, will mostly be decided by noun keywords.
Therefore, POS tagger module provides proper POS tags for the function of feature selection. Furthermore, POS tags give important information for deciding contextual relationship between words. In Figure 3.2, this tagger provides noun words to the next stemmer module. In such way, this module employs natural langua ge technology to help analyze news articles. Consequently, it is considered a language model.
For natural language understanding, giving a sentence POS tags prepares the further information to analyze the syntax of a sentence. The POS employed in this thesis is based on rule-based POS tagger proposed by Eric Brill in 1992. Brill’s tagger tries to learn lexical and contextual rules for tagging words. The precision of Brill’s tagger was pronounced to be higher than 90% [Brill 1995]. There are totally 37 POS tags as listing in APPENDIX B. As mentioned above, we select noun-only words.
Therefore, these noun tags are NN, NNS, NNP and NNPS. The following are examples of words after POS tagging.
N.10/CD S.1/CD
"I/NN think/VB it/PRP is/VBZ highly/RB unlikely/JJ that/IN American/NNP Express/NNP is/VBZ
Fig 3.3 Words with tagging
3.2.2 Stemming
Frequently, the user specifies a word in a query but only a variant of this word is present in a relevant document. Plurals, gerund forms, and past tense suffixes are examples of syntactical variations which prevent a perfect match between a query word and a respective document word [Ricardo et. al., 1999]. This problem can be partially overcome with the substitution of the words by their respective stems.
A stem is the portion of a word, which is left after the removal of its affixes. A typical example of a stem is the word “calculate”, which is the stem for the variants calculation, calculating, calculated, and calculations. Stems are thought to be useful for improving retrieval performance because they reduce variants of the same root word to a common concept. Furthermore, stemming has the secondary effect of reducing the size of the indexing structure because the number of distinct index terms is reduced [Ricardo et. al., 1999].
Because of most variants of a word are generated by the introduction of suffixes, and on the basis of intuitive, simple, and can be implemented efficiently, there are several well-known algorithms which been used suffixes removal. The most popular one is that by Porter, so we use the Porter algorithm [Porter 1980] to do word stemming.
3.2.3 Stop-Word Filter
Words, which are too frequent among the documents in the collection, are not good discriminators. In fact, a word, which occurs in 80% of the documents in the collection, is useless for purpose of retrieval. Such words are frequently referred to as stop-words and are normally filtered out as potential index terms. Articles, prepositions, and conjunctions are natural candidates for a list of stop-words.
Elimination of stop-words has an additional important benefit. It reduces the size of the indexing structure considerably. In fact, it is typical to obtain compression in the size of the indexing structure of 40% or more solely with the elimination of stop-words [Ricardo et. al., 1999].
Since stop-words elimination also provides for compression of the indexing structure, the list of stop-words might be extended to include words other than articles, prepositions, and conjunctions. For example, some verbs, adverbs, and adjectives
could be treated as stop-words. In this thesis, a list of 306 stop-words has used. A detailed list of the stop-words can be found in the appendix of this thesis.
The stop word filter procedure takes noun words as input and a few noun words may be ineffective to what human wants to express in the document. They are only auxiliary to complete the whole natural language text. Here we called them stop words. In this reason, stop words must be filtered to prevent noise from the analysis.
After the stop words are filtered, the rest of non-stop noun words still can’t say right away to be fully related to what human wants to express. According to human’s writing habit, it is believed that too low or too high frequency of word’s occurrence results from that the word itself is not important or representative.
3.2.4 Feature Selection
In many supervised learning problems, feature selection is important for a variety of reasons: generalization performance, running time requirements, and interpretational issues imposed by the problem itself.
One approach to feature selection is to select a subset of the available features.
This small feature subset will still retains the essential information of the original attributes. There are some criteria [Meisel 1972]:
(1) low dimensionality
(2) retention of sufficient information
(3) enhancement of distance in pattern space as a measure of the similarity of physical patterns, and
(4) consistency of feature throughout the sample.
Our test bed is Reuters Data set, a complete description is in Section 4.1. We choose features for each category and use the features to represent a document, we use the vector space model in information retrieval field. The feature selection method we
adopt is a frequency-based method, we use what so called TF-IDF,
max , log
, ,
t d
t d t d
t
n
N tf
w = tf ×
(3.1)where
tf
t,d is the number of times the word t occurs in document d,n is the
t number of documents the word t occurs. N is the total number of documents.From Section 3.2.1 to Section 3.2.4, we perform the preprocessing processes.
The original text document is now represented as a vector as the following Figure shows.
Fig 3.4 Representing text as a feature vector.
These vectors are all 1
× m
dimensional, wherem is the total number of features
we select for each category. We then utilize them as the training data in unsupervised learning stage.3.3 UNSUPERVISED LEARNING
In this stage, the aim is to construct the hierarchical news categories. In performing SV clustering, we first need to choose the kernel function in order to map the training data to high-dimensional feature space. Through the tuning of main parameters, we generate different shape and different number of connected components. The work now is to find out all the connected components by some searching algorithm [Ellis et.
al., 1995], the algorithm we use is adjacent matrix and Depth First Searching algorithm. After we find all the connected components we want, we need to do cluster validation checking. The strategy we use is discussed in Section 3.3.4.
Due to the experience in finding the connected components, we found that finding the connected components is time-consuming. To solve this problem, we decide to do sampling and dimension reduction for our raw data. The strategy for sampling is Fuzzy C-Mean [Georgej et. al., 1995], and the strategy for dimension reduction is Principal Component Analysis [Jolliffe 1986]. We will mention this later.
In the following we talk about the procedures of SV clustering and the approach we use to solve the problems we face.
3.3.1 Support Vector Clustering
The use of SV clustering can help us to construct a Reuters hierarchy. The learning algorithm we use is the one proposed by Scholkopf [Scholkopf et. al., 2000]. In solving the QP problem, the optimization is performed by the use of Sequential Minimal Optimization (SMO) [Platt 1998]. As we already mentioned that the training time the algorithm took is almost between linear and quadratic in terms of the training data set size. It is much faster than any other existing learning algorithms, and we can easily modify the learning algorithm of SMO to fit our one-class SVM.
We now go to the procedures of performing SV clustering. As we me ntioned in Section 2.1.2, in prforming SV clustering we have to choose proper value of
q
andν . The choice of q
decides the compactness of the enclosing sphere and also the number of clusters. The choice ofν helps us to solve the problem of overlapping
clusters.The SV clustering processes are as follows:
Fig 3.5 The SV clustering processes [Ben-Hur et. al., 2000]
Unlabeled Data Set
d
n
R
x x x
X = {
1,
2,..., } ∈
Choose kernel function Increase
q
from 0,ν fixed
Given
q
,νUsing adjacent-matrix and DFS to find out all
Yes
Yes, Stop No
Fixed
q
changeν
increasingly YesNo
No
If
q
exhausted and all NOCluster Validity Clusters exist (
≥
2)We explain the above procedures as follows:
3.3.2 The Choice of Kernel Function
In 1992 [Boser et. al., 1992] Vapnik shows that the order of operations for constructing a decision function can be interchanged. So instead of making a non- linear transformation of the input vectors followed by dot-products with SVs in feature space, one can first compare two vectors in input space and then makes a non-linear transformation of the value of the result.
Commonly used kernel functions are as follows:
a) Gaussian RBF kernel :
K ( x , y ) = exp( − q x − y
2)
(3.2) b) Polynomial kernel :K
(x
,y
)=
(x ⋅ y +
1)d (3.3) c) Sigmoid kernel :K
(x
,y
)=
tanh(x ⋅ y − θ
) (3.4)we use only Gaussian kernel since other kernel function like polynomial kernel function does not yield tight contour representations of a cluster [Tax et. al., 1999]
and we will show that Gaussian kernel is indeed the best choice for SV clustering in Section 4.3.
3.3.3 Cluster-Finding with Depth First Searching Algorithm
We use graph theory to explain the clustering result. Every enclosing sphere is a connected component, and data points in the same connected component are adjacent.
What we do now is to find out all the connected components.
Define an adjacent matrix
A
ijbetween pairs of pointsx and
ix
j,
≤
=
0 otherwise.R R(y) , x and x connecting segment
line on the y all for if
1 i j
A
ij(3.5)
up to now we can know that whether two data points are adjacent, we need to find all adjacent data points in the same connected component. The algorithm we adopt is the Depth First Searching (DFS) algorithm. As we know every training data point even BSV will belong to one connected component. We can find out which connected component that the data point belongs to by DFS algorithm.
The connected component and DFS algorithm are as follows [黃曲江 1989 ; Ellis 1995]:
procedure ConnectedComponents (adjacencyList: HeaderList; n: integer);
var
mark: array[VertexType] of integer;
{ Each vertex will be marked with the number of the component it is in.}
v: VertexType;
componentNumber: integer;
procedure DFS(v:VertexType);
{Does a depth-first search beginning at the vertex v}
var
w: VertexType;
ptr: NodePointer;
begin
mark[v] := componentNumber;
ptr := adjacencyList[v];
while ptr
≠
nil dow := ptr
↑ .
vertex;output(v,w);
if mark[w]=0 then DFS(w) end ptr := ptr
↑ .
linkend {while}
end {DFS}
begin {ConnectedComponents}
{Initialize mark array.}
for v:=1 to n do mark[v] :=0 end;
{Find and number the connected components.}
componentNumber := 0;
for v := 1 to n do if mark[v]=0 then
componentNumber := componentNumber +1;
output heading for a new component;
DFS(v)
end { if v was unmarked}
end {for}
end {ConnectedComponents}
3.3.4 Cluster Validation
But when to stop the clustering procedure ? It is natural to use the number of SVs as an indication of a meaningful solution [Ben-Hur et. al., 2000 ; Ben-Hur et. al., 2001].
At first we start with fixed