Hierarchical Text Categorization Using One-Class SVM

(1)

國立成功大學資訊工程學系

碩士論文

以單類支持向量機為基礎之階層式文件分類

Hierarchical Text Categorization Using One-Class SVM

研究生：塗宜昆

指導教授：蔣榮先博士

中華民國九十二年七月

(2)

(3)

(4)

以單類支持向量機為基礎之階層式文件分類

塗宜昆 * 蔣榮先**

國立成功大學資訊工程研究所

中文摘要

由於資訊量的快速成長，自動文件分類對於處理及組織資料成為一種重要的資訊分析技術。而由經驗得知當我們處理的資料類別數增加時，用來衡量效能好壞的工具如精確率(precision) 、召回率 (recall)都會相對的下降，採用階層式的類別架構可以解決及處理具有大量資料的問題。

在這個研究中，我們採用單類支持向量機來達到文件聚類之目的，並使用聚類的結果來建立一個階層式的架構，這個架構描述了類別間的關係。我們採用兩類及多類支持向量機來作監督式的分類訓練。

由所設計的三個實驗，我們探討以單類支持向量機為基礎所建立的系統的特性，並與其他研究方法作比較，實驗結果證明所提出的系統具有較佳的效能。

*作者 **指導教授

(5)

Hierarchical Text Categorization Using One-Class SVM

Yi-Kun Tu* Jung-Hsien Chiang**

Department of Computer Science & Information Engineering, National Cheng Kung University

Abstract

With the rapid growth of online information, text categorization has become one of the key techniques for handling and organizing text data. Experience to date has demonstrated that both precision and recall decrease as the number of categories increase. Hierarchical categorization affords the ability to deal with very large problems.

We utilize one-class SVM to perform support vector clustering, and then use the clustering results to construct a hierarchical categories. Two-class and multi-class SVMs are used to perform the supervised classification.

We explore one-class SVM model through three experiments. Performance analysis is performed by comparing with other approaches, the experimental results show that the proposed hierarchical categories works well.

*Author **Advisor

(6)

誌謝

本論文的順利完成，首先要感謝指導老師蔣榮先教授，在這兩年的時間裡悉心教導與鼓勵，並提供良好的學習環境。感謝上帝將女友淑華擺在我身旁成為我最大的支援及鼓勵，沛毅是我最大的資料庫，有任何問題他總不吝嗇的給予指導，宗賢學長是最佳支援手，總是在最需要時給我最好的建議。

當然，還有很多要感謝的人，我謹記在心，謝謝你們的支持與鼓勵。

2003 年仲夏塗宜昆謹誌於成大資工所 ISMP IIR LAB

(7)

χ

²Statistic… … … . 18

2.5.5 Term Strength… … … ... 18

2.6 MULTI-CLASS SVMs… … … .. 19

CHAPTER 3 TEXT CATEGORIZATION USING ONE-CLASS SVM… … … . 21

3.1 PROPOSED MODEL… … … 21

3.2 DATA PREPROCESSING… … … 22

3.2.1 Part-Of-Speech Tagger… … … 23

3.2.2 Stemming… … … .… … … 24

3.2.3 Stop-Word Filter… … … ... 25

3.2.4 Feature Selection… … … . 26

3.3 UNSUPERVISED LEARNING… … … 28

3.3.1 Support Vector Clustering… … … 28

(8)

3.3.2 The Choice Of Kernel Function… … … .. 30

3.3.3 Cluster-Finding With Depth First Searching Algorithm… … … .. 30

3.3.4 Cluster Validation… … … . 32

3.3.5 One-Cluster And Time-Consuming Problem… … … ... 33

3.4 SUPERVISED LEARNING… … … .. 34

3.4.1 Reuters Category Construction… … … 34

3.4.2 The Mapping Strategy… … … .… … … 36

3.4.3 Gateway Node Classifier… … … .… … … 37

3.4.4 Expert Node Classifier… … … 38

CHAPTER 4 EXPERIMENT DESIGN AND ANALYSIS … … … 39

4.1 THE CORPUS… … … .. 39

4.1.1 Documents… … … ... 39

4.1.2 File Format… … … .. 40

4.1.3 Document Internal Tags… … … .. 41

4.1.4 Categories… … … . 43

4.2 DATA REPRESENTATION ANALYSIS… … … . 44

4.3 THE CHOICE OF KERNEL FUNCTION… … … ...… … … 48

4.4 HIERARCHICAL TEXT CATEGORIZATION PERFORMANCE ANALYSIS… … … . 50

CHPATER 5 CONCLUSIONS AND FUTURE WORKS … … … . 56

5.1 CONCLUSIONS… … … 56

5.2 FUTURE WORKS… … … .… 57

Reference… … … .58

APPENDIX A Stop-word List… … … .. 63

APPENDIX B Part-Of-Speech Tags… … … .66

(9)

Figure Listing

Figure 1.1 Our Learning Processes … … … 3

Figure 2.1 Different value of

q

has different clustering results (a)

q

=1 (b)

q

=20(c)

q

=24 (d)

q

=48… … … 9

Figure 2.2

q

fixed and different value of

ν

results in different shape of cluster(a)

ν l =

1 ^,

l

is the number of data (b)

ν

=0.4… … … 10

Figure 3.1 Our proposed text categorization model… … … 22

Figure 3.2 Data Preprocessing Processes … … … . 23

Figure 3.3 Words with tagging… … … ..24

Figure 3.4 Representing text as a feature vector… … … .27

Figure 3.5 The SV clustering processes[Ben-Hur et. al., 2000]… … … ..29

Figure 3.6 Reuters basic hierarchy[D’Alessio et. al., 2000]… … … 34

Figure 3.7 Our Proposed Hierarchy… … … .35

Figure 4.1 Sample news stories in the Reuters-21578 corpus … … … 41

Figure 4.2 One-Class SVM with polynomial kernel where d=2… … … 48

Figure 4.2 One-Class SVM with Gaussian kernel where g=100,

ν

=0.1… … … ..48

Figure 4.4 Our Proposed Hierarchical Construction… … … . 52

(10)

Table Listing

Table 4.1 The list of categories, sorted in decreasing order of frequency… … … .42

Table 4.2 Number of Training/Testing Items … … … .… … … ..43

Table 4.3

F

₁measure in One-Class SVM with vector dimension 10… … … 45

Table 4.4

F

₁measure in One-Class SVM with vector dimension 20… … … 45

Table 4.5 Precision/Recall-breakeven point on the ten most frequent Reuters categories on Basic SVMs … .… … … .51

Table 4.6 Our proposed classifier comparison of k-NN and decision tree… … … 53

(11)

CHAPTER 1 INTRODUCTION

With the rapid growth of online information, text categorization has become one of the most popular key techniques for handling and organizing text data. It is used to classify news stories [Hayes et. al., 1990], to find interesting information on the WWW [Lang 1995], and to guide a users search through hypertext. Since building text classifiers by hand is difficult and very time consuming, it is desirable to learn classifiers from examples.

The most successful paradigm to organize a large mass of information is by categorizing the different documents according to their topics. Recently, various machine learning techniques have been attempted to automatic text categorization.

These approaches are usually based on the vector space model, in which documents are represented by sparse vectors, with one component for each unique word extracted from the document. Typically, the document vector is very high-dimensional, at least in the thousands for large collections. This is a major stumbling block in applying many machine learning methods, existing techniques rely heavily on dimension reduction as a preprocessing step. But it is computational expensive. Until recently, a new approach is found by the introduction of Support Vector Machines (SVMs), This new algorithm outperforms other classifiers in text categorization, and is also used as a clustering method.

Most of computational experience discussed in the literature deal with hierarchies that are trees. Indeed, until recently, most problems discussed dealt with categorization within a simple (non-hierarchical) set of categories [Frakes et. al., 1992]. But also a few hierarchical classification methods have been proposed recently

(12)

[D’Alessio et. al., 1998 ; Wang et. al., 2001 ; Weigend et. al., 1999].

In this research, we try to utilize one-cla ss SVM to perform text clustering. And use the clustering results to build a hierarchical construction. This hierarchical construction illustrates the relationship between Reuters categories.9

The Reuters-21578 corpus has been studied extensively. Yang [Ya ng 1997a]

compares 14 categorization algorithms applied this Reuters corpus as a flat categorization problem on 135 categories. This same corpus has been more recently studied by others treating the categories as a hierarchy [Koller et. al., 1997 ; Yang 1997a ; Ng et. al., 1997]. We construect our own hierarchy through Support Vector (SV) clustering and compare the text categorization result with state-of-the-art literatures.

1.1 RESEARCH MOTIVATION

In the field of document categorization, if we use only one- layer of classifiers, we usually need to use many training samples. This kind of model is usually too complicated and the training is not accurate enough. So we adopt the way ”divide and conquer”, to partition a problem into many small and easy-solved sub-problems. With this procedure, we can simplify our training processes and gain accurate training model.

In the heuristic, we know that each document can be in multiple, exactly one, or no category at all. By building the hierarchical construction between categories, whenever a new document comes in, we can easily assign it into the category (or categories) it belongs.

Support Vector Machine outperforms other classifiers in text classification in recent years, it is also used to perform text clustering. We want to perform support vector clustering on text data set and construct a hierarchical construction between the

(13)

data set.

1.2 THE APPROACH

There are three stages in our proposed approach. It consists of data preprocessing, unsupervised learning, and supervised learning.

We present documents as “bags of words”, only the presence or absence of the words in a document is indicated and not their order of any higher-level linguistic structure. We can thus think of documents as high-dimensional vectors, with one slot for each word in the vocabulary and a value in each slot related to the number of times that word appeared in the document. Note that a big fraction of the total number of slots will have zero value for any given document, so the vectors are quite sparse.

The goal of the first stage is to build the training data for the second stage, a detail description of the methods we use will be discussed in Section 3.2.

The second stage is unsupervised learning, we use one-class SVM to perform support vector clustering. We will explain how we achieve text clustering by this new learning algorithm in Section 3.3.

In the last stage, we use two-class SVMs and multi-class SVMs to train the tree nodes and finally get much more accurate tree node classifiers. The following Figure shows stage two and stage three:

Fig 1.1 Our Learning Process

(14)

1.3 THESIS ORGANIZATION

The content of this thesis is partitioned into five chapters and is organized as follows.

Chapter 1 introduces the motivation of our research.

Chapter 2 references works of related techniques such as support vector clustering, one-class SVM, sequential minimal optimization, text categorization and finally feature selection methods.

Chapter 3 describes our methodologies and the system.

Chapter 4 describes the corpus we use as the test bed for experiments, and our three experiments. Finally we report our experimental results of SV clustering and text categorization. System performance analysis is also discussed by comparing to other classifiers.

Chapter 5 presents our conclusions based on the results of our experiments, and suggests for future research.

(15)

CHAPTER 2 LITERATURE REVIEW AND RELATED WORKS

Section 2.1 and 2.2 illustrate unsupervised leaning methods we use. They are SV clustering, and one-class SVM, also Sequential minimal optimization. Section 2.3 delivers methods of dimension reduction, and section 2.4 is about Fuzzy C- mean.

Section 2.5 reviews the methods of feature selection. Section 2.6 is about Multi-class SVMs. These topics are major techniques in our proposed model.

2.1 SUPPORT VECTOR CLUSTERING

SV clustering was derived from the study of one-class support vector machine (SVM) [Scholkopf et. al., 2000 ; Tax et. al., 1999]. The work of one-class SVM was to find the density of a distribution probability. In solving the problem, one needs to solve a quadratic optimization problem. The optimal solution tells us the position of support vectors.

Asa Ben-Hur et. al. generalize support vectors as the boundary of the clusters.

[Ben-Hur et. al., 2000 ; Miguel 1997 ; Duda et. al., 2000]. They proposed an algorithm for representing the support of a probability distribution by a finite data set using the support vectors. We first review the concept of one-class SVM.

2.1.1 One-Class SVM

The term one-class classification originates from Moya [Moya et. al., 1993], but also outlier detection [Ritter et. al., 1997], novelty detection [Bishop 1995] for concept learning [Japkowicz et. al., 1995] are used. There are different names since

(16)

different applications can be applied for one-class classification. One-class classification problem can be described as that one wants to develop an algorithm which returns a function

f

that takes the value

+ 1

in a small region capturing most of the data points, and

− 1

elsewhere.

Scholkopf et. al. [Scholkopf et. al., 2000] proposed two methods for solving one-class classification problem:

a) To map the data into the feature space corresponding to the kernel function, and to separate them from the origin with maximum margin. This is a two-class separation problem. The only element of the negative examples is the origin, and all the training data are the positive examples.

b) One searches for the sphere that can include all the training data such that the sphere is with minimum volume.

We review only the solution to the second problem. Suppose the data points are mapped from input space to the high-dimensional feature space, this is through a non- linear transformation function

Φ

. One looks for the smallest sphere that encloses the image of the data.

Scholkopf solved the problem by the following way: Consider the smallest enclosing sphere with radius

R , the optimization problem is as follows:

min R 2

such that

Φ

(

x

_j)

− a

²

≤ R

²

∀ j

(2.1)

where

a is the center of the sphere and ⋅

is the Euclidean norm,

R is the radius

of the sphere. Soft margin constraints are incorporated by adding slack variables

ξ

_j, so the original optimization problem is:

∑

+

i

C

i

R

²

ξ

min

(17)

such that

Φ

(

x

_j)

− a

²

≤ R

²

+ ξ

_j (2.2) with

ξ

_j

≥ 0

. By the introduction of Lagrange Multiplier, the constrained problem

was :

∑

∑ ⁺ ⁻ ^Φ ⁻ ⁻ ⁺

−

=

_j _j _j

j

j j

j

x a u C

R R

L

² ( ²

ξ

( ) ²)

β ξ ξ

(2.3)

where

β

_j

≥ 0 , u

_j

≥ 0

are Lagrange Multipliers,

C is a constant. One calculates the

partial derivative of

L

with respect to

R , a

, and

ξ

_j. Set the partial derivatives to zero, then one can get

∑ ⁼

j

1 β

(2.4)

∑ ^Φ

=

j

x

a β ( )

(2.5)

j

= C − u

β

(2.6)

the Karush-Kuhn-Tucker (KKT) conditions should be hold, so it results in

0 ) ) (

(

R

²

+ ξ

j

− Φ x

j

− a

²

β

j

=

(2.7)

= 0

j j

u

ξ

(2.8)

Again, one can eliminate the variables and

R, a

and

u

_j, and turns it into its Wolfe dual form. Then the new problem is a function of the variable

_β

_j only

) ( ) ( )

(

, 2

j i

j i j

j

x x

x

W = ∑ Φ ^β − ∑ ^β ^β Φ ⋅ Φ

^(2.9) with the constraints

(18)

j

≤ C

≤ β

0

(2.10)

∑ ⁼

j

1 β

(2.11)

the non- linear transformation function

Φ

is a feature map, i.e. a map into an inner product space F such that the inner product in the image of

Φ

can be computed by evaluating some simple kernel [Boser et. al., 1992 ; Cortes et. al., 1995 ; Scholkopf et.

al., 1999].

Define

) ( ) ( ) ,

( x y x y

K ≡ Φ ⋅ Φ

(2.12) and the Gaussian kernel function,

) exp(

) ,

( x y q x y

²

K = − −

(2.13)

the Lagrangian

W in (2.9) can be written as

) , ( )

, (

,

j i j

i j i j

j j

i

x k x x

x K

W ⁼ ∑ ^β ⁻ ∑ ^β ^β

^(2.14)

if one solves the above formula and get the optimal solution of

β

_j, then one can now calculate the distance of each point

x to the center of the sphere :

2 2

) ( )

( x x a

R = Φ − (2.15)

By (2.5), one can rewrite it as

) , ( )

, ( 2 ) , (

, 2

j i j

i j i j

j j

i

x k x x

x K x

x k

R ⁼ ⁻ ∑ ^β ⁺ ∑ ^β ^β

^(2.16)

the radius of the sphere is:

(19)

} ctor support ve a

is

| ) ( {

R x

_i

x

_i

R = (2.17)

So if one solves the dual problem

W , one can find all the SVs. Then by calculating

the distance between the SVs and sphere center, the radius of the sphere is found.

Consider equation (2.7) and (2.8),

0 ) ) (

(

R

²

+ ξ

_j

− Φ x

_j

− a

²

β

_j

=

and

ξ

_j

u

_j

= 0

there are three types of the data , they are as follows:

a) If

0 < β

_j

< C

, then the corresponding

ξ

_j

= 0

, from equation (2.7), these data lie on the sphere surface. They are called Support Vectors (SVs).

b) If

ξ

_j

> 0

, the corresponding

β

_j

= C

, these data points lie outside the sphere.

They are called Bounded Support Vectors (BSVs).

c) All other data with

β

_j

= 0

lie inside the sphere.

So, from the discussion above. One can tell whether a data point is inside or outside or on the sphere surface by the corresponding value of

β

_j.

2.1.2 The Formulation of Support Vector Clustering

In the training of one-class SVM, considering the use of Gaussian kernel. We map the data into high-dimensional feature space and find a sphere with minimum volume to enclose the image of the data. When the sphere be mapped back to the input space, they form different shape of contours. These contours enclose the data points, they can be seen as cluster boundary, below we explain how to perform SV clustering.

By the use of Gaussian kernel function, the parameters that one needs to control are the kernel width

q

and penalty

v . One can use Fig 2.1 and Fig 2.2 [ Ben-Hur et.

(20)

al., 2000 ; Hao P. Y. et. al., 2002 ] to explain what they work,

Fig 2.1 Different value of

q

has different clustering results (a)

q

=1 (b)

q

=20 (c)

q

=24 (d)

q

=48

In Fig 2.1, one sees that as the value of

q

increases, the contour boundaries are more and more tightly, and there are more and more contours. This explains that the parameter

q

controls the compactness of the enclosing sphere, it also controls the number of the enclosing sphere. So one can tune the value of

q

from small to big or from big to small in order to gain proper results.

The influence of the parameter

v is as follows:

Fig 2.2 q

fixed and different value of

v results in different shape

of cluster (a)

vl = 1 l , is the number of data

(b)

v =

0.4

(21)

In Fig 2.2 one can see that as

v increases, there are more BSVs appear. It is because v controls the percentage of outliers. So up to now one knows how to control the

tightness of the contour boundaries and one can also let some data points to become outliers in order to form more clear contour boundaries.

The problem now is how one finds out these contours just as one sees in Fig 2.1 and Fig 2.2. [Ben-Hur et. al., 2000 ; Ben-Hur et. al., 2001] proposed a method that one can use the schema of connected components by defining an adjacent matrix.

Contours are now defined as the connected components. One can use any graph searching algorithm to find out all the connected components.

2.2 SEQUENTIAL MINIMAL OPTIMIZATION

The new SVM learning algorithm is called Sequential Minimal Optimization (SMO).

This new learning algorithm was proposed by John C. Platt [Platt 1998]. The algorithm solves the quadratic programming problem by breaking it into a series of smallest possible sub-QP problems. These small sub-QP problems are solved analytically, so one does not need to perform matrix computation. Thus the training time the algorithm took is almost between linear and quadratic in the training set size for various test problems.

Traditionally the SVM learning algorithm uses numeric quadratic programming as an inner loop, so it takes much time scaling between linear and cubic in the data set size, SMO goes in different way that it uses analytic QP step, one can partition it into two steps just as what the following two sections describe.

2.2.1 Optimize Two Lagrange Multipliers

In solving the QP problem, every training data corresponds to a single Lagrange

(22)

Multiplier, one can judge whether a training data is a support vector by its corresponding value of Lagrange Multiplier. The work now is to optimize two Lagrange Multipliers at a time.

Platt proposed the following way [Platt 1998]: Consider optimizing over

α

₁ and

α , with all other variables fixed. Then the original quadratic problem

₂

∑

∑ ^≤ ^≤ ⁼

i i i

ij

j i j

i

k x x subject to 1 l , 1 0

) 2 (

min 1 α

α ν α

α

(2.18)

can be reduced to

∑

=

+

=²

+

1 2

1 , 2 ,

min 1

2

1 i

i i j

i

ij j

i

α k α C C

α

α (2.19)

with

= ∑

=

= ∑

^l_i _j= i j ij

l

j j ij

i

K and C K

C

3

α

, 3

α α

subject to

∑

=

∆

=

≤

²

1 2

1

1 ,

, 0

i

l α

i

α ν

α

(2.20)

where

^∆ ⁼

¹

⁻ ∑

^lⁱ⁼³

^α

ⁱ

since

C does nothing with α and

₁

α , one can eliminate it, so one obtains the new

₂ form as:

2 2 1 2 22

2 2 12 2 2 11

2

( )

2 ) 1

( )

2 (

min 1 ∆ − α K + ∆ − α α K + α K + ∆ − α C + α C

(2.21)

with the derivative

2 1 22 2 12 2 11

2) ( 2 )

(

∆ − K + ∆ − K + K − C + C

− α α α

(2.22)

let the derivative be zero, then

12 22

11

2 1 12 11

2

2 ) (

K K

K

C C K K

− +

−

= ∆

α

(2.23)

(23)

since

α is found, we can calculate

₂

α from (2.20).

₁

If the new point (

α ,

₁

α ) is outside of [0,

2 1/

ν l

], the constrained optimum is found by projecting

α from (2.23) into the region allowed by the constraints, and

₂ then re-computing

α .

₁

The offset

ρ is recomputed at every such step.

2.2.2 Updating After A Successful Optimization Step

Let

α

₁^*,

α

₂^* be the values of the Lagrange parameter after the step in 2.2.1, then the corresponding output is [Platt 1998]

i i

i

K K C

O = +

2 2^*

+

* 1

1

α α

(2.24)

combine with (2.23), one then has the update equation for

α such that

₂

α is

1

disappeared,

12 22 11

2

* 1 2

2

K K 2K

O O

− + + −

= α

α

(2.25)

2.3 DIMENSION REDUCTION

The problem of dimension reduction is introduced as a way to overcome the curse of the dimensionality when dealing with vector data in high-dimensional spaces and as a modeling tool for such data [Miguel 1997].

In most applications, dimension reduction is carried out as a preprocessing step, the selection of the dimensions using principal component analysis (PCA) [Duda et.

al., 2000 ; Jolliffe 1986] through singular value decomposition (SVD) [Golub et. al., 1996] is a popular approach for numerical attributes.

(24)

PCA is possibly the most widely used technique to perform dimension reduction, consider a sample

n i

x

i

}

₁

{

₌ in

R (2.26)

^D

with mean

∑

=

ⁿ

i

x

i

x n

1

(2.27)

and covariance matrix

} ) )(

{( x x x x

^T

∑ ⁼ E ⁻ ⁻

, (2.28)

with spectral decomposition

∑ ⁼ ^U ^Λ ^U

^T , (2.29)

the principal component transformation

) (

x x U

y =

^T

−

(2.30)

yields a reference system in which the sample has mean 0 and diagonal covariance matrix

_Λ

containing the eigenvalues of

Σ , the variables are now uncorrelated. One

can discard the variables with small variance, i.e. project on the subspace spanned by the first L principal components, and obtain a good approximation to the original sample.

2.4 FUZZY C-MEAN

Clustering is one of the most fundamental issues in pattern recognition. It plays a key

(25)

role in searching for structures in data. Given a finite set of data, the problem is to find several cluster centers that can properly characterize relevant classes of the set.

Fuzzy C-mean is based on fuzzy c-partitions, the algorithm is as follows [Georgej et.

al., 1995] :

Step1. Let t=0.Select an initial fuzzy pseudopartition

P

⁽⁰⁾

.

Step2. Calculate the c cluster centers v

₁⁽^t⁾

,... v

_c⁽^t⁾for

P and the chosen value of

^{( t}⁾

)

, 1 ( , m ∈ ∞

m

.

Step3. Update

P

⁽^t⁺¹⁾by the following procedure: For each

x

_k

∈ x , if

2

0 >

−

i^t

k

v

x

(2.31)

for all

i ∈ N

_c, then define

1

1 1

) 2 (

) 2 ( )

1

( ( ) [ ( ) ]⁻

=

+

⁼ ∑

^c

₋ ⁻

−

j

m t j k

t i k k

t i

v x

v x x

A

, (2.32)

if

x

k

− v

i^t ²

= 0

for some

i ∈ I ⊆ N

_c,then define

A

_i⁽^t⁺¹⁾

( x

_k

)

for

i ∈ I

by any nonnegative real numbers satisfying

∑

∈

+

=

I i

k t

i

x

A

⁽ ¹⁾( ) 1 (2.33)

and define ⁽⁺¹⁾

(

k

) = 0

t

i

x

A

for

i ∈ N

_c

− I

Step4. Compare

P and

^{( t}⁾

P

⁽^t⁺¹⁾

. If P

⁽^t⁾

₋ P

⁽^t⁺¹⁾

_≤ _ε

, then stop, otherwise, increase t by one and return to Step2

The most obvious disadvantage of FCM algorithm is that we need to guess the

(26)

number of cluster centers. In our implementation, we do know how many clusters we need, so it won’t be big problem for us.

2.5 FEATURE SELECTION METHODS

In text categorization one is usually confronted with feature spaces containing 10000 dimensions and more, often exceeding the number of available training examples.

Many have noted the need for feature selection to make the use of conventional learning methods possible, to improve generalization accuracy, and to avoid

“overfitting” [Joachims 1998].

The most popular approach to feature selection is to select a subset of the available features using methods like Document Frequency Thresholding [Yang et. al., 1997b], Information Gain,

x

² statistic [Schutze et. al., 1995], Mutual Information, and Term Strength. The most commonly used and often most effective [Yang et. al., 1997b] method for selecting features is the information gain criterion. Below a short description of these methods is given.

2.5.1 Document Frequency Thresholding

Document frequency is the number of document in which a word occurs. In

Document Frequency Thresholding one computes the document frequency for each word in the training corpus and removes those words whose document frequency is less than some predetermined threshold. The basic assumption is that rare words are either non-informative for category prediction, or not influential in global performance. In either case removal of rare words reduces the dimensionality of the feature space. Improvement in categorization accuracy is also possible if rare words happen to be noise words.

(27)

2.5.2 Information Gain

Information Gain measures the number of bits of information obtained for

category prediction by knowing the presence or absence of a word in a document.

Let

c

₁ ,

c

₂,...,

c

_M denote the set of categories in the target space. The information gain of word w is defined to be:

)

| ( log )

| ( ) ( )

| ( log )

| ( ) ( ) ( log ) ( )

(

1 1

1

∑ ∑

∑

= = =

+ +

−

=

^M

k

k k

M

k

k k

M

k

P c P w P c w P c w P w P c w P c w

c P w

IG

(2.34)

Where

P

(

c

_k): The fraction of documents in the total collection that belong to class

c .

_k

P ( w )

: The fraction of documents in which the word w occurs.

P

(

c

_k |

w

): The fraction of documents from class

c that have at least one

_k occurrence of word w.

P

(

c

_k |

w

): The fraction of documents from class

c that does not contain word w.

_k The information gain is computed for each word of the training set, and the words whose information gain is less than some predetermined threshold are removed.

2.5.3 Mutual Information

Mutual Information considers the two-way contingency table of a word w and a category c. Then the mutual information between w and c is defined to be:

) ( ) (

) log (

) ,

(

p w p c

c w c P

w

MI ×

= ∧

(2.35)

and is estimated using

) ( ) log(

) ,

(

A C A B

N c A

w

MI + × +

≈ ×

(2.36)

where A is the number of times w and c co-occur,

B

is the number of times the w

(28)

occurs without c,

C is the number of times c occurs without w, and N is the total

number of documents.

2.5.4 x

²

Statistic

The

x

² statistic measures the lack of independence between w and class c. it is defined to be:

) ( ) ( ) ( ) (

) ) (

, (

2 2

D C B A D B C A

CB AD c N

w

x + × + × + × +

−

= ×

(2.37)

Where

A

is the number of times w and c co-occur,

B

is the number of times the w occurs without c,

C is the number of times c occurs without w, D

is the number of times neither w nor c occurs. N is still the total number of documents.

Two different measures can be computed based on the

x

² statistic:

∑

=

^M

k

k k

avg

w P c x w c

x

1

2

( ) ( ) ( , )

(2.38)

or

) , ( max )

( ²

1 2

max k

M

k

x w c

w

x =

=

2.5.5 Term Strength

Term Strength estimates word importance based on how commonly a word is

likely to appear in “closely-related” documents. It uses a training set of documents to derive document pairs whose similarity is above a threshold. Term Strength is computed based on the estimated conditional probability that a word occurs in the second half of a pair of related documents given that it occurs in the first half.

Let x and y be an arbitrary pair of distinct but related documents, and w be a word, then the strength of the word is defined to be:

)

| ( )

( w P w y w x

TS = ∈ ∈ (2.39)

(29)

2.6 MULTI-CLASS SVMS

There are many methods for SVMs to solve the multi-class classification problems.

One approach is to consider the problem as a two class classification problem. There are two ways to solve multi-class SVMs in this approach [Chih-Wei, et. al., 2002 ; Weston et. al., 1998], they are:

a) one-against-one classifiers.

b) one-against-the-rest classifiers.

In a), suppose there are

k classes to be classified, this method constructs )

1 2 (

1 k k −

SVM models. Each classifier must train on two classes, for training on the

i

th class and the j th class, one needs to solve the following two-class classification problem:

∑

+

t

T ij ij t ij

T ij b

w

w w C w

ij ij

ij ( ) ( )

2 min 1

,

ξ

ξ (2.40)

if b

x

w

^ij

)

^T

(

_t

)

^ij

1

_t^ij

,

( Φ + ≥ − ξ y

_t

= i

(2.41)

if

b x

w

^ij

)

^T

(

_t

)

^ij

1

_t^ij

,

( Φ + ≤ − + ξ y

_t

= j

(2.42)

≥ 0

ij

ξ

t (2.43)

and the testing can be implemented in many ways, one of them is what so called

“Max Wins”.

This voting strategy says that if the result (

w

_ij)^T

Φ

(

x

_t)

+ b

^ij says that the test data is in the

i

th class, then the vote for class

i

is added by 1, otherwise the class j is added 1. Then one can predict the test data with the largest vote. In general, one-against-one method will take us much time to accomp lish a training work, especially when we have many classes to train. But in real implementation, if one wants to gain better performance, one has no choice to use it.

(30)

In b), one-against-the-rest method. If one has

k classes, then one needs to train

only

k SVM models. This method spends much little time than one-against-one

method. The

i

th class is trained with positive labels while all the other classes are trained with negative labels. So if one has

l training data

(

x

₁,

y

₁),...,(

x

_l,

y

_l), such that

x

_i

∈ R

^N,

i =

1,...,

l

and

y

_i

∈

{1,...,

k

} is in class

i

, the n the

i

th classifier solves the following problem:

∑

+

t

T i i t i

T i b

w

w w C w

j i

i ( ) ( )

2 min 1

,

ξ

ξ (2.44)

if b

x

w

ⁱ)^T ( _j) ⁱ 1 ⁱ_j,

(

Φ + ≥ − ξ ^y

j

= ⁱ

(2.45)

if b

x

w

ⁱ)^T ( _j) ⁱ 1 ⁱ_j,

(

Φ + ≤ − + ξ ^y

j

≠ ⁱ

(2.46)

l

i

j

≥

0,

=

1,...,

ξ

(2.47)

The testing is the same with two class SVMs. In general, the performance is usually not good when comparing with one-against-one method.

(31)

CHAPTER 3 TEXT CATEGORIZATION USING ONE-CLASS SVM

3.1 PROPOSED MODEL

There are mainly three stages in our proposed model, they are:

a) Data preprocessing stage.

b) Unsupervised learning stage.

c) Supervised learning stage.

The first stage includes Part-Of-Speech (POS) tagging, word stemming, stop-word deleting and feature selection. Through these processes, we can transform the original raw data into the normalized data that can be used in the second stage.

The second stage provides the processes that about performing SV clustering.

The subjects include the choice of kernel function, and how we find clusters and the strategy to judge the cluster validation. Also we discuss the problems we face and the solution we use to solve them.

The last stage is about the training of internal and expert node classifiers, these node classifiers come from the clustering result of the second stage. Firstly we construct a mapping strategy between raw data and cluster centers, and then train all the classifiers by one-against-one or one-against-the-rest training methods.

All the main components and procedures are illustrated in the Figure 3.1 as follows:

(32)

Figure 3.1 Our proposed text categorization model

A detailed description of the behavior of each component is described as follows:

3.2 DATA PREPROCESSING

There are four main procedures in the data preprocessing stage :

(33)

Fig 3.2 Data Preprocessing Processes

3.2.1 Part-Of-Speech Tagger

In this procedure, a POS tagger [Brill 1994] is introduced to provide POS information.

News article will be previously tagged resulting in each word with its appropriate part-of-speech (POS) tag. In general, the news articles are mostly composed by natural language text to express human’s thought. In this thesis, we consider that

News Documents

Part-Of-Speech Tagger

Stemmer

Stop-Word Filter

Feature Selection

Training Data

(34)

concepts, which express human’s thought, will mostly be decided by noun keywords.

Therefore, POS tagger module provides proper POS tags for the function of feature selection. Furthermore, POS tags give important information for deciding contextual relationship between words. In Figure 3.2, this tagger provides noun words to the next stemmer module. In such way, this module employs natural langua ge technology to help analyze news articles. Consequently, it is considered a language model.

For natural language understanding, giving a sentence POS tags prepares the further information to analyze the syntax of a sentence. The POS employed in this thesis is based on rule-based POS tagger proposed by Eric Brill in 1992. Brill’s tagger tries to learn lexical and contextual rules for tagging words. The precision of Brill’s tagger was pronounced to be higher than 90% [Brill 1995]. There are totally 37 POS tags as listing in APPENDIX B. As mentioned above, we select noun-only words.

Therefore, these noun tags are NN, NNS, NNP and NNPS. The following are examples of words after POS tagging.

N.10/CD S.1/CD

"I/NN think/VB it/PRP is/VBZ highly/RB unlikely/JJ that/IN American/NNP Express/NNP is/VBZ

Fig 3.3 Words with tagging

3.2.2 Stemming

Frequently, the user specifies a word in a query but only a variant of this word is present in a relevant document. Plurals, gerund forms, and past tense suffixes are examples of syntactical variations which prevent a perfect match between a query word and a respective document word [Ricardo et. al., 1999]. This problem can be partially overcome with the substitution of the words by their respective stems.

(35)

A stem is the portion of a word, which is left after the removal of its affixes. A typical example of a stem is the word “calculate”, which is the stem for the variants calculation, calculating, calculated, and calculations. Stems are thought to be useful for improving retrieval performance because they reduce variants of the same root word to a common concept. Furthermore, stemming has the secondary effect of reducing the size of the indexing structure because the number of distinct index terms is reduced [Ricardo et. al., 1999].

Because of most variants of a word are generated by the introduction of suffixes, and on the basis of intuitive, simple, and can be implemented efficiently, there are several well-known algorithms which been used suffixes removal. The most popular one is that by Porter, so we use the Porter algorithm [Porter 1980] to do word stemming.

3.2.3 Stop-Word Filter

Words, which are too frequent among the documents in the collection, are not good discriminators. In fact, a word, which occurs in 80% of the documents in the collection, is useless for purpose of retrieval. Such words are frequently referred to as stop-words and are normally filtered out as potential index terms. Articles, prepositions, and conjunctions are natural candidates for a list of stop-words.

Elimination of stop-words has an additional important benefit. It reduces the size of the indexing structure considerably. In fact, it is typical to obtain compression in the size of the indexing structure of 40% or more solely with the elimination of stop-words [Ricardo et. al., 1999].

Since stop-words elimination also provides for compression of the indexing structure, the list of stop-words might be extended to include words other than articles, prepositions, and conjunctions. For example, some verbs, adverbs, and adjectives

(36)

could be treated as stop-words. In this thesis, a list of 306 stop-words has used. A detailed list of the stop-words can be found in the appendix of this thesis.

The stop word filter procedure takes noun words as input and a few noun words may be ineffective to what human wants to express in the document. They are only auxiliary to complete the whole natural language text. Here we called them stop words. In this reason, stop words must be filtered to prevent noise from the analysis.

After the stop words are filtered, the rest of non-stop noun words still can’t say right away to be fully related to what human wants to express. According to human’s writing habit, it is believed that too low or too high frequency of word’s occurrence results from that the word itself is not important or representative.

3.2.4 Feature Selection

In many supervised learning problems, feature selection is important for a variety of reasons: generalization performance, running time requirements, and interpretational issues imposed by the problem itself.

One approach to feature selection is to select a subset of the available features.

This small feature subset will still retains the essential information of the original attributes. There are some criteria [Meisel 1972]:

(1) low dimensionality

(2) retention of sufficient information

(3) enhancement of distance in pattern space as a measure of the similarity of physical patterns, and

(4) consistency of feature throughout the sample.

Our test bed is Reuters Data set, a complete description is in Section 4.1. We choose features for each category and use the features to represent a document, we use the vector space model in information retrieval field. The feature selection method we

(37)

adopt is a frequency-based method, we use what so called TF-IDF,

max _, log

, ,

t d

t d t d

t

n

N tf

w = tf ×

(3.1)

where

tf

_t_,_d is the number of times the word t occurs in document d,

n is the

_t number of documents the word t occurs. N is the total number of documents.

From Section 3.2.1 to Section 3.2.4, we perform the preprocessing processes.

The original text document is now represented as a vector as the following Figure shows.

Fig 3.4 Representing text as a feature vector.

These vectors are all 1

× m

dimensional, where

m is the total number of features

we select for each category. We then utilize them as the training data in unsupervised learning stage.

(38)

3.3 UNSUPERVISED LEARNING

In this stage, the aim is to construct the hierarchical news categories. In performing SV clustering, we first need to choose the kernel function in order to map the training data to high-dimensional feature space. Through the tuning of main parameters, we generate different shape and different number of connected components. The work now is to find out all the connected components by some searching algorithm [Ellis et.

al., 1995], the algorithm we use is adjacent matrix and Depth First Searching algorithm. After we find all the connected components we want, we need to do cluster validation checking. The strategy we use is discussed in Section 3.3.4.

Due to the experience in finding the connected components, we found that finding the connected components is time-consuming. To solve this problem, we decide to do sampling and dimension reduction for our raw data. The strategy for sampling is Fuzzy C-Mean [Georgej et. al., 1995], and the strategy for dimension reduction is Principal Component Analysis [Jolliffe 1986]. We will mention this later.

In the following we talk about the procedures of SV clustering and the approach we use to solve the problems we face.

3.3.1 Support Vector Clustering

The use of SV clustering can help us to construct a Reuters hierarchy. The learning algorithm we use is the one proposed by Scholkopf [Scholkopf et. al., 2000]. In solving the QP problem, the optimization is performed by the use of Sequential Minimal Optimization (SMO) [Platt 1998]. As we already mentioned that the training time the algorithm took is almost between linear and quadratic in terms of the training data set size. It is much faster than any other existing learning algorithms, and we can easily modify the learning algorithm of SMO to fit our one-class SVM.

(39)

We now go to the procedures of performing SV clustering. As we me ntioned in Section 2.1.2, in prforming SV clustering we have to choose proper value of

q

and

ν . The choice of ^q

decides the compactness of the enclosing sphere and also the number of clusters. The choice of

ν helps us to solve the problem of overlapping

clusters.

The SV clustering processes are as follows:

Fig 3.5 The SV clustering processes [Ben-Hur et. al., 2000]

Unlabeled Data Set

d

n

R

x x x

X = {

₁

,

₂

,..., } ∈

Choose kernel function Increase

q

from 0,

ν fixed

Given

q

,ν

Using adjacent-matrix and DFS to find out all

Yes

Yes, Stop No

Fixed

q

change

ν

increasingly Yes

No

If

q

exhausted and all NO

Cluster Validity Clusters exist (

≥

²)

(40)

We explain the above procedures as follows:

3.3.2 The Choice of Kernel Function

In 1992 [Boser et. al., 1992] Vapnik shows that the order of operations for constructing a decision function can be interchanged. So instead of making a non- linear transformation of the input vectors followed by dot-products with SVs in feature space, one can first compare two vectors in input space and then makes a non-linear transformation of the value of the result.

Commonly used kernel functions are as follows:

a) Gaussian RBF kernel :

K ( x , y ) = exp( − q x − y

²

)

(3.2) b) Polynomial kernel :

K

(

x

,

y

)

=

(

x ⋅ y +

1)^d (3.3) c) Sigmoid kernel :

K

(

x

,

y

)

=

tanh(

x ⋅ y − θ

) (3.4)

we use only Gaussian kernel since other kernel function like polynomial kernel function does not yield tight contour representations of a cluster [Tax et. al., 1999]

and we will show that Gaussian kernel is indeed the best choice for SV clustering in Section 4.3.

3.3.3 Cluster-Finding with Depth First Searching Algorithm

We use graph theory to explain the clustering result. Every enclosing sphere is a connected component, and data points in the same connected component are adjacent.

What we do now is to find out all the connected components.

Define an adjacent matrix

A

_ijbetween pairs of points

x and

_i

x

_j,

(41)

 

 ≤

=

0 otherwise.

R R(y) , x and x connecting segment

line on the y all for if

1 _i _j

A

ij

(3.5)

up to now we can know that whether two data points are adjacent, we need to find all adjacent data points in the same connected component. The algorithm we adopt is the Depth First Searching (DFS) algorithm. As we know every training data point even BSV will belong to one connected component. We can find out which connected component that the data point belongs to by DFS algorithm.

The connected component and DFS algorithm are as follows [黃曲江 1989 ; Ellis 1995]:

procedure ConnectedComponents (adjacencyList: HeaderList; n: integer);

var

mark: array[VertexType] of integer;

{ Each vertex will be marked with the number of the component it is in.}

v: VertexType;

componentNumber: integer;

procedure DFS(v:VertexType);

{Does a depth-first search beginning at the vertex v}

var

w: VertexType;

ptr: NodePointer;

begin

mark[v] := componentNumber;

ptr := adjacencyList[v];

while ptr

≠

nil do

(42)

w := ptr

↑ .

vertex;

output(v,w);

if mark[w]=0 then DFS(w) end ptr := ptr

↑ .

link

end {while}

end {DFS}

begin {ConnectedComponents}

{Initialize mark array.}

for v:=1 to n do mark[v] :=0 end;

{Find and number the connected components.}

componentNumber := 0;

for v := 1 to n do if mark[v]=0 then

componentNumber := componentNumber +1;

output heading for a new component;

DFS(v)

end { if v was unmarked}

end {for}

end {ConnectedComponents}

3.3.4 Cluster Validation

But when to stop the clustering procedure ? It is natural to use the number of SVs as an indication of a meaningful solution [Ben-Hur et. al., 2000 ; Ben-Hur et. al., 2001].

At first we start with fixed

v and some value of q

, increase slowly the value of

q

. We can find that the cluster boundaries are more and more tightly. More clusters are formed and the percentage of SVs increases. If the value of percentage is too high, it

Hierarchical Text Categorization Using One-Class SVM

國 立 成 功 大 學 資 訊 工 程 學 系

碩 士 論 文

以單類支持向量機為基礎之階層式 文件分類