• 沒有找到結果。

CHAPTER 3 TEXT CATEGORIZATION USING ONE-CLASS SVM

3.2 DATA PREPROCESSING

There are four main procedures in the data preprocessing stage :

Fig 3.2 Data Preprocessing Processes

3.2.1 Part-Of-Speech Tagger

In this procedure, a POS tagger [Brill 1994] is introduced to provide POS information.

News article will be previously tagged resulting in each word with its appropriate part-of-speech (POS) tag. In general, the news articles are mostly composed by natural language text to express human’s thought. In this thesis, we consider that

News Documents

Part-Of-Speech Tagger

Stemmer

Stop-Word Filter

Feature Selection

Training Data

concepts, which express human’s thought, will mostly be decided by noun keywords.

Therefore, POS tagger module provides proper POS tags for the function of feature selection. Furthermore, POS tags give important information for deciding contextual relationship between words. In Figure 3.2, this tagger provides noun words to the next stemmer module. In such way, this module employs natural langua ge technology to help analyze news articles. Consequently, it is considered a language model.

For natural language understanding, giving a sentence POS tags prepares the further information to analyze the syntax of a sentence. The POS employed in this thesis is based on rule-based POS tagger proposed by Eric Brill in 1992. Brill’s tagger tries to learn lexical and contextual rules for tagging words. The precision of Brill’s tagger was pronounced to be higher than 90% [Brill 1995]. There are totally 37 POS tags as listing in APPENDIX B. As mentioned above, we select noun-only words.

Therefore, these noun tags are NN, NNS, NNP and NNPS. The following are examples of words after POS tagging.

N.10/CD S.1/CD

"I/NN think/VB it/PRP is/VBZ highly/RB unlikely/JJ that/IN American/NNP Express/NNP is/VBZ

Fig 3.3 Words with tagging

3.2.2 Stemming

Frequently, the user specifies a word in a query but only a variant of this word is present in a relevant document. Plurals, gerund forms, and past tense suffixes are examples of syntactical variations which prevent a perfect match between a query word and a respective document word [Ricardo et. al., 1999]. This problem can be partially overcome with the substitution of the words by their respective stems.

A stem is the portion of a word, which is left after the removal of its affixes. A typical example of a stem is the word “calculate”, which is the stem for the variants calculation, calculating, calculated, and calculations. Stems are thought to be useful for improving retrieval performance because they reduce variants of the same root word to a common concept. Furthermore, stemming has the secondary effect of reducing the size of the indexing structure because the number of distinct index terms is reduced [Ricardo et. al., 1999].

Because of most variants of a word are generated by the introduction of suffixes, and on the basis of intuitive, simple, and can be implemented efficiently, there are several well-known algorithms which been used suffixes removal. The most popular one is that by Porter, so we use the Porter algorithm [Porter 1980] to do word stemming.

3.2.3 Stop-Word Filter

Words, which are too frequent among the documents in the collection, are not good discriminators. In fact, a word, which occurs in 80% of the documents in the collection, is useless for purpose of retrieval. Such words are frequently referred to as stop-words and are normally filtered out as potential index terms. Articles, prepositions, and conjunctions are natural candidates for a list of stop-words.

Elimination of stop-words has an additional important benefit. It reduces the size of the indexing structure considerably. In fact, it is typical to obtain compression in the size of the indexing structure of 40% or more solely with the elimination of stop-words [Ricardo et. al., 1999].

Since stop-words elimination also provides for compression of the indexing structure, the list of stop-words might be extended to include words other than articles, prepositions, and conjunctions. For example, some verbs, adverbs, and adjectives

could be treated as stop-words. In this thesis, a list of 306 stop-words has used. A detailed list of the stop-words can be found in the appendix of this thesis.

The stop word filter procedure takes noun words as input and a few noun words may be ineffective to what human wants to express in the document. They are only auxiliary to complete the whole natural language text. Here we called them stop words. In this reason, stop words must be filtered to prevent noise from the analysis.

After the stop words are filtered, the rest of non-stop noun words still can’t say right away to be fully related to what human wants to express. According to human’s writing habit, it is believed that too low or too high frequency of word’s occurrence results from that the word itself is not important or representative.

3.2.4 Feature Selection

In many supervised learning problems, feature selection is important for a variety of reasons: generalization performance, running time requirements, and interpretational issues imposed by the problem itself.

One approach to feature selection is to select a subset of the available features.

This small feature subset will still retains the essential information of the original attributes. There are some criteria [Meisel 1972]:

(1) low dimensionality

(2) retention of sufficient information

(3) enhancement of distance in pattern space as a measure of the similarity of physical patterns, and

(4) consistency of feature throughout the sample.

Our test bed is Reuters Data set, a complete description is in Section 4.1. We choose features for each category and use the features to represent a document, we use the vector space model in information retrieval field. The feature selection method we

adopt is a frequency-based method, we use what so called TF-IDF,

max , log

, ,

t d

t d t d

t

n

N tf

w = tf ×

(3.1)

where

tf

t,d is the number of times the word t occurs in document d,

n is the

t number of documents the word t occurs. N is the total number of documents.

From Section 3.2.1 to Section 3.2.4, we perform the preprocessing processes.

The original text document is now represented as a vector as the following Figure shows.

Fig 3.4 Representing text as a feature vector.

These vectors are all 1

× m

dimensional, where

m is the total number of features

we select for each category. We then utilize them as the training data in unsupervised learning stage.

相關文件