Automatic Web Pages Categorization with ReliefF and Hidden Naive Bayes

(1)

Automatic Web Pages Categorization with ReliefF and Hidden Naive Bayes

Xin Jin, Rongyan Li, Xian Shen, Rongfang Bie

^*

College of Information Science and Technology Beijing Normal University

Beijing 100875, China

*

8610-58800068

xinjin4@yahoo.com

*

corresponding author: rfbie@bnu.edu.cn

ABSTRACT

A great challenge of web mining arises from the increasingly large web pages and the high dimensionality associated with natural language. Since classifying web pages of an interesting class is often the first step of mining the web, web page categorization/classification is one of the essential techniques for web mining. One of the main challenges of web page classification is the high dimensional text vocabulary space. In this research, we propose a Hidden Naive Bayes based method for web page classification. We also propose to use the ReliefF feature selection method for selecting relevant words to improve the classification performance. Comparisons with traditional techniques are provided. Results on benchmark dataset show that the proposed methods are promising for accurate web page classification.

Categories and Subject Descriptors

H.2.8 [Data Mining]:

General Terms

Algorithms, Documentation, Performance, Experimentation.

Keywords

Web mining, ReliefF feature selection, hidden naive bayes.

1. INTRODUCTION

Classification of Web pages has been studied extensively since the Internet has become a huge repository of information, in terms of both volume and variance. Given the fact that web pages are based on loosely structured text, various statistical text learning algorithms have been applied to Web page classification [8, 18- 23]. Among them Naive Bayes has shown great success. However,

the major problem of Naive Bayes is that it ignores attribute dependencies. On the other hand, although Bayesian Network can represent arbitrary attribute dependencies, it is intractable to learn it from data [25]. In this paper we present a Hidden Naive Bayes (HNB) [17] based method for web page classification. HNB can avoid the intractable computational complexity for learning an optimal Bayesian network and still take the influences from all attributes into account [17, 25].

In the field of data mining many have argued that maximum performance is often not achieved by using all available features, but by using only a “good” subset of features. This is called feature selection. For web page classification, this means that we want to find a subset of words which help to discriminate between different kinds of web pages. In this paper we introduce a ReliefF [1, 2, 5, 7] based method to find relevant words for improving web page classification performance. ReliefF is able to efficiently estimate the quality of attributes with strong interdependencies that can be found for example in parity problems. The key idea of ReliefF is to estimate attributes according to how well their values distinguish among the instances that are near to each other.

The remainder of this paper is organized as follows. Section 2 presents the web page representation and preprocessing method.

Section 3 describes the ReliefF based word selection method.

Section 4 presents HNB based web page classification. Section 5 presents the performance measures and the experiment results.

Finally, conclusions are drawn in Section 6.

2. WEB PAGE REPRESENTATION AND PREPROCESSING

We represent each web page as a bag of words/features. A feature vector V is composed of the various words from a dictionary formed by analyzing the web pages. There is one feature vector per web page. The ith component/word w_iof the feature vector is the IDF transforming of word frequency:

IDF = Fi*log(num of web pages/num of web pages with word i) Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

where Fi is the frequency of word i in the web page.

Word tokens in web pages are formed only from contiguous alphabetic sequences. In addition, since web pages are in the HTML format, HTML tags are removed before web page classification. In tokenizing we perform stemming, stop-word removing and Document Frequency Thresholding (DFT) [24].

SAC’07, March 11-15, 2007, Seoul, Korea.

(2)

3. RELIEFF WORD SELECTION

Relief can be seen as an extension to Relief [1, 2].The key idea of Relief is to estimate attributes according to how well their values distinguish among the instances that are near to each other [6, 7].

For that purpose, given an instance, Relief searches for its two nearest neighbors: one from the same class (called nearest hit,

“H”) and the other from a different class (called nearest miss,

“M”). The original algorithm of Relief randomly selects m training instances R_i, i = 1,…, m, where m is the user-defined parameter, the weight of attribute A is calculated as:

[ ] : [ ] 1 ( , , )

1 ( , , )

i i

W A W A diff A R H

m

diff A R M m

= −

+

∑

(1)

Function diff(A, I1, I2) calculates the difference between the values of the attribute A for two instances I1 and I2. The function diff is used also for calculating the distance between instances to find the nearest neighbors. This case the distance is the sum of distances over all attributes.

ReliefF, an extension of Relief, improves the original algorithm by estimating probabilities more reliably (that is, more robust and can deal with noisy data) and extends it to deal with incomplete and multiclass data sets [3, 4]. Figure 1 shows the pseudo code of ReliefF algorithm.

Algorithm ReliefF

Input: for each training instance a vector of attribute

values and the class value

Output: the vector W of estimations of the qualities of

attributes

1. set all weights W[A] := 0.0;

2. for i := 1 to m do begin 3. randomly select an instance R

i

; 4. find k nearest hits H

j

;

5. for each class C ≠ class(R

i

) do

6. from class C find k nearest misses M

j

(C);

7. for A := 1 to a do

8. W[A] :=W[A] −

_i _j

1

diff(A,R ,H )/(m k)

k

j=

⋅ +

∑

i j

1 ( )

( ( ) diff(A,R ,M (C)))/(m k)

1 ( ( ))

i

k j

C class R i

P C

P class R ⁼

≠

− ⋅

∑ ∑

10. end;

Figure 1. Pseudo code of ReliefF algorithm

3.1 Reliable Estimation

Parameter m in the algorithm Relief, represents the number of instances for approximating probabilities in W[A]. The larger m implies more reliable approximation. However, m cannot exceed the number of available training instances. The obvious choice, adopted in ReliefF, is to set m to the upper bound and run the outer loop of the algorithm over all available training instances.

The selection of the nearest neighbors is of crucial importance in Relief. To increase the reliability of the probability approximation ReliefF searches for k nearest hits/misses instead of only one near

It was shown that this extension significantly improves the reliability of estimates of attributes’ qualities [2, 4].

3.2 Multiclass Solution

Instead of finding one near miss M from a different class, ReliefF finds one near miss M(C) for each different class C and averages their contribution for updating the estimate W[A]. The average is weighted with the prior probability of each class.

The idea is that the algorithm should estimate the ability of attributes to separate each pair of classes regardless of which two classes are closest to each other.

Using ReliefF algorithm to calculate the weight RF of each word in the web pages, then word selection is achieved by selecting the words with the highest weights.

The performance of ReliefF is compared to the following three feature selection methods: Information Gain (IG) which is based on the feature’s impact on decreasing entropy [10]; Gain Ratio (GR) which compensates for the number of features by normalizing by the information encoded in the split itself [11];

Chi Squared (CS) which is based on comparing the obtained values of the frequency of a class because of the split to the a priori frequency of the class.

4. HNB WEB PAGE CLASSIFICATION

The basic idea of HNB for web page classification is to create a hidden parent for each word/attribute, which combines the influences from all other words/attributes.

Figure 2. The structure of HNB

Figure 2 gives the structure of an HNB, which is originally proposed by H. Zhang etc. [17]. In an HNB, attribute dependencies are represented by hidden parents of attributes. C is the class node, and is also the parent of all attribute nodes. Each attribute A_ihas a hidden parent A_hpi, i = 1, 2,…, n, represented by a dashed circle. The arc from the hidden parent A_hpi to A_iis also represented by a dashed directed line, to distinguish it from regular arcs.

The joint distribution represented by an HNB is defined as follows.

(3)

1

( , , , ) ( ) ( | , )

n

n i

i

P A A C P C P A A C

=

= ∏

L

hpi

j

C

(5)

where,

1,

( | , ) ( | , )

n

i hpi ij i j

j j i

P A A C W P A A C

= ≠

= ∑ ∗

(6)

The hidden parent Ahpi

for

Ai

i

s essentially a mixture of the weighted influences from all other attributes.

Algorithm HNB(T)

Input: a set T of training web pages.

For each value c of C Compute P(c) from T

For each pair of words/attributes Ai and Aj

For each assignment ai, aj, and c to Ai, Aj, and C Compute P(ai;aj|c) from T

For each pair of attributes Ai and Aj

Compute IP(Ai;Aj|C) For each attribute Ai

Compute

1,

( ; | )

n

i j j i P i

W = ∑

₌ _≠

I A A

For each attribute Aj and j ≠ i Compute

( ; | )

P i j

ij

i

I A A C W = W

Output: HNB models for T

Figure 3. HNB algorithm for web page classification The classifier corresponding to an HNB on an example E = (a₁, … , a_n) is defined as follows.

1

( ) arg max ( ) ( | , )

n

i hpi

c C i

c E P c P a a c

∈ =

= ∏

(7)

The approach to determining the weights Wij , i, j = 1, … , n and i

≠ j, is crucial. HNB compute the estimated values from data and use the conditional mutual information between two attributes Ai

and A_j as the weight of P(A_i|A_j,C).

Learning an HNB is mainly about estimating the parameters in the HNB from the training data. HNB based web page classification is depicted in Figure 3.

5. EXPERIMENTS

CMU Industry Sector [12] is a collection of web pages belonging to companies from various economic sectors. We choose 10-fold cross-validation on this benchmark dataset to estimate classification performance. Comparison is done with three traditional methods: Naive Bayes (NB) [16], Support Vector Machine (SVM) [13, 14, 15] and Decision Tree (DT) [9, 11].

We use a subset of the original data, which form a two-level hierarchy. There are 527 web pages partitioned into 7 classes:

materials, energy, financial, healthcare, technology, transportation and utilities. Each class has about 80 web pages. There are 20257 words after stemming and stop-word removing, and 1258 words after DFT feature deduction.

5.1 Performance Measures

We use the following classification performance measures:

Error Rate (ER): defined by the ratio of the number of incorrect predictions and the number of all predictions (both correct and incorrect): ER= Nip/Np, where Nip is the number of incorrect predictions and Np is the number of all predictions (i.e. the number of test samples). ER ranges from 0% to 100%, the lower ER the better, with 0% corresponding to the ideal.

F1: It is a normal practice to combine recall and precision to F1 measure so that classifiers can be compared in terms of a single rating. F1 can be defined as F1=2R*P/(R+P). Recall (R) is the percentage of the web pages for a given category that are classified correctly. Precision (P) is the percentage of the predicted web pages for a given category that are classified correctly. F1 ranges from 0 to 1, the higher the F1 the better.

5.2 Results

Figure 4 shows the word selection results. Feature ranking scores (including RF, CS, IG, GR) of the words normalized to have a maximum of 1. The results show that the top 395 ranked words are informative.

-0.2 0 0.2 0.4 0.6 0.8 1

1 79 157 235 313 391 469 547 625 703 781 859 937 1015 1093 1171 1249 CS

GR IG RF

Figure 4. Feature ranking scores of the words. X-axis represents the sorted ranking index according to the score of the features. For example, ‘113’ represents the rank 113 feature. Y-axis represents the ranking score.

(4)

Figure 5 shows the Error Rate (ER) of HNB and three traditional methods (NB, DT, SVM) with and without feature selection. We can observe from the results that the performance of HNB is better than the traditional methods both with and without feature selection. HNB achieves the lowest ER of 5.7% with feature selection and 8.8% without feature selection. Feature selection can improve the performance of all the classifiers. RF’s performance is comparable with traditional feature selection methods (CS, GR and IG). For NB and DT classifiers, RF is even the best.

(%) ER

0 2 4 6 8 10 12 14 16 18 20

NB DT SVM HNB

Original RF CS GR IG

Figure 5. Error Rate (ER) of HNB and traditional methods with feature selection (RF, CS, GR and IG) and without feature selection (Original). RF = ReliefF, CS = Chi Squared, GR = Gain

Ratio, IG = Information Gain. “Original” means without feature selection. The X-axis denotes the learners, Y-axis denotes the ER.

Figure 6 shows Error Rate (ER) curves of HNB and other classifiers with RF feature selection. The number of selected words varies from 100 to 395. “all” denotes without feature selection, that is, with all 1258 words. When the selected words are over 200, HNB is better than the traditional classification methods. HNB achieves its best performance by reaching the lowest ER at the top 350 RF selected words.

(%) ER

0 5 10 15 20

100 150 200 250 300 350 395 all Number of Selected Words

NB DT

SVM HNB

Figure 6. Error Rate (ER) curves of HNB and three traditional methods (NB, DT and SVM) with ReliefF (RF) feature selection.

NB = Naive Bayes, DT = Decision Tree, SVM= Support Vector Machines. The X-axis denotes the number of selected words, Y- axis denotes the ER. “all” denotes without feature selection, that

is, with all 1258 words.

Figure 7 shows the F1 of HNB and the three traditional methods with and without feature selection. We can see the performance of HNB is better than traditional methods both with and without feature selection. HNB achieves the highest F1 of 0.96 with feature selection and 0.94 without feature selection. Feature selection can improve the performance of all the classifiers.

ReliefF’s performance is comparable to or better than traditional feature selection methods. RF is the best for NB, DT, and HNB.

F1

0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

NB DT SVM HNB

Original RF CS GR IG

Figure 7. F1 of HNB and traditional methods with feature selection and without feature selection (Original).

Figure 8 shows F1 curves of HNB and traditional classifiers (NB, DT and SVM) with RF feature selection. When the selected words are over 150, HNB is better than traditional classification methods. HNB achieves its best performance by reaching the highest F1 at the top 395 RF selected words.

F1

0.5 0.6 0.7 0.8 0.9 1

100 150 200 250 300 350 395 all Number of Selected Words

NB DT

SVM HNB

Figure 8. F1 curves of HNB and three traditional methods with ReliefF (RF) feature selection.

6. CONCLUSIONS

In this paper we propose a ReliefF (RF) feature selection based method for selecting relevant words in web pages. We also introduce a Hidden Naive Bayes (HNB) based method for classifying web pages. Comparison is done with traditional techniques.

Results on benchmark web page dataset CMU Industry Sector indicate that ReliefF based feature selection is a promising technique for web page classification. With ReliefF feature selection, all the classifiers can achieve better performance than

(5)

traditional methods, in some cases, it is even better than them.

The results also show that HNB based method is better than the traditional methods for web pages classification.

7. ACKNOWLEDGMENTS

This work was supported by the National Science Foundation of China under the Grant No. 10001006 and No. 60273015.

8. REFERENCES

[1] M. Robnik-Sikonja and I. Kononenko: Theoretical and Empirical Analysis of ReliefF and RReliefF. Machine Learning 53(1-2):23.69 (2003)

[2] Kononenko, I. and E. Simec: Induction of Decision Trees using ReliefF. In: G. Della Riccia, R. Kruse, and R. Viertl (eds.): Mathematical and Statistical Methods in Artificial Intelligence, CISM Courses and Lectures No. 363. Springer Verlag (1995)

[3] I. Kononenko. Estimating Attributes: Analysis and Extensions of Relief. In Proceedings of ECML'94, pages 171.182. Springer-Verlag New York, Inc. (1994)

[4] Kononenko, I., E. Simec, and M. Robnik- Sikonja:

Overcoming the Myopia of Inductive Learning Algorithms with ReliefF. Applied Intelligence 7, 39–55 (1997)

[5] Yuhang Wang and Fillia Makedon: Application of Relief-F Feature Filtering Algorithm to Selecting Informative Genes for Cancer Classification using Microarray Data (poster paper). In Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference, pages 497-498, Stanford, California (2004)

[6] Kira, K. and L. A. Rendell: The Feature Selection Problem:

Traditional Methods and New Algorithm. In: Proceedings of AAAI’92 (1992)

[7] Kira, K. and L. A. Rendell: A Practical Approach to Feature Selection. In: D.Sleeman and P.Edwards (eds.): Machine Learning: Proceedings of International Conference (ICML’92). pp. 249–256, Morgan Kaufmann (1992)

[8] H. Mase: Experiments on Automatic Web Page Categorization for IR System. Technical Report, Stanford Univ., Stanford, Calif. (1998)

[9] I.Witten, E.Frank: Data Mining –Practical Machine Learning Tools and Techniques with Java Implementation.

Morgan Kaufmann (2000)

[10] J. Ross Quinlan: Induction of Decision Trees. Machine Learning, 1:81-106 (1986)

[11] Ross Quinlan: C4.5: Programs for Machine Learning.

Morgan Kaufmann Publishers, San Mateo, CA. (1993)

[12] Industry Sector Dataset:

http://www.cs.cmu.edu/~TextLearning/datasets.html (2005)

[13] Corinna Cortes and Vladimir Vapnik: Support-vector Networks. Machine Learning, 20(3):273-297 (1995)

[14] J. Platt: Fast Training of Support Vector Machines using Sequential Minimal Optimization. Advances in Kernel Methods - Support Vector Learning, B. Schoelkopf, C.

Burges, and A. Smola, eds., MIT Press (1998)

[15] S.S. Keerthi, S.K. Shevade, C. Bhattacharyya, K.R.K.

Murthy: Improvements to Platt's SMO Algorithm for SVM Classifier Design. Neural Computation, 13(3), pp 637-649 (2001)

[16] Karl-Michael Schneider: A Comparison of Event Models for Naïve Bayes Anti-Spam E-Mail Filtering. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, Budapest, Hungary, 307-314, April, (2003)

[17] H. Zhang, L. Jiang, J. Su: Hidden Naive Bayes. Proceedings of the Twentieth National Conference on Artificial Intelligence (AAAI-05). pp.919-924, AAAI Press (2005) [18] Hwanjo Yu, Jiawei Han, Kevin Chen-Chuan Chang: PEBL:

Web Page Classification without Negative Examples. IEEE Trans. Knowl. Data Eng. 16(1): 70-81 (2004)

[19] S. Dumais, and H. Chen, Hierarchical Classification of Web Content. Proc. 23rd ACM Int'l Conf. Research and Development in Information Retrieval (SIGIR '00), pp. 256- 263 (2000)

[20] W. Wong and A.W. Fu: Finding Structure and Characteristics of Web Documents for Classification. Proc.

2000 ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery (DMKD ’00), pp. 96-105 (2000)

[21] J. Yi and N. Sundaresan: A Classifier for Semi-Structured Documents, Proc. Sixth Int’l Conf. Knowledge Discovery and Data Mining (KDD ’00), pp. 340-344 (2000)

[22] H. Oh, S. Myaeng, and M. Lee: A Practical Hypertext Categorization Method Using Links and Incrementally Available Class Information, Proc. 23rd ACM Int’l Conf.

Research and Development in Information Retrieval (SIGIR ’00), pp. 264-271 (2000)

[23] L. K. Shih, David R. Karger: Using Urls and Table Layout for Web Classification Tasks. WWW 2004: 193-202 (2004) [24] Stemming:

http://www.comp.lancs.ac.uk/computing/research/stemming/

general/ (Access 2006)

[25] Chickering, D. M. Learning Bayesian networks is NP- Complete. In Fisher, D., and Lenz, H., eds., Learning from Data: Artificial Intelligence and Statistics V. Springer-Verlag.

121-130 (1996)