A Simplicial Complex, a Hypergraph, Structure in the latent Semantic Space of Document Clustering

(1)

A simplicial complex, a hypergraph,

structure in the latent semantic space

of document clustering

Tsau Young Lin

a,*

, I-Jen Chiang

b

a_{Department of Computer Science, San Jose State University, One Washington Square,}

San Jose, CA 95192-0249, USA

b_{Graduate Institute of Medical Informatics, Taipei Medical University, 205 Wu-Hsien Street,}

Taipei 110, Taiwan, ROC

Received 1 July 2004; accepted 1 November 2004 Available online 7 January 2005

Abstract

This paper presents a novel approach to document clustering based on some geometric structure in Combinatorial Topology. Given a set of documents, the set of associations among frequently co-occurring terms in documents forms naturally a simplicial complex. Our general thesis is each connected component of this simplicial complex represents a concept in the col-lection. Based on these concepts, documents can be clustered into meaningful classes. How-ever, in this paper, we attack a softer notion, instead of connected components, we use maximal simplexes of highest dimension as representative of connected components, the con-cept so deﬁned is called maximal primitive concon-cepts.

Experiments with three diﬀerent data sets from Web pages and medical literature have shown that the proposed unsupervised clustering approach performs signiﬁcantly better than traditional clustering algorithms, such as k-means, AutoClass and Hierarchical Clustering (HAG). This abstract geometric model seems have captured the latent semantic structure of documents.

2005 Published by Elsevier Inc.

0888-613X/$ - see front matter 2005 Published by Elsevier Inc. doi:10.1016/j.ijar.2004.11.005

*

Corresponding author. Fax: +1 408 924 5062.

E-mail addresses:tylin@cs.sjsu.edu(T.Y. Lin),ijchiang@tmu.edu.tw(I-Jen Chiang). 40 (2005) 55–80

(2)

Keywords: Document clustering; Association rules; Topology; Hierarchical clustering; Simplicial complex

1. Introduction

Internet is an information ocean. Automatic tools are needed to help users ﬁnd, ﬁlter, and extract the desired information. Search engines have become indispensable tools for gathering Web pages and documents that are relevant to a userÕs query. Unfortunately, inconsistent, uninteresting and disorganized search results are often returned. Without conceptual categorization, issues like polysemy, phrases and term dependency impose limitations on search technology[22]. The goal of this paper is to improve the current state. Search results can be improved by proper organization based on categories, subjects, and contents.

How to organize the information ocean? Roughly speaking, we will organize the information by decomposed (triangulated, partitioned, granulated) the latent seman-tic space of documents into a simplicial complex (in combinatorial topology), which could be viewed a special form of hypergraphs. Note that the notion of simplicial complexes is actually predated that of hypergraphs about half a century, even though the latter notion is more familiar to modern computer scientists.

A good search engine needs to discriminate whether a piece of information is rel-evant to usersÕ queries within a short time. In the current state of art, to extract full semantic from a document automatically is Impossible. Given that multiple concepts can be simultaneously deﬁned in a single Web page, and it is hard to limit the num-ber of concept categories in a collection of Web pages. So some unsupervised clus-tering methods probably are best strategy. So we are proposing a technique, which is based on the triangulation of the latent semantic space of documents into a simplicial complex, to classify or cluster Web documents.

Clustering the documents by the associations (high frequent itemsets) of key terms that can be identiﬁed in a collection of documents naturally form a simplicial com-plex in combinatorial topology [41]. For example, the association that consists of ‘‘wall’’ and ‘‘street’’ denotes some ﬁnancial notions that have meaning beyond the two nodes, ‘‘wall’’ and ‘‘street’’. This is similar to the notion of open segment (v0, v1), in which two end points represent one-dimensional geometric object that

have meaning beyond the two 0-dimensional end points. In general, an r-association represents some semantic generated by a set of r keywords, may have more semantics or even have nothing to do with the individual keywords. The Apriori property of such associations is reﬂected exactly in the mathematical structure of simplicial com-plex in combinatorial topology (Section 4). We could regard such a structure as a triangulation (partition, granulation) of the space of latent semantics of Web pages. The thesis of this paper is that a connected component of the simplicial complex of term associations represents a CONCEPT in the conceptual structure of the latent semantic space of Web pages. Based on the conceptual structure, the documents can be clustered. In this research, we investigate the strength of such a geometric view against traditional approaches, such as k-means, AutoClass[9]and Hierarchical

(3)

Clustering (HAC) algorithms. The experimental results indicates that our approach is signiﬁcantly stronger than classical approaches in performance.

In what follows, we start by reviewing related work on Web document clustering in Section 2. Section 3 defines the association rules in a collection of documents and illustrates the way to compute the support and confidence of each association rule. The concepts and definitions of latent semantic space based on geometric forms for the frequent itemsets generated by association rules are given in Section 4. Sec-tion 5 presents the clustering algorithm for clustering the simplicial complex of the latent semantic network into several concrete concepts, each of which represents a CONCEPT in the document collection. Documents can then be clustered based on the primitive concepts identified by this algorithm. Experimental results from three different data sets are described in Section 6; followed by the conclusion.

2. Related work

Most search engines provide instant gratiﬁcation in response to user queries

[8,31,36,42], however, they provide little guarantee on precision, even for detailed queries. There has been much research on developing more intelligent tools for

infor-mation retrieval, such as machine learning [40], text mining and intelligent Web

agents (see an earlier survey by Mladenic[33]).

Document clustering has been considered as one of the most crucial techniques for dealing with the diverse and large amount of information present on the World Wide Web-information ocean. In particular, clustering is used to discover latent con-cepts in a collection of Web documents, which is inherently useful in organizing, summarizing, disambiguating, and searching through large document collections

[25].

Numerous document clustering methods have been proposed based on probabi-listic models, distance and similarity measures [16], or other techniques, such as SOM[24]. A document is often represented as a feature vector, which can be viewed

as a point in the multi-dimensional space. Many methods, including k-means [30],

hierarchical clustering [20] and nearest-neighbor clustering [29] etc., select a set of key terms or phrases to organize the feature vectors corresponding to diﬀerent doc-uments. Suﬃx-tree clustering[44], a phrase-based approach, formed document clus-ters depending on the similarity between documents.

When the number of features selected from each document is too large, methods for extracting the salient features are taken. However, the residual dimension can still be very large, moreover the quality of the resulting clusters tends to decrease due to the loss of relevant features. Common frameworks for reducing the dimension of the feature space are principle component analysis[21], independent component analysis [19], and latent semantic indexing [3,5]. Furthermore, in the presence of noise in the data, feature extraction may result in degradation of clustering quality

[6]. In that paper, association rule hypergraph partition was ﬁrst proposed in [6]

to transform documents into a transactional database form, and then apply hyper-graph partitioning[23]to ﬁnd the item clusters.

(4)

Hierarchical clustering algorithms have been proposed in an early paper by Willett [43]. Cutting et al. introduced partition-based clustering algorithms for document clustering[11]. Buckshot and fractionation were developed in[27]. Greedy heuristic methods are used in the hierarchical frequent term-based clustering

algo-rithm [4] to perform hierarchical document clustering by using frequent itemsets.

We should note here that frequent itemsets are also referred to as associations (undi-rected association rules).

3. Undirected term-associations

The notion of association rules was introduced by Agrawal et al.[1]and has been demonstrated to be useful in several domains [7,10], such as retail sales transaction database. In the theory two standard measures, called support and conﬁdence, are often used. We will be concerned more on the frequency than direction of rules. Our focus will be on the support; a set of items that meets the support is often referred to as frequent itemsets; we will call them associations (undirected association rules) as to emphasize more on their meaning than the phenomena of frequency.

The frequency distribution of a word or phrase in a document collection is quite diﬀerent from the item frequency distribution in a retail sales transaction database. In[28], we have shown that isomorphic relations have isomorphic associations. Doc-uments are amorphous. Isomorphic essentially means identical. A single key word does not carry much information about a document, yet a huge amount of key words may nearly identify the document uniquely. So ﬁnding all associations in a collection of textual documents presents a great interest and challenge.

Traditional text mining generally consists of the analysis of a text document by extracting key words, phrases, concepts, etc. and representing in an intermediate form reﬁned from the original text in that manner for further analysis with data min-ing techniques (e.g., to determine associations of concepts, key phrases, names,

ad-dresses, product names, etc.). Feldman and his colleagues [12,13,15] proposed the

KDT and FACT system to discover association rules based on keywords labeling the documents, the background knowledge of keywords and relationships between them. This is not an eﬀective approach, because a substantially large amount of background knowledge is required. Therefore, an automated approach that

docu-ments are labeled by the rules learned from labeled docudocu-ments are adopted [26].

However, several association rules are constructed by a compound word (such as ‘‘Wall’’ and ‘‘Street’’ often co-occur) [37]. Feldman et al. [12,14]further proposed term extraction modules to generate association rules by selected keywords. Never-theless, a system without the needs of human labeling is desirable. Holt and Chung

[18] addressed Multipass-Apriori and Multipass-DHP algorithms to eﬃciently ﬁnd

association rules in text by modiﬁed the Apriori algorithm[2] and the DHP

algo-rithm[35]respectively. However, these methods did not consider about the word dis-tribution in a document, that is, identify the importance of a word in a document. According to the trivial deﬁnition of distance measure in this space, no matter what kind of a method is, some common words are more frequent in a document

(5)

than other words. Simple frequency of the occurrence of words is not adequate, as some documents are larger than others. Furthermore, some words may occur fre-quently across documents. In most cases, words appeared in a few documents tend to most ‘‘important.’’ Techniques such as TFIDF[39]have been proposed directly to deal with some of these problems. The TFIDF value is the weight of term in each document. While considering relevant documents to a search query, if the TFIDF value of a term is large, then it will pull more weight than terms with lesser TFIDF values.

3.1. Feature extraction

A general framework for text mining consists of two phases. The ﬁrst phase, fea-ture extraction, is to extract key terms from a collection of ‘‘indexed’’ documents; in the second step various methods such as association rules algorithms may be applied to determine relations between features.

While performing association analyses on a collection of documents, all docu-ments should be indexed and stored in an intermediate form. Document indexing is originated from the task of assigning terms to documents for retrieval or extrac-tion purposes. In early approach, an indexing model was developed based on the assumption that a document should be assigned those terms that are used by queries to retrieve the relevant document[32,17]. The weighted indexing is the weighting of the index terms with respect to the document with this model given a theoretical jus-tiﬁcation in terms of probabilities. The most simple and sophisticated weighted sche-ma which is most common used in inforsche-mation retrieval or inforsche-mation extraction is TFIDF indexing, i.e., tf· idf indexing[39,38], where tf denotes term frequency that appears in the document and idf denotes inverse document frequency where docu-ment frequency is the number of docudocu-ments which contain the term. It takes eﬀect

on the commonly used word a relatively small tf· idf value. Moﬀat and Zobel

[34]pointed out that tf· idf function demonstrates: (1) rare terms are no less impor-tant than frequent terms in according to their idf values; (2) multiple appearances of a term in a document are no less important than single appearances in according to their tf values. The tf· idf implies the signiﬁcance of a term in a document, which can be deﬁned as follows.

Deﬁnition 1. Let Trdenote a collection of documents. The signiﬁcance of a term tiin

a document djin Tris its TFIDF value calculated by the function tﬁdf(ti, dj), which is

equivalent to the value tf(ti, dj)· idf(ti, dj). It can be calculated as

tfidfðti; djÞ ¼ tfðti; djÞ log

jTrj

jTrðtiÞj

wherejTr(ti)j denotes the number of documents in Trin which tioccurs at least once,

and

tfðti; djÞ ¼

1þ logðN ðti; djÞÞ if Nðti; djÞ > 0

0 otherwise

(6)

where N(ti, dj) denotes the frequency of terms tioccurs in document djby counting all

its nonstop words.

To prevent the value ofjTr(ti)j to be zero, Laplace Adjustment is taken to add an

observed count.

TFIDF values are often organized into the following matrix form: Let a docu-ment djin Trbe represented as a vector Vj=htfidf(t1, dj), tfidf(t2, dj), . . . , tfidf(tn, dj)i

and therefore Trbe represented as a matrix Mr=hV1, V2, . . . , VI, . . .i T

. Most

previ-ous works [12,13,15] proposed to ﬁnding the association rules or partitioning the

association rules into clusters[6]from Mr. However, there are often more than

thou-sands of terms in a document and some terms may appear only in a few documents

of a collection. The document matrix Mris large and sparse. It becomes

computa-tionally hard to ﬁnd the independent sets of association rules for automatic cluster-ing of the documents.

3.2. Measures on undirected term-associations

We observed that the direction of key terms is irrelevant information for the pur-pose of document clustering. So we ignore the confidence and consider only the sup-port. In other words, we consider the structure of the undirected associations of key terms; we believe the set of key terms that co-occur reflects the essential information, the rule directions of the key terms are inessential, at least in the present stage of investigation. Let tA and tB be two terms. The support is defined for a collection

of documents as follows.

Deﬁnition 2. The signiﬁcance of undirected associations of term tAand term tBin a

collection is tfidfðtA; tB; TrÞ ¼ 1 jTrj XjTrj i¼0 tfidfðtA; tB; diÞ where tfidfðtA; tB; diÞ ¼ tfðtA; tB; diÞ log jTrj jTrðtA; tBÞj

andjTr(tA, tB)j deﬁne number of documents contained both term tAand term tB.

The term frequency tf(tA, tB, di) of both term tA and tB can be calculated as

follows. Deﬁnition 3 tfðtA; tB; djÞ ¼ 1þ logðminfN ðtA; djÞ; N ðtB; djÞgÞ if N ðtA; djÞ > 0 and N ðtB; djÞ > 0 0 otherwise

(7)

A minimal support h is imposed to ﬁlter out the terms that their TFIDF values are small. It helps us to eliminate the most common terms in a collection and the non-speciﬁc terms in a document.

Let tAand tBbe two terms. The support and conﬁdence deﬁned in the document

matrix Mris as follows.

Deﬁnition 4. Support denotes to be the ratio of the number of documents in Trthat

contains both term tAand term tB, that is,

SupportðtA; tBÞ ¼

jTrðtA; tBÞj

jTrj

wherejTr(tA, tB)j is the number of the documents that contains both tAand tB.

Definition 5. The confidence is obtained from tfidf of both tAand tB, which denotes

the score of documents that contains tAand also contain tBwithin a ﬁxed distance:

ConfidenceðtA; tBÞ ¼ P ðtBjtAÞ ¼ tfidfðtA; tB; TrÞ tfidfðtA; TrÞ where tfidfðtA; TrÞ ¼ 1 jTrj XTr i¼0 tfidfðtA; diÞ and tfidfðtA; tB; TrÞ ¼ 1 jTrj XTr i¼0 tfidfðtA; tB; diÞ where tfidfðtA; tB; diÞ ¼ tfðtA; tB; diÞ log jTrj jTrðtA; tBÞj

where jTr(tA, tB)j is number of documents contained both term tA and term tB

and

tfðtA;tB;djÞ ¼

1þ logðminfN ðtA;djÞ;N ðtB;djÞgÞ if N ðtA;djÞ > 0 and N ðtB;djÞ > 0

0 otherwise

The terms with lower confidences than a given threshold, i.e., minimum confidence, from the origin matrix Mrare filtered to be the condensed matrix bMr. There are a lot

of algorithms developed for discovery association rules discussed in the previous sec-tion, such as Apriori[1], have been used to discover association rules in bMr. The

(8)

4. Geometric representations ofterm-associations

The goal of this section is to model the internal CONCEPTS that are hidden in a collection of documents. We observe that (1) term–term inter-relationships represent and carry the intrinsic semantics or CONCEPTS hidden in a collection of docu-ments, and (2) the co-occurred term associations, will be called term-associations, represent the term-term inter-relationships. So the key to model the hidden semantics or CONCEPTS in a set of documents is lied in modeling the term-associations. Somewhat a surprise, the mathematical structure of term-associations is a known geometric/topological subject, called simplicial complex.

So a natural way to represent the latent semantic in a set of documents is to use geometric and topologic notions that capture the totality of thoughts expressed in this collection of documents.

4.1. Combinatorial topology

Let us introduce and deﬁne some basic notions in combinatorial topology. The central notion is n-simplex.

Deﬁnition 6. A n-simplex is a set of independent abstract vertices [v0, . . . , vn+1].

A r-face of a n-simplex [v0, . . . , vn+1] is a r-simplex½vj₀; . . . ; vj_rþ1 whose vertices are a

subset of {v0, . . . , vn+1} with cardinality r + 1.

Geometrically 0-simplex is a vertex; 1-simplex is an open segment (v0, v1) that does

not include its end points; 2-simplex is an open triangle (v0, v1, v2) that does not

in-clude its edges and vertices; 3-simplex is an open tetrahedron (v0, v1, v2, v3) that does

not includes all the boundaries. For each simplex, all its proper faces (boundaries) are not included. An n-simplex is the high-dimensional analogy of those low-dimen-sional simplexes (segment, triangle, and tetrahedron) in n-space. Geometrically, an n-simplex uniquely determines a set of n + 1 linearly independent vertices, and vice versa. An n-simplex is the smallest convex set in a Euclidean space Rnthat contains n + 1 points v0, . . . , vnthat do not lie in a hyperplane of dimension less than n. For

example, there is the standard n-simplex dn¼ ðt0; t1; . . . ; tnþ1Þ 2 Rnþ1 X i ti¼ 1; tiP0 ( )

The convex hull of any m vertices of the n-simplex is called an m-face. The 0-faces are the vertices, the 1-faces are the edges, 2-faces are the triangles, and the single n-face is the whole n-simplex itself. Formally,

Definition 7. A simplicial complex C is a finite set of simplexes that satisfies the following two conditions:

• Any set consisting of one vertex is a simplex.

(9)

The vertices of the complex v0, v1, . . . , vnis the union of all vertices of those simplexes

[41, p. 108].

If the maximal dimension of the constituting simplexes is n then the complex is called n-complex.

Note that, any set of n + 1 objects can be viewed as a set of abstract vertices, to stress this abstractness, some times we refer to such a simplex a combinatorial n-sim-plex. The corresponding notion of combinatorial n-complex can be defined by (com-binatorial) r-simplexes. Now, by regarding the key terms, as defined by high TFIDT values, as abstract vertices, an association of n + 1 key terms, called n + 1-associa-tion, is a combinatorial n-simplex: A 2-association is an open 1-simplex. An open 1-simplex (‘‘wall’’, ‘‘street’’) represents a financial notion that includes some seman-tics that is well beyond the two vertices, ‘‘wall’’ and ‘‘street.’’ A (n + 1)-association is a combinatorial n-simplex of keywords that often carries some deep semantics that are well beyond the ‘‘union’’ of its vertices, or faces individually.

We need much more precise notions. A (n, r)-skeleton (denoted by Sn_r) of n-com-plex is a n-comn-com-plex, in which all k-simn-com-plexes (k 6 r 1) have been removed. Two sim-plexes in a (n, r)-skeleton are said to be directly connected if the intersection of them is a nonempty face. Two simplexes in a complex are said to be connected if there is a ﬁnite sequence of directly connected simplexes connecting them. For any nonempty two simplexes A, B are said to be r-connected if there exits a sequence of k-simplexes (k varies) A = S0, S1, . . . , Sm= B such that Sj and Sj+i has an h-common face for

j = 0, 1, 2, . . . , m 1; where r 6 h 6 k 6 n.

The maximal r-connected subcomplex is called a r-connected component. Note that a r-connected component implies there does not exist any r-connected compo-nent that is the superset of it. A maximal r-connected sub-complexes of n-complex is called r-connected component. A maximal r-connected component of n-complex is called connected component, if r = 0.

4.2. The geometry of term-associations

In the last section, we have observed that a n + 1-association is an abstract n-sim-plex, in fact, the set of all associations has more structures. In this section, we will investigate the mathematical structures of term-associations. First let us recall the notion of hypergraph:

Deﬁnition 8. A hypergraph G = (V, E) contains two distinct sets where V is a ﬁnite set of abstract vertices, and E = {e1, e2, . . . , em} is a nonempty family of subsets from

V, in which each subset is called a hyperedge.

It is obvious that the set of association can be interpreted as a hypergraph: The key terms are the vertices, the term-associations are hyperedges. Likewise, a simpli-cial complex is a hypergraph: the set of vertices is V, and the set of simplexes is E. However, both term-associations and simplicial complex has more structures. A sim-plicial complex satisﬁes further conditions that are speciﬁed in last section. Simsim-plicial

(10)

complex is a very special kind of hyper-graphs. Actually the diﬀerences are deeper and intrinsic:

• A hypergraph theory targets on the graph theoretical structure of vertices that are connected by hyperedges.

• A simplicial complex (combinatorial topology) targets on the geometrical or topo-logical structure of the spaces (polyhedron) that are supported by simplicial complex.

Note that the Apriori conditions on term-associations meet the conditions of the simplicial complex: an 1-association is the 0-simplex, and a ‘‘subset’’ of an associa-tion is an associaassocia-tion of shorter lengths. So the noassocia-tion of simplicial complex is a nat-ural view of term-associations. We will take this view.

In our application each vertex is a key term, a simplex is a term-association of maximal length. The open 1-simplex (Wall, Street) represents a concept in financial business. The 0-simplex (Network) might represent many different CONCEPTS, however, while it is combined with some other terms would denote further semantic CONCEPTS. For example, the following 1-simplexes (Computer, Network), (Traf-fic, Network), (Neural, Network), (Communication, Network), and etc., express fur-ther and richer semantic than their individual 0-simplexes. Of course, the 1-simplex (Neural, Network) is not conspicuous than the 2-simplexes (Artificial Neural Net-work) and (Biology, Neural, NetNet-work).

A collection of documents may carry a set of distinct CONCEPTS. Each concept, we believe, is carried by a connected component of the complex of term-associations. Here is our belief and our thesis:

• An IDEA (in the forms of complex of term-associations) may consist many CON-CEPTS (in the form of connected components) that consists of PRIMITIVE CONCEPTS (in the form of maximal simplexes). The maximal simplexes of high-est dimension is called MAXIMAL PRIMITIVE CONCEPT. A simplex is said to be a maximal if no other simplex in the complex is a superset of it. The geometric dimension represents the degree of preciseness or depth of the latent semantics that are represented by term-associations.

Example 1. InFig. 1, we have an idea that consist of 12 terms that organized in the forms of 3-complex, denoted by S3. Simplex(a, b, c, d) and Simplex(w, x, y, z) are two maximal simplexes of 3, the highest dimension. Let us consider S3₁. It is the leftover from the removal of all 0-simplexes from S3:

• Simplex(a, b, c, d) and its 10 faces: – Simplex(a, b, c)

– Simplex(a, b, d) – Simplex(a, c, d) – Simplex(b, c, d) – Simplex(a, b)

(11)

– Simplex(a, c) – Simplex(b, c) – Simplex(a, d) – Simplex(b, d) – Simplex(c, d)

• Simplex(a, c, h) and its three faces: – Simplex(a, c)

– Simplex(a, h) – Simplex(c, h)

• Simplex(c, h, e) and its three faces: – Simplex(c, h)

– Simplex(h, e) – Simplex(c, e)

• Simplex(e, h, f) and its three faces: – Simplex(e, h)

– Simplex(h, f) – Simplex(e, f)

• Simplex(e, f, x) and its three faces: – Simplex(e, f)

– Simplex(e, x) – Simplex(f, x)

• Simplex(f, g, x) and its three faces: – Simplex(f, g)

– Simplex(g, x) – Simplex(f, x)

• Simplex(g, x, y) and its three faces: – Simplex(g, x)

– Simplex(g, y) – Simplex(x, y)

(12)

• Simplex(w, x, y, z) and its 10 faces: – Simplex(w, x, y) – Simplex(w, x, z) – Simplex(w, y, z) – Simplex(x, y, z) – Simplex(w, x) – Simplex(w, y) – Simplex(w, z) – Simplex(x, y) – Simplex(x, z) – Simplex(y, z)

Note that Simplex(a, c), Simplex(c, h), Simplex(h, e), Simplex(e, f), Simplex(f, x), Simplex(g, x), and Simplex(x, y) all have common faces. So they generate a con-nected path from Simplex(a, b, c, d) to Simplex(w, x, y, z), and sub-paths. Therefore the S3

1complex is connected. This assertion also implies that S 3

is connected. Hence the IDEA consists of a single CONCEPT (please, note the technical meaning of the

IDEA and CONCEPT given above). Next, let us consider the (3, 2)-skeleton S3

2, by

removing all 0-simplexes and 1-simplexes from S3: • Simplex(a, b, c, d) and its four faces:

– Simplex(a, b, c) – Simplex(a, b, d) – Simplex(a, c, d) – Simplex(b, c, d) • Simplex(a, c, h) • Simplex(c, h, e) • Simplex(e, h, f) • Simplex(e, f, x) • Simplex(f, g, x) • Simplex(g, x, y)

• Simplex(w, x, y, z) and its four faces: – Simplex(w, x, y)

– Simplex(w, x, z) – Simplex(w, y, z) – Simplex(x, y, z)

There are no common faces between any two simplexes, so S3

2 has eight connected

components, or eight CONCEPTS. For S3

3, it consists of two nonconnected

3-sim-plexes or two MAXIMAL PRIMITIVE CONCEPTS.

A complex, connected component or simplex of a skeleton represent a more tech-nically reﬁned IDEA, CONCEPT or PRIMITIVE CONCEPT. If a maximal con-nected component of a skeleton contains only one simplex, this component is said to organize a primitive concept.

(13)

Deﬁnition 9. A set of maximal connected components is said to be independent if there are no common faces between any two maximal connected components. 4.3. Layered clustering

From a collection of documents, a complex of term-associations can be generated. Based on such complex, document can be clustered in layer fashions.

In this section, we will ﬁrst examine the intuitive meaning of such complex. As

seen in Example 1, connected components in Sn

k are contained in S n

r, where k P r.

Based on that, the goal of this paper is to deﬁne the layered clustering based on the dimension hierarchies of primitive CONCEPTS.

Example 2. Fig. 2 is 2-complex composed of the term set V = {tA, tB, tC} in a

collection of documents. It is a close 2-simplex; we recall here that a closed simplex is a complex that consists of one simplex and all its faces. In the skeleton S2₁, all 0-simplexes are ignored, i.e., the terms depicted in dash lines. The simplex set S¼ fSimplex2₁;Simplex1₂;Simplex1₃;Simplex1₄g is the closed 2-simplex that consists of one 2-simplex and three 1-faces, Simplex1₂, Simplex1₃ and Simplex1₄ (0-faces are ignored). These r-simplexes (0 6 r 6 2) represents frequent itemsets (term-associa-tions) from V, where W = {wA,B, wC,A, wB,C, wA,B,C} denote their corresponding

supports. The lines connecting Simplex1and three vertices represent the incidences of

2-simplex and 0-simplex; the incidences with 1-simplexes are not shown to avoid overcrowding the ﬁgure.

One of the geometric property of simplicial complex is all faces of a simplex, that is in the complex, has to be in the complex:

Fig. 2. This ﬁgure illustrates the skeleton S3

1of Example 2. It is composed from three key terms {tA, tB, tC}

of a collection of documents, where each simplex is identiﬁed by its tﬁdf value and all 0-simplexes have been removed (the nodes are drawn by using dash circles). Note that Simplex1has dimension 2, we draw its

(14)

Property 1. A simplex has nþ 1 iþ 1 i-faces (i 6 n), where n k is a binomial coefficient. This is the Apriori condition of association rules.

So, as we have observed previously, that in a complex of term-associations, the set of 0-simplexes (vertices) represents all frequent 1-itemsets, 1-simplexes frequent 2-itemsets and 2-simplexes frequent 3-2-itemsets, and so on.

According to Example 1, it is obvious that simplexes within the higher level skel-eton Sn_r is contained in the lower level skeleton Sn_k within the same n-complex, r P k.

Fig. 3shows the hierarchy, each skeleton is represented as a layer. For the purpose of simplicity, we skip the middle layer, namely, Sn_r, 0 6 r < 3, are not shown.

By considering diﬀerent skeletons, we can draw distinct layer of CONCEPTs: (1) In full complex S¼ Sn

0, this example only has one CONCEPT (one connected

component). (2) In Sn

1, this complex still has only one CONCEPT.

(3) In Sn

2, this complex has eight CONCEPTS.

(4) In Sn

3, this complex has two CONCEPTS; they are two MAXIMAL

PRIMI-TIVE CONCEPTS.

For each choice, say Sn₂, we have, in this case, eight CONCEPTS to label the

docu-ments (or clustered the docudocu-ments). A document is labeled CONCEPTk, if the

doc-ument has high TFITD values on the term-associations that deﬁnes CONCEPTk. By

Fig. 3. This ﬁgure illustrates the layer structures of Example 1. The top layer is skeleton (3, 3)-Skeleton that has two distinct CONCEPTS Simplex(a, b, c, d) and Simplex(w, x, y, z). The middle layer (3, 2)-Skeleton has 8 CONCEPTS; it is not illustrate here. The layer (3, 1)-2)-Skeleton is skipped. The bottom layer (3, 0)-Skeleton contains only one connected component; it is shown in the ﬁgure.

(15)

consider diﬀerent cases, we have layered clusters. In fact, we even could consider a very coarse clustering that is, we consider only the MAXIMAL PRIMITIVE CON-CEPTS; this is the case of Sn

3. For the purpose of illustrating the methodology, we

have focused on this ‘‘over simpliﬁed’’ one.

In general, the simplexes at the lower layers could have common faces between them. Therefore, to use all layers of CONCEPTS at the same time will produce

vague discrimination as shown in Fig. 4, in which an overlapped CONCEPTS

induced by (lower-dimensional) common faces could exist. As seen in the skeleton S3₁, the maximal connected components generated from simplex Simplex(a, b, c, d) and simplex Simplex(a, c, h) have a common face Simplex(a, c) that makes some doc-uments not able to properly discriminated in accordance with the generated associ-ation rules from term a and term c, so are the other maximal connected components in the skeleton. Because of the intersection produced by such faces, a proper way is to ignore the lower the skeleton as much as application can tolerate.

5. Finding maximal connected components

We can visualize that the latent semantic of a collection of documents is a space triangulated/partitioned/granulated by term-associations (simplexes). The space con-tains CONCEPTS, PRIMITIVE CONCEPTS. We have observed that combinato-rial geometry is an eﬀective theory for modeling the latent semantics space of a

Fig. 4. Each cluster of documents is identiﬁed by a maximal connected component. Some clusters may overlap with other cluster because of the common face between them; this phenomenon is illustrated here. To handle such a situation properly, we need to ignore the lower-dimensional simplexes. By so doing the overlapping will disappear (not shown).

(16)

huge variety of high-dimensional data, such as document collection, or bioinformat-ics data. The algorithms for ﬁnding all CONCEPTS, i.e., maximal connected com-ponents in the complex of term-associations will be introduced below; In fact, we will focus on ‘‘over simpliﬁed’’ version, namely, on the complex Sn_n. In other words, maximal PRIMITIVE CONCEPTS (highest dimension).

5.1. Incidence matrices

First, we need some geometric notations.

Definition 10. In a simplicial complex, V denotes the set of (individual) key terms in a collection of documents, i.e., 0-simplices, and E denotes the set of all r-simplices, where r P 0. If SimplexAis in E, its support is defined as w(SimplexA), i.e., the tfidf

of the simplex, SimplexA, of term-association.

The incident matrix and the weighted incident matrix of a complex can be deﬁned as follows; here we are more interested in the case Simplexiis a 0-face.

Deﬁnition 11. The n· m incident matrix A = (aij) associated to a complex is deﬁned

as

aij¼

1 if Simplexi is a face of Simplexj

0 otherwise

The corresponding weight incident matrix A0¼ ða0 ijÞ is

a0_ij¼ Wij if Simplexiis a face of Simplexj

0 otherwise

where the weight wijdenotes the support of a term-association.

Example 3. As seen in Example 2, the 2-simplex is the set {tA, tB, tC}, which is also

the maximal connected component that represents a concept in a document collec-tion. Based on the Venn diagram of this complex, the incident matrix I and the weighted incident matrix IWof the simplexes can be constructed. For clarity, we only

illustrate the incidences between the key terms (0-simplexes) and term-associations (r-simplexes, r = 1, 2) as follows: I¼ 1 0 1 1 1 1 1 0 1 1 0 1 0 B @ 1 C A IW ¼

wA;B;C 0 wA;B wC;A

wA;B;C wB;C wA;B 0 wA;B;C wB;C 0 wC;A 0 B @ 1 C A

(17)

Each row represents the incidence of a vertex with r-simplexes. Each column corre-sponds to the incidence of a ﬁxed simplex and all vertices.

5.2. Algorithm

As we already known, a r-simplex is a (r + term-association (frequent (r + 1)-itemset). Documents can be clustered based maximal simplexes of highest dimension (MAXIMAL PRIMITIVE CONCEPTS), namely, the longest associations. Note that documents clustered by MAXIMAL PRIMITIVE CONCEPTS contains com-mon lower-dimensional faces (shorter associations, in particular 0-simplexes); this is consequence of Apriori property. In this sense, the methodology provides a soft ap-proach; we allow lower-dimensional overlapped CONCEPTS exist within diﬀerent clusters. Considering Example 4, two maximal 2-simplexes in the skeleton S3₃ pro-duce two MAXIMAL PRIMITIVE CONCEPTS with common 0-face.

Example 4. As shown inFig. 5(in the form of incidence diagram), one component

is organized by the simplex Simplexj= {tA, tB, tC}, the other is generated by the

simplex Simplex5= {tC, tD, tE}. The incident matrix is (5 vertices· 8 simplexes)

1 1 0 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 1 1 0 B B B B B B @ 1 C C C C C C A

Both simplexes share a common concept 0-simplex {tC}, which is an 1-item frequent

itemset {tC}.

Fig. 5. The complex is composed of two maximal 2-simplexes Simplex1= Simplex(tA, tB, tC) and

Simplex5= Simplex(tC, tD, tE). Both of them contain a common face Simplex(tC) that produces an

(18)

Since the intersection of connected components has lower dimensions. It is conve-nient for us to design an eﬃcient algorithm for documents clustering in a skeleton by skeleton fashion. The algorithm for ﬁnding all maximal connected components in a skeleton is listed as follows.

Require: V ¼ ft1; t2; . . . ; tng be the vertex set of all reserved terms in a collection of

documents.

Ensure: S is the set of all maximal connected components. Let h be a given minimal support.

S ( ;

Let S0= {eij ei= {ti}"ti2 V} be the 0-simplex set.

i( 0

while Si5; do

while for all vertex tj2 V do

S(i+1)( ; be the (i + 1)-simplex set.

while for all element e2 Sido

if e0_{= e}_{[ {t}

j} with tj62 e whose support is no less than h then

add e0_{in S} (i+1) remove e from Si end if end while end while S ( S [ Si i( (i + 1) end while

Use our notation Siis a skeleton of Si0. It is clear, one can get S n

mfor any n and m.

A simplex will be constructed by including all those co-occurring terms whose sup-port is bigger than or equal to a given minimal supsup-port h. An external vertex will be added into a simplex if the produced support is no less than h.

The documents can be decomposed into several categories based on the MAXI-MAL PRIMITIVE CONCEPTS (correspond to a maximal simplex of highest dimension). If a document contains a MAXIMAL PRIMITIVE CONCEPT, it means that document highly equates to such concept, thereby, by the Apriori prop-erty, all the sub-associations in the concept is also contained in this document. The document can be classified into the category identified with such a concept. A doc-ument often consists of more than one MAXIMAL PRIMITIVE CONCEPTS, in this case it can be classified into multi-categories. In the following sections, the algo-rithm is abbreviated to MPCC (Maximal Primitive Concepts Clustering).

6. Experimental results

As for text search systems and document categorization systems, experimental re-sults are conducted to evaluate the clustering algorithm, rather than analytic statements.

(19)

6.1. Data sets

Three kinds of datasets are experimented in our study. The ﬁrst dataset is Web pages collected from Boley et al.[6]. Ninety-eight Web pages in four broad catego-ries: business and ﬁnance, electronic communication and networking, labor and manufacturing are selected for the experiments. Each category is also divided into four subcategories.

The second dataset is 848 electronic medical literature abstracts collected from PubMed. All those abstracts are collected by searching from the keywords of cancer, metastasis, gene and colon. Our purpose is to discriminate all articles in according to which organs a cancer spreads from the primary tumor. In our study, we neglect the primary tumor is occurred in colon or from the other organs. A few organs are se-lected for this study, such as, liver, breast, lung, brain, prostate, stomach, pancreas, and lymph.

The third dataset is 305 electronic medical literatures collected from the journals, Transfusion, Transfusion Medicine, Transfusion Science, Journal of Pediatrics and Archives of Diseases in Childhood Fetal and Neonatal Edition. Those articles are selected by searching from keywords, transfusion, newborn, fetal and pediatrics. The MeSH categories have the use of evaluating the eﬀectiveness of our algorithm. The second and the third datasets are a homogeneous topic. They both denote a similar concept hierarchy. It is best for us to make validation on the concepts gen-erated from our method by human experts.

6.2. Evaluation criteria

The experimental evaluation of document clustering approaches usually measures their eﬀectiveness rather than their eﬃciency[40], in the other word, the ability of an approach to make a right categorization.

Considering the contingency table for a category (Table 1), recall, precision, and Fbare three measures of the eﬀectiveness of a clustering method. Precision and recall

with respect to a category is deﬁned as follows respectively: Precisioni¼ TPi TPiþ FPi Recalli¼ TPi TPiþ FNi Table 1

The contingency table for category ci

Category ci Clustering results

YES NO

Expert YES TPi FNi

(20)

The Fbmeasure combined with precision and recall has introduced by van

Rijsber-gen in 1979 as the following formula: Fb ¼

ðb2þ 1Þ Precisioni Recalli

b2 Precisioniþ Recalli

In this paper, we use F1measure obtained when b equals 1 that means precision and

recall are equal weight to evaluate the performance of clustering. Because many cat-egories that will be generated and because of the comparison reasons, the overall pre-cision and recall are calculated as the average of all prepre-cisions and recalls belonging to ever categories, respectively. F1is calculated as the mean of individual results. It is

a macroaverage among categories. 6.3. Results

Table 2 demonstrates the results of the ﬁrst experiment. The result of the algo-rithm, PDDP [6], is under consideration by all nonstop words, that is, the Fl data-base in their paper, with 16 clusters. The result of our algorithm, MPCC, is under consideration by all nonstop words with the minimal support, 0.15.

The PDDP algorithm hierarchically splits the data into two subsets, and derives a linear discriminant function from them based on the principal direction (i.e., princi-pal component analysis). With sparse and high-dimensional datasets, principrinci-pal com-ponent analyses often hurt the results of classification, which induces a high false positive rate and false negative rate. The hyperedges generated by PDDP is based on the average of the confidences of the itemsets with the same items. It is unfair that a possible concept would be withdrawn if a very small confidence of an itemset is existed from an implication direction.

As seen inFig. 6, 47 clusters, i.e. MAXIMAL PRITITIVE CONCEPTS (maximal

connected components of top skeleton), has been generated by MPCC. It is larger than the original 16 clusters. After performing on decreasing the minimal support value to be 0.1, the number of clusters reduces to be 23 and its precision, recall,

and F1, become 63.7%, 77.3%, 0.698 respectively. The higher the minimal support

value is, the lower the number of co-occurred terms in a complex. Fig. 7

demon-strates the performance on the ﬁrst dataset of MPCC.

The eﬀectiveness of the second dataset is shown inFig. 8. The use of 14 organ re-lated words are selected for clustering those abstracts.Fig. 9demonstrates the gen-erated simplicial complex associated with a minimal support, 0.05.

Table 2

The ﬁrst dataset is compared with four algorithms, MPCC, PDDP, k-means and AutoClass

Method MPCC PDDP k-Means AutoClass HCA

Precision 68.3% 65.6% 56.7% 34.2% 35%

Recall 74.2% 68.4% 34.9% 23.6% 22.5%

(21)

The MeSH categories (22 categories) have been taken to evaluate the eﬀectiveness of MPCC on each individual category of the third dataset. Document clustering is based on the MeSH terms related to ‘‘Transfusion’’ and ‘‘Pediatrics’’. The eﬀective-ness of all categories is shown in Fig. 10. The MeSH categories are a hierarchical

Fig. 6. The complex generated from the ﬁrst dataset by using MPCC.

(22)

structure that some categories are the subcategories of the other categories. Many concept categories are shared with the same terminologies that induces a high false negative rate by MPCC on document clustering. In this dataset documents are not uniform distributed in all categories, some categories only contain a few documents

Fig. 8. The eﬀectiveness of MPCC on the second dataset.

(23)

that makes their latent concepts restricted by a few terms, for example, the Anemia and the Surgery categories whose precision are both below 70%.

7. Conclusion

Polysemy, phrases and term dependency are the limitations of search technology

[22]. A single term is not able to identify a latent concept in a document, for

instance, the term ‘‘Network’’ associated with the term ‘‘Computer’’, ‘‘Traffic’’, or ‘‘Neural’’ denotes different concepts. To discriminate term associations no doubt is concrete way to distinguish one category from the others. A group of solid term associations can clearly identify a concept. Most methods, such as k-means, HCA, AutoClass or PDDP classify or cluster documents from the repre-sented matrix of a set of documents. It seems inefficient and complicated to dis-cover all term associations from such a high-dimensional and sparse matrix. The term-associations (frequently co-occurring terms) of a given collection of documents, form a simplicial complex. The complex can be decomposed into con-nected components at various levels (in various level of skeletons). We believe each such a connected component properly identify a concept in a collection of documents.

(24)

The paper presents a novel view based on finding maximal connected components for document clustering. An agglomerative method for finding geometric maximal connected components without the use of distance function is proposed. An maximal r-simplexes of highest dimensions can represent a MAXIMAL PRIMITIVE CON-CEPT in a collection of documents. We can effectively discover such a maximal sim-plexes of highest dimension and use them to cluster the collection of documents. Comparing with some traditional methods, such as k-means, AutoClass and Hierar-chical Clustering (HAC), and the partition-based hypergraph algorithm, PDDP, our algorithm demonstrates its superior performance on three datasets. The paper illus-trates that geometric complexes are effective models for automatic document clustering.

References

[1] R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases, in: Proceedings of the 1993 International Conference on Management of Data (SIGMOD 93), May 1993, pp. 207–216.

[2] R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in: Proceedings of the 20th VLDB Conference, 1994.

[3] T.W. Anderson, On estimation of parameters in latent structure analysis, Psychometrika 19 (1954) 1–10. [4] F. Beil, M. Ester, X. Xu, Frequent term-based text clustering, in: Proceedings of the 8th International Conference on Knowledge Discovery and Data Mining (KDD 2002), Edmonton, Alta., Canada, 2002.

[5] M.W. Berry, Large scale sparse singular value computations, International Journal of Supercomputer Applications 6 (l) (1992) 13–49.

[6] D. Boley, M. Gini, R. Gross, E.-H. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, J. Moore, Document categorization and query generation on the world wide web using web, Artiﬁcial Intelligence Review 13 (5–6) (1999) 365–391.

[7] S. Brin, R. Motwani, J. Ullman, S. Tsur, Dynamic itemset counting and implication rules for market basket data, in: Proceedings of ACM SIGMOD International Conference on Management of Data, 1997, pp. 255–264.

[8] S. Brin, L. Page, The anatomy of a large-scale hypertextual web search engine, in: Proceedings of the Seventh International WWW Conference (WWW 98), Brisbane, Australia, 1998.

[9] P. Cheeseman, J. Stutz, Baysian classiﬁcation (autoclass): theory and results, in: U.M. Fayyad, G. Piatetsky-Shapiro, P. Smith, R. Uthurusamy (Eds.), Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996, pp. 153–180.

[10] M.S. Chen, J. Han, P.S. Yu, Data mining: an overview from a database perspective, IEEE Transaction on Knowledge and Data Engineering 8 (6) (1996) 866–883.

[11] D.R. Cutting, D.R. Karger, J.O. Pedersen, J.W. Tukey, Scatter/gather: a cluster-based approach to browsing large document collections, in: Proceedings of the Fifteenth Annual International ACM SIGIR Conference, 1992, pp. 318–329.

[12] R. Feldman, Y. Aumann, A. Amir, W. Klo´sgen, A. Zilberstien, Text mining at the term level, in: Proceedings of 3rd International Conference on Knowledge Discovery, KDD-97, pp. 167–172, Newport Beach, CA, 1998.

[13] R. Feldman, I. Dagan, W. Klo´sgen, Eﬃcient algorithms for mining and manipulating associations in texts, in: Cybernetics and Systems, The 13th European Meeting on Cybernetics and Research, vol. II, Vienna, Austria, April 1996.

[14] R. Feldman, M. Fresko, H. Hirsh, Y. Aumann, O. Liphstat, Y. Schler, M. Rajman, Knowledge management: a text mining approach, in: Proceedings of 2nd International Conference on Practical Aspects of Knowledge Management, Basel, Switzerland, 1998, pp. 29–30.

(25)

[15] R. Feldman, H. Hirsh, Mining associations in text in the presence of background knowledge, in: Proceedings of 3rd International Conference on Knowledge Discovery, 1996.

[16] W.B. Frakes, R. Baeza-Yates, Information Retrieval Data Structures and Algorithms, Prentice-Hall, Englewood Cliﬀs, NJ, 1992.

[17] N. Fuhr, C. Buckley, A probabilistic learning approach for document indexing, Information Systems 9 (3) (1991) 223–248.

[18] J.D. Holt, S.M. Chung, Eﬃcient mining of association rules in text databases, in: Proceedings of CIKM, Kansas City, MO, 1999.

[19] A. Hyvarinen, J. Karhunen, E. Oja, Independent Component Analysis, John Wiley, 2001. [20] A.K. Jain, R.C. Dubes, Algorithms for Clustering Data, Prentice-Hall, 1988.

[21] I.T. Jolliﬀe, Principle Component Analysis, Springer-Verlag, New York, 1986.

[22] A. Joshi, Z. Jiang, Retriever: improving web search engine results using clustering, in: A. Gangopadhyay (Ed.), Managing Business with Electronic Commerce: Issues and Trends, World Scientiﬁc, 2001, Chapter 4.

[23] G. Karypis, R. Aggarwal, V. Kumar, S. Shekhar, Multilevel hypergraph partition application in VLSI domain, Proceedings ACM/IEEE Design Automation Conference 8 (1997) 381–389. [24] T. Kohonen, Self-Organization Maps, Springer-Verlag, Berlin, Heidelberg, 1995.

[25] R. Kosala, H. Blockeel, Web mining research: A survey, SIGKDD Explorations 2 (1) (2000) 1–15. [26] B. Lent, R. Agrawal, R. Srikant, Discovering trends in text databases, in: Proceedings of 3rd

International Conference on Knowledge Discovery, KDD-97, Newport Beach, CA, 1997, pp. 227– 230.

[27] K.I. Lin, H. Chen, Automatic information discovery from the invisible web, in: Proceedings of the International Conference on Information Technology: Coding and Computing (ITCCÕ02), Special Session on Web and Hypermedia Systems, 2002.

[28] T.Y. Lin, Attribute (feature) completion—the theory of attributes from data mining prospect, in: Proceedings of 2002 IEEE International Conference on Data Mining (ICDM), Maebashi, Japan, 2002, pp. 282–289.

[29] S.Y. Lu, K.S. Fu, A sentence-to-sentence clustering procedure for pattern analysis, IEEE Transactions on Systems Man and Cybernetics 8 (1978) 381–389.

[30] J. MacQueen, Some methods for classiﬁcation and analysis of multivariate observationsProceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, vol. 1, University of California Press, 1967, pp. 281–297.

[31] M. Marchiori, The quest for correct information on the web: hyper search engines, in: Proceedings of the Sixth International WWW Conference (WWW 97), Santa Clara, CA, 1997.

[32] M.E. Maron, J.K. Kuhns, On relevance, probabilistic indexing, information retrieval, Journal of ACM 7 (1960) 216–244.

[33] D. Mladenic, Text-learning and related intelligent agents: a survey, IEEE Intelligent Systems (1999) 44–54.

[34] A. Moﬀat, J. Zobel, Compression and fast indexing for multi-gigabit text databases, Australian Computing Journal 26 (1) (1994) 19.

[35] J.S. Park, M.S. Chen, P.S. Yu, Using a hash-based method with transaction trimming for mining association rules, IEEE Transaction on Knowledge and Data Engineering 9 (5) (1997) 813– 825.

[36] B. Pinkerton, Finding what people want: experiences with the webcrawler, in: Proceedings of the Second International WWW Conference, Chicago, IL, 1994.

[37] M. Rajman, R. Besanon, Text mining: natural language techniques and text mining applications, in: Proceedings of seventh IFIP 2.6 Working Conference on Database Semantics (DS–7), Leysin, Switzerland, 1997.

[38] G. Salton, C. Buckley, Term weighting approaches in automatic text retrieval, Information Processing and Management 24 (5) (1960) 513–523.

[39] G. Salton, M.J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, 1983. [40] F. Sebastiani, Machine learning in automated text categorization, ACM Computing Surveys (2002)

1–47.

(26)

[42] R. Weiss, B. Velez, M.A. Sheldon, C. Manprempre, P. Szilagyi, A. Duda, D.K. Giﬀord, Hypursuit: a hierarchical network search engine that exploits content-link hypertext clustering, in: Proceedings of the 7th ACM Conference on Hypertext, New York, NY, 1996.

[43] P. Willett, Extraction of knowledge from databases, Information Processing and Management 24 (1988) 577–597.

[44] O. Zamir, O. Etzioni, Web document clustering: a feasibility demonstration, in Proceedings of the 19th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 98), 1998, pp. 46–54.