Topic Hierarchy Generation for Text Segments: A Practical Web-based Approach

(1)

Practical Web-based Approach

SHUI-LUNG CHUANG

Institute of Information Science, Academia Sinica and

LEE-FENG CHIEN

Institute of Information Science, Academia Sinica

Department of Information Management, National Taiwan University

It is crucial in many information systems to organize short text segments, such as keywords in documents and queries from users, into a well-formed topic hierarchy. In this paper, we address the problem of generating topic hierarchies for diverse text segments with a general and practical approach that uses the Web as an additional knowledge source. Unlike long documents, short text segments typically do not contain enough information to extract reliable features. This work inves- tigates the possibilities of using highly ranked search-result snippets to enrich the representation of text segments. A hierarchical clustering algorithm is then designed for creating the hierarchical topic structure of text segments. Text segments with close concepts can be grouped together in a cluster, and relevant clusters linked at the same or near levels. Different from traditional clustering algorithms, which tend to produce cluster hierarchies with a very unnatural shape, the algorithm tries to produce a more natural and comprehensive tree hierarchy. Extensive experiments were conducted on different domains of text segments, including subject terms, people names, paper titles, and natural language questions. The obtained experimental results have shown the potential of the proposed approach, which provides a basis for the in-depth analysis of text segments on a larger scale and is believed able to benefit many information systems.

Categories and Subject Descriptors: H.3 [Information Storage and Retrieval]: Miscellaneous General Terms: Algorithms, Experimentation, Performance

Additional Key Words and Phrases: Topic Hierarchy Generation, Text Segment, Hierarchical Clustering, Partitioning, Search-Result Snippet, Text Data Mining

1. INTRODUCTION

It is crucial in many information systems to organize short text segments, such as keywords in documents and queries from users, into a well-formed topic hierarchy.

For example, deriving a topic hierarchy (or concept hierarchy) of terms from a set of documents in an information retrieval system could provide a comprehensive form

This work was supported in part by the following grants: NSC 93-2752-E-001-001-PAE, 93-2422- H-001-0004, and 93-2213-E-001-025.

Authors’ address: Institute of Information Science, Academia Sinica, 128 Academia Road, Sec- tion 2, Nankang, Taipei 115, Taiwan; email: {slchuang,lfchien}@iis.sinica.edu.tw

Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee.

° 2005 ACM 0000-0000/2005/0000-0001 $5.00c

ACM Journal Name, Vol. V, No. N, March 2005, Pages 1–33.

(2)

What effects will an Iraq war have on oil prices?

Find the cheapest ACER TravelMate without a CD-Rom drive PDAs, Tablet computers, and Pocket PCs

When is the SIGIR 2003 paper submission deadline?

What is the difference between PHS, WAP, and GPRS?

The two final proposals to rebuild the World Trade Center The newest product in the IBM Thinkpad X Series Where can I find the interview with Saddam Hussein?

The cellular phone with the GSM-1900 system The website of the KDD 2003 conference

®¶

Find the cheapest ACER TravelMate without a CD-Rom drive The newest product in the IBM Thinkpad X Series

PDAs, Tablet computers, and Pocket PCs The cellular phone with the GSM-1900 system What is the difference between PHS, WAP, and GPRS?

The two final proposals to rebuild the World Trade Center What effects will an Iraq war have on oil prices?

Where can I find the interview with Saddam Hussein?

The website of the KDD 2003 conference

When is the SIGIR 2003 paper submission deadline?

◦

(3C)uuuuu uu uu ^(World)

ZZ ZZ ZZ Z

(Conf.)PPPPP PP PP PP PP PP PP P

(Notebook)

gg gg gg gg gg]]]]]^(Mobile)]]]] ]

(Handy)TTTTT TT TT T

(WTC)

(Iraq War)VVVVVV VV VV

Fig. 1. Example text segments and the topic hierarchy we seek to generate.

to present those documents [Sanderson and Croft 1999]. A similar need for auto- generation of topic hierarchies could occur in a question answering system. Many enterprise Web sites provide users with the ability to use natural language queries to ask questions and search for manually prepared answers. To make the preparation of answers to frequently asked questions (FAQ) more efficient, it is expected that queries in similar topics can be automatically clustered. In this paper, we address the problem of generating topic hierarchies for diverse text segments, and present a general and practical approach to this problem using the Web as an additional knowledge source. Here, a text segment is defined as a meaningful word string that is often short in length but represents a specific concept in a certain subject domain, such as a keyword in a document set and a natural language query from a user. Text segments are of many types, including words, phrases, named entities, natural language queries, news events, product names, paper or book titles, etc.

The key idea of the proposed approach is to apply a clustering technique to automatically create the hierarchical topic structure of text segments. In the structure, text segments with close concepts are grouped to form the basic clusters, and similar basic clusters form the super clusters recursively to characterize the associations between the composed clusters. Relevant clusters are also linked at the same or near levels as close as possible. Each cluster, therefore, represents a certain topic class of its composed text segments. Figure 1 shows an illustrative example to demonstrate the idea of the proposed approach. In the figure, there is a set of example text segments, e.g., natural language queries to a search engine, and a topic hierarchy that we seek to generate automatically from those example queries. With the auto-generated topic hierarchy, users’ search topic classes would be easier to be observed and analyzed.

Clustering short text segments is a difficult problem given that, unlike long documents, short text segments typically do not contain enough information to extract

(3)

Search Engines

Context Extraction Context

Extraction HAC-Based Binary-Tree

Hierarchy Generation Min-Max Partitioning Text Segments

Hierarchical Clustering

. . . . .

Topic Hierarchy Search Engines

Context Extraction Context

Extraction HAC-Based Binary-Tree

Hierarchy Generation Min-Max Partitioning Text Segments

Hierarchical Clustering

. . . . . . . . . .

Topic Hierarchy

Fig. 2. An abstract diagram showing the concept behind the proposed approach.

reliable features. For long documents, their similarities can be estimated based on the common composed words. A few of words (usually new words or proper nouns) in the documents unknown to the classifier might not cause serious classi- fication errors. However, the similarity between two text segments is difficult to judge using the same way because text segments are usually short and do not contain enough textual overlap. Thus, one of the most challenging issues concerning this problem is to acquire proper features to characterize the text segments. For those text segments extracted from documents, such as key terms from documents, the source documents can be used to characterize the text segments. However, in many cases, such as in dealing with search-engine query strings, there may not exist sufficient relevant documents to represent the target text segments. A lack of domain-specific corpora used to describe text segments is usually the case in practice. Therefore, relying on a predetermined corpus cannot become a general approach to this problem.

Fortunately, the Web, as the largest and most accessible data repository in the world, provides an alternative way to deal with this difficulty; that is, it provides a general way to supplement the insufficiency of information suffered by various text segments with the rich resources on the Web. Many search engines are constantly crawling Web resources and providing the retrieval of relevant Web pages for large amounts of free text queries, including single terms and longer word strings. In the proposed approach, we incorporate the search-result snippets returned from search engines into the process of acquiring features for text segments. Also a query relaxation technique is developed to get adequate snippets for long text segments because they have a higher chance to retrieve fewer search results. The overall concept of the proposed approach is shown as Figure 2.

In addition to the way we use to acquire features, a hierarchical clustering algorithm was developed for creating the hierarchical topic structure of text segments.

Different from traditional clustering algorithms, which tend to produce clusters and hierarchies with a very unnatural shape, the algorithm is pursued to produce a more natural and comprehensive hierarchy structure. In the literature, many algorithms for hierarchical data clustering have been developed. However, they are mostly binary-tree-based; i.e., they generate binary tree hierarchies [Willet 1988]. Only a few of them deal with multi-way-tree hierarchy, e.g., model-based hierarchical clustering [Vaithyanathan and Dom 2000], but they suffer from high computational cost or a need of predetermined constants on the number of branches or threshold values for similarity measure scores. Our initial intention of discovering topic hierarchies for text segments is to provide humans a basis for the in-depth analysis of text segments. The broad and shallow multi-way-tree representation, instead

ACM Journal Name, Vol. V, No. N, March 2005.

(4)

of the narrow and deep binary-tree one, is believed easier and more suitable for humans to browse, interpret, and do deeper analysis. According to this motivation, we develop a hierarchical clustering algorithm, called HAC+P: an extension of the Hierarchical Agglomerative Clustering algorithm (HAC), which builds a binary hierarchy in a bottom-up fashion, followed by a top-down hierarchical partitioning technique, named min-max partitioning, to partition the binary hierarchy into a natural and comprehensive multi-way-tree hierarchy.

Extensive experiments have been conducted on different domains of text segments, including subject terms, people names, paper titles, and natural language questions. The promising results show the potential of the proposed approach in clustering similar text segments and creating natural topic structures; this is believed able to benefit the design of information systems in many ways, such as text summarization, query clustering, and thesaurus construction. In the rest of this paper, we first examine the possibilities of using search-result snippets for feature extraction of text segments and introduce the data representation model used in this study. Then the proposed hierarchical clustering algorithm is presented in detail, followed by the experiments and their results. Further, the query relaxation technique and more experiments are introduced. An evaluation based on user studies is also conducted for the results of our approach. Finally, we have more discussions, review some related works, and draw conclusions.

2. FEATURE EXTRACTION USING SEARCH-RESULT SNIPPETS

Compared with general documents, text segments are much shorter and typically do not contain enough information to extract adequate and reliable features. To assist the relevance judgment between text segments, additional knowledge sources would be exploited. It would be helpful to understand the process that a human expert determines the meaning(s) of a text segment beyond his/her knowledge.

From our observations, when facing an unknown text segment, humans may refer to the various contexts it occurs in documents, from which the meaning(s) of the segment can be inferred. The proposed approach is, therefore, designed based on the stimulation of such human behavior.

Our basic idea is to exploit the Web, the largest and ubiquitously accessible data repository in the world. Adequate contexts of a text segment, e.g., the neighboring sentences of the given text segment, can be retrieved from large amounts of Web pages. Such idea is somewhat analogous to the one of determining the sense of a word by means of its context words extracted from a predetermined document corpus in linguistic analysis [Manning and Schutze 1999]. However, there are dif- ferences. With a document corpus of limited size and domains, a conventional approach normally extracts all possible context words of the given word from the corpus. The situation is different when using the Web as the corpus. The number of matched contexts on the Web might be huge. A practical approach should only adopt certain of them that are considered relevant to the intended domain(s) of the given text segment. From this perspective, the proposed approach favors the contexts obtained from relevant pages of the given text segment.

We found that it is convenient to implement our idea using the existent search engines. A text segment could be treated as a query with a certain search request.

(5)

And its contexts are then obtained directly from the highly ranked search-result snippets, e.g., the titles and descriptions of search-result entries, and the texts surrounding the matched terms. This is analogous to the technique of pseudo- relevance feedback used to improve the retrieval performance with expansion terms extracted from the top-ranked documents [Buckley et al. 1992]. Mostly, this sce- nario works fine for short text segments. However, for long text segments, some of them might be too specific to obtain exact text strings and effective search results directly via search engines; besides, the returned snippets might not be sufficient.

For the reason, a specific query processing technique, named query relaxation (refer to Section 5), was developed to get adequate relevant search results for long text segments, through a bootstrapping process of search requests to search engines.

Below we first introduce the text representation model used in this study.

2.1 Representation Model

We adopt the vector-space model as our data representation. Suppose that, for each text segment p, we collect up to Nmax search-result entries, denoted as Dp. Each text segment can then be converted into a bag of feature terms by applying normal text processing techniques, e.g., removing stop words¹ and stemming, to the contents of Dp. Let T be the feature term vocabulary, and let ti be the i- th term in T . With simple processing, a text segment p can be represented as a term vector vp in a |T |-dimensional space, where vp,i is the weight ti in vp. The term weights in this work are determined according to one of the conventional tf-idf term weighting schemes [Salton and Buckley 1988], in which each term weight vp,i

is defined as

vp,i= (1 + log₂fp,i) × log2(n/ni), (1) where fp,iis the frequency ti occurring in vp’s corresponding feature term bag, n is the total number of text segments, and niis the number of text segments that contain ti in their corresponding bags of feature terms. The similarity between a pair of text segments is computed as the cosine of the angle between the corresponding vectors, i.e.,

sim(va, vb) = cos(va, vb).

Further, we define the average similarity between two sets of vectors, Ci and Cj, as the average of all pairwise similarities among the vectors in Ci and Cj:

simA(Ci, Cj) = 1

|Ci||Cj| X

v_a∈C_i

X

v_b∈C_j

sim(va, vb).

It should be noticed that our purpose in using search-result snippets is not to fulfill the search request, but mainly to get adequate information to reflect the feature-distribution characteristics of a text segment’s intended topic domain(s) and to then aid the relevance measurement among text segments. It is not strictly required that the extracted contexts should be obtained from the most relevant pages. In fact, as the experimental results shown in Section 5, the ranking of search

1In this work, stop words are determined by a stop-word list, and the one we used is obtained from the Smart system, available at ftp://ftp.cs.cornell.edu/pub/smart/.

(6)

Laptop-Notebook-Handheld-Palm-size-PDA-Pocket-PC-Companion- ...

... makes carrying cases for Handheld PC, Palm-size PC, PDA, Pocket PC, PC Companion, Laptop, Notebook, and Tablet PC computers - as well as Cellphones, Pagers ...

Carrying cases for Portable Electronics - Cellular Phones, Laptop ...

Since there is a flood of new notebook, PDA computers, and cellular phones constantly coming to market, The Pouch, Inc., must do a lot of research, in order to ...

PocketLOOX - Fujitsu Siemens Computers

... PDA & Tablet PC notebooks personal computers thin clients broadband solutions worksta- tions intel based servers UNIX servers BS2000/OSD servers ...

pen tablet pc - Fujitsu Siemens Computers

Fujitsu Siemens Computers .com, HOME SEARCH CONTACT COUNTRIES JOBS SITEMAP, ... PDA & Tablet PC notebooks personal computers thin clients broadband solutions ...

PDA Street - The PDA Network for Handheld Computers, PDA Software ...

... you can really fit in any pocket? ... a PDA Talk about PDAs PDA News Windows ... REX PocketMail Smartphones Tablet Computers Other Gadgets. ...

PDAStreet: News

... Zire Stands Out in Sluggish PDA Stats As enterprise users back off buying ... has released version 1.1 of mcBank, its financial manager software for the Pocket PC. ...

Fujitsu Siemens Computers - Serwis FSC - PRODUKTY - PDA & Tablet ...

... Nowe urzadzenie kieszonkowe Fujitsu Siemens Computers to krok w przyszlosc ... male i lekkie urzadzenie z oprogramowaniem Microsoft Pocket PC 2002 zapewnia ...

Yahoo! News Full Coverage - Technology - Handheld and Palmtop ...

... more. Audio. -, PDA Tech Tips from ... , Sales of handheld computers fall - All Things Considered/NPR (Aug 7, 2001). more. ... , Is Microsoft’s Tablet PC innovative enough? ...

Hardware

... Computers, Computers, Copiers & Fax Machines, Copiers & Fax Machines. ... Hand- held/PalmPC/PDA/Pocket PC/Tablet PC, Hard Drives, ...

Medical Pocket PC - Medical Resources for the Pocket PC

... devices on the WLAN, including Tablet PCs ... considerations involved in implementing handheld computers into residency ... Dictionary, 27th edition for PDA gives you ...

Fig. 3. The top ten search-result snippets of text segment “PDA, Tablet computers, and Pocket PCs.”

results by search engines did not affect the clustering performance too much; the proposed approach is, therefore, believed robust and not highly dependent on the employed search engines. Before we move to the clustering algorithm, let’s examine the feasibility of using search-result snippets through a preliminary observation.

2.2 An Illustrative Example

Figure 3 shows Google²’s top ten search-result snippets for text segment “PDA, Tablet computers, and Pocket PCs,” where the titles and descriptions are represented in bold and normal faces, respectively. Obviously, these snippets, composed of only fragmented texts, do not directly provide an answer to a comparison between pda, tablet computer, and pocket pc, neither to any question about these three products that users might want to know. However, the snippets do contain a lot of words related to the corresponding text segment. In order to provide readers a clearer understanding of this phenomenon, Figure 4 lists the eighty words that occur most frequently in the top 100 search-result snippets together with their frequency counts. All stop words have been removed, and the remaining words are converted to lower case but not stemmed (e.g., plural and singular are temporarily treated different in this preliminary observation). From the list, words with extremely high frequency are those occurring in the text segment itself, i.e., “pc,” “pda,” “computers,” “tablet,” and “pocket.” Besides these, there can be found a lot of words that are considered highly related to the corresponding text segment; for example, the product manufactories and brand names, e.g., “fujitsu,” “palm,” “compaq,”

“microsoft,” etc; the related products, e.g., “pen,” “notebook,” “phone,” “laptop,”

etc; the product characteristics, e.g., “handheld,” “mobile,” “wireless,” “digital,”

“portable,” etc; and other software or hardware related terms, e.g., “windows,”

“ce,” “xp,” “pocketmail,” etc. Intuitively, all these related terms give a higher

2http://www.google.com/

(7)

pc 157

pda 127

computers 113

tablet 108

pocket 89

handheld 29

news 25

fujitsu 22

siemens 19

pen 18

handhelds 15

new 15

palm 16

com 15

mobile 15

compaq 14

microsoft 14

pcs 13

software 13

windows 13

pdas 12

wireless 11

computer 9

notebooks 9

systems 9

toshiba 9

digital 8

notebook 8

servers 8

accessories 7

ce 7

computing 7

gadgets 7

phone 7

256mb 6

epoc 6

internet 6

laptop 6

loox 6

personal 6

talk 6

0ghz 5

30gb 5

buy 5

devices 5

hardware 5

network 5

online 5

os 5

pocketmail 5 smartphones 5

tc1000t 5

technology 5 companion 5

ipaq 5

magazine 5

medical 5

prices 5

product 5

available 4

based 4

desktop 4

edition 4

portable 4

printers 4

products 4

rex 4

shareware 4

submit 4

thin 4

warnty 4

xp 4

xpp 4

components 4

dell 4

drives 4

infocater 4

ink 4

resources 4

rim 4

Fig. 4. The word and frequency list of the 80 most frequent words in the top 100 search-result snippets of text segment “PDA, Tablet computers, and Pocket PCs.”

chance to make an association between the corresponding text segment and other segments with similar topics, e.g., “The newest product in the IBM Thinkpad X Series” and “The cellular phone with the GSM-1900 system” (refer to Figure 1).

Of course, there exist very few terms, such as “based,” that are considered less related, i.e., the noisy terms, but they seems not to hurt too much. As a conclusion of this example, most of the highly-frequent terms found in snippets are reasonable to characterize the corresponding text segment.

To further examine whether using search-result snippets for feature extraction can help the relevance judgment between text segments and reflect their topic similarity, we selected five text segments from those listed in Figure 1, as p1–5 listed in Figure 5, as the testing samples³. For each text segment, we collected up to 100 search-result snippets and selected the 80 most frequent words as the segment’s feature terms⁴. Then we computed the term weights of those five segments’ features according to Equation 1 and left those feature terms with nonzero weight. Figure 5 lists the corresponding information: the head column lists the feature terms, and each inner cell lists the weight of a feature term with respect to the segment indi- cated in the head row. The matrix has been arranged so as to obviously reveal the relationships between text segments and their highly correlated features.

Suppose that the hierarchy shown in Figure 1 is the target topical relationship between the testing text segments that we want to discover. The similarity scores between all pairs of testing text segments are listed as follows:

sim(p1,p2)=.091 sim(p1,p3)=.037 sim(p1,p4)=.011 sim(p1,p5)=.008 sim(p2,p3)=.068 sim(p2,p4)=.014 sim(p2,p5)=.023 sim(p3,p4)=.041

sim(p3,p5)=.020 sim(p4,p5)=.178

From the data, segments p4 and p5 have the highest similarity score, indicating they are most related based on the similarity measure strategy used. This result

3The testing text segments were chosen because they were related in the topic about technology and consumer electronic products, but still have certain difference that can be distinguished.

4Again, the feature terms are shown without stemming in order to provide readers a more under- standable form of those terms. Notice that the eighty most frequent words are chosen mainly for the purpose of illustraction. Our approach uses all of the words found in the retrieved search-result snippets, except those stop words.

(8)

feature term p1 p2 p3 p4 p5

ibm .091 .164 0 0 0

model .078 .065 0 0 0

port .061 .072 0 0 0

review .061 .072 0 0 0

reviews .078 .086 0 0 0

series .078 .132 0 0 0

thinkpad .078 .156 0 0 0

work .101 .077 0 0 0

acer .182 0 .041 0 0

buy .030 0 .068 0 0

computer .056 .052 .048 0 0

computing .034 .040 .044 0 0

hardware .056 .040 .038 0 0

laptop .044 .046 .041 0 0

microsoft .061 0 .097 0 0

news .044 .046 .061 0 0

notebook .068 .052 .041 0 0

notebooks .056 .069 .048 0 0

pc .051 .062 .084 0 0

cable .034 .043 0 .041 0

connect .061 0 0 .073 0

new .015 .025 .023 .023 0

price .051 .046 0 .041 0

communications .034 0 0 .055 .039

mobile .022 0 .022 .034 .029

time .061 0 0 0 .096

computers 0 .086 .152 0 0

desktop 0 .072 .062 0 0

os 0 .065 .068 0 0

prices 0 .065 .062 0 0

product 0 .163 .062 0 0

toshiba 0 .077 .078 0 0

products 0 .036 .034 .041 0

systems 0 .048 .041 .051 0

business 0 .072 0 0 .075

feature term p1 p2 p3 p4 p5

information 0 .086 0 0 .063

page 0 .036 .011 0 .039

services 0 .040 0 .051 .063

system 0 .036 0 .078 .039

technology 0 .020 .017 .018 .021

wireless 0 .017 .022 .025 .030

www 0 .086 0 0 .063

accessories 0 0 .074 .079 0

pcs 0 0 .097 .115 0

siemens 0 0 .099 .066 0

based 0 0 .062 0 .091

digital 0 0 .044 .075 .035

internet 0 0 .068 0 .091

network 0 0 .038 .063 .049

personal 0 0 .038 .053 .044

phone 0 0 .044 .094 .035

cdma 0 0 0 .092 .107

cell 0 0 0 .112 .054

cellular 0 0 0 .166 .070

communication 0 0 0 .088 .063

data 0 0 0 .084 .093

europe 0 0 0 .073 .063

global 0 0 0 .095 .054

gprs 0 0 0 .073 .110

gsm 0 0 0 .166 .118

networks 0 0 0 .103 .063

pdc 0 0 0 .066 .108

phones 0 0 0 .128 .087

radio 0 0 0 .092 .063

service 0 0 0 .103 .096

tdma 0 0 0 .098 .091

technologies 0 0 0 .079 .087

users 0 0 0 .066 .084

wap 0 0 0 .066 .135

world 0 0 0 .114 .063

Text Patterns:

p1 Find the cheapest ACER TravelMate without a CD-Rom drive p2 The newest product in the IBM Thinkpad X Series

p3 PDAs, Tablet computers, and Pocket PCs p4 The cellular phone with the GSM-1900 system p5 What is the difference between PHS, WAP, and GPRS?

Fig. 5. The weight matrix of feature terms with respect to the example text segments.

truly reflects the fact (as shown in Figure 1) that they are in the same topic, namely, a topic about mobile phone system, that can be significantly distinguished from the rest three text segments. The second pair with highest similarity score is p1 and p2, and they can be identified as similar in a topic about notebook. The third pair with highest score is p2 and p3, indicating p3 is more related to the topic about notebook (p1 and p2 belong to) than the one about mobile phone system (p4 and p5 belong to). The relationship between the testing text segments discovered is almost the same as the topic hierarchy shown in Figure 1. This result supports our approach in extracting features using search-result snippets for revealing topic similarity between text segments.

From the observations, there are several problems required to justify for using search-result snippets as a general approach to feature extraction, e.g.,

—How many snippets are appropriate to obtain adequate features?

—Does the ranking of snippets by search engines seriously affect the quality of retrieved features?

—If a text segment has several different meanings, can highly ranked search-result snippets reveal all of its meanings?

—How to deal with those text segments that retrieve too few or even no search- result snippets?

Some of the questions will be answered through experiments, and some will be clarified further in the discussion section. Overall, the above preliminary observa-

(9)

tion shows that the idea of using search-result snippets for feature extraction has tremendous possibilities. Of course, such approach certainly has disadvantages, and both of its strengths and weaknesses will be discussed in Section 7. In conclusion, with regard to the problem concerned in this paper, the clustering hypothesis can be intuitively stated as follows: two text segments are clustered together because they can retrieve similar context contents.

3. HIERARCHICAL CLUSTERING ALGORITHM: HAC+P

The purpose of clustering in our approach is to generate a cluster hierarchy for organizing text segments. The hierarchical clustering problem has been studied extensively in the literature, and many different clustering algorithms exist. They are mainly of two major types: agglomerative and divisive. We have adopted HAC as the backbone mechanism and developed a new algorithm, called HAC+P, for our clustering problem. The algorithm consists of two phases: HAC-based clustering to construct a binary-tree cluster hierarchy and min-max partitioning to generate a natural and comprehensive multi-way-tree hierarchy structure from the binary one. The algorithmic procedure is formally shown in Figure 6, and the details will be introduced in the following subsections.

3.1 HAC-Based Binary-Tree Hierarchy Generation

An HAC algorithm operates on a set of objects with a matrix of inter-object similarities and builds a binary-tree cluster hierarchy in a bottom-up fashion [Mirkin 1996]. Let v1, v2, . . ., vnbe the input object vectors, and let C1, C2, . . ., Cnbe the corresponding singleton clusters. In the HAC clustering process, at each iteration step, the two most-similar clusters are merged to form a new one, and the whole process halts when there exists only one un-merged cluster, i.e., the root node in the binary-tree hierarchy. Let Cn+ibe the new cluster created at the i-th step. The output binary-tree hierarchy can be unambiguously expressed as a list, C1, . . . , Cn, Cn+1, . . . , C2n−1, with two functions, lef t(Cn+i) and right(Cn+i), 1 ≤ i < n, indicating the left and right children of the internal cluster node Cn+i, respectively.

The core of an HAC algorithm is a specific function used to measure the similarity between any pair of clusters Ci and Cj (steps 8 and 11 in Figure 6). Here, we consider four well-known inter-cluster similarity functions: (SL) the single-linkage function, defined as the largest similarity between two objects in both clusters:

simSL(Ci, Cj) = max

va∈Ci,vb∈Cj

sim(va, vb);

(CL) the complete-linkage function, defined as the smallest similarity between two objects in both clusters:

simCL(Ci, Cj) = min

v_a∈C_i,v_b∈C_jsim(va, vb);

(AL) the average-linkage function, defined as the average of all similarities among the objects in both clusters:

simAL(Ci, Cj) = simA(Ci, Cj);

(CE) the centroid function, defined as the similarity between the centroids of the

(10)

HAC+P(v1, . . . , vn)

vi, 1 ≤ i ≤ n: the vectors of the objects

1: _C₁, . . . , C_2n−1← GenerateHACBinaryTree(v₁, . . . , v_n) 2: return MinMaxPartition(1, C₁, . . . , C_2n−1)

GenerateHACBinaryTree(v1, . . . , v_n) vi, 1 ≤ i ≤ n: the vectors of the objects

3: _{for all}_v_i, 1 ≤ i ≤ n do 4: _C_i_{← {v}_i_}

5: f (i) ← true {f : whether a cluster can be merged}

6: calculate the pairwise cluster similarity matrix 7: _{for all}1 ≤ i < n do

8: choose the most-similar pair {Ca, Cb} with f (a) ∧ f (b) ≡ true 9: _C_n+i_{← C}_a_{∪ C}_b_{, lef t(C}_n+i_{) ← C}_a_{, right(C}_n+i_{) ← C}_b 10: f (n + i) ← true, f (a) ← f alse, f (b) ← f alse

11: update the similarity matrix with new cluster Cn+i

12: _{return C}₁, . . . , C_2n−1together with functions lef t and right MinMaxPartition(d, C1, . . . , Cn, Cn+1, . . . , C_2n−1)

d: the current depth

Ci, 1 ≤ i ≤ 2n − 1: the binary-tree hierarchy 13: _ifn < ǫ ∨ d > ρ then

14: _{return C}₁_{, C}₂, . . . , Cn

15: minq ← ∞, bestcut ← 0 16: _{for all}cut level l, 1 ≤ l < n do 17: q ← Q(LC(l))/N (LC(l)) 18: _ifminq > q then 19: minq ← q, bestcut ← l 20: _{for all}_C_i∈ LC(bestcut) do

21: _children(C_i) ← MinMaxPartition(d + 1, CH(C_i)) 22: return LC(bestcut)

Fig. 6. The HAC+P algorithm.

two clusters:

simCE(Ci, Cj) = sim(ci, cj)

where ci and cj are the centroids of Ci and Cj, respectively, and, for a cluster Cl, the k-th feature weight of its centroid, cl, is defined as cl,k =P

vi∈Clvi,k/|C^l|.

Usually, the clusters produced by the single-linkage method are isolated but not cohesive, and there may be some undesirably elongated clusters. At the other extreme, the complete-linkage method produces cohesive clusters that may not be isolated. The average-linkage method represents a compromise between the two extremes. The centroid method is another commonly used similarity measurement approach different from the linkage ones. A comparison of these methods will be made in a later section.

3.2 Min-Max Partitioning

The HAC algorithm produces a binary-tree cluster hierarchy. However, the proposed approach is pursued to produce a natural and comprehensive hierarchical organization like that of Yahoo!, in which there are 13-15 major categories, where each category also contains an appropriate number of sub-categories and so on.

This broad and shallow multi-way-tree representation, instead of the narrow and deep binary-tree one, is easier and more suitable for humans to browse, interpret, and do deeper analysis.

To generate a multi-way-tree hierarchy from a binary-tree representation, a top- down approach is used to decompose the hierarchy into several sub-hierarchies first, and then recursively applies the same decomposing procedure to each sub-hierarchy.

(11)

(A)

C₁ C₂ C₃ C₄ C₅ C6

C₈ C₉

C₇

1 2 3 4 Cut level l

C₁ C₂ C₃ C₄ C₅ C6

C₈ C₉

C₇

1 2 3 4 Cut level l

(B)

C₁ C₂ C₃ C₄ C₅ C₆ C₉

C7

C₁ C₂ C₃ C₄ C₅ C₆ C₉

C7

Fig. 7. An illustrative example for cluster partitioning.

Our idea is to determine a suitable level at which to cut the binary-tree hierarchy and create the most appropriate sub-hierarchies; that is, these sub-hierarchies are with the best balance of cluster quality and number preference over those produced by cutting at the other levels. Through recursively decomposing the sub-hierarchies, a new multi-way-tree hierarchy can be constructed.

The problem of cutting at a suitable level can be taken as that of determining between which pair of adjacent clusters {Cn+i−1, Cn+i}, 1 ≤ i < n, in the binary- tree hierarchy C1, . . . , C2n−1 to put the partition point. Let the level between {C2n−2, C2n−1} be 1, the level between {C2n−3, C2n−2} be 2, and so on such that the level between {Cn+i−1, Cn+i} is n − i (refer to Figure 7). For the purpose of further illustration, we let LC(l) be the set of clusters produced after cutting the binary-tree hierarchy at level l, and let CH(Ci) be the cluster hierarchy rooted at node Ci, i.e., CH(Ci) = C₁ⁱ, . . ., C_nⁱ_i, C_nⁱ_i₊₁, . . ., C_2nⁱ _i₋₁, where C₁ⁱ, . . ., C_nⁱ_i are the leaf, singleton clusters, C_nⁱ_i₊₁, . . ., C_2nⁱ _i₋₁ are the internal, merged clusters, and C_2nⁱ _i₋₁ = Ci. For example, in Figure 7(A), LC(1) is {C7, C8}, LC(2) is {C5, C6, C7}, and CH(C⁸) is {C³, C4, C5, C6, C8}. Suppose that the best cut level is chosen as 2; the first-level clusters of the generated hierarchy are LC(2), {C⁵, C6, C7} (refer to Figure 7(B)). If, then, the sub-hierarchy CH(C⁷) is partitioned and the chosen level is 1, the second-level clusters include C1 and C2. Readers should notice that all the above information, e.g., the values of functions LC and CH, could be collected while the HAC clustering process is proceeding, so they are available without too much extra computational efforts. In the following, we describe the two criteria used to determine the best cut level.

Cluster Set Quality. The generally accepted requirement of “natural” clusters is that they must be cohesive and isolated from the other clusters. Our criterion for determining a proper cut level given a binary-tree hierarchy of clusters is to heuristically satisfy this requirement. Let the inter-similarity between two clusters Ci and Cj be defined as the average of all pairwise similarities among the objects in Ci and Cj, i.e., simA(Ci, Cj), and let the intra-similarity within a cluster Ci be

(12)

defined as the average of all pairwise similarities within Ci, i.e., simA(Ci, Ci). Our partitioning approach finds a particular level that minimizes the inter-similarities among the clusters produced at the level and maximizes the intra-similarities of all those clusters; this is why the approach is named min-max partitioning. Let C be a set of clusters; our quality measurement of C based on its cohesion and isolation is defined as

Q(C) = 1

|C|

X

Ci∈C

simA(Ci, ¯Ci) simA(Ci, Ci), where ¯Ci = S

k6=iCk is the complement of Ci. Note that the smaller the Q(C) value is, the better the quality of the given set of clusters, C, is.

Cluster-Number Preference. Usually, a partition with neither too few nor too many clusters is preferable. Given n objects, there are at least one cluster and at most n clusters. To have a natural and comprehensive hierarchy structure, we expect that the number of clusters at each level should be appropriate to humans, but a proper number is really difficult to anticipate automatically because we have no idea how many meaningful groups exist among the objects. To make the generated hierarchy structure adaptable to the personal preference of each individual who is going to construct the taxonomy, a parameter Ncluson the default expected number of generated clusters at each layer is assumed given by the corresponding taxonomy constructor. Notice that Nclus could be a constant value or a function.

Then, a simplified gamma distribution function is used to measure the degree of preference on the number of clusters. Its definition is given as follows:

f (x) = 1

α!β^αx^α−1e^−x/β ; N (C) = f (|C|),

where |C| is the number of target clusters in C, α is a positive integer, and the constraint (α − 1)β = N^clusis required to ensure f (Nclus) >= f (x) for 0 < x ≤ n.

This constraint can be easily derived according to that the gamma distribution function f (x) is convex and the maximum can then be computed by differentiating f (x) to get f^′(x) and solving f^′(x) = 0. The detailed derivation is omitted here.

The two parameters α and β allow us to tune the smoothness of the preference function, and they are empirically set as α = 3 and β = Nclus/2 in our study. In this work, we empirically define Nclus as the square root of the number of objects in each partitioning step. These parameters are unavoidably subjective to the individual taxonomy constructor, therefore, no further experimental comparison of various values will be made. Figure 8 depicts the curves of this cluster-number preference function with respect to different numbers of generated clusters, f (x), on different sizes of objects n = 100, 200, and 400. Note that the function favors those cluster numbers close to Nclus.

Finally, to partition the given hierarchy, we estimate the quality and the cluster- number preference of all possible cluster sets produced at each level. The suitable cut level is chosen as the level l with the minimum Q(LC(l))/N (LC(l)) value (refer to steps 17–19 in Figure 6). The detailed partitioning procedure is shown in Figure 6. To avoid performing the partitioning procedure on a cluster with too few objects or making the result hierarchy too deep, two constants ǫ and ρ are provided to restrict the size of a cluster and the depth of the generated hierarchy,

(13)

0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02

0 5 10 15 20 25 30 35 40 45 50

f(x)

the number of generated clusters

n=100 n=200 n=400

Fig. 8. The cluster number preference function on n = 100, 200, and 400.

respectively, to be further processed (refer to steps 13–14 in Figure 6).

In the literature, several criteria for determining the number of clusters have been suggested [Milligan and Cooper 1985], but they are typically based on predetermined constants, e.g., the number of final clusters or a threshold for similarity measure scores. Relying on predefined constants is harmful in practice when applying the clustering algorithm because these criteria are very sensitive to the data set and hard to properly determine. There is another branch of clustering methods, called model-based clustering [Vaithyanathan and Dom 2000], which have the advantages of offering a way to estimate the number of groups present in the data.

However, they suffer from high computational cost and the risk of a wrong assump- tion of the data model. Our approach combines an objective function to measure the quality of the generated clusters with a heuristically acceptable preference function on the number of clusters. It can automatically determine a reasonable cluster number based on the given data set and still keep the advantage of efficiency of non-parametric clustering methods.

3.3 Cluster Naming

To provide users a more comprehensive hierarchy of clusters, the internal nodes, i.e., the nodes between the root and the leaves, should be labeled with some concise names. Although it is essential to label clusters, only a few works really dealt with it [Muller et al. 1999; Lawrie et al. 2001; Glover et al. 2002]. In Muller et al.

[1999], the labels of a cluster were chosen as the n most frequent terms in the cluster. Lawrie et al. [2001] extracted salient words and phrases of the instances in a cluster from retrieved documents to organize them hierarchically using a type of co-occurrence known as subsumption. Glover et al. [2002] inferred hierarchical relationships and descriptions by employing a statistical model they created to distinguish between the parent, self, and child features in a set of documents.

To name a cluster is a rather intellectual and challenging work. As we have mentioned, this work focuses on how to link the clusters with close concepts and to

(14)

Table I. The information of the paper data set.

Conference # Paper titles Conference # Paper titles

AAAI’02 29 SIGCOMM’02 25

ACL’02 65 SIGGRAPH’02 67

JCDL’02 69 SIGIR’02 44

decide appropriate levels in the hierarchy to position them. We use a hierarchical clustering technique to put similar instances together in a cluster and relevant clusters at the same or near levels. The cluster naming is not fully investigated in our current stage of study. In this work, we simply take the most frequent co- occurred feature terms from the composed instances to name the cluster. Even so, as illustrated in Figure 12, such a primitive approach still provides an easier way for users to understand the concepts of the generated cluster hierarchy.

4. EXPERIMENTS

Extensive experiments have been conducted to test the feasibility and the performance of the proposed approach on different domains of text segments, including category names in a popular Web directory, famous people names, academic paper titles, and natural language questions.

To have a standard basis for performance evaluation, we collected the following experimental data:

YahooCS. The category names in the top three levels of the Yahoo! Computer Science (CS) directory were collected. There were 36, 177, and 278 category names in the first, second, and third levels, respectively. These category names were short in length and specifically expressed some key concepts in the CS domain; therefore, they could play the role of typical text segments and would be used as the major experimental data in our study. Notice that some category names could be placed in multiple categories; i.e., they had multi-class information.

People. Named entities are an important problem of concern in information extraction. To assess the performance of our approach on such kind of data, we collected the people names listed in the Yahoo! People/Scientist directory. In this data set, there were 250 famous people distributed in nine science fields, e.g., math- ematicians and physicists.

Paper. For the people in research communities, papers are a major demand of search, and their titles or author names are often used as queries submitted to search engines. Therefore, the paper titles from several named conferences were collected for the test, and Table I listed the information of this data set.

QuizNLQ. Besides the data described above, we also collected a data set of natural language questions from a Web site that provided quiz problems⁵. In our collection, there were 163 questions distributed in seven categories. Figure 9 lists several examples. From this data set, we want to realize whether the proposed approach is helpful in clustering common NLQs with the retrieved relevant contexts, even though no advanced natural language understanding techniques are applied.

5http://www.coolquiz.com/trivia/

(15)

Questions About People What is a Booger made of?

What is a FART and why does it smell?

How come tears come out of our eyes when we cry?

What makes people sneeze?

Why do my feet smell?

Questions About Insects and Animals

How do houseflies land upside down on the ceiling?

Why do some animals hibernate in the winter?

What is honey? How do honey bees make honey?

Why and how do cats purr?

How does catnip work? Why is it only for cats?

Questions About Inventions and Machines Why do golf balls have dimples?

Did Thomas Edison really invent the light bulb?

How a newspaper tears?

How did coins get their names?

How do mirrors work?

Questions About The Environment, Space and Science How does a helium balloon float?

Why do bubbles attract?

What would happen if there was no dust?

What are sunspots?

Why are there 5,280 feet to a mile?

Questions About Food, Drinks and Snacks Why is Milk white?

Why is it called a ”hamburger” when there is no ham in it?

What are hot dogs made of?

Why do doughnuts have holes?

Why do Onions make us cry?

Fig. 9. Examples of the testing natural language questions.

Notice that all the target instances in the above data sets have class information, e.g., the conference of a paper and the field(s) of a person. The class information will be taken as the external information to evaluate the clustering results.

We adopted the F-measure [Larsen and Aone 1999] as the evaluation metric for the generated cluster hierarchy. The F-measure of cluster j with respect to class i is defined as

Fi,j= 2Ri,jPi,j

Ri,j+ Pi,j

,

where Ri,j and Pi,j are recall and precision, and are defined as ni,j/niand ni,j/nj, respectively, where ni,j is the number of members of class i in cluster j, nj is the number of members in cluster j, and ni is the number of members of class i. For the entire cluster hierarchy, the F-measure of any class is the maximum value it attains at any node in the tree, and an overall F-measure is computed by taking the weighted average of all the F-measure values as follows:

F =X

i

ni

n max{Fi,j}, (2)

where the maximum is taken over all clusters at all levels, n is the total number of instances, and ni is the number of instances in class i.

For comparison, a k-means method was modified to make it hierarchical (HK- Means) and used as the reference method. The basic algorithm of k-means is shown as following steps:

(1) Select randomly k instances as the initial elements of the k clusters.

(2) Assign each instance to its most-similar cluster.

(3) Repeat Step 2 until the clusters do not change or the predetermined number of iterations are used up.

By HKMeans, all the instances are first clustered into k clusters using k-means,

(16)

Table II. The F-measure results of clustering Yahoo! CS category names using various clustering methods and various similarity measure strategies.

HAC HAC+P HAC+P HKMeans

(ρ = 2) (ρ = 2,k =√n)

AL .8626 .8385 .8197 .5873

CL .8324 .8116 .7838 .2892

SL .6848 .6529 .2716 N/A

CE .4177 .4080 .2389 .5873

Random .2893

and then the same k-means procedure is recursively applied to each cluster until the specified depth ρ is reached. The similarity between an instance and a cluster is measured using not only the conventional centroid method, but also the average- and complete-linkage methods described previously. The single-linkage method is not suitable for k-means because, in the single-linkage similarity measure, one instance’s most-similar cluster is the one that contains the instance itself; therefore, the resulting clusters totally depend on the k initial clusters. Because of the random selection of k initial clusters, the resulting clusters vary in quality. Some sets of initial clusters lead to poor convergence rates or poor cluster quality.

Table II shows the resulting F-measure values for clustering the 177 Yahoo! CS second-level category names with the 36 first-level categories as the target classes using various clustering methods and various similarity measure strategies. The F- measure values of the binary-tree hierarchies produced by the conventional HAC are provided as the upper-bound reference values for the other HAC variants⁶. HAC+P is HAC with the partitioning procedure, and ρ = 2 means that the depth of the result hierarchy was constrained to be at most 2. The parameter k in HKMeans was dynamically set as the nearest integer of√

n, where n was the number of instances to be clustered at each step. Besides the various similarity measure strategies, a random clustering experiment was also performed with HKMeans; that is, at each step, each instance was randomly assigned to one of the k clusters. The reported F- measure score of random clustering is an average of twenty trials, and is provided as a reference score used to reflect the degree of the goodness of the cluster hierarchies produced using various clustering and similarity measure methods. The result obtained using HKMeans with complete-linkage was poor because the specified maximum number of iterations were used up and did not converge to a set of stable clusters. Comparably, HKMeans did not perform very well in this task.

From the experimental results shown in Table II, we found that the average- and complete-linkage methods performed much better than the single-linkage and centroid ones under the F-measure metric, and that the average-linkage method was even slightly better. The incorporation of partitioning and the constraint of hierarchy depth only caused very small decrement of the F-measure. To provide readers a more comprehensive and clearer understanding of the clustering results, Figure 10

6Since every cluster node in the hierarchy produced by HAC+P is a node in the hierarchy produced by HAC, the F-measure value of HAC is definitely the upper bound for variants of HAC+P according to the definition of F-measure in Equation 2.