A new method for fuzzy information retrieval based on fuzzy hierarchical clustering and fuzzy inference techniques

(1)

A New Method for Fuzzy Information Retrieval

Based on Fuzzy Hierarchical Clustering

and Fuzzy Inference Techniques

Yih-Jen Horng, Shyi-Ming Chen, Senior Member, IEEE, Yu-Chuan Chang, and Chia-Hoang Lee

Abstract—In this paper, we extend the work of Kraft et al. to present a new method for fuzzy information retrieval based on fuzzy hierarchical clustering and fuzzy inference techniques. First, we present a fuzzy agglomerative hierarchical clustering al-gorithm for clustering documents and to get the document cluster centers of document clusters. Then, we present a method to con-struct fuzzy logic rules based on the document clusters and their document cluster centers. Finally, we apply the constructed fuzzy logic rules to modify the user’s query for query expansion and to guide the information retrieval system to retrieve documents relevant to the user’s request. The fuzzy logic rules can represent three kinds of fuzzy relationships (i.e., fuzzy positive association relationship, fuzzy specialization relationship and fuzzy gener-alization relationship) between index terms. The proposed fuzzy information retrieval method is more flexible and more intelligent than the existing methods due to the fact that it can expand users’ queries for fuzzy information retrieval in a more effective manner. Index Terms—Fuzzy agglomerative hierarchical clustering, fuzzy information retrieval systems, fuzzy logic rules, fuzzy rela-tionships, query expansion.

I. INTRODUCTION

T

HE goal of an information retrieval system is to evaluate the degrees of relevance of the collected documents with respect to a user’s queries and retrieve the documents with a high degree of satisfaction to the user. In order to get good re-trieval performance, the user’s query must be able to appropri-ately describe the user’s requests. Currently, most of the com-mercial information retrieval systems are based on the Boolean logic model. They assume that a user’s queries can precisely be characterized by the index terms. However, this assumption is inappropriate due to the fact that the user’s queries may con-tain fuzziness [22]. The reason for the fuzziness concon-tained in the user’s queries is that the user may not know much about the sub-ject he/she is searching or may not be familiar with the informa-tion retrieval system. Therefore, the query specified by the user may not describe the information request properly. Since fuzzy

Manuscript received October 7, 2002; revised November 6, 2003, March 2, 2004, and May 25, 2004. This work was supported in part by the National Sci-ence Council, Republic of China, under Grant NSC 91-2213-E-011-052.

Y.-J. Horng and C.-H. Lee are with the Department of Computer and Infor-mation Science, National Chiao Tung University, Hsinchu 300, Taiwan, R.O.C. S.-M. Chen and Y.-C. Chang are with the Department of Computer Science and Information Engineering, National Taiwan University of Science and Tech-nology, Taipei 106, Taiwan, R.O.C.

Digital Object Identifier 10.1109/TFUZZ.2004.840134

set theory [29] can be used to describe imprecise or fuzzy infor-mation, many researchers have applied the fuzzy set theory to information retrieval systems [2], [3], [8], [9], [13], [18]–[22].

In [2], Bordogna et al. presented a relevance feedback model based on associative neural networks to provide an association mechanism in information retrieval systems. The purpose of the association mechanism in information retrieval systems is to build the association relationships between index terms and to modify the user’s queries by adding or replacing index terms associated with the queries. Generally speaking, the modified user’s queries should find more relevant documents than that of the original user’s queries and thus improve the retrieval performance. Therefore, the study of the association mecha-nism is very important in the field of information retrieval. In [3], Chen et al. presented a fuzzy-based concept information system that integrates human categorization and numerical clus-tering. In [8], Chen et al. presented a method for document re-trieval using knowledge-based fuzzy information rere-trieval tech-niques. In [9], Chen et al. presented fuzzy information retrieval techniques based on multi-relationship fuzzy concept networks. In [13], Horng et al. presented a fuzzy information retrieval method based on document terms reweighting techniques,

In [20], Kraft et al. explored several ways of using fuzzy clustering techniques in information retrieval systems, where the most important one is to capture the relationships among index terms. They use fuzzy logic rules to represent the associ-ation relassoci-ationships between index terms and to form the basis of the association mechanism. After a user submits his/her queries, the fuzzy logic rules are then applied under a fuzzy logic system to modify the user’s original queries. Experi-mental results show that the modified user’s queries can get a better retrieval performance than the original queries. In [20], Kraft et al. utilized the complete link clustering method and the fuzzy c-means clustering method to partition documents for information retrieval.

In this paper, we extend the work of Kraft et al. [20] to present a new method to modify a user’s queries for fuzzy informa-tion retrieval. First, we present a fuzzy agglomerative hierar-chical clustering algorithm for clustering documents and to get the document cluster center of each document cluster. Then, we present a method to construct fuzzy logic rules based on the document clusters and their document cluster centers. Finally, we apply the constructed fuzzy logic rules to modify the user’s query for query expansion and to guide the information retrieval system to retrieve documents relevant to a user’s request. The

(2)

fuzzy logic rules can represent three kinds of fuzzy relation-ships (i.e., fuzzy positive association relationship, fuzzy spe-cialization relationship, and fuzzy generalization relationship) between index terms. The proposed document retrieval method is more flexible and more intelligent than the existing methods due to the fact that it can expand users’ queries for fuzzy infor-mation retrieval in a more effective manner.

The rest of this paper is organized as follows. In Section II, we briefly review some clustering methods. In Section III, we pro-pose a new fuzzy agglomerative hierarchical clustering method. In Section IV, we compare the clustering performance of the proposed fuzzy agglomerative hierarchical clustering method with that of the complete link clustering method. In Section V, we propose a new method for fuzzy logic rules discovery. In Section VI, we propose a new method for query modification for fuzzy information retrieval based on the constructed fuzzy logic rules. The conclusions are discussed in Section VII.

II. A REVIEW OFCLUSTERINGMETHODS

The single link clustering (SLC) method is one of the hier-archical agglomerative clustering methods (HACM). In [24], Rohlf reviewed some algorithms of the single link clustering method. The time complexity of these algorithms ranges from to , where is the number of items. Many of these algorithms are not suitable for information retrieval ap-plications, when the data sets have large and high dimen-sionality [23]. The SLINK algorithm [28] is one of the single link clustering methods. In [28], Sibson pointed out that the SLINK algorithm is an optimally efficient algorithm for the single link clustering method. In [23], Rasmussen pointed out that the SLINK algorithm generates the hierarchy in a form known as the pointer representation. The SLINK method is one of the SLC methods [24] and it also takes the major character-istic of SLC, where the degree of similarity between two doc-ument clusters is defined by the maximal degree between any pairs of documents, each of which is in one of the two clusters. In [20], Kraft et al. utilized the agglomerative hierarchical clustering (AHC) method [23] and the fuzzy c-means clustering method [1] to partition documents. In the following, we briefly review these two clustering methods.

The agglomerative hierarchical clustering method for parti-tioning the given documents is reviewed from [20] as follows.

Step 1) Let each document be a document cluster.

Step 2) Merge two document clusters which have the largest degree of similarity into one document cluster.

Step 3) If the degree of similarity between any two

docu-ment clusters is smaller than a heuristic threshold

value , where , then Stop,

else, go to Step 2).

In [20], Kraft et al. pointed out that the measurement of the degree of similarity between two document clusters can be done by the CLC method [23], which takes the minimum degree of similarity between any pairs of documents, each of which is in one of the two clusters (i.e., one document from the first docu-ment cluster and the other from the second docudocu-ment cluster).

It is obvious that the agglomerative hierarchical clustering method is a crisp clustering method. That is, in the resulting document clusters, each document can only belong to exactly one document cluster.

The other clustering method used in [20] is the fuzzy c-means algorithm [1]. Assume that there is a set of documents,

, where each document can be regarded as a data point represented by a vector of dimension shown as follows:

where . Here, denotes the degree of significance

of term in , where and .

Fur-thermore, assume that we want to partition the documents into document clusters, where . Then, the fuzzy c-means algo-rithm will get document cluster centers and the membership degree of each document belonging to each document cluster by minimizing the following objective function:

(1) Here, denotes the membership degree of document with respect to document cluster , denotes the document cluster center of document cluster , and denotes the dis-tance from document to document cluster center . More-over, and are represented as vectors of dimension , and is a control parameter, where . Note that stands for the degree of fuzziness of the clustering algorithm. The larger the value of , the higher the degree of fuzziness, i.e., the like-lihood of a document belonging to multiple clusters at the same time is higher. It should be noted that the membership degree of each document belonging to each document cluster must be nonnegative, i.e., . Moreover, the summation of the membership degrees of each document belonging to each cluster must be equal to one, i.e., , where . The membership degree of document belonging to document cluster is calculated as follows:

(2)

and the document cluster center of document cluster is calculated as follows:

(3)

Initially, the fuzzy c-means algorithm randomly assigns a value

for each , where and for each .

Then, the following two steps are performed iteratively until the difference of the values of (and ) in the current iteration and those in the previous iteration are smaller than some con-vergence threshold , where . First, it uses the existing in formula (3) to obtain the document cluster center . Then,

(3)

it uses the newly derived document cluster center shown in formula (2) to update the value of . In [1], Bezdek et al. have proven that the fuzzy c-means algorithm will converge.

In [20], Kraft et al. used several experts to evaluate the doc-ument clusters produced by the CLC method [23] and the ones produced by the fuzzy c-means clustering method [1]. Since the fuzzy c-means algorithm will produce “fuzzy” document clusters in the sense that each document may belong to mul-tiple document clusters, it is in contrast to the “crisp” docu-ment clusters produced by the complete link clustering method in which each document only can belong to exactly one docu-ment cluster. In order to compare the performance of these two clustering methods, “hardening” is performed to the fuzzy doc-ument clusters obtained by the fuzzy c-means algorithm. That is, for each document , they found a document cluster where has a maximum membership degree among all the document clusters. Then, they set the membership degree of document in this document cluster to one and set the membership degrees for the other document clusters to zero. From the experts’ opin-ions [20], the complete link clustering method seems to perform better than the fuzzy c-means method. However, since Kraft et al. performed the “hardening” operation to the fuzzy document clusters, the experts have only compared two sets of “crisp” doc-ument clusters derived from the two clustering methods, respec-tively. Therefore, the advantage of the fuzzy clustering method is not revealed in their experiments. In fact, the fuzzy clustering method should perform better than the crisp clustering method when the boundaries between document clusters are not crisp or when the document clusters are overlapping.

Because the traditional agglomerative hierarchical clustering method is a crisp clustering method, where each document can only belong to exactly one document, it lacks flexibility. There-fore, in this paper, we extend the work of [20] to present a fuzzy agglomerative hierarchical clustering method to overcome the drawback of the traditional agglomerative hierarchical clus-tering method, where a document can not belong to multiple document clusters at the same time. Both the proposed fuzzy agglomerative hierarchical clustering method and the fuzzy c-means clustering method [1] have the advantage of flexibility to allow a document to belong to multiple document clusters at the same time. The difference between the fuzzy c-means clustering method and the proposed fuzzy agglomerative hier-archical clustering method is that the fuzzy c-means clustering method needs to predefine the number of clusters by the user and the proposed method uses the “similarity threshold value” and the “difference threshold value” to deal with the process

of clustering automatically, where and .

III. FUZZY AGGLOMERATIVEHIERARCHICAL

CLUSTERINGMETHOD

In this section, we present a fuzzy agglomerative hierarchical clustering (FAHC) for clustering documents. Let the member-ship degree of document belonging to cluster be denoted

by , where . If is an element of

cluster , then . The proposed fuzzy agglomerative hierarchical clustering method is now presented as follows.

Fuzzy Agglomerative Hierarchical Clustering Algorithm:

Input: The similarity threshold value

and the difference threshold value ,

where and .

Output: Document clusters.

Step 1: Let each document be a document cluster and set the membership degree of each document belonging to its document cluster to 1.

Step 2: Find a pair of document clusters

and among the set of document

clus-ters that has the largest degree of

simi-larity and is not less than , where

.

Step 3: Find a set of pairs of

docu-ment clusters and among the rest of

the document clusters that have degrees of similarity larger than or equal to

the similarity threshold value , where

.

Step 4: For each pair of document cluster

and in the set do

begin

if and the pair of document

clusters and obtained in Step 2 and

the pair of document clusters and

obtained in Step 3 share the same document

cluster (i.e., ), then

begin

make a copy of document cluster ;

merge document cluster with document

cluster into a new cluster ;

for each element in document

cluster , the membership degree

of document belonging

to the new document cluster

is equal to ;

for each element in document cluster

, the membership degree of

document belonging to the new

document cluster is equal to

;

document belonging to the

new document cluster is equal to

end else

(4)

;

document cluster is equal to ;

end end.

Step 5: Recalculate the degree of simi-larity between each pair of document clus-ters by taking the minimum of the degrees of similarity between any pair of docu-ments, where the documents in each pair are taken from different document clus-ters.

Step 6: If the degree of similarity be-tween any two document clusters is smaller

than the threshold value , where ,

then Stop

else go to Step 2.

After partitioning documents into several document clusters, the document cluster center of document cluster can be obtained by the following formula:

(4)

where , denotes the weight of term

in document , denotes the membership degree of

doc-ument belonging to document cluster , ,

, and . This holds for all .

IV. COMPARISON OF THECLUSTERINGRESULTS OF THECLC METHOD AND THEPROPOSEDFAHC METHOD

We have implemented the SLINK method [23], the CLC method [23] and the proposed FAHC method on a Pentium IV PC using Delphi version 5.0. We have chosen 247 research reports in the field of computer science [30] as the set of documents for clustering, which are a subset of a collection of research reports of the National Science Council (NSC), Taiwan, R.O.C. Each report consists of several parts, including a report identifier, a title, the researchers’ names, a Chinese ab-stract, an English abab-stract, , etc. Since the proposed method intends to deal with English documents, we take the English abstracts of the reports to represent the contents of the docu-ments. However, since each selected document contains a large amount of words, these documents should be preprocessed to reduce the set of words into a manageable size before the clustering algorithms are applied on the selected documents. The selected documents are preprocessed in two steps. First,

words appearing with high frequencies in all documents are eliminated. Then, the word extractor stems each remaining word to its “root form” [10]. The collection of these root-for-matted words forms a set of index terms for the document set.

The weight “ ” of term in document

is calculated as follows:

(5)

where (5) is derived from [26], denotes the frequency of term appearing in document , denotes the number of documents containing term , denotes the number of index terms contained in document , and denotes the number of documents in the corpus. The larger the value of , the more important is the term in document . Note that the value of

is normalized and is between zero and one.

After the weight of each index term in each document has been calculated, we can represent each document as a vector shown as follows:

(6)

where denotes the weight of

term in document , , and denotes the number

of terms in the set of index terms.

The degree of similarity between any two docu-ments and is calculated by the following formula:

(7)

where .

We use the 247 research reports [30] as the set of documents for clustering. Table I shows the SLINK method partitioning the 247 documents into document clusters for different “similarity threshold values” , where the “similarity threshold value” is between 0.04 and 0.08. When the “similarity threshold value” , there is only one cluster produced. When the “sim-ilarity threshold value” increases, there are more document clusters which contain more than one document being produced and there are more and more document clusters being produced containing only one document. Table II shows the complete link method [23] partitioning the 247 NSC documents into different numbers of document clusters for different “similarity threshold values” , where the “similarity threshold value” ranges from 0 to 0.025, and there is no document cluster containing only one document until reaches or is equal to 0.025. From Tables I and II, we can see that the SLINK method produced more doc-ument clusters containing only one docdoc-ument than the complete link method. In this case, the SLINK method has a poorer clus-tering result than the complete link method due to the fact that it produces too many document clusters containing only one doc-ument compared to the complete link method.

(5)

TABLE I

DOCUMENTCLUSTERSPRODUCED BY THESINGLELINKMETHOD[23]FOR

DIFFERENTSIMILARITYTHRESHOLDVALUES

TABLE II

DOCUMENTCLUSTERSPRODUCED BY THECOMPLETELINKMETHOD[23]FOR

DIFFERENTSIMILARITYTHRESHOLDVALUES

In the following, we compare the document clusters gener-ated by the hierarchical clustering method [23] with the ones generated by the proposed FAHC method. It is obvious that the document clusters generated by a good clustering method should have a maximum degree of within-cluster similarity and a minimum degree of between-cluster similarity [3]. That is, the degree of similarity between documents in the same document clusters should be high and the degree of similarity between documents in different document clusters should be low. Thus, we adopt the heuristic measure, called the category binding (CB) presented in [3] to evaluate the clustering results.

Assume that is a set of document clusters

obtained by the clustering methods, then the value of CB of the set of document clusters is calculated by the following formula: (8)

where denotes the average degree of

within-cluster similarity and denotes

the average degree of between-cluster similarity of the set of document clusters. From (8), we can see that a good clustering

result should get a large value of .

The average degree of within-cluster similarity is calculated as follows:

(9)

where denotes the cohesion of documents in document cluster and is calculated by (10), as shown at the bottom of the page, where denotes the number of documents in doc-ument cluster , denotes the membership degree of

document belonging to document cluster ,

de-notes the membership degree of document belonging to doc-ument cluster , denotes the degree of similarity between documents and calculated by (7), and de-notes a parameter representing the cohesion of documents in a single-instance document cluster [3].

The average degree of between-cluster similarity is calculated as follows:

(11) Here, denotes the degree of similarity between document clusters and and is calculated as follows:

(12)

where denotes the number of documents in document cluster , denotes the number of documents in document cluster , denotes the membership degree of document be-longing to document cluster , denotes the member-ship degree of document belonging to document cluster , and denotes the degree of similarity between any two documents and calculated by (7).

By observing the 247 NSC research reports [30], we can see that it is appropriate to partition these documents into 23–30 document clusters. Therefore, we tune the “similarity threshold value” of the complete link method [23] to partition the 247 NSC research reports into an appropriate number of document clusters and compute the CB value based on (8)–(12) for each clustering result as shown in Table III. Moreover, we also tune the “similarity threshold value” of the proposed fuzzy ag-glomerative hierarchical clustering method to partition the 247 NSC documents into an appropriate number of document ters and compute the CB value based on (8)–(12) for each clus-tering result as shown in Table IV (when the difference threshold value ) and Table V (when the difference threshold value ), where the values of are heuristically set to 0.001 and 0.0005, respectively

The “difference threshold value” of the proposed fuzzy ag-glomerative hierarchical clustering method can be regarded as

if otherwise

(6)

TABLE III

CB VALUES OF THEDOCUMENTCLUSTERSPRODUCED BY THE

COMPLETELINKMETHOD[23]

TABLE IV

CB VALUES OF THEDOCUMENTCLUSTERSPRODUCED BY THEPROPOSED

FUZZYAGGLOMERATIVEHIERARCHICALCLUSTERINGMETHODWHEN THE

DIFFERENCETHRESHOLDVALUE = 0:001

TABLE V

CB VALUES OF THEDOCUMENTCLUSTERSPRODUCED BY THEPROPOSED

FUZZYAGGLOMERATIVEHIERARCHICALCLUSTERINGMETHODWHEN THE

DIFFERENCETHRESHOLDVALUE = 0:0005

a parameter to control the “fuzziness” of the resulting docu-ment clusters. The larger the “difference threshold value” , the higher the “fuzziness” of the resulting document clusters, where . When the “difference threshold value” is set to 0.001, we found that many documents are partitioned into more than one document cluster. When the “difference threshold value” is set to 0.0005, only few documents are partitioned into more than one document cluster. From Tables III–V, we can see that the CB values of the document clusters produced by the complete link method [23] are larger than the ones of the doc-ument clusters produced by the proposed fuzzy agglomerative hierarchical clustering method when the “difference threshold value” is set to 0.001, but are smaller than the ones of the doc-ument clusters produced by the proposed fuzzy agglomerative hierarchical clustering method when the “difference threshold value” is set to 0.0005. The reason is that the resulting doc-ument clusters produced by the proposed fuzzy agglomerative hierarchical clustering method, when the “difference threshold

value” is set to 0.001, are too “fuzzy” and the average de-gree of between-cluster similarity is high due to the fact that many document clusters contain common documents. There-fore, we should decrease the fuzziness of the document clusters. The experimental results show that the document clusters pro-duced by the proposed fuzzy agglomerative hierarchical clus-tering method with a low degree of fuzziness (i.e., ) are better than the crisp document clusters produced by the com-plete link method [23].

V. FUZZYLOGICRULESDISCOVERY

In [20], Kraft et al. presented a method to construct fuzzy logic rules based on the document clusters and their cluster cen-ters. The fuzzy logic rules constructed in [20] have the following format:

(13) where and denote the weights of index terms and in the document cluster center representation, , and . The meaning of the rule is that whenever term ’s weight (in a document or query) is at least , the related term ’s weight (in the same document or query) should be at least . A fuzzy logic rule is constructed only when terms and both have high weights in the same document cluster center representation. Therefore, the fuzzy logic rule can represent the association relationship between index terms and . These rules can then be applied to modify the user’s original query and thus increase the retrieval effectiveness.

However, the fuzzy logic rules discovery method presented in [20] only considers the clustering results of documents ob-tained by the two clustering methods (i.e., the complete link clustering method [23] and the fuzzy c-means clustering method [1]). We believe that the document clusters formed in the middle of the clustering process can also provide some useful infor-mation. For example, in the clustering process using the ag-glomerative hierarchical clustering method (e.g., the complete link method and the proposed fuzzy agglomerative hierarchical clustering method), two document clusters are merged to form a larger document cluster. The resulting document cluster can be regarded as the parent document cluster of the original docu-ment clusters. The index terms having large degrees of weights in the parent document cluster center should be more general than the ones with large degrees of weights in the child docu-ment cluster center, due to the fact that the index terms having large weights in the parent document cluster center are con-tained in more documents.

In this section, we present a method for fuzzy logic rules discovery which constructs fuzzy logic rules representing more kinds of relationships between index terms than the ones presented in [20]. The fuzzy logic rules are constructed based on the document cluster centers. Similar to the fuzzy logic rule construction method presented in [20], for each document cluster center, terms are sorted in a descending sequence ac-cording to their weights in the document cluster center. Then, the first terms, where , as well as their weights are extracted. However, unlike the method presented in [20], we build term pairs not only with chosen terms from the same

(7)

document cluster center but also from document cluster centers that have parent–children relationships. Assume that term is chosen from the document cluster center of document cluster and term is chosen from the document cluster center of document cluster , then the term pair has the form of , where is the weight of term in the document cluster center of document cluster , is the weight of term in the document cluster center of document

cluster , , and . The generated

term pairs can be categorized into three categories according to the relationship between source document cluster centers con-taining terms and . These three categories are as follows.

1) Positive Association Category: If terms and are in the same document cluster center, then the corresponding term pair belongs to this category. The term pairs in this category represent a positive association relationship [20] between terms and due to the fact that terms and are contained in similar documents and should describe similar concepts.

2) Generalization Category: If the document cluster center containing term is the parent of the document cluster center containing , then the corresponding term pair be-longs to this category. The term pairs in this category de-note that term is more general than term due to the fact that term belongs to more documents than term does.

3) Specialization Category: If the document cluster center containing term is a child of the document cluster con-taining , then the corresponding term pair belongs to this category. The term pairs in this category denote that term

is more specific than term due to the fact that term belongs to fewer documents than term does.

If the same term pair occurs several times in the same cate-gory with different weights, then the minimal weight for each term among all term pairs will be taken as the aggregated weight of the term. Finally, for each term pair in the form

of , we can build rules according to the

following cases.

Case 1) If the term pair belongs to the

positive association category, then we build two fuzzy logic rules

and

for this term pair. The meaning of the pair of rules is that the occurrence of the term with a weight at least should always be accompanied by the term with a weight at least , and vice versa.

Case 2) If the term pair belongs to

the generalization category and the term pair belongs to the specialization category, then we build two fuzzy logic rules

and

for these term pairs. The meaning of the above rules is that the occurrence of the term with a weight at least should always be accompanied by the term with a weight at least , and vice versa.

VI. QUERYMODIFICATION

After the fuzzy logic rules are constructed by the proposed fuzzy logic rule discovery method, we can use these fuzzy logic rules to modify the user’s queries based on [4]. According to the definition of the fuzzy logic system [4], the fuzzy logic rule has the following form:

which is a well-formed formula consisting of propositions of the

form or and the logical connectives, i.e., ,

, and .

A user’s query can be represented by a query descriptor vector shown as follows:

where each element denotes the desired strength of term

in the retrieved documents, and or .

If , then it indicates that the user hopes that the re-trieved documents do not possess the term . Furthermore, if the user considers that some terms may be neglected, i.e., to in-clude those terms or not would have no substantial effects on the result, then the user does not have to assign degrees of strength with respect to such terms in the query descriptor vector. The symbol “ ” is used for labeling a neglected term. If , then it indicates that the term will not be considered in the document retrieval process.

Let be a set of the generated fuzzy logic rules, , where each fuzzy logic rule has the following form:

which is applied to modify the user’s query descriptor vector

if and or if and .

The modified user’s query descriptor vector is similar to the original user’s query descriptor vector except that is set to . The modified user’s query descriptor vector will be used to retrieve relevant documents. Assume that the modified user’s query descriptor vector is as follows:

where or . If , then it indicates

that the term is a neglected term and it will not be considered in the retrieval process. Furthermore, assume that each docu-ment can be represented by a vector shown as follows:

where represents the weight of term in document , and . Then, the retrieval status value of

(8)

Fig. 1. Set of generated fuzzy logic rules based on the term pairs belonging to the positive association category.

document with respect to the user’s query can be calculated as follows:

and

(14) where is the number of nonneglected terms. Here, is a sim-ilarity function [8] to calculate the degree of simsim-ilarity between two real values between zero and one and is given as

(15) After the retrieval status value of each document with respect to the user’s query is obtained, we normalize the value of by dividing it by the maximum value among

the values of , and , where

is the number of collected documents. If a document wants to be retrieved, then the retrieval status value of doc-ument should be larger than or equal to the “query threshold

value” given by the user, where .

We have implemented the proposed query modification method for document retrieval based on Delphi version 5.0 on a Pentium 4 PC using the 247 NSC research reports [30] as described in Section IV. Furthermore, we also implemented the query modification method proposed by Kraft et al. [20] for making a comparison. The experimental results show that

although the modified queries by applying the query modifi-cation method presented in [20] can be useful to improve the precision rate of the retrieval results, the queries modified by the proposed method can achieve a higher precision rate. For example, assume that the user wants to retrieve documents about the topic “natural language processing” and assume that 13 documents among the 247 NSC research reports are rele-vant to this topic. Assume that weights of the terms “natural,” “language,” and “processing” in the user’s query are 0.9, 0.9, and 0.8, respectively, and the other terms are neglected, denoted by the symbol . Assume that the documents have been clustered by the proposed fuzzy hierarchical clustering method with the similarity threshold value and the difference threshold value . Then, the fuzzy logic rules are generated as shown in Figs. 1–3.

The query modification method presented in [20] uses the generated fuzzy logic rules based on the term pairs belonging to the “Positive Association Category” to modify the original query. On the other hand, the proposed query modification method uses the generated fuzzy logic rules based on the term pairs not only belonging to the “Positive Association Category,” but also belonging to the “Generalization Category” and be-longing to the “Specialization Category” to modify the original user’s query . By applying the query modification method presented in [20], the original user’s query can be modified into the user’s query which has the same weights of the

(9)

(10)

(11)

TABLE VI

PRECISIONRATES AND THERECALLRATESWITHRESPECT TODIFFERENTUSER’SQUERIES FORDIFFERENTRETRIEVALTHRESHOLDVALUES

Fig. 4. Precision rate and the recall rate with respect to the original user’s queryq for different query threshold values.

terms “natural,” “language,” and “processing” as (i.e., 0.9, 0.9, and 0.8) and one additional term “word” with the weight 0.46. By applying the proposed query modification method, the original user’s query can be modified into query which also has the same weights of the terms “natural,” “language,” and “processing” as (i.e., 0.9, 0.9 and 0.8) and four additional terms “word,” “dictionari” (root of “dictionary”), “corpu” (root of “corpus”), and “speech” with the weights 0.46, 0.44, 0.39, and 0.37, respectively. Based on the queries , and for document retrieval, the precision rates and the recall rates with respect to these three queries are shown in Table VI, where the retrieval threshold value is given by the user and . If a document wants be retrieved, then the retrieval status value of document should be larger than or equal to the “retrieval threshold value” given by the user. The curves of the precision rate and the recall rate with respect to the user’s queries , and are shown in Figs. 4–6, respectively.

From Table VI and Figs. 4–6, we can see that the precision rates with respect to the modified user’s query are larger than the ones with respect to the original user’s query and the modified user’s query for each query threshold value.

In the following, we compare the precision rate and the recall rate for the top retrieved documents with respect to the original user’s query , the modified user’s query and the modified user’s query , respectively, where is a positive integer. In our experiment, we let the query threshold value be 0.2. Then, 116 documents, 55 documents, and 25 documents are retrieved with respect to the original user’s query , the modified user’s

Fig. 5. Precision rate and the recall rate with respect to the modified user’s queryq for different retrieval threshold values.

Fig. 6. Precision rate and the recall rate with respect to the modified user’s queryq for different retrieval threshold values.

query and the modified user’s query , respectively. Let us consider the following cases, remembering that there are 13 relevant documents in the collection.

Case 1) When , both the precision rate and the recall rate with respect to the modified user’s query are the largest. For example, when , the original user’s query gets five relevant documents among the ten retrieved documents. In this case, we can see that the precision rate is 50% and the recall rate is 38%. The modified user’s query gets nine rele-vant documents among the ten retrieved documents. In this case, the precision rate is 90% and the recall rate is 69%. The modified user’s query gets ten relevant documents among the ten retrieved docu-ments. In this case, we can see that the precision rate is 100% and the recall rate is 77%.

(12)

TABLE VII

NUMBER OFRELEVANTDOCUMENTS,THEPRECISIONRATES AND THERECALLRATESWITHRESPECT TO THERETRIEVED TOPp DOCUMENTS OF THEUSER’S

QUERYq , q ANDq , RESPECTIVELY, WHERE THERETRIEVALTHRESHOLDVALUE IS0.2

Case 2) When , the precision rate and the re-call rate of the modified user’s query are equal to the precision rate and the recall rate of the modified user’s query , respectively, and are larger than the precision rate and the recall rate of the original user’s query , respectively. For example, when , the original user’s query gets eight relevant doc-uments among the 15 retrieved docdoc-uments. In this case, the precision rate is 53% and the recall rate is 62%. Both the modified user’s query and the modified user’s query get 12 relevant documents among the 15 retrieved documents. In this case, the precision rate is 80% and the recall rate is 92%. Case 3) When , the precision rate and the

re-call rate of the modified user’s query are equal to the precision rate and the recall rate of the mod-ified user’s query , respectively, and are equal to the precision rate and the recall rate of the original user’s query , respectively. For example, when , the original user’s query , the modified user’s query and the modified user’s query all get 12 relevant documents among the 20 retrieved documents. In this case, the precision rate is 60% and the recall rate is 92%.

The number of relevant documents, the precision rates and the recall rates with respect to the retrieved top documents of the user’s queries , , and , respectively, are summarized in Table VII, where the query threshold value is 0.2.

From Table VII, we can see that the retrieval results with re-spect to the modified user’s query based on the proposed query modification method are better than those with respect to the user’s queries and from the point of view that most of the top ranked documents are relevant.

VII. CONCLUSION

In this paper, we have extended the work of Kraft et al. [20] to present a new method for fuzzy information retrieval based on fuzzy hierarchical clustering and fuzzy inference techniques. We have presented a fuzzy agglomerative hierar-chical clustering algorithm for clustering documents and to get the document cluster center of each document cluster. We

also have presented a method to construct fuzzy logic rules based on the document clusters and their document cluster centers. We also have applied the constructed fuzzy logic rules to modify a user’s query for query expansion and to guide the information retrieval system to retrieve documents relevant to the user’s request. The proposed fuzzy information retrieval method is more flexible and more intelligent than the existing methods due to the fact that it can expand users’ queries for fuzzy information retrieval in a more effective manner.

REFERENCES

[1] J. C. Bezdek, R. J. Hathaway, M. J. Sabin, and W. T. Tucker, “Conver-gence theory for fuzzy c-means: Counterexamples and repairs,” IEEE Trans. Syst., Man, Cybern., vol. SMC-17, no. 5, pp. 873–877, Sep.-Oct. 1987.

[2] G. Bordogna and G. Pasi, “A user-adaptive neural network supporting a rule-based relevance feedback,” Fuzzy Sets Syst., vol. 82, no. 2, pp. 201–211, 1996.

[3] C. L. P. Chen and Y. Lu, “FUZZ: A fuzzy-based concept formation system that integrates human categorization and numerical clustering,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 27, no. 1, pp. 79–94, Feb. 1997.

[4] J. Chen and S. Kundu, “A sound and complete fuzzy logic system using Zadeh’s implication operator,” in Proc. 9th Int. Symp. Methodologies for Intelligent Systems, Zakopane, Poland, 1996, pp. 233–242.

[5] S. M. Chen and Y. J. Horng, “Fuzzy query processing for document retrieval based on extended fuzzy concept networks,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 29, no. 1, pp. 126–135, Feb. 1999. [6] S. M. Chen, Y. J. Horng, and C. H. Lee, “Document retrieval using

fuzzy-valued concept networks,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 31, no. 1, pp. 111–118, Feb. 2001.

[7] S. M. Chen, W. H. Hsiao, and Y. J. Horng, “A knowledge-based method for fuzzy query processing for document retrieval,” Cybern. Syst.: Int. J., vol. 28, no. 1, pp. 99–119, 1997.

[8] S. M. Chen and J. Y. Wang, “Document retrieval using knowledge-based fuzzy information retrieval techniques,” IEEE Trans. Syst., Man, Cy-bern., vol. 25, no. 5, pp. 793–803, May 1995.

[9] S. M. Chen, Y. J. Horng, and C. H. Lee, “Fuzzy information retrieval based on multi-relationship fuzzy concept networks,” Fuzzy Sets Syst., vol. 140, no. 1, pp. 183–205, 2003.

[10] W. B. Frakes, “Stemming algorithms,” in Information Retrieval: Data Structure & Algorithms, W. B. Frakes and R. Baeza-Yates, Eds. Upper Saddle River, NJ: Prentice-Hall, 1992.

[11] Y. J. Horng and S. M. Chen, “Fuzzy query processing for document retrieval based on extended fuzzy concept networks,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 29, no. 1, pp. 96–104, Feb. 1999. [12] , “Finding inheritance hierarchies in fuzzy-valued

concept-net-works,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 29, no. 1, pp. 126–135, Feb. 1999.

(13)

[13] Y. J. Horng, S. M. Chen, and C. H. Lee, “A new fuzzy information re-trieval method based on document terms reweighting techniques,” Int. J. Inf. Manage. Sci., vol. 14, no. 4, pp. 63–82, 2003.

[14] , “Automatically constructing multi-relationship fuzzy concept net-works for document retrieval,” Appl. Art. Intell.: Int. J., vol. 17, no. 4, pp. 303–328, 2003.

[15] , “Fuzzy information retrieval using fuzzy hierarchical clustering and fuzzy inference techniques,” in Proc. 13th Int. Conf. Information Management, vol. 1, Taipei, Taiwan, R.O.C., 2002, pp. 215–222. [16] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,”

ACM Comput. Surveys, vol. 31, no. 3, pp. 264–323, 1999.

[17] Y. Jung, H. Park, and D. Du, “An effective term weighting scheme for in-formation retrieval,” Univ. Minnesota, Dept. Comput. Sci., Minneapolis, MN, Comput. Sci. Tech. Rep. TR008, 2000.

[18] K. J. Kim and S. B. Cho, “A personalized web search engine using fuzzy concept network with link structure,” in Proc. Joint 9th IFSA Congr. 20th NAFIPS Int. Conf., Vancouver, BC, Canada, 2001, pp. 81–86. [19] B. M. Kim, J. Y. Kim, and J. Kim, “Query term expansion and

reweighting using term co-occurrence similarity and fuzzy inference,” in Proc. Joint 9th IFSA World Congr. 20th NAFIPS Int. Conf., vol. 2, Vancouver, BC, Canada, 2001, pp. 715–720.

[20] D. H. Kraft, J. Chen, and A. Mikulcic, “Combining fuzzy clustering and fuzzy inferencing in information retrieval,” in Proc. 9th IEEE Int. Conf. Fuzzy Systems, vol. 1, San Antonio, TX, 2000, pp. 375–380.

[21] H. M. Lee, S. K. Lin, and C. W. Huang, “Interactive query expansion based on fuzzy association thesaurus for web information retrieval,” in Proc. 10th IEEE Int. Conf. Fuzzy Systems, vol. 2, Melbourne, Australia, 2001.

[22] S. Miyamoto, “Information retrieval based on fuzzy associations,” Fuzzy Sets Syst., vol. 38, no. 2, pp. 191–205, 1990.

[23] E. Rasmussen, “Clustering algorithms,” in Information Retrieval: Data Structure and Algorithms, W. B. Frakes and R. Baeza-Yates, Eds. Upper Saddle River, NJ: Prentice-Hall, 1992, pp. 419–442. [24] F. J. Rohlf, “Single-link clustering algorithms,” in Classification,

Pat-tern Recognition, and Reduction of Dimensionality, P. R. Krishnaiah and J. N. Kanal, Eds. Amsterdam, The Netherlands: North-Holland, pp. 267–284.

[25] G. Salton, The SMART Retrieval System: Experiments in Automatic Document Processing. Upper Saddle River, NJ: Prentice-Hall, 1971. [26] G. Salton and C. Buckley, “Term-weighting approaches in automatic text

retrieval,” Inf. Process. Manage., vol. 24, no. 5, pp. 513–523, 1988. [27] G. Salton and M. J. Mcgill, Introduction to Modern Information

Re-trieval. New York: McGraw-Hill, 1983.

[28] R. Sibson, “SLINK: An optimally efficient algorithm for the single link cluster method,” Comput. J., vol. 16, pp. 30–34, 1973.

[29] L. A. Zadeh, “Fuzzy sets,” Inf. Control, vol. 8, pp. 338–353, 1965. [30] A subset of the collection of the research reports of the national

science council. NTUST, Taiwan, R.O.C.. [Online]. Available: http://fuzzylab.et.ntust.edu.tw/NSC_Report_Database/247docu-ments.html

Yih-Jen Horng received the B.S., M.S., and Ph.D.

degrees, all in computer and information science, from National Chiao Tung University, Hsinchu, Taiwan, R.O.C., in 1994, 1996, 2003, respectively.

His current research interests include fuzzy sys-tems, database syssys-tems, and artificial intelligence.

Dr. Horng was the winner of the 1996 Acer Dragon Outstanding M.S. Thesis Award, Republic of China. He was also the winner of the 2003 Acer Dragon Outstanding Ph.D. Dissertation Award, Republic of China.

Shyi-Ming Chen (M’88–SM’96) received the B.S.

degree, and the M.S., and Ph.D. degrees in electrical engineering from the National Taiwan University of Science and Technology, Taipei, Taiwan, in 1982, 1986, and 1991, respectively.

From August 1987 to July 1989, and from August 1990 to July 1991, he was with the Department of Electronic Engineering, Fu-Jen University, Taipei, Taiwan. From August 1991 to July 1996, he was an Associate Professor in the Department of Computer and Information Science, National Chiao Tung Uni-versity, Hsinchu, Taiwan. From August 1996 to July 1998, he was a Professor in the Department of Computer and Information Science, National Chiao Tung University, Hsinchu, Taiwan. From August 1998 to July 2001, he was a Professor in the Department of Electronic Engineering, National Taiwan University of Sci-ence and Technology, Taipei, Taiwan. Since August 2001, he has been a Professor in the Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology. He was a Visiting Scholar in the Department of Electrical Engineering and Computer Science, University of California, Berkeley, in 1999, and with the Institute of Information Science, Academia Sinica, Republic of China, in 2003. His current research interests include fuzzy systems, database systems, knowledge-based systems, artificial intelligence, data mining, and genetic algorithms. He has published more than 200 papers in referred journals, book chapters, and conference proceedings.

Dr. Chen was the winner of the 1994 Outstanding Paper Award of the Journal of Information and Education and the winner of the 1995 Outstanding Paper Award of the Computer Society of the Republic of China. He was the winner of the 1997 Outstanding Youth Electrical Engineer Award of the Chinese Insti-tute of Electrical Engineering, Republic of China. He was the winner of the Best Paper Award of the 1999 National Computer Symposium, Republic of China. He was the winner of the 1999 Outstanding Paper Award of the Computer Society of the Republic of China. He was the winner of the 2001 Outstanding Talented Person Award, Republic of China, for the contributions in Information Tech-nology. He was the winner of the Outstanding Electrical Engineering Professor Award granted by the Chinese Institute of Electrical Engineering (CIEE), Re-public of China, in 2002. He was the winner of the 2003 Outstanding Paper Award of the Technological and Vocational Education Society, Republic of China. He is a Member of the ACM, the International Fuzzy Systems Association (IFSA), and the Phi Tau Phi Scholastic Honor Society. He is currently the President of the Tai-wanese Association for Artificial Intelligence (TAAI). He is also an Executive Committee Member of the Chinese Fuzzy Systems Association (CFSA). He is an Editor of the Journal of the Chinese Grey System Association, an Associate Editor of the International Journal of New Mathematics and Natural Computa-tion, and an Associate Editor of the International Journal of Fuzzy Systems. He is listed in International Who’s Who of Professionals, Marquis Who’s Who in the World, and Marquis Who’s Who in Science and Engineering.

Yu-Chuan Chang received the B.S. degree from the

Department of Computer Science and Information Engineering, Fu-Jen University, Taipei, Taiwan, R.O.C., in 2000, and the M.S. degree in computer science and information engineering from National Taiwan University of Science and Technology, Taipei, Taiwan, R.O.C., in 2004. He is currently working toward the Ph.D. degree in computer science and information engineering at National Taiwan University of Science and Technology.

His current research interests include fuzzy infor-mation systems and artificial intelligence.

Chia-Hoang Lee received the Ph.D. degree in

com-puter science from the University of Maryland, Col-lege Park, in 1983.

From 1984 to 1985, he was with the Department of Mathematics and Computer Science, University of Maryland. From 1985 to 1992, he was with the Department of Computer Science, Purdue University, West Lafayette, IN. He is currently a Professor in the Department of Computer and Information Sci-ence and also serves as the Deputy Director of Me-diaTek research center at National Chiao Tung Uni-versity, Hsinchu, Taiwan, R.O.C. His current research interests include artificial intelligence, man machine interface systems, and natural language processing. He was an Associate Editor of the International Journal of Pattern Recognition.