An architecture and category knowledge for intelligent information retrieval agents

(1)

An Architecture and Category Knowledge for Intelligent Information

Retrieval Agents

Hsieh-Chang Tu and Jieh Hsiang

Department of Computer Science and Information Engineering

National Taiwan University

Taipei, Taiwan

email:

ftu,hsiangg@csie.ntu.edu.tw

Abstract

Information overload has become a serious problem for users of the World Wide Web. In this paper we pro-pose to use intelligent information retrieval (IIR) agents as a solution to this problem. We identify the desir-able features of an IIR agent, including intelligent search, navigation guide, auto-notication, personal information management, personal preferred interface, and tools for easy page-reading. A modularized agent architecture is then proposed. We describe the responsibility of each component and how they are combined to performed to the various tasks of the IIR agent. We point out that group knowledge, acquired from preferences of other users in the same group, may be useful. Knowledge of our agents is primarily represented by categories. After dening and clarifying the dierence between clusters, directories and categories, we present category represen-tations as an abstraction of certain desired information. Possible applications of category knowledge are also ex-amined.

1 Introduction

In a short few years, the World Wide Web has become one of the most important media with which people share information resource. Web information is primarily ex-hibited through Web pages, designed and written by con-tent providers. The enormous amount of available

in-formation induces the problem ofinformation overload,

that there is too much information for people to digest. To alleviate this problem, one needs better information retrieval (IR) software to serve as a lter between the user and the information retrieved over the Web. Such

a software should provideintelligent search, which

pro-vides the user with more interesting Web pages and fewer

Partly supported by Grant NSC 87-2213-E-002-012 of the

Na-tional Science Council of the Republic of China.

uninteresting ones. When the user is surng the net, the software should also be able to suggest interesting URL's to visit. Finally, there may be \hot pages" whose con-tents may change frequently. The IR system should be able to notify the user of such changes automatically.

Attempts have been made to reduce the problem of information overload. There are browsers that are not only equipped with user-friendly interface but also sup-port Java and other powerful languages. Web

directo-ries, such asYahoo!

1, organize \important" Web pages

in a way similar to yellow pages. Search engines such as AltaVista 2, Excite 3, Infoseek 4, and Lycos 5 use

indexing techniques for users to retrieve potentially rel-evant pages through queries. Intelligent agents, pro-grams which are supposed to exhibit human behavior, are also proposed to help users retrieve, locate, and

man-age Wed information [Lee97]. Examples include

Point-Cast Network 6 and

Pathfinder

7, which oer

per-sonalized news and information,Firefly

8, which makes

movie and music recommendations, and WebWatcher [Arm95] which interactively helps users locate desired in-formation. The design of softbots [Etz94], on the other hand, is aiming at providing integrated solutions to uti-lize Web resources.

In this paper we propose an architecture of intelligent information retrieval (IIR) agents. We rst discuss what we think are desirable features of a good IIR agent. We

describe the notion of anagent communityin section 3,

and identify an IIR agent as one agent in the commu-nity. We propose an integrated architecture to carry out features of an IIR agent. We further decompose an IIR

1 http://www.yahoo.com/ 2 http://altavista.digital.com/ 3 http://www.excite.com/ 4 http://www.infoseek.com/ 5 http://www.lycos.com/ 6 http://www.pointcast.com/ 7 http://pathfinder.com/ 8 http://www.firefly.com/

(2)

agent into components called subagents, each of which can be regarded as an independent module to perform a specic function. We also discuss cooperations among

subagents. Section 4 discusses the issue ofcategory

infor-mation. Category information plays a crucial role in an agent's knowledge. We distinguish the meaning of a cate-gory from the well-known denition of a cluster. Possible applications of category knowledge are addressed next. A concluding remark about IIR agents is given at the end.

2 Essential Features of Web IIR

Agents

Before designing an appropriate agent architecture, it is essential to rst decide what an IIR agent should do. To answer this question, one should examine what kind of diculty people encounter when they try to get informa-tion over the Web. Some of these obstacles come from the diculty of using a software, but most problems are caused by information overload { there is too much in-formation for the user to retrieve. Since the purpose of an IIR agent is to assist people retrieve and manage in-formation on the Web. it should have following features: 1.

Intelligent search

: An eective and ecient search of information from a database is a major is-sue on the research of information retrieval. When people start to search for information from the Web, they often become frustrated when the search result contains too little useful information (or too much garbage). An intelligent agent should give the user an interactive environment so that the user's infor-mation need can be pinpointed exactly.

2.

Navigational guide

: When surng through the Web it is easy to \go astray" in cyberspace. A good IIR agent should provide guides or roadmaps so that users can get assistance when stuck in their navi-gation on the Web. For instance, the agent may analyze pages recently read by the user in order to suggest related subject areas and pages. Another kind of navigational guide is to highlight potentially interesting hyperlinks.

3.

Information auto-notication

: It is tedious for people to check whether a page has been updated. After the user species the kind of information he needs, an IIR agent should be able to detect up-dated information or even download them automat-ically. Messages may be sent to user to notify that new data have become available. Furthermore, it is worthwhile for an agent to analyze the user's reading

preference so that it may prompt interesting pages to the user automatically.

4.

Personal information management

: Cate-gories, directories, or folders are familiar ways for people to manage tree-structured hierarchical data. It is useful for an IIR agent to manage personal cate-gories in an intelligent way. For instance, the agent may provide suggestions to build a personal cate-gory tree for each user. This catecate-gory information can later be used to help user search or navigate on the Web. Another example of information man-agement is to automatically organizing bookmarks [Maa96] so that the user may handle bookmarks more easily.

5.

Personal preferred interface

: Each user may want to have his preferred interface. For instance, a user may want to set his default background colors for pages that do not indicate background colors. Another simple example is that, since pages may be written by languages with dierent language codes (e.g., the BIG-5 code for Chinese), a good agent should try to display appropriate characters after detecting what language (code) the page uses. 6.

Tools as reading-aide

: A good IIR agent may also

provide tools to help the user with reading retrieved papers. Such tools may include on-line dictionaries and translation programs. These programs are usu-ally stand-along agents themselves. The IIR agent should allow easy incorporation of such tools.

3 Proposed Agent Architecture

3.1 Agent Community

We consider an agent as a goal-oriented program with

some learning ability. An agent can dynamically adapt to individual users and can perform certain tasks

au-tonomously. An agent community is a group of agents

working together to serve a group of users. In an agent community, agents interact with each other and can co-operate to solve problems if necessary. It is also

conve-nient to further divide agents into task agents and

in-terface agents. Each task agent oers a specic service and they communicate with each other to execute more advanced functions. For instance, a typical IIR agent may not support functions such as looking up dictionar-ies or translating page contents into another language.

These functions are done by other task agents (sayT).

An IIR agent should be able to communicate withT to

oer such services to the user. Interface agents are re-sponsible for keeping and updating the user proles and communicating a user's need to the task agents.

(3)

Among the task agents in the agent community, there

is a resource management agent, calledmanagent, which

plays a special role. The managent keeps the list of the services provided by agents in the community. If a new user joins the group, the managent announces it to all task agents. Functions oered by agents can then be automatically prompted to the new user. This allows the user easy access to services available in the agent community. An interface agent acts like the DeskTop Manager in most systems with a graphics user interface. A browser which allows all possible Web information to be properly displayed can also be regarded as an interface agent.

An example of an architecture of agent community is illustrated in Fig 1. Task-Specific Knowledge System Resource Information Sources Interface Agent Resource Management Agent

Information-Retrieval Agent Task Agent Task Agent

Interface Agent Task-Specific

Resources

Data Access Path Active Comm. between agents Possible Comm. among agents

Agent Community Task-Specific Knowledge Task-Specific Knowledge

Figure 1: The architecture of agent community

3.1.1 Group and Personal Agents

An IIR agent needs to keep two types of preferences; each user's own preference about search and naviga-tion, and the preference of each group of users. The latter is needed because web pages that are interesting to most users in a group are likely to be interesting to

others in the group. Thus some kind of group

prefer-ence, computed from proles of users in the group, is

required. Furthermore, in order to reduce network traf-c load, web pages requests by users in the same group should be handled by the same program. These con-siderations motivate the design of an IIR agent into two

layers. Each user has his ownpersonal agent (PA)which

keeps a prole of user preference, and there is a group

agent (GA) that handles group knowledge and prefer-ence. Intuitively, a PA oers all anticipated features and learns personal preference from the user it serves. A GA accumulates the knowledge about personal preferences it obtains from the PA's, transforms it into the collective

group preference, nds interesting pages that re ect the group preference, and monitors web pages that the users wish to be watched. The various functions of the GA and

PA are captured in modules which we callsubagents. We

shall describe these subagents and their collaboration in detail in the next section. Meanwhile, the proposed IIR agent architecture is presented in Fig 2.

World Wide Web Resources

Personal Agent Personal Agent

WWW Search Engines

Group Agent

WWW IR Agent

Communication flow

Figure 2: A coarse view to the architecture of informa-tion retrieval agents

3.2 Agents and Subagents

Similar to decomposing a program into modules, an agent can also be organized as subagents. Each subagent, working independently, performs some preascribed fea-ture of the agent. Subagents have their own local databases, and share the same knowledge base with other subagents in the same agent. A ner architecture of the IIR agent is described in Fig 3, in which a group agent and a personal agent are enclosed in dashed, rounded

rectangles. Boxes within agents represent subagents,

which are separate, independent program modules. Sub-agents work together to form a group or personal agent. We brief the functions of each subagent as follow:

Communication subagent

: Each (group or

per-sonal) agent has a communication subagent which takes the responsibility of sending, receiving, and possibly interpreting messages from the external world. A communication subagent may be regarded as a program listening to certain communication ports in conventional network programming, except that the former can actively watch request queues. It is also desirable to equip the communication sub-agents with some learning capability or with a uni-form and exible protocol such as KQML, so that

(4)

Monitor Subagent Communication Subagent GA Communications between

(1). Proxy subagent and profile/search/navigation/notification subagent (2). Comm. subagent and profile/navigation/notification subagent Communications between modules

Doc. DB WWW Search Engines Notification Subagent Navigation Subagent Profile Manager Search Subagent Proxy Subagent Comm. Subagent Browser PA

Web Spider Group Knowledge Manager Group

Pref. User Pref. &_{Doc. DB}

Figure 3: A ner view to the architecture of information retrieval agents

collaborations between agents can become more ef-fective.

Proxy subagent

: According to RFC1945

9, a proxy

is an intermediary program which acts as both a server and a client for the purpose of making re-quests on behalf of other clients. A proxy subagent is a special program that intercepts messages be-tween the user and the Web. It also serves as a communication subagent between the user and the personal agent. If the user wishes to set his own per-sonal preference, he must interact with the prole manager via the proxy subagent. Since the proxy subagent knows which pages have been accessed by the user, it provides necessary information for the IIR agent to learn about user preference. A proxy subagent caches frequently accessed pages so that unnecessary network trac can be reduced. It may also be desirable for the IIR agent to pre-fetch pages which may be interesting to the user. Although pre-fetching pages may increase network trac, good in-teraction between the proxy subagent and the man-agent may allow pre-fetching be done when the net-work trac is light.

Search subagent

: There are already many search

engines on the Web. Instead of designing its own, an IIR agent may act as a \meta search engine", which collects results obtained from sending user queries to existing search engines. Since search engines may require dierent query formats, the search subagent is responsible for interacting with the user so that the user can format his queries properly. The

sub-9Request for Comments, No. 1945, a protocol standard for

HTTP/1.1

agent translates user queries to the formats accept-able by (a pre-dened set of) search engines, issues formatted queries to these engines, and collects re-turned results to the user. The subagent may ask the user for some category information (discussed in detail in Section 4) so that better searching results can be presented to him.

Navigation subagent

: Given a set of pages

re-cently read by the user, the navigation subagent at-tempts to classify these pages into pre-dened cat-egories. If the user gets lost in cyberspace, he may ask the navigation subagent to suggest interesting hyperlinks. The subagent will prompt categories, which are related to the current page being browsed, to the user. Each category contains hyperlinks as well as titles or descriptions about the correspond-ing Web pages. Some hyperlinks may be manually coded (such as important Web sites), and some are obtained from recently browsed pages. Categorized hyperlinks thus provide the user ways to jump to other pages which are related to pages currently be-ing browsed.

Notication subagent

: The user may ask the

no-tication subagent to monitor frequently changed pages. If these pages are changed, the subagent will notify the user automatically. On the other hand, the user may ask the IIR agent to search through the Web to nd pages fulllingrequirements specied by the user. The notication subagent should provide the user with a comprehensive way to specify his in-formationneeds. The notication subagent does not monitor or search Web pages itself. Instead it sends monitor or search requests to the group agent. The monitor subagent and Web spider (described later) in the group agent are responsible for handling these requests.

Prole manager

: The prole manager

modi-es the user preference, either by interacting with the user directly, or by communicating with the group agent to obtain group preference. It con-tains a knowledge explainer so that the user can read knowledge stored in the prole. Some knowl-edge, such as the pages to be monitored or to be searched by the Web spider, can be specied simply by a form or a table. Statistical knowledge, such as a user's category preference, is more dicult to interpret. Since we will represent such knowledge by a set of keywords, the agent needs to let the user know the role of such keywords in the representa-tion. It is also important to allow an experienced user to edit category preference manually. Details about categories will be addressed in section 4.

(5)

Web spider

: The Web spider is a subagent of

the group agent, and searches through the Web

using a sh-search algorithm10. To bootstrap the

search algorithm, we provide the spider with a list ofindex pages, which contains hyperlinks to related pages. The spider uses these hyperlinks to reach other pages, and in turn uses hyperlinks in the re-sulting pages to attain more pages. A pre-dened search width and search depth are required so that the spider will not lose its original searching goal by following too deep a chain of pages. The spider may also interact with the user so that search goals can be modied dynamically [Che97]. If the spider nds pages that may be interesting to all users in the group, these pages will be sent to corresponding personal agents so that the users can be notied.

Monitor subagent

: This subagent takes requests

from personal agents and monitor specied pages to see if their contents have been modied. Monitoring can be done by downloading the page and checking it with an older version of the same page. The main reason for putting the monitor subagent as part of the group agent instead of the personal agent is to reduce network trac (since many users may want to monitor the same pages) and to reduce the com-plexity of personal agents.

Group-knowledge manager

: Group knowledge

is information pertinent to the interests of a group of users. There are basically two kinds of knowl-edge known to the group agent. The rst kind is more of a record keeping nature. It includes pages specied by the users to monitor, the number of pictures or voice les in a page, and pages satisfying certain condition (such as containing at least two image les). The second kind of knowledge is ob-tained from statistics of pages accessed or read by the users. This knowledge is represented by a set of attributes and weights, which can be interpreted us-ing notions from fuzzy sets of probability. A simple statistical knowledge is the histogram which counts the times of a page accessed by the users. Pages fre-quently requested may be regarded as \hot pages". Another statistical knowledge, namely the category information, plays a central role of knowledge to our IIR agents.

3.3 Collaboration among Subagents

Subagents collaborate to perform functions of an IIR agent. In the rest of the section we describe how the

col-10A description on the algorithm can be found in

http:// www.eecs.wsu.edu/~bamberg/hypercourse/fishsearch.html.

laboration is done for the processes of intelligent search, auto-notication, navigation guide, and personal infor-mation management.

3.3.1 Process of Intelligent Search

A typical working scenario of intelligent search is de-scribed in Fig 4. In order to make the search \intel-ligent", the group agent needs some initial knowledge

about categories11 (which amounts to the initial

knowl-edge about group preference). It announces the category knowledge to personal agents so that each personal agent will have the same initial category knowledge. The left part of the o-line preference processing shown in Fig 4 illustrates this idea. After a user sends a query, the search subagent receives this message from the proxy subagent. It then translates the user query to queries that are acceptable to existing search engines on the Web, and gathers results returned from sending trans-lated queries to the search engines. The search subagent uses category knowledge to lter out pages it deems un-interesting, and presents the nal result (via the proxy subagent) to the user. The user marks pages as interest-ing, uninterestinterest-ing, or no comment. These labelled pages, which indicate the types of pages interesting to the user within certain categories, will be used later in the learn-ing process. After a pre-set period of time, the personal agent communicates with the group agent about what it has learned from the user. This makes it possible for the group agent to modify its category knowledge from the user preference. The group agent then communicates back its newly gained knowledge to the personal agents.

Search Subagent Proxy Subagent Comm. Subagent Browser PA GA

Off-line preference processing Data flow of a search process

Profile Manager Doc. & User

Pref.

WWW Search Engines Communication Subagent Doc. DB Group Knowledge Manager Group Pref.

Figure 4: The process of intelligent search, including the propagation of knowledge from the group agent to the personal agent

11Intuitively, categories are classes in a classication. We shall

(6)

We remark that the user has control over personal preference learned. This is illustrated in Fig 4 as the o-line preference processing between prole manager and the user prole. The prole manager should explain, as most as it can, what preference it has learned to the user. If the user is experienced enough, he can modify the preference to match his information need.

3.3.2 Processes of Auto-Notication,

Naviga-tion Guide, and Personal InformaNaviga-tion

Management

We illustrate the process of auto-notication in Fig 5. At the beginning, the user species what he needs to the notication subagent. One possible specication simply

indicates page patterns, and pages satisfying these

pat-terns can be regarded as interesting. The notication subagent sends pattern messages to the group knowl-edge manager so that personal requests can be stored in the group preference database. The Web spider analyzes group requests (from the group preference database) so that Web pages interesting to group users acquire more attention. As soon as interesting pages are found, they will be put into the group document database. Simi-larly, the monitor subagent watches specied Web pages to see if they have been changed. It writes messages to the group document database if it wants to inform group users that some pages are changed. The group knowledge manager checks personal requests with noti-able documents, and transmit necessary information to the corresponding notication subagent. The user re-ceives prompting message once the notication subagent decides to notify him.

Monitor Subagent

Communication Subagent

GA

Data access path

Communication path between agents and subagents Doc. DB Notification Subagent Proxy Subagent Comm. Subagent Browser PA

Web Spider Group Knowledge Manager Group Pref.

User Pref. & Doc. DB

Figure 5: The process of auto-notication

To perform navigation guide, the navigation subagent consults user preference and the documents stored in the personal agent. It analyzes pages recently read by the user, and prompts related categories to the user. Hy-perlinks stored in categories allow the user to visit re-lated pages. The process of navigation guide, not in-cluding propagation of knowledge (e.g., category knowl-edge) from the group agent to personal agents, is shown in Fig 6 (a).

Fig 6 (b) pictures the process of personal information management. The prole manager consults the user pro-le and displays stored knowledge to the user. User pref-erence includes personalized interface, interesting cat-egories and their representations, interesting-page pat-terns, and other system settings (e.g., cache size used by the proxy subagent). Personal information management allows an experienced user to control his own preference prole. WWW Resources Navigation Subagent Proxy Subagent Browser PA

User Pref. & Doc. DB Profile Manager Proxy Subagent Browser PA

User Pref. & Doc. DB

Data access path

Communication path between subagents

(a) (b)

Figure 6: (a) The process of navigation-gude, not includ-ing the propagation of knowledge from the group agent. (b) Subagents involved in personal information manage-ment

4 The Formulation of Category

Knowledge

Web pages normally contain multimedia information. However, it is dicult to handle information stored in non-text form. One solution is to describe multimedia information by a sequence of words [Gug96] so that it can be treated as normal texts, and use conventional IR techniques to handle the text information. In this sec-tion, we assume that information stored in Web pages can be processed via text processing. We call a page a

documentto emphasize that the retrieval is done by text processing.

(7)

Human beings are familiar with using classication

techniques to manage a large amount of objects. Similar objects are collected into the same group so that they can be retrieved conveniently. One of the most

popu-lar methods is clustering, which puts documents with

similar features or keywords into the same cluster. How-ever, since features or keywords may not reveal semantic information of a document, a cluster may contain too much noise to be truly useful. A more eective way is to group documents according to semantic concept. We call a group of documents that captures certain semantic

concept adirectory. The use of directories, although

in-tuitively reasonable, is not feasible computationallysince it is extremely dicult to compute the semantic informa-tion of a document precisely. We therefore introduce a

notion ofcategory, which computes anapproximationof

a directory of documents. In the following subsections,

we rst review the denition ofdocument vector model,

which is a popular method in document processing and will also be used in our method. We then discuss the re-lationship among a cluster, a directory, and a category in more detail in subsection 4.2. A special category called

User Categoryis then introduced to capture a user's

re-cent browsing preference. Finally, we examine possible applications of category knowledge in an IIR agent.

4.1 The Document Vector Model

Allowing inputs in natural language is one way to make an IR system friendly. However, natural language un-derstanding is a notoriously dicult task. Therefore it is common to employ some \approximation" method to analyze the queries and documents. One popular method

for processing documents is the vector model [Sal89],

which regards each document as a vector. Let V be a

nite vocabulary of words and let v = jVj. The word

spaceW is the v dimensional vector space over real

num-bers. Each document in a given database is represented

by a vector

d

(called adocument vector) inW. For

con-venience, from now on we use

d

to denote both a

doc-ument and the associated docdoc-ument vector. We regard the set of all \valid" Web pages (i.e., pages known or

accessible by the IIR agent) as a database D (which is

also regarded as a subset ofW). Informally, the value of

the ith_{component of a vector}

_d

_{is computed from some}

statistical property which is a function of the ith _word

in V, the document

d

, and the database D. Given a

vector

p

= (p1; ;pv) 2 W, we use j

p

j = p Pv i=1p 2 i

to denote the lengthof

p

. For simplicity, we normalize

each document vector

d

2Dso thatj

d

j= 1. We remark

that, in the vector model, it is possible to have the same document vector representing dierent documents.

An intuitive way to represent the similarity between

two document vectors is by the distance between them.

A smaller distance means that the two vectors are closer and therefore the associated documents are more similar.

Let

p

,

q

be two points (i.e., document vectors) inD, we

denote the distance between

p

and

q

by d(

p

;

q

), which

is a non-negative real number. Let

p

= (p1;

;pv) and

q

= (q1;

;qv), one commonly used denition of d(

p

;

q

)

is

d(

p

;

q

) = 1 Xv

i=1

piqi;

which measures the similarity by the inner product of

p

and

q

.

In order to automatically classify similar document vectors into the same group, one needs a denition to measure the distance between two sets of docu-ment vectors. Let P, Q be two sets of points in the word space, there are various denitions of d(P;Q), the distance between P and Q. Two popular denitions

are the single-link distance and the complete-link

dis-tance [Jai88, Sal89]. The former denes d(P;Q) by

minp2P;q2Qd(

p

;

q

), while the latter denes d(P;Q) =

maxp2P;q2Qd(

p

;

q

). Intuitively, the single-link

dis-tance is the disdis-tance between the most similar pair of points from the two sets (one from each set), while the complete-link distance refers to the distance of the least similar pair of points in P and Q.

4.2 Clusters, Directories, and

Cate-gories

Recall that Ddenotes a set of document vectors

repre-senting pages on the Web. AclusterX (ofD) is a subset

ofDsuch that there is a high degree of association

(mea-sured by a chosen distance function) between members in X. In practice we also require that members from dierent clusters have low degrees of association.

Clus-ters are generated by unsupervised learning techniques,

which means that the learning is performed without

la-belledtraining examples. By a labelled training example we mean that the classication result (i.e., whether a document belongs to a cluster) is explicitly specied in advance. Some commonly used clustering methods in-clude the c-means algorithm[Sch92], the Learning Vector Quantization (LVQ) [Mak85], and the fuzzy clustering techniques [Zim91]. It has been shown that the cluster-based approaches can be helpful for the user to browse large document collections [Cut92]. However, clusters generated automatically are dicult to interpret, since \similar documents" dened from a distance measure may not be meaningful to most people.

Like a cluster, a directoryis a subset of points in D.

(8)

that the former is a \syntactical" group of documents, while the latter is a \semantical" group of documents. A directory is pre-dened by the retrieval system (of-ten manually by the designer), and its semantical

con-tent means that we can name it in a way familiar to

most people. For instance, we may group documents (semantically) related to computers by a directory call

/computer, and organize documents (semantically)

re-lated to computer architecture by a sub-directory call

/computer/architecture. In the real world, people

have a lot of experience in handling informationwith this kind of naming structure. For instance, in a computer system, users are familiar with attaching a mnemonic path name to each le directory.

The main problem with using directories in a Web database is that its construction is almost impossible to

automate12. Thus in this paper we propose a notion

of a category, which is an approximation of a directory.

Given a directory , we denote an associated category

by C. A category is also a subset of points in

D 13.

Assume that we are given a set of (possibly hierarchi-cal) directory names. Since in practice directories exist only in an abstract sense, we approximate them using a

training set T D. Each sample in the training set T

is explicitly labelled as belonging to one or more direc-tories. Our task, then, is to create, for each directory

, a category C from the information provided by T.

Since categories are generated by a computer, we shall introduce a vector representation of categories in the fol-lowing section.

4.3 Category Representation

Hierarchical categories can be represented by a tree

structure. We call each node in such a tree a category

node. A category node is labelled by a category name,

which is the path starting from the root of the tree to that tree node. We use (C) to denote the category node

of the category C. A category C0is said to be a

subcate-goryof C if (C) is anancestorof the (C0). It is called

a proper subcategory if (C) is the parent of (C0). A

category without subcategories is aleaf category. A

cat-egory which is neither root nor leaf is an intermediate

category.

A category C is represented by a prototype vector

c

2 W and a radius(C), which is a positive number.

The interpretation of category representation is given as follows. If C is a leaf category, C is dened as the set

of document vectors

d

's such that d(

c

;

d

)(C). If C

is an intermediate category, C is dened as the union of

12A case in point is

Yahoo!, which has one of the best directory

structure on the Web today.

13As indicated before, a category may also contain Web

hyperlinks.

its subcategories and the document vectors

d

's such that

d(

c

;

d

)(C). If C is the root category, its radius is set

to 1 so that it consists of all document vectors in the database. Intuitively, a category should have a radius bigger than those of its subcategories.

Notice that a parent category is dened from its sub-categories. This is dierent from conventional hierar-chical categorization techniques where a subcategory is dened only when its parent category is dened. Our \bottom-up" denition is inspired from the observation that the vector representation of a subcategory is nor-mally more precise than that of a parent category.

The generation of the categories (which approximate

the intended directories) is done usingsupervised

learn-ingtechniques from the labelled training set T. Several

methods are known, such as the least square functional approximations[Sch96] or the training algorithms for lin-ear text classiers [Lew96]. Some lin-earlier experiments show that a prototype vector learned from specic user interests achieves encouraging results in selecting inter-esting Web pages [Sho95]. In practice documents satisfy-ing a user interest may be dened as a personal directory. We remark that the allowance of one document vector to be classied into several directories may complicate the learning process.

Once appropriate category representations are com-puted from the training set, they can be used to perform document classication. That is, we may classify a

doc-ument vector

d

(which may not belong to the training

set) into an existing category C, if one of the following holds:

1. C is a leaf category and d(

d

;

c

)(C).

2. C is an intermediate category, and either

d

is

clas-sied to some subcategory of C, or d(

d

;

c

)(C).

3. C is the root category.

4.4 A Special Category:

User Category

Interesting pages browsed by the user usually reveals information about the user preferences. Such infor-mation can be used to form a special category called

User Category14(which is represented by a vector

u

and

radius (

u

)). We assume that the radius (or threshold)

(

u

) is set by the user. The agent adopts the following

rules to modify the prototype vector

u

automatically:

1. Let

d

be the vector representing the browsing page.

We adjust

u

by

u

+

d

j

u

+

d

j

;

14It is also possible to form several user categories for the same

user. For simplicity we consider here only the case where one such category is formed.

(9)

where < 1 is a positive number called thelearning rate.

2. After some period of time, the agent may decide to \forget" information collected from pages browsed a long time ago. Let J be the set of pages recently

browsed by the user. For any document vectors

p

and

q

, we use

p

q

to denote a vector whose ith

element is pi qiif pi qi> 0, and is 0 otherwise. We

construct a vector

w

= (w1;:::;wv)

2W by setting

wi = 1 if the ith word appears in J, and wi = 0

otherwise. The vector

u

is adjusted by

u

w

j

u

w

j

;

where 1 is a positive number called the

discard-ing rate.

Knowledge ofUser Categorycan be used in the

auto-notication process to lter out uninteresting pages. On

the other hand, thegroup recent preferencemay be built

from each user's User Category knowledge so that the

Web spider can tune its search direction to nd more interesting pages.

4.5 Utilizing Category Knowledge in an

IIR Agent

Since categories are obtained via supervised learning with a training set, it contains more semantic informa-tion and closer resembles the intended directories than clusters. The knowledge contained in the categories is benecial to intelligent information retrieval. It enables us to do document classication, which implies that we may use categories as lters to exclude documents that do not belong to categories that interest a specic user. In the following, we describe how category knowledge can be used in dierent aspects of an IIR agent.

Intelligent search: Category knowledge can be used

to lter out uninteresting documents. In addition to the normal query, the agent may ask the user to provide category information (e.g., specify inter-esting categories) so that documents not belonging to the specied categories can be ltered out. To be more specic, let P be the set of documents

re-turned by search engines. If C1, ..., Cmare

interest-ing categories, a document d2P will be presented

to the user if d is classied as belonging to Ci, where

1im.

Navigation guide: Let us assume that each category

contains sample documents (stored in the group agent) which come from either training samples or

pages identied by users to belong in the category. When a user needs navigation guide, the agent rst analyzes recently browsed pages to determine which

categories (say C) are related to the user's recent

interest. A category tree, with categories inC

high-lighted or ranked high, is prompted to the user. After the user clicks an interesting category, pages stored in the selected category can be prompted to the user as suggested pages.

Auto-notication: The use of category knowledge

in auto-notication is similar to that in intelligent search, namely to use categories to lter out unin-teresting documents. To be more specic, the user sets document criteria so that documents (found by the Web spider) matching the constraints can be suggested to the user. The constraints specied typ-ically consist of (natural language) queries, shallow multimedia information (i.e., number of pictures, images, or voice les in a page), plain document information (such as document location, document size, the number of hyperlinks, or possibly document author), and interesting categories.

Personal information management: A user may

pre-serve interesting pages as bookmarks. Storing book-marks hierarchically is valuable to manage pages that are identied as interesting to the user. Peo-ple may simply store bookmarks under hierarchi-cal categories oered by the personal agent. The user can rename, add, delete, or modify the cat-egory names stored in the personal agent. If the personal agent nd that there are too many doc-uments or bookmarks stored, it may group them into several clusters and then ask the user to give a name to each cluster. In this way categories may grow semi-automatically (since the user has to give names to them), and cluster analysis will be helpful to construct personal categories.

A prole manager is responsible for explaining the meaning of category knowledge to the users. Explana-tion of category knowledge is helpful to a naive user to understand what has been stored in the personal agent.

Recall that a category C is represented by a vector

c

and

a radius (C). Adjusting the ith _{component of}

_c

corre-sponds to tuning the \importance" of the ith_{word (in}V)

to C, while modifying (C) amounts to changing the size or range of C. An experienced user can probe whether the adjustment of weights satises his demand, by clas-sifying sample documents into adjusted categories.

(10)

5 Concluding Remarks

The growing popularity of the World Wide Web worsens the problem of information overload. IIR agents are pro-posed as one possible solution to assist people manage Web information. We point out the desirable features of an IIR agent, and propose an agent architecture which supports the implementation of these features. Sub-agents of an IIR agent are also identied so that they can be designed and implemented separately. Collabo-ration among subagents are illustrated and discussed.

We then turn to the question of classication of Web documents. We then introduce a notion of categories which capture better the informal but \conceptually ideal" notion of directories. We describe how the cat-egories can be represented by vector models and can be obtained through supervised learning with a train-ing set of documents. Our notion of categories can be automated and seems better than the more common approach of clusters, which are built via unsupervised learning harder to understand by human. How cate-gories can be utilized in an IIR agent is also described.

Clustering analysis has attracted much attention in traditional IR research. There is, however, little study about hierarchical categories and their representations. It is imperative to encourage more theoretical, as well as experimental, studies about categories.

References

[And73] M. R. Anderberg,Cluster Analysis for

Applica-tions, New York: Academic, 1973.

[Arm95] R. Armstrong, D. Freitag, T. Joachims, and

T. Mitchell, WebWatcher: A Learning Apprentice

for the World Wide Web, AAAI Spring Symposium on Information Gathering from Heterogeneous, Dis-tributed Environments, 1995.

[Sho95] M. Balabanovicand Y. Shoham,Learning

Infor-mation Retrieval Agents: Experiments with Auto-mated Web Browsing, AAAI-95 Spring Symposium on Information Gathering from Heterogenous, Dis-tributed Environments, 1995.

[Che97] H. Chen, Y. M. Chung, M. Ramsey, C. C. Yang,

P. C. Ma, and J. Yen,Intelligent Spider for Internet

Searching, Proceedings of the 31st _Hawaii

Interna-tional Conference on System Sciences, HICSS-30, Vol. 4, pp. 242-252, 1997.

[Cut92] D. R. Cutting, D. R. Karger, J. O. Pedersen,

and J. W. Tukey,Scatter/Gather: A Cluster-based

Approach to Browsing Large Document Collections, SIGIR'92, 1992.

[Etz94] Oren Etzioni and Daniel Weld,A Softbot-Based

Interface to the Internet, Communications of the ACM, Vol. 37, No. 7, pp. 72-76, July 1994.

[Gug96] Eugene J. Guglielmo and Neil C. Rowe,

Natural-Language Retrieval of Images Based on De-scriptive Captions, ACM Transactions on Informa-tion Systems, Vol. 14, No. 3, July 1996, pp. 237-267.

[Jai88] Anil K. Jain and Richard C. Dubes,Algorithms

for Clustering Data, Prentice Hall, 1988.

[Lee97] Joseph K. W. Lee, et al., Intelligent Agents for

Matching Information Providers and Consumers on the World-Wide-Web, Proceedings of the Thirtieth Annual Hawaii International Conference on System Sciences, IEEE, 1997.

[Lew96] D. D. Lewis, R. E. Schapire, J. P. Callan, R.

Papka,Training Algorithms for Linear Text

Classi-ers, ACM SIGIR'96, 1996.

[Maa96] Yoelle S. Maarek and Israel Z. Ben Shaul,

Auto-matically Organizing Bookmarks per Contents, Fifth International World Wide Web Conference, May 1996.

[Mak85] J. Makhoui, S. Roucos, and H. Gish,

Vec-tor Quantization in Speech Coding, Proceedings of IEEE, Vol. 73, No. 11, 1985, pp. 1551-1588.

[Sal89] Gerard Salton,Automatic Text Processing: The

Transformation, Analysis, and Retrieval of Infor-mation by Computer, Addison-Wesley, 1989.

[Sch92] Robert Schalko,Pattern Recognition:

Statisti-cal, Structural and Neural Approaches, John Wiley & Sons, 1992.

[Sch96] Jurgen Schurmann, Pattern Classication: A

Unied View of Statistical and Neural Approaches, John Wiley & Sons, 1996.

[Zim91] H. -J. Zimmermann, Fuzzy Set Theory { and

Its Applications, 2nd_{, Revised Edition, Kluwer}

An architecture and category knowledge for intelligent information retrieval agents