Intelligent query agent for structural document databases
q
M.F. Jiang, S.S. Tseng*, C.J. Tsai
Department of Computer and Information Science, National Chiao Tung University, Hsinchu 300, Taiwan, ROC
Abstract
Querying a database for document retrieval is often a process close to querying an answering expert system. In this work, we apply the expert system techniques to the intelligent query agent establishment and regard the structural document database as the expertise which can be the objective of the knowledge acquisition. A new knowledge representation, named Structural Documents (SDs), is proposed to be the base of our model, and a transformation process from the raw data to the format of a database is applied. Based on the SDs, more suitable results could be inferenced rapidly by inference engine, and the flow of inference is also described. For implementation, an intelligent Chinese information retrieval system for personnel regulations by integrating knowledge-based and full-text searching techniques is proposed. In our experiments, the structural information of the documents can be acquired from the database using the knowledge extraction module. By observing the operating process of users, we found the query process of users are simplified.q 1999 Published by Elsevier Science Ltd. All rights reserved.
Keywords: Information retrieval; Document database; Intelligent agent; Expert system; Hierarchy clustering
1. Introduction
Nowadays, database systems are useful in business and industrial environments due to their multiple applications. A lot of database systems are built for storing documents, say document databases, and deserves more attention. Recently, due to the rapid growth of the Internet and hardware perfor-mance, research relating to digital library has become an important issue. A major portion in the digital library is the electronic books which have the advantage of saving a lot of retrieval time. There has been a recent trend to publish electronic books excepting hard copy. Especially in the professional field, the reference manuals are usually preserved as the document databases in order to increase the convenience to query.
If there is a database system, which can be modeled as shown in Fig. 1, users could formulate the query to retrieve documents, and reformulate the query when they see the results, and so on, until satisfied with the answer (Navarro & Baeza-Yates, 1997). The query effectiveness depends upon user’s knowledge about the query language.
In order to improve the convenience of query for the traditional database system, we wanted to design a query
agent which can be used to transform user’s demand into a query format, coordinate with a user’s request and revise the query interactively. For example, querying the personnel regulations for civil servant in Taiwan seems to be very sophisticated, and a lot of access time is required. Building a query agent is one of the solutions to such problems. Moreover, we hope that the agent provided an adaptability for different document databases. Therefore, we built an intelligent query agent (IQA), which is illustrated in Fig. 2, to assist users generating suitable queries and adjust the queries according to user’s demand (Riecken, 1994).
Since querying a database for document retrieval is often a process close to querying an answering expert system (ES) (Celentano, Fugini, & Pozzi, 1995), in this work, we apply the ES techniques to the IQA establishment. In building IQA by the ES approach, we are concerned about the construction of the knowledge base, including the knowl-edge representation and the method of knowlknowl-edge acquisi-tion.
A document database consists of a large number of elec-tronic books. In addition to the elecelec-tronic form of content stored in the database, some structural information, i.e. the chapter/section/paragraph hierarchy, may also be embedded in the database. Classical information retrieval usually allows little structuring (Navarro & Baeza-Yates, 1997), since it retrieves information only on data. However, the structural information is useful in querying the document database, for instance, most people always read books with a chapter-oriented concept. In accordance with the Expert Systems with Applications 17 (1999) 105–113
PERGAMON
Expert Systems with Applications
0957-4174/99/$ - see front matterq 1999 Published by Elsevier Science Ltd. All rights reserved. PII: S 0 9 5 7 - 4 1 7 4 ( 9 9 ) 0 0 0 2 8 - 7
www.elsevier.com/locate/eswa
qThis work was supported by the National Science Council of the
Republic of China under Grant No. NSC88-2213-E-009-078.
* Corresponding author. Tel: 1 886-3-5712121, ext. 56658; fax: 1 886-3-5721490.
document structure, including the index and the table of content, it can be regarded as the expertise and also the objective of the knowledge acquisition.
In order to elicit the knowledge embedded in the docu-ment structure, we first proposed a new knowledge repre-sentation, named Structural Documents (SD) (Jiang, Tseng, & Tsai, 1999), to be the basis of IQA. Second, our idea is to design a process of transforming the documents into a set of structural documents, which merge two documents with similarity greater than the given threshold into one struc-tural document. Based on this idea, we developed a cluster-ing-based approach to construct the SD in this work. Moreover, baed on the SD, more suitable results could be inferenced rapidly by the inference engine, and the flow of inference is also described. Besides, the architecture of the IQA and whole structural document database system is also proposed.
As we know, a sound legal system and complete regula-tions are usually of great importance for a government by law. Right now, the personnel regulations for civil servant in Taiwan seem to be very sophisticated. Although several kinds of reference books about personnel regulations have been provided for the general public to inquiry, they are not easy to use and a lot of access time is required.
Therefore, we design an intelligent Chinese information retrieval system for personnel regulations by integrating ES-based and full-text searching techniques. Our system may help users to retrieve the required personnel regulations. As mentioned above, a transformation process from the raw data to the format of a database is first applied. The embedded knowledge of the resulting data are then elicited by applying clustering techniques. By this way, the semantic indexes of the raw data can be established, and suitable results may be obtained. In our experiments, the structural information of the documents can be acquired from the database using the knowledge extraction module. By obser-ving the operating process of users, we found that the query processes of users are simplified.
The organization of the rest of paper is described as follows. Reviews of related work are first described in Section 2. The knowledge extractor process we proposed is presented in Section 3. Section 4 presents the inference process of the IQA. Section 5 demonstrates the
implemen-tations of this approach. Finally, the conclusion and future work are discussed in Section 6.
2. Related works
An ES is a knowledge-base system to solve problem at the level of a human expert (Giarratano & Riley, 1993), which generally consists of three modules: user interface, knowledge base and inference engine. User interface helps users to easily communicate with the ES, knowledge base contains the knowledge which forms the basis of inference and inference engine generates some conclusions according to the knowledge bases.
Generally, the process of building the knowledge base, which is called knowledge acquisition, is interviewing experts by knowledge engineers. However, it may cost a lot of time, as the domain expert may not have any sense of the computer techniques or the knowledge engineer is not familiar with the domain knowledge (Hwang & Tseng, 1990). In order to achieve the goal of decreasing the inter-vention of experts and the goal of smoothing the knowledge acquisition, we established an adaptive process to transfer domain knowledge to the format of a knowledge base.
The approach undertaken in IQA is based on the assump-tion that the structure of reference books is a hierarchy structure. Let us take the book’s structure, shown in Fig. 3, as an example.
In Fig. 3, there are nine chapters, where Chapter 1 consists of three sections and some section may be composed of paragraphs. As most of the readers usually refer the books following the tree-like hierarchy, consider-ing the book’s structure seems to influence in query constructing. Intuitively, the documents in the same chapter seem to have very likely higher similarity rather than the documents in other chapters. Similarly, for the documents in the same chapter, the documents in the same section have very likely higher similarity rather than the documents in other sections. So, the information of the structural hierar-chy may be useful in retrieving document databases.
Besides, the semantic meaning of the documents in the Fig. 1. The model of the database system.
Fig. 2. An intelligent query agent for the document database system.
book is expressed in the contents including words, phrases, etc. Therefore, we propose a new knowledge representation mixing contents and the structure of document database in building the IQA.
3. System structure
In this section, we will first introduce our system model and then the major modules of the system will be described in detail.
3.1. System overview
Fig. 4 shows the overview of the system, which consists of three components: IQA; query transformer (QT) module and knowledge extractor (KE) module; and a traditional database system. IQA and QT are used to transform the user’s demand into an accessible query for the database system, and the KE module is used to extract the knowledge from the database and then store the extracted knowledge into the knowledge base of IQA.
Based on the knowledge stored in the format of SD, which is the new knowledge representation proposed in this work, the IQA module infers and revises the suitable query for user’s demand, and the QT module transforms the result of IQA into an accessible query following the query syntax supported by the database system.
3.2. IQA and KE modules
The IQA module consists of user interface, inference engine and knowledge base. The user interface helps users communicate easily with the IQA using a web-based approach. The inference engine implies the results based on knowledge and the process is addressed in Section 4. Knowledge base stores the knowledge for inference, uses a new knowledge representation called SD, which is formally defined in Definition 3.1. As mentioned above, the knowledge is extracted from the database by the KE module. The flow of the knowledge extracting is described as follows as shown in Fig. 5. The content of the document database is first transformed into sets of words by partition and transformation. Besides, the index of the book are further considered as the knowledge source to identify the set of keywords for each document in the partition and transformation step. By computing each pair of documents,
the similarity matrix for documents are obtained, and used as the basis for the subsequent clustering to find a similar pair of documents. After clustering, the hierarchy of struc-tural documents can be obtained.
In this figure, the two kinds of existing resources, includ-ing the index and the table of contents for the books are used as the knowledge source in the KE module. All the three procedures in the KE module described in detail are as follows.
3.2.1. Partition and transformation
The KE module is capable of processing English or Chinese document database. As there is no obvious word boundary in the Chinese text, an identification process for identifying each possible disyllabic word from target data-base is needed. After the disyllabic words are identified from the sentence by applying the association measure, each document can be transferred to a set of keywords.
That is, for two different Chinese characters, ciand cj, the
association (Liang, 1995; Sproat and Shih, 1990) of the two Fig. 4. Overview of the system.
characters can be computed by: log2 freq CiCj n freq Ci n freq Cj n ;
where the value of the function freq( ) is the occurrence frequency of any character or two successive characters and n is the number of all the characters in the target data-base. Since the association formula is originally designed for a large corpus, the obtained associations may not good enough to identify words. Therefore, further manual turning may be required.
After the disyllabic words are identified from the sentence by applying the association measure, each document can be transferred to a set of words. For example, the sentence
may be segmented into
and . It should be
noted here that there is no need to identify words in English text, and so the above segmented method should be skipped for English text.
In our approach, the indexes of the book are also used to identify the set of keywords for each document. After the execution of the above two steps, each document is trans-formed into a set of keywords and words. Therefore, the size of both sets varies depending on the document itself, because the words are identified by the association measure and the keywords are identified by comparing with the index of the reference books.
3.2.2. Similarity measure
To measure the similarity between two documents, we use the following heuristics. The similarity between two documents in the same chapter is higher than that in two different chapters, and the similarity between two docu-ments in the same section is higher than that in two different sections. Without losing the generality, we assumed the whole reference book to be divided into a three-tier hierar-chy, including chapter, section and paragraph. Based upon these heuristics, the Hierarchy Dependence (HD) between two documents can be easily computed by the following procedure:
Step 1: If two documents are not in the same chapter, HD← 0 and stop.
Step 2: If two documents are not in the same section HD← 1=s; where s is the total number of sections in this chapter, and stop.
Step 3: If two documents are not in the same paragraph, HD← 0:5 1 1=p , where p is the total number of para-graphs in this section, and stop.
Step 4. HD← 1
.
Let the two documents be denoted as Diand Dj. In the KE
module, the similarity of Di and Dj, denoted by S i; j, is
computed by the following formula: S i; j 1 2d p same i; j 1dp HD i; j;
where same i; j means the number of words and keywords
which appear both in documents Di and Dj. The value is
normalized by dividing the total number of keywords in
documents Diand Dj, HD i; j means the hierarchy
depen-dence of documents Diand Djandd is an adaptive weight
value with 0#d# 1.
The default value ofdis 0.5, which can be adjusted by the
number of chapters for a given book. For example, when the number of chapters is near to 1 or n for a book divided into n
documents, we setd as the value closed to 0, as there is not
much meaning in the structure.
Example 3.1. Let Di { },j
Dij 4, and Dj { },j Djj 3.
Therefore, the number of the words which appear both in
documents Diand Djis 2, and we have same i; j j Di>
Djj = j Dij 1 j Djj 2= 4 1 3:
Example 3.2. Let Di {‘intelligent’, ‘query’, ‘agent’},
j Di j 3, and Dj {‘database’, ‘query’,}, j Dj j
2. Therefore, the number of the words which appear
both in documents Di and Dj is 1, and we have
same i; j 0:2.
By computing the similarity measure of all two different
documents Diand Dj, a similarity matrix [Sij] can be formed
by letting Sij S i; j. To simplify our further discussion, we
assign Sij 0 for i $ j. The matrix becomes an upper
trian-gular matrix and the diagonal elements are 0’s.
3.2.3. Clustering and structuring documents
Clustering is an important step in the KE module for the purpose of structuring documents. To transfer the docu-ments into a set of structural docudocu-ments, the algorithm for similarity matrix updating in hierarchy clustering (Jain & Dubes, 1988) is used. In hierarchy clustering, a lot of approaches are considered in measuring similarity of clus-ters, including the single-link and the complete-link meth-ods. It seems that the clustered hierarchy generated by the complete-link method is more balanced than the one gener-ated by the single-link method. In this work, the complete-link method is implemented. To represent the knowledge in clustering process, the definition and notation of SD are formally defined.
Definition 3.1. The SD with level l is defined recursively as follows:
• A document Diis a SD with level 0, denoted as SDiwhich
also can be denoted as 0Di0.
similarity measure, which is greater than a thresholdu, among all the different pairs is also a SD with level l,
where l Maximum m; n 1 1, m is the level number of
SDiand n is the level number of SDj.
Definition 3.2. The similarity S0 between the structural
documents SDiand SDjis defined as follows:
1. If SDi Di and SDj Dj, the similarity
S0 SDi; SDj S i; j.
2. The similarity between SDi; SDj and SDkis defined as
S0 SDk; SDi; SDj Q{S0 SDk; SDi; S0 SDk; SDj},
where the operator Q means ‘minimum’ for the
complete-link method, or ‘maximum’ for the single-link method.
Assume we have n documents, and then an n× n
similar-ity matrix can be generated by our method. First, each
docu-ment is assigned to be a SD, i.e. SDi Di for document
Di. Moreover, the similarity matrix for documents is
trans-ferred to the initial similarity matrix for SDs. The elements
of the similarity matrix [S0ij] is the similarity measure of the
structural documents SDi and SDj, i.e. S0ij Sij. Let C
{SDi} be the set of all SDi. Based upon the Johnson’s
algo-rithms (1967) for hierarchy clustering (Jain & Dubes, 1988), we propose the following procedure for updating similarity matrix.
Algorithm 3.1.
Step 1: Find the most similar pair of structural documents
in the current similarity matrix, say pair {p; q}, where,
S0p;q Maximum{S0i;j, for any i; j}.
Step 2: Merge structural documents SDpand SDqinto a
new single structural document (SDp, SDq).
Step 3: Delete the structural documents SDpand SDqfrom
the set C, and insert the new structural document
SDp; SDq into the set, i.e.
C ← C 2 {SDp}2 {SDq}1 { lSDp; SDql}, where
l Maximum m; n 1 1, m is the level number of SDp
and n is the level number of SDq.
Step 4: Update the similarity matrix by deleting the rows
and columns related to structural documents SDp and
SDq, and adding a row and a column corresponding to
the new structural document SDp; SDq.
Step 5: If there are no two structural documents with
similarity greater thanu, stop. Otherwise, go to Step 1.
After clustering, the hierarchy of structural documents
can be obtained according to the setC.
3.3. An example
Let u 0:1, assume we have six documents,
{D1; D2; D3; D4; D5; D6}, and the similarity matrix is:
D1 D2 D3 D4 D5 D6 D1 D2 D3 D4 D5 D6 0 0:8 0:4 0:5 0:4 0:1 0 0:5 0:6 0:3 0:2 0 0:7 0:8 0:1 0 0:9 0:3 0 0:2 0 2 6 6 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 7 7 5
Before clustering, each document is assigned to be a
structural document, i.e. SDi 0Di0 for document Di.
The set of all SDi is written as
C {SD1; SD2; SD3; SD4; SD5; SD6}, and the initial
simi-larity matrix for SDs is generated as:
SD1 SD2 SD3 SD4 SD5 SD6 SD1SD2 SD3 SD4 SD5 SD6 0 0:8 0:4 0:5 0:4 0:1 0 0:5 0:6 0:3 0:2 0 0:7 0:8 0:1 0 0:9 0:3 0 0:2 0 2 6 6 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 7 7 5
In the first iteration, the structural documents, SD4and
SD5, are merged into (SD4, SD5), as the elementary value of
the 4th row and the 5th column is the maximum. The set of
all SDi is written as C {SD1; SD2; SD3;
1SD4; SD51; SD6}. After the similarity matrix is updated,
we get a new similarity matrix after Step 4:
1 2 3 4; 5 6 1 2 3 4; 5 6 0 0:8 0:4 0:4 0:1 0 0:5 0:3 0:2 0 0:7 0:1 0 0:2 0 2 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 5
In the second iteration, we merge the structural
docu-ments SD1 and SD2 into (1SD1, SD2)1, as the elementary
value of the 1st row and 2nd column is the maximum. We have the newC { 1SD1; SD21; SD3; 1SD4; SD51; SD6}.
matrix is transformed to: 1; 2 3 4; 5 6 1; 2 3 4; 5 6 0 0:4 0:3 0:1 0 0:7 0:1 0 0:2 0 2 6 6 6 6 6 6 4 3 7 7 7 7 7 7 5
Similarly, we merge the structural documents SD3 and
(1SD4, SD5)1into (2SD3, (1SD4, SD5)1)2in the third iteration,
since the elementary value of the 2nd row and 3rd column is
the maximum. We have the new C
{ 1SD1; SD21; 2SD3; 1SD4; SD512; SD6} and the
new similarity matrix is transformed to:
1; 2 3; 4; 5 6 1; 2 3; 4; 5 6 0 0:3 0:1 0 0:1 0 2 6 6 4 3 7 7 5
In the last two iterations, we have the new
C { 3 1SD1; SD21; 2SD3; 1SD4; SD5123; SD6} and
{ 4 3 1SD1; SD21; 2SD3; 1SD4; SD5123; SD64}.
According to the setC, the hierarchy of SDs can be obtained
as Fig. 6.
4. Inference process based on the structural documents After the knowledge extracting process, in the previous
section, the SDC have been built for the retrieving
docu-ments. Based on the SD, more suitable results could be rapidly inferenced by an inference engine. The flow of the inference process is described in Fig. 7.
The query result R is generated by some searching engine. However, in most cases, the result may be not satisfied very well for the user’s demand, likely more or less. Two algorithms are designed to solve the problem. In this section, the detail of inference process will be illustrated. First, we state the notations and definitions for the following algorithms.
To easily represent the set of documents, a data structure bit_map is defined as:
bit_map b1b2…bn;
where bk 1, if document Dkbelongs to some set of
docu-ments, and bk 0, otherwise.
Based on the data structure bit_map, for a given query result, a set of documents generated by some searching engine can be written as:
R b1b2…bn;
such that bk 1, if document Dkbelongs to the query result,
and bk 0, otherwise.
In the format of SDs, a pair of structural documents
SDi; SDj means the similarity between SDi and SDj is
greater than the other SDs. For any structural document
SDj, the most similar structural document SDk can be
found when SDi, a pair of structural documents
SDj; SDk, has been found. To easily describe the above
idea, the definitions of immediate successor and successor are illustrated as following.
Definition 4.1. A structural document SDiis said to be an
immediate successor of structural documents SDj, if SDi
lSDj; SDkl or SDi lSDk; SDjl for some structural
document SDk, and level l.
Definition 4.2. A structural document SDiis said to be a
successor of a structural document SDj, if there exists an
immediate successor SDk of SDj, such that SDi is also a
successor of SDkor SDiis an immediate successor of SDj.
The key process, that is to find the immediate successor,
is to directly scan the structural documentC, and described
in the following algorithm. Fig. 6. The hierarchy of structural documents according to the clustering
result.
Algorithm 4.1 ((Find_immediate_successor)).
Input:C, SDi
Output: The immediate successor of SDi
Step 1: Find SDiin the setC.
Step 2: Check the next element of SDi with two cases
possibly existing.
Case 1: The next element of SDiis “,”,
Find the structural document SDj, for some j. 0,
in the next element of “SDi,” and return (k SDi,
SDj)k, for k MAX(i, j) 1 1.
Case 2: The next element of SDiis “)l” for some l,
Find “(l” in the preceding element of SDiand return
lSD; SDil, for some structural document SDj.
In the Step 2 of Algorithm 4.1, there are two cases
possi-bly existing. In Case 1, output SDi; SDj for some SDj[
C and in Case 2, output SDj; SDi for some SDj[ C. By
Definitions 3.1 and 4.1, it can be easily seen that output is
the immediate successor of SDi.
Now, a function bit_set is defined to convert the structural
document SDiinto a string of bits, such that the kth bit is 1, if
the document Dkbelongs to SDi, and is 0, otherwise.
Example 4.1. Assume the amount of all documents is 5, bit_set D1; D2; D3 “11100”.
The ith cover for all n documents is defined, following the definition of bit_map:
Ci b1b2…bn
Algorithm 4.2.
Input: Any structural document SDi, number k
Output: Ck; Ck11; …; Ck1m, where the number m is the
minimum integer such that R# Ck1m
Procedure Cover k; SDi
SDj← Find_immediate_successor (C, SDi)
Ck← bit_set SDj)
If NOT_EQUAL(R, AND (R, Ck)) Then Call
Cover k 1 1; SDj
End_Procedure
Example 4.2. Assume we have six documents, and the
hierarchy of SDs are shown in Fig. 6 and
C { SD1; SD2; SD3; SD4; SD5; SD6}. Now, for a
user’s query, database return the result, {D2; D3; D4} and
R 011100. When Cover 2; D3 is called, we get C2
001110 and C3 111110.
Algorithm 4.3. /p The users want to get a set of
docu-ments which can be deleted from the query result. p /
Input: The query result R,C
Output: The set of documents which can be deleted from the query result.
Procedure Less Select a Di according to R SDj← Di C1← {Di} Call Cover 2; SDj Case COUNT_ONE(AND R; Ck21 -COUNT_ONE(AND(R; SUB Ck; Ck21) , 0: RETURN AND(R; Ck21)
. 0: RETURN AND(R; SUB Ck; Ck21)
0: RETURN AND(R; Ck21) or
AND(R; SUB Ck; Ck21) determined by user
End_Case End_Procedure Fig. 8. The architecture of CPRCS.
Algorithm 4.4. / p The users want to get a set of
documents, which could offer more information. p /
Input: The query result R,C
Output: A set of documents, which could offer more information. Procedure More Select a Diaccording to R SDj← Di C1← {Di} Call Cover 2; SDj While COUNT_ONE(AND R; Ck ± 0 do Case COUNT_ONE(AND(R; Ck21 )-COUNT_ONE(AND(R; SUB Ck; Ck21)
, 0: R ← AND(R, SUB(Ck; Ck21) and
Call More
. 0: R ← SUB(R, AND(R; SUB Ck; Ck21),
k← k 2 1
0: RETURN AND(R; Ck21) or
AND(R; SUB Ck; Ck21) determined by
user and stop. End_Case
RETURN Ck
End_While End_Procedure
Example 4.3. Assume we have six documents, and the hierarchy of structural documents are shown in Fig. 6 and
C { SD1; SD2; SD3; SD4; SD5; SD6}. Now, for a
user’s query, database return the result, {D3; D4; D6}. For
the result, we have the corresponding structural documents
{SD3; SD4; SD6} generated by the IQA. When the user feels
insufficiency about the result, the IQA may generate the set
{SD3; SD4; SD5 ; SD6} and asks database to return
docu-ment {D5}. When the user feels insufficiency again, the IQA
may generate the set { SD1 ; SD2 ; SD3; SD4; SD5; SD6}
and the documents {D1; D2} are returned.
On the other hand, if the user feels the amount of result
{D3; D4; D6} is too many, IQA generate the set {SD3; SD4}
and deletes the document {D6} from the query result. If the
user feels the amount of result is too many again, the IQA
generates the set {SD4}, and delete the document {D3} from
the query result.
Example 4.4. For another user’s query, database return
the result, {D3; D4; D5}. The user feels the amount of result
is too many, IQA generates the set {SD4; SD5} and deletes
the document {D3} from the query result.
Example 4.5. For another user’s query, database return
the result, {D2; D3; D4}. The user feel the amount of result
is too many, IQA generates the set {SD3; SD4} and deletes
the document {D2} from the query result.
5. Implementation
As we know, a sound legal system and complete regula-tions are usually of great importance for a government by law. Right now, the PR for civil servant in Taiwan seem to be very sophisticated. Although several kinds of reference books about personnel regulations have been provided for the general public to inquiry, they are not easy to use and a lot of access time is required. Therefore, improvement of the methods of inquiry and annotation of the personnel regula-tions has become an important issue. Besides, by comparing the experimental results between the traditional database model and our new model, we may verify the performance of the model based on SDs for the PR document databases. We implemented a database prototype, named CPRCS (Chinese Personnel Regulations Consultation System), for PR documents and used the IQA to assist the querying process. The CPRCS architecture is followed by the system structure of Fig. 4 and shown in Fig. 8.
In CPRCS, users operate the system through the web pages interface, which can be easily promoted to be Fig. 10. The amount of structural documents of similarity. For instance, the amount of structural documents is 224, when the similarity between any two documents in the same SD is greater than 0.7.
extended. Two different modes, with or without IQA, are provided for users as they please. By observing the operating process of users, we found the query process of users are simplified. Fig. 9 is the main window of the system.
Furthermore, the relation between the chapter hierarchy and the amount of the SDs is explained in the following experiment. There are 365 documents in the target database for the experiment. All the documents are divided into 11 chapters and the total amount of sections is 171. After the processing of the KE module, the amount of SDs of different similarity is shown in Fig. 10.
The figure shows the merging situation in the clustering process. The results observed from the figure are discussed in the following way. When the similarity value is between 0 and 0.2, the amount of structural documents is about 11, which is the amount of chapters. Similarly, when the simi-larity value is between 0.4 and 0.7, the amount of structural documents is about 171, which is the total amount of sections. The situation says that the hierarchy of structural documents is similar to the chapter/section hierarchy for the book. The structural information of the documents can be acquired from the database using the KE module.
6. Conclusion
Classical information retrieval usually allows little struc-turing. However, the structural information is useful in querying the document database, for instance, most people always read books with chapter-oriented concept. In accor-dance with the document structure, including the index and the table of content, it can be regarded as the expertise and also can be the objective of the knowledge acquisition. Since querying a database for document retrieval is often a process close to querying an answering expert system, in this work, we apply the ES techniques to the IQA establish-ment. In building IQA by the ES approach, we are concerned about the construction of the knowledge base, including the knowledge representation and the method of knowledge acquisition. Therefore, a new knowledge repre-sentation, named SDs, is defined to construct the acquisition model for IQA, and proposed the KE model to transform the
data of database into the knowledge storing in IQA. For comparing the convenience of IQA, an intelligent retrieval system, CPRCS, is implemented. By observing the operat-ing process of users, we found the that query processes of users are simplified. Besides, the experimental result has shown the structural information of the documents can be acquired from the database using the KE module.
Future research will focus on several areas. First, better similarity measurements are necessary for increasing the performance of clustering. The analysis for the influence of different clustering methods is not covered in this work. The general formula proposed by Jain and Dubes (1988) includes most of the commonly hierarchical clustering method which is the basis for future work. Another signifi-cant focus on this work is to extend the model based on SD, and the goal is to allow any different kinds of documents that can be merged into the same knowledge structure in order to increase the practicability of the system.
References
Celentano, A., Fugini, M. G., & Pozzi, S. (1995). Knowledge-based docu-ment retrieval in office environdocu-ments: the Kabiria system. ACM
Trans-actions on Information System, 13 (3), 237–268.
Giarratano, J., & Riley, G. (1993). Expert systems, 2. PWS Publishing Company.
Hwang, G. J., & Tseng, S. S. (1990). EMCUD: A knowledge acquisition method which captures embedded meanings under uncertainty.
Inter-national Journal of Man Machine Studies, 33, 431–451.
Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data, Engle-wood Cliffs, NJ: Prentice Hall pp. 58–86.
Jiang, M.F., Tseng, S.S., Tsai, C.J. (1999). Discovering structure from document databases, The Third Pacific-Asia Conference on Knowledge
Discovery and Data Mining, PAKDD-99, Beijing, China.
Liang, T. (1995). The Study of Character-based Signature Methods in Chinese Text Retrieval, PhD thesis, National Chiao Tung University, Taiwan.
Navarro, G., & Baeza-Yates, R. (1997). Proximal nodes: a model to query document database by content and structure. ACM Transactions on
Information Systems, 15 (4), 400–435.
Riecken, D. (1994). Intelligent agent. Communications of ACM, 37 (7), 18– 21.
Sproat, R., & Shih, C. (1990). A statistical method for finding word bound-aries in Chinese text. Computer Proceedings of Chinese and Oriental