• 沒有找到結果。

Integrating knowledge flow mining and collaborative filtering to support document recommendation

N/A
N/A
Protected

Academic year: 2021

Share "Integrating knowledge flow mining and collaborative filtering to support document recommendation"

Copied!
15
0
0

加載中.... (立即查看全文)

全文

(1)

Integrating knowledge flow mining and collaborative filtering to support

document recommendation

Chin-Hui Lai, Duen-Ren Liu

*

Institute of Information Management, National Chiao Tung University, 1001 Ta Hseuh Rd., Hsinchu 300, Hsinchu, Taiwan

a r t i c l e

i n f o

Article history: Received 11 June 2008

Received in revised form 29 June 2009 Accepted 29 June 2009

Available online 2 July 2009 Keywords:

Knowledge flow Knowledge flow mining Knowledge sharing Document recommendation Collaborative filtering Sequential rule mining Recommender system

a b s t r a c t

Knowledge is a critical resource that organizations use to gain and maintain competitive advantages. In the constantly changing business environment, organizations must exploit effective and efficient meth-ods of preserving, sharing and reusing knowledge in order to help knowledge workers find task-relevant information. Hence, an important issue is how to discover and model the knowledge flow (KF) of workers from their historical work records. The objectives of a knowledge flow model are to understand knowl-edge workers’ task-needs and the ways they reference documents, and then provide adaptive knowlknowl-edge support. This work proposes hybrid recommendation methods based on the knowledge flow model, which integrates KF mining, sequential rule mining and collaborative filtering techniques to recommend codified knowledge. These KF-based recommendation methods involve two phases: a KF mining phase and a KF-based recommendation phase. The KF mining phase identifies each worker’s knowledge flow by analyzing his/her knowledge referencing behavior (information needs), while the KF-based recom-mendation phase utilizes the proposed hybrid methods to proactively provide relevant codified knowl-edge for the worker. Therefore, the proposed methods use workers’ preferences for codified knowlknowl-edge as well as their knowledge referencing behavior to predict their topics of interest and recommend task-related knowledge. Using data collected from a research institute laboratory, experiments are con-ducted to evaluate the performance of the proposed hybrid methods and compare them with the tradi-tional CF method. The results of experiments demonstrate that utilizing the document preferences and knowledge referencing behavior of workers can effectively improve the quality of recommendations and facilitate efficient knowledge sharing.

Ó 2009 Elsevier Inc. All rights reserved.

1. Introduction

Organizational knowledge can be used to create core competi-tive advantages and achieve commercial success in a constantly changing business environment. Hence, organizations need to adopt appropriate strategies to preserve, share and reuse such a valuable asset, as well as to support knowledge workers effectively (Nonaka and Takeuchi, 1995; Polanyi, 1966). Knowledge and expertise are generally codified in textual documents, e.g., papers, manuals and reports, and preserved in a knowledge database. This codified knowledge is then circulated in an organization to support workers engaged in management and operational activities (Brown and Duguid, 2002). Because most of these activities are knowledge-intensive tasks, the effectiveness of knowledge management depends on providing task-relevant documents to meet the information needs of knowledge workers.

In task-based business environments, knowledge management systems (KMSs) can facilitate the preservation, reuse and sharing

of knowledge. Moreover, workers may need to obtain task-relevant knowledge to complete a knowledge-intensive task by referencing codified knowledge (documents); For example, based on a task’s specifications and the process-context of the task, the KnowMore system (Abecker et al., 2000a,b) provides context-aware knowl-edge retrieval and delivery to support workers’ procedural activi-ties. The task-based K-support system (Liu et al., 2005; Wu et al., 2005) adaptively provides knowledge support to meet a worker’s dynamic information needs by analyzing his/her access behavior or relevance feedback on documents. To help knowledge workers complete multiple tasks, TaskTracer (Dragunov et al., 2005) was developed to monitor workers’ activities and help them rapidly lo-cate and reuse processes employed previously. However, previous research on task-based knowledge support did not analyze and uti-lize the flow of knowledge among various types of codified knowl-edge (documents) to provide effective recommendations about task-relevant documents.

Knowledge flow (KF) research focuses on how KF can transmit, share, and accumulate knowledge when it passes from one team member/process to another. In a workflow situation, work knowl-edge may flow among workers in an organization, while process

0164-1212/$ - see front matter Ó 2009 Elsevier Inc. All rights reserved. doi:10.1016/j.jss.2009.06.044

* Corresponding author.

E-mail address:[email protected](D.-R. Liu).

Contents lists available atScienceDirect

The Journal of Systems and Software

(2)

knowledge may flow among various tasks (Zhuge, 2002, 2006b; Zhuge and Guo, 2007). Thus, KF reflects the level of knowledge cooperation between workers or processes and influences the

effectiveness of teamwork/workflow. Zhuge (2002) proposed a

management mechanism for realizing ordered knowledge sharing, and integrated the knowledge flow with the workflow to assist people working in a complex and knowledge-intensive environ-ment. Also, KF plays an important role in academic research, as researchers often devise novel concepts based on previous research reported in the literature (Zhuge, 2006a). However, to the best of our knowledge, there is no systematic method that can flexibly identify KF in order to understand the information needs of work-ers. Furthermore, conventional KF approaches do not analyze knowledge flow from the perspective of information needs and recommend relevant documents based on the discovered KF.

Knowledge workers normally have various task-needs over time. Moreover, they may need to obtain task-relevant knowledge to complete a task by referencing several types of codified knowl-edge (documents); and the knowlknowl-edge in one document may prompt a worker to reference another related document. Based on a worker’s referencing behavior, KF can be used to describe the evolution of information needs, preferences, and knowledge accumulated for a specific task. From the perspective of informa-tion needs, some knowledge in a KF may have a higher priority for accomplishing a task. For example, before taking a Data Mining course, a student must take courses in Statistics and Database Sys-tems, which represent the fundamental knowledge of Data Mining. Thus, these two courses are significant and have a high priority for the student. Additionally, academic knowledge may flow between different courses and thereby help students accumulate more knowledge. Similarly, the codified knowledge for a task also has different referencing priorities and ordering based on its perceived importance. In other words, important basic knowledge about a task should be referenced first. Therefore, KF can be utilized to pro-vide effective recommendations about task-relevant knowledge to suit workers’ information needs for tasks. This issue has not been addressed by previous research.

In an attempt to resolve the limitations of previous research, we propose KF-based recommendation methods for recommending task-related codified knowledge. To adaptively provide relevant knowledge, collaborative filtering (CF), the most frequently used method, predicts a target worker’s preference(s) based on the opin-ions of similar workers. However, the target worker’s referencing behavior may change over the period of the task’s execution, be-cause his/her information needs may vary. Traditional CF methods only consider workers’ preferences for codified knowledge. They neglect the effect of the time factor, i.e., workers’ referencing behavior for knowledge over time. To fill this research gap, we pro-pose a KF-based sequential rule method (KSR) that recommends codified knowledge by utilizing the KF-based sequential rules. However, the method is based on the target worker’s referencing behavior without considering the opinions of his/her neighbors who may have similar preference for documents. Therefore, to take advantage of the merits of typical CF and KSR methods, we propose hybrid recommendation methods that combine CF and KSR meth-ods to enhance the quality of document recommendation. The

hy-brid methods consider workers’ preferences for codified

knowledge, as well as their knowledge referencing behavior, in or-der to predict topics of interest and recommend task-related knowledge.

The proposed hybrid methods consist of two phases: a KF min-ing phase and a KF-based recommendation phase. To determine a knowledge worker’s referencing behavior, the KF mining phase analyzes his/her historical work records to identify the knowledge flow, i.e., the target worker’s information needs. Then, the KF-based recommendation phase selects and recommends documents based

on the document preferences and KF-based sequential rules de-rived from the target worker’s neighbors. In other words, the pro-posed methods trace a worker’s information needs by analyzing his/her knowledge referencing behavior for a task over time, and also proactively provide relevant codified knowledge for the work-er based on the KFs of the workwork-er’s neighbors.

The remainder of this paper is organized as follows. Section2

provides a brief overview of related works. In Section3, we describe the knowledge flow-based recommendation framework and knowledge flow model. In Sections4 and 5, we discuss the knowl-edge flow mining phase and KF-based recommendation phase respectively. Section6 details our experimental work, including an evaluation and comparison of our proposed methods and a dis-cussion of the experiment results. Then, in Section7, we summarize our conclusions and consider future research directions.

2. Background

In this section, we discuss the background of our research, including knowledge flow, information retrieval and task-based knowledge support, document clustering, dynamic programming

algorithm, rule-based recommendations, and collaborative

filtering.

2.1. Knowledge flow

Knowledge can flow among people and processes to facilitate knowledge sharing and reuse. The concept of knowledge flow has been applied in various domains, e.g., scientific research, commu-nities of practice, teamwork, industry, and organizations ( Anjew-ierden et al., 2005; Kim et al., 2003; Zhuge, 2006a). Scholarly articles represent the major medium for disseminating knowledge among scientists to inspire new ideas (Anjewierden et al., 2005; Zhuge, 2006a). A citation implies that there is knowledge flow be-tween the citing article and the cited article. Such citations form a knowledge flow network that enables knowledge to flow between different scientific projects to promote interdisciplinary research and scientific development.

KM enhances the effectiveness of teamwork by accumulating and sharing knowledge among team members to facilitate peer-to-peer knowledge sharing (Zhuge, 2002). To improve the

efficiency of teamwork, Zhuge (Zhuge, 2006b) proposed a

pattern-based approach that combines codification and personali-zation strategies to design an effective knowledge flow network.

Kim et al. (2003)proposed a knowledge flow model combined with a process-oriented approach to capture, store, and transfer knowl-edge. KF in weblogs (blogs) is a communication pattern where the post of one blogger links to that of another blogger to exchange knowledge (Anjewierden et al., 2005). Similarly, knowledge flow in communities of practice helps members share their knowledge and experience about a specific domain to complete their tasks (Rodriguez et al., 2004).

2.2. Information retrieval and task-based knowledge support Information retrieval (IR) facilitates access to specific items of information (Baeza-Yates and Ribeiro-Neto, 1999; Feldman and Sanger, 2007). The vector space model (Salton and Buckley, 1988) is typically used to represent documents as vectors of index terms, where the weights of the terms are measured by the tf-idf approach. tf denotes the occurrence frequency of a particular term in the document, while idf denotes the inverse document fre-quency of the term. Terms with higher tf-idf weights are used as discriminating terms to filter out common terms. The weight of a term i in a document j, denoted by wi,j, is expressed as follows:

(3)

wi;j¼ tfi;j idfi¼ tfi;j log2 N nþ 1

 

; ð1Þ

where tfi,jis the frequency of term i in document j, idfiis measured by (log2N/n) + 1, N is the total number of documents in the collec-tion, and n is the number of documents in which term i occurs at least once.

Information retrieval techniques coupled with workflow man-agement systems (WfMS) have been used to support proactive delivery of task-specific knowledge based on the context of tasks within a process (Abecker et al., 2000a,b). For example, the Know-More system (Abecker et al., 2000a,b) provides context-aware delivery of task-specific knowledge. The Kabiria system assists knowledge workers with knowledge-based document retrieval by considering the operational context of task-associated procedures (Augusto et al., 1995).

Information filtering with a similarity-based approach is often used to locate knowledge items relevant to the task-at-hand. The discriminating terms of a task are usually extracted from a knowl-edge item/task to form a task profile, which is used to model a worker’s information needs.Holz et al. (2005)proposed a similar-ity-based approach to organize desktop documents and proactively deliver task-specific information.Liu et al. (2005) proposed a K-Support system to provide effective task support for a task-based working environment.

2.3. Document clustering

Document clustering or unsupervised document classification methods are used in many applications. Most methods apply pre-processing steps to the document set and represent each doc-ument as a vector of index terms. To cluster similar docdoc-uments, the similarity between documents is usually measured by the cosine measure (Baeza-Yates and Ribeiro-Neto, 1999; Van RijsBergen, 1979), which computes the cosine of the angle between their cor-responding feature vectors. Two documents are considered similar if the cosine similarity value is high. The cosine similarity of two documents, X and Y, is simcos ðX; YÞ ¼ X

*  Y * k X * kk Y * k, where X * and Y*are the feature vectors of X and Y respectively. Documents within a cluster are very similar, while documents in different clusters are very dissimilar.

Agglomerative hierarchical clustering (Johnson, 1967; Kaufman and Rousseeuw, 1990) is a popular document clustering method. In this work, we use the single-link clustering method (Dubes and Jain, 1988; Jain et al., 1999) to cluster codified knowledge (docu-ments). Initially, each document is regarded as a cluster. Next, the single-link method computes the similarity between two clusters, which is equal to the greatest similarity between any document in one cluster and any document in the other cluster. Then, based on the similarity measurement, the two most similar clusters are merged to form a new cluster. The merging process continues until all documents have been merged into one cluster at the top of a hierarchy, or a pre-specified threshold is satisfied (Jain et al., 1999).

2.3.1. Clustering quality

A good clustering method generates clusters that are cohesive and isolated from other clusters. For this reason, the measurement of clustering quality takes both inter-cluster similarity and intra-cluster similarity into account (Chuang and Chien, 2004). Let C be a set of clusters. The inter-cluster similarity between two clusters Ciand Cj, similarityA(Ci, Cj), is defined as the average of all pairwise similarities between the documents in Ciand Cj; and the intra-clus-ter similarity within a clusintra-clus-ter Ci, similarityA(Ci, Ci), is defined as the average of all pairwise similarities between documents in Ci. On

the basis of the cohesion and isolation of C, the quality measure of C, CQ(C), is defined as: CQ ðCÞ ¼ 1 jCj X Ci2C similarityAðCi;CiÞ similarityAðCi;CiÞ ; where Ci¼ [i–jCj: ð2Þ

Note that the smaller the value of CQ(C), the better the quality of the derived set of clusters, C, will be.

2.4. Dynamic programming algorithm for sequence alignment In this work, each worker’s knowledge flow is represented as a sequence. We use sequence alignment techniques to analyze the similarity of workers’ knowledge flows, which corresponds to a se-quence alignment problem. Such techniques are used to compare or align strings in many application domains, such as biology, speech recognition, and web session clustering. A number of methods can be used for sequence alignment, e.g., the sequence alignment method (SAM) (Hay et al., 2001) and dynamic program-ming. SAM, also called the string edit distance method (Kruskal, 1983), considers the sequential order of elements in a sequence and then measures the similarity/dissimilarity of sequences. The measurements reflect the operations necessary to equalize the se-quences by computing the costs of deleting and inserting unique elements as well as the costs of reordering common elements (Hay et al., 2001; Mannila and Ronkainen, 1997). In addition, Char-ter et al. (2000)proposed a dynamic programming algorithm that solves the sequence alignment problem efficiently.

The algorithm consists of three steps: initialization, FindScore and FindPath (Charter et al., 2000; Oguducu and Ozsu, 2006). The first step creates a dynamic programming matrix with N + 1 col-umns and M + 1 rows, where N and M correspond to the sizes of the sequences to be aligned. One sequence is placed at the top of the matrix and the other is placed on the left-hand side of the ma-trix. There is a gap at the end of each sequence to allow calculation of the alignment score. The FindScore step calculates the two-dimensional alignment score of sequences. If two aligned sequences have an identical matching in the same column, the col-umn is given a positive score s (e.g., +1 or +2); but if the values in a column are mismatches, the score s is zero or negative (e.g., 0, 1 or 2). In addition, if a column contains a gap, it is given a penalty score w (e.g., 0, 1 or 2). Therefore, starting from the bottom right-hand corner, each position in the dynamic programming ma-trix is given the maximal score Mij. For each position in the matrix, Mijis defined as follows:

Mij¼ MaximumfðMi1;j1þ sijÞ; ðMi;j1þ wÞ; ðMi1;jþ wÞg; ð3Þ

where i is the row number, j is the column number, sijis the match/ mismatch score, and w is the penalty score. The third step, FindPath, determines the actual KF alignment that derives the maximal score. It traverses the matrix from the destination point (top left-hand corner) to the starting point (bottom right-hand corner) to find an optimal alignment path in order to determine the maximal align-ment score d. We calculate the flow similarity based on the maximal alignment score. The details are given in Section5.1.

2.5. Collaborative filtering recommendation

Collaborative filtering (CF) is a well-known approach for recom-mender systems: GroupLens (Konstan et al., 1997), Ringo ( Sharda-nand and Maes, 1995), Siteseer (Rucker and Polanco, 1997), and Knowledge Pump (Glance et al., 1998). CF recommends items, e.g., products, movies, and documents, based on the preferences of people who have the same or similar interests to those of the target user (Breese et al., 1998; Liu et al., 2008; Liu and Shih,

(4)

formation and prediction. The neighborhood of a target user is se-lected according to his/her similarity to other users, and is com-puted by Pearson correlation coefficient or the cosine measure. Either the k-NN (nearest neighbor) approach or a threshold-based approach is used to choose n users that are most similar to the tar-get user. Here, we use the k-NN approach. In the prediction step, the predicted rating is calculated from the aggregated weights of the selected n nearest neighbors’ ratings, as shown in Eq.(4):

Pu;j¼ ruþ Pn

i¼1wðu; iÞðri;j riÞ Pn

i¼1jwðu; iÞj

; ð4Þ

where Pu,jdenotes the prediction rating of item j for the target user u; ruand riare the average ratings of user u and user i, respectively; w(u, i) is the similarity between target user u and user i; ri,jis the rating of user i for item j; and n is the number of users in the neighborhood.

Similar to the PCF method, the item-based collaborative filter-ing (ICF) algorithm (Linden et al., 2003; Sarwar et al., 2001) ana-lyzes the relationships between items (e.g., documents) first, rather than the relationships between users. Then, the item rela-tionships are used to compute recommendations for workers indi-rectly by finding items that are similar to other items the worker has accessed previously. Thus, the prediction for an item j for a user u is calculated by the weighted sum of the ratings given by the user for items similar to j and weighted by the item similarity, as shown in Eq.(5). pu;j¼ Pn m¼1wðj; mÞ  ru;m Pn m¼1jwðj; mÞj ; ð5Þ

where pu,j represents the predicted rating of item j for user u; w(j, m) is the similarity between two items j and m; and ru,m de-notes the rating of user u for item m. A number of methods can be used to determine the similarity between items e.g., the sine-based similarity, correlation-based similarity, and adjusted co-sine similarity methods. Since the adjusted coco-sine similarity method performs better than the others (Sarwar et al., 2001), we use it as the similarity measure for the ICF method. The adjusted co-sine similarity between two items i and j is given by Eq.(6).

simði; jÞ ¼ P

u2Uðru;i ruÞðru;j ruÞ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P u2Uðru;i ruÞ2 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P u2Uðru;j ruÞ2 q r ; ð6Þ

where ru,i/ru,jis the rating of item i/j given by user u; and ruis the average item rating of user u.

2.6. Rule-based recommendations

Association rule mining (Agrawal et al., 1993; Agrawal and Srikant, 1994; Yun et al., 2003) is a widely used data mining technique that generates recommendations in recommender systems. An association rule describes the relationships between items, such as products, documents, or movies, based on patterns of co-occurrence across transactions. The Apriori algorithm (Agrawal et al., 1993; Agrawal and Srikant, 1994) is usually employed to identify such rules. Two measures, support and confidence, are used to indicate the quality of an association rule (Agrawal et al., 1993). The discovered rules should satisfy two user-defined requirements, namely minimum support and mini-mum confidence.

To improve the quality of traditional CF,Cho et al. (2005) pro-posed a sequential rule-based recommendation method that

con-siders the evolution of customers’ purchase sequences.

Transactions are clustered into a set of q transaction clusters, C = {C1, C2, . . . , Cq}, where each Cjis a subset of transactions. Each customer’s transactions over l periods are then transformed into

transaction clusters as a behavior locus, Li= hCi,Tl+1, . . . Ci,Tl, Ci,Ti, where Ci,Tk2 C, k = 1,2, . . . , l1, l = 2. Finally, sequential purchase patterns are extracted from the behavior locus of customers by time-based association rule mining to keep track of customers’ preferences during l periods, with T as the current (latest) period. A sequential rule is expressed in the form CTl+1, . . . , CT1) CT, where CTrepresents the customers’ purchase behavior in period T. If a target customer’s purchase behavior prior to period T was similar to the conditional part of the rule, then it is predicted that his/her purchase behavior in period T will be CT. Accordingly, CTis used to recommend products to the target customer in T.

3. Knowledge flow-based recommendation framework In this work, we propose three hybrid recommendation meth-ods based on knowledge flow (KF), which is a sequence of codified knowledge (documents) or topics referenced by a worker during a task’s execution. KF represents a worker’s information needs and the evolution of knowledge requirements, and is identified by ana-lyzing a worker’s work log. To support workers effectively, our methods consider workers’ preferences as well as their referencing behavior in order to recommend task-related knowledge. During the recommendation phase, the user-based collaborative filtering (CF) is used to predict a target worker’s preferences based on the opinions of similar workers, while the item-based collaborative fil-tering (Sarwar et al., 2001) is used to predict a document based on the targets worker’s interests on its similar items (documents). However, the limitation of these traditional CF methods is that they only consider workers’ preferences for codified knowledge and neglect workers’ referencing behavior. A worker’s referencing behavior may change during the task’s execution to suit his/her current information needs. To address this issue, we propose a KF-based sequential rule method that improves the recommenda-tion quality by tracking workers’ referencing behavior based on sequential rules. However, this method does not consider the opin-ions of the target worker’s neighbors who have similar preferences for documents. To overcome the limitations of CF and KF-based sequential rule methods, we combine the advantages of the two approaches and propose three hybrid recommendation methods that integrate KF mining, KF-based sequential rule mining and CF techniques to enhance the quality of recommendations.

3.1. Recommendation processes based on the knowledge flow model The proposed recommendation methods are illustrated inFig. 1. Our methods consist of two phases, a knowledge flow mining phase and a KF-based recommendation phase. The first phase iden-tifies the worker’s knowledge flow from the large amount of knowledge in the worker’s log. Then, the second phase recom-mends codified knowledge to the target worker by using the pro-posed recommendation methods.

In the knowledge flow mining phase, KFs are identified from the task requirements and the referencing behavior of workers re-corded in their logs. As tasks are performed at various times, each knowledge worker requires different kinds of knowledge to achieve a goal or complete a task. This phase involves three steps: document profiling, document clustering, and knowledge flow extraction. In the first step, each document is represented as a doc-ument profile, which is an n-dimensional vector comprised of sig-nificant terms and their weights. Then, based on the document profiles, documents with higher similarity measures are grouped in clusters by the hierarchical clustering method. In the third step, topic-level and codified-level KFs are generated from the document clustering results. A topic-level KF is expressed as a sequence of topics referenced by a worker, while a codified-level

(5)

KF is represented as a sequence of codified knowledge accessed by a worker. Further details are given in Section4.

The proposed hybrid recommendation methods combine a KF-based sequential rule (KSR) method with a user-KF-based/item-KF-based collaborative filtering (CF). The KSR method is regarded as the core process of the proposed hybrid methods. In the KSR method, work-ers with similar KFs to that of the target worker are deemed neigh-bors of the target worker and their knowledge referencing behavior patterns are identified by a sequential rule mining meth-od. Based on the discovered sequential rules and the neighbors’ KFs, relevant topics and codified knowledge are recommended to the target worker to support the task-at-hand. Moreover, by con-sidering workers’ preferences for codified knowledge, the CF meth-od makes recommendations to the target worker based on the opinions of similar workers. Three approaches are used to find sim-ilar workers to the target worker. The preference-simsim-ilarity-based CF method (PCF) chooses workers with similar preferences, while the KF-similarity-based CF method (KCF) chooses workers with similar KFs. Different from these two user-based methods, the item-based CF method predicts a document rating based on its similar documents that have been rated by a target user. To adap-tively and proacadap-tively recommend codified knowledge, we con-sider workers’ referencing behavior as well as their preferences for codified knowledge. Therefore, three hybrid recommendation methods are used in the KF-based recommendation phase: (1) a hybrid of PCF and KSR (PCF–KSR), (2) a hybrid of KCF and KSR (KCF–KSR), and (3) a hybrid of ICF and KSR (ICF–KSR). Further de-tails are given in Section5. In the following sections, we describe our methods in detail, including the knowledge flow model, the knowledge flow mining phase and the KF-based recommendation phase.

3.2. Knowledge flow model

In a knowledge-intensive and task-based environment, workers may need to access a large number of documents (codified knowl-edge) to accomplish a task. From the perspective of information needs, a worker’s knowledge flow (KF) represents the evolution of his/her information needs and preferences during a task’s execu-tion. Workers’ KFs are identified by analyzing their knowledge ref-erencing behavior based on their historical work logs, which contain information about previously executed tasks, task-related documents and when the documents were accessed.

A KF consists of two levels: a codified-level and a topic-level, as shown inFig. 2. The knowledge in the codified-level indicates the knowledge flow between documents based on the access time. In most situations, the knowledge obtained from one document prompts a knowledge worker to access the next relevant document (codified knowledge). Hence, the task-related documents are sorted by their access time to obtain a document sequence as the codified-level KF.

Documents with similar concepts can be grouped together automatically to form a topic-level abstraction of knowledge. Note that each topic may contain several task-related documents. The codified-level KF can be abstracted to form a topic-level KF, which represents the transitions between various topics. Since the task knowledge in the topic-level may flow among topics, it could prompt the worker(s) to retrieve knowledge from the next related topic. Formally, we define knowledge flow as follows.

Definition 1 (Knowledge Flow (KF)). Let a worker’s knowledge flow be KFlowvw¼ fTKFvw;CKFvwg, where TKFvwis the topic-level KF of the

worker w for a task

v

, and CKFvwis his/her codified-level KF for the task

v

.

Definition 2 (Codified-Level KF ). A codified-level KF is a time-ordered sequence arranged according to the access times of the documents it contains. Thus, it is defined as CKFvw¼ hdt1

w; dt2

w; . . . ;d tf

wi and t1<t2<   < tf, where dtwj denotes the document that the worker w accessed at time tjfor a specific task

v

. Each document can be represented by a document profile, which is an n-dimensional vector containing weighted terms that indicate the key content of the document.

Document Profiling

Document Clustering

Knowledge Flow Extraction

Knowledge Flow Mining Phase KF-based Recommendation Phase

Topic-Level KF Codified-Level KF

Hybrid KCF-KSR

KF-based Sequential Rule Method (KSR) KF-Similarity-based

CF Method (KCF) Hybrid PCF-KSR

KF-based Sequential Rule Method (KSR) Preference-Similarity-based CF Method (PCF) Document Recommendation List Knowledge Space Hybrid ICF-KSR

KF-based Sequential Rule Method (KSR)

Item-based CF Method (ICF)

Fig. 1. Document recommendation based on knowledge flows.

Topic

Doc

1

1 Doc2 Doc3 Doci

Topic2 Topic3 Topici

Topic Level

Codified Level

Time Knowledge Flow

(6)

Definition 3 (Topic-Level KF ). A topic-level KF is a time-ordered topic sequence derived by mapping documents in the codified-level KF to corresponding topics. Thus, it is defined as TKFvw¼ hTPt1 w;TP t2 w; . . . ;TP tf wi; t1<t2<   < tf, where TP tj w denotes

the corresponding topic of the document that worker w accessed at time tjfor a specific task

v

. Each topic is represented by a topic profile, which is an n-dimensional vector containing weighted terms that indicate the key content of the topic.

4. Knowledge flow mining phase

The objective of the knowledge flow (KF) mining phase is to identify the KF of each knowledge worker. In this section, we de-scribe how the KF mining method identifies KFs from workers’ log. This phase consists of three steps: document profiling, docu-ment clustering and KF extraction, which we discuss in the follow-ing subsections.

4.1. Document profiling and document clustering

Two profiles, a document profile and a topic profile, are used to represent a worker’s KF. A document profile can be represented as an n-dimensional vector composed of terms and their respective weights derived by the normalized tf-idf approach based on Eq.(1). Based on the term weights, terms with higher values are selected as discriminative terms to describe the characteristics of a document. The document profile of dj is comprised of these discriminative terms. Let the document profile be DPj¼ hdt1j: dtw1j;dt2j:dtw2j; . . . ;dtnj:dtwnji, where dtijis the term i in djand dtwijis the degree of importance of a term i to the document dj, which is derived by the normalized tf-idf approach. The document profiles are used to measure the similarity of the documents.

We adopt the single-link hierarchical clustering method (Jain et al., 1999) to group documents with similar profiles into clusters by using the cosine measure to calculate the similarity between the profiles of two documents. The single-link method computes

the cluster similarity between two clusters Cr and Ct by

max di2Cr;dj2Ct

fsimcosðdi;djÞg (Zhao et al., 2005), and then merges the two most similar clusters into a single cluster. The similarity computation and cluster combination steps are repeated until the similarity of the most similar pair of clusters is lower than a pre-specified threshold value. Different clustering results can be ob-tained by setting different threshold values. We adjust the thresh-old value systematically and use the quality measure described in Section2.3.1to evaluate each clustering result. Then, we take the one with the best quality measure as our clustering result. Note that a cluster represents a topic set and has a topic profile (derived from the document cluster) that describes the features of the topic. 4.1.1. Topic profile

Documents in the same cluster contain similar content and form a topic set. The key features of the cluster are described by a topic profile, which is derived from the profiles of documents that belong to the cluster. Let TPx= htt1x:ttw1x, tt2x:ttw2x, . . . , ttnx:dtwnxi be the profile of a topic (cluster) x, where ttixis a topic term and ttwixis the weight of the topic term. In addition, let Dxbe the set of documents in cluster x. The weight of a topic term is determined by Eq.(7)as follows: ttwix¼ P j2Dxdtwij jDxj ; ð7Þ

where dtwijis the weight of term i in document j, and |Dx| is the number of documents in cluster x. The weight of a topic term is ob-tained from the average weight of the terms in the document set.

4.2. Knowledge flow extraction

In this section, we describe the method used to extract a work-er’s KF from his/her data log when performing a task. We define a task as a unit of work, which denotes either a previously executed (i.e., historical) task or the current task. When performing a task in a knowledge-intensive and task-based environment, a worker usu-ally requires a large amount of task-related knowledge to accom-plish the task. By analyzing a worker’s referencing behavior for a specific task, the corresponding knowledge flow of the task is de-rived by the knowledge flow extraction method. Note that if a worker performs more than one task, more than one knowledge flow will be extracted. For a specific task, the method derives two kinds of KF, codified-level KF and topic-level KF, to represent the worker’s information needs for the task.

4.2.1. Codified-level knowledge flow

The codified-level KF is extracted from the documents recorded in the worker’s work log. In most situations, workers are motivated to access a document about a specific task because of knowledge derived from other documents. The documents are arranged according to the times they were accessed, and a document se-quence, i.e., a codified-level KF, is obtained. The order of docu-ments in the sequence is subjective, since it is determined by the worker. In other words, each worker has his/her own codified-level KF, which represents his/her knowledge accumulation process for a specific task at the codified-level.

4.2.2. Topic-level knowledge flow

The topic-level KF is derived by mapping documents in the cod-ified-level KF of a specific task into corresponding clusters and is represented by a topic sequence. In the previous step, documents with similar content were grouped into clusters. We use the docu-ment clustering results to map the docudocu-ments in the codified-level KF into topics (clusters) in order to compile the topic-level KF. Since the codified-level KF is the basis of the topic-level KF, the knowledge in the latter is an abstraction of the former, and indi-cates how knowledge flows among various topics. A topic in the to-pic-level KF may be duplicated because the worker may read about the same topic frequently to obtain essential knowledge while exe-cuting a task.

5. KF-based recommendation phase

The KF-based recommendation phase consists of three hybrid recommendation methods: (1) PCF and KSR (PCF–KSR), (2) KCF and KSR (KCF–KSR), and (3) ICF and KSR (ICF–KSR), as shown in

Fig. 1. We note that PCF denotes the preference-similarity-based CF method; KCF denotes the KF-similarity-based CF method; ICF denotes the item-based CF method; and KSR denotes the KF-based sequential rule method. To adaptively recommend documents, both the PCF method and the KCF method select neighbors based on the similarity of preferences, while the ICF method chooses sim-ilar documents for a document based on their preferences given by a target user. The three methods differ in the way they compute the similarity between workers’ preferences to select the target worker’s neighbors. The PCF method (traditional CF) uses prefer-ence ratings to compute the similarity, while the KCF method uses workers’ KFs to derive the similarity. The ICF method applies sim-ilarity measure to evaluate the simsim-ilarity between two items (i.e., documents), rather than the similarity between two workers. The proposed KSR method traces workers’ knowledge referencing behavior by using the KF-based sequential rules. The proposed hy-brid recommendation methods take advantage of the merits of the KSR, PCF, KCF and ICF methods.

(7)

5.1. Identifying similar workers based on their knowledge flows To find a target worker’s neighbors, his/her topic-level KF is compared with those of other workers to compute the similarity of their KFs. The resulting similarity measure indicates whether the KF referencing behavior of two workers is similar. In this work, we regard each knowledge flow as a sequence. Since comparing knowledge flows is very similar to aligning sequences, the se-quence alignment method (SAM) (Hay et al., 2001) and the dy-namic programming approach (Charter et al., 2000; Oguducu and Ozsu, 2006) can be used to measure the similarity of two KF sequences.

To determine which of the two methods would be more appro-priate for comparing workers’ knowledge flows, we applied both methods in our experiments and found that dynamic programming is better than SAM. Therefore, we employ the dynamic program-ming algorithm (Charter et al., 2000; Oguducu and Ozsu, 2006) to measure the similarity of workers’ knowledge flows.

Unlike the sequence alignment problem, a worker’s KF contains task-related documents. Thus, we have to consider the sequential order of topics in a knowledge flow, as well as the worker’s aggre-gated profile, which accumulates the task-related documents based on the times they were accessed during the task’s execution. We propose a hybrid similarity measure, comprised of the KF alignment similarity and the aggregated profile similarity, to eval-uate the similarity of two workers’ KFs, as shown in Eq.(8).

simðTKFvi;TKF l jÞ ¼

a

 simaðTKFvi;TKF l jÞ þ ð1 

a

Þ  simPðAPvi;AP l jÞ; ð8Þ where simaðTKFvi;TKF l

jÞ represents the KF alignment similarity be-tween worker i and worker j who execute task

v

and task l, respec-tively; TKFvi=TKF

l

j is the topic-level KF of worker i/j for task v/l; simpðAPvi;AP

l

jÞ represents the aggregated profile similarity of two workers’ KFs; APvi=APl

jis the aggregated profile of worker i/j for task v/l; and

a

is a parameter used to adjust the relative importance of the two types of similarity.

The KF alignment similarity is based on the topic sequence and topic coverage, while the aggregated profile similarity is based on the aggregated profiles derived from the profiles of referenced doc-uments in the KFs. Note that the KF alignment similarity considers the topic sequence in the KF without considering the content of workers’ profiles; while the aggregated profile similarity considers the content of profiles without considering the topic sequence in the KF. By linearly combining these two similarities, we can bal-ance the tradeoff between KF alignment and the aggregated profile. We discuss the rationale behind these two similarity measures next.

5.1.1. KF alignment similarity

The KF alignment similarity is comprised of two parts: the KF alignment score, which measures the topics in sequence; and the join coefficient, which estimates the topic’s coverage in two compared topic-level KFs. We modify the sequence alignment method (Charter et al., 2000) to derive the KF alignment score. In addition to computing the sequence alignment score, we estimate the overlap of the topics in two compared topic-level KFs by using the Dice’s coefficient (Van RijsBergen, 1979). The rationale is that if the topic overlap is high, the KF alignment similarity of the two compared KFs will also be high. In other words, the two com-pared KFs will be very similar. The KF alignment similarity, simaðTKFvi;TKF l jÞ, is defined as follows: simaðTKFvi;TKF l jÞ ¼ Normð

g

Þ  2  jTPSvi \ TPS l jj jTPSvij þ jTPS l jj ; ð9Þ where TKFvi=TKF l

jdenotes the topic-level KF of worker i/worker j for task

v

/task l;

g

is the KF alignment score; Norm is a normalization function used to transform the value of

g

into a number between 0 and 1; TPSvi and TPSljare the sets of topics in TKFvi and TKFlj, respec-tively; TPSvi \ TPS

l

jis the intersection of topics common to TKFvi and TKFl

j; and jTPSvij and jTPS l

jj represent the number of topics in TKFvi and TKFl

j respectively. The KF alignment score, which is based on the sequence alignment method (Oguducu and Ozsu, 2006), is de-fined in Eq.(10):

g

¼ d

ms n

; ð10Þ

where d is the maximal alignment score derived by the dynamic programming approach, ms is the identical matching score (+2), and n is the length of the aligned KF. To obtain the maximal align-ment score d, we set the matching score ms, the mismatching score mdand the gap penalty score mgto +2, 1 and 2, respectively in the dynamic programming approach (Charter et al., 2000) discussed in Section2.4. The maximum value of

g

is 1 if the two compared KFs are exactly the same. On the other hand, the value of

g

is negative if most of topics in the two compared KFs do not match. Thus, the va-lue of

g

may range from a negative value to 1. To alter the range of the KF alignment score, the value of

g

is transformed into a value in the range [0, 1] by the normalization function. The normalized KF alignment score Norm(

g

) is then used to calculate the KF alignment similarity.

5.1.2. Aggregated profile similarity

The aggregated profile similarity, defined as simpðAPvi;AP l jÞ, computes the similarity of two workers’ KFs based on their aggre-gated profiles, which are derived from the profiles of documents they have referenced; APvi and APl

j are the respective vectors of the aggregated profiles of workers i/j for task v/l. We use the cosine formula to calculate the similarity between two aggregated pro-files. The value of the similarity score ranges from 0 to 1. The aggregated profile of a worker i for task

v

is defined as

APvi ¼ XT t¼1

twt;T DPvt; ð11Þ

where twt,Tis the time weight of the document referenced at time t in the KF; T is the index of the times the worker accessed the most recent documents in his KF; and DPvt is the profile of the document referenced by worker i at time t for task

v

. The aggregation process considers the time decay effect of the documents. Each document profile is assigned a time weight according to the time it was refer-enced. Thus, higher time weights are given to documents referenced in the recent past. The time weight of each document profile is de-fined as twt;T¼TSttSt, where St is the start time of the worker’s KF. 5.2. KF-based sequential rule method

The KF-based sequential rule method (KSR) considers the refer-encing behavior of neighbors whose KFs were very similar before time T, and then recommends documents at time T for the target worker.Fig. 3provides an overview of the KSR method. To deter-mine the similarity of various topic-level KFs, the target worker’s KF is compared with those of other workers by measuring their KF similarity, as discussed in Section 5.1. Workers with similar KFs to that of the target worker are regarded as the latter’s neighbors and their topic-level KFs are used to discover frequent knowledge referencing behavior by applying sequential rule min-ing to the target worker’s referencmin-ing behavior. The discovered sequential rules with high degrees of rule matching are selected to recommend topics at time T. Documents belonging to the rec-ommended topics have a high priority of being recrec-ommended.

(8)

The KSR recommendation method involves four steps: identifying similar workers, mining their knowledge referencing behavior, identifying the target worker’s knowledge referencing behavior, and document recommendation.

5.2.1. Mining knowledge referencing behavior

Knowledge workers with similar referencing behavior (high similarities) of the target worker are regarded as neighbors of the target worker. We modify the association rule mining method (Agrawal et al., 1993; Agrawal and Srikant, 1994) and sequential pattern mining method (Agrawal and Srikant, 1995) to discover to-pic-level sequential rules from the neighbors’ toto-pic-level KFs. The extracted rules can be used to keep track of the referenced topics among workers with similar referencing behavior. Let Ry be a sequential rule, as defined in Eq.(12).

Ry:gy;Ts; . . . ;gy;T1) gy;TðSupporty;ConfidenceyÞ ð12Þ

where gy,Tf2 TPS; f = 0 to s; and TPS is a set of all topics.

The conditional part of the sequential rule is hgy,Ts, . . . , gy,T1i, and the consequent part is gy,T. The items that appear in the rules are topics extracted from the neighbors’ topic-level KFs (TKF). The support and confidence values, Supportyand Confidencey, are used to evaluate the importance of rule Ry. We use the support and confidence scores to measure the degree of match between the referencing behavior and the conditional part of a rule for a tar-get worker, as illustrated in the third step. Note that if the knowl-edge referencing behavior of the target worker is similar to the conditional part of Ry, then the topic predicted for him/her at T will be gy,T.

5.2.2. Identifying the knowledge referencing behavior of the target worker

This step identifies the target worker’s knowledge referencing behavior by matching his/her KF with the sequential rules discov-ered in the previous step. Specifically, the rules are matched with the topic-level KF of the target worker to predict the topics re-quired at time T. We set a knowledge window on the KF before time T. The size of the window is determined by the user. Let KWu¼ hTPTsu ;TP

Tsþ1 u ; . . . ;TP

T1

u i be the knowledge window for

the topic-level KF of a target worker u before time T. Note that TPTfu is the topic referenced by u at time T  f, f = 1, . . . , s. The

knowledge window KWu covers several topics previously refer-enced by the target worker and arranged in time order. The steps of sequential rule matching are as follows.

Step 1. Set a knowledge window KWu

The reference time of topics in the window may range from T  s to T  1, where s is the window size determined by the work-er. The referencing behavior within the knowledge window is then compared with the sequential rules extracted from the KFs of the target worker’s neighbors (Step 3).

Step 2. Generate topic subsequences and compare them with the knowledge window

All generated rules are compared with the given knowledge window to obtain the matching scores of rules. A sequential rule may partially or fully match a knowledge window. To identify sequential rules that match the target worker’s referencing behav-ior, we consider all partial matches of the rules. Therefore, all pos-sible topic subsequences are generated from the conditional part of the rule first.

The topic subsequences are enumerated according to the topic order in the conditional part of a rule. Let TSky¼ hTP

k1 y; . . . ; TPki

y; . . . ;TP km

y i be a topic subsequence in the conditional part of a sequential rule y, and let TPki

y be a topic with the index position kiin the sequence TSky. In addition, let KWube a knowledge window in a worker’s KF, and let TPhj

u be a topic with the index position hjin the sequence KWu. Then, each topic subsequence of a rule is exam-ined by checking whether it exists in the knowledge window.

Instead of using identical matches, all the topics in a topic sub-sequence are compared with those in the knowledge window by using topic similarities to determine their matches. The character-istics of a KF are different from those of a general sequence, because a topic in a KF is composed of abstract knowledge con-cepts. Rather than using the identical match method, we use the topic similarity, i.e., simcos(TPki

y;TP hj

u), to determine if two topics match. That is, they match if their similarity is greater than the user-specified threshold h.

We define a similarity matching score to compare a topic

sub-sequence with a knowledge window. A topic subsub-sequence TSk

y

Worker A

Time Period

Sequential Rule Mining

T-2 T-1 T 3

-T

Knowledge Flow Mining Knowledge Sharing

Worker 1 Worker 2

. . . . . .

Neighbors of Worker A Target Worker

KF-based Sequential Rule Recommendation Topics Lists Documents Lists Worker s Log Documents Topic-Level KF Codified-Level KF

(9)

matches the knowledge window KWu, if their corresponding topic similarities are larger than the user-defined threshold, i.e. simcosðTPk1 y;TP h1 uÞ > h; simcosðTP k2 y;TP h2 uÞ > h; . . . ; simcosðTP km y ;TP hm u Þ > h, where integers k1< k2<  < km, h1< h2<  < hm, and h is the user-defined threshold. The similarity matching score is the sum-mation of the topic similarities, as defined in Eq.(13).

SMTSk y;KWu¼ Xm i¼1 simcosðTPki y;TP hi uÞ; ð13Þ

Step 3. Find the matching degree of a sequential rule

Given the similarity matching scores of all topic subsequences extracted from a sequential rule, we choose the subsequence with the highest score to compute the matching degree of the rule. The matching degree is defined as follows:

RMDRy;KWu¼ max

k¼1;...;qfSMTSky;KWug  Supporty Confidencey; ð14Þ

where RMDRy;KWu is the matching degree of rule Ryand KWuof the

target worker u; and max

k¼1;...;qfSMTSky;KWug is the highest similarity

matching score of all topic subsequences of sequential rule y. The matching degree is used to identify the sequential rules qualified to recommend topics at time T.

Step 4. Choose sequential rules for recommendation

A sequential rule with a high matching degree means that the referencing behavior of the target worker matches the conditional part of the rule, so the consequent part of the rule can be selected as a predicted topic for the target worker at time T. Hence, the Top-N approach can be used to derive a set of predicted topics by select-ing N rules with the highest matchselect-ing degree scores.

5.2.3. Document recommendation

The KSR method predicts a document rating based on sequen-tial rules derived from the KFs of a target worker’s neighbors. Let KNBvube a set of neighbors of target worker u for a task

v

, selected according to the KF similarity (using Eq.(8)). The sequential rules derived from KNBvuwith high degrees of rule matching are selected to recommend topics for the target worker at time T. However, the referencing behavior of some workers in KNBvumay not match the selected sequential rules. Therefore, we apply the sequential rule matching method discussed in Section5.2.2to compare the KFs of workers in KNBvuwith the selected sequential rules. If a worker’s KF matches a selected sequential rule, that worker’s referencing behavior conforms to the sequential rule, and can therefore be used to make recommendations based on the selected sequential rules. The reason for checking the KFs of workers in KNBvu is to identify neighbors whose referencing behavior conforms to the se-lected sequential rule.

For a task

v

, let KNBRvudenote the neighbors in KNBvuwhose KFs are very similar to the target worker’s KF and whose referencing behavior matches the selected sequential rules. In addition, let RTS be a set of recommended topics derived from the consequent parts of the recommended sequential rules;

s

be a recommended topic, where

s

2 RTS; and the topic of a document d be

s

. Based on the KFs of the neighbors in KNBRvu, the predicted rating of a

doc-ument d belonging to the recommended topic

s

for the target

worker u is calculated by Eq.(15):

^ pvu;d;s¼ rvu;sþ P xl2KNBRv usimðTKF v u;TKF l xÞ  ðrlx;d;s rlx;sÞ P xl2KNBRv ujsimðTKF v u;TKF l xÞj ; ð15Þ where rv

u;s=rlx;sis the topic rating of the target worker u/worker x for

task v/l, derived from the worker’s average rating of documents in the recommended topic

s

; TKFvu=TKFl

xis the topic-level KF of the tar-get worker u/worker x for task

v

/task l; rl

x;d;sis the rating given by

worker x for a document d belonging to the recommended topic

s

in task l; and simðTKFvu;TKFl

xÞ is the KF similarity of worker u and worker x, derived by Eq.(8). If the target worker u does not rate any documents in

s

, then rvu;s is replaced by the average rating of

all his/her documents.

To recommend task-related documents to a target worker, it is necessary to collect data with explicit ratings. Many recommender systems and recommendation methods use such ratings to repre-sent users’ preferences. Similarly, our recommendation methods use knowledge workers’ document ratings to predict other docu-ments that may be useful to a target worker’s task, as shown in Eq.(15). Each knowledge worker gives explicit ratings to the doc-uments referenced during the task’s execution, while docdoc-uments related to different tasks are re-rated by different workers. The rat-ings are used to gauge a worker’s perceptions about the usefulness and relevance of documents for a specific task. The stronger the worker’s perceptions of the usefulness or relevance of a document for the task at hand, the higher the rating he/she will give the doc-ument. Such ratings are subjective because they are based on the worker’s perspective. Moreover, since a document may be refer-enced by different workers as they execute their specific tasks, it will be given different ratings based on how the workers perceive its usefulness and relevance to their tasks.

The sequential rules with high matching scores are selected to recommend topics. In other words, topics with high scores in the consequent part of a rule are recommended to the target worker at time T. The KSR method predicts ratings for documents that be-long to the recommended topics and gives them a high priority for recommendation. Unlike traditional methods, KSR recommends documents to the target worker based on the selected sequential rules and the document ratings. Note that the KSR method does not consider the similarity of workers’ preferences when calculat-ing the predicted ratcalculat-ing of a document.

5.3. The hybrid PCF–KSR method

The hybrid PCF–KSR recommendation method linearly com-bines the preference-similarity-based CF method (PCF) with the KSR method to recommend documents to a target worker, as shown inFig. 4. The PCF method is the traditional CF method that makes recommendations according to workers’ preferences for codified knowledge. To recommend a document, the neighbors of a target worker are selected based on the similarities of the work-ers’ preference ratings. Pearson’s correlation coefficient is used to find similar workers based on the document rating vectors. Then, PCF–KSR predicts the rating of a document by linearly combining the predicted ratings calculated by the two methods. One part of the rating is derived by the PCF method based on the document ratings and the preferences of the target worker’s neighbors. The other part is derived by the KSR method described in Section5.2. Because a worker’s knowledge flow may change over time, the hy-brid method considers the worker’s preference for documents as well as topic changes in his/her KF to make recommendations adaptively.

The predicted rating of a document d for a worker u executing a task

v

is derived by combining the PCF and KSR methods, as de-fined in Eq.(16): ^ pvu;d¼ bPCF—KSR rvuþ P xl2PNBv uPSimðu v;xlÞ  ðrl x;d rlxÞ P xl2PNBv ujPSimðu v;xlÞj " # þ ð1  bPCF—KSRÞ  ^pKSRu;v;d; ð16Þ

(10)

where rv

u=rlxis the average rating of documents for task

v

/task l gi-ven by the target worker u/worker x; PSim(uv, xl) is the similarity between the target worker u for task

v

and the neighbor worker x for task l, derived by Pearson’s correlation coefficient; PNBvu is the set of neighbors of the target worker u for task

v

, selected by PSi-m(uv, xl); rl

x;dis the rating of a document d for task l given by worker x; ^pKSR

u;v;dis the predicted rating of a document d for the target worker u engaged in task

v

based on the KSR method; and bPCF–KSRis the weighting used to adjust the relative importance of the PCF method and KSR method.

According to Eq.(16), a document in a recommended topic has a higher priority for recommendation than documents that are not in the recommended topics, based on their predicted ratings derived by the KSR method. Documents with high predicted ratings are used to compile a recommendation list, from which the top-N doc-uments are chosen and recommended to the target worker. 5.4. The hybrid KCF–KSR method

The hybrid KCF–KSR method linearly combines the KF-similar-ity-based CF method (KCF) with the KSR method to recommend documents to a target worker, as shown inFig. 5. The KCF method

is based on the referencing behavior of neighbors with similar KFs, while the PCF method is based on the similarity of preference rat-ings derived by Pearson correlation coefficient. Like the PCF–KSR method, the predicted rating of a document is also derived by inte-grating two parts of the ratings. One part is obtained by the KCF method, while the other is obtained by the KSR method described in Section5.2.

The hybrid KCF–KSR method predicts the rating of a document d for worker u engaged in task

v

by Eq.(17), and then determines which documents should be recommended.

^ pvu;d¼ bKCF—KSR rvuþ P xl2KNBv usimðTKF v u;TKF l xÞ  ðrlx;d r l xÞ P xl2KNBv ujsimðTKF v u;TKF l xÞj " # þ ð1  bKCF—KSRÞ  ^pKSRu;v;d; ð17Þ where rv

u=rlxis the average rating of documents given by the target worker u/worker x engaged in task v/l; rl

x;dis the rating of a docu-ment d for task l given by worker x; TKFvu=TKFl

xdenotes the

topic-le-vel KF of the target worker u/worker x for task

v

/task l;

simðTKFvu;TKF l

xÞ is the KF similarity of worker u and worker x, de-rived by Eq.(8); KNBvuis the set of neighbors of the target worker u for task

v

, selected according to their KF similarity scores; ^pKSR

u;v;d PreferenceSimilarity -based CF Method Computing Pearson Correlation Coefficient Recommendaiton of Top-N Documents

Linear

Combination

KF-based Sequential Rule Method Identifying Similar Workers Mining Knowledge Referencing Behavior Identifying Target Worker's Referencing Behavior Document Recommendation Recommendation List

The Hybrid PCF -KSR Recommendation Method

Workers

’ KF

Document Ratings

Fig. 4. The framework of the hybrid PCF–KSR method.

KF-Similarity -based CF Method

Computing KF similarity

Recommendaiton of Top-N Documents KF-based Sequential Rule

Method Identifying Similar Workers Mining Knowledge Referencing Behavior Identifying Target Worker's Referencing Behavior Document Recommendation

Linear

Combination

Recommendation List

The Hybrid KCF-KSR Recommendation Method

Workers

’ KF

Document Ratings

(11)

is the predicted rating of a document d based on the KSR method; and bKCF–KSRis the weighting used to adjust the relative importance of the KCF method and the KSR method.

According to Eq.(17), a document in a recommended topic has a higher priority for recommendation than those documents that are not in the recommended topic. The KCF–KSR method considers the KF similarity of two workers, their preferences for documents, and topic sequences in the KF when making recommendations.

5.5. The hybrid ICF–KSR method

The hybrid ICF–KSR recommendation method linearly combines the item-based CF method (ICF) with the KSR method to recom-mend documents to a target worker, as shown inFig. 6. The ICF method is the traditional item-based CF method (Sarwar et al., 2001) described in Section2.6. The similar documents (neighbors) of a target document are selected based on the adjusted cosine similarities of the documents (Eq.(6)). Then, the predicted rating of the target document is computed by taking the weighted aver-age of the target worker’s ratings for similar documents (Eq.(5)).

The ICF method does not consider workers’ referencing behav-ior when they perform tasks. To address this issue, we propose the hybrid ICF–KSR method, which integrates traditional item-based collaborative filtering and the KSR method to recommend documents that may meet workers’ information needs. The ICF– KSR approach predicts the rating of a document by linearly com-bining the predicted ratings calculated by the two methods. One part of the rating is derived by the ICF method based on the target worker’s ratings for documents similar to the target document. The other part is derived by the KSR method described in Section5.2. A worker’s knowledge flow may change over time. Thus, to make recommendations adaptively, the hybrid method considers documents similar to the target document, the worker’s percep-tions about the usefulness of the documents, and the topic sequences in his/her KF.

The hybrid ICF–KSR method predicts a rating for a document d for worker u performing a task

v

by using Eq.(18), and then deter-mines the documents that should be recommended.

^

pvu;d¼ bICF—KSR P

i2IdACSimðd; iÞ  r

v u;i P

i2IdjACSimðd; iÞj

" # þ ð1  bICF—KSRÞ  ^p KSR u;v;d; ð18Þ where rv

u;iis the rating of the usefulness of a document i given by worker u for task

v

; ACSim(d, i) is the adjusted cosine similarity between document d and document i; Idis the set of documents similar to document d, selected according to their adjusted cosine similarities; ^pKSR

u;v;d is the predicted rating of document d for the target worker u engaged in task

v

based on the KSR method; and bICF–KSRis the weighting used to adjust the relative importance of the ICF method and the KSR method. According to Eq. (18), a document in a recommended topic has a higher priority for recom-mendation than documents that are not in the recommended topic. 6. Experiments and evaluations

In this section, we conduct experiments to compare and evalu-ate the recommendation quality for the hybrid PCF–KSR, KCF–KSR and ICF–KSR methods, and then have some discussions about these experimental results. Next, we will describe the experiment setup in Section 6.1, discuss the experiment results and evaluations in Section6.2, and have some discussions in Section6.3.

6.1. Experiment setup

To demonstrate that knowledge flows can support the recom-mendation of task-relevant knowledge (documents) to knowledge workers, experiments were conducted on a dataset from a real application domain, namely, research tasks in the laboratory of a research institute. The dataset contained information about the ac-cess behavior of each knowledge worker engaged in performing a specific task, e.g., writing a research paper or conducting a research project. To accomplish their tasks, the workers needed various documents (research papers). Besides the documents, other information, such as when the documents were referenced and the document ratings, is necessary for implementing our methods. Since it is difficult to obtain such a dataset, using the real application domain restricts the sample size of the data in our experiments.

The dataset is based on the referencing behavior of 14 knowledge workers in a research laboratory and 424 research papers used to evaluate the proposed methods. Specifically, it con-tains information about the content of the documents, the times they were referenced, and the document ratings given by workers. For each worker, the documents and the times at which they were referenced are used to identify the worker’s referencing behavior when performing a task.

Linear

Combination

KF-based Sequential Rule Method Identifying Similar Workers Mining Knowledge Referencing Behavior Identifying Target Worker's Referencing Behavior Document Recommendation Recommendation List

The Hybrid ICF-KSR Recommendation Method

Workers

’ KF

Document Ratings Item-based CF Method Computing Adjusted Cosine Similarity Recommendaiton of Top-N Documents

(12)

The document rating, which is given by a worker and on a scale of 1–5, indicates whether a document is perceived as useful and relevant to a task. A high rating, i.e., 4 or 5, indicates that the doc-ument is perceived as useful and relevant to the task at hand; while a low rating, i.e., 1 or 2, suggests that the document is deemed not useful. If a document has been referenced by a worker without being assigned a rating value, it is given a default rating of 3.

In our experiment, the dataset is divided according to the time order of the documents accessed by knowledge workers as follows: 70% for training and 30% for testing. The testing set contains docu-ments with access time more close to the current time period. The training set is used to generate recommendation lists, while the test set is used to verify the quality of the recommendations. In the experiments, we evaluate and compare the performance of tra-ditional CF methods and our KF-based recommendation methods, namely the hybrid PCF–KSR method, the hybrid KCF–KSR method, and the hybrid ICF–KSR method.

We use the Mean Absolute Error (MAE), which is widely used in recommender systems (Breese et al., 1998; Herlocker et al., 1999; Herlocker et al., 2004; Shardanand and Maes, 1995), to evaluate the quality of recommendations derived by our methods. MAE measures the average absolute deviation between a predicted rat-ing and the user’s true ratrat-ing (Shardanand and Maes, 1995), as shown in Eq.(19).

MAE ¼ Pn

i2Z;i¼1jpi qij

n ; ð19Þ

where MAE is the mean absolute error; Z is the test set of a target worker, which consists of n predicted documents; piis the predicted rating of document i; and qiis the real rating of document i. The lower the MAE, the more accurate the method will be. The advanta-ges of this measurement are that its computation is simple and easy to understand and it has well studied statistical properties for test-ing the significance of a difference.

6.2. Experiment results

We conduct several experiments to measure the quality of rec-ommendations derived by our methods. To generate topic-level KFs, the documents in the data set are grouped into clusters by the single-link hierarchical clustering method described in Section

4.1. To determine the threshold value that yields the best cluster-ing result, we adjust the threshold value systematically in decre-ments of 0.05 ranging from 0.5 to 0.2 to generate different clustering results, each of which is evaluated by using the quality measure defined in Section2.3.1. The cluster with the best quality measure generated by setting the threshold value at 0.3 is selected as our clustering result; it contains 8 clusters. Based on the cluster-ing results, topic-level KFs are generated by mappcluster-ing documents from the codified-level KFs into their corresponding clusters for each knowledge worker. Finally, by considering the topic-level and codified-level KFs, the hybrid PCF–KSR and KCF–KSR methods recommend task-related documents to users. In the following sub-sections, we discuss the experiment results.

6.2.1. Evaluation of the hybrid PCF–KSR method

In this experiment, we evaluate the performance of the hybrid PCF–KSR method. The parameters,

a

and bPCF–KSR, may affect the quality of the recommendations;

a

is used to calculate the KF sim-ilarity (Eq.(8)), while bPCF–KSRis used to predict a document’s rat-ing. We set various values for these parameters and determine the settings that yield the best recommendation performance. The experiment was conducted by systematically adjusting the values of

a

in increments of 0.1, and the optimal value (i.e., the

lowest MAE value) was chosen as the best setting. Based on the experiment results, we set

a

= 0.3 in all the following experiments. We evaluate how the bPCF–KSRvalues and the number of neigh-bors, k, affect the recommendation quality, as shown inFig. 7. The parameter bPCF–KSR, whose value ranges from 0.1 to 1, repre-sents the relative importance of the PCF method and KSR method in Eq.(16). The experiment was conducted using various numbers of neighbors (parameter k) to derive the predicted ratings.Fig. 7

shows that the lowest MAE value generally occurs when bPCF–KSR is 0.5.

Fig. 8 compares the hybrid PCF–KSR method with the tradi-tional CF method (PCF method). The predicted rating of a docu-ment is derived in two parts by the PCF method and the KSR method respectively. The part derived by the PCF method is based on the document ratings of the target worker’s neighbors, while the other part is derived by the KSR method based on documents in the recommended topics and sequential rules generated from the KFs of the target worker’s neighbors. If a document is in the recommended topic, the KSR part of PCF–KSR can be used to adjust the predicted rating of the document. Therefore, the PCF–KSR method ensures that documents in the recommended topics have a high priority for recommendation to the target worker. In the experiment, we set

a

= 0.3 and bPCF–KSR= 0.5, and select the top-5 sequential rules with high rule matching scores. The experiment results show that the PCF–KSR method outperforms the traditional CF method (PCF method) under various numbers of neighbors (parameter k). That is, the KSR method improves the recommenda-tion quality of the PCF method. In other words, the PCF–KSR meth-od is effective in recommending documents to the target worker, and it improves on the quality of the recommendations derived by the PCF method alone.

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 βPCF-KSR MA E k = 4 k = 6 k = 8 k = 10 k = 12 k = 14

Fig. 7. The performance of the hybrid PCF–KSR method with various k and bPCF–KSR

values. 0.6 0.7 0.8 0.9 1 4 6 8 10 12 14 k MA E PCF PCF-KSR

數據

Fig. 1. Document recommendation based on knowledge flows.
Fig. 3. An overview of the KSR method.
Fig. 4. The framework of the hybrid PCF–KSR method.
Fig. 6. The framework of the hybrid ICF–KSR method.
+3

參考文獻

相關文件

Relevant topics include, but are not limited to: Document Representation and Content Analysis (e.g., text representation, document structure, linguistic analysis, non-English

Discovering the City by Mining Diverse and Multimodal Data Streams – IBM Grand Challenge: New York City 360. §  Exploring and Integrating Multiple Contents and Sources for

5 Longest domain token length Integer 6 Longest path token length Integer 7∼9 Spam, phishing and malware SLD hit ratio Real.. 10 Brand name

• Information retrieval : Implementing and Evaluating Search Engines, by Stefan Büttcher, Charles L.A.

Step 5: Receive the mining item list from control processor, then according to the mining item list and PFP-Tree’s method to exchange data to each CPs. Step 6: According the

由於資料探勘 Apriori 演算法具有探勘資訊關聯性之特性,因此文具申請資 訊分析系統將所有文具申請之歷史資訊載入系統,利用

(1999), &#34;Mining Association Rules with Multiple Minimum Supports,&#34; Proceedings of ACMSIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego,

I-STD 是在資料以漸進式增加的前提下進行資料探勘,在醫院的門診診斷紀 錄中,雖然每個月門診數量不盡相同但基本上仍有一固定總門診數量範疇,因此 由圖