Mining group-based knowledge flows for sharing task knowledge

(1)

Mining group-based knowledge

ﬂows for sharing task knowledge

Duen-Ren Liu

⁎

, Chin-Hui Lai

Institute of Information Management, National Chiao Tung University, Taiwan

a b s t r a c t

a r t i c l e i n f o

Article history:

Received 16 December 2009

Received in revised form 9 September 2010 Accepted 26 September 2010

Available online 1 October 2010 Keywords:

Knowledgeﬂow

Group-based knowledgeﬂow Knowledge graph

Knowledge sharing Data mining Topic Task

In an organization, knowledge is the most important resource in the creation of core competitive advantages. It is circulated and accumulated by knowledgeflows (KFs) in the organization to support workers' task needs. Because workers accumulate knowledge of different domains, they may cooperate and participate in several task-based groups to satisfy their needs. In this paper, we propose algorithms that integrate information retrieval and data mining techniques to mine and construct group-based KFs (GKFs) for task-based groups. A GKF is expressed as a directed knowledge graph which represents the knowledge referencing behavior, or knowledgeflow, of a group of workers with similar task needs. Task-related knowledge topics and their relationships (flows) can be identified from the knowledge graph so as to fulfill workers' task needs and promote knowledge sharing for collaboration of group members. Moreover, the frequent knowledge referencing path can be identified from the knowledge graph to indicate the frequent knowledge flow of the workers. To demonstrate the efficacy of the proposed methods, we implement a prototype of the GKF mining system. Our GKF mining methods can enhance organizational learning and facilitate knowledge management, sharing, and reuse in an environment where collaboration and teamwork are essential.

1. Introduction

In an organization, knowledge is the most important resource used to create core competitive advantages. Generally, knowledge and expertise in an organization are codified in textual documents, e.g., papers, manuals and reports, and preserved in a knowledge database. Large amounts of such codified knowledge are circulated and accumulated in an organization to support knowledge workers engaged in diverse tasks and activities. To preserve, share and reuse these valuable assets, organizations need to adopt appropriate knowledge management strategies to support knowledge workers intelligently [26,28]. Knowledge management, which is widely utilized in organizations, is important for preserving and sharing knowledge efficiently[14,36].

Knowledge management systems (KMS) facilitate the preserva-tion, reuse and sharing of knowledge, and also support collaboration among workers. Based on a task's speciﬁcations and the process-context of the task, the KnowMore system[1]provides context-aware knowledge retrieval and delivery functions to support the procedural activities of workers. The task-based K-support system [23,24,38]

provides knowledge support adaptively to meet a worker's dynamic information needs by analyzing his/her access behavior. Moreover, knowledge workers may cooperate with each other to accomplish their tasks. Task knowledge can be transmitted, shared and

accumulated from one team member/process to another. Therefore, working knowledgeﬂows between workers in an organization, while process knowledgeﬂows between various tasks[39,41]. Zhuge[39]

proposed a management mechanism for knowledge sharing, and integrated the knowledgeflow with the workflow to assist workers. Furthermore, knowledgeflows (KFs) can be used to represent the long-term evolution of workers' information needs[22]. Based on those needs, the knowledgeflow-based document recommendation method proactively delivers task-relevant topics and documents to the workers.

To work more efficiently, workers conducting similar (relevant) tasks or cooperative tasks generally have similar task-related information needs, and can form a group to facilitate knowledge reuse, cooperation and sharing. A group of workers can be identified explicitly by specifying which tasks are similar or cooperative tasks so that the knowledge workers conducting those tasks can form a group. Alternatively, a group of workers can be identified implicitly based on their referencing behavior as proposed in this work. Under this approach, it is assumed that workers with similar knowledgeflows work on similar or cooperative tasks, and thus have similar task-related information needs. Workers in the same group may not have exactly the same knowledge flows, but they may adopt similar referencing behavior when performing tasks. Common or frequently referenced task-related knowledge items in group members' knowl-edge_{flows represent the core knowledge that the workers require to} perform their tasks. Since such knowledge is implicit in the knowledgeflows of a group of workers, it cannot be represented solely by an individual's personal knowledge flow. To facilitate ⁎ Corresponding author.

E-mail address:dliu@mail.nctu.edu.tw(D.-R. Liu).

Contents lists available atScienceDirect

Decision Support Systems

j o u r n a l h o m e p a g e : w w w. e l s e v i e r. c o m / l o c a t e / d s s

(2)

knowledge reuse, cooperation and sharing, discovering such com-mon/frequently accessed knowledge is important for workers who perform similar tasks and have similar referencing behavior patterns. Accordingly, we propose group-based knowledge ﬂow mining algorithms that model a group's frequent referencing behavior by identifying frequent topics of interest, major referencing behavior patterns, and the long-term evolution of the group's information needs.

Because the information needs of workers or groups may change over time, it is difﬁcult to model their knowledge referencing behavior. Obviously, recognizing those needs, delivering the required task-related knowledge, and facilitating knowledge sharing/reuse are important issues that must be addressed in a knowledge intensive organization. However, to the best of our knowledge, there is no appropriate approach for analyzing and constructing KFs from the perspective of a group's information needs; and very little research effort has been expended on KF mining for task-based groups.

To address the above research gaps, we propose algorithms that integrate information retrieval and data mining techniques for mining and constructing the KFs of groups. In our previous work[22], we presented a KF mining approach that identifies each knowledge worker's KF. Here, we extend that approach and focus on discovering group-based knowledgeflows (GKFs). From the group-based knowl-edgeflow, workers can discover the knowledge frequently accessed by group members. They can also share their own knowledge with others to facilitate knowledge reuse, cooperation and sharing. From the perspective of a task's execution, a group-based knowledgeflow will be an important knowledge asset when conducting a task similar or relevant to the performance of those tasks from which the group-based knowledge flow was derived. For example, a group-based knowledge flow derived from the knowledge flows of several researchers working on Social Network Analysis (SNA) related tasks would be helpful to a new researcher who has just started working on an SNA-related research task.

Specifically, we discover a group's KF from the KFs of workers who exhibit similar knowledge referencing behavior patterns. First, based on the workers' logs, we analyze each worker's referencing behavior when acquiring task-related knowledge, and then construct his/her KF as described in[22]. We then use a clustering method to identify a group of workers with similar task-related information needs based on the workers' KF similarities. Workers in the same group generally need similar codified knowledge to perform their tasks. In addition, workers in the same group may adopt different behavior when referencing task-related knowledge. Therefore, we propose GKF mining algorithms to discover the referencing behavior patterns of a group of workers. Second, we apply the concepts of graph theory to visualize the GKF as a knowledge graph in which a vertex and an edge indicate, respectively, a topic domain and a direct flow relation between two topic domains. Task-related knowledge topics and their relationships (flows) can be identified from the knowledge graph to fulfill workers' task needs when they reference task-relevant knowledge.

Frequent knowledge referencing paths (patterns) can also be identified based on the edge weights in the graph. The paths represent the workers' frequent knowledge referencing behavior and important knowledgeflows in the group. Finally, to demonstrate the efficacy of the proposed method, we implement a prototype system for mining the GKF of a group of workers. The system provides useful functions that allow users to simplify the KF mining process and visualize KFs graphically.

The remainder of this paper is organized as follows.Section 2

provides a brief overview of related works. InSections 3 and 4, we introduce our proposed algorithms for mining a group knowledge ﬂow (GKF), which is then used to construct a knowledge graph based on the workers' referencing behavior.Section 5describes a prototype system that we implement based on the proposed algorithms. Then,

in Section 6, we summarize our conclusions and consider future research directions.

2. Background and related work

In this section, we consider the related work of our research, including the concepts of knowledgeﬂow, information retrieval and task-based knowledge support, document clustering and process mining.

2.1. Knowledgeﬂow

Knowledge flows among people and processes to facilitate knowledge sharing and reuse. The concept of knowledgeflow has been applied in various domains, e.g., scientific research, communities of practice, teamwork, industry, and organizations[5,21,40]. Scholarly articles represent the major medium for disseminating knowledge among scientists to inspire new ideas[40]. A citation implies that there is a knowledge_{flow between the citing article and the cited} article. Such citations form a knowledgeflow network that enables knowledge toflow between different scientific projects to promote interdisciplinary research and scientific development.

A knowledgeflow model enhances the effectiveness of teamwork by accumulating and sharing knowledge among team members to facilitate peer-to-peer knowledge sharing [39]. To improve the efficiency of teamwork, Zhuge [41] proposed a pattern-based approach that combines codification and personalization strategies in order to design an effective knowledgeflow network. Kim et al.[21]

proposed a knowledgeflow model combined with a process-oriented approach to capture, store, and transfer knowledge. Knowledgeflows in communities of practice help members share their knowledge and experience about a speci_{fic domain to complete certain tasks}[29]. Luo et al.[25]propose the discovery of textual knowledgeflow based on the semantic link network. A context-based knowledge flow is proposed to reflect the major characteristics of a knowledge flow

[13]. In an organization, knowledge workers normally have various information needs over time when performing tasks. Thus, a knowledge flow is defined from the perspective of a worker's information needs to represent the evolution of referencing behavior and the knowledge accumulated for a specific task[22].

2.2. Information retrieval and task-based knowledge support

Information retrieval (IR) facilitates access to speciﬁc items of information [7]. The vector space model [30] is typically used to represent documents as vectors of index terms, where the weights of the terms are measured by the tf-idf approach; tf denotes the occurrence frequency of a particular term in the document, while idf denotes the term's inverse document frequency. Terms with higher tf-idf weights are used as discriminating terms to ﬁlter out common terms. The weight of a term i in a document j, denoted by wi,j, is

expressed as follows: wi;j= tfi;j× idfi= tfi;j× log2 N n + 1 ; ð1Þ

where tfi,jis the frequency of term i in document j, idfiis measured by

(log2N/n) + 1, N is the total number of documents in the collection,

and n is the number of documents in which term i occurs.

Information retrieval techniques coupled with workflow manage-ment systems (WfMS) have been used to support proactive delivery of task-specific knowledge based on the context of tasks within a process[2]. For example, the KnowMore system[1]provides context-aware delivery of task-specific knowledge; while the Kabiria system considers the operational context of task-associated procedures to help knowledge workers retrieve knowledge-based documents[6].

(3)

Information filtering with a similarity-based approach is often used to locate knowledge items relevant to the task-at-hand. The discriminating terms of a task are usually extracted from a knowledge item/task to form a task profile, which is then used to model a worker's information needs. Holz et al.[15]proposed a similarity-based approach that organizes desktop documents and proactively delivers task-specific information to the user; and Liu et al.[23,24]

presented a K-Support system to provide effective task support in a task-based work environment.

2.3. Hierarchical document clustering and CLIQUE clustering methods Document clustering or unsupervised document classiﬁcation is used in many applications. Most document clustering methods apply pre-processing steps to the document set and represent each document as a vector of index terms. To cluster similar documents, the similarity between documents is usually measured by the cosine measure[7,35], which computes the cosine of the angle between the documents' corresponding feature vectors. Two documents are considered similar if the cosine similarity value is high. The cosine similarity of two documents, X and Y, is simcos Xð ; YÞ = ⇀X⋅⇀Y

k⇀Xkk ⇀Yk, where

⇀_{X and ⇀}_{Y are the respective feature vectors of X and Y.}

Agglomerative hierarchical clustering[18,20]is a popular docu-ment clustering method. In this work, we use the single-link clustering method[17]to cluster codiﬁed knowledge (documents) into topic domains. The single-link method computes the similarity between two clusters, which is equal to the greatest similarity between any document in one cluster and any document in the other cluster.

We also apply the CLIQUE clustering method[3,17] to derive worker groups. CLIQUE starts with the definition of a unit-elementary rectangular cell in a subspace and uses a bottom-up approach to_find units whose densities exceed a threshold. All the subspaces are sorted by their coverage and those with less coverage are pruned. Therefore, a cluster is defined as a maximal set of connected dense units. 2.4. Process mining

In a workflow system, a process mining technique is used to extract the description of a structural process from a set of real process executions [33]. It then infers the relations between the tasks/ activities and generates a process model from event-based data (log data) automatically[4,32,34]. The relations between processes (tasks/ activities) are defined as causal relations and parallel relations[31], and are modeled by a directed graph[4,12]or an instance graph[34]. Because a workflow log contains information about workflow processes, a loop may occur in a process. Most process mining algorithms assume that loops do not exist[12,34]. However, some algorithms have been proposed to handle the problem of process loops[4,11,33]. For example, Agrawal et al.'s algorithm[4]builds a general directed graph with cycles for mining process models from the logs of executed processes. The algorithm labels multiple instances of the same activity with different identifiers to differentiate them in the workflow graph. Vertices with different instances of the same activity form an equivalent set and can be merged to form one vertex. A directed edge is added if there is an edge between two vertices of different equivalent sets.

Process mining is used in various applications. Discovering frequently occurring temporal patterns in process instances facilitates intelligent and automatic extraction of useful knowledge to support business decision-making[16]. Similarly, data mining techniques are exploited in workﬂow management contexts to mine frequent workﬂow execution patterns[12]. The sequence of activities within a process, the execution cost and the reliability of the process can be predicted by using the process path mining technique[8]. Based on the process patterns and process paths, unexpected and useful

knowledge about the process is extracted to help the user make appropriate decisions. In addition, a formal approach is proposed to discover process models from business policies[37].

3. Group-based knowledgeﬂow mining

A knowledgeﬂow (KF) represents a knowledge worker's long-term information needs and accumulated task-related knowledge when he/she performs a task. In a previous work, we proposed a KF mining method to obtain each worker's KF from his/her work log

[22]. We also presented document recommendation methods to support workers' in the execution of tasks and facilitate knowledge sharing in an organization. In the context of collaboration, workers usually have similar referencing behavior patterns, in which they share common topics or documents they ﬁnd useful, or they reference task-related knowledge in a similar order. To model the common referencing behavior of a group, we propose a method for mining a group-based knowledge ﬂow (GKF) from the KFs of a group of workers.

Fig. 1provides an overview of the proposed method for mining GKFs. Based on the workers' KFs, workers with similar topic-level KFs are clustered together to form a task-based group. Members of the group have task-related knowledge or similar referencing behavior in terms of the topics of interest and the order the topics were referenced in their KFs. To identify similar referencing behavior from the KFs, we propose KF mining algorithms based on process mining and graph theory to discover a group's knowledgeflow. The algorithms identify common information needs and referencing patterns from the KFs of a group of workers, and then build a group-based knowledgeflow (GKF) model. Then, a frequent knowl-edge path is identified from the model to represent the referencing (learning) patterns of the group and to support group members, especially the novices, in accessing and learning a group's knowledge. In this work, we focus on two issues: 1) how to construct a group-based knowledgeflow (GKF) model for a group of knowledge workers with similar KFs; and 2) how to identify frequent referencing patterns (paths) from the GKF model.

A group-based knowledge flow (GKF) is derived from the knowledgeflows (KFs) of the group members. Mining and construct-ing the GKF is the main focus of this work. As such, it extends our previous work on mining an individual's knowledgeflows[22]. To ensure that the explanation of our proposed GKF mining method is clear, we provide some fundamental definitions and concepts involved in generating knowledgeflows. Thus, we provide a summary of the fundamental de_{finitions and methods used to generate a} worker's codified level and topic-level KFs in Section 3.1. In

Section 3.2, we cluster workers with similar KFs into groups, based on the KFs described inSection 3.1. The information inSections 3.1 and 3.2is a summary of the fundamental concepts of knowledgeﬂow mining discussed in our previous paper[22]. The concepts provide the basis for this work. Readers may refer to our previous publication for further details. InSection 3.3, we describe the steps of the proposed group-based KF mining. Moreover, several important concepts and features used in the GKF mining algorithms are presented.

3.1. Knowledgeﬂow mining

From the perspective of information needs, a worker's knowledge ﬂow (KF) represents the evolution of his/her information needs and preferences during a task's execution. Workers' KFs are identiﬁed by analyzing their knowledge referencing behavior based on their historical work logs, which contain information about previously executed tasks, task-related documents and when the documents were accessed.

A KF consists of two levels: a codified level and a topic level. The knowledge in the codified-level indicates the knowledge flow

(4)

between documents based on the access time. In most situations, the knowledge obtained from one document prompts a knowledge worker to access the next relevant document (codified knowledge). Hence, the task-related documents are sorted by their access time to obtain a document sequence as the codified-level KF. Documents are clustered into topic domains by using the agglomerative hierarchical clustering method described inSection 2.3. Each topic may contain several task-related documents. The codified-level KF can be abstracted to form a topic-level KF, which represents the transitions between various topics. Formally, we define the knowledge flows as follows.

Deﬁnition 1. Knowledge ﬂow (KF).

Let the knowledge ﬂow of a worker, w, for a speciﬁc task be KFloww= {TKFw, CKFw}, where TKFwv is the worker's topic-level KF and

CKFwis his/her codiﬁed-level KF.

Deﬁnition 2. Codiﬁed-level KF.

A codiﬁed-level KF is a time-ordered sequence arranged according to the access times of the documents it contains. Formally, it is deﬁned as CKFw=bd t1 w;; dtw2; ⋯; d tf wN and t1bt2b⋯btf;

where dwtjdenotes the document that the worker w accessed at time tj

for a speciﬁc task. Each document can be represented by a document proﬁle, which is an n-dimensional vector containing weighted terms that indicate the key content of the document described in

Section 2.2.

Deﬁnition 3. Topic-level KF.

A topic-level KF is a time-ordered topic sequence derived by mapping documents in the codiﬁed-level KF to corresponding topics. Formally, it is deﬁned as TKFw=bTP t1 w; TP t2 w; ⋯; TP tf wN ; t1bt2b⋯btf;

where TPwtjdenotes the corresponding topic of the document that

worker w accessed at time tj for a speciﬁc task. Each topic is

represented by a topic profile, which is an n-dimensional vector containing weighted terms that indicate the key content of the topic. The topic profile of a topic is derived from the document profiles of documents contained in that topic by using the centroid approach.

By analyzing a worker's referencing behavior for a specific task, the corresponding knowledge flow is derived by the knowledge flow extraction method. The codified-level KF is extracted from the documents recorded in the worker's work log. The documents are arranged according to the times they were accessed to obtain a document sequence. The topic-level KF, which is derived by mapping documents in the codified-level KF of a specific task into corresponding clusters, is represented by a topic sequence.

3.2. Clustering similar workers based on their knowledgeflows To find a target worker's neighbors, his/her topic-level KF is compared with those of other workers to compute the similarity of their KFs. The resulting similarity measure indicates whether the KF referencing behavior of two workers is similar. Since the KFs are sequences, the sequence alignment method[9,27], which computes Fig. 1. An overview of mining group-based knowledgeflows.

(5)

the cost of aligning two sequences, can be used to measure the similarity of two KF sequences. Based on this concept, we use a hybrid similarity measure, comprised of the KF alignment similarity and the aggregated proﬁle similarity, to evaluate the similarity of two workers' KFs, as shown in Eq.(2).

sim TKFi; TKFj =_{α × sim}a TKFi; TKFj + 1ð _−αÞ × simP APi; APj ; ð2Þ where sima(TKFi, TKFj) represents the KF alignment similarity, simp

(APi, APj) represents the aggregated proﬁle similarity, and α is a

parameter used to adjust the relative importance of the two types of similarity. Here, we give a brief explanation. Further details are provided in[22].

The KF alignment similarity is comprised of two parts: the KF alignment score, which measures the topics in the sequence; and the join coefficient, which estimates the topic's coverage in two compared topic-level KFs. We modify the sequence alignment method[9]to derive the KF alignment score. We also estimate the overlap of the topics in two compared topic-level KFs by using Dice's coefficient[35]. The rationale is that if the topic overlap is high, the KF alignment similarity of the two compared KFs will also be high. The KF alignment similarity, sima(TKFi, TKFj), is defined as follows:

sima TKFi; TKFj

= Normð Þ ×η 2 ×jTi∩Tjj

jTij + jTjj ;

ð3Þ where TKFi and TKFj are the topic-level KFs of workers i and j

respectively;η is the KF alignment score; Norm is a normalization function used to transform the value ofη into a number between 0 and 1; Tiand Tjare the sets of topics in TKFiand TKFjrespectively; and

Ti∩Tjis the intersection of topics common to TKFiand TKFj.

The aggregated proﬁle similarity, deﬁned as simp(APi, APj),

computes the similarity of two workers' KFs based on their aggregated proﬁles; APiand APj are the vectors of the aggregated proﬁles of

workers i and j respectively. We use the cosine formula to calculate the similarity between two aggregated profiles. The aggregated profile of a worker i is defined as Eq.(4).

APi= ∑ T t = 1

twt;T× DPt; ð4Þ

where twt,Tis the time weight of a document referenced at time t in

the KF; T is the index of the time the worker accessed the most recent

documents in his KF; and DPtis the proﬁle of the document referenced

by worker i at time t. The aggregation considers the time decay effect of the documents. Hence, if a document was referenced in the recent past, it is given a higher time weight. The time weight of each document proﬁle is deﬁned as twt;T=T−Stt−St, where St is the start time

of the worker's KF.

In this paper, we use the CLIQUE clustering method [3,17] as described inSection 2.3 to cluster knowledge workers based on a similarity matrix of their KFs. Each entry in a similarity matrix represents the degree of KF similarity between two workers, derived by Eq.(2). Workers in the same cluster are highly connected with each other because they have similar referencing behavior and information needs in topic domains. To identify each group's GKF, we apply our group-based knowledge mining method to process the clustering results.

3.3. The group-based knowledgeﬂow mining process

The proposed method comprises three phases: worker clustering, group-based knowledgeflow (GKF) mining, and identifying knowl-edge-referencing paths, as shown inFig. 2. Based on the extracted KFs (Section 3.1), the worker clustering step (Section 3.2) is used to cluster workers with similar KFs as an interest group because they have similar information needs and task-related knowledge to fulfill their tasks. Given the KFs of the workers, we formalize the GKF model to represent the group's information needs by applying the proposed GKF mining algorithms, as described inSection 4. The group-based knowledgeflow (GKF) represents the information needs and common referencing behavior of a group of workers. Based on GKF, workers can share their task knowledge to complete the target task. Moreover, managers can comprehend the information needs of workers and groups to provide knowledge support adaptively.

The GKF is represented by a directed acyclic graph comprised of vertices and edges. Each vertex denotes a topic in a KF, while each directed edge represents the referencing order of two topics. We use graph theory to model a GKF. A GKF graph models the relations between topics, the direction of the knowledgeflow and the frequent knowledge paths to describe a group's information needs and referencing behavior. Moreover, a GKF contains several knowledge referencing paths, which indicate the referencing behavior patterns of the group of workers. In addition, frequent referencing behavior patterns of the group of workers, i.e., the paths with scores higher than a user-specified threshold, can be identified from the GKF. Before describing the details of the GKF mining

(6)

algorithms, weﬁrst deﬁne several important concepts and features used in the algorithms as follows.

Deﬁnition 4. Knowledge graph.

A knowledge graph is defined as G=(V, E), where V is a finite set of vertices, and E is afinite set of directed edges connecting two topics. Each vertex in V denotes a topic in the knowledge domain, and each edge in E denotes the knowledgeflow from one topic to the other topic. Example. Given a directed knowledge graph comprised of two vertices (topics) vxand vyand an edge ex,y, the edge is used to connect vertices vx

to vydirectly. In addition, vxis said to be an adjacent predecessor of vy,

while vyis said to be an adjacent successor of vx.

Deﬁnition 5. Knowledge sub-graph.

Given a knowledge graph G = (V, E), a knowledge sub-graph of G is a graph G' = (V', E'), where V' and E' are subsets of V and E respectively, i.e., V′⊂V and E′⊂E.

A GKF graph represents the referencing behavior of a group of workers as a directed knowledge graph, which consists of aﬁnite set of vertices and edges, deﬁned as follows.

Deﬁnition 6. Group-based knowledge ﬂow (GKF).

As mentioned earlier, a GKF is derived from the KFs of workers who are in the same cluster and therefore have similar information needs. A GKF is de_{ﬁned as GKF={G, W, TKF}, where G is a directed knowledge graph;} W={wi|∀i,i=1⋯n} is a set of n workers who have similar KFs; and

TKFS={TKFj|∀j,j=1⋯n} is a set of topic-level KFs of the workers in W.

The properties of TKF and the directed knowledge graph G are deﬁned as follows.

Deﬁnition 7. Flow relation and direct ﬂow relation.

In aflow relation of a topic-level KF (TKF), topic x is followed by topic y, denoted by xNy, if topic x was accessed before topic y in the TKF. A topic x is followed directly by another topic y if there does not exist a distinct topic such that x is followed by z and z is followed by y. Thus, the relation between topics x and y is a directflow relation, defined as x→y.

Deﬁnition 8. Path.

Given a directed graph G, if there is a path from a vertex vxto

another vertex vy, the path is denoted as vx~Nvy.

Deﬁnition 9. Topic cycle.

Let aﬂow relation xNy appear in a TKF and a ﬂow relation yNx also appear in another TKF. The relations are represented by their corresponding paths, vx~Nvyand vy~Nvx, on the graph of the GKF.

Such relations form a topic cycle between the vertices of vx(topic x)

and vy(topic y) in the GKF.

Deﬁnition 10. Topic loop.

Let x be a duplicate topic in a TKF and let twoﬂow relations xNy and yNx appear in the TKF. These relations are represented by their corresponding paths, vx~Nvyand vy~Nvx, on the graph of GKF. Such

relations form a topic loop between the vertices of vx(topic x) and vy

(topic y) in the GKF.

Deﬁnition 11. Strongly connected component (SCC).

A strongly connected component is a maximal strongly connected sub-graph in which every vertex is reachable from every other vertex in the sub-graph.

Deﬁnition 12. Knowledge referencing path.

Given a directed graph G = (V, E) of a GKF, if there is a path from a start vertex to an end vertex, it is a knowledge referencing path. Such a path is deﬁned as p={s, d, Vp, Ep}, where s is a start vertex, d is an

end vertex, and Vpis a set of topics on the path p. Epis a set of edges,

where each edge is an ordered pair (vi, vj); viand vj∈Vp, vi≠vjand vi

is an adjacent predecessor of vj.

Deﬁnition 13. Frequent referencing path.

Given a set of referencing paths derived from the graph of the GKF, a path p is said to be frequent if its path score, which is derived based on the frequency count of edges on the path, is greater than a certain threshold. A frequent referencing path indicates that workers accessed task-related knowledge in a particular topic order frequently.

Problem Statement: Given the TKFs of a group of workers, the GKF mining algorithms ﬁnds the GKF from the TKFs. The GKF is represented by a directed graph, which is used to model the referencing behavior of a group of workers.

4. GKF mining algorithm

To derive a GKF model from a set of TKFs, we propose two algorithms: one for cases where there are no duplicate topics in a TKF; and the other for cases where there are duplicate topics. Both algorithms model a group's information needs as a group-based knowledgeﬂow. The referencing path of a GKF details the order in which topics are accessed when workers search for task-related knowledge. InSection 4.1, we present a GKF mining algorithm for cases without duplicate topics. The GKF mining algorithm for dealing with duplicate topics is presented inSection 4.2.

4.1. GKF mining algorithm without considering duplicate topics We assume that a topic in a TKF appears just once in this algorithm. That is, there is no duplicate topic in each TKF; hence, there will not be a topic loop in the GKF. However, the order of topics in different TKFs may vary, so topic cycles, which form strongly connected compo-nents, may appear in the graph G.

In a strongly connected component (SCC), where each vertex is reachable from every other vertex, it is difficult to determine the ordering relation among the vertices. To resolve the problem, the algorithm applies the Topic_Relation_Identi_{fication procedure to} identify the vertex relation in the SCC. The relation, which can be classified as either a parallel relation or a sequential relation to characterize the topic relations in the GKF, represents part of the topic ordering in workers' referencing behavior.

The GKF mining algorithm discovers frequent referencing of topics from the TKFs of a group of workers. To discover frequent referencing behavior patterns, which are modeled as frequent edges or frequent referencing paths on the GKF graph, the algorithm use the edge deletion procedure to remove infrequent edges whose weights are lower than a user speciﬁed threshold. A start vertex and an end vertex are added to the discovered graph to indicate the start and end of the referencing behavior paths of the workers. Note that a topic is represented as a vertex on the graph. It would be odd to generate a GKF in which topic references were incomplete; that is, where a topic reference does not originate at the start vertex or reach the end vertex. The algorithm ensures that every topic can be referenced successfully from the start vertex to the end vertex. Thus, an infrequent edge can only be deleted if its removal does not make any vertex unreachable from the start vertex or to the end vertex.

Several knowledge paths may exist on a GKF graph. The paths represent the group's frequent referencing behavior when learning/

(7)

referencing knowledge. Thus, the discovered graph can be used to inform a group of workers about topics of interest and the referencing behavior related to those topics.

The steps of the proposed algorithm are shown in Fig. 3. To generate a GKF model for a speciﬁc group (task), a set of TKFs is taken as the algorithm's input, and the graph of the GKF is the output result. In the GKF graph, a topic domain in a TKF is represented as a knowledge vertex, and eachﬂow that directly orders the knowledge between two topics is represented as an edge. For example, given a TKFbA, B, E, CN, the four topics A, B, E and C are represented as four knowledge vertices, i.e., vA, vB, vEand vC, respectively; and the direct

ﬂows between two knowledge vertices are represented as three directed edges, i.e., eA,B, eB,E, and eE,C, in the graph of G. Note that an

edge is used to order theﬂow between two topics directly, e.g., the edge eA,Borders theﬂow from topic A to topic B. In contrast, if two

topics have no direct_{ﬂow relation, no edge exists between them. In} the same example, there is noﬂow relation between topic A and topic E , so an edge eA,Edoes not exist.

The algorithm for building the GKF model involves several steps. First, a start vertex s and an end vertex d are added to the directed graph. Second, each topic in a TKF is regarded as a vertex and is added to a vertex set V if it does not exist in V already. Then, to connect the vertices in V, the edges related to the inserted vertex are added to the edge set E as follows. Let x→y be a direct ﬂow relation from topic x to topic y, which denotes that topic x is followed immediately by topic y in a TKFw. When adding the edge ex,yto E, the algorithm has to check two

additional conditions for the edge to connect the start/end vertex with other vertices. First, if the vertex y is theﬁrst topic in a TKF, the edge es,y

from the start vertex s to the vertex y is added to E; then, if the vertex y is the last topic in the TKF, the edge ey,dfrom the vertex y to the end

vertex d is added to E. When adding an edge to E, the algorithm counts the frequency of the edge. Adding all the vertices and their related edges to V and E respectively yields the initial graph of the GKF model. Example. This example illustrates how to build a GKF graph by using the GKF algorithm without considering duplicate topics in a TKF. Five workers who have similar TKFs form a group. Their topic-level KFs are listed inTable 1.

The topic domains in each topic-level KF (TKF) are arranged as a topic sequence according to the times they were referenced. Based on the TKF of each worker, the proposed algorithm derives the group's GKF, which is represented by a directed graph, as shown inFig. 4. The

topic domains, including the start and end vertices are represented by circles; an edge is represented by an arrow, which indicates the direction of knowledgeﬂow from one knowledge vertex to another; and the number on each edge is the edge's frequency count.

In the initial graph, a strongly connected component (SCC) may be evident when some vertices appear in reverse order in any two TKFs. A strongly connected component Gsis a maximal strongly connected

sub-graph that contains a path from each vertex to every other vertex in Gs. Because the vertices in a connected component are strongly

connected, it is difﬁcult to determine the ordering relationships between them. Even so, such relationships can be used to represent the characteristics of a TKF and they are important for modeling workers' referencing behavior. Thus, we use a procedure called Topic_Relation_Identiﬁcation to determine the relationships among vertices in any strongly connected component.

In an SCC, two kinds of relations can be identiﬁed, namely, parallel and sequential relations. Any two vertices in an SCC indicate that two topics, x and y, may be referenced by different TKFs with the ordering xNy and yNx. This ordering is an example of a parallel relation, where either vx~Nvyor vy~Nvxwould be appropriate; thus, there is no strict

ordering between vxand vy. The referencing order of the vertices is not

obvious, and the knowledge items represented by the vertices may be referenced simultaneously. As the vertices in an SCC are not in a specific order, conventional workflow mining methods consider the association between the vertices as a parallel relation. However, in contrast to such methods, a sequential relation pattern (SRP) rather than a parallel relation pattern (PRP) may be extracted if most of the referencing behavior in the SCCfits the SRP. That is, the SRP represents the most frequent knowledge referencing pattern in the SCC.

We explain how to recognize the above relations inSection 4.1.1, and how to evaluate, the weight of each edge when measuring the importance of aﬂow in the GKF inSection 4.1.2. Then, we transform

Fig. 3. The algorithm for mining a GKF when TKFs do not contain duplicate topics. Table 1

Five workers and their TKFs.

Worker Topic-level KF (TKF) John bA, B, C, D, EN Mary bA, C, G, F, D, EN Lisa bB, A, C, EN Tom bA, B, C, DN Bob bC, B, G, F, DN

(8)

the initial graph of the GKF into a new directed acyclic graph GNin

which a strongly connected component Gsis regarded as a vertex.

After graph transformation, the topological sorting and edge deletion procedures are applied on GN to remove any infrequent

edges. An infrequent edge indicates that only a few workers in the group adopt a particular reference behavior pattern. Since such patterns are not representative of the group's general referencing behavior, they can be removed. The topological sorting procedure is used to sort all vertices in VNin topological order, as discussed in

Section 4.1.3. Based on the sorting result, the edge deletion procedure (described in Section 4.1.4) checks all the edges and removes

infrequent and unqualiﬁed edges from ENand E. After edge deletion,

the graph G represents the group-based knowledgeﬂow. 4.1.1. Topic relation identiﬁcation

The topic relation identiﬁcation procedure determines the rela-tions between vertices in a strongly connected component, as shown inFig. 5. Let the strongly connected component Gs= (Vs, Es), where Vs

is a vertex set and Esis an edge set. Parallel and sequential relations

can be discovered from a strongly connected component Gs= (Vs, Es)

based on the frequency count of knowledge_{ﬂow sequences (KFSs). To} determine and rebuild the relationships between vertices in Vs, all

Fig. 4. The initial graph of the GKF model.

(9)

possible non-duplicate KFSs of length |Vs|, which contain all vertices in

Vs, are identiﬁed from Gs. The derived KFSs are then compared with a

non-duplicate sequence, i.e., SQw, in a TKFw, which contains a set of

vertices that are common to both Vsand the vertex set of V(TKFw), i.e.,

V(SQw) = Vs\V(TKFw). V(SQw)/V(TKFw) denotes the set of vertices in

the sequence SQw/TKFw. When the sequence SQwis a subsequence of a

KFS, the frequency count of the KFS is increased. Next, all the KFSs are sorted in descending order of their frequencies and the top-2 frequent KFSs are selected to elicit the relations of vertices in Vs. The preceding

pseudo node vγand the succeeding pseudo node vρof Gsare also

added to V.

If the difference in the frequency counts of the selected KFSs is lower than a user-speciﬁed threshold ε, the order of the vertices in Vs

is not significant. In this case, the vertex relation is defined as parallel. For example, let us consider a strongly connected component where vertex vx, vertex vyand vertex vzare in Vs; and let the user-specified

thresholdε=2. When the frequency counts of two KFSs bvx, vy, vzN

andbvz, vy, vxN are 7 and 6 respectively, the relation between vertex

vx, vertex vyand vertex vzis parallel because the difference in their

frequency counts is lower than the threshold. However, if the difference is greater than a user-speciﬁed threshold, the KFS with the largest frequency count can be used to represent the relationship of vertices in Vsbased on the majority principle. The ordering of these

vertices is deﬁned as a sequential relation. Next, we explain how to identify the order of vertices in a strongly connected component, i.e., parallel relations and sequential relations.

4.1.1.1. Identifying parallel relations in a SCC. For parallel relations, the order of the vertices in Vsis not important. The

Topic_Relation_Identi-ﬁcation procedure checks each edge in Esfor each TKF. Let ei,jbe an

edge in Esthat connects vertex vito vertex vjdirectly. If this directﬂow

relation vi→vjappears in a TKF and aﬂow relation vjNviexists in

another TKF, the edge ei,jis removed from E and Es, and the relation

between vertex viand vertex vjis regarded as parallel. That is, there is

no speciﬁc ordering between vertex vi and vertex vj, and their

corresponding topics can be referenced in any order.

After adding a preceding pseudo node vγand a succeeding pseudo

node vρto G, the edges connected to the vertices in Vsare redirected

through the pseudo nodes. To connect a vertex in V to the pseudo nodes, each adjacent predecessor vkof vi, where vk∉Vsand vi∈Vs, and

each adjacent successor vlof vi, where vl∉Vsand vi∈Vs, are examined.

For vertex vk, if edge ek,i, which connects vertex vkto vertex vi, exists

in E, it is removed. Then, the edges ek,γand eγ,iare added to E and their

frequency counts are calculated. If the two edges already exist in E, their frequency counts are simply updated. Brieﬂy, the edge ek,iis

replaced by edges ek,γand eγ,ito make a connection with vertex vkand

vertex vithrough the pseudo node vγ. Similarly, for a vertex vl, if edge

ei,lexists in E, it is removed. Then, the edges ei,ρand eρ,l, are added to E

and their frequency counts are calculated. If the edges already exist in E, their frequency counts are simply updated.

Example. In Fig. 4, there is a strongly connected component Gs

comprised of Vs= {A, B, C} and Es= {eA,B, eB,A, eB,C, eC,B, eA,C}. Let the

threshold ε be 1. The graph of the GKF after topic relation identiﬁcation is shown inFig. 6. Based on the Topic_Relation_Identi-ﬁcation procedure, two pseudo nodes, vγand vρ, are added to G. Then,

the edges in Esare examined to determine which ones should be

removed. Two non-duplicate sequences are discovered in Gs, i.e.,bA,

B, C_{N, bC, BN and bB, A, CN; their frequency counts are 2, 1 and 2} respectively. Because the difference in the frequency counts of the top-2 sequences is equal to 1, the relation between vertex vA, vertex

vC, and vertex vBis regarded as parallel, and the edges eA,B, eB,A, eB,Cand

eC,Bare removed from the graph.

Meanwhile, the relation between vertex vAand vCis regarded as

sequence because A→C exists in one TKF, but there is no CNA in any other TKF. Thus, eA,Cis not removed from the graph. The incoming

edges of vertex vA, vertex vB and vertex vC are changed to make

connections through pseudo node vγ. Similarly, the outgoing edges of

vertex vA, vertex vBand vertex vCare changed to make connections

through pseudo node vρ. Then, the frequency counts of these edges

are updated (Fig. 6).

4.1.1.2. Identifying sequential relations in a SCC. If the difference between the frequency-counts of the selected top-2 KFSs is greater than a user-speciﬁed threshold, the ordering of the vertices in the KFSs is regarded as a sequential relation. That is, based on the majority principle w.r.t. knowledge referencing behavior discussed earlier, the vertices in Vs follow the ordering of the KFS with the highest

frequency. Let KFSybe the knowledgeﬂow sequence with the highest

frequency count; and let viand vjbe, respectively, theﬁrst and last

vertices in the sequential order of KFSy. All the edges in Esare removed

from Esand E. Then, for each directﬂow relation vg→vhin KFSy, an

edge eg,his added to Esand E. Similarly, the edges connected to the

vertices in Vsare redirected through the pseudo nodes.

For each adjacent predecessor vkof vf, where vk∈V, vk∉Vs, and

vf∈Vs, the edges ek,γand eγ,iare added to E, and their frequency counts

are calculated. If the edges already exist in E, their frequency counts are simply updated. The edge ek,f, which connects vertex vkto vertex

vf, is removed from E and replaced by the connections from vkto vγ

and from vγto vi, the ﬁrst vertex of KFSx. That is, the edge ek,fis

replaced by edges ek,γand eγ,i, which connect with vertex vkand

vertex virespectively through the pseudo node vγ. Similarly, for each

adjacent successor vlof vf, where vl∈V and vl∉Vs, and vf∈Vs, we use

the same method to establish connections from the last vertex in KFSx

to the vertex vlthrough the pseudo node vρ. The connection from vfto

vlis replaced by the connections from the last vertex of KFSx, i.e., vj, to

the pseudo node vρand from vρto vl.

Example. Table 2 lists the knowledge ﬂows of a group of seven workers. The GKF mining algorithm, described inSection 4.1, is used to generate the graph of the group-based KF and a strongly connected component with vertices vB, vC, and vDis identiﬁed from the GKF

graph. Then, the Topic_Relation_Identiﬁcation procedure is applied to determine the relation between those vertices. As shown inFig. 7, the relation is sequential with the ordering vB, vC, and vD. In addition, the

edges connected to any vertex in Vsare changed. For example, the

(10)

edge eB,Kis changed to edge eD,ρand edge eρ,Ksuch that there is a path

from vertex vBto vertex vKvia the pseudo node vρ.

4.1.2. Measuring the importance of an edge

Our objective is to derive the referencing behavior of a group of workers by constructing a frequent knowledge path in a GKF graph. However, some infrequent edges in the graph may not be suitable for building the path. To measure the importance of each edge in a graph, the frequency count of each edge is normalized by the maximum edge frequency in E. The weighting function measures the importance of an edge in a GKF model, as deﬁned in Eq.(5).

we_x;y= fx;y

maxffi;jj∀i; j; ei;j∈Eg;

ð5Þ where wex,y, which ranges from 0 to 1, is the weight of the edge ex,y

that represents a directﬂow from vertex vxto vertex vy; fx,yis the

frequency of the edge ex,y; E is the edge set of the graph; and the

denominator is a maximum function that derives the frequency count of the most frequent edge in the graph. The more frequently an edge occurs, the more important it is deemed to be. The most frequent edge represents the frequent referencing behavior of most members of the group. Thus, it is suitable for describing the group's referencing behavior.

Example. The weight of each edge inFig. 6is calculated by using the edge weighting method. The edge is then labeled with the weight to indicate its importance in the graph, as shown inFig. 8.

4.1.3. Graph transformation

To simplify a strongly connected component in a graph, the proposed algorithm transforms the original GKF graph into a new graph GN. After the transformation, the graph Gsis regarded as a

vertex vGsin GN. We create two pseudo nodes, vγand vρ, to represent,

respectively, the split operator and the join operator of Gs. In addition,

the incoming/outgoing edges of Gs, which connect to the pseudo

nodes vγ(the split operator)/vρ(the join operator), are merged to

form a new edge whose weight is also updated. The weight of the incoming edge of vGs,which combines the incoming edges of Gs, is

derived by combining the edge weights of the incoming edges of the

node vγ. Similarly, the weight of the outgoing edge of vGsis derived by

combining the edge weights of the outgoing edges of the node vρ.

Example. We transform the graph GsinFig. 8into a new graph for

further analysis, as shown inFig. 9. To simplify the strongly connected component, all the vertices in Gsare wrapped as a vertex vGsin the

new graph. The incoming edges and outgoing edges of any vertex in Gs

and the weights of those edges are adjusted. InFig. 8, edge eγ,Aand

edge eγ,Bare merged to form a new edge eγ,GsinFig. 9and their edge

frequencies are combined as 1. In the same way, edge eC,ρand edge

eB,ρare combined to form an edge eGs,ρ.

4.1.4. Topological sorting

The frequent referencing behavior of a group of workers is derived by mining the group's knowledgeﬂow from a GKF graph. The workers may reference topics in a different order when performing tasks, but some referencing behavior is more frequent because the majority of workers in the group reference topics in the same order. In the GKF graph, a frequent knowledge path from the start vertex to the end vertex represents the workers' frequent referencing behavior. For any vertex vion the path, vertex viis reachable from the start vertex and

the end vertex is reachable from vertex vi. Note that a path with

infrequent edges denotes an infrequent referencing behavior pattern. To derive a group's frequent referencing behavior, a topological sorting procedure is used to sort all vertices in the graph, after which infrequent edges whose weights are lower than a speciﬁed threshold are deleted. In graph theory, topological sorting[10,19] is a very efﬁcient way to arrange the vertices of a directed acyclic graph in topological order in linear time. The key property of the topological order is that, for any two vertices x and y, if x is a predecessor of y in the graph then x precedes y in the topological order.

In this work, we use topological sorting to arrange all vertices in GN, which is a directed acyclic graph before the edge deletion

procedure is applied. Then, the edge-deletion procedure examines the vertices in topological order to identify the infrequent incoming edges of each vertex that should be removed. However, before removing an infrequent edge, the procedure needs to ensure that each vertex in the GKF satisﬁes two criteria. First, any vertex vion a knowledge path

must be reachable from the start vertex and the end vertex must be reachable from vertex vi. Second, removing the edges of a vertex vi

does not affect the path from the start vertex to the preceding vertices of viin the topological order. In other words, topological ordering

guarantees that 1) a predecessor will be processed before a successor; and 2) the predecessor's reachability (i.e., from the start vertex to vi)

will not be affected by its successors. Thus, when an infrequent edge of any vertex vi in G is removed, there is no need to verify the

reachability of the predecessors of vertex vifrom the start vertex. On

the other hand, the path from the predecessors of vertex vito the end

vertex will be affected by removing an infrequent edge of vi; therefore,

the predecessors should be examined again to ensure that they can still reach the end vertex.

Table 2

The TKFs of seven knowledge workers.

Worker Topic-level KF (TKF) W1 bA, F, B, C, D, HN W2 bA, G, B, C, D, IN W3 bF, B, C, D, HN W4 bA, F, C, D, B, K, HN W5 bF, C, D, B, K, HN W6 bA, G, B, C, K, HN W7 b F, B, C, DN

(11)

Example. InFig. 9, all the vertices are sorted in topological order, and the resulting list isbs, γ, Gs,ρ, G, F, D, E, dN. According to the list, vsis

theﬁrst vertex to be checked, vGsis the second vertex and so on. The

algorithm examines all the vertices in topological order and removes infrequent edges from the graph GNvia the edge deletion procedure.

4.1.5. Using the edge deletion procedure to remove infrequent edges Based on the results of topological sorting of VN, the edge deletion

procedure examines the vertices and determines which incoming edges should be removed from them. It then removes infrequent edges whose weight is lower than a user-speciﬁed threshold, as shown inFig. 10. The inputs of this procedure are the sorted list L derived by topological sorting and the edge set ENof the GKF graph.

The algorithm checks the incoming edges of each vertex in ascending order of their weights, and those whose weights are lower than a user-speciﬁed threshold η are candidates for removal. If an edge is removed, it means that the knowledge referencing behavior between two vertices (topics) is infrequent among the group of workers.

However, an infrequent edge should only be deleted from the graph if removing it would not make any vertex unreachable. Let Q be the set of vertices that have been checked in topological order to remove their infrequent incoming edges. For a vertex vy, if one of its

incoming edges is removed and there is no other path from the start vertex to vy, the removed edge should be returned to the edge sets E

and EN. In addition, the vertices checked before vy should be

reexamined to ensure that there is a path from a checked vertex vi

in Q to the end vertex. If removing an edge violates the above condition, the edge should be returned to the edge sets E and EN.

Because of the characteristics of topological sorting, the edge deletion procedure ensures that 1) any vertex in the graph GNcan be

reached from the start vertex; and 2) removing an edge of a vertex does not affect any path from the start vertex to the predecessors of the vertex. In other words, there exists at least one path from each vertex to the end vertex. Moreover, we can obtain several frequent knowledge paths from the GKF graph to help workers learn the group's knowledge. The following example explains how to remove an edge from the GKF graph.

Example. InFig. 9, let vertex vEbe the examined vertex and let the

user-speciﬁed threshold be 0.3. The vertex vEhas two incoming edges:

eρ,Ewith weight 0.2 and eD,Ewith weight 0.4. The edge eρ,Equaliﬁes for

removal, because its weight is lower than 0.3 and removing it would not make any vertex unreachable.Fig. 11shows the resulting graph, which represents the GKF of the group. The graph is used to visualize the knowledge ﬂows among the frequent topics and model the referencing behavior of the group.

4.1.6. Properties of the GKF

The generated knowledge graph has several properties. We deﬁne and prove the associated lemmas below.

Lemma 1. Let vsbe the start vertex in a graph, GN,of a group-based

knowledgeﬂow. For any vertex vhin GN, there exists a path Ps,hfrom

vertex vsto vh.

Proof. In the edge deletion procedure, removal of an incoming edge from a vertex vhdepends on the weight of the edge. All vertices in GN

are visited in topological order and their incoming edges are examined. For any vertex vh, an incoming edge should be removed

if its weight is lower than a user-speciﬁed threshold. However, if removing an edge from vhalso removes the path Ps,hfrom GN, that

edge should be returned to the vertex.

When deleting an incoming edge of a vertex, the edge deletion procedure ensures that 1) there is a path Ps,hfrom the start vertex vsto

vertex vh; and 2) removing an incoming edge from a successor of vh

does not affect the path Ps,h. The proof is as follows. Let a vertex vkbe a

succeeding vertex of vh in the topological order. Based on

the topological order, the edge deletion procedure processes the vertex vhbefore vertex vkand there exists a path Ps,h. Assume that a

path Ps,hdoes not exist from vsto vh, because an incoming edge of vk

has been deleted. Thus, a path must have existed from vertex vs

through vkto vhbefore the edge was deleted. Consequently, vkmust

be a predecessor of vh. However, this statement contradicts the

algorithm's processing of vertices in topological order. That is, vkis a

succeeding vertex of vhand the path Ps,hexists in GN. Thus, removing

Fig. 8. The edge weights in a GKF graph.

(12)

an incoming edge from a succeeding vertex of vhdoes not affect the

path Ps,h. According to the algorithm and the above explanation, for

any vertex vhin GN, there exists a path Ps,hfrom vertex vsto vh.

Lemma 2. Let vdbe an end vertex in the graph of the group-based

knowledgeﬂow GN. For any vertex vhin GN, there exists a path Ph,dfrom

vertex vhto vd.

Proof. Let vertex vk be the succeeding vertex of the vertex vh.

Removing an incoming edge of vertex vkwill affect the reachability of

the end vertex vdfrom vertex vh. When the edge deletion procedure

removes an incoming edge of vertex vk,it has to check whether the

path Ph,dfrom vertex vhto the end vertex vdexists. If it does not exist,

the incoming edge should not be removed. Therefore, the procedure ensures that a path Ph,dexists from vertex vhto the end vertex vd.

Lemma 3. Let GN= {VN, EN} be the directed graph of a group-based

knowledgeﬂow. All vertices in VNcan be visited by traversing vertices

from the start vertex vsto the end vertex vd. Then, for any vertex vhin V,

there exists a path from vsto vdthrough vh.

Proof. According to Lemmas2 and 3, for any vertex vhin VN, there

exists a path Ps,hfrom the start vertex vsto vhand a path Pv,dfrom vhto

end vertex vd. Therefore, there exists a path from vsto vdthrough vh.

Lemma 4. For any infrequent edge eh,kon an infrequent path of GN,

either the path from the start vertex vsto vertex vkor the path from the

vertex vhto the end vertex vdmust pass through the edge eh,k.

Proof. Let vertex vhbe a predecessor of vertex vkin the topological

order, and let eh,kbe an infrequent edge from vertex vhto vertex vkin

GN. Assume that there exist two paths, one from start vertex vsto

vertex vkand the other from vertex vhto the end vertex vd,neither of

which passes through the edge eh,k. Our algorithm removes any

infrequent edge if doing so will not make any vertex unreachable. Thus, the algorithm will remove the edge eh,k. However, this

contradicts the statement that eh,k exists in GN. Consequently, for

any infrequent edge eh,kof an infrequent path of GN, either the path

from the start vertex vsto vertex vkor the path from the vertex vhto

the end vertex vdmust pass through the edge eh,k.

The vertex VGSin graph GNrepresents a corresponding strongly

connected component GSin G. All vertices in GSwith parallel relations

or sequential relations are reachable. Lemmas2–5also hold for G.

4.2. The GKF mining algorithm for dealing with topic loops

The GKF mining algorithm for dealing with topic loops (GKF-TL) is based on the GKF algorithm introduced in Section 4.1, which assumes there are no topic loops in workers' KFs when it generates the graph of the group-based KF. A topic loop means that a specific topic appears repeatedly in a TKF because it is referenced by a worker several times. This may happen because the worker needs the knowledge at different times during a task's execution. For example, given a worker's topic-level KFbA, B, A, C, DN, if topic A is referenced twice, it is appears as a topic loop in the corresponding graph of the TKF. Because the loop problem in a workflow mining domain is difficult to resolve, no matter what the application domain, many researchers ignore the problem[12,34]. Agrawal et al. [4] proposed an algorithm for workflow systems that builds a general directed graph with cycles for mining process models from work_{flow logs. The algorithm gives activities different labels to} differentiate them in a workflow instance. The problem of dealing with topic loops in TKFs is analogous to that of workflow systems. Thus, we adopt the above approach to solve the loop problem. Specifically, we propose an algorithm that considers duplicate topics (topic loops) in each TKF to build a directed graph for modeling the referencing behavior of a group of workers.

Different from our knowledgeﬂow mining approach, the workﬂow mining approach[4]does not consider strongly connected components (SCC) for differentiating the sequential and parallel relations. That is, it did not consider the issue of handling loops involving strongly connected components. We resolve the loop problems involving the vertices in SCC. Fig. 10. The edge deletion procedure.

(13)

The GKF-TL algorithm differs from the GKF algorithm. First, it identiﬁes duplicated topics in a TKF and gives them different labels in order to solve the loop problem. For example, given a KF_{bB, A, B, C, BN,} because topic B appears three times, it is transformed into three instances, i.e., B1, B2 and B3, such that the original KF becomesbB1, A, B2, C, B3_N.

After infrequent edges have been removed from the graph G, it is transformed into a new graph GT as follows. The vertices with

different instances of the same topic form an equivalent set and can be merged to make one vertex. For a topic TP in a TKF, each vertex in the equivalent set of TP is an instance of the topic. Then, a directed edge is added to the new graph GTif there is an edge between two vertices of

different equivalent sets in graph G. Initially, the merging process is applied to vertices of each equivalent set in G when a strongly connected component is not involved. To merge vertices involving a strongly connected component Gs, the steps are as follows.

Let vertices vi/vjbe instances in the equivalent sets Qa/Qb, and let vk

be an another instance in Qaas well as a vertex in a strongly connected

component, i.e., vk∈Gs, where vγand vρare two pseudo nodes of Gs.

Note that because vkand viare instances of the same topic, they are in

the same equivalent set and are thus merged to form one vertex. In addition, viis in Gs, since vkis in Gs. Generally, the vertices of an

equivalent set Qain G are combined as a vertex vain the new graph GT,

while the vertices of an equivalent set Qbare merged to form one

vertex vb. For a strongly connected component Gswith pseudo nodes

vγand vρ, if a directed edge ei,jbetween viand vjexists in G, a directed

edge eρ,bis added to the new graph GT. Similarly, if a directed edge ej,i

exists in G, a directed edge eb,γis added to GT.

Next, we consider how to combine vertices involving two strongly connected components. Let vk/vlbe vertices in strongly connected

components Ga/Gb; vγaand vρabe pseudo vertices that connect with

graph Ga; vγband vρbbe pseudo vertices that connect with Gb; and Qa/

Qb be the corresponding equivalent sets of vertices in Ga/Gb. In

addition, let vertex viand vk(resp. vj and vl) be instances of the

equivalent sets Qa(resp. Qb). Vertices in Qa/Qbare merged as vertex

va/vb. Because vk/vlis in Ga/Gb, vi/vjalso belongs to Ga/Gb; however,

some edges need to be adjusted. If there is a directed edge ei,jfrom vi

to vjin graph G, an edge eρa,γbwith the same direction as edge ei,jis

added to the new graph GT. Similarly, if a directed edge ej,iexists in

graph G, a directed edge eρb,γais added to GT. These new added edges

are used to merge two equivalent sets in different strongly connected components and make a connection between them. Note that the weights of the edges are updated during the merging process.

Note that we assume the instances of a topic exist in at most one strongly connected component after the vertices of each equivalent set have been merged to form one vertex. We defer consideration of the case where the same topic belongs to more than one strongly connected component to a future work. Next, we provide an example of implementing the GKF-TL algorithm.

4.2.1. Example of applying GKF-TL mining algorithm

The following example considers a group of four workers with similar KFs. Their topic-level KFs (TKFs) are listed inTable 3. Each element in a TKF is used to represent a topic domain. Thus, the elements in a TKF are arranged as a topic sequence based on the times they were referenced. As a topic may appear more than once in a

specific KF, because the worker needs the knowledge at different times, we apply the GKF-TL mining algorithm to deal with topic loops. A topic that appears more than once in a TKF is labeled as a different instance of the topic, and a TKF with duplicate topics is transformed into a TKF'. Then, the algorithm uses TKF' to build the initial graph of the GKF model. In this example, we set the user-specified thresholds for topic relation identification and edge deletion as ε=1 and θ=0.3 respectively. The initial graph derived before graph transformation is shown in Fig. 12. A strongly connected component is discovered in the initial graph. To resolve the vertex relation problem in the strongly connected component, the algorithm applies the topic relation identification procedure detailed inFig. 5. The vertex relation in the strongly connected component is shown in GsinFig. 12. The number on each edge represents the edge's weight.

Recall that the weight is derived by Eq.(5)to indicate the importance of the edge.

Fig. 13shows the result of removing the infrequent edges from the graph inFig. 12. The sub-graph Gsin the initial graph is transformed

into a vertex vGs; and the edge that connects a vertex in Gswith

another vertex, i.e., eρ,D, is removed because its weight is less than 0.3.

Finally, the algorithm merges vertices that are different instances of the same topic into one vertex. For example, inFig. 12, vertices vB1

and vB2are different instances of the same topic, so they are merged to

form the vertex vB. Moreover, the edge eρ,B2is replaced by an edge

connecting vρto vγ; and the edge eB2,Cis changed to edge eρ,c. The

vertices vA1 and vA2 are two instances of topic A; hence they are

merged to form vertex vA, and their edges are changed accordingly.

Fig. 14shows the ﬁnal GKF graph, which considers the duplicate topics in each worker's TKF. To illustrate all knowledge paths in the graph, the vertex vGsis converted into the original graph Gs.

4.3. Identifying knowledge referencing paths in a GKF graph

We have developed a method for identifying frequent knowledge paths from the GKF graph to describe the information needs of a group of workers, i.e. their knowledge referencing behavior. A knowledge path, which represents the knowledge referencing behavior of a group of workers, consists of several vertices and edges that can be traversed from the start vertex to the end vertex. To identify a frequent knowledge path, a path score derived from the weights of the edges on a path is used to evaluate each path. Paths with scores higher than a user-specified threshold are regarded as frequent knowledge paths in the GKF and are selected for the group. Specifically, such knowledge paths (patterns) are used to represent the frequent knowledge referencing behavior and important knowledgeflows. The paths can also be provided to help group member access and learn group-related knowledge.

We are interested inﬁnding frequent knowledge paths because they represent the frequent referencing behavior patterns of a group of workers. The discovered paths will be important references for workers, especially for novices in the group. A path's score indicates its importance and is based on the weights of the edges on the path, as deﬁned in Eq.(6).

psi= Minfwex;yj∀ex;y∈pathig; ð6Þ

where psiis the path score of the path i; and wex,yis the weight of edge

ex,y,which belongs to the path i and represents a directﬂow relation

between vertex x and vertex y. Based on the weights of all the edges on a specific path, a path score is derived from the minimal weight among the edges to indicate the path's level of importance. Note that the edge weight derived by Eq.(5) denotes the importance of the directflow in a GKF. A large edge weight means that the referencing flow between topics is highly significant for the group of workers. Table 3

The TKFs of four workers.

Worker Topic-level KF (TKF) TKF'

John bA, B, A, C, D, FN bA1, B1, A2, C, D, FN

Mary bB, A, B, C, DN bB1, A1, B2, C, DN

Lisa bB, A, D, FN bB1, A1, D, FN