In this chapter, we discuss the background of our research, including knowledge flow, information retrieval and task-based knowledge support, document clustering methods, dynamic programming algorithm, rule-based recommendations, collaborative filtering and process mining.
2.1 Knowledge Flow
Knowledge can flow among people and processes to facilitate knowledge sharing and reuse. The concept of knowledge flow has been applied in various domains, e.g., scientific research, communities of practice, teamwork, industry, and organizations [33, 63]. Scholarly articles represent the major medium for disseminating knowledge among scientists to inspire new ideas [8, 63]. A citation implies that there is knowledge flow between the citing article and the cited article. Such citations form a knowledge flow network that enables knowledge to flow between different scientific projects to promote interdisciplinary research and scientific development.
KM enhances the effectiveness of teamwork by accumulating and sharing knowledge among team members to facilitate peer-to-peer knowledge sharing [61]. To improve the efficiency of teamwork, Zhuge [62] proposed a pattern-based approach that combines codification and personalization strategies to design an effective knowledge flow network.
Kim et al. [33] proposed a knowledge flow model combined with a process-oriented approach to capture, store, and transfer knowledge. KF in weblogs (blogs) is a communication pattern where the post of one blogger links to that of another blogger to exchange knowledge [8].
Similarly, knowledge flow in communities of practice helps members share their knowledge and experience about a specific domain to complete their tasks [46].
2.2 Information Retrieval and Task-based Knowledge Support
Information retrieval (IR) facilitates access to specific items of information [10, 21]. The vector space model [48] is typically used to represent documents as vectors of index terms, where the weights of the terms are measured by the tf-idf approach. tf denotes the occurrence frequency of a particular term in the document, while idf denotes the inverse document
7
frequency of the term. Terms with higher tf-idf weights are used as discriminating terms to filter out common terms. The weight of a term i in a document j, denoted by wi,j, is expressed as follows: the total number of documents in the collection, and n is the number of documents in which term i occurs at least once.
Information retrieval techniques coupled with workflow management systems (WfMS) have been used to support proactive delivery of task-specific knowledge based on the context of tasks within a process [2]. For example, the KnowMore system [1] provides context-aware delivery of task-specific knowledge. The Kabiria system assists knowledge workers with knowledge-based document retrieval by considering the operational context of task-associated procedures [9].
Information filtering with a similarity-based approach is often used to locate knowledge items relevant to the task-at-hand. The discriminating terms of a task are usually extracted from a knowledge item/task to form a task profile, which is used to model a worker’s information needs. Holz et al. [27] proposed a similarity-based approach to organize desktop documents and proactively deliver task-specific information. Liu et al. [39] proposed a
K-Support system to provide effective task support for a task-based working environment.
2.3 Document Clustering Methods
Document clustering or unsupervised document classification methods are used in many applications. Most methods apply pre-processing steps to the document set and represent each document as a vector of index terms. To cluster similar documents, the similarity between documents is usually measured by the cosine measure [10, 57], which computes the cosine of the angle between their corresponding feature vectors. Two documents are considered similar if the cosine similarity value is high. The cosine similarity of two documents, X and Y, is
simcos(X, Y)=
are the feature vectors of X and Y respectively.
Documents within a cluster are very similar, while documents in different clusters are very dissimilar.
Agglomerative hierarchical clustering [30, 32] is a popular document clustering method.
In this work, we use the single-link clustering method [20, 29] to cluster codified knowledge (documents). Initially, each document is regarded as a cluster. Next, the single-link method computes the similarity between two clusters, which is equal to the greatest similarity between any document in one cluster and any document in the other cluster. Then, based on the similarity measurement, the two most similar clusters are merged to form a new cluster.
The merging process continues until all documents have been merged into one cluster at the top of a hierarchy, or a pre-specified threshold is satisfied [29].
2.3.1 The CLIQUE Clustering Method
We also apply the CLIQUE clustering method [6, 29] to derive worker groups. CLIQUE starts with the definition of a unit-elementary rectangular cell in a subspace and uses a bottom-up approach to find units whose densities exceed a threshold. The algorithm has four key steps. First, 1-dimensional units are determined by dividing intervals into equal-width bins (a grid). Next, candidate k-dimensional units are generated from (k-1)-dimensional dense units, which involves self-joining of k-1 units that have common k-2 dimensions (Apriori-reasoning). Finally, all the subspaces are sorted by their coverage and those with less coverage are pruned. Therefore, a cluster is defined as a maximal set of connected dense units.
2.3.2 Clustering Quality
A good clustering method generates clusters that are cohesive and isolated from other clusters. For this reason, the measurement of clustering quality takes both inter-cluster similarity and intra-cluster similarity into account [16]. Let C be a set of clusters. The inter-cluster similarity between two clusters Ci and Cj, similarityA(Ci, Cj), is defined as the average of all pairwise similarities between the documents in Ci
and C
j; and the intra-cluster similarity within a cluster Ci, similarityA(Ci, Ci), is defined as the average of all pairwise similarities between documents in Ci. On the basis of the cohesion and isolation of C, the quality measure of C , CQ(C), is defined as:9
Note that the smaller the value of CQ(C), the better the quality of the derived set of clusters, C, will be.
2.4 Dynamic Programming Algorithm for Sequence Alignment
In this work, each worker’s knowledge flow is represented as a sequence. We use sequence alignment techniques to analyze the similarity of workers’ knowledge flows, which corresponds to a sequence alignment problem. Such techniques are used to compare or align strings in many application domains, such as biology, speech recognition, and web session clustering. A number of methods can be used for sequence alignment, e.g., the sequence alignment method (SAM) [14, 24] and dynamic programming. SAM, also called the string edit distance method [35], considers the sequential order of elements in a sequence and then measures the similarity/dissimilarity of sequences. The measurements reflect the operations necessary to equalize the sequences by computing the costs of deleting and inserting unique elements as well as the costs of reordering common elements [24, 41]. In addition, Charter et
al. [14] proposed a dynamic programming algorithm that solves the sequence alignment
problem efficiently.The algorithm consists of three steps: initialization, FindScore and FindPath [14, 43].
The first step creates a dynamic programming matrix with N+1 columns and M+1 rows, where N and M correspond to the sizes of the sequences to be aligned. One sequence is placed at the top of the matrix and the other is placed on the left-hand side of the matrix. There is a gap at the end of each sequence to allow calculation of the alignment score. The FindScore step calculates the two-dimensional alignment score of sequences. If two aligned sequences have an identical matching in the same column, the column is given a positive score s (e.g., +1 or +2); but if the values in a column are mismatches, the score s is zero or negative (e.g., 0, -1 or -2). In addition, if a column contains a gap, it is given a penalty score w (e.g., 0, -1 or -2).
Therefore, starting from the bottom right-hand corner, each position in the dynamic programming matrix is given the maximal score Mij. For each position in the matrix, Mij is defined as follows:
( ) ( ) ( ) {
M s M w M w}
Maximum
Mij = i−1,j−1+ ij , i,j−1+ , i−1,j+ , (3) where i is the row number, j is the column number, sij is the match/mismatch score, and
w is the penalty score. The third step, FindPath, determines the actual KF alignment that
derives the maximal score. It traverses the matrix from the destination point (top left-hand corner) to the starting point (bottom right-hand corner) to find an optimal alignment path in order to determine the maximal alignment scoreδ
. We calculate the flow similarity based on the maximal alignment score. The details are given in Section 4.2.2.5 Rule-based Recommendations
Association rule mining [3-4, 59] is a widely used data mining technique that generates recommendations in recommender systems. An association rule describes the relationships between items, such as products, documents, or movies, based on patterns of co-occurrence across transactions. The Apriori algorithm [3-4] is usually employed to identify such rules.
Two measures, support and confidence, are used to indicate the quality of an association rule [3]. The discovered rules should satisfy two user-defined requirements, namely minimum support and minimum confidence.
To improve the quality of traditional CF, Cho et al. [15] proposed a sequential rule-based recommendation method that considers the evolution of customers’ purchase sequences.
Transactions are clustered into a set of q transaction clusters, C={C1,C2,…,Cq}, where each Cj
is a subset of transactions. Each customer’s transactions over l periods are then transformed into transaction clusters as a behavior locus, Li
=<C
i,T-l-1,…C
i,T-1, C
i,T>, where C
i,T-k ∈ C,k=1,2,…,l-1, l≧2. Finally, sequential purchase patterns are extracted from the behavior locus
of customers by time-based association rule mining to keep track of customers’ preferences during l periods, with T as the current (latest) period. A sequential rule is expressed in the form CT-l+1, …, CT-1 ⇒ CT, where CT represents the customers’ purchase behavior in period T.If a target customer’s purchase behavior prior to period T was similar to the conditional part of the rule, then it is predicted that his/her purchase behavior in period T will be CT. Accordingly, CT is used to recommend products to the target customer in T.
11
2.6 Collaborative Filtering Recommendation
Collaborative filtering (CF) is a well-known approach for recommender systems:
GroupLens [34], Ringo [51], Siteseer [47], and Knowledge Pump [22]. CF recommends items, e.g., products, movies, and documents, based on the preferences of people who have the same or similar interests to those of the target user [11, 38, 40]. The CF approach involves two steps: neighborhood formation and prediction. The neighborhood of a target user is selected according to his/her similarity to other users, and is computed by Pearson correlation coefficient or the cosine measure. Either the k-NN (nearest neighbor) approach or a threshold-based approach is used to choose n users that are most similar to the target user.
Here, we use the k-NN approach. In the prediction step, the predicted rating is calculated from the aggregated weights of the selected n nearest neighbors’ ratings, as shown in Eq. (4):
( )
average ratings of user u and user i, respectively; w(u,i) is the similarity between target user u and user i; ri,jis the rating of user i for item j; and n is the number of users in the
neighborhood.Similar to the PCF method, the item-based collaborative filtering (ICF) algorithm [37, 40, 50] analyzes the relationships between items (e.g., documents) first, rather than the relationships between users. Then, the item relationships are used to compute recommendations for workers indirectly by finding items that are similar to other items the worker has accessed previously. Thus, the prediction for an item j for a user u is calculated by the weighted sum of the ratings given by the user for items similar to j and weighted by the item similarity, as shown in Eq. (5).
where pu,j represents the predicted rating of item j for user u; w(j,m) is the similarity between two items j and m; and rj,m denotes the rating of user u for item m. A number of
methods can be used to determine the similarity between items e.g., the cosine-based similarity, correlation-based similarity, and adjusted cosine similarity methods. Since the adjusted cosine similarity method performs better than the others [50], we use it as the similarity measure for the ICF method. The adjusted cosine similarity between two items i and j is given by Eq. (6).
In a workflow system, a process mining technique is used to extract the description of a structural process from a set of real process executions [54]. It then infers the relations between the tasks/activities and generates a process model from event-based data (log data) automatically [7, 53, 55-56]. The relations between processes (tasks/activities) are defined as casual relations and parallel relations, and are modeled by a directed graph [7, 23] or an instance graph [56]. Because a workflow log contains information about workflow processes, a loop may occur in a process. Most process mining algorithms assume that loops do not exist [23, 56]. However, some algorithms have been proposed to handle the problem of process loops [18, 54]. For example, Agrawal, et al.’s algorithm [7] builds a general directed graph with cycles for mining process models from the logs of executed processes. The algorithm labels multiple instances of the same activity with different identifies to differentiate them in the workflow graph. Vertices with different instances of the same activity form an equivalent set and can be merged to form one vertex. A directed edge is added if there is an edge between two vertices of different equivalent sets.
Process mining is used in various applications. Discovering frequently occurring temporal patterns in process instances facilitates intelligent and automatic extraction of useful knowledge to support business decision-making [7, 28]. Similarly, data mining techniques are exploited in workflow management contexts to mine frequent workflow execution patterns [23]. The frequent patterns represent blocks of activities that have been scheduled together
13
more frequently during the execution of a process. The sequence of activities within a process, the time required to complete it, the execution cost and the reliability of the process can be predicted by using the process path mining technique [13]. Based on the process patterns and process paths, unexpected and useful knowledge about the process is extracted to help the user make appropriate decisions. In addition, combining the concepts of process mining and social network analysis is useful for mining social networks from event logs [52].
Another benefit of process mining is that it is useful for discovering how people and/or procedures work [54]. In this work, we use process mining to analyze the relations between knowledge topics in a knowledge flow and model the referencing behavior of a group of workers. We design algorithms for mining the group-based knowledge flow (GKF) and construct a GKF as a directed knowledge graph. In such graphs, frequent knowledge paths can be derived to represent the most common referencing behavior of the group.