Knowledge Flow Mining Phase - The Overview of Knowledge Flow-Based Research

Chapter 3. The Overview of Knowledge Flow-Based Research

3.3 Knowledge Flow Mining Phase

The objective of the knowledge flow (KF) mining phase is to identify the KF of each knowledge worker. In this Section, we describe how the KF mining method identifies KFs from workers’ log. This phase consists of three steps: document profiling, document clustering and KF extraction. In the first step, each document is represented as a document profile, which is an n-dimensional vector comprised of significant terms and their weights.

Then, based on the document profiles, documents with higher similarity measures are grouped in clusters by the hierarchical clustering method. In the third step, topic-level and codified-level KFs are generated from the document clustering results. A topic-level KF is expressed as a sequence of topics referenced by a worker, while a codified-level KF is represented as a sequence of codified knowledge accessed by a worker. Further details are given in the following subsections.

3.3.1 Document Profiling and Document Clustering

Two profiles, a document profile and a topic profile, are used to represent a worker’s KF.

A document profile can be represented as an n-dimensional vector composed of terms and their respective weights derived by the normalized tf-idf approach based on Eq. (1). Based on the term weights, terms with higher values are selected as discriminative terms to describe the characteristics of a document. The document profile of dj is comprised of these discriminative terms. Let the document profile beDP_j =<dt₁_j:dtw₁_j,dt₂_j:dtw₂_j,",dt_nj:dtw_nj >, where dtij is the term

i in d

j and dtwij is the degree of importance of a term i to the document dj, which is derived by the normalized tf-idf approach. The document profiles are used to measure the similarity of the documents.

We adopt the single-link hierarchical clustering method [29] to group documents with

similar profiles into clusters by using the cosine measure to calculate the similarity between the profiles of two documents. The single-link method computes the cluster similarity between two clusters Cr and Ct by _,

{ (

) }

i r j t i j

d C d Cmax simcos d d

∈ ∈ [60], and then merges the two most similar clusters into a single cluster. The similarity computation and cluster combination steps are repeated until the similarity of the most similar pair of clusters is no greater than a pre-specified threshold value. Different clustering results can be obtained by setting different threshold values. We adjust the threshold value systematically and use the quality measure described in Section 2.3.2 to evaluate each clustering result. Then, we take the one with the best quality measure as our clustering result. Note that a cluster represents a topic set and has a topic profile (derived from the document cluster) that describes the features of the topic.

Topic Profile

Documents in the same cluster contain similar content and form a topic set. The key features of the cluster are described by a topic profile, which is derived from the profiles of documents that belong to the cluster. Let TP_x =<tt₁_x:ttw₁_x,tt₂_x:ttw₂_x,",tt_nx:dtw_nx> be the profile of a topic (cluster) x, where ttixis a topic term and ttwixis the weight of the topic term.

In addition, let Dx be the set of documents in cluster x. The weight of a topic term is determined by Eq. (7) as follows:

x D j

ix D

dtw ttw ^x

∑

∈

= , (7)

where dtwij is the weight of term i in document j, and |Dx| is the number of documents in cluster x. The weight of a topic term is obtained from the average weight of the terms in the document set.

3.3.2 Knowledge Flow Extraction

In this section, we describe the method used to extract a worker’s KF from his/her data log when performing a task. We define a task as a unit of work, which denotes either a previously executed (i.e., historical) task or the current task. When performing a task in a knowledge-intensive and task-based environment, a worker usually requires a large amount of task-related knowledge to accomplish the task. By analyzing a worker’s referencing behavior for a specific task, the corresponding knowledge flow of the task is derived by the knowledge

flow extraction method. Note that if a worker performs more than one task, more than one knowledge flow will be extracted. For a specific task, the method derives two kinds of KF,

codified-level KF and topic-level KF, to represent the worker’s information needs for the task.

Codified-Level Knowledge Flow

The codified-level KF is extracted from the documents recorded in the worker’s work log. In most situations, workers are motivated to access a document about a specific task because of knowledge derived from other documents. The documents are arranged according to the times they were accessed, and a document sequence, i.e., a codified-level KF, is obtained. The order of documents in the sequence is subjective, since it is determined by the worker. In other words, each worker has his/her own codified-level KF, which represents his/her knowledge accumulation process for a specific task at the codified level.

Topic-Level Knowledge Flow

The topic-level KF is derived by mapping documents in the codified-level KF of a specific task into corresponding clusters and is represented by a topic sequence. In the previous step, documents with similar content were grouped into clusters. We use the document clustering results to map the documents in the codified-level KF into topics (clusters) in order to compile the topic-level KF. Since the codified-level KF is the basis of the topic-level KF, the knowledge in the latter is an abstraction of the former, and indicates how knowledge flows among various topics. A topic in the topic-level KF may be duplicated because the worker may read about the same topic frequently to obtain essential knowledge while executing a task.

在文檔中以知識流探勘與文件推薦提供知識支援 (頁 27-30)