Data, participants and evaluation metrics

Chapter 5 Collaborative Task-Relevance Assessment26

5.5 Experimental setup

5.5.2 Data, participants and evaluation metrics

Experiments using a real application domain were carried out for conducting research tasks in a research institute laboratory. The tasks constitute writing research papers or conducting research projects. The real application domain restricts the sample size of the data and participants in the experiments.

Fifty research tasks were collected, with 31 existing tasks and 19 executing tasks.

Over 250 documents accessed by tasks are collected during the period of 2002 ~ 2003. The smallest meaningful components of document information elements, such as title, abstract, journal and author, were extracted from documents. Each document

contained an average of 90 distinct terms after information extraction, and document pre-processing (e.g. case folding, stemming, and stop word removal).

An existing task is the historical tasks that had been accomplished in the research institute. The task corpus of each existing task, namely a feature vector of weighted terms, is derived by analyzing the documents generated and accessed by the existing task, also described in Section 4.1. Fuzzy classification is performed to derive the relevance degrees of existing tasks to categories as described in Section 4.2. Existing tasks are classified into five categories. The task categorization database records the relevance degrees of each existing task to categories. On the other hand, an executing

task is the target task that the knowledge worker conducts at hand. The task profile of

an executing-task (or an on-going task) is derived based on the task corpora of existing tasks and their relevance to the executing task as described in Section 5.3.

Participants

Knowledge workers usually require a longer time (eg. one year) to accomplish knowledge-intensive tasks. However, it is difficult to design experiments relevant to real world problems, when the task performance process spans a long time period.

Thus, we choose evaluators according to their task execution progress, classified into two levels, familiar or unfamiliar with the executing tasks. Consequently, two user groups were chosen for conducting the experiments: experienced workers familiar with the executing task, and novices unfamiliar with the executing task. The number of experimental participants is also restricted.

Six executing tasks were chosen as the testing set for evaluations, as lists in Table 6.

And eighteen workers are selected to participate in the experiments. Note that two participants gave up the testing. To evaluate the effectiveness of collaborative relevance-assessment, executing tasks in the testing set are those with more than one Table

6.

Six selected executing tasks (on-going tasks)

Task Task Name Task Characteristic

1 A Study of Feature-Weighting Clustering in Recommender Proposal of Thesis 2 Comparisons of Collaborative Filtering for Proposal of Thesis 3 News Detection and Tracking based on Event Hierarchy Proposal of Thesis 4 Deployment of Composite e-Service Framework System Development 5 Recommendation in Composite e-Service Research project 6 Modeling of Process-View based Workflow Management Research project

knowledge worker participating in the task. Moreover, we chose executing tasks conducted by at least one novice and one experienced worker to evaluate the effectiveness of proposed methods for different user groups. We randomly selected one or two experienced workers and one or two novices from each testing task as participants in the testing set. The testing set selection limitation for the problem domain also restricts testing set size.

Performance evaluation metrics

The retrieval effectiveness is plotted as a recall-precision curve, which treats precision as a function of recall [7][83].

Precision and recall. Precision is the fraction of retrieved items (tasks or

documents) that are relevant, while recall is the fraction of total known relevant items that are retrieved, defined as Eq. 5.3 and 5.4.

retrieved items that are relevant total retrieved items

precision= _(5.3)

relevant items that are retrieved total known relevant items

recall = _(5.4)

Notably, both the number of total retrieved items and the number of total known relevant items must be greater than zero. Increasing the number of retrieved items tends to reduce precision and increase recall. Generally, precision is high at low recall levels and low at high recall levels. Thus, the recall-precision curve is used to show the interpolated precision at each recall level, as follows. The recall values can be divided into different recall levels with rvi, iŒ{1,2,…,n}, denoting a reference point to the i-th recall level. The interpolated precision, IPr

(rv

) thus can be

expressed as: IPr

(rv

)=MAX P

(rv) for rv

i £

rv

i+1, where Pr

(rv)

represents the precision value given a recall value of rv.

The interpolated precision at each recall level can be derived for each task being evaluated. For evaluating a set of tasks, the average interpolated precision is derived as Eq. 5.5.

where aveIP(rvi

) denotes the average interpolated precision at the ith recall level,

and k denotes the number of evaluated tasks.

Evaluation criteria

The effectiveness of task-based knowledge support method is measured by recall and precision. In this experiment, we want to evaluate the tradeoff between recall and precision and the system overall performance at each recall level. Thus, the relationship between precision and recall is depicted in terms of recall-precision

curve. That is, we observe the trend of recall-precision curve (the curve decreased,

increased or even horizontal lines) and the value of precision at each recall level to discuss the system search capability.

We don’t just choose precision as the performance metric, since it is not easy to observe the overall system performance of various methods. And only shows the system performance under top-N document or task support. Even we choose F-Measure as the performance metrics, the metric only shows the tradeoff between recall and precision and cannot depict the system overall performance. In fact, a

“perfect” search system, precision would be 100 percent at all recall level; therefore, the recall-precision curve is a horizontal line at 100 percent. However, the precision is generally high at low recall level and low at high recall levels, the trend of curve is decreasing. Thus, if one method’s curve is completely above the other one, then the method is better than the other. Accordingly, we make comparisons between methods, including Query-based method (baseline method), B-RA, F-RA, 2-F-RA, Collaborative 2-F-RA method by recall-precision curves. And we will discuss the advantage and disadvantage of the methods from three aspects.

(1) The average precision value, avgIP(rvi

), to verify the search performance

between the methods.

(2) Looking more specifically, showing the value of precision at each recall level to examine the tradeoff between methods.

(3) Drawing the recall-precision curves to show the overall performance of methods.

A good condition is the method always has high precision at each recall level and far better than the other methods. However, some methods may have low average precision value, but have high precision at a certain interval of recall level.

在文檔中以工作觀為基礎之知識支援模式與系統:工作相關知識遞送與分享 (頁 57-60)