VII. Culture and the Accuracy of Judgment-Driven and Data-driven Severity Rating
7.2 Research Problem and Hypothesis
7.3.3 Procedure
Initial inspection of the software indicated twelve usability problems of interest.
Fifteen US users then completed typical tasks using the software during which users encountered all of the identified UPs. Based on user performance, data was derived on frequency, impact, and persistence of twelve UPs. Also the UPs’ effect on
effectiveness, efficiency and user satisfaction was derived from measures of task completion, time taken on task, and a satisfaction survey.
During completion of a task, after experiencing a UP, users were asked to rate its severity. The ratings were based on a five point scale as follows.
1. An extremely minor problem 2. A minor problem 3. A somewhat important problem 4. A major problem
5. An extremely major problem
Afterwards, twenty one Taiwanese usability evaluators rated the same UPs, also with the scale shown above, using both judgment-driven and data-driven methods,
67
with data drawn from the user tests described earlier provided to evaluators when using data-driven rating methods.
Evaluators were walked through each UP, with each problem described in terms of when and where users encountered the problem, its cause, and resulting user
behavior. Explanations were given first in English, then again in Chinese, although the evaluators’ English was of a high level. Of the twelve UPs, users assessed the first six with no empirical data, i.e., the assessments were judgment-driven, while the second six were driven with empirical data provided. Of each six judgment and data-driven severity ratings, three were based on UP frequency, impact, and persistence criteria, while the others were based on measures of effectiveness, efficiency, and satisfaction.
The results were processed in a number of ways. To compare whether the mean severity rating for each UP differed significantly between evaluators and users, the MANOVA test was applied to results after removing outliers in user ratings. Results were also compared for internal consistency, as to be effective, ratings must also agree with each other, as well as being accurate in their ratings. By setting user median user ratings (after removing outliers) as a gold standard to which evaluator ratings can be compared, I was able to use attribute agreement analysis to assess the accuracy of the different severity rating methods. Finally, I compared the observed number of accurate ratings based on the gold standard with expected number using chi-squared analysis.
Table VII-2 Severity Rating Methods Question Type Method description
Method 1 Judgment-driven, rating based on UP frequency, impact, and user persistence
Method 2 Judgment-driven, rating based on the effect of the UP on product effectiveness, efficiency and user satisfaction.
Method 3 Data-driven, rating based on UP frequency, impact, and user persistence Method 4 Data-driven, rating based on the effect of the UP on product effectiveness,
efficiency and user satisfaction.
7.4 Results
Overall, results show greater accuracy, if not consistency, for data-driven methods and for methods based on measures of effectiveness, efficiency, and user satisfaction.
MANOVA analysis shows an overall significant difference between ratings by evaluators and users, as well as significant differences for Methods 1-3. The smaller Wilks’ lambda for Method 1 indicates a large difference between mean user and evaluator ratings, while Methods 2 and 3 give evaluator ratings that are statistically significant and closer to user ratings.
Table VII-3 Association between Method and Rating Accuracy Question Type Wilks’ lambda p-value
Method 1 0.18 0.000**
Method 2 0.69 0.016*
Method 3 0.80 0.049*
Method 4 0.77 0.112
Overall 0.22 0.004**
*significant (alpha = 0.05)
** highly significant (alpha = 0.01)
The inter-rater consistency of each severity rating method, as indicated by Kendall’s coefficient of concordance, shows raters using Method 3 agree with each other much more, while raters using Method 4 are least consistent.
Table VII-4 Rating Method and Inter-Rater Agreement
Question Type Kendall’s coefficient of concordance p-value
Method 1 0.41 0.0002**
Method 2 0.49 0.0000**
Method 3 0.79 0.0000**
Method 4 0.40 0.0002**
Overall 0.53 0.0000**
The results of attribute agreement analysis, as indicated by Kendall’s coefficient of concordance, show significant and moderate to high levels of accuracy for most methods, apart from Method 1. Overall, data-driven methods were more accurate than judgment-driven methods, while effectiveness, efficiency and user satisfaction proved a more reliable guide to rating UP severity than did UP frequency, impact, and
persistence.
69 Table VII-5 Rating Method Accuracy
Rating
Table VII-6 Rating Method Accuracy (judgment vs. data-driven) Rating Method
Table VII-7 Rating Method Accuracy based on UP Severity Criteria Rating Method Type Kendall’s coefficient of
concordance
By comparing the number of ratings by evaluators that matched or did not match median ratings by users, significant differences in the accuracy of the rating methods were identified, Χ2 (3, N = 252) = 33.341, p = 0.000. The association between rating method and accuracy is weak as indicated by the lambda value (λ=0.22), perhaps because Method 2 and 3 gave results similar to what would be expected. However, Method 1 resulted in a much lower number of accurate ratings than expected, while Method 4 was more accurate than expected.
Table VII-8 Rating Method Accuracy Question
7.5 Discussion
My hypotheses are suggested, if not confirmed, by the results. As a type, data-driven ratings are more accurate than judgment-data-driven ratings, and methods based on measures of effectiveness, efficiency, and user satisfaction are more accurate than those based on UP frequency, impact, and persistence.
For each method, however, discrepancies are found. Method 4, which is both data-driven and based on ISO9241 usability measures, is much less accurate than Method 3, which is data-driven and based on frequency, impact, and persistence. The reason for this discrepancy may lie in the observation that Method 4 also shows lower levels of inter-rater agreement, suggesting the greater spread of results for Method 4.
This suggests that Method 4 may produce more accurate ratings, but inaccurate assessments are spread much further away from correct ratings than they are for Method 3. The greater familiarity of HCI students with methods based on UP
frequency, impact, and persistence may have a part to play in the greater consistency of Method 3. If this is so, with training and increasing familiarity, improvements in inter-rater consistency and also accuracy could be achieved with Method 4.