As mentioned in Section 6.6, six datasets (including cal500, majorminer, dlc1, dlc2, dlc3, and dlc4 ) come from the social tagging domain. These datasets, which contain the tag count information, are used for cost-sensitive social tagging experiments.
The experimental setup is the same as that in the Section 6.6. However, we only compare GLE with RAk EL, BR, and MLKNN, since these three methods perform better than CC, IBLR, and BPMLL on the experiments in Section 6.6. We replace the base classifier in BR by cost sensitive binary classifier (CSBR) as in the previous work [38]. The cost-sensitive binary classifier is implemented using LIBSVM.
The experimental results of cost-sensitive social tag annotation and retrieval are summarized in Table 5.5. In most cases, GLE outperforms RAk EL, CSBR, and MLKNN. In only one case, GLE performs slightly worse than RAk EL in cost-sensitive annotation; however, the difference is not significant. The average rankings of GLE on six datasets using two different metrics are 1.2 and 1.0, respectively.
Table 5.5: Experimental Results in Terms of Two Cost-Sensitive Evaluation Met-rics. The Average Rank is the Average of the Ranks Across All Datasets. •/◦
indicates whether GLE is statistically superior/inferior to the compared algorithm (the pairwise t-test at the 5% significance level).
GLE RAk EL CSBR MLKNN
Cost-Sensitive F-Measure for Annotation
cal500 0.6544 (1) 0.6436 (2) • 0.4916 (4)• 0.5889 (3) • majorminer 0.4938 (1) 0.4885 (2) • 0.2607 (4)• 0.2964 (3) • dlc1 0.2048 (2) 0.2052 (1) 0.0780 (4)• 0.1537 (3) • dlc2 0.1498 (1) 0.1433 (2) • 0.0588 (4)• 0.1213 (3) • dlc3 0.1555 (1) 0.1502 (2) • 0.0621 (4)• 0.1174 (3) • dlc4 0.1875 (1) 0.1835 (2) • 0.0529 (4)• 0.1449 (3) •
Average Rank 1.2 1.8 4.0 3.0
Cost-Sensitive F-Measure for Retrieval
cal500 0.4699 (1) 0.2929 (4) • 0.3341 (2)• 0.3053 (3) • majorminer 0.3157 (1) 0.3066 (2) • 0.1147 (4)• 0.1427 (3) • dlc1 0.2141 (1) 0.2094 (2) • 0.1721 (4) 0.1874 (3) • dlc2 0.2801 (1) 0.2678 (2) • 0.1887 (3)• 0.1725 (4) • dlc3 0.2515 (1) 0.2416 (2) 0.1864 (3) 0.1464 (4) • dlc4 0.2572 (1) 0.2479 (2) • 0.1329 (4)• 0.1466 (3) •
Average Rank 1.0 2.3 3.3 3.3
(a) Hamming Loss (b) Ranking Loss
(c) Subset 0/1 Loss (d) One Error
(e) Average Precision
Figure 5.2: Experimental Results of GLE with Different γ And ν in Terms of Five Different Evaluation Metrics on The Scene Dataset.
(a) Hamming Loss (b) Ranking Loss
(c) Subset 0/1 Loss (d) One Error
(e) Average Precision
Figure 5.3: Experimental Results of GLE with Different γ And ν in Terms of Five Different Evaluation Metrics on The Enron Dataset.
(a) Hamming Loss (b) Ranking Loss
(c) Subset 0/1 Loss (d) One Error
(e) Average Precision
Figure 5.4: Experimental Results of GLE with Different γ And ν in Terms of Five Different Evaluation Metrics on The Cal500 Dataset.
(a) Hamming Loss (b) Ranking Loss
(c) Subset 0/1 Loss (d) One Error
(e) Average Precision
Figure 5.5: Experimental Results of GLE with Different γ And ν in Terms of Five Different Evaluation Metrics on The Majorminer Dataset.
(a) Hamming Loss (b) Ranking Loss
(c) Subset 0/1 Loss (d) One Error
(e) Average Precision
Figure 5.6: Experimental Results of GLE with Different γ And ν in Terms of Five Different Evaluation Metrics on The Medical Dataset.
(a) Hamming Loss (b) Ranking Loss
(c) Subset 0/1 Loss (d) One Error
(e) Average Precision
Figure 5.7: Experimental Results of GLE with Different γ And ν in Terms of Five Different Evaluation Metrics on The Bibtex Dataset.
(a) Hamming Loss (b) Ranking Loss
(c) Subset 0/1 Loss (d) One Error
(e) Average Precision
Figure 5.8: Experimental Results of GLE with Different γ And ν in Terms of Five Different Evaluation Metrics on The Dlc1 Dataset.
(a) Hamming Loss (b) Ranking Loss
(c) Subset 0/1 Loss (d) One Error
(e) Average Precision
Figure 5.9: Experimental Results of GLE with Different γ And ν in Terms of Five Different Evaluation Metrics on The Dlc2 Dataset.
(a) Hamming Loss (b) Ranking Loss
(c) Subset 0/1 Loss (d) One Error
(e) Average Precision
Figure 5.10: Experimental Results of GLE with Different γ And ν in Terms of Five Different Evaluation Metrics on The Dlc3 Dataset.
(a) Hamming Loss (b) Ranking Loss
(c) Subset 0/1 Loss (d) One Error
(e) Average Precision
Figure 5.11: Experimental Results of GLE with Different γ And ν in Terms of Five Different Evaluation Metrics on The Dlc4 Dataset.
Chapter 6
Patient-Balanced Learning for Medical Image Classification
KDD Cup is an annual worldwide competition on KDD (knowledge discovery and data mining). It is organized by ACM special interest group on KDD, and started from 1997. It is now the most prestigious data mining competition. In both KDD Cup 2006 and 2008, the prediction task is medical image classification. The medi-cal image datasets are provided by Siemens Medimedi-cal Solutions, USA. In KDD Cup 20061, the task is pulmonary embolism (PE) classification using pre-processed com-puted tomography images [29]; while in KDD Cup 20082, the task is breast cancer classification using mammogram images [48]. We have participated in the KDD Cup 2008 and have won the joint winner of the competition.
In this chapter, we start from discussing about some practical issues of model selection for medical image classification. Since the performance evaluation is based on patient-based metrics rather than traditional instance-based metrics, general model selection strategies may not work well. We describe our model selection strategies that used in our winning method. Then, we describe a class-imbalanced issue and a class-balanced SVM. Furthermore, we discuss a patient-imbalanced
prob-1http://www.cs.unm.edu/kdd cup 2006
2http://www.kddcup2008.com/
lem that might seriously hurt the generalization ability of the image classifier. To the best of our knowledge, this problem has not been addressed and solved in pre-vious researches. We believe that it occur in general medical image classification tasks and is not specific to the KDD Cup competition. We design a patient-balanced learning strategy based on cost-sensitive binary classification. The experiments are conducted on both of the breast cancer dataset and the pulmonary embolism dataset. The absolute performance improvement of the patient-balanced learning over traditional learning method is about 5% on the test data, in terms of AUC, which should be considered as crucial for winning the competition.
6.1 Background
Data mining techniques have been widely exploited for the Computer Aided Diag-nosis (CAD) for medical image data (e.g. CT scans, X-ray, MRI,. . . , etc.). Given a set of labeled images, one can design a learning program that predicts whether an unlabelled image contains cancer regions or not. There are generally three steps in developing a CAD system [29]:
1. Identify some potentially unhealthy regions (or regions of interest, ROIs) from a medical image.
2. Extract descriptive features from each candidate region.
3. Design a classifier to identify the labels of the candidates.
The third step in the CAD scenario can be formulated as a supervised learning problem. That is, we are given a training data set {(xi, yi, pi)}Ni=1, where xi is a feature vector of an ROI, yi ∈ {1, −1} is a class label indicated whether this ROI is unhealthy (positive) or not (negative), and pi ∈ {1, 2, · · · , M} denotes that this
instance is associated with the j-th patient. We note that the instances of a patient may not come from one single image, but from images of diverse viewpoints or organs (e.g. left/right breast). We define a patient as unhealthy if and only if at least one of his ROI instances is regarded as unhealthy. We also define a patient as healthy if and only if none of his ROI instances is regarded as unhealthy. Let Xjbag be the set of ROI feature vectors associated with the j-th patient. We define a patient classifier F (Xjbag) with input Xjbag as:
F (Xjbag) =
{ 1 (unhealthy) if ∃xi ∈ Xjbag, f (xi) = 1,
−1 (healthy) else, (6.1)
where f (xi) can be any binary classifier, such as one implemented in SVM, with a single feature vector as its input. The patient classifier can also be expressed compactly as:
F (Xjbag) = max
xi∈Xjbagf (xi). (6.2) Suppose the classifier f (xi) can output confidence scores for the ROIs and the higher score means more confidence to be unhealthy. We can use the largest score of the associated ROIs to represent confidence degree of unhealthy for a patient according to (6.2).
This scenario is similar to the setting of the multi-instance learning (MIL) problem [2, 42, 62, 68] by considering the instances belonging to a patient as a bag of instances as defined in MIL. The major difference between MIL and our introduced scenario for medical image classification is that in MIL the label information is provided in the bag level but not in the instance level. Consequently, treating this medical image classification problem as a conventional MIL problem will lose the detailed label information. Nevertheless, such fine-grained MIL-liked problem is very important so that it was proposed as the major challenge for KDD Cup 2006 and 2008 competition. We propose learning methods to improve the performance of medical classifiers in such fine-grained MIL-liked problem.