Data Mining Analysis (breast-cancer data)

(1)

Data Mining Analysis (breast-cancer data)

Jung-Ying Wang

Register number: D9115007, May, 2003

Abstract

In this AI term project, we compare some world renowned machine learning tools.

Including WEKA data mining software (developed at the University of Waikato, Hamilton, New Zealand); MATLIB 6.1 and LIBSVM (developed at the Nation Taiwan University, by Chih-Jen Lin).

Contents

1 Breast-cancer-Wisconsin dataset summary ... 2

2 Classification - 10 fold cross validation on breast-cancer-Wisconsin dataset ... 8

2.1 Results for: Naive Bayes... 8

2.2 Results for: BP Neural Network ... 9

2.3 Results for: J48 decision tree (implementation of C4.5) ... 10

2.4 Results for: SMO (Support Vector Machine) ... 11

2.5 Results for: JRip (implementation of the RIPPER rule learner)... 12

3 Classification – Compare with other paper results ... 14

3.1.1 Results for training data: Naive Bayes... 14

3.1.2 Results for test data: Naive Bayes ... 15

3.2.1 Results for training data: BP Neural Network ... 16

3.2.2 Results for test data: BP Neural Network ... 16

3.3.1 Results for training data: J48 decision tree (implementation of C4.5) ... 17

3.3.2 Results for test data: J48 decision tree (implementation of C4.5) ... 19

3.3.1 Results for training data: J48 decision tree (implementation of C4.5) ... 19

3.4.1 Results for training data: SMO (Support Vector Machine) ... 20

3.4.2 Results for test data: SMO (Support Vector Machine)... 21

3.5.1 Results for training data: JRip (implementation of the RIPPER rule learner). 22 3.5.2 Results for test data: JRip (implementation of the RIPPER rule learner)... 23

4 Summary... 25

5. Reference ... 26

(2)

1. Breast-cancer-Wisconsin dataset summary

In our AI term project, all chosen machine learning tools will be use to diagnose cancer Wisconsin dataset. To be consistent with the literature [1, 2] we removed the 16 instances with missing values from the dataset to construct a new dataset with 683 instances.

Brief information from the UC Irvine machine learning repository:

Located in breast-cancer-Wisconsin sub-directory, filenames root:

breast-cancer-Wisconsin.

Currently contains 699 instances.

2 classes (malignant and benign).

9 integer-valued attributes.

Attribute Information:

Table 1 shows data attribute information

# Attribute Domain

1. Sample code number id number

2. Clump Thickness 1 - 10

3. Uniformity of Cell Size 1 - 10 4. Uniformity of Cell Shape 1 - 10

5. Marginal Adhesion 1 - 10

6. Single Epithelial Cell Size 1 - 10

7. Bare Nuclei 1 - 10

8. Bland Chromatin 1 - 10

9. Normal Nucleoli 1 - 10

10. Mitoses 1 - 10

11. Class 2 for benign,

4 for malignant

Missing attribute values: 16

There are 16 instances in Groups 1 to 6 that contain a single missing (i.e., unavailable) attribute value, now denoted by "?".

Class distribution:

Benign: 458 (65.5%) Malignant: 241 (34.5%)

(3)

Clump Thickness

139

50 104

79 128

33 23

44 14

69

0 20 40 60 80 100 120 140 160

1 2 3 4 5 6 7 8 9 10

Domain

Freq.

Figure 1: bar graph summaries for the clump thickness attributes in the training data.

Uniformity of Cell Size

373

45 52

38 30 25 19 28

6 67

0 50 100 150 200 250 300 350 400

1 2 3 4 5 6 7 8 9 10

Domain

Freq.

Figure 2: Bar graph summaries for the Uniformity of Cell Size attributes in the training data

(4)

Uniformity of Cell Shape

346

58 53 43 32 29 30 27

7 58

0 50 100 150 200 250 300 350 400

1 2 3 4 5 6 7 8 9 10

Domain

Freq.

Figure 3: Bar graph summaries for the Uniformity of Cell Shape attributes in the training data

Marginal Adhesion

393

58 58

33 23 21 13 25

4 55

0 50 100 150 200 250 300 350 400 450

1 2 3 4 5 6 7 8 9 10

Domain

Freq.

Figure 4: Bar graph summaries for the Marginal Adhesion attributes in the training data

(5)

Single Epithelial Cell Size

44 376

71 48 39 40

11 21 2

31 0

50 100 150 200 250 300 350 400

1 2 3 4 5 6 7 8 9 10

Domain

Freq.

Figure 5: Bar graph summaries for the Single Epithelial Cell Size attributes in the training data

Bare Nuclei

402

30 28 19 30

4 8 21 9

132

0 50 100 150 200 250 300 350 400 450

1 2 3 4 5 6 7 8 9 10

Domain

Freq.

Figure 6: Bar graph summaries for the Bare Nuclei attributes in the training data

(6)

Bland Chromatin

150 160 161

39 34

9 71

28

11 20

0 20 40 60 80 100 120 140 160 180

1 2 3 4 5 6 7 8 9 10

Domain

Freq.

Figure 7: Bar graph summaries for the Bland Chromatin attributes in the training data

Normal Nucleoli

432

36 42

18 19 22 16 23 15

60

0 50 100 150 200 250 300 350 400 450 500

1 2 3 4 5 6 7 8 9 10

Domain

Freq.

Figure 8: Bar graph summaries for the Normal Nucleoli attributes in the training data

(7)

Mitoses 563

35 33

12 6 3 9 8 0 14

0 100 200 300 400 500 600

1 2 3 4 5 6 7 8 9 10

Domain

Freq.

Figure 9: Bar graph summaries for the Mitoses attributes in the training data Table 2 shows data summary statistics.

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683 Uniformity of

Cell Size

373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape

346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion

393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size

44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

(8)

2 Classification - 10 fold cross validation on breast-cancer-Wisconsin dataset

First we use the data mining tools WEKA to do the training data prediction. In here, we will use 10 fold cross validation on training data to calculate the machine learning rules their performance. The results are as follows:

2.1 Results for: Naive Bayes

=== Run information ===

Scheme: weka.classifiers.bayes.NaiveBayes Relation: breast

Instances: 683 Attributes: 10

Test mode: 10-fold cross-validation Time taken to build model: 0.08 seconds

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 659 96.4861 % Incorrectly Classified Instances 24 3.5139 % Kappa statistic 0.9238

K&B Relative Info Score 62650.9331 %

K&B Information Score 585.4063 bits 0.8571 bits/instance Class complexity | order 0 637.9242 bits 0.934 bits/instance Class complexity | scheme 1877.4218 bits 2.7488 bits/instance Complexity improvement (Sf) -1239.4976 bits -1.8148 bits/instance Mean absolute error 0.0362

Root mean squared error 0.1869 Relative absolute error 7.9508 % Root relative squared error 39.192 % Total Number of Instances 683

=== Confusion Matrix ===

a b <-- classified as 425 19 | a = 2

5 234 | b = 4

(9)

2.2 Results for: BP Neural Network

Scheme: weka.classifiers.functions.neural.NeuralNetwork -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a

Relation: breast Instances: 683 Attributes: 10

Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

Time taken to build model: 32.06 seconds

=== Summary ===

K&B Information Score 562.8499 bits 0.8241 bits/instance Class complexity | order 0 637.9242 bits 0.934 bits/instance Class complexity | scheme 176.4694 bits 0.2584 bits/instance Complexity improvement (Sf) 461.4548 bits 0.6756 bits/instance Mean absolute error 0.0526

14 225 | b = 4

(10)

2.3 Results for: J48 decision tree (implementation of C4.5)

Scheme: weka.classifiers.trees.j48.J48 -C 0.25 -M 2 Relation: breast

J48 pruned tree --- a2 <= 2

| a6 <= 3: 2 (395.0/2.0)

| a6 > 3

| | a1 <= 3: 2 (11.0)

| | a1 > 3

| | | a7 <= 2

| | | | a4 <= 3: 4 (2.0)

| | | | a4 > 3: 2 (2.0)

| | | a7 > 2: 4 (8.0) a2 > 2

| a3 <= 2

| | a1 <= 5: 2 (19.0/1.0)

| | a1 > 5: 4 (4.0)

| a3 > 2

| | a2 <= 4

| | | a6 <= 2

| | | | a4 <= 3: 2 (11.0/1.0)

| | | | a4 > 3: 4 (3.0)

| | | a6 > 2: 4 (54.0/7.0)

| | a2 > 4: 4 (174.0/3.0) Number of Leaves : 11 Size of the tree : 21

=== Summary ===

(11)

17 222 | b = 4

2.4 Results for: SMO (Support Vector Machine)

Scheme: weka.classifiers.functions.supportVector.SMO -C 1.0 -E 1.0 -G 0.01 -A 1000003 -T 0.0010 -P 1.0E-12 -N 0 -V -1 -W 1

Relation: breast Instances: 683 Attributes: 10

SMO

Classifier for classes: 2, 4 BinarySMO

Machine linear: showing attribute weights, not support vectors.

1.5056 * (normalized) a1 + 0.2163 * (normalized) a2 + 1.2795 * (normalized) a3 + 0.6631 * (normalized) a4 + 0.901 * (normalized) a5

(12)

+ 1.5154 * (normalized) a6 + 1.2332 * (normalized) a7 + 0.7335 * (normalized) a8 + 1.2115 * (normalized) a9 - 2.598

Number of kernel evaluations: 16169 Time taken to build model: 0.53 seconds

=== Summary ===

K&B Information Score 595.2169 bits 0.8715 bits/instance Class complexity | order 0 637.9242 bits 0.934 bits/instance Class complexity | scheme 21480 bits 31.4495 bits/instance Complexity improvement (Sf) -20842.0758 bits -30.5155 bits/instance Mean absolute error 0.0293

8 231 | b = 4

2.5 Results for: JRip (implementation of the RIPPER rule learner)

Scheme: weka.classifiers.rules.JRip -F 3 -N 2.0 -O 2 -S 1 Relation: breast

JRIP rules:

(13)

===========

(a2 >= 4) and (a7 >= 5) => class=4 (148.0/2.0) (a6 >= 3) and (a1 >= 7) => class=4 (50.0/0.0) (a3 >= 4) and (a4 >= 4) => class=4 (22.0/2.0) (a6 >= 4) and (a3 >= 3) => class=4 (19.0/5.0) (a7 >= 4) and (a1 >= 5) => class=4 (8.0/3.0) (a8 >= 3) and (a1 >= 6) => class=4 (2.0/0.0) => class=2 (434.0/2.0)

Number of Rules : 7

=== Summary ===

12 227 | b = 4

(14)

3. Classification – Compare with other paper results

Above machine learning tools will use in this section to diagnose cancer Wisconsin dataset. To be consistent with the literature [1, 2] we removed the 16 instances with missing values from the dataset to construct a dataset with 683 instances. The first 400 instances in the dataset are chosen as the training data, and the remaining 283 as the test data.

3.1.1 Results for training data: Naive Bayes

Scheme: weka.classifiers.bayes.NaiveBayes Relation: breast_training

Test mode: evaluate on training data

Naive Bayes Classifier

Class 2: Prior probability = 0.57 Class 4: Prior probability = 0.43

=== Evaluation on training set ===

=== Summary ===

5 166 | b = 4

(15)

3.1.2 Results for test data: Naive Bayes

Scheme: weka.classifiers.bayes.NaiveBayes Relation: breast_training

Test mode: user supplied test set: 283 instances

Naive Bayes Classifier

Class 2: Prior probability = 0.57 Class 4: Prior probability = 0.43

=== Evaluation on test set ===

=== Summary ===

1 67 | b = 4

(16)

3.2.1 Results for training data: BP Neural Network

Relation: breast_training Instances: 400

Attributes: 10

=== Summary ===

0 171 | b = 4

3.2.2 Results for test data: BP Neural Network

(17)

Attributes: 10

Time taken to build model: 5 seconds

=== Summary ===

1 67 | b = 4

3.3.1 Results for training data: J48 decision tree (implementation of C4.5)

Scheme: weka.classifiers.trees.j48.J48 -C 0.25 -M 2 Relation: breast_training

J48 pruned tree ---

(18)

a3 <= 2

| a7 <= 3: 2 (194.0/1.0)

| a7 > 3

| | a1 <= 4: 2 (7.0)

| | a1 > 4: 4 (6.0/1.0) a3 > 2

| a6 <= 2

| | a5 <= 4: 2 (20.0/1.0)

| | a5 > 4: 4 (12.0)

| a6 > 2: 4 (161.0/9.0) Number of Leaves : 6 Size of the tree : 11

=== Summary ===

Correctly Classified Instances 388 97 % Incorrectly Classified Instances 12 3 % Kappa statistic 0.9391

2 169 | b = 4

(19)

3.3.2 Results for test data: J48 decision tree (implementation of C4.5)

Scheme: weka.classifiers.trees.j48.J48 -C 0.25 -M 2 Relation: breast_training

J48 pruned tree --- a3 <= 2

| a7 <= 3: 2 (194.0/1.0)

| a7 > 3

| | a1 <= 4: 2 (7.0)

| | a1 > 4: 4 (6.0/1.0) a3 > 2

| a6 <= 2

| | a5 <= 4: 2 (20.0/1.0)

| | a5 > 4: 4 (12.0)

| a6 > 2: 4 (161.0/9.0) Number of Leaves : 6 Size of the tree : 11

=== Summary ===

Root mean squared error 0.1646

(20)

Relative absolute error 9.9595 % Root relative squared error 35.2651 % Total Number of Instances 283

5 63 | b = 4

3.4.1 Results for training data: SMO (Support Vector Machine)

Attributes: 10

SMO

1.4364 * (normalized) a1 + 0.4204 * (normalized) a2 + 1.0846 * (normalized) a3 + 1.0712 * (normalized) a4 + 0.9297 * (normalized) a5 + 1.409 * (normalized) a6 + 1.0571 * (normalized) a7 + 0.6458 * (normalized) a8 + 1.1078 * (normalized) a9 - 2.3339

(21)

=== Summary ===

4 167 | b = 4

3.4.2 Results for test data: SMO (Support Vector Machine)

Attributes: 10

SMO

1.4364 * (normalized) a1

(22)

+ 0.4204 * (normalized) a2 + 1.0846 * (normalized) a3 + 1.0712 * (normalized) a4 + 0.9297 * (normalized) a5 + 1.409 * (normalized) a6 + 1.0571 * (normalized) a7 + 0.6458 * (normalized) a8 + 1.1078 * (normalized) a9 - 2.3339

=== Summary ===

1 67 | b = 4

3.5.1 Results for training data: JRip (implementation of the RIPPER rule learner)

Scheme: weka.classifiers.rules.JRip -F 3 -N 2.0 -O 2 -S 1 Relation: breast_training

(23)

JRIP rules:

===========

(a2 >= 3) and (a2 >= 5) => class=4 (116.0/2.0) (a6 >= 3) and (a3 >= 3) => class=4 (55.0/7.0) (a1 >= 6) and (a8 >= 4) => class=4 (5.0/0.0) (a2 >= 4) => class=4 (4.0/1.0)

=> class=2 (220.0/1.0) Number of Rules : 5

=== Summary ===

1 170 | b = 4

3.5.2 Results for test data: JRip (implementation of the RIPPER rule learner)

Scheme: weka.classifiers.rules.JRip -F 3 -N 2.0 -O 2 -S 1 Relation: breast_training

Instances: 400

(24)

Attributes: 10

JRIP rules:

===========

(a2 >= 3) and (a2 >= 5) => class=4 (116.0/2.0) (a6 >= 3) and (a3 >= 3) => class=4 (55.0/7.0) (a1 >= 6) and (a8 >= 4) => class=4 (5.0/0.0) (a2 >= 4) => class=4 (4.0/1.0)

=> class=2 (220.0/1.0) Number of Rules : 5

=== Summary ===

3 65 | b = 4

(25)

4. Summary

This section presents summary tables for scheme accuracy and running times.

Table 4.1: Accuracy and running time summary table for 10 fold cross validation

Model Running time 10 fold cross val.

Naive Bayes 0.08 seconds 96.4861%

BP neural network 32.06 seconds 95.1684%

J48 decision tree (C4.5) 0.08 seconds 95.7540%

SMO (support vector machine) 0.53 seconds 97.0717%

JRip (RIPPER rule learner) 0.19 seconds 95.6076%

Table 4.2: Accuracy for training and test data between different models

Model Training data Test data

Naive Bayes 95.50% 97.8799%

BP neural network 98.75% 98.2332%

J48 decision tree (C4.5) 97.00% 97.1731%

SMO (support vector machine) 96.75% 98.5866%

JRip (RIPPER rule learner) 97.25% 97.5265%

Table 4.2: A comparison with the others papers

Method Testing Error

WEKA 3.3.6

Naive Bayes 97.8799%

BP neural network 98.2332%

J48 decision tree (C4.5) 97.1731%

SMO (support vector machine) 98.5866%

JRip (RIPPER rule learner) 97.5265%

Fogel et al. [1] 98.1000%

Abbass et. al. [2] 97.5000%

Abbass H. A. [3] 98.1000%

(26)

5. Reference

1. Fogel DB, Wasson EC, Boughton EM. Evolving neural networks for detecting breast cancer. Cancer lett 1995; 96(1): 49-53.

2. Abbass HA, Towaey M, Finn GD. C-net: a method for generating non-deterministic and dynamic multivariate decision trees. Knowledge Inf. Syst. 2001; 3:184-97.

3. Abbass HA. An evolutionary artificial neural networks approach for breast cancer diagnosis. Artificial Intelligence in Medicine 2002; 25, 265-281.