Data Structures and Algorithms (NTU, Spring 2012) instructor: Hsuan-Tien Lin
Homework #5
TA in charge: Yu-Cheng Chou, Ya-Hsuan Chang and Wei-Yuan Shen
RELEASE DATE: 05/04/2012 DUE DATE: 05/18/2012, 17:00
Specification of 5.3(2) and 5.3(5)
We provide two data sets, namely heart and wine for you to test your program. Of course, we may use other data sets to evaluate your performance. If you are interseted in trying more data sets, you can check on UCI Machine Learning Repository.
http://archive.ics.uci.edu/ml/data sets.html
Input Format
For both 5.3(2) and 5.3(5), the first argument of your program should be the training data file which contains the examples. For the Random Forest in 5.3(5), the second argument of your program should be the number of decision tree to build. For example,
./tree heart ./forest heart 20
Data Format
The first line contains two integers n and m, the former one (n) is the number of examples and the latter one (m) is the number of total factors. Each of the following n lines represents an example in the following format, where each number is separated by a space:
label factor[0] factor[1] ... factor[m-1]
For instance, for the line
1 14.23 1.71 2.43 15.6 127 2.8 3.06 0.28 2.29 5.64 1.04 3.92 1065 1 is the label and the rest are the factors.
Output Format
Decision Tree
Please output your tree as a function in C/C++ language. The function must follow this signature:
int tree predict(double *attr);
The only argument is a double array which contains the factors of one example in the same format as input. This function should return the label prediction of the example (1 or -1 for heart, for in- stance). Also, please name your output file as ”tree pred.h”. Then, you can compile and run the provided ”tree predictor.cpp” to check how good your decision tree is (see README). For example, your ”tree pred.h” should look like:
int tree_predict(double *attr){
if(attr[0] > 5){
return 1;
} else{
return -1;
} }
1 of 2
Data Structures and Algorithms (NTU, Spring 2012) instructor: Hsuan-Tien Lin
Random Forest
Similar with the decision tree function, you need to output your forest as a function in C/C++ language.
The function must follow this signature:
int forest predict(double *attr);
The argument and return value specs are the same as the decision tree function. Also, please name your output file as ”forest pred.h”. Then, you can compile and run the provided ”forest predictor.cpp” to check how good your Random Forest is (see README). For example, your ”forest pred.h” should looks like:
int forest_predict(double *attr){
tree1_predict:
tree2_predict:
treeT_predict:
voting:
}
Data Set Description
In this section we provide the meaning of factors and label in the data sets.
Wine
label means two types of wine from two different cultivars.
factors are:
1) Alcohol 2) Malic acid 3) Ash
4) Alcalinity of ash 5) Magnesium 6) Total phenols 7) Flavanoids
8) Nonflavanoid phenols 9) Proanthocyanins 10)Color intensity 11)Hue
12)OD280/OD315 of diluted wines 13)Proline
Heart
The data set describes diagnosing of cardiac Single Proton Emission Computed Tomography (SPECT) images.
label means two catogories of patients : normal and adnormal.
factors are:
1. F1R: continuous (count in ROI (region of interest) 1 in rest) 2. F1S: continuous (count in ROI 1 in stress)
3. F2R: continuous (count in ROI 2 in rest) 4. F2S: continuous (count in ROI 2 in stress) 5. F3R: continuous (count in ROI 3 in rest) 6. F3S: continuous (count in ROI 3 in stress) 7. F4R: continuous (count in ROI 4 in rest) 8. F4S: continuous (count in ROI 4 in stress) ...
- all continuous attributes have integer values from the 0 to 100
2 of 2