Feature Engineering and Classifier Ensemble for KDD Cup 2010
Chih-Jen Lin
Department of Computer Science National Taiwan University
Joint work with HF Yu, HY Lo, HP Hsieh, JK Lou, T McKenzie, JW Chou, PH Chung, CH Ho, CF Chang, YH Wei, JY Weng, ES Yan, CW Chang, TT Kuo, YC Lo, PT Chang, C Po, CY Wang,
YH Huang, CW Hung, YX Ruan, YS Lin, SD Lin and HT Lin July 25, 2010
Chih-Jen Lin (National Taiwan Univ.) 1 / 84
Outline
Introduction Course at NTU
Initial Approaches and Some Settings Sparse Features and Linear Classification Condensed Features and Random Forest Ensemble and Final Results
Discussion and Conclusions
Outline
Introduction Course at NTU
Initial Approaches and Some Settings Sparse Features and Linear Classification Condensed Features and Random Forest Ensemble and Final Results
Discussion and Conclusions
KDD Cup
Annual data mining and knowledge discovery competition
Organized by ACM special interest group on knowledge discovery and data mining
1997-present
Now considered the most prestigious data mining competition
KDD Cup 2010
Educational data mining competition
https://pslcdatashop.web.cmu.edu/KDDCup/
Predicting student algebraic problem performance given information regarding past performance Training data: summaries of the logs of student interaction with intelligent tutoring systems Two data sets: algebra 2008 2009 and bridge to algebra 2008 2009.
We refer to them as A89 and B89, respectively.
KDD Cup 2010 (Cont’d)
Each data set: logs for a large number of interaction steps
A89: 8,918,055 steps; B89: 20,012,499 steps
Log Fields
student ID
problem hierarchy including step name, problem name, unit name, section name
knowledge components (KC) used in the problem number of times a problem has been viewed Some log fields are only available in the training set:
whether the student was correct on the first attempt for this step (CFA)
number of hints requested (hint) step duration information.
Log Fields (Cont’d)
Hierarchy: step ⊂ problem ⊂ section ⊂ unit Unit
CTA1_02 CTA1_01 ES_01 UNIT-CONVERSIONS-ONE-STEP Section
CTA1_02-4 CTA1_01-4 ES_01-11 UNIT-CONVERSIONS-ONE-STEP-2 Problem
EG27 -5=-y PROP03 RATIO4-135 L2FB14B Step
Series1AddPoint1 5=-y*(-1) ValidEquations R5C2
Log Fields (Cont’d)
KC examples:
KC subskills:
Using simple numbers~~Find Y, any form~~Find Y, positive slope~~Using small numbers Enter unit conversion
Entering a given~~Enter given, reading words Entering a given~~Enter given, reading numerals KC KTracedSkills:
Identifying units-1
Convert linear units-1~~Convert decimal units greater than one-1 Select form of one with denominator of one-1
Enter unit conversion-1
Generation of Training/Testing Data
• Testing data:
generated by
randomly drawing a problem from a unit
• Problems before are used as training and after are discarded.
A unit of problems problem 1 ∈ T problem 2 ∈ T
...
problem i ∈ ˜T problem i +1: not used
...
T : training T : testing˜
Competition Goal
Predict CFA
0 (i.e., incorrect on the first attempt) or 1 Training: CFA is available to participants
A testing set of unknown CFA is left for evaluation Evaluation criterion: root mean squared error (RMSE)
rkp − yk2 l
l : # testing data, p ∈ [0, 1]l: predictions, y ∈ {0, 1}l: true answers
KDD Cup 2010 Schedule
April 1: Registration opens at 2pm EDT, development data sets available
April 19: Competition starts at 2pm EDT, challenge data sets available
June 8: Competition ends at 11:59pm EDT
June 14: Fact sheet and team composition info due by 11:59pm EDT
June 21: Winners announced
July 25: Workshop at ACM KDD 2010
Leaderboard
Based on results of a “unidentified” portion of testing data
Leaderboard (Cont’d)
Outline
Introduction Course at NTU
Initial Approaches and Some Settings Sparse Features and Linear Classification Condensed Features and Random Forest Ensemble and Final Results
Discussion and Conclusions
Course at NTU
At National Taiwan University, we organized a course for KDD Cup 2010
Course page: http://www.csie.ntu.edu.tw/
~cjlin/courses/dmcase2010/
Wiki: used to record progress
Team Members
Three instructors, two TAs, 19 students and one RA 19 students split to six sub-teams
Named by animals
Armyants, starfish, weka, trilobite, duck, sunfish Every week each team reports progress
Armyants
麥陶德 (Todd G. McKenzie), 羅經凱 (Jing-Kai Lou) and 解巽評 (Hsun-Ping Hsieh)
Starfish
Chia-Hua Ho (何家華), Po-Han Chung (鐘博翰), and Jung-Wei Chou (周融瑋)
Weka
Yin-Hsuan Wei (魏吟軒), En-Hsu Yen (嚴恩勗), Chun-Fu Chang (張淳富) and Jui-Yu Weng (翁睿妤)
Trilobite
Yi-Chen Lo (羅亦辰), Che-Wei Chang (張哲維) and Tsung-Ting Kuo (郭宗廷)
Duck
Chien-Yuan Wang (王建元), Chieh Po (柏傑), and Po-Tzu Chang (張博詞).
Sunfish
Yu-Xun Ruan (阮昱勳), Chen-Wei Hung (洪琛洧) and Yi-Hung Huang (黃曳弘)
Tiger (RA)
Yu-Shi Lin (林育仕)
Snoopy (TAs)
Hsiang-Fu Yu (余相甫) and Hung-Yi Lo (駱宏毅) Snoopy and Pikachu are IDs of our team in the final stage of the competition
Instructors
林智仁 (Chih-Jen Lin), 林軒田 (Hsuan-Tien Lin) and 林 守德 (Shou-De Lin)
Outline
Introduction Course at NTU
Initial Approaches and Some Settings Sparse Features and Linear Classification Condensed Features and Random Forest Ensemble and Final Results
Discussion and Conclusions
Initial Thoughts and Our Approach
We suspected that this competition would be very different from past KDD Cups
Domain knowledge seems to be extremely important for educational systems
Temporal information may be crucial At first, we explored a temporal approach
We tried Bayesian networks
But quickly found that using a traditional classification approach is easier
Initial Thoughts and Our Approach (Cont’d)
Traditional classification:
Data points: independent Euclidean vectors
Suitable features to reflect domain knowledge and temporal information
Domain knowledge, temporal information: important, but not as extremely important as we thought in the beginning
Our Framework
Problem
Sparse Features
Condensed Features
Ensemble
Validation Sets
• Avoid overfitting the leader board
• Standard validation
⇒ ignore time series
• Our validation set: last problem of each unit in training set
• Simulate the procedure to construct testing sets
A unit of problems problem 1 ∈ V problem 2 ∈ V
...
last problem ∈ ˜V V : internal training V : internal validation˜
• In the early stage, we focused on validation sets
Validation Sets (Cont’d)
A89: algebra 2008 2009
B89: bridge to algebra 2008 2009
A89 B89
Internal training 8,407,752 19,264,097 Internal validation 510,303 748,402 External training 8,918,055 20,012,499 External testing 508,913 756,387 In the early stages, we focused on validation sets Each sub-team submits to the leader board only once per week
Validation Sets (Cont’d)
This avoid overfitting the leaderboard
Of course in the end, many teams slightly violated the rule to submit more results in a week
Outline
Introduction Course at NTU
Initial Approaches and Some Settings Sparse Features and Linear Classification Condensed Features and Random Forest Ensemble and Final Results
Discussion and Conclusions
Problem
Sparse Features
Condensed Features
Ensemble
Basic Sparse Features
Categorical: expanded to binary features student, unit, section, problem, step, KC
For example, 3,310 students in A89 ⇒ feature vector then contains 3,310 binary features to indicate the student who finished the step.
Numerical: scaled by log(1 + x ) opportunity value, problem view
original range of opportunity in [1, 1504], problem view in [1, 18] for A89
original range of opportunity in [1, 2402], problem view in [1, 29] for B89
We have tried other scaling methods (e.g., linear scaling)
Basic Sparse Features (Cont’d)
A89: algebra 2008 2009
B89: bridge to algebra 2008 2009
Data stud. unit sec. prob. step KC A89 3,310 42 165 192, 811×2 725,652 2, 097×2 B89 6,043 50 186 53, 375×2 129,349 1, 699×2
Number of features: 1M for A89, 200K for B89 prob.: problem and problem view
KC: KC and opportunity
Basic Sparse Features (Cont’d)
Results:
RMSE (leader board) A89 B89 Basic sparse features 0.2895 0.2985 Best leader board 0.2759 0.2777 Five of six student sub-teams use variants of this approach
From this basic set, we add more features
Extensions from Basic Sparse Features
Different scaling methods
Slightly different ways to generate features Slightly different subsets of features
Different regularization (L1 and L2) for classification We will discuss some in detail
Feature Combination
Due to large training size, nonlinear classifiers (e.g., kernel SVM) are not practical
Linear classifier viable, but not exploiting possible feature dependence
Following polynomial mapping in kernel methods or bigram/trigram in NLP, we use feature
combinations to indicate relationships.
We manually identify some useful combinations for experiments
Feature Combination (Cont’d)
Example: hierarchical information
(student name, unit name), (unit name, section name), (section name, problem name) and (problem name, step name)
We have also explored combinations of higher-order features (i.e., more than two)
We released two data sets using feature combinations at
http://www.csie.ntu.edu.tw/~cjlin/
libsvmtools/datasets/
We thank Carnegie Learning and Datashop for allowing us to release them
Temporal Information
Learning is a process of skill-improving over time Temporal information should be taken into
consideration.
We considered a simple and common approach:
For each step, step name and KC values from the previous few steps were added as features.
Feature Combination and Temporal Information
Leaderboard results RMSE
features
0.2895
0.2843 0.2816
0.2815 0.2985
0.2883 0.2875
0.2836
A89 B89
Basic
+Combination
+Temporal
+ More combination
Feature Combination and Temporal Information (Cont’d)
Feature combinations very useful for B89 Temporal features more useful for A89
More features improve RMSE; but improvement less dramatic
Information already realized by earlier feature combinations
Details of Features
+Combination (student name, unit name), (unit name, section name), (section name, problem name), (problem name, step name), (student name, unit name, sec- tion name), (unit name, section name, problem name), (section name, problem name, step name), (student name, unit name, section name, problem name) and (unit name, section name, problem name, step name)
+Temporal Given a student and a problem, add KCs and step name in each previous three steps as temporal fea- tures.
+More com- bination
(student name, section name), (student name, prob- lem name), (student name, step name), (student name, KC) and (student name, unit name, section name, problem name, step name)
Number of Features
Features A89 B89
Basic 1,118,985 245,776
+Combination 6,569,589 4,083,376 +Temporal 8,752,836 4,476,520 +More combination 21,684,170 30,971,151
Important Feature Combinations
#features A89 B89
Basic 0.2895 0.2985
+ (problem name, step name) 0.2851 0.2941 + (student name, unit name) 0.2881 0.2942 + (problem name, step name) and (stu-
dent name, unit name)
0.2842 0.2898
+ Combination 0.2843 0.2883
(problem name, step name) and (student name, unit name) are very useful
Other Feature Generations
We tried many other ways We will discuss some of them
They may be less effective than feature combinations mentioned earlier
Knowledge Component Feature
Originally using binary features to indicate if a KC appears. An alternative way:
Each token in KC as a feature
“Write expression, positive one slope” similar to
“Write expression, positive slope”
Use “write,” “expression,” “positive” “slope,” and
“one” as binary features Performs well on A89 only
Grouping Similar Names
Two step names “−18 + x = 15” and
“5 + x = −39” differ only in their numbers.
For problem name and step name, we tried to group similar names together
By replacing numbers with a symbol, they become the same string and hence the same step name Number of features reduced without deteriorating the performance
Training via Linear Classification
Large numbers of instances and features
The largest number of features used is 30,971,151
#instances #features A89 8,918,055 ≥ 20M B89 20,012,499 ≥ 30M Impractical to use nonlinear classifiers
Use LIBLINEAR developed at National Taiwan University (Fan et al., 2008)
We consider logistic regression instead of SVM Training time: about 1 hour for 20M instances and 30M features (B89)
Training via Linear Classification (Cont’d)
Logistic regression: CFA as label yi yi =
(
1 if CFA = 1,
−1 if CFA = 0,
Assume training set includes (xi, yi), i = 1, . . . , l . Logistic regression assumes the following probability model:
P(y | x) = 1
1 + exp(−y wTx).
Training via Linear Classification (Cont’d)
Regularized logistic regression solves minw
1
2wTw + C
l
X
i =1
log
1 + e−yiwTxi
(1)
w: weight vector of the decision function, wTw/2:
L2-regularization term, and C : penalty parameter.
C : often decided by validation. We used C = 1 most of the time.
Training via Linear Classification (Cont’d)
L2 regularization: a dense vector w; we have also considered L1 regularization to obtain a sparse w:
minw kwk1 + C
l
X
i =1
log
1 + e−yiwTxi
. (2) Once w is obtained, we submitted either
y =
(1 if wTx ≥ 0 0 otherwise or probability values
1/(1 + exp(−wTx))
Training via Linear Classification (Cont’d)
Using probability values gives a smaller RMSE than using 1/0
Assume the true label is 0.
Wrong prediction: errors using label/probability are 1 and (1 − p1)2
p1 ≥ 0.5: predicted probability
Correct prediction: errors are 0 and p2, respectively.
p2 ≤ 0.5: predicted probability
Training via Linear Classification (Cont’d)
Quadratic function is increasing in [0, 1],
Gain of reducing 1 to (1 − p)2 is often larger than loss of increasing 0 to p2.
Example:
p error label error Wrong 0.75 0.5625 1 1 Correct 0.25 0.0625 0 0
Training via Linear Classification (Cont’d)
We also checked linear support vector machine (SVM) solvers in LIBLINEAR
Result was slightly worse than logistic regression.
Result Using Sparse Features
Leader board results:
A89 B89
Basic sparse features 0.2895 0.2985 Best sparse features 0.2784 0.2830 Best leader board 0.2759 0.2777
Outline
Introduction Course at NTU
Initial Approaches and Some Settings Sparse Features and Linear Classification Condensed Features and Random Forest Ensemble and Final Results
Discussion and Conclusions
Problem
Sparse Features
Condensed Features
Ensemble
Condensed Features
Categorical feature ⇒ numerical feature
Use correct first attempt rate (CFAR). Example: a student named sid:
CFAR = # steps with student = sid and CFA = 1
# steps with student = sid CFARs for student, step, KC, problem, (student, unit), (problem, step), (student, KC) and (student, problem)
Condensed Features (Cont’d)
Temporal features: the previous ≤ 6 steps with the same student and KC
An indicator for the existence of such steps Average of CFAs
Average hints (up to six depending on the availability)
Other temporal features:
When was a step with the same student name and KC be seen?
Binary features to model four levels:
Same day, 1-6 days, 7-30 days, > 30 days
Condensed Features (Cont’d)
Opportunity and problem view:
First scaled by
x x + 1 Then linearly scaled to [0, 1]
Total 17 condensed features Eight CFARs
Seven temporal features
Two scaled numerical features for opportunity and problem view.
Training by Random Forest
Due to a small number of features, we could try several classifiers via Weka (Hall et al., 2009) To save training time, we considered a subset of training data and split the classification task into several independent sets according to unit name.
That is, for each unit name, we collected the last problem of each unit to form its training set.
In testing, we checked the testing point’s unit name to know which model to use.
Random Forest (Breiman, 2001) showed the best performance:
10 decision trees with depth 7
A89 B89
Basic sparse features 0.2895 0.2985 Best sparse features 0.2784 0.2830 Best condensed features 0.2824 0.2847 Best leader board 0.2759 0.2777 This small feature set works well
Due to the small feature size, a Random Forest on the training subset of a unit takes a few minutes.
Outline
Introduction Course at NTU
Initial Approaches and Some Settings Sparse Features and Linear Classification Condensed Features and Random Forest Ensemble and Final Results
Discussion and Conclusions
Problem
Sparse Features
Condensed Features
Ensemble
Linear Regression for Ensemble
Past competitions (e.g., Netflix Prize) showed ensemble of results from different methods often boost the performance
We find a weight vector to linearly combine predicted probabilities from student sub-teams We did not use a nonlinear way because a complex ensemble may cause overfitting
Linear Regression for Ensemble (Cont’d)
We checked linear models
simple averaging, linear SVM, linear regression, logistic regression
Linear regression gives best leaderboard result
Probably because linear regression minimizes RMSE (the evaluation criterion)
Linear Regression for Ensemble (Cont’d)
Given l testing steps and k prediction probabilities pi ∈ [0, 1]l, i = 1, . . . , k,
minw ky − Pwk2 + λ
2kwk2 (3)
λ: regularization parameter, y: CFA vector, and P = [p1, . . . , pk].
If λ = 0, Eq. (3) just a standard least-square problem
Linear Regression for Ensemble (Cont’d)
In SVM or logistic regression, we may add a bias term b so
Pw ⇒ Pw + b1 where 1 = [1, . . . , 1]T.
We also replaced kwk2 with kwk2 + b2.
The obtained weight w is used to calculate Pw for combining prediction results.
Pw may be out of the interval [0, 1]. We employ a simple truncation:
min(1, max(0, Pw)), (4) 1: vector with all ones; 0: vector with all zeros.
Linear Regression for Ensemble (Cont’d)
We also explored
Sigmoid transformation and Linear scaling Pw to [0, 1]l, But results did not improve The analytical solution of (3) is
w = (PTP + λ
2I )−1PTy, (5) where I is the identity matrix.
The problem is that y is unknown.
Estimating y: First Approach
Use validation data to estimate w.
Training set ⇒ V and ˜V internally
Student sub-teams generated two prediction results on ˜V and ˜T :
Train V ⇒ Predict ˜V to obtain ˜pi, Train T ⇒ Predict ˜T to obtain pi.
Let ˜P the matrix collecting all ˜pi; we know true ˜y.
In (3) using ˜y and ˜P to obtain w.
Final prediction: we calculated Pw and applied the truncation in (4).
Estimating y: Second Approach
Use leaderboard information to estimate PTy in (5).
We follow from T¨oscher and Jahrer (2009).
ri ≡
rkpi − yk2
l ,
so
pTi y = kpik2 + kyk2 − lri2
2 . (6)
ri and kyk unavailable; estimated by ri ≈ ˆri and kyk2 ≈ lˆr02, ˆ
ri: RMSE on the leaderboard by submitting pi ˆ
r0: RMSE by submitting the zero vector.
Ensemble Results
We collect 19 results from 7 sub-teams
Each result comes from training a single classifier To select λ, we gradually increased λ until the leaderboard result started to decline
This procedure, conducted in the last several hours before the deadline, was not very systematic
Ensemble Results (Cont’d)
Best A89 result: PTy in (5) and using λ = 10.
That is, second approach
Best B89 result: using the validation set to estimate w and λ = 0 (no regularization)
This means the first approach
Ensemble Results (Cont’d)
Ensemble significantly improves the results
A89 B89 Avg.
Basic sparse features 0.2895 0.2985 0.2940 Best sparse features 0.2784 0.2830 0.2807 Best condensed features 0.2824 0.2847 0.2835 Best ensemble 0.2756 0.2780 0.2768 Best leader board 0.2759 0.2777 0.2768
Our team ranked 2nd on the leader board
Difference to the 1st is small; we hoped that our solution did not overfit leader board too much and might be better on the complete challenge set
Final Results
Rank Team name Leader board Cup
1 National Taiwan University 0.276803 0.272952
2 Zhang and Su 0.276790 0.273692
3 BigChaos @ KDD 0.279046 0.274556 4 Zach A. Pardos 0.279695 0.276590 5 Old Dogs With New Tricks 0.281163 0.277864
Team names used during the competition:
Snoopy ⇒ National Taiwan University BbCc ⇒ Zhang and Su
Cup scores generally better than leader board
Final Results (Cont’d)
Many submissions in the last week before the deadline; in particular in the last two hours
Everyone (including ourselves) tries to achieve better leader board results
Overfitting may be a concern
Not very clear how serious this problem is
Leaderboard Immediately After the
Deadline
Web Page of Final Competition Results
Outline
Introduction Course at NTU
Initial Approaches and Some Settings Sparse Features and Linear Classification Condensed Features and Random Forest Ensemble and Final Results
Discussion and Conclusions
Discussion and Conclusions
Diversities in Learning
We believe that one key to our ensemble’s success is the diversity
Feature diversity Classifier diversity
Different sub-teams try different ideas guided by their human intelligence
Mammals: snoopy, tiger Birds: weka, duck
Insects: armyants, trilobite Marine animals: starfish, sunfish
Diversities in Learning
We believe that one key to our ensemble’s success is the diversity
Feature diversity Classifier diversity
Different sub-teams try different ideas guided by their human intelligence
Our student sub-teams even have biodiversity Mammals: snoopy, tiger
Birds: weka, duck
Insects: armyants, trilobite Marine animals: starfish, sunfish
Conclusions
Feature engineering and classifier ensemble seem to be useful for educational data mining
All our team members worked very hard, but we are also a bit lucky
We thank the organizers for organizing this interesting and fruitful competition
We also thank National Taiwan University for providing a stimulating research environment