Machine Learning Techniques
( 機器學習技法)
Lecture 16: Finale
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.twDepartment of Computer Science
& Information Engineering
National Taiwan University
Finale
Roadmap
1 Embedding Numerous Features: Kernel Models
2 Combining Predictive Features: Aggregation Models
3 Distilling Implicit Features: Extraction Models
Lecture 15: Matrix Factorization
linear models of movies
onextracted user features
(orvice versa) jointly optimized with
stochastic gradient descent
Lecture 16: Finale
Feature Exploitation Techniques
Error Optimization Techniques
Overfitting Elimination Techniques
Machine Learning in Practice
Finale Feature Exploitation Techniques
Exploiting Numerous Features via Kernel
numerous features within some Φ:
embedded in
kernel K Φ
withinner product operation
Polynomial Kernel
‘scaled’
polynomial
transformsSum of Kernels
transformunion
Gaussian Kernel infinite-dimensional
transformsProduct of Kernels
transformcombination
Stump Kernel decision-stumps
as transformsMercer Kernels
transformimplicitly
kernel ridge regression
kernel logistic regression
SVM SVR probabilistic SVM
possibly
Kernel PCA, Kernel k -Means, . . .
Finale Feature Exploitation Techniques
Exploiting Numerous Features via Kernel
numerous features within some Φ:
embedded in
kernel K Φ
withinner product operation Polynomial Kernel
‘scaled’
polynomial
transformsSum of Kernels
transformunion
Gaussian Kernel infinite-dimensional
transformsProduct of Kernels
transformcombination
Stump Kernel decision-stumps
as transformsMercer Kernels
transformimplicitly kernel ridge
regression
kernel logistic regression
SVM SVR probabilistic SVM
possibly
Kernel PCA, Kernel k -Means, . . .
Finale Feature Exploitation Techniques
Exploiting Numerous Features via Kernel
numerous features within some Φ:
embedded in
kernel K Φ
withinner product operation Polynomial Kernel
‘scaled’
polynomial
transformsSum of Kernels
transformunion
Gaussian Kernel infinite-dimensional
transformsProduct of Kernels
transformcombination
Stump Kernel decision-stumps
as transformsMercer Kernels
transformimplicitly kernel ridge
regression
kernel logistic regression
SVM SVR probabilistic SVM
possibly
Kernel PCA, Kernel k -Means, . . .
Finale Feature Exploitation Techniques
Exploiting Numerous Features via Kernel
numerous features within some Φ:
embedded in
kernel K Φ
withinner product operation Polynomial Kernel
‘scaled’
polynomial
transformsSum of Kernels
transformunion
Gaussian Kernel infinite-dimensional
transformsProduct of Kernels
transformcombination
Stump Kernel decision-stumps
as transformsMercer Kernels
transformimplicitly kernel ridge
regression
kernel logistic regression
SVM SVR probabilistic SVM
possibly
Kernel PCA, Kernel k -Means, . . .
Finale Feature Exploitation Techniques
Exploiting Numerous Features via Kernel
numerous features within some Φ:
embedded in
kernel K Φ
withinner product operation Polynomial Kernel
‘scaled’
polynomial
transformsSum of Kernels
transformunion
Gaussian Kernel infinite-dimensional
transformsProduct of Kernels
transformcombination
Stump Kernel decision-stumps
as transformsMercer Kernels
transformimplicitly kernel ridge
regression
kernel logistic regression
SVM SVR probabilistic SVM
possibly
Kernel PCA, Kernel k -Means, . . .
Finale Feature Exploitation Techniques
Exploiting Numerous Features via Kernel
numerous features within some Φ:
embedded in
kernel K Φ
withinner product operation Polynomial Kernel
‘scaled’
polynomial
transformsSum of Kernels
transformunion
Gaussian Kernel infinite-dimensional
transformsProduct of Kernels
transformcombination
Stump Kernel decision-stumps
as transformsMercer Kernels
transformimplicitly kernel ridge
regression
kernel logistic regression
SVM SVR probabilistic SVM
possibly
Kernel PCA, Kernel k -Means, . . .
Finale Feature Exploitation Techniques
Exploiting Numerous Features via Kernel
numerous features within some Φ:
embedded in
kernel K Φ
withinner product operation Polynomial Kernel
‘scaled’
polynomial
transformsSum of Kernels
transformunion
Gaussian Kernel infinite-dimensional
transformsProduct of Kernels
transformcombination
Stump Kernel decision-stumps
as transformsMercer Kernels
transformimplicitly
kernel ridge regression
kernel logistic regression
SVM SVR probabilistic SVM
possibly
Kernel PCA, Kernel k -Means, . . .
Finale Feature Exploitation Techniques
Exploiting Numerous Features via Kernel
numerous features within some Φ:
embedded in
kernel K Φ
withinner product operation Polynomial Kernel
‘scaled’
polynomial
transformsSum of Kernels
transformunion
Gaussian Kernel infinite-dimensional
transformsProduct of Kernels
transformcombination
Stump Kernel decision-stumps
as transformsMercer Kernels
transformimplicitly
kernel ridge regression
kernel logistic regression
SVM SVR probabilistic SVM
possibly
Kernel PCA, Kernel k -Means, . . .
Finale Feature Exploitation Techniques
Exploiting Numerous Features via Kernel
numerous features within some Φ:
embedded in
kernel K Φ
withinner product operation Polynomial Kernel
‘scaled’
polynomial
transformsSum of Kernels
transformunion
Gaussian Kernel infinite-dimensional
transformsProduct of Kernels
transformcombination
Stump Kernel decision-stumps
as transformsMercer Kernels
transformimplicitly kernel ridge
regression
kernel logistic regression
possibly
Kernel PCA, Kernel k -Means, . . .
Finale Feature Exploitation Techniques
Exploiting Numerous Features via Kernel
numerous features within some Φ:
embedded in
kernel K Φ
withinner product operation Polynomial Kernel
‘scaled’
polynomial
transformsSum of Kernels
transformunion
Gaussian Kernel infinite-dimensional
transformsProduct of Kernels
transformcombination
Stump Kernel decision-stumps
as transformsMercer Kernels
transformimplicitly kernel ridge
regression
kernel logistic regression
SVM SVR probabilistic SVM
possibly
Kernel PCA, Kernel k -Means, . . .
Finale Feature Exploitation Techniques
Exploiting Predictive Features via Aggregation
predictive features within some Φ:
φ t
(x) =g t
(x)Decision Stump simplest
perceptron;simplest
DecTreeDecision Tree branching
(divide) + leaves (conquer)(Gaussian) RBF prototype
(center) + influenceUniform Non-Uniform Conditional
Bagging;
Random Forest
AdaBoost; GradientBoost probabilistic SVM
Decision Tree; Nearest Neighbor
possibly
Infinite Ensemble Learning,
Decision Tree SVM, . . .
Finale Feature Exploitation Techniques
Exploiting Predictive Features via Aggregation
predictive features within some Φ:
φ t
(x) =g t
(x)Decision Stump
simplest
perceptron;simplest
DecTreeDecision Tree branching
(divide) + leaves (conquer)(Gaussian) RBF prototype
(center) + influenceUniform Non-Uniform Conditional
Bagging;
Random Forest
AdaBoost; GradientBoost probabilistic SVM
Decision Tree; Nearest Neighbor
possibly
Infinite Ensemble Learning,
Decision Tree SVM, . . .
Finale Feature Exploitation Techniques
Exploiting Predictive Features via Aggregation
predictive features within some Φ:
φ t
(x) =g t
(x)Decision Stump
simplest
perceptron;simplest
DecTreeDecision Tree branching
(divide) + leaves (conquer)(Gaussian) RBF prototype
(center) + influenceUniform Non-Uniform Conditional
Bagging;
Random Forest
AdaBoost; GradientBoost probabilistic SVM
Decision Tree; Nearest Neighbor
possibly
Infinite Ensemble Learning,
Decision Tree SVM, . . .
Finale Feature Exploitation Techniques
Exploiting Predictive Features via Aggregation
predictive features within some Φ:
φ t
(x) =g t
(x)Decision Stump
simplest
perceptron;simplest
DecTreeDecision Tree branching
(divide) + leaves (conquer)(Gaussian) RBF prototype
(center) + influenceUniform Non-Uniform Conditional
Bagging;
Random Forest
AdaBoost; GradientBoost probabilistic SVM
Decision Tree; Nearest Neighbor
possibly
Infinite Ensemble Learning,
Decision Tree SVM, . . .
Finale Feature Exploitation Techniques
Exploiting Predictive Features via Aggregation
predictive features within some Φ:
φ t
(x) =g t
(x)Decision Stump
simplest
perceptron;simplest
DecTreeDecision Tree branching
(divide) + leaves (conquer)(Gaussian) RBF prototype
(center) + influenceUniform Non-Uniform Conditional
Bagging;
Random Forest
AdaBoost; GradientBoost probabilistic SVM
Decision Tree;
Nearest Neighbor
possibly
Infinite Ensemble Learning,
Decision Tree SVM, . . .
Finale Feature Exploitation Techniques
Exploiting Predictive Features via Aggregation
predictive features within some Φ:
φ t
(x) =g t
(x)Decision Stump
simplest
perceptron;simplest
DecTreeDecision Tree branching
(divide) + leaves (conquer)(Gaussian) RBF prototype
(center) + influenceUniform Non-Uniform Conditional
Bagging;
Random Forest
AdaBoost; GradientBoost probabilistic SVM
Decision Tree;
Nearest Neighbor
possibly
Infinite Ensemble Learning,
Decision Tree SVM, . . .
Finale Feature Exploitation Techniques
Exploiting Predictive Features via Aggregation
predictive features within some Φ:
φ t
(x) =g t
(x)Decision Stump
simplest
perceptron;simplest
DecTreeDecision Tree branching
(divide) + leaves (conquer)(Gaussian) RBF prototype
(center) + influenceUniform Non-Uniform Conditional
Bagging;
Random Forest
AdaBoost;
GradientBoost
probabilistic SVM
Decision Tree;
Nearest Neighbor
possibly
Infinite Ensemble Learning,
Decision Tree SVM, . . .
Finale Feature Exploitation Techniques
Exploiting Predictive Features via Aggregation
predictive features within some Φ:
φ t
(x) =g t
(x)Decision Stump
simplest
perceptron;simplest
DecTreeDecision Tree branching
(divide) + leaves (conquer)(Gaussian) RBF prototype
(center) + influenceUniform Non-Uniform Conditional
Bagging;
Random Forest
AdaBoost;
GradientBoost
probabilistic SVM
Decision Tree;
Nearest Neighbor
possibly
Infinite Ensemble Learning,
Decision Tree SVM, . . .
Finale Feature Exploitation Techniques
Exploiting Predictive Features via Aggregation
predictive features within some Φ:
φ t
(x) =g t
(x)Decision Stump
simplest
perceptron;simplest
DecTreeDecision Tree branching
(divide) + leaves (conquer)(Gaussian) RBF prototype
(center) + influenceUniform Non-Uniform Conditional
Bagging;
Random Forest
AdaBoost;
GradientBoost
probabilistic SVM
Decision Tree;
Nearest Neighbor
possibly
Infinite Ensemble Learning,
Decision Tree SVM, . . .
Finale Feature Exploitation Techniques
Exploiting Predictive Features via Aggregation
predictive features within some Φ:
φ t
(x) =g t
(x)Decision Stump
simplest
perceptron;simplest
DecTreeDecision Tree branching
(divide) + leaves (conquer)(Gaussian) RBF prototype
(center) + influenceUniform Non-Uniform Conditional
Bagging;
Random Forest
AdaBoost;
GradientBoost probabilistic SVM
Decision Tree;
Nearest Neighbor
possibly
Infinite Ensemble Learning,
Decision Tree SVM, . . .
Finale Feature Exploitation Techniques
Exploiting Predictive Features via Aggregation
predictive features within some Φ:
φ t
(x) =g t
(x)Decision Stump
simplest
perceptron;simplest
DecTreeDecision Tree branching
(divide) + leaves (conquer)(Gaussian) RBF prototype
(center) + influenceUniform Non-Uniform Conditional
Bagging;
Random Forest
AdaBoost;
GradientBoost probabilistic SVM
Decision Tree;
Nearest Neighbor
Finale Feature Exploitation Techniques
Exploiting Hidden Features via Extraction
hidden features within some Φ:
as
hidden variables
to be ‘jointly’optimized
with usualweights
—possibly with the help of
unsupervised learning
Neural Network; Deep Learning neuron weights
AdaBoost; GradientBoost g t parameters
RBF Network RBF centers k -Means cluster centers
Matrix Factorization user/movie factors
Autoencoder; PCA
‘basis’ directions
possibly
GradientBoosted Neurons,
NNet on Factorized Features, . . .
Finale Feature Exploitation Techniques
Exploiting Hidden Features via Extraction
hidden features within some Φ:
as
hidden variables
to be ‘jointly’optimized
with usualweights
—possibly with the help of
unsupervised learning
Neural Network; Deep Learning neuron weights
AdaBoost; GradientBoost g t parameters
RBF Network RBF centers k -Means cluster centers
Matrix Factorization user/movie factors
Autoencoder; PCA
‘basis’ directions
possibly
GradientBoosted Neurons,
NNet on Factorized Features, . . .
Finale Feature Exploitation Techniques
Exploiting Hidden Features via Extraction
hidden features within some Φ:
as
hidden variables
to be ‘jointly’optimized
with usualweights
—possibly with the help of
unsupervised learning Neural Network;
Deep Learning neuron weights
AdaBoost; GradientBoost g t parameters
RBF Network RBF centers k -Means cluster centers
Matrix Factorization user/movie factors
Autoencoder; PCA
‘basis’ directions
possibly
GradientBoosted Neurons,
NNet on Factorized Features, . . .
Finale Feature Exploitation Techniques
Exploiting Hidden Features via Extraction
hidden features within some Φ:
as
hidden variables
to be ‘jointly’optimized
with usualweights
—possibly with the help of
unsupervised learning Neural Network;
Deep Learning neuron weights
AdaBoost; GradientBoost g t parameters
RBF Network RBF centers
k -Means cluster centers
Matrix Factorization user/movie factors
Autoencoder; PCA
‘basis’ directions
possibly
GradientBoosted Neurons,
NNet on Factorized Features, . . .
Finale Feature Exploitation Techniques
Exploiting Hidden Features via Extraction
hidden features within some Φ:
as
hidden variables
to be ‘jointly’optimized
with usualweights
—possibly with the help of
unsupervised learning Neural Network;
Deep Learning neuron weights
AdaBoost; GradientBoost g t parameters
RBF Network RBF centers
k -Means cluster centers
Matrix Factorization user/movie factors
Autoencoder; PCA
‘basis’ directions
possibly
GradientBoosted Neurons,
NNet on Factorized Features, . . .
Finale Feature Exploitation Techniques
Exploiting Hidden Features via Extraction
hidden features within some Φ:
as
hidden variables
to be ‘jointly’optimized
with usualweights
—possibly with the help of
unsupervised learning Neural Network;
Deep Learning neuron weights
AdaBoost; GradientBoost g t parameters
RBF Network RBF centers k -Means cluster centers
Matrix Factorization user/movie factors
Autoencoder; PCA
‘basis’ directions
possibly
GradientBoosted Neurons,
NNet on Factorized Features, . . .
Finale Feature Exploitation Techniques
Exploiting Hidden Features via Extraction
hidden features within some Φ:
as
hidden variables
to be ‘jointly’optimized
with usualweights
—possibly with the help of
unsupervised learning Neural Network;
Deep Learning neuron weights
AdaBoost; GradientBoost g t parameters
RBF Network RBF centers k -Means cluster centers
Matrix Factorization user/movie factors
Autoencoder;
PCA
‘basis’ directions
possibly
GradientBoosted Neurons,
NNet on Factorized Features, . . .
Finale Feature Exploitation Techniques
Exploiting Hidden Features via Extraction
hidden features within some Φ:
as
hidden variables
to be ‘jointly’optimized
with usualweights
—possibly with the help of
unsupervised learning Neural Network;
Deep Learning neuron weights
AdaBoost;
GradientBoost g t parameters
RBF Network RBF centers k -Means cluster centers
Matrix Factorization user/movie factors
Autoencoder;
PCA
‘basis’ directions
possibly
GradientBoosted Neurons,
NNet on Factorized Features, . . .
Finale Feature Exploitation Techniques
Exploiting Hidden Features via Extraction
hidden features within some Φ:
as
hidden variables
to be ‘jointly’optimized
with usualweights
—possibly with the help of
unsupervised learning Neural Network;
Deep Learning neuron weights
AdaBoost;
GradientBoost g t parameters
RBF Network RBF centers k -Means cluster centers
Matrix Factorization user/movie factors
Autoencoder;
PCA
‘basis’ directions
possibly
GradientBoosted Neurons,
NNet on Factorized Features, . . .
Finale Feature Exploitation Techniques
Exploiting Low-Dim. Features via Compression
low-dimensional features within some Φ:
compressed
from original featuresDecision Stump; DecTree Branching
‘best’ naïve
projection to RRandom Forest Tree Branching
‘random’
low-dim. projectionAutoencoder;PCA info.-preserving
compressionMatrix Factorization
projection from abstract toconcrete Feature Selection
‘most-helpful’
low-dimensional projectionpossibly
other ‘dimension reduction’ models
Finale Feature Exploitation Techniques
Exploiting Low-Dim. Features via Compression
low-dimensional features within some Φ:
compressed
from original featuresDecision Stump;
DecTree Branching
‘best’ naïve
projection to RRandom Forest Tree Branching
‘random’
low-dim. projectionAutoencoder;PCA info.-preserving
compressionMatrix Factorization
projection from abstract toconcrete Feature Selection
‘most-helpful’
low-dimensional projectionpossibly
other ‘dimension reduction’ models
Finale Feature Exploitation Techniques
Exploiting Low-Dim. Features via Compression
low-dimensional features within some Φ:
compressed
from original featuresDecision Stump;
DecTree Branching
‘best’ naïve
projection to RRandom Forest Tree Branching
‘random’
low-dim.projection
Autoencoder;PCA info.-preserving
compressionMatrix Factorization
projection from abstract toconcrete Feature Selection
‘most-helpful’
low-dimensional projectionpossibly
other ‘dimension reduction’ models
Finale Feature Exploitation Techniques
Exploiting Low-Dim. Features via Compression
low-dimensional features within some Φ:
compressed
from original featuresDecision Stump;
DecTree Branching
‘best’ naïve
projection to RRandom Forest Tree Branching
‘random’
low-dim.projection
Autoencoder;PCA info.-preserving
compressionMatrix Factorization
projection from abstract toconcrete Feature Selection
‘most-helpful’
low-dimensional projectionpossibly
other ‘dimension reduction’ models
Finale Feature Exploitation Techniques
Exploiting Low-Dim. Features via Compression
low-dimensional features within some Φ:
compressed
from original featuresDecision Stump;
DecTree Branching
‘best’ naïve
projection to RRandom Forest Tree Branching
‘random’
low-dim.projection
Autoencoder;PCA info.-preserving
compressionMatrix Factorization
projection from abstract toconcrete
Feature Selection
‘most-helpful’
low-dimensional projectionpossibly
other ‘dimension reduction’ models
Finale Feature Exploitation Techniques
Exploiting Low-Dim. Features via Compression
low-dimensional features within some Φ:
compressed
from original featuresDecision Stump;
DecTree Branching
‘best’ naïve
projection to RRandom Forest Tree Branching
‘random’
low-dim.projection
Autoencoder;PCA info.-preserving
compressionMatrix Factorization
projection from abstract toconcrete Feature Selection
‘most-helpful’
low-dimensional projectionpossibly
other ‘dimension reduction’ models
Finale Feature Exploitation Techniques
Exploiting Low-Dim. Features via Compression
low-dimensional features within some Φ:
compressed
from original featuresDecision Stump;
DecTree Branching
‘best’ naïve
projection to RRandom Forest Tree Branching
‘random’
low-dim.projection
Autoencoder;PCA info.-preserving
compressionMatrix Factorization
projection from abstract toconcrete Feature Selection
‘most-helpful’
low-dimensional projectionFinale Feature Exploitation Techniques
Fun Time
Consider running AdaBoost-Stump on a PCA-preprocessed data set.
Then, in terms of the original features
x, what does the final hypothesis
G(x) look like?1
a neural network with tanh(·) in the hidden neurons2
a neural network with sign(·) in the hidden neurons3
a decision tree4
a random forestReference Answer: 2
PCA results in a linear transformation of
x.
Then, when applying a decision stump on the transformed data, it is as if a perceptron is applied on the original data. So the resulting G is simply a linear aggregation of perceptrons.Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/21
Finale Feature Exploitation Techniques
Fun Time
Consider running AdaBoost-Stump on a PCA-preprocessed data set.
Then, in terms of the original features
x, what does the final hypothesis
G(x) look like?1
a neural network with tanh(·) in the hidden neurons2
a neural network with sign(·) in the hidden neurons3
a decision tree4
a random forestReference Answer: 2
PCA results in a linear transformation of
x.
Then, when applying a decision stump on the transformed data, it is as if a perceptron is
Finale Error Optimization Techniques
Numerical Optimization via Gradient Descent
when ∇E ‘approximately’ defined, use it for 1st order approximation:
new variables = old variables − η∇E
SGD/Minibatch/GD
(Kernel) LogReg; Neural Network[backprop]; Matrix Factorization; Linear SVM (maybe)
Steepest Descent
AdaBoost;GradientBoost
Functional GD
AdaBoost; GradientBoostpossibly
2nd order techniques,
GD under constraints, . . .
Finale Error Optimization Techniques
Numerical Optimization via Gradient Descent
when ∇E ‘approximately’ defined, use it for 1st order approximation:
new variables = old variables − η∇E
SGD/Minibatch/GD
(Kernel) LogReg;Neural Network [backprop];
Matrix Factorization;
Linear SVM (maybe)
Steepest Descent
AdaBoost;GradientBoost
Functional GD
AdaBoost; GradientBoostpossibly
2nd order techniques,
GD under constraints, . . .
Finale Error Optimization Techniques
Numerical Optimization via Gradient Descent
when ∇E ‘approximately’ defined, use it for 1st order approximation:
new variables = old variables − η∇E
SGD/Minibatch/GD
(Kernel) LogReg;Neural Network [backprop];
Matrix Factorization;
Linear SVM (maybe)
Steepest Descent
AdaBoost;GradientBoost
Functional GD
AdaBoost;GradientBoost
possibly
2nd order techniques,
GD under constraints, . . .
Finale Error Optimization Techniques
Numerical Optimization via Gradient Descent
when ∇E ‘approximately’ defined, use it for 1st order approximation:
new variables = old variables − η∇E
SGD/Minibatch/GD
(Kernel) LogReg;Neural Network [backprop];
Matrix Factorization;
Linear SVM (maybe)
Steepest Descent
AdaBoost;GradientBoost
Functional GD
AdaBoost;GradientBoost
possibly
2nd order techniques,
GD under constraints, . . .
Finale Error Optimization Techniques
Numerical Optimization via Gradient Descent
when ∇E ‘approximately’ defined, use it for 1st order approximation:
new variables = old variables − η∇E
SGD/Minibatch/GD
(Kernel) LogReg;Neural Network [backprop];
Matrix Factorization;
Linear SVM (maybe)
Steepest Descent
AdaBoost;GradientBoost
Functional GD
AdaBoost;GradientBoost
possibly
2nd order techniques,
GD under constraints, . . .
Finale Error Optimization Techniques
Indirect Optimization via Equivalent Solution
when difficult to solve original problem,
seek forequivalent solution
Dual SVM
equivalence viaconvex QP
Kernel LogReg Kernel RidgeReg
equivalence viarepresenter
PCA
equivalence to
eigenproblem
some
other boosting models
andmodern solvers of kernel models
rely on such atechnique heavily
Finale Error Optimization Techniques
Indirect Optimization via Equivalent Solution
when difficult to solve original problem,
seek forequivalent solution
Dual SVM
equivalence viaconvex QP
Kernel LogReg Kernel RidgeReg
equivalence viarepresenter
PCA
equivalence to
eigenproblem
some
other boosting models
andmodern solvers of kernel models
rely on such atechnique heavily
Finale Error Optimization Techniques
Indirect Optimization via Equivalent Solution
when difficult to solve original problem,
seek forequivalent solution
Dual SVM
equivalence viaconvex QP
Kernel LogReg Kernel RidgeReg
equivalence viarepresenter
PCA
equivalence to
eigenproblem
some
other boosting models
andmodern solvers of kernel models
rely on such atechnique heavily
Finale Error Optimization Techniques
Indirect Optimization via Equivalent Solution
when difficult to solve original problem,
seek forequivalent solution
Dual SVM
equivalence viaconvex QP
Kernel LogReg Kernel RidgeReg
equivalence viarepresenter
PCA
equivalence to
eigenproblem
some
other boosting models
andmodern solvers of kernel models
rely on such atechnique heavily
Finale Error Optimization Techniques
Indirect Optimization via Equivalent Solution
when difficult to solve original problem,
seek forequivalent solution
Dual SVM
equivalence viaconvex QP
Kernel LogReg Kernel RidgeReg
equivalence viarepresenter
PCA
equivalence to
eigenproblem
some
other boosting models
andmodern solvers of kernel models
rely on such atechnique heavily
Finale Error Optimization Techniques
Complicated Optimization via Multiple Steps
when difficult to solve original problem,
seek for‘easier’ sub-problems
Multi-Stage
probabilistic SVM; linear blending; stacking; RBF Network; DeepNet pre-trainingAlternating Optim.
k -Means;alternating LeastSqr;
(steepest descent)
Divide & Conquer
decision tree;useful for
complicated models
Finale Error Optimization Techniques
Complicated Optimization via Multiple Steps
when difficult to solve original problem,
seek for‘easier’ sub-problems
Multi-Stage
probabilistic SVM;linear blending;
stacking;
RBF Network;
DeepNet pre-training
Alternating Optim.
k -Means;alternating LeastSqr;
(steepest descent)
Divide & Conquer
decision tree;useful for
complicated models
Finale Error Optimization Techniques
Complicated Optimization via Multiple Steps
when difficult to solve original problem,
seek for‘easier’ sub-problems
Multi-Stage
probabilistic SVM;linear blending;
stacking;
RBF Network;
DeepNet pre-training
Alternating Optim.
k -Means;
alternating LeastSqr;
(steepest descent)
Divide & Conquer
decision tree;useful for
complicated models
Finale Error Optimization Techniques
Complicated Optimization via Multiple Steps
when difficult to solve original problem,
seek for‘easier’ sub-problems
Multi-Stage
probabilistic SVM;linear blending;
stacking;
RBF Network;
DeepNet pre-training
Alternating Optim.
k -Means;
alternating LeastSqr;
(steepest descent)
Divide & Conquer
decision tree;useful for
complicated models
Finale Error Optimization Techniques
Complicated Optimization via Multiple Steps
when difficult to solve original problem,
seek for‘easier’ sub-problems
Multi-Stage
probabilistic SVM;linear blending;
stacking;
RBF Network;
DeepNet pre-training
Alternating Optim.
k -Means;
alternating LeastSqr;
(steepest descent)
Divide & Conquer
decision tree;useful for
complicated models
Finale Error Optimization Techniques
Complicated Optimization via Multiple Steps
when difficult to solve original problem,
seek for‘easier’ sub-problems
Multi-Stage
probabilistic SVM;linear blending;
stacking;
RBF Network;
DeepNet pre-training
Alternating Optim.
k -Means;
alternating LeastSqr;
(steepest descent)
Divide & Conquer
decision tree;Finale Error Optimization Techniques
Fun Time
When running the DeepNet algorithm introduced in Lecture 213 on a PCA-preprocessed data set, which optimization technique is used?
1
variants of gradient-descent2
locating equivalent solutions3
multi-stage optimization4
all of the aboveReference Answer: 4
minibatch GD for training; equivalent
eigenproblem solution for PCA; multi-stage for pre-training
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/21
Finale Error Optimization Techniques
Fun Time
When running the DeepNet algorithm introduced in Lecture 213 on a PCA-preprocessed data set, which optimization technique is used?
1
variants of gradient-descent2
locating equivalent solutions3
multi-stage optimization4
all of the aboveReference Answer: 4
minibatch GD for training; equivalent
eigenproblem solution for PCA; multi-stage for pre-training
Finale Overfitting Elimination Techniques
Overfitting Elimination via Regularization
when model too ‘powerful’:
add
brakes
somewherelarge-margin
SVM;AdaBoost (indirectly)
denoising
autoencoderpruning
decision treeL2
SVR;kernel models; NNet [weight-decay]
weight-elimination
NNetearly stopping
NNet (any GD-like)voting/averaging
uniform blending; Bagging;Random Forest
constraining
autoenc. [weights]; RBF [# centers];arguably
most important techniques
Finale Overfitting Elimination Techniques
Overfitting Elimination via Regularization
when model too ‘powerful’:
add
brakes
somewherelarge-margin
SVM;
AdaBoost (indirectly)
denoising
autoencoderpruning
decision treeL2
SVR;kernel models; NNet [weight-decay]
weight-elimination
NNetearly stopping
NNet (any GD-like)voting/averaging
uniform blending; Bagging;Random Forest
constraining
autoenc. [weights]; RBF [# centers];arguably
most important techniques
Finale Overfitting Elimination Techniques
Overfitting Elimination via Regularization
when model too ‘powerful’:
add
brakes
somewherelarge-margin
SVM;
AdaBoost (indirectly)
denoising
autoencoderpruning
decision treeL2
SVR;kernel models;
NNet [weight-decay]
weight-elimination
NNetearly stopping
NNet (any GD-like)voting/averaging
uniform blending; Bagging;Random Forest
constraining
autoenc. [weights]; RBF [# centers];arguably
most important techniques
Finale Overfitting Elimination Techniques
Overfitting Elimination via Regularization
when model too ‘powerful’:
add
brakes
somewherelarge-margin
SVM;
AdaBoost (indirectly)
denoising
autoencoderpruning
decision treeL2
SVR;kernel models;
NNet [weight-decay]
weight-elimination
NNetearly stopping
NNet (any GD-like)voting/averaging
uniform blending;Bagging;
Random Forest
constraining
autoenc. [weights]; RBF [# centers];arguably
most important techniques
Finale Overfitting Elimination Techniques
Overfitting Elimination via Regularization
when model too ‘powerful’:
add
brakes
somewherelarge-margin
SVM;
AdaBoost (indirectly)
denoising
autoencoderpruning
decision treeL2
SVR;kernel models;
NNet [weight-decay]
weight-elimination
NNetearly stopping
NNet (any GD-like)voting/averaging
uniform blending;Bagging;
Random Forest
constraining
autoenc. [weights]; RBF [# centers];arguably
most important techniques
Finale Overfitting Elimination Techniques
Overfitting Elimination via Regularization
when model too ‘powerful’:
add
brakes
somewherelarge-margin
SVM;
AdaBoost (indirectly)
denoising
autoencoderpruning
L2
SVR;kernel models;
NNet [weight-decay]
weight-elimination
NNetearly stopping
NNet (any GD-like)voting/averaging
uniform blending;Bagging;
Random Forest
constraining
autoenc. [weights]; RBF [# centers];arguably
most important techniques
Finale Overfitting Elimination Techniques
Overfitting Elimination via Regularization
when model too ‘powerful’:
add
brakes
somewherelarge-margin
SVM;
AdaBoost (indirectly)
denoising
autoencoderpruning
decision treeL2
SVR;kernel models;
NNet [weight-decay]
weight-elimination
NNetearly stopping
NNet (any GD-like)voting/averaging
uniform blending;Bagging;
Random Forest
constraining
autoenc. [weights]; RBF [# centers];arguably
most important techniques
Finale Overfitting Elimination Techniques
Overfitting Elimination via Regularization
when model too ‘powerful’:
add
brakes
somewherelarge-margin
SVM;
AdaBoost (indirectly)
denoising
autoencoderpruning
L2
SVR;kernel models;
NNet [weight-decay]
weight-elimination
NNetearly stopping
voting/averaging
uniform blending;Bagging;
Random Forest
constraining
autoenc. [weights]; RBF [# centers];arguably
most important techniques
Finale Overfitting Elimination Techniques
Overfitting Elimination via Regularization
when model too ‘powerful’:
add
brakes
somewherelarge-margin
SVM;
AdaBoost (indirectly)
denoising
autoencoderpruning
decision treeL2
SVR;kernel models;
NNet [weight-decay]
weight-elimination
NNetearly stopping
NNet (any GD-like)voting/averaging
uniform blending;Bagging;
Random Forest
constraining
autoenc. [weights];RBF [# centers];
arguably
most important techniques
Finale Overfitting Elimination Techniques
Overfitting Elimination via Regularization
when model too ‘powerful’:
add
brakes
somewherelarge-margin
SVM;
AdaBoost (indirectly)
denoising
autoencoderpruning
L2
SVR;kernel models;
NNet [weight-decay]
weight-elimination
NNetearly stopping
voting/averaging
uniform blending;Bagging;
Random Forest
constraining
autoenc. [weights];RBF [# centers];
Finale Overfitting Elimination Techniques
Overfitting Elimination via Validation
when model too ‘powerful’:
check
performance carefully and honestly
# SV
SVM/SVROOB
Random Forest
Internal Validation
blending;DecTree pruning
simple but
necessary
Finale Overfitting Elimination Techniques
Overfitting Elimination via Validation
when model too ‘powerful’:
check
performance carefully and honestly
# SV
SVM/SVROOB
Random Forest
Internal Validation
blending;DecTree pruning
simple but
necessary
Finale Overfitting Elimination Techniques
Overfitting Elimination via Validation
when model too ‘powerful’:
check
performance carefully and honestly
# SV
SVM/SVROOB
Random Forest
Internal Validation
blending;DecTree pruning
simple but
necessary
Finale Overfitting Elimination Techniques
Overfitting Elimination via Validation
when model too ‘powerful’:
check
performance carefully and honestly
# SV
SVM/SVROOB
Random Forest
Internal Validation
blending;DecTree pruning
simple but
necessary
Finale Overfitting Elimination Techniques
Overfitting Elimination via Validation
when model too ‘powerful’:
check
performance carefully and honestly
# SV
SVM/SVROOB
Random Forest
Internal Validation
blending;DecTree pruning
simple but
necessary
Finale Overfitting Elimination Techniques
Fun Time
What is the major technique for eliminating overfitting in Random Forest?
1
voting/averaging2
pruning3
early stopping4
weight-eliminationReference Answer: 1
Random Forest, based on uniform blending, relies on voting/averaging for regularization.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/21
Finale Overfitting Elimination Techniques
Fun Time
What is the major technique for eliminating overfitting in Random Forest?
1
voting/averaging2
pruning3
early stopping4
weight-eliminationReference Answer: 1
Random Forest, based on uniform blending, relies on voting/averaging for regularization.
Finale Machine Learning in Practice
NTU KDDCup 2010 World Champion Model
Feature engineering and classifier ensemble for KDD Cup 2010, Yu et al., KDDCup 2010
linear blending of
Logistic Regression
+many
rawly encoded features
Random Forest
+human-designed features
yes, you’ve learned everything! :-)
Finale Machine Learning in Practice
NTU KDDCup 2010 World Champion Model
Feature engineering and classifier ensemble for KDD Cup 2010, Yu et al., KDDCup 2010
linear blending of Logistic Regression
+many
rawly encoded features
Random Forest
+human-designed features
yes, you’ve learned everything! :-)
Finale Machine Learning in Practice
NTU KDDCup 2010 World Champion Model
Feature engineering and classifier ensemble for KDD Cup 2010, Yu et al., KDDCup 2010
linear blending of Logistic Regression
+many
rawly encoded features
Random Forest
+human-designed features
yes, you’ve learned everything! :-)
Finale Machine Learning in Practice
NTU KDDCup 2010 World Champion Model
Feature engineering and classifier ensemble for KDD Cup 2010, Yu et al., KDDCup 2010
linear blending of Logistic Regression
+many
rawly encoded features
Random Forest
+human-designed features
yes, you’ve learned everything! :-)
Finale Machine Learning in Practice
NTU KDDCup 2011 Track 1 World Champion Model
A linear ensemble of individual and blended models for music rating
prediction, Chen et al., KDDCup 2011
NNet, DecTree-like, and then linear blending of
• Matrix Factorization
variants, including probabilisticPCA
• Restricted Boltzmann Machines: an ‘extended’ autoencoder
• k Nearest Neighbors
• Probabilistic Latent Semantic Analysis:
an extraction model that has
‘soft clusters’
as hidden variables•
linear regression, NNet, & GBDTyes, you can ‘easily’
understand everything! :-)
Finale Machine Learning in Practice
NTU KDDCup 2011 Track 1 World Champion Model
A linear ensemble of individual and blended models for music rating
prediction, Chen et al., KDDCup 2011
NNet, DecTree-like, and then linear blending of
• Matrix Factorization
variants, including probabilisticPCA
• Restricted Boltzmann Machines: an ‘extended’ autoencoder
• k Nearest Neighbors
• Probabilistic Latent Semantic Analysis:
an extraction model that has
‘soft clusters’
as hidden variables•
linear regression, NNet, & GBDTyes, you can ‘easily’
understand everything! :-)
Finale Machine Learning in Practice
NTU KDDCup 2011 Track 1 World Champion Model
A linear ensemble of individual and blended models for music rating
prediction, Chen et al., KDDCup 2011
NNet, DecTree-like, and then linear blending of
• Matrix Factorization
variants, including probabilisticPCA
• Restricted Boltzmann Machines: an ‘extended’ autoencoder
• k Nearest Neighbors
• Probabilistic Latent Semantic Analysis:
an extraction model that has
‘soft clusters’
as hidden variables•
linear regression, NNet, & GBDTyes, you can ‘easily’
understand everything! :-)
Finale Machine Learning in Practice
NTU KDDCup 2011 Track 1 World Champion Model
A linear ensemble of individual and blended models for music rating
prediction, Chen et al., KDDCup 2011
NNet, DecTree-like, and then linear blending of
• Matrix Factorization
variants, including probabilisticPCA
• Restricted Boltzmann Machines: an ‘extended’ autoencoder
• k Nearest Neighbors
• Probabilistic Latent Semantic Analysis:
an extraction model that has
‘soft clusters’
as hidden variables•
linear regression, NNet, & GBDTyes, you can ‘easily’
understand everything! :-)
Finale Machine Learning in Practice
NTU KDDCup 2011 Track 1 World Champion Model
A linear ensemble of individual and blended models for music rating
prediction, Chen et al., KDDCup 2011
NNet, DecTree-like, and then linear blending of
• Matrix Factorization
variants, including probabilisticPCA
• Restricted Boltzmann Machines: an ‘extended’ autoencoder
• k Nearest Neighbors
• Probabilistic Latent Semantic Analysis:
an extraction model that has
‘soft clusters’
as hidden variables•
linear regression, NNet, & GBDTyes, you can ‘easily’
understand everything! :-)
Finale Machine Learning in Practice
NTU KDDCup 2011 Track 1 World Champion Model
A linear ensemble of individual and blended models for music rating
prediction, Chen et al., KDDCup 2011
NNet, DecTree-like, and then linear blending of
• Matrix Factorization
variants, including probabilisticPCA
• Restricted Boltzmann Machines: an ‘extended’ autoencoder
• k Nearest Neighbors
• Probabilistic Latent Semantic Analysis:
an extraction model that has
‘soft clusters’
as hidden variables•
linear regression, NNet, & GBDTyes, you can ‘easily’
understand everything! :-)
Finale Machine Learning in Practice
NTU KDDCup 2011 Track 1 World Champion Model
A linear ensemble of individual and blended models for music rating
prediction, Chen et al., KDDCup 2011
NNet, DecTree-like, and then linear blending of
• Matrix Factorization
variants, including probabilisticPCA
• Restricted Boltzmann Machines: an ‘extended’ autoencoder
• k Nearest Neighbors
• Probabilistic Latent Semantic Analysis:
an extraction model that has
‘soft clusters’
as hidden variables•
linear regression, NNet, & GBDTFinale Machine Learning in Practice
NTU KDDCup 2012 Track 2 World Champion Model
A two-stage ensemble of diverse models for advertisement ranking in
KDD Cup 2012, Wu et al., KDDCup 2012
NNet, GBDT-like, and then linear blending of
• Linear Regression
variants, includinglinear SVR
• Logistic Regression
variants• Matrix Factorization
variants•
. . .‘key’ is to
blend properly without overfitting
Finale Machine Learning in Practice
NTU KDDCup 2012 Track 2 World Champion Model
A two-stage ensemble of diverse models for advertisement ranking in
KDD Cup 2012, Wu et al., KDDCup 2012
NNet, GBDT-like, and then linear blending of
• Linear Regression
variants, includinglinear SVR
• Logistic Regression
variants• Matrix Factorization
variants•
. . .‘key’ is to
blend properly without overfitting
Finale Machine Learning in Practice
NTU KDDCup 2012 Track 2 World Champion Model
A two-stage ensemble of diverse models for advertisement ranking in
KDD Cup 2012, Wu et al., KDDCup 2012
NNet, GBDT-like, and then linear blending of
• Linear Regression
variants, includinglinear SVR
• Logistic Regression
variants• Matrix Factorization
variants•
. . .‘key’ is to
blend properly without overfitting
Finale Machine Learning in Practice
NTU KDDCup 2012 Track 2 World Champion Model
A two-stage ensemble of diverse models for advertisement ranking in
KDD Cup 2012, Wu et al., KDDCup 2012
NNet, GBDT-like, and then linear blending of
• Linear Regression
variants, includinglinear SVR
• Logistic Regression
variants• Matrix Factorization
variants•
. . .‘key’ is to
blend properly without overfitting
Finale Machine Learning in Practice
NTU KDDCup 2012 Track 2 World Champion Model
A two-stage ensemble of diverse models for advertisement ranking in
KDD Cup 2012, Wu et al., KDDCup 2012
NNet, GBDT-like, and then linear blending of
• Linear Regression
variants, includinglinear SVR
• Logistic Regression
variants• Matrix Factorization
variants•
. . .‘key’ is to
blend properly without overfitting
Finale Machine Learning in Practice
NTU KDDCup 2013 Track 1 World Champion Model
Combination of feature engineering and ranking models for paper- author identification in KDD Cup 2013, Li et al., KDDCup 2013
linear blending of
• Random Forest
with many many many trees• GBDT
variantswith tons of efforts in designing features
‘another key’ is to
construct features with
domain knowledge
Finale Machine Learning in Practice
NTU KDDCup 2013 Track 1 World Champion Model
Combination of feature engineering and ranking models for paper- author identification in KDD Cup 2013, Li et al., KDDCup 2013
linear blending of
• Random Forest
with many many many trees• GBDT
variantswith tons of efforts in designing features
‘another key’ is to
construct features with
domain knowledge
Finale Machine Learning in Practice
NTU KDDCup 2013 Track 1 World Champion Model
Combination of feature engineering and ranking models for paper- author identification in KDD Cup 2013, Li et al., KDDCup 2013
linear blending of
• Random Forest
with many many many trees• GBDT
variantswith tons of efforts in designing features
‘another key’ is to
construct features with
domain knowledge
Finale Machine Learning in Practice
NTU KDDCup 2013 Track 1 World Champion Model
Combination of feature engineering and ranking models for paper- author identification in KDD Cup 2013, Li et al., KDDCup 2013
linear blending of
• Random Forest
with many many many trees• GBDT
variantswith tons of efforts in designing features
‘another key’ is to
construct features with
domain knowledge
Finale Machine Learning in Practice
NTU KDDCup 2013 Track 1 World Champion Model
Combination of feature engineering and ranking models for paper- author identification in KDD Cup 2013, Li et al., KDDCup 2013
linear blending of
• Random Forest
with many many many trees• GBDT
variantswith tons of efforts in designing features
‘another key’ is to
construct features with
domain knowledge
Finale Machine Learning in Practice
ICDM 2006 Top 10 Data Mining Algorithms
1
C4.5: anotherdecision tree
2
k -Means3
SVM4 Apriori: for frequent itemset mining
5
EM:‘alternating
optimization’
algorithm for some models6
PageRank: forlink-analysis, similar to
matrix factorization
7
AdaBoost8
k Nearest Neighbor9
Naive Bayes: a simplelinear model
with ‘weights’ decided by data statistics10
C&RTpersonal view of five missing ML competitors:
LinReg, LogReg,
Random Forest, GBDT, NNet
Finale Machine Learning in Practice
ICDM 2006 Top 10 Data Mining Algorithms
1
C4.5: anotherdecision tree
2
k -Means3
SVM4 Apriori: for frequent itemset mining
5
EM:‘alternating
optimization’
algorithm for some models6
PageRank: forlink-analysis, similar to
matrix factorization
7
AdaBoost8
k Nearest Neighbor9
Naive Bayes: a simplelinear model
with ‘weights’ decided by data statistics10
C&RTpersonal view of five missing ML competitors:
LinReg, LogReg,
Random Forest, GBDT, NNet
Finale Machine Learning in Practice
ICDM 2006 Top 10 Data Mining Algorithms
1
C4.5: anotherdecision tree
2
k -Means3
SVM4 Apriori: for frequent itemset mining
5
EM:‘alternating
optimization’
algorithm for some models6
PageRank: forlink-analysis, similar to
matrix factorization
7
AdaBoost8
k Nearest Neighbor9
Naive Bayes: a simplelinear model
with ‘weights’ decided by data statistics10
C&RTpersonal view of five missing ML competitors:
LinReg, LogReg,
Random Forest, GBDT, NNet
Finale Machine Learning in Practice
ICDM 2006 Top 10 Data Mining Algorithms
1
C4.5: anotherdecision tree
2
k -Means3
SVM4 Apriori: for frequent itemset mining
5
EM:‘alternating
optimization’
algorithm for some models6
PageRank: forlink-analysis, similar to
matrix factorization
7
AdaBoost8
k Nearest Neighbor9
Naive Bayes: a simplelinear model
with ‘weights’ decided by data statistics10
C&RTpersonal view of five missing ML competitors:
LinReg, LogReg,
Random Forest, GBDT, NNet
Finale Machine Learning in Practice
ICDM 2006 Top 10 Data Mining Algorithms
1
C4.5: anotherdecision tree
2
k -Means3
SVM4 Apriori: for frequent itemset mining
5
EM:‘alternating
optimization’
algorithm for some models6
PageRank: forlink-analysis, similar to
matrix factorization
7
AdaBoost8
k Nearest Neighbor9
Naive Bayes: a simplelinear model
with ‘weights’ decided by data statistics10
C&RTpersonal view of five missing ML competitors:
LinReg, LogReg,
Random Forest, GBDT, NNet
Finale Machine Learning in Practice
ICDM 2006 Top 10 Data Mining Algorithms
1
C4.5: anotherdecision tree
2
k -Means3
SVM4 Apriori: for frequent itemset mining
5
EM:‘alternating
optimization’
algorithm for some models6
PageRank: forlink-analysis, similar to
matrix factorization
7
AdaBoost8
k Nearest Neighbor9
Naive Bayes: a simplelinear model
with ‘weights’ decided by data statistics10
C&RTpersonal view of five missing ML competitors: