Machine Learning Techniques (ᘤᢈ)

(1)

Machine Learning Techniques ( 機器學習技法)

Lecture 16: Finale

Hsuan-Tien Lin (林軒田) [email protected]

Department of Computer Science

& Information Engineering

National Taiwan University ( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/21

(2)

Finale

Roadmap

1 Embedding Numerous Features: Kernel Models

2 Combining Predictive Features: Aggregation Models

3 Distilling Implicit Features: Extraction Models

Lecture 15: Matrix Factorization

linear models of movies

on

extracted user features

(or

vice versa) jointly optimized with

stochastic gradient descent

Lecture 16: Finale

Feature Exploitation Techniques

Error Optimization Techniques

Overfitting Elimination Techniques

Machine Learning in Practice

(3)

Finale Feature Exploitation Techniques

Exploiting Numerous Features via Kernel

numerous features within some Φ:

embedded in

kernel K _Φ

with

inner product operation Polynomial Kernel

‘scaled’

polynomial

transforms

Sum of Kernels

transform

union

Gaussian Kernel infinite-dimensional

transforms

Product of Kernels

transform

combination

Stump Kernel decision-stumps

as transforms

Mercer Kernels

transform

implicitly kernel ridge

regression

kernel logistic regression

SVM SVR probabilistic SVM

possibly

Kernel PCA, Kernel k -Means, . . .

(4)

Exploiting Predictive Features via Aggregation

predictive features within some Φ:

φ t

(x) =

g _t

(x)

Decision Stump

simplest

perceptron;

simplest

DecTree

Decision Tree branching

(divide) + leaves (conquer)

(Gaussian) RBF prototype

(center) + influence

Uniform Non-Uniform Conditional

Bagging;

Random Forest

AdaBoost;

GradientBoost probabilistic SVM

Decision Tree;

Nearest Neighbor

possibly

Infinite Ensemble Learning,

Decision Tree SVM, . . .

(5)

Exploiting Hidden Features via Extraction

hidden features within some Φ:

as

hidden variables

to be ‘jointly’

optimized

with usual

weights

—possibly with the help of

unsupervised learning Neural Network;

Deep Learning neuron weights

AdaBoost;

GradientBoost g t parameters

RBF Network RBF centers k -Means cluster centers

Matrix Factorization user/movie factors

Autoencoder;

PCA

‘basis’ directions

possibly

GradientBoosted Neurons, NNet on Factorized Features, . . .

(6)

Exploiting Low-Dim. Features via Compression

low-dimensional features within some Φ:

compressed

from original features

Decision Stump;

DecTree Branching

‘best’ naïve

projection to R

Random Forest Tree Branching

‘random’

low-dim.

projection

Autoencoder;PCA info.-preserving

compression

Matrix Factorization

projection from abstract to

concrete Feature Selection

‘most-helpful’

low-dimensional projection

possibly

other ‘dimension reduction’ models

(7)

Fun Time

Consider running AdaBoost-Stump on a PCA-preprocessed data set.

Then, in terms of the original features

x, what does the final hypothesis

G(x) look like?

1

a neural network with tanh(·) in the hidden neurons

2

a neural network with sign(·) in the hidden neurons

3

a decision tree

4

a random forest

Reference Answer: 2

PCA results in a linear transformation of

x.

Then, when applying a decision stump on the transformed data, it is as if a perceptron is applied on the original data. So the resulting G is simply a linear aggregation of perceptrons.

(8)

Fun Time

Consider running AdaBoost-Stump on a PCA-preprocessed data set.

Then, in terms of the original features

x, what does the final hypothesis

G(x) look like?

1

a neural network with tanh(·) in the hidden neurons

2

a neural network with sign(·) in the hidden neurons

3

a decision tree

4

a random forest

Reference Answer: 2

PCA results in a linear transformation of

x.

Then, when applying a decision stump on the transformed data, it is as if a perceptron is applied on the original data. So the resulting G is simply a linear aggregation of perceptrons.

(9)

Finale Error Optimization Techniques

Numerical Optimization via Gradient Descent

when ∇E ‘approximately’ defined, use it for 1st order approximation:

new variables = old variables − η∇E

SGD/Minibatch/GD

(Kernel) LogReg;

Neural Network [backprop];

Matrix Factorization;

Linear SVM (maybe)

Steepest Descent

AdaBoost;

GradientBoost

Functional GD

AdaBoost;

GradientBoost

possibly

2nd order techniques, GD under constraints, . . .

(10)

Indirect Optimization via Equivalent Solution

when difficult to solve original problem,

seek for

equivalent solution

Dual SVM

equivalence via

convex QP

Kernel LogReg Kernel RidgeReg

equivalence via

representer

PCA

equivalence to

eigenproblem

some

other boosting models

and

modern solvers of kernel models

rely on such a

technique heavily

(11)

Complicated Optimization via Multiple Steps

when difficult to solve original problem,

seek for

‘easier’ sub-problems

Multi-Stage

probabilistic SVM;

linear blending;

stacking;

RBF Network;

DeepNet pre-training

Alternating Optim.

k -Means;

alternating LeastSqr;

(steepest descent)

Divide & Conquer

decision tree;

useful for

complicated models

(12)

Fun Time

When running the DeepNet algorithm introduced in Lecture 213 on a PCA-preprocessed data set, which optimization technique is used?

1

variants of gradient-descent

2

locating equivalent solutions

3

multi-stage optimization

4

all of the above

Reference Answer: 4

minibatch GD for training; equivalent

eigenproblem solution for PCA; multi-stage for pre-training

(13)

Fun Time

When running the DeepNet algorithm introduced in Lecture 213 on a PCA-preprocessed data set, which optimization technique is used?

1

variants of gradient-descent

2

locating equivalent solutions

3

multi-stage optimization

4

all of the above

Reference Answer: 4

minibatch GD for training; equivalent

eigenproblem solution for PCA; multi-stage for pre-training

(14)

Finale Overfitting Elimination Techniques

Overfitting Elimination via Regularization

when model too ‘powerful’:

add

brakes

somewhere

large-margin

SVM;

AdaBoost (indirectly)

denoising

autoencoder

pruning

decision tree

L2

SVR;

kernel models;

NNet [weight-decay]

weight-elimination

NNet

early stopping

NNet (any GD-like)

voting/averaging

uniform blending;

Bagging;

Random Forest

constraining

autoenc. [weights];

RBF [# centers];

arguably

most important techniques

(15)

Overfitting Elimination via Validation

when model too ‘powerful’:

check

performance carefully and honestly

# SV

SVM/SVR

OOB

Random Forest

Internal Validation

blending;

DecTree pruning

simple but

necessary

(16)

Fun Time

What is the major technique for eliminating overfitting in Random Forest?

1

voting/averaging

2

pruning

3

early stopping

4

weight-elimination

Reference Answer: 1

Random Forest, based on uniform blending, relies on voting/averaging for regularization.

(17)

Fun Time

What is the major technique for eliminating overfitting in Random Forest?

1

voting/averaging

2

pruning

3

early stopping

4

weight-elimination

Reference Answer: 1

Random Forest, based on uniform blending, relies on voting/averaging for regularization.

(18)

Finale Machine Learning in Practice

NTU KDDCup 2010 World Champion Model

Feature engineering and classifier ensemble for KDD Cup 2010, Yu et al., KDDCup 2010

linear blending of Logistic Regression

+

many

rawly encoded features

Random Forest

+

human-designed features

yes, you’ve learned everything! :-)

(19)

NTU KDDCup 2011 Track 1 World Champion Model

A linear ensemble of individual and blended models for music rating

prediction, Chen et al., KDDCup 2011

NNet, DecTree-like, and then linear blending of

• Matrix Factorization

variants, including probabilistic

PCA

• Restricted Boltzmann Machines: an ‘extended’ autoencoder

• k Nearest Neighbors

• Probabilistic Latent Semantic Analysis:

an extraction model that has

‘soft clusters’

as hidden variables

•

linear regression, NNet, & GBDT

yes, you can ‘easily’

understand everything! :-)

(20)

NTU KDDCup 2012 Track 2 World Champion Model

A two-stage ensemble of diverse models for advertisement ranking in

KDD Cup 2012, Wu et al., KDDCup 2012

NNet, GBDT-like, and then linear blending of

• Linear Regression

variants, including

linear SVR

• Logistic Regression

variants

• Matrix Factorization

variants

•

. . .

‘key’ is to

blend properly without overfitting

(21)

NTU KDDCup 2013 Track 1 World Champion Model

Combination of feature engineering and ranking models for paper- author identification in KDD Cup 2013, Li et al., KDDCup 2013

linear blending of

• Random Forest

with many many many trees

• GBDT

variants

with tons of efforts in designing features

‘another key’ is to

construct features with domain knowledge

(22)

ICDM 2006 Top 10 Data Mining Algorithms

1

C4.5: another

decision tree

2

k -Means

3

SVM

4 Apriori: for frequent itemset mining

5

EM:

‘alternating

optimization’

algorithm for some models

6

PageRank: for

link-analysis, similar to

matrix factorization

7

AdaBoost

8

k Nearest Neighbor

9

Naive Bayes: a simple

linear model

with ‘weights’

decided by data statistics

10

C&RT

personal view of five missing ML competitors:

LinReg, LogReg,

Random Forest, GBDT, NNet

(23)

Machine Learning Jungle

soft-margin k -means OOB error RBF network probabilistic SVM GBDT PCA random forest matrix factorization Gaussian kernel kernel LogReg large-margin prototype quadratic programming SVR

Machine Learning Techniques (ᘤᢈ)

Machine Learning Techniques ( 機器學習技法)

Lecture 16: Finale

Department of Computer Science

& Information Engineering

National Taiwan University ( 國立台灣大學資訊工程系)

Roadmap

1 Embedding Numerous Features: Kernel Models

2 Combining Predictive Features: Aggregation Models

3 Distilling Implicit Features: Extraction Models

Lecture 15: Matrix Factorization

linear models of movies

extracted user features

vice versa) jointly optimized with

stochastic gradient descent

Lecture 16: Finale

Feature Exploitation Techniques

Error Optimization Techniques

Overfitting Elimination Techniques

Machine Learning in Practice

Exploiting Numerous Features via Kernel

numerous features within some Φ:

kernel K Φ

inner product operation Polynomial Kernel

polynomial

Sum of Kernels

union

Gaussian Kernel infinite-dimensional

Product of Kernels

combination

Stump Kernel decision-stumps

Mercer Kernels

implicitly kernel ridge

regression

kernel logistic regression

SVM SVR probabilistic SVM

Kernel PCA, Kernel k -Means, . . .

Exploiting Predictive Features via Aggregation

predictive features within some Φ:

φ t

g t

Decision Stump

simplest

simplest

Decision Tree branching

(Gaussian) RBF prototype

Uniform Non-Uniform Conditional

Bagging;

Random Forest

AdaBoost;

GradientBoost probabilistic SVM

Decision Tree;

Nearest Neighbor

Infinite Ensemble Learning,

Decision Tree SVM, . . .

Exploiting Hidden Features via Extraction

hidden features within some Φ:

hidden variables

optimized

weights

unsupervised learning Neural Network;

Deep Learning neuron weights

AdaBoost;

GradientBoost g t parameters

RBF Network RBF centers k -Means cluster centers

Matrix Factorization user/movie factors

Autoencoder;

PCA

‘basis’ directions

GradientBoosted Neurons, NNet on Factorized Features, . . .

Exploiting Low-Dim. Features via Compression

low-dimensional features within some Φ:

compressed

Decision Stump;

DecTree Branching

‘best’ naïve

Random Forest Tree Branching

‘random’

Autoencoder;PCA info.-preserving

Matrix Factorization

Machine Learning Techniques (ᘤᢈ)

kernel K _Φ

g _t