• 沒有找到結果。

Machine Learning Techniques (ᘤᢈ)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Techniques (ᘤᢈ)"

Copied!
26
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Techniques ( 機器學習技法)

Lecture 16: Finale

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University ( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/21

(2)

Finale

Roadmap

1 Embedding Numerous Features: Kernel Models

2 Combining Predictive Features: Aggregation Models

3 Distilling Implicit Features: Extraction Models

Lecture 15: Matrix Factorization

linear models of movies

on

extracted user features

(or

vice versa) jointly optimized with

stochastic gradient descent

Lecture 16: Finale

Feature Exploitation Techniques

Error Optimization Techniques

Overfitting Elimination Techniques

Machine Learning in Practice

(3)

Finale Feature Exploitation Techniques

Exploiting Numerous Features via Kernel

numerous features within some Φ:

embedded in

kernel K Φ

with

inner product operation Polynomial Kernel

‘scaled’

polynomial

transforms

Sum of Kernels

transform

union

Gaussian Kernel infinite-dimensional

transforms

Product of Kernels

transform

combination

Stump Kernel decision-stumps

as transforms

Mercer Kernels

transform

implicitly kernel ridge

regression

kernel logistic regression

SVM SVR probabilistic SVM

possibly

Kernel PCA, Kernel k -Means, . . .

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/21

(4)

Finale Feature Exploitation Techniques

Exploiting Predictive Features via Aggregation

predictive features within some Φ:

φ t

(x) =

g t

(x)

Decision Stump

simplest

perceptron;

simplest

DecTree

Decision Tree branching

(divide) + leaves (conquer)

(Gaussian) RBF prototype

(center) + influence

Uniform Non-Uniform Conditional

Bagging;

Random Forest

AdaBoost;

GradientBoost probabilistic SVM

Decision Tree;

Nearest Neighbor

possibly

Infinite Ensemble Learning,

Decision Tree SVM, . . .

(5)

Finale Feature Exploitation Techniques

Exploiting Hidden Features via Extraction

hidden features within some Φ:

as

hidden variables

to be ‘jointly’

optimized

with usual

weights

—possibly with the help of

unsupervised learning Neural Network;

Deep Learning neuron weights

AdaBoost;

GradientBoost g t parameters

RBF Network RBF centers k -Means cluster centers

Matrix Factorization user/movie factors

Autoencoder;

PCA

‘basis’ directions

possibly

GradientBoosted Neurons, NNet on Factorized Features, . . .

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/21

(6)

Finale Feature Exploitation Techniques

Exploiting Low-Dim. Features via Compression

low-dimensional features within some Φ:

compressed

from original features

Decision Stump;

DecTree Branching

‘best’ naïve

projection to R

Random Forest Tree Branching

‘random’

low-dim.

projection

Autoencoder;PCA info.-preserving

compression

Matrix Factorization

projection from abstract to

concrete Feature Selection

‘most-helpful’

low-dimensional projection

possibly

other ‘dimension reduction’ models

(7)

Finale Feature Exploitation Techniques

Fun Time

Consider running AdaBoost-Stump on a PCA-preprocessed data set.

Then, in terms of the original features

x, what does the final hypothesis

G(x) look like?

1

a neural network with tanh(·) in the hidden neurons

2

a neural network with sign(·) in the hidden neurons

3

a decision tree

4

a random forest

Reference Answer: 2

PCA results in a linear transformation of

x.

Then, when applying a decision stump on the transformed data, it is as if a perceptron is applied on the original data. So the resulting G is simply a linear aggregation of perceptrons.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/21

(8)

Finale Feature Exploitation Techniques

Fun Time

Consider running AdaBoost-Stump on a PCA-preprocessed data set.

Then, in terms of the original features

x, what does the final hypothesis

G(x) look like?

1

a neural network with tanh(·) in the hidden neurons

2

a neural network with sign(·) in the hidden neurons

3

a decision tree

4

a random forest

Reference Answer: 2

PCA results in a linear transformation of

x.

Then, when applying a decision stump on the transformed data, it is as if a perceptron is applied on the original data. So the resulting G is simply a linear aggregation of perceptrons.

(9)

Finale Error Optimization Techniques

Numerical Optimization via Gradient Descent

when ∇E ‘approximately’ defined, use it for 1st order approximation:

new variables = old variables − η∇E

SGD/Minibatch/GD

(Kernel) LogReg;

Neural Network [backprop];

Matrix Factorization;

Linear SVM (maybe)

Steepest Descent

AdaBoost;

GradientBoost

Functional GD

AdaBoost;

GradientBoost

possibly

2nd order techniques, GD under constraints, . . .

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/21

(10)

Finale Error Optimization Techniques

Indirect Optimization via Equivalent Solution

when difficult to solve original problem,

seek for

equivalent solution

Dual SVM

equivalence via

convex QP

Kernel LogReg Kernel RidgeReg

equivalence via

representer

PCA

equivalence to

eigenproblem

some

other boosting models

and

modern solvers of kernel models

rely on such a

technique heavily

(11)

Finale Error Optimization Techniques

Complicated Optimization via Multiple Steps

when difficult to solve original problem,

seek for

‘easier’ sub-problems

Multi-Stage

probabilistic SVM;

linear blending;

stacking;

RBF Network;

DeepNet pre-training

Alternating Optim.

k -Means;

alternating LeastSqr;

(steepest descent)

Divide & Conquer

decision tree;

useful for

complicated models

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/21

(12)

Finale Error Optimization Techniques

Fun Time

When running the DeepNet algorithm introduced in Lecture 213 on a PCA-preprocessed data set, which optimization technique is used?

1

variants of gradient-descent

2

locating equivalent solutions

3

multi-stage optimization

4

all of the above

Reference Answer: 4

minibatch GD for training; equivalent

eigenproblem solution for PCA; multi-stage for pre-training

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/21

(13)

Finale Error Optimization Techniques

Fun Time

When running the DeepNet algorithm introduced in Lecture 213 on a PCA-preprocessed data set, which optimization technique is used?

1

variants of gradient-descent

2

locating equivalent solutions

3

multi-stage optimization

4

all of the above

Reference Answer: 4

minibatch GD for training; equivalent

eigenproblem solution for PCA; multi-stage for pre-training

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/21

(14)

Finale Overfitting Elimination Techniques

Overfitting Elimination via Regularization

when model too ‘powerful’:

add

brakes

somewhere

large-margin

SVM;

AdaBoost (indirectly)

denoising

autoencoder

pruning

decision tree

L2

SVR;

kernel models;

NNet [weight-decay]

weight-elimination

NNet

early stopping

NNet (any GD-like)

voting/averaging

uniform blending;

Bagging;

Random Forest

constraining

autoenc. [weights];

RBF [# centers];

arguably

most important techniques

(15)

Finale Overfitting Elimination Techniques

Overfitting Elimination via Validation

when model too ‘powerful’:

check

performance carefully and honestly

# SV

SVM/SVR

OOB

Random Forest

Internal Validation

blending;

DecTree pruning

simple but

necessary

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/21

(16)

Finale Overfitting Elimination Techniques

Fun Time

What is the major technique for eliminating overfitting in Random Forest?

1

voting/averaging

2

pruning

3

early stopping

4

weight-elimination

Reference Answer: 1

Random Forest, based on uniform blending, relies on voting/averaging for regularization.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/21

(17)

Finale Overfitting Elimination Techniques

Fun Time

What is the major technique for eliminating overfitting in Random Forest?

1

voting/averaging

2

pruning

3

early stopping

4

weight-elimination

Reference Answer: 1

Random Forest, based on uniform blending, relies on voting/averaging for regularization.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/21

(18)

Finale Machine Learning in Practice

NTU KDDCup 2010 World Champion Model

Feature engineering and classifier ensemble for KDD Cup 2010, Yu et al., KDDCup 2010

linear blending of Logistic Regression

+

many

rawly encoded features

Random Forest

+

human-designed features

yes, you’ve learned everything! :-)

(19)

Finale Machine Learning in Practice

NTU KDDCup 2011 Track 1 World Champion Model

A linear ensemble of individual and blended models for music rating

prediction, Chen et al., KDDCup 2011

NNet, DecTree-like, and then linear blending of

• Matrix Factorization

variants, including probabilistic

PCA

• Restricted Boltzmann Machines: an ‘extended’ autoencoder

• k Nearest Neighbors

• Probabilistic Latent Semantic Analysis:

an extraction model that has

‘soft clusters’

as hidden variables

linear regression, NNet, & GBDT

yes, you can ‘easily’

understand everything! :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/21

(20)

Finale Machine Learning in Practice

NTU KDDCup 2012 Track 2 World Champion Model

A two-stage ensemble of diverse models for advertisement ranking in

KDD Cup 2012, Wu et al., KDDCup 2012

NNet, GBDT-like, and then linear blending of

• Linear Regression

variants, including

linear SVR

• Logistic Regression

variants

• Matrix Factorization

variants

. . .

‘key’ is to

blend properly without overfitting

(21)

Finale Machine Learning in Practice

NTU KDDCup 2013 Track 1 World Champion Model

Combination of feature engineering and ranking models for paper- author identification in KDD Cup 2013, Li et al., KDDCup 2013

linear blending of

• Random Forest

with many many many trees

• GBDT

variants

with tons of efforts in designing features

‘another key’ is to

construct features with domain knowledge

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/21

(22)

Finale Machine Learning in Practice

ICDM 2006 Top 10 Data Mining Algorithms

1

C4.5: another

decision tree

2

k -Means

3

SVM

4 Apriori: for frequent itemset mining

5

EM:

‘alternating

optimization’

algorithm for some models

6

PageRank: for

link-analysis, similar to

matrix factorization

7

AdaBoost

8

k Nearest Neighbor

9

Naive Bayes: a simple

linear model

with ‘weights’

decided by data statistics

10

C&RT

personal view of five missing ML competitors:

LinReg, LogReg,

Random Forest, GBDT, NNet

(23)

Finale Machine Learning in Practice

Machine Learning Jungle

soft-margin k -means OOB error RBF network probabilistic SVM GBDT PCA random forest matrix factorization Gaussian kernel kernel LogReg large-margin prototype quadratic programming SVR

dual uniform blending deep learning nearest neighbor decision stump AdaBoost aggregation sparsity autoencoder functional gradient

bagging decision tree support vector machine

neural network kernel

welcome to the

jungle!

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/21

(24)

Finale Machine Learning in Practice

Fun Time

Which of the following is the official lucky number of this class?

1

9876

2

1234

3

1126

4

6211

Reference Answer: 3

May the luckiness always be with you!

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/21

(25)

Finale Machine Learning in Practice

Fun Time

Which of the following is the official lucky number of this class?

1

9876

2

1234

3

1126

4

6211

Reference Answer: 3

May the luckiness always be with you!

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/21

(26)

Finale Machine Learning in Practice

Summary

1 Embedding Numerous Features: Kernel Models

2 Combining Predictive Features: Aggregation Models

3 Distilling Implicit Features: Extraction Models

Lecture 16: Finale

Feature Exploitation Techniques

kernel, aggregation, extraction, low-dimensional Error Optimization Techniques

gradient, equivalence, stages Overfitting Elimination Techniques

(lots of) regularization, validation Machine Learning in Practice

welcome to the jungle

next: happy learning!

參考文獻

相關文件

3 Distilling Implicit Features: Extraction Models Lecture 14: Radial Basis Function Network. RBF

Lecture 4: Soft-Margin Support Vector Machine allow some margin violations ξ n while penalizing them by C; equivalent to upper-bounding α n by C Lecture 5: Kernel Logistic

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.

Principle Component Analysis Denoising Auto Encoder Deep Neural Network... Deep Learning Optimization

For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24.:. Deep Learning Deep