## Machine Learning Techniques ( 機器學習技法)

### Lecture 16: Finale

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw### Department of Computer Science

### & Information Engineering

### National Taiwan University ( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/21

Finale

## Roadmap

### 1 Embedding Numerous Features: Kernel Models

### 2 Combining Predictive Features: Aggregation Models

### 3 Distilling Implicit Features: Extraction Models

### Lecture 15: Matrix Factorization

**linear models of movies**

on**extracted user** **features**

(or**vice versa) jointly optimized with**

**stochastic gradient descent**

### Lecture 16: Finale

### Feature Exploitation Techniques

### Error Optimization Techniques

### Overfitting Elimination Techniques

### Machine Learning in Practice

Finale Feature Exploitation Techniques

## Exploiting Numerous Features via Kernel

### numerous features within some Φ:

embedded in

### kernel K _{Φ}

with### inner product operation Polynomial Kernel

‘scaled’

### polynomial

transforms### Sum of Kernels

transform### union

### Gaussian Kernel infinite-dimensional

transforms### Product of Kernels

transform### combination

### Stump Kernel decision-stumps

as transforms### Mercer Kernels

transform### implicitly kernel ridge

### regression

### kernel logistic regression

### SVM SVR probabilistic SVM

possibly

**Kernel PCA,** **Kernel k -Means, . . .**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/21

Finale Feature Exploitation Techniques

## Exploiting Predictive Features via Aggregation

### predictive features within some Φ:

### φ t

(x) =### g _{t}

(x)
### Decision Stump

### simplest

perceptron;### simplest

DecTree### Decision Tree branching

(divide) + leaves (conquer)### (Gaussian) RBF prototype

(center) + influence### Uniform Non-Uniform Conditional

### Bagging;

### Random Forest

### AdaBoost;

### GradientBoost probabilistic SVM

### Decision Tree;

### Nearest Neighbor

possibly

**Infinite Ensemble Learning,**

**Decision Tree SVM, . . .**

Finale Feature Exploitation Techniques

## Exploiting Hidden Features via Extraction

### hidden features within some Φ:

as

### hidden variables

to be ‘jointly’### optimized

with usual### weights

—possibly with the help of

**unsupervised learning** Neural Network;

### Deep Learning neuron weights

### AdaBoost;

### GradientBoost g t parameters

### RBF Network RBF centers k -Means cluster centers

### Matrix Factorization user/movie factors

### Autoencoder;

### PCA

### ‘basis’ directions

possibly

**GradientBoosted Neurons,** **NNet on Factorized Features, . . .**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/21

Finale Feature Exploitation Techniques

## Exploiting Low-Dim. Features via Compression

### low-dimensional features within some Φ:

### compressed

from original features### Decision Stump;

### DecTree Branching

### ‘best’ naïve

projection to R### Random Forest Tree Branching

### ‘random’

low-dim.projection

### Autoencoder;PCA info.-preserving

compression### Matrix Factorization

projection from abstract to### concrete Feature Selection

### ‘most-helpful’

low-dimensional projectionpossibly

**other ‘dimension reduction’ models**

Finale Feature Exploitation Techniques

## Fun Time

Consider running AdaBoost-Stump on a PCA-preprocessed data set.

Then, in terms of the original features

**x, what does the final hypothesis**

G(x) look like?
### 1

a neural network with tanh(·) in the hidden neurons### 2

a neural network with sign(·) in the hidden neurons### 3

a decision tree### 4

a random forest### Reference Answer: 2

PCA results in a linear transformation of

**x.**

Then, when applying a decision stump on the
transformed data, it is as if a perceptron is
applied on the original data. So the resulting G
is simply a linear aggregation of perceptrons.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/21

Finale Feature Exploitation Techniques

## Fun Time

Consider running AdaBoost-Stump on a PCA-preprocessed data set.

Then, in terms of the original features

**x, what does the final hypothesis**

G(x) look like?
### 1

a neural network with tanh(·) in the hidden neurons### 2

a neural network with sign(·) in the hidden neurons### 3

a decision tree### 4

a random forest### Reference Answer: 2

PCA results in a linear transformation of

**x.**

Then, when applying a decision stump on the transformed data, it is as if a perceptron is applied on the original data. So the resulting G is simply a linear aggregation of perceptrons.

Finale Error Optimization Techniques

## Numerical Optimization via Gradient Descent

### when ∇E ‘approximately’ defined, use it for **1st order approximation:**

new variables = old variables − η∇E

### SGD/Minibatch/GD

(Kernel) LogReg;Neural Network [backprop];

Matrix Factorization;

Linear SVM (maybe)

### Steepest Descent

AdaBoost;GradientBoost

### Functional GD

AdaBoost;GradientBoost

possibly

**2nd order techniques,** **GD under constraints, . . .**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/21

Finale Error Optimization Techniques

## Indirect Optimization via Equivalent Solution

### when difficult to solve original problem,

seek for**equivalent solution**

### Dual SVM

equivalence via### convex QP

### Kernel LogReg Kernel RidgeReg

equivalence via### representer

### PCA

equivalence to

### eigenproblem

some

**other boosting models**

and**modern** **solvers of kernel models**

rely on such a
technique heavily

Finale Error Optimization Techniques

## Complicated Optimization via Multiple Steps

### when difficult to solve original problem,

seek for**‘easier’ sub-problems**

### Multi-Stage

probabilistic SVM;linear blending;

stacking;

RBF Network;

DeepNet pre-training

### Alternating Optim.

k -Means;

alternating LeastSqr;

(steepest descent)

### Divide & Conquer

decision tree;useful for

**complicated models**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/21

Finale Error Optimization Techniques

## Fun Time

When running the DeepNet algorithm introduced in Lecture 213 on a PCA-preprocessed data set, which optimization technique is used?

### 1

variants of gradient-descent### 2

locating equivalent solutions### 3

multi-stage optimization### 4

all of the above### Reference Answer: 4

minibatch GD for training; equivalent

eigenproblem solution for PCA; multi-stage for pre-training

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/21

Finale Error Optimization Techniques

## Fun Time

When running the DeepNet algorithm introduced in Lecture 213 on a PCA-preprocessed data set, which optimization technique is used?

### 1

variants of gradient-descent### 2

locating equivalent solutions### 3

multi-stage optimization### 4

all of the above### Reference Answer: 4

minibatch GD for training; equivalent

eigenproblem solution for PCA; multi-stage for pre-training

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/21

Finale Overfitting Elimination Techniques

## Overfitting Elimination via Regularization

### when model too ‘powerful’:

add

### brakes

somewhere### large-margin

SVM;

AdaBoost (indirectly)

### denoising

autoencoder### pruning

decision tree### L2

SVR;kernel models;

NNet [weight-decay]

### weight-elimination

NNet### early stopping

NNet (any GD-like)### voting/averaging

uniform blending;Bagging;

Random Forest

### constraining

autoenc. [weights];RBF [# centers];

arguably

**most important techniques**

Finale Overfitting Elimination Techniques

## Overfitting Elimination via Validation

### when model too ‘powerful’:

check

### performance carefully and honestly

### # SV

SVM/SVR### OOB

Random Forest

### Internal Validation

blending;DecTree pruning

simple but

**necessary**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/21

Finale Overfitting Elimination Techniques

## Fun Time

What is the major technique for eliminating overfitting in Random Forest?

### 1

voting/averaging### 2

pruning### 3

early stopping### 4

weight-elimination### Reference Answer: 1

Random Forest, based on uniform blending, relies on voting/averaging for regularization.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/21

Finale Overfitting Elimination Techniques

## Fun Time

What is the major technique for eliminating overfitting in Random Forest?

### 1

voting/averaging### 2

pruning### 3

early stopping### 4

weight-elimination### Reference Answer: 1

Random Forest, based on uniform blending, relies on voting/averaging for regularization.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/21

Finale Machine Learning in Practice

## NTU KDDCup 2010 World Champion Model

Feature engineering and classifier ensemble for KDD Cup 2010, Yu et al., KDDCup 2010

### linear blending of Logistic Regression

+many

### rawly encoded features

### Random Forest

+### human-designed features

**yes, you’ve learned everything! :-)**

Finale Machine Learning in Practice

## NTU KDDCup 2011 Track 1 World Champion Model

A linear ensemble of individual and blended models for music rating

prediction, Chen et al., KDDCup 2011

### NNet, DecTree-like, and then linear blending of

### • Matrix Factorization

variants, including probabilistic### PCA

### • Restricted Boltzmann Machines: an ‘extended’ autoencoder

### • k Nearest Neighbors

### • Probabilistic Latent Semantic Analysis:

an extraction model that has

### ‘soft clusters’

as hidden variables### •

linear regression, NNet, & GBDT**yes, you can ‘easily’**

**understand everything! :-)**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/21

Finale Machine Learning in Practice

## NTU KDDCup 2012 Track 2 World Champion Model

A two-stage ensemble of diverse models for advertisement ranking in

KDD Cup 2012, Wu et al., KDDCup 2012

### NNet, GBDT-like, and then linear blending of

### • Linear Regression

variants, including### linear SVR

### • Logistic Regression

variants### • Matrix Factorization

variants### •

. . .‘key’ is to

**blend properly without overfitting**

Finale Machine Learning in Practice

## NTU KDDCup 2013 Track 1 World Champion Model

Combination of feature engineering and ranking models for paper- author identification in KDD Cup 2013, Li et al., KDDCup 2013

### linear blending of

### • Random Forest

with many many many trees### • GBDT

variants### with tons of efforts in designing features

‘another key’ is to

**construct features with** **domain knowledge**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/21

Finale Machine Learning in Practice

## ICDM 2006 Top 10 Data Mining Algorithms

### 1

C4.5: another**decision** **tree**

### 2

k -Means### 3

SVM### 4 Apriori: for frequent itemset mining

### 5

EM:**‘alternating**

**optimization’**

algorithm for
some models
### 6

PageRank: forlink-analysis, similar to

**matrix factorization**

### 7

AdaBoost### 8

k Nearest Neighbor### 9

Naive Bayes: a simple**linear model**

with ‘weights’
decided by data statistics

### 10

C&RTpersonal view of five missing ML competitors:

**LinReg, LogReg,**

**Random Forest, GBDT, NNet**

Finale Machine Learning in Practice

## Machine Learning Jungle

### soft-margin k -means OOB error **RBF network** probabilistic SVM GBDT **PCA** random forest matrix factorization **Gaussian kernel** kernel LogReg large-margin prototype **quadratic programming SVR**

**dual** uniform blending deep learning nearest neighbor decision stump AdaBoost aggregation sparsity autoencoder **functional gradient**

### bagging decision tree support vector machine

**neural network kernel**

welcome to the

**jungle!**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/21

Finale Machine Learning in Practice

## Fun Time

Which of the following is the official lucky number of this class?

### 1

9876### 2

1234### 3

1126### 4

6211### Reference Answer: 3

May the luckiness always be with you!

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/21

Finale Machine Learning in Practice

## Fun Time

Which of the following is the official lucky number of this class?

### 1

9876### 2

1234### 3

1126### 4

6211### Reference Answer: 3

May the luckiness always be with you!

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/21

Finale Machine Learning in Practice