• 沒有找到結果。

# Machine Learning Techniques (ᘤᢈ)

N/A
N/A
Protected

Share "Machine Learning Techniques (ᘤᢈ)"

Copied!
137
0
0

(1)

## ( 機器學習技法)

### Lecture 15: Matrix Factorization

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

### ( 國立台灣大學資訊工程系)

(2)

Matrix Factorization

### 3

Distilling Implicit Features: Extraction Models

of

using

for

### Basic Matrix Factorization

(3)

Matrix Factorization Linear Network Hypothesis

## Recommender System Revisited

### •

competition held by Netflix in 2006

data

for

{(˜

= (n),y

=

) :

### user n rated movie m}

—abstract feature ˜

= (n)

how to

### learn our preferences

from data?

(4)

Matrix Factorization Linear Network Hypothesis

## Recommender System Revisited

### •

competition held by Netflix in 2006

data

for

{(˜

= (n),y

=

) :

### user n rated movie m}

—abstract feature ˜

= (n)

how to

### learn our preferences

from data?

(5)

Matrix Factorization Linear Network Hypothesis

## Recommender System Revisited

### •

competition held by Netflix in 2006

data

for

{(˜

= (n),y

=

) :

### user n rated movie m}

—abstract feature ˜

= (n)

how to

### learn our preferences

from data?

(6)

Matrix Factorization Linear Network Hypothesis

## Recommender System Revisited

### •

competition held by Netflix in 2006

data

for

{(˜

= (n),y

=

) :

how to

### learn our preferences

from data?

(7)

Matrix Factorization Linear Network Hypothesis

## Recommender System Revisited

### •

competition held by Netflix in 2006

data

for

{(˜

= (n),y

=

) :

### user n rated movie m}

—abstract feature ˜

= (n)

how to

### learn our preferences

from data?

(8)

Matrix Factorization Linear Network Hypothesis

## Binary Vector Encoding of Categorical Feature

˜

### xn

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

features

features, e.g.

### •

many ML models operate on

features

—except for

need:

(transform) from

to

=

,

=

,

=

,

=

### [0 0 0 1] T

(9)

Matrix Factorization Linear Network Hypothesis

## Binary Vector Encoding of Categorical Feature

˜

### xn

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

features

features, e.g.

### •

many ML models operate on

features

—except for

need:

(transform) from

to

=

,

=

,

=

,

=

### [0 0 0 1] T

(10)

Matrix Factorization Linear Network Hypothesis

## Binary Vector Encoding of Categorical Feature

˜

### xn

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

features

features, e.g.

### •

many ML models operate on

features

—except for

need:

(transform) from

to

=

,

=

,

=

,

=

### [0 0 0 1] T

(11)

Matrix Factorization Linear Network Hypothesis

## Binary Vector Encoding of Categorical Feature

˜

### xn

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

features

features, e.g.

### •

many ML models operate on

features

—except for

need:

(transform) from

to

=

,

=

,

=

,

=

### [0 0 0 1] T

(12)

Matrix Factorization Linear Network Hypothesis

## Binary Vector Encoding of Categorical Feature

˜

### xn

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

features

features, e.g.

### •

many ML models operate on

features

—except for

need:

(transform) from

to

=

,

=

,

=

,

=

### [0 0 0 1] T

(13)

Matrix Factorization Linear Network Hypothesis

## Binary Vector Encoding of Categorical Feature

˜

### xn

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

features

features, e.g.

### •

many ML models operate on

features

—except for

need:

(transform) from

to

=

,

=

,

=

,

=

### [0 0 0 1] T

(14)

Matrix Factorization Linear Network Hypothesis

## Binary Vector Encoding of Categorical Feature

˜

### xn

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

features

features, e.g.

### •

many ML models operate on

features

—except for

need:

(transform) from

to

=

,

=

,

=

,

=

### [0 0 0 1] T

(15)

Matrix Factorization Linear Network Hypothesis

## Binary Vector Encoding of Categorical Feature

˜

### xn

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

features

features, e.g.

### •

many ML models operate on

features

—except for

need:

(transform) from

to

=

,

=

,

=

,

=

### [0 0 0 1] T

(16)

Matrix Factorization Linear Network Hypothesis

## Binary Vector Encoding of Categorical Feature

˜

### xn

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

features

features, e.g.

### •

many ML models operate on

features

—except for

need:

(transform) from

to

=

,

=

,

=

,

=

### [0 0 0 1] T

(17)

Matrix Factorization Linear Network Hypothesis

## Binary Vector Encoding of Categorical Feature

˜

### xn

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

features

features, e.g.

### •

many ML models operate on

features

—except for

need:

(transform) from

to

=

,

=

,

=

,

=

### [0 0 0 1] T

(18)

Matrix Factorization Linear Network Hypothesis

## Feature Extraction from Encoded Vector

data

for

n

(x

=

=

) :

o

or,

n

(x

=

= [r

? ?

. . .

]

)o

idea: try

### feature extraction

using N-˜d -M NNet

1

2

3

4

tanh

tanh

1

2

3

is

### tanh necessary?:-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/22

(19)

Matrix Factorization Linear Network Hypothesis

## Feature Extraction from Encoded Vector

data

for

n

(x

=

=

) :

o or,

n

(x

=

= [r

? ?

. . .

]

) o

idea: try

### feature extraction

using N-˜d -M NNet

1

2

3

4

tanh

tanh

1

2

3

is

### tanh necessary?:-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/22

(20)

Matrix Factorization Linear Network Hypothesis

## Feature Extraction from Encoded Vector

data

for

n

(x

=

=

) :

o or,

n

(x

=

= [r

? ?

. . .

]

) o

idea: try

### feature extraction

using N-˜d -M NNet

1

2

3

tanh

tanh

1

2

is

### tanh necessary?:-)

(21)

Matrix Factorization Linear Network Hypothesis

## Feature Extraction from Encoded Vector

data

for

n

(x

=

=

) :

o or,

n

(x

=

= [r

? ?

. . .

]

) o

idea: try

### feature extraction

using N-˜d -M NNet

1

2

3

4

tanh

tanh

1

2

3

is

### tanh necessary?:-)

(22)

Matrix Factorization Linear Network Hypothesis

1

2

3

4

1

2

3

## w im(2)

n

(x

=

= [r

? ?

. . .

]

) o

rename:

and

hypothesis:

per-useroutput:

) =

, where

is

column of

### linear network

for recommender system:

and

### W

(23)

Matrix Factorization Linear Network Hypothesis

1

2

3

4

1

2

3

## w im(2)

n

(x

=

= [r

? ?

. . .

]

) o

rename:

and

hypothesis:

per-useroutput:

) =

, where

is

column of

### linear network

for recommender system:

and

### W

(24)

Matrix Factorization Linear Network Hypothesis

1

2

3

4

1

2

3

## = y V T : w ni(1) W : w im(2)

n

(x

=

= [r

? ?

. . .

]

) o

rename:

and

hypothesis:

per-useroutput:

) =

, where

is

column of

### linear network

for recommender system:

and

### W

(25)

Matrix Factorization Linear Network Hypothesis

1

2

3

4

1

2

3

## = y V T : w ni(1) W : w im(2)

n

(x

=

= [r

? ?

. . .

]

) o

rename:

and

hypothesis:

per-useroutput:

) =

, where

is

column of

### linear network

for recommender system:

and

### W

(26)

Matrix Factorization Linear Network Hypothesis

1

2

3

4

1

2

3

## = y V T : w ni(1) W : w im(2)

n

(x

=

= [r

? ?

. . .

]

) o

rename:

and

hypothesis:

per-useroutput:

) =

, where

is

column of

### linear network

for recommender system:

and

### W

(27)

Matrix Factorization Linear Network Hypothesis

1

2

3

4

1

2

3

## = y V T : w ni(1) W : w im(2)

n

(x

=

= [r

? ?

. . .

]

) o

rename:

and

hypothesis:

per-useroutput:

) =

, where

is

column of

### linear network

for recommender system:

and

### W

(28)

Matrix Factorization Linear Network Hypothesis

1

2

3

4

1

2

3

## = y V T : w ni(1) W : w im(2)

n

(x

=

= [r

? ?

. . .

]

) o

rename:

and

hypothesis:

per-useroutput:

) =

, where

is

column of

### linear network

for recommender system:

and

### W

(29)

Matrix Factorization Linear Network Hypothesis

1

2

3

4

1

2

3

## = y V T : w ni(1) W : w im(2)

n

(x

=

= [r

? ?

. . .

]

) o

rename:

and

hypothesis:

per-useroutput:

) =

, where

is

column of

### linear network

for recommender system:

and

### W

(30)

Matrix Factorization Linear Network Hypothesis

## Fun Time

For N users, M movies, and ˜d ‘features’, how many variables need to be used to specify a

hypothesis

N + M + ˜d

N · M · ˜d

(N + M) · ˜d

(N · M) + ˜d

simply

for

and

for

### W

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/22

(31)

Matrix Factorization Linear Network Hypothesis

## Fun Time

For N users, M movies, and ˜d ‘features’, how many variables need to be used to specify a

hypothesis

N + M + ˜d

N · M · ˜d

(N + M) · ˜d

(N · M) + ˜d

simply

for

and

for

### W

(32)

Matrix Factorization Basic Matrix Factorization

## Linear Network: Linear Model Per Movie

linear network:

|{z}

—for

### m-th

movie, just linear model

(x) =

### wTm Φ(x)

subject to shared

for every

, want

=

E

over all

### D m

with squared error measure: E

({w

}, {v

}) =

X





and

from all D

### m

(33)

Matrix Factorization Basic Matrix Factorization

## Linear Network: Linear Model Per Movie

linear network:

|{z}

—for

### m-th

movie, just linear model

(x) =

### wTm Φ(x)

subject to shared

for every

, want

=

E

over all

### D m

with squared error measure: E

({w

}, {v

}) =

X





and

from all D

### m

(34)

Matrix Factorization Basic Matrix Factorization

## Linear Network: Linear Model Per Movie

linear network:

|{z}

—for

### m-th

movie, just linear model

(x) =

### wTm Φ(x)

subject to shared

for every

, want

=

E

over all

### D m

with squared error measure: E

({w

}, {v

}) =

X





and

from all D

### m

(35)

Matrix Factorization Basic Matrix Factorization

## Linear Network: Linear Model Per Movie

linear network:

|{z}

—for

### m-th

movie, just linear model

(x) =

### wTm Φ(x)

subject to shared

for every

, want

=

E

over all

### D m

with squared error measure:

E

({w

}, {v

}) =

X





and

from all D

### m

(36)

Matrix Factorization Basic Matrix Factorization

## Linear Network: Linear Model Per Movie

linear network:

|{z}

—for

### m-th

movie, just linear model

(x) =

### wTm Φ(x)

subject to shared

for every

, want

=

E

over all

### D m

with squared error measure:

E

({w

}, {v

}) =

X





and

from all D

### m

(37)

Matrix Factorization Basic Matrix Factorization

## Linear Network: Linear Model Per Movie

linear network:

|{z}

—for

### m-th

movie, just linear model

(x) =

### wTm Φ(x)

subject to shared

for every

, want

=

E

over all

### D m

with squared error measure:

E

({w

}, {v

}) =

X





and

from all D

### m

(38)

Matrix Factorization Basic Matrix Factorization

## Matrix Factorization

=

⇐⇒

1

2

M

1

2

N

T1

T2

TN

1

2

### · · · w

M

Match movie and viewer factors

predicted rating

comedy content action

content blockb

uster?

TomCruise init? likes TomCruise? prefers blockbusters?

likes action? likes comedy?

movie viewer

learning:

→ learned

and

### wm

→ unknown rating prediction

similar modeling can be used for other

### abstract features

(39)

Matrix Factorization Basic Matrix Factorization

## Matrix Factorization

=

⇐⇒

1

2

M

1

2

N

T1

T2

TN

1

2

### · · · w

M

Match movie and

viewer factors

predicted rating

comedy content action

content blockb

uster?

TomCruise init? likes TomCruise? prefers blockbusters?

likes action? likes comedy?

movie viewer

learning:

→ learned

and

### wm

→ unknown rating prediction

similar modeling can be used for other

### abstract features

(40)

Matrix Factorization Basic Matrix Factorization

## Matrix Factorization

=

⇐⇒

1

2

M

1

2

N

T1

T2

TN

1

2

### · · · w

M

Match movie and

viewer factors

predicted rating

comedy content action

content blockb

uster?

TomCruise init? likes TomCruise? prefers blockbusters?

likes action? likes comedy?

movie viewer

learning:

→ learned

and

### wm

→ unknown rating prediction

similar modeling can be used for other

### abstract features

(41)

Matrix Factorization Basic Matrix Factorization

## Matrix Factorization

=

⇐⇒

1

2

M

1

2

N

T1

T2

TN

1

2

### · · · w

M

Match movie and viewer factors

predicted rating

comedy content action

content blockb

uster?

TomCruise init? likes TomCruise? prefers blockbusters?

likes action? likes comedy?

movie viewer

learning:

→ learned

and

### wm

→ unknown rating prediction

similar modeling can be used for other

### abstract features

(42)

Matrix Factorization Basic Matrix Factorization

## Matrix Factorization

=

⇐⇒

1

2

M

1

2

N

T1

T2

TN

1

2

### · · · w

M

Match movie and viewer factors

predicted rating

comedy content action

content blockb

uster?

TomCruise init? likes TomCruise? prefers blockbusters?

likes action? likes comedy?

movie viewer

learning:

→ learned

and

### wm

→ unknown rating prediction

similar modeling can be used for other

### abstract features

(43)

Matrix Factorization Basic Matrix Factorization

## Matrix Factorization

=

⇐⇒

1

2

M

1

2

N

T1

T2

TN

1

2

### · · · w

M

Match movie and viewer factors

predicted rating

comedy content action

content blockb uster?

TomCruise init?

likes TomCruise?

prefers blockbusters? likes action?

likes comedy?

movie viewer

learning:

→ learned

and

### wm

→ unknown rating prediction

similar modeling can be used for other

### abstract features

(44)

Matrix Factorization Basic Matrix Factorization

## Matrix Factorization

=

⇐⇒

1

2

M

1

2

N

T1

T2

TN

1

2

### · · · w

M

Match movie and viewer factors

predicted rating likes TomCruise?

prefers blockbusters? likes action?

likes comedy?

movie viewer

learning:

→ learned

and

### wm

→ unknown rating prediction

similar modeling can be used for

(45)

Matrix Factorization Basic Matrix Factorization

## Matrix Factorization Learning

min

E

({w

}, {v

}) ∝ X





=

n

nm

m





of variables:

can consider

when

### vn

fixed, minimizing

≡ minimize E

within

### D m

—simply per-movie(per-D

)

when

### wm

fixed, minimizing

### vn

?

—per-userlinear regression

by

between

called

### alternating least squares

algorithm

(46)

Matrix Factorization Basic Matrix Factorization

## Matrix Factorization Learning

min

E

({w

}, {v

}) ∝ X





=

n

nm

m





of variables:

can consider

when

### vn

fixed, minimizing

≡ minimize E

within

### D m

—simply per-movie(per-D

)

when

### wm

fixed, minimizing

### vn

?

—per-userlinear regression

by

between

called

### alternating least squares

algorithm

(47)

Matrix Factorization Basic Matrix Factorization

## Matrix Factorization Learning

min

E

({w

}, {v

}) ∝ X





=

n

nm

m





of variables:

can consider

when

### vn

fixed, minimizing

≡ minimize E

within

### D m

—simply per-movie(per-D

)

when

### wm

fixed, minimizing

### vn

?

—per-userlinear regression

by

between

called

### alternating least squares

algorithm

(48)

Matrix Factorization Basic Matrix Factorization

## Matrix Factorization Learning

min

E

({w

}, {v

}) ∝ X





=

n

nm

m





of variables:

can consider

when

### vn

fixed, minimizing

≡ minimize E

within

### D m

—simply per-movie(per-D

)

when

### wm

fixed, minimizing

### vn

?

—per-userlinear regression

by

between

called

### alternating least squares

algorithm

(49)

Matrix Factorization Basic Matrix Factorization

## Matrix Factorization Learning

min

E

({w

}, {v

}) ∝ X





=

n

nm

m





of variables:

can consider

when

### vn

fixed, minimizing

≡ minimize E

within

### D m

—simply per-movie(per-D

)

when

### wm

fixed, minimizing

### vn

?

—per-userlinear regression

by

between

called

### alternating least squares

algorithm

(50)

Matrix Factorization Basic Matrix Factorization

## Matrix Factorization Learning

min

E

({w

}, {v

}) ∝ X





=

n

nm

m





of variables:

can consider

when

### vn

fixed, minimizing

≡ minimize E

within

### D m

—simply per-movie(per-D

)

when

### wm

fixed, minimizing

### vn

?

—per-userlinear regression

by

between

called

### alternating least squares

algorithm

(51)

Matrix Factorization Basic Matrix Factorization

## Matrix Factorization Learning

min

E

({w

}, {v

}) ∝ X





=

n

nm

m





of variables:

can consider

when

### vn

fixed, minimizing

≡ minimize E

within

### D m

—simply per-movie(per-D

)

when

### wm

fixed, minimizing

### vn

?

—per-userlinear regression

by

between

called

### alternating least squares

algorithm

(52)

Matrix Factorization Basic Matrix Factorization

## Matrix Factorization Learning

min

E

({w

}, {v

}) ∝ X





=

n

nm

m





of variables:

can consider

when

### vn

fixed, minimizing

≡ minimize E

within

### D m

—simply per-movie(per-D

)

when

### wm

fixed, minimizing

### vn

?

—per-userlinear regression

by

between

called

### alternating least squares

algorithm

(53)

Matrix Factorization Basic Matrix Factorization

## Matrix Factorization Learning

min

E

({w

}, {v

}) ∝ X





=

n

nm

m





of variables:

can consider

when

### vn

fixed, minimizing

≡ minimize E

within

### D m

—simply per-movie(per-D

)

when

### wm

fixed, minimizing

### vn

?

—per-userlinear regression

by

between

called

### alternating least squares

algorithm

(54)

Matrix Factorization Basic Matrix Factorization

## Matrix Factorization Learning

min

E

({w

}, {v

}) ∝ X





=

n

nm

m





of variables:

can consider

when

### vn

fixed, minimizing

≡ minimize E

within

### D m

—simply per-movie(per-D

)

when

### wm

fixed, minimizing

### vn

?

—per-userlinear regression

called

### alternating least squares

algorithm

(55)

Matrix Factorization Basic Matrix Factorization

## Matrix Factorization Learning

min

E

({w

}, {v

}) ∝ X





=

n

nm

m





of variables:

can consider

when

### vn

fixed, minimizing

≡ minimize E

within

### D m

—simply per-movie(per-D

)

when

### wm

fixed, minimizing

### vn

?

—per-userlinear regression

by

between

called

### alternating least squares

algorithm

(56)

Matrix Factorization Basic Matrix Factorization

## Alternating Least Squares

### 1

initialize ˜d dimension vectors {w

}, {v

}

of E

: repeatedly

1

2

M

m

n

nm

1

2

N

n

m

nm

until

guaranteed as E

### indecreases

during alternating minimization

alternating least squares:

the ‘tango’ dance between

### users/movies

(57)

Matrix Factorization Basic Matrix Factorization

## Alternating Least Squares

### 1

initialize ˜d dimension vectors {w

}, {v

}

of E

: repeatedly

1

2

M

m

n

nm

1

2

N

n

m

nm

until

guaranteed as E

### indecreases

during alternating minimization

alternating least squares:

the ‘tango’ dance between

### users/movies

(58)

Matrix Factorization Basic Matrix Factorization

## Alternating Least Squares

### 1

initialize ˜d dimension vectors {w

}, {v

}

of E

: repeatedly

1

2

M

m

n

nm

1

2

N

n

m

nm

until

guaranteed as E

### indecreases

during alternating minimization

alternating least squares:

the ‘tango’ dance between

### users/movies

(59)

Matrix Factorization Basic Matrix Factorization

## Alternating Least Squares

### 1

initialize ˜d dimension vectors {w

}, {v

}

of E

: repeatedly

1

2

M

m

n

nm

1

2

N

n

m

nm

until

guaranteed as E

### indecreases

during alternating minimization alternating least squares:

the ‘tango’ dance between

### users/movies

(60)

Matrix Factorization Basic Matrix Factorization

## Alternating Least Squares

### 1

initialize ˜d dimension vectors {w

}, {v

}

of E

: repeatedly

1

2

M

m

n

nm

1

2

N

n

m

nm

until

guaranteed as E

### indecreases

during alternating minimization

alternating least squares:

the ‘tango’ dance between

### users/movies

(61)

Matrix Factorization Basic Matrix Factorization

## Alternating Least Squares

### 1

initialize ˜d dimension vectors {w

}, {v

}

of E

: repeatedly

1

2

M

m

n

nm

1

2

N

n

m

nm

until

guaranteed as E

### indecreases

during alternating minimization alternating least squares:

the ‘tango’ dance between

### users/movies

(62)

Matrix Factorization Basic Matrix Factorization

## Linear Autoencoder versus Matrix Factorization

X ≈

X

motivation:

### special d

-˜d -d linear NNet

### •

error measure: squared on

x

### •

solution: global optimal at

of X

X

### •

usefulness: extract

R ≈

motivation:

d -Mlinear NNet

error measure:

squared on

### •

solution: local optimal via

### •

usefulness: extract

### hidden user/movie features

linear autoencoder

### special

matrix factorization of

### complete

X

(63)

Matrix Factorization Basic Matrix Factorization

## Linear Autoencoder versus Matrix Factorization

X ≈

X

motivation:

### special d

-˜d -d linear NNet

### •

error measure: squared on

x

### •

solution: global optimal at

of X

X

### •

usefulness: extract

R ≈

motivation:

d -Mlinear NNet

error measure:

squared on

### •

solution: local optimal via

### •

usefulness: extract

### hidden user/movie features

linear autoencoder

### special

matrix factorization of

### complete

X

(64)

Matrix Factorization Basic Matrix Factorization

## Linear Autoencoder versus Matrix Factorization

X ≈

X

motivation:

### special d

-˜d -d linear NNet

### •

error measure: squared on

x

### •

solution: global optimal at

of X

X

### •

usefulness: extract

R ≈

motivation:

d -Mlinear NNet

error measure:

squared on

### •

solution: local optimal via

### •

usefulness: extract

### hidden user/movie features

linear autoencoder

### special

matrix factorization of

### complete

X

2 Combining Predictive Features: Aggregation Models Lecture 7: Blending and Bagging.. Motivation of Aggregation

Which of the following aggregation model learns diverse g t by reweighting and calculates linear vote by steepest search?.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/22.. If we use E loocv to estimate the performance of a learning algorithm that predicts with the average y value of the

3 Distilling Implicit Features: Extraction Models Lecture 14: Radial Basis Function Network. RBF

Lecture 4: Soft-Margin Support Vector Machine allow some margin violations ξ n while penalizing them by C; equivalent to upper-bounding α n by C Lecture 5: Kernel Logistic

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.