Machine Learning Techniques (ᘤᢈ)

(1)

Machine Learning Techniques

( 機器學習技法)

Lecture 15: Matrix Factorization

Hsuan-Tien Lin (林軒田) [email protected]

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Matrix Factorization

Roadmap

1 Embedding Numerous Features: Kernel Models

2 Combining Predictive Features: Aggregation Models

3

Distilling Implicit Features: Extraction Models

Lecture 14: Radial Basis Function Network

linear aggregation

of

distance-based similarities

using

k -Means clustering

for

prototype finding Lecture 15: Matrix Factorization

Linear Network Hypothesis

Basic Matrix Factorization

Stochastic Gradient Descent

(3)

Matrix Factorization Linear Network Hypothesis

Recommender System Revisited

data ML ^skill

• data: how ‘many users’ have rated ‘some movies’

• skill: predict how a user would rate an unrated movie

A Hot Problem

•

competition held by Netflix in 2006

• 100,480,507 ratings that 480,189 users gave to 17,770 movies

• 10% improvement = 1 million dollar prize

•

data

D _m

for

m-th movie:

{(˜

x n

= (n),y

n

=

r nm

) :

user n rated movie m}

—abstract feature ˜

x n

= (n)

how to

learn our preferences

from data?

(4)

Recommender System Revisited

data ML ^skill

• data: how ‘many users’ have rated ‘some movies’

• skill: predict how a user would rate an unrated movie

A Hot Problem

• • 100,480,507 ratings that 480,189 users gave to 17,770 movies

• 10% improvement = 1 million dollar prize

•

data

D _m

for

m-th movie:

{(˜

x n

= (n),y

n

=

r nm

) :

user n rated movie m}

x n

= (n)

how to

learn our preferences

from data?

(5)

Recommender System Revisited

data ML ^skill

• data: how ‘many users’ have rated ‘some movies’

• skill: predict how a user would rate an unrated movie

A Hot Problem

• • 100,480,507 ratings that 480,189 users gave to 17,770 movies

• 10% improvement = 1 million dollar prize

•

data

D _m

for

m-th movie:

{(˜

x _n

= (n),y

_n

=

r _nm

) :

user n rated movie m}

x n

= (n)

how to

learn our preferences

from data?

(6)

Recommender System Revisited

data ML ^skill

• data: how ‘many users’ have rated ‘some movies’

• skill: predict how a user would rate an unrated movie

A Hot Problem

• • 100,480,507 ratings that 480,189 users gave to 17,770 movies

• 10% improvement = 1 million dollar prize

•

data

D _m

for

m-th movie:

{(˜

x _n

= (n),y

_n

=

r _nm

) :

user n rated movie m}

how to

learn our preferences

from data?

(7)

Recommender System Revisited

data ML ^skill

• data: how ‘many users’ have rated ‘some movies’

• skill: predict how a user would rate an unrated movie

A Hot Problem

• • 100,480,507 ratings that 480,189 users gave to 17,770 movies

• 10% improvement = 1 million dollar prize

•

data

D _m

for

m-th movie:

{(˜

x _n

= (n),y

_n

=

r _nm

) :

user n rated movie m}

x n

= (n)

how to

learn our preferences

from data?

(8)

Binary Vector Encoding of Categorical Feature

˜

x _n

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

categorical

features

• categorical

features, e.g.

• IDs

• blood type: A, B, AB, O

• programming languages: C, C++, Java, Python, . . .

•

many ML models operate on

numerical

features

• linear models

• extended linear models such as NNet

—except for

decision trees

•

need:

encoding

(transform) from

categorical

to

numerical

binary vector encoding:

A

=

[1 0 0 0] ^T

,

B

=

[0 1 0 0] ^T

,

AB

=

[0 0 1 0] ^T

,

O

=

[0 0 0 1] ^T

(9)

Binary Vector Encoding of Categorical Feature

˜

x _n

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

categorical

features

• categorical

features, e.g.

• IDs

• blood type: A, B, AB, O

• programming languages: C, C++, Java, Python, . . .

• numerical

features

• linear models

• extended linear models such as NNet

—except for

decision trees

•

need:

encoding

(transform) from

categorical

to

numerical

binary vector encoding:

A

=

[1 0 0 0] ^T

,

B

=

[0 1 0 0] ^T

,

AB

=

[0 0 1 0] ^T

,

O

=

[0 0 0 1] ^T

(10)

Binary Vector Encoding of Categorical Feature

˜

x _n

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

categorical

features

• categorical

features, e.g.

• IDs

• blood type: A, B, AB, O

• programming languages: C, C++, Java, Python, . . .

• numerical

features

• linear models

• extended linear models such as NNet

—except for

decision trees

•

need:

encoding

(transform) from

categorical

to

numerical

binary vector encoding:

A

=

[1 0 0 0] ^T

,

B

=

[0 1 0 0] ^T

,

AB

=

[0 0 1 0] ^T

,

O

=

[0 0 0 1] ^T

(11)

Binary Vector Encoding of Categorical Feature

˜

x _n

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

categorical

features

• categorical

features, e.g.

• IDs

• blood type: A, B, AB, O

• programming languages: C, C++, Java, Python, . . .

• numerical

features

• linear models

• extended linear models such as NNet

—except for

decision trees

•

need:

encoding

(transform) from

categorical

to

numerical

binary vector encoding:

A

=

[1 0 0 0] ^T

,

B

=

[0 1 0 0] ^T

,

AB

=

[0 0 1 0] ^T

,

O

=

[0 0 0 1] ^T

(12)

Binary Vector Encoding of Categorical Feature

˜

x _n

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

categorical

features

• categorical

features, e.g.

• IDs

• blood type: A, B, AB, O

• programming languages: C, C++, Java, Python, . . .

• numerical

features

• linear models

• extended linear models such as NNet

—except for

decision trees

•

need:

encoding

(transform) from

categorical

to

numerical

binary vector encoding:

A

=

[1 0 0 0] ^T

,

B

=

[0 1 0 0] ^T

,

AB

=

[0 0 1 0] ^T

,

O

=

[0 0 0 1] ^T

(13)

Binary Vector Encoding of Categorical Feature

˜

x _n

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

categorical

features

• categorical

features, e.g.

• IDs

• blood type: A, B, AB, O

• programming languages: C, C++, Java, Python, . . .

• numerical

features

• linear models

• extended linear models such as NNet

—except for

decision trees

•

need:

encoding

(transform) from

categorical

to

numerical

binary vector encoding:

A

=

[1 0 0 0] ^T

,

B

=

[0 1 0 0] ^T

,

AB

=

[0 0 1 0] ^T

,

O

=

[0 0 0 1] ^T

(14)

Binary Vector Encoding of Categorical Feature

˜

x _n

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

categorical

features

• categorical

features, e.g.

• IDs

• blood type: A, B, AB, O

• programming languages: C, C++, Java, Python, . . .

• numerical

features

• linear models

• extended linear models such as NNet

—except for

decision trees

•

need:

encoding

(transform) from

categorical

to

numerical

binary vector encoding:

A

=

[1 0 0 0] ^T

,

B

=

[0 1 0 0] ^T

,

AB

=

[0 0 1 0] ^T

,

O

=

[0 0 0 1] ^T

(15)

Binary Vector Encoding of Categorical Feature

˜

x _n

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

categorical

features

• categorical

features, e.g.

• IDs

• blood type: A, B, AB, O

• programming languages: C, C++, Java, Python, . . .

• numerical

features

• linear models

• extended linear models such as NNet

—except for

decision trees

•

need:

encoding

(transform) from

categorical

to

numerical binary vector encoding:

A

=

[1 0 0 0] ^T

,

B

=

[0 1 0 0] ^T

,

AB

=

[0 0 1 0] ^T

,

O

=

[0 0 0 1] ^T

(16)

Binary Vector Encoding of Categorical Feature

˜

x _n

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

categorical

features

• categorical

features, e.g.

• IDs

• blood type: A, B, AB, O

• programming languages: C, C++, Java, Python, . . .

• numerical

features

• linear models

• extended linear models such as NNet

—except for

decision trees

•

need:

encoding

(transform) from

categorical

to

numerical

binary vector encoding:

A

=

[1 0 0 0] ^T

,

B

=

[0 1 0 0] ^T

,

AB

=

[0 0 1 0] ^T

,

O

=

[0 0 0 1] ^T

(17)

Binary Vector Encoding of Categorical Feature

˜

x _n

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

categorical

features

• categorical

features, e.g.

• IDs

• blood type: A, B, AB, O

• programming languages: C, C++, Java, Python, . . .

• numerical

features

• linear models

• extended linear models such as NNet

—except for

decision trees

•

need:

encoding

(transform) from

categorical

to

numerical binary vector encoding:

A

=

[1 0 0 0] ^T

,

B

=

[0 1 0 0] ^T

,

AB

=

[0 0 1 0] ^T

,

O

=

[0 0 0 1] ^T

(18)

Feature Extraction from Encoded Vector

encoded

data

D _m

for

m-th movie:

n

(x

n

=

BinaryVectorEncoding(n), y n

=

r nm

) :

user n rated movie m

o

or,

joint data D

n

(x

_n

=

BinaryVectorEncoding(n), y _n

= [r

_n1

? ?

r _n4 r _n5

. . .

r _nM

]

^T

)o

idea: try

feature extraction

using N-˜d -M NNet

without all x ₀ ^(`)

x

1

x

2

x

₃

x

₄

x =

tanh

≈ y

1

≈ y

2

≈ y

3

= y w _ni ⁽¹⁾ w _im ⁽²⁾

is

tanh necessary? :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/22

(19)

Feature Extraction from Encoded Vector

encoded

data

D _m

for

m-th movie:

n

(x

n

=

BinaryVectorEncoding(n), y n

=

r nm

) :

user n rated movie m

o or,

joint data D

n

(x

_n

=

BinaryVectorEncoding(n), y _n

= [r

_n1

? ?

r _n4 r _n5

. . .

r _nM

]

^T

) o

idea: try

feature extraction

using N-˜d -M NNet

without all x ₀ ^(`)

x

1

x

2

x

₃

x

₄

x =

tanh

≈ y

1

≈ y

2

≈ y

3

= y w _ni ⁽¹⁾ w _im ⁽²⁾

is

tanh necessary? :-)

(20)

Feature Extraction from Encoded Vector

encoded

data

D _m

for

m-th movie:

n

(x

n

=

BinaryVectorEncoding(n), y n

=

r nm

) :

user n rated movie m

o or,

joint data D

n

(x

_n

=

BinaryVectorEncoding(n), y _n

= [r

_n1

? ?

r _n4 r _n5

. . .

r _nM

]

^T

) o

idea: try

feature extraction

using N-˜d -M NNet

without all x ₀ ^(`)

x

1

x

2

x

3

x =

tanh

≈ y

1

≈ y

2

≈ y

= y w _ni ⁽¹⁾ w _im ⁽²⁾

is

tanh necessary? :-)

(21)

Feature Extraction from Encoded Vector

encoded

data

D _m

for

m-th movie:

n

(x

n

=

BinaryVectorEncoding(n), y n

=

r nm

) :

user n rated movie m

o or,

joint data D

n

(x

_n

=

BinaryVectorEncoding(n), y _n

= [r

_n1

? ?

r _n4 r _n5

. . .

r _nM

]

^T

) o

idea: try

feature extraction

using N-˜d -M NNet

without all x ₀ ^(`)

x

1

x

2

x

3

x

₄

x =

tanh

≈ y

1

≈ y

2

≈ y

3

= y w _ni ⁽¹⁾ w _im ⁽²⁾

is

tanh necessary? :-)

(22)

‘Linear Network’ Hypothesis

x

₁

x

2

x

3

x

4

x =

≈ y

₁

≈ y

₂

≈ y

3

= y

V ^T :

w _ni ⁽¹⁾

W :

w _im ⁽²⁾

n

(x

_n

=

BinaryVectorEncoding(n), y _n

= [r

_n1

? ?

r _n4 r n5

. . .

r _nM

]

^T

) o

•

rename:

V ^T

for h w _ni ⁽¹⁾ i

and

W

for h w _im ⁽²⁾ i

•

hypothesis:

h(x) =

W ^T V

x

•

per-useroutput:

h(x _n

) =

W ^T

v _n

, where

v _n

is

n-th

column of

V

linear network

for recommender system:

learn V

and

W

(23)

‘Linear Network’ Hypothesis

x

₁

x

2

x

3

x

4

x =

≈ y

₁

≈ y

₂

≈ y

3

= y

V ^T :

w _ni ⁽¹⁾

W :

w _im ⁽²⁾

n

(x

_n

=

BinaryVectorEncoding(n), y _n

= [r

_n1

? ?

r _n4 r n5

. . .

r _nM

]

^T

) o

•

rename:

V ^T

for h w _ni ⁽¹⁾ i

and

W

for h w _im ⁽²⁾ i

•

hypothesis:

h(x) =

W ^T V

x

•

per-useroutput:

h(x _n

) =

W ^T

v _n

, where

v _n

is

n-th

column of

V

linear network

learn V

and

W

(24)

‘Linear Network’ Hypothesis

x

₁

x

2

x

3

x

4

x =

≈ y

₁

≈ y

₂

≈ y

3

= y V ^T : w _ni ⁽¹⁾ W : w _im ⁽²⁾

n

(x

_n

=

BinaryVectorEncoding(n), y _n

= [r

_n1

? ?

r _n4 r n5

. . .

r _nM

]

^T

) o

•

rename:

V ^T for h w _ni ⁽¹⁾ i

and

W for h w _im ⁽²⁾ i

•

hypothesis:

h(x) =

W ^T V

x

•

per-useroutput:

h(x _n

) =

W ^T

v _n

, where

v _n

is

n-th

column of

V

linear network

learn V

and

W

(25)

‘Linear Network’ Hypothesis

x

₁

x

2

x

3

x

4

x =

≈ y

₁

≈ y

₂

≈ y

3

= y V ^T : w _ni ⁽¹⁾ W : w _im ⁽²⁾

n

(x

_n

=

BinaryVectorEncoding(n), y _n

= [r

_n1

? ?

r _n4 r n5

. . .

r _nM

]

^T

) o

•

rename:

V ^T for h w _ni ⁽¹⁾ i

and

W for h w _im ⁽²⁾ i

•

hypothesis:

h(x) =

W ^T V

x

•

per-useroutput:

h(x _n

) =

W ^T

v _n

, where

v _n

is

n-th

column of

V

linear network

learn V

and

W

(26)

‘Linear Network’ Hypothesis

x

₁

x

2

x

3

x

4

x =

≈ y

₁

≈ y

₂

≈ y

3

= y V ^T : w _ni ⁽¹⁾ W : w _im ⁽²⁾

n

(x

_n

=

BinaryVectorEncoding(n), y _n

= [r

_n1

? ?

r _n4 r n5

. . .

r _nM

]

^T

) o

•

rename:

V ^T for h w _ni ⁽¹⁾ i

and

W for h w _im ⁽²⁾ i

•

hypothesis:

h(x) = W ^T Vx

•

per-useroutput:

h(x _n

) =

W ^T

v _n

, where

v _n

is

n-th

column of

V

linear network

learn V

and

W

(27)

‘Linear Network’ Hypothesis

x

₁

x

2

x

3

x

4

x =

≈ y

₁

≈ y

₂

≈ y

3

= y V ^T : w _ni ⁽¹⁾ W : w _im ⁽²⁾

n

(x

_n

=

BinaryVectorEncoding(n), y _n

= [r

_n1

? ?

r _n4 r n5

. . .

r _nM

]

^T

) o

•

rename:

V ^T for h w _ni ⁽¹⁾ i

and

W for h w _im ⁽²⁾ i

•

hypothesis:

h(x) = W ^T Vx

•

per-useroutput:

h(x n

) =

W ^T

v _n

, where

v _n

is

n-th

column of

V

linear network

learn V

and

W

(28)

‘Linear Network’ Hypothesis

x

₁

x

2

x

3

x

4

x =

≈ y

₁

≈ y

₂

≈ y

3

= y V ^T : w _ni ⁽¹⁾ W : w _im ⁽²⁾

n

(x

_n

=

BinaryVectorEncoding(n), y _n

= [r

_n1

? ?

r _n4 r n5

. . .

r _nM

]

^T

) o

•

rename:

V ^T for h w _ni ⁽¹⁾ i

and

W for h w _im ⁽²⁾ i

•

hypothesis:

h(x) = W ^T Vx

•

per-useroutput:

h(x n

) =

W ^T v n

, where

v n

is

n-th

column of

V

linear network

learn V

and

W

(29)

‘Linear Network’ Hypothesis

x

₁

x

2

x

3

x

4

x =

≈ y

₁

≈ y

₂

≈ y

3

= y V ^T : w _ni ⁽¹⁾ W : w _im ⁽²⁾

n

(x

_n

=

BinaryVectorEncoding(n), y _n

= [r

_n1

? ?

r _n4 r n5

. . .

r _nM

]

^T

) o

•

rename:

V ^T for h w _ni ⁽¹⁾ i

and

W for h w _im ⁽²⁾ i

•

hypothesis:

h(x) = W ^T Vx

•

per-useroutput:

h(x n

) =

W ^T v n

, where

v n

is

n-th

column of

V

linear network

learn V

and

W

(30)

Fun Time

For N users, M movies, and ˜d ‘features’, how many variables need to be used to specify a

linear network

hypothesis

h(x) = W ^T Vx?

1

N + M + ˜d

2

N · M · ˜d

3

(N + M) · ˜d

4

(N · M) + ˜d

Reference Answer: 3

simply

N · ˜ d

for

V ^T

and

d · M ˜

for

W

(31)

Fun Time

For N users, M movies, and ˜d ‘features’, how many variables need to be used to specify a

linear network

hypothesis

h(x) = W ^T Vx?

1

N + M + ˜d

2

N · M · ˜d

3

(N + M) · ˜d

4

(N · M) + ˜d

Reference Answer: 3

simply

N · ˜ d

for

V ^T

and

d · M ˜

for

W

(32)

Matrix Factorization Basic Matrix Factorization

Linear Network: Linear Model Per Movie

linear network:

h(x) = W ^T Vx

|{z}

Φ(x)

—for

m-th

movie, just linear model

h _m

(x) =

w ^T _m Φ(x)

subject to shared

transform Φ

•

for every

D _m

, want

r _nm

=

y _n

≈

w ^T _m v _n

•

E

_in

over all

D _m

with squared error measure: E

_in

({w

_m

}, {v

_n

}) =

1 P M

m=1 |D _m |

X

user n rated movie m

r _nm

−

w ^T _m v _n

2 linear network: transform

and

linear modelS

jointly learned

from all D

_m

(33)

Linear Network: Linear Model Per Movie

linear network:

h(x) = W ^T Vx

|{z}

Φ(x)

—for

m-th

h _m

(x) =

w ^T _m Φ(x)

subject to shared

transform Φ

•

for every

D _m

, want

r _nm

=

y _n

≈

w ^T _m v _n

•

E

_in

over all

D _m

_in

({w

_m

}, {v

_n

}) =

1 P M

m=1 |D _m |

X

user n rated movie m

r _nm

−

w ^T _m v _n

2 linear network: transform

and

linear modelS

jointly learned

from all D

_m

(34)

Linear Network: Linear Model Per Movie

linear network:

h(x) = W ^T Vx

|{z}

Φ(x)

—for

m-th

h _m

(x) =

w ^T _m Φ(x)

subject to shared

transform Φ

•

for every

D _m

, want

r _nm

=

y _n

≈

w ^T _m v _n

•

E

_in

over all

D _m

_in

({w

_m

}, {v

_n

}) =

1 P M

m=1 |D _m |

X

user n rated movie m

r _nm

−

w ^T _m v _n

2 linear network: transform

and

linear modelS

jointly learned

from all D

_m

(35)

Linear Network: Linear Model Per Movie

linear network:

h(x) = W ^T Vx

|{z}

Φ(x)

—for

m-th

h _m

(x) =

w ^T _m Φ(x)

subject to shared

transform Φ

•

for every

D _m

, want

r _nm

=

y _n

≈

w ^T _m v _n

•

E

_in

over all

D _m

with squared error measure:

E

_in

({w

_m

}, {v

_n

}) =

1 P M

m=1 |D _m |

X

user n rated movie m

r _nm

−

w ^T _m v _n

2 linear network: transform

and

linear modelS

jointly learned

from all D

_m

(36)

Linear Network: Linear Model Per Movie

linear network:

h(x) = W ^T Vx

|{z}

Φ(x)

—for

m-th

h _m

(x) =

w ^T _m Φ(x)

subject to shared

transform Φ

•

for every

D _m

, want

r _nm

=

y _n

≈

w ^T _m v _n

•

E

_in

over all

D _m

E

_in

({w

_m

}, {v

_n

}) =

1 P M

m=1 |D _m |

X

user n rated movie m

r _nm

−

w ^T _m v _n

2 linear network: transform

and

linear modelS

jointly learned

from all D

_m

(37)

Linear Network: Linear Model Per Movie

linear network:

h(x) = W ^T Vx

|{z}

Φ(x)

—for

m-th

h _m

(x) =

w ^T _m Φ(x)

subject to shared

transform Φ

•

for every

D _m

, want

r _nm

=

y _n

≈

w ^T _m v _n

•

E

_in

over all

D _m

E

_in

({w

_m

}, {v

_n

}) =

1 P M

m=1 |D _m |

X

user n rated movie m

r _nm

−

w ^T _m v _n

2 linear network: transform

and

linear modelS

jointly learned

from all D

_m

(38)

Matrix Factorization

r _nm

≈

w ^T _m v _n

=

v ^T _n w _m

⇐⇒

R

≈

V ^T W R movie

1

movie

2

· · · movie

M

user

1

100 80 · · · ?

user

₂

? 70 · · · 90

· · · · · · · · · · · · · · ·

user

_N

? 60 · · · 0

≈

V ^T

—v

^T₁

—

—v

^T₂

—

· · ·

—v

^T_N

—

W w

₁

w

₂

· · · w

_M

Match movie and viewer factors

predicted rating

comedy content action

content blockb

uster?

TomCruise init? likes TomCruise? prefers blockbusters?

likes action? likes comedy?

movie viewer

add contributions from each factor

Matrix Factorization Model

learning:

→

known rating

→ learned

factors v _n

and

w _m

→ unknown rating prediction

similar modeling can be used for other

abstract features

(39)

Matrix Factorization

r _nm

≈

w ^T _m v _n

=

v ^T _n w _m

⇐⇒

R

≈

V ^T W

R movie

1

movie

2

· · · movie

M

user

₁

100 80 · · · ?

user

₂

? 70 · · · 90

· · · · · · · · · · · · · · ·

user

_N

? 60 · · · 0

≈

V ^T

—v

^T₁

—

—v

^T₂

—

· · ·

—v

^T_N

—

W w

1

w

2

· · · w

M

Match movie and

viewer factors

predicted rating

content blockb

uster?

movie viewer

Matrix Factorization Model

learning:

→

known rating

→ learned

factors v _n

and

w _m

abstract features

(40)

Matrix Factorization

r _nm

≈

w ^T _m v _n

=

v ^T _n w _m

⇐⇒

R

≈

V ^T W

R movie

1

movie

2

· · · movie

M

user

₁

100 80 · · · ?

user

₂

? 70 · · · 90

· · · · · · · · · · · · · · ·

user

_N

? 60 · · · 0

≈ V ^T

—v

^T₁

—

—v

^T₂

—

· · ·

—v

^T_N

—

W w

1

w

2

· · · w

M

Match movie and

viewer factors

predicted rating

content blockb

uster?

movie viewer

Matrix Factorization Model

learning:

→

known rating

→ learned

factors v _n

and

w _m

abstract features

(41)

Matrix Factorization

r _nm

≈

w ^T _m v _n

=

v ^T _n w _m

⇐⇒

R

≈

V ^T W

R movie

1

movie

2

· · · movie

M

user

₁

100 80 · · · ?

user

₂

? 70 · · · 90

· · · · · · · · · · · · · · ·

user

_N

? 60 · · · 0

≈ V ^T

—v

^T₁

—

—v

^T₂

—

· · ·

—v

^T_N

—

W w

1

w

2

· · · w

M

predicted rating

content blockb

uster?

movie viewer

Matrix Factorization Model

learning:

→

known rating

→ learned

factors v _n

and

w _m

abstract features

(42)

Matrix Factorization

r _nm

≈

w ^T _m v _n

=

v ^T _n w _m

⇐⇒

R

≈

V ^T W R movie

1

movie

2

· · · movie

M

user

₁

100 80 · · · ?

user

₂

? 70 · · · 90

· · · · · · · · · · · · · · ·

user

_N

? 60 · · · 0

≈ V ^T

—v

^T₁

—

—v

^T₂

—

· · ·

—v

^T_N

—

W w

1

w

2

· · · w

M

predicted rating

content blockb

uster?

movie viewer

Matrix Factorization Model

learning:

→

known rating

→ learned

factors v _n

and

w _m

abstract features

(43)

Matrix Factorization

r _nm

≈

w ^T _m v _n

=

v ^T _n w _m

⇐⇒

R

≈

V ^T W R movie

1

movie

2

· · · movie

M

user

₁

100 80 · · · ?

user

₂

? 70 · · · 90

· · · · · · · · · · · · · · ·

user

_N

? 60 · · · 0

≈ V ^T

—v

^T₁

—

—v

^T₂

—

· · ·

—v

^T_N

—

W w

1

w

2

· · · w

M

predicted rating

content blockb uster?

TomCruise init?

likes TomCruise?

prefers blockbusters? likes action?

likes comedy?

movie viewer

Matrix Factorization Model

learning:

→

known rating

→ learned

factors v _n

and

w _m

abstract features

(44)

Matrix Factorization

r _nm

≈

w ^T _m v _n

=

v ^T _n w _m

⇐⇒

R

≈

V ^T W R movie

1

movie

2

· · · movie

M

user

₁

100 80 · · · ?

user

₂

? 70 · · · 90

· · · · · · · · · · · · · · ·

user

_N

? 60 · · · 0

≈ V ^T

—v

^T₁

—

—v

^T₂

—

· · ·

—v

^T_N

—

W w

1

w

2

· · · w

M

predicted rating likes TomCruise?

prefers blockbusters? likes action?

likes comedy?

movie viewer

Matrix Factorization Model

learning:

→

known rating

→ learned

factors v _n

and

w _m

similar modeling can be used for

(45)

Matrix Factorization Learning

min

W,V

E

_in

({w

_m

}, {v

_n

}) ∝ X

user n rated movie m

r _nm

−

w ^T _m v _n

2

=

M

X

m=1





X

(x

n

,r

nm

)∈D

m

r nm

−

w ^T _m v n

2





• two sets

of variables:

can consider

alternating minimization, remember? :-)

•

when

v _n

fixed, minimizing

w _m

≡ minimize E

in

within

D _m

—simply per-movie(per-D

_m

)

linear regression without w ₀

•

when

w _m

fixed, minimizing

v _n

?

—per-userlinear regression

without v ₀

by

symmetry

between

users/movies

called

alternating least squares

algorithm

(46)

Matrix Factorization Learning

min

W,V

E

_in

({w

_m

}, {v

_n

}) ∝ X

user n rated movie m

r _nm

−

w ^T _m v _n

2

=

M

X

m=1





X

(x

n

,r

nm

)∈D

m

r nm

−

w ^T _m v n

2





• two sets

of variables:

can consider

alternating minimization, remember? :-)

•

when

v _n

fixed, minimizing

w _m

≡ minimize E

in

within

D _m

_m

)

linear regression without w ₀

•

when

w _m

fixed, minimizing

v _n

?

without v ₀

by

symmetry

between

users/movies

called

alternating least squares

algorithm

(47)

Matrix Factorization Learning

min

W,V

E

_in

({w

_m

}, {v

_n

}) ∝ X

user n rated movie m

r _nm

−

w ^T _m v _n

2

=

M

X

m=1





X

(x

n

,r

nm

)∈D

m

r nm

−

w ^T _m v n

2





• two sets

of variables:

can consider

alternating minimization, remember? :-)

•

when

v _n

fixed, minimizing

w _m

≡ minimize E

in

within

D _m

_m

)

linear regression without w ₀

•

when

w _m

fixed, minimizing

v _n

?

without v ₀

by

symmetry

between

users/movies

called

alternating least squares

algorithm

(48)

Matrix Factorization Learning

min

W,V

E

_in

({w

_m

}, {v

_n

}) ∝ X

user n rated movie m

r _nm

−

w ^T _m v _n

2

=

M

X

m=1





X

(x

n

,r

nm

)∈D

m

r nm

−

w ^T _m v n

2





• two sets

of variables:

can consider

alternating minimization, remember? :-)

•

when

v _n

fixed, minimizing

w _m

≡ minimize E

in

within

D _m

_m

)

linear regression without w ₀

•

when

w _m

fixed, minimizing

v _n

?

without v ₀

by

symmetry

between

users/movies

called

alternating least squares

algorithm

(49)

Matrix Factorization Learning

min

W,V

E

_in

({w

_m

}, {v

_n

}) ∝ X

user n rated movie m

r _nm

−

w ^T _m v _n

2

=

M

X

m=1





X

(x

n

,r

nm

)∈D

m

r nm

−

w ^T _m v n

2





• two sets

of variables:

can consider

alternating minimization, remember? :-)

•

when

v _n

fixed, minimizing

w _m

≡ minimize E

in

within

D _m

_m

)

linear regression without w ₀

•

when

w _m

fixed, minimizing

v _n

?

without v ₀

by

symmetry

between

users/movies

called

alternating least squares

algorithm

(50)

Matrix Factorization Learning

min

W,V

E

_in

({w

_m

}, {v

_n

}) ∝ X

user n rated movie m

r _nm

−

w ^T _m v _n

2

=

M

X

m=1





X

(x

n

,r

nm

)∈D

m

r nm

−

w ^T _m v n

2





• two sets

of variables:

can consider

alternating minimization, remember? :-)

•

when

v _n

fixed, minimizing

w _m

≡ minimize E

in

within

D _m

_m

)

linear regression without w ₀

•

when

w _m

fixed, minimizing

v _n

?

without v ₀

by

symmetry

between

users/movies

called

alternating least squares

algorithm

(51)

Matrix Factorization Learning

min

W,V

E

_in

({w

_m

}, {v

_n

}) ∝ X

user n rated movie m

r _nm

−

w ^T _m v _n

2

=

M

X

m=1





X

(x

n

,r

nm

)∈D

m

r nm

−

w ^T _m v n

2





• two sets

of variables:

can consider

alternating minimization, remember? :-)

•

when

v _n

fixed, minimizing

w _m

≡ minimize E

in

within

D _m

_m

)

linear regression without w ₀

•

when

w _m

fixed, minimizing

v _n

?

without v ₀

by

symmetry

between

users/movies

called

alternating least squares

algorithm

(52)

Matrix Factorization Learning

min

W,V

E

_in

({w

_m

}, {v

_n

}) ∝ X

user n rated movie m

r _nm

−

w ^T _m v _n

2

=

M

X

m=1





X

(x

n

,r

nm

)∈D

m

r nm

−

w ^T _m v n

2





• two sets

of variables:

can consider

alternating minimization, remember? :-)

•

when

v _n

fixed, minimizing

w _m

≡ minimize E

in

within

D _m

_m

)

linear regression without w ₀

•

when

w _m

fixed, minimizing

v _n

?

without v ₀

by

symmetry

between

users/movies

called

alternating least squares

algorithm

(53)

Matrix Factorization Learning

min

W,V

E

_in

({w

_m

}, {v

_n

}) ∝ X

user n rated movie m

r _nm

−

w ^T _m v _n

2

=

M

X

m=1





X

(x

n

,r

nm

)∈D

m

r nm

−

w ^T _m v n

2





• two sets

of variables:

can consider

alternating minimization, remember? :-)

•

when

v _n

fixed, minimizing

w _m

≡ minimize E

in

within

D _m

_m

)

linear regression without w ₀

•

when

w _m

fixed, minimizing

v _n

?

without v ₀

by

symmetry

between

users/movies

called

alternating least squares

algorithm

(54)

Matrix Factorization Learning

min

W,V

E

_in

({w

_m

}, {v

_n

}) ∝ X

user n rated movie m

r _nm

−

w ^T _m v _n

2

=

M

X

m=1





X

(x

n

,r

nm

)∈D

m

r nm

−

w ^T _m v n

2





• two sets

of variables:

can consider

alternating minimization, remember? :-)

•

when

v _n

fixed, minimizing

w _m

≡ minimize E

in

within

D _m

_m

)

linear regression without w ₀

•

when

w _m

fixed, minimizing

v _n

?

without v ₀

called

alternating least squares

algorithm

(55)

Matrix Factorization Learning

min

W,V

E

_in

({w

_m

}, {v

_n

}) ∝ X

user n rated movie m

r _nm

−

w ^T _m v _n

2

=

M

X

m=1





X

(x

n

,r

nm

)∈D

m

r nm

−

w ^T _m v n

2





• two sets

of variables:

can consider

alternating minimization, remember? :-)

•

when

v _n

fixed, minimizing

w _m

≡ minimize E

in

within

D _m

_m

)

linear regression without w ₀

•

when

w _m

fixed, minimizing

v _n

?

without v ₀

by

symmetry

between

users/movies

called

alternating least squares

algorithm

(56)

Alternating Least Squares

1

initialize ˜d dimension vectors {w

m

}, {v

_n

}

2 alternating optimization

of E

_in

: repeatedly

1 optimize w

₁

, w

₂

, . . . , w

_M

:

update w

m

by m-th-movie linear regression on {(v

n

, r

nm

)}

2 optimize v

1

, v

2

, . . . , v

N

:

update v

n

by n-th-user linear regression on {(w

m

, r

nm

)}

until

converge

• initialize: usually just randomly

• converge:

guaranteed as E

_in decreases

during alternating minimization

alternating least squares:

the ‘tango’ dance between

users/movies

(57)

Alternating Least Squares

1 m

}, {v

_n

}

2 alternating optimization

of E

_in

: repeatedly

1 optimize w

₁

, w

₂

, . . . , w

_M

:

update w

m

by m-th-movie linear regression on {(v

n

, r

nm

)}

2 optimize v

1

, v

2

, . . . , v

N

:

update v

n

by n-th-user linear regression on {(w

m

, r

nm

)}

until

converge

• initialize: usually just randomly

• converge:

guaranteed as E

_in decreases

users/movies

(58)

Alternating Least Squares

1 m

}, {v

_n

}

2 alternating optimization

of E

_in

: repeatedly

1 optimize w

₁

, w

₂

, . . . , w

_M

:

update w

m

by m-th-movie linear regression on {(v

n

, r

nm

)}

2 optimize v

1

, v

2

, . . . , v

N

:

update v

n

by n-th-user linear regression on {(w

m

, r

nm

)}

until

converge

• initialize: usually just randomly

• converge:

guaranteed as E

_in decreases

users/movies

(59)

Alternating Least Squares

1 m

}, {v

_n

}

2 alternating optimization

of E

_in

: repeatedly

1 optimize w

₁

, w

₂

, . . . , w

_M

:

update w

m

by m-th-movie linear regression on {(v

n

, r

nm

)}

2 optimize v

1

, v

2

, . . . , v

N

:

update v

n

by n-th-user linear regression on {(w

m

, r

nm

)}

until

converge

• initialize: usually just randomly

• converge:

guaranteed as E

_in decreases

during alternating minimization alternating least squares:

users/movies

(60)

Alternating Least Squares

1 m

}, {v

_n

}

2 alternating optimization

of E

_in

: repeatedly

1 optimize w

₁

, w

₂

, . . . , w

_M

:

update w

m

by m-th-movie linear regression on {(v

n

, r

nm

)}

2 optimize v

1

, v

2

, . . . , v

N

:

update v

n

by n-th-user linear regression on {(w

m

, r

nm

)}

until

converge

• initialize: usually just randomly

• converge:

guaranteed as E

_in decreases

users/movies

(61)

Alternating Least Squares

1 m

}, {v

_n

}

2 alternating optimization

of E

_in

: repeatedly

1 optimize w

₁

, w

₂

, . . . , w

_M

:

update w

m

by m-th-movie linear regression on {(v

n

, r

nm

)}

2 optimize v

1

, v

2

, . . . , v

N

:

update v

n

by n-th-user linear regression on {(w

m

, r

nm

)}

until

converge

• initialize: usually just randomly

• converge:

guaranteed as E

_in decreases

during alternating minimization alternating least squares:

users/movies

(62)

Linear Autoencoder versus Matrix Factorization

Linear Autoencoder

X ≈

W W ^T

X

•

motivation:

special d

-˜d -d linear NNet

•

error measure: squared on

all

x

_ni

•

solution: global optimal at

eigenvectors

of X

^T

X

•

usefulness: extract

dimension-reduced features

Matrix Factorization

R ≈

V ^T W

•

motivation:

N-˜

d -Mlinear NNet

•

error measure:

squared on

known r _nm

•

solution: local optimal via

alternating least squares

•

usefulness: extract

hidden user/movie features

linear autoencoder

≡

special

matrix factorization of

complete

X

(63)

Linear Autoencoder versus Matrix Factorization

Linear Autoencoder

X ≈

W W ^T

X

•

motivation:

special d

-˜d -d linear NNet

• all

x

_ni

• eigenvectors

of X

^T

X

•

usefulness: extract

dimension-reduced features

Matrix Factorization

R ≈

V ^T W

•

motivation:

N-˜

d -Mlinear NNet

•

error measure:

squared on

known r _nm

• alternating least squares

•

usefulness: extract

hidden user/movie features

linear autoencoder

≡

special

complete

X

(64)

Linear Autoencoder versus Matrix Factorization

Linear Autoencoder

X ≈

W W ^T

X

•

motivation:

special d

-˜d -d linear NNet

• all

x

_ni

• eigenvectors

of X

^T

X

•

usefulness: extract

dimension-reduced features

Matrix Factorization

R ≈

V ^T W

•

motivation:

N-˜

d -Mlinear NNet

•

error measure:

squared on

known r _nm

• alternating least squares

•

usefulness: extract

hidden user/movie features

linear autoencoder

≡

special

complete

X

Machine Learning Techniques (ᘤᢈ)