• 沒有找到結果。

Machine Learning Techniques (ᘤᢈ)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Techniques (ᘤᢈ)"

Copied!
137
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Techniques

( 機器學習技法)

Lecture 15: Matrix Factorization

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Matrix Factorization

Roadmap

1 Embedding Numerous Features: Kernel Models

2 Combining Predictive Features: Aggregation Models

3

Distilling Implicit Features: Extraction Models

Lecture 14: Radial Basis Function Network

linear aggregation

of

distance-based similarities

using

k -Means clustering

for

prototype finding Lecture 15: Matrix Factorization

Linear Network Hypothesis

Basic Matrix Factorization

Stochastic Gradient Descent

(3)

Matrix Factorization Linear Network Hypothesis

Recommender System Revisited

data ML skill

• data: how ‘many users’ have rated ‘some movies’

• skill: predict how a user would rate an unrated movie

A Hot Problem

competition held by Netflix in 2006

• 100,480,507 ratings that 480,189 users gave to 17,770 movies

• 10% improvement = 1 million dollar prize

data

D m

for

m-th movie:

{(˜

x n

= (n),y

n

=

r nm

) :

user n rated movie m}

—abstract feature ˜

x n

= (n)

how to

learn our preferences

from data?

(4)

Matrix Factorization Linear Network Hypothesis

Recommender System Revisited

data ML skill

• data: how ‘many users’ have rated ‘some movies’

• skill: predict how a user would rate an unrated movie

A Hot Problem

competition held by Netflix in 2006

• 100,480,507 ratings that 480,189 users gave to 17,770 movies

• 10% improvement = 1 million dollar prize

data

D m

for

m-th movie:

{(˜

x n

= (n),y

n

=

r nm

) :

user n rated movie m}

—abstract feature ˜

x n

= (n)

how to

learn our preferences

from data?

(5)

Matrix Factorization Linear Network Hypothesis

Recommender System Revisited

data ML skill

• data: how ‘many users’ have rated ‘some movies’

• skill: predict how a user would rate an unrated movie

A Hot Problem

competition held by Netflix in 2006

• 100,480,507 ratings that 480,189 users gave to 17,770 movies

• 10% improvement = 1 million dollar prize

data

D m

for

m-th movie:

{(˜

x n

= (n),y

n

=

r nm

) :

user n rated movie m}

—abstract feature ˜

x n

= (n)

how to

learn our preferences

from data?

(6)

Matrix Factorization Linear Network Hypothesis

Recommender System Revisited

data ML skill

• data: how ‘many users’ have rated ‘some movies’

• skill: predict how a user would rate an unrated movie

A Hot Problem

competition held by Netflix in 2006

• 100,480,507 ratings that 480,189 users gave to 17,770 movies

• 10% improvement = 1 million dollar prize

data

D m

for

m-th movie:

{(˜

x n

= (n),y

n

=

r nm

) :

user n rated movie m}

how to

learn our preferences

from data?

(7)

Matrix Factorization Linear Network Hypothesis

Recommender System Revisited

data ML skill

• data: how ‘many users’ have rated ‘some movies’

• skill: predict how a user would rate an unrated movie

A Hot Problem

competition held by Netflix in 2006

• 100,480,507 ratings that 480,189 users gave to 17,770 movies

• 10% improvement = 1 million dollar prize

data

D m

for

m-th movie:

{(˜

x n

= (n),y

n

=

r nm

) :

user n rated movie m}

—abstract feature ˜

x n

= (n)

how to

learn our preferences

from data?

(8)

Matrix Factorization Linear Network Hypothesis

Binary Vector Encoding of Categorical Feature

˜

x n

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

categorical

features

categorical

features, e.g.

• IDs

• blood type: A, B, AB, O

• programming languages: C, C++, Java, Python, . . .

many ML models operate on

numerical

features

linear models

extended linear models such as NNet

—except for

decision trees

need:

encoding

(transform) from

categorical

to

numerical

binary vector encoding:

A

=

[1 0 0 0] T

,

B

=

[0 1 0 0] T

,

AB

=

[0 0 1 0] T

,

O

=

[0 0 0 1] T

(9)

Matrix Factorization Linear Network Hypothesis

Binary Vector Encoding of Categorical Feature

˜

x n

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

categorical

features

categorical

features, e.g.

• IDs

• blood type: A, B, AB, O

• programming languages: C, C++, Java, Python, . . .

many ML models operate on

numerical

features

linear models

extended linear models such as NNet

—except for

decision trees

need:

encoding

(transform) from

categorical

to

numerical

binary vector encoding:

A

=

[1 0 0 0] T

,

B

=

[0 1 0 0] T

,

AB

=

[0 0 1 0] T

,

O

=

[0 0 0 1] T

(10)

Matrix Factorization Linear Network Hypothesis

Binary Vector Encoding of Categorical Feature

˜

x n

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

categorical

features

categorical

features, e.g.

• IDs

• blood type: A, B, AB, O

• programming languages: C, C++, Java, Python, . . .

many ML models operate on

numerical

features

linear models

extended linear models such as NNet

—except for

decision trees

need:

encoding

(transform) from

categorical

to

numerical

binary vector encoding:

A

=

[1 0 0 0] T

,

B

=

[0 1 0 0] T

,

AB

=

[0 0 1 0] T

,

O

=

[0 0 0 1] T

(11)

Matrix Factorization Linear Network Hypothesis

Binary Vector Encoding of Categorical Feature

˜

x n

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

categorical

features

categorical

features, e.g.

• IDs

• blood type: A, B, AB, O

• programming languages: C, C++, Java, Python, . . .

many ML models operate on

numerical

features

linear models

extended linear models such as NNet

—except for

decision trees

need:

encoding

(transform) from

categorical

to

numerical

binary vector encoding:

A

=

[1 0 0 0] T

,

B

=

[0 1 0 0] T

,

AB

=

[0 0 1 0] T

,

O

=

[0 0 0 1] T

(12)

Matrix Factorization Linear Network Hypothesis

Binary Vector Encoding of Categorical Feature

˜

x n

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

categorical

features

categorical

features, e.g.

• IDs

• blood type: A, B, AB, O

• programming languages: C, C++, Java, Python, . . .

many ML models operate on

numerical

features

linear models

extended linear models such as NNet

—except for

decision trees

need:

encoding

(transform) from

categorical

to

numerical

binary vector encoding:

A

=

[1 0 0 0] T

,

B

=

[0 1 0 0] T

,

AB

=

[0 0 1 0] T

,

O

=

[0 0 0 1] T

(13)

Matrix Factorization Linear Network Hypothesis

Binary Vector Encoding of Categorical Feature

˜

x n

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

categorical

features

categorical

features, e.g.

• IDs

• blood type: A, B, AB, O

• programming languages: C, C++, Java, Python, . . .

many ML models operate on

numerical

features

linear models

extended linear models such as NNet

—except for

decision trees

need:

encoding

(transform) from

categorical

to

numerical

binary vector encoding:

A

=

[1 0 0 0] T

,

B

=

[0 1 0 0] T

,

AB

=

[0 0 1 0] T

,

O

=

[0 0 0 1] T

(14)

Matrix Factorization Linear Network Hypothesis

Binary Vector Encoding of Categorical Feature

˜

x n

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

categorical

features

categorical

features, e.g.

• IDs

• blood type: A, B, AB, O

• programming languages: C, C++, Java, Python, . . .

many ML models operate on

numerical

features

linear models

extended linear models such as NNet

—except for

decision trees

need:

encoding

(transform) from

categorical

to

numerical

binary vector encoding:

A

=

[1 0 0 0] T

,

B

=

[0 1 0 0] T

,

AB

=

[0 0 1 0] T

,

O

=

[0 0 0 1] T

(15)

Matrix Factorization Linear Network Hypothesis

Binary Vector Encoding of Categorical Feature

˜

x n

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

categorical

features

categorical

features, e.g.

• IDs

• blood type: A, B, AB, O

• programming languages: C, C++, Java, Python, . . .

many ML models operate on

numerical

features

linear models

extended linear models such as NNet

—except for

decision trees

need:

encoding

(transform) from

categorical

to

numerical binary vector encoding:

A

=

[1 0 0 0] T

,

B

=

[0 1 0 0] T

,

AB

=

[0 0 1 0] T

,

O

=

[0 0 0 1] T

(16)

Matrix Factorization Linear Network Hypothesis

Binary Vector Encoding of Categorical Feature

˜

x n

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

categorical

features

categorical

features, e.g.

• IDs

• blood type: A, B, AB, O

• programming languages: C, C++, Java, Python, . . .

many ML models operate on

numerical

features

linear models

extended linear models such as NNet

—except for

decision trees

need:

encoding

(transform) from

categorical

to

numerical

binary vector encoding:

A

=

[1 0 0 0] T

,

B

=

[0 1 0 0] T

,

AB

=

[0 0 1 0] T

,

O

=

[0 0 0 1] T

(17)

Matrix Factorization Linear Network Hypothesis

Binary Vector Encoding of Categorical Feature

˜

x n

= (n): user IDs, such as 1126, 5566, 6211, . . .

—called

categorical

features

categorical

features, e.g.

• IDs

• blood type: A, B, AB, O

• programming languages: C, C++, Java, Python, . . .

many ML models operate on

numerical

features

linear models

extended linear models such as NNet

—except for

decision trees

need:

encoding

(transform) from

categorical

to

numerical binary vector encoding:

A

=

[1 0 0 0] T

,

B

=

[0 1 0 0] T

,

AB

=

[0 0 1 0] T

,

O

=

[0 0 0 1] T

(18)

Matrix Factorization Linear Network Hypothesis

Feature Extraction from Encoded Vector

encoded

data

D m

for

m-th movie:

n

(x

n

=

BinaryVectorEncoding(n), y n

=

r nm

) :

user n rated movie m

o

or,

joint data D

n

(x

n

=

BinaryVectorEncoding(n), y n

= [r

n1

? ?

r n4 r n5

. . .

r nM

]

T

)o

idea: try

feature extraction

using N-˜d -M NNet

without all x 0 (`)

x

1

x

2

x

3

x

4

x =

tanh

tanh

≈ y

1

≈ y

2

≈ y

3

= y w ni (1) w im (2)

is

tanh necessary? :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/22

(19)

Matrix Factorization Linear Network Hypothesis

Feature Extraction from Encoded Vector

encoded

data

D m

for

m-th movie:

n

(x

n

=

BinaryVectorEncoding(n), y n

=

r nm

) :

user n rated movie m

o or,

joint data D

n

(x

n

=

BinaryVectorEncoding(n), y n

= [r

n1

? ?

r n4 r n5

. . .

r nM

]

T

) o

idea: try

feature extraction

using N-˜d -M NNet

without all x 0 (`)

x

1

x

2

x

3

x

4

x =

tanh

tanh

≈ y

1

≈ y

2

≈ y

3

= y w ni (1) w im (2)

is

tanh necessary? :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/22

(20)

Matrix Factorization Linear Network Hypothesis

Feature Extraction from Encoded Vector

encoded

data

D m

for

m-th movie:

n

(x

n

=

BinaryVectorEncoding(n), y n

=

r nm

) :

user n rated movie m

o or,

joint data D

n

(x

n

=

BinaryVectorEncoding(n), y n

= [r

n1

? ?

r n4 r n5

. . .

r nM

]

T

) o

idea: try

feature extraction

using N-˜d -M NNet

without all x 0 (`)

x

1

x

2

x

3

x =

tanh

tanh

≈ y

1

≈ y

2

≈ y

= y w ni (1) w im (2)

is

tanh necessary? :-)

(21)

Matrix Factorization Linear Network Hypothesis

Feature Extraction from Encoded Vector

encoded

data

D m

for

m-th movie:

n

(x

n

=

BinaryVectorEncoding(n), y n

=

r nm

) :

user n rated movie m

o or,

joint data D

n

(x

n

=

BinaryVectorEncoding(n), y n

= [r

n1

? ?

r n4 r n5

. . .

r nM

]

T

) o

idea: try

feature extraction

using N-˜d -M NNet

without all x 0 (`)

x

1

x

2

x

3

x

4

x =

tanh

tanh

≈ y

1

≈ y

2

≈ y

3

= y w ni (1) w im (2)

is

tanh necessary? :-)

(22)

Matrix Factorization Linear Network Hypothesis

‘Linear Network’ Hypothesis

x

1

x

2

x

3

x

4

x =

≈ y

1

≈ y

2

≈ y

3

= y

V T :

w ni (1)

W :

w im (2)

n

(x

n

=

BinaryVectorEncoding(n), y n

= [r

n1

? ?

r n4 r n5

. . .

r nM

]

T

) o

rename:

V T

for h w ni (1) i

and

W

for h w im (2) i

hypothesis:

h(x) =

W T V

x

per-useroutput:

h(x n

) =

W T

v n

, where

v n

is

n-th

column of

V

linear network

for recommender system:

learn V

and

W

(23)

Matrix Factorization Linear Network Hypothesis

‘Linear Network’ Hypothesis

x

1

x

2

x

3

x

4

x =

≈ y

1

≈ y

2

≈ y

3

= y

V T :

w ni (1)

W :

w im (2)

n

(x

n

=

BinaryVectorEncoding(n), y n

= [r

n1

? ?

r n4 r n5

. . .

r nM

]

T

) o

rename:

V T

for h w ni (1) i

and

W

for h w im (2) i

hypothesis:

h(x) =

W T V

x

per-useroutput:

h(x n

) =

W T

v n

, where

v n

is

n-th

column of

V

linear network

for recommender system:

learn V

and

W

(24)

Matrix Factorization Linear Network Hypothesis

‘Linear Network’ Hypothesis

x

1

x

2

x

3

x

4

x =

≈ y

1

≈ y

2

≈ y

3

= y V T : w ni (1) W : w im (2)

n

(x

n

=

BinaryVectorEncoding(n), y n

= [r

n1

? ?

r n4 r n5

. . .

r nM

]

T

) o

rename:

V T for h w ni (1) i

and

W for h w im (2) i

hypothesis:

h(x) =

W T V

x

per-useroutput:

h(x n

) =

W T

v n

, where

v n

is

n-th

column of

V

linear network

for recommender system:

learn V

and

W

(25)

Matrix Factorization Linear Network Hypothesis

‘Linear Network’ Hypothesis

x

1

x

2

x

3

x

4

x =

≈ y

1

≈ y

2

≈ y

3

= y V T : w ni (1) W : w im (2)

n

(x

n

=

BinaryVectorEncoding(n), y n

= [r

n1

? ?

r n4 r n5

. . .

r nM

]

T

) o

rename:

V T for h w ni (1) i

and

W for h w im (2) i

hypothesis:

h(x) =

W T V

x

per-useroutput:

h(x n

) =

W T

v n

, where

v n

is

n-th

column of

V

linear network

for recommender system:

learn V

and

W

(26)

Matrix Factorization Linear Network Hypothesis

‘Linear Network’ Hypothesis

x

1

x

2

x

3

x

4

x =

≈ y

1

≈ y

2

≈ y

3

= y V T : w ni (1) W : w im (2)

n

(x

n

=

BinaryVectorEncoding(n), y n

= [r

n1

? ?

r n4 r n5

. . .

r nM

]

T

) o

rename:

V T for h w ni (1) i

and

W for h w im (2) i

hypothesis:

h(x) = W T Vx

per-useroutput:

h(x n

) =

W T

v n

, where

v n

is

n-th

column of

V

linear network

for recommender system:

learn V

and

W

(27)

Matrix Factorization Linear Network Hypothesis

‘Linear Network’ Hypothesis

x

1

x

2

x

3

x

4

x =

≈ y

1

≈ y

2

≈ y

3

= y V T : w ni (1) W : w im (2)

n

(x

n

=

BinaryVectorEncoding(n), y n

= [r

n1

? ?

r n4 r n5

. . .

r nM

]

T

) o

rename:

V T for h w ni (1) i

and

W for h w im (2) i

hypothesis:

h(x) = W T Vx

per-useroutput:

h(x n

) =

W T

v n

, where

v n

is

n-th

column of

V

linear network

for recommender system:

learn V

and

W

(28)

Matrix Factorization Linear Network Hypothesis

‘Linear Network’ Hypothesis

x

1

x

2

x

3

x

4

x =

≈ y

1

≈ y

2

≈ y

3

= y V T : w ni (1) W : w im (2)

n

(x

n

=

BinaryVectorEncoding(n), y n

= [r

n1

? ?

r n4 r n5

. . .

r nM

]

T

) o

rename:

V T for h w ni (1) i

and

W for h w im (2) i

hypothesis:

h(x) = W T Vx

per-useroutput:

h(x n

) =

W T v n

, where

v n

is

n-th

column of

V

linear network

for recommender system:

learn V

and

W

(29)

Matrix Factorization Linear Network Hypothesis

‘Linear Network’ Hypothesis

x

1

x

2

x

3

x

4

x =

≈ y

1

≈ y

2

≈ y

3

= y V T : w ni (1) W : w im (2)

n

(x

n

=

BinaryVectorEncoding(n), y n

= [r

n1

? ?

r n4 r n5

. . .

r nM

]

T

) o

rename:

V T for h w ni (1) i

and

W for h w im (2) i

hypothesis:

h(x) = W T Vx

per-useroutput:

h(x n

) =

W T v n

, where

v n

is

n-th

column of

V

linear network

for recommender system:

learn V

and

W

(30)

Matrix Factorization Linear Network Hypothesis

Fun Time

For N users, M movies, and ˜d ‘features’, how many variables need to be used to specify a

linear network

hypothesis

h(x) = W T Vx?

1

N + M + ˜d

2

N · M · ˜d

3

(N + M) · ˜d

4

(N · M) + ˜d

Reference Answer: 3

simply

N · ˜ d

for

V T

and

d · M ˜

for

W

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/22

(31)

Matrix Factorization Linear Network Hypothesis

Fun Time

For N users, M movies, and ˜d ‘features’, how many variables need to be used to specify a

linear network

hypothesis

h(x) = W T Vx?

1

N + M + ˜d

2

N · M · ˜d

3

(N + M) · ˜d

4

(N · M) + ˜d

Reference Answer: 3

simply

N · ˜ d

for

V T

and

d · M ˜

for

W

(32)

Matrix Factorization Basic Matrix Factorization

Linear Network: Linear Model Per Movie

linear network:

h(x) = W T Vx

|{z}

Φ(x)

—for

m-th

movie, just linear model

h m

(x) =

w T m Φ(x)

subject to shared

transform Φ

for every

D m

, want

r nm

=

y n

w T m v n

E

in

over all

D m

with squared error measure: E

in

({w

m

}, {v

n

}) =

1

P M

m=1 |D m |

X

user n rated movie m



r nm

w T m v n



2

linear network: transform

and

linear modelS

jointly learned

from all D

m

(33)

Matrix Factorization Basic Matrix Factorization

Linear Network: Linear Model Per Movie

linear network:

h(x) = W T Vx

|{z}

Φ(x)

—for

m-th

movie, just linear model

h m

(x) =

w T m Φ(x)

subject to shared

transform Φ

for every

D m

, want

r nm

=

y n

w T m v n

E

in

over all

D m

with squared error measure: E

in

({w

m

}, {v

n

}) =

1

P M

m=1 |D m |

X

user n rated movie m



r nm

w T m v n



2

linear network: transform

and

linear modelS

jointly learned

from all D

m

(34)

Matrix Factorization Basic Matrix Factorization

Linear Network: Linear Model Per Movie

linear network:

h(x) = W T Vx

|{z}

Φ(x)

—for

m-th

movie, just linear model

h m

(x) =

w T m Φ(x)

subject to shared

transform Φ

for every

D m

, want

r nm

=

y n

w T m v n

E

in

over all

D m

with squared error measure: E

in

({w

m

}, {v

n

}) =

1

P M

m=1 |D m |

X

user n rated movie m



r nm

w T m v n



2

linear network: transform

and

linear modelS

jointly learned

from all D

m

(35)

Matrix Factorization Basic Matrix Factorization

Linear Network: Linear Model Per Movie

linear network:

h(x) = W T Vx

|{z}

Φ(x)

—for

m-th

movie, just linear model

h m

(x) =

w T m Φ(x)

subject to shared

transform Φ

for every

D m

, want

r nm

=

y n

w T m v n

E

in

over all

D m

with squared error measure:

E

in

({w

m

}, {v

n

}) =

1 P M

m=1 |D m |

X

user n rated movie m



r nm

w T m v n



2

linear network: transform

and

linear modelS

jointly learned

from all D

m

(36)

Matrix Factorization Basic Matrix Factorization

Linear Network: Linear Model Per Movie

linear network:

h(x) = W T Vx

|{z}

Φ(x)

—for

m-th

movie, just linear model

h m

(x) =

w T m Φ(x)

subject to shared

transform Φ

for every

D m

, want

r nm

=

y n

w T m v n

E

in

over all

D m

with squared error measure:

E

in

({w

m

}, {v

n

}) =

1 P M

m=1 |D m |

X

user n rated movie m



r nm

w T m v n



2

linear network: transform

and

linear modelS

jointly learned

from all D

m

(37)

Matrix Factorization Basic Matrix Factorization

Linear Network: Linear Model Per Movie

linear network:

h(x) = W T Vx

|{z}

Φ(x)

—for

m-th

movie, just linear model

h m

(x) =

w T m Φ(x)

subject to shared

transform Φ

for every

D m

, want

r nm

=

y n

w T m v n

E

in

over all

D m

with squared error measure:

E

in

({w

m

}, {v

n

}) =

1 P M

m=1 |D m |

X

user n rated movie m



r nm

w T m v n



2

linear network: transform

and

linear modelS

jointly learned

from all D

m

(38)

Matrix Factorization Basic Matrix Factorization

Matrix Factorization

r nm

w T m v n

=

v T n w m

⇐⇒

R

V T W R movie

1

movie

2

· · · movie

M

user

1

100 80 · · · ?

user

2

? 70 · · · 90

· · · · · · · · · · · · · · ·

user

N

? 60 · · · 0

V T

—v

T1

—v

T2

· · ·

—v

TN

W w

1

w

2

· · · w

M

Match movie and viewer factors

predicted rating

comedy content action

content blockb

uster?

TomCruise init? likes TomCruise? prefers blockbusters?

likes action? likes comedy?

movie viewer

add contributions from each factor

Matrix Factorization Model

learning:

known rating

→ learned

factors v n

and

w m

→ unknown rating prediction

similar modeling can be used for other

abstract features

(39)

Matrix Factorization Basic Matrix Factorization

Matrix Factorization

r nm

w T m v n

=

v T n w m

⇐⇒

R

V T W

R movie

1

movie

2

· · · movie

M

user

1

100 80 · · · ?

user

2

? 70 · · · 90

· · · · · · · · · · · · · · ·

user

N

? 60 · · · 0

V T

—v

T1

—v

T2

· · ·

—v

TN

W w

1

w

2

· · · w

M

Match movie and

viewer factors

predicted rating

comedy content action

content blockb

uster?

TomCruise init? likes TomCruise? prefers blockbusters?

likes action? likes comedy?

movie viewer

add contributions from each factor

Matrix Factorization Model

learning:

known rating

→ learned

factors v n

and

w m

→ unknown rating prediction

similar modeling can be used for other

abstract features

(40)

Matrix Factorization Basic Matrix Factorization

Matrix Factorization

r nm

w T m v n

=

v T n w m

⇐⇒

R

V T W

R movie

1

movie

2

· · · movie

M

user

1

100 80 · · · ?

user

2

? 70 · · · 90

· · · · · · · · · · · · · · ·

user

N

? 60 · · · 0

≈ V T

—v

T1

—v

T2

· · ·

—v

TN

W w

1

w

2

· · · w

M

Match movie and

viewer factors

predicted rating

comedy content action

content blockb

uster?

TomCruise init? likes TomCruise? prefers blockbusters?

likes action? likes comedy?

movie viewer

add contributions from each factor

Matrix Factorization Model

learning:

known rating

→ learned

factors v n

and

w m

→ unknown rating prediction

similar modeling can be used for other

abstract features

(41)

Matrix Factorization Basic Matrix Factorization

Matrix Factorization

r nm

w T m v n

=

v T n w m

⇐⇒

R

V T W

R movie

1

movie

2

· · · movie

M

user

1

100 80 · · · ?

user

2

? 70 · · · 90

· · · · · · · · · · · · · · ·

user

N

? 60 · · · 0

≈ V T

—v

T1

—v

T2

· · ·

—v

TN

W w

1

w

2

· · · w

M

Match movie and viewer factors

predicted rating

comedy content action

content blockb

uster?

TomCruise init? likes TomCruise? prefers blockbusters?

likes action? likes comedy?

movie viewer

add contributions from each factor

Matrix Factorization Model

learning:

known rating

→ learned

factors v n

and

w m

→ unknown rating prediction

similar modeling can be used for other

abstract features

(42)

Matrix Factorization Basic Matrix Factorization

Matrix Factorization

r nm

w T m v n

=

v T n w m

⇐⇒

R

V T W R movie

1

movie

2

· · · movie

M

user

1

100 80 · · · ?

user

2

? 70 · · · 90

· · · · · · · · · · · · · · ·

user

N

? 60 · · · 0

≈ V T

—v

T1

—v

T2

· · ·

—v

TN

W w

1

w

2

· · · w

M

Match movie and viewer factors

predicted rating

comedy content action

content blockb

uster?

TomCruise init? likes TomCruise? prefers blockbusters?

likes action? likes comedy?

movie viewer

add contributions from each factor

Matrix Factorization Model

learning:

known rating

→ learned

factors v n

and

w m

→ unknown rating prediction

similar modeling can be used for other

abstract features

(43)

Matrix Factorization Basic Matrix Factorization

Matrix Factorization

r nm

w T m v n

=

v T n w m

⇐⇒

R

V T W R movie

1

movie

2

· · · movie

M

user

1

100 80 · · · ?

user

2

? 70 · · · 90

· · · · · · · · · · · · · · ·

user

N

? 60 · · · 0

≈ V T

—v

T1

—v

T2

· · ·

—v

TN

W w

1

w

2

· · · w

M

Match movie and viewer factors

predicted rating

comedy content action

content blockb uster?

TomCruise init?

likes TomCruise?

prefers blockbusters? likes action?

likes comedy?

movie viewer

add contributions from each factor

Matrix Factorization Model

learning:

known rating

→ learned

factors v n

and

w m

→ unknown rating prediction

similar modeling can be used for other

abstract features

(44)

Matrix Factorization Basic Matrix Factorization

Matrix Factorization

r nm

w T m v n

=

v T n w m

⇐⇒

R

V T W R movie

1

movie

2

· · · movie

M

user

1

100 80 · · · ?

user

2

? 70 · · · 90

· · · · · · · · · · · · · · ·

user

N

? 60 · · · 0

≈ V T

—v

T1

—v

T2

· · ·

—v

TN

W w

1

w

2

· · · w

M

Match movie and viewer factors

predicted rating likes TomCruise?

prefers blockbusters? likes action?

likes comedy?

movie viewer

add contributions from each factor

Matrix Factorization Model

learning:

known rating

→ learned

factors v n

and

w m

→ unknown rating prediction

similar modeling can be used for

(45)

Matrix Factorization Basic Matrix Factorization

Matrix Factorization Learning

min

W,V

E

in

({w

m

}, {v

n

}) ∝ X

user n rated movie m



r nm

w T m v n



2

=

M

X

m=1

X

(x

n

,r

nm

)∈D

m



r nm

w T m v n



2

• two sets

of variables:

can consider

alternating minimization, remember? :-)

when

v n

fixed, minimizing

w m

≡ minimize E

in

within

D m

—simply per-movie(per-D

m

)

linear regression without w 0

when

w m

fixed, minimizing

v n

?

—per-userlinear regression

without v 0

by

symmetry

between

users/movies

called

alternating least squares

algorithm

(46)

Matrix Factorization Basic Matrix Factorization

Matrix Factorization Learning

min

W,V

E

in

({w

m

}, {v

n

}) ∝ X

user n rated movie m



r nm

w T m v n



2

=

M

X

m=1

X

(x

n

,r

nm

)∈D

m



r nm

w T m v n



2

• two sets

of variables:

can consider

alternating minimization, remember? :-)

when

v n

fixed, minimizing

w m

≡ minimize E

in

within

D m

—simply per-movie(per-D

m

)

linear regression without w 0

when

w m

fixed, minimizing

v n

?

—per-userlinear regression

without v 0

by

symmetry

between

users/movies

called

alternating least squares

algorithm

(47)

Matrix Factorization Basic Matrix Factorization

Matrix Factorization Learning

min

W,V

E

in

({w

m

}, {v

n

}) ∝ X

user n rated movie m



r nm

w T m v n



2

=

M

X

m=1

X

(x

n

,r

nm

)∈D

m



r nm

w T m v n



2

• two sets

of variables:

can consider

alternating minimization, remember? :-)

when

v n

fixed, minimizing

w m

≡ minimize E

in

within

D m

—simply per-movie(per-D

m

)

linear regression without w 0

when

w m

fixed, minimizing

v n

?

—per-userlinear regression

without v 0

by

symmetry

between

users/movies

called

alternating least squares

algorithm

(48)

Matrix Factorization Basic Matrix Factorization

Matrix Factorization Learning

min

W,V

E

in

({w

m

}, {v

n

}) ∝ X

user n rated movie m



r nm

w T m v n



2

=

M

X

m=1

X

(x

n

,r

nm

)∈D

m



r nm

w T m v n



2

• two sets

of variables:

can consider

alternating minimization, remember? :-)

when

v n

fixed, minimizing

w m

≡ minimize E

in

within

D m

—simply per-movie(per-D

m

)

linear regression without w 0

when

w m

fixed, minimizing

v n

?

—per-userlinear regression

without v 0

by

symmetry

between

users/movies

called

alternating least squares

algorithm

(49)

Matrix Factorization Basic Matrix Factorization

Matrix Factorization Learning

min

W,V

E

in

({w

m

}, {v

n

}) ∝ X

user n rated movie m



r nm

w T m v n



2

=

M

X

m=1

X

(x

n

,r

nm

)∈D

m



r nm

w T m v n



2

• two sets

of variables:

can consider

alternating minimization, remember? :-)

when

v n

fixed, minimizing

w m

≡ minimize E

in

within

D m

—simply per-movie(per-D

m

)

linear regression without w 0

when

w m

fixed, minimizing

v n

?

—per-userlinear regression

without v 0

by

symmetry

between

users/movies

called

alternating least squares

algorithm

(50)

Matrix Factorization Basic Matrix Factorization

Matrix Factorization Learning

min

W,V

E

in

({w

m

}, {v

n

}) ∝ X

user n rated movie m



r nm

w T m v n



2

=

M

X

m=1

X

(x

n

,r

nm

)∈D

m



r nm

w T m v n



2

• two sets

of variables:

can consider

alternating minimization, remember? :-)

when

v n

fixed, minimizing

w m

≡ minimize E

in

within

D m

—simply per-movie(per-D

m

)

linear regression without w 0

when

w m

fixed, minimizing

v n

?

—per-userlinear regression

without v 0

by

symmetry

between

users/movies

called

alternating least squares

algorithm

(51)

Matrix Factorization Basic Matrix Factorization

Matrix Factorization Learning

min

W,V

E

in

({w

m

}, {v

n

}) ∝ X

user n rated movie m



r nm

w T m v n



2

=

M

X

m=1

X

(x

n

,r

nm

)∈D

m



r nm

w T m v n



2

• two sets

of variables:

can consider

alternating minimization, remember? :-)

when

v n

fixed, minimizing

w m

≡ minimize E

in

within

D m

—simply per-movie(per-D

m

)

linear regression without w 0

when

w m

fixed, minimizing

v n

?

—per-userlinear regression

without v 0

by

symmetry

between

users/movies

called

alternating least squares

algorithm

(52)

Matrix Factorization Basic Matrix Factorization

Matrix Factorization Learning

min

W,V

E

in

({w

m

}, {v

n

}) ∝ X

user n rated movie m



r nm

w T m v n



2

=

M

X

m=1

X

(x

n

,r

nm

)∈D

m



r nm

w T m v n



2

• two sets

of variables:

can consider

alternating minimization, remember? :-)

when

v n

fixed, minimizing

w m

≡ minimize E

in

within

D m

—simply per-movie(per-D

m

)

linear regression without w 0

when

w m

fixed, minimizing

v n

?

—per-userlinear regression

without v 0

by

symmetry

between

users/movies

called

alternating least squares

algorithm

(53)

Matrix Factorization Basic Matrix Factorization

Matrix Factorization Learning

min

W,V

E

in

({w

m

}, {v

n

}) ∝ X

user n rated movie m



r nm

w T m v n



2

=

M

X

m=1

X

(x

n

,r

nm

)∈D

m



r nm

w T m v n



2

• two sets

of variables:

can consider

alternating minimization, remember? :-)

when

v n

fixed, minimizing

w m

≡ minimize E

in

within

D m

—simply per-movie(per-D

m

)

linear regression without w 0

when

w m

fixed, minimizing

v n

?

—per-userlinear regression

without v 0

by

symmetry

between

users/movies

called

alternating least squares

algorithm

(54)

Matrix Factorization Basic Matrix Factorization

Matrix Factorization Learning

min

W,V

E

in

({w

m

}, {v

n

}) ∝ X

user n rated movie m



r nm

w T m v n



2

=

M

X

m=1

X

(x

n

,r

nm

)∈D

m



r nm

w T m v n



2

• two sets

of variables:

can consider

alternating minimization, remember? :-)

when

v n

fixed, minimizing

w m

≡ minimize E

in

within

D m

—simply per-movie(per-D

m

)

linear regression without w 0

when

w m

fixed, minimizing

v n

?

—per-userlinear regression

without v 0

called

alternating least squares

algorithm

(55)

Matrix Factorization Basic Matrix Factorization

Matrix Factorization Learning

min

W,V

E

in

({w

m

}, {v

n

}) ∝ X

user n rated movie m



r nm

w T m v n



2

=

M

X

m=1

X

(x

n

,r

nm

)∈D

m



r nm

w T m v n



2

• two sets

of variables:

can consider

alternating minimization, remember? :-)

when

v n

fixed, minimizing

w m

≡ minimize E

in

within

D m

—simply per-movie(per-D

m

)

linear regression without w 0

when

w m

fixed, minimizing

v n

?

—per-userlinear regression

without v 0

by

symmetry

between

users/movies

called

alternating least squares

algorithm

(56)

Matrix Factorization Basic Matrix Factorization

Alternating Least Squares

Alternating Least Squares

1

initialize ˜d dimension vectors {w

m

}, {v

n

}

2 alternating optimization

of E

in

: repeatedly

1 optimize w

1

, w

2

, . . . , w

M

:

update w

m

by m-th-movie linear regression on {(v

n

, r

nm

)}

2 optimize v

1

, v

2

, . . . , v

N

:

update v

n

by n-th-user linear regression on {(w

m

, r

nm

)}

until

converge

initialize: usually just randomly

converge:

guaranteed as E

in decreases

during alternating minimization

alternating least squares:

the ‘tango’ dance between

users/movies

(57)

Matrix Factorization Basic Matrix Factorization

Alternating Least Squares

Alternating Least Squares

1

initialize ˜d dimension vectors {w

m

}, {v

n

}

2 alternating optimization

of E

in

: repeatedly

1 optimize w

1

, w

2

, . . . , w

M

:

update w

m

by m-th-movie linear regression on {(v

n

, r

nm

)}

2 optimize v

1

, v

2

, . . . , v

N

:

update v

n

by n-th-user linear regression on {(w

m

, r

nm

)}

until

converge

initialize: usually just randomly

converge:

guaranteed as E

in decreases

during alternating minimization

alternating least squares:

the ‘tango’ dance between

users/movies

(58)

Matrix Factorization Basic Matrix Factorization

Alternating Least Squares

Alternating Least Squares

1

initialize ˜d dimension vectors {w

m

}, {v

n

}

2 alternating optimization

of E

in

: repeatedly

1 optimize w

1

, w

2

, . . . , w

M

:

update w

m

by m-th-movie linear regression on {(v

n

, r

nm

)}

2 optimize v

1

, v

2

, . . . , v

N

:

update v

n

by n-th-user linear regression on {(w

m

, r

nm

)}

until

converge

initialize: usually just randomly

converge:

guaranteed as E

in decreases

during alternating minimization

alternating least squares:

the ‘tango’ dance between

users/movies

(59)

Matrix Factorization Basic Matrix Factorization

Alternating Least Squares

Alternating Least Squares

1

initialize ˜d dimension vectors {w

m

}, {v

n

}

2 alternating optimization

of E

in

: repeatedly

1 optimize w

1

, w

2

, . . . , w

M

:

update w

m

by m-th-movie linear regression on {(v

n

, r

nm

)}

2 optimize v

1

, v

2

, . . . , v

N

:

update v

n

by n-th-user linear regression on {(w

m

, r

nm

)}

until

converge

initialize: usually just randomly

converge:

guaranteed as E

in decreases

during alternating minimization alternating least squares:

the ‘tango’ dance between

users/movies

(60)

Matrix Factorization Basic Matrix Factorization

Alternating Least Squares

Alternating Least Squares

1

initialize ˜d dimension vectors {w

m

}, {v

n

}

2 alternating optimization

of E

in

: repeatedly

1 optimize w

1

, w

2

, . . . , w

M

:

update w

m

by m-th-movie linear regression on {(v

n

, r

nm

)}

2 optimize v

1

, v

2

, . . . , v

N

:

update v

n

by n-th-user linear regression on {(w

m

, r

nm

)}

until

converge

initialize: usually just randomly

converge:

guaranteed as E

in decreases

during alternating minimization

alternating least squares:

the ‘tango’ dance between

users/movies

(61)

Matrix Factorization Basic Matrix Factorization

Alternating Least Squares

Alternating Least Squares

1

initialize ˜d dimension vectors {w

m

}, {v

n

}

2 alternating optimization

of E

in

: repeatedly

1 optimize w

1

, w

2

, . . . , w

M

:

update w

m

by m-th-movie linear regression on {(v

n

, r

nm

)}

2 optimize v

1

, v

2

, . . . , v

N

:

update v

n

by n-th-user linear regression on {(w

m

, r

nm

)}

until

converge

initialize: usually just randomly

converge:

guaranteed as E

in decreases

during alternating minimization alternating least squares:

the ‘tango’ dance between

users/movies

(62)

Matrix Factorization Basic Matrix Factorization

Linear Autoencoder versus Matrix Factorization

Linear Autoencoder

X ≈

W W T

X

motivation:

special d

-˜d -d linear NNet

error measure: squared on

all

x

ni

solution: global optimal at

eigenvectors

of X

T

X

usefulness: extract

dimension-reduced features

Matrix Factorization

R ≈

V T W

motivation:

N-˜

d -Mlinear NNet

error measure:

squared on

known r nm

solution: local optimal via

alternating least squares

usefulness: extract

hidden user/movie features

linear autoencoder

special

matrix factorization of

complete

X

(63)

Matrix Factorization Basic Matrix Factorization

Linear Autoencoder versus Matrix Factorization

Linear Autoencoder

X ≈

W W T

X

motivation:

special d

-˜d -d linear NNet

error measure: squared on

all

x

ni

solution: global optimal at

eigenvectors

of X

T

X

usefulness: extract

dimension-reduced features

Matrix Factorization

R ≈

V T W

motivation:

N-˜

d -Mlinear NNet

error measure:

squared on

known r nm

solution: local optimal via

alternating least squares

usefulness: extract

hidden user/movie features

linear autoencoder

special

matrix factorization of

complete

X

(64)

Matrix Factorization Basic Matrix Factorization

Linear Autoencoder versus Matrix Factorization

Linear Autoencoder

X ≈

W W T

X

motivation:

special d

-˜d -d linear NNet

error measure: squared on

all

x

ni

solution: global optimal at

eigenvectors

of X

T

X

usefulness: extract

dimension-reduced features

Matrix Factorization

R ≈

V T W

motivation:

N-˜

d -Mlinear NNet

error measure:

squared on

known r nm

solution: local optimal via

alternating least squares

usefulness: extract

hidden user/movie features

linear autoencoder

special

matrix factorization of

complete

X

參考文獻

相關文件

2 Combining Predictive Features: Aggregation Models Lecture 7: Blending and Bagging.. Motivation of Aggregation

Which of the following aggregation model learns diverse g t by reweighting and calculates linear vote by steepest search?.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/22.. If we use E loocv to estimate the performance of a learning algorithm that predicts with the average y value of the

3 Distilling Implicit Features: Extraction Models Lecture 14: Radial Basis Function Network. RBF

Lecture 4: Soft-Margin Support Vector Machine allow some margin violations ξ n while penalizing them by C; equivalent to upper-bounding α n by C Lecture 5: Kernel Logistic

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.