Machine Learning Techniques
( 機器學習技法)
Lecture 15: Matrix Factorization
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.twDepartment of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Matrix Factorization
Roadmap
1 Embedding Numerous Features: Kernel Models
2 Combining Predictive Features: Aggregation Models
3
Distilling Implicit Features: Extraction ModelsLecture 14: Radial Basis Function Network
linear aggregation
ofdistance-based similarities
usingk -Means clustering
forprototype finding Lecture 15: Matrix Factorization
Linear Network Hypothesis
Basic Matrix Factorization
Stochastic Gradient Descent
Matrix Factorization Linear Network Hypothesis
Recommender System Revisited
data ML skill
• data: how ‘many users’ have rated ‘some movies’
• skill: predict how a user would rate an unrated movie
A Hot Problem
•
competition held by Netflix in 2006• 100,480,507 ratings that 480,189 users gave to 17,770 movies
• 10% improvement = 1 million dollar prize
•
dataD m
form-th movie:
{(˜
x n
= (n),yn
=r nm
) :user n rated movie m}
—abstract feature ˜
x n
= (n)how to
learn our preferences
from data?Matrix Factorization Linear Network Hypothesis
Recommender System Revisited
data ML skill
• data: how ‘many users’ have rated ‘some movies’
• skill: predict how a user would rate an unrated movie
A Hot Problem
•
competition held by Netflix in 2006• 100,480,507 ratings that 480,189 users gave to 17,770 movies
• 10% improvement = 1 million dollar prize
•
dataD m
form-th movie:
{(˜
x n
= (n),yn
=r nm
) :user n rated movie m}
—abstract feature ˜
x n
= (n)how to
learn our preferences
from data?Matrix Factorization Linear Network Hypothesis
Recommender System Revisited
data ML skill
• data: how ‘many users’ have rated ‘some movies’
• skill: predict how a user would rate an unrated movie
A Hot Problem
•
competition held by Netflix in 2006• 100,480,507 ratings that 480,189 users gave to 17,770 movies
• 10% improvement = 1 million dollar prize
•
dataD m
form-th movie:
{(˜
x n
= (n),yn
=r nm
) :user n rated movie m}
—abstract feature ˜
x n
= (n)how to
learn our preferences
from data?Matrix Factorization Linear Network Hypothesis
Recommender System Revisited
data ML skill
• data: how ‘many users’ have rated ‘some movies’
• skill: predict how a user would rate an unrated movie
A Hot Problem
•
competition held by Netflix in 2006• 100,480,507 ratings that 480,189 users gave to 17,770 movies
• 10% improvement = 1 million dollar prize
•
dataD m
form-th movie:
{(˜
x n
= (n),yn
=r nm
) :user n rated movie m}
how to
learn our preferences
from data?Matrix Factorization Linear Network Hypothesis
Recommender System Revisited
data ML skill
• data: how ‘many users’ have rated ‘some movies’
• skill: predict how a user would rate an unrated movie
A Hot Problem
•
competition held by Netflix in 2006• 100,480,507 ratings that 480,189 users gave to 17,770 movies
• 10% improvement = 1 million dollar prize
•
dataD m
form-th movie:
{(˜
x n
= (n),yn
=r nm
) :user n rated movie m}
—abstract feature ˜
x n
= (n)how to
learn our preferences
from data?Matrix Factorization Linear Network Hypothesis
Binary Vector Encoding of Categorical Feature
˜
x n
= (n): user IDs, such as 1126, 5566, 6211, . . .—called
categorical
features• categorical
features, e.g.• IDs
• blood type: A, B, AB, O
• programming languages: C, C++, Java, Python, . . .
•
many ML models operate onnumerical
features• linear models
• extended linear models such as NNet
—except for
decision trees
•
need:encoding
(transform) fromcategorical
tonumerical
binary vector encoding:
A
=[1 0 0 0] T
,B
=[0 1 0 0] T
,AB
=[0 0 1 0] T
,O
=[0 0 0 1] T
Matrix Factorization Linear Network Hypothesis
Binary Vector Encoding of Categorical Feature
˜
x n
= (n): user IDs, such as 1126, 5566, 6211, . . .—called
categorical
features• categorical
features, e.g.• IDs
• blood type: A, B, AB, O
• programming languages: C, C++, Java, Python, . . .
•
many ML models operate onnumerical
features• linear models
• extended linear models such as NNet
—except for
decision trees
•
need:encoding
(transform) fromcategorical
tonumerical
binary vector encoding:
A
=[1 0 0 0] T
,B
=[0 1 0 0] T
,AB
=[0 0 1 0] T
,O
=[0 0 0 1] T
Matrix Factorization Linear Network Hypothesis
Binary Vector Encoding of Categorical Feature
˜
x n
= (n): user IDs, such as 1126, 5566, 6211, . . .—called
categorical
features• categorical
features, e.g.• IDs
• blood type: A, B, AB, O
• programming languages: C, C++, Java, Python, . . .
•
many ML models operate onnumerical
features• linear models
• extended linear models such as NNet
—except for
decision trees
•
need:encoding
(transform) fromcategorical
tonumerical
binary vector encoding:
A
=[1 0 0 0] T
,B
=[0 1 0 0] T
,AB
=[0 0 1 0] T
,O
=[0 0 0 1] T
Matrix Factorization Linear Network Hypothesis
Binary Vector Encoding of Categorical Feature
˜
x n
= (n): user IDs, such as 1126, 5566, 6211, . . .—called
categorical
features• categorical
features, e.g.• IDs
• blood type: A, B, AB, O
• programming languages: C, C++, Java, Python, . . .
•
many ML models operate onnumerical
features• linear models
• extended linear models such as NNet
—except for
decision trees
•
need:encoding
(transform) fromcategorical
tonumerical
binary vector encoding:
A
=[1 0 0 0] T
,B
=[0 1 0 0] T
,AB
=[0 0 1 0] T
,O
=[0 0 0 1] T
Matrix Factorization Linear Network Hypothesis
Binary Vector Encoding of Categorical Feature
˜
x n
= (n): user IDs, such as 1126, 5566, 6211, . . .—called
categorical
features• categorical
features, e.g.• IDs
• blood type: A, B, AB, O
• programming languages: C, C++, Java, Python, . . .
•
many ML models operate onnumerical
features• linear models
• extended linear models such as NNet
—except for
decision trees
•
need:encoding
(transform) fromcategorical
tonumerical
binary vector encoding:
A
=[1 0 0 0] T
,B
=[0 1 0 0] T
,AB
=[0 0 1 0] T
,O
=[0 0 0 1] T
Matrix Factorization Linear Network Hypothesis
Binary Vector Encoding of Categorical Feature
˜
x n
= (n): user IDs, such as 1126, 5566, 6211, . . .—called
categorical
features• categorical
features, e.g.• IDs
• blood type: A, B, AB, O
• programming languages: C, C++, Java, Python, . . .
•
many ML models operate onnumerical
features• linear models
• extended linear models such as NNet
—except for
decision trees
•
need:encoding
(transform) fromcategorical
tonumerical
binary vector encoding:
A
=[1 0 0 0] T
,B
=[0 1 0 0] T
,AB
=[0 0 1 0] T
,O
=[0 0 0 1] T
Matrix Factorization Linear Network Hypothesis
Binary Vector Encoding of Categorical Feature
˜
x n
= (n): user IDs, such as 1126, 5566, 6211, . . .—called
categorical
features• categorical
features, e.g.• IDs
• blood type: A, B, AB, O
• programming languages: C, C++, Java, Python, . . .
•
many ML models operate onnumerical
features• linear models
• extended linear models such as NNet
—except for
decision trees
•
need:encoding
(transform) fromcategorical
tonumerical
binary vector encoding:
A
=[1 0 0 0] T
,B
=[0 1 0 0] T
,AB
=[0 0 1 0] T
,O
=[0 0 0 1] T
Matrix Factorization Linear Network Hypothesis
Binary Vector Encoding of Categorical Feature
˜
x n
= (n): user IDs, such as 1126, 5566, 6211, . . .—called
categorical
features• categorical
features, e.g.• IDs
• blood type: A, B, AB, O
• programming languages: C, C++, Java, Python, . . .
•
many ML models operate onnumerical
features• linear models
• extended linear models such as NNet
—except for
decision trees
•
need:encoding
(transform) fromcategorical
tonumerical binary vector encoding:
A
=[1 0 0 0] T
,B
=[0 1 0 0] T
,AB
=[0 0 1 0] T
,O
=[0 0 0 1] T
Matrix Factorization Linear Network Hypothesis
Binary Vector Encoding of Categorical Feature
˜
x n
= (n): user IDs, such as 1126, 5566, 6211, . . .—called
categorical
features• categorical
features, e.g.• IDs
• blood type: A, B, AB, O
• programming languages: C, C++, Java, Python, . . .
•
many ML models operate onnumerical
features• linear models
• extended linear models such as NNet
—except for
decision trees
•
need:encoding
(transform) fromcategorical
tonumerical
binary vector encoding:
A
=[1 0 0 0] T
,B
=[0 1 0 0] T
,AB
=[0 0 1 0] T
,O
=[0 0 0 1] T
Matrix Factorization Linear Network Hypothesis
Binary Vector Encoding of Categorical Feature
˜
x n
= (n): user IDs, such as 1126, 5566, 6211, . . .—called
categorical
features• categorical
features, e.g.• IDs
• blood type: A, B, AB, O
• programming languages: C, C++, Java, Python, . . .
•
many ML models operate onnumerical
features• linear models
• extended linear models such as NNet
—except for
decision trees
•
need:encoding
(transform) fromcategorical
tonumerical binary vector encoding:
A
=[1 0 0 0] T
,B
=[0 1 0 0] T
,AB
=[0 0 1 0] T
,O
=[0 0 0 1] T
Matrix Factorization Linear Network Hypothesis
Feature Extraction from Encoded Vector
encoded
dataD m
form-th movie:
n
(x
n
=BinaryVectorEncoding(n), y n
=r nm
) :user n rated movie m
oor,
joint data D
n(x
n
=BinaryVectorEncoding(n), y n
= [rn1
? ?r n4 r n5
. . .r nM
]T
)oidea: try
feature extraction
using N-˜d -M NNetwithout all x 0 (`)
x
1x
2x
3x
4x =
tanh
tanh
≈ y
1≈ y
2≈ y
3= y w ni (1) w im (2)
is
tanh necessary? :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/22
Matrix Factorization Linear Network Hypothesis
Feature Extraction from Encoded Vector
encoded
dataD m
form-th movie:
n
(x
n
=BinaryVectorEncoding(n), y n
=r nm
) :user n rated movie m
o or,joint data D
n
(x
n
=BinaryVectorEncoding(n), y n
= [rn1
? ?r n4 r n5
. . .r nM
]T
) oidea: try
feature extraction
using N-˜d -M NNetwithout all x 0 (`)
x
1x
2x
3x
4x =
tanh
tanh
≈ y
1≈ y
2≈ y
3= y w ni (1) w im (2)
is
tanh necessary? :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/22
Matrix Factorization Linear Network Hypothesis
Feature Extraction from Encoded Vector
encoded
dataD m
form-th movie:
n
(x
n
=BinaryVectorEncoding(n), y n
=r nm
) :user n rated movie m
o or,joint data D
n
(x
n
=BinaryVectorEncoding(n), y n
= [rn1
? ?r n4 r n5
. . .r nM
]T
) oidea: try
feature extraction
using N-˜d -M NNetwithout all x 0 (`)
x
1x
2x
3x =
tanh
tanh
≈ y
1≈ y
2≈ y
= y w ni (1) w im (2)
is
tanh necessary? :-)
Matrix Factorization Linear Network Hypothesis
Feature Extraction from Encoded Vector
encoded
dataD m
form-th movie:
n
(x
n
=BinaryVectorEncoding(n), y n
=r nm
) :user n rated movie m
o or,joint data D
n
(x
n
=BinaryVectorEncoding(n), y n
= [rn1
? ?r n4 r n5
. . .r nM
]T
) oidea: try
feature extraction
using N-˜d -M NNetwithout all x 0 (`)
x
1x
2x
3x
4x =
tanh
tanh
≈ y
1≈ y
2≈ y
3= y w ni (1) w im (2)
is
tanh necessary? :-)
Matrix Factorization Linear Network Hypothesis
‘Linear Network’ Hypothesis
x
1x
2x
3x
4x =
≈ y
1≈ y
2≈ y
3= y
V T :
w ni (1)
W :
w im (2)
n
(x
n
=BinaryVectorEncoding(n), y n
= [rn1
? ?r n4 r n5
. . .r nM
]T
) o•
rename:V T
for h w ni (1) i
and
W
for h w im (2) i
•
hypothesis:h(x) =
W T V
x
•
per-useroutput:h(x n
) =W T
v n
, wherev n
isn-th
column ofV
linear network
for recommender system:learn V
andW
Matrix Factorization Linear Network Hypothesis
‘Linear Network’ Hypothesis
x
1x
2x
3x
4x =
≈ y
1≈ y
2≈ y
3= y
V T :
w ni (1)
W :
w im (2)
n
(x
n
=BinaryVectorEncoding(n), y n
= [rn1
? ?r n4 r n5
. . .r nM
]T
) o•
rename:V T
for h w ni (1) i
and
W
for h w im (2) i
•
hypothesis:h(x) =
W T V
x
•
per-useroutput:h(x n
) =W T
v n
, wherev n
isn-th
column ofV
linear network
for recommender system:learn V
andW
Matrix Factorization Linear Network Hypothesis
‘Linear Network’ Hypothesis
x
1x
2x
3x
4x =
≈ y
1≈ y
2≈ y
3= y V T : w ni (1) W : w im (2)
n
(x
n
=BinaryVectorEncoding(n), y n
= [rn1
? ?r n4 r n5
. . .r nM
]T
) o•
rename:V T for h w ni (1) i
and
W for h w im (2) i
•
hypothesis:h(x) =
W T V
x
•
per-useroutput:h(x n
) =W T
v n
, wherev n
isn-th
column ofV
linear network
for recommender system:learn V
andW
Matrix Factorization Linear Network Hypothesis
‘Linear Network’ Hypothesis
x
1x
2x
3x
4x =
≈ y
1≈ y
2≈ y
3= y V T : w ni (1) W : w im (2)
n
(x
n
=BinaryVectorEncoding(n), y n
= [rn1
? ?r n4 r n5
. . .r nM
]T
) o•
rename:V T for h w ni (1) i
and
W for h w im (2) i
•
hypothesis:h(x) =
W T V
x
•
per-useroutput:h(x n
) =W T
v n
, wherev n
isn-th
column ofV
linear network
for recommender system:learn V
andW
Matrix Factorization Linear Network Hypothesis
‘Linear Network’ Hypothesis
x
1x
2x
3x
4x =
≈ y
1≈ y
2≈ y
3= y V T : w ni (1) W : w im (2)
n
(x
n
=BinaryVectorEncoding(n), y n
= [rn1
? ?r n4 r n5
. . .r nM
]T
) o•
rename:V T for h w ni (1) i
and
W for h w im (2) i
•
hypothesis:h(x) = W T Vx
•
per-useroutput:h(x n
) =W T
v n
, wherev n
isn-th
column ofV
linear network
for recommender system:learn V
andW
Matrix Factorization Linear Network Hypothesis
‘Linear Network’ Hypothesis
x
1x
2x
3x
4x =
≈ y
1≈ y
2≈ y
3= y V T : w ni (1) W : w im (2)
n
(x
n
=BinaryVectorEncoding(n), y n
= [rn1
? ?r n4 r n5
. . .r nM
]T
) o•
rename:V T for h w ni (1) i
and
W for h w im (2) i
•
hypothesis:h(x) = W T Vx
•
per-useroutput:h(x n
) =W T
v n
, wherev n
isn-th
column ofV
linear network
for recommender system:learn V
andW
Matrix Factorization Linear Network Hypothesis
‘Linear Network’ Hypothesis
x
1x
2x
3x
4x =
≈ y
1≈ y
2≈ y
3= y V T : w ni (1) W : w im (2)
n
(x
n
=BinaryVectorEncoding(n), y n
= [rn1
? ?r n4 r n5
. . .r nM
]T
) o•
rename:V T for h w ni (1) i
and
W for h w im (2) i
•
hypothesis:h(x) = W T Vx
•
per-useroutput:h(x n
) =W T v n
, wherev n
isn-th
column ofV
linear network
for recommender system:learn V
andW
Matrix Factorization Linear Network Hypothesis
‘Linear Network’ Hypothesis
x
1x
2x
3x
4x =
≈ y
1≈ y
2≈ y
3= y V T : w ni (1) W : w im (2)
n
(x
n
=BinaryVectorEncoding(n), y n
= [rn1
? ?r n4 r n5
. . .r nM
]T
) o•
rename:V T for h w ni (1) i
and
W for h w im (2) i
•
hypothesis:h(x) = W T Vx
•
per-useroutput:h(x n
) =W T v n
, wherev n
isn-th
column ofV
linear network
for recommender system:learn V
andW
Matrix Factorization Linear Network Hypothesis
Fun Time
For N users, M movies, and ˜d ‘features’, how many variables need to be used to specify a
linear network
hypothesish(x) = W T Vx?
1
N + M + ˜d2
N · M · ˜d3
(N + M) · ˜d4
(N · M) + ˜dReference Answer: 3
simply
N · ˜ d
forV T
andd · M ˜
forW
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/22
Matrix Factorization Linear Network Hypothesis
Fun Time
For N users, M movies, and ˜d ‘features’, how many variables need to be used to specify a
linear network
hypothesish(x) = W T Vx?
1
N + M + ˜d2
N · M · ˜d3
(N + M) · ˜d4
(N · M) + ˜dReference Answer: 3
simply
N · ˜ d
forV T
andd · M ˜
forW
Matrix Factorization Basic Matrix Factorization
Linear Network: Linear Model Per Movie
linear network:
h(x) = W T Vx
|{z}
Φ(x)
—for
m-th
movie, just linear modelh m
(x) =w T m Φ(x)
subject to shared
transform Φ
•
for everyD m
, wantr nm
=y n
≈w T m v n
•
Ein
over allD m
with squared error measure: Ein
({wm
}, {vn
}) =1
P M
m=1 |D m |
X
user n rated movie m
r nm
−w T m v n
2
linear network: transform
andlinear modelS
jointly learned
from all Dm
Matrix Factorization Basic Matrix Factorization
Linear Network: Linear Model Per Movie
linear network:
h(x) = W T Vx
|{z}
Φ(x)
—for
m-th
movie, just linear modelh m
(x) =w T m Φ(x)
subject to shared
transform Φ
•
for everyD m
, wantr nm
=y n
≈w T m v n
•
Ein
over allD m
with squared error measure: Ein
({wm
}, {vn
}) =1
P M
m=1 |D m |
X
user n rated movie m
r nm
−w T m v n
2
linear network: transform
andlinear modelS
jointly learned
from all Dm
Matrix Factorization Basic Matrix Factorization
Linear Network: Linear Model Per Movie
linear network:
h(x) = W T Vx
|{z}
Φ(x)
—for
m-th
movie, just linear modelh m
(x) =w T m Φ(x)
subject to shared
transform Φ
•
for everyD m
, wantr nm
=y n
≈w T m v n
•
Ein
over allD m
with squared error measure: Ein
({wm
}, {vn
}) =1
P M
m=1 |D m |
X
user n rated movie m
r nm
−w T m v n
2
linear network: transform
andlinear modelS
jointly learned
from all Dm
Matrix Factorization Basic Matrix Factorization
Linear Network: Linear Model Per Movie
linear network:
h(x) = W T Vx
|{z}
Φ(x)
—for
m-th
movie, just linear modelh m
(x) =w T m Φ(x)
subject to shared
transform Φ
•
for everyD m
, wantr nm
=y n
≈w T m v n
•
Ein
over allD m
with squared error measure:E
in
({wm
}, {vn
}) =1 P M
m=1 |D m |
X
user n rated movie m
r nm
−w T m v n
2
linear network: transform
andlinear modelS
jointly learned
from all Dm
Matrix Factorization Basic Matrix Factorization
Linear Network: Linear Model Per Movie
linear network:
h(x) = W T Vx
|{z}
Φ(x)
—for
m-th
movie, just linear modelh m
(x) =w T m Φ(x)
subject to shared
transform Φ
•
for everyD m
, wantr nm
=y n
≈w T m v n
•
Ein
over allD m
with squared error measure:E
in
({wm
}, {vn
}) =1 P M
m=1 |D m |
X
user n rated movie m
r nm
−w T m v n
2
linear network: transform
andlinear modelS
jointly learned
from all Dm
Matrix Factorization Basic Matrix Factorization
Linear Network: Linear Model Per Movie
linear network:
h(x) = W T Vx
|{z}
Φ(x)
—for
m-th
movie, just linear modelh m
(x) =w T m Φ(x)
subject to shared
transform Φ
•
for everyD m
, wantr nm
=y n
≈w T m v n
•
Ein
over allD m
with squared error measure:E
in
({wm
}, {vn
}) =1 P M
m=1 |D m |
X
user n rated movie m
r nm
−w T m v n
2
linear network: transform
andlinear modelS
jointly learned
from all Dm
Matrix Factorization Basic Matrix Factorization
Matrix Factorization
r nm
≈w T m v n
=v T n w m
⇐⇒
R
≈V T W R movie
1movie
2· · · movie
Muser
1100 80 · · · ?
user
2? 70 · · · 90
· · · · · · · · · · · · · · ·
user
N? 60 · · · 0
≈
V T
—v
T1—
—v
T2—
· · ·
—v
TN—
W w
1w
2· · · w
MMatch movie and viewer factors
predicted rating
comedy content action
content blockb
uster?
TomCruise init? likes TomCruise? prefers blockbusters?
likes action? likes comedy?
movie viewer
add contributions from each factor
Matrix Factorization Model
learning:→
known rating
→ learned
factors v n
andw m
→ unknown rating prediction
similar modeling can be used for other
abstract features
Matrix Factorization Basic Matrix Factorization
Matrix Factorization
r nm
≈w T m v n
=v T n w m
⇐⇒
R
≈V T W
R movie
1movie
2· · · movie
Muser
1100 80 · · · ?
user
2? 70 · · · 90
· · · · · · · · · · · · · · ·
user
N? 60 · · · 0
≈
V T
—v
T1—
—v
T2—
· · ·
—v
TN—
W w
1w
2· · · w
MMatch movie and
viewer factors
predicted rating
comedy content action
content blockb
uster?
TomCruise init? likes TomCruise? prefers blockbusters?
likes action? likes comedy?
movie viewer
add contributions from each factor
Matrix Factorization Model
learning:→
known rating
→ learned
factors v n
andw m
→ unknown rating prediction
similar modeling can be used for other
abstract features
Matrix Factorization Basic Matrix Factorization
Matrix Factorization
r nm
≈w T m v n
=v T n w m
⇐⇒
R
≈V T W
R movie
1movie
2· · · movie
Muser
1100 80 · · · ?
user
2? 70 · · · 90
· · · · · · · · · · · · · · ·
user
N? 60 · · · 0
≈ V T
—v
T1—
—v
T2—
· · ·
—v
TN—
W w
1w
2· · · w
MMatch movie and
viewer factors
predicted rating
comedy content action
content blockb
uster?
TomCruise init? likes TomCruise? prefers blockbusters?
likes action? likes comedy?
movie viewer
add contributions from each factor
Matrix Factorization Model
learning:→
known rating
→ learned
factors v n
andw m
→ unknown rating prediction
similar modeling can be used for other
abstract features
Matrix Factorization Basic Matrix Factorization
Matrix Factorization
r nm
≈w T m v n
=v T n w m
⇐⇒
R
≈V T W
R movie
1movie
2· · · movie
Muser
1100 80 · · · ?
user
2? 70 · · · 90
· · · · · · · · · · · · · · ·
user
N? 60 · · · 0
≈ V T
—v
T1—
—v
T2—
· · ·
—v
TN—
W w
1w
2· · · w
MMatch movie and viewer factors
predicted rating
comedy content action
content blockb
uster?
TomCruise init? likes TomCruise? prefers blockbusters?
likes action? likes comedy?
movie viewer
add contributions from each factor
Matrix Factorization Model
learning:→
known rating
→ learned
factors v n
andw m
→ unknown rating prediction
similar modeling can be used for other
abstract features
Matrix Factorization Basic Matrix Factorization
Matrix Factorization
r nm
≈w T m v n
=v T n w m
⇐⇒R
≈V T W R movie
1movie
2· · · movie
Muser
1100 80 · · · ?
user
2? 70 · · · 90
· · · · · · · · · · · · · · ·
user
N? 60 · · · 0
≈ V T
—v
T1—
—v
T2—
· · ·
—v
TN—
W w
1w
2· · · w
MMatch movie and viewer factors
predicted rating
comedy content action
content blockb
uster?
TomCruise init? likes TomCruise? prefers blockbusters?
likes action? likes comedy?
movie viewer
add contributions from each factor
Matrix Factorization Model
learning:→
known rating
→ learned
factors v n
andw m
→ unknown rating prediction
similar modeling can be used for other
abstract features
Matrix Factorization Basic Matrix Factorization
Matrix Factorization
r nm
≈w T m v n
=v T n w m
⇐⇒R
≈V T W R movie
1movie
2· · · movie
Muser
1100 80 · · · ?
user
2? 70 · · · 90
· · · · · · · · · · · · · · ·
user
N? 60 · · · 0
≈ V T
—v
T1—
—v
T2—
· · ·
—v
TN—
W w
1w
2· · · w
MMatch movie and viewer factors
predicted rating
comedy content action
content blockb uster?
TomCruise init?
likes TomCruise?
prefers blockbusters? likes action?
likes comedy?
movie viewer
add contributions from each factor
Matrix Factorization Model
learning:→
known rating
→ learned
factors v n
andw m
→ unknown rating prediction
similar modeling can be used for other
abstract features
Matrix Factorization Basic Matrix Factorization
Matrix Factorization
r nm
≈w T m v n
=v T n w m
⇐⇒R
≈V T W R movie
1movie
2· · · movie
Muser
1100 80 · · · ?
user
2? 70 · · · 90
· · · · · · · · · · · · · · ·
user
N? 60 · · · 0
≈ V T
—v
T1—
—v
T2—
· · ·
—v
TN—
W w
1w
2· · · w
MMatch movie and viewer factors
predicted rating likes TomCruise?
prefers blockbusters? likes action?
likes comedy?
movie viewer
add contributions from each factor
Matrix Factorization Model
learning:→
known rating
→ learned
factors v n
andw m
→ unknown rating prediction
similar modeling can be used for
Matrix Factorization Basic Matrix Factorization
Matrix Factorization Learning
min
W,V
Ein
({wm
}, {vn
}) ∝ Xuser n rated movie m
r nm
−w T m v n
2
=
M
X
m=1
X
(x
n,r
nm)∈D
m
r nm
−w T m v n
2
• two sets
of variables:can consider
alternating minimization, remember? :-)
•
whenv n
fixed, minimizingw m
≡ minimize Ein
withinD m
—simply per-movie(per-D
m
)linear regression without w 0
•
whenw m
fixed, minimizingv n
?—per-userlinear regression
without v 0
by
symmetry
betweenusers/movies
called
alternating least squares
algorithmMatrix Factorization Basic Matrix Factorization
Matrix Factorization Learning
min
W,V
Ein
({wm
}, {vn
}) ∝ Xuser n rated movie m
r nm
−w T m v n
2
=
M
X
m=1
X
(x
n,r
nm)∈D
m
r nm
−w T m v n
2
• two sets
of variables:can consider
alternating minimization, remember? :-)
•
whenv n
fixed, minimizingw m
≡ minimize Ein
withinD m
—simply per-movie(per-D
m
)linear regression without w 0
•
whenw m
fixed, minimizingv n
?—per-userlinear regression
without v 0
by
symmetry
betweenusers/movies
called
alternating least squares
algorithmMatrix Factorization Basic Matrix Factorization
Matrix Factorization Learning
min
W,V
Ein
({wm
}, {vn
}) ∝ Xuser n rated movie m
r nm
−w T m v n
2
=
M
X
m=1
X
(x
n,r
nm)∈D
m
r nm
−w T m v n
2
• two sets
of variables:can consider
alternating minimization, remember? :-)
•
whenv n
fixed, minimizingw m
≡ minimize Ein
withinD m
—simply per-movie(per-D
m
)linear regression without w 0
•
whenw m
fixed, minimizingv n
?—per-userlinear regression
without v 0
by
symmetry
betweenusers/movies
called
alternating least squares
algorithmMatrix Factorization Basic Matrix Factorization
Matrix Factorization Learning
min
W,V
Ein
({wm
}, {vn
}) ∝ Xuser n rated movie m
r nm
−w T m v n
2
=
M
X
m=1
X
(x
n,r
nm)∈D
m
r nm
−w T m v n
2
• two sets
of variables:can consider
alternating minimization, remember? :-)
•
whenv n
fixed, minimizingw m
≡ minimize E
in
withinD m
—simply per-movie(per-D
m
)linear regression without w 0
•
whenw m
fixed, minimizingv n
?—per-userlinear regression
without v 0
by
symmetry
betweenusers/movies
called
alternating least squares
algorithmMatrix Factorization Basic Matrix Factorization
Matrix Factorization Learning
min
W,V
Ein
({wm
}, {vn
}) ∝ Xuser n rated movie m
r nm
−w T m v n
2
=
M
X
m=1
X
(x
n,r
nm)∈D
m
r nm
−w T m v n
2
• two sets
of variables:can consider
alternating minimization, remember? :-)
•
whenv n
fixed, minimizingw m
≡ minimize E
in
withinD m
—simply per-movie(per-D
m
)linear regression without w 0
•
whenw m
fixed, minimizingv n
?—per-userlinear regression
without v 0
by
symmetry
betweenusers/movies
called
alternating least squares
algorithmMatrix Factorization Basic Matrix Factorization
Matrix Factorization Learning
min
W,V
Ein
({wm
}, {vn
}) ∝ Xuser n rated movie m
r nm
−w T m v n
2
=
M
X
m=1
X
(x
n,r
nm)∈D
m
r nm
−w T m v n
2
• two sets
of variables:can consider
alternating minimization, remember? :-)
•
whenv n
fixed, minimizingw m
≡ minimize Ein
withinD m
—simply per-movie(per-D
m
)linear regression without w 0
•
whenw m
fixed, minimizingv n
?—per-userlinear regression
without v 0
by
symmetry
betweenusers/movies
called
alternating least squares
algorithmMatrix Factorization Basic Matrix Factorization
Matrix Factorization Learning
min
W,V
Ein
({wm
}, {vn
}) ∝ Xuser n rated movie m
r nm
−w T m v n
2
=
M
X
m=1
X
(x
n,r
nm)∈D
m
r nm
−w T m v n
2
• two sets
of variables:can consider
alternating minimization, remember? :-)
•
whenv n
fixed, minimizingw m
≡ minimize Ein
withinD m
—simply per-movie(per-D
m
)linear regression without w 0
•
whenw m
fixed, minimizingv n
?—per-userlinear regression
without v 0
by
symmetry
betweenusers/movies
called
alternating least squares
algorithmMatrix Factorization Basic Matrix Factorization
Matrix Factorization Learning
min
W,V
Ein
({wm
}, {vn
}) ∝ Xuser n rated movie m
r nm
−w T m v n
2
=
M
X
m=1
X
(x
n,r
nm)∈D
m
r nm
−w T m v n
2
• two sets
of variables:can consider
alternating minimization, remember? :-)
•
whenv n
fixed, minimizingw m
≡ minimize Ein
withinD m
—simply per-movie(per-D
m
)linear regression without w 0
•
whenw m
fixed, minimizingv n
?—per-userlinear regression
without v 0
by
symmetry
betweenusers/movies
calledalternating least squares
algorithmMatrix Factorization Basic Matrix Factorization
Matrix Factorization Learning
min
W,V
Ein
({wm
}, {vn
}) ∝ Xuser n rated movie m
r nm
−w T m v n
2
=
M
X
m=1
X
(x
n,r
nm)∈D
m
r nm
−w T m v n
2
• two sets
of variables:can consider
alternating minimization, remember? :-)
•
whenv n
fixed, minimizingw m
≡ minimize Ein
withinD m
—simply per-movie(per-D
m
)linear regression without w 0
•
whenw m
fixed, minimizingv n
?—per-userlinear regression
without v 0
by
symmetry
betweenusers/movies
called
alternating least squares
algorithmMatrix Factorization Basic Matrix Factorization
Matrix Factorization Learning
min
W,V
Ein
({wm
}, {vn
}) ∝ Xuser n rated movie m
r nm
−w T m v n
2
=
M
X
m=1
X
(x
n,r
nm)∈D
m
r nm
−w T m v n
2
• two sets
of variables:can consider
alternating minimization, remember? :-)
•
whenv n
fixed, minimizingw m
≡ minimize Ein
withinD m
—simply per-movie(per-D
m
)linear regression without w 0
•
whenw m
fixed, minimizingv n
?—per-userlinear regression
without v 0
called
alternating least squares
algorithmMatrix Factorization Basic Matrix Factorization
Matrix Factorization Learning
min
W,V
Ein
({wm
}, {vn
}) ∝ Xuser n rated movie m
r nm
−w T m v n
2
=
M
X
m=1
X
(x
n,r
nm)∈D
m
r nm
−w T m v n
2
• two sets
of variables:can consider
alternating minimization, remember? :-)
•
whenv n
fixed, minimizingw m
≡ minimize Ein
withinD m
—simply per-movie(per-D
m
)linear regression without w 0
•
whenw m
fixed, minimizingv n
?—per-userlinear regression
without v 0
by
symmetry
betweenusers/movies
calledalternating least squares
algorithmMatrix Factorization Basic Matrix Factorization
Alternating Least Squares
Alternating Least Squares
1
initialize ˜d dimension vectors {wm
}, {vn
}2 alternating optimization
of Ein
: repeatedly1 optimize w
1, w
2, . . . , w
M:
update w
mby m-th-movie linear regression on {(v
n, r
nm)}
2 optimize v
1, v
2, . . . , v
N:
update v
nby n-th-user linear regression on {(w
m, r
nm)}
until
converge
• initialize: usually just randomly
• converge:
guaranteed as E
in decreases
during alternating minimizationalternating least squares:
the ‘tango’ dance between
users/movies
Matrix Factorization Basic Matrix Factorization
Alternating Least Squares
Alternating Least Squares
1
initialize ˜d dimension vectors {wm
}, {vn
}2 alternating optimization
of Ein
: repeatedly1 optimize w
1, w
2, . . . , w
M:
update w
mby m-th-movie linear regression on {(v
n, r
nm)}
2 optimize v
1, v
2, . . . , v
N:
update v
nby n-th-user linear regression on {(w
m, r
nm)}
until
converge
• initialize: usually just randomly
• converge:
guaranteed as E
in decreases
during alternating minimizationalternating least squares:
the ‘tango’ dance between
users/movies
Matrix Factorization Basic Matrix Factorization
Alternating Least Squares
Alternating Least Squares
1
initialize ˜d dimension vectors {wm
}, {vn
}2 alternating optimization
of Ein
: repeatedly1 optimize w
1, w
2, . . . , w
M:
update w
mby m-th-movie linear regression on {(v
n, r
nm)}
2 optimize v
1, v
2, . . . , v
N:
update v
nby n-th-user linear regression on {(w
m, r
nm)}
until
converge
• initialize: usually just randomly
• converge:
guaranteed as E
in decreases
during alternating minimizationalternating least squares:
the ‘tango’ dance between
users/movies
Matrix Factorization Basic Matrix Factorization
Alternating Least Squares
Alternating Least Squares
1
initialize ˜d dimension vectors {wm
}, {vn
}2 alternating optimization
of Ein
: repeatedly1 optimize w
1, w
2, . . . , w
M:
update w
mby m-th-movie linear regression on {(v
n, r
nm)}
2 optimize v
1, v
2, . . . , v
N:
update v
nby n-th-user linear regression on {(w
m, r
nm)}
until
converge
• initialize: usually just randomly
• converge:
guaranteed as E
in decreases
during alternating minimization alternating least squares:the ‘tango’ dance between
users/movies
Matrix Factorization Basic Matrix Factorization
Alternating Least Squares
Alternating Least Squares
1
initialize ˜d dimension vectors {wm
}, {vn
}2 alternating optimization
of Ein
: repeatedly1 optimize w
1, w
2, . . . , w
M:
update w
mby m-th-movie linear regression on {(v
n, r
nm)}
2 optimize v
1, v
2, . . . , v
N:
update v
nby n-th-user linear regression on {(w
m, r
nm)}
until
converge
• initialize: usually just randomly
• converge:
guaranteed as E
in decreases
during alternating minimizationalternating least squares:
the ‘tango’ dance between
users/movies
Matrix Factorization Basic Matrix Factorization
Alternating Least Squares
Alternating Least Squares
1
initialize ˜d dimension vectors {wm
}, {vn
}2 alternating optimization
of Ein
: repeatedly1 optimize w
1, w
2, . . . , w
M:
update w
mby m-th-movie linear regression on {(v
n, r
nm)}
2 optimize v
1, v
2, . . . , v
N:
update v
nby n-th-user linear regression on {(w
m, r
nm)}
until
converge
• initialize: usually just randomly
• converge:
guaranteed as E
in decreases
during alternating minimization alternating least squares:the ‘tango’ dance between
users/movies
Matrix Factorization Basic Matrix Factorization
Linear Autoencoder versus Matrix Factorization
Linear Autoencoder
X ≈W W T
X•
motivation:special d
-˜d -d linear NNet•
error measure: squared onall
xni
•
solution: global optimal ateigenvectors
of XT
X•
usefulness: extractdimension-reduced features
Matrix Factorization
R ≈V T W
•
motivation:N-˜
d -Mlinear NNet•
error measure:squared on
known r nm
•
solution: local optimal viaalternating least squares
•
usefulness: extracthidden user/movie features
linear autoencoder≡
special
matrix factorization ofcomplete
XMatrix Factorization Basic Matrix Factorization
Linear Autoencoder versus Matrix Factorization
Linear Autoencoder
X ≈W W T
X•
motivation:special d
-˜d -d linear NNet•
error measure: squared onall
xni
•
solution: global optimal ateigenvectors
of XT
X•
usefulness: extractdimension-reduced features
Matrix Factorization
R ≈V T W
•
motivation:N-˜
d -Mlinear NNet•
error measure:squared on
known r nm
•
solution: local optimal viaalternating least squares
•
usefulness: extracthidden user/movie features
linear autoencoder≡
special
matrix factorization ofcomplete
XMatrix Factorization Basic Matrix Factorization
Linear Autoencoder versus Matrix Factorization
Linear Autoencoder
X ≈W W T
X•
motivation:special d
-˜d -d linear NNet•
error measure: squared onall
xni
•
solution: global optimal ateigenvectors
of XT
X•
usefulness: extractdimension-reduced features
Matrix Factorization
R ≈V T W
•
motivation:N-˜
d -Mlinear NNet•
error measure:squared on
known r nm
•
solution: local optimal viaalternating least squares
•
usefulness: extracthidden user/movie features
linear autoencoder≡