## Machine Learning Techniques

## ( 機器學習技法)

### Lecture 15: Matrix Factorization

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw### Department of Computer Science

### & Information Engineering

### National Taiwan University

### ( 國立台灣大學資訊工程系)

Matrix Factorization

## Roadmap

### 1 Embedding Numerous Features: Kernel Models

### 2 Combining Predictive Features: Aggregation Models

### 3

Distilling Implicit Features: Extraction Models### Lecture 14: Radial Basis Function Network

**linear aggregation**

of**distance-based similarities**

using### k **-Means clustering**

for**prototype finding** Lecture 15: Matrix Factorization

### Linear Network Hypothesis

### Basic Matrix Factorization

### Stochastic Gradient Descent

Matrix Factorization Linear Network Hypothesis

## Recommender System Revisited

### data ML ^{skill}

### • data: how ‘many users’ have rated ‘some movies’

### • skill: predict how a user would rate an unrated movie

### A Hot Problem

### •

competition held by Netflix in 2006### • 100,480,507 ratings that 480,189 users gave to 17,770 movies

### • 10% improvement = **1 million dollar prize**

### •

data### D _{m}

for### m-th movie:

{(˜

**x** n

= (n),y### n

=### r nm

) :### user n rated movie m}

—abstract feature ˜

**x** n

= (n)
how to

**learn our preferences**

from data?
Matrix Factorization Linear Network Hypothesis

## Recommender System Revisited

### data ML ^{skill}

### • data: how ‘many users’ have rated ‘some movies’

### • skill: predict how a user would rate an unrated movie

### A Hot Problem

### •

competition held by Netflix in 2006### • 100,480,507 ratings that 480,189 users gave to 17,770 movies

### • 10% improvement = **1 million dollar prize**

### •

data### D _{m}

for### m-th movie:

{(˜

**x** n

= (n),y### n

=### r nm

) :### user n rated movie m}

—abstract feature ˜

**x** n

= (n)
how to

**learn our preferences**

from data?
Matrix Factorization Linear Network Hypothesis

## Recommender System Revisited

### data ML ^{skill}

### • data: how ‘many users’ have rated ‘some movies’

### • skill: predict how a user would rate an unrated movie

### A Hot Problem

### •

competition held by Netflix in 2006### • 100,480,507 ratings that 480,189 users gave to 17,770 movies

### • 10% improvement = **1 million dollar prize**

### •

data### D _{m}

for### m-th movie:

{(˜

**x** _{n}

= (n),y_{n}

=### r _{nm}

) : ### user n rated movie m}

—abstract feature ˜

**x** n

= (n)
how to

**learn our preferences**

from data?
Matrix Factorization Linear Network Hypothesis

## Recommender System Revisited

### data ML ^{skill}

### • data: how ‘many users’ have rated ‘some movies’

### • skill: predict how a user would rate an unrated movie

### A Hot Problem

### •

competition held by Netflix in 2006### • 100,480,507 ratings that 480,189 users gave to 17,770 movies

### • 10% improvement = **1 million dollar prize**

### •

data### D _{m}

for### m-th movie:

{(˜

**x** _{n}

= (n),y_{n}

=### r _{nm}

) : ### user n rated movie m}

how to

**learn our preferences**

from data?
Matrix Factorization Linear Network Hypothesis

## Recommender System Revisited

### data ML ^{skill}

### • data: how ‘many users’ have rated ‘some movies’

### • skill: predict how a user would rate an unrated movie

### A Hot Problem

### •

competition held by Netflix in 2006### • 100,480,507 ratings that 480,189 users gave to 17,770 movies

### • 10% improvement = **1 million dollar prize**

### •

data### D _{m}

for### m-th movie:

{(˜

**x** _{n}

= (n),y_{n}

=### r _{nm}

) : ### user n rated movie m}

—abstract feature ˜

**x** n

= (n)
how to

**learn our preferences**

from data?
Matrix Factorization Linear Network Hypothesis

## Binary Vector Encoding of Categorical Feature

˜

**x** _{n}

= (n): user IDs, such as 1126, 5566, 6211, . . .
—called

**categorical**

features
### • **categorical**

features, e.g.
### • IDs

### • blood type: A, B, AB, O

### • programming languages: C, C++, Java, Python, . . .

### •

many ML models operate on**numerical**

features
### • **linear** models

### • **extended linear** models such as NNet

—except for

**decision trees**

### •

need:**encoding**

(transform) from### categorical

to### numerical

**binary vector encoding:**

### A

=### [1 0 0 0] ^{T}

,### B

=### [0 1 0 0] ^{T}

,
### AB

=### [0 0 1 0] ^{T}

,### O

=### [0 0 0 1] ^{T}

Matrix Factorization Linear Network Hypothesis

## Binary Vector Encoding of Categorical Feature

˜

**x** _{n}

= (n): user IDs, such as 1126, 5566, 6211, . . .
—called

**categorical**

features
### • **categorical**

features, e.g.
### • IDs

### • blood type: A, B, AB, O

### • programming languages: C, C++, Java, Python, . . .

### •

many ML models operate on**numerical**

features
### • **linear** models

### • **extended linear** models such as NNet

—except for

**decision trees**

### •

need:**encoding**

(transform) from### categorical

to### numerical

**binary vector encoding:**

### A

=### [1 0 0 0] ^{T}

,### B

=### [0 1 0 0] ^{T}

,
### AB

=### [0 0 1 0] ^{T}

,### O

=### [0 0 0 1] ^{T}

Matrix Factorization Linear Network Hypothesis

## Binary Vector Encoding of Categorical Feature

˜

**x** _{n}

= (n): user IDs, such as 1126, 5566, 6211, . . .
—called

**categorical**

features
### • **categorical**

features, e.g.
### • IDs

### • blood type: A, B, AB, O

### • programming languages: C, C++, Java, Python, . . .

### •

many ML models operate on**numerical**

features
### • **linear** models

### • **extended linear** models such as NNet

—except for

**decision trees**

### •

need:**encoding**

(transform) from### categorical

to### numerical

**binary vector encoding:**

### A

=### [1 0 0 0] ^{T}

,### B

=### [0 1 0 0] ^{T}

,
### AB

=### [0 0 1 0] ^{T}

,### O

=### [0 0 0 1] ^{T}

Matrix Factorization Linear Network Hypothesis

## Binary Vector Encoding of Categorical Feature

˜

**x** _{n}

= (n): user IDs, such as 1126, 5566, 6211, . . .
—called

**categorical**

features
### • **categorical**

features, e.g.
### • IDs

### • blood type: A, B, AB, O

### • programming languages: C, C++, Java, Python, . . .

### •

many ML models operate on**numerical**

features
### • **linear** models

### • **extended linear** models such as NNet

—except for

**decision trees**

### •

need:**encoding**

(transform) from### categorical

to### numerical

**binary vector encoding:**

### A

=### [1 0 0 0] ^{T}

,### B

=### [0 1 0 0] ^{T}

,
### AB

=### [0 0 1 0] ^{T}

,### O

=### [0 0 0 1] ^{T}

Matrix Factorization Linear Network Hypothesis

## Binary Vector Encoding of Categorical Feature

˜

**x** _{n}

= (n): user IDs, such as 1126, 5566, 6211, . . .
—called

**categorical**

features
### • **categorical**

features, e.g.
### • IDs

### • blood type: A, B, AB, O

### • programming languages: C, C++, Java, Python, . . .

### •

many ML models operate on**numerical**

features
### • **linear** models

### • **extended linear** models such as NNet

—except for

**decision trees**

### •

need:**encoding**

(transform) from### categorical

to### numerical

**binary vector encoding:**

### A

=### [1 0 0 0] ^{T}

,### B

=### [0 1 0 0] ^{T}

,
### AB

=### [0 0 1 0] ^{T}

,### O

=### [0 0 0 1] ^{T}

Matrix Factorization Linear Network Hypothesis

## Binary Vector Encoding of Categorical Feature

˜

**x** _{n}

= (n): user IDs, such as 1126, 5566, 6211, . . .
—called

**categorical**

features
### • **categorical**

features, e.g.
### • IDs

### • blood type: A, B, AB, O

### • programming languages: C, C++, Java, Python, . . .

### •

many ML models operate on**numerical**

features
### • **linear** models

### • **extended linear** models such as NNet

—except for

**decision trees**

### •

need:**encoding**

(transform) from### categorical

to### numerical

**binary vector encoding:**

### A

=### [1 0 0 0] ^{T}

,### B

=### [0 1 0 0] ^{T}

,
### AB

=### [0 0 1 0] ^{T}

,### O

=### [0 0 0 1] ^{T}

Matrix Factorization Linear Network Hypothesis

## Binary Vector Encoding of Categorical Feature

˜

**x** _{n}

= (n): user IDs, such as 1126, 5566, 6211, . . .
—called

**categorical**

features
### • **categorical**

features, e.g.
### • IDs

### • blood type: A, B, AB, O

### • programming languages: C, C++, Java, Python, . . .

### •

many ML models operate on**numerical**

features
### • **linear** models

### • **extended linear** models such as NNet

—except for

**decision trees**

### •

need:**encoding**

(transform) from### categorical

to### numerical

**binary vector encoding:**

### A

=### [1 0 0 0] ^{T}

,### B

=### [0 1 0 0] ^{T}

,
### AB

=### [0 0 1 0] ^{T}

,### O

=### [0 0 0 1] ^{T}

Matrix Factorization Linear Network Hypothesis

## Binary Vector Encoding of Categorical Feature

˜

**x** _{n}

= (n): user IDs, such as 1126, 5566, 6211, . . .
—called

**categorical**

features
### • **categorical**

features, e.g.
### • IDs

### • blood type: A, B, AB, O

### • programming languages: C, C++, Java, Python, . . .

### •

many ML models operate on**numerical**

features
### • **linear** models

### • **extended linear** models such as NNet

—except for

**decision trees**

### •

need:**encoding**

(transform) from### categorical

to### numerical **binary vector encoding:**

### A

=### [1 0 0 0] ^{T}

,### B

=### [0 1 0 0] ^{T}

,
### AB

=### [0 0 1 0] ^{T}

,### O

=### [0 0 0 1] ^{T}

Matrix Factorization Linear Network Hypothesis

## Binary Vector Encoding of Categorical Feature

˜

**x** _{n}

= (n): user IDs, such as 1126, 5566, 6211, . . .
—called

**categorical**

features
### • **categorical**

features, e.g.
### • IDs

### • blood type: A, B, AB, O

### • programming languages: C, C++, Java, Python, . . .

### •

many ML models operate on**numerical**

features
### • **linear** models

### • **extended linear** models such as NNet

—except for

**decision trees**

### •

need:**encoding**

(transform) from### categorical

to### numerical

**binary vector encoding:**

### A

=### [1 0 0 0] ^{T}

,### B

=### [0 1 0 0] ^{T}

,
### AB

=### [0 0 1 0] ^{T}

,### O

=### [0 0 0 1] ^{T}

Matrix Factorization Linear Network Hypothesis

## Binary Vector Encoding of Categorical Feature

˜

**x** _{n}

= (n): user IDs, such as 1126, 5566, 6211, . . .
—called

**categorical**

features
### • **categorical**

features, e.g.
### • IDs

### • blood type: A, B, AB, O

### • programming languages: C, C++, Java, Python, . . .

### •

many ML models operate on**numerical**

features
### • **linear** models

### • **extended linear** models such as NNet

—except for

**decision trees**

### •

need:**encoding**

(transform) from### categorical

to### numerical **binary vector encoding:**

### A

=### [1 0 0 0] ^{T}

,### B

=### [0 1 0 0] ^{T}

,
### AB

=### [0 0 1 0] ^{T}

,### O

=### [0 0 0 1] ^{T}

Matrix Factorization Linear Network Hypothesis

## Feature Extraction from Encoded Vector

**encoded**

data### D _{m}

for### m-th movie:

n

(x

### n

=### BinaryVectorEncoding(n), y n

=### r nm

) :### user n rated movie m

oor,

### joint data D

n(x

_{n}

=### BinaryVectorEncoding(n), **y** _{n}

= [r_{n1}

? ? ### r _{n4} r _{n5}

. . . ### r _{nM}

]^{T}

)o
idea: try

**feature extraction**

using N-˜d -M NNet### without all x _{0} ^{(`)}

### x

1### x

2### x

_{3}

### x

_{4}

**x =**

tanh

tanh

### ≈ y

1### ≈ y

2### ≈ y

3### = **y** w _{ni} ^{(1)} w _{im} ^{(2)}

is

### tanh **necessary?** **:-)**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/22

Matrix Factorization Linear Network Hypothesis

## Feature Extraction from Encoded Vector

**encoded**

data### D _{m}

for### m-th movie:

n

(x

### n

=### BinaryVectorEncoding(n), y n

=### r nm

) :### user n rated movie m

o or,### joint data D

n

(x

_{n}

=### BinaryVectorEncoding(n), **y** _{n}

= [r_{n1}

? ? ### r _{n4} r _{n5}

. . . ### r _{nM}

]^{T}

)
o
idea: try

**feature extraction**

using N-˜d -M NNet### without all x _{0} ^{(`)}

### x

1### x

2### x

_{3}

### x

_{4}

**x =**

tanh

tanh

### ≈ y

1### ≈ y

2### ≈ y

3### = **y** w _{ni} ^{(1)} w _{im} ^{(2)}

is

### tanh **necessary?** **:-)**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/22

Matrix Factorization Linear Network Hypothesis

## Feature Extraction from Encoded Vector

**encoded**

data### D _{m}

for### m-th movie:

n

(x

### n

=### BinaryVectorEncoding(n), y n

=### r nm

) :### user n rated movie m

o or,### joint data D

n

(x

_{n}

=### BinaryVectorEncoding(n), **y** _{n}

= [r_{n1}

? ? ### r _{n4} r _{n5}

. . . ### r _{nM}

]^{T}

)
o
idea: try

**feature extraction**

using N-˜d -M NNet### without all x _{0} ^{(`)}

### x

1### x

2### x

3**x =**

tanh

tanh

### ≈ y

1### ≈ y

2### ≈ y

### = **y** w _{ni} ^{(1)} w _{im} ^{(2)}

is

### tanh **necessary?** **:-)**

Matrix Factorization Linear Network Hypothesis

## Feature Extraction from Encoded Vector

**encoded**

data### D _{m}

for### m-th movie:

n

(x

### n

=### BinaryVectorEncoding(n), y n

=### r nm

) :### user n rated movie m

o or,### joint data D

n

(x

_{n}

=### BinaryVectorEncoding(n), **y** _{n}

= [r_{n1}

? ? ### r _{n4} r _{n5}

. . . ### r _{nM}

]^{T}

)
o
idea: try

**feature extraction**

using N-˜d -M NNet### without all x _{0} ^{(`)}

### x

1### x

2### x

3### x

_{4}

**x =**

tanh

tanh

### ≈ y

1### ≈ y

2### ≈ y

3### = **y** w _{ni} ^{(1)} w _{im} ^{(2)}

is

### tanh **necessary?** **:-)**

Matrix Factorization Linear Network Hypothesis

## ‘Linear Network’ Hypothesis

### x

_{1}

### x

2### x

3### x

4**x** =

### ≈ y

_{1}

### ≈ y

_{2}

### ≈ y

3## = **y**

## V ^{T} :

## w _{ni} ^{(1)}

## W :

## w _{im} ^{(2)}

n

(x

_{n}

=### BinaryVectorEncoding(n), **y** _{n}

= [r_{n1}

? ? ### r _{n4} r n5

. . . ### r _{nM}

]^{T}

)
o
### •

rename:### V ^{T}

### for h w _{ni} ^{(1)} i

and

### W

### for h w _{im} ^{(2)} i

### •

hypothesis:**h(x) =**

### W ^{T} V

**x**

### •

per-useroutput:**h(x** _{n}

) =### W ^{T}

**v** _{n}

, where**v** _{n}

is### n-th

column of### V

### linear network

for recommender system:**learn** V

and### W

Matrix Factorization Linear Network Hypothesis

## ‘Linear Network’ Hypothesis

### x

_{1}

### x

2### x

3### x

4**x** =

### ≈ y

_{1}

### ≈ y

_{2}

### ≈ y

3## = **y**

## V ^{T} :

## w _{ni} ^{(1)}

## W :

## w _{im} ^{(2)}

n

(x

_{n}

=### BinaryVectorEncoding(n), **y** _{n}

= [r_{n1}

? ? ### r _{n4} r n5

. . . ### r _{nM}

]^{T}

)
o
### •

rename:### V ^{T}

### for h w _{ni} ^{(1)} i

and

### W

### for h w _{im} ^{(2)} i

### •

hypothesis:**h(x) =**

### W ^{T} V

**x**

### •

per-useroutput:**h(x** _{n}

) =### W ^{T}

**v** _{n}

, where**v** _{n}

is### n-th

column of### V

### linear network

for recommender system:**learn** V

and### W

Matrix Factorization Linear Network Hypothesis

## ‘Linear Network’ Hypothesis

### x

_{1}

### x

2### x

3### x

4**x** =

### ≈ y

_{1}

### ≈ y

_{2}

### ≈ y

3## = **y** V ^{T} : w _{ni} ^{(1)} W : w _{im} ^{(2)}

n

(x

_{n}

=### BinaryVectorEncoding(n), **y** _{n}

= [r_{n1}

? ? ### r _{n4} r n5

. . . ### r _{nM}

]^{T}

)
o
### •

rename:### V ^{T} for h w _{ni} ^{(1)} i

and

### W for h w _{im} ^{(2)} i

### •

hypothesis:**h(x) =**

### W ^{T} V

**x**

### •

per-useroutput:**h(x** _{n}

) =### W ^{T}

**v** _{n}

, where**v** _{n}

is### n-th

column of### V

### linear network

for recommender system:**learn** V

and### W

Matrix Factorization Linear Network Hypothesis

## ‘Linear Network’ Hypothesis

### x

_{1}

### x

2### x

3### x

4**x** =

### ≈ y

_{1}

### ≈ y

_{2}

### ≈ y

3## = **y** V ^{T} : w _{ni} ^{(1)} W : w _{im} ^{(2)}

n

(x

_{n}

=### BinaryVectorEncoding(n), **y** _{n}

= [r_{n1}

? ? ### r _{n4} r n5

. . . ### r _{nM}

]^{T}

)
o
### •

rename:### V ^{T} for h w _{ni} ^{(1)} i

and

### W for h w _{im} ^{(2)} i

### •

hypothesis:**h(x) =**

### W ^{T} V

**x**

### •

per-useroutput:**h(x** _{n}

) =### W ^{T}

**v** _{n}

, where**v** _{n}

is### n-th

column of### V

### linear network

for recommender system:**learn** V

and### W

Matrix Factorization Linear Network Hypothesis

## ‘Linear Network’ Hypothesis

### x

_{1}

### x

2### x

3### x

4**x** =

### ≈ y

_{1}

### ≈ y

_{2}

### ≈ y

3## = **y** V ^{T} : w _{ni} ^{(1)} W : w _{im} ^{(2)}

n

(x

_{n}

=### BinaryVectorEncoding(n), **y** _{n}

= [r_{n1}

? ? ### r _{n4} r n5

. . . ### r _{nM}

]^{T}

)
o
### •

rename:### V ^{T} for h w _{ni} ^{(1)} i

and

### W for h w _{im} ^{(2)} i

### •

hypothesis:**h(x) =** W ^{T} Vx

### •

per-useroutput:**h(x** _{n}

) =### W ^{T}

**v** _{n}

, where**v** _{n}

is### n-th

column of### V

### linear network

for recommender system:**learn** V

and### W

Matrix Factorization Linear Network Hypothesis

## ‘Linear Network’ Hypothesis

### x

_{1}

### x

2### x

3### x

4**x** =

### ≈ y

_{1}

### ≈ y

_{2}

### ≈ y

3## = **y** V ^{T} : w _{ni} ^{(1)} W : w _{im} ^{(2)}

n

(x

_{n}

=### BinaryVectorEncoding(n), **y** _{n}

= [r_{n1}

? ? ### r _{n4} r n5

. . . ### r _{nM}

]^{T}

)
o
### •

rename:### V ^{T} for h w _{ni} ^{(1)} i

and

### W for h w _{im} ^{(2)} i

### •

hypothesis:**h(x) =** W ^{T} Vx

### •

per-useroutput:**h(x** n

) =### W ^{T}

**v** _{n}

, where**v** _{n}

is### n-th

column of### V

### linear network

for recommender system:**learn** V

and### W

Matrix Factorization Linear Network Hypothesis

## ‘Linear Network’ Hypothesis

### x

_{1}

### x

2### x

3### x

4**x** =

### ≈ y

_{1}

### ≈ y

_{2}

### ≈ y

3## = **y** V ^{T} : w _{ni} ^{(1)} W : w _{im} ^{(2)}

n

(x

_{n}

=### BinaryVectorEncoding(n), **y** _{n}

= [r_{n1}

? ? ### r _{n4} r n5

. . . ### r _{nM}

]^{T}

)
o
### •

rename:### V ^{T} for h w _{ni} ^{(1)} i

and

### W for h w _{im} ^{(2)} i

### •

hypothesis:**h(x) =** W ^{T} Vx

### •

per-useroutput:**h(x** n

) =### W ^{T} **v** n

, where**v** n

is### n-th

column of### V

### linear network

for recommender system:**learn** V

and### W

Matrix Factorization Linear Network Hypothesis

## ‘Linear Network’ Hypothesis

### x

_{1}

### x

2### x

3### x

4**x** =

### ≈ y

_{1}

### ≈ y

_{2}

### ≈ y

3## = **y** V ^{T} : w _{ni} ^{(1)} W : w _{im} ^{(2)}

n

(x

_{n}

=### BinaryVectorEncoding(n), **y** _{n}

= [r_{n1}

? ? ### r _{n4} r n5

. . . ### r _{nM}

]^{T}

)
o
### •

rename:### V ^{T} for h w _{ni} ^{(1)} i

and

### W for h w _{im} ^{(2)} i

### •

hypothesis:**h(x) =** W ^{T} Vx

### •

per-useroutput:**h(x** n

) =### W ^{T} **v** n

, where**v** n

is### n-th

column of### V

### linear network

for recommender system:**learn** V

and### W

Matrix Factorization Linear Network Hypothesis

## Fun Time

For N users, M movies, and ˜d ‘features’, how many variables need to be used to specify a

### linear network

hypothesis**h(x) =** W ^{T} Vx?

### 1

N + M + ˜d### 2

N · M · ˜d### 3

(N + M) · ˜d### 4

(N · M) + ˜d### Reference Answer: 3

simply

### N · ˜ d

for### V ^{T}

and### d · M ˜

for### W

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/22

Matrix Factorization Linear Network Hypothesis

## Fun Time

For N users, M movies, and ˜d ‘features’, how many variables need to be used to specify a

### linear network

hypothesis**h(x) =** W ^{T} Vx?

### 1

N + M + ˜d### 2

N · M · ˜d### 3

(N + M) · ˜d### 4

(N · M) + ˜d### Reference Answer: 3

simply

### N · ˜ d

for### V ^{T}

and### d · M ˜

for### W

Matrix Factorization Basic Matrix Factorization

## Linear Network: Linear Model Per Movie

linear network:

**h(x) =** W ^{T} Vx

|{z}

### Φ(x)

—for

### m-th

movie, just linear model### h _{m}

(x) =**w** ^{T} _{m} Φ(x)

subject to shared

### transform Φ

### •

for every### D _{m}

, want### r _{nm}

=### y _{n}

≈**w** ^{T} _{m} **v** _{n}

### •

E_{in}

over all### D _{m}

with squared error measure:
E_{in}

({w_{m}

}, {v_{n}

}) = ### 1

### P M

### m=1 |D _{m} |

X

### user n rated movie m

### r _{nm}

−**w** ^{T} _{m} **v** _{n}

### 2

### linear network: transform

and### linear modelS

**jointly learned**

from all D_{m}

Matrix Factorization Basic Matrix Factorization

## Linear Network: Linear Model Per Movie

linear network:

**h(x) =** W ^{T} Vx

|{z}

### Φ(x)

—for

### m-th

movie, just linear model### h _{m}

(x) =**w** ^{T} _{m} Φ(x)

subject to shared

### transform Φ

### •

for every### D _{m}

, want### r _{nm}

=### y _{n}

≈**w** ^{T} _{m} **v** _{n}

### •

E_{in}

over all### D _{m}

with squared error measure:
E_{in}

({w_{m}

}, {v_{n}

}) = ### 1

### P M

### m=1 |D _{m} |

X

### user n rated movie m

### r _{nm}

−**w** ^{T} _{m} **v** _{n}

### 2

### linear network: transform

and### linear modelS

**jointly learned**

from all D_{m}

Matrix Factorization Basic Matrix Factorization

## Linear Network: Linear Model Per Movie

linear network:

**h(x) =** W ^{T} Vx

|{z}

### Φ(x)

—for

### m-th

movie, just linear model### h _{m}

(x) =**w** ^{T} _{m} Φ(x)

subject to shared

### transform Φ

### •

for every### D _{m}

, want### r _{nm}

=### y _{n}

≈**w** ^{T} _{m} **v** _{n}

### •

E_{in}

over all### D _{m}

with squared error measure:
E_{in}

({w_{m}

}, {v_{n}

}) = ### 1

### P M

### m=1 |D _{m} |

X

### user n rated movie m

### r _{nm}

−**w** ^{T} _{m} **v** _{n}

### 2

### linear network: transform

and### linear modelS

**jointly learned**

from all D_{m}

Matrix Factorization Basic Matrix Factorization

## Linear Network: Linear Model Per Movie

linear network:

**h(x) =** W ^{T} Vx

|{z}

### Φ(x)

—for

### m-th

movie, just linear model### h _{m}

(x) =**w** ^{T} _{m} Φ(x)

subject to shared

### transform Φ

### •

for every### D _{m}

, want### r _{nm}

=### y _{n}

≈**w** ^{T} _{m} **v** _{n}

### •

E_{in}

over all### D _{m}

with squared error measure:
E

_{in}

({w_{m}

}, {v_{n}

}) = ### 1 P M

### m=1 |D _{m} |

X

### user n rated movie m

### r _{nm}

−**w** ^{T} _{m} **v** _{n}

### 2

### linear network: transform

and### linear modelS

**jointly learned**

from all D_{m}

Matrix Factorization Basic Matrix Factorization

## Linear Network: Linear Model Per Movie

linear network:

**h(x) =** W ^{T} Vx

|{z}

### Φ(x)

—for

### m-th

movie, just linear model### h _{m}

(x) =**w** ^{T} _{m} Φ(x)

subject to shared

### transform Φ

### •

for every### D _{m}

, want### r _{nm}

=### y _{n}

≈**w** ^{T} _{m} **v** _{n}

### •

E_{in}

over all### D _{m}

with squared error measure:
E

_{in}

({w_{m}

}, {v_{n}

}) = ### 1 P M

### m=1 |D _{m} |

X

### user n rated movie m

### r _{nm}

−**w** ^{T} _{m} **v** _{n}

### 2

### linear network: transform

and### linear modelS

**jointly learned**

from all D_{m}

Matrix Factorization Basic Matrix Factorization

## Linear Network: Linear Model Per Movie

linear network:

**h(x) =** W ^{T} Vx

|{z}

### Φ(x)

—for

### m-th

movie, just linear model### h _{m}

(x) =**w** ^{T} _{m} Φ(x)

subject to shared

### transform Φ

### •

for every### D _{m}

, want### r _{nm}

=### y _{n}

≈**w** ^{T} _{m} **v** _{n}

### •

E_{in}

over all### D _{m}

with squared error measure:
E

_{in}

({w_{m}

}, {v_{n}

}) = ### 1 P M

### m=1 |D _{m} |

X

### user n rated movie m

### r _{nm}

−**w** ^{T} _{m} **v** _{n}

### 2

### linear network: transform

and### linear modelS

**jointly learned**

from all D_{m}

Matrix Factorization Basic Matrix Factorization

## Matrix Factorization

### r _{nm}

≈**w** ^{T} _{m} **v** _{n}

=**v** ^{T} _{n} **w** _{m}

⇐⇒

### R

≈### V ^{T} W R movie

1 ### movie

2### · · · movie

M### user

1### 100 80 · · · ?

### user

_{2}

### ? 70 · · · 90

### · · · · · · · · · · · · · · ·

### user

_{N}

### ? 60 · · · 0

### ≈

### V ^{T}

### —v

^{T}

_{1}

### —

### —v

^{T}

_{2}

### —

### · · ·

### —v

^{T}

_{N}

### —

### W **w**

_{1}

**w**

_{2}

### · · · **w**

_{M}

Match movie and viewer factors

predicted rating

comedy content action

content blockb

uster?

TomCruise init? likes TomCruise? prefers blockbusters?

likes action? likes comedy?

movie viewer

add contributions from each factor

### Matrix Factorization Model

learning:→

### known rating

→ learned

**factors** **v** _{n}

and**w** _{m}

→ unknown rating prediction

similar modeling can be used for other

**abstract features**

Matrix Factorization Basic Matrix Factorization

## Matrix Factorization

### r _{nm}

≈**w** ^{T} _{m} **v** _{n}

=**v** ^{T} _{n} **w** _{m}

⇐⇒

### R

≈### V ^{T} W

### R movie

1### movie

2### · · · movie

M### user

_{1}

### 100 80 · · · ?

### user

_{2}

### ? 70 · · · 90

### · · · · · · · · · · · · · · ·

### user

_{N}

### ? 60 · · · 0

### ≈

### V ^{T}

### —v

^{T}

_{1}

### —

### —v

^{T}

_{2}

### —

### · · ·

### —v

^{T}

_{N}

### —

### W **w**

1 **w**

2 ### · · · **w**

M
Match movie and

viewer factors

predicted rating

comedy content action

content blockb

uster?

TomCruise init? likes TomCruise? prefers blockbusters?

likes action? likes comedy?

movie viewer

add contributions from each factor

### Matrix Factorization Model

learning:→

### known rating

→ learned

**factors** **v** _{n}

and**w** _{m}

→ unknown rating prediction

similar modeling can be used for other

**abstract features**

Matrix Factorization Basic Matrix Factorization

## Matrix Factorization

### r _{nm}

≈**w** ^{T} _{m} **v** _{n}

=**v** ^{T} _{n} **w** _{m}

⇐⇒

### R

≈### V ^{T} W

### R movie

1### movie

2### · · · movie

M### user

_{1}

### 100 80 · · · ?

### user

_{2}

### ? 70 · · · 90

### · · · · · · · · · · · · · · ·

### user

_{N}

### ? 60 · · · 0

### ≈ V ^{T}

### —v

^{T}

_{1}

### —

### —v

^{T}

_{2}

### —

### · · ·

### —v

^{T}

_{N}

### —

### W **w**

1 **w**

2 ### · · · **w**

M
Match movie and

viewer factors

predicted rating

comedy content action

content blockb

uster?

TomCruise init? likes TomCruise? prefers blockbusters?

likes action? likes comedy?

movie viewer

add contributions from each factor

### Matrix Factorization Model

learning:→

### known rating

→ learned

**factors** **v** _{n}

and**w** _{m}

→ unknown rating prediction

similar modeling can be used for other

**abstract features**

Matrix Factorization Basic Matrix Factorization

## Matrix Factorization

### r _{nm}

≈**w** ^{T} _{m} **v** _{n}

=**v** ^{T} _{n} **w** _{m}

⇐⇒

### R

≈### V ^{T} W

### R movie

1### movie

2### · · · movie

M### user

_{1}

### 100 80 · · · ?

### user

_{2}

### ? 70 · · · 90

### · · · · · · · · · · · · · · ·

### user

_{N}

### ? 60 · · · 0

### ≈ V ^{T}

### —v

^{T}

_{1}

### —

### —v

^{T}

_{2}

### —

### · · ·

### —v

^{T}

_{N}

### —

### W **w**

1 **w**

2 ### · · · **w**

M
Match movie and viewer factors

predicted rating

comedy content action

content blockb

uster?

TomCruise init? likes TomCruise? prefers blockbusters?

likes action? likes comedy?

movie viewer

add contributions from each factor

### Matrix Factorization Model

learning:→

### known rating

→ learned

**factors** **v** _{n}

and**w** _{m}

→ unknown rating prediction

similar modeling can be used for other

**abstract features**

Matrix Factorization Basic Matrix Factorization

## Matrix Factorization

### r _{nm}

≈**w** ^{T} _{m} **v** _{n}

=**v** ^{T} _{n} **w** _{m}

⇐⇒### R

≈### V ^{T} W R movie

1 ### movie

2### · · · movie

M### user

_{1}

### 100 80 · · · ?

### user

_{2}

### ? 70 · · · 90

### · · · · · · · · · · · · · · ·

### user

_{N}

### ? 60 · · · 0

### ≈ V ^{T}

### —v

^{T}

_{1}

### —

### —v

^{T}

_{2}

### —

### · · ·

### —v

^{T}

_{N}

### —

### W **w**

1 **w**

2 ### · · · **w**

M
Match movie and viewer factors

predicted rating

comedy content action

content blockb

uster?

TomCruise init? likes TomCruise? prefers blockbusters?

likes action? likes comedy?

movie viewer

add contributions from each factor

### Matrix Factorization Model

learning:→

### known rating

→ learned

**factors** **v** _{n}

and**w** _{m}

→ unknown rating prediction

similar modeling can be used for other

**abstract features**

Matrix Factorization Basic Matrix Factorization

## Matrix Factorization

### r _{nm}

≈**w** ^{T} _{m} **v** _{n}

=**v** ^{T} _{n} **w** _{m}

⇐⇒### R

≈### V ^{T} W R movie

1 ### movie

2### · · · movie

M### user

_{1}

### 100 80 · · · ?

### user

_{2}

### ? 70 · · · 90

### · · · · · · · · · · · · · · ·

### user

_{N}

### ? 60 · · · 0

### ≈ V ^{T}

### —v

^{T}

_{1}

### —

### —v

^{T}

_{2}

### —

### · · ·

### —v

^{T}

_{N}

### —

### W **w**

1 **w**

2 ### · · · **w**

M
Match movie and viewer factors

predicted rating

comedy content action

content blockb uster?

TomCruise init?

likes TomCruise?

prefers blockbusters? likes action?

likes comedy?

movie viewer

add contributions from each factor

### Matrix Factorization Model

learning:→

### known rating

→ learned

**factors** **v** _{n}

and**w** _{m}

→ unknown rating prediction

similar modeling can be used for other

**abstract features**

Matrix Factorization Basic Matrix Factorization

## Matrix Factorization

### r _{nm}

≈**w** ^{T} _{m} **v** _{n}

=**v** ^{T} _{n} **w** _{m}

⇐⇒### R

≈### V ^{T} W R movie

1 ### movie

2### · · · movie

M### user

_{1}

### 100 80 · · · ?

### user

_{2}

### ? 70 · · · 90

### · · · · · · · · · · · · · · ·

### user

_{N}

### ? 60 · · · 0

### ≈ V ^{T}

### —v

^{T}

_{1}

### —

### —v

^{T}

_{2}

### —

### · · ·

### —v

^{T}

_{N}

### —

### W **w**

1 **w**

2 ### · · · **w**

M
Match movie and viewer factors

predicted rating likes TomCruise?

prefers blockbusters? likes action?

likes comedy?

movie viewer

add contributions from each factor

### Matrix Factorization Model

learning:→

### known rating

→ learned

**factors** **v** _{n}

and**w** _{m}

→ unknown rating prediction

similar modeling can be used for

Matrix Factorization Basic Matrix Factorization

## Matrix Factorization Learning

min

### W,V

E_{in}

({w_{m}

}, {v_{n}

}) ∝ X
### user n rated movie m

### r _{nm}

−**w** ^{T} _{m} **v** _{n}

### 2

=

### M

### X

### m=1

### X

### (x

n### ,r

nm### )∈D

m

### r nm

−**w** ^{T} _{m} **v** n

### 2

### • two sets

of variables:can consider

**alternating minimization, remember? :-)**

### •

when**v** _{n}

fixed, minimizing**w** _{m}

≡ minimize E### in

within### D _{m}

—simply per-movie(per-D

_{m}

)**linear regression** without w _{0}

### •

when**w** _{m}

fixed, minimizing**v** _{n}

?
—per-userlinear regression

### without v _{0}

by

**symmetry**

between### users/movies

called

**alternating least squares**

algorithm
Matrix Factorization Basic Matrix Factorization

## Matrix Factorization Learning

min

### W,V

E_{in}

({w_{m}

}, {v_{n}

}) ∝ X
### user n rated movie m

### r _{nm}

−**w** ^{T} _{m} **v** _{n}

### 2

=

### M

### X

### m=1

### X

### (x

n### ,r

nm### )∈D

m

### r nm

−**w** ^{T} _{m} **v** n

### 2

### • two sets

of variables:can consider

**alternating minimization, remember? :-)**

### •

when**v** _{n}

fixed, minimizing**w** _{m}

≡ minimize E### in

within### D _{m}

—simply per-movie(per-D

_{m}

)**linear regression** without w _{0}

### •

when**w** _{m}

fixed, minimizing**v** _{n}

?
—per-userlinear regression

### without v _{0}

by

**symmetry**

between### users/movies

called

**alternating least squares**

algorithm
Matrix Factorization Basic Matrix Factorization

## Matrix Factorization Learning

min

### W,V

E_{in}

({w_{m}

}, {v_{n}

}) ∝ X
### user n rated movie m

### r _{nm}

−**w** ^{T} _{m} **v** _{n}

### 2

=

### M

### X

### m=1

### X

### (x

n### ,r

nm### )∈D

m

### r nm

−**w** ^{T} _{m} **v** n

### 2

### • two sets

of variables:can consider

**alternating minimization, remember? :-)**

### •

when**v** _{n}

fixed, minimizing**w** _{m}

≡ minimize E### in

within### D _{m}

—simply per-movie(per-D

_{m}

)**linear regression** without w _{0}

### •

when**w** _{m}

fixed, minimizing**v** _{n}

?
—per-userlinear regression

### without v _{0}

by

**symmetry**

between### users/movies

called

**alternating least squares**

algorithm
Matrix Factorization Basic Matrix Factorization

## Matrix Factorization Learning

min

### W,V

E_{in}

({w_{m}

}, {v_{n}

}) ∝ X
### user n rated movie m

### r _{nm}

−**w** ^{T} _{m} **v** _{n}

### 2

=

### M

### X

### m=1

### X

### (x

n### ,r

nm### )∈D

m

### r nm

−**w** ^{T} _{m} **v** n

### 2

### • two sets

of variables:can consider

**alternating minimization, remember? :-)**

### •

when**v** _{n}

fixed, minimizing**w** _{m}

≡ minimize E

### in

within### D _{m}

—simply per-movie(per-D

_{m}

)**linear regression** without w _{0}

### •

when**w** _{m}

fixed, minimizing**v** _{n}

?
—per-userlinear regression

### without v _{0}

by

**symmetry**

between### users/movies

called

**alternating least squares**

algorithm
Matrix Factorization Basic Matrix Factorization

## Matrix Factorization Learning

min

### W,V

E_{in}

({w_{m}

}, {v_{n}

}) ∝ X
### user n rated movie m

### r _{nm}

−**w** ^{T} _{m} **v** _{n}

### 2

=

### M

### X

### m=1

### X

### (x

n### ,r

nm### )∈D

m

### r nm

−**w** ^{T} _{m} **v** n

### 2

### • two sets

of variables:can consider

**alternating minimization, remember? :-)**

### •

when**v** _{n}

fixed, minimizing**w** _{m}

≡ minimize E

### in

within### D _{m}

—simply per-movie(per-D

_{m}

)**linear regression** without w _{0}

### •

when**w** _{m}

fixed, minimizing**v** _{n}

?
—per-userlinear regression

### without v _{0}

by

**symmetry**

between### users/movies

called

**alternating least squares**

algorithm
Matrix Factorization Basic Matrix Factorization

## Matrix Factorization Learning

min

### W,V

E_{in}

({w_{m}

}, {v_{n}

}) ∝ X
### user n rated movie m

### r _{nm}

−**w** ^{T} _{m} **v** _{n}

### 2

=

### M

### X

### m=1

### X

### (x

n### ,r

nm### )∈D

m

### r nm

−**w** ^{T} _{m} **v** n

### 2

### • two sets

of variables:can consider

**alternating minimization, remember? :-)**

### •

when**v** _{n}

fixed, minimizing**w** _{m}

≡ minimize E### in

within### D _{m}

—simply per-movie(per-D

_{m}

)**linear regression** without w _{0}

### •

when**w** _{m}

fixed, minimizing**v** _{n}

?
—per-userlinear regression

### without v _{0}

by

**symmetry**

between### users/movies

called

**alternating least squares**

algorithm
Matrix Factorization Basic Matrix Factorization

## Matrix Factorization Learning

min

### W,V

E_{in}

({w_{m}

}, {v_{n}

}) ∝ X
### user n rated movie m

### r _{nm}

−**w** ^{T} _{m} **v** _{n}

### 2

=

### M

### X

### m=1

### X

### (x

n### ,r

nm### )∈D

m

### r nm

−**w** ^{T} _{m} **v** n

### 2

### • two sets

of variables:can consider

**alternating minimization, remember? :-)**

### •

when**v** _{n}

fixed, minimizing**w** _{m}

≡ minimize E### in

within### D _{m}

—simply per-movie(per-D

_{m}

)**linear regression** without w _{0}

### •

when**w** _{m}

fixed, minimizing**v** _{n}

?
—per-userlinear regression

### without v _{0}

by

**symmetry**

between### users/movies

called

**alternating least squares**

algorithm
Matrix Factorization Basic Matrix Factorization

## Matrix Factorization Learning

min

### W,V

E_{in}

({w_{m}

}, {v_{n}

}) ∝ X
### user n rated movie m

### r _{nm}

−**w** ^{T} _{m} **v** _{n}

### 2

=

### M

### X

### m=1

### X

### (x

n### ,r

nm### )∈D

m

### r nm

−**w** ^{T} _{m} **v** n

### 2

### • two sets

of variables:can consider

**alternating minimization, remember? :-)**

### •

when**v** _{n}

fixed, minimizing**w** _{m}

≡ minimize E### in

within### D _{m}

—simply per-movie(per-D

_{m}

)**linear regression** without w _{0}

### •

when**w** _{m}

fixed, minimizing**v** _{n}

?
—per-userlinear regression

### without v _{0}

by

**symmetry**

between### users/movies

called**alternating least squares**

algorithm
Matrix Factorization Basic Matrix Factorization

## Matrix Factorization Learning

min

### W,V

E_{in}

({w_{m}

}, {v_{n}

}) ∝ X
### user n rated movie m

### r _{nm}

−**w** ^{T} _{m} **v** _{n}

### 2

=

### M

### X

### m=1

### X

### (x

n### ,r

nm### )∈D

m

### r nm

−**w** ^{T} _{m} **v** n

### 2

### • two sets

of variables:can consider

**alternating minimization, remember? :-)**

### •

when**v** _{n}

fixed, minimizing**w** _{m}

≡ minimize E### in

within### D _{m}

—simply per-movie(per-D

_{m}

)**linear regression** without w _{0}

### •

when**w** _{m}

fixed, minimizing**v** _{n}

?
—per-userlinear regression

### without v _{0}

by

**symmetry**

between### users/movies

called

**alternating least squares**

algorithm
Matrix Factorization Basic Matrix Factorization

## Matrix Factorization Learning

min

### W,V

E_{in}

({w_{m}

}, {v_{n}

}) ∝ X
### user n rated movie m

### r _{nm}

−**w** ^{T} _{m} **v** _{n}

### 2

=

### M

### X

### m=1

### X

### (x

n### ,r

nm### )∈D

m

### r nm

−**w** ^{T} _{m} **v** n

### 2

### • two sets

of variables:can consider

**alternating minimization, remember? :-)**

### •

when**v** _{n}

fixed, minimizing**w** _{m}

≡ minimize E### in

within### D _{m}

—simply per-movie(per-D

_{m}

)**linear regression** without w _{0}

### •

when**w** _{m}

fixed, minimizing**v** _{n}

?
—per-userlinear regression

### without v _{0}

called

**alternating least squares**

algorithm
Matrix Factorization Basic Matrix Factorization

## Matrix Factorization Learning

min

### W,V

E_{in}

({w_{m}

}, {v_{n}

}) ∝ X
### user n rated movie m

### r _{nm}

−**w** ^{T} _{m} **v** _{n}

### 2

=

### M

### X

### m=1

### X

### (x

n### ,r

nm### )∈D

m

### r nm

−**w** ^{T} _{m} **v** n

### 2

### • two sets

of variables:can consider

**alternating minimization, remember? :-)**

### •

when**v** _{n}

fixed, minimizing**w** _{m}

≡ minimize E### in

within### D _{m}

—simply per-movie(per-D

_{m}

)**linear regression** without w _{0}

### •

when**w** _{m}

fixed, minimizing**v** _{n}

?
—per-userlinear regression

### without v _{0}

by

**symmetry**

between### users/movies

called**alternating least squares**

algorithm
Matrix Factorization Basic Matrix Factorization

## Alternating Least Squares

### Alternating Least Squares

### 1

initialize ˜d dimension vectors {w### m

}, {v_{n}

}
### 2 **alternating optimization**

of E_{in}

: repeatedly
### 1 optimize **w**

_{1}

### , **w**

_{2}

### , . . . , **w**

_{M}

### :

### update **w**

m### by m-th-movie linear regression on {(v

n### , r

nm### )}

### 2 optimize **v**

1### , **v**

2### , . . . , **v**

N### :

### update **v**

n### by n-th-user linear regression on {(w

m### , r

nm### )}

until

**converge**

### • **initialize: usually just** **randomly**

### • **converge:**

guaranteed as E

_{in} **decreases**

during alternating minimization
alternating least squares:

the ‘tango’ dance between

### users/movies

Matrix Factorization Basic Matrix Factorization

## Alternating Least Squares

### Alternating Least Squares

### 1

initialize ˜d dimension vectors {w### m

}, {v_{n}

}
### 2 **alternating optimization**

of E_{in}

: repeatedly
### 1 optimize **w**

_{1}

### , **w**

_{2}

### , . . . , **w**

_{M}

### :

### update **w**

m### by m-th-movie linear regression on {(v

n### , r

nm### )}

### 2 optimize **v**

1### , **v**

2### , . . . , **v**

N### :

### update **v**

n### by n-th-user linear regression on {(w

m### , r

nm### )}

until

**converge**

### • **initialize: usually just** **randomly**

### • **converge:**

guaranteed as E

_{in} **decreases**

during alternating minimization
alternating least squares:

the ‘tango’ dance between

### users/movies

Matrix Factorization Basic Matrix Factorization

## Alternating Least Squares

### Alternating Least Squares

### 1

initialize ˜d dimension vectors {w### m

}, {v_{n}

}
### 2 **alternating optimization**

of E_{in}

: repeatedly
### 1 optimize **w**

_{1}

### , **w**

_{2}

### , . . . , **w**

_{M}

### :

### update **w**

m### by m-th-movie linear regression on {(v

n### , r

nm### )}

### 2 optimize **v**

1### , **v**

2### , . . . , **v**

N### :

### update **v**

n### by n-th-user linear regression on {(w

m### , r

nm### )}

until

**converge**

### • **initialize: usually just** **randomly**

### • **converge:**

guaranteed as E

_{in} **decreases**

during alternating minimization
alternating least squares:

the ‘tango’ dance between

### users/movies

Matrix Factorization Basic Matrix Factorization

## Alternating Least Squares

### Alternating Least Squares

### 1

initialize ˜d dimension vectors {w### m

}, {v_{n}

}
### 2 **alternating optimization**

of E_{in}

: repeatedly
### 1 optimize **w**

_{1}

### , **w**

_{2}

### , . . . , **w**

_{M}

### :

### update **w**

m### by m-th-movie linear regression on {(v

n### , r

nm### )}

### 2 optimize **v**

1### , **v**

2### , . . . , **v**

N### :

### update **v**

n### by n-th-user linear regression on {(w

m### , r

nm### )}

until

**converge**

### • **initialize: usually just** **randomly**

### • **converge:**

guaranteed as E

_{in} **decreases**

during alternating minimization
alternating least squares:
the ‘tango’ dance between

### users/movies

Matrix Factorization Basic Matrix Factorization

## Alternating Least Squares

### Alternating Least Squares

### 1

initialize ˜d dimension vectors {w### m

}, {v_{n}

}
### 2 **alternating optimization**

of E_{in}

: repeatedly
### 1 optimize **w**

_{1}

### , **w**

_{2}

### , . . . , **w**

_{M}

### :

### update **w**

m### by m-th-movie linear regression on {(v

n### , r

nm### )}

### 2 optimize **v**

1### , **v**

2### , . . . , **v**

N### :

### update **v**

n### by n-th-user linear regression on {(w

m### , r

nm### )}

until

**converge**

### • **initialize: usually just** **randomly**

### • **converge:**

guaranteed as E

_{in} **decreases**

during alternating minimization
alternating least squares:

the ‘tango’ dance between

### users/movies

Matrix Factorization Basic Matrix Factorization

## Alternating Least Squares

### Alternating Least Squares

### 1

initialize ˜d dimension vectors {w### m

}, {v_{n}

}
### 2 **alternating optimization**

of E_{in}

: repeatedly
### 1 optimize **w**

_{1}

### , **w**

_{2}

### , . . . , **w**

_{M}

### :

### update **w**

m### by m-th-movie linear regression on {(v

n### , r

nm### )}

### 2 optimize **v**

1### , **v**

2### , . . . , **v**

N### :

### update **v**

n### by n-th-user linear regression on {(w

m### , r

nm### )}

until

**converge**

### • **initialize: usually just** **randomly**

### • **converge:**

guaranteed as E

_{in} **decreases**

during alternating minimization
alternating least squares:
the ‘tango’ dance between

### users/movies

Matrix Factorization Basic Matrix Factorization

## Linear Autoencoder versus Matrix Factorization

### Linear Autoencoder

X ≈### W W ^{T}

X
### •

motivation:**special** d

-˜d -d linear NNet
### •

error measure: squared on**all**

x_{ni}

### •

solution: global optimal at**eigenvectors**

of X^{T}

X
### •

usefulness: extract**dimension-reduced features**

### Matrix Factorization

R ≈### V ^{T} W

### •

motivation:### N-˜

d -Mlinear NNet### •

error measure:squared on

**known** r _{nm}

### •

solution: local optimal via**alternating least squares**

### •

usefulness: extract**hidden user/movie features**

linear autoencoder
≡

**special**

matrix factorization of**complete**

X
Matrix Factorization Basic Matrix Factorization

## Linear Autoencoder versus Matrix Factorization

### Linear Autoencoder

X ≈### W W ^{T}

X
### •

motivation:**special** d

-˜d -d linear NNet
### •

error measure: squared on**all**

x_{ni}

### •

solution: global optimal at**eigenvectors**

of X^{T}

X
### •

usefulness: extract**dimension-reduced features**

### Matrix Factorization

R ≈### V ^{T} W

### •

motivation:### N-˜

d -Mlinear NNet### •

error measure:squared on

**known** r _{nm}

### •

solution: local optimal via**alternating least squares**

### •

usefulness: extract**hidden user/movie features**

linear autoencoder
≡

**special**

matrix factorization of**complete**

X
Matrix Factorization Basic Matrix Factorization

## Linear Autoencoder versus Matrix Factorization

### Linear Autoencoder

X ≈### W W ^{T}

X
### •

motivation:**special** d

-˜d -d linear NNet
### •

error measure: squared on**all**

x_{ni}

### •

solution: global optimal at**eigenvectors**

of X^{T}

X
### •

usefulness: extract**dimension-reduced features**

### Matrix Factorization

R ≈### V ^{T} W

### •

motivation:### N-˜

d -Mlinear NNet### •

error measure:squared on

**known** r _{nm}

### •

solution: local optimal via**alternating least squares**

### •

usefulness: extract**hidden user/movie features**

linear autoencoder
≡