• 沒有找到結果。

¡ ¡ http://www.mmds.org Assumption: Data lies on or near a low d -dimensional subspace Axes of this subspace are effective representation of the data

N/A
N/A
Protected

Academic year: 2022

Share "¡ ¡ http://www.mmds.org Assumption: Data lies on or near a low d -dimensional subspace Axes of this subspace are effective representation of the data"

Copied!
33
0
0

加載中.... (立即查看全文)

全文

(1)

Mining of Massive Datasets

Jure Leskovec, Anand Rajaraman, Jeff Ullman

Stanford University

http://www.mmds.org

¡ Assumption: Data lies on or near a low d-dimensional subspace

¡ Axes of this subspace are effective

representation of the data

(2)

¡ Compress / reduce dimensionality:

§ 10 6 rows; 10 3 columns; no updates

§ Random access to any cell(s); small error: OK

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3

The above matrix is really “2-dimensional.” All rows can be reconstructed by scaling [1 1 1 0 0] or [0 0 0 1 1]

¡ Q: What is rank of a matrix A?

¡ A: Number of linearly independent columns of A

¡ For example:

§ Matrix A = has rank r=2

§ Why? The first two rows are linearly independent, so the rank is at least 2, but all three rows are linearly dependent (the first is equal to the sum of the second and third) so the rank must be less than 3.

¡ Why do we care about low rank?

§ We can write A as two “basis” vectors: [1 2 1] [-2 -3 1]

§ And new coordinates of : [1 0] [0 1] [1 -1]

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4

(3)

¡ Cloud of points 3D space:

§ Think of point positions as a matrix:

¡ We can rewrite coordinates more efficiently!

§ Old basis vectors: [1 0 0] [0 1 0] [0 0 1]

§ New basis vectors: [1 2 1] [-2 -3 1]

§ Then A has new coordinates: [1 0]. B: [0 1], C: [1 -1]

§ Notice: We reduced the number of coordinates!

1 row per point:

A B

C A

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5

¡ Goal of dimensionality reduction is to discover the axis of data!

Rather than representing every point with 2 coordinates we represent each point with 1 coordinate (corresponding to the position of the point on the red line).

By doing this we incur a bit of

error as the points do not

exactly lie on the line

(4)

Why reduce dimensions?

¡ Discover hidden correlations/topics

§ Words that occur commonly together

¡ Remove redundant and noisy features

§ Not all words are useful

¡ Interpretation and visualization

¡ Easier storage and processing of the data

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

A [m x n] = U [m x r] S [ r x r] (V [n x r] ) T

¡ A: Input data matrix

§ m x n matrix (e.g., m documents, n terms)

¡ U: Left singular vectors

§ m x r matrix (m documents, r concepts)

¡ S: Singular values

§ r x r diagonal matrix (strength of each ‘concept’) (r : rank of the matrix A)

¡ V: Right singular vectors

§ n x r matrix (n terms, r concepts)

(5)

9

m A

n

m S

n

U

V T

»

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

T

m A

n

» +

s

1

u

1

v

1

s

2

u

2

v

2

σ i … scalar u i … vector v i … vector

T

(6)

It is always possible to decompose a real matrix A into A = U S V T , where

¡ U, S, V: unique

¡ U, V: column orthonormal

§ U T U = I; V T V = I (I: identity matrix)

§ (Columns are orthogonal unit vectors)

¡ S: diagonal

§ Entries (singular values) are positive,

and sorted in decreasing order (σ 1 ³ σ 2 ³ ... ³ 0)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11

Nice proof of uniqueness: http://www.mpi-inf.mpg.de/~bast/ir-seminar-ws04/lecture2.pdf

¡ A = U S V T - example: Users to Movies

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12

=

SciFi

Romnce

Ma tr ix Alie n Se re n it y Ca sa b la n ca Am elie

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2

m S

n

U

V T

“Concepts”

AKA Latent dimensions

AKA Latent factors

(7)

¡ A = U S V T - example: Users to Movies

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13

=

SciFi

Romnce

x x

Ma tr ix Alie n Se re n it y Ca sa b la n ca Am elie

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2

0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32

12.4 0 0 0 9.5 0 0 0 1.3

0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09

¡ A = U S V T - example: Users to Movies

SciFi-concept

Romance-concept

=

SciFi

Romnce

x x

Ma tr ix Alie n Se re n it y Ca sa b la n ca Am elie

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2

0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32

12.4 0 0 0 9.5 0 0 0 1.3

0.56 0.59 0.56 0.09 0.09

0.12 -0.02 0.12 -0.69 -0.69

0.40 -0.80 0.40 0.09 0.09

(8)

¡ A = U S V T - example:

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15

Romance-concept

U is “user-to-concept”

similarity matrix

SciFi-concept

=

SciFi

Romnce

x x

Ma tr ix Alie n Se re n it y Ca sa b la n ca Am elie

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2

0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32

12.4 0 0 0 9.5 0 0 0 1.3

0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09

¡ A = U S V T - example:

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16

SciFi

Romnce

SciFi-concept

“strength” of the SciFi-concept

=

SciFi

Romnce

x x

Ma tr ix Alie n Se re n it y Ca sa b la n ca Am elie

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2

0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32

12.4 0 0 0 9.5 0 0 0 1.3

0.56 0.59 0.56 0.09 0.09

0.12 -0.02 0.12 -0.69 -0.69

0.40 -0.80 0.40 0.09 0.09

(9)

¡ A = U S V T - example:

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17

SciFi-concept

V is “movie-to-concept”

similarity matrix

SciFi-concept

=

SciFi

Romnce

x x

Ma tr ix Alie n Se re n it y Ca sa b la n ca Am elie

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2

0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32

12.4 0 0 0 9.5 0 0 0 1.3

0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09

‘movies’, ‘users’ and ‘concepts’:

¡ U: user-to-concept similarity matrix

¡ V: movie-to-concept similarity matrix

¡ S: its diagonal elements:

‘strength’ of each concept

(10)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20

v

1

first right singular vector

Movie 1 rating

Mov ie 2 ra ti ng

¡ Instead of using two coordinates (𝒙, 𝒚) to describe point locations, let’s use only one coordinate 𝒛

¡ Point’s position is its location along vector 𝒗 𝟏

¡ How to choose 𝒗 𝟏 ? Minimize reconstruction error

(11)

¡ Goal: Minimize the sum of reconstruction errors:

) ) 𝑥 +, − 𝑧 +, /

0 ,12 3

+12

§ where 𝒙 𝒊𝒋 are the “old” and 𝒛 𝒊𝒋 are the

“new” coordinates

¡ SVD gives ‘best’ axis to project on:

§ ‘best’ = minimizing the reconstruction errors

¡ In other words, minimum reconstruction error

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21

v

1

first right singular vector

Movie 1 rating

Mo vie 2 r atin g

¡ A = U S V T - example:

§ V: “movie-to-concept” matrix

§ U: “user-to-concept” matrix

v

1

first right singular vector

Movie 1 rating

Mo vie 2 r atin g

= x x

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2

0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32

12.4 0 0 0 9.5 0 0 0 1.3

0.56 0.59 0.56 0.09 0.09

0.12 -0.02 0.12 -0.69 -0.69

0.40 -0.80 0.40 0.09 0.09

(12)

¡ A = U S V T - example:

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23

v

1

first right singular vector

Movie 1 rating

Mo vie 2 r atin g

variance (‘spread’) on the v

1

axis

= x x

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2

0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32

12.4 0 0 0 9.5 0 0 0 1.3

0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09

A = U S V T - example:

¡ U S: Gives the coordinates of the points in the

projection axis

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24

v

1

first right singular vector

Movie 1 rating

Mo vie 2 r atin g

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2

1.61 0.19 -0.01 5.08 0.66 -0.03 6.82 0.85 -0.05 8.43 1.04 -0.06 1.86 -5.60 0.84 0.86 -6.93 -0.87 0.86 -2.75 0.41 Projection of users

on the “Sci-Fi” axis

(U S) T :

(13)

More details

¡ Q: How exactly is dim. reduction done?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25

= x x

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2

0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32

12.4 0 0 0 9.5 0 0 0 1.3

0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09

More details

¡ Q: How exactly is dim. reduction done?

¡ A: Set smallest singular values to zero

= x x

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2

0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32

12.4 0 0 0 9.5 0 0 0 1.3

0.56 0.59 0.56 0.09 0.09

0.12 -0.02 0.12 -0.69 -0.69

0.40 -0.80 0.40 0.09 0.09

(14)

More details

¡ Q: How exactly is dim. reduction done?

¡ A: Set smallest singular values to zero

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27

x x

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2

0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32

12.4 0 0 0 9.5 0 0 0 1.3

0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09

»

More details

¡ Q: How exactly is dim. reduction done?

¡ A: Set smallest singular values to zero

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28

x x

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2

0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32

12.4 0 0 0 9.5 0 0 0 1.3

0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09

»

(15)

More details

¡ Q: How exactly is dim. reduction done?

¡ A: Set smallest singular values to zero

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29

» x x

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2

0.13 0.02 0.41 0.07 0.55 0.09 0.68 0.11 0.15 -0.59 0.07 -0.73 0.07 -0.29

12.4 0 0 9.5

0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69

More details

¡ Q: How exactly is dim. reduction done?

¡ A: Set smallest singular values to zero

»

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2

0.92 0.95 0.92 0.01 0.01 2.91 3.01 2.91 -0.01 -0.01 3.90 4.04 3.90 0.01 0.01 4.82 5.00 4.82 0.03 0.03 0.70 0.53 0.70 4.11 4.11 -0.69 1.34 -0.69 4.78 4.78 0.32 0.23 0.32 2.01 2.01 Frobenius norm:

ǁMǁ F = ÖΣ ij M ij 2 ǁA-Bǁ is “small” F = Ö Σ ij (A ij -B ij ) 2

(16)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31

A U

Sigma

V T

=

B U

Sigma

V T

=

B is best approximation of A

¡ Theorem:

Let A = U S V T and B = U S V T where

S = diagonal r x r matrix with s i i (i=1…k) else s i =0 then B is a best rank(B)=k approx. to A

What do we mean by “best”:

§ B is a solution to min B ǁA-Bǁ F where rank(B)=k

𝜎 22 Σ 𝜎 88

𝐴 − 𝐵 ; = ) 𝐴 +, − 𝐵 +, /

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

+,

32

(17)

¡ Theorem: Let A = U S V T1 ³σ 2 ³…, rank(A)=r)

then B = U S V T

§ S = diagonal r x r matrix where s i i (i=1…k) else s i =0

is a best rank-k approximation to A:

§ B is a solution to min B ǁA-Bǁ F where rank(B)=k

¡ We will need 2 facts:

§ 𝑀 ; = ∑ 𝑞 + ++ / where M = P Q R is SVD of M

§ U S V T - U S V T = U (S - S) V T

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 33

𝜎 22 Σ 𝜎 88

¡ We will need 2 facts:

§ 𝑀 ; = ∑ 𝑞 A AA / where M = P Q R is SVD of M

§ U S V T - U S V T = U (S - S) V T

We apply:

-- P column orthonormal -- R row orthonormal -- Q is diagonal

Details!

(18)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35

¡ A = U S V T , B = U S V T1 ³σ 2 ³… ³ 0, rank(A)=r)

§ S = diagonal n x n matrix where s i i (i=1…k) else s i =0

then B is solution to min B ǁA-Bǁ F , rank(B)=k

¡ Why?

¡ We want to choose s i to minimize ∑ 𝜎 + + − 𝑠 + /

¡ Solution is to set s i i (i=1…k) and other s i =0

å å

å = = + = +

= +

-

= r

k i

i r

k i

i k

i

i i

s s

i 1

2

1 2

1

) 2

(

min s s s

å =

= - = S - = r -

i

i i F s

k F B rank

B A B S s

i 1

2 )

(

, min min min ( s )

We used: U S V T - U S V T = U (S - S) V T

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36

Equivalent:

‘spectral decomposition’ of the matrix:

= x x

u

1

u

2

σ

1

σ

2

v

1

v

2

1 1 1 0 0

3 3 3 0 0

4 4 4 0 0

5 5 5 0 0

0 2 0 4 4

0 0 0 5 5

0 1 0 2 2

(19)

Equivalent:

‘spectral decomposition’ of the matrix

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37

= σ 1 u 1 v T 1 + σ 2 u 2 v T 2 +...

n

m

n x 1 1 x m

k terms

Assume: σ 1 ³ σ 2 ³ σ 3 ³ ... ³ 0

Why is setting small σ i to 0 the right thing to do?

Vectors u i and v i are unit length, so σ i scales them.

So, zeroing small σ i introduces less error.

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2

Q: How many σ s to keep?

A: Rule-of-a thumb:

keep 80-90% of ‘energy’ = ∑ 𝝈 𝒊 𝒊 𝟐

= σ 1 u 1 v T 1 + σ 2 u 2 v T 2 +...

n

m

Assume: σ 1 ³ σ 2 ³ σ 3 ³ ...

1 1 1 0 0

3 3 3 0 0

4 4 4 0 0

5 5 5 0 0

0 2 0 4 4

0 0 0 5 5

0 1 0 2 2

(20)

¡ To compute SVD:

§ O(nm 2 ) or O(n 2 m) (whichever is less)

¡ But:

§ Less work, if we just want singular values

§ or if we want first k singular vectors

§ or if the matrix is sparse

¡ Implemented in linear algebra packages like

§ LINPACK, Matlab, SPlus, Mathematica ...

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39

¡ SVD: A= U S V T : unique

§ U: user-to-concept similarities

§ V: movie-to-concept similarities

§ S : strength of each concept

¡ Dimensionality reduction:

§ keep the few largest singular values (80-90% of ‘energy’)

§ SVD: picks up linear correlations

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 40

(21)

¡ SVD gives us:

§ A = U S V T

¡ Eigen-decomposition:

§ A = X L X T

§ A is symmetric

§ U, V, X are orthonormal (U T U=I),

§ L, S are diagonal

¡ Now let’s calculate:

§ AA T = US V T (US V T ) T = US V T (VS T U T ) = USS T U T

§ A T A = V S T U T (US V T ) = V SS T V T

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41

¡ SVD gives us:

§ A = U S V T

¡ Eigen-decomposition:

§ A = X L X T

§ A is symmetric

§ U, V, X are orthonormal (U T U=I),

§ L, S are diagonal

¡ Now let’s calculate:

§ AA T = US V T (US V T ) T = US V T (VS T U T ) = USS T U T

§ A T A = V S T U T (US V T ) = V SS T V T

X L 2 X T

X L 2 X T

Shows how to compute SVD using eigenvalue

decomposition!

(22)

¡ A A T = U S 2 U T

¡ A T A = V S 2 V T

¡ (A T A) k = V S 2k V T

§ E.g.: (A T A) 2 = V S 2 V T V S 2 V T = V S 4 V T

¡ (A T A) k ~ v 1 σ 1 2k v 1 T for k>>1

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 43

(23)

¡ Q: Find users that like ‘Matrix’

¡ A: Map query into a ‘concept space’ – how?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 45

=

SciFi

Romnce

x x

Ma tr ix Alie n Se re n it y Ca sa b la n ca Am elie

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2

0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32

12.4 0 0 0 9.5 0 0 0 1.3

0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09

¡ Q: Find users that like ‘Matrix’

¡ A: Map query into a ‘concept space’ – how?

5 0 0 0 0 q =

Matrix

Al ie n

v1

q

v2 Ma tr ix Alie n Se re n it y Ca sa b la n ca Am elie

Project into concept space:

Inner product with each

‘concept’ vector v i

(24)

¡ Q: Find users that like ‘Matrix’

¡ A: Map query into a ‘concept space’ – how?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 47

v1

q

q*v 1 5 0 0 0 0

Ma tr ix Alie n Se re n it y Ca sa b la n ca Am elie

v2

Matrix

Al ie n

q =

Project into concept space:

Inner product with each

‘concept’ vector v i

Compactly, we have:

q concept = q V E.g.:

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 48

movie-to-concept similarities (V)

=

SciFi-concept

5 0 0 0 0 Ma tr ix Alie n Se re n it y Ca sa b la n ca Am elie

q =

0.56 0.12 0.59 -0.02 0.56 0.12 0.09 -0.69 0.09 -0.69

x 2.8 0.6

(25)

¡ How would the user d that rated (‘Alien’, ‘Serenity’) be handled?

d concept = d V E.g.:

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 49

movie-to-concept similarities (V)

=

SciFi-concept

0 4 5 0 0 Ma tr ix Alie n Se re n it y Ca sa b la n ca Am elie

q =

0.56 0.12 0.59 -0.02 0.56 0.12 0.09 -0.69 0.09 -0.69

x 5.2 0.4

¡ Observation: User d that rated (‘Alien’,

‘Serenity’) will be similar to user q that rated (‘Matrix’), although d and q have zero ratings in common!

0 4 5 0 0

d =

SciFi-concept

5 0 0 0 0

q =

Ma tr ix Alie n Se re n it y Ca sa b la n ca Am elie

Zero ratings in common Similarity ≠ 0

2.8 0.6

5.2 0.4

(26)

+ Optimal low-rank approximation in terms of Frobenius norm

- Interpretability problem:

§ A singular vector specifies a linear

combination of all input columns or rows

- Lack of sparsity:

§ Singular vectors are dense!

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 51

=

U S

V

T

(27)

¡ Goal: Express A as a product of matrices C,U,R Make ǁA-C·U·Rǁ F small

¡ “Constraints” on C and R:

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 53

A C U R

¡ Goal: Express A as a product of matrices C,U,R Make ǁA-C·U·Rǁ F small

¡ “Constraints” on C and R:

Pseudo-inverse of the intersection of C and R

A C U R

Frobenius norm:

ǁXǁ F = Ö Σ ij X ij 2

(28)

¡ Let:

A k be the “best” rank k approximation to A (that is, A k is SVD of A)

Theorem [Drineas et al.]

CUR in O(m·n) time achieves

§ ǁ A-CURǁ F £ ǁA-A k ǁ F + eǁAǁ F

with probability at least 1-d, by picking

§ O(k log(1/d)/e 2 ) columns, and

§ O(k 2 log 3 (1/d)/e 6 ) rows

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 55

In practice:

Pick 4k cols/rows

¡ Sampling columns (similarly for rows):

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 56

Note this is a randomized algorithm, same

column can be sampled more than once

(29)

¡ Let W be the “intersection” of sampled columns C and rows R

§ Let SVD of W = X Z Y T

¡ Then: U = Y (Z + ) 2 X T

§ Z + : reciprocals of non-zero singular values: Z + ii =1/ Z ii

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 57

A C

R

» W

¡ For example:

§ Select 𝒄 = 𝑶 𝒌 𝒍𝒐𝒈 𝒌 𝜺 𝟐 columns of A using ColumnSelect algorithm

§ Select 𝒓 = 𝑶 𝒌 𝒍𝒐𝒈 𝒌 𝜺 𝟐 rows of A using ColumnSelect algorithm

§ Set 𝑼 = 𝑾 P

¡ Then:

with probability 98%

In practice:

Pick 4k cols/rows

for a “rank-k” approximation

SVD error

CUR error

(30)

+ Easy interpretation

• Since the basis vectors are actual columns and rows

+ Sparse basis

• Since the basis vectors are actual columns and rows

- Duplicate columns and rows

• Columns of large norms will be sampled many times

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 59

Singular vector Actual column

¡ If we want to get rid of the duplicates:

§ Throw them away

§ Scale (multiply) the columns/rows by the square root of the number of duplicates

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 60

A C

d

R

d

C

s

R

s

Construct a

small U

(31)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 61

SVD: A = U S V T

Huge but sparse Big and dense

CUR: A = C U R

Huge but sparse Big but sparse dense but small

sparse and small

¡ DBLP bibliographic data

§ Author-to-conference big sparse matrix

§ A ij : Number of papers published by author i at conference j

§ 428K authors (rows), 3659 conferences (columns)

§ Very sparse

¡ Want to reduce dimensionality

§ How much time does it take?

§ What is the reconstruction error?

§ How much space do we need?

(32)

¡ Accuracy:

§ 1 – relative sum squared errors

¡ Space ratio:

§ #output matrix entries / #input matrix entries

¡ CPU time

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 63 CUR

CUR no duplicates

SVD CUR CUR no dup

Sun, Faloutsos: Less is More: Compact Matrix Decomposition for Large Sparse Graphs, SDM ’07.

CUR SVD

¡ SVD is limited to linear projections:

§ Lower-dimensional linear projection that preserves Euclidean distances

¡ Non-linear methods: Isomap

§ Data lies on a nonlinear low-dim curve aka manifold

§ Use the distance as measured along the manifold

§ How?

§ Build adjacency graph

§ Geodesic distance is graph distance

§ SVD/PCA the graph pairwise distance matrix

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 64

(33)

¡ Drineas et al., Fast Monte Carlo Algorithms for Matrices III:

Computing a Compressed Approximate Matrix Decomposition, SIAM Journal on Computing, 2006.

¡ J. Sun, Y. Xie, H. Zhang, C. Faloutsos: Less is More:

Compact Matrix Decomposition for Large Sparse Graphs, SDM 2007

¡ Intra- and interpopulation genotype reconstruction from tagging SNPs, P. Paschou, M. W. Mahoney, A. Javed, J. R.

Kidd, A. J. Pakstis, S. Gu, K. K. Kidd, and P. Drineas, Genome Research, 17(1), 96-107 (2007)

¡ Tensor-CUR Decompositions For Tensor-Based Data, M. W.

Mahoney, M. Maggioni, and P. Drineas, Proc. 12-th Annual SIGKDD, 327-336 (2006)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 65

參考文獻

相關文件

Consequently, these data are not directly useful in understanding the effects of disk age on failure rates (the exception being the first three data points, which are dominated by

The aim of this paper is to summarize some of the bibliographical data for the more than 230 mountain and temple gazetteers of which the archive is comprised, to compare the

In this chapter we develop the Lanczos method, a technique that is applicable to large sparse, symmetric eigenproblems.. The method involves tridiagonalizing the given

From all the above, φ is zero only on the nonnegative sides of the a, b-axes. Hence, φ is an NCP function.. Graph of g functions given in Example 3.9.. Graphs of generated NCP

In the past researches, all kinds of the clustering algorithms are proposed for dealing with high dimensional data in large data sets.. Nevertheless, almost all of

• But, If the representation of the data type is changed, the program needs to be verified, revised, or completely re- written... Abstract

Following the supply by the school of a copy of personal data in compliance with a data access request, the requestor is entitled to ask for correction of the personal data

Large data: if solving linear systems is needed, use iterative (e.g., CG) instead of direct methods Feature correlation: methods working on some variables at a time (e.g.,