1 § § § Ref: http://en.wikipedia.org/wiki/Eigenvector For a square matrix A : A x = λ x where x is a vector (eigenvector), and λ a scalar (eigenvalue) E.g.

(1)

§ Ref:

http://en.wikipedia.org/wiki/Eigenvector

§ For a square matrix A:

Ax = λx

where x is a vector (eigenvector), and λ a scalar (eigenvalue)

§ E.g.

÷÷ ø çç ö è æ

= -

÷÷ ø çç ö è æ

÷÷ - ø çç ö

è

æ -

-

÷÷ ø çç ö è

= æ

÷÷ ø çç ö è

÷÷ æ ø çç ö

è

æ -

-

1 4 1 1 1 3

1 1 3

1 2 1 1 1 3

1

3

(2)

2

§Linear algebra: A x = b

§Eigenvector: A x = λ x

§Eigenvectors are orthogonal (seen as being independent)

§Eigenvector represents the basis of the original vector A

§Useful for

§ Solving linear equations

§ Determine the natural frequency of bridge

§ …

(3)

(4)

4 § A = U S V

^T

- example: Users to Movies

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

http://www.mmds.org 8

=

SciFi

Romance

x x

Matrix Alien Serenity Casablanca Amelie

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2

0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32

12.4 0 0 0 9.5 0 0 0 1.3

0.56 0.59 0.56 0.09 0.09

0.12 -0.02 0.12 -0.69 -0.69

0.40 -0.80 0.40 0.09 0.09

(5)

(6)

6 n SVD decomposes:

n

Term x Document matrix X as

nX=USV^T

nWhere U,V left and right singular vector matrices, and nS is a diagonal matrix of singular values

nU: matrix of eigenvectors of XX^T nV: matrix of eigenvectors of X^TX nS : diagonal matrix of eigenvalues

(7)

13

注意：M为对角阵

(8)

8

向量v₁和v₂是正交的单位向量

从而：

₂

v

₂^T

x

M = u

₁

σ

₁

v

₁^T

+ u

₂

σ

₂

v

₂^T

(9)

(10)

10

§Matlab SVD

§命令

§ [U,S,V] = svds(z, 2)

(11)

http://alias-i.com/lingpipe/demos/tutorial/svd/read-me.html

§LSI: find the k-dimensions that Minimizes the Frobenius norm of A-A’.

§ Frobenius norm of A:

§pLSI: defines one’s own objective function to minimize (maximize)

(12)

12 What is a generative probabilistic model?

Has roughly the following procedure

1 Assume the data we see is generated by some parameterized random process.

2 Learn the parameters that best explain the data.

3 Use the model to predict (infer) new data, based on data seen so far.

Benefits compared to non-generative models

• Assumptions and model are explicit.

• For the inference and learning step we can use well-known algorithms (e.g. EM, Markov Chain Monte Carlo).

pLSI – a probabilistic approach

• Models each word in a document as a sample from a mixture model.

p(w ) =

k

i =1

p(zi)p(w|zi) s.t.

k

i =1

p(zi)= 1^!

• Introduces the concept of a topic.

maths

proof

induction objectbouquet memo ry 0

0.1 0.2 0.3 0.4

0.5 computer science

proof

induction object bouquet memo ry 0

0.1 0.2 0.3

0.4 wine

proof

induction objectbouquet memo ry 0

0.25 0.5 0.75 1

• Each word generated from one topic, different words can be generated by different topics.

• Problems: (1) not well-defined on documents level, (2) overfitting (kV + kM parameters).

(13)

n Assume a multinomial distribution

n Distribution of topics (z)

npLSI: log-likelihood of training data ~ cross-entropy / KL- divergence

å

Î

=

Z z

z w P z d P z P w

d

P ( , ) ( ) ( | ) ( | )

Mixture of Unigrams Model (this is just Naïve Bayes) For each of M documents,

p Choose a topic z.

p Choose N words by drawing each one independently from a multinomial conditioned on z.

In the Mixture of Unigrams model, we can only have one topic per document!

Zi

w_4i w_3i

w_2i w_i1

(15)

Probabilistic Latent Semantic Indexing (pLSI) Model

For each word of document d in the training set,

p Choose a topic z according to a multinomial conditioned on the index d.

p Generate the word by drawing from a multinomial conditioned on z.

In pLSI, documents can have multiple topics.

d

z_d4 z_d3

z_d2 z_d1

w_d4 w_d3

w_d2 w_d1

nIt is not a proper generative model for document:

nDocument is generated from a mixture of topics

nThe number of topics may grow linearly with the size of the corpus

nDifficult to generate a new document

(16)

16

nIn the LDA model, we would like to say that the topic mixture proportions for each document are drawn from some distribution.

nSo, we want to put a distribution on multinomials. That is, k-tuples of non-negative numbers that sum to one.

nThe space is of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex, which is just a generalization of a triangle to (k-1) dimensions.

nCriteria for selecting our prior:

nIt needs to be defined for a (k-1)-simplex.

nAlgebraically speaking, we would like it to play nice with the multinomial distribution.

n

Useful Facts:

nThis distribution is defined over a (k-1)-simplex. That is, it takes k non-negative arguments which sum to one.

Consequently it is a natural distribution to use over multinomial distributions.

nIn fact, the Dirichlet distribution is the conjugate prior to the multinomial distribution. (This means that if our likelihood is multinomial with a Dirichlet prior, then the posterior is also Dirichlet!)

nThe Dirichlet parameter a_ican be thought of as a prior count of the i^th class.

(17)

q

z₄ z₃ z₂ z₁

w₄ w₃ w₂ w₁

a

b q

z₄ z₃ z₂ z₁

w₄ w₃ w₂ w₁

q

z₄ z₃ z₂ z₁

w₄ w₃ w₂ w₁

§For each document,

§Choose q~Dirichlet(a)

§For each of the N words wn:

§Choose a topic z_n» Multinomial(q)

§Choose a word w_n from p(w_n|z_n,b), a multinomial probability conditioned on the topic z_n.

For each document,

§Choose q» Dirichlet(a)

§For each of the N words w_n:

§Choose a topic z_n» Multinomial(q)

§Choose a word w_n from p(w_n|z_n,b), a multinomial probability conditioned on the topic z_n.

(18)

18

n

Document = mixture of topics (as in pLSI), but

according to a Dirichlet prior

nWhen we use a uniform Dirichlet prior, pLSI=LDA n

A word is also generated according to another

variable b:

LDA: The algorithm (2/2)

⇤ z w

⇥

N M

More about and ⇥

• and ⇥ are corpus-level parameters. Sampled once per corpus.

• ⇥ is a k V matrix. For the pLSI example:

topic proof induction object bouquet memory

maths 0.45 0.35 0.2 0 0

comp sc. 0.23 0.17 0.35 0 0.25

wine 0 0 0.1 0.75 0.15

• is a k-vector. Tells us how much Dirichlet prior scatters around the different topics.

(19)

Inference: The problem

To which topics does a given document belong to? Thus want to compute the posterior distribution of the hidden variables given a document:

p(⇤, z|w, , ⇥) =p(⇤, z, w| , ⇥) p(w| , ⇥) where

p(⇤, z, w| , ⇥) = p(⇤| )

⌦N n=1

p(zn|⇤)p(wn|zn, ⇥) and

p(w| , ⇥) = (⌥

i i)

i ( i)

↵ ⌦k i=1

⇤_iⁱ ¹

⇥ ⇤

⇧

⌦N n=1

k

i=1

⌦V j=1

(⇤i⇥ij)^wⁿ^j

⌅

⌃ d⇤.

This not only looks awkward, but is as well computationally intractable in general. Coupling between ⇤ and ⇥ij. Solution: Approximations.

•In variational inference, we consider a simplified graphical model with variational parameters g, f and minimize the KL Divergence between the variational and posterior

distributions.

(20)

20 Variational inference

• Replace the graphical model of LDA by a simpler one.

⇤ z w

⇥

N M ⇥

⇤

z N M

• Minimize the Kullback-Leibler divergence between the two distributions. This is a standard-trick in machine learning. Often equivalent to maximum likelihood. Gradient descent procedure.

• Variational approach in contrast to often used Markov chain Monte Carlo. Variational is not a sampling approach where one samples from the posterior distribution, but rather we try to find a tightest lower bound.

• Problematic coupling between ⇥ and not present in simpler graphical model.

How LDA performs on Reuters data (1/2)

About the experiments

• 100-topic LDA trained on a 16’000 documents corpus of news articles by Reuters (the news agency).

• Some standard stop words removed.

Top words from some of the p(w|z)

“Arts” “Budgets” “Children” “Education”

new million children school

film tax women students

show program people schools music budget child education movie billion years teachers play federal families high

musical year work public

(21)

How LDA performs on Reuters data (2/2)

Inference on a held-out document

Again: “Arts”,“Budgets”,“Children”,“Education”.

TheWilliamRandolph HearstFoundationwill give$1.25 millionto LincolnCenter, MetropolitanOperaCo.,New York Philharmonicand JuilliardSchool. “Ourboardfelt that we had arealopportunitytomake amarkon thefutureof theperformingarts with thesegrantsanact everybitasimportantas our traditional areas ofsupportin health, medicalresearch,educationand thesocialservices,”HearstFoundation PresidentRandolphA.HearstsaidMondayinannouncingthegrants.

n A widely used topic model

n Complexity is an issue

n Use in IR:

n

Interpolate a topic model with traditional LM

n

Improvements over traditional LM,

n

But no improvement over Relevance model (Wei

and Croft, SIGIR 06)

(22)

22

§ LSI

§ Deerwester, S., et al, Improving Information Retrieval with Latent Semantic Indexing, Proceedings of the 51st Annual Meeting of the American Society for Information Science 25, 1988, pp. 36–40.

§ Michael W. Berry, Susan T. Dumais and Gavin W. O'Brien, Using Linear Algebra for Intelligent Information Retrieval, UT-CS-94-270,1994

§ pLSI

§ Thomas Hofmann, Probabilistic Latent Semantic Indexing, Proceedings of the Twenty-Second Annual International SIGIRConference on Research and Development in Information Retrieval(SIGIR-99), 1999

§ LDA

§ Latent Dirichlet allocation. D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3:993-1022, January 2003.

§ Finding Scientific Topics. Griffiths, T., & Steyvers, M. (2004). Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-5235.

§ Hierarchical topic models and the nested Chinese restaurant process. D. Blei, T. Griffiths, M. Jordan, and J.

Tenenbaum In S. Thrun, L. Saul, and B. Scholkopf, editors, Advances in Neural Information Processing Systems (NIPS) 16, Cambridge, MA, 2004. MIT Press.

§ Also see Wikipedia articles on LSI, pLSI and LDA