• 沒有找到結果。

1 § § § Ref: http://en.wikipedia.org/wiki/Eigenvector For a square matrix A : A x = λ x where x is a vector (eigenvector), and λ a scalar (eigenvalue) E.g.

N/A
N/A
Protected

Academic year: 2022

Share "1 § § § Ref: http://en.wikipedia.org/wiki/Eigenvector For a square matrix A : A x = λ x where x is a vector (eigenvector), and λ a scalar (eigenvalue) E.g."

Copied!
22
0
0

加載中.... (立即查看全文)

全文

(1)

§ Ref:

http://en.wikipedia.org/wiki/Eigenvector

§ For a square matrix A:

Ax = λx

where x is a vector (eigenvector), and λ a scalar (eigenvalue)

§ E.g.

÷÷ ø çç ö è æ

= -

÷÷ ø çç ö è æ

÷÷ - ø çç ö

è

æ -

-

÷÷ ø çç ö è

= æ

÷÷ ø çç ö è

÷÷ æ ø çç ö

è

æ -

-

1 4 1 1 1 3

1 1 3

1 2 1 1 1 3

1

1

3

(2)

2

§Linear algebra: A x = b

§Eigenvector: A x = λ x

§Eigenvectors are orthogonal (seen as being independent)

§Eigenvector represents the basis of the original vector A

§Useful for

§ Solving linear equations

§ Determine the natural frequency of bridge

§

(3)
(4)

4

§ A = U S V

T

- example: Users to Movies

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

http://www.mmds.org 8

=

SciFi

Romance

x x

Matrix Alien Serenity Casablanca Amelie

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2

0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32

12.4 0 0 0 9.5 0 0 0 1.3

0.56 0.59 0.56 0.09 0.09

0.12 -0.02 0.12 -0.69 -0.69

0.40 -0.80 0.40 0.09 0.09

(5)
(6)

6 n SVD decomposes:

n

Term x Document matrix X as

nX=USVT

nWhere U,V left and right singular vector matrices, and nS is a diagonal matrix of singular values

nU: matrix of eigenvectors of XXT nV: matrix of eigenvectors of XTX nS : diagonal matrix of eigenvalues

(7)

13

注意:M为对角阵

Mv

1

= σ

1

u

1

Mv

2

= σ

2

u

2

(8)

8

向量v1v2是正交的单位向量

从而:

最终:

通常表示为:

15

x = (v

1

.x) v

1

+ (v

2

.x) v

2

Mx = (v

1

.x) Mv

1

+ (v

2

.x) Mv

2

Mx = (v

1

.x) σ

1

u

1

+ (v

2

.x) σ

2

u

2

M = UΣV

T

Mx = u

1

σ

1

v

1T

x + u

2

σ

2

v

2T

x

M = u

1

σ

1

v

1T

+ u

2

σ

2

v

2T

(9)
(10)

10

§Matlab SVD

§命令

§ [U,S,V] = svds(z, 2)

(11)

http://alias-i.com/lingpipe/demos/tutorial/svd/read-me.html

§LSI: find the k-dimensions that Minimizes the Frobenius norm of A-A’.

§ Frobenius norm of A:

§pLSI: defines one’s own objective function to minimize (maximize)

(12)

12 What is a generative probabilistic model?

Has roughly the following procedure

1 Assume the data we see is generated by some parameterized random process.

2 Learn the parameters that best explain the data.

3 Use the model to predict (infer) new data, based on data seen so far.

Benefits compared to non-generative models

Assumptions and model are explicit.

For the inference and learning step we can use well-known algorithms (e.g. EM, Markov Chain Monte Carlo).

pLSI – a probabilistic approach

Models each word in a document as a sample from a mixture model.

p(w ) =

k

i =1

p(zi)p(w|zi) s.t.

k

i =1

p(zi)= 1!

Introduces the concept of a topic.

maths

proof

induction objectbouquet memo ry 0

0.1 0.2 0.3 0.4

0.5 computer science

proof

induction object bouquet memo ry 0

0.1 0.2 0.3

0.4 wine

proof

induction objectbouquet memo ry 0

0.25 0.5 0.75 1

Each word generated from one topic, different words can be generated by different topics.

Problems: (1) not well-defined on documents level, (2) overfitting (kV + kM parameters).

(13)

n Assume a multinomial distribution

n Distribution of topics (z)

Question: How to determine z ?

n

Likelihood

n

E-step

n

M-step

(14)

14

nRelation

nDifference:

nLSI: minimize Frobenius (L-2) norm ~ additive Gaussian noise assumption on counts

npLSI: log-likelihood of training data ~ cross-entropy / KL- divergence

å

Î

=

Z z

z w P z d P z P w

d

P ( , ) ( ) ( | ) ( | )

Mixture of Unigrams Model (this is just Naïve Bayes) For each of M documents,

p Choose a topic z.

p Choose N words by drawing each one independently from a multinomial conditioned on z.

In the Mixture of Unigrams model, we can only have one topic per document!

Zi

w4i w3i

w2i wi1

(15)

Probabilistic Latent Semantic Indexing (pLSI) Model

For each word of document d in the training set,

p Choose a topic z according to a multinomial conditioned on the index d.

p Generate the word by drawing from a multinomial conditioned on z.

In pLSI, documents can have multiple topics.

d

zd4 zd3

zd2 zd1

wd4 wd3

wd2 wd1

nIt is not a proper generative model for document:

nDocument is generated from a mixture of topics

nThe number of topics may grow linearly with the size of the corpus

nDifficult to generate a new document

(16)

16

nIn the LDA model, we would like to say that the topic mixture proportions for each document are drawn from some distribution.

nSo, we want to put a distribution on multinomials. That is, k-tuples of non-negative numbers that sum to one.

nThe space is of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex, which is just a generalization of a triangle to (k-1) dimensions.

nCriteria for selecting our prior:

nIt needs to be defined for a (k-1)-simplex.

nAlgebraically speaking, we would like it to play nice with the multinomial distribution.

n

Useful Facts:

nThis distribution is defined over a (k-1)-simplex. That is, it takes k non-negative arguments which sum to one.

Consequently it is a natural distribution to use over multinomial distributions.

nIn fact, the Dirichlet distribution is the conjugate prior to the multinomial distribution. (This means that if our likelihood is multinomial with a Dirichlet prior, then the posterior is also Dirichlet!)

nThe Dirichlet parameter aican be thought of as a prior count of the ith class.

(17)

q

z4 z3 z2 z1

w4 w3 w2 w1

a

b q

z4 z3 z2 z1

w4 w3 w2 w1

q

z4 z3 z2 z1

w4 w3 w2 w1

§For each document,

§Choose q~Dirichlet(a)

§For each of the N words wn:

§Choose a topic zn» Multinomial(q)

§Choose a word wn from p(wn|zn,b), a multinomial probability conditioned on the topic zn.

For each document,

§Choose q» Dirichlet(a)

§For each of the N words wn:

§Choose a topic zn» Multinomial(q)

§Choose a word wn from p(wn|zn,b), a multinomial probability conditioned on the topic zn.

(18)

18

n

Document = mixture of topics (as in pLSI), but

according to a Dirichlet prior

nWhen we use a uniform Dirichlet prior, pLSI=LDA n

A word is also generated according to another

variable b:

LDA: The algorithm (2/2)

z w

N M

More about and ⇥

and ⇥ are corpus-level parameters. Sampled once per corpus.

⇥ is a k V matrix. For the pLSI example:

topic proof induction object bouquet memory

maths 0.45 0.35 0.2 0 0

comp sc. 0.23 0.17 0.35 0 0.25

wine 0 0 0.1 0.75 0.15

is a k-vector. Tells us how much Dirichlet prior scatters around the different topics.

(19)

Inference: The problem

To which topics does a given document belong to? Thus want to compute the posterior distribution of the hidden variables given a document:

p(⇤, z|w, , ⇥) =p(⇤, z, w| , ⇥) p(w| , ⇥) where

p(⇤, z, w| , ⇥) = p(⇤| )

N n=1

p(zn|⇤)p(wn|zn, ⇥) and

p(w| , ⇥) = (⌥

i i)

i ( i)

↵ ⌦k i=1

ii 1

⇥ ⇤

N n=1

k

i=1

V j=1

(⇤iij)wnj

⌃ d⇤.

This not only looks awkward, but is as well computationally intractable in general. Coupling between ⇤ and ⇥ij. Solution: Approximations.

•In variational inference, we consider a simplified graphical model with variational parameters g, f and minimize the KL Divergence between the variational and posterior

distributions.

(20)

20 Variational inference

Replace the graphical model of LDA by a simpler one.

z w

N M

z N M

Minimize the Kullback-Leibler divergence between the two distributions. This is a standard-trick in machine learning. Often equivalent to maximum likelihood. Gradient descent procedure.

Variational approach in contrast to often used Markov chain Monte Carlo. Variational is not a sampling approach where one samples from the posterior distribution, but rather we try to find a tightest lower bound.

Problematic coupling between ⇥ and not present in simpler graphical model.

How LDA performs on Reuters data (1/2)

About the experiments

100-topic LDA trained on a 16’000 documents corpus of news articles by Reuters (the news agency).

Some standard stop words removed.

Top words from some of the p(w|z)

“Arts” “Budgets” “Children” “Education”

new million children school

film tax women students

show program people schools music budget child education movie billion years teachers play federal families high

musical year work public

(21)

How LDA performs on Reuters data (2/2)

Inference on a held-out document

Again: “Arts”,“Budgets”,“Children”,“Education”.

TheWilliamRandolph HearstFoundationwill give$1.25 millionto LincolnCenter, MetropolitanOperaCo.,New York Philharmonicand JuilliardSchool. “Ourboardfelt that we had arealopportunitytomake amarkon thefutureof theperformingarts with thesegrantsanact everybitasimportantas our traditional areas ofsupportin health, medicalresearch,educationand thesocialservices,”HearstFoundation PresidentRandolphA.HearstsaidMondayinannouncingthegrants.

n A widely used topic model

n Complexity is an issue

n Use in IR:

n

Interpolate a topic model with traditional LM

n

Improvements over traditional LM,

n

But no improvement over Relevance model (Wei

and Croft, SIGIR 06)

(22)

22

§ LSI

§ Deerwester, S., et al, Improving Information Retrieval with Latent Semantic Indexing, Proceedings of the 51st Annual Meeting of the American Society for Information Science 25, 1988, pp. 36–40.

§ Michael W. Berry, Susan T. Dumais and Gavin W. O'Brien, Using Linear Algebra for Intelligent Information Retrieval, UT-CS-94-270,1994

§ pLSI

§ Thomas Hofmann, Probabilistic Latent Semantic Indexing, Proceedings of the Twenty-Second Annual International SIGIRConference on Research and Development in Information Retrieval(SIGIR-99), 1999

§ LDA

§ Latent Dirichlet allocation. D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3:993-1022, January 2003.

§ Finding Scientific Topics. Griffiths, T., & Steyvers, M. (2004). Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-5235.

§ Hierarchical topic models and the nested Chinese restaurant process. D. Blei, T. Griffiths, M. Jordan, and J.

Tenenbaum In S. Thrun, L. Saul, and B. Scholkopf, editors, Advances in Neural Information Processing Systems (NIPS) 16, Cambridge, MA, 2004. MIT Press.

§ Also see Wikipedia articles on LSI, pLSI and LDA

參考文獻

相關文件

The best way to picture a vector field is to draw the arrow representing the vector F(x, y) starting at the point (x, y).. Of course, it’s impossible to do this for all points (x,

[r]

Given a shift κ, if we want to compute the eigenvalue λ of A which is closest to κ, then we need to compute the eigenvalue δ of (11) such that |δ| is the smallest value of all of

If x or F is a vector, then the condition number is defined in a similar way using norms and it measures the maximum relative change, which is attained for some, but not all

Numerical experiments are done for a class of quasi-convex optimization problems where the function f (x) is a composition of a quadratic convex function from IR n to IR and

Elsewhere the difference between and this plain wave is, in virtue of equation (A13), of order of .Generally the best choice for x 1 ,x 2 are the points where V(x) has

• Nearpod allows the teacher to create interactive lessons that are displayed on the student device while they are teaching2. • During the lesson students interact with the program

Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval pp.298-306.. Automatic Classification Using Supervised