© Deng Cai, College of Computer Science, Zhejiang University

(1)

So Far…

It’s time for

 Unsupervised learning

 We are only given inputs

 Goal: find “interesting patterns”

 Discovering clusters

– Clustering

 Discovering latent factors

– Dimensionality reduction

– Topic modeling

– Matrix factorization

1

(2)

Deng Cai (蔡登)

College of Computer Science Zhejiang University

[email protected]

2

Topic Modeling

(3)

Text Analysis

Text data

 Web page

 Emails

 Documents

 …

www.betaversion.org/~stefano/linotype/news/26/

(4)

Bag‐of‐Words (BOW)

Assumes order of words has no significance

e.g., the term “home made” has the same probability as

“made home”

It is a simplifying assumption used in natural language processing and information retrieval

(5)

Salton’s Vector Space Model (Prior to 1988)

Represent each document by a high-dimensional vector

in the space of words

(6)

Document‐Term Matrix

w

_j

intelligence

d

_i

Texas Instruments said it has developed the first 32‐bit computer chip designed specifically for artificial intelligence applications [...]

D = Document collection W = Lexicon/Vocabulary

...

1 0

0 ...

... 2

t

d

_i

=

term weighting

X

w

₁ ^...

w

_j ... w_J

d

₁

...

d

_i

...

d_I

D

W

...

Document-Term Matrix

(7)

Query

Compute the similarity between queries(q) and documents(d)

Simple, intuitive

Fast to compute, because both they are sparse

Retrieval Methods

• Rank documents according to similarity with query

• Term weighting schemes, for example, TF‐IDF

cos ,

(8)

A 100 Million ^ths of a Typical Document‐Term Matrix

0

1

0

2

Typical:

• Number of documents  1.000.000

• Vocabulary  100.000

• Sparseness < 0.1 %

• Fraction depicted  1e-8

(9)

Robust Information Retrieval — ^Beyond

Keyword‐based Search

“labour immigrants Germany”

query match

“German job market for immigrants”

query

?

“foreign workers in Germany”

query

?

“German green card”

query

?

“labour immigrants Germany”

query match

“German job market for immigrants”

query

?

“foreign workers in Germany”

query

?

“German green card”

query

?

G. W. Furnas, T. K. Landauer, L. M. Gomez , S. T. Dumais, The Vocabulary Problem in Human‐System Communication: an Analysis and a Solution, Bell Communications Research, 1987

Vocabulary Mismatch Problem

different people using different vocabulary to describe the same concept matching queries and documents based on keywords is insufficient

(10)

The lost meaning of words

Polysemy: words with multiple meanings

 The vector space model is unable to discriminate between different meaning of the same word.

Synonymy: separate words that have the same meaning.

 No associations between words are made in the vector space representation

There is a disconnect between topics and words sim , cos ∠ ,

sim , cos ∠ ,

(11)

Language Model Paradigm in IR

Probabilistic relevance model

 Random variables

 Bayes’ rule

prior probability of relevance for document d (e.g. quality, popularity) probability of generating a

query q to ask for relevant d

probability that document d is relevant for query q

J. Ponte and W.B. Croft, A Language Model Approach to Information Retrieval, ACM SIGIR, 1998.

(12)

Language Model Paradigm

First contribution: prior probability of relevance

 simplest case: uniform (drops out for ranking)

 popularity: document usage statistics (e.g. library circulation records, download or access statistics, hyperlink structure)

Second contribution: query likelihood

 query terms q are treated as a sample drawn from an (unknown) relevant document

1 2

1

2

(13)

Language Model Paradigm

Query generation model: how might a query look like that would ask for a specific document?

 Maron & Kuhns: Indexer manually assigns probabilities for pre‐specified set of tags/terms

 Ponte & Croft: Statistical estimation problem

Think of a relevant document. Formulate a query by picking some of the keywords as

query terms. Environmentalists are

blasting a Bush

administration proposal to lift a ban on logging in remote areas of national forests, saying the move ignores popular support for protecting forests.

(14)

Query Likelihood

| 1 ≡ |

, ⋯ ,

Independent Assumption

| Π _∈ |

| ?

(15)

Naive Approach

Terms Documents

Maximum Likelihood Estimation

number of occurrences of term w in document d

Zero frequency problem: terms not occurring in a document get zero probability

(16)

Estimation Problem

Crucial question: In which way can the document

collection be utilized to improve probability estimates?

document estimation

(i.i.d) sample

other

documents

learning from other documents in a collection ?

(17)

Probabilistic Latent Semantic Analysis

Latent Concepts

Terms Documents

TRADE

economic

imports

trade

| | |

Concept expression proba‐

bilities are estimated based on all documents that are dealing with a concept.

“Unmixing” of

superimposed concepts is achieved by statistical learning algorithm.

(18)

Probabilistic Latent Semantic Analysis

PLSA evolved from Latent semantic analysis, adding a sounder probabilistic model

It was introduced in 1999 by Thomas Hofmann (UAI’99) It is related to non‐negative matrix factorization (NMF)

(19)

pLSA – Latent Variable Model

Structural modeling assumption (mixture model)

Model fitting

Document

language model

Latent concepts or topics

Concept expression probabilities

Document‐specific mixture proportions

T. Hofmann, Probabilistic Latent Semantic Analysis, UAI 1999.

(20)

pLSA via Likelihood Maximization

Log‐Likelihood

Goal: Find model parameters that maximize the log‐likelihood, i.e.

maximize the average predictive probability for observed word occurrences (non‐convex optimization problem)

Observed

word frequencies argmax

Predictive probability of pLSA mixture model

(21)

EM Algorithm: Derivation

Q‐parameterized lower bound on log‐likelihood

Follows from Jensen’s inequality

observed pairs

with index r Q distribution

(22)

E step: posterior probability of latent variables (“concepts”)

Expectation Maximization Algorithm

how often is term w

associated with concept z ?

how often is document d associated with concept z ?

Probability that the occurence of term w in document d can be

“explained“ by concept z

M step: parameter estimation based on “completed” statistics

A.P. Dempster, N.M. Laird, and D.B. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of Royal Statistical Society B, vol. 39, no. 1, pp. 1‐38, 1977

(23)

z Matrix

(24)

Variations of pLSA

Hierarchical extensions:

 Asymmetric: MASHA (ʺMultinomial Asymmetric Hierarchical Analysisʺ)

 Symmetric: HPLSA (ʺHierarchical Probabilistic Latent Semantic Analysisʺ)

Manifold regularizer:

 Probabilistic Dyadic Data Analysis with Local and Global Consistency

Generative models:

 Latent Dirichlet allocation ‐ adds a Dirichlet prior on the per‐

document topic distribution, trying to address an often‐criticized shortcoming of PLSA, namely that it is not a proper generative

model for new documents and at the same time avoid the overfitting problem.

© Deng Cai, College of Computer Science, Zhejiang University

So Far…

 Unsupervised learning

Topic Modeling

Text Analysis

 Web page

 Emails

 Documents

 …

Bag‐of‐Words (BOW)

e.g., the term “home made” has the same probability as

“made home”

Salton’s Vector Space Model (Prior to 1988)

Represent each document by a high-dimensional vector

in the space of words

Document‐Term Matrix

w

d

d

=

X

w

w

d

d

D

W

Query

Simple, intuitive

Retrieval Methods

cos ,

A 100 Million ths of a Typical Document‐Term Matrix

0

0

Robust Information Retrieval — Beyond

Keyword‐based Search

The lost meaning of words

There is a disconnect between topics and words sim , cos ∠ ,

sim , cos ∠ ,

Language Model Paradigm in IR

Language Model Paradigm

Language Model Paradigm

Query generation model: how might a query look like that would ask for a specific document?

 Maron & Kuhns: Indexer manually assigns probabilities for pre‐specified set of tags/terms

 Ponte & Croft: Statistical estimation problem

Query Likelihood

Naive Approach

Estimation Problem

Crucial question: In which way can the document

collection be utilized to improve probability estimates?

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis

pLSA – Latent Variable Model

Model fitting

pLSA via Likelihood Maximization

EM Algorithm: Derivation

Expectation Maximization Algorithm

Variations of pLSA

 Asymmetric: MASHA (ʺMultinomial Asymmetric Hierarchical Analysisʺ)

 Symmetric: HPLSA (ʺHierarchical Probabilistic Latent Semantic Analysisʺ)

A 100 Million ^ths of a Typical Document‐Term Matrix

Robust Information Retrieval — ^Beyond