• 沒有找到結果。

© Deng Cai, College of Computer Science, Zhejiang University

N/A
N/A
Protected

Academic year: 2021

Share "© Deng Cai, College of Computer Science, Zhejiang University"

Copied!
24
0
0

加載中.... (立即查看全文)

全文

(1)

© Deng Cai, College of Computer Science, Zhejiang University 

So Far…

It’s time for 

 Unsupervised learning

We are only given inputs

Goal: find “interesting patterns”

Discovering clusters

Clustering

Discovering latent factors

Dimensionality reduction

Topic modeling

Matrix factorization

1

(2)

Deng Cai (蔡登)

College of Computer Science Zhejiang University

[email protected]

2

Topic Modeling

(3)

© Deng Cai, College of Computer Science, Zhejiang University 

Text Analysis

Text data

 Web page

 Emails

 Documents

 …

www.betaversion.org/~stefano/linotype/news/26/

(4)

© Deng Cai, College of Computer Science, Zhejiang University 

Bag‐of‐Words (BOW)

Assumes order of words has no significance

e.g., the term “home made” has the same probability as 

“made home”

It is a simplifying assumption used in natural language processing  and information retrieval

(5)

© Deng Cai, College of Computer Science, Zhejiang University 

Salton’s Vector Space Model (Prior to 1988)

Represent each document by a high-dimensional vector

in the space of words

(6)

© Deng Cai, College of Computer Science, Zhejiang University 

Document‐Term Matrix

w

j

intelligence

d

i

Texas Instruments said it has developed  the first 32‐bit computer chip designed  specifically for artificial   intelligence applications [...]

D = Document collection W = Lexicon/Vocabulary

...

1 0

0 ...

... 2

t

d

i

=

term weighting

X

w

1 ...

w

j ... wJ

d

1

...

d

i

...

dI

D

W

...

...

...

...

Document-Term Matrix

(7)

© Deng Cai, College of Computer Science, Zhejiang University 

Query

Compute the similarity between queries(q) and documents(d)

Simple, intuitive 

Fast to compute, because both  they are sparse

Retrieval Methods

• Rank documents according to  similarity with query

• Term weighting schemes, for  example, TF‐IDF

cos ,

(8)

© Deng Cai, College of Computer Science, Zhejiang University 

A 100 Million ths of a Typical  Document‐Term Matrix

0

1

0

2

Typical:

• Number of documents  1.000.000

• Vocabulary  100.000

• Sparseness < 0.1 %

• Fraction depicted  1e-8

(9)

© Deng Cai, College of Computer Science, Zhejiang University 

Robust Information Retrieval  Beyond 

Keyword‐based Search

“labour immigrants Germany”

query match

“German job market for immigrants”

query

?

“foreign workers in Germany”

query

?

“German green card”

query

?

“labour immigrants Germany”

query match

“German job market for immigrants”

query

?

“foreign workers in Germany”

query

?

“German green card”

query

?

G. W. Furnas, T. K. Landauer, L. M. Gomez , S. T. Dumais, The Vocabulary Problem in Human‐System Communication: an  Analysis and a Solution,  Bell Communications Research, 1987

Vocabulary Mismatch Problem

different people using different vocabulary to describe the same concept matching queries and documents based on keywords is insufficient

(10)

© Deng Cai, College of Computer Science, Zhejiang University 

The lost meaning of words

Polysemy: words with multiple meanings

 The vector space model is unable to discriminate between different meaning of the same word.

Synonymy: separate words that have the same meaning.

 No associations between words are made in the vector space  representation

There is a disconnect between topics and words sim , cos ∠ ,

sim , cos ∠ ,

(11)

© Deng Cai, College of Computer Science, Zhejiang University 

Language Model Paradigm in IR

Probabilistic relevance model

 Random variables

 Bayes’ rule

prior probability of relevance for  document d (e.g. quality, popularity) probability of generating a 

query q to ask for relevant d

probability that document d  is relevant for query q

J. Ponte and W.B. Croft, A Language Model Approach to Information Retrieval, ACM SIGIR, 1998.

(12)

© Deng Cai, College of Computer Science, Zhejiang University 

Language Model Paradigm

First contribution: prior probability of relevance

 simplest case: uniform (drops out for ranking)

popularity: document usage statistics (e.g. library circulation  records, download or access statistics, hyperlink structure)

Second contribution: query likelihood

query terms q are treated as a sample drawn from an  (unknown) relevant document

1 2

1

2

(13)

© Deng Cai, College of Computer Science, Zhejiang University 

Language Model Paradigm

Query generation model: how might a query look  like that would ask for a specific document?

 Maron & Kuhns: Indexer manually assigns probabilities  for pre‐specified set of tags/terms

 Ponte & Croft: Statistical estimation problem

Think of a relevant document. Formulate a query by picking some of the keywords as

query terms. Environmentalists are

blasting a Bush

administration proposal to lift a ban on logging in remote areas of national forests, saying the move ignores popular support for protecting forests.

(14)

© Deng Cai, College of Computer Science, Zhejiang University 

Query Likelihood

| 1 ≡ |

, ⋯ ,

Independent Assumption

| Π |

| ?

(15)

© Deng Cai, College of Computer Science, Zhejiang University 

Naive Approach

Terms Documents

Maximum Likelihood Estimation

number of occurrences of term w in document d

Zero frequency problem: terms  not occurring in a document get  zero probability 

(16)

© Deng Cai, College of Computer Science, Zhejiang University 

Estimation Problem

Crucial question: In which way can the document 

collection be utilized to improve probability estimates? 

document estimation

(i.i.d) sample

other 

documents

learning from other documents in a  collection ?

(17)

© Deng Cai, College of Computer Science, Zhejiang University 

Probabilistic Latent Semantic Analysis

Latent  Concepts

Terms Documents

TRADE

economic

imports

trade

| | |

Concept expression proba‐

bilities are estimated based  on all documents that are  dealing with a concept.

“Unmixing” of 

superimposed concepts is  achieved by statistical  learning algorithm.

(18)

© Deng Cai, College of Computer Science, Zhejiang University 

Probabilistic Latent Semantic Analysis

PLSA evolved from Latent semantic analysis, adding a sounder  probabilistic model

It was introduced in 1999 by Thomas Hofmann (UAI’99) It is related to non‐negative matrix factorization (NMF)

(19)

© Deng Cai, College of Computer Science, Zhejiang University 

pLSA – Latent Variable Model

Structural modeling assumption (mixture model)

Model fitting

Document

language model

Latent concepts  or topics

Concept expression probabilities

Document‐specific mixture proportions

T. Hofmann, Probabilistic Latent Semantic Analysis, UAI 1999. 

(20)

© Deng Cai, College of Computer Science, Zhejiang University 

pLSA via Likelihood Maximization

Log‐Likelihood

Goal: Find model parameters that maximize the log‐likelihood, i.e. 

maximize the average predictive probability for observed word  occurrences (non‐convex optimization problem)

Observed 

word frequencies argmax

Predictive probability of pLSA mixture model

(21)

© Deng Cai, College of Computer Science, Zhejiang University 

EM Algorithm: Derivation

Q‐parameterized lower bound on log‐likelihood

Follows from Jensen’s inequality

observed pairs 

with index r Q distribution

(22)

© Deng Cai, College of Computer Science, Zhejiang University 

E step: posterior probability of latent variables (“concepts”)

Expectation Maximization Algorithm

how often is term w

associated with concept z ?

how often is document d associated with concept z ?

Probability that the occurence of  term w in document d can be 

“explained“ by concept z

M step: parameter estimation based on  “completed” statistics

A.P. Dempster, N.M. Laird, and D.B. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of  Royal Statistical Society B, vol. 39, no. 1, pp. 1‐38, 1977

(23)

© Deng Cai, College of Computer Science, Zhejiang University 

z Matrix

(24)

© Deng Cai, College of Computer Science, Zhejiang University 

Variations of pLSA

Hierarchical extensions: 

 Asymmetric: MASHA (ʺMultinomial Asymmetric  Hierarchical Analysisʺ) 

 Symmetric: HPLSA (ʺHierarchical Probabilistic Latent  Semantic Analysisʺ) 

Manifold regularizer: 

 Probabilistic Dyadic Data Analysis with Local and Global  Consistency

Generative models: 

Latent Dirichlet allocation ‐ adds a Dirichlet prior on the per‐

document topic distribution, trying to address an often‐criticized  shortcoming of PLSA, namely that it is not a proper generative 

model for new documents and at the same time avoid the  overfitting problem.

參考文獻

相關文件

Feng-Jui Hsieh (Department of Mathematics, National Taiwan Normal University) Hak-Ping Tam (Graduate Institute of Science Education,. National Taiwan

2 Department of Educational Psychology and Counseling / Institute for Research Excellence in Learning Science, National Taiwan Normal University. Research on embodied cognition

Professor of Computer Science and Information Engineering National Chung Cheng University. Chair

2 Department of Materials Science and Engineering, National Chung Hsing University, Taichung, Taiwan.. 3 Department of Materials Science and Engineering, National Tsing Hua

Department of Physics, National Chung Hsing University, Taichung, Taiwan National Changhua University of Education, Changhua, Taiwan. We investigate how the surface acoustic wave

Torrance CA Public Library、Science Library - UC, Irvine、San Diego State University Libray, SDSU、Center for the Study of Religion Freedom Virginia Wesleyan College、Learning Resource

Department of Computer Science and Information Engineering, Chaoyang University of

• Children from this parenting style are more responsive, able to recover quickly from stress; they also have better emotional responsiveness and self- control; they can notice