• 沒有找到結果。

Accessing the Models

N/A
N/A
Protected

Academic year: 2022

Share "Accessing the Models"

Copied!
21
0
0

加載中.... (立即查看全文)

全文

(1)

Accessing the Models

Prof. Shou‐de Lin Prof. Shou de Lin CSIE/GINM, NTU

Sdlin@csie.ntu.edu.tw

2009/11/30 1

(2)

Questions

For a theory (or hypothesis of model) T, how do we 

k if t f t ti t i b tt

know if one set of parameter estimates is better  than another?

Which is better? Theory T with parameter estimates  X or theory S with parameter estimates P?y p

Knowledge Discovery is about finding a best model  with optimal parameter that fits the given data then with optimal parameter that fits the given data, then  use the model to find something useful.

M1 M2

(3)

Model Assessment: A Bayesian Approach y pp

• d= data (observation), m=model (how the data  are generated)

most likely model given data

)

| (

max p m d

aug max p ( m | d )

most likely model given data

aug

m

)

| (

* )

( m p d m

Baye’s Rule

p

) (

)

| (

) max (

)

| (

max p d

m d

p m

aug p d

m p

aug

m m

=

=

)

| (

* ) (

max p m p d m aug

m

Does this model look  reasonable?

Given the fixed model  m, does the observed  data stream look 

2009/11/30 3

reasonable? reasonable?

(4)

Maximum Likelihood Estimation (MLE) Maximum Likelihood Estimation (MLE)

• If p(m) in unknown, then we can only evaluate

)

| (

max

arg max p ( d | m )

arg p d m

m

,which is usually quantitative !!  thank god ☺

• E g d=H H T HE.g. d=H H T H

M1: coin is unbiased p(d|m)= 0.54 = 0.066

3 1 3

M2: coin is biased s.t. p(H)=3/4, p(d|m)= 3

M3: coin is biased so that P(H)=0.9, p(d|m)

1 . 4 0

* 3 4

* 1 4

* 3 4

3 =

073 .

= 0

(5)

)

| (

* ) (

max p m p d m aug max p ( m ) p ( d | m ) aug

m

• What if p(m) is not uniform (e.g. we examine  the coin and find nothing wrong with it)g g )

• E.g. P(M1): 0.9, P(M2):0.05, P(M3):0.05

Th i h i l

• Then in the previous example, 

P(M1)*P(d|M1)= 0.9*0.066=0.059 P(M2)*P(d|M2)= 0.1*0.1=0.01

P(M3)*P(d|M3)= 0 1*0 073=0 0073 P(M3) P(d|M3)= 0.1 0.073=0.0073

2009/11/30 5

(6)

Unsupervised Learning

Prof. Shou‐de Lin Prof. Shou de Lin GSIE/GINM, NTU

(7)

To Bring you Back to the Earth

In the “whatever I want to do lecture”, I’ll teach

Supervised learning. (2 hours)

Generative learning algorithms. Gaussian discriminant analysis. 

Unsupervised learning. (3 hours) 

EM (why? Because it is as magical as you should  know). 

Note: Last year I used 3 full lectures teaching EM

Clustering: K‐means (why? Because it is as simple as  h ld k )

you should know)  

Reinforcement learning (0.5 hour) 

Value iteration and policy iteration. 

Q‐learning & SARSA

2009/11/30 7

(8)

What is Unsupervised Learning What is Unsupervised Learning

i d l i i f

• Supervised learning: we are given a set of 

training data X given a class, and we want to  learn a function f(x)= y that maps x to y

• Unsupervised Learning:p g

Clustering: given x, grouping x into different  clusters.

EM: given x and partial information about y, trying  to learn f(x).  ( )

EM is the key solution to many knowledge discovery  tasks.

(9)

Analogy: Decipherment Analogy: Decipherment

i b h f d d i i h

• SL: given a bunch of words X and its cipher Y,  trying to figure out f(X)=Y. For example, (X,Y)=

(byf, axe) (hppe, good) (bqqmf, apple), f=?

• However this is not how decipherment worksHowever, this is not how decipherment works  in the real world. People didn’t decipher 

Egyptian or Maya this way They did it through Egyptian or Maya this way. They did it through  an unsupervised manner (only X is given, and  they need to translate it into Y):

they need to translate it into Y):

X=(byf, hppe, bqqmf …), f=?

2009/11/30 9

(10)

Data and Model Data and Model

Supervised learning  argmax P(m|data) 

m

Complete data  Incomplete model  (parameters unknown)

complete model (parameters unknown)

Data generation   incomplete data 

complete data argmaxP(d|m)

complete model d complete data

EM incomplete data 

incomplete model

complete data & 

model

incomplete model model

argmaxP(incomplete data|m)

(11)

Ideal vs. Available Data – Sequential  Labeling (POS tagging)

• Part of speech tagging:

P(T) T

Noisy  Channel 

P(W|T)

W

• Ideal: t t t

( | )

Ideal: tt2 t3 …..    

w1 w2 w3 ….

• Available: w1 w2 w3 ….

2009/11/30 11

(12)

Ideal vs Available Data Cryptography Ideal vs. Available Data ‐ Cryptography

Cryptography: 

P(E) E

Noisy 

Channel  C

Ideal: e1 e2 e3 ….. (solvable by SL)

P(C|E)

Ideal: ee2 e3 …..    (solvable by SL) c1 c2 c3 …..

Available: c c c (need EM)

Available: c1 c2 c3 …. (need EM)

(13)

Introducing EM Introducing EM

Expectation Maximization (EM) is perhaps most oftenExpectation Maximization (EM) is perhaps most often  used and mostly half understood algorithm for 

unsupervised learning.

It is very intuitive.

Many people rely on their intuition to apply the algorithm  in different problem domains.

It is not an algorithm instead a framework. Different  algorithms can be designed based on EM framework algorithms can be designed based on EM framework.

Note: The following slides integrate some people’s  materials and viewpoints about EM including Kevin materials and viewpoints about EM, including Kevin  Knight, Dekang Lin, D. Prescher, and Dan Klein.

2009/11/30 13

(14)

EM framework EM framework

Expectation step: Use current parameters  (and observations) to reconstruct hidden  structure

Maximization step: Use that hidden structure  (and observations) to re estimate parameters (and observations) to re‐estimate parameters

(15)

Handling incomplete Data Handling incomplete Data

O l i b ild b bili i d l f d

Our goal is to build a probabilistic model of data  (e.g. LM), defined by a set of parameters θ

The model parameters can be estimated from a  set of IID training examples: x1, x2, …, xn

Unfortunately, we only get to observe partial  information about x’s, for example: 

xi=(ti, yi) and we can only observe yi. The ti’s are the  so‐called “hidden” data that will be modeled by the 

“h dd bl

“hidden” variables in EM.

How can we still construct the model?

2009/11/30 15

(16)

Example MLE Example MLE 

i i h ( ) ( ) b d

A coin with P(H)=p, P(T)=q. We observed m H’s  and n T’s. 

Q: What are p and q according to MLE?

Solution:

Maximize Σi log Pθ(yi)= log pmqn=m log p + n log q,  under the constraint: p+q=1

under the constraint: p q 1

Lagrange Method: 

Define g(p q)=m log p + n log q+λ(p+q 1) Define g(p,q)=m log p + n log q+λ(p+q‐1) Solve the equations:

1

, ) 0

, (

, ) 0

,

( = + =

=

g p q g p q p q

(17)

But if the data is incomplete But if the data is incomplete 

• Suppose we have two coins. Coin 1 is fair. Coin 2  has probability p generating H.p y p g g

• They each have x probability to be chosen.

W l k h l f h b d ’

• We only know the result of the toss, but don’t  know when coin was chosen.

The complete data is (1, H), (1, T), (2, T), (1, H), (2, T) The observed data is H T T H TThe observed data is H, T, T, H, T.

• What are p, q and x? 

2009/11/30 17

(18)

EM Properties EM Properties

EM is a general technique for learning anytime we have  incomplete data (x,y)

E h t f EM i t d t i d t lik lih d

Each step of EM is guaranteed to increase data likelihood ‐ a  hill climbing procedure

Not guaranteed to find global maximum of data likelihood

Not guaranteed to find global maximum of data likelihood

Data likelihood typically has many local maxima for a general model  class and rich feature set

Many “patterns” in the data that we can fit our model to…

(19)

Ideal vs. Available Data – Alignment  Problem for Machine Translation

• MT:

P(E) E

Noisy  Channel 

P(F|E)

F

• Ideal: ee2 e3 ….. (solvable by SL)

( | )

f1 f2 f3 ….

• Available: eAvailable: e1ee22 ee33 …..    (need EM)(need EM) f1 f2 f3 ….

2009/11/30 19

(20)

Ex: English French Alignment Ex: English‐French Alignment

• Data: the house  la maison,  house maison

house  maison

• Alignments are missing!!

• Theory: English words are translated first,  then permuted.p

• Parameters: P(la|the), p(maison|the), 

(l |h ) ( i |h )

p(la|house), p(maison|house)

(21)

Ex: EMTraining on MT

Model to learn: 

P(la|the)=?      

Ex: EMTraining on MT

P(maison|the)=?      

P(la|house)=?

P(maison|house)=?

Possible assignments:

(a) (b)

initialize uniformly:

P(la|the)=1/2      p(a)=

P(maison|the) 1/2 Score  p(b) 1/8 1/8

Fractional  counts

C(la|the)=0*1/8+1*1/8=1/8       C(maison|the)=1 *1/8+0*1/8=1/8      P(maison|the)=1/2      p(b)=

P(la|house)=1/2

P(maison|house)=1/2

1/8 counts C(la|house)=1*1/8+0*1/8=1/8

C(maison|house)=1*1/8+2*1/8=3/8 p(la|the)=3/4      

p(maison|the)=1/4      P(a)=7/256 P(b) 147/256

normalize

p(la|the)=1/2       p(maison|the)=1/2

score P(a)=3/32

Fractional C(la|the)=9/32      

normalize

p( | )

p(la|house)=1/8

p(maison|house)=7/8

P(b)=147/256

2009/11/30 21

p(maison|the) 1/2        p(la|house)=1/4

p(maison|house)=3/4 P(a)=3/32

P(b)=9/32 Fractional 

counts

C(maison|the)=3/32       C(la|house)=3/32

C(maison|house)=21/32

參考文獻

相關文件

5 Create features of V1,V2 and testing data sets for validation set blending, including the predictions of models in step 2 and some optional extra features.. 6 Treat V1 as the

 Following these simple rules will ensure you gain the confidence and respect of your trip ( including host family) , and help to ensure a pleasant and rewarding experience...

In order to provide some materials for this research the present paper offers a morecomprehensive collection and systematic arrangement of the Lotus Sūtra and its commentaries

We will present some applications of the the- ory, including the derivations of the wallcrossing formulas, higher rank Donaldson-Thomas invariants on local curves, and the

The learning and teaching in the Units of Work provides opportunities for students to work towards the development of the Level I, II and III Reading Skills.. The Units of Work also

- Settings used in films are rarely just backgrounds but are integral to creating atmosphere and building narrative within a film. The film maker may either select an already

Teachers may encourage students to approach the poem as an unseen text to practise the steps of analysis and annotation, instead of relying on secondary

The short film “My Shoes” has been chosen to illustrate and highlight different areas of cinematography (e.g. the use of music, camera shots, angles and movements, editing