Accessing the Models

(1)

Accessing the Models

Prof. Shou‐de Lin Prof. Shou de Lin CSIE/GINM, NTU

Sdlin@csie.ntu.edu.tw

2009/11/30 1

(2)

Questions

• For a theory (or hypothesis of model) T, how do we

k if t f t ti t i b tt

know if one set of parameter estimates is better than another?

• Which is better? Theory T with parameter estimates X or theory S with parameter estimates P?y p

• Knowledge Discovery is about finding a best model with optimal parameter that fits the given data then with optimal parameter that fits the given data, then use the model to find something useful.

M1 M2

(3)

Model Assessment: A Bayesian Approach y pp

• d= data (observation), m=model (how the data are generated)

most likely model given data

)

| (

max p m d

aug max p ( m | d )

most likely model given data

aug

m

)

| (

* )

( m p d m

Baye’s Rule

p

) (

)

| (

) max (

)

| (

max p d

m d

p m

aug p d

m p

aug

m m

=

)

| (

* ) (

max p m p d m aug

m

Does this model look reasonable?

Given the fixed model m, does the observed data stream look

2009/11/30 3

reasonable? reasonable?

(4)

Maximum Likelihood Estimation (MLE) Maximum Likelihood Estimation (MLE)

• If p(m) in unknown, then we can only evaluate

)

| (

max

arg max p ( d | m )

arg p d m

m

,which is usually quantitative !! thank god ☺

• E g d=H H T HE.g. d=H H T H

– M1: coin is unbiased p(d|m)= 0.5⁴ = 0.066

3 1 3

– M2: coin is biased s.t. p(H)=3/4, p(d|m)= 3

– M3: coin is biased so that P(H)=0.9, p(d|m)

1 . 4 0

* 3 4

* 1 4

* 3 4

3 =

073 .

= 0

☺

(5)

)

| (

* ) (

max p m p d m aug max p ( m ) p ( d | m ) aug

m

• What if p(m) is not uniform (e.g. we examine the coin and find nothing wrong with it)g g )

• E.g. P(M1): 0.9, P(M2):0.05, P(M3):0.05

Th i h i l

• Then in the previous example,

– P(M1)*P(d|M1)= 0.9*0.066=0.059 ^☺ – P(M2)*P(d|M2)= 0.1*0.1=0.01

– P(M3)*P(d|M3)= 0 1*0 073=0 0073 – P(M3) P(d|M3)= 0.1 0.073=0.0073

2009/11/30 5

(6)

Unsupervised Learning

Prof. Shou‐de Lin Prof. Shou de Lin GSIE/GINM, NTU

(7)

To Bring you Back to the Earth

In the “whatever I want to do lecture”, I’ll teach

• Supervised learning. (2 hours)

– Generative learning algorithms. Gaussian discriminant analysis.

• Unsupervised learning. (3 hours)

– EM (why? Because it is as magical as you should know).

Note: Last year I used 3 full lectures teaching EM

– Clustering: K‐means (why? Because it is as simple as h ld k )

you should know)

• Reinforcement learning (0.5 hour)

– Value iteration and policy iteration.

– Q‐learning & SARSA

2009/11/30 7

(8)

What is Unsupervised Learning What is Unsupervised Learning

i d l i i f

• Supervised learning: we are given a set of

training data X given a class, and we want to learn a function f(x)= y that maps x to y

• Unsupervised Learning:p g

– Clustering: given x, grouping x into different clusters.

– EM: given x and partial information about y, trying to learn f(x). ( )

• EM is the key solution to many knowledge discovery tasks.

(9)

Analogy: Decipherment Analogy: Decipherment

i b h f d d i i h

• SL: given a bunch of words X and its cipher Y, trying to figure out f(X)=Y. For example, (X,Y)=

(byf, axe) (hppe, good) (bqqmf, apple), f=?

• However this is not how decipherment worksHowever, this is not how decipherment works in the real world. People didn’t decipher

Egyptian or Maya this way They did it through Egyptian or Maya this way. They did it through an unsupervised manner (only X is given, and they need to translate it into Y):

they need to translate it into Y):

X=(byf, hppe, bqqmf …), f=?

2009/11/30 9

(10)

Data and Model Data and Model

Supervised learning argmax P(m|data)

m

Complete data Incomplete model (parameters unknown)

complete model (parameters unknown)

Data generation incomplete data

complete data argmaxP(d|m)

complete model d complete data

EM incomplete data

incomplete model

complete data &

model

incomplete model model

argmaxP(incomplete data|m)

(11)

Ideal vs. Available Data – Sequential Labeling (POS tagging)

• Part of speech tagging:

P(T) T

Noisy Channel

P(W|T)

W

• Ideal: t t t

( | )

Ideal: t₁t₂ t₃ …..

w₁ w₂ w₃ ….

• Available: w₁ w₂ w₃ ….

2009/11/30 11

(12)

Ideal vs Available Data Cryptography Ideal vs. Available Data ‐ Cryptography

• Cryptography:

P(E) E

Noisy

Channel C

• Ideal: e₁ e₂ e₃ ….. (solvable by SL)

P(C|E)

Ideal: e₁e₂ e₃ ….. (solvable by SL) c₁ c₂ c₃ …..

• Available: c c c (need EM)

• Available: c₁ c₂ c₃ …. (need EM)

(13)

Introducing EM Introducing EM

• Expectation Maximization (EM) is perhaps most oftenExpectation Maximization (EM) is perhaps most often used and mostly half understood algorithm for

unsupervised learning.

– It is very intuitive.

– Many people rely on their intuition to apply the algorithm in different problem domains.

– It is not an algorithm instead a framework. Different algorithms can be designed based on EM framework algorithms can be designed based on EM framework.

• Note: The following slides integrate some people’s materials and viewpoints about EM including Kevin materials and viewpoints about EM, including Kevin Knight, Dekang Lin, D. Prescher, and Dan Klein.

2009/11/30 13

(14)

EM framework EM framework

• Expectation step: Use current parameters (and observations) to reconstruct hidden structure

• Maximization step: Use that hidden structure (and observations) to re estimate parameters (and observations) to re‐estimate parameters

(15)

Handling incomplete Data Handling incomplete Data

O l i b ild b bili i d l f d

• Our goal is to build a probabilistic model of data (e.g. LM), defined by a set of parameters θ

• The model parameters can be estimated from a set of IID training examples: x₁, x₂, …, x_n

• Unfortunately, we only get to observe partial information about x’s, for example:

– x_i=(t_i, y_i) and we can only observe y_i. The t_i’s are the so‐called “hidden” data that will be modeled by the

“h dd ” bl

“hidden” variables in EM.

• How can we still construct the model?

2009/11/30 15

(16)

Example MLE Example MLE

i i h ( ) ( ) b d ’

• A coin with P(H)=p, P(T)=q. We observed m H’s and n T’s.

• Q: What are p and q according to MLE?

• Solution:

• Maximize Σ_i log P_θ(y_i)= log p^mqⁿ=m log p + n log q, under the constraint: p+q=1

under the constraint: p q 1

• Lagrange Method:

Define g(p q)=m log p + n log q+λ(p+q 1) – Define g(p,q)=m log p + n log q+λ(p+q‐1) – Solve the equations:

1

, ) 0

, (

, ) 0

,

( = + =

∂

= ∂

∂

∂g p q g p q p q

∂

(17)

But if the data is incomplete But if the data is incomplete

• Suppose we have two coins. Coin 1 is fair. Coin 2 has probability p generating H.p y p g g

• They each have x probability to be chosen.

W l k h l f h b d ’

• We only know the result of the toss, but don’t know when coin was chosen.

– The complete data is (1, H), (1, T), (2, T), (1, H), (2, T) – The observed data is H T T H TThe observed data is H, T, T, H, T.

• What are p, q and x?

2009/11/30 17

(18)

EM Properties EM Properties

• EM is a general technique for learning anytime we have incomplete data (x,y)

E h t f EM i t d t i d t lik lih d

• Each step of EM is guaranteed to increase data likelihood ‐ a hill climbing procedure

• Not guaranteed to find global maximum of data likelihood

– Data likelihood typically has many local maxima for a general model class and rich feature set

– Many “patterns” in the data that we can fit our model to…

(19)

Ideal vs. Available Data – Alignment Problem for Machine Translation

• MT:

P(E) E

Noisy Channel

P(F|E)

F

• Ideal: e₁e₂ e₃ ….. (solvable by SL)

( | )

f₁ f₂ f₃ ….

• Available: eAvailable: e₁₁ee₂₂ ee₃₃ ….. (need EM)(need EM) f₁ f₂ f₃ ….

2009/11/30 19

(20)

Ex: English French Alignment Ex: English‐French Alignment

• Data: the house la maison, house maison

house maison

• Alignments are missing!!

• Theory: English words are translated first, then permuted.p

• Parameters: P(la|the), p(maison|the),

(l |h ) ( i |h )

p(la|house), p(maison|house)

(21)

Ex: EMTraining on MT

Model to learn:

P(la|the)=?

Ex: EMTraining on MT

P(maison|the)=?

P(la|house)=?

P(maison|house)=?

• Possible assignments:

(a) (b)

initialize uniformly:

P(la|the)=1/2 p(a)=

P(maison|the) 1/2 Score p(b) 1/8 1/8

Fractional counts

C(la|the)=0*1/8+1*1/8=1/8 C(maison|the)=1 *1/8+0*1/8=1/8 P(maison|the)=1/2 p(b)=

P(la|house)=1/2

P(maison|house)=1/2

1/8 counts C(la|house)=1*1/8+0*1/8=1/8

C(maison|house)=1*1/8+2*1/8=3/8 p(la|the)=3/4

p(maison|the)=1/4 P(a)=7/256 P(b) 147/256

normalize

p(la|the)=1/2 p(maison|the)=1/2

score P(a)=3/32

Fractional C(la|the)=9/32

normalize

p( | )

p(la|house)=1/8

p(maison|house)=7/8

P(b)=147/256

2009/11/30 21

p(maison|the) 1/2 p(la|house)=1/4

p(maison|house)=3/4 P(a)=3/32

P(b)=9/32 Fractional

counts

C(maison|the)=3/32 C(la|house)=3/32

C(maison|house)=21/32