Accessing the Models
Prof. Shou‐de Lin Prof. Shou de Lin CSIE/GINM, NTU
Sdlin@csie.ntu.edu.tw
2009/11/30 1
Questions
• For a theory (or hypothesis of model) T, how do we
k if t f t ti t i b tt
know if one set of parameter estimates is better than another?
• Which is better? Theory T with parameter estimates X or theory S with parameter estimates P?y p
• Knowledge Discovery is about finding a best model with optimal parameter that fits the given data then with optimal parameter that fits the given data, then use the model to find something useful.
M1 M2
Model Assessment: A Bayesian Approach y pp
• d= data (observation), m=model (how the data are generated)
most likely model given data
)
| (
max p m d
aug max p ( m | d )
most likely model given dataaug
m
)
| (
* )
( m p d m
Baye’s Rule
p
) (
)
| (
) max (
)
| (
max p d
m d
p m
aug p d
m p
aug
m m
=
=
)
| (
* ) (
max p m p d m aug
m
Does this model look reasonable?
Given the fixed model m, does the observed data stream look
2009/11/30 3
reasonable? reasonable?
Maximum Likelihood Estimation (MLE) Maximum Likelihood Estimation (MLE)
• If p(m) in unknown, then we can only evaluate
)
| (
max
arg max p ( d | m )
arg p d m
m
,which is usually quantitative !! thank god ☺
• E g d=H H T HE.g. d=H H T H
– M1: coin is unbiased p(d|m)= 0.54 = 0.066
3 1 3
– M2: coin is biased s.t. p(H)=3/4, p(d|m)= 3
– M3: coin is biased so that P(H)=0.9, p(d|m)
1 . 4 0
* 3 4
* 1 4
* 3 4
3 =
073 .
= 0
☺
)
| (
* ) (
max p m p d m aug max p ( m ) p ( d | m ) aug
m
• What if p(m) is not uniform (e.g. we examine the coin and find nothing wrong with it)g g )
• E.g. P(M1): 0.9, P(M2):0.05, P(M3):0.05
Th i h i l
• Then in the previous example,
– P(M1)*P(d|M1)= 0.9*0.066=0.059 ☺ – P(M2)*P(d|M2)= 0.1*0.1=0.01
– P(M3)*P(d|M3)= 0 1*0 073=0 0073 – P(M3) P(d|M3)= 0.1 0.073=0.0073
2009/11/30 5
Unsupervised Learning
Prof. Shou‐de Lin Prof. Shou de Lin GSIE/GINM, NTU
To Bring you Back to the Earth
In the “whatever I want to do lecture”, I’ll teach
• Supervised learning. (2 hours)
– Generative learning algorithms. Gaussian discriminant analysis.
• Unsupervised learning. (3 hours)
– EM (why? Because it is as magical as you should know).
Note: Last year I used 3 full lectures teaching EM
– Clustering: K‐means (why? Because it is as simple as h ld k )
you should know)
• Reinforcement learning (0.5 hour)
– Value iteration and policy iteration.
– Q‐learning & SARSA
2009/11/30 7
What is Unsupervised Learning What is Unsupervised Learning
i d l i i f
• Supervised learning: we are given a set of
training data X given a class, and we want to learn a function f(x)= y that maps x to y
• Unsupervised Learning:p g
– Clustering: given x, grouping x into different clusters.
– EM: given x and partial information about y, trying to learn f(x). ( )
• EM is the key solution to many knowledge discovery tasks.
Analogy: Decipherment Analogy: Decipherment
i b h f d d i i h
• SL: given a bunch of words X and its cipher Y, trying to figure out f(X)=Y. For example, (X,Y)=
(byf, axe) (hppe, good) (bqqmf, apple), f=?
• However this is not how decipherment worksHowever, this is not how decipherment works in the real world. People didn’t decipher
Egyptian or Maya this way They did it through Egyptian or Maya this way. They did it through an unsupervised manner (only X is given, and they need to translate it into Y):
they need to translate it into Y):
X=(byf, hppe, bqqmf …), f=?
2009/11/30 9
Data and Model Data and Model
Supervised learning argmax P(m|data)
m
Complete data Incomplete model (parameters unknown)
complete model (parameters unknown)
Data generation incomplete data
complete data argmaxP(d|m)
complete model d complete data
EM incomplete data
incomplete model
complete data &
model
incomplete model model
argmaxP(incomplete data|m)
Ideal vs. Available Data – Sequential Labeling (POS tagging)
• Part of speech tagging:
P(T) T
Noisy Channel
P(W|T)
W
• Ideal: t t t
( | )
Ideal: t1 t2 t3 …..
w1 w2 w3 ….
• Available: w1 w2 w3 ….
2009/11/30 11
Ideal vs Available Data Cryptography Ideal vs. Available Data ‐ Cryptography
• Cryptography:
P(E) E
Noisy
Channel C
• Ideal: e1 e2 e3 ….. (solvable by SL)
P(C|E)
Ideal: e1 e2 e3 ….. (solvable by SL) c1 c2 c3 …..
• Available: c c c (need EM)
• Available: c1 c2 c3 …. (need EM)
Introducing EM Introducing EM
• Expectation Maximization (EM) is perhaps most oftenExpectation Maximization (EM) is perhaps most often used and mostly half understood algorithm for
unsupervised learning.
– It is very intuitive.
– Many people rely on their intuition to apply the algorithm in different problem domains.
– It is not an algorithm instead a framework. Different algorithms can be designed based on EM framework algorithms can be designed based on EM framework.
• Note: The following slides integrate some people’s materials and viewpoints about EM including Kevin materials and viewpoints about EM, including Kevin Knight, Dekang Lin, D. Prescher, and Dan Klein.
2009/11/30 13
EM framework EM framework
• Expectation step: Use current parameters (and observations) to reconstruct hidden structure
• Maximization step: Use that hidden structure (and observations) to re estimate parameters (and observations) to re‐estimate parameters
Handling incomplete Data Handling incomplete Data
O l i b ild b bili i d l f d
• Our goal is to build a probabilistic model of data (e.g. LM), defined by a set of parameters θ
• The model parameters can be estimated from a set of IID training examples: x1, x2, …, xn
• Unfortunately, we only get to observe partial information about x’s, for example:
– xi=(ti, yi) and we can only observe yi. The ti’s are the so‐called “hidden” data that will be modeled by the
“h dd ” bl
“hidden” variables in EM.
• How can we still construct the model?
2009/11/30 15
Example MLE Example MLE
i i h ( ) ( ) b d ’
• A coin with P(H)=p, P(T)=q. We observed m H’s and n T’s.
• Q: What are p and q according to MLE?
• Solution:
• Maximize Σi log Pθ(yi)= log pmqn=m log p + n log q, under the constraint: p+q=1
under the constraint: p q 1
• Lagrange Method:
Define g(p q)=m log p + n log q+λ(p+q 1) – Define g(p,q)=m log p + n log q+λ(p+q‐1) – Solve the equations:
1
, ) 0
, (
, ) 0
,
( = + =
∂
= ∂
∂
∂g p q g p q p q
∂
∂
But if the data is incomplete But if the data is incomplete
• Suppose we have two coins. Coin 1 is fair. Coin 2 has probability p generating H.p y p g g
• They each have x probability to be chosen.
W l k h l f h b d ’
• We only know the result of the toss, but don’t know when coin was chosen.
– The complete data is (1, H), (1, T), (2, T), (1, H), (2, T) – The observed data is H T T H TThe observed data is H, T, T, H, T.
• What are p, q and x?
2009/11/30 17
EM Properties EM Properties
• EM is a general technique for learning anytime we have incomplete data (x,y)
E h t f EM i t d t i d t lik lih d
• Each step of EM is guaranteed to increase data likelihood ‐ a hill climbing procedure
• Not guaranteed to find global maximum of data likelihood
• Not guaranteed to find global maximum of data likelihood
– Data likelihood typically has many local maxima for a general model class and rich feature set
– Many “patterns” in the data that we can fit our model to…
Ideal vs. Available Data – Alignment Problem for Machine Translation
• MT:
P(E) E
Noisy Channel
P(F|E)
F
• Ideal: e1 e2 e3 ….. (solvable by SL)
( | )
f1 f2 f3 ….
• Available: eAvailable: e11 ee22 ee33 ….. (need EM)(need EM) f1 f2 f3 ….
2009/11/30 19
Ex: English French Alignment Ex: English‐French Alignment
• Data: the house la maison, house maison
house maison
• Alignments are missing!!
• Theory: English words are translated first, then permuted.p
• Parameters: P(la|the), p(maison|the),
(l |h ) ( i |h )
p(la|house), p(maison|house)
Ex: EMTraining on MT
Model to learn:
P(la|the)=?
Ex: EMTraining on MT
P(maison|the)=?P(la|house)=?
P(maison|house)=?
• Possible assignments:
(a) (b)
initialize uniformly:
P(la|the)=1/2 p(a)=
P(maison|the) 1/2 Score p(b) 1/8 1/8
Fractional counts
C(la|the)=0*1/8+1*1/8=1/8 C(maison|the)=1 *1/8+0*1/8=1/8 P(maison|the)=1/2 p(b)=
P(la|house)=1/2
P(maison|house)=1/2
1/8 counts C(la|house)=1*1/8+0*1/8=1/8
C(maison|house)=1*1/8+2*1/8=3/8 p(la|the)=3/4
p(maison|the)=1/4 P(a)=7/256 P(b) 147/256
normalize
p(la|the)=1/2 p(maison|the)=1/2
score P(a)=3/32
Fractional C(la|the)=9/32
normalize
p( | )
p(la|house)=1/8
p(maison|house)=7/8
P(b)=147/256
2009/11/30 21
p(maison|the) 1/2 p(la|house)=1/4
p(maison|house)=3/4 P(a)=3/32
P(b)=9/32 Fractional
counts
C(maison|the)=3/32 C(la|house)=3/32
C(maison|house)=21/32