• 沒有找到結果。

# Advanced Topics in Learning and Vision

N/A
N/A
Protected

Share "Advanced Topics in Learning and Vision"

Copied!
26
0
0

(1)

## Advanced Topics in Learning and Vision

Ming-Hsuan Yang

mhyang@csie.ntu.edu.tw

(2)

### Announcements

• Project midterm presentation: Nov 23

- Forsyth et al.: Structure from motion and color constancy using MCMC.

- Tu et al.: Image parsing using DDMCMC

D. Forsyth, J. Haddon and S. Ioffe. The joy of sampling. International Journal of Computer Vision, volume 41, no. 1/2, pages 109–134, 2001.

Z. Tu, X. Chen, A. Yuille and S. Zhu. Image parsing: unifying segmentation, detection, and recognition. In Proceedings of International Conference on Computer Vision, pages

18–25, 2003.

- David MacKay. Introduction to Monte Carlo methods.

- C. Bishop. Lecture notes for British Computer Society Summer School on Pattern Recognition.

(3)

### Overview

• EM algorithm

• Markov Chain Monte Carlo (MCMC)

• Variational inference

• Belief propagation, loop belief propagation

• Gaussian process, Gaussian process latent variable model

• Applications

(4)

### Conditional Independence

• If A is independent of B given C

p(A|B, C) = p(A|C) (1)

• Equivalently,

p(A, B|C) = p(A|B, C)p(B|C)

= p(A|C)p(B|C) (2)

• Conditional independence is crucial in applying graphical models as it significantly reduces the complexity of the graph.

• Need to be careful when nodes are not observable. See Bishop’s tutorials for details.

(5)

### Markov Properties

• Markov properties allows us to exploit conditional independence in graphical models.

- Time domain: p(st|s1, . . . , st−1) = p(st|st−1), e.g., Hidden Markov Model (HMM)

- Space domain (Markov blanket): p(A|M B(A), B) = p(A|M B(A)), e.g., Markov Random Field (MRF).

• M B(A) include the parents and children of node A, and it’s children’s parents, because they can be used to explain away the children nodes.

• Shield off nodes that are conditionally independent of node A given the M B(A).

(6)

### ML, MAP and Bayesian Learning

• Data set: D = {x1, . . . , xN} with distribution p(x).

• Likelihood (independent observations): L(θ) = p(D|θ) = QN

i=1 p(xi|θ).

• Maximum likelihood (ML) estimate:

θM L = arg max

θ L(θ) = arg max

θ p(D|θ) (3)

• Predictive distribution: p(x|D) ≈ p(x|θM L).

• Maximum a posteriori (MAP) estimate:

θM AP = arg max

θ p(θ|D) = arg max

θ p(D|θ)p(θ) (4)

where p(D) is usually assumed as a constant.

• Predictive distribution: p(x|D) ≈ p(x|θM AP).

(7)

Bayesian learning: marginalize over unknown parameters, rather than point estimates

p(x|D) = Z

p(x|θ)p(θ|D)dθ (5)

• Avoids overfitting problems of ML and MAP

p(D|θ) = QN

i=1 p(xi|θ)

p(D, θ, x) = p(D|θ)p(x|θ)p(θ) (6)

(8)

• Predictive distribution:

p(x|D) = 1 p(D)

Z

p(x|θ)p(D|θ)p(θ)dθ (7)

• Model evidence:

p(D) = Z

p(D|θ)p(θ)dθ (8)

(9)

### Exponential family

• Many distributions can be written as

p(x|θ) = exp{θTu(x) + g(θ) + f (x)} (9)

• Exponential family: Gaussian, Dirichlet, Gamma, Multinomial, Wishart, Bernoulli, . . .

• Use precision (inverse of variance) τ = 1/σ2. N (x|µ, τ−1) = ( τ

2π)1/2 exp{−τ

2(x − µ)2} (10)

θ =

 µτ

−τ /2

 u(x) =

 x x2



g(θ) = 12 ln(τ ) − 12τ µ2 f (x) = 0

(11)

(10)

### Bayesian Estimation

• Likelihood function

p(D|θ) =

N

Y

i=1

p(xi|θ) (12)

• Conjugate prior: prior of θ has same functional form as likelihood

p(θ|η, ν) = exp{θTη + νg(θ) + h(η, ν)} (13)

• Posterior:

p(θ|x) = p(x|θ)p(θ) ∝ exp{θTη + ¯¯ νg(θ)} (14) where η = η + u(x),¯ ν = ν + 1.¯

• Can interpret prior as ν effective observations of value η.

• Examples (which we put prior on parameters):

- Gaussian for the mean of a Gaussian.

- Gaussian-Wishart of mean and precision of Gaussian.

- Dirichlet for the parameters of a discrete distribution.

(11)

### The EM Algorithm Revisited

• Consider a model with observed data y, hidden/latent variables x, and parameters θ. The log likelihood is bounded by

L(θ) = log p(y|θ) = logR p(x, y|θ)dx

= logR q(x)p(x,y|θ)q(x) dx

≥ R q(x) log p(x,y|θ)q(x) dx ≡ F (q, θ)

(15)

where q(x) is an arbitrary density function over the hidden variables, and the lower bound holds due to the concavity of the log function, i.e., using Jensen’s inequality.

• Recall Jensen’s inequality for convex function

f (λx1 + (1 − λ)x2) ≤ λf (x1) + (1 − λ)f (x2) (16) for x1, x2 ∈ (a, b) and 0 ≤ λ ≤ 1.

• The lower bound F is a functional of q(x) and θ, F (q, θ) = PN

i=1 F(i)(q(i), θ).

(12)

• Idea: Iterate between optimizing the lower bound as a function of q and as a function of θ.

• Can prove that the log likelihood is never decreased.

E step: Optimize F w.r.t. q while holding θ fixed (i.e., compute expectation of hidden variables).

qk(x) = arg max

q(x)

Z

q(x) log p(x, y|θk−1)

q(x) (17)

qk(x) = p(x|y, θk−1) (18)

M step: Optimize F w.r.t. θ while holding q fixed (i.e., maximizing the log likelihood).

θk = arg maxθ R qk(x) log p(x,y|θ)q

k(x)

θk = arg maxθ R qk(x) log p(x, y|θ)dx (19)

• In the E step, for each data point, the distribution over the hidden variables is set to the posterior for that data point, qk(i) = p(x|y(i), θk−1), ∀i.

(13)

• In the M step, the set of parameters is re-estimated by maximizing the sum of the expected log likelihood: θk = arg maxθ P

i R qk(i) log p(x, y(i)|θ)dx.

• Note that from (17) to (18),

qk(x) = arg max

q(x)

h

log p(y|θk−1) + R q(x) log p(x|y,θq(x)k−1)dx

i (20)

• The first term is constant w.r.t. q(x) and the second term is the negative of KL divergence

KL(q(x)||p(x|y, θk−1)) = Z

q(x) log q(x)

p(x|y, θk−1) (21) which is minimized when q(x) = p(x|y, θk−1).

• In the E step, the goal is to find the posterior distribution of the hidden variables given the observed variables and the current settlings of the parameters.

(14)

• Also, since the KL divergence is zero, F (qk, θk−1) = L(θk−1) at the end of E step.

• In the M step, F is increased w.r.t. θ. Thus, F (qk, θk) ≥ F (qk, θk−1). And after the next E step, L(θk) = F (qk+1, θk) ≥ F (qk, θk). Thus

L(θk) ≥ L(θk−1).

• Can be applied to all the latent variable models including factor analysis, probabilistic PCA, mixture of factor analyzers, mixture of probabilistic PCA, mixture of Gaussians, etc.

• The likelihood often has many local optimal and EM may converge to some local optimal rather than the global one.

• Can be extended to MAP or Bayesian estimates.

• See Bishop’s lectures notes on latent variables, mixture models and EM (BCS summer school, Exeter, 2003).

(15)

### Markov Chain Monte Carlo (MCMC)

• It is difficult to compute joint distribution exactly.

• Goals:

- Aim to approximate the joint distribution p(x) so that we can draw samples

- Estimate expectation of functions under p(x), e.g.,

Φ = Z

φ(x)p(x)dx (22)

• Focus on sampling problems as the expectation can be estimated by drawing random samples {x(r)}.

Φ =ˆ 1 R

X

r

φ(x(r)) (23)

(16)

• As R → ∞, Φ → Φˆ since the variance σ2 =

Z

(φ(x) − Φ)2dx (24)

decreases as σR2.

• Good news: the accuracy of Monte Carlo estimate in (23) is independent of the dimensionality of the space sampled!

Bad news: It is difficult to draw independent samples in the high dimensional space.

(17)

### Why Sampling p(x) Is Difficult?

• Assume that the target (but unknown) density function p(x) can be evaluated, within a multiplicative constant, by p(x).

p(x) = p(x)/Z (25)

• Two difficulties in evaluating p(x).

- Typically we do not know Z

Z = Z

p(x)dx (26)

- Even if we know Z, it is difficult to draw samples to well represent or cover p(x) in the high dimensional space.

(18)

• Example:

• Decompose x = (x1, . . . , xd) in every dimension.

• Discretize x and ask for samples from discrete probability distribution over a set of uniformly spaced points {xi}, and

Z = P

i p(xi)

p(xi) = p(xi)/Z (27)

• Suppose we draw 50 samples uniformly spaced in 1-dimensional space, we need 501000 samples in 1000-dimensional space!

(19)

• Even if we draw 2 samples in each dimension, we still need 21000 samples in 1000-dimensional space.

• Related to Ising model, Boltzmann machine and Markov field.

• See MacKay for more detail on the number of samples are required to have a good approximation.

(20)

### Importance Sampling

• Recall we evaluate p(x) and evaluate with p: p(x) = p(x)/Z.

• p is often complicated and difficult to draw samples from.

• Proposal density function: Assume that we have a simpler density q(x) which we can evaluate with a multiplicative constant q(x), where

q(x) = q/Zq, and from which we can generate samples.

• Introduce weights to adjust the “importance” of each sample wr = p(x(r))

q(x(r)) (28)

(21)

, and

Φ =ˆ P

r wrφ(x(r)) P

r wr (29)

• It can be shown that Φˆ converges to Φ, the mean value of φ(x) as R increases (under some constraints).

• Problem: difficult to estimate how reliable Φˆ is.

• Examples of proposal functions: Gaussian and Cauchy distributions

(p(x) ∼ 1

πγ

h 1 + (x−xγ 0)2 i where γ is a scale )

(22)

• The results suggest we should use “heavy tailed” importance sampler

• Heavy tailed: a high proportion of the population is comprised of extreme values.

Left: Gaussian Right: Cauchy

(23)

### Rejection Sampling

• Further assume that the proposal function q

cq(x) > p(x) ∀x (30)

• Steps:

- First generate x from q(x) and evaluated with cq(x)

- Second generate uniformly distributed variable u from the interval [0, cq(x)].

- If u > p(x), then x is rejected. Otherwise, x is added to out set of samples {x(r)}.

(24)

### Metropolis Sampling

• Importance and rejection sampling only work well if proposal function q(x) is a good approximation of p(x).

• The Metropolis algorithm makes use of q(x) which depends on the current state x(t).

• Example: q(x0; x(t)) may be a simple Gaussian distribution centered at x(t).

• A tentative state x0 is generated from the proposal density q(x0; x(t)).

(25)

• Compute

a = p(x0)

p(x(t))

q(x(t);x0) q(x0;x(t))

If a ≥ 1, then the new state is accepted

Otherwise, the new state is accepted with probability a

(31)

• If the step is accepted, we set x(t+1) = x0. Otherwise, we set x(t+1) = x(t).

• We need to compute p(x0)

p(x(t)) and q(x(t);x0)

q(x0;x(t))

• If proposal density is a simple symmetric density as a Gaussian, then the latter factor is unity and the Metropolis algorithm simply involves comparing the value of the target density at two points.

• The general algorithm for asymmetric q is called Metropolis-Hastings algorithm.

• Metropolis method is an example of Markov Chain Monte Carlo (MCMC) method.

(26)

• Widely used for high dimensional problems.

• Has been applied to vision problems with good success in image segmentation, recognition, etc.

• Involves a Markov process in which a sequence of {x(r)} is generated where each sample x(t) having a probability distribution that depends on the previous value, x(t−1).

• Since successive samples are correlated, the Markov chain may have to be run for a considerable time in order to generate samples that are effectively independent samples from p(x).

• Random walk: small or large steps?

• Problems: slow convergence

• Many methods have been proposed for speed up.

Randomly permute the list of candidates best=0. for i=1

• To achieve small expected risk, that is good generalization performance ⇒ both the empirical risk and the ratio between VC dimension and the number of data points have to be small..

• Learn the mapping between input data and the corresponding points the low dimensional manifold using mixture of factor analyzers. • Learn a dynamical model based on the points on

For a directed graphical model, we need to specify the conditional probability distribution (CPD) at each node.. • If the variables are discrete, it can be represented as a

Hofmann, “Collaborative filtering via Gaussian probabilistic latent semantic analysis”, Proceedings of the 26th Annual International ACM SIGIR Conference on Research and

Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp... Annealed

• Global coordination of local generative models: Global coordination [1], Alignment of local representation

By University students of the Chung Hua University for objects, to make use of the descriptive statistics, the reliability analysis, and the factor analysis, considers the