Advanced Topics in Learning and Vision

(1)

Advanced Topics in Learning and Vision

Ming-Hsuan Yang

(2)

Announcements

• Reading (due Nov 15): Tipping: Relevance Vector Machine

M. Tipping. Sparse Bayesian learning and the relevance vector machine. In Journal of Machine Learning Research, volume 1, pages 211–244, 2001.

• Supplementary Reading:

- K. Murphy. A brief introduction to graphical models and Bayesian networks.

- Zoubin Ghahramani. Unsupervised learning.

- B. Frey. Inference and Learning in Graphical Model (CVPR 00 tutorial).

- Blake, Freeman, Bishop and Viola. Learning and vision tutorial (ICCV 03 tutorial).

- C. Bishop. Machine learning techniques for computer vision (ECCV 04 tutorial).

• Toolbox:

- Intel Open Source Probabilistic Network Library:

http://www.intel.com/technology/computing/pnl/

- Kevin Murphy’s Bayes Net Toolbox for Matlab:

http://bnt.sourceforge.net/

- Microsoft Bayesian Network Editor and Toolkit:

http://research.microsoft.com/adapt/MSBNx/

(3)

Overview

• Graphical model

• Bayesian inference

• Approximate inference

• Markov Chain Monte Carlo (MCMC)

• Belief propagation, Loopy belief propagation

• Variational inference

• Gaussian process, Gaussian process latent variable model

• Applications

(4)

Graphical Model

• A marriage of probability and graph theory.

• Provide natural tool for dealing with uncertainty and complexity.

• Widely used in design and analysis of machine learning algorithms.

• Notion of modularity: complex system is built by combining simple parts.

• Probability theory provides the glue in combing parts to ensure the system as a whole is consistent, and provides ways for inference.

• Graph theory provides tools to incorporate prior knowledge in structure of data.

• Many classical multivariate probabilistic systems such as statistics, system engineering, information theory, pattern recognition and statistical

mechanics are special cases of graphical model.

(5)

• Examples:

- factor analysis - mixture model

- hidden Markov model - linear dynamic system - Markov random field - Kalman filter

- Ising model - ...

• Issues:

- Representation: how to come up a graph structure that encodes the relationship among variables?

- Learning: how to learn the model parameters or models?

- Inference: how to answer probabilistic queries?

- Decision theory: convert beliefs into actions.

- Applications: how to apply them?

(6)

Solving Problems as Probabilistic Inference

• Image

- p(John|Image) - p(Happy|Image)

- p(Happy|Image, John) - p(John|Image, Happy)

• Video

- p(The kind of sports|Video) - p(Human pose|Video)

- p(Suspicious activity|Video)

- p(Shape, appearance, position at time t|Video)

• Bioinformatics

- p(disease|sequence)

(7)

Representation

• Graphical models are graphs with

- node: observable/hidden random variables

- arc: conditional dependence (that is, lack of arcs represent conditional independence)

where the relationships are probabilistic.

• Undirected graphical models, e.g., Markov random field.

• Directed graphical models, e.g., Bayesian networks, belief networks.

• An arc from node A to B means A “causes” B.

• Need to specify the model parameters. For a directed graphical model, we need to specify the conditional probability distribution (CPD) at each node.

• If the variables are discrete, it can be represented as a table (CPT).

(8)

Directed Acyclic Graphs (DAG)

• Joint distribution:

p(x₁, . . . , x_N) =

N

Y

i=1

p(x_i|pa_i) (1)

where pa_i denotes the parents of i

p(x₁, . . . , x₇) = p(x₁)p(x₂)p(x₃)p(x₄|x₁, x₂, x₃)p(x₅|x₁, x₃)p(x₆|x₄)p(x₇|x₄, x₅) (2)

(9)

Undirected Acyclic Graphs

• Joint distribution is product of non-negative functions over the maximal cliques of the graph

p(x) = 1 Z

Y

C

ψ_C(x_C) (3)

where ψ_C(x_C) is the clique potential function and Z is a normalization constant.

p(x₁, x₂, x₃, x₄) = 1

Zψ_A(x₁, x₂, x₃)ψ_B(x₂, x₃, x₄) (4)

(10)

Special Cases

• Factorized model:

- p(x₁, . . . , x_N) = QN

i p(x_i).

- example: Naive Bayes

• Fully connected:

- p(x₁, . . . , x_N) = QN

i p(x_i|x₁, . . . , x_i−1).

- always true using chain rule of probability.

• Both cases do not exploit prior knowledge of the problem at hand.

(11)

Example

• The even “grass is wet” (W=true) has two possible causes - sprinkler and rain.

• For example, p(W = T |S = T, R = F ) = 0.9 and thus p(W = F |S = T, R = F ) = 1 − 0.9 = 0.1.

• Conditional independence: a node is independent of its ancestors given its parents.

(12)

• Using chain rule, the joint probability of all nodes in the graph is

p(C, S, R, W ) = p(W |C, S, R)p(R|C, S)P (S|C)p(C) (5)

• But using conditional independence relationships, we have

p(C, S, R, W ) = p(W |S, R)p(R|C)p(S|C)p(C) (6)

• Notice that we simplify the first and second terms.

(13)

• In general, if we have N binary nodes, the full joint probability would require O(2^N) space to represent, but the factored form would require O(2^k),

where k is the maximum fan-in of a node.

• Fewer parameters make the learning problem much easier.

(14)

Inference

• What is the possible cause of observing the grass is wet?

p(S = 1|W = 1) = p(S=1,W =1) p(W =1) =

P

C,Rp(C=c,S=1,R=r,W =1)

p(W =1) = 0.2781/0.6471.

p(R = 1|W = 1) = p(R=1,W =1) p(W =1) =

P

C,Sp(C=c,S=s,R=1,W =1)

p(W =1) = 0.4581/0.6471.

(7) where p(W = 1) = P

C,R,S p(C = c, S = s, R = r, W = 1) = 0.6471 is a normalization term.

• We see that raining is more likely to be the cause of observing wet grass.

(15)

Explaining Away

• Two causes “compete” to explain the observed data.

• If we observe that grass is wet and know it is raining, then we can compute p(S = 1|W = 1, R = 1) = 0.1945 (8)

• That is, the posterior probability that the sprinkler is on goes down.

• This is called “explaining away,” as we are able to analyze among the competing causes given an observation.

(16)

• Another example

p(I, L, S) = p(I|L, S)p(L)p(S) p(L, S) = p(L)p(S)

p(L, S|I) 6= p(L|I)p(S|I)

(9)

• Either B or C explains the observation C.

• Look at a population of students (for which C is observed to be true). It will be found that being brainy makes you less likely to be sporty and vice

versa, because each property alone is sufficient to explain C.

(17)

Top-down and Bottom-up Reasoning

• Bottom-up reasoning: Given an observation, we want to find the causes.

What is like possible cause given that the grass is wet? or p(S = 1|W = 1), p(R = 1|W = 1)?

• Top-down (causal) reasoning: Determine the probability of certain event.

What is the probability that grass will be wet given that is cloudy? or p(W = 1|C = 1)?

• Bayes net is called “generative” models, since they specify how cause generate effects.

(18)

Bayes Nets with Discrete and Continuous Variables

• So far we use discrete variables for examples.

• Create Bayesian networks with continuous variables using Gaussians.

• For discrete nodes with continuous parents, we can use logistic/softmax distributions.

FA: p(x) = R p(x|z)p(z)dz MFA: p(x) = PK

k=1 R p(x|z, ω_k)p(z|ω_k)p(ω_k)dz (10)

(19)

Temporal Model

• Dynamic Bayesian Network (DBN): directed graphical models of stochastic processes.

• Generalize Hidden Markov Model (HMMs) and Linear Dynamic Model (LDS).

• Often assume model structure and parameters do not change over time.

• Many variations of HMM and LDS.

(20)

State Space Methods (SSM)

p(x_1:T, y_1:T|θ) =

T

Y

t=1

p(x_t|x_t−1, θ)p(y_t|x_t, θ) (11)

• Hidden Markov Model (HMM): Discrete state variables.

• Kalman Filter: Continuous state variables modeled by Gaussians.

• Extensively used in pattern recognition (e.g., speech recognition) and vision (e.g., tracking and texture synthesis) problems.

• See also nonlinear switching state space models.

(21)

Markov Random Field (MRF)

p(x, y) = 1 Z

Y

i

φ_i(x_i, y_j)Y

i,j

ψ_ij(x_i, x_j) (12)

• y are observations and x are hidden states.

• Z is a normalization term.

• φ_i(x_i, y_j) is the potential between hidden variable x_i and observation y_j.

• ψ(x_i, x_j) is the potential between hidden variables x_i and x_j.

• Widely used in pattern recognition (e.g., texture modeling, segmentation) problems.

(22)

Conditional Random Field (CRF)

p(x|y) = 1 Z

Y

i

φ_i(x_i|y)Y

i,j

ψ_ij(x_i, x_j|y) (13)

• Note ψ new depends on y.

• Relax the strong independence assumptions made in MRF.

• Related to maximum entropy models.

• Used in text labeling, scene labeling problems.

(23)

Big Picture of Generative Models Revisited

S. Roweis and Z. Ghahramani. A Unifying Review of Linear Gaussian Models. Neural Computation, 11(2):305–345, 1999.

(24)

Inference

• A graphical model specifies a complete joint probability distribution over all variables.

• With this joint distribution, we can answer all possible inference queries by marginalization.

• For an example of N discrete nodes with binary sates, the joint distribution has size O(2^N).

• Summing over the joint distribution takes exponential time.

p(W = w)

= P

c

P

s

P

r p(C = c, S = s, R = r, W = w)

= P

c

P

s

P

r p(C = c)p(S = s|C = c)p(R = r|C = c)p(W = w|S = s, R = r)

= P

c p(C = c)P

sp(S = s|C = c) P

r p(R = r|C = c)p(W = w|S = s, R = r)

= P

c p(C = c)P

sp(S = s|C = c)T 1(c, w, s)

(14)

(25)

where

T 1(c, w, s) = X

r

p(R = r|C = c)p(W = w|S = s, R = r)

• Further pushing the sums in...

p(W = w)

= P

c p(C = c)P

sp(S = s|C = c)T 1(c, w, s)

= P

c p(C = c)T 2(c, w)

(15)

where

T 2(c, w) = X

s

p(S = s|C = c)T 1(c, w, s)

• This algorithm is called variable elimination.

• The principle of distributing sums over products can be generalized greatly to apply to any commutative semi-ring.

• Forms the basis of many common algorithm, such as Viterbi decoding and the Fast Fourier Transform (FFT).

(26)

• For details, see

- R. McEliece and S. Aji. The Generalized Distributive Law. IEEE Transactions on Information Theory, 46(2):325–343, 2000.

- F. Kschischang, B. Frey, and H.-A. Loeliger. Factor graphs and the sum product algorithm. IEEE Transactions on Information Theory, 47(2):498–519, 2001.

(27)

Dynamic Programming

• Dynamic programming (DP):

- Solve an optimization problem by caching subproblem solutions rather than recomputing them.

- A method for reducing runtime of algorithms exhibiting the properties of overlapping subproblems and optimal substructure.

• Directed graphical model: Can use DP to avoid the redundant computation if we use variable elimination repeatedly.

• Undirected graphical model:

- Acyclic: can use Peral’s local message passing algorithm.

- Cyclic: transform the BN to a junction tree first and use Pearl’s message passing algorithm or others.

(28)

Approximate Algorithms

• Exact inference is N P-hard and thus resort to approximate algorithm.

• Approximate algorithms work well in practice

• Sampling (Monte Carlo) algorithms:

- Importance sampling: simplest way to draw random samples x from p(x), the distribution of the hidden variables, and then weight the samples by their likelihoods p(y|x), where y is the observation (evidence).

- Markov Chain Monte Carlo (MCMC): more efficient approach in the high dimensional space.

- Metropolis-Hasting algorithm.

- Gibbs sampling and others.

• Variational learning:

- Mean-field approximation: exploits the law of large numbers to approximate large sums of random variables by their means.

- In general, we decouple all the nodes and introduce a new parameter, i.e., variational parameter, for each node,

(29)

- Iteratively update these parameters so as to minimize the KL divergence (cross entropy) between the approximate and ture probability distribution.

• Belief propagation:

- Based on local message passing

- Extended to loopy belief propagation and performs well in practice.

- Bethe free energy, Kikuchi approximation, ...

• Laplace approximation

• Expectation propagation

• ...

(30)

Learning

• Need two things to describe BN: graph structure (topology) and parameters.

Structure Observability Method

Known Full Maximum likelihood estimate Known Partial EM (or gradient decent)

Unknown Full Search through model space

Unknown Partial EM + search through model space

• We mainly focus on the problems with known structure and partial Observability.

(31)

Bayes Rule

• Product rule

p(x, y) = p(x|y)p(y) (16)

• Sum rule

p(x) = X

y

p(x, y) (17)

p(x) is marginal probability.

• Bayes rule

p(y|x) = p(x|y)p(y)

p(x) (18)

normalization term

p(x) = X

y

p(x|y)p(y) (19)

(32)

Kullback-Leilber Divergence

• Data set: D = {x₁, . . . , x_N} with distribution p(x).

• How do we transmit x over a communication line optimally?

• Shannon quantifies this the optimal number of bits to use for encoding a symbol with p(x) is − log₂ p(x).

• The expected cost or entropy of the distribution is

H(p) = − X

x

p(x) log₂ p(x) (20)

• In general, we do not know p, and thus approximate it with q

• The expected coding cost with q is −P

x p(x) log₂ q(x).

(33)

• The difference between these two coding cost is called Kullback-Leilber (KL) divergence.

KL(p||q) = X

x

p(x) log p(x)

q(x) (21)

• Measure the distance between two distributions.

• KL divergence is non-negative and zero if and only if p = q.

• KL measures the coding efficiency in bits from using q to compress data when the true distribution is p.

• The better our model of data, i.e., q, the more efficiently we can compress and communicate new data.

• Important link between machine learning, statistics, and information theory.