基本问题、主要挑战、統一理论、和典型应用

(1)

Fundamentals, Challenges, Advances, and

Applications of Statistical Learning for Knowledge Discovery and Problem Solving:

A BYY Harmony Perspective

面向知识发现和问题求解的統計學習:

基本问题、主要挑战、統一理论、和典型应用

Lei Xu

http://www.cse.cuhk.edu.hk/~lxu/

Department of Computer Science and Engineering,

The Chinese University of Hong Kong

(2)

• Two types of Intelligent Ability: Learning from Samples

发现知识和求解问题是体现智能的两个基本能力--通过学习获得

• Key Ingredients of Statistical Learning

从有限个样本中学习--统计学习的三个基本要素

• Two Key Challenges and Advances on Seeking Solutions

两个主要挑战--几十年来应对挑战的发展轮廓

• A Unified Theory: Bayesian Ying-Yang Harmony Learning

一个统计学习之统一理论体系

•

Several Applications

简介若干应用

Outlines

(3)

exploration observation experiment

thinking communication collaboration

TYPE I

Knowledge about the world it survives

Why ?

(interpret what are observed)

The World

Two types of Intelligent Ability

(4)

The World

TYPE II

skill of handling each issue encountered in

the world How to do ? (problem solving)

Two types of Intelligent Ability

operation competition

driving cooperation

(5)

The World

TYPE II

Obtain Skills of Problem Solving via

• reasoning , inference, optimization

•learning from samples (fast implementation)

How to get the abilities

TYPE I

Obtain World Knowledge via

• loading from authorized sources (e.g., textbooks)

•learning from samples (pieces of

uncertain evidences)

(6)

(

xm1,

_∑

1

)

G

(

xm2,

_∑

2

)

G

(

xm3,

_∑

3

)

G

TYPE I

Discovering World Knowledge

via mining invariant dependence underlying a set of all samples

The World

TYPE II

Training Skills of Problem Solving

via building up input-response type dependence per sample

Two Types of Learning from Samples

(7)

• Singularity in samples gathering

Uncertainties in Learning from Samples

• Uncertainty due to noise

• Not enough samples

(8)

Statistical Learning

Sampling: Uniformly, More Samples

How to remove these uncertainties

statistical approaches

(accumulating, averaging, inference, estimation, etc)

统而计之 Noise + others

(9)

Roles of Statistical Learning

Take key roles in

•Machine Learning

•Neural Networks

•Pattern Recognition

•Data Mining

•Bioinformatics

…

(10)

Learner (hardware)

Sample gathering

(world)

Learning theory (software)

Key Ingredients of Statistical Learning

三个基本要素

(11)

Key Challenge 主要挑战 I

Learner’s hardware appropriately represents dependences among data

(matching structures of underlying world)

Learner (hardware)

Sample gathering

(world)

Learning theory

(software)

(12)

1. General Purpose 通用目的

(13)

Nonparametric joint density

Empirical density

Parzen window density

( ) ∑ ( )

=

= ^N

t

t h

h K x x

x N p

1

1 ,

( ) ∑ ( )

=

−

= ^N

t

xt

N x x

p

1 0

1 δ

. . . .

(

^x − ^x_i

)

δ

x

....

) , (

_t

h

x x K

Histogram

Curse of dimension

Memory based: individual 逐个记忆

记忆与适当模糊

(14)

Mean and covariance matrix

higher order statistics

third-order: skewness

fourth-order: kurtosis

…

the number increases exponentially

[ ^x ^x

^T

]

E ( − µ )( − µ )

= Σ

X1

X2

Ensemble Feature based: 总体特征

E(x) m

N x m

N

t

=

= ∑

=

1

(15)

Domain specific densities e.g., exponential family

Case by case: too narrow for a general purpose

[ ^x ^x

^T

]

E ( − µ )( − µ )

=

Σ

^X¹

X₂

E(x) m

N x

m

^N

t

=

= ∑

=

1

Mean and covariance matrix

equivalent to assume that p(x) is Gaussian

专用目的: 参数族

(16)

3. Recent decades

Specially in the fields of Neural

Networks and Machine Learning

(17)

Structures that indirectly specify distribution families

Typical structures 典型结构

y1 y^k

x1 x₂ L L L x^d

通过结构间接表示分布族

(18)

Seek a general framework 通用框架 to integrate

• existing studies

• investigating new structures

(19)

Dependence structures among samples from one-body world

One-body world

Dependence structures among samples from multi -body world

Multi-bodies world

Dependence structures

within each body and across multi-bodies

VS

(20)

Dependence structures among samples from one-body world

Three Architectures 三种构筑

The World The World The World

(21)

The World

TYPE II

Training Skills of Problem Solving

via building up input-response type dependence per sample

Forward Architecture

There are two

typical classes

(22)

Network System Classification

x

Y=F(X)

10 output units ⁰ ¹ ² ³ ⁴ ⁵ ⁶ ⁷ ⁸ ⁹

) d_i(x

) d_j(x

) ( ) ( )

(x d x d x g_ij = _i − _j

classification

The World

{y

^t

}

p(y|x)

{x^t}

f(x) y =

pair-wise structures

∫

= ypy xdy x

y

E( | ) ( | )

(a)

) f ( x y =

regression

(23)

z

PM1(z|1,x) PM1(z|2,x) PM1(z|3,x) PM2(1|x) PM2(2|x) PM2(3|x)

Expert 1 Expert 2 Expert 3

x1 x2 x3

Gating Network Input Input

x

Mixture of Experts (ME)

The World

{y

^t

}

p(y|x)

{x^t}

f(x) y =

Neural Networks

(24)

The World

p(y|x)

p(x)

X2

X₁ 1 2

[x1,x2] CDF [y ,y ]

uniform

X₂

X₁

[y₁,y₂]

[x1,x2]

ICA

∏

=

= ^k

j

yj

q y q

1 ) ( ) ( ) (

(b)

∏

=

= ^k

j

y j

q y

q

1

) ( ) ( )

(

Maximum Information Transfer

No redundancy

Transform structures

f(x) y =

Redundancy’s role for understanding perception (Attneave, 1954), sensory pathways (Barlow 1959, 1989), and pattern recognition (Watanabe, 1960)

(c)

y₂

y₁

X₂

X₁

[y₁,y₂]

[x1,x2]

PCA

(a)

(25)

TYPE I

Discovering World Knowledge

via mining invariant dependence

underlying a set of all samples

The World

Backward Architecture

There are also two

typical classes

(26)

Gaussian Mixture

1 , 0

1 j

j

=

≥ ∑

= K

j

α

(

xµ1,Σ1

)

G

ˆ1

x

L

xˆk

(

x k k

)

G µ ,Σ

X

( ) ∑ ( )

=

−

=

^N

i

x

i

N x x

p

1

1 δ

L

α1

1 k

l= ^α ^k

( )

∑

=

Σ

=

^k_r _r

G x

_r _r

x

q ( )

1

α µ ,

Finite Mixture

(multi-objects)

(27)

Clustering analyses and Gaussian mixture

) (

of formula detailed

a with )

( )

(

₁ ₁ ₁

1 − Θ ₁ − −

−

+ Θ ∇ Θ Θ

Θ

=

Θ

_t _t

P

_t ₋

L

_t

P

_t

t

•

a degenerated case

: K-means for MSE clustering

• The EM algorithm for ML on Gaussian mixture:

State its three advantages and clarified a wide spreading Misunderstanding

(see Xu & Jordan, MIT AI Memo, 1992, then Xu, L., and M.I.Jordan, “On Convergence Properties of The EM Algorithm

for Gaussian Mixtures”, Neural Computation, Vol. 8, No.1, 1996, pp.129-151).

（SCI紀錄之被引用量为54）（Google Scholar 紀錄之被引用量为159）

(28)

a

( )3

wj ( )1

wj

( )2

w j

m j

x

j

et_,

e Ay x = +

Y

X

∏=

= ^k

j

yj

q y

q

1 ) ( ) ( )

(

Y

X

e Ay x= +

0 or

=1 yj

Gaussian noise

+ e

x =

Independence subspace

(Linear Latent structures)

X₁ X₂

2^ndPrincipal Component 1^stPrincipal Component

X₂

X₁

[y1,y2]

[x1,x2]

ICA

∏=

= ^k

j

yj

q y q

1 ) ( ) ( ) (

(b)

Factor analysis

Independent Factor analysis Independent

Component

analysis PCA, e - ball

(29)

ICA nonlinear

x y p( | )

e Ay x= +

Y

X

+ e x =

3 2 1

y y

y

are real

NFA

( ) ( ) ∑ ( )

=

= ^N

t

xt

N q L

L

1

1 ln

,

max θ θ

Maximum Likelihood Learning θ

( )

^x ⁼

_∫

^G

(

^x ^Ay ^I

)

^p

^{( )}

^y ^dy

q ,σ²

Difficult to compute

Exact ML learning by an exact EM algorithm.

(Moulines, Cardoso,

& Gassiat, 97, Attias, 99).

The product of m of Gaussian mixtures

Æ a summation of

products.

∏

j

nj

Complexity increases exponentially

∏

=

= ^k

j

y j

q y

q

1

) ( ) ( )

(

( )

⁽ ⁾ ⁽ ⁽ ⁾^| ji^, ²ji⁾ j

i ji

j Gy m

y

q ⁼

∑

β σ

Expensive to compute

Y

X

0 or

= 1 yj

( ) _∏ ( )

^{( )}

=

− −

= ^m

j

y j y

j ^j

j p

p y

q

1

1 1

BFA

( )

⁼

_∑ ( ) ^{( )}

y

y p I Ay

x G x

q ,σ ²

Belouchrani&

Cadoso, 1995 Crellier &

Comon, 1998;

Xu, 1998 ICA

nonlinear x y p( | ) Approximate solution:

Nonlinear LMSER (Xu, 91&93)

( )

Wx y s

( )

Wx s

W x

E − ^T ², =

被引用量 96(SCI) 104(Google Scholar)

(30)

Dipole Field STM F₂

STM F₁ Gain Control

Attentional Subsystem

Orienting Subsystem

STM Reset Wave

Input Pattern Gain Control

+

+ +

+

+ + +

+

- -

LTM LTM

Bi-directional structures

Grossberg’s ART (1977)

recognition weights

generative weights

1 , 2 1 j,

θ

2 , 3 k, j

θ θk

2 , 1 i, j

φ

3 , 2 j,k

φ k

j

i

Helmholtz Machine (1995)

Transmission line

coding decoding Input Pattern Reconstruction

Others

•Kawato et al’s Forward-inverse optics

model

•Pattern Theory (Mumford, Grenander) TYPE II

Training Skills of Problem Solving

TYPE I

Discovering World Knowledge

The World

(31)

Gaussian noise

Y

X

e Ay x= +

0 or

=1 yj

) s(Wx y=

X real

( )

_r

r e

s ₋

= + 1

1

e Ay x= +

( )

_∏

( ) ⁽ ⁾

=

− −

= ^m

j

y j y

j

j j

q q

y q

1

1 ⁽ ⁾

)

( 1

⎩⎨

=⎧

LMSER

,

n Associatio -

Auto A, WT

A

∑

=

−

= ^N

t

t AsWx

N ₁ x

2 2

) 1 (

σ

X real

( )

r e r

s = + ₋ 1 1

) ( ξ+ν

= Ws y

) ( )

| (

)

| ( )

| , ( )

| (

ξ ξ

ζ

ζ ξ ζ

ξ q y q

e Ay

y q y q y q y x q

= +

=

( ) _∏ ( ) ^{( )}

=

− −

= ^m

j

y j y

j

j j

q q

y q

1

1 ⁽⁾

)

( 1

∑

=

− +

−

= ^N

t

t AsW

dN ₁

2 2

)

1 ζ ( ξ ν µ

σ

z ] 2

, [ξζ

= x

Networks) Neural

IJCNN01,

(Xu,

on organizati -

Self

LMSER 被引用量 96(SCI)

104(Google Scholar)

recogniti on weights

generati ve weights

1 , 2 1 j,

θ

2 , 3 k, j

θ θk

2 , 1 i, j

φ

3 , 2 j,k

φ k

j i

Helmholtz Machine (1995)

(32)

Motor Control

Representation Space Y q(y)

Input Pattern Space X p(d)

q(d|y) p(y|d)

x² x¹

d

y

q(d-x|y)

p(y|d) q(y)

x d p(d)

(33)

Dependence structures among samples from one-body world

pair-wise architectures

Regression Function fitting Co-occurrence association

transform

architectures

Principal component analysis (PCA)

Independent component analysis (ICA)

CDF transform

factor

architectures

Gaussi an factor Binary factor No n-Gaussian factor Positive factor

bi-directional

architectures

ART LMSER three layer net Helmholtz m a chine

Boundary structure

one-body world

(see Proceedings for details)

(34)

Dependence structures across multi-objects

qualitatively by the topology

A C

B

D

E

within each body

Across multi-bodies

quantitatively by the

dependence structures

among variables within

and across objects

(35)

It is very difficult to learn topology from samples.

Only three special cases have been studied:

A C

B

D

E

(36)

Autoregressive

Moving-average model

Markov Chain

Temporal dependences

1. Given topology

learn quantitative dependence among variables

(37)

independent state space

t t

t Ay e

x^ˆ = +

µ

+ State Space Y

Observation Space X

t t

t B y

y = ₋₁ +

ε

T

t

x

x x

₁

,..., ,...,

Choice 1: AR

TFA vs TNFA

e

a1

a 3

a 2

x

y1 y² y3

xˆ

(38)

State Space Model, Linear Case: Kalman Filter

known A

R

_t, _t,Σ ,_ε Σ_v

1

1 ˆ ₋

− = _t

t

y

u

t t

t

A y v

x

= +

State Space Y

Observation space X

t t

t

R y

y

⁼ ₋₁ +

ε

T t

1,...,x ,...,x x

Filter Kalman

t to equivalten Exactly

) ˆ

|

( y

_t

x

_t

= C

_t

x

_t

+ D

_t

y

_t₋₁

E

) ˆ ,

, (

) ,

| (

) ( 1 1

|

t n t

t t

t t t x My

y D x

C y G

u x y P

Σ +

= ₋

−

(39)

State Space Y

Observation Space X

T

t

x

x x

₁

,..., ,...,

( ) _∏ ( )

⁽ ⁾

=

− −

= ^m

k

j y j y

t

tj tj

y q

1

) 1 ( )

( ⁽ ⁾ ⁽ ⁾

1 ν ν

( )

_⎥

⎦

⎢ ⎤

⎣

⎡

−

= −

− ( )

0 )

( 0

) ( 1 )

( ) 1

( 1 ) (

1

| _j 1 _j

j j

j t j

t y

y

q π π

π π

( ) ( )

⁽1⁾ )

( 1 ) ( )

( ( ) ( | )

1

j t j t j y t

j

t q y y q y

y

q j

t − −

∑

₋

=

Choice 2: HMM

independent state space

HMM vs TBFA

(40)

A C

B

D

E

Belief networks in AI (Pearl,1988; Jensen, 1996)

2 -D la ttic e (c)

(41)

2. Null topology

With each sample vector x coming from one object in L but with its ID

missing

A C

B

D

E

(42)

X

m1

L

m_k

∑∑ ( )

= =

−

= ^N

t k j

j t

t x m

x y N ₁ ₁I

2 1 2

σ

L

1 k

y=

( )

⎪⎩

⎪⎨

⎧ = −

=

otherwise

0

min arg

1 _t _r

r x m

x y y I

Describing each density by its center

TYPE II

Training Skills of Problem Solving

TYPE I

Discovering World

Knowledge

MSE Clustering

(43)

retrieve similar objects from neighbors ^{useful in}

•

content based retrieval

• missing pattern recovering

• tracing temporal patterns

3. Topological map structures

X

m1

L

m_k

(

_t _i^old

)

_y

old i new

i

m x m i N

m = + η − , ∈

⇑

Ny

ξ

η

[ ] ^ξ ^, ^η

= y

L

1 k

y=

Kohonen Map

r

r xt m

y =argmin −

(44)

1^stprinciple component

2^nd principle component

A

C

B

D

E

a

( )3

w j ( )1

wj

( )2

w j

m j

x

j

et_,

Mixture of PCA (local PCA)

independent component

Mixture of ICA (local ICA)

•Local Gaussian FA

•Local NFA

•Local LMSER

…

•Local MCA

Mixture of BFA (local BFA)

Beyond point structures

RPCL learning (Xu, 94,02) Finite Mixture (Xu, 02,03)

(45)

t t

t Ay e

x^ˆ = + µ +

State Space Y

Observation Space X

t t

t B y

y = −1+

ε

T

t

x

x x

¹

,..., ,...,

t t

t Ay e

xˆ = + µ +

State Space Y

Observation Space X

t t

t B y

y = −1+

ε

T

t

x

x x

¹

,..., ,...,

t t

t Ay e

x^ˆ = + µ +

State Space Y

Observation Space X

t t

t B y

y = −1+

ε

T

t

x

x x

¹

,..., ,...,

Xu, L. ``Temporal BYY Encoding, Markovian State Spaces, and Space Dimension Determination", IEEE Tr. Neural Networks, Vol. 15, No. 5, pp1276-1295.

Mixture of independent state spaces

(46)

Dependence structures among samples from multi-body world

visible topology among

multi-bodies

invisible topology among

multi-bodies

temporal topology Image and video lattice Bayesian networks Combination of time and spatial topology Clustering structure Gaussian mixture Multi-set mixture Mixture of any one-body structure

Mixture of experts and R B F nets

Topological map and maps of any one-body structure

multi-body world

(47)

A general framework

The World

q(x|y) q(y)

p(y|x)

p(x) )

( )

| ( )

,

( x y P y x P x

P =

YANG representation (Machine)

) ( )

| ( )

,

( x y q x y q y

q =

YING representation (Machine)

Bayesian Ying Yang System

{

Stochastic: randomly pick x by q(x|y) maximum posteriori: x=argmax^xq(x|y)

regression: x=E^q(x\y)[x]

Stochastic: randomly pick y by P(y|x) maximum posteriori: y=argmax^yP(y|x) regression: y=E^P(y\x)[y]

(48)

A C

B D

E

ure) (architect on)

(observati

coding) -

(inner (time)

(topology)

binary

tree

learner in

rectional bi-di

ulti-parts n m

nonGaussia real

n lattice

learners i

ward s back

n two pair vector i

ssian Gau

yes ner

multi-lear

rward fo

ector full v

no

r one learne

×

Y

X

[ ] [ ]

( ) ( )

( | ) ( | )

} , } {m { k ,

| ,

or 1 0 ,

, , ,

1

1 1 l

2 1

∑∏

=

= =

=

∈

=

l

r

m

i i

k l l

k l

k i i

k

l y q l

q

k l

q l

q

R y , y

y y

y

L α

free

parametric parametric

parametric free

parametric

( )

⎪⎩

⎪⎨

⎧

−

⎟⎠

⎜ ⎞

⎝

⎛ −

= ∑

∑

=

= N

t t

N t

t

x N x

h x K x x N

P

1 1

1 1 ) (

δ

x

Neighborhood

after update

Integrated structures

Representation Space Y q(y)

Input Pattern Space X P(x)

q(x|y) P(y|x)

(49)

The number of hidden unit

X⁼

[x

⁰

,x

¹

,…,x

^d

] Y=[y

⁰

,y

¹

,…,y

^k

]

e Ay x= +

Key Challenge 主要挑战 II

Complexity of Learner’s structure Æmatching the size of samples

(reliable structures of underlying world)

Learner (hardware)

Sample gathering

(world)

Learning theory

(software)

(50)

The large number law

( ) ∑ ( )

=

−

= ^N

t

xt

N x x

q

1 0

1

δ . . . .

(

^x ⁻ ^x_i

)

δ

x

(Kolmogorov & Smirnov, 1930)

01 .

=0 ε

( )

{ } _∑

^∞

{ }

= Λ

∈

−

⎭ =

⎬⎫

⎩⎨

⎧ − >

2

2 2 2

0

*

2 exp

) 1 ( 2

2 exp

] )

| ( [ sup

t

t t N

N

x q x

p P

ε ε

ε

α θ

Histogram Curse of dimension

逐个记忆

One piece of evidence, take it by 100%

Two pieces of evidence, take each by 50%

…

More pieces of evidence

(51)

( ) ^Χ ^Χ ^x

_t _t^N

x

p ( | θ ˆ ), = { }

₌₁

( )

∑

=

^N

t

x

t

N p L

L

1

| 1 ln

) ( ), (

max

_θ

θ θ θ

The large number law on parametric model 参数模型

e.g., Maximum Likelihood (ML) 最大似然模型最佳匹配

( )

₀

ˆ θ

θ Χ → )

| ( )

|

(

_* ₀

*

x θ p x θ

p → as N → ∞

( )

( ^p ^x ^| ^θ ^, ^Χ )

F

One piece of evidence, take it by 100%

Two pieces of evidence take each by 50%

More pieces of evidence

optimizing a matching cost

(52)

) d_i(x

) d_j(x

) ( ) ( )

(x d x d x g_ij = _i − _j

∫

= ypy xdy x

y

E( | ) ( | )

) f ( x y =

(a)

q(x|y)

regression

classification

Network System

Classification

x

Y=F(X)

10 output units ⁰ ¹ ² ³ ⁴ ⁵ ⁶ ⁷ ⁸ ⁹

是寻求最佳匹配 !

(53)

e Ay x = +

Y

X

∏

=

= ^k

j

y j

q y

q

1 ) ( ) ( )

(

+ e

x =

uniform

1

X₂

X₁

[y₁,y₂]

[x1,x2]

ICA

∏

=

= ^k

j

yj

q y q

1 ) ( ) ( ) (

X₂

X₁

[y₁,y₂]

[x₁,x₂] CDF 1

0

(b) (c)

Maximum entropy Minimum reconstruction error

都是寻求最佳匹配 !

(54)

a family with same structure but in different scales

Provide that there is a k* and such that is equal or close to the true

The number of hidden unit

X=[x

⁰

,x

¹

,…,x

^d

] Y=[y

⁰

,y

¹

,…,y

^k

]

e Ay x = +

We do not known structure of template p

_*

( x | ⋅ )

( )

( ^x

^|

^k )

^,

^k

⁼ ¹^,²^,....^∞

p θ

( )

^*

*

k

( ) θ

( ^x ^|

^*

^k

^*

)

p θ ^p

_*

( ^x ^| ^θ

₀

)

(55)

k

error

generalization error

fitting error

d cx bx ax

y= ³+ ²+ +

Number of clusters=8 Î No error

∞

→ have N

not do

We

(56)

Existing Efforts

VC Dimension based SRM

AIC

BIC, SIC

Cross Validation

MML/MDL

Bayesian Approach

( ) ^k

The existing efforts usually lead to a rough estimate ∆ˆ

k

error

generalization error

fitting error

( )

^k

∆

( ) ( ( ) )

[ ^∆ ⁺ ^Χ ]

= arg min ˆ | ( ) ,

*

k F p x k

k

_k

θ

Estimated bound of generalization error

(57)

Two Steps of Solving

Very computational extensive !!!

k

error

fitting error

( )

( ^Χ )

= arg min | , )

*

( θ

θ k

_θ

F p x

Step 1 Enumerate k for a set of candidate values, fixed at each candidate, make learning

( ) ( ( ) )

[

^∆ ⁺ ^Χ

]

= arg min ˆ | ( ) ,

*

k F p x k

k

_k

θ

Step 2 Select the best one k by*

(58)

Representation Space q(y)

Input Pattern Space

P(x)

q(x|y) P(y|x)

(a) Best matching

( ) ^x ^y ^p ( ) ^y ^x ^p ( ) ^x

p , =

Best matching

^q ( ) ^x ^, ^y ⁼ ^q ( ) ^x ^y ^q ( ) ^y

(Least difference)

(b) The simplest one in complexity or most firm.

Basic Learning Principle: Ying-Yang Harmony

( ) ^k ^H ( ) ^p ^q ^p ( ) ^y ^x ^p ^{( )} ^x ^q ( ) ^x ^y ^q ^{( )} ^y ^dxdy ^z

_q

H

Max ^θ , = = ∫ ln[ ] − ln

Half job only

Bayesian Ying-Yang Harmony Learning

( ) ( ) ( ) ( ) ( )

( ) ( )

∫

= dxdy

y q y x q

x p x y x p

p x y p q

p

KL ln

min

(59)