Machine Learning Techniques (ᘤᢈ)

(1)

Machine Learning Techniques

( 機器學習技法)

Lecture 7: Blending and Bagging

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/23

(2)

Blending and Bagging

Roadmap

1 Embedding Numerous Features: Kernel Models

Lecture 6: Support Vector Regression kernel ridge regression

(dense) via ridge regression +

representer theorem;

support vector regression

(sparse) via regularized

tube

error +

Lagrange dual

2

Combining Predictive Features: Aggregation Models

Lecture 7: Blending and Bagging

Motivation of Aggregation Uniform Blending

Linear and Any Blending

Bagging (Bootstrap Aggregation)

3 Distilling Implicit Features: Extraction Models

(3)

Blending and Bagging Motivation of Aggregation

An Aggregation Story

Your T friends g

₁

, · · · ,g

_T

predicts whether stock will go up as g

t

(x).

You can . . .

• select

the most trust-worthy friend from their

usual performance

—validation!

• mix

the predictions from all your friends

uniformly

—let them

vote!

• mix

non-uniformly

—let them vote, but

give some more ballots

• combine

the predictions

conditionally

—if

[t satisfies some condition]

give some ballots to friend t

•

...

aggregation

models:

mix

or

combine

hypotheses (for better performance)

(4)

An Aggregation Story

Your T friends g

₁

, · · · ,g

_T

t

(x).

You can . . .

• select

usual performance

—validation!

• mix

uniformly

—let them

vote!

• mix

non-uniformly

give some more ballots

• combine

the predictions

conditionally

—if

[t satisfies some condition]

•

...

aggregation

models:

mix

or

combine

(5)

An Aggregation Story

Your T friends g

₁

, · · · ,g

_T

t

(x).

You can . . .

• select

usual performance

—validation!

• mix

uniformly

—let them

vote!

• mix

non-uniformly

give some more ballots

• combine

the predictions

conditionally

—if

[t satisfies some condition]

•

...

aggregation

models:

mix

or

combine

(6)

An Aggregation Story

Your T friends g

₁

, · · · ,g

_T

t

(x).

You can . . .

• select

usual performance

—validation!

• mix

uniformly

—let them

vote!

• mix

non-uniformly

give some more ballots

• combine

the predictions

conditionally

—if

[t satisfies some condition]

•

...

aggregation

models:

mix

or

combine

(7)

An Aggregation Story

Your T friends g

₁

, · · · ,g

_T

t

(x).

You can . . .

• select

usual performance

—validation!

• mix

uniformly

—let them

vote!

• mix

non-uniformly

give some more ballots

• combine

the predictions

conditionally

—if

[t satisfies some condition]

•

...

aggregation

models:

mix

or

combine

(8)

An Aggregation Story

Your T friends g

₁

, · · · ,g

_T

t

(x).

You can . . .

• select

usual performance

—validation!

• mix

uniformly

—let them

vote!

• mix

non-uniformly

give some more ballots

• combine

the predictions

conditionally

—if

[t satisfies some condition]

•

...

aggregation

models:

mix

or

combine

(9)

An Aggregation Story

Your T friends g

₁

, · · · ,g

_T

t

(x).

You can . . .

• select

usual performance

—validation!

• mix

uniformly

—let them

vote!

• mix

non-uniformly

give some more ballots

• combine

the predictions

conditionally

—if

[t satisfies some condition]

•

...

aggregation

models:

mix

or

combine

(10)

An Aggregation Story

Your T friends g

₁

, · · · ,g

_T

t

(x).

You can . . .

• select

usual performance

—validation!

• mix

uniformly

—let them

vote!

• mix

non-uniformly

give some more ballots

• combine

the predictions

conditionally

—if

[t satisfies some condition]

•

...

aggregation

models:

mix

or

combine

(11)

Aggregation with Math Notations

Your T friends g

₁

, · · · ,g

_T

_t

(x).

• select

usual performance

G(x) = g

_t

_∗(x) with t

∗

=argmin

t∈{1,2,··· ,T } E _val (g ⁻ _t )

• mix

uniformly

G(x) = sign P

T

t=1 1

· g

t

(x)

• mix

non-uniformly

G(x) = sign P

T

t=1 α _t

· g

_t

(x)

with

α _t ≥ 0

• include select: α

_t

= q

E

val

(g

_t⁻

) smallest y

• include uniformly: α

_t

=

1 • combine

the predictions

conditionally

G(x) = sign P

T

t=1 q _t (x)

· g

t

(x)

with

q _t (x) ≥ 0

• include non-uniformly: q

t

(x) =

α

_t

aggregation models: a

rich family

(12)

Aggregation with Math Notations

Your T friends g

₁

, · · · ,g

_T

_t

(x).

• select

usual performance

G(x) = g

_t

_∗(x) with t

∗

=argmin

t∈{1,2,··· ,T } E _val (g _t ⁻ )

• mix

uniformly

G(x) = sign P

T

t=1 1

· g

t

(x)

• mix

non-uniformly

G(x) = sign P

T

t=1 α _t

· g

_t

(x)

with

α _t ≥ 0

• include select: α

_t

= q

E

val

(g

_t⁻

) smallest y

• include uniformly: α

_t

=

1 • combine

the predictions

conditionally

G(x) = sign P

T

t=1 q _t (x)

· g

t

(x)

with

q _t (x) ≥ 0

• include non-uniformly: q

t

(x) =

α

_t

rich family

(13)

Aggregation with Math Notations

Your T friends g

₁

, · · · ,g

_T

_t

(x).

• select

usual performance

G(x) = g

_t

_∗(x) with t

∗

=argmin

t∈{1,2,··· ,T } E _val (g _t ⁻ )

• mix

uniformly

G(x) = sign

P

T

t=1 1

· g

t

(x)

• mix

non-uniformly

G(x) = sign P

T

t=1 α _t

· g

_t

(x)

with

α _t ≥ 0

• include select: α

_t

= q

E

val

(g

_t⁻

) smallest y

• include uniformly: α

_t

=

1 • combine

the predictions

conditionally

G(x) = sign P

T

t=1 q _t (x)

· g

t

(x)

with

q _t (x) ≥ 0

• include non-uniformly: q

t

(x) =

α

_t

rich family

(14)

Aggregation with Math Notations

Your T friends g

₁

, · · · ,g

_T

_t

(x).

• select

usual performance

G(x) = g

_t

_∗(x) with t

∗

=argmin

t∈{1,2,··· ,T } E _val (g _t ⁻ )

• mix

uniformly

G(x) = sign

P

T

t=1 1

· g

t

(x)

• mix

non-uniformly

G(x) = sign

P

T

t=1 α _t

· g

_t

(x)

with

α _t ≥ 0

• include select: α

_t

= q

E

val

(g

_t⁻

) smallest y

• include uniformly: α

_t

=

1 • combine

the predictions

conditionally

G(x) = sign P

T

t=1 q _t (x)

· g

t

(x)

with

q _t (x) ≥ 0

• include non-uniformly: q

t

(x) =

α

_t

rich family

(15)

Aggregation with Math Notations

Your T friends g

₁

, · · · ,g

_T

_t

(x).

• select

usual performance

G(x) = g

_t

_∗(x) with t

∗

=argmin

t∈{1,2,··· ,T } E _val (g _t ⁻ )

• mix

uniformly

G(x) = sign

P

T

t=1 1

· g

t

(x)

• mix

non-uniformly

G(x) = sign

P

T

t=1 α _t

· g

_t

(x)

with

α _t ≥ 0

• include select: α

_t

= q

E

val

(g

_t⁻

) smallest y

• include uniformly: α

_t

=

1 • combine

the predictions

conditionally

G(x) = sign P

T

t=1 q _t (x)

· g

t

(x)

with

q _t (x) ≥ 0

• include non-uniformly: q

t

(x) =

α

_t

rich family

(16)

Aggregation with Math Notations

Your T friends g

₁

, · · · ,g

_T

_t

(x).

• select

usual performance

G(x) = g

_t

_∗(x) with t

∗

=argmin

t∈{1,2,··· ,T } E _val (g _t ⁻ )

• mix

uniformly

G(x) = sign

P

T

t=1 1

· g

t

(x)

• mix

non-uniformly

G(x) = sign

P

T

t=1 α _t

· g

_t

(x)

with

α _t ≥ 0

• include select: α

_t

= q

E

val

(g

_t⁻

) smallest y

• include uniformly: α

_t

=

1 • combine

the predictions

conditionally

G(x) = sign P

T

t=1 q _t (x)

· g

t

(x)

with

q _t (x) ≥ 0

• include non-uniformly: q

t

(x) =

α

_t

rich family

(17)

Aggregation with Math Notations

Your T friends g

₁

, · · · ,g

_T

_t

(x).

• select

usual performance

G(x) = g

_t

_∗(x) with t

∗

=argmin

t∈{1,2,··· ,T } E _val (g _t ⁻ )

• mix

uniformly

G(x) = sign

P

T

t=1 1

· g

t

(x)

• mix

non-uniformly

G(x) = sign

P

T

t=1 α _t

· g

_t

(x)

with

α _t ≥ 0

• include select: α

_t

= q

E

val

(g

_t⁻

) smallest y

• include uniformly: α

_t

= 1

• combine

the predictions

conditionally

G(x) = sign P

T

t=1 q _t (x)

· g

t

(x)

with

q _t (x) ≥ 0

• include non-uniformly: q

t

(x) =

α

_t

rich family

(18)

Aggregation with Math Notations

Your T friends g

₁

, · · · ,g

_T

_t

(x).

• select

usual performance

G(x) = g

_t

_∗(x) with t

∗

=argmin

t∈{1,2,··· ,T } E _val (g _t ⁻ )

• mix

uniformly

G(x) = sign

P

T

t=1 1

· g

t

(x)

• mix

non-uniformly

G(x) = sign

P

T

t=1 α _t

· g

_t

(x)

with

α _t ≥ 0

• include select: α

_t

= q

E

val

(g

_t⁻

) smallest y

• include uniformly: α

_t

= 1

• combine

the predictions

conditionally

G(x) = sign

P

T

t=1 q _t (x)

· g

t

(x)

with

q _t (x) ≥ 0

• include non-uniformly: q

t

(x) =

α

_t

rich family

(19)

Aggregation with Math Notations

Your T friends g

₁

, · · · ,g

_T

_t

(x).

• select

usual performance

G(x) = g

_t

_∗(x) with t

∗

=argmin

t∈{1,2,··· ,T } E _val (g _t ⁻ )

• mix

uniformly

G(x) = sign

P

T

t=1 1

· g

t

(x)

• mix

non-uniformly

G(x) = sign

P

T

t=1 α _t

· g

_t

(x)

with

α _t ≥ 0

• include select: α

_t

= q

E

val

(g

_t⁻

) smallest y

• include uniformly: α

_t

= 1

• combine

the predictions

conditionally

G(x) = sign

P

T

t=1 q _t (x)

· g

t

(x)

with

q _t (x) ≥ 0

• include non-uniformly: q

t

(x) =

α

_t

rich family

(20)

Aggregation with Math Notations

Your T friends g

₁

, · · · ,g

_T

_t

(x).

• select

usual performance

G(x) = g

_t

_∗(x) with t

∗

=argmin

t∈{1,2,··· ,T } E _val (g _t ⁻ )

• mix

uniformly

G(x) = sign

P

T

t=1 1

· g

t

(x)

• mix

non-uniformly

G(x) = sign

P

T

t=1 α _t

· g

_t

(x)

with

α _t ≥ 0

• include select: α

_t

= q

E

val

(g

_t⁻

) smallest y

• include uniformly: α

_t

= 1

• combine

the predictions

conditionally

G(x) = sign

P

T

t=1 q _t (x)

· g

t

(x)

with

q _t (x) ≥ 0

• include non-uniformly: q

t

(x) = α

t

rich family

(21)

Aggregation with Math Notations

Your T friends g

₁

, · · · ,g

_T

_t

(x).

• select

usual performance

G(x) = g

_t

_∗(x) with t

∗

=argmin

t∈{1,2,··· ,T } E _val (g _t ⁻ )

• mix

uniformly

G(x) = sign

P

T

t=1 1

· g

t

(x)

• mix

non-uniformly

G(x) = sign

P

T

t=1 α _t

· g

_t

(x)

with

α _t ≥ 0

• include select: α

_t

= q

E

val

(g

_t⁻

) smallest y

• include uniformly: α

_t

= 1

• combine

the predictions

conditionally

G(x) = sign

P

T

t=1 q _t (x)

· g

t

(x)

with

q _t (x) ≥ 0

• include non-uniformly: q

t

(x) = α

t

rich family

(22)

Recall: Selection by Validation

G(x) = g

_t

_∗(x) with t

_∗

= argmin

t∈{1,2,··· ,T }

E _val (g _t ⁻ )

• simple

and popular

•

what if use E

_in

(g

_t

)instead of

E _val (g _t ⁻ )?

complexity price on d

_VC

, remember? :-)

•

need

one strong

g

_t ⁻

to guarantee small

E _val

(and small E

_out

)

selection:

rely on one strong hypothesis

aggregation:

can we do better with many (possibly weaker) hypotheses?

(23)

Recall: Selection by Validation

G(x) = g

_t

_∗(x) with t

_∗

= argmin

t∈{1,2,··· ,T }

E _val (g _t ⁻ )

• simple

and popular

•

what if use E

_in

(g

_t

)instead of

E _val (g _t ⁻ )?

complexity price on d

_VC

, remember? :-)

•

need

one strong

g

_t ⁻

to guarantee small

E _val

(and small E

_out

)

selection:

aggregation:

(24)

Recall: Selection by Validation

G(x) = g

_t

_∗(x) with t

_∗

= argmin

t∈{1,2,··· ,T }

E _val (g _t ⁻ )

• simple

and popular

•

what if use E

_in

(g

_t

)instead of

E _val (g _t ⁻ )?

complexity price on d

_VC

, remember? :-)

•

need

one strong

g

_t ⁻

to guarantee small

E _val

(and small E

_out

)

selection:

aggregation:

(25)

Recall: Selection by Validation

G(x) = g

_t

_∗(x) with t

_∗

= argmin

t∈{1,2,··· ,T }

E _val (g _t ⁻ )

• simple

and popular

•

what if use E

_in

(g

_t

)instead of

E _val (g _t ⁻ )?

complexity price on d

_VC

, remember? :-)

•

need

one strong

g

_t ⁻

to guarantee small

E _val

(and small E

_out

)

selection:

aggregation:

(26)

Recall: Selection by Validation

G(x) = g

_t

_∗(x) with t

_∗

= argmin

t∈{1,2,··· ,T }

E _val (g _t ⁻ )

• simple

and popular

•

what if use E

_in

(g

_t

)instead of

E _val (g _t ⁻ )?

complexity price on d

_VC

, remember? :-)

•

need

one strong

g

_t ⁻

to guarantee small

E _val

(and small E

_out

)

selection:

aggregation:

(27)

Recall: Selection by Validation

G(x) = g

_t

_∗(x) with t

_∗

= argmin

t∈{1,2,··· ,T }

E _val (g _t ⁻ )

• simple

and popular

•

what if use E

_in

(g

_t

)instead of

E _val (g _t ⁻ )?

complexity price on d

_VC

, remember? :-)

•

need

one strong

g

_t ⁻

to guarantee small

E _val

(and small E

_out

)

selection:

aggregation:

(28)

Why Might Aggregation Work?

•

mix

different weak hypotheses

uniformly

—G(x) ‘strong’

•

aggregation

=⇒

feature transform (?)

•

mix

different random-PLA hypotheses

uniformly

—G(x) ‘moderate’

•

aggregation

=⇒

regularization (?)

proper aggregation =⇒

better performance

(29)

Why Might Aggregation Work?

•

mix

different weak hypotheses

uniformly

•

aggregation

=⇒

feature transform (?)

•

mix

different random-PLA hypotheses

uniformly

•

aggregation

=⇒

regularization (?)

better performance

(30)

Why Might Aggregation Work?

•

mix

different weak hypotheses

uniformly

•

aggregation

=⇒

feature transform (?)

•

mix

different random-PLA hypotheses

uniformly

•

aggregation

=⇒

regularization (?)

better performance

(31)

Why Might Aggregation Work?

•

mix

different weak hypotheses

uniformly

•

aggregation

=⇒

feature transform (?)

•

mix

different random-PLA hypotheses

uniformly

•

aggregation

=⇒

regularization (?)

better performance

(32)

Why Might Aggregation Work?

•

mix

different weak hypotheses

uniformly

•

aggregation

=⇒

feature transform (?)

•

mix

different random-PLA hypotheses

uniformly

•

aggregation

=⇒

regularization (?)

better performance

(33)

Why Might Aggregation Work?

•

mix

different weak hypotheses

uniformly

•

aggregation

=⇒

feature transform (?)

•

mix

different random-PLA hypotheses

uniformly

•

aggregation

=⇒

regularization (?)

better performance

(34)

Why Might Aggregation Work?

•

mix

different weak hypotheses

uniformly

•

aggregation

=⇒

feature transform (?)

•

mix

different random-PLA hypotheses

uniformly

•

aggregation

=⇒

regularization (?)

better performance

(35)

Fun Time

Consider three decision stump hypotheses from R to {−1, +1}:

g

₁

(x ) = sign(1 − x ), g

₂

(x ) = sign(1 + x ), g

₃

(x ) = −1. When mixing the three hypotheses uniformly, what is the resulting G(x )?

1

2J|x | ≤ 1K − 1

2

2J|x | ≥ 1K − 1

3

2Jx ≤ −1K − 1

4

2Jx ≥ +1K − 1

Reference Answer: 1

The ‘region’ that gets two positive votes from g

₁

and g

₂

is |x | ≤ 1, and thus G(x ) is positive within the region only. We see that the three decision stumps g

_t

can be aggregated to form a more sophisticated hypothesis G.

(36)

Fun Time

Consider three decision stump hypotheses from R to {−1, +1}:

g

₁

(x ) = sign(1 − x ), g

₂

(x ) = sign(1 + x ), g

₃

(x ) = −1. When mixing the three hypotheses uniformly, what is the resulting G(x )?

1

2J|x | ≤ 1K − 1

2

2J|x | ≥ 1K − 1

3

2Jx ≤ −1K − 1

4

2Jx ≥ +1K − 1

Reference Answer: 1

The ‘region’ that gets two positive votes from g

₁

and g

₂

is |x | ≤ 1, and thus G(x ) is positive within the region only. We see that the three decision stumps g

_t

can be aggregated to form a more sophisticated hypothesis G.

(37)

Blending and Bagging Uniform Blending

Uniform Blending (Voting) for Classification

uniform

blending: known g _t

, each with

1

ballot

G(x) = sign

T

X

t=1

1 · g

_t

(x)

!

•

same

g t

(autocracy): as good as one single

g _t

•

very different

g t

(diversity+

democracy):

majority can

correct

minority

•

similar results with uniform voting for multiclass

G(x) = argmax

1≤k ≤K T

X

t=1

Jg

t

(x) = kK

how about

regression?

(38)

Uniform Blending (Voting) for Classification

uniform blending: known g _t

, each with

1

ballot

G(x) = sign

T

X

t=1

1 · g

_t

(x)

!

•

same

g t

g _t

•

very different

g t

(diversity+

democracy):

majority can

correct

minority

•

G(x) = argmax

1≤k ≤K T

X

t=1

Jg

t

(x) = kK

how about

regression?

(39)

Uniform Blending (Voting) for Classification

uniform blending: known g _t

, each with

1

ballot

G(x) = sign

T

X

t=1

1 · g

t

(x)

!

•

same

g t

g _t

•

very different

g t

(diversity+

democracy):

majority can

correct

minority

•

G(x) = argmax

1≤k ≤K T

X

t=1

Jg

t

(x) = kK

how about

regression?

(40)

Uniform Blending (Voting) for Classification

uniform blending: known g _t

, each with

1

ballot

G(x) = sign

T

X

t=1

1 · g

t

(x)

!

•

same

g t

(autocracy):

as good as one single

g _t

•

very different

g t

(diversity+

democracy):

majority can

correct

minority

•

G(x) = argmax

1≤k ≤K T

X

t=1

Jg

t

(x) = kK

how about

regression?

(41)

Uniform Blending (Voting) for Classification

uniform blending: known g _t

, each with

1

ballot

G(x) = sign

T

X

t=1

1 · g

t

(x)

!

•

same

g t

(autocracy):

g _t

•

very different

g t

(diversity+

democracy):

majority can

correct

minority

•

G(x) = argmax

1≤k ≤K T

X

t=1

Jg

t

(x) = kK

how about

regression?

(42)

Uniform Blending (Voting) for Classification

uniform blending: known g _t

, each with

1

ballot

G(x) = sign

T

X

t=1

1 · g

t

(x)

!

•

same

g t

(autocracy):

g _t

•

very different

g t

(diversity+

democracy):

majority can

correct

minority

•

G(x) = argmax

1≤k ≤K T

X

t=1

Jg

t

(x) = kK

how about

regression?

(43)

Uniform Blending (Voting) for Classification

uniform blending: known g _t

, each with

1

ballot

G(x) = sign

T

X

t=1

1 · g

t

(x)

!

•

same

g t

(autocracy):

g _t

•

very different

g t

(diversity+

democracy):

majority can

correct

minority

•

G(x) = argmax

1≤k ≤K T

X

t=1

Jg

t

(x) = kK

how about

regression?

(44)

Uniform Blending for Regression

G(x) =

1 T

T

X

t=1

g _t

(x)

•

same

g _t

g t

•

very different

g _t

(diversity+

democracy):

=⇒

some

g t

(x) > f (x), some

g t

(x) < f (x)

=⇒average

could be

more accurate than individual

diverse hypotheses:

even simple

uniform blending

can be better than any

single hypothesis

(45)

Uniform Blending for Regression

G(x) = 1 T

T

X

t=1

g _t

(x)

•

same

g _t

g t

•

very different

g _t

(diversity+

democracy):

=⇒

some

g t

(x) > f (x), some

g t

(x) < f (x)

=⇒average

could be

diverse hypotheses:

even simple

uniform blending

single hypothesis

(46)

Uniform Blending for Regression

G(x) = 1 T

T

X

t=1

g _t

(x)

•

same

g _t

(autocracy):

g t

•

very different

g _t

(diversity+

democracy):

=⇒

some

g t

(x) > f (x), some

g t

(x) < f (x)

=⇒average

could be

diverse hypotheses:

even simple

uniform blending

single hypothesis

(47)

Uniform Blending for Regression

G(x) = 1 T

T

X

t=1

g _t

(x)

•

same

g _t

(autocracy):

g t

•

very different

g _t

(diversity+

democracy):

=⇒

some

g t

(x) > f (x), some

g t

(x) < f (x)

=⇒average

could be

diverse hypotheses:

even simple

uniform blending

single hypothesis

(48)

Uniform Blending for Regression

G(x) = 1 T

T

X

t=1

g _t

(x)

•

same

g _t

(autocracy):

g t

•

very different

g _t

(diversity+

democracy):

=⇒

some

g t

(x) > f (x), some

g t

(x) < f (x)

=⇒average

could be

diverse hypotheses:

even simple

uniform blending

single hypothesis

(49)

Uniform Blending for Regression

G(x) = 1 T

T

X

t=1

g _t

(x)

•

same

g _t

(autocracy):

g t

•

very different

g _t

(diversity+

democracy):

=⇒

some

g t

(x) > f (x), some

g t

(x) < f (x)

=⇒average

could be

diverse hypotheses:

even simple

uniform blending

single hypothesis

(50)

Theoretical Analysis of Uniform Blending

G(x)

=

1 T

T

X

t=1

g t (x)

avg (g

t

(x) − f (x))

²

= avg

g

²_t

− 2g

_t

f + f

²

= avg g

²_t

− 2Gf + f

²

= avg g

²_t

− G

²

+ (

G − f

)

²

= avg g

²_t

− 2G

²

+ G

²

+ (G − f )

²

= avg g

²_t

− 2g

_t

G

+ G

²

+ (G − f )

²

=

avg

(g

t

− G)

²

+ (G − f )

²

avg

(E

_out

(g

_t

)) =

avg

E(g

_t

−

G) ²

+E

_out

(G)

≥

avg

E(g

_t

−

G) ²

+E

_out

(G)

(51)

Theoretical Analysis of Uniform Blending

G(x)

=

1 T

T

X

t=1

g t (x)

avg (g

t

(x) − f (x))

²

= avg

g

²_t

− 2g

_t

f + f

²

= avg g

²_t

− 2Gf + f

²

= avg g

²_t

− G

²

+ (

G − f

)

²

= avg g

²_t

− 2G

²

+ G

²

+ (G − f )

²

= avg g

²_t

− 2g

_t

G

+ G

²

+ (G − f )

²

=

avg

(g

t

− G)

²

+ (G − f )

²

avg

(E

_out

(g

_t

)) =

avg

E(g

_t

−

G) ²

+E

_out

(G)

≥

avg

E(g

_t

−

G) ²

+E

_out

(G)

(52)

Theoretical Analysis of Uniform Blending

G(x)

=

1 T

T

X

t=1

g t (x)

avg (g

t

(x) − f (x))

²

= avg g

²_t

− 2g

_t

f + f

²

= avg g

²_t

− 2Gf + f

²

= avg g

²_t

− G

²

+ (

G − f

)

²

= avg g

²_t

− 2G

²

+ G

²

+ (G − f )

²

= avg g

²_t

− 2g

_t

G

+ G

²

+ (G − f )

²

=

avg

(g

t

− G)

²

+ (G − f )

²

avg

(E

_out

(g

_t

)) =

avg

E(g

_t

−

G) ²

+E

_out

(G)

≥

avg

E(g

_t

−

G) ²

+E

_out

(G)

(53)

Theoretical Analysis of Uniform Blending

G(x)

=

1 T

T

X

t=1

g t (x)

avg (g

t

(x) − f (x))

²

= avg g

²_t

− 2g

_t

f + f

²

= avg g

²_t

− 2Gf + f

²

= avg g

²_t

− G

²

+ (

G − f

)

²

= avg g

²_t

− 2G

²

+ G

²

+ (G − f )

²

= avg g

²_t

− 2g

_t

G

+ G

²

+ (G − f )

²

=

avg

(g

t

− G)

²

+ (G − f )

²

avg

(E

_out

(g

_t

)) =

avg

E(g

_t

−

G) ²

+E

_out

(G)

≥

avg

E(g

_t

−

G) ²

+E

_out

(G)

(54)

Theoretical Analysis of Uniform Blending

G(x)

=

1 T

T

X

t=1

g t (x)

avg (g

t

(x) − f (x))

²

= avg g

²_t

− 2g

_t

f + f

²

= avg g

²_t

− 2Gf + f

²

= avg g

²_t

− G

²

+ (G − f )

²

= avg g

²_t

− 2G

²

+ G

²

+ (G − f )

²

= avg g

²_t

− 2g

_t

G

+ G

²

+ (G − f )

²

=

avg

(g

t

− G)

²

+ (G − f )

²

avg

(E

_out

(g

_t

)) =

avg

E(g

_t

−

G) ²

+E

_out

(G)

≥

avg

E(g

_t

−

G) ²

+E

_out

(G)

(55)

Theoretical Analysis of Uniform Blending

G(x)

=

1 T

T

X

t=1

g t (x)

avg (g

t

(x) − f (x))

²

= avg g

²_t

− 2g

_t

f + f

²

= avg g

²_t

− 2Gf + f

²

= avg g

²_t

− G

²

+ (G − f )

²

= avg g

²_t

− 2G

²

+ G

²

+ (G − f )

²

= avg g

²_t

− 2g

_t

G

+ G

²

+ (G − f )

²

=

avg

(g

t

− G)

²

+ (G − f )

²

avg

(E

_out

(g

_t

)) =

avg

E(g

_t

−

G) ²

+E

_out

(G)

≥

avg

E(g

_t

−

G) ²

+E

_out

(G)

(56)

Theoretical Analysis of Uniform Blending

G(x)

=

1 T

T

X

t=1

g t (x)

avg (g

t

(x) − f (x))

²

= avg g

²_t

− 2g

_t

f + f

²

= avg g

²_t

− 2Gf + f

²

= avg g

²_t

− G

²

+ (G − f )

²

= avg g

²_t

− 2G

²

+ G

²

+ (G − f )

²

= avg g

²_t

− 2g

_t

G + G

²

+ (G − f )

²

= avg

(g

t

− G)

²

+ (G − f )

²

avg

(E

_out

(g

_t

)) =

avg

E(g

_t

−

G) ²

+E

_out

(G)

≥

avg

E(g

_t

−

G) ²

+E

_out

(G)

(57)

Theoretical Analysis of Uniform Blending

G(x)

=

1 T

T

X

t=1

g t (x)

avg (g

t

(x) − f (x))

²

= avg g

²_t

− 2g

_t

f + f

²

= avg g

²_t

− 2Gf + f

²

= avg g

²_t

− G

²

+ (G − f )

²

= avg g

²_t

− 2G

²

+ G

²

+ (G − f )

²

= avg g

²_t

− 2g

_t

G + G

²

+ (G − f )

²

= avg (g

t

− G)

²

+ (G − f )

²

avg

(E

_out

(g

_t

)) =

avg

E(g

_t

−

G) ²

+E

_out

(G)

≥

avg

E(g

_t

−

G) ²

+E

_out

(G)

(58)

Theoretical Analysis of Uniform Blending

G(x)

=

1 T

T

X

t=1

g t (x)

avg (g

t

(x) − f (x))

²

= avg g

²_t

− 2g

_t

f + f

²

= avg g

²_t

− 2Gf + f

²

= avg g

²_t

− G

²

+ (G − f )

²

= avg g

²_t

− 2G

²

+ G

²

+ (G − f )

²

= avg g

²_t

− 2g

_t

G + G

²

+ (G − f )

²

= avg (g

t

− G)

²

+ (G − f )

²

avg

(E

out

(g

t

)) =

avg

E(g

_t

−

G) ²

+E

out

(G)

≥

avg

E(g

_t

−

G) ²

+E

_out

(G)

(59)

Theoretical Analysis of Uniform Blending

G(x)

=

1 T

T

X

t=1

g t (x)

avg (g

t

(x) − f (x))

²

= avg g

²_t

− 2g

_t

f + f

²

= avg g

²_t

− 2Gf + f

²

= avg g

²_t

− G

²

+ (G − f )

²

= avg g

²_t

− 2G

²

+ G

²

+ (G − f )

²

= avg g

²_t

− 2g

_t

G + G

²

+ (G − f )

²

= avg (g

t

− G)

²

+ (G − f )

²

avg

(E

out

(g

t

)) =

avg

E(g

_t

−

G) ²

+E

out

(G)

≥

avg

E(g

_t

−

G) ²

+E

_out

(G)

(60)

Some Special g _t

consider a

virtual

iterative process that for t = 1, 2, . . . , T

1

request size-N data D

_t

from P

^N

(i.i.d.)

2

obtain

g t

by A(D

t

)

g ¯

= lim

T →∞

G

=

T →∞

lim

1 T

T

X

t=1

g t

=

E

D A(D)

avg

(E

out

(g

t

)) =

avg

E(g

_t

−

¯ g) ²

+E

out

(

g) ¯

expected

performance of A

=

expected deviation

to

consensus

+

performance of

consensus

•

performance of

consensus: called bias

• expected deviation

to

consensus: called variance

uniform blending:

reduces

variance

for more stable performance

(61)

Some Special g _t

consider a

virtual

1 _t

from P

^N

(i.i.d.)

2

obtain

g t

by A(D

t

)

g ¯

= lim

T →∞

G

=

T →∞

lim

1 T

T

X

t=1

g t

=

E

D A(D)

avg

(E

out

(g

t

)) =

avg

E(g

_t

−

¯ g) ²

+E

out

(

g) ¯

expected

performance of A

=

expected deviation

to

consensus

+

performance of

consensus

•

performance of

consensus: called bias

• expected deviation

to

consensus: called variance

uniform blending:

reduces

variance

(62)

Some Special g _t

consider a

virtual

1 _t

from P

^N

(i.i.d.)

2

obtain

g t

by A(D

t

)

g ¯

= lim

T →∞

G

=

T →∞

lim

1 T

T

X

t=1

g t

=

E

D A(D)

avg

(E

out

(g

t

)) =

avg

E(g

_t

−

¯ g) ²

+E

out

(

g) ¯

expected

performance of A

=

expected deviation

to

consensus

+

performance of

consensus

•

performance of

consensus: called bias

• expected deviation

to

consensus: called variance

uniform blending:

reduces

variance

(63)

Some Special g _t

consider a

virtual

1 _t

from P

^N

(i.i.d.)

2

obtain

g t

by A(D

t

)

g ¯

= lim

T →∞

G

=

T →∞

lim

1 T

T

X

t=1

g t

=

E

D A(D)

avg

(E

_out

(g

_t

)) =

avg

E(g

_t

−

¯ g) ²

+E

_out

(

g) ¯

expected

performance of A

=

expected deviation

to

consensus

+

performance of

consensus

•

performance of

consensus: called bias

• expected deviation

to

consensus: called variance

uniform blending:

reduces

variance

(64)

Some Special g _t

consider a

virtual

1 _t

from P

^N

(i.i.d.)

2

obtain

g t

by A(D

t

)

g ¯

=

T →∞

lim

G

= lim

T →∞

1 T

T

X

t=1

g t

=

E

D A(D)

avg

(E

_out

(g

_t

)) =

avg

E(g

_t

−

¯ g) ²

+E

_out

(

g) ¯

expected

performance of A

=

expected deviation

to

consensus

+

performance of

consensus

•

performance of

consensus: called bias

• expected deviation

to

consensus: called variance

uniform blending:

reduces

variance

(65)

Some Special g _t

consider a

virtual

1 _t

from P

^N

(i.i.d.)

2

obtain

g t

by A(D

t

)

g ¯

=

T →∞

lim

G

= lim

T →∞

1 T

T

X

t=1

g t

=

E

D A(D)

avg

(E

_out

(g

_t

)) =

avg

E(g

_t

−

¯ g) ²

+E

_out

(

g) ¯

expected

performance of A

=

expected deviation

to

consensus

+

performance of

consensus

•

performance of

consensus: called bias

• expected deviation

to

consensus: called variance

uniform blending:

reduces

variance

(66)

Some Special g _t

consider a

virtual

1 _t

from P

^N

(i.i.d.)

2

obtain

g t

by A(D

t

)

g ¯

= lim

T →∞ G

= lim

T →∞

1 T

T

X

t=1

g t

=

E

D A(D)

avg

(E

_out

(g

_t

)) =

avg

E(g

_t

−

¯ g) ²

+E

_out

(

g) ¯

expected

performance of A

=

expected deviation

to

consensus

+

performance of

consensus

•

performance of

consensus: called bias

• expected deviation

to

consensus: called variance

uniform blending:

reduces

variance

(67)

Some Special g _t

consider a

virtual

1 _t

from P

^N

(i.i.d.)

2

obtain

g t

by A(D

t

)

g ¯

= lim

T →∞ G

= lim

T →∞

1 T

T

X

t=1

g t

=

E

D A(D)

avg

(E

_out

(g

_t

)) =

avg

E(g

_t

−

¯ g) ²

+E

_out

(

g) ¯

expected

performance of A

=

expected deviation

to

consensus

+

performance of

consensus

•

performance of

consensus: called bias

• expected deviation

to

consensus: called variance

uniform blending:

reduces

variance

(68)

Some Special g _t

consider a

virtual

1 _t

from P

^N

(i.i.d.)

2

obtain

g t

by A(D

t

)

g ¯

= lim

T →∞ G

= lim

T →∞

1 T

T

X

t=1

g t

=

E

D A(D)

avg

(E

_out

(g

_t

)) =

avg

E(g

_t

−

¯ g) ²

+E

_out

(

g) ¯ expected

performance of A =

expected deviation

to

consensus

+

performance of

consensus

•

performance of

consensus: called bias

• expected deviation

to

consensus: called variance

uniform blending:

reduces

Machine Learning Techniques (ᘤᢈ)