All Kinds of GAN

(1)

All Kinds of GAN

Hung-yi Lee

(2)

Zoo of GAN: https://github.com/hindupuravinash/the-gan-zoo

(3)

General Idea of GAN

Sebastian Nowozin, Botond Cseke, Ryota Tomioka, “f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization”, NIPS, 2016

(4)

Basic Idea of GAN

• A generator G is a network. The network defines a probability distribution.

generator G

𝑧 𝑥 = 𝐺 𝑧

Normal Distribution

𝑃_𝐺 𝑥 𝑃_{𝑑𝑎𝑡𝑎} 𝑥

As close as possible

https://blog.openai.com/generative-models/

It is difficult to compute 𝑃_𝐺 𝑥 We can only sample from the distribution.

(5)

Basic Idea of GAN

• Generator G

• G is a function, input z, output x

• Given a prior distribution P

_prior

(z), a probability distribution P

_G

(x) is defined by function G

• Discriminator D

• D is a function, input x, output scalar

• Evaluate the “difference” between P

_G

(x) and P

_data

(x)

• There is a function V(G,D).

𝐺

^∗

= 𝑎𝑟𝑔 min

𝐺

max

𝐷

𝑉 𝐺, 𝐷

Hard to learn by maximum likelihood

(6)

Basic Idea

𝐺

₁

𝐺

₂

𝐺

₃

𝑉 𝐺

₁

, 𝐷 𝑉 𝐺

₂

, 𝐷 𝑉 𝐺

₃

, 𝐷

𝐷 𝐷 𝐷

𝑉 = 𝐸

_𝑥∼𝑃_{𝑑𝑎𝑡𝑎}

𝑙𝑜𝑔𝐷 𝑥 + 𝐸

_𝑥∼𝑃_𝐺

𝑙𝑜𝑔 1 − 𝐷 𝑥 Given a generator G, max

𝐷

𝑉 𝐺, 𝐷 evaluate the JS divergence between 𝑃

_𝐺

and 𝑃

_{𝑑𝑎𝑡𝑎}

Pick the G defining 𝑃

_𝐺

most similar to 𝑃

_{𝑑𝑎𝑡𝑎}

𝐺

^∗

= 𝑎𝑟𝑔 min

𝐺

max

𝐷

𝑉 𝐺, 𝐷

(7)

• In each training iteration:

• Sample m examples 𝑥¹, 𝑥², … , 𝑥^𝑚 from data distribution 𝑃_{𝑑𝑎𝑡𝑎} 𝑥

• Sample m noise samples 𝑧¹, 𝑧², … , 𝑧^𝑚 from the prior 𝑃_{𝑝𝑟𝑖𝑜𝑟} 𝑧

• Obtaining generated data ෤𝑥¹, ෤𝑥², … , ෤𝑥^𝑚 , ෤𝑥^𝑖 = 𝐺 𝑧^𝑖

• Update discriminator parameters 𝜃_𝑑 to maximize

• ෨𝑉 = ¹

𝑚 σ_𝑖=1^𝑚 𝑙𝑜𝑔𝐷 𝑥^𝑖 + ¹

𝑚 σ_𝑖=1^𝑚 𝑙𝑜𝑔 1 − 𝐷 ෤𝑥^𝑖

• 𝜃_𝑑 ← 𝜃_𝑑 + 𝜂𝛻 ෨𝑉 𝜃_𝑑

• Sample another m noise samples 𝑧¹, 𝑧², … , 𝑧^𝑚 from the prior 𝑃_{𝑝𝑟𝑖𝑜𝑟} 𝑧

• Update generator parameters 𝜃_𝑔 to minimize

• ෨𝑉 = ¹

𝑚 σ_𝑖=1^𝑚 𝑙𝑜𝑔 1 − 𝐷 𝐺 𝑧^𝑖

• 𝜃_𝑔 ← 𝜃_𝑔 − 𝜂𝛻 ෨𝑉 𝜃_𝑔

Algorithm

Repeat k times Learning

D

Learning G

Initialize 𝜃_𝑑 for D and 𝜃_𝑔 for G

Only Once

(8)

Objective Function for Generator in Real Implementation

𝑉 = 𝐸

_𝑥∼𝑃_{𝑑𝑎𝑡𝑎}

𝑙𝑜𝑔𝐷 𝑥

𝑉 = 𝐸

_𝑥∼𝑃_𝐺

−𝑙𝑜𝑔 𝐷 𝑥

Real implementation:

label x from P

_G

as positive

−𝑙𝑜𝑔 𝐷 𝑥

𝑙𝑜𝑔 1 − 𝐷 𝑥

+𝐸

_𝑥∼𝑃_𝐺

𝑙𝑜𝑔 1 − 𝐷 𝑥

𝐷 𝑥 Slow at the beginning

Minimax GAN (MMGAN)

Non-saturating GAN (NSGAN)

(9)

𝐷_𝑓 𝑃||𝑄 = න

𝑥

𝑞 𝑥 𝑓 𝑝 𝑥

𝑞 𝑥 𝑑𝑥 f is convex f(1) = 0

f-divergence

If 𝑝 𝑥 = 𝑞 𝑥 for all 𝑥

𝑥

𝑞 𝑥 𝑑𝑥 = 0 𝐷_𝑓 𝑃||𝑄 = න

𝑥

𝑞 𝑥 𝑑𝑥

≥ 𝑓 න

𝑥

𝑞 𝑥 𝑝 𝑥

𝑞 𝑥 𝑑𝑥

= 𝑓 1 = 0

= 0

Because f is convex

𝑃 and 𝑄 are two distributions. 𝑝 𝑥 and q 𝑥 are the probability of sampling 𝑥.

= 1

If P and Q are the same distributions,

𝐷_𝑓 𝑃||𝑄 has the

smallest value, which is 0 𝐷_𝑓 𝑃||𝑄 evaluates the difference of P and Q

smallest

(10)

𝑥

𝑞 𝑥 𝑑𝑥 f is convex f(1) = 0

f-divergence

𝑓 𝑥 = 𝑥 − 1 ² 𝐷_𝑓 𝑃||𝑄 = න

𝑥

𝑞 𝑥 𝑝 𝑥

𝑞 𝑥 − 1

2

𝑑𝑥 = න

𝑥

𝑝 𝑥 − 𝑞 𝑥 ²

𝑞 𝑥 𝑑𝑥

𝑥

𝑞 𝑥 𝑝 𝑥

𝑞 𝑥 𝑙𝑜𝑔 𝑝 𝑥

𝑞 𝑥 𝑑𝑥 𝑓 𝑥 = 𝑥𝑙𝑜𝑔𝑥

= න

𝑥

𝑝 𝑥 𝑙𝑜𝑔 𝑝 𝑥

𝑞 𝑥 𝑑𝑥

KL

𝑓 𝑥 = −𝑙𝑜𝑔𝑥 𝐷_𝑓 𝑃||𝑄 = න

𝑥

𝑞 𝑥 −𝑙𝑜𝑔 𝑝 𝑥

𝑞 𝑥 𝑑𝑥 = න

𝑥

𝑞 𝑥 𝑙𝑜𝑔 𝑞 𝑥

𝑝 𝑥 𝑑𝑥

Reverse KL

Chi Square

(11)

Fenchel Conjugate

• Every convex function f has a conjugate function f*

𝑓^∗ 𝑡 = max

𝑥∈𝑑𝑜𝑚 𝑓 𝑥𝑡 − 𝑓 𝑥

𝑥₂𝑡₁ − 𝑓 𝑥₂ 𝑥₃𝑡₁ − 𝑓 𝑥₃

𝑡₁ 𝑡₂ 𝑡

𝑓^∗ 𝑡₁

𝑓^∗ 𝑡₂

𝑥

𝑞 𝑥 𝑑𝑥

f is convex, f(1) = 0

𝑥₁𝑡₁ − 𝑓 𝑥₁

= max

𝑥∈𝑑𝑜𝑚 𝑓 𝑥𝑡₁ − 𝑓 𝑥 𝑓^∗ 𝑡₁

𝑥₁𝑡₂ − 𝑓 𝑥₁ 𝑥₂𝑡₂ − 𝑓 𝑥₂ 𝑥₃𝑡₂ − 𝑓 𝑥₃ 𝑓^∗ 𝑡₂ = max

𝑥∈𝑑𝑜𝑚 𝑓 𝑥𝑡₂ − 𝑓 𝑥

(12)

Fenchel Conjugate

• Every convex function f has a conjugate function f*

𝑓^∗ 𝑡 = max

𝑥∈𝑑𝑜𝑚 𝑓 𝑥𝑡 − 𝑓 𝑥 𝑥₁𝑡 − 𝑓 𝑥₁

𝑥₂𝑡 − 𝑓 𝑥₂

𝑥₃𝑡 − 𝑓 𝑥₃

𝑡₁ 𝑡₂ 𝑡

𝑓^∗ 𝑡₁ 𝑓^∗ 𝑡₂

𝑥

𝑞 𝑥 𝑑𝑥

f is convex, f(1) = 0

(13)

𝑓^∗ 𝑡 = max

𝑥∈𝑑𝑜𝑚 𝑓 𝑥𝑡 − 𝑓 𝑥 𝑓 𝑥 = max

𝑡∈𝑑𝑜𝑚 𝑓^∗ 𝑥𝑡 − 𝑓^∗ 𝑡 𝐷_𝑓 𝑃||𝑄 = න

𝑥

𝑞 𝑥 𝑑𝑥

= න

𝑥

𝑞 𝑥 max

𝑡∈𝑑𝑜𝑚 𝑓^∗

𝑝 𝑥

𝑞 𝑥 𝑡 − 𝑓^∗ 𝑡 𝑑𝑥 𝑝 𝑥

𝑞 𝑥

𝑝 𝑥 𝑞 𝑥

D is a function whose input is x, and output is t

Connection with GAN

= න

𝑥

𝑝 𝑥 𝐷 𝑥 𝑑𝑥 − න

𝑥

𝑞 𝑥 𝑓^∗ 𝐷 𝑥 𝑑𝑥 𝐷_𝑓 𝑃||𝑄 ≥ න

𝑥

𝑞 𝑥 𝑝 𝑥

𝑞 𝑥 𝐷 𝑥 − 𝑓^∗ 𝐷 𝑥 𝑑𝑥

≈ max

𝐷 න

𝑥

𝑞 𝑥 𝑓^∗ 𝐷 𝑥 𝑑𝑥

(14)

Connection with GAN

𝐷_𝑓 𝑃||𝑄 ≈ max

D න

𝑥

𝑞 𝑥 𝑓^∗ 𝐷 𝑥 𝑑𝑥

= max

D 𝐸_𝑥~𝑃 𝐷 𝑥 − 𝐸_𝑥~𝑄 𝑓^∗ 𝐷 𝑥

Samples from P Samples from Q 𝐷_𝑓 𝑃_{𝑑𝑎𝑡𝑎}||𝑃_𝐺 = max

D 𝐸_𝑥~𝑃_{𝑑𝑎𝑡𝑎} 𝐷 𝑥 − 𝐸_𝑥~𝑃_𝐺 𝑓^∗ 𝐷 𝑥

𝐺

^∗

= 𝑎𝑟𝑔 min

𝐺

𝐷

_𝑓

𝑃

_{𝑑𝑎𝑡𝑎}

||𝑃

_𝐺

familiar? ☺

= 𝑎𝑟𝑔 min

𝐺

max

𝐷

𝐸

_𝑥~𝑃_{𝑑𝑎𝑡𝑎}

𝐷 𝑥 − 𝐸

_𝑥~𝑃_𝐺

𝑓

^∗

𝐷 𝑥

= 𝑎𝑟𝑔 min

𝐺

max

𝐷

𝑉 𝐺, 𝐷

Original GAN has

different V(G,D)

(15)

Using the f-divergence you like ☺

𝐷_𝑓 𝑃_{𝑑𝑎𝑡𝑎}||𝑃_𝐺 = max

D 𝐸_𝑥~𝑃_{𝑑𝑎𝑡𝑎} 𝐷 𝑥 − 𝐸_𝑥~𝑃_𝐺 𝑓^∗ 𝐷 𝑥

https://arxiv.org/pdf/1606.00709.pdf

(16)

Experimental Results

• Approximate a mixture of Gaussians by single

mixture

(17)

WGAN

Martin Arjovsky, Soumith Chintala, Léon

Bottou, Wasserstein GAN, arXiv prepring, 2017

(18)

Earth Mover’s Distance

• Considering one distribution P as a pile of earth, and another distribution Q as the target

• The average distance the earth mover has to move the earth.

𝑃 𝑄

d

𝑊 𝑃, 𝑄 = 𝑑

(19)

Earth Mover’s Distance

Source of image: https://vincentherrmann.github.io/blog/wasserstein/

𝑃

𝑄

Using the “moving plan” with the smallest average distance to define the earth mover’s distance.

There many possible “moving plans”.

Smaller distance?

Larger

distance?

(20)

Earth Mover’s Distance

Source of image: https://vincentherrmann.github.io/blog/wasserstein/

𝑃

𝑄

Using the “moving plan” with the smallest average distance to define the earth mover’s distance.

There many possible “moving plans”.

Best “moving plans”

of this example

(21)

A “moving plan” is a matrix

The value of the element is the amount of earth from one

position to another.

moving plan 𝛾 All possible plan Π

𝐵 𝛾 = ෍

𝑥_𝑝,𝑥_𝑞

𝛾 𝑥_𝑝, 𝑥_𝑞 𝑥_𝑝 − 𝑥_𝑞 Average distance of a plan 𝛾:

Earth Mover’s Distance:

𝑊 𝑃, 𝑄 = min

𝛾∈Π 𝐵 𝛾

The best plan

𝑃

𝑄

𝑥_𝑝

𝑥_𝑞

(22)

𝑃_{𝑑𝑎𝑡𝑎}

𝑃_𝐺₀ 𝑃_𝐺₅₀ 𝑃_{𝑑𝑎𝑡𝑎}

𝐽𝑆 𝑃_𝐺₀, 𝑃_{𝑑𝑎𝑡𝑎}

= 𝑙𝑜𝑔2

𝑃_{𝑑𝑎𝑡𝑎} 𝑃_𝐺₁₀₀

…… ……

𝑑₀ 𝑑₅₀

𝐽𝑆 𝑃_𝐺₅₀, 𝑃_{𝑑𝑎𝑡𝑎}

= 𝑙𝑜𝑔2

𝐽𝑆 𝑃_𝐺₁₀₀, 𝑃_{𝑑𝑎𝑡𝑎}

= 0 𝑊 𝑃_𝐺₀, 𝑃_{𝑑𝑎𝑡𝑎}

= 𝑑₀

𝑊 𝑃_𝐺₅₀, 𝑃_{𝑑𝑎𝑡𝑎}

= 𝑑₅₀

𝑊 𝑃_𝐺₁₀₀, 𝑃_{𝑑𝑎𝑡𝑎}

= 0

Why Earth Mover’s Distance?

𝐷

_𝑓

𝑃

_{𝑑𝑎𝑡𝑎}

||𝑃

_𝐺

𝑊 𝑃

_{𝑑𝑎𝑡𝑎}

, 𝑃

_𝐺

(23)

Back to the GAN framework

𝐷

_𝑓

𝑃

_{𝑑𝑎𝑡𝑎}

||𝑃

_𝐺

= max

𝐷

𝐸

𝐷 𝑥 − 𝐸

_𝑥~𝑃_𝐺

𝑓

^∗

𝐷 𝑥 𝑊 𝑃

_{𝑑𝑎𝑡𝑎}

, 𝑃

_𝐺

= max

𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧

𝐸

𝐷 𝑥 − 𝐸

_𝑥~𝑃_𝐺

𝐷 𝑥

𝑓 𝑥₁ − 𝑓 𝑥₂ ≤ 𝐾 𝑥₁ − 𝑥₂

Lipschitz Function

K=1 for "1 − 𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧"

𝑊 𝑃

_{𝑑𝑎𝑡𝑎}

, 𝑃

_𝐺

Output change

Input

change 1−Lipschitz?

1−Lipschitz?

Do not change fast

(24)

Back to the GAN framework

𝑊 𝑃

_{𝑑𝑎𝑡𝑎}

, 𝑃

_𝐺

= max

𝐸

𝐷 𝑥 − 𝐸

_𝑥~𝑃_𝐺

𝐷 𝑥

d

𝐷 𝑥₁ − 𝐷 𝑥₂ ≤ 𝑥₁ − 𝑥₂ 𝐷 𝑥₁

= +∞

𝑥₁ 𝑥₂

𝑃_{𝑑𝑎𝑡𝑎} 𝑃_𝐺

𝐷 𝑥₂

= −∞

𝑘 𝑘 + 𝑑

𝑘 + 𝑑 𝑘

WGAN will provide gradient to push P_G towards P_data

Blue: D(x) for original GAN Green: D(x) for WGAN

𝑊 𝑃_{𝑑𝑎𝑡𝑎} , 𝑃_𝐺 = 𝑑

(25)

Back to the GAN framework

𝑊 𝑃

_{𝑑𝑎𝑡𝑎}

, 𝑃

_𝐺

= max

𝐸

𝐷 𝑥 − 𝐸

_𝑥~𝑃_𝐺

𝐷 𝑥

How to use gradient descent to optimize?

Weight clipping:

Force the weights w between c and -c After parameter update,

if w > c, then w=c; if w<-c, then w=-c 𝐷 𝑥₁ − 𝐷 𝑥₂ ≤ 𝐾 𝑥₁ − 𝑥₂ We only ensure that

For some K K

K

No clipping clipping

Do not truly find function D maximizing the function

(26)

• In each training iteration:

• Sample m examples 𝑥¹, 𝑥², … , 𝑥^𝑚 from data distribution 𝑃_{𝑑𝑎𝑡𝑎} 𝑥

• Sample m noise samples 𝑧¹, 𝑧², … , 𝑧^𝑚 from the prior 𝑃_{𝑝𝑟𝑖𝑜𝑟} 𝑧

• Obtaining generated data ෤𝑥¹, ෤𝑥², … , ෤𝑥^𝑚 , ෤𝑥^𝑖 = 𝐺 𝑧^𝑖

• Update discriminator parameters 𝜃_𝑑 to maximize

• ෨𝑉 = ¹

𝑚 σ_𝑖=1^𝑚 𝑙𝑜𝑔 1 − 𝐷 ෤𝑥^𝑖

• 𝜃_𝑑 ← 𝜃_𝑑 + 𝜂𝛻 ෨𝑉 𝜃_𝑑

• Sample another m noise samples 𝑧¹, 𝑧², … , 𝑧^𝑚 from the prior 𝑃_{𝑝𝑟𝑖𝑜𝑟} 𝑧

• Update generator parameters 𝜃_𝑔 to minimize

• ෨𝑉 = ¹

𝑚 σ_𝑖=1^𝑚 𝑙𝑜𝑔 1 − 𝐷 𝐺 𝑧^𝑖

• 𝜃_𝑔 ← 𝜃_𝑔 − 𝜂𝛻 ෨𝑉 𝜃_𝑔

Algorithm of Original GAN

Repeat k times Learning

D

Learning G

Only Once

𝐷 𝑥^𝑖 − 𝐷 ෤𝑥^𝑖

No sigmoid for the output of D

WGAN

Weight clipping

𝐷 𝐺 𝑧^𝑖

−

(27)

https://arxiv.org/abs/1701.07875

Vertical

(28)

Improved WGAN

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent

Dumoulin, Aaron Courville, “Improved Training of Wasserstein GANs”, arXiv prepring, 2017

(29)

𝑊 𝑃

_{𝑑𝑎𝑡𝑎}

, 𝑃

_𝐺

= max

𝐸

𝐷 𝑥 − 𝐸

_𝑥~𝑃_𝐺

𝐷 𝑥

≈ max

𝐷

{𝐸

𝐷 𝑥 − 𝐸

_𝑥~𝑃_𝐺

𝐷 𝑥

−𝜆 ׬

_𝑥

𝑚𝑎𝑥 0, 𝛻

_𝑥

𝐷 𝑥 − 1 𝑑𝑥}

−𝜆𝐸

_𝑥~𝑃_{𝑝𝑒𝑛𝑎𝑙𝑡𝑦}

𝑚𝑎𝑥 0, 𝛻

_𝑥

𝐷 𝑥 − 1 }

A differentiable function is 1-Lipschitz if and only if it has gradients with norm less than or equal to 1 everywhere.

𝛻

_𝑥

𝐷 𝑥 ≤ 1 for all x

Improved WGAN

𝑊 𝑃

_{𝑑𝑎𝑡𝑎}

, 𝑃

_𝐺

𝐷 ∈ 1 − 𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧

Prefer 𝛻

_𝑥

𝐷 𝑥 ≤ 1 for all x

Prefer 𝛻

_𝑥

𝐷 𝑥 ≤ 1 for x sampling from 𝑥~𝑃

_{𝑝𝑒𝑛𝑎𝑙𝑡𝑦}

(30)

Improved WGAN

𝑃

_{𝑑𝑎𝑡𝑎}

𝑃

_𝐺

Only give gradient constraint to the region between 𝑃_{𝑑𝑎𝑡𝑎} and 𝑃_𝐺 because they influence how 𝑃_𝐺 moves to 𝑃_{𝑑𝑎𝑡𝑎}

−𝜆𝐸

𝑚𝑎𝑥 0, 𝛻

_𝑥

𝐷 𝑥 − 1 }

≈ max

𝐷

{𝐸

𝐷 𝑥 − 𝐸

_𝑥~𝑃_𝐺

𝐷 𝑥 𝑊 𝑃

_{𝑑𝑎𝑡𝑎}

, 𝑃

_𝐺

𝑃_{𝑝𝑒𝑛𝑎𝑙𝑡𝑦}

“Given that enforcing the Lipschitz constraint everywhere is intractable, enforcing it only along these straight lines seems sufficient and experimentally results in good performance.”

(31)

“One may wonder why we penalize the norm of the gradient for differing from 1, instead of just penalizing large gradients. The

reason is that the optimal critic … actually has gradients with norm 1 almost everywhere under Pr and Pg”

“Simply penalizing overly large gradients also works in theory, but experimentally we found that this approach converged faster and to better optima.”

Improved WGAN

(check the proof in the appendix)

𝑃

_{𝑑𝑎𝑡𝑎}

𝑃

_𝐺

−𝜆𝐸

𝑚𝑎𝑥 0, 𝛻

_𝑥

𝐷 𝑥 − 1 }

≈ max

𝐷

{𝐸

𝐷 𝑥 − 𝐸

_𝑥~𝑃_𝐺

𝐷 𝑥 𝑊 𝑃

_{𝑑𝑎𝑡𝑎}

, 𝑃

_𝐺

𝑃_{𝑝𝑒𝑛𝑎𝑙𝑡𝑦}

𝛻

_𝑥

𝐷 𝑥 − 1

²

𝐷 𝑥 Largest gradient in 𝐷 𝑥 this region (=1)

(32)

Improved WGAN

(33)

DCGAN

G: CNN, D: CNN

G: CNN (no normalization), D: CNN (no normalization)

LSGAN Original

WGAN

Improved WGAN

G: CNN (tanh), D: CNN(tanh)

(34)

DCGAN LSGAN Original WGAN

Improved G: MLP, D: CNN WGAN

G: CNN (bad structure), D: CNN

G: 101 layer, D: 101 layer

(35)

Energy-based GAN

Ref: Junbo Zhao, Michael Mathieu, Yann LeCun, Energy-based Generative Adversarial Network, ICRL 2017

(36)

Energy-based GAN (EBGAN)

• Using an autoencoder as discriminator D

Discriminator An image

is good.

It can be reconstructed by autoencoder.

=

0 for best images Generator is

the same.

-

^0.1

EN DE

Autoencoder

X -1 -0.1

(37)

EBGAN

real

gen gen

Hard to reconstruct, easy to destroy

m

0 is for the best.

Do not have to be very negative

Auto-encoder based discriminator only give limited region large value.

(38)

Stack and

Progressive GAN

(39)

Patch GAN

D

score

D D

score score

(40)

Stack GAN

Han Zhang, Tao Xu, Hongsheng

Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, Dimitris Metaxas, “StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks”, ICCV, 2017

(41)

Stack GAN ++

• Tree-like structure

Han Zhang, Tao Xu, Hongsheng

Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, Dimitris Metaxas, “StackGAN++:

Realistic Image Synthesis with Stacked

Generative Adversarial Networks”, arXiv 2017

c.f. ACGAN

(42)

Tero Karras, Timo Aila, Samuli Laine, Jaakko Lehtinen, ” Progressive Growing of GANs for Improved Quality, Stability, and Variation”, arXiv 2017

(43)

Progressive Growing of GAN

(44)

Ensemble

(45)

Observation

陳柏文同學提供實驗結果

Yaxing Wang, Lichao Zhang, Joost van de Weijer, “Ensembles of

Generative Adversarial Networks”, NIPS workshop, 2016

(46)

Ensemble

Generator 1

Generator 2

圖片來源柯達方同學

(47)

This figure is from the original paper: https://arxiv.org/abs/1612.04021

Generative Adversarial Parallelization

(48)

Code: https://github.com/poolio/unrolled_gan/blob/master/Unrolled%20GAN%20demo.ipynb

Unroll GAN – Experimental Results

(49)

Style Transfer

(50)

Cycle GAN Dual GAN

Disco GAN

(51)

Coupled GAN (CoGAN)

Ming-Yu Liu, Oncel Tuzel, “Coupled Generative Adversarial Networks”, NIPS, 2016

(52)

UNIT: Unsupervised Image-to- image Translation

https://arxiv.org/pdf/1703.00848.pdf https://github.com/mingyuliutw/unit

(53)

DTN: Domain Transfer Network

(54)

XGAN

(55)

StarGAN

(56)

StarGAN

(57)

StarGAN

(58)

StarGAN

(59)

Feature

Anders Boesen, Lindbo Larsen, Søren Kaae

Sønderby, Hugo Larochelle, Ole Winther, “Autoencoding beyond pixels using a learned similarity metric”, ICML. 2016

(60)

InfoGAN

Regular GAN

Modifying a specific dimension, no clear meaning

What we expect Actually …

(The colors represents the characteristics.)

(61)

What is InfoGAN?

Discrimi nator

Classifier

scalar Generator

z =

z’

c x

c Predict the code c

that generates x

“Auto- encoder”

Parameter sharing (only the last layer is different) encoder

decoder

(62)

What is InfoGAN?

Classifier Generator

z =

z’

c x

c Predict the code c

that generates x

“Auto- encoder”

encoder decoder

c must have clear influence on x

The classifier can

recover c from x.

(63)

(64)

VAE-GAN

Discrimi nator Generator

(Decoder)

Encoder z x scalar

x

as close as possible

VAE

GAN

➢ Minimize

reconstruction error

➢ z close to normal

➢ Minimize

reconstruction error

➢ Cheat discriminator

➢ Discriminate real, generated and

reconstructed images

Discriminator provides the similarity measure

(65)

• Initialize En, De, Dis

• In each iteration:

• Sample M images 𝑥¹, 𝑥², ⋯ , 𝑥^𝑀 from database

• Generate M codes ǁ𝑧¹, ǁ𝑧², ⋯ , ǁ𝑧^𝑀 from encoder

• ǁ𝑧^𝑖 = 𝐸𝑛 𝑥^𝑖

• Generate M images ෤𝑥¹, ෤𝑥², ⋯ , ෤𝑥^𝑀 from decoder

• ෤𝑥^𝑖 = 𝐸𝑛 ǁ𝑧^𝑖

• Sample M codes 𝑧¹, 𝑧², ⋯ , 𝑧^𝑀 from prior P(z)

• Generate M images ො𝑥¹, ො𝑥², ⋯ , ො𝑥^𝑀 from decoder

• ො𝑥^𝑖 = 𝐸𝑛 𝑧^𝑖

• Update En to decrease ෤𝑥^𝑖 − 𝑥^𝑖 , decrease KL(P( ǁ𝑧^𝑖|xⁱ)||P(z))

• Update De to decrease ෤𝑥^𝑖 − 𝑥^𝑖 , increase 𝐷𝑖𝑠 ෤𝑥^𝑖 and 𝐷𝑖𝑠 ො𝑥^𝑖

• Update Dis to increase 𝐷𝑖𝑠 𝑥^𝑖 , decrease 𝐷𝑖𝑠 ෤𝑥^𝑖 and 𝐷𝑖𝑠 ො𝑥^𝑖

Algorithm

Another kind of discriminator:

Discriminator

x

real gen recon

(66)

BiGAN

Encoder Decoder Discriminator

Image x code z

Image x code z from encoder

or decoder?

(real) (generated) (from prior distribution)

(67)

Algorithm

• Initialize encoder En, decoder De, discriminator Dis

• In each iteration:

• Sample M images 𝑥¹, 𝑥², ⋯ , 𝑥^𝑀 from database

• Generate M codes ǁ𝑧¹, ǁ𝑧², ⋯ , ǁ𝑧^𝑀 from encoder

• ǁ𝑧^𝑖 = 𝐸𝑛 𝑥^𝑖

• Sample M codes 𝑧¹, 𝑧², ⋯ , 𝑧^𝑀 from prior P(z)

• Generate M codes ෤𝑥¹, ෤𝑥², ⋯ , ෤𝑥^𝑀 from decoder

• ෤𝑥^𝑖 = 𝐷𝑒 𝑧^𝑖

• Update Dis to increase 𝐷𝑖𝑠 𝑥^𝑖, ǁ𝑧^𝑖 , decrease 𝐷𝑖𝑠 ෤𝑥^𝑖, 𝑧^𝑖

• Update En and De to decrease 𝐷𝑖𝑠 𝑥^𝑖, ǁ𝑧^𝑖 , increase 𝐷𝑖𝑠 ෤𝑥^𝑖, 𝑧^𝑖

(68)

Encoder Decoder Discriminator

Image x code z

Image x code z from encoder

or decoder?

(real) (generated) (from prior distribution)

𝑃 𝑥, 𝑧 𝑄 𝑥, 𝑧 Evaluate the

difference between P and Q To make P and Q the same

Optimal encoder and decoder:

En(x’) = z’ De(z’) = x’

En(x’’) = z’’

De(z’’) = x’’

For all x’

For all z’’

(69)

BiGAN

En(x’) = z’ De(z’) = x’

En(x’’) = z’’

De(z’’) = x’’

For all x’

For all z’’

How about?

Encoder Decoder

x z

෤ 𝑥

Encoder

Decoder x

z ǁ𝑧

En, De

En, De Optimal encoder

and decoder:

(70)

Triple GAN

Chongxuan Li, Kun Xu, Jun Zhu, Bo Zhang, “Triple Generative Adversarial Nets”, arXiv 2017

(71)

Domain-adversarial training

• Training and testing data are in different domains

Training data:

Testing data:

Generator

The same distribution

feature

Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, Domain-Adversarial Training of Neural Networks, JMLR, 2016

(72)

Domain-adversarial training

Not only cheat the domain classifier, but satisfying label classifier at the same time

This is a big network, but different parts have different goals.

Maximize label classification accuracy

Maximize domain classification accuracy Maximize label classification accuracy +

minimize domain classification accuracy feature extractor

Domain classifier Label predictor

(73)

Feature Disentangle

Emily Denton, Vighnesh Birodkar,

“Unsupervised Learning of Disentangled Representations from Video”, NIPS, 2017

(74)

Experimental Results

https://arxiv.org/pdf/1705.109 15.pdf

(75)

Evaluation

Ref: Lucas Theis, Aäron van den Oord, Matthias Bethge, “A note on the evaluation of generative models”, arXiv preprint, 2015

(76)

Likelihood

: real data (not observed during training)

: generated data

𝑥^𝑖

Log Likelihood:

Generator P

_G

𝐿 = 1

𝑁 ෍

𝑖

𝑙𝑜𝑔𝑃

_𝐺

𝑥

^𝑖

We cannot compute 𝑃

_𝐺

𝑥

^𝑖

. We can only sample from 𝑃

_𝐺

.

Prior

Distribution

(77)

Likelihood

- Kernel Density Estimation

• Estimate the distribution of P

_G

(x) from sampling

Generator P

_G

Now we have an approximation of 𝑃

_𝐺

, so we can compute 𝑃

_𝐺

𝑥

^𝑖

for each real data 𝑥

^𝑖

Then we can compute the likelihood.

Each sample is the mean of a

Gaussian with the same covariance.

(78)

Likelihood v.s. Quality

• Low likelihood, high quality?

• High likelihood, low quality?

Considering a model generating good images (small variance)

Generator P

_G

Generated data Data for evaluation 𝑥^𝑖 𝑃_𝐺 𝑥^𝑖 = 0

G1

𝐿 = 1

𝑁 ෍

𝑖

𝑙𝑜𝑔𝑃_𝐺 𝑥^𝑖

G2

X 0.99

X 0.01

= −𝑙𝑜𝑔100 + 1

𝑁 ෍

𝑖

𝑙𝑜𝑔𝑃_𝐺 𝑥^𝑖 100 4.6

(79)

Evaluate by Other Networks

Well-trained

𝑥 CNN 𝑃 𝑦|𝑥

Lower entropy means higher visual quality

𝑥¹ CNN 𝑃 𝑦¹|𝑥¹

Higher entropy means higher variety

𝑥² CNN 𝑃 𝑦²|𝑥²

𝑥³ CNN 𝑃 𝑦³|𝑥³

……

1

𝑁 ෍

𝑛

𝑃 𝑦^𝑛|𝑥^𝑛

(80)

Evaluate by Other Networks - Inception Score

= ෍

𝑥

෍

𝑦

𝑃 𝑦|𝑥 𝑙𝑜𝑔𝑃 𝑦|𝑥

− ෍

𝑦

𝑃 𝑦 𝑙𝑜𝑔𝑃 𝑦

Negative entropy of P(y|x) Entropy of P(y)

Inception Score

(81)

Mario Lucic, Karol Kurach, Marcin

Michalski, Sylvain Gelly, Olivier Bousquet, “Are

GANs Created Equal? A Large-Scale Study”, arXiv, 2017 Smaller is better

FIT:

(82)

Mode Collapse

(83)

Missing Mode ?

?

Mode collapse is

easy to detect.

(84)

We don’t want memory GAN.

• Using k-nearest neighbor to check whether the generator generates new objects

(85)

Concluding Remarks

from A to Z

Mihaela Rosca, Balaji Lakshminarayanan, David Warde-Farley, Shakir Mohamed, “Variational Approaches for Auto-Encoding Generative Adversarial Networks”, arXiv, 2017