We will check techniques to address the diﬃculty of storing or inverting the Hessian But before that let’s derive the mathematical form

(1)

We will check techniques to address the difficulty of storing or inverting the Hessian

But before that let’s derive the mathematical form

(2)

Hessian Matrix I

For CNN, the gradient of f (θ) is

∇f (θ) = 1

Cθ + 1 l

l

X

i =1

(Jⁱ)^T∇_z^L+1,iξ(z^L+1,i;yⁱ, Z^1,i), (1) where

Jⁱ =







∂z₁^L+1,i

∂θ₁ · · · ^∂z_∂θ¹^L+1,i ... ... ...n

∂z_nL+1^L+1,i

∂θ₁ · · · ^∂z

L+1,i nL+1

∂θ_n







nL+1×n

, i = 1, . . . , l , (2)

(3)

Hessian Matrix II

is the Jacobian of z^L+1,i(θ).

The Hessian matrix of f (θ) is

∇²f (θ) = 1

CI + 1 l

l

X

i =1

(Jⁱ)^TBⁱJⁱ

+ 1 l

l

X

i =1 nL

X

j =1

∂ξ(z^L+1,i;yⁱ, Z^1,i)

∂z_j^L+1,i







∂²z_j^L+1,i

∂θ₁∂θ₁ · · · ^∂

2z_j^L+1,i

∂θ₁∂θ_n

... . . . ...

∂²z_j^L+1,i

∂θn∂θ1 · · · ^∂

2z_j^L+1,i

∂θn∂θn





 ,

(4)

Hessian Matrix III

where I is the identity matrix and Bⁱ is the Hessian of ξ(·) with respect to z^L+1,i:

Bⁱ = ∇²_zL+1,i,z^L+1,iξ(z^L+1,i;yⁱ, Z^1,i) More precisely,

B_tsⁱ = ∂²ξ(z^L+1,i;yⁱ, Z^1,i)

∂z_t^L+1,i∂zs^L+1,i

, ∀t, s = 1, . . . , n_L+1. (3)

Usually Bⁱ is very simple.

(5)

Hessian Matrix IV

For example, if the squared loss is used, ξ(z^L+1,i;yⁱ) = ||z^L+1,i −yⁱ||². then

Bⁱ =



 2

. . . 2





Usually we consider a convex loss function ξ(z^L+1,i;yⁱ)

with respect to z^L+1,i

(6)

Hessian Matrix V

Thus Bⁱ is positive semi-definite

The last term of ∇²f (θ) may not be positive semi-definite

Note that for a twice differentiable function f (θ) f (θ) is convex

if and only if

∇²f (θ) is positive semi-definite

(7)

Jacobian Matrix

The Jacobian matrix of z^L+1,i(θ) ∈ Rⁿ^L+1 is

Jⁱ =







∂z₁^L+1,i

∂θ₁ · · · ^∂z_∂θ¹^L+1,i ... ... ...n

∂z_nL^L+1,i

∂θ1 · · · ^∂z

L+1,i nL

∂θn







∈ Rⁿ^L+1^×n, i = 1, . . . l .

n_L+1: # of neurons in the output layer n: number of total variables

nL+1 × n can be large

(8)

Gauss-Newton Matrix I

The Hessian matrix ∇²f (θ) is now not positive definite.

We may need a positive definite approximation Many existing Newton methods for NN has

considered the Gauss-Newton matrix (Schraudolph, 2002)

G = 1

CI + 1 l

l

X

i =1

(Jⁱ)^TBⁱJⁱ by removing the last term in ∇²f (θ)

(9)

Gauss-Newton Matrix II

The Gauss-Newton matrix is positive definite if Bⁱ is positive semi-definite

This can be achieved if we use a convex loss function in terms of z^L+1,i(θ)

We then solve

Gd = −∇f (θ)

(10)

References I

N. N. Schraudolph. Fast curvature matrix-vector products for second-order gradient descent.

Neural Computation, 14(7):1723–1738, 2002.