We will check techniques to address the difficulty of storing or inverting the Hessian
But before that let’s derive the mathematical form
Hessian Matrix I
For CNN, the gradient of f (θ) is
∇f (θ) = 1
Cθ + 1 l
l
X
i =1
(Ji)T∇zL+1,iξ(zL+1,i;yi, Z1,i), (1) where
Ji =
∂z1L+1,i
∂θ1 · · · ∂z∂θ1L+1,i ... ... ...n
∂znL+1L+1,i
∂θ1 · · · ∂z
L+1,i nL+1
∂θn
nL+1×n
, i = 1, . . . , l , (2)
Hessian Matrix II
is the Jacobian of zL+1,i(θ).
The Hessian matrix of f (θ) is
∇2f (θ) = 1
CI + 1 l
l
X
i =1
(Ji)TBiJi
+ 1 l
l
X
i =1 nL
X
j =1
∂ξ(zL+1,i;yi, Z1,i)
∂zjL+1,i
∂2zjL+1,i
∂θ1∂θ1 · · · ∂
2zjL+1,i
∂θ1∂θn
... . . . ...
∂2zjL+1,i
∂θn∂θ1 · · · ∂
2zjL+1,i
∂θn∂θn
,
Hessian Matrix III
where I is the identity matrix and Bi is the Hessian of ξ(·) with respect to zL+1,i:
Bi = ∇2zL+1,i,zL+1,iξ(zL+1,i;yi, Z1,i) More precisely,
Btsi = ∂2ξ(zL+1,i;yi, Z1,i)
∂ztL+1,i∂zsL+1,i
, ∀t, s = 1, . . . , nL+1. (3)
Usually Bi is very simple.
Hessian Matrix IV
For example, if the squared loss is used, ξ(zL+1,i;yi) = ||zL+1,i −yi||2. then
Bi =
2
. . . 2
Usually we consider a convex loss function ξ(zL+1,i;yi)
with respect to zL+1,i
Hessian Matrix V
Thus Bi is positive semi-definite
The last term of ∇2f (θ) may not be positive semi-definite
Note that for a twice differentiable function f (θ) f (θ) is convex
if and only if
∇2f (θ) is positive semi-definite
Jacobian Matrix
The Jacobian matrix of zL+1,i(θ) ∈ RnL+1 is
Ji =
∂z1L+1,i
∂θ1 · · · ∂z∂θ1L+1,i ... ... ...n
∂znLL+1,i
∂θ1 · · · ∂z
L+1,i nL
∂θn
∈ RnL+1×n, i = 1, . . . l .
nL+1: # of neurons in the output layer n: number of total variables
nL+1 × n can be large
Gauss-Newton Matrix I
The Hessian matrix ∇2f (θ) is now not positive definite.
We may need a positive definite approximation Many existing Newton methods for NN has
considered the Gauss-Newton matrix (Schraudolph, 2002)
G = 1
CI + 1 l
l
X
i =1
(Ji)TBiJi by removing the last term in ∇2f (θ)
Gauss-Newton Matrix II
The Gauss-Newton matrix is positive definite if Bi is positive semi-definite
This can be achieved if we use a convex loss function in terms of zL+1,i(θ)
We then solve
Gd = −∇f (θ)
References I
N. N. Schraudolph. Fast curvature matrix-vector products for second-order gradient descent.
Neural Computation, 14(7):1723–1738, 2002.