Main Theorems - 消除深度學習目標函數中局部極小值之研究

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Chapter 4 Main Theorems

This chapter introduces the main theorems and lemmas stated and proved in [17], which illustrates the fact that the modified neural network can delete all suboptimal local minima of the objective function of original network.

4.1 Lemmas

The following two lemmas are used to prove the main theorems in the next section. In this section, apart from presenting the proof of lemmas in [17], we discuss the differentiability of the modified objective function and the use of chain rule more clearly in Lemma 1 and prove four claims to make the proof of Lemma 2 in [17] more complete as well.

Lemma 1. Let ℓ_y_i : R^d^¯ → R be differentiable for all i ∈ {1, . . . , n}. For any (θ, W ), if (b, c) is a stationary point of ˜L

(θ,W )

, then c = 0.

Proof. For i∈ {1, . . . , n}, (b, c) ∈ R^d^¯× R^d^¯, and W ∈ R^d^{× ¯}^ddefined above,

let q_i(θ, b, c, W ) = f (x_i; θ) + g(x_i; b, c, W ) =







f (x_i; θ)₁+ c₁exp(w^⊤₁x_i+ b₁) ...

f (x_i; θ)d¯+ cd¯exp(w^⊤_d_¯x_i+ bd¯)





 .

Fix (θ, W ), then qi|_{(θ,W )} is differentiable at (b, c). Since ℓyi is differentiable on R^d^¯, ℓyi|_{(θ,W )}

‧

is differentiable at (b, c). From the definition of a stationary point of a differentiable function ˜L

(θ,W ), for all j ∈ {1, 2, . . . , ¯d}, we have that, from the proof of Lemma 1, we can find that the regularization of the modified objective function ˜L is necessary.

Lemma 2. Let ℓyi : R^d^¯ → R be differentiable for all i ∈ {1, . . . , n}. Then, for any θ, if

‧

know that for a multivariate real-valued function f onRⁿ, it is differentiable at x if and only if there exist a vector∇f(x) and a function ρ(x; ·) defined on D = {u ∈ Rⁿ: 0 < ∥u − 0∥ < δ}

‧

which is proved below.

Claim 1. g(x_i; b, ∆c, W + ∆W ) is arbitrarily small with sufficiently small ∆c and ∆W . is continuous at (0, 0), which can be deduced from a similar idea as above. Third, we need to show that if f₂ is continuous at some vector x₀, and h is continuous at f₂(x₀), then h◦ f2 is continuous at x₀. Since given ϵ > 0, there exists δ₁ > 0 such that ∀ |u − f2(x₀)| < δ1 with u ∈ Domain(h), |h(u) − h(f2(x₀))| < ϵ from h being continuous at f2(x₀). Moreover, for ϵ = δ₁, there exists δ > 0 such that∀∥x − x0∥ < δ with x ∈ Domain(f2),|f2(x)− f2(x₀)| < δ1

from f₂ being continuous at x₀. Hence, this can imply that |h(f2(x))− h(f2(x₀))| < ϵ, so we have finished the proof. Finally, let h(t) = exp(t), since f₂ is continuous at (0, 0) and h is continuous at f₂(0, 0), h◦f2is continuous at (0, 0). Also, f₁is continuous at (0, 0), so f₁(h◦f2)

‧

simultaneously. Due to this, we get

∑n

from the maclaurin series of the natural exponential function

exp(y) =

∑∞ l=0

y^l

l!,∀y ∈ R,

which we prove below.

‧

Proof. From the equality

∑n

Here, the second equality follows that

mlim→∞

‧

Here, the second equality holds because the fact that lim_t_→∞∑_t

l=0

¯ ϵ^l_j

l!z_l exists can deduce lim_t_→∞∑t

Proof. Because of the fact that if ∑_∞

l=1f_l converges uniformly on a set S ⊆ R and each fl is continuous at 0, a limit point of S, then

¯lim

l!z_l converges uniformly on a neighborhood of 0, then

¯lim

Initially, we show that∑_∞

l=0

converges by the ratio test:

lim

l! converges uniformly on [−R, R] by Weierstrass M-test. Consequently, we deduce the statement∑_∞

l=0 x^l

l! = exp(x) uniformly on [−R, R] by uniqueness of the limit.

From this, we have

∑∞ l=0

(¯ϵ_ju^⊤_j x_i)^l

l! = exp(¯ϵ_ju^⊤_j x_i)

‧

Secondly, we show that

∑n

converges uniformly on I.

It suffices to show that the operations summation and scalar multiplication preserves uniform convergence, that is, if both∑_∞

l=0f_l = f and∑_∞

‧

Hence, we complete the proof and get the fact that∑_∞

l=0

¯ ϵ^l_j

l!z_l = 0 converges uniformly on I by uniqueness of the limit.

Last, we show that∑_∞

Therefore, we conclude that ∑_∞

l=1

¯ ϵ^l_j

l!z_l converges uniformly on I and complete the proof of claim 3.

‧

By the assumption of induction, we can neglect the term ∑k−1 l=0

ϵ^k_j(nonzero) on the left-hand and right-hand side of the equation to get

z_k+ lim

l!z_l converges uniformly on I, and

thus ∑∞ converges uniformly on I. Therefore, we can conclude that

¯lim

to complete the proof of claim 4. From the above analysis, we finish the proof of Lemma 2.

4.2 Theorems

The following two theorems proves the feasibility of eliminating local minima of the original objective function theoretically. Among them, the former holds under arbitrary datasets, and the latter one holds under realizable datasets. In this section, apart from presenting the proof of theorems in [17], we introduce some related notations and theorems, propose a claim and discuss the case k = 0 to give more details of the proof in Theorem 1 as well as correct some mistake and prove the statement (i) and (ii) more clearly in Theorem 2.

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Theorem 1. For any i ∈ {1, . . . , n}, the function ℓyi :R^d^¯→ R is differentiable and convex. If L has a local minimum at (θ, b, c, W ), then we have˜

(i) L has a global minimum at θ.

(ii) ˜f (x; θ, b, c, W ) = f (x; θ) for all x∈ R^dand ˜L(θ, b, c, W ) = L(θ).

Now, before presenting the proof of Theorem 1, we first introduce some additional notations as follows.

A k-th order tensor A in a d-dimensional space is a mathematical noun that has k indices ranging from 1 to d, which is denoted by

A = (a_i₁_,...,i_k)₁_≤i_m≤d,m∈{1,...,k}.

For instance, a scalar is a 0-th order tensor, a vector is first order one, and a matrix is second order one.

A tensor

A = (a_i₁_,...,i_k)₁_≤i_m≤d,m∈{1,...,k}

is called symmetric if the element a_i₁_,...,i_k is invariant under any permutation of indices.

A tensor product⊗ is an operation between tensors which obeys the following rule. For a p-th n-dimensional order tensor A and a q-th m-dimensional order tensor B, the tensor product A⊗ B of A and B is a (p + q)-th order tensor

C = (c_i₁_,...,i_p+q)₁_≤i_l_≤n,1≤i_r≤m,l∈{1,...,p},r∈{p+1,...,p+q}.

Moreover, x^⊗k = x⊗ · · · ⊗ x means that x appears k times.

For a k-th order tensor A ∈ R^d^×···×d and k vectors u⁽¹⁾, u⁽²⁾, . . . , u^(k) ∈ R^d, defines an operation

A(u⁽¹⁾, u⁽²⁾, . . . , u^(k)) = ∑

1≤i1,··· ,ik≤d

A_i₁_,_{··· ,i}_ku⁽¹⁾_i

1 · · · u^(k)_i_k .

Proof.

Fix θ and ˜L|θ has a local minimum at (b, c, W ). Here, we define x^⊗0_i = 1 and let si,j =

‧

Here, the first equality follows from theorem 2.1 in [22] which is stated as follows. Suppose that A ∈ Sym^m(Rⁿ) (m-th order n-dimensional tensor). Then the optimal objective function (1)

max

x⁽¹⁾,...,x^(m)∈Rⁿ s.t.∥x⁽ⁱ⁾∥2=1

A(x⁽¹⁾, . . . , x^(m)).

has always a global optimal solution satisfying x⁽¹⁾ = . . . = x^(m). This implies that the value of (1) is the same as the value of

max

x∈Rⁿ s.t.∥x∥2=1

|Ax^m| .

Hence, if we can prove that x^⊗k is symmetric, then the first equality holds. And it is because (x^⊗k)_i₁_,...,i_k = x_i₁· · · xik that x^⊗k is symmetric, we have the second line. The third line follows from the claim below.

Claim.

Proof. Apparently, to prove the first equality, it suffices to show that the statement x^⊗k(u, . . . , u) = (u^⊤x)^k, for all tensor x and k ≥ 1. We prove it by induction on k. For the case k = 1,

‧

Next, the second equality follows directly from Lemma 2. Since Lemma 2 says that for any θ, if ˜L

which complete the proof of the claim. Note that for the case k = 0,

∑n

It is obvious if k = 0, now we prove that it is true for k ∈ N by contradiction. If the left-hand

‧

side is nonzero, then the tensor

∑n

which leads to a contradiction.

Now, define{I1, . . . , I_n′} as a partition of {1, . . . , n} such that

where the second line follows from the differentiability and convexity of ℓ_y_i. We know that a differentiable multivariate function is convex if and only if the function lies above all its tangents, that is,

f (x)≥ f(y) + ∇f(y)^⊤(x− y).

‧

follows from multivariate polynomial interpolation in [23]. Before illustrating the polynomial interpolation, let us introduce some notations first.

Let ∏d

n be the space of all d-variate polynomials with degree at most n with real coefficients. For all p ∈∏_d

where x ∈ R^d. Hence, we can find a unique d-variate polynomial with degree at most

n^′− 1 passing through (¯x1, q_1,j), . . . , (¯x_n′, q_n′,j) by the polynomial interpolation method. More precisely, with k sufficiently large (k = n^′− 1 is sufficient), for all j ∈ {

1, . . . , ¯d}

, there exists

‧

Theorem 1 shows that given an original neural network f and its objective function L, we can construct a modified neural network ˜f with its objective function ˜L to obtain the following result. If ˜L achieves a local minimum at (θ, b, c, W ) by gradient descent algorithm, then the corresponding parameter θ can make L attain a global minimum. Moreover, the value L(θ, b, c, W ) is equal to L(θ), which means that in this case, the error we obtain after training˜ the network ˜f is exactly the minimum error we can get from the original neural network.

In addition, from the mild assumption of Theorem 1, we know that the conclusion in the theorem can be used in an arbitrary dataset.

Theorem 2. For any i ∈ {1, . . . , n}, the function ℓyi : R^d^¯ → R is differentiable and for any

‧

Here, the second line follows from the assumption that there exists a function f^⋆such that f^⋆(x_i) = y_i for all i ∈ {1, . . . , n} and the preceding notation. The third line follows from the

gives an explanation of how the third line works.

As for the last line, it follows from the equation in the proof of Theorem 1, which still

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

holds here because it is obtained under the assumptions that ℓyi is differentiable (for all i ∈ {1, . . . , n}) and ˜L|θ has a local minimum at (b, c, W ). This implies∇ℓyi(f (x_i; θ)) = 0, for all i∈ {1, . . . , n}. Hence, the statement (iii) holds from the assumption.

Now, we prove the statement (i) as follows. Since ℓ_y_i has a global minimum at f (x_i; θ), for all i ∈ {1, . . . , n}, for any θ^′, ℓyi(f (xi; θ^′)) ≥ ℓyi(f (xi; θ)), for all i ∈ {1, . . . , n}. This implies L(θ^′)≥ L(θ), namely, L has a global minimum at θ, which shows (i). Finally, the proof of statement (ii) is the same as that one in Theorem 1.

Theorem 2 shows that if the dataset we use is a realizable dataset (the dataset with label consistency), then we can obtain another result in addition to the result in Theorem 1. That is, for each input of the original neural network, there is a minimum error between its corresponding predicted output and the correct output.

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Chapter 5

在文檔中消除深度學習目標函數中局部極小值之研究 - 政大學術集成 (頁 34-52)

Main Theorems

國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧

國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

Chapter 4

Main Theorems

4.1 Lemmas

‧

‧

‧

‧

‧

‧

‧

‧

‧

4.2 Theorems

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧

國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧

‧

‧

‧

‧

‧

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧

國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧

國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

Chapter 5

立政治大學

立政治大學

立政治大學

立政治大學

立政治大學

立政治大學

立政治大學

立政治大學