• 沒有找到結果。

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Chapter 4

Main Theorems

This chapter introduces the main theorems and lemmas stated and proved in [17], which illustrates the fact that the modified neural network can delete all suboptimal local minima of the objective function of original network.

4.1 Lemmas

The following two lemmas are used to prove the main theorems in the next section. In this section, apart from presenting the proof of lemmas in [17], we discuss the differentiability of the modified objective function and the use of chain rule more clearly in Lemma 1 and prove four claims to make the proof of Lemma 2 in [17] more complete as well.

Lemma 1. Let ℓyi : Rd¯ → R be differentiable for all i ∈ {1, . . . , n}. For any (θ, W ), if (b, c) is a stationary point of ˜L

(θ,W )

, then c = 0.

Proof. For i∈ {1, . . . , n}, (b, c) ∈ Rd¯× Rd¯, and W ∈ Rd× ¯ddefined above,

let qi(θ, b, c, W ) = f (xi; θ) + g(xi; b, c, W ) =









f (xi; θ)1+ c1exp(w1xi+ b1) ...

f (xi; θ)d¯+ cd¯exp(wd¯xi+ bd¯)







 .

Fix (θ, W ), then qi|(θ,W ) is differentiable at (b, c). Since ℓyi is differentiable on Rd¯, ℓyi|(θ,W )

is differentiable at (b, c). From the definition of a stationary point of a differentiable function ˜L

(θ,W ), for all j ∈ {1, 2, . . . , ¯d}, we have that, from the proof of Lemma 1, we can find that the regularization of the modified objective function ˜L is necessary.

Lemma 2. Let ℓyi : Rd¯ → R be differentiable for all i ∈ {1, . . . , n}. Then, for any θ, if

know that for a multivariate real-valued function f onRn, it is differentiable at x if and only if there exist a vector∇f(x) and a function ρ(x; ·) defined on D = {u ∈ Rn: 0 < ∥u − 0∥ < δ}

which is proved below.

Claim 1. g(xi; b, ∆c, W + ∆W ) is arbitrarily small with sufficiently small ∆c and ∆W . is continuous at (0, 0), which can be deduced from a similar idea as above. Third, we need to show that if f2 is continuous at some vector x0, and h is continuous at f2(x0), then h◦ f2 is continuous at x0. Since given ϵ > 0, there exists δ1 > 0 such that ∀ |u − f2(x0)| < δ1 with u ∈ Domain(h), |h(u) − h(f2(x0))| < ϵ from h being continuous at f2(x0). Moreover, for ϵ = δ1, there exists δ > 0 such that∀∥x − x0∥ < δ with x ∈ Domain(f2),|f2(x)− f2(x0)| < δ1

from f2 being continuous at x0. Hence, this can imply that |h(f2(x))− h(f2(x0))| < ϵ, so we have finished the proof. Finally, let h(t) = exp(t), since f2 is continuous at (0, 0) and h is continuous at f2(0, 0), h◦f2is continuous at (0, 0). Also, f1is continuous at (0, 0), so f1(h◦f2)

simultaneously. Due to this, we get

n

from the maclaurin series of the natural exponential function

exp(y) =

l=0

yl

l!,∀y ∈ R,

which we prove below.

Proof. From the equality

n

Here, the second equality follows that

mlim→∞

Here, the second equality holds because the fact that limt→∞t

l=0

¯ ϵlj

l!zl exists can deduce limt→∞t

Proof. Because of the fact that if

l=1fl converges uniformly on a set S ⊆ R and each fl is continuous at 0, a limit point of S, then

¯lim

l!zl converges uniformly on a neighborhood of 0, then

¯lim

Initially, we show that∑

l=0

converges by the ratio test:

lim

l! converges uniformly on [−R, R] by Weierstrass M-test. Consequently, we deduce the statement∑

l=0 xl

l! = exp(x) uniformly on [−R, R] by uniqueness of the limit.

From this, we have

l=0

ϵjuj xi)l

l! = exp(¯ϵjuj xi)

Secondly, we show that

n

converges uniformly on I.

It suffices to show that the operations summation and scalar multiplication preserves uniform convergence, that is, if both∑

l=0fl = f and

Hence, we complete the proof and get the fact that∑

l=0

¯ ϵlj

l!zl = 0 converges uniformly on I by uniqueness of the limit.

Last, we show that∑

Therefore, we conclude that ∑

l=1

¯ ϵlj

l!zl converges uniformly on I and complete the proof of claim 3.

By the assumption of induction, we can neglect the term ∑k−1 l=0

ϵkj(nonzero) on the left-hand and right-hand side of the equation to get

zk+ lim

l!zl converges uniformly on I, and

thus ∑ converges uniformly on I. Therefore, we can conclude that

¯lim

to complete the proof of claim 4. From the above analysis, we finish the proof of Lemma 2.

4.2 Theorems

The following two theorems proves the feasibility of eliminating local minima of the original objective function theoretically. Among them, the former holds under arbitrary datasets, and the latter one holds under realizable datasets. In this section, apart from presenting the proof of theorems in [17], we introduce some related notations and theorems, propose a claim and discuss the case k = 0 to give more details of the proof in Theorem 1 as well as correct some mistake and prove the statement (i) and (ii) more clearly in Theorem 2.

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Theorem 1. For any i ∈ {1, . . . , n}, the function ℓyi :Rd¯→ R is differentiable and convex. If L has a local minimum at (θ, b, c, W ), then we have˜

(i) L has a global minimum at θ.

(ii) ˜f (x; θ, b, c, W ) = f (x; θ) for all x∈ Rdand ˜L(θ, b, c, W ) = L(θ).

Now, before presenting the proof of Theorem 1, we first introduce some additional notations as follows.

A k-th order tensor A in a d-dimensional space is a mathematical noun that has k indices ranging from 1 to d, which is denoted by

A = (ai1,...,ik)1≤im≤d,m∈{1,...,k}.

For instance, a scalar is a 0-th order tensor, a vector is first order one, and a matrix is second order one.

A tensor

A = (ai1,...,ik)1≤im≤d,m∈{1,...,k}

is called symmetric if the element ai1,...,ik is invariant under any permutation of indices.

A tensor product⊗ is an operation between tensors which obeys the following rule. For a p-th n-dimensional order tensor A and a q-th m-dimensional order tensor B, the tensor product A⊗ B of A and B is a (p + q)-th order tensor

C = (ci1,...,ip+q)1≤il≤n,1≤ir≤m,l∈{1,...,p},r∈{p+1,...,p+q}.

Moreover, x⊗k = x⊗ · · · ⊗ x means that x appears k times.

For a k-th order tensor A ∈ Rd×···×d and k vectors u(1), u(2), . . . , u(k) ∈ Rd, defines an operation

A(u(1), u(2), . . . , u(k)) = ∑

1≤i1,··· ,ik≤d

Ai1,··· ,iku(1)i

1 · · · u(k)ik .

Proof.

Fix θ and ˜L|θ has a local minimum at (b, c, W ). Here, we define x⊗0i = 1 and let si,j =

Here, the first equality follows from theorem 2.1 in [22] which is stated as follows. Suppose that A ∈ Symm(Rn) (m-th order n-dimensional tensor). Then the optimal objective function (1)

max

x(1),...,x(m)∈Rn s.t.∥x(i)2=1

A(x(1), . . . , x(m)) .

has always a global optimal solution satisfying x(1) = . . . = x(m). This implies that the value of (1) is the same as the value of

max

x∈Rn s.t.∥x∥2=1

|Axm| .

Hence, if we can prove that x⊗k is symmetric, then the first equality holds. And it is because (x⊗k)i1,...,ik = xi1· · · xik that x⊗k is symmetric, we have the second line. The third line follows from the claim below.

Claim.

Proof. Apparently, to prove the first equality, it suffices to show that the statement x⊗k(u, . . . , u) = (ux)k, for all tensor x and k ≥ 1. We prove it by induction on k. For the case k = 1,

Next, the second equality follows directly from Lemma 2. Since Lemma 2 says that for any θ, if ˜L

which complete the proof of the claim. Note that for the case k = 0,

n

It is obvious if k = 0, now we prove that it is true for k ∈ N by contradiction. If the left-hand

side is nonzero, then the tensor

n

which leads to a contradiction.

Now, define{I1, . . . , In} as a partition of {1, . . . , n} such that

where the second line follows from the differentiability and convexity of ℓyi. We know that a differentiable multivariate function is convex if and only if the function lies above all its tangents, that is,

f (x)≥ f(y) + ∇f(y)(x− y).

follows from multivariate polynomial interpolation in [23]. Before illustrating the polynomial interpolation, let us introduce some notations first.

Let ∏d

n be the space of all d-variate polynomials with degree at most n with real coefficients. For all p d

where x ∈ Rd. Hence, we can find a unique d-variate polynomial with degree at most

n− 1 passing through (¯x1, q1,j), . . . , (¯xn, qn,j) by the polynomial interpolation method. More precisely, with k sufficiently large (k = n− 1 is sufficient), for all j ∈ {

1, . . . , ¯d}

, there exists

Theorem 1 shows that given an original neural network f and its objective function L, we can construct a modified neural network ˜f with its objective function ˜L to obtain the following result. If ˜L achieves a local minimum at (θ, b, c, W ) by gradient descent algorithm, then the corresponding parameter θ can make L attain a global minimum. Moreover, the value L(θ, b, c, W ) is equal to L(θ), which means that in this case, the error we obtain after training˜ the network ˜f is exactly the minimum error we can get from the original neural network.

In addition, from the mild assumption of Theorem 1, we know that the conclusion in the theorem can be used in an arbitrary dataset.

Theorem 2. For any i ∈ {1, . . . , n}, the function ℓyi : Rd¯ → R is differentiable and for any

Here, the second line follows from the assumption that there exists a function fsuch that f(xi) = yi for all i ∈ {1, . . . , n} and the preceding notation. The third line follows from the

gives an explanation of how the third line works.

As for the last line, it follows from the equation in the proof of Theorem 1, which still

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

holds here because it is obtained under the assumptions that ℓyi is differentiable (for all i {1, . . . , n}) and ˜L|θ has a local minimum at (b, c, W ). This implies∇ℓyi(f (xi; θ)) = 0, for all i∈ {1, . . . , n}. Hence, the statement (iii) holds from the assumption.

Now, we prove the statement (i) as follows. Since ℓyi has a global minimum at f (xi; θ), for all i ∈ {1, . . . , n}, for any θ, ℓyi(f (xi; θ)) ≥ ℓyi(f (xi; θ)), for all i ∈ {1, . . . , n}. This implies L(θ)≥ L(θ), namely, L has a global minimum at θ, which shows (i). Finally, the proof of statement (ii) is the same as that one in Theorem 1.

Theorem 2 shows that if the dataset we use is a realizable dataset (the dataset with label consistency), then we can obtain another result in addition to the result in Theorem 1. That is, for each input of the original neural network, there is a minimum error between its corresponding predicted output and the correct output.

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Chapter 5

相關文件