消除深度學習目標函數中局部極小值之研究 - 政大學術集成

全文

(1)國立政治大學應用數學系碩士學位論文. 消除深度學習目標函數中局部極小值政治大立. ‧ 國. 學. 之研究. A Survey on Eliminating Local Minima ‧. n. Ch. engchi. er. io. al. Learning. sit. y. Nat. of the Objective Function in Deep i n U. v. 研究生：季佳琪撰指導教授：蔡炎龍博士中華民國 108 年 7 月 DOI:10.6814/NCCU201900936.

(2) 謝來到政大應數也有三年了，這時間說短不短，回想這段時光的磕磕絆絆與披荊斬棘，確實有種沒白活了這幾年的感受；說長卻也不長，直到即將. 政治大. 畢業的此刻，才發覺整天嚷著想畢業的自己，對於離別仍然不夠灑脫，我想全是因為在這裡遇到的那些美好的人們。. 立. 首先，特別感謝我的指導教授蔡炎龍老師，這篇論文從主題挑選、理論. ‧ 國. 學. 證明，一直到寫程式的方向，都仰賴於老師的耐心協助與包容，不僅讓我做我比較擅長的理論，給我很大的發揮空間，同時建議我可以寫一些程式. ‧. 來加以佐證該理論，作為未來找工作的墊腳石，這全多虧於老師一路的帶領，常言道：「一篇成功的論文背後都有一位炎龍老師。」大概就是這個感覺. y. Nat. sit. 吧！老師在我眼中是一位不計回報、樂於教學更熱愛幫助學生的一位人物，. er. io. 也是我心中為人師表的典範，在老師身上確實可以看到對於教學的熱忱而. al. n. v i n Ch 其次，我要特別感謝我第二位指導教授陳天進老師（老大），我的數學底 engchi U. 不僅只是一份工作，希望我未來也能對一件事情有老師這樣的態度。. 子大多都是由他而來。在進政大以前，我連微積分都不太會，進政大以後迎來的是困難好幾倍的老大教授的實變函數論，猶記第一堂實變課我還茫. 然地看著黑板上飛快的粉筆字跡，心想著「原來這就是數學的箇中滋味。」唯一能驕傲的是從前受過的申論題洗禮讓我在抄筆記時從未感到費力，一直到現在，對於數學的世界即使還是以管窺天，但至少有能力欣賞她的美麗。老大在我眼中同樣是一位偉大的老師，初來乍到政大會對他肅然起敬，接著會對他腦中的數學感到「仰之彌高、鑽之彌堅」，最後會為他對學生的照顧而感到動容，謝謝老大！接下來，我要感謝我在碩士班遇到的同學們，特別是我這屆的同學，謝謝你們為我的碩士班生涯帶來的諸多歡笑與幫助，這三年來儘管無法確定開心和難過的日子哪個多，但開心的日子一定是有你們陪伴的日子，這段. i. DOI:10.6814/NCCU201900936.

(3) 日子裡我也許造成很多麻煩、做錯很多事，謝謝你們的善待，讓我得以成為更好的人。這裡特別感謝阿孝，謝謝你了解 70% 的我還能夠忍受我，並在我最難過的時候願意聽我抒發，未來的日子裡當你寫論文痛苦不堪的時候，作為了解 70% 阿孝的我，應該有資格陪在你的身邊，作為你的垃圾桶；感謝黃賴，雖然我們相遇是一場錯誤，你是被數學耽誤的文學家，我是被數學耽誤的夢想家，但是遇見你是一件再美麗不過的事，也感謝項涵提點我很多未來的方向和歡笑，我都銘記在心，哥論文加油！另外非常感謝同門學長學弟大澤佑、許嘉宏及瑄正在論文和助教課上的幫助，還有要特別感謝守朋，謝謝你代替我接了助教的位置，千言萬語也. 政治大. 不足以說明我的歉意與謝意，我相信你有朝一日能成為你心嚮往的一位老. 立. 師，加油！當然沒有忘了研究室少有的女生們，謝謝庭恩在生活和課業上. ‧ 國. 學. 的建議和榜樣，雖然一開始有點怕你，但相處下來才發現你對朋友真的非常好，未來一起實現那些超有愛的夢想！最後感謝芳誼，那些一起搭綠 1、. ‧. 棕 18 的日子是難忘的，能夠成為對方的支柱是幸運的，謝謝你在我論文最後爆走期間接受我的胡言亂語及一路以來的陪伴，也謝謝我們對彼此的包. sit. y. Nat. 容。. er. io. 最後的最後，我要感謝我的好朋友思予，謝謝你多年來陪著我經歷了無. al. v i n Ch 又笨又任性又難搞還是不放棄我，謝謝我們對這段友誼的努力，你永遠是 engchi U 我追逐的榜樣。最後我要謝謝我的家人，謝謝你們在我決定考應數系時毫 n. 數的悲歡；在我因為寫論文而焦慮的時候陪著我天南地北地聊天；知道我. 不保留的支持，謝謝你們在我寫論文期間關心我的進度，但更關心我的身體，謝謝你們放手讓我飛，但讓我知道家裡永遠是你的避風港，我知道雖然我不是最好的女兒或姊妹，但幸好我有天底下最棒的你們。. ii. DOI:10.6814/NCCU201900936.

(4) 中文. 要. 在本文中，我們主要研究消除目標函數的非最佳局部極小值的方法和其中的定理。更具體地說，我們發現，在給定原始神經網絡的情況下，我們. 政治大. 可以透過對其添加外加的神經網路層來建構一個修正的神經網絡。在這前提下，如果修正的神經網絡的目標函數達到局部最小值，則原始神經網絡. 立. 的目標函數將達到絕對最小值。在接下來的內容中，我們首先回顧一些以. ‧ 國. 學. 前的相關文獻、概述何謂深度學習，並證明常見損失函數的凸性以滿足定理的假設。接下來，我們在主要定理中證明了一些細節、討論了此方法的. ‧. 效果，並研究了它的局限性。最後，我們進行了一系列實驗來顯示此方法可以用於實際工作。. sit. y. Nat. n. al. er. io. 關鍵詞：深度學習、神經網路、目標函數、損失函數、局部極小值. Ch. engchi. i n U. v. iii. DOI:10.6814/NCCU201900936.

(5) Abstract In this paper, we mainly survey the method and theorems of eliminating suboptimal local minima of the objective function. More specifically, we find that:. 政治大. given an original neural network, we can construct a modified network by adding external layers to it. Then if the objective function of the modified network achieve. 立. a local minimum, the objective function of the original neural network will reach a. ‧ 國. 學. global minimum. We first review some previous related literature, give an overview of deep learning and then prove the convexity of common loss functions to satisfy. ‧. the assumptions of theorems. Next, we prove some details in such theorems, discuss. y. Nat. the effects of the method, and investigate its limitations. Finally, we perform a. n. al. er. io. sit. series of experiments to show that the method can be used for practical works.. i n U. v. Keywords: Deep Learning; Neural Network; Objective Function; Loss Function; Local Minima.. Ch. engchi. iv. DOI:10.6814/NCCU201900936.

(6) Contents 謝要. 立. Abstract. ‧ 國 al. n 2 Deep Learning. vii viii. sit. io. 1 Introduction. v. y. Nat. List of Figures. iv. ‧. List of Tables. iii. 學. Contents. 政治大. 1. er. 中文. i. Ch. i n U. v. 3. 2.1. Definition of Deep Learning e. n . .g. c . h . .i .. . . . . . . . . . . . . . . . . . . .. 3. 2.2. Standard Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4. 2.2.1. The Structure of The Neural Network . . . . . . . . . . . . . . . . . .. 4. 2.2.2. The Operation of The Neural Network . . . . . . . . . . . . . . . . . .. 5. 2.2.3. Activation Function . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6. 2.3. Optimization for Training Deep Network . . . . . . . . . . . . . . . . . . . .. 7. 2.4. Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . .. 10. 2.4.1. Definition of Convolution . . . . . . . . . . . . . . . . . . . . . . . .. 10. 2.4.2. The Structure of The Convolutional Neural Network . . . . . . . . . .. 12. Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 15. 2.5.1. The Structure of The Recurrent Neural Network . . . . . . . . . . . .. 16. 2.5.2. The Operation of The Recurrent Neural Network . . . . . . . . . . . .. 16. 2.5. DOI:10.6814/NCCU201900936.

(7) 3 Model Description. 19. 3.1. The Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 19. 3.2. Loss and Objective Functions . . . . . . . . . . . . . . . . . . . . . . . . . . .. 21. 3.2.1. Construction of Objective Functions . . . . . . . . . . . . . . . . . . .. 21. 3.2.2. Convexity of Loss Functions . . . . . . . . . . . . . . . . . . . . . . .. 21. 4 Main Theorems. 25. 4.1. Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 25. 4.2. Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 34. 治 .................. Effects of All Situations . .政 . . . . . . . . . .大 Examples . . . . 立 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 43. 5 Effects of Eliminating Local Minima. 5.2. 學. ‧ 國. 5.1. 6 Challenges of Eliminating Local Minima. 6.2. Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ‧. Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. y. Nat. sit. Convolutional Neural Network Results . . . . . . . . . . . . . . . . . . . . . .. n. 8 Conclusion. er. 7.2. io. Standard Neural Network Results . . . . . . . . . . . . . . . . . . . . . . . .. Ch. engchi. i n U. 49 51 52. 7.1. al. 44 49. 6.1. 7 Experiments. 43. v. 52 54 57. Appendix A Code of The Models. 58. A.1 The NN Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 58. A.2 The mNN Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 59. A.3 The CNN Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 62. A.4 The nCNN Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 63. A.5 The mnCNN Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 65. Bibliography. 69. DOI:10.6814/NCCU201900936.

(8) List of Tables Iterative method of gradient descent algorithm. . . . . . . . . . . . . . . . . .. 8. 7.1. Accurate on training set between NN and mNN . . . . . . . . . . . . . . . . .. 54. 7.3. 治 .................. 政 Accurate on test set between NN and mNN .大立set among three versions of CNN . . . . . . . . . . . . . Accurate on training. 7.4. Accurate on test set among three versions of CNN . . . . . . . . . . . . . . . .. 56. 54 56. ‧. ‧ 國. io. sit. y. Nat. n. al. er. 7.2. 學. 2.1. Ch. engchi. i n U. v. DOI:10.6814/NCCU201900936.

(9) List of Figures 2.1. The structure of the standard neural network. . . . . . . . . . . . . . . . . . .. 4. 2.2. Behavior of a neuron in the standard neural network. . . . . . . . . . . . . . .. 5. 2.3. Three common activation functions in deep learning. . . . . . . . . . . . . . .. 7. The backward pass of backpropagation. . . . . . . . . . . . . . . . . . . . . .. 9. 學. Geometric understanding of convolution on f and g. . . . . . . . . . . . . . .. 11. 2.7. The schematic diagram of structure of convolutional neural network. . . . . . .. 12. 2.8. The operation in convolutional layer. . . . . . . . . . . . . . . . . . . . . . . .. 13. 2.9. The operation in max pooling layer. . . . . . . . . . . . . . . . . . . . . . . .. 14 15. ‧. 2.6. y. 10. sit. 2.5. ‧ 國. 2.4. 政治大 A neural network used to explain the backpropagation algorithm. . . . . . . . . 立. Nat. io. er. 2.10 The operation between max pooling layer and fully connected layer. . . . . . . 2.11 The schematic diagram of structure of recurrent neural network. . . . . . . . .. 16. 2.12. . . . . . . . . . . . . . .. 17. The structure of the modified neural network f˜. . . . . . . . . . . . . . . . . .. 20. al. n. i.v. Behavior of a neuron inC recurrent neural network. n hengchi U. 3.1. DOI:10.6814/NCCU201900936.

(10) Chapter 1 Introduction 政治大. Deep Learning has already achieved significant success in the fields of computer vision,. 立. speech and audio processing, natural language processing, and artificial intelligence in recent. ‧ 國. 學. years. However, our understanding of deep neural networks depends mostly on their empirical success, whereas theoretical understanding of them is still insufficient. In fact, one of the major. ‧. difficulties in theoretically explaining the results of the deep neural networks arise from the nonconvexity and high-dimensionality of the objective functions used to train the networks. As far. y. Nat. sit. as we know, finding a global minimum of a non-convex function is an NP-hard problem [24].. er. io. Therefore, it is problematic for non-convex and high-dimensional objective functions to attain a. al. n. v i n C practical applications after training. Whether networks ofh e n g c h i U have many local minima and when global minimum [2] since it can not be guaranteed whether they get stuck in a bad local minimum. optimization algorithms are faced with them remain unclear. For many years, most scientists have believed that it is a major problem plaguing training process of neural networks and have devoted themselves to studying it. There have been a great number of studies that analyze the surface and local minima of the objective functions. Several studies have provided useful results for the quality of local minima under the conditions of specific type of deep neural network, important simplifications [5, 12, 14] and strong over-parameterization [25, 26]. Moreover, there have been much more positive results based on some shallow network and strong assumptions including simplification, over-parameterization, and Gaussian inputs [1, 3, 6, 8, 9, 22, 29, 32, 33, 35]. So far, there have been two significant results in related researches. The first one is that all local minima of certain deep neural networks are no worse than the global minima. 1. DOI:10.6814/NCCU201900936.

(11) of the corresponding machine learning models [15, 16, 30], and further improvements can be guaranteed by using residual representations [15] along with increasing the depth and width of the networks, even under less assumptions [16]. The second one is all suboptimal local minima (i.e., all local minima that are not global minima) can be eliminated by adding a neuron per output unit for a binary classification model with smoothed hinge loss functions [23]. However, the result cannot be used in many practical tasks because of the limitation of assumptions. Therefore, [17] generalizes the assumptions and obtains the same result, that is, without any unrealistic assumption, adding one neuron per output unit can eliminate all suboptimal local minima for multi-class classification, binary classification, and regression with an arbitrary loss function. Compared with the result in [23], it has been proven to be applicable to common deep. 政治大. learning tasks. To the extent of our knowledge, this is a major breakthrough in related works.. 立. In this paper, we survey the main results of [17], which provides a way of eliminating. ‧ 國. 學. suboptimal local minima of the objective function. Moreover, we prove those statements which are confusing or not being mentioned in [17] and discuss the limitations of the method. Finally,. ‧. we perform several experiments to test whether the method can be used for practical works.. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. 2. DOI:10.6814/NCCU201900936.

(12) Chapter 2 Deep Learning 政治大. In this paper, we present an optimization method to improve the performance of deep. 立. learning algorithms. Hence, this chapter gives a brief overview of so-called deep learning. 2.1 Definition of Deep Learning. Nat. y. ‧. ‧ 國. 學. algorithm.. sit. In order to clarify more clearly what deep learning is, we start with some related. er. io. terminologies. Artiﬁcial intelligence (AI) is a field studying to make computers being able to. al. n. v i n C h without humanUinstructions, which is regarded as a way datasets and experiences to acquire skills engchi. do the same things as humans. Machine learning is the science of letting machines learn from. of making AI dream come true. As for deep learning [10, 20], it is a class of machine learning. algorithms that uses deep artificial neural networks (inspired by biological neural networks of human brains) to learn from large amounts of data and understand it to make prediction further, here, the word ”deep” refers to the multiple hidden layers (being defined later) of the neural networks. The core of deep learning is the technique of neural networks. Ideally, deep learning algorithm can use neural networks to deal with any problem that is able to be transformed into a function. Notably, among these neural networks, there are three major types of them, namely standard neural network, convolutional neural network, and recurrent neural network. Next, we are going to introduce them in this order.. 3. DOI:10.6814/NCCU201900936.

(13) 2.2 Standard Neural Network 2.2.1 The Structure of The Neural Network More precisely, the standard neural network being discussed here refers to feedforward fully connected neural network [28]. The structure of such neural network is shown in Figure 2.1.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. Figure 2.1: The structure of the standard neural network.. Let us take a closer look at it. The nodes in the network are called neurons. The leftmost layer in the network is called the input layer because the data enters into the network by the neurons within it. The rightmost layer is called the output layer because the neurons within it are responsible for sending out information. The middle layers are called the hidden layers, since they are neither the input layer nor the output layer. Moreover, as we can see, each neuron in all of the layers is connected to every neuron in the previous layer, which is the meaning of the word ”fully connected”. On the other hand, the word ”feedforward” refers to the fact that in the network, the data moves in only one direction, forward, from the input neurons, through the hidden neurons and to the output neurons. As for the variables in Figure 2.1, x1 , x2 , . . . , xn are current inputs (i.e., the components of some data x) and yˆ1 , yˆ2 , . . . , yˆm is the corresponding 4. DOI:10.6814/NCCU201900936.

(14) predicted output.. 2.2.2 The Operation of The Neural Network We are going to explain how the neural network works by illustrating how each neuron acts in this section. In fact, each neuron has the same behavior, Figure 2.2 illustrates the general situation by the behavior of a neuron in the first hidden layer.. 立. 政治大. ‧. ‧ 國. 學 er. io. sit. y. Nat. al. n. v i n C hof a neuron in the U Figure 2.2: Behavior e n g c h i standard neural network. The operation of a neuron is divided into two parts. First, there are several inputs x1 , x2 , . . . , xn transmitted to the neuron and each input corresponds to a weight (w1 for x1 , w2 for x2 , . . . , wn for xn ) that indicates the importance of the input for the neuron. The weighted ∑ sum ni=1 wi xi of each input and its weight then enters into the neuron being added to the bias b of it. Second, let the activation function φ (being discussed later) work on the value that ∑ we just got and be the output h of the neuron. That is, h = φ( ni=1 wi xi + b). Usually, we adopt the vector representation h = φ(w⊤ x + b) to express the output for convenience, where w = (w1 , . . . , wn )⊤ . Hence, the operation of the network is as follows. The input layer receives the data and transmit them to the first hidden layer. Then all of the hidden layers have the same behavior: 5. DOI:10.6814/NCCU201900936.

(15) receive the signals from a layer above it and pass its output to the layer below it after a neuron within it performs its work, providing a feedforward path to the output layer. Finally, the output layer sends information directly to the outside world.. 2.2.3 Activation Function In this section, we will first introduce the role of the activation function in neural network and three common types of it. Then we focus on the reason of the ReLU function for being most common activation function. The main purpose of using activation function is to introduce nonlinearity, which can make. 治政 output and the input will keep linearity, causing deep大 neural network to lose its meaning. 立 Moreover, there are three common activation functions, Sigmoid function, Hyperbolic Tangent, more accurate predictions. If the activation function is not used in neural network, the predicted. ‧ 國. 學. and Rectified Linear Unit (ReLU), which is shown in Figure 2.9. Scientists has preferred to choose ReLU to be the activation function of neural network since it has the following. ‧. advantages.. y. Nat. First, reducing the probability of occurrence of vanishing gradient problem. For deep neural. io. sit. networks trained by gradient descent and backpropagation algorithms (discussed later), using. n. al. er. sigmoid function and hyperbolic tangent as activation function may cause vanishing gradient. i n U. v. problem. This is because the gradient descent optimizes the neural network by gradient and the. Ch. engchi. derivatives of those two functions approach zero on most of their domain. This phenomenon causes the fact that some parameters of the neural network cannot be updated. On the other hand, because the derivative of ReLU is nonzero on R+ , using it as activation function can effectively overcome the vanishing gradient problem. Second, enhancing sparsity of neural networks. ReLU can let the output of some neurons to be zero, which makes the neural network become more sparse and alleviates the problem of overfitting. Third, imitating biological neural network. In neurophysiology, when stimulus does not reach a certain intensity, the neuron has no response to it. Only when the intensity the stimulus exceeds some threshold, there is a nerve impulse. Hence, the definition of ReLU better captures such characteristics of biological neuron. Last, less computation. Compared to the other two functions, running ReLU only needs to determine whether the input is greater than zero and does not require the exponential operation. 6. DOI:10.6814/NCCU201900936.

(16) (a) Sigmoid function. (b) Hyperbolic tangent. (c) ReLU. Figure 2.3: Three common activation functions in deep learning.. 2.3 Optimization for Training Deep Network As we said, deep learning uses deep neural networks to find a nice function for prediction.. 政治大. Hence, this section focuses on how the neural network finds such ideal function. In fact, all. 立. credit goes to gradient descent algorithm [27] - it provides an effective way for networks to. ‧ 國. 學. learn from the given dataset. Next, we begin with introducing the algorithm. Then we use a simple example to illustrate more details.. ‧. When the structure of a neural network, including the number of layers, the number of neurons each layer and the activation function, is determined, the network variables are only. y. Nat. sit. left with weights and biases, which we call the model parameters θ, namely, θ is the vector. er. io. consists of all weights w and biases b of the neural network. Moreover, since a well-defined. al. n. v i n C h determined byUall possible parameters and f ∗ be the let {fθ } be the space of neural networks engchi ∗ ∗. neural network represents a function, the neural network becomes a function of θ hence. Now,. optimal function. Thus, our goal is to find a parameter θ such that fθ∗ is closest to f . In other. words, let fθ∗ (x) be closest to y = f ∗ (x) for training sample (x, y). Therefore, we construct an objective function to measure the difference of fθ (x) and y for any model parameter θ. ∑ Assume that there is a training dataset {(xi , yi )}ki=1 , let L(θ) = k1 ki=1 ℓ(fθ (xi ), yi ) be an objective function, where ℓ is a loss function used to measure the error. Hence, ℓ(fθ (xi ), yi ) represents that the error between the predicted output and the target for input xi . Furthermore, L(θ) signifies the average error between fθ and f ∗ on the entire dataset. This means that finding a parameter θ∗ such that L(θ∗ ) is a global minimum can achieve our goal. Therefore, scientists came up with gradient descent algorithm to help minimize the objective function, which is described below. (Note that sometimes we add the regularization term to the objective function to avoid overfitting problem.). 7. DOI:10.6814/NCCU201900936.

(17) Algorithm: Gradient descent algorithm Require: Learning rate η and initial parameter θ while repeating until convergence from k examples of training dataset {(xi , yi )}ki=1 . Compute: gradient estimate ∇θ L and velocity update η∇θ L. Apply update: θ := θ − η∇θ L. end while Table 2.1: Iterative method of gradient descent algorithm. Gradient descent are probably the most used optimization algorithms for deep learning. Let us explain the update formula. 立. 政治大 θ := θ − η∇θ L,. ‧ 國. 學. in the table above. First, −∇θ L determines the direction of updating the parameter. This is because the gradient of a real-valued function at a point θ is the direction of greatest increase of. ‧. the function at θ, the negative gradient is the direction of greatest decrease of the function, which. sit. y. Nat. helps minimize the objective function. Second, the learning rate η determines the magnitude of. io. er. updating the parameter. If it is too large, gradient descent may overshoot the minimum. If it is too small, the rate of convergence of gradient descent can be very slow. Hence, the learning. n. al. i n U. v. rate is usually chosen by trial and error. Finally, −η∇θ L guides the computer how to update the. Ch. engchi. parameter, which constructs the update formula.. Note that we use the backpropagation algorithm [34] to compute the gradient of the objective function with respect to the model parameter. Here, we use a simple neural network (See Figure 2.4) as an example to explain how backpropagation algorithm works. To begin with, we need to make additional assumptions. Let {(x, y)} be a single sample dataset, where x = (x1 , x2 ) and y = (y1 , y2 ), φ be an activation function and ℓ be the squared loss (discussed later). We then observe how the inputs move forward though the network. 1 1 x2 + b11 , x1 + w2,1 neth1 = w1,1. outh1 = φ(neth1 ),. 1 1 neth2 = w1,2 x1 + w2,2 x2 + b12 ,. outh2 = φ(neth2 ),. 2 2 neto1 = w1,1 outh1 + w2,1 outh2 + b21 ,. outo1 = φ(neto1 ),. 8. DOI:10.6814/NCCU201900936.

(18) 2 2 neto2 = w1,2 outh1 + w2,2 outh2 + b22 ,. 立. outo2 = φ(neto2 ).. 政治大. ‧. ‧ 國. 學. Figure 2.4: A neural network used to explain the backpropagation algorithm.. sit. y. Nat. er. io. We now have the objective function L(θ) = 12 (y1 − outo1 )2 + 21 (y2 − outo2 )2 . Here, let. al. v i n The core of backpropagationC is h thee”backward pass”, n g c h i U which makes the computer calculate n. L1 = 12 (y1 − outo1 )2 and L2 = 21 (y2 − outo2 )2 for future use.. the gradient (partial derivatives with respect to all parameters) of L to update the parameters.. Let us see how ”backward pass” works. First, we are going to update those parameters related to 2 the output neurons. Take w1,1 for example, we need to compute. ∂L 2 ∂w1,1. by chain rule as described. below. ∂L ∂L ∂neto1 ′ = = −(y1 − outo1 )φ (neto1 )outh1 . 2 2 ∂w1,1 ∂neto1 ∂w1,1 1 , Second, we update those parameters related to the hidden neurons. Consider w1,1. ( ) ∂L ∂L ∂outh1 ∂neth1 ∂L1 ∂L2 ∂outh1 ∂neth1 = = + 1 1 1 ∂w1,1 ∂outh1 ∂neth1 ∂w1,1 ∂outh1 ∂outh1 ∂neth1 ∂w1,1 ( ) ∂L1 ∂neto1 ∂L2 ∂neto2 ∂outh1 ∂neth1 = + 1 ∂neto1 ∂outh1 ∂neto2 ∂outh1 ∂neth1 ∂w1,1 ′. ′. ′. 2 2 = (−(y1 − outo1 )φ (neto1 )w1,1 − (y2 − outo2 )φ (neto2 )w1,2 )φ (neth1 )x1 .. 9. DOI:10.6814/NCCU201900936.

(19) The schematic diagram of backwards pass is shown in Figure 2.5. In conclusion, we use both gradient descent and backpropagation algorithms to update each of the parameters in the network so that the predicted output becomes closer to the target, thereby minimizing the objective function of the network as a whole.. 立. 政治大. ‧. ‧ 國. 學 y. Nat. n. al. er. io. sit. Figure 2.5: The backward pass of backpropagation.. Ch. i. e. i n U. v. gch 2.4 Convolutional NeuralnNetwork. This section introduces a classical type of neural networks, convolutional neural network (CNN) [18], which is applied to the tasks with data having grid-like topology, including sequential data and image data, which is 1-dimensional grid of samples and 2-dimensional grid of pixels, respectively. In fact, convolutional networks have achieved great success in image recognition and video analysis, etc. Now, we first explain how the word ”convolutional” comes from, and then illustrate the structure of convolutional neural network.. 2.4.1 Definition of Convolution The name ”convolutional neural network” originates from a mathematical operation called convolution stated below. 10. DOI:10.6814/NCCU201900936.

(20) Let F be the space of all integrable real-valued function on R. The convolution ∗ : F ×F → ∫∞ F is a binary operation defined by (f ∗w)(t) = −∞ f (τ )w(t−τ ) dτ , for any f, w ∈ F . In other words, f ∗ w is a function defined by the integral of the product of f and a composite function g = w(t − τ ), where the latter is obtained by w being translated and reflected. In particular, it has the geometric sense under the following case. Let R1 , R2 be the region between f, g and the x-axis, respectively. We observe that if 0 ≤ f, w ≤ 1, then f ∗ w can be regarded as the function of the area of the intersection region of R1 and R2 , which can be seen in Figure2.6. Note that f is colored blue, g is colored orange, and the convolutional value (f ∗ w)(t) is the y coordinate of the rightmost point of the red line in this figure.. 立. 政治大. ‧ 國. 學 (b) t = 0.25. ‧. (a) t = 0. n. al. er. io. sit. y. Nat (d) t = 1. (c) t = 0.5. Ch. i e n(e)gt c= h 1.5. i n U. v. (f) t = 2. Figure 2.6: Geometric understanding of convolution on f and g. Hence, we can find that the more g looks like f , the larger the convolutional value will be. That is the reason why we use a feature filter to scan an image in a convolutional neural network, which is actually a convolutional operation. The more similar the filter and image are, the more this type of feature the image has. Therefore, convolution in neural network is used to judge whether an image has a certain characteristic. However, convolution neural network uses the discrete version of convolution as follows. Let f, w defined on Z, f have finite support, the discrete convolution of f and w is given. 11. DOI:10.6814/NCCU201900936.

(21) by (f ∗ w)[n] =. K ∑. f [k]w[n − k],. k=−K. where f and w[n − k] represent the input and some filter of the network, respectively. Notice that if we do not use normalization to the data and filters, the operation in a convolutional network is not necessarily the same as those described above. This causes convolutional neural network unstable sometimes [10].. 2.4.2 The Structure of The Convolutional Neural Network. 政治大. The structure of such neural network is shown in Figure 2.7. A convolutional neural network consists of an input, an output layer, and several hidden layers. Generally speaking,. 立. the hidden layers consist of two convolutional layers, two max pooling layers, and two fully. ‧ 國. 學. connected layers, sometimes the normalization layers being added. Their order starts with the convolutional layer, subsequently followed by max pooling layers, then repeating this. ‧. architecture twice, finally doing fully connected layer twice. Next we will introduce how. n. al. er. io. sit. y. Nat. convolutional and max pooling layer work (fully connected layer is discussed in Section 2.2).. Ch. engchi. i n U. v. Figure 2.7: The schematic diagram of structure of convolutional neural network.. Convolutional layer. The convolutional layer is designed for extracting some features in the image by convoluting the original image with some specific filters (one filter corresponds to 12. DOI:10.6814/NCCU201900936.

(22) one feature). Let us see Figure2.8(a). The convolutional operation is as follows. First, multiply each component of the filter and each component of the 3 × 3 matrix of the image in the upper left, then add them together, which becomes the component in the upper left of the feature map. That is, 0 × 1 + 0 × 0 + 0 × 0 + 0 × 0 + 0 × 1 + 1 × 0 + 0 × 0 + 0 × 0 + 0 × 1 = 0. Then let the filter move one-unit to the right and do the convolutional operation. Next, repeating this procedure until all 3 × 3 matrix in the image being convoluted by the filter so that we get a complete feature map as shown in Figure2.8(b). Here, we denote the convolution operation in deep learning by the notation ⊗.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. e n g(a)c h i. i n U. v. (b). Figure 2.8: The operation in convolutional layer.. 13. DOI:10.6814/NCCU201900936.

(23) Max Pooling layer. Let us see Figure2.9(a). The operation of max pooling layer is as follows. First, choose the pooling size (2 × 2 in the figure), then pick the maximum value of the 2 × 2 matrix of the feature map in the upper left, which becomes the component in the upper left of the pooled feature map. Then move two-unit to the right and pick the maximum. Note that next matrix cannot overlap the previous matrix. Then, repeating this procedure to get a complete pooled feature map as shown in Figure2.9(b). There are three major advantages of max pooling, the first one is that translating the image by a few pixels will not change the result, the second one is that it is good at image-denoising, and the last one is that it can reduce the dimensions of the data.. 立. 政治大. ‧. ‧ 國. 學. n. Ch. engchi. er. io. sit. y. Nat. al. (a). i n U. v. (b). Figure 2.9: The operation in max pooling layer. After the operations of convolution and max pooling, finally, the network flattens the output of second max pooling layer and put it as an input of the first fully connected layer, which is 14. DOI:10.6814/NCCU201900936.

(24) illustrated in Figure2.10.. 立. 政治大. Figure 2.10: The operation between max pooling layer and fully connected layer.. sit. y. ‧. ‧ 國. 學. Nat. 2.5 Recurrent Neural Network. er. io. This section introduces another classical type of neural networks, recurrent neural network. al. (RNN) [7], which is applied to the tasks with sequential data such as time series. Recurrent. n. v i n neural network deals with those time-related problems by giving the effect the previous results Ch engchi U have on current result. In other words, the network is a network with memory. Note that although convolution neural network can also process sequential data, it does not preserve long-. term memory because each term in the sequence of convolutional output is only affected by some neighboring members of the corresponding input. On the other hand, recurrent networks works in a di�erent way that the output of current time step is affected by all the previous outputs, which has a better performance on processing sequential data than other networks hence. However, recurrent network still has some disadvantages, so there comes in many variants of it to address these issues, such as long short-term memory (LSTM) [13] and gated recurrent unit (GRU) [4], etc. Recently, recurrent neural networks have achieved great success in machine translation, speech and video recognition, and other natural language processing problems, etc. Now, we are going to illustrate the structure and operation of recurrent neural network. 15. DOI:10.6814/NCCU201900936.

(25) 2.5.1 The Structure of The Recurrent Neural Network The structure of basic recurrent neural network is shown in Figure 2.11. We briefly introduce the notation in this figure as follows. xt , yt denote the input and predicted output of time step t, respectively. For all j ∈ {1, . . . , n}, hjt denotes the output of j-th hidden layer at time step t. Note that each hidden layer is represented by a neuron for convenience and the directed cycle of each hidden layer represents its output will return to the layer (as a hidden state). Moreover, we can find that the output of each hidden layer is affected by inputs and they also play a role in determining the predicted output in the figure.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. Figure 2.11: The schematic diagram of structure of recurrent neural network.. 2.5.2 The Operation of The Recurrent Neural Network In this section, we are going to describe how recurrent neural network works by illustrating how each layer acts, which is described in Figure 2.12. The layer is represented by a neuron as above. Even though it only presents the behavior of a layer of the network, it is enough to see 16. DOI:10.6814/NCCU201900936.

(26) the whole recurrent network architecture because of the same behavior of each layer.. 立. 政治大. ‧. ‧ 國. 學. Figure 2.12: Behavior of a neuron in recurrent neural network.. Now, we start with introducing the setting in Figure 2.12. Let x = (x1 , . . . , xL ), y =. y. Nat. (y1 , . . . , yL ) be an input and output sequence, the figure describes the operation of the j-th hidden. io. sit. layer at time step t (xt is the current input). Clearly, for all t ∈ {1, . . . , L}, j ∈ {2, . . . , n}, the. n. al. er. output vector hjt of j-th hidden layer is computed by the following formula [11]:. Ch. engchi. i n U. v. hjt = φ(Wxhj xt + Whj−1 hj hj−1 + Whj hj hjt−1 + bjh ). t Here, Wxhj represents the weight of the input layer to the j-th hidden layer, Whl hm represents the weight of the l-th layer to the m-th hidden layer, bjh represents the bias of the j-th hidden layer and φ is an activation function. Likewise, for j = 1, we have h1t = φ(Wxh1 xt + Wh1 h1 h1t−1 + b1h ). Finally, the predicted output vector is defined by n ∑ yt = φ( Whj y hjt + by ). j=1. From this, we can find that the output of recurrent neural network at time step t is a function 17. DOI:10.6814/NCCU201900936.

(27) of current input and all the previous hidden states 1 ≤ i ≤ t, which makes such network be powerful in time-related prediction.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. 18. DOI:10.6814/NCCU201900936.

(28) Chapter 3 Model Description 政治大. This chapter introduces the model used in [17]. Section 3.1 describes the architecture of. 立. the modified neural network constructed by [17]. Such network can remove all suboptimal local. ‧ 國. 學. minima of any objective function. Section 3.2 illustrates the loss function and objective function for optimization and discusses the convexity of common loss function for later use.. ‧ sit. y. Nat. 3.1 The Architecture. er. io. The architecture of the modified neural network presented in [17] is depicted in Figure 3.1.. al. n. v i n C h briefly. Let {(x first introduce the left side of the network e n g c h i U i, yi)}ni=1 be a training dataset, where ⊤ d ⊤ d¯. As you can see, it can be roughly divided into two parts, the left side and the right side. We xi = (xi1 , . . . , xid ) ∈ R is an input vector and yi = (yi1 , . . . , yid¯) ∈ R is a target vector. Here, we replace xi with x = (x1 , . . . , xd )⊤ in the figure for convenience. Given an arbitrary function (deep neural network) f and the set θ of all parameters of the network, f (x; θ) = (f (x; θ)1 , . . . , f (x; θ)d¯)⊤ is the corresponding predicted output vector. Next, we introduce the right side of the network. There is one neuron being added for each output neuron of the original network f in the modified neural network. These added neurons skip the hidden layers and output layer of the original network and are only connected (fully connected) with input x. Here, the activation function of the new layer is the natural exponential function exp(x). Then one more layer with the same number of neurons is added again. Unlike the connection way of previous layers, each neuron of this layer is only connected with the neuron in front of it. Finally, add the output f (x; θ) of original network and that one g(x; b, c, W ). 19. DOI:10.6814/NCCU201900936.

(29) ( )⊤ ˜ ˜ ˜ ˜ ˜ ˜ of the external network up, we can get the predicted output f (x; θ) = f (x; θ)1 , . . . , f (x; θ)d¯ of modified neural network f˜, where θ˜ is the set of all parameters of modified network. In conclusion, given any deep neural network f and its parameters θ, the modified neural network f˜ is defined by ˜ = f (x; θ) + g(x; b, c, W ), f˜(x; θ) ¯ where θ˜ = (θ, b, c, W ), b = (b1 , . . . , bd¯)⊤ , c = (c1 , . . . , cd¯)⊤ ∈ Rd ,. [ W =. ] w1. ···. w2. wd¯. . 立. g(x; b, c, W )2 .. .. 1. ⊤ 1. 1. c2 exp(w2⊤ x + b2 ) .. ..        ∈ Rd¯.      . cd¯ exp(wd⊤¯ x + bd¯). er. io. sit. y. Nat. g(x; b, c, W )d¯.           =            . . ‧. ‧ 國. g(x; b, c, W )1.   治政大   c exp(w x + b ). 學.       g(x; b, c, W ) =       . { } ¯ ∈ Rd×d with wj ∈ Rd , ∀j ∈ 1, . . . , d¯ , and. Hence, the modified neural network can be seen as adding a neuron g(x; b, c, W )j for each { } output neuron f (x; θ)j of the original neural network for all j ∈ 1, . . . , d¯ . The results of the. n. al. Ch. engchi. following chapter hold under such network.. i n U. v. Figure 3.1: The structure of the modified neural network f˜.. 20. DOI:10.6814/NCCU201900936.

(30) 3.2 Loss and Objective Functions 3.2.1 Construction of Objective Functions In order to train the neural network, we must choose loss function and construct objective function to measure the error between targets and predicted outputs.. Here, there is no. strong assumption of both functions of original and modified neural networks as below. Let {(xi , yi )}ni=1 be a training dataset, then the objective function L of the original neural network f is defined by. 1∑ ℓ(f (xi ; θ), yi ), n i=1 n. L(θ) = ¯. 政治大. ¯. where ℓ : Rd × Rd → R is an arbitrary common loss function. On the other hand, the objective. 立. ˜ of the modified neural network is defined by function L. ‧ 國. 學. ∑ ˜ = 1 ˜ θ) L( ℓ(f (xi ; θ) + g(xi ; b, c, W ), yi ) + λ∥c∥22 , n i=1 n. ‧. which has a regularization term λ∥c∥22 , where λ > 0.. sit. y. Nat. n. al. er. io. 3.2.2 Convexity of Loss Functions. i n U. v. In this section, we are going to prove that most common loss functions are all convex,. Ch. engchi. which is an assumption of the theorems in the following chapters. More precisely, given a ¯. sample (xi , yi ), let ℓyi : Rd → R be a loss function of yi defined by ℓyi (q) = ℓ(q, yi ), where ℓ can be squared error, cross entropy or smoothed hinge loss, we would like to show that ℓyi is ¯. convex on Rd . In fact, the loss function also needs to be differentiable to meet the assumption of theorems, but we will not prove it since it is trivial. (i) Convexity of squared error ¯. Proof. In this case, ℓyi (q) = ∥q − yi ∥22 , for all q ∈ Rd . Obviously, ℓyi becomes a quadratic ¯. polynomial, which makes it convex on Rd . (ii) Convexity of cross entropy In this case, we need to introduce some terminology first. Note that we may use notations that are different from the model setting above. 21. DOI:10.6814/NCCU201900936.

(31) (a) Entropy [31] In information theory, entropy represents the average amount of information contained in each message.. The frequencies of a message presents negative. correlation to the amount of information it has. Hence entropy is defined by H(X) = E[I(X)] = E[− log(p(X))], where X is a random variable of message, p(X) is the probability density function of X and I(X) represents the quantity of information of X. When the sample space (dataset) is finite, we can write. 政∑ 治大∑ 立H(X) = p(x)I(x) = − p(x) log p(x). x. x. ‧ 國. 學. (b) Cross entropy. ‧. Given the true distribution p and an estimated probability distribution q of the message, entropy measures the average number of bits needed to encode the message. Nat. sit. y. through p. Conversely, the cross entropy measures the average number of bits. io. al. er. needed to encode the message through q rather than p. Hence cross entropy (with. n. respect to p and q) is defined by. Ch. engchi. iv n U∑. H(p, q) = Ep [− log q] = −. p(x) log q(x),. x∈S. for discrete probability distribution p and q with same support S. Note that when cross entropy used in deep learning, the so-called true distribution p represents the ideal function we are looking for, and q refers to the function we found. (c) Kullback–Leibler divergence (KL divergence) [19] KL divergence (with respect to p and q) is used to measure the extra number of bits needed to encode the message using q instead of p. Hence KL divergence (with. 22. DOI:10.6814/NCCU201900936.

(32) respect to p and q) is defined by DKL (p∥q) = H(p, q) − H(p) ( ) ∑ ∑ =− p(x) log q(x) − p(x) log p(x) x. =−. ∑ x. x. q(x) ∑ p(x) p(x) log = p(x) log . p(x) q(x) x. Now, we are ready to prove the convexity of cross entropy. Proof. Given the cross entropy H(p, q), from the definition of KL divergence, we get. 立. 政治大. H(p, q) = H(p) + DKL (p∥q).. ‧ 國. 學. Since H(p) is fixed, we can only consider the convexity of DKL (p∥q). By definition of convexity, it suffices to show that for all t ∈ [0, 1],. ‧. DKL (p∥tq1 + (1 − t)q2 ) ≤ tDKL (p∥q1 ) + (1 − t)DKL (p∥q2 ).. y. Nat. n. al. er. io. sit. Obviously, the inequality holds for t = 0 and t = 1. As for t ∈ (0, 1),. i n U. v. DKL (p∥tq1 + (1 − t)q2 ) ∑ p(x) = p(x) log tq1 (x) + (1 − t)q2 (x) x ∑ tp(x) + (1 − t)p(x) = (tp(x) + (1 − t)p(x)) log tq1 (x) + (1 − t)q2 (x) x ) ( ∑ (1 − t)p(x) tp(x) + (1 − t)p(x) log ≤ tp(x) log tq (x) (1 − t)q2 (x) 1 x ∑ ∑ p(x) p(x) =t p(x) log + (1 − t) p(x) log q1 (x) q2 (x) x x. Ch. engchi. = tDKL (p∥q1 ) + (1 − t)DKL (p∥q2 ).. Note that the fifth line follows from log sum inequality. Therefore, DKL is convex on Rdy , which makes cross entropy convex on Rdy . (iii) Convexity of smoothed hinge loss 23. DOI:10.6814/NCCU201900936.

(33) Proof. Here, we only consider the smoothed hinge loss used in binary classification problem (q ∈ R), which is defined by  (max{0, 1 − q})p. ℓyi (q) =. yi = 1. ,. (max{0, 1 + q})p y = −1 i. where p ≥ 2. Hence, for yi = 1, we have  0 ℓyi (q) =. 立ℓ. =.  0. q≥1. p(p − 1)(1 − q)p−2 q < 1. Nat. n. ℓyi (q) =. ℓ′yi (q) = Furthermore, ℓ′′yi (q) =. q ≤ −1. i n U. v. (1 + q)p q > −1. engchi. which implies. er. io.  0.  0. ,. q ≤ −1. p(1 + q)p−1 q > −1.  0. .. sit. On the other hand, for yi = −1, we have. Ch. .. ‧. ℓ′′yi (q) =.  − p(1 − q)p−1 q < 1. 學. Furthermore,. al. ,. 政 治大 0 q≥1. ′ yi (q). ‧ 國. (1 − q)p q < 1. y. which implies. q≥1. .. q ≤ −1. p(p − 1)(1 + q)p−2 q > −1. .. Since in any case, ℓ′′yi (q) ≥ 0 for all q ∈ R, ℓyi is convex on R. The smoothed hinge loss used in multi-class classification problem is just a generalization of that in binary classification problem, so we can prove its convexity in a similar way that we pass here.. 24. DOI:10.6814/NCCU201900936.

(34) Chapter 4 Main Theorems 政治大. This chapter introduces the main theorems and lemmas stated and proved in [17], which. 立. illustrates the fact that the modified neural network can delete all suboptimal local minima of. ‧. ‧ 國. 學. the objective function of original network.. 4.1 Lemmas. y. Nat. sit. The following two lemmas are used to prove the main theorems in the next section. In this. er. io. section, apart from presenting the proof of lemmas in [17], we discuss the differentiability of. al. n. v i n C h 2 in [17] moreUcomplete as well. four claims to make the proof of Lemma engchi. the modified objective function and the use of chain rule more clearly in Lemma 1 and prove. ¯. Lemma 1. Let ℓyi : Rd → R be differentiable for all i ∈ {1, . . . , n}. For any (θ, W ), if

(35)

(36) ˜ (b, c) is a stationary point of L

(37) , then c = 0. (θ,W ). ¯. ¯. ¯. Proof. For i ∈ {1, . . . , n}, (b, c) ∈ Rd × Rd , and W ∈ Rd×d defined above,      let qi (θ, b, c, W ) = f (xi ; θ) + g(xi ; b, c, W ) =    . f (xi ; θ)1 + c1 exp(w1⊤ xi + b1 ) .. . f (xi ; θ)d¯ + cd¯ exp(wd⊤¯ xi + bd¯).      .    ¯. Fix (θ, W ), then qi |(θ,W ) is differentiable at (b, c). Since ℓyi is differentiable on Rd , ℓyi |(θ,W ). 25. DOI:10.6814/NCCU201900936.

(38)

(39) ˜

(40)

(41) is differentiable at (b, c). This implies L. is differentiable at (b, c). From the definition of

(42) ¯ we have ˜

(43)

(44) a stationary point of a differentiable function L , for all j ∈ {1, 2, . . . , d}, (θ,W ). (θ,W ). ˜ b, c, W ) ∂ L(θ, = 0, ∂cj and. ˜ b, c, W ) ∂ L(θ, = 0. ∂bj. Hence, ˜ b, c, W ) ∂ L(θ, ∂cj n 1 ∑ ∂ℓyi = ncj ( + 2λcj ) n i=1 ∂cj. ncj. 學. ‧ 國. 立. 政治大. n ∑ ∂ℓyi ∂qi,j = cj + 2nλc2j ∂q ∂c i,j j i=1. (∇ℓyi (f (xi ; θ) + g(xi ; b, c, W )))j exp(wj⊤ xi + bj ) +2nλc2j. i=1. ˜. j. n. al. }. sit. ) = ∂ L(θ,b,c,W ∂b. er. io. = 2nλc2j = 0.. y. {z. Nat. |. ‧. = cj. n ∑. Ch. engchi. i n U. v. Since the number n and λ are positive, we can conclude cj = 0 and thus c = 0. Note that, from the proof of Lemma 1, we can find that the regularization of the modified objective ˜ is necessary. function L ¯. Lemma 2. Let ℓyi : Rd → R be differentiable for all i ∈ {1, . . . , n}. Then, for any θ, if

(45)

(46) ¯ and ˜ L

(47) has a local minimum at (b, c, W ), then for all unit vector uj ∈ Rd , all j ∈ {1, 2, . . . , d}, θ. all k ∈ N0 , we have n ∑. k (∇ℓyi (f (xi ; θ)))j exp(wj⊤ xi + bj )(u⊤ j xi ) = 0.. i=1.

(48)

(49) ˜

(50)

(51) has a local minimum at (b, c, W ), (b, c) is a stationary point of L ˜

(52)

(53) Proof. Fix θ. Since L θ. , (θ,W ). 26. DOI:10.6814/NCCU201900936.

(54)

(55) ˜

(56)

(57) then we get c = 0 from Lemma 1. That is, L. (0, W ) is a local minimum. Hence, (θ,b). [ d¯. choose ∆c ∈ R and ∆W =. ] ∆w1. ∆w2. ¯. ···. ∈ Rd×d to be. ∆wd¯. sufficiently small, then we can find a vector (∇ℓyi )(f (xi ; θ)) and a function ρ such that

(58)

(59) ˜ L

(60). (θ,b).

(61)

(62) ˜ (∆c, W + ∆W ) − L

(63). (0, W ) (θ,b). ˜ b, ∆c, W + ∆W ) − L(θ, ˜ b, 0, W ) = L(θ, 1∑ = ℓy (f (xi ; θ) + g(xi ; b, ∆c, W + ∆W )) + λ∥∆c∥22 − n i=1 i n. (. 政治大. 1∑ ℓy (f (xi ; θ)) n i=1 i n. 1∑ (ℓyi (f (xi ; θ)) + (∇ℓyi )(f (xi ; θ))⊤ g(xi ; b, ∆c, W + ∆W ) = n i=1 n. 立. (. ‧ 國. 學. + ∥g(xi ; b, ∆c, W + ∆W )∥2 ρ(f (xi ; θ); g(xi ; b, ∆c, W + ∆W ))) + λ∥∆c∥22 − n. 1∑ ℓy (f (xi ; θ)) n i=1 i n. ). ‧. 1∑ = ∇ℓyi (f (xi ; θ))⊤ g(xi ; b, ∆c, W + ∆W ) n i=1. ). 1∑ + ∥g(xi ; b, ∆c, W + ∆W )∥2 ρ(f (xi ; θ); g(xi ; b, ∆c, W + ∆W )) + λ∥∆c∥22 n i=1 n. n. al. er. io. sit. y. Nat. ≥ 0, where. lim. Ch. g(xi ;b,∆c,W +∆W )→0. engchi. i n U. v. ρ(f (xi ; θ); g(xi ; b, ∆c, W + ∆W )) = 0.. Here, the fourth and fifth line hold since ℓyi is differentiable for all i ∈ {1, . . . , n}. We know that for a multivariate real-valued function f on Rn , it is differentiable at x if and only if there exist a vector ∇f (x) and a function ρ(x; ·) defined on D = {u ∈ Rn : 0 < ∥u − 0∥ < δ} such that lim∆x→0 ρ(x; ∆x) = 0, and f (x + ∆x) = f (x) + ∇f (x)⊤ ∆x + ∥∆x∥ρ(x; ∆x), for all non-zero vector ∆x ∈ Rn that is sufficiently close to 0. Let f be replaced with ℓyi , x be replaced with f (xi ; θ), and ∆x be replaced with g(xi ; b, ∆c, W + ∆W ), we can get the fourth and fifth line if g(xi ; b, ∆c, W + ∆W ) is arbitrarily small with sufficiently small ∆c and ∆W , 27. DOI:10.6814/NCCU201900936.

(64) which is proved below. Claim 1. g(xi ; b, ∆c, W + ∆W ) is arbitrarily small with sufficiently small ∆c and ∆W . Proof. It suffices to show that lim(∆cj ,∆wj )→(0,0) ∆cj exp(wj⊤ xi + ∆wj⊤ xi + bj ) = 0 for all ¯ First, we need to show that f1 (∆cj , ∆wj ) = ∆cj is continuous at (0, 0). j ∈ {1, 2, . . . , d}. Since given ϵ > 0, we can choose δ = ϵ, then ∀(∆cj , ∆wj ) ∈ Rd+1 with ∥(∆cj , ∆wj )∥ = √ ∆c2j + ∆wj21 + . . . + ∆wj2d < δ, we have |f1 (∆cj , ∆wj ) − f1 (0, 0)| = |∆cj | =. √ √ ∆c2j ≤ ∆c2j + ∆wj21 + . . . + ∆wj2d < δ = ϵ,. we have completed the proof. Second, we need to show that f2 (∆cj , ∆wj ) = wj⊤ xi +∆wj⊤ xi +bj. 政治大 is continuous立 at some vector x , and h is continuous at f (x ), then h ◦ f. is continuous at (0, 0), which can be deduced from a similar idea as above. Third, we need to show that if f2. 0. 2. 0. 2. is. ‧ 國. 學. continuous at x0 . Since given ϵ > 0, there exists δ1 > 0 such that ∀ |u − f2 (x0 )| < δ1 with u ∈ Domain(h), |h(u) − h(f2 (x0 ))| < ϵ from h being continuous at f2 (x0 ). Moreover, for. ‧. ϵ = δ1 , there exists δ > 0 such that ∀∥x − x0 ∥ < δ with x ∈ Domain(f2 ), |f2 (x) − f2 (x0 )| < δ1 from f2 being continuous at x0 . Hence, this can imply that |h(f2 (x)) − h(f2 (x0 ))| < ϵ, so. Nat. sit. y. we have finished the proof. Finally, let h(t) = exp(t), since f2 is continuous at (0, 0) and h is. er. io. continuous at f2 (0, 0), h ◦ f2 is continuous at (0, 0). Also, f1 is continuous at (0, 0), so f1 (h ◦ f2 ) is continuous at (0, 0). Therefore, lim(∆cj ,∆wj )→(0,0) ∆cj exp(wj⊤ xi + ∆wj⊤ xi + bj ) = 0.. n. al. Ch. engchi. i n U. v. ¯. Let ∆c = ϵv with ϵ > 0 sufficiently small and any unit vector v = (v1 , . . . , vd¯) ∈ Rd , then (g(xi ; b, ∆c, W + ∆W ))j = ϵvj exp(wj⊤ xi + ∆wj⊤ xi + bj ) = ϵg(xi ; b, v, W + ∆W ), { } for all j ∈ 1, . . . , d¯ and ϵ∑ ∇ℓyi (f (xi ; θ))⊤ g(xi ; b, v, W + ∆W ) n i=1 n. ϵ∑ ≥− ∥g(xi ; b, v, W + ∆W )∥2 ρ(f (xi ; θ); ϵg(xi ; b, v, W + ∆W )) − λϵ2 ∥v∥22 . n i=1 n. Because lim ϵg(xi ; b, v, W + ∆W ) = 0. ϵ→0. 28. DOI:10.6814/NCCU201900936.

(65) implies lim ρ(f (xi ; θ); ϵg(xi ; b, v, W + ∆W )) = 0. ϵ→0. and lim λϵ∥v∥22 = 0,. ϵ→0. we have. n ∑. ∇ℓyi (f (xi ; θ))⊤ g(xi ; b, v, W + ∆W ) ≥ 0.. i=1. { } Moreover, for j ∈ 1, 2, . . . , d¯ , let v be with |vm | = { } where m ∈ 1, 2, . . . , d¯ , we have. 立.  0 if m ̸= j. 政治大. (∇ℓyi (f (xi ; θ)))j exp(wj⊤ xi + ∆wj⊤ xi + bj ) ≥ 0, ∀ |vj | = 1.. ‧ 國. i=1. n ∑. ‧. Hence,. (∇ℓyi (f (xi ; θ)))j exp(wj⊤ xi + bj ) exp(∆wj⊤ xi ) ≥ 0. sit. n. al. er. (∇ℓyi (f (xi ; θ)))j exp(wj⊤ xi + bj ) exp(∆wj⊤ xi ) ≤ 0. io. n ∑. y. Nat. i=1. and. ,. 學. vj. n ∑.   1 if m = j. i=1. Ch. simultaneously. Due to this, we get n ∑. engchi. i n U. v. (∇ℓyi (f (xi ; θ)))j exp(wj⊤ xi + bj ) exp(∆wj⊤ xi ) = 0.. i=1. Similarly, let ∆wj = ϵ¯j uj with ϵ¯j > 0 sufficiently small and ∥uj ∥2 = 1, we have n ∞ ∑ ϵ¯lj ∑ l=0. l!. l (∇ℓyi (f (xi ; θ)))j exp(wj⊤ xi + bj )(u⊤ j xi ) = 0,. i=1. from the maclaurin series of the natural exponential function. exp(y) =. ∞ ∑ yl l=0. l!. , ∀y ∈ R,. which we prove below. 29. DOI:10.6814/NCCU201900936.

(66) Claim 2.. ∞ n ∑ ϵ¯lj ∑ l=0. l!. l (∇ℓyi (f (xi ; θ)))j exp(wj⊤ xi + bj )(u⊤ j xi ) = 0.. i=1. Proof. From the equality n ∑. (∇ℓyi (f (xi ; θ)))j exp(wj⊤ xi + bj ) exp(∆wj⊤ xi ) = 0,. i=1. and ∆wj = ϵ¯j uj and exp(y) = n ∑. ∑∞. yl l=0 l! ,. for all y ∈ R, we can get ∞ l ∑ (¯ϵj u⊤ j xi ). 政治+ b ) 大 ∑ ∑ ϵ¯ = lim立 (∇ℓ (f (x ; θ))) exp(w (∇ℓyi (f (xi ; θ)))j exp(wj⊤ xi. j. i=1. n. m. l j. m→∞. m n ∑ ϵ¯lj ∑. ⊤ j xi. j. l + bj )(u⊤ j xi ). l (∇ℓyi (f (xi ; θ)))j exp(wj⊤ xi + bj )(u⊤ j xi ). l (∇ℓyi (f (xi ; θ)))j exp(wj⊤ xi + bj )(u⊤ j xi ). m→∞. l!. Nat. = lim. i=1 l=0. l!. i. ‧. ‧ 國. l=0 n ∑ m ∑ ϵ¯lj. = lim. yi. l!. m→∞. 學. i=1. l (∇ℓyi (f (xi ; θ)))j exp(wj⊤ xi + bj )(u⊤ j xi ) = 0.. io. l!. l=0. y. i=1. sit. l=0 n ∞ l ∑ ϵ¯j ∑ i=1. n. al. er. =. l!. l=0. Ch. Here, the second equality follows that. lim. m→∞. where am,i =. m ∑ ϵ¯lj l=0. l!. i n U. e n g c hn i. n ∑. am,i =. i=1. ∑ i=1. v. lim am,i ,. m→∞. l (∇ℓyi (f (xi ; θ)))j exp(wj⊤ xi + bj )(u⊤ j xi ). with limm→∞ am,i exists for all i ∈ {1, . . . , n} as their sum exists, so we completed the proof of Claim 2.. Now, let zl = ϵ¯j → 0,. ∑n. i=1 (∇ℓyi (f (xi ; θ)))j. l exp(wj⊤ xi + bj )(u⊤ j xi ) , we have the fact that as. ∞ ∑ ϵ¯lj l=0. l!. zl = 0.. 30. DOI:10.6814/NCCU201900936.

(67) Here, we prove that zk = 0 for all k ∈ N0 by induction on k to finish Lemma 2. For the case k = 0, we have t ∑ ϵ¯lj. lim. t→∞. l=0. l!. zl = lim (z0 + t→∞. t ∑ ϵ¯lj l=1. l!. zl ) = z0 + lim. t ∑ ϵ¯lj. t→∞. l=1. l!. zl = 0. as ϵ¯j → 0.. ∑ ϵ¯l Here, the second equality holds because the fact that limt→∞ tl=0 l!j zl exists can deduce ∑ ϵ¯l ∑ ϵ¯l limt→∞ tl=1 l!j zl = 0 exists. Since limϵ¯j →0 limt→∞ tl=1 l!j zl = 0, which we prove as follows,. z0 = 0. Claim 3.. 政治l! z =大 0.. lim lim. ϵ¯j →0 t→∞. 立∑. ∞ l=1. l. l=1. fl converges uniformly on a set S ⊆ R and each fl is. 學. continuous at 0, a limit point of S, then. ϵ¯j →0. l=1. l=1. lim fl (¯ϵj ).. ϵ¯j →0. y. Nat. Hence, if we can show that. fl (¯ϵj ) =. ∞ ∑. ∑∞. ϵ¯lj l=1 l! zl. sit. lim. ∞ ∑. ‧. ‧ 國. Proof. Because of the fact that if. t ∑ ϵ¯lj. converges uniformly on a neighborhood of 0, then. er. io. n. t ∞ ∞ a∑ ∑ ∑ v ϵ¯lj ϵ¯lj ϵ¯lj i l lim lim zl = lim zl = n lim zl = 0. C ϵ¯ →0 t→∞ ϵ¯ →0 l! l! h ϵ¯ →0 l=1 l! Ul=1 l=1 engchi j. j. j. ∑∞. xl l=0 l!

(68)

(69). = exp(x) uniformly on [−R, R], where R > 0. Given an ∑ Rl l

(70) l

(71) interval [−R, R], we have 0 ≤

(72) xl!

(73) ≤ Rl! , for all x ∈ [−R, R], and l ∈ N0 . Since ∞ l=0 l! Initially, we show that. converges by the ratio test:

(74)

(75) l+1

(76)

(77)

(78) R

(79) al+1

(80) 1 l!

(81)

(82)

(83)

(84)

(85) lim

(86) = lim

(87) = R lim = 0 < 1,

(88)

(89) l l→∞ (l + 1)! R l→∞ l + 1 l→∞ al where al =. Rl , l!. ∑∞. xl l=0 l!. converges uniformly on [−R, R] by Weierstrass M-test. Consequently, ∑ xl we deduce the statement ∞ l=0 l! = exp(x) uniformly on [−R, R] by uniqueness of the limit. From this, we have. ∞ ∑ (¯ϵj uj⊤ xi )l l=0. l!. = exp(¯ϵj u⊤ j xi ). 31. DOI:10.6814/NCCU201900936.

(90) R , R ). |u⊤j xi | |u⊤j xi | Secondly, we show that. uniformly on I := (−. n ∑. (∇ℓyi (f (xi ; θ)))j exp(wj⊤ xi + bj ). i=1. ∞ l ∑ (¯ϵj u⊤ j xi ). l!. l=0. converges uniformly on I. It suffices to show that the operations summation and scalar multiplication preserves ∑ ∑∞ uniform convergence, that is, if both ∞ l=0 fl = f and l=0 gl = g uniformly on I, then a1. ∞ ∑. f l + a2. ∞ ∑. gl =. ∞ ∑. (a1 fl + a2 gl ) = a1 f + a2 g. 政治大 uniformly on I, where a , a 立 ̸= 0 (trivial). l=0. 1. l=0. l=0. 2. n. al. y. er. io.

(91)

(92) l 2

(93)

(94) ∑ ϵ

(95)

(96) ( g )(x) − g(x) .

(97) <

(98) m

(99) 2 |a2 |

(100) m=0. sit. Nat. and.

(101) l

(102) 1

(103) ∑

(104) ϵ

(105)

(106) fm )(x) − f (x)

(107) <

(108) (

(109) m=0

(110) 2 |a1 |. ‧. ‧ 國. 學. Given ϵ > 0, there exists L1 , and L2 ∈ N such that ∀l1 > L1 , l2 > L2 ,. Ch. engchi. i n U. v. 32. DOI:10.6814/NCCU201900936.

(111) Choose L = max(L1 , L2 ), then ∀l > L,

(112)

(113) l

(114) ∑

(115)

(116)

(117) (a1 fm + a2 gm )(x) − (a1 f + a2 g)(x)

(118)

(119)

(120) m=0

(121)

(122)

(123) l l

(124) ∑

(125) ∑

(126)

(127) =

(128) a1 fm (x) + a2 gm (x) − a1 f (x) − a2 g(x)

(129)

(130) m=0

(131) m=0

(132)

(133) l l

(134) ∑

(135) ∑

(136)

(137) =

(138) a1 ( fm (x) − f (x)) + a2 ( gm (x) − g(x))

(139)

(140)

(141) m=0. m=0. 政治大.

(142) l

(143)

(144) l

(145)

(146) ∑

(147)

(148) ∑

(149)

(150)

(151)

(152)

(153) gm (x) − g(x)

(154) ≤ |a1 |

(155) fm (x) − f (x)

(156) + |a2 |

(157)

(158)

(159)

(160)

(161). 立. ‧ 國. 學. m=0. m=0. ϵ¯lj l=1 l! zl. al. n. Last, we show that such that ∀l > L,. which implies that. = 0 converges uniformly on I by. y. ϵ¯lj l=0 l! zl. er. ∑∞. io. uniqueness of the limit.. ∑∞. sit. Nat. Hence, we complete the proof and get the fact that. ‧. ϵ ϵ + = ϵ, ∀x ∈ I. 2 2. <. converges uniformly on I. Given ϵ > 0, there exists L ∈ N. C

(162)

(163) hl. i n U.

(164) l

(165) ∑ ϵ ¯

(166) j

(167) zl

(168) < ϵ, ∀x ∈ I,

(169)

(170) l!

(171) m=0. engchi. v.

(172)

(173) l

(174) ∑

(175) l ϵ ¯

(176)

(177) j z − (−z )

(178) l 0

(179) < ϵ, ∀x ∈ I.

(180) m=1 l!

(181). Therefore, we conclude that. ∑∞. ϵ¯lj l=1 l! zl. converges uniformly on I and complete the proof of. claim 3. For the remaining case, assume that zl = 0 for l ∈ {1, . . . , k − 1}, we have ∞ ∑ ϵ¯lj l=0. l!. zl =. k−1 l ∑ ϵ¯j l=0. t ∑ ϵ¯kj ϵ¯lj zl + zk + lim zl = 0 t→∞ l! k! l! l=k+1. as ϵ¯j → 0. 33. DOI:10.6814/NCCU201900936.

(182) By the assumption of induction, we can neglect the term multiply. k! (nonzero) ϵ¯kj. ∑k−1. ϵ¯lj l=0 l! zl .. Moreover, we can. on the left-hand and right-hand side of the equation to get t ∑ ϵ¯jl−k k! zk + lim zl = 0 t→∞ l! l=k+1. as ϵ¯j → 0. Since. t ∑ ϵ¯jl−k k! zl = 0, ϵ¯j →0 t→∞ l! l=k+1. lim lim. Claim 4.. 政 ∑治大 ϵ¯ k!. 立 lim lim. converges uniformly on I,. l!. zl = 0.. ∑∞. ϵ¯lj l=k+1 l! zl. Nat. ∞ ∞ ∑ ϵ¯l−k k! ∑ ϵ¯lj j k! zl = k zl l! ϵ¯j l=k+1 l! l=k+1. converges uniformly on I, and. sit. ϵ¯lj l=1 l! zl. l=k+1. ‧. thus. ∑∞. l−k j. 學. Proof. Since. ‧ 國. ϵ¯j →0 t→∞. t. y. which we prove as follows, zk = 0.. io. er. converges uniformly on I. Therefore, we can conclude that. t ∞ ∞ ∑ ∑ ∑ ϵ¯l−k ϵ¯l−k ϵ¯l−k j k! j k! j k! zl = lim zl = lim zl = 0 ϵ¯j →0 t→∞ ϵ¯j →0 ϵ¯j →0 l! l! l! l=k+1 l=k+1 l=k+1. al. n. lim lim. Ch. engchi. i n U. v. to complete the proof of claim 4. From the above analysis, we finish the proof of Lemma 2.. 4.2 Theorems The following two theorems proves the feasibility of eliminating local minima of the original objective function theoretically. Among them, the former holds under arbitrary datasets, and the latter one holds under realizable datasets. In this section, apart from presenting the proof of theorems in [17], we introduce some related notations and theorems, propose a claim and discuss the case k = 0 to give more details of the proof in Theorem 1 as well as correct some mistake and prove the statement (i) and (ii) more clearly in Theorem 2.. 34. DOI:10.6814/NCCU201900936.

(183) ¯. Theorem 1. For any i ∈ {1, . . . , n}, the function ℓyi : Rd → R is differentiable and convex. If ˜ has a local minimum at (θ, b, c, W ), then we have L (i) L has a global minimum at θ. ˜ b, c, W ) = L(θ). (ii) f˜(x; θ, b, c, W ) = f (x; θ) for all x ∈ Rd and L(θ, Now, before presenting the proof of Theorem 1, we first introduce some additional notations as follows. A k-th order tensor A in a d-dimensional space is a mathematical noun that has k indices ranging from 1 to d, which is denoted by. 政治大. A = (ai1 ,...,ik )1≤im ≤d,m∈{1,...,k} .. 立. A = (ai1 ,...,ik )1≤im ≤d,m∈{1,...,k}. Nat. sit. y. ‧. A tensor. ‧ 國. order one.. 學. For instance, a scalar is a 0-th order tensor, a vector is first order one, and a matrix is second. is called symmetric if the element ai1 ,...,ik is invariant under any permutation of indices.. io. n. al. er. A tensor product ⊗ is an operation between tensors which obeys the following rule. For a. i n U. v. p-th n-dimensional order tensor A and a q-th m-dimensional order tensor B, the tensor product. Ch. engchi. A ⊗ B of A and B is a (p + q)-th order tensor. C = (ci1 ,...,ip+q )1≤il ≤n,1≤ir ≤m,l∈{1,...,p},r∈{p+1,...,p+q} . Moreover, x⊗k = x ⊗ · · · ⊗ x means that x appears k times. For a k-th order tensor A ∈ Rd×···×d and k vectors u(1) , u(2) , . . . , u(k) ∈ Rd , defines an operation A(u(1) , u(2) , . . . , u(k) ) =. ∑. (1). (k). Ai1 ,··· ,ik ui1 · · · uik .. 1≤i1 ,··· ,ik ≤d. Proof. ˜ θ has a local minimum at (b, c, W ). Here, we define x⊗0 = 1 and let si,j = Fix θ and L| i 35. DOI:10.6814/NCCU201900936.

(184) { } (∇ℓyi (f (xi ; θ)))j exp(wj⊤ xi + bj ). Then ∀j ∈ 1, 2, . . . , d¯ and k ∈ N0 , we have ( n ∑. max. u(1) ,...,u(k) : (1) ∥u ∥2 =···=∥u(k) ∥2 =1. ( n ∑. = max. u:∥u∥2 =1. = max. si,j x⊗k i. (u(1) , . . . , u(k) ). i=1. ). si,j x⊗k i. (u, u, . . . , u). i=1 n ∑. u:∥u∥2 =1. ). si,j (u⊤ xi )k = 0.. i=1. Here, the first equality follows from theorem 2.1 in [22] which is stated as follows. Suppose. 政治大. that A ∈ Symm (Rn ) (m-th order n-dimensional tensor). Then the optimal objective function. 立. (1). ‧ 國. x(1) ,...,x(m) ∈Rn s.t.∥x(i) ∥2 =1.

(185)

(186)

(187) A(x(1) , . . . , x(m) )