深度學習在不平衡數據集之研究 - 政大學術集成

全文

(1)國立政治大學應用數學系碩士學位論文. 政治大深度學習在不平衡數據集之研究立. ‧ 國. 學. Survey on Deep Learning with ‧. Imbalanced Data Sets n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. 研究生：蔡承孝撰指導教授：蔡炎龍博士中華民國 108 年 8 月 DOI:10.6814/NCCU201901175.

(2) 致謝能夠完成這篇論文首先要感謝的就是我的指導教授炎龍老師，從他收我當學生的那一刻起，我就像是更生人般重獲新生。雖然幾經波折，但還是在八月底完成口試，不違背朋友家人以及自己的期許，真的很感謝老師！. 政治大. 接著是我的口委老大以及張媽，感謝你們順利讓我完成口試，無數次的四川、大團圓、敘園、新馬辣是我們共同的回憶，綠色瓶子的鏗鏗鏘鏘是. 立. 我們共同的作詞作曲，有機會再來一杯吧！. ‧ 國. 學. 感謝師姊佳琪帶我進入炎龍幫，不只在論文這條路上在前方為我開拓蠻荒，在人生也時時激勵著我推我向前，若沒有你我可能還在自怨自艾躊躇. ‧. 不前；感謝同學 7 年的振維，這輩子能喝到你調的酒是我上輩子戒酒的福分，感謝你讓我更認識我自己，但比起酒精我還是選擇你，你不開個酒吧. y. Nat. sit. 真的可惜；感謝原同門師兄弟黃賴，你的細心與堅持常常砥礪我自己必須. er. io. 更認真，讓我知道我還有所不足，我真心希望你能畢業！. al. v i n Ch 壓力，能讓我完成碩士學位，並讓我誕生在這世界上，這種父母哪裡找？ engchi U 我由衷地感謝並希望未來我也有能力讓你們做自己想做的事！ n. 感謝我的父母，都 25 歲了還沒有開始工作，沒有給我經濟跟時間上的. 最後感謝出現在我生命中的所有人，是你們造就了現在的我。我常常把進入政大這件事視為我人生中一大錯誤這句話掛在嘴邊，但是在我打這致謝熱淚盈眶地當下，我不再後悔進入政大，因為能遇見各位的我是最幸福的人，謝謝！. i. DOI:10.6814/NCCU201901175.

(3) 中文摘要本文旨在回顧利用深度學習處理不平衡數據集和異常偵測的方法，我們從 MNIST 生成兩個高度不平衡數據集，不平衡比率高達 2500 並應用在多元分類任務跟二元分類任務上，在二元分類任務中第 0 類為少數類；而在. 政治大. 多元分類任務中少數類為第 0、1、4、6、7 類，我們利用卷積神機網路來訓練我們的模型。在異常偵測方面，我們用預先訓練好的手寫辨識 CNN 模. 立. 型來判斷其他 18 張貓狗的圖片是否為手寫辨識圖片。. ‧ 國. 學. 由於數據的高度不平衡，原始分類模型的表現不盡理想。因此，在不同的分類任務上，我們分別利用 6 個和 7 個不同的方法來調整我們的模型。我. ‧. 們發現新的損失函數 Focal loss 在多元分類任務表現最好，而在二元分類任務中隨機過採樣的表現最佳，但是成本敏感學習的方法並不適用於我們所. y. Nat. sit. 生成的不平衡數據集。我們利用信心估計讓分類器成功判斷所有貓狗圖片. al. er. io. 皆不是手寫辨識圖片。. v. n. 關鍵字：深度學習、卷積神經網路、不平衡數據集、異常偵測、圖像分類. Ch. engchi. i n U. ii. DOI:10.6814/NCCU201901175.

(4) Abstract This paper is a survey on deep learning with imbalanced data sets and anomaly detection. We create two imbalanced data sets from MNIST for multiclassification task with minority classes 0, 1, 4, 6, 7 and binary classification task with minority. 政治大. class 0. Our data sets are highly imbalanced with imbalanced rate ρ = 2500 and we use convolutional neural network(CNN) for training. In anomaly detection, we. 立. use the pretrained CNN handwriting classifier to decide the 18 cat and dog pictures. ‧ 國. 學. are handwriting pictures or not.. Due to the data set is imbalanced, the baseline model have poor performance. ‧. on minority classes. Hence, we use 6 and 7 different methods to adjust our model. We find that the focal loss function and random oversampling(ROS) have. y. Nat. sit. best performance on multiclassification task and binary classification task on our. er. io. imbalanced data sets but the cost sensitive learning method is not suitable for our. al. v i n C are all the pictures of cat and dog picture. U h enotnhandwriting i h gc Keywords: Deep Learning, CNN, Imbalanced Data Sets, Anomaly Detection, n. imbalanced data sets. By confidence estimation, our classifier successfully judge. Image classification. iii. DOI:10.6814/NCCU201901175.

(5) Contents 致謝. i. 中文摘要. ii. Abstract. 立. Contents. vii. 1. sit. ix. y. ‧ 國 n. al. 3. er. io. 2 Deep Learning. Nat. 1 Introduction. 2.1. iv. ‧. List of Figures. iii. 學. List of Tables. 政治大. i n U. v. Neurons and Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . .. 4. 2.2. C ........................... Activation Function . . . .h. e ngchi. 7. 2.3. Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 9. 2.4. Gradient Descent Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 10. 3 Convolutional Neural Network(CNN). 11. 3.1. Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 12. 3.2. Max Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 12. 4 Abnormal Condition and Imbalanced Data Set. 14. 4.1. Abnormal Condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 14. 4.2. Imbalanced Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 15. DOI:10.6814/NCCU201901175.

(6) 5 Anomaly Detection. 17. 5.1. Confidence Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 17. 5.2. Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 18. 5.3. Experiment for Confidence Estimation . . . . . . . . . . . . . . . . . . . . . .. 20. 6 Methods for Imbalanced Data Problem. 23. 6.1.1. Randomoversampling(ROS) . . . . . . . . . . . . . . . . . . . . . .. 23. 6.1.2. Synthetic minority oversampling technique(SMOTE) . . . . . . . . .. 24. 6.1.3. Randomundersampling(RUS) . . . . . . . . . . . . . . . . . . . . . .. 25. Algorithm‑level Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 26. 6.2.1. 26. 6.2.3. 政治大 Mean squared false error(MSFE) . . . . . . . . . . . . . . . . . . . . . 立 Focal loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6.2.4. Cost sensitive learning . . . . . . . . . . . . . . . . . . . . . . . . . .. 學. 30 32 33 35 36. 7.2. RandomOversampling Model . . . . . . . . . . . . . . . . . . . . . . . . . .. 7.3. Synthetic Minority Oversampling Technique Model . . . . . . . . . . . . . .. 7.4. RandomUndersampling Model . . . . . . . . . . . . . . . . . . . . . . . . .. io. er. Baseline Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Nat. 7.1. al. 28. ‧. 7 Experiment for Multiclassification Task. 27. y. 6.2.2. Mean false error(MFE) . . . . . . . . . . . . . . . . . . . . . . . . . .. sit. 6.2. Data‑level Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ‧ 國. 6.1. 23. Focal Loss Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 39. 7.7. Cost Sensitive Learning Model . . . . . . . . . . . . . . . . . . . . . . . . . .. 42. 7.8. Result for Multiclassification Task . . . . . . . . . . . . . . . . . . . . . . . .. 43. 7.5. n. 37. 7.6. v i n Mean False Error ModelC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hengchi U. 8 Experiment for Binary Classification Task. 38. 45. 8.1. Baseline Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 45. 8.2. RandomOversampling Model . . . . . . . . . . . . . . . . . . . . . . . . . .. 46. 8.3. Synthetic Minority Oversampling Technique Model . . . . . . . . . . . . . .. 47. 8.4. Randomundersampling(RUS) . . . . . . . . . . . . . . . . . . . . . . . . . .. 48. 8.5. Mean False Error Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 48. 8.6. Mean Squared False Error Model . . . . . . . . . . . . . . . . . . . . . . . . .. 49. DOI:10.6814/NCCU201901175.

(7) 8.7. Focal Loss Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 50. 8.8. Cost Sensitive Learning Model . . . . . . . . . . . . . . . . . . . . . . . . . .. 52. 8.9. Result for Binary Classification Task . . . . . . . . . . . . . . . . . . . . . . .. 53. 9 Conclusion. 55. 9.1. Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 55. 9.2. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 55. Appendix A Python Code. 56. A.1 Baseline Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 56. A.2 RandomOversampling Model . . . . . . . . . . . . . . . . . . . . . . . . . .. 68. 政 . .治 RandomUndersampling Model . . . . .大 . . . . . . . . . . . . . . . . . . 立 ............................. Mean False Error Model. 81. A.3 Synthetic Minority Oversampling Technique Model . . . . . . . . . . . . . . A.4 A.5. 100 113. ‧ 國. 學. A.6 Focal Loss Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 A.7 Cost Sensitive Learning Model . . . . . . . . . . . . . . . . . . . . . . . . . . 138. ‧. A.8 Mean Squared False Error Model . . . . . . . . . . . . . . . . . . . . . . . . . 145 A.9 Anomaly Detection Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154. n. al. 164. er. io. sit. y. Nat. Bibliography. Ch. engchi. i n U. v. DOI:10.6814/NCCU201901175.

(8) List of Tables 5.1. Confidence score of cat and dog pictures . . . . . . . . . . . . . . . . . . . . .. 21. 5.2. Confidence score of inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 21. 6.1. Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 27. 7.1. 政治 .................. Cost matrix . . . . . . . . . . . . . . . . . . 大立 Average accuracy of M . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 7.2. Average accuracy of Mb and Mos . . . . . . . . . . . . . . . . . . . . . . . . .. 35. 7.3. Average accuracy of Mb and Msm . . . . . . . . . . . . . . . . . . . . . . . .. 37. 7.4. Average accuracy of every class in Mus . . . . . . . . . . . . . . . . . . . . .. 38. 7.5. Average accuracy of Mb and Mf e . . . . . . . . . . . . . . . . . . . . . . . . .. 39. 7.6. Average accuracy of Mf l with γ = 0 . . . . . . . . . . . . . . . . . . . . . . .. 40. 7.7. Average accuracy of Mf l with γ = 0.5 . . . . . . . . . . . . . . . . . . . . . .. 7.8. Average accuracy of Mf l with γ = 1 . . . . . . . . . . . . . . . . . . . . . . .. b. 學. ‧. ‧ 國. io. sit. y. Nat. er. 6.2. n. al. i n U. v. 30 34. 40. Ch γ = 2 . . . . . . . . . . . . . . . . . . . . . . . Average accuracy of M with engchi. 40. 7.10 Average accuracy of Mf l with γ = 5 . . . . . . . . . . . . . . . . . . . . . . .. 41. 7.11 Average accuracy of Mf l with different parameters . . . . . . . . . . . . . . .. 41. 7.12 Average accuracy of Mb and Mc . . . . . . . . . . . . . . . . . . . . . . . . .. 42. 7.13 Average accuracy of minority class with different models . . . . . . . . . . . .. 43. 8.1. Average accuracy of Mb2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 46. 8.2. Average accuracy of Mb2 and Mos2 . . . . . . . . . . . . . . . . . . . . . . . .. 46. 8.3. Average accuracy of Mb2 and Msm2 . . . . . . . . . . . . . . . . . . . . . . .. 47. 8.4. Average accuracy of Mb2 and Mus2 . . . . . . . . . . . . . . . . . . . . . . . .. 48. 8.5. Average accuracy of Mb2 and Mf e2 . . . . . . . . . . . . . . . . . . . . . . . .. 49. 7.9. fl. 40. DOI:10.6814/NCCU201901175.

(9) 8.6. Average accuracy of Mb2 and Mf se . . . . . . . . . . . . . . . . . . . . . . . .. 49. 8.7. Average accuracy of Mf l2 with γ = 0 . . . . . . . . . . . . . . . . . . . . . .. 50. 8.8. Average accuracy of Mf l2 with γ = 0.5 . . . . . . . . . . . . . . . . . . . . .. 51. 8.9. Average accuracy of Mf l2 with γ = 1 . . . . . . . . . . . . . . . . . . . . . .. 51. 8.10 Average accuracy of Mf l2 with γ = 2 . . . . . . . . . . . . . . . . . . . . . .. 51. 8.11 Average accuracy of Mf l2 with γ = 5 . . . . . . . . . . . . . . . . . . . . . .. 51. 8.12 Average accuracy of Mf l2 with different parameters . . . . . . . . . . . . . . .. 52. 8.13 Average probability predicted from Mb2 . . . . . . . . . . . . . . . . . . . . .. 53. 8.14 Average accuracy of class 0 and 1 with different models . . . . . . . . . . . . .. 54. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. DOI:10.6814/NCCU201901175.

(10) List of Figures 2.1. Three steps of deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4. 2.2. The Structure of a Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5. 2.3. Fully Connected Feedforward Network . . . . . . . . . . . . . . . . . . . . .. 6. 2.4. Rectified linear unit (ReLU) . . . . . . . . . . . . . . . . . . . . . . . . . . .. 7. Hyperbolic tangent (tanh) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 8 8. 學. Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 10. 3.1. Structure of CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3.2. ‧. 11. Convolutional operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 12. 3.3. Operation of max pooling layer (a) . . . . . . . . . . . . . . . . . . . . . . . .. 13. 3.4. Operation of max pooling layer (b) . . . . . . . . . . . . . . . . . . . . . . . .. 13. 5.1. Normal condition for heath cell classifier . . . . . . . . . . . . . . . . . . . .. n. al. er. io. sit. Nat. 2.7. y. 2.6. ‧ 國. 2.5. 政治大 Sigmoid function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 立. i n U. v. 18. 5.2. C h cell classifier . . . . . . . . . . . . . . . . . . . . Anomaly condition for heath engchi. 5.3. Car engine scatter diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 19. 5.4. Defective detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 20. 5.5. Example of cat and dog pictures . . . . . . . . . . . . . . . . . . . . . . . . .. 20. 6.1. Random oversampling(ROS) . . . . . . . . . . . . . . . . . . . . . . . . . . .. 24. 6.2. Algorithm of SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 25. 6.3. Random undersampling(RUS) . . . . . . . . . . . . . . . . . . . . . . . . . .. 26. 6.4. Focal loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 29. 7.1. Sample number of MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 32. 7.2. Sample from MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 33. 18. DOI:10.6814/NCCU201901175.

(11) 7.3. Imbalanced MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 33. 7.4. Structure of baseline CNN model . . . . . . . . . . . . . . . . . . . . . . . . .. 34. 7.5. Confusion matrix of Mb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 35. 7.6. Confusion matrix of Mb and Mos . . . . . . . . . . . . . . . . . . . . . . . . .. 36. 7.7. Average accuracy of Mb and Mos . . . . . . . . . . . . . . . . . . . . . . . . .. 36. 7.8. Confusion matrix of Mb and Msm . . . . . . . . . . . . . . . . . . . . . . . .. 37. 7.9. Confusion matrix of Mus . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 38. 7.10 Confusion matrix of Mb and Mf e . . . . . . . . . . . . . . . . . . . . . . . . .. 39. 7.11 Average accuracy of Mf l with different parameters . . . . . . . . . . . . . . .. 41. 7.12 Confusion matrix of Mb and Mc . . . . . . . . . . . . . . . . . . . . . . . . .. 42. 7.13 Comparison of average accuracy of minority classes with different methods . .. 44. 8.1. 45. 8.2. 政治大 Binary imbalanced MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . 立 Confusion matrix of M . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 8.3. Confusion matrix of Mb2 and Mos2 . . . . . . . . . . . . . . . . . . . . . . . .. 47. 8.4. Confusion matrix of Mb2 and Msm2 . . . . . . . . . . . . . . . . . . . . . . .. 47. 8.5. Confusion matrix of Mb2 and Mus2 . . . . . . . . . . . . . . . . . . . . . . . .. 48. 8.6. Confusion matrix of Mb2 and Mf e2 . . . . . . . . . . . . . . . . . . . . . . . .. 49. 8.7. Confusion matrix of Mb2 and Mf se . . . . . . . . . . . . . . . . . . . . . . . .. 50. 8.8. Average accuracy of Mf l2 with different parameters . . . . . . . . . . . . . . .. 52. 8.9. Confusion matrix of Mb2 and Mc2 . . . . . . . . . . . . . . . . . . . . . . . .. 53. . . . .. 54. ‧. ‧ 國. 學. al. er. io. sit. y. Nat. v i n C h of class 0 andU1 with different methods Comparison of average accuracy engchi n. 8.10. b2. 46. DOI:10.6814/NCCU201901175.

(12) Chapter 1 Introduction. 政治大. Deep learning is kind of machine learning and its development is mature and completed than before. [24] We can see that many fields are using deep learning to address their problem. 立. such as translation, finance, network and image recognition. [1, 14, 15, 29] Take an example,. ‧ 國. 學. using deep learning to translate Chinese in English and input an car image then deep learning model will give the output which is label of car image. This paper using deep learning to. ‧. recognize the handwriting number which is 0 to 9 but the data set is imbalanced and it will cause the bad predict result of the model. In anomaly detection, we use deep learning to pretrain. y. Nat. sit. a handwriting classifier and hope it can judge the input is handwriting picture or not.. er. io. Convolutional Neural Network(CNN) is one of structure of deep learning and it is good at. al. v i n C h layers and pooling main layers in CNN which are convolutional layers, convolutional layers can engchi U n. image recognition [20],video analysis [19] and natural language processing [8]. There are two. capture the features of pictures such as shape or line and these features will through the pooling. layer to reduce the dimension. Usually, they will flat the result as above and connect to the fully connected layer to get the output which is the probability of the labels. In real life, there are many different situations and some of them are bad or rare happened, we may called it abnormal condition. For example, the machine malfunction [31], system failure in IT services [39]. Abnormal condition as above does not happen often but when it happen it may cost a lot of time and money to fix it. Usually, identify abnormal condition is a binary classification problem in deep learning which we use to classify this condition is abnormal or normal. Imbalanced data set is a common situation in real life. For example, we want to predict. 1. DOI:10.6814/NCCU201901175.

(13) someone will get the cancer in 3 months by their medical record. Usually, the number of health people is much larger than cancer patient and we assume their ratio is 9 : 1. Hence, our model just predict all people are health then the accuracy still get 90%. If we just use accuracy as our criterion, the model seems like good enough but actually, it does not reach our goal to predict someone will get the cancer or not in 3 months. This situation is common in our life and we usually called the uncommon event or bad event as abnormal condition such as machine malfunction [31] and system failure in IT services [39]. In this paper, we create two imbalanced subset from MNIST which has 10 classes of handwriting number images from 0 to 9. In multi class task, 0, 1, 4, 6, 7 is our minority class and its imbalance ratio is ρ = 2500. On the other hand, binary class task also has imbalance ratio ρ = 2500, and the minority class and majority class is 0 and 1, respectively.. 政治大. In anomaly detection, there are two type of training data, with labels and without labels.. 立. If we already have a classifier to classify normal samples, then we want to let the classifier. ‧ 國. 學. judge the input is normal or anomaly. We can use confidence estimation to let classifier output the confidence score and determine the input is anomaly or not. On the other hand, if we do. ‧. not have the labels of inputs, we may assume that the data obey the gaussian distribution and believe that the outlier is the anomaly samples. We use confidence estimation on handwriting. Nat. sit. y. CNN classifier and successfully judge the 18 cat and dog pictures are not handwriting pictures.. io. al. er. To address the imbalanced data problem, we sort out the 7 different methods which divided. n. into two categories, datalevel methods and algorithmlevel methods. In datalevel methods we. Ch. i n U. v. will introduce ROS [12, 30], SMOTE [5], RUS [10, 12, 25] which used sampling to increase or. engchi. decrease the number of samples to balance the data set. Different from datalevel methods, we will introduce three loss function, MFE [36], MFSE [36] and focal loss [26], and cost sensitive learning [11,13,16,22,27,40]. There methods adjust our output or loss function to let the model be sensitive to minority class samples in algorithmlevel methods. In this paper, we use these methods to improve the performance of two baseline model on imbalanced MNIST. We will compare the modified model with baseline model in different task and choose which one is better solution to address the problem.. 2. DOI:10.6814/NCCU201901175.

(14) Chapter 2 Deep Learning. 政治大. Deep learning is just like a machine learning, it can let the computer learning by itself. According to the training data, deep learning is a way to find the most suitable function about. 立. 學. f (“Image of tree”) = “Tree” f (“我很帥”) = “I am handsome.”. ‧. ‧ 國. input data. Let’s take some example to figure out how the function work.. f (“How are you ?”) = “I’m fine, thank you.”. sit. y. Nat. When using deep learning in image recognition [14], we hope that computer can find a function. io. n. al. er. which can distinguish the images. If we send a tree image to the function as an input, then. i n U. v. computer can recognize it and give the label of image ”tree” as an output which shown as above.. Ch. engchi. In first function above, the input are images and the output is the label of these images. When we use deep learning in translation [1], we expect that machine can translate the words into the language which we want. As the second part of cases, ” 我很帥”, the description that ” i am handsome ” in Taiwan, is the input of the function. Then the function shows that ” I am handsome.” as an output. In this example, the Taiwan characters are input, and the English sentence translated from Taiwan characters are the output. As the deep learning is applied in prediction about conversation [38], we hope that computer can tell us what the next step is. We set the sentence ” How are you ? ” as an input. Then ” I’m fine, thank you.” is the output of the function which is the answer we should respond. Therefore, in the function shown in third case above, the input of the function is the current sentence and the responding sentence you should reply is the output. After figuring out these three cases above, we understand deep learning 3. DOI:10.6814/NCCU201901175.

(15) much more than before and it is time to see how it works. Actually, deep learning can be divided into three steps which are building the model, selecting loss function and training the model as shown in figure 2.1. Building the model means that construct a neural networks and set a function according to the structure. In this step, we need to decide the structure of neural networks and design the program about that. To determine which function is the best, we need to select a loss function to evaluate the goodness of those functions. So we need to define what is the goodness of the function in last step. Hence, according to the training data, computer can choose the best function by training the model.. 立. 政治大. ‧. ‧ 國. 學 y. sit. io. 2.1 Neurons and Neural Networks a. er. Nat. Figure 2.1: Three steps of deep learning. n. iv l C n h e n garecinspired Because of artificial neural networks h i U by biological neural networks,. the. structure of artificial neural networks is similar to the real human brains. [17] Neurons are connected with neural networks in human brains, so artificial neural networks in deep learning also have neural networks around the neurons. In this section, we will show that the structure of neurons and the operation between the neurons in neural networks. As in figure 2.2, a basic neuron is consisted of input, output, weights, bias and activation function. x1 , x2 , x3 in the left side of figure 2.2 are the input of neuron. In the right side, y1 is the output of neuron. Circle in figure 2.2 is a neuron and its symbol σ is the activation function of neuron which will introduce later. Between the input and neuron are weights, denoted as w1 , w2 , w3 . Finally, the bottom of figure 2.2 is the bias of the neuron, b1 .. 4. DOI:10.6814/NCCU201901175.

(16) Figure 2.2: The Structure of a Neuron Now, we introduce the operation of neurons. First, input value is multiplied by the corresponding value of weights, respectively. Then add the product values and bias together.. 政治大 simple instance, we set x , x立 , x , w , w , w , b as 4, 2, (−3), (−1), 3, 2, 3, respectively. Finally, let the value we get through the activation function, then we will get the output. Take a 1. 2. 3. 1. 2. 3. 1. ‧ 國. 學. and activation function is ReLU which will introduce in detail later. According to the operation we introduced, we have 4 × (−1) + 2 × 3 + (−3) × 2 + 3 = −1. Since ReLU will send negative value to 0, then the out put is. ‧. sit. y. Nat. σ(4 × (−1) + 2 × 3 + (−3) × 2 + 3) = σ(−1) = 0.. n. al. er. io. In deep learning, weights and bias are called parameters. Parameters are adjustable, when. i n U. v. training the model by training data, computer will adjust the parameters automatically to fit the training data.. Ch. engchi. After discussing the structure of a neuron and its operation, we will introduce the neural network. There are a lot of neurons in neural network, and they can connect to other neuron freely. The figure 2.3 shows that the typical model of neural network in deep learning.. 5. DOI:10.6814/NCCU201901175.

(17) 政治大. Figure 2.3: Fully Connected Feedforward Network. 立. ‧ 國. 學. Fully connected feedforward network is the simplest and first model in deep learning as shown in figure 2.3. [35] In figure 2.3, there are 3 layers, first and third layers contain 2 neurons,. ‧. second layer contains three neurons. The number of layers and neurons are arbitrary which is decided by ourselves. The input layer and out layer are the the left and right layers in figure. Nat. sit. y. 2.3, respectively. Between input layer and output layer are hidden layers, the number of hidden. io. er. layers is 3 or more than 3 will be called in deep structure. The neurons in fully connected feedforward network are connected to all neurons of adjacent layers which is showed in figure. n. al. Ch. i n U. v. 2.3 and explained what is ” fully connected ”. But in this model, it does not have any circle or. engchi. loop. We will introduce other model which contains circle or loop later. If we decide a structure of neuron networks, then this structure will define a set of functions. Machine will find the most suitable parameters according to the training data, this procedure we call it training and it is equivalent to choose a solution of set of functions. But there may be no or bad solution of set of functions because the set of functions defined by the model of structure is not suitable to the problem. For different problem, we have different model to fit it. It is difficult to let machine decided its model by itself, so this task is still done by ourselves.. 6. DOI:10.6814/NCCU201901175.

(18) 2.2 Activation Function The following activation functions we introduced are nonlinear functions. Because without activation functions the output of neurons in every layers are linear combination of the input, so activation function give them nonlinear relationship. Let’s see three most common activation functions in detail. 1. Rectified linear unit (ReLU) Equation:.   x, if x ≥ 0 f (x) =  0, if x < 0. Range: [0, ∞). 立. Graph:. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. Figure 2.4: Rectified linear unit (ReLU). 2. Sigmoid function Equation: f (x) =. 1 1 + e−x. Range: (0, 1) Graph: 7. DOI:10.6814/NCCU201901175.

(19) 政治大. Figure 2.5: Sigmoid function. 立. 3. Hyperbolic tangent (tanh). f (x) =. ex − e−x ex + e−x. ‧. ‧ 國. 學. Equation:. y. Nat. n. al. er. io. Graph:. sit. Range: (−1, 1). Ch. engchi. i n U. v. Figure 2.6: Hyperbolic tangent (tanh). 8. DOI:10.6814/NCCU201901175.

(20) 2.3 Loss Function After we decided the connection between the neurons, we need to adjust parameters i.e. weights and bias. The set which contains this parameters are called θ and every θ defined a function, the set collect all this function is denoted by {Fθ }. We want to find the best function of problem i.e. find the most suitable parameter, the optimal function is denoted by Fθ∗ . Define the loss function is the way to find the optimal function. Loss function map from parameter space to real number. For example, the number of training data is k namely, (x1 , y1 ), (x2 , y2 ), ..., (xn , yk ) and θ is the set of neural network parameters i.e. θ = {w1 , w2 , ..., wn , b1 , b2 , ..., bm } then the loss function is defined by L : Rn+m 7→ R. We use loss function to evaluate the difference between the real value y and. 政治大 we need to minimize the loss function. 立 Choose a correct loss function is an important thing, like classification problem, it usually. predicted value Fθ (x). Of course, we hope that the predicted value is close to the real value, so. ‧ 國. 學. uses binary crossentropy as its loss function. The following are the three basic and simple loss. ‧. 1. Mean absolute error (MAE). Nat. n. al. 2. Mean squared error (MSE). sit. k. er. io. 1∑ ||yi − Fθ (xi )|| L(θ) = k i=1. y. functions.. Ch. engchi. i n U. v. 1∑ L(θ) = ||yi − Fθ (xi )||2 k i=1 k. 3. Binary crossentropy 1∑ [yi log(Fθ (xi )) + (1 − yi ) log(1 − Fθ (xi ))] L(θ) = − k i=1 k. 9. DOI:10.6814/NCCU201901175.

(21) 2.4 Gradient Descent Method In deep learning, the optimization is different from other common case. Generally, when optimizing, we know how data exactly looks like and what goal we want to reach. But in deep learning, we do not know how new cases look like, so we optimize our training data and use validation data to test its performance. The way we use to optimize training data is ”gradient descent” which is usually Assume that the parameters of our neural networks are θ =. used in deep learning.. {w1 , w2 , ..., wn , b1 , b2 , ..., bm }, we want to find the optimal solution θ∗ such that minimizes the loss function L(θ). The following are the step of gradient descent method. First, we choose a (1). random value as an initial value for w1 denoted as w1 . Second, compute its first derivation ∂L (1) ∂w1. then update the parameter i.e.. (2). ∂L. (1). w1 = w1 − η. (1). ∂w1. 學. ‧ 國. 立. 政治大. which η is learning rate and will introduce later. Continuing this way, until. ∂L ∂w1. is approach to. ‧. zero. As same as the other parameters, so the iteration is like the following form in figure 2.7. n. al. er. io. sit. y. Nat. and we will do this iteration until all the parameter of norm is small enough.. Ch. engchi. i n U. v. Figure 2.7: Gradient Descent Learning rate is like a extent which represents a level we approach to minimum value in every iteration and is set by ourselves. Usually, learning rate is small, because if learning rate is large, then it may miss the minimum value or jump back and forth around the minimum value. However, when learning rate is too small, the speed of training will too slow and cost too many time. Therefore, setting learning rate is an important thing in deep learning.. 10. DOI:10.6814/NCCU201901175.

(22) Chapter 3 Convolutional Neural Network(CNN). 政治大. Neural network (NN), Convolutional neural network (CNN) and Recurrent neural network (RNN) are three based model in deep learning. [24] NN is the standard model of deep learning. 立. and other models are modified edition according to it. CNN is named by its convolutional layer. ‧ 國. 學. and is good at image recognition and classification problems. On the other hand, since the output of the previous layer is the input of the next layer, the whole model has recursiveness, so RNN. ‧. is good at processing timerelated or sequential data.. The standard structure of CNN consisted of two main layers, convolutional layers and. y. Nat. sit. pooling layers. The input image will be captured feature in convolutional layers and be reduced. er. io. the dimension in pooling layers. After that, the result will be flattened and through the fully. al. v i n C h of CNN and the following Figure 3.1 shows the detail of structure section we will introduce two engchi U n. connected layers which use softmax as activation function to get the probability from of output.. main layers in CNN.. Figure 3.1: Structure of CNN. 11. DOI:10.6814/NCCU201901175.

(23) 3.1 Convolutional Layer Convolutional layer uses the filters to capture the features of input images such as the shape or line of the pictures and one filter corresponds to one features. The following is the convolutional operation which is notation is ⊗ as shown in figure 3.2. It multiplies each elements of filter and each elements of 3 × 3 matrix in the upper left of the input image. Then add them together and we will get the element of feature map in the upper left. For example, we get 6 in upper left of feature map by 1×1+1×2+1×3+0×6+0×7+0×8=6. 立. 政治大. ‧. ‧ 國. 學 er. io. sit. y. Nat. Figure 3.2: Convolutional operation. n. al. Ch. engchi. i n U. v. After we get the output 6, the filter will move one grid of input image and do convolutional operation again then we will get the output 9. Continuing this way, until all 3 × 3 matrix in the input image convoluted by filter so that we can get the complete feature map shown in figure 3.2.. 3.2 Max Pooling Layer After we get the feature map from convolutonal layer in section 3.1, we use pooling layer to let our feature more streamlined. There are three advantages of max pooling, reduce the dimension of input image, imagedenoising and translating few pixels in image will not change the result. 12. DOI:10.6814/NCCU201901175.

(24) Figure 3.3 shows the operation of max pooling layers. Usually, we choose the pooling size as 2 × 2 and select the maximum value in 2 × 2 matrix. Hence, we will get the upper left element in pooled feature map 24. Next, move one grid of feature map and do the operation again until we get the complete pooled feature map.. 政治大. Figure 3.3: Operation of max pooling layer (a). 立. Different from convolutional operation, the next matrix in feature map can not overlap the. ‧. ‧ 國. 學. previous matrix and it can out of bonds as shown in figure 3.4. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. Figure 3.4: Operation of max pooling layer (b) Finally, we will flatten the pooled feature map which obtain from max pooling layer and send it to fully connected layers to get the output of whole model.. 13. DOI:10.6814/NCCU201901175.

(25) Chapter 4 Abnormal Condition and Imbalanced Data Set. 立. 政治大. In this paper, we want to use deep learning to recognize the handwriting number images.. ‧ 國. 學. But the data set is imbalanced so not all predictions are we interested, we care more about the accuracy of minority class no matter in multiclass task or binary class task. Hence, this chapter. ‧. will introduce the abnormal condition and imbalanced data set and describe the problem they caused in deep learning.. sit. y. Nat. er. io. 4.1 Abnormal Condition. al. n. v i n In real life, we will encounterC many situations, theyU h e n g c h i may be good or bad or just nothing and. we usually call bad situation as abnormal condition. For example, the machine malfunction [31],. system failure in IT services [39], unexpected situation in radar data corresponding to six military targets [9] and heart failure in electronic health records (EHRs) [6]. These abnormal condition as above does not happen often but when it happened it may cost a lot of money and time to fix and address it. As above, we usually want to predict, detect and classify the abnormal condition. Therefore, we use deep learning to help us to find the abnormal condition in condition domain and we regarded it as an binary classification task. In this binary classification task, we call normal condition as positive class and abnormal condition as negative class. Usually, this binary classification task has data imbalanced problem and it will led the performance of the model. 14. DOI:10.6814/NCCU201901175.

(26) terrible. In order to judge the imbalanced level of dataset the following formula [2] represents the imbalance ratio of dataset, that is, the ratio of maximum class size and minimum class size. ρ=. maxi (|Ci |) minj (|Cj |). where Ci is a set of samples in class i. For example, if our dataset’s largest class has 48 samples and smallest class has 16 samples, then the imbalance ratio ρ = 3. Note that the imbalanced data set not just happen in binary classification task, it also happen in multiclassification task and we will introduce it in next section. In our work, we create an binary imbalanced data set from MNIST and it consisted two. 政治大. classes, 0 and 1. The sample number of class 0 and 1 are 2 and 5000, respectively and it cause. 立. 學. ‧ 國. the imbalanced ratio ρ = 2500.. 4.2 Imbalanced Data Set. ‧. Usually, the abnormal condition can regard as a binary classification task which consisted. y. Nat. of one positive group and one negative group i.e. normal condition and abnormal condition.. io. sit. But imbalanced data set not just happen in binary classification task, it will happen in multi. n. al. er. classification task too. This section will discuss the imbalanced data in multiclassification task and description the problem in this paper.. Ch. engchi. i n U. v. Imbalanced data arise in many different application which have rare frequency on positive class such as computer security [7], disease diagnosis [32], image recognition [21]. In this paper, we create an imbalanced multiclass data set from MNIST with minority classes 0, 1, 4, 6, 7 and imbalanced ratio ρ = 2500. If we do not do something to adjust the model, it may get good accuracy but in fact this model does not achieve the purpose we want at first. For example, we want to predict someone will get the cancer in 3 months or not by their medical record. So our purpose is classify someone is health or get the cancer. Usually, the number of health people is much larger than cancer patient and we assume their ratio is 9 : 1. Hence, our model just predict all people are health then the accuracy still get 90%. If we just use accuracy as our criterion, the model seems like good enough but actually, it does not reach our goal to predict someone will get the cancer or 15. DOI:10.6814/NCCU201901175.

(27) not in 3 months and if someone has cancer and the model tell him that he is health, it may cause irreparable consequences. To prevent this situation we will introduce 7 different methods to address this problem in chapter 6.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. 16. DOI:10.6814/NCCU201901175.

(28) Chapter 5 Anomaly Detection. 政治大. Anomaly detection is just like an abnormal condition, it can find the outliers, novelties, noise, deviations and exceptions from the data set. Anomaly detection is applied on many fields. 立. such as bank fraud [37], medical problems [33], structural defect [3] and errors in a text [28].. ‧ 國. 學. We can regard anomaly detection as a binary classification task, it will classify the samples to normal and anomaly. Usually, we are interesting the anomaly condition but it often happened. ‧. rarely or even not happen in data set and cause highly imbalanced in data set. For example, if we want to detect the existence of the cancer cell [23], the health cell is normal and the anomaly. y. Nat. sit. condition is that detect to the cancer cell.. er. io. Although anomaly detection can be regard as binary classification, but there are some. al. v i n C hof anomaly condition unknown because the number of sort is too much. For example, in cancer engchi U detection, the cancer cell is anomaly. But if we feed a car or a tree to classifier, it will be regard n. different and difficult about anomaly detection. First, the distribution of anomaly sample is. as anomaly condition too. Second, is hard to collect the anomaly data because most sample in data set is normal. Hence, there are two training set, one is clean for all data , that is, all data is normal, the other one contains some anomaly data.. 5.1 Confidence Estimation If we already have a classifier for health cell, we want to use it to do anomaly detection, then beside the output label of health cell, the classifier must output a confidence score [4]. The confidence score is used to determine the sample is health cell or not. Let x be the input,. 17. DOI:10.6814/NCCU201901175.

(29) function f is classifier and c is the confidence score. Given a threshold λ we have   is normal, if c(x) > λ f (x).  is anomaly, if c(x) ≤ λ. For example, given λ = 0.5, in figure 5.1, the health cell classifier has highly confident for the blood cell input image is blood cell and 0.98 > 0.5 so the input is normal.. 治政 Figure 5.1: Normal condition for heath 大cell classifier 立 ‧ 國. 學. But as in figure 5.2, when using cancer cell image as input, we have low confidence and 0.34 < 0.5 so the input is anomaly.. ‧. n. er. io. sit. y. Nat. al. Ch. i n U. v. Figure 5.2: Anomaly condition for heath cell classifier. engchi. Hence, we can use the maximum of probability which classifier predicted to be the confidence score and determine whether it is greater than λ or not so that we can judge the condition of input.. 5.2 Gaussian Distribution In this section, we will introduce the method that find the outlier when the training data without the label. Since the training data without the label, it is hard to classify which sample is anomaly. We believe that the anomaly condition is rare to happen, so we assume that the data. 18. DOI:10.6814/NCCU201901175.

(30) set obey the gaussian distribution and believe that the anomaly sample is the outlier of gaussian distribution [34]. For example, we want to know that the car engine form the production line is a defective or not and we have some information which is the temperature and rotating speed of engine. We use the information to draw a scatter diagram, each points x represent an engine as shown in figure 5.3. 立. 政治大. ‧ 國. 學. Figure 5.3: Car engine scatter diagram. ‧. In figure 5.3, we can find that the point in the middle has higher probability to happen. y. Nat. than the point at upper right corner. Hence, given a threshold ε, we can define the normal and. n. er. io. al.  x is normal, if P (x) > ε. sit. anomaly condition as following:  Cxhis anomaly, if P (x)U≤nεi engchi. v. Furthermore, we assume that the data set obey the gaussian distribution and we can use it to calculate the P (x) we want. We need to calculate the mean and covariance for each feature xi which in our example are temperature and rotating speed. The formula for mean and covariance is as following:. 1 ∑ j x m j=1 i m. ui =. 1 ∑ j = (xi − ui )2 m j=1 m. σi2. , where m is the number of samples. After obtain the mean and covariance for each feature,. 19. DOI:10.6814/NCCU201901175.

(31) given any new data x, we can use gaussian distribution to calculate its probability as following:. P (x) =. n ∏ i=1. √. 1 (xi − ui )2 ) exp(− 2σi2 2πσi. Hence, given a threshold ε, we can find the outlier in the data set which means P (x) < ε. Continue the previous example, we let the red line as a threshold ε and then we get 4 anomaly samples as shown in figure 5.4.. 立. 政治大. ‧ 國. 學. Figure 5.4: Defective detection. ‧. 5.3 Experiment for Confidence Estimation. sit. y. Nat. In this section, we use a pretrained handwriting classifier to do the confidence estimation.. io. n. al. er. We want to judge that the input is handwriting or not. Hence, we collect 18 pictures of cat. i n U. v. and dog as input and hope classifier can predicted them are not handwriting with confidence. Ch. engchi. estimation. Figure 5.5 shows that the example of cat and dog pictures.. (a) Picture of cat. (b) Picture of dog. Figure 5.5: Example of cat and dog pictures. 20. DOI:10.6814/NCCU201901175.

(32) Our pretrained model use CNN and have vary high accuracy on training set and testing set with 99.98% and 99.35%. After we put the pictures of cat and dog to the handwriting classifier model, we get the confidence score of cat and dog pictures. We set λ = 0.6 as threshold and table 5.1 shows the confidence score of cat and dog pictures. Confidence Score 0.2014 0.1873 0.1689 0.5715 0.5491 0.4481 0.3339 0.3436 0.3723. Cat 1 Cat 2 Cat 3 Cat 4 Cat 5 Cat 6 Cat 7 Cat 8 Cat 9. Confidence Score 0.3286 0.3873 0.4854 0.2744 0.2345 0.178 0.4345 0.4574 0.3637. Dog 1 Dog 2 Dog 3 Dog 4 Dog 5 Dog 6 Dog 7 Dog 8 Dog 9. 政治大 Table 5.1: Confidence score of cat and dog pictures 立. ‧ 國. 學. From table 5.1, we can see that all of confidence score of cat and dog pictures are not greater than 0.6 so that all of cat and dog pictures are not handwriting picture by our classifier.. ‧. After our classifier successfully recognizing the cat and dog pictures are not handwriting pictures, we want to know the classifier can recognize the handwriting pictures or not. Hence,. Nat. sit. y. we use 0 − 9 handwriting number pictures but missing 5 and 6 as input to train our model. al. er. io. and get 99.3% and 80.96% accuracy on training set and testing set, respectively. Then we do. n. confidence estimation by using cat and dog pictures and 5 and 6 handwriting number pictures. Ch. i n U. as inputs. Table 5.2 shows the confidence score of inputs. Cat 1 Cat 2 Cat 3 Cat 4 Cat 5 Cat 6 Cat 7 Cat 8 Cat 9 Class 5. engchi. Confidence Score 0.4773 0.4771 0.3947 0.3414 0.4708 0.546 0.7042 0.7597 0.7722 0.9987. Dog 1 Dog 2 Dog 3 Dog 4 Dog 5 Dog 6 Dog 7 Dog 8 Dog 9 Class 6. v. Confidence Score 0.9333 0.8598 0.8056 0.9103 0.858 0.7704 0.7726 0.6686 0.5838 0.9993. Table 5.2: Confidence score of inputs We set λ = 0.8 and find that our classifier can recognize class 5 and 6 as handwriting 21. DOI:10.6814/NCCU201901175.

(33) number pictures successfully and recognize most cat and dog pictures are not handwriting number pictures.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. 22. DOI:10.6814/NCCU201901175.

(34) Chapter 6 Methods for Imbalanced Data Problem. 政治大. In this chapter will introduce many different methods to address the problem of imbalanced data. We are roughly divided into two methods, data‑level methods and algorithm‑level methods. 立. [18] some of them are fit to the binary classification task and the others are suitable to the multi. ‧. ‧ 國. 學. classification task.. 6.1 Data‑level Methods. y. Nat. sit. In this section, we will adjust our number of sample from different classes until all classes. al. er. io. are balanced. We introduce 3 main methods to handle our original imbalanced data sets. These. n. 3 methods can use on both binary classification task and multiclassification task.. Ch. en. gc 6.1.1 Randomoversampling(ROS). hi. i n U. v. Random oversampling(ROS) is a popular solution of imbalanced data problem since it is easy and have good performance. [12,30] ROS choose samples randomly from minority classes and copy the samples until the sample number of minority classes is equal to the sample number of majority class. For example, we have A, B, C three classes and the sample number of A, B, C are 10, 4, 5, respectively. As shown in figure 6.1, A is majority class and minority classes are B and C. After ROS, we can see that the sample number of B and C are equal to A.. 23. DOI:10.6814/NCCU201901175.

(35) Figure 6.1: Random oversampling(ROS). 6.1.2 Synthetic minority oversampling technique(SMOTE) Although people usually use ROS as a solution of imbalanced data problem, but it may cause overfitting since it has too many repeated samples in a new balanced data set. To avoid. 政治大. this situation, Chawla, Nitesh V., et al. create a advanced method, Synthetic Minority Over. 立. sampling Technique(SMOTE) [5] which produce new minority samples. SMOTE not just copy. ‧ 國. 學. the sample of minority classes like ROS, it develop an algorithm to create a new sample of minority classes to prevent overfitting. SMOTE choose a sample in minority class randomly,. the algorithm of SMOTE to create new minority samples.. Nat. y. ‧. find its knearest neighbors and create a new minority sample between them. The following is. er. io. sit. 1. Select a minority sample A randomly. 2. Find knearest neighbors.. al. n. v i n C h neighbors randomly 3. Choose 1 sample from knearest e n g c h i U and call it B. 4. Create a new sample C which C = λS + (1 − λ)R, where λ is a random number between 0 and 1. The knearest neighbors which are top k nearest samples to A in all classes, but we usually. choose them in minority class. Actually, the new sample C is on the line between A and B. Continue this way, we can get a lot of artificial minority samples until the sample number of all classes are equal. Figure 6.2 shows the algorithm detailed.. 24. DOI:10.6814/NCCU201901175.

(36) (a) SMOTE:step 1. (b) SMOTE:step 2. 政治大. 立. (d) SMOTE:step 4. Figure 6.2: Algorithm of SMOTE. Nat. sit. y. ‧. ‧ 國. 學. (c) SMOTE:step 3. SMOTE usually have good performance on classification of imbalanced data set, but since. io. n. al. er. several samples are synthetic, the interpretability of the model is greatly reduced.. C. hengchi 6.1.3 Randomundersampling(RUS). i n U. v. Undersampling is another common solution of imbalanced data [12], similar as oversampling, it also let all classes have same sample number. Different from ROS, random undersampling(RUS) remove the data in majority class until all classes have same sample number. Some article use transfer learning with RUS to classify imbalanced data sets of plankton images [25] and some research show that in some situations RUS have better performance than ROS [10]. But actually, RUS usually performance much badly than ROS because it may deleted some important data in majority class. Figure 6.3 shows the example that how random undersampling works.. 25. DOI:10.6814/NCCU201901175.

(37) Figure 6.3: Random undersampling(RUS) In datalevel methods, we usually use ROS or SMOTE to eliminate class imbalance and have good effect on small data sets. But if our data set is much bigger or have extreme class imbalance i.e. ρ is large, performance of oversampling is not good any more because it will. 治政大on big imbalanced data set since it other hand, we believe that RUS can obtain better results 立 can reduce the training time and the number of sample in minority class is enough for model. produce too many repeated samples led to overfitting and increase the training time. On the. 6.2 Algorithm‑level Methods. Nat. y. ‧. ‧ 國. 學. training.. sit. In this section, we will introduce 3 different loss function and cost sensitive learning. MFE. er. io. and focal loss are two loss function which can be used on both binary classification and multi. al. v i n C h learning can beUused on both two classification task. classification task. At last, cost sensitive engchi n. classification. Due to the property of MFSE, this loss function only can be used on binary. 6.2.1 Mean false error(MFE) Since MFE is inspired by the concepts of false positive rate and false negative rate [36], then we introduce confusion matrix with these two concepts first. Confusion matrix is a common indicator to judge a model is good or not. As shown in table 6.1, the index of vertical axis is actual condition, on other hand, the index of horizontal axis is predicted condition. The meaning of true positive, false positive, false negative and true negative are correct prediction of positive samples, incorrect prediction of negative samples, incorrect prediction of positive samples and correct prediction of negative samples, respectively. Take an example, we want to classify the pictures of dog and cat, pictures of dog are positive samples and 26. DOI:10.6814/NCCU201901175.

(38) Condition Positive True Positive False Negative. Predicted Condition Positive Predicted Condition Negative. Condition Negative False Positive True Negative. Table 6.1: Confusion matrix negative samples are cat pictures. If we predict dog pictures as dog and cat pictures as cat, then these condition are called true positive and true negative, respectively. But if we have wrong prediction like take dog pictures as cat and cat pictures as dog, then these condition are called false negative and false positive which they are also called Type I error and Type II error. The loss function MFE is focus on the false condition and is much more sensitive than MSE. MFE is combined with two error, mean false positive error(FPE) and mean false negative. 政治大 1 ∑. error(FNE), the following are formula of them,. FNE =. 1 P. N. ||yi − Fθ (xi )||2. i=1 P ∑. 學. ||yi − Fθ (xi )||2. i=1. ‧. ‧ 國. 立F P E = N. MF E = FP E + FNE. y. Nat. sit. where N and P are the number of negative samples and positive samples, respectively.. er. al ∑ v i ∑ 1 ∑ n MF E = CF hcEe =n g cNh i U||y − F (x )|| n. as following. io. In this paper, we apply MFE in multiclassification task so we modify the MFE loss function C. C. Nc. i. c=1. c=1. θ. i. 2. c i=1. , where Nc is the number of samples in c class and c = 1, 2, ..., C where C is the number of class.. 6.2.2 Mean squared false error(MSFE) Since we want to get the high classification accuracy on positive class, then the false negative error must quite low [36]. Hence, Wang et al. design an improve loss function, mean squared false error(MFSE) which is more sensitive than MFE on the error of positive class. In MFE, we only ensure that the minimize the sum of FPE and FNE, but it is not enough to get the high classification accuracy on positive class. Usually, FPE contribute more loss than FNE in. 27. DOI:10.6814/NCCU201901175.

(39) MFE since in imbalanced data set there are much more negative samples than positive samples. Hence, MFE is not sensitive enough to the error of positive class. To solve this problem, MFSE is presented and the following are the formula of MSFE. M F SE = F P E 2 + F N E 2 1 = ((F P E + F N E)2 + (F P E − F N E)2 ) 2 As above, MFSE minimizes (F P E + F N E)2 and (F P E − F N E)2 at the same time, then the error of positive class and negative class will also minimize at the same time. As a result, we can have the same effect as MFE and guarantee the positive class accuracy simultaneously. However, the property that MFSE minimizes both (F P E + F N E)2 and (F P E − F N E)2 is. 政治大. not exist in multiclassification task, so we only use MFSE on binary classification task.. 立. 6.2.3 Focal loss. ‧ 國. 學. Lin et al. propose a new loss function: focal loss which is modified from traditional cross. ‧. entropy(CE) [26]. Focal loss can reduce the weight of samples which is easy to classify and let the model focus on the samples which is hard to classify. Here, we will follow Lin et al.. Nat. sit. y. to introduce loss function form CE to focal loss. In binary classification, the following is the. io. n. al. er. formula of binary crossentropy.. v. 1∑ L(θ) = − [yi log(Fθ (xi )) + (1 − yi ) log(1 − Fθ (xi ))] k i=1 k. Ch. engchi. i n U. For convenience, we let. L(θ) = L(pi ) = −. 1 k. k ∑. log(pi ), where pi =. i=1.  Fθ (xi ), if yi = 1 1 − F (x ), if y = 0 i θ i. Then in order to control the weight of the positive and negative samples to the loss, we add a new hyper parameter αi , where αi = α when yi = 1 and αi = 1 − α when yi = 0. Usually, negative samples are much more than positive samples i.e. the number of i which yi is 1 is more than the number of i that yi is 0, then we will set α between 0.5 to 1 such that it can increase the. 28. DOI:10.6814/NCCU201901175.

(40) weight of positive samples to loss. Hence, we have. Lαi (pi ) = −. k ∑. 1 k. αi log(pi ), where pi =. i=1.  Fθ (xi ), if yi = 1 1 − F (x ), if y = 0 i θ i. Although Lαi (pi ) can control the weight of the positive and negative samples to the loss, it can not control the weight of samples which are easy to classify and hard to classify, so Lin et al. proposed the focal loss. F L(pi ) = −. k 1∑. k. (1 − pi )γ log(pi ), where pi =. i=1.  Fθ (xi ), if yi = 1 1 − F (x ), if y = 0 i θ i. 政治大 can see that when a sample has 立wrong classification then p is small and (1 − p ). We call γ as focusing parameter which γ ≥ 0 and (1 − pi )γ is modulating factor. Clearly, we i. i. γ. approaches to. ‧ 國. 學. 1. For example, if yj = 0 and wrong classification happened, then pi must less than 0.5, so pi is small and (1 − pi )γ is close to 1. Therefore, the loss will not change intensely compared with. ‧. original binary crossentropy. Furthermore, when pi approaches to 1, then modulating factor is small so the loss it contributed is small too. Focal loss restrict the loss of samples which are. sit. y. Nat. easy to classify, that is, it can raise the contribution of samples that are hard to classify. Figure. io. er. 6.4 shows that the performance with different focusing parameter γ and note that focal loss is traditional crossentropy when γ = 0.. n. al. Ch. engchi. i n U. v. Figure 6.4: Focal loss. 29. DOI:10.6814/NCCU201901175.

(41) Finally, Lin et al. find that if we combined Lαi (pi ) and F L(pi ), then it can adjust the weight of positive and negative samples to loss and control the loss of samples which are easy to classify. So we have,. F Lαi (pi ) = −. 1 k. k ∑. αi (1 − pi )γ log(pi ), where pi =. i=1.  Fθ (xi ), if yi = 1 1 − F (x ), if y = 0 θ i i. In this paper, we use focal loss on multiclassification task and do not set αi as above. We set α = (α1 , α2 , ..., αC ) where C is the number of class so the focal loss function on multi classification task is as below. 1 ∑∑ αj yic (1 − Fθ (xic ))γ log(Fθ (xic )) k i=1 c=1 k. M F Lα (Fθ (xi )) = −. 立. C. 政治大. 6.2.4 Cost sensitive learning. ‧ 國. 學. In binary classification, we often treat the loss of classification errors as the same, but it is wrong in real world. For example, we have a classifier to distinguish someone has a cancer or. ‧. not, positive sample means get a cancer and negative sample is not. If we predict someone does. sit. y. Nat. not have cancer and he has cancer in real, then it may not be a big deal. But if someone has cancer and the prediction is not, then he may die because our wrong classification. Hence, the cost of. io. n. al. er. misclassification may not be the same and it will change depends on different situations [11,13].. i n U. v. Cost sensitive learning give different cost to different misclassification as shown in table. Ch. engchi. 6.2 which is cost matrix in binary classification [27]. In cost matrix, C(c, j) means the cost of misclassifying that misclassifis a sample of c th class to j th, note that C(c, j) = 0 when c = j. Condition Negative Condition Positive. Predicted Condition Negative C(0, 0) C(1, 0). Predicted Condition Positive C(0, 1) C(1, 1). Table 6.2: Cost matrix There are many ways to apply cost sensitive learning to deep learning [13]. In this paper, we will use threshold moving apply in testing stage after the classifier is already trained [11, 22, 40]. In general, we will set a threshold t for our classifier, if output is greater than t, then it is considered positive sample, otherwise, negative sample, usually t = 0.5. So threshold moving adds the concept of cost of misclassification and moves the threshold t to let the minority class 30. DOI:10.6814/NCCU201901175.

(42) can easy to classify. In the case of cancer patient, threshold moving moves the threshold toward inexpensive misclassification i.e. someone does not have cancer and predict he has cancer, so that the samples with high costs become hard to misclassify. Threshold moving will use original data set to train and add concept of cost sensitive learning in testing stage. The output of our neural network is Fθ (xi,c ) for all c = 1, 2, ..., C, which is an probability and C is the number of classes, after threshold moving we will get a new output Fθ∗ (xi,c ) for all c = 1, 2, ..., C and it belongs to c∗ class by calculated arg max Fθ∗ (xi,c ) c. Clearly, the sum of all possibility is equal to 1 and we have C ∑. 立. 治政 ∑ 大 F (x ) = F (x ) = 1 C. θ. ∗ θ. i,c. c=1. i,c. c=1. ‧ 國. 學. The concept of cost sensitive learning is added in threshold moving, so we consider the effect of C(c, j) and get the new output by following equation. Fθ (xi,c )C(c, j),. io. sit. j=1. ∑C. n. er. where β is a parameter to let Fθ∗ (xi,c ) fit the probability equation. al. y. =β. C ∑. ‧. Nat. Fθ∗ (xi,c ). i n U. v. c=1. Fθ∗ (xi,c ) = 1.. Cost sensitive learning can also use on sampling [16], changing learning rate [22], modify. Ch. engchi. the output [13] and create a new loss function [13,27]. Through the ratio of different C(c, j), we can decide how many sample we need to duplicate or delete in new data set. Change learning rate or modify the output will led neural network to notice the point of cost in training stage. On the other hand, different from traditional loss function, we can minimize the total cost of misclassification to train our neural network. There are still many spaces where can work hard, we can combine cost sensitive learning with other methods to address imbalanced data and compare the performance of each different model in future.. 31. DOI:10.6814/NCCU201901175.

(43) Chapter 7 Experiment for Multiclassification Task. 政治大. In this chapter, we will introduce the structure of our models. We design an imbalanced data set from MNIST which have imbalance rate ρ = 2500 and use CNN for training. This. 立. model Mb is our baseline model which have bad performance because of imbalanced data and. ‧ 國. 學. we will use 7 different methods to create other 7 models to prove that our solution is work and compare their performance.. ‧. MNIST is a data set which contains 10 classes about handwriting images from 0 to 9. There are 60000 and 10000 images in training set and testing set, respectively. MNIST is a balance. y. Nat. sit. model on both training set and testing set, and all classes have more than 5000 samples. Figure. n. al. er. io. 7.1 shows that the samples number of each class on training set and testing set.. Ch. engchi. (a) Training set. i n U. v. (b) Testing set. Figure 7.1: Sample number of MNIST Each images have size 28 × 28 and its label represents its number as shown in figure 7.2.. 32. DOI:10.6814/NCCU201901175.

(44) Figure 7.2: Sample from MNIST. 7.1 Baseline Model. 立. 政治大. ‧ 國. 學. We create an imbalanced data set from MNIST by choose 5 minority class which is 0, 1, 4, 6, 7 randomly with imbalance rate ρ = 2500 in this section. Figure 7.3 shows that the number of samples in imbalanced MNIST.. ‧. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. Figure 7.3: Imbalanced MNIST After create an imbalanced data set, we use CNN to train our imbalanced data set and hope to get bad performance so that we can use other method to adjust it. In this baseline model, we have three convolutional layers and they all connect max pooling layers. After the input images pass through layers as above, it will be flatten and through two dense layers, finally get the output which represents the probability that the input belongs to each classes. We set the size of 33. DOI:10.6814/NCCU201901175.

(45) filters in all convolutional layers is 3 × 3, pooling size is 2 × 2, dropout rate is 0.2 and each layer contains 32, 64, 128, 200, 10 neurons, respectively. We choose Relu as our activation function in every convolutional layer and first dense layer, second dense layer we use softmax due to we want to get the probability output, figure 7.4 shows the structure of CNN model clearly.. 治政 Figure 7.4: Structure of baseline大 CNN model 立 ‧ 國. 學. We use stochastic gradient decent as our optimizer, set learning rate η = 0.05 and loss function is mean square error(MSE). The number of input is 15016, batch size is 100 and trained. ‧. 200 epochs. Mb is training by training set which we created is imbalanced, but test with the original testing set which is balanced. After 5 trainings, we find that Mb have great performance. y. Nat. sit. and get 99.04% average accuracy on training set, but we guess this is just an illusion because. al. er. io. the average accuracy on testing set is worse i.e. 48.63%. Table 7.1 shows the performance of. n. Mb which contains the average accuracy of training set, testing set, minority class, respectively. Ch. i n U. v. and in figure 7.5 we can find that at class 0, 1, 4, 6, 7 have bad performance in confusion matrix due to the number of samples is small.. Mb. Training 99.04%. engchi. Testing 48.63%. 0 0%. 1 0%. 4 0%. 6 0%. 7 0%. Table 7.1: Average accuracy of Mb. 34. DOI:10.6814/NCCU201901175.

(46) 政治大. Figure 7.5: Confusion matrix of Mb. 立. To improve the Mb , we will use 7 different methods to adjust the model and compare their. 7.2 RandomOversampling Model. Nat. y. ‧. ‧ 國. 學. effect.. sit. In this section, we adjust our training data set and let it become a balanced data set. As we. er. io. introduce in section 6.1.1 we will choose samples randomly from minority classes and copy the. al. n. v i n C h and we train U class. After ROS, the dataset is balanced e n g c h i it as the same structure of M. samples until the sample number of minority class is equal to the sample number of majority b. and call. this new model Mos . The model Mos also have good performance on training set which get 99.76% average accuracy, and the performance on testing set is better than Mb , that is, 79.44% average accuracy, we surmise that the overfitting happened so that the accuracy is not good enough. Table 7.2 shows that the comparison of average accuracy between Mos and Mb , and Mos is better than baseline model Mb . Mb Mos. Training 99.04% 99.76%. Testing 48.63% 79.44%. 0 0% 74.11%. 1 0% 89.56%. 4 0% 37.47%. 6 0% 56.8%. 7 0% 40.44%. Table 7.2: Average accuracy of Mb and Mos We discover that ROS adjust the accuracy on minority classes 0, 1, 4, 6, 7 in confusion 35. DOI:10.6814/NCCU201901175.

(47) matrix shown in figure 7.6.. (a) Mb. (b) Mos. 政治大. Figure 7.6: Confusion matrix of Mb and Mos. 立. ‧ 國. 學. 7.3 Synthetic Minority Oversampling Technique Model. ‧. Different from ROS, we will create an artificial minority samples to prevent overfitting. We follow the algorithm on section 6.1.2 and produce many samples of minority so that the. Nat. sit. y. dataset is balanced. We can see that the picture of new sample of class 1 produced by SMOTE. io. n. al. er. in figure 7.7, looks like two pictures of class 1 overlap.. Ch. engchi. i n U. v. Figure 7.7: Average accuracy of Mb and Mos Similarly, we use the same structure as Mb to train our model Msm and get 99.73% average accuracy on training set and testing set performs like ROS which have 77.54% average accuracy, 36. DOI:10.6814/NCCU201901175.

(48) and the comparison between Mb and Msm shown in table 7.3.. Mb Msm. Training 99.04% 99.73%. Testing 48.63% 76.8%. 0 0% 60.85%. 1 0% 82.07%. 4 0% 39.5%. 6 0% 53.94%. 7 0% 37.33%. Table 7.3: Average accuracy of Mb and Msm Figure 7.8 shows the comparison of confusion matrix between our model Msm and Mb , we can find that it also improve the accuracy of minority class.. 立. 政治大. ‧. ‧ 國. 學. (a) Mb. (b) Msm. y. Nat. er. io. sit. Figure 7.8: Confusion matrix of Mb and Msm. al. Although the performance of Msm is better than Mb , but since it create fake samples and. n. v i n some of them are not look like C original number, the interpretability of the model is greatly hengchi U reduced and the performance is also not perfect. 7.4 RandomUndersampling Model Section 7.1 − 7.3 is the method of oversampling, this section we will use undersampling to try to address the bad performance of model which caused by imbalanced data set. As in section 6.1.3, we will delete the sample of majority class until the data set is balanced. Similarly, we use same structure of CNN to train our model Mus but since it is hard to converge, so we change the training epochs as 500. Mus get very low accuracy in many classes on testing set as shown in table 7.4 and its confusion matrix also shows the bad predicted result in figure 7.9.. 37. DOI:10.6814/NCCU201901175.

(49) Mus. 0 56.5%. 1 95.42%. 2 59.49%. 3 0%. 4 0%. 5 0%. 6 0%. 7 0%. 8 0%. 9 0%. Table 7.4: Average accuracy of every class in Mus. 立. 政治大. ‧ 國. 學. Figure 7.9: Confusion matrix of Mus. ‧. We believe that the bad performance of RUS due to the sample number of training set is. Nat. sit. y. too small and our model can not capture the feature of picture. Hence, we consider RUS is not. al. er. io. a suitable method of our problem, we think it is suitable on a larger data set, which after RUS it. n. also have enough samples to train. But RUS is not all have disadvantages, it save our training. Ch. time and is quicker than Mb for 180 times.. engchi. i n U. v. 7.5 Mean False Error Model Different from section 7.1 − 7.4 is datalevel method, in this section we use new loss function MFE as we introduced in section 6.2.1. MFE is more sensitive to the loss that contributed by minority class. We use the same structure as Mb to train our model Mf e and it also has high average accuracy 99.92 on training set and the performance on testing is better than Mb too which get 75.55 average accuracy. The performance of comparison between Mb and our mean false error model Mf e shown in Table 7.5 and we can find it increases the accuracy of minority in confusion matrix shown in figure 7.10.. 38. DOI:10.6814/NCCU201901175.

(50) Mb Mf e. Training 99.04% 99.92%. Testing 48.63% 75.55%. 0 0% 58.3%. 1 0% 81.13%. 4 0% 36.47%. 6 0% 44.58%. 7 0% 38.59%. Table 7.5: Average accuracy of Mb and Mf e. (a) Mb. 政治大. 立. (b) Mf e. ‧ 國. 學. Figure 7.10: Confusion matrix of Mb and Mf e. ‧. 7.6 Focal Loss Model. sit. y. Nat. Same as section 7.5, we use new loss function, focal loss let the model be more sensitive to. io. n. al. er. minority class. Another advantage for focal loss is that it can reduce the contribution of sample. i n U. v. which is easy to classify as we mentioned in section 6.2.3. In this section we will try different. Ch. engchi. number of focusing parameter γ and α. Different from Lin et al. α in our model is not between 0 and 1, we set α = (a, a, 1, 1, a, 1, a, a, 1, 1) which we give weight a for minority class and 1 for majority class where a = 1, 5, 10, 50, 100. On the other hand, we set γ = 0, 0.5, 1, 2, 5. Note that when a = 1 and γ = 0 the focal loss is categorical crossentropy which we are familiar. The following are the performance of Mf l with different parameters.. 39. DOI:10.6814/NCCU201901175.

(51) 1. γ = 0:. a=1 a=5 a = 10 a = 50 a = 100. Training 99.99% 99.99% 99.99% 99.99% 99.97%. 0 55.13% 65.04% 65.99% 64.18% 68.13%. Testing 74.46% 77.88% 78.29% 79.31% 79.73%. 1 78.26% 85.9% 87.8% 86.42% 88.52%. 4 37.04% 44.84% 44.1% 55% 48.18%. 6 41.87% 46.53% 44.25% 47.53% 53.87%. 7 36.12% 39.49% 43.46% 43.65% 42.27%. Table 7.6: Average accuracy of Mf l with γ = 0. 2. γ = 0.5: Training 99.98% 99.98% 99.98% 99.99% 99.98%. Testing 74.83% 77.96% 77.64% 78.05% 77.84%. 立. 0 54.68% 63.19% 68.56% 60.58% 64.56%. 1 80.93% 86.88% 88.22% 86.14% 88.04%. 4 38.42% 41.99% 40.5% 50.26% 40.62%. 政治大. 6 41.39% 51.66% 45% 46.61% 49.99%. 7 36.04% 39.21% 37.13% 41.12% 38.86%. 學. ‧ 國. a=1 a=5 a = 10 a = 50 a = 100. Table 7.7: Average accuracy of Mf l with γ = 0.5. n. al. Ch. 0 55.22% 57.58% 63.48% 61.4% 63.4%. 1 81.06% 85.33% 88.1% 86.25% 84.89%. engchi. y. Testing 74.07% 75.91% 77.65% 78.24% 77.23%. 4 36.96% 38.28% 38.97% 51.29% 43.33%. sit. io. Training 99.98% 99.98% 99.97% 99.98% 99.97%. er. Nat. a=1 a=5 a = 10 a = 50 a = 100. ‧. 3. γ = 1:. i n U. v. 6 35.8% 41.89% 50.47% 46.17% 47.92%. 7 32.35% 38.84% 38.34% 40.77% 36.82%. Table 7.8: Average accuracy of Mf l with γ = 1. 4. γ = 2:. a=1 a=5 a = 10 a = 50 a = 100. Training 99.95% 99.97% 99.98% 99.98% 99.96%. Testing 72.85% 75.34% 74.65% 76.38% 77.15%. 0 55.77% 59.41% 61.18% 58.36% 63.89%. 1 80.18% 82.77% 83.16% 84.18% 85.84%. 4 36.22% 38.04% 33.35% 42.31% 43%. 6 27.15% 38.72% 32.54% 42.41% 44.54%. 7 32.05% 37.48% 38.86% 40.3% 38.4%. Table 7.9: Average accuracy of Mf l with γ = 2. 40. DOI:10.6814/NCCU201901175.

(52) 5. γ = 5:. a=1 a=5 a = 10 a = 50 a = 100. Training 99.83% 99.85% 99.88% 99.81% 99.83%. 0 41.07% 48.46% 53.32% 57.66% 49.62%. Testing 69.21% 71.61% 72.23% 73.22% 72.58%. 1 75.34% 78.69% 79.48% 78.3% 85.31%. 4 24.8% 28.99% 31.76% 38.97% 34.84%. 6 25.36% 27.59% 27.86% 29.93% 31.54%. 7 28.73% 35.15% 33.22% 31.95% 28.43%. Table 7.10: Average accuracy of Mf l with γ = 5. Table 7.11 shows that the average accuracy of testing set with different parameter a and gamma in model Mf l . We find that when a = 100 and γ = 0, the performance of model is best.. 立. 77.96% 75.91% 75.34% 71.68%. 77.64% 77.65% 74.65% 72.23%. 78.05% 78.24% 76.38% 73.22%. a = 100 79.73% 77.84% 77.23% 77.15% 72.58%. ‧. ‧ 國. γ=0 γ = 0.5 γ=1 γ=2 γ=5. 治政 a=5 a = 10 大 a = 50 77.88% 78.29% 79.31% 學. a=1 74.46% 74.83% 74.07% 72.85% 69.21%. Table 7.11: Average accuracy of Mf l with different parameters. y. Nat. sit. Furthermore, we find that almost every model has higher accuracy when a is bigger and γ. n. al. er. io. is smaller. We can see detailed in figure 7.11.. Ch. engchi. i n U. v. Figure 7.11: Average accuracy of Mf l with different parameters. 41. DOI:10.6814/NCCU201901175.