結合多重辨識和最佳化核函數之方法於支撐向量機進行高維度資料分類

全文

(1)國立臺中教育大學教育測驗統計研究所理學碩士論文. 指導教授：郭伯臣. 博士. 結合多重辨識和最佳化核函數之方法於支撐向量機進行高維度資料分類 Combining Ensemble Technique of Support Vector Machines with the Optimal Kernel Method for High Dimensional Data Classification. 研究生：陳怡伶. 撰. 中華民國一百年七月.

(2)

(3) 謝辭感謝主讓我對這三年的碩士生活所有成長與學習充滿了感恩與珍惜。打從接觸陌生卻充滿挑戰性的研究領域開始，為了對自己的決定負起責任，由最初艱辛吃力的埋首苦讀，到充滿鬥志與衝勁的方法研發，這一路走來所累積的點滴心路歷程，都要在在感謝師長的指導、研究夥伴的陪伴以及家人永遠的支持，讓我能堅持往這條學術之路邁進，順利完成論文的撰寫，也找到未來繼續進修的方向。首先感謝指導教授郭伯臣老師，在研究之路上扮演極其重要的角色，不僅引領我探究方法的方針，更時時提點合宜的處事態度。輔大黃孝雲老師、亞大劉湘川教授、教研院吳慧珉學姊、交大李政軒學長，以及已經畢業的鈞翔、文俊、志勝、士勛和仁傑學長，在生活及研究上都給予我許多寶貴的經驗分享與意見交流。同時，感謝三位碩士論文口試委員交大張志永教授、中興大學陶金旭教授和數教系楊晉民助理教授的提攜和指導，提出豐富有效的建議使本論文能夠更加完整。畢業前夕，回首過往精采的碩士班生活，要感謝所有一起付出、共同製造難忘回憶的研究室成員。包括整整三年一同磨練奮戰的林辰育同學，已經畢業的學長姐佳穎、任婕、秀聿、銘豪、境蔚，充滿上進心的博班學長姐軒博、筱倩、佳樺、智為、育隆，熬出頭作博士的典佑學長，一同為碩士論文打拼的學弟妹敏嫻、淑瑜、俊彥，研究室新血偉民、韋任、宗恩、芷寧，以及曾當過室友為我增添許多學習機會的慧珊、孟君。有大家的陪伴，讓我在自己的研究領域之外也能有所獲得，使生活更多采多姿。最後，特別要感謝為我潤稿的依蕾姐姐還有始終給我最大的支持和肯定的家人及男友鎧誌，讓我能夠堅持不放棄，在大家的幫助下，感謝主讓我順利考上台大資工、醫工和交大電控，無論未來發展為何，我都將帶著大家的祝福，盡快將所學回饋社會。特此將此文獻給在攻讀碩士其間所有幫助我的人。. 謹致臺中教育大學教育測驗統計所中華民國一百年七月.

(4)

(5) Abstract In recent years, the support vector machines (SVM) and combining classifiers are widely and successfully used to improve the classification on high dimensional data and solve Hughes phenomenon. Many researches have demonstrated that multiple classifier systems or so-called ensembles can alleviate concern occurs from small sample size and high dimensionality data and obtain more outstanding and robust results than single models. Examples are the random subspace method (RSM) and dynamic subspace method (DSM) which are both effective approaches for generating an ensemble of diverse base classifiers via different feature subsets. In addition, SVM can be used as the base classifier which is considered useful and effective classifier in the two methods mentioned above to achieve higher classification accuracy rate. However, the performance of SVM is influenced greatly based on choosing the proper kernel functions or proper parameters of a kernel function. Therefore, the objectives of this research are to develop an ensemble technique based on SVM via the optimal kernel method and propose a novel subspace selection mechanism, named the kernel-based dynamic subspace method (KDSM). KDSM combines the optimal kernel method with all superiorities of DSM that is improved on classification outcomes based on SVM. The experimental results show that the proposed method obtains sound performances than the other conventional methods; moreover, compared with the DSM, there are outstanding results not only in improving accuracy of classification but also in reducing the computation time. Keyword: pattern recognition, dynamic subspace method, optimal kernel, SVM.. I.

(6) 摘. 要. 近年來，支撐向量機與融合分類器已被廣泛且成功地運用於改善高維度資料辨識的效能並解決 Hughes 現象所造成的問題。許多研究證實多重辨識器系統，如隨機子空間法與動態子空間法，利用生成不同特徵子空間建構一群具有多樣性的基底辨識器，可減緩小樣本高維度的顧慮，得到比單一辨識器更好的辨識效果；此外，許多研究亦顯示支撐向量機為一種完善且有效的分類器，並且作為上述兩種多重辨識器系統中的基底辨識器，支撐向量機也可以獲得很不錯的分類正確率。然而，控制支撐向量機分類效能的主要因素為 kernel function。因此，選取一個合適的 kernel function 或挑選適合 kernel function 的參數對支撐向量機而言是相當重要的。本研究將整合上述方法的優勢，針對支撐向量機分類器，利用最佳化核函數的方法，發展一個適合高維度資料分析的多重辨識器系統，並提出一個融入最佳化核函數方法自動化挑選子空間維度數及特徵空間之多重支撐向量機。藉由一個自動化選取 RBF kernel function 最佳參數的方法，找出適合各維度資料進行分類的核化空間，並且在子空間選取的步驟當中引入動態子空間法的概念，加入兩個重要性密度分布函數，分別用來自動化的選取子空間維度數，以及選取該子空間的特徵，希望藉此增強已發展的動態子空間法之辨識效果。由實驗結果得知，此研究提出的方式在選取較適的 kernel 函數上有較佳的表現，相較於 DSM 而言，在縮短運算時間和提升辨識正確率之目標上，都有較為突破的效果。. 關鍵字：樣式辨識、動態子空間法、最佳化核函數、支撐向量機. II.

(7) Table of Contents CHAPTER 1: INTRODUCTION ................................................................................ 1 1.1 Statement of Research ................................................................................. 1 1.2 Organization of Thesis ................................................................................ 4 1.3 Major Notation and Acronyms .................................................................. 6 CHAPTER 2: LITERATURE REVIEW .................................................................... 7 2.1 Ensemble Method ........................................................................................ 8 2.1.1 Random subspace method ........................................................... 11 2.1.2 Dynamic subspace method........................................................... 15 2.2 Support Vector Machine ............................................................................. 19 2.2.1 Kernel Method .............................................................................. 20 2.2.2 SVM Algorithm ............................................................................ 21 2.3 An Optimal Kernel Method for selecting RBF Kernel Parameter ......... 24 CHAPTER 3: KERNEL BASED DYNAMIC SUBSPACE METHOD ................... 27 3.1 Importance Distribution of Band Membership ........................................ 29 3.2 Importance Distribution of Dimensionality Weight ................................. 33 3.3 Optimal Kernel-based Dynamic Subspace Ensemble .............................. 36 CHAPTER 4: EXPERIMENTAL DESIGN AND RESULTS .................................. 41 4.1 Experimental Design ................................................................................... 41 4.1.1 Datasets of experiment ................................................................. 43 4.2 Experimental Results .................................................................................. 48 CHAPTER 5: CONCLUSION AND FUTURE WORK ............................................ 55 APPENDIX A: THE TEST OF “SECTOR” UNIT.................................................... 57 REFERENCES .............................................................................................................. 61. III.

(8) List of Tables Table 2.1. The algorithm of RSM ................................................................................... 13 Table 2.2. The algorithm of DSM ................................................................................... 16 Table 3.1. The algorithm of KDSM ................................................................................ 36 Table 4.1. The description of algorithms used for comparison. ..................................... 41 Table 4.2. Number of Samples in the Washington DC Mall dataset used for experiments. .................................................................................................................... 45 Table 4.3. Nine categories of remedial instructions........................................................ 47 Table 4.4. Number of subjects (samples) in the educational measurement dataset used for experiments. ...................................................................................................... 47 Table 4.5. The average classification accuracy ± standard deviation and average computer time of ten test data in Washington DC Mall dataset ..................................... 48 Table 4.6. The outcome of classification in educational measurement dataset. ............. 48 Table 4.7. The accuracies and corresponding ratios of KDSM to DSM with relation to requisite time for classification in Washington DC Mall dataset. .............................. 52. IV.

(9) List of Figures Figure 1.1. Hughes phenomenon (Hughes, 1968). With a fixed number of training samples N , the accuracy first begins to rise with n . Ultimately, it falls back as n N → ∞ . ...................................................................................................................... 2 Figure 2.1. (a)-(c) The three fundamental reasons why an ensemble is used in place of single classifier from the aspects of statistical, computational and representational. ......................................................................................................................................... 9 Figure 2.2. Four levels to construct classifier ensembles (Kuncheva, 2004) .................. 10 Figure. 2.3. The framework of RSM based on SVM. ..................................................... 12 Figure. 2.4. The framework of DSM based on SVM. ..................................................... 15 Figure. 2.5. The concept of SVM supervised algorithm. ................................................ 21 Figure 3.1. The framework of KDSM ............................................................................. 28 Figure 3.2. OJ vs. accuracy of classification with samples on each bands. ................ 32 Figure 4.1. (a) The original size color IR image of the urban site in Washington, DC Mall, U.S. ........................................................................................................................ 44 Figure 4.1. (b) The color IR image of the selected region based on the Washington, DC Mall dataset.. ............................................................................................................. 44 Figure 4.1. (c) The identified class according to the corresponding color IR Washington, DC Mall image. .......................................................................................... 44 Figure 4.2. Experts’ structure of “Sector” unit. ............................................................... 46 Figure 4.3. (a)-(e) are the results about the classified map by using five methods based on SVM respectively with Ni =20 on Washington DC Mall dataset. .................... 49 Figure 4.4. (a)-(e) are the results about the classified map by using five methods based on SVM respectively with Ni =40 on Washington DC Mall dataset. .................... 50 Figure 4.5. (a)-(e) are the results about the classified map by using five methods based on SVM respectively with Ni =300 on Washington DC Mall dataset. .................. 51. V.

(10) VI.

(11) CHAPTER 1: INTRODUCTION. 1.1 Statement of Research Among the classification techniques in pattern recognition, the Hughes phenomenon (Hughes, 1968), as seen in Figure 1.1, is usually considered as a classification issue. When given a finite and fixed size of training data, the issue occurs as the positive effect which offers the potential to discriminate more detail of classes through increasing the number of features is gradually diluted by poor parameter estimation. Moreover, the classification result may be degraded as the number of training samples is small compared to the number of features (Hughes, 1968; Bellman, 1961; Raudys & Jain, 1991). A general assumption is that enough available training samples shall obtain reasonably accurate descriptions of classification. However, gathering enough training samples to accurately estimate the quantitative description of classification is difficult, expensive and is frequently unsatisfied as it is prone to Hughes phenomenon for high dimensional data. In order to overcome this extremely difficult classification problem, many researches (Bertoni, Folgieri, & Valentini, 2004a; Kuo, Hsieh, Liu & Chao, 2005; Yang, Kuo, Yu & Chung, 2010) used the ensemble techniques on classifier to alleviate concern occurs from the small sample size and high dimensionality data. The researchers demonstrated that the subspace selection based multiple classifier systems can obtain the better performance of classification than a single classifier in hyperspectral image dataset by using different feature subsets to generate an. 1.

(12) ensemble of diverse base classifiers. Among these methods, the dynamic subspace method (DSM) proposed by Yang, Kuo, Yu & Chung (2010) shows the outstanding performance and is more useful than one of the well-known multiple classifier systems, random subspace method (Ho, 1998a; 1998b), via adjusting its shortcomings successfully. 0.75 N= ∞. MEAN RECOGNITION ACCURACY. 0.70. 1000 0.65 500 0.60 200 100. 50 0.55. 20 10 5. 0.50 N=2 1. 2. 5. 10. 20. 50. 100. 200. 500. 1000. MEASUREMENT COMPLEXITY: n (Total Discrete Values). Figure 1.1. Hughes phenomenon (Hughes, 1968). With a fixed number of training samples N , the accuracy first begins to rise with n . Ultimately, it falls back as. n N →∞.. In addition, compared with various classifiers, support vector machines (SVM) has been used as every base classifier for different subspaces in DSM to achieve some better even the best effects on classification with high-dimensional datasets (Yang et al., 2010). However, the use of kernel function is a critical influence on the performance of SVM, hence choosing a proper kernel function or a better parameter. 2.

(13) of the kernel function for SVM is quite important though generally very time-consuming. Especially, the two importance distributions of feature and subspace dimensionality are keys of DSM, but DSM spends much time on getting the resubstitution accuracy as the index for drawing these two distributions by using the SVM in subspace. Thus the abilities of DSM on accuracy rate and computation time of classification certainly ought to be promoted by using a more proper index for estimating the distributions exactly and by applying a faster method for selecting the suitable parameter of kernel function for each SVM. For these reasons, it is a systematic approach to develop a novel ensemble technique based on SVM for strengthening the competitive ability of DSM on high-dimensional data classification. Through combining the advantages of ensemble technique with the optimal kernel-based algorithm (Li et al., 2010), the novel importance indexes of features and subspace dimensionality will be proposed and the proper parameter of radial basis function (RBF) kernel for SVM will also be found automatically in the same time as some major objectives of this study. Some purposes of this research are listed below. 1. To attempt at drawing the importance distributions of band membership and subspace dimensionality by using an automatic method of selecting the proper parameter in RBF kernel function to replace the performance by resubstitution classification accuracy or the class separability of Fisher’s linear discriminant analysis (LDA) (Fukunaga, 1990) in dynamic subspace method (Yang, Kuo, Yu & Chung, 2010). 2. To develop a novel dynamic subspace method based on the optimal. 3.

(14) kernel-based method which constructs adaptive subspaces with respect to the support vector machines through selecting the profitable bands and deciding the dimensionality of each subspace for fixing the shortcomings of conventional subspace methods. 3. To compare the performances of classification accuracy and computer time between different subspace methods using support vector machines, and to explore the influences of different training sample sizes on classification results.. 1.2 Organization of Thesis Chapter 1: A statement of the problem and the purpose of this research are described, and then the composition and notation of this study are listed item by item. Chapter 2: This chapter has three sections that to review some propaedeutic and related studies of this research about commonly used techniques and tools on classification for dealing with the concerns of limited samples and high dimensionality data in order. First, for applying the useful effect of multiple classifier system (MCS) and comparing with the method proposed in this research, two valid ensemble methods, random subspace method (RSM) and dynamic subspace method (DSM), which can obtain the better performance than single classifier, are discussed in section 2.1. Next, the well-known kernel based classifier, support vector machines,. 4.

(15) whose performance of supervised classification has more or at least equally well accuracy than other classifiers, is presented in section 2.2 to take this powerful technique into all the ensembles which are used in this study to be a base classifier. The end of this chapter, to make a description of an effective method, that automatically select the proper RBF kernel parameter, in section 2.3 for utilizing this optimal kernel-based method to develop a novel multiple classifier system in following chapter. Chapter 3: There is a novel subspace selection mechanism, named the kernel-based dynamic subspace method, is proposed and is proceeded with the ensemble techniques based on support vector machines about automatically determining dimensionality and selecting component dimensions for diverse subspaces via an optimal kernel-based method of adapting each base classifier. Chapter 4: The performances of applying the single classifier and multiple classifier systems, dynamic subspace method and kernel based dynamic subspace method, based on support vector machines to the real data experiment are displayed. Including the operation time and classification accuracy of different techniques for constructing component classifiers with relative subspaces is explored and reported for comparing with different method based on support vector machines. Chapter 5: Comprehensive conclusions and expected developments of this research are presented in this chapter.. 5.

(16) 1.3 Major Notation and Acronyms X = [ X 1 , X 2 ,… , X N ]. A training dataset with N X samples. X i = [ x1 , x2 ,…, xd ] (i = 1,2,..., N X ). A training sample with d dimensionality. Y = [Y1 ,Y2 ,…, X N ]. A testing dataset with N Y samples. Yi = [ y1 , y2 ,…, yd ] (i = 1,2,..., N Y ). A testing sample with d dimensionality. d. Dimensionality. L. Number of classes. NX. Number of total training samples. NY. Number of total testing samples. Ni ( i = 1, 2,  , L ). Number of samples in class i. S. Number of base classifiers. KS. Kernel smoothing. MCS. Multiple classifier system. SVM. Support vector machines. RSM. Random subspace method. DSM. Dynamic subspace method. KDSM. Kernel-based Dynamic subspace method. SVM. Support vector machine. X. Y. 6.

(17) CHAPTER 2: LITERATURE REVIEW. On a typical supervised classification task, the size of training data fundamentally affects the generality of classifiers when encountering the Hughes phenomenon. To solve this issue, the multiple classifier systems such as the random subspace method (RSM) and dynamic subspace method (DSM) were used in many researches (Christopher, 2004; Ham et al., 2005; Chuang, et al., 2008) had successfully demonstrated that small sample size and high dimensionality concern receive more outstanding and robust results than single classifier. Unfortunately, it is intuitive and necessary that the consumption time of classification by applying ensemble is more than the consumption time of classification by using single classifier. Especially, selecting the proper kernel function is time-consuming when using the support vector machines (SVM) as a base classifier in the ensemble though SVM performed most accurately than other classifiers in subspace selection based multiple classifier system (MCS) for remote sensing data classification (Yang, Kuo, Yu & Chung, 2010). In this chapter, section 2.1 introduces the advantages of ensemble method on classification, and the RSM and DSM are presented in section 2.1.1 and 2.1.2, respectively. Then, section 2.2 describes the SVM which will be taken into the MCS as base classifier and the importance technique, kernel trick, in the SVM. Finally, sections 2.3 will cover the basic concept of optimal kernel-based method and an automatic method of selecting the RBF kernel parameter which will be utilized for. 7.

(18) developing a novel ensemble method with more effective results in the classification.. 2.1 Ensemble Method Using a learning algorithm to construct a set of classifiers and then classify unknown pattern (testing sample) by taking a vote of their predictions is the kind of ensemble method. By learning several models instead of one and integrating them in some way, the error rate of a learner can be greatly reduced and more correct predictions can be made. Through continuous development of research, from the original ensemble method, Bayesian averaging, which was based on Bayesian learning theory (Buntine, 1990; Neal, 1993; Bernardo & Smith, 1994), to more algorithms proposed recently including error-correcting output coding (Kong & Dietterich, 1995), Bagging (Breiman, 1996) and boosting (Freund & Schapire, 1996), many techniques of multiple classifier system had been rapidly expanded over the last decade. Furthermore, many known researches (Drucker et al., 1994; Quin-lan, 1996; Freund & Schapire, 1996; Maclin & Opitz, 1997; Bauer & Kohavi, 1999) have confirmed and proved the use of ensemble on classification is powerful. For explaining why classifier ensembles might often perform better than any single classifier, Dietterich (2000) offered three fundamental reasons from the aspects of statistic, computation, and representation as Figure 2.1. The statistical reason in Figure 2.1 (a) indicates a learning algorithm can construct the set of good classifiers which contains h1, h2, and h3 in the classifier space H and all of these classifiers give the same and good accuracy on the training data. Hence constructing an ensemble out of all good classifiers and combining their outputs to approximate the best classifier h for the problem seems to be a safer option to avoid the risk of. 8.

(19) adopting only one inadequate single classifier without sufficient data.. h1. h1. h. h1. h. h. “Good” classifiers. h2. h2 h3 Classifier space, H. h2 h3. Classifier space, H. h3 Classifier space, H. (a) (b) (c) Figure 2.1. (a)-(c) The three fundamental reasons why an ensemble is used in place of single classifier from the aspects of statistical, computational and representational. Otherwise, finding the best classifier is still very difficult computationally even if there is enough training data. Therefore, the computational reason in Figure 2.1 (b) shows an ensemble constructed by training each classifier starts somewhere in H and ends close to optimal classifier h with an inadequacy setting for initial values or improper search method may avoid getting stuck in local optimal and provide a better approximation to h than any individual classifier. The final reason why an ensemble is used in place of single classifier is the aspect of representation as shown in Figure 2.1 (c). When the best classifier falls outside the classifier space, representing the true function of the best classifier by any of the learning algorithm in H for most applications is actually impossible. Again, an ensemble might be a better choice for such problems. The flowchart in Figure 2.2 is used to illustrate four levels involved in constructing a MCS. First, the dataset can be modified by sampling on different. 9.

(20) objects (Level A) and features (Level B) to generate various data and feature subsets respectively, that can be used to train different classifiers. And then the base classifiers can be modeled by any learning algorithm (Level C). Finally, the different fusion techniques can be applied in combining the classifier decisions (Level D). Dataset, X. ~ X (1). h. 1. . ~ X (2). h. A. Data Level: Use different data subsets. . 2. B. Feature Level: Use different feature subsets. ~ X (S ). h. C. Classifier Level: Use different base classifiers. S. D. Combination Level: Design different combiners. Combiner. Figure 2.2. Four levels to construct classifier ensembles (Kuncheva, 2004) In this study, the focus is on one of the effective approaches for generating an ensemble of diverse base classifiers which is to use different feature subsets such as the RSM (Ho, 1998a; 1998b), the weighted random subspace method (WRSM) (Kuo, Hsieh, Liu & Chao, 2005), and the DSM (Yang, Kuo, Yu & Chung, 2010). Then following the advantage of DSM that has improved RSM in Level B, a novel multiple classifier system which employs SVM as all base classification learners in Level C and uses a simple majority voting to combine all classifier decisions in Level D will be developed.. 10.

(21) 2.1.1 Random subspace method Random subspace method (RSM) proposed by Ho (1998a, 1998b) is one of the multiple classifier systems that can provide a way of alleviating sample size and high dimensionality concerns by constructing classifiers in random subspace. Certainly, RSM is a general technique that can be applied with any type of base classifiers such as decision tree (Ho, 1998a; Banfield, Hall, Bowyer, & Kegelmeyer, 2007; Sun, Zhang, & Zhang, 2007), k-nearest-neighbor classifier (Ho, 1998b; Sun et al., 2007), linear classifiers (Skurichina & Duin, 2001; Skurichina & Duin, 2002), maximum likelihood classifier (Kuo, Pai, Sheu, & Chen, 2004), support vector machine (Bertoni, Folgieri, & Valentini, 2004a, 2004b; Tao, Tang, Li, & Wu, 2006; Sun et al., 2007) etc. In this study, only SVM is considered as the base classifier. To study the usefulness of combining techniques for SVMs in relation to RSM, let every d-dimensional training sample X i = [ xi1 , xi 2 ,, xid ] with a class label  i ∈ {1, 2, , L}. constitute a training dataset. X = [ X 1 , X 2 ,, X N ] , where X. i = 1, 2, , N X , N X and L are the amounts of training samples and classes respectively. Then, a framework of RSM which uses SVM as the base classifiers is illustrated in Figure 2.3.. 11.

(22) d- dimensional space Original dataset X. N X samples. Set w. Repeat S times. Random Features Selection w- dimensional space. N X samples. w. Reduced dataset X~ (1). S classifiers (ensemble). SVM1. w. ~ X (2). ~ X (S). …. ….. SVM2. SVMS. Combine Decisions. Figure 2.3. The framework of RSM based on SVM. For all subspace selection based multiple classifier systems in this study, the main ideal is to construct S classifiers which are trained by returning every ~ ~ ~ ~ reduced-dimensional dataset, X (b) = [ X 1 , X 2 ,..., X N ] , where b = 1, 2,  , S , to be a X. new training dataset to get the final integrant decision. And what a particularly noteworthy matter among all multiple classifier systems is the different resorts of obtaining the reduced-dimensional dataset through various feature selection processes. In RSM, a predefined subspace dimensionality, w < d , has to be set before the random feature selection (RFS) process. The RFS works when the w features are randomly selected from the original d-dimensional feature space to form a training sample. ~ X i = RFS ( X i , w) = [ x im , x im ,, x im ] , 1. 2. where. w. m j ∈ {1, 2, , d }. and. ~ j = 1, 2, , w . Then, taking the reduced-dimensional dataset X (b) into a learning ~ algorithm Ψ to build a classifier SVM b = Ψ ( X (b)) , and repeating these processes. 12.

(23) above S times until all the ensemble classifiers SVM 1 , SVM 2 ,..., SVM S have been trained by returning every w-dimensional dataset to be each input of ~ ~ ~ Ψ ( X (1)), Ψ ( X (2))..., Ψ ( X ( S )) . Finally, diverse class label combinations of a testing. dataset Y are obtained by these classifiers and through a simple majority voting on every testing sample Yk , where k = 1, 2, , N Y and N Y is the number of testing samples, the final decision is adopted as A = arg max ∈{1, 2 ,, L} card (b | SVM b (Yk ) = ) , where card (Q ) denotes the cardinality of the set Q and b = 1, 2, , S . Following is the algorithm of RSM which is summarized in Table 2.1. Table 2.1. The algorithm of RSM A. Training Procedure Begin for b = 1, 2, , S ~ X (b) = RFS ( X , w) ~ SVM b = Ψ ( X (b)) end End B. Classification Procedure. Ak = arg max ∈{1, 2,, L} card (b | SVM b (Yk ) = ) , where b = 1, 2, ..., S and k = 1, 2, ..., N Y Ho (1998a) summarized the results of RSM that “the subspace method is better in some cases, about the same or worse in other cases when compared to the other two forest building techniques (bagging and boosting)” and “the subspace is best when the dataset has a large number of features and samples, and it is not good when the dataset has very few features coupled with a very small number of features or a. 13.

(24) large number of classes”. In addition, Skurichina & Duin (2002) made a conclusion that “the usefulness of the RSM, as well as the efficient dimensionality of random subspaces, also depends upon the level of redundancy in the data feature space”. However, there are two inadequacies in RSM. The first one concerns the implicit number of subspace dimensionality. From the experiment results by Ho (1998a), he gave a suggestion that the desirable results can be obtained by setting the subspace dimensionality to approximately half of the original space size when using the decision tree classifier as the base classifier in RSM. But this way is not suitable for every kind of classifiers. For example, also found by Ho (1998b), the best accuracy appears in different subspace dimensionalities if the choice of k is different when using the k-nearest-neighbor classifiers as the base classifiers in RSM; moreover, without an appropriate subspace dimensionality for the employed classifier, RSM might be inferior to a single classifier. The second is one about the irregular. and. unsound. rule. for. selecting. features. to. construct. the. reduced-dimensional dataset in subspace. Obviously each individual feature potentially possesses the different discriminate power for classification, but a randomized strategy for selecting features in RSM can be regarded as a uniform distribution of selected probabilities of all features; therefore, it is not possible to distinguish between informative features and redundant ones. According to the flaws discovered in RSM above, it raises two questions. One is on how to choose a suitable subspace dimensionality for the employed classifier and another is on how to appropriately select the advantageous features from other features for classification.. 14.

(25) 2.1.2 Dynamic subspace method Dynamic subspace method (DSM) proposed by Yang, Kuo, Yu & Chung, (2010) aims at overcoming two drawbacks of RSM that has been remarked in the end of previous section. In DSM, there are two major distributions added on the process of RFS to build the process of dynamic features selection (DFS) as shown in the red part of Figure 2.4. That is, for constructing component classifiers with adaptive subspaces to adjust the shortcomings of RSM, DSM works on the basis of W and R distributions that denote the distributions of feature weights and subspace dimensionality respectively. Figure 2.4 illustrates a whole framework of DSM with using the SVM as the base classifier. d- dimensional space. N X samples. Original dataset X Dynamic Features Selection. Distribution of subspace dimensionality Rj-1 Update Process of R distribution. rb+1 - dimensional space ~ NX X (b+1) S classifiers (ensemble). SVMb+1. Distribution of feature weightings W rb+S. rb+2 ~ X (b+2). SVM b+2. …. ….. ~ X (b+S). SVM b+S. Combine Decisions. Figure 2.4. The framework of DSM based on SVM. During the process of DFS, referring to the Figure 2.4, the amount of desirability features, rb+j, is automatically determined based on the Rj-1 distribution, ~ and the component features to form the reduced-dimensional dataset, X (j), are. selected with the probability based on the W distribution in the j-th subspace.. 15.

(26) Compared with RSM, not only W and R distributions are proposed to model the selected probability of each feature and make the subspace dimensionality is neither predefined nor a fixed number in subspace selection procedure but also the update process of R distribution is introduced into DSM in each overproduction. Notice that the numbers of SVMs, b and S , have to be set at the start for obtaining the initial distribution R0 and limiting the amount of SVMs in the ensemble, respectively. In the cause of more direct understanding the detail of this method, see the algorithm of DSM that is summarized in Table 2.2. Table 2.2. The algorithm of DSM A. Training Procedure Begin for i = 1, 2, ..., b  (i − 1) × ( d − 1)  ri = 1 +   b −1  ~ X (i ) = WFS ( X ,W , ri ) ~ SVM i = Ψ ( X (i )) Estimate φ ( SVM i ) end R0 = KS ( [ rk , φ ( SVM k )] ) , where k = 1, 2, , b . for j = 1, 2, ..., S Draw rb + j ~ R j −1 ~ X (b + j ) = WFS ( X ,W , rb + j ) ~ SVM b + j = Ψ ( X (b + j )) Estimate φ ( SVM b + j ) R j = KS ( [ rk , φ ( SVM k )] ) , where k = 1, 2, , b, , b + j . end End B. Classification Procedure. Ak = arg max ∈{1, 2 ,, L} card ( j | SVM b + j (Yk )) = ). 16.

(27) where j = 1,2,, S and k = 1, 2, ..., N Y According to Kuo et al. (2005), the first “for loop” in the Table 2.2 is for estimating the initial importance distribution of subspace dimensionality, and the second one is for producing ensemble classifiers. Following this algorithm above, inputted a subspace dimensionality r at first, the reduced-dimensional dataset will ~. be modified by X = WFS ( X ,W , r ) , where WFS represents the feature selection based on W . The probability mass functions of W ACC and WLDA are defined by the resubstitution accuracy of training samples as equation (2.1) and the normalized class separability of Fisher’s linear discriminate analysis (Fukunaga, 1990) as equation (2.2), respectively, with each individual feature.. p ( wm ) =. φm d. (2.1). ∑ φk k =1. iid .. where wm ~ W ACC , φm is the re-substitution classification accuracy of the training dataset with the m-th feature alone and m = 1, 2, , d .. p ( wm ) =. Jm d. ∑Jk. (2.2). k =1. −1. −1. J m = trace ( S mw S mb ) = S mw S mb. where S mw and S mb are the within-class scatter matrix and the between-class scatter matrix, respectively, which are estimated by training dataset with the m-th feature alone.. 17.

(28) Additionally, the R distribution records the importance of all possible subspace size. About its trick, letting the subspace dimensionality, r , be an outcome of R distribution with the probability function f (r ) , where 1 ≤ r ≤ d . f (r ) is drawn by applying the SVM to estimate the re-substantiation accuracy in r-dimensional space and then smoothed by the technique of the kernel smoothing (KS) (Parzen, 1962; Silverman, 1985). Although DSM has improved two disadvantages of RSM, there are still some remaining issues we need to discuss. First is a lack of a definition of the ensemble size for both RSM and DSM. Theoretically or practically, the more classifiers was trained, the more accurate and robust result can be obtained. However, it is a trade-off between what the ensemble size defining and what the computational time spent. The second significant issue is constructing training sub-datasets appropriate for the classification in subspace selection procedure. DSM obtains the better accuracy than RSM by changing the randomization-based strategy to weighted-based strategy in feature selection procedure. Nevertheless, choosing which one of the two kinds of W distribution, W ACC and WLDA , to acquire the best accuracy is indefinite in varied datasets; moreover, DSM faces the time-consuming situation in virtue of updating the R distribution by using base classifier to estimate the resubstitution accuracy in each subspace. Especially, the corresponding time of training SVM is much more than training other type of base classifiers in order to get the efficient effect by using the five-fold cross validation and the gird search to find the best parameters in SVM. Therefore, the DSM needs to spend much more time to exchange the enough accuracy when using the SVM as the base classifier.. 18.

(29) Instead of directly looking for the best ensemble size and features combination for every kind of base classifier, works on this study are focusing on SVM to integrate the correlated concern of saving the computer time on the feature and dimensionality selection processes. Furthermore, the study tries to reach the higher classification accuracy via using the importance distribution of features drawn by a novel feature membership as the substitute for W ACC and WLDA .. 2.2 Support Vector Machine Support vector machine (SVM) is a powerful tool for solving the problem of classification with small sampling and high-dimensional dataset such as Hughes phenomenon (Bruzzone & Persello, 2009; Hughes, 1968; Tarabalka, Benediktsson & Chanussot, 2009); moreover, it performed more accurately than other classifiers or performed at least equally well in many studies (Camps-Valls et al., 2004, 2005, 2008 & 2009; Melgani & Bruzzone, 2004; John & Nello, 2004). SVM is ordinarily used as binary classifiers that separate the data space into two areas. For linear separable sample sets, the ideal of SVM is to find an optimal hyperplane to divide two classes into different regions via maximum margins. However, the real data are often linearly inseparable in the input space. To overcome this situation, data are mapped into a high dimensional feature space, in which the data are sparse and possibly more separable so that the key technique of SVM is kernel function. By a kernel function that belongs to the input space, the inner product in high dimension feature space can be implied. That is, choosing a kernel function will affect learning and generalization abilities of SVM. Following, the use of kernel trick will be introduced.. 19.

(30) 2.2.1 Kernel Method The linear inseparable data in original space can be mapped to high dimension nonlinear space by using the calculation < φ ( x ), φ ( a ) > as an easier computation via kernel function. There are some popular kernel functions as follow: Linear kernel:. κ ( x, a ) =< x, a >. (2.3). Polynomial kernel:. κ ( x, a ) = ( < x, a > +1 ) r , r ∈ Z +. (2.4). Gaussian Radial Basis Function (RBF) kernel: − x−a κ ( x, a ) = exp  2σ 2 . 2.   , σ ∈ R − {0}  . (2.5). d where x and a are the samples in R .. d d These are based on the fact that any kernel function κ : R × R → R. satisfying the Mercer’s theorem, i.e., there is a feature mapping function φ into a Hilbert space H such that. κ ( x, a ) =< φ ( x ), φ ( a ) >. (2.6). where x, a ∈ X , if and only if it is a symmetric function for which the matrices. 20.

(31) K = [κ ( x i , a j )] 1≤i , j ≤ n. (2.7). formed by restriction to any finite subset { x1 ,..., x n } of the space X are positive semi-definite. Then the skill of taking this kernel trick into the classifier as SVM is presented as follows.. 2.2.2 SVM Algorithm The SVM algorithm provides an effective way that applies kernel-based method to perform supervised classification. Its concept is based on an optimal linear separating hyperplane that is fitted to the training patterns of two classes within a kernel space, as illustrated in Figure 2.5.. Figure. 2.5. The concept of SVM supervised algorithm. The SVM supervised algorithm estimates the entire classification using the. 21.

(32) principle of statistical risk minimization. (Boser, Guyon & Vapnik, 1992). If every pattern x i ∈ R d , ∀i = 1 ,..., n. in training data {( x i ,y i )} with class y i ∈ {±1} are. linear separable, the basic statistical risk minimization task is to estimate the decision function f : R d → {±1} by a training set from two classes, and then the f should correctly classify the unlabeled patterns via an optimal vector w and an optimal scalar b such that. y ( w T x i + b) ≥ 1 , i = 1 ,..., n .. (2.8). The optimal hyperplane,. w T xi + b = 0. (2.9). should be the one which is the furthest from the closest patterns in these two different classes, and the margin of the separation in Euclidean distance is. 2 w. .. Unfortunately, the real data does not always belong to complete linear separating case. Since the data may be more mixture and complex, it is not an easy way to find the optimal hyperplane directly. Hence, considering the training set and a feature mapping φ (.) : R d → H which usually map to a higher dimension feature space, the SVM method solves this optimal problem:. min w,ξ. l 1 T w w + C ∑ ξi 2 i =1. subject to. 22. (2.10).

(33) y i ( w T φ ( x i ) + b) ≥ 1 − ξ i ,. ∀i = 1,  , l. ξ i ≥ 0,. ∀i = 1,  , l. (2.11). where C > 0 is a penalty parameter, ξ i is positive slack variable, and the optimal hyper plane in feature space is. wT φ ( x) + b = 0. (2.12). According to Lagrange multiplier theorem, the primal optimal problem will be converted into solving a dual problem as follow: l. max ∑ α i − i =1. 1 l l ∑∑ α iα j y i y j κ ( xi , x j ) 2 i =1 j =1. (2.13). subject to l. ∑α i =1. i. yi = 0. 0 ≤ α i ≤ C ∀i = 1,  l. (2.14). where artificial variables are Lagrange multipliers corresponding to the primal constrain and a kernel function, κ , satisfying κ ( x, a ) =< φ ( x ), φ ( a ) > , x, a ∈ R d . The decision function of the support vector classifier for any new test vector x new can be represented as l. f ( xnew ) = sign( ∑ yiα iκ ( xi , xnew ) +b ) i =1. where α i and b are the optimal solutions from the dual optimal problem.. 23. (2.15).

(34) Although SVM is one of the most powerful techniques for supervised classification, the performance of SVM is based on choosing the proper kernel functions or the proper parameter of a kernel function. Next, the basic policies about developing optimal kernel-based methods will be introduced, and an automatic method for selecting the parameter of the RBF kernel function will be recommended.. 2.3 An Optimal Kernel Method for selecting RBF Kernel Parameter Kernel-based methods have attracted much attention in the area of pattern recognition and machine learning. The advantage of the kernel method is that data can map from the original space to a proper feature space via a proper kernel function. Therefore, choosing a kernel function influences the classification performances greatly in kernel-based algorithms. According to the above-mentioned, many researches such as (Chen, Liu, & Bao, 2008; Chen, Li, Kuo & Huang, 2010) concentrated on optimizing the data-dependent kernel and then proposed various optimal kernel functions to find a feature space which is suitable for training samples of remote sensing dataset. Another useful way of optimizing kernel function is to select the optimal parameter of a kernel function. In this section, we focus on the finding that performances of SVM that are based on choosing the proper parameters of a kernel function (Chang & Lin, 2001; Camps-Valls et al., 2004; Xiong, Swamy, Omair & Ahmad, 2005) and an automatic method for selecting the parameter of the RBF kernel function proposed by Lin et al. (2010). This method presents a novel and simple criterion to choose a proper parameter ( σ ) of RBF kernel function automatically; and furthermore it saves a lot of time. 24.

(35) from the time consuming k-fold cross-validation (CV) which is used for choosing the parameter in generally (Chang & Lin, 2001; Camps-Valls et al., 2004). According to the RBF kernel function as Equation (2.5), one can observe the two properties of the RBF kernel function. The first one is the outcome of RBF kernel function from any pair patterns lies between 0 and 1 obviously. And the other one is the outcome of RBF kernel function approximates to 1 when any pair patterns are very similar. On the contrary, when any pair patterns are very dissimilar, the outcome of RBF kernel function approximates to 0. Hence, this algorithm is to find a proper parameter which makes the outcomes of RBF kernel function from any pair patterns that come from the same class closed to 1 and from the different classes closed to 0. The main conception of this algorithm is formulated as following: Suppose { x1( i ) , x 2( i ) ,, x n( i ) } ⊂ R d i. is the set of samples in class i ,. i = 1,2,, L , and n i is the number of the samples from class i . Based on the two. properties and findings above, one can try to find a proper parameter σ such that. κ ( x ( i ) , x r( i ) , σ ) ≈ 1, , r = 1,2,  , n i. (2.16). κ ( x ( i ) , x r( j ) , σ ) ≈ 0,  = 1,2,  n i , r = 1,2,  , n j , i ≠ j. (2.17). and. where κ ( x,a , σ ) is the RBF kernel function, σ is the parameter of RBF kernel and x(i ) is the  -th sample from the i -th class. According to the properties of Equation (2.16) and Equation (2.17), two criterions were applied. First is the. 25.

(36) average of outcomes of RBF kernel function between all pair training samples, which come from the same class, and it defined as follow: C (σ ) =. L. 1 L. ∑n i =1. ni. ni. ∑ ∑ ∑κ ( x. 2 i =1  =1 r =1 i. (i ) . , x r( i ) , σ ). (2.18). Second is the average of the outcomes of the RBF kernel function between all pair training patterns, which come from different classes, and it defined as follow:. B (σ ) =. L. 1 L. i =1 j =1 j ≠i. i. ni. nj. ∑ ∑ ∑ ∑κ (x. L. ∑∑n n. L. j. i =1 j =1  =1 r =1 j ≠i. (i ) . , x r( j ) , σ ). (2.19). Since the properties of C (σ ) and B (σ ) are 0 ≤ C (σ ) ≤ 1 and 0 ≤ B (σ ) ≤ 1 , where C (σ ) and B (σ ) have to close to 1 and 0, respectively, the optimal parameter. σ can be obtained by minimizing the criterion J (σ ) which was defined as: minimize{J (σ ) = (1 − C (σ )) + ( B (σ ) − 0)} σ >0 = minimize{1 − C (σ ) + B (σ )} σ >0. 26. (2.20).

(37) CHAPTER 3: KERNEL BASED DYNAMIC SUBSPACE METHOD. The major objective in this study is to develop a novel kernel-based ensemble technique based on optimal algorithms for strengthening RSM and DSM. Compared with these two conventional methods, the proposed method not only aims at overcoming two drawbacks of RSM which has been remarked in the previous section as DSM has done, but also intends to solve the issue of choosing the proper kernel parameter in DSM based on SVMs. The component dimensions of each subspace are selected from their membership values based on the optimal kernel-based method we proposed. Furthermore, the dimensionality of the subspace could be determined dynamically during the training process by learning algorithm based on the same optimal kernel-based method to use in each basic classification. In this chapter, a novel multiple classifier system named kernel based dynamic subspace method (KDSM) is proposed and how the drawbacks of RSM and DSM are overcome is shown. The framework of KDSM is displayed in Figure 3.1, where all feature vectors are embedded into an automatic algorithm for choosing the proper kernel spaces via exploring the similarity within the samples and using two new distributions imposed in feature selection process. One is the importance distribution of feature membership, M , which models the probability of each feature that is selected out as the component dimensions of the subspace. The selected probabilities. 27.

(38) of each feature are assumed differently and modeled by membership degrees obtained from optimal kernel-based algorithms. The other is the importance distribution of subspace dimensionality weight, W , and the function of it is to give information about how many dimensionalities is suitable for resulting in better classification. The subspace dimensionality is a neither predefined nor fixed number as used in RSM just like the resort of DSM that is drawn from a novel importance distribution, W , in each production of subspace. Additionally, we also propose an automatic update algorithm for W distribution through the iteration process. Finally, setting the limitation of the change in each basic classifier, SVM, until the outcome of classification to reach convergence, and fuse all classifiers results to obtain the final decision by the majority voting rule.. Figure 3.1. The framework of KDSM. 28.

(39) In the following sections, the importance distribution of band membership, M, for selecting the profitable bands in each subspace, the importance distribution of dimensionality weight, W , for deciding the dimensionality of each subspace, and the steps of optimizing the proposed kernel-based dynamic subspace ensemble technique, including how to create the reducing dataset for base classifier in each subspace based on M and W distributions, are introduced, respectively.. 3.1 Importance Distribution of Band Membership To mention the importance distribution of all features, it is a discrete uniform distribution in RSM, and there are two improved W ACC and WLDA distributions defined by normalized re-substitution classification accuracy and the separability of Fisher’s linear discriminate analysis respectively in DSM. About this study, M is proposed to model the selected probability of each features with an automatic method of selecting the parameter of the RBF kernel (Li, Lin, Kuo, & Chu, 2010), and is introduced into subspace selection procedure. The design of M distribution is based on the principle as the W distributions in DSM that beneficial bands carry larger probabilities to be selected. Based on solving. an. optimization. problem. in. Equation. (2.17),. minimize J (σ ) = 1 − C (σ ) + B (σ ) , can obtain the optimal parameter, σ op , of RBF σ >0. kernel function which was verified by (Li, Lin, Kuo, & Chu, 2010), an index represents the membership degrees of each feature for drawing the M distribution through taking the σ op into the reciprocal of J (σ ) proposed as follows,. 29.

(40) OJ ( r ) = 1 − J r (σ op( r ) ). (3.1). where OJ (r ) denotes the membership degrees of the r -th feature for classification, and σ op( r ) is the optimal value of the optimization problem for minimize J r (σ ) with σ >0. 1-dimensional data left r -th band only.. According to the properties of the RBF kernel function mentioned in the section 2.3, the J r (σ op( r ) ) can be calculated by taking the reduced-dimensional data. x ( r ) which only has the r -th band in two measures, C (σ ) and B (σ ) , as Equation (2.15) and Equation (2.16) which were proposed for selecting the parameter of the RBF kernel automatically in (Li, Lin, Kuo, & Chu, 2010), and solving the following optimization problem:. minimize J r (σ ) = 1 − C r (σ ) + B r (σ ). (3.2). σ >0. where C (σ ) and B (σ ) are redefined as C r (σ ) =. L. ∑N i =1. B r (σ ) =. L. 1. i =1 j =1 j ≠i. ∑ ∑ ∑κ ( x. L. L. ni. (i ) ( r ). , x v( i ) , σ ). i. N. i =1 j =1  =1 v =1 j ≠i j. (3.3). (r). nj. ∑∑∑∑ κ ( x. L. ∑∑ N. ni. 2 i =1  =1 v =1 i. 1 L. ni. (i ) ( r ). , x v( j ) , σ ). (3.4). (r). and N i is the number of training sample in the i -th class, x( i ). (r). is the. reduced-dimensional  -th sample which only has the r -th feature and belongs to. 30.

(41) the i -th class. For drawing the M distribution by OJ , assume that M is a discrete random variable which takes d different values with the probability that M = r is defined as. f M ( r ) = P( M = r ) =. OJ ( r ) d. ∑ OJ ( r ). r = 1, 2 ,...,d. ,. (3.5). r =1. The probability mass function (pmf) of M , f M = ( f M (1), f M ( 2),..., f M ( d ) ) , is a list of probabilities associated with each of its possible values which represent the d different features here. The accordance of OJ with the membership degrees of all features for classification on support vector machines will be discussed next. Figure 3.2 shows the “ OJ versus accuracy of classification with samples on each bands” in the Indian Pines Site dataset. There are 220 bands, 9 classes, 20 training samples and 100 testing samples in every class. Through drawing the index OJ and classification accuracy with 1-dimensional data left r -th band only, there are. three values are displayed on each bands. One red curve represents the different values of OJ (r ) with training samples only contain r -th band, and the green and blue curves respectively represent the classification accuracies of training and testing samples obtained by applying SVM with a fixed penalty parameter C set as 1000 and a σ op( r ) selected via the automatic method of optimizing RBF parameter with 1-dimensional data left r -th band only.. 31.

(42) Feature. Figure 3.2. OJ vs. accuracy of classification with samples on each bands. To observe Figure 3.2, there are similar trend of all these curves, and we can design to choose the proper bands into the subspace for improving the effect of based subspace selection ensemble via this performance. Following this contention that the proper features will be chosen in the process of subspace dimensions selection is intuitive since these features possess large usefulness on representing the data for classification, the membership degrees OJ (r ) of any r -th feature are utilized to model the importance distribution of feature membership and is treated as one of the basis about reducing the dimension of data in this study. Next, we discuss another distribution for automatically determining the number of subspace dimensionalities instead of a fixed number in RSM and the R distribution in DSM will be proposed.. 32.

(43) 3.2 Importance Distribution of Dimensionality Weight In KDSM, although the opinion how to choose a suitable subspace size for constructing the reduced-dimensional dataset of each subspace is similar to the concept in DSM that drawing the subspace dimensionality from a continually update importance distribution of dimensionality weight before each procedure of subspace production to improve using a predefined and fixed number as the subspace size in RSM. It is noteworthy that there is a notable difference between these two methods, DSM and KDSM as the dissimilar algorithms and steps for obtaining the R distribution and W distribution respectively. Moreover, computer time will be saved on using the classification accuracy of training data to draw the importance distribution of dimensionality weight; In KDSM, the dimensionality weight obtained just through the process of selecting the optimal kernel parameter for SVM is replacing the resubstitution accuracy in DSM (Yang, Kuo, Yu & Chung, 2010). Let the subspace dimensionality, w , be an outcome of the probability function. f W with the distribution of subspace dimensionalities weight, W , where 1 ≤ w ≤ d , and d is the total number of features (dimensions). To take attention on establishing W. distribution to indicate the proper. w. by. fW. for each. reduced-dimensional dataset with base classifier in this proposed method, there are several steps and main concepts must be mentioned. First, it is like OJ (r ) proposed via taking the reduced-dimensional data to solve the related criterion about J (σ ) as mentioned above in section 3.2.1, an index OJ ( Rsw ) is defined as follows to estimate the dimensionality weights of reduced-dimensional datasets with different combination of features.. 33.

(44) OJ ( Rsw ) = 1 − J R (σ opR ) w s. (3.6). w s. It represents the dimensionality weight of w -dimensional datasets which is composed of Rsw features in s -th subspace. And σ opR. w s. is the optimal value of the. optimization problem:. minimize J R (σ ) = 1 − C R (σ ) + B R (σ ) w s. σ >0. w s. (3.7). w s. Following the resort of redefining the C (σ ) and B (σ ) to solve the J (σ op ) with reduced-dimensional data as process between Equation (3.1) and Equation (3.4) by applying the inferential achievements in (Li, Lin, Kuo, & Chu, 2010), there are only a difference is changing the reduced-dimensional data from x( r ) to x( R. w s ). in. redefining the two measurements below. C R (σ ) = w s. L. 1 L. ∑N i =1. BR (σ ) = w s. 2 i =1  =1 v =1 i. L. L. ∑ ∑ Ni N i =1 j =1 j ≠i. where x ( i ). w).  ( Rs. the. Rsw. ni. ∑ ∑ ∑ k(x. 1 L. ni. L. ni. (i ) w).  ( Rs. , x (i ) , σ ). (3.8). w). v ( Rs. nj. ∑ ∑ ∑ ∑ κ ( x( i ) , xv( j ) , σ ) ( Rsw ). i =1 j =1  =1 v =1 j ≠i j. (3.9). ( Rsw ). is the w -dimensional  -th sample in the i -th class which contains features,. and. Rsw. is. the. combination. of. features. that. Rsw = [m1 , , m 2 ,..., m k ] , k = 1, 2 ,...,w and ∀m k are not equal.. By applying the importance distribution of feature membership M with a. 34.

(45) probability mass function f M as discussed before, the suitable w features Rsw for. classification. are. selected. automatically. to. construct. the. s -th. reduced-dimensional subspace. And about the technique of how to choose proper. mi -th feature will be expanded in follow section 3.2.3.. Then, completing the W distribution in this method to indicate the proper w with f W by joining another importance technique, the kernel smoothing (KS) density estimation (Parzen, 1962; Silverman, 1985), which is a popular nonparametric skill for which prior knowledge about the functional form of the conditional probability distributions is not available or is not used explicitly (Yang, Kuo, Yu & Chung, 2010). There is a continuous probability density function fWs (w) drawn by utilizing the technique of KS to estimate the probability of the Ws random variable with OJ s below and to smooth it in s -th subspace.. OJ s = [ OJ s -1 , OJ ( Rsw ) ],. (3.10). Note that since the addition of OJ ( Rsw ) increases the information of the dimensionality weight in s -th subspace, the Ws distribution will continually be updated with a renewal OJ s following a change of the s . Finally, the whole method of this novel multiple classifier system, KDSM, has been expanded step by step through these main concepts and techniques in next section.. 35.

(46) 3.3 Optimal Kernel-based Dynamic Subspace Ensemble According to the description of the two importance distributions of feature membership and subspace dimensional weight above, the significant information indicates that the proper features Rsw = [m1 , , m2 ,..., mw ] and their amount w can be selected for base classifier by applying the probability mass function f M (m ) and the probability density function fWs (w) respectively. The algorithm of KDSM is shown in Table 3.1. Table 3.1. The algorithm of KDSM Input: The training dataset, X = {x1 , x 2 ,  , x N } ∈ R d The testing sample, Y ∈ R d A learning algorithm (classifier), Ψ The number of classifiers (subspaces) for initializing W0 distribution, b The feature (band) selection based on M in reduced dataset, MFS The subspace dimensionality selection based on W , CDS The number of classifiers in the non-convergence ensemble, B Output: Final hypothesis F : Y → {1, 2, , L} obtained by the ensemble classifiers. A. Training Procedure Begin Estimate the importance distribution of band membership, M for i = 1, 2,  , b. d −1   w i = 1 + (i − 1) × ( ) b − 1   ~ X i = MFS ( X , M , w i ) end. 36.

(47) Estimate the W distribution, W = {W ( X 1 ), W ( X 2 ), , W ( X b )} ~. ~. ~. W0 = Kernel Smoothing ( [ w k ,W )] ) , where k = 1, 2,  , b .. for j = 1, 2,  , B ~ b + j = WDS (W ) w j −1 ~j X = MFS ( X , M , w b + j ) ~ Estimate the W distribution, W = {W ,W ( X b + j )} W j = Kernel Smoothing ([ w k , W )]) , where k = 1, 2, , b + j . ~ h j = Ψ( X j ) if SUM (( h j (Y ) ~ = h j −1 (Y ) ) < 10 −4 break end end End B. Classification Procedure F = arg max ∈{1, 2 ,, L} card ( j | h j (Y )) = ) , where j <= B .. The procedure for selecting the suitable subspace dimensionality based on the. W distribution and the desired features based on the M distribution in the KDSM are as follows. Fist, for drawing W0 which is an initial distribution of W :. 1. To build OJ 0 = [ OJ ( R0w ) , OJ ( R0w ) ,…, OJ ( R0w ) ] by applying f M (m ) 1. 2. b. to select the mi -th feature in Rsw . 2. To apply KS with OJ 0 to get the W0 distribution as. 37.

(48) f W ( w) = 0. w − wj  b w κ OJ ( R ) ( ) ∑ 0  b w σ j =1   ∑ OJ ( R0 ) σ 1. j. (3.11). j. j =1. where 1 ≤ w ≤ d , σ is the smoothing parameter whose name is bandwidth, and κ is the kernel function that. κ(. w − wj. σ. )=. 1 2πσ 2. exp( −. (w − w j )2 2σ 2. ). (3.12). Then, the selection methods of M and W about how to using the cumulate density functions of f M (m ) and fWs (w) are listed as follows:. 1. To generate an uniform random number,. v , and others enough b. uniform random numbers, u1 , u2 ,..., uw between zero and one. 2. To determine the subspace dimensionality is w if FWs ( w − 1) < v < FWs ( w) , where FWs denotes the cumulate density function of Ws distribution, and. 1≤ w ≤ d . 3. To select the ri -th feature if FM ( r − 1) < ri < FM ( r ) , where FM denotes the cumulate density function of the M distribution, ∀ri , i = 1,2,..., w are not equal and 1 ≤ ri ≤ d .. After estimating the M and W0 distributions, the ensemble classifiers start to be constructed as the following steps: Let S be the number of subspaces (classifiers) in the ensemble and the index. s = 1, 2,  , S .. 38.

(49) Step 1. Draw a new subspace dimensionality wb + s from Ws−1 distribution.. Step 2. Select wb + s features based on M distribution in the feature selection process, and obtain a wb + s -dimensional dataset via the cumulate density function of. M distribution. Step 3. Estimate the OJ (Rsw ). with the wb + s -dimensional dataset which. comprises features of Rsw as the feedback to obtain a renewal of Ws distribution whose probability density is re-estimated via KS technique as. f W ( w) = s. s w − wj b w − wt  w OJ OJ ( Rtw ) κ ( R ( ) ( ) ) κ + ∑ ∑ 0 b+ s  σ σ j =1 t =1   ( ∑ OJ k ) σ. 1. j. t. (3.13). k =1. Step 4. Back to Step 1 until S classifiers have been trained. Step 5. Finally, combine S classifiers by simple majority voting in the final decision rule. These. S. classifiers are constructed in the subspaces with different. dimensionalities since Ws distribution is continually changing during the training process via choosing the proper kernel space for each SVMs. This is the reason why we name the proposed method as “kernel-based” dynamic subspace method.. 39.

(50) 40.

(51) CHAPTER 4: EXPERIMENTAL DESIGN AND RESULTS. 4.1 Experimental Design In this study, a common hyperspectral image source, Washington DC Mall dataset (Landgrebe, 2003), and an educational measurement data about “Sector” are applied to demonstrate the effect of the proposed classifier, KDSM. The performances of applying the single classifier and different multiple classifier systems, KDSM and DSM with two different techniques of selecting features (bands) for constructing each subspaces, based on support vector machines to the real data experiment are compared on classification. All algorithms and their descriptions for investigating the multiclass classification performances of the proposed method are listed in Table 4.1. Table 4.1. The description of algorithms used for comparison. Algorithm SVM_CV SVM_OP DSM_WACC DSM_ WLDA KDSM. Description Without any dimension reduction on only a single SVM with CV method Without any dimension reduction on only a single SVM with OP method DSM with the re-substitution accuracy as the band weights DSM with the separability of Fisher’s LDA as the band weights Kernel-based dynamic subspace method proposed in this research. For all SVM-based classifiers, these experiments adopt the soft-margin SVM with Gaussian radial basis function (RBF) kernel and the grid search within a given set to decide the parameter C which controls the trade-off between the margin and the size of the slack variables. Two different methods, CV and OP, of choosing. 41.

(52) proper kernel parameters are discussed on the single SVM classifier. In the CV method, the 5-fold cross-validation method is used for selecting the best C and the − x− y parameter in the RBF kernel function k ( x, y ) = exp  2σ 2 . 2.   , σ ∈ R , from the  . sets {0.1, 1, 10, 20, 60, 100, 160, 200, 1000} and {0.07, 0.29, 0.52, 0.74, 0.97, 1.19, 1.41, 1.64, 1.86, 2.08, 2.24}respectively as suggested by Bruzzone & Persello (2009). In the OP method, the automatic method (Li, Lin, Kuo, and Chu, 2010) which was mentioned in section 3.1 is used for selecting the optimal parameter σ of the RBF kernel, only one parameter C should be determined by 5-fold cross-validation method. About the multiple classifier systems of DSM and KDSM, the different ensemble techniques for constructing the reduced-dimensional data are remarked. Concerning the DSM, two different band weights distributions, WACC and WLDA , used for selecting the importance features of data in each subspaces with the re-substitution accuracy and the separability of Fisher’s LDA respectively are taken in to account simultaneously in comparing with KDSM. In these experiments, a threshold to limit the amount of subspaces in the multiple classifiers is set as the divergence degree of the classification outcome is smaller than 10 -4 between this classifier and preceding classifier with different subspaces, and the maximum, B , of classifiers in the ensemble is set as 50 to avoid the outcome is still change a lot and didn’t reach convergence until B classifiers already existed in an ensemble. In order to investigate the performance of the proposed method, the operation time and classification accuracy are inspected and recorded for comparing not only. 42.

(53) with the different methods based on SVM but also with the different size of training samples on each classifiers in high-dimensional datasets.. 4.1.1 Datasets of experiment There are two real datasets, the hyperspectral image of the urban site over Washington, DC Mall, U.S. and an educational measurement data. (see Appendix. A), are applied to compare the performances of five algorithms described in Table 4.1 in these experiments. The fist one dataset is a Hyperspectral Digital Imagery Collection Experiment (HYDICE) airborne hyperspectral data flight line over the Washington, DC Mall with an original size of 1208 × 307 pixels (Landgrebe, 2003) as Figure 4.1 (a), and since most of the landcover types are included, only a subregion containing 205x307 pixels is employed in this study as shown in Figure 4.1 (b). Every pixel contains a total of 191 channels discarded some water-absorption channels were gathered from two hundred and ten bands collected in the 0.4–2.4 μm region of the visible and infrared spectrum. In this hyperspectral image dataset, there are three cases, N i = 20 < N < d (case 1), N i = 40 < d < N (case 2) and d = 220 < N i = 300 < N = 2100 (case 3), where N i is the number of training samples in each class, and N is the number of all training samples, for observing the influences of training sample sizes will be discussed. Using the MultiSpec (Landgrebe, 2003), the seven information classes, water, tree, path, roof, grass, road and shadow, labeled regions for sampling the training and fixed testing samples of 100 pixels per class as the experimental data set are shown in Figure 4.1 (c). And the corresponding available samples of different. 43.

(54) classes are also listed in Table 4.2.. (b). □ Background ■ Water ■ Tree ■ Path ■ Grass ■ Roof ■ Road ■ Shadow. (c). (a). Figure 4.1. (a) The original size color IR image of the urban site in Washington, DC Mall, U.S.. (b) The color IR image of the selected region based on the Washington, DC Mall dataset. (c) The identified class according to the corresponding color IR Washington, DC Mall image.. 44.

(55) Table 4.2. Number of Samples in the Washington DC Mall dataset used for experiments. Class 191 available dimensions in per sample. Water Tree Path Roof Grass Road Shadow Total. Number of samples 1156 1430 737 3776 2870 1982 840 12791. Number of testing samples 100 100 100 100 100 100 100 700. Number of training samples Ni Ni Ni Ni Ni Ni Ni N. For each cases, there are ten experimental datasets individually contain enough different training samples in per class were created by selecting randomly from the samples in the labeled regions based on a portion of Washington, DC Mall dataset showed in Figure 4.1 (b) to avoid the issue of over-fitting in single data via taking the average effect of all datasets. Furthermore, the other dataset, educational measurement data, is analyzed by the proposed method and multiple classifier systems for evaluating student’s learning profile and arranging suitable remedial instruction. Appling adaptive testing simulation processes with Mathematics paper-based test data, the content of the test designed for the sixth grade students is about “Sector” related concepts (see Appendix A). In Figure 4.2, the experts’ structures of the unit for this test are developed by seven elementary school teachers and three researchers. These structures are different from usual concept maps but emphasis on the ordering of nodes. Additionally, every node can be assessed by an item. There. 45.

(56) are 21 items in this test and 828 subjects are collected in “Sector” tests.. Finding the areas of compound sectors. Drawing compound graphs. Finding the areas of simple sectors. Finding the areas of circles. Drawing sectors. Definition of sector. Figure 4.2. Experts’ structure of “Sector” unit. According to this structure in “Sector” test, it can divide subjects into nine categories of remedial instructions. Table 4.3 is showed the remedial concepts and subjects of each category of remedial instruction. For evaluating the performance of the proposed method, the eight categories of remedial instructions among the total categories as above were chosen by discarding the information from some subjects which belong to the category zero. And twenty subjects in each category are randomly selected to form training datasets, and others are selected to form testing datasets. Table 4.4 is showed the training subjects and testing subjects in the experiment. There are ten training and testing datasets which are randomly selected for estimating system parameters and computing the mean accuracies of testing data of different algorithms, respectively.. 46.

(57) Table 4.3. Nine categories of remedial instructions. Category (Class). Number of Subjects (samples). 0. 80. 1. 50. 2. 36. 3. 47. 4. 221. 5. 53. 6. 30. 7. 25. 8 Total. 286 828. The concepts of remedial instruction All concepts are known, they don’t need remedial instructions. They are careless and need to practice more. “Finding the areas of compound sectors” and “Finding the areas of simple sectors”. “Definition of sector”, “Finding the areas of compound sectors”, and “Finding the areas of simple sectors”. “Drawing sectors”. “Finding the areas of simple sectors”, “Drawing compound graphs”, and “Definition of sector”. “Finding the areas of compound sectors” and “Drawing sectors”. “Finding the areas of compound sectors”, “Drawing sectors”, and “Finding the areas of simple sectors”. They need to learn all concepts of remedial instruction.. Table 4.4. Number of subjects (samples) in the educational measurement dataset used for experiments. Category 21 items as features (dimensions) in per Subject (sample). 1 2 3 4 5 6 7 8 Total. Number of testing subjects 30 16 27 201 33 10 5 266 588. Available subjects 50 36 47 221 53 30 25 286 748. 47. Number of training subjects 20 20 20 20 20 20 20 20 160.

(58) 4.2 Experimental Results The corresponding average accuracies and computer (CPU) times of classification to employed five classifiers which used different methods based on SVM in the two experimental sources, hyperspectral image and educational measurement data sets, are listed in Table 4.5 and Table 4.6 respectively. Moreover, some results about the classified map by using these compared classifiers on the Washington DC Mall dataset are shown in figure 4.3 to 4.5 when the numbers of training samples in each class are 20, 40 and 300 respectively. Table 4.5. The average classification accuracy ± standard deviation and average computer time of ten test data in Washington DC Mall dataset. Method. Accuracy (%) Case 1. Case 2. Case 3. SVM_CV 83.66±1.98 86.39±1.31 94.69±0.61. 30.35. 116.02. 5858.18. SVM_OP 83.79±1.07 87.89±0.73 95.31±0.50. 3.10. 6.65. 376.99. DSM_WACC 85.49±1.54 88.74±1.32 95.94±0.70. 6045.31. 21113.75. 1165048.6. DSM_WLDA 86.83±3.06 90.76±2.38 96.94±1.18. 2902.00. 6767.77. 220121.62. 88.64±1.74 92.53±1.27 97.43±0.66. 155.31. 308.26. 17847.7. KDSM. Case 1. Case 2. CPU Time (sec) Case 3. Table 4.6. The outcome of classification in educational measurement dataset.. Method. Accuracy (%). CPU Time (sec). Ratio of CPU time. SVM_CV. 68.03. 64.97. 1. SVM_OP. 60.12. 7.0122. 0.10793. DSM_WACC. 69.19. 2349.20. 36.15823. DSM_WLDA. 70.52. 2476.44. 38.11667. KDSM. 69.46. 229.17. 3.52732. 48.