核心函數為基礎的支持向量分類器：理論與應用

全文

(1)國立交通大學工業工程與管理學系博士論文. 核心函數為基礎的支持向量分類器：理論與應用. Kernel-Based SVM: Theory and Application. 研究生：楊健炘指導教授：蘇朝墩陳文智中華民國九十五年十一月. i.

(2) 核心函數為基礎的支持向量分類器：理論與應用 Kernel-Based SVM: Theory and Application. 研究生：楊健炘. Student：Yang, Chien-Hsin. 指導教授：蘇朝墩. Advisor：Su, Chao-Ton. 陳文智. Chen, Wen-Chih. 國立交通大學工業工程與管理學系博士論文. A Dissertation Submitted to Department of Industrial Engineering and Management College of Management National Chiao Tung University in partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Industrial Engineering and Management. 2006, 11 Hsinchu, Taiwan, Republic of China ii.

(3) 核心函數為基礎的支持向量分類器：理論與應用. 研究生：楊健炘. 指導教授：蘇朝墩陳文智. 國立交通大學工業工程與管理學系. 在兩種類別的分類作業上，支持向量分類器（support vector machine, SVM）是一個好的資料探勘工具，使用者可以透過簡單的計算，以超平面（hyperplane）和邊界（boundary）完成資料分類。為了要解決許多非線性的問題，數學家們建議使用 SVM 結合核心函數（kernel function）的方式來分析。這樣的做法雖有利於分類，但是在處理大量和複雜資料上，SVM 仍受到限制。實際上，不同領域的龐大資料庫中往往隱含許多資訊和知識，而特徵選取是擷取資訊和知識的其中一種程序。因此如何能夠快速地刪減不重要的屬性，進而獲得正確的特徵屬性，是一重要的議題。本研究首先提出一個新的以 SVM 為基礎的分類方法，透過常用的核心函數 (Polynomial 和 RBF) 建構分類器，並提出參數設定與核心函數選擇的指引。接著，將所提出的 SVM 分類器與 Hermes 和 Buhmman (2000)所提的屬性選擇方法作結合，構建一特徵選取程序。最後，本研究以高血壓檢測為例，透過所提出之程序進行一個案研究：包含模型建構與刪減不重要的屬性，並與倒傳遞類神經網路、決策樹、粗略集合等方法比較。結果顯示，無論是正確率和精確度的評估上，所提出方法的績效優於其他方法。. 關鍵詞: 特徵選取、支持向量分類器、核心函數、倒傳遞類神經網路、決策樹、粗略集合、高血壓。. i.

(4) Kernel-Based SVM: Theory and Application. Student: Chien-Hsin Yang. Adviser: Chao-Ton Su Wen-Chih Chen. Department of Industrial Engineering and Management National Chiao Tung University. SVM is a good data mining tool for the two class classification. The classification task is worked by the hyperplane and boundary. In order to solve nonlinear classification problems, mathematicians provided related kernel functions to deal with them. Although the approach of the SVM with kernel function is useful for classification, its performance must be improved especially for some data, such as large and complex data. In practice, large data sets often connote information and knowledge in many fields. Feature selection is one of the procedures to gather the information. Thus, it is an important issue that how to reduce attributes and select correct features in this field. In this dissertation, we attempt to investigate the theory and application of classifier support vector machine. We hope to increase the performance of classification through the new classifier. Two popular kernel functions, polynomial kernel and Gaussian Radius Base Function kernel are used. The relevant strategies, including the setting of parameters and selecting of kernels will be provided. Next, we apply Hermes and Buhmann’s (2000) idea to our proposed new classifier. Also, we construct a procedure of feature selection based on it. Finally, we demonstrate a case study of feature selection for hypertension detection. This study will construct prediction model by the developed approaches. ii.

(5) Implementation results show that the performance of the developed approach is better than those of backpropagation neural network, decision tree (DT) and rough sets (RS) methods based on accuracy and specificity. In addition, this paper provides some medical discussions of the position of anthropometric factors after feature selection.. Keywords:. Feature. selection,. support. vector. machine,. kernel. function,. backpropagation neural network, decision tree, rough sets, hypertension.. iii.

(6) CONTENTS 摘要. i. ABSTRACT. ii. CONTENTS. iv. LIST OF TABLES. v. LIST OF FIGURES. vi. CHAPTER 1 INTRODUCTION. 1. 1.1 Overview. 1. 1.2 Motivations. 2. 1.3 Objectives. 3. 1.4 Organization. 4 6. CHAPTER 2 RELATED WORKS 2.1 Support Vector Machine. 6. 2.2 Kernel Function. 8. 2.3 Properties of the Kernel Function. 11. 2.4 Feature Selection. 12. 2.4.1 Wrappers Approach. 15. 2.4.2 Filters Approach. 16. 2.4.3 Information Theoretic Ranking Criterion. 17. 2.4.4 Embedded Approach. 18. 2.5 L-J Method. 18. 2.6 Data Complexity. 20. CHAPTER 3 PROPOSED APPROACHES. 22. 3.1 SVM with Combined Kernel Functions. 22. 3.2 Feature Selection for the SVM by Using the L-J Method. 25 27. CHAPTER 4 ILLUSTRATION 4.1 Data Sets. 27. 4.2 Implementation Results. 28. 4.3 Discussions. 34. CHAPTER 5 A CASE STUDY: HYPERTENSION DETECTION 5.1 Problem Description. 39 39. iv.

(7) 5.2 Implementation. 40. 5.3 Comparisons. 41. 5.4 Discussion. 45 48. CHAPTER 6 CONCLUSIONS 6.1 Summary. 48. 6.2 Further Research. 49 50. REFERENCES. v.

(8) LIST OF TABLES Table 4.1. Data sets used in this study. 30. Table 4.2. Comparison of classification accuracy with the larger and. 31. smaller data sets Table 4.3. The accuracy of feature selection for the SVM using the L-J. 32. method (larger data sets) Table 4.4. The accuracy of feature selection for the SVM using the L-J. 33. method (smaller data sets) Table 4.5. The strategies of parameter setting of polynomial and RBF. 37. kernels Table 5.1. Classification of blood pressure for adults aged 18 and older. 39. Table 5.2. A comparison of performance of feature selection. 45. vi.

(9) LIST OF FIGURES Figure 1.1. Research framework. 4. Figure 2.1. Hyperplane with the maximal margin by a linear SVM. 6. Figure 2.2. Original space (input space). 9. Figure 2.3. Transformed space (feature space). 9. Figure 2.4. A wrappers model of feature selection. 15. Figure 2.5. A filter model of feature selection. 17. Figure 4.1. The relationship between parameters and accuracy for the. 34. larger data set Figure 4.2. The relationship between parameters and accuracy for the smaller data set. vii. 35.

(10) 誌謝本論文得以完成，我要感謝恩師蘇朝墩教授，在這五十二個月的時間裡，從他身上學得論文撰寫的技巧，以及為人處事的道理。在論文口試上，我要感謝駱景堯教授、洪瑞雲教授、邱文科教授、陳穆臻教授、陳文智教授們的指教，得以讓整體論文更為完善。另外，能與研究室的成員：志華、俊欽、隆昇、敬森、宗銘、家任、宇翔相互切磋、鼓勵，謝謝你們！這些年來，家人的支持是我求學階段的最大動力，在此我要將完成博士學位的榮耀和喜悅獻給我親愛的父母和家人，謝謝他們讓我沒有後顧之憂，完成學業。最後，我要向曾經幫助我的貴人，表達最深的謝意！. viii.

(11) CHAPTER 1 INTRODUCTION 1.1 Overview Nowadays, Knowledge Discovery in Databases (KDD) is concerned with extracting useful information from databases (Fayyad et al., 1996). Data mining is a set of techniques used in an automated approach to exhaustively explore and bring to the surface complex relationships in very large datasets (Liu and Motoda, 1998). Two objectives in the data mining areas are gathering the model accuracy and important information. Recently, many algorithms or tools were developed to construct a more precise model to explain the relationship between input and output variables in the data mining areas. Support Vector Machine (SVM) is one of the tools spring up among the classification applications. The SVM is a promising classification technique proposed by Vapnik and his group at AT&T Bell Laboratories (Cortes and Vapnik, 1995). It is a universal approximator that can be used to learn a variety of representations from training samples and regression tasks. It has also been successfully applied to a number of real-world problems such as handwritten characters and digit recognition (Schoelkopf, 1997; Cortes and Vapnik, 1995; LeCun et al., 1995; Vapnik, 1995), face detection (Osuna, 1997) and speaker identification (Schmidt, 1996). SVM is a good tool for the two class classification. It can separate the classes with a particular hyperplane, which maximizes a quantity called the margin. The margin is the distance from a hyperplane separating the classes to the nearest point in the dataset. This maximum margin criterion has the advantage of being robust against noise in data and making a solution unique for linearly separable problems. In addition, it is important that the SVM with a theoretically strong support is based on the statistical learning theory framework. An 1.

(12) important finding of the statistical learning theory is that the generalization error can be bound by the sum of the empirical error and term, which depends on the Vapnik Chervonenkis (VC) dimension, which characterizes the complexity of the approximating function class (Vapnik, 1998; Pardo and Sberveglieri, 2005). However, not all of the cases are linear and separable for classification. In fact data that is both vague and overlapping is common in many cases. Thus, many interactions occur, particularly at the input spaces. Based on available studies (Oyang et al., 2005; Scholkopf et al., 1995), it seems that the original SVM did not perform well for these cases. In order to solve this problem, mathematicians provided related kernel functions to deal with nonlinear classification problems on the basis of the above limitations (Muller et al., 2001). There are several types of kernels being used for all kinds of problems. Each kernel may be suitable for some of the problems. For instance, some well-known special problems, such as text classification (Joachims, 2000) and DNA problems (Yeang et al., 2001) are reported to be classified more correctly using the linear kernel.. 1.2 Motivations Although the approach of the SVM with kernel function is useful for classification, its performance must be improved, especially for complex data. This is particularly important for people who want to obtain a high level of accuracy in advanced areas such as precision engineering and medical diagnosis. Owing to the available kernel function own the advantages by themselves, it seems that users can get a better accuracy on classification tasks using a combination of different kernel functions. Therefore, further study is highly desired. In addition to accuracy, feature selection is another substantial issue for classification. Feature selection can avoid any unnecessary computation for 2.

(13) classification process. Limiting the number of feature can sometime be helpful because it cuts down the model capacity and thus reduces the risk of over-fitting. However, we should note that reducing the features always bears the danger of reducing the expected classification performance. Thus, how to achieve/keep the expected classification performance and to avoid decreasing the accuracy after feature selection is an important problem. In the disease diagnosis, diagnosticians or physicians need to discovery some information or knowledge from the data set based on fewer features or subsets. Owing to most of the data sets with a large number of variables. They need a good tool/algorithm to implement feature selection quickly and precisely.. 1.3 Objectives In this study, we attempt to investigate the theory and application of classifier SVM. First, a kernel-based SVM will be developed. We hope to increase the performance of classification through the new classifier. Two popular kernel functions, polynomial kernel and Gaussian Radius Base Function kernel (so-called RBF kernel) transform the row data from low dimension to high dimension. SVM with single and combined kernels are experimented in this study respectively. Furthermore, the relevant strategies, including the setting of parameters and selecting of kernels will be tested by using the data sets collected from the UCI data bank and the hospital. Next, feature selection for SVM will be discussed. We investigate the feature selection problem of our proposed new classifier. We apply Hermes and Buhmann’s (2000) idea to our method. Finally, we demonstrate a case study of feature selection for hypertension detection. This study constructs a prediction model for hypertension using anthropometrical body surface scanning data. In addition to our proposed approaches, some feature selection methods: backpropagation neural network, 3.

(14) decision tree and rough sets are in the benchmark and used to predict hypertension. The relevant indices on epidemiology such as sensitivity and specificity are used to evaluate the position of anthropometric factors after feature selection. Finally, technical and medical discussions are provided. In summary, the framework of the kernel-based SVM discussed in this dissertation is showed in figure 1.1.. Data Collection and Data Preparing. Kernel Selection. A Combined Kernel Function is Proposed. Feature Selection. A Kernel-Based SVM by Using the L-J Method. Application. A Case Study of Hypertension Detection. Figure 1.1 Research framework. 1.4 Organization This remainder of this dissertation is organized as follows. Chapter 2 describes related research, including a brief introduction to SVM, relevant kernel functions, and feature selection approaches. In addition, an indicator to evaluate the kernel selection criterion, mess level, is briefly introduced in this chapter. Our proposed approaches including combined kernel method and feature selection method are described in Chapter 3. In the Chapter 4, we illustrate proposed approach’s effectiveness using 4.

(15) various real-world datasets. A case-study (hypertension detection) is described in Chapter 5. Finally, the conclusions and the direction of further research are given in Chapter 6.. 5.

(16) CHAPTER 2 RELATED WORKS 2.1 Support Vector Machine SVM recently gained popularity in the learning community. In its simplest linear form, an SVM is a hyperplane that separates a set of positive elements from a set of negative elements with maximum interclass distance, so-call the margin. Figure 2.1 shows such a hyperplane with the associated margin.. Class 1. margin. Class 2. Figure 2.1 Hyperplane with the maximal margin by a linear SVM. The formula for output of linear SVM is u = wT ⋅ xi + b. (2.1). where w is normal vector (weight coefficient vector), x i is input vector and b is bias term. Based on that, we can get the class u which is 1 or -1. The distance between a training vector x i and the boundary, called margin, is expressed as follows: wT ⋅ xi + b w. According to original theory by Vapnik (1995), we want to find the margin m that w T ⋅ x i + b > 1 and w T ⋅ x i + b < 1 to separate the elements which are in positive or. 6.

(17) negative class. In order to compute the boundary, we need to maximize m , i.e. minimize. 1 2 w . Consequently, we can draw the optimization formulation as 2. min imize w T w subject to y i (w T ⋅ x i + b ) ≥ 1. (2.2). where xi is the i th training element and y i ∈ {− 1,1} is the correct output of the SVM for the i th training element. Note that the hyperplane is determined by the training elements xi on the margin, so-called support vectors. As seen in figure 1, they are “physically supporting” the final hyperplane w T x + b = 0 . However, practically some of the problems are with the nonseparable patterns. Hence Cortes and Vapnik (1995) introduced a penalty term CΣ li =1ξ i in the objective function and allowed training errors: l 1 T + C ξi w w min imize 2 i =1. ∑. subject to. y i (w T ⋅ x i + b ) ≥ 1 − ξ i. (2.3). ξ i ≥ 0, i = 1,..., l. That is, constraints equation 2.3 allow that training data may not be on the correct side of the separating hyperplane w T x + b = 0 while we minimize the training error CΣ li =1ξ i in the objective function. Hence if the penalty parameter C is large enough and the data is linear separable, problem equation 2.3 goes back to equation 2.2 as all ξ i will be zero (Lin, 2001). In addition to linear cases, most of the problems are with nonlinear patterns. In order to solve the nonlinear cases, the kernel function was often used in these cases. As for nonlinear cases, the plane is found by solving the following constrained quadratic programming problem: maxmize W (α) =. n. ∑. αi −. i =1. under the constraints. ∑. n i =1. 1 2. n. n. ∑∑ α α i. i =1 j =1. j. y i y j k ( x, x ' ). (2.4). α i y i = 0 and 0 ≤ α i ≤ C for i = 1,2,..., n where x i ∈ R. are the training sample vectors, and y i ∈ {− 1,+1} the corresponding class labels 7.

(18) (Cortes and Vapnik, 1995). The kernel function k (x, x' ) will be detailed in the next section.. 2.2 Kernel Function Kernel function from kernel methods (Aronszajn, 1950) have become in the last few years on of the most popular approaches to learning from examples with many potential applications in science and engineering (Cristianini and Taylor, 2000; Vapnik, 1998; Scholkopf, 1997; Scholkopf et al., 1998; Roth and Steinhage, 2000). Kernel functions of the form k (x1 , x 2 ) = ϕ(x1 ) ⋅ ϕ(x 2 ) , ⋅ is an inner product and ϕ is in general a nonlinear mapping from input space X onto feature space Z . In fact, the kernel function k is directly defined. ϕ and the feature space Z are simply derived from its definition. Kernel substitution of the inner product can be applied for generating SVM for classification based on margin maximization (Sanchez, 2003). In other words, SVM find a hyperplane in a space different from that of the input data x . It is a hyperplane in a feature space induced by a kernel k (the kernel defines a inner product in that space). Through the kernel k the hypothesis space is defined as a set of “hyperplanes” in the feature space induced by k . We also can say that the fundamental concept of the kernel method is deformation of the vector (lower) space itself to higher dimensional space. Often the higher dimension is clearer to classify than low dimension. See a linearly non-separable example presented in the follows. A total of six points x1 , x 2 , x3 , x 4 , x5 and x 6 are vector x showed in figure 2.2. Significantly, six points are nonlinearly separable in two dimensions. Therefore, we need a kernel function Φ to transform to a higher dimensional space to solve it.. Φ : R2 → R3 ,. ⎡ x12 ⎤ ⎢ ⎥ i.e. Φ( x) = ⎢ 2 x1 x 2 ⎥ . After kernel function transformation, we can get ⎢ x2 ⎥ 2 ⎣ ⎦. 8.

(19) vector t , t ∈ R 3 , which is a linearly separable case showed in figure 2.3.. Figure 2.2 Original space (input space).. Figure 2.3 Transformed space (feature space).. In order to solve more complex problems, the kernel function was used to generate SVM for classification is a popular approach described as above. The kernel function is usually presented as k (x, x' ) and introduced for satisfying the distance is defined in transformed space and it has a relationship to the distance in the original space. Several examples of such kernel functions are known, as follows: (1) Polynomial kernel k ( x i , x' ) = ( a + b x i ⋅ x' ) d. 9. (2.5).

(20) ⎛ n + d − 1⎞ ⎟⎟ d ⎝ ⎠. where a and b are constants. Its degree is d. For this kernel there are ⎜⎜. distinct feature, being all the monomials up to and including degree d and the number of attributes n in an instance of the data set. A special case of this kernel a = 0 and b = d = 1 forms a linear kernel. Some simple cases using linear kernel are. good enough for SVM-based classification (Zien et al., 2000). (2) RBF kernel k (x i , x ' ) = exp(−. 2 1 xi − x' ) γ. (2.6). where γ is kernel width and it is 2σ 2 . The kernel width common to all the kernels, is specified a priori by the user. (3) Signomid kernel k (x i , x ' ) = tanh( κ x, x' + ϑ). (2.7). where κ > 0 and ϑ < 0 . However, Mercer’s theorem is satisfied only for some value of. κ and ϑ .. Summary of the statement concerning with kernels, we need to make a mathematical definition of kernel function in proposition 2.2.1 (Scholkopf and Smola, 2002).. Proposition 2.2.1 Definition positive definite kernel Let X be a nonempty set. A function k on X×X which for all m ∈ N and all x1 ..., x m ∈ χ gives rise to a positive definite Gram matrix is called a positive definite. (pd) kernel. Often, we shall refer to it simply as a kernel.. The selection of the kernel function is very important for the performance of the classifier (Papadopoulos et al., 2005). Wahba (2000) has suggested using the kernel 10.

(21) function to increase the dimensionality, and then it is easier to classify the data in the higher dimensional space by hyperplane. Although it is well known that the choice of kernel affects SVM’s performance, only a few kernels have been used in practice, because it is difficult to choose proper turning of parameters (Dudoit et al., 2002). As for the parameters selection for SVM, it is an other important issue to SVM’s performance. Scholkopf and Smola (2002) indicated that it is suitable of smaller C for SVM classification. They also present that both the kernel parameters and the SVM parameter (value of C) are often chosen using cross validation. Zhu and Zhang (2004) considered that too larger parameters will bring very time consuming. Still, the parameters selection lacks the consistency for researchers.. 2.3 Properties of the Kernel Function The use of a kernel function is an attractive computational short-cut. If we wish to use this approach, there appears to be a need to first create a complicated feature space, and then work out what the inner product in that space would be, and finally find a direct method of computing that value in terms of the original inputs. In practice, the approach taken is to define a kernel function directly, hence implicitly defining the feature space. In this way, we avoid the feature space not only in the computation of inner products, but also in the design of the learning machine itself (Cristianini and Taylor, 2000). We will argue that defining a kernel function for in input space is frequently more natural than creating a complicated feature space. Before we can follow this route, however, we must first determine what properties of a function k (x, x' ) are necessary to ensure that it is a kernel for some feature space (Cristianini and Taylor, 2000). Clearly, the function must be symmetric, k (x, x' ) = φ(x) ⋅ φ(x' ) = φ(x' ) ⋅ φ(x) = k (x' , x). and satisfy the inequalities that follow from the Cauchy-Schwarz inequality, 11. (2.8).

(22) k (x, x' ) 2 = φ(x) ⋅ φ(x' ). 2. ≤ φ(x). 2. φ(x' ). 2. = φ( x) ⋅ φ(x) φ(x' ) ⋅ φ(x' ) = k (x, x)(x' , x' ). (2.9). However, these conditions are not sufficient to guarantee the existence of a feature space. In practice, it should provide a characterization of Mercer’s theorem of when a function k (x, x' ) is a kernel (Cristianini and Taylor, 2000). The Mercer’s theorem can be formally stated as (Mercer, 1908; Courant and Hilbert, 1970):. Proposition 2.3.1 Let K (x, x' ) be a continuous symmetric kernel that is defined in the closed interval a ≤ x ≤ b and likewise for x' . The kernel K (x, x' ) can be expanded in the series K ( x, x ' ) =. ∞. ∑ λ φ (x)φ (x ) '. i. i. i. i =1. With positive coefficients, λ i > 0 for all i . For this expansion to be valid and for it to converge absolutely and uniformly, it is necessary and sufficient that the condition a. a. b. b. ∫∫. K (x, x ' )ψ (x)ψ (x ' )dxdx ' ≥ 0. holds for all ψ(⋅) for which. ∫. a. b. ψ 2 (x)dx < ∞ .. The functions φ i (x) are called eigenfunctions and the λ i are called eigenvalues. The fact the all of the eigenvalues are positive means that the kernel K (x, x' ) is positive definite (Haykin, 1999).. 2.4 Feature Selection Features are called attributes, properties, variables, or characteristics. Feature selection is a process by which a sample in the measurement space is described by a finite and usually smaller set of numbers classed features, say x1 , x 2 ,......, x n . The features become components of the pattern space. The feature selection is regarded as 12.

(23) a procedure to determine that which variables (attributes) are to be measured first or last. Liu and Montoda (1998) defined that feature selection is a process that chooses an optimal subset of features according to certain criterion. Feature selection may be multistage process to enhance the accuracy or performance of classification (Chiang, 2002). Feature selection (so-called variable selection) has become the focus of much research in area of application for which datasets with tens or hundred of thousands of variables are available. Feature selection problems are found in many machine learning tasks including classification, regression, time series prediction, etc. An appropriate feature selection can enhance the effectiveness and domain interpretability of an inference model. Liu and Motoda (1998) indicated that the effect of feature selection are (1) to improve performance (speed of learning, predictive accuracy, or simplicity of rules); (2) to visualize the data for model selection; and (3) to reduce dimensionality and remove noise. Guyon and Elisseeff (2003) indicated that there are many potential benefits of feature selection: facilitating data visualization and data understanding, reducing the measurement and storage requirements, reducing training and utilization times, defying the curse of dimensionality to improve prediction performance. Liu and Motoda (1998) provided a detailed survey and overview of the existing methods for feature selection. They suggest a feature selection process that includes four parts: feature generation, feature evaluation, stopping criteria and testing. In addition to the classic evaluation measures (accuracy, information, distance, and dependence) used for removing irrelevant features, they provided consistency measures (inconsistency rate) to determine a minimum set of relevant features. Two methods, feature selection and feature extraction were usually used to do the work of dimensionality reduction of data sets (Jain et al., 2000). As described as 13.

(24) above, feature selection is the way of selecting the sub-features in the measurement space. But, feature extraction method determines an appropriate subspace of dimensionality m (either in a linear or a nonlinear way) in the original feature space of dimensionality d . It should be noted that the features or attributes are not changed in the feature selection process; however, new attributes were established after feature extraction. It is obvious to find that the feature selection is superior to feature extraction in the interdisciplinary applications cause of the consistency of attribute. The universal algorithms of feature selection are often divided along three lines: wrappers, filters and embedded (Kohavi and John,, 1997; Guyon and Elisseeff, 2003). Both wrappers and filters do this work by select subsets of variables. Wrappers approach is one of the subset selection methods. It assesses subsets of variables according to their usefulness to a given predictor. The filters approach is a preprocessing step, independent of the choice of the predictor. Still, under certain dependence or orthogonality assumptions, it may be optimal with respect to a give predictor. Obviously, an exhaustive search can conceivably be performed, if the number of variables is not too large. But, the problem is known to be NP-hard (Amaldi and Kann, 1998) and the search becomes quickly computationally intractable. They may suffer from a block of wasting computational cost when variables are too large. As for embedded method, the disadvantages are as similar as theirs. In addition to these algorithms of feature selection, variable ranking is as a principal or auxiliary selection mechanism because of its simplicity, scalability, and good empirical success. Several papers (Bekkerman et al., 2003; Caruana and de Sa, 2003; Weston et al., 2003) in this issue use variable ranking as a baseline method. Furthermore, the information theoretic ranking criterion is also a common approach for variable classification (Bekkerman et al., 2003; Dhillon et al., 2003; Torkkola, 2003).. 14.

(25) 2.4.1 Wrappers Approach Regardless of any approach what used in the feature selection tasks, people always want to provide the classifier with the data of better quality and to improve the classification performance. If they can select the relevant features and remove noise, they can achieve their objective possibly. However, the essence of the wrappers approach owns the function to do that. A wrappers model (see figure 2.4) consists of two phases (Liu and Motoda, 1998): Phase 1 – feature subset selection, which selects the best subset using a classifier’s accuracy (on the training data) as a criterion. Phase 2 – learning and testing, a classifier is learned from the training data with the best feature subset and tested on the test data. N Full Set. Feature Generation. Subset. Learning Algorithm. Accuracy. Phase 1. Training Data. Accuracy. Good?. Yes. Testing. Classifier. Learning Algorithm. Best Subset. Phase 2. Training Data. Testing Data. Figure 2.4 A wrappers model of feature selection (Liu and Motoda, 1998). The wrappers approach consists in using the prediction performance of given learning machine to assess the relative usefulness of subsets of variable. When feature subsets are generated, for each subset of feature, a classifier is generated from the data with chosen features. If the number of variables is not too large, an exhaustive search can conceivably be performed. However, the problem is to be NP-hard (Amaldi and Kann, 1998). Although wrappers approach often criticized that it seems to be a “brute force” 15.

(26) method cause of required massive amount of computation, some of the researchers own different opinions. Such as Reunanen (2003) indicated that coarse search strategies may alleviate the problem of overfitting. In addition, Greedy search strategies are good at for against overfitting. Forward selection and backward elimination are two usages in these strategies. In forward selection, variables are progressively incorporated into larger and larger subsets, whereas in backward elimination one starts with the set of all variables and progressively eliminates the least promising ones. However, either forward selection or backward elimination, it seems that they do not avoid the time-consuming computation when the number of variables is very large. To solve this limitation, researchers often use heuristic learning method like Naïve Basesian Classifiers or Decision Tree Induction (Kohavi and John, 1997).. 2.4.2 Filters Approach Filters approach built on the intrinsic properties of the data, not on a bias of particular classifier. The essence of filters is to seek the relevant features and to eliminate the irrelevant ones. According Kohavi and John (1997) classification guideline, the preprocessing step of filters approach is to determine the independence of the choice of the predictor. Still under certain independence of orthogonality assumption, it may be optimal with respect to a given predictor. A filters model of feature selection (see figure 2.5) also consists of two phases (Liu and Motoda, 1998): Phase 1 – feature selection using measures such as information, distance, dependence, or consistency, and no classifier is engaged in this phase. Phase 2 is the same as in the wrappers model, a classifier is learned on the training data with the selected features and tested on the test data. 16.

(27) No Full Set. Feature GenerCh3. Subset. Measuring. Measurement. Good or Stop?. Phase 1 Yes. Accuracy. Testing. Classifier. Learning Algorithm. Best Subset. Phase 2. Training Data. Testing Data. Figure 2.5 A filters model of feature selection (Liu and Motoda, 1998). In addition to the characteristic that built on the intrinsic properties of the data, the filters approach has the other characteristics as follows. 1. Measuring information gains, distance, dependence, or consistency is usually cheaper than measuring accuracy of a classifier, so a filters method can produce a subset faster, other things being equal. 2. Because of the simplicity of the measures and low time complexity, a filters method can handle larger sized data than a classifier can. Therefore in the case where a classifier cannot directly be learned from the large data, it can be used to reduce data dimensionality so that the classifier can be learned from the data with reduced dimensionality. However, there is a danger that the features selected by a filters model cannot allow a learning algorithm to fully exploit its bias. Compared with wrappers, filters are faster. Still, recently proposed efficient embedded methods are competitive. In addition, some filters provide a generic selection of variables, not tuned for a given learning machine. It is an advantage to note that the filters approach can be used as a preprocessing step to reduce space dimensionality and overfitting.. 2.4.3 Information Theoretic Ranking Criterion Information Theoretic Ranking Criteria is to gather the empirical estimates of the 17.

(28) mutual information between each variable and the target: I (i ) =. p( xi , y ). ∫ ∫ ( p( x , y) log p( x ) p( y) dxdy xi. i. y. (2.10). i. where p ( xi ) and p( y ) are the probability densities of xi and y , and p ( xi , y ) is the joint density. The criterion I (i) is a measure of dependency between the density of variable xi and the density of the target y . Supposed that the variable is discrete or nominal, the I (i) has to describe as follows. I (i ) =. P( X = xi , Y = y ). ∑∑ P( X = x , Y = y) log P( X = x ) P(Y = y) i. xi. (2.11). i. y. It is mentioned to note that the estimation obviously becomes harder with larger numbers of classes and variable values.. 2.4.4 Embedded Approach Guyon and Elisseeff (2003) gave the definition of Embedded approach is to perform variable selection in the process of training and this approach is usually specific to given learning machines. In other words, embedded approach is based on the built-in mechanism to perform variable selection, such as Classification and Regression Tree (CART) (Breiman et al., 1984). 2.5 L-J Method The L-J method is a feature selection approach that defines scores for the available feature at training. It was developed for two authors, Lothar Hermes and Joachim M. Buhmann in 2000. They use the influence to determine the important features. The influence means that the ability of affecting the decision of hyperplane in SVM structure. Compared with wrappers and filters approaches, L-J methods is a feature selection strategy which defines scores for available features on the basis of a single training run (Hermes and Buhmann, 2000) and is easy to compute for users. Next, the brief introduction of L-J method is described as follows. 18.

(29) The initial tasks of the L-J method should construct a SVM structure with a given training sets by using complete data components. After constructing the classifier f (x) (equation 2.12), they estimated the importance of separate feature components to f (x) , λ i is the Lagrange multipliers. To rank the components of a given vector x according to their influence on the classification, users then should compute the gradient of f (x) at position x (equation 2.13). f (x ) =. l. ∑λ y x i. i. +b. (2.12). x K ( x i , x). (2.13). T i x. i =1. ∇f ( x ) =. ∑λ y ∇ i. i. i∈SV. where ∇f (x) is the gradient of f (x) , and it is perpendicular to the optimal hyperplane. Next, user should implement the task of project of the unit vector e j on ∇f (x) . If the projection of the unit vector e j on ∇f (x) is small, it represent that the. feature j is not important at position x . In other word, the ∇f (x) should be roughly orthogonal to e j , the feature j should not influence the distance to the decision hyperplane. Summary of above, we can compute the angle α j (x i ) between ∇f (x) and e j , j = 1,2,......, n , representing the indices of the individual feature, to. measure what are the important factors as follows: ⎧⎪ ⎛ (∇f ( x))T e j β α j (x i ) = min ⎨βπ + (−1) arccos⎜ ⎜ ∇f ( x ) β∈{0 ,1} ⎪⎩ ⎝. ⎞⎫⎪ ⎟⎬ ⎟⎪ ⎠⎭. (2.14). Values α j (x i ) ≈ π represent that the feature j has only weak influence on the assignment f (x) of x . Small values (is closed to 0) indicate important features. ~ to rank the features. The index α ~ is defined as Finally, we must compute the α j j. follows:. 19.

(30) ~ =1− 2 ⋅ α j π. ∑. i∈I ε. α ij. Iε. (2.15). ~ is an index for ranking features. The features with smaller α ~ can be α j j. dropped. 甲、. Data Complexity In order to show the effectiveness of our proposed approach, the data complexity. was utilized to evaluate its corresponding performance. As for the complexity of the data, many indicators were measured in different fields. For instance, entropy can help us to see the complexity of the input data. In this study we explain the data complexity by using mess level (Wang, 2003). Before calculating the mess level, related symbols are defined as follows.. Aij is the value of the j th attribute in an instance. A j+ is the mean of the attribute while y i = 1 for the instance. A j− is the mean of the attribute while y i = −1 for the instance.. A j max is the maximum value of the j th attribute. A j min is the minimum value of the j th attribute.. k is the number of attributes in an instance, and k ≥ 2 . x i+ is the instance with y i = 1 . xi− is the instance with y i = −1 .. n is the number of instances in the data set, and n ≥ 2 .. Now, we define M a (•) and M b (•) in the following: 20.

(31) ⎛ Aij − A j+ ⎜ ⎜ j =1 ⎝ A j max − A j min k −1. ⎞ ⎟ ⎟ ⎠. 2. ⎛ Aij − A j+ ⎜ ⎜ j =1 ⎝ A j max − A j min k −1. ⎞ ⎟ ⎟ ⎠. 2. k. ( )=. ∑. M a x i+. k. ( )=. ∑. M a x i−. M a (S ) =. (2.16). (2.17). ∑ (M (x ) + M (x )) n. a. + i. a. − i. (2.18). i =1. ⎛ Aij − A j− ⎜ ⎜ j =1 ⎝ A j max − A j min k −1. ⎞ ⎟ ⎟ ⎠. 2. ⎛ Aij − A j+ ⎜ ⎜ j =1 ⎝ A j max − A j min k −1. ⎞ ⎟ ⎟ ⎠. 2. k. ( )=. ∑. M b x i+. k. ( )=. ∑. M b x i−. M b (S ) =. (2.19). ∑ (M (x ) + M (x )) n. b. + i. b. − i. (2.20). (2.21). i =1. If all the elements belong to the positive class, M a (xi− ) approaches to 0. On the contrary, M a (xi+ ) approaches to 0 if all the elements belong to the negative class. Therefore, most of the elements approach to the positive class if M a (S ) is larger. The elements are also concentrated in their space. M b (xi+ ) approaches to 0 if all the elements belong to the negative class; else, if all the elements belong to the positive class, M b (xi− ) approaches to 0. According to the definition, we know that if M b (S ) is larger then positive and negative elements are further. Dividing equation 2.18 by equation 2.21, we get equation 2.22 called mess level (ML) which is shown as: Mess level =. M b (S ) M a (S ). When the ML is close to 1, the data set is complex and is not easy to classify.. 21. (2.22).

(32) CHAPTER 3 PROPOSED APPROACHES 3.1 SVM with Combined Kernel Functions Several investigations (Yao et al., 2005; Wang et al., 2004) indicated that the kernel functions are useful for classification. For examples, some researches show that SVM with polynomial kernel provide a good performance for prediction and classification (Wang et al., 2004); some researches indicate that SVM with RBF kernel has stronger ability for classification (Hammer and Gersmann, 2003; Dong et al., 2005; Lukas et al., 2004; Yao et al., 2005). The polynomial and RBF kernels own the advantages by themselves and this encourages us to propose a combined approach to pursue a much better classification accuracy. Suppose. given. {x 1 , x 2 ,..., x i ,...x M }. a. training. set. of. samples. M. with known class labels. or. {y1 , y 2 ,..., yi ,..., y M } ,. input. vectors. y i ∈ {+ 1,−1} , a. new data point x is a assigned a label by the SVM according to the decision function ⎛ Ms ⎞ f (x ) = sign⎜ α i y i k (x i , x) + b ⎟ ⎜ ⎟ ⎝ i =1 ⎠. ∑. (3.1). where k (x i , x) = Φ (x i ), Φ (x). (3.2). is the kernel function that defines feature space, Φ(x) is a nonlinear mapping function from input space to feature space,. ⋅,⋅. denotes an inner product, b is a. bias value, and α i are positive real numbers obtained by solving a Quadratic Programming (QP) problem that yield the maximal margin hyperplane (Vapnik, 1998). Owing to the polynomial kernel function owns the advantage of changing the degree d in the feature space (see equation 2.5) and Gaussian RBF kernel is itself a 22.

(33) normalized kernel (see equation 2.6), the kernel k P and k G are employed in this study to develop new kernels, k P +G and k P⋅G . First, to simplify the tasks of the classification process, parameter a was ignored and parameter b was set at 1. We can rewrite the equations 2.5 and 2.6 as follows: k P ( x i , x) = ( x i ⋅ x ) d. (3.3). where d is its degree and is adjustable. k G (x i , x) = exp(−. 1 2 xi − x ) 2γ. (3.4). where γ is kernel width; γ is adjustable. Consequently, the kernel function k P +G is defined as follows. ⎛ 1 2 ⎞ k P +G (x i , x) = ⎜⎜ ( x i ⋅ x ) d + exp( − x i − x ) ⎟⎟ 2γ ⎝ ⎠. (3.5). The kernel function k P⋅G is defined as follows. ⎛ 1 2 ⎞ k P⋅G (x i , x) = ⎜⎜ ( x i ⋅ x ) d .exp(− x i − x ) ⎟⎟ 2γ ⎝ ⎠. (3.6). As a result, the SVM decision functions using new kernels, k P +G and k P⋅G , can be rewritten in the following: ⎛ Ms ⎞ f (x ) = sign⎜ α i y i k P +G (x i , x) + b ⎟ ⎜ ⎟ ⎝ i =1 ⎠. (3.7). ⎛ Ms ⎞ f (x ) = sign⎜ α i y i k P⋅G (x i , x) + b ⎟ ⎜ ⎟ ⎝ i =1 ⎠. (3.8). ∑ ∑. Next, we need to provide the relevant proof of our new kernels, i.e. they should be symmetric and satisfied with the characterization of Mercer’s Theorem. Here, kernel k P +G is a representative case to proof.. Lemma. Let k P and k G be kernels over X × X , X ⊆ R n , then the kernel k P +G in 23.

(34) equation 3.5 is symmetric and satisfied with the characterization of Mercer’s theorem.. Proof Let k P = φ P = ( x ⋅ z ) d , k G = φ G = exp(−. 1 2 x−z ). 2γ. Then k P +G (x, z ). = k P (x, z ) + k G (x, z ) = φ P ( x), φ(z ). 2. + φ G (x), φ(z ). 2. = φ P ( z ), φ(x). 2. + φ G (z ), φ(x). 2. = k P ( z , x) + k G ( z , x) = k P + G ( z , x). Hence, kernel k P +G is symmetric. From the Cauchy-Schwarz inequality it follows that: k P +G (x, z ). = k P (x, z ) + k G (x, z ) ≤ k P ( x, x) k P (z, z ) + k G ( x, x) k G ( z, z ) ≤ k P ( x, x) k P (z, z ) + k G (x, x)k P (z, z ) + k P (x, x)k G (z, z ) + k G (x, x)k G (z, z ) = (k P (x, x) + k G (x, x))(k P (z, z ) + k G (z, z )) = k P + G ( x, x ) K P + G ( z , z ). Next, let k P +G be defined in the closed interval a ≤ x ≤ b and likewise for z . The kernel k P +G (x, z ) can be expanded in the series k P +G (x, z ) =. ∞. ∑ λ φ (x)φ (z) i. i. i. i =1. with positive coefficients, λ i > 0 for all i . According to the following condition,. 24.

(35) a. a. b. b. ∫ ∫ K ( x, z)ψ( x)ψ( z )dxdz =. a. a. b. b. ∫ ∫K. P ( x, z )ψ ( x )ψ ( z ) dxdz. +. b. b. a. a. ∫ ∫K. G ( x, z )ψ ( x )ψ ( z ) dxdz. ≥0+0=0. holds for all ψ(⋅) for which. ∫. a. b. ψ 2 (x)dx < ∞ .. The kernel k P +G is satisfied with the characterization of Mercer’s theorem. It is existence in the feature space. Thus, k p +G is a kernel.. 3.2 Feature Selection for the SVM by Using the L-J Method The L-J method enlighten us a good idea for feature selection by using the influence of jth feature. First, the L-J method needs to compute the angle α j (x i ) ~ . This process between ∇f (x) and e j . Next, the feature is ranked by the index α j. can be constructed by the SVM classifier. We hope that applying the L-J method to our combined kernels (developed in the section 3.1) may obtain a good performance for classification. Similar algorithm can be illustrated as follows: Step 1: Construct a SVM structure g (x) with a given training sets by using complete data components. Step 2: Select the combined kernel function into the SVM classification for classification. The combined kernel functions are k P +G and k P⋅G defined in (3.7) and (3.8). Step 3: Compute the gradient of g (x) at position x , so-called ∇g (x) . Step 4: Implement the task of project of the unit vector ε j on ∇g (x) . 25.

(36) Step 5: Estimate the importance of separate feature components to g (x) by the influence α j (x i ) . α ~ = 1 − 2 ⋅ ∑i∈I ε ij , α ~ ∈ [0,1] Step 6: Rank the features by α j j π. Iε. ~ values. Drop several unimportant features with smaller α j. 26.

(37) CHAPTER 4 ILLUSTRATION In order to illustrate the proposed approaches’ effectiveness, we use twelve datasets to implement the classification tasks. In addition, the relevant strategies of kernel selection and parameter setting are investigated. 4.1 Data Sets A total of three data sets, hyperlipidemia, liver disease and renal disease were collected from the Department of Health Examination from those seeking an annual physical health check-up at Chang Gung Memorial Hospital in Tao-Yuan, Taiwan. Thirty-one anthropometrical data were measured by the whole body scanner employing the independent variables. The dependent variable was that subjects suffer or do not suffer from the disease in each set of disease data. In addition to the medical data, nine data sets from the UCI repository (Blake and Merz, 1998) were used. These data sets were census income, shuttle, mushroom, letter, ionosphere, vehicle silhouettes, spambase, vowel, and sonar. Among the twelve data sets, seven were considered as the larger ones, as each contained more than 5,000 samples (Oyang et al., 2005). The remaining five data sets were considered as the smaller ones. Before our experiment, we had worked some data preprocess. Due to the fact that some anthropometrical data tend to be incomplete, we deleted these data. In order to reduce the differences among the features, we normalized the data prior to implementing the SVM classifier. All the normalized data transferred to x new were scaled to the [-1, 1] interval via equation 4.1. The meta-data including the number of features, classes, cases and feature style, are represented in Table 4.1. In addition, the data complexity computed by ML and the ratio of positive to negative were also appended. 27.

(38) x new =. x−x max( x) − min( x). (4.1). 4.2 Implementation Results Five approaches including linear kernel, two popular kernels (polynomial and RBF), and two proposed kernels (polynomial plus RBF, k P +G ; polynomial multiplies RBF, k P⋅G ) were implemented for the classification tasks. We use the one-against-one procedure to calculate the accuracy of classification in the multi-class SVM model, else the general procedure is employed to acquire that. Furthermore, a popular classifier, K nearest neighbor (KNN) was employed as the benchmark in our experiment. In order to simplify the process of classification, the parameter a was set at 0, b was set at 1 in the polynomial kernel. We only changed the degree d . As for the RBF kernel, it remained in its original form, i.e. kernel width γ could be changed. In our experiment, parameter d was set between 2 to 10. Parameter γ was set at 10 −3 , 10 −2 , 10 −1 , 10 0 , 101 , 10 2 , respectively. A total of twelve data sets were separated into large (more than 5000 samples) and small ones (less than 5000 samples) as mentioned before. The imbalanced data sets is shuttle. Table 4.2 compares the accuracy of the classification with the larger and smaller data sets respectively. In the larger data sets, in general the accuracy of the classification of the SVM based approaches is better than that of the KNN approach. Among them, the proposed kernel k P⋅G (polynomial multiplies RBF kernel) has the best performance, and the next one is another proposed kernel k P +G . As for smaller data sets, the results are as similar as the larger data sets. In general, the performance of the SVM based approaches is better than that of the KNN approach. Among them, the combined kernel k P⋅G also has the best performance. In addition we found that the performance of the proposed kernels is not so good for the. 28.

(39) classification of imbalanced data. Based on the ML, our proposed combined kernels have a better performance when the ML is small. The implementation result of feature selection is showed in Tables 4.3 and 4.4. In this procedure, the kernels were applied to L-J method for feature selection. The optimal parameter settings were employed in SVM model for L-J feature selection. As same as above, we use the original SVM technique if the data were of two classes. Else the SVM model is worked by one-against-one process if the data are more than two classes. The average accuracies of the classification for the seven larger data sets and five smaller data sets are shown in Tables 4.3 and 4.4. Their standard deviations are listed in the brackets. The two tables indicated that the combined kernel k P⋅G has better performance than the other approaches. After feature selection (from 75% to 25%), the kernel k P⋅G also showed a better performance both in larger and smaller data. In the larger data, the combined kernel k P + G showed a better performance than the polynomial and RBF kernel. The result in the smaller data was the same as that in the larger one. Furthermore, the kernel k P⋅G almost had the lowest standard deviation among the four approaches in the larger data. In the smaller data set, the kernel k P⋅G performed well.. 29.

(40) Table 4.1 Data sets used in this study. No Data set 1 2 3 4 5 6 7 8 9 10 11 12. Hyperlipidemia Liver disease Renal disease Census income Shuttle* Mushroom Letter Sonar Ionosphere Vehicle silhouettes Spambase Vowel. # of samples 6000 6000 6000 32561 14500 8124 15000 208 351 846 4601 990. # of features. # of classes. Data Style. Data complexity (ML). 33 33 33 14 9 22 16 60 34 18 57 13. 2 2 2 2 7 2 26 2 2 4 2 11. c c c c, d c d c c c c c c, d. 1.00 1.46 1.06 1.41 3.63 1.01 1.00 1.01 1.08 1.09 1.04 1.05. c: continuous; d: discrete.. 30. Ratio of positive to negative 1:2.08 1:2.81 1:3.60 1:3.15 1:14.20 1:1.07 1:1.13 1:1.14 1:1.79 1:1.40 1:1.54 1:1.00.

(41) Table 4.2 Comparison of classification accuracy with the larger and smaller data sets. Data classification algorithms Data sets. No.. SVM linear. SVM Polynomial. SVM RBF. SVM Poly + RBF. SVM Poly × RBF. KNN. KNN. k =1. k =3. 1. Hyperlipidemia. 68.06. 51.92. 68.06. 69. 69. 59. 68. 2 3 4 5 6 7. Liver disease Renal disease Census income Shuttle* Mushroom Letter. 73.75 78.5 70.5 95.5 97.33 82.25. 73.5 78.58 75 98.2 100 86.75. 73.5 78.5 71.5 95.5 99.73 90.75. 77 82 74 96.8 99.73 87.5. 77 82 75.8 95.5 100 90.75. 60 67 74 100 91.33 81. 68 81 73 100 94 78.65. 80.84(11.65). 80.56(16.49). 82.51(12.65). 83.72(11.55). 84.29(11.39). 76.05(15.63). 80.38(12.48). AVERAGE-large 8. Sonar. 85.33. 88.1. 88.1. 92.86. 95.23. 83.33. 85.71. 9 10 11 12. Ionosphere Vehicle silhouettes Spambase Vowel. 81.43 75 94.5 98.89. 84.29 79.3 94.5 90.91. 84.29 82.84 94.5 98.89. 91.43 79.3 94.5 99.49. 91.43 85.8 95 99.49. 81.43 75 88 97.22. 78.57 83 87.5 99.72. 87.03(9.69). 87.42(5.88). 89.72(6.83). 91.52(7.48). 93.39(5.11). 85(8.27). 86.9(7.92). AVERAGE-smaller *: imbalanced data set ( ): standard deviation No1-7: larger data sets. No8-12: smaller data sets.. 31.

(42) Table 4.3 The accuracy of feature selection for the SVM using the L-J method (larger data sets) Kernel. Full (100%). Reduced (75%). kP. kG. kP+G. k P⋅G. kP. kG. kP+G. k P⋅G. HyperLipidemia. 51.92. 68.06. 69. 69. 50.83. 67.83. 70. 69.5. Liver disease. 73.5. 73.5. 77. 77. 71.13. 71.13. 75.5. 76.5. Renal disease. 78.58. 78.5. 82. 82. 76.25. 72.13. 81.75. 80.38. 75. 71.5. 74. 75.8. 69.6. 71.6. 71.6. 76. Shuttle*. 98.2. 95.5. 96.8. 95.5. 98.4. 98.2. 99.8. 98.8. Mushroom. 100. 99.73. 99.73. 100. 96.8. 98.2. 98. 98.8. Letter. 86.75. 90.75. 87.5. 90.75. 81.25. 81. 83. 83. AVERAGE. 80.56. 82.51. 83.72. 84.29. 77.75. 80.01. 82.81. 83.28. (Std. dev.). (16.49). (12.65). (11.55). (11.39). (16.53). (13.06). (12). (11.4). Dataset. Census_income. Kernel. Reduced (50%). Reduced (25%). kP. kG. kP+G. k P⋅G. kP. kG. kP+G. k P⋅G. 50.25. 67. 68.5. 68.86. 48.5. 62.5. 61.83. 62.38. Liver disease. 72. 72. 73. 75. 62.5. 71.25. 60.25. 71.75. Renal disease. 75. 70. 78. 78.5. 69.4. 63.94. 74.75. 72.88. Census_income. 68.5. 69.5. 78. 72.5. 72. 73.6. 72. 74. Shuttle*. 91.5. 91.41. 91.5. 95.5. 81.4. 82.6. 81.4. 82.8. Mushroom. 98.82. 99.73. 99.73. 99.73. 100. 99.89. 100. 99.92. Letter. 67.5. 73. 73.5. 79. 46.5. 52.5. 52. 57.5. AVERAGE. 74.8. 77.52. 80.32. 81.3. 68.61. 72.33. 71.75. 74.46. (Std. dev.). (16.12). (12.71). (11.2). (11.74). (18.67). (15.43). (15.92). (13.91). Dataset HyperLipidemia. *: imbalanced data set ( ): standard deviation 32.

(43) Table 4.4 The accuracy of feature selection for the SVM using the L-J method (smaller data sets) Kernel. Full. Reduced (75%). kP. kG. kP+G. k P⋅G. kP. kG. kP+G. k P⋅G. Sonar. 88.1. 88.1. 92.86. 95.23. 85.71. 88.1. 88.1. 95.23. Ionosphere. 84.29. 84.29. 91.43. 91.43. 74.28. 75.71. 88.57. 91.43. Vehicle. 79.3. 82.84. 79.3. 85.8. 78.85. 77.51. 78.25. 83.25. Spambase. 94.5. 94.5. 94.5. 95. 92.5. 94. 91.5. 94. Vowel. 90.91. 98.98. 99.49. 99.49. 88.43. 94.47. 92.13. 95. AVERAGE. 87.42. 89.74. 91.52. 93.39. 83.95. 85.96. 87.71. 91.78. (Std. dev.). (5.88). (6.86). (7.48). (5.11). (7.34). (8.92). (5.57). (5). Dataset. Kernel. Reduced (50%). Reduced (25%). kP. kG. kP+G. k P⋅G. kP. kG. kP+G. k P⋅G. Sonar. 76.19. 78.57. 85.71. 88.1. 78.57. 76.2. 85.71. 85.71. Ionosphere. 71.42. 78.57. 85.71. 90. 72.86. 77.14. 82.56. 88.57. Vehicle. 78.1. 72.19. 78.1. 80.47. 69.29. 69.41. 72.92. 79.51. Spambase. 90.8. 93.4. 92. 94.4. 87.5. 86.8. 88. 89.75. Vowel. 84.48. 93.93. 89.39. 94.94. 44.44. 39.71. 44.95. 45.51. AVERAGE. 80.2. 83.33. 86.18. 89.58. 70.53. 69.85. 74.83. 77.81. (Std. dev.). (7.55). (9.79). (5.24). (5.86). (16.13). (17.95). (17.66). (18.49). Dataset. ( ): standard deviation. 33.

(44) 4.3 Discussions In the experiment, we found that parameters d and γ heavily influenced the classification accuracy. These two parameters have a different impact on larger and smaller data sets. In larger data sets, degree d should be higher, and γ should be lower. On the other hand, degree d should be lower and γ should be higher in the smaller data sets. Figures 4.1 and 4.2 show the relationship between parameters and accuracy for a large data set (Renal disease) and a smaller data set (Vowel), respectively. Classification accuracy (using polynomial kernel). accuracy %. 78.6 78.56 78.52 78.48 78.44 2. 3. 4. 5. 6. 7. 8. 9. 10. degree d. Classification accuracy (using RBF kernel). accuracy %. 78.6 78.4 78.2 78 77.8. 10^-3 10 −3. 10^-2 10 −2. 10^-1 10 −1. 0 10^0 10. 10^1 101. 2 10^2 10. Kernel width. Figure 4.1 The relationship between parameters and accuracy for the larger data set.. 34.

(45) Classification accuracy (using polynomial kernel). accuracy %. 100 99.5 99 98.5 98 2. 3. 4. 5. 6. 7. 8. 9. 10. degree d. Classification accuracy (using RBF kernel). accuracy %. 100 99.5 99 98.5 98. 10^-3−3 10. 10^-2−2 10. 10^-1−1 10. 10^0 0 10. 10^11 10. 10^22 10. kernel width. Figure 4.2 The relationship between parameters and accuracy for the smaller data set.. Some research indicated that the SVM with the kernel method provided a better performance for classification than the linear methods (Tefas et al., 2001). In the present study, our experiment showed similar results (see Table 4.2). Although the linear kernel is not the best of the kernel based approaches for large data sets, it is acceptable compared with the KNN approach. The other two popular kernels, polynomial and RBF, also provided an acceptable performance in both larger and. 35.

(46) smaller data sets. We found that the performance of the RBF is better than that of the polynomial, both in the larger and smaller data sets. In the setting of the parameters for the polynomial and RBF kernels, Pardo and Sberveglieri (2005) consider that larger values for the polynomial kernel parameter d mean more complex classification functions (higher order polynomials). These. functions are useful for solving classification problems. At the same time however, a smaller value for the RBF kernel parameter γ is also good at solving classification problems. In this study, the results of our experiment are similar to those of Pardo and Sberveglieri (2005) (see Figures 6 and 7). It is evident that larger d is good at complex data because it can obtain a greater probability of classification. Hence, we feel that the input space, with a lower dimension transformation to the feature space with a higher dimension, seems to make it easier to classify a separable bound. In the following we discuss the effect of parameters d and γ on classification accuracy. First, for the polynomial kernel, we set a = b = 1 , if d is adjusted from 3 to 5, and the terms of the polynomial will be expanded from 4 to 6. As a result of the terms being expanded, the number of boundaries is also increased. Although the larger the d value, the poorer the performance of the classification, we can slightly adjust the d value based on the complexity of the data. Next, suppose the width γ of the RBF kernel is adjusted from 10 0 to 101 , then. the increment of the RBF kernel is positive. On the contrary, the increment of. the RBF kernel is negative when the kernel width is decreased. Thus the user can change the kernel width until the kernel is satisfied with his need. From the mathematical viewpoint, when the smaller data sets are in the lower space, a larger width is useful to easily and quickly achieve the optimal solution. However, when the larger data sets are in the higher space and when there are many local optimal solutions, then it is easy to fall into the trap of larger kernel width. Thus, the small 36.

(47) width is best for larger data sets. Our experiment only shows the classification accuracy difference for larger and smaller data sets using different kernel widths; however, we could not find a significant difference in the classification accuracy for the data sets with a different data complexity. Based on the above discussion, some useful strategies for determining parameters d and γ are summarized in Table 4.5. In the polynomial kernel, a larger parameter d is suitable for larger data sets; and a smaller d is suitable for smaller data sets. In the RBF kernel, a smaller parameter γ is suitable for larger data sets; and a larger γ is suitable for smaller data sets.. Table 4.5 The strategies of parameter setting of polynomial and RBF kernels. Kernel type Data set size. Polynomial (d). RBF (γ). Larger data set. larger. smaller. Small data set. smaller. larger. In our experiment, it seems that the multiplication kernel ( k P⋅G ) is superior to the summation one ( k P +G ). The reason for this may be that the multiplication kernel has some functions by changing degree and adjusting width at the same time, which seems to increase the classification performance. However, the influences of these functions are not significant in the summation kernels. In addition, ML was used to evaluate the data complexity. As expected, the combined kernel, k P⋅G provides a better performance for classification when the ML approaches to 1. However, it seems that the combined kernels are not superior to the other approaches when the ML is greater than approximately 1.5. A possible explanation may be that, for simple problems using the SVM with the original kernel is good enough for classification. The combined kernels are not recommended for 37.

(48) addressing simple problems because they will complicate the data space. Next, we show the results with 100%, 75%, 50%, and 25% features after feature selection by twelve data sets. Obviously, the performance of classification decrease follows the number of features reduced. It is interesting to note that the more the number of classes there was, the larger the decreasing percentage of classification was noted. As for feature selection process, many investigators consider that the most straightforward idea is to use a leave-one-out procedure or a cross-validation set to assess the generalization error with regard to the number of features and choose the number of attributes which minimizes the test error. It was deemed to be unfavorable for the computation. Compared with this process, L-J method just selects variables by index influence ( α j ) and avoids this predicament. However, kernel selection in L-J method plays an important role and greatly affects the performance of classification.. 38.

(49) CHAPTER 5 A CASE STUDY: HYPERTENSION DETECTION In this chapter, a real-case from medical diagnosis is presented. We will show that the L-J method using SVM with the selected kernel function can be applied to reduce the attributes by a hypertension detection via anthropometrical data. Further explanation and discussion will likewise be provided. 5.1 Problem Description Hypertension is a major disease and is a significant cause of death all over the world. The relevant researches show that the cardiovascular disease is an important risk causing hypertension (Mykkanen et al., 1997; Jeppesen et al., 2000). As defined by the National High Blood Pressure Education Program (NHPEP), hypertension can be summarized as shown in Table 5.1. Table 5.1 Classification of blood pressure for adults aged 18 and older (NHPEP, 2002) Category. Systolic (mm Hg). Optimal Normal High-normal Hypertension Stage 1 Stage 2 Stage 3. Diastolic (mm Hg). <120. and. <80. <130 130-139. and or. <85 85-89. 140-159 160-179 ≥ 180. or or or. 90-99 100-109 ≥ 110. Recently, syndrome X has been investigated more and more (Chen et al., 2000a). In fact, there is a significant relationship between body size and syndrome X (Lin et al., 2002). Hence, it is feasible to explore the relation between hypertension and body size via syndrome X indirectly. In the past, the human body size is measured by the worker with his experience. The drawback of this approach is that it is not accurate and time consuming. Hence, 39.

(50) 3D anthropometrical measure prevails in this area. There are many advantages related to this measure, such as convenience and time saving. In addition, this technique can be employed to medical diagnosis. A memorial hospital in Taiwan has dealt with disease diagnosis for several years. Recently, they provide a whole body 3D scanning technique for patients in their Department of Health Examination. The purpose of the techniques is to explore the relationship between the body size and some chronic disease by some 3D body surface anthropometrical scanning data. In fact, too many anthropometrical data collected from this equipment and as listed on the diagnosis make the more difficulty of explanation for the physicians. Hence, how to reduce the unimportant or noisy features is necessary. Here, we implement a hypertension detection using the proposed approach for feature selection.. 5.2 Implementation A total of thirty-one anthropometrical items were collected from the hospital’s 3D whole body data bank. These data included: height, weight, head circumference, breast circumference, waist circumference, hip circumference, left upper arm circumference, right upper arm circumference, left forearm circumference, right forearm circumference, right thigh circumference, left thigh circumference, right leg circumference, left leg circumference, breast width, waist width, hip width, breast profile area, hip profile area, volume of head, surface area of head, volume of trunk, surface area of trunk, volume of left arm, surface area of left arm, volume of right arm, surface area of right arm, volume of left leg, surface of left leg, volume of right leg, and surface area of right leg. In addition to these measurements, the subjects’ age and gender were collected as well. Furthermore, the patients who suffered from hypertension were noted. 40.

(51) A total of 6,000 data sets were selected randomly from the original database via data pre-processing. Four kernel functions including k P , k G , k P +G , and k P⋅G were employed to construct the SVM models. The relevant parameter of polynomial kernel d was set between 2 to 10 and the parameter of RBF kernel γ was set at 10 −3 ,. 10 −2 , 10 −1 , 10 0 , 101 , 10 2 , respectively. The result shows that the combined kernel,. k P⋅G has a better performance than the other approaches. Next, these kernels were applied to L-J method for feature selection. In addition to accuracy, the important features were selected. By using the kernel function to L-J method, we selected the important features using the influence index α j . For instance, when the k P⋅G was employed, a total of thirteen anthropometrical attributes were selected, including age, weight, waist circumference, right thigh circumference, left thigh circumference, right leg circumference, left leg circumference, breast width, volume of trunk, surface area of trunk, volume of left arm, volume of right arm and volume of right leg.. 5.3 Comparisons In order to explain the effectiveness of the proposed approach, the collected data were also analyzed by the three approaches. They are backpropagation neural network (BPNN), rough sets and decision tree. In this study, Professional II Plus software was used to perform BPNN computation. The result showed that the structure 33-12-1 provided a better performance when the learning rate was 0.15 and the momentum was 0.75. After that, we pruned the network based on index Pi . Pi is the priority index of the input nodes in the trained backpropagation neural network structure. It can be defined as follows (Su et al., 2002): n. m. s. Pi = ∑∑∑ Wij × V jk i =1 j =1 k =1. 41. (5.1).

(52) Where:. Wij is the weight between the i th input node and the j th hidden node;. V jk is the weight between the j th hidden node and the k th output node; Pi is the sum of absolute multiplication values of the weights Wij and V jk . Based on the definition, the input nodes with Pi < 1.65 from the trained network 33-12-1 were removed. Finally, fourteen anthropometrical factors were determined, including weight, waist circumference, left forearm circumference, right forearm circumference, right thigh circumference, left thigh circumference, right leg circumference, breast width, hip profile area, volume of trunk, surface area of trunk, volume of left arm, volume of right arm and volume of left leg. The Rough Sets theory proposed by Pawlak (1982) provides a mathematical tool for representing and reasoning about vagueness and uncertainty. It can be approached as an extension of the Classical Set Theory and can be considered sets with fuzzy boundaries – sets that cannot be precisely characterized using the available of attributes. The basic concept of the rough sets is the notion of approximation space, which is an ordered pair A = (U , R) , where U is nonempty set of subjects, called. universe; R is equivalence relation on U , called indiscernibility relation. If. x, y ∈ U and xRy then x and y are indistinguishable in A. Each equivalence. ~ U class induced by R , i.e. each element of the quotient set R = , is called an R. ~ elementary set in A. An approximation space can be alternatively note by A = (U , R ) . It is assumed that the empty set is also elementary for every approximation space A . A definable set in A is any finite union of elementary sets in A. For x ∈ U let [ x] R denote the equivalence class of R , containing x . For each X ⊆ U , X is characterized in A by pair of sets – its lower and upper approximation in A , 42.

(53) defined respectively as: Alow ( X ) = {x ∈ U | [ x] R ⊆ X } Aupp ( X ) = {x ∈ U | [ x] R ∩ X ≠ φ}. .. A rough sets in A is the family of all subsets of U having the same lower and upper approximations. After the lower and the upper approximation have been found, the rough sets theory can be used to derive both certain and uncertain information, and induce certain and possible rules from them. In this case study, the important anthropometric factors selected from the rough sets are as similar as BPNN approach except for the breast width. A decision tree is another feature selection approach. It is a popular classifier in machine learning applications and is also used as a diagnostic model in medicine. Decision tree is connected via nodes and branches. The tree construction process is heuristically guided by choosing the ‘most informative’ attribute at each step, aimed at minimizing the expected number of tests needed for classification. Let E be the entire initial set of training examples, and. c1 ,……, c N be the decision classes. A. decision tree is constructed by repeatedly calling a tree construction algorithm in each generated node of the tree. Tree construction stops when all examples in a node are of the same class, or if some other stopping criteria are satisfied. In brief, a decision tree is a flow-chart-like tree structure, in which each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes or a class distribution. C4.5 and CART are two typical decision trees. C4.5 is an entropy-based algorithm; however, CART is a binary tree based on the Gini Index (GI) to determine the condition for constructing the tree. In this study, the entropy-based tree (Quinlan, 1986) has been chosen to analyze and induct this diagnostic tree owing to it has the a flow-chart-like structure and makes more user-friendly. In our hypertension detection 43.

(54) case, by running the C4.5, we select thirteen anthropometric factors. They are age, waist circumference, right thigh circumference, left thigh circumference, right leg circumference, left leg circumference, breast width, hip width, volume of trunk, surface area of trunk, volume of left arm, volume of right arm and volume of right leg. For medical applications, two measures, sensitivity and specificity, are frequently used to discuss the performance. The four elements of them are defined as True positives (TP): True positive answers of a classifier denoting correct classifications of positive cases; True negatives (TN): True negative answers denoting correct classifications of negative cases; False positives (FP): False positive answers denoting incorrect classifications of negative cases into class positive; False negatives (FN): False negative answers denoting incorrect classifications of positive cases into class negative. Sensitivity measures the fraction of positive cases that are classified as positive. Specificity measures the fraction of negative cases classified as negative. The two epidemiological measures can be described as follows. In addition, accuracy was also described.. Sensitivity =. TP TP + FN. (5.2). Sepcificity =. TN TN + FP. (5.3). Accuracy =. TP + FN TP + FN + TN + FP. (5.4). All of the feature selection approaches are assessed by the epidemiology based indices, namely, sensitivity and specificity. In addition, accuracy was employed to 44.

(55) evaluate their performance. There are 13 and 14 features selected by the implemented approaches. As shown in table 5.2, we found that the neural network based model is the worst in terms of the three indices among the various approaches. We also found that the sensitivity was decreased and the specificity was increased in SVM based approaches. This means that the ability of testing TN improved but it deteriorated on test TP. Indeed, this is not favorable for diagnosing. Fortunately, the decreased range observed was small. Also as the specificity increases, it would be beneficial in minimizing the cost of developing new medicines for hypertension. Furthermore, the accuracy of SVM based model is better than those of the neural network based, decision tree and rough sets approaches. In addition, although the results showed that the decision tree and rough sets is better on sensitivity, the SVM based methods have the fewer decreased range after feature selection. In other word, SVM based methods are still better than the other approaches. Hence, we consider that the SVM based method has the advantage of optimization computation and prevails over all other methods. Table 5.2 A comparison of performance of feature selection Features*. Methods. Sensitivity. Specificity. Accuracy. Full. Reduced. Full. Reduced. Full. Reduced. Full. Reduced. Neural network. 33. 14. 0.4478. 0.3963. 0.7186. 0.7289. 0.6883. 0.6233. kP. 33. 13. 0.4689. 0.4655. 0.7356. 0.7639. 0.7033. 0.6700. kG. 33. 13. 0.4929. 0.4805. 0.7368. 0.7693. 0.7083. 0.6767. (L-J based) k P +G. 33. 13. 0.5143. 0.4830. 0.7396. 0.7699. 0.7133. 0.6783. k P⋅G. 33. 13. 0.5373. 0.4987. 0.743. 0.7790. 0.7200. 0.6900. DT. 33. 13. 0.5970. 0.5300. 0.7178. 0.7082. 0.7040. 0.6785. Rough Sets. 33. 14. 0.5996. 0.5538. 0.7189. 0.7160. 0.6876. 0.6593. SVM. *: number of features. 5.4 Discussion. The aim of this study is to investigate the relationship between anthropometrical 45.