Fuzzy Weighted Support Vector Regression Using the Dual Coordinate Descent Method

全文

(1)國立臺灣師範大學理學院數學系碩士論文 Department of Mathematics College of Science. National Taiwan Normal University Master’s Thesis. Fuzzy Weighted Support Vector Regression Using the Dual Coordinate Descent Method. 簡子嘉. 指導教授 Advisor：張少同博士 Ph.D.. 中華民國 109 年 07 月 July 2020.

(2) 謝辭轉眼間研究所生涯即將邁入尾聲，邊寫論文邊修教程的這三年過得十分充實，驀然回首總有釋懷的感覺。此篇論文的完成承蒙許多人的支持以及鼓勵，讓我能堅持到最後。首先感謝我的指導教授張少同老師，在論文撰寫期間給予的教導與督促，並且從中學習到對於研究的執著與態度。在論文口試期間內，感謝口試委員呂翠珊老師與李孟峰老師所給予的寶貴意見與指正，讓此篇碩士論文更臻於完善，在此表示深摯的謝忱。也要感謝蔡碧紋老師以及程毅豪老師，在課堂上的教導讓我獲益良多。在求學期間內，研究室成員們彼此的情誼是最值得回憶的歷程。感謝怡雯學姊在課業上的幫助和經驗分享以及在 meeting 時提出的一些想法。同時感謝同屆好友張榆、世驊、楊修在研究上彼此的激勵與共同成長。也要感謝一起學習的沛瑀，在口試的時候幫了很大的忙。很高興在師大的這段期間有機會遇見你們，我會永遠珍惜這個緣分。最後，感謝默默支持與關心我的家人，讓我在求學過程當中無後顧之憂，忙碌之餘總是為我加油打氣，在此分享這份喜悅給所有的人！. 簡子嘉謹致國立臺灣師範大學數學所中華民國 109 年七月. I.

(3) 摘要支持向量機(Support Vector Machine/SVM)是一種監督學習方法，通常用於分類問題。此外，SVM 也被用於迴歸問題中，稱為支持向量迴歸（Support Vector Regression/SVR）。通過調整懲罰參數和可容忍的誤差界限，SVR 比多數線性迴歸模型更有彈性。但其較常被用於單一結構的迴歸問題。在無監督學習領域，聚類和混合回歸問題是我們非常感興趣的問題。因此，SVR 也被推廣到混合回歸問題。基於模糊理論，延伸出了模糊加權支持向量迴歸(Fuzzy Weighted Support Vector Regression/FWSVR)。通過將隸屬度引入懲罰項，FWSVR 可以處理混合迴歸問題，而不是像以前那樣一對一地使用 SVR。然而，FWSVR 在處理大規模數據時，所需的耗時較久。在本文中，我們介紹了支持向量迴歸如何解決迴歸問題。並且我們將隸屬度作為不同數據的模糊權重，以構建模糊加權支持向量回歸（FWSVR）模型。然後，我們使用對偶坐標下降法找到模糊加權支持向量迴歸中拉格朗日乘數的更新函數。最後，我們考慮使用 alpha cut 方法來使模糊加權支持向量迴歸的結果更加有效率且穩定。實驗表明，FWSVR-DCD 在處理大規模數據具有良好的性能且減少了所需的計算時間，並且估計結果對於有離群值的數據具有穩定性。. 關鍵字：支持向量迴歸、模糊加權支持向量迴歸、對偶坐標下降法. II.

(4) Abstract Support Vector Machine is a kind of supervise learning method, which is commonly used for classification problems. In addition, SVM is also used in regression problems called Support Vector Regression (SVR). SVR is a more flexible method than some linear regression models, through tuning the tolerance parameters and the acceptance error margin. However, SVR is most used for the regression problem of a single structure. In the field of unsupervised learning, clustering and mixture regression problem are issues we are very interested in. Therefore, SVR is generalized to the mixture regression problems. Based on the fuzzy theory, the Fuzzy Weighted Support Vector Regression (FWSVR) is extended. By introducing the membership to the penalty term, FWSVR can handle the mixture regression problem at the same time, instead of using SVR one by one as before. However, FWSVR takes a long time when it processes large-scale data. In this thesis, we introduce how SVR solve the regression problems. And we consider the membership as the fuzzy weight of different data to construct the Fuzzy Weighted Support Vector Regression models. Then we use the dual coordinate descent method to find the updating equations of the parameters of FWSVR, this method called FWSVR-DCD. Finally, we consider the alpha cut method to make FWSVR more robust. The simulations show the FWSVR has good performance in large-scale cases and reduce the computing time, and the estimated results are robust for data with outliers.. Keywords: SVR, fuzzy weighted support vector regression, dual coordinate descent method. III.

(5) Table of Contents 1. 2.. Introduction …………………………………………………………………………… FWSVR for mixture regression problem …………………………………………… 2.1. Support vector regression ……………………………………………………… 2.2. Sequential minimal optimization ………………………………………………… 2.3. Fuzzy weighted support vector regression ………………………………………. p.1 p.3 p.3 p.6 p.9. 3. 4. 5. 6.. FWSVR by DCD method …………………………………………………………… Simulation studies and real data application ……………………………………… Conclusions …………………………………………………………………………… References ……………………………………………………………………………. p.13 p.20 p.29 p.29. IV.

(6) List of Tables Table 1. The comparative results of FWSVR-DCD with and without alpha cut .................... p.21 Table 2. The comparison of FWSVR-DCD, FWSVR, FCR for large-scale data ................... p.22 Table 3. The comparison of FWSVR-DCD, FWSVR, FCR for data with outliers ................. p.24. V.

(7) List of Figures Figure 1. Hard-Margin SVM would easily overfitting because it cannot tolerate noise ........... p.4 Figure 2. Consider the loss function in soft margin support vector regression .................... p.4 Figure 3. The degree of fuzzy weights vs the average MSE.................................................. p.20 Figure 4. The results of the FWSVR-DCD with and without alpha cut ................................. p.22 Figure 5. Results of FWSVR-DCD (a), FWSVR (b), FCR (c) in large scale case ................. p.23 Figure 6. The comparison of FWSVR-DCD (a), FWSVR (b), FCR (c) for data with outliers p.24 Figure 7. The results of example 1 ....................................................................................... p.25 Figure 8. The results of example 2 ....................................................................................... p.25 Figure 9. The results of data with too many outliers ............................................................. p.26 Figure 10. The scatter plot of the tone perception data and the fitted two lines ..................... p.27 Figure 11. The scatter plot of Aphids data with fitted regression lines .................................. p.28. VI.

(8) 1. Introduction Support Vector Machine (SVM) [1] is a kind of supervise learning method which developed by Cortes and Vapnik (1995) [2]. The most known is that it is a binary classification method, i.e., support vector clustering [3]. The main idea is to find an optimal line or hyperplane to divide the data into two groups, at the same time the line or hyperplane should be as far as possible from the two groups. In addition, SVM can also be applied to regression problems because the model of SVM is similar to logistic regression. These types of models are known as Support Vector Regression (SVR) [4], similarly to SVM, both are based on Statistical Learning Theory that explains the real relationship between the expected risk and empirical risk. Different from most linear regression models that minimize the sum of square errors, the objective function of SVR minimizes the structure risk. SVR gives us the flexibility to define how much error is acceptable in our model and will find an appropriate line or hyperplane in higher dimensions to fit the data. In scientific data analysis, it is very common to solve regression problems with multiple structures or to deal with outliers in some data sets. However, SVR is used to estimate the parameters of single structure regression problems. Franck et al. (2007) [5] extended SVR to regression problems with multiple structures by introducing the fuzzy weights in the penalty term and an additional fuzzy constraint. Basically, the influence of a datum is weighted according to the membership which representing the relative importance of a data point for a given regression model. This method is called Fuzzy Weighted Support Vector Regression (FWSVR). And Franck shows that the approach keeps the robustness property from SVR and performs better than the standard Fuzzy c-prototypes algorithm in a noisy real situation. However, FWSVR takes a little more time when processing large-scale data. So we try to solve this problem in this thesis. Because of the robustness property of SVR, SVR is often used to process the data with outliers. In addition, another advantage of SVR is that it can fit nonlinear regression by kernel trick, called kernel SVR. But the use with kernels is often time consuming, so Ho et al. (2012) [6] propose a 1.

(9) new method. And they show that linear SVR training methods have much faster training and can very efficiently produce models that are as good as kernel SVR. Therefore, we follow Ho et al. (2012) and calculate the FWSVR models by the dual coordinate descent (DCD) method. Because of the DCD algorithm has a shrinking implementation, and it simplified the complex four-variables sub problem into a simple two-variables sub problem, the DCD method can omit some computing time. Finally, we apply the FWSVR with DCD to several cases and compare the computing cost with Franck et al. (2007). We also use the alpha cut method, ref Yang et al. (2008) [7], to mitigate the influences of outliers on the results of FWSVR. The rest of this thesis is organized as follows. In Section 2 we introduce the basic formulation of standard SVR. In Section 3 we present the fuzzy weighted support vector regression. Moreover, we apply the alpha cut method to make the FWSVR more robust. We conduct experiments in Section 4 on some cases with outliers. Section 5 is the conclusion of this thesis.. 2.

(10) 2. FWSVR for mixture regression problem 2.1.. Support vector regression. SVR is an extension of SVM, which can handle the prediction problem. As will be seen in equation (1) and (2), by choosing the tuning parameters, the acceptable error margin (ε) and the tolerance of loss function (γ), with cross validation, SVR gives us the flexibility to define how much error is acceptable in our model. Based on them, we will find an appropriate line (or hyperplane in higher dimensions) to fit the data.. SVR is developed by Vanpik and Smola (1995) [8], and suppose we are given the training data {( ,. ), … , (. ,. )} ∈. × , in ε-SV regression Vapnik (1995) [2]. The goal is to find a. function to satisfy ( )= that has at most. +. ℎ. deviation from the actually targets. ∈. ,. ∈. . ε is a given value which means we can. tolerate the error as long as it is smaller than ε, and at the same time make the margin as flat as possible. Flatness means that one seeks small norm of. , and it is equivalent to minimize the Euclidean. . Formally we can write this problem as a convex optimization problem by requiring 1 2 −. − + −. ≤ ≤. (1) ,. and call it the hard margin support vector regression. Because the hard margin support vector regression tries to fit all the data points, it cannot tolerate a little error. There may have a too flat margin or over fit in some cases, for example in Fig 1.. 3.

(11) Figure 1. Hard-Margin SVM would easily overfitting because it cannot tolerate noise. So an extended method is proposed which mentioned in Smola (2004) [4], the approach is to introduce the slack variables. ,. ∗. to the SVR to be optimized. Hence we can write the primal problem as 1 2. ∗). +. ( +. −. − ≤ + + − ≤ + , ∗≥0. (2) ∗. .. Where the ε-insensitive loss function | | described by | |=. | ( )− |≤ 0 | ( )− |− ℎ. ,. and the constant γ > 0 determines the trade-off between the flatness of. (3) and the amount up to. which deviations larger than ε are tolerated. It is called the soft margin support vector regression. Then we would find the optimal. and. from the objective function in the following.. Figure 2. Consider the loss function. in soft margin support vector regression. 4.

(12) First, the key way is using the Lagrange multiplier method to construct the Lagrange function and the corresponding constraints. Because the objective function has restrictions on inequalities, we can use the Lagrange multiplier method to help us finding the minimum, it can be shown that this function has a saddle point with respect to the primal and dual variables at the optimal solution. Hence we proceed as follows =. 1 2. +. ( +. ∗. (. )+. − − − ). −. (4) + where. ,. ∗. , ,. ∗. ∗(. +. −. − −. ∗). −. (. +. ∗ ∗. ). ≥ 0 are Lagrange multipliers. And we can get the following relationship. by partial differentiation. (∗). ∗). −. (. −. =−. (. −. ∗). =0. (6). −. (∗). =0 .. (7). (. −. ∗). (8). −. ∗). =0. (9). =. (∗). ≥0 .. (10). =. =. (∗). −. =0. (5). By solving equations (5), (6), (7), we obtained =. ( −. (∗). Substituting these equations (8), (9), (10) into the primal objective function (4) yields the dual optimization problem. 5.

(13) 1 2. (. −. ∗. )(. −. ∗. ). (. +. ∗). +. −. (. −. ∗. ) (11). ( ,. − ∗. ∗). =0. .. ∈ [0, ]. Through equation (10), we already eliminated the Lagrange multipliers. ,. ∗. in the process of. simplifying the formula, and the given parameters γ and ε are chosen by cross validation. Therefore, our goal is to find the Lagrange multipliers. ,. ∗. to calculate the coefficient and the. intercept of regression model =. 2.2.. (. −. ∗). ( )=. (. −. ∗). + .. (12). Sequential minimal optimization. There are many methods to solve the Lagrange multipliers from the dual objective function, of which the most commonly used is called Sequential Minimal Optimization (SMO). SMO is proposed by Platt (1998) [9] for SVM to deal with classification case, and this method chooses to solve the smallest possible optimization problem at every step. Flake and Lawrence et al. (2002) [10] generalized SMO to the regression problem.. Flake and Lawrence transformed the dual objective function to implement caching along with heuristics that assist the caching policy. By substituting. =. −. ∗. , and | | =. +. ∗. , the new. Lagrange multipliers will obey the constraint −γ ≤ λ ≤ γ. And the new objective function can be rewritten as ( )=. + , and. 6. (13).

(14) ( )=. 1 2. +. −. (14). with the Lagrange multipliers must obey a linear equality constraint, ∑. =∑. (. ∗). −. = 0.. Unlike the previous methods, SMO mainly want to solve the smallest possible optimization problem involves only two Lagrange multipliers. Let these two parameters have indices that. and (. are the two Lagrange multipliers, so we can express the objective function as ,. )= |. |+ | |− ∗. + Where. and. +. −. +. 1 2. +. 1 2. +. ∗. +. (15). .. is a term that is strictly constant with respect to. and. ,. =. ,. ∗. and. ∗. are defined as ∗. ∗. =. =. ∗. −. −. ∗. −. ,. (16) ∗. ∗. =. =. ∗. −. −. −. ∗. ,. with. ∗. = ( ,. ∗. , ). Note that the above superscript ∗ is used to indicate those values that are. computed with the old parameters, for example, the. Because of the constraint ∑. ∗. is the intercept of the last iteration.. = 0, the sum of the parameters before update would be. equivalent to the sum of the parameters after update when other Lagrange multipliers were fixed. So we can let. ∗. =. variable function. +. =. ∗. +. ∗. , and transform the objective function. ( ) by substituting. ( ) = |s ∗ −. ∗. −. | + | | − (s ∗ −. + (s ∗ −. ). (. ,. ) to a single. +. 1 2. (17). = ). + (s ∗ −. 1 + (s ∗ − 2. − ). ∗. +. ∗. +. ) .. To solve the objective function (17), we need to compute its partial derivative with respect to λ ; however, the equation is not strictly differentiable because of the absolute value function. 7.

(15) | |. Nevertheless, if we take = [. ( ), the resulting derivative is algebraically consistent. =. ( )−. ∗. (. −. )] +. −. −(. ∗. −. ). + (18). +(. ∗. −2 ). ∗. −. +. ∗. .. Now, setting the equation to zero yields =. −. + [. (. )−. ( )] +. ∗(. )+. −. ∗. −. ∗. (19) = where. =. +. −2 =. ∗. −. + [. (. )−. ( )] +. , and we can write a update rule for 1 + (. + [. −. (. )−. ( )] +. ∗. −. ∗. +. ∗. in terms of its old value ∗. −. ∗). .. (20). Hence, we just use the equation to update Lagrange multipliers until satisfying the convergence conditions. In optimization theory, the Karush-Kuhn-Tucker (KKT) condition is a necessary condition for the optimal solution of nonlinear programming. The KKT condition generalizes the constrained optimization problem involving constraint equations handled by Lagrange multipliers to inequalities.. Since SVR is a convex optimization problem, we want to find a solution that satisfies the partial differentials of equation (4) equal to zero. And by the property of complementary slackness, the product of Lagrange multipliers and the constraints must equal to zero too (. −. − − ) = 0.. −. (21). In other words, if the data point falls within the constraint, the constraint is invalid and can be solved directly; if the point falls outside the constraint, the constraint is required, therefore we would discuss all data in three cases. First, the Lagrange multipliers should be equal to zero if the data points are within the boundary; second, the Lagrange multipliers are not equal to zero and γ, we called them support vectors, if the data points are on the boundary; third, the Lagrange multipliers should be equal to γ if the data points are outside the boundary. Hence, we can get the convergence conditions for support vector regression from the KKT conditions 8.

(16) =0⇔| − <. ⇔|. ≠0<. | |=. − |<. ⇔|. (22). − |=. (23). − |> .. (24). So we will check whether the Lagrange multipliers violate the equation (22), (23), (24) or not to judge that the global minimum has been reached.. 2.3.. Fuzzy weighted support vector regression. SVM was originally designed for binary classification, and SVR is most used in the problem with single regression line. How to effectively expand to multivariate classification and mixture regression problem is an important issue. In classification problem, the common way is usually to construct multiple combinations of classifiers, for example, one-against-all, or one-against-one. In mixture regression problem, there is also a method to consider all cases at once, Franck et al. (2007) [5] proposed the Fuzzy Weighted Support Vector Regression (FWSVR) for multiple linear model estimation.. For FWSVR algorithm, Franck used the fuzzy weighting strategy as in fuzzy theory, and considered the fuzzy weights. which denotes the membership of the th data with the. th. regression line. It solves the following primal regularized optimization problem 1 2. + ⎧ ⎪ ⎨ ⎪ ⎩. where. − + ,. ∗. − −. ( ≤ + ≤ +. ≥ 0,. +. ∗. ) (25). ∗. ,. = 1, … , , = 1, … ,. =1. > 0 is the regularization parameter,. is the fuzzy membership,. groups, m >1 is a weighting exponent on each fuzzy membership.. Then the Lagrange function can be written as following 9. is the number of.

(17) =. 1 2. +. ( (. +. ∗. +. −. ). −. ). − −. (26) ∗. + − where. ,. ∗. ,. ,. ∗. ,. (. +. (. −. ∗. +. ∗. − −. ∗. ). )−. −1. ≥ 0 are Lagrange multipliers. Then the partial derivatives of. with. respect to the primal variables have to equal to 0 for optimality. =. (. − (. =− (∗). =. =. ∗. −. −. (∗). (. +. −. ∗. −. (∗). ∗. ). =0. (27). )=0. (28). =0. )−. (29). =0. (30). By solving equation (27), (28), (29), (30), we obtained (. = ( 0≤. ∑ ,. ∗. ∗. ∗. ≤. (. = Because of the slack variables. − ,. ∗. ). (31). )=0. (32). −. ∗. + (. +. (33) ). (34). ∗). are obtained from the loss function, they satisfy the. following equations = where. =. −. −. (0,. − ) and. ∗. =. (0, −. . By the above equations we know that. and if one of both is positive the other is 0. Then, the equation of. 10. − ), ,. ∗. (35) are greater than zero. can be expressed by.

(18) (0, (0,. =. − ) − ). − ,− − ,−. (36). Then substituting the results into the Lagrange function yields the following dual problem： = (36) ( ,. ∗. −. ∗. )=0. ∈ [0,. ]. where =. (. −. ∗. ). −. ∗. −. (. +. ∗. )+. (. ∗. −. ). (37). As we can notice, the Lagrange function of FWSVR, equation (26), is similar to the dual function (4) of SVR, the main difference is the boundaries of Lagrange multipliers bound changed from γ to. ,. ∗. , the upper. . Because of the meaning of Lagrange multipliers that we. mentioned before, a data point has no contribution if the Lagrange multiplier equal to zero. Therefore, the membership is lower, the upper bound γ. is closer to zero, and the Lagrange. multiplier will also approach to 0, indicating that this data points does not contribute much to the model. Then, following the previous remarks, Franck et al. (2007) proposed the FWSVR algorithm by the following steps. Algorithm 1 Fuzzy Weighted Support Vector Regression by Franck 2007 Step 1. Randomly initialization of Step 2. For each model. (1 ≤. ≤ ). ：. Step 2.1. Compute the residuals Step 2.2. Update. and. ,1 ≤. ≤ ,1 ≤ ≤. with (35) 11.

(19) Step 2.3. estimate. and. with the solution of (36). Step 3. Apply Step 2. Until stabilization of the. term. The FWSVR extends the original SVR to the mixture regression problems. And it keeps the advantage of the well-known robustness property from SVR. In the problem with outliers, the FWSVR also performs well. But in large scale problem, the FWSVR has cost more computing time. Hsieh et al. (2008) presents a dual coordinate descent method for linear SVM and make it faster. In the next section we will apply the dual coordinate descent method for FWSVR.. 12.

(20) 3. Fuzzy weighted support vector regression using the dual coordinate descent method (proposed) Although Franck et al. (2007) [5] proposed the Fuzzy Weighted Support Vector Regression to make SVR solve the mixture regression problem, however it takes longer computing time when processing large-scale data. Ho et al. (2012) [6] proposed “Large Scale Linear Support Vector Regression” which solves the optimization problem by the dual coordinate descent method. It shows that the proposed linear-SVR training methods can very efficiently produce models that are as good as kernel SVR for some problems. In this section, we apply the optimization method which mentioned in Ho et al. (2012) for training the FWSVR model, consider a set of Lagrange multipliers as a new variable and to make the number of constraints less. The dual coordinate descent method implicitly has a shrinking implementation, and it can omit some computing time compared with the ordinary method. Finally, we apply the alpha cut method to make the results of FWSVR more stable. We call this method be FWSVR-DCD and hope it can reduce the computing time and have good results in large-scale cases.. = {( ,. Given a training set assume there are. ,…,. structures. appears with bias terms. ), … , ( in. ,. )} , where. ∈. ,. ∈ , = 1, … ,. , we. . In Franck et al. (2007) , the FWSVR problem. . One often deals with this term by appending each instance with an. additional dimension ←[. , 1] ,. ←[. ,. ]. (38). and it solves the following primal regularized optimization problem 1 2. + −. ⎧ ⎪ ⎨ ⎪ ⎩. − ,. ∗. (. ≤ + ≤ +. ≥ 0,. ∗. ) (39). ∗. , =1. 13. +. = 1, … , , = 1, … ,.

(21) where. > 0 is the regularization parameter,. is the fuzzy membership,. is the number of. groups. Then the Lagrange function can be written as following =. 1 2. +. ( (. +. ∗. +. −. ) ). − −. (40) ∗. + − where. ,. ∗. ,. ,. ∗. ,. (. −. (. ∗. +. ∗. − − ∗. ). )−. −1. ≥ 0 are Lagrange multipliers. Then the partial derivatives of. with. respect to the primal variables have to equal to 0 for optimality =. (. − =. (∗). =. ∗. −. −. (∗). −. (∗). (. +. ∗. )−. ). =0. (41). =0. (42). = 0.. (43). By solving equation (41), (42), (43), we obtained (. = 0≤. ∗. , (. = ∑. ∗. −. ). ≤. + (. ∗. +. (45) ). (46). ∗). However there is a problem in equation (46) where both. and. ∗. (44). is calculated from the. and. ∗. , but. could be equal to 0 at the same time when the data point is within the margin. So +. in order to avoid this situation, we replace the. ∗. with a small tolerance . Hence, the. equation (46) can be rewritten as (. = ∑. ( (. + (. 14. ∗. +. , )) ∗. , )). (47).

(22) Basically, the Lagrange function and the solutions of partial differential are similar to Franck et al. (2007), the intercept. and its constraint be combined into the coefficient. and the equation. (44), then substitute the results into the Lagrange function =−. 1 2. (. ∗. −. ). ∗. −. (48) (. +. −. ∗. ). −(. ∗. +. ) +. .. Now the dual problem of FWSVR is ( ,. ∗. ). ,. 0≤. ,. ∗. ≤. ∀ = 1, … , ; = 1, … ,. (49). where ( ,. ∗). =. 1 2. (. −. ∗. ). ∗. −. (50) (. +. +. ∗. ) −(. ∗. −. ). +. .. Basically the method of finding Lagrange function is similar to the standard SVR, the main difference is that the membership. is introduced to the penalty term. It provides a simple. closed-form solution for the sub-problem of equation (50), and it not only change the bound of the Lagrange multipliers but also reduce of the number of constraints. Then the next step is to solve the dual problem of FWSVR with the dual coordinate descent method.. Consider. =. −. ∗. ∗. , because of the property of. be greater than 0 at the same time. Then, |. |=. +. ∗. = 0,. and. ∗. would not. , and the dual problem can be. transformed as ( ). −. ≤. ≤. |. | −. ,∀. (51). where ( )=. 1 2. (. ). + (52). and Assume. + =[. ]. ,…,. ×. ,. =. is the current iterate and its -th component, denoted as a scalar variable , is 15.

(23) being updated. Then the following one-variable sub problem is solved as ( ). , −. ≤. ≤. (53). where 1 ( )= ( − 2. ). +(. − )( −. )+| | +. To solve the sub problem, we start with checking the derivative of = 0, its derivatives at. differentiable at. ( )= +( ( )=− +(. ( ) and. Both values. Case (1): 0 <. ( ). Although. ( ) is not. < 0 are respectively. − ) + ( − ) ) − ) + ( −. ≥0 . <0. (55). ( ) are linear functions of , and we consider three cases according to the. ( ) and. of. ≥ 0 and. (54). .. ( ).. (0) <. (0). ( ) = 0 has a root for. ≤ 0.. (56). And the solution is −. Case (2):. (0) <. − +(. − ). if − + (. − ) >. .. (57). (0) < 0 ( ) = 0 has a root for. ≥ 0.. (58). And the solution is −. Case (3):. (0) ≤ 0 ≤. +(. − ). +(. if. − ) <. .. (59). (0). ( ) is a decreasing function at. ≤ 0, but is an increasing function at. Thus, an optimal solution occurs at. 16. = 0.. ≥ 0..

(24) Hence, the summary of the three cases shows that the sub problem has the following closed form solution (−. ←. (. ,. ⎧− ⎪ where. And calculate. =. + )). , (. ). (. ).. (60). ⎨− ⎪ ⎩ −. by (. ) =. .. (61). Now we have the equations of updating Lagrange multipliers, and then we should find the violation of the optimality condition for the stopping condition and the shrinking procedure. From the case (1) we mentioned above, we see that if − Thus, if −. ∗. <. <0, |. < (. ∗. ∗. (. < 0 is optimal, then. ) = 0.. (62). )| can be considered as the violation of the optimality.. The case (2) is similar to the case (1), therefore, if 0 <. ∗. <. (. ,. ) can be considered. as the violation of the optimality. Form the case (3), we have that if. ∗. = 0 is optimal, then. (. ∗. )≤0≤. (. ∗. ).. (63). Therefore, ( (. ) if ) if. = 0 and = 0 and. ( (. )>0 ) < 0.. (64). After discussing these situations, we could measure the violation of the optimality conditions as is optimal if and only if where. 17. =0. (65).

(25) ( )| ( ) = ( ) ⎨ ⎪− ( ) ⎩ 0 , ⎧ ⎪. |. ∈ (− ∈ (0, =0 =0. , 0) ). =− (. = )>0 )<0. ( (. ( )≤0 )≥0 .. (66). ℎ. We follow the similar stopping condition and shrinking scheme mentioned in Ho et al. (2012) as ( ). where. ( ). and. ( ). ( ). <. (67) -th iteration, and ϵ is a. are the initial violation and the violation in the. constant which used to determine how small we can accept the violation.. = 0,. For shrinking, we remove bounded variables (i.e., be changed at the final iterations. We shrink (. =0. )<−. (. ). <0< ( (. =− ≡. ) if they may not. if it satisfies one of the following conditions.. =. where. , or −. <. (. ). (68). )<−. (69). )>. (70). is the maximal violation of the previous iteration.. Finally, in order to further improve the convergence efficiency and stability, we consider the alpha cut method which was proposed by Yang et al. (2008) [7]. The concept of alpha cut method implementation can be used to form cluster cores such that the data points inside the cluster core will have a membership value of 1. The basic idea of alpha cut is that if the membership value of the data point membership value the value of. in the. th regression model is larger than a given value. is large enough to let us set it to 1 and. = 0 for all. , then the. ≠ . Generally. is set to 0.7, so we also use this setting and observe the robustness of data with noise. and outliers. In next section, the performances of this algorithm are analyzed in some cases.. 18.

(26) This FWSVR by Dual Coordinate Descent method, we called it FWSVR-DCD, is similar to the one proposed by Franck et al. (2007). But there are some differences, especially in the selection method of updating variables. First, following the gradient-based variable selection, if the Lagrange > 0 (or. multiplier. ∗. ∗. > 0), then ∗. is true because of the property of. = 0 (or. = 0) in the rest iterations for update. This. = 0. Therefore, the dual coordinate descent method has a. shrinking implementation. Second, by transforming the Lagrange multipliers variable. ,. ∗. to a new. , the DCD method is switched to a one-variable sub problem that a simple closed-form. solution is available.. Algorithm Fuzzy Weighted Support Vector Regression using the dual coordinate descent method Step 1. Given the initial Step 2. Calculate. =∑. and corresponding. by the equation of residual. Step 3. Calculate the membership. with. Step 4. Update the Lagrange multiplier Step 5. Calculate the violation. ( ). and use alpha cut by the updating equation (60). by the violation equation (66). ( ). Step 6. If. ( ). < , then stop. Otherwise, return to step 2. 19.

(27) 4. Simulation studies and real data application In this section, the performances of our algorithm in mixture regression problems will be demonstrated and the stability of the data with outliers or large-scale data will be compared with the FWSVR of Franck et al. (2007) and fuzzy c regression. The performance measure is based on the following expression = where. =. 1 ×. (. −. ). is the estimated value for the th data point and the. (70) th regression model.. In the following experiments, we can control the penalty term to those data that assigned to the wrong regression line by tuning γ. In other words, the bigger γ means the less tolerance to the penalty term; the smaller γ means the higher tolerance to the penalty term. And the parameter ε gives the margin of tolerance, it means we can accept the error of data within the margin. So, the larger the ε, the larger the tolerance margin, and the more data will be ignored. The more data errors will be considered in the small ε, but it is also easy to overfitting. We fixed 100 for the FWSVR-DCD by the cross validation. And the degree of fuzziness weights. = 0.01, = of fuzzy. is fixed to 2 according to the setting adopted by most papers, e.g. Yang et al. (2007).. Figure 3. The degree of fuzziness. vs the average MSE of 100 simulations 20.

(28) As the curve in Fig 3, the MSE converge to a small value and is acceptable when. is greater than. or equal to 2. We assume that every line is corrupted by a Gaussian noise with zero mean and variance. , the th line has. samples.. MSE. The mean number of iterations. The mean execution times. β. β. True coefficient. 5 −5. −9 −1. With alpha cut. 4.9090 −5.0054. −9.3492 1.0461. 0.3357. 20.2650. 6.9672. 5.1937 −4.9901. −9.5584 1.0397. 0.3368. 31.4530. 13.4473. Without alpha cut. Table 1. The comparative results of FWSVR-DCD with and without alpha cut. In the first case, we compare the FWSVR-DCD with and without alpha cut, the training data sets are defined as following =. +. +. , = 1, . . , =. ,…,. 5 , −5. ~ (1,10),. , = 1, … , , ℎ =. −9 , 1. ,...,. = ~ (0,. = 2, = 100, ). =. =1. We compare the MSEs, the number of iterations and the mean execution times of 1000 simulations in table 1, and Fig 4 shows the fitted regression lines. Obviously, both are fitted well and the MSEs are close, but the execution times with alpha cut is smaller than without alpha cut. Therefore, introducing the alpha cut method actually can reduce the computing time and obtained the results more efficiency.. 21.

(29) Figure 4. The results of the FWSVR-DCD with (the left figure) and without (the right figure) alpha cut. In the second case, we compared the FWSVR-DCD, the FWSVR proposed by Franck et al. (2007), and Fuzzy C Regression in large-scale cases, the training data is similar to the setting of three cross line in Franck (2007) as follows = =. +. 3 , 1. ,…,. =. + 35 , 0. ~ (0,100),. , = 1, . . , =. 100 −1 ~ (0,. ,...,. , = 1, … , , =. = ),. =. = 3, = 10000, = 1,. =4. MSE. The mean number of iterations. The mean execution times. β. β. β. True coefficient. 3 1. 35 0. 100 −1. FWSVR-DCD. 2.7089 1.0281. 34.3886 0.0030. 100.2180 −1.0328. 0.2189. 78.1290. 546.3297. FWSVR. 2.6439 0.9734. 34.3937 −0.0615. 101.0311 −0.9742. 0.2297. 603.1430. 3601.4302. Table 2. The comparison of FWSVR-DCD, FWSVR, FCR for large-scale data. Because the amount of data is large enough, the estimated results are basically not bad. Fig 5 shows the results of FWSVR-DCD and FWSVR proposed by Franck (2007), both are fitting well. But as we can see in the Table 2, the mean number of iterations and the mean execution times of FWSVR-DCD are smaller than FWSVR, so we keep the more effective convergence of the dual 22.

(30) coordinate descent method and applied to FWSVR.. (a). (b). (c). Figure 5. Results of FWSVR-DCD (a), FWSVR (b), FCR (c) in large scale case. To illustrate the robustness of the FWSVR-DCD, we also try this method in data with different percentage outliers. The training data continues to use the above settings and be added some outliers as following = = ,…,. +. +. , = 1, . . ,. 3 35 100 , = , = 1 0 −1 ~ (0,100), , … , ~ (0,. , = 1, … , , =. =. = 100,. ),. =. = 1, σ = 4. The outliers are generated by multivariate normal distribution MN tried several cases for the number of outliers. = 3,. . The results of. 50 20 0 , , and we have 50 0 20 = 50 is shown in Fig 6 and the. estimated coefficients are shown in table 3. Although, FCR is the fastest method, it is easily affected by outliers. In comparison, both FWSVR-DCD and FWSVR are more stable in this situation, and 23.

(31) the former requires less execution times.. MSE. The mean number of iterations. The mean execution times. β. β. β. True coefficient. 3 1. 35 0. 100 −1. FWSVR-DCD. 1.7646 1.8034. 35.0311 −0.0027. 99.3735 −1.0097. 0.6167. 51.1805. 53.6005. FWSVR. 1.7646 1.0566. 34.9454 0.0052. 99.0104 −0.9208. 0.6219. 132.6414. 169.8776. FWSVR. 17.3080 0.2343. 35.1593 0.0059. 99.2216 −0.9992. 0.6861. 24.4106. 0.0349. Table 3. The comparison of FWSVR-DCD, FWSVR, FCR for data with outliers. (a). (b). (c) Figure 6. The comparison of FWSVR-DCD (a), FWSVR (b), FCR (c) for data with 50 outliers 24.

(32) In the previous cases, we considered different straight lines with background noises which are some special outliers considered by Franck. Furthermore, we also consider another case where the outliers are concentrated in one place. Example 1. two curves with 10 outliers: =. +. + =. ,…,. +. 0 4.5 , −0.4. , = 1, . . ,. 10 = −3.5 , 0.4. ~ (1,10),. ,…,. , = 1, … , , =. ~ (0,. = 2,. = 100, ),. =. =1. Example 2. a curve and a straight line with 10 outliers: =. +. + 7 = 0 , 0. ,…,. +. , = 1, . . ,. 10 = −3.5 , 0.4. ~ (1,10),. ,…,. =. ~ (0,. , = 1, … , ,. = 2,. = 100, ),. =. =1. the results are shown in Fig 7 and Fig 8. If the outliers are concentrated in one place as shown in Fig 7, Fig 8, and Fig 9, the estimated coefficient will be seriously affected when the number of outliers is too large. In the above cases, when the number of outliers. exceeds 50 percentage of. the overall data, the fitted regression models perform bad and the results are shown in Fig 9.. Figure 7. The results of Example 1. Figure 8. The results of Example 2 25.

(33) Figure 9. The results of data with too many outliers. As we can note first, the influence of fuzzy weights on data is more effective and the MSE is smaller when. increases. And the introducing of DCD and alpha cut apparently make the. convergence more efficiency and stable. From the different cases shown above, the results of FWSVR-DCD and FWSVR are stable. However, the results of FCR sometimes did not perform well in data with outliers, because FCR is easily affected by initial values. And FWSVR-DCD reduce the sensitivity to outliers due to the design of its model and the adjustment of parameters. As previously, these results confirm that FWSVR-DCD reduce the more time consumption than FWSVR in large scale data, and it keeps the robustness property with data with outliers.. Finally, we applied the FWSVR-DCD to real data, a typical data set is the tone perception data (Cohen 1984). In tone perception experiment, a pure fundamental tone was played to a trained musician and added electronically generated overtones which were determined by a stretching ratio. 150 data points are obtained from the same musician, and the aim of this experiment was to find out how the tuning ratio affects the perception of the tone and to decide if either of two musical perception theories was reasonable (see Cohen [12] for more details). We set the 0.01, and. = 100, =. = 2 by cross validation, the result is shown in Fig 10. This data have also been 26.

(34) analyzed by Bai et al. (2012) [13] and our result is consistent with theirs. Based on Fig. 10, two lines are evident which correspond to the behavior indicated by the two musical perception theories.. Figure 10. The scatter plot of the tone perception data and the fitted two lines by Bai (left) and FWSCR-DCD (right). We also applied the FWSVR-DCD to another real data which is the Aphids data (Boiteau 1998 [14]). This data contains the results of 51 independent experiments in which varying numbers of aphids were released in a flight chamber containing 12 infected and 69 healthy tobacco plants. After 24 hours, the flight chamber was fumigated to kill the aphids, and the previously healthy plants were moved to a greenhouse and monitored to detect symptoms of infection. The number of plants displaying such symptoms was recorded. This data has also been analyzed by Grun and Leisch et al. (2008) [15]. By cross validation, we set. = 1, = 0.1,. = 2, and the result is shown in Fig. 11.. Because we consider the FWSVR-DCD in linear model cases, the fitted regression line is a little bit different from the results of Grun’s. But roughly it can be seen that the fitted regression models of the two groups are actually quite similar.. 27.

(35) Figure 11. The scatter plot of Aphids data with fitted regression lines by Grun (left) and FWSCR-DCD (right). By simulation studies and real data applications, the execution time of FWSVR-DCD is faster than the standard FWSVR and keeps the robustness in outlier cases. FWSVR-DCD also performs well in real data applications. The above results are all discussed in the cases of linear models, but SVM can handle nonlinear cases by different kernel function, for example, polynomial kernel and Gaussian kernel. So it may be able to apply kernel function to extend FWSVR to nonlinear cases in the future.. 28.

(36) 5. Conclusion The contribution of this thesis is to improve the computing time of fuzzy weighted support vector regression in large scale problems or data with outliers. We applied the dual coordinate descent method to FWSVR, and based on the property of the DCD method, it reduced the computing time of FWSVR and have a stable result in data with noises or outliers. In addition, the membership is developed from fuzzy theory, so we apply the alpha cut method to further improve the efficiency of calculation. The results of simulations show that the proposed method yields efficient and accurate regression model for large-scale problems and keep the robustness in data with outliers. Analyzing a real example of tone perception data, we demonstrate the practicality of the proposed method. It can effectively obtain good estimates from the tone perception data.. 6. Reference [1] Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), 121-167. [2] Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297. [3] Ben-Hur, A., Horn, D., Siegelmann, H. T., & Vapnik, V. (2001). Support vector clustering. Journal of machine learning research, 2(Dec), 125-137. [4] Smola, A. J., & Schölkopf, B. (2004). A tutorial on support vector regression. Statistics and computing, 14(3), 199-222. [5] Dufrenois, F., & Hamad, D. (2007, August). Fuzzy weighted support vector regression for multiple linear model estimation: application to object tracking in image sequences. In 2007 International Joint Conference on Neural Networks (pp. 1289-1294). IEEE. [6] Ho, C. H., & Lin, C. J. (2012). Large-scale linear support vector regression. The Journal of Machine Learning Research, 13(1), 3323-3348. 29.

(37) [7] Yang, M. S., Wu, K. L., Hsieh, J. N., & Yu, J. (2008). Alpha-cut implemented fuzzy clustering algorithms and switching regressions. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 38(3), 588-603. [8] Drucker, H., Burges, C. J., Kaufman, L., Smola, A. J., & Vapnik, V. (1997). Support vector regression machines. In Advances in neural information processing systems (pp. 155-161). [9] Platt, J. (1998). Sequential minimal optimization: A fast algorithm for training support vector machines. [10] Flake, G. W., & Lawrence, S. (2002). Efficient SVM regression training with SMO. Machine Learning, 46(1-3), 271-290. [11] Hsieh, C. J., Chang, K. W., Lin, C. J., Keerthi, S. S., & Sundararajan, S. (2008, July). A dual coordinate descent method for large-scale linear SVM. In Proceedings of the 25th international conference on Machine learning (pp. 408-415). [12] Bai, X., Yao, W., & Boyer, J. E. (2012). Robust fitting of mixture regression models. Computational Statistics & Data Analysis, 56(7), 2347-2359. [13] Cohen, E. A. (1984). Some effects of inharmonic partials on interval perception. Music Perception, 1(3), 323-349. [14] Boiteau, G., Singh, M., Singh, R. P., Tai, G. C. C., & Turner, T. R. (1998). Rate of spread of PVY n by alateMyzus persicae (Sulzer) from infected to healthy plants under laboratory conditions. Potato Research, 41(4), 335-344. [15] Grün, B., & Leisch, F. (2008). Finite mixtures of generalized linear regression models. In Recent advances in linear models and related areas (pp. 205-230). Physica-Verlag HD.. 30.

(38)