行政院國家科學委員會專題研究計畫 成果報告
子計畫一:行動電子商務中多策略學習之分散式智慧型代理
人架構(3/3)
計畫類別: 整合型計畫
計畫編號: NSC92-2213-E-002-009-
執行期間: 92 年 08 月 01 日至 93 年 07 月 31 日
執行單位: 國立臺灣大學電機工程學系暨研究所
計畫主持人: 王勝德
計畫參與人員: 林俊甫、吳國賓、黃健豐、許以達、黃乙丞、費廣瀚
報告類型: 完整報告
處理方式: 本計畫可公開查詢
中 華 民 國 93 年 11 月 1 日
行政院國家科學委員會補助專題研究計畫
■成 果 報 告
□期中進度報告
行動電子商務系統關鍵技術之研發與實作
—子計畫一:行動電子商務中多策略
學習之分散式智慧型代理人架構
(3/3)
Architectures of Multi-strategy Learning for Distributed Intelligent Agents in
Mobile-Commerce
計畫類別:□ 個別型計畫 整合型計畫
計畫編號:
NSC 92-2213-E-002 -009
執行期間: 92 年 8 月 1 日至 93 年 7 月 31 日
共同主持人:王勝德
計畫參與人員:林俊甫、吳國賓、黃健豐、許以達、黃乙丞、費廣瀚
成果報告類型(依經費核定清單規定繳交):□精簡報告
■完整報告
本成果報告包括以下應繳交之附件:
□赴國外出差或研習心得報告一份
□赴大陸地區出差或研習心得報告一份
□出席國際學術會議心得報告及發表之論文各一份
□國際合作研究計畫國外研究報告書一份
處理方式:除產學合作研究計畫、提升產業技術及人才培育研究計畫、列管計畫及下列情
形者外,得立即公開查詢
□涉及專利或其他智慧財產權,■一年 □二年後可公開查詢
執行單位:
國立台灣大學電機工程研究所
中 華 民 國 93 年 10 月 30 日
摘要:
本計畫研究以多策略的學習為基礎的智慧型代理人,並應用在行動電子商務方面。我們研
究以支持向量機(SVM)的學習方法,智慧型代理人學習資料間的關連,以便建立初始的知
識庫和規則庫;支持向量機和其他資料分類的方法一樣,可以從所輸入的資料學習決策面,
此決策面可用來解決分類或回歸的問題。在很多應用中,例如含有雜訊的資料,資料點適
合加上權重來考量其重要性。在我們過去的研究,已提出以模糊從屬函數的觀念來表達權
重,並重新對
SVM 命題,提出模糊支持向量機 (FSVM),以便可以學習到更好的決策面。
模糊支持向量機提供了一個方法來解決資料混有較不重要的例外或雜訊問題,但對於模糊
從屬函數的決定卻並無一定的方法,本研究的目的之一即是要提出一個較系統化或自動化
的方法來決定從屬函數的參數。為了此目標,我們提出兩個因數,一為信心因數(confident
factor),另一為捨棄因數 (trashy factor),並利用一個對應函數來找出模糊從屬函數的參數,
最後,我們以一些範例來,驗證所提方法的有效性。
Distributed Intelligent Agents in
Mobile-Commerce: The Approach of Fuzzy
Support Vector Machines with Automatic
Membership Setting
Chun-fu Lin and Sheng-de Wang
Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan
Abstract. In this project we investigated the architectures and applications of
mul-tistrategy learning for distributed intelligent agents for mobile-commerce. It makes use of multistrategy learning and hybrid knowledge base such that the intelligent agents can adapt themselves to environmental changes via constructing the knowl-edge base and the rule base. Support vector machines like other classification ap-proaches aim to learn the decision surface from the input points for classification problems or regression problems. In many applications, each input points may be associated with different weightings to reflect their relative strengths to conform to the decision surface. In our previous research, we applied a fuzzy membership to each input point and reformulate the support vector machines to be fuzzy support vector machines (FSVMs) such that different input points can make different contributions to the learning of the decision surface.
FSVMs provide a method for the classification problem with noises or outliers. However, there is no general rule to determine the membership of each data point. We can manually associate each data point with a fuzzy membership that can reflect their relative degrees as meaningful data. To enable automatic setting of member-ships, we introduce two factors in training data points, the confident factor and the trashy factor, and automatically generate fuzzy memberships of training data points from a heuristic strategy by using these two factors and a mapping function. We investigate and compare two strategies in the experiments and the results show that the generalization error of FSVMs are comparable to other methods on benchmark datasets.
Keywords: intelligent agents, support vector machines, fuzzy membership, fuzzy
1 Introduction
Support vector machines (SVMs) make use of statistical learning techniques and have drawn much attention on this topic in recent years [1, 2, 3, 4]. This learning theory can be seen as an alternative training technique for polyno-mial, radial basis function and multi-layer perceptron classifiers [5]. SVMs are based on the idea of structural risk minimization (SRM) induction princi-ple [6] that aims at minimizing a bound on the generalization error, rather than minimizing the mean square error. In many applications, SVMs have been shown to provide better performance than traditional learning machines. SVMs can also been used as powerful tools for solving classification [7] and regression problems [8]. For the classification case, SVMs have been used for isolated handwritten digit recognition [1, 9], speaker identification [10, 11], face detection in images [11, 12], knowledge-based classifiers [13], and text categorization [14, 15]. For the regression estimation case, SVMs have been compared on benchmark time series prediction tests [16, 17], financial fore-casting [18, 19], and the Boston housing problem [20].
When learning to solve the classification problem, SVMs find a separating hyperplane that maximizes the margin between two classes. Maximizing the margin is a quadratic programming (QP) problem and can be solved from its dual problem by introducing Lagrangian multipliers [21]. In most cases, searching suitable hyperplane in input space is a too restrictive application to be of practical use. The solution to this situation is mapping the input space into a higher dimension feature space and searching the optimal hyperplane in this feature space [22, 23]. Without any knowledge of the mapping, the SVMs find the optimal hyperplane by using the dot product functions in feature space that are calledkernels. The solution of the optimal hyperplane can be written as a combination of a few input points that are calledsupport vectors. By introducing the -insensitive loss function and doing some small mod-ifications in the formations of equations, the theory of SVMs can be easily applied in regression problems. The formulated equations of regression prob-lems in SVMs are similar with the those of classification probprob-lems except the target variables.
Solving the quadratic programming problem with a dense, structured, pos-itive semidefinite matrix is expensive in traditional quadratic programming algorithms [24]. Platt’s Sequential Minimal Optimization (SMO) [25, 26] is a simple algorithm that quickly solves the SVMs QP problem without any ex-tra matrix storage. LIBSVM [27], which is a simplification of both SMO and SVMlight[28], is provided as an integrated software for the implementation of
support vector machines. These researches make the use of SVMs simple and easy.
More and more applications can be solved by using the SVMs techniques. However, in many applications, input points may not be appropriately as-signed with the same importance in the training process. For the classifica-tion problem, some data points deserve to be treated more importantly so that
SVMs can separate these points more correctly. For the regression problem, some data points corrupted by noises are less meaningful and the machine should better to discard them. Original SVMs do not consider the effect of noises or outliers and thus cannot treat differently to data points.
In our previous research, we applied a fuzzy membership to each input point of SVMs and reformulate SVMs into FSVMs such that different input points can make different contributions to the learning of decision surface. The proposed method enhances the SVMs in reducing the effect of outliers and noises in data points. FSVMs are suitable for applications in which data points have unmodeled characteristics.
For the classification problem, since the optimal hyperplane obtained by the SVM depends on only a small part of the data points, it may become sensitive to noises or outliers in the training set [29, 30]. FSVMs solve this kind of problems by introducing the fuzzy memberships of data points. We can treat the noises or outliers as less important and let these points have lower fuzzy membership. It is also based on the maximization of the margin like the classical SVMs, but uses fuzzy memberships to prevent noisy data points from making narrower margin. This equips FSVMs with the ability to train data with noises or outliers by setting lower fuzzy memberships to the data points that are considered as noises or outliers with higher probability.
We design a noise model that introduces two factors in training data points, the confident factor and the trashy factor, and automatically generates fuzzy memberships of training data points from a heuristic strategy by using these two factors and a mapping function. This model is used to estimate the prob-ability that the data point is considered as noisy information and use this probability to tune the fuzzy membership in FSVMs. This simplifies the use of FSVMs in the training of data points with noises or outliers. The experi-ments show that the generalization error of FSVMs are comparable to other methods on benchmark datasets.
The rest of this report is organized as follows. A brief review of the theory of FSVMs will be given in Section 2. The training algorithm which reduces effects of noises or outliers in classification problems is illustrated in Section 3. Some concluding remarks are given in Section 4.
2 Fuzzy Support Vector Machines
In this section, we make a detail description about the idea and formulations of fuzzy support vector machines [31].
2.1 Fuzzy property of data points
The theory of SVMs is a powerful tool for solving classification problems [7], but there are still some limitations of this theory. From the training set and formulations, each training point belongs to either one class or the other. For
each class, we can easily check that all training points of this class are treated uniformly in the theory of SVMs.
In many real world applications, the effects of the training points are dif-ferent. It is often that some training points are more important than others in the classification problem. We would require that the meaningful training points must be classified correctly and would not care about some training points like noises whether or not they are misclassified.
That is, each training point no more exactly belongs to one of the two classes. It may 90 percent belong to one class and 10 percent be meaningless, and it may 20 percent belong to one class and 80 percent be meaningless. In other words, there is a fuzzy membership 0 < si≤ 1 associated with each
training point xi. This fuzzy membership sican be regarded as the attitude of
the corresponding training point toward one class in the classification problem and the value (1− si) can be regarded as the attitude of meaningless.
We found that this situation also occurred in the regression problems. The effects of the training points are the same in the standard regression algorithm of SVMs. The fuzzy membership si can be regarded as the importance of the
corresponding training point in the regression problem. For example, in the time series prediction problem, we can associate the older training points with lower fuzzy memberships such that we can reduce the effect of the older training points in the optimization of regression function.
We extend the concept of SVMs with fuzzy membership and make it as fuzzy SVMs or FSVMs.
2.2 Reformulated SVMs for Classification Problems
Suppose we are given a set S of labeled training points with associated fuzzy membership
(y1, x1, s1), . . . , (yl, xl, sl). (1)
Each training point xi∈ RN is given a label yi ∈ {−1, 1} and a fuzzy
mem-bership σ ≤ si≤ 1 with i = 1, . . . , l, and sufficient small σ > 0. Let z = ϕ(x)
denote the corresponding feature space vector with a mapping ϕ from RN to a feature spaceZ.
Since the fuzzy membership si is the attitude of the corresponding point xi toward one class and the parameter ξi is a measure of error in the SVMs,
the term siξi is a measure of error with different weighting. The setting of
fuzzy membership si is critical to the application of FSVMs. Although in
the formulation of the problem we assume the fuzzy membership is given in advance, it is beneficial to have the parameters of membership being auto-matically setting up in the course of training. To this end, we design a noise model that introduces two factors in training data points, the confident fac-tor and the trashy facfac-tor, and automatically generates fuzzy memberships of training data points from a heuristic strategy by using these two factors and a mapping function. This model is used to estimate the probability that the
data point is considered as noisy data and can serve as an aide to tune the fuzzy membership in FSVMs. This simplifies the application of FSVMs in the training of noisy data points or data points polluted with outliers. The optimal hyperplane problem is regarded as the solution to
minimize 1 2w· w + C l i=1 siξi, (2) subject to yi(w· zi+ b) ≥ 1 − ξi, i = 1, . . . , l, ξi≥ 0, i = 1, . . . , l,
where C is a constant. It is noted that a smaller si reduces the effect of the
parameter ξi in problem (2) such that the corresponding point zi= ϕ(xi) is
treated as less important.
To solve this optimization problem we construct the Lagrangian
L = 12w· w + C l i=1 siξi− l i=1 βiξi (3) − l i=1 αi(yi(w· zi+ b) − 1 + ξi)
and find the saddle point of L, where αi ≥ 0 and βi ≥ 0. The parameters
must satisfy the following conditions
∂L ∂w = w− l i=1 αiyizi= 0, (4) ∂L ∂b =− l i=1 αiyi= 0, (5) ∂L ∂ξi = siC − αi− βi= 0. (6)
Apply these conditions into the Lagrangian (3), the problem (2) can be trans-formed into maximize l i=1 αi− 1 2 l i=1 l j=1 αiαjyiyjK(xi, xj), (7) subject to l i=1yiαi = 0 0≤ αi≤ siC, i = 1, . . . , l. and the Karush-Kuhn-Tucker conditions are defined as
¯
αi(yi( ¯w· zi+ ¯b) − 1 + ¯ξi) = 0, i = 1, . . . , l, (8)
where ¯αi, ¯w, and ¯b denote a solution to the optimization problem (7).
The point xi with the corresponding ¯αi > 0 is called a support vector.
There are also two types of support vectors. The one with corresponding 0 < ¯
αi < siC lies on the margin of the hyperplane. The one with corresponding
¯
αi= siC is misclassified. An important difference between SVMs and FSVMs
is that the points with the same value of ¯αi may indicate a different type of
support vectors in FSVMs due to the factor si.
2.3 Reformulated SVMs for Regression Problems
Suppose we are given a set S of labeled training points with associated fuzzy membership
(y1, x1, s1), . . . , (yl, xl, sl). (10)
Each training point xi∈ RN is given a label yi∈ R and a fuzzy membership σ ≤ si≤ 1 with i = 1, . . . , l, and sufficient small σ > 0. Let z = ϕ(x) denote
the corresponding feature space vector with a mapping ϕ from RN to a feature
spaceZ.
Since the fuzzy membership si is the importance of the corresponding
point xi and the parameter ξ(∗)i is a measure of error in the SVMs, the term siξ(∗)i is a measure of error with different weighting. The regression problem
is then regarded as the solution to
minimize 1 2w· w + C l i=1 si(ξi+ ξ∗i) (11) subject to ⎧ ⎨ ⎩ yi− (w · zi+ b) ≤ + ξi, (w· zi+ b) − yi ≤ + ξi∗, ξi, ξi∗≥ 0.
where C is a constant. It is noted that a smaller si reduces the effect of the
parameter ξi in problem (11) such that the corresponding point zi= ϕ(xi) is
treated as less important.
L = 1 2w· w (12) +C l i=1 si(ξi+ ξ∗i) − l i=1 (ηiξi+ η∗iξI∗) − l i=1 αi( + ξi− yi+ w· zi+ b) − l i=1 α∗i( + ξ∗i + yi− w · zi− b)
and find the saddle point of L, where αi, α∗i, ηi, η∗i ≥ 0. The parameters must
satisfy the following conditions
∂L ∂b = l i=1 (α∗i − αi) = 0, (13) ∂L ∂w = w− l i=1 (αi− α∗i)xi = 0. (14) ∂L ∂ξ(∗)i = siC − α(∗)i − ηi(∗)= 0 (15)
Apply these conditions into the Lagrangian (12), the problem (11) can be transformed into maximize −1 2 l i,j=1 (αi− α∗i)(αj− α∗j)K(xi, xj) (16) − l i=1 (αi+ α∗i) + l i=1 yi(αi− α∗i) subject to l i=1(αi− α∗i) = 0, 0≤ α(∗)i ≤ siC, i = 1, . . . , l. and the Karush-Kuhn-Tucker conditions are defined as
¯ αi( + ¯ξi− yi+ ¯w· xi+ ¯b) = 0, i = 1, . . . , l, (17) ¯ α∗i( + ¯ξi∗+ yi− ¯w· xi− ¯b) = 0, i = 1, . . . , l, (18) (siC − ¯αi) ¯ξi = 0, i = 1, . . . , l, (19) (siC − ¯α∗i) ¯ξi∗ = 0, i = 1, . . . , l. (20)
The point xi with the corresponding ¯αi(∗) > 0 is called a support vector.
0 < ¯αi(∗)< siC lies on the -insensitive tube around the function fR. The one with corresponding ¯αi(∗)= siC is outside the tube. An important difference between SVMs and FSVMs is that the points with the same value of ¯αi(∗) may indicate a different type of support vectors in FSVMs due to the factor
si.
2.4 Dependence on the fuzzy membership
The only free parameter C in SVMs controls the trade-off between the max-imization of margin and the amount of errors. In classification problems, a larger C makes the training of SVMs less misclassifications and narrower mar-gin. The decrease of C makes SVMs ignore more training points and get wider margin. In regression problems, a larger C makes less amount of error in re-gression function and the decrease of C makes the rere-gression flatter.
In FSVMs, we can set C to be a sufficient large value. It is the same as standard SVMs if we set all si= 1. With different value of si, we can control
the trade-off of the respective training point xiin the system. A smaller value
of si makes the corresponding point xi less important in the training.
There is only one free parameter in SVMs while the number of free para-meters in FSVMs is equivalent to the number of training points.
2.5 Generating the fuzzy memberships
To choose the appropriate fuzzy memberships in a given problem is easy. First, the lower bound of fuzzy memberships must be defined, and second, we need to select the main property of data set and make connection between this property and fuzzy memberships.
Consider that we want to conduct the sequential learning problem. First, we choose σ > 0 as the lower bound of fuzzy memberships. Second, we identify that the time is the main property of this kind of problem and make fuzzy membership si be a function of time ti
si= f (ti), (21)
where t1≤ . . . ≤ tl is the time the point arrived in the system. We make the
last point xl be the most important and choose sl= f (tl) = 1, and make the
first point x1 be the least important and choose s1 = f (t1) = σ. If we want to make fuzzy membership be a linear function of the time, we can select
si= f (ti) = ati+ b. (22)
By applying the boundary conditions, we can get
si= f (ti) = 1− σ tl− t1t i+t lσ − t1 tl− t1 . (23)
If we want to make fuzzy membership be a quadric function of the time, we can select
si= f (ti) = a(ti− b)2+ c. (24)
By applying the boundary conditions, we can get
si= f (ti) = (1− σ)(t i− t1 tl− t1
)2+ σ. (25)
2.6 Data with time property
Sequential learning and inference methods are important in many applications involving real-time signal processing [32]. For example, we would like to have a learning machine such that the points from recent past is given more weighting than the points far back in the past. For this purpose, we can select the fuzzy membership as a function of the time that the point generated and this kind of problem can be easily implemented by FSVMs.
A Simple Example
We consider a simple classification problem for example. Suppose we are given a sequence of training points
(y1, x1, s1, t1), . . . , (yl, xl, sl, tl), (26)
where t1 ≤ . . . ≤ tl is the time the point arrived in the system. Let fuzzy
membership si be a function of time ti
si= f (ti) (27)
such that s1= σ ≤ . . . ≤ sl= 1.
The left part of Fig. 1 shows the result of SVMs and the right part of Fig. 1 shows the result of FSVMs by setting
si= f (ti) = (1− σ)(t i− t1 tl− t1
)2+ σ. (28)
The underlined numbers are grouped as one class and the non-underlined numbers are grouped as the other class. The value of the number indicates the arrival sequence in the same interval. The smaller numbered data is the older one. We can easily check that the FSVMs classify the last ten points with high accuracy while the SVMs does not.
Financial Time Series Forecasting
The distribution of financial time series is changing over the time [33]. For this property, solving this kind of problem by FSVMs would be more feasible than by SVMs. Cao et al. [19] proposed an exponential function
Fig. 1. The left part is the result of SVMs learning for data with time property and
the right part is result of FSVMs learning for data with time property
si = 1
1 + exp(a − 2a(ti−t1 tl−t1))
, (29)
which can be summarized as follows:
• When a → 0, then lima→0si = 1/2, the fuzzy membership si approaches
1/2. In this case, the same fuzzy memberships apply to all the training data points. • When a → ∞, then lim a→∞si= 0 ti< (ti+ t1)/2, 1 ti> (ti+ t1)/2.
In this case, the fuzzy memberships for the data points arrived in first half are reduced to zero, and the fuzzy memberships for the data points arrived in second half are equal to 1.
• When a ∈ (0, ∞) and increases, the fuzzy memberships for the data points
arrived in first half will become smaller, while the fuzzy memberships for the data points arrived in second half will become larger.
The simulation results in [19] demonstrated that FSVMs are effective in deal-ing with the structural change of financial time series.
2.7 Two classes with different weighting
In some problems, we are more concerned about the one situation than the other. For example, in medical diagnosis problem we are more concerned about the accuracy of classifying a disease than that of no disease. The fault detection
problem in materials also has such characteristic[34]. For example, given a point, if the machine says 1, it means that the point belongs to this class with very high accuracy, but if the machine says -1, it may belong to this class with lower accuracy or really belongs to another class. For this purpose, we can select the fuzzy membership as a function of respective class [5, 35].
Suppose we are given a sequence of training points
(y1, x1, s1), . . . , (yl, xl, sl). (30)
Let fuzzy membership si be a function of class yi si=
s+, if yi= 1, s−, if yi=−1.
(31) The left part of Fig. 2 shows the result of SVMs and the right part of Fig. 2 shows the result of FSVMs by setting
si=
1, if yi= 1,
0.1, if yi=−1.
(32) The point xi with yi= 1 is indicated as cross, and the point xi with yi=−1
is indicated as square. In left part of Fig. 2, SVMs find the optimal hyper-plane with errors appearing in each class. In right part of Fig. 2, we apply different fuzzy memberships to different classes, the FSVMs find the optimal hyperplane with errors appearing only in one class. We can easily check that FSVMs classify the class of cross with high accuracy and the class of square with low accuracy, while SVMs does not.
Fig. 2. The left part is the result of SVMs learning for data sets and the right part
3 Reducing Effects of Noises in Classification Problems
Since the optimal hyperplane obtained by the SVM depends on only a small part of the data points, it may become sensitive to noises or outliers in the training set [29, 30]. To solve this problem, one approach is to do some pre-processing on training data to remove noises or outliers, and then use the remaining set to learn the decision function [36]. This method is hard to im-plement if we do not have enough knowledge about noises or outliers. In many real world applications, we are given a set of training data without knowledge about noises or outliers. There are some risks to remove the meaningful data points as noises or outliers.
There are many discussions in this topic and some of them show good per-formance. The theory of Leave-One-Out SVMs [37] (LOO-SVMs) is a modified version of SVMs. This approach differs from classical SVMs in that it is based on the maximization of the margin, but minimizes the expression given by the bound in an attempt to minimize the leave-one-out error. No free para-meter makes this algorithm easy to use, but it lacks the flexibility of tuning the relative degree of outliers as meaningful data points. Its generalization, the theory of Adaptive Margin SVMs (AM-SVMs) [38], uses a parameter λ to adjust the margin for a given learning problem. It improves the flexibility of LOO-SVMs and shows better performance. The experiments in both of them show the robustness against outliers.
FSVMs solve this kind of problems by introducing the fuzzy memberships of data points. We can treat the noises or outliers as less importance and let these points have lower fuzzy membership. It is also based on the maximization of the margin like the classical SVMs, but uses fuzzy memberships to prevent noisy data points from making narrower margin. This equips FSVMs with the ability to train data spoiled with noises or outliers by setting lower fuzzy memberships to the data points that are considered as noises or outliers with higher probability.
We need to assume a noise model in the training data points, and then try and tune the fuzzy membership of each data point in the training. Without any knowledge of the distribution of data points, it is hard to associate the fuzzy membership to the data point.
In this section, we design a noise model that introduces two factors in train-ing data points, the confident factor and the trashy factor, and automatically generates fuzzy memberships of training data points from a heuristic strategy by using these two factors and a mapping function [39]. This model is used to estimate the probability that the data point is considered as noisy informa-tion and use this probability to tune the fuzzy membership in FSVMs. This simplifies the use of FSVMs in the training of data points polluted with noises or outliers. The experiments show that the generalization error of FSVMs is comparable to other methods on benchmark datasets.
3.1 The Error Function
For efficient computation, the SVMs select the least absolute value to estimate the error function, that is li=1ξi, and use a regularization parameter C to
balance between the minimization of the error function and the maximization of the margin of the optimal hyperplane. There are still some methods to estimate this error function. The LS-SVMs [40, 41] select the least square value and show the differences in the constraints and optimization processes. In the situations that the underlying error probability distribution can be estimated, we can use the maximum likelihood method to estimate the error function. Let ξi be i.i.d. with probability density function pe(ξ), pe(ξ) = 0 if ξ < 0. The optimal hyperplane problem is then modified as the solution to
the problem minimize 1 2w· w + C l i=1 φ(ξi) (33) subject to yi(w· zi+ b) ≥ 1 − ξi, i = 1, . . . , l, ξi≥ 0, i = 1, . . . , l,
where φ(ξ) = − ln pe(ξ). Clearly, φ(ξ) ∝ ξ when pe(ξ) ∝ e−ξ, that reduces
the problem (33) to the original optimal hyperplane problem. Thus, with the different knowledge of the error probability model, a variety of error functions can be chosen and different optimal problems can be generated.
However, there are some critical issues in this kind of application. First, it is hard to implement the training program since solving the optimal hyperplane problem (33) is in general NP-complete [1]. Second, the error estimator ξi is
related to the optimal hyperplane by
ξi =
0 if fH(xi)≥ 1
1− fH(xi) otherwise.
(34) Therefore, one needs to use the correct error model in the optimization process, but one needs to know the underlying function to estimate the error model. In practice it is impossible to estimate the error distribution reliably without a good estimation of the underlying function. This is so called a “catch 22” situation [42]. Even though the probability density function pe(ξ)
is unknown for almost all applications.
In contrast, in the cases where the noise distribution model of the data set is known, we assume px(x) be the probability density function of the data
point x that is not a noise. For the data point xi with higher value of px(xi),
which means that this data point has higher probability to be a real data, it is expected that the data point would get a lower value of ξi in the training
process since ξi stands for a penalty weight in error function. To achieve this
purpose, we can modify the error function as
l
i=1
Hence, the optimal hyperplane problem is then modified as the solution to the problem minimize 1 2w· w + C l i=1 px(xi)ξi (36) subject to yi(w· zi+ b) ≥ 1 − ξi, i = 1, . . . , l, ξi≥ 0, i = 1, . . . , l.
When the probability density function px(x) in problem (36) can be viewed
as some kind of fuzzy membership, we can simply replace si = px(xi) in
problem (2), such that we can solve problem (36) by using the algorithm of FSVMs.
3.2 The Noise Distribution Model
There exist many alternatives to setting up the fuzzy memberships in training data using FSVMs, depending on how much information contains in the data set. If the data points are already associated with the fuzzy memberships, we can just use this information in training FSVMs. If it is given a noise distribution model of the data set, we can set the fuzzy membership as the probability of the data point that is not spoiled by a noise, or as a function of it. In other words, let pibe the probability of the data point xi that is not
spoiled by a noise. If there exists this kind of information in the training data, we can just assign the value si = pi or si = fp(pi) as the fuzzy membership
of each data point, and use these information to get the optimal hyperplane in the training of FSVMs. Since almost all applications lack this information, we need some other methods to predict this probability.
Suppose we are given a heuristic function h(x) that is highly relevant to the probability density function px(x). For this assumption, we can build a
relationship between the probability density function px(x) and the heuristic
function h(x), that is defined as
px(x) = ⎧ ⎨ ⎩ 1 if h(xi) > hC σ if h(xi) < hT σ + (1 − σ)(h(x)−hT hC−hT ) d otherwise (37)
where hCis the confident factor and hT is the trashy factor. These two factors
control the mapping region between px(x) and h(x), and d is the parameter
that controls the degree of mapping function as shown in Fig. 3.
The training points are divided into three regions by the confident factor
hCand trashy factor hT. If the data point, whose heuristic value h(x) is bigger
than the confident factor hC, lies in the region of h(x) > hC, it can be viewed
as valid examples with high confidence and the fuzzy membership is equal to 1. In contract, if the data point, whose heuristic value h(x) is lower than the trashy factor hT, lies in the region of h(x) < hT, it can be highly thought as
6 px(x) - h(x) hT hC σ 1 d = 1
Fig. 3. The mapping between the probability density function px(x) and the heuris-tic function h(x).
noisy data and the fuzzy membership is assigned to lowest value σ. The data points in the rest region are considered as noisy ones with different probabil-ities and can make different distributions in the training process. There is no enough knowledge to choose proper function of this mapping. For simplicity, the polynomial function is selected in this mapping and the parameter d is used to control the degree of mapping.
3.3 The Heuristic Function
As for steps, discriminating between noisy data and noiseless ones, we propose two strategies: one is based on kernel-target alignment and the other is using k-NN.
Strategy of Using Kernel-Target Alignment
The idea of kernel-target alignment is introduced in [43]. Let fK(xi, yi) =
l
j=1yiyjK(xi, xj). The kernel-target alignment is defined as
AKT =
l
i=1fK(xi, yi) lli,j=1K2(xi, xj)
(38)
This definition provides a method for selecting kernel parameters and the experimental results show that adapting the kernel to improve alignment on the training data enhances the alignment on the test data, thus improved classification accuracy.
In order to discover some relation between the noise distribution and the data point, we simply focus on the value fK(xi, yi). Suppose K(xi, xj) is a
kind of distance measure between data points xiand xjin feature spaceF. For
example, by using the RBF kernel K(xi, xj) = e−γxi−xj 2
, the data points live on the surface of a hypersphere in feature space F as shown in Fig. 4. Then K(xi, xj) = ϕ(xi)· ϕ(xj) is the cosine of the angle between ϕ(xi) and
x x x x x x x F ϕ(x1) (outlier) ϕ(x2) ϕ(xi)
Fig. 4. The value fK(x1, y1) is lower than fK(x2, y2) in the RBF kernel.
ϕ(xj). For the outlier ϕ(x1) and the representative ϕ(x2), we have fK(x1, y1) = yi=y1 K(x1, xi)− yi=y1 K(x1, xi) fK(x2, y2) = yi=y2 K(x2, xi)− yi=y2 K(x2, xi). (39)
We can easily check the following yi=y1 K(x1, xi) < yi=y2 K(x2, xi) yi=y1 K(x1, xi) > yi=y2 K(x2, xi), (40)
such that the value fK(x1, y1) is lower than fK(x2, y2).
We observe this situation and assume that the data point xi with lower
value of fK(xi, yi) can be considered as outlier and should make less
contri-bution of the classification accuracy. Hence, we can use the function fK(x, y)
This heuristic function assumes that a data point will be considered as a noisy one with high probability if this data point is more closer to the other class than its class. For a more theoretic discussion, let D±(x) be the mean
distance between the data point x and data points xi with yi=±1, which is
defined as D±(x) = 1 l± yi=±1 x − xi2, (41)
where l± is the number of data points with yi =±1, respectively. Then the
value yk(D+(xk)− D−(xk)) can be considered as an indication of a noise. For
the same case in feature space, D±(x) is reformulated as D±(x) = 1 l± yi=±1 ϕ(x) − ϕ(xi)2 = 1 l± yi=±1 (K(x, x) − 2K(x, xi) + K(xi, xi)) = K(x, x) + 1 l± yi=±1 (K(xi, xi)− 2K(x, xi)). (42) Assume that l+ l−, we can replace l± by l/2, and the value of K(x, x) is 1
for the RBF kernel. Then
yk(D+(xk)− D−(xk)) = yk( 2 l yi=1 (1− 2K(xk, xi))−2 l yi=−1 (1− 2K(xk, xi))) = 4yk l i −yiK(xk, xi) =−4 lfK(xk, yk), (43)
which is reduced to the heuristic function fK(xk, yk). Strategy of Using k-NN
For each data point xi, we can find a set Sikthat consists of k nearest neighbors
of xi. Let nibe the number of data points in the set Sik that the class label is
the same as the class label of data point xi. It is reasonable to assume that the
data point with lower value of niis more probable as noisy data. It is trivial to
select the heuristic function h(xi) = ni. But for the data points that are near
the margin of two classes, the value niof these points may be lower. It will get
poor performance if we set these data points with lower fuzzy memberships. In order to avoid this situation, the confident factor hC, which controls the
threshold of which data point needs to reduce its fuzzy membership, will be carefully chosen.
3.4 The Overall Procedure
There is no explicit way to solve the problem of choosing parameters for SVMs. The use of a gradient descent algorithm over the set of parameters by minimizing some estimates of the generalization error of SVMs is discussed in [44]. On the other hand, the exhaustive search or the grid search is the popular method to choose the parameters, but it becomes intractable in this application as number of parameters is growing.
In order to select parameters in this kind of problem, we divide the training procedure into two main parts and propose the following procedures.
1. Use the original algorithm of SVMs to get the optimal kernel parameters and the regularization parameter C.
2. Fix the kernel parameters and the regularization parameter C, that are ob-tained in the previous procedure, and find the other parameters in FSVMs.
a) Define the heuristic function h(x)
b) Use the exhaustive search or the grid search to choose the confident factor hC, the trashy factor hT, the mapping degree d, and the fuzzy
membership lower bound σ.
3.5 Experiments
In these simulations, we use the RBF kernel as
K(xi, xj) = e−γxi−xj 2
. (44)
We conducted computer simulations of SVMs and FSVMs using the same data sets as in [45]. Each data set is split into 100 sample sets of training and test sets. For each sample set, the test set is independent of training set. For each data set, we train and test the first 5 sample sets iteratively to find the parameters of the best average test error. Then we use these parameters to train and test the whole sample sets iteratively and get the average test error. Since there are more parameters than the original algorithm of SVMs, we use two procedures to find the parameters as described in the previous section. In the first procedure, we search the kernel parameters and C using the original algorithm of SVMs. In the second procedure, we fix the kernel parameters and
C that are found in the first stage, and search the parameters of the fuzzy
membership mapping function.
To find the parameters of strategy using kernel-target alignment, we first fix hC = maxifK(xi, yi) and hT = minifK(xi, yi), and perform a
two-dimensional search of parameters σ and d. The value of σ is chosen from 0.1 to 0.9 step by 0.1. For some case, we also compare the result of σ = 0.01. The value of d is chosen from 2−8 to 28 multiply by 2. Then, we fix σ and d, and perform a two-dimensional search of parameters hC and hT. The value
of hC is chosen such that 0%, 10%, 20%, 30%, 40%, and 50% of data points
0%, 10%, 20%, 30%, 40%, and 50% of data points have the value of fuzzy membership as σ.
To find the parameters of strategy using k-NN, we just perform a two-dimensional search of parameters σ and k. We fix the value hC = k/2, hT = 0,
and d = 1, since we don’t find much gain or loss when we choose other values of these two parameters such that we skip searching for saving time. The value of σ is chosen from 0.1 to 0.9 stepped by 0.1. For some case, we also compare the result of σ = 0.01. The value of k is chosen from 21 to 28 multiplied by 2. Table 1 shows the results of our simulations. For comparison with SVMs, FSVMs with kernel-target alignment perform better in 9 data sets, and FSVMs with k-NN perform better in 5 data sets. By checking the average training error of SVMs in each data set, we find that FSVMs perform well in the data set when the average training error is high. These results show that our algorithm can improve the performance of SVMs when the data set contains noisy data.
Table 1. The test error of SVMs, FSVMs using strategy of kernel-target alignment
(KT), and FSVMs using strategy of k-NN (k-NN), and the average training error of SVMs (TR) on 13 datasets. SVMs KT k-NN TR Banana 11.5±0.7 10.4±0.5 11.4±0.6 6.7 B. Cancer 26.0±4.7 25.3±4.4 25.2±4.1 18.3 Diabetes 23.5±1.7 23.3±1.7 23.5±1.7 19.4 F. Solar 32.4±1.8 32.4±1.8 32.4±1.8 32.6 German 23.6±2.1 23.3±2.3 23.6±2.1 16.2 Heart 16.0±3.3 15.2±3.1 15.5±3.4 12.8 Image 3.0±0.6 2.9±0.7 - 1.3 Ringnorm 1.7±0.1 - - 0.0 Splice 10.9±0.7 - - 0.0 Thyroid 4.8±2.2 4.7±2.3 - 0.4 Titanic 22.4±1.0 22.3±0.9 22.3±1.1 19.6 Twonorm 3.0±0.2 2.4±0.1 2.9±0.2 0.4 Waveform 9.9±0.4 9.9±0.4 - 3.5
4 Conclusions
In this report, we reviewed the concept of fuzzy support vector machines and proposed training procedures for FSVMs. By associating the data points with fuzzy memberships, FSVMs train data points with different memberships in learning the decision function. However, the extra freedom in selecting the membership poses an issue to learning. Thus, systematic methods are required for the applicability of the FSVMs. The proposed training procedures
along with two strategies for setting fuzzy membership can effectively solve the membership selection problem. This makes FSVMs more feasible in the application of reducing the effects of noises or outliers. The experiments show that the performance is better in the applications with the noisy data.
It is still an issue that FSVMs should select a proper fuzzy model for a given specific problem. Some problems may involve different domains that are outside the discipline of learning techniques. For example, the problem of economical trend prediction may work better with both the domain knowledge of economics and the learning technique of computer scientists. The illustrated examples in this report show only the basic application of FSVMs. More versatile applications are expected in the near future.
References
1. C. Cortes and V. Vapnik, “Support vector networks,” Machine Learning, vol. 20, pp. 273–297, 1995.
2. V. Vapnik, The Nature of Statistical Learning Theory. New York: Springer, 1995.
3. V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998.
4. B. Sch¨olkopf, S. Mika, C. Burges, P. Knirsch, K.-R. M¨uller, G. R¨atsch, and A. Smola, “Input space vs. feature space in kernel-based methods,” IEEE
Trans-actions on Neural Networks, vol. 10, no. 5, pp. 1000–1017, 1999.
5. E. Osuna, R. Freund, and F. Girosi, “Support vector machines: Training and applications,” Tech. Rep. AIM-1602, MIT A.I. Lab., 1996.
6. V. Vapnik, Estimation of Dependences Based on Empirical Data. Springer-Verlag, 1982.
7. C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,”
Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121–167, 1998.
8. A. Smola and B. Sch¨olkopf, “A tutorial on support vector regression,” Tech. Rep. NC2-TR-1998-030, Neural and Computational Learning II, 1998.
9. C. J. C. Burges and B. Sch¨olkopf, “Improving the accuracy and speed of support vector learning machines,” in Advances in Neural Information Processing
Sys-tems 9 (M. Mozer, M. Jordan, and T. Petsche, eds.), pp. 375–381, Cambridge,
MA: MIT Press, 1997.
10. M. Schmidt, “Identifying speaker with support vector networks,” in Interface
’96 Proceedings, (Sydney), 1996.
11. S. Ben-Yacoub, Y. Abdeljaoued, and E. Mayoraz, “Fusion of face and speech data for person identity verification,” IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 1065–1074, 1999.
12. E. Osuna, R. Freund, and F. Girosi, “An improved training algorithm for sup-port vector machines,” in 1997 IEEE Workshop on Neural Networks for Signal
Processing, pp. 276–285, 1997.
13. G. Fung, O. L. Mangasarian, and J. Shavlik, “Knowledge-based support vector machine classifiers,” in Advances in Neural Information Processing, 2002. 14. T. Joachims, “Text categorization with support vector machines: learning with
many relevant features,” in Proceedings of ECML-98, 10th European Conference
on Machine Learning (C. N´edellec and C. Rouveirol, eds.), (Chemnitz, DE), pp. 137–142, Springer Verlag, Heidelberg, DE, 1998.
15. K. Crammer and Y. Singer, “On the learnability and design of output codes for multiclass problems,” in Computational Learning Theory, pp. 35–46, 2000. 16. K.-R. M¨uller, A. Smola, G. R¨atsch, B. Sch¨olkopf, J. Kohlmorgen, and V.
Vap-nik, “Predicting time series with support vector machines,” in Articial Neural
Networks - ICANN’97 (W. Gerstner, A. Germond, M. Hasler, and J.-D. Nicoud,
eds.), pp. 999–1004, 1997.
17. S. Mukherjee, E. Osuna, and F. Girosi, “Nonlinear prediction of chaotic time series using support vector machines,” in 1997 IEEE Workshop on Neural
Net-works for Signal Processing, pp. 511–519, 1997.
18. F. E. H. Tay and L. Cao, “Application of support vector machines in financial time series forecasting,” Omega, vol. 29, pp. 309–317, 2001.
19. L. J. Cao, K. S. Chua, and L. K. Guan, “c-ascending support vector machines for financial time series forecasting,” in 2003 International Conference on
Com-putational Intelligence for Financial Engineering (CIFEr2003), (Hong Kong),
pp. 317–323, 2003.
20. H. Drucker, C. J. C. Burges, L. Kaufman, A. Smola, and V. Vapnik, “Support vector regression machines,” in Advances in Neural Information Processing
Sys-tems, vol. 9, p. 155, The MIT Press, 1997.
21. R. Fletcher, Practical methods of optimization. Chichester and New York: John Wiley and Sons, 1987.
22. M. Aizerman, E. Braverman, and L. Rozonoer, “Theoretical foundations of the potential function method in pattern recognition learning,” Automations and
Remote Control, vol. 25, pp. 821–837, 1964.
23. N. J. Nilsson, Learning machines: Foundations of trainable pattern classifying
systems. McGraw-Hill, 1965.
24. L. Kaufman, “Solving the quadratic programming problem arising in support vector classification,” in Advances in Kernel Methods: Support Vector Learning (B. Sch¨olkopf, C. Burges, and A. Smola, eds.), pp. 147–168, Cambridge, MA: MIT Press, 1998.
25. J. Platt, “Sequential minimal optimization: A fast algorithm for training support vector machines,” Tech. Rep. 98-14, Microsoft Research, Washington, 1998. 26. J. Platt, “Fast training of support vector machines using sequential
mini-mal optimization,” in Advances in Kernel Methods: Support Vector Learning (B. Sch¨olkopf, C. Burges, and A. Smola, eds.), pp. 185–208, Cambridge, MA: MIT Press, 1998.
27. C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vector machines,” 2001. Software avaiable at http://www.csie.ntu.edu.tw/∼cjlin/libsvm/.
28. J. Platt, “Making large-scale svm learning practical,” in Advances in Kernel
Methods: Support Vector Learning (B. Sch¨olkopf, C. Burges, and A. Smola, eds.), pp. 169–184, Cambridge, MA: MIT Press, 1998.
29. B. E. Boser, I. Guyon, and V. Vapnik, “A training algorithm for optimal margin classifiers,” in Computational Learing Theory, pp. 144–152, 1992.
30. X. Zhang, “Using class-center vectors to build support vector machines,” in 1999
IEEE Workshop on Neural Networks for Signal Processing, pp. 3–11, 1999.
31. C.-F. Lin and S.-D. Wang, “Fuzzy support vector machines,” IEEE Transactions
on Neural Networks, vol. 13, no. 2, pp. 464–471, 2002.
32. N. D. Freitas, M. Milo, P. Clarkson, M. Niranjan, and A. Gee, “Sequential support vector machines,” in 1999 IEEE Workshop on Neural Networks for
33. S. A. Yaser and A. F. Atiya, “Introduction to financial forecasting,” Applied
Intelligence, vol. 6, pp. 205–213, 1996.
34. K. K. Lee, S. R. Gunn, C. J. Harris, and P. A. S. Reed, “Classification of unbalanced data with transparent kernels,” in International Joint Conference
on Neural Networks (IJCNN ’01), vol. 4, pp. 2445–2450, July 2001.
35. A. T. Quang, Q.-L. Zhang, and X. Li, “Evolving support vector machine parame-ters,” in 2002 International Conference on Machine Learning and Cybernetics, vol. 1, pp. 548–551, 2002.
36. L. J. Cao, H. P. Lee, and W. K. Chong, “Modified support vector novelty de-tector using training data with outliers,” Pattern Recognition Letters, vol. 24, pp. 2479–2487, 2003.
37. J. Weston, “Leave-one-out support vector machines,” in Proceedings of the
Sixteenth International Joint Conference on Artificial Intelligence, IJCAI 99
(T. Dean, ed.), pp. 727–733, Morgan Kaufmann, 1999.
38. J. Weston and R. Herbrich, “Adaptive margin support vector machines,” in
Advances in Large Margin Classifiers, pp. 281–295, Cambridge, MA: MIT Press,
2000.
39. C.-F. Lin and S.-D. Wang, “Training algorithms for fuzzy support vector ma-chines with noisy data,” in 2003 IEEE Workshop on Neural Networks for Signal
Processing, 2003.
40. J. A. K. Suykens and J. Vandewalle, “Least squares support vector machine classifiers,” Neural Processing Letters, vol. 9, no. 3, pp. 293–300, 1999.
41. K. S. Chua, “Efficient computations for large least square support vector ma-chine classifiers,” Pattern Recognition Letters, vol. 24, pp. 75–80, 2003. 42. D. S. Chen and R. C. Jain, “A robust back propagation learning algorithm for
function approximation,” IEEE Transactions on Neural Networks, vol. 5, no. 3, pp. 467–479, 1994.
43. N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. Kandola, “On kernel-target alignment,” in Advances in Neural Information Processing Systems 14, pp. 367– 373, MIT Press, 2002.
44. O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, “Choosing multiple parameters for support vector machines,” Machine Learning, vol. 46, no. 1-3, pp. 131–159, 2002.
45. G. R¨atsch, T. Onoda, and K.-R. M¨uller, “Soft margins for AdaBoost,” Machine