Accurate Online Support Vector Regression
Junshui Ma, James Theiler, and Simon Perkins
MS-D436, NIS-2, Los Alamos National Laboratory, Los Alamos, NM 87545, USA {junshui, jt, s.perkins}@lanl.gov
Abstract
Conventional batch implementations of Support Vector Regression (SVR) are inefficient when used for applications such as online learning or cross-validation, because one must retrain from scratch every time the training set is modified. We introduce an Accurate Online Support Vector Regression (AOSVR) algorithm which efficiently updates a trained SVR function whenever a sample is added to or removed from the training set.
The updated SVR function is identical to the one that would be produced by a batch algorithm. Examples are presented that compare the performance of AOSVR to a batch algorithm in an online and in a cross-validation scenario.
Keywords: Accurate Online Support Vector Regression; Support Vector Regression;
Online Time-series Prediction; Leave-one-out Cross-validation.
1. Introduction
Support Vector Regression (SVR) fits a continuous-valued function to data in a way that shares many of the advantages of support vector machine (SVM) classification. Most algorithms for SVR (Smola et al. 1998; Chang et al. 2002) require that training samples be delivered in a single batch. For applications such as online time-series prediction or leave-one-out cross-validation, a new model is desired each time a new sample is added to (or removed from) the training set. Retraining from scratch for each new data point can
be very expensive. Approximate online training algorithms have previously been proposed for SVMs (Syed et al.1999; Csato et al. 2001; Gentile 2001; Graepel et al.
2001; Herbster 2001; Li et al. 1999; Kivinen et al. 2002; Ralaivola et al. 2001). We propose an accurate online support vector regression (AOSVR) algorithm that follows the approach of (Cauwenberghs et al. 2001) for incremental SVM classification.
This paper is organized as follows. The formulation of the SVR problem, and the development of the Karush-Kuhn-Tucker (KKT) conditions that its solution must satisfy, are presented in Section 2. The incremental SVR algorithm is derived in Section 3, and a decremental version is described in Section 4. Two applications of the AOSVR algorithm are presented in Section 5.
2. Support Vector Regression and The Karush-Kuhn-Tucker conditions
A more detailed version of the following presentation of SVR theory can be found in Smola et al. (1998).
Given a training setT ={( , ),x_{i} y_{i} i=1... }l , where x_{i}∈R , and ^{N} , we construct a linear regression function
yi∈R
( )f x =W^{T}Φ( )x +b (1) on a feature space F. Here, W is a vector in F, and Φ maps the input x to a vector in F. The and b in (1) are obtained by solving an optimization problem:
( )x W
*
, 1
*
*
min 1 ( )
2
. . ( ( ) )
( ( ) )
, 0, 1
T l
i i
b i
T i
T
i i
i i
P C
s t y b
b y
i l
ξ ξ ε ξi
ε ξ ξ ξ
=
= + +
− Φ + ≤ +
Φ + − ≤ +
≥ =
W W W
∑
W x
W x
"
(2)
The optimization criterion penalizes data points whose y-values differ from f (x) by more than ε. The slack variables, ξ^{ and }ξ ^{*}, correspond to the size of this excess deviation for positive and negative deviations, respectively, as shown in Figure 1.
Figure 1. The ε -insensitive loss function and the role of the slack variables ξ^{ and }ξ^{*} Introducing Lagrange multipliers α, α^{*}, η and η^{*}, we can write the corresponding Lagrangian as:
* * *
1 1
* *
1 1
* *
1 ( ) ( )
2
( ( ) ) (
. . , , , 0
l l
T
P i i i i i i
i i
l l
T T
i i i i i i i i
i i
i i i i
L C
y b y
s t
ξ ξ η ξ η ξ
α ε ξ α ε ξ
α α η η
= =
= =
= + + − + −
+ + − Φ − − + − + Φ +
≥
∑ ∑
∑ ∑
W W
W x W ( )x b )
This in turn leads to the dual optimization problem:
*
* * *
, 1 1 1 1
*
* 1
min 1 ( )( ) ( ) ( )
2
. . 0 , 1 ,
( ) 0
l l l l
ij i i j j i i i i i
i j i i
i i
l
i i
i
D Q y
s t C i l
α α α α ε α α α α
α α α α
= = = =
=
= − − + + −
≤ ≤ =
− =
∑∑ ∑ ∑
∑
α α
"
− *
(3)
where . Here is a kernel function (Smola et al.
1998). Given the solution of (3), the regression function (1) can be written:
( )^{T} ( ) ( , )
ij i j i j
Q = Φ x Φ x =K x x K x x( , )_{i} _{j}
*
1
( ) ( ) ( , )
l
i i i
i
f α α K
=
=
∑
−x x x +b (4)
The Lagrange formulation of (3) can be represented as:
* * *
1 1 1 1
* * * * *
1 1 1
1 ( )( ) ( ) ( )
2
( ) [ ( ) ( )] (
l l l l
D ij i i j j i i i i
i j i i
l l l
i i i i i i i i i i
i i i
L Q y
u C u C
α α α α ε α α α α
δ α δ α α α
*
)
i
β α α
= = = =
= = =
= − − + + −
− + + − + − + −
∑∑ ∑ ∑
∑ ∑ ∑
−
(5)
where δ_{i}^{(*)}, u_{i}^{(*)}, and β are the Lagrange multipliers. Optimizing this Lagrangian leads to the Karush-Kuhn-Tucker (KKT) conditions::
*
1
* *
* 1
(*) (*) (*)
(*) (*) (*)
( ) 0
( )
0, 0
0, ( ) 0
l D
ij j j i i i
i j l D
ij j j i i i
i j
i i i
i i i
L Q y u
L Q y
u u C
α α ε β δ
α
α α ε β δ
α
δ δ α
α
=
=
∂ = − + − + − + =
∂
∂ = − − + + − − + =
∂
≥ =
≥ − =
∑
∑
^{u}^{*} ^{0} (6)Note that β in (6) is the same as in (1) and (4) at optimality (Chang et al. 2002). b Define a margin function ( ) for the ih x_{i} ^{th} sample as: x_{i}
* 1
( )_{i} ( )_{i} _{i} ^{l} _{ij}( _{j} _{j}) _{i}
j
h f y Q α α y
=
≡ − =
∑
− − +x x b
i
. (7)
By combining (6), and (7), h x( )_{i} can be rearranged as:
. (8)
*
*
*
( ) , 0
( ) , 0
( ) ,
( ) , 0
( ) , 0
( ) ,
i i
i i
i
i i
i i
i i
h
h C
h C
h
h C
h C
ε α
ε α
ε α
ε α
ε α
ε α
≥ − =
= − < <
≤ − =
≤ =
= < <
≥ =
x x x x x x
According to the KKT conditions (6), at most one of and will be nonzero, and both are nonnegative. That is,
αi αi^{*}
(9) Therefore, we can define a coefficient difference as
* *
0 0 0
i i and i i
α > ⇒α = α > ⇒α =
θi
0
i
−
<
*
i i
θ α α − (10)
and note that determines both and θ_{i} αi αi^{*}. Combining (8), (9), and (10), we can obtain:
( ) ,
( ) , 0
( ) 0
( ) , 0
( ) ,
i i
i i
i i
i i
i i
h C
h C
h
h C
h C
ε θ
ε θ
ε ε θ
ε θ
ε θ
> =
= − <
− ≤ ≤ =
= − < <
≤ − =
x x
x x x
. (11)
Equation (11) suggests that the samples in training set T can be classified into three subsets.
E Set: Error Support Vectors: E={ |i θ_{i} =C}
S Set: Margin Support Vectors: S ={ | 0i <θ_{i} <C} (12) R Set: Remaining Samples: R={ |i θ_{i} =0}
3. Incremental Algorithm
The incremental algorithm updates the trained SVR function whenever a new sample is added to the training set T. The basic idea is to gradually change the coefficient difference corresponding to the new sample until it meets the KKT conditions, while ensuring that the existing samples in T continue to satisfy the KKT conditions. In this section, we first derive the relation between the change of , or , and the change of other coefficients under the KKT conditions, and then propose a method to determine
xc
θc ^{x}_{c}
θc ∆θ_{c}
the largest allowed ∆ for each step. A pseudo-code description of this algorithm is provided in the Appendix.
θc
c∆
=
j c
Q b
θ θ
∆ =
∆
s} sl
Q Q
Q Q
"
"
%
"
3.1 Derivation of the Incremental Relations
Let be a new training sample that is added to T. We initially set and then gradually change (increase or decrease) the value of under the KKT conditions (11).
xc θ_{c} =0
θc
According to (6), (7), (8), and (11), the incremental relation between ∆ , , and is given by:
( )_{i} h x ∆θ_{i}
∆b
(13)
1
( )_{i} _{i} _{c} ^{l} _{ij} _{j}
j
h Q θ Q θ
=
∆ ^{x} = +
∑
∆ + ∆^{b}S From the equality condition in (3), we have
1
0
l
c i
i
θ θ
=
+
∑
(14) Combining (11), (12), (13), and (14), we obtain:ij j ic c
j S
j S
Q where i
θ θ
∈
∈
∆ + − ∆ ∈
∆ = −
∑
∑
(15)If we define the index of the samples in the S set as:
(16)
1 2
{ , , S = s s "
Equation (15) can be represented in matrix form as:
(17)
1 1 1 1 1
1
0 1 1 1
1
1
ls
l l
ls l ls s s s
s s s s s s c
c
s s c
s s s s
b
Q
Q
θ θ
θ
∆
∆
= − ∆
∆
# #
# # #
That is,
(18)
1
ls
s
c
s
b
θ θ
θ
∆
∆
= ∆
∆
# β
where
1 1 1
1 1 1
1
0 1 1 1
1 1
1
1
ls
ls ls ls ls sl ls
s s s s
s s c s c
s s c s s s s s c
Q Q
Q Q
Q Q Q Q
β β
β
−
= = − = −
β R
"
"
# # # # % #
"
#
ls
(19)
Define a non-S, or N set, as . Combining (11), (12), (13), and (18), we obtain
1 2
{ , , }
ln
N = ∪ =E R n n "n
(20)
1
2
( ) ( )
( )
ln
n
n
c
n
h h
h
θ
∆
∆ = ∆
∆
x
x γ
x
#
where,
1 1 1
1
2 2 1 2
1
1 1
1
ls ls
ln ln ln
n s n s
n c
n c n s n s
n c n s n s
Q Q
Q
Q Q Q
Q Q Q
=_{} _{}+_{}
γ β
"
"
# # # % #
"
(20b)
In special case, when S set is empty, according to (13) and (14), Equation (20) simplifies to:
(21)
1
2
( ) 1
( ) 1
1 ( )
ln
n
n
n
h
h b
h
∆
∆
= ∆
∆
x x
x
#
#
Given , we can update and b according to (18), and update
according to (20). Moreover, (11) suggests that and are constant if the S set stays unchanged. Therefore, the results presented in this section enable us to update all the and given . In the next section, we address the question of how to find an appropriate .
θc
∆ θ_{i},i S∈
θc
∆
( ),_{i} h x i∈N S
xc
i,i N
θ ∈ h( ),x_{i} i∈
θi h x( )_{i}
∆θc
3.2. AOSVR Bookkeeping Procedure
Equations (18) and (20) hold only when the samples in the S set do not change.
Therefore, is chosen to be the largest value that either can maintain the S set unchanged or lead to the termination of the incremental algorithm.
θc
∆
The first step is to determine whether the change should be positive or negative.
According to (11),
θc
∆
(22)
( _{c}) ( _{c} ( ))_{c} ( ( ))
sign ∆θ =sign y − f x =sign h−
The next step is to determine a bound on imposed by each sample in the training set. To simplify exposition we only consider the case , and the case ∆ can be obtained similarly.
θc
∆
c 0 θ
∆ > θ_{c}<0
For the new sample x_{c}, there are two cases:
Case 1: ( ) changes from to , and the new sample is added into S set, and the algorithm terminates.
h xc h( )x_{c} < −ε h( )x_{c} = −ε x_{c}
Case 2: If increase from to , the new sample is added into E set, and the algorithm terminates.
θc θc<C θc=C xc
For each sample in the set S, x_{i}
Case 3: If changes from θ_{i} ^{0}<θi <C^{ to }θ_{i} =C, sample changes from S set to E set;
If changes to , sample changes from S set to R set.
xi
θi θi =^{0} xi
For each sample in the set E, x_{i}
Case 4: If h x( )_{i} changes from h( )x_{i} >ε to ( )h x_{i} =ε, is moved from E set to S set. x_{i} For each sample in the set R, x_{i}
Case 5: If h x( )_{i} changes from h( )x_{i} <ε to ( )h x_{i} =ε, is moved from R set to S set. x_{i} The bookkeeping procedure is to trace each sample in the training set T against these five cases, and determine the allowed ∆ for each sample according to (18) or (20). The final is defined as the one with minimal absolute value among all the possible .
θc
θc
∆ ∆θc
3.3. Efficiently Updating R Matrix
When the S set is not example, the matrix R that is used in (19)
(23)
1 1 1
1
0 1 1 1
1
1
ls
ls ls
s s s s
s s s s
Q Q
Q Q
−
=
R
"
"
# # % #
"
ls
must be updated whenever the S set is changed. Following (Cauwenberghs et al. 2001) we can efficiently update R without explicitly computing the matrix inverse. When the k^{th} sample
sk
x in S set is removed from the S set, the new R can be obtained as follows:
, ,
,
,
k k
new
Rk k
= _{I I}−R R^{I}
R R ^{I} , where I=[1 " k k+2 " S_{l}_{s} _{+}1] (24)
When the new sample , or a sample x in E set or R set, is added to S set, the new R can be updated as follows:
xc _{i}
0
1 1
0 1
0 0 0
new T
γi
=_{} _{}+ _{ }
R β
R # β
"
(25)
where β is defined as ^{1} 1
ls
s i
s i
Q
Q
= − _{}
β R
#
, and γ_{i} is defined as ^{1} 1 1
ls
T s i i
s i
Q
Q γ
=_{} _{ }
β #
c
when the
sample was moved from E set or R set. In contrast, when the sample x is the sample added to S set,
xi
β is can be obtained according to (18), and γ_{i} is the last element of defined in (20).
γ
3.4. Initialization of the Incremental Algorithm
An initial SVR solution can be obtained from a batch SVR solution, and in most cases that is the most efficient approach. But it is sometimes convenient to use AOSVR to produce a full solution from scratch. An efficient starting point is the two-sample solution, which can be written analytically. Assume training set T y , and . The solution of (3) will be
1 1 2 2
{( , ),( , )}y
= x x
y1≥ y2
1 2
1
11 12
2 1
1 2
max(0, min( , 2 ))
2( )
( ) / 2
y y
C K K
b y y
θ ε
θ θ
= − −
−
= −
= +
(26)
4. Decremental Algorithm
The decremental (or “unlearning”) algorithm is employed when an existing sample is removed from the training set. If a sample x_{c} is in the R set, then it does not contribute to
the SVR solution, and removing it from the training set is trivial; no adjustments are needed. If on the other hand, has a nonzero coefficient, then the idea is to gradually reduce the value of the coefficient to zero, while ensuring all the other samples in training set continue to satisfy the KKT conditions.
xc
) y−
The general algorithm is almost the same as the incremental algorithm except for a few small adjustments:
(i) The direction of the change of is now changed to be: θ_{c}
( _{c}) ( ( _{c} _{c}) ( ( )
sign ∆θ =sign f x =sign h x_{c} ). (27) (ii) There is no Case 1 because the removed x_{c}need not satisfy KKT conditions.
(iii) The condition in Case 2 becomes: changing from θ_{c} θ_{c} >^{0} to θ_{c}=0.
5. Applications
The accurate online SVR (AOSVR) learning algorithm produces exactly the same SVR as the conventional batch SVR learning algorithm, and can be applied in all scenarios where batch SVR is currently employed. In this section, two particular applications of AOSVR are used to illustrate the particular efficiency of AOSVR for incremental learning.
In both applications, experimental results are presented to compare the efficiency of AOSVR with that of the batch SVR. Our version of AOSVR is implemented in Matlab, and for the batch SVR, we used LibSVM (Chang et al. 2001), which is implemented in C++. This leads to an apples-and-oranges comparison, but we found the currently available Matlab codes for batch SVR to be prohibitively slow. For example, although AOSVR should be slower than batch SVR on a batch SVR problem, we found that it took
AOSVR 4.34 seconds to train a predictor for the 292-point sunspot yearly time-series, while the Matlab SVM Toolbox (Gunn 1998) took 143.06 seconds. We expect a C++
implementation to be faster than Matlab, so the comparison with LibSVM gives batch SVR an advantage – but despite this, we find that AOSVR outperforms batch SVR in the online scenarios presented here.
5.1. Online Time-series Prediction
In recent years, the use of SVR for time-series prediction has attracted increased attention (Müller et al. 1997; Fernández 1999; Tay et al. 2001). In an online scenario, one updates a model from incoming data and at the same time makes predictions based on that model. This arises, for instance, in market forecasting scenarios. Another potential application is the (near) real-time prediction of electron density around a satellite in the magnetosphere, because high charge densities can damage satellite equipment (Friedel et al. 2002).
In time-series prediction, the prediction origin, denoted O, is the time from which the prediction is generated. The time between the prediction origin and the predicted data point is the prediction horizon, which for simplicity we will take as one time step.
A typical online time-series prediction scenario can be represented as follows (Tashman 2000):
(1) Given a time series { ( and prediction origin O, construct a set of training samples, , from the segment of time series
), 1, 2,3 } x t t= "
,
AO B
{ ( ),x t t =1"O}as A_{O B}_{,} ={( ( ), ( )),X t y t t=B"O−1}, where
[
^{x t B}^{(} ^{1)}]
^{T}= − + y t( )
,
AO B
( )t x t( )
X " , , and B is the embedding
dimension of the training set .
( 1)
=x t+
(2) Train a predictor P A( _{O B}_{,} ;X)from the training set .
) )
,
AO B
(3) Predict x O( +1 using x Oˆ( + =1) P(A_{O B}, ; ( )X O .
(4) When x O becomes available, update the prediction origin: O O . Then, go to (1) and repeat the above procedure.
( +1) = +1
Note that the training set keeps growing as O increases, so the training of the predictor in step (2) becomes increasingly expensive. Therefore, many SVR-based time- series predictions are implemented in a compromised way (Tay et al. 2001), with a fixed prediction origin O. That is, after the predictor is obtained, it stays fixed, and is not updated as new data arrives. A direct consequence of this compromise is the degrading of the prediction performance, which is demonstrated by the experimental results listed in Table 2.
,
AO B
In contrast, an online prediction algorithm, such as AOSVR, can take advantage of the fact that the training set is augmented one sample at a time, and the enhanced efficiency that an online algorithm provides is shown in the next section.
5.1.1. Experiments
Two experiments were performed to compare the AOSVR algorithm with the batch SVR algorithm. We are careful to use the same algorithm parameters for online and batch SVR, but since our purpose is to compare computational performance, we did not attempt to optimize these parameters for each data set. In these experiments, the kernel function is a gaussian radial basis function, exp(− X_{i}−X_{j} ^{2}), the regularization coefficient C and the insensitivity parameter ε in (2) are set to 10 and 0.1 respectively, and the embedding dimension, B, of the training A_{O B}_{,} , is 5. Also, we scale all the time-series to [-1,1].
Three widely used benchmark time-series are employed in both experiments: (a) the Santa Fe Institute Competition time series A (Weigend et al. 1994), (b) the Mackey-Glass equation with τ=17 (Mackey et al. 1977), and (c) the yearly average sunspot numbers recorded from 1700 to 1995. Some basic information about these time-series is listed in Table 1. The SV Ratio is the number of support vectors divided by the number of training samples. This is based on a prediction of the last data point using all previous data for training. In general, a higher SV ratio suggests that the underlying problem is harder (Vapnik 1998).
# Data Points SV Ratio
Santa Fe Institute 1000 4.52%
Mackey-Glass 1500 1.54%
Yearly Sunspot 292 41.81%
Table 1. Information Regarding Experimental Time Series
The first experiment demonstrates that using a fixed predictor produces less accurate predictions than using a predictor that is updated as new data becomes available. Two measurements are used to quantify the prediction performance: mean squared error (MSE), and mean absolute error (MAE). The predictors are initially trained on the first half of the data in the time-series. In the fixed case, the same predictor is used to predict the second half of the time-series. In the online case, the predictor is updated whenever a new data point is available. The performance measurements for both cases are calculated from the prediction and actual value of the second half data points in the time-series. As shown in Table 2, the online predictor outperforms the fixed predictor. We also note that the errors for the three time-series in Table 2 coincide with the estimated prediction difficulty in Table 1 based on SV Ratio.
MSE MAE
Online 0.0072 0.0588
Santa Fe
Institute Fixed 0.0097 0.0665
Online 0.0034 0.0506
Mackey-
Glass Fixed 0.0036 0.0522
Online 0.0263 0.1204
Yearly
Sunspot Fixed 0.0369 0.1365
Table 2. Performance Comparison For Online and Fixed Predictors
The second experiment illustrates that AOSVR is more efficient than a batch implementation in the online prediction scenario. For each benchmark time-series, an initial SVR predictor is trained on the first 20% of the data points using a batch SVR algorithm. Afterwards, both AOSVR and batch SVR algorithms are employed to work in the online prediction mode for the remaining 80% of the data points in the time-series.
AOSVR and the batch SVR algorithm produce exactly the same prediction errors in this experiment, so the comparison is only of prediction speed. The experimental results of the three time-series are presented in Figures 2, 3, and 4 respectively. The x-axis of these plots is the number of data points, to which the online prediction model is applied.
Figure 2. Log and linear plots of prediction time of SFI time series
Figure 3. Log and linear plots of prediction time of Mackey-Glass time series
Figure 4. Log and linear plots of prediction time of yearly sunspot time series These experimental results demonstrate that AOSVR algorithm is generally much faster than the batch SVR algorithm when applied to online prediction. This is because the batch SVR algorithm must train a new classifier from scratch every time a new point is added. Comparison of Figures 3 and 4 furthermore suggests that more speed improvement is achieved on the sunspot data than on the Mackey-Glass. We speculate that this is because the sunspot problem is “harder” than the Mackey-Glass – it has a higher support vector ratio – and that the performance of the AOSVR algorithm is less sensitive to problem difficulty.
To test this hypothesis, we compared the performance of AOSVR to batch SVR on a single dataset (the sunspots) whose difficulty was adjusted by changing the value of ε. A smaller ε leads to a higher support vector ratio and a more difficult problem. Both the
AOSVR and batch SVR algorithms were employed for online prediction of the full time- series. The overall prediction times are plotted against ε in Figure 5. Where AOSVR performance varied by a factor of about ten over the range of ε, the batch SVR performance varied by a factor of almost 200.
Figure 5. Log and linear plots of prediction time of yearly sunspot time series
5.1.2. Limited-Memory Version of the Online Time-series Prediction Scenario
One problem with online time-series prediction in general is the longer the prediction goes on, the bigger the training set will become, and the more SVs will be involved in SVR predictor (Kimeldorf et al. 1971). A complicated SVR predictor imposes both memory and computation stress on the prediction system. One way to deal with this problem is to impose a “forgetting” time W. When training set grows to this maximum W, then the decremental algorithm of AOSVR will be used to remove the oldest sample before the next new sample is added to the training set.
,
AO B
,
AO B
We note that this variant of the online prediction scenario is also potentially suitable for non-stationary time-series, as it can be updated in real-time to fit the most recent behavior of the time-series. More rigorous investigation in this direction will be a future effort.
5.2. Leave-One-Out Cross-validation
Cross-validation is a useful tool for assessing the generalization ability of a machine- learning algorithm. The idea is to train on one subset of the data, and then to test the accuracy of the predictor on a separate disjoint subset. In leave-one-out cross-validation (LOOCV), only a single sample is used for testing, and all the rest are used for training.
Generally, this is repeated for every sample in the dataset. When the batch SVR is employed, LOOCV can be very expensive, since a full retraining is done for each sample.
One compromise approach is to approximate LOOCV by estimating some related but less computation-demanding factors, such as the Xi-Alpha Bound (Joachims 2000), and Approximate Span Bound (Vapnik et al. 1999). Although (Lee et al. 2001) proposed a numerical solution to reduce the computation for directly implementing LOOCV, the amount of computation required is still considerable. Also, the accuracy of the LOOCV result obtained using this method can be potentially compromised due to the special parameter set employed by the method.
The decremental algorithm of AOSVR provides us with an efficient implementation of LOOCV for SVR:
(1) Given a dataset D, construct the SVR function f(x) from the whole dataset D using batch SVR learning algorithm;
(2) For each non-support vector xi in the dataset D, calculate error ei corresponding to xi as: ei = yi-f (xi), where yi is the target value corresponding to xi;
(3) For each support vector xi involved in the SVR function f(x),
a. Unlearn xi from the SVR function f(x) using the decremental algorithm to obtain the SVR function fi(x) which would be constructed from the dataset Di=D-{xi};
b. Calculate error ei corresponding to support vector xi as: ei = yi-fi(xi), where yi
is the target value corresponding to xi.
(4) Knowing the error for each sample xi in D, it is possible to construct a variety of overall measures; a simple choice is the MSE:
1 2
( )
N
LOOCV i
i
MSE e
= N
∑
D (28)
where N is number of samples in dataset D. Other choices of error metric, such as MAE, can be obtained just by altering (28) appropriately.
5.2.1. Experiment
The algorithm parameters in this experiment are set the same as those in the experiments in Subsection 5.1.1. Two famous regression datasets, the auto-mpg and Boston housing datasets, are chosen from the UCI machine-learning repository. Some basic information of these datasets is listed in Table 3.
# Attributes # Samples SV Ratio
Auto-MPG 7 392 41.07%
Boston Housing 13 506 36.36%
Table 3. Information Regarding Experimental Regression Datasets
The experimental results of both datasets are presented in Figure 6. The x-axis is the size of the training set, upon which the LOOCV is implemented. These plots show that AOSVR-based LOOCV is much faster than its batch SVR counterpart when the training set is relatively large.
Figure 6. Linear plots of LOOCV time of Auto-MPG and Boston Housing dataset
6. Conclusions
We have developed and implemented an accurate online support vector regression (AOSVR) algorithm that permits efficient retraining when a new sample is added to, or when an existing sample is removed from, the training set. AOSVR is applied to online time-series prediction and to leave-one-out cross-validation, and the experimental results
demonstrate that the AOSVR algorithm is more efficient than the conventional batch SVR in these scenarios. Moreover, AOSVR appears less sensitive than batch SVR to the difficulty of the underlying problem.
Appendix
Pseudo-code for Incrementing AOSVR with a New Data Sample Inputs:
Training set T ={( , ),x_{i} y_{i} i=1... }l l
Coefficients {θ_{i},i=1... }, and bias b Partition of samples into sets S, E, and R Matrix R defined in (23)
New sample ( , )x_{c} y_{c} Outputs:
Updated coefficients {θ_{i},i=1... 1}l+ and bias b Updated Matrix R
Updated partition of samples into sets S, E, and R AOSVR Incremental Algorithm:
• Initialize θ_{c}=0
• Compute ( )_{c} _{i} _{ic}
i E S
f θQ b
∈ ∪
=
∑
+x
• Compute h( )x_{c} = f( )x_{c} −y_{c}
• If ( )h x_{c} ≤ε, then assign x_{c}to R, and terminate.
• Let q s= ign h(− ( ))x_{c} be the sign that ∆θ_{c}will take
• Do until the new sample x_{c} meets the KKT condition o Update β,γ according to (19) and (20b)
o Start bookkeeping procedure:
Check the new samplex_{c},
- L_{c}_{1}= −( h( )x_{c} −qε) /γc (Case 1) - L_{c}_{2} =qC−θc (Case 2)
Check each sample x_{i} in the set S (Case 3) - If qβ_{i} >0 and C> ≥θ_{i} 0, L^{S}i =^{(}C−θi^{) /}β_{i} - If qβ_{i} >0 and 0> ≥ −θ_{i} C, L^{S}i = −θi^{/}βi
- If qβ_{i} <0 and C≥ >θ_{i} 0, L^{S}_{i} = −θ_{i}/βi
- If qβ_{i} <0 and 0≥ > −θ_{i} C,L^{S}_{i} = − −( C θ_{i})/β_{i} Check each sample in the set E (Case 4) x_{i} - L^{E}_{i} = −( h( )x_{i} −sign q( β ε β_{i}) ) / _{i}
Check each sample in the set R (Case 5) x_{i} - L^{R}_{i} = −( h( )x_{i} +sign q( β ε β_{i}) ) / _{i}
Set ∆ =θ_{c} qmin(L_{c}_{1} , L_{c}_{2} , L^{S} , L^{E} , L^{R} ),
where L^{S} ={ ,L i S^{S}_{i} ∈ }, L^{E} ={ ,L i E^{E}_{i} ∈ }, and L^{R} ={ ,L i R^{R}_{i} ∈ }. Let Flag be the case number that determines ∆θ.
Let x be the particular sample in T that determines_{I} ∆θ_{c}. o End Bookkeeping Procedure.
o Update , b, and θ_{c} θi^{,}i S∈ according to (18) o Update h( ),x_{i} i E∈ ∪R according to (20) o Switch Flag
(Flag = 1):
Add new sample x_{c} to set S; update matrix R according to (25) (Flag = 2):
Add new sample x_{c} to set E (Flag = 3):
If θ_{I} =0, move xI to set R; update R according to (24) If θ_{I} =C, move x_{I} to set E; update R according to (24) (Flag = 4):
Move x_{I} to set S; update R according to (25) (Flag = 5):
Move x_{I} to set S; update R according to (25) o End Switch Flag
o If Flag≤ 2, terminate; otherwise continue the Do-Loop.
• Terminate incremental algorithm; ready for the next sample.
Acknowledgements
We would like to thank Professor Chih-Jen Lin in National University of Taiwan for his suggestion on some implementation issues. This work is supported by the NASA project NRA-00-01-AISR-088 and by the Los Alamos Laboratory Directed Research and Development (LDRD) program.
References
Cauwenberghs, G., and T. Poggio (2001). Incremental and Decremental Support Vector Machine Learning, in: T. K. Leen, T. G. Dietterich, and V. Tresp, ed., Advances in Neural Information Processing Systems 13, Cambridge, MA, MIT Press, 409-415.
Chang, C-C, and C-J Lin (2001). LIBSVM: a library for support vector machines, Software available at http://www.csie.ntu.edu.tw/ ~cjlin/libsvm.
Chang, C.-C., and C.-J. Lin (2002). Training ν-support vector Regression: Theory and Algorithms, Neural Computation, Vol.14(8), 1959-1977.
Csato, L., and M. Opper (2001). Sparse Representation for Gaussian Process Models, in:
T. K. Leen, T. G. Dietterich, and V. Tresp, ed., Advances in Neural Information Processing Systems 13, Cambridge, MA, MIT Press, 444-450.
Fernández, R.(1999). "Predicting Time Series with a Local Support Vector Regression Machine," Advanced Course on Artificial Intelligence 1999 (ACAI '99), July 14, Chania, Greece.
Friedel, R.H, G. D. Reeves, T. Obara (2002). "Relativistic electron dynamics in the inner magnetosphere - a Review", Journal of Atmospheric and Solar-Terrestrial Physics 64, 265-282.
Gentile, C. (2001). A New Approximate Maximal Margin Classification Algorithm, Journal of Machine Learning Research, 2, 213-242.
Graepel, T. R. Herbrich, and R. C. Williamson (2001). From Margin To Sparsity, in: T.
K. Leen, T. G. Dietterich, and V. Tresp, ed., Advances in Neural Information Processing Systems 13, Cambridge, MA, MIT Press, 210-216.
Gunn, S. (1998). Matlab SVM Toolbox, Software package is available at http://www.isis.ecs.soton.ac.uk/resources/svminfo/.
Herbster, M. (2001). Learning Additive Models Online with Fast Evaluating Kernels, in:
Proceedings of 14th Annual Conference on Computational Learning Theory (COLT), Springer, 444-460.
Joachims, T. (2000). Estimating the Generalization Performance of a SVM Efficiently, in: Proceedings of the International Conference on Machine Learning, Morgan Kaufman.
Kimeldorf, G. S., and G. Wahba (1971). Some Results on Tchebycheffian Spline Functions, Journal of Mathematical Analysis and Applications, 33, 82-95.
Kivinen, J., A. J. Smola, and R. C. Willianmson (2002). Online Learning With Kernels, in: T. G. Dietterich, S. Becker, and Z. Ghahramani, ed., Advances in Neural Information Processing Systems 14, Cambridge, MA, MIT Press.
Lee, J-H, and C.-J. Lin (2001). Automatic Model Selection for Support Vector Machines, Machine Learning.
Li, Y., and P.M. Long (1999). The Relaxed Online Maximum Margin Algorithm, in: S.
A. Solla, T. K. Leen, and K.-R. Müller, ed., Advances in Neural Information Processing Systems 12, Cambridge, MA, MIT Press, 498-504.
Mackey, M.C., and L. Glass (1977). Science, 197, 287-289.
Müller, K.R., A.J. Smola, G. Rätsch, B. Schölkopf, J. Kohlmorgen, and V. Vapnik (1997). Prediction Time Series with Support Vector Machines, in: Proceedings of International Conference on Artificial Neural Networks, Lausanne, Switzerland.
Ralaivola, L., and F. d'Alche-Buc (2001). Incremental Support Vector Machine Learning:
a Local Approach, in: Proceedings of International Conference on Artificial Neural Networks, Aug. 21-25, Vienna, Austria.
Smola, A. J., and B. Schölkopf (1998). A Tutorial on Support Vector Regression, NeuroCOLT Technical Report NC-TR-98-030, Royal Holloway College, University of London, UK.
Syed, N. A., H. Liu, and K.K. Sung (1999). Incremental Learning With Support Vector Machines, in: Proceeding of International Joint Conference on Artificial Intelligence.
Tashman, L. J. (2000). Out-of-sample tests of forecasting accuracy: an analysis and review, International Journal of Forecasting,16, 437-450.
Tay, F. E. H., and L. Cao (2001). Application of Support Vector Machines in Financial Time Series Forcasting, Omega, 29, 309-317.
Vapnik, V. (1998). Statistical Learning Theory, Wiley, New York.
Vapnik, V., and O. Chapelle (1999). Bounds on error expectation for support vector machine, in: A. Smola, P. Bartlett, B. Schölkopf and D. Schuurmans, Ed., Advances in Large Margin Classifiers, Cambridge, MA, MIT Press.
Weigend, A. S., and N. A. Gershenfeld (1994). Time-series Prediction: Forcasting the future and Understanding the Past, Addison-Wesley.