Analysis of Switching Dynamics with Competing Support Vector Machines

(1)

Analysis of Switching Dynamics with Competing

Support Vector Machines

Ming-Wei Chang and Chih-Jen Lin

Department of Computer Science and Information Engineering

National Taiwan University, Taipei 106, Taiwan

E-mail: [email protected]

Ruby C. Weng

Department of Statistics

National Chenechi University

Taipei 116, Taiwan

Abstract - We present a framework for the

unsu-pervised segmentation of time series using support vector regression. It applies to non-stationary time series which alter in time. We follow the architecture by Pawelzik et al. [13] which consists of competing predictors. In [13] competing Neural Networks were used while here we exploit the use of Support Vector Machines, a new learning technique. Results indi-cate that the proposed approach is as good as that in [13]. Differences between the two approaches are also discussed.

I. Introduction

Recently support vector machines (SVM) [17] has been a promising method for data classification and regres-sion. However, its use on other types of problems have not been exploited much. In the paper we will apply it to unsupervised segmentation of time series. We consider the case in Pawelzik et al. [13] where different samples (xt, yt) are generated by a number m of unknown

func-tions frt, rt ∈ {1, . . . , m} which alternate according to

rt, i.e. yt = frt(xt). We then would like to determine

functions fr with their respective rt given time series

{xt, yt}lt=1. Therefore, it is likely that given points on

different function surfaces, the task is to separate these points to different groups where each one corresponds to points on one surface.

Practical applications of time-series segmentation in-clude, for example, speech recognition [14], signal classi-fication [5], and brain data [12].

As without any training information, this problem must be considered in an unsupervised manner. In order to correctly separate these points, we cannot only count on the information of{xt, yt}. Previous approaches

usu-ally need some additional properties.

In [13], the authors assumed that time series have a low

switching rate. That is, in general data before and after any given time point t are from the same time series. Therefore, in addition to the spatial relation of xt, t =

1, . . . , l, such an assumption provides more connections among xt. Another example by Feldkamp et al. [4] does

not consider {xt, yt} to be from slowly changing time

series. However, they assume, for example, there is a binary sequence like

01001011_{· · · ,}

where each {xt, yt} is associated with one of the four

categories: 00, 01, 10, and 11. Hence if {xt, yt} is in

the 00 class, {xt+1, yt+1} must be in 00 or 01. Such

additional information are typically available based on different applications. Here we will focus on time series so the same condition on a slow changing rate is assumed. Many papers have considered this issue by using neural networks. See [13], [10], [6] and references therein. An-other important approach is via hidden Markov models where examples are [1], [7], [16]. Basically [13] proposed to use several competing neural networks weighted by their relative performance. Weights of different networks are adjusted in an annealed manner where we gradually increase the degree of competition. The neural network used is a radius basis function network of the Moody-Darken type [9]. Here we follow a similar framework but discuss it more from a point of view of solving a global minimization problem. In addition, instead of RBF net-works we used SVM where their differences are also dis-cussed.

This paper is organized as follows. In Section II, we discuss our approach and present how SVM can be in-corporated. An important parameter in our algorithm is β whose calculation will be discussed in Section III. Section IV demonstrates experimental results on some data sets. We present some discussions in Section V.

(2)

II. Annealed Competition of Support Vector Machines

If without considering noise, we have yt= frt(xt), t = 1, . . . , l. If we assign pt_i= 1 if rt= i, 0 otherwise, (1) then l X t=1 m X i=1 pt_i(yt− fi(xt))2= 0. Therefore, pt_i, fi, i = 1,· · · , m, t = 1, · · · , l is an

op-timal solution of the following non-convex optimization problem: min p,f l X t=1 m X i=1 pt_i(yt− fi(xt))2 m X i=1 pt_i = 1, pt_i≥ 0, t = 1, . . . , l. (2)

Of course we can always find a single function which fits all data so that the objective value of (2) is zero and then it is already optimal. What we need is to avoid overfitting and adjust values of pt

i so that (1) is obtained.

Then according to whether pt_i is zero or one, we can find out which group a point (xt, yt) belongs to.

In [13], the authors consider an iterative process where in each iteration pt_i are fixed first and m Radial-Basis function (RBF) networks are used to minimize the quadratic functions: min ˆ fi l X t=1 pt_i(yt− ˆfi(xt))2, i = 1, . . . , m, (3) where ˆ fi(x) = ¯ l X t=1 αte− ||x−ct||2 2σ2 (4)

is the ith approximate predictor. Here c1, . . . , c¯_lare the

“centers” used for constructing functions. After new ˆfiare obtained, they update pti by

pt_i= exp (−β P∆ δ=−∆(e t−δ i ) 2₎ Pm j=1exp (−β P∆ δ=−∆(e t−δ j )2) , (5) where et_i= yt− ˆfi(xt) (6)

and β is a parameter which controls the degree of com-petition.

This updating rule on pt

i is from Bayes’ rule with

the assumption that ((xt_−∆, yt_−∆), · · · , (xt+∆, yt+∆))

are from the same time series. Therefore, using (5) we can put subsequent time series data into the same group. In addition, if β is large, then pt_i ≈ 1 for i = argmaxj

P∆

δ=_−∆(e t_−δ j )

2_{. This is the so-called hard}

competition (winner-takes-all). Here instead of (2), we consider

min p,f l X t=1 m X i=1 pt_i|yt− fi(xt)| m X i=1 pt_i= 1, pt_i≥ 0, t = 1, . . . , l. (7) When pt i is fixed, by considering ˆ fi(x) = wiTφ(x) + b, we then solve min wi,bi 1 2w T iwi+ C l X t=1 pt_i(ξ_it+ ξt,_i∗) s.t.− − ξit,∗≤ yt− (wiTφ(xt) + b)≤ + ξit, (8) ξ_it≥ 0, ξt,i∗≥ 0, t = 1, . . . , l,

which is a modification of the standard support vector regression. Note that if without the term 1₂wT

i wi, = 0,

and pt

iare fixed, (8) is equivalent to (7). The original idea

of support vector regression is to find a function which approximates the hidden relationships of the given data. Here data xtare mapped into a higher dimensional space

by the function φ. SVM uses the -insensitive loss func-tion where data points in a tube with width are consid-ered correctly approximated. Note that we have differ-ent weights Cpt_i in each term of the objective function. The original SVM usually considers a uniform penalty parameter C for all data. It is essential to use the so called “regularization term” 1

2w T

i wiin (8). Otherwise, if

the kernel matrix K with Kto = φ(xt)Tφ(xo) is positive

definite, the solution of (8) will have

yt= wTiφ(xt) + bi, if pti > 0. (9)

That is, overfitting occurs and we are trapped at a local minimum which is not what we want. Adding 1₂wT

i wi

remedies this problem so (9) does not happen in early iterations. Then we can calculate the error in (6) and use them for updating pt_i in (5).

(3)

At the optimal solution of (8) we have wi= l X t=1 αtiφ(xt), (10) where αt

i are obtained from the dual formulation

de-scribed later. So the ith predictor is

ˆ fi(x) = l X t=1 αtiK(xi, x),

where K(xt, x) = φ(xt)Tφ(x) is usually called the kernel

function.

Next we compare the use of RBF networks and SVM. If the RBF kernel is used for SVM,

K(xt, x) = e−

||xt−x||2 2σ2 .

Therefore if we choose ¯l = l, ct= xtin (4), = 0 in (8),

and use a quadratic loss function

C

l

X

t=1

pt_i((ξ_it)2+ (ξ_it,∗)2) (11)

in (8), then (11) is in fact the same as (3). Another difference is on the regularization term. Usually RBF is implemented with an additional regularization term

1 2

P¯l

t=1(α t

i)2. This is different from 12w

T_{w in SVR which} can be rewritten as 1 2 l X t=1 l X o=1 αt_iαo_iK(xt, xo).

A possible advantage of SVM is that with the linear -insensitive loss function, it automatically decides the number of nonzero αt

i so that φ(xt) will be used to

con-struct ˆfi. On the contrary, RBF networks have to decide

¯

l and ctin advance. An example on comparing RBF

net-works and SVM for classification problems is in [15]. Of course how to set an appropriate is also an issue. More discussions on the relation between the RBF networks and SVM are in [3].

Implementations of RBF networks and SVM are also different. As (3) is an unconstrained minimization where its first derivative becomes a linear system, sometimes a direct method such as Gaussian elimination is used. But sometimes iterative methods using the steepest descent direction are considered. For the modified form of SVR

(8), we usually consider its dual: min ¯ α, ¯α∗ 1 2( ¯α− ¯α ∗₎T_{K( ¯}_α − ¯α∗) + l X t=1 ( ¯αt+ ¯α∗t) + l X t=1 yt( ¯αt− ¯α∗t) l X t=1 ( ¯αt− ¯α∗t) = 0, (12) 0_{≤ ¯}αt, ¯α∗t ≤ Cp t i, t = 1, . . . , l, (13)

where K is a square matrix with Kt,o= K(xt, xo). Then

¯

αt− ¯αt∗ is the αti used in (10). The main difficulty on

solving (1) is that K is a large dense matrix. This issue has occurred for the case of classification where some methods such as the decomposition method (e.g. [11]) has been proposed. The decomposition method starts from the zero vector and can avoid the memory problem if the percentage of support vectors (i.e. αt

− αt,_∗

6= 0) is small. It has been extended to regression but we need further modifications for handling different upper bounds Cpt

i. Here we consider the decomposition method used by

the software LIBSVM [2] which can be easily modified for our purpose. The modified code of LIBSVM is available upon request.

When the RBF function is used, the parameter σ af-fects how data are fitted. It adjusts the smoothness of predictors. Thus if initially 1/(σ2_{) is not small, each}

machine will fit some data and become saturated. That is, we are trapped in a local minimum. Hence, we can start from a small 1/(σ2) and gradually increase it. Then in final steps when data have been correctly separated, 1/(σ2) becomes large so that predictors try to fit differ-ent groups of data. Our experience also shows that if we use a fixed σ which in some sense is not too small or too large, the algorithm can also work. Of course the choice of such a σ depends on the smoothness of all func-tions fi, i = 1, . . . , m. For example, if a random sample

of points on these functions has a high degree of nonlin-earity, we are safe to use a large fixed 1/(σ2_{) from the}

beginning.

The stopping criterion of our algorithm is as follows | ˆobj− obj|

|obj| ≤ 0.05,

where ˆobj and obj are the objective values of two consec-utive iterations. Here the objective function is defined as l X t=1 m X i=1 pt_i|yt− fi(xt)|.

(4)

Fig. 1. First four iterations (data without noise)

III. The Adjustment of β

In this section we describe our method for adjusting β, an important parameter which controls the update of pt

i.

From (6) we have that

yt= ˆfi(xt) + eti, i = 1, . . . , m, t = 1, . . . , l.

Assume that et_i are i.i.d. N (0, τ ).

Define aito be the percentage of data in the ith group:

ai≡

# data with rt= i

l ;

that is, ai = p(rt = i). The log-likelihood function of y

is L(τ ) = l X t=1 logp(yt|xt, τ ), where p(yt|xt, τ ) = m X i=1 p(yt, rt= i|xt, τ ) = m X i=1 aipi(yt|xt, τ ), (14) with pi(yt|xt, τ ) ≡ p(yt|rt= i, xt, τ ) = √1 2πτ exp{−(yt− ˆfi(xt)) 2 /(2τ )} = √1 2πτ exp{−(e t i) 2_{/(2τ )} }. (15) Hence p(yt, rt|xt, τ ) = artprt(yt|xt, τ ). (16)

Fig. 2. First six iterations (data with noise)

Let ˆτ be an estimate of τ . Then we can estimate pt i by ˆ pt_i ≡ p(rt= i|xt, yt, ˆτ ) = _Pmp(yt, rt= i|xt, ˆτ ) k=1p(yt, rt= k|xt, ˆτ ) = aiexp{−(e t i)2/(2ˆτ )} Pm k=1akexp{−(etk)2/(2ˆτ )} (17)

using (15) and (16). By comparing (5) and (17), we suggest to choose 1/(2ˆτ ) as our next β. Since ˆτ is measure of the variation of et

i, it is intuitively clear that the next

ˆ

τ will decrease if ˆfiin the next iteration can better fit the

data. So the new β is likely to increase (corresponding to the fact that the temperature is decreasing).

Let τ(g) _{and p}t(g)

i ≡ p(rt= i|xt, yt, τ(g)) be the

infor-mation of the previous iteration. We shall show how to obtain τ(g+1). Let X ≡ (x1, . . . , xl), Y ≡ (y1, . . . , yl),

and R ≡ (r1, . . . , rl). Denote Q(τ, τ(g)) as the

(5)

τg_{. Then} Q(τ, τ(g)) (18) ≡ E[logp(Y, R|X, τ )|X, Y, τ(g)_] = E[logY t p(yt, rt|xt, τ )|X, Y, τ(g)] (19) = l X t=1 E[log p(yt, rt|xt, τ )|X, Y, τ(g)] = l X t=1 E[log(artprt(yt|xt, τ ))|X, Y, τ (g)_{] (from (16))} = l X t=1 m X i=1 log(aipi(yt|xt, τ ))p(rt= i|X, Y, τ(g)) = l X t=1 m X i=1 {(log ai)p(rt= i|X, Y, τ(g))} + l X t=1 m X i=1 {[log pi(yt|xt, τ )]p(rt= i|X, Y, τ(g))} = l X t=1 m X i=1 {(log ai)p t(g) i } + l X t=1 m X i=1 {[log pi(yt|xt, τ )]p t(g) i }, (20)

where (19) and (20) follow from the independence of each observation. Note that from (15)

log pi(yt|xt, τ ) = −1 2 [log(2π) + logτ + (et i)2 τ ]. (21) Let τ(g+1)= argmax_τQ(τ, τg).

Using (20), simple calculations show that the maximum of (20) occurs at τ(g+1) = Pl t=1 Pm i=1{(eti)2p t(g) i } Pl t=1 Pm i=1p t(g) i = Pl t=1 Pm i=1{(e t i)2p t(g) i } l , (22) where pt(g)_i = p(rt= i|xt, yt, τ(g)) = aiexp{−(e t i)2/(2τ(g))} Pm k=1akexp{−(etk)2/(2τ(g))} (23) can be obtained similarly to (17).

Therefore, at the (g + 1)st iteration of the implemen-tation, we replace β in (5) by 1/(2τ(g+1)). Practically we

do not really calculate (23) and use (5) instead. There-fore, we also do not have to worry about ai, which is

unknown in advance.

Because we are using a linear loss function in support vector regression, we feel that in all formulations linear instead of quadratic terms should be used. Therefore, in (5), (22), and (23), all (et_i)2 is replaced by|et

i|. In other

words, though the derivation in this section assumes that et

i is with a normal distribution, if we consider it to be

with a Laplace (double exponential) distribution, we will get results using|et

i|.

IV. Experiments A. Four Chaotic Time Series

We test the extreme case of completely overlapping in-put manifold used in [13]. For all (xt, yt), yt= frt(xt).

They consider all xt∈ [0, 1] and four different functions:

f1(x) = 4x(1− x), f2(x) = 2x if x∈ [0, 0.5) and 2(1 − x)

if x∈ [0.5, 1], f3(x) = f1(f1(x)), and f4= f2(f2(x)). An

illustration of these functions is in Figure 1. It is easily seen that all these functions map x from [0, 1] to [0, 1].

In the beginning we randomly assign pt

i to be 0 or

1 while keeping the conditionPm

i=1p t

i = 1, t = 1,· · · , l.

We set 1/(2σ2_{) of the RBF kernel to be 50. For updating}

pt

i, we consider ∆ = 3. Following [6], the four series are

activated consecutively, each for 100 time steps, giving an overall 400 time steps. We use ten such periods so totally there are 1,200 steps.

For this case the algorithm stops in five iterations. We present the first four in Figure 1 where it can be seen that points are well separated. We assume the number of functions is unknown so we start from six competing SVMs for this case. Our experience indicates that if we use exactly four SVMs, sometimes it may fall into local minima. Thus, using more SVMs may be necessary.

We also consider cases where groups possess different number of data. Our implementation has been able to handle such data with different ratios.

To further test our implementation, we add noise on these four function using 0.1N (0, 0.5). The algorithm stops in seven iterations where the first six iterations are in Figure 2.

B. Mackey-Glass Time Series

Similar to earlier results, we also check time series ob-tained from the Mackey-Glass delay-differential equation [8]: dx(t) dt =−0.1x(t) + 0.2x(t− td) 1 + x(t− td)10 .

Following earlier experiments, points are selected every six time steps. Sequentially we generate 300 points in

(6)

each segment using the order of td= 23, 17, 23, 30. Thus,

totally there are 1,200 point for testing. The embedding dimension is d = 6. That is, ytis the one-step ahead value

of six consecutive xt. For this problem we set 1/(2σ2) to

be 1. Other settings are the same as the implementation in Section IV-A. Results are in Figure 3 where it can be seen that different segments are well separated.

V. Discussion

For SVM, the number of support vectors directly af-fects the training and testing time. A zero pt_imeans that αt_i in (10) is not necessary so the corresponding variables in the dual problem can be removed. However, in theory pt

i can never be zero due to (5). Thus, we use a

thresh-old 0.01 for removing points with small pt

i. Then the

computational time can be largely reduced.

One difference between SVR and neural networks is the use of , the width of the insensitive tube in (8). SVR can be smoother and tolerate more noise using appropriate . In the above case with noise, setting = 0.05 results in fewer support vectors and less running time.

Fig. 3. pti, i = 1, . . . , 3 at each time point t

References

[1] T. W. Cacciatore and S. J. Nowlan. Mixtures of controllers for jump linear and non-linear plants. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems, volume 6, pages 719–726. Morgan Kauf-mann Publishers, Inc., 1994.

[2] C.-C. Chang and C.-J. Lin. LIBSVM: a library for

support vector machines, 2001. Software available at

http://www.csie.ntu.edu.tw/~cjlin/libsvm.

[3] T. Evgeniou, M. Pontil, and T. Poggio. Regularization net-works and support vector machines. Advances in Computa-tional Mathematics, 13:1–50, 2000.

[4] L. A. Feldkamp, T. M. Feldkamp, and D. V. Prokhorov. An

approach adaptive classification. In S. Haykin and B. Kosko, editors, Intelligent signal processing, 2001.

[5] S. Haykin and D. Cong. Classification of radar clutter us-ing neural networks. IEEE Transactions on Neural Networks, 2:589–600, 1991.

[6] A. Kehagias and V. Petridis. Time-series segmentation us-ing predictive modular neural networks. Neural Computation, 9:1691–1709, 1997.

[7] S. Liehr, K. Pawelzik, J. Kohlmorgen, and K. R. Muller. Hid-den markov mixtures of experts with an application to EEG recordings from sleep. Theory in Biosci., 118(3-4):246–260, 1999.

[8] M. C. Mackey and L. Glass. Oscillation and chaos in physio-logical control systems. Science, 197:287–289, 1977.

[9] J. Moody and C. J. Darken. Fast learning in networks of

locally–tuned processing units. Neural Computation, 1:281– 293, 1989.

[10] K.-R. M¨uller, J. Kohlmorgen, and K. Pawelzik. Analysis of switching dynamics with competing neural networks. IEICE Transactions on Fundamentals of Electronics, Communica-tions and Computer Sciences, E78–A(10):1306–1315, 1995. [11] E. Osuna, R. Freund, and F. Girosi. Training support vector

machines: An application to face detection. In Proceedings of CVPR’97, 1997.

[12] K. Pawelzik. Detecting coherence in neuronal data. In

L. Van Hemmen and K. Schulten, editors, Physics of neural networks. Springer, 1994.

[13] K. Pawelzik, J. Kohlmorgen, and K.-R. Muller. Annealed

competition of experts for a segmentation and classification of switching dynamics. Neural Computation, 8(2):340–356, 1996. [14] L. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–285, 1989.

[15] B. Sch¨olkopf, K.-K. Sung, C. J. C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik. Comparing support vector ma-chines with gaussian kernels to radial basis function classiers. IEEE Transactions on Signal Processing, 45(11):2758–2765, 1997.

[16] S. Shi and A. Weigend. Taking time seriously: Hidden markov experts applied to financial engineering. In CIFEr ’97: Proc. of the Conf. on Computational Intelligence for Financial En-gineering, pages 244–252. IEEE, 1997.

[17] V. Vapnik. Statistical Learning Theory. Wiley, New York, NY, 1998.