Two-year effects of an interdisciplinary intervention on recovery following hip fracture in older Taiwanese with cognitive impairment

(1)

Analysis of Switching Dynamics With Competing

Support Vector Machines

Ming-Wei Chang, Chih-Jen Lin, Member, IEEE, and Ruby Chiu-Hsing Weng

Abstract—We present a framework for the unsupervised seg-mentation of switching dynamics using support vector machines. Following the architecture by Pawelzik et al., where annealed com-peting neural networks were used to segment a nonstationary time series, in this paper, we exploit the use of support vector machines, a well-known learning technique. First, a new formulation of sup-port vector regression is proposed. Second, an expectation-maxi-mization step is suggested to adaptively adjust the annealing pa-rameter. Results indicate that the proposed approach is promising. Index Terms—Annealing, competing experts, expectation max-imization (EM), support vector machines (SVMs), unsupervised time series segmentation.

I. INTRODUCTION

R

ECENTLY, support vector machines (SVMs) [26] have been a promising method for data classification and re-gression. For an account of various applications of the method, see [15], [16], and the references therein. However, its applica-tion to unsupervised learning problems has not been exploited much. In this paper, we aim to apply it to unsupervised seg-mentation of time series. Practical applications of unsupervised segmentation of time series include, for example, speech recog-nition [23], signal classification [6], and brain data [9],[20].

The topic of unsupervised segmentation has been investigated by researchers in various fields. See for example, an early survey on various approaches in [18], the fuzzy c-regression models (FCRMs) [5] with applications to models with known forms and known distribution of the noise term, a combination of super-vised and unsupersuper-vised learning using hidden Markov models based on neural networks [4], and other competing neural net-works approaches in [8], [9], [11], [14], and[21].

In [14] and [21] annealed competing neural networks were used to segment a nonstationary time series, where nonstationari-ties are caused by switching dynamics. This method is called “an-nealed competition of experts” (ACE). Unlike the mixtures of ex-perts architecture [7] which used an input-dependent gating-net-work, the ACE method drives the competition of experts by an evaluation of prediction performance so that the underlying dy-namics can have overlapping input domains. Two main features of this approach are memory derived from a slow switching rate and deterministically annealed competition of the experts during

Manuscript received February 11, 2002; revised February 5, 2003. This work was supported in part by the National Science Council of Taiwan under Grants NSC 90-2213-E-002-111 and NSC 91-2118-M-004-003.

M.-W. Chang and C.-J. Lin are with the Department of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan (e-mail: cjlin@csie.ntu.edu.tw).

R. C.–H. Weng is with the Department of Statistics, National Chenechi Uni-versity, Taipei 116, Taiwan.

Digital Object Identifier 10.1109/TNN.2004.824270

training. The assumption of slow switching rate is imposed to resolve problems caused by overlapping input-output relations. The idea of annealing is to avoid getting stuck in local minima and resolve the underlying dynamics in a hierarchical manner. (The deterministic annealing method was described in the con-text of clustering [24].) The neural network used is a radial basis function (RBF) network of the Moody–Darken type [13].

The present paper aims to solve the same unsupervised seg-mentation problem as in [14] and [21]. We propose a framework using competing SVMs. The standard SVMs assume equal weights on all error terms, which mean that each data point is equally important. However, due to the switching nature of the problem, the data points actually come from different sources and therefore, the contribution of each data point to each predictor would not be the same. In order to solve this problem, we propose a modified formulation of SVMs which allows different weights on the error terms. The dual problem and the implementation for this formulation are also new. Here the weights are adjusted by relative prediction performance, as in [14] and[21]. In addition to the new formulation, this paper is novel in presenting an adaptive annealing method. A key observation is that the annealing pa-rameter characterizes some statistical property of the error terms. Therefore, at each iteration during training, we treat the annealing parameter as an unknown parameter and estimate it based on the current weighting coefficients and error terms. This estimate is essentially the maximum likelihood estimate, which can be obtained by an expectation-maximization (EM) step. We shall call this annealing method adaptive deterministic annealing.

The paper is organized as follows. In Section II, we briefly describe the framework of [14] and [21] and present a modi-fied SVM formulation. In Section III, we give motivation be-hind the proposed annealing method and derive an estimate of the annealing parameter. The implementation of the modified SVM formulation is in Section IV. Section V demonstrates ex-perimental results on some data sets. Then, we give a discussion in Section VI.

II. MODIFIEDSVM FORMULATION

First, we describe the segmentation problem using input– output pairs and outline the approach of [21]. Then, we review the standard SVMs and introduce a modified formulation.

Let be generated by unknown

functions, where , and if

the th data point is generated by . If we set , we get a time series . The task is to determine

, and functions . For notational

sim-plicity, we take in this section. The readers should notice that the method presented below does not restrict to time

(2)

series with embedding order 1. The extension to higher embed-ding order can be done by simply replacing the scalar by vec-tors from the method of the time delay embedding of the time series [10], [21].

Define , if , and , otherwise. Then,

is an optimal solution of the following nonconvex optimization problem:

(1) The main computations for solving (1) can be divided into two steps: 1) minimize the objective function in (1) over with fixed; and 2) adjust given current . In [21], the authors use RBF networks for step 1), i.e., for , they solve (2) where are RBF networks. For step 2), they assume that are distributed according to Gaussian and then by Bayes rules

(3)

where

(4) and is the annealing parameter which controls the degree of competition among predictors. To incorporate the property of low switching rates, the sequence

are assumed to come from the same source. Using Bayes rules again, the weighting coefficients can be updated as

(5)

We will further discuss in Section III and the choice of in Section V.

We summarize the algorithm in [21] as follows.

1) Algorithm II.1:

1) Given and parameters of RBF networks. Randomly pick initial which satisfies

2) Solve (2) using the RBF networks and obtain . 3) Increase slightly and update by (5).

4) If certain stopping criteria are met, terminate the algo-rithm. Otherwise, go to step 2).

Next we propose to modify support vector regression (SVR) for solving (2). The idea of SVR is to map input data into a

higher dimensional space and find a function which approxi-mates the hidden relationships of the given data. The standard formulation is as follows:

(6) where and are unknown parameters to be estimated, and indicate errors, is the tolerance, and is the function that maps data to a higher dimensional space. SVR uses the so called “regularization term” in (6) to avoid overfit-ting. Without this term, for certain mapping functions , the

so-lution of (6) always satisfies ;

see, for example, [2, Cor. 1]. Thus, it overfits the data. If we take

then (6) can be rewritten as

(7) where is the -insensitive loss function.

The standard SVR considers a uniform penalty parameter for all data. By comparing (2) and (7), we propose to use dif-ferent weights in each error term of the objective function. Thus, for each , we have the following modifica-tion:

(8) Note that if we remove the term and set , then (8) is equivalent to (2) with -norm replaced by -norm. The role of the regularization term is similar to that in the standard SVR. Without it, we will overfit the data because

in this case for all with .

The stopping criterion of our algorithm is as follows: (9) where and are the objective values of two consecutive iterations. Here, the objective function is defined as

A summary of our changes from Algorithm II.1 is in the fol-lowing.

2) Algorithm II.2:

1’ Given , SVR parameters and , and SVR mapping function , solve (8) and obtain .

3’ Update by the method described in Section III. If (9) is satisfied, stop the algorithm.

(3)

For an account on the relation between RBF networks and SVRs with RBF kernels, see [3], [25], and the references therein.

III. ADJUSTMENT OF USING MAXIMUM

LIKELIHOODESTIMATION

Recall that the annealing parameter in (3) and (5) con-trols the degree of competition. If is large, for

and otherwise. This is the so-called hard competition (winner-takes-all). If one uses hard competi-tion right from the beginning, it is very likely to get stuck in an undesired local minimum. In [21], this problem was resolved by using deterministic annealing.

In this section, we describe an adaptive method for adjusting . Let , defined in (4), denote the difference between the observed value and the estimated value. From (3), can be viewed as a parameter that characterizes certain statistical prop-erty about . Therefore, at each iteration we suggest to update by maximizing the conditional likelihood function given current

. To do so, we assume that, given current follows a mix-ture of Gaussian distributions with mean , a common unknown variance , and mixing coefficients . Then, the den-sity of given is

(10) Let be an estimate of . Then using (10) we can estimate by Bayes rules

(11)

By comparing (5) and (11), we suggest to choose as our next . Since is the measure of the variation of , it is intuitively clear that the next will decrease if in the next iteration can better fit the data. So the new is likely to increase, which corresponds to the annealing property used in [24] and [21].

We propose to estimate in each iteration by the maximum likelihood principle. This can be obtained by the EM method. To see how, let be as in Section II, which is a latent variable (unobservable). We augment with and call

the complete data. Then the complete-data log-likeli-hood function of is

(12)

Let and be current estimates.

Let and . The E-step

calcu-lates the expectation of the conditional log-likelihood function

of given and

(13)

(14)

where the second equality follows from the independence of each observation and the third equality follows from (12) and the definition of .

The M-step finds the maximizer of

(15) By (14) and Gaussian assumption, we find that the maximum of (14) occurs at

(16)

where is obtained by replacing in (11) by .

To apply this method, the Gaussian assumption in (10) is not necessary. For example, if we replace the mixture of Gaus-sians by a mixture of Laplace (double exponential) distribu-tions with a common scale parameter , i.e.,

, the expectation and maximization steps can be carried out similarly. The resulting estimate of is sim-ilar to (16), but with replaced by . Because here a linear loss function is used for SVR, we suggest to use instead of when updating . Therefore, at the st iteration of our implementation, we update by (5) with replaced by and replaced by . Note that we use (5) be-cause the parameter incorporates the slowly switching prop-erty [21] while are less important in the implementation.

IV. SOLVING THEMODIFIEDSUPPORTVECTORREGRESSION

In this section, we discuss how to solve the new SVR formu-lation (8). As SVR’s map data into higher dimensional spaces, is a long vector variable. Thus, usually a dual problem is easier to be solved. The Lagrangian dual of (8) can be derived by a

(4)

Fig. 1. First four iterations (data without noise).

similar way as that of standard SVR so here we give only the resulting formulation

(17) where is a square matrix with

. Here we mainly consider the RBF kernel . The main difference is that each dual variable has its own upper bound . At the optimal

solution . So

If , then is a support vector of the modified SVR (17). The main difficulty on solving (17) is that is a large dense matrix. This issue has occurred for the case of classifica-tion and some methods such as the decomposiclassifica-tion method (e.g., [19]) have been proposed. The decomposition method avoids the memory problem by iteratively working on few variables. The extreme is the sequential minimal optimization (SMO) [22], where in each iteration only two variables are updated. Here we consider the SMO-type implementation in LIBSVM [1]. Orig-inally, in LIBSVM, the two variables are selected by the

fol-lowing rule: The first element is the (or ) which has the smallest value of the following numbers:

if and

if

The second element is the (or ) which has the largest value of

if and

if

They are derived from the optimality condition of the dual of (7). We change in the above formula to while the convergence of the algorithm still holds. The modified code of LIBSVM is available at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools.

V. EXPERIMENTS

A. Four Chaotic Time Series

We test the case of completely overlapping input manifold

used in [21]. For all . They consider

and four different functions:

if and if

, and . It is easily seen that all the

map from to . They set

and activate the four functions consecutively, each for 100 time steps, giving an overall 400 time steps. They then repeat the procedure three times and obtain a time series of 1200 points. An illustration of these functions is in Fig. 1. As the number of functions is considered unknown, we start from six competing SVMs. In the end some SVMs represent the same function and the six SVMs represent four different functions. If we use only

(5)

Fig. 2. First six iterations (data with noise).

four competing SVMs in the beginning, we find it is easier to fall into local minima than using six SVRs.

For SVR parameters, we simply set and the width of the insensitive tube to be 0.03 (actually, we may set to be any nonnegative small number, say 0 to 0.05). We set the parameter of the RBF kernel to be 50. Since a smaller means greater smoothness of predictors, in order to avoid getting trapped in a local minimum, we may start with a smaller and gradually increase it during training. Therefore, we also try to adaptively

adjust during training by setting , where

is as in (16). It works well too. The choice of is the same as that in [8]: as the small switching rate 1/100 guarantees a large probability for short sequences of length 7 to contain no switching event, a choice of would work. We also find that this parameter is not crucial during training. For this case, the algorithm stops in about six iterations and the data points are well separated; see Fig. 1 for the first four iterations. Here,

the initial are chosen to be around 1/6. So in the beginning, the six predictors are about the same. As the iteration goes, the underlying structure resolves in a hierarchical manner. We then add noise to these four functions. The algorithm stops in seven iterations. In Fig. 2, we present the first six itera-tions. For this case we randomly assign the initial to be 0 or

1 under the condition , . By

com-paring the first graphs in Figs. 1 and 2, we find that if the initial are chosen to be around 1/6, the six predictors obtained at the first iteration are closer to each other.

B. Mackey–Glass Chaotic System

Next, we consider a high-dimensional chaotic time series ob-tained from the Mackey–Glass delay-differential equation [12]:

(6)

Fig. 3. p ; i = 1; . . . ; 3 at each time point t (x-axis: time; y-axis: p ).

Following earlier experiments, points are selected every six time steps. We generate 300 points with delay parameter . Then we switch be to 17, 23, and 30, and generate also 300 points for each of the . In total, there are 1 200 points from three underlying dynamics. The embedding dimension is .

That is, is the one-step ahead value of .

For this problem we set the and leave the other parameters the same as in the previous example. As can be seen in Fig. 3, the underlying dynamics are well segmented.

C. Santa Fe Time Series: Data Set D

We also consider the benchmark Data Set D from the Santa Fe Time Series Prediction Competition [27]. This is an artifi-cial data generated from a nine-dimensional periodically driven system with an asymmetrical four-well potential and a drift on the parameters. The system operates in one well for some time and then switch to another well with a different dynamical be-havior. Each participant is asked to predict the first 25 steps and the evaluation is based on the root mean squared error (RMSE) of the prediction. The RMSE obtained by the winner of the Santa Fe Competition [27, pp. 219–241] is 0.0665.

One may train a model on the whole dataset and then conduct predictions. However, if data are actually from different sources, it would be better to do segmentation before prediction. [21] proposes to use the ACE method to segment the time series into regimes of approximately stationary dynamics by using six RBF networks. Then the prediction is simply done by iterating the particular predictor that is responsible for the generation of the latest training data. Similar to [21], in [17] they use ACE with six RBF networks to segment the time series. Instead of RBF networks, they then use SVR to train the particular class of data that includes the data points at the end of the full training set. They also use SVR to train the full dataset and compare its pre-dicting performance with that obtained by using segmentation.

Here, we start with the standard SVR (6) and train a model on the whole dataset. We set the embedding dimension as 20 and the kernel parameter as in [17]. The SVR param-eter is set to be 0 and is determined by five fold cross

vali-dation on . By five fold cross validation we

mean that after data are transformed into , they are ran-domly separated to five groups. Sequentially one group is vali-dated by training the rest. We then use this model to predict the

Fig. 4. Distribution of the selected C values.

first 25 steps. Unfortunately, the value with the smallest val-idation error does not always produce the best prediction. For example, a typical run shows that: the smallest validation error occurs at and the prediction error is 0.0748; the smallest prediction error 0.0521 is attained at , but it has a larger validation error 0.0357. As the result of a single run is not always representative, we decide to investigate the long-run performance of this method by repeating the same procedure 30 times with different cross validations chosen randomly. The dis-tribution of the selected values is shown in Fig. 4 (the light bars). In average, the RMSE of the 25-step prediction is 0.0871. Next, we use Algorithm II.1 with six competing SVRs to segment the time series. The parameters in the modified SVR

(8) are set as (as in [17]), , and .

Then, we evaluate the values for each predictor at the last 25 points of the training data. The predictor with the largest for the majority of the 25 points is chosen. After doing so, we use the standard SVR (6) to train this particular class of data. Here, we set and select by five fold cross validation

on . The resulting model is used to predict

the first 25 steps of the testing data. To study the long-run per-formance of the method we repeat this procedure 30 times. The average of the RMSE’s of the prediction errors is 0.0656. The distribution of the best selected is given in Fig. 4 (the dark bars). Of the 30 experiments, 15 of them outperform the result obtained by the winner. Due to different initial , the number of data points chosen for after the segmentation varies. For our 30 runs, this number ranges from 400 to 1100. The best

predic-tion performances ( to ) occur at numbers

around 800 to 1 000. Our finding is that the prediction result using segmentation is better than that using the whole dataset directly.

As mentioned in [17], determining the SVR parameters is computationally intensive. They suggest to determine them at the minimum of the one step prediction error measured on a randomly chosen validation set. To have a more stable param-eter selection, here we use five fold cross validation to select . Moreover, instead of setting , we also tried positive , but the results are not better. It is desirable to compare our param-eter settings with those in [17]. However, they did not indicate their final choice of the parameters so we cannot make the com-parison.

(7)

VI. DISCUSSION

In this paper, we present a framework for the unsupervised seg-mentation of nonstationary time series using SVMs. The prob-lems we consider here are the same as those in [14] and [21]. The method used in this paper is novel in two aspects. First, a new formulation of SVM and its implementation are proposed for the switching problems. Second, we use statistical reasoning to adjust the annealing parameter. The annealing property of the proposed method is given intuitively in Section II. From the fig-ures we obtained in Section V, our results are comparable to those in [21]. Moreover, as we adjust adaptively, our algorithm re-quires much fewer iterations than theirs.

Several parameters in our algorithm must be chosen in the beginning. For the mapping function , we use the RBF kernel , which is the most popular choice. The selection of , the width of the insensitive tube in (8), and the kernel parameter is usually by try and error. For the two chaotic time series examples in Section V, we can easily select suitable , and , see discussions in that section. However, for the prediction of Data Set D from the Santa Fe Time Series Competition, it takes quite a long time to tune the parameter.

For a slowly changing dynamics, [21] is the first paper that proposes to use the parameter for updating . Intuitively, this method would work better for smaller switching rates. The switching rates of the first two examples in Section V are set to be 1/100 and 1/300, respectively, the same as those in [21]. For the chaotic time series example in Section V.A, an unreported experiment shows that our algorithm also works very well for a switching rate as high as 1/5.

As to the computational time, the major cost in each itera-tion of the proposed approach is on solving different SVM problems. It is known that for most SVM implementations fewer support vectors (i.e., , where is optimal for (8)) leads to less training time. We also consider free support

vectors (i.e., or )

because for some SVM implementations they are more related to the training time than the number of support vectors. The reason is that these implementations are iterative procedures and in final iterations mainly those free support vectors are still con-sidered. More discussions can be found in [1]. Fig. 5 presents the number of support vectors and the number of free support vec-tors at each iteration for the chaotic time series example in Sec-tion V-A. Here the data is generated without adding the noise. To interpret, in the first iteration the average number of support vectors of the six SVMs is 133, the average number of free sup-port vectors is 11; in the second iteration the average number of support vectors is 701, that of free support vectors is 26, and so on.

Remember that there are 1200 training data points and six competing SVRs. In Section IV-A, we randomly assign the initial to be 0 or 1 under the condition

. Therefore, at the first iteration, approximately 200 data points are assigned to each predictor and the average number of support vectors of the six SVRs are less than 200. After updating , for all with the corresponding data points are assigned to the -th SVR. So the number of support vectors substantially increased at the second iteration. Then as the underlying structure gradually resolved, the number of

Fig. 5. Number of support vectors (i.e., 0 6= 0) and number of free support vectors (i.e.,0Cp < 0 < 0 or 0 < 0 < Cp ).

support vectors reduces again. An important trick we used here is to set all to be zero. This effectively decreases the number of support vectors.

As for the numbers of free support vectors, they are all small compared to support vectors. At an optimal solution of an SVM, some bounded variables have small values (i.e., small corre-sponding ) but some are large (i.e., corresponding close to 1). For variables with small bounds, our current implementa-tion, LIBSVM, can quickly identify them, so the training time of the first iteration is similar to that of the second iteration (less than two seconds on a Pentium III-500 machine). In other words, though the number of support vectors is dramatically increased, at the second iteration, many are small so LIBSVM quickly identifies bounded variables. Then, the decomposition method mainly works on free variables, so the difference between the two iterations is not that much.

ACKNOWLEDGMENT

The authors would like to thank J. Kohlmorgen for some helpful comments. The second author would like to thank D. Prokhorov and other organizers of the IJCNN Challenge 2001 for bringing him to this subject.

REFERENCES

[1] C.-C. Chang and C.-J. Lin. (2001) LIBSVM: A library for support vector machines [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/libsvm [2] , “Training-support vector classifiers: Theory and algorithms,”

Neural Comput., vol. 13, no. 9, pp. 2119–2147, 2001.

[3] T. Evgeniou, M. Pontil, and T. Poggio, “Regularization networks and support vector machines,” Adv. Comput. Math., vol. 13, pp. 1–50, 2000. [4] L. A. Feldkamp, T. M. Feldkamp, and D. V. Prokhorov, “An approach to adaptive classification,” in Intelligent Signal Processing, S. Haykin and B. Kosko, Eds. Piscataway, NJ: IEEE Press, 2001.

[5] R. J. Hathaway and J. C. Bezdek, “Switching regression models and fuzzy clustering,” IEEE Trans. Fuzzy Syst., vol. 1, pp. 195–204, June 1993.

[6] S. Haykin and D. Cong, “Classification of radar clutter using neural net-works,” IEEE Trans. Neural Networks, vol. 2, pp. 589–600, Nov. 1991. [7] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” Neural Comput., vol. 3, no. 1, pp. 79–87, 1991.

[8] A. Kehagias and V. Petridis, “Time-series segmentation using predic-tive modular neural networks,” Neural Comput., vol. 9, pp. 1691–1709, 1997.

(8)

[9] J. Kohlmorgen, K.-R. Müller, J. Rittweger, and K. Pawelzik, “Identi-fication of nonstationary dynamics in physiological recordings,” Biol. Cybern., vol. 83, pp. 73–84, 2000.

[10] W. Liebert, K. Pawelzik, and H. G. Schuster, “Optimal embeddings of chaotic attractors from topological considerations,” Europhys. Lett., vol. 14, p. 521, 1991.

[11] S. Liehr, K. Pawelzik, J. Kohlmorgen, and K. R. Müller, “Hidden Markov mixtures of experts with an application to EEG recordings from sleep,” Theory Biosci., vol. 118, no. 3–4, pp. 246–260, 1999. [12] M. C. Mackey and L. Glass, “Oscillation and chaos in physiological

control systems,” Science, vol. 197, pp. 287–289, 1977.

[13] J. Moody and C. J. Darken, “Fast learning in networks of locally tuned processing units,” Neural Comput., vol. 1, pp. 281–293, 1989. [14] K.-R. Müller, J. Kohlmorgen, and K. Pawelzik, “Analysis of switching

dynamics with competing neural networks,” IEICE Trans. Funda-mentals Electron., Commun., Comput. Sci., vol. E78–A, no. 10, pp. 1306–1315, 1995.

[15] K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf, “An introduction to kernel-based learning algorithms,” IEEE Trans. Neural Networks, vol. 12, pp. 181–201, Mar. 2001.

[16] K.-R. Müller, A. Smola, G. Rätsch, B. Schölkopf, J. Kohlmorgen, and V. Vapnik, “Predicting time series with support vector machines,” in Proc. Int. Conf. Artificial Neural Networks, W. Gerstner, A. Germond, M. Hasler, and J.-D. Nicoud, Eds., 1997, pp. 999–1004.

[17] , “Predicting time series with support vector machines,” in Ad-vances in Kernel Methods—Support Vector Learning, B. Schölkopf, C. J. C. Burges, and A. J. Smola, Eds. Cambridge, MA: MIT Press, 1999, pp. 243–254.

[18] R. Murray-Smith and T. A. Johansen, Eds., Multiple Model Approaches to Modeling and Control. London, U.K.: Taylor and Francis, 1997. [19] E. Osuna, R. Freund, and F. Girosi, “Training support vector machines:

An application to face detection,” in Proc. CVPR’97, New York, 1997, pp. 130–136.

[20] K. Pawelzik, “Detecting coherence in neuronal data,” in Physics of Neural Networks, L. Van Hemmen and K. Schulten, Eds. New York: Springer-Verlag, 1994.

[21] K. Pawelzik, J. Kohlmorgen, and K.-R. Müller, “Annealed competition of experts for a segmentation and classification of switching dynamics,” Neural Comput., vol. 8, no. 2, pp. 340–356, 1996.

[22] J. C. Platt, “Fast training of support vector machines using sequential minimal optimization,” Advances in Kernel Methods—Support Vector Learning, 1998.

[23] L. Rabiner, “A tutorial on hidden Markov models and selected appli-cations in speech recognition,” Proc. IEEE, vol. 77, pp. 257–285, Feb. 1989.

[24] K. Rose, F. Gurewitz, and G. Fox, “Statistical mechanics and phase tran-sitions in clustering,” Phys. Rev. Lett., vol. 65, no. 8, pp. 945–948, 1990. [25] B. Schölkopf, K.-K. Sung, C. J. C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik, “Comparing support vector machines with Gaussian kernels to radial basis function classiers,” IEEE Trans. Signal Processing, vol. 45, pp. 2758–2765, Nov. 1997.

[26] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998. [27] A. S. Weigend and N. A. Gershenfeld, Eds., Time Series Prediction:

Forcasting the Future and Understanding the Past. Reading, MA: Ad-dison-Wesley, 1994.

Ming-Wei Chang received the B.S. degrees in

computer science and information engineering and the M.S. degree in computer science and information engineering, both from National Taiwan University, Taiwan, in 2001 and 2003, respectively.

His research interests include machine learning, numerical optimization, and algorithm analysis.

Chih-Jen Lin (S’91–M’98) received the B.S. degree

in mathematics from National Taiwan University, Taiwan, in 1993 and the M.S. and Ph.D. degrees from the Department of Industrial and Operations Engineering, University of Michigan, Ann Arbor, in 1998.

He is currently an Associate Professor in the Department of Computer Science and Information Engineering at National Taiwan University, Taiwan. His research interests include machine learning, numerical optimization, and applications of opera-tions research.

Ruby Chiu-Hsing Weng received the B.S. and M.S.

degrees in mathematics from National Taiwan Uni-versity, Taiwan, in 1993 and 1995, respectively, and the Ph.D. degree in statistics from the University of Michigan, Ann Arbor, in 1999.

She is an Associate Professor with Department of Statistics, National Chengchi University, Taipei, Taiwan. Her research interests include sequential analysis and time series analysis.