• 沒有找到結果。

HSIANG-CHUAN LIU1, SHIN-WU LIU2, PEI-CHUN CHANG1, WEN-CHUN HUANG3, CHIEN-HSIUNG LIAO1

1Department of Bioinformatics, Asia University, Taiwan

2National Institute of Allergy and Infectious Diseases, National Institutes of Health, USA

3Graduate Institute of Educational Measurement and Statistics, Taichung University, Taiwan

E-MAIL: [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract:

In search of good classifier of hosts of influenza A viruses is an important issue to prevent pandemic flu. The hemagglutinin protein in the virus genome is the major molecule that determining the range of hosts. In this paper, a novel classification algorithm of hemagglutinin proteins integrating SVM and logistic regression based on 4 kinds of Hurst exponents for each protein sequence is proposed. This method not used before is the first one integrating the physicochemical properties, fractal property, SVM and logistic regression classifier. For evaluating the performance of this new algorithm, a real data experiment by using 5-fold Cross-Validation accuracy is conducted. Experimental result shows that this new classification algorithm is useful and batter than SVM and logistic regression, respectively.

Keywords:

Influenza A viruses; Hurst exponent; SVM; Logistic regression; SVM-Logistic regression

1. Introduction

Influenza A viruses are negative-strand RNA viruses that infect a wide variety of animals in the nature. The infection of human may cause significant mortality and morbidity worldwide [1]. The hemagglutinin (HA) protein in the virus genome is the major molecule that determining the range of hosts. The natural reservoir of influenza virus such as avian flu may emerges in strains infectious to human by mutation of HA protein and brings pandemic flu, therefore, in search of good classification algorithm of HA proteins is an important issue to prevent pandemic flu. In this paper, a novel classification algorithm of HA proteins combining Hurst exponents, SVM and logistic regression is proposed [2], [3], [4], [5]. This method not used before is the first one integrating the physicochemical properties, fractal property, support vector machine (SVM) and logistic regression classifier.

The protein residues were coded according to its

physicochemical quantities of acidity, Van der waal volume, surface area and hydrophobicity in the situation of single amino acid [6], [7]

First step, the HA sequence data of serotype H5 of influenza A viruses with two classes used in this research were downloaded from public databases: Influenza Sequence Database (http://www.flu.lanl.gov). The sample included 90 HA protein sequences of human infections and 90 HA protein sequences of bird infections.

Second step, to replace each residue of amino acid in the sequences of the HA proteins with 4 physicochemical quantities.

Third step, computing the Hurst exponents of each non-symbolic sequences of the HA proteins, we can obtained four features of Hurst exponents in each sequences of the HA protein [2], [6], [7].

Last step, two well known and appealing classifiers, Support Vector Machine (SVM) and Logistic regression (LR), and our new hybrid classifier combining SVM and LR were used to discriminate the correct class of the 180 HA proteins with four features of Hurst exponents.

For evaluating the performance of above three classifiers, the above HA proteins data experiment by using 5-fold Cross-Validation accuracy is conducted.

This paper is organized as followings: four physicochemical quantities of 20 amino acids are introduced in section 2, Hurst exponent is introduced in section 3, support vector machine classifier is introduced in section 4, logistic regression is introduced in section 5, the new hybrid classifier combining SVM and logistic regression is introduced in section 6, experiment and result are described in section 7 and final section is for conclusions and future works.

2. Four physicochemical properties of amino acids There are four physicochemical quantities of acidity,

[3]

Table 1. 20 amino acids and its 4 physicochemical quantities

The Hurst exponent occurs in several areas of applied mathematics, including fractals and chaos theory, long term memory processes and spectral analysis [8]. Hurst exponent estimation has been applied in areas ranging from biophysics to computer networking. Estimation of the Hurst exponent was originally developed in hydrology. However, the modern techniques for estimating the Hurst exponent comes from fractal mathematics.

Estimating the Hurst exponent for a data set provides a measure of whether the data is a pure random walk or has underlying trends.

The Hurst exponent (H) is a statistical measure used to classify time series. H=0.5 indicates a random series while H>0.5 indicates a trend reinforcing series. The larger the H value is, the stronger the trend. Experiments with backpropagation Neural Networks show that series with large Hurst exponent can be predicted more accurately than those with H value close to 0.50. Thus the Hurst exponent

of the Hurst exponent: the R/S method, the roughness–length (R–L) method and a variogram. The R/S method (Hurst et al., 1965) [9] is commonly perceived as the most suitable for the time series analysis, because it presents the relationship between irregular (singular) rescaled ranges, signal value and their local statistical properties relative to the scale factor.

In this study R/S method is used. R/S method [10]

is based on empirical observations by Hurst and estimates H are based on the R/S statistic. It indicates (asymptotically) second-order self-similarity. H is roughly estimated through the slope of the linear line in a log-log plot, depicting the R/S statistics over the number of points of the aggregated series. That is, given a time sequence of observations,

w

t

define the Series a line whose slope determines the Hurst exponent.

4. Support vector machine (SVM) [11~14]

Given the training set of instance-labeled pairs

(

x yi, i

)

,i=1, 2,...,N, where The support vector machine (SVM) algorithm (Boser, Guyon, and Vapnik 1992 [11], Cortes and Vapnik 1995 [12]) requires

(

( )

)

an assignment according to the following formula.

( ) ( )

5. Multiple Logistic regression classifier 5.1. Multiple logistic regression model [4], [5]

Let

(

x xi1, i2....,x yin i

)

,i=1, 2,...,N be a sample data, satisfying xi =

(

x xi1, i2,...,xin

)

R yn, i

{ }

0,1 ,

Yi⊥⊥ ~B

(

1,pi

)

,i=1, 2,...,N (12) The multiple logistic regression model is denoted as follows

5.2. Multiple logistic regression classifier [5]

We can obtain the likelihood function and log likelihood function as following equations (14) and (15)

Using Newton-Raphson’s iterative algorithm, we can get the estimated regression coefficients of the multiple logistic regression model and the estimated multiple logistic regression equation as follows:

( )

Increment k; until 1 1

6. SVM-Logistic regression classifier

In this paper, an improved hybrid classifier combining SVM and logistic regression is proposed here.

First step, using the SVM classifier, we can find the signed distance, d x( )i , between the point

(

1, 2,...,

)

i i i in

x = x x x and the hyperplane in SUM.

Second step, to consider the sample data

(

d x

( )

i ,yi

)

,i=1, 2,...,N , using the simple logistic regression to classifyyi.

6.1. Mathematical formulas

Let

(

x xi1, i2....,x yin i

)

,i=1, 2,...,N be a sample data, satisfying

(

1, 2,...,

)

n,

{ }

0,1

i i i in i

x = x x xR y ∈ (26) Using the above support vector machine (SVM) algorithm, from equation (11), for any pointxiRn, we can obtain the signed distance as below

( ) ( )

( )i i 1 i

d x =⎡⎣w′ϕ x + − −b ξ ⎤⎦ (27) 6.2. Simple logistic regression classifier of the working sample data

Let the working sample data

(

d x

( )

i ,yi

)

,i=1, 2,...,N

satisfying d x

( )

i R y, i

{ }

1, 0

Yi⊥⊥ ~B

(

1,pi

)

,i=1, 2,...,N (28) The simple logistic regression model is denoted as follows

(

1|

( ) )

1 exp

(

1

( ) )

, 1,2,..., Similarly as multiple logistic regression classifier, we can get log likelihood function, the estimated regression coefficients of the simple logistic regression model and the estimated simple logistic regression equation as follows:

( ( ) )

The sequence data of serotype H5 of Influenza A viruses with two classes used in this research were obtained from public databases: Influenza Sequence Database (http://www.flu.lanl.gov). The sample included 90 HA protein sequences of human infections and 90 HA protein sequences of bird infections.

The protein residues were coded according to its physicochemical quantities of acidity, Van der waal volume, surface area and hydrophobicity in the situation of single

sequences of the HA proteins, we can obtain four features represented as Hurst exponents respectively in each sequences of the HA protein.

The above real data with four features in terms of Hurst exponents is applied to evaluate the performances of the Support Vector Machine (SVM) algorithm, logistic regression and the proposed classifier combining SVM and logistic regression classifier by using 5-fold Cross-Validation method to compute the accuracies of the response category variable.

The experimental results for Accuracies of above three classifiers are listed in Table 2. We can find that our new classification algorithm is useful and batter than SVM and logistic regression, respectively.

Table 2 Accuracies of three classifiers Classifier 5-fold CV accuracy

SVM 0.8056 LR 0.8833 SVM-LR 0.9000 8. Conclusions and future works

In search of good classifier of influenza viruses is an important issue to prevent pandemic flu. In this paper, a novel classification algorithm of HA proteins integrating SVM and logistic regression based on 4 kinds of Hurst exponents for each protein sequence is proposed. This method not used before is the first one integrating the physicochemical properties, fractal property, SVM and logistic regression classifier. For evaluating the performance of this new algorithm, a real data experiment by using 5-fold Cross-Validation accuracy is conducted.

Experimental result shows that this new classification algorithm is useful and batter than SVM and logistic regression, respectively.

Our proposed new classifier can be used to classify not only the data of Influenza A viruses but also the data of other biological sequences.

In future, we will consider look for some further improving classification algorithms by using Hurst exponent and other hybrid Classifiers.

Acknowledgements

This paper is partially supported by the National Science Council grant (NSC 96-2413--H-468-001).

[1] P. Pale, “ Influenza: old and new threats”, Nat. Med , Vol.10, pp. 82–87, 2004.

[2] H. E. Hurst, “Long term storage capacity of reservoirs”, Transactions of the American Society of Civil Engineers 116, pp. 770-799, 1951.

[3] C. Cortes, and V., Vapnik, “Support-vector network”, Machine Learning, Vol. 20, pp. 273-297, 1995.

[4] D. R. Cox, and E. J. Snell, The analysis of binary data (2nd ed.) London, Chapman & Hall, 1989.

[5] Hsiang-Chuan Liu, Yu-Du Jheng, Guey-Shya Chen, Bai-Cheng Jeng, “A new classification algorithm combining Choquet integral and logistic regression”, 2008 International Conference on Machine Learning and Cybernetics, 12-15 July 2008 Kunming, China (accepted).

[6] R. G. Webster, W. J. Bean, O. T. Gorman, T. M.

Chambers., and Y. Kawaoka, “Evolution and ecology of influenza A viruses”, Microbiol. Rev., Vol. 56, pp.

152-179, 1992.

[7] N. J. Cox and K. Subbarao, “Global epidemiology of influenza: Past and present”, Annu. Rev. Med., Vol.

51, pp. 407-421, 2000.

[8] T. Di Matteo, T. Aste and M. M. Dacorogna,

"Longterm memories of developed and emerging markets: using the scaling analysis to characterize their stage of development", Journal of Banking &

Finance 29/4, pp. 827-851, 2005.

[9] H. E. Hurst, R. Black, Y. M. Sinaika, “Long term storage capacity of reservoirs” , An experimental study Constable, London, 1965.

[10] Roger Kalden, Sami Ibrahim, “Searching for Self-Similarity in GPRS”, PAM, pp. 83-92, 2004.

[11] B.E. Boser, I.M. Guyon, and V. Vapnik, “A training algorithm for optimal margin classifiers”, In Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, 1992. ACM.

[12] C. Cortes, and V. Vapnik, “Support-vector network”, Machine Learning, Vol. 20, pp. 273-297, 1995.

[13] V. Vapnik, The Nature of Statistical Learning Theory.

New York, NY. Springer-Verlay, 1995.

[14] C.-C. Chang, and C.-C. Lin, LIBSVM; a library for support vector machine Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm, 2004

Hsiang­Chuan Liu , Yu­Chieh Tu , Wen­Chun Huang , Chin­Chun Chen 2,3 

When  the  multicollinearity  within  independent  variables  occurs  in  the  multiple  regression  models,  its  performance  will  always  be  poor.  Replacing  the  above  models with the ridge regression model is the traditional  improved  method.  In  our  previous  work,  we  found  that,  the  Choquet  integral  regression  model  with  λ­measure  based on the new support, γ­support, proposed by us has  the  best  performance  than  before.  In  this  study,  for  finding  the  further  improved  model,  we    replaced  two  well  known  fuzzy  measures,  P­measure  and  λ­measure  with  our  new  fuzzy  measure,  R­measure  in  Choquet  integral regression model with the new support, γ­support. 

For  comparing  the  Choquet  integral  regression  model  with P­measure, λ­measure and R­measure based on two  different  fuzzy  supports,  V­support  and  γ­support,  respectively,  the  traditional  multiple  regression  model  and  the  ridge  regression  model,  a  real  data  experiment  by  using  a  5­fold  cross­validation  mean  square  error  (MSE)  is  conducted.  Experimental  result  shows  that  the  Choquet integral regression model with R­measure based  on γ­support has the best performance. 

1. Introduction 

When  interactions  among  independent  variables  exist  in forecasting problems,  the  performance  of  the  multiple  linear  regression  models  is  poor.  The  traditional  improved  methods  exploited  the  ridge  regression  models  [1].  Recently,  some  Choquet  integral  regression  models  based  on  different  fuzzy  measures  were  used  by  our  previous  works  to  further  improve  this  situation  [2],  [3],  [4], [5]. 

In  our  previous  works  [6],  we  found  that  if  the  Choquet  integral  regression  model  based  on  the  same  fuzzy  measure  is  derived  from  different  fuzzy  support,  then it may have different performances,  in  other  words,  the  better  performance  of  a  Choquet  integral  regression  model is not only derived from a better fuzzy measure but  also  first  derived  from  a  better  fuzzy  support.  Hence,  before  we  find  the  better  fuzzy  measure  of  a  Choquet  integral  regression  model,  we  need  first  to  find  a  better 

fuzzy support of the same fuzzy measure of that Choquet  integral  regression  model.  And  we  found  that  the  Choquet integral regression model with  λ­measure  based  on  the  new  support,  γ­support,  proposed  by  us  has  the  best performance than before. 

In  this  study,  the  Choquet  integral  regression  model  with  two  well  known  fuzzy  measures,  P­measure  and  λ­ 

measure and our new fuzzy measure, R­measure based on  the  V­support  and  γ­support,  respectively,  were  considered. For comparing the performances of the above  different  Choquet  integral  regression  models  with  the  multiple regression model and the ridge regression model,  a real data experiment by  using  a  5­fold  cross­validation  mean square error (MSE) is conducted. 

This  paper  is  organized  as  followings:  The  multiple  linear  regression  and  ridge  regression  are  introduced  in  section 2, two well known fuzzy measure, P­measure and  λ­measure  are  introduced  in  section  3,  R­measures  are  introduced  in  section  4,  two  kind  fuzzy  supports:  V­ 

support  and  γ­support  are  described  in  section  5.  The  Choquet  integral  regression  model  based  on  fuzzy  measures  are  described  in  section  6.  Experiment  and  result  are  described  in  section  7,  and  final  section  is  for  conclusions and future works. 

2.  The  multiple  linear  regression,  ridge  regression [1] 

A fuzzy measure m on a finite set X is a set function

A singleton measure of a fuzzy measure m on a finite  set X is a function s X ® :

[ ] 

0,1  satisfying:

( )

( )

{ }

s x =m x xΠX (4)

( ) 

s x  is called the density of singleton  x .  3.3. P­measure [10]  obtain  the  values  of  λ  uniquely  by  using  the  previous  polynomial  equation.  In  other  words,  λ­measure  has a unique solution without closed form. 

4. R­measure [4] 

For given singleton measure s, a R­measure, g  , is a  fuzzy measure on a finite set X,  Xn , satisfying: 

(i) R Î

[

0, ¥ (8) 

(i)  R­measure  has  infinitely  many  solutions  with  closed  form. 

(ii)  When  R=0,  the  R­measure  is  just  a  P­measure  with  closed form. 

(iii) g  is an increasing function of R. 

5. Fuzzy supports 

For  given  singleton  measures  s  of  a  fuzzy  measure  μ  on a finite  set  X,  if ( ) 

x X 

s x

Î

å 

= ,  then  s  is  called  a  fuzzy  support  measure  of  μ,  or  a  fuzzy  support  of  μ,  or  a  support of μ. Two kinds of fuzzy supports are introduced  scores of subject  i  for singleton x , satisfying:

( ) 

6.2.  Choquet  integral  regression  models  [2],  [3],  [4], [5], [6] 

A  real  data  set  with  59  samples  from  a  junior  high  school  in  Taiwan  including  the  independent  variables,  examination  scores  of  four  courses,  and  the  dependent  variable, the score of the Basic Competence Test of junior  high  school  listed  in  Table  2  is  applied  to  evaluate  the  performances of three Choquet integral regression models  with  P­measure,  λ­measure,  and  R­measure  based  on  V­ 

support,  and  γ­support  respectively,  a  ridge  regression  model, and a multiple linear regression model by using 5­ 

fold cross validation method to compute the mean square  error (MSE)  of the dependent  variable.  The  formulas  of  MSE is 

For  any  fuzzy  measure,  μ­measures,  once  the  fuzzy  support of the μ­measure is given, all the event measures  of  μ  can  be  found,  and  then,  the  Choquet  integral  based  on μ and the Choquet integral  regression  equation  based  on μ can also be found. 

The  singleton  measures,  V­support  and  γ­support  of  the  P­measure,  λ­measure,  and  R­measure  can  be  obtained by using the formulas (12) and (16), respectively.

8. Conclusions and future works 

When  the  sub­tests  of  a  composite  test  are  with  interaction,  the  performance  of  the  traditional  additive  scale  method  is  poor.  Non­additive  fuzzy  measures  and  fuzzy integral can be applied to improve this situation. In  this  study,  a  real  data  set  from  a  junior  high  school  including  the  independent  variables,  test  scores  of  four  courses  with  interaction,  and  the  dependent  variable,  junior  high  school  graduates’  scores  of  the  Basic  Competence  Test  (BCT)  are  applied  to  evaluate  the  performances  of  the  Choquet  integral  regression  model  with  three  well  known  fuzzy  measures,  P­measure,  λ­ 

measure, and R­measure based on two different supports,  V­support,  and  γ­support  respectively,  the  traditional  multiple linear regression model, and the ridge regression  model.  Experimental  result  shows  that  the  following  situations: 

Choquet  integral  regression  model  with  R­measure  based on γ­support has the best performance. 

(ii)  Based  on  the  same  fuzzy  support,  not  only  the  γ­ 

support  but  also  the  V­support,  the  Choquet  integral  regression  model  with  R­  measure  is  better  than  which  with fuzzy measure, λ­measure and P­measure. 

(iii)  The  Choquet  integral  regression  model  with  the  same  measure,  P­measure,  λ­measure,  and  R­measure,  respectively,  the  performance  of  which  is  derived  from  the γ­support is better than which from the V­support. 

(iv)  The  Choquet  integral  regression  model  with  λ­ 

measure,  and  R­measure  based  on  V­support  and  γ­ 

regression  model  with  the  better  measure  based  on  the  best  fuzzy  support,  γ­support,  to  develop  multiple  classifier system.  and  Yu­Du  Jheng,  “A  new  weighting  method  for  detecting  outliers  in  IPA  based  on  Choquet  integral”,  IEEE International conference on Industrial Engineering  and Engineering Management 2007, December 2­5, 2007,  Singapore. 

[3] Hsiang­Chuan Liu, “The Choquet  integral  regression  model  based  on  r­complete  measure”,  Journal  of  educational research and development, Vol. 2, No. 4, pp  87­107, 2006 (in Chinese). 

[4]  Hsiang­Chuan  Liu,  Wen­Chih  Lin,  and  Wei­Sheng  Weng, “A Choquet Integral Regression Model Based on a  New Fuzzy Measure”, The 12th International conference  on Fuzzy Theohery & Technology, July 19­24, 2007, Salt  Lake City, Utah, U. 

[5]  Hsiang­Chuan  Liu,  Wen­Chih  Lin,  Kei­Yi  Chang,  and  Wei­Sheng  Weng,  “A  Nonlinear  Regression  Model  Based  on  Choquet  Integral  with e  ­Measure”,  2007  WSEAS  International  Conferences,  Venice,  Italy,  November 21­24, 2007. 

[6]  Hsiang­Chuan  Liu,  Yu­Chieh  Tu,  Chin­Chun  Chen,  and  Wei­Sheng  Weng  (2008),  “The  Choquet  integral  with  respect  to  λ­measure  based  on  γ­support”,  2008  International  Conferences  on  Machine  Learning  and  Cybernetics,  Kuming,  China,  July  12­15,  2008  (Accepted). 

[7]  G.  Choquet,  “Theory  of  capacities”,  Annales  de  l’Institut Fourier, Vol. 5, pp. 131­295, 1953. 

[8]  M.  Sugeno,  “Theory  of  fuzzy  integrals  and  its  applications”,  unpublished  doctoral  dissertation,  Tokyo  Institute of Technology, Tokyo, Japan, 1974. 

No.  C1  C2  C3  C4  BCT  No.  C1  C2  C3  C4  BCT 

1  77  75  79  83  31  31  74  70  80  75  35 

2  71  72  78  75  26  32  56  61  75  68  22 

3  78  86  86  86  33  33  62  68  72  74  29 

4  58  64  68  66  32  34  86  80  82  81  35 

5  48  59  65  68  16  35  63  78  88  83  31 

6  68  74  77  80  28  36  56  66  76  71  21 

7  62  72  84  78  47  37  77  74  80  76  42 

8  51  53  65  59  9  38  73  78  84  81  24 

9  62  64  76  70  36  39  63  60  68  69  17 

10  63  70  81  75  41  40  53  68  80  74  31 

11  66  68  75  74  25  41  74  86  87  88  44 

12  66  72  80  76  23  42  78  83  81  85  50 

13  75  75  85  80  39  43  47  58  66  62  15 

14  74  63  69  75  12  44  51  60  63  64  18 

15  68  78  85  75  27  45  60  65  75  70  23 

16  71  74  80  77  26  46  68  68  80  74  26 

17  49  60  69  64  13  47  52  60  70  65  20 

18  73  78  84  81  39  48  57  65  75  70  24 

19  68  70  74  76  40  49  70  66  70  74  13 

20  54  56  62  68  7  50  53  68  74  80  30 

21  53  68  74  71  11  51  68  68  78  76  35 

22  56  63  69  75  21  52  57  60  68  64  23 

23  70  80  78  70  31  53  61  62  70  70  25 

24  51  74  82  75  49  54  59  70  80  76  37 

25  61  66  72  78  33  55  59  62  70  78  29 

26  67  70  80  75  35  56  52  64  76  70  27 

27  59  75  80  82  27  57  68  70  80  75  33 

28  53  56  70  63  22  58  71  76  74  78  38 

29  56  56  65  61  6  59  72  66  78  72  19 

30  52  57  67  62  15

Fuzzy c-Mean Algorithm Based on Complete Mahalanobis Distances and