LOGISTIC REGRESSION - 行政院國家科學委員會專題研究計畫成果報告

HSIANG-CHUAN LIU¹, SHIN-WU LIU², PEI-CHUN CHANG¹, WEN-CHUN HUANG³, CHIEN-HSIUNG LIAO¹

1Department of Bioinformatics, Asia University, Taiwan

2National Institute of Allergy and Infectious Diseases, National Institutes of Health, USA

3Graduate Institute of Educational Measurement and Statistics, Taichung University, Taiwan

E-MAIL: [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract:

In search of good classifier of hosts of influenza A viruses is an important issue to prevent pandemic flu. The hemagglutinin protein in the virus genome is the major molecule that determining the range of hosts. In this paper, a novel classification algorithm of hemagglutinin proteins integrating SVM and logistic regression based on 4 kinds of Hurst exponents for each protein sequence is proposed. This method not used before is the first one integrating the physicochemical properties, fractal property, SVM and logistic regression classifier. For evaluating the performance of this new algorithm, a real data experiment by using 5-fold Cross-Validation accuracy is conducted. Experimental result shows that this new classification algorithm is useful and batter than SVM and logistic regression, respectively.

Keywords:

Influenza A viruses; Hurst exponent; SVM; Logistic regression; SVM-Logistic regression

1. Introduction

Influenza A viruses are negative-strand RNA viruses that infect a wide variety of animals in the nature. The infection of human may cause significant mortality and morbidity worldwide [1]. The hemagglutinin (HA) protein in the virus genome is the major molecule that determining the range of hosts. The natural reservoir of influenza virus such as avian flu may emerges in strains infectious to human by mutation of HA protein and brings pandemic flu, therefore, in search of good classification algorithm of HA proteins is an important issue to prevent pandemic flu. In this paper, a novel classification algorithm of HA proteins combining Hurst exponents, SVM and logistic regression is proposed [2], [3], [4], [5]. This method not used before is the first one integrating the physicochemical properties, fractal property, support vector machine (SVM) and logistic regression classifier.

The protein residues were coded according to its

physicochemical quantities of acidity, Van der waal volume, surface area and hydrophobicity in the situation of single amino acid [6], [7]

First step, the HA sequence data of serotype H5 of influenza A viruses with two classes used in this research were downloaded from public databases: Influenza Sequence Database (http://www.flu.lanl.gov). The sample included 90 HA protein sequences of human infections and 90 HA protein sequences of bird infections.

Second step, to replace each residue of amino acid in the sequences of the HA proteins with 4 physicochemical quantities.

Third step, computing the Hurst exponents of each non-symbolic sequences of the HA proteins, we can obtained four features of Hurst exponents in each sequences of the HA protein [2], [6], [7].

Last step, two well known and appealing classifiers, Support Vector Machine (SVM) and Logistic regression (LR), and our new hybrid classifier combining SVM and LR were used to discriminate the correct class of the 180 HA proteins with four features of Hurst exponents.

For evaluating the performance of above three classifiers, the above HA proteins data experiment by using 5-fold Cross-Validation accuracy is conducted.

This paper is organized as followings: four physicochemical quantities of 20 amino acids are introduced in section 2, Hurst exponent is introduced in section 3, support vector machine classifier is introduced in section 4, logistic regression is introduced in section 5, the new hybrid classifier combining SVM and logistic regression is introduced in section 6, experiment and result are described in section 7 and final section is for conclusions and future works.

2. Four physicochemical properties of amino acids There are four physicochemical quantities of acidity,

[3]

Table 1. 20 amino acids and its 4 physicochemical quantities

The Hurst exponent occurs in several areas of applied mathematics, including fractals and chaos theory, long term memory processes and spectral analysis [8]. Hurst exponent estimation has been applied in areas ranging from biophysics to computer networking. Estimation of the Hurst exponent was originally developed in hydrology. However, the modern techniques for estimating the Hurst exponent comes from fractal mathematics.

Estimating the Hurst exponent for a data set provides a measure of whether the data is a pure random walk or has underlying trends.

The Hurst exponent (H) is a statistical measure used to classify time series. H=0.5 indicates a random series while H>0.5 indicates a trend reinforcing series. The larger the H value is, the stronger the trend. Experiments with backpropagation Neural Networks show that series with large Hurst exponent can be predicted more accurately than those with H value close to 0.50. Thus the Hurst exponent

of the Hurst exponent: the R/S method, the roughness–length (R–L) method and a variogram. The R/S method (Hurst et al., 1965) [9] is commonly perceived as the most suitable for the time series analysis, because it presents the relationship between irregular (singular) rescaled ranges, signal value and their local statistical properties relative to the scale factor.

In this study R/S method is used. R/S method [10]

is based on empirical observations by Hurst and estimates H are based on the R/S statistic. It indicates (asymptotically) second-order self-similarity. H is roughly estimated through the slope of the linear line in a log-log plot, depicting the R/S statistics over the number of points of the aggregated series. That is, given a time sequence of observations,

w

define the Series a line whose slope determines the Hurst exponent.

4. Support vector machine (SVM) [11~14]

Given the training set of instance-labeled pairs

(

^{x y}ⁱ^, ⁱ

)

^,ⁱ⁼^{1, 2,...,}^N^{, where} The support vector machine (SVM) algorithm (Boser, Guyon, and Vapnik 1992 [11], Cortes and Vapnik 1995 [12]) requires

(

( )

)

an assignment according to the following formula.

( ) ( )

5. Multiple Logistic regression classifier 5.1. Multiple logistic regression model [4], [5]

Let

(

^{x x}ⁱ¹^, ⁱ²^....,^{x y}^{in i}

)

^,ⁱ⁼^{1, 2,...,}^N be a sample data, satisfying ^xⁱ ⁼

(

^{x x}ⁱ¹^, ⁱ²^,...,^xⁱⁿ

)

^∈^{R y}ⁿ^, ⁱ^∈

{ }

^{0,1 ,}

^Yⁱ^⊥⊥ ^~^B

(

^1,^pⁱ

)

^,ⁱ⁼^{1, 2,...,}^N (12) The multiple logistic regression model is denoted as follows

5.2. Multiple logistic regression classifier [5]

We can obtain the likelihood function and log likelihood function as following equations (14) and (15)

Using Newton-Raphson’s iterative algorithm, we can get the estimated regression coefficients of the multiple logistic regression model and the estimated multiple logistic regression equation as follows:

( )

Increment k; until ¹ ¹

6. SVM-Logistic regression classifier

In this paper, an improved hybrid classifier combining SVM and logistic regression is proposed here.

First step, using the SVM classifier, we can find the signed distance, d x( )_i , between the point

(

¹^, ²^,...,

)

i i i in

x = x x x and the hyperplane in SUM.

Second step, to consider the sample data

(

^{d x}

( )

ⁱ ^,^yⁱ

)

^,ⁱ⁼^{1, 2,...,}^N , using the simple logistic regression to classifyy_i.

6.1. Mathematical formulas

Let

(

x x_i1, _i2....,x y_{in i}

)

,i=1, 2,...,N be a sample data, satisfying

(

¹^, ²^,...,

)

ⁿ^,

{ }

^0,1

i i i in i

x = x x x ∈R y ∈ (26) Using the above support vector machine (SVM) algorithm, from equation (11), for any pointx_i∈Rⁿ, we can obtain the signed distance as below

( ) ( )

( )_i _i 1 _i

d x =⎡⎣w′ϕ x + − −b ξ ⎤⎦ (27) 6.2. Simple logistic regression classifier of the working sample data

Let the working sample data

(

^{d x}

( )

ⁱ ^,^yⁱ

)

^,ⁱ⁼^{1, 2,...,}^N

satisfying ^{d x}

( )

ⁱ ^∈^{R y}^, ⁱ^∈

{ }

^{1, 0}

^Yⁱ^⊥⊥ ^~^B

(

^1,^pⁱ

)

^,ⁱ⁼^{1, 2,...,}^N (28) The simple logistic regression model is denoted as follows

(

^1|

( ) )

^{1 exp}

(

_{( )} )

^, ^1,2,..., Similarly as multiple logistic regression classifier, we can get log likelihood function, the estimated regression coefficients of the simple logistic regression model and the estimated simple logistic regression equation as follows:

( ( ) )

The sequence data of serotype H5 of Influenza A viruses with two classes used in this research were obtained from public databases: Influenza Sequence Database (http://www.flu.lanl.gov). The sample included 90 HA protein sequences of human infections and 90 HA protein sequences of bird infections.

The protein residues were coded according to its physicochemical quantities of acidity, Van der waal volume, surface area and hydrophobicity in the situation of single

sequences of the HA proteins, we can obtain four features represented as Hurst exponents respectively in each sequences of the HA protein.

The above real data with four features in terms of Hurst exponents is applied to evaluate the performances of the Support Vector Machine (SVM) algorithm, logistic regression and the proposed classifier combining SVM and logistic regression classifier by using 5-fold Cross-Validation method to compute the accuracies of the response category variable.

The experimental results for Accuracies of above three classifiers are listed in Table 2. We can find that our new classification algorithm is useful and batter than SVM and logistic regression, respectively.

Table 2 Accuracies of three classifiers Classifier 5-fold CV accuracy

SVM 0.8056 LR 0.8833 SVM-LR 0.9000 8. Conclusions and future works

In search of good classifier of influenza viruses is an important issue to prevent pandemic flu. In this paper, a novel classification algorithm of HA proteins integrating SVM and logistic regression based on 4 kinds of Hurst exponents for each protein sequence is proposed. This method not used before is the first one integrating the physicochemical properties, fractal property, SVM and logistic regression classifier. For evaluating the performance of this new algorithm, a real data experiment by using 5-fold Cross-Validation accuracy is conducted.

Experimental result shows that this new classification algorithm is useful and batter than SVM and logistic regression, respectively.

Our proposed new classifier can be used to classify not only the data of Influenza A viruses but also the data of other biological sequences.

In future, we will consider look for some further improving classification algorithms by using Hurst exponent and other hybrid Classifiers.

Acknowledgements

This paper is partially supported by the National Science Council grant (NSC 96-2413--H-468-001).

[1] P. Pale, “ Influenza: old and new threats”, Nat. Med , Vol.10, pp. 82–87, 2004.

[2] H. E. Hurst, “Long term storage capacity of reservoirs”, Transactions of the American Society of Civil Engineers 116, pp. 770-799, 1951.

[3] C. Cortes, and V., Vapnik, “Support-vector network”, Machine Learning, Vol. 20, pp. 273-297, 1995.

[4] D. R. Cox, and E. J. Snell, The analysis of binary data (2^nd ed.) London, Chapman & Hall, 1989.

[5] Hsiang-Chuan Liu, Yu-Du Jheng, Guey-Shya Chen, Bai-Cheng Jeng, “A new classification algorithm combining Choquet integral and logistic regression”, 2008 International Conference on Machine Learning and Cybernetics, 12-15 July 2008 Kunming, China (accepted).

[6] R. G. Webster, W. J. Bean, O. T. Gorman, T. M.

Chambers., and Y. Kawaoka, “Evolution and ecology of influenza A viruses”, Microbiol. Rev., Vol. 56, pp.

152-179, 1992.

[7] N. J. Cox and K. Subbarao, “Global epidemiology of influenza: Past and present”, Annu. Rev. Med., Vol.

51, pp. 407-421, 2000.

[8] T. Di Matteo, T. Aste and M. M. Dacorogna,

"Longterm memories of developed and emerging markets: using the scaling analysis to characterize their stage of development", Journal of Banking &

Finance 29/4, pp. 827-851, 2005.

[9] H. E. Hurst, R. Black, Y. M. Sinaika, “Long term storage capacity of reservoirs” , An experimental study Constable, London, 1965.

[10] Roger Kalden, Sami Ibrahim, “Searching for Self-Similarity in GPRS”, PAM, pp. 83-92, 2004.

[11] B.E. Boser, I.M. Guyon, and V. Vapnik, “A training algorithm for optimal margin classifiers”, In Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, 1992. ACM.

[12] C. Cortes, and V. Vapnik, “Support-vector network”, Machine Learning, Vol. 20, pp. 273-297, 1995.

[13] V. Vapnik, The Nature of Statistical Learning Theory.

New York, NY. Springer-Verlay, 1995.

[14] C.-C. Chang, and C.-C. Lin, LIBSVM; a library for support vector machine Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm, 2004

HsiangChuan Liu ¹, YuChieh Tu ², WenChun Huang ², ChinChun Chen ^2,3

When the multicollinearity within independent variables occurs in the multiple regression models, its performance will always be poor. Replacing the above models with the ridge regression model is the traditional improved method. In our previous work, we found that, the Choquet integral regression model with λmeasure based on the new support, γsupport, proposed by us has the best performance than before. In this study, for finding the further improved model, we replaced two well known fuzzy measures, Pmeasure and λmeasure with our new fuzzy measure, Rmeasure in Choquet integral regression model with the new support, γsupport.

For comparing the Choquet integral regression model with Pmeasure, λmeasure and Rmeasure based on two different fuzzy supports, Vsupport and γsupport, respectively, the traditional multiple regression model and the ridge regression model, a real data experiment by using a 5fold crossvalidation mean square error (MSE) is conducted. Experimental result shows that the Choquet integral regression model with Rmeasure based on γsupport has the best performance.

1. Introduction

When interactions among independent variables exist in forecasting problems, the performance of the multiple linear regression models is poor. The traditional improved methods exploited the ridge regression models [1]. Recently, some Choquet integral regression models based on different fuzzy measures were used by our previous works to further improve this situation [2], [3], [4], [5].

In our previous works [6], we found that if the Choquet integral regression model based on the same fuzzy measure is derived from different fuzzy support, then it may have different performances, in other words, the better performance of a Choquet integral regression model is not only derived from a better fuzzy measure but also first derived from a better fuzzy support. Hence, before we find the better fuzzy measure of a Choquet integral regression model, we need first to find a better

fuzzy support of the same fuzzy measure of that Choquet integral regression model. And we found that the Choquet integral regression model with λmeasure based on the new support, γsupport, proposed by us has the best performance than before.

In this study, the Choquet integral regression model with two well known fuzzy measures, Pmeasure and λ

measure and our new fuzzy measure, Rmeasure based on the Vsupport and γsupport, respectively, were considered. For comparing the performances of the above different Choquet integral regression models with the multiple regression model and the ridge regression model, a real data experiment by using a 5fold crossvalidation mean square error (MSE) is conducted.

This paper is organized as followings: The multiple linear regression and ridge regression are introduced in section 2, two well known fuzzy measure, Pmeasure and λmeasure are introduced in section 3, Rmeasures are introduced in section 4, two kind fuzzy supports: V

support and γsupport are described in section 5. The Choquet integral regression model based on fuzzy measures are described in section 6. Experiment and result are described in section 7, and final section is for conclusions and future works.

2. The multiple linear regression, ridge regression [1]

A fuzzy measure m on a finite set X is a set function

A singleton measure of a fuzzy measure m on a finite set X is a function ^{s X ®}^:

[ ]

^0,1 satisfying:

( )

{ } ^,

s x =m x xÎ X (4)

( )

s x is called the density of singleton x . 3.3. Pmeasure [10] obtain the values of λ uniquely by using the previous polynomial equation. In other words, λmeasure has a unique solution without closed form.

4. Rmeasure [4]

For given singleton measure s, a Rmeasure, g , is a _R fuzzy measure on a finite set X, X = n , satisfying:

(i) ^{R Î}

[

^0,^¥) ⁽⁸⁾

(i) Rmeasure has infinitely many solutions with closed form.

(ii) When R=0, the Rmeasure is just a Pmeasure with closed form.

(iii) g is an increasing function of R. _R

5. Fuzzy supports

For given singleton measures s of a fuzzy measure μ on a finite set X, if ( ) ¹

x X

s x

å

= , then s is called a fuzzy support measure of μ, or a fuzzy support of μ, or a support of μ. Two kinds of fuzzy supports are introduced scores of subject i for singleton x _j, satisfying:

( )

6.2. Choquet integral regression models [2], [3], [4], [5], [6]

A real data set with 59 samples from a junior high school in Taiwan including the independent variables, examination scores of four courses, and the dependent variable, the score of the Basic Competence Test of junior high school listed in Table 2 is applied to evaluate the performances of three Choquet integral regression models with Pmeasure, λmeasure, and Rmeasure based on V

support, and γsupport respectively, a ridge regression model, and a multiple linear regression model by using 5

fold cross validation method to compute the mean square error (MSE) of the dependent variable. The formulas of MSE is

For any fuzzy measure, μmeasures, once the fuzzy support of the μmeasure is given, all the event measures of μ can be found, and then, the Choquet integral based on μ and the Choquet integral regression equation based on μ can also be found.

The singleton measures, Vsupport and γsupport of the Pmeasure, λmeasure, and Rmeasure can be obtained by using the formulas (12) and (16), respectively.

8. Conclusions and future works

When the subtests of a composite test are with interaction, the performance of the traditional additive scale method is poor. Nonadditive fuzzy measures and fuzzy integral can be applied to improve this situation. In this study, a real data set from a junior high school including the independent variables, test scores of four courses with interaction, and the dependent variable, junior high school graduates’ scores of the Basic Competence Test (BCT) are applied to evaluate the performances of the Choquet integral regression model with three well known fuzzy measures, Pmeasure, λ

measure, and Rmeasure based on two different supports, Vsupport, and γsupport respectively, the traditional multiple linear regression model, and the ridge regression model. Experimental result shows that the following situations:

Choquet integral regression model with Rmeasure based on γsupport has the best performance.

(ii) Based on the same fuzzy support, not only the γ

support but also the Vsupport, the Choquet integral regression model with R measure is better than which with fuzzy measure, λmeasure and Pmeasure.

(iii) The Choquet integral regression model with the same measure, Pmeasure, λmeasure, and Rmeasure, respectively, the performance of which is derived from the γsupport is better than which from the Vsupport.

(iv) The Choquet integral regression model with λ

measure, and Rmeasure based on Vsupport and γ

regression model with the better measure based on the best fuzzy support, γsupport, to develop multiple classifier system. and YuDu Jheng, “A new weighting method for detecting outliers in IPA based on Choquet integral”, IEEE International conference on Industrial Engineering and Engineering Management 2007, December 25, 2007, Singapore.

[3] HsiangChuan Liu, “The Choquet integral regression model based on rcomplete measure”, Journal of educational research and development, Vol. 2, No. 4, pp 87107, 2006 (in Chinese).

[4] HsiangChuan Liu, WenChih Lin, and WeiSheng Weng, “A Choquet Integral Regression Model Based on a New Fuzzy Measure”, The 12th International conference on Fuzzy Theohery & Technology, July 1924, 2007, Salt Lake City, Utah, U.

[5] HsiangChuan Liu, WenChih Lin, KeiYi Chang, and WeiSheng Weng, “A Nonlinear Regression Model Based on Choquet Integral with e Measure”, 2007 WSEAS International Conferences, Venice, Italy, November 2124, 2007.

[6] HsiangChuan Liu, YuChieh Tu, ChinChun Chen, and WeiSheng Weng (2008), “The Choquet integral with respect to λmeasure based on γsupport”, 2008 International Conferences on Machine Learning and Cybernetics, Kuming, China, July 1215, 2008 (Accepted).

[7] G. Choquet, “Theory of capacities”, Annales de l’Institut Fourier, Vol. 5, pp. 131295, 1953.

[8] M. Sugeno, “Theory of fuzzy integrals and its applications”, unpublished doctoral dissertation, Tokyo Institute of Technology, Tokyo, Japan, 1974.

No. C1 C2 C3 C4 BCT No. C1 C2 C3 C4 BCT

1 77 75 79 83 31 31 74 70 80 75 35

2 71 72 78 75 26 32 56 61 75 68 22

3 78 86 86 86 33 33 62 68 72 74 29

4 58 64 68 66 32 34 86 80 82 81 35

5 48 59 65 68 16 35 63 78 88 83 31

6 68 74 77 80 28 36 56 66 76 71 21

7 62 72 84 78 47 37 77 74 80 76 42

8 51 53 65 59 9 38 73 78 84 81 24

9 62 64 76 70 36 39 63 60 68 69 17

10 63 70 81 75 41 40 53 68 80 74 31

11 66 68 75 74 25 41 74 86 87 88 44

12 66 72 80 76 23 42 78 83 81 85 50

13 75 75 85 80 39 43 47 58 66 62 15

14 74 63 69 75 12 44 51 60 63 64 18

15 68 78 85 75 27 45 60 65 75 70 23

16 71 74 80 77 26 46 68 68 80 74 26

17 49 60 69 64 13 47 52 60 70 65 20

18 73 78 84 81 39 48 57 65 75 70 24

19 68 70 74 76 40 49 70 66 70 74 13

20 54 56 62 68 7 50 53 68 74 80 30

21 53 68 74 71 11 51 68 68 78 76 35

22 56 63 69 75 21 52 57 60 68 64 23

23 70 80 78 70 31 53 61 62 70 70 25

24 51 74 82 75 49 54 59 70 80 76 37

25 61 66 72 78 33 55 59 62 70 78 29

26 67 70 80 75 35 56 52 64 76 70 27

27 59 75 80 82 27 57 68 70 80 75 33

28 53 56 70 63 22 58 71 76 74 78 38

29 56 56 65 61 6 59 72 66 78 72 19

30 52 57 67 62 15

Fuzzy c-Mean Algorithm Based on Complete Mahalanobis Distances and

在文檔中行政院國家科學委員會專題研究計畫成果報告 (頁 67-77)