Acquisition of rule-based knowledge for predicting and analyzing DNA-binding domains

Hui-Ling Huang^1,2, Shinn-Jang Ho³, Li-Sun Shu⁴, and Shinn-Ying Ho^1,2

1Department of Biological Science and Technology, ational Chiao Tung University, Hsinchu, Taiwan

2Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan

hlhuang@mail.nctu.edu.tw

3Department of Automation Engineering, National Formosa University, Yunlin 632, Taiwan

4Department of Information Management, Overseas Chinese University, Taichung 40721, Taiwan

syho@mail.nctu.edu.tw

Abstract—DNA-binding domains are functional proteins in a cell, which plays a vital role in various essential biological activities. It is desirable to predict and analyze novel proteins from protein sequences only using machine learning approaches. Numerous prediction methods were proposed by identifying informative features and designing effective classifiers. The support vector machine (SVM) is well recognized as an accurate and robust classifier. However, the block-box mechanism of SVM suffers from low interpretability for biologists. It is better to design a prediction method using interpretable features and prediction results. In this study, we propose an interpretable physicochemical property classifier (named iPPC) with an accurate and compact fuzzy rule base using a scatter partition of feature space for DNAbinding data analysis. In designing iPPC, the flexible membership function, fuzzy rule, and physicochemical properties selection are simultaneously optimized. An intelligent genetic algorithm IGA is used to efficiently solve the design problem with a large number of tuning parameters to maximize prediction accuracy, minimize the number of features selected, and minimize the number of fuzzy rules. Using benchmark datasets of DNA-binding domains, Ippc obtains the training accuracy of 81% and test accuracy of 79% with three fuzzy rules and two physicochemical properties. Compared with the decision tree method with a training accuracy of 77%, iPPC has a more compact and interpretable knowledge base. The two physicochemical properties are Number of hydrogen bond donors and Helix-coil equilibrium constant in the AAindex database.

Keywords- knowledge acquistion; fuzzy classifier; genetic algorithm;DNA-binding; physicochemical properties; prediction

X. INTRODUCTION

DNA-binding domains are functional proteins in a cell, which plays a vital role in various essential biological activities, such as DNA transcription, replication, packaging, repair and rearrangement [1]. These transcription factors are mainly DNA-binding proteins (DNA-BPs) coded by 2~3%

of the genome in prokaryotes and 6~7% in eukaryotes [2].

DNA-BPs play a pivotal role in variousintra- and extra-cellular activities ranging from DNA replication to gene expression control. These researches reveal that the

DNA-protein recognition mechanism is complicated and there is no simple rule for this recognition problem [3].

Stawiski et al. found that nucleic acid-binding proteins could be separated using a neural network trained that included secondary structure and charged patches, among others [4]. Ahmad and Sarai using a simple linear predictor to model a trivial system with few descriptors and they identified cutoff values for charge and dipole moment at which binding and non-binding proteins could be separated[5]. Kumar et al. proposed a method for predicting DNA-binding proteins using SVM and PSSM profiles [6].

The methods can fairly analyze and predict DNA-binding proteins, but suffer from obtaining human-interpretable knowledge.

Ho et al. [7] study aims to analyze DNA-binding proteins via acquisition of interpretable knowledge which can accurately predict binding sites in proteins to understand DNA-protein recognition mechanism. Their study investigates a novel feature set consisting of 11 features, including solvent accessibility, secondary structure, charge information near the residue, amino acid group and neighbor property. The derived binding and non-binding rules reveal that besides the well-known solvent accessibility, the electric charge distribution near the residue and the amino acid groups also play important roles in prediction of binding sites.

We have proposed Auto-IDPCPs [8] which is investigated the optimal design of predictors for DNA-DBs from amino acid sequence using both informative features and an appropriate classifier. Furthermore, we obtain a set of relevant physicochemical properties can advance prediction performance. The proposed Auto-IDPCPs identified m=22 features of properties belonging to five clusters for predicting DNA-binding domains with a fivefold cross-validation accuracy of 87.12%. Since the set of 22 physicochemical properties performs well, we would apply it to acquit the rule-based knowledge for predicting and analyzing DNA-binding domains.

In this paper, we propose an interpretable physicochemical properties classifier (named iPPC) with an accurate and compact fuzzy rule base using a scatter partition of feature space for DNA-binding data analysis.

Because physicochemical properties from AAindex database [9] have the property of natural clustering, fuzzy classifiers using a scatter partition of feature spaces often have a smaller number of rules than those using grid partitions. The design of iPPC has three objectives to be simultaneously optimized: maximal classification accuracy, minimal number of rules, and minimal number of used physicochemical properties. In designing iPPC, the flexible membership function, fuzzy rule, and physicochemical properties selection are simultaneously optimized. An intelligent genetic algorithm IGA is used to efficiently solve the design problem with a large number of tuning parameters [10].

XI. MATERIALS AND METHODS

A. Dataset DNAset

This dataset also called main dataset from Kumar et al., 2007 [6]. They got 146 non-redundant DNA-BPs in which no two proteins have the sequence identity of more than 25%. A non-redundant set of 250 non-binding proteins was obtained from Stawiski et al. [11]. They used following criteria: i) no two protein chains have similarity more than 25% and ii) the approximate size and electrostatics are similar to DNA-BPs. Final dataset called DNAset or main dataset or domain dataset, consists of 146 DNA-binding and 250 non-binding protein chains or domains.

DNAiset

We used this dataset to evaluate performance of our models and the also called DNAiset. This dataset from Kumar et al., 2007 [6] 92DNA-binding protein chains obtained from PDB and 100 nonbinding proteins picked from Swiss-Prot.

B. Feature set

Considering the DNA-binding domain data set DNAset, the set of m=22 informative properties (PCPs) identified by Auto-IDPCPs performs best where the robust solution with accuracy of 87.12% is used. The Auto-IDPCPs is a systematic approach to automatically identify a set of physicochemical and biochemical properties in the AAindex database to design SVM-based classifiers for predicting and analyzing DNA-binding domains/proteins. Auto-IDPCPs consists of 1) clustering 531 vectors in AAindex into 20 classes using a fuzzy c-means algorithm, 2) utilizing an efficient genetic algorithm based optimization method IBCGA to select an informative feature set of size m to represent sequences, and 3) analyzing the selected feature vectors to identify the related physicochemical properties

which may affect the binding mechanism of DNA-binding domains/proteins.

The set of m=22 PCPs is identified by Auto-IDPCPs, we would apply it to acquit the rule-based knowledge for predicting and analyzing DNA-binding domains. The set of 22 PCPs is described in table 1.

Table 1 - The Auto-IDPCPs indented a set of m=22 physicolchemical properties on DNAset.

Feature ID

AAindex ID Description

53 CHOP780216 Normalized frequency of the 2nd and 3rd residues in turn (Chou-Fasman, 1978b)

56 CIDH920103 Normalized hydrophobicity scales for alpha+beta-proteins (Cid et al., 1992)

64 DAYM780101 Amino acid composition (Dayhoff et al., 1978a) 86 FAUJ880109 Number of hydrogen bond donors (Fauchere et al.,

1988)

91 FINA770101 Helix-coil equilibrium constant (Finkelstein-Ptitsyn, 1977)

188 NAGK730103 Normalized frequency of coil (Nagano, 1973) 202 NAKH920101 AA composition of CYT of single-spanning proteins

(Nakashima-Nishikawa, 1992) 227 PALJ810105 Normalized frequency of turn from LG (Palau et al.,

1981)

228 PALJ810106 Normalized frequency of turn from CF (Palau et al., 1981)

255 PRAM900104 Relative frequency in reverse-turn (Prabhakaran, 1990)

262 QIAN880105 Weights for alpha-helix at the window position of -2 (Qian-Sejnowski, 1988)

274 QIAN880117 Weights for beta-sheet at the window position of -3 (Qian-Sejnowski, 1988)

286 QIAN880129 Weights for coil at the window position of -4 (Qian-Sejnowski, 1988)

363 SUEM840101 Zimm-Bragg parameter s at 20 C (Sueki et al., 1984) 383 WEBA780101 RF value in high salt chromatography (Weber-Lacey,

1978)

388 WOEC730101 Polar requirement (Woese, 1973) 412 AURR980110 Normalized positional residue frequency at helix

termini N5 (Aurora-Rose, 1998) 430 MUNV940102 Free energy in alpha-helical region (Munoz-Serrano,

1994)

434 WIMW960101 Free energies of transfer of AcWl-X-LL peptides from bilayer interface to water (Wimley-White,

1996)

443 KUMS000104 Distribution of amino acid residues in the alpha-helices in mesophilic proteins (Kumar et al., 2000) 486 BASU050102 Interactivity scale obtained by maximizing the mean

of correlation coefficient over single-domain globular proteins (Bastolla et al., 2005)

513 JACR890101 Weights from the IFH scale (Jacobs-White, 1989)

C. Acquition the rule-based knowledge method

High performance of iPPC mainly arises from two aspects. One is to simultaneously optimize all parameters in the design of iPPC where all the elements of the fuzzy classifier design have been moved in parameters of a large parameter optimization problem. The other is to use an efficient optimization algorithm IGA which is a specific variant of the intelligent evolutionary algorithm [10]. The

intelligent evolutionary algorithm uses a divide-and-conquer strategy to effectively solve large parameter optimization problems. IGA is shown to be effective in the design of accurate classifiers with a compact fuzzy-rule base using an evolutionary scatter partition of feature space [10].

1) Flexible membership function

The classifier design of iPPC uses flexible generic

parameterized fuzzy regions which can be determined by flexible generic parameterized membership functions (FGPMFs) and a hyperbox-type fuzzy partition of feature space. Each fuzzy region corresponds to a parameterized fuzzy rule. In this study, each value of gene expression is normalized into a real number in the unit interval [0,1]. An FGPMF with a single fuzzy set is defined as

Figure 1. Illuminations of FGPMF: (a) a>0 and d< 1; (b) a<0<b, (c) b≦0; (d) b≦0 and c≧1.

 determining the shape of a trapezoidal fuzzy set are the parameters to be optimized. It is well recognized that confining evolutionary searches within feasible regions is often much more reliable than penalty approaches for handling constrained problems [12]. Therefore, five parameters V¹, V²,…, V⁵



[0,1] without constraints instead of a, b, c, and d are encoded into a GA-chromosome for facilitating IGA. Let an additional variable L=V¹ which determines the location of the fuzzy set characterizing the occurrence of training patterns. When Vⁱare obtained, variables a, b, c, and d can be derived as follows: a=L-(V²+V³), b=L-V³, c= L+V4, and d=L+(V⁴+V⁵). This transformation can always make the derived values of a, b, c, and d feasible and reduce interactions among encoded parameters of GA chromosomes. Some illuminations of FGPMF are shown in Fig. 1.

2) Fuzzy rule and fuzzy reasoning method

The following fuzzy if–then rules for n-dimensional pattern classification problems are used in the design of iGEC: initial fuzzy rules in the training phase.

To enhance interpretability of fuzzy rules, linguistic variables in fuzzy rules can be used. Each variable xi has a linguistic set U= {L, ML, M, MH, H}. Each linguistic value of xi equally represents 1/5 of the domain [0, 1]. Following the quantization criterion, we can consider PCPs to be regulated according to a qualitative level. For example, xi is Low for down-regulated PCPs; xi is Medium for neutral PCPs; and xi is High for up-regulated PCPs. An antecedent fuzzy set Aji ∈ Au where A^u denotes a set of subsets of U.

Examples of linguistic antecedent fuzzy sets are shown in Fig. 2.

In the training phase, all the variables CLj and CFj are treated as parametric genes of GA (GA-genes) encoded in chromosomes of GA (GA-chromosomes) and their values are obtained using IGA. The following fuzzy reasoning method is adopted to determine the class of an input pattern xp = (xp1, xp2, . . ., xpn) based on voting using multiple fuzzy where FC denotes the fuzzy classifier, the scalar value and μji(·) represents the membership function of the antecedent fuzzy set Aji.

Step 2: Classify xp as the class with a maximal value of SClassv.

3) Fitness function

We define the fitness function Fit() of IGA for designing iPPC as follows:

max Fit(FC) = ACC − WrNr − WfNf (3)

Figure 2. Examples of an antecedent fuzzy set Aji with linguistic values (L: low, ML: medium low, M: medium, MH: medium high, H: high): (a) Aji represents {ML, M, MH}; (b) Aji represents {ML, M, MH, H}, i.e., not Low; (c) Aji represents {L, ML, M, MH, H} or ALL.

where Wr and Wf are positive weights. In this study, the fitness function is used to optimize the three objectives in the following order: to maximize the accuracy rate ACC of correctly classified training patterns, to minimize the number Nr of fuzzy rules, and to minimize the number Nf of selected PCPs. Generally, the final number of fuzzy rules is smaller than 10. Therefore, we set Wr = 0.1 to ensure that classification accuracy has the first priority to be optimized.

When the two objectives ACC and Nr are simultaneously optimized for DNA-Binding data, the best number of used genes is almost determined. Hence, a very small value 0.001 is set to Wf. The sensitive analysis about the different settings of Wr and Wf can be referred to [10].

4) GA-chromosome representation

A GA-chromosome consists of control GA-genes for selecting useful genes and significant fuzzy rules, and parametric GA-genes for encoding the membership functions and fuzzy rules. The control GA-genes comprise two types of parameters. One is parameter rj, j=1, . . ., N, represented by one bit for eliminating unnecessary fuzzy rules. If rj = 0, the fuzzy rule Rj is excluded from the rule base. Otherwise, Rj is included. The other is parameter fi, i=1, . . ., n, represented by one bit for eliminating useless genes. If fi = 0, the gene xi is excluded from the classifier.

Otherwise, xi is included. The parametric GA-genes consist of three types:

V

_ji^k∈ [0, 1], k = 1, . . . , 5, for determining the antecedent fuzzy set Aji; CLj for determining the consequent class label of rule Rj; and CFj ∈ [0, 1] for determining the certainty grade of rule Rj; where j=1, . . ., N and i=1, . . ., n. A rule base with N fuzzy rules is represented as an individual. The number of encoding parameters to be optimized is equal to Np = n+3N+5Nn. A GA-chromosome representation uses a binary string for encoding control and parametric GA-genes. There are eight bits for encoding one of parameters

V

_ji^kand CFj. Since each fuzzy region defines a fuzzy rule, the initial setting of N is independent of n but dependent on the number of fuzzy regions. Generally, N is set to the maximal number of possible fuzzy regions. In this study, N=3C. The design of an efficient fuzzy classifier is formulated as a large parameter optimization problem. Once the solution of IGA is obtained, an accurate classifier with a compact fuzzy rule base can be derived.

XII. RESULTS

The parameter settings of IGA from Ho et al. (2004a) are Npop = 20, Pc = 0.7, Ps =1−Pc, Pm = 0.01, and α = 15.

Because the search space of optimal design of iPPC is proportional to the number Np of parameters to be optimized, the stopping condition is suggested to use a fixed number 100Np of fitness evaluations (Ho et al., 2004a)

A. Performance

The dataset all the domains/sequences have a variable length l. A sequence forms an l-dimensional profile where the value of each amino acid is obtained from the AAindex database for encoding a specific physicochemical property.

The l-dimensional profiles are transformed into vectors with the same constant length L for utilizing classifier. The transformation can be any known effective representation provided that the L features can effectively classify the l-dimensional profiles of positive and negative sequences.

The simplest feature is the mean of the profile that L=1.

Therefore, the sequences with m properties are represented as an m-dimensional feature vectors.

The training dataset DNAset with m=22 properties are represented as a 22-dimensional feature vectors. This 22 physicochemical properties is pre-identified by Auto-IDPCPs. The set of 22 PCPs is described in table 1. The training accuracy is 87% and the testing accuracy is 70%.

Finally, all values of the feature vectors are normalized into [0, 1] to apply iPPC.

Because of the non-deterministic characteristic of GA, the experimental results are the average values of 30 independent runs. In each run, we can obtain a fuzzy classifier with the accuracy rate ACC, the number Nr of fuzzy rules, and the number Nf of selected PCPs. Using the optimal results, the test dataset DNAiset is applied to perform. The training results and testing results are shown in Table 2. The top six of high selected frequency PCPs in the 30 runs are shown in Table 3.

Table 2- The average values of 30 independent runs of the proposed iPPC.

DNAset DNAiset

Table 3- The top six of high selected frequency PCPs in the 30 runs.

Frequency Feature No. AAindex No.

26 86 FAUJ880109

10 274 QIAN880117

6 255 PRAM900104

6 286 QIAN880129

6 513 JACR890101

5 91 FINA770101

B. Comparison with decision tree

We random select one run result from iPPC independent 30 runs. The training ACC is 81%, the number Nf of selected PCPs is 2, the number Nr of fuzzy rules is 3, and testing ACC is 79%. The selected 2 PCPs are FAUJ880109 (86) and FINA770101 (91).

Using J48 in Weka3-6-4, the decision trees are built form a set 22 PCPs and the selected 2 PCPs (FAUJ880109 (86) and FINA770101 (91)) which are shown in Fig. 3(a) and Fig.3(b), respectively. The performance of decision trees, training accuracy of the 22 PCPs is 77.16% and training accuracy of the 2 PCPs is 77.67%. The decision value is fixed float value and is not easy to understand.

(a)

(b)

Figure 3. The decision trees are built form (a) a set 22 PCPs and (b) the iPPC selected 2 PCPs. Cid: clustering id, A:

Alpha and turn propensities. B: Beta propensity. C:

Composition. H: Hydrophobicity. P: Physicochemical properties. O: Other properties.

Fig. 4 shows an example of iPPC using the 2 PCPs with 3 rules. The classifier has three fuzzy rules using two PCPs FAUJ880109(C9H) and FINA770101(C10A), where The training ACC = 81% and testing ACC = 79%.

Figure 4. Fuzzy rules of the selected 2 PCPs, the training ACC is 81% and testing ACC is 79%. 0: binding, 1: non-binding.

Using the selected 2 PCPs, the proposed iPPC can obtain rule-based. The fuzzy rules are linguistically interpretable as follows:

R1 If FAUJ880109(C9H) is all and FINA770101(C10A) is all , then DNA is binding.(CF=0.290)

R2 If FAUJ880109(C9H) is {low , medium low , medium } and FINA770101(C10A) is all , then DNA is non-binding.(CF=0.325)

R3 If FAUJ880109(C9H) is all and FINA770101(C10A) is {medium low , medium , medium high , high} , then DNA is binding.(CF=0.992)

XIII. CONCLUSION

This paper proposes an interpretable physicochemical property classifier (named iPPC) with an accurate and compact fuzzy rule base using a scatter partition of feature space for DNA-binding data analysis. In designing iPPC, the flexible membership function, fuzzy rule, and physicochemical properties selection are simultaneously optimized. The obtained fuzzy rules are easy to interpret and analyze DNA-binding domains for biologists.

REFERENCES

[1] M Gao, J Skolnick, ”A threading-based method for the prediction of DNA-binding proteins with application to the human genome.” PLoS Comput Biol 2009, 5(11):e1000567.

[2] D. Lejeune, N. Delsaux, B. Charloteaux, A. Thomas, R. Brasseur,

“Protein-Nucleic Acid Recognition: Statistical Analysis of Atomic Interactions and Influence of DNA Structure,” PROTEINS: Structure, Function, and Bioinformatics, no. 61, 2005, pp. 258-271.

[3] R.A. O'Flanagan, G. Paillard, R. Lavery and A.M. Sengupta, “ Non-additivity in protein-DNA-binding.” Bioinformatics, 21, 2005, pp.

2254-2263.

[4] E. W.Stawiski, L. M. Gregoret, Y. Mandel-Gutfreund, "Annotating nucleic acid binding function based on protein structure" J. Mol. Biol., no. 326, 2003, pp. 1065–1079.

[5] D. C. Chan, D. Fass, J. M. Berger, “Core Structure of gp41 from the HIV Envelope Glycoprotein,” Cell, vol. 89, April 18, 1997, pp. 263–

273.

FAUJ880109 (C9H) FINA770101(C10A) Class CF

R1 0 0.290

IF FAUJ880109(C9H) <= 0.39017: nonBinding IF FAUJ880109(C9H) > 0.39017: Binding If FAUJ880109C9H) <= 0.39017

then non-binding else binding.

[6] M. Kumar, MM Gromiha, GP Raghava, “Identification of DNA-binding proteins using support vector machines and evolutionary profiles” BMC Bioinformatics 2007, 8:463.

[7] S-J Ho, C-Y Chang, L-T Huang , S-F Hwang, and S-Y Ho,

“Acquisition of Rule-based Knowledge for Analyzing DNAbinding Sites in Proteins,” Conference: Infoscale, June 6-8, 2007, Suzhou, China.

[8] H-L Huang, I-C Lin, Y-F Liou, C-T Tsi, K-T Hsu, W-L Huang, S-J Ho, S-Y Ho, ” Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties,” BMC Bioinformatics, 2010 (Accepted)

[9] S. Kawashima, P. Pokarowski, M. Pokarowska, A. Kolinski, T.

Katayama, M. Kanehisa, “AAindex: amino acid index database,”

progress report 2008. Nucleic Acids Res 2008, 36(Database issue):D202-205.

[10] S-Y Ho, L-S Shu, J-H Chen, “Intelligent evolutionary algorithms for large parameter optimization problems.” Ieee T Evolut Comput 2004, 8(6), pp.522-541.

[11] X. Yu, J. Cao, Y. Cai, T. Shi, Y. Li, “ Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with supportvector machines,” J Theor Biol, no. 240, 2006, pp.175-184.

[12] Z. Michalewicz, D. Dasgupta, R.G. Le Riche, M. Schoenauer,

“Evolutionary algorithms for constrained engineering problems.”

Comput. Ind. Eng. 30 (4), 1996, pp. 851–870.

在文檔中以蛋白質序列物化特性為特徵的蛋白質激Kinase-Specific磷酸化位置預測方法與分析 (頁 31-37)