• 沒有找到結果。

Machine Learning Methods

Chapter 2 Information Repository of Protein Post-Translational Modifications 16

2.6 Summary

3.2.1 Machine Learning Methods

Machine learning is programming computers to optimize a performance criterion using example data or past experience. In bioinformatics, machine learning is usually referred to classification which learns predictive model from training data sets for distinguishing between different exemplars based on their differentiating patterns. Several common machine learning algorithms such as k-Nearest Neighbor (KNN), decision three, Bayesian decision theory (BDT), neural network (NN), hidden Markov model (HMM), and support vector machine (SVM) are described as follows.

k-Nearest Neighbor (KNN)

Arguably the simplest method is the k-Nearest Neighbor classifier (Cover and Hart, 1967).

Here the k points of the training data closest to the test point are found, and a label is given to the test point by a majority vote between the k points. This method is highly intuitive and attains – given its simplicity – remarkably low classification errors, but it is computationally expensive and requires a large memory to store the training data.

Decision Tree

Another intuitive class of classification algorithms are decision trees. As shown in Figure 3.4, these algorithms solve the classification problem by repeatedly partitioning the input space, so as to build a tree whose nodes are as pure as possible (that is, they contain points of a single class). Classification of a new test point is achieved by moving from top to bottom along the branches of the tree, starting from the root node, until a terminal node is reached. Decision trees are simple yet effective classification schemes for small datasets. The computational complexity scales unfavorably with the number of dimensions of the data. Large datasets tend to result in complicated trees, which in turn require a large memory for storage. The C4.5

implementation by Quinlan (1992) is frequently used and can be downloaded at

http://www.rulequest.com/Personal.

Figure 3.4 An example of decision tree.

Bayesian Decision Theory (BDT)

Bayesian decision theory is a fundamental statistical approach to the problem of pattern classification. This approach is based on quantifying the tradeoffs between various classification decisions using probability and the costs that accompany such decisions. It makes the assumption that the decision problem is posed in probabilistic terms, and that all of the relevant probability values are known. Suppose that we have an unclassified data x that belongs to one of two certain categories: C1 (defined as phosphorylation sites) and C2 (defined as non-phosphorylation sites). Suppose that we know both the prior probabilities

P(Cj) and the conditional densities p(x|Cj ). In addition, the posterior probability of x for these

two categories can be denoted as: p(C1

|x) and p(C

2

|x), which are called Bayes’ formula:

) (

) ( )

| ) (

|

(

p x

C P C x x p

C

P

j = j j ,

where in this case of two categories

=

Then the probability of wring prediction is:

1

To minimize the expectation of error probability that is defined as [97]:

= P error x p x dx error

P( ) ( | ) ( )

It is obvious that one should choose the more probable category as the prediction result, which can be formulated by the Bayesian Decision Rule:

⎩⎨ possible solution. Thus, the expected loss (risk) of taking action

α

i is:

=

In this condition, the goal of optimization becomes to minimize the overall risk for every x.

Similar to the rationale of Bayesian Decision Rule, we can obtain the best performance by computing R(

α

i |x) for each solution

α

i and choose that for which has the minimal overall risk [97].

Neural Network (NN)

Neural network (NN) is one of the most commonly used approaches to classification.

Artificial neural network (ANN) is a computational model inspired by the connectivity of neurons in animate nervous systems [98]. A simple scheme for ANN is shown in Figure 3.5 [98]. Each circle denotes a computational element referred to as a neuron, which computes a weighted sum of its inputs, and possibly performs a nonlinear function on this sum. If certain classes of nonlinear functions are used, the function computed by the network can approximate any function (specifically a mapping from the training patterns to the training targets), provided enough neurons exist in the network and enough training examples are

provided.

Figure 3.5 A schematic diagram of artificial neural network. Each circle in the hidden and output layer is a computation element known as a neuron (Haykin et al., 1999).

ANN is capable of classifying highly complex and nonlinear biological sequence patterns, where correlations between positions are important. Not only does the network recognize the patterns seen during training, but it also retains the ability to generalize and recognize similar, though not identical patterns. Artificial neural network algorithms have been extensively used in biological sequence analysis. An artificial neural network library ANNLIB [99], which were implemented in C program language, is available.

Hidden Markov Model (HMM)

A hidden Markov model (HMM) is a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters from the observable parameters. The extracted model parameters can then be used to perform further analysis, for example for pattern recognition applications. A HMM can be considered as the simplest dynamic Bayesian network. The key idea is that an HMM is a finite model that describes a probability distribution over an infinite number of possible sequences. The HMM is composed of some number of states, which might correspond to positions in a three-dimensional structure or columns of a multiple alignment.

Each state “emits” symbols (residues) according to symbol emission probabilities, and the states are interconnected by state transition probabilities. Starting from initial state and a sequence of states is generated by moving from state to state according to the state transition

probabilities until an end state is reached. Each state then emits symbols according to that state’s emission probability distribution, creating an observable sequence of symbols.

The state path is a Markov chain, meaning that what stat we go to next depends only on what state we’re in. Since we’re only given the observed sequence, this underlying state path is hidden - these are residue labels that we’d like to infer. The state path is a hidden Markov

chain, whose probability P

(

S

,

π HMM

,

θ

) that an HMM with parameters

θ

generates a state path

π

and an observed sequence S is the product of all the emission probabilities and transition probabilities that were used. Why are they called hidden Markov models? The sequence of states is a Markov chain, because the choice of the next state to occupy is dependent on the identity of the current state. However, this state sequence is not observed; it is hidden. Only the symbol sequence that these hidden states generate is observed. The most likely state sequence must be inferred from an alignment of the HMM to the observed sequence.

Figure 3.6 An example of small profile HMM representing a short multiple alignment of five sequences with three consensus columns (Eddy et al., 1998).

Hidden Markov models now provide a coherent theory for profile methods, namely Profile hidden Markov models (profile HMMs) [63], which are statistical models (maximum likelihood) of multiple sequence alignments. They capture position-specific information about how conserved each column of the alignment is, and which residues are likely. An example of small profile HMM is shown in Figure 3.6 [63]. The three columns are modeled by three match states (squares labeled m1, m2, and m3), each of which has 20 residue emission probabilities, shown with black bars. Insert states (diamonds labeled i0 - i3) also have 20 emission probabilities each. Delete states (circles labeled d1-d3) are ‘mute’ states that have no

emission probabilities. A begin and end state are included (b,e). State transition probabilities are shown as arrows.

Support Vector Machine (SVM)

Support vector machine (SVM) [100] is a useful technique for data classification. A classification task usually involves with training and testing data which consist of some data instances. Each instance in the training set contains one “target value” (class label) and several “attributes” (features). The goal of SVM is to produce a model which predicts target value of data instances in the testing set which are given only the attributes. The basic concept of SVM is to transform the samples into a high dimensional space and find a separating hyperplane with the maximal margin between two classes in the space (Figure 3.7).

Figure 3.7 Basic concept of support vector machine.7

Basically, SVM is a binary classifier. Given training vectors xi, i = 1, …, l and a vector y defined as: yi = 1 if xi is in class I, and yi = -1 if xi is in the class II. The support vector technique tries to find the separating hyperplanewTxi + b=0 with the largest distance between two classes, measured along a line perpendicular to this hyperplane, which require the solution of following optimization problems (Figure 3.8):

7 The figure was obtained from http://www.imtech.res.in/raghava/rbpred/svm.jpg

=

+ l

i i T

b

w w w C

, 1

, 2

min1 ξ

ξ subject to yi(wTφ(xi)+b)≥1−ξi, 0

ξ

i ≥ .

Here training vectors xi are mapped into a higher dimensional space b the function

φ

. Constraints yi(wTφ(xi)+b)≥1−ξi allow that training data may not be on the correct side of the separating hyperplanewTxi + b=0. Then SVM finds a linear separating hyperplane with the maximal margin in this higher dimensional space. C is the penalty parameter of the error term to be optimized. Furthermore,

K

(

x

i,

x

j)≡

φ

(

x

i)T

φ

(

x

j) is called the kernel function.

Four basic kernel functions are listed as follows:

z Linear:

K

(

x

i,

x

j)=

x

iT

x

j

z Polynomial:

K

(

x

i,

x

j)=(

γ x

iT

x

j+

r

)d,

γ

>0

z Radial basis function (RBF):

K

(

x

i,

x

j)=exp(−

γ x

i

x

j 2),

γ

>0

z Sigmoid: )

K

(

x

i,

x

j)=tanh(

γ x

iT

x

j +

r

Here,γ , r, and d are kernel parameters. Most commonly used kernel functions are RBF kernel.

Figure 3.8 Principle of hyperplane in support vector machine. 8

Recently, SVM has been successfully applied in solving many biological problems, such as predicting protein subcellular localization [101], protein secondary structures [102], tumor

classification [103] and phosphorylation sites [78], which shown to be an effective machine learning method. A public SVM library, namely LIBSVM [104], was available at

http://www.csie.ntu.edu.tw/~cjlin/libsvm/

.

Boosting

The basic idea of boosting and ensemble learning algorithms in general is to iteratively combine relatively simple base hypotheses – sometimes called rules of thumb – for the final prediction. One uses a so-called base learner that generates the base hypotheses. In boosting the base hypotheses are linearly combined. In the case of two-class classification, the final prediction is the weighted majority of the votes. The combination of these simple rules can boost the performance drastically. It has been shown that Boosting has strong ties to support vector machines and large margin classification (R¨atsch, 2001, Meir and R¨atsch, 2003).

Boosting techniques have been used on very high dimensional data sets and can quite easily deal with than hundred thousands of examples. Research papers and implementations can be downloaded from http://www.boosting.org.