Machine learning algorithms - Literature review

Chapter 1 Literature review

1.4 Machine learning algorithms

That amino acid scales could not be directly used for continuous epitope prediction prompted the use of machine learning approaches to improve the accuracy of prediction. Machine learning is a branch of artificial intelligence that is concerned with the design and development of algorithms to learn from existing data, and to perform accurately on new and unseen data. The reason that machine learning has become increasingly necessary in dealing with biological data is due to the availability of huge amounts of data, and the need for turning such data into useful information.

Machine learning algorithms have the potential to uncover important data patterns or valuable knowledge embedded in the vast amount of data. The way that machine learning algorithms identify B-cell epitopes is by inferring a function from the training data that maps inputs to desired outputs. The inferred function, obtained after training on the finite dataset, can be applied to predict output values for new and unseen examples. As the training examples come from a certain probability distribution, the objective of the algorithm is to extract information about the distribution, which allows the algorithm to predict output values in a new dataset. Among the numerous machine learning algorithms, decision tree, nearest neighbor (k-NN), support vector

- 5 -

machines (SVM), and artificial neural networks (ANN) have been applied to the prediction of continuous B-cell epitopes.

1.4.1 Decision tree

Decision tree learning classifies an example using a tree-like structure, in which each interior node corresponds to a test on an attribute, and each branch represents an outcome of the test.

Classification starts at the root node at the top of the tree, goes through a path determined by the values of input variables, and end up at a particular leaf node representing class of the test. There are several decision tree algorithms. Among them, the C4.5 algorithm builds decision trees from a training data set based on information entropy. At each node of the tree, C4.5 chooses the attribute with the highest information gain for splitting data into subsets. The chosen attribute splits the class attributes into the two purest possible groups of instances with lowest entropy.

The process is performed recursively, where the attribute with the next highest information gain is chosen to make the next decision.

1.4.2 k-Nearest Neighbor (k-NN)

The k-NN algorithm belongs to the instance-based family of algorithms, which compares new and unseen examples with training examples. k-NN is sensitive to the local structure of the data, and it classifies new examples based on proximal training examples in the feature space. A new example is assigned class most common amongst its k nearest neighbors. If k=1, for instance, then the example is simply assigned to the class of its nearest neighbor. Commonly, Euclidean distance is used as the distance metric.

1.4.3 Support Vector Machine (SVM)

- 6 -

SVM also performs classification in a feature space, where training examples are represented as points in space, mapped such that the examples belonging to different classes can be separated by a hyperplane or a set of hyperplanes. New examples are mapped into the sample space and assigned a predicted class based on which side of the hyperplane that they fall on. However, it is possible that linear hyperplanes are unable to separate data points of different classes. In such case, the original finite-dimensional space may be mapped into a higher-dimensional space in hope that in this higher-dimensional space the data could become more easily separated. The mapping is computed using a kernel function that is selected to suit the problem. The choice of kernel depends on the problem and the kind of information one expects to extract from the data.

A polynomial kernel, for example, is suited for problems where all the training data is normalized.

The radial basis function (RBF) models spherical hyperplanes in a multi-dimensional space. The effectiveness of SVM depends on the selection of kernel and the kernel’s parameters.

1.4.4 Artificial Neural Network (ANN)

ANN is a computational model inspired by the structure and operations of the biological neural networks. In the case of a biological neural network, neurons are connected to form a network, where each neuron collects input stimuli and sends an output signal to the next neuron within the network. In an ANN, artificial neurons, also referred to as nodes, are connected to form a network of nodes to simulate the neurological processing ability of a biological neural system, such as learning and determining a conclusion from experience. A simple ANN is composed of three layers of neurons – the input layer, hidden layer, and output layer. The input layer collects and sends data to the hidden layer via synapses. Each neuron in the hidden layer sums its input from the previous layer, and converts it to output activation. Subsequently, the data is passed to the output layer via more synapses, where the results are presented to users. The synapses

- 7 -

between neurons represent weights, which manipulate data in calculations. Depending on the interconnections between neurons, an ANN can be characterized by two classes. The first is a feed forward neural network (FNN), in which neurons in one layer are connected to those in the next layer, thus FNN does not involve any feedback. In the second type of network, the output of each neuron is fed back to itself, as well as other neurons, thus referred to as recurrent networks.

In the supervised training process of an ANN, the network is initialized by putting small random weights on the synapses. The resultant output is compared with the observed data, and weights are adjusted by the network to minimize errors compared to the actual output. The learning method most frequently used is back propagation, where a gradient of error is calculated based on the network’s modifiable weights, and the gradient is used to find weights that minimize the error.

When the model and learning algorithm are selected appropriately, ANN is powerful in its ability to manage adaptive, parallel, and non-linear processes. Its adaptive nature is demonstrated through adaptation of internal parameters, which in turn enables ANN to determine the relationship between different examples. In the parallel structure of ANN, the information processing occurs through a great number of computational neurons, each of which sends exciting or inhibiting signals to the other neurons in the network. Since the calculations are distributed among many neurons, if any of the neurons deviates from the expected behavior, it does not affect the behavior of the network. Beyond that, neurons in ANN can be linear or non-linear. An ANN formed by the interconnection of non-linear neurons, is itself non-linear, and it is able to process and present a non-linear behavior, which is presented in most of real situations. Taken together, the flexibility of ANN allows the network to provide solution to complex problems.

- 8 -

1.5 Machine learning approaches for the prediction of continuous B-cell epitopes

在文檔中基於蛋白質自由能之預測B細胞表位方法 (頁 12-16)