Pattern Classification - Output Results - 情緒偵測、基因表現分群以及生物網路重建之統計方法

Output Results

2.4 Pattern Classification

Tools of machine learning could be applied to discriminate the emotional states

by the physiological signals. After daily and personal correction, we used the estima-tor X_ijkl = Z_ijkl− ¯Z_ijk+ ¯Z_ias our attribute for pattern classification of the emotional state Y_jkl, which represented the emotional state in j^th subject, on the k^th day, and for the l^th sample. We let the variable Y represent the emotional status and the variable Xi represent the value of i^th feature after removing the daily and personal correction. Six selected classifiers were tested for their performance and accuracy us-ing the method of leave-one-out cross-validation. All of these six classification meth-ods were performed by the software Weka (http://www.cs.waikato.ac.nz/ml/weka), and all of the classifiers used the default option in Weka. Further investigation of other options for classifiers in Weka could be studied in the future. The methods of classifiers were described as below.

2.4.1 Bayesian Network

A Bayesian network, also called Bayes nets, is a directed acyclic graph (DAG) which consists of two components. The first component G comprises vertices cor-responding to a set of variables V = {V₁, V₂, ..., V_N} and a set of directed edges between variables with the Markov properties. The second component θ is attached the potential table P (V_i|U_V_i), for each variable V_i in V with the corresponding par-ents nodes U_V_i (Pearl, 1988; Jensen, 2001). Given the structure G and the parameter θ , the joint probability distribution can be written as Eq. (2.3):

P (V ) =

i=1

P (V_i|U_V_i). (2.3)

For the purpose of learning take place in a Bayesian networks, we have to reconstruct the network structure and the field values. In this study, we apply the hill climbing algorithm and simple estimator to reconstruct the network and estimate the parameters. After getting the network structure, we used junction tree methods which can convert our DAG to a tree by clustering variables (Lauritzen and Spiegelhalt, 1988). Then an efficient algorithm using belief propagation can be applied for our inference. In our study, we would use the estimator X₁, X₂, ..., X_I

X₁ X₂ X_n

P(X₁|Y) P(X₂|Y) P(Xn|Y)

Figure 2.5: The network structure of the naive Bayesian classifier.

and Y as the prediction variables V = {V₁, V₂, ..., V_I+1} and calculate the conditional distribution of Y given the observation X₁, X₂, ..., X_I in the constructed Bayesian network structure.

2.4.2 Naive Bayesian

A naive Bayesian classifier is a simple approach based on the Bayes’ theorem.

The network structure is illustrated in Figure 2.5. There are two assumptions in the naive Bayesian classifier as follows (John and Langley, 1995). (i) Given the class attribute (Y ), the predictive attributes (X₁, X₂, ..., X_I) are independent. (ii) There were no other attributes affecting the prediction process. By the Bayes’ theorem,

P (Y = y|X = x) = P (Y = y)P (X = x|Y = y)

P (X = x) . (2.4)

We can predict the class attribute by finding y that maximizes P (Y = y|X = x) in Eq. (2.4) given the predictive attributes x. As the predictive attributes (X₁, X₂, ..., X_I) are assumed to be conditionally independent, we have

P (X = x|Y = y) =

i=1

P (X_i = x_i|Y = y). (2.5)

For the numeric attributes, we would assume that X_i is distributed as N(µ_iy, σ_iy²) given the class Y = y for every i = 1, 2, ..., I. Hence, we can estimate the parameters by the maximum likelihood estimates for each class.

2.4.3 Support Vector Machine

Support vector machine (SVM) (Vapnik, 1998) is a popular classification method used by a lot of research currently being conducted in the field of emotion recognition (Kim et al., 2004; Chuang and Shih, 2006). Suppose {(x^∗₁, y₁^∗), (x^∗₂, y^∗₂), ..., (x^∗_n, y_n^∗)}

is the training set, where y_i^∗ is 1 or -1, denoting whether x^∗_i belongs to one of two classes. In SVM, it is aimed to minimize the cost function ¹₂w^Tw + C^Pⁿ_i=1ξ_i under the constraints y_i^∗(w^Tx^∗_i + b) ≥ 1 − ξ_i for i = 1, 2, ..., n. By using the Lagrange multiplier method, the original problem can be transformed as optimizing α⁰_is in Eq. (2.6).

After obtaining α_i, we can apply the following decision function for prediction using the new predictive attribute of x^∗_new : f (x^∗_new) = sign(^Pⁿ_i=1y_i^∗α_iK(x_new, x^∗_i) + b), where K() is the kernel function. In this study, we use the Gaussian kernel and the sequential minimal optimization (SMO) algorithm (Keerthi et al., 2001).

Besides, because our case has multiple classes (three emotional statuses), we used the approach of pairwise classification by the one-against-one approach in the SVM classification method.

2.4.4 Decision Tree of C4.5

Decision tree is also a common method used in classification (Hunt et al., 1966).

C4.5 is a hierarchical data structure using the divide-and-conquer strategy to grow-ing decision trees (Quinlan, 1993). In decision trees, each decision node usgrow-ing a test

function to partition original data D into subsets D₁, D₂, . . . , D_n. Suppose the set D consists of C numbers of classes and p(D, j) denotes the proportion of cases in D that belongs to the jth class. We can define the information gain by a test T with m outcomes as Eq. (2.7): there is one case left in each subset D_i. The split information is defined as Eq. (2.8):

Split(D, T ) = −

For every possible test, the ratio of its information gain over its split information is assessed and the test with maximum gain ratio is selected.

2.4.5 Logistic Model

Logistic regression is a classical method to model category data for classification (Le Cessie and Van Houwelingen, 1992). Suppose there are n samples with c classes and I attributes. The parameter matrix B is calculated as an I × (c − 1) matrix.

The probability that the i^th sample, given the value of x^∗_i, in the j^th class but not in the last c^th class is shown in Eq. (2.9).

P_j(x^∗_i) = exp(x^∗_iBj)

P_c−1

k=1exp(x^∗_iB_k) + 1, where j = 1, 2, ..., c − 1. (2.9) The probability that the i^th sample, given the value of x^∗_i, in the last c^th class is shown in Eq. (2.10). The log-likelihood l of the data (K, X) under this model is shown in Eq. (2.11).

l(β) =

The indicator variable K_ij^∗ = 1 if the i^thsample belongs to the j^thclass, where j 6= c.

Otherwise, K_ij^∗ = 0 if the i^th sample belongs to the last c^th class. The parameter matrix B can be estimated by the maximize likelihood estimates of the likelihood function, l(β).

2.4.6 K-Nearest Neighbor (KNN)

The k-nearest neighbor (KNN) algorithm is one of the classical classification methods that have wide applications (Aha et al., 1991). KNN compares the similar-ity between testing data and every training data. Then it uses the top k similarsimilar-ity categories of training data to decide the category of the testing data by a weighted vote. For any testing data of H and training data of {G₁, G₂, ..., G_n}, we would classify the category of H as Eq. (2.12).

C(H) = arg max

Gi∈S

Sim(H, Gi)I(Gi, Cm). (2.12)

The notation of Sim(H, G_i) is the similarity measure of H and G_i. The set S = { ˜G₁, ˜G₂, . . . , ˜G_k} is the data set closed to the testing point H, and the notation of I(G_i, C_m) ∈ {0, 1} indicates whether G_i belongs to C_m. If there are tie cases in the classification, we will use the group with a minimal index as the corresponding category of testing data. In this study, we would use the Euclidean distance as the similarity measure and choose the number of nearest neighbors k=3.

Chapter 3 Data Collection and Analysis on

在文檔中情緒偵測、基因表現分群以及生物網路重建之統計方法 (頁 26-32)