Classifier - LITERATURE REVIEW - 基於子空間選取之多重辨識器系統於高維度資料分類

CHAPTER 2: LITERATURE REVIEW

2.4 Classifier

Once the dimensionality reduction techniques or other pre-processing procedures modify the original data to a proper representation, a classifier can be constructed by using a number of possible learning algorithms where the Bayes decision rule is one of the most commonly used decision rules.

The main concept of Bayes decision rule is based on the probabilistic approach using Bayes theorem. An input sample x can be assigned to one of the L-class

}

where the posterior probability P(ω_i |x) may be calculated from prior probability

Pi and the class-conditional density function P(x|ω , using Bayes theorem, as _i)

Since the mixture density function ∑

= optimization, formula (2.5) can be simplified as

)}

The density estimate can be categorized into either parametric or nonparametric model. Some well-known classifiers based on Bayes decision rule are Gaussian classifier in parametric model and k-nearest-neighbor classifier and Parzen classifier

in nonparametric models. Additionally, in hyperspectral image classification, a Markov random field-based contextual classification utilizing both spectral and spatial information in estimation statistics and classification is also based on the Bayes decision rule in parametric model where the class-conditional probability density assumes a normal distribution. In the next four sections, these four classifiers will be introduced respectively.

Figure 2.6. An example of Bayes decision rule in one-dimensional space. In this case, the input sample x is assigned to class i according to formula (2.7).

There are still two types of classifiers commonly applied as the base classifier in multiple classifier systems: decision tree classifier and support vector machine.

The decision rules of them are to find the separating hyperplane(s) by selecting the most salient feature at each node of the tree and by maximizing the margin been defined by the support vectors, respectively. These two classifiers are introduced in the last two sections of this chapter.

2.4.1 Gaussian classifier

Gaussian classifier (GC) is made up of the assumption that the samples are drawn from a multivariate normal population with mean μ and covariance matrix Σ. Hence, the class-conditional probability density function can be represented as

X )

( _i

iP X

P ω P_jP(X |ω_j)

ωi x

)

| ( )

( _i _j _j

iP x PP x

P ω ≥ ω

ωj

follows (Fukunaga, 1990).

Moreover, the MAP estimate (2.7) in connection with Equations (2.8) and (2.9) becomes as

Here, the term const. does not depend on the particular class assignment.

2.4.2 k-nearest-neighbor classifier

The general idea of k-nearest-neighbor classifier (kNN) is to find the set of k nearest neighbors in the training set to an input sample x and then classify x as the most frequent class among this set. On the probability perspective, the kNN rule is to extend the local region ν around x until the kth nearest neighbors is found;

hence, we have that if the density is high around x, the region v is relatively small which provides good solution. On the contrary, if the density is low, the region ν will grow large and stop until higher density regions are reached (Duda, Hart, &

Stork, 2001).

Figure 2.7. The class-conditional region around x of kNN estimate for the case of

k . Obviously, the region (left) centering around x to the third nearest neighbors in class i is small than the region (right) being shaped in class j.

The class-conditional density estimate of kNN can be represented as (Fukunaga, 1990)

Therefore, the MAP estimate (2.7) using the class-conditional distribution of kNN becomes

2.4.3 Parzen classifier

Parzen window classifier essentially superposes kernel functions placed at each sample. The class-conditional PDF of a point x is the sum of the contributions from all observations to this window, and for a given kernel function it could be defined as the following (Fukunaga, 1990).

∑= complex shape. However, in a high-dimensional space, because of its complexity, the practical selection of the kernel function is very limit to either a normal or uniform kernel. The following kernel including both normal and uniform kernels as special cases in n-dimensional space is used:

) the shape of the hyper-ellipsoid. For m=1, (2.14) reduces to a normal kernel; as m becomes larges, (2.14) approaches a uniform kernel.

Using the Parzen density estimate, the MAP estimate in connection with Equations (2.13) and (2.14) becomes

ωMAP 1 ( )}

2.4.4 Decision tree classifier

Decision tree is the most widely used base classifier for constructing ensembles, and it can be regarded as a non-metric method for classification (Duda, Hart, & Stork, 2001). A conventional structure of tree is made up by nodes, branches, and leaves as shown in Figure 2.11.

Figure 2.8. Classification in a decision tree proceeds from top to bottom through a sequence of questions. The first of a sequence of decisions leading finally to a terminal is made at the root node by asking a question with different answers. Based on the answers we select an appropriate branch and visit the child node. Another decision is made at this node, and so on, until a leaf is reach. The leaf contains a single class label, which is assigned to the object being classified.

CART (classification and regression tree) (Breiman, Friedman, Olshen, & Stone, 1984), a common way of automatically building decision trees, is used in this thesis.

Root Each node tests an attribute

Each branch corresponds to attribute value

Each leaf assigns a classification

2.4.5 Support vector machine

Support vector machines (SVM) (Boser, Guyon, & Vapnik 1992; Cortes &

Vapnik 1995), a success learning algorithm commonly used for classification and regression issues, is motivated by designing a linear discriminate function with the considerations of the margins as Figure 2.12.

Figure 2.9. Data from two different categories {−1,+1} can be well separated by an optimal hyperplane with an appropriate nonlinear mapping ϕ which embeds data from the original feature space (left) to a sufficient higher dimensional space (right).

Given a training dataset of instance-label pairs (x_i,y_i) , where x_i∈ℜⁿ ,

w^Tϕ that maximizes the margin b, and it requires the solution of the following optimization problem:

where C and ξ are penalty parameter and slack variables, respectively, for the soft-margin SVM. Using the so-called Kuhn-Tucker theorem the optimization of (2.19) then can be reformulated as the following dual problem with respect to the undermined multipliers α_i ≥0: function will utilize the radial basis function (RBF) as the following:

)

CHAPTER 3: CLUSTER BASED DYNAMIC SUBSPACE

在文檔中基於子空間選取之多重辨識器系統於高維度資料分類 (頁 25-33)