A Dynamic Subspace Method for Hyperspectral Image Classification

(1)

A Dynamic Subspace Method for Hyperspectral

Image Classification

Jinn-Min Yang, Bor-Chen Kuo, Pao-Ta Yu, Member, IEEE, and Chun-Hsiang Chuang

Abstract—Many studies have demonstrated that multiple

classi-fier systems, such as the random subspace method (RSM), obtain more outstanding and robust results than a single classifier on extensive pattern recognition issues. In this paper, we propose a novel subspace selection mechanism, named the dynamic subspace method (DSM), to improve RSM on automatically determining dimensionality and selecting component dimensions for diverse subspaces. Two importance distributions are proposed to impose on the process of constructing ensemble classifiers. One is the distribution of subspace dimensionality, and the other is the dis-tribution of band weights. Based on the two disdis-tributions, DSM becomes an automatic, dynamic, and adaptive ensemble. The real data experimental results show that the proposed DSM obtains sound performances than RSM, and that the classification maps remarkably produce fewer speckles.

Index Terms—Kernel smoothing (KS), random subspace

method (RSM), small sample size (SSS) classification.

I. INTRODUCTION

I

N hyperspectral imaging, data from the new generation sensors consist of a large number of spectral bands that provide the potential to improve the discrimination of objects. However, one of the difficulties for supervised classification inhibiting this potential is the constraint of training sample size because the ground truth is generally expensive and difficult to acquire. Therefore, we have to face the small sample size (SSS) problem, that is, the number of available training samples is much smaller than the dimensionality. Under this circumstance, the generalization ability of the resulting classifier is weak, and the variances of its classification results are large [1], [2]. In other words, the classifier suffers from the well-known Hughes phenomenon [3] or the curse of dimensionality [4] in classification results.

The random subspace method (RSM) proposed by Ho [5], [6] is one of the multiple classifier systems, providing a way Manuscript received April 2, 2009; revised August 12, 2009, December 7, 2009, and January 12, 2010. Date of publication April 5, 2010; date of current version June 23, 2010. This work was supported in part by the National Science Council, Taiwan, under Grant NSC 98-2221-E-142-005.

J.-M. Yang is with the Department of Computer Science and Information Engineering, National Chung Cheng University, Chiayi 621, Taiwan, and also with the Department of Mathematics Education, National Taichung University, Taichung 403, Taiwan.

B.-C. Kuo is with the Graduate School of Educational Measurement and Statistics, National Taichung University, Taichung 40306, Taiwan (e-mail: kbc@mail.ntcu.edu.tw).

P.-T. Yu is with the Department of Computer Science and Information Engineering, National Chung Cheng University, Chiayi 621, Taiwan.

C.-H. Chuang is with the Institute of Electrical and Control Engineering, National Chiao-Tung University, Hsinchu 300, Taiwan.

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TGRS.2010.2043533

of alleviating sample size and high-dimensionality concerns. It is a general technique that can be used with any type of base classifier [7]–[12]. Moreover, much research [13]–[15] has demonstrated its validness for hyperspectral image clas-sification. In RSM, each weak classifier is constructed in a subspace with bands randomly selected from the original ones, and the subspace dimensionality is usually predefined. Then, a final decision rule of weak classifiers is obtained by a simple majority vote. However, there are two inadequacies in RSM. One is that the dimensionality of subspace is not clearly de-fined, and the other is its random rule for selecting bands.

Ho suggests that desirable results are obtained by setting the dimensionality of subspace to approximately half of the dimensionality of original space [5], [16]. This result is based on the decision tree classifier, but it may not be extended to all kinds of classifiers. For instance, the suitable dimensionality of subspaces for a maximum likelihood (ML) classifier depends on the size of the training samples. The question of how to choose a suitable subspace size for the employed classifier will then arise. In addition, the random strategy assumes that the selected probability of each band to form a subspace is the same, but the discriminating power of each band is actually different.

In this paper, we propose the dynamic subspace method (DSM) for constructing component classifiers with adaptive subspaces to adjust the shortcomings of RSM. DSM works on the basis of two major distributions, namely, W and R, denoting the distributions of band weights and subspace dimensionality, respectively. The component bands to form the subspace are selected with the probability based on the W distribution, and the number of selected bands is automatically determined based on the R distribution. In fact, the R distribution records the importance of all possible subspace size, which is estimated by the kernel density estimation technique [17], [18] on some resubstitution performances of partial dimensionalities. Most importantly, it would be updated in the training process of constructing DSM. Comparing to a heuristic search or man-ual method, this scheme shows its dynamic selection manner for deciding the applicable dimensionality with respect to the employed classifiers.

Recently, the classification technique integrating both spec-tral and spatial information has rapidly developed for the hy-perspectral image classification [19]–[22], where the Markov random field (MRF) is one of the popular models to exploit the spatial context between neighboring pixels in an image. In this study, an MRF-based contextual classification [23] is also applied to the proposed DSM as the base learner because it suffers from the SSS problems.

(2)

Fig. 1. Framework of producing ensemble classifiers by RSM.

Fig. 2. Famework of producing ensemble classifiers by DSM, where the update process is to estimate the resubstitution accuracy of h_kas the feedback to update the R distribution.

The rest of this paper is organized as follows. A brief review of the RSM will be described in Section II. New methods will be derived in Section III. For evaluating the performance of the proposed method, real hyperspectral image data experiments are designed in Section IV, and the experimental results are reported in Section V. Section VI contains some comments and conclusions.

II. RSM

The RSM proposed by Ho [5], [6] is an ensemble tech-nique based on random band selection (RBS). Let D =

{(xi, ci)|1 ≤ i ≤ N} represent the original p-dimensional data

set that is composed of N training samples, where xi∈ p

with class label ci∈ C = {1, 2, . . . , L}, and L is the total

number of classes. In RSM, given a predefined subspace dimensionality r < p, the RBS process randomly selects r bands from the original p-dimensional space such that D reduces to D = RBS(D, r) ={(xi, ci)|1 ≤ i ≤ N}, where

xi∈ r; then, D returns as an input to the learning

algo-rithm Ψ, which outputs a classifier h = Ψ( D). This process

will repeat B times to construct ensemble classifiers H =

{h1, h2, . . . , hB}. In the classification procedure, diverse class

labels of a test sample Y are obtained by these classifiers and then combined together by simple majority voting to obtain a final decision F = arg maxc∈{1,2,...,L}card(k|hk(Y ) = c, k =

1, 2, . . . , B), where card(A) denotes the cardinality of the set

A. The framework of RSM is shown in Fig. 1, where Dk

denotes the kth reduced-dimensional data set of the original data set D.

The RSM has been theoretically and experimentally proven to be beneficial for the SSS problem, but there are still two prominent weaknesses that need to be improved. One is that

the dimensionality of subspace is fixed and needs to be pre-defined, generally selected by the trial-and-error method; the other is that its randomized band selection mechanism makes the equally selected probabilities of informative and noninfor-mative bands. In Section III, a novel method of the multiple classifier system, named DSM, will be proposed to overcome these weaknesses of RSM.

III. DSM

In this section, DSM is introduced, and how the drawbacks of RSM are overcome is shown. The design of DSM is displayed in Fig. 2, where two innovative distributions, namely, W and R, are imposed in the process of subspace selection. In addition,

D_kand r_krepresent the kth reduced-dimensional data set of D and its corresponding dimensionality, respectively. Compared to RSM, the contributions of bands are assumed differently, that is, band selection is no longer according to the uniform distribution. We propose the importance distribution of band weight W to model the probability of bands being selected. Im-portantly, the subspace dimensionality is neither predefined nor a fixed number but is drawn from the importance distribution of subspace dimensionality R. An update process for R is also proposed in each overproduction. DSM constructs D_k with r_k bands based on W and R distributions. The algorithm of DSM is summarized in Algorithm 1.

In the following, W and R distributions are defined, respec-tively, and the DSM algorithm is explained.

A. W Distribution

The design of W distribution is based on the principle that beneficial bands carry larger probabilities to be selected, and

(3)

Fig. 3. Initial procedure for estimating the R0distribution.

smaller probabilities are given to the futile ones. A class-based band selection for creating an ensemble of classifiers was proposed in [24]. It is time consuming and not suitable for our DSM. Hence, two simple and multiclass-based band selection methods are proposed for DSM. The band selection processes are based on two W distributions, WACC(Ψ) and

WLDA, where the subindices ACC(Ψ) and LDA represent

the resubstitution accuracy by applying the classifier Ψ and the class separability of Fisher’s linear discriminate analysis (LDA) [25], respectively. Note that the histogram approach [18] is utilized for density estimation of both distributions. The following are their formulations, and the procedure of selecting bands based on the W distribution is also introduced.

1) WACC(Ψ) Distribution: The WACC(Ψ) distribution is

built according to the so-called resubstitution accuracy [25], which is the classification accuracy of training data. In this paper, the resubstitution accuracy is obtained by applying the base classifier Ψ to each individual band. Assume that

WACC(Ψ) is a random variable with a probability mass

function (pmf ) given by a probability vector fWACC(Ψ) = (fWACC(Ψ)(1), fWACC(Ψ)(2), . . . , fWACC(Ψ)(p)), where

fWACC(Ψ)(j) = φj p k=1 φk , j = 1, 2, . . . , p. (1)

φj denotes the resubstitution classification accuracy by

ap-plying the base classifier Ψ to the jth band only.

2) WLDA Distribution: Another measurement used to

as-sign weight to individual bands in this study is based on the class separability of Fisher’s LDA [25], which is referred to the power of discrimination and is measured by J = tr(S_w−1Sb).

The value of J , computed by the trace of the inverse of the within-class scatter matrix (Sw) times the between-class

scatter matrix (Sb), should be large to the beneficial bands

but small to the futile ones. Assume that WLDA is a random

variable with pmf given by a probability vector fWLDA = (fWLDA(1), fWLDA(2), . . . , fWLDA(p)), where

fWLDA(j) = Jj p k=1 Jk , Jj= tr S_wj−1Sbj , j = 1, 2, . . . , p. (2)

Note that Jj denotes the discrimination power of the jth

spectral band.

3) Band Selection Based on theW Distribution: The

foun-dation of the band selection algorithm based on the W dis-tribution (WACC, WLDA, or the uniform distribution) is the

theory of pseudorandom number generation [26]. The inversion method of the pseudorandom number generation is used to im-plement the algorithm for selecting the desired bands. Assume that there are r bands that need to be selected; the steps of selecting these bands are described as follows.

1) Generate a uniform random number υ on [0, 1].

2) Select the kth band if FW(k− 1) < υ < FW(k), where

FW denotes the cumulate density function of the W

distribution, and 1≤ k ≤ p.

3) Set the fW(k) = 0 and renormalize the W distribution.

4) Go back to Step 1 until r bands have been selected. Finally, a reduced-dimensional data set D = W BS(D, r, W ) is obtained, where “W BS” denotes the

acronym of W -based band selection.

B. R Distribution

The function of R distribution is to indicate how many dimensions are suitable for the employed base classifier. The procedure to establish R distribution includes two steps. First, we build R0, an initial distribution of R, by applying Ψ

to b different dimensional data sets with dimensionalities

r1, . . . , rb. Second, the kernel smoothing (KS) density

estima-tion [17], [18] (or “Parzen density estimaestima-tion” [25]) is utilized to smoothen R0, making it a continuous one. Fig. 3 illustrates

the aforementioned procedure. KS is an important and popular nonparametric technique for which prior knowledge about the functional form of the conditional probability distributions is not available or is not used explicitly [27].

As shown from the left plot in Fig. 3, the b training data set for building R0 is generated based on W , i.e., Dt=

W BS(D, rt, W ), t = 1, 2, . . . , b, where the dimensionality rt

is given by rt= 1 + (t− 1) ×(p− 1) (b− 1) . (3)

Next, we need to compute the resubstitution classification accuracy φ(ht) by applying Ψ to Dt. Then, the R0distribution

(4)

TABLE I

DESCRIPTION OFALGORITHMSUSED FORCOMPARISON

is built. Finally, we get the continuous R0distribution by KS by fR(r) = 1 b t=1 φ(ht)σ _b t=1 φ(ht)K r− rt σ , r = 1, 2, . . . , p (4) where K is the kernel function, and σ is the smoothing parame-ter called bandwidth.

The subspace dimensionality is drawn from the R distribu-tion in this study. Again, the inversion method of the theory of pseudorandom number generation is used to implement the algorithm for determining the subspace dimensionality based on R.

1) Generate a uniform random number υ on [0, 1].

2) Determine the subspace dimensionality is r if FR(r−

1) < υ < FR(r), where FRdenotes the cumulate density

function of R distribution, and 1≤ r ≤ p.

The R distribution will be updated during the construction of the B classifiers in the ensemble, and the updating process is described in Section III-C.

C. DSM

After estimating the W distribution and the R0distribution,

the classifiers in the ensemble start being constructed. The R distribution can be automatically updated by the performance of subsequent classifier. The steps of the proposed DSM are described as follows, and the algorithm of DSM is presented in Algorithm 1.

Let B be the number of classifiers in the ensemble and the index k = 1, 2, . . . , B.

1) Draw a new subspace dimensionality r_k from Rk−1

distribution.

2) Obtain a reduced-dimensional data set by D_k =

W BS(D, r_k, W ).

3) Obtain the kth component classifier of the ensemble by

h_k= Ψ( D_k).

4) Estimate the resubstitution accuracy φ(h_k) as the feed-back to obtain an updating Rkdistribution by

fR(r) = 1 _b t=1 φ(ht) + k =1 φ (h) σ × _b t=1 φ(ht)K r− rt σ + k =1 φ (h) K r− r σ (5) 5) Back to Step 1 until B classifiers have been trained.

Algorithm 1. The algorithm of DSM

Input:

The training data set D The test sample Y

A learning algorithm (classifier) Ψ The ensemble size B

The band selection based on W , W BS

Output:

Final hypothesis F : Y → c ∈ {1, 2, . . . , L} computed by the ensemble H ={h₁, h₂, . . . , h_B}.

A. Training procedure

Begin

Estimate the W distribution. Estimate the R0distribution. fork = 1, 2, . . . , B

Draw a subspace dimensionality r_kfrom Rk−1.

D_k = W BS(D, r_k, W ) h_k = Ψ( D_k)

Obtain Rk distribution by the formula (5). end

End

B. Classification procedure

F = arg max_{c∈{1,2,...,L}}card(k|h_k(Y )) = c), where k = 1, 2, . . . , B.

IV. EXPERIMENTALDESIGN

A. Methods

For investigating the multiclass classification performances of the proposed methods, there are five different algorithms used for comparison. All algorithms and their descriptions are listed in Table I. The value of b for building the initial

R distribution (R0) is set to 5, and the ensemble size B

is set to 20 in RSM and DSMs. In RSM and DSMs, the simple majority voting is used for the fusion of all ensemble classifiers.

In DSMs, the kernel function K used in the R distribution is taken to be a Gaussian function as

K t σ = √ 1 2πσ2e −1 2( t σ) 2 . (6)

From [17], if σ is large, then the R distribution is flatter, and the difference of selecting probabilities of bands is small. The

(5)

bandwidth σ suggested by [17] is set as

σ = 0.9An−1/5 (7)

where A = min(standard deviation, interquartile range/1.34), and n is the cardinality of the subspace dimensionalities that have been input.

B. Base Classifiers

To explore the performances of RSM and DSM on differ-ent base classifiers, we employ Gaussian ML classifier [25],

k-nearest-neighbor classifier (kNN, k = 1) [25], support vector

machine (SVM) [28] using a radial basis function (RBF) as a kernel, and Bayesian contextual classifier (BCC) [23] into all algorithms. The following explains why we select these classifiers as the base learners. Here, we give the term “weak classifier” a general definition that refers to a classifier that does not have a good enough performance on hyperspec-tral image classification with insufficient training data. We try to apply DSM with these classifiers to obtain a better performance.

The most widely used statistical classifier, namely the ML classifier, belongs to the parametric model that is made up of mean vector and covariance matrix for a normal distribution [25]. However, the covariance matrix of ML may be singular or near-singular (i.e., noninvertible) and leads to inaccurate estimation when the data dimensionality exceeds the number of training samples [29]. Consequently, the classifier performs poorly.

The kNN classifier is a simple and appealing approach, which assigns an unknown point to the class most common among its k nearest neighbors. However, high-dimensional bands obstacle to the generation of kNN since nearest neigh-bors of a point can be very far away, causing bias and degrading the performance of the rule [30]. Since kNN is sensitive to input bands [31], DSM generates a diverse set of kNN ensemble to overcome the mentioned problem in a reduced-dimensional space. In this study, PRTools [32] is used to implement kNN classifier.

The SVM, a successful learning algorithm commonly used for classification and regression issues, is designed by solving a constrained optimization problem. Geometrically, the SVM aims at finding a linear discriminate function with the max-imal margin in the potentially very high-dimensional space. Given a training data set D ={(xi, ci)}, where xi∈ n, ci∈

{+1, −1}, and i = 1, 2, . . . , N. The goal for SVM is to find

the separating hyperplane wT_{ϕ(x) that maximizes the}

mar-gin, and it requires the solution of the following optimization problem: min w,b,ξ 1 2w T_{w + C} N i=1 ξi subject to ci wTϕ(xi) + b ≥ 1 − ξiand ξi≥ 0 (8)

where C and ξ are penalty parameters and slack variables, respectively, for the soft-margin SVM. Using the so-called Kuhn–Tucker theorem [33] the optimization of (8) can then be

reformulated as the following dual problem with respect to the Lagrange multipliers αi ≥ 0: min α N i=1 αi− 1 2 N i=1 N j=1 αiαjcicjκ(xi, xj) subject to N i=1 αici = 0 and 0≤ αi≤ C ∀i = 1, 2, . . . , N (9) where κ(xi, xj) is called the kernel function. In this study, an

RBF kernel is used as follows:

κ(xi, xj) = exp

−γ xi− xj2

. (10)

The LIBSVM [34] is used to implement the SVM classifier. Here, we use the fivefold cross-validation and the grid search to find the best C within the given set{2−5, 2−3, . . . , 215_{} and the}

best γ within the given set{2−15, 2−13, . . . , 23_{} (suggested by}

Hsu et al. [35]) of parameters.

Although SVM has been found to provide better classifica-tion results than other widely used classifiers in hyperspectral image classification [36], [37], the band-reduction procedure combined with SVM for classification also proves its validness for obtaining higher accuracies [38], [39]. Hence, we include SVM as the base learner for investigating the effectiveness for the ensemble method.

The MRF-based BCC [23] is also applied as a base classifier. Let u(i, j) denote a field that contains the classification of a pixel at the ith row and the jth column in an image X, where

u∈ {1, 2, . . . , L}. According to [23], a decision rule is derived

as follows:

u(i, j) = arg max

u={1,2,...,L} − ln u + (X(i, j)− μu)T ×Σ−1 u (X(i, j)− μu) + 2mβ + const. (11)

where_u and μu are the covariance matrix and mean vector

of class u, respectively. The coefficient β emphasizing the significance of interaction among adjacent pixels inside a clique is empirically set to 30, and m is the total number of occur-rences of the class different from u(i, j) in all cliques, where MRFs are used to model the context-dependent information. The 4-neighborhood system and the corresponding cliques of order 2 are used in this study. Additionally, [23] also provides a recursive process for adaptively estimating the statistics of mean vectors and covariance matrices. In this study, we omit this step for saving the computational time.

Although the BCC can achieve satisfactory classification results [19], [23], [40], [41], it still suffers from the singular or near-singular problem in the estimation of the inverse of the covariance matrix, which makes the classifier weak and has poor classification performances. In this study, we apply BCC into the proposed DSM as the base classifier to try to overcome this problem. Additionally, we also want to investigate the

(6)

Fig. 4. (a) Test image of a portion of Washington, DC Mall data set with a size of 205× 307 pixels. Bands 63, 52, and 36 of 191 bands were used for this image space presentation. (b) Corresponding labeled field map.

TABLE II

NUMBERS OFPIXELS IN THEWASHINGTON, DC MALLDATASET

Fig. 5. (a) Test image of the Indian Pines data set. Bands 50, 27, and 17 of 220 bands were used for this image space presentation. (b) Corresponding labeled field map.

mutual effect on the behavior of selecting subspaces when using both spectral and spatial information.

C. Data Sets

In this study, two hyperspectral image data sets are applied to compare the performances of five algorithms described in Table I. They are the urban site over Washington, DC Mall, U.S. [42] and the mixed forest/agricultural site over northwest Indiana’s Indian Pines site, U.S. [43]. The first data set is a Hy-perspectral Digital Imagery Collection Experiment (HYDICE) airborne hyperspectral-data flightline over Washington, DC Mall with an original size of 1280 × 307 pixels, and we use a size of 205 × 307 in our study. Two hundred and ten bands are collected in the 0.4–2.4 μm region of the visible and infrared spectrum. Some water absorption channels are discarded, resulting in 191 channels. In the experiment, seven information classes, namely, Roof, Road, Trail, Grass, Tree, Water, and Shadow, are selected by using MultiSpec [42], which is shown in Fig. 4(b), and the number of samples of each class is displayed in Table II.

For exploring the effects of the training sample size to the dimensions, three different cases, namely, Ni= 20 < N < p

(case 1: ill-posed problem), Ni= 40 < p < N (case 2: poorly

posed problem), and p < Ni= 300 < N (case 3: well-posed

problem), are investigated. For test sample size, we use a fixed size of 300 pixels for each class of the Washington DC Mall data set. In the Indian Pines data set, 37.24% of the labeled samples of each class is used as test samples because in training sample size Ni= 300 case, the

maxi-mum available test samples of the Hay-windrowed class is 178, which is 37.24% of the labeled samples. This way, smaller classes will be tested with a smaller number of pixels, and larger classes will have a larger number of samples. In each experiment, ten spatially disjointed training and test data sets are randomly assembled for estimating the parameters and computing the overall classification accuracy of the test data sets.

The Indian Pines data set is gathered by a sensor known as the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS). These data are obtained from an aircraft flown at 19 812 m altitude and operated by the National Aeronautics and Space Administration/Jet Propulsion Laboratory, with a size of 145× 145 pixels and 220 spectral bands measuring approximately 20 m across on the ground. The test image is shown in Fig. 5. From the 16 different land-cover classes available in the

(7)

TABLE III

NUMBERS OFPIXELS IN THEINDIANPINESDATASET

Fig. 6. Update process of the R distribution using DSMw2 with ML classifier on the Washington, DC dataset (case 3). (a) Initialize R0. (b) KS (R0).

(c) Iteration 1 (R1). (d) Iteration 2 (R2). (e) Iteration 5 (R5). (f) Iteration 10 (R10). (g) Iteration 15 (R15). (h) Iteration 20 (R20).

TABLE IV

AVERAGECLASSIFICATIONACCURACY± STANDARDDEVIATION ANDKAPPASTATISTIC± STANDARD

(8)

TABLE V

AVERAGECLASSIFICATIONACCURACY± STANDARDDEVIATION ANDKAPPASTATISTIC± STANDARD

DEVIATION OFTENTESTDATA ON THEINDIANPINESDATASET(INPERCENT)

original ground truth [42], eight are discarded due to the constraint of three sample sizes. The eight classes, namely, Corn-notill, Corn-min, Grass/Pasture, Hay-windrowed, Soybeans-notill, Soybeans-min, Soybean-clean, and Wood, are selected for the experiments, and the number of samples of each class is displayed in Table III.

V. EXPERIMENTRESULTS

Fig. 5 demonstrates the update process of R distribution using DSMw2 with ML classifier on the Washington DC data set. Initially, R starts from five specific subspace sizes, namely, 1, 48, 96, 143, and 191, with approximately equal intervals as Fig. 6(a), and then, kernel density estimation is introduced to form R0 as the first guide to select the first subspace size

as Fig. 6(c)–(h) shows the change of R distribution and the corresponding subspace size selection are based on these dis-tributions. Through several times of updates, the R distribution for subspace size selection tends to be stable.

Tables IV and V display the classification accuracies of test-ing data with cases 1, 2, and 3 on the Washtest-ington, DC Mall and Indian Pines data sets, respectively. Note that the shaded parts indicate the best accuracy of each case, and the best accuracy of each applied classifier among all algorithms is written in bold type in accordance to each case. Figs. 7–12 are three types of W distributions and corresponding R distributions of two

data sets. Note that all R distributions are the final result over 20 iterations. The following are some findings based on these results.

A. Washington, DC Mall Data Set

1) In the Washington, DC data set, the highest accuracies among all methods are 94.4%, 95.4%, and 97.0% in cases 1, 2, and 3, respectively, and all occur in DSMw2 with BCC. Additionally, as the training sample size in-creases, the accuracies also represent ascending tenden-cies in all combinations.

2) In terms of each classifier, the best accuracies occur mostly when applying DSMw2 among three cases. Addi-tionally, the proposed methods, namely, DSM, DSMw1, and DSMw2, are better than single classifiers and RSM disregarding any base classifier applied.

3) The W distribution in Fig. 9 is significantly different from those in Figs. 7 and 8, possibly giving sound results of DSMw2. This shows that using the LDA separability as the band weights is a better choice to select component bands of subspaces.

4) The W distribution of kNN in Fig. 8(a) shows a different behavior with respect to the three others and is closer to a uniform distribution. It is possible to be the result of the similar classification accuracies between DSM and DSMw1.

(9)

Fig. 7. (a) W distribution and (b)–(d) corresponding R distributions of DSM using ML, kNN, SVM, and BCC, respectively, on the Washington, DC Mall dataset: (a) RBS (W distribution is a uniform distribution); (b) case 1 (Ni= 20 < N < p); (c) case 2 (Ni= 40 < N < p); and (d) case 3 (p < Ni= 300 < N ).

Fig. 8. (a) WACCdistribution and (b)–(d) corresponding R distributions of DSMw1 using ML, kNN, SVM, and BCC, respectively, on the Washington, DC Mall dataset: (a) WACCdistributions; (b) case 1 (Ni= 20 < N < p); (c) case 2 (Ni= 40 < N < p); and (d) case 3 (p < Ni= 300 < N ).

5) RSM is mostly better than single classifiers. ML and BCC do not work in cases 1 and 2 under RSM due to the singularity problem. DSM can reduce the singularity problem.

6) From R distributions of Figs. 7–9, the proposed method will automatically estimate R distributions for different base classifiers. Based on the dimensionality of sub-space, lower dimensionality is suitable for ML and BCC, whereas higher dimensionality is suitable for kNN and SVM. Furthermore, due to applying additional spatial information, BCC definitively uses less dimensionality than ML.

B. Indian Pines Data Set

1) The highest accuracies among all methods are 77.1% (DSMw2 with SVM), 81.9% (DSMw1 with BCC), and 96.2% (DSMw2 with BCC) in cases 1, 2, and 3, respectively.

2) The best accuracies are distributed over DSM, DSMw1, and DSMw2 when using SVM and BCC. For kNN and SVM classifiers, the performances of three DSMs seem to be similar to that of RSM, which means that half of the original space size is suitable for RSM with kNN and SVM classifiers. The R distributions in Figs. 10–12 support this claim; more importantly, they demonstrate

(10)

Fig. 9. (a) WLDAdistribution and (b)–(d) corresponding R distributions of DSMw2 using ML, kNN, SVM, and BCC, respectively, on the Washington, DC Mall dataset: (a) WLDAdistribution; (b) case 1 (Ni= 20 < N < p); (c) case 2 (Ni= 40 < N < p); and (d) case 3 (p < Ni= 300 < N ).

Fig. 10. (a) W distribution and (b)–(d) corresponding R distributions of DSM using ML, kNN, SVM, and BCC, respectively, on the Indian Pines dataset: (a)

RBS (W distribution is a uniform distribution); (b) case 1 (Ni= 20 < N < p); (c) case 2 (Ni= 40 < N < p); and (d) case 3 (p < Ni= 300 < N ). that the suitable subspace size for kNN and SVM

clas-sifiers is close to half of the original space size, which matches Ho’s suggestion. Additionally, these R distribu-tions reveal that BCC uses less dimensionality than other classifiers.

3) In cases 1 and 2, ML and BCC suffer from the singular problem when the dimensionality of subspace exceeds the training sample size. The proposed dynamic selection scheme can avoid this situation.

4) In Figs. 7–9, the W distributions are dissimilar; there-fore, the performances of DSM, DSMw1, and DSMw2 are different as well. In Figs. 10–12, the W distributions

are flat and similar; therefore, the performances of DSM, DSMw1, and DSMw2 are close as well.

Due to length constraints, only some classified images are shown for comparison and three methods (single classifier, RSM, and DSMw2) are selected to generate the classified images under case 3. Figs. 13–15 are the classification results of the area of Fig. 4 using single classifier, RSM and DSMw2 with ML, kNN, SVM, and BCC, respectively. Generally, we can find that all single classifiers do not perform well compared to RSM and DSMw2. In Fig. 13, although BCC shows less speckle error than other classifiers, there are many pixels from roads that are incorrectly identified as roofs. Compared to the

(11)

Fig. 11. (a) WACC distribution and (b)–(d) corresponding R distributions of DSMw1 using ML, kNN, SVM, and BCC, respectively, on the Indian Pines dataset: (a) WACCdistributions; (b) case 1 (Ni= 20 < N < p); (c) case 2 (Ni= 40 < N < p); and (d) case 3 (p < Ni= 300 < N ).

Fig. 12. (a) WLDA distribution and (b)–(d) corresponding R distributions of DSMw2 using ML, kNN, SVM, and BCC, respectively, on the Indian Pines dataset: (a) WLDAdistributions; (b) case 1 (Ni= 20 < N < p); (c) case 2 (Ni= 40 < N < p); and (d) case 3 (p < Ni= 300 < N ).

Fig. 13. Thematic maps resulting from the classification of Fig. 4 in case 3: (a)–(d) are the results of the single classifier. (a) ML. (b) kNN. (c) SVM. (d) BCC.

(12)

Fig. 14. Thematic maps resulting from the classification of Fig. 4 in case 3: (a)–(d) are the results of using RSM. (a) ML. (b) kNN. (c) SVM. (d) BCC.

Fig. 15. Thematic maps resulting from the classification of Fig. 4 in case 3: (a)–(d) are the results of using DSMw2. (a) ML. (b) kNN. (c) SVM. (d) BCC.

Fig. 16. Thematic maps resulting from the classification of Fig. 5 in case 3: (a)–(d) are the results of the single classifier. (a) ML. (b) kNN. (c) SVM. (d) BCC.

Fig. 17. Thematic maps resulting from the classification of Fig. 5 in case 3: (a)–(d) are the results of using RSM. (a) ML. (b) kNN. (c) SVM. (d) BCC. single classifier method, RSM and DSMw2 both obtain

im-provement in roofs; DSMw2 with kNN and SVM significantly outperform the single classifier method and RSM in grass. The best classification result occurs in Fig. 15(d) by using DSMw2 with BCC.

Figs. 16–18 are the classification results of the area of Fig. 5 using single classifier, RSM and DSMw2 with ML,

kNN, SVM, and BCC, respectively. Compared to the ground

truth in Fig. 5(b), we can observe that the classification re-sults of RSM and DSMw2 are better than those of the single classifiers, particularly in Soybeans-min, Soybeans-notill, and Corn-notill, which are the most difficult parts to accurately

classify; additionally, RSM and DSMw2 have similar perfor-mances when using ML, kNN, and SVM. The best classi-fication result occurs in Fig. 18(d) by using DSMw2 with BCC, which performs much better than RSM with BCC in Soybeans-min.

VI. CONCLUSION ANDCOMMENTS

In this paper, a new multiple classifiers system named DSM has been proposed for classifying hyperspectral image data, and we have investigated the effects of using four different base classifiers and three training sample sizes. Compared to

(13)

Fig. 18. Thematic maps resulting from the classification of Fig. 5 in case 3. (a)–(d) Results of using DSMw2. (a) ML. (b) kNN. (c) SVM. (d) BCC.

original RSM, DSM shows its statistical foundation for select-ing better subspaces and their sizes, at the same time havselect-ing the robust ability to accommodate every situation within this study.

In the original RSM, the probability of each band being selected is based on uniform distribution, whereas it is replaced by W distribution in DSM. Two criteria, namely, the resub-stitution accuracy and the separability of Fisher’s LDA, are used to model the density of W distribution. Two “dynamic” strategies are carried in the R distribution. One is that the R distribution is enabled to automatically select suitable subspace sizes, which are usually troublesome to preprocess. The other one is the updating technique, which makes the R distribution change progressively toward a stable status. Experimental re-sults show that these modifications have the ability to improve the classification accuracy.

There are two theoretical drawbacks to the proposed DSM. The first one is that in the estimation of the W distribution, the accuracy of the resubstitution classifications is actually evalu-ated using only a single band. Theoretically, this approach does not precisely measure the importance of each band mentioned previously because there is cross-information between bands. In terms of algorithm, once the dimensionality of subspace has been estimated, the importance of each band should be evaluated statistically by trying different sets with the dimen-sionality estimated and including the band to be estimated. However, there are many combinations of band sets; therefore, the computation load will increase to obtain a better W .

The second drawback is that in the estimation of R distri-bution, the classification performances of b classifiers built in

b different-dimensional spaces (r1, . . . , rb) are adopted, which

means that for each ri dimensional space, only one classifier

is trained. The results could be unstable because the data set extracted may be unrepresentative. The methods to alleviate this problem are to create many sets with identical dimensionality and perform the single classifier on each set, then averaging the results, or to use a sort of k-fold validation.

As the two approaches are applied to W and R, they may yield better results but may greatly increase the computational load. However, we have experimentally found that the pro-posed DSM, indeed, yields better results when B is smaller than 15, but the results of the two approaches tend to be similar when B is bigger than 15. By using the proposed method, computation time is reduced and yields similar re-sults. Due to length constraints, experimental results are not presented.

In conclusion, the proposed DSM fixes the inadequacies of RSM by employing two importance distributions in the process of subspace selection, and furthermore, it not only alleviates the Hughes effect but also obtains sound results in classification performance. The experimental results also show that DSMw2 with BCC has the best performance in accuracy and classifica-tion map. An interesting finding from the R distribuclassifica-tion shows that BCC performs well in a much smaller dimensional space. Comparing the performances of DSM with ML and BCC, we find that the spatial information does, in fact, improve DSM.

ACKNOWLEDGMENT

The authors would like to thank Dr. Landgrebe for providing the Indian Pines site and Washington, DC mall data sets and the reviewers for their constructive comments.

REFERENCES

[1] M. Skurichina and R. P. W. Duin, “Bagging, boosting and the random subspace method for linear classifiers,” Pattern Anal. Appl., vol. 5, no. 2, pp. 121–135, Jun. 2002.

[2] L. Breiman (1998, Jun.). Arcing classifiers. Ann. Stat. [Online]. 26(3), pp. 801–824. Available: http://www.jstor.org/stable/120055

[3] G. F. Hughes, “On the mean accuracy of statistical pattern recognition,”

IEEE Trans. Inf. Theory, vol. IT-14, no. 1, pp. 55–63, Jan. 1968.

[4] R. E. Bellman, Adaptive Control Processes—A Guided Tour. Princeton, NJ: Princeton Univ. Press, 1961.

[5] T. K. Ho, “The random subspace method for constructing decision forests,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 8, pp. 832– 844, Aug. 1998.

[6] T. K. Ho, “Nearest neighbors in random subspaces,” in Proc. IAPR,

Lec-ture Notes in Computer Science: Advances in Pattern Recognition, 1998,

pp. 640–648.

[7] M. Skurichina and R. P. W. Duin, “Bagging and the random subspace method for redundant feature spaces,” in Proc. 2nd Int. Workshop Multiple

Classifier Syst., 2001, pp. 1–10.

[8] B.-C. Kuo, C.-H. Pai, T.-W. Sheu, and G.-S. Chen, “Hyperspectral data classification using classifier overproduction and fusion strategies,” in

Proc. IEEE Int. Geosci. Remote Sens. Symp., 2004, vol. 5, pp. 2937–2940.

[9] A. Bertoni, R. Folgieri, and G. Valentini, “Feature selection combined with random subspace ensemble for gene expression based diagnosis of malignancies,” in Proc. 15th Italian Workshop Neural Nets, Perugia, Italy, 2004, pp. 29–35.

[10] A. Bertoni, R. Folgieri, and G. Valentini, “Random subspace ensembles of support vector machines,” in Neurocomputing, vol. 63, pp. 535–539, 2005.

[11] D. Tao, X. Tang, X. Li, and X. Wu, “Asymmetric bagging and random subspace for support vector machines-based relevance feedback in im-age retrieval,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 7, pp. 1088–1099, Jul. 2006.

[12] S. Sun, C. Zhang, and D. Zhang, “An experimental evaluation of ensemble methods for EEG signal classification,” Pattern Recognit. Lett., vol. 28, no. 15, pp. 2157–2163, Nov. 2007.

(14)

[13] J. W. Christopher, “Hyperspectral image classification with limited train-ing data samples ustrain-ing feature subspaces,” Proc. SPIE, vol. 5425, pp. 170–181, 2004.

[14] J. Ham, Y. Chen, M. M. Crawford, and J. Ghosh, “Investigation of the random forest framework for classification of hyperspectral data,” IEEE

Trans. Geosci. Remote Sens., vol. 43, no. 3, pp. 492–501, Mar. 2005.

[15] C.-H. Chuang, B.-C. Kuo, and H.-P. Wang, “Fuzzy fusion method for combining small number of classifiers in hyperspectral image classification,” in Proc. 8th Int. Conf. Intell. Syst. Des. Appl., 2008, vol. 1, pp. 26–28.

[16] L. I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms. Hoboken, NJ: Wiley, 2004.

[17] B. W. Silverman, Density Estimation for Statistics and Data Analysis. London, U.K.: Chapman & Hall, 1985.

[18] C. M. Bishop, Neural Networks for Pattern Recognition. New York: Oxford Univ. Press, 1995.

[19] R. Cossu, S. Chaudhuri, and L. Bruzzone, “A context-sensitive Bayesian technique for the partially supervised classification of multitemporal im-ages,” IEEE Geosci. Remote Sens. Lett., vol. 2, no. 3, pp. 352–356, Jul. 2005.

[20] L. O. Jimenez, J. L. Rivera-Medina, E. Rodriguez-Diaz, E. Arzuaga-Cruz, and M. Ramirez-Velez, “Integration of spatial and spectral information by means of unsupervised extraction and classification for homo-genous objects applied to multispectral and hyperspectral data,”

IEEE Trans. Geosci. Remote Sens., vol. 43, no. 4, pp. 844–851,

Apr. 2005.

[21] G. Camps-Valls, L. Gómez-Chova, J. Muñoz-Marıacute;, J. Vila-Francés, and J. Calpe-Maravilla, “Composite kernels for hyperspectral image clas-sification,” IEEE Geosci. Remote Sens. Lett., vol. 3, no. 1, pp. 93–97, Jan. 2006.

[22] X. Jia and J. A. Richards, “Managing the spectral–spatial mix in context classification using Markov random fields,” IEEE Geosci. Remote Sens.

Lett., vol. 5, no. 2, pp. 311–314, Apr. 2008.

[23] Q. Jackson and D. A. Landgrebe, “Adaptive Bayesian contextual classi-fication based on Markov random fields,” IEEE Trans. Geosci. Remote

Sens., vol. 40, no. 11, pp. 2454–2463, Nov. 2002.

[24] Y. Maghsoudi, A. Alimohammadi, M. J. Valadan Zoej, and B. Mojaradi, “Application of feature selection and classifier ensembles for the classi-fication of hyperspectral data,” in Proc. 26th Asian Conf. Remote Sens., Hanoi, Vietnam, 2005.

[25] K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd ed. San Diego, CA: Academic, 1990.

[26] L. Devroye, Non-Uniform Random Variate Generation. New York: Springer-Verlag, 1986.

[27] F. van der Heiden, R. P. W. Duin, D. de Ridder, and D. M. J. Tax,

Clas-sification, Parameter Estimation and State Estimation: An Engineering Approach Using MATLAB. Chichester, U.K.: Wiley, 2004.

[28] V. N. Vapnik, Statistical Learning Theory. New York: Wiley, 1998. [29] J. P. Hoffbeck and D. A. Landgrebe, “Covariance matrix estimation and

classification with limited training data,” IEEE Trans. Pattern Anal. Mach.

Intell., vol. 18, no. 7, pp. 763–767, Jul. 1996.

[30] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical

Learning: Data Mining, Inference, and Prediction. New York:

Springer-Verlag, 2001.

[31] S. D. Bay, “Nearest neighbor classification from multiple feature subsets,” Intell. Data Anal., vol. 3, no. 3, pp. 191–209, Sep. 1999. [32] R. P. W. Duin, P. Juszczak, P. Paclik, E. Pekalska, D. de Ridder,

D. M. J. Tax, and S. Verzakov, PRTools4.1, A Matlab Toolbox for

Pattern Recognition. Delft, The Netherlands: Delft Univ. Technol.,

2007.

[33] F. J. Kampas, “Tricks of the trade: using reduce to solve the Kuhn–Tucker equations,” Mathematica J., vol. 9, pp. 686–689, 2005.

[34] C.-C. Chang and C.-J. Lin, LIBSVM: A Library for Support Vector

Machines, 2001. [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/

libsvm

[35] C.-W. Hsu, C.-C. Chang, and C.-J. Lin, A Practical Guide to Support

Vector Classification, 2004. [Online]. Available: http://www.csie.ntu.edu.

tw/~cjlin/papers/guide/guide.pdf

[36] F. Melgani and L. Bruzzone, “Classification of hyperspectral remote sens-ing images with support vector machines,” IEEE Trans. Geosci. Remote

Sens., vol. 42, no. 8, pp. 1778–1790, Aug. 2004.

[37] L. Bruzzone and C. Persello, “A novel context-sensitive semisupervised SVM classifier robust to mislabeled training samples,” IEEE Trans.

Geosci. Remote Sens., vol. 47, no. 7, pp. 2142–2154, Jul. 2009.

[38] B.-C. Kuo and K.-Y. Chang, “Feature extractions for small sample size classification problem,” IEEE Trans. Geosci. Remote Sens., vol. 45, no. 3, pp. 756–764, Mar. 2007.

[39] B.-C. Kuo, C.-H. Li, and J.-M. Yang, “Kernel nonparametric weighted feature extraction for hyperspectral image classification,” IEEE Trans.

Geosci. Remote Sens., vol. 47, no. 4, pp. 1139–1155, Apr. 2008.

[40] Y. Jhung and P. H. Swain, “Bayesian contextual classification based on modified M-estimates and Markov random fields,” IEEE Trans. Geosci.

Remote Sens., vol. 34, no. 1, pp. 67–75, Jan. 1996.

[41] A. H. S. Solberg, T. Taxt, and A. K. Jain, “A Markov random field model for classification of multisource satellite imagery,” IEEE Trans. Geosci.

Remote Sens., vol. 34, no. 1, pp. 100–113, Jan. 1996.

[42] D. A. Landgrebe, Signal Theory Methods in Multispectral Remote

Sensing. Hoboken, NJ: Wiley, 2003.

[43] AVIRIS NW Indiana’s Indian Pines 1992 Data Set, [Online]. Available: ftp://ftp.ecn.purdue.edu/biehl/MultiSpec/92AV3C, (original files) and ftp://ftp.ecn.purdue.edu/biehl/PC_MultiSpec/ThyFiles.zip, (ground truth)

Jinn-Min Yang received the B.S. and M.S.

de-grees from the National Taichung Teachers College, Taichung, Taiwan, in 1994 and 2000, respectively. He is currently working toward the Ph.D. degree in the Department of Computer Science and Informa-tion Engineering, NaInforma-tional Chung Cheng University, Chiayi, Taiwan.

He is also with the Department of Mathematics Education, National Taichung University, Taichung. His research interests include pattern recognition, remote sensing, and machine learning.

Bor-Chen Kuo received the B.S. and M.S.

de-grees from National Taichung Teachers College, Taichung, Taiwan, in 1993 and 1996, respectively, and the Ph.D. degree from Purdue University, West Lafayette, IN, in 2001.

He is currently a Professor with the Graduate Institute of Educational Measurement and Statis-tics, National Taichung University, Taichung. His research interests include pattern recognition, remote sensing, image processing, and nonparametric func-tional estimation.

Pao-Ta Yu (S’88–M’90) received the B.S. degree

in mathematics from the National Taiwan Normal University, Taipei, Taiwan, in 1979, the M.S. de-gree in computer science from the National Taiwan University, Taipei, in 1985, and the Ph.D. degree in electrical engineering from Purdue University, West Lafayette, IN, in 1989.

Since 1990, he has been with the Department of Computer Science and Information Engineering, National Chung Cheng University, Chiayi, Taiwan, where he is currently a Professor. His research in-terests include e-learning, neural networks and fuzzy systems, nonlinear filter design, intelligent networks, and XML technology.

Chun-Hsiang Chuang received the B.S. degree

from Taipei Municipal Teachers College, Taipei, Taiwan, in 2004, and the M.S. degree from the National Taichung University, Taichung, Taiwan, in 2009. He is currently working toward the Ph.D. degree in the Institute of Electrical and Control En-gineering, National Chiao-Tung University, Hsinchu, Taiwan.

He is also with the Brain Research Center, Na-tional Chiao-Tung University. His research interests include pattern recognition, remote sensing, and bio-medical signal processing.