Researchers have developed numerous statistical learning algorithms for applications in various areas of science, finance, and industry in recent years. Statistical learning comprises several different paradigms such as classification, regression, feature extraction, dimensionality reduction and density estimation [3]. The basic idea of classification methods for feature space data is to partition up the entire feature space into L exhaustive, nonoverlapping regions, where L is the number of classes present in the scene, so that every point in the feature space is uniquely associated with one of the L classes [22].
The classification algorithms can be divided into two main categories according to the learning process. Supervised classification, or simply classification, is the learning process of inferring a function to classify unknown patterns using the training data to train the rule [66], i.e., a set of training samples is available and the classifier exploits this a priori known information [2].
The other type of learning process is called unsupervised classification, or simply clustering. It is referred to as unsupervised because it does not use training samples [22]. Clustering assesses the relationships among samples of a data set by organizing the patterns into different groups. After clustering, patterns in one group show greater similarity to each other than those belonging to different groups without any prior known information [1]. Clustering analysis can detect underlying structures within data, for classification and pattern recognition, and for model reduction and optimization [2], [4]-[5].
Clustering algorithms are most commonly used as an aid to selecting a class list and training samples for the classes in that list. That is, clustering
may be a means of preprocessing the data for a supervised classification procedure. A clustering scheme may be applied to the data for each class separately and representative samples for each group within the class used as the prototypes for that class [66]. Fundamentally, to be optimally useful, a classification must have classes that are (simultaneously) “of information value, exhaustive, and separable.” The training samples for supervised learning generally are selected with emphasis on the former one. Clustering is a useful tool of the training process to achieve the latter two. It can be a useful procedure, though, in defining spectral classes and training for them by breaking up the distribution of pixels in feature space into subunits so that one can observe what is likely to be separable from what. It allows one to locate the prevailing modes in the feature space, if any prevalence exists [22].
Recent statistical learning algorithms [17]-[19] use both labeled and unlabeled samples for training. These algorithms are called semi-supervised learning process, and fall between unsupervised pattern recognition and supervised recognition. The aim of this thesis is to develop an unsupervised clustering algorithm and a semi-supervised classification algorithm. The former one is a fuzzy-based clustering which considers both within- and between-information of clusters, and the latter one is a semi-supervised classification algorithm which takes into account both spectral and spatial information.
Fuzzy-based clustering, which determines if a vector belongs to a specific cluster to a certain degree, have been the subject of intensive research in the past three decades [2], [4]-[8]. Fuzzy c-means (FCM) clustering is one of the most well-known clustering methods [7]-[8], and researchers have developed many advanced FCM-type clustering algorithms. The Gustafson-Kessel (GK) algorithm [9] is a well-known
algorithm in this category. This algorithm employs an adaptive distance norm to detect clusters of different geometrical shapes in one data set [2].
Krishnapuram and Keller [52] proposed a new clustering model, called possibilistic c-means (PCM), which relaxes the following constraint: “the sum of the membership values of every sample to all clusters is 1.” This approach avoids the outliers belonging to one or more clusters. In 1997, the fuzzy-possibilistic c-means (FPCM) [10] was proposed to generate both possibility and membership values. However, the possibility values generated by FPCM become very small as the size of the data set increases.
To eliminate the problem of FPCM and take advantage of the benefits of FCM and PCM, the possibilistic fuzzy c-means (PFCM) was proposed in 2005 [11].
Some FCM-type algorithms, such as the Gath-Geva (GG) algorithm, employ an adaptive distance norm based on the fuzzy maximum likelihood estimates [5], [12]. Chatzis and Varvarigou [13] proposed a robust fuzzy clustering algorithm based on the fuzzy treatment of finite mixtures of multivariate Student’s t-distributions (FSMM). This approach uses finite mixtures of multivariate Student’s t distributions instead of finite Gaussian mixture models (GMMs). Chatzis and Varvarigou [56] combined the advantages of factor analysis and proposed a fuzzy mixture of Student’s t factor analyzers (FMSFA). FMSFA provides a well-established observation space dimensionality reduction framework for fuzzy clustering algorithms based on factor analysis. This simultaneously achieves fuzzy clustering and a reduction in local dimensionality within each cluster. Their experimental results show that FMSFA outperforms finite mixtures of Student’s t-factor analyzers (tMFA) [57], a modification of the fuzzy c-varieties algorithm with regularization by Kullback–Leibler information (KLFCV) [58], and the mixture of factor analyzers (MFA) model [59].
Most fuzzy-based clustering algorithm by minimizing a cost function, only based on the sum of distances between samples to their cluster centers [2], which is equal to the trace of the within-cluster scatter matrix [14]-[15].
Researchers have recently used linear discriminant analysis (LDA) [14] for dimensional reduction in supervised classification problems. LDA uses the mean vector and covariance matrix of each class to formulate within-class, between-class, and mixture-class scatter matrices. Two similar fuzzy-based clustering algorithms based on fuzzy within-cluster, between-cluster, and total scatter matrices are proposed in [15] and [16]. The objective function of fuzzy compactness and separation (FCS) [15] is based on the difference of fuzzy within- and between-cluster scatter matrices. This minimizes the measurement of compactness, but simultaneously maximizes the separation measure. However, the within- and between-class scatter matrices of LDA are not the special case of the proposed fuzzy within- and between-cluster scatter matrices in the supervised learning problem. Moreover, based on the Fisher criterion, the LDA method finds features such that the ratio of the between-class scatter to the average within-class scatter is maximized in a lower dimensional space. Of the concept of class scattering to class separation, the Fisher criterion takes the large values from samples when they are well clustered around their mean within each class, and the clusters of the different classes are well separated [2]. The Fisher criterion is formulated as a function of class statistics. For these reasons, this thesis proposes a clustering algorithm based the Fisher criterion [4].
The first part of the thesis is to propose a fuzzy-based clustering which is based on the fuzzy-based within- and between-cluster scatter matrices. In addition, the Fisher criterion is used to form the objective function. This means that the proposed clustering algorithm take into account not only the within- and between-information of the distribution of data but also the
interaction of the within- and between-information. Chapters 2-3 present the fuzzy-based clustering algorithm. Chapter 2 introduces some recently proposed fuzzy-based clustering algorithms. Chapter 3 details the proposed clustering algorithm based on both within- and between-cluster scatter matrices, extended from linear discriminant analysis (LDA) [4].
Figure 1. The spectral values obtained from the Indian Pine Site data set. The purple represents the Soybeans-min till patterns and the yellow represents the Corn-no till patterns. These two classes have similar spectral properties.
In hyperspectral image classification, spectral-domain based classifiers often lead to imprecise estimation of different land-cover classes that have very similar spectral properties, which makes it difficult to distinguish unlabeled patterns [20]-[21]. Fig. 1 shows the spectral values obtained from patterns of two categories in the Indian Pine Site data set: Soybeans-min till (purple color) and Corn-no till (yellow color) [22]. These two different classes have very similar spectral properties. Hence, employing these classes to train conventional classifiers (e.g., maximum likelihood classifier (ML) [2], [15], k-nearest neighbor classifier (k-NN) [2], [15], and support vector machine (SVM) [23]-[24]) leads to poor classification performance, producing a speckle-like classification map [20]-[21], [25]. Fig. 2 shows
that the support vector machine (SVM) classification map of Indian Pine Site includes a number of speckle-like errors.
Figure 2. The support vector machine (SVM) classification results of the Indian Pine Site image, containing speckle-like errors.
Considering both spectral and spatial-contextual information, using a semi-supervised learning algorithm is an effective way to decrease speckle-like errors when interpreting a hyperspectral image. There are two main methods for combining spectral and spatial-contextual information.
The graph-based technique [18]-[19], [26]-[32] uses the typical method of performing a regularization in which “similar” features belong to the same class. This method associates the vertices of a graph with the complete set of samples, and then builds the regularization depending on the variables defined on the vertices [18]. The other approach is to use fixed-window-based methods, such as Markov random fields [20]-[21], morphological filtering [28], or morphological leveling [29]-[30]. This approach improves the classification performance of hyperspectral images compared to pixel-wise methods [31].
Jackson and Landgrebe [20] applied a Gaussian function to the Bayesian decision rule with Markov random fields (MRF), Bayesian contextual classifier based on MRF (ML_MRF), to mitigate the
speckle-like errors. Their method achieves improved performance in classification maps. Another study suggests applying similar concepts to develop a MRF-based k-nearest neighbors classifier and Parzen classifier [21]. However, MRF-based classifiers are still constrained by statistical estimation (e.g., the covariance matrix of ML based on a Gaussian distribution) or the amount of learning data.
The support vector machine [23] is a pattern classification technique proposed by Vapnik et al. Unlike traditional methods, which minimize empirical training errors, SVM attempts to minimize the upper bound of the generalization error by maximizing the margin between the separating hyperplane and the training data. Hence, SVM is a distribution-free algorithm that can overcome the problem of poor statistical estimation.
SVM also achieves greater empirical accuracy and better generalization capabilities than other standard supervised classifiers [3] [34]-[35]. In particular, SVM performs well for high-dimensional data classification with a few training samples [37]-[38], and is robust to the Hughes phenomenon [32]-[33], [35], [37]-[38].
Moreover, many studies [30]-[33] show that support vector machines with both spectral and spatial information achieve effective and stable hyperspectral image classification. A context-sensitive semi-supervised support vector machine (CS4VM) [32] uses the context of neighborhood patterns as semi-patterns to solve the problem of noisy training patterns. In this case, noisy training patterns are mislabeled patterns that introduce distorted information to a classifier. CS4VM is a semi-learning approach in which the computational cost increases as the number of semi-samples increases.
Tarabalka et al. [31] presented a spectral-spatial classification scheme based on partitional clustering techniques (SVM+EM). This approach
segments an image into more homogeneous regions and combines the results of these regions using pixel-wise SVM classification. A spatial post-regularization (PR) of the classification map reduces the noise. This approach is particularly suitable for classifying images with large spatial structures, when spectral responses of different classes are dissimilar, and when classes contain a comparable number of pixels. If the spectral responses are not significantly different, this approach may result in misclassification [31].
The second part of this thesis uses two neighborhood systems, that one is in the original space and the other one is in the feature space, to modify the constrain and decision rule of the support vector machine, and proposes a spatial-contextual support vector machine to overcome the speckle-like errors. Chapters 4-5 focus on the spectral-spatial classification schemes.
Chapter 4 introduces the SVM and some recently spectral-spatial classification algorithms. Chapter 5 describes two spatial-contextual support vector machine classification algorithms (SCSVMs) [39] that modifies the decision function and constraints of a support vector machine (SVM) using a spatial-contextual term in the original space or in the feature space, which are based on the concept of the Markov random fields in the original space or k-nearest neighborhoods in the feature space, respectively.
The thesis is devoted to fuzzy-based clustering algorithm, fuzzy linear discriminant clustering (FLDC), and semi-supervised image classification, spatial-contextual support vector machine. First, in Chapter 3, fuzzy-based within- and between-cluster scatter matrices extended from the within- and between-class scatter matrices of LDA are introduced. Furthermore, the Fisher criterion composed by the fuzzy-based scatter matrices is used to form the objective function. FLDC considers not only the within- and between-information of the data distribution but also the interaction of the
within- and between-information. The results of experiments on both synthetic and real data show that the proposed clustering algorithm can generate similar or better clustering results than eleven popular clustering algorithms: K-means, K-medoid, FCM, the Gustafson-Kessel, Gath-Geva, possibilistic c-means, fuzzy-possibilistic c-means, possibilistic fuzzy c-means, fuzzy compactness and separation, a fuzzy clustering algorithm based on a fuzzy treatment of finite mixtures of multivariate Student’s-t distributions algorithms, and a fuzzy mixture of Student’s t factor analyzers model.
Then, in Chapter 5, two neighborhood systems is used to overcome the similar spectrum problem in support vector machine. Two semi-supervised classifiers, spatial-contextual support vector machines (SCSVMs), are proposed by modifying the constrain and the decision function of support vector machine. To evaluate the effectiveness of SCSVM, the experiments in this study compare the performances of other classifiers: a support vector machine (SVM), context-sensitive semi-supervised support vector machine (CS4VM), maximum likelihood classifier (ML), Bayesian contextual classifier based on Markov random fields (ML_MRF), and k-nearest-neighbor classifier (k-NN). Experimental results show that the proposed method achieves good classification performance on famous hyperspectral images (the Indian Pine site and the Washington, D.C. Mall data sets). The overall classification accuracy of for the hyperspectral image of the Indian Pine site dataset with 16 classes is 95.5%. The kappa accuracy is up to 94.9%, and the average accuracy of each class is up to 94.2%.