CHAPTER 2. INTEGRATION OF INDEPENDENT COMPONENT ANALYSIS
2.2 MATERIALS AND METHODS
2.2.3 DATA ANALYSIS
2.2.3.1 INDEPENDENT COMPONENT ANALYSIS (ICA)
Independent component analysis (ICA) is a method used to transform the observed
multivariate data to statistically independent components (ICs) and to present them as a
linear combination of observation variables. The number of receptors defined by ICA
algorithm must be more than or equal to the number of sources, and the signals emitted
by the sources are in non-Gaussian distribution (Hyvärinen and Oja, 2000). The ICs are
latent variables; therefore, they cannot be directly observed, indicating that the mixing
matrix is also unknown. The purpose of the ICA algorithm is to determine the mixing
matrix (M) or the separating matrix (W). In order to predict the unknown source, it is
assumed that W = M-1,
ŝ = Wx = M-1Ms (2.1)
where ŝ is the estimation of the sources (s) and x represents the observed spectra of
the objects.
In the present study JADE (joint approximate diagonalization of eigenmatrices)
algorithm (Cardoso and Souloumiac, 1993; Cardoso, 1999) was employed to conduct
ICA analysis. In general, JADE offers rapid performance for dealing with spectra data
due to it works off-the-shelf, an improvement over other multivariate approaches like
PCR and PLSR. Assuming that the spectra obtained through measurement of the unknown mixtures were the linear combination of various components’ spectra, it can
be expressed as:
A = MI (2.2)
The spectra of samples were all linearly composed of m ICs. Matrix Al×n stands for l
samples containing n values; Im×n stands for the matrix of ICs, including m independent
components. Ml×m stands for the mixing matrix, which is related to the component
concentration in the mixture. The linear relationship between the mixing matrix (M) and
the component concentration (C) can be expressed as:
C = MB (2.3)
Among them, B referred to the matrix of regression coefficient. In doing so, the
concentration of each component in the mixture could be determined by the
combination of ICA and linear regression.
2.2.3.2 PARTIAL LEAST SQUARES REGRESSION (PLSR)
Partial least squares regression (PLSR), a typical method in chemometrics (Wold et
al., 2001), has been widely applied to chemical and engineering fields. When PLSR is
applied to spectral analysis, the spectra can be regarded as the composition of several principal components (PCs), and be expressed as a ‘factor’ in the PLSR algorithm. The
factors’ sequence is determined by their influences; the more important factor is ranked
earlier in the order, such as factor 1 and factor 2. Since information from spectral bands
was used in PLSR analysis, the analysis results can be improved by selecting
appropriate number of factors and specific wavelength ranges. To avoid overfitting of the PLSR model’s results with too many factors, the factors were selected based on the
following principles in this study: (1) A maximum factor limit was set at 1/10 of
calibration set data + 2 to 3 factors; (2) new factors were not added if they caused a rise
in the prediction error; and (3) new factors were not added if they resulted in a standard
error of validation (SEV) smaller than the standard error of calibration (SEC).
2.2.3.3 SPECTRAL PRETREATMENTS
The purpose of spectral pretreatments was to eliminate the spectral variation, which
was not caused by chemical information contained in the samples (de Noord, 1994). For
the raw NIR spectra of sucrose solutions and wax jambu, three different spectral
pretreatments were employed in this study: (1) normalization; (2) 1st derivative with
normalization; and (3) 2nd derivative with normalization. Normalization scaled the
spectrum absorbance of all samples to fall within an interval of -1 to 1. For further
applications of ICA in fast on-line inspection of fruits, the procedure of selecting best
pretreatment parameters, including points of smoothing and gap of derivative, were not
employed to save computational time. The gap of derivative was set at a minimal value
of 2, so as to maintain the most wavelength values as inputs for the model.
2.2.3.4 MODEL ESTABLISHMENT
This study used the mathematic software MATLAB (The MathWorks, Inc., Natick,
MA, U.S.A.) to write ICA programs based on JADE algorithm for establishing ICA
spectral calibration models. The results of ICA were compared with the spectral
calibration models of PLSR built by WinISI II (Infrasoft International, LLC., Port
Matilda, PA, U.S.A.) chemometric software package. The analysis procedure of both
ICA and PLSR for wax jambu and sucrose solution samples included: (1) selecting
calibration set and validation set, (2) spectral pretreatments, and (3) determining best
calibration model. Since the sucrose solutions were mixtures of sucrose powder and
water, their composition were rather simple. Therefore, the data of full wavelength
range (400 to 2498 nm) were used for comparing the tolerance abilities of ICA and
PLSR since spectral bands with more noises (e.g. 2200 to 2498 nm) often affect the
analysis results. Identification of specific wavelength ranges was needed for wax jambu
because their composition was more complicated than that in sucrose solutions, which
required additional correlation analysis between wavelengths and sugar content. All of
the sucrose solutions and wax jambu samples were respectively used for analysis to
assess the tolerance abilities of ICA and PLSR. A ratio of calibration to validation
samples of 2:1 was adopted according to the sugar content in the sample. All samples
were ranked ascendantly according to their sugar content. Number 1 and 2 were
assigned for calibration and 3 for validation, with subsequent numbers following the
same alternating sequence. The same sets of calibration and validation were used for
both ICA and PLSR analyses.
After the respective spectral calibration models of sucrose solution and wax jambu
were built, these models were then used to predict the sugar contents of the calibration
and the validation set. The evaluation of predictability was based on the following
statistical parameters, including coefficient of correlation of calibration set (Rc),
standard error of calibration (SEC), coefficient of correlation of validation set (rv),
standard error of validation (SEV), bias, and ratio of [standard error of] performance to
[standard] deviation (RPD), as defined by:
where Yc and Yv represent the estimated sugar contents of the calibration set and the
validation set, respectively. Yr is the reference sugar content, nc and nv are the number of
samples in the calibration set and validation set, and SD is the standard deviation of
sugar content within the validation set. RPD is one of the indices used to evaluate the
performance of a model. The greater the value of RPD is considered adequate for
analytical purposes in most of NIR spectroscopy applications for agricultural products
(Williams and Sobering, 1993).