Chapter 2 Fundamentals of Feature Extraction and Machine Learning. 10
2.2 Feature Extraction
( )
2
Isotope Hale lifePixel Value Rescale Slope Body Weight SUV Time Difference
2.2 Feature Extraction
One way to describe quantified area is according to its texture feature.
Feature extraction is a computed visional and image processing concept, which has so far no universal characteristics and precise definition. The image textures can provide intuitively smooth and rough measurements and consistency properties for measurements of parameters leading to direct image understanding. The image textures can be described by three main methods:
statistical, structural and spectral methods. Texture features such as smooth, rough, and others are the representations of statistical methods. Structural image processing provides the unit technology arrangement. Spectral analysis is based on the nature of the spectrum, detecting an image's overall periodicity by recognizing high-energy peak in the spectrum.
This study calculated the area of CT image, the maximum value of SUV on PET, and the average means of CT and PET images, and then applied Gray-level Co-occurrence Matrix (GLCM) technique to measure the image textures.
GLCM created from a gray-scale image is defined as the distribution of the co-occurring values over that image. It is an approach to extract second or higher order statistical texture features. It calculates how often two adjacent
pixels, one with grayscale intensity i and another with intensity j, occur when they are separated by a pixel distance (Albregtsen, F., 2008).
As shown in Figure 2-1, the GLCM was formed by using the distance vectors: d [0 1], d [-1 1], d [-1 0] and d [-1 -1] as a set of offsets sweeping through 0, 45, 90 and 135 degrees. Figure 2-2 shows a matrix representation of a 5 × 5 pixel image with three grey values with its GLCM P (i, j) for d = [1, 1], where i and j are the pair pixel originated from different locations in the same grayscale image. Texture features can be computed by normalizing the GLCM.
The normalized GLCM as the second-order histogram H(yq, yr, d) provides the probability of occurrence of a pair of gray values yq and yr separated by a distance vector d (Dhawan, A.P., 2011).
90°
d[-1, 0]
135°
d[-1, -1]
45°
d[-1, 1]
0°
d[0, 1]
Figure 2- 1 Geometrical relationships of GLCM measurements made for 4 distances d.
13
1
Figure 2- 2 Illustration of texture calculation using GLCM technique. (a) A matrix representation of a 5 × 5 pixel image with three grey values; (b) the
GLCM P (i, j) for d = [1, 1].
Entropy can measure the texture non-uniformity; low values of entropy indicate greater structural variation.
(2) Angular second moment (ASMH) of H(yq ,yr, d):
ASM is the degree of homogeneity among textures, and is also representative of the energy in the image. Low values of ASM indicate finer textures.
(3) Contrast (
Contrast values indicate the pixel intensity.
(4) Inverse difference moment (IDMH) of H(yq ,yr, d):
IDM provides a measure of the local homogeneity among the textures instead of all, and the formula from MATLAB is different in the denominator.
(5) Inverse difference moment (IDMH) of H(yq ,yr, d) (in MATLAB):
In addition to the formulas as shown in (Albregtsen, F., 2008), we also applied the statistical texture formulas as described in (Dhawan A.P., 2011).
(7) Correlation: e correlation is greater for similar elements within the second-order
) Variance:
Variance is a measurement of heterogeneous area of image, showing the
) Sum Entropy (SENT):
extent of deviation from the mean of the sample.
(9
Sum Entropy is the entropy of a new matrix produc d by the summation of e
X-axis or Y-axis of GLCM, in which larger or smaller entropy can be obtained.
The gray value in the new matrix changes more intensely to generate more information.
0) Difference Entropy (DENT):
(2-13)
Difference Entropy is calculated based on the histogram of region difference images. The minimum value for DENT means most similar regions.
=0 =0
The cluster shade value is weighted for the neighborhood size.
1
(11) Cluster Shade (SHADE):
{ }
(12) Cluster Prominence (PROM):
(2-16)
Cluster prominence and cluster shade are quite similar in which shade is a cubic equation, whereas cluster prominence is a quadratic equation.
3) Sum Average (SA):
(2-17) ber of disti ct gray levels in the quantized image
(14) Sum Vari
(2-18)
SENT is Sum ulated as shown by equation in 2-1
(1
DV=variance (2-19)
6) Information Measure
(2-20)
The following equations are based on (Haralick, R.M. et al, 1973).
(1
=Num n
ance (SV):
Entropy which can be calc 1.
5) Difference Variance (DV):
of
19
(2-21)
7) Information Measure of Correlation 2:
(2-23)
where HX and HY
(2-24)
The following equations are based on (Clausi D.A., 2002).
(2-25)
(19) Inverse dif lized (IDN):
(2-26)
The following equations are based on (Soh, L.K. and Tsatsoulis, C.)
(2-22)
(1
are entropies of Px and Py, and
(18) Inverse difference normalized (INN):
ference moment norma
0) Autocorrelation (AC):
(2-27)
(2-28)
ability (MP):
(2-29)
.3 Genetic Algorithms
concept applied in our study was based on itchell’s 1999 book oduction to Genetic Algorithm. Through combination, each locus having two possible alleles (0 and 1) in the
ci in other chromosomes. Each chromosome
n. The process is repeated until a ( ) ( , )
The Genetic Algorithm (GA)
M An Intr
re
chromosome is replaced with lo
was assigned with an average fitness value. The populations of chromosomes were processed by GA to screen out chromosomes (candidates) with excellent fitness values for the next generation. We set the numbers of generation to 10 to select candidates for SVM classification.
GA operation is simplified below: an initialized population of binary strings was evaluated and selected for the next population. Mutation and crossover operators were applied within this population to generate new strings that will be evaluated and added to the populatio
( , )
stopping criterion is met, that is, the population for the next generation is filled.
2.4 Support Vector Machine
Support vector machine (SVM) is a supervised learning model, as well as a binary linear classifier that can be used for linear and non-linear classification.
21
performs data clustering, achieving higher accuracy than traditional methods.
form two support hyperplanes, in which It
Groups of data (vectors) are mapped to
a classification hyperplane is constructed maximizing the distance (margin) between them. The idea of SVM is to find this constructed classification hyperplane to classify the data.
The formula of classification hyperplane is defined as
w xT + =b y (2-29)
And two support hyperplanes using a constant are written as:
If the distance between the two formulas is given by
lanes, the maximum need imize needs to be The formula can be rewritten as
≥
T (2-32)
To optimize the distance between two support hyperp s to be 2 and their mean min
w 2
w
The Lagrangian can be written as
(2-33)
, and optimizing L(w,b,α) w, b and α, this Lagrangian can be solved:
Making use of the Lagrangian Duality theory with respect to
23
Chapter 3 Diagnosis of Solitary Pulmonary Nodule in PET/CT Using GA for Feature Extraction and SVM
Classifiers
3.1 Procedure
As shown in Figure 3-1, the experimental process started with obtaining the CT and PET images from the DICOM file, then handled the CT and PET images separately, and lastly combined them into PET/CT images. The images of suspicious nodules were diagnosed by physicians to divide them into benign and malignant nodules. Image pre-processing was performed which implemented the thresholding method prior to the delineation for the second treatment. After processing the images, the variables of the treated image area were computed before applying with GA and SVM for subsequent screening.
Lastly, we identified which combinations of variables that result in malignant tumors.
CT PET Image segmentation
Variable Computation GA applied for feature extraction and SVM applied
for classification Benign Malignant
Output
Figure 3- 1 The flow chart of the experiment.
Pre-processing
In this study, we took the PET/CT images of 68 patients together with their medical diagnostic report from Digital Imaging and Communications in Medicine (DICOM) file provided by nuclear medicine physicians from National Cheng Kung University (NCKU) College of Medicine in Tainan City, Taiwan.
28 out of the 68 patients were recognized as having a benign nodule; the remaining 40 patients were recognized as having a malignant tumor.
25
3.3 Merge PET/CT
CT and PET images from the DICOM file were merged to become PET/CT image in this study. Since the CT images of blood vessels, kidneys, liver and other organs produced by X-ray are grayscale images, they need to be whitened until a clearer display of various human organs can be constructed. The purpose of using PET imaging is to identify abnormal metabolism of the cancerous organs and tissues, but PET imaging is not clearer than CT imaging. There are differences between these two different image resolutions; the CT images have a dimension of 512 × 512, while the dimension of PET images is of 168 × 168 only.
Either CT or PET alone has specific benefits and shortcomings, but by merging CT and PET, physicians can diagnose and localize the tumors more accurately. In order to merge these two different images, PET image was up-sampled by 3.04 in dimensions, and mapped to CT images by making adjustments on CT images. The corresponding position of the nodules on CT images then can be found on PET images immediately.
(a) (b) (c)
Figure 3- 2 The axial, coronal and sagittal cross sections of a pulmonary nodule from top to bottom. (a) CT image; (b) PET image; (c) merged PET/CT image. The
highlighted regions as shown in the figure are the pulmonary nodules with higher SUV.
Computer-aided diagnosis technology such as X-ray and MRI assists physicians to interpret medical images in radiology. Physicians diagnose the patients by viewing the axial, coronal and sagittal cross sections of a pulmonary nodule. The merged image of PET/CT was shown in Figure 3-2. The highlighted regions as shown in the figure are the pulmonary nodules with higher SUV.
3.4 Image Segmentation
Image segmentation is an easy process to analyze a digital image. It partitions digital image into sets of pixels, then we can undergo computation on each partition of the image. Clinical physicians usually compute the maximum value of SUV (SUV max) of the lesion area for pathology decision. An
27
alternative way is to calculate the average SUV of region of interest (ROI).
However, the position, size and shape of individual tumors are different, therefore it is important to select a right region in the experiment.
We obtained the DICOM file from NCKU College of Medicine in Tainan City, Taiwan. We then analyzed the images of CT and PET and merged them into PET/CT images, and finally selected ROI with 31 × 31 pixels. The 31 × 31 pixels of tumors were limited to less than 3 cm in size for 3 cm is the range and limit in our discussion. After reading the CT and PET images from the program, we chose the tumors by adjusting the coordinates of their axial, coronal and sagittal planes, and kept every segmented image for later processing.
3.5 Preprocessing
The imaging data taken from the PET/CT images cannot be directly assessed in the experiment, because it contains a lot of unwanted noise which will interfere with the correct information. Therefore, image preprocessings need to be done in order to reduce background noise and enhance data images prior to computation of variables. We created a threshold value for the boundaries of the nodules as it could be used to leave out unnecessary areas of the nodules. The computer then automatically drew the contour by tracing the coordinate points formed by the images.
3.5.1 Thresholding
Thresholding is intuitively applied in image segmentation because it is
simple in implementation. It accelerates the computation by turning the gray-scale images of pulmonary nodules into binary images which contain only black and white for each pixel. We selected a threshold value of 0.5 to exclude unnecessary areas of the nodules to prevent poor performance during computation.
(a)
(a) (b)
Figure 3- 3 Thresholded image of a pulmonary nodule. (a) the original image; (b) the thresholded image using a gray-value threshold of 0.5 on image (a).
3.5.2 Contour
In this study, we used a new variational formula as shown in Li, C.’s study for geometric active contours. It consisted of an internal energy term that penalized the deviation of the level set function from a signed distance function for object boundaries by minimizing the overall functional energy. The variational level set formulation has the following advantages: speeding up the curve evolution, the general functions to initialize the level set function are more easily to construct than the signed distance function, and can be easily implemented (Li, C., et al., 2005).
As shown in Figure 3-4, indented coordinate points were set up on the
image after thresholding. The indented coordinate points were joined together to form a circular red line which surrounds the tumor boundary. The red line was iterated along the tumor boundary according to its internal energy term, until a clear dividing line was caught to obtain the contoured image.
Figure 3- 4 The image contouring of a pulmonary nodule. (a) The original image; (b)-(e) a red line was iterated to form a circle around the tumor according to its internal energy term, until (f) a dividing line was caught around
the tumor boundary; (g) a contoured image of a pulmonary nodule.
(g)
(a) (b) (c)
(d) (e) (f)
Contouring could not be performed well when the tumor boundary is too
29
close to the image frame. In such circumstance, a manual dividing line is set up to surround the desired part of the tumor to obtained the contoured image as shown in Figure 3-5.
Figure 3- 5 Manual dividing line set up to obtain a contoured image. (a) the original image; (b)-(c) a red line was iterated to circle the tumor according to its
internal energy term; (d) the dividing line exceeded the image boundary; (e) a manual dividing line was set up to surround the desired part of the tumor; (f) a
contoured image of a pulmonary nodule.
(a) (b) (c)
(d) (e) (f)
3.6 Variable Computation
The variables such as mean SUV, max SUV of PET, areas of CT, mean of CT were computed by the formulas stated in Chapter 2 after noise was removed during pre-processing.
Physicians in NCKU College of Medicine gave us two different kinds of
patients’ diagnosis and medical images: benign nodules and malignant nodules.
The state of illness and the description of patients’ suspicious nodules were written on the diagnostic report, so we could find the nodules according to diagnostic medical keywords such as “hyper metabolic cavitated nodule”, “lung mass”, “pulmonary nodule”, “lung nodule”, and “lung lesion” as well as the size, position and SUV.
a hypermetabolic cavitated nodule in right upper lung
size: 2.2 cm, SUV: 4.8
Figure 3- 6 Images of a segmented malignant pulmonary nodule with its description. Starting from left to right is the original CT image, thresholded and
contoured image, and the PET image.
Figure 3-6 shows the description of one of the segmented malignant nodules. The full description of segmented malignant nodules is listed in Appendix A. The images and description of the segmented benign nodules in Appendix B are arranged in the same way as shown in Appendix A.
The left column of the table consists of three images, starting from the left to right: the original CT image, image after preprocessing and the corresponding PET image of CT image. The right column of the table lists the information obtained from the images in left column. We then used this information to locate the tumor, segment and compute the variables from the related formulas.
3.7 Variable Selection
31
The variables computed from the experiment were applied with GA for classification of tumors. When the variables were applied with GA at generation 0, their frequency of bit 1 is randomly distributed. At 10th generations, the frequencies of some variables become 0, while some become 1 (Figure 3.8).
This is known as the local optimum resulting from GA development. As shown in Figure 3.7, the horizontal axis shows the generations of the variables applied with GA, starting from 0 to 10; the vertical axis shows the fitness of the variables, which is the size of the area under ROC, ranging between 0 and 1.
Figure 3- 7 Maximum and average fitness.
33
(a) (b) Figure 3- 8 The frequency of bit 1 at (a) generation = 0, and (b) generation = 10
resulting from GA development.
3.8 Cross-validation
Cross-validation, also known as rotation estimation, is a practical method of partitioning data sample into smaller subsets in statistics. A subset is first analyzed in this process, while others are used for subsequent confirmation and verification based on this analysis. The subset used for analysis is called a training set, while others are called validation set or testing set. Usually multiple rounds of cross-validation are performed over the subsets to receive the average validation results. Common types of cross-validation include K-fold cross-validation, 2-fold cross-validation, repeated random sub-sampling validation and leave-one-out cross-validation.
In K-fold cross-validation, the initial sample is partitioned into k subsamples in a random way. A single subsample is retained as testing data, and the other k-1 subsamples are used as training data. Repeating the
cross-validation k times, once for each subsample validation, a single estimate was obtained ultimately from the average results obtained after k times (Moreno-Torres, J.G. et al., 2010). A 10-fold cross-validation is usually used to receive the average results; we performed a 5-fold cross validation instead due to the small numbers of variables used in our GA and SVM classification.
35
Chapter 4 Experiment Results
Texture calculations were done on the suspicious nodules taken from the PET/CT images, and these calculations were then applied with the genetic algorithm program for feature selection before proceeding with SVM classifier.
A set of input variables were randomly selected and evaluated under the GA program at first, then were proceeded to detect the optimal hyper-plane by using SVM classifier for classification of the nodules.
We ran 5-fold cross-validation five times for each set of variables processed from GA screening and SVM classification. Each of the 26 variables inputted was performed with 5-fold cross validation. Finally, statistical results of sensitivity, specificity and accuracy were obtained from each experiment and were used to calculate their means and standard deviations.
Sensitivity
= number of true positives / (number of true positives + number of false negatives)
= probability of a positive test, given that the patient is ill
Specificity
= number of true negatives / (number of true negatives + number of false positives)
= probability of a negatives test, given that the patient is well
Accuracy rate
= (numbers of benign nodules × sensitivity + numbers of malignant nodules × specificity) / total numbers of nodules
Each of the 22 variables from GLCM (not including the first 4 variables) has four distance vectors. These variables can have four different changing distance vectors. If each variable in each distance vector are put together, the number of variables will be even more than the number of items in experimental nodules, and it cannot properly find the most appropriate variables for screening.
Therefore, experiments were done by separating these variables in different directions. We repeated experiments, five times in each direction, found out the sensitivity, specificity and accuracy, and calculated the mean and standard deviation, as shown in Table 4-1.
Table 4- 1 Sensitivity, specificity and accuracy, and their mean and standard deviation of classifications in each direction to GA
Distance vector
[0,1] (1) [0,1] (2) [0,1] (3) [0,1] (4) [0,1] (5)
Sensitivity (%) 79.31 68.97 68.97 68.97 72.41
Mean 71.73
Standard Deviation 4.49
Specificity (%) 77.78 69.44 72.22 69.44 75.00
Mean 72.78
Standard Deviation 3.62
Accuracy (%) 78.63 69.18 70.42 69.18 73.57
Mean 72.20
37
Standard Deviation 4.02
Distance vector
[1,1] (1) [1,1] (2) [1,1] (3) [1,1] (4) [1,1] (5)
Sensitivity (%) 72.41 72.41 68.97 65.52 65.52
Mean 68.97
Standard Deviation 3.45
Specificity (%) 69.44 72.22 69.44 61.11 72.22
Mean 68.89
Standard Deviation 4.56
Accuracy (%) 71.09 72.33 69.18 63.55 68.51
Mean 68.93
Standard Deviation 3.37
Distance vector
[-1,1] (1) [-1,1] (2) [-1,1] (3) [-1,1] (4) [-1,1] (5)
Sensitivity (%) 68.97 55.17 68.97 72.41 58.62
Mean 64.83
Standard Deviation 7.48
Specificity (%) 69.44 55.56 69.44 72.22 58.33
Mean 65.00
Standard Deviation 7.50
Accuracy (%) 69.18 55.34 69.18 72.33 58.49
Mean 64.90
Standard Deviation 7.49
Distance vector
[-1,-1] (1) [-1,-1] (2) [-1,-1] (3) [-1,-1] (4) [-1,-1] (5)
Sensitivity (%) 65.52 62.07 65.52 68.97 72.41
Mean 66.90
Standard Deviation 3.93
Specificity (%) 63.89 61.11 69.44 66.67 72.22
Mean 66.67
Standard Deviation 4.39
Accuracy (%) 64.79 61.64 67.27 67.94 72.33
Mean 66.79
Standard Deviation 3.96
The final statistical results obtained from GA screening were calculated.
Each result showed that out of the 26 variables, some were used and some were not used. The results were summed according to their variables. Running five times for different directions, as represented by [0, 1], [1, 1], [-1, 0] and [-1, -1]
respectively, and each time there will be cross validation five times in each SVM. Therefore, each variable appear at most 25 times in each direction as shown in Table 4-2 which shows the number of occurrences of each variable.
Table 4- 2 The number of occurrences of each variable number
Variable name Occurrence
[0,1] [1,1] [-1,0] [-1,-1]
20 Sum variance 5 8 4 6
21 Difference variance 21 20 21 22
22 Difference entropy 8 7 8 7
23 Information measure of correlation1
16 17 17 13 24 Information measure of
correlation2
18 15 16 17 25 Inverse difference
normalized (INN)
6 11 8 5 26 Inverse difference moment
normalized
25 21 19 23
After calculating the number of occurrences of each variable, we took out variables with occurrences higher than 15, 20 and 22. The total numbers of variables occurring more than 15 times were found to be 23. Those occurring more than 20 and 22 times were found to be 14 and 5, respectively. Table 4-3 listed these variables.
After calculating the number of occurrences of each variable, we took out variables with occurrences higher than 15, 20 and 22. The total numbers of variables occurring more than 15 times were found to be 23. Those occurring more than 20 and 22 times were found to be 14 and 5, respectively. Table 4-3 listed these variables.