Classification by K-Nearest Neighbor - Morphological methods for diagnosing oral cancer

Chapter 3 Methodology

3.4 Morphological methods for diagnosing oral cancer

3.4.2 Classification by K-Nearest Neighbor

K-nearest Neighbor (KNN) is one of the machine-learning algorithms, and it is instance-based learning. KNN classifies the test data by the closest training data in the feature space. When inputting test data, we calculated the distance between the input and all training data. In the k sets of closest training data, if the majority belonged to a category, the test data also belonged to the category.

Figure 3-5: The flow chart of analysis by KNN classifying.

When we observed the section of cancer tissue, we determined the change of the image in the lamina propria because of the erosion by the cancer cells.

Because the morphology of the samples was one by one case, we constructed the training data and used the KNN digitalizing the change of the image in the lamina propria. If we classified the nuclei in the basal-cell layer and lamina

propria by KNN, the correct rate of classification was lower in cancer cells because of the change of the image in the lamina propria by the erosion of the cancer cells. First, we had to establish the training data and test data. To establish the training data, we chose nuclei of the basal-cell layer and of the prickle-cell layer in normal tissue and cancer cells in cancerous tissue, which we labeled as group A. Then we chose some nuclei of the lamina propria, which we labeled as group B. In every coordinate, we got an 11x11 matrix, similar to the size of a cell, and expanded the matrix to the one-dimensional matrix, as in Figure 3-5; we recorded the one-dimensional matrix as training data. To establish the test data, we moved an 11x11 matrix to scan the image. When the intensity of the 5x5 matrix in the center was lower than 80% intensity of the average in the 11x11 matrix, we recorded the 11x11 matrix as a nucleus. After the matrix scanning of the image, we expanded the 11x11 matrix as an one dimensional matrix, as in Figure 3-5, and recorded the one dimensional matrix as test data.

Figure 3-6: Expanding an 11x11 matrix to one dimensional matrix.

According to the training data, we could classify the test data by KNN, and the vector difference between the test data and each set of training data in the feature space was calculated. Lastly, we calculated the correct rate of the lamina propria classified by KNN.

Before analyzing the spectrum, we needed to remove the noise. For instance, the spectrum in halogen dark and light noise needed to be removed; the spectrum in fluorescence only removed the dark noise because there was no data for the light source.

After removing the noise, we marked nuclei in the image of the sample and recorded these coordinates. According to the coordinates chosen by hand, we got the spectrum of these nuclei. We had three methods to analyze these spectral data: four in halogen, four in fluorescence 330~385nm excitation and three in fluorescence 470~490nm excitation.

Figure 3-7: The flow chart of spectral analysis.

3.5.1 Intensity in the specific wavelength range

In some wavelength ranges, the spectral curve showed an obvious characteristic. For searching the specific wavelength range, we compared the spectral intensity every 30nm wavelength range; then we could find the specific wavelength range of the largest difference in intensity. For example, the transmittance of halogen in normal cells was higher than in cancer cells in the 460~480nm, as shown in Figure 3-8a, and we also compared the fluorescence intensity 470~490nm excitation in the wavelength range 540~560nm, as shown in Figure 3-8d.

Figure 3-8: The spectrum of one sample (a) Halogen transmittance. (b)

Fluorescent intensity with 330-385nm excitation. (c) Fluorescent intensity compensated by spline with 330-385nm excitation. (d) Fluorescent intensity with 470-490nm excitation. Red line is normal cells, and blue line is cancer cells.

A: the peak in wavelength range 460~480nm. B: the peak in wavelength range 700~710nm. C: the intensity in wavelength range 540~570nm. D: the intensity in wavelength range 680~710nm. E: the maximum of spectral curve. F: the full width at half maximum (FWHM) of spectral curve. G: the peak in wavelength range 540~570nm. H: the FWHM of spectral curve.

3.5.2 Ratio of intensity in two different wavelength ranges

Generally, there were some peaks in the spectral curve. We calculated the ratio of the intensity in different peaks or specific wavelength range. In our research, we calculated the ratio of average halogen transmittance in the range 460~480nm to the 700~710nm.

3.5.3 Wavelength of the specific peak

The wavelength of the peak may be different in each spectral curve. In halogen, we compared the wavelength of the maximum in the wavelength 700~710nm. We tried to find the movement of the peak caused by the cancer cells.

3.5.4 Area under spectral curve

By the integration of the spectral curve, we calculated the area under the curve. Before the integration, we needed to normalize the spectral curve. First, we recorded the maximum of the spectral curve, and then we divided the intensity by the maximum. We wanted to diagnose cancer cells by the change of area under the spectral curve.

3.5.5 Maximum of spectral curve compensated by spline

In fluorescence 330~385nm excitation, two peaks occurred in the wavelength range 540~560nm and 700~710nm. The decrease between the peaks that might have been caused by the absorption of hemoglobin, and the spectral curve of fluorescence excitation approximates Gaussian distribution and parabolic[43]. We compensated the decrease and fitted the spectral curve as parabolic by spline. Spline used the slope of the left and right borders to fit a

Figure 3-9: Spectrum compensated by Spline. Solid line is fluorescent

spectrum with 330-385nm excitation, dotted line is the spectrum curve compensated by Spline.

3.5.6 Full Width at Half Maximum (FWHM) of spectral curve

Full width at half maximum (FWHM) refers to the distance between half width of a peak, figure 3-10. After the spectral curve in fluorescence 330~385nm excitation was compensated by spline, we calculated the FWHM of the curve.

Figure 3-10: Full width at half maximum.

3.6 Sensitivity and specificity

Sensitivity means the correct rate of determining the normal cells as normal cells, and specificity means the correct rate of determining the cancer cells as By the Gaussian distribution, we determined the critical value of the method, for example, by using the intersection of two curves. We defined the right region as normal tissue and the left region as cancerous tissue. Then we could determine the sensitivity and specificity: sensitivity was the ratio of the area under the solid line and on the left of critical point to the area under the solid line, as shown in Figure 4-9; specificity was the ratio of the area under the dotted line and on the right of critical point to the area under the dotted line, as

Figure 3-11: The Gaussian distribution of the analysis. Red line is normal

cells, and blue line is cancer cells.

Patients with oral cancer Condition

Positive

Condition Negative

Test Outcome

Test Outcome Positive

True Positive (TP)

False Positive (FP)

Test Outcome Negative

False Negative (FN)

True Negative (TN)

Sensitivity

=TP/(TP+FN)

Specificity

=TN/(FP+TN)

Figure 3-12: Calculation of sensitivity and specificity.

Chapter 4 Results

4.1 Experiments

This is the flow chart of our experiment. First, we used the hyperspectral scanning system to transfer the optical image of the sample into a hyperspectral matrix. The hyperspectral scanning system includes: a microscope, three light sources, a motor, a relay lens, a spectrometer and an EMCCD. We use the microscope to enlarge the image of the sample. There are three light sources:

halogen lamp, fluorescent at 330~385nm excitation and fluorescent at 470~490nm excitation. The motor was used to move the relay lens for scanning the image, and we used the spectrometer to transform the image into the hyperspectral image. Lastly, the EMCCD stored the hyperspectral image in a three-dimensional matrix. The three axes are x, y and λ.

After we transferred the format of hyperspectral data to BIL, we chose the nuclei in the image. These data on nuclei, which were chosen by hand, and used to diagnose cancer cells using the spectral method; they would also serve as the training data in the method of topology. There are two kinds of cancer cell diagnoses: the difference of spectral curves and the change of topology. In the diagnosis by the difference of spectral curves, we have 11 methods to compare the spectrum of normal cells and cancer cells: 4 methods in halogen, 4 in fluorescence 330~385nm excitation and 3 in fluorescence 330~385nm excitation.

In the diagnosis of the change in topology, we present two methods: calculating the fractal dimension of the image and the correct rate of the classification by KNN. Lastly, we can plot the Gaussian distribution curve of the value generated by the methods, and calculate the sensitivity and specificity of this method.

Figure 4-1: The flow chart of experiment.

4.2 Morphological analysis of oral cancer

Before our analysis in topology, we needed to remove the light noise.

Because we only had the spectrum of light source in samples numbered 1 to 12, we only analyzed the sample numbers 1 to 12 in topology.

4.2.1 Calculation of Fractal Dimension after Threshold Method

Figure 4-1 shows the result of the threshold method for the samples numbered 1 to 12. After using the threshold method, the basal-cells in normal tissue and the cancer cells in cancerous tissue were set to 1 or white, and the others to 0 or black. The figures on the left in Figure 4-1 are all normal tissue;

the figures on the right are cancerous tissue.

We can determine that the threshold method images exnibit a large difference between normal cells and cancer cells. The white area in the normal tissue shows a continuous curve, and the white area in the cancerous tissue shows a discontinuous curve and spreads over the whole image. Table 5-1 shows the fractal dimension of the image after using the threshold method.

According to the Gaussian distribution, we can determine 1.75 as the critical value. If the fractal dimension is lower than 1.75, we identify the image as normal tissue; on the other hand, if the fractal dimension is higher than 1.75, we identify the image as cancerous tissue. The sensitivity is 83.44%, and specificity is 91.46%.

Figure 4-2: After threshold method, the basal cells are white, and the others

are black in the images. The number 1 to 12 of figure means the number of the sample. The label (a) means normal tissue, and the label (b) means cancer tissue.

Table 4-1: The fractal dimension of sample numbered 1 to 12 after

threshold method.

Number of patients

1 2 3 4 5 6 7 8 9 10 11 12

(a) Normal cells

1.73 1.73 1.57 1.73 1.24 1.59 1.65 1.66 1.68 1.59 1.66 1.50

(b) Cancer cells

1.90 1.82 1.81 1.67 1.78 1.80 1.85 1.97 1.95 1.89 1.76 1.91

4.2.2 K-nearest Neighbor Classification

According to the training data chosen by hand, we can classify the nuclei as basal-cell and lamina propria. Figure 4-2 shows the result of classification by KNN. A red point means basal-cell and a blue point means lamina propria.

Table 4-2 shows the correct rate of classification in lamina propria by KNN.

After the calculation of Gaussian distribution, we determined the critical value as 0.76. This means if the correct rate is lower than 0.76, then we indentify the sample as cancerous; conversely, if the correct rate is higher than 0.76, we indentify the sample as normal tissue. The sensitivity is 81.36%, and the specificity is 55.34%.

Figure 4-3: The result of classification by KNN in the sample numbered 1

to 12. The number 1 to 12 of figure means the number of the sample. The label (a) means normal tissue, and the label (b) means cancer tissue.

Table 4-2: The correct rate of classification by KNN in the sample

4.3 Spectral analysis of oral cancer

We chose nuclei in the image as the data for diagnosis of cancer cells by hand, and the number of the data is shown in Table 4-3. Except for sample number 5, we have at least 100 sets of data in one sample. Because we have three light sources, we have to diagnose cancer cells in three kinds of spectrum.

We have 11 methods to diagnose cancer cells in spectrum: 4 in halogen, 4 in fluorescence 330~385nm excitation and 3 in fluorescence 470~490nm excitation.

Table 4-3: The number of spectral data.

Number

10 241 263 27 425 449

11 334 278 28 399 316

12 242 413 29 311 330

13 215 120 30 315 387

14 393 759 31 229 632

15 313 342 32 507 296

16 432 351 33 359 693

17 623 565

4.3.1 Transmitting spectral mode

Figure 5-3 shows the penetration in halogen for the samples numbered 1 to 12. The penetration means we removed the light noise and black noise from the spectral data, and we only have the data of light source in the samples numbered 1 to 12. The maximum of penetration is 1, and the minimum is 0. We have four analytical methods in the spectrum of halogen. Method 1-1 is calculating the penetration in the wavelength range 460~480nm. Method 1-2 is calculating the ratio of the penetration in the range 460~480nm to the penetration in the range 700~710nm. Method 1-3 is calculating the intensity of the peak at 700~710nm.

Table 5-2 shows the specificity of each method in halogen. Except for sample number 2, we can see that methods 1-1 and 1-2 have the highest specificity, the means of which are 87.15 and 86.27, respectively, so we combined methods 1-1 and 1-2 to analyze the spectrum in halogen. If we ignore sample number 2, the mean of specificity is 98.45.

Figure 4-4: The spectral curve of the sample numbered 1 to 12 in halogen.

The red line means normal cells, and blue line means cancer cells. The number means the number of patients. X axis is wavelength (nm), and Y axis is halogen transmittance.

Table 4-4: The correct rate of classification by KNN in the sample number

1 to 12.

Number of patient

Specificity of method 1-1 (%)

Specificity of method 1-2 (%)

Specificity of method 1-3 (%)

Specificity of method 1-1+1-2 (%)

1 92.5 90.4 43.4 97.4

2 0 0 42.0 0.0

3 99.6 98.9 77.1 100.0

4 98.0 90.4 65.8 100.0

5 95.1 99.7 71.1 100.0

6 92.3 88.4 74.4 92.2

7 83.6 86.6 76.1 97.3

8 96.6 86.4 35.0 100.0

9 99.8 93.7 64.2 100.0

10 99.9 65.2 37.7 100.0

11 94.7 69.9 71.3 96.1

12 96.6 79.4 66.7 100

4.3.2 Fluorescence with 330~385nm excitation

330~385nm excitation, and Figure 4-6 shows the spectral curves that have been compensated by the spline. We have four methods to compare the normal cells and cancer cells. Method 2-1 compares the ratio of the intensity in the range 540~570nm to the intensity in the range 680~710nm. Method 2-2 calculates the area under the spectral curve that has been normalized by the intensity at 540~570nm. Method 2-3 compares the maximum of the spectral curve that has been compensated by spline. Method 2-4 compares the full width at half maximum (FWHM) of the spectral curve compensated by the spline.

Table 4-5 shows the specificity of the method analyzing the spectrum in the fluorescence 330~385nm excitation. We can see a large difference between the specificity of each sample; it signifies that these analyzing methods are not strong enough.

Figure 4-5: The spectral curves of each sample in fluorescence 330~385nm

excitation. The number of figure means the number of the sample. The red line means normal cells, and blue line means cancer cells. The number means the number of patients. X axis is wavelength (nm), and Y axis is fluorescence intensity (μ w).

Figure 4-6: The spectral curves of each sample which have been

compensated by spline in fluorescence 330~385nm excitation. The number of figure means the number of the sample. The red line means normal cells, and

blue line means cancer cells. The number means the number of patients. X axis is wavelength (nm), and Y axis is fluorescence intensity (μ w).

Table 4-5: The specificity of analysis in fluorescence 330~385nm

24 87.6 80.3 79.4 89.0

25 71.0 69.5 84.2 50.3

26 61.8 73.5 32.4 48.2

27 93.1 89.3 85.8 92.0

28 78.6 70.5 69.9 67.1

29 64.6 78.7 82.5 65.8

30 89.1 85.0 83.2 92.3

31 83.4 72.4 59.4 78.9

32 97.8 92.8 90.7 98.2

33 86.9 82.4 77.9 88.4

4.3.3 Fluorescence with 470~490nm excitation

Figure 4-7 shows the spectrum in fluorescence 470~490nm excitation. We have 3 methods to analyze the spectrum to distinguish normal cells and cancer cells. Method 3-1 compares the intensity of the peak in the wavelength range 540~570nm. Method 3-2 calculates the area under the spectral curve that we normalize. Method 3-3 calculates the full width at half maximum (FWHM) of the spectral curve.

Table 4-6 shows the result of the analysis as well as the specificity of each method in each sample. We can see that the variation of the specificity between different samples is large, and the mean of specificity in one method is too low (about 50%).

Figure 4-7: The spectral curve of each sample in fluorescence 470~490nm

excitation. The number of figure means the number of the sample. The red line

means normal cells, and blue line means cancer cells. The number means the

Chapter 5 Discussions, Conclusions and Future Works

5.1 Discussions

In calculating the fractal dimension of the image, the critical value of the threshold method has an enormous impact on the value of the fractal dimension.

Determining the value of the critical value is an important issue. Because the area of which the value is 1 in the threshold method and the coordinates of the training data are both the nuclei in the basal-cell layer, we determined the critical value in accordance to the value of the training data chosen by hand.

The fractal dimension method shows a high sensitivity and specificity, except in sample number 3. The fractal dimension of number 3 in normal tissue is 1.73 and 1.67 in cancer cells. The goal of the threshold method is setting the value of the nuclei in the basal-cell layer as 1 and the other as 0; the image 55.34%. The distribution of nuclei in the lamina propria decides the correct rate.

If the distribution shows an intensive type, it is difficult to use the KNN to distinguish the nuclei between the basal-cell layer and the lamina propria, and it causes a low correct rate. The morphology of the lamina propria shows low correlation with the erosion of cancer cells, Table 4-2. The topology of the lamina propria may change because of different composition or location.

In order to establish data for spectral diagnosis, we can mark the points on the nuclei, cytoplasm or intercellular bridge. Therefore, we have to verify the best target on the image for diagnosis. In the beginning, we mark the nuclei, cytoplasm and intercellular bridge on the image of the sample and diagnose oral cancer using the spectrum. After comparison, nuclei are chosen as the targets because of their highest sensitivity and specificity.

In the halogen spectrum, there is a large difference between normal cells and cancer cells. Except for sample number 2, methods 1 and 2 of diagnosis have high sensitivity and specificity. When we observe the normal cells in Figure 3-7(2), we see that the transmittance abnormally decreases in the wavelength range 300~700nm and 750~1100nm. In fluorescence, there is no

obvious decrease in the spectrum of normal cells. Therefore, the cause of the decrease in Figure 3-7(2) is the thickness of the sample because the thickness has an effect on transmittance, but no effect on reflectance.

We do not remove the light noise in the intensity of fluorescence because it is difficult to save light noise in fluorescence. In order to save light noise, the light source irradiates the black area of the biopsy, but there is no autofluorescence with the black area. Autofluorescence is emitted by the excited cells.

The sensitivity and specificity are higher in the diagnosis of the halogen spectrum than in the fluorescence spectrum. The average specificity of diagnosis in halogen is 79.545%, in fluorescence 330~385nm excitation is 68.48% and in fluorescence 470~490nm excitation is 70.12%. The reason for this is that we do not remove the light noise in fluorescence. Removing the light noise can remove the effect of the light source on different locations of the sample and normalize the intensity; however, the light noise cannot be recorded in fluorescence.

In defining the method of calculating the fractal dimension as in method 4-1 and the correct rate of classification by KNN as in method 4-2, according to the value of specificity, the order of methods is as follows: 1-1, 4-1, 1-2, 1-3, 3-3, 2-4, 3-1, 2-3, 2-2, 3-2, 2-1, 1-4 and 4-2. The specificity of the fractal dimension method is 91.46%, the method of calculating the halogen penetration in the wavelength range of 460~480nm (method 1-1) is 95.34% if sample number 2 is ignored, and the method of calculating the ratio of the halogen penetration in the range of 460~480nm to the halogen penetration in the range 700~710nm (method 1-2) is 86.45% if sample number 2 is ignored. Methods 1-1, 1-2 and 4-1 have the highest specificity, and the specificity is higher than the others by 13% at least. Finally, the specificity is 98.45% in combination with methods 1-1, 1-2 and 4-1.

Table 5-1 shows the comparison with other articles. Our light sources include halogen and fluorescence with 330-385nm and 470-490nm excitation, and the method includes analysis using morphology and spectrum. Compared to the research presented in other articles, our research presents more information (more light sources, larger spectral range), is more convincing (based on 33 patients and using more than 100 sets of data in one sample), more novel (analysis combining morphology and spectrum) and shows high specificity (98.45%).

Table 5-1: Comparison of other articles.

400-1100 33 Fractal dimension, classification by

350-700 unknown Combination of fluorescence,

Halogen 1000-2500 10 Standard deviation, support vector

300-700 96 Combination of PLS andlysis and logistic

450-750 13 500-549/657-700nm ratio

Sens. 97%, spec. 95%

5.1.1 Cocktail method in accordance to the sample data

The cocktail method combines the methods with fluorescence excitation , and the methods show higher specificity and better correlation with sample data.

The sample data includes the age of patient, the differentiated level, the location of biopsy, the stage of oral cancer, T, N, and M. T is the diagnosis of tumor, N is the diagnosis of regional lymph nodes, and M is the diagnosis of distant metastasis. For finding the correlation between specificity of the methods and

在文檔中利用嵌入式繼光鏡顯微超頻譜影像系統進行口腔癌檢測 (頁 30-0)