Experimental results and discussions - THE MULTI-LAYER SEGMENTATION METHOD FOR COMPLEX DOCUMENT

4. THE MULTI-LAYER SEGMENTATION METHOD FOR COMPLEX DOCUMENT

4.4 Experimental results and discussions

This study illustrates 24-bit true color or 8-bit monochromatic document images at 300dpi, full page. The proposed method for automatic text segmentation has been tested on numerous magazine images, cover images and advertisement images.

Figures 27(a)~32(a) display parts of the test images. The background images in Figs.

27(a) to 32(a) include the following features. 1) Monochromatic background with/without text; 2) slowly varying background with/without text; 3) highly varying background with/without texts, and 4) complex varying background with/without text of various colors.

Figures 27(b)~32(b) present the text planes in Figs. 27(a)~32(a) after the proposed text segmentation method is implemented. Figures 27(c) to 32(c) show parts of the object layers in Figs. 27(a) to 32(a). The ratio of success of the proposed text segmentation method is,

Ratio of success =

The ratios of success in Figs. 27(b)~32(b) are 100%, 98.5%, 99.2%, 98.7%, 100%, and 97%, respectively. The proposed text segmentation method can be successfully applied to extract texts with different typefaces or sizes, as well as those spread in a compound document image with monochromatic, slowly varying, highly varying and complex varying backgrounds.

The MLSM decomposes the document image into several object layers. All of the texts are spread into different object layers, according to their colors. The text extraction algorithm extracts the text from all of the object layers. Different object layers may contain text-like blocks in a particular position, so the text extracting algorithm could make the wrong decision. Consequently, the text extraction algorithm can be further improved. For instance, although most of the text in Fig.28(a) overlays a complex varying background - a map - all of the text that overlaps the map is segmented into one of the object layers in Fig. 28(c). Although the ratios of success in Figs. 28(b), 29(b), 30(b) and 32(b) are not 100%, the MLSM successfully segments all of the texts.

background images. Consequently, the multi-layer segmentation algorithm constitutes an effective solution for extracting text from various document images.

In block-based clustering algorithm, the parameters THJDF and THσ are the threshold values to decide which cluster is convergence when the conditions, JDF>TH_JDF andσ <TH_σ , are met. The JDF value measures the separability between two adjacent clusters in the block-based clustering algorithm.

The JDF value may lies within the range 0 ≤ JDF ≤ 1. Maximizing the JDF value can be utilized as an objective function, to optimize the segmentation result. Hence, when the JDF approximates 1.0, the two adjacent clusters are ideally and completely separated. When the number of the clusters is more than two, the average JDF is used to measure the separability of the clusters. This study employs TH_JDF=0.9.

The standard deviation,σ , measures the compactness of the pixel values of each clusters. Ideally, the σ approximates zero for a monochromatic object. In a pilot experiment, we analyze the widespread distributions, caused by the scanner or the original document, of the pixel values of monochromatic texts in different document images. The average variation of the monochromatic texts with different size or style is around 0~50. In general, the TH_σ =25 can obtain good preservation of the texts, but it is insufficient for our needs. Therefore, this study employs TH_σ =14 to obtain better outcome, when the texts overlap a background with rapidly varying texture and

similar grayscale. When the TH_σ is below 25, the extracted texts are thinner than original texts and the boundary of the texts are clustered to different object layer, as the Fig.23(j).

Because the value of the TH_σ is set as 14, the standard deviation,σ, of each LSB will less than 14. In other words, if two LSBs belong to the same object layer, the difference of the average values between the two LSBs will be less than 14. The threshold values of the ThLM and ThSI are used to judge whether the two LSBs are belong to the same object layer or not by the difference of the average values.

Therefore, the threshold values of the ThLM and ThSI are set as 14. In the decision procedure for constructing of a new object layer, we use the ThSI to determine which unclassified LSB should be merged with an existing object layer or set up a new object layer. In the pre-match condition of the matching procedure, the ThLM is used to filter out the unreasonable object layers in order to save the computation power.

The segmentation method proposed by the chapter has experimented on a large number of different document images, scanned from book covers, advertisements, brochures, and magazines. We find that the monochromatic objects, text or non-text, can be successfully separated from a document image by the MLSM, nevertheless, a few texts could be failed to extract when the pixel values of the texts are multicolor, gradual change, or too close to the pixel values of the background. A multicolor or

gradual change text brings the text fragmented and distributed to different clusters. A text could be merged with its background when the values of the text are too close to the values of its background. Although, decreasing the parameter TH_σ (below 14) can separate the text and the overlapped background, whose values are too close to the text, to solve the merged problem, it will cause the text fragmented and distributed to different clusters. Therefore, an adaptive threshold TH_σis the future work to solve the merged problem.

4.5 Concluding remarks

This study presents a viable method for extracting texts from a complex compound document image in which texts overlay various background images. The proposed segmentation algorithm uses a multi-layer segmentation method to segment the texts from various compound document images, regardless of whether the texts overlap the background. This method overcomes various issues raised by the complexity of the background images. Experimental results obtained with various document images reveal that the proposed algorithm can successfully segment Chinese and English text strings from various backgrounds, regardless of whether the texts overlap a simple, slowly or rapidly varying background. The method can be used to improve the effectiveness of compression; the technique has many applications,

algorithm can be used in Optical Character Recognition (OCR) to search for characters in complex documents strong text/background overlap.