Organization of the dissertation - 複雜型複合式文件影像壓縮方法之研究

1. INTRODUCTION

1.2 Organization of the dissertation

In this dissertation, three segmentation methods for document images proposed to extract texts from compound document images.

In the Chapter 2, a compression method for color document images based on the wavelet transform and fuzzy picture-text segmentation is presented. This approach addresses a fuzzy picture-text segmentation method, which separates pictures and texts by using wavelet coefficients from color document images. Two components, text strings and pictures, are generated and processed by different compression

algorithms.

The fuzzy picture-text segmentation method separates the text and picture from the monochromatic background. However, the rapid development of multimedia technology has led to increasing numbers of real-life documents, including stylistic text strings with decorated objects and colorful, slowly or highly varying background components. These documents overlap the text strings with the background images.

Therefore, the fuzzy picture-text segmentation method cannot effectively segment all important objects. It is insufficient when the background includes sharply varying contours or overlaps with text.

Therefore, Chapter 3 proposes a new segmentation algorithm for separating text from a document image with a complex background. However, the image of text cannot easily be directly separated from the background image because the difference between the gray values is too small. Therefore, two phases are used to accomplish the desired purpose. In the first phase, which involves color transformation and clustering analysis, the monochromatic document image is partitioned into three planes, the dark plane, the medium plane, and the bright plane. The color of the text is almost all the same, so the variance of the text’s grayscale is small. Therefore, all the text can be grouped in the same plane. When the text is black, the text and some of the background with a gray value close to that of the text is put in the dark plane. In contrast, the text is put into another plane if it is not black. Thus, the text and some

noise are coarsely separated from the background. In the second phase, an adaptive threshold is determined to refine the text by adaptive binarization and block extraction.

Then, two compression algorithms that yield a high compression ratio are also proposed.

The segmentation algorithm focuses on processing the images whose texts are overlap to the complex background. The study is powerful in extracting texts from complex backgrounds. However, we can find many advertisements or magazines whose background images contain many different cases including 1) monochromatic background with/without texts, 2) slowly varying background with/without texts, 3) highly varying background with/without texts and 4) complex varying background with/without different color texts. It is hard to extract the texts when all of the cases spread in a compound document image, especially. Furthermore, the color of texts may be more than three. Therefore, the segmentation algorithm in the Chapter 3 may be insufficient to extract the text from document images in all cases. The text segmentation method of those complex images becomes a great challenge and still a novel research field.

To conquer this challenge, we present a text segmentation algorithm for various document images in Chapter 4. The proposed segmentation algorithm incorporating with a new multi-layer segmentation method (MLSM) can separate the text from various compound document images, regardless of whether the text and background

overlap. This method solves various problems associated with the complexity of background images.

The MLSM provides an effective method to extract objects from different complex images. The complex image includes many different objects such as difference color texts, figures, scenes and complex backgrounds. Those objects could be overlapped or non-overlapped by each others. Because those objects have different features, the image can be partitioned into many object-layers by means of the features of objects embedded in it. Then the block-based clustering algorithm can be performed on those layered image sub-blocks and cluster them to form several object layers. Consequently, different text, non-text objects and background components are segmented into separate object layers. The proposed method can separate or objects from 8-bit grayscale or 24-bit true-color images, no matter the objects overlap a simple, slowly or highly varying background. The block-based clustering algorithm decomposes the sub-block image into different layered sub-block images, LSBs, in the order of darkest to lightest corresponding to the original sub-block image. In the jigsaw-puzzle layer construction algorithm, some statistical and spatial features of adjacent LSBs are introduced to assemble all LSBs of the same text paragraph or object.

The different text and non-text objects and background components are clearly

segmented into several independent object layers for further extraction process. When applied to real-life, complex document images, the proposed method can successfully extract text strings with various colors and illuminations from overlaying non-text objects or complex backgrounds, as determined experimentally. Experimental results obtained using different document images scanned from book covers, advertisements, brochures, and magazines reveal that the proposed algorithm can successfully segment Chinese and English text strings from various backgrounds, regardless of whether the texts are over a simple, slowly varying or rapidly varying background texture.

CHAPTER 2 THE FUZZY-BASED TEXT SEGMENTATION METHOD

This chapter presents a compression method for color document images based on the wavelet transform and fuzzy picture-text segmentation. This approach addresses a fuzzy picture-text segmentation method, which separates pictures and texts by using wavelet coefficients from color document images. The number of colors, the ratio of projection variance, and the fractal dimension are utilized to segment the pictures and texts. By using the fuzzy characteristics of these parameters, a fuzzy rule is proposed to achieve the purpose of picture-text image segmentation. Two components, text strings and pictures, are generated and processed by different compression algorithms.

The picture components and the text components are encoded by zerotree wavelet coding and by the modified run-length Huffman coding, respectively. Experimental results have shown that the work has achieved promising performance on high compression ratio for color document images.

2.1 Introduction

Digitized images of printed documents typically consist of a mixture of texts, pictures, and graphics elements, which have to be separated for further processing and efficient representation. Because text captures the most information, how to segment the text from printed document images becomes an important step in document analysis. Accordingly, various techniques have been developed to segment document images. Many approaches devoted to process monochrome document have been proposed in the past years. Wahl et al. [6] designed a prototype system for document analysis and a constrained run length algorithm (CRLA) for block segmentation.

Nagy et al. [7] presented an expert system with two tools: the X-Y tree and formal block-labeling schema, to accomplish document analysis. Fletcher and Kasturi [8]

proposed a robust algorithm, which uses the Hough transform to group connected components into local character strings, to separate text from mixed text/graphics document images. Kamel and Zhao [9] presented two new extraction techniques: a

logical level technique and a mask-based subtraction technique. Tsai [10] proposed an approach to automatic threshold selection using the moment-preserving principle.

Some other systems based on the prior knowledge of some statistical properties of various blocks [11]-[15], or texture analyses [16],[17] have also been successively developed. Those systems all focus on processing monochrome document. In contrast, few approaches have been proposed for dealing with color document. Suen and Wang

[19] presented a text string extraction algorithm, which uses the edge-detection technique and text block identification to extract the text string. Haffner et al. [20]

proposed an image compression technique called “DjVu” that is specially geared toward the compression of document image in color.

In this chapter, we present a compression method for color document images by using a new fuzzy picture-text segmentation algorithm. This proposed segmentation algorithm separates the text from color document images in frequency domain by using the coefficients derived from the discrete wavelet transform. The coefficients are used to separate the text/picture components by using the fuzzy classification method. Then, the coefficients of picture components are encoded by the coding method of zerotree, and Modified Run-Length Huffman Code (MRLHC) encodes the coefficients of text components.

2.2 The characteristics of coefficients in wavelet transform

The basic theory of the wavelet transform is to represent any arbitrary function f as a superposition of wavelets. After the first level of two-dimensional discrete wavelet transform, the image was arranged placing the lowest-frequency band in the left upper corner, the highest-frequency band in the right down corner, and the middle-frequency band in the right upper corner and left down corner. The coefficients in lowest-frequency band have further correlation than the ones in other

bands, therefore, making the image decomposed into the second level of wavelet transform, and then seven frequency bands as Fig.1 are obtained. The coefficients of seven bands obtained from the original image by wavelet transform are treated as the textures of different frequency.

Fig.1 Image after 2-level discrete wavelet transformation.

In Fig.1, the LL2 band has the coefficients of the lowest-frequency. LHi, HLi and HHi (i=1, 2) bands indicate the edge information in the original images. The LL2 band is very similar to the original image but with only the size of 1/16. The cost of calculation can be reduced by using the LL2 band for picture-text segmentation. For picture components, the signals in LH1, HL1 and HH1 bands are not sensitive to human eyes, and they could be directly discarded. However, for text components, those frequency bands include prominent edge information which should be coded to preserve the text information.

After the wavelet transform, the coefficients extracted from text components

LL2 LH2

HL1

HL2 HH2

LH1

HH1

appear more edge information than the coefficients from picture components. That is, the coefficients of text components show higher frequency characteristics than the coefficients of picture components. As such, the characteristics of wavelet transform coefficients are different between text components and picture components. Therefore, the edge feature is a good parameter for segmenting picture-text components.

The number of colors, also a useful feature, can be used for color-document image segmentation. Because the coefficients of LL2 band are very similar to the original image after wavelet transform, the color number can be obtained by counting the color number of UV-plane from the coefficients of LL2 band. However, it is difficult to obtain the number of color from the color document images directly. In this chapter, a new algorithm is proposed to extract the number of colors from the

coefficients of LL2 band.

The fractal dimension indicates the complexity of images. In addition, it is found that the fractal dimension [22] between text components and picture components is quit different. Because the picture components are more complicated than the text components, the picture components have higher fractal dimension than the text components. The fractal dimension in original image and in low-resolution image is similar. Therefore, the coefficients of LL2 band are applied only to obtain the fractal dimension from text components and picture components. In this way, the processing time to compute fractal dimension can be reduced.

2.3 Fuzzy picture-text segmentation algorithm

As mentioned in previous section, the new document image segmentation utilizes the coefficients from wavelet transform to extract the features of picture components and text components which can be further separated by extracted features using a fuzzy algorithm. We use the technique of spreading and region growing to mark all the foreground blocks including picture-images, text-images and other kinds of images on LL2 band, and these blocks are then segmented to text components or picture components (non-text components).

The segmentation method uses color number, the energy of edge projection, and fractal dimension to segment foreground blocks. The reasons why we use those three parameters to perform segmentation are listed as fellows：

(1)Color number: Since picture components are more colorful than text components, color number can distinguish them.

(2)The energy of edge projection: It shows the distribution of edge projection in a block. In general, the variation of edge projection is regular in text components, and irregular in picture components.

(3)Fractal dimension (FD): It shows the complexity of images. In most cases, the pixels in text components distribute more uniformly and the fractal dimension is lower than picture components.

The three parameters, color number, the energy of edge projection and fractal dimension, are used in the same time to reduce misjudgment. For example, if the

reliability of the three parameters are 9/10, 19/20 and 4/5, the decision error of picture-text segmentation would diminish to 1/1000 (1/10×1/20×1/5) when we

consider the three parameters in the same time appropriately. The Fuzzy Rule calculation [23] is very suitable to analyze this kind of variables. Therefore, we propose a fuzzy picture-text segmentation algorithm to separate text components and picture components from color document images.

Fig.2 The flowchart of fuzzy picture-text segmentation algorithm.

W avelet coefficient of

The flowchart of the algorithm is shown in Fig.2. Details of the algorithm are explained in the following subsections.

A. Spreading and region growing for blocks extraction

The coefficients of LL2 band are used to perform block extraction. The proposed block extraction method is to divide the foreground of document images into text components and picture components. Before the process, we need to convert the coefficients of LL2 band into bi-level data, and use the thresholding method to decide the location of foreground and background. However, the pixel numbers of background are more than foreground's. In order to make the boundary of foreground more obvious, the algorithm uses a threshold value from the mean value. The threshold value is (Mean－Variance). We can realize the influence of bias in Fig.3.

Fig.3 The influence of threshold.

Original Gray-Scale Image

Seperate Foreground and Background by Mean

Seperate Foreground and Background by

(Mean-varience)

The pixels of foreground and noise will be all extracted by a thresholding method in the same time. Therefore, we use the Constrained Run Length Algorithm (CRLA) to remove noise pixels. The algorithm was proposed by Wahl et al. [6] to preserve the pixel when it comes from the valid continuous pixels. For example, there is a binary string, 11001000001000011, with a constraint C=4 to the run length of 0s, if the number of consecutive 0s is less than or equal to C, these 0s must be replaced with 1s; otherwise, they are reserved. As a result, the above binary string is converted into the sequence, 11111000001111111. Some noises are eliminated by the method.

The CRLA is performed in horizontal and vertical directions, and the bi-level images, "Mv" and "Mh", are obtained, respectively. Then, we apply the "OR" operator on Mv and Mh pixel by pixel, and get a bi-level spreading image, Mhv, which merges the neighboring pixels in both direction.

Therefore, the methods of thresholding, CRLA and logic operation are called the spreading process. After the spreading process, the bi-level spreading image Mhv is processed by the region growing method to gather the foreground pixels into rectangle blocks. The steps of region growing method are described below:

Step 1. Collect the foreground pixels of image Mhv row by row.

Step 2. Compare the foreground pixels collected from Step 1 with the current blocks. If there exists any overlap between the foreground pixels and

blocks, the foreground pixels and blocks are merged into the same block.

If there is no overlap, make a new block for the foreground pixels.

Step 3. After region growing, every block will be checked. If the block is neither growing bigger nor being a new one, stop the block's growing and regard it as an isolated block.

Step 4. Check if there is any overlap between blocks or not. Merging the overlapping blocks into the same block.

Step 5. If Mhv comes to the last row, then go to Step 6; if not, return to Step 1.

Step 6. Change all existing blocks into isolated blocks. If there is any overlap between blocks, merge the overlapping blocks into the same block.

Step 7. Delete those smaller noise blocks.

Step 8. The end.

After the processes of spreading and region growing, we got all foreground blocks of the image Mhv. An example is shown in Fig.4.

We can calculate the local edge projection variance ratio, the color number, and the fractal dimension from each foreground block.

(a) Original document image.(200 dpi, image size=768×256)

(b) The binary image of sub-band LL2.(image size=384×128)

(image size=384×128)

Fig.4 Example of CRLA and region growing.

Those calculating methods are described as follows.

B. The calculation of local edge projection variance ratio

It is assumed that texts are written in horizontal or vertical direction. When the edge information is projected toward the vertical direction of text strings, the projection histogram would variation regularly. In addition, the projection magnitudes of text components are larger than those of picture components. If the edge

information is not projected on the vertical direction of text strings, the variation of the projection histogram will be irregularly. This property can be used to decide the direction of text strings. The horizontal or vertical edge projection is used to decide the direction of text strings.

Furthermore, since the variation of edge projection is different between text components and picture components, it can be applied to distinguish text components or picture components from foreground blocks. After discrete wavelet transform, the edge projection in high frequency bands (LH, HL, and HH) is more obvious than the one in low frequency band (LL). Therefore, the edge projection is calculated from the binary image which combines the binary images of high frequency bands (LH, HL, and HH) using logical OR operator. Fig.5 shows the vertical and horizontal edge projection of text-image in high frequency band.

Fig.5 The edge projections of text components in high frequency band.

H orizontal P ro jection histogram

V ertical P rojection histo gram

T he binary im age of high-frequency band

The variation of edge projection is regular in text components, and irregular in picture components. The edge projection variance ratio is defined by

Edge projection variance ration (EPVR)= ^×

∑

⁻

in which P(i) is the magnitude of ith projection and Mean is the average of the projection histogram.

The edge projection variance ratio shows the variation in projection histogram.

By considering those two histograms shown in Fig.6, the left one is the projection histogram of text component and the right one is the projection histogram of picture component. We find these two projection histograms share the same EPVR. However, only the left diagram reveals the property of text.

(a)The projection of text component (b)The projection of picture component Fig.6 Two kinds of projection histogram.

So we modify the equation (1-1) below:

Local edge projection variance ratio (LEPVR)= ^×

∑

⁻

, where P(i) is the magnitude of ith projection, Mean is the average of projection, and substitute the global one. Therefore, the LEPVR of text components is larger than that

在文檔中複雜型複合式文件影像壓縮方法之研究 (頁 15-0)