Text extraction algorithm - THE MULTI-LAYER SEGMENTATION METHOD FOR COMPLEX DOCUMENT

4. THE MULTI-LAYER SEGMENTATION METHOD FOR COMPLEX DOCUMENT

4.3 Text extraction algorithm

After MLSM is performed, the whole document image is decomposed into various object layers. Each object layer may include significant information about characters, foreground objects, background textures or some other objects. Then, each object layer will be binarized by setting the valid pixels of the object layer to “1” and setting the invalid pixels to “0”. Note that in this work only text lines are interested.

The bounding boxes of all connected-components are extracted by the connected-component extraction step. The blocks that contain characters must be identified and they must be organized to form text lines or text regions. The connected-component-based projection profile method is applied to separate all bounding boxes into different “general text lines”, GTLs. Each GTL contains a group of bounding boxes.

The following notation is defined:

(a). CCi is the i-th connected-component of the binarized object layer.

(b). CG is a group of connected-components, CG = {CCi, i=0,1,2,…,p }

(c).The connected-component CCi has the top, left, bottom and right coordinates denoted by t(CCi), l(CCi), b(CCi) and r(CCi), respectively, where t(CCi) < b(CCi) and l(CCi) < r(CCi).

(d). The width and height of CCi are denoted as W(CCi) and H(CCi), respectively.

(e). The horizontal and vertical distances between two bounding boxes are defined

If the two bounding boxes are overlapping in the horizontal or vertical direction, the value of the D_h(CC_i,CC_j )or D_v(CC_i,CC_j) will be a negative one.

(f). The horizontal and vertical projection overlap measures of the two bounding boxes are defined as

Using the functions and notations defined above, the details of the text extraction method are introduced as follows. The method includes two procedures:

The horizontal segmentation procedure H-seg(CGin) (The subscript “in” indicates

“the input CG”) is performed as follows:

(1). Project all the bounding boxes of the CCs in the CGin horizontally onto the vertical y-axis.

(2). Sort all the CCs in the CGin according to their corresponding t(CCi), where all the CC_i∈CG_in. Then scan the horizontal projections of these CCs on the y-axis

and determine the “shadow segments” of these CCs on the y-axis. The CCs that are said to be sharing the same shadow segment must have their horizontal projections of bounding boxes overlap on the y-axis, and can be detected whenP_v(CC_i,CC_j)>0.

(3). For each shadow segment, group the CCs which are covered by the same shadow segment into an individual CG.

(4). After the above steps are performed, there are many CGs produced, CGK, where K = 0, 1, 2,…,k-1. For each CGK, perform the vertical segmentation procedure V-seg(CGK).

The vertical segmentation procedure V-seg(CGK ) is performed as follows:

(1). Project all the bounding boxes of CCs of the CGK vertically onto the x-axis.

(2). Sort all the CCs in the CGK according to their corresponding l(CCi), where all the CC_i∈CG_K. Then scan the vertical projections of these CCs on the x-axis and determine their shadow segments. The CCs sharing the same shadow segment of their vertical projections on the x-axis can be detected whenP_h(CC_i,CC_j)>0.

(3). For each shadow segment, group the CCs which are covered by the same shadow segment into individual CGs.

(4). Determine the two merge conditions of the adjacent CGs, CGK1 and CGK2, which are: i) whether the horizontal space between the two adjacent CGs is sufficiently small, that is,

width of all CCs belonging to this CG; ii) whether the average heights of the CCs belonging to the two CGs are similar, that is, the ratio of the two average heights should be within a reasonable range

5 If the above two conditions are satisfied, then merge the two adjacent CGs.

(5). After the above steps are performed, there are many CGs of CCs produced, CGL, where L = 0, 1, 2,…,l-1. If only one resultant CG0 is obtained, then terminate the segmentation procedure; otherwise, for each CGL, perform H-seg(CGL).

As defined above, the text extraction procedure is performed on all CCs in processing a certain object image by recursive segmentation, involving H-seg and V-seg procedures. The sets of all the CCs extracted from the processed object image

are defined as the CGs, the processes of the text extraction algorithm and the results

are illustrated in Fig.26.

(a). The result of the H-seg procedure on the CCs of the Fig.25(d).

(b). The result of the V-seg procedure of the first CG of the Fig.26(a).

(c). The text plane after the text extraction algorithm

(d). The binary image of the text plane

Fig.26 The example of the text extraction algorithm of the Fig.25(d)

We denote the resultant CGs as the “general text lines”, GTLs. Figure 26(a) shows the result of the H-seg procedure on the CCs of the Fig.25(d), and five CGs are obtained. Then the five CGs are performed the V-seg procedure in turn. We take the first CG for example and the result is shown in Fig.26(b). Then, the CGs obtained from the V-seg procedure is divided into two CGs according the condition 4-23. The two CGs obtained from the V-seg procedure will be performed the H-seg procedure in turn and both cannot be divided into more CGs. Hence the two CGs are the resultant GTLs and then are checked by the text-line decision rules.

We state a set of knowledge-based decision rules to determine whether each GTL is a text line or a non-text region. If one GTL meets the requested rules of a text-line, it will be identified as a text-line. The shape and contents of the bounding box are determined from the GTL, and the features of the text line, such as the transition pixel ratio, the foreground pixel density ratio, and the block size, are considered. A “1”

represents a valid pixel and a “0” represents an invalid pixel. The transition pixel is at the boundary of the foreground pixels.

The horizontal transition pixel ratio of the GTL block is defined as

h Col

T =Totalnumber of the transition pixelsof theGTL, (4-24)

where the ColN is the number of the pixel columns in which the valid pixels are present.

The valid pixel density of the GTL is defined as, where the A is the area of the bounding box of the GTL.

The width and height of the bounding box of the GLT, and the number of the CCs belonging to the GTL are W, H and Nc, respectively.

Using these features defined above, a GTL block is identified as a text-line block if all of the following decision rules are met.

(i). 1.1 < Th < 4.0 (4-26) The ratio of the number of the transition pixels used in the condition (i) is to evaluate the complexity of the area of the GTL. The valid pixel density utilized in the condition (ii) measures the density of the valid pixels. The conditions (iii)-(v) determine whether the CCs in the GTL are well aligned, that is, if a series of the CCs is a text line, they should be well aligned. The above decision conditions are determined from analyzing many experimental results of processing document images which are with text strings with various types, lengths and sizes. The constant

values utilized in the above decision conditions are determined experimentally and achieve good performance in most general cases.

After all object layers have been processed to extract all text-lines from them, all text lines are extracted and collected as the final segmentation result, as depicted in Fig. 26(c). Figure 26(d) is the binarized text image of Fig. 26(c).

在文檔中複雜型複合式文件影像壓縮方法之研究 (頁 110-117)