Field data extraction for form document processing using a gravitation-based algorithm

(1)

* Corresponding author.: Tel.: 3-5731802; fax: #886-3-5721500.

E-mail addresses: [email protected] (J.-L. Chen), [email protected] (H.-J. Lee).

Field data extraction for form document processing

using a gravitation-based algorithm

Jiun-Lin Chen, Hsi-Jian Lee*

Department of Computer Science and Information Engineering, National Chiao Tung University, Hsinchu, Taiwan 30050, ROC Received 6 June 1999; received in revised form 24 April 2000; accepted 20 June 2000

Abstract

This paper presents a novel approach to grouping Chinese handwritten "eld data "lled in form documents using a gravitation-based algorithm. An algorithm is developed to extract handwritten "eld data which may be written out of form "elds. First, form lines are extracted and removed from input form images. Connected-components are then detected from remaining data, and the gravitation for each connected-component is computed by using the black pixel counts as their mass. Next, we move connected-components according to their gravitation. As generally known, "lled-in data have the locality property, i.e., data of the same "eld are normally written in a local area consecutively. Therefore, the relationship of these components can be determined by this property. Repeatedly moving these connected-components according to their neighbor connected-components allows us to determine which connected-connected-components should be extracted for a particular "eld. Experimental results demonstrate the e!ectiveness of the proposed method in grouping "eld data. 2001 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved.

Keywords: Form document processing; Field-data grouping; Gravitation-based algorithm; Connected-component; Locality property

1. Introduction

The main purpose of a form document processing system is to collect information "lled in by individuals for further applications. Retrieval and recognition of "lled-in data are the essential functions of a form document processing system. However, automatically extracting "lled-in data for each "eld is di$cult since handwritten data may be written out of their "elds, as shown in Fig. 1. According to the o$ce and medical form documents collected in our experiments, this is the situation 17.63 and 30.04% of data "elds and "lled data "elds, respec-tively. To overcome this problem, we present a novel approach for extracting handwritten "lled-in data of form documents in this paper.

Several form document processing systems have been developed in recent years [1}7]. Casey et al. [1] developed an intelligent form-processing system with "l-led-in data acquisition from a form image. Taylor et al. [2] proposed a system for extracting data from the prep-rinted forms. They located pixels within the interior of the "eld and searched slightly above and below the "eld for characters which extend outside the "eld. Yu et al. [3] proposed a system for form dropout. Although their system can separate input characters and form frames, they did not extract any information between input char-acters and "elds. Thus, their method cannot determine the "elds that characters are "lled into. Fan et al. [4,5] presented a clustering-based method for extracting char-acters from form documents. In their method, feature points were clustered to distinguish characters from form documents. Fletcher et al. [7] presented a connected-component based method for separating text strings from mixed text/graphics images. To extract text strings, their algorithm groups strings into words and phrases logi-cally, groups collinear components together and then performs area/ratio "ltering.

(2)

Fig. 1. Three di!erent cases of data written out of a "eld. (a) A stroke of a character extends outside the "eld. (b) An entire character is written outside the "eld. (c) Several characters are written outside the "eld.

Extracting "lled-in "eld data is an important aspect of a complete form document processing system. Several techniques have been proposed including connected-component analysis [7], searching outside the "eld [2] and feature point clustering [4,5]. Connected-component analysis can be used to distinguish characters from mixed text/graphics images; however, it cannot identify the "eld that the characters belong to. Searching outside the "eld [2], although a preferred means of handling characters that extend out of the "eld, has two problems: (1) determining the extent to search is di$cult and (2) determining which "eld should be extended is di$cult as well. Feature point clustering, although e$ciently distin-guish characters from form structure, cannot determine which "eld characters belong to, either. The "eld data shown in Fig. 1(a) can be extracted correctly by using connected-component analysis, but not feature point clustering. However, neither "eld data in Fig. 1(b) nor in Fig. 1(c) can be extracted successfully using these techniques.

In this paper, we present a novel gravitation-based algorithm for grouping and extracting "lled-in "eld data of form documents. According to the form documents collected herein, we found that "lled-in data have the locality property, meaning that "lled-in data of the same "eld are usually written in a local area consecutively. By considering each connected component of the "lled-in data as a planet in the universe, the gravitation among them causes the consecutive data to gravitate towards each another. Thus, such a gravitation model can be applied to group "eld data since data of the same "eld are usually closer than those of di!erent "elds. Fig. 2 presents

a form document with "lled-in data, the telephone num-ber, written out of "elds.

The proposed method contains three major parts: preprocessing, gravitation processing and postprocess-ing. In the following sections, we discuss the details of the proposed method.

2. Preprocessing

In our proposed method, an input form document is "rst scanned with 300 dpi (dots per inch) as a gray-scale image. Next, noise removal and binarization are per-formed by using the method of Niblack [8] for input form images. Then, the method of Chen and Lee [9] is used to extract form "elds and remove all extracted form lines from the form image. Finally, connected com-ponents detection is performed to locate the "lled-in data.

2.1. Field extraction and form line removing

To extract form lines and "elds, this work adopts the strip projection method [9], which is e$cient in terms of extracting the form structure. Owing to that meaningful "lled-in data that usually appeared around a form, we add four additional virtual "elds (Fig. 3) into the extrac-ted "elds list. After locating the form lines with their two end-points, all form lines can be removed to focus on the form data.

According to the line extraction method employed here-in, we obtain two end-points for each extracted line on the

(3)

Fig. 2. A form document with "lled-in data, the phone number, written outside the "eld.

Fig. 3. Four additional virtual "elds, labeled as gray regions, for a form document.

original form image. Thus, a 3;3 window shown in Fig. 4 can be used to trace through a given form line to remove it. The steps to trace a horizontal form line are as follows:

Step 1. Set P as the left end-point.

Step 2. If P is a black pixel, move P to P. Repeat step 2.

Step 3. If P is a black pixel, move P to P, and go to step 2.

Step 4. If P is a black pixel, move P to P, and go to step 2.

Step 5. If there is a black pixel PG on the right side of P and also in the same row as P, and Distance (PG, P)(

(4)

Fig. 5. Result of line removing and connected-component detection. Fig. 4. The 3;3 windows used in line removal.

20 pixels. Then, move P to PG and go to step 2._{Step 6. Stop.}_{+We meet the right end of this line.,} Removing a horizontal form line, we go through the line twice. The "rst pass is used to estimate the line width. At each point, we trace up and down to locate the top and bottom end points by a procedure similar to line tracing. The distance between the top and bottom end points is recorded as the line width at that point. After running through the whole line, the width of this line is de"ned as the medium value of all width values recorded so far. The estimated line width is used as the line width threshold in the next pass.

Next, the horizontal line is removed in the second pass. If the width at a point P is larger than the width threshold, the vertical line segment at P is preserved since it is assumed herein to be on a long vertical line or on

a character. Otherwise, the vertical line segment at P is removed. Preserving the vertical line segments of these points that have larger widths than the line width thre-shold allows us to make other data remain complete if they touch with this horizontal line which is being re-moved.

The procedure to remove vertical form lines resembles the above procedure.

2.2. Connected-component detection

After removing all form lines, connected-component detection is performed. The bounding box, black pixels and the number of black pixels of each component are recorded. Overlapped connected components are merged together, and components that are too small are elimi-nated.

In the gravitation-based algorithm proposed herein, we operate on connected-components rather than on charac-ters or black pixels. Fig. 5 presents the result of form lines removal and connected-component detection from Fig. 2. In the next section, we explain the method of grouping connected-components of the same "eld together.

3. Gravitation-based algorithm

Having represented form data as connected-compo-nents, the connected-components can then be moved to their corresponding "eld with a gravitation-based

(5)

Fig. 7. Flow chart of gravitation-based algorithm.

Fig. 8. (a) The gravitation of a connected-component C. (b) The gravitating result of C.

Fig. 6. Partial center area of a "eld.

algorithm. According to physics theory, the gravitation between objects depends on the masses of objects and their distance between each other. Herein, the black pixel count of each connected component is used as its mass. Calculation is also performed of the gravitation for all connected-components as they impact one another ac-cording to their masses and the distances among them. To reduce the processing time, only those components whose center of mass are not located inside the center area of a "eld are moved. The gray area shown in Fig. 6 denotes a partial center area of a "eld, and the compo-nents &1' and &)' located in it. The rest of the compocompo-nents are moved according to their gravitation. Once a connec-ted-component is moved into the center area of a form "eld, we can determine this component belongs to that "eld. Repeating the above steps, we can move those

connected-components which are near "eld boundaries into the "elds they belong to. When a stop criterion is satis"ed, the iteration of the gravitation process is stop-ped. Fig. 7 shows the #ow chart of the gravitation-based algorithm, and the details are as follows.

First, a boundary checking process is applied for each connected-component. If the center of mass of a compon-ent is in the ccompon-enter area of a "eld, this componcompon-ent is assigned to that "eld and denoted as _`ASSIGNED,a indicating that processing this component is unnecessary. The height and width of the center area of a "eld are determined as follows.

B"min( ,eld height, ,eld width)0.4, height",eld height!2B,

width",eld width!2B.

After performing the "eld boundary checking on each connected-component, the gravitation of the compo-nents that are not set as_{`ASSIGNEDa is computed by}

(6)

Fig. 9. (a) The gravitating progress of connected-components in sequence of (a)}(d). (a) The right bottom area of Fig. 2.

applying the gravitation function below:

Gf (C?, C@)"_{Distance(=?, =@)}

-

u?@M?M@ ,

where =G is the centre of mass of CG, u?@

-

is the unit vector of (=?!=@&&&&-), and MG is the mass of CG.

G(CV)"k GGf (CV, CG)_MV , _{for all CGOCV.}

The constant k in the above expression is used to translate the gravitation into moving distance in pixels. Experimental results indicate that setting k"5 can yield a better result.

After computing the gravitation for all connected-com-ponents that are not set as_`ASSIGNED,a connected-components can be moved according to their gravitation. Fig. 8 illustrates the movement of a component.

Repeating these steps allows us to assign most data components to the "eld they should be. When any of the following two conditions are satis"ed, the iteration is stopped:

(1) The G(CV) of each connected-components is less than three pixels.

(2) The iteration is repeated for n times.

Although criteria 1 can be satis"ed after a certain number of iteration n, a larger n increases the computa-tional time. Experimental results indicate that setting n as 10 can yield an excellent performance.

Details of the gravitation-based algorithm are as fol-lows:

Repeat+

for a component CG in all connected compo-nents

if CG is inside the center area of a "eld, then set CG as `ASSIGNED.a

for CG which is not set as `ASSIGNED,a compute the gravitation of CG.

move all connected component according to their graviation.

,until (a stop criterion is met.)

Fig. 9 shows the gravitating progress of connected-components at the right bottom area of Fig. 2. We can see the digit &5' gravitating towards the "eld it should be and all components are gravitating together.

4. Postprocessing

After the gravitation process, some connected-compo-nents not set as_{`ASSIGNEDa will remain because their}

(7)

Fig. 10. (a) Moving status of a "eld data extracted from Fig. 2. (b) Extracted results of (a).

gravitation is not large enough to gravitate into the center area of any "eld. However, these components may have already gravitated back into the "elds they belong to. For these connected-components, two postprocessing operations are performed in this study to determine the "elds they belong to.

Density merging: Remaining connected-components are merged with components already located according to the black pixel density of merged results. For the connected-component CG that is not set as `AS-SIGNED_{a, if the maximum density of the additional} area introduced by adding CG into a given "eld is larger than 60%, we add CG into this "eld and set CG as `ASSIGNEDa.

This operation is repeated until no more connected-components can be merged.

Direct assignment: Directly assign the remaining con-nected-components to the "eld that they are located in.

5. Experimental results

In this section, some experimental results are presented to demonstrate the validity of our proposed method. The proposed system was implemented in C language on a Pentium-300 personal computer running Linux with 96 MB RAM. The input forms used here were typical o$ce forms and the scanning resolution was 300 dpi. Twelve form documents are tested in this experiment. Filled-in data in 95.42% of "elds were correctly extrac-ted. 72.86% of ambiguous connected-components that were "lled out of "elds were extracted correctly.

To demonstrate the e!ectiveness of our approach, we test the same data set with the algorithm proposed by Taylor et al. [2], since other approaches [4,5,7] do not group data in "elds. With Taylor's method, "lled-in data in 94.86% of "elds were correctly extracted; 41.24% of ambiguous connected-components that were "lled out of "elds were extracted correctly.

Fig. 10 shows a "eld extracted from Fig. 2. According to this "gure, the digit &5' that is written outside its "eld is correctly extracted. Fig. 11 shows another result. The borders shown in Figs. 10 and 11(d) are added to empha-size the extracted "eld data. According to Figs. 11(c) and (d), most "eld data are extracted correctly.

Some "eld data cannot be extracted correctly with the proposed approach. Some errors are introduced by the four surrounding additional "elds as shown in the circled area in Fig. 11(c). We observed that some form docu-ments require "lled-in data in these surrounding areas. Meanwhile, some form documents have printed charac-ters in these areas but no "lled-in data. In addition, the data "lled in these areas are usually written besides the boundary form lines. Thus, determining which "eld the "lled-in data belong to is quite di$cult. The left circled area in Fig. 12(c) shows another kind of error

which is caused by the preprinted characters along the "eld boundaries. Preprinted characters along "eld boundaries can be mis-extracted if there are a lot of "lled data in the "eld beside. The other circled area shown in Fig. 12(c) illustrates that the input data is written out of its original "eld (the "eld on top) and is close to the data in the other "eld. This situation makes it mis-grouped with data of the "eld below.

Since the proposed gravitation-based algorithm works on the connected-components located near "eld bound-aries, the computation time of our method does not increase with the resolution of input images. The execu-tion time used in our test images is shown in Table 1.

6. Conclusion

This paper presents a gravitation-based algorithm for grouping "eld data of form documents based on the locality property of "lled-in data. The "lled-in data of form documents are closely related to their "elds. Under-standing a form document requires exact knowledge of which "eld the handwritten data belong to. By adopting the locality feature of "lled-in data, the proposed method can e!ectively extract "lled-in data with their "elds.

(8)

Fig. 11. (a) Original binary form image. (b) Results of line removal and connected-component detection. (c) Data extracted from (a). (d) An extracted "eld which contains data written out of it.

In addition, the proposed method is integrated into our form document processing system. Our upcom-ing work will add a blank form drop-out procedure to our system to obtain better results of

"lled-in data extraction. In the future, we will also attempt to optimize the source codes of this method to further enhance its performance. Character segment-ation and recognition we have developed will also be

(9)

Fig. 12. (a) Original binary form image. (b) Results of line removal and connected-component detection. (c) Extracted "eld data.

added to the system based on the results of this work. Furthermore, a comprehensive form document process-ing system will be constructed on the basis of this approach.

7. Summary

In this paper, we present a gravitation-based algorithm for grouping "eld data of form documents based on the

(10)

Table 1

Execution time for the gravitation-based algorithm

Resolution Time (s)

Fig. 2 2088;1356 pixels 0.16

Fig. 11 1965;2976 pixels 1.64

Fig. 12 1000;1404 pixels 0.99

locality property of "lled-in data. The "lled-in data of form documents are closely related to thier "elds. Under-standing a form document requires exact knowledge of which "eld the handwritten data belong to. By adopting the locality feature of "lled-in data, the proposed method can extract "lled-in data with their "elds correctly.

To utilize the locality property, we extract and remove form lines and apply connecte-component detection to locate form data. We compute the gravitation for each connected-component by using the black pixel counts as thier mass. Repeatedly moving these data by their gravi-tation according to their neighbor components, we can determine which connected-components should be ex-tracted for a particular "eld.

Experimental results show that the proposed method can group data which are "lled out of "elds. Filled-in data in 95.42% of "elds were correctly extracted. 72.86%

of ambiguous connected-components that were "lled out of "elds were extracted correctly.

References

[1] R.G. Casey, D.R. Ferguson, K. Mohiuddin, E. Walach, Intelligent forms processing system, Machine Vision Appl. 5 (1992) 143}155.

[2] S.L. Taylor, R. Fritzson, J.A. Pastor, Extraction of data from preprinted forms, Mach. Vision Appl. 5 (1992) 211}222. [3] B. Yu, A.K. Jain, A generic system for form dropout, IEEE

Trans. Pattern Anal. Mach. Intell. 18 (11) (1996) 1127}1134. [4] K.C. Fan, J.M. Lu, L.S. Wang, H.Y. Liao, Extraction of characters from form documents by feature point clustering, Pattern Recognition Lett. 16 (1995) 963}970.

[5] L.H. Chen, J.Y. Wang, H.Y. Liao, K.C. Fan, A robust algorithm for separation of Chinese characters from line drawings, Image Vision Comput. 14 (1996) 753}761. [6] Y.Y. Tang, S.W. Lee, C.Y. Suen, Automatic document

process-ing: a survey, Pattern Recognition 29 (12) (1996) 1931}1952. [7] L.A. Fletcher, R. Kasturi, A robust algorithm for text string

separation from mixed Ttxt/graphics images, IEEE Trans. Pattern Anal. Mach. Intell. 10 (6) (1988) 910}918. [8] W. Niblack, An Introduction to Digital Image Processing,

Prentice-Hall, Englewood Cli!s, NJ, 1986, pp. 115}116. [9] J.L. Chen, H.J. Lee, An e$cient algorithm for form

struc-ture extraction using strip projection, Pattern Recognition 31 (9) (1998) 1353}1368.

About the Author*JIUN-LIN CHEN received his B.S. degree from National Chiao Tung University in Computer Science and Information Engineering in 1992. He is currently a Ph.D. student in Computer Science and Information Engineering at National Chiao Tung University. His research interest is document processing and image processing.

About the Author*HSI-JIAN LEE received the B.S., M.S. and Ph.D. degrees in Computer Engineering from National Chiao Tung University, Hsinchu, Taiwan, in 1976, 1980, and 1984, respectively. From 1981 to 1984, he was a lecturer in the Department of Computer Engineering, National Chiao Tung University, and from 1984 to 1989 an associate professor in the same department. Since August 1989, he has been with National Chiao Tung University as a professor. He was the chairman of the department of Computer Science and Information Engineering from August 1991 to July 1997. From January 1997 to July 1998, he was a deputy director of Microelectronic and Information Research Center (MIRC). Since August 1998, he has been the general secrectory to the president of National Chiao Tung University. He was the president of the Oriental Language Computer Society (OLCS), the editor-in-chief of the International Journal of Computer Processing of Oriental Languages (CPOL), and has been an associate editor of the International Journal of Pattern Recognition and Arti"cial Intelligence, and Pattern Analysis and Applications. He has been a member of the executive committee of the Chinese Society on Image Processing and Pattern Recognition. He was responsible for the 1992 ROC Computational Linguistic Workshop and 1993 ROC Conference on Computer Vision, Graphics, and Image Processing. He was the program chair of the 1994 International Computer Symposium and the Fourth International Workshop on Frontiers in Handwriting Recognition (IWFHR). In 1997, he was a winner of the ten outstanding information persons of ROC. In 1992}94, he was a winner of outstanding researchers of the National Science Council, ROC. He was the general Chair of the fourth Asian Conference of Computer Vision (ACCV), January 2000. His current research interests include document analysis, optical character recognition, image processing, pattern recognition, digital library and arti"cial intelligence. He is a member of Phi Tau Phi.