5 Status of OCR 2
5.4 OCR performance evaluation
No standardized test sets exist for character recognition, and as the performance of an OCR system is highly dependent on the quality of the input, this makes it difficult to eval-uate and compare different systems. Still, recognition rates are often given, and usually presented as the percentage of characters correctly classified. However, this does not say anything about the errors committed. Therefore in evaluation of OCR system, three dif-ferent performance rates should be investigated:
• Recognition rate.
The proportion of correctly classified characters.
• Rejection rate.
The proportion of characters which the system were unable to recognize. Rejected characters can be flagged by the OCR-system, and are therefore easily retraceable for manual correction.
• Error rate.
The proportion of characters erroneously classified. Misclassified characters go by undetected by the system, and manual inspection of the recognized text is necessary to detect and correct these errors.
There is usually a tradeoff between the different recognition rates. A low error rate may lead to a higher rejection rate and a lower recognition rate. Because of the time required to detect and correct OCR errors, the error rate is the most important when evaluating whether an OCR system is cost-effective or not. The rejection rate is less critical. An ex-ample from barcode reading may illustrate this. Here a rejection while reading a barcoded price tag will only lead to rescanning of the code or manual entry, while a misdecoded pri-cetag might result in the customer being charged for the wrong amount. In the barcode industry the error rates are therefore as low as one in a million labels, while a rejection rate of one in a hundred is acceptable.
In view of this, it is apparent that it is not sufficient to look solely on the recognition rates of a system. A correct recognition rate of 99%, might imply an error rate of 1%. In the case of text recognition on a printed page, which on average contains about 2000 charac-ters, an error rate of 1% means 20 undetected errors per page. In postal applications for mail sorting, where an address contains about 50 characters, an error rate of 1% implies an error on every other piece of mail.
Chapter 6
The Future of OCR
Through the years, the methods of character recognition has improved from quite primi-tive schemes, suitable only for reading stylized printed numerals, to more complex and sophisticated techniques for the recognition of a great variety of typeset fonts and also handprinted characters. Below the future of OCR when it comes to both research and ar-eas of applications, is briefly discussed.
6.1 Future improvements
New methods for character recognition are still expected to appear, as the computer tech-nology develops and decreasing computational restrictions open up for new approaches.
There might for instance be a potential in performing character recognition directly on grey level images. However, the greatest potential seems to lie within the exploitation of existing methods, by mixing methodologies and making more use of context.
Integration of segmentation and contextual analysis can improve recognition of joined and split characters. Also, higher level contextual analysis which look at the semantics of entire sentences may be useful. Generally there is a potential in using context to a larger extent than what is done today. In addition, combinations of multiple independent feature sets and classifiers, where the weakness of one method is compensated by the strength of another, may improve the recognition of individual characters.
The frontiers of research within character recognition have now moved towards the rec-ognition of cursive script, that is handwritten connected or calligraphic characters. Prom-ising techniques within this area, deal with the recognition of entire words instead of in-dividual characters.
6.2 Future needs
Today optical character recognition is most successful for constrained material, that is documents produced under some control. However, in the future it seems that the need for constrained OCR will be decreasing. The reason for this is that control of the production process usually means that the document is produced from material already stored on a computer. Hence, if a computer readable version is already available, this means that data
may be exchanged electronically or printed in a more computer readable form, for in-stance barcodes.
The applications for future OCR-systems lie in the recognition of documents where con-trol over the production process is impossible. This may be material where the recipient is cut off from an electronic version and has no control of the production process or older material which at production time could not be generated electronically. This means that future OCR-systems intended for reading printed text must be omnifont.
Another important area for OCR is the recognition of manually produced documents.
Within postal applications for instance, OCR must focus on reading of addresses on mail produced by people without access to computer technology. Already, it is not unusual for companies etc., with access to computer technology to mark mail with barcodes. The rel-ative importance of handwritten text recognition is therefore expected to increase.
Chapter 7 Summary
Character recognition techniques associate a symbolic identity with the image of charac-ter. Character recognition is commonly referred to as optical character recognition (OCR), as it deals with the recognition of optically processed characters. The modern version of OCR appeared in the middle of the 1940’s with the development of the digital computers.
OCR machines have been commercially available since the middle of the 1950’s. Today OCR-systems are available both as hardware devices and software packages, and a few thousand systems are sold every week.
In a typical OCR systems input characters are digitized by an optical scanner. Each char-acter is then located and segmented, and the resulting charchar-acter image is fed intoa preproc-essor for noise reduction and normalization. Certain characteristics are the extracted from the character for classification. The feature extraction is critical and many different tech-niques exist, each having its strengths and weaknesses. After classification the identified characters are grouped to reconstruct the original symbol strings, and context may then be applied to detect and correct errors.
Optical character recognition has many different practical applications. The main areas where OCR has been of importance, are text entry (office automation), data entry (bank-ing environment) and process automation (mail sort(bank-ing).
The present state of the art in OCR has moved from primitive schemes for limited char-acter sets, to the application of more sophisticated techniques for omnifont and handprint recognition. The main problems in OCR usually lie in the segmentation of degraded sym-bols which are joined or fragmented. Generally, the accuracy of an OCR system is directly dependent upon the quality of the input document. Three figures are used in ratings of OCR systems; correct classification rate, rejection rate and error rate. The performance should be rated from the systems error rate, as these errors go by undetected by the system and must be manually located for correction.
In spite of the great number of algorithms that have been developed for character recog-nition, the problem is not yet solved satisfactory, especially not in the cases when there are no strict limitations on the handwriting or quality of print. Up to now, no recognition algorithm may compete with man in quality. However, as the OCR machine is able to read much faster, it is still attractive.
In the future the area of recognition of constrained print is expected to decrease. Emphasis will then be on the recognition of unconstrained writing, like omnifont and handwriting.
This is a challenge which requires improved recognition techniques. The potential for OCR algorithms seems to lie in the combination of different methods and the use of tech-niques that are able to utilize context to a much larger extent than current methodologies.
Bibliography
• H.S. Baird & R. Fossey.
A 100-Font Classifier.
Proceedings ICDAR-91, Vol. 1, p. 332-340, 1991.
• M. Bokser.
Omnidocument Technologies.
IEEE Proceedings, special issue on OCR, p. 1066-1078, July 1992.
• R. Bradford & T. Nartker.
Error Correlation in Contemporary OCR Systems.
Proceedings ICDAR-91, Vol. 2, p. 516-524, 1991.
• J-P. Caillot.
Review of OCR Techniques.
NR-note, BILD/08/087.
• R. G. Casey & K. Y. Wong.
Document-Analysis Systems and Techniques.
Image Analysisi Applications, eds: R. Kasturi & M. Tivedi, p. 1-36.
New York: Marcel Dekker, 1990.
• R. H. Davis & J. Lyall.
Recognition of Handwritten Characters - a Review.
Image and Vision Computing, Vol. 4, No. 4, p. 208-218, nov. 1986.
• S. Diehl & H. Eglowstein.
Tame the Paper Tiger.
Byte, p. 220-238, April 1991.
• G. Dimauro, S. Impedovo & G. Pirlo.
From Character to Cursive Script Recognition: Future Trends in Scientific Research.
Proceedinngs, IAPR’92, The Hague, Vol. 2, p. 516-519, 1992.
• R. C. Gonzalez & R. E. Woods.
Digital Image Processing.
Addison-Wesley, 1992.
• V. K. Govindan & A.P. Shivaprasad.
Character Recognition - a Review.
Pattern Recognition, Vol. 23, No &, P. 671-683, 1990.
• L. Haaland.
Automatisk identifikasjon - den glemte muligheten.
Teknisk Ukeblad, nr 39, 1992.
• S. Impedovo & L. Ottaviano & S. Occhinegro.
Optical Character Recognition - A survey.
Int. Journal of PRAI, Vol. 5, No 1& 2, p. 1-24, 1991.
• S. Kahan, T. Pavlidis & H. S. Baird.
On the Recognition of Printed Characters of Any Font and Size.
IEEE T-PAMI, Vol. 9, No.2, p. 274-288, March 1987.
• J. Mantas.
An Overview of Character Recognition Methodologies.
Pattern Recognition, Vol. 19, No 6, p. 425-430, 1986.
• S. Mori C.Y. Suen & K. Yamamoto.
Historical Review of OCR research and Development.
IEEE Proceedings, special issue on OCR, p. 1029-1057, July 1992.
• G. Nagy.
At the Frontiers of OCR.
IEEE Proceedings, special issue on OCR, p.1093-1100, July 1992.
• T. Pavlidis.
Recognition of printed text under realistic conditions.
Pattern Recognition Letters 14, p. 317-326, 1993.
• T. Pavlidis, J. Swartz & Y. P. Wang.
Fundamentals of Bar Code Information Theory.
IEEE Computer.
• R. Plamondon & G. Lorette.
Automatic Signature Verification and Writer Identification - The State of the Art.
Pattern Recognition, Vol. 22,No 2, p. 107-131, 1989.
• H. F. Schantz.
The History of OCR.
Recognition Technology Users Association, VT, 1982.
• J. Scurmann.
Reading Machines.
Proceedings IJCPR, Munich, p. 1031-1044, 1982.
• C.Y. Suen, M. Berthod & S. Mori.
Automatic Recognition of Handprinted Characters - The State of the Art.
IEEE Proceedings, Vol. 68, No. 4, p.469-487, April 1980.
• A.A. Verikas, M.I. Bachauskene, S.J. Vilunas & D.R. Skaigiris.
Adaptive Character Recognition System.
Pattern Recognition Letters 13, p. 207-212, 1992.
• T. Y. Young & K-S Fu.
Handbook of Pattern Recognition and Image Processing.
Academic Press, 1986.