OCR today - The History of OCR 8 - Optical Character Recognition OCR

2 The History of OCR 8

2.6 OCR today

Although, OCR machines became commercially available already in the 1950’s, only a few thousand systems had been sold world wide up to 1986. The main reason for this was the cost of the systems. However, as hardware was getting cheaper, and OCR systems started to become available as software packages, the sale increased considerably. Today a few thousand is the number of systems sold every week, and the cost of an omnifont OCR has dropped with a factor of ten every other year for the last 6 years.

Table 1 : A short OCR chronology.

Chapter 3 Methods of OCR

The main principle in automatic recognition of patterns, is first to teach the machine which classes of patterns that may occur and what they look like. In OCR the patterns are letters, numbers and some special symbols like commas, question marks etc., while the different classes correspond to the different characters. The teaching of the machine is performed by showing the machine examples of characters of all the different classes. Based on these examples the machine builds a prototype or a description of each class of characters.

Then, during recognition, the unknown characters are compared to the previously ob-tained descriptions, and assigned the class that gives the best match.

In most commercial systems for character recognition, the training process has been per-formed in advance. Some systems do however, include facilities for training in the case of inclusion of new classes of characters.

3.1 Components of an OCR system

A typical OCR system consists of several components. In figure 3 a common setup is il-lustrated. The first step in the process is to digitize the analog document using an optical scanner. When the regions containing text are located, each symbol is extracted through a segmentation process. The extracted symbols may then be preprocessed, eliminating noise, to facilitate the extraction of features in the next step.

Figure 3 : Components of an OCR-system Optical

scanning Location

Segmentation

Preprocessing

Feature extraction Recognition

Post-processing

The identity of each symbol is found by comparing the extracted features with descrip-tions of the symbol classes obtained through a previous learning phase. Finally contextual information is used to reconstruct the words and numbers of the original text. In the next sections these steps and some of the methods involved are described in more detail.

3.1.1 Optical scanning.

Through the scanning process a digital image of the original document is captured. In OCR optical scanners are used, which generally consist of a transport mechanism plus a sensing device that converts light intensity into gray-levels. Printed documents usually consist of black print on a white background. Hence, when performing OCR, it is common practice to convert the multilevel image into a bilevel image of black and white. Often this process, known as thresholding, is performed on the scanner to save memory space and computational effort.

The thresholding process is important as the results of the following recognition is totally dependent of the quality of the bilevel image. Still, the thresholding performed on the scanner is usually very simple. A fixed threshold is used, where gray-levels below this threshold is said to be black and levels above are said to be white. For a high-contrast doc-ument with uniform background, a prechosen fixed threshold can be sufficient. However, a lot of documents encountered in practice have a rather large range in contrast. In these cases more sophisticated methods for thresholding are required to obtain a good result.

Figure 4 : Problems in thresholding: Top: Original greylevel image, Middle: Image thresholded with global method, Bottom: Image thresholded with an adaptive method.

The best methods for thresholding are usually those which are able to vary the threshold

such methods usually depend upon a multilevel scanning of the document which requires more memory and computational capacity. Therefore such techniques are seldom used in connection with OCR systems, although they result in better images.

3.1.2 Location and segmentation.

Segmentation is a process that determines the constituents of an image. It is necessary to locate the regions of the document where data have been printed and distinguish them from figures and graphics. For instance, when performing automatic mail-sorting, the ad-dress must be located and separated from other print on the envelope like stamps and com-pany logos, prior to recognition.

Applied to text, segmentation is the isolation of characters or words. The majority of op-tical character recognition algorithms segment the words into isolated characters which are recognized individually. Usually this segmentation is performed by isolating each connected component, that is each connected black area. This technique is easy to imple-ment, but problems occur if characters touch or if characters are fragmented and consist of several parts. The main problems in segmentation may be divided into four groups:

• Extraction of touching and fragmented characters.

Such distortions may lead to several joint characters being interpreted as one single character, or that a piece of a character is believed to be an entire symbol. Joints will occur if the document is a dark photocopy or if it is scanned at a low threshold. Also joints are common if the fonts are serifed. The characters may be split if the document stems from a light photocopy or is scanned at a high threshold.

• Distinguishing noise from text.

Dots and accents may be mistaken for noise, and vice versa.

• Mistaking graphics or geometry for text.

This leads to nontext being sent to recognition.

• Mistaking text for graphics or geometry.

In this case the text will not be passed to the recognition stage. This often happens if characters are connected to graphics.

Figure 5 : Degraded symbols.

3.1.3 Preprocessing

The image resulting from the scanning process may contain a certain amount of noise. De-pending on the resolution on the scanner and the success of the applied technique for thresholding, the characters may be smeared or broken. Some of these defects, which may later cause poor recognition rates, can be eliminated by using a preprocessor to smooth the digitized characters.

The smoothing implies both filling and thinning. Filling eliminates small breaks, gaps and holes in the digitized characters, while thinning reduces the width of the line. The most common techniques for smoothing, moves a window across the binary image of the char-acter, applying certain rules to the contents of the window.

In addition to smoothing, preprocessing usually includes normalization. The normaliza-tion is applied to obtain characters of uniform size, slant and rotanormaliza-tion. To be able to correct for rotation, the angle of rotation must be found. For rotated pages and lines of text, vari-ants of Hough transform are commonly used for detecting skew. However, to find the ro-tation angle of a single symbol is not possible until after the symbol has been recognized.

Figure 6 : Normalization and smoothing of a symbol.

3.1.4 Feature extraction

The objective of feature extraction is to capture the essential characteristics of the sym-bols, and it is generally accepted that this is one of the most difficult problems of pattern recognition. The most straight forward way of describing a character is by the actual raster image. Another approach is to extract certain features that still characterize the symbols, but leaves out the unimportant attributes. The techniques for extraction of such features are often divided into three main groups, where the features are found from:

• The distribution of points.

• Transformations and series expansions.

• Structural analysis.

The different groups of features may be evaluated according to their sensitivity to noise and deformation and the ease of implementation and use. The results of such a comparison are shown in table 1. The criteria used in this evaluation are the following:

• Robustness.

1) Noise.

Sensitivity to disconnected line segments, bumps, gaps, filled loops etc.

2) Distortions.

Sensitivity to local variations like rounded corners, improper protrusions, dilations and shrinkage.

3) Style variation.

Sensitivity to variation in style like the use of different shapes to represent the same character or the use of serifs, slants etc.

4) Translation.

Sensitivity to movement of the whole character or its components.

5) Rotation.

Sensitivity to change in orientation of the characters.

• Practical use.

1) Speed of recognition.

2) Complexity of implementation.

3) Independence.

The need of supplementary techniques.

Each of the techniques evaluated in table2 are described in the next sections.

Table 2 : Evaluation of feature extraction techniques.

Template matching

High or easy Medium Low or difficult

3.1.4.1 Template-matching and correlation techniques.

These techniques are different from the others in that no features are actually extracted.

Instead the matrix containing the image of the input character is directly matched with a set of prototype characters representing each possible class. The distance between the pat-tern and each prototype is computed, and the class of the prototype giving the best match is assigned to the pattern.

The technique is simple and easy to implement in hardware and has been used in many commercial OCR machines. However, this technique is sensitive to noise and style vari-ations and has no way of handling rotated characters.

3.1.4.2 Feature based techniques.

In these methods, significant measurements are calculated and extracted from a character and compared to descriptions of the character classes obtained during a training phase.

The description that matches most closely provides recognition. The features are given as numbers in a feature vector, and this feature vector is used to represent the symbol.

Distribution of points.

This category covers techniques that extracts features based on the statistical distribution of points. These features are usually tolerant to distortions and style variations. Some of the typical techniques within this area are listed below.

Zoning.

The rectangle circumscribing the character is divided into several overlapping, or non-overlapping, regions and the densities of black points within these regions are computed and used as features.

Moments.

The moments of black points about a chosen centre, for example the centre of gravity, or a chosen coordinate system, are used as features.

Crossings and distances.

In the crossing technique features are found from the number of times the character shape is crossed by vectors along certain directions. This technique is often used by commercial systems because it can be performed at high speed and requires low complexity.

When using the distance technique certain lengths along the vectors crossing the character shape are measured. For instance the length of the vectors within the boundary of the char-acter.

n-tuples.

The relative joint occurrence of black and white points (foreground and background) in certain specified orderings, are used as features.

Characteristic loci.

For each point in the background of the character, vertical and horizontal vectors are gen-erated. The number of times the line segments describing the character are intersected by these vectors are used as features.

Figure 7 : Zoning

Transformations and series expansions.

These techniques help to reduce the dimensionality of the feature vector and the extracted features can be made invariant to global deformations like translation and rotation. The transformations used may be Fourier, Walsh, Haar, Hadamard, Karhunen-Loeve, Hough, principal axis transform etc.

Figure 8 : Elliptical Fourier descriptors

Many of these transformations are based on the curve describing the contour of the char-acters. This means that these features are very sensitive to noise affecting the contour of

the character like unintended gaps in the contour. In table 2 these features are therefore characterized as having a low tolerance to noise. However, they are tolerant to noise af-fecting the inside of the character and to distortions.

Structural analysis.

During structural analysis, features that describe the geometric and topological structures of a symbol are extracted. By these features one attempts to describe the physical make-up of the character, and some of the commonly used features are strokes, bays, end-points, intersections between lines and loops. Compared to other techniques the structural analy-sis gives features with high tolerance to noise and style variations. However, the features are only moderately tolerant to rotation and translation. Unfortunately, the extraction of these features is not trivial, and to some extent still an area of research.

Figure 9 : Strokes extracted from the capital letters F, H and N.

3.1.5 Classification.

The classification is the process of identifying each character and assigning to it the cor-rect character class. In the following sections two different approaches for classification in character recognition are discussed. First decision-theoretic recognition is treated.

These methods are used when the description of the character can be numerically repre-sented in a feature vector.

We may also have pattern characteristics derived from the physical structure of the acter which are not as easily quantified. In these cases the relationship between the char-acteristics may be of importance when deciding on class membership. For instance, if we know that a character consists of one vertical and one horizontal stroke, it may be either an “L” or a “T”, and the relationship between the two strokes is needed to distinguish the characters. A structural approach is then needed.

3.1.5.1 Decision-theoretic methods.

The principal approaches to decision-theoretic recognition are minimum distance classi-fiers, statistical classifiers and neural networks. Each of these classification techniques are

Matching.

Matching covers the groups of techniques based on similarity measures where the dis-tance between the feature vector, describing the extracted character and the description of each class is calculated. Different measures may be used, but the common is the Euclidean distance. This minimum distance classifier works well when the classes are well separat-ed, that is when the distance between the means is large compared to the spread of each class.

When the entire character is used as input to the classification, and no features are extract-ed (template-matching), a correlation approach is usextract-ed. Here the distance between the character image and prototype images representing each character class is computed.

Optimum statistical classifiers.

In statistical classification a probabilistic approach to recognition is applied. The idea is to use a classification scheme that is optimal in the sense that, on average, its use gives the lowest probability of making classification errors.

A classifier that minimizes the total average loss is called the Bayes’ classifier. Given an unknown symbol described by its feature vector, the probability that the symbol belongs to class c is computed for all classes c=1...N. The symbol is then assigned the class which gives the maximum probability.

For this scheme to be optimal, the probability density functions of the symbols of each class must be known, along with the probability of occurrence of each class. The latter is usually solved by assuming that all classes are equally probable. The density function is usually assumed to be normally distributed, and the closer this assumption is to reality, the closer the Bayes’ classifier comes to optimal behaviour.

The minimum distance classifier described above is specified completely by the mean vector of each class, and the Bayes classifier for Gaussian classes is specified completely by the mean vector and covariance matrix of each class. These parameters specifying the classifiers are obtained through a training process. During this process, training patterns of each class is used to compute these parameters and descriptions of each class are ob-tained.

Neural networks.

Recently, the use of neural networks to recognize characters (and other types of patterns) has resurfaced. Considering a back-propagation network, this network is composed of several layers of interconnected elements. A feature vector enters the network at the input layer. Each element of the layer computes a weighted sum of its input and transforms it into an output by a nonlinear function. During training the weights at each connection are adjusted until a desired output is obtained. A problem of neural networks in OCR may be their limited predictability and generality, while an advantage is their adaptive nature.

3.1.5.2 Structural Methods.

Within the area of structural recognition, syntactic methods are among the most prevalent approaches. Other techniques exist, but they are less general and will not be treated here.

Syntactic methods.

Measures of similarity based on relationships between structural components may be for-mulated by using grammatical concepts. The idea is that each class has its own grammar defining the composition of the character.A grammar may be represented as strings or trees, and the structural components extracted from an unknown character is matched against the grammars of each class. Suppose that we have two different character classes which can be generated by the two grammars G₁ and G₂, respectively. Given an unknown character, we say that it is more similar to the first class if it may be generated by the gram-mar G₁, but not by G₂.

3.1.6 Post processing.

Grouping.

The result of plain symbol recognition on a document, is a set of individual symbols.

However, these symbols in themselves do usually not contain enough information. In-stead we would like to associate the individual symbols that belong to the same string with each other, making up words and numbers. The process of performing this association of symbols into strings, is commonly referred to as grouping. The grouping of the symbols into strings is based on the symbols’ location in the document. Symbols that are found to be sufficiently close are grouped together.

For fonts with fixed pitch the process of grouping is fairly easy as the position of each character is known. For typeset characters the distance between characters are variable.

However, the distance between words are usually significantly larger than the distance be-tween characters, and grouping is therefore still possible. The real problems occur for handwritten characters or when the text is skewed.

Error-detection and correction.

Up until the grouping each character has been treated separately, and the context in which each character appears has usually not been exploited. However, in advanced optical text-recognition problems, a system consisting only of single-character text-recognition will not be sufficient. Even the best recognition systems will not give 100% percent correct identifi-cation of all characters, but some of these errors may be detected or even corrected by the use of context.

There are two main approaches, where the first utilizes the possibility of sequences of

the word, by saying for instance that after a period there should usually be a capital letter.

Also, for different languages the probabilities of two or more characters appearing togeth-er in a sequence can be computed and may be utilized to detect togeth-errors. For instance, in the English language the probability of a “k” appearing after an “h” in a word is zero, and if such a combination is detected an error is assumed.

Another approach is the use of dictionaries, which has proven to be the most efficient method for error detection and correction. Given a word, in which an error may be present, the word is looked up in the dictionary. If the word is not in the dictionary, an error has been detected, and may be corrected by changing the word into the most similar word.

Probabilities obtained from the classification, may help to identify the character which has been erroneously classified. If the word is present in the dictionary, this does unfortunately

在文檔中 Optical Character Recognition OCR (頁 10-0)