Back-Propagation Network - Related Work - 智慧型字卡影像辨識系統設計

Chapter 2 Related Work

2.2 Back-Propagation Network

The back propagation, BP in brief, was proposed in 1986 by Werbos, etc. [], which is based on the gradient steepest descent method to update the weights by minimizing the total square error of the output. The BP algorithm has been widely used in a diversity of applications with supervised learning. To clearly explain the BP algorithm, an example is given in Fig. 2.3, a neural network with I input nodes, J output nodes, and K hidden nodes. Let the inputs and outputs be xi, and yj, where i=1,2,…,I and j=1,2,…,J, respective. For the hidden layer, the k-th hidden node, k=1,2,…,K, receives information from input layer and sends out hk to the output layer.

These three layers are connected by two sets of weights, vik and wkj. The weight vik connects the i-th input node to the k-th hidden node, while the weight wkj connects the k-th hidden node to the j-th output node.

Fig. 2.3 Neural network with one hidden layer.

Based on the neural network in Fig. 2.3, the BP algorithm for supervised learning is generally processed by eight steps as below:

Step 1: Set the maximum tolerable error E_max and then the learning rate  between 0.1 and 1.0 to reduce the computing time or increase the precision.

Step 2: Set the initial weight and bias value of the network randomly.

Step 3: Input the training data, x[x₁ x₂ x_I]^T and the desired output

1 2

[ _J]^T

d  d d d .

Step 4: Calculate each output of the K neurons in hidden layer

where fh() is the activation function, and then each output of the J neurons in output layer

where f_y() is the activation function.

Step 5: Calculate the following error function

where d is the desired output.

Step 6: According to gradient descent method, determine the correction of weights as below:

Step 8: Check the next training data. If it exists, then go to Step 3, otherwise, go to Step 9.

Step 9: Check whether the network converges or not. If EE_max, terminate the training process, otherwise, begin another learning circle by going to Step 1.

The maximum tolerable error E_max are the same as error function. Learning rate  is the parameters can change the speed on correction the weights .BP learning algorithm can be used to model various complicated nonlinear functions. Recently, the BP learning algorithm is successfully applied to many domain applications, such as: pattern recognition, adaptive control, clustering problem, etc. In the thesis, the BP algorithm was used to learn the input-output relationship for clustering problem.

2.3 Foreground Segmentation

Dynamic imaging is often the part of interest in the real-time detection system, a good motion detection system to identify moving objects in the picture can get great help for the next classification, or tracking. So a good dynamic detection method can provide more accurate information for follow-up action. There are three common way:

background subtraction [7], and optical flow [8], frame difference [9].

Background subtraction is the most common method for segmentation of interesting regions in videos. This method has to build the initial background model firstly. The purpose of training background model is to subtract background image from current image for obtaining interesting foreground regions. Background subtraction method can detect the most complete of feature points of interesting foreground regions and real-time implementation.

Optical flow reflects the image changes due to motion during a time interval, and the optical flow field is the velocity field that represents the three-dimensional motion of foreground points across a two-dimensional image. Compared with other two methods, optical flow can be more accurate to detect interesting foreground region.

But optical flow computations are very intensive and difficult to realize in real time.

Frame difference method is to do pixel-based subtraction in successive frames.

Its original reasonable is using consistency continuous image background subtraction, image segmentation algorithms such as

is image subtraction matrix， and are representatives time for RGB color in It and I(t-1). Application to set up image subtraction threshold

(Sub_th) for change detection, when I_sub(x,y) level of abnormal larger than this threshold can be regarded as a dynamic pixel, otherwise identified as the background.

lower computation. This study considers the environment and the processing time and other factors, the use of frame difference as the prospect of capture method.

2.4 Introduction to Morphology Operation

Morphology has two simple functions, dilation and erosion [10]. Dilation is defined as:



^{: ( )}^ˆ ^x



A B x B  A  (2.13)

where A and B are sets in Z. This equation simply means that B is moved over A and the intersection of B reflected and translated with A is found. Usually A will be the signal or image being operated on and B will be the structuring element. Fig. 2.4 Shows how dilation works.

The opposite of dilation is known as erosion. This is defined as:



^{: ( )}x



A B  x B A (2.14)

which simply says erosion of A by B is the set of points x such that B, translated by x, is contained in A. Fig. 2.5 shows how erosion works. This works in exactly the same way as dilation. However equation (2.2) essentially says that for the output to be a one, all of the inputs must be the same as the structuring element. Thus, erosion will remove runs of ones that are shorter than the structuring element. This thesis will applied two kind of this operation to process the image.

Fig. 2.4 Example of dilation.

Fig. 2.5 Example of erosion.

2.5 Color Detection

Color is an important source of information during the human visual perception activities. There are some popular research topic like detecting and tracking human faces and gestures. Different color detection has applied to a variety of tasks, we can chose the color we want and using filter in web image contents, for examples about skin color like detecting and tracking human faces and gestures, and diagnosing disease [11],[12],[13].

As the first task in detection of moving object in special color and character extraction technique in our schemes, color detection can highly reduce the computational cost [14], and then extracts the potential object regions and character.

Furthermore, color image segmentation is computationally fast while being relatively robust to changes in scale, viewpoint, and complex background.

According to the characteristics of module in color space distribution, the color of pixel can be detected quickly by a module’s color model. However the use of different color spaces for different races and different illuminations often results in different detection accuracy [15]. In this thesis, the experimental environment is our laboratory and the lighting condition is fixed.

Usually, the color detection should be considered two aspects: color space selection and how to use the color distribution to establish a good color model.

Nowadays main color spaces include RGB, HSV, HSI, YCbCr, some of their variant, etc, while RGB is the foundational method to represent color.

2.6 Character recognition

In character recognition applications, it can be divided into two categories, Optical Character Recognition (OCR) and On Line Character Recognition (OLCR).

The OLCR uses a handwritten board or digital pen as an input tool to get the characters and then implement the character recognition. Different from the OLCR, The OCR uses a scanner to scan a document and save it as an image file, and then identifies the characters in the image file. This thesis will adopt the OCR for character recognition since the characters to be recognized are attained from a sequence of images.

The basic flow chart of the character recognition is shown in Fig.2.6. In general, the pre-processing of an image contains the object location, size normalization, binarization, angle of rotation, tilt, etc. There are two main parts in the character recognition system, which are feature extraction and feature classification. They are related to the speed and accuracy in text recognition. In these steps, many methods have been proposed and they can be divided into three types: Statistics method, Structure method, Merger Statistics and Structure method.

The main part in statistical method is to measure the composition of some particular physical quantities in the image. It’s usually extracted text features or characteristics, to classify and by matching the pattern in built-in database. Basically,

16 characters. Word is split into several parts and compare with built-in database in order to determine the most similar result. In general, the structure method can tolerate its own variability. But the reaction with interference of noise is unstable. For example:

template matching method.

Merger Statistics and Structure method combined the advantages in two ways.

This thesis used merger statistics and structure method and mainly refers to this literature on the feature extraction, Torres-Mendez, L.A. “Translation, Rotation, and Scale-Invariant Object Recognition” [16].The paper presents a method for object recognition that achieves excellent invariance under translation, rotation, and scaling.

In the feature extraction, it takes into account the invariant properties of the normalized moment of inertia [17] and a novel coding that extracts topological object characteristics [18].

The feature extraction is based on a set of concentric circles which are naturally and perfectly invariant to rotation (in 2-D). Fig. 2.7(a) shows an example with 8 concentric circles. Each circle is cut into some arcs by the character. Heuristically, the number of arcs of the i-th outside the character can be used as the first feature, denoted as M_i. This simple coding scheme extracts the topological characteristics of the object regardless of its position, orientation, and size. However, in some cases, two different objects could have the same or very similar Mi value (for example, letters 2 and 5). For the second feature, we take into account the difference of the two

largest arcs for each circle outside the object and normalize the difference by the circumference, denoted as

2 1

i i

d d

D r

  (2.15)

for the i-th circle. Fig. 2.7(b) shows d31 and d32 of the third circle as an example.

(a) (b) Fig. 2.7 Example of feature extraction.

(a) Example with 8 concentric circles. (b) d31 and d32 of the third circle.

Chapter 3 Intelligent Word Card Image Recognition System

The intelligent recognition system is implemented in several steps as shown in Fig. 3.1, including potential object localization, character extraction and character recognition. Each step adopts some schemes of image processing, such as image subtraction, morphology operation and connected components labeling (CCL) are used in the first step to extract moving objects. Different to the conventional image processing, this thesis will adopt the intelligent neural network on the basis of supervised learning to accomplish part of the schemes, detection of moving object in special color, color extraction and object recognition.

Fig. 3.1 The flow chart of the intelligent system.

3.1 Detection of Moving Object in special color

This is the first part of potential object location. In usual, there are two fundamental steps to detect a moving object of special color, which include moving object detection and color extraction. Both steps are often processed separately, but this thesis presents a scheme based on the artificial neural network to extract the color and detect the moving object simultaneously.

To detect a moving object from a sequence of images, the algorithm is shown as below:

where I^c represent the color in the image and we choose green for example, I_mosc is the result of moving object in special color is shown in Fig. 3.2(d),

(a) (b)

Fig. 3.2 (a) Input image (T=t-1). (b) Input image (T=t).

In supervised learning, the training data are required as shown in Fig.

3.2(d).The RGB information is learned by the neural network structure in Fig. 3.3 based on the back-propagation. After learning, moving object of special color can be distinguished from the background according to the output value of neural network.

Usually, a pixel of moving object of special color has an output value near to 1, while a pixel in the background has an output value near to 0. To efficiently extract the moving object of special color in an image, a threshold value should be carefully selected under the lighting condition of the environment being properly controlled.

Fig. 3.3 MCNN’s structure.

The neural network MCNN extracts a moving object of special color is shown in Fig. 3.3, which is composed of one input layer with 6 neurons, one hidden layer with 7 neurons, and one output layer with 1 neuron. The RGB values are sent into the 6 neurons of the input layer, represented by MC(p), where p=1,2,3 for frame t-1 and p=4,5,6 for frame t. The p-th input neuron is connected to the q-th neuron, q=1,2,…,7, of the hidden layer with weighting WMC1(p,q), which is a weighting array of dimension 6×7. Besides, the q-th neuron of the hidden layer is also with an extra bias bMC1(q). Finally, the q-th neuron of the hidden layer is connected to the output neuron with weighting W_MC2(q), q=1,2,…,7, and a bias b_MC2 is added to the output neuron.

Let the activation function of the hidden layer be the hyperbolic tangent sigmoid transfer function, then the q-th output neuron O_MC1(q) is expressed as:

       

then the single output neuron OMC2 is expressed as:

 

The above operations are shown in Fig. 3.4.

Fig. 3.4.MCNN

3.2 Morphology operation

Three parts are using in this section, erosion; dilation and holes filling.

3.2.1 Erosion and Dilation

After applying color extraction, color regions are extracted from the original image, but some noise still exists therein. One of the conventional ways to eliminate noise regions is using the morphology operations. In the thesis, the noises are eliminated by the morphology erosion operation (2.14) expressed as



^{: ( )}x



A B  x B A (3.7)

where B is a disk-shaped structuring element with radius 4 as shown in Fig. 3.5(a) and the noises in image A with region smaller than B are erased after operation. However, some gaps may be also generated in isolated regions after erosion. In order to repair these gaps, further employ the morphology dilation operation (2.13) expressed as



^{: ( )}^ˆ ^x



A C x C  A  (3.8)

where C is a disk-shaped structuring element with radius 6 as shown in Fig. 3.5(b) and the gaps in image A are repaired after operation. Fig. 3.6 shows an example of erosion and dilation using the structuring elements B and C.

(a) (b)

Fig. 3.3 (a) Structuring element B (b) Structuring element C

(a)

(b)

(c)

Fig. 3.4 Steps for morphology operations (a) Initial image (b) Result of erosion using structuring element B (c) Result of dilation using structuring element C

0 0 1 1 1 1 0 0

3.2.2 Holes filling

In the current application, it is appropriately called conditional dilation or inside connected components. A hole may be defined as a background region surrounded by connected border of foreground pixels. In this section, we develop an algorithm based on set dilation, complementation, and intersection for filling holes in an image. Let A denote a set whose elements are 8-connected boundaries as in Fig. 3.7(a), each boundary enclosing a background region. Given a point in each hole, the objective is to fill all the holes with 1.

We begin by forming an array, X0, of 0s, expect at the locations in X0

corresponding to given point in each hole, which we set to 1. Then, the following procedure fills all holes with 1s:

^X^k ^⁽^X^k^¹^^B⁾^^A^c

1 , 2 , 3 , . . . k

(3.9) where B is the symmetric structuring element in Fig. 3.7(c). The algorithm terminates at iteration step k if Xk=Xk-1.The set then Xk contains all the filled holes. The set union of Xk and A contains all the filled holes and their boundaries. Fig. 3.8(a) shows an example image, and Fig. 3.8(b) shows the result of the hole filling.

(a) (b) (c) Fig. 3.7(a) A (b) A^c (c) B

(a) (b) Fig.3.8

Fig.3.8 (a) Original frame. (b) Result of holes filling.

3.3 Connected Components Labeling

After morphology operation different components are identified by using Connected Components Labeling (CCL), which is often used in computer vision to detect connected regions containing 4 or 8 pixels in the binary digital image. In this thesis, the 4-pixel connected component will be used to label potential face regions.

Fig. 3.9 Scanning the image.

r t p

The 4-pixel connected CCL algorithm can be partitioned into two processes, labeling and componentizing. During the labeling, the image is scanned pixel by pixel, from left to right and top to bottom as shown in Fig. 3.9, where p is the pixel being processed, and r and t are respectively the upper and left pixels to p. Defined v() and l() as the binary value and the label of a pixel. If v(p)=0, then move on to next pixel, otherwise, i.e., v(p)=1, the label l(p) is determined by following rules:

R1. For v(r)=0 and v(t)=0, assign a new label to l(p).

R2. For v(r)=1 and v(t)=0, assign l(r) to l(p), i.e., l(p)=l(r).

R3. For v(r)=0 and v(t)=1, assign l(t) to l(p), i.e., l(p)=l(t).

R4. For v(r)=1, v(t)=1 and l(t)=l(r), then assign l(r) to l(p), i.e., l(p)=l(r).

R5. For v(r)=1, v(t)=1 and l(t)≠l(r), then assign l(r) to both l(p) and l(t), i.e., l(p)=l(r) and l(t)= l(r).

For example, after the labeling process, Fig. 3.10(a) is changed into Fig. 3.10(b). It is clear that some connected components contain pixels with different labels. Hence, it is required to further execute the process of componentizing, which sorts all the pixels connected in one component and assign them by the same label, the smallest number among the labels in that component. Fig. 3.10(c) is the result of Fig. 3.10(b) after componentizing.

(a)

(b)

(c)

Fig. 3.10 Example of 4-pixel connected CCL.

(a) Digital image. (b) Labeling. (c) Componentizing.

0 0 0 0 0 0 0 0 0 0

3.4 Character extraction from Potential Region

Two main parts in this section, color extraction and character extraction.

3.4.1 Color extraction

This system presents the color extraction based on artificial neural network. In supervised learning, the training data of the colors are required and obtained from images composed of object and background. Examples of training data are shown in Fig. 3.11, where each original image is separated into object region and background.

The RGB information is learned by the neural network structure in Fig. 3.12 based on the back-propagation. After learning, the pixels of green color can be distinguished from the background according to the output value of neural network. Usually, a pixel of color has an output value near to 1, while a pixel in the background has an output value near to 0. To efficiently extract the color in an image, a threshold value should be carefully selected under the lighting condition.

(a) (b) (c)

Fig. 3.11 Examples of training data for skin color extraction.

(a) Original image (b) Skin color region (c) Background

Fig. 3.12 Neural network structure for color extraction.

For green color extraction neural network (GNN) based on the structure in Fig.

3.12, there include one input layer with 3 neurons, one hidden layer with 5 neurons, and one output layer with 1 neuron. The RGB values are sent into the 3 neurons of the input layer, represented by G(p), p=1,2,3, correspondingly. The p-th input neuron is connected to the q-th neuron, q=1,2,…,5, of the hidden layer with weighting WG1(p,q).

Hence, there exists a weighting array WG1(p,q) of dimension 3x5. Besides, the q-th neuron of the hidden layer is also with an extra bias bG1(q). Finally, the q-th neuron of the hidden layer is connected to the output neuron with weighting WG2(q), q=1,2,…,5, and a bias bG2 is added to the output neuron.

Let the activation function of the hidden layer be the hyperbolic tangent sigmoid transfer function and the q-th output neuron OG1(q) is expressed as:

       

Let the activation function of the output layer be the log-sigmoid transfer function and the single output neuron OG2 is expressed as:

 

The above operations are shown in Fig. 3.13.

3.4.2 Character extraction

After detect the green color of the word card, the word card will be abstract.

Here are three steps in this chapter to extract character from the word card. First, some noise still exists therein. Using erosion (3.7) structuring element with radius 2 and dilation (3.8) structuring element with radius 3 will reduce noises and get the frame as shown in Fig. 3.14(a).

The word card is thus achieved as a binary image with the shape of the character on it as shown in Fig. 3.14(a).The character will be extracted by using holes filling

and image subtraction.

In the first, generate a plain binary card with holes filling, the holes of the word card will be filled as shown in Fig. 3.14(b). Then, by generating a plain binary card, the character on the word card can be extracted by subtracting the plain binary card.

The result is shown in Fig. 3.14(c).

(a) (b) (c) Fig. 3.14 (a) Result after reduce the noise.

(b)Image with hole filled. (c) Image subtraction.

3.5 Character recognition

Two parts are using in this section, Feature extraction and classification.

3.5.1 Feature extraction

After object extraction, the character recognition is executed as the following step, whose performance will directly affect the overall accuracy rate. There are two main parts of the character recognition, one is feature extraction and the other is

在文檔中智慧型字卡影像辨識系統設計 (頁 19-0)