System Overview - 智慧型字卡影像辨識系統設計

Chapter 1 Introduction

1.2 System Overview

For hardware architecture, the system shown in Fig. 1.1 is established by setting two cameras on a horizontal line and their lines of vision are parallel and fixed. In addition, the distance between two cameras is set as constant equal to 10 cm and these two cameras, QuickCam^TM Communicate Deluxe, have specification listed below.

The experimental environment for testing is our laboratory and the deepest depth of the background is 400 cm.

 1.3-megapixel sensor with RightLight™2 Technology

 Built-in microphone with RightSound™ Technology

 Video capture: Up to 1280 x 1024 pixels (HD quality) (HD Video 960 x 720 pixels)

 Frame rate: Up to 30 frames per second

 Still image capture: 5 megapixels (with software enhancement)

 USB 2.0 certified

 Optics: Manual focus

Fig. 1.1 The humanoid vision system.

For software architecture, the image shown in Fig. 1.2 is the flow chart of the proposed system. The system is implemented in three steps, including potential object localization, character extraction and character recognition. The system is designed in three steps, including potential object localization, character extraction and character recognition. In the first step, it is required to detect the moving object, or word card, in special color and then determine the location of the word card in the image. A supervised learning neural network (MCNN) is used to extract the color and detect the moving word card simultaneously. After applying the MCNN, the region of the word card in green color is extracted from a sequence of images; unfortunately, some noise exists therein. Using morphology operation and connected components labeling (CCL), the noise is removed and the region of the word card could be located correctly.

In the second step, use another supervised learning neural network (GNN) to detect the green color of the word card, and then apply the morphology operation to reduce noise. The word card is thus achieved as a binary image with the shape of the

character on it. By generating a plain binary card, the character on the word card can be extracted by subtracting the plain binary card. Besides, the total number of pixels of the character is calculated to determine whether the result is a character or not. In the third step, a scheme based on a set of concentric circles is adopted to extract the character features, and then feed the features into the third supervised learning neural network (CRNN) to recognize which word it is, the designed neural networks CRNN can robustly identify characters in different translation, size, tilt and angle of rotation [3]. The overall system processing time is about 0.15s.

Chapter 2 describes the related works of the system. Chapter 3 describes the intelligent word card image recognition system. Chapter 4 shows the experiment results. Chapter 5 is the conclusions of the thesis and the future works.

Fig. 1.2 The software architecture.

Chapter 2 Related Work

2.1 Introduction to ANNs

The human nervous system consists of a large amount of neurons, including somas, axons, dendrites and synapses. Each neuron is capable of receiving, processing, and passing signals from one to another. Recently, investigators have developed an intelligent algorithm to mimic the characteristics of the human nervous system, called artificial neural networks (ANNs). In the artificial intelligence field, ANNs have been applied successfully to speech recognition, image analysis and adaptive control. This thesis will apply ANNs to the character recognition in an eyeball system through

Fig. 2.1 Basic structure of a neuron.

Fig. 2.1 shows the basic structure of a neuron, whose input-output relationship can be described as represents the output. As for the activation function f(), it can be linear or nonlinear, and its activation level is determined by the sum of w_ix_i and the bias b. Here list three types of the commonest activation functions, called the linear function, log-sigmoid function and tan-sigmoid function, respectively expressed as below:

(1) Linear function

The structure of a multilayer feed-forward network is generally composed of one input layer, one output layer, and some hidden layers. For example, Fig. 2.2 shows a multilayer feed-forward network with one input layer, one output layer, and two hidden layers. Each layer is formed by neurons with basic structure depicted in Fig.

2.1. The input layer receives signals from the outside world, and then responses through the hidden layers to the output layer. Note that in some cases only the input layer and output layer are required, with the hidden layers omitted.

Compared with the network using single hidden layer, the network with multi-hidden layers can solve more complicated problems. However, its related

training process may become more difficult.

Fig. 2.2 Multilayer feed-fo

In addition to the architecture, the method of setting the values of the weights is important for a neural network, which may be trained via supervised learning or unsupervised learning. Training of supervised learning is mapping a given set of inputs to a specified set of target outputs. The weights are then adjusted according to various learning algorithms. As for the unsupervised learning, the neural network is trained to group similar input vectors together without any training data to specify what a typical member of each group looks like or which group each vector belongs to. In this thesis, the neural network learns the behavior by many input-output pairs, hence that is belongs to supervised learning.

Fig. 2.2 A multilayer feed-forward network with two hidden layers.

2.2 Back-Propagation Network

The back propagation, BP in brief, was proposed in 1986 by Werbos, etc. [], which is based on the gradient steepest descent method to update the weights by minimizing the total square error of the output. The BP algorithm has been widely used in a diversity of applications with supervised learning. To clearly explain the BP algorithm, an example is given in Fig. 2.3, a neural network with I input nodes, J output nodes, and K hidden nodes. Let the inputs and outputs be xi, and yj, where i=1,2,…,I and j=1,2,…,J, respective. For the hidden layer, the k-th hidden node, k=1,2,…,K, receives information from input layer and sends out hk to the output layer.

These three layers are connected by two sets of weights, vik and wkj. The weight vik connects the i-th input node to the k-th hidden node, while the weight wkj connects the k-th hidden node to the j-th output node.

Fig. 2.3 Neural network with one hidden layer.

Based on the neural network in Fig. 2.3, the BP algorithm for supervised learning is generally processed by eight steps as below:

Step 1: Set the maximum tolerable error E_max and then the learning rate  between 0.1 and 1.0 to reduce the computing time or increase the precision.

Step 2: Set the initial weight and bias value of the network randomly.

Step 3: Input the training data, x[x₁ x₂ x_I]^T and the desired output

1 2

[ _J]^T

d  d d d .

Step 4: Calculate each output of the K neurons in hidden layer

where fh() is the activation function, and then each output of the J neurons in output layer

where f_y() is the activation function.

Step 5: Calculate the following error function

where d is the desired output.

Step 6: According to gradient descent method, determine the correction of weights as below:

Step 8: Check the next training data. If it exists, then go to Step 3, otherwise, go to Step 9.

Step 9: Check whether the network converges or not. If EE_max, terminate the training process, otherwise, begin another learning circle by going to Step 1.

The maximum tolerable error E_max are the same as error function. Learning rate  is the parameters can change the speed on correction the weights .BP learning algorithm can be used to model various complicated nonlinear functions. Recently, the BP learning algorithm is successfully applied to many domain applications, such as: pattern recognition, adaptive control, clustering problem, etc. In the thesis, the BP algorithm was used to learn the input-output relationship for clustering problem.

2.3 Foreground Segmentation

Dynamic imaging is often the part of interest in the real-time detection system, a good motion detection system to identify moving objects in the picture can get great help for the next classification, or tracking. So a good dynamic detection method can provide more accurate information for follow-up action. There are three common way:

background subtraction [7], and optical flow [8], frame difference [9].

Background subtraction is the most common method for segmentation of interesting regions in videos. This method has to build the initial background model firstly. The purpose of training background model is to subtract background image from current image for obtaining interesting foreground regions. Background subtraction method can detect the most complete of feature points of interesting foreground regions and real-time implementation.

Optical flow reflects the image changes due to motion during a time interval, and the optical flow field is the velocity field that represents the three-dimensional motion of foreground points across a two-dimensional image. Compared with other two methods, optical flow can be more accurate to detect interesting foreground region.

But optical flow computations are very intensive and difficult to realize in real time.

Frame difference method is to do pixel-based subtraction in successive frames.

Its original reasonable is using consistency continuous image background subtraction, image segmentation algorithms such as

is image subtraction matrix， and are representatives time for RGB color in It and I(t-1). Application to set up image subtraction threshold

(Sub_th) for change detection, when I_sub(x,y) level of abnormal larger than this threshold can be regarded as a dynamic pixel, otherwise identified as the background.

lower computation. This study considers the environment and the processing time and other factors, the use of frame difference as the prospect of capture method.

2.4 Introduction to Morphology Operation

Morphology has two simple functions, dilation and erosion [10]. Dilation is defined as:



^{: ( )}^ˆ ^x



A B x B  A  (2.13)

where A and B are sets in Z. This equation simply means that B is moved over A and the intersection of B reflected and translated with A is found. Usually A will be the signal or image being operated on and B will be the structuring element. Fig. 2.4 Shows how dilation works.

The opposite of dilation is known as erosion. This is defined as:



^{: ( )}x



A B  x B A (2.14)

which simply says erosion of A by B is the set of points x such that B, translated by x, is contained in A. Fig. 2.5 shows how erosion works. This works in exactly the same way as dilation. However equation (2.2) essentially says that for the output to be a one, all of the inputs must be the same as the structuring element. Thus, erosion will remove runs of ones that are shorter than the structuring element. This thesis will applied two kind of this operation to process the image.

Fig. 2.4 Example of dilation.

Fig. 2.5 Example of erosion.

2.5 Color Detection

Color is an important source of information during the human visual perception activities. There are some popular research topic like detecting and tracking human faces and gestures. Different color detection has applied to a variety of tasks, we can chose the color we want and using filter in web image contents, for examples about skin color like detecting and tracking human faces and gestures, and diagnosing disease [11],[12],[13].

As the first task in detection of moving object in special color and character extraction technique in our schemes, color detection can highly reduce the computational cost [14], and then extracts the potential object regions and character.

Furthermore, color image segmentation is computationally fast while being relatively robust to changes in scale, viewpoint, and complex background.

According to the characteristics of module in color space distribution, the color of pixel can be detected quickly by a module’s color model. However the use of different color spaces for different races and different illuminations often results in different detection accuracy [15]. In this thesis, the experimental environment is our laboratory and the lighting condition is fixed.

Usually, the color detection should be considered two aspects: color space selection and how to use the color distribution to establish a good color model.

Nowadays main color spaces include RGB, HSV, HSI, YCbCr, some of their variant, etc, while RGB is the foundational method to represent color.

2.6 Character recognition

In character recognition applications, it can be divided into two categories, Optical Character Recognition (OCR) and On Line Character Recognition (OLCR).

The OLCR uses a handwritten board or digital pen as an input tool to get the characters and then implement the character recognition. Different from the OLCR, The OCR uses a scanner to scan a document and save it as an image file, and then identifies the characters in the image file. This thesis will adopt the OCR for character recognition since the characters to be recognized are attained from a sequence of images.

The basic flow chart of the character recognition is shown in Fig.2.6. In general, the pre-processing of an image contains the object location, size normalization, binarization, angle of rotation, tilt, etc. There are two main parts in the character recognition system, which are feature extraction and feature classification. They are related to the speed and accuracy in text recognition. In these steps, many methods have been proposed and they can be divided into three types: Statistics method, Structure method, Merger Statistics and Structure method.

The main part in statistical method is to measure the composition of some particular physical quantities in the image. It’s usually extracted text features or characteristics, to classify and by matching the pattern in built-in database. Basically,

16 characters. Word is split into several parts and compare with built-in database in order to determine the most similar result. In general, the structure method can tolerate its own variability. But the reaction with interference of noise is unstable. For example:

template matching method.

Merger Statistics and Structure method combined the advantages in two ways.

This thesis used merger statistics and structure method and mainly refers to this literature on the feature extraction, Torres-Mendez, L.A. “Translation, Rotation, and Scale-Invariant Object Recognition” [16].The paper presents a method for object recognition that achieves excellent invariance under translation, rotation, and scaling.

In the feature extraction, it takes into account the invariant properties of the normalized moment of inertia [17] and a novel coding that extracts topological object characteristics [18].

The feature extraction is based on a set of concentric circles which are naturally and perfectly invariant to rotation (in 2-D). Fig. 2.7(a) shows an example with 8 concentric circles. Each circle is cut into some arcs by the character. Heuristically, the number of arcs of the i-th outside the character can be used as the first feature, denoted as M_i. This simple coding scheme extracts the topological characteristics of the object regardless of its position, orientation, and size. However, in some cases, two different objects could have the same or very similar Mi value (for example, letters 2 and 5). For the second feature, we take into account the difference of the two

largest arcs for each circle outside the object and normalize the difference by the circumference, denoted as

2 1

i i

d d

D r

  (2.15)

for the i-th circle. Fig. 2.7(b) shows d31 and d32 of the third circle as an example.

(a) (b) Fig. 2.7 Example of feature extraction.

(a) Example with 8 concentric circles. (b) d31 and d32 of the third circle.

Chapter 3 Intelligent Word Card Image Recognition System

The intelligent recognition system is implemented in several steps as shown in Fig. 3.1, including potential object localization, character extraction and character recognition. Each step adopts some schemes of image processing, such as image subtraction, morphology operation and connected components labeling (CCL) are used in the first step to extract moving objects. Different to the conventional image processing, this thesis will adopt the intelligent neural network on the basis of supervised learning to accomplish part of the schemes, detection of moving object in special color, color extraction and object recognition.

Fig. 3.1 The flow chart of the intelligent system.

3.1 Detection of Moving Object in special color

This is the first part of potential object location. In usual, there are two fundamental steps to detect a moving object of special color, which include moving object detection and color extraction. Both steps are often processed separately, but this thesis presents a scheme based on the artificial neural network to extract the color and detect the moving object simultaneously.

To detect a moving object from a sequence of images, the algorithm is shown as below:

where I^c represent the color in the image and we choose green for example, I_mosc is the result of moving object in special color is shown in Fig. 3.2(d),

(a) (b)

Fig. 3.2 (a) Input image (T=t-1). (b) Input image (T=t).

In supervised learning, the training data are required as shown in Fig.

3.2(d).The RGB information is learned by the neural network structure in Fig. 3.3 based on the back-propagation. After learning, moving object of special color can be distinguished from the background according to the output value of neural network.

Usually, a pixel of moving object of special color has an output value near to 1, while a pixel in the background has an output value near to 0. To efficiently extract the moving object of special color in an image, a threshold value should be carefully selected under the lighting condition of the environment being properly controlled.

Fig. 3.3 MCNN’s structure.

The neural network MCNN extracts a moving object of special color is shown in Fig. 3.3, which is composed of one input layer with 6 neurons, one hidden layer with 7 neurons, and one output layer with 1 neuron. The RGB values are sent into the 6 neurons of the input layer, represented by MC(p), where p=1,2,3 for frame t-1 and p=4,5,6 for frame t. The p-th input neuron is connected to the q-th neuron, q=1,2,…,7, of the hidden layer with weighting WMC1(p,q), which is a weighting array of dimension 6×7. Besides, the q-th neuron of the hidden layer is also with an extra bias bMC1(q). Finally, the q-th neuron of the hidden layer is connected to the output neuron with weighting W_MC2(q), q=1,2,…,7, and a bias b_MC2 is added to the output neuron.

Let the activation function of the hidden layer be the hyperbolic tangent sigmoid transfer function, then the q-th output neuron O_MC1(q) is expressed as:

       

then the single output neuron OMC2 is expressed as:

 

The above operations are shown in Fig. 3.4.

Fig. 3.4.MCNN

3.2 Morphology operation

Three parts are using in this section, erosion; dilation and holes filling.

3.2.1 Erosion and Dilation

After applying color extraction, color regions are extracted from the original image, but some noise still exists therein. One of the conventional ways to eliminate noise regions is using the morphology operations. In the thesis, the noises are eliminated by the morphology erosion operation (2.14) expressed as



^{: ( )}x



A B  x B A (3.7)

where B is a disk-shaped structuring element with radius 4 as shown in Fig. 3.5(a) and the noises in image A with region smaller than B are erased after operation. However, some gaps may be also generated in isolated regions after erosion. In order to repair these gaps, further employ the morphology dilation operation (2.13) expressed as



^{: ( )}^ˆ ^x



A C x C  A  (3.8)

where C is a disk-shaped structuring element with radius 6 as shown in Fig. 3.5(b) and the gaps in image A are repaired after operation. Fig. 3.6 shows an example of erosion and dilation using the structuring elements B and C.

(a) (b)

Fig. 3.3 (a) Structuring element B (b) Structuring element C

(a)

(b)

(c)

Fig. 3.4 Steps for morphology operations (a) Initial image (b) Result of erosion using structuring element B (c) Result of dilation using structuring element C

0 0 1 1 1 1 0 0

3.2.2 Holes filling

In the current application, it is appropriately called conditional dilation or inside connected components. A hole may be defined as a background region surrounded by

在文檔中智慧型字卡影像辨識系統設計 (頁 13-0)