• 沒有找到結果。

2. Related Work

2.2. Intelligent Character Recognition

2.2.3. Mobile Character Recognition and Apps

立立 政 治 大

㈻㊫學

•‧

N a tio na

l C h engchi U ni ve rs it y

that the database will become huge since it has to cover both character datasets. Using the wrong model to perform recognition can result in unpredictable outcome. In practical situations, printed characters can be mixed together with handwritten texts. Therefore, intelligent character recognition with flexible features is anticipated in different applications, such as signboard or menu recognition.

With the recent development of China, the number of Chinese learners is increasing steadily. Application developers have started to put efforts in creating programs that utilize Chinese character recognition on mobile platforms. We will give a quick description of the status of mobile intelligent character recognition and mobile applications in the next section.

2.2.3. Mobile Character Recognition and Apps

Mobile learning is a topic that has received much attention thanks to the recent advances of information technology. Smart phone equipped with intelligent character recognition can be play an effective assistive role in language learning. Worldictionary[22]

and Pleco[23](Fig. 2-6) are two noticeable examples of utilizing modern technology in optical character recognition.

•‧

立立 政 治 大

㈻㊫學

•‧

N a tio na

l C h engchi U ni ve rs it y

Fig. 2-6: Enhancing recognition results by post-processing

1) Worldictionary: Following the fundamental principles in user interface design, this App features a friendly interface that allows user to get instant translation by pointing the camera phone at the text he or she wishes to look up. It is reported to be able to recognize and translate more than ten languages, including Traditional Chinese, Simplified Chinese, English, Japanese, Korean and so on.

It also provides related information such as pronunciation.

2) Pleco: Similar to Worldictionary, Pleco works by moving the camera to a proper position in the text area to instantly look up its meaning. Pleco also provides associated information for users.

Both applications can retrieve learning materials instantly. However, according to their official description, they are confident of only printed or handwritten character recognition.

Their system cannot handle mixed-type characters very well. Back to the objective of this thesis, we hope to devise an intelligent character recognition engine which can deal with most situations occurring in tour books. Since the text can be either printed or handwritten Chinese

•‧

立立 政 治 大

㈻㊫學

•‧

N a tio na

l C h engchi U ni ve rs it y

characters, we have attempted to develop a solution that can handle mixed-type characters at the same time.

•‧

立立 政 治 大

㈻㊫學

•‧

N a tio na

l C h engchi U ni ve rs it y 3. Case Studies

The proposed system aims to provide a mobile service for foreign tourists to plan their trips effectively and effortlessly. Since mobile applications are receiving much attention and experiencing rapid growth, we will study and analyze two use cases: HuayuNavi and iConference in this chapter, both of which I have been deeply involved in the past few months.

We will then elucidate the key components of the proposed framework subsequently.

3.1. HuayuNavi

The concepts of HuayuNavi are inspired by those who travel, work or study in the environment of Chinese-speaking countries and have a need to understand Chinese. This application is designed to make Chinese learning easy for users who have never studied any Chinese lessons at all [24, 25](See Fig. 3-1).

Fig. 3-1: HuayuNavi webpage

•‧

立立 政 治 大

㈻㊫學

•‧

N a tio na

l C h engchi U ni ve rs it y

The motivation to design this product can be linked to three phenomena summarized as follows:

1) Chinese is becoming a dominant language in the world.

2) More foreigners visit Taiwan and want to explore Taiwanese culture in the past decade according to Taiwan Tourism Bureau.

3) Traditional learning platforms are not portable and interactive simultaneously, such as textbook, laptop or language learning center.

The system flowchart of the HuayuNavi is illustrated in Fig. 3-2 and the operation details are elaborated in the following.

Fig. 3-2: System flowchart of the HuayuNavi platform

The interface of HuayuNavi allows users to take a picture and select a rectangular area

•‧

立立 政 治 大

㈻㊫學

•‧

N a tio na

l C h engchi U ni ve rs it y

they want perform recognition on. In order to transmit the picture under limited bandwidth, the acquired image will be resized and compressed in advance. The interface also manages listening from the remote server and format the retrieved data to be user-readable. Of course, the user can also report problems caused by the program through the interface. The recognition engine is the core of the system. Once a picture is transmitted to the server, noise removal and segmentation are applied to clean up the image and extract a candidate region.

Next, feature vectors will be extracted from the image and passed on to the classification stage. For each candidate character, its corresponding probability will be generated by the recognition engine and fed into the vocabulary selector component to search for the best matching term under domain-specific vocabulary models. Through the cycle, a recognition process is complete and waits for the next user request.

Fig. 3-3: The user interface of HuayuNavi

From the user’s point of view, designing an intuitive interface is a crucial part of our system, especially for foreigners. The easier and clearer the interface is, the better experience the user has. Fig. 3-3 depicts the system interface (from left to right) of the HuayuNavi application. The main menu (Fig. 3-3(a)) consists of 9 different subjects, including food, travel, position, art, culture, book, business, entertainment and landmark. Users can choose

•‧

one subject he or she is interested in, e.g., in the food category. The user then takes the picture of a signboard (pearl milk drink in our example) using the camera phone. A rectangular box will appear and user can resize the box by moving the anchors. The user is advised to crop the desired area as accurate as possible in order to eliminate irrelevant content (Fig. 3-3 (c)). After cropping is finished, the recognition will start by clicking “OK” button. Top three candidates with English translation will be returned by the server and appear on the screen as shown in Fig. 3-3(d). Furthermore, the detailed explanation, pronunciation and phonetic spelling will be presented if user touches the corresponding buttons. The overall operation can finish within 3 seconds on average. We believe that a short response time is the key factor to keep users stick to this application.

3.2. iConference

iConference [26](Fig. 3-4) is a mobile augmented reality application designed to facilitate social functions and ice-breaking among attendees in a conference using the combination of social networking, face recognition, intelligent character recognition and augmented reality technologies. Users are able to identify faces in the crowd and swipe name card using their mobile devices, after which they can obtain relevant user profile from social networking sites. The faces are tracked or name card are recognized and then information is overlaid on the screen.

•‧

立立 政 治 大

㈻㊫學

•‧

N a tio na

l C h engchi U ni ve rs it y

Fig. 3-4: Snap shots of iConference

In this project, my responsibility is name recognition and its user interface design.

iConference follows a client-server architecture just like most mobile image search applications, one reason being that it also requires substantial processing power on name recognition. The frontend captures an image and sends it to the remote server for recognition.

In addition, iConference puts great attention on user interface and features an innovative way to capture query information (see Fig. 3-5).

•‧

立立 政 治 大

㈻㊫學

•‧

N a tio na

l C h engchi U ni ve rs it y

Fig. 3-5: iConference obtains query by swiping finger

Fig. 3-5(a) depicts one scenario attendee might meet when they participate in a conference: a name card left somewhere. In such common situations, one way to locate the owner is to recognize his/her name first. Through recognition techniques, it is easy to obtain the information that the attendee provides when registering for the conference, such as affiliations, nationalities, research interests, and publications.

In practice, the user picks up a name card by one hand and points the camera phone at the card with the other hand (Fig. 3-5(b)). The user then swipes one finger across the area of interest to define the query image (Fig. 3-5(c)). This is an intuitive way to obtain information from users. Technically speaking, it filters out unnecessary data. The recognition engine only needs to process the region specified by the user. Displaying the segmented region on the screen helps users to decide whether to modify the input if recognition result is not satisfactory. As the recognition engine, we employ the same framework adopted in the HuayuNavi project. The former database focuses on Chinese with mixed-type characters. The latter is concerned with English only. Therefore iConference usually returns better recognition

•‧

立立 政 治 大

㈻㊫學

•‧

N a tio na

l C h engchi U ni ve rs it y

results.

After surveying the HuayuNavi and iConference cases, we can observe that these services truly integrate intelligent character recognition with mobile application in a sensible way. Back to our problem, we are also concerned about offering an instrumental service based on mobile platform. But the objective is quite distinct. We wish to provide quick routing information for tourists while they search their destinations according to the information depicted in the travel guide. Hence, we will develop a recognition system that can recognize images or texts, depending on what type of information the travel guide provided. Next, we will introduce our system architecture and flow chart of the detailed processing steps.

•‧

立立 政 治 大

㈻㊫學

•‧

N a tio na

l C h engchi U ni ve rs it y 4. Proposed Methodology

In this research, we intend to propose a recognition system that can deal with image or character recognition on mobile platforms. This chapter introduces the proposed system flowchart first, and then presents the core methodologies we developed, including the descriptors for mobile visual search as well as character recognition in a detailed manner.

4.1. System Flowchart

Following the common mobile recognition framework based on client-server architecture, we design a system that captures query image or texts, and then returns the recognition results at the client side, i.e., the smartphone. At the backend, the server initiates different recognition engines based on the type of information sent from the client, as illustrated in Fig. 4-1.

•‧

立立 政 治 大

㈻㊫學

•‧

N a tio na

l C h engchi U ni ve rs it y

Fig. 4-1: System flowchart

Frist, we will illustrate the role played by the client, including input and output. Next, we will briefly mention the core components of the backend server. Detailed information regarding feature representation, feature extraction and object recognition will be elucidated in the next section.

• Input at frontend:

1) First, we will ask users to turn on GPS in order to detect user’s current location.

The recorded location serves as the starting point in route planning.

•‧

intelligent character recognition. Users should select the proper mode of operation before taking the photo. After that, the user can take interested picture such as landmark photo or textual information from tour books, travel brochures or photo-sharing websites.

3) The camera phone must be registered with either Wi-Fi or 3G network in order to send the query to the remote server.

• Output at frontend:

4) Frontend will receive top-N (N ≤ 5 in most cases) matches and the corresponding geographic information after the query. The App will display multiple matching results to allow users to manually select the desired destination by themselves.

5) After the user selects one landmark photo from the candidates, the corresponding routing information will be presented. In other words, the navigation guide from the starting location (detected by GPS) to the final destination will be shown on the screen.

• Image search at backend:

All feature extraction tasks for the images stored in the database can be carried out

•‧

offline. It is therefore possible to build a recognition engine that responds in real time. After receiving the query image from the client, we normalize the image at first step, and then extract global features from the query data and prepare it for recognition. The output is a list of top-N matches.

• Character recognition at backend:

Similar to the visual search framework, we also extract all features from the vocabulary model offline to build a character recognition service with fast response. For the incoming query, noise removal is applied first, followed by image segmentation, feature extraction and recognition. Vocabulary selector picks top-N candidates eventually.

Simply speaking, the frontend is just an interface that is responsible for capturing the input and presenting the recognition results. Backend process plays the crucial role of image and text recognition. The core recognition technologies we proposed will be discussed in the next section.

4.2. Image Descriptors

To strike a balance between efficiency and precision, this research attempts to combine two global features, namely, weighted gist descriptor [27] and average effective number of neighbors (AENN) [28] to perform the image matching task. This section first introduces the motivation for assigning weights to the gist descriptor. The detailed process of computing the weight based on saliency measure will then be presented. Next, a novel global descriptor AENN will be defined. We will discuss the property and the type of image characteristic

•‧

results using mixture of these two global descriptors with a linear combination scheme.

4.2.1. Weighted Gist Descriptor

Gist [29] is a low-dimensional representation of an image. It is designed to capture the overall structure of the scene by partitioning the image into n × n blocks and computing the response using multi-scale oriented filters. The filtered image patch is represented using a k dimensional vector. Therefore, a gist descriptor of dimension n × n × k will be constructed by concatenating all vectors from the image blocks. For example, if the input image is partitioned into 4 × 4 blocks, and each block generates 20 coefficients, the dimension of the final gist feature will be 320.

Not all blocks in an image contain meaningful content for matching. The original gist descriptor, however, treats all image blocks equally. In order to assign more influence to image regions containing interesting features, we propose to incorporate saliency measure so that visually salient regions will play more important roles during the matching process. In [30], color, intensity and orientation have been identified to be important factors in attentional allocation. Similarly, graph-Based visual saliency [31] shows a remarkable consistency with the attentional development of human subjects through graphic theory to focus majority on activation maps using feature vectors. It also predicts human fixations and achieves 98% of the region-of-interest area of a human-based control. The term ‘gist’ is derived from spatial envelope [29], which provides a holistic description of the scene where global perceptual

•‧

properties including naturalness, openness, roughness, ruggedness and expansion are extracted. Gist has been employed to classify scene categories by different spectral signatures, such as amplitude (which captures roughness) or orientation (which captures dominant edges).

Our proposed modification associates weights with different blocks in the image according to the saliency measure so that blocks with higher saliency core will contribute more to the recognition process.

In our proposed framework, the saliency map of the input image is computed using the method described in [31].The input photo is then partitioned into 4x4 sub-images to prepare for the extraction of gist descriptor. The saliency score, which is the mean of the saliency measure in each sub-region, is utilized to weigh the contribution of the corresponding gist descriptor. Unlike many previous works where the distance between two images are computed using the sum-of-squared-difference (SSD) between gist descriptors, we fed the weighted gist descriptor to a modified support vector machine (SVM) to generate the list of possible matches.

Fig. 4-2 summarizes the key steps of the proposed landmark photo matching algorithm.

To begin with, the query image is scaled to 512x512 since both saliency map and gist descriptor depends on the layout (Fig. 4-2(a)). A graph-based visual saliency algorithm is applied to construct the saliency map and locate regions of interest (Fig. 4-2(b)). The image is then divided into 4x4 blocks. Each block is therefore of size 128x128. The saliency measure in each block is averaged to arrive at a single weight factor for that particular block (Fig. 4-2 (c)). Next, we compute the gist descriptor for each image block. We use 8 orientation channels at two different frequencies and 4 orientation channels at another frequency, totaling

•‧

立立 政 治 大

㈻㊫學

•‧

N a tio na

l C h engchi U ni ve rs it y

20 coefficients for each block. The gist coefficients are concatenated to form a 320 dimensional feature vector. For color images, the dimension is increased to 960 as we will extract the gist features from R, G and B color channels, respectively (Fig. 4-2(d)). The concatenated gist vector is weighted according to the saliency score calculated in Fig. 4-2(c) to produce a weighted gist feature, which is then forwarded to the classifier based on support vector machine to perform the recognition (Fig. 4-2(e)).

Fig. 4-2: Computing the weighted gist descriptor

•‧

4.2.2. Average ENN Descriptor

ENN defines a computationally effective approach to assess the sharpness of an image so that images of poor focus can be identified. The basic idea is that well-focused images usually contain clearly defined edges and fine structures. Therefore, if we apply edge detector and retain the same amount of edges, we will get more strong and isolated peaks in the edge map for sharp images than those of the blurred images. As a result, if we retain a fixed percentage of the edge pixel, say q, and calculate the effective number of neighbors (ENN) for each edge point p with in a neighborhood (usually a D × D window) according to:

ENN( p) = I( !p )

Then out-of-focus images will generally produce larger values of ENN because of the clustering phenomenon. Notice that denominator of Eq.1 is the Manhattan distance so that farther neighbors will get less weights, hence the term: effective number of neighbors.

Each edge pixel will generate an ENN value according to Eq.1. If we wish to examine the distribution of edges within a specific region, we can compute the average ENN for all

•‧

立立 政 治 大

㈻㊫學

•‧

N a tio na

l C h engchi U ni ve rs it y

edge pixels belonging to that region. According to our previous arguments, the smaller the average ENN is, the sharper the area. In a sense, the average ENN signifies the distribution of edge pixels in a certain image block, and can therefore be used to describe the overall structure of an image. Several examples of average ENN output are given in Fig. 4-3. For classification tasks, the average ENN from each image block is concatenated to form a feature vector, which is then forwarded to a probabilistic SVM to generate an ordered list of possible matches.

Fig. 4-3: Some examples of average ENN

4.2.3. Information Fusion

We have defined two global features in the previous subsections. To integrate information obtained from these two modules, we adopt a late fusion principle. That is, we will get a list of probable matches with corresponding matching probabilities using each feature separately. The final matching probability pm will be computed according to:

𝑝! = 𝛼×𝑝!"#$!!"#_!"#$+ (1 − 𝛼)×𝑝!""               (2)  

•‧

立立 政 治 大

㈻㊫學

•‧

N a tio na

l C h engchi U ni ve rs it y

where α (0 ≤ α ≤ 1)  is a parameter to adjust the contribution of each component.

Eq. (2) can be further modified to enable consensus check between two matching results. Whereas local features tend to complement each other, global features capture the overall structure and should generate consistent results if it is truly the best match. Therefore, we can check the difference between 𝑝!"#$!!"#_!"#$ and 𝑝!""and if these two results are not consistent (meaning the difference is larger than a threshold δ), we should discount the final matching probability by an amount 𝑝!"!"#$%&.

4.3. Intelligent Character Recognition

Intelligent character recognition refers to the processing and classification of non-printed texts. In [25], we have surveyed and experimented with many character recognition algorithms and achieved certain level of success under some constraints. The experience is readily applicable to the proposed route planning service. We will describe the approach employed for robust character recognition in feature extraction and recognition.

4.3.1. Feature Extraction

In order to balance the efficiency and accuracy, we adopted the framework proposed by [32]and modified it to better suit our problem. The detailed steps of our algorithm are illustrated in Fig. 4-4.

•‧

立立 政 治 大

㈻㊫學

•‧

N a tio na

l C h engchi U ni ve rs it y

Fig. 4-4: Feature extraction stage

Fig. 4-4: Feature extraction stage

相關文件