4. Proposed Methodology
4.2. Image Descriptors
4.3.2. Recognition
國
立立 政 治 大
㈻㊫學
•‧
N a tio na
l C h engchi U ni ve rs it y
Fig. 4-4: Feature extraction stage
For feature extraction, we divide the input image into 4 by 4 overlapped sub-regions.
The percentage of overlapping is set to 50%. Each sub-region is further divided into four concentric rectangles. Then 16 orientations of gradient features are extracted from these four concentric rectangles that are not overlapped with others. Consequently, a vector containing 16-orientation gradient features is created by a weighted combination of the features from each of the concentric rectangle. The 16-orientation feature from each sub-region is concatenated to form a feature vector of 256 dimensions. This can increase the robustness in feature extraction if the input character is deformed or skewed.
4.3.2. Recognition
We adopted support vector machine (SVM) to train and classify the training and testing instances. The problem can be viewed as N-category classification, where N is the number of distinct classes of characters. The benefit of SVM is that the recognition model can be built in advance, i.e., the model can be trained in off-line. The recognition engine can execute efficiently because it is simply a process of finding which hyperspace the feature vector is
•‧
國
立立 政 治 大
㈻㊫學
•‧
N a tio na
l C h engchi U ni ve rs it y
located. However, SVM usually returns only the best match. In our work, we wish to retrieve top k matches so that domain-specific knowledge can be incorporated with the vocabulary selector component to further improve the accuracy. As reported in [33], it is possible to record the likelihood information between every category when building the SVM model.
Therefore, generation of top candidates become feasible and the probability from each candidate can be taken into consideration to increase the confidence of matching in the later stage.
The result from the recognition stage is a list of candidates along with the probabilities for each character image. In the vocabulary selector stage, the goal is to search for the most proper term through different combination of candidates of every character image (Fig. 4-5).
Fig. 4-5: Compute the probability of a vocabulary from recognition results
This process can increase the matching confidence if only certain portion of the character images were recognized properly. To address the issue, we adopt a conservative but efficient way to collect Chinese terms belonging to the category to form a domain-specific vocabulary model. In other words, we will have different vocabulary models for different scenarios, e.g., Taiwanese snacks in our preliminary study. When searching for the most
•‧
國
立立 政 治 大
㈻㊫學
•‧
N a tio na
l C h engchi U ni ve rs it y
proper term, user has to assign the desired domain in advance. Once the recognition engine stage is complete, the list of candidates along with the probabilities for each character image will be used to calculate the score of terms which appear in the vocabulary model. Note that the term must exist in the repository. Here a term’s score is defined by adding together the corresponding candidate’s probability from each character image. For example, if there is a recognition result with a length of 4 characters as depicted in Fig. 4-5, the proposed vocabulary selector will use the candidates’ probabilities to compute the scores for all the terms with the same length within a specific model exhaustively. The one with the highest score is the best answer for this request. In general, only terms with length the same to the number of character images will be considered. Flexible range such as plus/minus one in length is more computation-demanding, but can compensate for the error made in the segmentation stage. It is possible to replace the current scoring mechanism with other approaches. For example, ranking strategy is a strategy that makes use of the ranking (or order) to substitute the candidates’ probabilities. Thus, the lower the score is, the more ideal the term is. A hybrid approach, in which only top-N candidates for each character image are included in the score accumulation, has also been considered. This can avoid a lot of candidates with very small likelihoods.
•‧
國
立立 政 治 大
㈻㊫學
•‧
N a tio na
l C h engchi U ni ve rs it y 5. Performance Evaluation
To evaluate the performance of the proposed recognition systems, we conduct experiments using publicly available benchmark data as well as dataset we collected. In this chapter, we first provide a general description of the data. We then report the results of testing the proposed algorithms using these datasets.
5.1. Data Collection
We collected two datasets for visual search and character recognition, respectively.
Further details are provided in the following subsections.
5.1.1. Visual Search Dataset
In the area of mobile location recognition, [34] put great efforts in pushing the research forward by publishing extensive datasets. The authors focused on scenes in the San Francisco area. Along a similar line, we build our dataset using images of popular Taiwan landmarks.
5.1.1.1. Field Study
In order to observe possible image sources that those foreign visitors might confront once reaching Taiwan, we conducted a quick field study by collecting brochures from two
•‧
國
立立 政 治 大
㈻㊫學
•‧
N a tio na
l C h engchi U ni ve rs it y
important traffic pivots, namely, Taipei Main Station and Taiwan Taoyuan International Airport (Fig. 5-1). We gathered 88 brochures from the train station and 71 brochures from airport. To identify the distribution of public attractions in Taiwan, we classify these brochures by counties, as illustrated Fig. 5-2.
Fig. 5-1: Brochures
Fig. 5-2: Statistics of travel brochures collected in the train station and at the airport
In the train station, over fifty percent of the brochures introduce Taipei’s attractions.
The free travel booklets distributed at the airport cover more cities and counties. This result is
•‧
國
立立 政 治 大
㈻㊫學
•‧
N a tio na
l C h engchi U ni ve rs it y
instrumental to us in collecting experimental materials.
5.1.1.2. Taiwan Landmark Image Set
Our target users are likely to visit more than one area or county during their stay in Taiwan. Therefore, the strategy we have taken here is to collect famous tourist destination around the island by referencing to the official website of Taiwan Tourism Bureau, as shown in Fig. 5-3:
Fig. 5-3: Images from the Taiwan landmark database
Based on the above guideline, we collect popular landmarks from tour books, travel brochures, official websites and photo-sharing networks. A total of 9530 images from 50 landmarks have been gathered. The dataset cover all major areas in Taiwan (refer to Fig. 5-4).
On average, each landmark contains 190 images.
•‧
國
立立 政 治 大
㈻㊫學
•‧
N a tio na
l C h engchi U ni ve rs it y
Fig. 5-4: Attraction distribution of Taiwan landmark database
• Test dataset:
We randomly select 20 images from each category to form the test set.
• Training dataset:
The remaining 8530 images are used to train the support vector machine using mixture of weighted gist and average ENN features.
5.1.2. Intelligent Character Recognition Dataset
In a typical pattern recognition problem based on supervised learning technique, two separate datasets are required, namely, training dataset and test dataset. Training dataset is used to train a model so that when testing dataset is ready to be recognized, the classifier can calculate the similarity between the training and testing dataset and report the classification result. Following is the training and testing dataset we adopted.
• Test dataset:
We have collected 100 character images. 60 are generated by computer and 40 are
•‧
國
立立 政 治 大
㈻㊫學
•‧
N a tio na
l C h engchi U ni ve rs it y
taken from real signboards (Fig. 5-5). After pruning redundant characters, a total of 326 distinct characters have been recorded.
Fig. 5-5: Test data (a) generated by computer (b) real signboards
Table 5-1: Test data information Independently Generated
by Computer Real Signboard Number of Instances
60 40 100
• Training dataset:
We include 945 words in the training dataset initially. These words are related to the night market topic. After pruning redundant characters, the dataset contains 612 distinct Chinese characters.
For each character, we include six different fonts (Fig. 5-6) to increase the number of training samples. Five versions with different rotation angles with/without noise are considered. Since there are sixty variations (6*5*2) for each character, the total number of
•‧
國
立立 政 治 大
㈻㊫學
•‧
N a tio na
l C h engchi U ni ve rs it y
samples in the training set is increased to 36720.
Fig. 5-6: Six different Chinese fonts
5.2. Experimental Results
In this section, we will evaluate the performance of visual search and intelligent character recognition based on dataset we mentioned previously. At the end, we will give a quick demonstration of the routing information.
5.2.1. Visual Search
A total of 9530 images from 50 landmarks have been experimented. On average, each landmark contains 190 images. We randomly select 20 images from each category to form the test set. The remaining 8530 images are used to train the support vector machine using mixture of weighted gist and average ENN features. It is possible to expand the database by adding different levels of noise, applying Gaussian blur or scaling, as shown in Fig. 5-7.
•‧
Fig. 5-7: Generating more samples by adding noise, applying blur and resizing (a) Presidential Office (b) Yeh-Liu
For weighted gist descriptor, we compute the vector from R, G, and B channels separately. A total of 960 dimensions are used in the weighted gist representation. For average ENN, parameter selection is important. We will discuss experimental results using different parameter settings in section 5.2.1.1. Here, we first partition the input image into 8×8 blocks, thereby generating a feature vector of 64 dimensions. The percentage of edge retained (q) is set to 10%. Neighborhood for computing the effective numbers of neighbors is set to 7×7.
The results are summarized in Table 5-2 using different linear combinations of 𝑝!"#$!!"#_!"#$
and 𝑝!"". It turns out that 𝑝!"#$!!"#_!"#$ plays a more important role during the matching
process. However, the overall accuracy does improve if results from both modules are integrated. The recognition rate is comparable to the results reported in previous research articles, with top matches approaching 66% and top 3 matches near 80%.
Table 5-2: Recognition rate using Taiwan landmark dataset
Accuracy
•‧
5.2.1.1. Setting Different Parameters in ENN
In the previous section, we partition the input image into 8×8 blocks and compute the effective number of neighbors using a 7×7 window to obtain the matching results shown in Table 5-2. However, different numbers of blocks will result in different feature dimensions.
For example, 16 dimensions for 4×4 blocks and 256 dimensions for 16×16 blocks. The size of neighborhood for calculating ENN will also have direct impact on the estimated value and computation time as well. Parameter selection is therefore a crucial factor in evaluating recognition accuracy and efficiency.
We experiment with different parameter settings in attempt to obtain a proper combination. Table 5-3 to Table 5-5 summarize the results by choosing different block size and neighborhood size.
•‧
•‧
國
立立 政 治 大
㈻㊫學
•‧
N a tio na
l C h engchi U ni ve rs it y
results is about 2 percent. According to these results and consider computation resources simultaneously, we decide to set the partition to 8×8 and neighborhood size to 7×7.
5.2.1.2. Comparison of Individual and Hybrid Approaches
In this research, we not only use weighted gist in combination with the average ENN descriptor, but also utilize separate descriptors and different combinations to evaluate the performance. Specifically, we have experimented with QC distance, which was proposed in [35]. Chi-square distance is sensitive to quantization effects, such as lighting variants or shape deformations. Thus, Quadratic-Chi takes into account cross-bin relationship and tries to alleviate the quantization problem. Also, its computation time is linear in the number of non-zero entries in the bin-similarity matrix. Results are summarized in Table 5-6.
•‧
Table 5-6: Comparison of individual and hybrid approaches
According to Table 5-6, the accuracy of utilizing only one descriptor, be it gist, ENN or QC is not very good, particularly on Top1 results. With regard to hybrid approaches, weighted gist + QC has the best accuracy, weighted gist + ENN is the second, weighted gist is the third. Regarding the efficiency issue, QC is robust but slow. According to our experiments with the current dataset, QC must calculate 1000x8530 times to acquire each distance. The entire procedure takes over six days to obtain whole pair-wise distance. Compared to QC, weighted gist + ENN descriptor is forwarded to a probabilistic SVM to generate an ordered list of possible matches. Each query just takes less than one second to return the classification results. The computational time for QC vs. ENN is summarized in Table 5-7.
Table 5-7: Comparison of computational complexity
Training Matching
•‧
5.2.2. Intelligent Character Recognition
As mentioned in section 5.1.2, we took a total of 32760 samples, which are employed to train the SVM with RBF kernel. We also independently collected 100 character images, in which 60 are generated by computer and 40 are taken from real signboards, to evaluate the recognition performance. From the experiments, the accuracy of top three is 83%, top ten is 89% and top twenty is 92%. If the results can be further incorporated with the vocabulary selector, the final outcome can be improved.
Table 5-8: Recognition of character recognition
Top3 Top10 Top20
Accuracy 0.83 0.89 0.92
5.2.3. Routing Implementation on Mobile Platform
After recognition at backend, we wish to return not only the recognized image or textual information, but also jointly attach geography information. Toward this goal, we employ Google Map API to implement the navigation function. The result is demonstrated in Fig. 5-8.
GPS detects the present location as the starting point, and then use the returned location data as the destination (Fig. 5-8(a)). Meanwhile, users also can decide what kind of transportation tool to use, such as driving, public transportation or walking. With this information, we can present a map view 9cto provide a quick overview of the routing guide (Fig. 5-8(b)). There is also list view that contains turn-by-turn directions to the destination (Fig. 5-8 (c).)
•‧
國
立立 政 治 大
㈻㊫學
•‧
N a tio na
l C h engchi U ni ve rs it y
Fig. 5-8: Snapshots of routing implementation
•‧
authors argue that image search is more challenging than text search since average users are accustomed to typing short keywords in search for a specific document. But for images, this task is not easy. In recent years, many researchers have tried to address this problem. This chapter will introduce approaches proposed by other researchers. A comparative analysis will be performed using a publicly available benchmark dataset.6.1. Benchmark Database – Oxford Buildings Dataset
First, introduce the publicly available dataset with ground truth: Oxford Buildings Dataset [37]. The Oxford dataset contains 5062 high resolution (1024x768) images automatically retrieved from Flicker by searching for particular Oxford landmarks.
Some examples are shown in Fig. 6-1. The collection has been manually annotated to generate a comprehensive ground truth for 11 different landmarks. Each landmark has been annotated as Good, OK, Bad and Junk.
1) Good – A nice, clear picture.
2) OK – More than 25% of the object is clearly visible.
3) Bad – The object is not presented.
4) Junk – Less than 25% of the object is visible, or there are very high levels of
•‧
國
立立 政 治 大
㈻㊫學
•‧
N a tio na
l C h engchi U ni ve rs it y
occlusion or distortion.
Fig. 6-1: Images from Oxford benchmark
6.2. Experiments on Oxford Building Dataset
In our experiment, we use all Good and OK images as our dataset, totaling 568 images for 11 landmarks. We randomly select 3 images from each category as our query data. The remaining images are used as training data. Results of matching using the proposed mixture of global feature approach with the same parameter settings are summarized in Table 6-1.
Table 6-1: Recognition rate using Oxford dataset
α Top1 Top3 Top5 Top10
ENN (α = 0) 0.4545 0.51520 0.6667 0.9394
α = 0.5 0.5758 0.6364 0.8182 1
α = 0.6 0.5758 0.6667 0.8485 1
α = 0.7 0.5455 0.697 0.8485 1
α = 0.8 0.5455 0.7576 0.8485 1
α = 0.9 0.5455 0.7576 0.8485 1
Weighted gist (α = 1) 0.5455 0.7879 0.8485 1
•‧
6.3. Other Approaches on Oxford Dataset[36] noticed the advantages and disadvantages of local and global features. The authors chose PCA-SIFT local feature as their principle feature descriptor. According to [7], PCA-SIFT is faster than SIFT, but SIFT outperforms PCA-SIFT. Therefore, the authors also utilize shape context to describe local configuration. Table 6-2 summarizes their experimental results under different parameter settings.
Table 6-2: Recognition results on Oxford dataset. [36] The parameters are: scale of shape context in pixel, number of segmentation in scale/angle
Shape Context Recognition Rate
200px, 2/4 0.4779 approximate nearest neighbor in large-scale image database [38]. The authors also experimented isolated and hybrid features to evaluate the performance separately, including the original BOF, Hamming embedding (HE), weak geometric consistency constraints (WGC) and multiple assignment (MA) approaches. Experimental results using Hessian-Affine detector and SIFT are shown at Table 6-3.
•‧
國
立立 政 治 大
㈻㊫學
•‧
N a tio na
l C h engchi U ni ve rs it y
Table 6-3: Recognition of Oxford dataset using different methods [38]
Methods Recognition Rate
BOF 0.384
HE 0.489
HE + weights 0.507
HE + weights + MA 0.561
WGC, no prior 0.404
WGC 0.462
HE + WGC + weights 0.545
HE + WGC + weights + MA 0.615
Refer to Table 6-3, recognition rate between isolated and hybrid methods are quite different. The best recognition rate is 61.5%, which is achieved by combining four methods together. As such, combining multiple features boost recognition accuracy effectively.
•‧ 7. Conclusion and Future Work
We have proposed an innovative idea of integrating mobile visual search, intelligent character recognition and location-based service to provide a convenient tool for route planning. The tool can assist users who are unfamiliar with the local area to search for interested destinations and provide suggestions for planning the trip.
We have surveyed the current status of related Apps on several digital distribution channels, which has motivated us to take part in real projects including HuayuNavi and iConference. The practical experience gained in these two projects has helped to lay good foundation on this research.
The proposed framework deals with the object recognition problem in two aspects:
mobile visual search and intelligent character recognition. The motivation is to distinguish landmark photo or textual photo from a large collection of images and texts. Since efficiency is vital in mobile visual search, we follow a client-server architecture to guarantee performance. We employ a mixture of global features to perform image search at the server side. Regarding text information, we have also developed an intelligent character recognition system using a novel feature that captures the overall structure as well as the distribution of the strokes. The proposed feature is designed specifically to address mixed-type characters.
Experiment results using different parameters and comparative analysis have validated the efficacy of our proposed strategy.
•‧
國
立立 政 治 大
㈻㊫學
•‧
N a tio na
l C h engchi U ni ve rs it y
Mobile applications are in great stipulate nowadays. In our research, all necessary components to build a total solution are ready, with special emphasis on the recognition engines. We are yet to assimilate these modules to create a mobile application with seamless integration.
Standalone mobile application is an alternative if the quality of network connection is poor. But the scope must be drastically reduced if we wish to satisfy precision and user experience at the same time.
•‧
[3] T. C. Government. (June 2011). Taipei-Free. Available:
http://www.tpe-free.taipei.gov.tw/TPE/
[4] UDN. (2012/04/15). 一機在手 跟著「旅遊雲」玩遍全世界. Available:
http://mag.udn.com/mag/digital/storypage.jsp?f_ART_ID=383884
[5] D. G. Lowe, "Distinctive image features from scale-invariant keypoints," International journal of computer vision, vol. 60, pp. 91-110, 2004.
[6] H. Bay, T. Tuytelaars, and L. Van Gool, "Surf: Speeded up robust features," Computer Vision–ECCV 2006, pp. 404-417, 2006.
[7] L. Juan and O. Gwun, "A comparison of sift, pca-sift and surf," International Journal of Image Processing, vol. 3, pp. 143-152, 2009.
[8] V. Chandrasekhar, S. S. Tsai, G. Takacs, D. M. Chen, N. M. Cheung, Y. Reznik, R.
Vedantham, R. Grzeszczuk, and B. Girod, "Low Latency Image Retrieval with Embedded Compressed Histogram of Gradient Descriptors."
[9] V. Chandrasekhar, D. M. Chen, A. Lin, G. Takacs, S. S. Tsai, N. M. Cheung, Y.
Reznik, R. Grzeszczuk, and B. Girod, "Comparison of local feature descriptors for mobile visual search," 2010, pp. 3885-3888.
•‧
[10] Y. Cao, H. Zhang, Y. Gao, X. Xu, and J. Guo, "Matching Image with Multiple Local Features," 2010.
[11] D. Nister and H. Stewenius, "Scalable recognition with a vocabulary tree," 2006, pp.
2161-2168.
[12] S. S. Tsai, D. Chen, G. Takacs, V. Chandrasekhar, R. Vedantham, R. Grzeszczuk, and B. Girod, "Fast geometric re-ranking for image-based retrieval," 2010, pp. 1029-1032.
[13] S. S. Tsai, D. Chen, V. Chandrasekhar, G. Takacs, N. M. Cheung, R. Vedantham, R.
Grzeszczuk, and B. Girod, "Mobile product recognition," 2010, pp. 1587-1590.
[14] S. S. Tsai, D. Chen, J. P. Singh, and B. Girod, "Rate-efficient, real-time CD cover recognition on a camera-phone," 2008, pp. 1023-1024.
[15] D. Chen, S. Tsai, C. H. Hsu, J. P. Singh, and B. Girod, "Mobile augmented reality for books on a shelf," 2011, pp. 1-6.
[16] S. S. Tsai, H. Chen, D. Chen, R. Vedantham, R. Grzeszczuk, and B. Girod, "Mobile Visual Search Using Image and Text Features."
[17] G. Inc. (2009). Google Goggles. Available: http://www.google.com/mobile/goggles/ - text
[18] Amazon. (2011). Flow powered by Amazon. Available:
http://itunes.apple.com/us/app/flow-powered-by-amazon/id474664425?mt=8 [19] L. Earnest, "Machine reading of cursive script," in in Proc. IFIP Congress,
http://itunes.apple.com/us/app/flow-powered-by-amazon/id474664425?mt=8 [19] L. Earnest, "Machine reading of cursive script," in in Proc. IFIP Congress,