快速辨識立體物件之機器人視覺技術 (II)

(1)

行政院國家科學委員會專題研究計畫成果報告

快速辨識立體物件之機器人視覺技術 (II) 研究成果報告(精簡版)

計畫類別：整合型

計畫編號： NSC 100-2221-E-011-057-

執行期間： 100 年 08 月 01 日至 101 年 07 月 31 日執行單位：國立臺灣科技大學機械工程系

計畫主持人：徐繼聖

計畫參與人員：碩士班研究生-兼任助理人員：林鈺山碩士班研究生-兼任助理人員：胡凱閎碩士班研究生-兼任助理人員：Trung Tan

報告附件：出席國際會議研究心得報告及發表論文

公開資訊：本計畫涉及專利或其他智慧財產權，1 年後可公開查詢

中華民國 101 年 11 月 06 日

(2)

中文摘要：本計畫延續上年度(99)執行之「快速辨識立體物件之機器人視覺技術第一期計畫」第二期年度計畫，目的為發展一套可在不同視角與不同光照條件下快速辨識立體物件的視覺系統。本(100)年度的研究是在第一期的研究基礎上，繼續發展更成熟的物件辨識技術。研發成果可以三方向說明：第一個方向為利用 Kinect 感測器取得彩色與深度影像，先運用色彩訊息對場景物件進行分割，再以深度訊息整合分割場域中的立體資訊，建立精確的三維場景和物件模型。初步成果發表於 Automation 2012。第二個方向為以外觀特徵

(Appearance- based features)結合 Naive Bayes 分類器之即時物件辨識系統之設計和製作，此研究乃先進行外觀特徵與局部不變特徵在物件辨識的效能評估，再依評估結果選擇最佳外觀特徵和最合適之分類器。研究成果已為模式識別領域主要研討會 ICPR 2012 所接受，並為一審稿者認為是目前最完備的外觀特徵效能評估研究。另一更完整版本已投稿 MVA(J. Machine Vision and Applications)。第三個方向為繼續在前一期的研究中已有部分成果之骰點辨識，並挑戰多顏色與大角度下的辨識效能。新的研究成果發表於人機互動領域知名研討會 SMC 2012，完整版本亦已投稿了 MVA。本報告將以上述三項成果進行分項說明。

中文關鍵詞： RGBD camera, scene reconstruction, object recognition, stereo vision

英文摘要： This is the 2ndphase of the research on Robotic Vision for Fast 3D Object Recognition. Based on the outcomes from the 1st phase, new progress has been made in advancing the technology for the recognition of generic objects. The progress can be addressed in three studies. The first is the 3D scene and object segmentation using RGBD images. The scene

segmentation based on colors is used as a prior for achieving refined segmentation when the depth is taken into account. The result is presented in Automation 2012. The second is the study on using appearance-based features with Bayesian classifier, which is experimentally proven the most compelling approach in a comparative study that covers various appearance features and classifiers for generic object recognition. The result will be presented in ICPR 2012, a major conference in pattern recognition.

The third is the extension of the dice recognition

(3)

studied in the 1st phase to handling dice of multiple colors and views with large tilt angles. The outcome of this study has been presented in SMC 2012, a major conference in human-computer systems and cybernetics.

The progresses in these three studies are summarized in this report.

英文關鍵詞： RGBD camera, scene reconstruction, object recognition, stereo vision

(4)

行政院國家科學委員會補助專題研究計畫■成果報告□

期中進度報告

快速辨識立體物件之機器人視覺技術第二期計畫

計畫類別：■ 個別型計畫 □ 整合型計畫計畫編號：NSC 100-2221-E-011-057

執行期間：100 年 8 月 1 日至 101 年 7 月 31 日

計畫主持人：徐繼聖

計畫參與人員：林鈺山、胡凱閎、Trung Tan Loc

成果報告類型(依經費核定清單規定繳交)：■精簡報告 □完整報告

本成果報告包括以下應繳交之附件：

□可供推廣之研發成果資料表

□赴國外出差或研習心得報告一份

□赴大陸地區出差或研習心得報告一份

□出席國際學術會議心得報告及發表之論文各一份

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計畫、

列管計畫及下列情形者外，得立即公開查詢

□涉及專利或其他智慧財產權，■一年□二年後可公開查詢執行單位：國立臺灣科技大學

中華民國 101 年 10 月 31 日

(5)

I

中文摘要

本計畫延續上年度(99)執行之「快速辨識立體物件之機器人視覺技術第一期計畫」第二期

年度計畫，目的為發展一套可在不同視角與不同光照條件下快速辨識立體物件的視覺系統。

本(100)年度的研究是在第一期的研究基礎上，繼續發展更成熟的物件辨識技術。研發成果可以三方向說明：第一個方向為利用 Kinect 感測器取得彩色與深度影像，先運用色彩訊息對場景物件進行分割，再以深度訊息整合分割場域中的立體資訊，建立精確的三維場景和物件模型。初步成果發表於 Automation 2012。第二個方向為以外觀特徵(Appearance- based features)結合 Naive Bayes 分類器之即時物件辨識系統之設計和製作，此研究乃先進行外觀特徵與局部不變特徵在物件辨識的效能評估，再依評估結果選擇最佳外觀特徵和最合適之分類器。研究成果已為模式識別領域主要研討會 ICPR 2012 所接受，並為一審稿者認為是目前最完備的外觀特徵效能評估研究。另一更完整版本已投稿 MVA(J. Machine Vision and Applications)。第三個方向為繼續在前一期的研究中已有部分成果之骰點辨識，

並挑戰多顏色與大角度下的辨識效能。新的研究成果發表於人機互動領域知名研討會 SMC 2012，完整版本亦已投稿了 MVA。本報告將以上述三項成果進行分項說明。

英文摘要

This is the 2^ndphase of the research on Robotic Vision for Fast 3D Object Recognition.

Based on the outcomes from the 1^st phase, new progress has been made in advancing the technology for the recognition of generic objects. The progress can be addressed in three studies. The first is the 3D scene and object segmentation using RGBD images. The scene segmentation based on colors is used as a prior for achieving refined segmentation when the depth is taken into account. The result is presented in Automation 2012. The second is the study on using appearance-based features with Bayesian classifier, which is experimentally proven the most compelling approach in a comparative study that covers various appearance features and classifiers for generic object recognition. The result will be presented in ICPR 2012, a major conference in pattern recognition. The third is the extension of the dice recognition studied in the 1^st phase to handling dice of multiple colors and views with large tilt angles. The outcome of this study has been presented in SMC 2012, a major conference in human-computer systems and cybernetics.

The progresses in these three studies are summarized in this report.

關鍵詞: RGBD camera, scene reconstruction, object recognition, stereo vision.

(6)

II

中英文摘要 ... I

前言 ... 1

研究目的 ... 1

研究方法： ... 2

文獻探討 ... 4

計畫成果自評： ... 8

出席國際學術會議心得報告 ... 9

(7)

1

前言

本計畫為開發「快速辨識立體物件之機器人視覺技術」的第二年計畫。在第一年計畫執行所得的成果上，繼續改進快速辨識立體物件的視覺技術。第二年所得之成果可以三方向說明：第一為利用 Kinect 感測器取得之彩色與深度影像，先運用色彩訊息對場景和物件進行二維空間的分割，再以深度訊息整合並分析場域中的立體資訊，進行三維空間的分割以建立精確的三維場景和物件模型。初步成果發表於 Automation 2012。第二為以外觀特徵(Appearnace-based features)結合 Naive Bayes 分類器之即時物件辨識系統之研發和設計，此研發乃根據一項詳盡的外觀特徵與局部不變特徵在物件辨識效能評估的結果而製定 的方法。本研究成果已為模式識別領域主要研討會之一的 ICPR (IEEE Int’l Conference on Pattern Recognition)所接受，並為一審稿者認為是目前最完備的外觀特徵在物件辨識效能之 評估研究，將於十一月中旬發表。更完整版本已投稿 MVA(Journal of Machine Vision and Applications)。第三個方向為繼續在前一期的研究中已有部分成果之骰點辨識，並挑戰多 顏色與大角度下的辨識效能。新的研究成果發表於人機互動領域知名研討會 SMC (IEEE Int’l Conf on Systems, Man and Cybernetics)，完整版本亦已投稿了 MVA。本報告整合了上 述三項成果進行說明。

研究目的

本計畫之目的在開發一個可與機器人整合進行快速立體物件辨識的機器視覺系統。第一年為整合計畫下之一分項研究計畫，第二年被審定為個別研究計畫。第二年的計畫目標為發展即時立體物件辨識視覺系統，運用第一年研究所得之影像特徵建立物件模型，再結合深度感應器(Kinect)有效提高辨識系統的精準度和可靠性。依上述之研究目的，本計畫擬定並執行了三項在前言中所列之工作項目，該三項工作項目是根據下述目的所擬定：

一、以 Kinect 提供之 RGBD 影像進行 3D 場景中的物件切割(segmentation)：不同於以往單以 2D 影像為基礎之立體視覺技術，RGBD 感測器同時提供了 2D 彩色影像和 2D 深度影像，所以可取得場景和物件精確的深度值，大幅降低了物件切割時、

或複雜背景下特徵擷取時可能的誤差。

二、決定最適合一般物件辨識的外貌特徵與分類器：雖然外貌特徵已大量使用在特定物件辨識，如人臉、車輛，但卻少有文獻探討外貌特徵在辨識一般物件的效能與瓶頸。本研究首次針對此一基礎研究進行深入探討，由此決定出一有效的外貌特徵與分類器之組合，並製作出一即時物件辨識系統。

三、挑戰複雜變數下的骰點辨識：因骰點辨識為一娛樂型機器人所需之物件辨識的關鍵技術，並可推展至解決其它特殊物件的辨識需求，故本期研究仍持續改進原有方法，並挑戰實際應用時所需考慮的參數，如不同顏色的骰子與多變化的視角。

(8)

2

研究方法

本研究提出的方法詳述於下列論文中，摘要下：

1、A Comparison Study on Appearance-Based Object Recognition, Accepted to IEEE Int’l Conf. Pattern Recognition (ICPR 2012), Tsukuba, Japan, Nov. 11~15, 2012.

2、A Comparative Study on Appearance-based Object Recognition with Silhouette Alignment for Object Segmentation, submitted to Journal of Machine Vision and Applications.

3、Color and Illumination Invariant Dice Recognition, Proc. IEEE Int’l Conf. Systems, Man, and Cybernetics (SMC 2012), Seoul, Korea, Oct. 14~17, 2012.

4、Stereo Vision and Invariant Features for Dice Recognition in Uncontrolled Conditions, submitted to Journal of Machine Vision and Applications.

5、3D Scene Segmentation and Modeling Using RGBD Images, Accepted to Automation 2012, Taoyuan, Taiwan, Nov. 2-3, 2012.

第一、二篇論文摘要(A Comparative Study on Appearance-based Object Recognition with Silhouette Alignment for Object Segmentation, 第二篇為第一篇之期刊延伸版本)

The features extracted for object recognition can be split into two categories, local invariant and appearance-based. The former is commonly selected for the recognition of generic objects, while the latter is a popular choice for the recognition of specific objects, for example, faces. Because most specific objects can be aligned to appearance features, the deviations from the aligned features offer some appearance characteristics good for recognition. Such an alignment can be difficult to define in generic objects. Therefore, the works on the recognition of generic objects using appearance features in the literature are significantly outnumbered by those using local invariant features. The performance of many appearance features and associated classifiers on face recognition has been widely studied and reported; however, their performance on generic object recognition is only studied in a limited scope. To extend our understanding in this regard and be able to determine the appearance features and classifiers good for generic object recognition, this paper reports a comprehensive comparison study in which different combinations of appearance features and classifiers are evaluated on a benchmark database. To detect and segment the object of interest from a scene, which is often the first step in object recognition, we propose a scheme, called Silhouette Alignment, to align the features extracted from a test image to those in the database.

Although the appearance features considered in this study are holistic, in the comparison we also include SIFT (Scale Invariant Feature Transform), one of the most popular local invariant features for object recognition, to justify the performance of the appearance features and associated classifiers. Experiments on the COIL-100 database show that DCT features with Naive Bayesian classifier give the best performance among

(9)

3

others on recognition across viewpoints. SIFT outperforms most appearance features when the image is blurred by noise. However, a few classifiers with appearance features outperform SIFT in both noise-free conditions and cluttered backgrounds.

第三、四篇論文摘要(Stereo Vision and Invariant Features for Dice Recognition in Uncon- trolled Conditions, 第四篇為第三篇之期刊延伸版本)

A system is proposed for automatic reading of the number of dots on dice in general table game settings. Different from previous dice recognition systems which recognize dice of a specific color using a single top-view camera in an enclosure with controlled settings, the proposed one uses multiple cameras to recognize dice of various colors posed in a wide range of viewing angle and under uncontrolled conditions. It is composed of three modules. Module-1 locates the dice using the gradient-conditioned color segmentation (GCCS), proposed in this paper, to segment dice of arbitrary colors from the background. Module-2 exploits the local invariant features good for building homographies across multiple views and lighting conditions. The homographies are used to enhance coplanar features and weaken non-coplanar features, giving a solution to segment the top faces of the dice and make up the features ruined by possible specular reflection. To identify the dots on the segmented top faces, an MSER detector is embedded in Module-3 for its consistency in locating the dot regions regardless of illumination and viewpoint variations. Experiments show that the proposed system performs satisfactorily in various test conditions.

第五篇論文摘要(3D Scene Segmentation and Modeling Using RGBD Images)

Different from the stereo image based scene segmentation and modeling, which strongly depends on the image feature correspondences across different views, the proposed method exploits the color and depth readings from a RGBD camera and segment the scene using color and depth features. The color is used as a prior parameter for initial segmentation, and the depth follows as a clue to split or merge the color- segmented regions. Meanshift is applied to segment the colors in the scene. The color-segmented scene is then processed by a plane search scheme based on RANSAC (Random Sample Consensus) that can remove the measurement noises in the depth data and identify planes of various gradients. The depth along with the identified gradient is then processed by moving windows of various sizes so that the split of a windowed region or merge of different regions can be determined. Experiments reveal that the proposed method outperforms the scene segmentation and modeling by either color or depth image alone.

(10)

4

文獻探討

(節錄於上述論文以完整呈現已執行之相關文獻探討)

第一、二篇論文文獻探討(A Comparative Study on Appearance-based Object Recognition with Silhouette Alignment for Object Segmentation, 第二篇為第一篇之期刊延伸版本)

Object recognition can be generally split into two phases, feature extraction and classification. Commonly used features include local interest regions and appearance- based features. The former aims at capturing local regions which are invariant, or more precisely, covariant, to affine transformations [1]. The latter aims at capturing the holistic visual properties using appearance characteristics [2, 3, 4, 5]. It is interesting to observe that local interest regions are mostly applied to the recognition of generic objects [6], but appearance-based features are often exploited in the recognition of specific objects, particularly faces [3, 4]. This fact can be due to the following facts:

 Alignment can be well defined by appearance features: most specific objects can be aligned to appearance features, for example, faces are commonly aligned to the eyes, the deviations from the aligned features offer important characteristics good for recognition. However, the appearance features good for alignment can be difficult or even impossible to define in generic objects.

 Region of interest (ROI) with limited views can be easily defined on specific objects:

Many specific objects are considered only with relatively limited viewpoints. Again, take face as an example, frontal to profile poses are mostly concerned. However, there can be many more poses to consider given a generic object. The detection and segmentation of a specific object can thus be performed using the appearance features from limited views; but it can be better performed using local interest regions on generic objects for covering a large number of different views.

The above explains why the works on generic object recognition using appearance features in the literature are significantly outnumbered by those using local invariant features. Among the few using appearance features for object recognition, several deserve our attention. SVM is studied in [2] with feature vectors each formed by

catenating the columns of a given image. It is shown that SVM is good at handling such high dimensional features. The kNN classifier is used in [4] to determine the optimal linear representations of images for appearance-based object recognition. Gabor features are experimentally proven better than PCA and LDA features in [8] for object

classification with cosine similarity measure and maximum correlation classifiers. A sparse representation of objects proposed in [5] to capture the object appearance is proven better than PCA in basis storage and handling new samples added in to the training set. However, few, if any, offer an in-depth performance evaluation on using appearance features for generic object recognition. This study is one of the very few on

(11)

5

such an evaluation. Our comparison also includes the SIFT feature [6], one of the most popular local invariant features for object recognition, so that we can better justify the performance of appearance-based features and classification methods.

The impacts of the following parameters are investigated: (1) Viewpoint, to recognize the object from the views different from those available for feature extraction and

classifier training; (2) Complex background, to recognize the object when it is in a cluttered scene; (3) Blur, to recognize the object when its image is blurred by noise.

When appearance features are considered, many works skip the detection, segmentation and size normalization of the target object in a given image, and assume that the object has been located, normalized in size, and made ready for feature extraction [2, 4, 5, 8].

However, it is often a case in real life that a training set is offered for determining features and training a classifier and the trained classifier is evaluated on a disjoint test set or query set. To better handle real-life scenarios, we assume that the detection, segmentation and normalization must be solved given an image from the test set before carrying out recognition, and we propose a method, Silhouette Alignment, for handling this issue.

[1] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman,J. Matas, T. Kadir, and L.J.V. Gool, “A comparison of affine region detectors,” IJCV, vol. 65, no. 1-2, pp. 43–72, 2005.

[2] M.Pontil and A.Verri, “Support vector machines for 3d object recognition,” TPAMI, vol. 20, no. 6, pp. 637 –646, June 1998.

[3] A.M.Martinez and A.C.Kak, “Pca versus lda,” TPAMI, vol. 23,no. 2, pp. 228 –233, Feb 2001.

[4] X. Liu, A. Srivastava, and K. Gallivan, “Optimal linear representations of images for object recognition,” TPAMI, vol. 26(5),pp.

662–666, 2004.

[5] T.V. Pham and A.W.M. Smeulders, “Sparse representation for coarse and fine object recognition,” TPAMI, vol. 28, no. 4, pp.555 –567, April 2006.

[6] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” IJCV, vol. 60, pp. 91–110, 2004.

[7] R. O. Duda and P. E. Hart, Pattern Classification, Wiley Interscience;2nd ed., 2000.

[8] I. Buciu, I. Nafornita, and I. Pitas, “Gabor features for rotation invariant object classification,” in ICCP, 2008, pp. 41–46.

第三、四篇論文文獻探討(Stereo Vision and Invariant Features for Dice Recognition in Uncon- trolled Conditions, 第四篇為第三篇之期刊延伸版本)

Different from existing dice recognition systems [1, 2, 3], which use a single top-view camera and only work in an enclosure with controlled settings, the proposed system exploits stereo view and works in general uncontrolled settings, including dice of various colors, under different illumination conditions, and with various camera viewpoints. In controlled settings edges are the prominent features considered for dice recognition. But specular reflection, often observed in uncontrolled illumination, make

(12)

6

the approaches solely based on edges ineffective. In addition to the incapability of recognizing dice under various illumination conditions, the requirement that only the top surfaces of the dice can be considered also refrains the existing dice recognition systems from applications in general conditions. To capture the top surfaces, the camera must be made aligned to the dice. However it can be very difficult to maintain such an alignment in a table game, as the banker may roll the dice and place the dice roller back to a spot misaligned to the camera.

To have a broader application scope, the proposed system also considers various viewpoints to the dice captured by multiple cameras. Possible applications include the integration with table games and some new design of computer-controlled dice games.

When integrated with table games, it can help or replace human bankers in offering services, such as counting and collecting wagers. Because non-top views reveal both the top and side surfaces of the dice, it is required to segment the top surfaces and remove or ignore the side surfaces. To meet this requirement, the proposed system exploits stereo vision with invariant features to extract the top surfaces and identify the dots on them.

[1] Kuo-Yi Huang, “An auto-recognizing system for dice games using amodified unsupervised grey clustering algorithm,” Sensors, 2008.

[2] Wen-Yuan Chen and Chin-Ho Chung, “Image automatic-recognitionscheme for dice game using structure features technique,”

OpticalEngineering, 2011.

[3] Y.-N. Lai, S.-T Hsu, C.-Y Wang, and M.-T Tsai, “Method for recognizingdice dots,” U.S. Patent No. 2009/0263008 A1, October 2009.

第五篇論文文獻探討(3D Scene Segmentation and Modeling Using RGBD Images)

Image-based scene segmentation and modeling can be considered using the intensity and color extracted from 2D images or the range data given by depth images. Many previous works are either the former or the latter, and few are based on both. As the RGBD cameras are becoming popular, such as Microsoft Kinect and Asus Xtion, many are focusing on the combination of both the color/intensity and depth for scene segmentation.

Silberman and Fergus [1] use a CRF-based model on different representations of depth image for dense scene labeling. Experiments demonstrate that the depth information give massive performance gain over the method limited to intensity information only. Several works inferred the geometric structure after scene

segmentation. Zhang et al. [2] estimate geometric structure solely from the dense depth maps captured by moving vehicle. The accuracy of semantic segmentation depends on

(13)

7

the depth map quality and can be improved with better input data. Silberman et al. [3]

proposed the major segmented surfaces in indoor scene and inferred the support relations of objects and surfaces which are required by robotics and scene analysis.

Recently the RGBD images are applied for plane detection, for example, [4] and [5].

Guan et al. [4] segment the 3D plane by iteratively refining the plane function that considers both appearance and geometry information from a RGBD camera. Holz et al.

[5] extract the local surface normals from a RGBD camera, and segment the planes in normal space and distance space.

This research takes the advantages of both the color channel and depth channel. The former enables color-based segmentation, and the segmented regions are further split or merged depending on the depth and distance information offered by the latter. Our preliminary result shows that the result is better than that of using either channel along.

We are now focusing on the further improvement of the accuracy and reduction on the computational cost.

[1] Silberman, N., Fergus, R, “Indoor scene segmentation using a structured light sensor,” ICCV Workshops 2011, 601-608.

[2] Chenxi Zhang, Liang Wang, Ruigang Yang, “Semantic Segmentation of Urban Scenes Using Dense Depth Maps,” ECCV (4) 2010, 708-721.

[3] Nathan Silberman, Derek Hoiem, Pushmeet Kohli3, Rob Fergus, "Indoor Segmentation and Support Inference from RGBD Images,” ECCV 2012.

[4] Li Guan, Ting Yu, Peter Tu, Ser-Nam Lim, "Simultaneous Image Segmentation and 3D Plane Fitting for RGB-D Sensors - An Iterative Framework," CVPR Workshop 2012.

[5] Dirk Holz, Stefan Holzer, RaduBogdanRusu, Sven Behnke, "Real-Time Plane Segmentation Using RGB-D Cameras," RoboCup 2011, 306-317.

(14)

8

計畫成果自評：

若以下表為成果自評標準:

等級特優優佳可欠佳

評判標準

完成原計畫大部份規劃工作，作品得到國際(如知名期刊、專利等)之肯定，或引發廣泛產業關注。

完成原計畫大部份規劃工作，作品得到廣泛的肯定，如領域內主要國際研討會或相關專利等。

完成原計畫大部份規劃工作，作品得到些許肯定，如一般研討會等。

完成原計畫大部份規劃工作，但作品尚未得到其它單位肯定。

未完成原計畫大部份規劃工作。

本計畫成果包括國際研討會論文二篇、國內研討會論文一篇、審核中期刊論文二篇。國際研討會中，ICPR (IEEE Int’l Conference on Pattern Recognition)為模式識別影像處理領域重要研討會之一；SMC (IEEE Int’l Conf on Systems, Man and Cybernetics)為人機互動領域重要研討會之一；雖 AUTOMATION 為國內研討會，卻是國內自動化領域最重要的學術研討會。再考慮申請審核中期刊論文二篇，故成果自評為優等。

(15)

9

出席國際學術會議心得報告

Gee-Sern Hsu

因本年度接受之國際研討會論文均為 2012 年 10 月之後舉辦，而本年度研究經費中的海外差旅補助已用於 2011 年 8 月底的 CAIP 研討會。故仍以 CAIP 參加的心得為以下報告內容。在 CAIP 發表之論文已寫入去(99) 年度的報告成果，本報告不再列入。

I joined the CAIP 2011 held in Seville, Spain on Aug 29~31, and presented the paper Dice Recognition in Uncontrolled Illumination Conditions by Local Invariant Features. According to the conference ranking referred in http://www.cs.ucla.edu/~eklee/paper/CS_conf_rank.htm and a few other websites, CAIP is considered a fine conference in the area of computer vision. It is given 0.84 on a unity scale, compared with ICCV (0.96) and ICIP (0.71), in

http://perso.crans.org/~genest/conf.html.

Quite a few attendees were interested to know more details about how we determine matches across different views as features from different dice revealed the same characteristics. Some were impressed by the extraction of features across dice, on top of the features within each die, using invariant features. Many considered our work an interesting application of invariant features to the entertainment and game technology sector. I joined several talks with topics on face recognition, object recognition, kernel methods and stereo vision. Among those I kept in contact with, the work by Herrera, a Ph.D. student from the University of Oulu, is closely related to the continuing phase of this research. He proposes a simplified method to calibrate Kinect, the depth and color camera pair. We talked about possible collaboration in the near future, and he offers me the package he developed and introduced in the conference. Since the second phase of this research will exploit the depth and color cameras in establishing 3D perception, which has been in progress for now, Herrera’s toolbox will be studied and compared with other tools that are available on the web. We expect this interaction to be able to initiate some collaboration research opportunities good for both parties.

附件 1

(16)

Dice Recognition in Uncontrolled Illumination Conditions by Local Invariant Features

Gee-Sern Hsu¹, Hsiao-Chia Peng, Chyi-Yeu Lin, and Pendry Alexandra

Department of Mechanical Engineering, National Taiwan University of Science and Technology,

[email protected]¹

Abstract. A system is proposed for the recognition of the number of the dots on dices in general table game settings. Different from previous dice recognition systems which use a single top-view camera and work only under controlled illumination, the proposed one uses multiple cameras and works for uncontrolled illumination. Under controlled illumination edges are the prominent features considered by most approaches. But strong specular reflection, often observed in uncontrolled illumination, paralyzes the approaches solely based on edges. The proposed system exploits the local invariant features robust to illumination variation and good for building homographies across multi-views. The homographies are used to enhance coplanar features and weaken non- coplanar features, giving a way to segment the top faces of the dices and make up the features ruined by possible specular reflection. To identify the dots on the segmented top faces, a MSER detector is applied for its consistency rendering local interest regions across large illumination variation. Experiments show that the proposed system can achieve a superb recognition rate in various uncontrolled illumination conditions.

Keywords: Object recognition, invariant feature, local descriptor.

1 Introduction

Dice is a popular table game in casinos, especially in Asia. As automatic or computer-controlled games are emerging and becoming popular, many are interested in the technologies able to assist or replace human bankers. A computer vision system is proposed in this paper for dice recognition, which refers to the automatic recognition of the numbers of dots on dices, in normal table game settings. Different from existing dice recognition systems, for example [4] and [5], which work under controlled illumination, the proposed system can work in uncontrolled illumination conditions. In controlled illumination edges are the prominent features considered. But specular reflection, often observed in uncontrolled illumination, paralyzes the approaches solely based on edges. Fig. 1 shows an image in the middle with strong specular reflection, on the left is its edge map

(17)

obtained by previous methods. Because it is not limited to controlled illumination, the proposed allows a much wider scope of applications, e.g., integration with table games or different designs of automatic dice games.

Fig. 1. Middle: specular reflection on the dices; Left: the edge map obtained by previous methods; Right: the edge map obtained by the proposed method.

Existing dice recognition systems only consider the top view of dices. But a top-view camera is difficult to install on a game table as a specially designed camera support will be needed. To enable an easy integration with a game table, the proposed system considers tilted views to the dices captured by the cameras held on the peripheral supports around the table. Peripheral cameras are more friendly to install on a game table than top-view ones. However top views only capture the top faces of the dices, tilted views reveal the top and side surfaces.

The latter is harder to handle as a method is required to segment the top faces and remove the side surfaces.

The proposed system consists of two major modules: dice segmentation and dots identification. To segment dices, it exploits the local invariant features robust to illumination variation and good for building homographies across multi- views. The homographies are used to enhance coplanar features, segment the top faces of the dices and make up the features ruined by possible specular reflection.

To identify the dots on the segmented top faces, a MSER (Maximally Stable Ex- treme Region) [8] detector is applied for its consistency rendering local interest regions across large illumination variation. The rest of this paper is organized as follows: the dice segmentation is presented in Section 2. The dot identification is elaborated in Section 3. Section 4 presents an experimental study of the proposed methods, followed by a conclusion in Section 5.

2 Dice Segmentation Using Local Invariant Features

Because dices can pose in arbitrary locations and orientations on a dice roller base and their sizes vary slightly according to the distance to the camera, local invariant features are explored in capturing these variations. Many local invariant feature detectors were proposed and applied in a broad range of applications.

Reviews on these detectors can be found in [10], and [9], [3]. The invariant feature detectors can be generally categorized into three types [11]. One detects corner-like features, e.g., Harris-affine, Harris-Laplace, and multi-scale Harris

(18)

Fig. 2. Correspondences across two different views on the local invariant features detected by a multi-scale Harris-Hessian detector. Many of the detected correspondences are removed for better visual inspection.

detectors.One detects blob-like features, e.g., Hessian-affine, Hessian-Laplace, multi-scale Hessian and Difference of Gaussians (DoG) [7]. Different from the former two types, region detectors extract homogeneous local areas, e.g., the MSER detector [8], which is used in this work for identifying the dots on dices, and will be addressed in details in Sec. 3.

Due to the limitation of Harris and Hessian detectors in handling multiple scales, both are modified with multiple scales and made scale-invariant in [1].

To determine the most appropriate scale for a local feature, Harris-Laplace and Hessian-Laplace both search for the characteristic scale with a Laplace operator added on top of the multi-scales. Harris-affine and Hessian-affine obtain the affine invariant corners or blobs by an iterative estimation of elliptical affine regions proposed by Lindeberg et al. [6]. The shape of the feature region is adapted to ensure that the same region is covered when extracted from a different viewpoint.

The performance of the aforementioned 8 invariant feature detectors in rendering the most accurate homographies between different viewpoints is evaluated by a comparison to the ground truth obtained using manually selected correspondences. All of the invariant regions (or interest regions) are represented in the form of SIFT descriptor [7] as it is experimentally proven as one of the most ef- fective descriptors among others [10]. The match of the invariant features across views is measured by the Euclidean distance between the feature descriptors, and a threshold on this distance measure is determined to select correspondences. Be- cause a dot on a dice in a given view can appear quite similar to a different dot in another view, the scale factor in the local feature detectors is first chosen as that comes with the maximum number of correct correspondences. RANSAC [2]

is then applied to filter out outliers and determine the most appropriate homographies across different views with matched correspondences. Our experiments reveal that the multi-scale Harris-Hessian detector gives the best performance.

Fig. 2 shows an example of the correspondences across two viewpoints obtained using this detector. The settings and other details of the performance evaluation are reported in Section 4.

Given N different viewpoints of dice images, N (N −1)/2 homographies would be obtained using the invariant feature correspondences. In most cases 2 ≤ N ≤ 4 suffices. Each homography and its inverse define the transformation between a

(19)

pair of different viewpoints, and such a transformation only works for the top faces of the dices as these surfaces are coplanar. This property motivates the stacking of coplanar surfaces to segment the top faces of the dices even when specular reflection appears in certain viewpoints. One can choose a dice image of any viewpoint as a reference image and transform the rest N − 1 images of different viewpoints to the reference one using the corresponding homographies.

Stacking of the reference image and N − 1 transformed images does not just enhance the coplanar features but also weaken the non-coplanar features, as those on the lateral sides of the dice would be overlapped with features from different planes. As the specular reflection can be considered a view-dependent feature, different from the coplanar features observed in other majority of views, it can be removed by imposing a threshold on a similarity measure. An example with N = 3 is shown in Fig. 1, which in the middle shows a view of the dices with strong specular reflection, and on the right is the edge map of the image by stacking the homography-transformed images from the rest two views.

3 Dot Identification and Dice Recognition

Given a segmented top face of a dice, a MSER detector [8] is exploited to extract the dots from the segmented area because of its stability in rendering persistent or slowly varying edges around the dots as illumination varies. The extraction of MSER considers the set of all possible thresholds able to binarize an intensity image I(x) into a binary image EtM(x),

E_t_M(x) = 1 if I(x) ≤ t_M

0 otherwise. (1)

where t_M is the threshold. A MSER is a connected region in E_t_M(x), with little change in its size for a range of thresholds, extracted with a watershed like segmentation algorithm. The homogeneous intensity regions extracted are stable over a wide range of thresholds. The number of thresholds that maintain the connected region similar in size is known as the margin of the region.

The dots on dices are blob-like objects and MSER usually anchors on the boundaries of such objects, and thus the dots can be better located by MSER compared to other interest region detectors. Fig. 3 shows the MSER regions detected on dices. With some preprocessing, as histogram equalization, MSER can achieve highly accurate identification rate. Fig. 3 shows a case with the segmented top faces, and the regions detected by MSER before and after preprocessing. Note that the MSER can detect incomplete or partial interest regions which can be due to imperfect segmentation.

The dots identified by the MSER are clustered by k-means (k happens to be the number of dices) subject to the constraints that the number of dots in a cluster must be less than 7 and the distance between the farthest dots must be less than the diagonal of the dice. The spatial distribution of the dots in each cluster must be verified against the 6 known patterns. For example, 6-dot must contain two parallel rows of dots and 3 dots each row. 5-dot must have two

(20)

(a) Segmentation of top faces.

(b) Regions detected before preprocessing.

(c) Regions detected after preprocessing.

Fig. 3. The performance of MSER in the identification of the dots.

crossing rows of dots, 3 dots each row and crossing each other at the same central dot. Specific patterns are configured for 4-, 3-, and 2-dot cases. Depending on the number of dots in a given cluster, the distribution pattern for that number is examined first, and if found incompatible, two possibilities would be verified.

One is a non-dot spot falsely considered as a dot and the other is a valid dot failed to be identified as a dot. A large number of casts and experiments, with details given in Section 4, reveal that such a combination of size-constrained clustering and spatial pattern confirmation yields a superb recognition rate.

4 Experiments

The experimental setup follows a common dice table game ”sci-bo” with three dices, and three cameras of different viewpoints are installed on the sides of a game table. 12 different illumination conditions are configured to study the performance of the proposed system, 3 of them chosen as the training set and the rest 9 as the test set, as shown in Fig. 4. The intensity on the dices from the training set is 67, 108, and 138 in average, in 8-bit gray scale, with deviation 8, 10, and 11, respectively. The intensity on the test set is between 45 to 158 in average with deviation from 7 to 12. 120 random cast sessions and 30 manual placement sessions are carried out under each illumination condition. The manual placement attempts to create special layouts of the dices, such as three dices in a row and others.

4.1 Homography based on Local Invariant Features

The training set is for the evaluation of the 8 invariant feature detectors, men- tioned in Section 2, in creating homographies with least error across different illumination conditions. The error EF_i is measured by the difference between the correspondences from the invariant-feature-based homography HF_i and the ground-truth HG obtained using manually selected correspondences, i.e.,

E_F^(a,b)

i = ||(H^(a,b)_F

i − H^(a,b)_G )x^b_F

i||

NF_i

(2)

(21)

Fig. 4. First column from the left is the training set with 3 illumination conditions;

the rest is the test set with 9 illumination conditions.

where H^(a,b)_F

i is the homography that transforms the invariant features x^b_F

i detected by the invariant feature detector F_i in the image I_b to the corresponding ones in I_a; H_G is the ground-truth homography obtained by manual selected correspondences between I_a are I_b, N_F_i is the number of features detected by F_i, and a, b denote two different viewpoints.

Additionally, it is also desired that the correspondences from the feature- based homographies can be consistent across different scales, as some features change with scales. To investigate what features are better than others in rendering desired homographies across illumination and scale, the original images in 320 × 240 pixels are scaled down to smaller sizes, and the error is computed in each size and averaged over the three illumination conditions in the training set.

Fig. 5 shows this comparison, the smallest scale with 128 × 96 reveals relatively

Fig. 5. Normalized error of feature-based homography across scales and three illumination conditions.

(22)

high errors, indicating that some details between the dices are lost in such a small scale and thus the accuracy in the homography estimation is degraded.

Among the eight invariant feature detectors we tested, the multi-scale Harris- Hessian detector gives the lowest error at 0.87%, and it is about 1.7 pixels in a 192 × 144 image.

4.2 Dice Identification

The performance evaluation on the 9 test sets reveals the following observations and results:

– As long as the correspondences from the feature-based homography are consistent over at least two scales, the average match error can be kept below or near 1%, and the top faces of dices can be perfectly segmented in all tested conditions.

– Two identification rates are measured in each test illumination condition, one is the identification of the dots and the other is the identification of the dot number on each dice. The former is shown by the bar on the left and the latter by the bar on the right at each indexed illumination condition in Fig.

6. Because the MSER dot detector has been adjusted to zero miss rate on the price of additional false positives on the training set, the imperfections in the dot identification in Fig. 6 are all caused by false positives. For example, in the brightest illumination condition, indexed ”1”, 1.8%(=1 − 98.2%) of the dots identified are false positives. All false positives are found caused by specular reflection or insufficient lightings. As the intensity of the illumination increases, specular reflection becomes stronger, causing more false positives to appear.

Fig. 6. Identification rates in 9 illumination conditions, indexed from 1 to 9; at each index the left bar shows the rate of dot identification, and the right bar shows the rate of dice number identification.

– The combination of size-constrained clustering and spatial pattern confirmation can effectively remove the false positives and yield superb dice recognition rates in all tested conditions, as shown by the right bar at each indexed illumination in Fig. 6.

(23)

5 Conclusion

A solution with invariant features across multiple views is proposed for dice recognition under uncontrolled illumination. An extensive comparison on the performance of various invariant feature detectors in rendering correct homographies under various test conditions and parameters shows that the multi-scale Harris Hessian is the best, and better than the commonly selected SIFT features.

The homographies built on the multi-scale Harris Hessian features are exploited to enhance the coplanar features and weaken the non-coplanar features on the dices. This leads to an extraction of the coplanar features and the segmentation of the top faces of the dices even when the features, observed from some viewpoint, are ruined by specular reflection. A MSER detector is applied for the identification of dots on the top faces, followed by a pattern-specific confirmation of the spatial distribution of dots. Experiments reveal that, although false positives of dots are observed in few cases, as under strong or insufficient illumination, the numbers of the dots on the dices can still be recognized accurately by the proposed solution.

References

1. Dufournaud, Y., Schmid, C., Horaud, R.: Matching images with different resolu- tions. In: CVPR. pp. 1612–1618 (2000)

2. Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun.

ACM 24(6), 381–395 (1981)

3. Hsu, G.S.J.: Advances in Theory and Applications of Stereo Vision, chap. 7: Stereo Correspondence with Local Descriptors for Object Recognition, pp. 129–150. In- Tech (2011)

4. Huang, K.Y.: An auto-recognizing system for dice games using a modified unsupervised grey clustering algorithm. Sensors 8(2), 1212–1221 (2008)

5. Lai, Y.N., Hsu, S.T., Wang, C.Y., Tsai, M.T.: Method for recognizing dice dots.

U.S. Patent No. 2009/0263008 A1 (October 2009)

6. Lindeberg, T., G˚arding, J.: Shape-adapted smoothing in estimation of 3-d shape cues from affine deformations of local 2-d brightness structure. Image Vision Com- put. 15(6), 415–434 (1997)

7. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna- tional Journal of Computer Vision 60(2), 91–110 (Month 2004)

8. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide baseline stereo from maximally stable extremal regions. In: BMVC (2002)

9. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffal- itzky, F., Kadir, T., Gool, L.V.: A comparison of affine region detectors. Interna- tional Journal of Computer Vision 65(1-2), 43–72 (2005)

10. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 27(10), 1615–1630 (2005)

11. Tuytelaars, T., Mikolajczyk, K.: Local invariant feature detectors: A survey. Foun- dations and Trends in Computer Graphics and Vision 3(3), 177–280 (2007)

(24)

出席國際學術會議心得報告

Gee-Sern Hsu

因本年度接受之國際研討會論文均為 2012 年 10 月之後舉辦，而本年度研究經費中的海外差旅補助已用於 2011 年 8 月底的 CAIP 研討會。故仍以 CAIP 參加的心得為以下報告內容。在 CAIP 發表之論文已寫入去(99)年度的報告成果，本報告不再列入。

I joined the CAIP 2011 held in Seville, Spain on Aug 29~31, and presented the paper Dice

Recognition in Uncontrolled Illumination Conditions by Local Invariant Features. According to the

conference ranking referred in http://www.cs.ucla.edu/~eklee/paper/CS_conf_rank.htm and a few other websites, CAIP is considered a fine conference in the area of computer vision. It is given 0.84 on a unity scale, compared with ICCV (0.96) and ICIP (0.71), in

http://perso.crans.org/~genest/conf.html.

Quite a few attendees were interested to know more details about how we determine matches across different views as features from different dice revealed the same characteristics. Some were impressed by the extraction of features across dice, on top of the features within each die, using invariant features. Many considered our work an interesting application of invariant features to the entertainment and game technology sector. I joined several talks with topics on face recognition, object recognition, kernel methods and stereo vision. Among those I kept in contact with, the work by Herrera, a Ph.D. student from the University of Oulu, is closely related to the continuing phase of this research. He proposes a simplified method to calibrate Kinect, the depth and color camera pair. We talked about possible collaboration in the near future, and he offers me the package he developed and introduced in the conference. Since the second phase of this research will exploit the depth and color cameras in establishing 3D

perception, which has been in progress for now, Herrera’s toolbox will be studied and compared

with other tools that are available on the web. We expect this interaction to be able to initiate

some collaboration research opportunities good for both parties.

(25)

國科會補助計畫衍生研發成果推廣資料表

日期:2012/11/05

國科會補助計畫

計畫名稱: 快速辨識立體物件之機器人視覺技術 (II) 計畫主持人: 徐繼聖

計畫編號: 100-2221-E-011-057- 學門領域: 自動化系統整合技術

無研發成果推廣資料

(26)

100 年度專題研究計畫研究成果彙整表

計畫主持人：徐繼聖計畫編號：100-2221-E-011-057-

計畫名稱：快速辨識立體物件之機器人視覺技術 (II)

量化

成果項目 ^{實際已達成}

數（被接受或已發表）

預期總達成數(含實際已

達成數)

本計畫實際貢獻百

分比

單位

備註（質化說明：如數個計畫共同成果、成果列為該期刊之封面故事 ...

等）

期刊論文 0 0 100%

研究報告/技術報告 3 3 100%

研討會論文 1 1 100%

論文著作篇

專書 0 0 100%

申請中件數 0 1 100%

專利已獲得件數 0 0 100% 件

件數 0 1 100% 件

技術移轉

權利金 0 0 100% 千元

碩士生 2 2 100%

博士生 0 1 100%

博士後研究員 0 0 100%

國內

參與計畫人力

（本國籍）

專任助理 0 0 100%

人次

期刊論文 0 2 100%

研究報告/技術報告 0 0 100%

研討會論文 2 2 100%

論文著作篇

專書 0 0 100% 章/本

申請中件數 0 1 100%

專利已獲得件數 0 0 100% 件

件數 0 0 100% 件

技術移轉

權利金 0 0 100% 千元

碩士生 1 1 100%

博士生 0 0 100%

博士後研究員 0 0 100%

國外

參與計畫人力

（外國籍）

專任助理 0 0 100%

人次

(27)

其他成果

(無法以量化表達之成

果如辦理學術活動、獲得獎項、重要國際合作、研究成果國際影響力及其他協助產業技術發展之具體效益事項等，請以文字敘述填列。)

無

成果項目量化 名稱或內容性質簡述

測驗工具(含質性與量性) 0

課程/模組 0

電腦及網路系統或工具 0

教材 0

舉辦之活動/競賽 0

研討會/工作坊 0

電子報、網站 0

科教處計畫加填項

目計畫成果推廣之參與（閱聽）人數 0

(28)

國科會補助專題研究計畫成果報告自評表

請就研究內容與原計畫相符程度、達成預期目標情況、研究成果之學術或應用價值（簡要敘述成果所代表之意義、價值、影響或進一步發展之可能性）、是否適合在學術期刊發表或申請專利、主要發現或其他有關價值等，作一綜合評估。

1. 請就研究內容與原計畫相符程度、達成預期目標情況作一綜合評估

■達成目標

□未達成目標（請說明，以 100 字為限）

□實驗失敗

□因故實驗中斷

□其他原因說明：

2. 研究成果在學術期刊發表或申請專利等情形：

論文：■已發表 □未發表之文稿 □撰寫中 □無專利：□已獲得 □申請中 ■無

技轉：□已技轉 ■洽談中 □無其他：（以 100 字為限）

成果可以三方向說明：一為利用 Kinect 取得彩色與深度影像，先運用色彩進行場景分割，

再以深度整合場域中的立體資訊。成果發表於 Automation 2012。二為以外觀特徵結合 Naive Bayes 分類器之即時物件辨識，先進行外觀特徵在物件辨識的效能評估，再選擇最佳外觀特徵。成果已為模式識別主要研討會 ICPR 2012 所接受，更完整版本已投稿 MVA(J. Mach. Vision

& Appl.)。三為繼續在前一期的骰點辨識研究，挑戰多顏色與大角度下的效能。成果已發表於人機互動知名研討會 SMC 2012，完整版本投稿了 MVA。

3. 請依學術成就、技術創新、社會影響等方面，評估研究成果之學術或應用價值（簡要敘述成果所代表之意義、價值、影響或進一步發展之可能性）（以 500 字為限）

請見上傳之成果報告書