下世代服務型機器人快速工作定義和全自主執行技術---子計畫三：快速辨識立體物件之機器人視覺技術

(1)

行政院國家科學委員會專題研究計畫成果報告

下世代服務型機器人快速工作定義和全自主執行技術--子計畫三：快速辨識立體物件之機器人視覺技術

研究成果報告(精簡版)

計畫類別：整合型

計畫編號： NSC 99-2221-E-011-098-

執行期間： 99 年 08 月 01 日至 100 年 07 月 31 日執行單位：國立臺灣科技大學機械工程系

計畫主持人：徐繼聖

計畫參與人員：碩士班研究生-兼任助理人員：郭昱慶碩士班研究生-兼任助理人員：卓佑霖碩士班研究生-兼任助理人員：Pendry Ale

報告附件：出席國際會議研究心得報告及發表論文

處理方式：本計畫可公開查詢

中華民國 100 年 10 月 26 日

(2)

行政院國家科學委員會補助專題研究計畫



成果報告

□期中進度報告

快速辨識立體物件之機器人視覺技術第一期計畫

計畫類別：□ 個別型計畫 ■ 整合型計畫計畫編號：NSC 99-2221-E-011-098-

執行期間：99 年 8 月 1 日至 100 年 7 月 31 日

計畫主持人：徐繼聖

計畫參與人員：郭昱慶、卓佑霖、Pendry Alexandra

成果報告類型(依經費核定清單規定繳交)：■精簡報告 □完整報告

本成果報告包括以下應繳交之附件：

■可供推廣之研發成果資料表

□赴國外出差或研習心得報告一份

□赴大陸地區出差或研習心得報告一份

■出席國際學術會議心得報告及發表之論文各一份

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計畫、

列管計畫及下列情形者外，得立即公開查詢

□涉及專利或其他智慧財產權，□一年□二年後可公開查詢

執行單位：國立臺灣科技大學

中華民國壹佰年十月十日

(3)

中文摘要

本計畫之目的為發展一套可在不同視角與不同光照條件下快速辨識立體物件的視覺系統。

研究先回顧近年發展迅速的不變特徵偵測器與描述子，整理不同的不變特徵適用辨識的物件特性，找出對光照和視角變化較具抵抗力的特徵，並將其應用在實際立體物件--骰子辨識。

骰子辨識將是一型娛樂用機器人的關鍵技術。主要步驟包括骰身與背景脫離，骰面切割，

共平面確認，骰點偵測和辨識。實驗證明某些不變特徵可有效進行前背景脫離，切割不同面向的骰面，確認共平面，有助於建立不同視角的轉換式，適用於不同視角的特徵匹配。

並且這些不變特徵可容忍較大的光源變化，有助於不同光照條件下的物件辨識。本研究亦將不變特徵應用於建立立體空間模型與車牌中的字元切割，均得到肯定的結果。本計畫之實際成果包括書藉專章一份、國際研討會論文三篇、國內研討會論文二篇、申請中專利一 份、審核中期刊論文一篇。

英文摘要

The objective of this research is to develop a monocular vision system for fast 3D object recognition which must be robust to illumination and viewpoint variation. Invariant feature detectors and descriptors are studied for their potentials in the development. An extensive review is completed that summarizes the findings and comments reported by many researchers who made substantial contributions to the study on invariant features and interest regions in the last decade. Guided by this review, several invariant features are exploited and experimented in an automatic dice reading system which involves the segmentation of 3D objects, the dice, and recognition of the dots on the dice. This dice recognition system will be the core part of an entertainment robot. It is experimentally confirmed that some invariant features are better than others in rendering consistent matches across viewpoints and illumination conditions, leading to a fast and robust dice recognition solution. Invariant features are also applied in character segmentation and 3D scene modeling, both yielding satisfactory performance. This research leads to five conference papers, one book chapter, one patent application, and one under-reviewed journal paper.

關鍵詞: Invariant feature, covariant feature, interest region, local descriptor, stereo vision, object recognition.

(4)

前言

本計畫為開發「快速辨識立體物件之機器人視覺技術」之第一年計畫。依原計畫書規劃，研發重點在發展單鏡頭視覺系統，利用不同視角 2D 影像中的特徵點，定義出 3D 物體在不同視角下的特徵，再根據視角轉換計算出的轉換矩陣，決定有效特徵點的增加與刪除。

實驗的重點在測詴環境參數改變下，執行 3D 物件登錄與辨識的效果，並優化辨識核心以滿足快速登錄與即時辨識的需求。本研究之實際執行內容包括下列三項工作：一、回顧並研習最近十年發表之不變特徵偵測器與描述子(invariant or covariant feature detector and descriptor)，整理不變特徵相關之效能評估，選取效能較優者進行實驗確認；二、將效能較優的不變特徵應用在 3D 物件擷取與骰點辨識上，骰點辨識為本團隊執行多年之研究，本計畫首次將不變特徵應用於立體骰身之不同平面切割，共平面特徵擷取與比對；三、將前兩項所選擇之不變特徵應用在其它立體物件之辨識。

第一年之主要研究成果含下列數項：一、完成一書籍專章，整理歸納近年發表的不變特徵與相關的效能評估，提供持續研究 3D 物件辨識的參考方向；二、應用不變特徵擷取數個骰子的共平面處，並偵測在不同視角和光源下的骰點空間分佈，由此可準確讀取骰點，

並辨識其立體屬性；論文已發表於國內外相關研討會，並申請專利；三、不變特徵 shape context 與 MSER 應用於行人偵測、3D 場景重建、和字元切割，均得到肯定的結果；相關成果已發表於國外研討會和畢業論文，併入最新實驗結果的新投稿尚持續進行中。

研究目的

本計畫之目的在開發一個可與機器人整合進行快速立體物件辨識的機器視覺系統。原計畫分三年，其中第一年之目的在發展一單鏡頭視覺系統，定義出可在不同視角、不同光照條件下辨識立體物件之影像特徵，進而以此特徵執行 3D 物件的快速登錄與辨識。

依上述之研究目的，本計畫擬定並執行了三項在前言中所列之工作項目，該三項工作是根據下述目的所擬定：

一、近年發展迅速的影像不變特徵偵測器與描述子相關研究，已陸續發表許多成果，

並有許多效能比較的評估報告，本研究應深入探討這些已發表的成果，並推論不同類型的不變特徵與其適合或不適合辨識的物件類別的關係。

二、在不同視角、不同光照條件、不同距離下利用影像不變特徵進行物件辨識。

三、為延伸本研究的影響力，不變特徵將被應用於本團隊正執行之相關主題，如 3D 場景重建、行人偵測、人臉與車牌辨識等。

研究方法：

本研究提出的方法詳述於下列論文中：

1、Stereo Correspondence with Local Descriptors for Object Recognition, Chapter 7, Advances in Theory and Applications of Stereo Vision, ed. Asim Bhatti, 129~150, Jan 2011, InTech, ISBN: 978-953-307-516-7.

2、Dice Recognition in Uncontrolled Illumination Conditions by Local Invariant Features, Proc. 14^th Int’l Conf Computer Analysis of Images and Patterns (CAIP), vol. 2, 188~195, Seville, Spain, Aug. 2011.

(5)

3、(Best paper award) Invariant Features for Dice Recognition Across Illumination, Proc.

2^nd IEEE Int’l Conf Multimedia Technology (ICMT), 3112~3115, Hangzhou, China, July 2011.

4、License Plate Recognition for Categorized Applications, Proc. IEEE Conf Vehicular Electronics and Safety (ICVES), 220~225, Beijing, China, July 2011.

上列部份內容亦發表於國內研討會 CVGIP 2011 與 Automation 2011.

第一篇論文摘要(因是書藉專章，故內容較豐富)

Many methods on local descriptors consider each image from stereo or multiple views a single instance without exploring much of the relationship between these instances, ending up with models of multiple independent instances. Using such a model for object recognition is like matching between a training image and a test image.

It is, however, especially interested in this chapter that models are developed integrating the information across multiple training images. The central concern is how to extract local features from stereo or multiple images so that the information from different views can be integrated in the modeling phase, and applied in the recognition phase.

This chapter is composed of the following contents: a few promising affine invariant region detectors are first reviewed in Section 2. Many invariant feature detectors are proposed in the last decade. Because the detected features are invariant to variations in viewpoint, scale, illumination, and other variables, they serve well for establishing correspondences across images. Section 3 reviews a couple local region descriptors which outperform many others in a performance comparison study. These descriptors transform affine invariant regions into vectors or distributions so that some distance measure can be applied to discern the similarity or difference between interest regions.

Those with better invariance to viewpoint changes are especially interested. Section 4 reviews a couple methods that develop models by combining the information from local descriptors extracted across multiple views. These methods offer good examples on how to integrate local invariant features across different views. A case study on performance evaluation and benchmark databases is presented in Section 5, with an introduction to its database, followed by a snapshot on other databases also good for 3D object

recognition study in Section 6.

第二篇論文摘要

A system is proposed for the recognition of the number of the dots on dice in general table game settings. Different from previous dice recognition systems which use a single top-view camera and work only under controlled illumination, the proposed one uses multiple cameras and works for uncontrolled illumination. The proposed system exploits the local invariant features robust to illumination variation and good for building homographies across multi-views. The homographies are used to enhance coplanar features and weaken non-coplanar features, giving a way to segment the top faces of the dice and make up the features ruined by possible specular reflection. To recognize the number of dots on dice, MSER (Maximally Stable Extreme Region) detectors are applied for dice localization and dots identification, followed by certain

(6)

constraints to conform the six known dice patterns. Experiments show that the proposed system can achieve a superb recognition rate in various uncontrolled illumination

conditions.

第三篇論文摘要(雖部份內容與上篇類似，但主要方法為 shape context，上篇為 MSER) A system is proposed for the recognition of the dot numbers on dices in general table game settings. Different from previous dice recognition systems which use a single top-view camera and work only under controlled illumination, the proposed one uses multiple cameras and works for uncontrolled illumination. Under controlled

illumination edges are the prominent features considered in most approaches. But reflection, often observed in uncontrolled illumination, makes the approaches solely based on edges ineffective. The proposed system exploits the local invariant features robust to illumination variation and good for building the homographies between multiple viewpoints. The homographies can enhance coplanar features and weaken non-coplanar features, giving a way to segment the top surfaces of the dices and make up the features ruined by reflection. A coarse-to-fine search with a shape-dependent local descriptor is designed to identify the dots on the segmented top surfaces. The identified dots are clustered subject to certain constraints so that each cluster conforms to one of the six know dice patterns. Experiments show that the proposed system gives satisfactory performance for various uncontrolled illumination conditions.

第四篇論文摘要(此為本研究之延伸成果，應用不變特徵於車牌字元切割)

The variables and the variation scope considered in each variable would be different for different applications of vehicle license plate recognition (VLPR). This research splits major VLPR applications into three categories: access control, traffic law enforcement, and road patrol. Each category is characterized by the variables, including plate size and illumination condition, camera viewpoint, with different scopes of variation.

Applications with more variables or larger variation scopes, as in road patrol cases, require more sophisticated methods and higher computational cost than those with fewer variables or less variation scopes, as in the access control cases. It is uneconomic to apply the methods developed for road patrol to handle access control. On the contrary, a method developed for access control cannot solve most cases in road patrol. Different from most previous works without specifying applications, this paper redefines the VLPR problem using the variables and their variation scopes for the three major applications. Because no benchmark database is available for the evaluation of VLPR methods on the three major applications, a database, called AOLP (Application-Oriented License Plate) database, composed of three application- oriented subsets is introduced and made available to the research community. There has not been a method commonly acknowledged as a baseline although VLPR has been an active research topic for more than a decade. A modular method, whose components can be adjusted for the three applications, is proposed and benchmarked on our database. In the character

segmentation module, the MSER (Maximally Stable Extremal Region) is exploited and yields a performance better than previous approaches. The proposed method is

(7)

compared with a few competitive ones to highlight its value as a benchmark.

文獻探討(節錄於上述論文以完整呈現已執行之相關文獻探討)

As object recognition being the central concern of this research, the literature survey mostly covers the works on invariant features, interest region detection, and local descriptors, especially those with robustness against variation in illumination and viewpoint.

Harris and Hessian affine detector:

Harris-Affine region detector exploits a combination of Harris corner detector, Gaussian scale-space and affine shape adaptation (Mikolajczyk, 2005). Given an image, the algorithm for detecting Harris-Affine regions consists of the following steps: (1) Detection of

scale-invariant interest regions using Harris-Laplace detector and characteristic scale selection; (2) Normalization of the scale-invariant interest regions using affine shape

adaptation; (3) Iterative estimation of the affine region; (4) Affine region update on scale and localization. In addition to the Harris-Affine region based on the Harris-Laplace detector, a similar alternative is Hessian-Affine region detector based on the Hessian matrix. Both are effective for detecting blobs and ridges, but the latter performs better in detecting long blobs.

Maximally Stable Extremal Region (MSER):

MSER considers the set of all possible thresholds able to binarize an intensity image, and an MSER is a connected region with little change in its size for a range of thresholds (Matas et. al, 2002). Because it is defined exclusively by the intensity function in the region and the outer border, and the local binarization is stable over a large range of thresholds, it possesses many favored characteristics, such as robustness to changes in viewpoint, illumination, scale and even occlusion.

SIFT and GLOH Descriptors:

Local region descriptors are mostly in vector forms that can characterize the pattern of an interest point with its neighboring region. Ten different descriptors were reviewed by

Mikolajczyk (2005), and it revealed that the GLOH (Mikolajczyk, 2005) performs the best, closely followed by SIFT (Lowe, 2004) and shape context (Belongie et al., 2002) in generating more correct matches under viewpoint and scale changes. These three descriptors also outperform others in most tests with different variables and settings. The SIFT (Scale- Invariant Feature Transform) descriptor, proposed by Lowe (2004), is derived from a 3D histogram of gradient location and orientation. GLOH is a modified version of SIFT, which computes a SIFT descriptor for a log-polar location grid with bins in both radial and angular directions.

Shape Context Descriptor:

Shape context, proposed by Belongie et al., (2002), is a descriptor that characterizes the shape of an object. Given a shape, which can be obtained by an edge detector, one can pick a point on the shape and compute the histogram of the relative coordinates of the remaining points. This design makes the descriptor more sensitive to the locations of nearby shape points than to those farther apart. Belongie use 5 bins for radial and 12 bins for angluar, giving a descriptor of dimension 60; while Mikolajczyk split radial into 9 bins and angular into 4 bins, resulting in a descriptor of dimension 36.

(8)

Integration of Local Descriptors from Multi-Views:

Depending on how the model of a given object is built, the approaches of using local invariant regions for object recognition can be split into two categories. One takes a single view of the object for developing the model, while the other uses multiple views. Both recognize the object in different views along with occlusions and different geometric and photometric conditions. Because of multiple views of the object considered in the modeling phase, the multi-view based methods can recognize the object in a much broader range of conditions. As far as stereo vision for 3D object recognition is concerned, only the methods using multi-views are considered in this section. Two methods are selected, one is given by Lowe (2001) that fuses the SIFT features from multiple views of an object into a single model with view-dependent clusters, and the other, proposed by Rothganger et. al (2006), builds a patch-based 3D model using affine descriptors and multi-view spatial constraints.

Databases for 3D Object Recognition:

The database used by Rothganger et. al (2006) consists of 9 objects and 80 test images.

The training images are stereo views for each of the 9 objects that are roughly equally spaced around the equatorial ring for each of them. The number of stereo views ranges from 7 to 12 for the different objects. The test images are monocular images of objects under varying amounts of clutter and occlusion and different lighting conditions. In addition, several other databases can also be considered for benchmarking stereo vision algorithms for object recognition. The ideal databases must offer stereo images for training, and test images collected with variations in viewpoint, scale, illumination, and partial occlusion. A few samples taken from the dataset used by Rothganger et. al (2006) are shown in Fig.1.

Fig.1 Samples from the dataset by Rothganger et. al (2006), the top two rows are from those for training, and the bottom two for testing.

(9)

計畫成果自評：

若以下表為成果自評標準:

等級特優優佳可欠佳

評判標準

完成原計畫大部份規劃工作，作品得到國際(如知名期刊、專利等)之肯定，

或引發廣泛產業關注。

完成原計畫大部份規劃工作，作品得到廣泛的肯定，如領域內國際頂尖研討會或相關專利等。

完成原計畫大部份規劃工作，作品得到些許肯定，如領域內一般研討會等。

完成原計畫大部份規劃工作，但作品尚未得到其它單位肯定。

未完成原計畫大部份規劃工作。

本計畫成果包括書藉專章一份、國際研討會論文三篇、國內研討會論文二篇、申請中專 利一份、審核中期刊論文一篇。其中書藉專章 Stereo Correspondence with Local Descriptors for Object Recognition 的網路版本自元月出版以來，已被下載 117 次(可由下列網址取得資料) http://www.intechopen.com/articles/show/title/stereo-correspondence-with-local-descri ptors-for-object-recognition。國際研討會中，CAIP 為影像處理模式識別領域重要研討會之一；ICVES 為智慧車輛研究領域重要研討會之一；雖 ICMT 為較新的多媒體技術領域的研討會，但參加者眾，本篇論文在 120 餘篇接受的論文中，入選 Best Paper Awards (共 16 篇獲獎)。再考慮申請中專利一份、審核中期刊論文一篇和國內研討會論文二篇，故成果自評為優等。

(10)

可供推廣之研發成果資料表

▓ 可申請專利(專利申請中) ▓ 可技術移轉日期：100 年 8 月 10 日

國科會補助計畫

計畫名稱：快速辨識立體物件之機器人視覺技術計畫主持人：徐繼聖

計畫編號：NSC 99-2221-E-011-098- 學門領域：自動化系統整合技術

技術/創作名稱 Stereo Vision Based Dice Recognition Method and System for Uncontrolled Environments

發明人/創作人 徐繼聖、張訓嘉

技術說明

1. Different from existing dice recognition systems, which only work under controlled illumination, the present invention can work in uncontrolled illumination conditions. To enable an easy

integration with a game table, the present invention considers tilted views to the dice captured by the cameras held on the peripheral supports.

2. The present invention is composed of two major modules: dice segmentation and dots identification. Given dice images of different viewpoints, local invariant features are extracted and compared on the performance of rendering homographies with least error. The homographies can be used to enhance the coplanar features and weaken the non-coplanar features. This leads to an extraction of the coplanar features and segmentation of the top surfaces of the dice even when some features are ruined by specular reflection.

3. The dots on the segmented top surfaces of the dice must be identified as some lighting can blur the dots and specular reflection spots can appear as valid dots. A MSER (Maximally Stable Extreme Region) detector is applied for its consistency in rendering local interest regions across large illumination variation.

可利用之產業 及可開發之產品

1. 休閒娛樂產業 2. 智慧型機器人產業 3. 精品玩具相關產業

技術特點

不同於其他已見於 Las Vegas 和 Macau 的娛樂場所之密閉式骰點辨識系統，本發明是目前唯一可應用於開放式環境的自動骰點辨識系統，故可應用於上述產業之產品開發。例如一般娛樂場所內使用之骰盅，無需更換，僅需加裝本發明所設計之系統與，即可進行自動骰點辨識。本系統利用複數台攝影機擷取不同視角、不同距離、不同光照條件下的骰子影像，利用立體視覺，取得骰點分佈的幾何空間訊息，可進行精確的骰點辨識。本系統除攝影機外，亦含遮罩與輔助光源設計，可濾除影響辨識率之環境光源，有效提升辨識率。

推廣及運用的價值 本技術可推廣至休閒娛樂、智慧型機器人、精品玩具等產業之創新產品設計與製造，或提昇現有產品功能，加強產品國際競爭力。

※ 上述研發成果已透過本校技術移轉中心進行本國、中國與美國專利申請。

附件 1

(11)

出席國際學術會議心得報告

Gee-Sern Hsu

I joined the CAIP 2011 held in Seville, Spain on Aug 29~31, and presented the paper Dice Recognition in Uncontrolled Illumination Conditions by Local Invariant Features. According to the conference ranking referred in http://www.cs.ucla.edu/~eklee/paper/CS_conf_rank.htm and a few other websites, CAIP is considered a fine conference in the area of computer vision. It is given 0.84 on a unity scale, compared with ICCV (0.96) and ICIP (0.71), in http://perso.crans.org/

~genest/conf.html.

Quite a few attendees were interested to know more details about how we determine matches across different views as features from different dice revealed the same characteristics. Some were impressed by the extraction of features across dice, on top of the features within each die, using invariant features. Many considered our work an interesting application of invariant features to the entertainment and game technology sector. I joined several talks with topics on face recognition, object recognition, kernel methods and stereo vision. Among those I kept in contact with, the work by Herrera, a Ph.D. student from the University of Oulu, is closely related to the continuing phase of this research. He proposes a simplified method to calibrate Kinect, the depth and color camera pair. We talked about possible collaboration in the near future, and he offers me the package he developed and introduced in the conference. Since the second phase of this research will exploit the depth and color cameras in establishing 3D perception, which has been in progress for now, Herrera’s toolbox will be studied and compared with other tools that are available on the web. We expect this interaction to be able to initiate some collaboration research opportunities good for both parties.

I also joined the ICMT held in Hangzhou, China on July 26~28, and presented the paper Invariant Features for Dice Recognition Across Illumination. ICMT is a new conference on multimedia related areas, and this is the second time after its first commencement in 2010.

Although a new one, it seemed to have attendees no less than other major conferences, and a few keynotes offered by senior researchers, such as E. Hancock, D. Terzopoulos and others. I went to a few talks with topics of my interests, and the keynote by Hancock on “facial shape, texture and reflectance from a single view”. Actually we met again in CAIP, and kept in contact from then as he is also working on face related vision research. Our paper was awarded as one of the 16 Best Paper Awards out of the 120 accepted papers from more than 300 submissions.

Many were impressed by the live demo system that we showed in the conference, which could precisely recognize dice in various illumination conditions.

附件 2

(12)

出席國際學術會議心得報告

Gee-Sern Hsu September 10m 2011

I joined the CAIP 2011 held in Seville, Spain on Aug 29~31, and presented the paper Dice Recognition in Uncontrolled Illumination Conditions by Local Invariant Features. According to the conference ranking referred in http://www.cs.ucla.edu/~eklee/paper/CS_conf_rank.htm and a few other websites, CAIP is considered a fine conference in the area of computer vision. It is given 0.84 on a unity scale, compared with ICCV (0.96) and ICIP (0.71), in http://perso.crans.org/

~genest/conf.html.

Quite a few attendees were interested to know more details about how we determine matches across different views as features from different dice revealed the same characteristics. Some were impressed by the extraction of features across dice, on top of the features within each die, using invariant features. Many considered our work an interesting application of invariant features to the entertainment and game technology sector. I joined several talks with topics on face recognition, object recognition, kernel methods and stereo vision. Among those I kept in contact with, the work by Herrera, a Ph.D. student from the University of Oulu, is closely related to the continuing phase of this research. He proposes a simplified method to calibrate Kinect, the depth and color camera pair. We talked about possible collaboration in the near future, and he offers me the package he developed and introduced in the conference. Since the second phase of this research will exploit the depth and color cameras in establishing 3D perception, which has been in progress for now, Herrera’s toolbox will be studied and compared with other tools that are available on the web. We expect this interaction to be able to initiate some collaboration research opportunities good for both parties.

I also joined the ICMT held in Hangzhou, China on July 26~28, and presented the paper Invariant Features for Dice Recognition Across Illumination. ICMT is a new conference on multimedia related areas, and this is the second time after its first commencement in 2010.

Although a new one, it seemed to have attendees no less than other major conferences, and a few keynotes offered by senior researchers, such as E. Hancock, D. Terzopoulos and others. I went to a few talks with topics of my interests, and the keynote by Hancock on “facial shape, texture and reflectance from a single view”. Actually we met again in CAIP, and kept in contact from then as he is also working on face related vision research. Our paper was awarded as one of the 16 Best Paper Awards out of the 120 accepted papers from more than 300 submissions.

Many were impressed by the live demo system that we showed in the conference, which could precisely recognize dice in various illumination conditions.

(13)

Dice Recognition in Uncontrolled Illumination Conditions by Local Invariant Features

Gee-Sern Hsu, Hsiao-Chia Peng, Chyi-Yeu Lin, and Pendry Alexandra

Department of Mechanical Engineering, National Taiwan University of Science and Technology

[email protected]

Abstract. A system is proposed for the recognition of the number of the dots on dice in general table game settings. Different from previous dice recognition systems which use a single top-view camera and work only under controlled illumination, the proposed one uses multiple cameras and works for uncontrolled illumination. Under controlled illumination edges are the prominent features considered by most approaches. But strong specular reflection, often observed in uncontrolled illumination, paralyzes the approaches solely based on edges. The proposed system exploits the local invariant features robust to illumination variation and good for building homographies across multi-views. The homographies are used to enhance coplanar features and weaken non- coplanar features, giving a way to segment the top faces of the dice and make up the features ruined by possible specular reflection. To identify the dots on the segmented top faces, an MSER detector is applied for its consistency rendering local interest regions across large illumination variation. Experiments show that the proposed system can achieve a superb recognition rate in various uncontrolled illumination conditions.

Keywords: Object recognition, invariant feature, local descriptor.

1 Introduction

Dice is a popular table game in casinos, especially in Asia. As automatic or computer-controlled games are emerging and becoming popular, many are interested in the technologies able to assist or replace human bankers. A computer vision system is proposed in this paper for dice recognition, which refers to the automatic recognition of the numbers of dots on dice, in normal table game settings. Different from existing dice recognition systems, for example [4] and [5], which work under controlled illumination, the proposed system can work in uncontrolled illumination conditions. In controlled illumination edges are the prominent features considered. But specular reflection, often observed in uncontrolled illumination, paralyzes the approaches solely based on edges. Fig. 1 shows an image in the middle with strong specular reflection, on the left is its edge map

Corresponding author.

A. Berciano et al. (Eds.): CAIP 2011, Part II, LNCS 6855, pp. 188–195, 2011.

Springer-Verlag Berlin Heidelberg 2011c

(14)

Dice Recognition in Uncontrolled Illumination Conditions 189

Fig. 1. Middle: specular reﬂection on the dice; Left: the edge map obtained by previous methods; Right: the edge map obtained by the proposed method

obtained by previous methods. Because it is not limited to controlled illumination, the proposed allows a much wider scope of applications, e.g., integration with table games or diﬀerent designs of automatic dice games.

Existing dice recognition systems only consider the top view of dice. But a top-view camera is diﬃcult to install on a game table as a specially designed camera support will be needed. To enable an easy integration with a game table, the proposed system considers tilted views to the dice captured by the cameras held on the peripheral supports around the table. Peripheral cameras are more friendly to install on a game table than top-view ones. However top views only capture the top faces of the dice, tilted views reveal the top and side surfaces.

The latter is harder to handle as a method is required to segment the top faces and remove the side surfaces.

The proposed system consists of two major modules: dice segmentation and dots identiﬁcation. To segment dice, it exploits the local invariant features robust to illumination variation and good for building homographies across multi-views.

The homographies are used to enhance coplanar features, segment the top faces of the dice and make up the features ruined by possible specular reﬂection.

To identify the dots on the segmented top faces, an MSER (Maximally Stable Extreme Region) [8] detector is applied for its consistency rendering local interest regions across large illumination variation. Although one can consider classiﬁers for the segmentation and identiﬁcation, such as that proposed by Viola and Jones [12], they are not considered here as a large amount of training samples are required. The proposed only need a few samples as references.

The rest of this paper is organized as follows: the dice segmentation is presented in Section 2. The dot identiﬁcation is elaborated in Section 3. Section 4 presents an experimental study of the proposed methods, followed by a conclusion in Section 5.

2 Dice Segmentation Using Local Invariant Features

Because dice can pose in arbitrary locations and orientations on a dice roller base and their sizes vary slightly according to the distance to the camera, local invariant features are explored in capturing these variations. Many local invariant feature detectors were proposed and applied in a broad range of applications.

Reviews on these detectors can be found in [10], and [9], [3]. The invariant

(15)

190 G.-S. Hsu et al.

Fig. 2. Correspondences across two diﬀerent views on the local invariant features de- tected by a multi-scale Harris-Hessian detector. Many of the detected correspondences are removed for better visual inspection.

feature detectors can be generally categorized into three types [11]. One detects corner-like features, e.g., Harris-affine, Harris-Laplace, and multi-scale Harris detectors.One detects blob-like features, e.g., Hessian-affine, Hessian-Laplace, multi-scale Hessian and Difference of Gaussians (DoG) [7]. Different from the former two types, region detectors extract homogeneous local areas, e.g., the MSER detector [8], which is used in this work for identifying the dots on dice, and will be addressed in details in Sec. 3.

Due to the limitation of Harris and Hessian detectors in handling multiple scales, both are modiﬁed with multiple scales and made scale-invariant in [1].

To determine the most appropriate scale for a local feature, Harris-Laplace and Hessian-Laplace both search for the characteristic scale with a Laplace operator added on top of the multi-scales. Harris-affine and Hessian-affine obtain the affine invariant corners or blobs by an iterative estimation of elliptical affine regions proposed by Lindeberg et al. [6]. The shape of the feature region is adapted to ensure that the same region is covered when extracted from a different viewpoint.

The performance of the aforementioned 8 invariant feature detectors in rendering the most accurate homographies between different viewpoints is evaluated by a comparison to the ground truth obtained using manually selected correspondences. All of the invariant regions (or interest regions) are represented in the form of SIFT descriptor [7] as it is experimentally proven as one of the most effective descriptors among others [10]. The match of the invariant features across views is measured by the Euclidean distance between the feature descriptors, and a threshold on this distance measure is determined to select correspondences. Because a dot on a die in a given view can appear quite similar to a different dot in another view, the scale factor in the local feature detectors is first chosen as that comes with the maximum number of correct correspondences. RANSAC [2] is then applied to filter out outliers and determine the most appropriate homographies across different views with matched correspondences.

Our experiments reveal that the multi-scale Harris-Hessian detector gives the best performance. Fig. 2 shows an example of the correspondences across two viewpoints obtained using this detector. The settings and other details of the performance evaluation are reported in Section 4.

(16)

GivenN different viewpoints of dice images, N(N −1)/2 homographies would be obtained using the invariant feature correspondences. In most cases 2≤ N ≤ 4 suffices. Each homography and its inverse define the transformation between a pair of different viewpoints, and such a transformation only works for the top faces of the dice as these surfaces are coplanar. This property motivates the stacking of coplanar surfaces to segment the top faces of the dice even when specular reflection appears in certain viewpoints. One can choose a dice image of any viewpoint as a reference image and transform the rest N − 1 images of different viewpoints to the reference one using the corresponding homographies.

Stacking of the reference image andN − 1 transformed images does not just enhance the coplanar features but also weaken the non-coplanar features, as those on the lateral sides of the dice would be overlapped with features from different planes. As the specular reflection can be considered a view-dependent feature, different from the coplanar features observed in other majority of views, it can be removed by imposing a threshold on a similarity measure. An example with N = 3 is shown in Fig. 1, which in the middle shows a view of the dice with strong specular reflection, and on the right is the edge map of the image by stacking the homography-transformed images from the rest two views.

3 Dot Identification and Dice Recognition

Given a segmented top face of a die, an MSER detector [8] is exploited to extract the dots from the segmented area because of its stability in rendering persistent or slowly varying edges around the dots as illumination varies. The extraction of MSER considers the set of all possible thresholds able to binarize an intensity imageI(x) into a binary image E_t_M(x),

EtM(x) =

1 ifI(x) ≤ t_M

0 otherwise. (1)

where tM is the threshold. An MSER is a connected region in EtM(x), with little change in its size for a range of thresholds, extracted with a watershed like segmentation algorithm. The homogeneous intensity regions extracted are stable over a wide range of thresholds. The number of thresholds that maintain the connected region similar in size is known as the margin of the region.

The dots on dice are blob-like objects and MSER usually anchors on the boundaries of such objects, and thus the dots can be better located by MSER compared to other interest region detectors. Fig. 3 shows the MSER regions detected on dice. With some preprocessing, as histogram equalization, MSER can achieve highly accurate identiﬁcation rate. Fig. 3 shows a case with the segmented top faces, and the regions detected by MSER before and after preprocessing. Note that the MSER can detect incomplete or partial interest regions which can be due to imperfect segmentation.

The dots identiﬁed by the MSER are clustered by k-means (k happens to be the number of dice) subject to the constraints that the number of dots in a cluster must be less than 7 and the distance between the farthest dots must

(17)

(a) Segmentation of top faces

(b) Regions detected before preprocessing

(c) Regions detected after preprocessing Fig. 3. The performance of MSER in the identiﬁcation of the dots

be less than the diagonal of the dice. The spatial distribution of the dots in each cluster must be verified against the 6 known patterns. For example, 6-dot must contain two parallel rows of dots and 3 dots each row. 5-dot must have two crossing rows of dots, 3 dots each row and crossing each other at the same central dot. Specific patterns are configured for 4-, 3-, and 2-dot cases. Depending on the number of dots in a given cluster, the distribution pattern for that number is examined first, and if found incompatible, two possibilities would be verified.

One is a non-dot spot falsely considered as a dot and the other is a valid dot failed to be identiﬁed as a dot. A large number of casts and experiments, with details given in Section 4, reveal that such a combination of size-constrained clustering and spatial pattern conﬁrmation yields a superb recognition rate.

4 Experiments

The experimental setup follows a common dice table game ”sci-bo” with three dice, and three cameras of different viewpoints are installed on the sides of a game table. 12 different illumination conditions are configured to study the per- formance of the proposed system, 3 of them chosen as the training set and the rest 9 as the test set, as shown in Fig. 4. The intensity on the dice from the training set is 67, 108, and 138 in average, in 8-bit gray scale, with deviation 8, 10, and 11, respectively. The intensity on the test set is between 45 to 158 in average with deviation from 7 to 12. 120 random cast sessions and 30 manual placement sessions are carried out under each illumination condition. The manual placement attempts to create special layouts of the dice, such as three dice in a row and others.

4.1 Homography Based on Local Invariant Features

The training set is for the evaluation of the 8 invariant feature detectors, men- tioned in Section 2, in creating homographies with least error across diﬀerent illumination conditions. The error EFi is measured by the diﬀerence between the correspondences from the invariant-feature-based homography H_F_i and the ground-truth H_G obtained using manually selected correspondences, i.e.,

(18)

Fig. 4. First column from the left is the training set with 3 illumination conditions;

the rest is the test set with 9 illumination conditions

E_F⁽â,b)_i =||(H⁽_Fâ,b)_i − H⁽_Gâ,b))x^b_F

i||

NFi

(2)

where H⁽_F^a,b)

i is the homography that transforms the invariant features x^b_F

i detected by the invariant feature detectorFi in the imageIb to the corresponding ones in Ia; H_G is the ground-truth homography obtained by manual selected correspondences betweenIa are Ib, NFi is the number of features detected by Fi, anda, b denote two diﬀerent viewpoints.

Additionally, it is also desired that the correspondences from the feature-based homographies can be consistent across diﬀerent scales, as some features change with scales. To investigate what features are better than others in rendering desired homographies across illumination and scale, the original images in 320×240 pixels are scaled down to smaller sizes, and the error is computed in each size and averaged over the three illumination conditions in the training set. Fig. 5 shows

Fig. 5. Normalized error of feature-based homography across scales and three illumi- nation conditions

(19)

this comparison, the smallest scale with 128×96 reveals relatively high errors, in- dicating that some details between the dice are lost in such a small scale and thus the accuracy in the homography estimation is degraded. Among the eight invariant feature detectors we tested, the multi-scale Harris-Hessian detector gives the lowest error at 0.87%, and it is about 1.7 pixels in a 192 × 144 image.

4.2 Dice Identification

The performance evaluation on the 9 test sets reveals the following observations and results:

– As long as the correspondences from the feature-based homography are con- sistent over at least two scales, the average match error can be kept below or near 1%, and the top faces of dice can be perfectly segmented in all tested conditions.

– Two identification rates are measured in each test illumination condition, one is the identification of the dots and the other is the identification of the dot number on each die. The former is shown by the bar on the left and the latter by the bar on the right at each indexed illumination condition in Fig.

6. Because the MSER dot detector has been adjusted to zero miss rate on the price of additional false positives on the training set, the imperfections in the dot identification in Fig. 6 are all caused by false positives. For example, in the brightest illumination condition, indexed ”1”, 1.8%(=1 − 98.2%) of the dots identified are false positives. All false positives are found caused by specular reflection or insufficient lightings. As the intensity of the illumination increases, specular reflection becomes stronger, causing more false positives to appear.

– The combination of size-constrained clustering and spatial pattern conﬁrma- tion can eﬀectively remove the false positives and yield superb dice recognition rates in all tested conditions, as shown by the right bar at each indexed illumination in Fig. 6.

Fig. 6. Identification rates in 9 illumination conditions, indexed from 1 to 9; at each index the left bar shows the rate of dot identification, and the right bar shows the rate of dice number identification

(20)

5 Conclusion

A solution with invariant features across multiple views is proposed for dice recognition under uncontrolled illumination. An extensive comparison on the performance of various invariant feature detectors in rendering correct homographies under various test conditions and parameters shows that the multi-scale Harris Hessian is the best, and better than the commonly selected SIFT features.

The homographies built on the multi-scale Harris Hessian features are exploited to enhance the coplanar features and weaken the non-coplanar features on the dice. This leads to an extraction of the coplanar features and the segmentation of the top faces of the dice even when the features, observed from some viewpoint, are ruined by specular reflection. An MSER detector is applied for the identification of dots on the top faces, followed by a pattern-specific confirmation of the spatial distribution of dots. Experiments reveal that, although false positives of dots are observed in few cases, as under strong or insufficient illumination, the numbers of the dots on the dice can still be recognized accurately by the proposed solution.

References

1. Dufournaud, Y., Schmid, C., Horaud, R.: Matching images with diﬀerent resolu- tions. In: CVPR, pp. 1612–1618 (2000)

2. Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model ﬁtting with applications to image analysis and automated cartography. Commun.

ACM 24(6), 381–395 (1981)

3. Hsu, G.S.J.: Stereo Correspondence with Local Descriptors for Object Recognition.

In: Advances in Theory and Applications of Stereo Vision, ch. 7, pp. 129–150.

InTech (2011)

4. Huang, K.Y.: An auto-recognizing system for dice games using a modiﬁed unsu- pervised grey clustering algorithm. Sensors 8(2), 1212–1221 (2008)

5. Lai, Y.N., Hsu, S.T., Wang, C.Y., Tsai, M.T.: Method for recognizing dice dots.

U.S. Patent No. 2009/0263008 A1 (October 2009)

6. Lindeberg, T., G˚arding, J.: Shape-adapted smoothing in estimation of 3-d shape cues from aﬃne deformations of local 2-d brightness structure. Image Vision Com- put. 15(6), 415–434 (1997)

7. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004)

8. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide baseline stereo from maximally stable extremal regions. In: BMVC (2002)

9. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaﬀal- itzky, F., Kadir, T., Gool, L.V.: A comparison of aﬃne region detectors. Interna- tional Journal of Computer Vision 65(1-2), 43–72 (2005)

10. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 27(10), 1615–1630 (2005)

11. Tuytelaars, T., Mikolajczyk, K.: Local invariant feature detectors: A survey. Foun- dations and Trends in Computer Graphics and Vision 3(3), 177–280 (2007) 12. Viola, P.A., Jones, M.J.: Rapid object detection using a boosted cascade of simple

features. In: CVPR, vol. (1), pp. 511–518 (2001)

(21)

國科會補助計畫衍生研發成果推廣資料表

日期:2011/10/18

國科會補助計畫

計畫名稱: 子計畫三：快速辨識立體物件之機器人視覺技術計畫主持人: 徐繼聖

計畫編號: 99-2221-E-011-098- 學門領域: 自動化系統整合技術

無研發成果推廣資料

(22)

99 年度專題研究計畫研究成果彙整表

計畫主持人：徐繼聖計畫編號：99-2221-E-011-098-

計畫名稱：下世代服務型機器人快速工作定義和全自主執行技術--子計畫三：快速辨識立體物件之機器人視覺技術

量化

成果項目 ^{實際已達成}

數（被接受或已發表）

預期總達成數(含實際已

達成數)

本計畫實際貢獻百

分比單位

備註（質化說明：如數個計畫共同成果、成果列為該期刊之封面故事 ...

等）

期刊論文 0 0 0%

研究報告/技術報告 0 0 0%

研討會論文 2 1 100%

論文著作篇

專書 0 0 0%

申請中件數 0 0 0%

專利已獲得件數 0 0 0% 件

件數 0 0 0% 件

技術移轉

權利金 0 0 0% 千元

碩士生 2 2 100%

博士生 0 0 0%

博士後研究員 0 0 0%

國內

參與計畫人力

（本國籍）

專任助理 0 0 0%

人次

期刊論文 0 1 50% 期刊論文審稿中

研究報告/技術報告 0 0 100%

研討會論文 3 2 100%

論文著作篇

專書 1 0 100% 章/本

申請中件數 1 1 100%

專利已獲得件數 0 0 0% 件

件數 0 0 0% 件

技術移轉

權利金 0 0 0% 千元

碩士生 1 1 100%

博士生 0 0 0%

博士後研究員 0 0 0%

國外

參與計畫人力

（外國籍）

專任助理 0 0 0%

人次

(23)

其他成果

(無法以量化表達之成

果如辦理學術活動、獲得獎項、重要國際合作、研究成果國際影響力及其他協助產業技術發展之具體效益事項等，請以文字敘述填列。)

1. The paper presented at ICMT receives Best Paper Award；

2. Collaboration with the Machine Vision Group, University of Oulu, Finland is being initialized.

成果項目量化 名稱或內容性質簡述

測驗工具(含質性與量性) 0

課程/模組 0

電腦及網路系統或工具 0

教材 0

舉辦之活動/競賽 0

研討會/工作坊 0

電子報、網站 0

科教處計畫加填項

目計畫成果推廣之參與（閱聽）人數 0

(24)

國科會補助專題研究計畫成果報告自評表

請就研究內容與原計畫相符程度、達成預期目標情況、研究成果之學術或應用價值（簡要敘述成果所代表之意義、價值、影響或進一步發展之可能性）、是否適合在學術期刊發表或申請專利、主要發現或其他有關價值等，作一綜合評估。

1. 請就研究內容與原計畫相符程度、達成預期目標情況作一綜合評估

■達成目標

□未達成目標（請說明，以 100 字為限）

□實驗失敗

□因故實驗中斷

□其他原因說明：

2. 研究成果在學術期刊發表或申請專利等情形：

論文：■已發表 □未發表之文稿 □撰寫中 □無專利：□已獲得 ■申請中 □無

技轉：□已技轉 ■洽談中 □無其他：（以 100 字為限）

本研究計畫目前之成果包括書藉專章一份、國際研討會論文三篇(其中一篇獲最佳論文獎)、國內研討會論文二篇、申請中專利一份、審核中期刊論文一篇。

3. 請依學術成就、技術創新、社會影響等方面，評估研究成果之學術或應用價值（簡要敘述成果所代表之意義、價值、影響或進一步發展之可能性）（以 500 字為限）

本技術可推廣至休閒娛樂、智慧型機器人、精品玩具等產業之創新產品設計與製造，或提昇現有產品功能，加強產品國際競爭力。

不同於其他已見於 Las Vegas 和 Macau 的娛樂場所之密閉式骰點辨識系統，本技術所研發出之系統，是目前唯一可應用於開放式環境的自動骰點辨識系統，可應用於上述不同產業之產品開發。例如一般娛樂場所內使用之骰盅，無需更換，僅需加裝本發明所設計之系統與，即可進行自動辨識。本系統利用多台攝影機擷取不同視角、不同距離、不同光照條件下的骰子影像，利用立體視覺，取得骰點分佈的幾何空間訊息，可進行精確的骰點辨識。

本系統除攝影機外，亦含遮罩與輔助光源設計，可濾除影響辨識率之環境光源，有效提升辨識率。本系統可推廣於休閒娛樂產業智、慧型機器人產業、或精品玩具相關產業。