Applications

Chapter 3 Incremental Similarity-Based Aspect-Graph 3D Object Recognition .29

3.4 Flexible 3D Object Recognition Framework

3.4.3 Applications

1 _jⁱ, _mⁿ

d V C (Eq.(3-24.1)) are regarded as the top-three matches. In Eq. (3-24.1), ω₁^k is a weighting parameter for combing different similarity measures. When no assistant feature is utilized, ω₁^k is set to zero. Otherwise, the objects included in the first half smallest similarity distances of d V C are defined as a set ^k

(

^jⁱ^, ^mⁿ

)

N^k⁺¹. The objects in

Nk⁺ are preserved for further recognition in the (k+1)^th AOD.

If the framework comprises two or more AODs, the characteristic views of the first three smallest similarity distances in d V C (Eq. (3-24.2)) are regarded as ^k

(

^jⁱ^, ^mⁿ

)

the top-three matches. In Eq. (3-24.2), dmink⁻¹

( )

n denotes the minimum similarity distance between the unknown object and the n candidate object in the (^th k−1)^th database. Moreover, ω₂^k is a weighting parameter for combing the similarity measure between the k and (^th k−1)^th AODs.

3.4.3 Applications

The proposed framework is evaluated on various object recognition problems, including 3D object recognition, human posture recognition, and scene recognition.

Three assumptions are made for applying the proposed framework to the above three applications. First, different features are used in different applications with the

proposed framework. In this dissertation, the features described in Section 3.3 are employed in the above three applications to perform the efficiency of the proposed framework. Second, training images for extracting the aspect-graph of objects in different applications are randomly sampled from a viewing sphere of a robot platform. A Pan-Tilt-Zoon camera is set up in a fixed position in the robot platform.

Next, the efficiency of the proposed framework is performed with the object database and testing images that belong to the same category with the object database. For example, 2D rigid-object testing images are tested with the rigid object database, and etc.

The similarity measures described in Section 3.3 are employed in the three applications. Three kinds of combing structures are performed with these similarity measures. In the 3D rigid object recognition, two AODs are utilized with two main features MAG and PPL. The weighting combination of the similarity measures is described in Eqs. (3-25) and (3-26).

( ) ( )

1 _jⁱ, _mⁿ 1_m _jⁱ, _mⁿ , 1

d V C =d V C n N∈ (3-25)

( ) ^{( )} ( ( ) )

2 1 2 2 2

min 2

, , ,

i n i n

j m m j m

d V C =d n +ω ⋅ d V C n N∈ (3-26)

Moreover, ^{d V C}^m¹

(

^jⁱ^, ^mⁿ

)

is calculated with MAG using Eq. (3-14), and

( )

2 ⁱ, ⁿ

m j m

d V C is calculated with PPL using Eq. (3-15). Furthermore, the weighting parameters ω₁¹and ω₁²are both set to zero and the weighting parameters ω₂² is

defined as the Eq. (3-27). T₄^d¹^M and T₄^d^M² are the threshold values applied on the ISAG, and are defined in Section 4.1.

2 1

2 T4^d^M /T4^d^M

ω = (3-27)

In the human posture recognition, only one AOD is utilized with one main feature MAG and one assistant feature θ_z. The weighting combination of the similarity measures is described in Eq (3-28).

( ) ( ) ( )

1 1 1 1 1

, , 1 , ,

i n i n i n

j m m j m a j m

d V C =d V C +ω ⋅d V C n N∈ (3-28)

In Eq. (3-28), ^{d V C}^m¹

(

^jⁱ^, ^mⁿ

)

is calculated using Eq. (3-14) and ^{d V C}¹^a

(

^jⁱ^, ^mⁿ^'

)

^is

calculated using Eq. (3-17). Furthermore, the weighting parameter ω₁¹ is defined as

Eq. (3-29), where T is the threshold values applied on the ISAG, and is defined in ₅^d¹^a Section 4.1.

1/T5^d^a

ω= (3-29)

In the scene recognition, only one AOD is utilized with one main feature BM. The weighting combination of the similarity measures is described in Eq (3-30).

( ) ( )

1 _jⁱ, _mⁿ 1_m _jⁱ, _mⁿ , 1

d V C =d V C n N∈ (3-30)

In Eq. (3-30), ^{d V C}^m

(

^jⁱ^, ^mⁿ

)

is calculated using Eq. (3-16). Furthermore, the weighting parameter ω₁¹ is defined as zero.

Chapter 4 Experimental Results

The chapter provides experimental results to assess the efficiency of the proposed 3D object recognition system. In Section 4.1, five experiments are performed to test the robustness of the BSHSR with a complex background in an indoor environment.

After that, the BSHSR is applied to extract foreground regions for building a 3D object database using the ISAG and testing the performance of the proposed 3D object recognition system. Three object recognition problems, namely rigid object recognition, human posture recognition, and scene recognition, are performed with the proposed method in Section 4.2.

4.1 BSHSR

The video data for experiments was obtained using a SONY DVI-D30 PTZ camera in an indoor environment. Morphological filter was applied to remove noise and the camera controls were set to automatic mode. The same threshold values were used for all experiments. The values of the important threshold values wereN_G =15,

002 .

α , P_B =0.1 , B₀ =0.7 , 300B₁= and B₂ =0.8 . Meanwhile, the computational speed was around five frames per second on a P4 2.8GHz PC, while the video had a frame size of 320 x 240.

4.1.1 Local Illumination Changes

The first experiment was performed to test the robustness of the proposed method about the local illumination changes. Local illumination changes resulting from desk lights occur constantly in indoor environments. Desk lights are usually white or yellow. Two video clips containing several changes of desk light are collected to simulate local illumination changes. Figure 4-1(a) shows 15 representative samples of the first one video clip. Meanwhile, Fig. 4-1(b) shows the classified result of the foreground pixel using the proposed method, the CBM and CSIM, where red indicates shadow, green indicates highlight and blue indicates foreground. Figure 4-1(c) displays the result of the result of final background subtraction to demonstrate the robustness of the proposed method, where the white and black color represents the foreground and background pixels respectively. The image sequences comprise different levels of illumination changes. The desk light was turned on at the 476^th frame and its brightness increased until the 1000^th frame. The overall picture becomes the foreground regions of the corresponding frames in Fig. 4-1(b) owing to the lack of such information in the CBM. However, the final result of background subtraction of the corresponding frames in Fig. 4-1(c) is still good owing to the proposed scheme combining the CBM, CSIM and GBM. The desk light was then turned off at the 1030^th frame, and became darker until the 1300^th frame. The original Gaussian distribution in the ECBM became the component in the CCBM, and a new representative Gaussian distribution in the ECBM is constructed for that a new

background information is involved from the new collected frames between the 476^th and the 1000^th frame are more than the initial collected 300 frames. Consequently, the 1300^th frame in Fig. 4-1(b) has many foreground regions. However the final result of the 1300^th frame is still good. The illumination changes are all modeled into the LTCBM when the background model records the background changes. The area of the red, blue and green regions reduces after the 1300^th frame.

Table 4-1 compares the proposed scheme with the method proposed by Hoprasert [60]. Comparison criteria are identified by labeling the foreground regions of a frame manually. The CSIM can be constructed based on the appropriate representative Gaussian distribution chosen from the LTCBM and STCBM. The ability to handle illumination variation and the accuracy of the background subtraction are improved and the results are shown in Table 4-1.

Table 4-1 The robustness test between the proposed method and that proposed by Hoprasert [60] via local illumination changes with a yellow desk light

F^RAME 476 480 500 580 650

PROPOSED

(%*)

HOPRASERT [60]

(%*) 100.00 94.05 99.84 36.40 99.93 22.50 99.91 15.38 83.96 23.42

FRAME 750 900 1000 1030 1120

PROPOSED

(%*)

HOPRASERT [60]

(%*) 91.50 31.51 93.10 30.91 95.44 34.26 97.75 38.28 99.15 32.90

F^RAME 1150 1300 1330 1400 1600

PROPOSED

(%*)

HOPRASERT [60]

(%*) 93.79 50.72 99.95 99.84 93.31 92.40 96.22 13.03 99.30 34.66

*: The value in the table means the recognition rate that correct background pixels in a frame divide total pixels in a frame(%)

(a)

1000 1030 1120 1150 1300 1330 1400 (b)

(c)

Figure 4-1 The results of illumination changes with a yellow desk light, the number below the picture is the index of frame. (a) Original images. (b) The results of pixel classification, where red indicates the shadow, green indicates the highlight and blue indicates the foreground. (c) The results of background subtraction with shadow removal using the proposed method, where dark indicates the background and white indicates the foreground.

1000 1030 1120 1150 1300 1330 1400 1600 Background 476 480 500 580 650 750 900

Background 476 480 500 580 650 750 900

1000 1030 1120 1150 1300 1330 1400 1600

Background 476 480 500 580 650 750 900

1000 1030 1120 1150 1300 1330 1400 1600

Figure 4-2(a) shows a similar image sequence to that on Fig. 4-1(a). The two sequences differ only in the color of the desk light. The desk light was turned on at the 660^th frame and the same brightness was maintained until the 950^th frame. The desk light was then turned off at the 1006^th frame and turned on again at the 1180^th frame.

The results of shadows and highlights removal are shown in Fig. 4-2(b) and the results of final background subtraction are shown in Fig. 4-2(c). The results of background subtraction in Fig. 4-2 and the comparison result in Table 4-2 are shown to demonstrate the robustness of the proposed scheme.

Table 4-2 The robustness test between the proposed method and that proposed by Hoprasert [60] via local illumination changes with a white desk light

Frame 660 665 670 860 950

Proposed (%*)

Hoprasert[60]

(%*) 99.02 99.48 97.93 79.81 95.92 92.22 96.73 93.81 97.44 94.46

Frame 1006 1020 1150 1180 1250

Proposed (%*)

Hoprasert[60]

(%*) 98.12 95.65 99.94 98.85 99.78 99.68 98.94 99.08 97.28 93.81

Frame 1300 1375 1377 1380 1445

Proposed (%*)

Hoprasert[60]

(%*) 97.49 95.26 97.73 87.50 98.83 98.92 99.73 99.32 100.00 99.71

*: The value in the table means the recognition rate that correct background pixels in a frame divide total pixels in a frame(%).

Figure 4-2 The results of illumination changes with white desk light, the number below the picture is the index of frame. (a) Original images, where red indicates the shadow, green indicates the highlight and blue indicates the foreground. (b) The results of pixel classification. (c) The results of background subtraction with shadow removal using our proposed method, where dark indicates the background and white indicates the foreground.

Background 660 665 670 860 950 1006 1020

1150 1180 1250 1300 1375 1377 1380 1445 (a)

Background 660 665 670 860 950 1006 1020

1150 1180 1250 1300 1375 1377 1380 1445 (b)

Background 660 665 670 860 950 1006 1020

1150 1180 1250 1300 1375 1377 1380 1445 (c)

4.1.2 Global Illumination Changes

The second experiment was performed to test the robustness of the proposed method in terms of global illumination changes. The image sequences consist of illumination changes where a fluorescent lamp was turned on at the 381^th frame and more lamps were turned on at the 430^th frame. The illumination changes are then modeled into the LTCBM when the proposed background model recorded the background changes. Notably the area of the red, blue and green regions decreases at the 580^th frame. When the third daylight lamp is switched on in the 650^th frame, it is clear that fewer blue regions appear at the 845^th frame owing to illumination changes having been modeled in the LTCBM. However, the final results of background subtraction shown in Fig. 4-3(c) are all better than those of pure color-based background subtraction shown in Fig. 4-3(b). Table 4-3 shows the comparison results between the proposed scheme and that proposed by Hoprasert [60]. The comparison demonstrates that the proposed scheme is robust to global illumination changes.

Table 4-3 The comparison between the proposed method and that proposed by Hoprasert [60] via global illumination changes with fluorescent lamps

Frame 381 (1^**) 385 (1^**) 405 (1^**) 430 (2^**) 560 (2^**) Proposed

(%*)

Hoprasert[60]

(%*) 98.24 93.54 88.35 82.14 83.85 78.24 56.50 68.42 66.85 69.82 Frame 565 (2^**) 570 (2^**) 580 (2^**) 650 (3^**) 700 (3^**) Proposed

(%*)

Hoprasert[60]

(%*) 79.87 69.30 96.88 69.69 99.08 69.55 99.23 45.62 99.49 46.22 Frame 845 (3^**) 910 (3^**) 1000 (3^**) 1050 (3^**) 1110 (3^**) Proposed

(%*)

Hoprasert[60]

(%*) 99.56 46.18 99.39 53.58 99.85 57.87 99.93 60.83 99.64 60.32

*: The value in the table means the recognition rate that correct background pixels in a frame divide total pixels in a frame(%).

**: The number inside the parentheses indicates the number of fluorescent lamps that have turned on.

Figure 4-3 The results of global illumination changes with fluorescent lamps, the number below the picture is the index of frame. (a) Original images. (b) The results of pixel classification, where red indicates the shadow, green indicates the highlight and blue indicates the foreground. (c) The results of background subtraction with shadow removal using our proposed method, where dark indicates the background and white indicates the foreground.

Background 381 385 405 430 560 565 570

580 650 700 845 910 1000 1050 1110 (a)

Background 381 385 405 430 560 565 570

580 650 700 845 910 1000 1050 1110 (b)

Background 381 385 405 430 560 565 570

580 650 700 845 910 1000 1050 1110 (c)

4.1.3 Foreground Detection

In the third experiment (Fig. 4-4), a person goes into the monitoring area, and the foreground region can be effectively extracted regardless of the influence of shadow and highlight in the indoor environment. Owing to the captured video clip having little illumination variation and dynamic background variation, the comparison of the recognition rate of final background subtraction between the proposed method and that of Hoprasert [60] reveals that both methods are about the same, as listed in Table 4-4.

Table 4-4 The comparison between the proposed method and that proposed by Hoprasert [60] via foreground detection

Frame 380 450 530 590 620

Proposed (%*)

Hoprasert[60]

(%*) 90.45 89.18 86.50 85.80 89.38 88.87 88.45 87.72 88.67 88.76

Frame 680 700 735 755 840

Proposed (%*)

Hoprasert[60]

(%*) 91.07 90.62 85.63 85.15 82.76 80.71 92.44 92.46 100.00 99.61

*: The value in the table means the recognition rate that correct background pixels in a frame divide total pixels in a frame(%).

4.1.4 Dynamic Background

In the fourth experiment (Fig. 4-5), image sequences consist of swaying clothes hung on a frame. The proposed method gradually recognizes the clothes as background owing to the ability of LTCBM to record the history of background changes. In situations involving large variation of dynamic background, a representative initial color-based background model can be established by using more training frames to handle the variations.

Figure 4-4 The results of foreground detection. (a) Original images. (b) The results of pixel classification, where the red color means the shadow, where red indicates the shadow, green indicates the highlight and blue indicates the foreground. (c) The results of background subtraction with shadow removal using our proposed method, where dark indicates the background and white indicates the foreground.

Background 380 450 530 590 620

680 700 735 755 840 (a)

Background 380 450 530 590 620

680 700 735 755 840 (b)

Background 380 450 530 590 620

680 700 735 755 840 (c)

Figure 4-5 The results of background subtraction about dynamic background. (a) Original images. (b) The results of pixel classification, where the red color means the shadow, where red indicates the shadow, green indicates the highlight and blue indicates the foreground. (c) The results of background subtraction with shadow removal using our proposed method, where dark indicates the background and white indicates the foreground.

Background 500 540 580 620 660 700 740

780 820 860 900 940 980 1020 1060 (a)

Background 500 540 580 620 660 700 740

780 820 860 900 940 980 1020 1060 (b)

Background 500 540 580 620 660 700 740

780 820 860 900 940 980 1020 1060

(c)

4.1.5 Short-Term Color-based Background Model (STCBM)

The final experiment (Fig. 4-6) shows the advantage of adding the STCBM. A doll is placed on the desk at the 360^th frame. Initially, it is regarded as foreground, and at the 560^th frame, the foreground region becomes background owing to the LTCBM.

However, the Gaussian component belonging to the doll still does not have the highest weighting. Without adding the STCBM, when a hand is placed above the doll at the 590^th frame, the foreground regions at the 670^th frame remain the same as those at the 590^th frame, as shown in Fig. 4-6(b). The foreground regions under our hand become shadows at the 670^th frame in Fig. 4-6(c) for that shadows and highlights removal works well using a representative Gaussian component based on the STCBM.

This experiment demonstrates the efficiency of the STCBM that a representative Gaussian component of the CBM can be selected by giving consideration to long-term tendency and short-term tendency. Besides, the advantage of the STCBM helps to reduce the computing time used in the GBM and increase the recognition rate of foreground detection.

Figure 4-6 The results of the advantage of the STCBM, where the red color means the shadow, the green color means the highlight and the blue color means the foreground.

(a) Original images. (b) The results of background subtraction without the STCBM. (c) The results of background subtraction with the STCBM, where red indicates the shadow, green indicates the highlight and blue indicates the foreground.

4.2 3D Object Recognition

This section describes several experiments demonstrating the effectiveness of the proposed 3D object recognition. A SONY EVI-D30 PTZ camera was employed to capture object views. The following three databases were built to test the proposed method: Fig. 4-6 contains 12 3D rigid objects, Fig. 4-7 contains six 3D human postures, and Fig. 4-8 contains 11 scenes.

301 360 560 590 670 740

(a)

301 360 560 590 670 740 (b)

301 360 560 590 670 740

(c)

Object 1 Object 2 Object 3 Object 4

Object 5 Object 6 Object 7 Object 8

Object 9 Object 10 Object 11 Object 12

Figure 4-7 The first database containing 12 3D rigid objects.

Posture 1 Posture 2 Posture 3 Posture 4

Posture 5 Posture 6 Posture 7 Posture 8

Figure 4-8 The second image database containing eight 3D human postures.

Scene 1 Scene 2 Scene 3 Scene 4

Scene 5 Scene 6 Scene 7 Scene 8

Scene 9 Scene 10 Scene 11

Figure 4-9 The third image database containing 11 scenes.

The notationV_d^1,^j and V_d^2,^jdenote the sets of training views captured at 5°

intervals, where V_d^1,^j is employed during rigid object recognition, and V_d^2,^j is

employed during human posture recognition. The notation V_d^3,^j, which denotes the set of training views captured at each location at a 1° increment, is utilized during scene recognition. Moreover, V_t^1,^j and V_t^2,^jdenote the set of testing views captured from trisection points between each pair of points separated by 5°, where V_t^1,^jis utilized during rigid object recognition, and V_t^2,^jis utilized during human posture

recognition. Moreover, V_t^3,^j denotes the set of testing views captured at locations away from the original locations in four directions (forward, backward, left and right), five distances (5cm, 10cm, 15cm, 20cm and 50cm) and five covering rates (5%, 10%, 15%, 20% and 50%). V_t^3,^j is utilized during scene recognition. The descriptions of the captured views are given by Eqs. (4-1)-(4-6).

1,^j { 1,^j( )}, where 1 12, 1 72

In the following experiments, T denotes the number of objects, and is 12 for the ₀ rigid object recognition, 8 during human posture recognition, and 11 during scene recognition; T denotes the number of training views, and is 72 during rigid object ₁

recognition and human posture recognition, and 61 during scene recognition. T ₂ denotes the number of low frequency information in FD, and is 40 in the following experiment. Moreover, the threshold values used in the ISAG is listed as Table 4-5.

Table 4-5The threshold values for the ISAG

The first AOD The second AOD

Main Feature Assistant

Feature Main Feature Assistant

Feature

Computing time for calculating similarity between a test view and a view in the database was approximately 0.006 seconds for rigid object recognition, 0.004 seconds for human posture recognition and 0.01 seconds for scene recognition on a P4 3.2G CPU with 1GB RAM.

4.2.1 Rigid Object Recognition

In the first experiment, the efficiency of the proposed framework was assessed using 2-D views captured at random intervals with the first database (Fig. 4-7). To determine average performance of the proposed method, training views were generated by sampling views in V in 200 different random orders. Background _d^{1, j} subtraction was first performed on training 2D views to extract foreground objects.

After that, Canny edge detection and GVF were performed on the extracted foreground objects to extract the object contour. Two features, called the MAG and PPL, are then extracted from the object contour and be used for building the AODs with the ISAG (Fig. 3-2). The characteristic views of aspects in each AOD are utilized for object matching. A recognition result is calculated with a weighted combination of the similarity measures from both AODs. Figure 4-10 illustrates the system architecture of the proposed framework for the 3D rigid object recognition.

The 1st AOD The 2nd AOD MOD

A 2D view sampled from an unknown object

MAG feature Extraction

PPL feature Extraction

The 1st Similarity Measure

The 2nd Similarity Measure

Weighted Combination

Recognition Results (Top Three Matches)

Feature Extraction Similarity Measure

3D rigid object recognition

Figure 4-10 The system architecture of the proposed framework applied on the first experiment (3D rigid object recognition).

Table 4-6 The result of rigid object recognition using 2D views via MAG and PPL

Table 4-6 presents statistical information for the means of aspect numbers using MAG and PPL. Furthermore, symmetrical objects, such as objects 2, 5, 6 and 7, had few aspects, thereby reducing computing time for recognizing objects. The views in

1, j

Vt were adopted as unknowns, and tested whenever aspect-graph representations were built each time (200 times). The proposed aspect-graph generation is efficient due to its high recognition rate in the Top 1 to Top 3 matches in the Table 4-7.

The proposed method, which constructs an aspect-graph representation using sampled views at random intervals, generates a practicable updating mechanism that integrates the database using new collected views. In this experiment, 18 random views sampled from V_d^{1, j} are first utilized to construct a coarse aspect-graph

The index of the objects in the first database listed in Fig. 4-7 Recognition

Results 1 2 3 4 5 6 7 8 9 10 11 12 Avg.

Numbers of

aspect of MAG 34.66 3.84 27.83 24.75 6.87 9.47 2.04 25.62 17.14 16.16 16.62 28.75 17.81

Numbers of

aspect PPL 38.72 14.08 14.32 22.84 10.98 20.12 8.41 31.07 25.79 17.68 23.61 19.88 20.63

Top 1 Match (%) 98.25 99.97 97.71 97.39 100 99.81 99.79 99.35 99.90 97.97 98.44 96.83 98.78

Top 2 Match (%) 99.21 100 98.96 98.73 100 99.96 99.86 99.67 99.97 98.68 99.47 98.17 99.39

Top 3 Match (%) 99.61 100 99.39 99.34 100 99.98 99.89 99.78 99.99 98.98 99.77 98.64 99.62

representation of each object, calledD . Eighteen additional random views are then ₁₈ adopted from the remaining views in V_d^{1, j} to increase the accuracy of the database

D , called 18 D . Similarly, ₃₆ D and ₅₄ D are constructed using views in remaining ₇₂

1, j

Vd . Additionally, D and ₉₀ D are further constructed with extra random views ₁₀₈ sampled from V_t^{1, j}. Table 4-7 presents the average aspect numbers for each rigid object from 200 iterations. Although the aspect numbers increase when new views are employed to update the coarse database, the number of stored views remains significantly smaller than the number of original views. Figure 4-11 presents the recognition rate results obtained when using coarse to fine databases. Figure 4-12 presents the standard deviations for recognition rates. The recognition rate increases when aspect-graph representations are trained using additional object views.

Moreover, stability increases based on decreasing standard deviation. Therefore, the proposed method is demonstrated as effective for updating aspect-graph representations without re-sorting the overall collected views, or re-calculating overall similarity measures.

Table 4-7Results for numbers of aspects using MAG and PPL after updating with additional training views

The index of the objects in the first database listed in Figure 4-7

在文檔中以二維影像與漸進式相似度外觀圖解法為基礎之穩健三維物體辨識 (頁 58-0)

Chapter 3 Incremental Similarity-Based Aspect-Graph 3D Object Recognition .29

3.4 Flexible 3D Object Recognition Framework

3.4.3 Applications

(

)

(

)

( )

( ) ( )

( ) ( ) ( ( ) )

(

)

( )

( ) ( ) ( )

(

)

(

)

( ) ( )

(

)

Chapter 4

Experimental Results

( ) ^{( )} ( ( ) )