Chapter 3 Incremental Similarity-Based Aspect-Graph 3D Object Recognition .29
3.4 Flexible 3D Object Recognition Framework
3.4.3 Applications
1 ji, mn
d V C (Eq.(3-24.1)) are regarded as the top-three matches. In Eq. (3-24.1), ω1k is a weighting parameter for combing different similarity measures. When no assistant feature is utilized, ω1k is set to zero. Otherwise, the objects included in the first half smallest similarity distances of d V C are defined as a set k
(
ji, mn)
Nk+1. The objects in1
Nk+ are preserved for further recognition in the (k+1)th AOD.
If the framework comprises two or more AODs, the characteristic views of the first three smallest similarity distances in d V C (Eq. (3-24.2)) are regarded as k
(
ji, mn)
the top-three matches. In Eq. (3-24.2), dmink−1
( )
n denotes the minimum similarity distance between the unknown object and the n candidate object in the (th k−1)th database. Moreover, ω2k is a weighting parameter for combing the similarity measure between the k and (th k−1)th AODs.3.4.3 Applications
The proposed framework is evaluated on various object recognition problems, including 3D object recognition, human posture recognition, and scene recognition.
Three assumptions are made for applying the proposed framework to the above three applications. First, different features are used in different applications with the
proposed framework. In this dissertation, the features described in Section 3.3 are employed in the above three applications to perform the efficiency of the proposed framework. Second, training images for extracting the aspect-graph of objects in different applications are randomly sampled from a viewing sphere of a robot platform. A Pan-Tilt-Zoon camera is set up in a fixed position in the robot platform.
Next, the efficiency of the proposed framework is performed with the object database and testing images that belong to the same category with the object database. For example, 2D rigid-object testing images are tested with the rigid object database, and etc.
The similarity measures described in Section 3.3 are employed in the three applications. Three kinds of combing structures are performed with these similarity measures. In the 3D rigid object recognition, two AODs are utilized with two main features MAG and PPL. The weighting combination of the similarity measures is described in Eqs. (3-25) and (3-26).
( ) ( )
1 ji, mn 1m ji, mn , 1
d V C =d V C n N∈ (3-25)
( ) ( ) ( ( ) )
2 1 2 2 2
min 2
, , ,
i n i n
j m m j m
d V C =d n +ω ⋅ d V C n N∈ (3-26)
Moreover, d V Cm1
(
ji, mn)
is calculated with MAG using Eq. (3-14), and( )
2 i, n
m j m
d V C is calculated with PPL using Eq. (3-15). Furthermore, the weighting parameters ω11and ω12are both set to zero and the weighting parameters ω22 is
defined as the Eq. (3-27). T4d1M and T4dM2 are the threshold values applied on the ISAG, and are defined in Section 4.1.
2 1
2
2 T4dM /T4dM
ω = (3-27)
In the human posture recognition, only one AOD is utilized with one main feature MAG and one assistant feature θz. The weighting combination of the similarity measures is described in Eq (3-28).
( ) ( ) ( )
1 1 1 1 1
, , 1 , ,
i n i n i n
j m m j m a j m
d V C =d V C +ω ⋅d V C n N∈ (3-28)
In Eq. (3-28), d V Cm1
(
ji, mn)
is calculated using Eq. (3-14) and d V C1a(
ji, mn')
iscalculated using Eq. (3-17). Furthermore, the weighting parameter ω11 is defined as
Eq. (3-29), where T is the threshold values applied on the ISAG, and is defined in 5d1a Section 4.1.
1
1/T5da
ω= (3-29)
In the scene recognition, only one AOD is utilized with one main feature BM. The weighting combination of the similarity measures is described in Eq (3-30).
( ) ( )
1 ji, mn 1m ji, mn , 1
d V C =d V C n N∈ (3-30)
In Eq. (3-30), d V Cm
(
ji, mn)
is calculated using Eq. (3-16). Furthermore, the weighting parameter ω11 is defined as zero.Chapter 4
Experimental Results
The chapter provides experimental results to assess the efficiency of the proposed 3D object recognition system. In Section 4.1, five experiments are performed to test the robustness of the BSHSR with a complex background in an indoor environment.
After that, the BSHSR is applied to extract foreground regions for building a 3D object database using the ISAG and testing the performance of the proposed 3D object recognition system. Three object recognition problems, namely rigid object recognition, human posture recognition, and scene recognition, are performed with the proposed method in Section 4.2.
4.1 BSHSR
The video data for experiments was obtained using a SONY DVI-D30 PTZ camera in an indoor environment. Morphological filter was applied to remove noise and the camera controls were set to automatic mode. The same threshold values were used for all experiments. The values of the important threshold values wereNG =15,
002 .
=0
α , PB =0.1 , B0 =0.7 , 300B1= and B2 =0.8 . Meanwhile, the computational speed was around five frames per second on a P4 2.8GHz PC, while the video had a frame size of 320 x 240.
4.1.1 Local Illumination Changes
The first experiment was performed to test the robustness of the proposed method about the local illumination changes. Local illumination changes resulting from desk lights occur constantly in indoor environments. Desk lights are usually white or yellow. Two video clips containing several changes of desk light are collected to simulate local illumination changes. Figure 4-1(a) shows 15 representative samples of the first one video clip. Meanwhile, Fig. 4-1(b) shows the classified result of the foreground pixel using the proposed method, the CBM and CSIM, where red indicates shadow, green indicates highlight and blue indicates foreground. Figure 4-1(c) displays the result of the result of final background subtraction to demonstrate the robustness of the proposed method, where the white and black color represents the foreground and background pixels respectively. The image sequences comprise different levels of illumination changes. The desk light was turned on at the 476th frame and its brightness increased until the 1000th frame. The overall picture becomes the foreground regions of the corresponding frames in Fig. 4-1(b) owing to the lack of such information in the CBM. However, the final result of background subtraction of the corresponding frames in Fig. 4-1(c) is still good owing to the proposed scheme combining the CBM, CSIM and GBM. The desk light was then turned off at the 1030th frame, and became darker until the 1300th frame. The original Gaussian distribution in the ECBM became the component in the CCBM, and a new representative Gaussian distribution in the ECBM is constructed for that a new
background information is involved from the new collected frames between the 476th and the 1000th frame are more than the initial collected 300 frames. Consequently, the 1300th frame in Fig. 4-1(b) has many foreground regions. However the final result of the 1300th frame is still good. The illumination changes are all modeled into the LTCBM when the background model records the background changes. The area of the red, blue and green regions reduces after the 1300th frame.
Table 4-1 compares the proposed scheme with the method proposed by Hoprasert [60]. Comparison criteria are identified by labeling the foreground regions of a frame manually. The CSIM can be constructed based on the appropriate representative Gaussian distribution chosen from the LTCBM and STCBM. The ability to handle illumination variation and the accuracy of the background subtraction are improved and the results are shown in Table 4-1.
Table 4-1 The robustness test between the proposed method and that proposed by Hoprasert [60] via local illumination changes with a yellow desk light
FRAME 476 480 500 580 650
PROPOSED
(%*)
HOPRASERT [60]
(%*) 100.00 94.05 99.84 36.40 99.93 22.50 99.91 15.38 83.96 23.42
FRAME 750 900 1000 1030 1120
PROPOSED
(%*)
HOPRASERT [60]
(%*) 91.50 31.51 93.10 30.91 95.44 34.26 97.75 38.28 99.15 32.90
FRAME 1150 1300 1330 1400 1600
PROPOSED
(%*)
HOPRASERT [60]
(%*) 93.79 50.72 99.95 99.84 93.31 92.40 96.22 13.03 99.30 34.66
*: The value in the table means the recognition rate that correct background pixels in a frame divide total pixels in a frame(%)
(a)
1000 1030 1120 1150 1300 1330 1400 (b)
(c)
Figure 4-1 The results of illumination changes with a yellow desk light, the number below the picture is the index of frame. (a) Original images. (b) The results of pixel classification, where red indicates the shadow, green indicates the highlight and blue indicates the foreground. (c) The results of background subtraction with shadow removal using the proposed method, where dark indicates the background and white indicates the foreground.
1000 1030 1120 1150 1300 1330 1400 1600 Background 476 480 500 580 650 750 900
Background 476 480 500 580 650 750 900
1000 1030 1120 1150 1300 1330 1400 1600
Background 476 480 500 580 650 750 900
1000 1030 1120 1150 1300 1330 1400 1600
Figure 4-2(a) shows a similar image sequence to that on Fig. 4-1(a). The two sequences differ only in the color of the desk light. The desk light was turned on at the 660th frame and the same brightness was maintained until the 950th frame. The desk light was then turned off at the 1006th frame and turned on again at the 1180th frame.
The results of shadows and highlights removal are shown in Fig. 4-2(b) and the results of final background subtraction are shown in Fig. 4-2(c). The results of background subtraction in Fig. 4-2 and the comparison result in Table 4-2 are shown to demonstrate the robustness of the proposed scheme.
Table 4-2 The robustness test between the proposed method and that proposed by Hoprasert [60] via local illumination changes with a white desk light
Frame 660 665 670 860 950
Proposed (%*)
Hoprasert[60]
(%*) 99.02 99.48 97.93 79.81 95.92 92.22 96.73 93.81 97.44 94.46
Frame 1006 1020 1150 1180 1250
Proposed (%*)
Hoprasert[60]
(%*) 98.12 95.65 99.94 98.85 99.78 99.68 98.94 99.08 97.28 93.81
Frame 1300 1375 1377 1380 1445
Proposed (%*)
Hoprasert[60]
(%*) 97.49 95.26 97.73 87.50 98.83 98.92 99.73 99.32 100.00 99.71
*: The value in the table means the recognition rate that correct background pixels in a frame divide total pixels in a frame(%).
Figure 4-2 The results of illumination changes with white desk light, the number below the picture is the index of frame. (a) Original images, where red indicates the shadow, green indicates the highlight and blue indicates the foreground. (b) The results of pixel classification. (c) The results of background subtraction with shadow removal using our proposed method, where dark indicates the background and white indicates the foreground.
Background 660 665 670 860 950 1006 1020
1150 1180 1250 1300 1375 1377 1380 1445 (a)
Background 660 665 670 860 950 1006 1020
1150 1180 1250 1300 1375 1377 1380 1445 (b)
Background 660 665 670 860 950 1006 1020
1150 1180 1250 1300 1375 1377 1380 1445 (c)
4.1.2 Global Illumination Changes
The second experiment was performed to test the robustness of the proposed method in terms of global illumination changes. The image sequences consist of illumination changes where a fluorescent lamp was turned on at the 381th frame and more lamps were turned on at the 430th frame. The illumination changes are then modeled into the LTCBM when the proposed background model recorded the background changes. Notably the area of the red, blue and green regions decreases at the 580th frame. When the third daylight lamp is switched on in the 650th frame, it is clear that fewer blue regions appear at the 845th frame owing to illumination changes having been modeled in the LTCBM. However, the final results of background subtraction shown in Fig. 4-3(c) are all better than those of pure color-based background subtraction shown in Fig. 4-3(b). Table 4-3 shows the comparison results between the proposed scheme and that proposed by Hoprasert [60]. The comparison demonstrates that the proposed scheme is robust to global illumination changes.
Table 4-3 The comparison between the proposed method and that proposed by Hoprasert [60] via global illumination changes with fluorescent lamps
Frame 381 (1**) 385 (1**) 405 (1**) 430 (2**) 560 (2**) Proposed
(%*)
Hoprasert[60]
(%*) 98.24 93.54 88.35 82.14 83.85 78.24 56.50 68.42 66.85 69.82 Frame 565 (2**) 570 (2**) 580 (2**) 650 (3**) 700 (3**) Proposed
(%*)
Hoprasert[60]
(%*) 79.87 69.30 96.88 69.69 99.08 69.55 99.23 45.62 99.49 46.22 Frame 845 (3**) 910 (3**) 1000 (3**) 1050 (3**) 1110 (3**) Proposed
(%*)
Hoprasert[60]
(%*) 99.56 46.18 99.39 53.58 99.85 57.87 99.93 60.83 99.64 60.32
*: The value in the table means the recognition rate that correct background pixels in a frame divide total pixels in a frame(%).
**: The number inside the parentheses indicates the number of fluorescent lamps that have turned on.
Figure 4-3 The results of global illumination changes with fluorescent lamps, the number below the picture is the index of frame. (a) Original images. (b) The results of pixel classification, where red indicates the shadow, green indicates the highlight and blue indicates the foreground. (c) The results of background subtraction with shadow removal using our proposed method, where dark indicates the background and white indicates the foreground.
Background 381 385 405 430 560 565 570
580 650 700 845 910 1000 1050 1110 (a)
Background 381 385 405 430 560 565 570
580 650 700 845 910 1000 1050 1110 (b)
Background 381 385 405 430 560 565 570
580 650 700 845 910 1000 1050 1110 (c)
4.1.3 Foreground Detection
In the third experiment (Fig. 4-4), a person goes into the monitoring area, and the foreground region can be effectively extracted regardless of the influence of shadow and highlight in the indoor environment. Owing to the captured video clip having little illumination variation and dynamic background variation, the comparison of the recognition rate of final background subtraction between the proposed method and that of Hoprasert [60] reveals that both methods are about the same, as listed in Table 4-4.
Table 4-4 The comparison between the proposed method and that proposed by Hoprasert [60] via foreground detection
Frame 380 450 530 590 620
Proposed (%*)
Hoprasert[60]
(%*) 90.45 89.18 86.50 85.80 89.38 88.87 88.45 87.72 88.67 88.76
Frame 680 700 735 755 840
Proposed (%*)
Hoprasert[60]
(%*) 91.07 90.62 85.63 85.15 82.76 80.71 92.44 92.46 100.00 99.61
*: The value in the table means the recognition rate that correct background pixels in a frame divide total pixels in a frame(%).
4.1.4 Dynamic Background
In the fourth experiment (Fig. 4-5), image sequences consist of swaying clothes hung on a frame. The proposed method gradually recognizes the clothes as background owing to the ability of LTCBM to record the history of background changes. In situations involving large variation of dynamic background, a representative initial color-based background model can be established by using more training frames to handle the variations.
Figure 4-4 The results of foreground detection. (a) Original images. (b) The results of pixel classification, where the red color means the shadow, where red indicates the shadow, green indicates the highlight and blue indicates the foreground. (c) The results of background subtraction with shadow removal using our proposed method, where dark indicates the background and white indicates the foreground.
Background 380 450 530 590 620
680 700 735 755 840 (a)
Background 380 450 530 590 620
680 700 735 755 840 (b)
Background 380 450 530 590 620
680 700 735 755 840 (c)
Figure 4-5 The results of background subtraction about dynamic background. (a) Original images. (b) The results of pixel classification, where the red color means the shadow, where red indicates the shadow, green indicates the highlight and blue indicates the foreground. (c) The results of background subtraction with shadow removal using our proposed method, where dark indicates the background and white indicates the foreground.
Background 500 540 580 620 660 700 740
780 820 860 900 940 980 1020 1060 (a)
Background 500 540 580 620 660 700 740
780 820 860 900 940 980 1020 1060 (b)
Background 500 540 580 620 660 700 740
780 820 860 900 940 980 1020 1060
(c)
4.1.5 Short-Term Color-based Background Model (STCBM)
The final experiment (Fig. 4-6) shows the advantage of adding the STCBM. A doll is placed on the desk at the 360th frame. Initially, it is regarded as foreground, and at the 560th frame, the foreground region becomes background owing to the LTCBM.
However, the Gaussian component belonging to the doll still does not have the highest weighting. Without adding the STCBM, when a hand is placed above the doll at the 590th frame, the foreground regions at the 670th frame remain the same as those at the 590th frame, as shown in Fig. 4-6(b). The foreground regions under our hand become shadows at the 670th frame in Fig. 4-6(c) for that shadows and highlights removal works well using a representative Gaussian component based on the STCBM.
This experiment demonstrates the efficiency of the STCBM that a representative Gaussian component of the CBM can be selected by giving consideration to long-term tendency and short-term tendency. Besides, the advantage of the STCBM helps to reduce the computing time used in the GBM and increase the recognition rate of foreground detection.
Figure 4-6 The results of the advantage of the STCBM, where the red color means the shadow, the green color means the highlight and the blue color means the foreground.
(a) Original images. (b) The results of background subtraction without the STCBM. (c) The results of background subtraction with the STCBM, where red indicates the shadow, green indicates the highlight and blue indicates the foreground.
4.2 3D Object Recognition
This section describes several experiments demonstrating the effectiveness of the proposed 3D object recognition. A SONY EVI-D30 PTZ camera was employed to capture object views. The following three databases were built to test the proposed method: Fig. 4-6 contains 12 3D rigid objects, Fig. 4-7 contains six 3D human postures, and Fig. 4-8 contains 11 scenes.
301 360 560 590 670 740
(a)
301 360 560 590 670 740 (b)
301 360 560 590 670 740
(c)
Object 1 Object 2 Object 3 Object 4
Object 5 Object 6 Object 7 Object 8
Object 9 Object 10 Object 11 Object 12
Figure 4-7 The first database containing 12 3D rigid objects.
Posture 1 Posture 2 Posture 3 Posture 4
Posture 5 Posture 6 Posture 7 Posture 8
Figure 4-8 The second image database containing eight 3D human postures.
Scene 1 Scene 2 Scene 3 Scene 4
Scene 5 Scene 6 Scene 7 Scene 8
Scene 9 Scene 10 Scene 11
Figure 4-9 The third image database containing 11 scenes.
The notationVd1,j and Vd2,jdenote the sets of training views captured at 5°
intervals, where Vd1,j is employed during rigid object recognition, and Vd2,j is
employed during human posture recognition. The notation Vd3,j, which denotes the set of training views captured at each location at a 1° increment, is utilized during scene recognition. Moreover, Vt1,j and Vt2,jdenote the set of testing views captured from trisection points between each pair of points separated by 5°, where Vt1,jis utilized during rigid object recognition, and Vt2,jis utilized during human posture
recognition. Moreover, Vt3,j denotes the set of testing views captured at locations away from the original locations in four directions (forward, backward, left and right), five distances (5cm, 10cm, 15cm, 20cm and 50cm) and five covering rates (5%, 10%, 15%, 20% and 50%). Vt3,j is utilized during scene recognition. The descriptions of the captured views are given by Eqs. (4-1)-(4-6).
1,j { 1,j( )}, where 1 12, 1 72
In the following experiments, T denotes the number of objects, and is 12 for the 0 rigid object recognition, 8 during human posture recognition, and 11 during scene recognition; T denotes the number of training views, and is 72 during rigid object 1
recognition and human posture recognition, and 61 during scene recognition. T 2 denotes the number of low frequency information in FD, and is 40 in the following experiment. Moreover, the threshold values used in the ISAG is listed as Table 4-5.
Table 4-5The threshold values for the ISAG
The first AOD The second AOD
Main Feature Assistant
Feature Main Feature Assistant
Feature
Computing time for calculating similarity between a test view and a view in the database was approximately 0.006 seconds for rigid object recognition, 0.004 seconds for human posture recognition and 0.01 seconds for scene recognition on a P4 3.2G CPU with 1GB RAM.
4.2.1 Rigid Object Recognition
In the first experiment, the efficiency of the proposed framework was assessed using 2-D views captured at random intervals with the first database (Fig. 4-7). To determine average performance of the proposed method, training views were generated by sampling views in V in 200 different random orders. Background d1, j subtraction was first performed on training 2D views to extract foreground objects.
After that, Canny edge detection and GVF were performed on the extracted foreground objects to extract the object contour. Two features, called the MAG and PPL, are then extracted from the object contour and be used for building the AODs with the ISAG (Fig. 3-2). The characteristic views of aspects in each AOD are utilized for object matching. A recognition result is calculated with a weighted combination of the similarity measures from both AODs. Figure 4-10 illustrates the system architecture of the proposed framework for the 3D rigid object recognition.
The 1st AOD The 2nd AOD MOD
A 2D view sampled from an unknown object
MAG feature Extraction
PPL feature Extraction
The 1st Similarity Measure
The 2nd Similarity Measure
Weighted Combination
Recognition Results (Top Three Matches)
Feature Extraction Similarity Measure
3D rigid object recognition
Figure 4-10 The system architecture of the proposed framework applied on the first experiment (3D rigid object recognition).
Table 4-6 The result of rigid object recognition using 2D views via MAG and PPL
Table 4-6 presents statistical information for the means of aspect numbers using MAG and PPL. Furthermore, symmetrical objects, such as objects 2, 5, 6 and 7, had few aspects, thereby reducing computing time for recognizing objects. The views in
1, j
Vt were adopted as unknowns, and tested whenever aspect-graph representations were built each time (200 times). The proposed aspect-graph generation is efficient due to its high recognition rate in the Top 1 to Top 3 matches in the Table 4-7.
The proposed method, which constructs an aspect-graph representation using sampled views at random intervals, generates a practicable updating mechanism that integrates the database using new collected views. In this experiment, 18 random views sampled from Vd1, j are first utilized to construct a coarse aspect-graph
The index of the objects in the first database listed in Fig. 4-7 Recognition
Results 1 2 3 4 5 6 7 8 9 10 11 12 Avg.
Numbers of
aspect of MAG 34.66 3.84 27.83 24.75 6.87 9.47 2.04 25.62 17.14 16.16 16.62 28.75 17.81
Numbers of
aspect PPL 38.72 14.08 14.32 22.84 10.98 20.12 8.41 31.07 25.79 17.68 23.61 19.88 20.63
Top 1 Match (%) 98.25 99.97 97.71 97.39 100 99.81 99.79 99.35 99.90 97.97 98.44 96.83 98.78
Top 2 Match (%) 99.21 100 98.96 98.73 100 99.96 99.86 99.67 99.97 98.68 99.47 98.17 99.39
Top 3 Match (%) 99.61 100 99.39 99.34 100 99.98 99.89 99.78 99.99 98.98 99.77 98.64 99.62
representation of each object, calledD . Eighteen additional random views are then 18 adopted from the remaining views in Vd1, j to increase the accuracy of the database
D , called 18 D . Similarly, 36 D and 54 D are constructed using views in remaining 72
1, j
Vd . Additionally, D and 90 D are further constructed with extra random views 108 sampled from Vt1, j. Table 4-7 presents the average aspect numbers for each rigid object from 200 iterations. Although the aspect numbers increase when new views are employed to update the coarse database, the number of stored views remains significantly smaller than the number of original views. Figure 4-11 presents the recognition rate results obtained when using coarse to fine databases. Figure 4-12 presents the standard deviations for recognition rates. The recognition rate increases when aspect-graph representations are trained using additional object views.
Moreover, stability increases based on decreasing standard deviation. Therefore, the proposed method is demonstrated as effective for updating aspect-graph representations without re-sorting the overall collected views, or re-calculating overall similarity measures.
Table 4-7Results for numbers of aspects using MAG and PPL after updating with additional training views
The index of the objects in the first database listed in Figure 4-7
The index of the objects in the first database listed in Figure 4-7