• 沒有找到結果。

Chapter 3. Proposed System

3.5. Temporal Interpolation

3.5.2. Label Propagation

As mentioned above, we use temporal information to propagate the labels for non-anchor frames. The labeling results of the preceding frame are warped to generate the labels of the current frame based on the temporal correspondence between these two frames. The use of temporal correspondence makes the labeling results reliable and accurate as long as we have obtained correct labels in the anchor frame.

39

Figure 3-26 Histogram of SIFT flow magnitudes in the horizontal direction when the user turns left. The green words indicate the inferred camera status.

Figure 3-27 Histogram of SIFT flow magnitudes in the horizontal direction when the user walks straight. The green words indicate the inferred camera status.

40

Chapter 4.

E XPERIMENTAL R ESULTS

In this chapter, we will demonstrate some of our experimental results. In Section 4-1, we will show the results of label transformation by the SIFT flow with single view approach and by the SIFT flow with panoramic approach. In Section 4-2, we will show the performance of our system over a real outdoor environment in NCTU.

Our proposed system is tested over a personal computer with Intel® Core™ i5-760 CPU at 2.8G Hz. Our algorithm is developed in Matlab but without code optimization.

4.1. L ABEL R ESULTS OF D IFFERENT

A PPROACHES

First, we show the warping and label results by SIFT flow with the single-view approach and with the panoramic approach. A result is shown in Figure 4-1, which shows that the panoramic approach provides more accurate warping and label result.

The parameters of SIFT flow are set to be the same for both cases. The input frame is captured at the resolution of 640480 pixels and then down-sampled to 160120 pixels. The panoramic image is 2.5 times wider than the input frame, but with the same height.

41

Figure 4-1 (a) Single-view support image, (b) panoramic-view support image, (c) warped result from single-view image, (d) warped result from panoramic-view image, (e) ground truth labels, (f) mapped labels based on single-view support, and (g) mapped labels based on panoramic view.

Another example is shown in Figure 4-2. Via panoramic approach, even though the results of warped image and label do not perfectly resemble the input frame, the results can still well represent the input frame.

Figure 4-2 (a) Single-view support image, (b) panoramic-view support image, (c) warped result from single-view image, (d) warped result from panoramic-view image, (e) ground truth labels, (f)mapped labels based on single-view support, and (g) mapped labels based on panoramic view.

42

4.2. O UTDOOR E XPERIMENTAL R ESULTS WITHIN NCTU

4.2.1. Database Setup

Our system is tested on two routes near the north gate of National Chiao Tung University (NCTU), as shown in Figure 4-3. Red lines indicate these two routes, which consist of various kinds of scenes, such as bus station, intersection, or trees, as shown in Figure 4-4. The total length for these two routes is about 300 meters.

Figure 4-3 Our test routes in NCTU.

Figure 4-4 Scene appearances

43

To create the database along these two routes, we choose 11 sampling spots in total. The selection of sampling spots is based on three criteria: 1) a scene that contains an intersection, 2) a scene that contains informative landmarks, such as crosswalk and waymark, and 3) a scene that contains special construction, like bus station. We follow these criteria to build our panoramic database. As aforementioned in Section 3-1-1, we take 16 photographs of different views at each sampling spot to create the panoramic image.

4.2.2. Experimental Results in Test Environments

We test our system over three video sequences that were captured in three different weather conditions: cloudy days, sunny days with some unexpected shadows in the scene, and evening time with low lighting condition. Here we show some inferred label and detected dangerous situations. The resolution of the videos is 640480. The test procedure of our system includes the panoramic approach and temporal interpolation mentioned in Section 3.3.2 and Section 3.5. The test video of the sunny situation was captured around 14:00 in the afternoon, while evening video was captured around 18:00 in the evening. While taking these video sequences, we mimicked the way blind people take a straight walk until the „real‟ dangerous situation occurs. Hence, the walking tracks follow a zigzag style. We sample all the test video with the sampling period of 0.6 seconds. For every 6 seconds, we pick an anchor frame. In our experiments, the detection process is performed over the anchor frames only. For the remaining frames between anchor frames, we use temporal information to propagate labels.

44

4.2.2.1. Results of Database Retrieval

In Section 3-2, we have discussed how to find the portion of panoramas which is most similar to the sight in front of the blind user. Here we show some examples of the best matches after sub-database retrieval by using gist feature matching with the spatio-temporal constraint.

Figure 4-5 Results of database retrieval for walking forward with some pedestrians passing by.

(a) Frame index, (b) input frames, and (c) the best match from the panoramic sub-database.

In the previous example, we can see that the interference in the local image structure, such as passing pedestrians, doesn‟t affect the outcome of database retrieval too much. The best matched part and input frame would share a similar local structure.

In comparison, in the following example, we show the case of a panning view.

Figure 4-6 Results of database retrieval for a panning case.

(a) Frame index, (b) input frames, and (c) the best match from the panoramic sub-database.

.

45

4.2.2.2. Cloudy Day

For the test video captured in a cloudy day, its lighting condition is very similar to that in our database. Some results of the cloudy-day case are shown below. Here, we show different walking situations on the sidewalk and the detected dangerous situations. To visualize the outcome of our system, we use an exclamation mark to represent the occurrence of dangerous situation. Moreover, the yellow arrow indicates the suggested safe way to turn if a danger situation is detected. The case of walking forward is shown in Figure 4-7.

Figure 4-7 The case of walking forward in safe situation. (a) Frame index. (b) Input frames.

(c) Inferred labels. (d) Outcome of dangerous situation detection.

Next, we show an example in which the blind user turn into a wrong direction. In this case, our system will warn the user the detection of a dangerous situation. In this example, the system informs the user to turn right to achieve safe walk.

46

Figure 4-8 The case of turning to a wrong direction. (a)Frame index. (b) Input frame. (c) Inferred labels. (d) Outcomes of dangerous situation detection. (e) Suggested turning direction.

In the previous example, we still find that the labels are not perfectly accurate for Frames 317 and 318. However, the tendency of the change of mapped labels along the temporal domain is correct. That is to say, the result of dangerous situation detection is still correct even if the labels are not perfectly correct. For Frames 314~316, one could see that the user walks on the sidewalk border. We also treat this situation as dangerous. One more dangerous situation occurs when the blind user walks on the border of sidewalk and road. In this case, our system detects the occurrence of dangerous situation and suggests the blind user turn right for safe walk.

47

Figure 4-9 The case of approaching to the border of sidewalk and road. (a) Frame index.

(b) Input frame. (c) Inferred labels. (d) Outcome of dangerous situation detection. (e) Suggested turning direction.

The final example in figure 4-10 shows the dangerous situation when there is little sidewalk area in front of the blind user.

Figure 4-10 The case of little sidewalk area in front of the user. (a) Frame index. (b) Input frame. (c) Inferred labels. (d) Outcome of dangerous situation detection. (e) Suggested turning direction.

48

4.2.2.3. Sunny Day and Evening Time

For sunny days, the shades projected on the objects usually cause difficulty in detection and recognition. Due to the strong edges caused by shadows, the processes of sub-database retrieval and label mapping may easily get affected. On the other hand, the lighting condition is usually poor for outdoor environment during the evening time. In the following examples, we show the performance of our system under these two weather conditions.

Figure 4-11 Test results in sunny day. (a) Input frames. (b) Inferred label. (c) Outcome of dangerous situation detection. (d) Suggested turning direction.

The above figure shows some simulation results for the test video captured in a sunny day. The performance of our system is not too bad under slight shadow interference. However, for the rightest frame in Figure 4-11, there is a huge dark area in front of the user caused by the shade of tree. In this case, our system may infer incorrect labels.

In the following case, we show the simulation results at evening time. In some

49

scenes, the light condition is very poor, such as these places near trees. In this kind of poor lighting condition, our system may not generate correct outcome.

Figure 4-12 Some examples at evening time. (a) Input frame. (b) Inferred labels. (c) Outcome of dangerous situation detection. (d) Suggested turning direction.

4.2.2.4. Experimental Data

First we analyze the accuracy of sub-database retrieval. We recall that each panorama in the sub-database is partitioned into 32 overlapping parts, representing 32 viewing directions. For the accuracy of sub-database retrieval, we define the best match is accurate if its corresponding direction is within the 4 nearest directions of the user‟s true facing direction. We test the accuracy of sub-database retrieval using the aforementioned three videos. In the cloudy-day video, we test 438 frames to measure the accuracy. In sunny-day and evening-time videos, we test 385 and 425 frames, respectively.

Table 4-1 Accuracy of sub-database retrieval

Cloudy Sunny Evening

Accuracy (%) 97.717 97.143 94.479

When analyzing the experimental outcome of our system, we regard our issue as

50

a detection problem. We take the detected dangerous situation as the positive outcome.

The definitions of false positive and false negative are listed in Table 4-2.

Table 4-2 Definition of false positive and false negative

True dangerous

safe situation False negative True negative

The equations for detection rate, false positive rate, and false negative rate are

Figure 4-13 Ground truth definition (a) to (c): apparent cases, (d) to (f): use the location of border to determine whether it is a dangerous situation.

For the case of cloudy days, we test our algorithm over 438 images, in which 110 images indicate dangerous situations. By comparing our panoramic approach with temporal information to the single-view approach, we can see that the panoramic

51

approach is more reliable. In the test of the single-view approach, we create a database which is composed of 8 single-view images representing 8 different viewing directions at each sampling spot. For the spatio-temporal constraint, we search for 3 nearest facing directions of all 8 directions based on the best match at the previous moment. Other procedures are set to be the same as the panoramic approach. The cases of walking to the border between sidewalk and road (shown in Figure 4-9) are not detected as a dangerous situation by the single-view approach. The ground truth results are defined manually. Some dangerous cases are quite apparent. However, for the case of approaching to the border of sidewalk, we use the position of border to determine whether it is a dangerous situation. If the border of sidewalk at the bottom of the image locates within the central one third of the image width, we define the situation to be dangerous, as shown in Figure 4-13.

Table 4-3 Experimental data at cloudy day and comparison of single-view approach

Detection rate

Panoramic approach 95.205 3.354 9.091

Single-view approach 77.854 9.756 59.091

For the evening-time case, we test our system over 425 images, in which 77 images indicate dangerous situations. For the sunny-day case, we test over 385 images, in which 79 images indicate dangerous situation.

Table 4-4 Experimental data under different lighting conditions

Detection rate

52

phenomenon tells us that our system may be able to detect the situation when the user

“really” approaches dangers.

The computing time of our algorithm is listed in Table 4-5. When we take the full procedure starting from sub-database retrieval to the detection of dangerous situations, the computation time is about 4.25 seconds. By using temporal information, we only need three seconds per frame for non-anchor frames. The three seconds are almost spent by the computation of SIFT flow. We recall that the resolution of the test videos is 640480. When using SIFT flow to achieve scene alignment, we don‟t need to use the full image resolution. Instead, the width and height of the input frame and the support image are down-sampled to 0.25 times of the original images. Because the numbers of nodes are reduced to 1/16 for belief propagation, the computation time is much faster, while still maintaining similar performance in label mapping.

Table 4-5 Computational speed for our system

Full procedure Using temporal

53

Chapter 5.

C ONCLUSIONS

In this thesis, we propose a vision-based travel aid system for blind people. Our system can label the walking area in front of the blind user and automatically detect the occurrence of dangerous situation. With the proposed system, blind user can know which direction would be safer to walk along. In our system, we adopt a database-driven framework. First we utilize blind user‟s position coordinate and gist feature to find a part of panoramas which is the similar to the view in front of the user.

After that, we exploit the SIFT flow for image alignment. We map the label of the best matched sub-image to infer the labels of the input frame. Finally we use the label information to detect the dangerous situation. Our system is able to run on different kinds of environments as long as the local database is installed beforehand. Some experimental results have shown our system is reliable under different weather conditions.

54

R EFERENCES

[1] D. Dakopoulos, and N. G. Bourbakis, “Wearable Obstacle Avoidance Electronics Travel Aids for Blind: A Survey,” IEEETransaction on Systems,Man, and Cybernetics, Part C: Applications and Reviews, vol. 40, no. 1, pp. 25-35, Jan.

2010.

[2] S. Shoval, J.Borenstein, and Y.Koren, “The Navbelt-A Computerized Travel Aid for the Blind Based on Mobile Robotics Technology,” IEEETransactions on Biomedical Engineering, vol. 45, no. 11, pp. 1376-1386, Nov.1998.

[3] J. Borenstein, and I. Ulrich, “The GuideCane-A Computerized Travel Aid for the Active Guidance of Blind Pedestrians,” in Proc. IEEE International Conference on Roboticsand Automation, vol. 2, Albuquerque, NM, USA, Apr 1997, pp.

1283-1288.

[4] J. M. Benjamin, N. A. Ali, and A. F. Schepis, “A Laser Cane for the Blind,” IEEE Journal of Quantum Electronics, vol. 3, pp. 268, Jun. 1967.

[5] D. Yuan, and R. Manduchi, “A Tool for Range Sensing and Environment Discovery for the Blind,” IEEE Conference on Computer Vision and Pattern RecognitionWorkshop,pp. 39-39, Jun. 2004.

[6] D. Yuan, and R. Manduchi, “Dynamic Environment Exploration Using a Virtual White Cane,” IEEE Conference on Computer Vision and Pattern Recognition,vol.

1, pp. 243-249, Jun. 2005.

[7] B. Ding, H. Yuan, L. Jiang, and X. Zang, “The Research on Blind Navigation System Based on RFID”, IEEE International Conference on Wireless Communications, Networking and Mobile Computing, , Shanghai, China, Sept 2007.

[8] João José, M. Farrajota, João M.F. Rodrigues, J.M. Hans du Buf, “The Smart Vision Local Navigation Aid for Blind and Visually Impaired Persons,”

International Journal of Digital Content Technology and its Applications, Vol.5 No.5, May 2011.

[9] J. Lafferty, A. McCallum, and F. Pereira, “Conditional Random Fields:

Probabilistic Models for Segmenting and Labeling Sequence Data,” in Proc.

International Conference on Machine Learning, pp. 282-289, 2001.

[10] J.Shotton, J. Winn, C. Rother, and A. Criminsi, “TextonBoost for Image Understanding: Multi-class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context,” International Journal of Computer Vision, vol. 81, no. 1, pp. 2-23, Jan. 2009.

[11] J. Tighe, and S. Lazebnik, “SuperParsing: Scalable Nonparametric Image Parsing

55

with Superpixels,” European Conference on Computer Vision, ECCV’10, vol.

6315,Herakilon, Crete, Greece, pp. 352-365, Sep. 2010.

[12] V.Paradeep, G. Medioni, and J. Weiland, “Robot Vision for the Visually Impaired,” IEEE Cpmputer Society Conference on Computer Vision and Pattern Recognition Workshops, CVPRW’10, San Francisco, CA, pp. 15-22, Jun. 2010.

[13] J. J. Liu, C. Philips, and K. Daniilidis, “Video-Based Localization Without 3D Mapping for the Visually Impaired,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, CVPRW’10, San Francisco, CA, pp. 23-30, Jun. 2010.

[14] H. Bay, A. Ess, T. Tuytelaars and L. V. Gool, “SURF: Speeded Up Robust Feature,” Computer Vision and Image Understanding, CVIU’08, vol. 110, no. 3, pp. 346-359, Jun. 2008.

[15] A. Oliva, A. Torralba, “Modeling the shape of the scene: a holistic representation of the spatial envelope,” International Journal of Computer Vision, vol. 42, no. 3, pp. 145-175, May. 2001.

[16] B. C. Russell, A. Torralba, K. P. Murphy, W. T. Freeman, “LabelMe: a database and web-based tool for image annotation,” International Journal of Computer Vision, vol. 77, no. 1-3, pp. 157-173, May. 2008.

[17] J. Hays, and A. A. Efros, “IM2GPS: Estimating Geographic Information from a Single Image,” Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition, CVPR’08, pp. 1-8, Jun. 2008.

[18] A. Torralba,M. S. Castelhano, A. Olivaand J. M. Henderson, “Contextual Guidance of Eye Movements and Attention in Real-World Scenes: The Role of Global Features,” Psychology Review, vol. 113, no. 4, pp. 766-786, Oct. 2006.

[19] H. Zhang, J. Xiao, and L. Quan, “Supervised Label Transfer for Semantic Segmentation of Street Scenes,”European Conference on Computer Vision, ECCV’10, vol. 6315,Herakilon, Crete, Greece, pp. 561-574, Sep. 2010.

[20] C. Liu, J. Yuen and A. Torralba. “SIFT flow: dense correspondence across different scenes and its applications.” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 33, no.5, pp. 978-994, May. 2011.

[21] A. Shekhovtsov, I. Kovtun, and V. Hlavac, “Efficient MRF Deformation Model for Non-rigid Image Matching,”Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition, CVPR’07, pp. 1-6, Jun. 2007.

相關文件