• 沒有找到結果。

Figure 4-7 VGA set

320x240

640x480

960x720

Figure 4-8 Scaled test set

Algorithm parameters in adaboost-based object detection are as follows. The entire experiment is using scale-window implementation. We take 1.1, 1.2, and 1.3 for scaling factor to observe performance. And initial window size is 30×30. Throughout the experiment, two sets of test image are used. Four 640x480 image is collected as VGA set to verify detection performance on VGA images. Besides, a single image is resized into 960x720, 640x480, and 320x240 to form the scaled test set. The test set’s content includes different number, size and color tones of background to evaluate the experiment, as shown in Figure 4-6.

 VGA set (image size = 640×480)

Table 4-1 Performance comparison @ VGA set

3-core 4-core 5-core 6-core 3-core 4-core 5-core 6-core

1.1 12.3% 15.0% 21.2% 22.8% 8.3% 13.1% 17.2% 15.6%

1.2 16.9% 26.1% 23.9% 20.2% 15.9% 15.8% 12.6% 10.0%

1.3 13.9% 11.0% 23.5% 35.1% 4.7% 7.7% 18.4% 22.2%

Scale factor

Improvement (V.S. CellCV) Improvement (V.S. row-based splitting)

In Table 4-1, it shows the performance comparison with the CellCV and row-based splitting. We can fine out that the data-oriented task partition can obtain more improvement in

scale factor = 1.3 with increasing core number. In case of scale factor = 1.3, low ratio of execution time is spent on computing, and high ratio of time is wasted in waiting memory transfer operation. While more cores are deployed, these effects prevent us from getting linear speedup. Proposed data-oriented task partition can reduce unnecessary memory transfer and avoid memory contention. Therefore, we can eliminate the weight of data transfer in whole execution time. And we can improve up to 35.1% comparing with CellCV. As shown in Table 4-2, Table 4-3 and Table 4-4, they represent performance acceleration with two to six cores in CellCV, row-based splitting and data-oriented task partition. In Table 4-2, the performance is performed with high speedup in scale factor = 1.1. It is what I mentioned before. In case of scale factor = 1.1, the workload balance problem is light by sufficient task instance in resolution-based task partition. Additionally, the time waiting in data transfer is relativity low with a great amount of computation by lots of detection candidates.

Table 4-2 Acceleration of CellCV @ VGA set

2-core 3-core 4-core 5-core 6-core 1.1 1.65 2.43 3.16 3.73 4.33 1.2 1.64 2.40 2.86 3.64 4.40 1.3 1.84 2.52 3.45 3.67 3.66 Scale

factor

Acceleration

In Table 4-3, it shows the performance accelerated trend of row-based splitting. As the image size increasing, the workload balance issue is more negligible by its sufficient task.

Therefore, the row-based splitting method is useful for small image size and high scale factor which generates less task instances by resolution-based task partition. And in Table 4-3, row-based splitting method can accelerate the speedup up to 4.4.

Table 4-3 Acceleration of row-based splitting @ VGA set

Table 4-4 Acceleration of data-oriented task partition @ VGA set

2-core 3-core 4-core 5-core 6-core acceleration as we expect. In communication-intensive applications as multi-resolution application, data transfer is especially important to be considered to obtain superior performance. As above mention, data-oriented task partition can optimize at high scale factor and small image size because it waste much of time in data transfer than computing. In Table 4-5, we show performance in frame rate (fps). Because the entire analysis and design flow for multi-resolution application is focusing on general distributed scratchpad memory multicore platform. We didn’t apply local optimization like intrinsic functions of Cell SDK to optimization for Cell processor in this experiment results. Even for light weight multicore, this design flow can support a design methodology for multi-resolution application. And in

Table 4-5, we can improve performance up to 12.7 fps in VGA size. Appling with local optimization, we can obtain about 25 fps.

 Scaled test set (scale factor = 1.2)

Table 4-6 shows the performance comparison with different image size. Similar to VGA test set, proposed task-oriented task partition is much more useful for small image size because of its high data transfer ratio. And we can improve up to 31.8% performance comparing with CellCV. Table 4-7, Table4-8 and Table 4-9 represent acceleration of CellCV, row-based splitting and data-oriented task partition. In CellCV, big size of test image obtains much higher speedup comparing with the small image size because computing part is more critical than communication part. It can fully utilize multicore platform to accelerate the computing part. But it only can achieve 4.66 of speedup at 6 cores. In row-based splitting, it improves up to 4.91 of speedup in Table4-8. And this manner can perform better effect on insufficient task number as small image size. By data-oriented task partition, the performance can be achieved average 5.6 speedup from CellCV on single core. Table 4-10 shows the performance of scaled test set in frame rate. We can achieve 42.8 fps in 320×240. Appling with local optimization, we can achieve about 83 fps.

Table 4-6 Performance comparison @ scaled test set

3-core 4-core 5-core 6-core 3-core 4-core 5-core 6-core 960×720 9.8% 19.1% 19.4% 18.9% 7.8% 14.1% 13.4% 13.9%

640×480 16.9% 26.1% 23.9% 20.2% 15.9% 15.8% 12.6% 10.0%

320×240 16.0% 23.8% 20.0% 31.8% 12.0% 10.8% 8.0% 15.8%

image size

Improvement (V.S. CellCV) Improvement (V.S. row-based splitting)

Table 4-7 Acceleration of CellCV @ scaled test set

2-core 3-core 4-core 5-core 6-core

Table 4-8 Performance of row-based splitting @ scaled test set

2-core 3-core 4-core 5-core 6-core

Table 4-9 Performance of data-oriented task partition @ scaled test set

2-core 3-core 4-core 5-core 6-core

Table 4-10 Performance in frame rate @ scaled test set

1-core 2-core 3-core 4-core 5-core 6-core

960×720 0.6 1.1 1.6 2.2 2.7 3.3

640×480 2.0 3.2 5.0 6.7 8.3 9.5

320×240 8.5 14.4 22.0 30.2 36.6 42.8 image

size

Performance (fps)

4.2.1 Overall performance enhancement comparison

We take value 1.1 for scaling factor and size 640×480 for test image in this series simulation. In Figure 4-19, we can use data-oriented task partition to obtain 9% performance improvement comparing with CellCV. It exploits all available data locality across resolutions.

And performance is improved going through platform-dependent optimization flow: data allocation, tile-shape exploration and granularity exploration. In optimization flow, data

allocation step can improve 8% comparing with previous step. It is most important step in this flow. Additionally, Tile-shape exploration and granularity exploration also can improve 4%

and 2% from previous step. So we can achieve 4.96fps from original 3.88fps.

3.6 3.7 3.8 3.9 4 4.1 4.2

cellcv proposed

fps Data-oriented

task partition Data-allocation Tile-exploration Granularity exploration

9%

8%

4%

2%

3.88fps

4.96fps

Figure 4-9 Improvement comparing with CellCV

5 C ONCLUSIONS

In this thesis, we propose a data-oriented task partition for multi-resolution applications.

Proposed partition considers multi-dimension data locality across resolutions to reduce redundant memory transfer. Instead of raster scanning, tile-based scanning is adopted. As a result, most available data locality can be utilized. It can relieve loading of interconnect network and avoid memory contention. Moreover, we also propose an optimization flow for parallelizing multi-resolution application on distributed scratchpad memory multicore. The optimization goes through data allocation, tile-shape exploration and granularity exploration to obtain superior performance.

Taking Viola and Jones object detection as case study, which plays an important role in intelligent multimedia processing. When implemented on PlayStation3, a typical distributed scratchpad memory multicore. Following proposed optimization flow, we can reduce 95%

data transfer comparing to conventional CellCV implementation. Speedup factor can achieve

5.6 times of acceleration from CellCV version by using 6 cores. The execution time is improved 25% using 6 cores compared to CellCV.

In the future, we would apply this method to another multi-resolution application.

Augmented reality is one popular application for smart phone. Augmented reality (AR) is a term for a live direct or indirect view of a physical real-would environment whose elements are augmented by virtual computer-generated imagery. In this application, object recognition for real-world environment is a necessary procedure for following operation.

To further improve for object detection, we would try to develop a fast algorithm for data-oriented task partition. According to inter-resolution consideration, it is possible developing computation compression. We expect this method can not only maintain the accuracy but also reduce the computing.

R EFERENCES

[1] R. Banakar et al. “Scratchpad memory: Design alternative for cache on-chip memory in embedded systems,” in Proc. CODES, 2002

[2] J. Kahle et al., “Introduction to the Cell multiprocessor,” IBM J. RES. & DEV., vol.

49, no. 4/5, pp. 589-604, July 2005.

[3] P. Viola and M. Jones, ―Rapid Object Detection using a Boosted Cascade of Simple Feature‖, in Proc. CVPR, vol. 1, pp. 8-14, 2001

[4] OpenCV Library. [Online]. Available: http://sourceforge.net/projects/opencvlibrary.

[Accessed: Apr. 16, 2009]

[5] OpenCV on the Cell. [Online]. Available:

http://cell.fixstars.com/opencv/index.php/OpenCV_on_the_Cell. [Accessed: Apr. 16, 2009]

[6] Shin-Kai Chen, Tay-Jyi Lin and Chih-Wei Liu, “Parallel object detection on multicore platform,” in Proc. SiPS, 2009

[7] Cell Broadband Engine Programming Tutorial, IBM, Mar. 2007.

[8] Cell Broadband Engine Programming Handbook, IBM, Apr. 2007.

[9] U. Kapasi et al., “Programmable stream processors,” Computer, vol.36, no.8, pp.

54-62, Aug. 2003

[10] J.L. Hennessy and D.A. Patterson, Computer Architecture: A Quantitative Approach, 4th Edition, calif.: Morgan Kaufmann Publishers, 2007

[11] ―Multicore processor,‖ http://en.wikipedia.org/wiki/multicore_processor.html. 2009 [12] H. Sugano and R. Miyamoto, ―A real-time object recognition system on Cell

Broadband Engine,‖ Advances in Image and Video Technology, D. Mery and L, Rueda, efds., LNCS Series 4872, Berlin: Springer-Verlag, pp. 932-943, 2007 [13] ―CoWare Platform architecture,‖ Coware Inc.,

http://www.coware.com/products/platformarchitect.php. 2009

[14] M. Rabbani, R. Joshiet al., ―An overview of JPEG2000 still image compression standard‖, Proc. IEEE Data Comp. Conf., Vol 17/1, 2002

[15] R. Gemello, F. Mana, D. Albesano, and R. De Mori, ― Multiple resolution analysis for robust automatic speech reconition,‖ in Proc Computer Speech and Language, vol. 20, no. 1, pp. 2-21, 200

作者簡歷

甘禮源,1985 年 2 月 1 日出生於桃園縣。2007 年取得長庚大學電子工程學系學士學 位,並在國立交通大學電子工程研究所攻讀碩士。2011 年在劉志尉教授指導下,取得碩 士學位。本篇論文「適用於分散式暫存記憶體多核心平台之多媒體多解析應用處理最佳 化」為其碩士論文。

相關文件