Performance Analysis - 適用於多核心PlayStation 3平台之基於多層級管線模型的多媒體平行處理技術

Table 4-2 summarized the overall performance of our H.264 decoder in all kinds of sequences with different sizes. The performance of our optimized H.264 decoder is scaling well in all kinds of sequence.

Table 4-2 Performances with Different Sequences of Our Optimized H.264 Decoder Frame Size Sequence FPS 1080P SunFlower 25.32 fps

Optimization Results in our Design Flow

0.90

Figure 4-25 Performance Improvement in Each Step of Design Flow

Figure 4-25 shows the performance improvement in each step of our design flow. The y-axis represent for frame per second of 1080P high definition sequence. X-axis shows the technique we applied on the H.264 decoder. Our original source JM decoder only has 0.9 fps on PPE.

After computation optimization, we achieve 5.92 fps with 6.6x by using loop unrolling

technique and SIMD. Achieving 15.83 fps after task allocated on SPEs. Then apply MFC-aware scheduling for hiding DMA latency and achieve 21.49 fps. Finally, we buffering between Processors as much as possible and achieving 25.32 fps meeting the high-definition real-time constraint.

5 C ONCLUSIONS AND F UTURE W ORK

In this thesis, we proposed a design flow based on strict multistage pipeline model. Strict multistage pipeline model is suit for multimedia applications with highly dependency for achieving loading balance with efficiency. The strict multistage pipeline model limits task migration choice and data flow direction for simplifying the multicore programming considerations.

We provide guides for solving several NP-complete multicore programming problems including task allocation, MFC-aware scheduling and task migration. We allocate tasks on SPEs considering the computation/communication ratio. Use MFC-scheduling to parallelize MFC and SPU as much as possible. Finally achieve load balance by task migration. These guides can get acceptable solutions with efficiency.

Synchronization overhead is the most serious problem in the multistage pipeline model.

The factors of causing synchronization overhead are two. One is the workload variance between kernels. First, the workload of each kernel is different in iterations. The work load of kernel depends on the decoding sequence content. The second is the OS handled by PPE. OS thread request PPE occasionally and influence our application synchronization. We reduce this effect by buffering as much as possible on the limited local store (LS) of SPE. Buffering can reduce this effect, but not totally solve this phenomenon.

We used proposed design flow based on a strict multistage pipeline model parallelizing H.264 decoder on PlayStation 3. We locally optimize H.264 decoder with 6.6x performance gain at first. Then allocate the optimized kernels on proposed multistage pipeline model with 3 SPEs with 17.3x performance gain compare to original source code. MFC-aware scheduling

is applied for hiding DMA latency and the H.264 decoder gets 23.88x improvement compare to original code. Task migration dose not work well in H.264 decoder because the task granularity is not proper for migrating. Finally, we buffering between all SPEs as deep as possible for reducing synchronization overhead. We have 28.13x performance gain compared to original code and almost meet the real time constraint of 1080P test sequence with high efficiency. The load balance among processors is well and the utilization is nearly achieving 80% in average.

We offload as more kernels as possible on SPEs to ease PPE workload. But the branch intensive Variable-Length Decoding is not offloaded because it’s nature is not suit for SPE executing. But PPE loading is unstable in PlayStation 3 platform. OS threads needed handle occasionally by PPE. It disturbs our proposed multistage pipeline model. Therefore, ease PPE workload as much as possible is needed because we have several SPEs available. The synchronization and communication overhead between more SPEs should be taken into design consideration.

The proposed concept is only applied on H.264 decoder. We should study more cases with our proposed manner and revise our design flow for more multimedia applications. We will try to extend our MFC-aware scheduling and task allocation strategies for getting more close to optimal results with efficiency.

R EFERENCES

[1] ITU-T Rec. H.264 ISO/IEC 14496-10 AVC, Document JVTD157, 4th Meeting:

Klagenfurt, Austria, July 2002

[2] Cell broadband engine programming tutorial, IBM ,version 2.1, 2007

[3] Cell Broadband Engine Programming Handbook, IBM, version 1.11, May 2008.

[4] Cell broadband engine SDK libraries overview and users guide, IBM, version 2.1, 2007.

[5] SPE runtime management library, IBM, version 2.1, 2007.

[6] C/C++ language extensions for Cell Broadband Engine architecture, IBM, version 2.4, 2007

[7] Cell Broadband Engine architecture, IBM ,version 1.01, 2006

[8] J. Kahle, M. Day, H. Hofstee, C. Johns, T. Maeurer, and D. Shippy, “Introduction to the Cell multiprocessor,” in IBM J. RES. & DEV. VOL. 49 NO. 4/5, 2005

[9] M. Kistler, M. Perrone, and F. Petrini, “Cell multiprocessor communication network:

built for speed,” published by the IEEE Computer Society, 2006.

[10] “H.264 / MPEG-4 Part 10 White Paper”, [Online]. Available: http://www.vcodex.com.

[11] I. E. G. Richardson, H264 and MPEG4 Video Compression Video Coding for Next- Generation Multimedia, John Wiley & Sons, 2003.

[12] D. Bader and S. Patel “High performance software decoder on the Cell Broadband Engine,＂ in Proc. IPDPS, 2008

[13] H. Baik, K. Sihn, Y. Kim, S. Bae, N. Han and H. J. Song, “Analysis and

parallelization of H.264 decoder on Cell Broadband Engine Architecture,” in Proc.

ISSPIT, 2007.

[14] Y. Kim, J. Kim, S. Bae, H. Baik and H. J. Song, “H.264/AVC decoder parallelization and optimization on asymmetric multicore platform using dynamic load balancing,”

in Proc. ICME, 2008.

[15] E. van der Tol, E. Jaspers, and R. Gelderblom, “Mapping of H.264 decoding on a multiprocessor architecture,” Proceedings of SPIE, volume 5022, 200

[16] T. Chen, R. Raqhavan, J. Dale, and E. Iwata, ” Cell Broadband Engine Architecture and its first implementation: a performance view,” IBM, 2005

[17] Z. Zhao and P. Liang, "Data Partition for Wavefront Parallelization of H.264 Video Encoder," in Proc. ISCAS, May, 2006.

[18] J. Chong, N. Satish, B. Catanzaro, K. Ravindran and K. Keutzer, "Efficient Parallelization Of H.264 Decoding with Macroblock Level Scheduling," in Proc.

ICME, July 2007.

作者簡歷

洪正堉，1983 年 12 月 28 日出生於台北市。2006 年取得國立交通大學電子工程學系學士學位，並繼續在國立交通大學電子工程研究所攻讀碩士。2008 年在劉志尉教授指導下，取得碩士學位。本篇論文「適用於多核心 PlayStation 3 平台之基於多層級管線模型的多媒體平行處理技術」為其碩士論文。

在文檔中適用於多核心PlayStation 3平台之基於多層級管線模型的多媒體平行處理技術 (頁 72-79)