Profile for dual-core P macroblock encoding module

Chapter 5 Experimental results

5.2 Experiment of Inter frame processing

5.2.4 Profile for dual-core P macroblock encoding module

The follow figure shows the main computation components of dual-core P macroblock processing module with the fastest mode we mentioned before.

Fig 37 Profile for dual-core P macroblock encoding module

We experimented with two different implementations of DMA control modules in this thesis. The first one transfers all memory section of the reconstructed frames from DSP to ARM. This implementation has lower complexity of DMA handler, but increases the amount of redundant transfer. The second implementation transfers memory sections of the reconstructed frames from DSP to ARM according to its encoding status. Although this approach makes the DMA handler more complex, it removes redundant data transfer.

The follow two tables show the result of transfer with redundant transfer.

Table 41 Experiment result of profile P MB encoding

Number Execution time (ms) Description 1 4261 Time in dual-core processing

2 689 Time in interrupt mode

3 357 Time in handling DMA handler

4 126 Time in handling Dual-core handler 5 727 Time in transfer data to DSP by DMA 6 1452 Time in receive data from DSP by DMA

7 523 Time in post-processing

A/D 3.11

Total time 4785

Table 42 Calculated result of profile P MB encoding

Description Execution time (ms) or Percentage

ARM core execution time 3572

DSP core execution time 1393

Dual-core execution section percentage 89%

Post processing section percentage 10%

And the follow two tables show the result of transfer without redundant transfer.

Table 43 Experiment result of profile P MB encoding

Number Execution time (ms) Description 1 3813 Time in dual-core processing

2 742 Time in interrupt mode

3 397 Time in handling DMA handler

4 136 Time in handling Dual-core handler 5 756 Time in transfer data to DSP by DMA 6 895 Time in receive data from DSP by DMA

7 526 Time in post-processing

A/D 3.74

Total time 4341

Table 44 Calculated result of profile P MB encoding

Description Execution time (ms) or Percentage

ARM core execution time 3071

DSP core execution time 1420

Dual-core execution section percentage 87%

Post processing section percentage 12%

Obviously, it shows that the implementation which transfers data without redundancy is more efficient, although the overhead of the DMA handler becomes more complex than the other method, but its overall performance gain is better than the implementation redundant transfer.

Chapter 6 Conclusions and Future Works

From these experiments, one can see that the proposed dual-core codec partitioning framework achieves better performance than using the DSP core alone. The thesis also shows that data transfer overhead between the RISC core and the DSP core is crucial to the performance of the system. Efficient use of DMA module for data transfer also plays an important role in this framework. For future improvements, instead of executing jobs at frame-level, data structures and execution flows of our codec should be modified for execution at slice-level or macroblock-level. This allows the combination of multiple function modules into one single module and reduces large data transfer overhead. Fig. 38 shows this concept, the left-hand side of the figure shows current execution flow, and the right-hand side shows the improved architecture for future work.

Fig 38 Architecture of future work

In our current design, motion compensation module is running on the RISC core alone. With the above-mentioned modification, the motion estimatin/compensation subtasks can be completely hosted on the same core (either RISC or DSP) without extra data transfer overhead.

In addition, as demonstrated by many researches, employing dual buffer mechanism on the DSP core can increase memory bandwidth greatly. This is a key technique to improve system performance. It can be expected that we can also use similar design to increase performance of our design, since one of the major bottleneck of the proposed dual-core framework is from the limited memory bandwidth between the RISC core and the DSP core.

Finally, the research conducted in this thesis is a prelude to the design of a dynamic scheduling kernel for asymmetric multiple processors (AMP) platforms. Based on the experiments conducted in this thesis and the simple shell-like DSP command processor developed for this work, one can design an AMP kernel that dispatch tasks to different processor cores on the fly.

Chapter 7 References

[1] S. De-Gregorio, M. Budagavi, and C. Chaoui, Bringing Streaming Video to Wireless Handheld Devices, Texas Instrument Technical White Paper SWPY005, May 2002

[2] ISO/IEC JTC 1/SC 29/WG11, Information technology -- Coding of Audio-visual objects - Part II: Visual, ISO/IEC 14496-2:2003, Apr. 2003.

[3] Jamil Chaoui, Ken Cyr, Sebastien de Gregorio, Jean-Pierre Giacalone, Jennifer Webb, Yves Masse, Open multimedia application platform: enabling multimedia applications in third generation wireless terminals through a combined RISC/DSP architecture, Proceeding of ICASSP2001, Pages:1009 - 1012 vol.2, May 2001 [4] Kyu Ha Lee, Keun-Sup Lee, Tae-Hoon Hwang, Young-Cheol Park and Dae Hee

Youn, An architecture and implementation of MPEG audio layerIII decoder using dual-core DSP, IEEE Transactions on Consumer Electronics, Vol 47,No4,

NOVEMBER 2001

[5] Olli Lehtoranta, Timo Hamalainen and Jukka Saarinen, Real-time H.263 encoding of QCIF-images on TMS320C6201 fixed point DSP,ISCAS 2000 - IEEE

International Symposium on Circuits and Systems, May6 28-31, 2000, Geneva, Swizerland

[6] Atsushi Hatabu, Takashi Miyazaki, and Ichiro Kuroda,QVGA/CIF resolution MPEG-4 video codec based on a low-power and general-purpose DSP, IEEE 2002 [7] James Song, Thomas Shepherd, Minh Chau, Ayesha Huq, Ikram Syed, Somdipta

Roy, Achuta Thippana, Kaijian Shi, Uming Ko. A low power open multimedia application platform for 3G wireless, IEEE 2003

[8] Byeong-Doo Choi, Kang-Sun Choi, Sung-Jea Ko, Senior Member, IEEE, and Aldo W. Morales, Senior Member, IEEE, Efficient real-time implementation of MPEG-4 audiovisual decoder using DSP and RISC chips. IEEE 2003

[9] TMS320 C55x User guide (TI publication)

[10] OMAP5910 Dual-Core Processor Technical Reference Manual, SPRU602B, January 2003 (TI publication)

[11] Rishi Bhattacharya,System Initialization for the OMAP5910 Device, SPRA828A, August 2002 (TI publication)

[12] TMS320C55x Hardware Extensions for Image/Video Applications Programmer's Reference,SPRU098,February 2002 (TI publication)

[13] DSP/BIOS Bridge Programming Guide (WinCE/BIOS) Version 1.10, November 22, 2002 (TI publication)

[14] Hans-Joachim Stolberg, Mladen Berekovic, Lars Friebe, S¨oren Moch, Sebastian Fl¨ugel, Xun Mao, Mark B. Kulaczewski, Heiko Klußmann, and Peter Pirsch, HiBRID-SoC: A Multi-Core System-on-Chip Architecture for Multimedia Signal Processing Applications, Proceedings of the Design,Automation and Test in

Europe Conference and Exhibition (DATE’03), IEEE 2003

[15] Thanh Tran, Ph.D. OMAP5910 Video Encoding and Decoding, SPRA985 December 2003 (TI publication)

[16] Bill Winderweedle, OMAP System DMA Throughput Analysis,SPRA883 – December 2002 (TI publication)

在文檔中視訊編碼器在雙核心平臺上的最佳化 (頁 75-0)