計劃成果自評 - MPEG-4/21 SoC 設計及新世代行動訊之研究-子計畫二：多媒體通訊數位基頻SoC加速架構及嵌入式作業系統界面的研究(III)

總合三年的成果，和原計畫提出的目標相當吻合。在達成預期目標情況方面有以下數點：

1. 多媒體雙核心系統中，在 TI OMAP OSK5912 平台上發展一套 Heterogeneous Multi-Processor (HMP) 的動態工作切割排程作業系統核心。在開發這個技術的過程中，我們有以下成果：

I. 發展自己的 DSP scheduler 來幫助作 task 的排程。

II. Porting eCos 到 TI omap 平台上，並開發雙核心有效的溝通協定。

III. 設計一個可以配合 eCos MLQ scheduler 的動態分工模組。

IV. 設計新的異質多核心的新 Programming model。

V. 實作出支援動態分工應用程

式的開發工具。

2. 視訊編碼硬體加速平台上，實作出以下 IPs：

I. Motion-estimation：計算到達到 1/4 pel ，參考多個 reference frames 以及所有 sub block 模式

II. H.264 deblocking filter III. MPEG4 IDCT

IV. H.264 transform/inverse transform unit

V. H.264 quantizer and de-quantizer

VI. H.264 intra predictor

本平台的設計是以 MPEG 正在

發展中的 Reconfigurable Video Coding (RVC) 的架構為主要的設計目標，以期能支援不同視訊壓縮法的解碼器的動態產生。由於 RVC 為目前 MPEG 工作中的項目，所以目前的設計都是以軟體（C model 或其它 behavioral model 的模擬平台，如 Moses for

CAL 來進行研究）。本團隊因為積極參與 MPEG 標準的制訂，所以能隨時根據最新的結果來修正設計這個平台。

3. 實作出 Java Dynamic Code Optimization for Java Processor 的軟硬體系統。

附錄一、可重組的視訊加速 SoC 平台

I. INTRODUCTION

Most multimedia devices today have to support multiple codec standards. Take video codecs for example, a portable multimedia player usually supports the playback of the MPEG-1/2, MPEG-4 SP, WMV, and H.264/MPEG-4 Part 10 video contents. In order to reduce system cost, a single-chip SoC solution that supports all these standards is a sensible approach. From IC designers’ point of view this is not a serious problem since most (if not all) popular video codecs share the same block-based motion compensated transform coding data flow. In addition, many coding tools have similar architecture. However, there are some application issues that makes traditional codec design approaches unsatisfactory [1].

A major problem with existing approach of defining a codec standard is the lack of flexibility when new applications emerge. A video codec is composed of several coding tools (e.g. DCT/IDCT, MC, VLC/VLD, etc.). However, for a codec standard, the conformance point is defined at codec-level, instead of tool-level. Different profiles/levels are created for each codec to address the need of different classes of applications. This approach works fine in the past since the application scenarios were quite simple (e.g. DVD, DTV). However, with the exponential growth of new multimedia applications, the old approach of defining conformance point at codec-level becomes awkward. Quite often, a new application designer finds it impossible to

find a reasonable codec profile@level to fit the target application well. For example, the FMO tool of H.264 is useless for many applications but a decoder may still need to support it simply because it is included in AVC baseline profile. In general, application environment is changing faster than an international standard can catch up that there should be a more efficient way of allowing a codec to adapt to new applications while maintaining interoperability among different solutions.

MPEG has recognized this issue and started a new work item called Video Coding Tools Repository (VCTR) in 2004.

After some investigations, the direction and benefit of VCTR is becoming clear [2].

Later, this effort becomes the Reconfigurable Video Coding (RVC) framework in 2006 [3]. This new framework defines the conformance point at tool-level.

Therefore, in principle, an RVC-enabled codec can negotiate on-the-fly with the video bitstream encoder/sender about which coding tools is required and how the data path can be wired among these coding tools in order to decode the video bitstream. After the setup stage, the decoder can decode the bitstream correctly. With this approach, an SoC can support multiple codec standards as well as creating customized codecs in real time as long as it contains all the standard-conforming tools that is necessary to decode bitstreams from different encoders.

So far, the RVC framework is still in development. Most of the investigations are done using C models and behavioral model

simulators such as Moses [4]. In this report, SoC architecture that can be used to implement the RVC framework is proposed.

The report is organized as follows. The RVC framework is introduced in section II.

The SoC architecture for direct support of RVC is presented in section III. Some comparisons of the RVC architecture to a common hard-wired solution is also given in this section. Section IV studies an implementation to get an idea on the cost for such flexibility. Finally, some discussions are given in section V.

II. MPEG RVC FRAMEWORK

The concept of MPEG RVC framework can be illustrated in Fig. 1. The key difference between RVC and the old MPEG codec standards is that the interface of each coding tools is defined precisely so that they can be used (like LEGO blocks) to build various codecs. The decoder configuration describes how input bitstream can be parsed so that the raw input data to each coding tools can be extracted. A decoder description language is under development so that the configuration of a specific codec (such as H.264) can be described using a (small) configuration bitstream. The decoder configuration bitstream will be processed by an RVC decoder before decoding of a video bitstream conforming to the described standard. Note that after processing a configuration bitstream, the RVC decoder will generate a Global Control Unit (GCU) that governs the operation of the coding tools.

In principle, the configuration description tells the RVC decoder how to wire the coding tools to form a data path. In

the RVC framework, each coding tools is called a functional unit (FU) and is specified in Fig. 2 [1]. In Fig. 2, a control signal is a signal embedded in the video bitstream (for example, the width and height of the video frame). A context signal is a signal generated from the processing of bitstream data (for example, the AC prediction direction in the MPEG-4 Part 2 video standard). The context-control unit reads in the context and control signals generated by previousFU’sand generates(orpasseson) some context and control signals to the next FU’sbased on theresultoftheprocessing unit.

A partial example of a configured RVC codec that behaves like an MPEG-4 Simple Profile video decoder is shown in Fig. 3. In Fig. 3, VLD is the FU for variable length decoding, RLD is the FU for run-length decoding, and MBG is the 8x8 block coefficients composition FU.

88 IDCT 44 GBT 44

intra-prediction ¼-Pel MC ½-Pel MC H.264 Decoder

Configuration and API

MPEG-4 Decoder Configuration and API

Tools in RVC Toolbox

Applications Old MPEG

conformance point

New RVC conformance point

Fig. 1. Concept of MPEG RVC framework

Processing Unit Input

bitstream data

Output bitstream

data

Context & control [in]

e.g. coding parameters, mode selection signals

Context & control [out]

e.g. derived parameters from the video data Context-Control

Unit

Fig. 2. Definition of an FU in RVC

control & context

#MB, data_partition_flag

MB data

Fig. 3. Example of RVC configuration

III. SOC ARCHITECTURE FOR RVC

Since the specification of the video decoder configuration language and the actual mechanism of a GCU are still under development at MPEG, this report proposes a potential VLSI architecture that is suitable for supporting the RVC framework and perform some early analysis on such architecture. The RVC framework actually fits the platform-based design principle of SoC quite well. For maximal flexibility, the GCU will be implemented in software and running on the processor core of an SoC. Each coding tool can be implemented as an IP on the bus with limited configurability via a private register file. The proposed architecture is show in Fig. 4.

In Fig. 4, the coding tools are not attached to the main system bus (AMBA AHB) directly.

A local bus, MMB, is used to off-load the bandwidth from the main system bus. Here, MMB stands for Multi-Media Bus. In our implementation, the bus protocol of MMB is a simplified version of AHB. A two-way DMA is used to transfer data between external SDRAM and internal SRAM banks. The DMA can be

invoked from either the ARM core or the coding tool IPs (as long as the tool is implemented as an MMB master).Thereason formultipleSRAM’s on the MMB is to reduce the memory bandwidth requirement for parallel operations of the coding tools.

Fig. 4. SoC architecture for RVC framework

ARM memory

Fig. 5. Hard-wired decoder example

Although local bus and multiple SRAM banks are used to alleviate the bandwidth issue, the performance of this architecture still cannot match that of a hard-wired architecture. For example, a hard-wired H.264 baseline decoder may have a tighter MB decoding pipeline as shown in Fig. 5. There are two main advantages of the architecture in Fig. 5. First of all, the decoding pipeline is controlled by a hard-wired FSM with cycle-based synchronization. On the other hand, for the RVC framework, the controller will be implemented in software, and

hence, cannot guarantee cycle-based operation of the pipeline. Another advantage of the hard-wired approach is that it does not require excessive accesses to external memory.

It is important to point out that the purpose of the RVC framework is not to obtain the most efficient design of a single codec, but to allow a flexible and extensible design of codec systems.

Multi-standard codec support (or even generate customized codec on-the-fly) can be achieved by configuring a new GCU via decoder description bitstreams. In the next section, we will study an actual implementation of the proposed architecture in Fig. 4 to get an idea about the cost one has to pay for such flexibility.

IV. IMPLEMENTATION STUDY OF THE PROPOSED SYSTEM

In this section, an implementation of the proposed system architecture (Fig. 4) is investigated. The implementation is based on an SoC emulation platform, the ARM Integrator [6].

The platform is composed of a main board, an ARM 9 processor core module, and a Xilinx VirtexE XCV2000E FPGA logic module. The platform adopts the AMBA bus protocol. The RVC coding toolbox logic of the proposed system is implemented in the FPGA. The local bus protocol, MMB, of the toolbox logic is a reduced version of AHB with much less wires and a minimal implementation of bus arbiter and decoder.

In the proposed system architecture, the finite state machine (FSM) that drives the operation of the coding tool FUs is implemented

in software. As a result, the codec pipeline is not executed in a lock step fashion but instead driven by the software FSM via control signals.

Each coding tool FU (please refer to Fig. 2) is implemented so that the input bitstream data is coming from a SRAM bank on the MMB and the output bitstream data will be stored in another SRAM bank on the MMB. Block RAMs of the Virtex II FPGA and the ZBT SRAM of the ARM Integrator are used for this purpose.

Table I and Table II list the required memory for the input data and output data. It is obvious that such implementation is not as efficient as a tightly-coupled pipeline [5] where different pipeline stages are connected via registers or FIFO.

On the other hand, since the system control FSM is implemented in software, the Global Control Unit of the MPEG RVC framework can be dynamically implemented using this FSM.

Therefore, any video decoders can be emulated on-the-fly by the proposed architecture as long as all the coding tools required by the target codec are supported by the architecture.

Therefore, the proposed architecture is very flexible and scalable. It is important to point out that in order to support dynamic reconfiguration of the RVC decoder, the software-based system FSM shall not be a hard-coded FSM. Instead, it should be implemented as a table-driven FSM where the table content can be modified by the RVC decoder configuration bitstream.

The implementation of the processing unit and context-control unit of a coding tool FU follows traditional hard-wired IP design methodology where the processing unit is

implemented as a data path and the context-control unit is a hard-wired FSM with register files for memory-mapped I/O configuration and signaling. Currently, most of the FUs supported in the proposed platforms are for H.264. The synthesis report of some of the implemented FUs is shown in TABLE III.

V. CONCLUSIONS

This report introduces the MPEG RVC framework and proposes an SoC architecture to support the framework. Since the RVC framework is still under development at MPEG.

There is not much research on how the

framework can be efficiently supported using an SoC platform design paradigm. The table-driven software FSM for dynamic generation of a GCU and the decoder configuration language is still yet to be defined by MPEG. However, based on our study, the proposed architecture is very feasible for practical SoC implementation of the RVC framework. Although a reconfigurable video codec cannot compete with a hard-wired codec for performance given current VLSI implementation technology, it is much more scalable in the sense that any new codecs (coding tools) can be added into the platform with minimal effort.

TABLE I. The data size from external of the FUs in the proposed RVC architecture Data form External Memory

Luma Chroma Total Cycles

Intra

predictor(input)

256 bytes 128 bytes 384 bytes 96 Luma Chroma Block info Total Cycles Deblocking

filter(input)

384 bytes 256 bytes 120 bytes 760 bytes 190

TABLE II. The data size from internal memory of the FUs in the proposed RVC architecture Data from Internal Memory

Residual Luma Residual Chroma Total Cycles Intra

predictor(outpu t)

256 bytes 128 bytes 384

bytes

Trans.&quant.

Luma

Trans.&quant Chroma

Total Cycles TQ/TQ^-1(outpu

pt)

512 bytes 256 bytes 768

bytes

192

Luma Chroma Total Cycles

TQ/TQ^-1(input) 256 bytes 128 bytes 384 96

bytes

Deblocked Luma Deblocked Chroma Total Cycles Deblocking

filter(output)

256 bytes 128 bytes 384

bytes

residual data MVD Total Cycles

CAVLC(input) 768 bytes 64 bytes 832

bytes

208

TABLE III. The Synthesis Report of some Logics

Module name

H.264

Transform Quantizer

Intra predictor 1

(other modes)

Intra predictor 2 (DC mode)

Inloop

Filter CAVLC* MPEG-2 IDCT

Clock rate 72MHZ NA 198 MHZ 158 MHZ 60MHZ 50MHZ 77MHZ

Logic size 252 LUTS

197 LUTS with MULT18X18

879 LUTS 188 LUTS 3105 LUTS

3125 LUTS

3232 LUTS

Bandwidth 16/18

(output/clk) 1/1 (output/clk) 4/1 output/clk

1/1 output/clk

(I4MB) 1/5 output/clk

(I16MB)

2/1 (output/clk)

depend on content

64/158 (output/clk)

Memory

usage 1 (16x16 bit)

96 words by 14 bits 52 words by 5

bits 52 words by 3

bits

NA NA 16x384 bits

128x22-bi t 16x16-bit

64x16 bit

*CAVLC is based on a Spartan II FPGA device, and the others are based on a VirtexE FPGA device

REFERENCES

[1] E. S. Jang, K. Asai, and C.-J. Tsai, Study of Video Coding Tool Repository v5.0, MPEG Meeting Document N7329, Poznan, July 2005.

[2] C.-J. Tsai, Suggestions on the Direction of VCTR, MPEG Input Document M12074, Busan, April, 2005.

[3] ISO/IEC MPEG Video Group, Final Call for Proposals on Reconfigurable Video Coding, MPEG Meeting Document N8070, Montreux, April 2006.

[4] J. Janneck et al., Moses Tool Suite, https://sourceforge.net/projects/mosestoolsuite/.

[5] T.-C Chen, Y.-W. Huang, and L.-G.Chen,“Analysisand design of macroblock pipelining for H.264/AVC VLSI architecture,”Proc. of IEEE ISCAS 2004, Kobe, 2004.

[6] http://www.arm.com/products/DevTools/IntegratorAP.htm

附錄二、異質多核心作業系統動態分工排程器

在文檔中 MPEG-4/21 SoC 設計及新世代行動訊之研究-子計畫二：多媒體通訊數位基頻SoC加速架構及嵌入式作業系統界面的研究(III) (頁 8-16)