總合三年的成果,和原計畫提出的目 標相當吻合。在達成預期目標情況方面有 以下數點:
1. 多 媒 體 雙 核 心 系 統 中 , 在 TI OMAP OSK5912 平台上發展一套 Heterogeneous Multi-Processor (HMP) 的動態工作切割排程作業 系統核心。在開發這個技術的過程 中,我們有以下成果:
I. 發展自己的 DSP scheduler 來幫助作 task 的排程。
II. Porting eCos 到 TI omap 平台 上,並開發雙核心有效的溝 通協定。
III. 設 計 一 個 可 以 配 合 eCos MLQ scheduler 的動態分工 模組。
IV. 設 計 新 的 異 質 多 核 心 的 新 Programming model。
V. 實作出支援動態分工應用程
式的開發工具。
2. 視訊編碼硬體加速平台上,實作出 以下 IPs:
I. Motion-estimation:計算到達 到 1/4 pel , 參 考 多 個 reference frames 以 及 所 有 sub block 模式
II. H.264 deblocking filter III. MPEG4 IDCT
IV. H.264 transform/inverse transform unit
V. H.264 quantizer and de-quantizer
VI. H.264 intra predictor
本平台的設計是以 MPEG 正在
發展中的 Reconfigurable Video Coding (RVC) 的架構為主要的 設計目標,以期能支援不同視訊 壓縮法的解碼器的動態產生。由 於 RVC 為目前 MPEG 工作中的 項目,所以目前的設計都是以軟 體(C model 或其它 behavioral model 的模擬平台,如 Moses for
CAL 來進行研究)。本團隊因為 積極參與 MPEG 標準的制訂,所 以能隨時根據最新的結果來修正 設計這個平台。
3. 實 作 出 Java Dynamic Code Optimization for Java Processor 的 軟硬體系統。
附錄一、可重組的視訊加速 SoC 平台
I. INTRODUCTION
Most multimedia devices today have to support multiple codec standards. Take video codecs for example, a portable multimedia player usually supports the playback of the MPEG-1/2, MPEG-4 SP, WMV, and H.264/MPEG-4 Part 10 video contents. In order to reduce system cost, a single-chip SoC solution that supports all these standards is a sensible approach. From IC designers’ point of view this is not a serious problem since most (if not all) popular video codecs share the same block-based motion compensated transform coding data flow. In addition, many coding tools have similar architecture. However, there are some application issues that makes traditional codec design approaches unsatisfactory [1].
A major problem with existing approach of defining a codec standard is the lack of flexibility when new applications emerge. A video codec is composed of several coding tools (e.g. DCT/IDCT, MC, VLC/VLD, etc.). However, for a codec standard, the conformance point is defined at codec-level, instead of tool-level. Different profiles/levels are created for each codec to address the need of different classes of applications. This approach works fine in the past since the application scenarios were quite simple (e.g. DVD, DTV). However, with the exponential growth of new multimedia applications, the old approach of defining conformance point at codec-level becomes awkward. Quite often, a new application designer finds it impossible to
find a reasonable codec profile@level to fit the target application well. For example, the FMO tool of H.264 is useless for many applications but a decoder may still need to support it simply because it is included in AVC baseline profile. In general, application environment is changing faster than an international standard can catch up that there should be a more efficient way of allowing a codec to adapt to new applications while maintaining interoperability among different solutions.
MPEG has recognized this issue and started a new work item called Video Coding Tools Repository (VCTR) in 2004.
After some investigations, the direction and benefit of VCTR is becoming clear [2].
Later, this effort becomes the Reconfigurable Video Coding (RVC) framework in 2006 [3]. This new framework defines the conformance point at tool-level.
Therefore, in principle, an RVC-enabled codec can negotiate on-the-fly with the video bitstream encoder/sender about which coding tools is required and how the data path can be wired among these coding tools in order to decode the video bitstream. After the setup stage, the decoder can decode the bitstream correctly. With this approach, an SoC can support multiple codec standards as well as creating customized codecs in real time as long as it contains all the standard-conforming tools that is necessary to decode bitstreams from different encoders.
So far, the RVC framework is still in development. Most of the investigations are done using C models and behavioral model
simulators such as Moses [4]. In this report, SoC architecture that can be used to implement the RVC framework is proposed.
The report is organized as follows. The RVC framework is introduced in section II.
The SoC architecture for direct support of RVC is presented in section III. Some comparisons of the RVC architecture to a common hard-wired solution is also given in this section. Section IV studies an implementation to get an idea on the cost for such flexibility. Finally, some discussions are given in section V.
II. MPEG RVC FRAMEWORK
The concept of MPEG RVC framework can be illustrated in Fig. 1. The key difference between RVC and the old MPEG codec standards is that the interface of each coding tools is defined precisely so that they can be used (like LEGO blocks) to build various codecs. The decoder configuration describes how input bitstream can be parsed so that the raw input data to each coding tools can be extracted. A decoder description language is under development so that the configuration of a specific codec (such as H.264) can be described using a (small) configuration bitstream. The decoder configuration bitstream will be processed by an RVC decoder before decoding of a video bitstream conforming to the described standard. Note that after processing a configuration bitstream, the RVC decoder will generate a Global Control Unit (GCU) that governs the operation of the coding tools.
In principle, the configuration description tells the RVC decoder how to wire the coding tools to form a data path. In
the RVC framework, each coding tools is called a functional unit (FU) and is specified in Fig. 2 [1]. In Fig. 2, a control signal is a signal embedded in the video bitstream (for example, the width and height of the video frame). A context signal is a signal generated from the processing of bitstream data (for example, the AC prediction direction in the MPEG-4 Part 2 video standard). The context-control unit reads in the context and control signals generated by previousFU’sand generates(orpasseson) some context and control signals to the next FU’sbased on theresultoftheprocessing unit.
A partial example of a configured RVC codec that behaves like an MPEG-4 Simple Profile video decoder is shown in Fig. 3. In Fig. 3, VLD is the FU for variable length decoding, RLD is the FU for run-length decoding, and MBG is the 8x8 block coefficients composition FU.
88 IDCT 44 GBT 44
intra-prediction ¼-Pel MC ½-Pel MC H.264 Decoder
Configuration and API
MPEG-4 Decoder Configuration and API
Tools in RVC Toolbox
Applications Old MPEG
conformance point
New RVC conformance point
Fig. 1. Concept of MPEG RVC framework
Processing Unit Input
bitstream data
Output bitstream
data
Context & control [in]
e.g. coding parameters, mode selection signals
Context & control [out]
e.g. derived parameters from the video data Context-Control
Unit
Fig. 2. Definition of an FU in RVC
8
control & context
#MB, data_partition_flag
MB data
Fig. 3. Example of RVC configuration
III. SOC ARCHITECTURE FOR RVC
Since the specification of the video decoder configuration language and the actual mechanism of a GCU are still under development at MPEG, this report proposes a potential VLSI architecture that is suitable for supporting the RVC framework and perform some early analysis on such architecture. The RVC framework actually fits the platform-based design principle of SoC quite well. For maximal flexibility, the GCU will be implemented in software and running on the processor core of an SoC. Each coding tool can be implemented as an IP on the bus with limited configurability via a private register file. The proposed architecture is show in Fig. 4.
In Fig. 4, the coding tools are not attached to the main system bus (AMBA AHB) directly.
A local bus, MMB, is used to off-load the bandwidth from the main system bus. Here, MMB stands for Multi-Media Bus. In our implementation, the bus protocol of MMB is a simplified version of AHB. A two-way DMA is used to transfer data between external SDRAM and internal SRAM banks. The DMA can be
invoked from either the ARM core or the coding tool IPs (as long as the tool is implemented as an MMB master).Thereason formultipleSRAM’s on the MMB is to reduce the memory bandwidth requirement for parallel operations of the coding tools.
Fig. 4. SoC architecture for RVC framework
ARM memory
Fig. 5. Hard-wired decoder example
Although local bus and multiple SRAM banks are used to alleviate the bandwidth issue, the performance of this architecture still cannot match that of a hard-wired architecture. For example, a hard-wired H.264 baseline decoder may have a tighter MB decoding pipeline as shown in Fig. 5. There are two main advantages of the architecture in Fig. 5. First of all, the decoding pipeline is controlled by a hard-wired FSM with cycle-based synchronization. On the other hand, for the RVC framework, the controller will be implemented in software, and
9
hence, cannot guarantee cycle-based operation of the pipeline. Another advantage of the hard-wired approach is that it does not require excessive accesses to external memory.
It is important to point out that the purpose of the RVC framework is not to obtain the most efficient design of a single codec, but to allow a flexible and extensible design of codec systems.
Multi-standard codec support (or even generate customized codec on-the-fly) can be achieved by configuring a new GCU via decoder description bitstreams. In the next section, we will study an actual implementation of the proposed architecture in Fig. 4 to get an idea about the cost one has to pay for such flexibility.
IV. IMPLEMENTATION STUDY OF THE PROPOSED SYSTEM
In this section, an implementation of the proposed system architecture (Fig. 4) is investigated. The implementation is based on an SoC emulation platform, the ARM Integrator [6].
The platform is composed of a main board, an ARM 9 processor core module, and a Xilinx VirtexE XCV2000E FPGA logic module. The platform adopts the AMBA bus protocol. The RVC coding toolbox logic of the proposed system is implemented in the FPGA. The local bus protocol, MMB, of the toolbox logic is a reduced version of AHB with much less wires and a minimal implementation of bus arbiter and decoder.
In the proposed system architecture, the finite state machine (FSM) that drives the operation of the coding tool FUs is implemented
in software. As a result, the codec pipeline is not executed in a lock step fashion but instead driven by the software FSM via control signals.
Each coding tool FU (please refer to Fig. 2) is implemented so that the input bitstream data is coming from a SRAM bank on the MMB and the output bitstream data will be stored in another SRAM bank on the MMB. Block RAMs of the Virtex II FPGA and the ZBT SRAM of the ARM Integrator are used for this purpose.
Table I and Table II list the required memory for the input data and output data. It is obvious that such implementation is not as efficient as a tightly-coupled pipeline [5] where different pipeline stages are connected via registers or FIFO.
On the other hand, since the system control FSM is implemented in software, the Global Control Unit of the MPEG RVC framework can be dynamically implemented using this FSM.
Therefore, any video decoders can be emulated on-the-fly by the proposed architecture as long as all the coding tools required by the target codec are supported by the architecture.
Therefore, the proposed architecture is very flexible and scalable. It is important to point out that in order to support dynamic reconfiguration of the RVC decoder, the software-based system FSM shall not be a hard-coded FSM. Instead, it should be implemented as a table-driven FSM where the table content can be modified by the RVC decoder configuration bitstream.
The implementation of the processing unit and context-control unit of a coding tool FU follows traditional hard-wired IP design methodology where the processing unit is
10
implemented as a data path and the context-control unit is a hard-wired FSM with register files for memory-mapped I/O configuration and signaling. Currently, most of the FUs supported in the proposed platforms are for H.264. The synthesis report of some of the implemented FUs is shown in TABLE III.
V. CONCLUSIONS
This report introduces the MPEG RVC framework and proposes an SoC architecture to support the framework. Since the RVC framework is still under development at MPEG.
There is not much research on how the
framework can be efficiently supported using an SoC platform design paradigm. The table-driven software FSM for dynamic generation of a GCU and the decoder configuration language is still yet to be defined by MPEG. However, based on our study, the proposed architecture is very feasible for practical SoC implementation of the RVC framework. Although a reconfigurable video codec cannot compete with a hard-wired codec for performance given current VLSI implementation technology, it is much more scalable in the sense that any new codecs (coding tools) can be added into the platform with minimal effort.
TABLE I. The data size from external of the FUs in the proposed RVC architecture Data form External Memory
Luma Chroma Total Cycles
Intra
predictor(input)
256 bytes 128 bytes 384 bytes 96 Luma Chroma Block info Total Cycles Deblocking
filter(input)
384 bytes 256 bytes 120 bytes 760 bytes 190
TABLE II. The data size from internal memory of the FUs in the proposed RVC architecture Data from Internal Memory
Residual Luma Residual Chroma Total Cycles Intra
predictor(outpu t)
256 bytes 128 bytes 384
bytes
96
Trans.&quant.
Luma
Trans.&quant Chroma
Total Cycles TQ/TQ-1(outpu
pt)
512 bytes 256 bytes 768
bytes
192
Luma Chroma Total Cycles
TQ/TQ-1(input) 256 bytes 128 bytes 384 96
11
bytes
Deblocked Luma Deblocked Chroma Total Cycles Deblocking
filter(output)
256 bytes 128 bytes 384
bytes
96
residual data MVD Total Cycles
CAVLC(input) 768 bytes 64 bytes 832
bytes
208
TABLE III. The Synthesis Report of some Logics
Module name
H.264
Transform Quantizer
Intra predictor 1
(other modes)
Intra predictor 2 (DC mode)
Inloop
Filter CAVLC* MPEG-2 IDCT
Clock rate 72MHZ NA 198 MHZ 158 MHZ 60MHZ 50MHZ 77MHZ
Logic size 252 LUTS
197 LUTS with MULT18X18
LE
879 LUTS 188 LUTS 3105 LUTS
3125 LUTS
3232 LUTS
Bandwidth 16/18
(output/clk) 1/1 (output/clk) 4/1 output/clk
1/1 output/clk
(I4MB) 1/5 output/clk
(I16MB)
2/1 (output/clk)
depend on content
64/158 (output/clk)
Memory
usage 1 (16x16 bit)
96 words by 14 bits 52 words by 5
bits 52 words by 3
bits
NA NA 16x384 bits
128x22-bi t 16x16-bit
64x16 bit
*CAVLC is based on a Spartan II FPGA device, and the others are based on a VirtexE FPGA device
REFERENCES
[1] E. S. Jang, K. Asai, and C.-J. Tsai, Study of Video Coding Tool Repository v5.0, MPEG Meeting Document N7329, Poznan, July 2005.
[2] C.-J. Tsai, Suggestions on the Direction of VCTR, MPEG Input Document M12074, Busan, April, 2005.
[3] ISO/IEC MPEG Video Group, Final Call for Proposals on Reconfigurable Video Coding, MPEG Meeting Document N8070, Montreux, April 2006.
[4] J. Janneck et al., Moses Tool Suite, https://sourceforge.net/projects/mosestoolsuite/.
[5] T.-C Chen, Y.-W. Huang, and L.-G.Chen,“Analysisand design of macroblock pipelining for H.264/AVC VLSI architecture,”Proc. of IEEE ISCAS 2004, Kobe, 2004.
[6] http://www.arm.com/products/DevTools/IntegratorAP.htm