Design Verification

CHAPTER 4 PROPOSED ARCHITECTURE

4.3 D ESIGN I MPLEMENTATION AND V ERIFICATION

4.3.2 Design Verification

Fig. 26 shows the flow of verification. We utilize software and hardware to verify the proposed algorithm. The patterns are created by software and applied as the input of

hardware designs. Then the software calculates the answer to compare with the result of hardware and the result will be stored in memory. Afterward the coded data is accessed by software and hardware decompressor from memory. We check the coded data to confirm the result whether is matched in software and hardware.

Fig. 26 The verification flow

Chapter 5 System Integration

In section 5.1, we will introduce Si2 H.264 Decoder System. Then, both access analysis and processing analysis will be discussed in sections 0 and 5.3, respectively.

5.1 System Analysis

Fig. 27 The block diagram of the overall H.264 decoder system

The overall H.264 decoder [3] with the embedded compression codec is shown in Fig. 27. Our H.264 decoder specification is HD1080/HD720@30fps and works at

150MHz. The embedded compressor works between the deblocking filter and the external memory. The embedded decompressor works between the external memory and the motion compensation. To design address controller of EC is very simple since our compression ratio is fixed at two. Our system bus is 32 bits and the external memory is 32 bits per entry.

5.1.1 Interface Problem

Fig. 28 The system interface design for embedded codec

Fig. 28 shows the system; interface design for embedded codec. Between the chip and the off-chip memory, the embedded compression can be considered as an interface.

In original H.264 decoder system, here are two interface issues. First interface issue occurs between the deblocking filter and the off-chip memory. The throughput of the deblocking filter is 4 pixels per clock. Therefore, avoiding the pipelined jam at the input of embedded compressor, the processing clocks must be less or equal to 4 cycles. The other issue occurs between the motion compensation (MC) and the off-chip memory. The input of MC requires 4 pixels per cycle, thus the throughput of the embedded decompressor is at least 4 pixels per cycle. Furthermore, since the compression ratio is

fixed at two, the address converter can be easily implemented.

5.1.2 Processing Cycles Problem

In this part, we talk about processing cycle problem of out H.264 decoder system.

Our H.264 decoder specification is HD1080/HD720@30fps and works at 150MHz. From our simulation, MC requires average 25 cycles to deal with a 4 4× block. Therefore, embedded compressor requires a fewer-cycle design to reduce the loading cycles.

5.1.3 Overhead Problem

Fig. 29 An example of overhead problem

A block is basic coding unit in H.264 standard. Moreover, due to block-based approaches fit in with block-oriented structure of the received bit-stream, they are most popular techniques. However, here is an overhead problem

4×4

[11] that can be defined as:

the ratio between the number of pixels that are actually accessed during the motion compensation of a block and the number of pixels that are really useful in the reference block. In the original system without block-based approaches, the ratio is equal to 1 for

the required pixels accessed. On the contrary, in the original system with block-based approaches, the ratio is always bigger than 1. As shown in Fig. 29, if the required 4×4 block data, we need to fetch four 4×4 block-based data. The overhead in this case is 48.

TABLE 3 Overhead with block grid for six sequences

Sequence 4×4 block grid 8×8 block grid 16 16× block grid

Foreman 1.31 1.77 3.69

Flower 1.30 1.74 3.77

News 1.14 1.51 2.78

Silent 1.17 1.50 3.22

Stefan 1.51 2.44 6.95

Weather 1.17 1.49 3.18

All 1.27 1.73 3.93

As given in TABLE 3, [12] has been provided the summary of the statistical analysis simulated with six sequences. From this table, we can know that the faster motion sequence such as Stefan causes higher overhead. Consequently, it is important that the smaller block-grid can obtain smaller overhead.

Fig. 30 The flow of EC accesses

5.2 Access Analysis

Fig. 30 shows deblocking filter through the embedded compressor write the data into the external memory and MC through the embedded decompressor read the data from external memory. Moreover, exploiting SystemC, CoWare can build up a simulated platform to analyze the related system problem. As shown in Fig. 31, the user-defined field includes H.264 decoder and EC which is coded in Verilog.

AMBA Slave Interface

Embedded Decompressor

Motion Compensation

Embedded Compressor AMBA AHB

External Memory

Deblocking Filter

H.264 Decoder User-defined

CoWare Fig. 31 The Block Diagram of CoWare System

Fig. 32 shows the block diagram of our H.264 decoder with EC in the work space of CoWare platform. The external memory is accepted 128Mb Mobile LPSDR [14] and the bus protocol used AMBA 2.0 with 32-bit bandwidth. After CoWare simulating, we can get the information of the data access as shown in Fig. 33 and Fig. 34.

Fig. 32 Block diagram in CoWare System

Write 4x4 Blocks Read 4x4 Blocks

Fig. 33 Embedded compressor waveform over CoWare system

Fig. 34 Data access trace

5.2.1 Write Reduction

The compression ratio of the proposed EC is fixed at 2. After the proposed EC ( block unit and ) is embedded into our H.264 decoder system, comparing the original system ( block-based access), the reduction ratio of the writing times is 50%.

4×2 CR=2

4×1

5.2.2 Read Reduction

In Motion Compensation, reading required data is based on Motion Vector (MV).

Moreover, in MV (x, y), the x value and the y value can be classified as follows:

1) Align: The value is quadruple and the required 4 pixels fits with the block grid.

2) Not Align: The value is not quadruple and an integer. The required 4 pixels traverse two block grids.

4×4

3) Sub Pixel: The value accurate to 1 / or 1 / . The required 9 pixels can be interpolated into 4 pixels.

2 4

TABLE 4 All Cases of read access required by MC with/without EC

Case of MV (x, y)

Access Cycles for System without

Access Cycles for System with EC

Reduction of

In Table II, we analyze the read times of the motion compensation with/without EC.

The worst case is the (Sub, Sub) case. To finish the motion compensation, a 4x4 block needs a 9x9 block. Therefore, the system with/without proposed embedded compressor takes 15/27 cycles. The best case is the (Align, Align) case. Original system with/without embedded compressor needs 2/4 cycles to finish the best case. For the other cases when the required data of motion compensation are not fit for 4x2 block-grids, the access times become increased. From our simulation with four sequences (Akiyo, Stefan, Mobile Calendar, Foreman), each 300 frames, we can derive the probabilities of each case.

According to the probability of each case, the reduction ratio of the reading times is about 50%.

5.3 Processing Cycle Analysis

In section 5.1.2, the processing cycle problem had been mentioned. In this section, we will talk about the results of our system integration.

Our system specification is HD1080/720@30fps. This specification means each block accepts cycle count in 25 cycles. Because we do not want to change our specification, we wish that MC with the proposed embedded decompressor finishes in 25 cycles. Moreover, based on not to change our specification, we will not to change the data input structure. Here, we must compute the processing cycles as given by

4 4×

MC with Decompressor Decompressor MC without Decompressor

Proccesing Time = Delay +Proccesing Time (10)

TABLE 5 shows the processing cycle analysis for all cases. Excluding the (Sub, Sub) case, each case is less than 25 cycles. The average of processing cycles for MC without EC is 17.4 cycles. Therefore, the proposed embedded compression can be embedded into our H.264 decoder system.

TABLE 5 All Cases of Processing Cycle Analysis for EC

Case of MV (x, y)

5.3.1 Access Reduction Ratio

The access ratio of the system with/without EC is given in (11)

System with EC System with EC System without EC System without EC

Read +Write

Access Ratio =

Read +Write (11)

From the simulation, the ratio of read times with/without EC is 0.517, the ratio of write times with/without EC is 0.5, and the average access ratio of read/write in the system without EC is about 3.51. The overall access ratio is given in (12)

0.517 3.51+0.5 1 Overall Access Ratio =

3.51 1 = 51.3%

× ×

+ (12)

The average reduction ratio on memory accessed is given in (13)

(13) Average Reduction Ratio = 1 - Overall Access Ratio

= 1 - 0.513 = 48.7%

Therefore, the average reduction ratio is 48.7%.

5.3.2 Simulation Result on Power Reduction

We exploit the system-power calculator [13] as a external memory power model and set the parameter as [14]. The simulation of memory is employed on HD1080/720@150MHz. The simulation results are shown in Fig. 35. Including the core power of H.264 decoder, SDRAM background power and SDRAM access power (read/write).

0 50 100 150 200 250

Power_Original_System Power_EC_System

Power_EC Power_SDRAM

Fig. 35 Power analysis on HD1080/720@150MHz

Chapter 6 Conclusion and Future Works

6.1 Conclusion

In this thesis, we have proposed a new embedded compression algorithm for mobile video applications. With these advantages of the proposed EC algorithm, we can lessen the size of external memory and bandwidth utilization to achieve power saving. The pipelined architecture of the proposed decompressor requires 1 cycle, thus the random accessibility becomes better. Due to the fixed CR, the proposed EC algorithm is easier to be integrated with H.264 decoder.

From the experimental results, the PSNR loss of the proposed EC algorithm is from 1.89 to 3.45dB. The proposed architecture is synthesized with 90-nm CMOS standard-cell library and the gate counts of the proposed algorithm for compressor/decompressor are 4.0k/0.9k respectively. The working frequency is up to 150MHz@HD1080/720. For power consumption, the compressor is 158uW and the decompressor is 86uW.

6.2 Future Work

For the lossy embedded compression, reducing the visual quality distortion, it is the major objective. From our experimental results, error propagation is worth to be improved. For the simulation results of all I frames, between the original sequence and the compressed sequence, the differences are hardly found. However, for 1I/29P frames,

the drift effect can be found easily in the simulation results. Therefore, we can refine the proposed lossy embedded compression algorithm, such as adaptive predefined bitplanes, additive lossless embedded compression algorithm, and et al, to get better coding-efficiency.

Bibliography

[1] “Draft ITU-T recommendation and final draft international standard of Joint Video Specification (ITU-T Rec. H264-ISO/IEC 14496-10:2005 AVC),” JVT G050, 2005.

[2] T. Wiegand, G. J. Sullivan, G. Bjontegaard, A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Trans. Circuits Syst. Video Technol., vol.

13, no. 7, pp. 560-576, Jul. 2003.

[3] T. M. Liu, and et al., “A 125/spl mu/w, fully scalable MPEG-2 and H.264/AVC video decoder for mobile applications,” in Proc. IEEE Int. Solid-State Circuits Conf.

(ISSCC), pp. 1576-1585, 2006.

[4] J. Kim, and C. M. Kyung, “A Lossless Embedded Compression Using Significant Bit Truncation for HD Video Coding,” IEEE Trans. Circuits Syst. Video Technol., early access, 2010.

[5] E.J. Delp, O.R. Mitchell, “Image compression using block truncation coding,” IEEE Trans. Comm., vol. 27, issue 7, pp. 1335-1342, Sep. 1979.

[6] C. K. Yang and W. H. Tsai, “Improving block truncation coding by line and edge information and adaptive bit plane selection for gray-scale image compression,”

Pattern Recognition Letter, vol. 16, pp. 67-75, 1995.

[7] T. M. Amarunnishad, V. K. Govindan, and T. M. Abraham, “Block Truncation Coding Using a Set of Predefined Bit Planes,” in Proc. IEEE Int. Conf.

Computational Intelligence and Multimedia Applications (ICCIMA), vol. 3, pp.

73-78, Dec. 2007.

[8] T. Y. Lee, “A New Frame-Recompression Algorithm and its Hardware Design for MPEG-2 Video Decoders,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no.6, pp. 529-534, Jun. 2003.

[9] Y. D. Wu, Y. Li, and C. Y. Lee, “A Novel Embedded Bandwidth-Aware Frame Compressor for Mobile Video Applications,” in Proc. IEEE Intelligent Signal Processing and Communication Syst. (ISPACS), pp. 1-4, Feb. 2009.

[10] C. C. Yang, “An Embedded Codec based on Reduced Pattern Comparison Coding for Mobile Devices,” Master’s thesis, National Chiao Tung University, Jan. 2010.

[11] Y. D. Wu, “Design of An Embedded Compressor/Decompressor for Mobile Video Application,” Master’s thesis, National Chiao Tung University, Sep. 2009.

[12] A. Bourge and J. Jung, “Low-Power H.264 Video Decoder with Graceful Degradation,” in SPIE Proc. Visual Comm. And Image Processing, vol. 5308, pp.

372-383, Jan. 2004.

[13] Micron® Technology Inc. The Micron® System-Power Calculator: SDRAM.

[Website]:http://www.micron.com/support/part_info/powercalc

[14] Micron® Technology Inc. MT48H4M32LFB5-6 128Mb Mobile LPSDR.

[Website]:http://www.micron.com/products/partdetail?part=MT48H4M32LFB5-6

作者簡歷

姓名：林建辰

戶籍地：台灣省彰化市出生日期：1980 年 8 月 9 日

學歷：

1995 年 9 月~ 2000 年 6 月國立雲林工專電機科 2000 年 9 月~ 2002 年 6 月大華科技大學電機工程系

2005 年 2 月~ 2009 年 6 月國立交通大學 IC 設計產業研發碩士專班

在文檔中應用於行動式視訊裝置之預設位元平面比對之嵌入式編解碼器 (頁 38-0)

CHAPTER 4 PROPOSED ARCHITECTURE

4.3 D ESIGN I MPLEMENTATION AND V ERIFICATION

4.3.2 Design Verification

Chapter 5

System Integration