S IMULATION R ESULTS

CHAPTER 3 PROPOSED ALGORITHM

3.2 S IMULATION R ESULTS

In the beginning, we first define the formula of MSE (Mean Square Error) and PSNR (Peak Signal to Noise Ratio). The MSE and PSNR are given in (8) and (9),

2 frame, and P is the compressed frame.

W H I

In this section, we focus on the coding efficiency for all CIF sequences. In Fig. 15, PSNR loss is from 1.68 dB to 3.45dB and PSNR loss average is 2.37dB. Then, we show the result of embedded result for different group of picture (GOP) in Fig. 16. Along with the number of P frame, we can see that PSNR loss is growing. Because each P frame is generated by the previous frame which is compressed by our proposed algorithm, the error is bigger and bigger along with the number of P frame. This phenomenon is also called error propagation or drift effect. Fig. 17 shows the results of drift effect with different QP and GOP.

2.41 2.64 2.31 2.20

Fig. 15 PSNR and PSNR Loss for all CIF sequences

Fig. 16 Drift Effect for Mobile_QP28

12.04

Fig. 17 Drift Effect with different QP and GOP

Chapter 4 Proposed Architecture

In these sections, we will introduce our proposed architecture. In section 4.1, we will describe our proposed embedded compressor. In section 4.2, we will describe our proposed embedded decompressor. In section 4.3, we will show the summary of the proposed architecture and the flow of the design verification.

4.1 Architecture of Compressor

Fig. 18 shows the pipeline architecture of compressor design. We use two pipeline stages and each stage requires one cycle. The first stage is the pixel truncation. The second stage is composed of selective start plane, rounding, selective pattern comparison, and packer. This compressor encodes a 4x2 block in 2 cycles.

Fig. 18 The Pipelined Architecture of Compressor Design

4.1.1 Architecture of Pixel Truncation

Fig. 19 shows the architecture of pixel truncation. There are seven combinational logics, one multiplexer, one de-multiplexer, and one register. The seven combinational

logics as follows: average, difference, type selector, quantizer 1, quantizer 2, quantizer 3, and quantizer 4. The type selector controls the multiplexer and the de-multiplexer. The register stores the truncated 4x2 block.

Fig. 19 Architecture of Pixel Truncation

4.1.2 Architecture of Selective Start Plane

LUT

Fig. 20 Architecture of Selective Start Plane

Fig. 20 shows the architecture of selective bitplane. The block of bitplane transform is a wrapper. There are five combinational logics, one de-multiplexer, and one look-up table. The five combinational logics as follows: Mode 1, Mode 2, Mode 3, Mode 4, and Mode selector. The look-up table records the information of B7 and B6 for each mode.

4.1.3 Architecture of Compensation

Fig. 21 shows the architecture of compensation. There are two combinational logics as follows: Comparison Rounding and No Comparison Rounding.

Comparison

Fig. 21 Architecture of Compensation

4.1.4 Architecture of Predefined Bitplanes Comparison

Fig. 22 shows the architecture of pattern comparison. There are five combinational logics, one de-multiplexer, one look-up table and one register. The five combinational logics as follows: Comparison Group A, Comparison Group B, Comparison Group C, and No Comparison. The pattern selector controls the de-multiplexer. The register is stored the coded data.

Comparison

Fig. 22 Architecture of Predefined Bitplanes Comparison

4.1.5 Data Packing

Fig. 23 shows the architecture of data packing. The representation format consists of 2-bit Mode, 2-bit Start Plane, 2-bit Decision L, 2-bit Decision R, 12-bit Coded Data L, and 12-bit Coded Data R.

Fig. 23 Architecture of Data Packing

4.2 Architecture of Decompressor

Fig. 24 shows the pipeline architecture of decompressor. The decompressor only needs one stage with one cycle. This decompressor reaches a higher throughput; therefore we can provide a higher random accessibility.

Fig. 24 The Pipeline Architecture of Decompressor

4.2.1 Data Rearrange

Fig. 25 shows the architecture of data rearrange. According to the representation format, the data rearrange can be considered as an inverse processing.

Fig. 25 Architecture of Data Rearrange

4.3 Design Implementation and Verification

In section 4.3.1 and 4.3.2, we will introduce the results of design implementation

and the flow of the design verification, respectively.

4.3.1 Design Implementation

TABLE 2 shows the summary of the hardware design. The proposed hardware architecture is synthesized with 90-nm CMOS standard-cell library and the gate count of the proposed algorithm for the compressor and the decompressor are 4.0k and 0.9k, respectively. The working frequency is up to 150MHz@HD1080/720. The proposed embedded compressor is divided into 2 pipelined stages and each stage requires 1 cycle.

The proposed embedded decompressor is divided into 1 pipelined stage and each stage requires 1 cycle. For the power consumption, the compressor and the decompressor are 158uW and 86uW@150MHz respectively. As above description, the proposed hardware provides less hardware complexity.

TABLE 2 Summary of the hardware implementation Proposed EC

Function Compressor Decompressor

Technology UMC 90nm

Working Frequency HD1080+HD720@150MHz

Latency/4x2 block 2 cycles 1 cycle

Gate count 4K 0.9K

Power Consumption 158uW 86uW

4.3.2 Design Verification

Fig. 26 shows the flow of verification. We utilize software and hardware to verify the proposed algorithm. The patterns are created by software and applied as the input of

hardware designs. Then the software calculates the answer to compare with the result of hardware and the result will be stored in memory. Afterward the coded data is accessed by software and hardware decompressor from memory. We check the coded data to confirm the result whether is matched in software and hardware.

Fig. 26 The verification flow

Chapter 5 System Integration

In section 5.1, we will introduce Si2 H.264 Decoder System. Then, both access analysis and processing analysis will be discussed in sections 0 and 5.3, respectively.

5.1 System Analysis

Fig. 27 The block diagram of the overall H.264 decoder system

The overall H.264 decoder [3] with the embedded compression codec is shown in Fig. 27. Our H.264 decoder specification is HD1080/HD720@30fps and works at

150MHz. The embedded compressor works between the deblocking filter and the external memory. The embedded decompressor works between the external memory and the motion compensation. To design address controller of EC is very simple since our compression ratio is fixed at two. Our system bus is 32 bits and the external memory is 32 bits per entry.

5.1.1 Interface Problem

Fig. 28 The system interface design for embedded codec

Fig. 28 shows the system; interface design for embedded codec. Between the chip and the off-chip memory, the embedded compression can be considered as an interface.

In original H.264 decoder system, here are two interface issues. First interface issue occurs between the deblocking filter and the off-chip memory. The throughput of the deblocking filter is 4 pixels per clock. Therefore, avoiding the pipelined jam at the input of embedded compressor, the processing clocks must be less or equal to 4 cycles. The other issue occurs between the motion compensation (MC) and the off-chip memory. The input of MC requires 4 pixels per cycle, thus the throughput of the embedded decompressor is at least 4 pixels per cycle. Furthermore, since the compression ratio is

fixed at two, the address converter can be easily implemented.

5.1.2 Processing Cycles Problem

In this part, we talk about processing cycle problem of out H.264 decoder system.

Our H.264 decoder specification is HD1080/HD720@30fps and works at 150MHz. From our simulation, MC requires average 25 cycles to deal with a 4 4× block. Therefore, embedded compressor requires a fewer-cycle design to reduce the loading cycles.

5.1.3 Overhead Problem

Fig. 29 An example of overhead problem

A block is basic coding unit in H.264 standard. Moreover, due to block-based approaches fit in with block-oriented structure of the received bit-stream, they are most popular techniques. However, here is an overhead problem

4×4

[11] that can be defined as:

the ratio between the number of pixels that are actually accessed during the motion compensation of a block and the number of pixels that are really useful in the reference block. In the original system without block-based approaches, the ratio is equal to 1 for

the required pixels accessed. On the contrary, in the original system with block-based approaches, the ratio is always bigger than 1. As shown in Fig. 29, if the required 4×4 block data, we need to fetch four 4×4 block-based data. The overhead in this case is 48.

TABLE 3 Overhead with block grid for six sequences

Sequence 4×4 block grid 8×8 block grid 16 16× block grid

Foreman 1.31 1.77 3.69

Flower 1.30 1.74 3.77

News 1.14 1.51 2.78

Silent 1.17 1.50 3.22

Stefan 1.51 2.44 6.95

Weather 1.17 1.49 3.18

All 1.27 1.73 3.93

As given in TABLE 3, [12] has been provided the summary of the statistical analysis simulated with six sequences. From this table, we can know that the faster motion sequence such as Stefan causes higher overhead. Consequently, it is important that the smaller block-grid can obtain smaller overhead.

Fig. 30 The flow of EC accesses

5.2 Access Analysis

Fig. 30 shows deblocking filter through the embedded compressor write the data into the external memory and MC through the embedded decompressor read the data from external memory. Moreover, exploiting SystemC, CoWare can build up a simulated platform to analyze the related system problem. As shown in Fig. 31, the user-defined field includes H.264 decoder and EC which is coded in Verilog.

AMBA Slave Interface

Embedded Decompressor

Motion Compensation

Embedded Compressor AMBA AHB

External Memory

Deblocking Filter

H.264 Decoder User-defined

CoWare Fig. 31 The Block Diagram of CoWare System

Fig. 32 shows the block diagram of our H.264 decoder with EC in the work space of CoWare platform. The external memory is accepted 128Mb Mobile LPSDR [14] and the bus protocol used AMBA 2.0 with 32-bit bandwidth. After CoWare simulating, we can get the information of the data access as shown in Fig. 33 and Fig. 34.

Fig. 32 Block diagram in CoWare System

Write 4x4 Blocks Read 4x4 Blocks

Fig. 33 Embedded compressor waveform over CoWare system

Fig. 34 Data access trace

5.2.1 Write Reduction

The compression ratio of the proposed EC is fixed at 2. After the proposed EC ( block unit and ) is embedded into our H.264 decoder system, comparing the original system ( block-based access), the reduction ratio of the writing times is 50%.

4×2 CR=2

4×1

5.2.2 Read Reduction

In Motion Compensation, reading required data is based on Motion Vector (MV).

Moreover, in MV (x, y), the x value and the y value can be classified as follows:

1) Align: The value is quadruple and the required 4 pixels fits with the block grid.

2) Not Align: The value is not quadruple and an integer. The required 4 pixels traverse two block grids.

4×4

3) Sub Pixel: The value accurate to 1 / or 1 / . The required 9 pixels can be interpolated into 4 pixels.

2 4

TABLE 4 All Cases of read access required by MC with/without EC

Case of MV (x, y)

Access Cycles for System without

Access Cycles for System with EC

Reduction of

In Table II, we analyze the read times of the motion compensation with/without EC.

The worst case is the (Sub, Sub) case. To finish the motion compensation, a 4x4 block needs a 9x9 block. Therefore, the system with/without proposed embedded compressor takes 15/27 cycles. The best case is the (Align, Align) case. Original system with/without embedded compressor needs 2/4 cycles to finish the best case. For the other cases when the required data of motion compensation are not fit for 4x2 block-grids, the access times become increased. From our simulation with four sequences (Akiyo, Stefan, Mobile Calendar, Foreman), each 300 frames, we can derive the probabilities of each case.

According to the probability of each case, the reduction ratio of the reading times is about 50%.

5.3 Processing Cycle Analysis

In section 5.1.2, the processing cycle problem had been mentioned. In this section, we will talk about the results of our system integration.

Our system specification is HD1080/720@30fps. This specification means each block accepts cycle count in 25 cycles. Because we do not want to change our specification, we wish that MC with the proposed embedded decompressor finishes in 25 cycles. Moreover, based on not to change our specification, we will not to change the data input structure. Here, we must compute the processing cycles as given by

4 4×

MC with Decompressor Decompressor MC without Decompressor

Proccesing Time = Delay +Proccesing Time (10)

TABLE 5 shows the processing cycle analysis for all cases. Excluding the (Sub, Sub) case, each case is less than 25 cycles. The average of processing cycles for MC without EC is 17.4 cycles. Therefore, the proposed embedded compression can be embedded into our H.264 decoder system.

TABLE 5 All Cases of Processing Cycle Analysis for EC

Case of MV (x, y)

5.3.1 Access Reduction Ratio

The access ratio of the system with/without EC is given in (11)

System with EC System with EC System without EC System without EC

Read +Write

Access Ratio =

Read +Write (11)

From the simulation, the ratio of read times with/without EC is 0.517, the ratio of write times with/without EC is 0.5, and the average access ratio of read/write in the system without EC is about 3.51. The overall access ratio is given in (12)

0.517 3.51+0.5 1 Overall Access Ratio =

3.51 1 = 51.3%

× ×

+ (12)

The average reduction ratio on memory accessed is given in (13)

(13) Average Reduction Ratio = 1 - Overall Access Ratio

= 1 - 0.513 = 48.7%

Therefore, the average reduction ratio is 48.7%.

5.3.2 Simulation Result on Power Reduction

We exploit the system-power calculator [13] as a external memory power model and set the parameter as [14]. The simulation of memory is employed on HD1080/720@150MHz. The simulation results are shown in Fig. 35. Including the core power of H.264 decoder, SDRAM background power and SDRAM access power (read/write).

0 50 100 150 200 250

Power_Original_System Power_EC_System

Power_EC Power_SDRAM

Fig. 35 Power analysis on HD1080/720@150MHz

Chapter 6 Conclusion and Future Works

6.1 Conclusion

In this thesis, we have proposed a new embedded compression algorithm for mobile video applications. With these advantages of the proposed EC algorithm, we can lessen the size of external memory and bandwidth utilization to achieve power saving. The pipelined architecture of the proposed decompressor requires 1 cycle, thus the random accessibility becomes better. Due to the fixed CR, the proposed EC algorithm is easier to be integrated with H.264 decoder.

From the experimental results, the PSNR loss of the proposed EC algorithm is from 1.89 to 3.45dB. The proposed architecture is synthesized with 90-nm CMOS standard-cell library and the gate counts of the proposed algorithm for compressor/decompressor are 4.0k/0.9k respectively. The working frequency is up to 150MHz@HD1080/720. For power consumption, the compressor is 158uW and the decompressor is 86uW.

6.2 Future Work

For the lossy embedded compression, reducing the visual quality distortion, it is the major objective. From our experimental results, error propagation is worth to be improved. For the simulation results of all I frames, between the original sequence and the compressed sequence, the differences are hardly found. However, for 1I/29P frames,

the drift effect can be found easily in the simulation results. Therefore, we can refine the proposed lossy embedded compression algorithm, such as adaptive predefined bitplanes, additive lossless embedded compression algorithm, and et al, to get better coding-efficiency.

Bibliography

[1] “Draft ITU-T recommendation and final draft international standard of Joint Video Specification (ITU-T Rec. H264-ISO/IEC 14496-10:2005 AVC),” JVT G050, 2005.

[2] T. Wiegand, G. J. Sullivan, G. Bjontegaard, A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Trans. Circuits Syst. Video Technol., vol.

13, no. 7, pp. 560-576, Jul. 2003.

[3] T. M. Liu, and et al., “A 125/spl mu/w, fully scalable MPEG-2 and H.264/AVC video decoder for mobile applications,” in Proc. IEEE Int. Solid-State Circuits Conf.

(ISSCC), pp. 1576-1585, 2006.

[4] J. Kim, and C. M. Kyung, “A Lossless Embedded Compression Using Significant Bit Truncation for HD Video Coding,” IEEE Trans. Circuits Syst. Video Technol., early access, 2010.

[5] E.J. Delp, O.R. Mitchell, “Image compression using block truncation coding,” IEEE Trans. Comm., vol. 27, issue 7, pp. 1335-1342, Sep. 1979.

[6] C. K. Yang and W. H. Tsai, “Improving block truncation coding by line and edge information and adaptive bit plane selection for gray-scale image compression,”

Pattern Recognition Letter, vol. 16, pp. 67-75, 1995.

[7] T. M. Amarunnishad, V. K. Govindan, and T. M. Abraham, “Block Truncation Coding Using a Set of Predefined Bit Planes,” in Proc. IEEE Int. Conf.

Computational Intelligence and Multimedia Applications (ICCIMA), vol. 3, pp.

73-78, Dec. 2007.

[8] T. Y. Lee, “A New Frame-Recompression Algorithm and its Hardware Design for MPEG-2 Video Decoders,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no.6, pp. 529-534, Jun. 2003.

[9] Y. D. Wu, Y. Li, and C. Y. Lee, “A Novel Embedded Bandwidth-Aware Frame Compressor for Mobile Video Applications,” in Proc. IEEE Intelligent Signal Processing and Communication Syst. (ISPACS), pp. 1-4, Feb. 2009.

[10] C. C. Yang, “An Embedded Codec based on Reduced Pattern Comparison Coding for Mobile Devices,” Master’s thesis, National Chiao Tung University, Jan. 2010.

[11] Y. D. Wu, “Design of An Embedded Compressor/Decompressor for Mobile Video Application,” Master’s thesis, National Chiao Tung University, Sep. 2009.

[12] A. Bourge and J. Jung, “Low-Power H.264 Video Decoder with Graceful Degradation,” in SPIE Proc. Visual Comm. And Image Processing, vol. 5308, pp.

372-383, Jan. 2004.

[13] Micron® Technology Inc. The Micron® System-Power Calculator: SDRAM.

[Website]:http://www.micron.com/support/part_info/powercalc

[14] Micron® Technology Inc. MT48H4M32LFB5-6 128Mb Mobile LPSDR.

[Website]:http://www.micron.com/products/partdetail?part=MT48H4M32LFB5-6

作者簡歷

姓名：林建辰

戶籍地：台灣省彰化市出生日期：1980 年 8 月 9 日

學歷：

1995 年 9 月~ 2000 年 6 月國立雲林工專電機科 2000 年 9 月~ 2002 年 6 月大華科技大學電機工程系

2005 年 2 月~ 2009 年 6 月國立交通大學 IC 設計產業研發碩士專班

在文檔中應用於行動式視訊裝置之預設位元平面比對之嵌入式編解碼器 (頁 30-0)

CHAPTER 3 PROPOSED ALGORITHM

3.2 S IMULATION R ESULTS

Chapter 4

Proposed Architecture

4.1 Architecture of Compressor

4.2 Architecture of Decompressor

4.3 Design Implementation and Verification

Chapter 5

System Integration