S UMMARY - PREVIOUS WORKS - 應用於行動式視訊裝置之預設位元平面比對之嵌入式編解碼器

CHAPTER 2 PREVIOUS WORKS

2.3 S UMMARY

Lossless compression can guarantee no quality loss, but variable length of the compressed data caused irreducible frame memory size. Therefore, existing lossless algorithms are not suitable for frame compression because their primary purpose is high coding efficiency rather than low latency, computation complexity, and high random accessibility. On the contrary, lossy compression algorithm with the fixed CR can guarantee the reduction of frame memory size. Consequently, it is important to design a lossy algorithm with the following features: 1) Low visual quality distortion, 2) Low complexity, 3) Low bandwidth requirement, and 4) Low power consumption.

Chapter 3 Proposed Algorithm

The proposed algorithm compresses a 4x2 block (64-bit) from the output of the deblocking filter. The CR is fixed at 2. After compressing, a 4x2 block will become a 32-bit segment. With fixed CR, the amount of the coded data is constant. Therefore, this compression can guarantee access times. Besides, in H.264 standard, a 4x4 block which is a basic coding unit can be partitioned into two 4x2 blocks.

For each 4x2 block, the probability of the difference less than 16 is about 64%, the probability of the difference less than 32 is about 76%, and the probability of the difference less than 64 is about 89%. In [10], RPCC (Reduced Pattern Comparison Coding) uses the pattern comparison to compress a 4x2 block and the decoder just requires one cycle to reconstruct a 4x2 block. Therefore, exploiting two properties can be exploited to create the proposed algorithm.

3.1 Algorithm of Embedded Compression

Fig. 9 Compression flow of the proposed algorithm

Fig. 9 shows the flowchart of the proposed compression algorithm. We divide the algorithm into four parts: 1) Pixel Truncation, 2) Selective Start Plane, 3) Compensation, and 4) Predefined Bitplanes Comparison. These parts will be described in the following paragraphs. The compressed 32-bit segment format is shown in Fig. 10. The representation format consists of 2-bit Mode, 2-bit Start Plane (SP), 2-bit Decision L, 2-bit Decision R, 12-bit Coded Data L, and 12-bit Coded Data R.

Fig. 10 Compressed 32-bit segment format

3.1.1 Pixel Truncation

Fig. 11 shows the flowchart of the pixel truncation. First, we calculate the average value (Avg.) of the 4x2 block and the difference value (Diff.) between maximum pixel and minimum pixel of the 4x2 block. Second, according to the average and the difference, we classify those 4x2 sub-blocks into five types as the following:

1) Avg. from 0 to 63 and Diff. less than 32.

2) Avg. from 64 to 127 and Diff. less than 64.

3) Avg. from 128 to 191 and Diff. less than 64.

4) Avg. from 192 to 255 and Diff. less than 32.

5) No change.

In type 1, if each pixel is larger than or equal to 64, we force the pixel to be 63. In type 2, if each pixel is less than 64, we force the pixel to be 64; if each pixel is larger than or equal to 128, we force the pixel to be 127. Types 3 and 4 are processed like types 2 and 1 respectively. In type 5, the original pixel value remains unchanged.

Fig. 11 Flowchart of the pixel truncation

3.1.2 Selective Start Plane

Fig. 14 shows the flowchart of the selective start plane. Bitplane coding is a well-known method. We exploit bitplane as a basic unit to a group numbers, instead of pixel-wised basic unit.

First, we consider a 4x2 block in which each pixel value is represented by 8-bit. A bitplane can be formed by selecting a single bit from the same position in the binary representation of each pixel. We define that B7 represents the MSB plane while B0 represents the LSB plane.

Second, the start plane (SP) is searched for four successive bitplanes from the MSB bitplane with four modes as follows:

1) From B7 to B5 are all-0.

2) B6 is all-1; B7 and B5 are all-0.

3) B7 are all-1; B6 and B5 are all-0.

4) B7 and B6 are all-1; B5 is all-0.

In the first mode, if both B7 and B6 are all-0 and B5 is not all-0, then SP is equal to 1. Similarly, the other modes like as the first mode. Finally, the maximum start plane of four modes is selected to record the mode and start plane.

B7 != All 0's ^NO

Select Maximum Start Plane and Record Mode SP=0

Fig. 12 Flowchart of the selective start plane

3.1.3 Compensation

Since lower bitplanes are truncated due to the limited budget, a simple rounding is applied here. The rounding is applied when the significant bit of the truncated bits is nonzero and the coded bits are not all 1’s. In Fig. 13(a), the simple idea is shown. This idea leads to a satisfied quality improvement. Two rounding modes are proposed because the pattern comparison has two data compressed formats. As shown in Fig. 13(b), the first one is the comparison rounding and the other is the no comparison rounding. For pattern comparison, the first rounding method is applied to the first three types and the second rounding method is only for the final type.

0 1 0 1 1 1 0 0

2. No Comparison Rounding

0 1 0 1 1 1 0 0

Fig. 13 Flowchart of the rounding

3.1.4 Predefined Bitplanes Comparison

The final step encodes the preserving bitplanes. First, the truncated 4x2 block is partitioned into two 2x2 blocks that are called the left 2x2 block and the right 2x2 block as shown in Fig. 14(a). In Fig. 14(b), both the left 2x2 block and the right 2x2 block exploited the equal SP and compressed individually. Second, four types for a 2x2 block is classified as follows: 1) Group A, 2) Group B, 3) Group C, and 4) No Comparison. The first three types exploit a group of the eight patterns to compare with four successive bitplanes from SP and select one type which can hit three successive bitplanes. The three groups of the eight patterns are shown in

TABLE 1.If the first three types cannot hit larger than or equal to three bitplanes, the type 4 is chosen and three successive bitplanes from SP are stored.

TABLE 1 Three Group of Eight Predefined Bitplanes

Pattern No. 1 2 3 4 5 6 7 8

Group A 0000 1111 1110 0111 0011 1100 0001 1000 Group B 0000 1111 1110 0111 1010 1001 0110 0101 Group C 0000 1111 1110 0111 1101 1011 0010 0100

(a) (b) Fig. 14 An example of partitioning 4x2 block

3.2 Simulation Results

In the beginning, we first define the formula of MSE (Mean Square Error) and PSNR (Peak Signal to Noise Ratio). The MSE and PSNR are given in (8) and (9),

2 frame, and P is the compressed frame.

W H I

In this section, we focus on the coding efficiency for all CIF sequences. In Fig. 15, PSNR loss is from 1.68 dB to 3.45dB and PSNR loss average is 2.37dB. Then, we show the result of embedded result for different group of picture (GOP) in Fig. 16. Along with the number of P frame, we can see that PSNR loss is growing. Because each P frame is generated by the previous frame which is compressed by our proposed algorithm, the error is bigger and bigger along with the number of P frame. This phenomenon is also called error propagation or drift effect. Fig. 17 shows the results of drift effect with different QP and GOP.

2.41 2.64 2.31 2.20

Fig. 15 PSNR and PSNR Loss for all CIF sequences

Fig. 16 Drift Effect for Mobile_QP28

12.04

Fig. 17 Drift Effect with different QP and GOP

Chapter 4 Proposed Architecture

In these sections, we will introduce our proposed architecture. In section 4.1, we will describe our proposed embedded compressor. In section 4.2, we will describe our proposed embedded decompressor. In section 4.3, we will show the summary of the proposed architecture and the flow of the design verification.

4.1 Architecture of Compressor

Fig. 18 shows the pipeline architecture of compressor design. We use two pipeline stages and each stage requires one cycle. The first stage is the pixel truncation. The second stage is composed of selective start plane, rounding, selective pattern comparison, and packer. This compressor encodes a 4x2 block in 2 cycles.

Fig. 18 The Pipelined Architecture of Compressor Design

4.1.1 Architecture of Pixel Truncation

Fig. 19 shows the architecture of pixel truncation. There are seven combinational logics, one multiplexer, one de-multiplexer, and one register. The seven combinational

logics as follows: average, difference, type selector, quantizer 1, quantizer 2, quantizer 3, and quantizer 4. The type selector controls the multiplexer and the de-multiplexer. The register stores the truncated 4x2 block.

Fig. 19 Architecture of Pixel Truncation

4.1.2 Architecture of Selective Start Plane

LUT

Fig. 20 Architecture of Selective Start Plane

Fig. 20 shows the architecture of selective bitplane. The block of bitplane transform is a wrapper. There are five combinational logics, one de-multiplexer, and one look-up table. The five combinational logics as follows: Mode 1, Mode 2, Mode 3, Mode 4, and Mode selector. The look-up table records the information of B7 and B6 for each mode.

4.1.3 Architecture of Compensation

Fig. 21 shows the architecture of compensation. There are two combinational logics as follows: Comparison Rounding and No Comparison Rounding.

Comparison

Fig. 21 Architecture of Compensation

4.1.4 Architecture of Predefined Bitplanes Comparison

Fig. 22 shows the architecture of pattern comparison. There are five combinational logics, one de-multiplexer, one look-up table and one register. The five combinational logics as follows: Comparison Group A, Comparison Group B, Comparison Group C, and No Comparison. The pattern selector controls the de-multiplexer. The register is stored the coded data.

Comparison

Fig. 22 Architecture of Predefined Bitplanes Comparison

4.1.5 Data Packing

Fig. 23 shows the architecture of data packing. The representation format consists of 2-bit Mode, 2-bit Start Plane, 2-bit Decision L, 2-bit Decision R, 12-bit Coded Data L, and 12-bit Coded Data R.

Fig. 23 Architecture of Data Packing

4.2 Architecture of Decompressor

Fig. 24 shows the pipeline architecture of decompressor. The decompressor only needs one stage with one cycle. This decompressor reaches a higher throughput; therefore we can provide a higher random accessibility.

Fig. 24 The Pipeline Architecture of Decompressor

4.2.1 Data Rearrange

Fig. 25 shows the architecture of data rearrange. According to the representation format, the data rearrange can be considered as an inverse processing.

Fig. 25 Architecture of Data Rearrange

4.3 Design Implementation and Verification

In section 4.3.1 and 4.3.2, we will introduce the results of design implementation

and the flow of the design verification, respectively.

4.3.1 Design Implementation

TABLE 2 shows the summary of the hardware design. The proposed hardware architecture is synthesized with 90-nm CMOS standard-cell library and the gate count of the proposed algorithm for the compressor and the decompressor are 4.0k and 0.9k, respectively. The working frequency is up to 150MHz@HD1080/720. The proposed embedded compressor is divided into 2 pipelined stages and each stage requires 1 cycle.

The proposed embedded decompressor is divided into 1 pipelined stage and each stage requires 1 cycle. For the power consumption, the compressor and the decompressor are 158uW and 86uW@150MHz respectively. As above description, the proposed hardware provides less hardware complexity.

TABLE 2 Summary of the hardware implementation Proposed EC

Function Compressor Decompressor

Technology UMC 90nm

Working Frequency HD1080+HD720@150MHz

Latency/4x2 block 2 cycles 1 cycle

Gate count 4K 0.9K

Power Consumption 158uW 86uW

4.3.2 Design Verification

Fig. 26 shows the flow of verification. We utilize software and hardware to verify the proposed algorithm. The patterns are created by software and applied as the input of

hardware designs. Then the software calculates the answer to compare with the result of hardware and the result will be stored in memory. Afterward the coded data is accessed by software and hardware decompressor from memory. We check the coded data to confirm the result whether is matched in software and hardware.

Fig. 26 The verification flow

Chapter 5 System Integration

In section 5.1, we will introduce Si2 H.264 Decoder System. Then, both access analysis and processing analysis will be discussed in sections 0 and 5.3, respectively.

5.1 System Analysis

Fig. 27 The block diagram of the overall H.264 decoder system

The overall H.264 decoder [3] with the embedded compression codec is shown in Fig. 27. Our H.264 decoder specification is HD1080/HD720@30fps and works at

150MHz. The embedded compressor works between the deblocking filter and the external memory. The embedded decompressor works between the external memory and the motion compensation. To design address controller of EC is very simple since our compression ratio is fixed at two. Our system bus is 32 bits and the external memory is 32 bits per entry.

5.1.1 Interface Problem

Fig. 28 The system interface design for embedded codec

Fig. 28 shows the system; interface design for embedded codec. Between the chip and the off-chip memory, the embedded compression can be considered as an interface.

In original H.264 decoder system, here are two interface issues. First interface issue occurs between the deblocking filter and the off-chip memory. The throughput of the deblocking filter is 4 pixels per clock. Therefore, avoiding the pipelined jam at the input of embedded compressor, the processing clocks must be less or equal to 4 cycles. The other issue occurs between the motion compensation (MC) and the off-chip memory. The input of MC requires 4 pixels per cycle, thus the throughput of the embedded decompressor is at least 4 pixels per cycle. Furthermore, since the compression ratio is

fixed at two, the address converter can be easily implemented.

5.1.2 Processing Cycles Problem

In this part, we talk about processing cycle problem of out H.264 decoder system.

Our H.264 decoder specification is HD1080/HD720@30fps and works at 150MHz. From our simulation, MC requires average 25 cycles to deal with a 4 4× block. Therefore, embedded compressor requires a fewer-cycle design to reduce the loading cycles.

5.1.3 Overhead Problem

Fig. 29 An example of overhead problem

A block is basic coding unit in H.264 standard. Moreover, due to block-based approaches fit in with block-oriented structure of the received bit-stream, they are most popular techniques. However, here is an overhead problem

4×4

[11] that can be defined as:

the ratio between the number of pixels that are actually accessed during the motion compensation of a block and the number of pixels that are really useful in the reference block. In the original system without block-based approaches, the ratio is equal to 1 for

the required pixels accessed. On the contrary, in the original system with block-based approaches, the ratio is always bigger than 1. As shown in Fig. 29, if the required 4×4 block data, we need to fetch four 4×4 block-based data. The overhead in this case is 48.

TABLE 3 Overhead with block grid for six sequences

Sequence 4×4 block grid 8×8 block grid 16 16× block grid

Foreman 1.31 1.77 3.69

Flower 1.30 1.74 3.77

News 1.14 1.51 2.78

Silent 1.17 1.50 3.22

Stefan 1.51 2.44 6.95

Weather 1.17 1.49 3.18

All 1.27 1.73 3.93

As given in TABLE 3, [12] has been provided the summary of the statistical analysis simulated with six sequences. From this table, we can know that the faster motion sequence such as Stefan causes higher overhead. Consequently, it is important that the smaller block-grid can obtain smaller overhead.

Fig. 30 The flow of EC accesses

5.2 Access Analysis

Fig. 30 shows deblocking filter through the embedded compressor write the data into the external memory and MC through the embedded decompressor read the data from external memory. Moreover, exploiting SystemC, CoWare can build up a simulated platform to analyze the related system problem. As shown in Fig. 31, the user-defined field includes H.264 decoder and EC which is coded in Verilog.

AMBA Slave Interface

Embedded Decompressor

Motion Compensation

Embedded Compressor AMBA AHB

External Memory

Deblocking Filter

H.264 Decoder User-defined

CoWare Fig. 31 The Block Diagram of CoWare System

Fig. 32 shows the block diagram of our H.264 decoder with EC in the work space of CoWare platform. The external memory is accepted 128Mb Mobile LPSDR [14] and the bus protocol used AMBA 2.0 with 32-bit bandwidth. After CoWare simulating, we can get the information of the data access as shown in Fig. 33 and Fig. 34.

Fig. 32 Block diagram in CoWare System

Write 4x4 Blocks Read 4x4 Blocks

Fig. 33 Embedded compressor waveform over CoWare system

Fig. 34 Data access trace

5.2.1 Write Reduction

The compression ratio of the proposed EC is fixed at 2. After the proposed EC ( block unit and ) is embedded into our H.264 decoder system, comparing the original system ( block-based access), the reduction ratio of the writing times is 50%.

4×2 CR=2

4×1

5.2.2 Read Reduction

In Motion Compensation, reading required data is based on Motion Vector (MV).

Moreover, in MV (x, y), the x value and the y value can be classified as follows:

1) Align: The value is quadruple and the required 4 pixels fits with the block grid.

2) Not Align: The value is not quadruple and an integer. The required 4 pixels traverse two block grids.

4×4

3) Sub Pixel: The value accurate to 1 / or 1 / . The required 9 pixels can be interpolated into 4 pixels.

2 4

TABLE 4 All Cases of read access required by MC with/without EC

Case of MV (x, y)

Access Cycles for System without

Access Cycles for System with EC

Reduction of

In Table II, we analyze the read times of the motion compensation with/without EC.

The worst case is the (Sub, Sub) case. To finish the motion compensation, a 4x4 block needs a 9x9 block. Therefore, the system with/without proposed embedded compressor takes 15/27 cycles. The best case is the (Align, Align) case. Original system with/without embedded compressor needs 2/4 cycles to finish the best case. For the other cases when the required data of motion compensation are not fit for 4x2 block-grids, the access times become increased. From our simulation with four sequences (Akiyo, Stefan, Mobile Calendar, Foreman), each 300 frames, we can derive the probabilities of each case.

According to the probability of each case, the reduction ratio of the reading times is about 50%.

5.3 Processing Cycle Analysis

In section 5.1.2, the processing cycle problem had been mentioned. In this section, we will talk about the results of our system integration.

Our system specification is HD1080/720@30fps. This specification means each block accepts cycle count in 25 cycles. Because we do not want to change our specification, we wish that MC with the proposed embedded decompressor finishes in 25 cycles. Moreover, based on not to change our specification, we will not to change the data input structure. Here, we must compute the processing cycles as given by

4 4×

MC with Decompressor Decompressor MC without Decompressor

Proccesing Time = Delay +Proccesing Time (10)

TABLE 5 shows the processing cycle analysis for all cases. Excluding the (Sub, Sub) case, each case is less than 25 cycles. The average of processing cycles for MC without EC is 17.4 cycles. Therefore, the proposed embedded compression can be embedded into our H.264 decoder system.

TABLE 5 All Cases of Processing Cycle Analysis for EC

Case of MV (x, y)

5.3.1 Access Reduction Ratio

The access ratio of the system with/without EC is given in (11)

System with EC System with EC System without EC System without EC

Read +Write

Access Ratio =

Read +Write (11)

From the simulation, the ratio of read times with/without EC is 0.517, the ratio of write times with/without EC is 0.5, and the average access ratio of read/write in the system without EC is about 3.51. The overall access ratio is given in (12)

0.517 3.51+0.5 1 Overall Access Ratio =

3.51 1 = 51.3%

× ×

+ (12)

The average reduction ratio on memory accessed is given in (13)

(13) Average Reduction Ratio = 1 - Overall Access Ratio

= 1 - 0.513 = 48.7%

Therefore, the average reduction ratio is 48.7%.

5.3.2 Simulation Result on Power Reduction

We exploit the system-power calculator [13] as a external memory power model and set the parameter as [14]. The simulation of memory is employed on HD1080/720@150MHz. The simulation results are shown in Fig. 35. Including the core power of H.264 decoder, SDRAM background power and SDRAM access power (read/write).

0 50 100 150 200 250

Power_Original_System Power_EC_System

Power_EC Power_SDRAM

Fig. 35 Power analysis on HD1080/720@150MHz

Chapter 6 Conclusion and Future Works

6.1 Conclusion

In this thesis, we have proposed a new embedded compression algorithm for mobile video applications. With these advantages of the proposed EC algorithm, we can lessen the size of external memory and bandwidth utilization to achieve power saving. The pipelined architecture of the proposed decompressor requires 1 cycle, thus the random accessibility becomes better. Due to the fixed CR, the proposed EC algorithm is easier to be integrated with H.264 decoder.

From the experimental results, the PSNR loss of the proposed EC algorithm is from 1.89 to 3.45dB. The proposed architecture is synthesized with 90-nm CMOS standard-cell library and the gate counts of the proposed algorithm for

在文檔中應用於行動式視訊裝置之預設位元平面比對之嵌入式編解碼器 (頁 24-0)