T HESIS O RGANIZATION - 應用於行動式視訊裝置之預設位元平面比對之嵌入式編解碼器

CHAPTER 1 INTRODUCTION

1.2 T HESIS O RGANIZATION

The rest of this paper is organized as follows. Chapter 2 introduces data compression and previous works. In Chapter 3, a novel algorithm is briefly described. The hardware architecture suitable for mobile video applications is given in Chapter 4. The design implementation and verification are shown in 4.3. We discuss the integration with an available H.264 decoder [3] and the experimental results respectively in Chapter 5.

Finally, the conclusions and future work will be given in Chapter 6.

Chapter 2 Previous Works

In general, embedded compression algorithms can be categorized into two fundamental groups: lossless embedded compression algorithms and lossy embedded compression algorithms. First, we briefly explain the existing lossless embedded compression algorithms. Second, we introduce the existing lossy embedded compression algorithms. Finally, we summarize merits and drawbacks of two fundamental groups of embedded compression algorithms.

2.1 Lossless Embedded Compression Algorithm

Lossless embedded compression algorithms [4] can guarantee no quality distortion of video sequences. Moreover, it has no error propagation problem in H.264 decoder.

However, after lossless compressing, the compressed data is variable length. Therefore, existing lossless approaches are not suitable for frame compression because their primary purpose is high coding efficiency rather than low latency, low visual quality distortion, low computation complexity, and high random accessibility.

2.2 Lossy Embedded Compression Algorithm

Lossy compression algorithms, comparing with lossless compression algorithms, accomplish the fixed compression ratio (CR). Several lossy embedded compression algorithms have been proposed, such as Block Truncation Coding (BTC) [5], improving

BTC by line and edge information and adaptive bitplane selection [6], BTC using a set of predefined bitplanes [7], Modified Hadamard Transform (MHT) and quantization of Colomb-Rice Coding [8], DCT and Modified Bitplane Zonal Coding (MBZC) [9], and et al.

2.2.1 Block Truncation Coding (BTC) Compression

The conventional Block Truncation Coding [5] (BTC) segments a frame into n n× non-overlapping blocks (usually,4×4) and has a two-level quantizer is independently designed for each block. In response to the local statistics of each block, the threshold of the quantizer and the two reconstructed levels are altered. Fig. 1 shows the flow of the BTC compression algorithm. Therefore, the compressed format includes a 16-bit bit map indicating the reconstructed level related with each pixel and two 8-bit reconstructed levels as shown in Fig. 2.

Fig. 1 The compression flow of Block Truncation Coding

Fig. 2 Compressed 26-bit segment format of Block Truncation Coding

A two-level quantizer is designed to preserve the mean and variance of a block. First, a frame is divided into non-overlapping n n× blocks. Letm=n², let X X₁, ₂,",X_m be

the pixel values of a block. The sample mean (α) and absolute moment (β ) are given in

The sample mean and absolute moment are preserved. By taking the mean (α) as the threshold, the two reconstructed levels, a and b are given in (3) and (4).

2 than or equal to the mean. Because BTC is a minimum mean square error (MMSE), the reconstructed level a can be simplified as (5).

Similarly, b also becomes as (6).

As above equations, the additions and comparisons are required. Therefore, in hardware implementation, BTC is very simple. The decoder is even simpler. However, the quality loss of BTC is not suitable to be embedded into the H.264 decoder. Therefore, we can learn the proposed architecture. Moreover, in H.264 decoder, the simpler decoder also provides higher random accessibility.

2.2.2 Block Truncation Coding using a set of predefined bitplanes

As aforementioned BTC, the encoder generated two reconstructed levels and the bitplane. For a block, the bitplane can result 65536 (= ) possible number of bitplanes. For the limited data budget, the bitplane occupied 16-bit of the compressed format. Thus,

4×4 2¹⁶

[6] have been proposed an approach to reduce the bit number of the bitplane in BTC. Fig. 3 shows the flowchart of improving Block Truncation Coding. In the data packing, the 16-bit bitplane of the compressed format becomes the 6-bit bitplane as shown in Fig. 4. Using 64 predefined bitplanes, as shown in Fig. 5, matched the generated bitplane.

Fig. 3 The compression flow of improving Block Truncation Coding

Fig. 4 Compressed 16-bit segment format of improving Block Truncation Coding

Fig. 5 64 classes of line and edge bitplanes (reverse versions not shown)

A novel bitplane coding scheme [6] have been proposed based on the conventional BTC. An approaches [7] has been proposed which exploited the visually continuous blocks are encoded as uniform regions, whereas visually discontinuous block are encoded as localized patterns interpreted as edges or lines. Fig. 6 shows the flowchart of Block Truncation Coding using a set of predefined bitplanes. By inverting and rotating, ten basic predefined bitplanes, as shown in Fig. 7, can be extended the 32 predefined bitplanes. In the data packing, the 15-bit bitplane becomes the 6-bit bitplane as shown in Fig. 8.

Although both [6] and [7] based on BTC could reduce the bit number of the bitplane.

However, H.264 decoder has the error propagation problem. Thus, in H.264 decoder, they are not suitable for visual quality because their quality loss becomes unacceptable visual quality.

Fig. 6 The compression flow of Block Truncation Coding using a set of predefined bitplanes

Fig. 7 Ten basic bitplanes can be extend the thirty-two bitplanes

Fig. 8 Compressed 16-bit segment format of Block Truncation Coding using a set of predefined bitplanes

2.2.3 Bitplane Truncation Coding

In the beginning, the integer sequence P can be decomposed in binary with a magnitude representation, to form a 8 N× binary matrix, such as (7)

( ) ( )

,where is the number of pixels of a block. represents the MSB plane while represents the LSB plane. Then, as shown in

N B₇ B₀

Algorithm 1, the start plane (SP) is

searched for four successive bitplanes from the MSB bitplane. For example, if and are all-0, then SP is equal to 2.

Algorithm 1 (Bitplane Truncation Coding Algorithm)

Input: B P( ) is binary matrix.

Output: SP is start plane

( )

else if B P is zero vector then SP end else if

else SP end else return SP

NOT NOT NOT

2.3 Summary

Lossless compression can guarantee no quality loss, but variable length of the compressed data caused irreducible frame memory size. Therefore, existing lossless algorithms are not suitable for frame compression because their primary purpose is high coding efficiency rather than low latency, computation complexity, and high random accessibility. On the contrary, lossy compression algorithm with the fixed CR can guarantee the reduction of frame memory size. Consequently, it is important to design a lossy algorithm with the following features: 1) Low visual quality distortion, 2) Low complexity, 3) Low bandwidth requirement, and 4) Low power consumption.

Chapter 3 Proposed Algorithm

The proposed algorithm compresses a 4x2 block (64-bit) from the output of the deblocking filter. The CR is fixed at 2. After compressing, a 4x2 block will become a 32-bit segment. With fixed CR, the amount of the coded data is constant. Therefore, this compression can guarantee access times. Besides, in H.264 standard, a 4x4 block which is a basic coding unit can be partitioned into two 4x2 blocks.

For each 4x2 block, the probability of the difference less than 16 is about 64%, the probability of the difference less than 32 is about 76%, and the probability of the difference less than 64 is about 89%. In [10], RPCC (Reduced Pattern Comparison Coding) uses the pattern comparison to compress a 4x2 block and the decoder just requires one cycle to reconstruct a 4x2 block. Therefore, exploiting two properties can be exploited to create the proposed algorithm.

3.1 Algorithm of Embedded Compression

Fig. 9 Compression flow of the proposed algorithm

Fig. 9 shows the flowchart of the proposed compression algorithm. We divide the algorithm into four parts: 1) Pixel Truncation, 2) Selective Start Plane, 3) Compensation, and 4) Predefined Bitplanes Comparison. These parts will be described in the following paragraphs. The compressed 32-bit segment format is shown in Fig. 10. The representation format consists of 2-bit Mode, 2-bit Start Plane (SP), 2-bit Decision L, 2-bit Decision R, 12-bit Coded Data L, and 12-bit Coded Data R.

Fig. 10 Compressed 32-bit segment format

3.1.1 Pixel Truncation

Fig. 11 shows the flowchart of the pixel truncation. First, we calculate the average value (Avg.) of the 4x2 block and the difference value (Diff.) between maximum pixel and minimum pixel of the 4x2 block. Second, according to the average and the difference, we classify those 4x2 sub-blocks into five types as the following:

1) Avg. from 0 to 63 and Diff. less than 32.

2) Avg. from 64 to 127 and Diff. less than 64.

3) Avg. from 128 to 191 and Diff. less than 64.

4) Avg. from 192 to 255 and Diff. less than 32.

5) No change.

In type 1, if each pixel is larger than or equal to 64, we force the pixel to be 63. In type 2, if each pixel is less than 64, we force the pixel to be 64; if each pixel is larger than or equal to 128, we force the pixel to be 127. Types 3 and 4 are processed like types 2 and 1 respectively. In type 5, the original pixel value remains unchanged.

Fig. 11 Flowchart of the pixel truncation

3.1.2 Selective Start Plane

Fig. 14 shows the flowchart of the selective start plane. Bitplane coding is a well-known method. We exploit bitplane as a basic unit to a group numbers, instead of pixel-wised basic unit.

First, we consider a 4x2 block in which each pixel value is represented by 8-bit. A bitplane can be formed by selecting a single bit from the same position in the binary representation of each pixel. We define that B7 represents the MSB plane while B0 represents the LSB plane.

Second, the start plane (SP) is searched for four successive bitplanes from the MSB bitplane with four modes as follows:

1) From B7 to B5 are all-0.

2) B6 is all-1; B7 and B5 are all-0.

3) B7 are all-1; B6 and B5 are all-0.

4) B7 and B6 are all-1; B5 is all-0.

In the first mode, if both B7 and B6 are all-0 and B5 is not all-0, then SP is equal to 1. Similarly, the other modes like as the first mode. Finally, the maximum start plane of four modes is selected to record the mode and start plane.

B7 != All 0's ^NO

Select Maximum Start Plane and Record Mode SP=0

Fig. 12 Flowchart of the selective start plane

3.1.3 Compensation

Since lower bitplanes are truncated due to the limited budget, a simple rounding is applied here. The rounding is applied when the significant bit of the truncated bits is nonzero and the coded bits are not all 1’s. In Fig. 13(a), the simple idea is shown. This idea leads to a satisfied quality improvement. Two rounding modes are proposed because the pattern comparison has two data compressed formats. As shown in Fig. 13(b), the first one is the comparison rounding and the other is the no comparison rounding. For pattern comparison, the first rounding method is applied to the first three types and the second rounding method is only for the final type.

0 1 0 1 1 1 0 0

2. No Comparison Rounding

0 1 0 1 1 1 0 0

Fig. 13 Flowchart of the rounding

3.1.4 Predefined Bitplanes Comparison

The final step encodes the preserving bitplanes. First, the truncated 4x2 block is partitioned into two 2x2 blocks that are called the left 2x2 block and the right 2x2 block as shown in Fig. 14(a). In Fig. 14(b), both the left 2x2 block and the right 2x2 block exploited the equal SP and compressed individually. Second, four types for a 2x2 block is classified as follows: 1) Group A, 2) Group B, 3) Group C, and 4) No Comparison. The first three types exploit a group of the eight patterns to compare with four successive bitplanes from SP and select one type which can hit three successive bitplanes. The three groups of the eight patterns are shown in

TABLE 1.If the first three types cannot hit larger than or equal to three bitplanes, the type 4 is chosen and three successive bitplanes from SP are stored.

TABLE 1 Three Group of Eight Predefined Bitplanes

Pattern No. 1 2 3 4 5 6 7 8

Group A 0000 1111 1110 0111 0011 1100 0001 1000 Group B 0000 1111 1110 0111 1010 1001 0110 0101 Group C 0000 1111 1110 0111 1101 1011 0010 0100

(a) (b) Fig. 14 An example of partitioning 4x2 block

3.2 Simulation Results

In the beginning, we first define the formula of MSE (Mean Square Error) and PSNR (Peak Signal to Noise Ratio). The MSE and PSNR are given in (8) and (9),

2 frame, and P is the compressed frame.

W H I

In this section, we focus on the coding efficiency for all CIF sequences. In Fig. 15, PSNR loss is from 1.68 dB to 3.45dB and PSNR loss average is 2.37dB. Then, we show the result of embedded result for different group of picture (GOP) in Fig. 16. Along with the number of P frame, we can see that PSNR loss is growing. Because each P frame is generated by the previous frame which is compressed by our proposed algorithm, the error is bigger and bigger along with the number of P frame. This phenomenon is also called error propagation or drift effect. Fig. 17 shows the results of drift effect with different QP and GOP.

2.41 2.64 2.31 2.20

Fig. 15 PSNR and PSNR Loss for all CIF sequences

Fig. 16 Drift Effect for Mobile_QP28

12.04

Fig. 17 Drift Effect with different QP and GOP

Chapter 4 Proposed Architecture

In these sections, we will introduce our proposed architecture. In section 4.1, we will describe our proposed embedded compressor. In section 4.2, we will describe our proposed embedded decompressor. In section 4.3, we will show the summary of the proposed architecture and the flow of the design verification.

4.1 Architecture of Compressor

Fig. 18 shows the pipeline architecture of compressor design. We use two pipeline stages and each stage requires one cycle. The first stage is the pixel truncation. The second stage is composed of selective start plane, rounding, selective pattern comparison, and packer. This compressor encodes a 4x2 block in 2 cycles.

Fig. 18 The Pipelined Architecture of Compressor Design

4.1.1 Architecture of Pixel Truncation

Fig. 19 shows the architecture of pixel truncation. There are seven combinational logics, one multiplexer, one de-multiplexer, and one register. The seven combinational

logics as follows: average, difference, type selector, quantizer 1, quantizer 2, quantizer 3, and quantizer 4. The type selector controls the multiplexer and the de-multiplexer. The register stores the truncated 4x2 block.

Fig. 19 Architecture of Pixel Truncation

4.1.2 Architecture of Selective Start Plane

LUT

Fig. 20 Architecture of Selective Start Plane

Fig. 20 shows the architecture of selective bitplane. The block of bitplane transform is a wrapper. There are five combinational logics, one de-multiplexer, and one look-up table. The five combinational logics as follows: Mode 1, Mode 2, Mode 3, Mode 4, and Mode selector. The look-up table records the information of B7 and B6 for each mode.

4.1.3 Architecture of Compensation

Fig. 21 shows the architecture of compensation. There are two combinational logics as follows: Comparison Rounding and No Comparison Rounding.

Comparison

Fig. 21 Architecture of Compensation

4.1.4 Architecture of Predefined Bitplanes Comparison

Fig. 22 shows the architecture of pattern comparison. There are five combinational logics, one de-multiplexer, one look-up table and one register. The five combinational logics as follows: Comparison Group A, Comparison Group B, Comparison Group C, and No Comparison. The pattern selector controls the de-multiplexer. The register is stored the coded data.

Comparison

Fig. 22 Architecture of Predefined Bitplanes Comparison

4.1.5 Data Packing

Fig. 23 shows the architecture of data packing. The representation format consists of 2-bit Mode, 2-bit Start Plane, 2-bit Decision L, 2-bit Decision R, 12-bit Coded Data L, and 12-bit Coded Data R.

Fig. 23 Architecture of Data Packing

4.2 Architecture of Decompressor

Fig. 24 shows the pipeline architecture of decompressor. The decompressor only needs one stage with one cycle. This decompressor reaches a higher throughput; therefore we can provide a higher random accessibility.

Fig. 24 The Pipeline Architecture of Decompressor

4.2.1 Data Rearrange

Fig. 25 shows the architecture of data rearrange. According to the representation format, the data rearrange can be considered as an inverse processing.

Fig. 25 Architecture of Data Rearrange

4.3 Design Implementation and Verification

In section 4.3.1 and 4.3.2, we will introduce the results of design implementation

and the flow of the design verification, respectively.

4.3.1 Design Implementation

TABLE 2 shows the summary of the hardware design. The proposed hardware architecture is synthesized with 90-nm CMOS standard-cell library and the gate count of the proposed algorithm for the compressor and the decompressor are 4.0k and 0.9k, respectively. The working frequency is up to 150MHz@HD1080/720. The proposed embedded compressor is divided into 2 pipelined stages and each stage requires 1 cycle.

The proposed embedded decompressor is divided into 1 pipelined stage and each stage requires 1 cycle. For the power consumption, the compressor and the decompressor are 158uW and 86uW@150MHz respectively. As above description, the proposed hardware provides less hardware complexity.

TABLE 2 Summary of the hardware implementation Proposed EC

Function Compressor Decompressor

Technology UMC 90nm

Working Frequency HD1080+HD720@150MHz

Latency/4x2 block 2 cycles 1 cycle

Gate count 4K 0.9K

Power Consumption 158uW 86uW

4.3.2 Design Verification

Fig. 26 shows the flow of verification. We utilize software and hardware to verify the proposed algorithm. The patterns are created by software and applied as the input of

hardware designs. Then the software calculates the answer to compare with the result of hardware and the result will be stored in memory. Afterward the coded data is accessed by software and hardware decompressor from memory. We check the coded data to confirm the result whether is matched in software and hardware.

Fig. 26 The verification flow

Chapter 5 System Integration

In section 5.1, we will introduce Si2 H.264 Decoder System. Then, both access analysis and processing analysis will be discussed in sections 0 and 5.3, respectively.

5.1 System Analysis

Fig. 27 The block diagram of the overall H.264 decoder system

The overall H.264 decoder [3] with the embedded compression codec is shown in Fig. 27. Our H.264 decoder specification is HD1080/HD720@30fps and works at

150MHz. The embedded compressor works between the deblocking filter and the external memory. The embedded decompressor works between the external memory and the motion compensation. To design address controller of EC is very simple since our compression ratio is fixed at two. Our system bus is 32 bits and the external memory is 32 bits per entry.

5.1.1 Interface Problem

Fig. 28 The system interface design for embedded codec

Fig. 28 shows the system; interface design for embedded codec. Between the chip and the off-chip memory, the embedded compression can be considered as an interface.

In original H.264 decoder system, here are two interface issues. First interface issue occurs between the deblocking filter and the off-chip memory. The throughput of the deblocking filter is 4 pixels per clock. Therefore, avoiding the pipelined jam at the input of embedded compressor, the processing clocks must be less or equal to 4 cycles. The other issue occurs between the motion compensation (MC) and the off-chip memory. The input of MC requires 4 pixels per cycle, thus the throughput of the embedded decompressor is at least 4 pixels per cycle. Furthermore, since the compression ratio is

fixed at two, the address converter can be easily implemented.

5.1.2 Processing Cycles Problem

In this part, we talk about processing cycle problem of out H.264 decoder system.

Our H.264 decoder specification is HD1080/HD720@30fps and works at 150MHz. From our simulation, MC requires average 25 cycles to deal with a 4 4× block. Therefore, embedded compressor requires a fewer-cycle design to reduce the loading cycles.

5.1.3 Overhead Problem

Fig. 29 An example of overhead problem

A block is basic coding unit in H.264 standard. Moreover, due to block-based approaches fit in with block-oriented structure of the received bit-stream, they are most popular techniques. However, here is an overhead problem

4×4

[11] that can be defined as:

the ratio between the number of pixels that are actually accessed during the motion compensation of a block and the number of pixels that are really useful in the reference block. In the original system without block-based approaches, the ratio is equal to 1 for

the required pixels accessed. On the contrary, in the original system with block-based approaches, the ratio is always bigger than 1. As shown in Fig. 29, if the required 4×4 block data, we need to fetch four 4×4 block-based data. The overhead in this case is 48.

TABLE 3 Overhead with block grid for six sequences

Sequence 4×4 block grid 8×8 block grid 16 16× block grid

Foreman 1.31 1.77 3.69

Flower 1.30 1.74 3.77

News 1.14 1.51 2.78

Silent 1.17 1.50 3.22

Stefan 1.51 2.44 6.95

Weather 1.17 1.49 3.18

All 1.27 1.73 3.93

As given in TABLE 3, [12] has been provided the summary of the statistical analysis

在文檔中應用於行動式視訊裝置之預設位元平面比對之嵌入式編解碼器 (頁 16-0)