應用於行動式視訊裝置之嵌入式壓縮器解壓縮器設計

(1)

國立交通大學

電子工程學系

電子研究所碩士班

碩

士

論

文

應用於行動式視訊

應用於行動式視訊裝置

裝置

裝置之嵌入式壓縮器解壓縮器

之嵌入式壓縮器解壓縮器

之嵌入式壓縮器解壓縮器設計

設計

Design of An Embedded Compressor/Decompressor

for Mobile Video Applications

學生：吳昱德

指導教授：李鎮宜教授

中華民國九十七年七月

(2)

(3)

應用於行動式視訊

應用於行動式視訊裝置

裝置

裝置之嵌入式壓縮器解壓縮器

之嵌入式壓縮器解壓縮器

之嵌入式壓縮器解壓縮器設計

設計

Design of An Embedded Compressor/Decompressor

for Mobile Video Applications

研究生：吳昱德 Student：Yu-De Wu

指導教授：李鎮宜 Advisor：Chen-Yi Lee

國立交通大學

電子工程學系電子研究所碩士班

碩士論文

A Thesis

Submitted to Institute of Electronics

College of Electrical Engineering and Computer Science National Chiao Tung University

in Partial Fulfillment of the Requirements for the Degree of

Master of Science in

Electronics Engineering

July 2008

Hsinchu, Taiwan, Republic of China

(4)

(5)

應用於行動式視訊

應用於行動式視訊裝置

裝置

裝置之嵌入式壓縮器解壓縮器

之嵌入式壓縮器解壓縮器

之嵌入式壓縮器解壓縮器設計

設計

學生：吳昱德指導教授：李鎮宜教授

國立交通大學

電子工程學系

電子研究所碩士班

摘要

本論文提出適合嵌入於行動式視訊裝置上的有失真嵌入式壓縮器/解壓縮器設計。藉由有失真資料壓縮來減少晶片與外部記憶體間所需要的資料傳輸量，損失些微的視訊品質，來達到縮小外部記憶體空間需求、減少頻寬使用以及降低能量消耗等多種目的。所提出的演算法是以二維離散餘弦轉換搭配簡約位元平面區域編碼所構成。在壓縮率為二的前提之下，將一個四乘以四的像素矩陣壓縮為六十四位元的壓縮封包。首先將四乘以四像素矩陣以二維離散餘弦轉換為十六個不同頻率之係數分量，再使用簡約位元平面區域編碼將係數予以編碼封存後送到外部記憶體。解壓縮過程中並提出一個簡單的補償方式來彌補失真壓縮所造成的資料遺失。所提出的硬體架構可以嵌入在視訊解碼器上以 100MHz 的操作頻率支援每秒三十張的高畫質電視規格(HD1080)。由於將壓縮率固定為兩倍，所以壓縮後的封包大小固定，記憶體地址轉換十分簡單並且可以支援動作補償單元 (Motion Compensation)的亂數存取。在 UMC 90 奈米製程技術下，所提出的硬體使用了

(6)

30k 個邏輯閘數目。壓縮一個巨型區塊(MB)需要 72 個週期，解壓縮一個巨型區

塊(MB)則僅需要 34 個週期。整體系統對於記憶體的存取次數則節省了原本的百

(7)

Design of An Embedded Compressor/Decompressor

for Mobile Video Applications

Student : Yu-De Wu Advisor : Dr. Chen-Yi Lee

Department of Electronics Engineering

Institute of Electronics

National Chiao Tung University

ABSTRACT

This thesis proposes an embedded compressor/decompressor for mobile video

applications. It uses lossy compression scheme to reduce the amount of data

transferring between chip and external memory. This lossy compression can maintain

acceptable video quality while reduces the required size of external memory, the

bandwidth requirement and the power consumption on memory access.

Proposed algorithm is composed by discrete cosine transform (DCT) with coarse

grain bit-plane zonal coding (CGBPZ). The compression ratio is two. It compresses a

4x4 pixel-array into a 64 bits segment. First, the two dimensions discrete cosine

transform converts 16 pixels into 16 elementary frequency components. Coarse grain

(8)

compensation scheme is also proposed for decoding.

Hardware architecture of the proposed algorithm is able to be embedded into

video decoder and support HD1080@100MHz, 30 frames per second. Since the

compression ratio is fixed at two, the coded segments have fixed size and can be

randomly accessed by motion compensation unit. The gate counts are 30K

synthesized by UMC 90 nm CMOS technology. It costs 72 cycles to encode a MB and

34 cycles to decode a MB. Overall reduction ratio on memory access is 40%.

Comparing with the power consumed of proposed design, the amount of power saving

(9)

誌

謝

在 SI2 Lab 的兩年裡，是我人生中珍貴的日子。首先，我要對我的指導教授李鎮宜博士表達我最深的感謝。老師總是熱心且耐心的指導我，並適時的給予鼓勵，使我在碩班兩年裡獲益良多。在此誠摯的給予老師最深的祝福。其次我要感謝的是 multimedia group 的博班學長，劉子明和李曜。學長們對我的訓練和不厭其煩的指導，為我的研究打下了扎實的基礎。阿龍、義閔學長在專業的領域上也給予我很多幫助。特別要感謝的是蔣迪豪教授和鍾菁哲博士，除了研究上犀利的建議之外，也給我很多就業與職場上的詳盡分析。同屆的夥伴們，韋磬、Amos、bluer、俊廷、琇茹、清峰、建螢、點子、茗智、學弟昱帆和其他每一位 SI2 成員們，有你們一起思考、互相幫忙、適時搞笑，讓研究生活充實而不枯燥。此外，我也要感謝我的室友們，宗學、良諺、碩宇、振祐、育瑋、文炫，讓我的宿舍生活充滿了歡樂。最後，我要感謝我的家人和我的朋友們，因為有你們的支持、付出與鼓勵，我才能全心全意的向前邁進。願你們永遠健康、快樂、順心。

(10)

Index

Chapter 1 Introduction ... 1

1.1 Motivation ... 1

1.2 Thesis Organization ... 2

Chapter 2 Previous Works ... 3

2.1 Lossless Embedded Compression Schemes... 3

2.2 Lossy Embedded Compression Scheme ... 4

2.2.1 Transform-Based Lossy Embedded Compression ... 4

2.2.2 Delta Pulse Code Modulation Lossy Embedded Compression ... 5

2.2.3 Other Embedded Lossy Compression ... 5

2.3 Bit-Plane Coding... 6

2.3.1 Bit-Plane Truncation Coding (BPT) ... 6

2.3.2 Bit-Plane Zonal Coding (BPZ)... 8

2.3.3 Modified Bit Plane Zonal Coding ... 12

2.4 Summary ... 15

Chapter 3 Proposed Embedded Compression Algorithm... 17

3.1 Overview ... 17

3.2 Algorithm of Embedded Compressor... 20

3.2.1 Discrete Cosine Transform... 21

3.2.2 Proposed Fine Grain Bit Plane Zonal Coding (FGBPZ) ... 22

3.3 Coarse Grain Bit-Plane Zonal Coding (CGBPZ) ... 32

3.4 Decoding Process and the Compensation... 35

3.5 Embedded Result on Software Simulation... 36

3.5.1 FGBPZ versus CGBPZ ... 36

(11)

Chapter 4 Proposed Embedded Compressor/Decompressor Architecture

... 43

4.1 Architecture of Encoder Design... 43

4.1.1 The Architecture of Two Dimensions Discrete Cosine Transform44 4.1.2 The Architecture of Coarse Grain Bit-Plane Zonal Encoding and Data Packing ... 44

4.1.3 The Architecture of End Plane Calculation... 45

4.1.4 Overall Encoder Design ... 46

4.2 Architecture of Decoder Design ... 47

4.2.1 Architecture of Data Unpacking, Bit-Plane Zonal Decoding and Compensation... 48

4.2.2 Architecture of Two Dimensions Discrete Cosine Transform ... 48

4.2.3 Overall Decoder Design... 48

Chapter 5 Design Implementation and Verification ... 50

5.1 Design Implementation... 50

5.2 Design Verification ... 51

Chapter 6 System Integration and Experimental Results ... 53

6.1 System Analysis ... 53

6.1.1 Interface... 54

6.1.2 Overhead Problem ... 55

6.1.3 Processing Cycles Problem... 56

6.2 System Integration ... 57

6.2.1 Access Reduction ... 57

6.2.2 Processing Cycles Problem... 58

6.2.3 Access Reduction Ratio ... 60

(12)

Chapter 7 Conclusion and Future Work ... 63

7.1 Conclusions ... 63

7.2 Future Work... 63

(13)

Figure Index

FIG. 1 BIT-PLANE TRUNCATION: AC COEFFICIENTS ARE PACKED FROM THE START PLANE. DUE TO THE LIMITATION OF PACKING BUDGET, COEFFICIENT BITS OF

LOWER DIGIT PLANE SURROUNDED BY DASH LINE WILL BE TRUNCATED...7

FIG. 2 CODING FORMAT FOR BIT-PLANE TRUNCATION CODING (BPT)...7

FIG. 3 THE CONCEPT OF BIT-PLANE ...9

FIG. 4 CODING PROCEDURE OF BPZ ALGORITHM... 10

FIG. 5 AN EXAMPLE OF BPZ CODING ... 11

FIG. 6 NEW PACKING DATA FORMAT (BPZ) VERSUS BPT... 11

FIG. 7 CODING PROCEDURE OF MBPZ ALGORITHM ... 13

FIG. 8 AN EXAMPLE FOR MBPZ CODING... 14

FIG. 9 COMPENSATION FOR A BIT-TRUNCATED AC COEFFICIENT... 15

FIG. 10 PIXEL-BASED (LEFT) VERSUS BLOCK-BASED (RIGHT) ... 18

FIG. 11 AN EXAMPLE OF OVERHEAD PROBLEM... 18

FIG. 12 THE CORRELATION BETWEEN BIT-RATE AND OVERHEAD (STEFAN SEQUENCE) 19 FIG. 13 THE FLOW CHART OF PROPOSED DCT-FGBPZ/CGBPZ EMBEDDED COMPRESSION ... 21

FIG. 14 THE OCCURRENCE PROBABILITY OF EACH TYPES IN MBPZ... 23

FIG. 15 CODING FLOW OF FGBPZ WITH VLC CODEBOOK. RECALL THAT TYPE A, B, C AND D IS REFERRED FROM [20]. ... 24

FIG. 16 A CODING EXAMPLE FOR FGBPZ ... 27

FIG. 17 PROTECTING MECHANISM FOR UNKNOWN SIGN BIT ... 30

FIG. 18 FINAL ENCODING FLOW CHART ... 31

FIG. 19 CGBPZ CODING FORMAT FOR THE MAGNITUDE OF AC COEFFICIENTS... 32 FIG. 20 THE CONCEPT OF HOW TO DERIVE THE RMAX/CMAX OF SIGN BIT-PLANE FROM

(14)

CODED BIT PLANE. ... 33

FIG. 21 END PLANE DECISION... 34

FIG. 22 OVERALL ENCODING FLOW OF CGBPZ ... 35

FIG. 23 PROPOSED COMPENSATION TECHNIQUE ... 36

FIG. 24 DRIFT EFFECTS ON FORMAN_QP28_GOP20 ... 37

FIG. 25 DRIFT EFFECTS ON MOBILE_QP28_GOP20... 38

FIG. 26 PSNR LOSS CONSIDERING DIFFERENT QP AND DIFFERENT GOP (FOREMAN) ... 38

FIG. 27 PSNR LOSS RESULTS DIFFERENT QP AND DIFFERENT GOP (MOBILE) ... 39

FIG. 28 DRIFT EFFECTS ON FOREMAN_QP28_GOP20... 40

FIG. 29 DRIFT EFFECTS ON MOBILE_QP28_GOP20... 41

FIG. 30 PSNR LOSS RESULTS DIFFERENT QP AND DIFFERENT GOP (FOREMAN) ... 41

FIG. 31 PSNR LOSS RESULTS DIFFERENT QP AND DIFFERENT GOP (MOBILE) ... 42

FIG. 32 OVERALL BLOCK DIAGRAM OF EMBEDDED COMPRESSOR ... 43

FIG. 33 CONTENT ADAPTIVE RIPPLE CONNECTER... 45

FIG. 34 THE ARCHITECTURE OF A SINGLE CONNECTER IN FIG. 33... 45

FIG. 35 THE ARCHITECTURE OF END PLANE CALCULATION ... 46

FIG. 36 OVERALL ENCODER DESIGN... 47

FIG. 37 OVERALL BLOCK DIAGRAM OF EMBEDDED DECOMPRESSOR ... 47

FIG. 38 OVERALL DECODER DESIGN... 49

FIG. 39 THE FLOW OF DESIGN VERIFICATION... 52

FIG. 40 THE OVERALL SYSTEM BLOCK DIAGRAM... 54

FIG. 41 SYSTEM INTERFACE DESIGN FOR EMBEDDED CODEC... 55

FIG. 42 BEST CASE ON DATA FETCHING... 56

FIG. 43 WORSE CASE: SUB PIXEL CASE... 56

FIG. 44 POWER ANALYSIS ON CIF @ 5.3MHZ ... 62

(15)

Table Index

TABLE 1 CODING TYPES OF BIT-PLANE PROPOSED IN [20] ... 12

TABLE 2 OVERHEAD WITH EC BLOCK GRID FOR EACH SEQUENCE ... 20

TABLE 3 THE COMPLEXITY OF N-POINT DCT... 22

TABLE 4 THE NEEDED CODEBOOK ENTRIES AND THEIR RELATED RMAX/CMAX ... 25

TABLE 5 THE FINAL 40 ENTRIES VLC CODEBOOK... 27

TABLE 6 THE OVERALL CODEWORD IN VLC CODEBOOK... 28

TABLE 7 SUMMARY OF HARDWARE DESIGN... 51

TABLE 8 OVERALL CASES OF READ ACCESS REQUESTED BY MC WITH/WITHOUT EC ... 58

TABLE 9 FULL CASES OF “EC DECODE” CYCLES PLUS ORIGINAL “MC DATA READ” CYCLES... 60

(16)

Chapter 1 Introduction

1.1 Motivation

To improve the video coding efficiency, eliminating temporal redundancy

between frames is a useful technique. This technique is widely used in nowadays

video coding standards such as MPEG-1/2/4, H.263 and H.264. But to accomplish

this method when encoding or decoding, at least one previous frame must be stored in

frame memory as reference. However, the accesses between external memory and

decoder chip may consume a lot of power. The rapid data accesses of motion

compensation dominate the power consumption of whole system.

For a mobile device, power is always the critical issue that people do care about.

Although the power consumed on chip can be reduced by many low power techniques,

data transferring still consumes a lot of power. Therefore, minimization of memory

access operations is a key consideration in hardware design of mobile video devices.

Embedded compression is a technique to reduce the transferring of data and the

size of off-chip frame memory. Since mobile video devices are suffered from limited

battery life and the visual quality criterion is not so strict due to the small display

screen, we hope to reduce the bandwidth requirement while maintain the acceptable

visual quality.

Nowadays, the mobile devices become more and more powerful by their various

functions. Reduce the bandwidth and resource requirement of each hardware

(17)

1.2Thesis Organization

This thesis is organized as follows. First, the basic introduction of compression

scheme and the reviews of prior works are described in Chapter 2. The proposed lossy

embedded compression algorithm is proposed in Chapter 3. To integrate with

H.264/AVC decoder, there are some constraints needed to be specified and the

proposed algorithm must be modified to fit in those constraints in hardware design.

The modified algorithm and hardware architecture is presented in Chapter 4.

Moreover, the simulation results about proposed algorithm integrated with

H.264/AVC HDTV decoder are also presented in this chapter. The design

implementation, integration and verification are shown in Chapter 5. Chapter 6 shows

the experimental results and performance comparison. Finally, the conclusions and

(18)

Chapter 2 Previous Works

Basically, compression techniques can be divided into two types: lossless

compression and lossy compression. In this chapter, we will simply introduce the

algorithms that have been proposed before. Also, the bit-plane coding is introduced in

chapter 2.3. Bit-plane coding can be used as lossy or lossless coding. The concept of

bit plane coding is used in our proposed methods.

2.1 Lossless Embedded Compression Schemes

A lot of lossless compression methods have been proposed. The benefit of

lossless compression is obviously: it can maintain the information while cutting down

the data size. To embed a lossless compression mechanism into a video system is

quite acceptable, since it would not cause the drifting effect no matter in encoder

system or in decoder system.

However, behind those advantages mentioned above, it does suffer from the

variable data amount after lossless compression. By mathematical theory, even for

ideal lossless compression, the information of source data still controls the

compression ratio. That means, the more the information of the source data contained,

the longer the coded data is. This unstable factor becomes the fatal wound of lossless

embedded compressions. Embedded compression schemes are born to reduce the data

access times between the external memories and reduce the size of external memory.

However, Variable data amount after lossless compression can not guarantee the

(19)

prepared for the worst case nor the bandwidth reduction since the compressed data

amount is unknown. A research of lossless compression is shown in [2].

2.2 Lossy Embedded Compression Scheme

Lossy compressions with fixed compression ratio are suitable to reduce the size

of frame memory and the bandwidth since the predictable amount of compressed data

can guarantee the reduction. Therefore, lossy embedded compressions are more

popular in comparing with lossless embedded compressions on solving this bandwidth

reduction problem. [3] – [14] are the previous works of lossy compression.

2.2.1 Transform-Based Lossy Embedded Compression

Transform-based lossy embedded compression is a popular way to compose

lossy compressions. It converts a signal into elementary frequency components. With

the characteristic of human visual system, lower frequency component is more

noticeable than higher frequency component. Thus implying quantization and data

collection on each component by their visual priority could be an efficient way to

collect data within limited data budget. The research uses the Hadamard Transform

and quantizes the coefficients by their priority, and then encodes quantized

coefficients by Golomb-Rice Coding is in [3]. Golomb-Rice coding is an efficient

coding method, and it can nearly reach the coding ability of Huffman coding by

selecting the suitable K factor. However in this paper, it pursuit low complexity,

therefore it chose fixed K values according to simulation. It can operate on 100 MHz

and the cycle usages of encoding/decoding a MB are both 33 cycles. It is a work of

(20)

2.2.2 Delta Pulse Code Modulation Lossy Embedded Compression

Delta Pulse Code Modulation (DPCM) is another popular way to compose the

lossy compression. Since the neighbor data has relatively small difference, the

information of data after DPCM can be efficiently reduced by comparing with the

source data. It does help on reduction of source information.

[4] uses DPCM as base coding method and takes the intra prediction mode from

H.264 video coding standard to find the best direction to perform DPCM. This smart

idea makes this algorithm more adaptive in each video pattern and achieves the

satisfied quality than [3].

However, the satisfied performance of DPCM method costs a lot. DPCM method

needs to collect every difference into limited budget, but those differences are not

always as small as we wish. To derive best quantization level and fit every difference

into limited budget, this DPCM-based method needs several iterations to get the best

performance. This situation causes this algorithm not to be able to use pipeline

scheme. And to avoid large gate counts, it is more acceptable to deal with subtractions

clock by clock instead of parallel architecture. However it leads to longer coding

cycles and becomes a heavy load of original system on timing issue. In the view point

of system integration, it needs to increase the operation frequency or slow down the

system throughputs to perform this DPCM-based embedded compression scheme.

2.2.3 Other Embedded Lossy Compression

There are still many approaches about lossy compression such as adaptive vector

quantize (VQ)[11], down-sampling based compression algorithm [12] and adaptive

(21)

mechanism to choose with methods to use. It claims that this mechanism can achieve

better performance by choosing adaptive algorithm to fit the different feature of video

sequence. DWT with SPIHT in [14] is also another transform approach. And the

algorithm used in [14] makes it be able to perform lossy and lossless with the same

architecture.

We can see that lossy embedded compression scheme is truly the mainstream.

However it suffers from the loss of quality and the drift effect. Therefore, how to

organize the lossy coding methods is very important. To cover information as much as

possible within limited budget is the main challenge of lossy compression.

2.3 Bit-Plane Coding

Bit-plane zonal coding is a well known coding method and widely used in many

compression algorithms. It uses bit-plane as its basic unit to encode a group of

number instead of individual number. It can be combined into a lossy or lossless

compression scheme by adjusting the budget of bit storage. It can fully represent the

group of number with sufficient bit budget. On the other hand, with un-sufficient

budget it may loss some information at lower bits and thus becomes a lossy

compression. The details of bit-plane zonal coding will be shown in the following

sections.

2.3.1 Bit-Plane Truncation Coding (BPT)

Before introducing proposed bit-plane zonal coding, we would like to introduce

the basic concept first. Bit-plane truncation coding is the prototype of bit-plane zonal

(22)

DCT. We can simply classify those coefficients into one DC coefficient and 15 AC

coefficients. The idea of bit-plane coding is to collect data in bit-plane (that is, to take

the N-th bit out of each coefficients as a union) rather in individual coefficient.

Sometimes, we want to further analyze a group of numbers and to cut them into

several parts by their importance, separating them into bit-planes is a good idea.

Moreover, for a group of coefficients, the upper bit-planes are zero most of the time.

Therefore to record start plane is the smart way to improve the coding efficiency. For

a group of 4x4, N bits coefficient, about cell function (log2 N) bits is needed for

recording start plane, but it can represent 15 zero bits for each skipped bit-plane. After

the bit-plane truncation coding, the coded format is shown in Fig. 2.

Fig. 1 Bit-plane truncation: AC coefficients are packed from the start plane. Due to the limitation of packing budget, coefficient bits of lower digit plane surrounded by

dash line will be truncated.

(23)

2.3.2 Bit-Plane Zonal Coding (BPZ)

However, BPT has poor performance and image quality must be enhanced by

other approach to reduce energy loss of DCT coefficients. In this section, an improved

coding algorithm named bit-plane zonal coding (BPZ) [18] will be described in detail.

Familiar with BPT, BPZ packets DCT coefficients bit-plane by bit-plane, but the

packing scheme is quite different from BPT. We will show that the packing efficiency

of BTZ is much better than BPT.

The word “zonal” is the idea to encode a bit-plane with its zonal characteristic.

Fig. 3 is a possible outcome of a bit-plane. The coefficients with larger magnitude

tend to be gathered at up-left corner (lower horizontal or vertical frequencies) by DCT.

Also, the bits at down-right corner tend to be zero in the same bit-plane. Furthermore,

the data for individual DCT blocks often has a bias for either the horizontal or vertical

direction. Besides, by describing the maximum row and column number of valid data

in this scan zone, named RMAX and CMAX respectively, we have large probability

to represent the information of a bit plane within less than 15 bits. Therefore, a

signal-dependent rectangular scan zone starts from the upper-left corner will perform

(24)

Fig. 3 The concept of bit-plane

Two classes of coefficients namely significant and in-significant coefficients are

defined respectively. In the encoding flow, significant coefficient will have a 1 in any

of the higher coded bit-planes. In the contrary, in-significant coefficient always have

all 0’s on the higher bit-planes.

Sometimes, zone represented by RMAX/CMAX will be very similar between the

neighboring bit-planes. This feature allows us to use this data-similarity to develop

more efficient coding mechanism.

The detail coding flow is described as follow: For DCT coefficient blocks, we

can divide the process into DC and AC flows. In DC flow, the DC coefficient is

completely packed for avoiding significant degradation in quality as BPT. In AC flow,

the procedure of this algorithm is shown as Fig. 4. Initially, all AC coefficients are

marked as insignificant. Then, we start from the most significant plane (MSP) to

encode the subsequent bit-planes. The first plane which contains nonzero bit is

defined as start plane, and the nonzero bits in start plane are the newly significant

coefficients. Thus, sign bits are inserted behind each nonzero bit. For the subsequent

bit-plane, there is only one question. If the following bit-plane has a newly significant

bit, a bit “1” is packed first to represent the newly significant bit is founded and then

(25)

corresponding sign bits. Those significant bits and in-significant bits are no need to be

followed by sign bits since the sign bits of significant bits are already packed and the

sign bit of in-significant bit are useless so far. Notice that unlike the fully packed sign

bit in BPT, the sign bit packed in BPZ is on demand.

If no newly significant appeared in current bit-plane, a bit “0” is inserted to

represent that the RMAX/CMAX of current bit-plane is the same as previous

bit-plane and only the bits in the position of significant coefficient needed to be

packed. BPZ repeat this procedure until all bit-planes have been packed. For the

category on packing sign bits and the no newly significant bit-plane, we can see the

efficiency of BPZ and that is why BPZ can achieve better performance than BPT.

Fig. 4 Coding procedure of BPZ algorithm

An example for bit-plane classification is illustrated in Fig. 5. The same as BPT,

the start plane of DCT coefficients is also packed as a part of header information. Sign

bits of a DCT coefficient block are not a part of header information any more. They

are dispersed and accompanied with newly significant coefficients found in certain

bit-planes. Header information is shortened and more AC coefficient packing budget

(26)

Fig. 5 An example of BPZ coding

(27)

2.3.3 Modified Bit Plane Zonal Coding

If we take more look to the BPZ algorithm from the example shown in Fig. 4, we

will discover that the original BPZ algorithm can be further improved. For software

application, adding a little complexity can achieve more coding efficiency. A

mechanism within good trade off between complexity and coding efficiency is

proposed in [20].

The starting point is to use the limit budget in more efficient way. Carefully

looking at the coding type of bit-plane zonal coding (BPZ), we can find that there is

an annoying format to deal with the occurrences of newly significant coefficient

because of the longest header information. Every time we found a newly significant

bit, we need to packet 4 bits for RMAX/CMAX and one bit to distinguish coding

format. However, the four bits of RMAX/CMAX is not really necessary since the

RMAX/CMAX may be the same with the previous bit-plane. Therefore, [20]

proposes a new coding format to deal with this situation. The new coding format is

adopted when “newly significant bit is found, but the RMAX/CMAX of current

bit-plane is the same with the previous bit-plane” and overall coding types shown in

Table 1. The drawback is that we need one more bit to distinguish from original type

B with new proposed type C. However the advantage is saving four bits comparing

with original coding format. Fig. 7 shows the coding flow of modified bit-plane zonal

coding proposed in [20].

Table 1 Coding types of bit-plane proposed in [20]

Bits for Rmax/Cmax

A Yes Yes None 4 4 B No No 00 None 2 C Yes No 01 None 2 D Yes Yes 1 4 5

Rmax/Cmax Changed

Bits for Flag(s) and Rmax/Cmax Type Newly Sig. Coef. Flag

(28)

Fig. 7 Coding procedure of MBPZ algorithm

An example of the modified bit-plane zonal coding (MBPZ) proposed in [20] is

given in Fig. 8. The bit streams in the bottom of figure are coded by original BPZ and

modified BPZ (MBPZ) respectively. Through this compare we can clearly figure out

the benefit brought by MBPZ. There is a small technique here. When packing a

bit-plane of AC coefficients, we use zigzag scan order to collect bits. Since human

visual system is more sensitive on low frequency signal elements, when we are

running out of packing budget, zigzag scan order can store the relative important

(29)

Fig. 8 An example for MBPZ coding

Using MBPZ to encode AC coefficients within limited budget, quality loss is

inevitable. To slightly compensate for the truncated data bits, [20] also propose a

method to raise the quality. First, if the value of this coefficient is large or equal than 4,

scan the decoded AC coefficients from LSB to find the first non-zero bit, and then

paste a “1” to its lower-two digit. If the value of this coefficient is less than 4, nothing

will be changed on it. Finally, recover the coefficients by the corresponding sign bits

(30)

Fig. 9 Compensation for a bit-truncated AC coefficient.

2.4 Summary

From the introduction and discussion above, we classify the existing algorithm

into two basic types and briefly introduce the pros. And cons. We can find that lossy

compression is the popular way to implement embedded compressor by the advantage

on fixed compression ratio and fixed amount of coded data. However, good

performance usually comes with time consuming while low complexity usually brings

worse quality. The former kind of methods derives better performance but the large

buffer may be required, and longer processing cycles will enlarge the loading of the

system and the barrier to embed this extra function into system. Although to slow

down the system or to increase the operation frequency can fix this problem, but the

former methods will decrease the coding throughput and the later methods will

increase the power consumption. Each drawback is not what we want. Some lossy

(31)

into a decoder system as far as hardware is concerned, but at the same time, those

schemes often suffer from unsatisfied quality.

For the real time, low power HDTV H.264/AVC decoder, low latency is the basic

requirement. Not to increase the loading of original system is also another target.

Therefore, our design challenge on embedded compressor is to find the optimal trade

(32)

Chapter 3 Proposed Embedded Compression

Algorithm

3.1 Overview

Researches about data compression have been developed for a long time. Those

developed algorithms show us that enhancing the complexity can reach better

performance. However, the problem is to find a suitable compression category to

combine with H.264 system but not to affect the performance of overall system. The

discussions in chapter 2 have shown us that the threshold of embedding an extra

function may arise with higher complexity coding scheme. In this chapter, further

discussion will be presented.

In practice, block-based schemes are the most convenient schemes because they

match the block-oriented structure of the incoming bit-stream in H.264 system and

allow on-the-fly process. However, another problem exists: the overhead. The

overhead can be defined as the ratio between the number of pixels that are actually

accessed during the motion compensation of a block and the number of pixels that are

really useful in the reference block. In original system, the ratio is 1 since every

accessed pixel is on demand. After embedding block-based algorithm adopted, this

ratio will always superior to 1 because of the nature of block-based embedded

compression algorithm. Fig. 10 shows the concept between block-based and

pixel-based. The left of Fig. 10 is pixel-based, represents the data without EC. The

right of Fig. 10 is block-based since the characteristic of EC. Fig. 11 is an example to

(33)

Fig. 10 Pixel-based (left) versus block-based (right)

Fig. 11 An example of overhead problem

According to the standard of H.264, a 16x16 macro block can be divided into

8x8, 8x16 or 16x8 blocks during the process of motion compensation (MC). Further

more, an 8x8 block can then be sub-divided into 8x4, 4x8 or 4x4 sub blocks. If the

compensated block is not aligned with the coded block grid, the overhead will be

occurred like depicted in Fig. 11. Four coded blocks have to be loaded and decoded to

get the required pixels. If the EC scheme is 8x8 block-based and the compensated

block is 4x4 block, we need to load and decode 256 pixels to derive 16 useful pixels.

The overhead in this case is 16. Because of the overhead problem, the relation

between the compression ratio of EC and the gain in memory transfer is not direct.

There is a statistic material about the phenomenon of overhead provided by [15].

(34)

Stefan sequence. Three kinds of EC block-grid are presented. Since H.264 encoder

allows macroblock (MB) partitioning and larger motion vectors at high rate (which

also means the small quantization step and better quality) and favors the null vectors

with 16x16 partition at low rate, the overhead increases while the bit rate increases.

Fig. 12 The correlation between bit-rate and overhead (Stefan sequence) simulated with 4x4, 8x8 and 16x16 block grid

Table 2 [15] is the summary of the statistical analysis simulated with six

sequences. In this table, we can see that the relatively still sequences (News, Weather)

generate smaller overhead since the motion vector is often equal to zero while the fast

motion sequence such as Stefan generates more overhead. Finally, an important

conclusion is that the smaller block-grid gets the better of larger block-grids and

(35)

Table 2 Overhead with EC block grid for each sequence

Sequence 4x4 block grid 8x8 block grid 16x16 block grid

Foreman 1.31 1.77 3.69 Flower 1.30 1.74 3.77 News 1.14 1.51 2.78 Silent 1.17 1.50 3.22 Stefan 1.51 2.44 6.95 Weather 1.17 1.49 3.18 All 1.27 1.73 3.93

3.2 Algorithm of Embedded Compressor

We adopt transform-based and 4x4 block-grid as our coding algorithm. First

reason is the smallest overhead according to the statistical result that we presented in

previous section. Actually it is a trade off between coding efficiency and overhead.

We know that as far as the transform algorithm is concerned, the bigger the block-grid,

the better coding efficiency it can achieve. Since we want to increase the coding

efficiency with less overhead, the 4x4 block-grid is our best choice.

The basic concept of proposed algorithm is the combination of DCT with

bit-plane zonal coding. DCT is a well known technique so we just simply introduce it.

The two proposed bit-plane zonal coding are the main characters. Fine grain bit-plane

zonal coding (FGBPZ) is quite efficiency and is suitable to be used in software

application. Coarse grain bit-plane zonal coding is relatively simple and is suitable for

hardware implementation. Fig. 13 is the coding flow of proposed

DCT-FGBPZ/CGBPZ algorithm. This is a one way open-loop coding scheme and no

iteration is needed. The discrete cosine transform (DCT) is divided into two one

dimension DCT. The coefficients of DCT are packed by fine grain bit-plane zonal

coding (FGBPZ) or coarse grain bit-plane zonal coding (CGBPZ) we proposed. The

(36)

Fig. 13 The flow chart of proposed DCT-FGBPZ/CGBPZ embedded compression

3.2.1 Discrete Cosine Transform

Discrete cosine transforms (DCT) is a powerful technique for converting a signal

into elementary frequency components. It is widely used in image compression and

JPEG is the well-known example.

For human visual system, human eyes are more sensitive on low frequency

component of a picture and less sensitive on high frequency component. Therefore,

the quality loss in high frequency component is relatively unnoticeable. The DCT can

generate the relatively important low frequency component on up left corner, and the

most high frequency in down right corner. Thus the DCT combines with bit-plane

zonal coding with original point at up left corner can efficiently collect the

information.

But the biggest disadvantage of DCT is its complexity on hardware design. Here

we make our coding unit in 4x4 block grid, the complexity of 4 point DCT is minor

and still can take the advantage of the transform. The complexity of different size of

DCT can be evaluated in Table 3. Two designs are shown in Table 3. A design is

reference from [16] and B design is reference from [17]. B design is focus on

reducing multiplications by increasing additions. We can see that in both designs, the

complexity of 4 points DCT is much simpler than 8 points and 16 points. 4 points

(37)

Table 3 The complexity of N-point DCT

Number of Multiplications Number of Additions

m N

A B A B

2 4 2 4 6 9

3 8 16 12 26 29

4 16 116 80 194 209

3.2.2 Proposed Fine Grain Bit Plane Zonal Coding (FGBPZ)

Base on the modified bit-plane zonal coding proposed in [20], the coding

efficient is quite good. But we are not satisfied yet. To further improve the coding

efficiency, we introduce a pre-determined variable length coding here with a small

code book.

3.2.2.1 VLC Codebook

Before further change the MBPZ in [20], we make simulation here to evaluation

the occurrences of each MBPZ types and Fig. 14 is the simulation result. The naming

of each type A, B, C and D is referred from [20] (see Fig. 7). We can see that the

appearance probability of type B and type C are relatively small although they have

better coding efficiency. Type D is the dominate type but the bits recording header

information are 5 bits including one bit for distinguishing between types and 4 bits for

RMAX/CMAX. Therefore, we want to improve the efficiency by adding a small

(38)

Probability of each types in MBPZ Probability of each types in MBPZ Probability of each types in MBPZ Probability of each types in MBPZ

Type B 16% Type C 11% Type D 73% Type B Type C Type D

Fig. 14 the occurrence probability of each types in MBPZ

According to the modified bit-plane zonal coding proposed in [20], the

RMAX/CMAX of each bit-plane is accumulated bit-plane by bit-plane and is always

large or equal to the RMAX/CMAX of previous plane. Recall that type D is applied

when RMAX/CMAX is changed. Therefore, when type D is applied, the possible

outcomes of the RMAX/CMAX in next bit-plane are limited: they must larger than

the RMAX/CMAX of previous plane.

For example, if RMAX/CMAX of current plane is 2/2 and next plane is coded by

type D, the possible outcomes of next plane RMAX/CMAX must be 3/2, 2/3 or 3/3.

Notice that 2/2 is also possible to be the RMAX/CMAX of next bit-plane, but type D

only deals with the situation that RMAX/CMAX is different from previous bit-plane.

Those 3 possible outcomes can be fully presented by 1~2 bits instead of original 4 bits.

The description above explains the chance of reducing the codeword length in type D.

Fig. 15 shows the coding flow of FGBPZ with VLC codebook. This method can save

(39)

Fig. 15 Coding flow of FGBPZ with VLC codebook. Recall that type A, B, C and D is referred from [20].

We generate those codes by Huffman coding methods and the probabilities of

next possible RMAX/CMAX (Pcurrent RMAX/CMAX [next RMAX/CMAX]) are derived

from simulation on over 3000 frames. The code words in this codebook are fixed.

To cover all possible CMAX/R/MAX of next bit-plane according to current

plane, the needed codebook entry and their related RMAX/CMAX is shown in Table

4. The number of possible outcomes of next RMAX/CMAX is shown in (1). For a

4x4 bit-plane, the row/column are mark as 0, 1, 2, 3. When type D is applied, at least

one of row or column is changed. Therefore, this equation is to calculate the outcomes

which are large than or equal to current RMAX/CMAX and then minus one outcome

that RMAX and CMAX are both equal to current bit-plane.

Next possible outcomes =

1 ) _ 4 ( ) _ 4

( ₋Current RMAX _× ₋Current CMAX ₋

(40)

Table 4 The needed codebook entries and their related RMAX/CMAX Current

RMAX/CMAX

The number of next Possible RMAX/CMAX outcomes

Huffman Code length (bits) ( 0, 1 ) 11 ( 4*3-1 ) 3~4 ( 1, 0 ) 11 ( 3*4-1 ) 3~4 ( 1, 1 ) 8 ( 3*3-1 ) 2~4 ( 2, 0 ) 7 ( 2*4-1 ) 2~4 ( 0, 2 ) 7 ( 4*2-1 ) 2~4 ( 2, 1 ) 5 ( 2*3-1 ) 2~3 ( 1, 2 ) 5 ( 3*2-1 ) 2~3 ( 2, 2 ) 3 ( 2*2-1 ) 1~2 ( 3, 0 ) 3 ( 1*4-1 ) 1~2 ( 0, 3 ) 3 ( 4*1-1 ) 1~2 ( 3, 1 ) 2 ( 1*3-1 ) 1 ( 1, 3 ) 2 ( 3*1-1 ) 1 ( 3, 2 ) 1 0 ( 2, 3 ) 1 0 Summary 67 0~4

But there are still rooms for codebook improvement. Consider the following

two cases: case 1), current RMAX/CMAX is 2/3; next RMAX/CMAX is 3/4. Case 2),

current RMAX/CMAX is 3/2; next RMAX/CMAX is 4/3. With the original codebook,

the codebook index for case 1 is {(2, 3), (3, 4)} and case 2 is {(3, 2), (4, 3)}. Actually,

the mainly different of case 1 and case 2 is the direction of row and column. Both

cases are similar even on the probability distribution of each possible “next

RMAX/CMAX”. If we switch the row to the column, we can find that those two cases

are undergoing the same changes. According to this idea, we introduce our symmetric

VLC codebook. By eliminating the bias of Row and Column in codebook, the

symmetric cases can share the same codeword. We can reduce the 67 entries

codebook into 40 entries by this idea.

(41)

codebook size by 42%. The timing wasted on codebook searching is also reduced.

And then we will show how to use our symmetric VLC codebook. We represent

current RMAX/CMAX as Cm_cur, Rm_cur, previous RMAX/CMAX as Cm_pre,

Rm_pre. The action of table look up can be described as follow:

If (Cm_pre_≥Rm_pre)

Codeword at index {(Cm_pre,Rm_pre) (Cm_cur,Rm_cur)} is applied. Else

Codeword at index {(Rm_ pre,Cm_pre) (Rm_cur,Cm_cur)} is applied.

Therefore, 40 codeword is enough.

And then we want to explain the decoding procedure of symmetric VLC

codebook. After start plane is decoded, the RMAX/CMAX of start plane is known

and can be used as reference. Decoding procedure for the subsequent bit-planes can

be illustrated in (2).

If (Cm_pre_≥Rm_pre)

Codeword in block {(Cm_pre,Rm_pre)} is searched; And the result is in {(Cm_cur,Rm_cur)} order. Else

Codeword in block {(Rm_pre,Cm_pre)} is searched; And the result is in {(Rm_cur,Cm_cur)} order.

(2)

These switch actions between RMAX and CMAX in encoding procedure need

not to be recorded since they can be derived from the decoding procedure. The final

VLC codebook is shown in Table 5 and is formed by eliminating the symmetric entry

in Table 4. The coding example for FGBPZ is shown in Fig. 16. Table 6 is our detail

(42)

Table 5 The final 40 entries VLC codebook Current

RMAX/CMAX

The number of next Possible RMAX/CMAX outcomes

Huffman Code length (bits) ( 1, 0 ) 11 ( 3*4-1 ) 3~4 ( 1, 1 ) 8 ( 3*3-1 ) 2~4 ( 2, 0 ) 7 ( 2*4-1 ) 2~4 ( 2, 1 ) 5 ( 2*3-1 ) 2~3 ( 2, 2 ) 3 ( 2*2-1 ) 1~2 ( 3, 0 ) 3 ( 1*4-1 ) 1~2 ( 3, 1 ) 2 ( 1*3-1 ) 1 ( 3, 2 ) 1 0 Summary 40 0~4

(43)

Table 6 The overall codeword in VLC codebook Current RMAX/CMAX Next RMAX/CMAX Codeword Code length (bits) (1, 1) 000 3 (2, 0) 001 3 (2, 1) 010 3 (3, 0) 011 3 (2, 2) 100 3 (1, 2) 1010 4 (3, 1) 1011 4 (3, 2) 1100 4 (3, 3) 1101 4 (2, 3) 1110 4 ( 1, 0 ) (1, 3) 1111 4 (2, 2) 00 2 (2, 1) 100 3 (1, 2) 101 3 (3, 3) 110 3 (3, 2) 111 3 (2, 3) 010 3 (3, 1) 0110 4 ( 1, 1 ) (1, 3) 0111 4 (2, 1) 00 2 (3, 0) 01 2 (3, 1) 100 3 (2, 2) 101 3 (3, 2) 110 3 (3, 3) 1110 4 ( 2, 0 ) (2, 3) 1111 4 (3, 2) 01 2 (2, 2) 00 2 (3, 3) 10 2 (3, 1) 110 3 ( 2, 1 ) (2, 3) 111 3 (2, 3) 00 2 ( 2, 2 ) (3, 2) 01 2

(44)

(3, 3) 1 1 (3, 1) 0 1 (3, 2) 10 2 ( 3, 0 ) (3, 3) 11 2 (3, 2) 0 1 ( 3, 1 ) (3, 3) 1 1 3.2.2.2

Data Packing

Since our compression ratio is fixed at two, the budget of coded data is 64 bits.

After DCT and bit-plane zonal coding, we need to packet coded data into 64 bits

segment before sending to external memory. First we reserve for the DC coefficient

because of its importance in transform. Second, we use 4 bits to packet the start plane.

The rest of budget, that is to say, 52 bits, is used for storing AC coefficients. With the

help of the fine grain bit-plane zonal coding, AC coefficient are divided into

bit-planes and presented by the coding format in Fig. 15 Coding flow of FGBPZ with

VLC codebook. Recall that type A, B, C and D is referred from [20].. The procedure

is keep packing bit-plane by bit-plane until the end of bit-planes or running out of bit

(45)

Newly significant bit found? Residual bit budget equal to 1 ? Packing bit by bit Yes No Packing “0” Yes Packing this bit No Next 4x4 coefficients

Fig. 17 Protecting mechanism for unknown sign bit

When running out of budget, unpacked information will be loss. Recall that the

newly significant coefficient must be followed by its sign bit. If newly significant bit

is packed while its sign bit is cut, this coefficient will be wrong after decoded. We

make a mechanism to avoid this situation and show in Fig. 17. If next packing bit is

newly significant bit and the rest of the budget is less than two bits, we will abort

packing this newly significant bit.

The final encoding flow chart is shown in Fig. 18. Each bold line in Fig. 18

(46)

Start plane? DCT coeficients Data Packing DC Coef. Magnitude of AC Coef. Type decisio n Yes Start plane &

RMAX/CMAX Type header Newly significan t bit? Packing bit by bit Protect machenisum No Type D? No No Table look up VLC codeword End of plane? Yes Yes Packing final bit and exit

Sign bits of AC coef. Next bit plane Final coded result 2-D DCT 4x4 pixels

Sign bit Yes

(47)

3.3 Coarse Grain Bit-Plane Zonal Coding (CGBPZ)

FGBPZ introduced in section 3.2.2 is simple and efficiency. This algorithm

encodes the coefficients on “bit” level. But its encoding procedure may cost more

than 30 cycles and decoding procedure may cost more than 10 cycles under our

estimation. So FGBPZ is more suitable embedded into software or hardware/software

co-design system. To implement the algorithm as hardware accelerator, the algorithm

must be further modified into simpler version.

By the discussion in chapter 6.1, we will see the critical problems of embedding

a compressor into system. Taking all these problems into consideration, we propose

coarse grain bit-plane zonal coding (CGBPZ). CGBPZ is a trade off between short

cycles, ability of parallelism and quality. The details will be presented in this section.

Fig. 19 is the coding formats of CGBPZ. All magnitude bit-planes of AC

coefficients are coded in uniform format. For each bit-plane, we record the

RMAX/CMAX (4 bits), and then pack the bits which are enclosed by RMAX and

CMAX. 4 bits are used to record RMAX/CMAX of each plane. The dependencies

between bit-planes are not used in CGBPZ.

Fig. 19 CGBPZ coding format for the magnitude of AC coefficients

In CGBPZ, we introduced the concept of sign bit-plane. Sign bit-plane can be

considered as union of sign bits for each coefficient. We only packet those used sign

(48)

pack all the information may be happened frequently. Since not every coefficient can

be packed, packing whole sign bit-plane may become a waste. So we take the

maximum value of RMAX and CMAX from packed bit-plane (from start plane to end

plane) and packing sign bit-plane by those two boundaries. Under this method we will

waste least bits to pack unused sign bits. The RMAX/CMAX of sign bit-plane needs

not to be packed when encoding, because they can be derived from those coded

bit-plane. Fig. 20 illustrates the idea of how we derive the RMAX/CMAX of sign

bit-plane.

Fig. 20 The concept of how to derive the RMAX/CMAX of sign bit-plane from coded bit plane.

Finally, in CGBPZ, the end plane needs to be estimated and packed to fulfill the

decoding procedure. Fig. 21 shows the simple concept of end plane decision. From

MSB plane to LSB plane, the calculator estimates the total bits usage accumulated

from most significant plane (MSP) to current plane. If total bits usage is more than 64

(49)

Fig. 21 End plane decision

The overall encoding flow can be shown in Fig. 22. Finally, there is one small

skill. According to the description above, the bits usage accumulated to end plane is

less than bit budget. Therefore, there are few bits unused. To well use those bit

budgets, we keep putting the information into those unused budgets within the

(50)

Fig. 22 Overall encoding flow of CGBPZ

3.4 Decoding Process and the Compensation

Roughly, the decoding process can be thinking as the inverse process of

encoding. We take the coded data segments and divide them into DC coefficient and

AC coefficients.

Since the algorithm we proposed is a lossy compression and the lower bit-planes

of AC coefficients are often truncated due to limited budget, we apply a simple

compensation here. The basic concept is shown in Fig. 23. The compensation is

applied when the coefficient is nonzero and the end plane is larger than least bit-plane.

This compensation technique can be considered as adding a median number of lost

bit-plane. It leads to a satisfied quality improvement. Notice that this compensation is

(51)

Fig. 23 Proposed compensation technique

3.5 Embedded Result on Software Simulation

Before all the discussion, we want to define the formula of PSNR calculation

first. All the PSNR values in this section are the PSNR between compressed

sequences versus the original sequence. The reason why we choose original sequence

as reference is to establish an absolute quality level. The equation of PSNR is given in

(3):

∑∑

− = − = − × × × × = 1 0 1 0 2 )) , ( ) , ( ( 1 255 255 log 10 _R r C c compressed origin r c P r c P C R PSNR (3)

3.5.1 FGBPZ versus CGBPZ

(52)

bit-plane zonal coding (FGBPZ) and coarse bit-plane zonal coding (CGBPZ). We

want to show the result of trade off between FGBPZ and CGBPZ. Fig. 24 shows the

embedded result on Foreman sequences with group of picture (GOP) 20. We can see

the PSNR value decades along the P frame number. This is because each P frame is

formed by referring the blocks in previous frame. Since every reference frames are

compressed by our lossy EC algorithm, the errors will be propagated and accumulated

through P frames. This phenomenon is also called drift effects. Fig. 25 shows the drift

effect but the experimented sequence is Mobile Calendar. Mobile Calendar is famous

by its complex components and fast motion. Those features make Mobile sequence

difficult to be compressed and the loss on quality may larger than slow motion

sequences. foreman_QP28_IP=1/19 foreman_QP28_IP=1/19 foreman_QP28_IP=1/19 foreman_QP28_IP=1/19 20 2020 20 22 2222 22 24 2424 24 26 2626 26 28 2828 28 30 3030 30 32 3232 32 34 3434 34 36 3636 36 38 3838 38 40 4040 40 1111 4444 7777 10101010 13131313 1616 191616 1919 2219 222222 2525 282525 282828 313131 3431 34 373434 373737 40404040 43434343 frame # frame # frame # frame # P S N R P S N R P S N R P S N R ori_dec ori_dec ori_dec ori_dec embedded_fine embedded_fine embedded_fine embedded_fine embedded_coarse embedded_coarse embedded_coarse embedded_coarse

(53)

mobile_QP28_IP=1/19 mobile_QP28_IP=1/19 mobile_QP28_IP=1/19 mobile_QP28_IP=1/19 15 15 15 15 17 17 17 17 19 19 19 19 21 21 21 21 23 23 23 23 25 25 25 25 27 27 27 27 29 29 29 29 31 31 31 31 33 33 33 33 35 35 35 35 37 37 37 37 39 39 39 39 1111 4444 7777 10101010 13131313 161616 1916 1919 2219 222222 2525 282525 282828 313131 3431 34 373434 373737 40404040 43434343 frame # frame # frame # frame # P S N R P S N R P S N R P S N R

ori_dec

embedded_fine

embedded_coarse

Fig. 25 Drift effects on Mobile_QP28_GOP20

foreman_FGBPZ_vs_CGBPZ foreman_FGBPZ_vs_CGBPZ foreman_FGBPZ_vs_CGBPZ foreman_FGBPZ_vs_CGBPZ 4.02 4.024.02 4.02 2.26 2.26 2.26 2.26 1.24 1.24 1.24 1.24 0.61 0.61 0.61 0.61 5.27 5.275.27 5.27 3.23 3.23 3.23 3.23 1.89 1.89 1.89 1.89 0.99 0.99 0.99 0.99 6.22 6.226.22 6.22 4.01 4.01 4.01 4.01 2.41 2.41 2.41 2.41 1.34 1.34 1.34 1.34 5.45 5.455.45 5.45 3.19 3.19 3.19 3.19 1.81 1.81 1.81 1.81 0.9 0.9 0.9 0.9 6.99 6.996.99 6.99 4.36 4.36 4.36 4.36 2.6 2.6 2.6 2.6 1.34 1.34 1.34 1.34 8.24 8.248.24 8.24 5.41 5.41 5.41 5.41 3.33 3.33 3.33 3.33 1.82 1.82 1.82 1.82 0000 1111 2222 3333 4444 5555 6666 7777 8888 9999 20 20 20 20 24242424 28282828 32323232 _QP_QP_QP_QP P S N R _l os s P S N R _l os s P S N R _l os s P S N R _l os s FGBPZ_IP=1/9 FGBPZ_IP=1/9 FGBPZ_IP=1/9 FGBPZ_IP=1/9 FGBPZ_IP=1/19 FGBPZ_IP=1/19 FGBPZ_IP=1/19 FGBPZ_IP=1/19 FGBPZ_IP=1/29 FGBPZ_IP=1/29 FGBPZ_IP=1/29 FGBPZ_IP=1/29 CGBPZ_IP=1/9 CGBPZ_IP=1/9 CGBPZ_IP=1/9 CGBPZ_IP=1/9 CGBPZ_IP=1/19 CGBPZ_IP=1/19 CGBPZ_IP=1/19 CGBPZ_IP=1/19 CGBPZ_IP=1/29 CGBPZ_IP=1/29 CGBPZ_IP=1/29 CGBPZ_IP=1/29

(54)

Mobile_fine_vs_coarse Mobile_fine_vs_coarse Mobile_fine_vs_coarse Mobile_fine_vs_coarse 10.9 10.910.9 10.9 7.34 7.34 7.34 7.34 4.52 4.52 4.52 4.52 2.29 2.292.29 2.29 12.41 12.4112.41 12.41 8.73 8.73 8.73 8.73 5.64 5.64 5.64 5.64 3.06 3.063.06 3.06 13.41 13.4113.41 13.41 9.68 9.68 9.68 9.68 6.45 6.45 6.45 6.45 3.65 3.653.65 3.65 13.16 13.1613.16 13.16 9.59 9.59 9.59 9.59 6.6 6.6 6.6 6.6 4.01 4.014.01 4.01 14.62 14.6214.62 14.62 10.95 10.95 10.95 10.95 7.8 7.8 7.8 7.8 4.94 4.944.94 4.94 15.61 15.6115.61 15.61 11.9 11.9 11.9 11.9 8.65 8.65 8.65 8.65 5.6 5.65.6 5.6 1 11 1 3 33 3 5 55 5 7 77 7 9 99 9 11 11 11 11 13 13 13 13 15 15 15 15 17 17 17 17 20 20 20 20 24242424 28282828 32323232 QPQPQPQP P S N R _l os s P S N R _l os s P S N R _l os s P S N R _l os s fine_IP= 1/9 fine_IP= 1/9fine_IP= 1/9 fine_IP= 1/9 fine_IP=1/19 fine_IP=1/19fine_IP=1/19 fine_IP=1/19 fine_IP=1/29 fine_IP=1/29fine_IP=1/29 fine_IP=1/29 coarse_IP=1/9 coarse_IP=1/9coarse_IP=1/9 coarse_IP=1/9 coarse_IP=1/19 coarse_IP=1/19coarse_IP=1/19 coarse_IP=1/19 coarse_IP=1/29 coarse_IP=1/29coarse_IP=1/29 coarse_IP=1/29

Fig. 27 PSNR loss results different QP and different GOP (Mobile)

Fig. 26 and Fig. 27 show the results of PSNR loss considering different QP and

different GOP. We can see that the PSNR loss increases with the increasing GOP

while tail off at higher QP values.

According to our simulation results over sequences Akiyo, Foreman, Mobile,

Stefan, GOP 10, 20, 30, and QP 20, 24, 28, 32, the average difference in quality

between using CGBPZ and FGBPZ is 1.5 dB. This number shows that CGBPZ is a

good trade off between complexity and quality. 1.5dB PSNR drop enables the fast

encoding procedure form over 30 cycles (FGBPZ) into 2 cycles (CGBPZ).

3.5.2 CGBPZ versus MHT

(55)

coding (CGBPZ) and modified Hadamard transform (MHT). CGBPZ is what we use

as hardware implementation and system integration. Considering the requirement of

high speed processing, we compare CGBPZ with MHT work. Fig. 28 shows the

embedded result on Foreman with group of picture (GOP) as 20. The proposed DCT

with CGBPZ has better performance and can efficiently slow down the speed of

decade compared with MHT work. Fig. 29 also shows the drift effect but the

experimented sequence is Mobile Calendar.

foreman_QP28_IP=1/19 foreman_QP28_IP=1/19 foreman_QP28_IP=1/19 foreman_QP28_IP=1/19 20 2020 20 22 2222 22 24 2424 24 26 2626 26 28 2828 28 30 3030 30 32 3232 32 34 3434 34 36 3636 36 38 3838 38 40 4040 40 1111 3333 5555 7777 9999 111111 1311 131313 151515 1715 17 191717 1919 2119 212121 232323 2523 252525 2727 292727 292929 313131 3331 333333 353535 3735 37 393737 3939 4139 414141 434343 4543 454545 frame # frame #frame # frame # P S N R P S N R P S N R P S N R ori_dec ori_decori_dec ori_dec MHT_embedded MHT_embeddedMHT_embedded MHT_embedded CGBPZ_embedded CGBPZ_embeddedCGBPZ_embedded CGBPZ_embedded

(56)

mobile_QP28_IP=1/19 mobile_QP28_IP=1/19 mobile_QP28_IP=1/19 mobile_QP28_IP=1/19 15 15 15 15 17 17 17 17 19 19 19 19 21 21 21 21 23 23 23 23 25 25 25 25 27 27 27 27 29 29 29 29 31 31 31 31 33 33 33 33 35 35 35 35 37 37 37 37 39 39 39 39 1111 3333 5555 7777 9999 11111111 131313 1513 151515 1717 191717 1919 2119 212121 23232323 252525 2725 27 292727 2929 3129 313131 33333333 353535 3735 37 393737 3939 4139 4141 4341 4343 4543 454545 frame # frame # frame # frame # P S N R P S N R P S N R P S N R ori_dec ori_dec ori_dec ori_dec MHT_embedded MHT_embedded MHT_embedded MHT_embedded CGBPZ_embedded CGBPZ_embedded CGBPZ_embedded CGBPZ_embedded

Fig. 29 Drift effects on Mobile_QP28_GOP20

foreman_mht_vs_coarse

5.45 5.45 5.45 5.45 3.19 3.193.19 3.19 1.81 1.81 1.81 1.81 0.9 0.90.9 0.9 6.99 6.99 6.99 6.99 4.36 4.364.36 4.36 2.6 2.6 2.6 2.6 1.34 1.341.34 1.34 8.24 8.24 8.24 8.24 5.41 5.415.41 5.41 3.33 3.33 3.33 3.33 1.82 1.821.82 1.82 12.39 12.39 12.39 12.39 9.32 9.329.32 9.32 4.72 4.724.72 4.72 15.39 15.39 15.39 15.39 12.18 12.1812.18 12.18 9.5 9.5 9.5 9.5 6.99 6.996.99 6.99 17.52 17.52 17.52 17.52 14.32 14.3214.32 14.32 11.51 11.51 11.51 11.51 8.83 8.838.83 8.83 6.88 6.88 6.88 6.88 0 00 0 2 22 2 4 44 4 6 66 6 8 88 8 10 1010 10 12 1212 12 14 1414 14 16 1616 16 18 1818 18 20 2020 20 20 2020 20 24242424 28282828 32323232 QP QP QP QP P S N R _l os s P S N R _l os s P S N R _l os s P S N R _l os s I/P=1/9_coarse I/P=1/9_coarseI/P=1/9_coarse I/P=1/9_coarse I/P=1/19_coarse I/P=1/19_coarseI/P=1/19_coarse I/P=1/19_coarse I/P=1/29_coarse I/P=1/29_coarseI/P=1/29_coarse I/P=1/29_coarse IP=1/9_mht IP=1/9_mhtIP=1/9_mht IP=1/9_mht IP=1/19_mht IP=1/19_mhtIP=1/19_mht IP=1/19_mht IP=1/29_mht IP=1/29_mhtIP=1/29_mht IP=1/29_mht

(57)

Mobile_CGBPZ_VS_MHT Mobile_CGBPZ_VS_MHT Mobile_CGBPZ_VS_MHT Mobile_CGBPZ_VS_MHT 15.69 15.6915.69 15.69 12.24 12.2412.24 12.24 9.11 9.119.11 9.11 6.09 6.096.09 6.09 17.97 17.9717.97 17.97 14.53 14.5314.53 14.53 11.24 11.2411.24 11.24 7.99 7.997.99 7.99 19.48 19.4819.48 19.48 16.02 16.0216.02 16.02 12.67 12.6712.67 12.67 9.34 9.349.34 9.34 8.71 8.718.71 8.71 5.64 5.645.64 5.64 3.29 3.293.29 3.29 1.53 1.531.53 1.53 10.06 10.0610.06 10.06 6.91 6.916.91 6.91 4.27 4.274.27 4.27 2.16 2.162.16 2.16 11.05 11.0511.05 11.05 7.83 7.837.83 7.83 4.94 4.944.94 4.94 2.64 2.642.64 2.64 0000 3333 6666 9999 12 1212 12 15 1515 15 18 1818 18 21 2121 21 20 20 20 20 24242424 28282828 32323232 QP QP QP QP P S N R _l os s P S N R _l os s P S N R _l os s P S N R _l os s mht_IP=1/9 mht_IP=1/9 mht_IP=1/9 mht_IP=1/9 mht_IP= 1/19 mht_IP= 1/19 mht_IP= 1/19 mht_IP= 1/19 mht_IP=1/29 mht_IP=1/29 mht_IP=1/29 mht_IP=1/29 CGBPZ_IP= 1/9 CGBPZ_IP= 1/9 CGBPZ_IP= 1/9 CGBPZ_IP= 1/9 CFGBPZ_IP=1/19 CFGBPZ_IP=1/19 CFGBPZ_IP=1/19 CFGBPZ_IP=1/19 CGBPZ_IP=1/29 CGBPZ_IP=1/29 CGBPZ_IP=1/29 CGBPZ_IP=1/29

Fig. 31 PSNR loss results different QP and different GOP (Mobile)

Fig. 30 and Fig. 31 show the results of PSNR drop considering different QP and

different GOP. According to our simulation results over sequences Akiyo, Foreman,

Mobile, Stefan, GOP 10, 20, 30 and QP 20, 24, 28, 32, the average quality difference

between DCT plus CGBPZ and MHT is 7.12 dB. This number shows the coding

(58)

Chapter 4 Proposed Embedded

Compressor/Decompressor

Architecture

In section 4.1 and 4.2, we will introduce our hardware design of proposed

embedded compressor and decompressor respectively. The architectures are designed

to fit the specification in chapter 6.1.

4.1 Architecture of Encoder Design

Overall block diagram of embedded compressor is shown in Fig. 32.

(59)

4.1.1 The Architecture of Two Dimensions Discrete Cosine

Transform

The hardware design of DCT is referred from Lee’s architecture [16]. This

architecture can maintain the same performance with original DCT while reduced the

number of multiplications to about half of those required by the existing efficient

algorithms. This design allows us to take the advantage of DCT while not suffering

from its hardware complexity. Notice that in Table 3, [16] uses more multiplications

than [17] when applying 4 points DCT. However, the two inputs of multiplications in

[16] are formed by one constant and one variable number while the inputs of

multiplications in [17] are formed by two variable numbers. According to our

experience, the synthesis area of multiplications which has one constant input is about

1/3 comparing to the synthesis area of multiplications which has two variable

numbers. Therefore, design [16] we referred still gets the better of design [17] when

applying 4 points DCT.

4.1.2 The Architecture of Coarse Grain Bit-Plane Zonal Encoding

and Data Packing

There is a combinational block dealing with coefficients to derive the

RMAX/CMAX and plane content of each plane. To serialize the plane information in

one cycle, we propose the content adaptive ripple connector to solve the problem. The

basic concept is shown in Fig. 33. The 10 lines at left represent the 9 plane contents

pulsing 1 sign bit-plane content. Each connecter represents a shifted-outcome

(60)

34. By the ripple behavior, the wire at the end of the flow is the connected result.

Notice that we embed this embedded compressor into our 100MHz decoder, thus one

cycle is enough to finish our ripple processing.

Fig. 33 Content adaptive ripple connecter

Fig. 34 The architecture of a single connecter in Fig. 33

應用於行動式視訊裝置之嵌入式壓縮器解壓縮器設計

國立交通大學

國立交通大學

國立交通大學

國立交通大學

電子工程學系

電子工程學系

電子工程學系

電子工程學系

電子研究所碩士班

電子研究所碩士班

電子研究所碩士班

電子研究所碩士班

碩

碩

碩

碩

士

士

士

士

論

論

論

論

文

文

文

文

應用於行動式視訊

應用於行動式視訊

應用於行動式視訊

應用於行動式視訊裝置

裝置

裝置

裝置之嵌入式壓縮器解壓縮器

之嵌入式壓縮器解壓縮器

之嵌入式壓縮器解壓縮器

之嵌入式壓縮器解壓縮器設計

設計

設計

設計

Design of An Embedded Compressor/Decompressor

for Mobile Video Applications

學生 ： 吳昱德

指導教授 ： 李鎮宜 教授

中華民國九十七年七月

中華民國九十七年七月

中華民國九十七年七月

中華民國九十七年七月

應用於行動式視訊

應用於行動式視訊

應用於行動式視訊

應用於行動式視訊裝置

裝置

裝置

裝置之嵌入式壓縮器解壓縮器

之嵌入式壓縮器解壓縮器

之嵌入式壓縮器解壓縮器

之嵌入式壓縮器解壓縮器設計

設計

設計

設計

Design of An Embedded Compressor/Decompressor

for Mobile Video Applications

研 究 生：吳昱德 Student：Yu-De Wu

指導教授：李鎮宜 Advisor：Chen-Yi Lee

國 立 交 通 大 學

電子工程學系 電子研究所 碩士班

碩 士 論 文

應用於行動式視訊

應用於行動式視訊

應用於行動式視訊

應用於行動式視訊裝置

裝置

裝置

裝置之嵌入式壓縮器解壓縮器

之嵌入式壓縮器解壓縮器

之嵌入式壓縮器解壓縮器

之嵌入式壓縮器解壓縮器設計

學生：吳昱德

指導教授：李鎮宜教授

研究生：吳昱德 Student：Yu-De Wu

國立交通大學

電子工程學系電子研究所碩士班

碩士論文

學生：吳昱德指導教授：李鎮宜教授