國立交通大學
國立交通大學
國立交通大學
國立交通大學
電子工程學系
電子工程學系
電子工程學系
電子工程學系
電子研究所碩士班
電子研究所碩士班
電子研究所碩士班
電子研究所碩士班
碩
碩
碩
碩
士
士
士
士
論
論
論
論
文
文
文
文
應用於行動式視訊
應用於行動式視訊
應用於行動式視訊
應用於行動式視訊裝置
裝置
裝置
裝置之嵌入式壓縮器解壓縮器
之嵌入式壓縮器解壓縮器
之嵌入式壓縮器解壓縮器
之嵌入式壓縮器解壓縮器設計
設計
設計
設計
Design of An Embedded Compressor/Decompressor
for Mobile Video Applications
學生 : 吳昱德
指導教授 : 李鎮宜 教授
中華民國九十七年七月
中華民國九十七年七月
中華民國九十七年七月
中華民國九十七年七月
應用於行動式視訊
應用於行動式視訊
應用於行動式視訊
應用於行動式視訊裝置
裝置
裝置
裝置之嵌入式壓縮器解壓縮器
之嵌入式壓縮器解壓縮器
之嵌入式壓縮器解壓縮器
之嵌入式壓縮器解壓縮器設計
設計
設計
設計
Design of An Embedded Compressor/Decompressor
for Mobile Video Applications
研 究 生:吳昱德 Student:Yu-De Wu
指導教授:李鎮宜 Advisor:Chen-Yi Lee
國 立 交 通 大 學
電子工程學系 電子研究所 碩士班
碩 士 論 文
A ThesisSubmitted to Institute of Electronics
College of Electrical Engineering and Computer Science National Chiao Tung University
in Partial Fulfillment of the Requirements for the Degree of
Master of Science in
Electronics Engineering
July 2008
Hsinchu, Taiwan, Republic of China
應用於行動式視訊
應用於行動式視訊
應用於行動式視訊
應用於行動式視訊裝置
裝置
裝置
裝置之嵌入式壓縮器解壓縮器
之嵌入式壓縮器解壓縮器
之嵌入式壓縮器解壓縮器
之嵌入式壓縮器解壓縮器設計
設計
設計
設計
學生:吳昱德 指導教授:李鎮宜 教授
國立交通大學
國立交通大學
國立交通大學
國立交通大學
電子工程學系
電子工程學系
電子工程學系
電子工程學系
電子研究所碩士班
電子研究所碩士班
電子研究所碩士班
電子研究所碩士班
摘要
摘要
摘要
摘要
本論文提出適合嵌入於行動式視訊裝置上的有失真嵌入式壓縮器/解壓縮器 設計。藉由有失真資料壓縮來減少晶片與外部記憶體間所需要的資料傳輸量,損 失些微的視訊品質,來達到縮小外部記憶體空間需求、減少頻寬使用以及降低能 量消耗等多種目的。 所提出的演算法是以二維離散餘弦轉換搭配簡約位元平面區域編碼所構 成。在壓縮率為二的前提之下,將一個四乘以四的像素矩陣壓縮為六十四位元的 壓縮封包。首先將四乘以四像素矩陣以二維離散餘弦轉換為十六個不同頻率之係 數分量,再使用簡約位元平面區域編碼將係數予以編碼封存後送到外部記憶體。 解壓縮過程中並提出一個簡單的補償方式來彌補失真壓縮所造成的資料遺失。 所提出的硬體架構可以嵌入在視訊解碼器上以 100MHz 的操作頻率支援每 秒三十張的高畫質電視規格(HD1080)。由於將壓縮率固定為兩倍,所以壓縮後 的封包大小固定,記憶體地址轉換十分簡單並且可以支援動作補償單元 (Motion Compensation)的亂數存取。在 UMC 90 奈米製程技術下,所提出的硬體使用了
30k 個邏輯閘數目。壓縮一個巨型區塊(MB)需要 72 個週期,解壓縮一個巨型區
塊(MB)則僅需要 34 個週期。整體系統對於記憶體的存取次數則節省了原本的百
Design of An Embedded Compressor/Decompressor
for Mobile Video Applications
Student : Yu-De Wu Advisor : Dr. Chen-Yi Lee
Department of Electronics Engineering
Institute of Electronics
National Chiao Tung University
ABSTRACT
This thesis proposes an embedded compressor/decompressor for mobile video
applications. It uses lossy compression scheme to reduce the amount of data
transferring between chip and external memory. This lossy compression can maintain
acceptable video quality while reduces the required size of external memory, the
bandwidth requirement and the power consumption on memory access.
Proposed algorithm is composed by discrete cosine transform (DCT) with coarse
grain bit-plane zonal coding (CGBPZ). The compression ratio is two. It compresses a
4x4 pixel-array into a 64 bits segment. First, the two dimensions discrete cosine
transform converts 16 pixels into 16 elementary frequency components. Coarse grain
compensation scheme is also proposed for decoding.
Hardware architecture of the proposed algorithm is able to be embedded into
video decoder and support HD1080@100MHz, 30 frames per second. Since the
compression ratio is fixed at two, the coded segments have fixed size and can be
randomly accessed by motion compensation unit. The gate counts are 30K
synthesized by UMC 90 nm CMOS technology. It costs 72 cycles to encode a MB and
34 cycles to decode a MB. Overall reduction ratio on memory access is 40%.
Comparing with the power consumed of proposed design, the amount of power saving
誌
謝
在 SI2 Lab 的兩年裡,是我人生中珍貴的日子。首先,我要對我的指導教 授李鎮宜博士表達我最深的感謝。老師總是熱心且耐心的指導我,並適時的給予 鼓勵,使我在碩班兩年裡獲益良多。在此誠摯的給予老師最深的祝福。 其次我要感謝的是 multimedia group 的博班學長,劉子明和李曜。學長們 對我的訓練和不厭其煩的指導,為我的研究打下了扎實的基礎。阿龍、義閔學長 在專業的領域上也給予我很多幫助。特別要感謝的是蔣迪豪教授和鍾菁哲博士, 除了研究上犀利的建議之外,也給我很多就業與職場上的詳盡分析。同屆的夥伴 們,韋磬、Amos、bluer、俊廷、琇茹、清峰、建螢、點子、茗智、學弟昱帆和 其他每一位 SI2 成員們,有你們一起思考、互相幫忙、適時搞笑,讓研究生活充 實而不枯燥。此外,我也要感謝我的室友們,宗學、良諺、碩宇、振祐、育瑋、 文炫,讓我的宿舍生活充滿了歡樂。 最後,我要感謝我的家人和我的朋友們,因為有你們的支持、付出與鼓勵, 我才能全心全意的向前邁進。願你們永遠健康、快樂、順心。
Index
Chapter 1 Introduction ... 1
1.1 Motivation ... 1
1.2 Thesis Organization ... 2
Chapter 2 Previous Works ... 3
2.1 Lossless Embedded Compression Schemes... 3
2.2 Lossy Embedded Compression Scheme ... 4
2.2.1 Transform-Based Lossy Embedded Compression ... 4
2.2.2 Delta Pulse Code Modulation Lossy Embedded Compression ... 5
2.2.3 Other Embedded Lossy Compression ... 5
2.3 Bit-Plane Coding... 6
2.3.1 Bit-Plane Truncation Coding (BPT) ... 6
2.3.2 Bit-Plane Zonal Coding (BPZ)... 8
2.3.3 Modified Bit Plane Zonal Coding ... 12
2.4 Summary ... 15
Chapter 3 Proposed Embedded Compression Algorithm... 17
3.1 Overview ... 17
3.2 Algorithm of Embedded Compressor... 20
3.2.1 Discrete Cosine Transform... 21
3.2.2 Proposed Fine Grain Bit Plane Zonal Coding (FGBPZ) ... 22
3.3 Coarse Grain Bit-Plane Zonal Coding (CGBPZ) ... 32
3.4 Decoding Process and the Compensation... 35
3.5 Embedded Result on Software Simulation... 36
3.5.1 FGBPZ versus CGBPZ ... 36
Chapter 4 Proposed Embedded Compressor/Decompressor Architecture
... 43
4.1 Architecture of Encoder Design... 43
4.1.1 The Architecture of Two Dimensions Discrete Cosine Transform44 4.1.2 The Architecture of Coarse Grain Bit-Plane Zonal Encoding and Data Packing ... 44
4.1.3 The Architecture of End Plane Calculation... 45
4.1.4 Overall Encoder Design ... 46
4.2 Architecture of Decoder Design ... 47
4.2.1 Architecture of Data Unpacking, Bit-Plane Zonal Decoding and Compensation... 48
4.2.2 Architecture of Two Dimensions Discrete Cosine Transform ... 48
4.2.3 Overall Decoder Design... 48
Chapter 5 Design Implementation and Verification ... 50
5.1 Design Implementation... 50
5.2 Design Verification ... 51
Chapter 6 System Integration and Experimental Results ... 53
6.1 System Analysis ... 53
6.1.1 Interface... 54
6.1.2 Overhead Problem ... 55
6.1.3 Processing Cycles Problem... 56
6.2 System Integration ... 57
6.2.1 Access Reduction ... 57
6.2.2 Processing Cycles Problem... 58
6.2.3 Access Reduction Ratio ... 60
Chapter 7 Conclusion and Future Work ... 63
7.1 Conclusions ... 63
7.2 Future Work... 63
Figure Index
FIG. 1 BIT-PLANE TRUNCATION: AC COEFFICIENTS ARE PACKED FROM THE START PLANE. DUE TO THE LIMITATION OF PACKING BUDGET, COEFFICIENT BITS OF
LOWER DIGIT PLANE SURROUNDED BY DASH LINE WILL BE TRUNCATED...7
FIG. 2 CODING FORMAT FOR BIT-PLANE TRUNCATION CODING (BPT)...7
FIG. 3 THE CONCEPT OF BIT-PLANE ...9
FIG. 4 CODING PROCEDURE OF BPZ ALGORITHM... 10
FIG. 5 AN EXAMPLE OF BPZ CODING ... 11
FIG. 6 NEW PACKING DATA FORMAT (BPZ) VERSUS BPT... 11
FIG. 7 CODING PROCEDURE OF MBPZ ALGORITHM ... 13
FIG. 8 AN EXAMPLE FOR MBPZ CODING... 14
FIG. 9 COMPENSATION FOR A BIT-TRUNCATED AC COEFFICIENT... 15
FIG. 10 PIXEL-BASED (LEFT) VERSUS BLOCK-BASED (RIGHT) ... 18
FIG. 11 AN EXAMPLE OF OVERHEAD PROBLEM... 18
FIG. 12 THE CORRELATION BETWEEN BIT-RATE AND OVERHEAD (STEFAN SEQUENCE) 19 FIG. 13 THE FLOW CHART OF PROPOSED DCT-FGBPZ/CGBPZ EMBEDDED COMPRESSION ... 21
FIG. 14 THE OCCURRENCE PROBABILITY OF EACH TYPES IN MBPZ... 23
FIG. 15 CODING FLOW OF FGBPZ WITH VLC CODEBOOK. RECALL THAT TYPE A, B, C AND D IS REFERRED FROM [20]. ... 24
FIG. 16 A CODING EXAMPLE FOR FGBPZ ... 27
FIG. 17 PROTECTING MECHANISM FOR UNKNOWN SIGN BIT ... 30
FIG. 18 FINAL ENCODING FLOW CHART ... 31
FIG. 19 CGBPZ CODING FORMAT FOR THE MAGNITUDE OF AC COEFFICIENTS... 32 FIG. 20 THE CONCEPT OF HOW TO DERIVE THE RMAX/CMAX OF SIGN BIT-PLANE FROM
CODED BIT PLANE. ... 33
FIG. 21 END PLANE DECISION... 34
FIG. 22 OVERALL ENCODING FLOW OF CGBPZ ... 35
FIG. 23 PROPOSED COMPENSATION TECHNIQUE ... 36
FIG. 24 DRIFT EFFECTS ON FORMAN_QP28_GOP20 ... 37
FIG. 25 DRIFT EFFECTS ON MOBILE_QP28_GOP20... 38
FIG. 26 PSNR LOSS CONSIDERING DIFFERENT QP AND DIFFERENT GOP (FOREMAN) ... 38
FIG. 27 PSNR LOSS RESULTS DIFFERENT QP AND DIFFERENT GOP (MOBILE) ... 39
FIG. 28 DRIFT EFFECTS ON FOREMAN_QP28_GOP20... 40
FIG. 29 DRIFT EFFECTS ON MOBILE_QP28_GOP20... 41
FIG. 30 PSNR LOSS RESULTS DIFFERENT QP AND DIFFERENT GOP (FOREMAN) ... 41
FIG. 31 PSNR LOSS RESULTS DIFFERENT QP AND DIFFERENT GOP (MOBILE) ... 42
FIG. 32 OVERALL BLOCK DIAGRAM OF EMBEDDED COMPRESSOR ... 43
FIG. 33 CONTENT ADAPTIVE RIPPLE CONNECTER... 45
FIG. 34 THE ARCHITECTURE OF A SINGLE CONNECTER IN FIG. 33... 45
FIG. 35 THE ARCHITECTURE OF END PLANE CALCULATION ... 46
FIG. 36 OVERALL ENCODER DESIGN... 47
FIG. 37 OVERALL BLOCK DIAGRAM OF EMBEDDED DECOMPRESSOR ... 47
FIG. 38 OVERALL DECODER DESIGN... 49
FIG. 39 THE FLOW OF DESIGN VERIFICATION... 52
FIG. 40 THE OVERALL SYSTEM BLOCK DIAGRAM... 54
FIG. 41 SYSTEM INTERFACE DESIGN FOR EMBEDDED CODEC... 55
FIG. 42 BEST CASE ON DATA FETCHING... 56
FIG. 43 WORSE CASE: SUB PIXEL CASE... 56
FIG. 44 POWER ANALYSIS ON CIF @ 5.3MHZ ... 62
Table Index
TABLE 1 CODING TYPES OF BIT-PLANE PROPOSED IN [20] ... 12
TABLE 2 OVERHEAD WITH EC BLOCK GRID FOR EACH SEQUENCE ... 20
TABLE 3 THE COMPLEXITY OF N-POINT DCT... 22
TABLE 4 THE NEEDED CODEBOOK ENTRIES AND THEIR RELATED RMAX/CMAX ... 25
TABLE 5 THE FINAL 40 ENTRIES VLC CODEBOOK... 27
TABLE 6 THE OVERALL CODEWORD IN VLC CODEBOOK... 28
TABLE 7 SUMMARY OF HARDWARE DESIGN... 51
TABLE 8 OVERALL CASES OF READ ACCESS REQUESTED BY MC WITH/WITHOUT EC ... 58
TABLE 9 FULL CASES OF “EC DECODE” CYCLES PLUS ORIGINAL “MC DATA READ” CYCLES... 60
Chapter 1
Introduction
1.1
Motivation
To improve the video coding efficiency, eliminating temporal redundancy
between frames is a useful technique. This technique is widely used in nowadays
video coding standards such as MPEG-1/2/4, H.263 and H.264. But to accomplish
this method when encoding or decoding, at least one previous frame must be stored in
frame memory as reference. However, the accesses between external memory and
decoder chip may consume a lot of power. The rapid data accesses of motion
compensation dominate the power consumption of whole system.
For a mobile device, power is always the critical issue that people do care about.
Although the power consumed on chip can be reduced by many low power techniques,
data transferring still consumes a lot of power. Therefore, minimization of memory
access operations is a key consideration in hardware design of mobile video devices.
Embedded compression is a technique to reduce the transferring of data and the
size of off-chip frame memory. Since mobile video devices are suffered from limited
battery life and the visual quality criterion is not so strict due to the small display
screen, we hope to reduce the bandwidth requirement while maintain the acceptable
visual quality.
Nowadays, the mobile devices become more and more powerful by their various
functions. Reduce the bandwidth and resource requirement of each hardware
1.2Thesis Organization
This thesis is organized as follows. First, the basic introduction of compression
scheme and the reviews of prior works are described in Chapter 2. The proposed lossy
embedded compression algorithm is proposed in Chapter 3. To integrate with
H.264/AVC decoder, there are some constraints needed to be specified and the
proposed algorithm must be modified to fit in those constraints in hardware design.
The modified algorithm and hardware architecture is presented in Chapter 4.
Moreover, the simulation results about proposed algorithm integrated with
H.264/AVC HDTV decoder are also presented in this chapter. The design
implementation, integration and verification are shown in Chapter 5. Chapter 6 shows
the experimental results and performance comparison. Finally, the conclusions and
Chapter 2
Previous Works
Basically, compression techniques can be divided into two types: lossless
compression and lossy compression. In this chapter, we will simply introduce the
algorithms that have been proposed before. Also, the bit-plane coding is introduced in
chapter 2.3. Bit-plane coding can be used as lossy or lossless coding. The concept of
bit plane coding is used in our proposed methods.
2.1 Lossless Embedded Compression Schemes
A lot of lossless compression methods have been proposed. The benefit of
lossless compression is obviously: it can maintain the information while cutting down
the data size. To embed a lossless compression mechanism into a video system is
quite acceptable, since it would not cause the drifting effect no matter in encoder
system or in decoder system.
However, behind those advantages mentioned above, it does suffer from the
variable data amount after lossless compression. By mathematical theory, even for
ideal lossless compression, the information of source data still controls the
compression ratio. That means, the more the information of the source data contained,
the longer the coded data is. This unstable factor becomes the fatal wound of lossless
embedded compressions. Embedded compression schemes are born to reduce the data
access times between the external memories and reduce the size of external memory.
However, Variable data amount after lossless compression can not guarantee the
prepared for the worst case nor the bandwidth reduction since the compressed data
amount is unknown. A research of lossless compression is shown in [2].
2.2 Lossy Embedded Compression Scheme
Lossy compressions with fixed compression ratio are suitable to reduce the size
of frame memory and the bandwidth since the predictable amount of compressed data
can guarantee the reduction. Therefore, lossy embedded compressions are more
popular in comparing with lossless embedded compressions on solving this bandwidth
reduction problem. [3] – [14] are the previous works of lossy compression.
2.2.1
Transform-Based Lossy Embedded Compression
Transform-based lossy embedded compression is a popular way to compose
lossy compressions. It converts a signal into elementary frequency components. With
the characteristic of human visual system, lower frequency component is more
noticeable than higher frequency component. Thus implying quantization and data
collection on each component by their visual priority could be an efficient way to
collect data within limited data budget. The research uses the Hadamard Transform
and quantizes the coefficients by their priority, and then encodes quantized
coefficients by Golomb-Rice Coding is in [3]. Golomb-Rice coding is an efficient
coding method, and it can nearly reach the coding ability of Huffman coding by
selecting the suitable K factor. However in this paper, it pursuit low complexity,
therefore it chose fixed K values according to simulation. It can operate on 100 MHz
and the cycle usages of encoding/decoding a MB are both 33 cycles. It is a work of
2.2.2 Delta Pulse Code Modulation Lossy Embedded Compression
Delta Pulse Code Modulation (DPCM) is another popular way to compose the
lossy compression. Since the neighbor data has relatively small difference, the
information of data after DPCM can be efficiently reduced by comparing with the
source data. It does help on reduction of source information.
[4] uses DPCM as base coding method and takes the intra prediction mode from
H.264 video coding standard to find the best direction to perform DPCM. This smart
idea makes this algorithm more adaptive in each video pattern and achieves the
satisfied quality than [3].
However, the satisfied performance of DPCM method costs a lot. DPCM method
needs to collect every difference into limited budget, but those differences are not
always as small as we wish. To derive best quantization level and fit every difference
into limited budget, this DPCM-based method needs several iterations to get the best
performance. This situation causes this algorithm not to be able to use pipeline
scheme. And to avoid large gate counts, it is more acceptable to deal with subtractions
clock by clock instead of parallel architecture. However it leads to longer coding
cycles and becomes a heavy load of original system on timing issue. In the view point
of system integration, it needs to increase the operation frequency or slow down the
system throughputs to perform this DPCM-based embedded compression scheme.
2.2.3 Other Embedded Lossy Compression
There are still many approaches about lossy compression such as adaptive vector
quantize (VQ)[11], down-sampling based compression algorithm [12] and adaptive
mechanism to choose with methods to use. It claims that this mechanism can achieve
better performance by choosing adaptive algorithm to fit the different feature of video
sequence. DWT with SPIHT in [14] is also another transform approach. And the
algorithm used in [14] makes it be able to perform lossy and lossless with the same
architecture.
We can see that lossy embedded compression scheme is truly the mainstream.
However it suffers from the loss of quality and the drift effect. Therefore, how to
organize the lossy coding methods is very important. To cover information as much as
possible within limited budget is the main challenge of lossy compression.
2.3
Bit-Plane Coding
Bit-plane zonal coding is a well known coding method and widely used in many
compression algorithms. It uses bit-plane as its basic unit to encode a group of
number instead of individual number. It can be combined into a lossy or lossless
compression scheme by adjusting the budget of bit storage. It can fully represent the
group of number with sufficient bit budget. On the other hand, with un-sufficient
budget it may loss some information at lower bits and thus becomes a lossy
compression. The details of bit-plane zonal coding will be shown in the following
sections.
2.3.1 Bit-Plane Truncation Coding (BPT)
Before introducing proposed bit-plane zonal coding, we would like to introduce
the basic concept first. Bit-plane truncation coding is the prototype of bit-plane zonal
DCT. We can simply classify those coefficients into one DC coefficient and 15 AC
coefficients. The idea of bit-plane coding is to collect data in bit-plane (that is, to take
the N-th bit out of each coefficients as a union) rather in individual coefficient.
Sometimes, we want to further analyze a group of numbers and to cut them into
several parts by their importance, separating them into bit-planes is a good idea.
Moreover, for a group of coefficients, the upper bit-planes are zero most of the time.
Therefore to record start plane is the smart way to improve the coding efficiency. For
a group of 4x4, N bits coefficient, about cell function (log2 N) bits is needed for
recording start plane, but it can represent 15 zero bits for each skipped bit-plane. After
the bit-plane truncation coding, the coded format is shown in Fig. 2.
Fig. 1 Bit-plane truncation: AC coefficients are packed from the start plane. Due to the limitation of packing budget, coefficient bits of lower digit plane surrounded by
dash line will be truncated.
2.3.2 Bit-Plane Zonal Coding (BPZ)
However, BPT has poor performance and image quality must be enhanced by
other approach to reduce energy loss of DCT coefficients. In this section, an improved
coding algorithm named bit-plane zonal coding (BPZ) [18] will be described in detail.
Familiar with BPT, BPZ packets DCT coefficients bit-plane by bit-plane, but the
packing scheme is quite different from BPT. We will show that the packing efficiency
of BTZ is much better than BPT.
The word “zonal” is the idea to encode a bit-plane with its zonal characteristic.
Fig. 3 is a possible outcome of a bit-plane. The coefficients with larger magnitude
tend to be gathered at up-left corner (lower horizontal or vertical frequencies) by DCT.
Also, the bits at down-right corner tend to be zero in the same bit-plane. Furthermore,
the data for individual DCT blocks often has a bias for either the horizontal or vertical
direction. Besides, by describing the maximum row and column number of valid data
in this scan zone, named RMAX and CMAX respectively, we have large probability
to represent the information of a bit plane within less than 15 bits. Therefore, a
signal-dependent rectangular scan zone starts from the upper-left corner will perform
Fig. 3 The concept of bit-plane
Two classes of coefficients namely significant and in-significant coefficients are
defined respectively. In the encoding flow, significant coefficient will have a 1 in any
of the higher coded bit-planes. In the contrary, in-significant coefficient always have
all 0’s on the higher bit-planes.
Sometimes, zone represented by RMAX/CMAX will be very similar between the
neighboring bit-planes. This feature allows us to use this data-similarity to develop
more efficient coding mechanism.
The detail coding flow is described as follow: For DCT coefficient blocks, we
can divide the process into DC and AC flows. In DC flow, the DC coefficient is
completely packed for avoiding significant degradation in quality as BPT. In AC flow,
the procedure of this algorithm is shown as Fig. 4. Initially, all AC coefficients are
marked as insignificant. Then, we start from the most significant plane (MSP) to
encode the subsequent bit-planes. The first plane which contains nonzero bit is
defined as start plane, and the nonzero bits in start plane are the newly significant
coefficients. Thus, sign bits are inserted behind each nonzero bit. For the subsequent
bit-plane, there is only one question. If the following bit-plane has a newly significant
bit, a bit “1” is packed first to represent the newly significant bit is founded and then
corresponding sign bits. Those significant bits and in-significant bits are no need to be
followed by sign bits since the sign bits of significant bits are already packed and the
sign bit of in-significant bit are useless so far. Notice that unlike the fully packed sign
bit in BPT, the sign bit packed in BPZ is on demand.
If no newly significant appeared in current bit-plane, a bit “0” is inserted to
represent that the RMAX/CMAX of current bit-plane is the same as previous
bit-plane and only the bits in the position of significant coefficient needed to be
packed. BPZ repeat this procedure until all bit-planes have been packed. For the
category on packing sign bits and the no newly significant bit-plane, we can see the
efficiency of BPZ and that is why BPZ can achieve better performance than BPT.
Fig. 4 Coding procedure of BPZ algorithm
An example for bit-plane classification is illustrated in Fig. 5. The same as BPT,
the start plane of DCT coefficients is also packed as a part of header information. Sign
bits of a DCT coefficient block are not a part of header information any more. They
are dispersed and accompanied with newly significant coefficients found in certain
bit-planes. Header information is shortened and more AC coefficient packing budget
Fig. 5 An example of BPZ coding
2.3.3
Modified Bit Plane Zonal Coding
If we take more look to the BPZ algorithm from the example shown in Fig. 4, we
will discover that the original BPZ algorithm can be further improved. For software
application, adding a little complexity can achieve more coding efficiency. A
mechanism within good trade off between complexity and coding efficiency is
proposed in [20].
The starting point is to use the limit budget in more efficient way. Carefully
looking at the coding type of bit-plane zonal coding (BPZ), we can find that there is
an annoying format to deal with the occurrences of newly significant coefficient
because of the longest header information. Every time we found a newly significant
bit, we need to packet 4 bits for RMAX/CMAX and one bit to distinguish coding
format. However, the four bits of RMAX/CMAX is not really necessary since the
RMAX/CMAX may be the same with the previous bit-plane. Therefore, [20]
proposes a new coding format to deal with this situation. The new coding format is
adopted when “newly significant bit is found, but the RMAX/CMAX of current
bit-plane is the same with the previous bit-plane” and overall coding types shown in
Table 1. The drawback is that we need one more bit to distinguish from original type
B with new proposed type C. However the advantage is saving four bits comparing
with original coding format. Fig. 7 shows the coding flow of modified bit-plane zonal
coding proposed in [20].
Table 1 Coding types of bit-plane proposed in [20]
Bits for Rmax/Cmax
A Yes Yes None 4 4 B No No 00 None 2 C Yes No 01 None 2 D Yes Yes 1 4 5
Rmax/Cmax Changed
Bits for Flag(s) and Rmax/Cmax Type Newly Sig. Coef. Flag
Fig. 7 Coding procedure of MBPZ algorithm
An example of the modified bit-plane zonal coding (MBPZ) proposed in [20] is
given in Fig. 8. The bit streams in the bottom of figure are coded by original BPZ and
modified BPZ (MBPZ) respectively. Through this compare we can clearly figure out
the benefit brought by MBPZ. There is a small technique here. When packing a
bit-plane of AC coefficients, we use zigzag scan order to collect bits. Since human
visual system is more sensitive on low frequency signal elements, when we are
running out of packing budget, zigzag scan order can store the relative important
Fig. 8 An example for MBPZ coding
Using MBPZ to encode AC coefficients within limited budget, quality loss is
inevitable. To slightly compensate for the truncated data bits, [20] also propose a
method to raise the quality. First, if the value of this coefficient is large or equal than 4,
scan the decoded AC coefficients from LSB to find the first non-zero bit, and then
paste a “1” to its lower-two digit. If the value of this coefficient is less than 4, nothing
will be changed on it. Finally, recover the coefficients by the corresponding sign bits
Fig. 9 Compensation for a bit-truncated AC coefficient.
2.4 Summary
From the introduction and discussion above, we classify the existing algorithm
into two basic types and briefly introduce the pros. And cons. We can find that lossy
compression is the popular way to implement embedded compressor by the advantage
on fixed compression ratio and fixed amount of coded data. However, good
performance usually comes with time consuming while low complexity usually brings
worse quality. The former kind of methods derives better performance but the large
buffer may be required, and longer processing cycles will enlarge the loading of the
system and the barrier to embed this extra function into system. Although to slow
down the system or to increase the operation frequency can fix this problem, but the
former methods will decrease the coding throughput and the later methods will
increase the power consumption. Each drawback is not what we want. Some lossy
into a decoder system as far as hardware is concerned, but at the same time, those
schemes often suffer from unsatisfied quality.
For the real time, low power HDTV H.264/AVC decoder, low latency is the basic
requirement. Not to increase the loading of original system is also another target.
Therefore, our design challenge on embedded compressor is to find the optimal trade
Chapter 3
Proposed Embedded Compression
Algorithm
3.1
Overview
Researches about data compression have been developed for a long time. Those
developed algorithms show us that enhancing the complexity can reach better
performance. However, the problem is to find a suitable compression category to
combine with H.264 system but not to affect the performance of overall system. The
discussions in chapter 2 have shown us that the threshold of embedding an extra
function may arise with higher complexity coding scheme. In this chapter, further
discussion will be presented.
In practice, block-based schemes are the most convenient schemes because they
match the block-oriented structure of the incoming bit-stream in H.264 system and
allow on-the-fly process. However, another problem exists: the overhead. The
overhead can be defined as the ratio between the number of pixels that are actually
accessed during the motion compensation of a block and the number of pixels that are
really useful in the reference block. In original system, the ratio is 1 since every
accessed pixel is on demand. After embedding block-based algorithm adopted, this
ratio will always superior to 1 because of the nature of block-based embedded
compression algorithm. Fig. 10 shows the concept between block-based and
pixel-based. The left of Fig. 10 is pixel-based, represents the data without EC. The
right of Fig. 10 is block-based since the characteristic of EC. Fig. 11 is an example to
Fig. 10 Pixel-based (left) versus block-based (right)
Fig. 11 An example of overhead problem
According to the standard of H.264, a 16x16 macro block can be divided into
8x8, 8x16 or 16x8 blocks during the process of motion compensation (MC). Further
more, an 8x8 block can then be sub-divided into 8x4, 4x8 or 4x4 sub blocks. If the
compensated block is not aligned with the coded block grid, the overhead will be
occurred like depicted in Fig. 11. Four coded blocks have to be loaded and decoded to
get the required pixels. If the EC scheme is 8x8 block-based and the compensated
block is 4x4 block, we need to load and decode 256 pixels to derive 16 useful pixels.
The overhead in this case is 16. Because of the overhead problem, the relation
between the compression ratio of EC and the gain in memory transfer is not direct.
There is a statistic material about the phenomenon of overhead provided by [15].
Stefan sequence. Three kinds of EC block-grid are presented. Since H.264 encoder
allows macroblock (MB) partitioning and larger motion vectors at high rate (which
also means the small quantization step and better quality) and favors the null vectors
with 16x16 partition at low rate, the overhead increases while the bit rate increases.
Fig. 12 The correlation between bit-rate and overhead (Stefan sequence) simulated with 4x4, 8x8 and 16x16 block grid
Table 2 [15] is the summary of the statistical analysis simulated with six
sequences. In this table, we can see that the relatively still sequences (News, Weather)
generate smaller overhead since the motion vector is often equal to zero while the fast
motion sequence such as Stefan generates more overhead. Finally, an important
conclusion is that the smaller block-grid gets the better of larger block-grids and
Table 2 Overhead with EC block grid for each sequence
Sequence 4x4 block grid 8x8 block grid 16x16 block grid
Foreman 1.31 1.77 3.69 Flower 1.30 1.74 3.77 News 1.14 1.51 2.78 Silent 1.17 1.50 3.22 Stefan 1.51 2.44 6.95 Weather 1.17 1.49 3.18 All 1.27 1.73 3.93
3.2
Algorithm of Embedded Compressor
We adopt transform-based and 4x4 block-grid as our coding algorithm. First
reason is the smallest overhead according to the statistical result that we presented in
previous section. Actually it is a trade off between coding efficiency and overhead.
We know that as far as the transform algorithm is concerned, the bigger the block-grid,
the better coding efficiency it can achieve. Since we want to increase the coding
efficiency with less overhead, the 4x4 block-grid is our best choice.
The basic concept of proposed algorithm is the combination of DCT with
bit-plane zonal coding. DCT is a well known technique so we just simply introduce it.
The two proposed bit-plane zonal coding are the main characters. Fine grain bit-plane
zonal coding (FGBPZ) is quite efficiency and is suitable to be used in software
application. Coarse grain bit-plane zonal coding is relatively simple and is suitable for
hardware implementation. Fig. 13 is the coding flow of proposed
DCT-FGBPZ/CGBPZ algorithm. This is a one way open-loop coding scheme and no
iteration is needed. The discrete cosine transform (DCT) is divided into two one
dimension DCT. The coefficients of DCT are packed by fine grain bit-plane zonal
coding (FGBPZ) or coarse grain bit-plane zonal coding (CGBPZ) we proposed. The
Fig. 13 The flow chart of proposed DCT-FGBPZ/CGBPZ embedded compression
3.2.1
Discrete Cosine Transform
Discrete cosine transforms (DCT) is a powerful technique for converting a signal
into elementary frequency components. It is widely used in image compression and
JPEG is the well-known example.
For human visual system, human eyes are more sensitive on low frequency
component of a picture and less sensitive on high frequency component. Therefore,
the quality loss in high frequency component is relatively unnoticeable. The DCT can
generate the relatively important low frequency component on up left corner, and the
most high frequency in down right corner. Thus the DCT combines with bit-plane
zonal coding with original point at up left corner can efficiently collect the
information.
But the biggest disadvantage of DCT is its complexity on hardware design. Here
we make our coding unit in 4x4 block grid, the complexity of 4 point DCT is minor
and still can take the advantage of the transform. The complexity of different size of
DCT can be evaluated in Table 3. Two designs are shown in Table 3. A design is
reference from [16] and B design is reference from [17]. B design is focus on
reducing multiplications by increasing additions. We can see that in both designs, the
complexity of 4 points DCT is much simpler than 8 points and 16 points. 4 points
Table 3 The complexity of N-point DCT
Number of Multiplications Number of Additions
m N
A B A B
2 4 2 4 6 9
3 8 16 12 26 29
4 16 116 80 194 209
3.2.2 Proposed Fine Grain Bit Plane Zonal Coding (FGBPZ)
Base on the modified bit-plane zonal coding proposed in [20], the coding
efficient is quite good. But we are not satisfied yet. To further improve the coding
efficiency, we introduce a pre-determined variable length coding here with a small
code book.
3.2.2.1 VLC Codebook
Before further change the MBPZ in [20], we make simulation here to evaluation
the occurrences of each MBPZ types and Fig. 14 is the simulation result. The naming
of each type A, B, C and D is referred from [20] (see Fig. 7). We can see that the
appearance probability of type B and type C are relatively small although they have
better coding efficiency. Type D is the dominate type but the bits recording header
information are 5 bits including one bit for distinguishing between types and 4 bits for
RMAX/CMAX. Therefore, we want to improve the efficiency by adding a small
Probability of each types in MBPZ Probability of each types in MBPZ Probability of each types in MBPZ Probability of each types in MBPZ
Type B 16% Type C 11% Type D 73% Type B Type C Type D
Fig. 14 the occurrence probability of each types in MBPZ
According to the modified bit-plane zonal coding proposed in [20], the
RMAX/CMAX of each bit-plane is accumulated bit-plane by bit-plane and is always
large or equal to the RMAX/CMAX of previous plane. Recall that type D is applied
when RMAX/CMAX is changed. Therefore, when type D is applied, the possible
outcomes of the RMAX/CMAX in next bit-plane are limited: they must larger than
the RMAX/CMAX of previous plane.
For example, if RMAX/CMAX of current plane is 2/2 and next plane is coded by
type D, the possible outcomes of next plane RMAX/CMAX must be 3/2, 2/3 or 3/3.
Notice that 2/2 is also possible to be the RMAX/CMAX of next bit-plane, but type D
only deals with the situation that RMAX/CMAX is different from previous bit-plane.
Those 3 possible outcomes can be fully presented by 1~2 bits instead of original 4 bits.
The description above explains the chance of reducing the codeword length in type D.
Fig. 15 shows the coding flow of FGBPZ with VLC codebook. This method can save
Fig. 15 Coding flow of FGBPZ with VLC codebook. Recall that type A, B, C and D is referred from [20].
We generate those codes by Huffman coding methods and the probabilities of
next possible RMAX/CMAX (Pcurrent RMAX/CMAX [next RMAX/CMAX]) are derived
from simulation on over 3000 frames. The code words in this codebook are fixed.
To cover all possible CMAX/R/MAX of next bit-plane according to current
plane, the needed codebook entry and their related RMAX/CMAX is shown in Table
4. The number of possible outcomes of next RMAX/CMAX is shown in (1). For a
4x4 bit-plane, the row/column are mark as 0, 1, 2, 3. When type D is applied, at least
one of row or column is changed. Therefore, this equation is to calculate the outcomes
which are large than or equal to current RMAX/CMAX and then minus one outcome
that RMAX and CMAX are both equal to current bit-plane.
Next possible outcomes =
1 ) _ 4 ( ) _ 4
( −Current RMAX × −Current CMAX −
Table 4 The needed codebook entries and their related RMAX/CMAX Current
RMAX/CMAX
The number of next Possible RMAX/CMAX outcomes
Huffman Code length (bits) ( 0, 1 ) 11 ( 4*3-1 ) 3~4 ( 1, 0 ) 11 ( 3*4-1 ) 3~4 ( 1, 1 ) 8 ( 3*3-1 ) 2~4 ( 2, 0 ) 7 ( 2*4-1 ) 2~4 ( 0, 2 ) 7 ( 4*2-1 ) 2~4 ( 2, 1 ) 5 ( 2*3-1 ) 2~3 ( 1, 2 ) 5 ( 3*2-1 ) 2~3 ( 2, 2 ) 3 ( 2*2-1 ) 1~2 ( 3, 0 ) 3 ( 1*4-1 ) 1~2 ( 0, 3 ) 3 ( 4*1-1 ) 1~2 ( 3, 1 ) 2 ( 1*3-1 ) 1 ( 1, 3 ) 2 ( 3*1-1 ) 1 ( 3, 2 ) 1 0 ( 2, 3 ) 1 0 Summary 67 0~4
But there are still rooms for codebook improvement. Consider the following
two cases: case 1), current RMAX/CMAX is 2/3; next RMAX/CMAX is 3/4. Case 2),
current RMAX/CMAX is 3/2; next RMAX/CMAX is 4/3. With the original codebook,
the codebook index for case 1 is {(2, 3), (3, 4)} and case 2 is {(3, 2), (4, 3)}. Actually,
the mainly different of case 1 and case 2 is the direction of row and column. Both
cases are similar even on the probability distribution of each possible “next
RMAX/CMAX”. If we switch the row to the column, we can find that those two cases
are undergoing the same changes. According to this idea, we introduce our symmetric
VLC codebook. By eliminating the bias of Row and Column in codebook, the
symmetric cases can share the same codeword. We can reduce the 67 entries
codebook into 40 entries by this idea.
codebook size by 42%. The timing wasted on codebook searching is also reduced.
And then we will show how to use our symmetric VLC codebook. We represent
current RMAX/CMAX as Cm_cur, Rm_cur, previous RMAX/CMAX as Cm_pre,
Rm_pre. The action of table look up can be described as follow:
If (Cm_pre≥Rm_pre)
Codeword at index {(Cm_pre,Rm_pre) (Cm_cur,Rm_cur)} is applied. Else
Codeword at index {(Rm_ pre,Cm_pre) (Rm_cur,Cm_cur)} is applied.
Therefore, 40 codeword is enough.
And then we want to explain the decoding procedure of symmetric VLC
codebook. After start plane is decoded, the RMAX/CMAX of start plane is known
and can be used as reference. Decoding procedure for the subsequent bit-planes can
be illustrated in (2).
If (Cm_pre≥Rm_pre)
Codeword in block {(Cm_pre,Rm_pre)} is searched; And the result is in {(Cm_cur,Rm_cur)} order. Else
Codeword in block {(Rm_pre,Cm_pre)} is searched; And the result is in {(Rm_cur,Cm_cur)} order.
(2)
These switch actions between RMAX and CMAX in encoding procedure need
not to be recorded since they can be derived from the decoding procedure. The final
VLC codebook is shown in Table 5 and is formed by eliminating the symmetric entry
in Table 4. The coding example for FGBPZ is shown in Fig. 16. Table 6 is our detail
Table 5 The final 40 entries VLC codebook Current
RMAX/CMAX
The number of next Possible RMAX/CMAX outcomes
Huffman Code length (bits) ( 1, 0 ) 11 ( 3*4-1 ) 3~4 ( 1, 1 ) 8 ( 3*3-1 ) 2~4 ( 2, 0 ) 7 ( 2*4-1 ) 2~4 ( 2, 1 ) 5 ( 2*3-1 ) 2~3 ( 2, 2 ) 3 ( 2*2-1 ) 1~2 ( 3, 0 ) 3 ( 1*4-1 ) 1~2 ( 3, 1 ) 2 ( 1*3-1 ) 1 ( 3, 2 ) 1 0 Summary 40 0~4
Table 6 The overall codeword in VLC codebook Current RMAX/CMAX Next RMAX/CMAX Codeword Code length (bits) (1, 1) 000 3 (2, 0) 001 3 (2, 1) 010 3 (3, 0) 011 3 (2, 2) 100 3 (1, 2) 1010 4 (3, 1) 1011 4 (3, 2) 1100 4 (3, 3) 1101 4 (2, 3) 1110 4 ( 1, 0 ) (1, 3) 1111 4 (2, 2) 00 2 (2, 1) 100 3 (1, 2) 101 3 (3, 3) 110 3 (3, 2) 111 3 (2, 3) 010 3 (3, 1) 0110 4 ( 1, 1 ) (1, 3) 0111 4 (2, 1) 00 2 (3, 0) 01 2 (3, 1) 100 3 (2, 2) 101 3 (3, 2) 110 3 (3, 3) 1110 4 ( 2, 0 ) (2, 3) 1111 4 (3, 2) 01 2 (2, 2) 00 2 (3, 3) 10 2 (3, 1) 110 3 ( 2, 1 ) (2, 3) 111 3 (2, 3) 00 2 ( 2, 2 ) (3, 2) 01 2
(3, 3) 1 1 (3, 1) 0 1 (3, 2) 10 2 ( 3, 0 ) (3, 3) 11 2 (3, 2) 0 1 ( 3, 1 ) (3, 3) 1 1 3.2.2.2
Data Packing
Since our compression ratio is fixed at two, the budget of coded data is 64 bits.
After DCT and bit-plane zonal coding, we need to packet coded data into 64 bits
segment before sending to external memory. First we reserve for the DC coefficient
because of its importance in transform. Second, we use 4 bits to packet the start plane.
The rest of budget, that is to say, 52 bits, is used for storing AC coefficients. With the
help of the fine grain bit-plane zonal coding, AC coefficient are divided into
bit-planes and presented by the coding format in Fig. 15 Coding flow of FGBPZ with
VLC codebook. Recall that type A, B, C and D is referred from [20].. The procedure
is keep packing bit-plane by bit-plane until the end of bit-planes or running out of bit
Newly significant bit found? Residual bit budget equal to 1 ? Packing bit by bit Yes No Packing “0” Yes Packing this bit No Next 4x4 coefficients
Fig. 17 Protecting mechanism for unknown sign bit
When running out of budget, unpacked information will be loss. Recall that the
newly significant coefficient must be followed by its sign bit. If newly significant bit
is packed while its sign bit is cut, this coefficient will be wrong after decoded. We
make a mechanism to avoid this situation and show in Fig. 17. If next packing bit is
newly significant bit and the rest of the budget is less than two bits, we will abort
packing this newly significant bit.
The final encoding flow chart is shown in Fig. 18. Each bold line in Fig. 18
Start plane? DCT coeficients Data Packing DC Coef. Magnitude of AC Coef. Type decisio n Yes Start plane &
RMAX/CMAX Type header Newly significan t bit? Packing bit by bit Protect machenisum No Type D? No No Table look up VLC codeword End of plane? Yes Yes Packing final bit and exit
Sign bits of AC coef. Next bit plane Final coded result 2-D DCT 4x4 pixels
Sign bit Yes
3.3
Coarse Grain Bit-Plane Zonal Coding (CGBPZ)
FGBPZ introduced in section 3.2.2 is simple and efficiency. This algorithm
encodes the coefficients on “bit” level. But its encoding procedure may cost more
than 30 cycles and decoding procedure may cost more than 10 cycles under our
estimation. So FGBPZ is more suitable embedded into software or hardware/software
co-design system. To implement the algorithm as hardware accelerator, the algorithm
must be further modified into simpler version.
By the discussion in chapter 6.1, we will see the critical problems of embedding
a compressor into system. Taking all these problems into consideration, we propose
coarse grain bit-plane zonal coding (CGBPZ). CGBPZ is a trade off between short
cycles, ability of parallelism and quality. The details will be presented in this section.
Fig. 19 is the coding formats of CGBPZ. All magnitude bit-planes of AC
coefficients are coded in uniform format. For each bit-plane, we record the
RMAX/CMAX (4 bits), and then pack the bits which are enclosed by RMAX and
CMAX. 4 bits are used to record RMAX/CMAX of each plane. The dependencies
between bit-planes are not used in CGBPZ.
Fig. 19 CGBPZ coding format for the magnitude of AC coefficients
In CGBPZ, we introduced the concept of sign bit-plane. Sign bit-plane can be
considered as union of sign bits for each coefficient. We only packet those used sign
pack all the information may be happened frequently. Since not every coefficient can
be packed, packing whole sign bit-plane may become a waste. So we take the
maximum value of RMAX and CMAX from packed bit-plane (from start plane to end
plane) and packing sign bit-plane by those two boundaries. Under this method we will
waste least bits to pack unused sign bits. The RMAX/CMAX of sign bit-plane needs
not to be packed when encoding, because they can be derived from those coded
bit-plane. Fig. 20 illustrates the idea of how we derive the RMAX/CMAX of sign
bit-plane.
Fig. 20 The concept of how to derive the RMAX/CMAX of sign bit-plane from coded bit plane.
Finally, in CGBPZ, the end plane needs to be estimated and packed to fulfill the
decoding procedure. Fig. 21 shows the simple concept of end plane decision. From
MSB plane to LSB plane, the calculator estimates the total bits usage accumulated
from most significant plane (MSP) to current plane. If total bits usage is more than 64
Fig. 21 End plane decision
The overall encoding flow can be shown in Fig. 22. Finally, there is one small
skill. According to the description above, the bits usage accumulated to end plane is
less than bit budget. Therefore, there are few bits unused. To well use those bit
budgets, we keep putting the information into those unused budgets within the
Fig. 22 Overall encoding flow of CGBPZ
3.4
Decoding Process and the Compensation
Roughly, the decoding process can be thinking as the inverse process of
encoding. We take the coded data segments and divide them into DC coefficient and
AC coefficients.
Since the algorithm we proposed is a lossy compression and the lower bit-planes
of AC coefficients are often truncated due to limited budget, we apply a simple
compensation here. The basic concept is shown in Fig. 23. The compensation is
applied when the coefficient is nonzero and the end plane is larger than least bit-plane.
This compensation technique can be considered as adding a median number of lost
bit-plane. It leads to a satisfied quality improvement. Notice that this compensation is
Fig. 23 Proposed compensation technique
3.5 Embedded Result on Software Simulation
Before all the discussion, we want to define the formula of PSNR calculation
first. All the PSNR values in this section are the PSNR between compressed
sequences versus the original sequence. The reason why we choose original sequence
as reference is to establish an absolute quality level. The equation of PSNR is given in
(3):
∑∑
− = − = − × × × × = 1 0 1 0 2 )) , ( ) , ( ( 1 255 255 log 10 R r C c compressed origin r c P r c P C R PSNR (3)3.5.1 FGBPZ versus CGBPZ
bit-plane zonal coding (FGBPZ) and coarse bit-plane zonal coding (CGBPZ). We
want to show the result of trade off between FGBPZ and CGBPZ. Fig. 24 shows the
embedded result on Foreman sequences with group of picture (GOP) 20. We can see
the PSNR value decades along the P frame number. This is because each P frame is
formed by referring the blocks in previous frame. Since every reference frames are
compressed by our lossy EC algorithm, the errors will be propagated and accumulated
through P frames. This phenomenon is also called drift effects. Fig. 25 shows the drift
effect but the experimented sequence is Mobile Calendar. Mobile Calendar is famous
by its complex components and fast motion. Those features make Mobile sequence
difficult to be compressed and the loss on quality may larger than slow motion
sequences. foreman_QP28_IP=1/19 foreman_QP28_IP=1/19 foreman_QP28_IP=1/19 foreman_QP28_IP=1/19 20 2020 20 22 2222 22 24 2424 24 26 2626 26 28 2828 28 30 3030 30 32 3232 32 34 3434 34 36 3636 36 38 3838 38 40 4040 40 1111 4444 7777 10101010 13131313 1616 191616 1919 2219 222222 2525 282525 282828 313131 3431 34 373434 373737 40404040 43434343 frame # frame # frame # frame # P S N R P S N R P S N R P S N R ori_dec ori_dec ori_dec ori_dec embedded_fine embedded_fine embedded_fine embedded_fine embedded_coarse embedded_coarse embedded_coarse embedded_coarse
mobile_QP28_IP=1/19 mobile_QP28_IP=1/19 mobile_QP28_IP=1/19 mobile_QP28_IP=1/19 15 15 15 15 17 17 17 17 19 19 19 19 21 21 21 21 23 23 23 23 25 25 25 25 27 27 27 27 29 29 29 29 31 31 31 31 33 33 33 33 35 35 35 35 37 37 37 37 39 39 39 39 1111 4444 7777 10101010 13131313 161616 1916 1919 2219 222222 2525 282525 282828 313131 3431 34 373434 373737 40404040 43434343 frame # frame # frame # frame # P S N R P S N R P S N R P S N R
ori_dec
ori_dec
ori_dec
ori_dec
embedded_fine
embedded_fine
embedded_fine
embedded_fine
embedded_coarse
embedded_coarse
embedded_coarse
embedded_coarse
Fig. 25 Drift effects on Mobile_QP28_GOP20
foreman_FGBPZ_vs_CGBPZ foreman_FGBPZ_vs_CGBPZ foreman_FGBPZ_vs_CGBPZ foreman_FGBPZ_vs_CGBPZ 4.02 4.024.02 4.02 2.26 2.26 2.26 2.26 1.24 1.24 1.24 1.24 0.61 0.61 0.61 0.61 5.27 5.275.27 5.27 3.23 3.23 3.23 3.23 1.89 1.89 1.89 1.89 0.99 0.99 0.99 0.99 6.22 6.226.22 6.22 4.01 4.01 4.01 4.01 2.41 2.41 2.41 2.41 1.34 1.34 1.34 1.34 5.45 5.455.45 5.45 3.19 3.19 3.19 3.19 1.81 1.81 1.81 1.81 0.9 0.9 0.9 0.9 6.99 6.996.99 6.99 4.36 4.36 4.36 4.36 2.6 2.6 2.6 2.6 1.34 1.34 1.34 1.34 8.24 8.248.24 8.24 5.41 5.41 5.41 5.41 3.33 3.33 3.33 3.33 1.82 1.82 1.82 1.82 0000 1111 2222 3333 4444 5555 6666 7777 8888 9999 20 20 20 20 24242424 28282828 32323232 QPQPQPQP P S N R _l os s P S N R _l os s P S N R _l os s P S N R _l os s FGBPZ_IP=1/9 FGBPZ_IP=1/9 FGBPZ_IP=1/9 FGBPZ_IP=1/9 FGBPZ_IP=1/19 FGBPZ_IP=1/19 FGBPZ_IP=1/19 FGBPZ_IP=1/19 FGBPZ_IP=1/29 FGBPZ_IP=1/29 FGBPZ_IP=1/29 FGBPZ_IP=1/29 CGBPZ_IP=1/9 CGBPZ_IP=1/9 CGBPZ_IP=1/9 CGBPZ_IP=1/9 CGBPZ_IP=1/19 CGBPZ_IP=1/19 CGBPZ_IP=1/19 CGBPZ_IP=1/19 CGBPZ_IP=1/29 CGBPZ_IP=1/29 CGBPZ_IP=1/29 CGBPZ_IP=1/29
Mobile_fine_vs_coarse Mobile_fine_vs_coarse Mobile_fine_vs_coarse Mobile_fine_vs_coarse 10.9 10.910.9 10.9 7.34 7.34 7.34 7.34 4.52 4.52 4.52 4.52 2.29 2.292.29 2.29 12.41 12.4112.41 12.41 8.73 8.73 8.73 8.73 5.64 5.64 5.64 5.64 3.06 3.063.06 3.06 13.41 13.4113.41 13.41 9.68 9.68 9.68 9.68 6.45 6.45 6.45 6.45 3.65 3.653.65 3.65 13.16 13.1613.16 13.16 9.59 9.59 9.59 9.59 6.6 6.6 6.6 6.6 4.01 4.014.01 4.01 14.62 14.6214.62 14.62 10.95 10.95 10.95 10.95 7.8 7.8 7.8 7.8 4.94 4.944.94 4.94 15.61 15.6115.61 15.61 11.9 11.9 11.9 11.9 8.65 8.65 8.65 8.65 5.6 5.65.6 5.6 1 11 1 3 33 3 5 55 5 7 77 7 9 99 9 11 11 11 11 13 13 13 13 15 15 15 15 17 17 17 17 20 20 20 20 24242424 28282828 32323232 QPQPQPQP P S N R _l os s P S N R _l os s P S N R _l os s P S N R _l os s fine_IP= 1/9 fine_IP= 1/9fine_IP= 1/9 fine_IP= 1/9 fine_IP=1/19 fine_IP=1/19fine_IP=1/19 fine_IP=1/19 fine_IP=1/29 fine_IP=1/29fine_IP=1/29 fine_IP=1/29 coarse_IP=1/9 coarse_IP=1/9coarse_IP=1/9 coarse_IP=1/9 coarse_IP=1/19 coarse_IP=1/19coarse_IP=1/19 coarse_IP=1/19 coarse_IP=1/29 coarse_IP=1/29coarse_IP=1/29 coarse_IP=1/29
Fig. 27 PSNR loss results different QP and different GOP (Mobile)
Fig. 26 and Fig. 27 show the results of PSNR loss considering different QP and
different GOP. We can see that the PSNR loss increases with the increasing GOP
while tail off at higher QP values.
According to our simulation results over sequences Akiyo, Foreman, Mobile,
Stefan, GOP 10, 20, 30, and QP 20, 24, 28, 32, the average difference in quality
between using CGBPZ and FGBPZ is 1.5 dB. This number shows that CGBPZ is a
good trade off between complexity and quality. 1.5dB PSNR drop enables the fast
encoding procedure form over 30 cycles (FGBPZ) into 2 cycles (CGBPZ).
3.5.2 CGBPZ versus MHT
coding (CGBPZ) and modified Hadamard transform (MHT). CGBPZ is what we use
as hardware implementation and system integration. Considering the requirement of
high speed processing, we compare CGBPZ with MHT work. Fig. 28 shows the
embedded result on Foreman with group of picture (GOP) as 20. The proposed DCT
with CGBPZ has better performance and can efficiently slow down the speed of
decade compared with MHT work. Fig. 29 also shows the drift effect but the
experimented sequence is Mobile Calendar.
foreman_QP28_IP=1/19 foreman_QP28_IP=1/19 foreman_QP28_IP=1/19 foreman_QP28_IP=1/19 20 2020 20 22 2222 22 24 2424 24 26 2626 26 28 2828 28 30 3030 30 32 3232 32 34 3434 34 36 3636 36 38 3838 38 40 4040 40 1111 3333 5555 7777 9999 111111 1311 131313 151515 1715 17 191717 1919 2119 212121 232323 2523 252525 2727 292727 292929 313131 3331 333333 353535 3735 37 393737 3939 4139 414141 434343 4543 454545 frame # frame #frame # frame # P S N R P S N R P S N R P S N R ori_dec ori_decori_dec ori_dec MHT_embedded MHT_embeddedMHT_embedded MHT_embedded CGBPZ_embedded CGBPZ_embeddedCGBPZ_embedded CGBPZ_embedded
mobile_QP28_IP=1/19 mobile_QP28_IP=1/19 mobile_QP28_IP=1/19 mobile_QP28_IP=1/19 15 15 15 15 17 17 17 17 19 19 19 19 21 21 21 21 23 23 23 23 25 25 25 25 27 27 27 27 29 29 29 29 31 31 31 31 33 33 33 33 35 35 35 35 37 37 37 37 39 39 39 39 1111 3333 5555 7777 9999 11111111 131313 1513 151515 1717 191717 1919 2119 212121 23232323 252525 2725 27 292727 2929 3129 313131 33333333 353535 3735 37 393737 3939 4139 4141 4341 4343 4543 454545 frame # frame # frame # frame # P S N R P S N R P S N R P S N R ori_dec ori_dec ori_dec ori_dec MHT_embedded MHT_embedded MHT_embedded MHT_embedded CGBPZ_embedded CGBPZ_embedded CGBPZ_embedded CGBPZ_embedded
Fig. 29 Drift effects on Mobile_QP28_GOP20
foreman_mht_vs_coarse
foreman_mht_vs_coarse
foreman_mht_vs_coarse
foreman_mht_vs_coarse
5.45 5.45 5.45 5.45 3.19 3.193.19 3.19 1.81 1.81 1.81 1.81 0.9 0.90.9 0.9 6.99 6.99 6.99 6.99 4.36 4.364.36 4.36 2.6 2.6 2.6 2.6 1.34 1.341.34 1.34 8.24 8.24 8.24 8.24 5.41 5.415.41 5.41 3.33 3.33 3.33 3.33 1.82 1.821.82 1.82 12.39 12.39 12.39 12.39 9.32 9.329.32 9.32 4.72 4.724.72 4.72 15.39 15.39 15.39 15.39 12.18 12.1812.18 12.18 9.5 9.5 9.5 9.5 6.99 6.996.99 6.99 17.52 17.52 17.52 17.52 14.32 14.3214.32 14.32 11.51 11.51 11.51 11.51 8.83 8.838.83 8.83 6.88 6.88 6.88 6.88 0 00 0 2 22 2 4 44 4 6 66 6 8 88 8 10 1010 10 12 1212 12 14 1414 14 16 1616 16 18 1818 18 20 2020 20 20 2020 20 24242424 28282828 32323232 QP QP QP QP P S N R _l os s P S N R _l os s P S N R _l os s P S N R _l os s I/P=1/9_coarse I/P=1/9_coarseI/P=1/9_coarse I/P=1/9_coarse I/P=1/19_coarse I/P=1/19_coarseI/P=1/19_coarse I/P=1/19_coarse I/P=1/29_coarse I/P=1/29_coarseI/P=1/29_coarse I/P=1/29_coarse IP=1/9_mht IP=1/9_mhtIP=1/9_mht IP=1/9_mht IP=1/19_mht IP=1/19_mhtIP=1/19_mht IP=1/19_mht IP=1/29_mht IP=1/29_mhtIP=1/29_mht IP=1/29_mhtMobile_CGBPZ_VS_MHT Mobile_CGBPZ_VS_MHT Mobile_CGBPZ_VS_MHT Mobile_CGBPZ_VS_MHT 15.69 15.6915.69 15.69 12.24 12.2412.24 12.24 9.11 9.119.11 9.11 6.09 6.096.09 6.09 17.97 17.9717.97 17.97 14.53 14.5314.53 14.53 11.24 11.2411.24 11.24 7.99 7.997.99 7.99 19.48 19.4819.48 19.48 16.02 16.0216.02 16.02 12.67 12.6712.67 12.67 9.34 9.349.34 9.34 8.71 8.718.71 8.71 5.64 5.645.64 5.64 3.29 3.293.29 3.29 1.53 1.531.53 1.53 10.06 10.0610.06 10.06 6.91 6.916.91 6.91 4.27 4.274.27 4.27 2.16 2.162.16 2.16 11.05 11.0511.05 11.05 7.83 7.837.83 7.83 4.94 4.944.94 4.94 2.64 2.642.64 2.64 0000 3333 6666 9999 12 1212 12 15 1515 15 18 1818 18 21 2121 21 20 20 20 20 24242424 28282828 32323232 QP QP QP QP P S N R _l os s P S N R _l os s P S N R _l os s P S N R _l os s mht_IP=1/9 mht_IP=1/9 mht_IP=1/9 mht_IP=1/9 mht_IP= 1/19 mht_IP= 1/19 mht_IP= 1/19 mht_IP= 1/19 mht_IP=1/29 mht_IP=1/29 mht_IP=1/29 mht_IP=1/29 CGBPZ_IP= 1/9 CGBPZ_IP= 1/9 CGBPZ_IP= 1/9 CGBPZ_IP= 1/9 CFGBPZ_IP=1/19 CFGBPZ_IP=1/19 CFGBPZ_IP=1/19 CFGBPZ_IP=1/19 CGBPZ_IP=1/29 CGBPZ_IP=1/29 CGBPZ_IP=1/29 CGBPZ_IP=1/29
Fig. 31 PSNR loss results different QP and different GOP (Mobile)
Fig. 30 and Fig. 31 show the results of PSNR drop considering different QP and
different GOP. According to our simulation results over sequences Akiyo, Foreman,
Mobile, Stefan, GOP 10, 20, 30 and QP 20, 24, 28, 32, the average quality difference
between DCT plus CGBPZ and MHT is 7.12 dB. This number shows the coding
Chapter 4
Proposed Embedded
Compressor/Decompressor
Architecture
In section 4.1 and 4.2, we will introduce our hardware design of proposed
embedded compressor and decompressor respectively. The architectures are designed
to fit the specification in chapter 6.1.
4.1 Architecture of Encoder Design
Overall block diagram of embedded compressor is shown in Fig. 32.
4.1.1 The Architecture of Two Dimensions Discrete Cosine
Transform
The hardware design of DCT is referred from Lee’s architecture [16]. This
architecture can maintain the same performance with original DCT while reduced the
number of multiplications to about half of those required by the existing efficient
algorithms. This design allows us to take the advantage of DCT while not suffering
from its hardware complexity. Notice that in Table 3, [16] uses more multiplications
than [17] when applying 4 points DCT. However, the two inputs of multiplications in
[16] are formed by one constant and one variable number while the inputs of
multiplications in [17] are formed by two variable numbers. According to our
experience, the synthesis area of multiplications which has one constant input is about
1/3 comparing to the synthesis area of multiplications which has two variable
numbers. Therefore, design [16] we referred still gets the better of design [17] when
applying 4 points DCT.
4.1.2 The Architecture of Coarse Grain Bit-Plane Zonal Encoding
and Data Packing
There is a combinational block dealing with coefficients to derive the
RMAX/CMAX and plane content of each plane. To serialize the plane information in
one cycle, we propose the content adaptive ripple connector to solve the problem. The
basic concept is shown in Fig. 33. The 10 lines at left represent the 9 plane contents
pulsing 1 sign bit-plane content. Each connecter represents a shifted-outcome
34. By the ripple behavior, the wire at the end of the flow is the connected result.
Notice that we embed this embedded compressor into our 100MHz decoder, thus one
cycle is enough to finish our ripple processing.
Fig. 33 Content adaptive ripple connecter
Fig. 34 The architecture of a single connecter in Fig. 33