• 沒有找到結果。

Chapter 1 Introduction

1.2 Thesis Organization

This thesis is organized as follows. First, the basic introduction of compression scheme and the reviews of prior works are described in Chapter 2. The proposed lossy embedded compression algorithm is proposed in Chapter 3. To integrate with H.264/AVC decoder, there are some constraints needed to be specified and the proposed algorithm must be modified to fit in those constraints in hardware design.

The modified algorithm and hardware architecture is presented in Chapter 4.

Moreover, the simulation results about proposed algorithm integrated with H.264/AVC HDTV decoder are also presented in this chapter. The design implementation, integration and verification are shown in Chapter 5. Chapter 6 shows the experimental results and performance comparison. Finally, the conclusions and future work will be given in Chapter 7.

Chapter 2

Previous Works

Basically, compression techniques can be divided into two types: lossless compression and lossy compression. In this chapter, we will simply introduce the algorithms that have been proposed before. Also, the bit-plane coding is introduced in chapter 2.3. Bit-plane coding can be used as lossy or lossless coding. The concept of bit plane coding is used in our proposed methods.

2.1 Lossless Embedded Compression Schemes

A lot of lossless compression methods have been proposed. The benefit of lossless compression is obviously: it can maintain the information while cutting down the data size. To embed a lossless compression mechanism into a video system is quite acceptable, since it would not cause the drifting effect no matter in encoder system or in decoder system.

However, behind those advantages mentioned above, it does suffer from the variable data amount after lossless compression. By mathematical theory, even for ideal lossless compression, the information of source data still controls the compression ratio. That means, the more the information of the source data contained, the longer the coded data is. This unstable factor becomes the fatal wound of lossless embedded compressions. Embedded compression schemes are born to reduce the data access times between the external memories and reduce the size of external memory.

However, Variable data amount after lossless compression can not guarantee the reduction ratio nether on the size of frame memory since the memory must be well

prepared for the worst case nor the bandwidth reduction since the compressed data amount is unknown. A research of lossless compression is shown in [2].

2.2 Lossy Embedded Compression Scheme

Lossy compressions with fixed compression ratio are suitable to reduce the size of frame memory and the bandwidth since the predictable amount of compressed data can guarantee the reduction. Therefore, lossy embedded compressions are more popular in comparing with lossless embedded compressions on solving this bandwidth reduction problem. [3] – [14] are the previous works of lossy compression.

2.2.1 Transform-Based Lossy Embedded Compression

Transform-based lossy embedded compression is a popular way to compose lossy compressions. It converts a signal into elementary frequency components. With the characteristic of human visual system, lower frequency component is more noticeable than higher frequency component. Thus implying quantization and data collection on each component by their visual priority could be an efficient way to collect data within limited data budget. The research uses the Hadamard Transform and quantizes the coefficients by their priority, and then encodes quantized coefficients by Golomb-Rice Coding is in [3]. Golomb-Rice coding is an efficient coding method, and it can nearly reach the coding ability of Huffman coding by selecting the suitable K factor. However in this paper, it pursuit low complexity, therefore it chose fixed K values according to simulation. It can operate on 100 MHz and the cycle usages of encoding/decoding a MB are both 33 cycles. It is a work of high speed.

2.2.2 Delta Pulse Code Modulation Lossy Embedded Compression

Delta Pulse Code Modulation (DPCM) is another popular way to compose the lossy compression. Since the neighbor data has relatively small difference, the information of data after DPCM can be efficiently reduced by comparing with the source data. It does help on reduction of source information.

[4] uses DPCM as base coding method and takes the intra prediction mode from H.264 video coding standard to find the best direction to perform DPCM. This smart idea makes this algorithm more adaptive in each video pattern and achieves the satisfied quality than [3].

However, the satisfied performance of DPCM method costs a lot. DPCM method needs to collect every difference into limited budget, but those differences are not always as small as we wish. To derive best quantization level and fit every difference into limited budget, this DPCM-based method needs several iterations to get the best performance. This situation causes this algorithm not to be able to use pipeline scheme. And to avoid large gate counts, it is more acceptable to deal with subtractions clock by clock instead of parallel architecture. However it leads to longer coding cycles and becomes a heavy load of original system on timing issue. In the view point of system integration, it needs to increase the operation frequency or slow down the system throughputs to perform this DPCM-based embedded compression scheme.

2.2.3 Other Embedded Lossy Compression

There are still many approaches about lossy compression such as adaptive vector quantize (VQ)[11], down-sampling based compression algorithm [12] and adaptive DPCM in [13]. [15] provides two compression schemes and uses a pre-determining

mechanism to choose with methods to use. It claims that this mechanism can achieve better performance by choosing adaptive algorithm to fit the different feature of video sequence. DWT with SPIHT in [14] is also another transform approach. And the algorithm used in [14] makes it be able to perform lossy and lossless with the same architecture.

We can see that lossy embedded compression scheme is truly the mainstream.

However it suffers from the loss of quality and the drift effect. Therefore, how to organize the lossy coding methods is very important. To cover information as much as possible within limited budget is the main challenge of lossy compression.

2.3 Bit-Plane Coding

Bit-plane zonal coding is a well known coding method and widely used in many compression algorithms. It uses bit-plane as its basic unit to encode a group of number instead of individual number. It can be combined into a lossy or lossless compression scheme by adjusting the budget of bit storage. It can fully represent the group of number with sufficient bit budget. On the other hand, with un-sufficient budget it may loss some information at lower bits and thus becomes a lossy compression. The details of bit-plane zonal coding will be shown in the following sections.

2.3.1 Bit-Plane Truncation Coding (BPT)

Before introducing proposed bit-plane zonal coding, we would like to introduce the basic concept first. Bit-plane truncation coding is the prototype of bit-plane zonal coding. It can be shown in Fig. 1 as an example. Fig. 1 is the coefficients after 4x4

DCT. We can simply classify those coefficients into one DC coefficient and 15 AC coefficients. The idea of bit-plane coding is to collect data in bit-plane (that is, to take the N-th bit out of each coefficients as a union) rather in individual coefficient.

Sometimes, we want to further analyze a group of numbers and to cut them into several parts by their importance, separating them into bit-planes is a good idea.

Moreover, for a group of coefficients, the upper bit-planes are zero most of the time.

Therefore to record start plane is the smart way to improve the coding efficiency. For a group of 4x4, N bits coefficient, about cell function (log2 N) bits is needed for recording start plane, but it can represent 15 zero bits for each skipped bit-plane. After the bit-plane truncation coding, the coded format is shown in Fig. 2.

Fig. 1 Bit-plane truncation: AC coefficients are packed from the start plane. Due to the limitation of packing budget, coefficient bits of lower digit plane surrounded by

dash line will be truncated.

Fig. 2 Coding format for bit-plane truncation coding (BPT).

2.3.2 Bit-Plane Zonal Coding (BPZ)

However, BPT has poor performance and image quality must be enhanced by other approach to reduce energy loss of DCT coefficients. In this section, an improved coding algorithm named bit-plane zonal coding (BPZ) [18] will be described in detail.

Familiar with BPT, BPZ packets DCT coefficients bit-plane by bit-plane, but the packing scheme is quite different from BPT. We will show that the packing efficiency of BTZ is much better than BPT.

The word “zonal” is the idea to encode a bit-plane with its zonal characteristic.

Fig. 3 is a possible outcome of a bit-plane. The coefficients with larger magnitude tend to be gathered at up-left corner (lower horizontal or vertical frequencies) by DCT.

Also, the bits at down-right corner tend to be zero in the same bit-plane. Furthermore, the data for individual DCT blocks often has a bias for either the horizontal or vertical direction. Besides, by describing the maximum row and column number of valid data in this scan zone, named RMAX and CMAX respectively, we have large probability to represent the information of a bit plane within less than 15 bits. Therefore, a signal-dependent rectangular scan zone starts from the upper-left corner will perform a more efficient coding of the coefficients [12].

Fig. 3 The concept of bit-plane

Two classes of coefficients namely significant and in-significant coefficients are defined respectively. In the encoding flow, significant coefficient will have a 1 in any of the higher coded bit-planes. In the contrary, in-significant coefficient always have all 0’s on the higher bit-planes.

Sometimes, zone represented by RMAX/CMAX will be very similar between the neighboring bit-planes. This feature allows us to use this data-similarity to develop more efficient coding mechanism.

The detail coding flow is described as follow: For DCT coefficient blocks, we can divide the process into DC and AC flows. In DC flow, the DC coefficient is completely packed for avoiding significant degradation in quality as BPT. In AC flow, the procedure of this algorithm is shown as Fig. 4. Initially, all AC coefficients are marked as insignificant. Then, we start from the most significant plane (MSP) to encode the subsequent bit-planes. The first plane which contains nonzero bit is defined as start plane, and the nonzero bits in start plane are the newly significant coefficients. Thus, sign bits are inserted behind each nonzero bit. For the subsequent bit-plane, there is only one question. If the following bit-plane has a newly significant bit, a bit “1” is packed first to represent the newly significant bit is founded and then the RMAX/CMAX must be also updated. The newly significant bits are followed by

corresponding sign bits. Those significant bits and in-significant bits are no need to be followed by sign bits since the sign bits of significant bits are already packed and the sign bit of in-significant bit are useless so far. Notice that unlike the fully packed sign bit in BPT, the sign bit packed in BPZ is on demand.

If no newly significant appeared in current bit-plane, a bit “0” is inserted to represent that the RMAX/CMAX of current bit-plane is the same as previous bit-plane and only the bits in the position of significant coefficient needed to be packed. BPZ repeat this procedure until all bit-planes have been packed. For the category on packing sign bits and the no newly significant bit-plane, we can see the efficiency of BPZ and that is why BPZ can achieve better performance than BPT.

Fig. 4 Coding procedure of BPZ algorithm

An example for bit-plane classification is illustrated in Fig. 5. The same as BPT, the start plane of DCT coefficients is also packed as a part of header information. Sign bits of a DCT coefficient block are not a part of header information any more. They are dispersed and accompanied with newly significant coefficients found in certain bit-planes. Header information is shortened and more AC coefficient packing budget is reserved. New packing data format is shown in Fig. 6.

Fig. 5 An example of BPZ coding

Fig. 6 New packing data format (BPZ) versus BPT

2.3.3 Modified Bit Plane Zonal Coding

If we take more look to the BPZ algorithm from the example shown in Fig. 4, we will discover that the original BPZ algorithm can be further improved. For software application, adding a little complexity can achieve more coding efficiency. A mechanism within good trade off between complexity and coding efficiency is proposed in [20].

The starting point is to use the limit budget in more efficient way. Carefully looking at the coding type of bit-plane zonal coding (BPZ), we can find that there is an annoying format to deal with the occurrences of newly significant coefficient because of the longest header information. Every time we found a newly significant bit, we need to packet 4 bits for RMAX/CMAX and one bit to distinguish coding format. However, the four bits of RMAX/CMAX is not really necessary since the RMAX/CMAX may be the same with the previous bit-plane. Therefore, [20]

proposes a new coding format to deal with this situation. The new coding format is adopted when “newly significant bit is found, but the RMAX/CMAX of current bit-plane is the same with the previous bit-plane” and overall coding types shown in Table 1. The drawback is that we need one more bit to distinguish from original type B with new proposed type C. However the advantage is saving four bits comparing with original coding format. Fig. 7 shows the coding flow of modified bit-plane zonal coding proposed in [20].

Table 1 Coding types of bit-plane proposed in [20]

Bits for Rmax/Cmax

A Yes Yes None 4 4

B No No 00 None 2

C Yes No 01 None 2

D Yes Yes 1 4 5

Rmax/Cmax Changed

Bits for Flag(s) and Rmax/Cmax

Type Newly Sig. Coef. Flag

Found

Fig. 7 Coding procedure of MBPZ algorithm

An example of the modified bit-plane zonal coding (MBPZ) proposed in [20] is given in Fig. 8. The bit streams in the bottom of figure are coded by original BPZ and modified BPZ (MBPZ) respectively. Through this compare we can clearly figure out the benefit brought by MBPZ. There is a small technique here. When packing a bit-plane of AC coefficients, we use zigzag scan order to collect bits. Since human visual system is more sensitive on low frequency signal elements, when we are running out of packing budget, zigzag scan order can store the relative important signal and bring us better visual quality within the same packing budget.

Fig. 8 An example for MBPZ coding

Using MBPZ to encode AC coefficients within limited budget, quality loss is inevitable. To slightly compensate for the truncated data bits, [20] also propose a method to raise the quality. First, if the value of this coefficient is large or equal than 4, scan the decoded AC coefficients from LSB to find the first non-zero bit, and then paste a “1” to its lower-two digit. If the value of this coefficient is less than 4, nothing will be changed on it. Finally, recover the coefficients by the corresponding sign bits (do two’s compliment or not). The compensation procedure is illustrated as Fig. 9.

Fig. 9 Compensation for a bit-truncated AC coefficient.

2.4 Summary

From the introduction and discussion above, we classify the existing algorithm into two basic types and briefly introduce the pros. And cons. We can find that lossy compression is the popular way to implement embedded compressor by the advantage on fixed compression ratio and fixed amount of coded data. However, good performance usually comes with time consuming while low complexity usually brings worse quality. The former kind of methods derives better performance but the large buffer may be required, and longer processing cycles will enlarge the loading of the system and the barrier to embed this extra function into system. Although to slow down the system or to increase the operation frequency can fix this problem, but the former methods will decrease the coding throughput and the later methods will increase the power consumption. Each drawback is not what we want. Some lossy compression schemes are low complexity and high speed and easy to be embedded

into a decoder system as far as hardware is concerned, but at the same time, those schemes often suffer from unsatisfied quality.

For the real time, low power HDTV H.264/AVC decoder, low latency is the basic requirement. Not to increase the loading of original system is also another target.

Therefore, our design challenge on embedded compressor is to find the optimal trade off between low latency, low complexity and high performance.

Chapter 3

Proposed Embedded Compression Algorithm

3.1 Overview

Researches about data compression have been developed for a long time. Those developed algorithms show us that enhancing the complexity can reach better performance. However, the problem is to find a suitable compression category to combine with H.264 system but not to affect the performance of overall system. The discussions in chapter 2 have shown us that the threshold of embedding an extra function may arise with higher complexity coding scheme. In this chapter, further discussion will be presented.

In practice, block-based schemes are the most convenient schemes because they match the block-oriented structure of the incoming bit-stream in H.264 system and allow on-the-fly process. However, another problem exists: the overhead. The overhead can be defined as the ratio between the number of pixels that are actually accessed during the motion compensation of a block and the number of pixels that are really useful in the reference block. In original system, the ratio is 1 since every accessed pixel is on demand. After embedding block-based algorithm adopted, this ratio will always superior to 1 because of the nature of block-based embedded compression algorithm. Fig. 10 shows the concept between block-based and pixel-based. The left of Fig. 10 is pixel-based, represents the data without EC. The right of Fig. 10 is block-based since the characteristic of EC. Fig. 11 is an example to show how overhead occurs.

Fig. 10 Pixel-based (left) versus block-based (right)

Fig. 11 An example of overhead problem

According to the standard of H.264, a 16x16 macro block can be divided into 8x8, 8x16 or 16x8 blocks during the process of motion compensation (MC). Further more, an 8x8 block can then be sub-divided into 8x4, 4x8 or 4x4 sub blocks. If the compensated block is not aligned with the coded block grid, the overhead will be occurred like depicted in Fig. 11. Four coded blocks have to be loaded and decoded to get the required pixels. If the EC scheme is 8x8 block-based and the compensated block is 4x4 block, we need to load and decode 256 pixels to derive 16 useful pixels.

The overhead in this case is 16. Because of the overhead problem, the relation between the compression ratio of EC and the gain in memory transfer is not direct.

There is a statistic material about the phenomenon of overhead provided by [15].

Fig. 12 shows the relation between overhead and encoding bit-rate simulating with

Stefan sequence. Three kinds of EC block-grid are presented. Since H.264 encoder

Stefan sequence. Three kinds of EC block-grid are presented. Since H.264 encoder

相關文件