Summary - Previous Works - 應用於行動式視訊裝置之嵌入式壓縮器解壓縮器設計

Chapter 2 Previous Works

2.4 Summary

From the introduction and discussion above, we classify the existing algorithm into two basic types and briefly introduce the pros. And cons. We can find that lossy compression is the popular way to implement embedded compressor by the advantage on fixed compression ratio and fixed amount of coded data. However, good performance usually comes with time consuming while low complexity usually brings worse quality. The former kind of methods derives better performance but the large buffer may be required, and longer processing cycles will enlarge the loading of the system and the barrier to embed this extra function into system. Although to slow down the system or to increase the operation frequency can fix this problem, but the former methods will decrease the coding throughput and the later methods will increase the power consumption. Each drawback is not what we want. Some lossy compression schemes are low complexity and high speed and easy to be embedded

into a decoder system as far as hardware is concerned, but at the same time, those schemes often suffer from unsatisfied quality.

For the real time, low power HDTV H.264/AVC decoder, low latency is the basic requirement. Not to increase the loading of original system is also another target.

Therefore, our design challenge on embedded compressor is to find the optimal trade off between low latency, low complexity and high performance.

Chapter 3 Proposed Embedded Compression Algorithm

3.1 Overview

Researches about data compression have been developed for a long time. Those developed algorithms show us that enhancing the complexity can reach better performance. However, the problem is to find a suitable compression category to combine with H.264 system but not to affect the performance of overall system. The discussions in chapter 2 have shown us that the threshold of embedding an extra function may arise with higher complexity coding scheme. In this chapter, further discussion will be presented.

In practice, block-based schemes are the most convenient schemes because they match the block-oriented structure of the incoming bit-stream in H.264 system and allow on-the-fly process. However, another problem exists: the overhead. The overhead can be defined as the ratio between the number of pixels that are actually accessed during the motion compensation of a block and the number of pixels that are really useful in the reference block. In original system, the ratio is 1 since every accessed pixel is on demand. After embedding block-based algorithm adopted, this ratio will always superior to 1 because of the nature of block-based embedded compression algorithm. Fig. 10 shows the concept between block-based and pixel-based. The left of Fig. 10 is pixel-based, represents the data without EC. The right of Fig. 10 is block-based since the characteristic of EC. Fig. 11 is an example to show how overhead occurs.

Fig. 10 Pixel-based (left) versus block-based (right)

Fig. 11 An example of overhead problem

According to the standard of H.264, a 16x16 macro block can be divided into 8x8, 8x16 or 16x8 blocks during the process of motion compensation (MC). Further more, an 8x8 block can then be sub-divided into 8x4, 4x8 or 4x4 sub blocks. If the compensated block is not aligned with the coded block grid, the overhead will be occurred like depicted in Fig. 11. Four coded blocks have to be loaded and decoded to get the required pixels. If the EC scheme is 8x8 block-based and the compensated block is 4x4 block, we need to load and decode 256 pixels to derive 16 useful pixels.

The overhead in this case is 16. Because of the overhead problem, the relation between the compression ratio of EC and the gain in memory transfer is not direct.

There is a statistic material about the phenomenon of overhead provided by [15].

Fig. 12 shows the relation between overhead and encoding bit-rate simulating with

Stefan sequence. Three kinds of EC block-grid are presented. Since H.264 encoder allows macroblock (MB) partitioning and larger motion vectors at high rate (which also means the small quantization step and better quality) and favors the null vectors with 16x16 partition at low rate, the overhead increases while the bit rate increases.

Fig. 12 The correlation between bit-rate and overhead (Stefan sequence) simulated with 4x4, 8x8 and 16x16 block grid

Table 2 [15] is the summary of the statistical analysis simulated with six sequences. In this table, we can see that the relatively still sequences (News, Weather) generate smaller overhead since the motion vector is often equal to zero while the fast motion sequence such as Stefan generates more overhead. Finally, an important conclusion is that the smaller block-grid gets the better of larger block-grids and derives the smallest overhead.

Table 2 Overhead with EC block grid for each sequence

Sequence 4x4 block grid 8x8 block grid 16x16 block grid

Foreman 1.31 1.77 3.69

Flower 1.30 1.74 3.77

News 1.14 1.51 2.78

Silent 1.17 1.50 3.22

Stefan 1.51 2.44 6.95

Weather 1.17 1.49 3.18

All 1.27 1.73 3.93

3.2 Algorithm of Embedded Compressor

We adopt transform-based and 4x4 block-grid as our coding algorithm. First reason is the smallest overhead according to the statistical result that we presented in previous section. Actually it is a trade off between coding efficiency and overhead.

We know that as far as the transform algorithm is concerned, the bigger the block-grid, the better coding efficiency it can achieve. Since we want to increase the coding efficiency with less overhead, the 4x4 block-grid is our best choice.

The basic concept of proposed algorithm is the combination of DCT with bit-plane zonal coding. DCT is a well known technique so we just simply introduce it.

The two proposed bit-plane zonal coding are the main characters. Fine grain bit-plane zonal coding (FGBPZ) is quite efficiency and is suitable to be used in software application. Coarse grain bit-plane zonal coding is relatively simple and is suitable for hardware implementation. Fig. 13 is the coding flow of proposed DCT-FGBPZ/CGBPZ algorithm. This is a one way open-loop coding scheme and no iteration is needed. The discrete cosine transform (DCT) is divided into two one dimension DCT. The coefficients of DCT are packed by fine grain bit-plane zonal coding (FGBPZ) or coarse grain bit-plane zonal coding (CGBPZ) we proposed. The detail of each part will be introduced in the following sections.

Fig. 13 The flow chart of proposed DCT-FGBPZ/CGBPZ embedded compression

3.2.1 Discrete Cosine Transform

Discrete cosine transforms (DCT) is a powerful technique for converting a signal into elementary frequency components. It is widely used in image compression and JPEG is the well-known example.

For human visual system, human eyes are more sensitive on low frequency component of a picture and less sensitive on high frequency component. Therefore, the quality loss in high frequency component is relatively unnoticeable. The DCT can generate the relatively important low frequency component on up left corner, and the most high frequency in down right corner. Thus the DCT combines with bit-plane zonal coding with original point at up left corner can efficiently collect the information.

But the biggest disadvantage of DCT is its complexity on hardware design. Here we make our coding unit in 4x4 block grid, the complexity of 4 point DCT is minor and still can take the advantage of the transform. The complexity of different size of DCT can be evaluated in Table 3. Two designs are shown in Table 3. A design is reference from [16] and B design is reference from [17]. B design is focus on reducing multiplications by increasing additions. We can see that in both designs, the complexity of 4 points DCT is much simpler than 8 points and 16 points. 4 points DCT can be considered as most economical type of DCT. Notice that N = 2^m.

Table 3 The complexity of N-point DCT

Number of Multiplications Number of Additions

m N

A B A B

2 4 2 4 6 9

3 8 16 12 26 29

4 16 116 80 194 209

3.2.2 Proposed Fine Grain Bit Plane Zonal Coding (FGBPZ)

Base on the modified bit-plane zonal coding proposed in [20], the coding efficient is quite good. But we are not satisfied yet. To further improve the coding efficiency, we introduce a pre-determined variable length coding here with a small code book.

3.2.2.1 VLC Codebook

Before further change the MBPZ in [20], we make simulation here to evaluation the occurrences of each MBPZ types and Fig. 14 is the simulation result. The naming of each type A, B, C and D is referred from [20] (see Fig. 7). We can see that the appearance probability of type B and type C are relatively small although they have better coding efficiency. Type D is the dominate type but the bits recording header information are 5 bits including one bit for distinguishing between types and 4 bits for RMAX/CMAX. Therefore, we want to improve the efficiency by adding a small variable length code (VLC) codebook on type D.

Probability of each types in MBPZ Probability of each types in MBPZ Probability of each types in MBPZ Probability of each types in MBPZ

Type B 16%

Type C 11%

Type D 73%

Type B Type C Type D

Fig. 14 the occurrence probability of each types in MBPZ

According to the modified bit-plane zonal coding proposed in [20], the RMAX/CMAX of each bit-plane is accumulated bit-plane by bit-plane and is always large or equal to the RMAX/CMAX of previous plane. Recall that type D is applied when RMAX/CMAX is changed. Therefore, when type D is applied, the possible outcomes of the RMAX/CMAX in next bit-plane are limited: they must larger than the RMAX/CMAX of previous plane.

For example, if RMAX/CMAX of current plane is 2/2 and next plane is coded by type D, the possible outcomes of next plane RMAX/CMAX must be 3/2, 2/3 or 3/3.

Notice that 2/2 is also possible to be the RMAX/CMAX of next bit-plane, but type D only deals with the situation that RMAX/CMAX is different from previous bit-plane.

Those 3 possible outcomes can be fully presented by 1~2 bits instead of original 4 bits.

The description above explains the chance of reducing the codeword length in type D.

Fig. 15 shows the coding flow of FGBPZ with VLC codebook. This method can save up to four bits when type D is applied.

Fig. 15 Coding flow of FGBPZ with VLC codebook. Recall that type A, B, C and D is referred from [20].

We generate those codes by Huffman coding methods and the probabilities of next possible RMAX/CMAX (Pcurrent RMAX/CMAX [next RMAX/CMAX]) are derived from simulation on over 3000 frames. The code words in this codebook are fixed.

To cover all possible CMAX/R/MAX of next bit-plane according to current plane, the needed codebook entry and their related RMAX/CMAX is shown in Table 4. The number of possible outcomes of next RMAX/CMAX is shown in (1). For a 4x4 bit-plane, the row/column are mark as 0, 1, 2, 3. When type D is applied, at least one of row or column is changed. Therefore, this equation is to calculate the outcomes which are large than or equal to current RMAX/CMAX and then minus one outcome that RMAX and CMAX are both equal to current bit-plane.

Next possible outcomes =

1 ) _

4 ( ) _

( −Current RMAX × −Current CMAX −

(1)

Table 4 The needed codebook entries and their related RMAX/CMAX Current

RMAX/CMAX

The number of next Possible RMAX/CMAX outcomes two cases: case 1), current RMAX/CMAX is 2/3; next RMAX/CMAX is 3/4. Case 2), current RMAX/CMAX is 3/2; next RMAX/CMAX is 4/3. With the original codebook, the codebook index for case 1 is {(2, 3), (3, 4)} and case 2 is {(3, 2), (4, 3)}. Actually, the mainly different of case 1 and case 2 is the direction of row and column. Both cases are similar even on the probability distribution of each possible “next RMAX/CMAX”. If we switch the row to the column, we can find that those two cases are undergoing the same changes. According to this idea, we introduce our symmetric VLC codebook. By eliminating the bias of Row and Column in codebook, the symmetric cases can share the same codeword. We can reduce the 67 entries codebook into 40 entries by this idea.

Actually, this idea does not reduce the number of comparisons. But it reduces the

codebook size by 42%. The timing wasted on codebook searching is also reduced.

And then we will show how to use our symmetric VLC codebook. We represent current RMAX/CMAX as Cm_cur, Rm_cur, previous RMAX/CMAX as Cm_pre, Rm_pre. The action of table look up can be described as follow:

If (Cm_pre≥Rm_pre)

Codeword at index {(Cm_pre,Rm_pre) (Cm_cur,Rm_cur)} is applied.

Else

Codeword at index {(Rm_ pre,Cm_pre) (Rm_cur,Cm_cur)} is applied.

Therefore, 40 codeword is enough.

And then we want to explain the decoding procedure of symmetric VLC codebook. After start plane is decoded, the RMAX/CMAX of start plane is known and can be used as reference. Decoding procedure for the subsequent bit-planes can be illustrated in (2).

If (Cm_pre≥Rm_pre)

Codeword in block {(Cm_pre,Rm_pre)} is searched;

And the result is in {(Cm_cur,Rm_cur)} order.

Else

Codeword in block {(Rm_pre,Cm_pre)} is searched;

And the result is in {(Rm_cur,Cm_cur)} order.

(2)

These switch actions between RMAX and CMAX in encoding procedure need not to be recorded since they can be derived from the decoding procedure. The final VLC codebook is shown in Table 5 and is formed by eliminating the symmetric entry in Table 4. The coding example for FGBPZ is shown in Fig. 16. Table 6 is our detail codebook with code words.

Table 5 The final 40 entries VLC codebook Current

RMAX/CMAX

The number of next Possible RMAX/CMAX outcomes

Huffman Code length (bits)

( 1, 0 ) 11 ( 3*4-1 ) 3~4

( 1, 1 ) 8 ( 3*3-1 ) 2~4

( 2, 0 ) 7 ( 2*4-1 ) 2~4

( 2, 1 ) 5 ( 2*3-1 ) 2~3

( 2, 2 ) 3 ( 2*2-1 ) 1~2

( 3, 0 ) 3 ( 1*4-1 ) 1~2

( 3, 1 ) 2 ( 1*3-1 ) 1

( 3, 2 ) 1 0

Summary 40 0~4

Fig. 16 A coding example for FGBPZ

Table 6 The overall codeword in VLC codebook

(3, 3) 1 1

(3, 1) 0 1

(3, 2) 10 2

( 3, 0 )

(3, 3) 11 2

(3, 2) 0 1

( 3, 1 )

(3, 3) 1 1

3.2.2.2

Data Packing

Since our compression ratio is fixed at two, the budget of coded data is 64 bits.

After DCT and bit-plane zonal coding, we need to packet coded data into 64 bits segment before sending to external memory. First we reserve for the DC coefficient because of its importance in transform. Second, we use 4 bits to packet the start plane.

The rest of budget, that is to say, 52 bits, is used for storing AC coefficients. With the help of the fine grain bit-plane zonal coding, AC coefficient are divided into bit-planes and presented by the coding format in Fig. 15 Coding flow of FGBPZ with VLC codebook. Recall that type A, B, C and D is referred from [20].. The procedure is keep packing bit-plane by bit-plane until the end of bit-planes or running out of bit budget.

Newly significant bit

found?

Residual bit budget equal to 1 ?

Packing bit by bit

Yes

Packing

“0”

Yes

Packing this bit No

Next 4x4 coefficients

Fig. 17 Protecting mechanism for unknown sign bit

When running out of budget, unpacked information will be loss. Recall that the newly significant coefficient must be followed by its sign bit. If newly significant bit is packed while its sign bit is cut, this coefficient will be wrong after decoded. We make a mechanism to avoid this situation and show in Fig. 17. If next packing bit is newly significant bit and the rest of the budget is less than two bits, we will abort packing this newly significant bit.

The final encoding flow chart is shown in Fig. 18. Each bold line in Fig. 18 represents a check point checking whether if we run out of the budget.

Start

Fig. 18 Final encoding flow chart

3.3 Coarse Grain Bit-Plane Zonal Coding (CGBPZ)

FGBPZ introduced in section 3.2.2 is simple and efficiency. This algorithm encodes the coefficients on “bit” level. But its encoding procedure may cost more than 30 cycles and decoding procedure may cost more than 10 cycles under our estimation. So FGBPZ is more suitable embedded into software or hardware/software co-design system. To implement the algorithm as hardware accelerator, the algorithm must be further modified into simpler version.

By the discussion in chapter 6.1, we will see the critical problems of embedding a compressor into system. Taking all these problems into consideration, we propose coarse grain bit-plane zonal coding (CGBPZ). CGBPZ is a trade off between short cycles, ability of parallelism and quality. The details will be presented in this section.

Fig. 19 is the coding formats of CGBPZ. All magnitude bit-planes of AC coefficients are coded in uniform format. For each bit-plane, we record the RMAX/CMAX (4 bits), and then pack the bits which are enclosed by RMAX and CMAX. 4 bits are used to record RMAX/CMAX of each plane. The dependencies between bit-planes are not used in CGBPZ.

Fig. 19 CGBPZ coding format for the magnitude of AC coefficients

In CGBPZ, we introduced the concept of sign bit-plane. Sign bit-plane can be considered as union of sign bits for each coefficient. We only packet those used sign bits. Because we have only 64 bits budget for each 4x4 unit, the situation of unable to

pack all the information may be happened frequently. Since not every coefficient can be packed, packing whole sign bit-plane may become a waste. So we take the maximum value of RMAX and CMAX from packed bit-plane (from start plane to end plane) and packing sign bit-plane by those two boundaries. Under this method we will waste least bits to pack unused sign bits. The RMAX/CMAX of sign bit-plane needs not to be packed when encoding, because they can be derived from those coded bit-plane. Fig. 20 illustrates the idea of how we derive the RMAX/CMAX of sign bit-plane.

Fig. 20 The concept of how to derive the RMAX/CMAX of sign bit-plane from coded bit plane.

Finally, in CGBPZ, the end plane needs to be estimated and packed to fulfill the decoding procedure. Fig. 21 shows the simple concept of end plane decision. From MSB plane to LSB plane, the calculator estimates the total bits usage accumulated from most significant plane (MSP) to current plane. If total bits usage is more than 64 bits when accumulates to Nth bit-plane, (N+1)th bit-plane is the end plane.

Fig. 21 End plane decision

The overall encoding flow can be shown in Fig. 22. Finally, there is one small skill. According to the description above, the bits usage accumulated to end plane is less than bit budget. Therefore, there are few bits unused. To well use those bit budgets, we keep putting the information into those unused budgets within the RMAX/CMAX of sign bit-plane.

Fig. 22 Overall encoding flow of CGBPZ

3.4 Decoding Process and the Compensation

Roughly, the decoding process can be thinking as the inverse process of encoding. We take the coded data segments and divide them into DC coefficient and AC coefficients.

Since the algorithm we proposed is a lossy compression and the lower bit-planes of AC coefficients are often truncated due to limited budget, we apply a simple compensation here. The basic concept is shown in Fig. 23. The compensation is applied when the coefficient is nonzero and the end plane is larger than least bit-plane.

This compensation technique can be considered as adding a median number of lost bit-plane. It leads to a satisfied quality improvement. Notice that this compensation is slightly different with [20] and has better quality improvement.

Fig. 23 Proposed compensation technique

3.5 Embedded Result on Software Simulation

Before all the discussion, we want to define the formula of PSNR calculation first. All the PSNR values in this section are the PSNR between compressed sequences versus the original sequence. The reason why we choose original sequence

在文檔中應用於行動式視訊裝置之嵌入式壓縮器解壓縮器設計 (頁 30-0)