• 沒有找到結果。

S IMULATION R ESULT AND C OMPARISON

CHAPTER 3 PROPOSED INTRA PREDICTION ENGINE WITH DATA REUSE IN

3.5 S IMULATION R ESULT AND C OMPARISON

To enhance system performance, our proposal is designed to optimize area, buffer size, and latency. We use two 0.64kb Line SRAMs which are connected to a 32-bits system bus to make decoding pipeline simple, and 0.688kb SRAM to store reused neighboring pixels.

Table 5: Average cycles needed for decoding a MB in different video sequences.

Test Video Sequence

Intra Prediction

@ BL [17]

Proposed Intra Prediction @ HP

Cycle Overhead

Foreman 342.68 355.81 3.8%

Grandma 275.63 285.28 3.5%

Suzie 294.90 307.28 4.2%

Table 5 shows the average cycles for decoding an I-MB in different video sequences of our proposed design for 30fps HD1080 video format at working frequency of 100MHz with

MBAFF and Luma intra_8x8. The overhead of latency is less than 5% compared to preliminary architecture [17]. The overall area and buffer memory size for supporting H.264 BP/MP/HP are 14063 gates in UMC 0.18um technology and 688 bits, as shown in Table 6.

The overheads for supporting Luma intra_8x8 are 10% and 7.5% compared to [16].

Table 6: Comparison results.

Chen et al. [16] Proposal Overhead

Profile MP HP

Process 0.18um 0.18um

Working Frequency 87M 100M

Gate Count 12785 14063 10%

Memory (bit) 640 688 7.5%

Chapter 4

Proposed Power Efficient Inter-Layer Intra Prediction Engine in SVC

Scalable Video Coding (SVC) extension of the H.264/AVC is the latest standard both in video coding. It has achieved significant improvements in coding efficiency with an increased degree of supported scalability relative to the scalable profiles of prior video coding standards.

In order to improve coding efficiency in comparison to simulcasting different spatial resolutions, additional so-called inter-layer prediction mechanisms are incorporated in SVC.

For the intra frame decoding, inter-layer intra prediction will be considered in SVC.

When the prediction signal of the enhancement layer macroblock is obtained by inter-layer intra-prediction, the new prediction type called Intra_BL will be used (for which the corresponding reconstructed intra-signal of the co-located 8x8 submacroblock in reference layer is upsampled). In other words, the intra prediction types which are intra_4x4, intra_8x8, intra_16x16 and intra_BL in enhancement layer are more than traditional H.264 intra prediction. However, the decoding process of Intra_BL is more complex than the other intra prediction modes. It is more like motion compensation, which needs interpolation process to generate the predicted pixels. The difference between them is that the interpolation coefficients in intra interpolation are not fixed. Therefore, complexity, processing time, and power consumption will become the major problems to decoder.

In this chapter, we propose an architecture design of high profile inter-layer intra prediction engine in SVC. It supports the new decoding type called Intra_BL and 7 possible picture type combinations between spatial layers in intra prediction. Besides, we also propose the area efficient interpolators design which can save 26% hardware area compared to direct

implement. Moreover, we further propose a power efficient design, including memory hierarchy improvement and computational complexity reduction, which can save 46.43% total power consumption compared to our preliminary design.

4.1 Overview

In Chapter 3, the proposed SVC intra prediction engine is illustrated in Figure 18, and the basic intra prediction which is used to decode the traditional H.264 intra block for high profile video stream is also described. Another prediction module, Intra_BL prediction, will be described in this chapter. Besides, the power efficient proposal based on our original Intra_BL prediction will also be further described in Section 4.5.

Figure 30: Proposed Intra_BL prediction decoding block diagram.

Figure 30 shows our proposed Intra_BL prediction decoding block diagram. We use a banked SRAM in our Intra_BL module to reduce the external SDRAM access times, and enhance the reuse of prediction source pixels. As mentioned in Section 2.2.3.1, there are total of 7 picture type prediction combinations between spatial layers. Hence the multiplexer and memory controller are applied to select the corresponding decoding flow and locations of

prediction source pixels in the internal banked SRAM, respectively. The main interpolation architecture designs of basic horizontal interpolation (Basic H_inter.), basic vertical interpolation (Basic V_inter.), and extended vertical interpolation (Extended V_inter.) will be illustrated in Section 4.3.

4.2 System Power Modeling

Figure 31: SVC decoder system block diagram.

The power consumption of a video coder system can be considered as a summation of on-chip and off-chip memory power. For the off-chip power consumption, during the video bitstream decoding, there are lots of data read/write from/to external SDRAM through a

system bus, such as entropy decoder, motion compensation, intra prediction, and deblocking filter in H.264. However, when the new prediction type Intra_BL is taken into account in SVC, the intra prediction becomes one of the most major power consumption parts in the system due to the large data fetching from external SDRAM, as shown in Figure 31. Therefore, that will generate much more off-chip power consumption in SVC.

Consequently, we try to model the power consumption between external SDRAM and internal SRAM, choosing a most suitable way to our overall system. As for off-chip memory, the power modeling becomes more complicated. Not only data access but also I/O and background power should be concerned (see Eq. (2)).

( )

total on chip off chip on chip access IO BG

PP P P PPP (2)

Figure 32: Analysis of power dissipation on external SDRAM and internal SRAM.

Specifically, we adopt Micron’s system-power calculator [20] to model SDRAM power.

To estimate the power consumption, we exploit memory size and power calculator as the

SRAM and SDRAM power indexes. An observation is that the curve between internal SRAM and external DRAM power consumption is shown in Figure 32. The Micron’s SDRAM model is chosen with supply voltage = 3.3V, tCK = 7ns, DQ = 32, and CK Frequency = 145 MHz. As a result, a better compromise can be selected by the marked region in Figure 32, since it achieves smaller SRAM size as well as SDRAM power dissipation. Therefore, in our proposed Intra_BL prediction module (see Figure 30), the size of banked SRAM is chosen according to the analysis of SDRAM and SRAM before.

Figure 33: Banked SRAM and required region for different picture type predictions between spatial layers.

Figure 33 shows our internal banked SRAM used in Intra_BL prediction module. The total size of this banked SRAM is 3072 bits. We separate this SRAM to four banks, and each

bank has 24 entries and 16 bits word length (i.e. 2 pixels). In this way, when the required reference pixels are stored into banked SRAM, fours pixels can be fetched in one cycle and interpolated, as shown in the right side of Figure 33. It should be noticed that we also mark the required pixel region in reference layer for different prediction types between spatial layers. However, due to the position of stored reference pixels in the external SDRAM, the first bank and last bank in Figure 33 are used to eliminate the redundancy and waste during the data reading from SDRAM. Moreover, this kind of partition is also convenient to update the pixel source inside the SRAM. When the current MB (or MB pair) is decoded, we update the first half part in each bank, and reuse another half part to be the new reference prediction source for the next MB (or MB pair). The corresponding update process is illustrated in Figure 34. By applying this banked SRAM, it is not only reduce the system power consumption but also efficient to data fetching and update process.

Figure 34: Updating process of banked SRAM.

4.3 Area Efficient Interpolator Design

The main interpolators in Intra_BL decoding process are basic interpolator and extended vertical interpolator. The decoding process of Intra_BL is illustrated in 2.2.4.1 previously. For upsampling the luma component, one-dimensional 4-tap FIR filters are applied horizontally and vertically. The chroma components are upsampled by using a simple bilinear filter. First, one of the coefficient sets in Table 3 will be chosen according to the phase_idx signal (i.e. the interpolation coefficients are not fixed) and then multiplying these coefficients by 4 (or 2, according to luma or chroma) reference pixels. Finally, add and shift these scaled pixels to generate one prediction out, as described in Eq. (3) and Eq. (4).

(Luma)

L_Pred_out1 = L_coef1* L_ref_pixle1 + L_coef2* L_ref_pixle2

+ L_coef3* L_ref_pixle3 + L_coef4* L_ref_pixle4 (3)

(Chroma)

C_Pred_out1 = C_coef1* C_ref_pixle1 + C_coef2* C_ref_pixle2 (4)

The interpolation process mentioned above is referred to basic interpolation. Another interpolation process is called extended vertical interpolation process, which uses the outputs of basic interpolation to further generate the vertical correlative pixels in some cases of prediction types between spatial layers, as described in Eq. (5) and Eq. (6).

(Luma) ( )

( )

L_Pred_out1 = - 3 * L_B_int_out1 + 19* L_B_int_out2

+19* L_B_int_out3 + - 3 * L_B_int_out4+16 (5)

(Chroma)

C_Pred_out1 = C_B_int_out1 + C_B_int_out2+1 (6)

4.3.1 Hybrid Basic Interpolator Design

Our proposed architecture, as shown in Figure 35, is divided into three parts, which are coef_generator, pixel_shifter, and scaling_engine. To be noticed, the bi-linear filter which is used for chroma sample is embedded in our architecture to reduce the area cost.

Figure 35: Proposed architecture of basic interpolator.

4.3.1.1 Coef_generator

For the chroma interpolation coefficient sets, if the coefficients are translated into binary form in Table 7, we can notice that both of the coef1 and coef2 have highly relative to the phase_idx. The first coefficient coef1 is the left shift of phase_idx, and the second coefficient coef2 is the 2’s compliment and shift of phase_idx. Hence, we can easily use these relationships to construct our hardware architecture, as shown in Figure 36. On the other hand, the luma table has less relationship than chroma. However, we can still use some skills to reduce the area. In Table 8, if we separate the table from the ninth coefficient set, each set has the same coefficients to the symmetrical one except the coef2 and coef3 should be exchanged, so do coef1 and coef4. Therefore, taking the 2’s complement of the phase_idx to be a control

signal, we can easily merge the last 7 coefficient sets to the first 7 coefficient sets.

Furthermore, the positive coefficients are used in the coef1 and coef4 instead of original negative representations. Then we separate the positive values (left part in Figure 35) and negative values (right part in Figure 35) to the two sides, and let the subtraction to the end of interpolation calculation (i.e. the last adder in Figure 35). So that before the last adder, all the wires can uses less bits due to further reduce the area cost. The luma_coef generator architecture is illustrated in Figure 37. The luma_coef set2~4 are just like set1, only the coefficients are different.

Table 7: Chroma Table in Binary Form.

Phase idx Chr Coef1 Chr Coef2

0000 000000 100000

0001 000010 011110

0010 000100 011100

0011 000110 011010

0100 001000 011000

0101 001010 010110

0110 001100 010100

0111 001110 010010

1000 010000 010000

1001 010010 001110

1010 010100 001100

1011 010110 001010

1100 011000 001000

1101 011010 000110

1110 011100 000100

1111 011110 000010

Figure 36: The architecture of chroma coef_generator.

Table 8: Luma Table in Binary Form.

Phase_idx

Lu_Coef1

(use positive)

Lu_Coef2 Lu_Coef3

Lu_Coef4

(use positive)

0000 000000 100000 000000 000000 0001 000001 100000 000010 000001 0010 000010 011111 000100 000001 0011 000011 011110 000110 000001 0100 000011 011100 001000 000001 0101 000100 011010 001011 000001 0110 000100 011000 001110 000010 0111 000011 010110 010000 000011 1000 000011 010011 010011 000011 1001 000011 010000 010110 000011 1010 000010 001110 011000 000100 1011 000001 001011 011010 000100 1100 000001 001000 011100 000011 1101 000001 000110 011110 000011 1110 000001 000100 011111 000010 1111 000001 000010 100000 000001

Figure 37: The architecture of luma coef_generator.

4.3.1.2 Pixel_shifter

In order to replace the multipliers, pixel_shifter1 and pixel_shifter2 are used to generate the scaled units of reference pixels. As shown in Figure 38 (a) and Figure 38 (b), pixel_shifter1 and pixel_shifter2 generate six scaled units for chr_coef1, chr_coef2, lu_coef2 and lu_coef3, and three scaled units for lu_coef1 and lu_coef4, respectively. In particular, lu_coef1 and lu_coef4 can use only three scaled units due to the positive representations.

Further, the multiplexer is needed for changing the source pixels to match the right coefficients when the phase_idx is selected to the last half coefficient sets, which are merged to the upper coefficient sets in luma table.

(a) (b)

Figure 38: (a) Pixel_shifter1 generates six scaled sets, and (b) pixel_shifter2 generates three scaled sets.

4.3.1.3 Scaling_engine

The last parts are scaling_engine1 and scaling_engine2. They are composed of four adders and one adder, respectively, as shown in Figure 39 (a) and Figure 39 (b). In particular, every coefficient in the luma or chroma table can be combined with these scaled units which are after the pixel_shifters. All it has to do is to select the right combination of the scaled

values in scaling_engines. However, each bit of the coefficient in binary form is just the control signal for the multiplexer in scaling_engine. For example, if the sixth coefficient of lu_coef3 is chosen, that means an eleven times of reference pixel is needed. First, we change the coefficient 11 to binary form 001011. Then, the first bit 1, second bit, and fourth bit will select the non-scaled, 2 times, and 8 times reference pixels, respectively. The other three zero bits will select the zero values. Finally, adding these selected scaled units (1+2+0+8+0+0=11) to get the final scaled result.

(a) (b)

Figure 39: The architecture of (a) scaling_engine1 and (b) scaling_engine2.

4.3.2 Share-Based Extended Vertical Interpolator Design

Another interpolator, extended vertical interpolator, is proposed in Figure 40. We first extend and rewrite Eq. (5) to Eq. (7), and then recombine the Eq. (7) to a new form in Eq. (8).

In Eq. (8), the common term (A+B+1) is merged together, and this common term also represent the equation for chroma extended vertical interpolation in Eq. (6). Therefore, due to the common term and two filters are merged together, the area cost can be more efficient.

L_Pred_out1 = (-3)* L_B_int_out1 + (16* L_B_int_out2+ 3* L_B_int_out2)

+(16* L_B_int_out3 +3* L_B_int_out3)+ (-3)* L_B_int_out4+(16 - 3+3) = (16* L_B_int_out2+16* L_B_int_out3+16)

- 3* L_B_int_out1- 3* L_B_int_out4 - 3 + 3* L_B_int_out2+3* L_B_int_out3+3

(7)

 

 

 

L_Pred_out1 = 16* L_B_int_out2+ L_B_int_out3+1 L_B_int_out1+ L_B_int_out4+1 3*

- L_B_int_out2+ L_B_int_out3+1

 

  

 

 

(8)

Figure 40: Proposed architecture of extended vertical interpolator.

4.4 Short Summary

In a short summary, we exploit an Intra_BL prediction module which contains banked SRAM, basic interpolator, and extended vertical interpolator. Our proposed basic interpolator design has better area efficiency compared to direct implementation, as shown in Figure 41.

Table 9 describes the area comparisons between direct implementation and proposed basic interpolator design. Simulation result shows that 26% of total area cost can be saved in our

proposed basic interpolator. The area of proposed share-based extended vertical interpolator is

Figure 41: Direct implementation of basic interpolator.

Table 9: Simulation results of proposed basic interpolator and extended vertical interpolator.

In our SVC system, our target is to support HD720 (base layer) – HD1080 (enhancement layer) two spatial layers. However, we can also support more spatial layers such as qcif (base

layer) – cif (enhancement layer1) – 4cif (enhancement layer2) – HD1080 (enhancement layer3) as long as the total MBs per second is less than 352800MBs/second ({[(1920x1088) + (1280x720)] / (16x16)} x 30 frames/s).

Therefore, for our Intra_BL prediction module, total of two basic horizontal interpolators, two basic vertical interpolators, and four extended vertical interpolators are used. The execution times for decoding a MB in each prediction type between spatial layers are listed in Table 10. It should be noticed that we calculate the processing cycle time in worst case. Hence, for the frame-MBAFF and MBAFF-MBAFF prediction types, the worst cases for decoding the enhancement blocks are bottom field MB and frame MB, respectively.

Table 10: Interpolation execution time for a MB in worst case.

192

The detail about proposed Intra_BL prediction module is illustrated in Table 11. We use umc90 technology to simulate our work. The total gate count is 18673 under working frequency of 145 MHz, and the size of internal SRAM used is 5376 bits (luma and chroma).

Among these prediction types between spatial layers, the most critical case is based on the bottom field block in enhancement layer of frame-MBAFF prediction. The result shows that

the worst case of interpolation execution time in this prediction type is 312 cycles/MB.

4.5 Memory Hierarchy Improvement and Computational Complexity Reduction

Based on our preliminary basic intra prediction and Intra_BL prediction designs in Chapter 3 and previous sections in Chapter 4, Figure 42 shows the internal core power consumptions in intra prediction at working frequency of 145 MHz. We can see that the power consumption becomes higher when the Intra_BL prediction is supported in SVC. This is because the memory fetching and computational complexity are both increased.

The power consumption of SVC intra prediction is consisted of internal SRAM power, dynamic power, and leakage power, as shown in Figure 43. Among these three parts, dynamic and SRAM power almost dominate the total power consumption. Therefore, in the following sections, we will try to reduce these two major power consumptions and further propose a

power efficient Intra_BL prediction engine.

Figure 42: Intra power consumption in H.264/AVC and SVC.

Figure 43: Power organization of Intra_BL prediction engine.

4.5.1 Internal SRAM Access Improvement

In Figure 43, internal SRAM occupies about half of total power consumption. In order to reduce this significant power, we improve the memory hierarchy in internal memory access.

Four register sets are added to be the second stage of memory hierarchy, as described in Figure 44. These register sets are exploited to transform the large power consumption in

SRAM into register sets. To be noticed, in order to minimize the power consumption overhead in these register sets, gated clocks are applied to each set of register.

Figure 44: Memory hierarchy improvement of Intra_BL prediction engine.

These four register sets are V_BHI, V_BI, H_BI, and H_ST. The definitions of them are illustrated below.

 V_BHI: Store vertical direction values that passed through basic horizontal interpolation.

 V_BI: Store vertical direction values that passed through basic horizontal and vertical interpolations.

 H_BI: Store horizontal direction values that passed through basic horizontal and vertical interpolations.

 H_ST: Store horizontal direction values that come from internal banked SRAM.

V_BHI and V_BI are used to store and reuse the vertical related values as well as H_BI and H_ST are used to store and reuse the horizontal related values. Most of the data fetching will be transferred to register sets instead of directly reading from SRAM. For instance, we take the most critical case which uses frame picture type and MBAFF picture type for base

bottom field MB. Some definitions of representation in the following figures this example are listed in Figure 45.

Figure 45: Some definitions of representations.

Figure 46, Figure 47, Figure 48, and Figure 49 illustrate the decoding process of this example. We take two 4x4 blocks in enhancement layer for a decoding unit. When the last three rows of co-located region in reference layer are decoded, the outputs of basic horizontal interpolation and basic vertical interpolation will be store to V_BHI set and V_BI set, respectively, and then for the next unit which is below to current unit to use. For the H_ST set, the data will be store to the H_ST set and read from H_ST set except the first basic horizontal interpolation. In particular, when the 4x4 blocks 2 and 3 are decoding in the enhancement layer, the basic horizontal interpolation does not have to start from the first row of the co-located region due to the required values have been already stored to the V_BHI and V_BI sets, as shown in Figure 47. Therefore, the required values can be directly gotten from the V_BHI and V_BI sets during cycle 31 to 42 in Figure 47. Hence, some calculations can be

removed. Furthermore, the prediction outputs will also be store to H_BI at the end of the decoding unit. These stored prediction outputs can be reuse in the next decoding unit that right next to the current unit, as described in cycle 43 to 48 and cycle 63 to 66 in Figure 48 and Figure 49, respectively.

Figure 46: An example of using these register sets.

Figure 47: An example of using these register sets.

Figure 48: An example of using these register sets.

Figure 49: An example of using these register sets.

Table 12 shows the power reduction in internal SRAM when the second stage four register sets are applied to memory hierarchy. Simulation result shows that 62.3% of total

Table 12 shows the power reduction in internal SRAM when the second stage four register sets are applied to memory hierarchy. Simulation result shows that 62.3% of total

相關文件