• 沒有找到結果。

Chapter 1. Introduction

1.5. Organization of This Thesis

1.5.

1.5.

1.5. Organization of This Thesis

In this thesis, we propose a new low power, table partition VLD for dual standards, a new group-based, high throughput VLC codec system with full programmability for dual standards, and a new soft VLD to handle the error resilient problem. The organization of this thesis is as follows. The overview of CAVLC and the new low power, table partition VLD for dual standards is presented in Chapter 2. The algorithm and architectures of the proposed group-based, high throughput VLC codec system with full programmability for dual standards are described in Chapter 3. The proposed error resilient CAVLD is introduced in Chapter 4. Finally, conclusions and future works are made in Chapter 5.

Chapter 2.

A Low Power VLC decoder design

2.1. 2.1.

2.1. 2.1. Overview of CAVLC Encoder and Decoder

2.1.1.

2.1.1.

2.1.1.

2.1.1. Encoding Process Flow

Figure 2-1 : The encoding process flow of CAVLC

Figure 2-8 shows the encoding process flow and the detailed steps are as follows.

 When receiving a 2x2 or 4x4 block, the procedure of scanning coefficients will record the symbols to be encoded. There are six symbols which are TotalCoeff, TrailingOnes, trailing_ones_sign_flag, level, total_zeros, and run_before. TotalCoeff is the total number of non-zero coefficients; TrailingOnes is the number of trailing +/- 1 and its value should be smaller than four, level is the value of non-zero coefficient;

total_zeros is the number of all zeros before the last non-zero coefficient in zigzag-scan order; run_before is the number of zeros before last one non-zero coefficient in zigzag-scan order. Figure 2-2 shows the results derived in coefficients-scanning procedure.

Figure 2-2 : An example of CAVLC coefficients scanning

 Encode TotalCoeff and TrailingOnes (coeff_token). There are 5 choice of look-up table to use for encoding coeff_token. The choice of table depends on a variable named nC and Figure 2-3 shows how to calculate the value of nC.

Figure 2-3 : How to calculate the value of nC

 Encode the sign of each trailing one in reverse order.

 Encode level in reverse order and there are 7 VLC tables to choose from, Level_VLC0 to Level_VLC6.

 Encode total_zeros.

 Encode run_before.

Table 2-1 lists the result of encoding the example in Figure 2-2 and the transmitted bitstream for this block is 000010001110010111101101.

Element Value Code coeff_token TotalCoeff = 5, TrailingOnes = 3 0000100

T1 sign (4) + 0

T1 sign (3) - 1

T1 sign (2) - 1

Level (1) +1 1

Level (0) +3 0010

total_zeros 3 111

run_before (4) zerosLeft = 3; run_before = 1 10 run_before (3) zerosLeft = 2; run_before = 0 1 run_before (2) zerosLeft = 2; run_before = 0 1 run_before (1) zerosLeft = 2; run_before = 1 01

run_before (0) zerosLeft = 1; run_before = 1 No code required; last coefficient Table 2-1 : The result of encoding the example in Figure 2-2

2.1.2.

2.1.2.

2.1.2.

2.1.2. Decoding Process Flow

Figure 2-4 : The decoding process of CAVLC

Figure 2-4 shows the decoding process flow of CAVLC and we can see that the decoding procedures are similar to the encoding steps. The only difference is decoding process does not do coefficients scanning and the other steps do decoding bitstream instead of encoding symbols. Table 2-2 shows an example of CAVLC decoding and the final output array is 0, 3, 0, 1, -1, -1, 0, 1.

Code Element Value Output array 0000100 coeff_token TotalCoeff = 5, TrailingOnes = 3 Empty

0 T1 sign + 1

1 T1 sign - -1,1

1 T1 sign - -1,-1,1

1 level +1 1,-1,-1,1

0010 level +3 3,1,-1,-1,1

111 total_zeros 3 3,1,-1,-1,1

10 run_before 1 3,1,-1,-1,0,1

1 run_before 0 3,1,-1,-1,0,1

1 run_before 0 3,1,-1,-1,0,1

01 run_before 1 3,0,1,-1,-1,0,1

Table 2-2 : An example of CAVLC decoding from the result of Table 2-1

2.2.

2.2.

2.2.

2.2. Overview of the Proposed Architecture

Figure 2-5 shows the functional diagram of the proposed architecture of the CAVLC decoder. As introduced in section 2.1.2, there are five major parts to decode the symbols. In order to support MPEG-2 VLC decoding, we construct the MPEG-2 VLC tables in coeff_token part, because the two decoding procedures have similar decoding manner. This part will be described in later section. The prefix-zero buffer and the bitstream buffer are used for the table partition and table realization with arithmetic method. The coeffNum is to calculate the right position in the coefficient buffer of the present level in level buffer. For power reduction issue, all function units are controlled by enable signals, because they must not work at the same time. There is also a hold signal for prefix-zero buffer to avoid counting the zeros not belong to

prefix zeros. If there is no enable signal or hold signal to control the function unit, it should result the power dissipation.

bitstream buffer [11:0]

coeff_token

&

MPEG-2 VLC tabls

trailing_one s_sign_flag

level

total_zeros

run_before prefix-zero buffer

+

bit’

bit

controller

is_cavlc MPEG-2 is_b14 is_b15 maxNumCoeff

hold_prefix0

level buffers run buffers

coefficients buffers

+

coeffNum

1

Enable

coefficients

Figure 2-5 : Overview of the proposed low power architecture

2.3.

2.3.

2.3.

2.3. Table Partition

In VLSI design, the efficient method to reduce dynamic power consumption is to decrease the data switching. However, most designs of the CAVLC decoder use FSM to look up the VLC tables. As long as the input bitstream to access the look-up table changes frequently, that must cause much power dissipation. Besides, the alteration in large look-up table must dissipate more power than the same one in small look-up table. Therefore, good table partition will reduce the size of look-up table and the data switching to decrease power consumption.

Figure 2-6 : An example of proposed table partition

Figure 2-6 shows an example of the proposed table partition. Although the original codeword table has only 10 entries, the longest length of the codeword is 6, so we have to build a look-up table with 32 entries for this codeword table by FSM method. That is, the longest length of the codeword dominates the entries of the codeword table not the real entries. However, if we adopt the proposed table partition method to build the look-up table, the entries of the first time to access the table are 4, and other entries are equal to the relative suffix entries. Because this approach divide the tables according to the prefix zeros, we call it prefix-zero table partition (PZTP).

When we access the look-up table with PZTP every cycle, the searching entries are much smaller than the original entries. If the longest length of the codeword is larger, the difference between the searching entries with PZTP and the original entries is greater.

The way to build the look-up table is as follows:

 According to the leading zeros we call prefix, build the first layer of look-up table like prefix item in Figure 2-6.

 Build the second layer of look-up table by suffix which is the codeword except the leading zeros and the first 1.

The steps to look up the VLC tables are as follows:

 We count the leading zeros until the first 1 appears, and choose the relative suffix table by prefix.

 We look up the suffix table by the input bitstream, and find symbols needed.

TotalCoeff Decoder&Table B14 or B15 Selectoris_cavlc MPEG-2 Prefix zeros decoder

Figure 2-7 : The PZTP VLC decoder architecture of coeff_token

Figure 2-7 shows the PZTP VLC decoder (VLD) architecture of coeff_token.

There are five tables of CAVLD, NUM_VLC0, NUM_VLC1, NUM_VLC2, NUM_VLC3, and NUM_FLC and the other two tables, Table B14 and Table B15, belong to MPEG-2 VLD. The implementation of NUM_FLC will be introduced in the next section. First, if both the two enable signals, is_cavlc and MPEG-2, are not active, the entire PZTP VLD will be shut down to avoid the dynamic power dissipation due to the data switching. If either of them is active, the controller (TotalCoeff Decoder &

Table B14 or B15) will open only one of those tables for power issue. Of course, the two signals should not be active at the same time.

Assume that we are executing H.264/AVC decoding. Even if the present decoding procedure is coeff_token, the enable signal, is_cavlc, will not be active at the beginning. To avoid unnecessary power consumption, we set the enable signal to be active, only when we receive the first one of the codeword or the boundary of prefix.

Therefore, when receiving prefix, only accumulator consumes power. When executing MPEG-2 VLD, we do the same thing.

From Figure 2-5, we put the value of prefix in prefix-zero buffer. When we begin receiving suffix of codeword and looking up the suffix table, the value of prefix is fixed. Therefore, we can consider the output of prefix zeros decoder in Figure 2-7 as an enable signal of the relative suffix table in the process of looking up the suffix table. At this time, the searching entries of the entire codeword table are equal to the entries of the suffix table. The most entries of coeff_token are 8 and those of MPEG-2 VLD are 16.

PZTP takes advantage of the feature of Huffman coding to decrease the data switching when accessing the look-up table, and the hardware cost of the VLC tables.

Besides, another advantage is easy to implement, so total_zeros and run_before also adopt this method to implement in the proposed CAVLD.

2.4.

2.4.

2.4.

2.4. Table Realization with Arithmetic Method

2.4.1.

2.4.1.

2.4.1.

2.4.1. NUM_FLC of coeff_token

The length of all the codeword in this look-up table is 6, and the total entries of this table are 62. If we build the table by FSM method, this idea seems good. However, if we analyze the relationship between the codeword and the symbols, we will find some arithmetic rules.

Figure 2-8 : An example of NUM_FLC

Figure 2-8 shows an example of NUM_FLC. The left table is the original table of NUM_FLC and we can derive the right table after we separate the codeword. We can find the following arithmetic relationship except the first row, and this formula exists in NUM_FLC distinctly. Although the first row of NUM_FLC doesn’t fit this rule, only prefix of the codeword map to the symbols is 4.

5:2 1 consideration, we only access this part when we receive the sixth bit of the codeword.

Based on this method, we can easily change the look-up table into and reduce much hardware cost and power consumption.

MUX

Figure 2-9 : The architecture of proposed NUM_FLC

2.4.2.

2.4.2.

2.4.2.

2.4.2. Level Decoding

Basically, level coding is constructed by seven VLC tables which are VLC0 to VLC6. However, if we implement the level decoder with VLC tables, it costs much hardware and power. The reason is the longest length of codeword is 28, prefix is 16 and suffix is 12. Even if we use PZTP to construction the VLC tables of level decoder, they are still huge VLC tables. For the low power demand, we have to use another method to realize the level decoder, and here we implement it by arithmetic approach which algorithm is specified in [10].

Figure 2-10 shows the algorithm of level decoding. In fact, suffixLength is to decide the VLC tables to choose from. According to this algorithm, if we pipeline the level decoding and suffixLength well, we can use the minimum number of function units to decode level. However, we can get good performance about the power and hardware cost.

level_prefix

levelCode = (level_prefix << suffixLength) if (suffixLength > 0 || level_prefix >= 14) {

level_suffix

levelCode += level_suffix }

if (level_prefix == 15 && suffixLength == 0) levelCode += 15

if (first_level && TrailingOnes < 3) levelCode += 2 if (levelCode % 2 == 0)

level = (levelCode + 2) >> 1 else

level = (-levelCode - 1) >> 1 level decoding

if (TotalCoeff > 10 && TrailingOnes < 3) suffixLength = 1

if (|level| > (3 << (suffixLength - 1)) && suffixLength < 6) suffixLength++

suffixLength

level_prefix = leading 0s

level_suffix = bitstream [levelSuffixSize-1 : 0]

if (level_prefix == 15) levelSuffixSize = 12

else if (level_prefix == 14 && suffixLength == 0) levelSuffixSize = 4

else

levelSuffixSize = suffixLength

Figure 2-10 : Algorithm of level decoding

MUX

MUX MUX

Figure 2-11 : The proposed architecture of level decoding

Figure 2-11 shows the proposed architecture of level decoding. There are two major parts, the left part is to calculate the suffixLength and the right part is to decode the codeword of level. The gray rectangles represent the registers. The size of level_prefix buffer is 10 bits, bitstream buffer uses 12 bits which is shared by all modules, and suffixLength needs 3 bits to save the value. The level_prefix is the number of leading zeros derived by the leading zeros counter shown in Figure 2-5 which is shared by four decoding modules, coeff_token, level, total_zeros, and run_before. The barrel shifter to rearrange the level_prefix works, only when we receive the first one of the codeword of level. Besides, it also handles the special case when level_prefix is 15 and suffixLength is 0. That helps us not to add additional 15 to levelCode, so it shortens the critical path of level decoding and reduces the hardware cost. The whole architecture of level decoding is also controlled by an enable signal which turns off level decoding when we execute another procedure.

That inverter is to do the step, (-levelCode - 1), and according to 2’s complement -levelCode is equal to (~ levelCode + 1), so the formula, -levelCode – 1, is equal to (~

levelCode + 1 - 1), that is ~levelCode.

The part to calculate suffixLength is also needed even if we implement level decoding with look-up table. As we mentioned above, the method of table searching depends on suffixLength to choose the correct VLC table, so this part is not omissible in any approach of level decoding. Therefore, our contribution is to simplify the VLC tables with arithmetic method, and the effect is pretty good.

2.5.

2.5.

2.5.

2.5. Summary

Figure 2-12 : The throughput of foreman.yuv with the proposed VLD

Figure 2-13 : The throughput of mobile.yuv with the proposed VLD

Figure 2-12 and Figure 2-13 show the throughput of two pictures with the proposed VLD. The simulation environment is JM 9.2 which C code of H.264/AVC system. We set nine different values of QP to get the simulation results. In the two figures, the blue line is the throughput requirement of baseline@3.1 specified in H.264/AVC standard when the clock frequency is 100MHz and the black one is for baseline@3.2. In Figure 2-12, the throughput of foreman meets the requirement of baseline@3.2 when QP is 20 and that of I-frame in the same picture also meets that standard when QP is 28. In Figure 2-13, the throughput of mobile meets the demand when QP is 28. Therefore, the proposed design can support H.264/AVC baseline.

[3] [4] Proposed Design

Tech. 0.25 um 0.18 um 0.18 um

Gate-count 6100 4720 CAVLC : 3267

MPEG2 : 945 Target Spec. Baseline Profile Main Profile @4.1

Main Profile @4.2

&

MPEG-2

Buffer N.A. 696 bits RAM 3471 gate-count

Clock Constraint 125 MHz 125 MHz 125 MHz

Table 2-3 : Hardware cost evaluation of proposed low power design

Table 2-3 shows the comparison of the hardware cost. Although we show the throughput of two pictures in Figure 2-12 and Figure 2-13 when the clock frequency is 100MHz, the maximum speed of the proposed design is 180MHz under a 0.18um CMOS technology. The performance is fast enough for meeting the real-time processing requirement of CAVLC decoding on main profile @4.2. Compared to the design proposed by [4], The CAVLC part of the proposed design reduce 30%

hardware cost, and the total design still has less hardware cost. The proposed design doesn’t use RAM as storage due to the power saving.

Spec. MPEG-2 I-frame H.264 I-frame H.264 P-frame

power (mW) 1.719 1.302 1.376

Table 2-4 : The post layout power consumption under 0.18um CMOS Tech.

Table 2-4 shows the post layout power consumption under 0.18um CMOS technology. The proposed design can provide extremely low power, and it is used in our dual-standard system [8], [9].

Chapter 3.

A VLC Codec System for dual standards

<Function>

Figure 3-1 : The architecture of our proposed system

Figure 3-1 shows the architecture of our proposed system for H.264/AVC main profile. The entropy decoder contains CABAD, UVLD, and CAVLD. UVLD and CAVLD are the same choice for entropy decoder, and UVLD is used to decode the syntax parser, and CAVLD is for residual data. Therefore, the output of UVLD is to control the decoding mode of H.264/AVC decoder, and the results of CAVLD are the DCT coefficients of residual data. After IDCT, the data will be added with the predicted data to complete a unit block.

In Figure 3-1, CABAD has to use slice memory to store the context model and row-storage. Figure 3-2 shows the usage of memory of CABAD in our proposed H.264/AVC decoder system. The context model of CABAD uses 349.1 bytes memory of the slice memory.

Figure 3-2 : The usage of memory of CABAD in our proposed H.264/AVC decoder

The context model of CABAD uses much memory, so that is an idea to integrate CABAD and CAVLD. The used memory can provide a space to store the VLC tables of CAVLD, and our proposed H.264/AVC decoder receive parallel input of bitstream, so we have to try another approach to implement CAVLD. Besides, as mentioned in my motivation, if we add the CAVLC encoder into the entropy decoder, that can be integrated with H.264/AVC encoder to a H.264/AVC codec system. Therefore, we try to find a method to implement a VLC codec system based on memory. and finally we proposed a new group-based VLC codec system reference to [6] and [7].

3.1. 3.1.

3.1. 3.1. The Architecture of the Proposed VLC Codec System

Here, we will describe the architecture of the proposed VLC codec system. We will focus on the design of CAVLC encoder/decoder, and not to express the MPEG-2 VLC codec in detail. That is because the major difference of the proposed MPEG-2

VLC codec is the group-based algorithm and hardware implementation, and other parts basically are similar to the conventional VLC codec design. Therefore, about the MPEG-2 VLC codec system, we only discuss the proposed group-based alteration, and we will pay attention to the CALVC encoder/decoder.

codeword boundary detector coefficients scanner

Figure 3-3 : Block diagram of the proposed VLC codec design

The block diagram of the proposed VLC codec design is shown in Figure 3-3. To fit specification of our proposed H.264/AVC decoder system, the input bitstream is parallel input and its length is 8 bits. The decoder is controlled by the enable signal, is_decoding, so is the encoder. The maxNum is to decide the block type which is being decoded or encoded, and nC is introduced in 2.1 to choose the correct VLC table for coeff_token. The serial input data, coefficients, is the DCT coefficient for the encoder in reverse order. The codeword boundary detector has a FIFO to store the input bitstream, and the output signal, FIFO_full, represents whether the bitstream FIFO is full or not. The symbols constructor will send out the results of DCT coefficients arranged and the bitstream concatenater handles the link of the encoded

codeword. The illumination of the components is as follows.

 The major functions of the codeword boundary detector are counting the leading ones and zeros, and fetching the demanded suffix for the each decoding function unit by the recorded bitstream boundary. Besides, it is also a controller to decide the activity of each decoding component, and it has to calculate the number of skipped run_before and then send the information to symbols constructor. For MPEG-2 VLC, it has to detect the special case such as escape mode and end of block.

 After coefficients scanner receive the serial input data, DCT coefficients, it calculates and sends the necessary data for each encoding component. When doing MPEG-2 VLC encoding, it only counts the levels and runs. After sending the MPEG-2 level and run, it can receive the following coefficients. The more

 After coefficients scanner receive the serial input data, DCT coefficients, it calculates and sends the necessary data for each encoding component. When doing MPEG-2 VLC encoding, it only counts the levels and runs. After sending the MPEG-2 level and run, it can receive the following coefficients. The more

相關文件