Scarce-State-Transition (SST) Algorithm - Some low-Power Schemes for Viterbi Decoder

Chapter 4 Some low-Power Schemes for Viterbi Decoder

4.1 Scarce-State-Transition (SST) Algorithm

The scarce-state-transition (SST) algorithm [7] [8] [9] was first proposed by Ishitani et al in 1987. It is a low-power technique for Viterbi decoder to reduce the state transition activity significantly under high SNR condition. The decoding process of SST algorithm will be introduced in this section.

Figure 4.1 shows the block diagram and data sequences of convolutional code. In this block diagram, denotes the information sequence,

is the codeword sequence deriving from the generator polynomial . From the ( )

u D C D( )=u D G D( )⋅ ( ) ( )

G D

received sequence , the Viterbi decoder estimates the decoded information

Figure 4.1 The block diagram of convolutional code

The SST Viterbi decoding model illustrated in Figure 4.2 includes two additional blocks: pre-decoder and re-encoder. The re-encoder is the same as the convolutional encoder of the transmitter and the pre-decoder provides the inverse function of the convolutional encoder. The SST decoding scheme consists of some steps. First, the hard decision of the received sequence is pre-decoded by the pre-decoder. Next, the output of the pre-decoder is re-encoded by the convolutional encoder. The modulo-2 addition of the received sequence and the re-encoded sequence is the new input of the Viterbi decoder. Finally, the decoded information sequence is the modulo-2 addition of the output of Viterbi decoder and the output of the pre-decoder.

Figure 4.2 The model of SST Viterbi decoding

The relationships between the data sequence in Figure 4.2 are described as follows. The received sequence can be expressed as

(4.1) ( ) ( ) ( ) ( ) ( ) ( )

r D =u D G D⋅ +e D =C D +e D

where is the error sequence from a noisy channel. Next, the pre-decoder directly decode the information sequence from by perform the inverse

( ) e D

( ) r D

function of the encoder. The output of pre-decoder is

The new input sequence of the Viterbi decoder equals to the modulo-2 addition of the received sequence and the re-encoded sequence, which is represented as

The Viterbi decoder then performs maximum likelihood decoding on . From equation (4.4), it is obvious that the switching activity of depends on the channel noise. In high SNR condition, the input of the Viterbi decoder is nearly zero as well as the output of the Viterbi decoder . The decoded information sequence equals to the modulo-2 addition of the output of Viterbi decoder and the output of the pre-decoder, which can be represented as

( )

Figure 4.3 illustrates the SST decoding process for the (2, 1, 2) convolutional code described in Section 2.1.1. Assume the encoder has reset to the 00 state, the encoded codeword symbols corresponding to information bit 0 and information bit 1 are (0,0) and (1,1) respectively. In BPSK modulation, the coded bit ‘0’ is mapped to

‘+1’ and ‘1’ is mapped to ‘-1’. As the codeword symbols pass through the noisy channel, the received symbols may not match the codeword symbols due to the errors denoted by e in Figure 4.3. In this example, 3-bit quantization is applied to represent the received symbol. Then, the hard decision of the received symbol is processed by the pre-decoder and re-encoder. As Figure 4.3 illustrated, the input of the Viterbi decoder becomes the error symbol e introduced by channel noise. The output of the Viterbi decoder is expected to be zero as the channel noise is not serious. Finally, the

decoded information bit which is the same as that in the transmitter is obtained.

Figure 4.3 The SST Viterbi decoding process

The SST algorithm has the following properties. As the channel noise is not serious, most of the decoded bits of the Viterbi decoder are zero. Therefore, the survivor path will pass through the zero state at most of the time and the zero state is most likely the best state with minimum path metric. Figure 4.4 shows the survivor paths of the conventional Viterbi algorithm and the SST algorithm over a noiseless channel. For conventional Viterbi algorithm, the maximum likelihood state is distributed across all the states. On the other hand, the zero state has a higher probability to be the maximum likelihood state than other states as SST algorithm is exploited. This property is useful for the modified memory management we proposed, which will be described later.

(a)Conventional Viterbi algorithm

(b)SST algorithm

Figure 4.4 The survivor paths over a noiseless channel

The SST algorithm performs a transformation process that converts the origin input sequence of the Viterbi decoder into an approximately zero sequence as the channel condition is good enough. As a result, the state transition activity is reduced and the decoded sequence passes through the zero state with high probability under high SNR environment. Therefore, the dynamic power is reduced as the channel becomes better.

4.2 Adaptive Viterbi Algorithm

Adaptive Viterbi algorithm [10] [11] combines the Viterbi algorithm with the principle of T-algorithm [12]. Unlike conventional Viterbi algorithm which retains all survivors of each state at each trellis stage, T-algorithm applies the path-pruning technique to reduce computation and storage requirements. Instead of computing and retaining all possible paths, only some paths which satisfy certain path-cost conditions are retained at each stage. The path retention is based on the following criteria

z A path is retained if its path metric is less than d_m + , where T is a threshold T value determined by the designer and is the best path metric among all survivor paths at the previous trellis stage.

z The total number of survivor paths per trellis stage is limited to a fixed number , which is also determined by the designer and less than the state number.

Nmax

The first criterion allows high-cost paths to be eliminated in the decoding process.

In the case of many paths with similar cost, the second criterion restricts the number of paths to . At each stage, the minimum path metric , threshold T, and maximum survivor number are used to prune the number of survivor paths.

Nmax d_m

Nmax

For adaptive Viterbi algorithm, it is important to select T and carefully. If threshold T is set to a small value, the average number of paths retained at each trellis stage will be reduced. However, the bit error rate may increase since the most likely path has to be taken from a reduced number of possible paths. Alternatively, if a large value of T is selected, the average number of retained paths increases and results in a reduced bit error rate. But the computation and the path-storage requirements also increase. The maximum number of survivor paths per stage , has a similar effect on bit error rate as T. Therefore, an optimal value for T and should be chosen

Nmax

so that bit error rate is within allowable limits, while matching the resource of the hardware. Figure 4.5 shows the ACS unit of adaptive Viterbi decoder.

Adder d

<d

+T

Figure 4.5 The ACS unit of adaptive Viterbi decoder

4.3 Variable Truncation Length

As Section 3.4 mentioned, there are two well-know survivor memory management approaches: the register-exchange (RE) approach and the trace-back (TB) approach. Register-exchange approach is conceptually the simplest used technique and eliminates the need to traceback since the registers have contained the decoded information. Compared with trace-back approach, register-exchange approach has the advantage of short critical path, short latency, and simple structure. However, register-exchange method is not power efficient due to the need to copy the contents of all registers in a stage to the next stage. In this Section, we will propose a modified memory management based on path merging property of Viterbi algorithm. This scheme provides variable truncation length for register-exchange approach to access the survivor memory efficiently. Figure 4.6 illustrates a 64-state Viterbi decoder with radix-2x2 ACS and RE-based survivor memory.

Routing Routing Routing

Figure 4.6 A 64-state, radix-2x2 Viterbi decoder with RE-based memory

In Section 2.23, we introduce an important characteristic of Viterbi algorithm, namely path merging property. As path merging property mentioned, all survivor paths will merge with high probability if the truncation length L is long enough. By selecting proper truncation length, the decoded data can be determined with L-stage information only. Moreover, it is unnecessary to search for the best state. In fact, the decoded data from any state is the same if all survivor paths have merged. Based on this characteristic, fixed state approach is a proper choice for a register-exchange based survivor memory when the state number is large.

As all survivor paths merge, it is more efficient to store the merged path rather than all paths. Based on this principle, we propose a low-power scheme called variable truncation length for Viterbi decoder. Figure 4.7 illustrates a 64-state, radix-2x2, RE-based survivor memory with variable truncation length. D0 to D63 are the decisions provided by the ACS units for selecting survivor paths. In the decoding process, the contents of registers corresponding to 64 states tend to be equivalent from the left stages to the right stages. The registers of each stage are connected to the

path merging detection unit. The path merging detection unit will find the merged stage in the memory and generates clock gating signals of each stage to eliminate unnecessary data movement.

D₁

D₆₂

D63

OUT

D₂

CLK0 CLK1 CLK2

2'b00

Clock Gating

CLK3

Path Merging Detection Unit

Figure 4.7 A RE-based survivor memory with variable truncation length

Figure 4.8 illustrates the survivor memory by trellis diagram. In this example, the fixed state approach is applied and the decoded data is obtained from state 0. After detecting the merged point, we apply clock gating to the registers in the shadow region and directly shift out the value of state 0. The path corresponding to state 0 is considered as the correct one, and the others are dropped. Based on the scheme, we can adjust truncation length dynamically, depending on the channel. In high SNR environments, a shorter truncation length is required and the clock gating can be

applied to more registers, resulting in a power efficient survivor memory.

D0~3 OUT

D_4~7 D8~11

D12~15

D52~55

D56~59

D60~63

Merge Point

Routing

Figure 4.8 Trellis diagram representation of variable truncation length

Chapter 5 The Proposed Low-power Viterbi Decoder

In this chapter, we will propose a low-power Viterbi decoder combining scarce-state-transition (SST) algorithm and variable truncation length. The ACS computation and the survivor memory are most power critical, consuming about 90%

power in the Viterbi decoder. Therefore, most low power designs focus on these two blocks. The SST algorithm reduces the switching activity of the input sequence to lower down the dynamic power. In addition to apply SST, we propose a modified register-exchange approach that adjusts the truncation length dynamically. With variable truncation length, the access of the survivor memory will become more efficient.

The proposed Viterbi decoder targets for Multi-band OFDM UWB [13] system.

This system exploits a 64-state convolutional code and has a high throughput requirement up to 480Mbps. Figure 5.1 shows the block diagram of MB-OFDM UWB system.

At the beginning of this chapter, we will present the architecture of the proposed Viterbi decoder. Next, we will show the simulation and implementation results.

Finally, the comparison between some different designs will be discussed.

Convolutional

Figure 5.1 The block diagram of MB-OFDM UWB system

5.1 The Design of Proposed Viterbi Decoder

Figure 5.2 shows the block diagram of proposed low-power Viterbi decoder combining SST algorithm and variable truncation length. In this section, we will present the implementation of these low-power schemes. The architecture of SST unit will be described first. Next, we will introduce the radix-2x2 ACS structure applied in the proposed design. Finally, we will show how the modified memory management adjusts the truncation length dynamically.

Radix-2x2

Figure 5.2 The block diagram of proposed Viterbi decoder

5.1.1 Implementation of SST

To apply SST algorithm in the Viterbi decoder, it is necessary to implement the pre-decoder and re-encoder. Figure 5.3 shows the convolutional encoder of the MB-OFDM UWB system. The corresponding generator polynomial is

The re-encoder structure is just the same as the convolutional encoder.

Figure 5.3 The convolutional encoder of MB-OFDM UWB system

The pre-decoder provides the inverse function of re-encoder. We involved three sequences: , , and . The function of pre-decoder can be

5 (5.4)

With the three sequences in equation (5.4), one can implement the pre-decoder as shown in Figure 5.4. The pre-decoder and the re-encoder both are composed of some shifter registers and modulo-2 adders only. Therefore, the hardware overhead of these two additional blocks for SST algorithm is small.

D D D D

Figure 5.4 The pre-decoder for the convolutional encoder in Figure 5.3

5.1.2 Radix-2x2 ACS Structure

The throughput requirement of MB-OFDM UWB system is up to 480Mbps. As mentioned in Section 3.2.2, ACS unit is the speed bottleneck of Viterbi decoder due to a data-dependent feedback loop. For high speed applications, one often applies high-radix or multi-dimension ACS to improve the throughput. Radix-4 ACS and radix-2x2 ACS both completes the operations of two trellis stages in one clock cycle.

In 0.13μm CMOS technology, the radix-4 and radix-2x2 ACS structure can achieve the throughput requirement. Figure 5.5 shows a 4-state radix-4 trellis and a 4-state radix-2x2 trellis. The structures of radix-4 and radix-2x2 ACS unit for state S0 is shown in Figure 5.6.

(a) 4-state radix-4 trellis diagram (b) 4-state radix-2x2 trellis diagram Figure 5.5 The 4-state radix-4 and radix-2x2 trellis diagrams

Figure 5.6 The radix-4 and radix-2x2 ACS units

The complexity analysis of radix-4 and radix-2×2 ACS units for a 64-state Viterbi decoder is summarized in Table 5.1. The main differences of these two ACS structures are the comparator and multiplexer. Table 5.2 lists their gate counts to show the hardware costs. Although the critical path is longer, radix-2x2 ACS can achieve the throughput requirement with lower complexity. To design a low-power Viterbi decoder, we exploit radix-2x2 ACS structure in the proposed design.

Table 5.1 Comparison of complexity between radix-4 and radix-2×2 ACS units

registers adders 2-way comparator

Table 5.2 The gate counts of different comparators and multiplexers 2-way

1 Apply UMC 0.13μm technology

2 The length of all input and output data are 9-bit

5.1.3 Implementation of Variable Truncation Length

In Section 4.3, we propose variable truncation length scheme based on path merging property of Viterbi algorithm. As all survivor paths merge, the survivor memory stores the merged path rather than all paths to eliminate unnecessary data movement. To implement variable truncation length, it is necessary to find the merged stage of the survivor memory. After detecting the merged point, we can shift out the data on merged path directly and apply clock gating to the registers corresponding to other paths.

Obviously, all survivor paths merge as the contents of 64 states are equivalent at the same stage. However, it is too complex to check the equality of all 64 states concurrently. To reduce the hardware complexity, our proposal detects path merging by dividing 64 states into several groups that are verified separately. For radix-2x2 trellis, there are four source states corresponding to each state. Therefore, we divide 64 states into 16 groups and each group contains 4 states. Figure 5.7 illustrates the implementation of variable truncation length. Because we exploit SST algorithm in the proposed Viterbi decoder, the decoded data is obtained from state 0, which is most likely the best state. As the Figure shown, the equality of each group is checked separately. The verified results of each stage are connected to the path merging detection unit. The signals Gi and Si generated by the path merging detection unit mean the clock gating control of each stage and the selection signal of the state 0 respectively. With the clock gating control signal Gi, the register clocks in the shadow region of Figure 5.7 are gated to reduce the power consumption. The selection signal Si controls the content of state 0 to be updated by directly shift or register exchange.

Clock Gating G2

1'b0

S0 S1 S2 S3

D0~3 OUT

D4~7

D8~11

D60~63

Merge Stage

Path Merging Detection Unit

Figure 5.7 The implementation of variable truncation length

Simulation results show that checking each group separately not only reduces the hardware complexity but also preserves the error performance. Some simulation results are shown in the following section.

5.2 Simulation and Implementation Results

This section will show some simulation and implementation results. The performance simulations are performed in AWGN channel and BPSK modulation. We adopt the (3, 1, 6) convolutional code for MB-OFDM UWB system with 3-bit soft-decision and 1/3 code rate. As the variable truncation length scheme is based on the path merging property, it is necessary to choose a proper truncation length to ensure all survivor path will merge with high probability. Figure 5.8 shows the performance curves under different truncation length. The right upper corner of Figure 5.8 highlights the curves in low SNR condition. As these curves shown, the performance improvement will reach a limit even the truncation length increases continuously. We select 64 as the maximum truncation length in the proposed design.

Figure 5.8 The performance curves under different truncation lengths

As described in Section 5.1.3, our proposal detects path merging by dividing 64 states into several groups that are verified separately. In addition, we analyze the performance by verifying only parts of the 64 states. Figure 5.9 shows the performance curves as we check the equality of the first 16 states (4 groups), the first 32 states (8 groups), the first 48 states (12 groups), and all 64 states (16 groups). The simulation result shows checking the first 48 states only can achieve the same performance as checking all 64 states. Therefore, we verify the first 48 states only to reduce the hardware complexity but still preserve the error performance.

Figure 5.9 The performance curves under different verification conditions

Table 5.3 lists the design parameters of the proposed Viterbi decoder. In order to demonstrate the proposed schemes reduce the power consumption, we implement three versions of Viterbi decoder including conventional register-exchange structure, SST scheme only, and the proposed structure. Table 5.4 lists the gate counts of these

three implementations.

Table 5.3 Design parameters of the proposed Viterbi decoder

Technology UMC 0.13-μm process

State number 64

Code rate 1/3

Soft-decision 8-levels

BM width 6 bits

PM width 9 bits

Truncation length 64 (max)

ACS structure radix-2x2

Table 5.4 The gate counts of different implementations

Implementation Gate count

Conventional RE 57.8k

SST 58.2k

Proposed 65.1k

Figure 5.10 shows the power simulation results in different channel conditions.

The operation frequency is 250MHz and the corresponding data rate is 500Mbps. For the conventional structure, the channel conditions are ineffective in the power dissipation. In the SST only implementation, the decoder power dissipation is reduced in high SNR environments. In the proposed design combining the SST and the variable truncation length, the decoder power has a obvious reduction as shown in Figure 5.10(a). Figure 5.10(b) shows the survivor memory power only to highlight the effect of the dynamic truncation length.

0

(a)The power consumption of whole Viterbi decoder

0

(b)The power consumption of the survivor memory

Figure 5.10 The power simulation results in different channel conditions

Figure 5.11 shows the gate count distribution of the conventional and the proposed designs. For the conventional structure, the ratio of ACS and survivor memory is more than 90%. In the proposed design, the ratio of the additional circuits for implementing low-power schemes is about 9%.

PM Unit

Figure 5.11 The gate count distribution of conventional and proposed designs

Figure 5.12 shows the power profiling of the conventional and the proposed designs as Eb/N0 is 4.0 dB. The corresponding bit error rate in this channel condition is 1.41e-5. In conventional decoder design, the survivor memory is a power intensive

在文檔中以低狀態機率切換與可調變擷取長度為基礎之維特比解碼器 (頁 48-0)