Parallel Chien Search Architecture - Design of Chase-Type Soft Decoders

Chase Type Soft Decoders

4.3 Design of Chase-Type Soft Decoders

4.3.6 Parallel Chien Search Architecture

In an (N, K; t) BCH/RS decoder, once the error location polynomial Λ(x) is obtained in the decoding process, a Chien search block shown in Fig. 4.15 can be used to exhaustively examine whether Λ(αⁱ) = 0 for i = 0 ∼ N − 1, where

Λ(αⁱ) = Xt

j=0

Λjα⁽ⁱ⁾^j = Xt

j=1

Λjα^ij + 1. (4.14)

Figure 4.15: Conventional Chien Search Architecture.

Notice that an arbitrary element over GF (2^m) is presented as ^m−1P

Notice that the binary element α^k_l stands for the l-th coordinate of α^k. We define ρij as the density of the matrix constructed with coordinates of α^ij ∼ α^ij+m−1 as shown in (4.16) (i.e., the ratio between the number of 1’s and the number of all entries) [68]. Then the complexity of an α^ij-CFFM, a constant finite field multiplier with α^ij as the multiplicator, is around m × (m − 1) × ρ_ij XOR gates according to (4.16).

To improve the decoding efficiency for the long BCH/RS codes, multiple successive lo-cations can be examined with parallel Chien search architectures. Fig. 4.16 depicts two conventional p-parallel Chien search architectures to shorten the operating cycles from N to lN

, where p is the parallel factor. Fig. 4.16(a) is the straight-forward version from Fig. 4.15 while Fig. 4.16(b) is the direct-unfolded version with the unfolded factor p [69, 70]. Both designs have p × t CFFMs and p (t + 1)-input m-bit finite field adders (FFAs), resulting in a linear dependence of p for the hardware complexity. The directed-unfolded architecture, which utilizes αⁱ-CFFM to replace α^ij-CFFM for j = 2 ∼ p, provides lower hardware com-plexity because the density ρi is much smaller than ρij for i < m. Nevertheless, the critical

p 2p

(p+pτ)

(p-1) 2(p-1) t(p-1)

(p-1+pτ)

2 t

(1+pτ)

(a)

t t

(1+pτ)

2 t

(2+pτ)

2 t

(p+pτ)

(b)

Figure 4.16: Conventional p-Parallel Chien Search Architectures. (a) Straight-Forward. (b) Direct-Unfolded.

path in Fig. 4.16(b) is (Tmux+p×Tm+Ta) while that in Fig. 4.16(a) is only (Tmux+Tm+Ta), where Tmux, Tm, and Ta represent the critical path of multiplexer, CFFM, and FFA respec-tively. The direct-unfolded architecture will lead to p times longer critical path if CFFM dominates the delay path.

The hardware complexity of the high parallel Chien search architecture is dominated by numerous CFFMs. This section will reformulate the Chien search equation with minimal polynomials and utilize minimal polynomial combinational networks (MPCNs) for replacing the CFFMs in Chien search architecture. In addition, the proposed MPCN-based Chien

search architecture can merge the syndrome calculator with small overhead, leading to sig-nificant hardware complexity reduction.

To calculate Λjα^ij with minimal polynomials, the proposed new Chien search scheme defines a m − 1 degree polynomial T_j(x) = t_j,0+ t_j,1x¹+ · · · + t_j,m−1x^m−1, and the relation

After the binary matrix operation in (4.18), Λj is represented with the basis from {α⁰, α^j, · · · , α^(m−1)j} to {α⁰, α¹, · · · , α^m−1}; therefore, this operation is called as j-th ba-sis transformer (BTj). As a result, the coefficients of Tj(x) can be determined as



where the operation of the inverse matrix in (4.19) is called as j-th inverse basis transformer (IBT_j).

Based on our definitions, Λj can be represented in terms of Tj(x) and (4.15) becomes

Pij = Λjα^ij

= Tj(x)

_x=αj× (α^j)ⁱ

= xⁱTj(x) |_x=α^j

= M_j(x) × W_j(x) + D_j(x) |_x=α^j , (4.20)

where Mj(x) is the minimal polynomial of α^j and Dj(x) is the remainder polynomial resulting from dividing xⁱTj(x) by Mj(x). Since α^j is a root of Mj(x), Dj(α^j) is the only non-zero term in (4.20). Then the Chien search equation can be reformulated as

Λ(αⁱ) = Xt

j=1

Pij + 1 = Xt

j=1

Dj(α^j) + 1. (4.21)

As shown in (4.21), the Chien search can be simply realized by summing up all the evaluation results of 1st ∼ t-th BTs. Instead of executing summation after the basis trans-formations, the addition operation can be moved before the transformation, leading to fewer transformation operations.

Hence, (4.22) can be reformulated with group basis transformer (GBT) as

Λ(αⁱ) = Xt

j=1 m−1X

k=0

d_j,kα^jk+ 1

= Xmt

v=0

∀jk=v

dj,k

α^v+ 1, (4.22)

where dj,k is the k-th coefficient of Dj(x).

Fig. 4.17 shows the architectures of three basis components, including j-th MPCN, j-th BT and GBT. The j-th MPCN (M P CN_j) shown in Fig. 4.17(a) executes modulo operation with the divisor Mj(x). It is constructed by the combinational circuit of the linear feedback shift register with the connection polynomial Mj(x). Each binary element m^k_j in Fig. 4.17(a) is the k-th coefficient of Mj(x), indicating the wire connection. In the j-th BT shown in

IN_MPCN[0]

Figure 4.17: Basic components in proposed Chien search architecture. (a) j-th minimal polynomial combinational network (M P CNj). (b) j-th basis transformer (BTj). (c) Group basis transformer (GBT).

Fig. 4.17(b), each α_l^kis a binary element as in (4.18) and can be represented whether the wire is connected or not. Fig. 4.17(c) illustrates the block diagram of the GBT. The additions are executed firstly with all the coefficients of Dj(x) for j = 1 ∼ t (total mt bits), and the similar operations as a BT are applied with the basis α⁰ ∼ α^mt.

In the proposed MPCN-based parallel-p Chien search architecture shown in Fig. 4.18, the coefficients of Λ(x) are applied to the IBTs for transforming the operating basis. According to

(1+pτ) (2+pτ) (p+pτ)

1 2 t

Figure 4.18: MPCN-Based parallel-p Chien search architecture.

(4.20) ∼ (4.22), the transformed values are evaluated with minimal polynomials for obtaining the Chien search results. All the multiplexers select the outputs of IBTs in the first cycle, and then select the register data afterward. Searching from (N − 1)-th to 0-th location, the proposed design checks p locations at each cycle. In each row, mt bits data are fed into a GBT to examine the error locations. An error is found at (N + r − p(τ + 1) − 1)-th location if the output of the r-th row GBT equals zero at τ -th cycle. In contrast to Fig. 4.16, our proposed Chien search architecture utilizes p × t MPCNs to replace p × t CFFMs. Notice that the XOR gate count requirement of one MPCN is at most m − 1, which is much smaller than that of one CFFM. Therefore, it is area-efficient to apply the MPCNs, especially in the large parallelism conditions.

The proposed MPCN-based architecture can merge the syndrome calculator and Chien search in the same hardware with small overhead. In the BCH/RS decoding process, the re-ceived polynomial R(x) is fed into the syndrome calculator to generate syndrome polynomial

1 2 t

1+pτ) (2+pτ)

1 2 t

(p+pτ

N-p-pτ

N-1-pτ N-2-pτ

Figure 4.19: Parallel-p joint syndrome calculator and Chien search with MPCN-based archi-tecture.

S(x) = S1+ S2x¹+ · · · + S2tx^2t−1, which is expressed as [3]

Sj = R(x) |_x=α^j

= M_j(x) × Q_j(x) + B_j(x) |_x=α^j

= Bj(α^j), (4.23)

where Bj(x) is the remainder polynomial resulting from dividing R(x) by Mj(x). Conse-quently, the j-th syndrome value can be calculated with Mj(x).

Fig. 4.19 illustrates our parallel-p joint syndrome calculator and Chien search with MPCN-based architecture. The syndrome calculator phase and Chien search phase are determined by the SEL signal. When the SEL signal is high, the j-th syndrome value is

formulated as

Sj = (((RN −1x^p−1+· · ·+RN −p−1) mod Mj(x))x^p + (RN −p−2x^p−1+· · ·+RN −2p−1)) mod Mj(x))x^p

+ · · · )x^p+Rp−1x^p−1+· · ·+R0) mod Mj(x) |_x=α^j (4.24)

The partial remainder stored in the register is multiplied by x^p and accumulated with the received symbols. After all the received symbols are processed, the BT_j transforms the accumulated result to j-th syndrome value. In contrast to Fig. 4.18, t BTs are applied instead of one GBT in the first row to evaluate individual syndrome value. Note that the FFA in Fig. 4.19 is only a 1-bit operation because each coefficient of R(x) is binary value.

Therefore, except for the difference between the BT and GBT, the overhead of supporting syndrome calculation is only p NAND and p × t XOR gates.

4.3.7 Architecture Comparison

For RS (224, 216; 4) code, there was no soft RS (224, 216; 4) decoder has been published according the best knowledge of the author. Thus, a LCC soft RS (255, 239; 8) decoder [34]

is shortened to RS (224, 216; 4) decoder for comparison. The proposed decision-eased soft RS decoder with 2-stage pipeline architecture is compared with the LCC soft RS decoder with 4-stage pipeline architecture as shown in TABLE 4.3. Both soft decoders evaluates 3 LRPs for generating candidate sequences.

The LCC decoder has 4-stage pipeline architecture and has to storage every candidate codeword, resulting in large amount of storage elements. Notice that, the complexity ratio over GF (2⁸) among XOR, CFFM, FFM, FFA, MUX, Register, ROM (byte) and RAM (byte) is 1 : 20 : 100 : 8 : 3: 1 : 8 : 8. While the complexity of these designs is normalized to XOR gate, the proposed decision-eased soft RS decoder is around 13,248 XOR gates and the LCC soft RS decoder is about 32,991 XOR gates. Due to fewer number of FFMs and storage elements, our proposed decoder can save around 59.8% complexity as compared to

LCC decoder, even though the LCC decoder excludes the decision making unit.

Table 4.3: Comparison Table for Soft RS (224, 216; 4) Decoder

Architecture GF (2⁸) GF (2⁸) GF (2⁸) 2-to1 MUX Register ROM RAM Latency

CFFM FFM FFA (Bit) (Bit) (Byte) (Byte) (Cycle)

Decision-Eased with η = 3

Syndrome Calculator 8 0 8 0 64 0 0 224

Reliability Evaluator 0 0 0 192 0 0 0 224

Syndrome Updater 0 1 1 128 64 232 0 16

Key Equation Solver 0 9 8 72 136 0 0 16

Parallel-16

64 0 64 128 224 0 0 16

Chien Search BP-Based

0 2 2 192 64 224 0 16

Error Value Evaluator Simplified

0 0 3 80 80 0 0 16

Decision Making Unit

Total 72 12 86 792 632 456 0+224×2 224

LCC with η = 3 [34]

Re-encoder 0 13 23 392 344 448 0 464

Interpolation 0 14 12 87 166 0 68 461

Polynomial Select 0 8 8 139 264 0 0 23

Chien Search 4 0 4 0 64 0 0 216

Forney’s Algorithm 0 2 2 136 24 224 0 76

Erasure Decoder 0 13 23 243 168 224 0 464

Total 4 50 72 997 1430 896 68+224 × 8 464

*The complexity ratio over GF (2⁸) among XOR, CFFM, FFM, FFA, MUX, Register, ROM (byte) and RAM (byte) is 1 : 20 : 100 : 8 : 3: 1 : 8 : 8.

For RS (255, 239; 8) code, the proposed decision-confined soft RS decoder with 3-stage pipeline architecture is compared with the LCC soft RS decoder with 4-stage pipeline ar-chitecture as shown in TABLE 4.4. The proposed design evaluates 5 LRPs for generating candidate sequences while the LCC decoder evaluates 3 LRPs.

In LCC decoder, each candidate codeword to be stored; therefore, the LCC decoder with 4-stage pipeline architecture requires large number of storage elements. While the complexity of these designs is normalized to XOR gate, the proposed decision-eased soft RS decoder is around 22,534 XOR gates and the LCC soft RS decoder is about 38,671 XOR gates. Even though the complexity of LCC decoder excludes the complexity of the decision making unit, our proposed decoder can save around 42% complexity as compared to LCC decoder because of fewer number of storage elements.

Table 4.4: Comparison Table for Soft RS (255, 239; 8) Decoder

Architecture GF (2⁸) GF (2⁸) GF (2⁸) 2-to1 MUX Register ROM RAM Latency

CFFM FFM FFA (Bit) (Bit) (Byte) (Byte) (Cycle)

Decision-Confined with η = 5

Syndrome Calculator 16 0 16 0 128 0 0 256

Reliability Evaluator 0 0 0 40 90 0 0 259

Syndrome Updater 0 4 16 288 128 264 0 8

Half Iteration

0 62 37 464 296 0 0 8

Key Equation Solver Parallel-2

16 0 16 64 64 0 0 128

Chien Search BP-Based

0 2 2 448 128 256 0 92

Error Value Evaluator

Total 32 68 87 1592 834 520 0+256×3 259

LCCwith η = 3 [34]

Re-encoder 0 21 39 448 600 512 0 528

Interpolation 0 14 12 87 166 0 68 525

Polynomial Select 0 8 8 139 264 0 0 23

Chien Search 8 0 8 0 128 0 0 239

Forney’s Algorithm 0 2 2 136 24 256 0 152

Erasure Decoder 0 21 39 299 424 256 0 528

Total 8 66 108 1109 1606 1024 68+256 × 8 528

*The complexity ratio over GF (2⁸) among XOR, CFFM, FFM, FFA, MUX, Register, ROM (byte) and RAM (byte) is 1 : 20 : 100 : 8 : 1 : 3 : 8 : 8.

Chapter 5

在文檔中高面積效益軟性BCH及RS解碼器 (頁 112-123)