里德所羅門解碼器之通用型架構設計

全文

(1)國立交通大學電子工程系碩士論文. 里德所羅門解碼器之通用型架構設計 Universal Architectures for Reed-Solomon Error-and-Erasure Decoder. 研究生:張富科指導教授:張錫嘉. 中華民國九十四年八月.

(2) 里德所羅門解碼器之通用型架構設計 Universal Architectures for Reed-Solomon Error-and-Erasure Decoder 學生：張富科. Student : Fuke Chang. 指導教授：張錫嘉. Advisor : Hsie-Chia Chang. 國立交通大學電子工程系碩士論文. A Thesis Submitted to Department of Electronics College of Electrical Engineering and Computer Science National Chiao Tung University In Partial Fulfillment of the Requirements For the Degree of Master In Electronics Engineering. August 2005 Hsinchu, Taiwan, R.O.C..

(3) 里德所羅門解碼器之通用型架構設計學生 : 張富科指導教授 : 張錫嘉國立交通大學電機工程學系﹙研究所﹚碩士班. 摘. 要. 里德所羅門碼主要用來保護資料來避免在傳輸中可能發生的錯誤，它的數學演算主要是根據有限場(finite field)的運算。里德所羅門碼在許多應用上都有例子，譬如 CD, DVD 光碟機，cable modern 以及 DVB-T 的系統。然而在各種應用裡，因應不同的規格要求，每種里德所羅門馬有著完全不同的參數以及不同的有限場的定義和 p(x)。而以往的多模式設計，總需要花上許多的硬體和週期來處理不同有限場定義的問題。. 因此本論文提出一. 個完全多模式的里德所羅門碼解碼器，它可以同時處理不同的參數包含可更正的錯誤和有現場的定義。. 我們總共提出兩種的架構，第一種架構主. 要支援最高有限場次方到 10，第二種架構有限場次方到 8。除此之外，我們還應用一些小面積的設計考量於次方為八的架構，使得能夠達到小面積的設計。這兩種架構都以 0.13 1P8M 的製程來實現，分別需要 110K 和 53K 個邏輯閘。根據模擬的結果，最快可以達到 220MHz 以及 250MHz 的工作頻率。 ii.

(4) Universal Architectures for Reed Solomon Error-and-Erasure Decoder Student: Fuke Chang Advisor: Hsie-Chia Chang Institute of Electronics National Chiao Tung University. ABSTRACT Due to protecting the data form random error and burst error during transmission, Reed Solomon (RS) code has been widely accepted as the forward error correction scheme, such as xDSL, cable modem, and DVB-T. Because of the different RS specific parameters, a cost efficient RS decoder that can support various applications has practical importance to reduce the time-to-market and design costs. This thesis presents two universal architectures for Reed Solomon (RS) error-and-erasure decoder that can accommodate any codeword with different code parameters and finite field definitions. The first architecture can support the maximum degree to 10, and the second architecture can support to 8. The area efficient design approach is also considered in second architecture. Implemented with 1.2V 0.13µm 1P8M technology, the two decoders can operate at 220 MHZ and 300MHz and reach 2.2Gb/s and 2.4Gb/s data rate, respectively. The total gate counts of two decoders are 110K with core size 0.78mm2 and 54K with the core size 0.36mm2.. iii.

(5) 誌. 謝. 二年的研究所生活很快就過去了，在這兩年中學到許多作學問的方法以及一些處世的道理。當然要感謝的人非常多，首先最要感謝的當然是我的指導教授張錫嘉博士，這兩年的指導不但能夠讓我有一些研究成果，在遇到一些停頓時都能夠指導一個正確的研究方向，當然老師的好相處最讓我印像深刻。再來要感謝的就是幫助我最多的林建青學長，他讓我在研究上能夠有一個很可靠的支柱，包括研究的方向以及 IC 設計上所碰到的種種問題，可以讓我很順利的的完成各項進度。當然也感謝每個 FEC group 成員，以及最棒的 oasis 實驗室同學和學弟，這兩年我過的真的很快樂也很充實。最後再一次跟每個人說一聲謝謝。. iv.

(6) CONTENTS 中文摘要………………………………………………………………………………………ii 英文摘要……………………………………………………………………...………………iii 誌謝………………………………………………………………………….…………….…..iv 目錄……………………………………………………………………………...……………..v 圖目錄……………………………………………………………………………..…………vii 表目錄…………………………………………………………………………...…………….ix Chapter 1 ............................................................................................................................... - 1 Introduction ................................................................................................................... - 1 1.1 Background...................................................................................................... - 1 1.2 Motivation ....................................................................................................... - 2 1.3 Thesis Organization ......................................................................................... - 4 Chapter 2 ............................................................................................................................... - 5 Introduction to Reed-Solomon code.............................................................................. - 5 2.1 Reed Solomon Encoding ................................................................................. - 5 2.2 RS Code Decoding with Erasure Correction................................................... - 7 Berlekamp Massey algorithm........................................................................ - 9 Euclidean Algorithm.................................................................................... - 10 Chapter 3 ............................................................................................................................. - 14 Universal Finite Field Operator................................................................................... - 14 3.1 Montgomery Multiplication Algorithm ......................................................... - 14 3.2 Universal Finite Field Multiplier................................................................... - 17 3.3 Universal Finite Field Inverter ...................................................................... - 19 Fermat’s algorithm....................................................................................... - 20 On-the-fly Inversion Table .......................................................................... - 21 Chapter 4 ............................................................................................................................. - 22 Proposed Universal Architectures ............................................................................... - 22 4.1 Universal Syndorme and Erasure Value Calculator ...................................... - 23 4.2 Erasure Locator Polynomial Expansion and Key Equation Solve block ...... - 25 Decomposed Berlekamp-Massey Architecture ........................................... - 25 Expansion Hardware of Erasure Locator Polynomial ................................. - 26 Computation for Errata Evaluator Polynomial............................................ - 29 v.

(7) 4.3 Chien search and Error Evaluator Block ....................................................... - 29 Error value evaluator ................................................................................... - 31 4.4 Summary........................................................................................................ - 32 Chapter 5 ............................................................................................................................. - 34 Area Efficient Design Approach.................................................................................. - 34 5.1 Universal Syndorme and Erasure Value Calculator ...................................... - 34 5.2 Chien search and Error Evaluator Block ....................................................... - 36 5.3 8 ≤ t ≤ 16 Error-only Correction.................................................................... - 39 5.5 Summary........................................................................................................ - 40 Chapter 6 ............................................................................................................................. - 42 Chip Implementation Result........................................................................................ - 42 6.1 Design and Test Consideration ...................................................................... - 42 6.2 CHIP Implemenation for Proposed Architecture 1........................................ - 44 6.3 CHIP Implemenation for Proposed Architecture II ....................................... - 46 6.4 Comparison.................................................................................................... - 48 Chapter 7 ............................................................................................................................. - 50 Conclusion ................................................................................................................... - 50 APPENDIX ......................................................................................................................... - 51 Hardware Sharing Design for (528, 518) RS codec IP ............................................... - 51 Porposed Hardware Sharing Architecture ........................................................... - 51 Composite Field Inverter ..................................................................................... - 55 Implementation Result......................................................................................... - 57 Conclusion ........................................................................................................... - 58 BIBLIOGRAPHY ............................................................................................................... - 59 -. vi.

(8) List of Figures. Figure 1.1: Block diagram of communication system........................................................... - 1 Figure 2.1: The systematic RS encoding architecture ........................................................... - 7 Figure 2.2: The systematic RS decoding scheme .................................................................. - 8 Figure 3.1: Montgomery multiplier structure for GF(2m) while m≦4................................ - 18 Figure 3.2: The finite field inverter based on Fermat’s algorithm....................................... - 20 Figure 3.3: On-the-fly inversion table ................................................................................. - 21 Figure 4.1: The syndrome cell of syndrome calculator ....................................................... - 24 Figure 4.2: Universal Syndrome Block ............................................................................... - 24 Figure 4.3: The decomposed key equation architecture for calculate the error locator polynomial. .......................................................................................................................... - 26 Figure 4.4: Using decomposed architecture to compute the erasure locator polynomial.... - 28 Figure 4.5: (a) the double check Chien search cell. (b) Chien search architecture with correctable erasure is n-k..................................................................................................... - 30 Figure 4.6: The serial error evaluator architecture with one FFM. ..................................... - 32 Figure 4.7: The Timing Diagram for propose I architecture. .............................................. - 32 Figure 4.8: Block diagram of proposed I RS decoder. ........................................................ - 33 Figure 5.1: The universal Syndrome and Erasure Value Calculator.................................... - 36 Figure 5.2: The parallel Chien-search block with constant UFFM ..................................... - 37 Figure 5.3: The parallel error evaluator block with constant UFFM................................... - 38 Figure 5.4: Block diagram of 16 errors correction. ............................................................. - 39 vii.

(9) Figure 5.5: Decoding timing diagram for 16 errors-only correction................................... - 40 Figure 5.6: The Timing Diagram for propose II architecture. ............................................. - 40 Figure 5.7: Block diagram of proposed II RS decoder. ....................................................... - 41 Figure 6.1: The entire design flow....................................................................................... - 43 Figure 6.2: The simulation environment ............................................................................. - 44 Figure 6.3: The die photo of proposed I architecture .......................................................... - 45 Figure 6.4: The layout view of proposed II architecture ..................................................... - 47 Figure A.1: (a) Encoder/Syndrome calculator block, (b) Syndrome cell (SCi). ................. - 52 Figure A.2: The decomposed Berlekamp-Massey architecture with finite field inverter.... - 53 Figure A.3：Chien-search and error-value evaluator block................................................ - 53 Figure A.4: The hardware sharing architecture. .................................................................. - 54 Figure A.5：Composite field inverter over GF(210) ........................................................... - 56 -. viii.

(10) List of Tables. Table 1.1: Some application of Reed Solomon decoder and its finite field definition.......... - 2 Table 1.2: The number of primitive polynomial with different field degree......................... - 2 Table 3.1: The comparison of universal finite field multiplier............................................ - 19 Table 6.1: The chip summary of proposed I universal RS decoder. .................................... - 46 Table 6.2: The chip summary of proposed II architecture................................................... - 48 Table 6.3: The comparison of various mode RS decoder.................................................... - 49 Table A.1：Comparison of required registers and finite-field function units. The C-S and E-E represent the Chien search block and error evaluator, respectively..................................... - 55 Table A.2：The comparison table for (528,518) RS codec with different key-equation block..57 -. ix.

(11) CHAPTER 1 Introduction. 1.1 Background. Figure 1.1: Block diagram of communication system The importance of efficient and reliable data transmission in communication system is required in recent years. Fig 1.1 shows the typical communication system which composed of source coding, channel coding and modulation [1].. However, we only focus on the. channel coding block or be named as well as error control coding which is used to resist the channel noise during data transmission.. As shown of figure, the error control code is. composed of channel encoder and channel decoder. The channel encoder is used to encode the information symbol with additional redundancy bits. The channel decoder can decode the encoded codeword and has capable of correcting the errors. The error control code also can be separate form different encoding arithmetic, one is block code and the other is convoluitional code. The Reed Solomon (RS) code which belongs to block code and has -1-.

(12) cyclic structure [2] will be described in this thesis that includes algorithm research and hardware implementation.. 1.2 Motivation In recent years, the Reed Solomon code is used in many applications, such as xDSL, cable modem, DVD, blue-ray disc, and DVB-T systems.. Table 1.1 shows a list of RS code. applications and the finite field (FF) definition, and the Table 1.2 indicated the number of primitive polynomial with different field degree [3]. From table 1.1, we know that there are many different RS specifications in single systems. For example, the ITU J.83 system which includes of two different finite field definitions and the correctable error number has 3 different modes. Table 1.1: Some application of Reed Solomon decoder and its finite field definition. Applications Blue-ray DISC. LDC. (248,216) RS code for GF(28), t=16. BIS. (62,30) RS code for GF(28), t=16 (526,518) RS code for GF(210), t=4. Flash. ITU J.83. DVB-T. RS code specifications. A,B. (204, 188) RS code for GF(28), t=8. C. (128,122) RS code for GF(27), t=3. D. (207,187) RS code for GF(28), t=10 (204, 188) RS code for GF(28), t=8. Table 1.2: The number of primitive polynomial with different field degree. -2-.

(13) Finite field degree m. Primitive polynomial number. 5. 6. 6. 12. 8. 34. 10. 106. Because of the different RS specific parameters, a cost efficient RS decoder that can support various applications has practical importance to reduce the time-to-market and design costs. There are many similarities between various applications and the hardware can be shared for lower cost design.. This thesis proposes two universal architectures for Reed Solomon. error-and-erasure decoder that can accommodate any codeword with different code parameters and finite field definitions. The proposed I supports the maximum field degree to ten, and the corrective error is eight, and the proposed II can support the maximum field degree to eight, and the corrective error is sixteen. The area efficient approach is adopted for implementing the proposed II architecture.. Furthermore, the proposed decoders can support erasure. correction without increasing any finite field multipliers. The design challenge is how to realize a dedicated RS decoder that is suitable for different finite field definitions. The Montgomery multiplication algorithm will be used to deal with the relation between different finite field definitions. In comparison with other reconfigurable RS decoders, the proposed design, based on the Montgomery multiplication algorithm, can support various finite field degrees, different primitive polynomials, and erasure decoding functions.. -3-.

(14) 1.3 Thesis Organization The organization of this thesis is described as follows. In chapter 2, the Reed Solomon code algorithm includes encoding and decoding will be introduced. Chapter 3 shows the Montgomery multiplication algorithm [8], universal finite field multiplier and universal finite field inverter. Additionally, the on-the-fly look-up table is described in subsection 3.3. The proposed universal RS decoder architecture will be addressed in chapter 4. Each block and its design methodology of proposed decoder will be described in detail. In chapter 5, the each subsection will show another area efficient design in proposed II architecture. The design and test consideration and chip implementation is shown in Chapter 6. We also compare the chip performance and size with others architecture in Chapter 6. conclusion and future work.. -4-. Finally, Chapter 7 is the.

(15) CHAPTER 2 Introduction to Reed-Solomon code. Reed Solomon (RS) code which is used to protect the data during transmission has been widely accepted as the forward error correction scheme for various optical storage systems and communication systems,. The fundamental arithmetic of RS code is built on the Galois filed which denoted with GF [9]. A RS code over GF(2m) can be represented (n, k, t) code which has block length n and n-k parity symbols. The number of maximum correctable errors is t and the number of parity symbols is n-k. Note that the GF(2m) indicates that RS code are non-binary code with symbols made up of m-bit sequence, where m is any integer having a value greater than two. This chapter is organized as follow. Section 2.1 describes the RS encoding procedure and its mathematical arithmetic. The RS decoding scheme is presented in Section 2.2, and the Berlekamp-Massey algorithm and Euclidean algorithm which are used to solve the key equation will be introduced [3, 4].. 2.1 Reed Solomon Encoding It has been know that the GF(pm) which p is a prime number can be represented using 0 and (p-1) consecutive powers of a primitive field element a GF(pm). Symbols from the field GF(2m) are used in the construction of Reed Solomon code. Each of the 2m elements of the finite field GF(2m) can be represented as a distinct polynomial of degree m-1.. -5-.

(16) αi = αi(x) = αi,0 + αi,1x + αi,2x2+ ··· + αi,m-1xm-1. , for i = 0 ~ 2m-2. (2.1). Let M(x) represented as (mk-1, mk-2, …, m1, m0) be the information symbols with k symbols. And the G(x) is the generator polynomial which is the product of the associated minimal polynomial. G(x) = (x+αb)(x+αb+1)·····(x+αb+2t-2)(x+αb+2t-1). (2.2). Where the degree of G(x) is equal to the number of parity symbols, and the b is a constant. Therefore, for an (n, k, t) RS code, the nonsystematic encoding procedure can be expressed as follow: C(x) = G(x)·M(x) = (x+ab)(x+ab+1)·····(x+ab+2t-1)* M(x). (2.3). Where the C(x) is the codeword that has 2t roots of αb+1 ~ αb+2t. Another encoding approach to encode the information symbols is the systematic encoding [4] which uses the parity check symbols to form the codeword. Firstly, the message polynomial M(x) is multiplied by x2t and then modular by the generator polynomial G(x) to obtain the remainder polynomial R(x). R(x) = M(x)·x2t + Q(x)·G(x). (2.4). Where the Q(x) is the quotient polynomial of the divided polynomial M(x)·x2t and the divisor polynomial G(x) which has degree less than 2t-1. The systematic polynomial can be expressed as follow:. C(x) = M(x) x2t +R(x) = G(x) Q(x). (2.5) -6-.

(17) We can think of shifting a message polynomial M(x) into the rightmost k location of a codeword and appending 2t parity check symbols in the leftmost location. Fig. 2.1 shows the typical systematic encoding circuit with 2t register, where the g0, g1, …,g2t-1 is the coefficient of generator polynomial. The output symbols are the message M(x) during the first k clock cycles. The remaining n-k cycles, the parity symbols R(x) are moved to output. The total number of required clock cycles is equal to n.. g0. g1. g. g2. 2t-1. m(x). C(x). Figure 2.1: The systematic RS encoding architecture. 2.2 RS Code Decoding with Erasure Correction As mentioned early, the codeword polynomial is C(x) and the error polynomial is e(x). The received polynomial r(x) can be expressed as follow: r(x) = C(x) + e(x). (2.6). Fig. 2.2 shows the error-only RS decoding flow chart which can be divided into four steps: 1) calculation of the syndrome S(x) form the received codeword, 2) computation of the error locator polynomial σ(x) and the key equation Ω(x) with Berlekamp-Massey algorithm [5, 6] or Euclidean algorithm, 3) search of the error location by Chien search approach, and 4) evaluation of error value.. -7-.

(18) S (x). Ω(x). Figure 2.2: The systematic RS decoding scheme The syndrome is the result of a parity check performed on received polynomial r(x) to determine whether r(x) is a valid member of the codeword set. If the received polynomial has no errors, then the syndrome polynomial S(x) is always 0. On the other hands, any nonzero value of syndrome indicates the presence of errors. The computation of a syndrome symbols can be describes as follows: 2t. S ( x) = ∑ r (α i )x i −1. (2.7). i =1. S1 = r(α1) = e(α1) = e1χ1+ e2χ2+…+evχv S2 = r(α2) = e(α2) = e1χ12 + e2χ22 +…+evχv2 … S2t = r(α2t) = e(α2t) = e1χ12t + e2χ22t +…+evχv2t. (2.8). Where the ei represents the i-th error value and the v is the occurred error number, and the. χi represents the error location number. When a nonzero syndrome vector has been computed, it signifies that an error has occurred. Then, the error locator polynomial σ(x) and the key equation Ω(x) will be computed. An error locator polynomial and key equation are defined as -8-.

(19) σ(x) = (1+β1x) (1+β2x)····(1+βvx) = σ0 +σ1x +σ2x2 + σ3x3 + ····+σvxv. (2.9). Ω(x) = S(x) σ(x) mod xn-k = e1χ1(1-χ2x)(1-χ3x)····(1-χvx) + e1χ2(1-χ1x)(1-χ3x)····(1-χvx) + e1χ3(1-χ1x)(1-χ2x)····(1-χvx) +…. (2.10). Berlekamp Massey algorithm In 1960, Peterson provided the first explicit description of a decoding algorithm for binary BCH code. He uses the relation of error locator polynomial and syndrome vector to solve the key equation. The relation can be rewritten as a matrix form:. ⎡ S1 ⎢ ⎢ S2 ⎢ S3 ⎢ ⎢ ⎢ ⎣ Sv. S2. S3. S3. S4. S4. S5 .... Sv +1. Sv + 2. Sv ⎤ ⎡ σ v ⎤ ⎡ − Sv +1 ⎤ ⎢ ⎥ ... Sv +1 ⎥⎥ ⎢σ v −1 ⎥⎥ ⎢⎢ − Sv + 2 ⎥ ⎥ ... Sv + 2 ⎥ ⎢ . ⎥ = ⎢ ⎥ ⎥⎢ ⎥ ⎢ . ⎥ ⎥⎢ ⎥ ⎢ ⎢ ... S 2 v −1 ⎦⎥ ⎢⎣ σ 1 ⎦⎥ ⎣⎢ − S 2 v ⎦⎥ .... (2.11). But this algorithm is inefficient for large correctable error number code. Consequently, the error locator polynomial and error evaluate polynomial are always computed by Berlekamp-Massey algorithm or modified Euclidean algorithm in the past.. The. Berlekamp-Massey algorithm which is developed by Berlekamp and explained by Massey with linear feedback register has regular property for decoding key equation. The entire Berlekamp-Massey algorithm is shown as: -9-.

(20) 1) Initially. Λ ( b ) ( x) = 1, Λ ( a ) ( x) = 1, l = 0, k = 1, γ ( k ) = 1 l. (b ) ⎯ xΛ ( x) and δ = ∑ Λ j S k − j 2) Compute (a) Λ ( x) ←⎯ (a). (a). j =0. (b) Λ. (c). ( x) = Λ (b ) ( x) +. δ (a) Λ ( x) γ. (c) If δ ≠ 0 and 2l ≤ k − 1 Set Λ (d) Λ. (b ). (a). ( x) = Λ ( b ) ( x), l = k − l , γ = δ. ( x) = Λ ( c ) ( x). 3) Set k = k+1. If k < d, then go step 2. 4) Stop. The δ is the discrepancy which is the convolution of syndrome vector and error locator polynomial and the γ is the dummy nonzero discrepancy that keeps the value of previous discrepancy. The discrepancy is used to verify that the linear feedback shift register generates corresponding the given syndrome sequence at each step. If the discrepancy is equal to zero, it represents that we don’t update the error locator polynomial and the dummy discrepancy. For operating the Berlekamp-Massey algorithm, it totally costs 2t iteration.. Euclidean Algorithm In 1975 Sugiyama et al. showed that Euclidean algorithm can decode Reed Solomon code. The Euclidean algorithm originally is used to calculating the greatest common divisor of two. - 10 -.

(21) polynomials. For rewriting the key equation, the Euclidean algorithm can be applied to produce the correct sets of solutions (σ(x), Ω(x)) that satisfy as. Ω(x) = S(x) ⋅ σ (x) mod x n-k => Q(x) ⋅ x n-k + S(x) ⋅ σ (x) = Ω(x). (2.12). Where the Q(x) is the quotient polynomial of the dividend polynomial S(x)σ(x) and divisor polynomial xn-k. The Q(x) is not available for us, but the pair (σ(x), Ω(x)) is the interested solution. The Ω(x) computation is similar as calculating the GCD polynomial of xn-k and S(x).. R1 ( x) = x d + S ( x)Q1 ( x) R2 ( x) = S ( x) + R1 ( x)Q2 ( x) R3 ( x) = R1 ( x) + R2 ( x)Q3 ( x). (2.12). .... Ω( x) = Rn − 2 ( x) + Rn −1 ( x)Qn ( x). where the Q(i)(x) is i-steps quotient polynomial and R(i)(x) is the i-steps remainder polynomial. At each step, the division operation of polynomial is performed. According to the Euclidean algorithm, the computation of error locator polynomial is shown as. σ 1 ( x) = 0 + 1Q1 ( x) σ 2 ( x) = 1 + σ 1Q2 ( x) σ 3 ( x) = σ 1 ( x) + σ 2 ( x)Q3 ( x) .... σ ( x) = σ n − 2 ( x) + σ n −1 ( x)Qn ( x). (2.13). Where the Q(x) is same as the result of computing error evaluate polynomial. From the equation (2.12), it is known that the error locator polynomial can be calculated by given quotient polynomial and previous error locator polynomial.. - 11 -. For performing Euclidean.

(22) algorithm, the degree increasing of σ(x) is in opposition to the degree of Ω(x). Hence, the Euclidean decoding procedure terminates when the degree of σ(x) is larger than the degree of. Ω(x). Because of the regularity of Berlekamp Massey algorithm, the proposed universal architecture is implemented by applying this algorithm. After the key equation and error equation have been calculating, the next step is finding the error location roots by Chien search approach. The methodology of Chien search is substitution of error locator polynomial with finite field elements to check the result equals zero or not.. σ(α-i) =0 for i = 0, 1, 2, , N.. (2.14). Then, according to Forney algorithm [7], the error value can be computed as follow:. Ω( xl −1 ) el = σ '( xl −1 ). (2.15). The xl and the σ’(x) are the location root at the Chien search step and the derivative of error locator polynomial σ(x) respectively. Erasure is a type of error with the position information. A RS decoder with erasure correction will improve the performance in various systems. For a (n, k, t) RS code, the erasure correction capability of the code is s = dmin-1 = n-k. (2.16). where the dmin is the minimum distance between any two codewords. For RS code the minimum distance is given by dmin = n – k +1. (2.17). - 12 -.

(23) Simultaneous error correction and erasure correction capability can be expressed by the requirement that 2v+s < dmin < n-k. (2.18). where v is the number of symbol error that can correct, and the s is the number of symbol erasure that can be corrected. For decoding erasure, it is shown that the error and erasure locator polynomial (errata locator polynomial) can be obtained directly by initiating an inverse-free BM algorithm with the erasure locator polynomial.. Consequently, we just. consider the expansion of erasure locator polynomial. The erasure locator polynomial is computed by the following equation.. T ( x ) = ∏ (1 + α i x ) mod x 2 t. (2.19). Hence, the error-erasure locator polynomial Λ(x) (or say errata locator polynomial) and key equation W(x) of erasure correction can be rewritten respectively as follows:. Λ( x) = σ (x) ⋅ T(x) mod x n-k. (2.20). W (x) = S(x) ⋅ Λ(x) mod x n-k. (2.21). - 13 -.

(24) CHAPTER 3 Universal Finite Field Operator This Chapter describes the Montgomery multiplication algorithm [8] and indicates the implementation of universal finite field operators.. The basic idea to achieve universal. property is applying the universal finite field multiplier which can accommodate different finite field definition [20]. In comparison with others proposed approach, the universal finite field multiplier only cost two cycles to realize the finite field multiplication for various definitions. What if the input of universal FFM multiplies is replaced with the corrective factor at first, it only requires one cycle to perform the operation. Additionallly, the universal finite field inverter is the last step to realize the Forney algorithm. Two approaches are presented in subsection 3.3, the Fermat’s algorithm [10] with universal multiplier and on-the-fly lookup table with SRAM.. 3.1 Montgomery Multiplication Algorithm An element A of the field GF(qm) with a prime q can be interpreted as the polynomial representation.. In the polynomial representation, multiplication in GF(qm) corresponds to. the multiplication of polynomials module an irreducible polynomial of degree m. Suppose A and B are two elements in GF(qm), and p(x) is the corresponding primitive polynomial of this field. Then, the multiplicative operation C=AB can be defined as follows:. C ( x) = A( x) B( x) mod p( x). (3.1). Where C is also an element of GF(qm). - 14 -.

(25) Actually, the finite field addition and subtraction are just excursive OR operations. Therefore, what we interested is the multiplication and division (or say, the inverse operation) in finite field.. According to the modular multiplication property in (3.1), we can adopt. Montgomery multiplication algorithm to calculate the product C(x).. The Montgomery. multiplication algorithm has been proven that this algorithm can replace the modular operation with a series multiplication. The following equation defines the Montgomery product of A and B:. Cˆ ( x) = A( x) B( x) R* ( x) mod p( x). (3.2). The polynomial R*(x) here is a fixed element of GF(qm) satisfying R(x)R*(x) =1 mod p(x) while R(x)=xm . Note that the requirement of R(x) and p(x) being relatively prime is always consistent. It has been proven by [8] that the result Cˆ ( x) of (3.2) can be obtained by following equations:. Q( x) = A( x) B( x) p* ( x) mod R( x). (3.3). Cˆ ( x ) = [ A( x ) B ( x ) + Q ( x ) p ( x )] / R ( x ). (3.4). The polynomial p*(x) in (3.3) is defined as p(x)p*(x)=1 mod R(x). As compared with (3.2), it is evident that modulo p(x) operation is replaced by modulo R(x) and division by R(x) operations. Since R(x)=xm , implementation of (3.3) and (3.4) are much easier than that of (3.2). Furthermore, as A is interpreted in polynomial form and R*(x)= x-m mod p(x), (3.2) can be rewritten as:. Cˆ ( x ) = [ am −1 B ( x ) x −1 mod p ( x )] + [ am − 2 B ( x ) x −2 mod p ( x )] +... + [ a0 B ( x ) x − m mod p ( x )] Rearrange this equation, an iterative representation comes out:. - 15 -. (3.5).

(26) Cˆ ( x ) = [ am −1 B ( x ) + [..[ a1 B ( x ) + [ a0 B ( x ) x −1 mod p ( x )]] x −1. (3.6). mod p ( x )]...] x −1 mod p ( x ). Based on this equation and the transformation from (3.4) to (3.6), the Montgomery multiplication algorithm is derived as:. Montgomery multiplication algorithm:. S0 ( x) = 0; for (i = 0; i < m; i + +){. ρi ( x) = [( Si ( x) + ai B( x)) p* ( x)]mod x; Si +1 ( x) = [ Si ( x) + ai B( x) + ρi ( x) p( x)] / x; } Cˆ ( x) = Sm ( x);. (3.7). The term p*(x) is the multiplicative inverse of p(x) under modulo x multiplication. In GF(2m), elements are often represented in binary digits, and the coefficients ai are referred to the bits of A. The binary representation will cause some reduction to Montgomery multiplication algorithm. Since p(x) is irreducible, the results of p(x) mod x and p*(x) mod x are both equal to 1. The p*(x) term in the Montgomery multiplication algorithm can be eliminated, which leads ρi(x) to equal the least significant bit of the sum Si(x)+ aiB(x). The number of recursive operation in Montgomery multiplication depends on the field degree m. However, some modification can be proposed to remove the effect of unexpected variable m. In equation (3.3) and (3.4), R(x) is modified to be Rd(x)=xd, and d is a constant integer such that d≧m. Since the result of R*d(x) mod p(x) is an element of GF(qm), there exists an element R*d(x) in the field GF(qm) that satisfies Rd(x)R*d(x)=1 mod x.. Therefore, the. modified Montgomery multiplication algorithm for GF(2m) with m≦d is constructed:. - 16 -.

(27) Modified Montgomery multiplication algorithm:. MM ( A( x), B( x), p ( x)){ S0 ( x) = 0; for (i = 0; i < d ; i + + ){ if (i ≥ m). ai = 0;. T ( x) = Si ( x) + ai B( x); Si +1 ( x) = [T ( x) + t0 p( x)] / x; } Cˆ ( x) = S d ( x); }. (3.8). The term t0 is the least significant bit of the temporal element T(x). If the field degree is less than d, the most significant bits of A is set to zero.. The final result will be multiplying the. normal finite field product A(x)B(x) by a constant element R*d(x) of GF(2m). The output of Montgomery multiplier involves a constant factor R*d(x) mod p(x) with the standard product. Such constant factor can be canceled by applying one additional Montgomery multiplier. Calculation of the product C(x)=A(x)B(x) is completed using:. K ( x) = x 2 d mod p ( x). (3.9). Cˆ ( x ) = MM ( A( x ), B ( x ), p ( x )). (3.10). C ( x ) = MM (Cˆ ( x ), K ( x ), p ( x )). (3.11). where K(x) is treated as a constant value for a given p(x).. 3.2 Universal Finite Field Multiplier As mentioned in chapter 2, to design a universal finite field multiplier, the circuit complexity mainly depends on the module operation for different primitive polynomial. To achieve universal finite field operation, the methodology proposed in the past is using a series - 17 -.

(28) shift and multiplication operations to replace the modular operation. However, this approach costs two or more cycles to operate than original dedicated finite field operation. This section presents a new multiplier architecture that can accommodate different finite field definition. The proposed universal finite field multiplier is built on the Montgomery multiplication and only cost two cycles to operate the finite field multiplication. According to this modified algorithm, the bit-level multiplier architecture can be implemented easily. The t0 indicate the LSB bits of T(x), and the division of x replaces as a left shift operation. The Montgomery multiplier architecture for GF(2m) with m≦4 is shown in Fig. 3.1. Fig. 3.1(a) and Fig. 3.1(b) indicate the function unit and the Fig.3.1(c) illustrates the overall architecture in GF(24). The signal ai and bi are the bits of two input element A and B, which can be expressed as A=(a3a2a1a0) and B=(b3b2b1b0) respectively. Besides, mi is used to indicate the i-th bits of the primitive polynomial and Si is the i-th output bits.. Figure 3.1: Montgomery multiplier structure for GF(2m) while m≦4 As the multiplier for maximum field degree d has been implemented, any multiplication of GF(2m) with field degree less than d and corresponding primitive polynomial is applicable. As - 18 -.

(29) shown in Fig. 3.1(c), the proposed bit-parallel multiplier dispenses with additional control circuit due to the regular structure. Table 3.1: The comparison of universal finite field multiplier. Instruction cycle Critical path. n −1. C=AB. C=A/B. C = ∑ Ai Bi i =0. 8TAND L. Song [11]. +11TXOR. Proposed. 9TAND. [20]. +15TXOR. 3. 4m-4. 3n-2. 2. m. 2n. As compared to another universal finite field multiplier proposed by [11], our approach needs no additional pre and post-shifting circuit. Table 3.1 compares the required instruction cycle between the proposed universal finite field multiplier [20] and the multiplier of [11] while operating over GF(2m) with a multiplier that supporting maximum field degree of 8. Note that one instruction cycle here indicates a single shift operation, multiplication, or addition. And the finite field division in Table 3.1 is based on Fermat’s algorithm. In this table, it is clear that our proposed multiplier cost less cycles for calculating the finite field operation.. 3.3 Universal Finite Field Inverter The implementation of Forney algorithm requires a universal finite field inverse operation. Two methods for realizing the inverse operation are generally used, one is using Fermat - 19 -.

(30) algorithm which replaces inversion with a series of square and multiply operations [10], and the other is the looking up table.. Fermat’s algorithm Fermat’s algorithm. β −1 = β 2 2. m. −2. = β 2+2 +....+2. m −1. (3.12). = β 2(1+2(1+....)) = (β ...(β (β * β 2 ) 2 ) 2 ...)2. Based on this algorithm, the inversion in GF(2m) can be replaced by serial square and multiply operations.. In additional, the Fermat algorithm shows us that inverse operation. needs m-1 cycles which include two Montgomery multiplications in each cycle. For example, in GF(16), β-1= β16-2= β2+4+8= β2(1+2(1+2))= (β(ββ2)2)2 , three cycles are required. Fig. 3.2 shows the architecture of inverter which is composed with two universal FFM. The control unit is added to realize the inverse operation, the Montgomery multiplier A is taken as a squarer and the multiplier B perform the finite field multiplication.. Figure 3.2: The finite field inverter based on Fermat’s algorithm.. - 20 -.

(31) On-the-fly Inversion Table Since the Fermat algorithm needs many cycles to calculate the error value, leading to a larger FIFO buffer. Therefore, for high speed computation, the on-the-fly look-up table composed of 2m*m SRAM, universal α generator and universal α−1 generator, is proposed as shown in Fig. 3.3. According to different finite field definition, the universal α generator and α−1 generator will update the finite field element and its corresponding inverse value respectively at syndrome calculating stage. At error evaluator stage, the inversion table is available for Forney algorithm. Hence, the total decoding stage can be kept on 4 and can decrease the length of FIFO buffer.. Figure 3.3: On-the-fly inversion table. - 21 -.

(32) CHAPTER 4 Proposed Universal Architectures As mentioned in Chapter 2, the RS erasure decoder consists of syndrome calculator, erasure locator polynomial expansion, key-equation solver, Chien-search block and error-value evaluator, and a finite field inverter. The syndrome calculator computes the syndrome vector to key equation block. When the syndrome value is equal zero, the following decoding procedure will be terminated. If not, the erasure locator polynomial will be computed and the key equation block will calculate the error locator polynomial based on inverse Berlekamp-Massey algorithm [12]. The error location and the location roots will be known at Chien search step. Finally, according to the Forney algorithm, the error evaluate block calculate the error and erasure values. Besides, the FIFO buffer is used to keeps the received codeword which size is increasing with code block length. The proposed architecture of universal RS erasure decoder is presented in this chapter. All of these components implementation mentioned early will be detailed in the following subsections.. In subsection 4.1, the universal syndrome and erasure value calculator is. addressed. For erasure correction, the corresponding erasure value must be kept to compute erasure locator polynomial at syndrome stage and transmit the erasure value to next stage, key equation block. For key equation block design, the authors present a decomposed inversionless BM architecture that can reduce the complexity significantly in paper [13, 14]. The proposal in [15, 16] requires 2t~3t finite field multiplier. However, the decomposed architecture only requires 3 finite field multipliers without any finite field inverter to implement. For decoding erasure, the key equation must replace the initial condition with the erasure locator polynomial. Therefore, the expansion hardware of erasure locator polynomial must work before operating. - 22 -.

(33) the Berlekamp Massey algorithm. For area efficient design, the combination of erasure locator polynomial expansion and decomposed Berlekamp-Massey architecture is presented in subsection 4.2. In subsection 4.3, the Chien search and error evaluator architecture is shown. This architecture can search the error and erasure roots of errata locator polynomial with any variable parameters.. 4.1 Universal Syndorme and Erasure Value Calculator This section presents universal syndrome architecture design.. For design syndrome. calculator in the past, the Horner’s rule is applied to reduce the substitution hardware area. Let the R(x) be the received polynomial, and the syndrome value can be obtained by substituting the finite field elements α1, α2, ...., α2t. This substitution of syndrome value can be expressed as follows:. Si =R (α i ) = (..(( Rn −1α i + Rn − 2 )α i + Rn −3 )α i + ......R2 )α i + R1. for i=1~2t. (4.1). For adopting the property of universal finite field multiplier, the syndrome has to be modified as:. α mSi = α m R(α i ) for i=1~2t. (4.2). ,where the αm is the corrective factor for universal finite field multiplier.. Hence, for. implementing a syndrome calculator cell, a universal finite field multiplier is required. This cell architecture is shown in fig. 4.1. Assuming the correctable errors are equal eight, sixteen syndrome cells are needed.. - 23 -.

(34) Figure 4.1: The syndrome cell of syndrome calculator Fig. 4.2 shows the entire universal syndrome and erasure value calculator block with correctable erasure is 16. Expect calculating the syndrome value, this stage also calculates erasure vectors. If the erasure flag is valid, the erasure occurs, and the corresponding erasure value must be saved.. According to different correctable error and erasure number, the. syndrome selector has to transmit the appropriate syndrome vector to key equation block.. Figure 4.2: Universal Syndrome Block - 24 -.

(35) 4.2 Erasure Locator Polynomial Expansion and Key Equation Solve block Decomposed Berlekamp-Massey Architecture As has been mentioned in chapter 2, the key-equation block can be implemented by two algorithms, Euclidean algorithm and Berlekamp-Massey algorithm.. For implementing. Berlekamp Massey algorithm, a lot of parallel architectures have realized in the past which required 2t ~3t finite field multiplier. However, a decomposed architecture which only three finite field multipliers required has proposed to reduce the circuit complexity significantly in [13], and this architecture is based on the inversionless Berlekamp-Massey algorithm. In the inversionless Berlekamp-Massey algorithm, the finite filed inverter is replaced by a multiplier and doesn’t have any influence on computing the correct result. The decomposed architecture slows down the key equation without impacting the decoding speed, and the each iteration of equation can be decomposed as following:. Λ. δj. (c) j. ( i +1). ⎧⎪γ ⋅ Λ (b ) 0 , for j=0 =⎨ (b ) (i ) (a) 1 ≤ j ≤ s + vi ⎪⎩γ ⋅ Λ j + δ ⋅ Λ j −1 , for 0 , for j=0 ⎧⎪ = ⎨ (i +1) (c) 1 ≤ j ≤ s + vi ⎪⎩δ j −1 + Si − j + 3 ⋅ Λ j −1 , for. (4.5). where Λ(a)j is the coefficient of Λ(a)(x) and δj(i) is the i-steps partial result in computing the discrepancy. From above equation, only two finite field multipliers are used to computing the error locator polynomial Λ(x) and one finite field multiplier is needed to calculating the product of syndrome value and the coefficient of error locator polynomial.. - 25 -.

(36) Figure 4.3: The decomposed key equation architecture for calculate the error locator polynomial. Because of the regularity of Berlekamp Massey algorithm, the universal key equation block can be realized easily by replacing the dedicated finite field multiplier with universal finite field multiplier. Fig. 4.3 shows the error decoding steps of decomposed key equation architecture with three finite field multipliers. The register buffer with length (4t+1)*m bits is used to keep the latest error locator polynomial Λ(b)(x) and previously error locator polynomial Λ(a)(x). Each initial coefficient such as discrepancy δ, previous discrepancy γ, and error locator polynomial Λ(a)(x) and Λ(b)(x) must multiply the corresponding corrective factor αm respectively to obtain the correct result.. Expansion Hardware of Erasure Locator Polynomial The paper [17] shows that the error-erasure locator polynomial (errata locator polynomial) can be obtained directly by initiating an inverse-free BM algorithm with the erasure locator polynomial. Hence, for implementation of erasure correction, only the erasure polynomial expansion hardware must be considered.. - 26 -.

(37) In paper [18, 19], it shows two approach to implement the expansion hardware. One is the parallel architecture which costs n-k finite field multipliers and another is the serial architecture which needs two finite field multipliers. Additionally, after calculating of erasure locator polynomial, the initial discrepancy of erasure locator polynomial and syndrome vector must be computed as the inversionless Berlekamp-Massey algorithm coefficient. This step also needs additional penalty to realize. Therefore, to achieve a regular and minimum area design, the modified inversionless Berlekamp-Massey algorithm with erasure locator polynomial expansion is shown as follows.. Λ ( b ) ( x) = 1, Λ ( a ) ( x) = 1, 1) Initially. l = 0, k = 1, γ ( k ) = 1, decode = 0. 2) If k<s decoder=0, set Compute Λ. (c). δ = 1, γ = Z k , Λ ( a ) ( x) = xΛ (b ) ( x) = xΛ ( x). ( x) = γΛ ( b ) ( x) + δΛ ( a ) ( x) = (1 + Z k x)Λ ( x). and δ =. l. ∑Λ j =0. (c ) j. Sk − j. Set k=k+1, If k<s repeat step (2), Else set decoder =1 and go step (3) l. ⎯ xΛ ( x) and 3) Compute (a) Λ ( x) ←⎯ (a). (b) Λ. (c). (a). δ = ∑ Λ j (b ) Sk − j j =0. ( x) = γΛ ( b ) ( x) + δΛ ( a ) ( x). (c) If δ ≠ 0 and 2l ≤ k − 1 Set Λ (d) Λ. (b ). (a). ( x) = Λ ( b ) ( x), l = k − l , γ = δ. ( x) = Λ ( c ) ( x). Set k = k+1. If k < d, then go step 3. 4) Stop - 27 -.

(38) ,where k is iteration number, s is the erasure number, Zk is the erasure value provided by the previous syndrome stage, and Λ(x) is the error and erasure locator polynomial (or say errata locator polynomial). The δ is the latest discrepancy and the γ is the previous discrepancy. In this algorithm, if the decoder=0 at the beginning and the iteration number is smaller than erasure number (k < s), the decomposed architecture will calculate the erasure locator polynomial. Then, the signal decode will be asserted, and the inversionless Berlekamp-Massey algorithm will be performed. Note that at step (2) and step (3), the computation of Λ(x) equation has the similar computation form. According to this property, the erasure locator can be computed by the same architecture with additional control circuit. Consequently, the erasure locator polynomial can be obtained regularly without increasing additional finite field multipliers. Besides, the extra cycles aren’t required to deal with the problem of the initial discrepancy. Fig. 4.4 shows the decomposed architecture state that computes the erasure locator polynomial.. Figure 4.4: Using decomposed architecture to compute the erasure locator polynomial.. - 28 -.

(39) Computation for Errata Evaluator Polynomial The paper [13] has also indicated that the errata evaluator polynomial can be computed by decomposed architecture. After the errata locator polynomial is obtained, the errata evaluator polynomial can be derived as following:. W ( x) = S ( x)Λ ( x) mod x 2t = W (0) + W (1) x + ⋅⋅⋅⋅ +W ( v −1) x v −1 W. (i ). (4.6). = Si +1Λ 0 + Si Λ1 + ⋅⋅⋅ + S1Λ i. where the v is the degree if the errata locator polynomial and the W(i) represents the coefficient of the errata evaluator polynomial. To compute the errata evaluator polynomial is similar to compute the discrepancy, which also requires a multiply-and-addition hardware to implement. The errata evaluator polynomial also can be decomposed like calculating discrepancy, which is show as follows:. Si +1Λ 0 , for ⎧⎪ W j (i ) = ⎨ (i ) ⎪⎩W j −1 + Si − j +1 ⋅ Λ j , for. j=0 1≤ j ≤ i. (4.7). Obviously, this decomposed format is same to compute the discrepancy (equation (4.7)). Hence, the same hardware is used to solving the error evaluator polynomial after obtaining the errata locator polynomial.. 4.3 Chien search and Error Evaluator Block Chien search block The Chien search is used to check the roots of errata locator polynomial which equals to zero or not. If Λ(α-i)=0, this is represent that there is an error or erasure at the i-th location of received codeword. Similar to the syndrome block, the Chien search and error evaluator block - 29 -.

(40) are also implemented through the Horner’s rule.. However, for universal Chien-search. architecture design, the dedicated FFM is replaced with universal FFM for each cell. Fig. 4.5(a) shows the circuit of the Chien search cell. Because of the maximum field of our proposed design is ten, the one stage FIFO buffer must have length 1024x10 bits and costs large area. For reducing the FIFO buffer length, the double check Chien search is used to find the roots twice at a time. Fig. 4.5(b) shows the entire Chien structure with n-k Chien search cells.. Figure 4.5: (a) the double check Chien search cell. (b) Chien search architecture with correctable erasure is n-k. - 30 -.

(41) Error value evaluator The Foney algorithm mentioned in Chapter 2 is used to evaluate the error value. The Foney algorithm can be expressed as follows:. el =. W (β j ) Λ '( β j ). (4.8). where the βj indicates the root of errata location polynomial Λ(x) and the Λ’(x) is represent the first derivative of Λ(x). In finite field arithmetic, the derivative can be replaced with simple format which is composed of original odd coefficient. It is shown as follows: l. Λ '( x) = (∑ Λ k x k )l k =1. l. = ∑ k Λ k x k −1 k =1. (4.9). = Λ 0 + Λ 3 x 2 + ... + Λ todd xtodd −1 =. 1 Λ odd ( x) x. where the todd represents the maximum degree of Λ(x). Hence, the Forney algorithm can be rewritten as:. el =. W (β j ) Λ '( β j ). =. W (β j ) ⋅ β j Λ odd ( β j ). for. ,. j =1~ t. (4.10). There are two solutions to realize the error evaluator block. One is parallel approach which is similar as the Chien search architecture, and another serial structure is using one FFM to implement. However, the large FIFO buffer length is the penalty of serial architecture. Fig. 4.6 shows the serial error evaluator architecture. Where the UFFI represents a universal finite field inverter and βj indicates the roots of errata locator polynomial. The finite field - 31 -.

(42) inverter is used to calculate the inversion of finite field elements. This architecture can calculate the W(βj) and 1/Λ(βj) at same time.. Figure 4.6: The serial error evaluator architecture with one FFM.. 4.4 Summary In this chapter, the universal RS erasure decoder is proposed. If a (n, k, t, m) universal erasure RS decoder is designed. The any (n’, k’, t’, m’) RS code with n’≤ n, k’≤ k, t’≤ t, m’≤ m parameters can be decoded by our proposed architecture.. Figure 4.7: The Timing Diagram for propose I architecture. - 32 -.

(43) The decoding timing scheme of proposed RS decoders is shown in fig. 4.7. As the fig.4.7 shows, the double check Chien search is used to reduce the search cycles and serial error evaluator architecture can be applied. And, the finite field inverter of proposed I architecture is based on Fermat’s algorithm.. Figure 4.8: Block diagram of proposed I RS decoder.. Fig 4.8 shows the block diagram of proposed architecture I. The proposed architecture I can support the maximum field degree to 10, and the corrective error is 8. Two 2048x10 SRAMs are used to store the received codeword. Because the syndrome cell and Chien search cell are implemented by universal FFM, the total gate count of proposed architecture I is large. Hence, for implementing the error evaluator block and finite filed inverter, the serial computation architectures is used.. - 33 -.

(44) CHAPTER 5 Area Efficient Design Approach The RS erasure decoder consists of syndrome calculator, erasure locator polynomial expansion, key-equation solver, Chien-search block and error-value evaluator. However, in proposed architecture I, the design has larger design-cost than typical single mode RS decoder. In proposed architecture II, constant universal finite field multiplier will be used to reduce the gate count. In this chapter, the modified syndrome calculator is introduced in subsection 5.1 and the parallel Chien search and error evaluator architecture is shown in subsection 5.2. The parallel Chien search and error evaluate architecture can search the error and erasure roots and calculate the error value simultaneously. The key equation architecture is same as proposed architecture 1. Because of the implementation of constant universal FFM, the total area cost of syndrome block and Chien search block is improved obviously. Finally, the improving of the decoder function that can correct 16 errors is described in subsection 5.3.. 5.1 Universal Syndorme and Erasure Value Calculator The syndrome value represents the error information of received codeword. The Horner’s rule is applied to compute the syndrome value. Hence, the typical substitution form of syndrome value is shown as following:. Si =R (α i ) = (..(( Rn −1α i + Rn − 2 )α i + Rn −3 )α i + ......R2 )α i + R1. for i=1~2t. (5.1). It is because that the universal finite field multiplier cost larger area than a dedicated constant multiplier. In order to achieving area efficient design, a constant UFFM (CUFFM) - 34 -.

(45) can be constructed by replacing one input value of UFFM with fixed finite field element xi. It can be expressed as following:. C ( x) = A( x) B ( x)r * ( x) mod p ( x). A( x ) = x j. = x j B( x) x − m mod p ( x) = [0 + [..[0 + [1 ⋅ B ( x) x −1 + [..[0 B( x) x −1 ]...]x −1 −1. −1. mod p ( x)]]x mod p ( x)]...]x mod p ( x) =α. ( j −m). (5.2) x =α. B ( x) mod p ( x). It is shown that the α(i-m) represents a constant UFFM (CUFFM) function. However, for adapting to constant UFFM function, the original substitution of syndrome polynomial must be modified. According the Horner’s rule, each codeword symbols must multiply the constant αm*n before entering the syndrome cell, where n represents the location of codeword symbols.. The following equation indicates the detail modified syndrome. substitution procedure.. α m R(α i ) =α m R n-1α i*n-1 + α m R n-2α i*n-2 +....+ α m R 0 = α m R n-1α (m +(i-m ))*n-1 + α m R n-2 a (m +(i-m ))*n-2 +...+ α m R 0 = (...((R n-1α. m*n. α. (i-m ). + R n-2α. m*n-1. )α. (i-m ). +..)..) α. (i-m ). + R 0α. (4.4) m. Based on above equation form, the modified syndrome calculator is constructed in Fig 4.1. Fig. 5.1(a) indicates each cell of syndrome architecture and Fig. 5.1(b) shows the entire universal syndrome and erasure value calculator block with correctable erasure is 16. Since the area and critical path of CUFFM increase in proportion to the minus degree of α(i-m), the maximum minus degree of CUFFM is kept at eight.. - 35 -.

(46) Figure 5.1: The universal Syndrome and Erasure Value Calculator. 5.2 Chien search and Error Evaluator Block For area efficient design, the universal FFM can be replaced with constant universal FFM in each cell of Chien search and error evaluator block. Since the area and critical path of CUFFM increase with the minus degree of , the errata polynomial form must be modified to - 36 -.

(47) avoid larger minus degree.. Assume the correctable erasure is 16, the modified errata. polynomial form is shown as follows; Λ (α -i ) = Λ 0 + Λ1 (α -1 )i + Λ 2 (α -2 )i +....+ Λ8 (α -8 )i + Λ 9 (α -9 )i + Λ10 (α -10 )i +....+ Λ16 (α -16 )i = Λ 0 + Λ1 (α -1 )i + ....+ Λ8 (α -8 )i +. .. (5.2). (α -8 )i {Λ 9 (α -1 )i + Λ10 (α -2 )i +....+ Λ16 (α -8 )i } From the above modified equation format, the maximum α minus degree is always 8 and the Chien search block can be implemented easily based on this polynomial form. Fig. 5.2 shows the area efficient Chien-search architecture for t=8. The cell’s output value whose alpha degree is large than 8 will multiply the corrective factor α-8.. Figure 5.2: The parallel Chien-search block with constant UFFM. - 37 -.

(48) For the error evaluator polynomial, its implementation is same as the error locator polynomial that is shown as follows:. Ω(α -i ) = Ω0 + Ω1 (α -1 )i + Ω 2 (α -2 )i +....+ Ω8 (α -8 )i + Ω9 (α -9 )i + Ω10 (α -10 )i +....+ Ω16 (α -16 )i = Ω0 + Ω1 (α -1 )i + ....+ Ω8 (α -8 )i +. (5.3). (α -8 )i {Ω9 (α -1 )i + Ω10 (α -2 )i +....+ Ω16 (α -8 )i }. Figure 5.3: The parallel error evaluator block with constant UFFM Figure 5.3 shows the parallel error evaluator architecture. The inversion RAM which will be described in next subsection is used to store the corresponding finite field inverse. Each cell is constructed by one const universal FFM. As compared with serial architecture in Fig. 4.6, because the function of Chien search and error evaluate can be performed at same time, - 38 -.

(49) this architecture will reduce one stage FIFO buffer. However, it totally costs N cycles to operate the error evaluator function.. 5.3 8 ≤ t ≤ 16 Error-only Correction Since the proposed design supports the maximum 16 correctable erasure, it can be configured to correct 9~16 errors without any erasure. The basic idea is calculating the syndrome twice which costs 2n (n is the block length) cycles. At first N cycles, the syndrome S1~S16 will be calculated. However, if the first half of syndrome S1~S16 are equal zero, the S17~S32 will all equal zero, and the following decoding process, includes 16 syndrome calculation , key equation solve, and Chen search block can be terminated. Based on this property, the power consumption can be reduced significantly. If syndrome S1~S16 are not equal zero, the syndrome S17~S32 will be executed. The syndrome block will read the received codeword again from the FIFO buffer, and the next codeword will be hold. Fig 5.4 shows the hardware structure between syndrome block and FIFO buffer, and fig. 5.5 indicates the decoding procedure of 16 error-only correction.. Figure 5.4: Block diagram of 16 errors correction. - 39 -.

(50) Figure 5.5: Decoding timing diagram for 16 errors-only correction.. 5.5 Summary. Figure 5.6: The Timing Diagram for propose II architecture. In this chapter, the area efficient architecture of universal RS erasure decoder is introduced. The proposed architecture II can support the maximum field degree to 8, and the corrective error is 16. The decoding timing scheme of proposed decoder II is shown in fig. 5.6. As this figure shows, the maximum latency is 4N+4 and two 512x8 SRAMs are applied. Fig. 5.7 shows the block diagram of proposed architecture II. The area efficient approach is adopted for implementing the proposed II architecture. A 256x8 SRAM is used to realize the on-the-fly inversion table, and parallel Chien-search and error evaluate block is adopted for - 40 -.

(51) high speed computation. Besides, some methodology like gated CLK circuit is adopted to reduce the power consumption. The on-the-fly SRAM can support the parallel chien search and error evaluate block with high speed computation.. Additionally, more power. consumption issue is considered in proposed architecture II.. Figure 5.7: Block diagram of proposed II RS decoder.. - 41 -.

(52) CHAPTER 6 Chip Implementation Result. This chapter will describe the CHIP implementation and its design methodology. In subsection 6.1, we will describe the design and test consideration. Then, there are two implementations of proposed universal RS erasure decoders are shown in subsection 6.2 and 6.3. The proposed I architecture can support the field degree to ten and correctable errors to eight. The proposed II architecture implemented by area efficient approach can support the maximum field degree eight and correctable error is sixteen. Besides, the simulation result of two proposed architecture will do some comparisons with other single-mode or reconfigurable RS decoder published in the past.. 6.1 Design and Test Consideration Fig. 6.1 presents the entire design and testing flow with various CAD tools. At first, we can use the high level language like C/C++ or MATLAB to construct the software simulation environment and generate a lot of random codewords with AWGN noise. Hence, after the RTL coding, the hardware-software co-simulation ensure the correction of behavior model. Fig 6.2 shows the relation of hardware-software co-simulation. The verilog description language is chosen as the RTL implementation. After the RTL level, the gate level implementation will be performed by Synopsys Design Analyzer synthesis tools. And, the synthesis standard library of proposed architecture is 0.13mm 1P8M CMOS. - 42 -.

(53) technology.. The clock rate and the performance in 0.13µm technology are improving. significantly. And, the memory size of FIFO buffer and inversion table decreases obviously. After the gate level synthesis, the pre-layout simulation will be performed to verify the gate level performance. In deep submicron process, the wire delay plays an important role of circuit speed.. Hence, the pre-layout simulation can not calculate the circuit speed precisely.. Besides for pre-layout simulation with nc-verilog complier, the primetime is also an effective CAD tool to calculate the critical path.. Figure 6.1: The entire design flow - 43 -.

(54) Figure 6.2: The simulation environment For successful pre-layout simulation, the place and route will be performed through the SOC Encounter tool. In deep-submicron design, many problems like signal integrity (SI), IR drop, and wire delay must be considered carefully. The power consumption, RC extraction, and timing estimation will be computed exactly at place and route procedure. Finally, the post layout simulation includes DRC (design rule check) and LVS (layout versus schematic) can verify the chip layout integrity.. 6.2 CHIP Implemenation for Proposed Architecture 1 The proposed I architecture can support the maximum field degree to 10, and the maximum correctable error is 8 (maximum correctable erasure is 16). In syndrome block, the universal FFM is applied in 16 syndrome cell. In Chien search block, the double check architecture is used to reduce the search cycles. And, the error evaluator block is designed by the serial architecture. Two 2048x10 SRAMs is used to realize the FIFO buffer, and the inverter is implemented based on Fermat algorithm. In this design, the circuit complexity. - 44 -.

(55) isn’t considered that syndrome cell and Chien cell are implemented by universal FFM. This architecture is implemented by 0.13µm 1P8M standard cell technology. The critical path of synthesized gate level model exists in the key equation block. Fig 6.3 shows chip die photo of proposed I architecture. Table 6.1 shows the chip summary of proposed decoder I. The total gate count is about 110K and the maximum clock rate is 222 MHz. And, the core size is 1.25 x 0.63 mm2. The maximum power consumption is 23mW at clock rate 222 MHz. The chip is packaged in a 84 CLCC package.. Figure 6.3: The die photo of proposed I architecture. - 45 -.

(56) Table 6.1: The chip summary of proposed I universal RS decoder.. Design. Universal RS Erasure Decoder. maximum field degree. 10. Corrective error. 1~8. Memory size. 40 K bits. Core area (mm2). 0.78. Total gate count. 75K + 35K FIFO RAM. Maximum Operating Frequency. 220 MHz. Date rate (M bits/s). 2200. Average Power (supply voltage). 23.2 (1.2V). 6.3 CHIP Implemenation for Proposed Architecture II The proposed II universal RS erasure decoder is implemented by area efficient design. This architecture can support the maximum field degree to 8, and the maximum correctable error is 16 as well as maximum correctable erasure. Two 512x8 SRAMs is used to realize the FIFO buffer, and a 256x8 SRAM is used to construct the finite field inversion table. The error evaluator block is designed by the parallel architecture which performs the Forney. - 46 -.

(57) algorithm at Chien search stage. For circuit complexity consideration, that syndrome cell and Chien cell are implemented by constant UFFM. Therefore, the total area of entire RS code has smaller overhead than a single mode RS decoder. This decoder is also implemented by 0.13mm 1P8M standard cell technology. Fig. 6.4 shows layout view of proposed decoder II. The critical path of synthesized gate level model also exists in the key equation block. Table 6.2 shows the chip summary of proposed II decoder. The total gate count is about 54K with FIFO buffer 14K, and the maximum clock rate is 300 MHz. And, the core size is 0.36 mm2. The maximum power consumption is 20.2mW at clock rate 222 MHz. The chip is packaged in a 68 CLCC package.. Figure 6.4: The layout view of proposed II architecture - 47 -.

(58) Table 6.2: The chip summary of proposed II architecture. Design. Universal RS Erasure Decoder. maximum field degree. 8. Corrective error. 1 ~ 16. Memory size. 10 K bits. Core area (mm2). 0.36. Total gate count. 39K + 14K FIFO. Maximum Operating Frequency. 300 MHz. Date rate (M bits/s). 2400. Average Power (supply voltage). 20.2 (1.2V). 6.4 Comparison Table 6.3 lists various mode RS comparison. From this table, it is obviously that our proposed architecture can support the maximum correctable errors, erasure correction, and the complete reconfigurable capability. As compared the same universal decoder proposed in [23], our proposed decoder can improve about 50 times decoding speed with parallel decoding scheme. Additionally, out proposed design has more flexibility and much higher decoding date rate. - 48 -.

(59) Table 6.3: The comparison of various mode RS decoder. [21]. Mode. [22]. [23]. Proposed I. Propose II. Variable. Universal. Universal. Universal. (n, t). (n, t, m p(x)). (n, t, m p(x)). (n, t, m p(x)). single. M. 8. 8. 1~8. 1~10. 1~8. T. 8. 1~8. 1~8. 1~8. 1~16. Erasure. No. No. No. Yes. Yes. P(x). Single. Single. Variable. Variable. Variable. 1600. 800. 48. 2200. 2400. (200MHz) (parallel). (100MHz) (parallel). (serial). (220MHz) (parallel). (300MHz) (parallel). 21 K. 34K. 44K. 75K. 39K. + 35K FIFO. + 14K FIFO. 0.13. 0.13. Data rate. Gate count. Technology. 0.25. 0.35. 0.25. - 49 -.

(60) CHAPTER 7 Conclusion. In this paper, two universal architectures for RS error-and-erasure decoder are presented. The proposed architecture can accommodate variable codeword length, correctable errors, different finite field degrees, and different primitive polynomials. The proposed I architecture can support the maximum field degree to ten, and the corrective error is eight. The proposed II architecture can support the maximum field degree to eight, and the corrective error is sixteen. To achieve the universal property, the design challenge is to realize a dedicated RS decoder that can accommodate different finite field definition. Hence, the main solution is applying Montgomery multiplication algorithm which described in section 2.1.. Based on this. algorithm, the universal finite field operator includes multiplier and inverter can be implemented. In consideration of complexity, a universal constant multiplier will be applied in syndrome block and Chien search block to reduce the area size. Besides, we combine the erasure locator expansion and Berlekamp-Massey algorithm to achieve the erasure correction with increasing additional universal FFM. In design approach view, the software simulation is built first, and then the RTL code can be verified in according to software result. The Verilog description language is chosen as the RTL implementation. After the RTL level, the gate level implementation will be performed with the synthesis standard library of proposed architecture is 0.13µm 1P8M CMOS technology. Finally, the layout will be constructed by the SOC Encounter software. - 50 -.

(61) APPENDIX Hardware Sharing Design for (528, 518) RS codec IP In this chapter, an area-efficient Reed-Solomon (RS) codec IP with composite-field inverter is presented. For some specific applications such as flash memory controller, the RS decoder will stop receiving any new codeword until the on-going erroneous codeword to be corrected. It is that the circuit complexity can be reduced by sharing the registers and finite-field operation units. The proposed hardware sharing architecture also includes the RS encoder function. Moreover, for area consideration, the composite field inverter is constructed in error evaluator.. Porposed Hardware Sharing Architecture In flash memory controller, the RS (528, 518) code over GF(210) is used to mitigate the errors that may be introduced during manufacturing or by user damage. Note that there are totally 518 message bytes in each codeword of 528 coded bytes. Since the specified RS code is constructed over GF(210), these 10 parity-checking bytes imply that the number of correctable errors is 4. In this section, firstly the RS encoder & syndrome calculator, key-equation solver, as well as Chien-search & error-value evaluator are introduced in following subsections. Then the hardware sharing architecture will be addressed to optimize the usage of registers and operation units. By means of linear system theory transformations, Fettweis proposed a combined methodology to implement both the RS encoder block and the syndrome calculator [24]. In key equation solver, the decomposed inversionless Berlekamp-Massey architecture uses 3 - 51 -.