A Reed-Solomon Product-Code (RS-PC) decoder chip for DVD applications

(1)

A Reed–Solomon Product-Code (RS-PC)

Decoder Chip for DVD Applications

Hsie-Chia Chang, C. Bernard Shung, Member, IEEE, and Chen-Yi Lee

Abstract—In this paper, a Reed–Solomon Product-Code

(RS-PC) decoder for DVD applications is presented. It mainly con-tains two frame-buffer controllers, a (182, 172) row RS decoder, and a (208, 192) column RS decoder. The RS decoder features an area-efficient key equation solver using a novel modified

decomposed inversionless Berlekamp–Massey algorithm.

The proposed RS-PC decoder solution was implemented using 0.6- m CMOS single-poly double-metal (SPDM) standard cells. The chip size is 4 22 3 64 mm2 with a core area of 2 90 2 88 mm2_{, where the total gate count is about 26K. Test} results show that the proposed RS-PC decoder chip can support 4 DVD speed with off-chip frame buffers or 8 DVD speed with embedded frame buffers operating at 3 V.

Index Terms—Reed–Solomon Product-Code decoder, DVD,

de-composed inversionless Berlekamp–Massey algorithm.

I. INTRODUCTION

D

UE TO increasing demand for high-quality video and audio consumer products, the digital versatile disc (DVD) was standardized in 1995 to provide higher storage capacity by leading industrial consortion. In order to mitigate the errors that may be introduced during manufacturing or by user damage, a Reed–Solomon Product-Code (RS-PC) is used in DVD for error correction. In this paper, we report a RS-PC decoder chip for DVD applications.

As illustrated in Fig. 1, the DVD RS-PC is composed of a (182, 172) RS code in the row direction and (208, 192) RS code in the column direction. We will refer to the matrix in Fig. 1 as a frame. A RS code contains message symbols and parity checking symbols, and is capable of correcting up to symbol errors. For (182, 172) and (208, 192) RS codes, each symbol is one byte.

The most popular RS decoder architecture today, [1], [2] can be summarized into four steps: 1) calculating the syndromes from the received codeword; 2) computing the error locator polynomial and the error evaluator polynomial; 3) finding the error locations; and 4) computing error values. The second step

Manuscript received March 17, 2000; revised August 15, 2000. This work was supported in part by the NSC of Taiwan, R.O.C., under Grant NSC89-2215-E-029-053. This paper was presented at the International Solid-State Cir-cuits Conference, San Francisco, CA, February 1998.

H.-C. Chang and C.-Y. Lee are with the Department of Electronics Engi-neering, National Chiao Tung University, Hsinchu, Taiwan, R.O.C.

C. B. Shung was with the Department of Electronics Engineering, National Chiao Tung University, Hsinchu, Taiwan, R.O.C. He is currently with Allayer Communications Corporation, San Jose, CA 95134 USA.

Publisher Item Identifier S 0018-9200(01)00929-5.

Fig. 1. DVD RS-PC frame structure.

in the four-step procedure involves solving the key equation1

[1], which is

(1) where is the syndrome polynomial, is the error lo-cator polynomial, and is the error evaluator polynomial.

While there has been a lot of research work reported on RS decoder designs, there has been little on RS-PC decoders. The architectural design of the RS-PC decoders is different from that of the RS decoders in the following ways. First, each symbol in the RS-PC decoder is subject to a row RS decoding and a column RS decoding. Since the column RS decoding cannot proceed until all the row RS decodings in the frame are finished, a frame buffer is required to parallelize the row and column decoding. Second, in most RS decoder designs, line buffers in the form of shift registers are used to store the received symbols when the error locations and error values are computed. When the code size is large, these line buffers constitute a major portion of the hardware complexity. In a RS-PC decoder, however, we can exploit the frame buffers by cleverly arranging the accessing pattern and eliminating the need of the line buffers. Third, it is a design choice whether to implement a programmable RS decoder which can serve as row RS decoding and column RS decoding at different times, or implement one dedicated row RS decoder and one dedicated column RS decoder. All these design considerations will be explained in more detail in this paper.

1_{In fact, the key equation defined in [1] was}_{(1 + S(x))(x) = (x) mod}

x , where the syndrome polynomial was defined to beS(x) = S x . In our notation which follows [4],S(x) = S x , and hence our key equation is slightly different.

(2)

Fig. 3. Three-stage pipelining of the RS decoder using the column RS decoder as the example. Section II describes the RS decoder architecture. Our RS

de-coder architecture is shown in Fig. 2 which contains a syndrome calculator, a key equation solver, a Chien search and an error value evaluator. We present a novel implementation of the key equation solver which helps to reduce the hardware complexity significantly. Section III describes the dual-frame-buffer archi-tecture of the RS-PC decoder. We explain the control flow and data flow of the two frame-buffer controllers. Section IV shows the chip implementation and chip testing. Finally, in Section V we conclude the paper.

II. RS DECODERDESIGN

As shown in Fig. 2, we divide the decoding process into four steps. The syndrome calculator calculates a set of syndromes from the received codewords. From the syndromes, the key equation solver produces the error locator polynomial and the error value evaluator polynomial , used by the Chien search and the error value evaluator to produce the error locations and error values, respectively. In Fig. 3, we illustrate the three-stage pipelining used in (208, 192) column RS decoder.

In our RS decoder, an inversionless Berlekamp–Massey al-gorithm is adopted which not only eliminates the finite-field in-verter (FFI) but also introduces additional parallelism. We dis-cover a clever scheduling of three finite-field multipliers to im-plement the algorithm, which is named as decomposed

inver-tionless Berlekamp–Massey algorithm here. Because of the de-composed algorithm, a specified sequence is added to the syn-drome calculator, and we will illustrate the modification in Sec-tion II-A. In SecSec-tions II-C and II-D, we introduce how to calcu-late error locations and error values.

A. Syndrome Calculator

By definition the syndrome polynomial is

, , where

is a received polynomial and is the first received symbol into a syndrome cell illustrated in Fig. 4(a).

As shown in Fig. 4(a), at each cycle, the partial syndrome is multiplied with and accumulated with the received symbol. After all the received symbols are processed, the accumulated result is the th syndrome. The upper side of Fig. 4(a) indicates a way to connect multiple syndrome cells to generate a control-lable sequence of syndrome results.

Fig. 4(b) shows how the 16 syndrome cells (for ) are or-ganized in our chip. By controlling the multiplexer in Fig. 4(b), we can generate different syndrome sequences for the calcula-tion of the discrepancy in the key equation solver. Table I shows all 16 different syndrome sequences.

B. Key Equation Solver

The techniques frequently used to solve the key equation in-clude the Berlekamp–Massey algorithm [1], [5], the Euclidean

(3)

(a)

(b)

Fig. 4. (a)Syndrome cellS . (b) Syndrome calculator cell structure and its buffer. Note that for simplification, we take t = 8. TABLE I

THE16 SYNDROMESEQUENCESREQUIRED TOCALCULATE1

algorithm [2], and the continuous-fraction algorithm [3]. Compared to the other two algorithms, the Berlekamp–Massey algorithm is generally considered to be the one with the least hardware complexity [6]. Another advantage of the Berlekamp–Massey algorithm is that it can be formulated to compute only and the computation of is similiar to discrepancy , thus saving a portion of the hardware used to compute .

Existing architectures to implement the Berlekamp–Massey algorithm in hardware were proposed by Berlekamp [7], Liu [8], and Oh and Kim [9]. These proposals require finite-field multiplications (FFMs) where is the number of correctable er-rors. In addition, they all require an FFI to implement the divi-sion operation. An inverdivi-sionless Berlekamp–Massey algorithm was proposed by Burton [10] for BCH decoders, and was imple-mented by Reed, Shih, and Truong [6] for BCH and RS codes.

However, more FFMs are required in the existing implementa-tion of the inversionless Berlekamp–Massey algorithm [6].

1) Decomposed Inversionless Berlekamp–Massey Algo-rithm: An inversionless Berlekamp–Massey algorithm is adopted in our architecture that is a -step iterative algorithm, as shown in the following:

Initial condition:

for ( to )

If ( or ) then

else

where is the th step error locator polynomial and 's are the coefficients of ; is the th step dis-crepancy and is a previous nonzero discrepancy; is an

(4)

auxiliary polynomial and is an auxiliary degree variable in th step. Define for for (2) for for (3)

where , ’s are the

coefficients of , and ’s are the partial results in computing . At the first cycle of th step, we get

(4) In other words, we can decompose the th iteration into cy-cles. In each cycle requires at most two FFMs and requires only one FFM. The data dependency of the decom-posed algorithm can be seen in Table II.

It is evident from Table II that, at cycle , the computation of requires and , which have been com-puted at cycle . Similarly, at cycle , the computation of requires and , which have been computed at cycle 0 and the th step, respectively. Note that the original Berlekamp–Massey algorithm cannot be scheduled as efficiently because the computation,

, requires two sequential multiplica-tions and one inversion. The inversionless Berlekamp–Massey algorithm provides the necessary parallelism to allow our effi-cient scheduling. The scheduling and data dependency of the decomposed algorithm are further illustrated in Fig. 5.

The decomposed algorithm shown above suggests a three-FFM implementation of the inversionless Berlekamp–Massey algorithm, which is shown in Fig. 6. Compared to the previously proposed parallel architectures [6]–[9] our architecture reduces the hardware complexity significantly. Compared to a previously proposed serial architecture [11], our architecture reduces the time complexity significantly because of the reduction of cycle time and the number of clock cycles. Therefore, our proposed architecture achieves an optimization in the area-delay product.

Fig. 5. Scheduling and data dependency of the decomposed inversionless Berlekamp–Massey algorithm. The dotted line represents the data dependency.

Dual-basis finite-field arithmetic is adopted in the key equa-tion solver for lower gate count [12]. A dual-basis FFM takes one input in standard basis and the other input in dual basis to produce a dual-basis output. In Fig. 6, the dotted lines corre-spond to the data symbols in dual basis while the solid lines correspond to the data symbols in standard basis, and D2S is a dual-to-standard basis converter.

2) Efficient Computation of : The conventional way to compute the error evaluator polynomial is to do it in parallel with the computation of . Using the Berlekamp–Massey algorithm, this involves an iterative

algo-rithm to compute . However, if

is first obtained, from the key equation and the Newton’s identity we could derive as follows:

(5)

(6) That is, the computation of can be performed directly after is computed. Note that the direct computation requires fewer multiplications than the iterative algorithm which com-putes many unnecessary intermediate results. The penalty of this efficient computation is the additional latency because and are computed in sequence.

Furthermore, it can be seen that the computation of is very similar to that of except for some minor differences. Therefore, the same hardware used to compute can be re-configured to compute after is computed. Like ,

(5)

Fig. 6. Three-FFM architecture for implementing the decomposed inversionless Berlekamp–Massey algorithm. Note that the(x) buffer stores final coefficients of(x) for the standard basis.

Fig. 7. Three-FFM architecture reconfigured to compute(x). Note the labels in this figure are different from those in Fig. 6.

’s are the partial results in computing and we could de-rive it as follows:

for

for (7)

At the last cycle of the th iteration in (7),

. In Fig. 7, we show how the same three-FFM architecture can be reconfigured to compute .

3) Application Conditions for Errors and Erasures: For decoding errors and erasures, the key equation is modified

to , where is the errata

evaluator polynomial, is the

Forney syndrome polynomial, and

and is the erasure set [1]. Furthermore, we could rewrite the inversionless Berlekamp–Massey algorithm as follows:

Initial condition:

for to

If or

else

where is the number of erasures, is the th step errata locator polynomial with degree , and ’s are the coefficients

of .

Let us now calculate the total number of cycles required to compute and using our decomposed architecture. It is clear that the degree of at most increases by one during each iteration. Therefore, we use to set the upper bound of .

Because both errors and erasures are corrected, we need cycles to compute the initial and we have

for

(6)

(9) The number of cycles required to compute is

(10) Hence the total number of cycles is less than

.

Table III shows the maximum number of cycles for different RS codes with ranging from 4 to 16. If is larger than the number of cycles required, then our area-efficient architecture can be applied to reduce the hardware complexity while maintaining the overall decoding speed.

C. Chien Search

In an RS decoding algorithm, a Chien search is used to check whether the error locator polynomial equals zero or not

while , . If , it

means there is an error at , where the received polynomial is

defined as and

is the first received symbol. Fig. 8(a) shows the circuit of the th Chien search cell. The upper side of the Chien search cell accumulates the result of this and the previous cell, and sends the sum to the next cell.

Fig. 8(b) shows the structure of the Chien search module with eight Chien scarce cells. AnXORgate is used to check if the final sum is zero.

It is instructional to observe the similarity between the syn-drome cell and the Chien search cell in our architecture. The only difference is that the location of the finite-field adder and the multiplexer is interchanged. In a fully custom layout, such similarity is very helpful to reduce chip area.

D. Error Value Evaluator

For calculating the error value, there are two popular methods, namely the transform decoding process in the fre-quency domain and the Forney algorithm in the time domain. Although the transform decoding process does not need any FFI and Chien search, it requires variable–variable FFMs and constant–variable FFMs. While and are large, the Forney algorithm is preferred because of its lower circuit complexity.

(b)

Fig. 8. (a) Chien search cellC . (b) Chien search structure for t = 8.

Fig. 9. Error value evaluator structure fort = 8.

In the Forney algorithm, the error value becomes

(11) where indicates the root of , for . Be-cause of the fact that any element will be zero while multiplying an even constant value, and will be its original value while multiplying an odd constant, the first derivative of can be represented as

(12) Note that is the largest odd number less than or equal to

, and .

So we rewrite the Forney’s algorithm as

(13) In Fig. 9, we calculate error values in parallel with the com-putation of the Chien search. Note that cells C1 C8 in Fig. 9 are all the same as the Chien search cells in Fig. 8(a). The only

(7)

Fig. 10. RS-PC decoder chip architecture.

difference is the loaded coefficients are instead of . While the computation of the Chien search is into th iteration, the value of in Fig. 8(b) is , and the value of in Fig. 9 is

. In other words, , the active signal goes to high, and the output of FFM in Fig. 9 is the error value of the received codeword .

III. FRAMEBUFFERCONTROLLER

The DVD RS-PC decoder can be implemented by a pair of dedicated row ( ) and column ( ) RS decoders, or by a programmable ( ) RS decoder. Through some added control logic, a programmable RS decoder with can sup-port both row and column RS decoding. The main drawback of using one RS decoder is that the throughput rate is reduced. In our work, we used two dedicated row and column RS de-coders to maximize the throughput rate. Taking advantage of the area-efficient architecture mentioned in the previous section, our two-RS-decoder architecture is both feasible in complexity and fast in speed.

In RS-PC decoding, each symbol is subject to a row RS decoding and a column RS decoding. There are two pos-sible frame-buffer architectures—single and dual. In the single-frame-buffer architecture, the th incoming row of frame is stored at the location of the th outgoing column of frame . In other words, each adjacent frame is stored in a transposed fashion. The frame-buffer size in the single-frame-buffer architecture for DVD RS-PC is

[max (row, column)]. The drawback of the single-frame-buffer architecture is that the RS-PC decoder output sequence is different from the input sequence. The former is column-wise while the latter is row-wise. This effect is similar to passing the input data through an interleaver. To deinterleave the data, a frame buffer is also required. Therefore, unless the downstream processing (e.g., MPEG decoding) can be done using the interleaved data directly, the single-frame-buffer

RS-PC decoder architecture is not preferred because it simply transfers the storage requirement to downstream processing.

In our design, we use a dual-frame-buffer architecture. Each frame buffer is controlled by a frame-buffer controller. The RS-PC decoder architecture is illustrated in Fig. 10 which contains two frame-buffer controllers that interface with two off-chip frame buffers, a (182, 172) row RS decoder and a (208, 192) column RS decoder. At any time, one (primary) frame buffer is serving the incoming data, the outgoing data, and the (182, 172) row RS decoder, and the other (secondary) frame buffer is serving the (208, 192) column RS decoder. The error locations and error values computed by the RS decoders are sent to the frame-buffer controllers to update the frame-buffer content accordingly. This parallel architecture minimizes the amount of frame-buffer access and timing constraint on the RS decoders. The architecture also allows the frame buffers to be incorporated as on-chip embedded SRAMs or DRAMs, which are not yet realized in the current chip.

Since we only need to correct the user data part of the frame, for each input row, the last ten parity checking bytes are used only by the row RS decoder and not stored in the primary frame buffer. The size of the frame buffer is therefore . The remaining memory bandwidth is used by the row RS decoder for error correction. Likewise, in the column RS decoding, only 172 columns are processed.

The frame-buffer controller consists of an address plane and a data plane, as shown in Fig. 11. The address plane consists of a row address generator and a column address generator, each selecting one out of three possible addresses: counter, counter , and the error location address. The data plane provides a great number of different data routes: input to buffer, buffer to output, buffer to column RS decoder input, and error correction. The detailed symbol and memory interface timing for the row and column decoders is illustrated in Figs. 11 and 12, respec-tively. During each DVD symbol time, each frame buffer under-goes one read and one write operation, both at the same address. For the row decoder and the primary frame buffer, as shown

(8)

Fig. 11. Frame-buffer controller diagram.

Fig. 12. Frame-buffer controller control signals.

in Fig. 12, in the first 172 symbol times, the (decoded) output of frame is read out immediately before the incoming symbol is stored. The DVD timing specification demands two sync symbols every 91 data symbols per row. After the 172 data symbols, the memory bandwidth is used to perform error cor-rection of the second previous row. Since for row RS decoder, it takes up to five symbol times to finish the error cor-rection. For error correction, the frame-buffer content is read out,XOR-ed with the row or column error value and then written back in the same DVD symbol time.

For the column RS decoder and the secondary frame buffer, as shown in Fig. 12, in the first 208 symbol times, the corre-sponding data symbols are read out in the memory read cycle. The memory write cycle in this period is idle. After the first 208 symbol times, the memory bandwidth is used to perform error correction of the second previous column. Since for

column RS decoder, it takes up to eight symbol times to finish the error correction. The total time to process one column is therefore 216 symbol times. The total time to finish the column RS decoding is symbol times.

The two frame-buffer controllers change their roles by the control of a number of externally or internally generated control signals, illustrated in Fig. 12. The select signal selects the pri-mary frame buffer, and is derived from the sync signal defined in DVD. Due to the pipeline latency, the secondary frame buffer does not start the column RS decoding until two row delays, in-dicated by the mode signal. The correction signal indicates the time period within which the error correction is performed.

IV. CHIPIMPLEMENTATION ANDTESTING

We implement the RS-PC decoder chip by Verilog and all of modules designed by gate-level description. The total Verilog

(9)

Fig. 13. Chip die photo.

code takes about 3000 lines. In our design, the delay time be-tween two registers is restricted to only permit one FFM and one finite-field adder for speed consideration. For complexity con-sideration, we choose constant–variable FFMs to implement the syndrome calculator, the Chien search, and the error value eval-uator. Note that the constant–variable FFM only needs XOR gates while the variable–variable FFM requires 73 XOR gates and 64ANDgates [12].

The chip was designed using the Compass cell library in a 0.6- m single-poly double-metal (SPDM) CMOS process. The chip size is with a core size of

. The chip die photo is shown in Fig. 13. The total gate count is about 26K, including 14K for the (208, 192) column RS decoder and 9K for the (182, 172) row RS decoder. The 99-pin chip is packaged in a 100 LD CQFP package, where 48 pins are for frame-buffer interface and can be eliminated with embedded frame buffers. In the test mode, the column RS decoder can op-erate on the input data directly and bypass the frame buffer (a connection not shown in Fig. 10). While operating at 3 V, the row and column RS decoders have been tested to work success-fully at 33 MHz. The RS-PC decoder, however, is currently lim-ited in speed by the off-chip frame buffer to about 18 MHz. The power dissipation of the chip is 102 mW at 33 MHz.

As the DVD symbol rate is less than 4 Mbytes/s, our RS-PC decoder can support a speed of DVD with off-chip frame buffers or DVD with embedded frame buffers. The improved speed performance is attributed to the parallel RS decoder archi-tecture, which is made feasible by the proposed area-efficient key equation solver.

V. CONCLUSION

In this paper, the design and implementation of an area-ef-ficient RS-PC decoder chip for DVD applications is pre-sented. Based on a modified decomposed inversionless Berlekamp–Massey algorithm, more optimal hardware struc-ture for the key solver equation can be achieved. Moreover, the derived structure can be applied to other functional blocks, leading to a very regular structure for the area-delay product. As a result, an area-efficient solution for RS-PC decoder chip can be obtained.

The proposed chip solution contains two frame-buffer con-trollers, a row RS decoder, and a column RS decoder. Imple-mented in a 0.6- m CMOS SPDM standard cells, measurement results show that DVD speed can easily be achieved.

REFERENCES

[1] E. Berlekamp, Algebraic Coding Theory. New York: McGraw-Hill, 1968.

[2] Y. Sugiyama, M. Kasahara, S. Hirasawa, and T. Namekawa, “A method for solving key equation for decoding Goppa codes,” Information and Control, vol. 27, pp. 87–99, Jan. 1975.

[3] L. Welch and R. Scholtz, “Continued fractions and Berlekamp’s algo-rithm,” IEEE Trans. Inform. Theory, vol. IT-25, pp. 19–27, Jan. 1979. [4] T. Truong, W. Eastman, I. Reed, and I. Hsu, “Simplified procedure for

correcting both errors and erasures of Reed–Solomon code using Eu-clidean algorithm,” Proc. IEE, pt. E, vol. 135, no. 6, pp. 318–324, 1988. [5] J. Massey, “Shift-register synthesis and BCH decoding,” IEEE Trans.

Inform. Theory, vol. IT-15, pp. 122–127, Jan. 1969.

[6] I. Reed, M. Shih, and T. Truong, “VLSI design of inverse-free Berlekamp–Massey algorithm,” Proc. IEE, pt. E, vol. 138, pp. 295–298, Sept. 1991.

[7] E. Berlekamp, “Galois field computer,” U.S. Patent 4 162 480, July 24, 1979.

[8] K. Liu, “Architecture for VLSI design of Reed–Solomon decoders,” IEEE Trans. Computers, vol. C-33, pp. 178–189, Feb. 1984.

[9] Y. Oh and D. Kim, “Method and apparatus for computing error locator polynomial for use in a Reed–Solomon decoder,” U.S. Patent 5 583 499, Dec. 10, 1996.

[10] H. Burton, “Inversionless decoding of binary BCH codes,” IEEE Trans. Inform. Theory, vol. IT-17, pp. 464–466, July 1971.

[11] R. Blahut, Theory and Practice of Error Control Codes. Boston: Ad-dison-Wesley, 1983.

[12] S. T. J. Fenn, M. Benaissa, and D. Taylor, “GF (2 ) multiplication and division over the dual basis,” IEEE Trans. Comput., vol. 45, pp. 319–327, Mar. 1996.

Hsie-Chia Chang was born in Keelung City,

Taiwan, in 1973. He received the B.S. and M.S. degrees in electronics engineering from National Chiao Tung University, Hsinchu, Taiwan, R.O.C., in 1991 and 1995, respectively. He is currently working toward the Ph.D. degree in electronics engineering at the same university.

His research interests include architectures and algorithms for communications and the signal processing, and the integrated circuit design for the Reed–Solomon decoder.

(10)