Vector instruction for FEC applications

CHAPTER 3 PROPOSED AS-DSP ARCHITECTURE

3.4 V ECTOR I NSTRUCTIONS

3.4.2 Vector instruction for FEC applications

In this section, the vector instruction for Reed-Solomon (RS) and Viterbi decoding are introduced here.

3.4.2.1. Viterbi Decoding

The Viterbi algorithm [12] consists of three parts: branch metric calculation (BM), add-compare-selection (ACS), and survival memory (SM). In BM, the DSP will calculate branch metrics for each state transition in the trellis. Since branch metrics have only four different values in 1/2 code rate convolutional code, the Branch metric look-up-table (BM-LUT) approach is proposed to reduce unnecessary calculations.

Input

Figure 3.11: Encoder of 64 state convolution code.

The branch metric are the encode results of convolutional code and convolutional encoder can be regarded as a finite state machine as shown in Figure 3.11. So the BM-LUT can calculate the branch metric by the state transitions and the generator polynomial. The architecture of BM-LUT is illustrated in Figure 3.12. The constraint length and generator polynomial decides the input of modulo-two adder. The outputs of modulo-two adder control the multiplex, MUX and select branch metric form pm00~pm11 which are the values of four different branch metric.

Figure 3.12: Architecture of BM-LUT.

The vector ACS operation performs two ACS operations according to the BM-LUT and

saves the decisions to external memory. For instance, the instruction “viacs e2 e0 e1 r7”

represents one vector ACS operation in Figure 3.13. Bank0 and Bank1 store the path metric and their indexes are the current states. The trellis controller generates the address of Bank0~Bank3 which are the current states and next states in Figure 3.13.

Figure 3.13: ACS vector operations diagram.

Figure 3.14 represents the architecture of vector ACS instruction. The trellis controller generates read and write addresses which are the current and next state. The BM-LUT calculates branch metric for each ACS then ACS Unit computes the new path metric and store them to Bank2 and Bank3 in this case. The decisions during ACS operation are buffered and stored into the external memory. The GPRs r29 which is the stack pointer of survival memory are automatically update when the decisions are stored.

Figure 3.14: Architecture of vector ACS instruction.

After the last ACS operation, the most probable state is decided by the minimum or maximum path metric. Trace back operation is used for SM, and the survival path can be found through external memory accesses.

3.4.2.2. RS Decoding

The syndrome calculator and Chien search constitute over 50% computations of RS decoding, and both of them are similar operations [13]. The syndrome calculator generates a set of syndromes S1~S2t from received polynomial R(x). The representation of syndromes calculation is as following. The received polynomial can be written as (3.5) by Horner’s rule.

( ⁱ), 1 ~ 2

1 result should be normalized by MMB as Figure 3.8 (a). Function (3.6) can be represented as (3.7) if α⁻^j is replaced by α^{− +}^{j m} to avoid normalizing K^*. Without normalization, multiplications in syndrome calculation can be executed by a single MM block, MMA or MMB. requires no extra computations. Each MM block can calculates one syndrome itself, and therefore the execution time in syndrome calculating is two time faster than traditional works.

Figure 3.15: Datapath of FMUL when calculating syndrome.

Figure 3.8 (b) is the datapath of calculating finite field multiplication and Figure 3.15 is the datapath when calculating syndrome. The register r45[15:8], r45[7:0], r46[15:8] and r46[7:0] represent constantsα⁻^j⁰,α⁻^j¹,α⁻^j²andα⁻^j³,respectively. For example, the instruction

“rssyn r0 e0 r7;” represents syndrome calculation. The receive polynomial has to be stored in Bank0 and r7 must equals to N. The final results S0 and S1 will write back to register r0 according to the instruction. Furthermore, the results S2 and S3 will write back to the register r28 as shown in Table 3.3.

At the first cycle, the multiplexers in Figure 3.15 select the input from iRAM by setting control signal to zero. Then the function block MM calculates the product of

1 ^jⁱ, 0 ~ 3

RN₋α⁻ i∈ (r_N₋₁is the first output of iRAM). At the second cycle, the value of modulo-2 adders can be represented asSi=R_N₋₁α⁻^jⁱ +R_N₋₂,i∈0 ~ 3. The multiplexers selects the values S0~S3 after the first cycle. The syndrome S0~S3 will be calculated after N cycles.

Two 8-bit FFM are implemented in this design. Because of the designed datapaths, FFM can be easily added according to the applications, and calculation speed will rise linearly.

Chapter 4 Chip Implementation & FEC Applications

This chapter discusses the details of implement RS decoding and Viterbi decoding by AS-DSP. Before the discussion, the design flow will be introduced. The Design flow chart is illustrated in Figure 4.1.

Figure 4.1: Design flow chart.

At first, we simulate the FEC algorithm by software (C or Matlab) and record the results for each decoding steps. After finishing the assembly codes, the translator translates them to machine codes. To ensure the accuracy of the assembly codes and machine codes, the simulator compares the machine codes with assembly codes. Besides, the simulator can calculate the values of every register, i-RAM and external memory for each instruction and represent them if user wants. After comparing with the results of software and simulator, the machine codes will be test in the pre-layout simulation of AS-DSP. If the comparison is not correct, we need to check out the RTL coding until.

4.1 Viterbi decoding using AS-DSP

4.1.1 Some details in Viterbi decoding

In section 2.1.2, we introduced the Viterbi decoder and explain why we select the TM approach. Here we discuss other issues in Viterbi decoding.

In the real applications, the length of received symbols may be quite long. It is impossible to store the decision bits if we start to trace back after received all symbols. Thus, a suitable TB length (or called truncation length) should be defined without serious performance degradation. The rule of thumb is that truncation length is about five times of constraint length.

In Viterbi algorithm, the path metric accumulates at each time index; and undoubtedly increasing as time goes by. The path metric must be limited in a range so that it can be expressed with finite bits. There are several approaches such as reset, rescaling subtraction, shift, and modulo normalization. The modulo normalization approach (also called two’s complement arithmetic approach) is more efficient than others. As shown in Figure 4.2, the maximum difference between time constant t=k and t=2k-L is B x L; where B and L are maximum value of branch metrics and truncation length, respectively.

Figure 4.2: The survival path of convolutional codes.

The concept of modulo normalization is not to avoid overflow but to accommodate.

Figure 4.3 demonstrates the ideal of modulo normalization. M1 and M2 are both positive number and M1−M2 <2^c⁻¹; where C is the bit number representing path metric.

1 1 mod 2 2 2 mod 2

m M

= (4. 1) Thus, m1 and m2 can be presented on half cycle without confusing their difference. The penalty of modulo normalization is to increase one bit [14].

Figure 4.3: The ideal of modulo normalization.

The AS-DSP has 16-bit data type, which means it can tolerate the maximum difference about 2¹⁵-1. The huge range of path metric can implement every spec. of convolutional code without error when normalizing.

As shown in Figure 4.2, the truncation length is L. It means that the decoded data has the acceptable accuracy if we track back at least L length then decode the data. The AS-DSP has two decoding strategies for different decoding speed. Strategy one is decoding one bit data after tracing back L length and Strategy two is decoding k bits data. Take Figure 4.2 as example. Since we can ensure the accuracy before time index k, we can trace back form time index t = 2k to t = 0. Then the data form 0~k can be decoded and the decoding speed is k times as fast as strategy one.

4.1.2 Decoding procedure and data rate for Viterbi decoding

This section talks about the notice of Viterbi decoding using AS-DSP and takes the convolutional code of 802.11a as an example. The following steps are the decoding procedure of Viterbi decoding:

1. Initializing the AS-DSP.

The FUs and states of processor should be initialized by setting the SPR r47. The detail of r47 is listed in Table 3.2. For this example, we enable the cache and set the access cycle as 3 (base on the spec. of asynchronous RAM [15]). Table 4.1 lists the fields that need to be initialized.

Table 4.1 The list of initialization.

SPR Value Function

r47[0] 0 Use min PM to trace back

r47[2] 1 Decode N-bit/vitb N=trace back length

r47[12] 1 Enable the I-Cache

r47[15:13] 2 3 cycles for accessing external memory

2. Setting the coefficient of Viterbi decoding

In this step, we setup the coefficient of Viterbi decoding. First, we set the trace back control and trace back method as “1”. The trace back control is to find the maximum likelihood path by selecting the minimum or maximum path metric according to different applications. The trace back method was talked before; it can decode k bit data when it setting as 1. Second, we setup the generator polynomial g1(x) and g2(x), constraint length and trace back length. The PMs stored in i-RAM have to be initialized, too.

Tables 4.2 lists the coefficient and explain their function.

Table 4.2 The coefficient of Viterbi decoding.

GPR Value Function

r29 2000hex Pointer of survival memory r33 1338 Generator polynomial g1(x) r34 1718 Generator polynomial g2(x)

r35 7 Constraint length

r44 40 Trace back length (truncation length)

3. Execute ACS operations.

After initializing the AS-DSP and setting the coefficient, we start the Viterbi decoding. The first step of ACS operation is to calculate the branch metrics

(pm00~pm11). After that, the instruction viacs updates the new PMs, survival memory and pointer of survival memory automatically.

4. Trace back operation.

The instruction vitb can trace back according the minimum PM then decode the information data. Since we set the trace back method as “1” and the track back length is 40. We get 40 information bits after 80 (40 x 2) memory accessing. Thus, it takes 2 x L cycles for generating one information bit in vitb operation.

The N states convolutional code needs N ACS operations for one time index. Since the instruction viacs performances two ACS operations per cycle, it takes N/2+8+L+2L cycles to decode one information bit. The 8+L is the cycles when updating the branch metrics and 2L is the average cycles when tracing back. Table 4.3 is the average operation cycles for decoding one information bit and corresponding data rate when working at 133MHz.

Table 4.3 Operation cycles and data rate at different state numbers of convolutional code (L=3).

State number 4 8 16 32 64 128 256 512

Operation cycles 19 21 25 33 49 81 145 273

Data Rate at 133MHz (Mbp/s)

7.00 6.33 5.32 4.03 2.71 1.64 0.92 0.48

4.2 RS decoding using AS-DSP

The decoding procedure is similar to the decoding procedure of Viterbi decoding. We take the (255, 239)RS code as an example here. The decoding step is illustrates as follow:

1. Initializing the AS-DSP.

Table 4.4 list the fields that need to be initialized.

Table 4.4 The list of initialization.

SPR Value Function

r47[1] 0 Use 8-bit data type r47[4] 1 Use two stage of FMUL

r47[5] 0 Use two FMULs

r47[12] 1 Enable the I-Cache

r47[15:13] 2 3 cycles for accessing external memory

2. Setting the coefficient of Viterbi decoding

Tables 4.5 lists the coefficient and explain their function.

Table 4.5 The coefficient of Viterbi decoding.

GPR Value Function

R40 8Ehex k(x) of Montgomery mul

R41 4Chex P(x) of Montgomery mul

3. RS decoding

The decoding flow is introduced in chapter 2. The decoder is implemented according to the decoding flow.

Table 4.6 is the operation cycles at different error numbers and corresponding data rate of (255,239)RS. The maximum correctable error is 255 239

2 8

t = − = . The codeword will not be corrected if error number is bigger than 8. Figure 4.4 demonstrate the corresponding data rate of Table4.6 in different SNR.

Table 4.6 Operation cycles and data rate at different error number for (255, 239)RS (L=3)

Error number(s)

Operation cycles

Data rate at 133MHz (Mbp/s)

0 2265 112.27

1 12250 20.76

2 12945 19.64

3 13705 18.55

4 14526 17.51

5 15409 16.50

6 16354 15.55

7 17361 14.65

8 18430 13.80

>8 13611 18.68

0 20 40 60 80 100 120

1 2 3 4 5 6 7 8 9 10 11

SNR(dB)

Data Rate(Mbps)

Figure 4.4: Data Rate of (255,239)RS on BPSK channel.

Table 4.7 is the cycles of each steps when error number = 8. The syndrome calculation and Chien search are accelerated by the instruction rssyn.

Table 4.7 Operation cycles for each steps when error number =8.

Syndrome calculation 2240

Key equation 8745

Chien search 2161

Error value 4906

Correction 378

If we use the 16-bit data type for RS decoding, the decoder can decode two codeword simultaneously. Thus, the data rate will be almost twice as fast as 8-bit data type. The data rate of (255, 239)RS is 27 Mbp/s when error number = 8.

4.3 Chip specification

The processor is implemented with the 0.18µm CMOS standard cell library and 0.18µm 1P6M process. The chip size is 7.73mm²while the core occupies 3.5mm². The processor has 18k bits embedded SRAM and the total gate count is 139.4k. After static timing analysis (STA) and post-layout simulation, the processor can work successfully at 133MHz under 1.62V and worst speed condition. While working at 1.98 supply and 133MHz, the power dissipation is about 141mW, and the worst IR drop is 0.14V. Table 4.8 summarizes the chip features.

Table 4.8 Summary of the chip.

Purposed ASDSP

Technology 0.18µm 1P6M

Package CQFP144

Supply Voltage 1.8V

Work Frequency 133MHz (1.62V,125°C, worst process)

Chip Size 2.78x2.78mm²

Core Size 1.87x1.87mm²

Gate Count 139.4k

Embedded SRAM 18k bits

Power Dissipation 141mW (1.98V)

Figure 4.5: The microphoto of the chip.

4.4 Comparison with other similar work

Table 4.9 shows the performance comparisons with TI’s TMS320C64X and TMS320C54X DSP families. As compared with TMS320C64X family, the data rate has about 15 times improvement when decoding 512 states convolutional code.

Table 4.9 Viterbi performance compares with TMS320C family.

TMS320C64X TMS320C54X 16-bit ASDSP

Technology 0.13um N.A. 0.18um

Clock rate (MHz) 500~700 100~160 133

M support 5~9 N.A. 2~9

32 states convolutional code N.A. 444 Bytes 110 Bytes 32 states convolutional code

N.A.

3.1Mb/s (160MHz)

4.03Mb/s (133MHz) 512 states convolutional code 32Kb/s

(500MHz) N.A.

480Kb/s (133MHz)

Table 4.10 demonstrates the operation cycles for each syndrome calculation. The TMS320C64X has eight finite field multipliers and takes 470 cycles to finish one syndrome.

The execution cycle number using one FFM is 3760. The proposed processor has two multipliers and needs 816 cycles to complete this work.

Table 4.10 Performance of syndrome calculation compares with TMS320C64X.

TMS320C64X 16-bit ASDSP (204,188)RS syndrome execution cycles 470

(8 FFMs)

816 (2 FFMs)

(204,188)RS syndrome code size (Bytes) 1100 48

Chapter 5 Conclusion and Future Work

5.1 Conclusion

The design and implementation of a 16-bit AS-DSP supporting various FEC applications is proposed. The architecture using the vector operations with optimized internal memory organization is proposed to increase the memory bandwidth efficiency. The datapaths also simplify the data flow control and improve both system throughput and program code size.

After implemented by 0.18µm 1P6M CMOS process, the chip can provide at least 7Mb/s data rate for 4 state convolutional code decoding and 13.8Mb/s data rate for (255,239) Reed-Solomon decoding respectively.

5.2 Future Work

As shown in Table 1.1, the data rate of DVB-T and 802.11a are 28Mbp/s and 54Mbp/s, respectively. The data rates of corresponding decoding process of AS-DSP are 25Mbp/s and 2.71Mbp/s, respectively. The decoding speed is not high enough to support every spec. We have to better our design in two directions, software and hardware. Since the programs for AS-DSP are only translated by a simple translator, the non-optimized machine codes reduce the performance. The compiler is needed to improve system performance. Besides, the datapath for FEC applications have to be more flexibility and powerful to speed up the decoding process.

Bibliography

[1] ITU-T, Telecommunication Standardization Sector of ITU, “Digital multi-programme systems for television sound and data services for cable distribution”- Digital transmission of television signals, ITU-T Recommendation J.83, Apr. 1997.

[2] S. Lin and D. J. Costello, Jr., Error Control Coding, Fundamentals and Applications.

Englewood Cliffs, NJ: Prentice-Hall, 1983.

[3] G. D. Forney, Jr., “Convolutional Code II: Maximum likelihood decoding,” Information and Control, 25, pp 222-226, July 1974.

[4] R. Blahut, Theory and Pratice of Error control Codes. Boston: Addison-Wesley, 1983 [5] T.-K. T. J.-H Jeng, “On decoding of both errors and erasures of a Reed-Solomoncode

using an inverse-free Berlekamp-Massey algorithm,” IEEE Trans. Comput.,vol. 47, pp.

1488–1494, Oct. 1999.

[6] H.C. Chang, C.B. Shung, and C.Y. Lee, ”A RS-PC decoder chip for DVD applications,”

IEEE J. Solid-State Circuits, vol. 36, no. 2, pp.229-238, February 2001.

[7] G. Forney, “On decoding BCH codes,” IEEE Trans. Inform. Theory, vol. IT-11, pp.

549–557, Oct. 1965

[8] C.K. Koc and T Acar, “Fast Software Exponentiation in GF(2^k)”, 13th IEEE Symposium on Computer Arithmetic, pp. 225-231, 1997.

[9] J. Daemen and V. Rijmen., “AES Proposal: Rijndael,” submitted to NIST AES, June 1998.

[10] V. Rijmen, "Efficient implementation of the Rijndael S-bos" Available:

http://www.esat.kuleuven.ac.be/~rijmen/rijndael/ .

[11] SPI Block Guide V04.00, Freescale Semiconductor, Inc. S12SPIV4/D 21 Jun. 2004.

[12] G. Fettweis, H. Meyr, “A 100 Mbit/s Viterbi-decoder chip: Novel architecture and its realization,” IEEE International Conf. Communication (ICC'90) vol. 2, pp: 463-467, April 1990

[13] C.C. Lin, F.K. Chang, H.C. Chang, and C.Y. Lee, “An Universal VLSI Architecture for Bit-Parallel Computation in GF(2^m),” in IEEE Asia Pacific Conf. on Circuits and System, 2005.

[14] Andries P. Hekstra, “An Alternative to Metric Rescaling in Viterbi Decoders,” IEEE Trans. on Communications, vol. 37, No 11, pp 1220-1222, Nov. 1989.

[15] uPD4443362 data sheet, NEC Inc.

[16] G. Fettweis , H. Meyr, “Parallel Viterbi algorithm implementation : Breaking the ACS-Bottleneck,” IEEE Trans. Commun. ,8-89, 785-90; also paper 23.5, Proc. IEEE ICC’88, 719-23

[17] G. Fettweis , H. Meyr, “High rate Viterbi processor : a systolic array solution, ” IEEE J.

SAC, Oct. ] 1990.

[18] G. Fettweis , H. Meyr, “Cascaded feedforward architecture for parallel Viterbi decoding ,”

IEEE ICSAS,978-81,1990; subm. Kluwer J. VLSI Sig.Proc.

[19] TMS320C64x DSP Core Application Report, Texas Instrument Inc. SPRA686 - December 2000

[20] TMS320C54x DSP Core Application Report, Texas Instrument Inc. SPRA071A - January 2002

[21] G. Fettweis , H. Meyr,” Minimized method Viterbi decoding : 600Mbit/s per chip, ” Global Telecommunications Conference, 1990, and Exhibition. 'Communications:

Connecting the Future', GLOBECOM '90., IEEE , 2-5 Dec. 1990 Page(s): 1712 -1716 vol.3

[22] H. Dawid,G. Fettweis , H. Meyr, “A CMOS IC for Gb/s Viterbi Decoding: System Design and VLSI Implement, ” Very Large Scale Integration (VLSI) Systems, IEEE Transactions on , Volume: 4 Issue: 1 , March 1996 Page(s): 17 -31

[23] C. B. Shung, P. H. Siegel, G. Ungerboeck and H. K. Thapar, “VLSI architectures for metric normalization in the Viterbi algorithm,” IEEE International Conference on Communications, vol. 4, pp.1723-1728, Apr. 1990.

[24] J. Hagenauer and P. Hoeher, “A Viterbi Algorithm with Soft-decision Outputs and its Applications,” in IEEE GLOBE-COM, Dallas, TX, pp. 47.1.1-47.1.7, Nov. 1989.

Published Paper

Tien-Yuan Hsiao, Chien-Ching Lin, Hsie-Chia Chang, “An AS-DSP for Forward Error Correction Applications,” IEEE SIPs, 2-4 Nov. 2005.

作者簡歷

姓名：蕭添元

出生地：台灣省嘉義縣出生日期：1981. 3. 7

學歷： 1993. 9 ~ 1996. 6 桃園縣中興國民中學 1996. 9 ~ 1999. 6 桃園縣武陵高級中學

1999. 9 ~ 2003. 6 國立交通大學電子工程系學士

2003. 9 ~ 2005. 6 國立交通大學電子研究所系統組碩士

得獎事績

九十二學年度全國大專院校 FPGA 系統設計競賽 Xilinx 研究所組特優九十三學年度演算科技 IC 設計競賽巧手獎

在文檔中應用於前端錯誤更正機制的16位元數位訊號處理器 (頁 44-0)