Hardware Designs for Function Evaluation and LDPC Coding

(1)

Imperial College

London

Hardware Designs for

Function Evaluation and LDPC Coding

A thesis submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computing

by

Dong-U Lee

October 2004

(2)

° Copyright byc Dong-U Lee October 2004

(3)

To my parents for their love and support, and my country Korea. . .

(4)

Acknowledgments

I thank my supervisor Prof. Wayne Luk for his advice and direction on both academic and non-academic issues. I would also like to thank Prof. John D. Vil- lasenor from UCLA, Prof. Philip H.W. Leong from the Chinese University of Hong Kong, Prof. Peter Y.K. Cheung from the Department of EEE and Dr. Os- kar Mencer from the Department of Computing for their help on my research topics.

Many thanks to my colleagues Altaf Abdul Gaffar, Andreas Fidjeland, An- thony Ng, Arran Derbyshire, Danny Lee, David Pearce, David Thomas, Henry Styles, Jose Gabriel de Fiqueiredo Coutinho, Jun Jiang, Ray Cheung, Shay Ping Seng, Sherif Yusuf, Tero Rissa and Tim Todman from Imperial College, Chris Jones, Connie Wang, David Choi, Esteban Vall´es and Mike Smith from UCLA, and Dr. Guanglie Zhang from the Chinese University of Hong Kong for their assistance. I am especially thankful to Altaf Abdul Gaffar and Ray Cheung who helped me with numerous Linux programming tasks, and Tim Todman who proof read this thesis.

The financial support of Celoxica Limited, Xilinx Inc., the U.K. Engineering and Physical Sciences Research Council PhD Studentship from the Department of Computing, Imperial College, and the U.S. Office of Naval Research is gratefully acknowledged.

(5)

Abstract of the Thesis

Hardware based implementations are desirable, since they can be several orders of magnitudes faster than software based methods. Reconfigurable devices such as Field-Programmable Gate Arrays (FPGAs) are ideal candidates for this purpose, because of their speed and flexibility. Three main achievements are presented in this thesis: function evaluation, Gaussian noise generation, and Low-Density Parity-Check (LDPC) code encoding. First, our function evaluation research covers both elementary functions and compound functions. For elementary functions, we automate function evaluation unit design covering table look-up, table-with-polynomial and polynomial-only methods. We also illustrate a framework for adaptive range reduction based on a parametric function evaluation library. The proposed approach is evaluated by exploring various effects of several arithmetic functions on throughput, latency and area for FPGA designs.

For compound functions which are often non-linear, we present an evaluation method based on piecewise polynomial approximation with a novel hierarchical segmentation scheme, which involves uniform segments and segments with size varying by powers of two. Second, our research on Gaussian noise generation results in two hardware architectures, some of which can be used for Monte Carlo simulations such as evaluating the performance of LDPC codes. The first design is based on the Box-Muller method and the central limit theorem, while the second design is based on the Wallace method. The quality of the noise produced by the two noise generators are characterized with various statistical tests. We also examine how design parameters affect the noise quality with the Wallace method.

Third, our research on LDPC encoding describes a flexible hardware encoder for regular and irregular LDPC codes. Our architecture, based on an encoding method proposed by Richardson and Urbanke, has linear encoding complexity.

(6)

List of Figures

1.1 Relations of the chapters in this thesis. . . 7 1.2 Design flow for evaluating elementary functions. . . 13 1.3 Design flow for evaluating non-linear functions using the hierarchi-

cal segmentation method. . . 14 1.4 The BenONE board from Nallatech used to run our LDPC simu-

lation experiments. . . 16 1.5 Our LDPC hardware simulation framework. . . 17 1.6 LDPC encoding framework. . . 18

2.1 Simplified view of a Xilinx logic cell. A single slice contains 2.25 logic cells. . . 21 2.2 Architecture of a typical FPGA. . . 22 2.3 Certain approximation methods are better than others for a given

metric at different precisions. . . 33 2.4 Area comparison in terms of configurable logic blocks for different

methods with varying data widths [122]. . . 34 2.5 Comparison of (3,6)-regular LDPC code, Turbo code and opti-

mized irregular LDPC code [151]. . . 39 2.6 LDPC communication system model. . . 40 2.7 A bipartite graph of a (3,6)-regular LDPC code of length ten and

rate 1/2. There are ten variable nodes and five check nodes. For each check node C_i the sum (over GF(2)) of all adjacent variable node is equal to zero. . . 41

(13)

2.8 An equivalent parity-check matrix in lower triangular form. . . 43 2.9 The parity-check matrix in approximate lower triangular form . . 44

3.1 Block diagram of methodology for automation. . . 55 3.2 Principles behind automatic design optimization with ASC. . . . 56 3.3 Accuracy graph: maximum error versus bitwidth for sin(x) with

the three methods. . . 58 3.4 Area versus bitwidth for sin(x) with TABLE+POLY. OPT indi-

cates for what metric the design is optimized for. Lower part:

LUTs for logic; small top part: LUTs for routing. . . 62 3.5 Latency versus bitwidth for sin(x) with TABLE+POLY. Shows

the impact of latency optimization. . . 62 3.6 Throughput versus bitwidth for sin(x) with TABLE+POLY. Shows

the impact of throughput optimization. . . 63 3.7 Latency versus area for 12-bit approximations to sin(x). The

Pareto-optimal points [124] in the latency-area space are shown. 63 3.8 Latency versus throughput for 12-bit approximations to sin(x).

The Pareto-optimal points in the latency-throughput space are shown. . . 64 3.9 Area versus throughput for 12-bit approximations to sin(x). The

Pareto-optimal points in the throughput-area space are shown. . 64 3.10 Area versus bitwidth for the three functions with TABLE+POLY.

Lower part: LUTs for logic; small top part: LUTs for routing. . . 67 3.11 Latency versus bitwidth for the three functions with TABLE+POLY. 67

(14)

3.12 Throughput versus bitwidth for the three functions with TABLE+POLY.

Throughput is similar across functions, as expected. . . 68 3.13 Area versus bitwidth for sin(x) with the three methods. Note that

the TABLE method gets too large already for 14 bits. . . 68 3.14 Latency versus bitwidth for sin(x) with the three methods. . . . 69 3.15 Throughput versus bitwidth for sin(x) with the three methods. . . 69

4.1 Design flow: MATLAB generates all the ASC code for the library.

The user simply indexes into the library to obtain the specific function approximation unit. . . 73 4.2 Description of range reduction, evaluation method and range re-

construction for the three functions sin(x), log(x) and√

x. . . . . 75 4.3 Circuit for evaluating sin(x). . . . 76 4.4 Circuit for evaluating log(x). . . . 77 4.5 Circuit for evaluating √

x. . . . 78 4.6 Plot of the three functions over the range reduced intervals. . . . 79 4.7 Segmentation for evaluating log(y) with eight uniform segments.

The leftmost three bits of the inputs are used as the segment index. 82 4.8 Architecture of table-with-polynomial unit for degree d polynomi-

als. Horner’s rule is used to evaluate the polynomials. . . 83 4.9 ASC code for evaluating sin(x) for range 8 bits and precision 8 bits

with tp2. This code is automatically generated from our MATLAB tool. . . 86 4.10 Area matrix which tells us for each input range/precision combi-

nation which design to use for minimum area. . . 91

(15)

4.11 Latency matrix which tells us for each input range/precision com- bination which design to use for minimum latency. . . 91 4.12 Area cost of range reduction (upper part) for sin(x) implemented

using po with the designs optimized for area. . . . 92 4.13 Area cost of range reduction (upper part) for sin(x) implemented

using tp3 with the designs optimized for area. . . . 92 4.14 Area cost of range reduction (upper part) for log(x) implemented

using po with the designs optimized for area. . . . 93 4.15 Area cost of range reduction (upper part) for log(x) implemented

using tp3 with the designs optimized for area. . . . 93 4.16 Area for sin(x) with precision of eight bits for different methods

with (WRR, solid line) and without (WOR, dashed line) range reduction, with the designs optimized for area. . . 94 4.17 Latency for sin(x) with precision of eight bits for different methods

with (WRR, solid line) and without (WOR, dashed line) range reduction, with the designs optimized for latency. . . 94 4.18 Area for log(x) with precision of eight bits for different methods

with (WRR, solid line) and without (WOR, dashed line) range reduction, with the designs optimized for area. . . 95 4.19 Latency for sin(x) with precision of eight bits for different methods

with (WRR, solid line) and without (WOR, dashed line) range reduction, with the designs optimized for latency. . . 95 4.20 Area versus precision for sin(x) using tp3 for different ranges and

optimization. . . 96

(16)

4.21 Latency versus precision for sin(x) using tp3 for different ranges and optimization. . . 96 4.22 Area versus range for all three functions using different methods

with the precision fixed at eight bits optimized for area. . . 97 4.23 Latency versus range for all three functions using different methods

with the precision fixed at eight bits optimized for latency. . . 97 4.24 Area versus range for all three functions using po for different

precisions optimized for area. . . 98 4.25 Latency versus range for all three functions using po for different

precisions optimized for latency. . . 98 4.26 Area versus range for all three functions using po for different

precisions optimized for area. . . 99 4.27 Latency versus range for all three functions using po for different

precisions optimized for latency. . . 99

5.1 MATLAB code for finding the optimum boundaries. . . 109 5.2 Optimum locations of the segments for the four functions in Sec-

tion 5.1 for 16-bit operands and second order approximation. . . . 110 5.3 Numbers of optimum segments for first order approximations to

the functions for various operand bitwidths. . . 111 5.4 Numbers of optimum segments for second order approximations to

the functions for various operand bitwidths. . . 111 5.5 Ratio of the number of optimum segments required for first and

second order approximations to the functions. . . 112

(17)

5.6 Circuit to calculate the P2S address for a given input δ_i, where δi = av−1av−2..a0. The adder counts the number of ones in the output of the two prefix circuits. . . 115 5.7 Main MATLAB code for finding the hierarchical boundaries and

their polynomial coefficients. . . 119 5.8 Variation of total number of segments against v₀for a 16-bit second

order approximation to f₃. . . 120 5.9 The segmented functions generated by HFS for 16-bit second order

approximations. f₁, f₂, f₃ and f₄ employ P2S(US), P2SL(US), US(US) and US(US) respectively. The black and grey vertical lines are the boundaries for the outer and inner segments respectively. . 121 5.10 Design flow of our approach. . . 123 5.11 HSM function evaluator architecture for λ = 2 and degree d ap-

proximations. Note that ‘:’ is a concatenation operator. . . 130 5.12 Variations of the table sizes to the four functions with varying

polynomial degrees and operand bitwidths. . . 131 5.13 Variations of the HSM/Optimum segment ratio with polynomial

degrees and operand bitwidths. . . 132 5.14 Xilinx System Generator design template used for first order US(US).135 5.15 Xilinx System Generator design template used for second order

P2SL(US). . . 136 5.16 Error in ulp for 16-bit second order approximation to f3. . . 137

6.1 Gaussian noise generator architecture. The black boxes are buffers. 150

(18)

6.2 The f function. The asterisks indicate the boundaries of the linear approximations. . . 153 6.3 Circuit to calculate the segment address for a given input x. The

adder counts the number of ones in the output of the two prefix circuits. Note that the least-significant bit x_o is not required. . . . 155 6.4 Function evaluator architecture based on non-unform segmentation.157 6.5 Variation of function approximation error with number of bits for

the gradient of the f function. . . 158 6.6 The g functions. Only the thick line is approximated; see Figure

4. The most significant 2 bits of u₂ are used to choose which of the four regions to use; the remaining bits select a location within Region 0. . . 159 6.7 Approximation for g1 over [0, 1/4). The asterisks indicate the seg-

ment boundaries of the linear approximations. . . 160 6.8 Approximation error to f . The worst case and average errors are

0.031 and 0.000048 respectively. . . 161 6.9 Approximation error to g₁. The worst case and average errors are

0.00079 and 0.0000012 respectively. . . 162 6.10 PDF of the generated noise with 17 approximations for f and 6

for g for a population of four million. The p-values of the χ² and A-D tests are 0.00002 and 0.0084 respectively. . . 169 6.11 PDF of the generated noise with 59 approximations for f and 21

for g for a population of four million. The p-values of the χ² and A-D tests are 0.0012 and 0.3487 respectively. . . 169

(19)

6.12 PDF of the generated noise with 59 approximations for f and 21 for g with two accumulated samples for a population of four million. The p-values of the χ² and A-D tests are 0.3842 and 0.9058 respectively. . . 170 6.13 Scatter plot of two successive accumulative noise samples for a

population of 10000. No obvious correlations can be seen. . . 170 6.14 Variation of output rate against the number of noise generator

instances. . . 173

7.1 Overview of the Wallace method. . . 177 7.2 Overview of our Gaussian noise generator architecture based on the

Wallace method. The triangle in Stage 4 is a constant coefficient multiplier. . . 179 7.3 The transformation circuit of Stage 3. The square boxes are reg-

isters. The select signals for the multiplexors and the clock enable signals for the registers are omitted for simplicity. . . 183 7.4 Detailed timing diagram of the transformation circuit and the

dual-port “Pool RAM”. A z indicates the address of the data z and WE is the write enable signal of the “Pool RAM”. . . 184 7.5 Wallace architecture Stage 1 in Xilinx System Generator. The 30

LFSRs generate uniform random bits for Stage 2. . . 188 7.6 Wallace architecture Stage 2 in Xilinx System Generator. Pseudo

random addresses for p, q, r, s are generated. . . 189 7.7 Wallace architecture Stage 3 and Stage 4 in Xilinx System Gener-

ator. Orthogonal transformation is performed and sum of squares corrected. . . 190

(20)

7.8 Our Wallace design placed on a Xilinx Virtex-II XC2V4000-6 FPGA.192 7.9 Our Wallace design routed on a Xilinx Virtex-II XC2V4000-6 FPGA.192 7.10 Scatter plot of two successive noise samples for a population of

10000. No obvious correlations can be seen. . . 195 7.11 PDF of the generated noise from our design for a population of

one million. The p-values of the χ² and A-D tests are 0.9994 and 0.2332 respectively. . . 196 7.12 PDF of the generated noise from our design for a population of

four million. The p-values of the χ² and A-D tests are 0.7303 and 0.8763 respectively. . . 197 7.13 PDF of the generated noise from the Xilinx block for a population

of one million. The p-values of the χ² and A-D tests are 0.0000 and 0.0002 respectively. . . 198 7.14 Variation of the χ² test p-value with sample size for the Xilinx

block, 12-bit, 16-bit, 20-bit and 24-bit Wallace implementation. . 200 7.15 Variation of output rate against the number of noise generator

instances. . . 202

8.1 Pseudo code of the Wallace method. . . 207 8.2 Four million samples of blocks immediately following the block

containing a 5σ output, evaluated with the χ² test with 200 bins over [−7, 7] for FastNorm2. The χ²₁₉₉ contributions of each of the bins are shown. . . 209

(21)

8.3 The χ²₁₉₉values of blocks relative to a block containing a realization with absolute value of 5σ or higher. Four million samples are compiled for each block. The dotted horizontal line indicates the 0.05 confidence level. . . 210 8.4 Impact of various design choices on the χ²₁₉₉ value. Four million

samples are compiled from the block immediately after each block containing an absolute value of 5σ or higher for each data point.

The dotted horizontal line indicates the 0.05 confidence level. . . . 222 8.5 Speed comparisons at various K at N = 4096 and R = 1. Lower

part: arithmetic operations. Upper part: table accesses. . . 223 8.6 Speed comparisons for different parameter choices. The solid,

dashed and dotted lines are for R = 1, R = 2 and R = 3 re- spectively. . . 223 8.7 Execution times for different pool sizes at R = 1 and K = 16. The

solid and dotted lines are for the Athlon XP and the Pentium 4 processors respectively. . . 224 8.8 Level 2 cache miss rates on the SimpleScalar x86 simulator for

different pool sizes at R = 1, K = 16 and various level 2 cache sizes. Level 1 cache is fixed at 16KB and 65536 noise samples are generated for each data point. . . 224

9.1 The parity-check matrix H in ALT form. A, B, C, and E are sparse matrices, D is a dense matrix, and T is a sparse lower triangular matrix. . . 228 9.2 LDPC encoding framework. . . 229

(22)

9.3 An equivalent parity-check matrix in lower triangular form. Note that n = block length and m = block length × (1 − code rate). . . 230 9.4 Different starting columns for H and H^T. . . 235 9.5 Overview of our hardware encoder architecture. Double buffering

is used between the stages for concurrent execution. Grey and white box indicate RAMs and operations respectively. . . 236 9.6 Circuit for vector addition (VA). . . 239 9.7 Circuit for matrix-vector multiplication (MVM). . . 241 9.8 Circuit for forward-substitution (FS). . . 243 9.9 Scatter plot of a preprocessed irregular 500 × 1000 H matrix in

ALT form with a gap of two. Ones appear as dots. . . 245 9.10 The four stage LDPC encoder architecture in Xilinx System Gener-

ator. Each stage contains multiple subsystems performing MVM, FS, VA or CWG. . . 246 9.11 LDPC encoder architecture Stage 2 and stage controller in Xilinx

System Generator. . . 247 9.12 The matrix-vector multiplication (MVM) circuit in Xilinx System

Generator. . . 248 9.13 The forward-substitution (FS) circuit in Xilinx System Generator. 249 9.14 Variation of throughput with the number of encoder instances. . . 255

(23)

List of Tables

2.1 Maximum absolute and average errors for various fist order poly- nomial approximations to e^x over [−1, 1]. . . . 29 2.2 Efficient computation of p^T₁ = −φ⁻¹(−ET⁻¹A + C)s^T. . . 46 2.3 Efficient computation of p^T₂ = −T⁻¹(As^T + Bp^T₁). . . 47 2.4 Summary of the RU encoding procedure. . . 48

3.1 Various place and route results of 12-bit approximations to sin(x).

The logic minimized LUT implementation of the tables minimizes latency and area, while keeping comparable throughput to the other methods, e.g. block RAM (BRAM) based implementation. . 59

5.1 The ranges for P2S addresses for Λ₁ = P2S, n = 8, v₀ = 5 and v₁ = 3. The five P2S address bits δ₀ are highlighted in bold. . . . 114 5.2 Number of segments for second order approximations to the four

functions. Results for uniform, HSM and optimum are shown. . . 122 5.3 Comparison of direct look-up, SBTM, STAM and HSM for 16 and

24-bit approximations to f2. The subscript for HSM denotes the polynomial degree, and the subscript for STAM denotes the number of multipartite tables used. Note that SBTM is equivalent to STAM₂. . . 139 5.4 Hardware synthesis results on a Xilinx Virtex-II XC2V4000-6 FPGA

for 16 and 24-bit, first and second order approximations to f₂ and f₃. . . 140

(24)

5.5 Widths of the data paths, number of segments, table size and percentage of exactly rounded results for 16 and 24-bit second order approximations to f₂ and f₃. . . 141 5.6 Performance comparison: computation of f2 and f3 functions. The

Athlon and the Pentium 4 PCs are equipped with 512MB and 1GB DDR-SDRAMs respectively. . . 142

6.1 Comparing two segmentation methods. Second column shows the comparison of the number of segments for non-uniform and uniform segmentation. Third column shows the number of bits used for the coefficients to approximate f and g₁. . . 163 6.2 Performance comparison: time for producing one billion Gaussian

noise samples. All PCs are equipped with 1GB DDR-SDRAM. . . 171

7.1 Resource utilization for the four stages of the noise generator on a Xilinx Virtex-II XC2V4000-6 FPGA. . . 191 7.2 Hardware implementation results of the noise generator using dif-

ferent types of FPGA resources on a Xilinx Virtex-II XC2V4000-6 FPGA. . . 193 7.3 Comparisons of different hardware Gaussian noise generators im-

plemented on Xilinx Virtex-II XC2V4000-6 FPGAs. All designs generate a noise sample every clock. . . 199 7.4 Hardware implementation results on a Xilinx Virtex-II XC2V4000-

6 FPGA for for different numbers of noise generator instances.

The device has 23040 slices, 120 block RAMs and 120 embedded multipliers in total. . . 201

(25)

7.5 Performance comparison: time for producing one billion Gaussian noise samples. . . 202

8.1 Number of arithmetic operations per transform/sample for the transformation at various sizes of K. . . 214 8.2 Specifications of the AMD Athlon XP and Intel Pentium 4 plat-

forms used in our experiments. . . 216 8.3 Details of the AMD Athlon XP and Intel Pentium 4 data caches. 217 8.4 Execution time in nanoseconds for the AMD Athlon XP and Intel

Pentium 4 platforms at N = 4096. . . 218 8.5 Performance comparison of different software Gaussian random

number generators. The Wallace implementations use N = 4096, R = 1 and K = 16. . . 220

9.1 Computation of p^T₁ = −F⁻¹(−ET⁻¹A+C)s^T. Note that T⁻¹[As^T] = y^T ⇒ T y^T = [As^T]. . . 232 9.2 Computation of p^T₂ = −T⁻¹(As^T + Bp^T₁). . . 232 9.3 Matrix X stored in memory. The location of the edges of each row

and an extra bit indicating the end of a row are stored. . . 240 9.4 Preprocessing times and gaps for H matrices with rate 1/2 for var-

ious block lengths performed on a Pentium 4 2.4GHz PC equipped with 512MB DDR-SDRAM. . . 244 9.5 Dimensions and number of edges for the matrices A, B, T , C, F

and E generated from a 1000 × 2000 irregular H matrix. . . 250 9.6 Hardware synthesis results on a Xilinx Virtex-II XC2V4000-6 FPGA

for rate 1/2 for various block length. . . 252

(26)

9.7 Hardware synthesis results on a Xilinx Virtex-II XC2V4000-6 FPGA for block length of 2000 bits for various rates. . . 253 9.8 Hardware synthesis results on a Xilinx Virtex-II XC2V4000-6 FPGA

for block length of 2000 bits and rate 1/2 for different numbers of encoder instances. . . 254 9.9 Performance comparison of block length of 2000 bits and rate 1/2

encoders: time for producing 410 million codeword bits. . . 256

(27)

Abbreviations

A-D Anderson-Darling

ALT Approximate Lower Triangular ASC A Stream Compiler

ASIC Application-Specific Integrated Circuit AWGN Additive White Gaussian Noise BER Bit Error Rate

CDF Cumulative Distribution Function CORDIC COordinate Rotations DIgital Computer CPC Cycles Per Codeword

CPS Codewords Per Second CWG CodeWord Generation DDR Double Data Rate DSP Digital Signal Processor ECC Error Correcting Coding FPGA Field-Programmable Gate Array FS Forward-Substitution

GF Galois Field

HFS Hierarchical Function Segmenter HSM Hierarchical Segmentation Method K-S Kolmogorov-Smirnov

LDGM Low-Density Generator-Matrix LDPC Low-Density Parity-Check LFSR Linear Feedback Shift Register LNS Logarithmic Number Systems LRU Least Recently Used

LUT Look-Up Table Mbps Mega bits per second MVM Matrix-Vector Multiplication P2S Powers of 2 Segments

PDF Probability Distribution Function po polynomial only

RAM Random Access Memory ROM Read Only Memory RU Richardson and Urbanke

S1 Stage 1

SBTM Symmetric Bipartite Table Method SNR Signal to Noise Ratio

STAM Symmetric Table Addition Method tp2 table-with-polynomial of degree 2 ulp unit in the last place

US Uniform Segments VA Vector Addition

VHDL Very high speed integrated circuits Hardware Description Language WOR WithOut Range reduction

WRR With Range Reduction

(28)

Publications

Journal Papers

D. Lee, A. Abdul Gaffar, O. Mencer and W. Luk, “Automating optimized hard- ware function evaluation”, submitted to IEEE Transactions on Computers, 2004.

P.H.W. Leong, G. Zhang, D. Lee, W. Luk and J.D. Villasenor, “A comment on the implementation of the Ziggurat method”, submitted to Journal of Statistical Software, 2004.

D. Lee, W. Luk, J.D. Villasenor and P.H.W. Leong, “Design parameter optimiza- tion for the Wallace Gaussian random number generator”, submitted to ACM Transactions on Modeling and Computer Simulation, 2004.

D. Lee, W. Luk, J.D. Villasenor, G. Zhang and P.H.W. Leong, “A hardware Gaussian noise generator using the Wallace method”, submitted to IEEE Trans- actions on VLSI, 2004.

G. Zhang, P.H.W. Leong, C.H. Ho, K.H. Tsoi, R.C.C. Cheung, D. Lee and W. Luk, “Monte Carlo simulation using FPGAs”, submitted to IEEE Trans- actions on VLSI, 2004.

D. Lee, W. Luk, J.D. Villasenor and P.Y.K. Cheung, “The hierarchical segmen- tation method for function evaluation”, submitted to IEEE Transactions on Cir- cuits and Systems I, 2004.

D. Lee, W. Luk, J.D. Villasenor and P.Y.K. Cheung, “A hardware Gaussian noise generator for hardware-based simulations”, IEEE Transactions on Computers,

(29)

volume 53, number 12, pages 1523-1534, 2004.

Book Chapter

D. Lee, W. Luk, J.D. Villasenor and P.Y.K. Cheung, “The effects of polynomial degrees on the hierarchical segmentation method”, Chapter in New Algorithms, Architectures, and Applications for Reconfigurable Computing, W. Rosenstiel and P. Lysaght (Eds.), Kluwer Academic Publishers, 2004.

Conference Papers

D. Lee, A. Abdul Gaffar, O. Mencer and W. Luk, “MiniBit: Bit-width opti- mization via affine arithmetic”, submitted to ACM/IEEE Design Automation Conference, 2005.

D. Lee, A. Abdul Gaffar, O. Mencer and W. Luk, “Adaptive range reduction for hardware function evaluation”, In Proceedings of IEEE International Conference on Field-Programmable Technology (FPT), pages 169-176, Brisbane, Australia, Dec 2004.

D. Lee, “Gaussian noise generation for Monte Carlo simulations in hardware”, In Proceedings of The Korean Scientists and Engineers Association in the UK 30th Anniversary Conference, pages 182-185, London, UK, Sep 2004.

D. Lee, O. Mencer, D.J. Pearce and W. Luk, “Automating optimized table- with-polynomial function evaluation for FPGAs”, In Proceedings of International Conference on Field Programmable Logic and its Applications (FPL), pages 364-

(30)

373, LNCS 3203, Springer-Verlag, Antwerp, Belgium, Aug 2004.

D. Lee, W. Luk, C. Wang, C. Jones, M. Smith and J.D. Villasenor, “A flexible hardware encoder for low-density parity-check codes”, In Proceedings of IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 101-111, Napa Valley, USA, Apr 2004.

D. Lee, W. Luk, J.D. Villasenor and P.Y.K. Cheung, “Hierarchical segmentation schemes for function evaluation”, In Proceedings of IEEE International Confer- ence on Field-Programmable Technology (FPT), pages 92-99, Tokyo, Japan, Dec 2003.

D. Lee, W. Luk, J.D. Villasenor and P.Y.K. Cheung, “Hardware function eval- uation using non-linear segments”, In Proceedings of International Conference on Field Programmable Logic and its Applications (FPL), pages 796-807, LNCS 2778, Springer-Verlag, Lisbon, Portugal, Sep 2003.

D. Lee, W. Luk, J.D. Villasenor and P.Y.K. Cheung, “A hardware Gaussian noise generator for channel code evaluation”, In Proceedings of IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 69-78, Napa Valley, USA, Apr 2003.

D. Lee, T.K. Lee, W. Luk and P.Y.K. Cheung, “Incremental programming for re- configurable engines”, In Proceedings of IEEE International Conference on Field- Programmable Technology (FPT), pages 411-415, Shatin, Hong Kong, Dec 2002.

(31)

CHAPTER 1 Introduction

1.1 Objectives and Contributions

The objective of this thesis is to explore hardware designs for function evaluation, Gaussian noise generation and Low-Density Parity-Check (LDPC) code encoding.

Our main contributions are:

• Methodology for the automation of function evaluation unit design, cov- ering table look-up, table-with-polynomial and polynomial-only methods (Chapter 3).

• Framework for adaptive range reduction based on a parametric function evaluation library, and on function approximation by polynomials and tables and pre-computing all possible input an output ranges (Chapter 4).

• Efficient hierarchical segmentation method based on piecewise polynomial approximations suitable for non-linear compound functions, which involves uniform segments and segments with size varying by powers of two (Chap- ter 5).

• Hardware Gaussian noise generator based on the Box-Muller method and the central limit theorem capable of producing 133 million samples per second with 10% resource usage on a Xilinx XC2V4000-6 FPGA (Chapter 6).

• Hardware Gaussian noise generator based on the Wallace method capable

(32)

of producing 155 million samples per second with 3% resource usage on a Xilinx XC2V4000-6 FPGA (Chapter 7).

• Design parameter optimization for software implementations of the Wallace method to reduce correlations and execution time (Chapter 8).

• Linear complexity hardware encoder for regular and irregular LDPC codes with an efficient architecture for storing and performing computation on sparse matrices (Chapter 9).

The most exciting contribution of this thesis is perhaps the hierarchical segmentation method presented in Chapter 5. It is a systematic method for producing fast and efficient hardware function evaluators for both compound and elementary functions using piecewise polynomial approximations with a novel hierarchical segmentation scheme. This method is particulary useful for approximating non-linear functions or curves, using significantly less memory than the traditional uniform segmentation approach. Depending on the function and precision, the memory requirements can be reduced to several orders of magnitudes.

We believe that there are numerous applications out there that can benefit from our approach including data compression, function evaluation, non-linear filter- ing, pattern recognition and picture processing.

Although the designs in this thesis target FPGA technology, we believe that our methods are generic enough to be applied across different implementation technologies such as ASICs. FPGAs are simply used as a platform to demonstrate that our ideas can be efficiently mapped into hardware.

Figure 1.1 illustrates how the various chapters in this thesis are related to each other. The chapters on function evaluation are 3, 4 and 5. The chapters on LDPC coding are 6, 7, 8 and 9. Within the LDPC coding framework, Chapters 6, 7 and 8 are on Gaussian noise generation, which is needed for exploring LDPC

(33)

Chapter 3 Automating Function Evaluation

LDPC Coding

Gaussian Noise Generation

Chapter 6 Box-Muller Method

Chapter 7 Wallace Method

Chapter 9 LDPC Encoding Function Evaluation

Chapter 4 Range Reduction

Chapter 5 Hierarchical Segmentation

Chapter 8 Wallace Optimization Chapter 1

Introduction

Chapter 2

Background Chapter 10

Conclustions

Figure 1.1: Relations of the chapters in this thesis.

code behavior in hardware. The Box-Muller method in Chapter 6 requires the evaluation of functions and uses a variant of the hierarchical segmentation method presented in Chapter 5.

The rest of this chapter provides historical information and an overview of the material in Chapters 3 ∼ 8. Chapter 2 covers background material and previous work. Chapter 3 describes a methodology for the automation of elementary function evaluation unit design. Chapter 4 presents a framework for adaptive range reduction based on a parametric elementary function evaluation library. Chapter 5 presents an efficient hierarchical segmentation method suitable for non-linear compound functions. Chapter 6 describes a hardware Gaussian noise generator based on the Box-Muller method and the central limit theorem.

Chapter 7 presents a hardware Gaussian noise generator based on the Wallace method. Chapter 8 analyzes correlations that can occur in the Wallace method, and examines parameters to reduce correlations and execution time for software implementations. Chapter 9 describes an efficient hardware encoder with linear encoding complexity for both regular and irregular LDPC codes, and Chapter 10

(34)

offers conclusions and future work.

1.2 Computer Arithmetic

Arithmetic has played important roles in human civilization, especially in the areas of science, engineering and technology. Machine arithmetic can be traced back as early as 500 BC in the form of abacus used in China. Many numerically intensive applications, such as signal processing, require rapid execution of arithmetic operations. The evaluation of functions is often the performance bottleneck of many compute-bound applications. Examples of these functions in- clude elementary functions such as log(x) and√

x, and compound functions such as p

− log(x) and x log(x). Computing these functions quickly and accurately is a major goal in computer arithmetic. For instance, over 60% of the total run time is devoted to function evaluation operations in a simulation of a jet engine reported by O’Grady and Wang [133].

Recent studies have shown that the increasing importance of these math- ematical functions in a wide variety of applications. The applications where these functions have increasingly more important are computer 3D graphics, an- imation, scientific computing, artificial neural networks, digital signal processing and multimedia applications. Software implementations are often too slow for numerically intensive or real-time applications. The increasing speed and performance constraints of such applications have led to the development of new dedicated hardware for the computation of these operations, providing high-speed solutions implemented in coprocessors, graphic cards, Digital Signal Processors (DSPs), Application-Specific Integrated Circuits (ASICs), Field-Programmable Gate Arrays (FPGAs) [122] and numerical processors in general.

(35)

1.3 Error Correcting Coding and LDPC Codes

Error correcting coding (ECC) is a critical part of modern communications systems, where it is used to detect and correct errors introduced during a transmission over a channel [11], [126]. It relies on transmitting the data in an encoded form, such that the redundancy introduced by the coding allows a decoding device at the receiver to detect and correct errors. In this way, no request for retransmission is required, unlike systems which only detect errors (usually by means of a checksum transmitted with the data). In many applications, a sub- stantial portion of the baseband signal processing is dedicated to ECC. The wide range of ECC applications [30] include space and satellite communications, data transmission, data storage and mobile communications.

NASA’s space missions including Galileo, Odyssey, Rovers and Voyager would not have been possible without the use of ECC [71]. Odyssey, NASA’s Mars spacecraft currently boasts the highest data transmission rate at 128,000 bits per second via a radio link. However, for future space missions NASA are planning to use optical communications via laser beams [60]. The new laser will beam back between one million and 30 million bits per second, depending on the distance between Mars and Earth [119]. Projects like this provide great challenges to implement high-speed and low-power ECC systems with good error correcting performance in deep space.

In 1948, Claude Shannon founded the field of study “Information Theory”

which is the basis of modern ECC with his discovery of the noisy channel coding theorem [164]. The theoretical contribution of Shannon’s work was a useful definition of “information” and several “channel coding theorems” which gave ex- plicit upper bounds, called the channel capacity, on the rate at which information could be transmitted reliably on a given communication channel. In the context

(36)

of our work, the result of primary interest is the “noisy channel coding theorem for continuous channels with average power limitations”. This theorem states that the capacity C (which is now known as the Shannon limit) of a bandlimited additive white Gaussian noise (AWGN) channel with bandwidth W , a channel model that approximately represents many practical digital communication and storage systems, is given by

C = W log₂(1 + E_s/N₀) bits per second (bps) (1.1) where E_s is the average signal energy in each signaling interval of duration T = 1/W , and N₀/2 is the two-sided noise power spectral density. Perfect Nyquist signalling is assumed. The proof of this theorem demonstrates that for any transmission rate R less than or equal to the channel capacity C, there exists a coding scheme that achieves an arbitrarily small probability of error; conversely, if R is greater than C, no coding scheme can achieve reliable performance. Since this theorem was published, an entire field of study has grown out of attempts to design coding schemes that approach the Shannon limit of various channels.

In the past few years, LDPC codes have received much attention because of their excellent performance, and have been widely considered as the most promising candidate ECC scheme for many applications in telecommunications and storage devices [132], [8]. LDPC codes were first proposed by Gallager in 1962 [48], [49]. He defined an (n, d_v, d_c) LDPC code as a code of block length n in which each column of the parity-check matrix contains d_v ones and each row contains d_c ones. Due to the regular structure (uniform column and row weight) of Gallager’s codes, they are now called regular LDPC codes. Gallager provided simulation results for codes with block lengths of the order of hundreds of bits. The results indicated that LDPC codes have very good potential for error correction. However, the high storage and computation requirements interrupted

(37)

the research on LDPC codes. After the discovery of Turbo codes by Berrou et al. in 1993 [7], MacKay [110] re-established the interest in LDPC codes during the mid to late 1990s.

1.4 Overview of our Approach

1.4.1 Function Evaluation

The evaluation of elementary functions is at the core of many compute-intensive applications [133] which perform well on reconfigurable platforms. Yet, in order to implement function evaluation efficiently, the FPGA programmer has to choose between many function evaluation methods such as table look-up, polynomial approximation, or table look-up combined with polynomial approximation.

We present a methodology and a partially automated implementation to select the best function evaluation hardware for a given function, accuracy require- ment, technology mapping and optimization metrics, such as area, throughput or latency. The automation of function evaluation unit design is combined with ASC [123], A Stream Compiler, for FPGAs. On the algorithmic side, we use MATLAB to design approximation algorithms with polynomial coefficients and minimize bitwidths. On the hardware implementation side, ASC provides par- tially automated design space exploration. We illustrate our approach for sin(x), log(1 + x) and 2^x, which are commonly used in a variety of applications. We provide a selection of graphs that characterize the design space with various dimensions, including accuracy, precision and function evaluation method. We also demonstrate design space exploration by implementing more than 400 distinct designs.

The evaluation of a function f (x) typically consists of range reduction which

(38)

transforms the input into a small interval, and the actual function evaluation on the small interval. We investigate optimization of range reduction given the range and precision of x and f (x). For every function evaluation there exists a convenient interval such as [0, π/2) for sin(x). An example of the adaptive range reduction method, which we propose in our work, introduces another larger interval for which it makes sense to skip range reduction. The decision depends on the function being evaluated, precision, and optimization metrics such as area, latency and throughput. In addition, the input and output range has an impact on the choice of function evaluation method such as polynomial, table based, or combinations of the two. We explore this vast design space of adaptive range reduction for fixed-point sin(x), log(x) and √

x accurate to one unit in the last place (ulp) using MATLAB and ASC. These tools enable us to study over 1000 designs resulting in over 40 million Xilinx equivalent circuit gates, in a few hours’

time. The final objective is to progress towards a fully automated library that provides optimal function evaluation hardware units given input and output range and precision. Our design flow for evaluating elementary functions is illustrated in Figure 1.2.

Compound functions often have non-linear properties, hence sophisticated approximation techniques are needed. We present a method for evaluating such functions based on piecewise polynomial approximation with a novel hierarchical segmentation scheme. The use of hierarchal schemes of uniform segments and segments with size varying by powers of two enables us to approximate non- linear regions of a function particularly well. This partitioning is automated:

efficient look-up tables and their coefficients are generated for a given function, input range, degree of the polynomials, desired accuracy and finite precision constraints. Parameterized reference design templates are provided for various predefined hierarchical schemes. We describe an algorithm to find the optimum

(39)

function f(x) input format method

Approximate f(x) (MATLAB)

Hardware Compiler (ASC)

FPGA implementations Library

Generator (Perl Script)

ASC code Function

Evaluation Library (ASC Lib) User

Library Construction Library Usage

Figure 1.2: Design flow for evaluating elementary functions.

number of segments and the placement of their boundaries, which is used to analyze the properties of a function and to benchmark our hierarchical approach.

Our method is illustrated using four non-linear compound and elementary functions: p

− log(x), x log(x), a high order rational function and cos(πx/2). We present results for various operand sizes between 8 and 24 bits for first and second order polynomial approximations. For 24-bit data, our method requires a look-up table of size 12 times smaller than the symmetric table addition method.

Our framework for the hierarchical segmentation method is shown in Figure 1.3.

1.4.2 Gaussian noise generation

Evaluations of LDPC codes are based on computer simulations which can be time consuming, particularly when the behavior at low bit error rates (BERs) in the error floor region is being studied [57]. Tremendous efforts have been devoted

(40)

Hierarchical Function Segmenter

Data

File Synthesis

Place and Route Report

Hardware User Input

Design Generator

Reference Design

Library

Figure 1.3: Design flow for evaluating non-linear functions using the hierarchical segmentation method.

to analyze and improve their error-correcting performance, but little considera- tion has been given to the practical LDPC codec hardware implementations. If the binary Hamming distance [148] between all combinations of codewords (the distance spectrum) is known, then analytic techniques for describing the performance of the codes in the presence of noise is available. However, in the case of capacity achieving random linear codes (such as LDPC codes), the problem of finding the distance spectrum of the code is intractable and researchers resort to the use of Monte Carlo simulation in order to characterize various code construc- tions in terms BER versus signal to noise ratio (SNR). At very low SNRs, errors occur often and a sufficient statistic can be gathered readily within a PC. However at higher SNRs where errors occur rarely, the situation is different. Thorough characterization of a code in this region may require simulation of 10¹⁰−10¹²code symbols, and computer based simulations provide inadequate means of finding statistically sufficient set of error events, which can take several weeks.

Hardware based simulation offers the potential of speeding up code evaluation

(41)

by several orders of magnitude [99]. Such simulation framework consists of three main blocks: encoder, noise channel and decoder, where the noise channel is generally modeled by Gaussian noise. Our LDPC code simulations are run on a reconfigurable engine, which consists of a PC and a reconfigurable hardware platform [85]. The reconfigurable hardware platform we use is a Xilinx Virtex-II FPGA prototyping board from Nallatech [131] shown in Figure 1.4. It consists of two Xilinx Virtex-II XC2V4000-6 FPGAs and 4MB of SRAM. The board can be connected to a PC via the PCI bus or USB. The grey wires are connected to a logic analyzer for debugging purposes. A block diagram of our LDPC simulation framework is provided in Figure 1.5. The LDPC encoder follows an algorithm suggested in [152]. Our noise generator block improves the overall value of the system as a Monte Carlo simulator, since noise quality at high SNRs (tails of the Gaussian distribution) is essential. Since the LDPC decoding process is iterative and the number of required iterations is non-deterministic, a flow control buffer is used to greatly increase the throughput of the overall system.

We present two methods for generating Gaussian noise. The first is based on the Box-Muller method [13] and the central limit theorem [78], which involve the computation of two functions: p

− ln(x) and cos(2πx). The accuracy and speed in computing these functions are essential for generating high-quality Gaussian noise samples rapidly. The use of non-uniform segments enables us to approximate non-linear regions of a function particularly well. The appropriate segment address for a given function can be rapidly calculated in run time by a simple combinatorial circuit. Scaling factors are used to deal with large polynomial coefficients and to trade precision for range. Our function evaluator is based on first order polynomials, and is suitable for applications requiring high performance with small area, at the expense of accuracy. We exploit the central limit theorem to overcome quantization and approximation errors. An implementation at

(42)

Figure 1.4: The BenONE board from Nallatech used to run our LDPC simulation experiments.

133MHz on a Xilinx Virtex-II XC2V4000-6 FPGA takes up 10% of the device and produces 133 million samples per second, which is seven times faster than a 2.6GHz Pentium 4 PC.

The second method is based on the Wallace method [180]. Wallace proposed a fast algorithm for generating normally distributed pseudo-random numbers which generates the target distributions directly using their maximal-entropy properties. This algorithm is particularly suitable for high throughput hardware implementation since no transcendental functions such as √

x, log(x) or sin(x) are required. The Wallace method takes a pool of normally distributed random numbers from the normal distribution. Through transformation steps, a new pool of normal distributed random numbers are generated. An implementation running at 155MHz on a Xilinx Virtex-II XC2V4000-6 FPGA takes up 3% of the device and produces 155 million samples per second.

(43)

LDPC Decoder

Record Errors Code

Definition LDPC Encoder

Gaussian Noise Generator SNR

Flow Control Buffer

Code Definition

Data Source

Compare

Figure 1.5: Our LDPC hardware simulation framework.

The outputs of the two noise generators accurately model a true Gaussian PDF even at very high σ values (tails of the Gaussian distribution). Their properties are explored using: (a) several different statistical tests, including the chi-square test and the Anderson-Darling test [32], and (b) an application for decoding of LDPC codes. Although the Wallace design has smaller area and is faster than the Box-Muller design, it has slight correlations between successive transformations, which may be undesirable for certain types of simulations. We examine design parameter optimizations to reduce such correlations.

(44)

H Matrix

Preprocessor (SW)

ALT H Matrix

Encoder

Message Blocks (HW) Codewords

Figure 1.6: LDPC encoding framework.

1.4.3 LDPC Encoding

We describe a flexible hardware encoder for regular and irregular Low-Density Parity-Check (LDPC) codes. Although LDPC codes achieve better performance and lower decoding complexity than Turbo codes, a major drawback is their apparently high encoding complexity: whereas Turbo codes can be encoded in linear time, a straightforward implementation for a LDPC code has complexity quadratic in the block length due to dense matrix-vector multiplication. Using an efficient encoding method proposed by Richardson and Urbanke [152], we present a hardware LDPC encoder with linear encoding complexity. The encoder is flexible, supporting arbitrary H matrices, rates and block lengths. We develop a software preprocessor to bring the parity-check matrix H into a approximate lower triangular form. A hardware architecture with an efficient memory organi- zation for storing and performing computations on sparse matrices is proposed.

An implementation for a rate 1/2 irregular length 2000 bits LDPC code encoder on a Xilinx Virtex-II XC2V4000-6 FPGA takes up 4% of the device. It runs at 143MHz and has a throughput of 45 million codeword bits per second (or 22 mil-

(45)

lion information bits per second) with a latency of 0.18ms. An implementation of 16 instances of the encoder on the same device at 82MHz is capable of 410 million codeword bits per second, 80 times faster than an Intel Pentium 4 2.4GHz PC. The design flow of our LDPC encoder is illustrated in Figure 1.6. This block is placed in front of the noise generator in our LDPC simulation framework (Figure 1.5).

(46)

CHAPTER 2 Background

2.1 Introduction

The purpose of this chapter is to present the background material and related work of this thesis. Section 2.2 introduces the basics of FPGAs and the design tools used in this thesis. Section 2.3 introduces six of the most popular methods for approximating functions and the existing work. Section 2.4 discusses various issues such are range reduction related to function evaluation. Section 2.5 presents different ways of generating Gaussian noise and explores the existing work in this area. Finally, Section 2.6 introduces the basics of LDPC codes, LDPC encoding, describes Richardson and Urbanke’s (RU) method for efficiently encoding LDPC codes and looks at previous work on hardware related issues on LDPC codes.

2.2 FPGAs

2.2.1 Introduction

Field-Programmable Gate Arrays (FPGAs) have long been used for glue logic and prototyping. More recently, they are being used for many real-life applications including communications [93], encryption [173], video image processing [168], [175], medical imaging [72], network security [96] and numerical com-

(47)

4-input LUT

mux

flip-flop a

b c d

e clock

clock enable set/reset

y

q

Figure 2.1: Simplified view of a Xilinx logic cell. A single slice contains 2.25 logic cells.

putations [104].

FPGAs can potentially approach the execution speed of application specific hardware with the rapid programming time of microprocessors. In recent years, the size of FPGAs has followed Moore’s law: the number of logic gate doubles every 18 months. FPGAs can exploit improvements following Moore’s law better than microprocessors because of their simpler and more regular structure.

The fundamental building block of Xilinx FPGAs is the logic cell [118]. A logic cell comprises a 4-input look-up table (which can also act as a 16 × 1 RAM or a 16-bit shift register), a multiplexer and a register. A simplified view of a logic cell is depicted in Figure 2.1. Two logic cells are paired together in an element called a slice. A slice contains additional resources such as multiplexers and carry logic to increase the efficiency of the architecture. These extra resources are equivalent to having more logic cells, and therefore a slice is counted as being equivalent of 2.25 logic cells. Recent-generation reconfigurable hardware has a large amount of slices. For instance, the Xilinx Virtex-II XC2V4000-6 has 23040 slices.

The architecture of a typical FPGA is illustrated in Figure 2.2. In general,

(48)

Figure 2.2: Architecture of a typical FPGA.

an FPGA will have an array of configurable logic blocks (which contain two or four slices depending on the FPGA family), programmable wires, and programmable switches to realize any function out of the logic blocks and implement any interconnection topology. Programming is done using of the many popular technologies such as SRAM cells, Antifuses, EPROM transistors and EEPROM transistors. In addition to logic blocks, state-of-the-art FPGAs such as the Xilinx Virtex-II or Virtex-4 devices contain embedded hardware elements for memory, multiplication, multiply-and-add and even a number of hard microprocessor cores (such as the IBM PowerPC) [189].

The long IC fabrication time is completely eliminated for these devices and design realization times are only a few hours. The idea of user-programmability is very exciting, most ASIC vendors now prefer FPGAs for low cost prototyping for fine tuning of designs before fabrication. Also, from a marketing point of view, the FPGA technology allows quick product announcements, which is commercially attractive. The two major FPGA vendors are Altera and Xilinx. A good review on configurable computing and FPGAs is given in [28].

(49)

2.2.2 Design Tools

The following three FPGA design tools are used for the implementations presented in this thesis:

• ASC [123], A Stream Compiler for FPGAs, adopts C++ custom types and operators to provide a programming interface in the algorithmic level, the architectural, the arithmetic level and the gate level. As a unique feature, all levels of abstraction are accessible from C++. This enables the user to program on the desired level for each part of the application. Semi- automated design space exploration further increases design productivity, while supporting optimization at all available levels of abstraction. Object- oriented design enables efficient code-reuse; ASC includes an integrated arithmetic unit generation library, PAM-Blox II [121], which in turn builds upon the PamDC [137] gate library. The elementary function evaluation units in Chapters 3 and 4 are implemented with this tool.

• Handel-C [21] is based on ANSI-C with extensions to support flexible width variables, signals, parallel blocks, bit-manipulation operations and channel communication. A distinctive feature is that timing of the compiled circuit is fixed at one cycle per C assignment. This makes it easy for programmers to know in which cycle a statement will be executed at the expense of reducing the scope for optimization. It gives application devel- opers the ability to schedule hardware resources manually, and Handel-C tools generate the resulting designs automatically. The ideas of Handel-C are based on work by Page and Luk in compiling Occam into FPGAs [134].

The Gaussian noise generator using the Box-Muller method in Chapter 6 is implemented with this tool.

(50)

• Xilinx System Generator [188] is a plug-in to the MATLAB Simulink software [117] and provides bit-accurate model of FPGA circuits. It automatically generates a synthesizable VHDL or Verilog code including a testbench. Other unique capabilities include MATLAB m-code compila- tion, fast system-level resource estimation, and high-speed hardware co- simulation interfaces, both a generic JTAG interface [31] and PCI based co- simulation for FPGA hardware platforms. The Xilinx Blockset in Simulink enables bit-true and cycle-true modeling, and includes common parameter blocks such as finite impulse response (FIR) filter, fast Fourier transform (FFT), logic gates, adders, multipliers, RAMs, etc. Moreover, most of these blocks utilize the Xilinx cores, which are highly optimized for Xilinx devices.

The function evaluator using the hierarchical segmentation method (HSM) in Chapter 5, the Gaussian noise generator using the Wallace method in Chapter 7, and the LDPC encoder in Chapter 8 are implemented with this tool.

ASC designs are synthesized with PAM-Blox II and all others with Synplicity Synplify Pro (versions 7.3 ∼ 7.5). Place-and-route for all designs are performed with Xilinx ISE (versions 6.0 ∼ 6.2).

2.3 Function Evaluation Methods

Many FPGA applications including digital signal processing, computer graphics and scientific computing require the evaluation of elementary or special purpose functions. For applications that require low precision approximation at high speeds, full look-up tables are often employed. However, this becomes imprac- tical for precisions higher than a few bits, because the size of the table grow

(51)

exponentially with respect to the input size. Six well known methods are de- scribed below, which are better suited to high precision.

2.3.1 CORDIC

CORDIC is an acronym for COordinate Rotations DIgital Computer and offers the opportunity to calculate desired functions in a rather simple and elegant way.

The CORDIC algorithm was first introduced by Volder [178] for the computation of trigonometric functions, multiplication, division and data type conversion, and later generalized to hyperbolic functions by Walther [182]. It has found its way into diverse applications including the 8087 math coprocessor [38], the HP-35 calculator, radar signal processors and robotics.

It is based on simple iterative equations, involving only shift and add operations and was developed in an effort to avoid the time consuming multiply and divide operations. The general CORDIC algorithm consists of the following three iterative equations:

x_k+1 = x_k− mδ_ky_k2^−k y_k+1 = y_k+ δ_kx_k2^−k z_k+1 = z_k− δ_kσ_k

The constants m, δ_kand σ_k depend on the specific computation being performed, as explained below.

• m is either 0, 1 or −1. m = 1 is used for trigonometric and inverse trigono- metric functions. m = −1 is used for hyperbolic, inverse hyperbolic, expo- nential and logarithmic functions, as well as square roots. Finally, m = 1 is used for multiplication and division.

Hardware Designs for Function Evaluation and LDPC Coding

Imperial College