### Imperial College

London

## Hardware Designs for

## Function Evaluation and LDPC Coding

A thesis submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computing

by

### Dong-U Lee

October 2004

*° Copyright by*c
Dong-U Lee
October 2004

*To my parents for their love and support,*
*and my country Korea. . .*

### Acknowledgments

I thank my supervisor Prof. Wayne Luk for his advice and direction on both academic and non-academic issues. I would also like to thank Prof. John D. Vil- lasenor from UCLA, Prof. Philip H.W. Leong from the Chinese University of Hong Kong, Prof. Peter Y.K. Cheung from the Department of EEE and Dr. Os- kar Mencer from the Department of Computing for their help on my research topics.

Many thanks to my colleagues Altaf Abdul Gaffar, Andreas Fidjeland, An- thony Ng, Arran Derbyshire, Danny Lee, David Pearce, David Thomas, Henry Styles, Jose Gabriel de Fiqueiredo Coutinho, Jun Jiang, Ray Cheung, Shay Ping Seng, Sherif Yusuf, Tero Rissa and Tim Todman from Imperial College, Chris Jones, Connie Wang, David Choi, Esteban Vall´es and Mike Smith from UCLA, and Dr. Guanglie Zhang from the Chinese University of Hong Kong for their assistance. I am especially thankful to Altaf Abdul Gaffar and Ray Cheung who helped me with numerous Linux programming tasks, and Tim Todman who proof read this thesis.

The financial support of Celoxica Limited, Xilinx Inc., the U.K. Engineering and Physical Sciences Research Council PhD Studentship from the Department of Computing, Imperial College, and the U.S. Office of Naval Research is gratefully acknowledged.

### Abstract of the Thesis

Hardware based implementations are desirable, since they can be several or- ders of magnitudes faster than software based methods. Reconfigurable devices such as Field-Programmable Gate Arrays (FPGAs) are ideal candidates for this purpose, because of their speed and flexibility. Three main achievements are presented in this thesis: function evaluation, Gaussian noise generation, and Low-Density Parity-Check (LDPC) code encoding. First, our function evalu- ation research covers both elementary functions and compound functions. For elementary functions, we automate function evaluation unit design covering table look-up, table-with-polynomial and polynomial-only methods. We also illustrate a framework for adaptive range reduction based on a parametric function evalu- ation library. The proposed approach is evaluated by exploring various effects of several arithmetic functions on throughput, latency and area for FPGA designs.

For compound functions which are often non-linear, we present an evaluation method based on piecewise polynomial approximation with a novel hierarchical segmentation scheme, which involves uniform segments and segments with size varying by powers of two. Second, our research on Gaussian noise generation re- sults in two hardware architectures, some of which can be used for Monte Carlo simulations such as evaluating the performance of LDPC codes. The first design is based on the Box-Muller method and the central limit theorem, while the second design is based on the Wallace method. The quality of the noise produced by the two noise generators are characterized with various statistical tests. We also ex- amine how design parameters affect the noise quality with the Wallace method.

Third, our research on LDPC encoding describes a flexible hardware encoder for regular and irregular LDPC codes. Our architecture, based on an encoding method proposed by Richardson and Urbanke, has linear encoding complexity.

### Table of Contents

*1 Introduction . . . .* 5

1.1 Objectives and Contributions . . . 5

1.2 Computer Arithmetic . . . 8

1.3 Error Correcting Coding and LDPC Codes . . . 9

1.4 Overview of our Approach . . . 11

1.4.1 Function Evaluation . . . 11

1.4.2 Gaussian noise generation . . . 13

1.4.3 LDPC Encoding . . . 18

*2 Background . . . 20*

2.1 Introduction . . . 20

2.2 FPGAs . . . 20

2.2.1 Introduction . . . 20

2.2.2 Design Tools . . . 23

2.3 Function Evaluation Methods . . . 24

2.3.1 CORDIC . . . 25

2.3.2 Digit-recurrence and On-line Algorithms . . . 26

2.3.3 Bipartite and Multipartite Methods . . . 27

2.3.4 Polynomial Approximation . . . 28

2.3.5 Polynomial Approximation with Non-uniform Segmentation 30 2.3.6 Rational Approximation . . . 31

2.4 Issues on Function Evaluation . . . 31

2.4.1 Evaluation of Elementary and Compound Functions . . . . 32

2.4.2 Approximation Method Selection . . . 32

2.4.3 Range Reduction . . . 33

2.4.4 Types of Errors . . . 35

2.5 Gaussian Noise Generation . . . 36

2.6 LDPC Codes . . . 38

2.6.1 Basics of LDPC Codes . . . 38

2.6.2 LDPC Encoding . . . 42

2.6.3 RU LDPC Encoding Method . . . 43

2.6.4 Hardware Aspects of LDPC codes . . . 49

2.7 Summary . . . 50

3 Automating Optimized Table-with-Polynomial
*Function Evaluation . . . 52*

3.1 Introduction . . . 52

3.2 Overview . . . 53

3.3 Algorithmic Design Space Exploration with MATLAB . . . 54

3.4 Hardware Design Space Exploration with ASC . . . 57

3.5 Verification with ASC . . . 59

3.6 Results . . . 60

3.7 Summary . . . 70

4 Adaptive Range Reduction

*for Function Evaluation . . . 71*

4.1 Introduction . . . 71

4.2 Overview . . . 72

4.3 Design . . . 73

4.3.1 Design Overview . . . 74

4.3.2 Degrees of Freedom . . . 80

4.4 Implementation . . . 82

4.4.1 Algorithmic Design Space Exploration . . . 83

4.4.2 ASC Code Generation and Optimizations . . . 87

4.5 Results . . . 88

4.6 Summary . . . 100

5 The Hierarchical Segmentation Method
*for Function Evaluation . . . 101*

5.1 Introduction . . . 101

5.2 Related Work . . . 103

5.3 Optimum Placement of Segments . . . 104

5.4 The Hierarchical Segmentation Method . . . 113

5.5 Architecture . . . 124

5.6 Error Analysis . . . 125

5.7 The Effects of Polynomial Degrees . . . 127

5.8 Evaluation and Results . . . 133

5.9 Summary . . . 143

6 Gaussian Noise Generator

*using the Box-Muller Method . . . 144*

6.1 Introduction . . . 144

6.2 Related Work . . . 147

6.3 Architecture . . . 148

6.4 Function Evaluation for Non-uniform Segmentation . . . 152

6.5 Function Evaluation for Noise Generator . . . 156

6.6 Implementation . . . 162

6.7 Evaluation and Results . . . 165

6.8 Summary . . . 173

7 Gaussian Noise Generator
*using the Wallace Method . . . 175*

7.1 Introduction . . . 175

7.2 The Wallace Method . . . 176

7.3 Architecture . . . 178

7.3.1 The First Stage . . . 181

7.3.2 The Second Stage . . . 182

7.3.3 The Third Stage . . . 182

7.3.4 The Fourth Stage . . . 185

7.4 Implementation . . . 186

7.5 Evaluation and Results . . . 193

7.6 Summary . . . 203

8 Design Parameter Optimization

*for the Wallace Method . . . 204*

8.1 Introduction . . . 204

8.2 Overview of the Wallace Method . . . 205

8.3 Measuring the Wallace Correlations . . . 208

8.4 Reducing the Wallace Correlations . . . 211

8.5 Performance Comparisons . . . 214

8.6 Hardware Design with Optimized Parameters . . . 220

8.7 Summary . . . 225

*9 Flexible Hardware Encoder for LDPC Codes . . . 226*

9.1 Introduction . . . 226

9.2 Overview . . . 228

9.3 Preprocessing . . . 231

9.4 Encoder Architecture . . . 235

9.5 Components for the Encoder . . . 239

9.5.1 Vector Addition . . . 239

9.5.2 Matrix-Vector Multiplication . . . 239

9.5.3 Forward-Substitution . . . 241

9.6 Implementation and Results . . . 242

9.7 Summary . . . 253

*10 Conclusions . . . 257*

10.1 Summary . . . 257

10.2 Future Work . . . 261

10.2.1 Function Evaluation . . . 261

10.2.2 Gaussian Noise Generation . . . 263

10.2.3 LDPC Coding . . . 264

*References . . . 265*

### List of Figures

1.1 Relations of the chapters in this thesis. . . 7 1.2 Design flow for evaluating elementary functions. . . 13 1.3 Design flow for evaluating non-linear functions using the hierarchi-

cal segmentation method. . . 14 1.4 The BenONE board from Nallatech used to run our LDPC simu-

lation experiments. . . 16 1.5 Our LDPC hardware simulation framework. . . 17 1.6 LDPC encoding framework. . . 18

2.1 Simplified view of a Xilinx logic cell. A single slice contains 2.25 logic cells. . . 21 2.2 Architecture of a typical FPGA. . . 22 2.3 Certain approximation methods are better than others for a given

metric at different precisions. . . 33 2.4 Area comparison in terms of configurable logic blocks for different

methods with varying data widths [122]. . . 34 2.5 Comparison of (3,6)-regular LDPC code, Turbo code and opti-

mized irregular LDPC code [151]. . . 39 2.6 LDPC communication system model. . . 40 2.7 A bipartite graph of a (3,6)-regular LDPC code of length ten and

*rate 1/2. There are ten variable nodes and five check nodes. For*
*each check node C** _{i}* the sum (over GF(2)) of all adjacent variable
node is equal to zero. . . 41

2.8 An equivalent parity-check matrix in lower triangular form. . . 43 2.9 The parity-check matrix in approximate lower triangular form . . 44

3.1 Block diagram of methodology for automation. . . 55
3.2 Principles behind automatic design optimization with ASC. . . . 56
*3.3 Accuracy graph: maximum error versus bitwidth for sin(x) with*

the three methods. . . 58
*3.4 Area versus bitwidth for sin(x) with TABLE+POLY. OPT indi-*

cates for what metric the design is optimized for. Lower part:

LUTs for logic; small top part: LUTs for routing. . . 62
*3.5 Latency versus bitwidth for sin(x) with TABLE+POLY. Shows*

the impact of latency optimization. . . 62
*3.6 Throughput versus bitwidth for sin(x) with TABLE+POLY. Shows*

the impact of throughput optimization. . . 63
*3.7 Latency versus area for 12-bit approximations to sin(x). The*

Pareto-optimal points [124] in the latency-area space are shown. 63
*3.8 Latency versus throughput for 12-bit approximations to sin(x).*

The Pareto-optimal points in the latency-throughput space are
shown. . . 64
*3.9 Area versus throughput for 12-bit approximations to sin(x). The*

Pareto-optimal points in the throughput-area space are shown. . 64 3.10 Area versus bitwidth for the three functions with TABLE+POLY.

Lower part: LUTs for logic; small top part: LUTs for routing. . . 67 3.11 Latency versus bitwidth for the three functions with TABLE+POLY. 67

3.12 Throughput versus bitwidth for the three functions with TABLE+POLY.

Throughput is similar across functions, as expected. . . 68
*3.13 Area versus bitwidth for sin(x) with the three methods. Note that*

the TABLE method gets too large already for 14 bits. . . 68
*3.14 Latency versus bitwidth for sin(x) with the three methods.* . . . 69
*3.15 Throughput versus bitwidth for sin(x) with the three methods. . .* 69

4.1 Design flow: MATLAB generates all the ASC code for the library.

The user simply indexes into the library to obtain the specific function approximation unit. . . 73 4.2 Description of range reduction, evaluation method and range re-

*construction for the three functions sin(x), log(x) and√*

*x. . . . .* 75
*4.3 Circuit for evaluating sin(x). . . .* 76
*4.4 Circuit for evaluating log(x). . . .* 77
4.5 Circuit for evaluating *√*

*x. . . .* 78
4.6 Plot of the three functions over the range reduced intervals. . . . 79
*4.7 Segmentation for evaluating log(y) with eight uniform segments.*

The leftmost three bits of the inputs are used as the segment index. 82
*4.8 Architecture of table-with-polynomial unit for degree d polynomi-*

als. Horner’s rule is used to evaluate the polynomials. . . 83
*4.9 ASC code for evaluating sin(x) for range 8 bits and precision 8 bits*

*with tp2. This code is automatically generated from our MATLAB*
tool. . . 86
4.10 Area matrix which tells us for each input range/precision combi-

nation which design to use for minimum area. . . 91

4.11 Latency matrix which tells us for each input range/precision com-
bination which design to use for minimum latency. . . 91
*4.12 Area cost of range reduction (upper part) for sin(x) implemented*

*using po with the designs optimized for area. . . .* 92
*4.13 Area cost of range reduction (upper part) for sin(x) implemented*

*using tp3 with the designs optimized for area. . . .* 92
*4.14 Area cost of range reduction (upper part) for log(x) implemented*

*using po with the designs optimized for area. . . .* 93
*4.15 Area cost of range reduction (upper part) for log(x) implemented*

*using tp3 with the designs optimized for area. . . .* 93
*4.16 Area for sin(x) with precision of eight bits for different methods*

with (WRR, solid line) and without (WOR, dashed line) range
reduction, with the designs optimized for area. . . 94
*4.17 Latency for sin(x) with precision of eight bits for different methods*

with (WRR, solid line) and without (WOR, dashed line) range
reduction, with the designs optimized for latency. . . 94
*4.18 Area for log(x) with precision of eight bits for different methods*

with (WRR, solid line) and without (WOR, dashed line) range
reduction, with the designs optimized for area. . . 95
*4.19 Latency for sin(x) with precision of eight bits for different methods*

with (WRR, solid line) and without (WOR, dashed line) range
reduction, with the designs optimized for latency. . . 95
*4.20 Area versus precision for sin(x) using tp3 for different ranges and*

optimization. . . 96

*4.21 Latency versus precision for sin(x) using tp3 for different ranges*
and optimization. . . 96
4.22 Area versus range for all three functions using different methods

with the precision fixed at eight bits optimized for area. . . 97 4.23 Latency versus range for all three functions using different methods

with the precision fixed at eight bits optimized for latency. . . 97
*4.24 Area versus range for all three functions using po for different*

precisions optimized for area. . . 98
*4.25 Latency versus range for all three functions using po for different*

precisions optimized for latency. . . 98
*4.26 Area versus range for all three functions using po for different*

precisions optimized for area. . . 99
*4.27 Latency versus range for all three functions using po for different*

precisions optimized for latency. . . 99

5.1 MATLAB code for finding the optimum boundaries. . . 109 5.2 Optimum locations of the segments for the four functions in Sec-

tion 5.1 for 16-bit operands and second order approximation. . . . 110 5.3 Numbers of optimum segments for first order approximations to

the functions for various operand bitwidths. . . 111 5.4 Numbers of optimum segments for second order approximations to

the functions for various operand bitwidths. . . 111 5.5 Ratio of the number of optimum segments required for first and

second order approximations to the functions. . . 112

*5.6 Circuit to calculate the P2S address for a given input δ** _{i}*, where

*δ*

*i*

*= a*

*v−1*

*a*

*v−2*

*..a*0. The adder counts the number of ones in the output of the two prefix circuits. . . 115 5.7 Main MATLAB code for finding the hierarchical boundaries and

their polynomial coefficients. . . 119
*5.8 Variation of total number of segments against v*_{0}for a 16-bit second

*order approximation to f*_{3}. . . 120
5.9 The segmented functions generated by HFS for 16-bit second order

*approximations. f*_{1}*, f*_{2}*, f*_{3} *and f*_{4} employ P2S(US), P2SL(US),
US(US) and US(US) respectively. The black and grey vertical lines
are the boundaries for the outer and inner segments respectively. . 121
5.10 Design flow of our approach. . . 123
*5.11 HSM function evaluator architecture for λ = 2 and degree d ap-*

proximations. Note that ‘:’ is a concatenation operator. . . 130 5.12 Variations of the table sizes to the four functions with varying

polynomial degrees and operand bitwidths. . . 131 5.13 Variations of the HSM/Optimum segment ratio with polynomial

degrees and operand bitwidths. . . 132 5.14 Xilinx System Generator design template used for first order US(US).135 5.15 Xilinx System Generator design template used for second order

P2SL(US). . . 136
*5.16 Error in ulp for 16-bit second order approximation to f*3. . . 137

6.1 Gaussian noise generator architecture. The black boxes are buffers. 150

*6.2 The f function. The asterisks indicate the boundaries of the linear*
approximations. . . 153
*6.3 Circuit to calculate the segment address for a given input x. The*

adder counts the number of ones in the output of the two prefix
*circuits. Note that the least-significant bit x** _{o}* is not required. . . . 155
6.4 Function evaluator architecture based on non-unform segmentation.157
6.5 Variation of function approximation error with number of bits for

*the gradient of the f function. . . 158*
*6.6 The g functions. Only the thick line is approximated; see Figure*

*4. The most significant 2 bits of u*_{2} are used to choose which of
the four regions to use; the remaining bits select a location within
Region 0. . . 159
*6.7 Approximation for g*1 *over [0, 1/4). The asterisks indicate the seg-*

ment boundaries of the linear approximations. . . 160
*6.8 Approximation error to f . The worst case and average errors are*

0.031 and 0.000048 respectively. . . 161
*6.9 Approximation error to g*_{1}. The worst case and average errors are

0.00079 and 0.0000012 respectively. . . 162
*6.10 PDF of the generated noise with 17 approximations for f and 6*

*for g for a population of four million. The p-values of the χ*^{2} and
A-D tests are 0.00002 and 0.0084 respectively. . . 169
*6.11 PDF of the generated noise with 59 approximations for f and 21*

*for g for a population of four million. The p-values of the χ*^{2} and
A-D tests are 0.0012 and 0.3487 respectively. . . 169

*6.12 PDF of the generated noise with 59 approximations for f and*
*21 for g with two accumulated samples for a population of four*
*million. The p-values of the χ*^{2} and A-D tests are 0.3842 and
0.9058 respectively. . . 170
6.13 Scatter plot of two successive accumulative noise samples for a

population of 10000. No obvious correlations can be seen. . . 170 6.14 Variation of output rate against the number of noise generator

instances. . . 173

7.1 Overview of the Wallace method. . . 177 7.2 Overview of our Gaussian noise generator architecture based on the

Wallace method. The triangle in Stage 4 is a constant coefficient multiplier. . . 179 7.3 The transformation circuit of Stage 3. The square boxes are reg-

isters. The select signals for the multiplexors and the clock enable signals for the registers are omitted for simplicity. . . 183 7.4 Detailed timing diagram of the transformation circuit and the

*dual-port “Pool RAM”. A z indicates the address of the data z*
and WE is the write enable signal of the “Pool RAM”. . . 184
7.5 Wallace architecture Stage 1 in Xilinx System Generator. The 30

LFSRs generate uniform random bits for Stage 2. . . 188 7.6 Wallace architecture Stage 2 in Xilinx System Generator. Pseudo

*random addresses for p, q, r, s are generated. . . 189*
7.7 Wallace architecture Stage 3 and Stage 4 in Xilinx System Gener-

ator. Orthogonal transformation is performed and sum of squares corrected. . . 190

7.8 Our Wallace design placed on a Xilinx Virtex-II XC2V4000-6 FPGA.192 7.9 Our Wallace design routed on a Xilinx Virtex-II XC2V4000-6 FPGA.192 7.10 Scatter plot of two successive noise samples for a population of

10000. No obvious correlations can be seen. . . 195 7.11 PDF of the generated noise from our design for a population of

*one million. The p-values of the χ*^{2} and A-D tests are 0.9994 and
0.2332 respectively. . . 196
7.12 PDF of the generated noise from our design for a population of

*four million. The p-values of the χ*^{2} and A-D tests are 0.7303 and
0.8763 respectively. . . 197
7.13 PDF of the generated noise from the Xilinx block for a population

*of one million. The p-values of the χ*^{2} and A-D tests are 0.0000
and 0.0002 respectively. . . 198
*7.14 Variation of the χ*^{2} test p-value with sample size for the Xilinx

block, 12-bit, 16-bit, 20-bit and 24-bit Wallace implementation. . 200 7.15 Variation of output rate against the number of noise generator

instances. . . 202

8.1 Pseudo code of the Wallace method. . . 207 8.2 Four million samples of blocks immediately following the block

*containing a 5σ output, evaluated with the χ*^{2} test with 200 bins
*over [−7, 7] for FastNorm2. The χ*^{2}_{199} contributions of each of the
bins are shown. . . 209

*8.3 The χ*^{2}_{199}values of blocks relative to a block containing a realization
*with absolute value of 5σ or higher. Four million samples are*
compiled for each block. The dotted horizontal line indicates the
0.05 confidence level. . . 210
*8.4 Impact of various design choices on the χ*^{2}_{199} value. Four million

samples are compiled from the block immediately after each block
*containing an absolute value of 5σ or higher for each data point.*

The dotted horizontal line indicates the 0.05 confidence level. . . . 222
*8.5 Speed comparisons at various K at N = 4096 and R = 1. Lower*

part: arithmetic operations. Upper part: table accesses. . . 223 8.6 Speed comparisons for different parameter choices. The solid,

*dashed and dotted lines are for R = 1, R = 2 and R = 3 re-*
spectively. . . 223
*8.7 Execution times for different pool sizes at R = 1 and K = 16. The*

solid and dotted lines are for the Athlon XP and the Pentium 4 processors respectively. . . 224 8.8 Level 2 cache miss rates on the SimpleScalar x86 simulator for

*different pool sizes at R = 1, K = 16 and various level 2 cache*
sizes. Level 1 cache is fixed at 16KB and 65536 noise samples are
generated for each data point. . . 224

*9.1 The parity-check matrix H in ALT form. A, B, C, and E are*
*sparse matrices, D is a dense matrix, and T is a sparse lower*
triangular matrix. . . 228
9.2 LDPC encoding framework. . . 229

9.3 An equivalent parity-check matrix in lower triangular form. Note
*that n = block length and m = block length × (1 − code rate). . . 230*
*9.4 Different starting columns for H and H** ^{T}*. . . 235
9.5 Overview of our hardware encoder architecture. Double buffering

is used between the stages for concurrent execution. Grey and
white box indicate RAMs and operations respectively. . . 236
9.6 Circuit for vector addition (VA). . . 239
9.7 Circuit for matrix-vector multiplication (MVM). . . 241
9.8 Circuit for forward-substitution (FS). . . 243
*9.9 Scatter plot of a preprocessed irregular 500 × 1000 H matrix in*

ALT form with a gap of two. Ones appear as dots. . . 245 9.10 The four stage LDPC encoder architecture in Xilinx System Gener-

ator. Each stage contains multiple subsystems performing MVM, FS, VA or CWG. . . 246 9.11 LDPC encoder architecture Stage 2 and stage controller in Xilinx

System Generator. . . 247 9.12 The matrix-vector multiplication (MVM) circuit in Xilinx System

Generator. . . 248 9.13 The forward-substitution (FS) circuit in Xilinx System Generator. 249 9.14 Variation of throughput with the number of encoder instances. . . 255

### List of Tables

2.1 Maximum absolute and average errors for various fist order poly-
*nomial approximations to e*^{x}*over [−1, 1]. . . .* 29
*2.2 Efficient computation of p*^{T}_{1} *= −φ*^{−1}*(−ET*^{−1}*A + C)s** ^{T}*. . . 46

*2.3 Efficient computation of p*

^{T}_{2}

*= −T*

^{−1}*(As*

^{T}*+ Bp*

^{T}_{1}). . . 47 2.4 Summary of the RU encoding procedure. . . 48

*3.1 Various place and route results of 12-bit approximations to sin(x).*

The logic minimized LUT implementation of the tables minimizes latency and area, while keeping comparable throughput to the other methods, e.g. block RAM (BRAM) based implementation. . 59

5.1 The ranges for P2S addresses for Λ_{1} *= P2S, n = 8, v*_{0} = 5 and
*v*_{1} *= 3. The five P2S address bits δ*_{0} are highlighted in bold. . . . 114
5.2 Number of segments for second order approximations to the four

functions. Results for uniform, HSM and optimum are shown. . . 122 5.3 Comparison of direct look-up, SBTM, STAM and HSM for 16 and

*24-bit approximations to f*2. The subscript for HSM denotes the
polynomial degree, and the subscript for STAM denotes the num-
ber of multipartite tables used. Note that SBTM is equivalent to
STAM_{2}. . . 139
5.4 Hardware synthesis results on a Xilinx Virtex-II XC2V4000-6 FPGA

*for 16 and 24-bit, first and second order approximations to f*_{2} and
*f*_{3}. . . 140

5.5 Widths of the data paths, number of segments, table size and
percentage of exactly rounded results for 16 and 24-bit second
*order approximations to f*_{2} *and f*_{3}. . . 141
*5.6 Performance comparison: computation of f*2 *and f*3 functions. The

Athlon and the Pentium 4 PCs are equipped with 512MB and 1GB DDR-SDRAMs respectively. . . 142

6.1 Comparing two segmentation methods. Second column shows the
comparison of the number of segments for non-uniform and uni-
form segmentation. Third column shows the number of bits used
*for the coefficients to approximate f and g*_{1}. . . 163
6.2 Performance comparison: time for producing one billion Gaussian

noise samples. All PCs are equipped with 1GB DDR-SDRAM. . . 171

7.1 Resource utilization for the four stages of the noise generator on a Xilinx Virtex-II XC2V4000-6 FPGA. . . 191 7.2 Hardware implementation results of the noise generator using dif-

ferent types of FPGA resources on a Xilinx Virtex-II XC2V4000-6 FPGA. . . 193 7.3 Comparisons of different hardware Gaussian noise generators im-

plemented on Xilinx Virtex-II XC2V4000-6 FPGAs. All designs generate a noise sample every clock. . . 199 7.4 Hardware implementation results on a Xilinx Virtex-II XC2V4000-

6 FPGA for for different numbers of noise generator instances.

The device has 23040 slices, 120 block RAMs and 120 embedded multipliers in total. . . 201

7.5 Performance comparison: time for producing one billion Gaussian noise samples. . . 202

8.1 Number of arithmetic operations per transform/sample for the
*transformation at various sizes of K. . . 214*
8.2 Specifications of the AMD Athlon XP and Intel Pentium 4 plat-

forms used in our experiments. . . 216 8.3 Details of the AMD Athlon XP and Intel Pentium 4 data caches. 217 8.4 Execution time in nanoseconds for the AMD Athlon XP and Intel

*Pentium 4 platforms at N = 4096. . . 218*
8.5 Performance comparison of different software Gaussian random

*number generators. The Wallace implementations use N = 4096,*
*R = 1 and K = 16. . . 220*

*9.1 Computation of p*^{T}_{1} *= −F*^{−1}*(−ET*^{−1}*A+C)s*^{T}*. Note that T*^{−1}*[As** ^{T}*] =

*y*

^{T}*⇒ T y*

^{T}*= [As*

*]. . . 232*

^{T}*9.2 Computation of p*

^{T}_{2}

*= −T*

^{−1}*(As*

^{T}*+ Bp*

^{T}_{1}). . . 232

*9.3 Matrix X stored in memory. The location of the edges of each row*

and an extra bit indicating the end of a row are stored. . . 240
*9.4 Preprocessing times and gaps for H matrices with rate 1/2 for var-*

ious block lengths performed on a Pentium 4 2.4GHz PC equipped
with 512MB DDR-SDRAM. . . 244
*9.5 Dimensions and number of edges for the matrices A, B, T , C, F*

*and E generated from a 1000 × 2000 irregular H matrix. . . 250*
9.6 Hardware synthesis results on a Xilinx Virtex-II XC2V4000-6 FPGA

for rate 1/2 for various block length. . . 252

9.7 Hardware synthesis results on a Xilinx Virtex-II XC2V4000-6 FPGA for block length of 2000 bits for various rates. . . 253 9.8 Hardware synthesis results on a Xilinx Virtex-II XC2V4000-6 FPGA

for block length of 2000 bits and rate 1/2 for different numbers of encoder instances. . . 254 9.9 Performance comparison of block length of 2000 bits and rate 1/2

encoders: time for producing 410 million codeword bits. . . 256

### Abbreviations

A-D Anderson-Darling

ALT Approximate Lower Triangular ASC A Stream Compiler

ASIC Application-Specific Integrated Circuit AWGN Additive White Gaussian Noise BER Bit Error Rate

CDF Cumulative Distribution Function CORDIC COordinate Rotations DIgital Computer CPC Cycles Per Codeword

CPS Codewords Per Second CWG CodeWord Generation DDR Double Data Rate DSP Digital Signal Processor ECC Error Correcting Coding FPGA Field-Programmable Gate Array FS Forward-Substitution

GF Galois Field

HFS Hierarchical Function Segmenter HSM Hierarchical Segmentation Method K-S Kolmogorov-Smirnov

LDGM Low-Density Generator-Matrix LDPC Low-Density Parity-Check LFSR Linear Feedback Shift Register LNS Logarithmic Number Systems LRU Least Recently Used

LUT Look-Up Table Mbps Mega bits per second MVM Matrix-Vector Multiplication P2S Powers of 2 Segments

PDF Probability Distribution Function po polynomial only

RAM Random Access Memory ROM Read Only Memory RU Richardson and Urbanke

S1 Stage 1

SBTM Symmetric Bipartite Table Method SNR Signal to Noise Ratio

STAM Symmetric Table Addition Method tp2 table-with-polynomial of degree 2 ulp unit in the last place

US Uniform Segments VA Vector Addition

VHDL Very high speed integrated circuits Hardware Description Language WOR WithOut Range reduction

WRR With Range Reduction

### Publications

Journal Papers

D. Lee, A. Abdul Gaffar, O. Mencer and W. Luk, “Automating optimized hard-
*ware function evaluation”, submitted to IEEE Transactions on Computers, 2004.*

P.H.W. Leong, G. Zhang, D. Lee, W. Luk and J.D. Villasenor, “A comment on
*the implementation of the Ziggurat method”, submitted to Journal of Statistical*
*Software, 2004.*

D. Lee, W. Luk, J.D. Villasenor and P.H.W. Leong, “Design parameter optimiza-
*tion for the Wallace Gaussian random number generator”, submitted to ACM*
*Transactions on Modeling and Computer Simulation, 2004.*

D. Lee, W. Luk, J.D. Villasenor, G. Zhang and P.H.W. Leong, “A hardware
*Gaussian noise generator using the Wallace method”, submitted to IEEE Trans-*
*actions on VLSI, 2004.*

G. Zhang, P.H.W. Leong, C.H. Ho, K.H. Tsoi, R.C.C. Cheung, D. Lee and
*W. Luk, “Monte Carlo simulation using FPGAs”, submitted to IEEE Trans-*
*actions on VLSI, 2004.*

D. Lee, W. Luk, J.D. Villasenor and P.Y.K. Cheung, “The hierarchical segmen-
*tation method for function evaluation”, submitted to IEEE Transactions on Cir-*
*cuits and Systems I, 2004.*

D. Lee, W. Luk, J.D. Villasenor and P.Y.K. Cheung, “A hardware Gaussian noise
*generator for hardware-based simulations”, IEEE Transactions on Computers,*

volume 53, number 12, pages 1523-1534, 2004.

Book Chapter

D. Lee, W. Luk, J.D. Villasenor and P.Y.K. Cheung, “The effects of polynomial
*degrees on the hierarchical segmentation method”, Chapter in New Algorithms,*
*Architectures, and Applications for Reconfigurable Computing, W. Rosenstiel and*
P. Lysaght (Eds.), Kluwer Academic Publishers, 2004.

Conference Papers

D. Lee, A. Abdul Gaffar, O. Mencer and W. Luk, “MiniBit: Bit-width opti-
*mization via affine arithmetic”, submitted to ACM/IEEE Design Automation*
*Conference, 2005.*

D. Lee, A. Abdul Gaffar, O. Mencer and W. Luk, “Adaptive range reduction for
*hardware function evaluation”, In Proceedings of IEEE International Conference*
*on Field-Programmable Technology (FPT), pages 169-176, Brisbane, Australia,*
Dec 2004.

D. Lee, “Gaussian noise generation for Monte Carlo simulations in hardware”, In
*Proceedings of The Korean Scientists and Engineers Association in the UK 30th*
*Anniversary Conference, pages 182-185, London, UK, Sep 2004.*

D. Lee, O. Mencer, D.J. Pearce and W. Luk, “Automating optimized table-
*with-polynomial function evaluation for FPGAs”, In Proceedings of International*
*Conference on Field Programmable Logic and its Applications (FPL), pages 364-*

373, LNCS 3203, Springer-Verlag, Antwerp, Belgium, Aug 2004.

D. Lee, W. Luk, C. Wang, C. Jones, M. Smith and J.D. Villasenor, “A flexible
*hardware encoder for low-density parity-check codes”, In Proceedings of IEEE*
*Symposium on Field-Programmable Custom Computing Machines (FCCM), pages*
101-111, Napa Valley, USA, Apr 2004.

D. Lee, W. Luk, J.D. Villasenor and P.Y.K. Cheung, “Hierarchical segmentation
*schemes for function evaluation”, In Proceedings of IEEE International Confer-*
*ence on Field-Programmable Technology (FPT), pages 92-99, Tokyo, Japan, Dec*
2003.

D. Lee, W. Luk, J.D. Villasenor and P.Y.K. Cheung, “Hardware function eval-
*uation using non-linear segments”, In Proceedings of International Conference*
*on Field Programmable Logic and its Applications (FPL), pages 796-807, LNCS*
2778, Springer-Verlag, Lisbon, Portugal, Sep 2003.

D. Lee, W. Luk, J.D. Villasenor and P.Y.K. Cheung, “A hardware Gaussian noise
*generator for channel code evaluation”, In Proceedings of IEEE Symposium on*
*Field-Programmable Custom Computing Machines (FCCM), pages 69-78, Napa*
Valley, USA, Apr 2003.

D. Lee, T.K. Lee, W. Luk and P.Y.K. Cheung, “Incremental programming for re-
*configurable engines”, In Proceedings of IEEE International Conference on Field-*
*Programmable Technology (FPT), pages 411-415, Shatin, Hong Kong, Dec 2002.*

## CHAPTER 1

## Introduction

### 1.1 Objectives and Contributions

The objective of this thesis is to explore hardware designs for function evaluation, Gaussian noise generation and Low-Density Parity-Check (LDPC) code encoding.

Our main contributions are:

*• Methodology for the automation of function evaluation unit design, cov-*
ering table look-up, table-with-polynomial and polynomial-only methods
(Chapter 3).

*• Framework for adaptive range reduction based on a parametric function*
evaluation library, and on function approximation by polynomials and ta-
bles and pre-computing all possible input an output ranges (Chapter 4).

*• Efficient hierarchical segmentation method based on piecewise polynomial*
approximations suitable for non-linear compound functions, which involves
uniform segments and segments with size varying by powers of two (Chap-
ter 5).

*• Hardware Gaussian noise generator based on the Box-Muller method and*
the central limit theorem capable of producing 133 million samples per sec-
ond with 10% resource usage on a Xilinx XC2V4000-6 FPGA (Chapter 6).

*• Hardware Gaussian noise generator based on the Wallace method capable*

of producing 155 million samples per second with 3% resource usage on a Xilinx XC2V4000-6 FPGA (Chapter 7).

*• Design parameter optimization for software implementations of the Wallace*
method to reduce correlations and execution time (Chapter 8).

*• Linear complexity hardware encoder for regular and irregular LDPC codes*
with an efficient architecture for storing and performing computation on
sparse matrices (Chapter 9).

The most exciting contribution of this thesis is perhaps the hierarchical seg- mentation method presented in Chapter 5. It is a systematic method for pro- ducing fast and efficient hardware function evaluators for both compound and elementary functions using piecewise polynomial approximations with a novel hierarchical segmentation scheme. This method is particulary useful for approx- imating non-linear functions or curves, using significantly less memory than the traditional uniform segmentation approach. Depending on the function and pre- cision, the memory requirements can be reduced to several orders of magnitudes.

We believe that there are numerous applications out there that can benefit from our approach including data compression, function evaluation, non-linear filter- ing, pattern recognition and picture processing.

Although the designs in this thesis target FPGA technology, we believe that our methods are generic enough to be applied across different implementation technologies such as ASICs. FPGAs are simply used as a platform to demonstrate that our ideas can be efficiently mapped into hardware.

Figure 1.1 illustrates how the various chapters in this thesis are related to each other. The chapters on function evaluation are 3, 4 and 5. The chapters on LDPC coding are 6, 7, 8 and 9. Within the LDPC coding framework, Chapters 6, 7 and 8 are on Gaussian noise generation, which is needed for exploring LDPC

**Chapter 3**
Automating
Function Evaluation

**LDPC Coding**

**Gaussian Noise**
**Generation**

**Chapter 6**
Box-Muller
Method

**Chapter 7**
Wallace
Method

**Chapter 9**
LDPC
Encoding
**Function Evaluation**

**Chapter 4**
Range Reduction

**Chapter 5**
Hierarchical
Segmentation

**Chapter 8**
Wallace
Optimization
**Chapter 1**

Introduction

**Chapter 2**

Background **Chapter 10**

Conclustions

Figure 1.1: Relations of the chapters in this thesis.

code behavior in hardware. The Box-Muller method in Chapter 6 requires the evaluation of functions and uses a variant of the hierarchical segmentation method presented in Chapter 5.

The rest of this chapter provides historical information and an overview of
*the material in Chapters 3 ∼ 8. Chapter 2 covers background material and*
previous work. Chapter 3 describes a methodology for the automation of el-
ementary function evaluation unit design. Chapter 4 presents a framework for
adaptive range reduction based on a parametric elementary function evaluation li-
brary. Chapter 5 presents an efficient hierarchical segmentation method suitable
for non-linear compound functions. Chapter 6 describes a hardware Gaussian
noise generator based on the Box-Muller method and the central limit theorem.

Chapter 7 presents a hardware Gaussian noise generator based on the Wallace method. Chapter 8 analyzes correlations that can occur in the Wallace method, and examines parameters to reduce correlations and execution time for software implementations. Chapter 9 describes an efficient hardware encoder with linear encoding complexity for both regular and irregular LDPC codes, and Chapter 10

offers conclusions and future work.

### 1.2 Computer Arithmetic

Arithmetic has played important roles in human civilization, especially in the
areas of science, engineering and technology. Machine arithmetic can be traced
back as early as 500 BC in the form of abacus used in China. Many numer-
ically intensive applications, such as signal processing, require rapid execution
of arithmetic operations. The evaluation of functions is often the performance
bottleneck of many compute-bound applications. Examples of these functions in-
*clude elementary functions such as log(x) and√*

*x, and compound functions such*
as p

*− log(x) and x log(x). Computing these functions quickly and accurately*
is a major goal in computer arithmetic. For instance, over 60% of the total run
time is devoted to function evaluation operations in a simulation of a jet engine
reported by O’Grady and Wang [133].

Recent studies have shown that the increasing importance of these math- ematical functions in a wide variety of applications. The applications where these functions have increasingly more important are computer 3D graphics, an- imation, scientific computing, artificial neural networks, digital signal processing and multimedia applications. Software implementations are often too slow for numerically intensive or real-time applications. The increasing speed and perfor- mance constraints of such applications have led to the development of new ded- icated hardware for the computation of these operations, providing high-speed solutions implemented in coprocessors, graphic cards, Digital Signal Processors (DSPs), Application-Specific Integrated Circuits (ASICs), Field-Programmable Gate Arrays (FPGAs) [122] and numerical processors in general.

### 1.3 Error Correcting Coding and LDPC Codes

Error correcting coding (ECC) is a critical part of modern communications sys- tems, where it is used to detect and correct errors introduced during a transmis- sion over a channel [11], [126]. It relies on transmitting the data in an encoded form, such that the redundancy introduced by the coding allows a decoding de- vice at the receiver to detect and correct errors. In this way, no request for retransmission is required, unlike systems which only detect errors (usually by means of a checksum transmitted with the data). In many applications, a sub- stantial portion of the baseband signal processing is dedicated to ECC. The wide range of ECC applications [30] include space and satellite communications, data transmission, data storage and mobile communications.

NASA’s space missions including Galileo, Odyssey, Rovers and Voyager would not have been possible without the use of ECC [71]. Odyssey, NASA’s Mars spacecraft currently boasts the highest data transmission rate at 128,000 bits per second via a radio link. However, for future space missions NASA are planning to use optical communications via laser beams [60]. The new laser will beam back between one million and 30 million bits per second, depending on the distance between Mars and Earth [119]. Projects like this provide great challenges to implement high-speed and low-power ECC systems with good error correcting performance in deep space.

In 1948, Claude Shannon founded the field of study “Information Theory”

which is the basis of modern ECC with his discovery of the noisy channel cod- ing theorem [164]. The theoretical contribution of Shannon’s work was a useful definition of “information” and several “channel coding theorems” which gave ex- plicit upper bounds, called the channel capacity, on the rate at which information could be transmitted reliably on a given communication channel. In the context

of our work, the result of primary interest is the “noisy channel coding theorem
for continuous channels with average power limitations”. This theorem states
*that the capacity C (which is now known as the Shannon limit) of a bandlimited*
*additive white Gaussian noise (AWGN) channel with bandwidth W , a channel*
model that approximately represents many practical digital communication and
storage systems, is given by

*C = W log*_{2}*(1 + E*_{s}*/N*_{0}) bits per second (bps) (1.1)
*where E** _{s}* is the average signal energy in each signaling interval of duration

*T = 1/W , and N*

_{0}

*/2 is the two-sided noise power spectral density. Perfect*Nyquist signalling is assumed. The proof of this theorem demonstrates that for

*any transmission rate R less than or equal to the channel capacity C, there exists*a coding scheme that achieves an arbitrarily small probability of error; conversely,

*if R is greater than C, no coding scheme can achieve reliable performance. Since*this theorem was published, an entire field of study has grown out of attempts to design coding schemes that approach the Shannon limit of various channels.

In the past few years, LDPC codes have received much attention because
of their excellent performance, and have been widely considered as the most
promising candidate ECC scheme for many applications in telecommunications
and storage devices [132], [8]. LDPC codes were first proposed by Gallager in
*1962 [48], [49]. He defined an (n, d*_{v}*, d** _{c}*) LDPC code as a code of block length

*n in which each column of the parity-check matrix contains d*

*ones and each*

_{v}*row contains d*

*ones. Due to the regular structure (uniform column and row weight) of Gallager’s codes, they are now called regular LDPC codes. Gallager provided simulation results for codes with block lengths of the order of hundreds of bits. The results indicated that LDPC codes have very good potential for error correction. However, the high storage and computation requirements interrupted*

_{c}the research on LDPC codes. After the discovery of Turbo codes by Berrou et al. in 1993 [7], MacKay [110] re-established the interest in LDPC codes during the mid to late 1990s.

### 1.4 Overview of our Approach

1.4.1 Function Evaluation

The evaluation of elementary functions is at the core of many compute-intensive applications [133] which perform well on reconfigurable platforms. Yet, in or- der to implement function evaluation efficiently, the FPGA programmer has to choose between many function evaluation methods such as table look-up, polyno- mial approximation, or table look-up combined with polynomial approximation.

We present a methodology and a partially automated implementation to select
the best function evaluation hardware for a given function, accuracy require-
ment, technology mapping and optimization metrics, such as area, throughput
or latency. The automation of function evaluation unit design is combined with
ASC [123], A Stream Compiler, for FPGAs. On the algorithmic side, we use
MATLAB to design approximation algorithms with polynomial coefficients and
minimize bitwidths. On the hardware implementation side, ASC provides par-
*tially automated design space exploration. We illustrate our approach for sin(x),*
*log(1 + x) and 2** ^{x}*, which are commonly used in a variety of applications. We
provide a selection of graphs that characterize the design space with various di-
mensions, including accuracy, precision and function evaluation method. We also
demonstrate design space exploration by implementing more than 400 distinct
designs.

*The evaluation of a function f (x) typically consists of range reduction which*

transforms the input into a small interval, and the actual function evaluation
on the small interval. We investigate optimization of range reduction given the
*range and precision of x and f (x). For every function evaluation there exists*
*a convenient interval such as [0, π/2) for sin(x). An example of the adaptive*
range reduction method, which we propose in our work, introduces another larger
interval for which it makes sense to skip range reduction. The decision depends
on the function being evaluated, precision, and optimization metrics such as area,
latency and throughput. In addition, the input and output range has an impact
on the choice of function evaluation method such as polynomial, table based, or
combinations of the two. We explore this vast design space of adaptive range
*reduction for fixed-point sin(x), log(x) and* *√*

*x accurate to one unit in the last*
place (ulp) using MATLAB and ASC. These tools enable us to study over 1000
designs resulting in over 40 million Xilinx equivalent circuit gates, in a few hours’

time. The final objective is to progress towards a fully automated library that provides optimal function evaluation hardware units given input and output range and precision. Our design flow for evaluating elementary functions is illustrated in Figure 1.2.

Compound functions often have non-linear properties, hence sophisticated approximation techniques are needed. We present a method for evaluating such functions based on piecewise polynomial approximation with a novel hierarchical segmentation scheme. The use of hierarchal schemes of uniform segments and segments with size varying by powers of two enables us to approximate non- linear regions of a function particularly well. This partitioning is automated:

efficient look-up tables and their coefficients are generated for a given function, input range, degree of the polynomials, desired accuracy and finite precision constraints. Parameterized reference design templates are provided for various predefined hierarchical schemes. We describe an algorithm to find the optimum

function *f(x)* input format method

**Approximate** **f(x)***(MATLAB)*

**Hardware Compiler**
*(ASC)*

FPGA
implementations
**Library**

**Generator**
*(Perl Script)*

ASC code **Function**

**Evaluation**
**Library**
*(ASC Lib)*
User

**Library Construction****Library Usage**

Figure 1.2: Design flow for evaluating elementary functions.

number of segments and the placement of their boundaries, which is used to an- alyze the properties of a function and to benchmark our hierarchical approach.

Our method is illustrated using four non-linear compound and elementary func- tions: p

*− log(x), x log(x), a high order rational function and cos(πx/2). We*
present results for various operand sizes between 8 and 24 bits for first and sec-
ond order polynomial approximations. For 24-bit data, our method requires a
look-up table of size 12 times smaller than the symmetric table addition method.

Our framework for the hierarchical segmentation method is shown in Figure 1.3.

1.4.2 Gaussian noise generation

Evaluations of LDPC codes are based on computer simulations which can be time consuming, particularly when the behavior at low bit error rates (BERs) in the error floor region is being studied [57]. Tremendous efforts have been devoted

**Hierarchical **
**Function**
**Segmenter**

Data

File Synthesis

Place and Route Report

**Hardware**
User Input

**Design **
**Generator**

Reference Design

Library

Figure 1.3: Design flow for evaluating non-linear functions using the hierarchical segmentation method.

to analyze and improve their error-correcting performance, but little considera-
tion has been given to the practical LDPC codec hardware implementations. If
the binary Hamming distance [148] between all combinations of codewords (the
distance spectrum) is known, then analytic techniques for describing the perfor-
mance of the codes in the presence of noise is available. However, in the case of
capacity achieving random linear codes (such as LDPC codes), the problem of
finding the distance spectrum of the code is intractable and researchers resort to
the use of Monte Carlo simulation in order to characterize various code construc-
tions in terms BER versus signal to noise ratio (SNR). At very low SNRs, errors
occur often and a sufficient statistic can be gathered readily within a PC. However
at higher SNRs where errors occur rarely, the situation is different. Thorough
characterization of a code in this region may require simulation of 10^{10}*−10*^{12}code
symbols, and computer based simulations provide inadequate means of finding
statistically sufficient set of error events, which can take several weeks.

Hardware based simulation offers the potential of speeding up code evaluation

by several orders of magnitude [99]. Such simulation framework consists of three main blocks: encoder, noise channel and decoder, where the noise channel is generally modeled by Gaussian noise. Our LDPC code simulations are run on a reconfigurable engine, which consists of a PC and a reconfigurable hardware platform [85]. The reconfigurable hardware platform we use is a Xilinx Virtex-II FPGA prototyping board from Nallatech [131] shown in Figure 1.4. It consists of two Xilinx Virtex-II XC2V4000-6 FPGAs and 4MB of SRAM. The board can be connected to a PC via the PCI bus or USB. The grey wires are connected to a logic analyzer for debugging purposes. A block diagram of our LDPC simulation framework is provided in Figure 1.5. The LDPC encoder follows an algorithm suggested in [152]. Our noise generator block improves the overall value of the system as a Monte Carlo simulator, since noise quality at high SNRs (tails of the Gaussian distribution) is essential. Since the LDPC decoding process is iterative and the number of required iterations is non-deterministic, a flow control buffer is used to greatly increase the throughput of the overall system.

We present two methods for generating Gaussian noise. The first is based on the Box-Muller method [13] and the central limit theorem [78], which involve the computation of two functions: p

*− ln(x) and cos(2πx). The accuracy and speed*
in computing these functions are essential for generating high-quality Gaussian
noise samples rapidly. The use of non-uniform segments enables us to approxi-
mate non-linear regions of a function particularly well. The appropriate segment
address for a given function can be rapidly calculated in run time by a simple
combinatorial circuit. Scaling factors are used to deal with large polynomial coef-
ficients and to trade precision for range. Our function evaluator is based on first
order polynomials, and is suitable for applications requiring high performance
with small area, at the expense of accuracy. We exploit the central limit theo-
rem to overcome quantization and approximation errors. An implementation at

Figure 1.4: The BenONE board from Nallatech used to run our LDPC simulation experiments.

133MHz on a Xilinx Virtex-II XC2V4000-6 FPGA takes up 10% of the device and produces 133 million samples per second, which is seven times faster than a 2.6GHz Pentium 4 PC.

The second method is based on the Wallace method [180]. Wallace proposed
a fast algorithm for generating normally distributed pseudo-random numbers
which generates the target distributions directly using their maximal-entropy
properties. This algorithm is particularly suitable for high throughput hardware
implementation since no transcendental functions such as *√*

*x, log(x) or sin(x)*
are required. The Wallace method takes a pool of normally distributed random
numbers from the normal distribution. Through transformation steps, a new
pool of normal distributed random numbers are generated. An implementation
running at 155MHz on a Xilinx Virtex-II XC2V4000-6 FPGA takes up 3% of the
device and produces 155 million samples per second.

LDPC Decoder

*Record Errors*
*Code*

*Definition* LDPC Encoder

Gaussian Noise
Generator
*SNR*

Flow Control Buffer

*Code*
*Definition*

Data Source

Compare

Figure 1.5: Our LDPC hardware simulation framework.

The outputs of the two noise generators accurately model a true Gaussian
*PDF even at very high σ values (tails of the Gaussian distribution). Their*
properties are explored using: (a) several different statistical tests, including
the chi-square test and the Anderson-Darling test [32], and (b) an application for
decoding of LDPC codes. Although the Wallace design has smaller area and is
faster than the Box-Muller design, it has slight correlations between successive
transformations, which may be undesirable for certain types of simulations. We
examine design parameter optimizations to reduce such correlations.

*H Matrix*

**Preprocessor**
**(SW)**

ALT
*H Matrix*

**Encoder**

Message Blocks **(HW)** Codewords

Figure 1.6: LDPC encoding framework.

1.4.3 LDPC Encoding

We describe a flexible hardware encoder for regular and irregular Low-Density
Parity-Check (LDPC) codes. Although LDPC codes achieve better performance
and lower decoding complexity than Turbo codes, a major drawback is their
apparently high encoding complexity: whereas Turbo codes can be encoded in
linear time, a straightforward implementation for a LDPC code has complexity
quadratic in the block length due to dense matrix-vector multiplication. Using
an efficient encoding method proposed by Richardson and Urbanke [152], we
present a hardware LDPC encoder with linear encoding complexity. The encoder
*is flexible, supporting arbitrary H matrices, rates and block lengths. We develop*
*a software preprocessor to bring the parity-check matrix H into a approximate*
lower triangular form. A hardware architecture with an efficient memory organi-
zation for storing and performing computations on sparse matrices is proposed.

An implementation for a rate 1/2 irregular length 2000 bits LDPC code encoder on a Xilinx Virtex-II XC2V4000-6 FPGA takes up 4% of the device. It runs at 143MHz and has a throughput of 45 million codeword bits per second (or 22 mil-

lion information bits per second) with a latency of 0.18ms. An implementation of 16 instances of the encoder on the same device at 82MHz is capable of 410 mil- lion codeword bits per second, 80 times faster than an Intel Pentium 4 2.4GHz PC. The design flow of our LDPC encoder is illustrated in Figure 1.6. This block is placed in front of the noise generator in our LDPC simulation framework (Figure 1.5).

## CHAPTER 2

## Background

### 2.1 Introduction

The purpose of this chapter is to present the background material and related work of this thesis. Section 2.2 introduces the basics of FPGAs and the design tools used in this thesis. Section 2.3 introduces six of the most popular methods for approximating functions and the existing work. Section 2.4 discusses various issues such are range reduction related to function evaluation. Section 2.5 presents different ways of generating Gaussian noise and explores the existing work in this area. Finally, Section 2.6 introduces the basics of LDPC codes, LDPC encoding, describes Richardson and Urbanke’s (RU) method for efficiently encoding LDPC codes and looks at previous work on hardware related issues on LDPC codes.

### 2.2 FPGAs

2.2.1 Introduction

Field-Programmable Gate Arrays (FPGAs) have long been used for glue logic and prototyping. More recently, they are being used for many real-life appli- cations including communications [93], encryption [173], video image process- ing [168], [175], medical imaging [72], network security [96] and numerical com-

4-input LUT

mux

flip-flop a

b c d

e clock

clock enable set/reset

y

q

Figure 2.1: Simplified view of a Xilinx logic cell. A single slice contains 2.25 logic cells.

putations [104].

FPGAs can potentially approach the execution speed of application specific hardware with the rapid programming time of microprocessors. In recent years, the size of FPGAs has followed Moore’s law: the number of logic gate doubles every 18 months. FPGAs can exploit improvements following Moore’s law better than microprocessors because of their simpler and more regular structure.

The fundamental building block of Xilinx FPGAs is the logic cell [118]. A logic
*cell comprises a 4-input look-up table (which can also act as a 16 × 1 RAM or a*
16-bit shift register), a multiplexer and a register. A simplified view of a logic cell
is depicted in Figure 2.1. Two logic cells are paired together in an element called
a slice. A slice contains additional resources such as multiplexers and carry logic
to increase the efficiency of the architecture. These extra resources are equivalent
to having more logic cells, and therefore a slice is counted as being equivalent of
2.25 logic cells. Recent-generation reconfigurable hardware has a large amount
of slices. For instance, the Xilinx Virtex-II XC2V4000-6 has 23040 slices.

The architecture of a typical FPGA is illustrated in Figure 2.2. In general,

Figure 2.2: Architecture of a typical FPGA.

an FPGA will have an array of configurable logic blocks (which contain two or four slices depending on the FPGA family), programmable wires, and pro- grammable switches to realize any function out of the logic blocks and implement any interconnection topology. Programming is done using of the many popular technologies such as SRAM cells, Antifuses, EPROM transistors and EEPROM transistors. In addition to logic blocks, state-of-the-art FPGAs such as the Xilinx Virtex-II or Virtex-4 devices contain embedded hardware elements for memory, multiplication, multiply-and-add and even a number of hard microprocessor cores (such as the IBM PowerPC) [189].

The long IC fabrication time is completely eliminated for these devices and design realization times are only a few hours. The idea of user-programmability is very exciting, most ASIC vendors now prefer FPGAs for low cost prototyping for fine tuning of designs before fabrication. Also, from a marketing point of view, the FPGA technology allows quick product announcements, which is commercially attractive. The two major FPGA vendors are Altera and Xilinx. A good review on configurable computing and FPGAs is given in [28].

2.2.2 Design Tools

The following three FPGA design tools are used for the implementations pre- sented in this thesis:

*• ASC [123], A Stream Compiler for FPGAs, adopts C++ custom types and*
operators to provide a programming interface in the algorithmic level, the
architectural, the arithmetic level and the gate level. As a unique feature,
all levels of abstraction are accessible from C++. This enables the user
to program on the desired level for each part of the application. Semi-
automated design space exploration further increases design productivity,
while supporting optimization at all available levels of abstraction. Object-
oriented design enables efficient code-reuse; ASC includes an integrated
arithmetic unit generation library, PAM-Blox II [121], which in turn builds
upon the PamDC [137] gate library. The elementary function evaluation
units in Chapters 3 and 4 are implemented with this tool.

*• Handel-C [21] is based on ANSI-C with extensions to support flexible*
width variables, signals, parallel blocks, bit-manipulation operations and
channel communication. A distinctive feature is that timing of the com-
piled circuit is fixed at one cycle per C assignment. This makes it easy for
programmers to know in which cycle a statement will be executed at the
expense of reducing the scope for optimization. It gives application devel-
opers the ability to schedule hardware resources manually, and Handel-C
tools generate the resulting designs automatically. The ideas of Handel-C
are based on work by Page and Luk in compiling Occam into FPGAs [134].

The Gaussian noise generator using the Box-Muller method in Chapter 6 is implemented with this tool.

*• Xilinx System Generator [188] is a plug-in to the MATLAB Simulink*
software [117] and provides bit-accurate model of FPGA circuits. It au-
tomatically generates a synthesizable VHDL or Verilog code including a
testbench. Other unique capabilities include MATLAB m-code compila-
tion, fast system-level resource estimation, and high-speed hardware co-
simulation interfaces, both a generic JTAG interface [31] and PCI based co-
simulation for FPGA hardware platforms. The Xilinx Blockset in Simulink
enables bit-true and cycle-true modeling, and includes common parameter
blocks such as finite impulse response (FIR) filter, fast Fourier transform
(FFT), logic gates, adders, multipliers, RAMs, etc. Moreover, most of these
blocks utilize the Xilinx cores, which are highly optimized for Xilinx devices.

The function evaluator using the hierarchical segmentation method (HSM) in Chapter 5, the Gaussian noise generator using the Wallace method in Chapter 7, and the LDPC encoder in Chapter 8 are implemented with this tool.

ASC designs are synthesized with PAM-Blox II and all others with Synplicity
*Synplify Pro (versions 7.3 ∼ 7.5). Place-and-route for all designs are performed*
*with Xilinx ISE (versions 6.0 ∼ 6.2).*

### 2.3 Function Evaluation Methods

Many FPGA applications including digital signal processing, computer graphics and scientific computing require the evaluation of elementary or special purpose functions. For applications that require low precision approximation at high speeds, full look-up tables are often employed. However, this becomes imprac- tical for precisions higher than a few bits, because the size of the table grow

exponentially with respect to the input size. Six well known methods are de- scribed below, which are better suited to high precision.

2.3.1 CORDIC

CORDIC is an acronym for COordinate Rotations DIgital Computer and offers the opportunity to calculate desired functions in a rather simple and elegant way.

The CORDIC algorithm was first introduced by Volder [178] for the computation of trigonometric functions, multiplication, division and data type conversion, and later generalized to hyperbolic functions by Walther [182]. It has found its way into diverse applications including the 8087 math coprocessor [38], the HP-35 calculator, radar signal processors and robotics.

It is based on simple iterative equations, involving only shift and add opera- tions and was developed in an effort to avoid the time consuming multiply and divide operations. The general CORDIC algorithm consists of the following three iterative equations:

*x*_{k+1}*= x*_{k}*− mδ*_{k}*y** _{k}*2

^{−k}*y*

_{k+1}*= y*

_{k}*+ δ*

_{k}*x*

*2*

_{k}

^{−k}*z*

_{k+1}*= z*

_{k}*− δ*

_{k}*σ*

_{k}*The constants m, δ*_{k}*and σ** _{k}* depend on the specific computation being performed,
as explained below.

*• m is either 0, 1 or −1. m = 1 is used for trigonometric and inverse trigono-*
*metric functions. m = −1 is used for hyperbolic, inverse hyperbolic, expo-*
*nential and logarithmic functions, as well as square roots. Finally, m = 1*
is used for multiplication and division.