Function Evaluation Methods - Hardware Designs for Function Evaluation and LDPC Coding

• Xilinx System Generator [188] is a plug-in to the MATLAB Simulink software [117] and provides bit-accurate model of FPGA circuits. It au-tomatically generates a synthesizable VHDL or Verilog code including a testbench. Other unique capabilities include MATLAB m-code compila-tion, fast system-level resource estimacompila-tion, and high-speed hardware simulation interfaces, both a generic JTAG interface [31] and PCI based co-simulation for FPGA hardware platforms. The Xilinx Blockset in Simulink enables bit-true and cycle-true modeling, and includes common parameter blocks such as finite impulse response (FIR) filter, fast Fourier transform (FFT), logic gates, adders, multipliers, RAMs, etc. Moreover, most of these blocks utilize the Xilinx cores, which are highly optimized for Xilinx devices.

The function evaluator using the hierarchical segmentation method (HSM) in Chapter 5, the Gaussian noise generator using the Wallace method in Chapter 7, and the LDPC encoder in Chapter 8 are implemented with this tool.

ASC designs are synthesized with PAM-Blox II and all others with Synplicity Synplify Pro (versions 7.3 ∼ 7.5). Place-and-route for all designs are performed with Xilinx ISE (versions 6.0 ∼ 6.2).

exponentially with respect to the input size. Six well known methods are de-scribed below, which are better suited to high precision.

2.3.1 CORDIC

CORDIC is an acronym for COordinate Rotations DIgital Computer and offers the opportunity to calculate desired functions in a rather simple and elegant way.

The CORDIC algorithm was first introduced by Volder [178] for the computation of trigonometric functions, multiplication, division and data type conversion, and later generalized to hyperbolic functions by Walther [182]. It has found its way into diverse applications including the 8087 math coprocessor [38], the HP-35 calculator, radar signal processors and robotics.

It is based on simple iterative equations, involving only shift and add opera-tions and was developed in an effort to avoid the time consuming multiply and divide operations. The general CORDIC algorithm consists of the following three iterative equations:

x_k+1 = x_k− mδ_ky_k2^−k y_k+1 = y_k+ δ_kx_k2^−k z_k+1 = z_k− δ_kσ_k

The constants m, δ_kand σ_k depend on the specific computation being performed, as explained below.

• m is either 0, 1 or −1. m = 1 is used for trigonometric and inverse trigono-metric functions. m = −1 is used for hyperbolic, inverse hyperbolic, expo-nential and logarithmic functions, as well as square roots. Finally, m = 1 is used for multiplication and division.

• δ_k is one of the following two signum functions:

δ_k= sgn(z_k) =





1, z_k≥ 0

−1, z_k< 0

or δ_k= -sgn(y_k) =





1, y_k< 0

−1, y_k≥ 0

The first is often called the rotation mode, in which the z values are driven to zero, whereas the second is the vectoring mode, in which the y values are driven to zero. Note that δk requires nothing more than a comparison.

• The numbers σ_k are constants and are stored in a table which depend on the value of m. For m = 1, σ_k = tan⁻¹2^−k; for m = 0, σ_k = 2^−k; and for m = −1, σ_k= tanh⁻¹2^−k.

To use these equations, appropriate starting values x₁, y₁ and z₁ must be given.

One of these inputs, say z1, might be the number whose hyperbolic sine we wish to approximate, sinh(z₁). In all cases, the starting values must be restricted to a certain interval about the origin in order to ensure convergence. As the iterations proceed, one of the variables tends to zero while another variable approaches the desired approximation.

The major disadvantage of the CORDIC algorithm is its linear convergence resulting in an execution time which is linearly proportional to the number of bits in the operands. In addition, CORDIC is limited to a relatively small set of elementary functions. A comprehensive study of CORDIC algorithms on FPGAs can be found in [3].

2.3.2 Digit-recurrence and On-line Algorithms

Digit-recurrence [41] and on-line algorithms [40] belong to the same type of meth-ods for the approximation of functions in hardware, usually known as digit-by-digit iterative methods, due to their linear convergence, which means that a fixed

number of bits of the result is obtained in each iteration. Implementations of this type of algorithms are typically of low complexity, utilize small area and have rel-atively large latencies. The fundamental choices in the design of a digit-by-digit algorithm are the radix, the allowed coefficients of digits and the representation of the partial reminder.

2.3.3 Bipartite and Multipartite Methods

The bipartite method, meaning that the table is divided into two parts, was originally introduced by Das Sarma and Matula [159] with the aim of getting accurate reciprocals. Improvements were suggested by Schulte and Stine [162], [163], Muller [129], and generalizations from bipartite to multipartite method are discussed by Denechin and Tisserand [34].

Assume an n-bit binary fixed-point system, and assume that n is a multiple of 3 and n = 3k. We wish to design a table based implementation of function f . A full look-up table would lead to a table of size n × 2ⁿ. Instead, we split the input word x into three k-bit words x₀, x₁, and x₂, that is,

x = x0+ x12^−k+ x22^−2k (2.1) where x₀, x₁and x₂are multiples of 2^−k that are less than 1. The original bipartite method consists in approximating the first order Taylor expansion

f (x) = f (x₀+ x₁2^−k) + x₂2^−2kf⁰(x₀+ x₁2^−k) + x²₂2^−4kf⁰⁰(ξ), (2.2) ξ ∈ [x0+ x12^−k, x]

f (x) = f (x0+ x12^−k) + x22^−2kf⁰(x0). (2.3)

That is, f (x) is approximated by the sum of two functions α(x₀, x₁) and β(x₀, x₂),

where 





α(x0, x1) = f (x0+ x12^−k) β(x₀, x₂) = x₂2^−2kf⁰(x₀)

The error of this approximation is roughly proportional to 2^−3k. Instead of di-rectly tabulating function f , functions α and β are tabulated. Since they are functions of 2k bits only, each of these tables has 2^2n/3 entries. This results in a total table size of 2n × 2^2n/3 bits, which is a significant improvement over the full look-up table. These methods basically exploit the symmetry of the Taylor approximations and leading zeros in the table coefficients to reduce the look-up table size. Although these methods yield in significant improvements in table size over direct table look-up, they can be inefficient for functions that are highly non-linear [88].

2.3.4 Polynomial Approximation

Polynomial approximation [58], [150] involves approximating a continuous func-tion f with one or more polynomials p of degree d on a closed interval [a, b]. The polynomials are of the form

p(x) = c_dx^d+ c_d−1x^d−1+ ... + c₁x + c₀ (2.4) and with Horner’s rule, this becomes

p(x) = ((cdx + cd−1)x + ...)x + c0 (2.5) where x is the input. The aim is to minimize the distance kp − f k. In our work, we use minimax polynomial approximations, which involve minimizing the maximum absolute error [128]. The distance for minimax approximations is:

kp − f k_∞= max

a≤x≤b|f (x) − p(x)| (2.6)

Table 2.1: Maximum absolute and average errors for various fist order polynomial approximations to e^x over [−1, 1].

Taylor Legendre Chebyshev Minimax Max. Error 0.718 0.439 0.372 0.279

Avg. Error 0.246 0.162 0.184 0.190

where [a, b] is the approximation interval. Many researchers rely on methods such as Taylor series of simply to minimize the maximum absolute error. Table 2.1 shows the maximum and average errors of various first order polynomial approxi-mations to e^x over [−1, 1]. It can be seen that generally minimax gives the lowest maximum error and Legendre provides the lowest average error. Therefore, when low maximum absolute error is desired, minimax approximation should be used (unless the polynomial coefficients are computed at run-time from stored func-tion values [100]). The minimax polynomial is found in an iterative manner using the Remez exchange algorithm [149], which is often used for determining optimal coefficients for digital filters.

Sidahao et al. [165] approximate functions over the whole interval with high order polynomials. This polynomial-only approach has the advantage of low memory requirements, but suffers from long latencies. In addition, it will not generate acceptable results when the function is highly non-linear. Pi˜neiro et al. [147] divide the interval into several uniform segments. For each segment, they store the second order minimax polynomial approximation coefficients, and accumulate the partial terms in a fused accumulation tree. This scheme performs well for the evaluation of elementary functions for moderate precisions (less than 24 bits).

2.3.5 Polynomial Approximation with Non-uniform Segmentation Approximations using uniform segments are suitable for functions with relatively linear regions, but are inefficient for non-linear functions, especially when the function varies exponentially. It is desirable to choose the boundaries of the segments to cater for the non-linearities of the function. Highly non-linear regions will need smaller segments than linear regions. This approach minimizes the amount of storage required to approximate the function, leading to more compact and efficient designs. A number of techniques that utilize non-uniform segment sizes to cater such non-linearities have been proposed in literature. Cantoni [18]

uses optimally placed segments and presents an algorithm to find such segment boundaries. However, although this approach minimizes the number of segments required, such arbitrary placed segments are impractical for actual hardware implementation, since the hardware circuitry to find the right segment for a given input would be too complex. Combet et al. [27] and Mitchell Jr. [75] use segments that increase by powers of two to approximate the base two logarithm.

Henkel [61] divides the interval into four arbitrary placed segments based on the non-linearity of the function. The address for a given input is approximated by another function that approximates the segment number for an input. This method only works if the number of segments is small and the desired accuracy is low. Also, the function for approximating the segment addresses is non-linear, so in effect the problem has been moved into a different domain. Coleman et al. [26]

divide the input interval into seven P2S (powers of two segments: segments with the size varying by increasing or decreasing powers of two) that decrease by powers of two, and employ constant numbers of US (uniform segments: segments with the sample sizes) nested inside each P2S, which we call P2S(US). Lewis [100]

divides the interval into US that vary by multiples of three, and each US has

vari-able numbers of uniform segments nested inside, which we call US(US). However, in both cases the choice of inner and outer segment numbers is left to the use, and a more efficient segmentation could be achieved using a systematic segmentation scheme.

2.3.6 Rational Approximation

Rational approximation offers efficient evaluation of analytic functions repre-sented by the ratio of two polynomials:

f (x) = c_nxⁿ+ c_n−1xⁿ⁻¹. . . c₁x + c₀ dnx^m+ dm−1x^m−1. . . d1x + d0

(2.7)

In general, rational approximations are the most efficient method to evaluate functions on a microprocessor. However, they are less attractive for FPGA im-plementations due to the presence of the divider. Typical polynomial sizes for floating-point single precision are smaller than ten coefficients [122]. Hardware implementations of rational approximation are studied in [79].

在文檔中 Hardware Designs for Function Evaluation and LDPC Coding (頁 50-57)