Evaluation and Results - Hardware Designs for Function Evaluation and LDPC Coding

and hence our design produces 133 million Gaussian noise samples per second.

We have also implemented our design on a Xilinx Spartan-IIE XC2S300E-7 FPGA. This design runs at 62MHz and has 2829 slices and eight block RAMs, which requires over 90% of this device. This implementation can produce 133 million samples in around two seconds.

It is possible to increase the performance by exploiting parallelism. We have experimented with placing multiple instances of our noise generator in an FPGA, and find that there is a small reduction in clock speed probably due to the high fan-out of the clock tree. For instance, a design with three instances of our noise generator takes up around 32% of the resources in an XC2V4000-6 device; it runs at 126MHz, producing 378 million noise samples per second.

In Section 6.7, the performance of the hardware designs presented above is compared with those of software implementations.

and expected number of samples appearing in each bin, and using the results to derive a single number that serves as an overall quality metric. Let t be the number of observations, p_i be the probability that each observation fall into the category i and Yibe the number of observations that actually do fall into category i. The χ² statistic is given by

χ² = Xk

i=1

(Yi− tpi)²

tp_i (6.7)

This test, which is essentially a comparison between an experimentally deter-mined histogram and the ideal PDF, is sensitive not only to the quality of the noise generator itself, but also to the number and size of the k bins used on the x axis. For example, a noise generator that models the true PDF accurately for low absolute values of x but fails for large x could yield a good χ² result if the examined regions are too closely centered around the origin. It is precisely for these high |x| regions where a noise generator is critically important, and most likely to be flawed.

Consider a simulation involving generation of 10¹² noise samples, conducted with the goal of exploring performance for a channel decoder in the range of BERs from 10⁻⁹to 10⁻¹⁰. In samples drawn from a true unit-variance Gaussian PDF, we would expect that approximately half a million samples from the set of 10¹²would have absolute value greater than x = 5. These high σ noise values are precisely the ones likely to cause problems in decoding, so a hardware implementation that fails to faithfully produce them appropriately risks creating incorrect and deceptively optimistic results in simulation. To counter this, we extend the tests to specifically examine the expected versus actual production of high σ values.

While the χ² test deals with quantized aspects of a design, the A-D test deals with continuous properties. It is a modification of the Kolmogorov-Smirnov (K-S) test [78] and gives more weight to the tails than the K-S test does. The K-S

test is distribution free in the sense that the critical values do not depend on the specific distribution being tested. The A-D test makes use of the specific distribution (normal in our case) in calculating critical values. For comparing a data set to a known CDF F (x), the A-D statistic A² is defined by

A² = XN i=1

1 − 2i

N [ln F (x_i) + ln(1 − F (x_{N +1−i}))] − N (6.8) where xi is the ith sorted and standardized sample value, and N is the sample size.

A p-value [32] can be obtained from the tests, which is the probability that the deviation of the observed from that expected is due to chance alone. A sample set with a small p-value means that it is less likely to follow the target distribution.

The general convention is to reject the null hypothesis – that the samples are normally distributed – if the p-value is less than 0.05.

Figures 6.10, 6.11 and 6.12 illustrate the effect on the PDF of different im-plementation choices. Figure 6.10 shows the PDF obtained when 17 and 6 linear approximations are used for f and g1 respectively. The figure (as well as the others in this section) is based on a simulation of four million Gaussian random variables. There are distinct error regions visible in the PDF, which occur when there are large errors in the approximation of f and g₁. These distinct errors cause the χ² and A-D tests to fail. Increasing the number of linear approxima-tions to 59 and 21 respectively leads to the PDF shown in Figure 6.11. It is clear that the error regions have decreased significantly. However, although this passes the A-D test, it fails the χ² test when the sample size is sufficiently large. When the further enhancement of summing two successive samples as discussed earlier is added, the PDF of Figure 6.12 results.

This implementation passes the statistical tests even with extremely large

numbers of samples. We have run a simulation of 10¹⁰ samples to calculate the p-values for the χ² and A-D test. For the χ² test, we use 100 bins for the x axis over the range [-7,7]. The p-values for the χ² and A-D tests are found to be 0.3842 and 0.9058 respectively, which are well above 0.05, indicating that the generated noise samples are indeed normally distributed. To test the noise quality in the high σ regions, we run a simulation of 10⁷ samples over the range [-7,-4] and [4,7]

with 100 bins. This is equivalent to a simulation size of over 10¹¹ samples. The p-values for the χ² and A-D tests are found to be 0.6432 and 0.9143, showing that the noise quality even in the high σ regions is high.

In order to explore the possibility of temporal statistical dependencies [154]

between the Gaussian variables, we generate scatter plots showing pairs yi and y_i+1. This is to test serial correlations between successive samples, which can occur if the noise generator is improperly designed. If correlations exist, certain patterns can be seen in the scatter plot [154]. An example based on 10000 Gaus-sian variables is shown in Figure 6.13, which displays no obvious correlations.

Our hardware implementations, described in Section 6.6, have been compared to several software implementations based on the polar method [78] and the Ziggurat method [115], which are the fastest methods for generating Gaussian noise for instruction processors. The software implementations are written in C generating single precision floating-point numbers, and are compiled with the GNU gcc 3.2.2 compiler. The uniform random number generator used is the mrand48 C function in UNIX, which uses a linear congruential algorithm [78]

and 48-bit integer arithmetic (period of 2⁴⁸). This algorithm can generate one billion 48-bit uniform random numbers on a Pentium 4 2.6GHz PC in just 23 seconds.

The results are shown in Table 6.2. The XC2V4000-6 FPGA belongs to

−60 −4 −2 0 2 4 6 0.05

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

PDF(x)

Figure 6.10: PDF of the generated noise with 17 approximations for f and 6 for g for a population of four million. The p-values of the χ²and A-D tests are 0.00002 and 0.0084 respectively.

−60 −4 −2 0 2 4 6

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

PDF(x)

Figure 6.11: PDF of the generated noise with 59 approximations for f and 21 for g for a population of four million. The p-values of the χ² and A-D tests are 0.0012 and 0.3487 respectively.

−60 −4 −2 0 2 4 6 0.05

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

PDF(x)

Figure 6.12: PDF of the generated noise with 59 approximations for f and 21 for g with two accumulated samples for a population of four million. The p-values of the χ² and A-D tests are 0.3842 and 0.9058 respectively.

−4 −3 −2 −1 0 1 2 3 4

−4

−3

−2

−1 0 1 2 3 4

yi yi+1

Figure 6.13: Scatter plot of two successive accumulative noise samples for a population of 10000. No obvious correlations can be seen.

Table 6.2: Performance comparison: time for producing one billion Gaussian noise samples. All PCs are equipped with 1GB DDR-SDRAM.

platform speed [MHz] method time [s]

XC2V4000-6 FPGA 105 96% usage 1

XC2V4000-6 FPGA 126 32% usage 2.6 XC2V4000-6 FPGA 133 10% usage 7.5

XC2S300E-7 FPGA 62 90% usage 16

Intel Pentium 4 PC 2600 Ziggurat 50

AMD Athlon PC 1400 Ziggurat 72

Intel Pentium 4 PC 2600 Polar 147

AMD Athlon PC 1400 Polar 214

the Xilinx Virtex-II family, while the XC2S300E-7 FPGA belongs to the Xilinx Spartan-IIE family. It can be seen that our hardware designs are faster than software implementations by 3–200 times, depending on the device used and the resource utilization. Such speedups are mainly due to the ability to perform bit-level and parallel operations in FPGAs, which result in a more efficient usage of silicon area for a given design over general purpose microprocessors.

Figure 6.14 shows how the number of noise generator instances affects the output rate. While ideally the output rate would scale linearly with the number of noise generator instances (dotted line), in practice the output rate grows slower than expected, because the clock speed of the design deteriorates as the number of noise generators increases. This deterioration is probably due to the increased

routing congestion and delay. We are able to fit up to nine instances on the Virtex-II XC2V4000-6, which can generate almost one billion noise samples per second.

We have used our noise generator in LDPC decoding experiments [74]. Al-though the output precision of our noise generator is 32 bits, 16 bits are found to be sufficient for our LDPC decoding experiments (other applications such as fi-nancial modeling [14] may require higher precisions). To obtain a benchmark, we performed LDPC decoding using a full precision (64-bit floating-point represen-tation) software implementation of belief propagation in which the noise samples are also of full precision. We then performed decoding using the LDPC algorithm but with noise samples created using the design presented in this chapter. Over many simulations, we have found no distinguishable difference in code perfor-mance, even in the high Eb/N0 (low SNR) regions where the error floor in BER is as low as 10⁻⁹ (10¹² codewords are simulated). To generate 10¹² noise samples on a 2.6GHz Pentium 4 PC, it takes over 11 hours, whereas a single instance of our hardware noise generator takes just over two hours. On a PC, where LDPC encoding, noise generation and LDPC decoding are performed, the simulation time for 10¹²codeword samples will be a lot longer than ten hours, since all three modules need to be performed. However, in our hardware simulation we have the advantage of running all three modules in parallel. Although the hardware im-plementation of our hardware LDPC decoder is currently at a preliminary stage (implemented serially), it has a throughput of around 500Kbps, which is over 20 times faster than our PC based simulations. We are currently in the process of implementing a fully parallel scalable decoder, which we predict will be several orders of magnitude faster than traditional software simulations.

Comparing our implementation with other hardware Gaussian noise

genera-1 2 3 4 5 6 7 8 9 0

200 400 600 800 1000 1200

Number of Instances

Million Samples / Second

Figure 6.14: Variation of output rate against the number of noise generator instances.

tors, the only implementation known on a Xilinx FPGA is the AWGN core [186]

from Xilinx. This implementation follows the ideas presented in [12]. Although this core is around twice as fast as and four times smaller than our design, it is only capable of a maximum σ value of 4.7 (whereas we can achieve 6.7 σ and more). In addition, we have tested the design with our statistical tests, and found out that the noise samples fails the χ² test after around 200,000 samples. Hence, we find the design to be inadequate for our low BER and high quality LDPC decoding experiments.

在文檔中 Hardware Designs for Function Evaluation and LDPC Coding (頁 191-199)