Bottleneck for the system when operating at high speed

Chapter 3 System Overview…

3.3 Bottleneck for the system when operating at high speed

In Sections 3.1 and 3.2, we design a system following the 802.16a specification and pick a channel model for simulating the system. In this section, we would like to discuss the possible limit on the supporting mobility rate for 802.16a specification through theoretical calculation.

As described in Chapter 2, OFDM systems utilize CP to mitigate the ISI effects due to delay spread. Roughly speaking, CP time interval must be longer than the maximum delay tap. A better defined constraint imposed on CP length by ISI contribution to the decoder SIR ( Signal to Interference Ratio) can be expressed as [31] sets the requirement that the channel impulse response up toT contains at least a _cp fraction 1-θ_cp of the total impulse energy, and is chosen to system SIR requirements.

Typical values for θ_cp range from 0.02 to 0.25, depending on SIR[32].

If we set SIR 10 dB in our system, θ_cp will be equal to 0.1. To capture

(

¹⁻^θ^cp

)

=90% of impulse energy of worst-case delay spread in our chosen channel

model[19], it require T_cp ≈10 sµ , while in our PHY design, the CP time length is 20µs

≈ , which is double the time length we need. Thus, in our system simulation, there should not exist ISI and the orthognality problems.

Due to the Doppler shift in the time varying channel response, the received OFDM symbol may be distorted by the time-domain smearing of the signal. We roughly define the constraint

T_b <θ τ_d⋅ _chan (3.20)

where T is the data period , _b τ_chan is the channel coherence time, and θ_d is some predetermined fraction of the channel de-coherence time, which in turn depends inversely on the mobility rate. Equation (3.20) means that the data period T should _b be shorter than channel de-coherence time, so the signal will not be distorted.

Previous simulation studies [32], and actual parameters used in the field OFDM systems indicate that θ_d is rarely larger than 0.1.

By (3.21) the maximum Doppler frequency is 279.32µs; through (3.2), the maximum supporting mobility rate shall be 41.9 (m/s), which means that 150 Km/hr is supported.

In this chapter, we introduce our system design, channel parameters, channel coefficients for the simulator, and the possible maximum mobility rate. We will go to the implementation stage in next chapter.

Chapter 4 Hardware Implementation

In the previous chapters, the backgrounds, including IEEE 802.16a, system parameters, and channel environment are given. In this chapter, we shall discuss the hardware implementation on the channel simulator.

4.1 Quixote Board[34]

The DSP-FPGA embedded card used in our system is Innovative Integration’s Quixote-III, which is shown in Figure 4.1. It is a PCI bus compatible card housing one TI’s (Texas Instruments company) TMS320C6416 digital signal processor (DSP) and one Xilinx’s Virtex-II XC2V6000 field programmable gate arrays (FPGA) in a

symmetric multiprocessing relationship with high bandwidth inter-DSP-FPGA communication links. The block diagram of Quixote-III is shown in Figure 4.2.

Fig 4.1: Quixote-III [34]

The main features are as follows:

1. TMS 320C6416 processor running at frequency up to 600 MHz.

2. On board 32MB SDRAM for DSP chip, enhanced cache controllers, 64 DMA channels, 3 MCBSP sync serial ports and two 32 bits timers.

3. A 32/64 bits PCI bus host interface with direct host memory access capability for busmatering data between the card and the memory.

4. Onboard 40MB/sec FIFO port for fast data transmission between II’s DSP board.

5. 2 in, 2 out A/D, D/A conversion , 14 bit, DC-105MHz

Fig 4.2: Block diagram of Quixote-III [34]

4.1.1 DSP Chip Introduction

The DSP chip used is TI’s TMS 320C6416. It employs the “VelociTI” architecture, a variant of the traditional VLIW architecture, which consists of multiple execution units running in parallel, performing multiple instructions during one cycle time. TMS 320C6416 is a fixed-point DSP, with 8 function units running at 600 MHz and 4800M instructions per second. It’s internal memory includes a two-level cache architecture with 16 KB of L1 data cache, 16 KB of L1 program cache, and 1 MB L2 cache for data/program allocation. On-chip peripherals include two multichannel buffered serial ports (McBSPs), two timers, a 16-bit host port interface (HPI), and 32-bit external memory interface (EMIF). Internal buses include a 32-bit program address bus, a 256-bit program data bus to accommodate eight 32-bit instructions, two 32-bit data address buses, two 64-bit data buses, and two 64-bit store data buses. The block diagram of TMS 320C6416 is shown in Figure 4.3.

Fig. 4.3: Block diagram of TMS320C6416

From the schematic in Fig 4.3, we can schematically get the picture of CPU structure.

Further in Fig.4.4, the detailed structure of C64X CPU is illustrated.

Fig. 4.4: C64X CPU structure [35]

The C64X CPU consists of 2 general purpose register files (A and B), 8 functional units (L1,S1,M1,D1,L2,S2,M2,and D2), 2 load-from-memory paths(LD1 and LD2), 2 store-to-memory paths(SD1 and SD2 ), 2 data address paths (DA1 and DA2) and 2 register cross data paths(1X and 2X). The functional units can perform multiple modes of bits operation, such as 16 16, 32 16,8 8× × × …etc, detailed information is given in Table 4.1 and Table 4.2.

Table 4.1: C64x CPU Function units and Operation performed [35]

Table 4.2: C64X CPU Function units and Operation performed [35]

4.1.2 Xilinx FPGA Chip [36]

Xilinx Virtex-II XC2V6000 is produced by 0.15 µ, 8-layer metal process with a likely running frequency up to 300MHz depending on the designed circuit. It has 33792 slices made up by 6M system gate counts, and that explains the name “XC2V6000”.

“Slice” is the smallest area unit in FPGA design, which consists of 2 FF/LAT units (Flip-Flop/ Latch), 2 four-input LUTs (Look-up table) and other logic units. Figure 4.5 shows a general slice diagram. Through the routing combinations of slices, a FPGA chip can simulate different kind of complex circuits.

Fig. 4.5 : General Slice Diagram [36]

The main constraints of the FPGA chips are the area and the number of multipliers.

FPGA chip. Another barrier is the usage of multipliers. Through routing, Virtex-II XC2V6000 can offer 144 multipliers as a whole, once the multiplier usage of the circuit exceeds the constraint, the circuit will fail to fit in the chip, too. We list the key features and constraints below.

1. 6M system gates 2. 33792 slices 3. 144 multipliers

4. 16 GCLKs (Gate Clocks)

5. 67584 Slice Flip Flops and 4-input LUTs

4.1.3 Data Transmission Mechanism

In this section, we will introduce the data transmission mechanism between the Host PC and Quixote-III.

There are 2 modes of transmission interfaces between Host PC and Quixote. The primary busmaster interface is a streaming model where logically data is an infinite stream between the source and destination. This model is efficient because the signaling between the two parties in the transfer can be kept to a minimum and transfers can be buffered for maximum throughput. On the other hand the streaming model can have relatively high latency for a particular piece of data, this is because a data item may remain in internal buffering until subsequence data accumulates to allow an efficient transfer. Theoretically, the transmission rate can reach the limitation of PCI interface, that is around 1065Mbps.

Another transmission interface is the message interface. The DSP and host have a lower bandwidth communications link for sending commands or side information between host PC and and target DSP, with a lower transmission rate around 20K to 60K bps.

4.2 Fundamental Functions for Channel Coefficients

In section 3.2.2, we introduced three channel coefficient models for Rayleigh fading channel simulation, they are Clarke model (3.4) to (3.6), Jakes model (3.8) to (3.10), and Xiao model (improved Jakes model) (3.13) to (3.15).Among them, Clarke model is too complex for hardware implementation owing to the need of three random number generators. So, we will just implement Jakes model and Xiao model on the FPGA chip, and take Xiao model as our consideration to generate a set of 6-ray channels simultaneously with maximum Doppler shift of 2000 Hz support for our system.

As we can observe from those channel model equations, the most important part of the equations is the triangular function. We shall introduce three triangular function generators in section 4.2.1, and compare the advantages and disadvantages. In section 4.2.2, we also introduce a binary pseudo random number generator for the implementation of Xiao model. In this thesis, all the syntheses are done by ISE 6.0 developed by Xilinx.

4.2.1 Triangular Function Generator

Generally speaking, there are many methods generating the triangular function, we will introduce three of them, which are all popular triangular function generators.

Look-Up Tables

The Look-Up Tables (LUTs) approximation is an extensively documented technique for triangular functions. An LUTs of length 2^x samples describes on period of a unit amplitude sinusoid: the LUTs is addressed by X most significant bits of the input phase φ, constituting the sinusoidal value to the nearest sample.

The advantages of LUTs triangular generation will be simple structured, and can generate sinusoidal values fast. The main disadvantage for LUTs will be the large table size.

Fortunately, in a channel simulator implementation, the channel coefficients are only statistical characterized, detail precision is not required. Thus, we can reduce the size of our LUTs.

In our works, we implement the LUTs with three approaches. First, we divide 0 to π /2 into 180 parts, and use the symmetric property of triangular function to find the rest values outside 0 toπ /2. This method takes 2 clock cycles to complete the procedure. The advantage will be less FPGA chip area consuming. The rest two approaches will be directly divide 0 to2π in to 360 parts and 720 parts respectively.

They may be area consuming but fast. An example of direct LUTs-720 division is shown below

Fig 4.6.: Example of direct LUTs-720 division

In the LUTs, we design sine and cosine for 12 bits, the first bit is sign bit, 0 for positive, and 1 for negative. The second bit is for integer and the rest 10 bits are for decimal.

The input data are 18 bits, the first bit is sign bit, the following 12 bits are for integer, and the rest 5 bits for the decimal. We give a desired input of theta=18'b0_000011110000_00000, which is 240 degree, and the output value of

cosine is directly given as 12'b1_1_0111111111, which is -0.5, and the result is right.

Other implementing results are shown in Table 4.3. From Table 4.3, we find that the data yield rates are all around 100M per second, and their area occupied are all pretty small compared to the whole FPGA chip area. So, we decide to use the LUTs which is directly divided into 720 parts within 0 to2π as our LUTs method.

Performance

Taylor series is one of the most popular methods calculating sinusoidal values. It is widely used in software programs. But in hardware design, especially in FPGA simulation, the number of multipliers is always limited. For (4.2), it takes 4 multiplies.

If we calculate triangular functions with Taylor series, the need of multipliers will be a barrier. Moreover, Taylor series can only deal with the angle from -π /2 toπ /2, we

have to use the symmetric property to find the other values out of that range. So, it takes two clock cycles to find the right value. We also give a demonstration of Taylor’s series cosine function.

Fig 4.7: Example of Taylor series cosine function

In Taylor series, we set the input angle 18 bits, the first bit is sign bit, the following 6 bits are for integer, and the rest 11 bits are for decimal. The sine and cosine for 12 bits, the first bit is sign bit, the second bit is for integer and the rest 10 bits are for decimal.

In Fig 4.7, we input a desired angle "theta" for 18'b0_000011_00100101000, which is almost 3.14 radians. The first step is to transform "theta" into "thetaR" which is between 0 to π/2, and the result is thataR=18'0000000_00000001000, which is almost 0. And the corresponding value of cosine is 12'b1_1_0000000000, which is -1, our output is confirmed.

CORDIC [39]

The third method we would like to introduce is CORDIC (COordinate Rotation DIgital Computer). CORDIC is an iterative algorithm for calculating trig functions including sine, cosine, magnitude and phase. By rotating the phase, multiplying it by a succession of constant values, CORDIC generates triangular function values.

However, the “multiplies” can all be powers of 2, so in binary arithmetic they can be replaced with shifts and additions. Therefore, it is particularly suited for hardware

implementations because it needs no multipliers.

Imagine a fan built from a sequence of right triangles. The longest leg is hinged to the hypotenuse of the preceding section (as shown in Figure 4.8). The angles that meet at the vertex of the fan are all acute, and decrease steadily toward zero. If the desired angle is larger than the accumulated angle, we will add the fan in the positive direction toward the desired angle, and vise versa, till the accumulated angle reaches the desired angle. The size of those angles are listed in Table 4.4. Note that each angle is larger than the half of the preceding one, so the accumulated angles can converge.

Table 4.4: The accumulating angles

From the coordinate view, as shown in Figure 4.9: simultaneously. In addition, the range of CORDIC is also limited between -π /2 to

π /2, we have to use the symmetric property to find the values out of this range.

Fig 4.10: Example of CORDIC

In CORDIC, we design sine and cosine for 12 bits, the first bit is sign bit, 0 for positive, and 1 for negative. The second bit is for integer and the rest 10 bits are for decimal. The input data are 18 bits, the first bit is sign bit, the following 12 bits are for integer, and the rest 5 bits for the decimal. And we let CORDIC rotate 8 times to reach the desired angle, it takes another 2 clocks to transform the angle to be less than 90 degree.

For example, we input a desired angle "theta" of 120 degree, which in binaty turn is 18'b0_000001111000_00000. The first action of the circuit will transform theta into "thetaR" which is between 0 to 90 degree, which is 18'b0_000000111100_00000, represents 60 degree. When counter "p" is equal to 9, the desired values of cosine and sine are 12'b1_1_1000000100 and 12'b0_0_1101110111. Respectively, they are -0.49999 for cosine and 0.86718 for sine, while the true value of cosine is -0.5 and sine is 0.86602. They are really close.

There is trick when implementing the CORDIC algorithm. That is, when the desired angle is close to 0, simple shift and addition will cause an error in CORDIC.

Because if we simply shift the negative number, the value will become positive due to the addition of 0 in MSB. So, we have to detect the value, and fix it from the error as what have described. As shown in Fig.4.10, SIN_minus and COS_minus are designed to prevent the error from happening.

From the description above, we can conclude that using CORDIC may save the number of multiplier, on the other hand, it will be slow when generating the sinusoidal values, since it takes several iterations to reaches the desired angle. In our design, we decide to rotate the CORDIC 8 times for simplifying the circuit design.

The implementation results of LUTs, Taylor series, and CORDIC are shown in Table 4.5.

From Table 4.5, we can clearly find that the Taylor series method leads to the fastest data rate and smallest FPGA chip area. But, the usage of multiplier reaches 4.2 % of the total multiplier which FPGA chip can offer. And LUts generates the sinusoidal

CORDIC, the performance is a little bit disappointed, the data yield rate is only 14 M per second, and the occupied area is almost as large as LUTs method. We also list the comparison of theses three algorithms in Table 4.6.

Performance Table 4.5: Comparison of Triangular Functions

Algorithm Table 4.6: Comparison of Triangular Algorithms

As what we have mentioned in the beginning of this section, for a channel simulator implementation, the channel coefficients are only statistical characterized, detail precision is not required. The FPGA chip area occupied and the usage of multipliers are the main constrains. From this point, we conclude that LUTs might be the suitable triangular function algorithm for FPGA implementation in a channel simulation.

4.2.2 Pseudo Random Binary Number Generator

After introducing the triangular function generator, we still need a random number generator to implement Xiao model.

Generating random numbers has been a hard problem for engineers for decades.

Application of security protocols and encryption algorithms are basically based in the random number generator. For years, literatures came up with numerous kinds of random number generator architectures. But, most of them are very complex and difficult to be implemented on a FPGA chip, while we are going to implement a channel model on the same chip.

In this thesis, we consider a simple algorithm to implement the binary random number generator for our system. And we take pseudo random number generator as our consideration. Its primary advantages are that it is easily generated by feedback shift registers, and has a correlation function that is highly peaked for zero delay, which means that the randomness of the sequence is high. However, because it is predetermined, we call it pseudo random.

Figure 4.11 illustrates the generation of a pseudo random binary sequence of length 2ⁿ− , which is accomplished with the use of a set of shift register. After each 1 shift of the contents of the shift register to the right, the contents of some predetermined position registers are used to produce an input to the first stage through an exclusive-or operation. As the procedure goes on, the content bits of all the registers will form a set of pseudo binary random numbers. Proper feedback connections for several values of n are given in Table 4.8. And we use X²³+X⁵+ 1 as our use.

Table 4.8: Feedback connections for generation of pseudo random sequence

Considering its autocorrelation, the autocorrelation is shown in Figure 4.12.

Generally speaking, for a sequence of length N, the minimal correlation is -1/N [33].

Because the autocorrelation function of a pseudo random binary sequence consists of a narrow triangle around zero and essentially zero otherwise, this explains the reason for the name “ Pseudo”.

Fig 4.12: Correlation function of a Pseudo Random sequence [37]

n Sequence Length

Feedback digit n Sequence Length

13 8191 (12,6,4,1,0) 25 3354432 (24,4,3,1,0)

In our works, we will use 8 Pseudo random binary generators simultaneously in the Jakes model implementation. The random seeds (initial values of each stage) are given in Table 4.9. Following are the FPGA implementing results. Performance

Table 4.10: The FPGA implementation results of PN generator.

As we expected, the FPGA chip area occupied is small, and the speed is fast, since it uses shift registers and exclusive-or operation (that is, a binary add without carry).

4.3 Channel Simulator

After we have introduced the fundamental functions of channel coefficient, we can now start to implement the Rayleigh fading channel simulator. We will implement two single channels followed Jakes model and Xiao model as introduced in the preceding chapter. We will also implement a 6-ray Rayleigh fading channel to meet the channel model we have chosen for our system simulation.

4.3.1 Single-ray Channel Simulator

In the previous sections, we introduced two important components of the channel coefficient models, now we can start to implement a single channel of Jake model and Xiao model.

Jake model

Once again we review the equation of Jakes’ model:

( )

In our implementation, we set N=34, which is sufficient large to this model. Even though we had made a conclusion that LUTs is the best suitable triangular function generating algorithm for FPGA channel simulator, we still implement Jakes model with LUTs, Taylor series, and CORDIC algorithms for comparison.

We demonstrate the Jakes model waveform below. In this example, we check for the input Doppler shift=200 Hz, which is represented by f =12'b000011001000 (all the 12 bits stand for integer). Another input t =21'b000000100111000100110, which represents t = 0.0095125 second, if we calculated the Jakes model with computer, the channel coefficient will be 1.51276 for the real part, and -0.18775 for the image part.

We first show the Jakes model implemented by CORDIC. We can see there are two clocks control the circuit. The one with smaller period controls the CORDIC, it takes

在文檔中可程式化閘陣列之快速瑞立衰褪通道模擬器 (頁 41-0)