System Architecture and Performance - Fast Convergent Adaptive Frequency Domain Equalizer

Chapter 3 Fast Convergent Adaptive Frequency Domain Equalizer

3.5 System Architecture and Performance

The proposed FDE operates based on equations in sections of 3.2, 3.3, and 3.4, and the detailed block diagram is shown in Fig. 3-11. In the simulation, the signals are interfered by channel model and AWGN and are assumed to be perfectly synchronized. The system flow is explained as follows:

1. In the beginning, the channel estimation evaluates the filter coefficients with training sequences by LS method.

2. When training sequences is done, the cyclic prefixed data stream is transmitted.

The adaptive FDE equalize the received signal.

3. After equalization, the signal is sent to decision circuit, which functions as a slicer.

4. Using the error between equalized and sliced signal, the adaptive FDE updates the filter coefficients by LMS algorithm.

To evaluate the performance, the channel model we use is based on the IEEE 802.15.3c standard group with Jakes’ model, mentioned in Section 2.2. The whole transmitted sequence is composed of CMS, preamble, PW, data, and PCES. The CMS and preamble are used for training and PW works as cyclic prefix. The simulation results are shown in Fig. 3-12. The whole testing environment is built with C language. For each testing point, the length of the transmitted sequence is 448000 samples. Based on the standard, the error rate criterion is set to 1.54*10^-4 after any error correcting method. From the figure, our adaptive FDE requires about 10 dB Eb/N0 to achieve this criterion. Comparing to optimal receiver, the loss is only 1.5 dB for both π/2 BPSK and π/2 QPSK.

The fixed-point simulation model is determined by following procedure. First, we quantize the input to minimum word length without significant performance loss.

Then, we quantize the next data path. Step by step, we can finally find out the word length of each data path and ensure the performance loss in a reasonable range.

Fig. 3-11 Detailed block diagram of the proposed adaptive FDE

0 1 2 3 4 5 6 7 8 9 10 11 12

10^-5 10^-4 10^-3 10^-2 10^-1 10⁰

Eb/N0(dB)

BER

AWGN pi/2 BPSK pi/2 QPSK

pi/2 BPSK (fixed-point) pi/2 QPSK (fixed-point)

Fig. 3-12 Eb/N0 vs. Bit Error Rate

Chapter 4 Architecture Design and Hardware Reduction

4.1 Design Specifications and Architecture

IEEE 802.15.3c standard focuses on over Gbps data rate wireless communication.

To achieve the target, there are two key features in the standard. The first one is the usage of the 60 GHz RF band. The unlicensed RF bandwidth is wide enough to support the usage of large bandwidth. The transmission rate is proportional to the bandwidth, so using the unlicensed 60 GHz RF band is essential. The second one is the ultra high sampling rate. Although there are many methods to achieve the target of high data rate, like using higher modulation or multi-input and multi-output (MIMO) system [26], raising the sampling rate is the most direct way since the data rate is proportional to the sampling rate. With the moderate modulation scheme, the data rate could be twice or three times of the sampling rate. In this way, we can easily achieve the target of over Gbps data rate. Based on these features, we propose LS-LMS combined FDE in Chapter 3. The block diagram shown in Fig. 3-11 is redrawn in Fig.

4-1 due to the hardware design considerations. In the following sections, we will discuss our hardware design. However, FFT and IFFT are not the design target in this thesis.

Fig. 4-1 Revised block diagram of the proposed FDE

In modern CMOS process, the issue of power consumption becomes more and more important. There are many methods to reduce the power consumption when we design the hardware, such as using low computational complexity algorithm, substituting high complexity arithmetic unit with lower one, or sharing the hardware resources. By using these methods, we can reduce the chip area and the switching power consumption. Meanwhile, the leakage power is also reduced when the chip area is reduced.

4.2 Divider Free LS Method

In Section 3.2, Eqn. (3.10) indicates LS method needs a complex division. There are two ways to avoid the division. One is using the phase operation as shown in Eqn.

(4.1), and the other one is to multiply the conjugate of the divisor both on the denominator and the numerator as shown in Eqn. (4.2).

512, 512,

The phase operation replaces the complex division into one square root function, two square functions, one scalar division, and one subtraction. However, the transformation between the phasor and complex number requires trigonometric function, as shown in Eqn. (4.3). Although there are some realistic designs, the hardware cost is still too high.

512 512, 512,

Eqn. (4.2) transforms one complex division to one complex multiplication, one square function and one scalar division. This method is generally used when we calculate the complex division. However, there is one scalar division, which is much more complex than a multiplier [27].

Since the division is an inversed multiplication, then multiplying an inverse of the scalar is a commonly used method. To find out the inverse, we can try to use a table with all possible inverse of the scalar, and we can easily implement it with a ROM as illustrated in Fig. 4-2. The bit width is determined by the accuracy of the inverse, and the word width is determined by the word length of the scalar. According to the simulation result of fixed-point C language, the bit width should be 13 bits and the word width is 14 bits to maintain the performance. Therefore, the size of the ROM is 2¹⁴*13, which is 213k bits. The cost is reduced, but the ROM still takes large area.

To reduce the size of the ROM, we can try to reduce the bit and word width. Since the accuracy is already determined by bit width, we need to focus on the reduction of the word width. By observing the inverse, we can find out that the inverse is almost the same in nearby words. An example is shown in Eqn. (4.4), the difference between 1/128 and 1/129 is so small that they can not represented in 13 bits. Therefore, nearby scalars can all map to the same inverse stored in the table, as illustrated in Fig. 4-3.

Fig. 4-2 Table of inversed scalar

This effect is more obvious when the scalar is large, as shown in Eqn. (4.5), where

N is the reference scalar and Δn is the difference. Taking the property of the scalar into

consideration, we can see that the scalar is always positive since it’s the result of the square function. Hence, we can reconsider the scalar structure in Fig. 4-4. We can just look up the table according to the significant bits regardless of sign bits and Δn. From the simulation results, the optimal length of the significant bits is 4. The reduced table is shown in Fig. 4-5 and the size is 2⁴*11*13, which is 2288 bits.

10 2

1128 0.0078125 1129=0.0077519

1128-1129=0.0000606 0.0000000000000001

Fig. 4-3 Reduced mapping

Fig. 4-4 Structure of the scalar

Inverse of 1 Inverse of 2 Inverse of 3

Inverse of 15 Inverse of 16,17 Inverse of 18,19

……

Inverse of 30,31 Inverse of 32,33,34,35

Inverse of 36,37,38,39

…

Fig. 4-5 Reduced table

Since Δn bits are ignored, they looks like zeros. We can represent them as another form as illustrated in Eqn. (4.6), where SB means the significant bits.

*2 ⁿ

scalar SB

= ^Δ (4.6)

The meaning of these Δn bits are doing the left shift on SB, so the inverse of the scalar can be represented as Eqn. (4.7).

1 1 2 table has to store only 16 inversed scalars, which cost 208 bits storage area. Through this procedure, we substitute a division with one small ROM and one multiplier. The block diagram is shown in Fig. 4-6. Compared with a real divider in DesignWare, the modified version has smaller area and can satisfy the requirement of high clock rate operation. Furthermore, the size of the ROM is 99.99% off by the method mentioned above.

Fig. 4-6 Block diagram of modified divider

4.3 Hardware Sharing

4.3.1 Multiplier Sharing

Since we want to maintain balance between the performance and the power consumption, we combine LMS and LS to achieve the target. LMS and LS both have the property of low computation complexity, which means low power consumption.

With the aid of LS, we can get a better training result and improve the performance of

LMS. However, using LS needs additional area and power, which is contradictory to our target of low power design. To solve this contradiction, we need to do some methods of reduction.

According to Section 3.5, the system flow enters the training stage first. After the training stage, the transmitter starts to transmit the data stream. It’s obvious that these two stages don’t overlap with each other. In the training stage, only LS related circuits are operating, but LMS and one-tap equalizer related circuits are idle. On the other hand, the LS related circuits are idle in the data transmitting stage as illustrated in Fig.

4-7.

Fig. 4-7 Execution order of stages

According to Fig. 3-11 and Eqn. (4.2), LS channel estimation takes one complex multiplication with one conjugated input, one complex power measurement unit, and one divider, which is substituted with two multipliers. There are total 8 multipliers for the computation of LS in the training stage. On the other hand, the LMS only takes one complex multiplication with one conjugated input as shown in Eqn. (3.16). Then, the one-tap equalizer needs one complex multiplier. The FDE requires also 8 multipliers to operate in the data transmitting stage. Based on these observations, we can list the operation table in each stage as shown in Table 4-1 and figure out how to share the hardware resource.

Table 4-1 Operation requirement of the proposed FDE

Training stage Data transmission stage

LS LMS One-tap EQ

(4 real multipliers) 0 Complex

(2 real multipliers) 0 0

Modified divider (scalar multiplier)

11 bits * 13 bits

(2 real multipliers) 0 0

Only the size of the complex multiplication with one conjugated input has to be extended to 19 bits * 16 bits, and the complex power measurement unit and the modified divider multipliers are all shared with the one-tap EQ. Hence, the area of the combined circuit is reduced 47% compared with no-sharing circuit. The reduction is tremendously important since these multipliers takes 60% area among the proposed FDE before the sharing.

4.3.2 Register Sharing

We have two storage components in the architecture, which are SISO_1 and SISO_2 in Fig. 3-11. The purpose of the storage blocks is to store the information temperately. SISO_1 is used to store the summation of received signal R, and SISO_2 is to store the filter coefficients. In the hardware design, these blocks are both replaced with the storage device, such as random logic register file or RAM. Since the filter coefficients are fetched by both LMS and one-tap EQ, we prefer the register file rather than RAM. Furthermore, the two storage blocks can share the same register file since they operate in different time slot. From the synthesis result, the area of the

register file is reduced by 44% due to the combination of SISO1 and SISO2, whose area is 23% of the whole area before sharing.

In summary, by sharing the hardware resource, we successfully reduce the area of the proposed FDE up to 38% in total area as listed in Table 4-2. The block diagram of the reduced version is shown in Fig. 4-8(a). Also, the divider is replaced with the table of inverse and one multiplier. The synthesis result shows that the area reduction is 86%. Notice that the divider can not operate under high clock rate. Although the solution is to insert the pipeline, the gate count will increase. Hence, the reduction percentage is definitely higher than 86%. Since we can not modify the divider of DesignWare, we will not discuss the real reduction percentage on the divider in this thesis.

Table 4-2 Chart of reduction percentage

Hardware Sharing

Multiplier Register

Reduction % 47% 44% (can not satisfy the timing

criterion)

Table of inverse + real multiplier

Gate count 7086 1027

Reduction % 86%

ROM:

A Real part of divider output (c and e)

B Real part of FDE output (a and d)

C Imagine part of divider output (c and e)

Imagine part of FDE output (a and d)

or Square function output (b²)

(c)

Fig. 4-8 Block diagram of (a) the proposed FDE, (b) complex multiplier, and (c) complex multiplier with one conjugated input

4.4 FFT/IFFT Design Specifications

The IEEE 802.15.3c standard focuses on the ultra high data rate wireless communication. The sampling rate for analog-to-digital convert is set to 1728 MHz, which means that the throughput of the digital circuit is exactly the same. However, to realize the high throughput digital circuit is a challenge in hardware implementation, especially the high computational complexity components. Obviously, FFT/IFFT takes highest computational complexity and is most critical in our FDE design.

In the recent years, there are many researches on the high throughput FFT. The pipeline-based structure and large radix butterfly are commonly used to achieve the requirement. In [28], the 128-point FFT is designated for the ultra wideband (UWB) system and requires radix-8 butterfly and 4 parallel input and output to fulfill the 1Gsps requirement. The throughput is just exactly 4 times of the clock rate, which is

250 MHz. Hence, the high throughput FFT must be realized with large radix butterfly and parallel input/output. This is more obvious in large point FFT. The FFT in [29] is 512-point with maximum throughput of 2592 MHz. It is designated for IEEE 802.15.3c HSI mode, which uses OFDM system. There are three modes in that FFT:

4-way, 8-way, and 16-way, and each mode correspond to different throughput. The butterfly is radix-8 and the input/output is up to 16 times parallel. The throughput and challenge are indeed highly related to the large radix and parallel input/output design.

In our FDE, the specifications of FFT/IFFT are listed in Table 4-3. The point of FFT/IFFT is 512, which equals to the length of the subblock. Since the sampling rate is 1728 MHz, the throughput is also 1728 MHz. In order to use the same clock rate with the proposed FDE, the clock rate of FFT/IFFT is set to 216 MHz and the input/output is 8 times parallel. From the fixed point simulation result of C language, the input/output word length of FFT and IFFT are 10/21 and 13/13 bits respectively.

According to [29], the 16-way mode under 216 MHz clock rate may fulfill our specifications of FFT. Then, the estimated gate count of FFT could be 415k, and the area is about 0.6 mm² with TSMC 65nm process.

Table 4-3 Specifications of FFT/IFFT in the proposed FDE

Parameter Value

Point 512 samples

Throughput 1728 MHz

Clock rate 216 MHz

Parallel input/output 8 times

FFT word length (input/output) 10 / 21 bits IFFT word length (input/output) 13 / 13 bits

4.5 RTL and Gate Level Simulation Results

4.5.1 Design Considerations about High Sampling Rate

In Section 2.1.1, the standard indicates that the sampling rate is 1728 MHz, which means the sampling time is about 0.58 ns. In the hardware design, this fact means the throughput of the chip is also 1728 MHz. There are two ways to fulfill the requirement: pipeline or parallel structure. Also, the process we use will affect the structure we use. In this design, we use 65 nm CMOS process with VDD=1.2V and we set our design of the clock rate at 216 MHz with 8 times parallel structure for the proposed FDE, based on two considerations: the clock rate and the power consumption.

The pipeline structure is essential for such high computational and high speed digital circuit. Although inserting more flip-flops can reduce the operating time in one stage and increase the clock rate, the cost is large chip area due to large number of flip-flops. Moreover, the maximum clock rate is limited by the switching time of flip-flops. In 65nm process, the clock-to-Q transition and setup time take 0.17ns and 0.2ns respectively, which mean that there are only 0.21 ns left for logic delay.

Therefore, for example, a single multiplier has to be cut into many stages, and the area is growing dramatically with the insertion of flip-flops. Hence, the fully pipelined structure is not a good design.

Using the parallel structure can reduce the clock rate and maintain the throughput at the same time. The drawback is the area is proportional to the number of the copies.

Although the slower clock rate leads to lower power consumption, too many copies will generate more leakage power than expected, which violates our purpose of low

power design.

Based on the considerations above, we adopt the combined structure. To avoid using too many flip-flops, it is better that a complete arithmetic unit is done in single stage. The multiplier takes longest computational time among the circuit, so the maximum clock rate is determined by the operation of multiplier. The synthesis result shows that 432 MHz, which is a quarter of 1728MHz, is the maximum clock rate that one multiplier can achieve. However, we use 216 MHz instead of 432 MHz since we want to leave some timing margin for later chip layout design and dual mode design with HSI in the future. 216 MHz is actually one-eighth of 1728 MHz and matches the 8 level of parallel structure as shown in Fig. 4-8(a). The system parameters are listed in Table 4-4.

Table 4-4 System parameters

Parameter Value

Sampling rate 1728 MHz

Clock rate 216 MHz

Modulation π/2 BPSK, π/2 QPSK

Equalization LS-LMS FDE

FFT point/subblock length 512 symbols

CP length 64 symbols

Channel model LOS residual model[14]

RMS delay spread: 12.73 ns

Doppler Effect 250Hz frequency shift

Max. data rate (uncoded) 2.9 Gbps

Level of parallel 8 times

4.5.2 Synthesis and Simulation Results

The synthesis results are listed in Table 4-5. The process is TSMC 65 nm 1P9M at 1.2 V and the system required sampling rate of IEEE 802.15.3c is 1728 MHz. The synthesis tool is Synopsys Design Compiler and design constrains set the operation speed at 216 MHz. The area report comes from the Design Complier, and the power consumption report is from the Prime Power. The maximum power consumption appears at data transmission stage since the whole circuit executes the functions of equalization and adaptive algorithm. The maximum power consumption of main functional units is listed in Table 4-6. The gate-level simulation result is shown in Fig.

4-9, which shows that it can achieve the criterion at Eb/N0 of 10 dB under the channel model mentioned in Section 2.2.

Table 4-5 Synthesis result of the proposed FDE

Process TSMC 65 nm 1P9M (1.2V)

Clock rate 216 MHz

Gate count (including memory) 504k

Power 81.87 mW

Memory

(generated by memory compiler)

ROM: 64 × 144 RAM: 64 × 64

Table 4-6 Power consumption percentage of each functional unit

Total 81.87 mW (100%)

Complex multiplier 36.51 mW (44.6%)

Memory 12.53 mW (15.3%)

Others 32.83 mW (40.1%)

0 1 2 3 4 5 6 7 8 9 10 11 12 10^-5

10^-4 10^-3 10^-2 10^-1 10⁰

Eb/N0(dB)

BER

AWGN pi/2 BPSK pi/2 QPSK pi/2 BPSK, RTL pi/2 QPSK, RTL

Fig. 4-9 BER vs. Eb/N0 of RTL simulation

The related work can be found in [19]. In [19], the channel model is NLOS residential model with 6.26 ns RMS delay spread. Other non-ideal effects include nonlinear power amplifier and phase noise. Compared with [19], we choose the LS-LMS combined algorithm instead of MMSE. The difficulty of realizing MMSE is the information of noise variance. In [19], the noise variance is assumed well known.

Since we take hardware design into account, only the realizable algorithm is in our consideration. Furthermore, our channel model includes Doppler Effect to simulate the time-variant channel effect. The comparisons between the proposed FDE and related work in [19] are listed in Table 4-7.

Table 4-7 Comparisons between the proposed FDE and related work

Proposed [19]

Sampling rate 1728 MHz 1728 MHz

Modulation π/2 BPSK, π/2 QPSK QPSK,8PSK

Channel model LOS residual model[14]

RMS delay spread: 12.73 ns

NLOS residential model [14]

RMS delay spread =6.26 ns

Non-ideal effects Doppler Effect Nonlinear power amplifier and phase noise

Synchronization Perfect Perfect

Equalization LS-LMS FDE MMSE FDE

在文檔中應用於單載波室內無線接收器之快速適應頻率域通道等化器之設計 (頁 46-0)