Proposed LOS Golay-MPIC TDE - Proposed Architecture for IEEE 802.15.3c

Chapter 3 SC/OFDM Dual-Mode Frequency and Time Domain Equalizer

3.3 Proposed Architecture for IEEE 802.15.3c

3.3.2 Proposed LOS Golay-MPIC TDE

The proposed LOS Golay-MPIC TDE operates based on equations in Section of 3.2.1 and 3.2.2, and the block diagram is shown in Fig. 3-8.

Fig. 3-8 Block diagram of the proposed LOS Golay-MPIC TDE

The pseudo code of the system flow is shown below:

1. If (received signals == training sequences) the Golay sequences aided channel estimation evaluates the channel impulse response, else the received signals will be equalized by MPIC TDE.

2. If (mode == HSI) the equalized data will be transformed to frequency domain by FFT, else will go through next block straightly.

3. While (received signals == valid data) the equalized data will be sent to decision circuit and demapper.

The proposed LOS Golay-MPIC TDE has low hardware complexity, which doesn’t need additional IFFT/FFT for SC/OFDM dual mode system. In each PCES period, the LOS Golay-MPIC TDE will update the channel impulse response again by Golay sequence aided channel estimation.

The LOS channel model provided by IEEE 802.15.3c standard has only two higher channel path gains. Also, the second path gain of LOS channel model is at most 0.3 and the other paths are less than the main path [9] [10]. Therefore the MPIC TDE can be efficiently implemented. For evaluating the influence of the multi-path gain and delay, different test patterns with AWGN and test channels are created. The test channels have one main path and one delayed path, and the second path gain and delay differs from 0.1 to 0.5 and differs from 8 to 56 samples which is normalized to main path, respectively. The modulations of SC and HSI mode are pi/2 QPSK and QPSK respectively. In Fig. 3-9 and Fig. 3-10, it shows BER is very sensitive to second path gain, but rarely affected by second path delay.

Fig. 3-9 BER of TDE SC mode for 2 channel paths

Fig. 3-10 BER of TDE HSI mode for 2 channel paths

If the number of LOS channel paths has more than two paths, the proposed Golay-MPIC TDE can still work. But the BER will be worse as long as the third path gain becomes larger. Take SC mode with three channel paths as example, Fig. 3-9 shows the BER is 4.89*10^-4 and 6.64*10^-4 when the second path gain is 0.35 and delay is 8 and 40, respectively. If the third channel path is involving for these two cases, the BERs are shown in Fig. 3-11 and Fig. 3-12. In Fig. 3-11 and Fig. 3-12, when the third path gain becomes small, the BER is near 4.89*10^-4 and 6.64*10^-4, respectively. The performance of different delay of the third path is almost the same.

Fig. 3-11 TDE SC mode BER for 3 channel paths & 2^nd Path gain=0.35 and delay=8

The proposed LOS Golay-MPIC TDE can achieve the BER requirement of IEEE 802.15.3c standard. If the number of LOS channel paths is more, the computation is more complex. Also, the number of register becomes more with the longer delay path.

Therefore, we consider the general case of two higher gain channels. It can reduce hardware complexity and achieve the required BER. Section 4.2.1 will describe an efficient architecture design of Golay-sequence aided channel estimation.

Chapter 4 Architecture Design and Performance Analysis

This chapter describes architecture design of the proposed adaptive LS-LMS FDE and LOS MPIC TDE in Section 4.1. The detail sub-blocks design is shown in Section 4.2. Section 4.3 is the synthesis result and performance of the proposed adaptive LS-LMS FDE and LOS MPIC TDE. The comparison of the proposed adaptive LS-LMS FDE and LOS MPIC TDE is presented in Section 4.4

4.1 Design Specifications and Architecture

IEEE 802.15.3c and IEEE 802.11ad standards focus on over Gbps data rate wireless communication. To achieve the target, there are two key features in the standard. The first one is the usage of the 60 GHz RF band. The unlicensed RF bandwidth is wide enough to support the usage of large bandwidth. The transmission rate is proportional to the bandwidth, so using the unlicensed 60 GHz RF band is essential. The second one is the ultra-high sampling rate. Although there are many methods to achieve the target of high data rate, like using higher modulation or multi-input and multi-output (MIMO) system [28], raising the sampling rate is the most direct way since the data rate is proportional to the sampling rate. With the moderate modulation scheme, the data rate could be twice or three times of the sampling rate. In this way, we can easily achieve the target of over Gbps data rate.

In modern CMOS process, the issue of power consumption becomes more and more important. There are many methods to reduce the power consumption when we design the hardware, such as using low computational complexity algorithm, substituting high complexity arithmetic unit with lower one, or sharing the hardware resources. By using these methods, we can reduce the chip area and the switching power consumption. Meanwhile, the leakage power is also reduced when the chip area is reduced.

As identical in Section 2.2.1, the sampling rate is 1760 MHz in SC mode and 2640 MHz in HSI mode. In the hardware design, this fact means the throughput of the chip is also 2640 MHz for dual mode system. There are two ways to fulfill the requirement:

pipeline or parallel structure.

The pipeline structure is essential for such high computational and high speed digital circuit. Although inserting more flip-flops can reduce the operating time in one stage and increase the clock rate, the cost is large chip area due to large number of flip-flops. Moreover, there is insertion delay (t_setup + t_C-Q) of the flip-flops, where t_setup and tC-Q are the setup and clock to Q delay time of flip-flops. Thus when clock rate is very high and computation path is long, the fully pipelined structure is not a good design. Using the parallel structure can reduce the clock rate and maintain the throughput at the same time. The drawback is the area is proportional to the number of the copies. Although the slower clock rate leads to lower power consumption, too many copies will generate more leakage power than expected, which violates our purpose of low power design.

Based on the considerations above, we adopt the combined structure. The system parameters are listed in Table 4-1. For dual mode system, we set the clock rate at 330 MHz with 8 parallels, based on two considerations: the clock rate and the power consumption.

Table 4-1 System parameters

SC HSI

Sampling rate (MHz)

1760 2640

Clock rate (MHz)

220 330

Modulation

π/2 QPSK QPSK

FFT point/sub-block length

512 symbols

CP length

64 symbols

Channel model

LOS residual model [9]

RMS delay spread: 3.2 ns

Max. data rate

(uncoded)

3.52 Gbps 5.28 Gbps

Level of parallel

8 times

4.1.1 Proposed Adaptive LS-LMS FDE and Baseband

Fig. 4-1 Proposed block diagram of baseband receiver design

The baseband receiver mainly consists of three blocks which is shown in Fig. 4-1.

The first block is the synchronization block, which consists of SCO compensator, symbol boundary detection (BD), and CFO synchronization. SCO performs the time interpolation method and the frequency rotation method for the dual modes. Symbol boundary detection (BD) allocates the incoming packet and finds the symbol boundary located in the ISI free region. CFO synchronization which uses correlation based method to estimate FCFO and can be realized by the same hardware with symbol synchronization. The second block is the FFT and frequency domain equalizer (FDE). FFT transforms signals from time domain into frequency domain, and FDE eliminates the channel effect. The final block is the LDPC decoder that corrects the error bits using normalized min-sum algorithm with row-based layered scheduling and can support four code rates of 802.15.3c applications.

The proposed LS channel estimation combined LMS FDE has been described in Section 3.2.1. The block diagram shown in Fig. 3-6 is redrawn in Fig. 4-2 due to the hardware design considerations. In the following sections, we will discuss our hardware design. FFT and IFFT are not the design target in this thesis and is

Fig. 4-2 Block diagram of the proposed adaptive LS-LMS FDE

The proposed LS-LMS FDE can be used on SC and HSI mode in IEEE 802.15.3c specification. Fig. 4-2 shows SC and HSI mode system architecture. The signals flows of SC and HSI mode are different at FFT feedback loop, The SC mode needs to transform to time domain to obtain the original data. After doing slicing, the errors will be transformed to frequency domain to do LMS algorithm. SC mode has additional feedback delay, so it needs single-port memories to save the received data for feedback FFT output. The complex multiplier (|．|²) in LS can be shared in one-tap equalizer, and the complex conjugate multiplier (Conj.) in LS can also be shared in LMS. Besides, in Fig. 4-2, the four dual-port memories marked with “No.1” in LS

channel estimation will be reused in one-tap equalizer to save coefficient W. The proposed equalizer equalizes the received data and updates the coefficient at the same time, and the sample rate is too high to use single-port memories with interleaved access. Hence, the work uses dual-port memories to implement the architecture. The pilot word recovery block is for SC mode to insert known Golay sequences behind sliced data to form one data subblock. The single-port memories in LMS adaptive algorithm are shared with BD block in the baseband receiver, and the gray parts are 69% shared by SC and HSI mode except FFT/IFFT. Fig. 4-3 shows the hardware reduction of the proposed LS-LMS FDE, and the hardware of SC and HSI mode are listed in Table 4-2.

100%

31%

9%

SC + HSI mode excluding FFT

Shared Parts Between SC and HSI

Shared Memories

with BD Block 22% 69%

Fig. 4-3 Hardware reduction of the proposed LS-LMS FDE Table 4-2 FDE Hardware comparison between SC and HSI mode

SC HSI

64x64 Single-Port Memory

12 0

64x64 Dual-Port Memory

8 8

FFT

2 0

PW Recovery

1 0

Clockwise pi/2 shifter

1 0

4.1.2 Proposed LOS Golay-MPIC TDE and Baseband Receiver

Fig. 4-4 Proposed block diagram of baseband receiver design

The three parts of the block diagram in Fig. 4-4 are different from the baseband receiver mentioned in Section 4.1.1:

 Frequency domain equalizer (FDE) is replaced by time domain equalizer (TDE).

 The system simulation considers about phase noise effect, so phase noise cancellation (PNC) is added to baseband receiver.

 The additional two FFTs are cancelled.

Fig. 4-4 shows the equalization moves to time domain, and it only needs one FFT for OFDM mode. Phase noise cancellation is added after TDE and FFT. The inputs of PNC are connected to TDE outputs in SC mode and to FFT outputs in HSI mode.

The proposed Golay sequences aided channel estimation combined MPIC TDE has been described in Section 3.2.2. The block diagram shown in Fig. 3-8 is redrawn in Fig. 4-5 due to the hardware design considerations. In the following sections, we will discuss our hardware design. The only FFT is not the design target in this thesis,

Optimised Golay Correlator(OGC)

Fig. 4-5 Block diagram of the proposed Golay-MPIC TDE

The proposed Golay-MPIC TDE can be used on SC and HSI mode in IEEE 802.15.3c specification. Fig. 4-5 shows SC and HSI mode system architecture. In Golay-MPIC TDE, the signal flows of SC and HSI mode are almost the same. There is one block which is “clock wise pi/2 shifter” used by SC mode only, since the modulation in SC mode is pi/2 M-PSK. In channel estimation, we use Optimised Golay Correlator (OGC) [30] to implement the architecture which is mentioned in Section 3.2.2 and we will discuss the OGC in Section 4.2.1. Because the length of Golay sequence we used is 256, it has 8 stages (number of stage=Log2256) to finish the computation. The proposed architecture has no feedback loop like LS-LMS FDE, so we don’t need too many memories to store the received data. The only memory which is 144 bits by 16 rows is for channel estimation to store a256. The size of second path delay register in MPIC block depends on the maximum of cyclic prefix

length. The gray parts are 99% shared by SC and HSI mode except pi/2 phase shifter circuit, and the single-port memories in OGC channel estimation are shared with BD and PNC blocks in the baseband receiver. Fig. 4-6 shows the hardware reduction of the proposed Golay-MPIC TDE, and the hardware of SC and HSI mode are listed in Table 4-3.

100%

6%

SC + HSI mode

Shared Parts Between SC and HSI excluding Memories

Shared Memories with BD

and PNC Block 5% 94%

1%

Fig. 4-6 Hardware reduction of the proposed Golay-MPIC TDE

Table 4-3 TDE Hardware comparison between SC and HSI mode

SC HSI

144x16 Single-Port Memory

2 2

Clockwise pi/2 shifter

1 0

4.2 Sub-block Architecture Design

4.2.1 Optimised Golay Correlator (OGC)

Section 3.2.2 mentioned about the channel estimation method by the correlation of Golay sequences, and the computation of correlation is very large. Thus, it is not practical if we design the operation directly. The proposed Golay-MPIC TDE architecture in Section 4.1.2 uses an efficient arithmetic to reduce the complexity. The Optimised Golay Correlator (OGC) [30][29] is an efﬁcient calculation to do the correlation of Golay sequences. The OGC is obtained by reordering some elements of the Efficient Golay Correlator (EGC) [25] which is shown in Fig. 4-7 and Fig. 4-8.

EGC is based on the way in which the sequences are generated.

- +

Fig. 4-7 Efficient Golay Correlator (EGC)

EGC

In Fig. 4-9, we can see the adders and subtracters are inter-changed with the delay and seed blocks. Secondly, the order of the EGC stages is reversed, placing the large delay stage at the input and the small delay stage at the output. Both changes allow the correlation of two inputs to be obtained simultaneously. The recursive algorithm of the OGC is:

[ ] [ ]

Fig. 4-9 Optimsed Golay Correlator (OGC)

To compare the efﬁciency of the correlations of a signal detection system based on Golay sequences, three different architectures are considered: the straightforward correlator, the EGC and the OGC. The straightforward correlator, just as the EGC, utilizes two architectures to simultaneously perform the correlations against

a k

[ ]

and

b k

[ ]

. A ﬁnal adder is considered to obtain the sum of correlations in there three cases. The results are listed in Table 4-4.

Table 4-4 Number of calculation for each corrlator

Straightforward EGC OGC

Multiplications

2L 2Log₂(L) Log₂(L)

Add/Sub.

2(L-1) 4Log2(L)-1 2Log2(L)-1

Delays

2(L-1) 2(L-1)

L-1

4.2.2 Divider Free LS Method [11]

In Section 3.1.1, Eqn. (3.10) indicates LS method needs a complex division. There are two ways to avoid the division. One is using the phase operation as shown in Eqn.

(4.6), and the other one is to multiply the conjugate of the divisor both on the denominator and the numerator as shown in Eqn. (4.7).

512, 512,

512, 512, transformation between the phasor and complex number requires trigonometric function, as shown in Eqn. (4.8). Although there are some realistic designs, the hardware cost is still too high.

512 512, 512, calculate the complex division. However, there is one scalar division, which is much more complex than a multiplier [31].

Since the division is an inversed multiplication, then multiplying an inverse of the scalar is a commonly used method. To find out the inverse, we can try to use a table with all possible inverse of the scalar, and we can easily implement it with a ROM as illustrated in Fig. 4-10. The bit width is determined by the accuracy of the inverse, and the word width is determined by the word length of the scalar. According to the simulation result of fixed-point C language, the bit width should be 13 bits and the

word width is 14 bits to maintain the performance. Therefore, the size of the ROM is 2¹⁴*13, which is 213k bits. The cost is reduced, but the ROM still takes large area.

To reduce the size of the ROM, we can try to reduce the bit and word width. Since the accuracy is already determined by bit width, we need to focus on the reduction of the word width. By observing the inverse, we can find out that the inverse is almost the same in nearby words. An example is shown in Eqn. (4.9), the difference between 1/128 and 1/129 is so small to be represented in 13 bits. Therefore, nearby scalars can all map to the same inverse stored in the table, as illustrated in Fig. 4-11.

1

Fig. 4-10 Table of inversed scalar

This effect is more obvious when the scalar is large, as shown in Eqn. (4.10), where

N is the reference scalar and Δn is the difference. Taking the property of the scalar into

consideration, we can see that the scalar is always positive since it’s the result of the square function. Hence, we can reconsider the scalar structure in Fig. 4-12. We can just look up the table according to the significant bits regardless of sign bits and Δn.

From the simulation results, the optimal length of the significant bits is 4. The reduced table is shown in Fig. 4-13 and the size is 2⁴*11*13, which is only 2288 bits.

10 2

Fig. 4-12 Structure of the scalar

Inverse of 1

Since Δn bits are ignored, they looks like zeros. We can represent them as another form as illustrated in Eqn. (4.11), where SB means the significant bits.

*2

ⁿ

scalar  SB

^ (4.11)

The meaning of these Δn bits are doing the left shift on SB, so the inverse of the scalar can be represented as Eqn. (4.12).

1 1 2 table has to store only 16 inversed scalars, which cost 208 bits storage area. Through this procedure, we substitute a division with one small ROM and one multiplier. The block diagram is shown in Fig. 4-14. Compared with a real divider in DesignWare, the

modified version has smaller area and can satisfy the requirement of high clock rate operation. Furthermore, the size of the ROM is 99.99% off by the method mentioned above.

SB

Δn

Table 

U

512,k

R

Right Shifter

LS result R

R

Fig. 4-14 Block diagram of modified divider

4.2.3 FFT/IFFT Design Specifications

The IEEE 802.15.3c standard focuses on the ultra-high data rate wireless communication. However, to realize the high throughput digital circuit is a challenge in hardware implementation, especially the high computational complexity components. Obviously, FFT/IFFT takes highest computational complexity and is most critical in our design.

In the recent years, there are many researches on the high throughput FFT. The pipeline-based structure and large radix butterfly are commonly used to achieve the requirement. The high throughput FFT must be realized with large radix butterfly and parallel input/output. This is more obvious in large point FFT. The FFT in [32] is 512-point with maximum throughput of 2592 MHz. It is designated for IEEE 802.15.3c HSI mode, which uses OFDM system. There are three modes in that FFT:

butterfly is radix-8 and the input/output is up to 16 times parallel. The throughput and challenge are indeed highly related to the large radix and parallel input/output design.

In our equalization design, the specifications of FFT/IFFT are listed in Table 4-5.

The point of FFT/IFFT is 512, which equals to the length of the sub-block. Since the sampling rate is 2640 MHz, the throughput is also 2640 MHz. In order to use the same clock rate with the proposed FDE, the clock rate of FFT/IFFT is set to 330 MHz and the input/output is 8 times parallel. From the fixed point simulation result of Matlab, the input/output word length of FFT and IFFT are 14/14 bits respectively.

Table 4-5 Specifications of FFT/IFFT in the proposed FDE

Parameter Value

Point 512 samples

Throughput 2640

Clock rate 330 MHz

Parallel input/output 8 times

FFT/IFFT word length (input/output) 14 / 14 bits

4.3 Synthesis and Simulation Results

The following architectures will use 65 CMOS process to synthesis. In the system simulation, the signals are interfered by channel model and AWGN and are assumed to be perfectly synchronized. To evaluate the performance, the channel model we use is based on the IEEE 802.15.3c standard group with Jakes’ model, mentioned in Section 2.2. The whole transmitted sequence is composed of preamble, PW, data, and PCES. The preamble is used for training and PW works as cyclic prefix. The testing environment is built by MATLAB simulation.

4.3.1 Proposed Adaptive LS-LMS FDE

The LS-LMS FDE is designed by using 65 nm CMOS low power process, and the maxmum operating rate can acheive 400 MHz(required clock rate is 330 MHz). Total area percentage of each part in FDE is shown in Fig. 4-15. The LS channel estimation

在文檔中十億級資料傳輸室內無線SC/OFDM接收機之等化器 (頁 46-0)