Integer CFO Estimation and Preamble Index Identification

Chapter 2 Synchronization Techniques for IEEE 802.16e Downlink

2.3 Mobile Station Synchronization Techniques

2.3.3 Integer CFO Estimation and Preamble Index Identification

After time and frequency synchronization in time domain, we do the integer CFO estimation and the preamble index identification in frequency domain (see Fig. 2.1).

Since the preamble index must be estimated by using the preamble, we keep the received preamble signal in a buffer after the FFT function block. Then we estimate the integer CFO and identify the preamble index using the compensated preamble in the frequency domain.

Note (1.1) in 1.3.2 there are three types of preamble carrier-sets and each segment uses only one carrier-set. As the feature of the preamble, we cannot find the exact

integer CFO until we find the correct preamble index which contains the information about the used carrier-sets and vice versa. Directly we consider the brute force correlation method to find the preamble index and integer CFO jointly. Fortunately since the preamble data is BPSK (1-bit) in frequency domain, it allows easier implementation without any need for complex multipliers. Further reduction in complexity can be achieved with the knowledge that the preamble is transmitted on every 3rd sub-carrier.

Refer to [8] the joint integer CFO and preamble search can be summarized as

1 2 where m∈

{

0,1,...,113

}

represents set of preamble sequence,

{

^,...,

}

i∈ −Nfo +Nfo represents the range of integer CFO, R is the received preamble symbol in frequency domain (as estimated by time acquisition step), Pm is the m-th possible preamble sequence, Nfo is the maximum frequency offset normalized to sub-carrier spacing, and ( )v n _idenotes the shift of vector v(n) by i elements.

Further, in order to reduce the number of correlation operations, we present a method using the guard band power of preamble to do the coarse estimation of integer CFO first and than do the preamble index identification. We observe the sub-carrier structure of different segments near the guard band from Fig. 2.7. When the segments are 0, 1, 2, the corresponding sub-carrier permutations are shown in the top, middle, bottom of Fig. 2.7, respectively. All 114 preambles are classed as three segments (carrier-sets) described in (1.1). The method of coarse integer CFO estimation using guard band power calculation and preamble index identification is described as follows.

(1) Assume the preamble signal is segment 0.

(2) Set a window which length is the same as guard band range. Then calculate the signal power inside the window and shift the window to find the sub-carrier permutation fit in with the segment 0 (top of Fig. 2.7). The amount of the window shift fitting segment 0 is the coarse integer CFO.

(3) Compensate the coarse integer CFO.

(4) Calculate the correlation of frequency domain received preamble compensated by coarse integer CFO and all segment 0 preambles using (2.9).

Because we do the coarse integer CFO we can reduce the search range Nfo of i in (2.9). Set a threshold and find the maximum correlation value exceeding it. The preamble index and final correct integer CFO are our goal.

(5) If we can not find a maximum correlation value exceed the threshold, it means that there is no correct preamble index in all segment 0 preambles. We return to (1) and test segment 1 situation from (2) to (4). If it still can not find a correct result, return to (1) and test segment 2 from (2) to (4) until find the true integer CFO and preamble index.

Fig. 2.7 The sub-carrier permutation of different segments near the guard band.

Fig. 2.8 shows the simulation results of the method described above in different Doppler frequencies with FFT size 512 where “error” means incorrect identification of the integer CFO or the preamble index or both. We discuss the influence of the search range Nfo selection after the coarse integer CFO estimation. From Fig. 2.8, the results for Nfo = 4 and Nfo = 5 are almost the same and better than Nfo = 3. So setting the search range Nfo for 5 is appropriate to our simulation enough.

0 2 4 6 8 10 12 14 16 18 20

10^-3 10^-2 10^-1 10⁰

Error rate of preamble index/integer CFO detection after coarse integer CFO estimation

SNR(dB)

Fig. 2.8 Error probability of preamble index identification after coarse integer CFO estimation under SUI-3 channel with different Doppler frequencies and search range.

Now we compare the performance and computational complexity of the two methods described above. Assume that the symbol timing and fractional CFO offset are perfect estimated and compensated. Set the maximum search range of integer CFO to be ±10Δf , and let the preamble index be 31 in our simulation. Fig. 2.9 shows the

error probability of 10⁵ test samples under SUI-3 channel in various SNRs and Doppler frequencies with FFT size 512. The brute force correlation method has a little better performance than guard band power calculation method in low SNR. This is because the guard band power is led by the noise. There is only noise power in the guard band so that we may not find the sub-carrier pattern shown in Fig. 2.7 accurately and make a worse coarse integer CFO estimation. The true integer CFO is probably outside the search range and the preamble identification failed. In high SNR, the performance of the two methods is almost the same and does not improve more, i.e. it has an error floor. It is because the noise is independent and has no effect on the signal correlation.

Further, the threshold setting has some influence on performance.

0 2 4 6 8 10 12 14 16 18 20

10^-3 10^-2 10^-1

Error rate of preamble index/integer CFO detection

SNR(dB)

Error rate

Correlation method in fd=0Hz Correlation method in fd=150Hz Correlation method in fd=300Hz Guard band power method (Nfo=5) in fd=0Hz Guard band power method (Nfo=5) in fd=150Hz Guard band power method (Nfo=5) in fd=300Hz

Fig. 2.9 Error probability of either the estimated integer CFO or the identified preamble index under SUI-3 channel with different methods.

We analyze the computational complexity. The major load of complexity is the number of multiplications. There are ^{114 142 21}^×

(

^×

)

⁼³³⁹⁹⁴⁸ ^complex

multiplications for the brute force correlation method where 114 is number of all preambles, 142 is the number of BPSK PN symbols used in preamble, and 21 (Nfo=10) is the estimation range of integer CFO. For the guard band power calculation method,

there are 76 complex multiplications used where 76

(

(142 11) 118712

× × =

1 114 1 114 1 114

3 3 3 3 2 3 3

= × + × × + × ×3) is the expect number of the preambles used to calculate in the method and 11 (Nfo=5) is the search range of integer CFO. Note that the guard band power method has lower complexity. The complexity reduction is depended on the search range Nfo but make sacrifice for performance in low SNR.

Allow for the better performance, the correlation method is maybe more suitable to find the preamble index and integer CFO jointly in IEEE 802.16e.

Chapter 3 DSP Implementation of IEEE 802.16e Downlink System

DSP implementation is the final goal of our work. The MSC8126ADS board (see Fig. 3.1) is made by Freescale Semiconductor. In this chapter, we introduce the architectures of the DSP board.

This chapter is organized as follows. In section 3.1, we present the architecture of the MSC8126ADS board. In section 3.2, we introduce that how to develop optimized code for speed on the SC140 cores.

3.1 Introduction to the DSP Platform

3.1.1 MSC8126ADS Board Architecture

The MSC8126ADS board uses the Freescale MSC8126 processor [10], a highly integrated system-on-a-chip device containing four StarCore SC140 DSP cores along with an MSC8103 device as the host processor. The MSC8126ADS board serves as a platform for software and hardware development in the MSC8126 processor environment. Developers can use the on-board resources and the associated debugger

to perform a variety of tasks, such as downloading and running code, setting breakpoints, displaying memory and registers, and connecting proprietary hardware via the expansion connectors. This board works seamlessly with the CodeWarrior Development Studio for StarCore. According to [10], we described the MSC8122/26ADS board features in Table 3.1 as follows.

Fig. 3.1 MSC8122/8126ADS top-side part location diagram. (Source: [10])

Table 3.1 MSC8126ADS Board Features

Feature Description

MSC8126ADS board

• Host debug through a single JTAG connector supports both the MSC8103 and MSC8126 processors.

• MSC8103 is the MSC8126 host. The MSC8103 system bus connects to the MSC8126 DSI.

• Emulates MSC8126 DSP farm by connecting to three other ADS boards.

3.1.2 MSC8126 Features

The MSC8126 (see Fig. 3.2) is a highly integrated system-on-a-chip that combines four SC140 extended cores with a turbo coprocessor (TCOP), a Viterbi coprocessor (VCOP), an RS-232 serial interface, four time-division multiplexed (TDM) serial interfaces, thirty-two general-purpose timers, a flexible system interface unit (SIU), an Ethernet interface, and a multi-channel DMA controller.

The SC140 extended core (see Fig. 3.3) is a flexible, programmable DSP core that handles compute-intensive communications applications, providing high performance, low power, and code density. It efficiently deploys a novel variable-length execution set (VLES), attaining maximum parallelism by allowing multiple address generation and data arithmetic logic units to execute multiple operations in a single clock cycle. A single SC140 core running at 500 MHz can perform 2000 MMACS. Having four such cores, the MSC8126 can perform up to 8000 MMACS per second.

Based on [11], we organized the features of MSC8126 and SC140 extended core and listed them in Table 3.2. The block diagram of the MSC8126 is shown in the Fig.

3.2 and SC140 extended core is shown in the Fig. 3.3.

Fig. 3.2 MSC8126 block diagram. (Source: [11])

Fig. 3.3 SC140 extended core block diagram. (Source: [11])

Table 3.2 MSC8126 Features

Feature Description

MSC8126

• Four-core DSP with internal clock up to 500 MHz at 1.2 V.

System bus frequency up to 166 MHz using 64 or 32 data lines, addressing up to 4 GB external memory,

connected to:

— 16 MB of soldered, non-buffered on one 4-bank × 1 M × 32-bit device.

— 4 MB of buffered Flash memory organized as 4 M × 8-bit for configuration/boot/program storage.

•DSI frequency up to 100 MHz as a 32-bit or 64-bit slave on the MSC8103 system bus connects to:

— 2 MB of non-buffered SDRAM organized as 32-bit (default) or 64-bit.

— 16 MB of 100 MHz soldered, non-buffered SDRAM, organized on two 4-bank × 32-bit devices.

— 4 MB of 16-bit buffered Flash memory.

— Buffered board control and status register (BCSR) with eight byte-sized registers.

•SDRAM machine controls the SDRAM on the system bus.

•SMII support for MAC-to-PHY or MAC-to-MAC connections.

•RMII and MII support for MAC-to-PHY connections.

•Core power level adjustable via potentiometer.

•Includes Viterbi coprocessor and Turbo coprocessor.

3.1.3 Developing Optimized Code for Speed on SC140 Cores

Speed optimization techniques on the SC140 core reference to [12] are generally classified as follows.

z Loop unrolling

The most popular speed optimization technique, loop unrolling explicitly repeats

the body of a loop with corresponding indices. As a stand-alone technique, loop unrolling increases the Data ALU usage per loop step. If the iterations are independent, each one is performed on a single Data ALU.

z Split computation

A frequent operation in DSP computations is to reduce one dimension of a data massive (scalars are zero-dimensional, vectors are one-dimensional, and matrices are two-dimensional). The most frequently used reductions are: energy computation of a vector, mean square error, or maximum of a vector. If the reduction operator is associative and commutative, the reduction can be performed by splitting the original data massive into several data massives (usually four on the SC140 core).

The same conditions must be met as for loop unrolling (for example, the vector alignment and the loop counter). In addition, split computations are used if the operator on the given data set is associative and commutative.

z Multisampling

The multisampling technique is frequently used in nested loops and is a combination of primitive transformations. Given a nested loop formed out of OL (outer loop) and IL (inner loop containing one or two instructions), the multisampling transformation consists of the following.

(1) A loop unroll applied for OL to create a new OL with four IL inside (IL0, IL1, IL2, and IL3).

(2) A loop merge applied for IL0, IL1, IL2, and IL3 to create a new IL that makes more efficient use of the DALU units.

(3) A loop unroll applied to the newly-obtained IL so that the programmer can detail the reuse of already fetched values in the computations inside the new IL.

The speed increases by sample-factor times, but the code size also increases significantly. Therefore, multisampling should be used only if the speed constraints are

much more important than the size constraints.

3.2 Implementation of Transmitter

Our IEEE 802.16e OFDMA downlink PHY implementation system on the MSC8126 includes the user domain processing (UP), the frequency domain (FP) processing, and the time domain processing (TP). The following diagram, Fig. 3.4, gives a high level view on the main building blocks for WiMAX OFDMA PHY processing. The upper PHY part on the MSC8126 includes the two main subsystems UP and FP. The demo focuses on the data path implementation, assuming a fully synchronized system.

Fig. 3.4 WiMAX PHY interfaces.

The user domain processing covers the channel encoding and decoding steps.

Specifically these are:

z Randomization and derandomization z Convolutional encoder and decoder

z Interleaving and deinterleaving

z Constellation mapping and demapping

Fig. 3.5 shows an overview of the UP steps throughout the PHY chain on the MSC8126.

The frequency domain processing is mainly responsible for the OFDMA signal formatting. It is a subsystem that is not tied to any specific user functionality. The processing steps in the DL direction are:

z Preamble generation z Data modulation z Data symbol mapping

z Pilot generation and mapping z Carrier scrambling

From a processing point of view, the DL FP data flow is as follows and shown in Fig. 3.6.

(1) The first symbol in a frame is the preamble, which is a PN code pattern dependent on some control variable like ID cell. It is independent of user data. The mobile station knows this sequence and hence may use for initial

channel estimation.

(2) After the preamble user processing fills sub-channels with user and control data. FP performs as follows and is shown in Fig. 3.7 for detail.

z Mapping data to logical slots on logical sub-channels (function MapDl()).

z Generate and insert the pilot symbols into the tiles.

z Translate the logical carriers to physical carriers by the function CarrierScrambler(). This function needs a permutation table which is generated by GenerateDlTable(). This map is generated once per permutation zone.

z Modulate the resulting data vector on the physical carriers by a PN sequence and a static weight. This is achieved by the function DataModulation().

MapDl()

Map user bursts on frames in the symbol sub-channel space

(see [1..3]. sec 8.4.3.4)

CarrierScrambler() Scramble carrier by using look-up table

To IFFT From DL User processing

Carrier Lookup table:

ausiDlCarrierMap[]

GenerateDlTable(..)

Generate Carrier Lookup Table (see [1..3]. sec 8.4.6.1.2.2.2)

Map

Fig. 3.6 DL FP processing functions.

N-Used/

Fig. 3.7 DL carrier scrambling scheme.

The time domain processing includes IFFT/FFT and synchronization mechanism.

In our DSP implementation programs, it is limited to only PUSC and the configurations marked as “optional” in [4] and [5] are not considered. This functionality is confined to user independent sub-channel management. Hence, all specific control channels like ranging, FCH, Map-DL/UL bursts etc are not considered specifically because they are treated as normal bursts. Table 3.3 and Table 3.4 list the cycle count of UP and FP respectively for WiMAX OFDMA DL transmitter on single MSC8126 SC140 core running up to 500 MHz. Fig. 3.8 and Fig. 3.9 show the histograms of them. In DL UP, we see Fig. 3.8 and obtain that the interleaver and modulation spend clock cycles about 50 % of total respectively. To speed up implementation, the functions using shift registers or memory arrangement like randomizer, convolutional encoder, puncture, and interleaver are written by assembly language and the improvement of cycle count is shown in Table 3.3. In DL FP, we

transmit the preamble and a data symbol which needs to execute the initialization function. Then record clock cycles of every function using in the preamble and data symbols as Table 3.4. Fig. 3.9 shows the histogram of all FP functions. Clearly, we see that the preamble spends fewer clock cycles and the data initialization spends most clock cycles. Fortunately, the data initialization must be done only once in a data burst.

So it doesn’t spend much time to transmit data symbols and the real time speed is about 20842 symbols per second ( 500 ( / sec)

23990( / )

M cycles cycles symbol

= ).

Table 3.3 Cycle Count of DL UP

Function Cycle Count (cycles/bit) Randomizer (assembly) 0.15

Randomizer 0.23

Convolutional Encoder (assembly) 1.02 Convolutional Encoder 2.21 Puncture (Rate = 1) (assembly) 0.45 Puncture (Rate = 1) 0.71 Interleaver (assembly) 13.64

QPSK Modulation 15.95

Total Cycles (assembly) 18.59 Bit Rate (Mbits/sec) 26.39

Cycle count of UP functions

0 Puncture (Rate = 1) (assembly) Interleaver (assembly)

QPSK Modulation Total Cycles Total

Fig. 3.8 Histogram of UP function cycle count.

Table 3.4 Cycle Count of DL FP

Function Description Function Name Cycle Count Preamble

Symbol Preamble Generation PreambleGen() 4175 Carrier Permutation Table

Generation GenerateDlTable() 98089

Initial Data Position within

Sub-channel InitialDataPositionVectorDl() 1360 Data

Initialization

Subtotal Cycles 99508

Mapping Data onto Physical

Sub-carrier MapDl() 3147

Carrier Scramble CarrierScramble() 5661 Carrier Manipulation DataModulation() 10945 Data Symbol

Subtotal Cycles 23990

Total Cycles 135549

Data Symbol Rate (symbols/sec) 20842

Data Rate (Mbits/sec) 17.51

Cycle count of FP functions

Fig. 3.9 Histogram of FP function cycle count.

3.3 Performance Analysis of Synchronization Implementation

We implement initial synchronization techniques described in section 2.3 for WiMAX OFDMA downlink on MSC8126 DSP. All simulation parameters and environments are similar to those in chapter 2, but we translate floating data type to short integer data type. For comparison, floating point simulation results are also presented together with the fixed point results.

3.3.1 Symbol Timing Estimation

As the description in section 2.3.1, we use the preamble to estimate the symbol timing offset. Fig. 3.10 shows the RMSE of symbol timing offset estimation in SUI-3 channel for different Doppler frequencies. We can see the curves of fixed point simulation are very close to those of floating point simulation.

0 2 4 6 8 10 12 14 16 18 20

RMSE of frame offset in fd=0Hz

SNR(dB)

RMSE of frame offset in fd=150Hz

SNR(dB)

RMSE

floating point fixed point

(b)

0 2 4 6 8 10 12 14 16 18 20 1

1.5 2 2.5 3 3.5

RMSE of frame offset in fd=300Hz

SNR(dB)

RMSE

floating point fixed point

(c)

Fig. 3.10 RMSE of symbol timing offset synchronization under SUI-3 channel.

3.3.2 Fractional CFO Estimation

When implementing the fractional CFO estimation on MSC8126 DSP platform we have a difficulty to obtain the phase of CP correlation. Taking hardware signal processing complexity into account, we adopt a phase estimation algorithm which is called CORDIC, an acronym for COordinate Rotation DIgital Computer. This algorithm described in [13] for detail provides an iterative method of performing vector rotations by arbitrary angles using only shifts, adds and a small lookup table. So it is a better choice to use in the fixed point environment on DSP platform. For our implementation on MSC8126, the phase of the CP correlation value which represents

the frequency offset is normalized byπand it is between [-1,…,+1) with Q15 format (16 bits).

Fig. 3.11 shows the RMSE of fractional CFO estimation in SUI-3 channel for different frequencies. We can learn how the SNR affects the carrier frequency synchronization and see that the corrected frequency offset is under 2% of the sub-carrier spacing, as required by IEEE 802.16e. From the figure on fractional CFO estimation results, we can also see the performance curves for fixed point and floating point implementations are almost the same.

0 2 4 6 8 10 12 14 16 18 20

Fig. 3.11 RMSE of fractional CFO synchronization under SUI-3 channel with 0 Hz, 150 Hz, and 300 Hz Doppler frequency comparing with fixed point and floating point.

3.3.3 Summary of Implementation Analysis for Synchronization

From section 3.3.1 and section 3.3.2 discussions, we implement the initial synchronization including symbol timing estimation, fractional CFO estimation, and correction (compensation) on single MSC8126 SC140 core running up to 500 MHz before FFT. The initial synchronization relies on preamble and CP correlation described in section 2.3.1 and section 2.3.2. Table 3.5 and Fig. 3.12 show the implementation results of initial synchronization which use the preamble to do symbol timing estimation. The correlation values are calculated and stored in function SyncT().

It spends the most clock cycles about 60 % of total process. The loading of CFO estimation is very low because it only estimates the phase of CP correlation using CORDIC algorithm described above. After finding symbol timing and frequency offset,

在文檔中 IEEE 802.16e OFDMA 下行同步技術之探討與數位訊號處理器實現 (頁 46-0)