• 沒有找到結果。

Chapter 1 Introduction to IEEE 802.16e OFDMA Physical Layer

1.3 Overview of IEEE 802.16e Downlink

1.3.3 Sub-carrier Allocation

As mentioned in [4] and [5], the OFDMA PHY defines four scalable FFT sizes:

2048, 1024, 512, and 128. Here we only take the 2048-FFT OFDMA sub-carrier allocation for introduction. The sub-carriers are divided into three types: null (guard band and DC), pilot, and data. Subtracting the guard tones from scalable FFT size NFFT, one obtains the set of “used” sub-carriers Nused. These used sub-carriers are allocated to pilot sub-carriers and data sub-carriers. We give some introduction as follows:

z Preamble

The first symbol of the downlink transmission is the preamble. There are three types of preamble carrier-sets, those are defined by allocation of different sub-carriers for each one of them; those sub-carriers are modulated using a boosted BPSK modulation with a specific pseudo-noise (PN) code defined in Table 309 if [4]. The preamble carrier-sets are defined using

n 3

PreambleCarrierSet = + ⋅n k (1.1) where:

PreambleCarrierSetn specifies all sub-carriers allocated to the specific preamble, n is the number of the preamble carrier-set indexed 0...2, k is a running index 0...567.

Each segment uses a preamble composed of a carrier-set out of the three available carrier-sets in the following manner. Each segment eventually modulates each third sub-carrier. As an example, Fig. 1.7 depicts the preamble of segment 1 (in this figure sub-carrier 0 corresponds to the first sub-carrier used on the preamble symbol).

Because the DC carrier will not be modulated at all, it shall always be zeroed and the

appropriate PN will be discarded. For the preamble symbol there will be 172 guard band sub-carriers on the left side and the right side of the spectrum.

Fig. 1.7 Downlink basic structure [4].

z Symbol Structure for PUSC

The symbol structure is constructed using pilots, data, and zero sub-carriers.

Active (data and pilot) sub-carriers are grouped into subsets of sub-carriers called sub-channels. The minimum frequency-time resource unit of sub-channelization is one slot, which is equal to 48 data tones (sub-carriers).

With DL-PUSC, for each pair of OFDMA symbols, the available or usable sub-carriers are grouped into clusters containing 14 contiguous sub-carriers per symbol period, with pilot and data allocations in each cluster in the even and odd symbols as shown in Fig 1.8. A re-arranging scheme is used to form groups of clusters such that each group is made up of clusters that are distributed throughout the sub-carrier space.

A slot contains two clusters and is made up of 48 data sub-carriers and eight pilot sub-carriers. The data sub-carriers in each group are further permutated to generate sub-channels within the group. Therefore, only the pilot positions in the cluster are shown in Fig 1.8. The data sub-carriers in the cluster are distributed to multiple sub-channels. Table 1.1 from [5] summarizes the parameters of the symbol structure.

Fig. 1.8 Cluster structure [3].

Table 1.1 OFDMA DL-PUSC Sub-carrier Allocations

Parameter Value Comments

Number of DC Sub-carriers 1 Index 1024 (counting from 0) Number of Guard Sub-carriers, Left 184

Number of Guard Sub-carriers, Right 183

Number of Used Sub-carriers (Nused) 1681 Number of all sub-carriers used within a symbol, including all possible allocated pilots and the DC carrier.

Number of Sub-carriers per Cluster 14 Number of Clusters 120

Renumbering Sequence 1 Used to renumber clusters before allocation to sub-channels:

Number of Data Sub-carriers in each Symbol per Sub-channel 24 Number of Sub-channels 60

Basic Permutation Sequence 12 (for 12 Sub-channels)

6,9,4,8,10,11,5,2,7,3,1,0

Basic Permutation Sequence 8 (for 8 Sub-channels)

4 7,4,0,2,1,5,3,6

z Downlink Sub-channels Sub-carrier Allocation in PUSC

The carrier allocation to sub-channels is performed using the following procedure:

1) Dividing the sub-carriers into the number of clusters (Nclusters) physical clusters containing 14 adjacent sub-carriers each (starting from carrier 0).

The number of clusters, Nclusters, varies with FFT sizes.

2) Renumbering the physical clusters into logical clusters using the following formula:

LogicalCluster

RenumberingSequence(PhysicalCluster) First DL zone, or Use All SC indicator = 0

= in STC_DL_Zone_IE,

RenumberingSequence((PhysicalCluster)+13DL_PermBase)modNclusters otherwise.

(1.2) 3) Allocate logical clusters to groups. The allocation algorithm varies with FFT sizes. For FFT size = 2048, dividing the clusters into six major groups.

Group 0 includes clusters 0-23, group 1 includes clusters 24-39, group 2 includes clusters 40-63, group 3 includes clusters 64-79, group 4 includes clusters 80-103, and group 5 includes clusters 104-119. These groups may be allocated to segments, if a segment is being used, then at least one group shall be allocated to it. By default group 0 is allocated to sector 0, group 2 is allocated to sector 1, and group 4 to is allocated sector 2.

4) Allocating sub-carriers to sub-channel in each major group is performed separately for each OFDMA symbol by first allocating the pilot carriers within each cluster. After mapping all pilots, the remainders of the used sub-carriers are used to define the data sub-channels. To allocate the data sub-channels, the remaining sub-carriers are partitioned into groups of contiguous sub-carriers. Each sub-channel consists of one sub-carrier from each of these groups. The number of groups is therefore equal to the number of sub-carriers per sub-channel, and it is denotedNsubcarriers . The number of the sub-carriers in a group is equal to the number of sub-channels, and it is denoted Nsubchannels . The number of data sub-carriers is thus equal toNsubcarriersNsubchannels. The parameters vary with FFT sizes. For FFT size = 2048, use the parameters from Table 1.1, with basic permutation sequence 12 for even numbered major groups, and basic permutation sequence 8 for odd numbered major groups, to partition the sub-carriers into sub-channels containing 24 data sub-carriers in each symbol. The exact partitioning into sub-channels is according to the permutation formula (1.3).

( , ) { [ mod ]

_ }mod

subchannels k s k subchannels subchannels

sub-carrier(k,s) is the sub-carrier index of sub-carrier k in sub-channel s, s is the index number of a sub-channel, from the set

{0,...,Nsubchannels-1},

Nsubchannels is the number of sub-channels (for PUSC use number of sub-channels in the currently partitioned major group), ps[j] is the series obtained by rotating basic permutation

sequence cyclically to the left s times,

DL_PermBase is an integer ranging from 0 to 31, which is set to preamble IDCell in the first zone and determined by the DL-MAP for other zones.

On initialization, an SS must search for the downlink preamble. After finding the preamble, the user shall know the IDcell used for the data Sub-channels.

Chapter 2

The DSP Hardware and Associated Software Development Environment

DSP implementation is the final goal of our work. The MSC8126ADS board (see Fig. 2.1) is made by Freescale Semiconductor. In this chapter, we introduce the architectures of the DSP board.

This chapter is organized as follows. In section 2.1, we present the architecture of the MSC8126ADS board. In section 2.2, we introduce that how to develop optimized code for speed on the SC140 cores.

2.1 The MSC8126ADS board

2.1.1 MSC8126ADS Board Features

The MSC8126ADS board uses the Freescale MSC8126 processor [6], a highly integrated system-on-a-chip device containing four StarCore SC140 DSP cores along with an MSC8103 device as the host processor. The MSC8126ADS board serves as a platform for software and hardware development in the MSC8126 processor environment. Developers can use the on-board resources and the associated debugger

to perform a variety of tasks, such as downloading and running code, setting breakpoints, displaying memory and registers, and connecting proprietary hardware via the expansion connectors. This board works seamlessly with the CodeWarrior Development Studio for StarCore. According to [6], we described the MSC8122/26ADS board features in Table 2.1 as follows.

Fig. 2.1 MSC8122/8126ADS Top-side Part Location Diagram [6].

Table 2.1 MSC8126 ADS board features

Feature Description

MSC8126ADS board

• Host debug through a single JTAG connector supports both the MSC8103 and MSC8126 processors.

• MSC8103 is the MSC8126 host. The MSC8103 system bus connects to the MSC8126 DSI.

• Emulates MSC8126 DSP farm by connecting to three other ADS boards.

2.1.2 MSC8126 Features

The MSC8126 (see Fig. 2.2) is a highly integrated system-on-a-chip that combines four SC140 extended cores with a turbo coprocessor (TCOP), a viterbi coprocessor (VCOP), an RS-232 serial interface, four time-division multiplexed (TDM) serial interfaces, thirty-two general-purpose timers, a flexible system interface unit (SIU), an Ethernet interface, and a multi-channel DMA controller.

The SC140 extended core (see Fig. 2.3) is a flexible, programmable DSP core that handles compute-intensive communications applications, providing high performance, low power, and code density. It efficiently deploys a novel variable-length execution set (VLES), attaining maximum parallelism by allowing multiple address generation and data arithmetic logic units to execute multiple operations in a single clock cycle. A single SC140 core running at 500 MHz can perform 2000 MMACS. Having four such cores, the MSC8126 can perform up to 8000 MMACS per second.

Based on [7], we organized the features of MSC8126 and SC140 extended core and listed them in Table 2.2. The block diagram of the MSC8126 is shown in the Fig.

2.2 and SC140 extended core is shown in the Fig. 2.3.

Table 2.2 MSC8126 Features

Feature Description

MSC8126

• Four-core DSP with internal clock up to 500 MHz at 1.2 V.

• System bus frequency up to 166 MHz using 64 or 32 data lines, addressing up to 4 GB external memory

•DSI frequency up to 100 MHz as a 32-bit or 64-bit slave on the MSC8103 system bus

•Includes Viterbi coprocessor and Turbo coprocessor.

SC140 Core

Four SC140 cores:

•Up to 8000 MMACS using 16 ALUs running at up to 500 MHz.

•A total of 1436 KB of internal SRAM (224 KB per core + 16 KB ICache per core + the shared M2memory). Each SC140 core provides the following:

•Up to 2000 MMACS using an internal 500 MHz clock. A MAC operation includes a multiply-accumulate command with the associated data move and pointer update.

•4 ALUs per SC140 core.

•16 data registers, 40 bits each.

•27 address registers, 32 bits each.

•Hardware support for fractional and integer data types.

•Very rich 16-bit wide orthogonal instruction set.

•Up to six instructions executed in a single clock cycle.

•Variable-length execution set (VLES) that can be optimized for code density and performance.

•Enhanced on-device emulation (EOnCE) with real-time debugging capabilities.

Extended Core

Each SC140 core is embedded within an extended core that provides the following:

•224 KB M1 memory that is accessed by the SC140 core with zero wait states.

•Support for atomic accesses to the M1 memory.

•16 KB instruction cache, 16 ways.

•A four-entry write buffer that frees the SC140 core from waiting for a write access to finish.

Multi-Core Shared Memories

•M2 memory (shared memory):

—A 476 KB memory working at the core frequency.

—Accessible from the local bus.

—Accessible from all four SC140 cores using the MQBus.

•4 KB bootstrap ROM.

Fig. 2.2 MSC8126 Block Diagram [7].

Fig. 2.3 SC140 Extended Core Block Diagram [7].

2.2 Developing Optimized Code for Speed on the SC140 Cores

Speed optimization techniques on the SC140 core are generally classified as follows [8]:

• Loop unrolling

• Split computation

• Multisampling

2.2.1 Loop unrolling

The most popular speed optimization technique, loop unrolling explicitly repeats the body of a loop with corresponding indices. As a stand-alone technique, loop unrolling increases the Data ALU usage per loop step. If the iterations are independent, each one is performed on a single Data ALU. For example, the following code unrolls the loop three times to create four operations to be executed per one loop step:

Example 1. Loop Unrolling Word16 signal[SIG_LEN];

#pragma align signal 8

for ( i = 0; i < SIG_LEN; i+=4 ) {

signal[i+0] = L_shr(signal[i+0], 2);

signal[i+1] = L_shr(signal[i+1], 2);

signal[i+2] = L_shr(signal[i+2], 2);

signal[i+3] = L_shr(signal[i+3], 2);

}

In this document, the unroll-factor refers to the number of copies of the original loop that are in the unrolled loop. For example, in Example 1, the unroll-factor is 4.

2.2.2 Split Computation

A frequent operation in DSP computations is to reduce one dimension of a data massive (scalars are zero-dimensional, vectors are one-dimensional, and matrices are two-dimensional). The most frequently used reductions are: energy computation of a vector, mean square error, or maximum of a vector. If the reduction operator is associative and commutative, the reduction can be performed by splitting the original data massive into several data massives (usually four on the SC140 core). The reduction is applied to the smaller massives, and the results are combined to obtain the result as shown in Example 2.

Example 2. Split Computation

/* Energy computation for the signal[] vector of */

/* size SIG_LEN (multiple of 4). */

L_e0 = L_e1 = L_e2 = L_e3 = 0;

for ( i = 0; i < SIG_LEN; i+=4 ) {

L_e0 = L_mac(L_e0, signal[i+0], signal[i+0]);

L_e1 = L_mac(L_e1, signal[i+1], signal[i+1]);

L_e2 = L_mac(L_e2, signal[i+2], signal[i+2]);

L_e3 = L_mac(L_e3, signal[i+3], signal[i+3]);

}

L_e0 = L_add(L_e0, L_e1);

L_e2 = L_add(L_e2, L_e3);

L_e0 = L_add(L_e0, L_e2);

The same conditions must be met as for loop unrolling (for example, the vector alignment and the loop counter). In addition, split computations are used if the operator on the given data set is associative and commutative.

2.2.3 Multisampling

The multisampling technique is frequently used in nested loops and is a combination of primitive transformations. Given a nested loop formed out of OL (outer loop) and IL (inner loop containing one or two instructions), the multisampling transformation consists of the following:

•A loop unroll applied for OL to create a new OL with four IL inside (IL0, IL1, IL2, and IL3)

•A loop merge applied for IL0, IL1, IL2, and IL3 to create a new IL that makes more efficient use of the DALU units.

•A loop unroll applied to the newly-obtained IL so that the programmer can detail the reuse of already fetched values in the computations inside the new IL.

In Example 3, the nested loop computes the maximum absolute value of the correlations between X[] and h[]:

Example 3. Code Before Multisampling L_max = 0;

Example 4 shows the result of applying the multisampling technique. The speed and size estimations are not as obvious as they are for loop unrolling and split computation. We have the following before multisampling:

Example 4. Code After Multisampling

L_s0 = L_mac(L_s0, x_curr, h0);

L_s1 = L_mac(L_s1, x_curr, h1);

L_s2 = L_mac(L_s2, x_curr, h2);

L_s3 = L_mac(L_s3, x_curr, h3);

h3 = h[j+1-i]; x_curr = X[j+1];

L_s0 = L_mac(L_s0, x_curr, h3);

L_s1 = L_mac(L_s1, x_curr, h0);

L_s2 = L_mac(L_s2, x_curr, h1);

L_s3 = L_mac(L_s3, x_curr, h2);

h2 = h[j+2-i]; x_curr = X[j+2];

L_s0 = L_mac(L_s0, x_curr, h2);

L_s1 = L_mac(L_s1, x_curr, h3);

L_s2 = L_mac(L_s2, x_curr, h0);

L_s3 = L_mac(L_s3, x_curr, h1);

h1 = h[j+3-i]; x_curr = X[j+3];

L_s0 = L_mac(L_s0, x_curr, h1);

L_s1 = L_mac(L_s1, x_curr, h2);

L_s2 = L_mac(L_s2, x_curr, h3);

L_s3 = L_mac(L_s3, x_curr, h0);

h0 = h[j+4-i]; x_curr = X[j+4];

L_max0 = L_max(L_max0, L_s0);

L_max1 = L_max(L_max1, L_s1);

L_max2 = L_max(L_max2, L_s2);

L_max3 = L_max(L_max3, L_s3);

}

L_max0 = L_max(L_max0, L_max1);

L_max1 = L_max(L_max2, L_max3);

L_max0 = L_max(L_max0, L_max1);

The speed increases by sample-factor times, but the code size also increases significantly. Therefore, multisampling should be used only if the speed constraints are much more important than the size constraints.

Chapter 3 Channel Estimation Techniques for IEEE 802.16e

Downlink and DSP Implementation

In this chapter, we introduce three algorithms of channel estimation for IEEE 802.16e OFDMA transmission system and evaluate the performance of each channel estimation method mainly by the bit error rate (BER) and the mean square error (MSE).

This chapter is organized as follows. In section 3.1, we present the algorithms of channel estimation. In section 3.2, we introduce our simulation environment. In section 3.3, we show floating-point simulation figures of all algorithms. In section 3.4, we show performance tables and fixed-point simulation figures which implement in DSP.

Section 3.5 is the WiMAX system integration on the DSP platform.

3.1 Channel Estimation Techniques for 802.16e Downlink

In IEEE 802.16e OFDMA-PHY downlink PUSC, the sub-carriers are divided into many clusters containing 14 adjunct sub-carriers each. Fig 3.1 depicts this cluster structure and the position of pilot sub-carriers in each cluster for even or odd symbol.

According to the pilot arrangement, we adopt three different techniques to estimate

channels and discuss in following sections.

Fig. 3.1 Cluster structure [9].

3.1.1 Channel estimation with linear interpolation (LI)

The received signal yk (with cyclic prefix removed) can be expressed as

1

wk represents the additive white Gaussian noise and N is FFT size. Taking an FFT of y , we obtain the received signal in frequency domain: k

1

When mi, Hi m, represents the effect of X onm Yi. So we can see clearly here how ICI is introduced by the time-varying channel. In the following, we just use linear interpolation techniques to estimateH . i i,

Fig. 3.2 Pilot distribution in successive clusters [9].

We ignore Hi m, (mi) in (3.3), then estimate H . This can be done by i i,

Step 1) Estimating H at pilot positioni i,

i

n, which means to obtain all the frequency responses at the black sub-carriers in each cluster shown in Fig. 3.2, could be written as

ˆ

, n

where the superscript l represents the symbol index.

Step 2) Interpolating between symbols, we obtain all frequency responses at the brown sub-carriers in different clusters shown in Fig. 3.2.

The ˆ ,

n n

l

Hi i will be obtained by linear interpolation as follows:

(

1 1

)

Step 3) When completing Step 2, we can regard pilot arrangement as

equal-spaced distribution. In order to obtain all the frequency responses at the white sub-carriers in different clusters shown in Fig.

3.2, we do linear interpolation once again as follows

1 1 range values which are on the edge of clusters.

Step 4) This is the final step to estimate the transmitted frequency data as follows: For the above steps, we know how to use linear interpolation with pilots. If we use preamble to estimate channel frequency response, our equation is similar to (3.6) because of preamble structure (see Fig. 1.11). Simulation results are shown and discussed in the back section.

3.1.2 Channel Estimation with circular interpolation (CI)

Circular interpolation [14] is the ability to interpolate values around a circular trajectory. We treat all complex values as the form of

r × e

jθ, which r is the radius and θ is the phase in complex plane. In the section, we do linear interpolation in the radius and phase, but we get the complex values which aren’t interpolated linearly in real and imaginary part. Therefore, the algorithm which is discussed in section 3.1.1 can be employed here and repeat all steps in the phase and radius. Simulation results

are shown and discussed in the back section.

3.1.3 Least-Square (LS) Estimator with time-domain linear interpolation

The algorithm of least-squares channel estimation mainly estimates time-varying channels. In a time-varying environment, it amounts to estimating N channelsh : [k = hk,0, ,… hk L, 1] ,0T ≤ ≤ −k N 1; in other words, we need to estimate

N×L parameters. We assume that there is not significant variation between channels

h

0 and hN-1. In order to reduce complexity, we only estimate2 L× parameters which are h0 and hN-1 channels. Interpolating between channels

h

0 and hN-1, we obtain the remaining channels by linear interpolation. Based on [10], we use P pilot tones to estimate channels

h

0 and hN-1 by least-squares method, but P must be chosen such that P≥2L. Revisiting (3.3) for a pilot tone p, then the received tone

Y

p would be

, , ,

noise

p p p p p q q n p n

q pilot n not pilot

q p

0 1

The number am r, is the linear interpolation coefficients between channels

h

0

and hN-1. For the above description, the pilot-based channel estimation can be achieved by

Step 1) Revisiting (3.9), we can express the received tone

Y

p as

,

Step 2) Form the P × 2L system of linear equations

(1), (1)

Step 3) Obtain

h

as the least squares solution of the aforementioned system of linear equations (3.14).

For the above steps, we know how to estimate channels with least-squares method.

Simulation results are shown and discussed in the back section.

3.1.4 ICI Cancellation by Equalization of Time-Varying Channels

For mobile applications, channel variations within an OFDM block period destroy the orthogonality between sub-carriers; the effect, known as Inter-carrier Interference (ICI), will degrade the system performance. In [11], we use a block MMSE equalizer to cancel out the ICI. The received signal Y could be expressed by

Y= ΛX + noise (3.15) where Y is the N × 1 received vector, X is the N × 1 transmitted vector, and Λ is the N

× N frequency-domain channel matrix as below

0,0 1,0 1,0

All elements of Λ channel matrix could be estimated by LS channel estimation, which had introduced in section 3.1.2. Linear block MMSE equalization could be expressed by

(

1

)

1

ˆX

MMSE

= Λ ΛΛ +

H H

γ

I

N

Y

(3.17) where γ is the signal-to-noise ration (SNR), IN is the N-dimension identity matrix, and the superscript H represents the conjugate and transposed matrix. The transmitted signal X could be recovered by (3.17). In fact, we are unable to estimate time-varying channel matrix accurately so that the effect of ICI cancellation is not good. Simulation results are shown and discussed in the following sections.

3.1.5 Computational Complexity Analysis

Table 3.1 is the computational complexity analysis of all algorithms. Note that Nused is the number of all data sub-carriers and N is the FFT size and Np is the number of all pilots and L is the number of all channel taps. The computational complexity of phase and amplitude is calculated by CORDIC algorithms [13] and quantified by 14-bit. And the matrix inversion in (3.17) requiresO N

( )

3 flops.

Table 3.1 Computational Complexity

Table 3.1 Computational Complexity

相關文件