DSP Implementation - 基於正交分頻多重進接之無線多媒體傳收機研究及設計---子計畫四：無線正交分頻多重進接頻道使用技術研究及全系統整合(II)

We employ simple least-square (LS) channel estimation at the pilot subcarriers [15], [16], which merely divides the received signal at each subcarrier by the known pilot value at that subcarrier to obtain the corresponding channel estimate. In our case, the division can even be avoided, because the pilot subcarriers are BPSK-modulated. After the LS channel estimation at the pilot subcarrier locations, we perform frequency-domain interpolation and time-domain filtering to obtain the response at non-pilot subcarrier locations as well as to reduce the noise effect in LS channel estimation. Several methods of interpolation and filtering are considered. They are, for frequency-domain interpolation, linear interpolation and second-order interpolation, and for time-domain filtering, two-dimensional (2D) interpolation and LMS adaptation. The techniques themselves and their performance as obtained from computer simulation are discussed in considerable detail in [2] and [3]. The simulation results show that linear interpolation in the frequency domain is about as good as second-order interpolation, and that 2D interpolation in the time domain is better than LMS adaptation. Hence we employ them in the DSP implementation. But in the following discussion we will present more in-depth information regarding linear interpolation.

B. DSP Implementation

We consider fixed-point DSP software implementation, where we employ Texas Instruments (TI)'s TMS320C6416 DSP. Its CPU contains eight parallel 32-bit function units,

2 This chapter is mainly excerpted from R.-C. Chen, D. W. Lin, and C.-J. Wu, “Pilot-aided channel estimation for IEEE 802.16 OFDMA TDD downlink transmission and its DSP software implementation,” to appear in Proc. Workshop Consumer Electronics Signal Processing, Yunlin, Taiwan, ROC, Nov. 2005.

two of which are multipliers and the remaining six can do a number of arithmetic, logic, and memory access operations. There is also flexibility in arranging the data so that each function unit can do double 16-bit or quadruple 8-bit operations. Running at 600 MHz, the peak performance is 4800 MIPS.

TI supports a useful software development tool set with convenient graphical user interface (GUI), called the Code Composer Studio. It includes, among other things, a compiler, a debugger, and a profiler that can help the programmer analyze the efficiency of his/her code. The compiler supports several options to optimize the code either in size or in execution speed. In our case, code size is not a concern, but speed is. Hence we use -o3, the highest level (program level) of optimization. TI's library function also includes a set of

“intrinsics,” which are C-callable functions mapped directly to assembly instructions that are not easily expressable in C. Examples of such intrinsics are functions for parallel loading of multiple data and parallel multiplications of multiple 16-bit data. We make use of some such intrinsics in our implementation.

Now we turn to the fixed-point DSP implementation. This entails careful conversion of the original program based on floating-point computation, used in simulation, to fixed-point.

Fig. 3-1. Structure of the implemented channel estimation system.

Figure 3-1 shows the structure of the implemented system. As far as channel estimation is concerned, the key function is Linear_Interp which does linear interpolation; other components only play a supporting role which are not the focus of this work (but are considered in greater depth in related studies such as [5]). The block Modulation (QPSK,

16-QAM, 64-QAM) maps binary data to the constellation points, Complex_Mul does

complex multiplications to simulate the channel filtering effect, Pilot Location is the LS estimator which divides the received signal Y(f) at pilot locations by p = 4/3 or -4/3,

Complex_Div is an equalizer which divides the received data signal by the estimated

channel response, and De_Modulation maps the equalized signal into the constellation points as well as back into binary data. In a practical implementation, the functions Pilot

Location, Complex_Div, and De_Modulation should be re-designed for efficiency, e.g., by

avoiding use of divisions and integrating with the subsequent error-control decoder. But in the present work they are left as is.

Table 3-1 lists the code sizes and the execution speed of Linear_Interp and some other function blocks in our final fixed-point implementation employing 16-bit computation, where “load” refers to the number of DSPs needed for real-time execution of the given function. Simulation shows that the 16-bit implementation performs similarly to the original floating-point program in noise performance. Enhancement of the implementation is being worked on. Reference [5] contains discussion on optimization of the modulation function.

It is of interest to see how efficient the software is relative to the DSP’s computing power and how the efficiency improves from using floating-point computation to using fixed-point computation. For fixed-point computation, the DSP can perform 6 32-bit additions and 2 32-bit multiplications per cycle, and 12 16-bit additions and 4 16-bit multiplications per cycle. For division, from measurement we see that it takes 22 and 21 cycles, respectively, in 32-bit and 16-bit fixed-point arithmetic. Now for Complex_Mul, each sample costs 4 real multiplications and 2 real additions, and for Complex_Div, each sample costs 6 real multiplications, 3 additions, and 2 divisions. Since a downlink OFDMA symbol contains 1702 used subcarriers (including pilots and data), the minimum cycle counts needed per symbol for Complex_Mul are roughly max{2/6,4/2}×1702 = 3404 and max{2/12,4/4}×1702 = 1702, respectively, with 32-bit and 16-bit fixed-point computation.

That needed for Complex_Div are roughly (max{3/6,6/2}+2×22/2)×1702 = 42550 and (max{3/12,6/4}+2×21/2) 1702 = 38295, respectively, with 32-bit and 16-bit computation. ×

Table 3-1. Profile of Implemented Function Blocks Using 16-Bit Fixed-Point Computation Function Code Size (Bytes) Load (# DSPs)

Complex_Mul 272 0.02

Linear_Interp 332 0.55

Complex_Div 428 1.15

De_Modulation 1068 1.05

Table 3-2. Comparison of Minimum Cycles and Actual Cycles Consumed per OFDMA Symbol Under Different Data Types for Various Functions

Function Type Actual Minimum Efficiency float 899,231 3,404 0.38%

32-bit 15,338 3,404 22.19%

16-bit 3,421 1,702 49.75%

float 1,900,051 42,550 2.24%

32-bit 688,850 42,550 6.18%

16-bit 162,960 38,295 23.50%

float 467,233 12,852 2.75%

32-bit 441,423 26,082 5.91%

16-bit 67,705 24,381 36.01%

A: Complex_Mul, B: Complex_Div, C: Linear_Interp

Table 3-2 lists the efficiency figures, where for the floating-point implementation we have used the same minimum cycle counts as 32-bit fixed-point computation to gauge the efficiency, and the efficiency is defined as the ratio of minimum cycles needed to actual cycles consumed. We see that the efficiency is improved significantly from using floating-point computation to 32-bit and to 16-bit fixed-point computation.

Now consider Linear_Interp, wherein we use 567×8 additions, 567×4 multiplications, and 567×4 divisions per OFDMA symbol. Therefore, the minimum cycles are 26082 with 32-bit fixed-point computation and 24381 for 16-bit fixed-point computation. The efficiency is also listed in Table 3-2. The minimum cycles for floating-point computation are calculated on a different base, whose details are omitted here.

The implemented channel estimator can achieve real-time execution speed for the considered transmission bandwidth of 10 MHz. Improvement of the execution speed is possible, for example, by replacing the divisions with equivalent operations, and such work is part of the studies which are currently in progress. Indeed, part of the inefficiency in the final code is due to checks to prevent division by zero. This kind of conditional statements hamper the compiler’s ability in software pipelining. An enhanced version of the overall transceiver is currently being worked on.

四、TDD OFDMA下行傳收系統之DSP實現與整合³

A. Introduction

Figures 4-1 depicts the downlink (DL) transmitter and receiver structures. Not all blocks are treated to equal depth in this study. Some system parameters used in our study are listed in Table 4-1. We refer to the IEEE 802.16a and 802.16-2004 standards for detailed explanation of the parameters. Suffice it to say that the center frequency and the signal bandwidth are chosen arbitrarily but typical of some foreseeable applications.

modulation

scrambler FEC modulationdata 1702

S/P add virtual carriers not addressed in the

present study

(a)

(b)

Fig. 4-1. (a) DL transmitter structure. (b) DL receiver structure. (From [17].)

Table 4-1. System Parameters Used in This Study Number of carriers (N) 2048

Center frequency 6 GHz Signal bandwidth (BW) 10 MHz

Carrier spacing (Δf) 5.58 kHz Sampling frequency (fs) 11.43 MHz OFDM symbol time (Ts) 201.6 μs (2304 samples)

Cyclic prefix time (Tg) 22.4 μs (256 samples)

3 This chapter is mainly excerpted from Y.-S. Chen, D. W. Lin, and C.-J. Wu, “DSP Software Implementation and Integration of IEEE 802.16 TDD-OFDMA-mode downlink transceiver functions,” to appear in Proc. Int.

We have employed four-times oversampled square-root raised cosine (SRRC) transmitter and receiver filters, where the four-times oversampling is for convenience in simulating non-integer spaced multipath propagation. Both filters have the same length of 57 taps and the rolloff factor is set to 0.155 to satisfy the power mask specification [17].

In what follows, since synchronization is the most complicated function in this work, Section B introduces the downlink synchronization method. Section B discusses the DSP implementation. And Section D contains the conclusion.

在文檔中基於正交分頻多重進接之無線多媒體傳收機研究及設計---子計畫四：無線正交分頻多重進接頻道使用技術研究及全系統整合(II) (頁 11-16)