A Low Complexity SLM with conversion matrix[13]

Chapter 1 Introduction

1.2 A Low Complexity SLM with conversion matrix[13]

In the variety of PAPR reduction technique, selected mapping (SLM) approach has a high reduction performance, but it suffer from high computational complexity due to the use of the bank of inverse fast Fourier transforms (IFFT) . So, in this chapter, it focuses on reducing the complexity within SLM.

To make short of the matter, SLM is stated that it has M-1 statically independent sequence to multiply the original data sequence and pass into the IFFT, and then choose one lowest PAPR sequence to transmit. The block diagram of SLM is Fig 2 in chap 1.

At first, we define the data sequence vector .It is multiplied

by a phase rotation vector defined as ( i = 1~M-1) to generate

M-1 different output signal . We can

write as

is phase rotation matrix generate by putting phase rotation vector Bi on its diagonal line.

The conversion matrix method is stated that we do not need so many IFFT as to have high complexity if we can transform the action multiplying the phase rotate vector to passing into the conversion matrix.

Figure 06 Equivalent SLM with conversion matrix diagram

We can generate one output signal s_iby the original IFFT out signal x. Because signal x and si is

{ }

x=IFFT X =QX ( 4.1)

i { }i i

s =IFFT S =QS =QR X_i ( 4.2)

Where Q is the IFFT matrix. From (4.1), we have X =Q x⁻¹ , where Q⁻¹ is equivalent to FFT matrix. Then we can obtain

i i

s =QR Q x⁻ ( 4.3)

Therefore, the conversion matrix T is represent as T = QR_iQ^-1

In general, this conversion does not necessarily have low complexity comparing with the original SLM structure. But if we choose the phase rotate vector with care, we can make the conversion matrix have a simple form so as to only need addition

operation but not multiply operation. For example, at the 16-points FFT length case and phase rotate vector is of the form [ 1 , j , 1 , -j , 1 , j , 1 , -j , 1 , j , 1 , -j , … ], the conversion matrix T is like this

1 0 0 0 1 0 0 0 1 0 0 0 -1 0 0 0

Only need 3 * 16 = 48 complex addition operation but not multiplication operation.

Chapter 2 The Brief of EWC PHY Specification For 802.11n

This document [14] specifies those features of a device that are necessary to achieve interoperability. It specifies the signals that may be transmitted by the device and received by the device's receivers. This device is referred to as an HT (High Throughput) device. The HT device is assumed to be compliant with 802.11a/b/g/j standards. This document describes the extensions needed in the physical layer for high throughput transmission.

2.1 PLCP Packet Format

Two new formats are defined for the PLCP (PHY Layer Convergence Protocol):

Mixed mode and Green Field. These two formats are called HT formats. Figure

1 shows the legacy format and the HT formats. In addition to the HT formats, there is a legacy duplicate format (specified in section 4.8) that duplicates the 20MHz legacy packet in two 20MHz halves of a 40MHz channel.

Figure 07 PLCP packet format

The elements of the PCLP packet are:

L-STF: Legacy Short Training Field L-LTF: Legacy Long Training Field L-SIG: Legacy Signal Field

HT-SIG: High Throughput Signal Field

HT-STF: High Throughput Short Training Field HT-LTF1: First High Throughput Long Training Field

HT-LTF's: Additional High Throughput Long Training Fields

Data – The data field includes the PSDU (PHY sub-layer Service Data Unit)

The HT-SIG, HT-STF and HT-LTF's exist only in HT packets. In legacy and 12 legacy duplicate formats only the L-STF, L-LTF, L-SIG and Data fields exist.

2.2 Operating Mode

The PHY will operate in one of 3 modes –

z Legacy Mode – in this mode packets are transmitted in the legacy 802.11a/g format.

z Mixed Mode – in this mode packets are transmitted with a preamble compatible with the legacy 802.11a/g – the legacy Short Training Field (STF), the legacy Long Training Field (LTF) and the legacy signal field are transmitted so they can be decoded by legacy 802.11a/g devices. The rest of the packet has a new format.

In this mode the receiver shall be able to decode both the Mixed Mode packets and legacy packets.

z Green Field – in this mode high throughput packets are transmitted without a

legacy compatible part. This mode is optional. In this mode the receiver shall be able to decode both Green Field mode packets, Mixed Mode packets and legacy format packets.

The operation of PHY in the frequency domain is divided to the following modes:

z LM – Legacy Mode – equivalent to 802.11a/g

z HT-Mode – In HT mode the device operates in either 40MHz bandwidth or 20MHz bandwidth and with one to four spatial streams. This mode includes the HT-duplicate mode.

z Duplicate Legacy Mode – in this mode the device operates in a 40MHz channel composed of two adjacent 20MHz channel. The packets to be sent are in the legacy 11a format in each of the 20MHz channels. To reduce the PAPR the upper channel (higher frequency) is rotated by 90º relative to the lower channel.

z 40 MHz Upper Mode – used to transmit a legacy or HT packet in the upper 20MHz channel of a 40MHz channel.

z 40 MHz Lower Mode – used to transmit a legacy or HT packet in the lower 20MHz channel of a 40MHz channel

LM is mandatory and HT-Mode for 1 and 2 spatial streams are also mandatory.

2.3 Modulation and Coding Scheme(MCS)

The Modulation and Coding Scheme (MCS) is a value that determines the modulation, coding and number of spatial channels. It is a compact representation that is carried in the high throughput signal field.

Rate dependent parameters for the full set of modulation and coding schemes (MCS) are shown in Appendix A in Tables 2 to 5. These tables give rate dependent

parameters for MCSs with indices 0 through 31 for 20MHz.

Table 02 rate dependent parameters for mandatory 20 MHz, Nss =1 (NES = 1) modes MCA

index Modulation R Nbpsc Nsd Nsp Ncpbs Ndbps Data rate (Mbps)

Table 03 rate dependent parameters for mandatory 20 MHz, Nss =2 (NES = 1) modes MCA

index Modulation R Nbpsc Nsd Nsp Ncpbs Ndbps Data rate (Mbps)

Table 04 rate dependent parameters for mandatory 20 MHz, Nss=3 (NES =2) modes MCA

index Modulation R Nbpsc Nsd Nsp Ncpbs Ndbps Data rate (Mbps)

16 BPSK 1/2 1 52 4 156 78 19.5

17 QPSK 1/2 2 52 4 312 156 39.0

Table 05 rate dependent parameters for mandatory 20 MHz, Nss=4 (N^ES= 2) modes MCA

index Modulation R Nbpsc Nsd Nsp Ncpbs Ndbps Data rate (Mbps)

Nss: Number of Spatial Streams Nsd: Number of Data Subcarriers Nsp: Number of pilot subcarriers

Nbpsc: Number of coded bits per subcarrier per spatial stream

Ncbps: Number of Code Bits Per OFDM Symbol (total of all spatial streams) Ndbps: Number of data bits per MIMO-OFDM symbol

2.4 Transmitter Block Diagram

Figure 08 Transmitter diagram

The transmitter is composed of the following blocks:

z Scrambler – scrambles the data to prevent long sequences of zeros or ones – see section 4.2.

z Encoder Parser – de-multiplexes the scrambled bits among Nes FEC encoders, in a round robin manner.

z FEC encoders – encodes the data to enable error correction – an FEC encoder may include a binary convolutional encoder followed by a puncturing device, or an LDPC encoder.

z Stream Parser – divides the output of the encoders into blocks that will be sent to different interleaver and mapping devices. The sequences of the bits sent to the interleaver are called spatial streams.

z Interleaver – interleaves the bits of each spatial stream (changes order of bits) to prevent long sequences of noisy bits from entering the FEC decoder.

z QAM mapping – maps the sequence of bit in each spatial stream to

constellation points (complex numbers).

z Spatial Mapping – maps spatial streams to different transmit chains. This may include one of the following:

° Direct mapping – each sequence of constellation points is sent to a different transmit chain.

° Spatial expansion – each vector of constellation points from all the sequences is multiplied by a matrix to produce the input to the transmit chains.

° Space Time Block coding – constellation points from one spatial stream are spread into two spatial streams using a space time block code.

° Beam Forming - similar to spatial expansion: each vector of constellation points from all the sequences is multiplied by a matrix of steering vectors to produce the input to the transmit chains.

z Inverse Fast Fourier Transform – converts a block of constellation points to a time domain block.

z Cyclic shift insertion – inserts the cyclic shift into the time domain block. In the case that spatial expansion is applied that increases the number of

transmit chains, the cyclic shift may be applied in the frequency domain as part of spatial expansion.

z Guard interval insertion – inserts the guard interval.

z Optional windowing – smoothing the edges of each symbol to increase spectral decay

2.5 Timing Parameter

Parameter Value in legacy

data subcarriers 48 52 108

NSP: Number of

period 3.2µsec 3.2µsec 3.2µsec

TL-STF: Legacy

Chapter 3 The Brief of Innovative Quixote DSP Board

3.1 About Quixote

Quixote is Innovative Integration’s Velocia-family baseboard. Velocia is an advanced architecture DSP baseboard that integrates a high performance Texas Instruments TMS320 C64xx DSP and Xilinx high density programmable logic with high performance peripherals such as PMC modules, analog IO and interconnectivity interfaces. The powerful combination of the DSP and FPGA provide signal processing speed and flexibility for almost any DSP application. Each baseboard features a PCI backbone connecting the DSP, PMC, peripherals and StarFabric interfaces. The StarFabric interface (PICMG 2.17) provides unlimited and extremely flexible interconnection to other DSP cards, IO cards and host processor systems. Each Velocia card incorporates a high performance IO system with either on-board peripherals like A/D and D/As, or one or more PMC sites accommodating a wide range of I/O options.

Quixote DSP baseboard is for wireless, RADAR, ultrasound, high energy physics and other demanding applications requiring speed and processing power. Quixote features a powerful processing core built around Texas Instruments TMS320C6416 and Xilinx Virtex2 (2M or 8M gates) with 32MB of DSP RAM and 2MB of FPGA computation RAM (optional). Analog IO features include dual channels Quixote adds two (2) channels of 14-bit, 105 Mbps analog-to-digital (A/D) conversion and two (2)

channels of 14-bit, 105 Mbps digital-to-analog (DAC) conversion plus a 2 or 8 M user-programmable FPGA. The card serves a variety of applications including RF processing, radio-communications, servo applications, data acquisition and many others. System expansion is over StarFabric in a PICMG 2.17 compatible compact PCI chassis.

3.2 Support libraries

In order to support the baseboard as a part of a complete system, a complete set of powerful software libraries is provided to program the DSP on the baseboard and also to allow the card to interact with a host program resident on the PC. The Pismo Class Library provides support for developing applications which run on the target baseboard. The Armada Library provides the library support for host application development.

The Pismo Class Library

Pismo provides extensive C++ class support for：

1. Dynamic creation and runtime control of tasks

2. Simplified management of and access to all TI Chip Support Library (CSL) and DSP/BIOS API functions including: Semaphores, Mutexes, Mailboxes, Timers, Edma, Qdma, Atoms, McBsp, Timebases, Counters, etc.

3. Data exchange using RTDX Streaming I/O

4. Foundation (base) classes for DMA-driven device driver development 5. Templatized queues

6. Partial standard-template library functionality via STL Port

The Armada Class Library

Armada is the Innovative Integration-authored component suite, which combines with the Borland BCB or Microsoft MSVC Integrated Development Environments (IDEs) to support programming of the Matador baseboard. Armada supports both high-speed data streaming plus asynchronous mailbox communications between the DSP and the Host PC, plus a wealth of Host functions to visualize and post-process data received from or to be sent to the target DSP.

The Armada suite shields the user from the nitty-gritty details of responding to asynchronous notifications of stream data and message reception, stream data requirements and message acknowledgements. Instead, a set of special C++ software class objects, called components, have been created to model each portion of the system. By employing software objects which model the true physical layout of the system, we can make a full-featured system more understandable.

The Caliente

Caliente is the internal, high performance data streaming support software within Armada. It is packaged both as an internal .component. for Borland VCL users and as a DLL for MSVC users. It handles bi-directional streaming of data between the host memory and the target DSP. The two streams are independent of each other, and may even be running at different rates.

When input streaming, the target DSP application and Host PC baseboard component must use identical channel configurations, in order for the channelization features of Caliente to function properly. The mechanism to create these configurations is described in a later chapter. When streaming starts, the baseboard collects data and busmasters it to the host memory. Caliente then moves the data into a set of internal buffers (called Pool Buffers), where the data is examined and split into individual data channels for use by the application. Caliente assumes that the data format that will be produced by the baseboard matches the configuration required by the pump components used within the Host application and can, therefore, properly separate the channels into independent streams and pump the data into the application for real-time processing.

Output streaming works similarly, but in the opposite direction. When data needs to be sent to a baseboard, Caliente requests sample data for the appropriate device channel from data pump components contained within Armada-based application software. When the data is received, the data is collated with data received from all other active channels, converted into peripheral-specific format and copied into bus-master memory. When the baseboard needs more data, it will automatically busmaster this data to its own onboard data storage, from which it sends the data to the appropriate output hardware.

Chapter 4 Developing A DSP Program

4.1 A recommended flow of developing a DSP program

Traditional development flows in the DSP industry have involved validating a C model for correctness on a host PC or Unix workstation and then painstakingly porting that C code to hand coded DSP assembly language. This is both time consuming and error prone. This process tends to encounter difficulties that can arise from maintaining the code over several projects.

The recommended code development flow involves utilizing the C6000 code generation tools to aid in optimization rather than forcing the programmer to code by hand in assembly. These advantages allow the compiler to do all the laborious work of instruction selection, parallelizing, pipelining, and register allocation. This allows the programmer the ability to focus on getting the product to market quickly. These features simplify the maintenance of the code, as everything resides in a C framework that is simple to maintain, support, and upgrade.

It is recommended that we follow the code develop flow below when we are writing and debugging our code.

Figure 09 A flow of developing a DSP program

4.2 Analyzing the C code performance

One of the preliminary measures of code is the time it takes the code to run. In large applications, it makes sense to optimize the most important sections of code first.

Use the clock( ) and printf( ) functions in C/C++ to time and display the performance of specific code regions. The following example demonstrates how to include the clock() function in your C code.

Figure 10 A example to show the executing time

4.3 Refine the C/C++ code

4.3.1 Using the intrinsic to replace complicated C/C++ code

The C6000 compiler provides some intrinsic, special functions that map directly to inline C62x/C64x/C67x instructions to optimize your C/C++ code. All instructions that are not easily expressed in C/C++ code are supported as the intrinsic. The intrinsics are specified with a leading underscore ( _ ) and are accessed by calling them as you call a function. The following table shows some intrinsics.

Table 06 C compiler intrinsic

C Compiler Intrinsic Assembly Instruction Description

int _abs(int src2); ABS Return the saturated absolute value of src2

int _add4 (int src1, int src2); ADD4

Performs 2s-complement addition to pairs of packed 8-bit numbers.

unsigned _bitr (unsigned src); BITR Reverses the order of the bits

int _dotpn2 (int src1, int

src2); DOTPN2

The product of signed lower 16-bit values of src1 and src2 is subtracted from the product of signed upper 16-bit values of src1 and src2.

double _mpy2 (int src1, int

src2); MPY2

Returns the products of the lower and higher 16-bit values in src1 and src2.

4.3.2 Loop Unrolling

Another technique that improves performance is unrolling the loop; that is, expanding small loops so that each iteration of the loop appears in your code. This optimization increases the number of instructions available to execute in parallel.

Figure 11 A example to show the loop unrolling

4.3.2 Word access to the packed data

If we want to add 16-bit data vector, we can pack two 16-bit data into one 32-bit data. And then, do 32-bit addition with no carry at bit 16 which TMS320C64XX support. Like Using these word access to operate on 16-bit data stored in the high and low parts of a 32-bit register, we can save more time.

Figure 12 A example to show word access to packed data 4.3.2 Using compiler option

Compiler Options control the operation of the compiler. It can translate C code to assembly with attaching to debug capability, executing time or code size.

-pm

Combines source files to perform program-level optimization by allowing visibility to the entire application.

-o#

Optimizes register usage, locally or globally, file or program level.

-ms#

Optimizes primarily for code size, and secondly for performance. Code size on three level (-ms0, -ms1, -ms2)

4.3 Write linear assembly code

If some function’s performance still does not achieve the requirement by refining C code using above methods, we can write assembly code by ourselves.TMS320C6x provides linear assembly language to user. It is no need to assign which register to use in one instruction comparing to original assembly language. Because it isn’t assigning register, parallel executing the instruction is also can not done by user appoint but by linear assembly optimizer.

A linear assembly file has a extended filename *.sa and it will be noticed that several points like:

1. Program label will start at the first character in one line.

2. Instruction can not start at the first character, it must follow the space.

3. There are some mnemonic which are machine – instruction or optimizer directive can help your program.

Table 07 some common linear assembler’s Directive

Directive Description Restrictions

.call Calls a function Valid only within procedures .cproc Start a C/C++ callable

procedure Must use with .endproc .endproc End a C/C++ callable

procedure Must use with .cproc

.mptr Avoid memory bank conflicts

Valid only within procedures;

can use variables in the register parameter

.reg Declare variables Valid only within procedures .reserve Reserve register use Valid only within procedures

.return Return value to procedure Valid only within .cproc procedures

.trip Specify trip count value Valid only within procedures

Here is a linear assembly code function which can be called by C function

Figure 13 A example to show a C callable function writing by linear assembly

Chapter 5 DSP Board Implementation Result and Discussion

5.1 Choose M = 4 in SLM structure

802.11n is set up for wireless communication, it adopted MIMO OFDM structure, and choose FFT length 64. Because 64 is not too long so as we choose M = 4 in SLM structure is enough to deal with its PAPR problem. We can see fig 12 to know PAPR in the original OFDM structure will excess 9dB at the probability about 0.05, but applying to SLM with M=4, the probability is decrease to about 0.0001.

Figure 14 CCDF of PAPR in 64-pt FFT length SLM with different M

5.2 The use of the conversion matrix

Because choosing M = 4, we need to select additional three independent phase rotate vector. See chapter 1.2, using the conversion matrix can greatly lower the complexity of the original SLM structure. We select three phase rotate vector of the form [1, j , 1, j ] , [1, j , 1, -j ] and [1, j , -1, j ]. Figure 13 shows this conversion has no performance decade comparing to the IFFT banks. Table 8 shows the comparison of computation complexity between IFFT bank and conversion matrix.

5 6 7 8 9 10 11 12 13

10^-5 10^-4 10^-3 10^-2 10^-1 10⁰

probability of the PAPR exceeds the threshold for SLM with 64-pt FFT , L=4,

threshold

P(PAPR > threshold) dB

在文檔中多輸入輸出正交分頻多工系統中峰均功率比的減低 (頁 17-0)