Support libraries - The Brief of Innovative Quixote DSP Board

Chapter 3 The Brief of Innovative Quixote DSP Board

3.2 Support libraries

In order to support the baseboard as a part of a complete system, a complete set of powerful software libraries is provided to program the DSP on the baseboard and also to allow the card to interact with a host program resident on the PC. The Pismo Class Library provides support for developing applications which run on the target baseboard. The Armada Library provides the library support for host application development.

The Pismo Class Library

Pismo provides extensive C++ class support for：

1. Dynamic creation and runtime control of tasks

2. Simplified management of and access to all TI Chip Support Library (CSL) and DSP/BIOS API functions including: Semaphores, Mutexes, Mailboxes, Timers, Edma, Qdma, Atoms, McBsp, Timebases, Counters, etc.

3. Data exchange using RTDX Streaming I/O

4. Foundation (base) classes for DMA-driven device driver development 5. Templatized queues

6. Partial standard-template library functionality via STL Port

The Armada Class Library

Armada is the Innovative Integration-authored component suite, which combines with the Borland BCB or Microsoft MSVC Integrated Development Environments (IDEs) to support programming of the Matador baseboard. Armada supports both high-speed data streaming plus asynchronous mailbox communications between the DSP and the Host PC, plus a wealth of Host functions to visualize and post-process data received from or to be sent to the target DSP.

The Armada suite shields the user from the nitty-gritty details of responding to asynchronous notifications of stream data and message reception, stream data requirements and message acknowledgements. Instead, a set of special C++ software class objects, called components, have been created to model each portion of the system. By employing software objects which model the true physical layout of the system, we can make a full-featured system more understandable.

The Caliente

Caliente is the internal, high performance data streaming support software within Armada. It is packaged both as an internal .component. for Borland VCL users and as a DLL for MSVC users. It handles bi-directional streaming of data between the host memory and the target DSP. The two streams are independent of each other, and may even be running at different rates.

When input streaming, the target DSP application and Host PC baseboard component must use identical channel configurations, in order for the channelization features of Caliente to function properly. The mechanism to create these configurations is described in a later chapter. When streaming starts, the baseboard collects data and busmasters it to the host memory. Caliente then moves the data into a set of internal buffers (called Pool Buffers), where the data is examined and split into individual data channels for use by the application. Caliente assumes that the data format that will be produced by the baseboard matches the configuration required by the pump components used within the Host application and can, therefore, properly separate the channels into independent streams and pump the data into the application for real-time processing.

Output streaming works similarly, but in the opposite direction. When data needs to be sent to a baseboard, Caliente requests sample data for the appropriate device channel from data pump components contained within Armada-based application software. When the data is received, the data is collated with data received from all other active channels, converted into peripheral-specific format and copied into bus-master memory. When the baseboard needs more data, it will automatically busmaster this data to its own onboard data storage, from which it sends the data to the appropriate output hardware.

Chapter 4 Developing A DSP Program

4.1 A recommended flow of developing a DSP program

Traditional development flows in the DSP industry have involved validating a C model for correctness on a host PC or Unix workstation and then painstakingly porting that C code to hand coded DSP assembly language. This is both time consuming and error prone. This process tends to encounter difficulties that can arise from maintaining the code over several projects.

The recommended code development flow involves utilizing the C6000 code generation tools to aid in optimization rather than forcing the programmer to code by hand in assembly. These advantages allow the compiler to do all the laborious work of instruction selection, parallelizing, pipelining, and register allocation. This allows the programmer the ability to focus on getting the product to market quickly. These features simplify the maintenance of the code, as everything resides in a C framework that is simple to maintain, support, and upgrade.

It is recommended that we follow the code develop flow below when we are writing and debugging our code.

Figure 09 A flow of developing a DSP program

4.2 Analyzing the C code performance

One of the preliminary measures of code is the time it takes the code to run. In large applications, it makes sense to optimize the most important sections of code first.

Use the clock( ) and printf( ) functions in C/C++ to time and display the performance of specific code regions. The following example demonstrates how to include the clock() function in your C code.

Figure 10 A example to show the executing time

4.3 Refine the C/C++ code

4.3.1 Using the intrinsic to replace complicated C/C++ code

The C6000 compiler provides some intrinsic, special functions that map directly to inline C62x/C64x/C67x instructions to optimize your C/C++ code. All instructions that are not easily expressed in C/C++ code are supported as the intrinsic. The intrinsics are specified with a leading underscore ( _ ) and are accessed by calling them as you call a function. The following table shows some intrinsics.

Table 06 C compiler intrinsic

C Compiler Intrinsic Assembly Instruction Description

int _abs(int src2); ABS Return the saturated absolute value of src2

int _add4 (int src1, int src2); ADD4

Performs 2s-complement addition to pairs of packed 8-bit numbers.

unsigned _bitr (unsigned src); BITR Reverses the order of the bits

int _dotpn2 (int src1, int

src2); DOTPN2

The product of signed lower 16-bit values of src1 and src2 is subtracted from the product of signed upper 16-bit values of src1 and src2.

double _mpy2 (int src1, int

src2); MPY2

Returns the products of the lower and higher 16-bit values in src1 and src2.

4.3.2 Loop Unrolling

Another technique that improves performance is unrolling the loop; that is, expanding small loops so that each iteration of the loop appears in your code. This optimization increases the number of instructions available to execute in parallel.

Figure 11 A example to show the loop unrolling

4.3.2 Word access to the packed data

If we want to add 16-bit data vector, we can pack two 16-bit data into one 32-bit data. And then, do 32-bit addition with no carry at bit 16 which TMS320C64XX support. Like Using these word access to operate on 16-bit data stored in the high and low parts of a 32-bit register, we can save more time.

Figure 12 A example to show word access to packed data 4.3.2 Using compiler option

Compiler Options control the operation of the compiler. It can translate C code to assembly with attaching to debug capability, executing time or code size.

-pm

Combines source files to perform program-level optimization by allowing visibility to the entire application.

-o#

Optimizes register usage, locally or globally, file or program level.

-ms#

Optimizes primarily for code size, and secondly for performance. Code size on three level (-ms0, -ms1, -ms2)

4.3 Write linear assembly code

If some function’s performance still does not achieve the requirement by refining C code using above methods, we can write assembly code by ourselves.TMS320C6x provides linear assembly language to user. It is no need to assign which register to use in one instruction comparing to original assembly language. Because it isn’t assigning register, parallel executing the instruction is also can not done by user appoint but by linear assembly optimizer.

A linear assembly file has a extended filename *.sa and it will be noticed that several points like:

1. Program label will start at the first character in one line.

2. Instruction can not start at the first character, it must follow the space.

3. There are some mnemonic which are machine – instruction or optimizer directive can help your program.

Table 07 some common linear assembler’s Directive

Directive Description Restrictions

.call Calls a function Valid only within procedures .cproc Start a C/C++ callable

procedure Must use with .endproc .endproc End a C/C++ callable

procedure Must use with .cproc

.mptr Avoid memory bank conflicts

Valid only within procedures;

can use variables in the register parameter

.reg Declare variables Valid only within procedures .reserve Reserve register use Valid only within procedures

.return Return value to procedure Valid only within .cproc procedures

.trip Specify trip count value Valid only within procedures

Here is a linear assembly code function which can be called by C function

Figure 13 A example to show a C callable function writing by linear assembly

Chapter 5 DSP Board Implementation Result and Discussion

5.1 Choose M = 4 in SLM structure

802.11n is set up for wireless communication, it adopted MIMO OFDM structure, and choose FFT length 64. Because 64 is not too long so as we choose M = 4 in SLM structure is enough to deal with its PAPR problem. We can see fig 12 to know PAPR in the original OFDM structure will excess 9dB at the probability about 0.05, but applying to SLM with M=4, the probability is decrease to about 0.0001.

Figure 14 CCDF of PAPR in 64-pt FFT length SLM with different M

5.2 The use of the conversion matrix

Because choosing M = 4, we need to select additional three independent phase rotate vector. See chapter 1.2, using the conversion matrix can greatly lower the complexity of the original SLM structure. We select three phase rotate vector of the form [1, j , 1, j ] , [1, j , 1, -j ] and [1, j , -1, j ]. Figure 13 shows this conversion has no performance decade comparing to the IFFT banks. Table 8 shows the comparison of computation complexity between IFFT bank and conversion matrix.

5 6 7 8 9 10 11 12 13

10^-5 10^-4 10^-3 10^-2 10^-1 10⁰

probability of the PAPR exceeds the threshold for SLM with 64-pt FFT , L=4,

threshold

P(PAPR > threshold) dB

original

SLM with M = 4(using IFFT bank) SLM with M = 4(using conversion matrix)

Figure 15 The CCDF of PAPR comparison between IFFT bank and conversion matrix

Table 08 computation complexity between IFFT band and conversion matrix

5.3 About insert the side information

We still don’t consider side information up to now, and side information insertion usually means that bit rate will decrease. It is the expense which we choose SLM to lower the PAPR.

We should reserve several tones to transmit side information to the receiver side.

It will need 2 (log₂M) bit to indicate which phase rotate vector is been used when M=4.

In order to prevent the side information being corrupt by noise easily so as to make whole symbol is wrong, we need to apply channel coding to the side information bit and use low order modulation on the side information tones.

Fig14 simulate the PAPR of the 64-pt FFT length OFDM structure, SLM structure that information tones are reserved but not transmitting side information, and SLM structure with transmitting side information at the different power level on the information tones. The modulation type is BPSK to resist the noise. The side information is pass channel coding with (5, 2) shorten hamming code which can detect

two errors and correct one error.

Figure 16 SLM structure with transmitting the side information

5.3 Fixed point format on the DSP board

There are usually several restrict on the modulation accuracy in most spec, we should choose the fixed point format to fit this requirement. Due to not find the modulation accuracy requirement in 802.11n spec, we reference to 802.16 spec’s error vector magnitude (EVM.) It is obvious that 8-bit is not achieving requirement, so we choose 16-bit fixed point format to represent fraction number.

Table 09 simulation EVM result At the output of QAM

mapping

At the input of IFFT At the output of IFFT EVM(﹪) when using

8 bit to quantize 6.6 6.6 6.98

EVM(﹪) when using

16 bit to quantize 1.462* 10^-3 1.462* 10^-3 5.854 * 10^-2

5.3 Code performance on the DSP board

Table8 shows the comparison of PC and DSP bit rate and execute time. Fig14 shows the percentage of the block execute time in the simulation.

Table 10 execute time and bit rate

Bit rate(bit/s) 89435.43 105346.27 310077.59

execution time

Figure 17 the percentage of the all block execution time

5.4 About using the digital IO

Quixote provides 40 bits of bidirectional digital I/O. The digital I/O port allows the baseboard to exchange digital handshaking and information signals with other hardware, control and signal other devices, and may be used for software troubleshooting tasks as well. The user DIO (UD) port that has separate control and data registers that allows byte-wide control of the direction. The UD digital IO port is on connector JP5 (MDR50 connector). See the appendix for the connector pinouts.

1. About setting: make sure to include right DSP/BIOS configuration and the DIO is contained. ( Choosing the Quixtoe in the board category page or copy the existed project setting directly ).

2. Make sure to include UserData.h header.

The digital IO class

Example:

5.5 The estimate of speeding up the program using FPGA

Speeding up the program, it is needed that analyzing your application to identify the operations that are high speed (above 1 MHz) and lower speed. Higher speed signal processing operations should be targeted at the FPGA provided that they are of manageable complexity. Typical FPGA operations include FIR filters, down conversion, specialized high speed triggering and data sampling, and FFTs. All of these functions are deterministic mathematical functions that are suitable for the FPGA. Data formatting, protocols and control functions are typically more easily implemented on the DSP.

The Quixote has two FPGAs: a Xilnx Spartan2 (200K gates) and a Xilinx Virtex2 (2M or 6M gates). The Virtex2 is used for the analog interfacing, and as the computational logic on the board. Major elements inside the Spartan2 logic are the PCI interface, interrupt control, timebase selection and message mailboxes.

Quixote also provide some available clock. When we like developing the custom logic, we will need using the refclk which has about 20MHz frequency.

Block FFT:

The number of the stage is log₂(N) in the N-point FFT structure. It can reach the maximum data rate is 20M * N / log2(N) if FFT stages can be done in the 20MHz clock period. Put the result in the 802.11n system (using 64-pt FFT), the maximum data rate is 20M * 64 / log2(64) = 213.33Mbit/s.

Block convolution encoder

There are six registers in the spec. 802.11n convolution encoder structure. Like the case in the FFT block, the maximum data rate depends on if we can put all logic into the 20MHz clock. We can reach the data rate about 20MHz when we really put into the 20MHz clock.

Table 11 speeding-up estimate using FPGA The estimated maximum

data rate by FPGA

data rate by DSP

IFFT 213.33Mbits/s 3.508Mbits/s

Convolution encoder 40Mbits/s 0.55026Mbits/s

the estimate data rate using FPGA

0 50 100 150 200 250

FFT encoder

function block

data rate (Mbits/s)

DSP FPGA

Figure 18 estimated data rate using FPGA

Chapter 6 Conclusion

SLM using conversion matrix is greatly lower the complexity comparing with the original IFFT bank in the original OFDM structure. Because MIMO-OFDM uses more transmit antenna, the complexity is needed lower in the every transmit antenna hardware dealing with the PAPR. In addition, 802.11n spec use 64 point FFT length, it is not too long as to very serious PAPR problem. So, we use SLM structure with conversion matrix but PTS structure in the 802.11n structure to reduce the PAPR.

SLM has not bad performance but greatly lower complexity comparing with other PAPR reduction method. And insert the side information by low complexity way, we also add a power level parameter to trade off between the PAPR reduction capability and power on side information. This thesis also conclude something about how to use DSP board to develop a efficient program.

Chapter 7 Reference

[1] H. Ochiai and H. Imai, “Performance Analysis of Deliberately Clipped OFDM Signals,” IEEE Trans. on Commun., vol. 50, pp. 89–101, Jan. 2002.

[2] H. Saeedi, M. Sharif, and F. Marvasti, “Clipping noise cancellation in OFDM systems using oversampled signal reconstruction,” IEEE Comm. Lett., vol. 6, pp.

73–75, Feb. 2002.

[3] A. E. Jones, T. A. Wilkinson, and S. K. Barton, “Block coding scheme for reduction of peak to mean envelope power ration of multicarrier transmission schemes, ” Electron. Lett., vol. 30, no. 25, pp. 2098 – 2099, Dec. 1994.

J. A. Davis and J. Jedwab, “Peak-to-mean power control in OFDM, Golay complementary sequences, and Reed- Muller codes,” IEEE Trans. Inform. Theory,

vol. 45, no. 7, pp. 2397–2417, Nov. 1999.

[4] K. Yang and S. Chang, “Peak-to-Average Power Control in OFDM Using Standard Arrays of Linear Block Codes,” IEEE Commun. Lett., vol. 7, no. 4, pp. 174–176, Apr. 2003.

K. Patterson, “Generalized Reed-Muller codes and power control in OFDM modulation,” IEEE Trans. Inform. Theory, vol. 46, no. 1, pp. 104–120, Jan. 2000.

[5] D. Wulich and L. Goldfeld, “Reduction of peak factor in orthogonal multicarrier modulation by amplitude limiting and coding,” IEEE Trans. on Commun., vol. 47,

no. 1, pp. 18–21, Jan. 1999.

[6] S. H. Müller and J. B. Huber, “OFDM with Reduced Peak–to–Average Power

Ratio by Optimum Combination of Partial Transmit Sequences,” Elect. Lett., vol.

33, no. 5, Feb. 1997, pp. 368–69.

[7] A. D. S. Jayalath and C. Tellambura, “Adaptive PTS Approach for Reduction of Peak-to-Average Power Ratio of OFDM Signal,” Elect. Lett., vol. 36, no. 14,

July 2000, pp. 1226–28.

[8] R. W. Ba¨uml, R. F. H. Fischer and J. B. H¨uber, “Reducing the peak-toaverage power ratio of multicarrier modulation by selective mapping, ” Electron. Lett., vol. 32, no. 22, pp. 2056 –2 057, Oct. 1996.

[9] M. Breiling, S. H. M¨ uller – Weinfurtner and Johannes B. Huber, “SLM peak-power reduction without explicit side information, ” IEEE Commun. Lett.,

vol. 5, no. 6, JUNE 2001.

[10] EunJung CHANG, HoYeol KWON, and John M. CIOFFI, “PAR Reduction of Multicarrier Signals Using Injected Tone Constellation”, IEICE TRANS.

COMMUN., VOL.E89–B, NO.10 OCTOBER 2006

[11] Krongold, B.S., Jones, D.L. An active-set approach for OFDM PAR reduction ,”

via tone reservation”, IEEE Commun. Mag., vol. 28, pp. 5-14, May 1990.

[12] Naoto Ohkubo † and Tomoaki Ohtsuki, “A Peak to Average Power Ratio Reduction of Multicarrier CDMA Using Selected Mapping”Vehicular

Technology Conference, 2002. Proceedings. VTC 2002-Fall. 2002 IEEE 56th

[13] Chin-Liang Wang, Ming-Yen Hsu, Yuan Ouyang, “A low-complexity peak-to-average power ratio reduction technique for OFDM systems”, Global Telecommunications Conference, 2003. GLOBECOM '03. IEEE

”[14]Innovative Quixote user manual

在文檔中多輸入輸出正交分頻多工系統中峰均功率比的減低 (頁 29-0)