IEEE 802.16a 分時雙工正交分頻多重進接下行傳收系統之數位訊號處理器軟體實現與整合

全文

(1)國立交通大學電子工程學系. 電子研究所碩士班. 碩士論文. IEEE 802.16a 分時雙工正交分頻多重進接下行傳收系統之數位訊號處理器軟體實現與整合 DSP Software Implementation and Integration of IEEE 802.16a TDD OFDMA Downlink Transceiver System. 研究生:. 陳昱昇. 指導教授: 林大衛博士. 中華民國九十四年六月.

(2) IEEE 802.16a 分時雙工正交分頻多重進接下行傳收系統之數位訊號處理器軟體實現與整合. DSP Software Implementation and Integration of IEEE 802.16a TDD OFDMA Downlink Transceiver System 研究生: 陳昱昇. Student: Yu-Sheng Chen. 指導教授: 林大衛博士. Advisor: Dr. David W. Lin. 國立交通大學電子工程學系. 電子研究所碩士班. 碩士論文. A Thesis Submitted to Department of Electronics Engineering & Institute of Electronics College of Electrical Engineering and Computer Science National Chiao Tung University in Partial Fulfillment of Requirements for the Degree of Master of Science in Electronics Engineering June 2005 Hsinchu, Taiwan, Republic of China. 中華民國九十四年六月.

(3) IEEE 802.16a 分時雙工正交分頻多重進接下行傳收系統之數位訊號處理器軟體實現與整合. 研究生: 陳昱昇. 指導教授:林大衛博士. 國立交通大學電子工程學系. 電子研究所碩士班. 摘要. 我們在此論文中介紹 IEEE 802.16a 分時雙工正交分頻多重進接之下行傳收系統。傳收系統包含了在數位訊號處理器上實現發射端、同步裝置、通道狀態估測器和其他接收端功能，以及在電腦主機上實作通道模擬器來模擬多路徑衰減、外加白色高斯雜訊以及頻率偏移等通道效應。下行同步技術包含了符元 (symbol)開始時間、頻率偏移和資料訊框(frame)之估測。我們使用德州儀器(TI) 所製造的數位訊號處理器。此處理器的操作平台為 Innovative Integration 公司製名為 Quixote 的 cPCI 卡。程式主要都是用 16 位元(bit)的定點(fixed point)格式來完成。我們藉著改變程式編碼的風格(coding style)以及 C6416 本身具有的指令來改進程式執行的效能，並把執行效能拿來跟能否達到即時運算的要求做比較以及分析。此外，我們還在電腦主機上做了一個用來在螢幕上監控同步裝置以及通道狀態估測器的圖形介面。我們發現若要整個系統都達到即時運算的要求就需要把各個功能都分割到多顆數位訊號處理器上來實現。. i.

(4) DSP Software Implementation and Integration of IEEE 802.16a TDD OFDMA Downlink Transceiver System. Student: Yu-Sheng Chen. Advisor: Dr. David W. Lin. Department of Electronics Engineering Institute of Electronics National Chiao Tung University. Abstract This thesis presents an implementation of IEEE 802.16a TDD OFDMA DL transceiver system, which includes the implementation of transmitter, synchronizer, channel estimator, and other receiver functions on the DSP baseboard and channel simulator, which simulates multipath fading, AWGN and frequency offset, on host PC. The DL synchronization includes the estimations of symbol timing, frequency offset, and. frame. lock. status.. The. implementation. employs. Texas. Instruments’. TMS320C6416 DSP chip housed on Innovative Integration’s Quixote cPCI card. The program is mainly implemented by 16-bit fixed point data format. Performances of the programs are analyzed and improved by changing the coding style and applying intrinsic function of C6416 DSP. The execution performances are compared to the real-time requirement. Besides, we also implement a host graphical interface which can monitor the synchronization and channel estimation results on the screen. We find that we may need to separate the functions into multi-DSPs to achieve the real-time of the overall system.. ii.

(5) 誌謝. 誠摯的感謝我的指導老師林大衛博士這兩年多來的指導，老師對我的指導不僅僅只是在學識的指導，在研究方法以及學習態度上，給我的獲益更是難以估計。在通訊領域知識的學習上，林老師給我的只是個開頭，讓我知道還有許許多多的方向值得去研究。我感到非常榮幸可以成為林老師的學生誠摯的感謝我的指導老師林大衛博士，由衷的感謝老師的指導。另外，我還要感謝這個像個大家庭似的實驗室，實驗室豐富的資源讓我們有最佳的學習環境，感謝博士班學長崑健、俊榮在學習過程中給予的許多建議以及幫助，感謝景中、汝芩、志凱、鎮宇等同學彼此間的砥礪以及幫助，有大家一起努力才有這篇論文。最後，我要感謝我最愛的家人，有你們長久來一直對我的支持是我學習、成長最大的動力，有你們一路陪伴和幫助讓我在求學過程沒有後顧之憂。. iii.

(6) Table of Contents Table of Contents. iv. List of Tables. vi. List of Figures. vii. 1 Introduction. 1. 2 IEEE 802.16a Transmission Techniques 2.1 Overview of the IEEE 802.16a TDD OFDMA Downlink System [3] 2.1.1 Transceiver System Structure [2] . . . . . . . . . . . . . . . 2.1.2 Downlink Carrier Allocation [3] . . . . . . . . . . . . . . . . 2.1.3 OFDMA TDD Frame Structure [3] . . . . . . . . . . . . . . 2.1.4 Modulation [3] . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Approach to Downlink Synchronization . . . . . . . . . . . . . . . . 2.2.1 Downlink Synchronization Requirements . . . . . . . . . . . 2.2.2 Procedure of Initial Downlink Synchronization . . . . . . . . 2.2.3 Normal Synchronization . . . . . . . . . . . . . . . . . . . . 2.3 Sparse DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Pruning Algorithm . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Transform Decomposition [11] . . . . . . . . . . . . . . . . . 2.3.3 Transform Decomposition with Filtering Approach [11] . . . 2.3.4 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . 2.3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Introduction to the DSP Implementation Platform 3.1 The Quixote Baseboard [15] . . . . . . . . . . . . . . 3.2 Quixote’s Transfer Mechanisms [15] . . . . . . . . . . 3.2.1 DSP Streaming Interface . . . . . . . . . . . . 3.2.2 CPU Busmastering Interface . . . . . . . . . . 3.2.3 Packetized Message Interface . . . . . . . . . . 3.3 The TMS320C6416 DSP Chip [23] . . . . . . . . . . 3.3.1 TMS320C6416 Features . . . . . . . . . . . . 3.3.2 Central Processing Unit Features [20] . . . . . iv. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . . . . . . . . .. 3 3 4 5 9 11 13 14 16 22 23 24 25 26 30 31. . . . . . . . .. 35 35 36 38 38 40 42 42 44.

(7) 3.4 3.5. 3.3.3 Cache Memory Architecture Overview [19] TI’s Code Development Environment [16], [26] . . Code Development Flow [21] . . . . . . . . . . . . 3.5.1 Compilier Optimization Options [21] . . .. 4 DSP Implementation 4.1 System Structure . . . . . . . . . . . . . . . . . . 4.1.1 Memory Arrangement . . . . . . . . . . . 4.1.2 Fixed-Point Data Formats . . . . . . . . . 4.2 System Performance . . . . . . . . . . . . . . . . 4.2.1 Execution Cycles of the Original Programs 4.2.2 Efficiency Enhancement . . . . . . . . . . 4.3 Overall Performance . . . . . . . . . . . . . . . . 4.4 Graphical User Interface . . . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . .. 48 48 52 54. . . . . . . . .. 57 57 58 58 61 61 65 80 83. 5 Conclusion and Future Work 86 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.2 Potential Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 88. v.

(8) List of Tables 2.1 2.2 2.3. System Parameters Used in Our Study . . . . . . . . . . . . . . . . . 6 OFDMA Carrier Allocation . . . . . . . . . . . . . . . . . . . . . . . 7 Possible Pilot Structures in Frame Synchronization . . . . . . . . . . 21. 3.1 3.2. Message Packet Formatting (from [15]) . . . . . . . . . . . . . . . . . 41 Execution Stage Length Description for Each Instruction Type (from [20]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Functional Units and Operations Performed (from [20]) . . . . . . . . 47. 3.3 4.1 4.2. 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17. System Memory Arrangement . . . . . . . . . . . . . . . . . . . . . . Performance Comparision of Frequency Lock Between Floating-Point and Fixed-Point Implementation (from [2]) . . . . . . . . . . . . . . . Performance Comparision of Frame Lock Between Floating-Point and Fixed-Point Implementation (from [2]) . . . . . . . . . . . . . . . . . Characteristics of the ETSI “Vehicular A” Channel Environment [14] Relations Between Speed and Maximum Doppler Shift at Carrier Frequency 6 GHz and Subcarrier Spacing 5.58 kHz . . . . . . . . . . . . Profile of the Original 802.16a DL Transmitter Function Blocks (based on [2]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Profile of the Original 802.16a DL Receiver Function Blocks (based on [2]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of the Modulation Function Before and After Optimization Comparison of Framing/De-framing Functions Before and After Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of Performance of FFT Functions in DSPLIB for N = 2048 Comparison of Computational Complexity of Different FFT Algorithms Comparison of FFT/IFFT Before and After Optimization . . . . . . Simulation Data for SRRC downsample . . . . . . . . . . . . . . . . . Performance Improvement of SRRC downsample by Using Intrinsics . Optimized Profile of the 802.16a DL Transmitter Function Blocks . . Optimized Profile of the 802.16a DL Receiver Function Blocks . . . . Detailed Information of Synchronization Function . . . . . . . . . . .. 5.1 5.2. Improvement After Modifications . . . . . . . . . . . . . . . . . . . . 87 Execution Time of the DL Receiver . . . . . . . . . . . . . . . . . . . 87. 4.3 4.4 4.5 4.6 4.7 4.8 4.9. vi. 58 60 61 62 62 64 65 67 71 74 78 78 80 81 82 82 83.

(9) List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11. 2.12 2.13 2.14 2.15 2.16 2.17 2.18 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9. DL transmitter structure (from [1]). . . . . . . . . . . . . . . . . . . DL receiver structure (modified from [1]). . . . . . . . . . . . . . . . Illustration of carrier usage in OFDMA DL (from [1]). . . . . . . . . Pilot allocation in the OFDMA DL (from [3]). . . . . . . . . . . . . Frame structure of the TDD OFDMA system (from [3]). . . . . . . QPSK, 16-QAM and 64-QAM constellations (from [3]). . . . . . . . Pseudo Random Binary Sequence (PRBS) generator for pilot modualtion (from [3]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . Structure of the symbol time and frequency estimator (from [1]). . . DL/UL symbol identification (from [2]). . . . . . . . . . . . . . . . State diagram of the frame synchronizer. . . . . . . . . . . . . . . . Multiple FFTs are needed for a consecutive range of sample locations to ensure finding the true symbol start time. (a) Symbol location detected in stage I, where the gray region is the useful samples which are applied FFT. (b), (c) Leftmost and rightmost ranges of correlation, respectively. (From [1].) . . . . . . . . . . . . . . . . . . . . . . . . Normal synchronization operations. . . . . . . . . . . . . . . . . . . Length 16 pruned FFT for a subset of output points (from [11]). . . Block diagram of the transform decomposition method of DFT for a subset of outputs (from [11]). . . . . . . . . . . . . . . . . . . . . . Flow graph of first order network to compute (2.3.10) (from [11]). . Flow graph of second order network to compute (2.3.14) (from [11]). Number of multiplications needed for transform decomposition when P = 512. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Number of multiplications needed for transform decomposition when P = 1024. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 4 . 5 . 6 . 8 . 10 . 12. Picture of the Quixote card [15]. . . . . . . . . . . . . . Block diagram of Quixote (from [23]). . . . . . . . . . DSP streaming mode (from [15]). . . . . . . . . . . . . The message system (from [15]). . . . . . . . . . . . . Block diagram of TMS320C6416 DSP (from [20]). . . . Pipeline phases of TMS320C6416 DSP (from [20]). . . TMS320C64x CPU data path (from [20]). . . . . . . . C64x cache memory architecture (from [19]). . . . . . . Code development flow for TI C6000 DSP (from [21]). .. . . . . . . . . .. vii. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . .. 13 17 19 20. . 22 . 23 . 24 . 27 . 28 . 29 . 32 . 33 36 37 39 41 44 45 49 50 53.

(10) 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19. System integration structure. . . . . . . . . . . . . . . . . . . . . . Fixed-point data formats used in the transmitter. . . . . . . . . . . Fixed-point data formats used in the receiver (based on [2]). . . . . Allocation of bursts in a frame. . . . . . . . . . . . . . . . . . . . . A part of the original modulation program. . . . . . . . . . . . . . . A part of the modified program in the modulation function. . . . . The other part of the modified program in the modulation function. Compiler feedback of the modulation4 function. . . . . . . . . . . . Kernel of the assembly code of the modulation4 function. . . . . . . Original C code of the de-framing function. . . . . . . . . . . . . . . Revised C code of the de-framing function. . . . . . . . . . . . . . . Software pipelining information of the revised code for the de-framing function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kernel of the assembly code of the revised de-framing function. . . Kernel of the assembly code of the original de-framing function. . . IFFT implementation using FFT function. . . . . . . . . . . . . . . A part of the assembly code in DSP 16x16r. . . . . . . . . . . . . . Using intrinsics in SRRC filter. . . . . . . . . . . . . . . . . . . . . Host PC graphical interface. . . . . . . . . . . . . . . . . . . . . . . Verification structure of the DL transceiver system. . . . . . . . . .. viii. . . . . . . . . . . .. 59 60 63 64 66 67 68 69 70 72 73. . . . . . . . .. 74 75 76 77 79 81 84 85.

(11) Chapter 1 Introduction In recent years there has been increasing interest in wireless technologies for subscriber access. For some years much interest has been devoted to fixed wireless access. To provide a standardized approach, the IEEE 802 committee set up the 802.16 working group in 1999 to develop broadband wireless access standards [24]. The IEEE 802.16 standards are concerned with the air interface between a subscriber’s transceiver station and a base transceiver station. One IEEE 802.16 Task Group [24] developed the IEEE Standard 802.16a that amends IEEE Std 802.162001 by enhancing the medium access control (MAC) layer and providing additional physical layer specifications in support of broadband wireless access at frequencies 2–11 GHz. After 802.16-2001, a new IEEE Std 802.16-2004 (also called 802.16) has been published and the IEEE 802.16e is near completion. In the physical layer of the 802.16, the main differences between 802.16 and 802.16a are as follows: • The preamble allocation of the TDD (time division duplexing) frame structure. • The usage of subchannels in the symbol structure. • Forward error correction code. Details can be found in [3] and [4]. The IEEE 802.16e adds mobile extension to the 802.16 standard.. 1.

(12) In this thesis, we consider the DSP software implementation of IEEE 802.16a downlink system. The reason that we consider the now defunct IEEE 802.16a rather than the current IEEE 802.16-2004 is because this project was started three years ago. We will consider newer 802.16 standards in the future. The synchronization techniques are modified from [2]. The implementation employs Texas Instrument’s TMS320C6416 digital signal processor (DSP) housed on Innovative integration’s Quixote cPCI card. This thesis is organized as follows. In chapter 2, we introduce the 802.16a downlink system specification and the synchronization techniques. Chapter 3 introduces the Quixote baseboard and the TMS320C6416 DSP chip, as well as the program development environment and the host-target communication mechanism. In chapter 4, we describe the DSP implementation and examine the program efficiency. We also introduce the user interface to control program execution and display numerical results results. Finally, chapter 5 gives the conclusions and points out some potential future work.. 2.

(13) Chapter 2 IEEE 802.16a Transmission Techniques The IEEE 802.16a specification enhances the medium access control layer of the IEEE 802.16-2001 standard and its operating frequencies are between 2 to 11 GHz. There are three physical layer modes in 802.16a: SCa (single carrier a), OFDM (orthogonal frequency-division multiplexing), and OFDMA (orthogonal frequencydivision multiple access). We consider OFDMA, as it is a technology of considerable research potential. In this chapter, we first introduce the OFDMA specifications in 802.16a and then explain the approaches we take to implement the transceiver system. Finally , we introduce the sparse DFT algorithms and discuss the reason that we do not adopt the transform decomposition method.. 2.1. Overview of the IEEE 802.16a TDD OFDMA Downlink System [3]. Before a detailed introduction to IEEE 802.16a standard, we explain some frequently used terms first. The direction of transmission from the base station (BS) the subscriber station (SS) is called downlink (DL), and the opposite direction from SS to. 3.

(14) parameters: No_OFDM_symbol/ No_subchannel/ OFDM_symbol_offset/ Subchannel_offset. DL_MAP,UL_MAP pilot (preamble). modulation. burst 1 data scrambler. FEC. burst n data. data modulation. burst 1. Framing & carrier allocation. S/P 1702. add virtual carriers (padding zeros). burst n not addressed in the present study. interpolator IFFT. P/S 2048. add prefix. 4. LPF (SRRC filter). D/A filter. Tx RF. channel (AWGN) (fadding channel). Fig. 2.1: DL transmitter structure (from [1]).. BS is called uplink (UL). The medium access control layer is used to provide the system grant/request access and the link of data between the upper layer and the lower layer (i.e., physical layer). The physical layer (PHY) handles the data transmission and may include use of multiple transmission technologies, each appropriate to a particular frequency range and application.. 2.1.1. Transceiver System Structure [2]. The structure of the DL transmitter is shown in Fig. 2.1. The data bursts are fed into the FEC (forward error correction) encoder. Then we apply modulation and framing. Gray-mapped QPSK and 16-QAM are required to be supported in modulation, whereas the support of 64-QAM is optional. The framing is used to arrange the coded data, MAPs, pilots and preamble according to the specified frame structure and carrier allocation. After framing, the data are fed into IFFT with some null carriers (guard band) to obtain the time domain signal through IFFT. The result from IFFT is output sequentially to the pulse shaping filter. As the ideal lowpass interpolation filter cannot be implemented exactly, the square root raised cosine filter is used instead. The impulse response of the filter is given by 4.

(15) Fig. 2.2: DL receiver structure (modified from [1]). ³. ´ ³ ´ t t sin π Tsample (1 − α) + 4α Tsample cos π Tsample (1 + α) ³ ´ SRRC(t) = , t t π Tsample 1 − (4α Tsample )2 t. where α is the roll-off factor. The D/A and RF parts are not addressed in the present study. Fig. 2.2 shows the downlink receiver structure. The receiver is in some sense the reverse of the transmitter, except for the synchronizer and the channel estimator. The synchronizer is a major focus in this thesis, and it will be discussed in more detail later.. 2.1.2. Downlink Carrier Allocation [3]. In the 802.16a OFDMA system, there are 2048 carriers per symbol. The carriers are divided into three groups: pilot carriers for synchronization and channel estimation purposes, data carriers for data transmission, and null carriers that are used for guard band and the DC carrier which transmits nothing at all. And the system parameters employed in this study are shown in Table 2.1. As we can see in Fig. 2.3, there are 1702 used subcarriers, composed of 1536 data carriers and 166 pilot carriers. The remaining subcarriers are unused subcarriers as 5.

(16) Table 2.1: System Parameters Used in Our Study Number of carriers (N ) Center frequency Uplink / Downlink bandwidth (BW ) Carrier spacing (∆f ) Sampling frequency (fs ) OFDM symbol time (Ts ) Useful time (Tb ) Cyclic prefix time (Tg ). 2048 6 GHz 10 MHz 5.58 kHz 11.43 MHz 201.6 µ sec (2304 samples) 179.2 µ sec (2048 samples) 22.4 µ sec (256 samples). 32 data carriers (no pilots in the group). Guard band. DC carrier Group 1. Group 2. Guard band Group48. The 1702 used carriers = 1536 data carriers + 166 pilot carriers pilot. subchannel 1. subchannel 2. Fig. 2.3: Illustration of carrier usage in OFDMA DL (from [1]).. 6.

(17) Table 2.2: OFDMA Carrier Allocation Parameter Number of DC carriers Number of guard carriers, left Number of guard carriers, right Nused , number of used carriers Total number of carriers NvarLocP ilots Number of fixed-location pilots Number of variable-location pilots which coincide with fixed-location pilots Total number of pilots Number of data carriers Nsubchannels Nsubcarriers per subchannel Number of data carriers per subchannel. DL Value 1 173 172 1702 2048 142 32 8 166 1536 32 48 48. guard bands distributed on the edge of the symbol, and one DC carrier right in the middle of the OFDMA symbol. In the downlink, the pilot subcarriers are allocated first, and then the remainders of the used carriers are divided into 32 subchannels, each subchannel consisting of 48 data carriers. The pilot locations change with time according to some permutation formula which will be described below. Table 2.2 shows the OFDMA downlink carrier allocation. There are variable location pilot carriers and fixed-location pilot carriers. The carrier indices of the fixed-location pilots never change. The variable location pilots shift their locations every symbol periodically every 4 symbols, according to the formula varLocP ilotk = 3L + 12Pk , where varLocP ilotk is the carrier index of a variable location pilot, L periodically takes the values 0,2,1,3, cyclically over the symbols, and Pk = {0, 1, 2, 3, . . . , 141}. The detailed illustration is given in Fig. 2.4.. After mapping the pilot carriers, we should also map the data carriers to the correct positions. Note that since the variable location pilots change their locations 7.

(18) Fig. 2.4: Pilot allocation in the OFDMA DL (from [3]).. 8.

(19) with symbols, the locations of the data carriers change also. The exact partitioning into subchannels is done according to the formula below, called a permutation formula: carrier(n, s) = Nsubchannels · n + {ps [nmod(Nsubchannels ) ] +IDcell · ceil[(n + 1)/Nsubchannels ]}mod(Nsubchannels ) where • carrier(n, s) is the carrier index of carrier n in subchannels, • s is the index number of a subchannel, from the set [0, 1, ......., Nsubchannels − 1], • n is the carrier-in-subchannel index from the set [0, 1, .....Nsubchannels − 1], • Nsubchannels is the number of the sunchannels, • ps [j] is the series obtained by rotating PermutationBase cyclically to the left s times, • ceil[] is the function that rounds its argument up to the next integer, • IDcell is a positive integer assigned by MAC to identify this particular BS, and • Xmod(k) is the remainder of quotient X/k. The following text in this section is mainly taken from [3], [2] and [1].. 2.1.3. OFDMA TDD Frame Structure [3]. According to IEEE 802.16a, the duplexing method in the 2–11 GHz band shall be either FDD (frequency division duplexing) or TDD (time division duplexing) in licensed band and TDD in license-exempt bands. We consider the TDD mode in this thesis. The advantage of using TDD is that we have flexibility to control the DL and UL traffic ratio. 9.

(20) Fig. 2.5: Frame structure of the TDD OFDMA system (from [3]).. The frame structure of TDD OFDMA is as shown in Fig. 2.5. The data are segmented into blocks for FEC (forward error correction) coding. Each FEC block spans one OFDMA subchannel in the subchannel axis and three OFDMA symbols in the time axis. A frame consists of one DL subframe and one UL subframe. The duration of a frame can run from 2 to 20 ms and is specified by the frame duration code. A subframe contains several transmission bursts, which are composed of multiple FEC blocks. In each subframe, the TTG (Tx/Rx transition gap) and RTG (Rx/Tx transition gap) are inserted between the downlink and uplink transmissions at the end of each frame respectively to allow the BS and SS to turn around. TTG and RTG shall be at least 5 µs and an integer multiple of four samples in duration. For the DL, the transmitted data from the BS should contain the control message and system parameters, so that the subscribers can know when and how to receive and transmit their data. The burst profile is used to define the parameters such as modulation type, forward error correction type, preamble length, guard times, etc. 10.

(21) The first FEC block of each frame is the DL Frame Prefix that is always transmitted in the most robust burst profile, QPSK-1/2. The DL Frame Prefix contains the parameters of the FCH (Frame Control Header) which includes the DL-MAPs, ULMAPs and may additional DCD (Downlink Channel Descriptor) and UCD (Uplink Channel Descriptor) messages. The DL-MAP/UL-MAP messages define the access to the DL/UL information, including the burst profiles and the distributions of the subchannels and time axes of the bursts. The DCD and UCD shall be transmitted by the BS at a periodic interval to define the characteristics of DL and UL physical channels. The pilots of the first OFDM symbols is the DL preamble in the sense that they indicate where the OFDMA frame starts. Note that the DL preamble is not composed of an all-pilot symbol, so no additional OFDM symbol is transmitted. As a result, the number of OFDM symbols of the DL is 3N , where N is a positive integer. And the number of UL OFDM symbols is 3N + 1, including one preamble and subsequent data symbols.. 2.1.4. Modulation [3]. There are three types of information to be modulated: data, pilot, and preamble. The modulation of pilot and preamble will be explained in detail for they are useful in synchronization. Data Modulation The data modulation in 802.16a is shown in Fig. 2.6. The data bits are entered serially to the constellation mapper. Gray-mapped QPSK and 16-QAM must be supported, whereas the support of 64-QAM is optional. Pilot Modulation Pilot carriers shall be inserted into each data burst in order to constitute the symbol and they shall be modulated according to their carrier locations within the OFDMA 11.

(22) Fig. 2.6: QPSK, 16-QAM and 64-QAM constellations (from [3]).. symbol. The PRBS generator is used to produce a sequence, wk , where k corresponds to the carrier index. The value of the pilot modulation on carrier k is then derived from wk . The polynomial for the PRBS generator is X 11 + X 9 + 1, as Fig. 2.7 shows. The initialization vector of the PRBS in the DL transmission is [11111111111] except for the OFDMA DL PHY preamble. For the UL, the initialization vector of the PRBS is [10101010101]. The PRBS shall be initialized so that its first output bit coincides with the first usable carrier. A new value shall be generated by the PRBS on every usable carrier. Each pilot shall be transmitted with a boost of 2.5 dB over the average power of each data tone. The pilot carriers shall be modulated according to the following formulas: 8 1 Re {ck } = ( − wk ), Im {ck } = 0. 3 2 Preamble Modulation The first three symbols of a frame serve as the OFDMA DL preamble. For the DL preamble, the initialization vector of the pilot modulation PRBS is [01010101010]. 12.

(23) Fig. 2.7: Pseudo Random Binary Sequence (PRBS) generator for pilot modualtion (from [3]).. Hence, the preamble and other symbols may have the same pilot locations, but they can be recognized by different modulation values. The pilots shall be boosted and shall be modulated according to the following formulas: 8 1 Re {ck } = ( − wk ), Im {ck } = 0. 3 2 For the UL preamble, all the used carriers are pilots. The initial vector of the PRBS is the same as the normal UL pilot modulation. The pilots shall not be boosted and is modulated as 1 Re {ck } = 2( − wk ), Im {ck } = 0. 2. 2.2. Approach to Downlink Synchronization. Synchronization errors in OFDM can cause intersymbol and intercarrier interference. Accurate demodulation and detection of an OFDM signal requires carrier orthogonality. One way to suppress these interferences in OFDM systems is to track the carrier frequency of the received signal and the start time of each OFDM symbol. A blind joint maximum likelihood estimator of symbol time and carrier frequency offset for OFDM symbols using cyclic prefix is presented in [7]. The estimator exploits 13.

(24) the redundancy introduced by the prefix and is independent of how the subscribers are modulated. Therefore, it does not require extra pilot information to complete the timing and fractional frequency synchronization. Variations of carrier oscillator, sample clocks or the symbol time affect the orthogonality of the OFDM system. In this thesis, we do not consider sample clock synchronization. The sample clocks of the users and the base station are assumed to be fully synchronized. The timing requirement is relaxed by using cyclic prefix (CP). If the time offset is smaller than the length of the guard interval minus the length of the channel impulse response, then the orthogonality among carriers is maintained. In this case, the time offset will appear as a phase shift of the demodulated data symbols across the carriers but will not result in intersymbol interference (ISI) or intercarrier interference (ICI). In practical OFDM systems, frequency offsets due to oscillator mismatch usually exist between transmitters and receivers. Each subcarriers can be assumed equally affected by a center carrier frequency shift, because the system bandwidth is small compared to the center carrier frequency. The frequency offset has three effects: reducing the amplitude of the FFT output, introducing ICI from other carriers, and introducing a common phase rotation of the subcarriers [9].. 2.2.1. Downlink Synchronization Requirements. The DL synchronization can be divided into two conditions. One is for the establishment of the initial connection, called the initial synchronization. The other is the tracking of the synchronization, called the normal synchronization. The main reason to have a different normal synchronization than initial synchronization is to reduce the computational complexity in normal operation. In fact, we use a simplified version of the initial synchronization procedure for normal synchronization (tracking) purpose. 14.

(25) If a subscriber wants to join the transmission network for the first time, it has no idea about the timing of the network and the frequency offset with the BS. In this case, after detecting the symbol start time, frequency estimation and correction is needed. According to 802.16a, the center frequency of the SS shall be synchronized to the BS with a tolerance of maximum 2% of the inter-carrier spacing. Then, the SS has to check that the received OFDM symbol is from the BS or from other SSs. If the symbol is from the BS, further check is required to know whether this symbol is the start of a frame. After initial synchronization, the subscriber is able to extract the transmission parameters from the DL MAPs and UL MAPs. With these parameters, the SS can roughly predict the next symbol and frame start times, so normal timing synchronization can be simplified. The frequency offset is tracked during normal operation. If the OFDM symbol start time is out of the predicted range, re-initial synchronization is needed. There are three kinds of useable information for synchronization: guard interval, pilot carriers (including preamble), and the guard bands. We employ the method proposed in [1] and divide the initial DL synchronization into 4 stages. In the first two stages, the OFDM symbol start time and the fractional frequency offset are detected using the guard interval. The third stage exploits the guard bands to correct integer frequency offset. Then, the final stage checks the pilot and preamble information to determine when a frame starts. For normal synchronization, only two stages are needed, where stage I is the same as that in initial DL synchronization and stage II is used to track the frequency. More detailed description of the synchronization technique is given below.. 15.

(26) 2.2.2. Procedure of Initial Downlink Synchronization. 2.2.2.1 Stage I: Symbol Timing Synchronization In [1], two methods of symbol timing estimation have been considered, both using the cyclic prefix: ML estimation and CP correlation. The method of ML estimation is proposed in [7], which uses the maximum likelihood criterion to estimate time and frequency offsets. Under the assumption that the received samples are jointly Gaussian, the estimated symbol time offset θˆ is given by θˆ = arg max {|Γ(θ)| − ρΦ(θ)} , where Γ(θ) =. θ+L−1 X. r(k)r∗ (k + N ),. (2.2.1). (2.2.2). k=θ θ+L−1 1 X |r(k)|2 + |r(k + N )|2 , Φ(θ) = 2 k=θ. and ρ =. SN R SN R+1. (2.2.3). with SNR being the signal to noise ratio. It is a one-shot estimator. in the sense that the estimates are based on the observation of one OFDM symbol. To roduce the complexity, the CP correlation method uses only the correlation part to estimate the symbol time, ignoring the part that compensates for the difference in energy in the correlated samples. As the samples of different OFDM symbols are uncorrelated, the peak of the sliding sum of r(k)r∗ (k + N ) would occur when the samples r(θ), · · · , r(θ + N + L − 1) are all within the same OFDM symbol. Then, the symbol time offset estimator becomes ¯ ¯θ+L−1 ¯ ¯ X ¯ ¯ r(k)r∗ (k + N )¯ . θˆ = arg max ¯ ¯ ¯. (2.2.4). k=θ. A comparison of the complexity difference between the two methods is given in [2]. For further reduction of the CP correlation complexity, we can compute the CP correlation at sample time θ by (2.2.2), then the CP correlation at sample time θ+1. 16.

(27) r(k+2048). sliding sum (length=L =CP legnth). (.)*. Dealy 2048 samples. |.|. r(k). argmax. θ. − 1/(2 π). ε. Fig. 2.8: Structure of the symbol time and frequency estimator (from [1]).. is given by Γ(θ + 1) =. θ+L X. r(k)r∗ (k + N ). k=θ+1. = Γ(θ) − r(k)r∗ (k + N ) + r(θ + L)r∗ (θ + L + N ).. (2.2.5). Reference [1] shows that although the performance of ML estimator algorithm is better than that of CP correlation algorithm in AWGN channels, neither algorithm can estimate the exact symbol time at 100% accuracy. In addition, for fading multipath channels the CP correlation algorithm can outperform the ML estimator algorithm. To estimate the exact symbol time, both algorithms should be assisted by other means to find the symbol time more accurately. Here pilot correlation is used as the auxiliary operation, which is combined in stage IV with frame synchronization. Since the complexity of ML estimation is much higher than that of CP correlation, but the benefit is questionable [1], [2], we use CP correlation to estimate the symbol time in stage I. The algorithm structure is as shown in Fig. 2.8. 2.2.2.2 Stage II: Fractional Frequency Synchronization The ML estimator of the fractional frequency offset ²ˆ is given by [7], [8] ²ˆ =. −1 ˆ ∠Γ(θ), 2π. whose structure is already shown in Fig. 2.8. It is easy to understand why ² can be estimated by this method. The frequency offset ² results in an exponential 17.

(28) modulation in the time domain, in that the received samples are multiplied by n o j 2π² j 2π²2 N N 1, e ,e , ... . In AWGN channel, the received sample in the guard time is r(k) = s(k)ej. 2π²k N. + n(k),. and the sample in the last part of the useful time is r(k + N ) = s(k + N )ej. 2π²(k+N ) N. + n(k + N ),. where s(k) is the transmitted signal, N is the FFT size, and n(k) is the noise. Then the multiplication of r(k) and r∗ (k + N ) yields r(k)r∗ (k + N ) = s(k)s∗ (k + N )e−j Note that e−j. 2π(²+N ) N. 2π(²+N ) N. + noise.. is the common factor of all the pairwise sample products for. r(k) in the guard interval. Hence the sum of these products should reduce the noise effect. The frequency offset ² can be estimated by the phase of the sum of r(k)r∗ (k + N ) taken at the symbol start position. Note that the phase contribution of any integer frequency offset is an integer times 2π. Thus this estimator is merely able to detect fractional frequency offset. 2.2.2.3 Stage III: Integer Frequency Synchronization The integer frequency synchronization stage is performed after FFT by utilizing the guard band and two fixed pilot carriers which are at the edge of the used carriers to correct the frequency offset. There are two reasons to using the guard band to do integer frequency synchronization. First, guard carriers suffer less degradation from by ICI than pilot carriers. Secondly, the complexity of using the guard carriers is much less than that using the pilot carriers as no multiplication is required. The first step in integer frequency offset estimation is for SS to check whether the received OFDM symbol is from the BS rather than another SS. In 802.16a [3], the definition of the guard bands and pilots are different for DL and UL. The indices 18.

(29) Fig. 2.9: DL/UL symbol identification (from [2]).. of the DL guard carriers are from −1024 to −852 and from 852 to 1023, while for UL they are from −1024 to −849 and from 849 to 1023. A threshold can be set and if any of the carriers {−849, −850, −851, 849, 850, 851} is larger than the threshold, the SS will regard the symbol as a DL symbol, as shown in Fig. 2.9. For the DL, the standard defines that carriers −851 and 851 are fixed location pilots which are modulated to ± 34 in amplitude. If there is no integer frequency offset, the FFT outputs of all the guard carriers will be small. So, all the guard carriers are checked to see if any of them exceeds the threshold. The direction of checking is from 1023 to 852, and then from −1024 to −852. If a carrier k is detected to be larger than the threshold, the ±851st fixed pilots are assumed to have shifted k − 851 carrier spacings due to the frequency offset. Thus the checking is stopped and the frequency is corrected by k − 851 carrier spacings. In a fading channel, ICI may cause serious distortion. Thus, if the ±851st pilots 19.

(30) Fig. 2.10: State diagram of the frame synchronizer.. are distorted to be less than the threshold, the frequency offset will not be detected by the method. An additional check is added to see whether both of the ±851st pilot carriers are larger than the threshold. After these three checks, the integer synchronization finishes. The threshold is chosen to be 0.55 in our simulation. This value is derived from the simulation results in [1]. 2.2.2.4 Stage IV: Frame Synchronization In stage I, the OFDMA symbol start time have been roughly estimated, but the SS has to know exactly where the frame starts. The frame start time estimation proposed in [1] uses the pilot correlation method. In the 802.16a standard [3], the varible location pilots change their locations from symbol to symbol depending on the symbol index L. The modulation of pilots is decided by the PRBS generator, and the initialization vector of the PRBS generator is different in the preamble. 20.

(31) symbol than in a non-preamble symbol. Therefore, there are 7 possible kinds of pilot structure as shown in Table 2.3. If the received symbol has the same pilot locations and the same initial vector of modulation PRBS with the reference data, the correlation of them will be larger than the other 6 cases. A frame is determined to start if there are three successive DL symbols with the maximum correlation corresponding to the preamble. Table 2.3: Possible Pilot Structures in Frame Synchronization DL preamble L = 0, P RBS = 01010101010 L = 2, P RBS = 01010101010 L = 1, P RBS = 01010101010. DL normal symbol L = 0, P RBS = 11111111111 L = 2, P RBS = 11111111111 L = 1, P RBS = 11111111111 L = 3, P RBS = 11111111111. The proposed frame synchronization algorithm is illustrated in Fig. 2.10. In order to build connection, we have to find the starting point of a frame in initial synchronization. After finding the third preamble symbol, we can turn the operation to normal synchronization as shown in Fig. 2.10. The method presented in [2] declares frame synchronization failure when there is one unexpected symbol in pilot correlation. But we find that one unexpected symbol does not mean that it cannot find correct pilot correlation in the next symbol. So we modify the method to declaring frame synchronization failure with the detection of 6 unexpected symbols within one DL subframe. From [2], because of the use of pilot correlation, we may need to do FFT at each sample location for a range of 65 samples (from −32 to +32, as shown in Fig. 2.11(b) and (c) [1]) in order not to miss the true symbol start time. In order to reduce the computational complexity, the conventional FFT is only applied at location −32. At the subsequent sample locations, the FFT may be computed recursively as Xn (k) = [Xn−1 (k) − xn−N + xn ] ej 21. 2πk N. (2.2.6).

(32) (1) x (a). x(k+N). x(k)*. x (cp) (b). (c). detected symbol start time. corresponding detected useful time. Fig. 2.11: Multiple FFTs are needed for a consecutive range of sample locations to ensure finding the true symbol start time. (a) Symbol location detected in stage I, where the gray region is the useful samples which are applied FFT. (b), (c) Leftmost and rightmost ranges of correlation, respectively. (From [1].). where N is the FFT size, k is the carrier index, n is sample number, and xn is the new sample location.. 2.2.3. Normal Synchronization. After initial synchronization, the SS can find the frame duration from the frame duration code in the MAPs. Thus the next frame start time can be predicted and there is no need to do complicated initial synchronization again. The timing synchronization stage should still be used to track the exact symbol time, because the received symbol time may shift with time due to channel variation and sampling clock offset. The CP correlation can estimate the rough symbol time. In normal synchronization, pilot correlation can still help to find a new accurate symbol time. As shown in Fig. 2.12, we track the symbol timing and frequency offset in stages I and II respectively. And we use pilot correlation to search for a more accurate symbol time and frame start time with a smaller search range. The simulation in [1] sets the search range in initial synchronization to ±32 samples around the estimated 22.

(33) Fig. 2.12: Normal synchronization operations.. symbol time from CP correlation. For normal synchronization, the range is reduced to within ±5 samples. In this thesis, we set the normal synchronization’s pilot search range to ±16 samples to get more reliable symbol timing estimates. Concerning carrier frequency synchronization, according to 802.16a, the SS shall track the frequency changes and shall defer any transmission if synchronization is lost. Small frequency changes can be tracked by the fractional frequency part (stage II) of initial or normal synchronization. If by any chance a larger frequency variation occurs, we may detect it by monitoring the received guard carriers and then try to correct it.. 2.3. Sparse DFT. In some multiple access communications systems, transmitter and receivers may have different cost and capacity requirements. For instance, in a downlink scenario, one transmitter sends the same composite signal to multiple receivers. Each receiver may only be interested in a small fraction of the transmitted data. The transmitter may have high cost, provided the receivers have low cost. Partial transforms offer the possibility of cost reductions in OFDM systems. In this section, we will introduce two kinds of methods. One is called the pruning. 23.

(34) Fig. 2.13: Length 16 pruned FFT for a subset of output points (from [11]).. algorithm and the other is called the transform decomposition [11] algorithm. The following introduction is mainly taken from [11].. 2.3.1. Pruning Algorithm. The pruning method is first devised by Markel [12]. Pruning is a modification of the standard one-butterfly radix-2 FFT. Fig. 2.13 shows how this pruning scheme works. Assuming that X(0) and X(1) are of interest, only the solid edges in the flow graph need to be computed, while the grey edges can be “pruned” away. By also shift twiddle factor in the program it is possible to get a band that does not start at X(0), but can start anywhere. Multiplying all the twiddle factors by WNJ , the L output values will be X(J), X(J +1), . . . , X(J +L−1), instead of X(0), X(1), . . . , X(L−1). 24.

(35) To compute L out of N DFT points, the regular pruning program requires #M U LP RU N E = 2N b log2 Lc + 2N − 4L +. 2N L 2b log2 Lc. real multiplications and #ADDP RU N E = 3N b log2 Lc + 3N − 6L +. 3N L b log2 Lc. real additions. More discussion about pruning algorithm can be found in [11]. The pruning algorithm can only compute consecutive output points. It cannot compute the output points with random indices. For this reason, the pruning algorithm is not suitable for 802.16a implementation.. 2.3.2. Transform Decomposition [11]. A method, transform decomposition, for computing only a subset of output points will now be introduced. It is shown to be more efficient and more flexible than the pruning algorithm. We know that the DFT is designed as X(k) =. N −1 X. x(n)WNnk. (2.3.1). n=0. where k = 0, 1, . . . , N − 1. Assume that only L output points are needed and that there exists a P such that P divides N and define Q = N/P . Using the variable substitution n = Qn1 + n2. (2.3.2). where n1 = 0, 1, . . . , P − 1, and n2 = 0, 1, . . . , Q − 1. We can rewrite the DFT as follows:. X(k) =. Q−1 P −1 X X. (n Q+n2 )k. x(n1 Q + n2 )WN 1. (2.3.3). x(n1 Q + n2 )W n1 <k>P ]WNn2 k. (2.3.4). n2 =0 n1 =0. =. Q−1 P −1 X X. [. n2 =0 n1 =0. 25.

(36) where <>P denotes reduction modulo P , and k takes on any L consecutive values between 0 and N − 1. Breaking this up into two equations Q−1. X(k) =. X. Xn2 (< k >P )WNn2 k. (2.3.5). n2 =0. where Xn2 (j) =. P −1 X. x(n1 Q + n2 )WPn1 j. (2.3.6). xn2 (n1 )WPn1 j. (2.3.7). n1 =0. =. P −1 X n1 =0. where j = 0, 1, . . . , P − 1. and xn2 = x(n1 Q + n2 ). The sum in (2.3.7) can be recognized as a length P DFT, and it can be computed efficiently using any FFT algorithm. This is a great advantage of the transform decomposition method. Inspecting (2.3.7), it can be seen that the sequence over which the DFT has to be computed is two dimensional and hence depends on n2 . Thus a DFT has to be computed for each different value of n2 , and hence there are Q such length P DFTs. The output of the DFTs are recombined using (2.3.5) which can be computed directly using Q complex multiplications and Q − 1 complex additions per output point or a total of QL complex multiplications, each requiring 4 real multiplications and 2 real additions, and L(Q−1) complex additions, each requiring 2 real additions. The advantage of the transform decomposition is that we can compute any output point with index k in (2.3.5), which can prove that the transform decomposition algorithm is more flexible than pruning algorithm. Fig.2.14 shows how this method works to compute the first L out of N DFT points.. 2.3.3. Transform Decomposition with Filtering Approach [11]. It is possible to lower the number of operations required to compute (2.3.5) even further using a technique similar to Goertzel algorithm [13]. To see this, rewrite. 26.

(37) Fig. 2.14: Block diagram of the transform decomposition method of DFT for a subset of outputs (from [11]).. 27.

(38) Fig. 2.15: Flow graph of first order network to compute (2.3.10) (from [11]).. (2.3.5) as follows: Q−1. X(k) =. X. Xn2 (< k >P )(WNk )n2. (2.3.8). XQ−m−1 (< k >P )(WNk )Q−m−1. (2.3.9). n2 =0 Q−1. =. X. m=0. with the variable substitution m = Q − n2 − 1. Now define yk (j) =. j−1 X. XQ−m−1 (< k >P )(WNk )j−m−1. (2.3.10). X(k) = yk (j)|j=Q .. (2.3.11). m=0. from which we can find X(k) as. Equation (2.3.10) can be recognized as a shifted cyclic convolution between the sequence XQ−j−1 (< k >P ) and (WNk )j−1 in the variable j and hence yk (j) can be viewed as the output of a system with impulse response (WNk )j−1 driven by the input XQ−j−1 (< k >P ). Fig. 2.15 shows a flow graph that implements (2.3.10), but a quick analysis will show that this implementation requires 4 real multiplications per iteration assuming the input is complex, and hence requires the same amount of operations as a direct implementation of (2.3.5). 28.

(39) Fig. 2.16: Flow graph of second order network to compute (2.3.14) (from [11]).. The transfer function of the system in Fig. 2.15 is Hk (z) =. z −1 1 − z −1 WNk. (2.3.12). which can be rewritten as z −1 (1 − z −1 WN−k ) (1 − z −1 WNk )(1 − z −1 WN−k ) z −1 (1 − z −1 WN−k ) = . 1 − 2 cos ( 2πk )z −1 + z −2 N. Hk (z) =. (2.3.13) (2.3.14). This last equation can be implemented using the flow graph in Fig. 2.16. Assume that the input is complex. Then each iteration only takes two real multiplications since the multiplication by −1 need not be counted. This is half of what was needed in the first order case. Because we are only interested in yk (Q), but not the intermediate values, it can be seen that the zero of the system is only needed once. The derivations of (2.3.10) and (2.3.11) are not based on the actual values of the indices of the computed output values, i.e., does not rely on the specified values of k. Unlike the standard FFTs, efficient computation of (2.3.10) and (2.3.11) by the flow graph in Fig. 2.16 does not depend on combining computations for several 29.

(40) different output points (several different k). Hence the number of output points to be computed can be any length L subset of the N possible output points. This is a very powerful result that shows that transform decomposition is not just more efficient than pruning, but also more flexible. Where pruning restricts you to L subsequent output values, transform decomposition allows any length L subset to be computed.. 2.3.4. Complexity Analysis. For the transform decomposition method, the computational complexity is discussed in [11]. Given that N is a power of two, then we need #M U LT D = N log2 P − 3N + 4(L + 1). N − 4L P. (2.3.15). N − 4L P. (2.3.16). real multiplications and #ADDT D = 3N log2 P − 3N + 4(L + 1) real additions. The computational complexity for transform decomposition with filtering is N + 2L P. (2.3.17). N − 4L + 4P P. (2.3.18). #M U LT D−F ILT = N log2 P − 3N + 2(L + 2) real multiplications and #ADDT D−F ILT = 3N log2 P − 3N + 4(L + 1) real additions.. It still needs to be determined what values to use for the factor P . For most applications the number of output points L is given and the optimum P has to be found. To minimize the total number of operations, P should be chosen as PT OT −M IN −T D = [2(L + 1) loge 2]close 30. (2.3.19).

(41) for the transform decomposition method, where [ ]close indicates closest power of two. Unfortunately, the problem is nonlinear, and hence it is not “closest” in any easily determined sense, so both the larger and smaller possible values of P should be examined. If instead the lowest possible of multiplications is required, P should be chosen as PM U L−M IN −T D = [4(L + 1) loge 2]close .. (2.3.20). The lower number of multiplications may be more useful for us because the multiplication operations are fewer than addition operations. For transform decomposition with filtering method, the P can be chosen as q ( logN 2 )2 + 6LN + 8N − ( logN 2 ) e e PT OT −M IN −T D−F ILT = [ ]close (2.3.21) 2 to minimize the total number of operations, and PM U L−M IN −T D−F ILT = [2(L + 2) loge 2]close. (2.3.22). to minimize the number of multiplications. Hence if the total number of operations is to be minimized, P should be chosen slightly larger than L, while if the number of multiplications is to be minimized, P should be chosen about three times the size of L (from the simulation results in [11]). This result will become the major reason that we do not adopt the transform decomposition algorithm for our implementation. There is more discussion about these methods in [11].. 2.3.5. Discussion. Because the TMS320C6416 DSP chip can perform 6 additions but only 2 multiplications at the same time, we consider the multiplication complexity in this section. In downlink transmission, the carriers we need to use are 166 pilot carriers plus user data carriers. So the output points we need to compute are L = 166 + 48 × k, where k is the number of subchannels assigned to the users (SSs). 31.

(42) 4. 1.85. for P=512. x 10. split−radix 2/4 Transform Decomposition Transform Decomposition with filtering. 1.8. 1.75. # Real Multiplication. 1.7. 1.65. 1.6. 1.55. 1.5. 1.45. 1.4. 1. 2. 3. 4 k (number of subchannels). 5. 6. 7. Fig. 2.17: Number of multiplications needed for transform decomposition when P = 512.. From the simulation results in [11], the value of P should be chosen about three times the size of L to minimize the number of multiplications, so the only proper values of P are 512 and 1024. For these vales of P , the numbers of subchannel which can be assigned to the SSs are bounded by b(512 − 166)/48c = 7 and b(1024 − 166)/48c = 17 respectively. Figs. 2.17 and 2.18 show the number of multiplications needed at P = 512 and 1024 respectively. In these figures, we also show the multiplication complexity of split-radix 2/4 algorithm which is one of the most efficient algorithms for completepoints FFT. For P = 512, we can find that if the number of subchannels used is larger than 4 or 5, we would be better off using the split-radix algorithm to compute all the points. For P = 1024, it is more efficient using the transform decomposition algorithm when k ≤ 7. Further, the filtering approach performance is even worse. 32.

(43) 4. 2.1. for P=1024. x 10. split−radix 2/4 Transform Decomposition Transform Decomposition with filtering. 2. # Real Multiplication. 1.9. 1.8. 1.7. 1.6. 1.5. 0. 2. 4. 6. 8 10 12 k (number of subchannels). 14. 16. 18. Fig. 2.18: Number of multiplications needed for transform decomposition when P = 1024.. than transform decomposition. According to our observation, it results from that the filter taps are left to 2 when P = 1024, so we cannot obtain enough advantage from the computation of the poles of (2.3.14) while we have to pay the computation of the zero. Based on the above, we decide not to adopt the transform decomposition algorithm in our implementation of 802.16a DL transmission. In the 802.16a specification [3], we may assign all the subchannels to one SS. Besides, Texas Instruments provides high performance FFT functions in their DSPLIB [22]. The analysis of TI’s FFT functions is given in chapter 4. As a final remark, we note that we have only discussed the “many to few” case of transform decomposition algorithm above, which means that the number of FFT output points L is smaller than the number of FFT input points N . The case of “few to many” can be applied to the uplink transmission of 802.16a. We refer to 33.

(44) [11] for details of the methods.. 34.

(45) Chapter 3 Introduction to the DSP Implementation Platform We introduced the 802.16a DL transmission system in the last chapter. In this work, we conduct a DSP (digital signal processor) implementation of a DL transmitterreceiver pair. This chapter introduces the Quixote DSP-FPGA baseboard made by Innovative Integration (II) and the on-board DSP which is Texas Instruments’ TMS320C6416. Our discussion will concentrate on the DSP chip and the associated system development environment because our implementation is purely software on the DSP.. 3.1. The Quixote Baseboard [15]. The DSP-FPGA embedded card used in our implementation is Innovative Integration’s Quixote baseboard, which is illustrated in Fig. 3.1. Quixote is one of Innovative Integration’s Velocia-family baseboards for various applications requiring high-speed computation. Fig. 3.2 shows a block diagram of the Quixote board. It combines a 600 MHz C6416 32-bit fixed-point DSP with a Virtex-II FPGA, and system-level peripherals.The FPGAs on our boards are six-million-gate version. The TI C6416 DSP operating at 600 MHz offers a processing power of 4800 MIPS. Some detailed features of the board are as follows:. 35.

(46) Fig. 3.1: Picture of the Quixote card [15].. • TMS 320C6416 processor running at frequency up to 600 MHz. • Onboard 32 MB SDRAM for DSP chip, enhanced cache controllers, 64 DMA channels, 3 McBSP synchronized serial ports and two 32 bits timers. • A 32/64 bits PCI bus host interface with direct host memory access capability for busmastering data between the card and the memory. • 2 input, 2 output A/D and D/A conversion, 14 bit, DC to 105 MHz.. 3.2. Quixote’s Transfer Mechanisms [15]. Many applications in DSP baseboard may involve communication with the host CPU in some manner. They may have to interact with a host program during the lifetime of the program. Some examples are: • Passing parameters to the program at start time. • Receiving progress information and results from the application. • Passing updated parameters during the run time of the program, such as the frequency and amplitude of a wave to be produced on the target. 36.

(47) Fig. 3.2: Block diagram of Quixote (from [23]).. 37.

(48) • Receiving alert information from the target. • Receiving snapshots of data from the target. • Sending a sample waveform to be generated to the target. • Receiving full rate data. • Sending data to be streamed at full rate. There are three transfer methods on Quixote, which are DSP streaming interface, CPU busmastering interface, and packetized message interface. The following text is mainly taken from [15].. 3.2.1. DSP Streaming Interface. The DSP streaming interface is continuous block based streaming transfer. It is designed for non-stop operation such as A/D and D/A. The DSP streaming interface is bi-directional. Two stream can run simultaneously, one running from the analog peripherals through the DSP into the application. This is called the “incoming stream.” The other stream runs out of the analog peripherals. This is the “outgoing stream.” The mechanism is shown in Fig. 3.3. In both cases, the DSP needs to act as a mediator, since there is no direct access to analog peripherals from the host. This arrangement allows the DSP to process the streams as they move from the application to the hardware.. 3.2.2. CPU Busmastering Interface. This method of target-to-host communication is on the Velocia baseboards only. The TI 64x baseboard is capable of using PCI busmastering to move data between target and host memories. This additional busmaster channel can be used to transfer data between host and target applications. 38.

(49) Fig. 3.3: DSP streaming mode (from [15]).. The CPU busmastering interface is packet based transfers which transfer discrete blocks between source and destination. Each data buffer is transferred completely to the destination in a single operation. The data buffers transferred can be of different sizes. Each requested buffer is interrogated for its size and fully transmitted. At the destination, the destination buffer is re-sized to allow the incoming data to fit. Reallocating buffers can take some time, for best performance buffers should be pre-sized to be large enough for the largest transfer expected. CPU busmastering uses a simple blocking interface for its sending and receiving functions. The sending function will not return until the transfer has completed and the buffer is ready for reuse. Similarly, the receiving function waits until data have arrived from the data source and transferred into the data buffer before returning. Since the transfer functions are blocking, they are best avoided in the main user interface thread of a Windows application. The GUI will appear to be frozen until the transfer has completed. For best results, the data transfer function should be. 39.

(50) placed in separate threads in target and host applications. In fact, each direction of transfer should have its own thread, so that the two directions of transfer can interleave as much as possible. The CPU busmastering interface allows separate channels of data between the target and the host. Using separate channels allows multiple, independent data streams to be maintained between the target and host. At present, only a single channel is supported. The largest transfer allowed is half of the total size of the DMA buffer allocated by the INF file (a kind of files used for software/firmware installation in windows system) when the driver is installed. Half of the memory is dedicated to each direction. The default buffer size in the INF is 0x200000 bytes, so the maximum transfer block is 1 MB.. 3.2.3. Packetized Message Interface. In addition to the busmastering streaming interface, the DSP and host have a lower bandwidth (limited to about 56 kB/sec) communications link for sending commands or out-of-band information between target and host. Software is provided to build a packet-based message system between the target and host software. These packets can provide a simple yet powerful means of sending commands and information across the link between the two processes. As shown in Fig. 3.4, the message system’s arrangement provides one bi-directional link between the target and the host. The “CIIMessage” and “IImessage” are host and target side message objects declarations respectively. The detailed contents of the packet formatting are shown in Table 3.1. The “CIIbaseboard::OnMessage” and “Unsolicited Message Handler” are the messages handler used to handle the message when messages are received for host and target sides respectively. The “Post” function is just used for sending the message out. In this study, we use the methods of CPU busmastering and message interface for 40.

(51) Fig. 3.4: The message system (from [15]).. Table 3.1: Message Packet Formatting (from [15]) Function Name Property Channel Message Channel TypeCode Message or Command type MessageId Message counter or other user data IsReplyExpected Set if reply is needed. Free for use in application Data[ ] Access the data region as 32-bit integers (index 0–13) AsFloat[ ] Access the data region as floating point data (index 0–13) Asshort[ ] Access the data region as 16-bit integers (index 0–27) AsChar[ ] Access the data region as 8-bit characters (index 0–55). 41.

(52) communication between the host and the target. The CPU busmastering interface provides higher bandwidth for data transmission. But the disadvantage is that only one channel is supported. Packetized message interface supports sixteen channels in each direction. But the bandwidth is limited to 56 kB/sec.. 3.3. The TMS320C6416 DSP Chip [23]. The following text is mainly taken from references [2] and [23].. 3.3.1. TMS320C6416 Features. The TMS320C64x DSPs are the highest-performance fixed-point DSP generation on the TMS320C6000 DSP platform. The TMS320C64x device is based on the secondgeneration high-performance, very-long-instruction-word (VLIW) architecture developed by TI. The C6416 device has two high-performance embedded coprocessors, Viterbi Decoder Coprocessor (VCP) and Turbo Decoder Coprocessor (TCP) that can significantly speed up channel-decoding operations on-chip, but we do not make use of these coprocessors now. The C64x core CPU consists of 64 general-purpose 32-bits registers and 8 function units. These 8 function units contain two multipliers and six ALUs. Features of C6000 devices includes : • Advanced VLIW CPU with eight functional units, including two multipliers and six arithmetic units: – Executes up to eight instructions per cycle. – Allows designers to develop highly effective RISC-like code for fast development time. • Instruction packing: 42.

(53) – Gives code size equivalence for eight instructions executed serially or in parallel. – Reduces code size, program fetches, and power consumption. • Conditional execution of all instructions: – Reduces costly branching. – Increases parallelism for higher sustained performance. • Efficient code execution on independent functional units: – Efficient C compiler on DSP benchmark suite. – Assembly optimizer for fast development and improved parallelization. • 8/16/32-bit data support, providing efficient memory support for a variety of applications. • 40-bit arithmetic options add extra precision for applications requiring it. • Saturation and normalization provide support for key arithmetic operations. • Field manipulation and instruction extract, set, clear, and bit counting support common operation found in control and data manipulation applications. The C64x additional features include: • Each multiplier can perform two 16×16 bits or four 8×8 bits multiplies every clock cycle. • Quad 8-bit and dual 16-bit instruction set extensions with data flow support. • Support for non-aligned 32-bit (word) and 64-bit (double word) memory accesses.. 43.

(54) Fig. 3.5: Block diagram of TMS320C6416 DSP (from [20]).. • Special communication-specific instructions have been added to address common operations in error-correcting codes. • Bit count and rotate hardware extends support for bit-level algorithms.. 3.3.2. Central Processing Unit Features [20]. The block diagram of C6416 DSP is shown in Fig. 3.5. The DSP contains: program fetch unit, instruction dispatch unit, instruction decode unit, two data paths which each has four functional units, 64/32-bit registers, control registers, control logic, and logic for test, emulation, and logic. 44.

(55) Fig. 3.6: Pipeline phases of TMS320C6416 DSP (from [20]).. The TMS320C64x DSP pipeline provides flexibility to simplify programming and improve performance. The pipeline can dispatch eight parallel instructions every cycle. The following two factors provide this flexibility: Control of the pipeline is simplified by eliminating pipeline interlocks, and the other is increasing pipelining to eliminate traditional architectural bottlenecks in program fetch, data access, and multiply operations. This provides single cycle throughput. The pipeline phases are divided into three stages: fetch, decode, and execute. All instructions in the C62x/C64x instruction set flow through the fetch, decode, and execute stages of the pipeline. The fetch stage of the pipeline has four phases for all instructions, and the decode stage has two phases for all instructions. The execute stage of the pipeline requires a varying number of phases, depending on the type of instruction. The stages of the C62x/C64x pipeline are shown in Fig. 3.6. Reference [20] contains detailed information regarding the fetch and decode phases. The pipeline operation of the C62x/C64x instructions can be categorized into seven instruction types. Six of these are shown in Table 3.2, which gives a mapping of operations occurring in each execution phase for the different instruction types. The delay slots associated with each instruction type are listed in the bottom row. The execution of instructions can be defined in terms of delay slots. A delay slot is a CPU cycle that occurs after the first execution phase (E1) of an instruction. Results from instructions with delay slots are not available until the end of the last 45.

(56) Table 3.2: Execution Stage Length Description for Each Instruction Type (from [20]). delay slot. For example, a multiply instruction has one delay slot, which means that one CPU cycle elapses before the results of the multiply are available for use by a subsequent instruction. However, results are available from other instructions finishing execution during the same CPU cycle in which the multiply is in a delay slot. The eight functional units in the C6000 data paths can be divided into two groups of four; each functional unit in one data path is almost identical to the corresponding unit in the other data path. The functional units are described in Table 3.3. Besides being able to perform 32-bit operations, the C64x also contains many 8bit to 16-bit extensions to the instruction set. For example, the MPYU4 instruction performs four 8×8 unsigned multiplies with a single instruction on a .M unit. The 46.

(57) Table 3.3: Functional Units and Operations Performed (from [20]) Function Unit .L unit (.L1, .L2). .S unit (.S1, .S2). .M unit (.M1, .M2). .D unit (.D1, .D2). Operations 32/40-bit arithmetic and compare operations 32-bit logical operations Leftmost 1 or 0 counting for 32 bits Normalization count for 32 and 40 bits Byte shifts Data packing/unpacking 5-bit constant generation Dual 16-bit arithmetic operations Quad 8-bit arithmetic operations Dual 16-bit min/max operations Quad 8-bit min/max operations 32-bit arithmetic operations 32/40-bit shifts and 32-bit bit-field operations 32-bit logical operations Branches Constant generation Register transfers to/from control register file (.S2 only) Byte shifts Data packing/unpacking Dual 16-bit compare operations Quad 8-bit compare operations Dual 16-bit shift operations Dual 16-bit saturated arithmetic operations Quad 8-bit saturated arithmetic operations 16 x 16 multiply operations 16 x 32 multiply operations Quad 8 x 8 multiply operations Dual 16 x 16 multiply operations Dual 16 x 16 multiply with add/subtract operations Quad 8 x 8 multiply with add operation Bit expansion Bit interleaving/de-interleaving Variable shift operations and rotation Galois Field Multiply 32-bit add, subtract, linear and circular address calculation Loads and stores with 5-bit constant offset Loads and stores with 15-bit constant offset (.D2 only) Load and store double words with 5-bit constant Load and store non-aligned words and double words 5-bit constant generation 32-bit logical operations 47.

(58) ADD4 instruction performs four 8-bit additions with a single instruction on a .L unit. The data line in the CPU supports 32-bit operands, long (40-bit) and double word (64-bit) operands. Each functional unit has its own 32-bit write port into a general-purpose register file (see Fig. 3.7). All units ending in 1 (for example, .L1) write to register file A, and all units ending in 2 write to register file B. Each functional unit has two 32-bit read ports for source operands src1 and src2. Four units (.L1, .L2, .S1, and .S2) have an extra 8-bit-wide port for 40-bit long writes, as well as an 8-bit input for 40-bit long reads. Because each unit has its own 32-bit write port, when performing 32-bit operations all eight units can be used in parallel every cycle.. 3.3.3. Cache Memory Architecture Overview [19]. The C64x memory architecture consists of a two-level internal cache-based memory architecture plus external memory. Level 1 cache is split into program (L1P) and data (L1D) cache. The C64x memory architecture is shown in Fig. 3.8. On C64x devices, each L1 cache is 16 kB. All caches and data paths are automatically managed by cache controller. Level 1 cache is accessed by the CPU without stalls. Level 2 cache is configurable and can be split into L2 SRAM (addressable on-chip memory) and L2 cache for caching external memory locations. On a C6416 DSP, the size of L2 cache is 1 MB, and the external memory on Quixote baseboard is 32 MB. More detailed introduction to the cache system can be found in [19].. 3.4. TI’s Code Development Environment [16], [26]. TI provides a useful GUI development interface to DSP users for developing and debugging their projects: Code Composer Studio (CCS). The CCS development. 48.

(59) Fig. 3.7: TMS320C64x CPU data path (from [20]).. 49.

(60) Fig. 3.8: C64x cache memory architecture (from [19]). tools are a key element of the DSP software and development tools from Texas Instruments. The fully integrated development environment includes real-time analysis capabilities, easy to use debugger, C/C++ compiler, assembler, linker, editor, visual project manager, simulators, XDS560 and XDS510 emulation drivers and DSP/BIOS support. Some of CCS’s fully integrated host tools include: • Simulators for full devices, CPU only and CPU plus memory for optimal performance. • Integrated visual project manager with source control interface, multi-project support and the ability to handle thousands of project files. • Source code debugger common interface for both simulator and emulator targets: – C/C++/assembly language support. – Simple breakpoints. 50.

(61) – Advanced watch window. – Symbol browser. • DSP/BIOS host tooling support (configure, real-time analysis and debug). • Data transfer for real time data exchange between host and target. • Profiler to understand code performance. CCS also delivers foundation software consisting of: • DSP/BIOS kernel for the TMS320C6000 DSPs: – Pre-emptive multi-threading. – Interthread communication. – Interupt Handling. • TMS320 DSP Algorithm Standard to enable software reuse. • Chip Support Libraries (CSL) to simplify device configuration. CSL provides C-program functions to configure and control on-chip peripherals. • DSP libraries for optimum DSP functionality. The DSP Library includes many C-callable, assembly-optimized, general-purpose signal-processing and image/video processing routines. These routines are typically used in computationally intensive real-time applications where optimal execution speed is critical. TI also supports some optimized DSP functions for the TMS320C64x devices: the TMS320C64x digital signal processor library (DSPLIB). The routines included in the DSP library are organized into seven groups: • Adaptive filtering. 51.