IEEE 802.16a 分時雙工正交分頻多重進接上行傳收系統在數位訊號處理器平台上之整合及最佳化

全文

(1)國立交通大學電子工程學系碩. 士. 電子研究所碩士班論. 文. IEEE 802.16a 分時雙工正交分頻多重進接上行傳收系統在數位訊號處理器平台上之整合及最佳化. IEEE 802.16a OFDMA TDD Uplink Transceiver System Integration and Optimization on DSP Platform. 研究生: 董景中指導教授: 林大衛博士. 中華民國九十四年六月.

(2) IEEE 802.16a 分時雙工正交分頻多重進接上行傳收系統在數位訊號處理器平台上之整合及最佳化. IEEE 802.16a OFDMA TDD Uplink Transceiver System Integration and Optimization on DSP Platform. 研究生: 董景中. Student: Ching Chung Tung. 指導教授: 林大衛博士. Advisor: Dr. David W. Lin. 國立交通大學電子工程學系. 電子研究所碩士班. 碩士論文 A Thesis Submitted to Department of Electronics Engineering & Institute of Electronics College of Electrical Engineering and Computer Science National Chiao Tung University in Partial Fulfillment of the Requirements for the Degree of Master in Electronics Engineering June 2005 Hsinchu, Taiwan, Republic of China. 中華民國九十四年六月.

(3) IEEE 802.16a 分時雙工正交分頻多重進接上行傳收系統在數位訊號處理器平台上之整合及最佳化. 研究生：董景中. 指導教授：林大衛博士. 國立交通大學電子工程學系電子研究所碩士班. 摘要本篇論文主要介紹 IEEE 802.16a 分時雙工正交分頻多重進接上行傳輸系統的軟體實現，我們整合前向誤差改正編碼器於傳送端，並於接收端加入同步裝置、通道等化裝置、及前向誤差改正解碼器。我們先針對接收端的同步演算法做些修改，並在數位訊號處理 (DSP) 平台上對程式做最佳化的處理。我們的數位訊號處理平台包括一台個人電腦、 Innovative Integration 公司的 Quixote 板子及其上裝置的 Texas Instrument 公司的 TMS320C6416 數位訊號處理晶片。我們在接收端的上行同步處理機制是利用上行傳輸資訊 (preamble) 的不變性，直接對收到的信號作相關性 (correlation) 的運算。藉此找到第一個到達基地台之使用者的時間，以減低符元間的干擾 (inter symbol interference)。為了能有效提升 DSP 運算效率，我們系統中所有的運算皆是以定點 (fixed-point) 的格式來處理。而最佳化的目標是加速程式執行的速度，以期能達 i.

(4) 到即時運算的要求。我們提出數個針對程式所做的改善技巧，如軟體管線 (software pipelining)，或是使用 C6416 內具有的指令 (intrinsic) 來做處理。並從編譯器所提供的相關資訊做進一步的分析討論，以清楚了解程式的運作情形。最後，傳送端的插值濾波器 (interpolator filter) 及接收端同步器的速度分別改善了 85.85 倍和 1.74 倍，且在 DSP 上執行的效率也各達到 90.94%和 85.87%。. ii.

(5) IEEE 802.16a OFDMA TDD Uplink Transceiver System Integration and Optimization on DSP Platform Student：Ching Chung Tung. Advisor：Dr. David W. Lin. Department of Electronics Engineering Institute of Electronics National Chiao Tung University Abstract. This thesis introduces the software implementation of the IEEE 802.16a TDD uplink transceiver system. We integrate the FEC encoder in the transmitter, the synchronizer, the channel equalizer, and the FEC decoder in the receiver. We first do some modifications to the uplink synchronization algorithm, and then optimize our programs on the digital signal processing (DSP) platform, which includes a personal computer (PC), Innovative Integration’s Quixote DSP board, and the TI’s TMS320C6416 DSP chip. The uplink synchronization mechanism is using the invariance of the preamble which is also known to the base station. We correlate it to the received signals directly, and thus find the first coming subscriber station’s time to reduce the inter-symbol interference. The data formats on this system are all “fixed-point” for improving the computational efficiency in DSP. Our optimization goal is to accelerate the program’s execution speed so that it can satisfy the requirement of real-time processing. We present some optimization techniques, such as software pipelining, and using the iii.

(6) intrinsics of DSP, to deal with the most time-consuming parts of the program. We also discuss and analyze the compiler feedbacks to understand how the program works in the DSP. Finally, the speed of the interpolator filter in the transmitter and the uplink synchronizer in the receiver can be improved by 85.58 and 1.74 times, respectively. The computational efficiencies of them are 90.94% and 85.87%, respectively.. iv.

(7) 誌謝本篇論文方得以順利完成，首先想感謝林大衛老師。在兩年的研究所生涯裡，由於他的細心指導及在專業領域的博學精深，使得我在學習研究這條路上，一直都能順利地往前行。祝福老師在忙碌之餘，能保有健康的身體。另外，感謝通訊電子與訊號處理實驗室所有的成員，包含各位師長、同學、學長姐與學弟妹們。感謝吳俊榮學長、洪崑健學長給予我在研究過程上的指導與建議，還有簡志凱同學、陳昱昇同學、陳汝芩同學、王盈閔同學、陳志楹同學、徐漢光等同學，因為能和你們共同討論及分享求學的經驗，使得實驗室一直是一個燈光美、氣氛佳的好地方。最後，我要感謝我的家人和朋友們，感謝他們一直都在背後支持著我，讓我能心無旁鶩地完成學業。在此，將此篇論文獻給所有給予我幫助的人。董景中民國九十四年六月於新竹. v.

(8) Table of Contents Table of Contents. vi. List of Tables. viii. List of Figures. ix. 1. Introduction. 1. 2. The IEEE 802.16a TDD OFDMA Uplink Transmission Scheme 2.1 Introduction to OFDM . . . . . . . . . . . . . . . . . . . . 2.2 Overview of OFDMA . . . . . . . . . . . . . . . . . . . . 2.3 Overview of the IEEE 802.16a Standard . . . . . . . . . . . 2.3.1 UL Carrier Allocation . . . . . . . . . . . . . . . . 2.3.2 OFDMA Data Mapping . . . . . . . . . . . . . . . 2.3.3 OFDMA Frame Structure for TDD . . . . . . . . . 2.4 Transmitter - Receiver System Architecture . . . . . . . . . 2.4.1 Modulation . . . . . . . . . . . . . . . . . . . . . . 2.4.2 TX/RX SRRC filter . . . . . . . . . . . . . . . . . . 2.5 UL Synchronization Problems . . . . . . . . . . . . . . . . 2.6 UL Synchronization . . . . . . . . . . . . . . . . . . . . . . 2.7 UL Synchronization Result . . . . . . . . . . . . . . . . . . 2.7.1 Simulation Parameters and Environments . . . . . . 2.7.2 UL Synchronization . . . . . . . . . . . . . . . . .. 3. Introduction to the DSP Implementation Platform 3.1 DSP Board [16] . . . . . . . . . . . . . . . . . 3.2 DSP Chip [18] . . . . . . . . . . . . . . . . . . 3.3 Data Transmission Mechanism [16] . . . . . . 3.3.1 DSP Streaming Interface . . . . . . . . 3.3.2 CPU Busmastering Interface . . . . . . 3.3.3 Packetized Message Interface . . . . .. vi. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . . . . . . . . . .. . . . . . .. . . . . . . . . . . . . . .. . . . . . .. . . . . . . . . . . . . . .. . . . . . .. . . . . . . . . . . . . . .. . . . . . .. . . . . . . . . . . . . . .. . . . . . .. . . . . . . . . . . . . . .. . . . . . .. . . . . . . . . . . . . . .. 5 5 8 9 11 12 13 14 15 18 18 20 22 22 24. . . . . . .. 27 27 28 34 34 36 38.

(9) 4. 5. Integration and Optimization of the IEEE 802.16a Transmitter-Receiver System 4.1 Structure of the Implemented System . . . . . . . 4.1.1 CPU Busmastering Interface . . . . . . . 4.2 Fixed-Point Data Formats . . . . . . . . . . . . . 4.3 TI’s Code Development Environment [21] . . . . 4.3.1 Code Development Flow [19] . . . . . . 4.3.2 Compiler Optimization Options [19] . . 4.3.3 Software Pipelining [22] . . . . . . . . . 4.3.4 Intrinsics [19] . . . . . . . . . . . . . . . 4.4 Performance of the Original Program . . . . . . . 4.5 The Modulation Function . . . . . . . . . . . . . 4.6 The Framing and Deframing Functions . . . . . 4.7 The IFFT and FFT Functions . . . . . . . . . . . 4.7.1 Analysis of the Output Performance . . . 4.7.2 Complexity Analysis . . . . . . . . . . . 4.8 Transmission Filtering . . . . . . . . . . . . . . 4.8.1 Complexity Analysis . . . . . . . . . . . 4.9 The Uplink Synchronization Function . . . . . . 4.9.1 Complexity Analysis . . . . . . . . . . . 4.10 Conclusion in Optimization . . . . . . . . . . . .. OFDMA TDD Uplink . . . . . . . . . . . . . . . . . . .. 41 41 43 45 48 49 51 52 54 54 55 60 65 66 68 71 75 76 79 82. Conclusion and Future work 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Potential Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 85 85 86. Bibliography. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. 87. vii.

(10) List of Tables 1.1 2.1 2.2 2.3 2.4 2.5. Comparison of OFDMA Uplink Carrier Allocations in IEEE 802.16-2004 and IEEE 802.16a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . OFDM Advantages and Disadvantages . . . . . . . . . . OFDMA UL Carrier Allocation . . . . . . . . . . . . . . System Parameters Used in Our Study [14] . . . . . . . . ETSI “Vehicular A” Channel Model in Different Units [23] Relation Between Speed and Maximum Doppler Shift at quency 6 GHz. Subcarrier Spacing is 5.58 kHz . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Carrier Fre. . . . . . . . . . . .. . . . .. 7 13 22 24 25. 3.1 3.2 3.3 3.4. Functional Units (.L, .S) and Operations Performed [18] . . Functional Units (.M, .D) and and Operations Performed [18] CIIMessage Header Field [16] . . . . . . . . . . . . . . . . CIIMessage Data Section Interface [16] . . . . . . . . . . .. . . . .. . . . .. . . . .. . . . .. 32 33 39 39. 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18. Q1.14 Bit Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . Range of Data Values After Modulation . . . . . . . . . . . . . . . Profile of Transmitter Function Blocks . . . . . . . . . . . . . . . . Profile of Receiver Function Blocks . . . . . . . . . . . . . . . . . Breakdown of Clock Cycles for Three Modulation Functions . . . . Breakdown of Clock Cycles for framing() . . . . . . . . . . . . . . Breakdown of Clock Cycles for deframing() . . . . . . . . . . . . . Computational Complexity for FFT algorithm . . . . . . . . . . . . Complexity and Efficiency of DSP fft16x16r and DSP fft32x32 . . Breakdown of Clock Cycles for IFFT() . . . . . . . . . . . . . . . . Breakdown of Clock Cycles for TX SRRC() . . . . . . . . . . . . . Breakdown of Clock Cycles for Modified Code using DSP fir gen() Breakdown of Clock Cycles for TX SRRC() . . . . . . . . . . . . . Complexity and Efficiency of SRRC Filter . . . . . . . . . . . . . . Breakdown of Clock Cycles for sync() . . . . . . . . . . . . . . . . Complexity and Efficiency of sync() . . . . . . . . . . . . . . . . . Profile of 802.16a UL Transmitter Function Blocks . . . . . . . . . Profile of 802.16a UL Receiver Function Blocks . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. 46 47 55 55 59 64 65 68 70 71 75 75 76 76 79 82 83 83. viii. . . . .. 2.

(11) List of Figures 1.1 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18 2.19 2.20. Frame structures of IEEE 802.16 2004 (top) [5] and IEEE 802.16a (bottom) [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bandwidth efficiency comparison of traditional FDM and OFDM systems. The use of cyclic prefix. . . . . . . . . . . . . . . . . . . . . . . . . . . . Carrier allocation of an OFDMA symbol. . . . . . . . . . . . . . . . . . Carrier allocation of an OFDMA symbol (modified from [9]). . . . . . . . Illustration of carrier usage in OFDMA UL. . . . . . . . . . . . . . . . . Mapping of FEC blocks to OFDMA subchannels and symbols (from [1]). Time plan of one OFDMA frame (from [1]). . . . . . . . . . . . . . . . . UL transmitter structure. . . . . . . . . . . . . . . . . . . . . . . . . . . UL receiver structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . QPSK, 16-QAM, and 64-QAM constellations [1]. . . . . . . . . . . . . PRBS for generation of data pilots and preamble pilots [1]. . . . . . . . . Positioning of the FFT window. . . . . . . . . . . . . . . . . . . . . . . Three UL signals arrive at different times, and the CP correlation peak may occur between them [11]. . . . . . . . . . . . . . . . . . . . . . . . Illustration of UL synchronization in time domain. . . . . . . . . . . . . The received samples and the time plan of the UL synchronization. . . . . Frame stucture used in UL synchronization. . . . . . . . . . . . . . . . . The transition instant for BS to turn around. . . . . . . . . . . . . . . . . Error distribution under different maximum Doppler shifts. . . . . . . . . Power-delay profile of the multipath channel [14]. . . . . . . . . . . . . . Performance of UL symbol time synchronization: error distribution under different maximum Doppler shifts. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. 6 7 9 10 12 14 15 16 16 16 17 19 20 21 22 23 24 25 26 26. 3.1 3.2 3.3 3.4 3.5 3.6. Quixote-II board [24]. . . . . . . . . . . . . . . . . Block diagram of Quixote-II(from [16]). . . . . . . Functional block and CPU (DSP core) diagram [17]. The C64x CPU block diagram [18]. . . . . . . . . . DSP streaming mode [16]. . . . . . . . . . . . . . . Simple target to host messaging configuration [16]. .. . . . . . .. 28 29 30 31 35 39. 4.1 4.2 4.3. Structure of implemented system. . . . . . . . . . . . . . . . . . . . . . System structure on transmitter side (modified from [15]). . . . . . . . . System structure on receiver side (modified from [15]). . . . . . . . . . .. 42 43 43. ix. . . . . . .. 4.

(12) 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21 4.22 4.23 4.24 4.25 4.26 4.27 4.28 4.29 4.30 4.31 4.32 4.33 4.34 4.35 4.36. Organization of transmitter and receiver using CPU busmastering interface. Fixed-point data formats at the transmitter side. . . . . . . . . . . . . . . The fixed-point data formats at the receiver side. . . . . . . . . . . . . . . Code development flow of C6000 (from [19]). . . . . . . . . . . . . . . Software pipeline loop (from [18]). . . . . . . . . . . . . . . . . . . . . . Compiler’s feedback of the modulation(). . . . . . . . . . . . . . . . . . A part of C code in the main function. . . . . . . . . . . . . . . . . . . . C code for modulation 16QAM(). . . . . . . . . . . . . . . . . . . . . . A part of the assembly code in the modulation 16QAM(). . . . . . . . . . Compiler’s feedback of the modulation QPSK(). . . . . . . . . . . . . . Compiler’s feedback of the modulation 16QAM(). . . . . . . . . . . . . Compiler’s feedback of the modulation 64QAM(). . . . . . . . . . . . . C code for original PRBS generator (from [14]). . . . . . . . . . . . . . . C code for generating the carrier locations. . . . . . . . . . . . . . . . . . Two versions of C program for framing of the preamble. . . . . . . . . . Two versions of C program for framing of other symbols. . . . . . . . . . The resulting assembly code for the revised code. . . . . . . . . . . . . . Compiler feedback of the optimized code for other symbols. . . . . . . . Block diagram of the IFFT function. . . . . . . . . . . . . . . . . . . . . Performance of IFFT when the modulation is 16-QAM. . . . . . . . . . . Performance of IFFT when the modulation is 64-QAM. . . . . . . . . . . A part of the assembly code in DSP fft16x16r. . . . . . . . . . . . . . . . Implementation of interpolation filter with polyphase decomposition [11]. Convolution kernel at the boundary of a finite-length sequence. . . . . . . C code for convolution with E0 (z) and E2 (z). . . . . . . . . . . . . . . . Compiler’s feedback for convolution with E0 (z) and E2 (z). . . . . . . . . A part of assembly code for convolution with E0 (z) and E2 (z). . . . . . . C code in sync() before optimization. . . . . . . . . . . . . . . . . . . . . C code in sync() after optimization. . . . . . . . . . . . . . . . . . . . . Graphical illustration of c = dotp2(b,a) [19]. . . . . . . . . . . . . . . . Compiler’s feedback of the code shown in Fig. 4.31. . . . . . . . . . . . . Compiler’s feedback of the code shown in Fig. 4.32. . . . . . . . . . . . . A part of the assembly code in sync(). . . . . . . . . . . . . . . . . . . .. x. 45 45 47 50 53 55 56 56 57 58 58 59 60 60 61 61 62 63 67 69 69 70 72 72 73 73 74 77 78 78 80 80 81.

(13) Chapter 1 Introduction Orthogonal frequency division multiple access (OFDMA) is a variation scheme of orthogonal frequency division multiplexing (OFDM), which is a special case of multicarrier transmission that transmits one data stream over a number of subchannels. What makes OFDMA different from OFDM is that multiple users can share one OFDM symbol. It is the combination of OFDM and frequency division multiple access (FDMA), but the guard band of each user could be neglected. OFDMA provides a highly flexible and efficient structure for mutltiuser communication. At present, OFDMA has been proposed for use in wireless broadband multimedia communications systems (WBMCS) in IEEE 802.16a [1] and in cable TV networks [2]. The IEEE 802.16a standard is an extension of the global IEEE 802.16 WirelessMAN standard for 10 to 66 GHz published in April 2002. It provides for fixed broadband wireless access (BWA) between 2 and 11 GHz for non-line-of-sight connections up to 31 miles at speeds up to 70 Mbps. The IEEE 802.16a, “Air Interface for Fixed Broadband Wireless Access Systems — Medium Access Control Modifications and Additional Physical Layer Specifications for 2–11 GHz,” sets the platform for the extensive deployment of 2 to 11 GHz wireless metropolitan area networks (MANs) as an economical alternative to wireline “first-mile” connections to public networks. “It closes the first-mile gap, giving users an easily installable, wire-free method to access core networks for multimedia applications,” states 1.

(14) Table 1.1: Comparison of OFDMA Uplink Carrier Allocations in IEEE 802.16-2004 and IEEE 802.16a Parameters IEEE802.16 2004 Number of dc subcarriers 1 Nsuchannels 70 Nused 1681 Number of data carriers per subchannel 48 Guard subcarriers: Left, Right 184, 183. IEEE 802.16a 1 32 1696 48 176,175. Roger Marks, Chair of the 802.16 Working Group on Broadband Wireless Access [3]. The new 802.16d upgrade to the 802.16a standard was recently approved in June 2004 (now named 802.16-2004), and primarily introduces some performance enhancement features in the uplink [4]. It consolidates IEEE Std 802.16, IEEE Std 802.16a, and IEEE Std 802.16c, retaining all modes and major features without adding modes [5]. Table 1.1 gives a comparison between IEEE 802.16-2004 and IEEE 802.16a in OFDMA uplink carrier allocations. The number of subchannels is increased to 70, while in IEEE 802.16a it is 32. The TDD frame structure has also been modified in IEEE 802.16-2004, which is shown in Figure 1.1. We can see that in IEEE 802.16-2004, each frame begins with a preamble followed by a downlink transmission period and an uplink transmission period. This is quite different from the frame structure in IEEE 802.16a, where a preamble appears only in the uplink subframe. Since the project that this thesis is based was started in 2002, the algorithms implemented in this work have been designed to meet the requirements of IEEE 802.16a. In this thesis, we will discuss the DSP implementation and optimization of the IEEE 802.16a uplink system, instead of IEEE 802.16-2004. Generally speaking, the basic elements in a communication system can be divided into two parts, a transmitter and a receiver. In our case, we will implement the uplink transmitter-receiver pair that includes the transmitter in the subscriber station (SS) and the receiver in the base station (BS). Our work is mainly based on the simulation program. 2.

(15) in [14], adding the FEC (forward-error-correcting coding) encoder and decoder of [15]. We briefly introduce the references [14] and [15]. In [14], the intent is to introduce the uplink synchronization scheme by using digital signal processor. The work also includes the implementation of the framing/deframing structure, IFFT/FFT block and TX/RX SRRC filter. In [15], it focuses on the implementation of FEC/FED (forwarderror-correcting decoding) scheme of the IEEE 802.16a on II Quixote DSP board. The hardware environment of our system involves one host personal computer (PC) and a digital signal processing (DSP) chip housed on Innovative Integration’s Quixote PC plug-in card. The DSP core, Texas Instruments, TMS320C6416, “meets the need for today’s high processing speed for digital data transmission” [18]. It uses an advanced very long instruction word (VLIW) architecture, VelociTI, which can allow the functional units to work in parallel so that the execution time can be greatly reduced. In our system, we will make use of the busmastering interface [16] provided by Quixote as a method for host-to-target communication. A major issue dealt with in this work, besides system integration, is the efficiency in DSP software implementation employing various optimization techniques. Our optimization is aim to accelerate the execution speed of the programs. This thesis is organized as follows. In chapter 2, we introduce the IEEE 802.16a TDD OFDMA uplink transmission scheme. Chapter 3 introduces the DSP platform. Chapter 4 discusses the DSP optimization methods and presents the optimization results. Finally, in chapter 5 we give a conclusion and point out some potential future work.. 3.

(16) Fig. 1.1: Frame structures of IEEE 802.16 2004 (top) [5] and IEEE 802.16a (bottom) [1].. 4.

(17) Chapter 2 The IEEE 802.16a TDD OFDMA Uplink Transmission Scheme In this chapter, we first introduce some basic concepts regarding OFDM and OFDMA. Then, we give a brief overview of the relevant specification in IEEE 802.16a and describe our transmission and reception schemes in detail.. 2.1. Introduction to OFDM. Orthogonal frequency division multiplexing (OFDM) is a special case of multicarrier transmission technique, where a single datastream is transmitted over a number of lower rate subcarriers [6]. It has been successfully applied in many digital communication systems in recent years. The concept of OFDM is to use parallel data transmission and frequency multiplexing. It divides the available spectrum into several narrow subcarrier bands, and each subcarrier only transmits part of the information. We would like to emphasize that the orthogonality of OFDM constitutes one major difference from the classical parallel data system, making its use of the available spectrum more efficient. Figure 2.1 shows the differences. As we can see, the subcarriers in an OFDM symbol can be arranged so that the sideband of each subcarrier overlaps but the received symbols still live without adjacent interference. This can be accomplished by using the discrete Fourier transform (DFT) proposed by Weinstein and Ebert in 1971 [7].. 5.

(18) Fig. 2.1: Bandwidth efficiency comparison of traditional FDM and OFDM systems.. The complexity of DFT, however, is too expensive. Fortunately, modern advances in verylarge-scale integration (VLSI) make it possible to use the fast fourier transform (FFT) for a more efficient implementation of the DFT. The complexity is reduced from N 2 in DFT to N log2 N in FFT. One of the most important reasons to do OFDM is that it can deal with multipath delay spread in a more efficient way. This is achieved by introducing a guard time for every OFDM symbol such that intersymbol interference can be eliminated. The guard time is chosen larger than the expected delay spread, such that multipath components from one symbol cannot interfere with the next symbol. However, if the guard time is filled with zeros, the othogonality among subcarriers will no longer exist, and this causes serious intercarrier interference (ICI). To preserve the orthogonality among subcarriers and eliminate ICI, the OFDM symbol should be cyclically extended in the guard time rather than just extended with zero. Hence the guard time is usually called cyclic prefix (CP). Figure 2.2 shows how to add cyclic prefix in front of an OFDM symbol. Hence if the maximum multipath delay is smaller than the guard time, we can ensure 6.

(19) Fig. 2.2: The use of cyclic prefix.. Table 2.1: OFDM Advantages and Disadvantages OFDM Advantages OFDM Disadvantages Bandwidth efficiency Peak power problem Resistant to multipath effect SNR loss Efficient implementation Sensitive to frequency offset and phase noise. that the delayed replicas of the OFDM symbols will still have an integer number of cycles within the FFT intervals. After all, any multipath signals that have delay spread smaller than the guard time will not cause ICI or ISI. The advantages and disadvantages of OFDM are summarized in Table 2.1. Bandwidth efficiency is already shown in Figure 2.1. The bandwidth is saved by almost 50%. Resistance to multipath effect is already discussed above. Efficiency of implementation means that OFDM can be realized by using the FFT and IFFT instead of lots of sinusoidal generators and coherent demodulators required in a parallel system. The disadvantages of OFDM are high peak-to-average power ratio (PAPR), loss of signal-to-noise ratio (SNR), and sensitivity to frequency offset and phase noise. The high peak-to-average power ratio in OFDM signals can increase the complexity of the analogto-digital and digital-to-analog converters and reduce the efficiency of the RF power amplifier. SNR loss is due to the insertion of the guard time, reducing the efficiency in bandwidth and power. Because of the overlapping of the subcarriers, OFDM is very sensitive 7.

(20) to frequency offset and phase noise.. 2.2. Overview of OFDMA. For network applications where a base station communicates with multiple subscribers, the system resources must be partitioned among the subscribers to provide “multiple access” services. The sharing of spectrum is required to achieve high capacity by simultaneously allocating the available bandwidth to multiple users. For high quality communication, this must be done without severe degradation in the performance of the system. At present, it is often realized by using the techniques of time division multiple access (TDMA), frequency division multiple access (FDMA), or spread spectrum multiple access (SSMA) which is also referred to as code division multiple access (CDMA). OFDMA, also referred to as multi-user OFDM, is now considered one of the most promising multiple access methods for fourth generation wireless networks. OFDM or OFDMA is currently the modulation of choice for high speed data access systems such as IEEE 802.11a/g wireless LAN (WiFi) and IEEE 802.16a/d/e wireless broadband access systems (perhaps more widely known as WiMAX). In present OFDM systems, multiple access can be supported by employing time division or frequency division, and only one user is allowed to transmit data on all of the subcarriers. This scheme does not realize the fact that different users see different wireless channels. OFDMA, however, allows multiple users to transmit on different subcarriers of an OFDM symbol concurrently. Figure 2.3 shows an example with four users in it. In this figure, we illustrate that the subcarriers can carry information of different users. Because of the very low probability that all users experience a deep fade in the same subcarriers, it is possible to assure that subcarriers are assigned to the users who see good channel gains on them. Fig. 2.4 shows an example carrier allocation of an OFDMA symbol. The frequency response of a typical broadband wireless channel is also depicted. In this example, the 8.

(21) Fig. 2.3: Carrier allocation of an OFDMA symbol.. deep-fading condition and narrowband interference are also considered. In the top plot, we see that when the channel is in deep fade, the subcarriers are not sufficiently energy efficient to carry information. These wasted subcarriers can be utilized in OFDMA, thus achieving higher efficiency and capacity. Very few, if any, subcarriers are wasted in OFDMA, since no particular subcarrier is likely to be bad for all users.. 2.3. Overview of the IEEE 802.16a Standard. For years, there exists a continuing challenge for service providers to satisfy the growing demand for broadband wireless access (BWA) in underserved business and residential markets [8]. They are seeking a solution to build systems that support infrastructure build outs comparable to cable, digital subscriber lines (DSL), and fiber. Recently, the IEEE 802.11x or WiFi wireless LAN technology has been used in BWA applications; however, it was evident that they are not suitable for outdoor BWA use for their limited capacity in terms of bandwidth and subscribers, range and other issues [8]. The IEEE conducted a multi-year effort to develop this standard, culminating in final approval of the 802.16a air-interface specification in January 2003. The 802.16a standard delivers carrier-class performance in terms of robustness and QoS and has been designed. 9.

(22) Fig. 2.4: Carrier allocation of an OFDMA symbol (modified from [9]).. from the ground up to deliver a suite of services over a scalable, long range, high capacity “last mile” wireless communications for carriers and service providers around the world [8]. The 802.16a standard specifies a protocol that among other things supports low latency applications such as voice and video, provides broadband connectivity without requiring a direct line of sight between subscriber terminals and the base station and will support hundreds if not thousands of subscribers from a single BS [8]. The IEEE 802.16a is an amendment of the 802.16 standard to cover frequency bands in the range between 2 and 11 GHz, and it specifies a metropolitan area networking protocol that enables a wireless alternative for cable, DSL and T1 level services for last mile broadband access [8]. The major reason for using 2–11 GHz bands is that they have the ability to deal with non-line-of-sight (NLOS) operation. The longer wavelengths allow. 10.

(23) for non-directional NLOS operation with the ability to serve much broader geographic regions, allowing underserved customers to take advantage of this technology. Compared to the higher frequencies, such spectra offer the opportunity to reach many more customers less expensively, although at generally lower data rates [10]. The 2–11 GHz spectrum does not require line-of-sight and directionality, and therefore requires multiplexing techniques supporting multi-path propagation. Because residential applications are expected, rooftops may be too low for a clear sight line to a BS antenna. Therefore, significant multipath propagation is expected [10]. As a result, the 802.16a did some major changes to the PHY layer specification, which includes a single carrier PHY, a 256-point FFT OFDM PHY, and a 2048-point FFT OFDMA PHY, to address the needs of 2–11 GHz bands. In this thesis, we consider the 2048-point FFT OFDMA. The glossary we will often use in the following sections is introduced here. The direction of transmission from the base station (BS) to the subscriber station (SS) is called downlink (DL), and the opposite direction is uplink (UL). The SS is usually known as the mobile station or the user. The BS is a generalized equipment set providing connectivity, management, and control of the SS.. 2.3.1 UL Carrier Allocation The number of subcarriers in one OFDMA symbol is 2048. These carriers are divided into as three types: data carriers for data transmission, pilot carriers for various estimation purposes, and null carriers (guard bands and DC carrier) which transmit nothing at all. The data and pilot carriers together are termed the used carriers for they transmit useful information. The allocation is as shown in Fig. 2.5 for UL. Among the 2048 subcarriers, there are 1696 used carriers, composed of 1536 data carriers and 160 pilot carriers. The rest 352 subcarriers are unused subcarriers as the guard band distributed on the edges of the symbol, and one DC carrier right in the middle of the band of the OFDMA symbol.. 11.

(24) 32 used carriers (including pilot carriers) pilot. pilot. DC carrier. Guard band Group 1. Guard band. Group 2. Group53. The 1696 used carriers = 1536 data carriers + 160 pilot carriers. subchannel 1. subchannel 2. Fig. 2.5: Illustration of carrier usage in OFDMA UL.. In 802.16a, the used subcarriers are divided into 32 subchannels, where each subchannel contains 48 data carriers, 1 fixed pilot carrier, and 4 variable location pilot carriers. The carrier allocation for UL is listed in Table 2.2. The carrier index of the fixed-location pilots never change in different symbols. The variable-location pilots, however, shift their locations every symbol periodically every 13 symbols, according to Lk = 0, 2, 4, 6, 8, 10, 12, 1, 3, 5, 7, 9, 11, where k = 0 to 12. Lk is the amount of carrier spacing which will be added to L0 to shift to the right of the subcarrier position. For k = 0, the variable-location pilots are positioned at indexes 0, 13, 27, and 40. For other values of k, these locations change by adding Lk to each index.. 2.3.2 OFDMA Data Mapping A PHY burst in OFDMA is allocated a group of contiguous subchannels, in a group of contiguous OFDMA symbols using an FEC block as a unit. Note that one FEC block spans one OFDMA subchannel in the subchannel axis and three OFDM symbols in the time axis. Fig. 2.6 illustrates the order in which FEC blocks are mapped to OFDMA subchannels and OFDM symbols [1].. 12.

(25) Table 2.2: OFDMA UL Carrier Allocation Parameter UL Value Number of DC carriers 1 Number of guard carriers, left 176 Number of guard carriers, right 175 Nused , number of used carriers 1696 Total number of carriers 2048 NvarLocP ilots 128 Number of fixed-location pilots 32 Number of variable-location pilots which 0 coincide with fixed-location pilots Total number of pilots 160 Number of data carriers 1536 Nsubchannels 32 Nsubcarriers per subchannel 53 Number of data carriers per subchannel 48. 2.3.3 OFDMA Frame Structure for TDD The 802.16a is designed to operate in the frequency band between 2 to 11 GHz. The duplexing method of OFDMA system in this band shall be either frequency division duplexing (FDD) or time division duplexing (TDD) in licensed bands and TDD in licenseexempt bands. We consider the TDD mode in this thesis, since TDD is better suited to data communications, which is often highly asymmetric. TDD flexibility permits efficient allocation of the available traffic transport capacity, and thus the uplink and downlink traffic transport ratio can vary with time. Fig. 2.7 shows the frame structure of TDD OFDMA. A frame consists of one DL subframe and one UL subframe, and they are transmitted by the BS and the SS, respectively. The allowed duration of a frame is from 2 to 20 ms and is specified by the frame duration code. A subframe contains several transmission bursts, which are composed of multiples FEC blocks. In each frame, the Tx/Rx transition gap (TTG) and Rx/Tx transition gap (RTG) shall be inserted between the downlink and uplink and at the end of each frame respectively to allow the BS and the SS to turn around. TTG and RTG shall be at least 13.

(26) Fig. 2.6: Mapping of FEC blocks to OFDMA subchannels and symbols (from [1]).. 5 µsec and an integer multiple of four samples in duration [1]. From the UL-MAPs, the SSs know their usable subchannels and transmission time. The first symbol is an all-pilot preamble where the SS should send on all its allocated subchannels. The number of symbols of the UL subframe is 3N + 1, where N is a positive integer, one for the preamble, and the others for data bursts.. 2.4. Transmitter - Receiver System Architecture. The UL transmitter is shown in Fig. 2.8. For each SS, the transmitted data are first scrambled, FEC encoded, and then interleaved. After passing through the constellation mapper, the data are mapped to Gray-mapped QPSK, 16-QAM, or 64-QAM up to the option of the modulation types. The framing is used to arrange the coded data, MAPs, preamble 14.

(27) Fig. 2.7: Time plan of one OFDMA frame (from [1]).. and pilots to the corresponding subchannels following the specification of used carrier allocation. After framing, the used carriers and null carriers are allocated properly and fed into the 2048-point IFFT block in parallel. The IFFT results are output sequentially and shaped by the interpolator block, which is composed of a 4× upsampler and a low-pass filter (LPF). The receiver is shown in Fig. 2.9. It is in some sense a modified reverse of the transmitter. Synchronizer and channel estimator are added. In the following subsections, we introduce the modulation and the TX/RX SRRC filter.. 2.4.1 Modulation Data Modulation Gray-mapped QPSK and 16-QAM must be supported by any compliant transceiver, whereas the support of 64-QAM is optional. The constellations as shown in Fig. 2.10 shall be normalized by multiplying the constellation points with the indicated factor c (shown in Fig. 2.10) to achieve equal average power. The constellation-mapped data shall be subsequently modulated onto the allocated data carriers.. 15.

(28) Fig. 2.8: UL transmitter structure.. Fig. 2.9: UL receiver structure.. Fig. 2.10: QPSK, 16-QAM, and 64-QAM constellations [1].. 16.

(29) Fig. 2.11: PRBS for generation of data pilots and preamble pilots [1].. Pilot Modulation There are two types of pilot to be modulated: data pilots and preamble pilots. These two pilots are generated using the PRBS generator in Fig. 2.11 with initialization vector [1 0 1 0 1 0 1 0 1 0 1] for the UL. 1. Data Pilot Modulation Each pilot shall be transmitted with a boosting of 2.5 dB over the average power of each data tone. The pilot carriers shall be modulated according to the following formulas: 8 1 {ck } = ( − wk ), 3 2. {ck } = 0,. (2.4.1). where wk is the sequence produced by the PRBS generator, and k corresponds to the carrier index. 2. Preamble Pilot Modulation For the first UL OFDMA symbol, it shall be an all-pilot preamble. The pilots shall not be boosted and shall be modulated according to the following formulas: 1 {ck } = 2( − wk ), 2 17. {ck } = 0.. (2.4.2).

(30) 2.4.2 TX/RX SRRC filter We briefly introduce the SRRC filter based on [11] here. To avoid the complexity of an ideal lowpass filter and to simulate path delays at non-integer sample times, an interpolator is added to the transmitter to yield 4-times oversampled transmitter output. The square root raised cosine (SRRC) filter is used as the lowpass interpolation filter. The impulse response of this filter is given by t t t sin π Tsample (1 − α) + 4α Tsample cos π Tsample (1 + α) SRRC(t) = , t t 2 π Tsample 1 − (4α Tsample ) where α is the roll-off factor. One reason for adopting the SRRC filter is that for this filter the transmitter and receiver filters are matched to each other and there is no inter-sample interference introduced in the receiver when fully synchronized. Finally, the roll-off factor of SRRC filter is 0.155 with 57 taps, which is chosen to satisfy the power mask specified in 802.16a [11].. 2.5. UL Synchronization Problems. Before the receiver can demodulate the subcarriers, it has to perform the synchronization task, since the OFDM systems can be extremely sensitive and vulnerable to synchronization errors. There are three major kinds of synchronization tasks: 1. Symbol synchronization [12] The purpose of it is to find the correct position of the fast Fourier transform (FFT) window. Any misalignment of the FFT window will result in an evolving phase shift in the frequency domain symbols, leading to BER degradation. If the timing errors are so high that the FFT window of the receiver includes samples outside the data and guard intervals of the current OFDMA symbol, then the consecutive OFDMA symbols interfere, severely affecting the system’s performance. Fig. 2.12(a) shows the correct FFT window. Fig. 2.12(b) shows an early FFT window that includes 18.

(31) Fig. 2.12: Positioning of the FFT window.. samples of the data segment and the guard interval. Fig. 2.12(c) depicts a delayed FFT window that overlaps with the next OFDMA symbol. The second case will not introduce any interference, but the third is detrimental to the performance. 2. Sampling clock synchronization The purpose of it is to align the receiver sampling clock frequency to that of the transmitter. The sampling clock errors can cause ICI. In addition, the sampling clock frequency error can result in a drift in the symbol timing and can further worsen the symbol synchronization problems. In this thesis, we will assume that the sample clocks of the users and the base station are identical. 3. Carrier synchronization Carrier frequency offset can give rise to a shift of all the subcarriers and results in not only ICI but also multiple access interference (MAI). It is caused by the difference in the local oscillators of the transmitter and the receiver, or the Doppler spread introduced by motion. Carrier synchronization is a complex problem in the UL system, since all users share the total number of subcarriers and each user has its own carrier frequency offset. In our system, the synchronization scheme is subject to the specifications of 802.16a. Thus we assume that after a successful initial synchronization and ranging, the mobile enters the time and frequency grid with a low offset in time and frequency [11]. Hence 19.

(32) no frequency synchronization is done in normal UL transmission. While this assumption may be suitable for fixed BS and SS, it is certainly debatable for multipath fading channels. However, for simplicity we leave it further consideration to future work.. 2.6. UL Synchronization. The above discussion of the UL synchronization motivates our doing timing synchronization only. We now introduce the techniques used in our UL synchronization, the detection of symbol start time. Our synchronization task is to find the first coming symbol. Different users’ transmitted signals may not arrive at the same time, but the correlation peak may occur between them, as shown in Fig. 2.13 for an example of three users. If we use the detected peak location as the symbol start time, the corresponding useful time will include a part of the guard interval of the next symbol for the earlier arriving signals. Therefore, we have to find the exact instant of the first arriving signal to avoid ISI. Since the subchannels are comprised by the subcarriers which are orthogonal to one another, we can assume that the orthogonality property still exists among subchannels unless the received signals from different users are subject to significantly different carrier. Fig. 2.13: Three UL signals arrive at different times, and the CP correlation peak may occur between them [11].. 20.

(33) Fig. 2.14: Illustration of UL synchronization in time domain.. offsets. After passing through IFFT, the time domain signals which occupy different subchannels in the frequency domain are uncorrelated if the channel has zero delay spread. Since the first coming symbol is an all-pilot preamble, the BS knows the exact values of each user’s signals. Therefore, the signal transmitted by each SS in the UL preamble is deterministic and the BS can generate the same time domain signals as all SSs by taking IFFT. We show the block diagram depicting how the synchronization works in Fig. 2.14. The received samples are correlated with the reference data string, which results from passing the preamble into the IFFT block. When the next sample arrives, we recompute the correlation. The start and stop times of the correlation are as illustrated in Fig. 2.15. The start time is decided by when the BS turns to receive signals. Note that this time shall be in the TTG interval. As the user arrival time may vary as much as 50% of the guard interval, we stop the correlation up to 50% of the guard interval earlier than the corresponding detected useful time. Then, the peak locations of different SSs are compared as follows. We can find the peak location of each correlator which uses a distinct preamble, then we can know the peak locations of different SSs. Finally, we compare all these peaks and get the start location of the first coming signal.. 21.

(34) Fig. 2.15: The received samples and the time plan of the UL synchronization.. 2.7. UL Synchronization Result. 2.7.1 Simulation Parameters and Environments Table 2.3 specifies the transmission parameters for our simulation. The uplink and downlink use the same frequency bands. The intercarrier spacing is thus 5.58 kHz and the symbol length (without cyclic prefix) is 179.2 µsec. In this section, we select the channel environment defined by ETSI for the evaluation of UMTS radio interface proposals. We employ the multipath ETSI “Vehicular A” channel model given in Table 2.4. The SNR is chosen to be 10 dB in the fading channels. Note that the receiver SNR specified in 802.16a is from 9.4 dB to 24.4 dB, so 10 dB, which is almost the worst condition, is a reasonable value for simulation. The maximum Doppler shifts of our simulation are shown in Table 2.5 for the speed from 0 to 100 km/hr.. Table 2.3: System Parameters Used in Our Study [14] Number of carriers(N ) 2048 Center frequency 6 GHz Uplink / Downlink bandwidth (BW ) 10 MHz Carrier spacing ( ∆f ) 5.58 kHz Sampling frequency (fs ) 11.43 MHz OFDM symbol time(Ts ) 201.6 µsec (2304 samples) Useful time (Tb ) 179.2 µsec (2048 samples) Cyclic prefix time (Tg ) 22.4 µsec (256 samples). 22.

(35) Fig. 2.16: Frame stucture used in UL synchronization.. Fig. 2.16 shows the frame structure used in UL synchronization simulation, where SS1 transmits UL burst 1 using 8 subchannels and SS2 transmits UL burst 2 using 16 subchannels. The arriving times of burst 1 and burst 2 differ by 11.25% of the guard time, which is 16 samples. No ranging subchannel is allocated. Note that the start time of the preamble correlation is chosen to be 76 samples earlier than the UL subframe, and the stop time is 128 samples after the starting instant of the UL subframe. Recall that the TTG is used for BS to turn around (from TX to RX). It is reasonable to assume that the transition instant is approximately at the midpoint of the TTG. Now TTG is 136 samples, and thus we assume that it is 60 samples after the start time of TTG. Figure 2.17 illustrates the transition instant for BS to turn around. The reason for the stop time is as follows. According to IEEE 802.16a standard, all SSs shall acquire and adjust their timings such that all uplink OFDM symbols arrive time coincident at the base station to an accuracy of 50% of the minimum guard-interval or better. Therefore, we assume that all SSs arrive before the stop time, 50% of the guard. 23.

(36) Fig. 2.17: The transition instant for BS to turn around.. Table 2.4: ETSI “Vehicular A” Channel Model in Different Units [23] tap 1 2 3 4 5 6. relative delay (nsec or sample number) (nsec) (4 oversampling) (normal) (dB) 0 0 0 0 310 14 3 or 4 -1.0 710 32 8 -9.0 1090 50 12 or 13 -10.0 1730 79 20 -15.0 2510 115 29 -20.0. average power (normal scale) (normalized) 1.0000 0.4850 0.7943 0.3852 0.1259 0.0610 0.1000 0.0485 0.0316 0.0153 0.0100 0.0049. interval after the start time of the uplink subframe.. 2.7.2 UL Synchronization Fig. 2.18 shows the symbol time synchronization errors of the first coming signal under different Doppler spreads. If the Doppler shift is zero (speed = 0 km/hr), it is shown that we can always detect the correct symbol start time of the first coming signal. Another interesting result is that when the speed increases, the distribution of the time synchronization errors is closely related to the power-delay profile of the multipath channel. Fig. 2.19 depicts the power-delay profile of the simulated channel with normal sample numbers and with normalized average power (see Table 2.4). Fig. 2.20 shows the resulting time synchronization error distribution. Comparing these two figures, we see that the different time offsets obtained at the synchronizer output almost coincide with the sample number of the multipath delays. Furthermore, the occurrence probabilities at 24.

(37) Table 2.5: Relation Between Speed and Maximum Doppler Shift at Carrier Frequency 6 GHz. Subcarrier Spacing is 5.58 kHz Speed (km/hr) 0 20 40 60 80. Doppler shift (Hz) 0 111 222 333 444. fd Ts 0 0.0224 0.0448 0.0672 0.0896. Fig. 2.18: Error distribution under different maximum Doppler shifts.. the different time offsets are proportional to the relative average power of the paths. Note that the Doppler shift has no obvious effects on this synchronization scheme expect when it is very small. As the correlation is done for each SS, we can detect the arriving time of each SS. We find that the timing error distributions of the late arriving SS are almost the same as the result of the first coming SS. No matter when the signal arrives, the synchronization performance has no significant differences. In summary, we can detect the start time of all signals from different SSs and this information could be helpful to channel estimation.. 25.

(38) Fig. 2.19: Power-delay profile of the multipath channel [14].. Fig. 2.20: Performance of UL symbol time synchronization: error distribution under different maximum Doppler shifts.. 26.

(39) Chapter 3 Introduction to the DSP Implementation Platform In this chapter, we introduce the DSP platform utilized in our implementation. The platform includes a DSP board, DSP core, and the communication mechanism between the host PC and the DSP target.. 3.1. DSP Board [16]. The DSP board, Quixote-II, is a 64-bit cPCI 6U board for advanced signal capture, generation and co-processing. Figure 3.1 shows a picture of the board. Quixote-II associates with one TI (Texas Instruments) TMS320C6416 DSP with a Xilinx’s Virtex-II FPGA, providing processing flexibility, efficiency, and delivering performance. The block diagram of Quixote-II is shown in Figure 3.2. On our board, the FPGA is a six-million-gate one. The board’s primary features are as follows: 1. 600 MHz 32-bit fixed-point TMS320C6416 DSP offers processing power of 4800 MIPS. 2. An onboard 32 MB SDRAM for DSP chip, with advanced cache controllers. 3. 64/32-bit 33 MHz PCI interface for busmastering data between the card and the memory. 27.

(40) 4. 14-bit 105 MSPS I/Q input channels and output channels for A/D and D/A.. 3.2. DSP Chip [18]. The DSP chip, TI’s TMS320C6416, employs the “VelociTI” architecture, a variant of the traditional VLIW architecture, which consists of multiple execution units running in parallel, performing multiple instructions during one cycle time. It is a 32-bit fixed-point DSP, with processing speed at 600 MHz, delivering 4800 MIPS. The C6416 core CPU, which is shown in Fig. 3.3, consists of 64 general-purpose 32-bit registers and eight functional units. These eight functional units contain two multipliers and six arithmetic units. It allows users to develop highly effective RISC-like code for fast development time. The C6416 uses a two-level cache-based architecture with 16 kB of L1 data cache, 16 kB of L1 program cache, and 1 MB of L2 data/program cache. On-chip peripherals include two multichannel buffered serial ports (McBSPs), two timers, a 16-bit host port interface (HPI), a 32-bit external memory interface (EMIF), a direct memory access (DMA) controller and a enhanced direct memory access (EDMA) controller. The following gives some sketch of the units just mentioned above: • The EDMA controller transfers data between the memory without passing through. Fig. 3.1: Quixote-II board [24].. 28.

(41) Fig. 3.2: Block diagram of Quixote-II(from [16]).. the DSP core. • McBSPs can buffer serial samples in memory automatically with the aid of the DMA/ EDMA controller. • HPI is a parallel port through which a host processor can directly access the CPU’s memory space. • EMIF provides the interface for the DSP core to connect with several external devices, allowing additional data and program space. The C6416 has two 64-bit internal ports to access internal data memory. It supports double word loads and stores. There are four 32-bit paths for loading/storing data from memory to the register file. C6416 has two register files (A and B), each containing 32 32-bit registers for a total of 64 general-purpose registers. The general-purpose registers can be used for data, data address pointers, or condition registers. The C6416 register 29.

(42) Fig. 3.3: Functional block and CPU (DSP core) diagram [17].. file supports packed 8-bit types and 64-bit fixed-point data types. Packed data types store either four 8-bit values or two 16-bit values in a single 32-bit register, or four 16-bit values in a 64-bit register pair. Note that the C6416 does not directly support floating-point data types. The eight functional units in the C6416 data paths can be divided into two groups of four; each functional unit in one data path is almost identical to the corresponding unit in the other data path. The two sets of functional units, along with two register files, compose sides A and B of the DSP core. Figure 3.4 illustrates the C6416 DSP CPU. From this figure, we see that the C6416 CPU contains: • Program fetch unit 30.

(43) Fig. 3.4: The C64x CPU block diagram [18].. • Instruction dispatch unit, with advanced instruction packing • Instruction decode unit • Control registers • Control logic • Test, emulation, and interrupt logic The details of each functional units are given in Tables 3.1 and 3.2. Most data lines in the CPU support 32-bit operands, and some support long (40-bit) and double word (64bit) operands. Each functional unit has its own 32-bit write port into a general-purpose register file. All units ending in 1 (for example, .L1) write to register file A, and all units ending in 2 write to register file B. Each functional unit has two 32-bit read ports for source operands src1 and src2. Four units (.L1, .L2, .S1, and .S2) have an extra 8-bitwide port for 40-bit long writes, as well as an 8-bit input for 40-bit long reads. Because each unit has its own 32-bit write port, when performing 32-bit operations all eight units can be used in parallel every cycle.. 31.

(44) Table 3.1: Functional Units (.L, .S) and Operations Performed [18]. 32.

(45) Table 3.2: Functional Units (.M, .D) and and Operations Performed [18]. 33.

(46) 3.3. Data Transmission Mechanism [16]. In this section, we introduce the data transmission mechanism that Quixote-II supports. We will make use of one, CPU busmastering, to realize the transmission between the host PC and the DSP target, Quixote-II. From [16], we know that there are three schemes provided by the Quixote baseboard. They are DSP streaming interface, CPU busmastering interface, and packetized message interface. We now introduce them in the following subsections.. 3.3.1 DSP Streaming Interface The Quixote supports using PCI busmastering for the highest data rate streaming between the host and the target. The busmaster streaming interface is fully handshook, so that no data loss can occur in the process of streaming. For example, if the application cannot process blocks fast enough, the buffers will fill, then the busmaster region will fill, then busmastering will stop until the application resumes processing. When the busmaster stops, the DSP will no longer be able to add data to the PCI interface FIFO. The target DSP code can then take any needed action to cover the interruption. When service resumes, the system will move the backed up data through the system to the application normally. Figure 3.5 shows the block diagram of DSP streaming mode. The DSP streaming interface is bi-directional. Two streams can run simultaneously, one running from the analog peripherals through the DSP into the application. This is called the “incoming stream.” The other stream runs out to the analog peripherals. This is the “outgoing stream.” In both cases, the DSP needs to act as a mediator, since there is no direct access to analog peripherals from the host. This arrangement allows the DSP to process the streams as they move from the application to the hardware. • Software implementation: DSP streaming is initiated and started on the host, using the Caliente component, which handles bi-directional streaming of data between the host memory and the 34.

(47) Fig. 3.5: DSP streaming mode [16].. target DSP. On the target, the DSP interface uses one pair of DSP/BIOS device drivers, PciIn (on the outgoing stream) and PciOut (on the incoming stream), provided in the Pismo peripheral libraries for the DSP. • Hardware implementation: The Quixote baseboard has a 32 or 64 bit PCI interface, 33 MHz, compatible with 3V or 5V signalling PCI bus systems. This interface supports both busmastering and “slave” interfaces to the baseboard and supports data burst rates of 264 MB/sec for 64-bit systems, or 132 MB/sec for 32-bit systems. The baseboard uses busmastering to host memory as the primary method for moving large amounts of data in the system. From the DSP perspective, the busmastering interface is a bi-directional FIFO that manages the interaction with the host memory. The PCI controller is responsible for moving data to and from the host as required by the DSP. Slave accesses, from the host processor to the target DSP, are used to support configuration, control, and communications.. 35.

(48) 3.3.2 CPU Busmastering Interface The TI 64x baseboard is capable of using PCI busmastering to move data between target and host memory. This additional busmaster channel can be used to transfer data between host and target applications. The primary busmaster interface is based on a streaming model where logically data constitute an infinite stream between the source and destination. This model is more efficient because the signalling between the two parties in the transfer can be kept to a minimum and transfers can be buffered for maximum throughput. On the other hand the streaming model can have relatively high latency for a particular piece of data. This is because a data item may remain in internal buffering until subsequent data accumulates to allow for an efficient transfer. The CPU busmaster interface uses a different model: it transfers discrete blocks between the source and destination. Each data buffer is transferred completely to the destination in a single operation. Only if several transfers are requested at once will any delay in beginning transmission occur, as multiple requests have to be serialized through the single hardware system. The data buffers transferred can be of different sizes. Each requested buffer is interrogated for its size and fully transmitted. At the destination, the destination buffer is re-sized to allow the incoming data to fit. If the buffer given is too small for the data, it will be reallocated to allow the transfer. Reallocating buffers can take some time, for best performance buffers should be pre-sized to be large enough for the largest transfer expected. This will make allocation of buffers at critical times unnecessary. CPU busmastering uses a simple blocking interface for its sending and receiving functions. The sending function will not return until the transfer has completed and the buffer is ready for reuse. Similarly, the receiving function waits until data have arrived from the data source and transferred into the data buffer before returning. At this point the buffer is ready for use. This blocking allows sequences of transfers managed by a simple sequence of calls to transfer functions. Since the transfer functions are blocking, they are 36.

(49) best avoided in the main user interface thread of a Windows application. The GUI will appear to be frozen until the transfer has completed. For best results, the data transfer functions should be placed in separate threads on the target and host applications. In fact, each direction of transfer should have its own thread, so that the two directions of transfer can interleave as much as possible. The CPU busmaster interface allows separate channels of data between the target and the host. Using separate channels allows multiple, independent data streams to be maintained between the target and host. At present, only a single channel is supported. The largest transfer allowed is half the total size of the DMA buffer allocated by the INF file (a kind of files used for software/firmware installation in windows system) when the driver is installed. Half of the memory is dedicated to each direction. The default buffer size in the INF is 0x200000 bytes, so the maximum transfer is 1 MB. PciTransfer::Send() sends the contents of a Buffer-derived object to the target on the channel Channel. All of the data in the buffer are transferred. There is no means of sending a partial buffer. Only channel 0 is currently supported. The function will not return until the block has been transferred to the host. The use of the base buffer class allows any of the IntBuffer, CharBuffer, FloatBuffer and similar classes to be sent across the interface. The function returns true if the transfer succeeded. It returns false if the transfer failed due to a PCI bus error. PciTransfer::Recv() waits for data to arrive from the target, then returns the data in the buffer provided. The data must be sent on the same channel as the Channel argument. The Buffer will be re-sized to fit the data transferred from the source. If the buffer is too small, this may involve a reallocation of the data block. The function returns true if the transfer succeeded. It returns false if the transfer failed due to a PCI bus error.. 37.

(50) 3.3.3 Packetized Message Interface The DSP and host have a lower bandwidth communications link for sending commands or out-of-band information between target and host. These packets can provide the users another way to send commands for many purposes. For example, we can use these packets to tell the target what to do next, or when to do next. These packets provide a very important “bridge” for the host and the DSP. A set of sixteen mailboxes in each direction to and from the host PC are shared with the DSP to allow for an efficient message mechanism that complements the busmastering interface. These mailboxes have a handshake mechanism that signals the recipient for the availability of data, and a corresponding signalling to the sender when the message was received. Data rate is limited to about 56 kB per second. Higher data rate requirements should use the busmastering interface. A single bi-directional path can be set up with minimal configuration for applications with simple communication needs. A virtually unlimited number of independent communication channels may be set up to run in parallel, with messages on each channel directed to their own receiver on the other side. Figure 3.6 shows a single bi-directional path between the DSP and the host PC. As we see, this figure is divided into two parts, one is host application, and the other is target application. On the host side, CIIMessage class can encapsulate the packet, that may contain up to 14 32-bit data words plus two 32-bit header words, to be transmitted. On the target, the corresponding class is called IIMessage. Messages sent by the target are collected into CIIMessage objects for delivery to the event handlers dedicated to respond to the messages. For all practical purposes, we can think of the Message System as exchanging IIMessage/CIIMessage objects. The header portion of the Message Packet contains some system data and some fields that can be used by the application. Table 3.3 gives the header field that CIIMessage can read. The 14 words of Data are accessable as array. Table 3.4 shows these methods all have 38.

(51) Fig. 3.6: Simple target to host messaging configuration [16].. Table 3.3: CIIMessage Header Field [16]. an additional argument giving the index into the data section. On the target side, the Pismo library supports a very similar class, IIMessage, to contain the message. The header field access and the data section interface are identical to the host side, CIIMessage. The packetized message system is event driven. When the sender posts a message packet, at the first available opportunity the packet is loaded in the communication registers and an interrupt generated on the receiving side. On the receiver, the interrupt is. Table 3.4: CIIMessage Data Section Interface [16]. 39.

(52) detected and the message removed and enqueued for later processing. The sender is then acknowledged that the previous packet has been removed and the hardware is free for another transmission. The receiver then analyzes the message and distributes it to the proper handler for processing. For more information, refer to [16].. 40.

(53) Chapter 4 Integration and Optimization of the IEEE 802.16a OFDMA TDD Uplink Transmitter-Receiver System In previous chapters, the components of the uplink transceiver system have been introduced and the DSP implementation platform has been described. In this chapter, we discuss the major topic of this thesis — the integration and optimization of the specified uplink transceiver system on II’s Quixote DSP baseboard, using the TI TMS320C6416 DSP chip. At first, we briefly introduce the entire structure of our system, its transmission mechanism, and the precision of the fixed-point numbers that we use. Secondly, we introduce the DSP code development environment and some features of the TI C6000 family DSP tools for doing compiler level optimization. Then, we discuss optimization of the major blocks in the uplink transceiver, including the TX/RX SRRC filters, the TX IFFT, and others. Finally, we present the improvement after the efforts we have made by showing the simulation profile generated by TI’s Code Composer Studio (CCS) built-in profiler.. 4.1. Structure of the Implemented System. The structure of the uplink transmitter and receiver system is shown in Fig. 4.1. There are two SSs and one BS. In consequence, the FEC scheme and the channel modulation. 41.

(54) Fig. 4.1: Structure of implemented system.. scheme each ideally needs to use one individual DSP board and be linked by the PCI port on the personal computer (PC), which are illustrated in Figs. 4.2 and 4.3, respectively. In our work, however, we merge the two on only one DSP board, for the reason that the data transmission mechanism across the DSP boards is too complex to realize. The FEC part consists of randomizer/de-randomizer, Reed-Solomon encoder/decoder, convolutional encoder/decoder, and interleaver/de-interleaver. The channel modulation part consists of data modulation/demodulation, data framing/deframing, IFFT/FFT, 4-times upsampling/downsampling, and SRRC filtering. Due to the unknown system software bug, we are unable to run the channel equalizer() on the DSP baseboard yet. We think memory allocation problem may be the reason for the bugs. Hence, only parts of the functions in the receiver (i.e., RX SRRC() and RX sync) are workable now. The implementation of other parts of the receiver on DSP are now leaved to the future work. 42.

(55) Fig. 4.2: System structure on transmitter side (modified from [15]).. Fig. 4.3: System structure on receiver side (modified from [15]).. 4.1.1 CPU Busmastering Interface The data transmission mechanism between the DSP board and PC employs the CPU busmastering interface described in chapter 3, because this mechanism is relatively easier to implement than the data streaming mode. The size of the transmitting blocks from the transmitter is chosen to be 8200 samples, that is, 2050 samples before the 4-times upsampling. Here one sample means a complex number that contains the real part and the imaginary part. The reasons why we do not use 2304 samples (one OFDM symbol) as block size are given below. The size of one OFDMA frame in our work consists of three downlink symbols, four uplink symbols, one TTG, and one RTG. Figure 2.16 shows the frame structure. After 4-times upsampling, the SS transmits 9216 samples per OFDM 43.

(56) symbol, and 544 samples in both TTG and RTG. Therefore, the SS transmits 65600 samples in one frame time. We assume that we do not know the precise boundary of the received symbols on the receiver side. To cope with this assumption, we can let our transmitting block size be a constant value. But we must ensure that the output sample time is an integer fraction of the input time, or we will have to take care of complicated timing. Mathematically, we have output sample time =. input sample time , N. where N is a positive integer bigger than 1. In our work, the input sample time is 65600 samples. The value of N is chosen to be 8, resulting in the output sample time of 8200 samples. There are no particular theoretical reasons to choose 8 as N . We just let the output sample time be close to one OFDM symbol, 9216 samples. Figure 4.4 shows the organization of our transmitter and receiver. Note that the transmitter will not return until the transfer has completed and the buffer is ready for reuse, and the receiver waits until data has arrived from the data source and transferred into the data buffer before returning. The transmitters (2 SSs) and the receiver (BS) are actually both on the same DSP chip, and the channel simulators for both users are located in the host PC. The reason for excluding the channel simulator from the DSP chip is because the C6416 is a fixed-point DSP and floating-point operations on it are time-consuming. Therefore, we exclude the channel simulator from the DSP chip, for it uses expensive floating-point operations. Since there are two SSs in the transmitter side, the buffer size of the transmitting block is 16400 samples, that is, 2 times the 8200 samples. The receiver’s block size is 8200 samples. In the following section, we first introduce the data format we use to represent one sample.. 44.

(57) Fig. 4.4: Organization of transmitter and receiver using CPU busmastering interface.. Fig. 4.5: Fixed-point data formats at the transmitter side.. 4.2. Fixed-Point Data Formats. For improving the speed and saving the memory, we have to work with fixed-point numbers instead of floating-point numbers. The data formats we use in the transmitter are shown in Fig. 4.5. Since the DSP chip supports 16 × 16 multiply operations, and [19] suggests that use of the short data type (16 bits) for fixed-point multiplication inputs whenever possible, most of the data types are chosen to be 16-bit in our implementation: • Data format before IFFT is Q1.14, which is in the range [−2,2]. • Data format after IFFT is Q.15, which is in the range [−1,1]. The Q1.14 format places the sign bit in the leftmost position, followed by 1 integer bit and 14 fractional bits (Table 4.1), and the Q.15 format places the sign bit in the leftmost position, and the remainder 15 bits are fractional ones. We explain the reasons for adopting these representations below. 45.

(58) Table 4.1: Q1.14 Bit Fields Bits Value. 15 S. 14 13 I0 Q13. ... 1 0 ... Q1 Q0. After the binary sequence passes through the modulator, the range of data values, at normalized symbol energies as shown in Table 4.2. The widest range occurs in the 64QAM, it is [. −7 √ 42. ,. √7 42. ], and the Q.15 can not cover this range. Thus, Q1.14 is the suitable. range to use. Then, the range of data values at the framing block’s output is located in [- 34 , 34 ], where the values ± 34 occur in the UL preamble carriers [1]. Again, it can be represented by the data format Q1.14. The fractional part of the fixed-point number in this system is 14 bits; hence the finest fractional resolution is 2−14 = 6.10 × 10−5 . Now, we focus on the IFFT’s output range. The 2048-point IFFT defined as N −1 1 x(n) = X(k)WN −kn , n = 0, ..., N − 1, N k=0. (4.2.1). where N = 2048, WN = e−j(2π/N ) , X(k) is the input sequence, and x(n) is the resulting output sequence. This function is implemented by the 16-bit FFT function DSP fft16x16r() provided in the TI TMS320C64x DSP library (DSPLIB) [20]. The detailed operation of it can be found in later sections. Because of the factor. 1 , 2048. the range of the output data. values will be smaller than 1. For this reason, the data format after FFT is set to be Q.15. The fractional part of the fixed-point number in this system is 15 bits; hence the finest fractional resolution is 2−15 = 3.05 × 10−5 . Finally, the data field of the SRRC filter’s output is Q.15, and it’s SNR is 44.90dB comparing to the floating-point results. In the receiver side, the main consideration of setting fixed-point formats is that the multiplier operations are always 16 × 16. We show the data formats in the receiver side in Fig. 4.6. The data formats are: • The data format before the synchronization is Q.15, which is in the range [-1,1]. 46.

(59) Table 4.2: Range of Data Values After Modulation Modulation QPSK 16-QAM 64-QAM. Range , √12 ] , √310 ] , √742 ]. −1 [√ 2 [ √−3 10 [ √−7 42. Fig. 4.6: The fixed-point data formats at the receiver side.. • The data format of the FFT input is Q2.13, and the output is Q7.8. • The data format after the channel equalizer is Q1.14. In the receiver side, we must take into account the channel gain introduced by the fading channel. Since the fading coefficients implemented are not normalized, it is better for us to give some guard bits to the integer part of the FFT input. Thus, we set the data format of the FFT input to Q2.13. This prevents data overflows in the FFT output. According to [20], the FFT function we use scale the output by 5 bits (i.e., 5), to prevent output overflow. As a result, the output of the FFT is Q7.8. The finest fractional resolution is 2−8 = 3.91 × 10−3 . For simplicity, we assume that the receiver knows the frequency response of the channel. We implement a zero forcing equalizer, which is simply an inverse filter which inverts the frequency response of the channel. After the channel equalizer, the output shall be in the range [−2,2] as the transmitter output. Hence, we set the data format to Q1.14.. 47.