IEEE 802.16a 分時雙工正交分頻多重進接下行導引訊號輔助式通道估測之技術與數位訊號處理器軟體實現

全文

(1)國立交通大學電子工程學系電子研究所碩士班碩. 士. 論. 文. IEEE 802.16a 分時雙工正交分頻多重進接下行導引訊號輔助式通道估測之技術與數位訊號處理器軟體實現 IEEE 802.16a TDD OFDMA Downlink Pilot-Symbol-Aided Channel. Estimation:. Techniques. and. DSP. Implementation. 研究生：陳汝芩指導教授：林大衛博士. 中華民國九十四年六月. Software.

(2) IEEE 802.16a 分時雙工正交分頻多重進接下行導引訊號輔助式通道估測之技術與數位訊號處理器軟體實現 IEEE 802.16a TDD OFDMA Downlink Pilot-Symbol-Aided Channel. Estimation:. Techniques. and. DSP Software. Implementation 研究生: 陳汝芩. Student: Ruu-Ching Chen. 指導教授: 林大衛博士. Advisor: Dr. David W. Lin. 國立交通大學電子工程學系. 電子研究所碩士班. 碩士論文. A Thesis Submitted to Department of Electronics Engineering & Institute of Electronics College of Electrical Engineering and Computer Science National Chiao Tung University in Partial Fulfillment of Requirements for the Degree of Master of Science in Electronics Engineering June 2005 Hsinchu, Taiwan, Republic of China. 中華民國九十四年六月.

(3) IEEE 802.16a 分時雙工正交分頻多重進接下行導引訊號輔助式通道估測之技術與數位訊號處理器軟體實現研究生：陳汝芩. 指導教授：林大衛博士. 國立交通大學電子工程學系電子研究所碩士班. 摘要. 正交分頻技術近來因為能在行動環境中穩定高速傳輸而廣受注目，IEEE 802.16a 即是一個基於正交分頻多重進接技術用於無線區域網路和大都會網路的標準。本論文主要在討論 IEEE 802.16a 下行通道估測的方法以及數位訊號處理器軟體實現。我們使用最小平方差的估測器來估計在導訊上的通道頻率響應，因為硬體的計算方便。而內插的方法我們則研究了線性內插、二次式內插。而在用時域的資料改善的方法有下列兩種：二維內插法、以及最小平均平方差適應 (LMS adaptation）。我們的在靜態以及瑞雷通道上模擬。結合線性內插和二維內插法，我們得到較好的表現，而且運算複雜度也比較低，所以決定在數位訊號處理器軟體上實現。我們將通道估測的技術以軟體實現在 Texas Instruments (TI) 公司製造型號為 TMS320C6416 的數位訊號處理器上（DSP）。此處理器的操作平台為 Innovative Integration 公司製名為 Quixote 的 cPCI 卡。因為我們所使用的 DSP 是專為定點運算所設計的，所以浮點數運算是很費時的。有三種方法可以加速運算速度：改變資料型態、程式語法的改良及使用 intrinsic 程式。所謂的改變資料型態就是把一開始的浮點數運算先改成 32-bit 的定點運算，再改成 16-bit 的定點運算。程式語法的改良則是把許多耗時的指令做修正，如 if-else 的指令。Intrinsic 程式是一種直接對應到 C64x 指令集的程式，可以改善我們 C 程式的表現。在依照上述步驟對原本浮點數運算的程式做改良後，我們得到了很大的進步，雖然與理論上運算的複雜度相比，成效最高只到 49%。不過在線性內插程式方面，我們至少達到了只需 0.52 個 symbol time 就能完成的速度。. i.

(4) IEEE 802.16a TDD OFDMA Downlink Pilot-Symbol-Aided Channel Estimation: Techniques and DSP Software Implementation Student: Ruu-Ching Chen. Advisor: Dr. David W. Lin. Department of Electronics Engineering & Institute of Electronics National Chiao Tung University. Abstract. OFDM (orthogonal frequency division multiplexing) technique has drawn much interest recently for its robustness in the mobile transmission environment and its high transmission data rate. IEEE 802.16a is a wireless local and metropolitan area networks standard which is based on OFDMA (orthogonal frequency division multiple access) technique. This work considers two main subjects of the downlink channel estimation under the specifications of IEEE 802.16a, the interpolation schemes and the DSP implementation. We use LS estimator for estimations of pilot carriers because of its low computational complexity. We study the linear, the second-order interpolations in frequency domain and the LMS adaptation algorithm, the two-D interpolation in time domain. We did the simulation on both static and Rayleigh fading channels. Combination of linear interpolation and 2-D interpolation are chosen to be implemented on DSP board for its low computational complexity. Our implementation is software-based, employing Texas Instruments’ TMS320C6416 digital signal processor (DSP) housed on Innovative Integration’s Quixote cPCI card. For the fixed-point DSP operation environment, floating-point operation is absolutely time-consuming. There are three ways to accelerate the DSP execution speed: changing data type, code style optimization, and using intrinsic functions. Changing data type means we replace the original floating-point operation with 32-bit fixed-point operation and then 16-bit fixed-point operation at last. Code style optimization is to modify the time-wasting parts of code, such as spared if-else instruction. Intrinsic functions are special functions that map directly to C64x instructions, to optimize our C code performance. The execution cycles of each function is improved a lot after optimized although compared with the theoretical execution cycles, the efficiency is 49% at most. At least, we reach the 0.52 multiples of real time needed per symbol in linear interpolation. ii.

(5) 誌謝要感謝的人太多，尤其是林老師，感謝他兩年多來對我的指導與包容，能當老師的學生是我前世修來的福氣。此外，感謝通訊電子與訊號處理實驗室所有的成員，包含各位師長、同學、學長姐與學弟妹們。我要感謝吳俊榮學長、洪崑健學長指導與建議，還有昱昇、志凱、景中、鎮宇、、等同學，謝謝他們在這兩年來對我的幫助及帶給我歡樂。家人對我的支持、鼓勵是我研究路上一股強大的動力，對他們的感謝，是筆墨難以形容的。最後由衷感謝所有幫助關懷過我的人。陳汝芩民國九十四年七月於新竹. v.

(6) Contents. 1 Introduction. 2. 1. 1.1. Brief Introduction to IEEE 802.16a [1], [2] . . . . . . . . . . . . . . .. 1. 1.2. Motivation of This Thesis . . . . . . . . . . . . . . . . . . . . . . . .. 4. 1.3. Organization of This Thesis . . . . . . . . . . . . . . . . . . . . . . .. 5. Channel Estimation for IEEE 802.16a OFDMA Downlink Transmission. 6. 2.1. Introduction to the IEEE 802.16a TDD OFDMA System . . . . . . .. 6. 2.1.1. Generic OFDMA Symbol Description . . . . . . . . . . . . . .. 7. 2.1.1.1. Time Domain Description . . . . . . . . . . . . . . .. 7. 2.1.1.2. Frequency Domain Description . . . . . . . . . . . .. 7. 2.1.2. Primitive Parameters . . . . . . . . . . . . . . . . . . . . . . .. 8. 2.1.3. Derived Parameters . . . . . . . . . . . . . . . . . . . . . . . .. 8. 2.1.4. Downlink Carrier Allocation. . . . . . . . . . . . . . . . . . .. 9. 2.1.4.1. Pilot Allocation . . . . . . . . . . . . . . . . . . . . .. 9. 2.1.4.2. Data Carrier Allocation . . . . . . . . . . . . . . . . 11. 2.1.5. Data Modulation and Pilot Modulation . . . . . . . . . . . . . 12 2.1.5.1. Data Modulation i. . . . . . . . . . . . . . . . . . . . 12.

(7) 2.1.5.2 2.2. Pilot Modulation. . . . . . . . . . . . . . . . . . . . 13. DL Channel Estimation Methods . . . . . . . . . . . . . . . . . . . . 14 2.2.1. Pilot-Symbol-Aided Channel Estimation . . . . . . . . . . . . 14. 2.2.2. Frequency Domain Interpolation Methods . . . . . . . . . . . 16. 2.2.3. 2.2.2.1. Linear Interpolation . . . . . . . . . . . . . . . . . . 16. 2.2.2.2. Second-Order Interpolation . . . . . . . . . . . . . . 17. Time Domain Improvement Methods . . . . . . . . . . . . . . 18 2.2.3.1. Two-Dimensional Interpolation [11] . . . . . . . . . . 18. 2.2.3.2. Least Mean Square (LMS) Adaptation [12], [14] . . . 21. 3 DSP Introduction 3.1. 23. Introduction to TMS320C6416 DSP [16] . . . . . . . . . . . . . . . . 23 3.1.1. TMS320C6416 Features . . . . . . . . . . . . . . . . . . . . . 23. 3.1.2. Central Processing Unit . . . . . . . . . . . . . . . . . . . . . 25. 3.1.3. 3.1.2.1. Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 26. 3.1.2.2. Functional Units . . . . . . . . . . . . . . . . . . . . 29. Memory Architecture . . . . . . . . . . . . . . . . . . . . . . 29. 3.2. Introduction to the Quixote cPCI Board [15] . . . . . . . . . . . . . . 32. 3.3. Introduction to the Code Composer Studio Development Tools[17], [18] 36. 3.4. Code Optimization Methods [21] . . . . . . . . . . . . . . . . . . . . 38 3.4.1. Compiler Optimization Options [17], [18] . . . . . . . . . . . . 40. 3.4.2. Using Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . 42. 4 Simulation and DSP Implementation ii. 43.

(8) 4.1. Comparison Between 2-D Interpolation and LMS Adaptive Methods . 43 4.1.1. Simulation Results for AWGN Channel . . . . . . . . . . . . . 44. 4.1.2. Simulation Results for Static Multipath Channel . . . . . . . . 45. 4.1.3 4.2. 4.1.2.1. Two-Dimensional Interpolation . . . . . . . . . . . . 48. 4.1.2.2. LMS Adaptive Algorithm . . . . . . . . . . . . . . . 57. Multipath Rayleigh Fading Channel Simulations . . . . . . . . 57. DSP Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2.1. Introduction to Program Structure . . . . . . . . . . . . . . . 65. 4.2.2. Performance of the Original Program . . . . . . . . . . . . . . 69. 4.2.3. Choice of the Fixed-Point Data Formats . . . . . . . . . . . . 71. 4.2.4. 4.2.5. 4.2.3.1. 32-bit Fixed-Point Operation . . . . . . . . . . . . . 71. 4.2.3.2. 16-Bit Fixed-Point Operation . . . . . . . . . . . . . 74. Code Improvement . . . . . . . . . . . . . . . . . . . . . . . . 76 4.2.4.1. Coding Style Improvement . . . . . . . . . . . . . . . 76. 4.2.4.2. Optimization by Using Intrinsic Functions [21] . . . . 78. Final Version of Fixed-Point 16-Bit Operation . . . . . . . . . 80 4.2.5.1. 4.2.6. Execution Efficiency . . . . . . . . . . . . . . . . . . 81. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87. 5 Conclusion and Future Work. 90. 5.1. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90. 5.2. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91. Bibliography. 92 iii.

(9) List of Figures 1.1. (a) Frame structure in IEEE 802.16-2004 [1]. (b) Frame structure in IEEE 802.16a-2003 [2].. . . . . . . . . . . . . . . . . . . . . . . . . .. 3. 2.1. Time structure of OFDMA symbol (from [2]). . . . . . . . . . . . . .. 7. 2.2. Illustration of carrier usage in OFDMA DL (from [3]). . . . . . . . . .. 9. 2.3. Pilot allocation in the OFDMA DL (from [2]). . . . . . . . . . . . . . 10. 2.4. QPSK, 16-QAM and 64-QAM constellations (from [2]). . . . . . . . . 13. 2.5. Pseudo random binary sequence (PRBS) generator for pilot modulation (from [2]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14. 2.6. Illustration of 2D interpolation. . . . . . . . . . . . . . . . . . . . . . 18. 2.7. Adaptive channel estimation using the LMS algorithm . . . . . . . . 21. 3.1. Block diagram of the TMS320C6416 DSP [16]. . . . . . . . . . . . . . 26. 3.2. Pipeline phases of TMS320C6416 DSP [16]. . . . . . . . . . . . . . . 27. 3.3. TMS320C64x CPU data path [16]. . . . . . . . . . . . . . . . . . . . 31. 3.4. Block diagram of Quixote [15].. 3.5. Block diagram of DSP streaming mode [15]. . . . . . . . . . . . . . . 34. 3.6. Code development flow for TI C6000 DSP [21]. . . . . . . . . . . . . . 39. 4.1. Block diagram of the simulated system. . . . . . . . . . . . . . . . . . 44. . . . . . . . . . . . . . . . . . . . . . 33. iv.

(10) 4.2. Channel estimation steps. . . . . . . . . . . . . . . . . . . . . . . . . 44. 4.3. ˆ i − Xi | for AWGN channel. . . . . . . . . . . . . . . . . . . 45 MSE of |X. 4.4. The (a) MSE and (b) SER for AWGN channel simulation . . . . . . . 46. 4.5. (a)Amplitude response and (b) phase response of the channel given in Table 4.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47. 4.6. ˆ i − Xi | on subcarrier 1. . . . . . . . . . . . . . . . . . . . . 49 MSE of |X. 4.7. The (a) MSE and (b) SER on the subcarrier 1 of the 2-D interpolation using formula 1 with linear interpolation in the frequency domain respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50. 4.8. ˆ i − Xi | on subcarrier 1700. . . . . . . . . . . . . . . . . . . 51 MSE of |X. 4.9. The (a) MSE and (b) SER on the subcarrier 1700 of the 2-D interpolation using formula 1 with linear interpolation in the frequency domain respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . 52. 4.10 The (a) MSE and (b) SER of the 2-D interpolation using formula 1 with linear and 2nd-order interpolation in the frequency domain respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.11 The (a) MSE and (b) SER of the 2-D interpolation using formula 2 with linear and 2nd-order interpolation in the frequency domain respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.12 The (a) MSE and (b) SER of using formula 1 and 2 in the 2-D interpolation respectively with linear interpolation in the frequency domain. 55 4.13 The (a) MSE and (b) SER of using formula 1 and 2 in the 2-D interpolation respectively with 2nd-order interpolation in the frequency domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56. v.

(11) 4.14 The (a) MSE and (b) SER for different weighting and different stepsize parameters in LMS adaptive method. . . . . . . . . . . . . . . . 58 ˆ and X ˆ af ter decision for different weighting and different 4.15 MSE between X step-size parameters in LMS adaptive method. . . . . . . . . . . . . . 59 4.16 The (a) MSE of |ˆ xi − xi |, MSE and (b) SER for one-path Rayleigh fading channel, where V = 27 km/h, f dT = 0.01. . . . . . . . . . . . 62 ˆ i − Xi | on subcarrier 1 for multipath Rayleigh fading channel. 63 4.17 MSE of |X 4.18 The (a) MSE and (b) SER of carrier 1 with 2-D interpolation using formula 2 with linear interpolation in the frequency domain respectively. V = 27 km/h, f dT = 0.01. . . . . . . . . . . . . . . . . . . . . 64 ˆ i − Xi | on subcarrier 1700 for multipath Rayleigh fading 4.19 MSE of |X channel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.20 The (a) MSE and (b) SER carrier 1700 with 2-D interpolation using formula 2 with linear and 2nd-order interpolation in the frequency domain respectively. V = 27 km/h, f dT = 0.01. . . . . . . . . . . . . 66 4.21 The (a) MSE and (b) SER of the 2-D interpolation using formula 2 with linear and 2nd-order interpolation in the frequency domain respectively. V = 27 km/h, f dT = 0.01. . . . . . . . . . . . . . . . . 67 4.22 The (a) MSE and (b) SER of the 2-D interpolation using formula 2 with linear and 2nd-order interpolation in the frequency domain respectively. V = 54 km/h, fd T = 0.02. . . . . . . . . . . . . . . . . . 68 4.23 Program structure for channel estimation. . . . . . . . . . . . . . . . 69 4.24 Function Modulation (QPSK). . . . . . . . . . . . . . . . . . . . . . . 70 4.25 Function Complex Mul. . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.26 Function Linear Interp. . . . . . . . . . . . . . . . . . . . . . . . . . . 71 vi.

(12) 4.27 Function Complex Div. . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.28 Function De-modulation(QPSK). . . . . . . . . . . . . . . . . . . . . 72 4.29 Function Modulation(QPSK ) of 32-bit fixed-point operation. . . . . . 74 4.30 Software pipelining information of 32-bit fixed-point Complex Mul . . 75 4.31 The loop kernel of Complex Mul . . . . . . . . . . . . . . . . . . . . . 76 4.32 Fixed-point data formats used in DSP implementation. . . . . . . . . 77 4.33 Example of different coding styles in C code. . . . . . . . . . . . . . . 78 4.34 Result of different coding styles in complied assembly code. . . . . . . 79 4.35 Array access in vector sum by LDDW [21]. . . . . . . . . . . . . . . . 80 4.36 Array access in vector sum by STDW [21]. . . . . . . . . . . . . . . . 80 4.37 Illustration of the dotp2 and the dotpn2 intrinsics [21]. . . . . . . . 81 4.38 Function vec Complex Mul . . . . . . . . . . . . . . . . . . . . . . . . 82 4.39 Function vec Complex Div . . . . . . . . . . . . . . . . . . . . . . . . 83 4.40 Original interpolation loop.. . . . . . . . . . . . . . . . . . . . . . . . 84. 4.41 Final version of the interpolation loop. . . . . . . . . . . . . . . . . . 84 4.42 Loop kernel of modified assembly code in Linear Interp. . . . . . . . 85 4.43 Software pipelining information of the modified loop in Linear Interp . 85 4.44 Software pipelining information of 16-bits fixed-point of Complex Mul . 87 4.45 (a) MSE and (b) SER comparison between floating-point and 16-bit fixed-point operations with 2-D interpolation using formula 2 (4 sets) with linear interpolation in the frequency domain. V = 27 km/h, f dT = 0.01. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89. vii.

(13) List of Tables 1.1. Carrier Allocation in the OFDMA DL (from [1]) . . . . . . . . . . . .. 4. 2.1. Carrier Allocation in the OFDMA DL (from [2]) . . . . . . . . . . . . 12. 3.1. Execution Stage Length Description for Each Instruction Type [16] . 28. 3.2. Functional Units and Operations Performed [16] . . . . . . . . . . . . 30. 4.1. MSE Ratio Between Formula 1 and Formula 2 for AWGN Channel. 4.2. Channel Impulse Response . . . . . . . . . . . . . . . . . . . . . . . . 48. 4.3. MSE Ratio Between Formula 1 and Formula 2 for Multipath Channel 57. 4.4. Relation Between Speed and Maximum Doppler Shift . . . . . . . . . 61. 4.5. Floating-Point Profile of 802.16a DL Channel Estimation Function. . 48. Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.6. Q16.15 Bit Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73. 4.7. Fixed-Point 32-Bit Operation Profile of 802.16a DL Channel Estimation Function Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . 73. 4.8. Q1.14 Bit Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77. 4.9. Different Ways of Variable Declaration, Where r Stands for Real Part and i Stands for Imaginary Part . . . . . . . . . . . . . . . . . . . . . 77. viii.

(14) 4.10 Fixed-Point 16-bit Operation with Coding Style Modified Profile of 802.16a DL Channel Estimation Function Blocks. . . . . . . . . . . . 86. 4.11 Performance Comparison Between Different Data Types of Complex Mul 86 4.12 Performance Comparison Between Different Data Types of Complex Div 86 4.13 Performance Comparison Between Different Data Types of Linear Interp 87. ix.

(15) Chapter 1 Introduction 1.1. Brief Introduction to IEEE 802.16a [1], [2]. In recent years, orthogonal frequency division multiplexing (OFDM) modulation technique has drawn much attention for its ability to deal with frequency-selective fading in high-speed wireless communication. The IEEE 802.16 standard committee has developed a group of standards for wireless metropolitan area networks (MANs). Project 802.16a is one of them. The object of this present study is the OFDMAbased interface option of this project, namely WirelessMAN-OFDMA. The IEEE 802.16-2001 specifies the air interface of fixed (stationary) pointto-multipoint broadband wireless access systems providing multiple services. The medium access control layer is capable of supporting multiple physical layer specifications optimized for the frequency bands of application. This standard includes a particular physical layer specification applicable to systems operating between 10 and 66 GHz. The IEEE 802.16a amends IEEE 802.16-2001 by enhancing the medium access control layer and providing additional physical layer specifications in support of broadband wireless access at frequencies from 2 to 11 GHz. For the reason that our project started in year 2002, we have followed the specifi-. 1.

(16) cation of these two standards above. However, the IEEE 802.16 standard committee has completed a new version of the standard in year 2004, namely IEEE 802.16-2004. This standard specifies the air interface of fixed broadband wireless access (BWA) systems supporting multimedia services. The medium access control layer (MAC) supports a primarily point-to-multipoint architecture, with an optional mesh topology. The MAC is structured to support multiple physical layer (PHY) specifications, each suited to a particular operational environment. For operational frequencies of 10–66 GHz, the PHY is based on single-carrier modulation. For frequencies below 11 GHz, where propagation without a direct line of sight must be accommodated, three alternatives are provided, using OFDM, OFDMA, and single carrier modulation techniques. Since pilot allocations are key to the study reported in this thesis, we summarize the difference between these two versions about the carrier allocations. Table 1.1 shows the pilot allocation of IEEE 802.16-2004. The variable set of pilots embedded within the symbol of each segment obeys the following rule: PilotsLocation = VariableSet#x + 6 · (FUSC SymbolNumber mod 2). (1.1). where FUSC SymbolNumber counts the FUSC (full uasage of subchannels) symbols used in the transmission starting from 0. The arrangement is slightly different from the specification in the IEEE 802.16a-2003 (see also Fig.2.3). We have four kinds of variable location pilot arrangements in 802.16a but there are only two kinds in IEEE 802.16-2004. As to the frame structure, the IEEE 802.16-2004 also made modification to it. From Fig. 1.1(a) we can see that in IEEE 802.16-2004, each frame begins with a preamble followed by a downlink transmission period and an uplink transmission period. This is quite different from the frame structure in IEEE 802.16a-2003, shown in Fig. 1.1(b), where preamble is used only in the uplink subframe. 2.

(17) (a). (b) Figure 1.1: (a) Frame structure in IEEE 802.16-2004 [1]. (b) Frame structure in IEEE 802.16a-2003 [2].. 3.

(18) Table 1.1: Carrier Allocation in the OFDMA DL (from [1]). 1.2. Motivation of This Thesis. In high data rate transmission, the imperfectness of channels, e.g., multipaths, causes more severe trouble than in low-rate transmission in demodulation. The result of data transmission over such a channel is that each received symbol is affected somewhat by adjacent symbols, thereby bringing about a common form of interference referred to as inter-symbol-interference (ISI). Inter-symbol-interference is a major source which degrades performance in the reconstructed data at receiver. In single carrier transmission, we usually employ an time domain adaptive equalizer to solve this problem. If the channel has very long impulse response compared with symbol 4.

(19) duration, time domain equalizer may fail to handle ISI. However, in OFDM system, ISI can be easily eliminated by inserting cyclic prefix which is longer than the maximum delay spread, at the expense of some loss in capacity. In uncoded OFDM, we only need a frequency domain equalizer with one tap at the receiver for each subcarrier. The purpose of channel estimation is to obtain the channel response at each subcarrier. Then, we can easily obtain the equalizer coefficient, the inverse of the channel gain. In channel coded OFDM, such as that in IEEE 802.16a OFDMA, equalization is not needed, but the estimated channel response is directly useful in channel decoding. Hence in this thesis, we will investigate channel estimation methods that can be employed to the IEEE 802.16a downlink transmission.. 1.3. Organization of This Thesis. The contents of this thesis are as follows. In chapter 2, we give some specifications of the IEEE 802.16a OFDMA downlink system and introduce the channel estimation approaches. In chapter 3, we describe the implementation platform, which consists of Texas Instruments’ TMS320C6416 digital signal processor(DSP) on a cPCI board Quixote made by Innovative Integration. Then, in chapter 4, we discuss the performance of the proposed channel estimation method as well as its DSP implementation. At last, we will give the conclusion and potential future work in chapter 5.. 5.

(20) Chapter 2 Channel Estimation for IEEE 802.16a OFDMA Downlink Transmission For wideband mobile communication systems, the radio channel is usually frequency selective and time variant. Therefore, our estimation schemes combine frequency domain estimation with time domain processing. In this thesis, our algorithms for channel estimation in OFDM system are intimately related to pilot sub-carrier arrangement.. 2.1. Introduction to the IEEE 802.16a TDD OFDMA System. The IEEE standard 802.16a specifies the WirelessMAN air interface for wireless metropolitan area networks. There are several system modes in 802.16a: SCa (single-carrier modulation), OFDM (orthogonal frequency-division multiplexing) and OFDMA (orthogonal frequency-division multiple access). It also supports two duplex types: TDD (time division duplex) and FDD (frequency division duplex). We consider the TDD OFDMA option. Most contents in this section are taken from [2].. 6.

(21) Figure 2.1: Time structure of OFDMA symbol (from [2]).. 2.1.1. Generic OFDMA Symbol Description. 2.1.1.1. Time Domain Description. An OFDM symbol contains the useful symbol part and the cyclic prefix (CP) part. The useful symbol time is referred to as Tb . The CP is a copy of the last Tg µs of the useful symbol period. The two together are referred to as the symbol time Ts . The ratio of CP time to useful time (Tg /Tb ) that should be supported includes 1/32, 1/16, 1/8 and 1/4. In this thesis, the CP time to useful time ratio is set to 1/8. The time domain OFDMA symbol structure is shown in Fig. 2.1. 2.1.1.2. Frequency Domain Description. In frequency domain, we have 3 carrier types: • Data carriers — for data transmission. • Pilot carriers — for various estimation purposes. • Null carriers — no transmission at all, for guard bands and DC carrier. (The purpose of the guard bands is to enable the signal to naturally decay and create the FFT “brick wall” shaping.). 7.

(22) In the OFDMA mode, active carriers are devided into subsets of carriers, and each subset is termed a subchannel. In the downlink (DL), a subchannel may be intended for different groups of receivers; similarly, a transmitter may be assigned one or more subchannels in the uplink (UL), so serveral transmitters may transmit in parallel. The symbol structure in frequency domain will be shown in detail in the following section.. 2.1.2. Primitive Parameters. Four primitive parameters characterize the OFDMA symbol: • BW . This is the nominal channel bandwidth. And it equals 10 MHz in our system simulation. • (Fs /BW ). This is the ratio of “sampling frequency” to the nominal channel bandwidth. This value is set to 8/7. • (Tg /Tb ). This is the ratio of CP time to “useful” time. We use 1/8 in our system. • NF F T . This is the number of points in the FFT. The OFDMA PHY defines this value to be 2048.. 2.1.3. Derived Parameters. The following parameters are defined in terms of the primitive parameters. • Fs = (Fs /BW )·BW = sampling frequency. The value equals 10×8/7 = 11.42 MHz. • 4f = Fs /NF F T = carrier spacing = 5.57617 KHz. • Tb = 1/4f = useful time = 179.33 µs. 8.

(23) • Tg = (Tg /Tb ) · Tb = CP time = 22.4 µs. • Ts = Tb + Tg = OFDM symbol time = 201.9 µs. • 1/Fs = sample time = 87.5657 ns.. 2.1.4. Downlink Carrier Allocation. Since we focus on downlink pilot-symbol-aided channel estimation in this thesis, it is necessary to understand what the allocation of carriers is. 2.1.4.1. Pilot Allocation. The carriers allocation in a DL OFDM symbol is shown in Fig. 2.2. Null carriers are allocated in the left and right sides as well as at DC. The pilot and data carriers are termed useful carriers since they transmit useful information. The pilot tones are allocated first, and the remainder of the used carriers are divided into 32 subchannels, and then the data carriers are allocated within each subchannel.. 32 data carriers (no pilots in the group). Guard band. DC carrier Group 1. Group 2. Guard band Group48. The 1702 used carriers = 1536 data carriers + 166 pilot carriers pilot. subchannel 1. subchannel 2. Figure 2.2: Illustration of carrier usage in OFDMA DL (from [3]).. 9.

(24) The pilot carriers include fixed-location pilots and variable-location pilots. The carrier indices of fixed-location pilots never change. The carrier indices of the variable-location pilots vary according to the formula varLocP ilotk = 3L + 12Pk , where varLocP ilotk is the carrier index of a variable-location pilot, L is the symbol index that cycles through the values 0,2,1,3 periodically every 4-symbol period, and Pk = {0, 1, 2, ....., 141}. The pilot carriers allocation map is shown in Fig. 2.3.. Figure 2.3: Pilot allocation in the OFDMA DL (from [2]).. 10.

(25) 2.1.4.2. Data Carrier Allocation. After inserting the pilots, the remaining space is for the useful carriers from the data subchannels. To allocate data subchannels, we partition the remaining carriers into groups of contiguous carriers. Each subchannel consists of one carrier from each of these groups respectively. The number of carriers in a subchannel is therefore equal to the number of groups, and it is denoted Nsubcarriers . The number of carrier groups is equal to the number of channels, and it is denoted Nsubchannels . The total number of data carriers is thus equal to Nsubcarriers × Nsubchannels . The exact partitioning into subchannels is according to the following equation called a permutation formula: carrier(n, s) = (Nsubchannels ) · n + {ps [nmod(Nsubchannels ) ] +IDcell · ceil[(n + 1)/Nsubchannels ]}(mod(Nsubchannels )). (2.1). where: • carrier(n, s) = carrier index of carrier n in subchannel s. • s = index number of a subchannel, from the set [0, · · · , Nsubchannels − 1]. • n = carrier-in-subchannel index from the set [0, · · · , Nsubcarriers − 1]. • Nsubchannels = number of subchannels. • ps [j] = the series obtained by rotating {P ermutationBase0 }, which is given in the Table 2.1, cyclically to the left s times. • ceil[ ] = ceiling function which rounds its argument up to the next integer. • IDcell = a positive integer assigned by the MAC to identify this particular base-station cell. • Xmod(k) = the remainder of the quotient X/k, which is at most k − 1. 11.

(26) Table 2.1: Carrier Allocation in the OFDMA DL (from [2]). The numerical parameters are given in Table 2.1.. 2.1.5. Data Modulation and Pilot Modulation. 2.1.5.1. Data Modulation. The data modulation schemes in 802.16a are shown in Fig. 2.4. The data bits are entered serially to the constellation mapper. Gray-mapped QPSK and 16-QAM must be supported, whereas the support of 64-QAM is optional.. 12.

(27) Figure 2.4: QPSK, 16-QAM and 64-QAM constellations (from [2]).. 2.1.5.2. Pilot Modulation. Pilot carriers are inserted into each data burst in order to constitute the symbol and they are modulated according to their carrier locations within the OFDMA symbol. A PRBS (pseudo-random binary sequence) generator is used to produce a sequence wk where k corresponds to the carrier index. The value of the pilot modulation on carrier k is then derived from wk . The polynomial for the PRBS generator is X 11 + X 9 + 1, as shown in Fig. 2.5. Symbols in the TDD OFDMA system DL transmission can be separated into two different types. The first three symbols are preamble symbols, and other symbols are normal symbols. The initialization vector of the PRBS in the DL normal symbols is [11111111111], while the initialization vector of the PRBS in the DL preamble symbol is [01010101010]. The PRBS shall be initialized so that its first output bit coincides with the first usable carrier. A new value shall be generated by the PRBS on every usable carrier. Each pilot shall be transmitted with a boosting of 2.5 dB. 13.

(28) Figure 2.5: Pseudo random binary sequence (PRBS) generator for pilot modulation (from [2]).. over the average power of each data tone. The pilot carriers shall be modulated as 8 1 Re {ck } = ( − wk ), Im {ck } = 0. 3 2. 2.2. (2.2). DL Channel Estimation Methods. Interpolation plays an significant role in pilot-symbol-aided channel estimation. Our interpolation schemes work in both frequency and the time domains. Linear and second-order interpolation are applied in the frequency domain, while 2-D interpolation and LMS (least mean square adaptation) optimize their performance in the time domain.. 2.2.1. Pilot-Symbol-Aided Channel Estimation. Channel estimators usually need some kind of pilot information as a point of reference. A fading channel requires constant tracking, so pilot information has to be transmitted more or less continuously. Decision-directed channel estimation can also be used. But even in these types of schemes, pilot information has to be transmitted regularly to mitigate error propagation [4]. In general, the fading channel can be viewed as a two-dimensional (2-D) signal 14.

(29) (time and frequency), which is sampled at pilot positions and the channel coefficients between pilots may be estimated by interpolation. Based on a priori known data, we can estimate the channel information on pilot carriers roughly by the least-square (LS) or the minimum mean square error (MMSE) estimator. An LS estimator minimizes the following squared error [5]: ˆ LS X||2 ||Y − H. (2.3). where Y is the received signal and X is a priori known pilots, both in the frequency ˆ LS is an domain and both being N × 1 vectors where N is the OFDM FFT size. H N ×N matrix whose values are 0 except at pilot locations mi where i = 0, · · · , Np −1:   Hm0 ,m0 · · · 0 ··· 0 ··· 0   0 · · · Hm1 ,m1 · · · 0 ··· 0   . ˆ LS =  0 · · · 0 · · · H · · · 0 (2.4) H m ,m 2 2     0 ··· 0 ··· 0 ··· 0 0 ··· 0 ··· 0 · · · HmNp −1 ,HmNp −1 Therefore, (2.3) can be rewritten as ˆ LS (m)X(m)]2 , for all m = mi . [Y (m) − H. (2.5). Then the estimate of pilot signals, based on one observed OFDM symbol, is given by ˆ LS (m) = Y (m) = X(m)H(m) + N (m) = H(m) + N (m) H X(m) X(m) X(m). (2.6). where N (m) is the complex white Gaussian noise on subcarrier m. We collect ˆ p,LS , an Np × 1 vector where Np is the total number of pilots, as HLS (m) into H ˆ p,LS = [Hp,LS (0) Hp,LS (1) · · · Hp,LS (Np − 1)]T H = X−1 p Yp. (2.7). p −1) T ] , , Yp (1) , . . . , XYpp(N = [ XYpp(0) (Np −1) (0) Xp (1). where Xp and Yp are the collections of the transmitted and the received signal on the pilot subcarriers respectively. The LS estimate of Hp based on one OFDM 15.

(30) symbol only is susceptible to Gaussian noise, and thus an estimator better than the LS estimator is preferable. The minimum mean-square error (MMSE) estimate has been shown to be better than the LS estimate for channel estimation in OFDM systems, but the major drawback of the MMSE estimate is its high complexity. A low-rank approximation results in a linear minimum mean squared error (LMMSE) estimator that uses the frequency-domain correlation of the channel [6]. The mathematical representation for the LMMSE estimator of pilot signals is ˆ ˆ p,lmmse = RHp H R−1 H p,LS Hp,LS Hp,LS Hp,LS −1 −1 ˆ = RHp Hp (RHp Hp + σn2 (Xp XH p ) ) Hp,LS. (2.8). ˆ p,LS is the least-square estimate of Hp in (2.7), σn2 is the variance of the where H Gaussian white noise, and the covariance matrices are defined by RHp Hp,LS = E{Hp HH p,LS },. (2.9). RHp,LS Hp,LS = E{Hp,LS HH p,LS },. (2.10). RHp Hp = E{Hp HH p }.. (2.11). Note that there is a matrix inverse involved in the MMSE estimator, which must be calculated every time, and the computation of matrix inversion requires O(Np3 ) arithmetic operations [7]. We also need to use the statistical properties of the unknown channel. Therefore, we use the LS estimator which requires only O(Np ) operations instead of the LMMSE due to the concerns of complexity and unknown information.. 2.2.2. Frequency Domain Interpolation Methods. 2.2.2.1. Linear Interpolation. Linear interpolation is a commonly used method of interpolation. It does the interpolation simply with two known data, and interpolates those unknown data between 16.

(31) them. It is given by [8] He (k) = He (m + l) = (Hp (m + 1) − Hp (m)). l + Hp (m) L. (2.12). where Hp (k), k = 0, 1, · · · , Np , are the channel frequency responses at pilot subcarriers, L is the distance between the two given data, that is, the pilot sub-carriers spacing, and 0 ≤ l < L. 2.2.2.2. Second-Order Interpolation. Theoretically, using higher-order polynomial interpolation may fit the channel response better than linear interpolation [9]. However, the computational complexity grows as the order is increased. Here we consider the second order polynomial interpolation, and it has also been called Gaussian second order estimation. It is given as a solution to the second order polynomial with respect to l/L by using three reference signal points. The interpolation is obtained using three successive pilot subcarriers signal as follows [10]: He (k) = He (m + l) = c1 Hp (m − 1) + c0 Hp (m) + c−1 Hp (m + 1) where.  , c1 = α(α−1)  2         c0 = −(α − 1)(α + 1),    , c−1 = α(α+1)  2      α = Ll .. The notations are the same as they are in linear interpolation.. 17. (2.13).

(32) Figure 2.6: Illustration of 2D interpolation.. 2.2.3. Time Domain Improvement Methods. As Table 2.1 shows, we can only use 166 pilots in one symbol to interpolate the channel in the frequency domain. It is not sufficient because the pilot spacings are too wide in our system. Since the channel does not change abruptly over time, here we propose two methods to improve the performance. 2.2.3.1. Two-Dimensional Interpolation [11]. Recall the downlink variable pilot allocation in IEEE 802.16a in Fig. 2.3. The equation of the allocation formula is varLocP ilotk = 3L + 12Pk where: • varLocP ilotk = carrier index of a variable-location pilot. • L ∈ 0, · · · , 3 is a function of the symbol index, modulo 4. • Pk ∈ {0, 1, 2, · · · , NvarLocP ilots − 1}. 18. (2.14).

(33) Because the positions of the variable location pilots vary with a period of four symbols, we could make use of the four sets of pilot locations to help channel estimation. The maximum number of pilot locations that we can use is (NvarLocP ilots − NCoincidentP ilots ) × 4 + Nf ixLocP ilots = (142 − 8) × 4 + 32 = 568 (2.15) where NConincidentP ilots is the number of the variable location pilots which are coincident with the fixed location pilots. For example, we can use extrapolation in the time domain to estimate the channel frequency response at the pilot locations of other symbols. It should work the best when transmitting through a static channel. The method is illustrated in Fig. 2.6. One possible way of interpolation (extrapolation) is ˜ p (f ) ˜ p (f ) + 1 h ˜ 2D−extrap−p (f ) = 1 h h 4sets 2 −4 2 0 ˜ p (f ) ˜ p (f ) + 1 h + 12 h −1 2 −5 ˜ p (f ) ˜ p (f ) + 1 h + 12 h −2 2 −6. (2.16). ˜ p (f ) ˜ p (f ) + 1 h + 12 h −3 2 −7 ˜ p−n (f ), n = 0, 1, · · · , 7, are the channel frequency responses at pilot carriers where h in the nth previous symbol. We can use interpolations again in the frequency do˜ 2D−extrap−p (f ). Since the equivalent number of pilots becomes main after obtaining h 568/166 = 3.421 times that of the original case, better estimation is expected. However, there are seven extra registers needed to store the channel frequency response at pilot carriers. Except for the hardware concern, a fast fading channel might seriously affect the accuracy of the extrapolations in the time domain, because we need to use the information from the seven previous symbols. Thus, an alternative is use less previous symbols, say only 3 or 2. Then the extrapolation 19.

(34) formula becomes. ˜ p (f ) ˜ p (f ) + 1 h ˜ 2D−extrap−p (f ) = 1 h h 3sets 2 −4 2 0 ˜ p (f ) ˜ p (f ) + 1 h + 12 h −1 2 −5. (2.17). ˜ p (f ) ˜ p (f ) + 1 h + 12 h −2 2 −6 and. ˜ p (f ) ˜ p (f ) + 1 h ˜ 2D−extrap−p (f ) = 1 h h 2sets 2 −4 2 0 ˜ p (f ), ˜ p (f ) + 1 h + 12 h −1 2 −5. (2.18). respectively. When dealing with fading channels, we consider replacing the formulas above with. ˜ 2D−extrap−p (f ) = h ˜ p (f ) h 4sets 0 ˜ p (f ) ˜ p (f ) − 1 h + 54 h −1 4 −5 ˜ p (f ) ˜ p (f ) − 1 h + 32 h −2 2 −6. (2.19). ˜ p (f ), ˜ p (f ) − 3 h + 74 h −3 4 −7 ˜ 2D−extrap−p (f ) = h ˜ p (f ) h 3sets 0 ˜ p (f ) ˜ p (f ) − 1 h + 54 h −1 4 −5. (2.20). ˜ p (f ), ˜ p (f ) − 1 h + 32 h −2 2 −6 and. ˜ 2D−extrap−p (f ) = h ˜ p (f ) h 2sets 0 ˜ p (f ), ˜ p (f ) − 1 h + 54 h −1 4 −5. (2.21). ˜ p (f ) in ˜ p (f ) n = −1, −2, −3 nearier to h where we emphasize the weighting of h 0 n a linear fashion, because when time variation of the channel is not overly fast, the channel coefficients can be modelled to a first-order approximation as varying linearly with time in a short-enough time span.. 20.

(35) Figure 2.7: Adaptive channel estimation using the LMS algorithm. 2.2.3.2. Least Mean Square (LMS) Adaptation [12], [14]. The LMS algorithm is the most widely used adaptive filtering algorithm in practice for its simplicity. Meanwhile, it is stable and robust against different channel conditions. The LMS channel estimation process is illustrated in Fig. 2.7, where X(f ) is the input signal sent into the channel, H(f ) is channel frequency response, and Y (f ) is the channel output. The following equations apply to our work where HnLM S (f ) is the estimated channel response at the nth symbol. • Filtering by channel: y(n) = h(n) ∗ x(n),. (2.22). Y(f ) = H(f ) · X(f ).. (2.23). ˆ ), ˆ af ter decision (f ) − X(f e(f ) = X. (2.24). • Estimated error:. ˆ )= X(f. Y(f ) . HnLM S (f ). (2.25). • Cost function: ˆ ) = e2 (f ) = |X ˆ )|2 . ˆ af ter decision (f ) − X(f ξ(f. 21. (2.26).

(36) • Channel frequency response adaptation: n ∗ ˆ Hn+1 LM S (f ) = HLM S (f ) + µe (f )X(f ),. (2.27). where µ is the step size which affects the speed of convergence. With a larger step size, the estimated channel converges more quickly to the real channel response. However, if it is too big, then it may lead to a unstable condition. To minimize the error shown in (2.24), we try to minimize the expected value of (2.26). For this, we can tune the estimated channel weights adaptively. In our 0 ˜0 simulation, we use the interpolated channel estimation H interp (f ) as HLM S (f ) and. HnLM S (f ) is obtained by (2.27) when n > 0. Following the algorithm, only the first symbol’s pilot information is used in the whole flow, thus the pilot information in other symbols is wasted. So we try to combine the interpolated channel and the HnLM S (n) which is the estimated channel by using LMS algorithm when n > 0. The combination is given by n ˜ modif H ied. LM S (f ) =.  ˜ n (f ),  α · HnLM S (f ) + (1 − α) · H interp. n > 0,.  ñ Hinterp (f ),. n = 0,. (2.28). where HnLM S (f ) is the channel estimated by the LMS adaptation algorithm and ˜ n (f ) is the channel estimated by interpolation. The α and the (1 − α) are the H interp n ñ ˜ interp (f ), respectively. Therefore, H weighting factors for HnLM S (f ) and H modif ied. LM S (f ). is the combination of these two kinds of estimation outcomes and may be more corñ rect. Then, we use H modif ied. LM S (f ). in place of HnLM S (f ) in the right-hand side of. (2.27) to calculate the estimated channel response for the next symbol.. 22.

(37) Chapter 3 DSP Introduction DSP implementation is the final goal of our work. The DSP platform that we use is the Quixote board produced by Innovation Integration. The DSP on the board is TMS320C6416 made by Texas Instruments. In this chapter, we introduce the architectures of the Quixote board and the DSP chip.. 3.1 3.1.1. Introduction to TMS320C6416 DSP [16] TMS320C6416 Features. The TMS320C64x DSPs are the highest-performance fixed-point DSP generation of the TMS320C6000 DSP devices with a performance of up to 6000 million instructions per second (MIPS) and an efficient C compiler. The TMS320C64x device is based on the second-generation high-performance, very-long-instruction-word (VLIW) architecture developed by Texas Instruments (TI). The C6416 device has two high-performance embedded coprocessors, Viterbi Decoder Coprocessor (VCP) and Turbo Decoder Coprocessor (TCP) that significantly speed up channel-decoding operations on-chip. But they do not apply to the work reported in this thesis. The C64x core CPU consists of 64 general-purpose 32-bits registers and 8 function units. These 8 function units contain:. 23.

(38) • Two multipliers. • Six ALUs. Features of C6000 devices include : • Advanced VLIW CPU with eight functional units, including two multipliers and six arithmetic units: – Executes up to eight instructions per cycle. – Allows designers to develop highly effective RISC-like code for fast development time. • Instruction packing: – Gives code size equivalence for eight instructions executed serially or in parallel. – Reduces code size, program fetches, and power consumption. • Conditional execution of all instructions: – Reduces costly branching. – Increases parallelism for higher sustained performance. • Efficient code execution on independent functional units: – Efficient C compiler on DSP benchmark suite. – Assembly optimizer for fast development and improved parallelization. • 8/16/32-bit data support, providing efficient memory support for a variety of applications. • 40-bit arithmetic options add extra precision for applications requiring it. 24.

(39) • Saturation and normalization provide support for key arithmetic operations. • Field manipulation and instruction extract, set, clear, and bit counting support common operation found in control and data manipulation applications. The additional features of C64x include: • Each multiplier can perform two 16×16 bits or four 8×8 bits multiplies every clock cycle. • Quad 8-bit and dual 16-bit instruction set extensions with data flow support. • Support for non-aligned 32-bit (word) and 64-bit (double word) memory accesses. • Special communication-specific instructions have been added to address common operations in error-correcting codes. • Bit count and rotate hardware extends support for bit-level algorithms.. 3.1.2. Central Processing Unit. The block diagram of the C6416 DSP is shown in Fig. 3.1. The DSP contains: • Program fetch unit. • Instruction dispatch unit. • Instruction decode unit. • Two data paths, each with four functional units. • 64 32-bit registers. • Control registers. 25.

(40) Figure 3.1: Block diagram of the TMS320C6416 DSP [16].. • Control logic. • Test, emulation, and interrupt logic. The TMS320C64x DSP pipeline provides flexibility to simplify programming and improve performance. The pipeline can dispatch eight parallel instructions every cycle. These two factors provide this flexibility: • Control of the pipeline is simplified by eliminating pipeline interlocks. • Increased pipelining eliminates traditional architectural bottlenecks in program fetch, data access, and multiply operations. This provides single cycle throughput. 3.1.2.1. Pipeline. The pipeline phases are divided into three stages as shown in Fig. 3.2: • Fetch has 4 phases: 26.

(41) Figure 3.2: Pipeline phases of TMS320C6416 DSP [16].. – PG (program address generate): The address of the fetch packet is determined. – PS (program address send): The address of the fetch packet is sent to memory. – PW (program access ready wait): A program memory access is performed. – PR (program fetch packet receive): The fatch packet is at the CPU boundary. • Decode has two phases: – DP (instruction dispatch): The next execute packet in the fetch packet is determined and sent to the appropriate functional units to be decoded. – DC (instruction decode): Instructions are decoded in functional units. • Execute has five phases: – E1: Execute 1. – E2: Execute 2. – E3: Execute 3. – E4: Execute 4. – E5: Execute 5.. 27.

(42) The pipeline operation of the C62x/C64x instructions can be categorized into seven instruction types. Six of these are shown in Table 3.1, which gives a mapping of operations occurring in each execution phase for the different instruction types. The delay slots associated with each instruction type are listed in the bottom row. The execution of instructions can be defined in terms of delay slots. A delay slot is a CPU cycle that occurs after the first execution phase (E1) of an instruction. Results from instructions with delay slots are not available until the end of the last delay slot. For example, a multiply instruction has one delay slot, which means that one CPU cycle elapses before the results of the multiply are available for use by a subsequent instruction. However, results are available from other instructions finishing execution during the same CPU cycle in which the multiply is in a delay slot. Table 3.1: Execution Stage Length Description for Each Instruction Type [16]. 28.

(43) 3.1.2.2. Functional Units. The eight functional units in the C6000 data paths can be divided into two groups of four; each functional unit in one data path is almost identical to the corresponding unit in the other data path. The functional units are described in Table 3.2. Besides being able to perform 32-bit operations, the C64x also contains many 8-bit to 16-bit extensions to the instruction set. For example, the MPYU4 instruction performs four 8×8 unsigned multiplies with a single instruction on an .M unit. The ADD4 instruction performs four 8-bit additions with a single instruction on an .L unit. The data line in the CPU supports 32-bit operands, long (40-bit) and double word (64-bit) operands. Each functional unit has its own 32-bit write port into a general-purpose register file (listed in Fig. 3.3). All units ending in 1 (for example, .L1) write to register file A, and all units ending in 2 write to register file B. Each functional unit has two 32-bit read ports for source operands src1 and src2. Four units (.L1, .L2, .S1, and .S2) have an extra 8-bit-wide port for 40-bit long writes, as well as an 8-bit input for 40-bit long reads. Because each unit has its own 32-bit write port, when performing 32-bit operations all eight units can be used in parallel every cycle.. 3.1.3. Memory Architecture. The C64x has a 32-bit, byte-addressable address space. Internal (on-chip) memory is organized in separate data and program spaces. When off-chip memory is used, these spaces are unified on most devices to a single memory space via the external memory interface (EMIF). The C64x has two 64-bit internal ports to access internal data memory have and a single internal port to access internal program memory, with an instruction-fetch width of 256 bits. A variety of memory options are available for the C6000 platform. In our system,. 29.

(44) Table 3.2: Functional Units and Operations Performed [16] Function Unit .L unit (.L1, .L2). .S unit (.S1, .S2). .M unit (.M1, .M2). .D unit (.D1, .D2). Operations 32/40-bit arithmetic and compare operations 32-bit logical operations Leftmost 1 or 0 counting for 32 bits Normalization count for 32 and 40 bits Byte shifts Data packing/unpacking 5-bit constant generation Dual 16-bit arithmetic operations Quad 8-bit arithmetic operations Dual 16-bit min/max operations Quad 8-bit min/max operations 32-bit arithmetic operations 32/40-bit shifts and 32-bit bit-field operations 32-bit logical operations Branches Constant generation Register transfers to/from control register file (.S2 only) Byte shifts Data packing/unpacking Dual 16-bit compare operations Quad 8-bit compare operations Dual 16-bit shift operations Dual 16-bit saturated arithmetic operations Quad 8-bit saturated arithmetic operations 16 x 16 multiply operations 16 x 32 multiply operations Quad 8 x 8 multiply operations Dual 16 x 16 multiply operations Dual 16 x 16 multiply with add/subtract operations Quad 8 x 8 multiply with add operation Bit expansion Bit interleaving/de-interleaving Variable shift operations Rotation Galois Field Multiply 32-bit add, subtract, linear and circular address calculation Loads and stores with 5-bit constant offset Loads and stores with 15-bit constant offset (.D2 only) Load and store double words with 5-bit constant Load and store non-aligned words and double words 5-bit constant generation 32-bit logical operations 30.

(45) Figure 3.3: TMS320C64x CPU data path [16].. 31.

(46) the memory types we can use are: • On-chip RAM, up to 875 MB. • Program cache. • 32-bit external memory interface supports SDRAM, SBSRAM, SRAM, and other asynchronous memories. • Two-level caches [20]. Level 1 cache is split into program (L1P) and data (L1D) cache. Each L1 cache is 16 KB. Level 2 memory is configurable and can be split into L2 SRAM (addressable on-chip memory) and L2 cache for caching external memory locations. The size of L2 is 1 MB. External memory can be several MB large. The access time depends on the memory technology used but is typically around 100 to 133 MHz. In our system, the external memory usable by DSP is a 32 MB SDRAM.. 3.2. Introduction to the Quixote cPCI Board [15]. The Quixote is one of Innovative Integration’s Velocia-family baseboard for applications requiring speed and processing power. Quixote features a processing core built around Texas Instruments’ fixed-point TMS320C6416 and Xilinx Virtex2 with 32 MB of DSP RAM and 2 MB of FPGA computation RAM (optional). The TI C6416 DSP operating at 600 MHz offers a processing power of 4800 MIPS. The analog IO features of the board include dual channels of 105 MHz A/D and D/A (2 in, 2 out). A block diagram of Quixote board is shown in Fig. 3.4. The Quixote card has a 32 MB SDRAM for use by the DSP. When used with the advanced cache controller on the ’C6416, the SDRAM provides a large, fast external memory pool for DSP data and code. The Quixote has a serial EEPROM for storing data such as board identification, calibration coefficients, and other data that needs 32.

(47) Figure 3.4: Block diagram of Quixote [15].. to be stored permanently on the card. This memory is 16K bits in size. Functions for using the serial EEPROM are included in the Pismo Toolset that allow the software application programmer to easily write and read from the memory without controlling the low-level interface. The Caliente subsystem handles the details of interacting with the baseboard in streaming mode. There are 3 ways for data transmission between host PC and DSP: data streaming, block mode data streams and message packet I/O.. 33.

(48) Data Streaming. To address high-bandwidth data transfer applications, Quixote is capable of continuous transmission and reception of data via the PCI bus, using a mechanism called streaming. When streaming, the target DSP, which must be running a downloaded DSP application, transfers data between target DSP memory and host PC memory automatically with no host intervention. Streaming input is independent of streaming output. It is possible to acquire data from any number and mix of input devices at a programmed rate. Simultaneously, data may be streamed out to a variety of output devices at a different programmed rate. Data flow is fully controlled by use of device drivers called from within the DSP target application. During data streaming on baseboards, data flows between peripherals and a dedicated, onboard, digital signal processor (DSP) while simultaneously flowing data between the DSP and the host application software. The dedicated DSP can extensively process data as it travels between peripherals and the host application. Fig. 3.5 illustrates the data streaming operation.. Figure 3.5: Block diagram of DSP streaming mode [15].. Block Mode Data Streams. An alternate data flow paradigm is supported for non-channelized peripherals. This mode is referred to as block mode stream34.

(49) ing. In block mode, the splitter/merger features of Caliente are bypassed, and raw, binary data in peripheral-specific format is consumed and supplied by the application program. Devices that produce data that can be channelized may elect to use block mode because of its higher inherent efficiency. For very high rate applications, any processing done to each point may result in a reduction in the maximum data rate that can be achieved. Since block mode does no implicit processing on a point-by-point basis, the fastest data rates are achievable using this mode. Message Packet I/O. In many applications, there is a need for additional, low bandwidth channels in addition to a high rate data stream. Velocia baseboards feature a means to support the asynchronous interchange of low-bandwidth data in conjunction with high-bandwidth streaming mode I/O. Messages packets consist of a command code and channel number plus up to 14 additional 32-bit parametric data values. Messages may be asynchronously transmitted and received from any number of distinct channels by any number of threads running on both the target DSP and the host PC. Message transfers have no deleterious effect on data streaming and consume virtually none of the bandwidth of the DSP, so they may be freely used even in conjunction with full rate data streaming. In our implementations, we use block mode data streams the most and also use message packet I/O [24]. TheVirtex2 FPGA includes 18×18 hardware multipliers and contains up to 12 digital clock managers, each providing 256 subdivisions of phase shifting and frequency synthesis capabilities to deliver flexibility in managing both on-chip and off-chip clock domains and synchronization. On-chip memory blocks in the Virtex-II fabric provide convenient high-speed memory elements for FIFOs, dual-port RAM and local processing memory that are invaluable in efficient logic design.. 35.

(50) 3.3. Introduction to the Code Composer Studio Development Tools[17], [18]. TI supports a useful GUI development tool set to DSP users for developing and debugging their projects: the Code Composer Studio (CCS). The CCS development tools are a key element of the DSP software and development tools from Texas Instruments. The fully integrated development environment includes real-time analysis capabilities, easy to use debugger, C/C++ compiler, assembler, linker, editor, visual project manager, simulators, XDS560 and XDS510 emulation drivers and DSP/BIOS support. Some of CCS’s fully integrated host tools include: • Simulators for full devices, CPU only and CPU plus memory for optimal performance. • Integrated visual project manager with source control interface, multi-project support and the ability to handle thousands of project files. • Source code debugger common interface for both simulator and emulator targets: – C/C++/assembly language support. – Simple breakpoints. – Advanced watch window. – Symbol browser. • DSP/BIOS host tooling support (configure, real-time analysis and debug). • Data transfer for real time data exchange between host and target. • Profiler to analyze code performance. 36.

(51) CCS also delivers foundation software consisting of: • DSP/BIOS kernel for the TMS320C6000 DSPs. – Pre-emptive multi-threading. – Interthread communication. – Interrupt handling. • TMS320 DSP Algorithm Standard to enable software reuse. • Chip Support Libraries (CSL) to simplify device configuration. CSL provides C-program functions to configure and control on-chip peripherals. TI also supports some optimized DSP functions for the TMS320C64x devices: the TMS320C64x digital signal processor library (DSPLIB). This source code library includes C-callable functions (ANSI-C language compatible) for general signal processing mathematical and vector functions [19]. The routines included in the DSP library are organized as follows: • Adaptive filtering. • Correlation. • FFT. • Filtering and convolution. • Math. • Matrix functions. • Miscellaneous.. 37.

(52) 3.4. Code Optimization Methods [21]. The recommended code development flow involves utilizing the C6000 code generation tools to aid in optimization rather than forcing the programmer to code by hand in assembly. These advantages allow the compiler to do all the laborious work of instruction selection, parallelizing, pipelining, and register allocation. These features simplify the maintenance of the code, as everything resides in a C framework that is simple to maintain, support, and upgrade. The recommended code development flow for the C6000 involves the phases described in Fig. 3.6. The tutorial section of the Programmer’s Guide [21] focuses on phases 1 and phase 2, and the Guide also instructs the programmer about the tuning stage of phase 3. What is learned is the importance of giving the compiler enough information to fully maximize its potential. An added advantage is that this compiler provides direct feedback on the entire program’s high MIPS areas (loops). Based on this feedback, there are some simple steps the programmer can take to pass complete and better information to the compiler to maximize the compiler performance. The following items list the goal for each phase in the software development flow shown in Fig. 3.6. • Developing C code (phase 1) without any knowledge of the C6000. Use the C6000 profiling tools to identify any inefficient areas that we might have in the C code. To improve the performance of the code, proceed to phase 2. • Use techniques described in [21] to improve the C code. Use the C6000 profiling tools to check its performance. If the code is still not as efficient as we would like it to be, proceed to phase 3. • Extract the time-critical areas from the C code and rewrite the code in linear assembly. We can use the assembly optimizer to optimize this code. 38.

(53) Figure 3.6: Code development flow for TI C6000 DSP [21].. 39.

(54) TI provides high performance C program optimization tools, and they do not suggest the programmer to code by hand in assembly. In this thesis, the development flow is stopped at phase 2. We do not optimize the code by writing linear assembly. Coding the program in high level language keeps the flexibility of porting to other platforms.. 3.4.1. Compiler Optimization Options [17], [18]. The compiler supports several options to optimize the code. The compiler options can be used to optimize code size or execution performance. Our primary concern in this work is the execution performance. Hence we do not care very much about the code size. The easiest way to invoke optimization is to use the cl6x shell program, specifying the -on option on the cl6x command line, where n denotes the level of optimization (0, 1, 2, 3) which controls the type and degree of optimization: • -o0. – Performs control-flow-graph simplification. – Allocates variables to registers. – Performs loop rotation. – Eliminates unused code. – Simplifies expressions and statements. – Expands calls to functions declared inline. • -o1. Performs all -o0 optimization, and: – Performs local copy/constant propagation. – Removes unused assignments. – Eliminates local common expressions. 40.

(55) • -o2. Performs all -o1 optimizations, and: – Performs software pipelining. – Performs loop optimizations. – Eliminates global common subexpressions. – Eliminates global unused assignments. – Converts array references in loops to incremented pointer form. – Performs loop unrolling. • -o3. Performs all -o2 optimizations, and: – Removes all functions that are never called. – Simplifies functions with return values that are never used. – Inline calls to small functions. – Reorders function declarations so that the attributes of called functions are known when the caller is optimized. – Propagates arguments into function bodies when all calls pass the same value in the same argument position. – Identifies file-level variable characteristics. The -o2 is the default if -o is set without an optimization level. The program-level optimization can be specified by using the -pm option with the -o3 option. With program-level optimization, all of the source files are compiled into one intermediate file called a module. The module moves through the optimization and code generation passes of the compiler. Because the compiler can see the entire program, it performs several optimizations that are rarely applied during file-level optimization:. 41.

(56) • If a particular argument in a function always has the same value, the compiler replaces the argument with the value and passes the value instead of the argument. • If a return value of a function is never used, the compiler deletes the return code in the function. • If a function is not called directly or indirectly, the compiler removes the function. When program-level optimization is selected in Code Composer Studio, options that have been selected to be file-specific are ignored. The program level optimization is the highest level optimization option. We use this option to optimize our code.. 3.4.2. Using Intrinsics. The C6000 compiler provides intrinsics, special functions that map directly to C64x instructions, to optimize our C code performance. All instructions that are not easily expressed in C code are supported as intrinsics. Intrinsics are specified with a leading underscore ( ) and are accessed by calling them as we call a function. A table of TMS320C6000 C/C++ compiler intrinsics can be found in [21]. The intrinsics used in our program are introduced in chapter 4.. 42.

(57) Chapter 4 Simulation and DSP Implementation Our work and results can be separated into two parts. The first part concerns the performance of each channel estimation approach, such as symbol error rate (SER), mean square error (MSE), etc. The second part concerns the DSP implementation which emphasizes the execution efficiency.. 4.1. Comparison Between 2-D Interpolation and LMS Adaptive Methods. Fig. 4.1 illustrates the block diagram of the simulated system. We assume perfect synchronization and omit it in the simulation. After channel estimation, we get MSE between the real channel response and the estimated one. Also, the SER can be calculated after de-mapping, i.e., de-QAM. The channel estimation contains several steps: • Channel response estimation at each pilot location. • Interpolation for the whole channel response using the estimated values at pilot locations, which may include use of the LMS alogorithm. • Estimating the transmitted signal using a divider. 43.

(58) Figure 4.1: Block diagram of the simulated system.. Figure 4.2: Channel estimation steps.. These steps are illustrated in Fig. 4.2.. 4.1.1. Simulation Results for AWGN Channel. Before considering multipath channels, we do simulation with an AWGN channel, which means we transmit the data through a one-path channel with h[0] = 1, and then add AWGN to it. The theoretical symbol error rate with Gaussian noise power N0 for M -ary QAM can be obtained by [23]. s 3N Eb 1 ) Pe = 4(1 − √ )Q( (M − 1)N0 M. (4.1). where N = log2 M and for 64-QAM we have N = 6 with M = 64 here. The Eb is Es /6 and the Es is normalized to be 1 in our simulation. If we substitute ˆ i − Xi |2 ] for N0 , we can get a theoretical symbol error rate. The result is E[|X shown in Figs. 4.3 and 4.4, where we call (2.16) formula 1 and (2.19) formula 2 and linear interpolation is used. The modulation scheme is 64QAM. We can see that the 44.

(59) ˆ i − Xi | for AWGN channel. Figure 4.3: MSE of |X. theoretical SERs are closed to the simulated ones whatever the formula we use. We also see that formula 1 works better than formula 2. We calculate the ratio between the coefficients of formula 1 and formula 2 this way: ( 12 )2 × 8 = 0.2286 12 + ( 54 )2 + ( 14 )2 + ( 32 )2 + ( 12 )2 + ( 74 )2 + ( 34 )2. (4.2). and the simulated ratio is listed in Table 4.1. We can find that those simulated ratios are closed to 0.2286.. 4.1.2. Simulation Results for Static Multipath Channel. We employ the ATTC (Advanced Television Technology Center) and the Grande Alliance DTV Laboratory’s ensemble E mode channel response, assuming the channel is static. The response is given in Table 4.2. The phase in time domain is π/4. The amplitude and phase response of this channel response are shown in Fig. 4.5. 45.

(60) (a). (b) Figure 4.4: The (a) MSE and (b) SER for AWGN channel simulation. 46.

(61) (a). (b) Figure 4.5: (a)Amplitude response and (b) phase response of the channel given in Table 4.2.. 47.

(62) Table 4.1: MSE Ratio Between Formula 1 and Formula 2 for AWGN Channel Es N0. 15. 17.5. 20. 22.5. 25. 27.5. M SEf ormula 1 0.24055 0.23916 0.23756 0.23776 0.23769 0.23417 M SEf ormula 2 Es 40 37.5 35 32.5 30 N0 M SEf ormula 1 0.23407 0.22929 0.22764 0.22158 0.21259 M SEf ormula 2. Table 4.2: Channel Impulse Response Tap Delay (OFDM Samples) 0 1 2 2 17 3 36 4 75 5 137 6. 4.1.2.1. Average Power 1 0.3162 0.1995 0.1296 0.1 0.1. Average Power (in dB) 0 -5 -7 -8.87 -10 -10. Two-Dimensional Interpolation. In this section, we will do comparison between the two interpolation schemes proposed in chapter 2. We use different sets in these two formulas, which means different amount of previous symbols’ information will be employed. To verify the correspondence between the simulation results and the theory, we ˆ i − Xi |2 on subcarrier 1 (see Fig.2.7; note that subcarrier calculate the average |X indexes run from 0 to 1701) and subcarrier 1700 by simulating 1000 symbols. The theoretical symbol error is taken by following (4.1). Fig. 4.17 shows the MSE of ˆ i − Xi | on the subcarrier 1 where we use formula 1 and linear interpolation. |X Fig. 4.7 gives the MSE and SER on the subcarrier 1. The theoretical values are ˆ i − Xi |2 in (4.1) whether in low SNR or high SNR. We find calculated with N0 = |X that the theoretical results are closed to the simulated ones, and we conclude the ˆ i − Xi |, MSE, SER of simulation results seen correct. Figs. 4.9 shows the MSE of |X the carrier 1700. It responses similar results. 48.

(63) ˆ i − Xi | on subcarrier 1. Figure 4.6: MSE of |X. Fig. 4.10 shows the outcomes of formula 1 with both linear and 2nd-order interpolations. Obviously, if we use more sets of pilot information, we get better performance. The MSE and SER of 2nd-order interpolation method decrease faster than the linear one for Es /N0 > 22.5 dB. The SER of 4 sets interpolation decreases to zero because we have only run 1000 symbols. Thus, it proves that 2-D interpolation is useful in the static channel condition. On the whole, the difference between these two interpolation methods is small but the 2nd-order interpolation is of more complexity than the linear one. Formula 2 yields results with many similar properties, which are given in Fig. 4.11. We now compare the performance between formula 1 and formula 2. We can find that formula 1 works better than formula 2 in Fig. 4.12; here the linear interpolation is used. This is because we weight the 2 pilot-symbol information equally in formula 1 and it is reasonable doing so in a static channel. In formula 2, we emphasize the. 49.

(64) (a). (b) Figure 4.7: The (a) MSE and (b) SER on the subcarrier 1 of the 2-D interpolation using formula 1 with linear interpolation in the frequency domain respectively.. 50.

(65) ˆ i − Xi | on subcarrier 1700. Figure 4.8: MSE of |X. pilot-symbol information closer to the present symbol. Objectively, it may not be effective in estimating a static channel response because sometimes the information of the symbols which are away from the present symbol may be more correct due to the different AWGN. We also calculate the MSE ratio between formula 1 and formula 2 in Table 4.3 and find the simulated ratios are closed to 0.2286 at low. Es . N0. The same comparison is given in Fig. 4.13 with 2nd-order interpolation. Both these two figure shows that formula 1 causes the SER drops to zero by 2.5 dB earlier than formula 2 with 4 sets of pilot symbols employed. The reason for the zero-dropping is also that we have only run 1000 symbols.. 51.

(66) (a). (b) Figure 4.9: The (a) MSE and (b) SER on the subcarrier 1700 of the 2-D interpolation using formula 1 with linear interpolation in the frequency domain respectively. 52.

(67) (a). (b) Figure 4.10: The (a) MSE and (b) SER of the 2-D interpolation using formula 1 with linear and 2nd-order interpolation in the frequency domain respectively. 53.

(68) (a). (b) Figure 4.11: The (a) MSE and (b) SER of the 2-D interpolation using formula 2 with linear and 2nd-order interpolation in the frequency domain respectively. 54.

(69) (a). (b) Figure 4.12: The (a) MSE and (b) SER of using formula 1 and 2 in the 2-D interpolation respectively with linear interpolation in the frequency domain. 55.

(70) (a). (b) Figure 4.13: The (a) MSE and (b) SER of using formula 1 and 2 in the 2-D interpolation respectively with 2nd-order interpolation in the frequency domain. 56.