IEEE 802.16a 分時雙工正交分頻多重進接之上行同步技術研討與在數位訊號處理器上的實現

全文

(1)國立交通大學電子工程學系碩. 士. 電子研究所碩士班論. 文. IEEE 802.16a 分時雙工正交分頻多重進接之上行同步技術研討與在數位訊號處理器上的實現. Study and DSP Implementation of IEEE 802.16a TDD OFDMA Uplink Synchronization. 研究生: 林筱晴指導教授: 林大衛博士. 中華民國九十三年六月.

(2) IEEE 802.16a 分時雙工正交分頻多重進接之上行同步技術研討與在數位訊號處理器上的實現. Study and DSP Implementation of IEEE 802.16a TDD OFDMA Uplink Synchronization. 研究生: 林筱晴. Student：Hsiao Ching Lin. 指導教授: 林大衛博士. Advisor：Dr. David W. Lin. 國立交通大學電子工程學系. 電子研究所碩士班. 碩士論文 A Thesis Submitted to Department of Electronics Engineering & Institute of Electronics College of Electrical Engineering and Computer Science National Chiao Tung University in Partial Fulfillment of the Requirements for the Degree of Master in Electronics Engineering June 2004 Hsinchu, Taiwan, Republic of China. 中華民國九十三年六月.

(3) IEEE 802.16a 分時雙工正交分頻多重進接之上行同步技術研討與在數位訊號處理器上的實現. 研究生：林筱晴. 指導教授：林大衛博士. 國立交通大學電子工程學系電子研究所碩士班. 摘要正交分頻多工 (OFDM) 技術可有效地解決通訊系統中的許多問題，如多重路徑衰落、窄頻干擾等，多用戶正交分頻多工系統能依據使用者之需求將頻寬作更有效之分配。在本篇論文中，我們使用數位訊號處理器去實現分時雙工正交分頻多重進接環境下的上行同步機制。此數位訊號處理器的環境是 Innovative Integration 公司的 Quixote 個人電腦插板，其上裝置為德州儀器公司的 TMS320C6416，是個擁有強大數學運算功能的處理器。我們所處理的上行同步架構如下。上行傳輸需要作時間同步以偵測信號到達的時間，如果估測錯誤會降低間格區間 (guard interval) 用來防止多重路徑延遲造成符元間 (ISI) 干擾的能力。我們將上行同步分為兩級，第一級利用 OFDM 系統特有之間格區間(guard interval) 估測 OFDM 符元(symbol) 大略的開始時間，此乃由於間格區間使單一符元內具有高度的自相關。第二級利用上行傳輸前置資 i.

(4) 訊 (preamble) 判斷估測 OFDM 符元(symbol) 精確的開始時間。我們嘗試用兩種方式作時間同步的第二級，分別為在時間域及頻率域對收到的訊號與上行傳輸前置資訊 (preamble) 作相關性 (correlation) 分析，找到具有最大相關性的時間。為了降低在數位訊號處理器上的運算複雜度，我們先將原始的浮點運算 C 程式版本修改為實數運算的程式版本，接著再考慮數位訊號處理器—TMS320C64X 的特性來修改之前的程式。最後，我們在數位訊號處理器上加速了上行同步機制達 374 倍。在本篇論文中，我們首先簡介分時雙工正交分頻多重進接環境下的上行同步機制。接著，我們介紹數位訊號處理器的運作環境。最後，我們描述利用數位訊號處理器的特點以加速程式的方法並且提供一些關於執行速度與同步機制效能方面的實驗結果。. ii.

(5) Study and DSP Implementation of IEEE 802.16a TDD OFDMA Uplink Synchronization. Student：Hsiao Ching Lin. Advisor：Dr. David W. Lin. Department of Electronics Engineering Institute of Electronics National Chiao Tung University Abstract. OFDM is an effective transmission scheme to cope with many transmission impairments, such as multipath fading and narrowband interference. Multiuser OFDM can provide highly flexible to allocate the bandwidth according to the needs of users. In this thesis, we focus on the TDD OFDMA uplink synchronization based on IEEE 802.16a. We use digital signal processor to implement uplink synchronization schemes. The digital signal processing environment is Innovative Integration’s Quixote personal computer card, which hosts Texas Instruments’ TMS320C6416 which is a powerful signal processor with strong arithmetic operation capability. Time synchronization is performed to detect the start time of symbols for uplink synchronization. Time synchronization errors would decrease the ability of guard interval to avoid ISI introduced by multipath channel. There are two stages in the uplink synchronization. The first stages use the guard interval to estimate the OFDM iii.

(6) symbol start time roughly. The reason of using the guard interval is that it provides strong autocorrelation within an OFDM symbol. The second stage uses the preamble information to detect the symbol start time exactly. We present two schemes to do the second stage. One is using the correlation of received signal with preamble in the time domain and the other is in the frequency domain. The symbol start time is determined as the location with maximum correlation value. In order to decrease the computation complexity on the DSP, we rewrite the original floating-point C programs to fixed-point version and further refine our codes by taking into account the features of the DSP chip, TMS320C6416, to produce a more efficient program. Overall, the final version for uplink synchronization schemes is 374 times faster than the original version. In this thesis, we first introduce to the TDD OFDMA uplink synchronization schemes. Second, we describe the environment of DSP implementation. Finally, we discuss the optimization methods using the features of C64x and present experimental results on the speed and the synchronization performance.. iv.

(7) 誌謝本論文承蒙恩師林大衛教授細心的指導與教誨，方得以順利完成。在兩年的研究所生涯中，林教授不僅在學術研究上予以學生指導，在研究態度上亦給予相當多的建議，在此對林教授獻上最大的感激之意。此外，感謝通訊電子與訊號處理實驗室所有的成員，包含各位師長、同學、學長姐與學弟妹們。我要感謝吳俊榮學長、洪昆健學長與林郁男學長給予我在研究過程上的指導與建議，還有宗書、盈縈、明哲、明瑋、建統、仰哲、岳賢等同學與學弟妹與我彼此勉勵、互相討論，讓我在這兩年的研究生涯充滿歡樂與回憶。最後，我要感謝我的家人和朋友，在我的求學過程當中總是不斷的鼓勵我，提供我心靈上的支持，陪我走過我的不安、徬徨、憂愁，也與我分享我的驕傲、快樂、心得。在此，我誠摯的對這些幫助過我的人表達我的謝意。. 林筱晴民國九十三年六月於新竹. v.

(8) Contents 1 Introduction. 1. 2 Techniques for Uplink Synchronization 2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Overview of IEEE 802.16a . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 OFDMA Carrier Allocation . . . . . . . . . . . . . . . . . . . . 2.2.2 OFDMA Frame Structure . . . . . . . . . . . . . . . . . . . . . 2.2.3 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . 2.3 UL Synchronization Approach . . . . . . . . . . . . . . . . . . . . . . . 2.4 UL Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Stage I: Using CP Correlation Property . . . . . . . . . . . . . . 2.4.2 Stage II: Using Preamble Correlation Property . . . . . . . . . . 2.5 UL Synchronization Result . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Preamble Correlation in Frequency Domain Approach . . . . . . 2.5.2 Preamble Correlation in Time Domain Approach . . . . . . . . . 2.5.3 Comparison of UL Synchronization Using Time Domain Approach and Frequency Domain Approach . . . . . . . . . . . . . . . . .. 3 3 5 5 6 8 11 11 11 13 17 20 21. 3 DSP Introduction 3.1 DSP Board Introduction [11] . . . . . . . . . 3.2 DSP Core Introduction [13] . . . . . . . . . . 3.3 Data Transmission Mechanism [15] . . . . . 3.4 Code Composer Studio Introduction [16], [17]. . . . .. 26 26 28 37 40. . . . . . . . . . .. 42 42 43 43 46 46 48 50 50 51 53. . . . .. . . . .. . . . .. . . . .. 4 DSP Implementation 4.1 Procedure of the Implementation Work . . . . . . . . 4.2 Optimization Method . . . . . . . . . . . . . . . . . 4.2.1 Configuring the Setting of Compiler Options 4.2.2 Using Intrinsics [19] . . . . . . . . . . . . . 4.2.3 Software Pipelining . . . . . . . . . . . . . . 4.2.4 Data Type Modification . . . . . . . . . . . 4.3 Framing/Deframing Structure . . . . . . . . . . . . . 4.3.1 Framing . . . . . . . . . . . . . . . . . . . . 4.3.2 Deframing . . . . . . . . . . . . . . . . . . 4.4 IFFT/FFT Structure . . . . . . . . . . . . . . . . . . vi. . . . .. . . . . . . . . . .. . . . .. . . . . . . . . . .. . . . .. . . . . . . . . . .. . . . .. . . . . . . . . . .. . . . .. . . . . . . . . . .. . . . .. . . . . . . . . . .. . . . .. . . . . . . . . . .. . . . .. . . . . . . . . . .. . . . .. . . . . . . . . . .. . . . .. . . . . . . . . . .. 23.

(9) 4.5. . . . . . . . .. 61 61 61 64 64 65 72 74. 5 Conclusion and Future Work 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Potential Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 76 76 77. 4.6. 4.7. Transmission Filtering . . . . . . . . . . . . . . . . . . . 4.5.1 Oversampling and SRRC Filter in the Transmitter . 4.5.2 Downsampling and SRRC Filter in the Receiver . Uplink Synchronization Using Time Domain Approach . . 4.6.1 CP Correlation . . . . . . . . . . . . . . . . . . . 4.6.2 Preamble correlation . . . . . . . . . . . . . . . . 4.6.3 Complexity Analysis . . . . . . . . . . . . . . . . Conclusion in Optimization . . . . . . . . . . . . . . . . .. vii. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . ..

(10) List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11. OFDM symbol structure in time. . . . . . . . . . . . . . . . . . . . . . . Illustration of carrier usage in OFDMA UL. . . . . . . . . . . . . . . . . Carrier allocation in the OFDMA UL (from [4]). . . . . . . . . . . . . . Frame structure of the TDD OFDMA system (from [4]). . . . . . . . . . UL transmitter structure. . . . . . . . . . . . . . . . . . . . . . . . . . . UL receiver structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pseudo Random Binary Sequence (PRBS) generator for pilot modulation. Method of UL synchronization. . . . . . . . . . . . . . . . . . . . . . . . The structure of the ML time offset estimator (from [8]). . . . . . . . . . The structure of the proposed symbol time estimator. . . . . . . . . . . . Three UL signals arrive at different times, and the CP correlation peak may occur between them (from [5]). . . . . . . . . . . . . . . . . . . . . The received samples and the time plan of the UL synchronization stage II (from [5]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Illustration of UL synchronization stage II in frequency domain (from [5]). Illustration of UL synchronization stage II in time domain (from [5]). . . Frame structure used in UL synchronization. . . . . . . . . . . . . . . . Error distribution under different maximum Doppler shifts using frequency domain approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Error distribution under different maximum Doppler shifts using time domain approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The multipath delay spread and the relative average power (The definition of the Ref in the next figure). . . . . . . . . . . . . . . . . . . . . . . . . Performance of UL time synchronization under different Doppler spreads. Comparison of UL synchronization using frequency domain and time domain approach at velocity of 60 km/hr. . . . . . . . . . . . . . . . . . . .. 3 5 6 7 8 9 10 11 12 13. 3.1 3.2 3.3 3.4 3.5 3.6 3.7. Block diagram of Quixote (from [12]). . . . . . . . . . . . . . . Technical specification of Quixote (from [12]). . . . . . . . . . Block diagram for C6416 DSP (from [14]). . . . . . . . . . . . TMS320C6416 DSP core data paths (from [14]). . . . . . . . . Block diagram for C62x and C64x DSP core (from [15]). . . . . Block diagram of DSP streaming mode (from [11]). . . . . . . . Simplified code composer studio development flow (from [17]).. . . . . . . .. 27 29 32 33 37 39 40. 4.1. Code development flow of C6000 (from [19]). . . . . . . . . . . . . . . .. 44. 2.12 2.13 2.14 2.15 2.16 2.17 2.18 2.19 2.20. viii. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. 14 14 15 16 19 20 21 22 23 24.

(11) 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21 4.22. C64x fixed-point pipeline phases. . . . . . . . . . . . . . . . . . . . . . . The fixed-point data formats at the TX side and the RX side. . . . . . . . Error distribution under different maximum Doppler shifts using time domain approach in fixed-point version. . . . . . . . . . . . . . . . . . . . C code for PRBS generator. . . . . . . . . . . . . . . . . . . . . . . . . . Compiler’s feedback for PRBS generator loop. . . . . . . . . . . . . . . Two versions of C programs for framing. . . . . . . . . . . . . . . . . . . Compiler’s feedback for framing loop before and after optimization. . . . A part of C code for framing. . . . . . . . . . . . . . . . . . . . . . . . . A part of C code for deframing. . . . . . . . . . . . . . . . . . . . . . . . A part of assembly code for DSP fft32x32. . . . . . . . . . . . . . . . . C code for mul sum() in Tx SRRC(). . . . . . . . . . . . . . . . . . . . . C code and compiler’s feedback for mul sum() loop. . . . . . . . . . . . C code and compiler’s feedback for Rx SRRC() loop. . . . . . . . . . . . C code in CP correlation() before optimization. . . . . . . . . . . . . . . C code in CP correlation() after optimization. . . . . . . . . . . . . . . . Compiler’s feedback for CP correlation() loop before optimization. . . . . Compiler’s feedback for CP correlation() loop after optimization. . . . . C code in Preamble correlation() before optimization. . . . . . . . . . . . C code in Preamble correlation() after optimization. . . . . . . . . . . . . Compiler’s feedback for Preamble correlation() loop before and after optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison between floating-point version and fixed-point version. . . .. ix. 47 49 50 51 52 53 54 55 55 59 62 63 64 65 66 67 68 69 70 71 74.

(12) List of Tables 2.1 2.2 2.3 2.4 2.5 2.6 2.7. OFDMA UL Carrier Allocations . . . . . . . . . . . . . . . . . . . . . . Complexity for ML estimator and the Proposed Symbol Time Estimator . Comparisons of Computational Complexity for Different FFT Algorithms Complexity for Time Domain Approach and Frequency Domain Approach System Parameters Used in Our Study . . . . . . . . . . . . . . . . . . . Characteristics of the ETSI “Vehicular A” Channel Environment . . . . . Relations Between Spread and Maximum Doppler Shift at Carrier Frequency 6GHz and Subcarrier Spacing 5.58 kHz . . . . . . . . . . . . . .. 7 13 17 17 18 18. 3.1 3.2 3.3. Characteristics of TI C6416T Processors (from [14]) . . . . . . . . . . . Functional Units (.L, .S) and Operations Performed (from [15]) . . . . . . Functional Units (.M, .D) and Operations Performed (from [15]) . . . . .. 30 34 36. 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16. Compiler Options to Avoid on Performance Critical Code (from [19]) Compiler Options for Performance (from [19]) . . . . . . . . . . . . Breakdown of Clock Cycles for Framing() . . . . . . . . . . . . . . . Breakdown of Clock Cycles for Deframing() . . . . . . . . . . . . . . IFFT/FFT Function . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of Different IFFT/FFT . . . . . . . . . . . . . . . . . . . Complexity and Performance of IFFT/FFT Implementation . . . . . . Used Compiler Intrinsics in DSP ifft32x32/DSP fft32x32 . . . . . . . Breakdown of Clock Cycles for IFFT() . . . . . . . . . . . . . . . . . Breakdown of Clock Cycles for FFT() . . . . . . . . . . . . . . . . . Breakdown of Clock Cycles for TX SRRC() . . . . . . . . . . . . . . Breakdown of Clock Cycles for RX SRRC() . . . . . . . . . . . . . . Breakdown of Clock Cycles for CP correlation() . . . . . . . . . . . Breakdown of Clock Cycles for Preamble correlation() . . . . . . . . Complexity and Performance of CP Correlation Implementation . . . Complexity and Performance of Preamble Correlation Implementation. 45 47 52 55 56 58 58 59 60 60 61 64 66 72 73 74. x. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .. 19.

(13) Chapter 1 Introduction Orthogonal frequency-division multiple access (OFDMA) technique has attracted serious attention in the last few years and has been proposed for the uplink of wireless communication systems [1], [2] and cable TV (CATV) networks [3]. In this thesis, we focus on uplink synchronization based on IEEE 802.16a WirelessMAN OFDMA system [4]. Our intent is to implement the uplink synchronization scheme by using digital signal processor (DSP). In order to verify the accuracy of the fixed-point uplink synchronization scheme, the framing/deframing structure, IFFT/FFT block and Tx/Rx SRRC filter have been also implemented. The environment of our DSP implementation involves a host PC, DSP board and DSP chip on the board. The DSP chip is Texas Instruments (TI)’s TMS320C6416. The TMS320C6416 is a fixed-point DSP with 1.67 ns instruction cycle time. It adopts the advanced VelociTI Very Long Instruction Word (VLIW) architecture that enables sustained throughput of eight instructions in parallel and thus allows the processor running faster. In addition, the C64x device comes with on-chip program and data memories, which may be configured as cache on some devices. The DSP board we use is Innovative Integration (II)’s Quixote. It is a PCI bus compatible DSP card housing one TI TMS320C3416 processor. Our work is based on the code from [5]. In order to reduce the computation complexity, we rewrite the original 32-bit floating-point version to 16-bit fixed-point version. We. 1.

(14) also do some optimization methods to facilitate better parallelism after compilation. The thesis is organized as follows. In chapter 2, we introduce the techniques for uplink synchronization in detail. Chapter 3 introduces the DSP board and the DSP chip. Chapter 4 discusses the optimization methods based on DSP properties and presents the optimization results. Finally, the conclusion is given in chapter 5 and we point out some potential future work.. 2.

(15) Chapter 2 Techniques for Uplink Synchronization 2.1 Background The basic idea of orthogonal frequency-division multiplexing (OFDM) is to divide the available spectrum into a number of subchannels. To obtain high spectral efficiency, the frequency response of the subchannels are overlapping but orthogonal, hence the name OFDM. By introducing a cyclic prefix (CP), the orthogonality can be completely maintained even through the signal passes through a time-dispersive channel. The cyclic prefix is a copy of the last part of the OFDM symbol which is prepended to the transmitted symbol, as shown in Figure 2.1 [6]. Orthogonal frequency-division multiaccess (OFDMA) is a multiplexing technique in which several users simultaneously transmit their own data by modulating an exclusive set of orthogonal subcarriers. Its main advantage is that separating different users through frequency-division multiaccess (FDMA) techniques at the subcarrier level can mitigate. Figure 2.1: OFDM symbol structure in time.. 3.

(16) multiaccess interference (MAI) within a cell [7]. Also, compared with single-carrier multiaccess, OFDMA offers increased robustness to narrowband interferences, allows straightforward dynamic channel assignment, and does not need adaptive time-domain equalizers, since channel estimation is performed in the frequency domain through onetap multipliers. For all this to be true, however, proper time and frequency synchronization is necessary to maintain orthogonality among the active users. Frequency offset due to Doppler shifts and/or oscillator instabilities produce interchannel interference (ICI) that must be counteracted to avoid severe error-rate degradations. Timing errors result in intersymbol interference (ISI) between consecutive OFDM symbols. Using a guard interval (cyclic prefix) provides intrinsic protection against timing errors at the expense of some reduction in the data throughput due to the extra overhead. However, timing accuracy becomes a stringent requirement in practical applications where, to minimize the overhead, the cyclic prefix is made only just greater than the length of the channel impulse response (CIR). In this thesis, we consider the IEEE 802.16a WirelessMAN OFDMA system [4]. According to the IEEE 802.16a standard, the duplexing method of OFDMA system in 2–11 GHz band shall be either FDD or TDD in licensed bands and TDD in license-exempt bands. The traffic requirements of the downlink (DL) and uplink (UL) transmissions are usually different. Compared with FDD mode, TDD mode supports more flexibility for different traffic transport capacity. That is why we choose to study the TDD mode in this thesis. In this work, we focus on IEEE 802.16a TDD OFDMA uplink synchronization techniques. According to IEEE 802.16a standard, all SSs shall acquire and adjust their timing such that all uplink OFDM symbols arrive time coincident at the base station to a accuracy of 50% of the minimum guard-interval or better. For the same reason, both the transmitted center frequency and the symbol clock frequency shall be synchronized to the BS with a. 4.

(17) 32 used carriers (including pilot carriers) pilot. pilot. DC carrier. Guard band Group 1. Guard band Group53. Group 2. The 1696 used carriers = 1536 data carriers + 160 pilot carriers. subchannel 1. subchannel 2. Figure 2.2: Illustration of carrier usage in OFDMA UL.. tolerance of maximum 2% of the carrier spacing, which equals to 111.6 Hz in our work. These limitations are very useful for UL synchronization scheme.. 2.2 Overview of IEEE 802.16a 2.2.1 OFDMA Carrier Allocation The FFT size used in the 802.16a OFDMA system is 2048, so there are 2048 carriers in a channel. These carriers are divided into as three types: data carriers that are used for data transmission, pilot carriers for various estimation purposes, and null carriers (guard bands and DC carrier) which transmit nothing at all. The data and pilot carriers together are termed the used carriers for they transmit useful information. The allocation is as shown in Figure 2.2 for UL. In the uplink, the set of used carriers is first partitioned into 32 subchannels, and then the pilot carriers are allocated within each subchannel. Each subchannel may be transmitted from a different SS. The used carriers of the UL transmission are partitioned into fixed-location pilots, variable location pilots, and data subchannels. Within each subchannel, there are 48 data carriers, 1 fixed-location pilot carrier, and 4 variable-location pilot carriers. The allocation of pilot carriers is illustrated in Fig. 2.3. 5.

(18) Figure 2.3: Carrier allocation in the OFDMA UL (from [4]).. The fixed-location pilot is always at carrier 26 in the subchannel. The variablelocation pilots change locations in each symbol, repeating every 13 symbols, according to.

(19) . where. . . 0 to 12. For. location pilots are positioned at indices 0, 13, 27, 40. For other change by adding. . . . . the variable. values these locations. to each index. Thus due to the motion of the variable-location pilots,. the locations of data carriers also change with each symbol [4]. The parameters of the UL are also shown in Table 2.1.. 2.2.2 OFDMA Frame Structure Figure 2.4 shows the TDD OFDMA frame structure. The frame structure is built from BS and SS transmissions. Each TDD OFDMA frame is composed of a DL subframe and a UL subframe. The duration of a frame is allowed from 2 ms to 20 ms and is specified by the frame duration code. A subframe contains several transmission bursts, which are composed of multiples of FEC blocks. 6.

(20) Table 2.1: OFDMA UL Carrier Allocations Parameter Number of DC carriers Number of guard carriers, left Number of guard carriers, right Number of used carriers ( ) Number of total carriers ( ).

(21) . Number of fixed-location pilots Number of variable-location pilots which coincide with fixed-location pilots Number of total pilots Number of data carriers. "!#!$% &' "

(22)

(23) (

(24) " per subchannel Number of data carriers per subchannel ) *$+-,/.10 230"457698:2<;=*$>. UL Value 1 176 175 1696 2048 128 32 0 160 1536 32 53 48 3,18,2,8,16,10,11,15, 26,22, 6, 9,27,20,25,1, 29,7,21,5,28,31,23,17, 4,24,0,13,21,19,14,30. Figure 2.4: Frame structure of the TDD OFDMA system (from [4]).. 7.

(25) Figure 2.5: UL transmitter structure.. From the UL-MAPs, the subscribers know their usable subchannels and transmission time. The first symbol of the UL subframe is the all-pilot preamble where the SS should send a preamble on all its allocated subchannels. The number of symbols of the UL is. . . . , one for the preamble and the others data transmitted bursts. The Tx/Rx transition. gap (TTG) and Rx/Tx transition gap (RTG) shall be inserted between the downlink and uplink and at the end of each frame respectively to allow the BS to turn around. After the TTG, the BS receiver shall look for the first symbols of a UL burst. After the RTG, the SS receivers shall look for the first symbols of QPSK modulated data in the DL burst. TTG. . and RTG shall be at least 5 s and an integer multiple of four samples in duration.. 2.2.3 System Architecture Figure 2.5 shows the system structure of the UL transmitter. The data is scrambled and FEC coded, while the preambles and pilots are not coded. The BS has to receive various bursts from different SSs at the same time. Each SS has to support one kind of coding and modulation types in a frame. The framing is used to arrange the coded data, MAPs, preamble or pilots to the corresponding carriers and. 8.

(26) Figure 2.6: UL receiver structure.. symbols following the specified frame structure and carrier allocation. After framing, the used carriers and null carriers are ordered properly and fed into the 2048-point IFFT block in parallel. The IFFT results are output sequentially and shaped by the pulse shaping block. The system structure of the UL receiver is as shown in Figure 2.6. The receiver operation is in some sense the reverse of the transmitter. Two blocks are added: synchronizer and channel estimator. These two blocks and the FEC decoder are the most sophisticated elements of the receiver. In framing/deframing structure, we need some information such as carrier allocation and UL parameters shown in Table 2.1. Pilot carriers shall be inserted into each data burst in order to constitute the symbol and they shall be modulated according to their carrier location within the OFDMA symbol. The PRBS generator is used to produce a sequence,. . . , where corresponds to the carrier index. The value of the pilot modulation on carrier is then derived from. . . The polynomial for the PRBS generator is. Figure 2.7 shows. For the UL, the initialization vector of the PRBS is. . . , as. . The PRBS shall. be initialized so that its first output bit coincides with the first usable carrier. A new value shall be generated by the PRBS on every usable carrier. Each pilot shall be transmitted with a boosting of 2.5 dB over the average power of each data tone. The pilot carriers. 9.

(27) Figure 2.7: Pseudo Random Binary Sequence (PRBS) generator for pilot modulation.. shall be modulated according to the following formulas:. * .

(28) <, . For the UL preamble, all the used carriers are pilots. The initial vector of the PRBS is the same as the normal UL pilot modulation. The pilots shall not be boosted and is modulated as. * .

(29) <, . The details for the Tx/Rx SRRC filter we use are based on [5]. In order to provide the ability to simulate path delays at non-integer sample times, an interpolator is added to the transmitter to yield 4-times oversampled transmitter output. As the ideal lowpass interpolation filter cannot be implemented exactly, the easier realized square root raised cosine (SRRC) filter is used instead. The impulse response of the filter is given by. + .-

(30) + !"$#& %('*)0/ 1 !"$#& %('*) 3+ . 0 !"$#&%('*) , 2

(31) .. 7 6 + !"$#4%$'*) !"$#4%$'*) 5 where + is the roll-off factor. The reason of adopting the SRRC filter is that for this filter the transmitter and receiver filters are matched to each other and there is no inter-sample interference introduced in the receiver. In our work, the pulse-shaping block is regard as the interpolator with 4-time oversampling and the roll-off factor 0.155 SRRC filter.. 10.

(32) Figure 2.8: Method of UL synchronization.. 2.3 UL Synchronization Approach After doing DL synchronization, the mobile enters the time and frequency grid with a low offset in time and frequency. The UL synchronization is unlike the DL synchronization which requires complex frame synchronization at initialization. No frequency synchronization is done in UL normal transmission. What the BS has to do is to detect the exact UL symbol arrival time. The BS shall detect the arrival time of the first coming signal to keep the symbol ISI free. There are two stages in UL synchronization, which is shown in Figure 2.8. The first stage uses cyclic prefix information to detect symbol start time roughly. The second stage uses preamble information to detect symbol start time exactly. We present two schemes to do the second stage. One is using the correlation of received signal with preamble in the time domain and the other is in the frequency domain. The symbol start time is determined as the location with maximum correlation value.. 2.4 UL Synchronization 2.4.1 Stage I: Using CP Correlation Property OFDM/OFDMA signals have strong auto-correlation properties of the waveforms. This autocorrelation is a consequence of the cyclic prefix part of the waveform. The algorithm in [2] and [8] uses the maximum likelihood (ML) criterion to estimate the time offset. Under the assumption that received samples are jointly Gaussian distributed and uncorrelated except for the pairs of identical samples contained in the cyclic prefix, symbol time. 11.

(33) Figure 2.9: The structure of the ML time offset estimator (from [8]).. . offset is given by. where. and. "!"! . 7

(34) + + + 6 + . . (2.1). . 6 . with SNR being signal to noise ratio. Estimator (2.1) exploits the correla-. tion introduced by the cyclic prefix to estimate the offsets. The structure of the time offset estimator is shown in Figure 2.9. Its strength is that it is independent of the modulation and it does not need pilot symbols. It is a one-shot estimator in the sense that the estimates are based on the observation of one OFDM symbol. The symbol time offset estimator can be viewed as consisting of two parts: the correlation. . +. which correlates the received sampled baseband signal, , with a delayed. version of itself, and a part that compensates for the difference in energy in the correlated samples. In order to reduce the complexity, we only employ the correlation part in our work. As the samples of different OFDM symbols are uncorrelated, the peak of the sliding sum of. + + . would occur when the samples 12. + $#%#%# + . . . are.

(35) Figure 2.10: The structure of the proposed symbol time estimator.. Table 2.2: Complexity for ML estimator and the Proposed Symbol Time Estimator No. of Real Multiplications No. of Real Additions.

(36)

(37). ML time offset estimator Proposed symbol time estimator.

(38) . all within the same OFDM symbol. Then, the symbol time offset estimator becomes.

(39) . . . . . . + + . . . . . (2.2). . Figure 2.10 shows the structure of this estimator. Table 2.2 shows a comparison of the complexity for ML time offset estimator and the proposed symbol time estimator. In this table, we consider the complexity for the first 256 samples. Different users’ transmitted signals may not arrive at the same time, but the correlation peak may occur between them, as shown in Figure 2.11 for an example of three users. If we use the detected peak location as the symbol start time, the corresponding useful time will include a part of the guard interval of the next symbol for the earlier arriving signals. Therefore, we have to find the exact instant of the first arriving signal to avoid ISI. This is why we use preamble information in stage II. In stage II, we use preamble correlation property to detect the symbol start time exactly.. 2.4.2 Stage II: Using Preamble Correlation Property In stage I, the symbol (frame) start time is roughly detected by using CP correlation peak. We know that the actual arrival time of the first arriving signal is likely before the detected 13.

(40) useful time CP. CP. CP. CP CP. CP. Figure 2.11: Three UL signals arrive at different times, and the CP correlation peak may occur between them (from [5]). CP correlation peak location. The corresponding detected useful time Useful time stage II start time. stage II stop time. Figure 2.12: The received samples and the time plan of the UL synchronization stage II (from [5]).. time. In stage II, we use preamble information to detect the symbol start time exactly. We present two schemes to do stage II. One is using the correlation of received signal with preamble in the frequency domain and the other is in the time domain. Figure 2.12 shows the received samples of the BS and the time relation for stage II. As the user arrival time may vary as much as 50% of the guard interval, we apply the FFT and preamble correlation for the samples up to 50% of the guard interval earlier than the corresponding detected useful time. 2.4.2.1 Frequency Domain Approach In this section, we describe the UL synchronization stage II using the correlation of received signal with preamble in frequency domain. Figure 2.13 illustrates the processing. 14.

(41) From stage II start time to stop time. At stage II stop time. reference used carrier 1. samples within useful time. FFT. carriers of SS 1 are summed together. peak detector. reference used carrier k. carriers of SS k are summed together. peak detector. peak location and peak value compatator. First arriving signal start time. Figure 2.13: Illustration of UL synchronization stage II in frequency domain (from [5]).. conducted in stage II. The FFT outputs are correlated with the preamble reference values. As the BS knows the allocation status of UL subchannel, the frequency correlation is taken over all the subchannels used by each SS. When a new sample is received, the frequency is updated. The correlation peak value and location of each SS is recorded. This procedure is continued until the end of the corresponding useful time. Then, the peak locations of different SSs are compared as follows. We start by assuming SS1 as the first coming signal. The peak location of SS2 is compared with that of SS1. If the peak location of SS2 is earlier than SS1, then we check the peak correlation value. The peak value is normalized by the number of subchannels each SS uses. If (peak value/subchannel num) of SS2 is larger than SS1, the first coming signal is set to SS2. After all SSs are compared, we get the start location of the first coming signal. 2.4.2.2 Time Domain Approach In this section, we describe the UL synchronization stage II using the correlation of received signal with preamble in time domain. Since the carriers are orthogonal to each other, so are the subchannels. After IFFT, the time domain signals which occupy different subchannels are uncorrelated if the channel has zero delay spread. For the UL preamble, the transmitted value of each carrier is specified by the BS. Thus the signal transmitted by each SS in the UL preamble is deterministic and the BS can produce the same signals as all SSs by taking IFFT. In this scheme, stage. 15.

(42) reference for SS 1 r1(0)~r1(2047) sum of 2048 samples. peak detector. start time of SS 1. sum of 2048 samples. peak detector. start time of SS k. r(k)~r(k+2047). reference for SS k rk(0)~rk(2047). Figure 2.14: Illustration of UL synchronization stage II in time domain (from [5]).. I is the same as the previous scheme, and stage II is as shown in Figure 2.14. The received samples are correlated with reference data string. Each reference data string is the IFFT output according to the subchannels used by each SS. When the next sample arrives, the correlation is calculated again. The start and stop times of the correlation are the same as shown in Figure 2.12. The complexity of time domain correlation is less than frequency domain correlation. This is because we need to do FFT in frequency domain correlation. In order to reduce the complexity of FFT, the conventional FFT is only applied once. When a new data value is received, the simplified FFT below is used:. ! ! 1!

(43) 1! 7* where. is the FFT size,. . is the carrier index,. incoming sample. The simplified FFT requires. 6. is sample number, and complex additions and. !. (2.3) is the new complex. multiplications. Table 2.3 shows a comparison of computational complexity for different FFT algorithm [9]. Table 2.4 shows a comparison of the complexity for time domain approach and frequency domain approach. For time domain correlation, only 2048 complex multiplications and 2047 complex additions are needed. In our simulation, the guard interval is 256 samples and hence stage II is applied to 128 sample locations. For frequency domain correlation, computation complexity depends on different type of FFT algorithm. After 16.

(44) Table 2.3: Comparisons of Computational Complexity for Different FFT Algorithms Complexity Radix-2 FFT Radix-4 FFT Radix-8 FFT Split-radix-4/2 FFT Simplified FFT. 1 6 1 6 6 6 6 . 6 1 1 6 . No. of Real Multiplications. 6.

(45).

(46). .

(47). 1 6 6 1 6 6 6 1 6 1 6 . No. of Real Additions. . 66. .

(48).

(49). Table 2.4: Complexity for Time Domain Approach and Frequency Domain Approach Complexity Time domain approach Frequency domain approach Radix-2 + Simplified FFT Radix-4 + Simplified FFT Radix-8 + Simplified FFT Split-radix-4/2 + Simplified FFT. No. of Real Multiplications No. of Real Additions

(50) 1048576.

(51) .

(52) . calculation, the needed multiplications and additions of frequency domain correlation is about 2 times that of time domain correlation.. 2.5 UL Synchronization Result Table 2.5 specifies the transmission parameters for our simulation. The uplink and downlink use the same frequency bands. The intercarrier spacing is thus 5.58 kHz and the. . symbol length (without cyclic prefix) is 179.2 sec. In this section, we select the channel environment defined by ETSI for the evaluation of UMTS radio interface proposals. The time-varying channel impulse response for these models can be described by. 0 . + . 0

(53) . (2.4). 0. This equation defines the channel impulse response at time as a function of the lag . In this thesis, we will evaluate our synchronization algorithm for the choices of + 17. . and .

(54) Table 2.5: System Parameters Used in Our Study Number of carriers ( ) Center frequency 8 Uplink / Downlink bandwidth ( Carrier spacing ( ) Sampling frequency ( ) OFDM symbol time (9 ) Useful time ( ) Cyclic prefix time ( ). . . . . . ).

(55). . . . .

(56) GHz MHz

(57) kHz MHz / (2304 samples) / (2048 samples) / (256 samples). Table 2.6: Characteristics of the ETSI “Vehicular A” Channel Environment tap 1 2 3 4 5 6. relative delay (nsec or sample number) (nsec) (4 oversampling) (normal) 0 0 0 310 14 4 710 32 8 1090 50 12 1730 79 20 2510 115 29. average power (dB) (normal scale) (normalized) 0 1.0000 0.4850 -1.0 0.7943 0.3852 -9.0 0.1259 0.0610 -10.0 0.1000 0.0485 -15.0 0.0316 0.0153 -20.0 0.0100 0.0049. associated with the “Vehicular A” channel environment [10]. The channel taps. + . 0. are. complex independent stochastic variables, fading with Jakes’ Doppler spectrum, with a maximum Doppler frequency of 240 Hz, reflecting a mobile speed of approximately 120 km/hr (and scatterers uniformly distributed around the mobile). The real-valued the variance of the complex-valued +. . 7 and. are given in [10] and repeated in Table 2.6.. The SNR is chosen to be 10 dB in the fading channels. Note that the receiver SNR specified in 802.16a is from 9.4 dB to 24.4 dB, so 10 dB , which is almost the worst condition, is a reasonable value for simulation. The maximum Doppler shifts of our simulation are shown in Table 2.7 for the speed from 0 km/hr to 100 km/hr. The frame structure used in UL synchronization simulation is as shown in Figure 2.15. UL burst1 is transmitted by SS1 using 8 subchannels. UL burst2 is transmitted by SS2 using 16 subchannels. UL burst3 is transmitted by SS3 using 8 subchannels. The TTG and RTG each occupies 136 sample times. No ranging subchannel is provided.. 18.

(58) Table 2.7: Relations Between Spread and Maximum Doppler Shift at Carrier Frequency 6GHz and Subcarrier Spacing 5.58 kHz Speed (km/hr) Doppler shift (Hz) 0 0 20 111 40 222 60 333 80 444 100 556. 0 0.0224 0.0448 0.0672 0.0896 0.112. OFDMA symbol number. DL burst. DL burst. DL. k+2. Preamble. DL−MAP. k+1. UL burst #1. Preamble. k. k−1. UL burst #2. Preamble. subchannel number. k−3n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31. UL burst #3. TTG. UL. k+3. DL−MAP. RTG. Figure 2.15: Frame structure used in UL synchronization.. 19.

(59) Figure 2.16: Error distribution under different maximum Doppler shifts using frequency domain approach.. The arriving times of burst1 and burst2 differ by 25% of the guard interval, which is 64 sample time, while burst3 lags burst1 by 50% of the guard interval, which is 128 sample times.. 2.5.1 Preamble Correlation in Frequency Domain Approach The probability of symbol time synchronization error for the first coming user is as shown in Figure 2.16. The reason for using the carrier correlation to find the symbol start time is that if there is a time offset, the carrier phases will rotate. The phase rotation reduces the correlation. If there is no Doppler shift, the synchronization is always correct. For larger Doppler shifts, the inter-carrier interference causes serious variation of the post-FFT carrier values.. 20.

(60) Figure 2.17: Error distribution under different maximum Doppler shifts using time domain approach.. Moreover, the signals passing through different fading channels of different SSs would affect each other. Thus the synchronization performance is decreased as the Doppler spread increases. We can see the performance drops significantly when the maximum. .. Doppler shift is larger than 0.025 . 2.5.2 Preamble Correlation in Time Domain Approach Figure 2.17 shows the symbol time synchronization errors of the first coming signal under different Doppler spreads. If the Doppler shift is zero (speed = 0 km/hr), we can always detect the correct symbol start time of the first coming signal. When the speed increases, the distribution of the time synchronization errors is closely related to the multipath channel. We have used this channel model to obtain the time synchronization error distribution shown in Figure 2.18. 21.

(61) Figure 2.18: The multipath delay spread and the relative average power (The definition of the Ref in the next figure).. Comparing the time synchronization error distribution with the model, we see that the different time offsets obtained at synchronizer output almost concur with the sample number of the multipath delays. Furthermore, the occurrence probabilities at the different time offsets are proportional to the relative average power of the paths. The Doppler shift has no obvious effects on this synchronization scheme except when it is very small. As the correlation is done for each SS, we can detect the arriving time of each later arriving signal. The time error distributions of the other SSs are similar to the previous condition. Thus, Figure 2.19 shows that correlation in time domain approach is ideal for fixed environments. For the mobile environments, the performance depends on how dispersive the multipath channel is. For different SSs, the errors under different Doppler shifts (excluding the zero shift) are averaged and the probabilities are shown in Figure 2.19. Now that the estimated time offset is approximately equal to the multipath delay, we. 22.

(62) Figure 2.19: Performance of UL time synchronization under different Doppler spreads.. can safely say that, considering the guard interval, the minimum value of it should be larger than 2 times the channel delay spread plus the sacrificed part of guard interval due to pulse shaping. From Figure 2.17, 8 sample times earlier is reasonable for Doppler shift smaller than 0.1. 9 . In our simulation, this value is equal to. . . "

(63) . . '. .

(64)

(65) . . . . /. 2.5.3 Comparison of UL Synchronization Using Time Domain Approach and Frequency Domain Approach Figure 2.20 shows the time synchronization error distribution of UL synchronization using frequency domain and time domain approach when the maximum Doppler shift is 0.067 (velocity 60 km/hr). 23.

(66) Figure 2.20: Comparison of UL synchronization using frequency domain and time domain approach at velocity of 60 km/hr.. 24.

(67) Although the time offset estimated with correlation in frequency domain approach is to some degree related with the channel delay, it is more dispersive than correlation in time domain approach. The percentage of errors that are larger than 40 samples cannot be neglected. So the ability of the guard interval to counter the channel impulse response decreases. Moreover, the peak location of post-FFT correlation for the later signals cannot be used due to their low accuracy. This is because for larger Doppler shifts, the intercarrier interference causes serious variation of the post-FFT carrier values. Comparing these two schemes, the correlation in time domain is more accurate and demands less complexity.. 25.

(68) Chapter 3 DSP Introduction In this thesis, we use digital signal processor (DSP) to implement the framing/deframing operation, the Tx/Rx SRRC filter, and the uplink synchronization scheme. The DSP board we use is Innovative Integration’s Quixote, which is powered by the TMS320C6416 DSP from Texas Instruments (TI). In this chapter, we focus on the environment of DSP implementation, which involves the host PC, the Quixote DSP board, and the C6416 DSP chip on the board. First, we introduce the DSP board and then the DSP core. The communication mechanism between the DSP core and the peripherals is also introduced. Last, we describe the code development on the TI DSP.. 3.1 DSP Board Introduction [11] Quixote is Innovative Integration’s Velocia-family baseboard for wireless, RADAR, ultrasound, high energy physics and other demanding applications requiring speed and processing power. It combines a 600 MHz 32-bit fixed-point Texas Instruments C6416 DSP with two- or six- million-gate Xilinx Virtex-ll FPGA. Figure 3.1 gives a block diagram of Quixote [12]. Quixote has a 32 MB SDRAM for use by the C6416 DSP. When used with the advanced cache controller on the C6416 DSP, the SDRAM provides a large, fast external memory pool for DSP data and code. The C6416 cache controller is said to be effective 26.

(69) Figure 3.1: Block diagram of Quixote (from [12]).. to over 80% of on-chip memory performance for most DSP applications. The analog interface offers 105 MHz 14-bit I/Q input channels and 105 MHz output channels, all tightly coupled to the FPGA external interface. A 64-bit 33 MHz PCI interface and one PMC site facilitate integration in PCI systems and support the addition of off-the shelf and custom PMC mezzanine boards. Finally, a PCI-to-StarFabric bridge chip offers two full duplex 2.5 Gbps ports to the new PICMG 2.17 switched interconnect backplane, for up to 625 MBytes/sec board-to-board or chassis-to-chassis communication. Figure 3.2 shows the technical specification of the Quixote [12]. In our work, we only focus on the C6416 DSP chip to implement OFDMA synchronization structure and. 27.

(70) some related block. However, our goal is to implement the overall OFDMA system, including source coding, channel coding, framing/deframing, IFFT/FFT block, channel model, synchronization scheme and channel estimation, on several Quixote board. In the future work, we need to use the PCI-to-StarFabric bridge chip to do board-to-board communication.. 3.2 DSP Core Introduction [13] TMS320C6416T DSP core is the latest architecture of 32-bit fixed-point DSP generation in the C6000 DSP platform. It has 600 MHz clock rate and 4800 MIPS. Table 3.1 provides an overview of the C6416 DSP. The table shows significant features of the C6416 devices, including the capacity of on-chip RAM, the peripherals, the CPU frequency, and the package type with pin count. C6416 DSP uses a two-level cache-based architecture. The Level 1 program cache (L1P) is a 16K-Byte direct mapped cache and the Level 1 data cache (L1D) is a 16KByte 2-way set-associative cache. The Level 2 memory/cache (L2) consists of an 1024KByte memory space that is shared between program and data space. L2 memory can be configured as mapped memory or combinations of cache and mapped memory. C6416 DSP chip also has two high-performance embedded coprocessors, which are Viterbi Decoder Coprocessor (VCP) and Turbo Decoder Coprocessor (TCP). The two coprocessors are very useful for channel decoding. Communications between the VCP/TCP and the CPU are carried through the EDMA controller. The enhanced direct memory access (EDMA) controller transfers data between the memory without passing through the DSP core. The external memory interface (EMIF) provides the interface for the DSP core to connect with several external devices, allowing additional data and program space. C6416 DSP has two EMIFs: the 64-bit EMIF A is interfaced to the SDRAM and the Virtex-ll FPGA while the 16-bit EMIF B is primarily used for the streaming PCI interface. 28.

(71) Figure 3.2: Technical specification of Quixote (from [12]).. 29.

(72) Table 3.1: Characteristics of TI C6416T Processors (from [14]). 30.

(73) Figure 3.3 shows the block diagram of the C6416 DSP chip. The DSP core features two sets of functional units. Each set contains four units and a register file. One set contains functional units .L1, .S1, .M1, and .D1; the other set contains units .D2, .M2, .S2, and .L2. The two register files each contain 32 32-bit registers for a total of 64 general-purpose registers. In addition to support the packed 16-bit and 32-/40-bit fixed-point data types found in the C62x VelociTI VLIW architecture, the C64x register files also support packed 8-bit data and 64-bit fixed-point data types. The two sets of functional units, along with two register files, compose sides A and B of the DSP core. The four functional units on each side of the CPU can freely share the 32 registers belonging to that side. Additionally, each side features a “data cross path” — a single data bus connected to all the registers on the other side, by which the two sets of functional units can access data from the register files on the opposite side. The C6416 DSP core pipelines datacross-path accesses over multiple clock cycles. This allows the same register to be used as a data-cross-path operand by multiple functional units in the same execute packet. All functional units in the C6416 CPU can access operands via the data cross path. Register access by functional units on the same side of the DSP core as the register file can service all the units in a single clock cycle. Figure 3.4 shows the data path of the C6416 DSP chip. On the DSP core, a delay clock is introduced whenever an instruction attempts to read a register via a data cross path if that register was updated in the previous clock cycle. Another key feature of the C6416 DSP core is the load/store architecture, where all instructions operate on registers. The function units .L and .S are described in Table 3.2. The two .S and .L functional units perform a general set of arithmetic, logical, and branch functions with results available every clock cycle. The arithmetic and logical functions on the C64x CPU include single 32-bit, dual 16-bit, and quad 8-bit operations. Two sets of data-addressing units (.D1 and .D2) are responsible for all data transfers. 31.

(74) Figure 3.3: Block diagram for C6416 DSP (from [14]).. 32.

(75) Figure 3.4: TMS320C6416 DSP core data paths (from [14]).. 33.

(76) Table 3.2: Functional Units (.L, .S) and Operations Performed (from [15]) Function Unit Fixed-Point Operations .L unit (.L1, .L2) 32/40-bit arithmetic and compare operations 32-bit it logical operations Leftmost 1 or 0 counting for 32 bits Normalization count for 32 and 40 bits Byte shifts Data packing/unpacking 5-bit constant generation Dual 16-bit arithmetic operations Quad 8-bit arithmetic operations Dual 16-bit min/max operations Quad 8-bit min/max operations Quad 8-bit subtract with absolute value .S unit (.S1, .S2). 32-bit arithmetic operations 32/40-bit shifts and 32-bit bit-field operations 32-bit logical operations Branches Constant generation Register transfers to/from control register file (.S2 only) Byte shifts Data packing/unpacking Dual 16-bit compare operations Quad 8-bit compare operations Dual 16-bit shift operations Dual 16-bit saturated arithmetic operations Quad 8-bit saturated arithmetic operations. 34.

(77) between the register files and the memory. The data address driven by the .D units allows data addresses generated from one register file to be used to load or store data to or from the other register file. The C6416 .D units can load and store bytes (8 bits), half-words (16 bits), and words (32 bits) with a single instruction. And with the new data path extensions, the C6416 .D unit can load and store doublewords (64 bits) with a single instruction. Furthermore, the non-aligned load and store instructions allow the .D units to access words and doublewords on any byte boundary. The C6416 DSP core supports a variety of indirect addressing modes using either linear- or circular-addressing with 5or 15-bit offsets. All instructions are conditional, and most can access any one of the 64 registers. Some registers, however, are singled out to support specific addressing modes or to hold the condition for conditional instructions (if the condition is not automatically true). The two .M functional units perform all multiplication operations. Each of the C64x .M units can perform two 16x16-bit multiplies or four 8x8-bit multiplies per clock cycle. The .M unit can also perform 16 32-bit multiply operations, dual 16 16-bit multiplies with add/subtract operations, and quad 8 8-bit multiplies with add operations. In addition to standard multiplies, the C64x .M units include bit-count, rotate, Galois field multiplies, and bidirectional variable shift hardware. The function units .M and .D are described in Table 3.3. The processing flow begins when a 256-bit-wide instruction fetch packet is fetched from a program memory. The 32-bit instructions destined for the individual functional units are “linked” together by “1” bits in the least significant bit (LSB) position of the instructions. The instructions that are “chained” together for simultaneous execution (up to eight in total) compose an execute packet. A 0 in the LSB of an instruction breaks the chain, effectively placing the instructions that follow it in the next execute packet. A C6416 DSP device enhancement now allows execute packets to cross fetch-packet boundaries. In the TMS320C62x/TMS320C67x DSP devices, if an execute packet crosses the. 35.

(78) Table 3.3: Functional Units (.M, .D) and Operations Performed (from [15]) Function Unit Fixed-Point Operations .M unit (.M1, .M2) 16 x 16 multiply operations 16 x 32 multiply operations Quad 8 x 8 multiply operations Dual 16 x 16 multiply operations Dual 16 x 16 multiply with add/subtract operations Quad 8 x 8 multiply with add operations Bit expansion Bit interleaving/de-interleaving Galois Field Multiply Rotation Variable shift operations .D unit (.D1, .D2). 32-bit add, subtract, linear and circular address calculation Loads and stores with 5-bit constant offset Loads and stores with 15-bit constant offset (.D2 only) Load and store double words with 5-bit constant offset Load and store non-aligned words and double words 5-bit constant offset generation 32-bit logical operations Dual 16-bit arithmetic operations. fetch-packet boundary (256 bits wide), the assembler places it in the next fetch packet, while the remainder of the current fetch packet is padded with NOP instructions. In the C64x DSP device, the execute boundary restrictions have been removed, thereby, eliminating all of the NOPs added to pad the fetch packet, and thus, decreasing the overall code size. The number of execute packets within a fetch packet can vary from one to eight. Execute packets are dispatched to their respective functional units at the rate of one per clock cycle and the next 256-bit fetch packet is not fetched until all the execute packets from the current fetch packet have been dispatched. After decoding, the instructions simultaneously drive all active functional units for a maximum execution rate of eight instructions every clock cycle. While most results are stored in 32-bit registers, they can be subsequently moved to memory as bytes, half-words, words, or doublewords. All load and store instructions are byte-, half-word-, word-, or doubleword-addressable. 36.

(79) Figure 3.5: Block diagram for C62x and C64x DSP core (from [15]).. Figure 3.5 compares the difference between the C62x DSP core and the C64x DSP core. By doubling the registers in the register file and doubling the width of the data path as well as utilizing advanced instruction packing, the C6000 compiler can improve performance with even fewer restrictions placed upon it by the architecture. These additions and others make the C64x an even better compiler target than the original C62x architecture, while reducing code size by up to 25%.. 3.3 Data Transmission Mechanism [15] Many applications of the Matador family baseboards involve communication with the host CPU in some manner. All applications at a minimum must be reset and downloaded from the host, even if they are isolated from the host after that. Other applications need to interact with a host program during the lifetime of the program. This may vary from a small amount of information to acquiring large amounts of data. Some examples: 37.

(80) Passing parameters to the program at start time. Receiving progress information and results from the application. Passing updated parameters during the run of the program, such as the frequency and amplitude of a wave to be produced on the target. Receiving alert information from the target. Receiving snapshots of data from the target. Sending a sample waveform to be generated to the target. Receiving full rate data. Sending data to be streamed at full rate. These different requirements require different levels of support to efficiently accomplish. The simplest method supported is a mapping of Standard C++ I/O to the Uniterminal applet that allows console-type I/O on the host. This allows simple data input and control and the sending of text strings to the user. The next level of support is given by the Packetized Message Interface. This allows more complicated medium rate transfer of commands and information between the host and target. It requires more software support on the host than the Standard I/O does. For full rate data transfers the hardware supports the creation of data streaming to the host, for the maximum ability to move data between the target and host. On Quixote baseboard, a second type of busmaster communication between target and host is available for use, the CPU Busmaster interface. The primary CPU busmaster interface is based on the streaming model, where logically data is an stream between the source and destination. The model os more efficient because the signaling between the two parties in the transfer can be kept to a minimum and transfers can be buffered for maximum throughput. In addition, the Busmaster streaming interface is fully handshook, so that no data loss can occur in the process of streaming. 38.

(81) Figure 3.6: Block diagram of DSP streaming mode (from [11]).. For example, if the application cannot process blocks fast enough, the buffers will fill, then the busmaster region will fill, then busmastering will stop until the application resumes processing. When the busmaster stops, the DSP will no longer be able to add data to the PCI interface FIFO. The DSP Streaming interface is bi-directional. Two streams can run simultaneously, one running from the analog peripherals through the DSP into the application. This is called the “Incoming Stream”. The other stream runs out to the analog peripherals. This is the “Outgoing Stream”. In both cases, the DSP needs to act as a mediator, since there is no direct access to analog peripherals from the host. Figure 3.6 shows the block diagram of the DSP streaming mode. DSP Streaming is initiated and started on the Host, using the Caliente component. On the target, the DSP interface uses a pair of DSP/BIOS Device Drivers, PciIn (on the Outgoing Stream) and PciOut (on the Incoming Stream), provided in the Pismo peripheral libraries for the DSP. They use burst-mode and are capable of copying blocks of data between target SDRAM and host bus-master memory via the PCI interface at instantaneous rates up 264 MB/sec. 39.

(82) Figure 3.7: Simplified code composer studio development flow (from [17]).. In addition to the busmaster streaming interface, the DSP and the host also have a lower bandwidth communication link called packetized message interface for sending commands or side information between the host PC and the target DSP.. 3.4 Code Composer Studio Introduction [16], [17] TI’s Code Composer Studio (CCS) is a useful GUI tool to develop DSP codes. The CCS contains simple components: concept/design, code/build, debug, analyze, and extends the basic code generation tools with a set of debugging and real-time analysis capabilities. The phases of the development cycle are shown in Figure 3.7. We briefly describe some of its features related our implementation. The details can be found in [16] and [17]. 1. Compiles your C code to generate the Common Object File Format (COFF) output file. 2. Choose Run, Halt, Animate, or Run Free to start or stop to execution your program. 3. When the DSP halts, check the memory sections. 4. Probes the PC file stream into or from the target memory locations. 5. Counts the instruction cycles from the profile. We can divide the software development into three steps. 40.

(83) Step 1: Write the C program like standard ANSI C code. Then use the debugger to profile the C code to identify the inefficient areas in the code. Step 2: Use the optimization techniques and intrinsic function to improve the performance. Refine the C code procedures such as data type modifiers, compiler options, intrinsics, and so on. Step 3: Find the most time-critical areas and use the linear assembly code to replace the C code. We can use the assembly optimizer to optimize the code. In our work, we only focus on step 1 and step 2. Details for the optimization methods are shown in the next chapter.. 41.

(84) Chapter 4 DSP Implementation In the earlier chapters, the backgrounds of uplink synchronization scheme and its related function are given. We also described the environment of the DSP implementation. In this chapter, we discuss the DSP implementation of uplink synchronization and its related work on C6416 DSP. First, we describe the procedure of our implementation work. Second, we illustrate some optimization methods using the features of C6416 and applied to our implementation. Third, we discuss the progress in each part of our system with different methods. Because the compiler changes the C program into assembly code, we can see the parallel situation from the assembly code. The profile is for comparison between the original floating-point code and the optimized fixed-point code. Finally at the end of this chapter, we present some experimental results on the speed and the synchronization performance of our implementation.. 4.1 Procedure of the Implementation Work Traditional development flows in the DSP industry have involved validating a C model for correctness on a host PC or UNIX workstation and than painstakingly porting that C code to hand coded DSP assembly language. The recommended code development flow involves utilizing the C6000 code generation tools to aid in optimization rather than forcing the programmer to code by hand in assembly. These advantages allow the compiler to do all the laborious work of instruction selection, parallelizing, pipelining, and register 42.

(85) allocation. Figure 4.1 shows the phases in the 3-step software development flow.. 4.2 Optimization Method Speeding up the execution time of the OFDMA framing/deframing structure, the SRRC filter, and the uplink synchronization scheme is the main task of our implementation. In this section, we introduce the supported by the special features of C64x DSP. The experimental results are discussed in the next section.. 4.2.1 Configuring the Setting of Compiler Options As we mentioned in section 3.4, the Code Composer Studio (CCS) is a useful GUI tool for us to develop DSP codes. CCS compiles the C code and assembles it into the Common Object File Format (COFF) file format. Compiler options control the operation of both the compiler and the programs it runs. Proper configuration of the compiler options helps the compiler to generate efficient assembly codes. The compiler tools include a shell program (c16x), which you use to compile, assembly optimize, assemble, and link program in a single step. The options described in Table 4.1 are obsolete or intended for debugging, and could potentially decrease performance and increase code size. Avoid using these options with performance critical code. The options in Table 4.2 can improve performance but require certain characteristics to be true. Details for total compiler options can be found in [18]. The compiler option we usually use is –o3, which represents the highest level of optimization available. In addition to the optimization described in Table 4.2, –o3 can perform other code size reducing optimization like: eliminating unused assignments, eliminating local and global common subunused assignments, and removing functions that are never called. In addition, we can specify program-level optimization by using the –pm option with the –o3 option. With program-level optimization, all of the source files are compiled. 43.

(86) Figure 4.1: Code development flow of C6000 (from [19]). 44.

(87) Table 4.1: Compiler Options to Avoid on Performance Critical Code (from [19]). into one intermediate file giving the compiler complete program view during compilation. This creates significant advantage for determining pointer locations passed into a function. Once the compiler determines two pointers do not access the same memory location, substantial improvements can be made in software pipelined loops. Because the compiler has access to the entire program, it performs several additional optimizations rarely applied during file-level optimization: If a particular argument in a function always has the same value, the compiler replaces the argument with the value and passes the value instead of the argument. If a return value of a function is never used, the compiler deletes the return code in the function. If a function is not called, directly or indirectly, the compiler removes the function. Also, using the –pm option can lead to better schedules for our loops. If the number of iterations of a loop is determined by a value passed into the function, and the compiler can. 45.

(88) determine what that value is from the caller, then the compiler will have more information about the minimum trip count of the loop leading to a better resulting schedule.. 4.2.2 Using Intrinsics [19] The C6000 compiler provides intrinsics, special functions that map directly to C64x instructions, to optimize our C code quickly. All instructions that are not easily expressed in C code are supported as intrinsics. Intrinsics are specified with a leading underscore ( ) and are accessed by calling them as we call a function. The table of TMS320C6000 C/C++ compiler intrinsics can be found in [19].. 4.2.3 Software Pipelining Pipeline is used to parallelize instruction execution. The C64x pipeline has several features that improves performance. Figure 4.2 shows all the phases in each stage of the C64x pipeline in sequential order, from left to right [13]. As shown in Figure 4.2, the C64x has 11 phases, and the phases are grouped into 3 pipeline stages: program fetch, instruction decode and execution. In the execution stage, most of the C64x instructions are done in one phase. However, the load instruction needs five execution phases, the store instruction needs three execution phases, the multiplication needs four execution phases, and the branch needs six execution phases. If the sequential instructions need the result of these kinds of multi-cycle instructions, there is a delay before the result is written to the register file and available. Thus the NOP instruction is added to the program by the compiler to represent one cycle delay. So 4, 2, 3, 5 NOPs are added following the load, store, multiplication and branch instructions respectively. Software pipelining is a technique which can be used to schedule instruction from a loop so that multiple iterations of the loop execution in parallel. It is a great way to improve performance. The concept of software pipelining consists of implementing parallel instructions, filling delay slots with useful instructions, loop unrolling and maximizing functional units usage. When we use the -o2 or -o3 compiler options, the compiler at46.

(89) Table 4.2: Compiler Options for Performance (from [19]). Figure 4.2: C64x fixed-point pipeline phases.. 47.

(90) tempts to software pipeline code with the information that it gathers from the program. If the compiler can gather the more information from the program, the result schedule can be better. We may help the optimization work if the compiler by providing some information to the compiler as described below. Loop unrolling Loop unrolling expands small loops so that all iterations of the loop appear. It can increase the number of instructions available to execute in parallel. The compiler may automatically unroll the loop or be suggested using the. . pragma is:. +2<, 2 . . pragma. The syntax of the. 6. If possible, the compiler unrolls the loop so there are. 6. copies of the original loop.. But under the conditions listed below, the compiler will not do software pipelining [19]: 1. If a register value lives too long, the code is not software-pipelined. 2. If a loop has complex condition code within the body that requires more than five condition registers, the loop is not software pipelined. 3. A software-pipelined loop cannot contain function calls, including code that calls the run-time support routines. 4. In a sequence of nested loops, the innermost loop is the only one that can be software-pipelined. 5. If a loop contains conditional break, it is not software-pipelined.. 4.2.4 Data Type Modification The TMS320C6416 is a fixed-point DSP, so floating-point operations on C6416 DSP are inefficient. This is the main reason we rewrite the original floating-point C code to fixedpoint version. We should use the 16-bit data type for multiplication inputs whenever 48.

(91) Figure 4.3: The fixed-point data formats at the TX side and the RX side.. possible because this data type can provide the most efficient use of the 16-bit multiplier in C64x DSP. Figure 4.3 shows the data formats at the TX side and the RX side. In the original floating-point version, Tx SRRC, Rx SRRC and SYNC function need lots of 32-bit by 32-bit floating-point multiply operations. In fixed-point version, we use 16-bit by 16-bit fixed-point multiply operations to instead. In UL, the ranges of data values before IFFT and after FFT are [1, -1]. Also, the data values after IFFT and before FFT are less than 1. Therefore, we set the input/output data formats for Tx SRRC, Rx SRRC and SYNC as Q.15, which places the sign bit in the leftmost and the remainder 15 bits are fraction component. Compared with the other 16-bit data type, Q.15 can support the best precision for the data which is less than 1. We use the IFFT/FFT function from TI C64x DSP library, which supports two types of IFFT/FFT. The former is 32-bit input/output data type; the latter is 16-bit input/output data type. The main reason we choose 32-bit input/output data type is that IFFT/FFT data input must be scaled by the length of IFFT/FFT to prevent overflow. According to IEEE 802.16a, length of IFFT/FFT is. .. If we choose 16-bit data type before IFFT,. only 4 bits can be used to represent the fixed-point value. In our implementation, the data formats before IFFT is Q16.15. Q16.15 places the sign bit in the leftmost, followed by 16 bits integer and 15 bits fraction component. Compared with the other 32-bit data type, Q16.15 can be easily transformed to the 16-bit Q.15 data type. In order to evaluate the precision of fixed-point format, we compare the uplink syn-. 49.