IEEE 802.16a 分時雙工正交分頻多重進接之下行同步技術研討與在數位訊號處理器上的實現

全文

(1)國立交通大學電機資訊學院電子與光電學程碩士論文. IEEE 802.16a 分時雙工正交分頻多重進接之下行同步技術研討與在數位訊號處理器上的實現 Study and DSP Implementation of IEEE 802.16a TDD OFDM Downlink Synchronization. 研究生：蔣宗書指導教授：林大衛. 博士. 中華民國九十三年七月.

(2) IEEE 802.16a 分時雙工正交分頻多重進接之下行同步技術研討與在數位訊號處理器上的實現 Study and DSP Implementation of IEEE 802.16a TDD OFDM Downlink Synchronization. 研究生：蔣宗書. Student：Tsung-Shu Chiang. 指導教授：林大衛博士. Advisor：Dr. David W. Lin. 國立交通大學電機資訊學院電子與光電學程碩士論文. A Thesis Submitted to Degree Program of Electrical Engineering Computer Science College of Electrical Engineering and Computer Science National Chiao Tung University in Partial Fulfillment of the Requirements for the Degree of Master of Science in Electronics and Electro-Optical Engineering July 2004 Hsinchu, Taiwan, Republic of China. 中華民國九十三年七月.

(3) IEEE 802.16a 分時雙工正交分頻多重進接之下. 行同步技術研討與在數位訊號處理器上的實現研究生：蔣宗書. 指導教授：林大衛博士. 國立交通大學電機資訊學院電子與光電學程﹙研究所﹚碩士班. 摘要在論文中我們介紹一種實現 IEEE 802.16a 分時雙工正交分頻多重進接之下行同步技術的方法。下行同步技術包含 OFDM 符元(symbol) 開始時間與分數頻率偏移之同步，整數頻率偏移之同步，以及傳送資料訊框(frame)的同步。我們將同步技術以軟體方便實現在 Texas Instruments(TI) 公司製造型號為 TMS320C6416 的數位訊號處理器上(DSP)。此處理器的操作平台為 Innovative Integration 公司製名為 Quixote 的 cPCI 卡。為了能方便驗証同步技術，我們也同時實現了整個 802.16a 下行傳輸的系統。為了獲得較高的 DSP 運算效率 , 在此系統中所有的運算皆是以定點 (fixed-point)的格式來進行。在同步技術中我們以 15 個位元(bits)代表小數 1 個位元代表正負號共 16 位元的定點格式作運算。我們使用了 TI 提供的程式庫裏以組合語言做過最佳化的 FFT 程式。我們藉著使用 C6416 本身具有的指令以及將無法做軟體程序規畫(software pipeline scheduling)的迴圈展開(unroll)以達到提高執行效率的目的。在同步技術的程式做過改善之後，其執行效率獲得大幅度的提高。. i.

(4) 論文中並針對執行效率做了分析。以軟體實現的同步技術在一顆 DSP 上執並無法達到即時運算的要求。如果我們要使同步技術的執行可以達到即時運算的要求，我們必須將同步技術分割成數個部份。用更多顆的 DSP 來實現同步技術或將一部份用 FPGA 實現。. ii.

(5) Study and DSP Implementation of IEEE 802.16a TDD OFDMA Downlink Synchronization. Student： Tsung-Shu Chiang. Advisor：Dr. David W. Lin. Degree Program of Electrical Engineering Computer Science National Chiao Tung University. Abstract This thesis presents an implementation method of IEEE 802.16a TDD (time division duplex) OFDMA (frequency-division multiple access) downlink (DL) synchronization techniques. The DL synchronization includes symbol time synchronization, fractional frequency offset synchronization, integer frequency offset synchronization and frame synchronization. Our implementation is software-based, employing Texas Instruments’ TMS320C6416 digital signal processor (DSP) housed on Innovative Integration's Quixote cPCI card. We implement the complete 802.16a DL system to verify the accuracy of synchronization function. The computation on this system is fixed-point for obtaining a higher execution efficiency. The data format we use in synchronization is Q.15 which is a 16 bits fixed-point data format that consists of a sign bit nad 15 fractional bits. We use the assembly-optimized FFT which is supported by TI’s DSP library to obtain the high execution efficiency. We increase the execution efficiency of synchronization by using intrinsics of C6416 DSP and unrolling the disqualified loops to make the software pipeline well scheduled. The efficiency is much increased after we refine the program. iii.

(6) The execution efficiency of synchronization is analyzed. We find that the real time operation requirement is over the synchronization execution time. If we want the synchronization function to achieve real-time speed, we must partition the synchronization function into sub-functions and implement these functions either on more DSPs or on FPGA.. iv.

(7) 致謝. 誠摯地感謝指導老師林大衛博士二年來的指導。以在職生的身份要找指導老師一開始便是件辛苦的事。但林老師並不排斥我在職生的身份而將我收入門下，給予當時在尋找指導教授之途不甚順遂的我重新燃起對學業的熱情。林老師指導的二年中對我碩士論文的完整規畫，讓我有明確的目標可以努力，這對一個在職生的求學過程有相當大的幫助。而林老師高深的學術素養，對於我在通訊領域上專業知識的增進是難以用數字來衡量。我感到非常榮幸能成為林老師的學生。在此，我要向林老師及老師的家人表達由衷的謝意。通訊電子與訊號處理實驗室設備完善，讓我在完成碩士論文的過程中有取用不盡的資源。我要感謝實驗室中和我一起 meeting 的團隊成員俊榮以及筱晴、明哲、子瀚和盈縈，因為有大家的幫助才能使我完成這篇論文。還要感謝郁男、崑健、明偉、建統、岳賢以及全體實驗室裏的同學們給予我各方面的幫助。由於這些同學，才使得我在本實驗室中充滿快樂的回憶。我所服務的公司加爾發半導體，在我做論文的過程一直支持我、給我最大的方便。感謝黃董事長、廖總經理和我的直屬上司呂經理以及我部門的唐先生。有他們的支持才讓我無後顧之憂。我要感謝我的父母及家人對我的支持。最後，特別要感謝我的妻子。只有她才知道我這一路走來的艱辛以及所承受的壓力。並且在這完成學業的過程中，她一直不斷給我鼓勵與支持。沒有她的支持，我不可能完成這一切。僅將這本論文獻給我親愛的妻子。. v.

(8) Contents 1 Introduction. 1. 2 Techniques for Downlink Synchronization 2.1 Introduction to the 802.16a TDD OFDMA System 2.1.1 Pilot and Data Carrier Allocatin . . . . . . 2.1.2 Data Modulation and Pilot Modulation [5] . 2.1.3 Frame Structure . . . . . . . . . . . . . . . 2.2 Downlink Synchronization Techniques . . . . . . . 2.2.1 Initial Synchronization . . . . . . . . . . . 2.2.2 Normal Synchronization . . . . . . . . . . 2.3 Summary of Downlink Synchronization Techniques. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. 3 4 5 9 10 12 13 19 20. 3 DSP Introduction 3.1 DSP Board Introduction . . . . . . . . . . . . . . . . 3.2 Introduction to TMS320C6416 DSP [9] . . . . . . . . 3.2.1 TMS320C6416 Features . . . . . . . . . . . . 3.2.2 Central Processing Unit . . . . . . . . . . . . 3.2.3 Memory Architecture . . . . . . . . . . . . . 3.3 TI’s Code Development Environment [16], [17] . . . . 3.4 Code Development Flow to Increase Performance [10] 3.4.1 Compilier Optimization Options [10] . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. 27 27 29 29 30 36 36 39 42. 4 DSP Implementation 4.1 Efficiency Enhancement of DL Synchronization Code 4.1.1 Performance of the Original Program . . . . 4.1.2 Fixed-Point Number System Consideration . 4.1.3 Code Refinement . . . . . . . . . . . . . . . 4.2 Performance Discussion . . . . . . . . . . . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. 45 45 45 47 58 73. 5 Conclusion and Future work 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Potential Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 78 78 79. vi. . . . . ..

(9) List of Tables 2.1 2.2 2.3 2.4. Carrier Allocation in the OFDMA DL (from [5]) . Complexity of Symbol Time Synchronization . . . Possible Pilot Structures in Frame Synchronization System Parameters Used in Our Study . . . . . . .. . . . .. 8 14 18 20. 3.1 3.2. Execution Stage Length Description for Each Instruction Type (from [9]) Functional Units and Operations Performed (from [9]) . . . . . . . . . .. 34 35. 4.1 4.2 4.3 4.4. Floating-Point Profile of 802.16a DL Transmitter Function Blocks . . . . Floating-Point Profile of 802.16a DL Receive Function Blocks . . . . . . Characteristics of the ETSI “Vehicular A” Channel Environment . . . . . Relations Between Spreed and Maximum Doppler Shift at Carrier Frequency 6 GHz and Subcarrier Spacing 5.58 kHz . . . . . . . . . . . . . . Performance Comparision of Frequency Lock Between Floating-Point and Fixed-Point Implementation . . . . . . . . . . . . . . . . . . . . . . Performance Comparision of Frame Lock Between Floating-Point and Fixed-Point Implementation . . . . . . . . . . . . . . . . . . . . . . . . Q16.15 Bit Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Q.15 Bit Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparisons of Computational Complexity for Different FFT Algorithms Complexity and Performance of IFFT/FFT Implementation . . . . . . . . Sine/Cosine Look-Up Table . . . . . . . . . . . . . . . . . . . . . . . . . Fixed-Point Profile of 802.16a DL Transmitter Function Blocks . . . . . Fixed-Point Profile of 802.16a DL Receiver Function Blocks . . . . . . . Comparison Between FFT and Recursive DFT . . . . . . . . . . . . . . . Efficiency of Recursive DFT Implementation . . . . . . . . . . . . . . . The Execution Cycles of Pilot Correlation Loop . . . . . . . . . . . . . . Profile of the sync Function . . . . . . . . . . . . . . . . . . . . . . . . . Profile of CP Correlation Function Loop Using Different Buffer Types . . Multiply-Add Efficiency of CP Correlation Functions . . . . . . . . . . . Profile of Refined Code of 802.16a DL Receiver Function Blocks . . . . . Performances Estimation in Separate Initial and Tracking Condition . . .. 47 47 49. 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21. vii. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. 50 51 51 52 52 54 54 57 58 58 59 60 62 68 68 72 77 77.

(10) List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7. 2.14 2.15 2.16 2.17. OFDMA symbol time structure (from [5]). . . . . . . . . . . . . . . . . . DL transmitter structure (from [1]). . . . . . . . . . . . . . . . . . . . . . DL receiver structure (from [1]). . . . . . . . . . . . . . . . . . . . . . . Illustration of carrier usage in OFDMA DL (from [1]). . . . . . . . . . . Pilot allocation in the OFDMA DL (from [5]). . . . . . . . . . . . . . . . QPSK, 16-QAM and 64-QAM constellations (from [5]). . . . . . . . . . Pseudo random binary sequence (PRBS) generator for pilot modulation (from [5]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frame structure of the TDD OFDMA system (from [5]). . . . . . . . . . The stucture of the symbol time and frequency estimator from [1]. . . . . DL/UL symbols identification. . . . . . . . . . . . . . . . . . . . . . . . (a) Symbol location detected in stage I, where the gray region is the useful samples which are applied FFT. (b), (c) Leftmost and rightmost ranges of correlation, respectively. (from [1]) . . . . . . . . . . . . . . . . . . . . . DL transmitter structure (from [1]). The gray regions indicate the implemented function in our study. . . . . . . . . . . . . . . . . . . . . . . . . DL receiver structure (from [1]). The gray regions indicate the implemented fuction in our study. . . . . . . . . . . . . . . . . . . . . . . . . DL synchronization process block diagram. . . . . . . . . . . . . . . . . Flow chart of symbol time and fractional frequency offset synchronization. Flow chart of integer frequency offset synchronization. . . . . . . . . . . The state machine of framing synchronization. . . . . . . . . . . . . . . .. 21 22 24 25 26. 3.1 3.2 3.3 3.4 3.5. Block diagram of Quixote (from [15]). . . . . . . Block diagram of TMS320C6416 DSP (from [9]). . Pipeline phases of TMS320C6416 DSP (from [9]). TMS320C64x CPU data path. (from [9]). . . . . . Code development flow for TI C6000 DSP. . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. 28 31 33 37 41. 4.1 4.2 4.3 4.4 4.5 4.6 4.7. The bursts allocation in a frame. . . . . . . . . . . . A part of assembly code for DSP fft32x32. . . . . . The fixed-point data formats at the TX side. . . . . . The fixed-point data formats at the RX side . . . . . C code of recursive DFT. . . . . . . . . . . . . . . . The software pipeline information of recursive DFT. Assembly code of recursive DFT. . . . . . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. 49 54 55 56 59 60 61. 2.8 2.9 2.10 2.11. 2.12 2.13. viii. 4 5 5 6 7 9 10 11 15 16. 19 21.

(11) 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19. C code of revised pilot correlation loop. . . . . . . . . . . . . . . . . . . Partial assembly code of original pilot correlation loop. . . . . . . . . . . The software pipeline information of pilot correlaton loop . . . . . . . . . Partial assembly code of revised pilot correlation loop. . . . . . . . . . . The abs() function is replaced by instrinsic abs() in C code. . . . . . . . Shift-register buffer arrangement. . . . . . . . . . . . . . . . . . . . . . . Code of CP correlation functions using shift-register buffer and circular buffer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Software pipeline information of shift-register buffer type CP correlation loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Software pipline information of circular buffer type CP correlation loop. . Hand-unrolled code of circular buffer type CP correlation. . . . . . . . . Software pipline information of hand-unrolled circular buffer type CP correlation loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Execution cycles of synchronization functions. . . . . . . . . . . . . . .. ix. 63 64 65 66 67 68 70 71 72 73 74 75.

(12) Chapter 1 Introduction The IEEE-SA (Institute of Electrical and Electronics Engineers Standards Association)’s 802.16 working group is concerned with the WirelessMAN air interface for wireless metropolitan area networks. The IEEE 802.16 Task Group a developed IEEE Standard 802.16a that amends IEEE Std 802.16-2001 by enhancing the medium access control layer and providing additional physical layer specifications in support of broadband wireless access at frequencies 2–11 GHz. We consider the DSP implementation of a IEEE802.16a downlink synchronization method. The synchronization includes symbol time synchronization, frequency offset synchronization and frame synchronization. The synchronization techinques are from [1] with some modifications. Our implementation is software-based, employing Texas Instrument’s TMS320C6416 digital signal processor (DSP) housed on Innovative Integration’s Quixote cPCI card. The TMS320C6416 is a fixed-point DSP with 1.67 ns instruction cycle time. It adopts the advanced VelociTI Very Long InstructionWord (VLIW) architecture that enables sustained throughput of eight instructions in parallel. The implemented code is modified from the simulation program from [1]. We rewrite the floating-point version to the 16-bit fixed-point version and refine the code to maximize the execution performance. The thesis is organized as follows. In chapter 2, we introduce the 802.16a downlink synchronization techniques. Chapter 3 introduces the synchronization program executing. 1.

(13) environment, including the Quixote card and the TMS320C6416 DSP chip. Chapter 4 describes the DSP implementation and its performance. Finally, chapter 5 containes the conclusion.. 2.

(14) Chapter 2 Techniques for Downlink Synchronization The IEEE standard 802.16a [5] specifies the WirelessMAN air interface for wireless metropolitan area networks. There are several system modes in 802.16a: SC (single carrier), OFDM (orthogonal frequency-division multiplexing) and OFDMA(orthogonal frequency-division multiple access). It also supports two duplex types: TDD (time division duplex) and FDD (frequency division duplex). We consider the TDD OFDMA option. Accurate demodulation and detection of an OFDM signal requires carrier orthogonality. Variations of the carrier oscillator, sampling clock or the symbol time affect the orthogonality of the system. In this thesis, the sample clocks of the users and the base station are assumed to be identical. Then, before an OFDM receiver can demodulate the carriers, it has to perform two synchronization tasks. First, timing synchronization is needed to detect the proper frame start time. Secondly, it has to estimate and correct the carrier frequency offset of the received signal. Before a more detailed technical overview of the IEEE 802.16a standard, we introduce some frequently used terms below. The subscriber station (SS) is usually known as the mobil station or the user. The base station (BS) is a generalized equipment set providing connectivity, management, and control of the subscriber station. The direction of transmission from the BS to the SS is called downlink (DL), and the opposite direction is 3.

(15) Fig. 2.1: OFDMA symbol time structure (from [5]).. uplink (UL). In this thesis, we only discuss the downlink synchronization techniques.. 2.1 Introduction to the 802.16a TDD OFDMA System The 802.16a WirelessMan-OFDMA system is based on OFDMA modulation. The inverse Fourier transform creates the OFDMA waveform. The time duration is referred to as the useful symbol time. . . The cyclic prefix (CP) is a copy of the last. symbol period. The two together are referred to as the symbol time time to useful time .

(16) . . . s of the usful. . The ratio of CP. that shulod be supported includes 1/32, 1/16, 1/8 and 1/4. In. this thesis, CP time to useful time ratio is set to 1/8. The time domain OFDMA symbol is as shown in Fig. 2.1. In frequency domain, an OFDMA symbol is made up of carriers. There are several carrier types: data carriers, pilot carriers and null carriers. Data carriers are used for data transmission. Pilot carriers carry pilot data and are used for various estimation purposes. Null carriers do not transmission at all, they consist of the guard band and the DC carrier. The total carrier number in a DL OFDMA symbol is 2048. There are 166 pilot carriers, 1536 data carriers and 346 null carriers. The DL system structures are shown in Figs. 2.2 and 2.3. This thesis focuses on synchronization techniques. The pilot and data carrier allocation, pilot and data modulation, and frame structure that impact the synchronization techniques are described in the 4.

(17) parameters: No_OFDM_symbol/ No_subchannel/ OFDM_symbol_offset/ Subchannel_offset. DL_MAP,UL_MAP pilot (preamble). modulation. burst 1 data scrambler. FEC. burst n data. burst 1. data modulation. add virtual carriers (padding zeros). S/P 1702. Framing & carrier allocation. burst n not addressed in the present study. interpolator IFFT. P/S 2048. add prefix. LPF (SRRC filter). 4. Tx RF. D/A filter. channel (AWGN) (fadding channel). Fig. 2.2: DL transmitter structure (from [1]).. fractional freq. sync.. not addressed in the present study. Rx RF. A/D filter. LPF (SRRC filter). channel estimation. 4. symbol time sync.. guard interval removal. S/P 2048. integer freq. sync.. FFT. frame sync.. DL_frame_prefix DL_MAP. P/S 1702. equalization. data demodulation. data deframing. FEC decoder. de−scrambler. data. Fig. 2.3: DL receiver structure (from [1]).. following.. 2.1.1 Pilot and Data Carrier Allocatin 2.1.1.1 Pilot Allocation The carriers allocation in a DL OFDM symbol is shown in Fig. 2.4. Null carriers are allocated in the left side,the right side and the DC carrier. The pilot and data carriers are termed useful carriers for they transmit useful information. The pilot tones are allocated first, and the remainder of the used carriers are divided into 32 subchannels, and then the data carriers are allocated within each subchannel. The pilot carriers include fixed-location pilots and varible-location pilots. The carrier. 5.

(18) 32 data carriers (no pilots in the group). Guard band. DC carrier Group 1. Guard band. Group 2. Group48. The 1702 used carriers = 1536 data carriers + 166 pilot carriers pilot. subchannel 2. subchannel 1. Fig. 2.4: Illustration of carrier usage in OFDMA DL (from [1]).. indices of fixed-location pilots never change. The carrier indices of the varible-location

(19)

(20) ! pilots vary according to the formula , where is the carrier index of a varible-location pilot, is the symbol index that cycles through the. "$#%'&(&)'&+*,*-*,*-*-&((./0 values 0,2,1,3,0,..., periodically every 4-symbol period, and . The pilot carriers allocation map is shown in Fig. 2.5. 2.1.1.2 Carrier Allocation After mapping the pilots, the remainder of the useful carriers from the data subchannels. To allocate data subchannels, partition the remaining carriers into groups of contiguous carriers. Each subchannel consists of one carrier from each of these groups. The number of the carriers in a subchannel is therefore equal to the number of groups, and it is denoted 1 32 546877:9<;=7 . The number of carriers in a groups is equal to the number of channels, and 1 2 54>)6!?+?(;=@ 1 32 546!779<;37 BA it is denoted . The total number of data carriers is thus equal to 1 32 54> 6!?+?+;:@ . The exact patitioning into subchannels is according to the following equation called a permutation formula:. . C .

(21) D. 3E. &)F G. 1 32 54> 6!?+?+;:@ IH. JLK LM. E. . EONQPRTSVUXWZY[]\5^_a``cbZdeW8fZg. (2.1). ihjk4;:@l@'H D(

(22) 8M 6. 3E. +1 32 54> 6!?+?+;:@ . gm SnNoPRTS<U WZY8[-\Z^_a`c`bZdeW f,f.

(23) Fig. 2.5: Pilot allocation in the OFDMA DL (from [5])..

(24) D &)F F F M % & 1 ) D 3F where C 3E is the carrier index of carrier E in subchannel , CE E M % & 1 )

(25) D F

(26) C g is the index of a subcarrier g is the index of a subchannel, E K LM 1 2 543>)6!?+?+;:@ g is the series obtained is the number of subchannels, in the subchannel, # D X !

(27) E F D(0 cyclically to the left F times, D(

(28) 8M g is the function that by rotating hj 4;:@l@ rounds its argument up to the next integer, is a positive integer assigned by the MAC (Medium Access Control) to identify this particular BS sector, and NoPRTS f denotes ). The numerical parameters are the remainder of the quotient (which is most given in Table. 2.1.. 7.

(29) Table 2.1: Carrier Allocation in the OFDMA DL (from [5]). 8.

(30) Fig. 2.6: QPSK, 16-QAM and 64-QAM constellations (from [5]).. 2.1.2 Data Modulation and Pilot Modulation [5] 2.1.2.1 Data Modulation The data modulation in 802.16a are shown in Fig. 2.6. The data bits are entered serially to the constellation mapper. Gray-mapped QPSK and 16-QAM must be supported, whereas the support of 64-QAM is optional. 2.1.2.2 Pilot Modulation Pilot carriers are inserted into each data burst in order to constitute the symbol and they are modulated according to their carrier locations within the OFDMA symbol. The PRBS (Pseudo-Random Binary Sequence) generator is used to produce a sequence where. . corresponds to the carrier index. The value of the pilot modulation on carrier is then , as Fig. 2.7 derived from . The polynomial for the PRBS generator is shows. The symbols in an TDD OFDMA system DL transmission can be separated to two different types. The first three symbols are termed preamble symbols, and other symbols. 9.

(31) Fig. 2.7: Pseudo random binary sequence (PRBS) generator for pilot modulation (from [5]).. are normal symbols. The initialization vector of the PRBS in the DL normal symbols is [11111111111], while the initialization vector of the PRBS in the DL preamble symbol is [01010101010]. The PRBS shall be initialized so that its first output bit coincides with the first usable carrier. A new value shall be generated by the PRBS on every usable carrier. Each pilot shall be transmitted with a boosting of 2.5 dB over the average power of each data tone. The pilot carriers shall be modulated as. . D # 0 . .

(32) & h. . # ) 0 % *. 2.1.3 Frame Structure The frame structure of TDD OFDMA is as shown in Fig. 2.8. The data are segmented into blocks from the view of coding, and each fit into one FEC (forward error correction) block. Each FEC block spans one OFDMA subchannel in the subchannel axis and three OFDM symbols in the time axis. A frame consists of one DL subframe and one UL subframe. The duration of a frame can be from 2 to 20 ms and is specified by the frame duration code. A subframe contains several transmission bursts, which are composed of multiples of FEC blocks. In each frame, the Tx/Rx transition gap (TTG) and Rx/Tx transition gap (RTG) shall be inserted between the downlink and uplink and at the end of each frame respectively to allow the BS and the SS to turn around. TTG and RTG shall 10.

(33) Fig. 2.8: Frame structure of the TDD OFDMA system (from [5]).. be at least. . s and an integer multiple of four samples in duration [5].. For DL, the transmitted data from the BS should contain the control messages and the system parameters, so that the subscribers can know when and how to receive and transmit their data. The burst profile is used to define the parameters such as modulation type, FEC type, preamble length, guard times, etc. The first FEC block of each frame is the DL Frame Prefix that is always transmitted in the most robust burst profile QPSK1/2. The DL Frame Prefix contains the parameters of the FCH (Frame Control Header) which includes the DL-MAPs, UL-MAPs and may additional DCD and UCD messages. The DL-MAP/UL-MAP messages define the access to the DL/UL information, including the burst profiles and the allocation in the subchannel and time axes of the bursts. The Downlink Channel Descriptor (DCD) and Uplink Channel Descriptor (UCD) shall be transmitted by the BS at a periodic interval to define the characteristics of downlink and uplink physical channels. The pilots of the first three OFDM symbols is the DL preamble in the sense that they indicate where the OFDMA frame starts. The number of OFDM 1 1 symbols of the DL is , where is positive integer. 11.

(34) 2.2 Downlink Synchronization Techniques A time offset gives rise to a phase rotation of the carriers. If the time offset is smaller than the length of the guard interval minus the length of the channel impulse response, then the orthogonality among carriers is maintained. In this case, the time offset will appear as a linear phase shift of the demodulated data symbols across the carriers but will not result in inter-symbol interference (ISI) and inter-carrier interference (ICI). For larger time offset, ISI and ICI occur. By increasing the length of the guard interval, the timing requirement can be loosened. Frequency offset due to oscillator mismatch usually exists between the transmitter and the receiver. Each subcarrier can be assumed equally affected by a center carrier frequency spread, because the system bandwidth is small compared to the center carrier frequency. The frequency offset causes three effects : reducing the amplitude of FFT output, introducing ICI from other carriers, and introducing a common phase rotation of the subcarriers [3]. The frequency offset can be separated to an integer part and a fractional part. The former gives frequency offset in integer times carrier spacing, and the latter gives frequency offset in fractional number times carrier spacing. The integer frequency offset results in the entire spectrum of an OFDMA signal be cyclicly shifted, and no ICI [4]. There are two DL synchronization conditions: initial synchronization and normal synchronization. In the beginning when one subscriber wants to join the transmission network, it has no idea about the timing of the network and frequency offset with the base station. When the SS receives DL OFDMA symbol, the OFDMA symbol start time should be found, and the frequency offset between SS and BS should be estimated and compensated. According to 802.16a, the center frequency of the SS shall be synchronized to the BS with a tolerance of maximum 2% of the inter-carrier spacing. The frame start time should be found after symbol time and frequency offset synchronization are finished. After the frame synchronization, SS can get the frame information and use it to enter the 12.

(35) normal synchronization condition [1].. 2.2.1 Initial Synchronization The scheme that we use divides initial synchronization into four stages [1], which are symbol time synchronization, fractional frequency synchronization, integer frequency synchronization and frame synchronization. 2.2.1.1 Stage I: Symbol Time Synchronization The research in [1] suggests estimating symbol time by using the cyclic prefix. Two algorithms are mentioned in that thesis: ML estimation and CP correlation. ML estimation algorithm is proposed in [2], using the maximum likelihood criterion to estimate time and frequency offsets. Under the assumption that received samples are jointly Gaussian,. . symbol time offset is given by.

(36) # 0B&. (2.2). i1 )& (2.3) !I" 1 #! & (2.4) k $ U&% and $ U&% with SNR being signal to noise ratio. It is a one-shot estimator in the sense. where. . . . that the estimates are based on the observation of one OFDM symbol. To roduce the com-. plexity, CP correlation algorithm [1] suggests using only the correlation part to estimate the symbol time. As the samples of different OFDM symbols are uncorrelated, the peak of O 1 would occur when the samples )&(H H H & 1 k the sliding sum of . . . . are all within the same OFDM symbol. Then, the symbol time offset estimator becomes. '(

(37) *)) ))) . . . . . . . +))). 1 *. )). (2.5). The complexities of ML estimation and CP correlation algorithm are shown in Ta-. . ble 2.2. Notes that after the CP correlation is computed at sample time by formula 2.3, 13.

(38) Table 2.2: Complexity of Symbol Time Synchronization. . Multipications(complex) Additions(complex) Other Functions 4350 4349 8700 8452 1 absolute value 6913 6909 1 division 1 square root 1 absolute value. . the CP correlation at sample time +1 is simplified as. . G . . . . . . 1 &. . i1 . The CP correlation algorithm only calculates. . . . . . i 1 *. (2.6). , and ML estimation algorithm calcu-. lates all the entries listed. The research in [1] shows that although the performance of ML estimator algorithm is better than that of CP correlation algorithm, neither algorithm can estimate the exact symbol time at 100% accuracy. To estimate the exact symbol time, both algorithms should be assisted by some other auxiliary operations. Here pilot correlation is used as the auxiliary operation to estimate the symbol time, which is performed in stage IV. The complexity of ML estimaiton is much more than CP correlation algorithm, but the benefit is not as much. We use the CP correlation to estimate the symbol time in this stage. 2.2.1.2 Stage II: Fractional Frequency Synchronization In our algorithm, integer frequency offset is estimated in the post-FFT stages. Fractional frequency offset is estimated in this stage. Based on the frequency part of the joint ML estimator in [2] and [8], the fractional frequency offset is given by. . . . . &. as shown in Fig. 2.9. It is easy to understand why can be estimated by this method. The frequency offset results in a sinusoidal wave in the time domain, and thus the received 14.

(39) r(k+2048). sliding sum (length=L =CP legnth). (.)* r(k). Dealy 2048 samples. argmax. θ. − 1/(2 π). ε. |.|. Fig. 2.9: The stucture of the symbol time and frequency estimator from [1].. samples are multiplied by. J & D & D &(*-*-* . in the guard time is. . . F. . D. m . In AWGN channel, the received sample. . . . E . &. and the sample in the last part of the useful time is. . . i1 . F 1 D.

(40) . . E . . 1 &. is the noise. Then the F 1 where is the transmitted signal, is the FFT size, and E 1 becomes multiplication of and . . Note that. D. !. . . . . . 1 F. . F. . 1 8D. !. . . noise. *. is the common factor of all the sample pairs with . . in the guard. interval. It makes sense that the sum of these sample pairs would reduce the noise effect. 1 taken The frequency offset can be given by the angle part of the sum of . . at the symbol start position. Note that the phase rotation of integer frequency offset is integer times of . Thus this estimator is merely able to detect the fractional frequency offset. The structure of this estimator including stages I and II is shown in Fig. 2.9. 2.2.1.3 Stage III: Integer Frequency Synchronization After the fractional frequency synchronization, we use the guard bands information to estimate integer frequency offset [1]. To begin, an SS shuld check whether the received OFDM symbol is from BS rather than from another SS. In 802.16a [5], the definition 15.

(41) Fig. 2.10: DL/UL symbols identification.. of the guard bands and pilots are different for DL and UL. The indices of the DL guard %C . to and from to % , while the UL are from % . to carriers are from . and from . to %C . Because the symbol from another SS has the limitation that. . . . . its frequency offset to the BS must not be over 2% carrier spacing, if the OFDMA symbol # & % & . & . & % & 0 is from another SS, the magnitude in carrier indices # . & % & & . & % & must be small. A threshold can be set that if any of the carriers. . . . is larger than the threshold, the SS will regard the symbol as a DL symbol, as shown in Fig. 2.10. For the DL, the standard defines the carriers. . . and 851 as fixed location pilots. which are modulated to in amplitude. If there is no integer frequency offset, the FFT outputs of all the guard carriers will be small. So, all the guard carriers are checked to 16. 0.

(42) see if any of them exceeds the threshold. The checking direction is from 1023 to 852, and % . to . If carrier is detected to be larger than the threshold in the then from carrier spacings checking procedure, the st fixed pilots are supposed to shift. . . . due to the frequency offset. Thus the checking is stopped and the frequency is corrected carrier spacings. The checking and correction take turns until all the guard by. . carriers are checked to be smaller than the threshold. In fading channels, ICI may cause serious distortion. Thus, if the . . st pilots are. distorted to be less than the threshold, the frequency offset will not be detected by the previous method. An additional check is added to see whether both of the st pilot. . carriers are larger than the threshold. After these three checks, the integer synchronization finishes. 2.2.1.4 Stage IV: Frame Synchronization By stage I, the OFDMA symbol start time can be ruoghly estimated, but the SS has to know exactly where the frame starts. The frame start time estimation suggested in [1] uses the pilot correlation method. In the 802.16a standard [5], the varible location pilots change their location from symbol to symbol depending on symbol index . The modulation of pilots is decided by the PRBS generator, and the initialization vector of the PRBS generator is different in preamble symbol generation from in non-preamble symbol generation. Therefore, there are 7 possible kinds of pilot sructures as shown in Table 2.3. If the received symbol has the same pilot locations and the same initial vector of modulation PRBS with the reference data, the correlation of them will be larger than the other 6 cases. A frame is determined to start if there are three successive DL symbols with the maximum correlation corresponding to the preamble. The simulation result of [1] shows that the accuracy of symbol time estimation is not enough. There is a serious problem by using the post-FFT pilots or preamble if the symbol time synchronization in stage I does not detect the correct location of the symbol, for then there will be a time offset . After FFT, the time offset causes phase shift across 17.

(43) Table 2.3: Possible Pilot Structures in Frame Synchronization DL preamble %% %% '%'% %% %% (( %% % % '% % % ( %. %'&. &. $&. . the carriers by. D. . DL normal symbol % &. '&. $&. &. . , where . . is the carrier index. This phase shift affects the correlation. of the received pilots and the reference data. Moreover, if the detected symbol start time is later than the actual time, ISI and ICI may occur. Whether the maximum correlation of the 7 cases indicates the true frame start becomes doubtful. To solve this problem, a more robust symbol time should be estimated. If there was a time offset, the useful time would be shifted and the pilots correlation would be smaller. The simulation of [1] shows that the symbol time estimation error in stage I has high C probability to be smaller than 30 samples. Assume that the time offset may be from to 32 sample times. Fig. 2.11(a) shows the symbol start location detected in stage I, where the gray region is the corresponding useful samples which are taken FFT. We apply the C to C samples in offset, as shown in Fig. 2.11(b) and (c) FFT to the gray region from [1]. After observing the correlation for 65 sample times, the location with peak correlation is assumed to be the real symbol start time. The maximum correlation of the 7 cases is then robust enough to be used. In order to reduce the complexity of FFT, the conventional . When a new data value is received, the FFT may be FFT is only applied to location computed successively as. ?. where. 1. is the FFT size,. . . M ?. . . X?. U. X? D g. . is the carrier index, E is sample number, and. incoming sample.. 18. . (2.7). ?. is the new.

(44) (1) x (a). x(k)*. x(k+N). x (cp) (b). (c). detected symbol start time. corresponding detected useful time. Fig. 2.11: (a) Symbol location detected in stage I, where the gray region is the useful samples which are applied FFT. (b), (c) Leftmost and rightmost ranges of correlation, respectively. (from [1]) .. 2.2.2 Normal Synchronization After finishing initial synchronization, the SS can find the frame duration from frame duration code in the MAPs. The timing synchronization stage should still be used to track the exact symbol time, because the received symbol time may shift with time due to channel variation. The CP correlation can estimate the rough symbol time. In normal synchronization condition, pilot correlation helps to find the robust symbol time. The simulation result in [1] shows that when the Doppler spread is small, the standard deviation of time synchronization error is about 3–4. If the channel is compensated, we can reduce the range of possible timing offset that estimated from CP correlation to simplify the complexity. The normal synchronization condition should be started after the channel is compensated. In our system, the channel estimator is performed after the synchronization. We assume that the channel is compensated before the frame is synchronized. In this case, the timing synchronization error in CP correlation stage is assumed to be less than 5 sample time. Just as the pilot correlation step in frame synchronization stage, we should take FFT in the range from 5 sample time before the estimated symbol time to 5 sample 19.

(45) Table 2.4: System Parameters Used in Our Study 1 %. Number of carriers ( ) Center frequency GHz % Uplink / Downlink bandwidth ( ) MHz * Carrier spacing ( ) kHz *n. Sampling frequency ( ) MHz % * OFDM symbol time ( ) (2304 samples) * Useful time ( ) (2048 samples). '*l. Cyclic prefix time ( ) (256 samples). . . . .

(46). . . . . time after the estimated symbol time. The FFT output is used to do the pilot correlation with 7 symbol types listed in Table 2.3. We can track the exact symbol time and check the symbol types. If the symbol type is not as expected, the initial synchronization should be re-done. Besides, the frequency has been synchronized to the BS during normal operation. According to 802.16a, the SS shall track the frequency changes and shall defer any transmission if synchronization is lost. The small frequency changes can be tracked by the frequency part of the joint ML estimation (the same as stage II of initial synchronization). These changes are averaged for a period of time and then compensated, so the frequency offset under the tracking mode will be smaller than the initial frequency synchronization. If by any chance a larger frequency variation occurs, we may detect it by monitoring the received guard carriers and then try to correct it.. 2.3 Summary of Downlink Synchronization Techniques The system parameters employed in this study are shown in 2.4. Our goal in this thesis is to do software implementation of the synchronization techniques on DSPs. The implemented transmitter and receiver components are as indicated in Fig. 2.12 and 2.13. The gray regions are implemented blocks, and the others such as FEC, channel estimation and equalization are not implemented in this study. Recall from 2.2 that the initial DL synchronization contains 4 stages, which are sym20.

(47) Fig. 2.12: DL transmitter structure (from [1]). The gray regions indicate the implemented function in our study.. Fig. 2.13: DL receiver structure (from [1]). The gray regions indicate the implemented fuction in our study.. 21.

(48) Fig. 2.14: DL synchronization process block diagram.. bol time synchronization, fractional frequency offset synchronization, integer frequency offset synchronization , and frame synchronization. At beginning, the CP correlator output detects an local peak value. The phase of correlator output peak is the fractional frequency offset. As shown in Fig. 2.14, use this peak location to perform the integer frequency estimaion. The integer frequency offset estimator estimates the integer frequency offset. Adding integer and fractional frequency offset and using this result to compensate the input data. After some iterations, the integer frequency offset will be fixed, than start to find the frame start by using pilot correlation. The flow chart of symbol time and fractional frequency offset estimations are shown in Fig. 2.15. The CP max records the maximum value of CP correlation, CP corre location records the start time of a symbol that estimated in CP correlation stage, Freq Off records the estimated fractional frequency offset. A new correlation value is computed and then compared with CP Max whenever a new sampled data is received and shifted into synchronization buffer. If the new correlation value is larger than CP Max, we replace the value of CP Max by the news correlation value, CP corre location by current location, 22.

(49) and Freq Off by the phase of correlation value. If the correlation value is not larger than the maximum vlaue, we compute the next CP correlation value by receiving new sampled data without modify the content of these varibles that record the CP correlation information. If all the next 256 successive CP correlation values are not larger than CP Max, the current CP Corre location is the estimated symbol time and the current Freq Off is the estimated fractional frequency offset. Integer frequency offset estimation is perfomed after FFT. The CP correlation peak location is used in this stage to be the symbol start time. The flow chart of integer frequency offset estimation are shown in Fig. 2.16. The lock condition is achieved after the spectrum offset of the received symbol is checked zero. Frame synchronization is started after frequency offset is compensated. The type of every received symbol is identified by pilot correlation. In the beginning of frame $ % synchronization, the preamble and symbol is waited. This is the first symbol of a frame. The state machine is started when the first preamble symbol is received and goes to the next state when the predicted symbol is received. The normal synchronization codition is achieved when the third preamble symbol is received. If the received symobl is not the predicted symbol, the synchronization lost, and then the frame synchronization is re-started. Fig. 2.17 shows the state machine for frame synchronization.. 23.

(50) Fig. 2.15: Flow chart of symbol time and fractional frequency offset synchronization.. 24.

(51) Fig. 2.16: Flow chart of integer frequency offset synchronization.. 25.

(52) Fig. 2.17: The state machine of framing synchronization.. 26.

(53) Chapter 3 DSP Introduction The 802.16a DL synchronization techinques are implemented on DSP platform. The platform we use is a DSP card made by Innovative Integration, the Quixote. This chapter introduces the Quixote PC-plugin card and the DSP which is Texas Instruments’ TMS320C6416 on this card. Our discussion will concentrate more on the DSP chip because of our implementation is pure software on the DSP.. 3.1 DSP Board Introduction Quixote is Innovative Integration’s Velocia-family baseboard for various applications requiring high-speed computation. Fig. 3.1 shows a block diagram of the Quixote board. It combines a 600 MHz 32-bit fixed-point DSP, an FPGA (Virtex-II) analog acquisition, and system-level peripherals. The TI C6416 DSP operating at 600 MHz offers a processing power of 4800 MIPS. The Virtex-II FPGA includes 18x18 hardware multipliers and contains up to 12 digital clock managers, each providing 256 subdivisions of phase shifting and frequency synthesis capabilities to deliver flexibility in managing both on-chip and off-chip clock domains and synchronization. On-chip memory blocks in the Virtex-II fabric provide convenient high-speed memory elements for FIFOs, dual-port RAM and local process memory that are invaluable in efficient logic design. The Quixote card has a 32MB SDRAM for use by the DSP. When used with the 27.

(54) advanced cache controller on the ’C6416, the SDRAM provides a large, fast external memory pool for DSP data and code. The 6416 cache controller is effective to over 85% of infinite on-chip memory performance for most DSP applications. A flash EEPROM allows configuration data to be saved and a 512 byte serial EEPROM memory allows storage of converter correction coefficients which is used by the embedded Viterbi and turbo decoder .. Fig. 3.1: Block diagram of Quixote (from [15]).. 28.

(55) 3.2 Introduction to TMS320C6416 DSP [9] 3.2.1. TMS320C6416 Features. The TMS320C64x DSPs are the highest-performance fixed-point DSP generation on the TMS320C6000 DSP platform. The TMS320C64x device is based on the secondgeneration high-performance, very-long-instruction-word (VLIW) architecture developed by Texas Instruments (TI). The C6416 device has two high-performance embedded coprocessors, Viterbi Decoder Coprocessor (VCP) and Turbo Decoder Coprocessor (TCP) that significantly speed up channel-decoding operations on-chip. The C64x core CPU consists of 64 general-purpose 32-bits registers and 8 function units. These 8 function units contain two multipliers and six ALUs. Features of C6000 device includes : Advanced VLIW CPU with eight functional units, including two multipliers and six arithmetic units: – Executes up to eight instructions per cycle. – Allows designers to develop highly effective RISC-like code for fast development time. Instruction packing: – Gives code size equivalence for eight instructions executed serially or in parallel. – Reduces code size, program fetches, and power consumption. Conditional execution of all instructions: – Reduces costly branching. – Increases parallelism for higher sustained performance. Efficient code execution on independent functional units: 29.

(56) – Efficient C compiler on DSP benchmark suite. – Assembly optimizer for fast development and improved parallelization. 8/16/32-bit data support, providing efficient memory support for a variety of applications: 40-bit arithmetic options add extra precision for applications requiring it. Saturation and normalization provide support for key arithmetic operations. Field manipulation and instruction extract, set, clear, and bit counting support common operation found in control and data manipulation applications. The C64x additional features include:. A A Each multiplier can perform two 16 16 bits or four 8 8 bits multiplies every clock cycle. Quad 8-bit and dual 16-bit instruction set extensions with data flow support. Support for non-aligned 32-bit (word) and 64-bit (double word) memory accesses. Special communication-specific instructions have been added to address common operations in error-correcting codes. Bit count and rotate hardware extends support for bit-level algorithms.. 3.2.2 Central Processing Unit The block diagram of C6416 DSP is shown in Fig. 3.2. The DSP contains: Program fetch unit. Instruction dispatch unit. Instruction decode unit. 30.

(57) Fig. 3.2: Block diagram of TMS320C6416 DSP (from [9]).. 31.

(58) Two data paths, each with four functional units. 64 32-bit registers. Control registers. Control logic. Test, emulation, and interrupt logic. The TMS320C64x DSP pipeline provides flexibility to simplify programming and improve performance. The pipeline can dispatch eight parallel instructions every cycle. These two factors provide this flexibility: Control of the pipeline is simplified by eliminating pipeline interlocks. Increased pipelining eliminates traditional architectural bottlenecks in program fetch, data access, and multiply operations. This provides single cycle throughput. The pipeline phases are divided into three stages: Fetch. Decode. Execute. All instructions in the C62x/C64x instruction set flow through the fetch, decode, and execute stages of the pipeline. The fetch stage of the pipeline has four phases for all instructions, and the decode stage has two phases for all instructions. The execute stage of the pipeline requires a varying number of phases, depending on the type of instruction. The stages of the C62x/C64x pipeline are shown in Fig. 3.3. Reference [9] contains the detailed fetch and decode phases information. The pipeline operation of the C62x/C64x instructions can be categorized into seven instruction types. Six of these are shown in Table 3.1, which gives a mapping of operations occurring in 32.

(59) Fig. 3.3: Pipeline phases of TMS320C6416 DSP (from [9]).. each execution phase for the different instruction types. The delay slots associated with each instruction type are listed in the bottom row. The execution of instructions can be defined in terms of delay slots. A delay slot is a CPU cycle that occurs after the first execution phase (E1) of an instruction. Results from instructions with delay slots are not available until the end of the last delay slot. For example, a multiply instruction has one delay slot, which means that one CPU cycle elapses before the results of the multiply are available for use by a subsequent instruction. However, results are available from other instructions finishing execution during the same CPU cycle in which the multiply is in a delay slot. The eight functional units in the C6000 data paths can be divided into two groups of four; each functional unit in one data path is almost identical to the corresponding unit in the other data path. The functional units are described in Table 3.2. Besided being able to perform 32-bit operations, the C64x also contains many 8-bit to 16-bit extensions to the instruction set. For example, the MPYU4 instruction performs four 8x8 unsigned multiplies with a single instruction on an .M unit. The ADD4 instruction performs four 8-bit additions with a single instruction on an .L unit. The data line in the CPU supports 32-bit operands, long (40-bit) and double word (64bit) operands. Each functional unit has its own 32-bit write port into a general-purpose register file (Refer to Fig. 3.4). All units ending in 1 (for example, .L1) write to register file A, and all units ending in 2 write to register file B. Each functional unit has two 32-bit read ports for source operands src1 and src2. Four units (.L1, .L2, .S1, and .S2) have an extra 8-bit-wide port for 40-bit long writes, as well as an 8-bit input for 40-bit long reads. 33.

(60) Table 3.1: Execution Stage Length Description for Each Instruction Type (from [9]). 34.

(61) Table 3.2: Functional Units and Operations Performed (from [9]) Function Unit Operations .L unit (.L1, .L2) 32/40-bit arithmetic and compare operations 32-bit logical operations Leftmost 1 or 0 counting for 32 bits Normalization count for 32 and 40 bits Byte shifts Data packing/unpacking 5-bit constant generation Dual 16-bit arithmetic operations Quad 8-bit arithmetic operations Dual 16-bit min/max operations Quad 8-bit min/max operations .S unit (.S1, .S2) 32-bit arithmetic operations 32/40-bit shifts and 32-bit bit-field operations 32-bit logical operations Branches Constant generation Register transfers to/from control register file (.S2 only) Byte shifts Data packing/unpacking Dual 16-bit compare operations Quad 8-bit compare operations Dual 16-bit shift operations Dual 16-bit saturated arithmetic operations Quad 8-bit saturated arithmetic operations .M unit (.M1, .M2) 16 x 16 multiply operations 16 x 32 multiply operations Quad 8 x 8 multiply operations Dual 16 x 16 multiply operations Dual 16 x 16 multiply with add/subtract operations Quad 8 x 8 multiply with add operation Bit expansion Bit interleaving/de-interleaving Variable shift operations Rotation Galois Field Multiply .D unit (.D1, .D2) 32-bit add, subtract, linear and circular address calculation Loads and stores with 5-bit constant offset Loads and stores with 15-bit constant offset (.D2 only) Load and store double words with 5-bit constant Load and store non-aligned words and double words 5-bit constant generation 32-bit logical operations 35.

(62) Because each unit has its own 32-bit write port, when performing 32-bit operations all eight units can be used in parallel every cycle.. 3.2.3. Memory Architecture. The C64x has a 32-bit, byte-addressable address space. Internal (on-chip) memory is organized in separate data and program spaces. When off-chip memory is used, these spaces are unified on most devices to a single memory space via the external memory interface (EMIF). The C62x/C67x have two 32-bit internal ports to access internal data memory. The C64x has two 64-bit internal ports to access internal data memory. The C62x/C64x/C67x have a single internal port to access internal program memory, with an instruction-fetch width of 256 bits. A variety of memory options are available for the C6000 platform. In our system, the memory types we can use are: On-chip RAM, up to 7M bits. Program cache. Two-level caches. 32-bit external memory interface supports SDRAM, SBSRAM, SRAM, and other asynchronous memories. In our system, the external memory used by DSP is a 32MB SDRAM.. 3.3 TI’s Code Development Environment [16], [17] TI supports a useful GUI development to DSP users for developing and debugging their projects: the Code Composer Studio (CCS). The CCS development tools are a key element of the DSP software and development tools from Texas Instruments. The fully integrated development environment includes real-time analysis capabilities, easy to use. 36.

(63) Fig. 3.4: TMS320C64x CPU data path. (from [9]).. 37.

(64) debugger, C/C++ compiler, assembler, linker, editor, visual project manager, simulators, XDS560 and XDS510 emulation drivers and DSP/BIOS support. Some of CCS’s fully integrated host tools include: Simulators for full devices, CPU only and CPU plus memory for optimal performance. Integrated Visual Project Manager with source control interface, multi-project support and the ability to handle thousands of project files. Source code debugger common interface for both simulator and emulator targets: – C/C++/assembly language support. – Simple breakpoints. – Advanced watch window. – Symbol browser. DSP/BIOS host tooling support (configure, real-time analysis and debug). Data transfer for real time data exchange between host and target. Profiler to understand code performance. CCS also delivers foundation software consisting of: DSP/BIOS kernel for the TMS320C6000 DSPs. – Pre-emptive multi-threading – Interthread communication – Interupt Handling TMS320 DSP Algorithm Standard to enable software reuse.. 38.

(65) Chip Support Libraries (CSL) to simplify device configuration. CSL provides Cprogram functions to configure and control on-chip peripherals. DSP libraries for optimum DSP functionality. The DSP Library includes many Ccallable, assembly-optimized, general-purpose signal-processing and image/video processing routines. These routines are typically used in computationally intensive real-time applications where optimal execution speed is critical. TI also supports many optimized DSP functions for the TMS320C64x devices: the TMS320C64x digital signal processor library (DSPLIB). This source code library includes C-callable functions (ANSI-C language compatible) for general signal processing mathematical and vector functions [11]. The routines included in the DSP library are organized into eight groups: Adaptive filtering. Correlation. FFT. Filtering and convolution. Math. Matrix functions. Miscellaneous. In our project, the FFT and IFFT functions are from this library.. 3.4 Code Development Flow to Increase Performance [10] The recommended code development flow involves utilizing the C6000 code generation tools to aid in optimization rather than forcing the programmer to code by hand in assembly. These advantages allow the compiler to do all the laborious work of instruction 39.

(66) selection, parallelizing, pipelining, and register allocation. These features simplify the maintenance of the code, as everything resides in a C framework that is simple to maintain, support, and upgrade. The recommended code development flow for the C6000 involves the phases described in Fig. 3.5. The tutorial section of the Programmers Guide focuses on phases 1 – 3. These phases will instruct the programmer when to go to the tuning stage of phase 3. What is learned is the importance of giving the compiler enough information to fully maximize its potential. An added advantage is that this compiler provides direct feedback on the entire programmers high MIPS areas (loops). Based on this feedback, there are some very simple steps the programmer can take to pass complete and better information to the compiler allowing the programmer a quicker start in maximizing compiler performance. The following items list goal for each phase in the 3-step software development flow shown in Fig. 3.5. Developing C code (phase 1) without any knowledge of the C6000. Use the C6000 profiling tools to identify any inefficient areas that we might have in the C code. To improve the performance of the code, proceed to phase 2. Use techniques described in [10] to improve the C code. Use the C6000 profiling tools to check its performance. If the code is still not as efficient as we would like it to be, proceed to phase 3. Extract the time-critical areas from the C code and rewrite the code in linear assembly. We can use the assembly optimizer to optimize this code. TI provides high performance C program optimization tools, and they do not suggest the programmer to code by hand in assembly. In this thesis, the development flow is stopped at phase 2. We do not optimize the code by writing linear assembly. Coding the program in high level language keeps the flexibility of porting to other platforms.. 40.

(67) Fig. 3.5: Code development flow for TI C6000 DSP.. 41.

(68) 3.4.1 Compilier Optimization Options [10] The compilier supports several options to optimize the code. The compilier options can be used to optimize code size or executing performance. Our primary concern in this work is the execution performance. Hence we do not care very much about the code size. The easiest way to invoke optimization is to use the cl6x shell program, specifying the -oE option on the cl6x command line, where E denotes the level of optimization (0, 1, 2, 3) which controls the type and degree of optimization: -o0. – Performs control-flow-graph simplification. – Allocates variables to registers. – Performs loop rotation. – Eliminates unused code. – Simplifies expressions and statements. – Expands calls to functions declared inline. -o1. Peforms all -o0 optimization, and: – Performs local copy/constant propagation. – Removes unused assignments. – Eliminates local common expressions. -o2. Performs all -o1 optimizations, and: – Performs software pipelining. – Performs loop optimizations. – Eliminates global common subexpressions. – Eliminates global unused assignments. 42.

(69) – Converts array references in loops to incremented pointer form. – Performs loop unrolling. -o3. Performs all -o2 optimizations, and: – Removes all functions that are never called. – Simplifies functions with return values that are never used. – Inlines calls to small functions. – Reorders function declarations so that the attributes of called functions are known when the caller is optimized. – Propagates arguments into function bodies when all calls pass the same value in the same argument position. – Identifies file-level variable characteristics. The -o2 is the defaule if -o is set without an optimization level. The program-level optimization can be specified by using the -pm option with the -o3 option. With program-level optimization, all of the source files are compiled into one intermediate file called a module. The module moves to the optimization and code generation passes of the compiler. Because the compiler can see the entire program, it performs several optimizations that are rarely applied during file-level optimization: If a particular argument in a function always has the same value, the compiler replaces the argument with the value and passes the value instead of the argument. If a return value of a function is never used, the compiler deletes the return code in the function. If a function is not called directly or indirectly, the compiler removes the function.. 43.

(70) When program-level optimization is selected in Code composer studio, options that have been selected to be file-specific are ignored. The program level optimization is the hightest level optimization option. We use this option to optimization our code.. 44.

(71) Chapter 4 DSP Implementation Recall that 802.16a downlink synchronization process is as shown in Fig. 2.14. The process includes symbol timing synchronization, fractional frequency synchronization, integer frequency synchronization and frame synchronization. Our target is to implemente DL synchroization process on TI TMS32C6416 DSP. Because of the memory on our platform is quite large, the most important issue to be optimized on our system is the execution efficiency. This chapter focuses on the performance improvement of the DL synchronization code. The DL synchronization programs developed in [1] employed floating-point computation. The code we implemente on DSP employs fixed-point computation. The precision of fixed-point numbers that we use is also discussed.. 4.1 Efficiency Enhancement of DL Synchronization Code The original DL synchronization program is written in C language. It is written without any knowledge of DSP at beginning. This section introduces the process of maximizing the performance.. 4.1.1 Performance of the Original Program The compile option that we use to optimize the original DL synchronization program is the program-level optimization. Tables 4.1 and 4.2 shows the the code size, maxi-. 45.

(72) mum exection cycles, and minmmum exection cycles of individual function blocks for the transmitter and the receiver, respectively. Floating-point computation is used in the program. Because the C6416 is a fixed-point DSP, floating-point operations on it is timeconsuming. The transmitter consists of several function blocks that are listed in Table 4.1. Modulation performs the data modulation that IEEE 802.16a supports. The options of data modulation are QPSK, 16-QAM and 64-QAM. In our program, the modulation is fixed 64-QAM for all burst data. Framing performs the allocations of pilot carriers , guard carriers and burst data. Fft float is the discrete fast fourier transfer from [1]. IFFT function includes the fft float with some input data buffer arrangement of fft float. Tx mask satisfaction performs the 4-times oversample and SRRC filter (from [1]). The functions that executed in receiver are listed in Table 4.2. SRRC downsample performs the 4-times downsample and SRRC filter (from [1]). CP correlation, initial freq sync, integer freq sync, and pilot corre functions perform the synchronization techniques that are CP correlation, fractional frequency synchronization, integer frequency synchronization and pilot correlation respectively. Fft float in receiver is the same as that in transmitter with different input option. FFT consists of fft float function and some input data buffer arrangement of fft float. In de-framing function, data bursts are extracted from the received symbols. And finally, de-modulation of the burst data is performed in demodulation function.. . In our system, one symbol duration is 201.6 s and there are 2304 samples in a symbol. The clock frequency of DSP is 600 MHz. The execution clock cycles is 120960 in a symbol duration and average 52.5 in a sample duration. The average counts of all transmitter functions are in a symbol duration. Their target counts are 120960 cycles for real time operation. In the receiver, the average count of fft float, FFT, de framing and de modulation functions are in a symbol duration and their targets counts are 120960 for real time operations. For the other funcions in receiver, their average counts are in one. 46.

(73) Table 4.1: Floating-Point Profile of 802.16a DL Transmitter Function Blocks Block. Code size Max. count Min. count Avg. count Real time (Bytes) (Cycles) (Cycles) (Cycles) rate Modulation 460 4294288 1441061 3058185 3.96% Framing 2212 188125 188091 188110 64.30% fft float 1328 23487728 23476418 23481019 0.52% IFFT 676 23491380 23480070 23484737 0.52% Tx mask satisfaction 1852 46471084 46460414 46465084 0.26%. Table 4.2: Floating-Point Profile of 802.16a DL Receive Function Blocks Block Code Size Max. count Min. count Avg. count Real time (Bytes) (Cycles) (Cycles) (Cycles) rate SRRC downsample 608 23283 16387 21233 0.25% CP correlation 1188 185559 43 645 8.14% initial freq sync 420 184 52 57 92.11% integer freq sync 1228 23484952 40 2078 2.53% pilot corre 2972 24057628 48 167290 0.03% sync 1132 47393690 56192 228702 0.02% fft float 1328 23258068 23250546 23254032 0.52% FFT 420 23456722 23451576 23453957 0.52% de framing 948 1626187 1626187 1626187 7.44% de modulation 904 1883132 637124 1352664 8.94%. sample duration and their target counts are 52.5 cycles for real time operation. The real time rate listed in Table 4.1 and 4.2 show that the rate that average counts compared with real time requirement of individual function. In this thesis, we will optimize the synchronization related functions. They are CP correlation (CP correlation), initial freq sync (initial frequency synchronization) , integer freq sync (integer frequency synchronization) , pilot corre (pilot correlation) and sync (synchronization). The sync function is the top-level function of synchronization.. 4.1.2 Fixed-Point Number System Consideration The C6416 is a fixed-point DSP. Floating-point operations on it are inefficient. We should realize the transmission system using fixed-point arithmetic to maximize the performance. 47.

(74) TI’s programmer guide [10] recommands the user to use the short data type (16 bits) for fixed-point multiplication inputs whenever possible. Because this data type provides the most efficient use of the 16-bit multiplier in the C6416. Besides changing the data type, some sub-functions in this system such as FFT, IFFT, sine and cosine should be replaced by fixed-point version. 4.1.2.1 On the Precision of Fixed-Point Computation The fixed-point number format that we use in the system to do arithmetic operations is Q.15. We choose the format because the most efficiency data format for the multiply operation is 16 bits, and the data used in synchronization process are less than 1 in their numerical values. Now, we evaluate whether the precision is enough for the synchronizationi work. For this, we allocate 6 bursts (users) in the downlink part of one 802.16a frame. Source data are generated randomly, and are modulated to 64 QAM symbols. There are 12 OFDMA symbols in one DL frame and 4 OFDMA symbols in UL frames. The TTG and RTG are 136 samples. The frame structure and the bursts allocation are shown in Fig. 4.1. The frame is repeated several times in transmission. In the simulation environment, we employ the multipath ETSI “Vehicular A” channel model [1]. The time-varying channel impulse response for these models can be described by. . . &c . 9 . 9 . . . 9 &. (4.1). which defines the channel impulse response at time as a function of the lag . The chan9 nel taps are independent complex stochastic variables, fading with Jakes’ Doppler spectrum, with a maximum Doppler frequency of 240 Hz, reflecting a mobile speed of approximately 120 km/h (and scatterers uniformly distributed around the mobile). The 9 9 real-valued and the variance of the complex-valued are given in [13] and repeated in Table 4.3.. 48.

(75) Fig. 4.1: The bursts allocation in a frame.. Table 4.3: Characteristics of the ETSI “Vehicular A” Channel Environment tap 1 2 3 4 5 6. relative delay (nsec or sample number) (nsec) (4 oversampling) (normal) 0 0 0 310 14 4 710 32 8 1090 50 12 1730 79 20 2510 115 29. 49. average power (dB) (normal scale) (normalized) 0 1.0000 0.4850 -1.0 0.7943 0.3852 -9.0 0.1259 0.0610 -10.0 0.1000 0.0485 -15.0 0.0316 0.0153 -20.0 0.0100 0.0049.

(76) Table 4.4: Relations Between Spreed and Maximum Doppler Shift at Carrier Frequency 6 GHz and Subcarrier Spacing 5.58 kHz Speed (km/hr) Doppler shift (Hz) 0 0 20 111 40 222 60 333 80 444 100 556 120 557. R. . 0 0.0224 0.0448 0.0672 0.0896 0.112 0.134. The SNR is set to 10 dB in the fading chnanel. The receiver SNR specified in 802.16a test condition is from 9.4 to 24.4 dB, so 10 dB, which is almost the worst condition, is a reasonable value for simulation. The maximum Doppler shifts of our simulation are shown in Table 4.4 for the speed from 0 to 120 km/hr. The goals of synchronization are to compensate the frequency offset and to find the frame start time. To evaluate the precision of fixed-point format, we compare the frequency lock and frame lock performance between floating-point system and fixed-point system. The frequency offset is estimated and compensated in the synchronization process. The frequency lock condition is achievd when the frequency offset is compensated. The frame lock condition is achieved when the three successive preamble symbols are identified. The simulation transmits 5 802.16a frames every time. If the frequency lock and frame lock are not obtained in these 5 frames, the synchronization is declared to fail. The current symbol number is recorded when the frequency is locked, and the current frame number is recorded when the frame is locked. The average symbol number of frequency lock and frequency lock fail rate is used to measure the performance of frequency lock, and the average frame number of frame lock and the frame lock fail rate is used to measure the performance of frame lock. Tables 4.5 and 4.6 show the simulation result. The frequency offset is always locked in 5 frames duration. And it takes on average no more than 6 symbols to achieve the frequency lock. The performance is not clearly. 50.

(77) Table 4.5: Performance Comparision of Frequency Lock Between Floating-Point and Fixed-Point Implementation Doppler shift Lock fail rate Average lock symbol number R Floating-point Fixed-point Floating-point Fixed-point 0 0 0 2.99 2.98 0.0224 0 0 2.66 2.69 0.0448 0 0 2.36 2.39 0.0672 0 0 2.30 2.32 0.0896 0 0 2.61 2.57 0.112 0 0 3.23 3.42 0.134 0 0 5.15 5.14. . Table 4.6: Performance Comparision of Frame Lock Between Floating-Point and FixedPoint Implementation Doppler shift Lock fail rate Average lock frame number R Floating-point Fixed-point Floating-point Fixed-point 0 0.001 0.001 1.00 1.00 0.0224 0.057 0.074 1.98 1.94 0.0448 0.008 0.100 1.26 1.24 0.0672 0.027 0.032 1.65 1.70 0.0896 0.136 0.140 2.59 2.59 0.112 0.107 0.135 2.14 2.19 0.134 0.063 0.069 1.50 1.47. . 51.