IEEE 802.16a標準之前向誤差改正編碼於數位訊號處理器平台上之實現與最佳化

全文

(1)國立交通大學電子工程學系電子研究所碩士班碩. 士. 論. 文. IEEE 802.16a 標準之前向誤差改正編碼於數位訊號處理器平台上之實現與最佳化. DSP Implementation and Optimization of the Forward Error Correction Scheme in IEEE 802.16a Standard. 研究生：李仰哲指導教授：杭學鳴博士. 中華民國九十三年六月.

(2) IEEE 802.16a 標準之前向誤差改正編碼於數位訊號處理器平台上之實現與最佳化. DSP Implementation and Optimization of the Forward Error Correction Scheme in IEEE 802.16a Standard 研究生：李仰哲. S t u d en t：Young-Tse Lee. 指導教授：杭學鳴博士. Advisor：Dr. Hsueh-Ming Hang. 國立交通大學電子工程學系. 電子研究所碩士班碩士論文. A Thesis Submitted to Institute of Electronics College of Electrical Engineering and Computer Science National Chiao Tung University in Partial Fulfillment of Requirements for the Degree of Master of Science in Electronics Engineering June 2004 Hsinchu, Taiwan, Republic of China. 中華民國九十三年六月.

(3) IEEE 802.16a 標準之前向誤差改正編碼於數位訊號處理器平台上之實現與最佳化研究生: 李仰哲. 指導教授: 杭學鳴博士. 國立交通大學電子工程學系電子工程研究所摘要在 IEEE 802.16a 無線通訊標準中，於系統的傳送端與接收端都分別訂定了前向誤差改正編碼的機制，藉此減低通訊頻道中雜訊失真的影響。本篇論文的重點在於，實現標準所訂定的前向誤差改正編碼系統於數位訊號處理器(DSP)平台上，並且針對 DSP 平台(此平台包含 DSP 與 FPGA)的特性以及前向誤差改正編碼的演算法進行程式的改進。在此篇論文中、我們將標準中制訂的四個必備的前向誤差改正編碼系統，實現在以德州儀器公司所發展的 DSP 為核心的平台上。由於我們關注的重點在於程式的執行效率，因此在簡短地介紹過我們所使用的前向誤差改正編碼的演算法以及 DSP 平台的架構與軟體最佳化技巧後，我們將逐步地闡述如何在 DSP 平台上最佳化我們的程式。最後我們可以在程式執行效率上達到明顯的進步，前向誤差改正編碼的編碼器部分，經過改進後，於 DSP 模擬器上、可以達到每秒 7984K 位元的處理速度，而解碼器的部分可以達到每秒 750K 位元的處理速度。此外、針對我們所使用的數位處理器平台上內建的 Xilinx FPGA，我們也做了兩項模擬來評估在 Viterbi 解碼器最佳化中的瓶頸：”加-比-選”(ACS)單元在 FPGA 上的處理效率。受限於 DSP 與 FPGA 之間的傳輸頻寬，原本在 FPGA 上的處理速度應為每秒 45M 64 狀態的 ACS 單元，實際上僅能達到每秒 32M 64 狀態的處理速度。. I.

(4) DSP Implementation and Optimization of the Forward Error Correction Scheme in IEEE 802.16a Standard Student: Young-Tse Lee. Advisor: Dr. Hsueh-Ming Hang. Department of Electronics & Institute of Electronics National Chiao Tung University. Abstract In the IEEE 802.16a wireless communication standard, a Forward Error Correction (FEC) mechanism is presented at both the transmitter and the receiver sides to reduce the noisy channel effect. The focus of this thesis is DSP implementation of the FEC scheme defined in IEEE 802.16a standard and modifying FEC algorithms to match the architecture of DSP and FPGA platforms. We have implemented four required FEC schemes defined in the standard on the Texas Instruments (TI) TMS320C6416 digital signal processor (DSP). After a brief review of the algorithms, we describe the DSP hardware architecture and its software optimization techniques. We then explain how we optimize the FEC programs on the DSP platform step by step since the speed performance is our major concern. Finally, we achieve a significant improvement on the speed performance. At the end, the improved FEC encoder can achieve a data processing rate of 7984 Kbits/sec and the improved FEC decoder can achieve a processing rate of 750 Kbits/sec on the TI C64xx DSP simulator. Furthermore, we have done two simulations to evaluate the. II.

(5) data processing rate of the Add-Compare-Select (ACS) unit implemented on the Xilinx FPGA since the ACS unit is the speed bottleneck of the Viterbi decoder. Due to the constraint on transmission bandwidth between DSP and FPGA, the processing rate of the ACS unit on FPGA can only approach 32M (64 states/sec), while the actual processing rate on FPGA is 45M (64 states/sec).. III.

(6) 誌謝這篇論文能夠順利完成，首先要感謝我的指導教授杭學鳴博士這兩年來的悉心指導。在研究的過程中，常有遇到阻礙而停滯不前的時候，老師總是能夠適時的敦促和提供寶貴的意見，使我得以克服在研究中所遭遇的瓶頸。除了本身的研究方向，老師也不斷地鼓勵我們多接觸其它相關領域，厚實未來作進一步研究的基礎。除此之外，老師也常能關心並體諒我們在生活與學業上的種種問題，使我在研究的過程中沒有遭受到額外的壓力。此外也要特別感謝通訊電子與訊號處理實驗室，提供了充足的軟硬體資源，供我們在研究中充分的利用。也感謝實驗室全體成員，營造了一個充滿活力與和諧的環境氣氛，為平常稍嫌沉悶的研究生活增添了不少趣味。感謝陳繼大、楊政翰與蔡家揚學長，在我的研究過程中不吝提供寶貴的經驗與協助，減少我許多摸索的時間。另外，也要感謝曾建統與劉明瑋同學，在論文的研究中經常和我討論切磋，使我感到受益良多。最後，要感謝的是我的家人，他們讓我能全心地從事研究工作而無後顧之憂。沒有家人在背後的支持與體諒，也就沒有這篇論文的誕生。在此僅以這篇論文獻給所有幫助過我，陪我走過這一段歲月的師長、同儕與家人，謝謝！.

(7) Contents 1. Introduction. 2. Overview of IEEE 802.16a FEC Scheme 4 2.1 Introduction to IEEE 802.16a Standard..............................................................4 2.2 IEEE 802.16a FEC Specifications......................................................................5. 2.3. 1. 2.2.1 Randomizer..........................................................................................6 2.2.2 Forward Error Correction Coding .......................................................7 2.2.2.1 Reed-Solomon Code Specification.......................................9 2.2.2.2 Convolutional Code Specification........................................9 2.2.2.3 Interleaver........................................................................... 11 Implementation Issues of the FEC Scheme......................................................12 2.3.1 Reed-Solomon Code..........................................................................12 2.3.1.1 Encoding of Shortened and Punctured Reed-Solomon Codes .................................................................................12 2.3.1.2 Decoding of Shortened and Punctured Reed-Solomon Codes .................................................................................15 2.3.1.3 Galois Field Arithmetic ....................................................18 2.3.2 Convolutional Code...........................................................................19 2.3.2.1 Encoding of Punctured Convolutional Code ......................19 2.3.2.2 Viterbi Decoding of Punctured Convolutional Code..........20 2.3.2.3 Bit Interleaved Soft Decision Viterbi Decoding.................24 2.3.2.4 Viterbi Decoding of Tail-Biting Convolutional Code ........26 2.3.2.5 The Butterfly Structure in the Trellis Diagram...................26. 3. DSP Implementation Environment 28 3.1 The DSP Chip...................................................................................................29 3.1.1 Central Processing Unit .....................................................................31 3.1.2 Memory .............................................................................................32 3.1.3 Peripherals .........................................................................................33 3.2 The DSP Baseboard..........................................................................................34. i.

(8) 3.3 4. DSP Transmission Mechanism.........................................................................35. Implementation and Optimization of 802.16a FEC Scheme on DSP Platform 38 4.1 System Structure of the FEC Implementation..................................................39 4.2 Compiler Level Optimization Techniques .......................................................40 4.2.1 Pipeline Structure of the TI C6000 Family .......................................41 4.2.2 Code Development Flow ...................................................................42 4.2.3 Software Pipelining ...........................................................................43 4.3 Optimization on Reed-Solomon Code..............................................................47 4.3.1 Optimization on RS Encoder.............................................................47 4.3.1.1 Choose Appropriate Data Types .........................................47 4.3.1.2 Galois Field Multiplication.................................................49 4.3.1.3 Compiler Level Improvements...........................................56 4.3.2 Optimization on RS Decoder.............................................................57 4.3.2.1 Galois Field Inversion ........................................................57. 4.4. 4.5. 4.3.2.2 Data Type Modification......................................................59 4.3.2.3 Chien Search Improvement – I...........................................60 4.3.2.4 Chien Search Improvement – II..........................................63 4.3.2.5 Inverse-Free Berlekamp Massey’s Algorithm ....................64 4.3.2.6 Compiler Level Improvements...........................................65 Optimization on Convolutional Code...............................................................67 4.4.1 Optimization on Viterbi Decoder ......................................................68 4.4.1.1 Choose Appropriate Data Type for Branch Metric.............68 4.4.1.2 Modified Path Recording – I ..............................................69 4.4.1.3 Modified Path Recording – II.............................................70 4.4.1.4 Counter Splitting ................................................................71 4.4.1.5 Removal of Replicated Metrics ..........................................73 Simulation Results............................................................................................74 4.5.1 Simulation Profile for RS Encoder....................................................74 4.5.2 Simulation Profile for RS Decoder....................................................76 4.5.3 Simulation Profile for CC Encoder ...................................................77 4.5.4 Simulation Profile for CC Decoder ...................................................78 4.5.5 Simulation Profile for FEC Encoder .................................................79 4.5.6 Simulation Profile for FEC Decoder .................................................80. 5. ACS Unit Acceleration by Employing Xilinx FPGA as an Assistant 81 5.1 ACS Design - I .................................................................................................82 5.1.1 Original ACS Structure .....................................................................82 5.1.2 Improved ACS Structure ...................................................................84 5.2 ACS Design - II ................................................................................................87 ii.

(9) 6 Conclusion and Future Work 93 6.1 Conclusion ........................................................................................................93 6.2 Future Work......................................................................................................94 Bibliography. 96. iii.

(10) List of Tables Table 2.1 Mandatory Channel Coding per Modulation ................................................... 9 Table 2.2 The Inner Convolutional Code with Puncturing Configuration..................... 10 Table 2.3 Bit Interleaved Block Sizes and Modulo ........................................................11 Table 4.1 Table 4.2 Table 4.3 Table 4.4 Table 4.5 Table 4.6 Table 4.7 Table 4.8 Table 4.9. Completing Phase of Different Type Instructions.......................................... 42 Original Profile of RS Encoder ...................................................................... 48 Profile of Revised RS Encoder (Data Type Modification) ............................ 49 Comparison of the Five Different Galois Field Multiplier ............................ 55 Comparison of the Five Different Galois Field Inverter ................................ 59 Original Profile of RS Decoder...................................................................... 59 Profile of Revised RS Decoder (Data Type Modification) ............................ 60 Profile of the Worst Case and Best Case of Early Terminated Chien Search 62 Profile of the Worst Case and Best Case of Early Terminated Chien Search (Modified) ....................................................................................................... 63 Table 4.10 Comparison between the Original and the Inverse-Free BM Algorithm ..... 65 Table 4.11 Original Profile of Viterbi Decoder.............................................................. 68 Table 4.12 Profile of Viterbi Decoder Using Fixed Point Value As Branch Metric ...... 69 Table 4.13 Profile of VD_Decode Function .................................................................. 69 Table 4.14 Profile of Reed-Solomon Encoder (I/O Included) ....................................... 75 Table 4.15 Profile of Reed-Solomon Encoder (I/O Excluded) ...................................... 76 Table 4.16 Profile of Reed-Solomon Decoder (I/O Included) ....................................... 77 Table 4.17 Profile of Reed-Solomon Decoder (I/O Excluded)...................................... 77 Table 4.18 Profile of Convolutional Encoder (I/O Included) ........................................ 77 Table 4.19 Profile of Convolutional Encoder (I/O Excluded) ....................................... 78 Table 4.20 Profile of Soft Decision Decoding Viterbi Decoder (I/O Included)............. 78 Table 4.21 Profile of Soft Decision Decoding Viterbi Decoder (I/O Excluded) ........... 79 Table 4.22 Profile of Forward Error Correction Encoder .............................................. 79. iv.

(11) Table 4.23 Profile of Forward Error Correction Decoder.............................................. 80. v.

(12) List of Figures Figure 2.1 IEEE local and metropolitan area networks standards family ........................5 Figure 2.2 Channel coding structure in transmitter side (top) and receiver side (bottom) .......................................................................................................................6 Figure 2.3 PRBS for Data Randomization .......................................................................6 Figure 2.4 Creation of OFDMA randomizer initialization vector....................................7 Figure 2.5 Forward Error Correction structure in transmitter side (left) and receiver side (right) .......................................................................................................8 Figure 2.6 Convolutional Encoder of Rate 1/2...............................................................10 Figure 2.7 Block Diagram of the RS Encoder Program .................................................14 Figure 2.8 The Linear Feedback Shift Register Structure of RS Encoder .....................14 Figure 2.9 Block Diagram of a Conventional RS Encoder ............................................15 Figure 2.10 Flowchart of the Berlekamp-Massey Algorithm.........................................17 Figure 2.11 Block Diagram of the RS Decoder Program...............................................18 Figure 2.12 Syndrome Computation Circuit ..................................................................18 Figure 2.13 Block Diagram of the Convolutional Encoder Program .............................20 Figure 2.14 State Transition Diagram Example .............................................................21 Figure 2.15 Trellis Diagram Example for a Viterbi Decoder .........................................22 Figure 2.16 Survivor path of the Trellis Diagram ..........................................................23 Figure 2.17 Figure 2.18 Figure 2.19 Figure 2.20 Figure 2.21. Block Diagram of the Viterbi Decoder Program.........................................23 Structure of the Viterbi Algorithm ..............................................................24 Partition of the 16-QAM Constellation.......................................................26 Block Diagram of the Suboptimal Tail-Biting Viterbi Decoder..................26 Butterfly Structure Showing Branch Cost Symmetry.................................27. Figure 3.1 The Block Diagram of TMS320C6x DSP Chip............................................30 Figure 3.2 The TMS320C64x DSP Chip Architecture and Comparison with Ancient TMS320C62x/C67x Chip..............................................................................30 Figure 3.3 Innovative Integration’s Quixote DSP Baseboard Card................................34. vi.

(13) Figure 3.4 The Architecture of Quixote Baseboard........................................................35 Figure 3.5 Block Diagram of DSP Streaming Mode ......................................................37 Figure 4.1 Figure 4.2 Figure 4.3 Figure 4.4 Figure 4.5 Figure 4.6 Figure 4.7 Figure 4.8 Figure 4.9 Figure 4.10 Figure 4.11 Figure 4.12 Figure 4.13 Figure 4.14 Figure 4.15 Figure 4.16 Figure 4.17 Figure 4.18 Figure 4.19 Figure 4.20 Figure 4.21. System Structure of Transmitter Side ...........................................................40 System Structure of Receiver Side................................................................40 Code Development Flow ..............................................................................43 (a) The Original Loop. (b) The Loop After Applying Software Pipelining ..44 (a) Execution Record of the Original Loop. (b) Execution Record of the Software Pipelined Loop ...............................................................................44 Pseudo Code for Variable Using Long and Int Data Type ............................49 Algorithm for Serial Multiplier.....................................................................52 Serial Multiplier in GF(28) ............................................................................53 Compiler’s Feedback for RS_Encode Loop .................................................56 Pseudo Code for RS_Encode Loop.............................................................56 Compiler’s Feedback for RS_Encode Loop (After Build Option Change) 57 Pseudo Code for Chien Search (a) w/o Criterion. (b) w/ Criterion.............61 Flowchart of Early Terminated Chien Search .............................................62 Compiler’s Feedback for Syndrome Calculator Loop ................................66 Pseudo Code for Syndrome Calculator .......................................................66 Compiler’s Feedback for Syndrome Calculator Loop (After Build Option Change)........................................................................................................67 Pseudo Code for Recording Path in Internal Data Memory and Register ..70 Compiler’s Feedback for ACS Loop...........................................................72 Pseudo Code for Counter Splitting .............................................................72 Compiler’s Feedback for ACS Loop (After Counter Splitting) ..................73 Pseudo Code For Removing Replicated Metrics ........................................74. Figure 5.1 Block Diagram of Original ACS Design.......................................................82 Figure 5.2 Figure 5.3 Figure 5.4 Figure 5.5 Figure 5.6 Figure 5.7 Figure 5.8. FPGA Synthesis Report for Original ACS Design........................................83 Schematic of Modified ACS Design .............................................................85 FPGA Synthesis Report for Modified ACS Design ......................................86 Place and Route Report for Modified ACS Design ......................................87 Waveform of Modified ACS Design.............................................................87 Two Equivalent Trellises...............................................................................88 (a) Original ACS Architecture (b) ACS Architecture Based on Double State Trellis.............................................................................................................89 Figure 5.9 FPGA Synthesis Report for Double State ACS Design ................................90 Figure 5.10 Place and Route Report for Double State ACS Design...............................91 Figure 5.11 Macro Statistics of the (a) Original ACS Design. (b) Double State ACS. vii.

(14) Design............................................................................................................91 Figure 5.12 Waveform of Double State ACS Design .....................................................92. viii.

(15) Chapter 1 Introduction Digital wireless transmission of multimedia contents is a trend in the consumer electronics field in the future. Due to the demand for wireless communication of multimedia contents, high data transmission rate and mobility are needed. Thus, the OFDM modulation technique for wireless communication has been the main stream in the recent years. IEEE has completed several standards such as IEEE 802.11 series for LAN (Local Area Network) and IEEE 802.16 series for MAN (Metropolitan Area Network) based on OFDM technology. Our study is based on the IEEE 802.16a standard, which specifies the air interface of fixed broadband wireless access systems providing multiple accesses. The advantage of digital wireless communication is based on a fact that it is convenient for consumers to receive or transmit digital contents without connecting to the transmission lines. However, there are still problems to solve in wireless communication system. One major problem is that the transmission channel is not noiseless. The transmission signals are easily interfered and distorted by several different types of noise source such as the crowd traffic, bad weather, the obstacle of buildings, etc. Multimedia service contains a broad range of contents such as audio, video, still image, and the traditional speech. These services would have unacceptable quality if they cannot detect and recover the errors introduced by the noisy channel. To improve the robustness of the wireless communication against the noisy channel. 1.

(16) condition, the FEC (Forward–Error-Correcting Coding) mechanism and the FED (Forward–Error-Correcting Decoding) mechanism are a must to combat the channel error. Therefore, they exist in almost every commercial communication standards, including the IEEE 802.16a standard we mentioned earlier. In this thesis, we focus on the study of the implementation of the FEC/FED scheme of the IEEE 802.16a standard on the II Quixote DSP/FPGA board. We first review the algorithms the FEC/FED used in 802.16a to understand the encoding and decoding procedure. Then, we write C programs to check the correctness of our algorithms. Finally, we implement the FEC/FED algorithm on DSP and improve its speed by optimizing the DSP programs. Furthermore, to increase the processing speed of the FED scheme further, we also use the Xilinx FPGA which is also embedded in the Quixote board as an extra hardware accelerator. In Chapter 2, we briefly introduce the forward error correction scheme of IEEE 802.16a standard and discuss the major algorithm blocks. Also, we discuss how to implement these algorithms in C language to reduce the computational complexity. In Chapter 3, we give a brief description of our implementation environment; it includes both the II’s Quixote DSP baseboard and its communication mechanism between host PC and target DSP. In Chapter 4, we first prelude the architecture of C6x DSP shortly and explain the impact of data and instruction types on the program execution. Then, we describe the techniques of software pipeling used by the compiler, which is helpful for writing efficient high-level programs. We then describe the optimization we have done on the Reed-Solomon decoder implemented on the TI C64 DSP. Next, the optimization of the Viterbi decoder implemented on the TI C64 DSP is discussed. We also describe the techniques used for improving the overall processing speed of the decoder. We check the processing speed before optimization and after optimization. Finally, the simulation profile of the improved FEC encoder and decoder is given to show how fast the processing rate may be achieved after optimization.. 2.

(17) In Chapter 5, we introduce the Xilinx FPGA as an extra hardware accelerator, we discuss the implementation issues based on FPGA platform and evaluate how much improvement we may achieved with the assistance of FPGA. At the end, we give some observations and conclusions. Possible subjects for future works are also included.. 3.

(18) Chapter 2 Overview of IEEE 802.16a FEC Scheme 2.1 Introduction to IEEE 802.16a Standard The IEEE 802.16a standard amends IEEE standard 802.16 by enhancing the medium access control layer and providing additional physical layer specifications in support of broadband wireless access at frequencies from 2-11GHz. The resulting standard specifies the air interface of fixed (stationary) broadband wireless access systems providing multiple services. The medium access control layer is capable of supporting multiple physical layer specifications optimized for the frequency bands of application. The standard includes particular physical layer specifications applicable to systems operating between 2 and 66 GHz. It supports point-to-multipoint and optional mesh topologies [1]. This standard is part of a family of standards for local and metropolitan area networks. The relationship between the standard and other members of the family is shown in Fig. 2.1 (The numbers in the figure refer to IEEE standard designations). The family of standards deals with the Physical and Data Link Layers as defined by the international Organization for Standardization (ISO) Open Systems Interconnection Basic Reference Model. The access standards define several types of medium access technologies and associated physical media, each appropriate for particular applications or system objectives. Other types are under investigation [1].. 4.

(19) This thesis focus on the DSP/FPGA joint implementation and optimization issues of the IEEE 802.16a Forward Error Correction (FEC) Coding/Decoding scheme. Therefore we will concentrate on introducing the FEC specifications defined in IEEE 802.16a physical layer part in next section. At the last part of this chapter, we show the block diagrams of our simulation programs and explain what we have initiatively done to modify the implementation structure and hence reduce the computational complexity.. Figure 2.1: IEEE local and metropolitan area networks standards family.. 2.2 IEEE 802.16a FEC Specifications The overall physical layer structure of the channel coding scheme is shown in Fig. 2.2, whereas the Reed-Solomon Code and the Convolutional Code are major parts of the FEC scheme, the randomizer and the interleaver are additional modules for further improving the error performance of the FEC scheme. The detailed specifications of each block are introduced in the following subsections, excluding the modulator which is not implemented in our research subproject.. 5.

(20) Figure 2.2: Channel coding structure in transmitter side (top) and receiver side (bottom).. 2.2.1 Randomizer Data randomization is performed on data transmitted on the DL and UL. The randomization is performed on each allocation (DL or UL), which means that for each allocation of a data block (subchannels on the frequency domain and OFDM symbols on the time domain) the randomizer shall be used independently. If the amount of data to transmit does not fit exactly the amount of data allocated, padding of 0xFF (“1” only) shall be added to the end of the transmission block, up to the amount of data allocated.. Figure 2.3: PRBS for Data Randomization.. The randomizer is a Pseudo Random Binary Sequence (PRBS) generator depicted in Fig. 2.3. As shown in the figure, the generator polynomial of the randomizer is 1+X14+X15. Each data byte to be transmitted shall enter sequentially into the randomizer, msb first to make the “0” and “1” bits in the input data streams well-distributed and. 6.

(21) hence improve the coding performance. The randomizer sequence is applied only to information bits. Preambles are not randomized. The shift-register of the randomizer shall be initialized for every 1250 bytes passed through (if the allocation is larger then 1250 bytes). In the downlink, the randomizer shall be re-initialized at the start of each frame with the sequence (msb) 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 (lsb). In the uplink, the randomizer is initialized with the vector created as shown in Fig. 2.4.. Figure 2.4: Creation of OFDMA Randomizer Initialization Vector.. 2.2.2 Forward Error Correction Coding Forward error correction is used to decrease bit error rate (BER) on noisy communication channels. This is achieved by a method known as channel coding, which adds redundant information to the transmitted data. With forward error correction, transmission errors are corrected at the decoder, without requesting a retransmission. Convolutional encoding and block coding are two major forms of channel coding [2]. In our IEEE 802.16a OFDMA project, both convolutional code and block code (Reed-Solomon Code) are employed. The Forward Error Correction scheme used in the IEEE 802.16a standard, as shown in Fig. 2.5, consisting of the concatenation of a Reed-Solomon outer code and a rate-compatible convolutional inner code, is supported on both UL and DL. The input data streams are first divided into RS (Reed-Solomon) blocks, the block size is. 7.

(22) determined by parameter k defined in RS code specification, then encoded by a RS encoder, and each RS coded block is then encoded by a convolutional encoder. Convolutional code is one kind of sequential codes, but RS code is a block code. Overall it makes the whole concatenated code a block-based coding scheme.. Figure 2.5: Forward Error Correction structure in transmitter side (left) and receiver side (right).. In order to make the system more flexible and adaptable to the channel condition, there are six coding-modulation schemes provided in the standard, as shown in Table 2.1(notice that 64QAM is an optional mode). The different coding rates are made by shortening and puncturing the original RS code and with puncturing of the original convolutional code. The shortened- and- punctured mechanisms in RS code can provide different block size and hence different error correcting capability through the same RS Codec (Coder / Decoder). Similarly, the convolutional code can provide variable code rates through the same codec by applying the puncturing rule. Thus it can suit the variable block size of the shortened-and-punctured RS code to achieve a desired overall coding rate.. 8.

(23) Table 2.1: Mandatory Channel Coding per Modulation.. 2.2.2.1 Reed-Solomon Code Specification The Reed-Solomon encoding is derived from a systematic RS (N=255, K=239, T=8) code using GF(28),where N is the number of overall bytes after encoding, K is the number of data bytes before encoding, and T is the number of data bytes which can be corrected from errors. The galois field used in this code is generated by the field generator polynomial: p(x) = x8 + x4 + x3 + x2 + 1, and the codeword is generated by the code generator polynomial: g(x) = (x +. )(x +. )(x +. )…(x +. ).. This code is shortened and punctured to enable variable block sizes and variable error-correction capability. When a block is shortened to K’ data bytes, the first 239 – K’ bytes of the encoder block are filled with “0”s. When a codeword is punctured to permit T’ bytes to be corrected, only the first 2T’ of the total 16 codeword bytes are employed.. 2.2.2.2 Convolutional Code Specification After the RS encoding process, each RS block is then encoded by the binary convolutional encoder, which has native rate of 1/2, a constraint length equal to K=7, and uses the following generator polynomials codes to derive its two code bits: G1 = 171OCT. 9. FOR X.

(24) G2 = 133OCT. FOR Y. The generator is depicted in Fig. 2.6.. Figure 2.6: Convolutional Encoder of Rate 1/2.. Puncturing patterns and serialization order which is used to realize different code rates are defined in Table 2.2. In the table, “1” denotes a transmitted bit and “0” denotes a removed bit, whereas X and Y are in reference to Fig. 2.6.. dfree. X1Y1Y2. X1Y1Y2X3. X1Y1Y2X3Y4X5. Table 2.2: The Inner Convolutional Code with Puncturing Configuration.. Furthermore, a tail-biting mechanism is adopted in our convolutional code, by initializing the encoder’s memory with the last data bits of the RS block being encoded.. 10.

(25) 2.2.3 Interleaver All encoded data bits are interleaved by a block interleaver with a block size corresponding to the number of coded bits per the specified allocation, Ncbps (see Table 2.3) to protect the convolutional code from severe impact of burst errors and therefore increase the coding performance. The interleaver is defined by a two step permutation. The first permutation ensures that adjacent coded bits are mapped onto nonadjacent carriers. The second permutation ensures that adjacent coded bits are mapped alternately onto less or more significant bits of the constellation, thus avoiding long runs of lowly reliable bits.. Table 2.3: Bit Interleaved Block Sizes and Modulo.. Now let Ncpc be the number of coded bits per carrier, i.e. 2, 4 or 6 for QPSK, 16QAM or 64QAM, respectively. Let s = Ncpc/2. Let k be the index of the coded bit before the first permutation at transmission, m be the index after the first and before the second permutation and j be the index after the second permutation, just prior to modulation mapping, and d be the modulo used for the permutation.. The first permutation is defined by the rule: m = (Ncbps/d) * kmod(d) + floor(k/d),. k = 0, 1, …, Ncbps – 1. The second permutation is defined by the rule: J = s * floor(m/s) + (m + Ncbps – floor(d*m Ncbps))mod(s),. 11. m = 0, 1, …, Ncbps -1.

(26) The de-interleaver, which performs the inverse operation, is also defined by two permutations. Let j be the index of the received bit before the first permutation, m be the index after the first and before the second permutation and k be the index after the second permutation, just prior to delivering the coded bits to the convolutional decoder.. The first permutation is defined by the rule: m = s * floor(j/s) + (j + floor(d*j/ Ncbps))mod(s),. j = 0, 1, …, Ncbps -1. The second permutation is defined by the rule: K = d * m – (Ncbps -1) * floor (d*m/ Ncbps),. m = 0, 1, …, Ncbps -1. The first permutation in the de-interleaver is the inverse of the second permutation in the interleaver, and conversely.. 2.3 Implementation Issues of the FEC Scheme Detailed explanation of the FEC coding and decoding algorithms is given in this section. The block diagrams of our simulation programs are also provided in each section. Also we will describe how we reduce the computational complexity on PCs.. 2.3.1 Reed-Solomon Code 2.3.1.1 Encoding of Shortened and Punctured Reed-Solomon Codes The Reed-Solomon code defined in IEEE 802.16a standard is a modified RS code which is derived from the standard systematic (255, 239, 8) RS code as mentioned in section 2.2.2. In this section, we first give an example to illustrate how the encoding. 12.

(27) process has been done. Secondly, the block diagram of our RS encoder program is given too. The (48, 36, 6) RS code is chosen from Table 2.2 as an example to show the details of encoding process. Before talking about the encoding process, we must note one thing that the galois field defined in the IEEE 802.16a standard is GF(28), it means that each element, i.e. I238 ~ I0, R15 ~ R0, mentioned below denotes a byte (8 bits). First we let the information data bytes which are inputs to the systematic (255, 239, 8) RS code be represented as polynomial form shown below: I(x) = I238x238 + I237x237 + ………+ I36x36 + I35x35 + …… + I1x + I0 = (I238, I237, … , I36, I35, … , I1, I0). Then the resulting systematic (255, 239, 8) RS codeword is given by C(x) = I(x) x16 + R(x) = (I238, I237, … , I36, I35, … , I1, I0, R15, R14, … , R3, R2, R1, R0). The remainder polynomial R(x) can be represented as below: R(x) = I(x) x16 mod g(x) = (R15, R14, … , R3, R2, R1, R0) Where the exponent of x is derived from N – K = 16. The encoding process shown above is the standard (255, 239, 8) RS code. In order to match the (48, 36, 6) code requirement, shortening and puncturing are needed. In other words, we have to modify the existing codeword further. Initially we set the first (239 – 36) = 203 input data bytes to zero and pad with 36 information data bytes, for example, the input data bytes becomes: I(x) = (0, 0, 0, … , 0, I35, I34, I33,… , I2, I1, I0), totally 203 zeros in the beginning.. Then let the 239 data bytes be encoded by the standard (255, 239, 8) RS encoder, after it has been encoded, we discard the last 4 bytes of the codeword. Finally we have 48 bytes codeword, for example, the 48 bytes codeword is shown as below:. 13.

(28) C(x) = (I35, I34, I33,… , I2, I1, I0, R15, R14, … , R7, R6, R5, R4). Similarly, the other types of shortened-and-punctured RS code listed in Table 2.2 can be acquired by performing the same procedure as discussed above, except for the (81, 72, 4) RS code which is derived from (80, 72, 4) shortened-and-punctured RS code by inserting a zero byte in the beginning of codeword. The block diagram of our RS encoder is shown in Fig. 2.7, where the block named as shortened-and-punctured block is to discard the first 203 zero bytes (shortening) and the last 4 bytes (puncturing) of the RS codeword. The details of the LFSR block is shown in Fig. 2.8, we employ the Linear Feedback Shift Register (LFSR) structure to implement the RS encoder block diagram as shown in Fig. 2.9 [3].. Figure 2.7: Block Diagram of the RS Encoder Program.. Figure 2.8: The Linear Feedback Shift Register Structure of RS Encoder.. 14.

(29) Figure 2.9: Block Diagram of a Conventional RS Encoder.. 2.3.1.2 Decoding of Shortened and Punctured Reed-Solomon Codes In order to understand how to decode a shortened-and-punctured RS code, we also take the (48, 36, 6) RS code as an example. First we acquire 48 data bytes from the receiver side, prepending with 203 zero bytes and padding with 4 zero bytes in the end. Then, we have a data block whose size equals 255 bytes. Afterwards we can employ a standard (255, 239, 8) RS decoder to decode it with the last 4 zero bytes of the codeword marked as erasures.. A (48, 36, 6) RS decoder consists of the following main steps: 1.. Syndrome computation: Insert 203 bytes of zero before the 48 bytes received data and insert 4 bytes of zero in the locations marked as erasure then compute the syndromes. Sk =. 254 i =0. riα ik , for 1 ≤ k ≤ 16 , whereas the ri is the received data after zero. inserting. 2.. Erasure locator polynomial computation: s. s. j =1. j =0. Λ ( x) = ∏ (1 − Z j x) =. Λ j x j , whereas the Zj is the jth erasure location and the s. is the number of erasures. 3.. Find the error location polynomial coefficient by solving. 15.

(30) S1 S2. S2 S3. S7 S8. S8 S9. Λ8 Λ7. S8. S7. S14. S15. Λ1. =. − S9 − S10. (1). − S16. Then find the error location by finding the roots of. (x).. (When performing erasure and error decoding, the syndrome shown in (1) shall be replaced by Forney syndrome : Tk = 4.. 5.. s j =0. Λ j S k + s − j , for 1 ≤ k ≤ d − 1 − s ). Find the error and erasure magnitude by solving X1 X 12. X2 X 22. X v Y1 X v2 Y2. X 1v. X 2v. X vv Yv. =. S1 S2. (2). Sv. Let t denote the number of errors, s denote the number of erasures If 2s + t > T (T = 6 in the case of (48, 36, 6) RS code), it means that the number of errors and erasures exceed the amount that can be recovered by this RS code. Thus, the received data bytes would be left unchanged.. For computing (1) and (2), there are two well-known and conventional algorithms existing. One is called Euclidean’s algorithm, and the other is called Berlekamp-Massey (BM) algorithm. The Euclidean’s algorithm is used to compute the eqns. (1) and (2). The BM algorithm is used to compute eqn. (1). In our case, we choose the BM algorithm to compute (1) and employ the Forney algorithm to solve (2). A flowchart of the BM algorithm for computing the error/erasure locator polynomial in RS decoder is shown in Fig. 2.10 [3]. At the end of the iterations, we can obtain the error/erasure locator polynomial via the. 16. (n-k). (x) polynomial..

(31) Figure 2.10: Flowchart of the Berlekamp-Massey Algorithm.. In addition, the RS decoding procedure introduced previously can be further simplified based on an improved time-domain RS decoder proposed in [19]. The major difference between the new decoder and the previous one is that decoding in the new decoder do not require pre-computating the Forney syndrome and post-computing the errata locator polynomial, it just simply initialize the BM algorithm with the erasure locator polynomial and afterward the errata locator polynomial can be obtained in the end of iteration of BM algorithm.. 17.

(32) The block diagram of our RS decoder is shown in Fig. 2.11, where the syndrome computation is done by employing the circuit shown in Fig. 2.12 then fed to the BM algorithm, the chein search performed after BM algorithm is used to find the roots of the error/erasure locator polynomial and the forney algorithm is for the purpose of computing the magnitude of the error/erasure .. Figure 2.11: Block Diagram of the RS Decoder Program.. Figure 2.12: Syndrome Computation Circuit.. 2.3.1.3 Galois Field Arithmetic The major computational complexity of RS code is resulted from the galois field arithmetic architecture. In the case of GF(28), 8 bits by 8 bits operation are needed in galois field arithmetic when engaged in field element addition, multiplication or. 18.

(33) inversion. The addition operation only requires the 8 bits by 8 bits XOR operation, but the multiplication and inversion are much more complicated than the addition and hence require a lot of computational time [3]. For the purpose of reducing the complexity of the galois field arithmetic (especially for the multiplication and inversion operations), several methods have been presented. For instance, By using Mastrovito algorithm [4], which introduce the concept of “product matrix”, we can avoid degree 8 (Since the considered galois field is GF(28)) polynomial multiplication/division, or by exploiting the serial multiplier structure [5] which. is. commonly. used. in. VLSI,. we. can. simplify. the. polynomial. multiplication/division to 64 bit multiplication and 7 polynomial reduction operation only. However, due to the special architecture of DSP, the computational complexity is still high for the two previous methods, which are mainly developed for VLSI architecture. In our case we finally employ the logarithmic table lookup algorithms [6] to handle the galois field arithmetic. Further explanation of the table lookup method and its profile respect to the computational speed are given in Chapter 4. Also we will discuss the difference between the two methods mentioned above and the table-lookup method in Chapter 4 for DSP optimization.. 2.3.2 Convolutional Code 2.3.2.1 Encoding of Punctured Convolutional Code The convolutional code encoding structure is shown in Fig. 2.6. It consists of one input bit, six memory elements (shift registers) and two output bits, which are generated by first performing AND operations on the generator polynomial coefficients, then pad the contents of the memory elements with the input bit, and then perform operation of modulo 2(XOR) on each bit generated by the previous AND operation. For the purpose to reduce computational complexity, we avoid performing XOR operation directly but. 19.

(34) employing the table-lookup method to replace it. That is, we build a table that contains all possible 7 bit (6 memory element bits plus 1 input bit) XOR results and store them in memory. From the fact that the XOR operation is used frequently during the encoding process, we can just search the XOR results in the table and avoid the computations thus slightly speed up the encoding process. According to the puncturing rule shown in Table 2.2, a “1” means a transmitted bit and a “0” means a skipped bit. The X and Y in the table denote the two output bits shown in Fig. 2.6. Note that the dfree has been changed from that of the original convolutional code with rate 1/2, which is equal to 10. The operations stated above are represented by a block diagram shown in Fig. 2.13. The input and output buffers shown in this figure are used for reducing the number of times on memory access when concerning DSP implementation. Since the convolutional encoder processes a piece of 1-bit input data each time step, if we do not setup buffers for input and output, we have to do memory accessing frequently during the encoding period, which decreases the processing rate on the TI DSP platform.. Figure 2.13: Block Diagram of the Convolutional Encoder Program.. 2.3.2.2 Viterbi Decoding of Punctured Convolutional Code Viterbi algorithm is the most well known technique in convolutional decoding process. The operation of Viterbi algorithm can be explained easily using the trellis diagram, which is generated by the encoder with all possible inputs. As we know, the. 20.

(35) convolutional encoder consists of the memory elements, one input bit and two output bits. The output bits are decided by the suitable combinations (AND and XOR) of the past input bits. The changes of the value in the memory elements are viewed as the transition from one state to another. So we can model the encoder as a finite state machine, which is useful in the analysis of trellis diagram. An example of the finite state machine is shown in Fig. 2.14, whereas x(n-1) and x(n-2) denote the previous input and the input prior to the previous input, respectively. When we acquire a new input bit, the state of memory elements is changed and the finite state machine generates the corresponding output bits.. Figure 2.14: State Transition Diagram Example.. The trellis diagram can be derived from the state transition diagram. First, the finite state machine output is constructed by the given input and the current state. We expand the finite state machine to a trellis diagram by introducing the concept of time. The trellis diagram is consisting of all the features of finite state machine and can be viewed as the time axis expansion of the finite state machine diagram. A simple trellis diagram is shown in Fig. 2.15 as an example. We can easily see all the state transition for any possible input for every propagation time instance. In this trellis diagram, the upper. 21.

(36) outgoing branch for each state corresponds to an input of 0, and the lower outgoing branch corresponds to an input of 1. Each state has two incoming and two outgoing branches. Each information sequence, uniquely encoded into an encoded sequence, corresponds to a unique path in the trellis. Equivalently, for a given path through the trellis, we can obtain the corresponding information sequence by reading off the input labels on all the branches that make up the path, and the procedure is also called “Traceback”. The Viterbi algorithm is used to find the optimal path in the trellis diagram that results in the minimum errors. Then we do the traceback procedure to retrieve the information sequence, which has been the inputs to the encoder, and the details are discussed below.. Figure 2.15: Trellis Diagram Example for a Viterbi Decoder.. The Viterbi algorithm computes the branch metric of each path at each stage of the trellis. The metric is first calculated and stored as a partial metric for each branch as the trellis traversed. Since there are two paths merge at each node, the path with a smaller metric is retained while the other is discarded. This is based on the principle that the optimum path must contain the sub-optimum survivor path just like as the one shown in Fig. 2.16 [7]. The survivor path for a given state at time instance n is the sequence of symbols closest to the received sequence up to time n. For the case of puncturing convolutional code, the metric associated with the punctured bits are simply disregarded. 22.

(37) in metric calculation stage. The overall operation discussed in the above constitutes the computational core of the Viterbi algorithm and is so-called the Add-Compare-Select (ACS) operation.. Figure 2.16: Survivor path of the Trellis Diagram.. In conclusion, the Viterbi algorithm can be divided into four major steps, the first step is the branch metric calculation and state metric loading, the second step is the ACS, the third step is the state metric storing and path recording, and the last one is the traceback. The block diagram of our Viterbi decoder program is shown in Fig. 2.17, and the structure of the Viterbi algorithm is shown in Fig. 2.18. The extend received sequence block shown in Fig. 2.17 is included for decoding the puncturing and tail-biting convolutional code and will be discussed later in this subsection.. Figure 2.17: Block Diagram of the Viterbi Decoder Program.. 23.

(38) Figure 2.18: Structure of the Viterbi Algorithm.. Notice that we have named our Viterbi decoder in the block diagram as an SDD Viterbi decoder, where the SDD stands for Soft-Decision-Decoding. In fact, there are two kinds of decision types used in Viterbi decoding, one is called hard-decision, and another is called soft-decision. If hard-decision is adopted, then the metric value we used for calculating branch metric and state metric is the Hamming distance, which only counts the bit errors between each trellis path and the hard-limited output of the demodulator. For the case of soft-decision, the metric we used should be the Euclidean distance between each trellis path and the soft-output of the demodulator. The major difference on performance between these two decision types is the coding gain and the computational speed. For hard-decision, the calculation of Hamming distance is a simple XOR operation, On the other hand, the soft-decision in metric calculation requires a floating-point arithmetic. The hard-decision based Viterbi decoder is much faster than the soft-decision based algorithm. However, its coding gain will lose 2 to 3 dB compared to soft-decision decoding, and cannot satisfy the requirements of IEEE 802.16a standard [8]. Hence, the soft-decision decoding is adopted to implement our Viterbi decoder.. 2.3.2.3 Bit Interleaved Soft Decision Viterbi Decoding In the specific FEC scheme defined by IEEE 802.16a, there is a block interleaver between the convolutional code and modulator. Therefore, the optimal SDD should take the joint trellis structure which consists of the convolutional code, the block interleaver and the modulator into account. In consequence, it leads to a complicated solution to be realized in practice. To be more practical, we consider a suboptimal solution based on a. 24.

(39) bit-by-bit metric mapping and calculation concept, which is proposed in [9]. To begin with, we can generalize our major problems to how to obtain the metric values used in the SDD Viterbi decoder while concerning the de-interleaving process. Here we are not going to discuss or prove the detailed algorithm that has already been well-defined in [9], but just showing the procedure on acquiring metric values. According to the suboptimal solution, we first calculate the Euclidean distance between the received symbol and its nearest reference modulated symbol with respect to a decided bit “0” and “1”. Let us take 16-QAM modulation as example. Referring to Fig. 2.19, if a received symbol lies in the coordinate (2.5, 2.7) (represented by a square point in the figure), then its branch metric of the first bit with respect to a decided bit “0” should be the Euclidean distance between the received symbol and the rightmost reference symbol whose in-phase coordinate is 3 and the result is |3 – 2.5|2 = 0.25. And the branch metric with respect to a decided bit “1” should be |-1 – 2.5|2 = 12.25. The branch metric of the second bit, third bit, and fourth bit of this received symbol can be calculated in a similar way. Consequently, we have four pairs of branch metric for each received symbol. Before sending them to the SDD Viterbi decoder, these pairs of branch metric should be mapped to the corresponding bit position since the original convolutional encoded sequence has been interleaved. In order to be consistent with the newly defined branch metrics, our SDD Viterbi decoder should be modified to be able to treat these de-interleaved (or to say “demapped”, alternatively) branch metric as the input data sequence instead of the soft-demodulated symbol. Except for the branch metric calculation step, all the other parts in a conventional SDD Viterbi decoder are still the same.. 25.

(40) Figure 2.19: Partition of the 16-QAM Constellation.. 2.3.2.4 Viterbi Decoding of Tail-Biting Convolutional Code According to [8] and [10], the practical suboptimal tail-biting Viterbi decoder is shown in Fig. 2.20, where the “SDD Viterbi Decoder” block denotes the Viterbi decoder with puncturing mechanism and bit-interleaved SDD. The parameter. and. are both chosen to be 24 to achieve the balance of computational complexity and the performance of error correction based on the analysis done in [8].. Figure 2.20: Block Diagram of the Suboptimal Tail-Biting Viterbi Decoder.. 2.3.2.5 The Butterfly Structure in the Trellis Diagram. 26.

(41) In order to reduce the computational complexity in the ACS part, we bring in the concept of butterfly structure from the trellis diagram. The Symmetry in the trellis diagram, which forms the butterfly structure, can be used to reduce the number of branch metric calculations. Fig. 2.21 shows the butterfly structure associated with the Viterbi decoder. pairing new states 2i and 2i+1 with previous states i and i+s/2,. where s is the number of total possible states. In our case of constraint length K=7, s equals 64 (26). Even though there are four incoming branches, there are only two different branch costs. Path metrics for each new state are calculated using each incoming branch cost plus the previous path cost associated with that branch. The maximum of the two incoming path metrics is selected as the survivor. The butterfly computations consist of two “Add-Compare-Select” (ACS) operations and updating the survivor path history. The two ACS operations are: Sn(2i) = min {Sn-1 (i) + b , Sn-1(i+s /2) + a}, and Sn(2i+1) = min {Sn-1 (i) + a , Sn-1 (i+s /2) + b}. After completing N stages of decoding, one of the M survivor paths is selected for trace-back. Obviously, the number of branch metric calculation has been reduced greatly by introducing the butterfly structure.. Figure 2.21: Butterfly Structure Showing Branch Cost Symmetry.. 27.

(42) Chapter 3 DSP Implementation Environment In our IEEE 802.16a OFDMA project, for the ease to link up all the subprojects, we choose the digital signal processor (DSP) platform to implement the whole system. The DSP baseboard we use is Innovative Integration' s (II’s) new product in year 2003 named Quixote, which houses the Texas Instruments TMS320C6416 DSP chip. In this chapter, in addition to an introduction to the DSP chip and the DSP baseboard, the data communication process between the host PC and the target DSP is also described.. 3.1 The TI DSP Chip The DSP chip we adopt is one of the TMS320C64x series DSP. According to [11], TMS320C64x series is a member of the TMS320C6000 (C6x) family. The C6000 device is capable of executing up to eight 32-bit instructions per cycle and its core CPU consists of 64 general-purpose 32-bit registers (for C64x only) and eight functional units. The detailed features of the C6000 family devices include: Advanced VLIW CPU with eight functional units, including two multipliers and six arithmetic units. Instruction packing (Reduce Code Size). Conditional execution of all instructions. Efficient code execution on independent functional units.. 28.

(43) 8/16/32-bit data support, providing efficient memory support for a variety of applications. 40-bit arithmetic options add extra precision for computationally intensive applications. Saturation and normalization provide support for key arithmetic operations. Field manipulation and instruction extract, set, clear, and bit counting support common operation found in control and data manipulation applications.. The block diagram of the C6000 family is shown in Fig. 3.1. The C6000 devices come with program memory, which, on some devices, can be used as a program cache. The devices also have varying sizes of data memory. Peripherals such as a direct memory access (DMA) controller, power-down logic, and external memory interface (EMIF) usually come with the CPU, while peripherals such as serial ports and host ports are available only for certain models. In the following subsections, the TMS320C64x DSP Chip is introduced in the three major parts: Central processing unit (CPU), Memory, and Peripherals.. 29.

(44) Figure 3.1: The Block Diagram of TMS320C6x DSP Chip.. Figure 3.2: The TMS320C64x DSP Chip Architecture and Comparison with Ancient TMS320C62x/C67x Chip.. 30.

(45) 3.1.1 Central Processing Unit Besides the eight independent functional units and sixty-four general purpose registers that has been mentioned before, the C64x CPU also consists of the program fetch unit, instruction dispatch unit (attached with advanced instruction packing), instruction decode unit, two data path (A and B, each with four functional units), test unit, emulation unit, interrupt logic, several control registers and two register files (A and B with respect to the two data paths). The architecture is illustrated in more detail in Fig .3.2 [12]. Compared with the other C6000 family DSP chip, the C64x DSP chip provides more available hardware resources. The additional features that are only available on C64x are: Each multiplier can perform two 16 x 16-bit or four 8 x 8 bit multiplies every clock cycle. Quad 8-bit and dual 16-bit instruction set extensions with data flow support Support for non-aligned 32-bit (word) and 64-bit (double word) memory accesses. Special communication-specific instructions have been added to address common operations in error-correcting codes. Bit count and rotate hardware extends support for bit-level algorithms.. The program fetch unit shown in the figure could fetch eight 32-bit instructions (which implies 256-bit wide program data bus) every single cycle, and the instruction dispatch and decode units could also decode and arrange the eight instructions to eight functional units. The eight functional units in the C64x architecture could be further divided into two data paths A and B as shown in Fig. 3.2. Each path has one unit for multiplication operations (.M), one for logical and arithmetic operations (.L), one for branch, bit manipulation, and arithmetic operations (.S), and one for loading/storing, address calculation and arithmetic operations (.D). The .S and .L units are for arithmetic,. 31.

(46) logical, and branch instructions. All data transfers make use of the .D units. Two cross-paths (1x and 2x) allow functional units from one data path to access a 32-bit operand from the register file on the opposite side. There can be a maximum of two cross-path source reads per cycle. There are 32 general purpose registers, but some of them are reserved for specific addressing or are used for conditional instructions. Most of the buses in the CPU support 32-bit operands, and some of them support 40-bit operands. Each functional unit has its own 32-bit write port into a general-purpose register file. All functional units which end in 1 (for example, .L1) write to register file A while all functional units which end in 2 ( for example, .L2) write to register file B. There is an extra 8-bit wide port for 40-bit write as well as an extra 8-bit wide input port for 40-bit read in four specific units (.L1, .L2, .S1 and .S2). Since each unit has its own 32-bit write port, all eight functional units could be operated in parallel in every single cycle. The program pipelining is also an important technique to make instructions execute in parallel and hence reduce the overall execution cycles. In order to make pipelining work properly, we should have knowledge of the pipeline stages and instruction execution phases. Since the program pipelining is highly related to the optimization of DSP program, we left it to be discussed in next chapter and not go into detail here.. 3.1.2 Memory Internal Memory The C64x DSP chip has a 32-bit, byte-addressable address space. Internal (on-chip) memory is organized in separate data and program spaces. When off-chip memory is used, these spaces are unified on most devices to a single memory space via the external memory interface (EMIF). The C64x has two 64-bit internal ports to access internal data memory and a single internal port to access internal program memory, with an instruction-fetch width of 256 bits.. 32.

(47) Memory Options Besides the internal memory, the C64x DSP Chip also provides a variety of memory options: Large on-chip RAM, up to 7M bits. Program cache. 2-level caches. 32-bit external memory interface supports SDRAM, SBSRAM, SRAM, and other asynchronous memories for a broad range of external memory requirements and maximum system performance.. 3.1.3 Peripherals In addition to the on-chip memory, the TMS320C64x DSP chips also contain peripherals for supporting with off-chip memory options, co-processors, host processors, and serial devices. The peripherals are direct memory access (DMA) controller, Host-Port Interface (HPI), EMIF, Timers and some other units. The DMA controller transfers data between regions in the memory map without the intervention by CPU. It could move the data from internal memory to external memory or from internal peripherals to external devices. It is used for communication to other devices. The Host-Port Interface (HPI) is a 16-bir wide parallel port through which a host processor could directly access the CPUs memory space. It is used for communication between the host PC and the target DSP. The C64x has two 32-bit general-purpose timers that are used to time events, count events, generate pulses, interrupt the CPU and send synchronization events to the DMA controller. The timer has two signaling modes and could be clocked by an internal or an external source.. 33.

(48) 3.2 The DSP Baseboard The Quixote DSP Baseboard card is shown in Fig. 3.3 and the architecture is shown in Fig. 3.4 [15]. Quixote consists of a TMS320C6416 600 MHz 32-bit fixed-point DSP chip and a Xilinx two- or six-million gate Virtex-II FPGA in a single board. Utilizing the signal processing technology to provide processing flexibility, efficiency and deliver high performance. Quixote has 32MBytes SDRAM for use by DSP and 4 or 8Mbytes zero bus turnaround (ZBT) SBSRAM for use by FPGA. Developers could build complicated signal processing systems by integrating these reusable logic designs with their specific application logic.. Figure 3.3: Innovative Integration’s Quixote DSP Baseboard Card.. 34.

(49) Figure 3.4: The Architecture of Quixote Baseboard.. 3.3 Data Communication Mechanism Many applications of the Quixote baseboards involve communication with the host CPU in some manner. All applications at a minimum must be reset and downloaded from the host, even if they are isolated from the host after that. The simplest communication method supported is a mapping of Standard C++ I/O to the Uniterminal applet that allows console-type I/O on the host. This allows simple data input and control and sending text strings to the user. The next level of support is provided by the Packetized Message Interface. This allows more complicated medium rate transfer of commands and information between the host and target. It requires more software support on the host than the Standard I/O. For full rate data transfers Quixote supports. 35.

(50) the creation of data streaming to the host, for the maximum ability to move data between the target and host. On Quixote baseboards, a second type of busmaster communication between target and host is available for use, it is the CPU Busmaster interface. The primary CPU busmaster interface is based on a streaming model, where logically data is an infinite stream between the source and destination. This model is more efficient because the signaling between the two parties in the transfer can be kept to a minimum and transfers can be buffered for maximum throughput. In addition, the Busmaster streaming interface is fully handshook, so that no data loss can occur in the process of streaming. For example, if the application cannot process blocks fast enough, the buffers will fill, then the busmaster region will fill, then busmastering will stop until the application resumes processing. When the busmaster stops, the DSP will no longer be able to add data to the PCI interface FIFO. However, in the application of FEC encoder and decoder, the data sequence is first divided into RS blocks then performed encoding and decoding procedure. Hence the continuous streaming may not be suitable for FEC application. Alternatively, there is a data flow paradigm supported for non-continuous data sequence called block mode streaming. For very high rate applications, any processing done to each point may result in a reduction in the maximum data rate that can be achieved. Since block mode does no implicit processing on a point-by-point basis, the fastest data rates are achievable using this mode. The DSP Streaming interface is bi-directional. Two streams can run simultaneously, one running from the analog peripherals through the DSP into the application. This is called the “Incoming Stream”. The other stream runs out to the analog peripherals. This is the “Outgoing Stream”. In both cases, the DSP needs to act as a mediator, since there is no direct access to analog peripherals from the host. The block diagram of the DSP streaming mode is shown in Fig. 3.5 [15].. 36.

(51) Figure 3.5: Block Diagram of DSP Streaming Mode.. DSP Streaming is initiated and started by the Host, using the Caliente component. On the target, the DSP interface uses a pair of DSP/BIOS Device Drivers, PciIn (on the Outgoing Stream) and PciOut (on the Incoming Stream), provided in the Pismo peripheral libraries for the DSP. They use burst-mode and are capable of copying blocks of data between target SDRAM and host bus-master memory via the PCI interface at instantaneous rates up 264 MBytes/sec. In addition to the busmaster streaming interface, the DSP and the host also have a lower bandwidth communications link called packetized message interface for sending commands or side information between the host PC and the target DSP.. 37.

(52) Chapter 4 Implementation and Optimization of 802.16a FEC Scheme on DSP Platform As mentioned in last chapter, we adopt the Texas Instruments (TI) digital signal processor (DSP) for implementing the Forward Error Correction (FEC) scheme in the IEEE 802.16a wireless communication standard. In this chapter, we are going to discuss the main themes of this thesis – the implementation and optimization of the specified FEC scheme on the newly released II’s Quixote DSP baseboard, which houses a TI TMS320C6416 DSP chip. We firstly briefly introduce the entire system structure of our FEC implementation and its communication mechanism. Secondly, we introduce some special features of TI C6000 family DSP that is helpful when doing compiler level optimization. Then, we proposed some simple and yet practically useful techniques for improving the computational speed of Reed-Solomon (RS) Code and Convolutional (CC) Code (mainly for decoder part) on TI C64 family DSP. Finally, we present the improvement after the efforts we made on the RS code and the CC code optimization by showing the simulation profile generated by the TI Code Composer Studio (CCS) built-in profiler.. 38.

(53) 4.1 System Structure of the FEC Implementation As defined in the IEEE 802.16a standard, the FEC scheme, which consists of FEC encoder and FEC decoder, is located between the source encoder/decoder and the channel modulator/demodulator. Due to the features of the FEC scheme, it requires massive computation in the decoding procedure. Thus, for the purpose of achieving the real-time processing goal, it is necessary to assign a single DSP board for the FEC use only. In consequence, the FEC scheme, source coding scheme and channel modulation scheme each uses an individual DSP board and linked by the PCI port on the personal computer (PC), one thing to be noted here is the communication mechanism between the DSP board and the Host PC. It supposed to be the data streaming mode that has been described in Chapter 3, but due to the malfunction of the newly released II’s Quixote DSP baseboard, up till now, the streaming mode on the DSP board is not yet work. In substitution, we use the standard I/O (fread and fwrite) to implement the DSP file I/O mechanism on the TMS320C6416 Simulator. The drawback of using the standard I/O is that it cannot proceed too many input data or the processor may crash during the I/O time and it takes extra cycles to perform the file I/O mechanism. The system structure is shown in Fig. 4.1 and Fig. 4.2 for the transmitter side and the receiver side, respectively. At the transmitter side, the source coded data sequence is first multiplexed by an audio/video multiplexer then transmits to the randomizer of the FEC encoding scheme through the PCI interface of Host PC. Afterward, the sequence is processed by the randomizer, the RS encoder, the CC encoder and the block interleaver and then the interleaved coded sequence is transmitted to the channel modulator through the PCI interface. At the receiver side, the procedure is an reverse of that in the transmitter side. First the demodulator transmits the soft decision demodulated metric sequence to the FEC decoding scheme (again through the PCI interface). After FEC decoding, the decoded sequence is passed to the source decoder through the PCI interface and then the source decoding operation is performed.. 39.

(54) Figure 4.1: System Structure of Transmitter Side.. Figure 4.2: System Structure of Receiver Side.. 4.2 Compiler Level Optimization Techniques In this subsection, firstly the TI C6000 family pipeline structure is introduced for understanding how the processor arranges the pipeline stages and what instructions are more time consuming and shall be avoided if possible. Secondly, the code development flow is presented to show how to develop a DSP program efficiently and systematically. Thirdly, an important techniques used by the TI CCS compiler to improve the program speed, so-called “software pipelining”, is introduced and a simple example is given to. 40.

(55) explain how we can improve the program efficiency by using the software pipelining technique.. 4.2.1 Pipeline Structure of the TI C6000 Family There are a few features regarding to the TI C6000 family’s pipeline structure that can provide the advantages of good performance, low cost, and simple programming. The following are several useful features [11]: Increased pipelining eliminates traditional architectural bottlenecks in program fetch, data access, and multiply operation. Pipeline control is simplified by eliminating pipeline locks. The pipeline can dispatch eight parallel instructions in every cycle. Parallel instructions proceed simultaneously through the same pipeline phase.. The pipeline structure of the C6000 family consists of three basic pipeline stages, They are Fetch stage (F), Decode stage (D), and Execution stage (E). At the F stage, the CPU first generates an address, fetches the opcode of the specified instruction from memory, and then passes it to the program decoder. At the D stage, the program decoder efficiently routes the opcode to the specific functional unit determined by the type of instruction (LDW, ADD, SHR, MPY, etc). Once the instruction reaches the E stage, it is executed by its specified functional unit. Most instructions of the C6000 family fall in the Instruction-Single-Cycle (ISC) category, such as ADD, SHR, AND, OR, XOR, etc. However, the results of a few instructions are delayed. For example, the multiply instructions - MPY (and its varieties) requires a delay length equal to one cycle. One cycle delay means that the execution result will not be available until one cycle later (i.e., not available for the next instruction to use). The results of a load instruction – LDW (and its varieties) are delayed for 4 cycles. Branches instructions. 41.

(56) reach their target destination 5 cycles later. Store instructions are viewed as an ISC from the CPU’s perspective because of the fact that there is no execution phase required for a store instruction but actually it still finishes in 2 cycles later. Since the maximum delay among all the available instructions is 5 cycles (6 execution cycles totally), it is intuitive to split the execution stage (E) into six phases as shown in Table 4.1.. Execution Phases (Completing Phase) E1 E2 E3 E4 E5 E6. Instructions' Category. Table 4.1: Completing Phase of Different Type Instructions.. 4.2.2 Code Development Flow Traditional development flows in DSP industry have involved validating a C model for correctness on a host PC or Unix workstation and then painstakingly porting that C code to hand-coded DSP assembly language. This is both time consuming and error prone. The recommended code development flow involves utilizing the C6000 code generation tools to help in optimization rather than forcing the programmer to code by hand in assembly. These advantages allow the compiler to do all the laborious work of instruction selection, parallelizing, pipelining, and register allocation. Fig. 4.3 illustrates the three phases in the code development flow [13]. Because phase 3 is kind of too detailed and time consuming, most of the time we will not go into phase 3 to. 42.