在多數位訊號處理器系統上進行高效率無線通道模擬之研討

全文

(1)國立交通大學電機資訊學院電子與光電學程. 碩. 士. 論. 文. 在多數位訊號處理器系統上進行高效率無線通道模擬之研討 Research in Efficient Wireless Channel Simulation on Multi-DSP Platform. 研究生 : 指導教授 :. 李建興林大衛博士. 中華民國九十三年七月.

(2) 在多數位訊號處理器系統上進行高效率無線通道模擬之研討 Research in Efficient Wireless Channel Simulation on Multi-DSP Platform 研究生: 李建興指導教授: 林大衛博士. Student : Chien-Hsing Lee Advisor : Dr. David W. Lin. 國立交通大學電機資訊學院電子與光電學程碩. 士. 論. 文. A Thesis Submitted to Degree Program of Electrical Engineering Computer Science. College of Electrical Engineering and Computer Science National Chiao Tung University in Partial Fulfillment of the Requirements for the Degree of Master of Science in Electronics and Electro-Optical Engineering July 2004 Hsinchu, Taiwan, Republic of China. 中華民國九十三年七月.

(3) 在多數位訊號處理器系統上進行高效率無線通道模擬之研討. 研究生：李建興. 指導教授：林大衛博士. 國立交通大學電機資訊學院電子與光電學程(研究所)碩士班. 摘要. 數位訊號處理器是個可編程以完成達到不同功能性的有用工具。我們想要改進一個已實現於數位訊號處理器上的無線通道模擬器。其工作平台係使用桌上型電腦為主控中心，加插一塊 Innovative Integration 公司的 Quatro6x DSP 板，該 DSP 板共裝置四顆德州儀器公司出品的 TMS320C6x 數位訊號處理器。此模擬系統主要是利用其中三顆加以實現，分別為：一顆依據 3GPP WCDMA 上行傳輸之規格的調變器，一顆幾種不同通道的模擬器，及一顆接收濾波器。然而，有三個問題尚待釐清解決。第一，這三顆 DSP 程式在這個工作平臺上執行需按特定順序，難以同時啟動。第二，各訊號處理元件均可達到即時速度，但卻在其連接後的多處理器系統上變慢。第三，由於只有浮點格式的基本數學函數庫可用，通道係數尚未完全使用定點方式產生。在本篇論文中，我們改進這三個問題。首先，我們將工作平台更新到特定版本組合的一個新平台，解決了啟動的問題。然後，我們提出一個應用雙緩衝(double buffering)技巧的管線化(pipelining) 架構，改善了速度的問題。此外，我們運用 CORDIC 演算法，以定點算術運算四個基本數學函數，避免了浮點的問題。總之，我們找出在多數位訊號處理器平台上順利進行高效率無線通道模擬的一些方法，然而，WCDMA 的詳盡探討並非本文所涉及。. i.

(4) Research in Efficient Wireless Channel Simulation on Multi-DSP Platform. Student: Chien-Hsing Lee. Advisor: Dr. David W. Lin. Degree Program of Electronics Engineering Computer Science National Chiao Tung University. Abstract. The DSP is a useful tool that is programmable to achieve different functionalities. We want to improve a previously developed DSP-based wireless channel simulator. The existing simulator includes a 3GPP WCDMA modulator, several kinds of channel simulator and a matched filter. A desktop PC acts as the controller, and an Innovative Integration’s Quatro6x DSP-embedded card is employed in the system. Four Texas Instruments’ TMS320C6x DSP chips are placed on the board. Three chips used to accomplish the system, one for modulator, one for channel, and one for the matched filter. However, three problems remained to be addressed. Firstly, synchronized execution of the three programs was not smooth on the platform. Secondly, real-time performance degraded on the connected multiprocessors DSP system. Thirdly, fixed-point generation of channel coefficients was not available yet for lack of fixed-point library. These three main topics are presented in this thesis. To begin, we do migration to a specified new platform for first problem solution. Then, we propose a pipelining structure applying double buffering scheme for second problem solution. Furthermore, we apply CORDIC algorithm to evaluate elementary functions in fixed-point for third problem solution. In a word, we seek out several methods to run an efficient wireless channel simulator smoothly on a multiprocessor platform, but a close study of WCDMA is not our concern.. ii.

(5) 誌謝. 這篇論文的完成，首先，我要感謝我的指導教授林大衛博士這些日子以來的耐心教導。每當我的研究學習遇到瓶頸時，老師總是不厭其煩地指點迷津，在此致上最衷心的謝意與敬意。此外，也要感謝口試委員們於百忙之中撥冗，並於口試時給予寶貴的意見與建議。. 我要感謝許多夥伴們，有了你們的包容與幫忙，我方能兼顧學業與工作。實驗室內充足的資源，使我的研究學習沒有後顧之憂。感謝諸位學長姊、同學與學弟妹們，在彼此互相的討論砥礪下，使我獲益良多。同時，也要感謝朋友與同事們，適時地關懷督促與伸出援手。另外，還要感謝 Innovative Integration、德州儀器、與盛暘科技等公司，適切地技術支援與答詢。. 當然，我要獻上最深摯的感謝給我的父母與家人。你們在背後默默的支持與鼓勵，是我成長茁壯的陽光與土壤，更是我前進的動力，才使得我得以堅持到最後，順利完成心中的願望。. 一路走來，點點滴滴，要感謝的如滿天星斗難以計數。感謝所有陪我走過、關心我的人，也感謝上蒼保佑。謹以這篇論文，與你們一同分享這份喜悅。. iii.

(6) Contents 1 Introduction. 1. 2 Overview of An Existing DSP-Based Wireless Channel Simulator 2.1 Introduction to the Existing Simulator [32] . . . . . . . . . . . . 2.2 WCDMA Uplink Transmission Scheme [32] . . . . . . . . . . . 2.2.1 Spreading Modulator . . . . . . . . . . . . . . . . . . . 2.2.2 Pulse Shaping Filter . . . . . . . . . . . . . . . . . . . 2.2.3 Channel Model . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Fixed-Point Simulation . . . . . . . . . . . . . . . . . . 2.3 DSP Implementation of the Existing Simulator [32] . . . . . . . 2.3.1 Transmission Mechanism . . . . . . . . . . . . . . . . 2.3.2 Memory Arrangement . . . . . . . . . . . . . . . . . . 2.3.3 Optimization and Profile . . . . . . . . . . . . . . . . . 2.3.4 Complexity and Performance . . . . . . . . . . . . . . 2.4 Problems with the Existing Simulator . . . . . . . . . . . . . . 2.4.1 Synchronizing Execution Problem . . . . . . . . . . . . 2.4.2 Real-Time Multiprocessing Problem . . . . . . . . . . . 2.4.3 Fixed-Point Generation Problem . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. 3 3 4 4 6 8 11 12 12 13 13 14 14 15 15 16. 3 The Quatro6x Multiprocessor Platform 3.1 Overview . . . . . . . . . . . . . . . 3.2 Introduction to the Multi-DSP Board . 3.2.1 TMS320C6x DSP Chip . . . 3.2.2 Quatro6x DSP Board . . . . . 3.3 Introduction to the Development Tools 3.3.1 Code Composer Studio . . . . 3.3.2 Zuma Software Toolset . . . . 3.4 Interprocessor Interaction . . . . . . . 3.4.1 FIFOLink Functions . . . . . 3.4.2 DMA Transfer Functions . . . 3.4.3 Transmission Mechanism [32] 3.5 Coding Style . . . . . . . . . . . . . 3.5.1 Unlucky Style . . . . . . . . . 3.5.2 Lucky Style . . . . . . . . . . 3.6 Tools Compatibility . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. 17 17 18 18 20 22 23 23 26 26 28 30 31 31 31 32. iv. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . ..

(7) 3.6.1 3.6.2. Two Issues in MPO and CCS . . . . . . . . . . . . . . . . . . . . Migration to the New Version . . . . . . . . . . . . . . . . . . .. 32 32. 4 Efficient Multiprocessing on Quatro6x 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Structure of DSP Partition . . . . . . . . . . . . . . . . . . 4.2.1 Parallel Structure . . . . . . . . . . . . . . . . . . . 4.2.2 Sequential Structure . . . . . . . . . . . . . . . . . 4.2.3 Pipelining Structure . . . . . . . . . . . . . . . . . 4.3 Actual Speed Observation . . . . . . . . . . . . . . . . . . 4.3.1 Observation on Single DSP Chip . . . . . . . . . . . 4.3.2 Observation of Overall Connection Speed . . . . . . 4.3.3 Observation by Changing Segments of Program . . . 4.3.4 Observation by Removing Data Transfer Commands 4.3.5 Observation on the Sequential-10240 Structure . . . 4.3.6 Observation on the Pipelining-512 Structure . . . . . 4.4 Zoom in DMA Library . . . . . . . . . . . . . . . . . . . . 4.5 Double Buffering Technique . . . . . . . . . . . . . . . . . 4.5.1 Data Transfer Flow . . . . . . . . . . . . . . . . . . 4.5.2 Memory Constraints . . . . . . . . . . . . . . . . . 4.5.3 DMA Channels . . . . . . . . . . . . . . . . . . . . 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. 36 36 36 37 38 39 42 42 42 43 47 50 52 54 54 59 59 61 65. 5 Fixed-Point Arithmetic on TMS320C62 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . 5.2 Methods of Channel Model Generation . . . . . . 5.2.1 Fading Channel Coefficients . . . . . . . . 5.2.2 Gaussian Random Number . . . . . . . . . 5.3 Introduction to CORDIC and CCM . . . . . . . . . 5.3.1 The CORDIC Method . . . . . . . . . . . 5.3.2 Convergence Computation Method . . . . 5.4 Function Evaluation in Integer Arithmetic . . . . . 5.4.1 Cosine and Sine by CORDIC . . . . . . . 5.4.2 Square Root and Logarithm by CORDIC . 5.4.3 Natural Logarithm by CCM . . . . . . . . 5.5 Evaluation Results and Discussion . . . . . . . . . 5.5.1 DSP Performance of Function Evaluation . 5.5.2 Discussion on Variations and Optimization. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. 66 66 66 66 67 68 68 73 76 76 79 81 85 85 86. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. 6 Conclusion. 87. A Things to Note in DSP Implementation A.1 Emails Concerning Two Problems in Using MPO and CCS . . . . . . . . A.2 Emails Concerning Four DMA Channels and One DMA Bus . . . . . . .. 88 88 92. v.

(8) List of Figures 1.1. Block diagram of WCDMA system for one-way transmission (from [32]).. 2. Block diagram of the implemented system (from [32]). . . . . . . . . . . Frame structure of uplink DPDCH/DPCCH (from [2]). . . . . . . . . . . Configuration of long scrambling sequence generator (from [3]). . . . . . Frequency response of 33-tap RRC filter (4 times oversampled) vs. the emission mask (from [32]). . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Interpolated filtering system (from [26]). . . . . . . . . . . . . . . . . . . 2.6 Implementation of interpolated filter after applying the polyphase decomposition (from [26]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Block diagram of channel model (from [32]). . . . . . . . . . . . . . . . 2.8 The scaling situation of each step (from [32]). . . . . . . . . . . . . . . . 2.9 Transmission mechanism in the existing simulator (from [32]). . . . . . . 2.10 Block diagram of the demo subsystem (from [32]). . . . . . . . . . . . .. 4 5 6. 8 9 11 12 15. 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8. Innovative Integration’s Quatro6x DSP board (from [18]). . . . Block diagram of TMS320C6x01 DSP chip (from [33]). . . . Innovative Integration’s Quatro6x block diagram (from [18]). . FIFOLink block diagram (from [19]). . . . . . . . . . . . . . Code for using DMA through FIFOLink (from [32]). . . . . . Transmission mechanism in the existing simulator (from [32]). A simple code suffering from issue 1 in MPO. . . . . . . . . . A simple code suffering from issue 2 in CCS. . . . . . . . . .. 17 19 21 26 29 30 33 34. 4.1. Interpolated fitering system with the cascade of pulse shaping filter, channel, and matched filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . Polyphase technique applied to the cascade of pulse shaping filter, channel, and matched filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . Operations of one computing and four movement used for the channel simulator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Case A0 program segment. . . . . . . . . . . . . . . . . . . . . . . . . . Case A1 program segment. . . . . . . . . . . . . . . . . . . . . . . . . . Case A2 program segment. . . . . . . . . . . . . . . . . . . . . . . . . . Case A3 program segment. . . . . . . . . . . . . . . . . . . . . . . . . . Case A4 program segment. . . . . . . . . . . . . . . . . . . . . . . . . . Zoom in II supported DMA library. . . . . . . . . . . . . . . . . . . . .. 2.1 2.2 2.3 2.4. 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9. vi. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. 7 8. 37 38 41 44 45 47 48 49 58.

(9) 4.10 Software flow in a simple application (from [34]). . . . . . . . . . . . . . 4.11 ’C6x data memory controller interconnect to memory banks (from [44]). . 4.12 A simple code showing that DMA channels could not move data simultaneously. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 60 62. 5.1 5.2 5.3 5.4. . . . .. 74 78 82 84. A.1 The volatile structure needs to be changed. . . . . . . . . . . . . . . . . .. 91. A geometric interpretation of CCM (from [4]). . . . . . . CORDIC program for sine and cosine. . . . . . . . . . . CORDIC program for square root and natural logarithm. CCM program for natural logarithm. . . . . . . . . . . .. vii. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. 63.

(10) List of Tables 2.1 2.2 2.3 2.4. 10 13 13. 2.5 2.6. Propagation Conditions for Multipath (Fading) Environments [1] . . . . . Memory Arrangement [32] . . . . . . . . . . . . . . . . . . . . . . . . . Profiles of Scrambling Operation for Different Versions of the Code [32] . Profiles of Pulse Shaping Filter Operation for Different Versions of the Code [32] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Profiles of Matched Filter Operation for Different Versions of the Code [32] Complexity and Performance of Implementation [32] . . . . . . . . . . .. 3.1 3.2 3.3 3.4 3.5. FIFO Status Definition [19] . . . . . . . . . . . . . FIFOLink Data Transfer Functions Using CPU [19] DMA Data Transfer Functions [19] . . . . . . . . . Unlucky Style . . . . . . . . . . . . . . . . . . . . Lucky Style . . . . . . . . . . . . . . . . . . . . .. . . . . .. 28 28 29 31 32. 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15. Pipelining Structure, Software Pipelining and ’C6x Pipeline Operation . . Scheduling Table of Our Pipelining-512 Structure . . . . . . . . . . . . . Observation on Single DSP . . . . . . . . . . . . . . . . . . . . . . . . . Observation on Overall Connection Speed . . . . . . . . . . . . . . . . . Observation on Case B Type of Removing Data Transfer Commands . . . Observation on the Sequential-10240 Structure per Slot . . . . . . . . . . Observation on the Sequential-10240 Structure per Move-in 512 . . . . . Observation on the Sequential-10240 Structure per Move-out 512 . . . . Observation on the Pipelining-512 Channel . . . . . . . . . . . . . . . . Observation on the Pipelining-512 Channel tm14 . . . . . . . . . . . . . Observation on the Pipelining-512 Channel tm35 . . . . . . . . . . . . . Observation on the “while(!(get fifo link status(2)&Tx FIFO EMPTY))” Observation on the “while(!(get fifo link status(1)&Rx FIFO FULL))” . Observation on Commands Starting with “while!(dma done” . . . . . . . Improved Actual Performance . . . . . . . . . . . . . . . . . . . . . . .. 41 42 43 43 51 51 52 52 55 56 56 57 57 57 65. 5.1 5.2 5.3 5.4 5.5 5.6. Summary of Generalized CORDIC Algorithms [27] . . . . . . . . . . . . CORDIC Example for sin(70◦ ) and cos(70◦ ) [29] . . . . . . . . . . . . . CORDIC Example for tanh−1 (1/3) [29] . . . . . . . . . . . . . . . . . . CCM Example for Logarithm in Floating-Point [15] . . . . . . . . . . . . CCM Example for Logarithm in Fixed-Point . . . . . . . . . . . . . . . . DSP Speed of CORDIC/CCM in Fixed-Point vs. TI Floating-Point Library. 69 77 80 81 83 85. viii. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. 13 14 14.

(11) A.1 Summary of Two Issues in MPO and CCS . . . . . . . . . . . . . . . . .. ix. 90.

(12) Chapter 1 Introduction Current generations of telecommunications infrastructure require real-time performance. To meet the demands of next generation equipment, such as third generation (3G) cellular mobile radio communication system, designers must seek methods to improve hardware and software efficiency. This thesis is intended as an investigation of the methods. With the advent of applications such as 3G wireless system implementation that require intensive real-time computation, distributed processing systems consisting of several interconnected DSPs are a useful prototyping tool [38]. It comes within the scope of this thesis to run an efficient 3G wireless channel simulator smoothly on a multiprocessor platform, but a close study of 3G topic is not necessary for our present concern. Fig. 1.1 gives a simple block diagram of the overall WCDMA system. A large number of studies has been made in a team project reported in [7], [11], [23], [24], and [32]. Four DSP boards have been used in the work, three fixed-point ones called Quatro62 and one floating-point one called Quatro67. The Quatro6x is made by Innovative Integration (II). It houses four Texas Instruments (TI)’s TMS320C6x01 processors in a symmetric multiprocessing relationship with interprocessor communication links. Basically, the implemented individual functional blocks on the DSP processors can reach real-time computation separately. However, the interprocessor data transmission was done in a block-based, event-driven fashion. Therefore, a more efficient implementation is desirable. The components shaded in Fig. 1.1 belong to an implemented wireless channel simu1.

(13) Fig. 1.1: Block diagram of WCDMA system for one-way transmission (from [32]).. lator on Quatro67 platform. Although a great deal of effort has been made in the implementation by Tsai [32], several issues remained to be addressed. Firstly, synchronizing execution of multiprocessor programs was not smooth on the platform. The individual DSP programs needed to be executed in a particular order, that is, they could not be loaded to the DSPs simultaneously. Secondly, real-time performance degraded on the connected multiprocessor DSP system. The actual run time increased unexpectedly after we connected the individual DSP programs together on the Quatro6x DSP board. It was an issue also faced by other project team members mentioned previously. Thirdly, fixedpoint generation of channel coefficients was not available yet. The generation of channel coefficients still depended on floating-point mathematic functions. The three issues are addressed in this thesis. This thesis is organized as follows. Chapter 2 reviews briefly the implemented wireless channel simulator. In chapter 3, we introduce the DSP environment and deal with synchronizing execution problem. Chapter 4 identifies efficient multiprocessing, including pipelining-512 structure, real speed observation, and double-buffering skill. We discuss elementary function evaluation in fixed-point arithmetic, such as CORDIC, in chapter 5. Finally, chapter 6 contains the conclusion and potential future work. 2.

(14) Chapter 2 Overview of An Existing DSP-Based Wireless Channel Simulator 2.1 Introduction to the Existing Simulator [32] A DSP-based wireless channel simulator has been implemented by Tsai [32]. This implemented system employs a collaborative computing structure, which is composed of a desktop PC and a DSP-embedded plug-in card. There are four DSP chips on the DSPembedded card, called DSP0 to DSP3. Three chips are used to accomplish the system. DSP0 acts as the channel simulator. The function of DSP1 includes spreading, scrambling coding and pulse-shaping filtering. The matched filter is in DSP2. Fig. 2.1 shows a block diagram of the modulator, channel simulator and receiver filter. The modulator is used to generate the transmitted signal. The data after framing operation (not shown in the figure) is input into the system and spread according to the 3GPP standard [3], which includes channelization coding, scrambling coding, and the pulse-shaping filtering. Then the data are passed through the channel, such as static and fading channels. Besides the multipath effect, which is defined in the 3GPP standard [1], multi-user interference also is considered. Finally, the noisy, distorted signal is received through the matched filter. In the existing simulator [32], DSP processors are used to implement 3GPP WCDMA transmission signal processing and simulate the wireless channel for real-time experiments. However, we want to improve the existing simulator for several remained issues.. 3.

(15) Fig. 2.1: Block diagram of the implemented system (from [32]).. 2.2 WCDMA Uplink Transmission Scheme [32] To begin, we first introduce the transport channels. The time durations are defined by start and stop instants, measured in integer multiples of chips. A radio frame is a processing duration which consists of 15 slots. The length of a radio frame corresponds to 38400 chips. A slot is a duration which consists of fields containing bits. The length of a slot corresponds to 2560 chips.. 2.2.1 Spreading Modulator The spreading modulator performs two operations. The first, called channelization, transforms every source data symbol into a number of chips, thus increasing the bandwidth of the signal. The second operation, called scrambling, distinguishes different users in the receiver. Channelization codes With the channelization, data symbols on I- and Q-branches are independently multiplied with an orthogonal variable spreading factor (OVSF) code. The cross-correlation between orthogonal codes is zero for synchronous transmission.. 4.

(16) Fig. 2.2: Frame structure of uplink DPDCH/DPCCH (from [2]).. Scramble codes After the channelization operation, the I- and Q-branch signals are multiplied by the complex-valued scrambling code. The code can be either a short or a long code. The required number of codes depends on the expected traffic load and spectrum efficiency. There are 224 different uplink scrambling codes with different initial values for the generating registers. The long scrambling sequences Clong,1,n and Clong,2,n are constructed from sum of 38400 chips segments of two binary m-sequence generated by means of two generator polynomial of degree 25. Fig. 2.3 shows the configuration of uplink long scrambling sequence generators. Control of SNR To define the power level of input data, we have to compute the signal energy in the overall system and find how to adjust the power to achieve different SNR. The SNR at matched filter output for DPDCH is: SN Rd ,. Ed 4A2 · SF 2 · βd2 2A2 · SF · βd2 = = , σv2 2 · SF · σ 2 σ2 5.

(17) clong,1,n LSB. MSB. clong,2,n. Fig. 2.3: Configuration of long scrambling sequence generator (from [3]).. where A is the amplitude of the transmitted signal, σ 2 is the noise variance on each quadrature branch at the input to the matched filter. For DPCCH, the SNR is: SN Rc ,. Ec 4A2 · 2562 · βd2 2A2 · 256 · βc2 = . σv2 2 · 256 · σ 2 σ2. Hence we can adjust the amplitude factor A of the receiver input signal to achieve any desired SNR.. 2.2.2 Pulse Shaping Filter Due to the requirement of bandwidth-limited transmission, the output chip stream from the spreading modulator is filtered using a pulse shaping filter (PSF). The 3GPP WCDMA employs root-raised-cosine (RRC) pulse shaping with roll-off α = 0.22 [1]. It is found [32] that a 33-taps RRC filter can comply with the spectrum emission mask [1] to beyond 8 MHz. Fig. 2.4 is the comparison. Assume that the subsequent analog filtering can effectively suppress the signal power above 8 MHz.. 6.

(18) magnitude(dB) 20. 0 Emission mask. −20 −40 −60. Frequency response of 33−tap filter. −80 −100. 0. 2. 4. 8. 6. 10. 12. 16. 14. 6. x 10 frequency. Fig. 2.4: Frequency response of 33-tap RRC filter (4 times oversampled) vs. the emission mask (from [32]).. At transmitter — the polyphase technique In the transmitter, for reducing unnecessary computations, Tsai [32] consider a more efficient implementation to oversample the data and pass them through the PSF. Since only every Lth sample of the input data is nonzero, a better implementation, shown in Fig. 2.6, would involve applying filter coefficients only to input values that are nonzero [26]. To illustrate the advantage of Fig. 2.6 compared to Fig. 2.5, we note that if H(z) is a filter of length N , then we then need N L multiplications and (N L − 1) additions per unit time originally. On the other hand, we only need L(N/L) multiplications and L(N/L − 1) additions per unit time for the set of polyphase filters, plus (L − 1) additions to obtain an output datum. Thus we can obtain significant saving in computation. In DSP implementation, the rearrangement reduces the required run time and makes the system complete the signal processing in the limited amount of time [26], [32]. At receiver — the matched filter At receiver side, a matched filter is designed to provide the maximum signal-to-noise power ratio at its output in AWGN. Thus the amount of computation is quite large. A. 7.

(19) Fig. 2.5: Interpolated filtering system (from [26]). Y[n]. E0(z). L. +. E1(z). L. +. X[n]. L. E2(z). +. L. E3(z). Fig. 2.6: Implementation of interpolated filter after applying the polyphase decomposition (from [26]).. 9-tap RRC filter is chosen [32] to replace the original 33-tap one. We need to perform 15.36M·9·2 = 276.84M multiplications in one TI’s TMS320C6701 chip, which is within the real-time computation ability of the chip.. 2.2.3 Channel Model A communication channel transmits the information-bearing signal to the destination. The signal is subject to multipath fading and addition of noise, which produce random variations in the signal. The block diagram of channel simulator is illustrated in Fig. 2.7, where only a single user is considered. The interference from other users is added to the result at the output of single user channel. In our simulation, Tsai [32] implement two kinds of channels, static and fading.. 8.

(20) Fig. 2.7: Block diagram of channel model (from [32]).. Static channel In this situation, multipaths exist, but the channel coefficients do not change during the transmission period. The channel coefficient of each path is a complex value given by αejθ , where α is the power level of the path and can be computed from the definition of SNR. ejθ = cos θ + j sin θ is the phase of the path. In addition to static multipath propagation, white Gaussian noise is added. Fading channel Fading refers to rapid fluctuation of the amplitude of the channel gain over a short period. The power of each multipath is time varying, resulting from moving mobile or surrounding objects. To approximate Rayleigh fading, Jakes [13] suggests using phases θn,j = βn + 2π(j − 1)/(N0 + 1) , where j = 1 to N0 is the waveform index. The model becomes T (t) =. r. 2 X [cos βn + j sin βn ] cos (wn t + θn ) N0. where the Doppler frequency wm is given by wm = 2π λv , with v being the velocity of the mobile and λ being the wavelength of the carrier. In our case, f = 2 GHz and 3 πn λ = 3 × 108 /2 × 109 = 20 m, wn = wm cos 2πn . , n = 1, 2, · · · , N0 , and βn = N N0 0. Further discussion will be presented in chapter 5. 9.

(21) Table 2.1: Propagation Conditions for Multipath (Fading) Environments [1] Case 1, speed 3km/h Realative Delay [ns]. Average power [dB]. Case 2,speed 3km/h Realative Delay [ns]. Case 3,. Case 4,speed. 120 km/h. Average power [dB]. Realative Delay [ns]. 3 km/h Average power [dB]. Realative Delay [ns]. Average power [dB]. 0. 0. 0. 0. 0. 0. 0. 0. 976. -10. 976. 0. 260. -3. 976. 0. 20000. 0. 521. -6. 781. -9. Multipath propagation Some multipath propagation conditions are defined in [1]. Table 2.1 shows the propagation conditions that are used for the performance measurements in multipath environment. A chip in our program is 1/3.84 = 260 ns. After oversampling by 4 times, each symbol is 260/4 = 65 ns. The conversion of the delay listed in Table 2.1 is computed by delay symbol = delay/65. For 976 ns delay, we implement 15 points shift in our program. Except the defined multipath propagation listed in the Table 2.1, Tsai [32] also simulate other kinds of propagation channels including the birth-death and moving propagation conditions [1]. Gaussian noise The signal transmitted through channel is added white Gaussian noise. Tsai [32] generate the noise by warping two uniform distribution random sequences. The operation is done on the PC and the noise is stored in files. After using Matlab functions to compare the power spectral density between the amount of the noise we store and use repeatedly and the real situation, we decide to store 298801 complex-valued noise data in files. Further discussion will be presented in chapter 5.. 10.

(22) Fig. 2.8: The scaling situation of each step (from [32]).. 2.2.4 Fixed-Point Simulation For improving the speed and saving the memory, we have to simulate with fixed-point numbers instead of floating-point numbers. The mechanism we want is that the data saved in the internal memory in DSP board should be 16-bit integers. For maintaining the precision, the method is to shift each data to 16-bit integer. After addition or multiplication operations, Tsai [32] put them in a longer temporary register to avoid overflow. Tsai pay attention to the addition of each step during the simulation and make sure the scale on the two data are the same. The scale of the data in each step is illustrated in Fig. 2.8. The example in Fig. 2.8 is for static channel with SN R = 10 dB. When we save the data into memory, we have to measure the maximum value and downshift them to 16-bit integers. Further discussion will be presented in chapter 5.. 11.

(23) Fig. 2.9: Transmission mechanism in the existing simulator (from [32]).. 2.3 DSP Implementation of the Existing Simulator [32] A DSP-based simulator has been implemented by Tsai [32]. In this existing simulator, one floating-point Quatro67 DSP board is used to implement to the modulator, the wireless channel simulator and the receiver. This section shows briefly their key record.. 2.3.1 Transmission Mechanism Fig. 2.9 shows the transmission mechanism of the existing simulator [32]. The transmission between two chips or two boards is performed by a flexible FIFO-based interprocessor communications network. And if we want to process the data saved in external memory, we use DMA controller to move data rapidly. Further discussion will be presented in the following chapter.. 12.

(24) Internal Memory External Memory. Table 2.2: Memory Arrangement [32] CPU1 (Modulation) CPU0 (Channel) CPU2 (Matched Filter) 56.19 Kbytes 63.46 Kbytes 43.28 Kbytes 150 Kbytes 1167 Kbytes (Scramble codes) (Noise). Table 2.3: Profiles of Scrambling Operation for Different Versions of the Code [32] Modified Version Total Cycle Computation Time (per frame) Memory Usage If-else statement 2210565 1.326 × 10−2 s 9.375 Kbytes −4 Direct multiplication 117974 7.708 × 10 s 150 Kbytes and addition. 2.3.2 Memory Arrangement The existing simulator [32] processes a slot of information each time. In each slot, after oversampling 4 times, the amount of data are 10240 chips of complex data. For the data saved as 16-bit integer, the memory size is 40.96 Kbytes. Table 2.2 shows the memory arrangement for each block in each CPU.. 2.3.3 Optimization and Profile Tsai [32] use software pipelining, loop unrolling, speculative execution replacement and loop partition to optimize the performance [43], [39]. The profiles of different transceiver functions are given in Tables 2.3, 2.4 and 2.5, respectively. The measurement tool used here is the profile, which is provided by TI’s TMS320C6x Code Composer Studio (CCS). Table 2.4: Profiles of Pulse Shaping Filter Operation for Different Versions of the Code [32] Modified Version Total Cycle Computation Time (per frame) Original 10763220 6.458 × 10−2 s Loop Unrolling 15727890 9.437 × 10−2 s Loop Partition 1539691 9.238 × 10−3 s. 13.

(25) Table 2.5: Profiles of Matched Filter Operation for Different Versions of the Code [32] Modified Version Total Cycle Computation Time (per frame) 33-tap Filter 19340310 1.16 × 10−1 s 9-tap Filter 3032278 1.819 × 10−2 s 9-tap Filter 1391295 8.348 × 10−3 s (After data declaration). Table 2.6: Complexity and Performance of Implementation [32]. 2.3.4 Complexity and Performance When we analyze the complexity, we focus on the multiplications in our program. The amount of data we consider is 38400 chips, equal to a frame. It should be completed in 10 ms. The complexity and final performance are given in Table 2.6. The percentage figures listed in the table reflect the achievement from the effort spent in optimization by Tsai [32]. Two demo systems were constructed: the demo subsystem (shown in Fig. 2.10) and the multi-user system (not shown here).. 2.4 Problems with the Existing Simulator In summary, Tsai [32] implements a 3GPP WCDMA modulator, several kinds of channel simulators, and a matched filter on the Quatro67 multi-DSP card. For single user transmission under static channel with multipath propagation, the processing speed of. 14.

(26) Fig. 2.10: Block diagram of the demo subsystem (from [32]).. the modulator and channel simulator can achieve the needed 3.84 Mchips per second. However, the existing simulator suffers from three problems. They will be improved in following three chapters respectively.. 2.4.1 Synchronizing Execution Problem In the existing simulator [32], three DSP chips are used to implement the modulator, the channel, and the receiver respectively. However, the three programs need to be executed in a particular order. That is, the system fails to download them simultaneously. We study the synchronizing problem in chapter 3.. 2.4.2 Real-Time Multiprocessing Problem A real-time DSP-based modulator, wireless channel simulator and receiver filter has been implemented [32]. For example, three individual parts channel, modulation, and matched filter can run at real-time in about 7.5, 10.2, and 8.3 ms per frame, respectively. We. 15.

(27) connect three main parts and download them to the Quatro6x DSP board. However, the actual run time is about 20 ms per frame [32]. But in 3GPP standard, each frame has length 10 ms. We study the real-time multiprocessing problem in chapter 4.. 2.4.3 Fixed-Point Generation Problem We use the Quatro67 DSP board to simulate the channel model. The TMS320C67x DSPs on the board do floating-point number operations. With the restriction of the DSP, the performance of the fading channel is not so good because the programs have to call special mathematic functions to generate the channel coefficients. There are several methods to change the generation of the channel coefficients without using special functions which need branching [32]. These methods are fixed-point methods. We study the problem in chapter 5.. 16.

(28) Chapter 3 The Quatro6x Multiprocessor Platform 3.1 Overview Fig. 3.1 shows the DSP board we use. It is an Innovative Integration (II)’s Quatro6x DSP board which houses four Texas Instruments (TI)’s TMS320C6x DSP chips. A host PC and several development tools work together with the board to provide a complete design environment. The development tools are TI’s Code Composer Studio integrated development environment, JTAG emulator and II’s Zuma toolset for the Quatro6x. In the existing simulator [32], three DSP chips are used to implement the modulator, the channel, and the receiver respectively. However, the three DSP programs need to execute in a certain specified order. Otherwise, troubles arise. Is it due to the platform or our application? Therefore, in this chapter, after introduction of the DSP environment, we are concerned with the interprocessor interaction mechanism, and the problem of synchronizing execution in the multiprocessor environment.. Fig. 3.1: Innovative Integration’s Quatro6x DSP board (from [18]). 17.

(29) 3.2 Introduction to the Multi-DSP Board The DSP board we use is Innovative Integration (II)’s Quatro6x DSP board. It houses four Texas Instruments (TI)’s TMS320C6x01 DSP chips. The four chips may be fixed-point TMS320C6201 DSPs or floating-point TMS320C6701 DSPs. In the following, Quatro62 and Quatro67 refer to Quatro6x (or Q6x) that houses TMS320C6201 and TMS320C6701 DSPs, respectively, and Quatro6x (or Q6x) refer to either case. This section introduces the DSP chip and the DSP board.. 3.2.1 TMS320C6x DSP Chip Much description given in this subsection is taken from [41] and [42]). TI’s TMS320C6701 is a 167 MHz floating-point DSP, and TMS320C6201 is a 200 MHz fixed-point DSP. Fig. 3.2 give their block diagrams. The TMS320C6x (’C6x) DSP processor consists of three main parts: the core, memory, and peripherals. DSP core The DSP core has two paths A and B in which processing occurs. Each data path has a register file containing sixteen 32-bit registers. Each path has four functional units to perform multiplication (.M), addition (.L), branching (.S) and load/store (.D). The functional units of each data path have a data bus to connect with the registers on the opposite side of the DSP core so that the units can exchange data. Internal memory The ’C6x has 64 Kbytes internal program memory and 64 Kbytes internal data memory. The program memory is 256 bits wide, having one fetch packet per line. The program memory can be configured as a program cache or a directed program memory. The 64 Kbytes of data memory of the ’C6x is organized into two blocks of 32 Kbytes: the TMS320C6701 have eight banks per block, and the TMS320C6201 have four banks per block. 18.

(30) Fig. 3.2: Block diagram of TMS320C6x01 DSP chip (from [33]). 19.

(31) Peripherals The ’C6x chip contains several peripherals for communication with off-chip memory, coprocessors, host processors, and serial devices. Peripherals include a direct memory access (DMA) controller, power-down logic, external memory interface (EMIF), serial ports, expansion bus or host port, and timers. EMIF provides the interface for the DSP core to connect with several external devices, allowing additional data and program memory space. The DMA controller transfers data between regions in the memory map without passing through the DSP core. The DMA allows the movement of data at the internal memory, internal peripheral, or the external devices occurs in the background of the DSP core operation.. 3.2.2 Quatro6x DSP Board Much description given in this subsection is taken from [19] and [7]. Fig. 3.3 shows a block diagram of the Quatro6x board. Four DSP chips and memories are shown with connection to peripherals and other interfaces. The Quatro6x is a PCI bus compatible DSP board housing four Texas Instruments (TI)’s TMS320C6x (’C6x) DSP chips in a symmetric multiprocessing relationship with interprocessor communication links. The four chips are called DSP0 to DSP3 (CPU0 to CPU3 or DSP-A to DSP-D) anticlockwise. The Quatro6x’s features include the following: Six interprocessor FIFOLinks The Quatro6x implements a high speed, flexible FIFO-based interprocessor communication network called FIFOLink. The FIFOLink network allows any on-board processor to transmit to and receive from any other processor on the card via a high-speed 32-bit wide FIFO buffered interface. Each of the six available links implements a 512 × 32 bidirectional buffer, and the maximum transfer rate reaches 160 Mbytes/sec on a 200 MHz Quatro6x board. Further discussion of FIFOLinks will be presented in the following. 20.

(32) Fig. 3.3: Innovative Integration’s Quatro6x block diagram (from [18]).. 21.

(33) chapter. Three interboard FIFOPorts The FIFOPort feature provides a means for interboard communications. It provide three bidirectional buffered 16-bit interfaces which allow external hardware or other II’s DSP boards to communicate with the Quatro6x. A 512 x 16 FIFO is provided per FIFOPort, and up to 80 Mbytes/sec writing rate and 57 Mbytes/sec reading rate can be reached. Only DSP1, 2, and 3 support FIFOPorts, but DSP0 (first processor) has not. More details about applying FIFOPort can be found in [11]. PCI interface with ASRAM buffer memory The Quatro6x provides a standard 32-bit PCI bus interface for communication between the PC and the DSP board. Only first processor (DSP0) can communicate directly with the host PC through PCI. The ASRAM (asynchronous SRAM) accessible by the PCI bus interface device. The 128K × 32 ASRAM is used as a buffer for busmaster and slave data movement on the PCI bus. Further understanding of PCI can be gained from [7]. External SBSRAM and SDRAM memory pools Optional SBSRAM (synchronous burst SRAM) and 16 Mbytes SDRAM (synchronous DRAM) memories provide large areas to store data or program. The SBSRAM and SDRAM are not accessible by the PCI interface and are private to their associated processor. The processors allow 8-, 16-, and 32-bit wide data movement to and from off-chip SBSRAM and SDRAM memory.. 3.3 Introduction to the Development Tools Much description given in this section is taken from [19]. The development tools of the Quatro6x are Code Composer Studio integrated development environment, JTAG emulator and Zuma toolset for the Quatro6x. 22.

(34) 3.3.1 Code Composer Studio TI’s Code Composer Studio (CCS) is an integrated development environment to provide editing, compiling, downloading and low level debugging. When used in conjunction with II’s JTAG emulator, CCS allows to access specific DSP registers and functions. The PCI-style JTAG debugger is a separate card connecting with Quatro6x DSP board via cable. Using the JTAG-based, hardware-assisted C/Assembler source debugger, typical application programs will consist of one or more C (.c), header (.h), and assembly language (.asm) source files, as needed. Additionally, target program generation requires use of a linker command file (.cmd) which specifies the memory map (.map) for the target and optionally includes commands defining the libraries to be linked into the final application. The linker command file serves three main objectives. The first objective is to describe to the linker the memory map of the system to be used, and this is specified by “MEMORY{...}”. The second objective is to tell the linker how to bind each section of the program to a specific section as defined by the “MEMORY{...}” area, which is specified in the “SECTION{...}”. The third objective is to supply the linker with the input and output files, and options of the linker.. 3.3.2 Zuma Software Toolset The Zuma toolset is a comprehensive collection of tools and libraries used to develop application programs for several series of II’s DSP boards, which includes: 1. DSP Peripheral Library – supporting on-board peripheral and DSP functions. 2. Dynamic Link Library (DLL) – for host PC software application development. 3. Host Support Applets – for terminal emulation and automatic program download. 4. Sample Applications – showing Host PC as well as target DSP coding techniques.. 23.

(35) UniTerminal and MPO are II’s support applets for terminal emulation and automatic program download. UniTerminal Each of the development packages is supplied with a terminal emulator application called “UniTerminal,” which can be used either stand-alone or in conjunction with CCS. If we invoke the UniTerminal utility and it is successfully started, then UniTerminal will display “Status: Active. DSP DLL Loaded OK” at the bottom of its client window. The terminal emulator UniTermimal provides a C language-compatible, standard I/O terminal emulation facility for interacting with the stdio library running on an Innovative Integration target DSP processor. The DSP program execution will be halted automatically at the first stdio library call if the terminal emulator is not executing when the DSP application is run, since standard I/O uses hardware handshaking. The UniTerminal supports downloading of either COFF (Common Object File Format) files (.OUT) or multiprocessor out (.MPO) files. The .MPO file provides a means of downloading separate .OUT files to multiple processors simultaneously, which greatly simplifies the task of synchronizing execution in a multiprocessor environment. MPO The MPO editor provides a means of editing the special configuration files used on the Quatro6x to allow downloading of multiple COFF object files simultaneously. The UniTerminal applets understand the MPO file format and are able to consume .MPO files as well as .OUT files as download arguments. Attempting to download an MPO file from within UniTerminal will cause new code to be loaded onto and executed by all processors. This is in contrast to the downloading a standard COFF .OUT file, which simply downloads and executes code on DSP0 (first processor) only.. 24.

(36) TEST example program One may refer to the target board directory for example programs provided with II’s development package for examples of the use of the Quatro6x. These programs are provided as models for custom user software, and II highly recommends that the user examine these examples before beginning a first development effort for the target DSP. Full source code is provided for user inspection and reuse in modified or customer application. TEST and TEST2 are board level hardware test programs, capable of exercising the major peripherals on the Quatro6x to double-check proper hardware functionality. As such, it contains routines for exercising each of the peripherals on the Quatro6x, including: 1. FIFOLinks, 2. FIFOPorts, 3. Internal timers, 4. Sync serial ports, 5. Busmastering, and 6. Interprocessor interrupts. The programs aim to be encompassing in that they try to test as much of the board-level functionality as possible. The code included for TEST and TEST2 are broken down into functional pieces which are called separately for each subsystem to be tested, it is possible to factor out individual tests for use in other programs. For example, the Quatro6x implements an interprocessor interrupt generation architecture which allows any one processor to notify any other processor of an event or condition via an interrupt. The ’C6x has two 32-bit general-purpose timers that is used to time events, count events, generate pulses, interrupt the CPU and send synchronization events 25.

(37) Processor1 Interrupts/ Level Readback. Data. Logic. Flags. Data 512x32 FIFO Data. Flags. Logic. Interrupts/ Level Readback. Data. Processor0. Fig. 3.4: FIFOLink block diagram (from [19]). to the DMA controller. If we want to apply interprocessor interrupt based on internal timer, it is one of ways to factor out them from TEST and TEST2.. 3.4 Interprocessor Interaction Much description given in this section is taken from [19], [11], and [32]. On the Quatro6x DSP board, each of the four DSP processors has FIFOLink connected to another onboard processor. The FIFOLinks are compatible with the DMA controller for high-performance interprocessor data flow. Both FIFOPort and FIFOLink have several modes that can be used: single word, full words by DMA and almost full mode. In this section, we introduce the way to use FIFOLink and DMA functions for communication between two processors.. 3.4.1 FIFOLink Functions Fig. 3.4 shows the details of a single FIFOLink interface connection and its attendant control and status signals. Each FIFOLink includes a 512-element × 32-bit bidirectional buffer with full level and interrupt control on data transmission and reception. In this 26.

(38) subsection, we describe some important functions used in FIFOLink. Before using FIFO (both the FIFOLink and the FIFOPort), there are some important things to do: • Include the header file “periph.h.” • Declare the variables used by FIFO as global variables. FIFOLink reset The receive FIFO may be cleared and its condition reset at any time by using the function: reset fifo link(cpu), cpu ∈ 0, 1, 2, 3. FIFOLink status The current fullness of a given link may be determined by reading the status port. The low-order six bits of the status port shows the status of Full, Empty, and Almost Full from each device. The FIFO status is defined in Table 3.1. We can use the following function to get the status: get fifo link status(cpu), cpu ∈ 0, 1, 2, 3. FIFOLink data transfer functions using CPU Data may be moved between memory blocks and each of the FIFOLinks using the functions listed in Table 3.2. These routines are coded as inline functions for speed. The address of FIFOLinks is defined as Periph->FLink[fifo link(cpu)]. For example, when we want to transmit a single word “a” to other CPUs through FIFOLink buffer, the used instruction as follows, Periph->FLink[fifo link(cpu)]=a; // cpu ∈ 0, 1, 2, 3. It’s same as fifo link split(cpu,a). If we check FIFOLink status before the transfer, that is, while(!(get fifo link status(cpu)&Tx FIFO EMPTY); Periph->FLink[fifo link(cpu)]=a; 27.

(39) Table 3.1: FIFO Status Definition [19] C #define Rx FIFO FULL Rx FIFO EMPTY Rx FIFO AF Tx FIFO FULL Tx FIFO EMPTY Tx FIFO AF. Bits # 0 1 2 3 4 5. Condition Receive FIFO contains 512 elements Receive FIFO contains 0 elements Receive FIFO contains more elements than pro-programmed threshold Transmit FIFO contains 512 elements Transmit FIFO contains 0 elements Transmit FIFO contains more elements than pro-programmed threshold. Table 3.2: FIFOLink Data Transfer Functions Using CPU [19] C Function fifo link spit(cpu,a) fifo link emit(cpu,a) fifo link eat(cpu) fifo link key(cpu) fill fifo link() bleed fifo link(). Description Write a single word to the transmit FIFO using CPU without handshaking Write a single word to the transmit FIFO using CPU with handshaking Read a single word from the receive FIFO using CPU without handshaking Read a single word from the receive FIFO using CPU with handshaking Write up to 512 elements from a memory buffer into the transmit FIFO using CPU Read up to 512 elements from receive FIFO into memory buffer using CPU. This is single word transfer with handshaking, and same as fifo link emit(cpu,a). Similar examples can be found in II recommended example program, TEST and TEST2.. 3.4.2 DMA Transfer Functions The DMA controller transfers data between regions in the memory map without intervention by the CPU. It has four independent programmable channels, allowing four different contexts for DMA operation. The DMA channels may be used to transfer data between any of the FIFOLinks and a memory buffer using the inline functions in Table 3.3. The call sequences are: dma mem to port(int channel, int* src, int* dest, int count, int block), dma port to mem(int channel, int* src, int* dest, int count, int block), dma copy mem(int channel, int* src, int* dest, int count, int block). For instance, the function dma copy mem(int channel, int* src, int* dest, int count, int block) copies “count” words of memory from the source buffer “src” to the destination 28.

(40) Table 3.3: DMA Data Transfer Functions [19] C Function dma port to mem() dma mem to port() dma copy mem(). Description Read up to 65536 words from a FIFO at indicated address into a memory buffer using specified DMA channel Write up to 65536 words from a memory buffer into a FIFO using specified DMA channel Copies up to 65536 words between internal memory and external memory using specified DMA channel channel. Fig. 3.5: Code for using DMA through FIFOLink (from [32]).. buffer “dest.” This function utilizes the specified DMA channel to perform the move. If “block” is true, the function waits until the move is completed before processing; otherwise execution continues immediately after the DMA operation starts. We give an example of using DMA through FIFOLink buffer with full level in Fig. 3.5. More details will be discussed in chapter 4.. 29.

(41) Fig. 3.6: Transmission mechanism in the existing simulator (from [32]).. 3.4.3 Transmission Mechanism [32] Fig. 3.6 shows the transmission mechanism of the existing simulator [32]. When we initialize the DSP, the SNR, initial value of scrambling codes and channel case will be set in the controller PC and downloaded to CPU0. We use FIFOLink single mode transmission to transmit the information to the modulator in CPU1. The input to modulator from the last board is received using FIFOPort almost full mode. We pass the output to the next processor after finishing one slot. The FIFOLink can transmit 512 × 32 bits of integer per time in full condition. During the process, a stop signal will be transmitted through all processors. The stop command is given by the first processor or the host PC. We use single word mode to pass the signal in FIFO.. 30.

(42) Table 3.4: Unlucky Style noise.h & channel.c 1 short noisereal[298801]={...}; short noiseim[298801]={...}; j=(slot*10240+...)%298801; 2 short noisereal[11]={...}; short noiseim[11]={...}; j=(slot*10240+...)%11;. data.h & modulation.c unsigned short input[330]={...}; index=count*10; j=(input[330+count]...; unsigned short input[330]={...}; index=count*10; j=(input[330+count]...;. CCS MPO Can Can Run Not Order Begin Can Can Run Not Finish. 3.5 Coding Style In this section, for multiprocessor programs synchronizing execution smoothly in CCS and MPO, we compare two coding styles, which is sensitive to array declaration.. 3.5.1 Unlucky Style In the existing simulator, we found an unlucky coding style as shown in Table 3.4. This unlucky style is to declare array of many elements with assigned value. It exists in two header (.h) files, one is noise.h and another is data.h. As introduced previously, the existing simulator stores 298801 complex-valued noise data in the noise.h file. In addition, there are 330 elements of data stored in data.h for modulation code. However, it is unlucky to run MPO smoothly or CCS without order.. 3.5.2 Lucky Style As shown in Table 3.5, we found a lucky style, which is to declare array of many elements without assigned value , or array of less elements with assigned value. It is lucky to run MPO smoothly. However, it maybe changes largely the implementation of the existing simulator. In next section, we find out a solution without changing coding.. 31.

(43) Table 3.5: Lucky Style 3 4. noise.h & channel.c short noisereal[11]={...}; short noiseim[11]={...}; j=(slot*10240+...)%11; short noisereal[298801]; short noiseim[298801]; j=(slot*10240+...)%298801;. data.h & modulation.c unsigned short input[33]={...}; index=count*1; j=(input[3+count]...; unsigned short input[330]; index=count*10; j=(input[330+count]...;. CCS Can Run. MPO Can Run. Can Run. Can Run. 3.6 Tools Compatibility Besides the coding style in previous section, we get a solution in this section ending for synchronizing execution. Before this, we are suffering from tools compatibility.. 3.6.1 Two Issues in MPO and CCS There are two issues related to Quatro6x platform. One is for MPO, another for CCS. Fig. 3.7 shows the issue 1 in MPO. Its .OUT or .MPO file could not run by UniTerminal only, it seems sensitive to 2 factors: one is .cinit to DRAM or SDRAM, another is array size less or more than 500. Fig. 3.8 shows the issue 2 in CCS. It repeated “TP>> internal error: bad type: TYPE::type qualified()” during compile phase in newer version environment CCS2.0. However, the internal error exist, but zero error during compiling phase. We email the two issues to vendors. However, vendor II suggests us migrate platform to new version. Refer to the appendix for details.. 3.6.2 Migration to the New Version As discussed previously, our application still encountered the two issues on the old version platform. When we tried to migrate to new version CCS2.2, or several different combinations of TI and II toolsets, many additional compile errors or link errors often happened. It also happened for the vendor’s example program of board-level TEST, and 32.

(44) Fig. 3.7: A simple code suffering from issue 1 in MPO. 33.

(45) Fig. 3.8: A simple code suffering from issue 2 in CCS.. 34.

(46) even for a very simple code such as summation from 1 to 100. The above problem is sensitive to version combinations. After trying various combinations, we got a trouble-free combination without errors as shown above. The combination is II 2.97 installation CD and TI’s CCS2.1 with patch from CCS2.0. On this platform, the program can run smoothly through CCS or MPO, that is, our multi-DSP application can run smoothly without a specifying the startup order through CCS, and the task of synchronizing execution through MPO can also run smoothly. Therefore, we do migration to the specified platform without issues of tool compatibility. The memory amount may be not large enough to allocate both platform and application. Hence one should take care of memory allocation in linker command file to avoid the problem that memory can not be allocated.. 35.

(47) Chapter 4 Efficient Multiprocessing on Quatro6x 4.1 Overview In general, when a system is partitioned to n processors, we hope to get n times the speed compared to using one processor. However, In fact, it is not so simple. For example, the three individual parts (channel, modulation, and matched filter) in the existing implementation can be run in real-time in about 7.5, 10.2, and 8.3 ms per frame, respectively, but the actual run time is about 20 ms per frame after we connect three main parts and download them to Quatro6x DSP board [32]. Why did the run time increase after they are connected? In the early part of this chapter, we profile unexpected the time-consumption to identify the actual performance. Then, in the latter part of the chapter, we point out that how care must be taken to apply double buffering scheme.. 4.2 Structure of DSP Partition Tsai [32] implement the sequential-10240 structure in the existing simulator. In this section, we discuss with several structures of DSP partition, such as parallel structure, sequential-512 and pipelining-512. Then, we devote to implement pipelining-512 structure.. 36.

(48) Fig. 4.1: Interpolated fitering system with the cascade of pulse shaping filter, channel, and matched filter.. 4.2.1 Parallel Structure In this subsection, we discuss with parallel structure using polyphase technique. However, we let go this thinking for keep flexibility and lack of enough FIFOPorts. Polyphase with three functional blocks? As described in chapter 2, Figures 2.5 and 2.6 depicts 4-polyphase implementation of the pulse shaping filter. Similarly, let us consider four functional blocks as shown in Fig. 4.1. As shown in Fig. 4.2, the question is: can we combine pulse shaping filter with the following channel model and matched filter into one block and apply the polyphase technique? The basic idea is correct, but it loses some flexibility. If we do this, we cannot get the output signal at the output of each functional block. For example, we cannot get the oversampled, transmit-filtered signal since we have combined these three functional blocks together. Particularly, if we combine all filters into one block, then we lose the flexibility of simulating different kinds of channels, especially time-varying channels. Since transmitter and receiver filters belong to the transmitter and the receiver, respectively, they will have to be implemented separately in practical real systems, anyway. Therefore, we keep the separation of these functional blocks. Quatro6x DSPs parallel interconnection? In previously existing implementation, we put the pulse shaping filter, the channel model, and the matched filter in three different DSPs and connect them in series. If four DSPs by parallel connection and each DSP included channel model, pulse shaping filter, and 37.

(49) Fig. 4.2: Polyphase technique applied to the cascade of pulse shaping filter, channel, and matched filter.. matched filter. Then, the latency maybe almost vanish because no functional block across interprocessor. However, the scheduling may be a problem. We are not sure whether parallel interconnection will be better or not, maybe need to check the computational complexity to see if it is possible. However, due to only three FIFOPorts exist for four DSPs on Quatro6x board, that is, DSP0 have no FIFOPort, so we let go the parallel connection thinking.. 4.2.2 Sequential Structure In this subsection, we discuss with sequential structure, which use blocking mode DMA. Blocking mode DMA As discussed in chapter 3, the DMA functions we use are dma mem to port(int channel, int* src, int* dest, int count, int block), dma port to mem(int channel, int* src, int* dest, int count, int block), dma copy mem(int channel, int* src, int* dest, int count, int block). 38.

(50) If “block” is true, the function waits until the move is completed before processing, which is called the blocking mode DMA. Sequential-10240 structure In the previous implementation, Tsai [32] employ sequential-10240 structure to processes a slot of information each time. It is data amount of 10240 samples per slot. As discussed previously, the FIFOLink buffer size is up to 512 samples. In the sequential-10240 structure, the channel model processor repeat blocking mode DMA 20 times through FIFO to get one slot of input, then compute, and then repeat blocking mode DMA 20 times through FIFO to output the result. However, the latency is big. Sequential-512 structure If we process 512 samples each time, then we hope that we can cut the latency down to 1/20. In addition, we also reduce memory usage. In the sequential-512 structure, each time the channel model processor use blocking mode DMA to input 512 samples through FIFO, then compute, and then use blocking mode DMA to output 512 samples through FIFO.. 4.2.3 Pipelining Structure We propose a pipelining structure, which use non-blocking mode DMA and double buffer. Partial idea of the structure is from ’C6x pipeline operation [40] and software pipelining [43], [39]. Terms of pipelining There are three basic terms common to software pipelining: prolog, loop kernel, and epilog. The first stage, prolog, contains instructions to build the second-stage loop cycle, and the epilog stage contains instructions to finish all loop iterations [8]. In our pipelining512 structure, we want to pipeline a FIFO-buffered block size of 512 samples. Besides. 39.

(51) this size is different to the instruction cycle of software pipelining, the terms prolog, loop kernel, and epilog are still available. We give a comparison in Table 4.1. Non-blocking mode DMA As discussed previously, the DMA transfer function we use are dma mem to port(int channel, int* src, int* dest, int count, int block), dma port to mem(int channel, int* src, int* dest, int count, int block), dma copy mem(int channel, int* src, int* dest, int count, int block). If ”block” is false, then program execution continues immediately after the DMA operation starts, which is belonging to the non-blocking mode. There is a while-loop to check the dma done status inside DMA transfer function. The dma done command reports DMA operation done ready or not yet, which returns true if DMA channel available. If we use non-blocking mode, this while-loop skips in DMA support library. We could put the while-loop of checking dma done outside DMA transfer function. Then, we insert the code of a wanted pipelining-512 looping unit between non-blocking call DMA transfer function and while-looping check DMA completion status. Pipelining-512 structure As shown in Fig. 4.3 and Table 4.2, there are five operations used for the channel simulator, including 1. dma copy mem move in real part noise from external memory without FIFO, 2. dma copy mem move in imaginary part noise from external memory without FIFO, 3. dma port to mem move in nth-512 F out via FIFO from modulator, 4. CPU compute (n − 1)th-512 F out according channel model to form P ath, and 5. dma mem to port move out (n − 2)th-512 P ath via FIFO to matched filter. 40.

(52) Fig. 4.3: Operations of one computing and four movement used for the channel simulator.. We assign four independent DMA channels for the above four kinds of DMA transfer, respectively. Based on non-blocking DMA, we apply dual buffers for F out and P ath, respectively. That is, LF , RF , LP , and RP as shown in Table 4.2. One buffer used for computing while another buffer used for moving data. However, we pay attention to a matter of double buffering in the latter part of this chapter. Table 4.1: Pipelining Structure, Software Pipelining and ’C6x Pipeline Operation Our Pipeline-512 Structure Software Pipeline and ’C6x Pipeline Cycle Block-based 512 samples Clock-based instructions Device Multiprocessors of three DSPs Eight functional units of CPU core Stage Prolog, loop, and epilog Prolog, loop, and epilog Phase Move in, computing, and move out. Fetch, decode, and execute.. 41.

(53) Block Cycle 1 2 3 ··· n ··· 10 11 12. Table 4.2: Scheduling Table of Our Pipelining-512 Structure Move In F out Channel CPU Processing Move Out P ath from Modulator F out to form P ath to Matched Filter 1st-512 @LF No operation No operation 2nd-512 @RF 1st-512 @LF → LP No operation 3rd-512 @LF 2nd-512 @RF → RP 1st-512 @LP ··· ··· ··· nth-512 @LF (n − 1)th-512 @ RF → RP (n − 2)th-512 @LP ··· ··· ··· 10th-512 @RF 9th-512 @LF → LP 8th-512 @RP No operation 10th-512 @RF → RP 9th-512 @LP No operation No operation 10th-512 @RP. 4.3 Actual Speed Observation We insert the uclock commands into the DSP program to observe actual run-time in detail. It is helpful to find out the unexpected time-consumption.. 4.3.1 Observation on Single DSP Chip As shown in Table 4.3, we observe only one individual program on a single processor DSP0, respectively. However, only first processor DSP0 is available to the uclock command. There are many different versions programs belong to the channel, the modulation, the receiver programs, respectively. One of key information exists in the table, that is, the scramble code generation takes around 29 ms per run, and belongs to overhead but not looping part. Zoom in the subroutine; we see that some if-else statements inside the for-loop of many times looping. It may why this routine is so time-consuming.. 4.3.2 Observation of Overall Connection Speed Table 4.4 shows the overall speed after three DSP programs (channel, modulation, and receiver) are connected. There are many different versions programs belong to the different structures, such as sequential and pipelining. We are interested the time-consumption per slot because it is belonging to looping part. In 3GPP standard, a frame of 15 slots 42.

(54) Table 4.3: Observation on Single DSP ms @ cpu0 TEST SPEED (ms/frame) DEMO (ms/frame) MULTIUSER (ms/frame) SYSTEM (ms/frame) SINGE USER (ms/frame) CHANGING (ms/frame) Table 2.6. Channel Modulate scramble 33-tap 9-tap 14.255 48.03 29.282 8.719 (14.255) (10.685) +8.063 (8.719) 16.269 51.379 28.445 237.325 17.458 (8.134) (11.467) (118.662) (8.729) 9250 51.307 17.458 16.268 (8.134) 16.27 (8.135) 15.037 (7.519) (6.67∼ 11.72). Table 4.4: Modulation / Channel pipelining-512 sequential-10240 pipelining-512 pipelining-512 pipelining-512 (no many uclock). 249.359 (112.03) 51.398 (11.467) 54.19 (' 12.6) (10.24). 29.3 22.384 +6.073 ' 29. Remark 1frame/run case3 2frame/run noise by log, cos 2frame/run. 17.458 (8.729) 237.325 17.458 20s/200run (118.662) (8.729) with Rake 237.351 17.458 251,92,75 (118.676) (8.729) in Table 4.4 (8.346). Observation on Overall Connection Speed Receiver Run Scramble 33-tap sequential-512 251 ms 28.9 ms 9-tap sequential-512 110 ms 30.5 ms 9-tap sequential-512 92 ms 29.9 ms 9-tap pipelining-512 75 ms 29.9 ms 9-tap pipelining-512 64 ms 29.9 ms. Slot 7421 us 2637 us 2096 us 1521 us 1170 us. has 10 ms length. As discussed previously, for FIFO size up to 512 samples, we need move 512 samples 20 times for a slot data. It means that a real-time criterion is about 666 µs per slot, that is, 33 µs per 512 samples. However, the better speed is about 1 ms per slot, which is about 1.5 times the criterion. We zoom in actual speed in the following subsections, respectively.. 4.3.3 Observation by Changing Segments of Program This subsection presents observations by changing segments of the channel program.. 43.

(55) Fig. 4.4: Case A0 program segment.. (A0) When run channel program with modulation and 9-tap receiver together As shown in Fig. 4.4, the measured time difference between t5 and t6 is about 21 µs. Is it from the first for-loop computing-time itself or not? The channel model program performs complex multiplications to input data and channel coefficient. If single path, then for each point, we need four real multiplications where two are for real part and two for imaginary part. The TMS320C6701 DSP chip of 167 MHz has two units to perform multiplication and six units for addition. Therefore, one multiplication requires on the average of 6 ns and 3 ns, depending on the utilization of the two multipliers.. 44.

(56) Fig. 4.5: Case A1 program segment.. The 21 µs between t6 and t7 is reasonable for the second for-loop computing time. In the second-loop of 497 points, for both two paths, we need 4 × 497 × 2 = 3976 multiplications. If 6 ns for a multiplication, then 3976 × 6 = 23856 ns, it is near 21 µs. Similarly, the 1 µs between t7 and t8 is reasonable for third for-loop of 15 points to compute the second single path each point, we need 4 × 15 × 1 × 6 = 360 ns, it is smaller than 1 µs resolution of uclock command. Therefore, the first for-loop of 15 points should take about 1 µs, but not measured 21 µs, to compute first single path each point. Where does this 21 µs between t5 and t6 come from?. 45.

(57) (A1) When the first and second for-loop are removed In this A1 case as shown in Fig. 4.5, we remove the first for-loop, the second for-loop, t5 and t6 measurement, but still keep the third for-loop in program. Therefore, the time difference between t7 and t8 become 20 µs, but not keep 1 µs as before. Thinking about the 21 µs between t5 and t6 in the previous case A0, and the 20 µs between t7 and t8 in this current case A1, it hints that the strange time appears around just next command after of the dma port to mem(0,......,0) of non-blocking. (A2) When dma port to mem is changed to blocking mode As shown in Fig. 4.6, we modified to case A2 from case A1, only changed to blocking mode in dma port to mem. Then time difference between t4 and t7 become to 21 µs in case A2 from 2 µs in case A1, also time difference between t7 and t8 become to 1 µs in case A2 from 20 µs in case A1. It hints that the strange time such as 20 µs in case A1 is correlated to dma port to mem of non-blocking mode. Moreover, we know that dma port to mem of blocking mode take about 21 µs to move 512 32-bit integer data through bursting FIFO buffer. (A3) When dma mem to port is changed to blocking mode As shown in Fig. 4.7, case A3 is modified from case A2; the dma mem to port between t2 and t3 is changed to use blocking mode. Then time difference between t2 and t3 is 19 µs, also time difference between t4 and t7 is 21 µs. We know again that dma mem to port or dma port to mem of blocking mode takes around 20 µs to move 512 32-bit integer data through bursting FIFO. (A4) When both the first and the second for-loops are recovered As shown in Fig. 4.8, case A4 is modified from case A3; both the first for-loop of prolog 15 points and the second for-loop of looping 497 points are recovered. Then blocking mode dma mem to port takes 18 µs between t2 and t3, and blocking mode 46.

(58) Fig. 4.6: Case A2 program segment.. dma port to mem takes 21 µs between t4 and t5. The 21 µs between t6 and t7 is reasonable for the second for-loop to compute two paths of 497 points. Note that the time difference between t5 and t6 is 2 µs after blocking mode DMA in the case A4, but not still 21 µs after non-blocking mode DMA in case A0.. 4.3.4 Observation by Removing Data Transfer Commands As shown in Table 4.5, from case B0 toward B5, we observe time-consumption by removing gradually the commands used for data transfer. (B0) When running channel program with modulation and 9-tap receiver together The actual run time is about 63.73 ms per run after we connect the three main parts of channel, modulation, and 9-tap receiver, and download them to Quatro6x DSP board.. 47.

(59) Fig. 4.7: Case A3 program segment.. (B1) When commands starting with “while(!(get fifo link status” are removed In case B1, we remove commands starting with “while(!(get fifo link status.” Only single DSP of channel program is run. Then the run time reduces to 31.24 ms from 63.73 per run. It is surprising that handshaking waits around 30 ms for scramble code generation to be ready. However, there still exists a strange 20 µs between t5 and t6 after non-blocking mode DMA. (B2) When the “dma port to mem” commands are removed In case B2, we remove the “dma port to mem” commands from case B1. Only single DSP of channel program run. Then the run time reduces to 27.91 from 31.24 ms per run. Moreover, we observe the time difference between t5 and t6 becomes 2 µs in current case B2 from previous case B1. Excitingly, the strange time-consumption almost vanish in the first for-loop between t5 and t6, that is, it reduces 18 µs largely since the non-blocking 48.

(60) Fig. 4.8: Case A4 program segment.. 49.