長管線延遲資料路徑之高面積效率設計與實現

全文

(1)國立交通大學電子工程學系電子研究所碩士班碩士論文. 長管線延遲資料路徑之高面積效率設計與實現 Area-Efficient Design and Implementation of Deep-Pipeline Latency Datapath. 研究生：呂進德指導教授：劉志尉. 中華民國九十七年十一月.

(2)

(3) 長管線延遲資料路徑之高面積效率設計與實現 Area-Efficient Design and Implementation of Deep-Pipeline Latency Datapath Student: Chin-Te Lu. 研究生：呂進德. Advisor: Dr. Chih-Wei Liu. 指導教授：劉志尉博士. 國立交通大學電子工程學系電子研究所班碩士論文. A Thesis Submitted to Department of Electronics Engineering & Institute of Electronics College of Electrical and Computer Engineering National Chiao Tung University In partial Fulfillment of the Requirements for the Degree of Master of Science in Electronics Engineering November 2008 Hsinchu, Taiwan, Republic of China. 中華民國九十七年十一月.

(4)

(5) 長管線延遲資料路徑之高面積效率設計與實現. 研究生：呂進德. 指導教授：劉志尉博士. 國立交通大學電子工程學系電子研究所摘要處理器的資料路徑(datapath)通常是影響其效能的最重要部分。隨著不同應用需求，資料路徑的配置與設計也會不同，一般說來，針對高效能處理器，例如 Intel Pentium 處理器、IBM Cell 處理器等，設計者會藉由各種 VLSI 技術，盡可能的提高資料路徑的操作頻率；但另一方面，對於輕量化(lightweight)應用、如嵌入式系統(embedded system)，則會以追求低功率、低晶片面積等方向做最佳化資料路徑設計。同一套指令集架構 (instruction set architecture)對於不同的應用而言會有不同的資料路徑設計，針對此，本論文提出一套能針對不同效能需求，而能自動合成一具高面積效率的資料路徑設計流程。此具高面積效率資料路徑產生器，其中包含兩個動作：空間和時間維度做最佳化設計。此具高面積效率資料路徑產生器可延用現有的高效能處理器的指令集、如 IBM Cell，和其相關發展軟體與應用程式，並根據應用所需的效能，有系統的對處理器資料路徑做最佳化。空間維度上的最有效率的應用意指資料共享路徑，包含建立函數模型(function modeling)和週期準確模型(cycle-accurate modeling)設計。另一方面，我們也會針對時間維度上做最佳化，並分析指令的延遲(latency)時間，系統化地建立數學方程式以獲得最小面積的微架構(micro-architecture)。我們以 Cell SPU(Synergistic Processor Unit)資料路徑設計為例，利用所提出的設計流程分析指令集架構，尋找出最高面積效率的微架構。實驗顯示，針對 100MHz 到 800MHz 的嵌入式微處理器的資料路徑設計，我們所提出的設計流程比自動化工具改善約 20%的面積。在 UMC 90nm 的製程下，我們利用前述的設計流程實作 SPU 數位訊號處理器，晶片面積為 2.5mm × 2.5mm，而其操作頻率為 400MHz。.

(6)

(7) Area-Efficient Design and Implementation of Deep-Pipeline Latency Datapath Student: Chin-Te Lu. Advisor: Dr. Chih-Wei Liu. Department of Electronics Engineering Institute of Electronics National Chiao Tung University. ABSTRACT Datapath is primarily the most critical element that affects performance. The allocations and design of datapath depends various application requirements. General speaking, for high-performance processors like Intel’s Pentium Processors, IBM’s Cell Processors and so on, the designers extremely rise up operating frequency by board VLSI techniques. On the contrary, such as lightweight applications in the embedded system, the goal of datapath design is to seek low-power, small chip area and so on. The instruction set architecture (ISA) has different ways of implementation for different application requirements. Therefore, this thesis proposes the design flow to automatically generate the area-efficient datapath for various application requirements. The area-efficient datapath generator includes the two-phased including spatial-optimized and temporal-optimized for datapath optimization. It can systematically develop and optimize datapth of the processors while leveraging the instruction set architecture (ISA) of high performance processor like IBM’s Cell and the software toolchain and application programs. Spatial-optimized means that efficient utilization in spatial domain including function modeling and cycle-accurate design. In other phase, temporal-optimization explores the instruction latency to systematically build up mathematical formulation to get the optimal micro-architecture. We take the Cell synergistic processor unit (SPU) as our datapath design example to analyze the optimization space of SPU ISA implementation, and find the area-efficient micro-architecture by using our proposed design flow. In the experiment, the micro-architecture by using our proposed design flow improves about 15-20% of area compared to using CAD tools for datapath design of embedded processors targeted 100MHz to 800MHz. Finally, we use the previous design flow to implement the SPU DSP in the UMC 90nm 1P9M CMOS process. The silicon area is 2.5mm × 2.5mm and the clock rate is 400MHz..

(8)

(9) 誌. 謝. 研究生涯轉眼即逝，兩年來受到許多人幫助及鼓勵，才能順利完成碩士學業，在此致上最深的感激。首先，我要感謝劉志尉老師在我的專業知識和研究態度給予熱誠的指導，使我在這兩方面更臻成熟，老師的豐富學養及學者風範，令我受益良多。特別感謝任建葳教授、周景揚教授及周世傑教授，謝謝你們在百忙之中，撥冗參與論文口試，並對我的研究給予寶貴的意見，讓此篇論文更加完備充實。另外，我還要感謝林泰吉學長不厭其煩地對我的研究工作步步導引，並培養研究態度以及應有的態度。以及歐士豪學長給我諸多細節的解惑，還有林彥呈同學和甘禮源學弟對我的研究提出意見和討論和諸多的協助。感謝實驗室學長、同學及學弟妹們。感謝陳信凱、郭羽庭、林禮圳、林佑昆和張彥中，感謝學長們在研究生生涯中的各項協助及鼓勵。感謝張國強、莊明勳、葉世賢、吳聲昀、張雅婷及蔡安綺，謝謝學弟妹們在研究工作上的一切幫忙。感謝和我一起打拼的顏于凱、洪正堉、李岳泰、張巍瀚。這兩年，我們一同經歷了挑燈夜戰的努力，也共同分享研究成果的喜悅。最後，感謝我最親愛的家人。爸、媽、妹，感謝你們一路上的支持及鼓勵，沒有你們就沒有今日的我，我愛你們。謹將此篇論文獻給所有曾支持我、協助我的人，衷心的感謝並祝福你們。進德謹誌於新竹 2008 冬.

(10)

(11) CONTENTS 1. 2. 3. INTRODUCTION ................................................................................................................................................ 1 1.1. Motivation................................................................................................................................................. 2. 1.2. Problem Description and Distribution....................................................................................................... 2. 1.3. Thesis Organization .................................................................................................................................. 4. BACKGROUND .................................................................................................................................................. 5 2.1. Cell Broadband Engine Architecture......................................................................................................... 6. 2.2. SPU Instruction Set Architecture ............................................................................................................ 10. 2.3. SPU Micro-Architecture ......................................................................................................................... 18. DESIGN & OPTIMIZATION FLOW OF DEEP-PIPELINE LATENCY DATAPATH ................................................ 23 3.1. Spatial Optimization................................................................................................................................ 24. 3.1.1 Function Modeling........................................................................................................................... 24 3.1.2 Cycle-Accurate Modeling ................................................................................................................ 29. 4. 5. 3.2. Temporal Optimization ........................................................................................................................... 34. 3.3. Experimental Results .............................................................................................................................. 41. SILICON IMPLEMENTATION ........................................................................................................................... 49 4.1. Implementation Design Flow .................................................................................................................. 50. 4.2. Implementation Result ............................................................................................................................ 52. CONCLUSION & FUTURE WORKS ................................................................................................................. 55. REFERENCES ......................................................................................................................................................... 57.

(12)

(13) LIST OF FIGURES Figure 1-1 Latency exploration .......................................................................................................................... 3 Figure 2-1 Die photo of Cell Broadband Engine ................................................................................................ 6 Figure 2-2 Block diagram of CBE processor...................................................................................................... 7 Figure 2-3 PPE block Diagram........................................................................................................................... 8 Figure 2-4 SPE architecture................................................................................................................................ 9 Figure 2-5 SPU functional units ....................................................................................................................... 10 Figure 2-6 Instruction format............................................................................................................................ 11 Figure 2-7 Register layout of data types and preferred scalar slot.................................................................... 12 Figure 2-8 Example of addition operation ........................................................................................................ 14 Figure 2-9 Example of multiply operation........................................................................................................ 15 Figure 2-10 Example of form select mask for bytes operation......................................................................... 17 Figure 2-11 Example of shuffle bytes operation............................................................................................... 18 Figure 2-12 SPU organization .......................................................................................................................... 20 Figure 2-13 SPU pipeline diagram ................................................................................................................... 21 Figure 3-1 Overview of our proposed design flow ........................................................................................... 24 Figure 3-2 Functional modeling ....................................................................................................................... 25 Figure 3-3 Example of behavioral assignment in RTL ..................................................................................... 26 Figure 3-4 Add/Sub functional unit .................................................................................................................. 28 Figure 3-5 Cycle-accurate modeling................................................................................................................. 29 Figure 3-6 Tolerable latencies........................................................................................................................... 31 Figure 3-7 Forwarding network........................................................................................................................ 33 Figure 3-8 SPU Pipeline diagram ..................................................................................................................... 33 Figure 3-9 Temporal optimization .................................................................................................................... 34 Figure 3-10 Function unit with 3-cycle latency ................................................................................................ 35 Figure 3-11 Piped S/R FU ................................................................................................................................ 36 Figure 3-12 Piped Shuffle FU........................................................................................................................... 37 Figure 3-13 Piped MUL FU.............................................................................................................................. 38 Figure 3-14 Timing delay of one-stage pipelined datapath............................................................................... 41 Figure 3-15 Proposed design flow .................................................................................................................... 42 Figure 3-16 Improvement by our proposed design flow................................................................................... 47 Figure 4-1 Implementation flow ....................................................................................................................... 50 Figure 4-2 Our SPU interface ........................................................................................................................... 51 Figure 4-3 Pipeline diagram of our SPU........................................................................................................... 52 Figure 4-4 Implementation result ..................................................................................................................... 53.

(14)

(15) LIST OF TABLES Table 2-1 Binary values in register RC and byte results................................................................................... 17 Table 2-2 Dual issue unit assignments.............................................................................................................. 19 Table 2-3 Unit and instruction latency.............................................................................................................. 20 Table 3-1 Synthesis result of baseline............................................................................................................... 27 Table 3-2 Instruction latency ............................................................................................................................ 30 Table 3-3 Forwarding table of our SPU............................................................................................................ 32 Table 3-4 Piped S/R FU.................................................................................................................................... 36 Table 3-5 Piped Shuffle FU .............................................................................................................................. 37 Table 3-6 Piped MUL FU ................................................................................................................................. 38 Table 3-7 Number ID of functional unit ........................................................................................................... 39 Table 3-8 Description of equation’s parameter ................................................................................................. 40 Table 3-9 Synthesized result of baseline and spatial-optimized ....................................................................... 43 Table 3-10 Comparison between baseline and spatial-optimized ..................................................................... 44 Table 3-11 Improvement by spatial-optimized ................................................................................................. 44 Table 3-12 All cases for latency spec................................................................................................................ 45 Table 3-13 Temporal optimization.................................................................................................................... 46 Table 3-14 Area reduction from temporal optimization.................................................................................... 47 Table 4-1 Synthesis result................................................................................................................................. 53.

(16)

(17) 1. INTRODUCTION. Today’s system-on-a-chip (SoC) has advanced rapidly, and there exists many design considerations, such as time-to-market, production cost, operation speed and so on. These demand the different performance requirement such as, low power for portable devices, small area, and high operating frequency such as computing-intensive for workstation and so on. In the meanwhile, the cost of software development is more and more expensive in many embedded systems. It is not efficient time-to-market to develop the software and hardware at the same time. By this motivation, we try to develop the hardware for different performance requirement under the software support. In this thesis, we focus on developing processors under different performance requirement while leaving the existing software in order to shrink the time-to-market and then explore the micro-architecture optimization space of the specific ISA implementation 1.

(18) 1.1 Motivation With the increasing performance requirement for system-on-a-chip (SoC) applications, such as lower power, small area, and high operating frequency, developing these applications for many performance requirements is not time-consuming. In the meanwhile, the software development is more and more expensive in the embedded system. However, we can reuse the existing software to develop the hardware for the various performance requirements. That’s means that we can reduce the TTM (time-to-market) to develop the hardware for many performance requirements. We exploit the same ISA with suitable implementation can help to reduce design cost. The Cell Broadband Engine (CBE) is very popular. It provides the open and full software support. Therefore, we can take it into consideration to develop the hardware under its software support. But its datapath is for extremely high-performance. There is a trade-off between the performance and the area. If we design the hardware for low performance compared to Cell processor, such as targeted to several hundred MHz. The original datapath of Cell processor is not the most area-efficient for lower performance. However, that’s mean that the same instruction set architecture (ISA) has different ways of implementation for different performance requirement.. 1.2 Problem Description and Distribution With the growing computing requirement, DSPs are becoming prevalent solutions in multimedia applications and telecommunications. In order to save time-to-market, we can develop the processors under different performance requirements with the existing software toolchain. By this above motivation, we can save the design time of software development to develop DSPs for various applications with software support. For example, the software. 2.

(19) toolchain of the famous Cell Broadband Engine (CBE) is ready to develop the processor with the instruction set architecture (ISA) for various applications in the embedded system. ISA is the interface between hardware and software. In fact, ISA implementation depends on the various application requirements. That’s mean that different ISA implementations have different micro-architecture designs under the target applications. In other words, there are different optimization spaces under various applications. For example, the micro-architecture targeted to several hundred MHz under the same ISA implementation with the binary-compatible software. We can find that the Cell SPU expose the long latency for high-performance and expose the long latency for datapath optimization as show in Figure 1-1. There are three ways for microarchitecture design. We’ll propose two-phased design flow to design area-efficient micro-architecture under this constraint.. Figure 1-1 Latency exploration. In this thesis, we propose two-phased area-efficient design flow of ISA implementation for DSPs under binary-compatible software. We take the Cell SPU as our design example. Because the Cell SPU is the data-oriented processor, there is cleanly much more optimization space than control-oriented processor, such ARM processors. Our proposed two-phased area-efficient design flow includes spatial optimization and temporal optimization. This two-phased design flow provides the systematical area-efficient micro-architecture design.. 3.

(20) Compared with ad-hoc method, using our proposed design flow saves about 20% of area under 100MHz to 800MHz timing constraints.. 1.3 Thesis Organization This thesis focuses primarily on two-phased systematical design flow of processor: Spatial optimization and temporal optimization. This thesis is organized as follows. Chapter 2 introduces the Cell SPU which includes Cell Broadband Engine Architecture (CBEA), Synergistic Processor Unit (SPU), SPU instruction set architecture (ISA), and SPU micro-architecture. Chapter 3 first describes the first-phased design flow including function modeling and cycle-accurate modeling. This phase design flow is mainly spatial optimization while the second-phased is temporal optimization by formulating mathematical formulation. At last of this chapter, we list the experimental results of our proposed design flow. Chapter 4 shows the silicon implementation results by using our proposed design flow target to 400MHz. Finally, chapter 5 concludes this thesis and points out the direction of future research.. 4.

(21) 2. BACKGROUND. Contemporary DSPs are multimedia-rich, involving significant amounts of audio and video processing. Cell Broadband Engine (CBE) processor provides a high-performance for applications in media-rich consumer-electronic devices. This chapter provides background information related to this thesis. Chapter 2.1 introduces the Cell Broadband Engine (CBE) and synergistic processing unit (SPU). Chapter 2.2 and Chapter 2.3 give an overview of the synergistic processing unit (SPU) instruction set architecture (ISA) and micro-architecture respectively.. 5.

(22) 2.1 Cell Broadband Engine Architecture The Cell Board Engine (CBE) is the first implementation of a new multiprocessor family conforming to the Cell Broadband Engine Architecture (CBEA, or informally, “Cell”). The CBEA is a new architecture that extends the 64-bit PowerPC Architecture. The CBEA and the CBE are multicore processors jointly developed by SONY, Toshiba, and IBM, known as STI [4]. Figure 2-1 is a die photo of the Cell BE.. Figure 2-1 Die photo of Cell Broadband Engine. Although the CBE processor is initially intended for multimedia applications in media-rich consumer-electronics devices such as game consoles, the architecture has been designed to extend fundamental advances in processor performance. These advances are expected to support a broad range of applications in both commercial and scientific fields. Figure 2-2 [5] shows the block diagram of Cell processor. The most distinguishing feature is that the CBE processor is a multi-core with 9 processor elements and a shared coherent memory on-a-chip: the Power Processor Element (PPE) and the Synergistic Processor Element (SPE). The CBE processor has one PPE and eight SPEs. There is a mutual dependence between the PPE and the SPEs. The PPE is responsible for running the operating system and coordinating the flow of the data processing threads through the SPEs. This 6.

(23) differentiation allows the architectures and implementations of the PPE and SPE to be optimized for their respective workloads and enables significant improvements in performance per transistor.. Figure 2-2 Block diagram of CBE processor. PowerPC Processing Elements The PowerPC Processor Element (PPE) is a 64-bit PowerPC Architecture core optimized for design frequency and power efficiency. It is a general-purpose, dual-thread, 64-bit RISC processor with vector/SIMD extensions. The PPE is responsible for overall control of a CBE system. It runs the operating system for all applications running on PPE and Synergistic Processor Elements (SPEs). The PPE consists of two main units as shown in Figure 2-3 [6], The PowerPC processor unit (PPU) is the computation unit, and the PowerPC processor storage subsystem (PPSS) is for the purpose of storage. More detail information about PowerPC Processing Elements is in [6].. 7.

(24) Figure 2-3 PPE block Diagram. Synergistic Processor Elements The eight Synergistic Processor Elements (SPEs) execute a new single instruction multiple data (SIMD) instruction set-the Synergistic Processor Unit Instruction Set Architecture. They are independent processors, each running an independent application thread. Each SPE is a 128-bit RISC processor for data-rich, compute-intensive applications and includes a private local store for efficient data and instruction access. Figure 2-4 [6] shows the major elements of the SPE architecture and their relationship. Local storage (LS) is a private memory for SPE instructions and data. The synergistic processor unit (SPU) core is a processor that runs instructions from the LS and can read from or write to the local storage (LS). The direct memory access (DMA) unit transfers data between LS and system memory. The channel unit is a message-passing interface that allows the SPU core to communicate with both the DMA unit and other devices in the Cell processor.. 8.

(25) Figure 2-4 SPE architecture. The SPU core is a single-instruction multiple-data (SIMD) reduced instruction set computing (RISC) processor [7]. All instructions are encoded in 32-bit fixed-length instruction formats. The SPU feature 128 general-purpose registers (GPRs) that are used by both floating and integer instructions. Most instructions operate on 128-bit-wide data that perform integer arithmetic, logical operations, loads, stores, compares, and branches. The main SPU functional unist are shown in Figure 2-5 [6]. These include the synergistic execution unit (SXU), the LS, and the SPU register file unit (SRF). The SPU issues two instructions to its two execution pipelines respectively. The pipelines are referred to as even (pipeline 0) and odd (pipeline 1). These units execute the following types of operations: z. Odd Pipeline . SPU Odd Fixed-Point Unit (SFS) ― Executes byte shift, rotate mask, and shuffle operations on quadwords. . SPU Load and Store Unit (SLS) ― Executes load and store instructions and hint for branch instructions. It also handles DMA requests to the LS. . SPU Control Unit (SCN) ― Fetches and issues instructions to the two pipelines. It performs control functions such as branch instructions, arbitration of access to the. 9.

(26) LS and register file, etc. . SPU Channel and DMA Unit (SSC) ― Manages communication, data transfer, and control into and out of the SPU.. z. Even Pipeline . SPU Even Fixed-Pointed Unit (SFX) ― Executes arithmetic instructions, logical instructions, word SIMD shifts and rotations, floating-point comparisons, and floating-point reciprocal and reciprocal square-root estimations.. . SPU Floating-Point Unit (SFP) ― Executes single-precision and double-precision floating point instructions, and conversions, and byte operations. The 32-bit multiplier are implemented in software using 16-bit multiplies.. Figure 2-5 SPU functional units. 2.2 SPU Instruction Set Architecture The instruction set architecture (ISA) is the most important design issue that DSP designer must get right from the start. Instruction set architecture (ISA) serves as an. 10.

(27) abstraction layer between hardware and software. It should include the following information, instruction sets, instruction format, data representation, data storage, address modes, and exceptional conditions. In the following section, the fixed point SPU Instruction set architecture (ISA) [8] will be described. . Instruction formats There are six basic instruction formats. These instructions are all 32-bit long. Instructions. in memory must be aligned on word boundaries. The instruction formats shown in Figure 2-6.. Figure 2-6 Instruction format. . Data representation The SPU hardware supports the following data types: Byte (8-bit), halfword (16-bit),. word (32-bit), doubleword (64-bit), and quadword (128-bit) as shown in Figure 2-7. All GPRs (general-purpose resisters) are 128-bit wide. The leftmost word (bytes 0, 1, 2, and3) of a. 11.

(28) register is called the preferred slot. When instructions use or produce scalar operands or addresses, the values are in the preferred slot. Because the SPU accesses its LS a quadword at a time, there is a set of store-assist instructions for insertion of bytes, halfwords, words, and doublewords into a quadword for a subsequent load/store.. Figure 2-7 Register layout of data types and preferred scalar slot. . Data storage The SPU architecture defines a private memory, also called local storage, which is. byte-addressed load and store instructions combined operands from one or two registers or immediate value to form the effective address of the memory operand. The LS is 256 KB, single-ported, non-caching memory. It stores all instructions and data used by the SPU. SPU data-access bandwidth is 16 bytes per cycle, quadword aligned. . Addressing modes All instructions, except branches, generate address by incrementing a program counter.. For load and store instructions that specify a base register, the effective address in memory for a data value is calculated relative to the base register in one of three ways: . Resister + Displacement The displacement (D) forms of the load and store instructions form the sum of a. 12.

(29) displacement specified by the sign-extended 16-bit immediate field of the instruction plus the contents of the base resister. . Register + Register The index (X) forms of the load and store instructions form the sum of the contents. of the index register plus the contents of the base register . Register The load string immediate and store string immediate instructions use the. unmodified contents of the base register . Instruction sets The SPU instruction set used are instructions that are 4 bytes long and word-aligned. It. supports 16-byte (128-bit) operand accesses between storage and its 128 registers. For a brief overview of the fixed point SPU instruction set, including data transfer, integer, logical, data transformation. . Data transfer instructions In order to process data in the memory, load/store machine use the load and store. instruction to handle memory access issues. Load and store instructions combine operands from one or two registers and an immediate value to form the effective address of the memory operand. Only aligned 16-byte-long quadwords can be loaded and stored. Therefore, the rightmost 4 bits of an effective address are always ignored and are assumed to be zero. . Integer and logical instructions z. Addition/subtraction instructions. 13.

(30) The instructions of addition or subtraction are the operators of halfword (16-bit) or word (32-bit) of SIMD version. “A” is the word-operator that replaces the destination operand with the sum of the two source registers as shown in Figure 2-8, while “ai” takes one source operand as 128-bit immediate data. “Sf” and “sfi” perform general and immediate subtraction. The 32-bit SIMD version of “A” is supported by the SPU instruction set and shown in Figure 2-8.. Figure 2-8 Example of addition operation. z. Compare instructions Compare instructions compare the two source operands and store the. destination to register. The source operands can be registers or immediate data. It is the operators of byte (8-bit) or halfword (16-bit) or word of SIMD version. For example, “ceqb” (compare equal byte) set the byte-result as 0xFF if the source operand 1 is equal to source operand 2 and set 0x00 vice versa. z. Multiply instructions Multiply-relative instructions combine multiply and multiply-and-accumulator. instructions. These multiply instructions only support 16-bit SIMD multiplication which get the lower or upper part of one word in each register to take the multiply operation and the product maybe be shifted, mask upper or lower, or the additional accumulation with it. For example, multiply-high gets the result of the leftmost 16 bits of the value in one word of register RA are multiplied by of the rightmost 16. 14.

(31) bits of the value in one word of register RB, and then the product is shifted left by 16 bits and zero are shifted in at the right for each of four word slots as shown in Figure 2-9.. Figure 2-9 Example of multiply operation. z. Logical instructions Logical instructions handle bit-wise Boolean logical operations. The logical. operations are composed of AND, OR, XOR, NAND, NOR, and XOR instructions. These instructions perform the general logical operation in the processor. . Data transformation instructions To support the data alignment of application processing, data transformation. instructions are include shift/rotator, extend, form, gather and shuffle. z. Shifter / rotator instructions The shift instruction can shift the source operand arithmetically or logically. It. can specify the shift amount in the ways, either register or immediate. It support shift of halfword, word, and quadword, and shift quadword by byte. The rotator instructions also support the same as the above operation. z. Extend instructions. 15.

(32) The extend instruction is used to support the data precision. These instructions support byte (8-bit) to halfword (16-bit), halfword (16-bit) to word (32-bit), and word (32-bit) to double word (64-bit). For example, the operation of “xsbh” (extend sign byte to halfword) is that for each of eight halfword slots, the sign of the byte in the right byte of the operand in register RA is propagated to the left byte. z. Gather instructions The gather instruction is include gather bits from bytes, halfwords, or words.. This operation can be used to gather bits of the leftmost bit of one byte, halfword, or word. For example, “gbb” (gather bits from bytes) operates as the following description: a 16-bit quantity is formed in the right half of the preferred slot of register RT by concatenating the rightmost bit in each byte of register RA. The leftmost 16 bits of register RT are extending to zero as the remaining slots of register RT. z. Form instructions The Form instructions are to create a mask by replicating the rightmost bit of. bytes, halfwords, and words. For example, “fsmb” (form select mask for bytes) operates as the following description: the right 16-bit of the preferred slot of register RA are used to create a mask in register RT by replicating each bit eight times. Bits in the operand are related to bytes in the result in a left-to-right correspondence as shown in Figure 2-10.. 16.

(33) source register. A.0. A.1. A.2. 112 113 114 115 116 117 118 119 120 121 122. destination register. T.0. T.1. A.3. 27. 28. 29. T.2. 30. 31. T.3. Syntax : FSMB RT,RA. Figure 2-10 Example of form select mask for bytes operation. z. Shuffle instructions The shuffle operation is extremely powerful and finds its way into many. applications in which data reordering, selection, or merging is required. Its operation is that register RA and RB are logically concatenated with the least-significant bit of RA adjacent to the most-significant bit of RB. The bytes of the resulting value are considered to be numbered from 0 to 31. For each byte slot in registers RC and RT, the value in register RC is examined, and a result byte is produced as shown in Table 2-1 and Figure 2-11, and then the result byte is inserted into register RT. Other instructions which are not above are enumerated in [8]. Table 2-1 Binary values in register RC and byte results Value in Register RC (expressed in binary) 10xxxxxx 110xxxxx 111xxxxx otherwise. Result Byte 0x00 0xFF 0x80 shown in figure 2-11. 17.

(34) RA. RB RT. RC Syntax : SHUFB RT,RA,RB,RC. Figure 2-11 Example of shuffle bytes operation. . Exceptional conditions The SPU support a single interrupt handler. The entry point for this handler is address 0. in local store. When a condition is present and interrupts are enabled, the SPU branches to address 0 and disables the interrupt facility. The address of the next instruction to be executed is saved in the SRR0 register. The iret instruction can be used to return from the handler.. 2.3 SPU Micro-Architecture Figure 2-12 [7] shows how the SPU is organized and the key bandwidth (per cycle) between units. Instructions are fetched from the LS in 32 4-byte groups when LS is idle. The fetched lines are sent in two cycles to the instruction line buffer (ILB). Instructions are sent, two at a time, from the ILB to the issue control unit. The SPU issues and completes all instructions in program order and doesn’t reorder or rename its instructions. Although the SPU isn’t a VLIW processor, it does feature like dual feature and can issue up to two instructions per cycle to nine execution units organized into two pipelines as shown in Table 2-1. Instructions pairs can be issued if the first instruction (from an even address) will be routed to an even pipe unit and the second instruction to an odd pipe unit.. 18.

(35) Table 2-2 Dual issue unit assignments Inst. From addrress 0 Inst. From addrress 4. Simple fixed Shift Single precision Floating Integer Byte. Permute Local store Channel Branch. Operands are fetched either from the register file or forward network and sent to the execution pipelines. Each of the two pipelines can consume three 16 byte operands and produce a 16 byte result every cycle. The register file has six read ports, two write ports, 128 entries of 128 bits each and is accessed in two cycles. Results produced by functional units are held in the forward macro until they are committed and available from the register file. Loads and stores transfer 16 bytes of data between the register file and the local store. Table 2-3 details the eight execution units. Simple fixed point [9], floating point [10] and load results are bypassed directly from the unit output to input operands to reduce result latency. Other results are sent to the forward macro where they distribute a cycle later. Figure 2-13 [7] is a pipeline diagram for the SPE that shows how flush and fetch are related to other instruction processing.. 19.

(36) Figure 2-12 SPU organization. Table 2-3 Unit and instruction latency Unit Simple Fixed Simple Fixed Single Precision Single Precision Bytes Permute Local Store Channel Branch. Instruction Instruction Latency word arithmetic, logicals, count leading zeros, 2 selects, and compares 4 word shifts and rotates 6 multiply-accumulate 7 integer multiply-accumulate pop count, absolute sum of differences, 4 byte average, byte sum Quadword shifts, rotates, gathers, shuffles as 4 well as reciprocal estimate 6 Load and strore 6 Channel Read/Write Branches 4. 20.

(37) Figure 2-13 SPU pipeline diagram. 21.

(38)

(39) 3. DESIGN & OPTIMIZATION FLOW OF DEEP-PIPELINE LATENCY DATAPATH. Today’s multimedia applications need significant amounts of digital signal processing, so the current trend of many contemporary processors is generally designed for datapath-dominated recently. Cell processor provides a high performance for multimedia applications in the embedded system. One of the key features is the synergistic processing processor (SPU) which is data-oriented core for the requirement computing-intensive operations. In this chapter, we firstly introduce an overview of our proposed two-phased design flow: how to design the SPU datapath systematically. Chapter 3.1 presents the first-phased of design flow called spatial optimization including function modeling and cycle-accurate modeling, and then chapter 3.2 gives the second-phased of design flow, and chapter 3.3 illustrates the experimental results.. 23.

(40) 3.1 Spatial Optimization Given the instruction set architecture (ISA) of synergistic processor unit (SPU), how to design the datapath of functional modeling systematically? At this sub-section, we detail the functional modeling and cycle-accurate design of our proposed two-phased design flow as shown in Figure 3-1.. Figure 3-1 Overview of our proposed design flow. 3.1.1 Function Modeling The functional modeling of our proposed first-phased design flow can be divide four steps: instruction grouping, behavioral mode in RTL, synthesize (synthesized for time-optimized and area-optimized), and then the datapath called baseline of this step in order as shown in Figure 3-2.. 24.

(41) Figure 3-2 Functional modeling. . Instruction grouping To reduce the effort of SPU datapath design, we profile the common instruction sets used. by multimedia applications, such as JPEG, FFT, DCT, FIR, and IIR through the SPU complier. The first step is to categorize these profiled instruction sets mainly by operations. We divide these instruction sets of datapath into seven group that are Add/Sub, Logic, Cmp (compare), Mask, S/R (shifter/rotator), Shuffle, Mpy (multiply) respectively. . Behavioral model in RTL After we categorize these instruction sets, we analyze the synthesis result of behavioral. assignment in RTL followed the semantics of the SPU instruction sets architecture by CAD tool. This step intends to get the information of optimized-degree by Synopsys Design. 25.

(42) Complier. We take the instruction set “ah” (add halfword) for example, as shown in Figure 3-3. The “add_sub_sel” of the Figure 3-3 is the control which instruction of the Add/Sub group. Other instructions can follow the code format like the description of Figure 3-3.. Figure 3-3 Example of behavioral assignment in RTL. . Synthesis (synthesized for timing-optimized and area-optimized) After finishing the above RTL-coding, we initially analyze the synthesis result of the. seven functional units. This result is defined as baseline of synthesis result in this thesis. We can get the shortest delay of each functional unit through synthesized for timing-optimized. At the same, synthesized for area-optimized provides the information of hardware complexity. The information of area and timing is the mainly two topics that we’re most concern in datapath design. In Table 3-1, we can clearly indicate that both the largest area and the longest delay of baseline is the “Mpy” functional unit. By the way, we can see the synthesized result about the resource report, and find out the numbers of synthesized resource provided by DesignWare. Then we find that the CAD tool doesn’t do any optimization for our functional units in datapath. In next subsection, we will provide general strategy to optimize the datapath. 26.

(43) Table 3-1 Synthesis result of baseline Baseline Synthesis for timing Synthesis for area 0.63 4 Add/Sub delay(ns) (#9) area(um2) 40468 25173 0.45 3 Logic delay(ns) (#9) area(um2) 14148 7209. Grouping. Cmp (#18) Mask (#9) S/R (#26) Mpy (#11). delay(ns) 2. area(um ) delay(ns) area(um2) delay(ns) area(um2) delay(ns). area(um2) Shuffle delay(ns) (#7) area(um2). . 0.62 15914 0.31. 2.5 9977 1.4. 3063 0.8 225086. 1442 4.8 131332 7. 2.51 518969 0.66 42123. 371405 2 26657. Optimization (sharing) The SPU instruction sets support 128-bit SIMD operations which are 8-bit, 16-bit, 32-bit,. and 128-bit. For example, the “Add/Sub” functional unit supports the 16-bit and 32-bit addition and subtraction, and the “Cmp” functional unit even supports 8-bit (byte), 16-bit (halfword), and 32-bit(word) comparison.. In order to support varieties of bit-length. operations, we intuitively follow the behavioral assignment in RTL followed by the SPU ISA. From the last section, we find that there is no optimization for these functional units in datapath from these synthesized reports. This strategy is not area-efficient in order to transfer the SPU ISA to single-cycle execution datapath. In this step, we describe how to optimize these function units by using the general optimized strategy, such as resource sharing, sub-parallel method [11] for this seven functional units. Resource sharing is the general method that the same bit-length computation of on functional unit uses the same hardware to compute with encoder that decides which instruction set to execute. We use this method in these functional units called “Logical”, “Mask”, “S/R”, and “Mpy”.. 27.

(44) Sub-parallel method is that using multiple sub-word length hardware for word length computation. For a subword adder, this method is achieved by inserting multiplexers in the subword boundaries to propagate or prevent the subword carries in the carry chain [12]. For example, the “Add/Sub” functional unit is support 16-bit and 32-bit operations. We use two 16-bit adders to support 32-bit adder by adding an and-gate to control the 16-bit adder result’s carry as shown in Figure 3-4. If the word control bit is “1”, this hardware is to execute 32-bit addition or subtraction, and execute 16-bit operations vice versa. Other functional unit, such “Cmp”, can follow this method to do 8-bit, 16-bit, or 32-bit operations. Figure 3-4 Add/Sub functional unit. We design the SPU datapath by using the above these optimized methods. In fact, these above optimization methods are the spatial optimization on the contrary to the temporal optimization in the 3.2 chapter.. 28.

(45) 3.1.2 Cycle-Accurate Modeling. Figure 3-5 Cycle-accurate modeling. Because the front sub-section is just single-execution in datapath, we must take the latency spec. of the SPU ISA into consideration in datapath design as shown in Figure 3-5. In Table 3-2, it provides the instruction latency of all seven functional units. Instruction latency means that the number of clock cycles it takes for the instruction to get the available result through the pipeline. For example, the “Add/Sub” has two instruction latencies that means its result must be produced within two-cycle. At this step called “cycle-accurate modeling”, we combine the previous optimized single-execution with the instruction latency spec. of the SPU to design the micro-architecture. We use the main two methods, queue-sharing and the forwarding unit design. Queue-sharing means that these single-execution functional units bypass the same pipelined-register to meet the instruction latency, and the forwarding unit uses the above pipelined-register to forward the data to operand fetch unit. Next, we will 29.

(46) introduce how to design the forward unit. Table 3-2 Instruction latency. Grouping Add/Sub Logic Cmp Mask S/R Shuffle Mpy. # latency 2 2 2 2 4 4 7. Data forwarding is a well-known technique to reduce the number of extra execution cycles. However, the complexity of forwarding network is rapidly increasing and usually constitutes the critical path [14]. In order to design forwarding unit systematically, we sort out pipelined-stage that producer (produce data) or consumer (consume data) and divide them into two categories. The analysis of the data forwarding paths includes two domains. One is temporal domain analysis and the other is spatial domain analysis. The temporal domain analysis checks the results produced by previous instructions but still in execution unit pipeline, while the spatial domain analysis checks all possible paths between every producer and consumer stages. We defined the tolerable latency (TL) [14] of forwarding unit is the latency between data producing and data consuming: TL (tolerable latency) = data consuming time – data producing time The TL indicates the available latencies between consumer and producer. If the TL is less than the latency of forwarding unit, the data forwarding is impossible and has to stall several cycles till the TL is equal to the latency of forwarding unit. Figure 3-6 shows the example of TL in our SPU datapath design.. 30.

(47) Figure 3-6 Tolerable latencies. For the forwarding unit which has one-cycle latency, there are three possible forwarding cases depending on the TL: 1. Non-causal (TL < 0). 2. Timing critical (TL = 0). 3. Normal (TL ≥ 1). The first one is non-causal path. It happens that the consumer is executed earlier than the producer that results in a non-causal forwarding condition. The second one is the producer is directly forwarding to the consumer. That is, the data of producer is directly forwarded to consumer at next instruction cycle that means non-tolerable extra latency on the forwarding path. The final one is normal paths which have multi-cycle tolerable latencies between consumer and producers. In this case, the result produced by producer can’t be forwarded to consumer directly but has to queue for multiple cycles. In our datapath, we divide seven functional units into main three groups having the same instruction latencies, which have two-latency, four-latency, and seven-latency respectively. Table 3-3 shows all of our forwarding paths. In this table, all possible paths between each producer and consumer are categorized into three forwarding cases mentioned above. The instruction number indicates the instruction cycle latencies between consumer and producer. 31.

(48) Consumer 7-latenty of FUs 4-latenty of FUs 2-latenty of FUs. Table 3-3 Forwarding table of our SPU. inst. 1 inst. 2 inst. 3 inst. 4 inst. 5 inst. 6 inst. 1 inst. 2 inst. 3 inst. 4 inst. 5 inst. 6 inst. 1 inst. 2 inst. 3 inst. 4 inst. 5 inst. 6. 2-latenty of FUs timing critical normal (TL=1) normal (TL=2) normal (TL=3) normal (TL=4) normal (TL=5) timing critical normal (TL=1) normal (TL=2) normal (TL=3) normal (TL=4) normal (TL=5) timing critical normal (TL=1) normal (TL=2) normal (TL=3) normal (TL=4) normal (TL=5). Producer 4-latenty of FUs 7-latenty of FUs non-causal non-causal non-causal non-causal timing critical non-causal normal (TL=1) non-causal normal (TL=2) non-causal normal (TL=3) timing critical non-causal non-causal non-causal non-causal timing critical non-causal normal (TL=1) non-causal normal (TL=2) non-causal normal (TL=3) timing critical non-causal non-causal non-causal non-causal timing critical non-causal normal (TL=1) non-causal normal (TL=2) non-causal normal (TL=3) timing critical. The forwarding module can be imaged as a pipelined stage in the forwarding paths which isolates the complicated network from datapath by output registers. Once the forwarding table is established, we can design the forwarding micro-architecture as shown in Figure 3-7. By using sharing queue and forwarding unit, we can design the micro-architecture of SPU. Because the SPU is dual-issue, we divide the datapath into two pipelined path. Figure 3-8 shows the pipelined diagram which meets the instruction latency of the SPU. One of the fundamental decisions to be made in the design of a processor is the choice of the structure of the pipeline. In next chapter, we explore this issue to get an optimal area-efficient pipeline stage for each functional unit of SPU, given the instruction latency of ISA with a targeted performance requirement. This issue is treated both analytically and by simulation. We use the spatial optimization for our datapath design. At first, we have the preliminary analysis for the latency exploration. Finally, we use the mathematical formulation to get the optimal architecture.. 32.

(49) producer 0 producer 1. fw_src1 fw_src2 fw_src3. Forward Unit. FW_control. OF. EXE. src1 FU_src1 fw_src1 producer 0. Function Unit 0 src2 FU_src2 fw_src2. src3. producer 1. Function Unit 1. FU_src3 fw_src3. Figure 3-7 Forwarding network. Consumer. Forward Unit. even pipe. ADD/SUB Unit. EXE. Logical Unit. EXE. src1_exe. Cmp Unit. Producer 6. Producer 5. Producer 4. Producer 3. Mask Unit. Producer 2. src2_exe. Producer 1. EXE. even_pipe_out. EXE src3_exe S/R Unit. EXE MUL Unit. EXE. Figure 3-8 SPU Pipeline diagram. 33. Producer 10. EXE. Producer 9. Producer 8. Producer 7. Shuffle Unit. odd pipe even_pipe_out.

(50) 3.2 Temporal Optimization As shown in Figure 3-8, the pipeline datapath just use bypassing-register to meet the instruction latency of functional units, but we can explore the latency spec. to optimize the datapath as shown in Figure 3-9. This is the question as to an optimum pipeline depth for a processor, given the latency spec. of ISA. Retiming is a structural optimization technique that relocates the registers in a logic circuit with the objective of minimizing their total gate counts, maximizing the circuit performance, or achieving both simultaneously [15][16]. We apply retiming to make functional units run at the required timing constraints containing a minimum number of registers.. Figure 3-9 Temporal optimization. 34.

(51) . Latency exploration Retiming [15] is a transformation technique used to change the locations of delay. element in a circuit without affecting the input/output characteristics of the circuit. It is a useful method for optimize the performance in synchronous circuit design. These include reducing the clock period of the circuit, reducing the number of registers in the circuit, reducing the power consumption of the circuit, and logic synthesis. In this thesis, we use the retiming to reduce the number of registers in out datapath. In the following pipelined functional units, we use the CAD tool called pipeline_design of Synopsys Design Complier to pipeline the functional units. For example, we have three ways to decide the pipeline structure with the given latency of functional units as shown in Figure 3-10. The first way is that the functional unit is directly bypassing three pipelined register without pipelining the functional unit. The second way is that the functional unit is pipelined 2-stage by CAD tool and then bypassing two pipelined registers. The final way, we use CAD tool to pipeline the functional unit with 3-atage and bypassing the output register. We find that the trivial stage-selection, that is 3-stage, is not surely the best area-efficient with targeted frequency, so we will analyze the synthesized area trend of pipelined functional units to help formulate mathematical formulation.. Figure 3-10 Function unit with 3-cycle latency. Next, we’ll analyze the multiple-cycle latency basic module of functional units. That is,. 35.

(52) we will analyze these functional units, such as 16-bit “S/R”, 8-bit “Shuffle”, and 16-bit “MUL” . Functional unit characterization . S/R The “S/R” functional unit has 3-latency, so there is three ways to decide the. pipelined structure. We use the 16-bit shifter to estimate the area trend of all three ways by using the synthesized result of pipelined functional unit and estimate the 16-bit pipelined register under the 1.25ns timing constraints. As shown in both Table 3-4 and Figure 3-11, we can see that the first column is the possible stage number of “S/R” functional unit. At the same time, we try to estimate the pipelined register the third column and the 16-bit register is 288 um2. For example, the first case is that the functional unit is directly bypassing three-stage pipelined-registers, so 288 multiplied by three is 864 um2. We use this way to estimate the synthesized trend, and cleanly see the area is proportion to the stage number as shown in. Figure 3-11.. Table 3-4 Piped S/R FU Piped S/R FU Piped Stage 1 2 3. Piped-FU 1514 2237 2528. Area (um2) 2378 2813 2816. Bypassing Register 864 576 288. Piped S/R FU 2900 2800 2700 2600 2500 2400 2300 2200 2100 1. 2 # piped stage. Figure 3-11 Piped S/R FU. 36. 3.

(53) . Shuffle The “Shuffle” is also a 3-latency functional unit, so it has three choice of. pipelined-stage. We use 8-bit shuffle module to approximate the area trend of the three possible cases for pipelined stage under the 1.25ns timing constraints. In both Table 3-5 and Figure 3-12, we can find the area trend of pipelined “Shuffle” is almost proportion to the pipelined stage of functional unit. We can see the slight difference between 2-stage and 3-stage, but it will be more distinct from the multiple modules in our “Shuffle” unit. Table 3-5 Piped Shuffle FU Piped Shuffle FU Piped Stage Piped-FU 1433 1 1580 2 1724 3. Bypassing Register 432 288 144. Area (um2) 1865 1868 1868. Piped Shuffle FU 1868.5 1868 1867.5 1867 1866.5 1866 1865.5 1865 1864.5 1864 1863.5 1. 2. 3. # Piped Stages. Figure 3-12 Piped Shuffle FU. . MUL The multiplier is the main critical functional unit, so multiplier typically has much. more pipelined-stages than other functional units in order to target high frequency. It means that there is deeper pipeline in multiplier, so we have more design space to decide the pipelined-stages. Our “MUL” functional unit has 6-latency, so it has six possible. 37.

(54) selections of pipelined-stages. Here, we use 16-bit multiplier synthesized under 1.25ns timing constraints. In both Table 3-6 and Figure 3-13, we can cleanly see the trend of the area of pipelined-stage “MUL”. Different from the previous functional units, there is a non-available synthesized result in the second row. Because the non-pipelined MUL’s critical path is longer than 1.25 ns, it requires at least 2 pipelined-stages to target our operating frequency, 800MHz. In Figure 3-13, we can find there is a smooth curve between the 2-stage and 3-stage functional unit. At the same time, there is a steep curve between 3-stage and 4-stage functional unit. That’s because the synthesized result is not absolutely linear growing up with the pipelined-stage functional unit. General speaking, the area of MUL is proportion to the pipelined-stage. Table 3-6 Piped MUL FU Piped MUL FU Piped Stage 1 2 3 4 5 6. Piped-FU NA 8521 9403 9981 11340 12140. Bypassing Register NA 2880 2304 1728 1152 576. 2. Area (um ) NA 11401 11707 11709 12492 12716. Piped MUL FU 13000. Area(um^2). 12500 12000 11500 11000 10500 2. 3. 4. 5. 6. # piped stage. Figure 3-13 Piped MUL FU. In one word, we use smaller modules to estimate the area of functional unit with the above synthesized trend in reality. These results help us to formulate the following 38.

(55) mathematical formulation. After using register estimation and pipeline functional unit to estimate the area trend, we’ll introduce how to formulate the mathematical formulation for our datapath design. . Formulation Retiming is a structural optimization technique that relocates the registers in datapath. that is targeted to minimize total gate count and maximize the circuit performance. We formulate the mathematical formulation with retiming method that is to minimize the area under timing constraint. The mathematical formulation is as follows. We solve the equations to minimize the total area of all pipelined functional units under timing constraints. As shown in Table 3-8, the first parameter “i” gives the identified number to each functional unit. Table 3-7 shows the ID number of each functional unit. The second Api is the total area of each pipelined functional unit, Pi and Li is respectively the pipelined-stage and latency spec. of each functional unit. Finally, we introduce the following timing parameter. The first parameter Ci is the control delay of each functional unit, means the timing delay of multiplex before function unit. The Means of ti and tp are timing delay of pipelined-register and pure combinational respectively. We estimate the parameter “tp” about 0.15 ns to 0.2 ns from the manual of UMC 90 process. The least timing constraint is “t” that is our targeted operating frequency 800MHz (1.25ns). Table 3-7 Number ID of functional unit FU Add/Sub Logic Cmp Mask S/R Shuffle Mpy. ID number (i) 1 2 3 4 5 6 7. 39.

(56) Table 3-8 Description of equation’s parameter Parameter i Ai. Description ith FU area of ith FU. Pi. # pipelined stage. Li. latency spec.. Ci. control delay. ti. timing of non-piped FU. tp t. pipelined register timing contraints. Minimize A1 + A2 + A3 + A4 + A5 + A6 + A7. (3-1). ⎧ P1 ≤ L1 ⎪P ≤ L 2 ⎪ 2 ⎪ P3 ≤ L3 ⎪ ⎨ P4 ≤ L4 ⎪P ≤ L 5 ⎪ 5 ⎪ P6 ≤ L6 ⎪ ⎩ P7 ≤ L7. (3-2). ⎧ t1 ⎪ ⎪ P1 ⎪ t2 ⎪ ⎪ P2 ⎪ t3 ⎪ ⎪ P3 ⎪⎪ t 4 ⎨ ⎪ P4 ⎪ t5 ⎪ ⎪ P5 ⎪t ⎪ 6 ⎪ P6 ⎪t ⎪ 7 ⎪⎩ P7. (3-3). + C1 + t p < t + C2 + t p < t + C3 + t p < t + C4 + t p < t + C5 + t p < t + C6 + t p < t + C7 + t p < t. 40.

(57) In Equation 3-1, we try to derive the minimum area of each one of the seven functional units. We formulate Equation 3-2 and Equation 3-3 with parameters. We list a simultaneous inequality by the mainly concepts, that is the total delay must be shorter than timing constraints (1.25ns) within one-stage pipelined datapath as shown in Figure 3-14. The parameter ti can be derived from the column of synthesis for timing of every functional unit in Table 3-8. Equation 3-2 is the latency spec. that means combinational circuit of every functional unit can be divided into at most stage.. Function unit. Control circuit. <t. Figure 3-14 Timing delay of one-stage pipelined datapath. From these above constraints and Equation 3-2 and 3-3, we can derive the solution as the Equation 3-4. ⎧ P1 = 1 ⎪P = 1 ⎪ 2 ⎪ P3 = 1 ⎪ ⎨ P4 = 1 ⎪P = 3 ⎪ 5 ⎪ P6 = 1 ⎪ ⎩ P7 = 3. (3-4). 3.3 Experimental Results In this chapter, we show the experimental result of our proposed design flow. It includes spatial optimization and temporal optimization as shown in Figure 3-15.. 41.

(58) Figure 3-15 Proposed design flow. Spatial optimization. Table 3-9 shows the synthesized result by using the above optimized method that is defined as spatial-optimized. It shows the timing critical and the hardware complexity of each grouping. At the same time, we also set the timing constraint as 2.5ns (400MHz) as the typical case. By the way, we can derive the general case of common operating frequency in DSP.. 42.

(59) Table 3-9 Synthesized result of baseline and spatial-optimized Synthesis for timing Grouping. Add/Sub Logic Cmp Mask S/R Mpy Shuffle. Baseline delay(ns) 2 area(um ) delay(ns) 2 area(um ) delay(ns) 2 area(um ) delay(ns) 2 area(um ) delay(ns) 2 area(um ) delay(ns) 2 area(um ) delay(ns) 2 area(um ). 0.63 40468 0.45 14148 0.62 15914 0.31 3063 0.8 225086 2.51 518969 0.66 42123. Spatialoptimized 0.82 8811 0.45 14282 0.72 7040 0.31 3063 1.4 94460 2.01 68055 0.66 30393. Synthesis for area Baseline 4 25173 3 7209 2.5 9977 1.4 1442 4.8 131332 7 371405 2 26657. Spatialoptimized 4.6 5635 3 7209 2.8 5515 1.4 1442 9.3 49693 7.6 39317 2 24388. Target 400 MHz Baseline 2.5 26183 2.5 7215 2.5 9977 2.5 1443 2.5 132160 2.5 NA 2.5 26657. Spatialoptimized 2.5 9530 2.5 7215 2.5 5547 2.5 1443 2.5 50341 2.5 47741 2.5 24399. In Table 3-10, we can find the hardware complexity of spatial-optimization is much lower than that of baseline from the third column “Area-optimized”. In order to sharing the hardware resource, we add some control hardware like encoder which adds timing delay slightly in datapath. In the “Synthesis for timing” column of Table 3-10, we can find the timing delay of most functional units is increasing slightly except the “Mpy” functional unit. Because the hardware complexity of the “Mpy” functional unit of baseline with larger encoder is much lager than spatial-optimized with smaller encoder, the timing delay of baseline is longer than spatial-optimization. By the way, the “Logic” and “Mask” unit are the same in both baseline and spatial-optimized because these units use simple logical gate or just wiring. If we add some controller to share the same logical gate, there is much overhead compared to the original controller in these two units. Finally, we use a typical case target to 400MHz (2.5ns) to confirm the above optimized method for area. We show the improvement of these functional units as shown in Table 3-11. In the “Mpy” unit of the 400MHz column, the “NA” means no available because its timing delay is longer than 2.5ns. In the next section, we’ll take the latency spec. of the SPU into consideration for micro-operation design. 43.

(60) Table 3-10 Comparison between baseline and spatial-optimized. Grouping Add/Sub Logic Cmp Mask S/R Mpy Shuffle. Timing-optimized (ns) SpatialBaseline optimized 0.63 0.82 0.45 0.45 0.62 0.72 0.31 0.31 0.8 1.4 2.51 2.01 0.66 0.66. Area-optimized (μm2) SpatialBaseline optimized 25173 5635 7209 7209 9977 5515 1442 1442 131332 49693 371405 39317 26657 24388. 400MHz (μm2) SpatialBaseline optimized 26183 9530 7215 7215 9977 5547 1443 1443 132160 50341 47741 NA 26657 24399. Table 3-11 Improvement by spatial-optimized Grouping Timing Add/Sub Logic Cmp Mask S/R Mpy Shuffle. . -30.2% 0.0% -16.1% 0.0% -75.0% 19.9% 0.0%. Area. 400MHz. 77.6% 0.0% 44.7% 0.0% 62.2% 89.4% 8.5%. 63.6% 0.0% 44.4% 0.0% 61.9% NA 8.5%. Temporal optimization We’ll prove our proposed temporal optimization that is optimal micro-architecture. In. fact, we can synthesize for all cases of the latency spec. of functional units. But, this method is too trivial to consume the design time in order to get the optimal micro-architecture with the deeper pipelined datapath. Our proposed temporal optimization not only saves the iterative time, but also gets the optimal selection of pipelined-stage as shown in Table 3-12.. 44.

(61) Table 3-12 All cases for latency spec.. Grouping Latency(#) Add/Sub. 1. Logic. 1. Cmp. 1. Mask. 1. S/R. 3. Shuffle. 3. Mpy. 6. delay(ns) 2 area(um ) delay(ns) 2 area(um ) delay(ns) area(um2) delay(ns) 2 area(um ) delay(ns) area(um2) delay(ns) 2 area(um ) delay(ns) area(um2). 1 1.25 9490 1.25 10287 1.25 7690 1.25 3710 1.25 NA 1.25 31283 1.25 NA. under timing constraint 1.25 2 3 4 5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1.25 1.25 NA NA NA 78639 NA NA 1.25 1.25 NA NA 32653 34888 NA NA 1.25 1.25 1.25 1.25 NA 81782 83195 83554. 6 NA NA NA NA NA NA NA NA NA NA NA NA 1.25 85655. In Table 3-12, the blue-color word is derived form our proposed temporal optimization. For each functional unit, the blue-word selection is the best choice for the optimal micro-architecture. Maybe we can try to synthesize for all case, but it is time-consuming. For example, the Mpy functional unit has six latencies, that’s mean that it have six possible pipelined structure. If we try to synthesize for all cases, it is not timing-efficient. This situation is becoming serious for functional unit with more and more latencies of deeper function unit in order to target high operation frequency. Finally, we show the improvement compared to trivial approach with our proposed temporal optimization as shown in Table 3-13. Table 3-13 shows the comparison of all pipelined functional units between our proposed temporal optimization and the trivial approach with the pipeline diagram of the reference paper [7]. The Reference version uses the spatial-optimized datapath to explore the latency directly by the pipelined-diagram of reference, and the temporal-optimized uses the same datapath by temporal optimization. The first column is all functional units, and the second is the latency spec. that means the maximum number pipelined-stage of each functional unit. We can see that the reference is directly used the latency spec. to pipeline each functional unit by 45.

(62) CAD tool. However, the spatial optimization uses our proposed temporal optimization. Seeing the “Improvement” column, we can see improvement by 0% from Add/Sub to S/R functional units, because these functional units have no latency to explore. In other words, these functional units have only one-latency and the S/R must pipeline 3 stages into it in order to target high frequency. In both “Shuffle” and “Mpy”, we improve the area compared to the version 3 by 10% and 4.5% respectively. Table 3-13 Temporal optimization FU. latency(#) (800MHz). Add/Sub. 1. Logic. 1. Cmp. 1. Mask. 1. S/R. 3. Shuffle. 3. Mpy. 6. Temporaloptimized 1 9490 1 10033 1 7690 1 3710 3 78639 1 31283 3 82964. pipelined-stage area pipelined-stage area pipelined-stage area pipelined-stage area pipelined-stage area pipelined-stage area pipelined-stage area. Ref. 1 9490 1 10033 1 7690 1 3710 3 78639 3 34888 6 85655. Improvement (%) 0% 0% 0% 0% 0% 10.33% 3.14%. Finally, we show the area-efficient micro-architecture target to lightweight applications target to 100MHz to 800MHz as shown in Table 3-14. Our proposed design flow improves the area of micro-architecture by approximate 20%. The case “800 MHz” is just improved by 3% because its optimization space is limited by timing-optimized. General speaking, we can see the trend of micro-architecture is area-efficient by using our proposed design flow. In Table 3-14 and Figure 3-16, we can see the other case is area-efficient micro-architecture improved 15% to 20% of area.. 46.

(63) Table 3-14 Area reduction from temporal optimization Freq. 800 700 600 500 400 300 200 100 Ref 231180 227291 222731 220023 211618 205309 204331 200748 Proposed 224884 192097 177805 174614 169417 160588 157005 154830 Improvement 2.72% 15.48% 20.17% 20.64% 19.94% 21.78% 23.16% 22.87%. Ref. Proposed. 500. 400. 250000. Area (um^2). 200000 150000 100000 50000 0 800. 700. 600. 300. 200. 100. Freq. (MHz). Figure 3-16 Improvement by our proposed design flow. 47.

(64)

(65) 4. SILICON IMPLEMENTATION. In this chapter, the silicon implementation is to implement the design with cell-based flow, and the result shows the area and timing after physical implementation. The result of silicon implementation contains two parts. The first one is the implementation flow and the second one is the implementation result of SPU and layout of SPU chip.. 49.

(66) 4.1 Implementation Design Flow We will design the SPU processor based on UMC 90nm 1P9M Process Low-K. Figure 4-1 is a flow chart which illustrates the design flow for our SPU design.. μ-architecture design. RTL coding. RTL simulation. Synthesis. Gate-level simulation. Physical design. Figure 4-1 Implementation flow. Figure 4-2 illustrates the I/O interface. The SPU has 32KB on-chip instruction memory and 64KB on-chip data memory. The datapath is dual-issue as shown in Figure 4-3. By our proposed design flow of last chapter, we can decide systematically the pipelined stage of every functional unit to design area-efficient micro-architecture. By the way, the S/R and Mul are pipelined into two stages respectively to meet the timing constraints. Due to the critical path determined by the memory modules provided in cell library, we set 400 MHz (2.5ns) as the operating frequency of our SPU core. We can see that the LS pipe is pipelined into 5 stages because these stages are to do data-gather due to the 32-bit output of data memory. In order to gather the data, we need four cycles to do that. After getting the available result of. 50.

(67) each functional unit, we use shared by-passing register to pass the result of every functional unit. According the micro-architecture proposed from the last chapter, it defines pipeline stage to facilitate the RTL design. And then the forwarding path will take into consideration to avoid redundant routing paths. On the basis of pipeline stage and the I/O definition, we define the micro-operation of all instructions in every pipeline stage. By doing this work, hardware resources will be further defined. After these above analysis and design, the RTL (register transfer level) model can be built up. We execute behavioral-level simulation in the RTL model by using NC-Verilog simulator. After the RTL code is bug free, we use synthesis tool (Synopsys design complier) to synthesis our RTL code into gate-level netlist. The gate-level simulation will be performed to sure the logic gates work correctly. When the gate-level netlist is ready, we can use Cadence SoC encounter to implement physical design.. Figure 4-2 Our SPU interface. 51.

(68) Consumer. Forward Unit. ADD/SUB Unit. even pipe. EXE. Logical Unit. Producer 6. Producer 5. Producer 4. Producer 3. Cmp Unit. Producer 2. src1_exe. Producer 1. EXE. even_pipe_out. EXE src2_exe. Mask Unit. EXE src3_exe S/R Unit. EXE. EXE. MUL Unit. EXE. EXE. Producer 9. Producer 8. Producer 7. EXE. odd pipe. Shuffle Unit Producer 10. EXE. odd_pipe_out. LS. AG. MEM. MEM. MEM. MEM. Figure 4-3 Pipeline diagram of our SPU. 4.2 Implementation Result In Table 4-1 , the synthesis result which uses UMC 90nm 1P9M Process Low-K and operates in worst case shows the area and timing. Figure 4-4 shows the layout of our SPU_CHIP. This processor is pipelined into 10 stages (4 instruction pipeline and 6 execution pipeline). According to simulation and APR result, SPUCHIP can operate at 400MHz and core size is 2.5mm x 2.5mm shows the summary of APR results. 52.

(69) Table 4-1 Synthesis result Technology UMC 90nm 1P9M Process Low-K Total area 927,326 Die size 2.5mm x 2.5mm Operating freq. 400 MHz Power 320 mW. 64 KB Data Mem. SPU_DSP. 32 KB Instr. Mem. Figure 4-4 Implementation result. 53.

(70)

(71) 5. CONCLUSION & FUTURE WORKS. In this thesis, we proposed two-phased design flow to explore the optimization space with respect to the specific ISA implementation for embedded applications. The proposed two-phased design flow is to get the optimal micro-architecture under the various timing constraints with the software support. First-phased design flow is spatial-optimized includes function modeling and cycle-accurate modeling. In one word, the first-phased design flow is mainly spatial optimization. Second-phased design flow is temporal optimization to explore the latency by building mathematical formulation. Using formulating the mathematical formulation by our proposed temporal optimization, we can get the area-efficient micro-architecture systematically. We take the Cell SPU as our design example because the Cell SPU is data-oriented processor that exposes long latency. Our proposed design flow is more area-efficient than the ad-hoc method to design micro-architecture. The experimental result shows that our proposed design flow is more area-efficient 15% to 20% than the micro-architecture of reference [7] directly under timing constraints 100 MHz to 800 MHz. Besides, our proposed design flow can be applied to the synthesis of ASIPs (Application 55.