整合動態功率管理之平行資料路徑設計

全文

(1)國立交通大學電信工程學系碩士班碩士論文整合動態功率管理之平行資料路徑設計 The Design of a Parallel Data Path with Dynamic Power Management. 研究生：王志軒指導教授：闕河鳴. 博士. 中華民國九十三年七月.

(2) 整合動態功率管理之平行資料路徑設計. The Design of a Parallel Data Path with Dynamic Power Management 研究生：王志軒. Student：Jyh-Shiuan Wang. 指導教授：闕河鳴博士. Advisor：Dr.Herming Chiueh. 國立交通大學電信工程學系碩士班碩士論文. A Thesis Submitted to Department of Communication Engineering College of Electrical Engineering and Computer Science National Chiao Tung University in partial Fulfillment of the Requirements for the Degree of Master of Science in Communication Engineering July 2004 Hsinchu, Taiwan. 中華民國九十三年七月.

(3)

(4) . . . . . . . . . . . . . .

(5). . . . . .

(6). . . . . ( ) , + 0. @. A. . 1. B. 2. C. 3. 4. . g. \. `. ;. . . u. ]. . w. Z. F [. x. ¡. ¶. Â. ¢. ·. §. ¨. O. ¿. ;. Á È. . . #. . . . . . | Â. ~ 6j. . ©. . ª. ¸ 6Ã. É. . . 3. . J K. K Ê. º Ë. 5C. . o. . . 2. pq {. . . . . z. |. Z. . . e. 63 E. ¼ Ä. ½. 56;. Ì. i. Í. Å. . #. 7 x. Î. £ . <. =. 2. e. À. c. ¯. . ². . % e. . ± . $. vd. x. ° . #. E. #. t. 6v. ". o. 1. s. . !. w. ¤. s. x. :. . ¹. ?. 5f. . 5. . . >. VLIW [. DSP d. 5< ¿. . ³ E. ¥. . p´ . UMC 0.18 um CMOS Æ [. Ï. Z. v. {. [. Z. ;. (Reconfigurable Y. Z. . ¢. 1. :. I. . . . ¡. . . :. ~. x. . ¾. 9. c. }. [. . &' ( ) * + , - ) . / . 5 VLIW . 5 VLIW . 4. D. E. b. r. J 6. . 8. S T UV W 6X. . e. %. 5H. 6a. y. $. 1. G. ". j. :. #. 7. F. . vd. . . ®. n. . {. . `. z 6j x. 5 EDA » º. . 5# ;. . _. 5m. 5y. 6. ¬. ¹. x . . «. [. X. t. 5J. D. 1. ". . DSP5!. . D. C. [. 5. x. C. e. Z. R. !. VLIW . 6. Q. B. d. e. x. P. A. l. . 5#. . D. . ~. 6. 5C Ç. }. k. . O. @. vd ;. . N. ?. j. E ~. ~. K. ¦60 µ. }. >. i. o }. j. 5. 5M. =. h. w . L. <. K. o. &. ^. . v. . . (Platform-Based SoC Design) E D. Architecture)J K . VLIW . ¡. Ð. 5Ñ. Y. J. 3. ¹ 6´.

(7) µ. <. Ú6j. ¯. A. B. °. C. . ±. D. º. ². E. µ. ³. 1. Û. 5#. 2. G F. 5° G. x. ±. 5H. ² I. ³. º. Ò ¶. Ó. 5Ô. Õ. 6Ö. VLIW . . ii. . × . $ Ü. E Ý. # ¨. x Þ. [. ;. ¶ . Ø ^. | <. Ù =. >. 1 ?. ` @.

(8) The Design of a Parallel Data Path with Dynamic Power Management Student: Jyhshiuan Wang. Advisor: Herming Chiueh. Institute of Communication Engineering National Chiao Tung University Hsinchu, Taiwan 30050. Abstract Very long instruction word (VLIW) processor is a multi-issue processor with many functional units. In recent research [1-3], many VLIW processors can provide the functions of baseband digital signal processing (DSP) calculations such as discrete cosine transform (DCT), finite impulse response (FIR) filter, and motion estimation. These general purpose DSP calculations are very common in platform-based design and reconfigurable architecture for wireless communications and multi-media applications. So if a VLIW processor is applied as the microprocessor in platform-based design or reconfigurable architecture, some general purpose DSP blocks can be replaced by VLIW processor and the design complexity of hardware in platform-based design and reconfigurable architecture can be reduced. However, power dissipation in VLIW processor can be a serious problem due to the low code density and hardware utilization. In this thesis we applied dynamic power management techniques in the 16-bits parallel data path which is similar to a data path iii.

(9) in a three-issue VLIW processor to reduce the overhead of power dissipation. Two dynamic power managements: clock gating and voltage separation are chosen since they are suitable for the parallel data path. Clock gating can reduce the power dissipation caused by low code density in pipeline registers and voltage separation can reduce power wastage in functional units that is caused by low hardware utilization. Furthermore, we also explored an appropriate design flow for these two power managements with current electronic design automation (EDA) tool. All the design and verification are completed with UMC 0.18 um process. Power analysis is accomplished with five test benches. The power dissipation of the parallel data path is successfully reduced and the power and performance of the data path become scalable as we expected. A VLIW processor with such a data path is provided as a good option for microprocessor in platform-based design and reconfigurable architecture.. iv.

(10) Ã ¬. ø. . . . 5É . . $. . . ª. l. . Ì. Í. . . ñ. ò. E. N. . . ç. . ü 5. { ;. 2. o. ?. @. Â. . ². é. A. B Ü. ð. C. . . v. 5. À. Ä G. . >. 5. ç 6{. 6õ. ö. . ä. + 6. 7. H 6|. 3 µ. Í. ¿ ÷. é _. . . ·. @ . $ . 6Ì. 5?. è. 5. . ¡. Ð. . , 7. £. . . . ) * 54. æ. . ð. ô. þ. . ( . D. ý. . ó. . . 3. ò. <. ¿.

(11) ñ. ö. & '. Ø. J.

(12). %. Ë. J 6$. å. 5. · 6õ. # $. F. 6å. å. 6. ". . 5ð. = 6E E. ä ;. í. K <. ;. å. ï. ! ;. ã. 5û. . Ò. ;. ä. í 6î. 5. . â. . 1 60. D 6. M. . É. ~ 6á. K. ò å. 0. È. ú. . 9 6: C. ¸ E. ù ù. ä . /. à . ¸ . ã . É 8. 5B L. . à . . ß. 5ì. $. . 1. ë. . 6Ü. Ö . . ê. 6Z $. . . ã. ä. . 6* |. ã. 6º. å (. ä. ) å. . 5I. A J. P K.

(13) Content: Chapter 1 Introduction.............................................................................1 Chapter 2 Background and Related Research.......................................7 2.1 Architecture of the Parallel Data Path..............................................................7 2.2 Dynamic Power Management........................................................................12 2.2.1 Clock Gating .......................................................................................12 2.2.2 Voltage Separation ..............................................................................14. Chapter 3 System Level Design .............................................................19 3.1 Parallel Data Path with Dynamic Power Management..................................19 3.2 Instruction Set Design....................................................................................21 3.3 Clock Gating Implementation........................................................................22 3.4 Voltage Separation Implementation ...............................................................24. Chapter 4 Design Flow and Simulation Result ....................................28 4.1 CAD Flow......................................................................................................28 4.2 Test Configuration..........................................................................................38 4.3 Simulation Result...........................................................................................39 4.3.1 Clock Gating Simulation.....................................................................39 4.3.2 Test Bench Simulation ........................................................................46. Chapter 5 Power Analysis ......................................................................51 5.1 Benchmark Definition....................................................................................51 5.1.1 Matrix Addition...................................................................................51 5.1.2 Idle Process .........................................................................................53 5.1.4 Matrix Calculation with Functional Unit MAC..................................54 5.1.5 Matrix Calculation with Functional Unit ALU...................................55 5.2 Simulation Result...........................................................................................57 5.2.1 Matrix Addition...................................................................................57 5.2.2 Idle Process .........................................................................................59 5.2.3 Matrix Calculation with All Functional Units ....................................61 5.2.4 Matrix Calculation with Functional Unit MAC..................................64 5.2.5 Matrix Calculation with Functional Unit ALU...................................66 5.2.6 Comparison .........................................................................................69 5.3 Summary ........................................................................................................72. Chapter 6 Conclusion .............................................................................74 Bibliography:............................................................................................76 Appendix A Instruction set summary ...................................................79 Appendix B Instruction format .............................................................80. vi.

(14) List of Figures: Figure 1.1. An example of a platform-based design ......................................................2 Figure 1.2.a. An example of reconfigurable architecture...............................................3 Figure 1.3. A reconfigurable architecture with VLIW processor. ..................................4 Figure 2.1.1. Block diagram of the parallel data path....................................................9 Figure 2.1.2. Normal pipeline operation ......................................................................10 Figure 2.1.3. Pipeline operation when data dependence occurs .................................. 11 Figure 2.1.4. A not-taken branch instruction operation ...............................................12 Figure 2.2.1 Illustration of clock gating.......................................................................13 Figure 2.2.2. Timing-critical voltage islands [19]........................................................15 Figure 3.1.1. The parallel data path with power management unit..............................20 Figure 3.2.1. Normal instruction format ......................................................................21 Figure 3.2.2. Instruction with power control bits.........................................................21 Figure 3.3.1. Clock gating implementation .................................................................22 Figure 3.3.2. Clock gating on the pipeline registers of ALU.......................................23 Figure 3.3.3. Clock gating operation timing diagram..................................................23 Figure 3.4.1. Voltage Separations ................................................................................25 Figure 3.4.2. Implementation of voltage separation on ALU ......................................25 Figure 3.4.3. Implementation of voltage separation with high vt transistors ..............26 Figure 4.1.1. Cell-base design flow .............................................................................29 Figure 4.1.2. Basic auto place and route flow..............................................................30 Figure 4.1.3. Design flow of this implementation .......................................................31 Figure 4.1.4. A clock gating example ..........................................................................32 Figure 4.1.5. Timing diagram of Figure 4.1.4..............................................................33 Figure 4.1.6. Design flow for this implementation of auto P&R.................................34 Figure 4.1.7. Layout of the parallel data path with dynamic power management.......36 Figure 4.3.1. Simulation of instruction fetch ...............................................................40 Figure 4.3.2. The simulation of pipeline operation with power management units ....45 Figure 4.3.3. Matrix A and matrix B............................................................................46 Figure 4.3.4 The operation flow of this chip................................................................47 Figure 4.3.5 The operation of WRITE mode ...............................................................48 Figure 4.3.6 Insertion of matrix A ...............................................................................48 Figure 4.3.7 Insertion of matrix B ...............................................................................49 Figure 4.3.8. EXECUTION mode ...............................................................................49 Figure 4.3.9. Matrix A after execution.........................................................................50 Figure 4.3.10. TEST mode...........................................................................................50 Figure 5.2.1. Total power dissipation of matrix addition.............................................58 vii.

(15) Figure 5.2.2. The power dissipation of the three functional units ...............................58 Figure 5.2.3. Total power dissipation of idle process ..................................................60 Figure 5.2.4. Total power dissipation of the three functional units .............................61 Figure 5.2.5 Total power dissipation of matrix calculation with all functional units ..63 Figure 5.2.6 Total power dissipation of the three functional units ..............................63 Figure 5.2.7. Total power dissipation of matrix calculation with MAC ......................65 Figure 5.2.8. Total power dissipation of the three functional units .............................66 Figure 5.2.9. Total power dissipation of matrix calculation with ALU .......................68 Figure 5.2.10. Total power dissipation of the three functional units ...........................68 Figure 5.2.11. The power dissipation of matrix calculation ........................................70. viii.

(16) List of Tables: Table 2.1.1. The instructions corresponding to the three functional units .....................8 Table 2.1.2. Description of pipeline stage......................................................................9 Table 4.1.1. EDA tools used in this implementation....................................................35 Table 4.1.2. Circuit summaries ....................................................................................35 Table 4.1.3. The definitions of the IO pads..................................................................37 Table 4.3.1. Definitions of the signals in Figure 4.3.1.................................................41 Table 4.3.2 Definitions of the signals in Figure 4.3.2..................................................42 Table 4.3.3. The assemble language of the program A=A+B ......................................46 Table 5.1.1 The assemble language of Matrix Addition ..............................................52 Table 5.1.2. The assemble language of matrix calculation with all functional units ...53 Table 5.1.3 The assemble language of matrix calculation with MAC.........................54 Table 5.1.4. The assemble language of calculation with ALU ....................................55 Table 5.2.1. Power distribution of matrix addition (without power management) :....57 Table 5.2.2. Power distribution of matrix addition (with clock gating ) :....................57 Table 5.2.3. The power analysis result of matrix addition...........................................59 Table 5.2.4. Power distribution of idle process (without power management) : .........59 Table 5.2.5. Power distribution of idle process (with clock gating ) : .........................59 Table 5.2.6. The power analysis result of idle process ................................................61 Table 5.2.7. Power distribution of matrix calculation (without P.M.) : .......................62 Table 5.2.8. Power distribution of matrix calculation (with clock gating ) : ...............62 Table 5.2.9. The power analysis of matrix calculation with all functional units .........64 Table 5.2.10 Power distribution of matrix calculation with MAC (without P.M.) : ....64 Table 5.2.11 Power distribution of matrix calculation with MAC (with C.G.) :..........64 Table 5.2.12. The power analysis of calculation with MAC........................................66 Table 5.2.13. Power distribution of matrix calculation with ALU (without P.M.) ......67 Table 5.2.14. Power distribution of matrix calculation with ALU (with C.G ) ...........67 Table 5.2.15. The power analysis of matrix calculation with ALU .............................68. ix.

(17) Chapter 1 Introduction Platform-based design is a design approach of reuse methods for embedded system in SoC[4][5][6]. Generally, a platform-based design contains common architecture and the supporting technologies (Intellectual Property libraries and develop tools) [5]. The common architectures typically include a microprocessor, memory and the communication bus. The Intellectual Property (IP) of the IP libraries are all designed with the same microprocessor and bus protocol. The initial configuration formed by the common architectures can be extended with functional units (F.U.) for different application. The functional units, are embedded DSP blocks such as discrete cosine transform (DCT), fast fourier transform (FFT), finite impulse response filter (FIR) filter, motion estimation…etc, can be available from the IP libraries of the platform or designed by the application developer. For example, the common architectures with DCT and motion estimation can be used for video application, or with FFT and FIR filter can be used for wireless communication. An example of a platform-based design is shown in Figure 1.1. The microprocessor, memory and bus are the common architecture in this platform. The other devices include the functional units for application-specific extension, and other important hardware components such as I/O controller and bus bridge. The major character of platform-based design is to reduce the design time. Since all the devices are based on the same bus protocol and microprocessor, and they can be quickly integrated. A platform-based design is accomplished by software and hardware co-design. Designer will partition the functionality of applications into the software and hardware and choose appropriate microprocessor and other hardware blocks. The software implementation ran in the microprocessor and other coprocessor, and the hardware part will be executed on the customized signal processing functional units.. 1.

(18) In recent research [5], the microprocessor has become the most important elements in a platform-based design such as ARM [7] and Philips Nexperia [8].. Figure 1.1. An example of a platform-based design Some general purpose DSP blocks such as DCT, motion estimation, motion prediction, FIR filter, and viterbi decoder…etc, are common blocks for many applications. If a platform-based design includes those general DSP blocks, the platform-based design can be used to implement a reconfigurable architecture. Figure 1.2 gives an example. In Figure1.2.a, the platform is operating an application for 3G wireless communication where a viterbi decoder is required. If the platform will operate another application for MPEG decode where a DCT is required, user only has to reconfigure the data path of the hardware blocks and implement other software on the microprocessor as shown in Figure1.2.b. One advantage of reconfigurable architecture is that the used hardware and data path are reconfigurable. This advantage provides a great flexibility for wide different application. Furthermore, if a reconfigurable architecture includes power management unit, system can scale the power and performance of the reconfigurable architecture. When the required performance is high, more functional units on a reconfigurable architecture will be active and consume larger power. While the power consumption of the reconfigurable 2.

(19) architecture has to be reduced, system can reduce active hardware and turn them off with power management unit. So the power and performance of a reconfigurable architecture can be scalable if there is power management unit on the architecture.. Figure 1.2.a. An example of reconfigurable architecture. Figure 1.2.b. An example of reconfigurable architecture Very long instruction word (VLIW) processors is a kind of multi-issue processor and is suitable of high-performance real-time DSP application [9]. In recent research [1][2][3][9], many VLIW processors can provide the functions of those general DSP blocks. So a VLIW processor can be used to replace the microprocessor and some general DSP blocks in a reconfigurable architecture and this is shown in Figure 1.3. If the VLIW processor is well-defined, such a design shown in Figure 1.3 can simplify 3.

(20) the design of reconfigurable architecture since the complexity of the interconnection in the hardware blocks is reduced.. Figure 1.3. A reconfigurable architecture with VLIW processor. However, some disadvantages such as low code density and low hardware utilization exist in general VLIW processors [1][9]. Very long instruction word is the characteristic of VLIW processor and pipeline registers will be longer. Low code density means that the power wasted in pipeline registers and low hardware utilization also means large power wastage in unused functional units. The problem in power wastage will especially be serious when executing idle process. When the general DSP functional units are not replaced by VLIW processor, the unused one can be turned off by power management unit on the reconfigurable architecture, and the power and performance of reconfigurable architecture can be scalable. However, while these general DSP blocks are replaced by VLIW processor, the reconfigurable architecture will lose the scalability of power and performance due to the power issue in VLIW processor. To overcome such a problem it is necessary to apply power management on VLIW processor. If power management method can be applied in VLIW processor and overcome those power issue, VLIW processor is able to provide 4.

(21) a good option for microprocessor in reconfigurable architecture. In recent years, some power management methods for SoC design are proposed, such as Variable Threshold Voltage CMOS (VTCMOS) [10][11], Multi-Threshold CMOS technology (MTCMOS) [12][13][14][15], Clock Gating [16][17], Dynamic Voltage Scaling (DVS) [18][19][20], Voltage Islands [21][22], Adaptive Supply Voltage and Body Bias (ASB) [23][24][25]. From the ideas of these power management methods, the clock gating and voltage separation are suitable to be applied in VLIW processor. Clock gating is a useful method to reduce dynamic power dissipation by reducing unnecessary clock switching. Low code density will cause a lot of power wastage in VLIW processor due to large unnecessary clock switching on pipeline registers. So clock gating can be used to reduce the overhead of low code density. Furthermore, low hardware utilization rate in VLIW processor means that many functional units are idle during program operation. Those idle functional units will cause large static power consumption in advance process. If the power supply of the functional units in VLIW processor can be separated and managed individually, the system can turn off the power supply of the unused functional units and reduce the overhead in static power consumption due to the low hardware utilization. In this thesis we designed a 16-bits parallel data path which contain three common functional units in VLIW processor, they are ALU, Load/Store and MAC [2][3][9]. Then the data path is used to simulate the data flow in a three-issue VLIW processor and apply clock gating and voltage separation on this data path. Clock gating will disable the unnecessary clock switching on the pipeline registers and voltage separation will turn off the power supply of idle functional units. So even the code density and hardware utilization are low, there will not be a great overhead in power dissipation. Furthermore, the performance and power of this data path will become scalable after applying these two power management mechanisms. The power 5.

(22) dissipation will be according to the required performance. If the performance requirement is high, all the functional units in this data path will all be used and the power dissipation will be high. If the required performance is normal, only some of the functional units are used and the unnecessary power dissipation on idle elements will be reduced by the dynamic power management mechanisms. So the power and performance of the parallel data path will depend on the system requirement. This characteristic of power and performance scalable will be suitable for a data path of a powerful VLIW processor in a reconfigurable architecture. This thesis focused on the power management design and implementation in the parallel data path which is for test vehicle. The remaining of this thesis is organized as the following. In Chapter 2 the background and related research are presented. The architecture of the parallel data path for test vehicle is presented. The clock gating and voltage separation for dynamic power management are briefly explained. In Chapter 3 the detail design of all the system is presented. The implementation of power management and the impact on the instruction set architecture design of the parallel data path are presented. In Chapter 4 a detail electronic design automation (EDA) design flow of implementation will be presented. In Chapter 5 the power simulation of the parallel data path with clock gating and voltage separation is presented. In Chapter 6 a conclusion is addressed.. 6.

(23) Chapter 2 Background and Related Research In this chapter, the architecture of the parallel data path is presented which is the test vehicle for the dynamic power management. In addition, the applied dynamic power management methods will be explained. The functional units, the instruction sets design and the pipeline operation of the parallel data path are shown in Section 2.1. The detail of the applied dynamic power management, clock gating and voltage separation, are presented in Section 2.2.. 2.1 Architecture of the Parallel Data Path The very long instruction word (VLIW) processor is one type of multi-issue processors. This is a method to decrease the cycle per instruction (CPI) by issuing a fixed number of instructions per cycle. In VLIW processor, the instruction parallelism (ILP) exploiting and data hazards detection are accomplished by compiler. So the instruction issue and the instruction scheduling are all static and the hardware cost is much less than the other type of multi-issue processors such as superscalar. In the recent years, VLIW processor is broadly applied in the market. Such as Trimedia TM32 [26] and Transmeta Crusoe [27] are all VLIW type processor. And Philips Nexperia, one of the leading platform for multimedia, is based on MIPS with a VLIW processor. The major advantage of VLIW processor is the low cost of hardware. But on the other hand the complexity of compiler is high, this is the trade-off in VLIW processor. In this research, the test vehicle for dynamic power management is a 16-bits parallel data path with three functional units. Like a VLIW processor, this data path issues three instructions per clock cycle and the ILP is explored by compiler. When the data path is ready to operate, who will compile program and write the instructions into the instruction memory. Then the data path will fetch the instruction form 7.

(24) instruction memory every clock cycle with a program counter. All the instructions are executed in five stage pipeline which are instruction fetch (IF), instruction decode (ID), execution (EX), memory access (MEM), and write back (WB) respectively. The three functional units are Arithmetic logic unit (ALU), load/store (L/S) unit, and multiply-accumulator (MAC) unit. ALU can execute some simple arithmetic and control instructions. The load/store unit consists of an adder which is used to calculate data address in memory. It can also execute add and subtract operation. MAC unit contains a multiplier and an adder. It can be used to execute accumulator and multiplier operation. The block diagram of the data path is shown in Figure 2.1.1. The instructions corresponding to the three functional units are summarized in Table 2.1.1. The pipeline stages in Figure 2.2.1 are listed in Table 2.1.2. In the following paragraph we will introduce normal pipeline operation and analyzing two special cases that impact the compiler and pipeline operation in this data path.. Table 2.1.1. The instructions corresponding to the three functional units Functional Unit. Instructions. ALU. ADD, SUB, AND, OR, XOR, SLL, SRL, SRA, SLT, JR, JUMP, JAL, BEQ, HALT. LOAD / STORE. ADDI, SUBI, LOAD, STORE. MAC. MAC, MUL. 8.

(25) Figure 2.1.1. Block diagram of the parallel data path Table 2.1.2. Description of pipeline stage Pipeline register – IF. The first stage in pipeline that is used to save the instruction fetched from memory.. Decoder. That decode the OP code of instruction in Pipeline register – IF, and decide the register file accessing.. Control unit1. That controls the access to register file of every instruction.. Register file. A 16 x 32 register which save data during program execution.. Pipeline register – EX. That save the instruction after decode and the necessary data from register file.. Pipeline register – MEM. That saves the result of execution for memory access.. Memory. This is a 16 x 256 one port SRAM.. Control unit2. That controls the access to register file of every instruction. 9.

(26) Pipeline register – WB. That save the data after execution and the data from memory for writing back to Register file. An instruction will be saved in Pipeline register – IF at first cycle when it is fetched from instruction memory. Decoder will decode the op code of the instruction and read out the necessary data from register file at second cycle. Then the functional units will deal with the data with the result of decoding at third cycle. At the next cycle, this data path will access memory if a load/store or control instruction is fetched. At the final cycle the data path will write back the data generated in third and forth cycle to register file. Every instruction is executed in five clock cycles and the timing diagram is shown in Figure 2.1.2.. Clock cycle. Instruction. 1. 2. 3. Instruction i. IF. ID. IF. Instruction i+1. Instruction i+2. Instruction i+3. 4. 5. EX. MEN. WB. ID. EX. MEN. IF. ID. EX. IF. ID. 6. WB. MEN. EX. Figure 2.1.2. Normal pipeline operation . Figure 2.1.2 show the timing diagram of the data path when there is no data. dependence. If there are data dependences between two continue instructions, these two instructions can’t be fetched continually or the latter one will read the wrong data. Data dependence will be detected when compiling program, and the compiler will schedule the instruction to avoid this situation. But if the data dependence still exist after scheduling. The compiler will insert stall instructions. An example is shown bellow:. 10.

(27) Instruction i. :. ADDI R1, R2, R3. Instruction i+1 :. ADDI. R4, R1, R5. Instruction i produces a result that will be used by Instruction i+1. If Instruction i+1 is decoded before Instruction i write back the result, then a data hazard will happen. For avoiding this situation two stall instructions should be inserted between these two instructions by compiler. The instruction execution timing diagram is depicted in Figure 2.1.3.. Clock cycle. Instruction. 1. 2. 3. Instruction i. IF. ID. Stall. Stall Instruction. Stall Instruction. Instruction i+1. 4. 5. 6. EX. MEN. WB. Stall. Stall. Stall. Stall. Stall. Stall. Stall. Stall. IF. ID. EX. Figure 2.1.3. Pipeline operation when data dependence occurs. Beside data dependence, there is still one case that compiler will insert stall instruction when compiling program. That is the situation when a control instruction like branch is generated. A control instruction may change the program counter while a control instruction is fetched and this will be known in ID stage. So before a control instruction is decoded the next instruction shouldn’t be fetched. Figure 2.1.4 show the timing diagram of this case, the compiler will also insert a stall instruction after branch instruction.. 11.

(28) Clock cycle. Instruction. 1. 2. 3. 4. 5. Branch. IF. ID. EX. MEN. WB. Stall. Stall. Stall. Stall. Stall. IF. ID. EX. MEN. IF. ID. Stall Instruction. Instruction i+1. Instruction i+2. 6. EX. Figure 2.1.4. A not-taken branch instruction operation. 2.2 Dynamic Power Management The concept of dynamic power management in proposed design is to provide a scalable performance according to a real-time system requirement. In such design, some dynamic power management mechanisms such as multi-threshold CMOS (MTCMOS) technology [10], clock gating [14], voltage islands [19] [20], are proposed in recent years. We use two power management methods, clock gating and voltage separation, which are suitable for parallel data path in VLIW processor. Clock gating is an efficient solution to reduce the power dissipation caused by clock and voltage separation is a sub-step of voltage islands. In the following section, we will briefly explain those two power management.. 2.2.1 Clock Gating In the modern high performance VLSI design, the power dissipation due to clock tree is always the domination of dynamic power. This is because the clock tree represents a very large load due to that the clock signal switches all the time. As the incensement in operation frequency, the power consumed by clock can grow ever larger. But in fact, a portion of this power consumption is wasted, and this is caused. 12.

(29) by unnecessary switching activity. For example, if the input data and output data of a flip-flop are identical, the switching of clock just wastes power. Actually, not all the clocked elements in a chip change its data all the clock cycle. And even when the data in a storage element remains unchanged, the switching of clock consumes significant power. In a digital design, if the clock signal can be controlled, then the power wastage caused by unnecessary switching can be saved. A popular and effective way to do this is clock gating which is one method of dynamic power management and can eliminate unnecessary switching by gating clock signal with qualifying signals. For example, if the gated clock is through an AND gate which is shown in Figure 2.2.1. Only when something useful is computed in a given clock cycle, the qualifying signal will be set to high or it remains low. Then the clock to a clocked element is controlled, and the unnecessary clock switching can be avoided.. Figure 2.2.1 Illustration of clock gating. 13.

(30) 2.2.2 Voltage Separation Voltage islands is addressed by IBM in 2002 [19]. It is a dynamic power management method and is used to optimize the power supply of individual functional units in SoC design. Voltage separation is a sub-step of voltage islands and can be accomplished with current EDA tools. In this thesis, we will apply voltage separation on the three functional units so that the power supply of these functional units can be managed individually. In this section, we will briefly explain voltage islands and voltage separation. In early years, large functional units were not integrated on a single chip, so that different functional units could be supplied by different voltage level and the power supply could be optimized. However, with the era of SoC designs come, more and more functional units can be integrated on a single chip. As a result, if a chip is still supplied by fix level voltage. The chip will lose the optimization and the flexibility of the power supply in hybrid solution. The voltage islands technique is to supply multi-level voltage in a single chip. Commonly a complex SoC design consists of a number of functional units, but not all of which always active at any given time. For achieving the optimization in power supply, functional units in a SoC design will be separated into different islands according to its power characteristics and every island has its own power control unit. By this unit system can turn of the power supply if the corresponding island is idle, and turn on again when it’s time to active. An example of voltage islands is shown in Figure 2.2.2 [19]. In this chip, the most performance-critical functional units need voltage supply of 1.2v to meet the require performance, and the rest elements in this chip including memory could meet the timing requirement by 1.0 voltage supply.. 14.

(31) Figure 2.2.2. Timing-critical voltage islands [19]. Without new technology, the entire chip is supplied by 1.2v voltage due to the requirement of the high performance element. But that waste power, especially when the functional units only active in few percentage of all the operation time. This picture shows a solution of the problem described above. The high performance elements are separated into a voltage island and are supplied by individual voltage 1.2v. The rest of the design could be only supplied by voltage 1.0v. Due to the vdd2 term in the active power equation, the power of the design can be greatly reduced. Furthermore, voltage islands technique can also scale the power supply of each island. For an example, if an island only active for 2% of the operation time, then the power management unit of voltage islands can turn off the power supply at the rest 98% of the operation time. Then the power consumption can be optimized. Voltage islands is a useful technology to reduce to power consumption of a SoC design. But there are some difficulty existing in implementation. In the following paragraph these issues will be introduced.. 15.

(32) A traditional method of implementation for SoC design includes the following steps: Architecture design Functional implementation ( RTL ) Synthesis and timing consideration Timing verification and simulation Floor planning and physical design Final timing verification and tape out When implementing a voltage islands design, there will be some additional consideration that will affect each step in the design flow. The following will briefly describe the requirement consideration. Functional partition: The designer should partition the functional components of the design into different islands according to its power characteristic, and each individual voltage island should be written into individual RTL module. This step should consider the performance requirement and the period of active and inactive of each functional component. For example, if component A and component B require the same voltage supply level, active in the same period and the duration of the inactive period exceed the minimum time for power on-to-off and off-to-on of switching. Then these two components could be classified into the same islands. Synthesis and timing consideration: Once the RTL has been completed, the designer can begin to synthesis the design. When synthesis and optimize the design, the effect caused by signal traveling different voltage level should be take into consideration. In the other words, the major problem is the delay calculation of the signal level shifting. Take the design shown in Figure 2.2.2 for example. The signal path from 1.0v level to 1.2 level or from 1.2v level to 1.0v level 16.

(33) should include a signal level shifter, and the designer should make sure that the delay caused by the level shifter won’t affect the correction of operation. Furthermore, each island has its working voltage range, so that the timing calculation of each island should be different, and the voltage difference also represent difference in the calculation of clock skew. The problem described above almost caused by different islands and its different voltage level, and these effect should be taken into account when synthesis. Physical planning and implementation: In order to enable independent power supply, each island should be placed isolated from others, and designer should arrange the power supply of the power management and the power control of each island carefully. These two components shouldn’t be influenced by the power switching of each island. Logic simulation: When simulating the logic functionality of a voltage islands design, the output of the power-off islands should be observed as unknown. The observed state is important for verifying the functionality of the power management and the power control in the power on-to-off and off-to-on switching. The voltage islands technology provide an opportunity to reduce significant power consumption and achieve the power optimization for SoC’s. But when implementing voltage islands there are some issue should be taken into consideration, and these considerations will affect the design flow. Some of each are even not supported by EDA tool, and should be accomplished by manufacturing test now. This is still a big challenge for voltage islands implementation. In this thesis, we implement the functional partition and voltage separation which are sub-steps of voltage islands and can be accomplished with current EDA tool. In the parallel data path, the three functional units are separated into three different areas 17.

(34) in the same chip and their voltage supplies are individual connected. Thus, if any functional unit is idle, the power management units can detect the situation and turn off the power supply. The voltage separation operates in the two levels of on and off with the same voltage level supply, so the design can be accomplished without signal level shifting consideration. The detailed design will be presented in next chapter.. 18.

(35) Chapter 3 System Level Design In this Chapter, we will present the power management units for clock gating and voltage separation in this data path in Section 3.1. Actually, these power management units are portion of the pipeline registers and are designed by being based on instruction. So the instruction set architecture design is different from the traditional one and this will be presented in Section 3.2. Finally, the detail design of clock gating and voltage separation in this parallel data path will be presented in Section 3.3 and 3.4.. 3.1 Parallel Data Path with Dynamic Power Management The power consumption in a chip can be separated into two parts, the dynamic power and the static power. In this data path, the power consumption of clock tree is the domination of the dynamic power, and the power dissipated by the clock of the pipeline registers occupy about 50 percent among the power consumption caused by clock tree. Due to the low code density, the power wastage in the pipeline register can be large. In order to save the dynamic power dissipation of this part, we apply clock gating technique to control the clock for each pipeline registers. Furthermore, there are three major functional units in this parallel data path. When this data path is operating, not all functional units will be active or even some of which will keep in idle during operation due to the disadvantage of low hardware utilization. Then there will be power wasting if we keep supplying power to the idle functional unit. However, the implementation of clock gating in the pipeline register at ID stage will save the dynamic power consumption here. But leakage power remains, and can be significant in high performance technologies. Because of the characteristic of low hardware utilization in parallel data path and for better power optimization we implement voltage separation on the three functional units. 19.

(36) Figure 3.1.1 shows the block diagram of the parallel data path combined with power management unit for clock gating and voltage separation. The power management includes power management registers and gate control logic. The first two power management registers consist of 6 bits. The others only contain 3 bits. Every power management register uses 3 bits for clock gating. If a stall instruction is fetched, this mechanism will disable the clock with the control logic gate. And the extra 3 bits in the first two power management registers are used to control the power supply of the three functional units. If any functional unit is standing by, we can make use of the power management register to detect the situation and turn off the power supply for them. Thus we can save the static power of the functional units when those units are idle. The static power of the three functional units occupies about 20 percent of all the static power for the data path.. Figure 3.1.1. The parallel data path with power management unit. 20.

(37) 3.2 Instruction Set Design This data path is a 3-address machine. The instruction format corresponding to a traditional 3-address machines includes only four fields: one that specify the operation, two that specify the source data address in register file, one that specify the address where to put the result. This format is shown in Figure 3.2.1. But the instruction format of this data path is a little different form the traditional, because we take the power consumption into consideration. One field that specifies the power control of clock gating and power supply for the functional units is contained in the instruction format of this data path beside the four fields mentioned above. This format is shown in Figure 3.2.2.. OP Code. Source and Destination Address Figure 3.2.1. Normal instruction format. OP Code. Power Control Source and Destination Address Figure 3.2.2. Instruction with power control bits. The field of power control contains two bits, of which one control the clock gating and the other control the power supply of the functional units. These two bits will be decided by compiler. If the Instruction i+1 is not a stall instruction, the bit of the power control for clock controlling in Instruction i will be set to 1, otherwise it will be set to 0. The decision for other bit will depend on the distribution of stall instructions after a program is fully compiled. If the number of the continuing stall instructions is large enough and the corresponding functional units will stand by long time enough, then compiler will set the bit for functional unit power supply control of these continuing stall instructions to 0. Otherwise if the functional unit doesn’t stand by for long enough time, the bit will be set to 1. We have to emphasize that the power 21.

(38) control of each instruction are all decided by compiler, and the hardware only operate power management mechanism according to the data in power control field. When instruction is executed, the part of power control will be saved in power management registers as mentioned in Section 3.1 and the power management of the data path will be according to the data in this part.. 3.3 Clock Gating Implementation In this parallel data path, the pipeline registers of the three functional units occupy fifty percent among the clocked elements. When a stall instruction is fetched, the clock switching on pipeline register only causes power wastage. Even turn off the clock in this period the executing result won’t be influenced, but the power wastage can be saved. So we implement the clock gating to all the pipeline registers as shown in Figure 3.3.1. Figure 3.3.2 shows an example in functional unit ALU of this implementation.. Figure 3.3.1. Clock gating implementation 22.

(39) Figure 3.3.2. Clock gating on the pipeline registers of ALU In Figure 3.3.2, the pipeline register-IF and pipeline register-EX is the first two stage in pipeline. The pipeline register IF is used to save the instruction fetched from the instruction memory and the pipeline register EX will save the result after instructions are decoded. The power management register will supply the qualifying signal and control the clock to the next pipeline register. The operation timing diagram of Figure 3.3.2 is shown in Figure 3.3.3.. Figure 3.3.3. Clock gating operation timing diagram 23.

(40) As shown in Figure 3.3.3, all the pipeline registers are all trigged by negative edge of clock and all the power management registers are positive trigged. When the instruction fetched is not a stall, the power management register will be set to low. Then the clock signal can pass through the OR gate and the pipeline register EX will read the value at the next negative trigger. If the instruction fetched is a stall instruction, the power management register will be set to high and the signal to trigger pipeline register EX will remain in high state. Then the stall instruction will not be pipelined and the data in pipeline register EX will be kept until the next useful instruction is fetched.. 3.4 Voltage Separation Implementation The voltage separation in the three functional units is shown in Figure 3.4.1. In this chip, the three individual functional units are supplied by different voltage source and the power supply is controlled by the power management unit mentioned in Section 3.1. Figure 3.4.2 shows an example of functional unit ALU. The data path is an architecture with five pipeline stages. Before data arrives at functional unit, there are two pipeline stages and we will make use of these two stages to define the control of power supply. As shown in Figure 3.4.2, the pipeline register is used to save the data during instruction execution. The 2 flip-flop of power management register-IF and power management register-EX will arrange the state that will control the power supply. Only the state “00” the power supply of ALU will be turned off.. 24.

(41) Figure 3.4.1. Voltage Separations. Figure 3.4.2. Implementation of voltage separation on ALU 25.

(42) Figure 3.4.3. Implementation of voltage separation with high vt transistors A complete voltage separation design is shown in Figure 3.4.3. Two transistors with high Vt valve and larger size will be applied to control the power supply of the functional units in the parallel data path. This design should be completed with the process which supports multi Vt transistor. However, the present process provided by CIC [28] does not support this technology and to implement the power supply controller in present process is very difficult in the present environment. Due to the factor above, we implement the power management inside the chip and the power controller is external connected. When implementing voltage separation, the overheads of circuit charging and discharging are critical factors. One of these overheads is the charging and discharging timing. If the required timing is too long to charge the functional units to stable before data arrives, the operation won’t be correct. In recent research [15], the stable required timing to charge such a design as shown in Figure 3.4.3 is within 4ns.. 26.

(43) This settling time is reasonable for one or two cycles in a low-power system. The other overhead is the power dissipation caused by circuit charging and discharging. If the power dissipation for charging and discharging is too large, the overhead here could be possibly larger than the static power we save. Due to these two overheads, the control of power supply is very important. We have to reemphasize that although the control signal for power supply is decided by the data in the power management registers, but the data is also a part of instruction and is generated by compiler not by hardware. So the compiler will keep a great flexibility in state arrangement which is used to control the power supply of functional units. But such a compiler should take the impact of charging and discharging into consideration. In the future we will estimate the necessary time to charge and discharge functional units and the power dissipation per charging and discharging by the experiments with this chip. After finishing the experiments, the compiler will be designed according to the results.. 27.

(44) Chapter 4 Design Flow and Simulation Result The parallel data path with clock gating and voltage separation is completed with cell-base design method. However, there should be some extra steps beside the traditional design flow when implementing the circuit with voltage separation and clock gating. In section 4.1 we will show the design flow of this chip implementation and explain why there are some differences from the traditional one. In section 4.2 we will introduce how to test this chip. Finally, the functionality simulation of this chip is shown in section 4.3.. 4.1 CAD Flow Figure 4.1.1 shows the cell-base design flow, it contains the following steps: 1.. Architecture design: This is the first step to design an integrated circuit (IC). The designer should decide the architecture which includes the detail data path and spec.. 2.. RTL: After the architecture is decided, the designers can begin to write the hardware descript language (HDL) and verify it’s functionality.. 3.. Synthesis: If the verification of functionality is correct, the designers can synthesis their design and constrain the timing, delay, and area to their design to meet the required performance. After this step is completed, the simulation with gate delay can be performed.. 4.. Physical design: After the result of gate-level simulation is correct, the physical design can be started. This step includes 2 sub-steps. The first is automatic placement & routing (P&R) which can be completed with EDA tool and create the layout of the design. In general automatic placing and routing (P&R) EDA tool, the basic design flow is formed by the flowing step: specify global net connection, floor planning setup, timing setup, placement and optimization, 28.

(45) synthesis clock tree, connect global nets, routing and optimization, and stream out. This flow is shown in Figure 4.1.2. The second is post-layout simulation which contain the gate delay and wire delay consideration. 5.. Physical verification: If the result of the simulation is correct, the designer can begin physical verification which includes design rule check (DRC), layout parameter extraction (LPE), and power analysis. These are also the least three steps..

(46) . . 564/. . !

(47) " . !

(48) " . # $

(49) "% .

(50) '& "" "('

(51) . ) $

(52)

(53) *

(54) + ,. ) ! "% . -

(55) ".. /)10. )

(56) ! " 7 " . ) ,23 / . 4 &

(57) . Figure 4.1.1. Cell-base design flow. 29.

(58) Figure 4.1.2. Basic auto place and route flow. The traditional design flow descript above can handle a lot part of digital design. But when implement the circuit with voltage separation and clock gating there should be some extra steps. Figure 4.1.3 shows the design flow of the parallel data path with voltage separation and clock gating implementation. The remaining of this section will present the differences.. 30.

(59) 8:9<;=?> @ A

(60) ;@ B%9CA:D%A!E+> F1G. iB*G?;@ > Q%G?NT/ZON,9<@ > @ > Q%GSjONE+ADSQ%G Z?Q1ckA9 HJILK. e-dfK. M'A=ON,P?> Q%93E+> RSB?T N,@ > Q%G. l T Q*;`gF*N@ > G*Fm;

(61) Q%G/EC@ 9CN> G*@. UOV G*@ =*A1EC>E. UOV G*@ =*A1EC>E. W N,@ AXYT AP?A

(62) T6E+> RSB*T N@ > Q1G. 8LB*@ Q[Z?T N;A\[9<Q%B%@ A. ]^Q%c:A

(63) 9fT > G?A:N,9<9CNG*F*A

(64) RmAG*@. ]^QOEC@XT N V Q%B%@_E+> RSB?T N,@ > Q%G. iB*G?;@ > Q%G?N

(65) T/B%G?> @E'N,GOD[Z*Q%c:A

(66) 9 RgN,GONF*A

(67) RmAG*@B*G*> @E3F%9<Q%B*Z*> G*F. I:A!E+> F%GS9<B*T AL;=?A

(68) ;`. iB%G?;@ > Q%GONT/B*G*> @EfZ?T N;A

(69) RmAG*@. Ka]b. ]^Q%c:A

(70) 9fRgN,GONF*A

(71) RmAG*@B%G?> @E Z?T N;A

(72) RmAG*@. ]^Q%c:A93N,GONT V E+>E. l T Q%;

(73) `Sj*B*hYhCA9aZ?T N;

(74) A:;

(75) Q%GOEC@ 9+N> G*@ ]= V E+> ;NT^D%A!E+> F%GgNG?D[P?A9<> hC> ;N,@ > Q%G. d3N,Z?ALQ%B%@. Figure 4.1.3. Design flow of this implementation Functional partition based on power: When implement voltage separation design, the designer should partition the architecture into different parts based on the power characteristic of each partition before writing the HDL. The partition. 31.

(76) consideration includes the active period of each parts and their required performance. In this data path, we implement the voltage separation on the three functional units. So that the data path is partitioned into five parts, the functional unit ALU, the functional unit LS, the functional unit MAC, the memory, and the rest elements. Each part is written into individual module with verilog code, and this will help when separating each part in physical design stage. Set clock gating constraint when synthesis: The delay of the control signal must be constrained if a design contains gated clock. Figure 4.1.4 gives an example. There are two flip-flops in this example. FF1 is negative edge trigged, and FF2 is positive edge trigged. The output of FF1 will pass through a combinational logic network and form the control signal of the gated clock to FF2. Figure 4.1.5 shows the timing diagram of this example. In the ideal case, there won’t be any problem. But if the delay of the combinational logic network is too long, then the gated clock will not operate correctly. Thus the delay of the control signal must be constrained when synthesis. Figure 4.1.4. A clock gating example. 32.

(77) Figure 4.1.5. Timing diagram of Figure 4.1.4. Power line arrangement when P&R: This is a sub-step of floor planning setup. Designer should arrange the power line distribution in his chip. The step will decide the power pad placement and affect the placement of the functional units. Functional units and power management units grouping when P&R: The standard cells corresponding to different functional unit should be grouped individually, so that the EDA tool can separate functional units and place them as the designer wish. We also group the standard cells of the power management units for clock gating. Because these cell should be placed in the center of the data path for better synthesis of clock tree. Functional units placing when P&R: Designer should specify the area to place the functional units. Each functional unit should be placed closely to its power supply for reducing the length of power nets, so that this step will be influenced by the power line arrangement. Power management units placing when P&R: The power management units should not be placed in the area of the three functional units, and it’s better to be placed in the center of the data path for better clock tree synthesis. 33.

(78) Clock buffer place constraint when P&R: When EDA tool synthesizes clock tree, designer should note that the clock buffer can’t be placed in the three areas of the functional units. Because the clock tree shouldn’t be affected by the power switching of the three functional units. So that when clock tree is synthesized, the area of the clock buffer placement must be constrained. The last five differences are about the automatic P&R, and are shown in Figure 4.1.6.. U Z*A

(79) ;> h V F*T Q1jONT/G?A@^;

(80) Q%G*G*A;@ > Q%G. ]^Q%c:A9T > G?A:N,9<9CN,G?F*A

(81) RmAG*@ U A@ B*Zmh<T Q*Q%9aZ?T N,G*G*> G?F. iB*G?;@ > Q%G?N

(82) T/B%G?> @E'N,GODkZ?Q%c:A9 RgN,GONF*A

(83) RmAG*@B%G?> @E3F%9<Q%B*Z*> G?F U A

(84) @ B%Zm@ > Rm> G?F. i^B%G?;@ > Q%GONT/B*G*> @EZ?T N;

(85) ARSA

(86) G%@ ]^T N

(87) ;

(88) A

(89) RmAG*@6N,GOD[Q%Z*@ > Rm> nN,@ > Q%G. ]^Q%c:A9RgN,GONF*A

(90) RmAG*@B%G?> @E Z?T N;A

(91) RmAG%@ UOV G%@ =?A!E+>E3;

(92) T Q*;`m@ 9<AA. l T Q*;`mj*B%h<h<A

(93) 9aZ?T N;

(94) AL;Q1G/EC@ 9CN

(95) > G%@ l Q%G%G?A

(96) ;

(97) @^F%T Q%jONT/G?A@E. eQ1B*@ > G?FmNG?DSQ%Z%@ > Rm> nN@ > Q%G. U @ 9CANRoQ%B*@. Figure 4.1.6. Design flow for this implementation of auto P&R The EDA tools used in this implementation are summarized in Table 4.1.1. The whole design is synthesized by Synopsys Design analyzer, and the layout is generated by using Synopsys Apollo. The characteristic of the chip is summarized in Table 4.1.2, and the layout is shown in Figure 4.1.7. 34.

(98) Table 4.1.1. EDA tools used in this implementation Step. EDA tool. Provider. RTL. verilog. Cadence. Behavior simulation. Debussy. Novas. Synthesis. Design analyzer. Synopsys. Gate level simulation. Debussy. Novas. Auto P&R. Apollo. Synopsys. Post-layout simulation. Debussy. Novas. DRC. Calibre-DRC. Metor Graphic. LVS. Apollo. Synopsys. LPE. Calibre-LPE. Metor Graphic. Power analysis. NANOSIM. Synopsys. Table 4.1.2. Circuit summaries Technology. UMC 0.18um Mixed Signal (1P5M) CMOS. Library. Artisan SAGE-X Standard Cell Library. Pad Core Size. 2.2 mm x 1.8mm. Core Size. 1.535 mm x 1.073 mm. On-Chip Memory. 1 4 1. Gate Count. 45328. Work Clock Rate. 40 MHz. Input Pad. 40 pins. Output Pad. 22 pins. Power Pad. 22 pins. Power dissipation. 20 mW. 7x64 16x64 16x256. single port SRAM single port SRAM single port SRAM. 35.

(99) Figure 4.1.7. Layout of the parallel data path with dynamic power management There are six memories in this chip. The 7x64 and the four 16x64 signal port SRAM are instruction memory, which will save the instructions for execution. The 16x256 signal port SRAM is the data memory, which will save the required data for program execution. All memories in this chip are supplied by power supply 1, and the functional unit ALU is supplied by power supply 2, the L/S is supplied by power supply 3, the MAC is supplied by power supply 4, and the rest elements are supplied by power supply5. This chip contains 84 IO pads, in which there are 40 input pads and 22 output pads and 22 power pads. The definitions of the IO pads are summarized in Table 4.1.3.. 36.

(100) Table 4.1.3. The definitions of the IO pads IO pad. IO. Function. clk. Input. The clock signal to this chip. reset. Input. The reset signal to this chip. data_in. Input. This is a 16-bits input, user can insert data into the memory in this chip with these ports.. IS_sellect. Input. This is a 5-bits input. There are five instruction-memory in this chip, this input port will specify which will be written or tested.. address_mode. Input. This is a 8-bits input, user can specify the address of the write or test target memory by this input port.. I_RM. Input. This is a 1-bit input. “0” mean that the writing or testing target is instruction-memory. “1” means that the writing or testing target is data memory or the register file.. R_M. Input. This is a 1-bit input. “0” mean that the write or test target is instruction-memory. “1” means that the write or test target is data memory or the register file.. data_out. Output. This is a 16-bits input. User can get the data in the memory or register file form this port in TEST mode.. VI_ALU. Output. The output of the power management for ALU, this signal will control the power supply to ALU.. 37.

(101) VI_LS. Output. The output of the power management for LS, this signal will control the power supply to LS.. VI_MAC. Output. The output of the power management for MAC, this signal will control the power supply to MAC.. TM. Input. Input port for scan chain test. test_en. Input. Input port for scan chain test. test_si. Input. Input port for scan chain test. test_so. output. Output port for scan chain test. Power 1. Power. The power supply to the memory in this chip. Power 2. Power. The power supply to the ALU in this chip. Power 3. Power. The power supply to the LS in this chip. Power 4. Power. The power supply to the MAC in this chip. Power 5. Power. The power supply to the data path in this chip. Pad power. Power. The power supply to the IO pad. 4.2 Test Configuration In this section we will present the test configuration. For the testability of this chip, this data path can be operated in three different modes. They are WRITE mode, EXECUTION mode, and TEST mode. When the chip is going to execute a program, the chip will operate in the order of WRITE, EXECUTION, and TEST. The action of the three modes and the test method will be listed in the following. WRITE mode: In this mode, the user can insert instructions into the instruction memory and the data into the data memory for execution from the input port data_in. With the combination of the input port IS_sellect, address_mode, I_RM, R_M, user can decide which memory element is the writing target. 38.

(102) EXECUTION mode: After inserting the instruction and the required data into the chip, the data path can begin to execute the program. In this mode, VI_ALU, VI_LS, VI_MAC, will show the state saved in the power management units of voltage separation. User can use logic analyzer to analyze the three outputs, and compare if the outputs are identical to the result of compiler. When the data path is in execution mode, user can also use power meter to measure the power dissipation of the five power supply for this chip. TEST mode: After the program execution is finished, user can test the data in the data memory and the register file in this chip. With the combination of the input port address_mode, I_RM, and R_M, user can read out the data from the output port data_out and use a logic analyzer to exam the output if the execution result is correct.. 4.3 Simulation Result In this section, we will show the functionality simulation of this chip. These simulations include the clock gating operation and a program execution of two matrixes addition. In section 4.3.1, the clock gating operation in five stage pipeline is shown. In section 4.3.2, a test bench of two matrix addition is executed and the three operation mode simulation of this chip is shown. All simulations are post-layout simulation and the operation frequency is 40Mhz.. 4.3.1 Clock Gating Simulation In this section, we will simulate the functionality of clock gating and the control signal for voltage separation in the data path to make sure that after implementing these two dynamic power management, the instruction execution and the pipeline operation are correct. Firstly, the following instructions are executed.. 39.

(103) Cursor: 810827.028. Marker:0. 1.. SUBI R5 R4 #1. 2.. ADDI R1 R0 #1. 3.. LW R28 R4 #0. 4.. STALL. 5.. LW R29 R5 #0. 6.. STALL. 7.. STALL. Delta:-810827.028. x 10ps. 820000. 815000. 5. 4. 2. 5. 4. 4. 1. 28. 29. 0. 4. 5. 0. 1. 0. 835000. 830000. 825000. 10000000. 5000000. 15000000. Figure 4.3.1. Simulation of instruction fetch Figure 4.3.1 shows the post-layout simulation result of the instruction fetch and present the operation of clock gating in this stage. OP_LS_IF, RD_LS_IF, RS_LS_IF and imm_LS_IF are the pipeline registers for functional unit L/S in instruction fetch stage. They are used to save the instruction fetched from instruction memory. The more detail definition of the signals in Figure 4.3.1 is summarized in Table 4.3.1.. 40.

(104) Table 4.3.1. Definitions of the signals in Figure 4.3.1 clk_LS. The clock signal of the pipeline registers for functional unit L/S in IF stage. OP_LS_IF. The pipeline registers used to save the OP code of instruction. Number 4 for instruction SUBI. Number 5 for ADDI and number 2 for LW. RD_LS_IF. The pipeline registers used to save the destination register address. RS_LS_IF. The pipeline registers used to save the source register address. imm_LS_IF. The integer used to calculate data address in memory. Only when the instruction fetched is not a stall instruction, the clock signal switches, or it remain in high state. The simulation result is identical to that shown in Figure 3.3.2. The clock gating operate correctly, and the pipeline register only reads the instructions that really be executed. The clock gating operates correctly in single pipeline stage. In the next, the simulation of pipeline operation with clock gating and the state detection for voltage separation will be presented in Figure 4.3.2. An example of the functional unit L/S is shown. The data flow in all the pipeline registers will also be shown. The signals in G1 represent the pipeline registers in IF stage. G2 represent the pipeline registers in ID stage. G3 represent the pipeline registers in MEM stage. G4 represent the pipeline registers in WB stage, and the detail definitions of the signals in Figure 4.3.2 is summarized in table 4.3.2.. 41.

(105) Table 4.3.2 Definitions of the signals in Figure 4.3.2. G1. clk_LS. The gated clock signal for pipeline registers in IF stage. RD_LS_IF. The pipeline registers used to save the destination register (RD) address. RS_LS_IF. The pipeline registers used to save the source register (RS) address. imm_LS_IF. The integer used to calculate data address in memory. OP_LS_IF. The pipeline registers used to save OP code of the instruction fetched. G2. clk_LS. The gated clock signal for pipeline registers in EX stage. WB_LS_EX. A part of the instruction decoded result which controls the access to register file at WB stage. MEM_LS_EX. A part of the instruction decoded result which controls the access to MEMORY at MEM stage.. EX_LS_EX. A part of the instruction decoded result which decides the operation at EX stage.. RDdata_LS_EX. The pipeline registers which save the data at destination address. If the instruction is STORE, the data here will be written to MEMORY.. RSdata_LS_EX. The pipeline registers which save the data at source address and is used to calculate the data address in memory. imm_LS_EX. The integer used to calculate data address in memory,. 42.

(106) and the data in this part is directly form imm_LS_IF. G3. clk_LS. The gated clock signal for pipeline registers in MEM stage. WB_LS_MEM. The data in this part is directly form WB_LS_EX. MEM_LS_MEM. That control the access to memory and the data in this part is directly from MEM_LS_EX. RDdata_LS_MEM. The data in this part is directly form RDdata_LS_EX. If the instruction is STORE, the data here will be written to MEMORY. result_LS_MEM. The result of the functional execution and the data in this register maybe the data address in memory or the data which will be written back to RD. G4. RD_LS_MEM. The data in this part is directly from RD_LS_EX.. clk_LS. The gated clock signal for pipeline registers in WB stage. WB_LS_WB. The data in this part is directly form WB_LS_MEM and controls the access to register file. result_LS_WB. The. data. in. this. part. is. directly. from. result_LS_MEM. If the instruction is ADDI or SUBI, the data will be written back to RD. MEMdata_LS_WB. The data in this part is from memory. If the instruction is LOAD, the data will be written back to RD.. RD_LS_WB. The data in this part is directly from RD_LS_MEM and represents the address of RD. 43.

(107) G5. voltage_separation_LS. That represents the condition of the functional unit. “1” means that the functional unit is busy now. “0” means that the functional unit is idle and the power supply can be turned off. In Figure 4.3.2, following instructions are executed : 1.. STALL. 2.. ADDI R4 R0 #15. 3.. STALL. 4.. STALL. 5.. STALL. 6.. STALL. 7.. STALL. 8.. STALL. 9.. SUBI R3 R1 #6. 10. STALL Only the instruction fetched is not a stall instruction, the gated clock to each pipeline register will switch. In this case, all the instructions between instruction ADDI R4 R0 #15 and instruction SUBI R3 R1 #6 are all stall. In this period, the functional unit is under idle situation. So after instruction ADDI R4 R0 #15 is executed, the signal voltage_separation_LS is set to low. This means that the power supply to this functional unit could be turned off. Before instruction SUBI R3 R1 #6 is fetched, the signal voltage_separation_LS is set to high. This means that the functional unit need to be charge immediately.. 44.

(108) Figure 4.3.2. The simulation of pipeline operation with power management units. 45. Cursor: 483600. Marker:0. 0. 0. 200000. 400000. 14. 4. 0. 0. 600000. 3. 15. 0. 4. 3. 0. 15. 0. 3. 500000. 0. 0. 0. 4. 0. 15. 0. 0. 0. 0. 15. 0. 0. 0. 5. 490000. 4. 480000. 0. x 10ps. 0. Delta:-483600. 800000. 510000. 4. 6. 1. 3. 3. 655*. 1000000. 3. 6. 520000.