• 沒有找到結果。

Chapter 1 Introduction

1.3 Organization

This thesis includes six chapters which focus on unified logical effort models and clock system in sub/near-threshold region. The latter includes clock tree and programmable clock generator. The following briefly introduces the content of each chapter.

Chapter 2 gives an overview on clock tree and clock generator.

Chapter 3 describes the proposed unified logical effort models which cover super-, near- and sub-threshold regions.

Chapter 4 presents the proposed thermal-robust clock tree using logical effort compensation. The unified logical effort models will be used for thermal compensation of clock buffers. In the end of this chapter, we will show layout and simulation result.

Chapter 5 demonstrates the proposed programmable clock generator which is aimed at sub/near-threshold DVFS system. Finally, we will show the implementation of layout, simulation result and performance summary.

Chapter 6 gives the conclusion of this thesis and future work.

4

Chapter 2

Overview on Clock Distribution Networks and Clock Generator

2.1 An Overview on Clock Distribution Networks [2.1]

Clock distribution networks synchronize the flow of data signals among synchronous data paths. The design of clock distribution networks directly influences system-wide performance and reliability. The characteristics of clock signal in the distribution network has been noted because they are critical to the synchronous system. Clock signals have some special characteristics: loaded with the greatest fanout, traveling over the longest distances, and operating at the highest speeds.

Furthermore, the clock waveforms must be clean and sharp to guarantee the data movement with no error. However, the high resistance of long global metal lines affects the property of clock signals; the resistance is even higher due to technology scaling. Thus, it is important to pay more attention to design of clock distribution on synchronous performance. In this section, we will introduce some topics: synchronous systems, theoretical background of clock skew and clock distribution design.

2.1.1 Synchronous Systems

In the synchronous systems, the clock signal defines the timing for the shift of data. The synchronous systems consist of cascaded banks of sequential registers with combinational logic between each set of registers. Timing requirements between each

5

set of registers are satisfied by carefully setting worst case timing in the combinational logic. Properly designing the clock distribution network can further guarantee that timing requirements are satisfied.

A digital synchronous system is composed of logic elements and clocked registers. For an ordered pair of registers (R1, R2), R1 => R2 denotes that the signal switching at the output of R1 will propagate to the input of R2. This is called a sequentially-adjacent pair of registers. Figure 2.1 shows the local data path.

Figure 2.1 Local data path

The minimum clock period is decided by the delay between any two registers in a sequential data path:

Skew PD

CP clkMAX

T T

f 1 T (min) (max)

(2.1)

) , (

(max) T T T T Di f

TPDCQLogicIntSetup  (2.2)

Where TPD(max) is the maximum data path delay, TC-Q is the time for the data required for the data to leave the initial register, TLogic and TInt is the time of propagation in the logic and interconnect, TSet-up is the time required to successfully propagate to and latch within the final register of data path.

6

2.1.2 Theoretical Background of Clock Skew

Figure 2.2 shows the schematic of generalized synchronized data path. Ci and Cf

are clock signals driving a sequentially-adjacent pair of registers, the initial one Ri and the final one Rf. Both clock signals are generated from the same clock signal source.

We define that TCi and TCj are the propagation delays from the clock source to the ith and jth clocked register. The clock source is designed to generate a specific clock signal waveform for synchronizing each register. The equipotential clocking is most commonly used, which makes the clocking events occur at all registers simultaneously in ideal condition.

Figure 2.2 Timing diagram of clocked data path

The clock skew is defined as the difference in clock signal arrival time between two sequentially-adjacent registers. TSkew, the clock skew, is zero if the clock signals Ci and Cf are in complete synchronism which means clock signals arrive at their respective registers at the same time. If clock skew is not zero, it comes from the difference between the arrival time of ith and jth clock signals:

7

Cj Ci Skewij T T

T   (2.3)

where TCi and TCj are the clock delays from the clock source to registers Ri and Rj. The contributions of clock skew are due to a variety of reasons. Wann and Franklin [2.2] present that there are four kinds of reasons that causes clock skew: (1) the differences in line lengths from clock source to the clocked register, (2) the differences in delays of clock distribution buffers, (3) the differences in passive interconnect parameters such as line resistivity and via/contact resistance and (4) differences in active device parameters such as MOS threshold voltages and channel mobility in the clock buffers. In them, the distributed clock buffers are the main source of clock skew.

2.1.3 Clock Distribution Design of Custom VLSI Circuits

There are many approaches developed for designing clock distribution networks in synchronous digital integrated circuits. Clock distribution network affects the tradeoffs existing among system speed, physical die area and power dissipation. Thus in the development of system, the design methodology and structural topology of the clock distribution network should be considered.

Many kinds of clock distribution strategies have been developed. Buffered clock tree is the most general approach to equipotential clock distribution which is presented 2.1.3.1. Symmetric trees such as H-trees in 2.1.3.2 are used to distribute high-speed clock signals.

8

2.1.3.1 Buffered Clock Distribution Trees

The buffered clock distribution trees are most commonly used for distributing clock signals among the integrated circuits. The buffers are inserted in the clock signal path or at the clock source to drive long interconnections and registers at the end nodes. This clock distribution structure is commonly used and illustrated in Figure 2.3.

Figure 2.3 Tree structure of clock distribution network

The mesh structure is an extended version of the standard. In the mesh clock tree structure, the shunt paths down to next level of distribution network are used to minimize the resistance within the clock tree. Since the branch resistances are placed in parallel, it has the advantage of minimized clock skew. Various forms of clock distribution network including trunk, tree, mesh, and H-tree are illustrated in Figure 2.4.

An alternative approach to using distributed clock buffers throughout the clock distribution network is adopting only one buffer at the clock source. Using only one buffer, the additional area consumed by distributed buffers is saved greatly. However, this approach is suitable for the clock network with negligible resistance of the interconnect lines. In addition, the buffer should be strong enough to drive the

9

network capacitance while maintaining high-quality waveform shapes and minimizing the effects of the interconnect resistance.

Compared with one-buffer clock distribution network, distributed buffers consume more power and area, but it greatly improves the precision of the clock signal waveform. So it is necessary to use distributed buffers when the interconnect lines are too long. The distributed buffers not only amplify the clock signals but also isolate the local clock nets from upstream load impedances [2.3]. An example using three-level buffer clock distribution network is shown in Figure 2.5. In this strategy a single buffer drives multiple clock paths and buffers. The number of buffer stages between the clock source and registers depends on (1) the loading of registers and interconnect, and (2) the allowable clock skew [2.4]. Note that the source of clock skew mainly comes from clock buffers since the active device characteristics vary much more greatly than the passive device characteristics.

Figure 2.4 Common structures of clock distribution networks including a trunk, tree, mesh and H-tree

10

Figure 2.5 Three-level buffer clock distribution network

The primary design goal of clock distribution networks is to ensure that the clock signal arrives at every register at the same time. With zero skew, it can enhance the system reliability.

2.1.3.2 Symmetric H-Tree Distribution Networks

Figure 2.6 shows the symmetric clock distribution networks H-tree and X-tree which ensure zero clock skew by setting the length of interconnect and buffers identical from the clock signal source to any end node. They are a subset of the distributed buffer approach described in section 2.1.3.1. In the H-tree distribution networks, the clock driver is placed at the center of the main “H” structure. Clock signal is transmitted to four corners of H. The distances from these corners are the same, so the clock signal is transited to the corners with equal delay. Then, the four corners provide clock signal for smaller “H” structure in the next level. The distribution process continues through several levels of progressively smaller “H”

structure, driving the registers at the end.

11

Figure 2.6 Symmetric H-tree and X-tree clock distribution networks

The primary source of clock skew is from the difference between the signal paths, including process variations on metal lines, and active buffers in particular. In the H-tree structure clock distribution network, the amount clock skew depends on physical size, the control of semiconductor process, and the degree to which active buffers and clocked latches are distributed.

2.1.4 Previous Works on Temperature-Aware Clock

Distribution Design

2.1.4.1 Dynamic Thermal Clock Skew Compensation Using

Tunable Delay Buffers [2.8]

The temperature gradient in a high-performance chip brings the problem of clock skew in the clock distribution network. Knowing the spatial temperature distribution beforehand, it is possible to compensate the thermal non-uniformities by properly

12

designing a clock network. However, the temperature distribution also changes over time. A. Chakraborty et al. proposed a technique of compensation for temporal variations of temperature, by dynamically modifying the clock tree. It is realized by using tunable delay buffers during the clock network generation. The control of buffer is computed offline and stored in a tuning table which is added in the design. Then, temperature-induced delay variations are compensated.

The conceptual architecture of tunable delay buffer is shown in Figure 2.7. Each control signal decides whether the corresponding transmission gate is opened, thus achieving variable delays in discrete steps. In Figure 2.8 we can observe that each additional tap delivers a constant delay of approximately 8 ps, this value is chosen to keep the area and power overheads within reasonable values.

Figure 2.7 Structure of the tunable delay buffer

Figure 2.8 Delay and normalized power versus number of taps

13

An online hardware mechanism is in Figure 2.9 that the clock buffers are properly tuned so that the clock skew induced by thermal gradient can be compensated. There are two essential elements required to do that. First, a set of on-chip temperature sensors detects thermal variations. Second, a hardware mechanism hereafter called thermal management unit (TMU) translates this variation into the proper tuning of the buffers.

Figure 2.9 Online skew compensation architecture

The algorithm to minimize the number of inserted tunable buffers is proposed in this design. The overflow of the methodology is established and depicted in Figure 2.10. It includes some processes. In the first step, physical synthesis, the RTL design is synthesized; the placement, clock tree generation, and global routing are done. In the second step, TDB identification, the characterization and optimization are run from the synthesized designs and their corresponding clock trees, which entail the repeated execution of the optimization algorithm for every relevant thermal profile. In the final step, physical redesign, the insertion of buffers require some amount of

14

physical redesign because TDBs have larger footprint than regular buffers.

Figure 2.10 Overall flow of the proposed methodology

This design shows that the clock skew is kept within original bounds with worst-case power and area penalty of 3.5% and 5.5%, respectively.

2.1.4.2 Design of Thermally Robust Clock Trees Using

Dynamically Adaptive Clock Buffers [2.9]

Temperature gradient has emerged as a major concern for high-performance integrated circuits design in current and future technology nodes, which causes undesired clock skew in the clock distribution network. The primary purpose in research [2.9] is to provide intelligent solution for minimizing the temperature-induced clock skew by designing dynamically adaptive circuit elements, particularly the clock buffers.

The effect of on-chip temperature gradient on the clock skew for a number of

15

temperature profile is investigated by using an RLC model of the clock tree. To mitigate the variable clock skew, an adaptive circuit technique is proposed, which senses the temperature of different parts of the clock tree and adjusts the driving strengths of the corresponding clock buffers dynamically. Figure 2.11 shows the design technique in which the local temperature sensors sense the ambient temperatures and convert the temperatures to voltages. The voltages are used for dynamically changing the driving strength of the clock buffers, thereby reducing the overall clock skew. The buffers use the combination of two techniques to compensate the temperature effect, buffer-current control and body-bias control. Figure 2.12 shows the control waveforms coming from the wave-shaping circuits.

Figure 2.11 Thermally adaptive buffer schematic

Figure 2.12 Control waveforms coming from the wave-shaping circuits

16

To distribute the thermal sensors all over the chip, a moderate-accuracy temperature sensor is needed for the purpose of reduced area and power. The architecture of the temperature sensor used here is shown in Figure 2.13. The accuracy of this temperature sensor is below 10 °C while occupying only 30 um2 on 45-nm technology. The waveforms of the output are shown in Figure 2.14, it demonstrate the linearity of the output voltage over the temperature range.

Figure 2.13 Temperature-sensor schematic

Figure 2.14 Temperature-sensor output-voltage levels

Spice simulations were performed to evaluate the performance. The clock skew equals zero when the temperature difference of clock signal path is zero. With the difference of 80 °C, the clock skew is 155 ps while reduced to 21 ps with the use of adaptive technique. Simulation results show that the adaptive technique is capable of reducing the temperature-induced clock skew by up to 92.4% and 70.2% in average.

17

2.2 An Overview on Clock Generator

A clock generator is a circuit that produces a timing signal for use in synchronizing a circuit’s operation. Many kinds of clock generators have been presented in previous literatures. In this section, we will briefly describe some categories of clock generators, including DLL-based, PLL-based, TDC-based and CCM-based (cyclic clock multiplier) clock generators. They are used in different applications.

2.2.1 DLL-Based Clock Generator [2.10]

A low-power programmable DLL-based clock generator for dynamic frequency scaling is developed in [2.10]. The block diagram is shown in Figure 2.15. When the DLL locks, the phase difference between B0 and B8 is one reference clock cycle. The voltage-controlled delay line (VCDL) generates uniformly spaced clocks which are used for frequency multiplying. The frequency of multiplied clock is decided by the two-bit control signals. To avoid harmonic-lock, an anti-harmonic block established.

Three clock phases B0, B3 and B8 are selected as inputs for the antiharmonic-lock block. The phases of B0 and B8 are compared by phase detector (PD), and then the phase detector sends signals UP or DOWN to the charge pump (CP). If the DLL locks in harmonic state, the antiharmonic-lock block will have the priority to make the output of PD UP or DOWN. These signals increase or decrease the control voltage of the VCDL, so the phase B9 can be locked.

18

Figure 2.15 Block diagram of the proposed DLL-based frequency multiplier

2.2.2 PLL-Based Clock Generator [2.11]

A triangular-modulated spread-spectrum clock generator using a △ - Σ modulated fractional-N phase-locked loop is presented in [2.11]. The multiphase divider is employed to implement the modulated fractional counter with increased △

-Σ operation speed. The phase mismatching error in the phase-interpolated PLL with multiphase clocks can be randomized, and finer frequency resolution is achievable. Figure 2.16 shows the system architecture, it consists of a PLL, a △-Σ modulator, and a triangular modulated profile. The PLL is a digiphase-based fractiona-N synthesizer with a multimodulus fractional divider (MMDF). The instantaneous phase error can be canceled by a phase-compensated technique before the phase frequency detector. When the PLL is locked, neglecting the modulated operation of the △-Σ modulator to the MMDF, the output frequency of the PLL is (M±k/16)fref, and the synthesizer operates as a modulo-31 fractional-N frequency synthesizer for K = 0~15.

19

Figure 2.16 System architecture

2.2.3 Multi-Phase Clock Generator Based on a

Time-to-Digital Converter [2.12]

An all-digital fast-lock synchronous multi-phase clock generator is presented in [2.12]. It adopts a time-to-digital converter (TDC) to achieve the purposes of fast-lock and delay measurement. It can generate four-phase clocks and synchronize the reference clock within 45 cycles. Figure 2.17 shows the synchronous multi-phase clock generator, consisting of a TDC, sampling clock selector, control pulse generator, code controller and de-skewing circuit. The TDC measures the periods of the input clock and the replica delay. Then, the delay codes generated by the TDC are converted into coarse and fine codes in the code controller. Therefore, the clock generator can be synchronized, generating multi-phase output clock. In addition, the de-skewing circuits improve the phase resolution of the multi-phase clocks. The phase error between the reference and output clocks is 4.6ps at 1.8V, with 1.22GHz input clock.

20

Figure 2.17 Architecture of the synchronous multi-phase clock generator

2.2.4 Programmable Clock Generator Based on a Cyclic

Clock Multiplier [2.13]

An all-digital clock generator using a cyclic clock multiplier (CCM) is presented in [2.13]. It realizes the fractional or multiplied output clock within four reference clock cycles. Figure 2.18 shows the all-digital clock generator which is composed of a CCM, a finite state machine (FSM), a conventional time-to-digital converter (TDC), a counter_K, a programmable divider and two multiplexers (MUXs). It can generate output clock with frequency M/N times of reference clock, where the ranges of M and N are 1~7 and 1~8 respectively. CCMout is a multiplied clock which frequency is M times of reference clock. The timing diagram of clock generator is shown in Figure 2.19 with M = 5 and N = 1. There are four steps for its operation. First, C[4:0] is preset to M and the CCM measures the period of the reference cycle. Second, the

21

counted value is stored as K[4:0] = K and K = 3 in Figure 2.19. Third, the clock CCMout generates M pulses by K unit delay cells. Finally, the delay of the unit delay cell in the CCM is adjusted by F[3:0] according to the TDC outputs, so the phase error between the multiplied clock and the reference clock can be reduced.

Figure 2.18 The all-digital clock generator using cyclic clock multiplier

Figure 2.19 The timing diagram of the clock generator

22

Chapter 3

Unified Logical Effort Models over Wide Supply Voltage and Temperature Range

In this chapter, we present unified logical effort models, which cover all operational regions of MOSFET in weak-, moderate- and strong- inversion regions.

These models have been established over the four different nanoscale CMOS generations and environmental parameter variations with wide supply voltage 0.1~1V and temperature range -50~125ºC. The simulation results are using UMC90-, 65-nm, PTM 65-, 45- and 32-nm bulk CMOS technologies, respectively, with average modeling error no more than 8.40%. Proposed models extend the original high performance circuits design in super-threshold region to low power design operation in near-threshold and sub-threshold regions. They are useful for future ultra-low voltage design and applications.

Section 3.1 is the introduction. The classic logical effort model will be reviewed in section 3.2. In section 3.3 we will derive the physical alpha-power law current equations. The formulas of unified logical effort models will be derived in section 3.4.

Section 3.5 shows the experimental results.

3.1 Introduction

Power becomes the dominant design constraint in many emergence applications such as mobile consumer electronics or wireless sensor networks. The techniques of ultra-low voltage (ULV) design have been exploded continuously. In addition, the

相關文件