• 沒有找到結果。

Chapter 2 Overview on Clock Distribution Networks and Clock Generator

2.2 An Overview on Clock Generator

2.2.4 Programmable Clock Generator Based on a Cyclic Clock Multiplier [2.13]

Clock Multiplier [2.13]

An all-digital clock generator using a cyclic clock multiplier (CCM) is presented in [2.13]. It realizes the fractional or multiplied output clock within four reference clock cycles. Figure 2.18 shows the all-digital clock generator which is composed of a CCM, a finite state machine (FSM), a conventional time-to-digital converter (TDC), a counter_K, a programmable divider and two multiplexers (MUXs). It can generate output clock with frequency M/N times of reference clock, where the ranges of M and N are 1~7 and 1~8 respectively. CCMout is a multiplied clock which frequency is M times of reference clock. The timing diagram of clock generator is shown in Figure 2.19 with M = 5 and N = 1. There are four steps for its operation. First, C[4:0] is preset to M and the CCM measures the period of the reference cycle. Second, the

21

counted value is stored as K[4:0] = K and K = 3 in Figure 2.19. Third, the clock CCMout generates M pulses by K unit delay cells. Finally, the delay of the unit delay cell in the CCM is adjusted by F[3:0] according to the TDC outputs, so the phase error between the multiplied clock and the reference clock can be reduced.

Figure 2.18 The all-digital clock generator using cyclic clock multiplier

Figure 2.19 The timing diagram of the clock generator

22

Chapter 3

Unified Logical Effort Models over Wide Supply Voltage and Temperature Range

In this chapter, we present unified logical effort models, which cover all operational regions of MOSFET in weak-, moderate- and strong- inversion regions.

These models have been established over the four different nanoscale CMOS generations and environmental parameter variations with wide supply voltage 0.1~1V and temperature range -50~125ºC. The simulation results are using UMC90-, 65-nm, PTM 65-, 45- and 32-nm bulk CMOS technologies, respectively, with average modeling error no more than 8.40%. Proposed models extend the original high performance circuits design in super-threshold region to low power design operation in near-threshold and sub-threshold regions. They are useful for future ultra-low voltage design and applications.

Section 3.1 is the introduction. The classic logical effort model will be reviewed in section 3.2. In section 3.3 we will derive the physical alpha-power law current equations. The formulas of unified logical effort models will be derived in section 3.4.

Section 3.5 shows the experimental results.

3.1 Introduction

Power becomes the dominant design constraint in many emergence applications such as mobile consumer electronics or wireless sensor networks. The techniques of ultra-low voltage (ULV) design have been exploded continuously. In addition, the

23

minimum energy point appeared at the voltage where transistors operate in weak-inversion (also called sub-threshold region) [3.1], [3.2]. However, sub-threshold circuits are much more sensitive to environmental variations than super-threshold ones.

Recently, three-dimensional integrated circuit (3D-IC) technology is developed for overcoming the barriers in large interconnections. The high integration of 3D-IC introduces hot spot problem because of different thermal distribution. The temperature inconsistency brings performance coherence problem in ULV circuits design. Voltage and temperature variations affect timing behavior of logic gates significantly with lower voltage and advanced CMOS technology. They may lead to functional errors in digital circuits. Therefore, novel unified logical effort models for optimizing of combinational logic by considering temperature and voltage variations are proposed.

The logical effort model proposed by Sutherland, Sproull, and Harris in 1999 is a method for estimating circuit path delay [3.3]. By using logical effort, it is easy to estimate path delay from simple calculation, but it doesn’t consider environmental conditions. Many papers have been presented to improve the accuracy of logical effort model in different conditions. The effect of a linear input transition time was introduced [3.4]. A modified logical effort model concerning series connected MOSFET structure, input transition time, and internodal charge were presented [3.5].

I/O coupling capacitance and the input ramp effect on logical effort was considered [3.6]. The influences of voltage and temperature on logical effort were introduced in UMC 90nm bulk CMOS process [3.7], which logical gates, however, were operated in strong inversion region.

In this chapter, unified logical effort models for different CMOS operation regions are proposed, which cover strong-, moderate- and weak-inversion regions (also called

24

super-threshold, near-threshold and sub-threshold regions, respectively). The models have been established in UMC90-, 65-nm, PTM 65-, 45- and 32-nm bulk CMOS technologies. Next section we will derive them from classic logical effort model.

3.2 Classic Logical Effort Model [3.3]

The method of logical effort is established on a simple model of the delay through a single MOS logic gate. This model describes the delay model composed of gate drive and gate capacitive load. When the gate load increases, the delay will increase; however, the delay also depends on the logic function of the gate. Inverters are the simplest logic gate and mostly chosen as amplifiers to drive large load. Some logic gates with complex function often require series topology, making them poorer than inverter at driving current. Thus NAND gate has more delay than inverter with the same transistor sizes which drive the same load. The method of logical effort quantifies these effects to simply delay analysis.

The first step in modeling delays is dividing the absolute delay into two parts:

delay unit  and unitless delay d of the gate. The delay unit is particular to a specific integrated circuit fabrication process. The absolute gate delay can be expressed as:

d

dabs (3.1)

The delay is composed of two components, a fixed part called the parasitic delay p and a part proportional to the load on the gate’s output called the stage effort f. The total delay, measured in units of , is the sum of parasitic delay and stage effort:

p f

d   (3.2)

25

The stage effort delay depends on the output load and the driving capability of the logic gate. The output load and driving capability are represented by the terms electrical effort h, and logical effort g respectively. The stage effort f is the product of these two factors:

gh

f  (3.3) The logical effort characterizes the effect of the logic gate’s topology on its ability to drive the load. It is independent of the size of transistors in the circuit. The electrical effort h is defined by:

in

where bi is the branching effort, and G, B, H, F are the path logical effort, path branching effort, path electrical effort and path effort. The minimum path delay will be performed when the stage effort and the input capacitance of each gate are

N obtain the optimize path delay.

26

3.3 Unified Logical Effort Models

The unified logical effort models are derived by considering current equation of physical alpha-power law [3.8] and conventional logical effort model simultaneously.

In logic gates, the operation region of MOSFET is determined by the value of supply voltage. When the supply voltage is less than threshold voltage (VDD < VT), then the weak-inversion (or sub-threshold) current is derived as

   



 

where (W/L) is the channel width-to-length ratio, COX is the gate oxide capacitance per unit area, 0 is carrier mobility, and the MOSFET parameters

) /(kT

q

, 1C /D0 COX (3.8)

When supply voltage is applied near threshold voltage (VDD ~ VT), velocity saturation is negligible (ECL>>VDD-VT), this region is called moderate-inversion (near-threshold) region. Thus, we simplify the saturation voltage and IDSAT from [3.8] and obtain

) strong velocity saturation (ECL<<VDD-VT) is reached. This is called strong-inversion (super-threshold) region. Again, we simplify the saturation voltage and IDSAT from [3.8]

as

27

Figure 3.1 Simplified physical alpha-power law current equations

All three regions of MOS current are derived in (3.7), (3.10) and (3.12), summarized in Figure 3.1. To modify the logical effort model, the logical effort g has been introduced in section 3.2. From equations (3.1) and (3.2) we can get:

)

where Rinv and Cinv are output resistance and input capacitance of an inverter template;

Rt, Cint, Cpt are output resistance, input capacitance and output parasitic capacitance of a specific gate. In (3.15), logical effort is equal to the ratio of gate RC to inverter RC:

int drain current. The inverse of logical effort

Strong-inversion (Super-threshold):

28

From (3.16), inverse of logical effort is proportional to ID; there are three regions for ID as well as g: strong-, moderate- and weak-inversions. The driving ability of NMOS and PMOS are not the same in different regions. The inverter sizing ratios Wp/Wn, are set as 2.5, 2.0 and 1.5 in strong-, moderate- and weak-inversion regions to get balanced rise and fall delay.

3.3.1 Strong-Inversion (Super-Threshold) Region

In strong-inversion region, MOSFET operates with strong carrier velocity saturation. Substitute ID (3.12) into (3.16)

DD function is curve fitted by

DD

29

gu stands for unified logical effort; A(T) is two-degree polynomial of T. By measuring logical effort with various VDD and T, A(T) is solved and listed in Table 3.1. In this region, we set g equal to 1 at VDD = 1V, T = 25 ºC and the VDD range is from 0.5V to 1.0V. Figure 3.3 shows unified and simulated 1/g with various VDD and T. The average of absolute modeling errors are 3.89%, 3.05%, 4.12%, 8.01%, 6.55% in UMC 90-,

Table 3.1 Function A(T) for strong-inversion

Figure 3.3 1/g in UMC 65-nm technology (strong-inversion)

3.3.2 Moderate-Inversion (Near-Threshold) Region

In moderate-inversion region, MOSFET operates with negligible carrier velocity saturation. Substitute ID (3.10) into (3.16)

0

30

where const2 represents all const coefficients. VT is function of T. Unified 1/g is curve fitted by

gu stands for unified logical effort; B(T), C(T), and D(T) are two-degree polynomials of T. By measuring logical effort with various VDD and T, B(T), C(T), and D(T) are solved, listed in Table 3.2. In this region, g is set to be 1 at VDD = 0.5V, T = 25 ºC and the VDD range is from about 0.33V to 0.5V. The position of divide point between moderate- and weak-inversions depends on which CMOS technology used. Figure 3.4 is unified and simulated 1/g with various VDD and T. The average of absolute Table 3.2 Functions B(T), C(T) and D(T) for moderate-inversion

31

Figure 3.4 1/g in UMC 65-nm technology (moderate-inversion)

3.3.3 Weak-Inversion (Sub-Threshold) Region

In weak-inversion region, MOSFET operates in sub-threshold mode. Substitute ID (3.7) into (3.16)

where const3 represents all constant coefficients,  and VT are functions of T. Unified 1/g is curve fitted by depending on which CMOS technology used. Figure 3.5 is unified and simulated 1/g with various VDD and T. The average of absolute modeling error are 6.01%, 8.40%,

32 Table 3.3 Functions E(T) and F(T) for weak-inversion

Figure 3.5 1/g in UMC 65-nm technology (weak-inversion) Average

Table 3.4 Logic effort modeling error

0.0001

33

Figure 3.6 Unified logical effort models

3.4 Experimental Result

In this section, to test and verify the unified logical effort models, we use them to estimate some path delays. There are two test vehicles. Test vehicle I is some simple logic gates, and test vehicle II is an 8-to-256 decoder. The test vehicles are simulated in various thermal and voltage conditions, and real delays are measured. The estimations of delay are done through calculation based on delay equation of logical effort model variations. g and p are logical effort and parasitic delay. The unified logical effort will be substituted for g here to include the effects of temperature and supply voltage. We measured the values of p in various environmental conditions beforehand, thereby using ideal values of p for equations (3.23) here.

In the test vehicles, the logical efforts of logic gates are calculated according classic rule. The logical efforts of INV, 2-input NAND and NOR, listed in Table 3.5,

Strong-inversion (Super-threshold):

34

can be derived from different Wp/Wn ratio in three distinct regions. In the next two sections we will show the comparisons of simulated and estimated delays.

Strong-inversion Moderate-inversion Weak-inversion

Wp/Wn 2.5 2.0 1.5

g(INV) gu gu gu

g (2-NAND) gu×9/7 gu×4/3 gu×7/5

g (2-NOR) gu×12/7 gu×5/3 gu×8/5

Table 3.5 Ratios of logical effort for logic gates

3.4.1 Test Vehicle I

The test vehicle I is an INV-NAND-NOR-INV path with another INV as load, shown in Figure 3.7. All of these gates have the same driving ability as unit size inverter. They are simulated in UMC 90-nm CMOS technology. The delay comparisons of simulated and estimated delays are shown in Figure 3.8, Figure 3.9 and Figure 3.10. The results show that the average absolute errors are 12.6%, 7.96%

and 16.8% in strong-, moderate- and weak-inversion regions respectively.

LOAD

d

start end

Figure 3.7 Test vehicle I for proposed logical effort models

35

Figure 3.8 Simulated and estimated delays for the circuit path of Figure 3.7 in UMC 90nm technology (strong-inversion)

Figure 3.9 Simulated and estimated delays for the circuit path of Figure 3.7 in UMC 90nm technology (moderate-inversion)

Figure 3.10 Simulated and estimated delays for the circuit path of Figure 3.7 in UMC 90nm technology (weak-inversion)

36

3.4.2 Test Vehicle II

Test vehicle II is an 8-to-256 decoder which is used to control a register file.

Figure 3.11 shows the 8-to-256 decoder along with a 32×256 register file. In the register file, there are 256 words and each word is 32 bits wide. Each bit presents a load of 3 unit-sized inverter, so there is a total of 3×32 unit capacitance for every output of decoder. Figure 3.12 shows the circuit diagram of 8-to-256 decoder. Every stage is set with stage effort 4 to achieve fast propagation of FO4 rule. Besides, the branch number is 128.

The 8-to-256 decoder is simulated in UMC 65nm CMOS technology. The path delays are estimated through logical effort model. The comparisons of simulated and estimated delays are shown in Figure 3.13, Figure 3.14 and Figure 3.15. The results show that the average absolute errors are 14.6%, 6.15% and 10.13% in strong-, moderate- and weak-inversion regions respectively.

8-256

Decoder 256 Register File

A[7:0]

A[7:0]

32 bits

256 words

Figure 3.11 8-to-256 decoder for a 32×256 register file

37

Figure 3.13 Simulated and estimated delays for Figure 3.12 in UMC 65nm technology (strong-inversion)

Figure 3.14 Simulated and estimated delays for Figure 3.12 in UMC 65nm technology (moderate-inversion)

38

Figure 3.15 Simulated and estimated delays for Figure 3.12 in UMC 65nm technology (weak-inversion)

1 10 100 1000

-100 -50 0 50 100 150

delay (ns)

T (°C)

0.3V (simulated) 0.2V (simulated) 0.3V (estimated) 0.2V (estimated)

39

Chapter 4

A Thermally Robust Buffered Clock Tree Using Logical Effort Compensation

Temperature gradient has been a major design concern for integrated circuits recently. In this chapter, an intelligent solution for mitigating the temperature-induced clock skew by using logical effort compensation is proposed. Logical effort - an index of propagation delay, varying with thermal and supply voltage conditions, is controlled by a tunable-width buffer. As an effective way of mitigating the variable clock skew, this chapter presents an adaptive circuit technique that senses the temperature of different parts of the clock tree and adjusts the logical effort of the corresponding clock buffers dynamically to reduce the clock skew. In UMC-65nm technology, tunable-width buffers along with 7th-layer metal interconnect clock H-tree are constructed in post-layout simulation, which shows that the clock skew is reduced by up to 97.8%, and 72.2% in average. This leads to much improved clock synchronization and design performance.

Section 4.1 will give the introduction of clock tree with effect of temperature variation. In section 4.2, we create a constant gate delay against thermal variation by using a tunable-width inverter to control the logical effort. Section 4.3 shows the thermally robust buffered clock tree, in which the technique proposed in section 4.2 is adopted. Section 4.4 will give the simulation results of thermally robust buffered clock tree.

40

4.1 Introduction

Temperature gradient has become a significant factor in designing a chip with the advancement of integrated circuit technology. It significantly affects the performance of a chip. Temperature gradient is getting more acute because of various activities in different parts of a chip. For instance, a processor chip contains operating part with higher activity and cache part with lower activity, causing temperature gradient. The temperature difference can be as high as 50 ºC [4.1], which affects the performance of the different functional parts and interconnection. In this chapter, we focus on the effect of temperature on the clock skew between special-close and function-related points of a clocking network.In the H-tree shown in Figure 4.1, we can see that, for a number of terminal locations, while physically close, the clocking signals reached through completely different paths from the source. As a result, temperature differences in the paths can lead to significant skews. As shown in Figure 4.2 for the H-tree mapped to the 45-nm technology node, the clock skew increases with increasing temperature difference between different parts of the chip [4.2]. Since the increase of clock skew has a big performance threat to integrated circuits, we need intelligent solutions to mitigate the effect of temperature-dependent clock skew.

41

Figure 4.1 Buffered Clock Tree

Figure 4.2 Temperature effect on edge skew between two buffers

The effect of temperature on the device performance is complicated because there are two mixed phenomena. First, carrier mobility is decreased while temperature increases. Second, threshold voltage is lowered while temperature increases.

Depending on the operating point of the transistor, the drain saturation current may actually increase or decrease. Figure 4.3, which was simulated by T. Ragheb [4.2], shows the results of the drain saturation current of the nMOS and pMOS devices modeled using BSIM4 predictive 45-nm CMOS technology [4.3]. There is a zero-temperature-coefficient (ZTC) point where the current of transistors are invariant

42

to temperature variation.

Figure 4.3 Inversion of the temperature dependence of drain saturation current for a PTM 45- nm (a) nMOS transistor and (b) pMOS transistor [4.2]

The ZTC point was well-known to designers for a long time. This is the basis of the method suggested by Shakeri and Meindl in [4.4] that uses a temperature-variable supply voltage of 1V (TVS) to guarantee near-constant delay across a temperature range. However, the ZTC point is also a function of the technology node.Because of different ZTC points between technologies, designer may need to redesign circuits using ZTC bias method when the circuits are ported from one technology to another.

Previous solutions uses fixed known temperature profiles [4.5]–[4.7]. The temperature profiles are built beforehand. However, it may be too optimistic especially for processors running different applications. Other techniques try to manage clock skew under thermal variations; nevertheless, they sacrifice performance to achieve immunity against variations [4.8]. Finally, dynamic adjustment techniques for microprocessor pipelines have been proposed, which incur significant overheads to enable timing violation detection and correction [4.9].

43

In this chapter, a thermally robust buffered clock tree is proposed. It uses tunable-width inverter as clock buffer to adjust the drive ability by means of logical effort compensation. Here we consider thermal conditions from -50°C to 125°C.

4.2 Creating Constant Gate Delay against Thermal

Variation

In this section we will introduce the method of creating constant delay which is invariant to thermal conditions. In chapter 3 we have presented the unified logical effort models, the logical effort is a function of voltage and temperature. Here the voltage in set unchanged, thus the logical effort of a gate is only varied to temperature.

By adjusting the logical effort of a gate as a constant value, constant gate delay can be created. To adjust the logical effort, a tunable-width inverter is adopted in which the width as well as logical effort can be tuned. Later we will show the relation between width and logical effort.

The constant gate delay is used for the buffers of clock tree. Constant delay means that the delays of buffers are invariant to temperature, thus the clock skew can be minimized.

4.2.1 Effects of Dynamically Tuning MOSFET Width on

Logical Effort

From (3.19), logical effort g is inversely proportional to drain current ID.

44

ID

g 1 (4.1)

In the current equation, current is proportional to width-length ratio

L

IDW (4.2)

So logical effort is inversely proportional to (W/L)

W

gL (4.3)

L is fixed, so logical effort g is inversely proportional to width W. In chapter 2, we demonstrated that logical effort is affected by thermal and supply voltage conditions.

The relation between two logical efforts with different widths W1 and W2 considering temperature and supply voltage: adoption of a tunable-width inverter shown in Figure 4.4. In this figure, control signals B0-B7 come from outside control blocks, determining total width of the tunable-width inverter. The widths of MOSFETs are binary weighted, 1X, 2X … 128X unit size corresponding to control signals B0-B7, and the available tuning range of width is from 1X to 255X. By altering the width, we can tune the logical effort to a specific value.

45

B0 B1 B2 B6 B7

1X 2X 4X 64X 128X

IN OUT

B[7:0]

IN OUT

=

Figure 4.4 Tunable-width inverter

4.2.2 Creating Constant Gate Delay

In this section, we will demonstrate how to create constant gate delay by tuning

In this section, we will demonstrate how to create constant gate delay by tuning

相關文件