• 沒有找到結果。

Chapter 3 Unified Logical Effort Models over Wide Supply Voltage and

3.4 Experimental Result

3.4.2 Test Vehicle II

Test vehicle II is an 8-to-256 decoder which is used to control a register file.

Figure 3.11 shows the 8-to-256 decoder along with a 32×256 register file. In the register file, there are 256 words and each word is 32 bits wide. Each bit presents a load of 3 unit-sized inverter, so there is a total of 3×32 unit capacitance for every output of decoder. Figure 3.12 shows the circuit diagram of 8-to-256 decoder. Every stage is set with stage effort 4 to achieve fast propagation of FO4 rule. Besides, the branch number is 128.

The 8-to-256 decoder is simulated in UMC 65nm CMOS technology. The path delays are estimated through logical effort model. The comparisons of simulated and estimated delays are shown in Figure 3.13, Figure 3.14 and Figure 3.15. The results show that the average absolute errors are 14.6%, 6.15% and 10.13% in strong-, moderate- and weak-inversion regions respectively.

8-256

Decoder 256 Register File

A[7:0]

A[7:0]

32 bits

256 words

Figure 3.11 8-to-256 decoder for a 32×256 register file

37

Figure 3.13 Simulated and estimated delays for Figure 3.12 in UMC 65nm technology (strong-inversion)

Figure 3.14 Simulated and estimated delays for Figure 3.12 in UMC 65nm technology (moderate-inversion)

38

Figure 3.15 Simulated and estimated delays for Figure 3.12 in UMC 65nm technology (weak-inversion)

1 10 100 1000

-100 -50 0 50 100 150

delay (ns)

T (°C)

0.3V (simulated) 0.2V (simulated) 0.3V (estimated) 0.2V (estimated)

39

Chapter 4

A Thermally Robust Buffered Clock Tree Using Logical Effort Compensation

Temperature gradient has been a major design concern for integrated circuits recently. In this chapter, an intelligent solution for mitigating the temperature-induced clock skew by using logical effort compensation is proposed. Logical effort - an index of propagation delay, varying with thermal and supply voltage conditions, is controlled by a tunable-width buffer. As an effective way of mitigating the variable clock skew, this chapter presents an adaptive circuit technique that senses the temperature of different parts of the clock tree and adjusts the logical effort of the corresponding clock buffers dynamically to reduce the clock skew. In UMC-65nm technology, tunable-width buffers along with 7th-layer metal interconnect clock H-tree are constructed in post-layout simulation, which shows that the clock skew is reduced by up to 97.8%, and 72.2% in average. This leads to much improved clock synchronization and design performance.

Section 4.1 will give the introduction of clock tree with effect of temperature variation. In section 4.2, we create a constant gate delay against thermal variation by using a tunable-width inverter to control the logical effort. Section 4.3 shows the thermally robust buffered clock tree, in which the technique proposed in section 4.2 is adopted. Section 4.4 will give the simulation results of thermally robust buffered clock tree.

40

4.1 Introduction

Temperature gradient has become a significant factor in designing a chip with the advancement of integrated circuit technology. It significantly affects the performance of a chip. Temperature gradient is getting more acute because of various activities in different parts of a chip. For instance, a processor chip contains operating part with higher activity and cache part with lower activity, causing temperature gradient. The temperature difference can be as high as 50 ºC [4.1], which affects the performance of the different functional parts and interconnection. In this chapter, we focus on the effect of temperature on the clock skew between special-close and function-related points of a clocking network.In the H-tree shown in Figure 4.1, we can see that, for a number of terminal locations, while physically close, the clocking signals reached through completely different paths from the source. As a result, temperature differences in the paths can lead to significant skews. As shown in Figure 4.2 for the H-tree mapped to the 45-nm technology node, the clock skew increases with increasing temperature difference between different parts of the chip [4.2]. Since the increase of clock skew has a big performance threat to integrated circuits, we need intelligent solutions to mitigate the effect of temperature-dependent clock skew.

41

Figure 4.1 Buffered Clock Tree

Figure 4.2 Temperature effect on edge skew between two buffers

The effect of temperature on the device performance is complicated because there are two mixed phenomena. First, carrier mobility is decreased while temperature increases. Second, threshold voltage is lowered while temperature increases.

Depending on the operating point of the transistor, the drain saturation current may actually increase or decrease. Figure 4.3, which was simulated by T. Ragheb [4.2], shows the results of the drain saturation current of the nMOS and pMOS devices modeled using BSIM4 predictive 45-nm CMOS technology [4.3]. There is a zero-temperature-coefficient (ZTC) point where the current of transistors are invariant

42

to temperature variation.

Figure 4.3 Inversion of the temperature dependence of drain saturation current for a PTM 45- nm (a) nMOS transistor and (b) pMOS transistor [4.2]

The ZTC point was well-known to designers for a long time. This is the basis of the method suggested by Shakeri and Meindl in [4.4] that uses a temperature-variable supply voltage of 1V (TVS) to guarantee near-constant delay across a temperature range. However, the ZTC point is also a function of the technology node.Because of different ZTC points between technologies, designer may need to redesign circuits using ZTC bias method when the circuits are ported from one technology to another.

Previous solutions uses fixed known temperature profiles [4.5]–[4.7]. The temperature profiles are built beforehand. However, it may be too optimistic especially for processors running different applications. Other techniques try to manage clock skew under thermal variations; nevertheless, they sacrifice performance to achieve immunity against variations [4.8]. Finally, dynamic adjustment techniques for microprocessor pipelines have been proposed, which incur significant overheads to enable timing violation detection and correction [4.9].

43

In this chapter, a thermally robust buffered clock tree is proposed. It uses tunable-width inverter as clock buffer to adjust the drive ability by means of logical effort compensation. Here we consider thermal conditions from -50°C to 125°C.

4.2 Creating Constant Gate Delay against Thermal

Variation

In this section we will introduce the method of creating constant delay which is invariant to thermal conditions. In chapter 3 we have presented the unified logical effort models, the logical effort is a function of voltage and temperature. Here the voltage in set unchanged, thus the logical effort of a gate is only varied to temperature.

By adjusting the logical effort of a gate as a constant value, constant gate delay can be created. To adjust the logical effort, a tunable-width inverter is adopted in which the width as well as logical effort can be tuned. Later we will show the relation between width and logical effort.

The constant gate delay is used for the buffers of clock tree. Constant delay means that the delays of buffers are invariant to temperature, thus the clock skew can be minimized.

4.2.1 Effects of Dynamically Tuning MOSFET Width on

Logical Effort

From (3.19), logical effort g is inversely proportional to drain current ID.

44

ID

g 1 (4.1)

In the current equation, current is proportional to width-length ratio

L

IDW (4.2)

So logical effort is inversely proportional to (W/L)

W

gL (4.3)

L is fixed, so logical effort g is inversely proportional to width W. In chapter 2, we demonstrated that logical effort is affected by thermal and supply voltage conditions.

The relation between two logical efforts with different widths W1 and W2 considering temperature and supply voltage: adoption of a tunable-width inverter shown in Figure 4.4. In this figure, control signals B0-B7 come from outside control blocks, determining total width of the tunable-width inverter. The widths of MOSFETs are binary weighted, 1X, 2X … 128X unit size corresponding to control signals B0-B7, and the available tuning range of width is from 1X to 255X. By altering the width, we can tune the logical effort to a specific value.

45

B0 B1 B2 B6 B7

1X 2X 4X 64X 128X

IN OUT

B[7:0]

IN OUT

=

Figure 4.4 Tunable-width inverter

4.2.2 Creating Constant Gate Delay

In this section, we will demonstrate how to create constant gate delay by tuning logical effort to a fixed value. With utilization of constant delay buffers, temperature induced clock skew can be mitigated that will be described in section 4.3. From the delay equation of classic logical effort, gate delay dabs = τ(gh + p). Usually, compared with parasitic effort p, stage effort f = gh is much more significant, so we can consider the effects of thermal and voltage only on logical effort g and neglect the effects on parasitic effort. Under various thermal and supply voltage conditions, we tune logical effort g to a fixed value for the purpose of creating constant delay.

Assume that supply voltage is a fixed value VSUPPLY. The relation between two logical efforts gW1(VSUPPLY, T) and gW2(VSUPPLY, T) has been shown in equation (4.4).

We define that gW1(VSUPPLY, 25°C) is set to be 1 when supply voltage = VSUPPLY, temperature = 25°C, width = W1. Our goal is to keep the logical effort at a fixed value

= 1 for various thermal conditions. So the width is altered to the target width W2 according to current temperature T so that gW2(VSUPPLY, T) = 1. Substitute gW2(VSUPPLY, T) = 1 into equation (4.4):

46

V T

g W

W21W1 SUPPLY, (4.5)

W2 is calculated by multiplying the reference width W1 and gW1(VSUPPLY, T) together. For example, assume VSUPPLY = 0.5V and W1 = 128X unit size, if the temperature changes from 25°C to -25°C, logical effort will change from gW1(0.5V, 25°C) = 1 to gW1(0.5V, -25°C) = 1.34, corresponding to procedure I in Figure 4.5. To find target width W2 for that gW2(VSUPPLY, -25°C) = 1, the W2 is equal to W1 × gw1(0.5V, -25°C) = 128X × 1.34 = 172X unit size. We tune the width to W2, so the value of logical effort is then set back to 1 which corresponds to procedure II in Figure 4.5. Figure 4.6 shows W2 – T curve according to equation (4.5) with VSUPPLY = 0.5V, W1 = 128X.

Figure 4.5 Logical effort with two different widths

Figure 4.6 Tuned W2 according to various thermal conditions

II. I.

47

To test the methodology of creating constant gate delay against thermal variation, we use a ring oscillator composed of 9-stage tunable-width inverters in UMC 65-nm technology to run simulation. In Figure 4.7, control signals B0-B7 determine the total widths of all tunable-width inverter. The width is tuned according to equation (4.5), it compensates for temperature induced delay variation. With compensation, the period of ring oscillator is almost unchanged. Figure 4.8 shows the 0.5V simulation results before and after compensation. The maximum normalized period is up to 1.77 with fixed width = 128X, however, it is lowered to 1.08 with logical effort tuned to 1.

Control code B0~B7

9-stage ring oscillator

composed of tunable-width inverters

Out

0 50 100 150 200 250

-100 0 100 200

B[7:0]

T(°C)

W2

Figure 4.7 A ring oscillator composed of 9-stage tunable-width inverters

Figure 4.8 Normalized period before and after compensation

48

4.3 A Thermally Robust Buffered Clock Tree Using Logical

Effort Compensation

To mitigate temperature induced clock skew in clock tree, we propose a thermally robust buffered clock tree. The generally used buffered H-tree is taken as clock distribution scheme, while beside each clock buffer there is a local temperature sensor. The logical effort of each clock buffer is tuned according to the digital codes of temperature. With logical effort tuned to 1 shown in section 4.2, the clock buffer appears to have a nearly constant delay with temperature variation.

A typical H-tree in a UMC 65-nm design is chosen in Figure 4.1, where the die length is 2 cm. The side length in first level H-tree is 10mm, and 5mm in next level. The H-tree interconnections use 7th-layer metal with width equal to 1um. For showing the effects of temperature difference on clock skew, the die is divided into two temperature areas, TL on the left half and TR on the right half. When the clock signal propagates through H-tree, it enters different temperature parts and brings various delay time, thereby producing clock skew. In this design, clock skew will be measured between points A and B in Figure 4.1.

Beside each clock buffer, there is a temperature sensor and a look-up table.

Figure 4.9 shows the control blocks and tunable-width buffer. The thermal condition is sensed by temperature sensor which outputs temperature codes T[9:0]. Then, the total width of tunable-width buffer is adjusted according to the look-up table. B[7:0]

is the width control code. Basically, the buffer’s logical effort is tuned to a fixed value based on the equation (4.5).

49

In section 4.2, we have presented that with logical effort tuned to fixed value, the tunable-width inverter appears to have nearly constant delay. Although the loading of clock buffer is long and wide metal, it still possesses that property. Moreover, this design is mainly aimed at ultra-low voltage region, so the buffer instead of metal resistance plays the dominant role on producing delay. Thus the tunable-width buffer in the clock tree can still have constant delay with logical effort compensation.

Temperature

Figure 4.9 Tunable-width buffer with control blocks

Figure 4.10 shows fully on-chip temperature, process and voltage sensors proposed by Shi-Wen Chen [4.10]. P[3:0], V[4:0] and T[9:0] are output codes for process, voltage and temperature respectively. Temperature compensation block is fed with P[3:0] and V[4:0] to calibrate the temperature codes. Unlike conventional temperature sensor using voltage/current analog-to-digital converter (ADC) or bandgap reference, this one adopts frequency-to-digital scheme. The property of zero temperature coefficient (ZTC) bias point is used to remove temperature effect. It is designed in UMC 65nm bulk CMOS technology, capable of operating over a wide voltage range within 0.3V~1V. Thus it is suitable for ultra-low voltage thermally

50

robust buffered clock tree.

In addition, the power consumption is no more than 3.7W at 0.3V supply voltage. With low-power characteristic, it is suitable to distribute temperature sensors among the chips. The temperature error is merely -0.8~0.8ºC, thus the thermally robust clock tree possesses a high precision on tuning logical effort.

Figure 4.10 Temperature Sensor Proposed by Shi-Wen Chen

4.4 Simulation Results

Hspice simulations with layout parameters extracted were performed to evaluate the performance of the proposed design. In the simulations, we used the UMC 65-nm technology including layout of tunable-width buffers and 7th-layer metal H-tree interconnection. Figure 4.11 shows the tunable-width inverter and 7th-layer metal with 1-um width. The clock skew is measured between points A and B in Figure 4.1, in various thermal conditions. Table 4.1 lists the improvements of clock skew after using logical effort compensation at 0.3V (sub-threshold region) and 0.5V

51

(near-threshold region). In Table 4.1, W1 is set to be 128X for 0.5V and 64X for 0.3V, considering logical effort tuning range. Before compensation, the buffer width is not changed in various thermal conditions, equal to W1. With logical effort compensation, clock buffers create constant delay, mitigating temperature induced clock skew. The clock skew is reduced by up to 97.8%, and 71.19% in average.

Metal 7

1-um Width Interconnect

128X

64X 32X 16X 8X 4X 2X 1X

Figure 4.11 Layout of a tunable-width inverter

52

Table 4.1 Compensation improvement of clock skew in sub/near-threshold region

53

Chapter 5

A Programmable Clock Generator for Sub- and Near-Threshold DVFS System

In this chapter, a sub/near-threshold programmable clock generator will be presented. It has the ability creating output clock with frequency 1/8~4 times of the reference clock. The variation-aware logic design is performed in the clock generator, which improves the reliability on process variation. The adoption of pulse-circulating scheme reduces process induced output clock jitter. In addition, we realize a PVT compensation unit for adjusting the locking range of clock generator. The clock generator has been designed in UMC 65nm CMOS technology. The frequencies of reference clock are 625 KHz at 0.2V and 5MHz at 0.5V.

Section 5.1 gives the introduction. Section 5.2 shows the system architecture of proposed programmable clock generator for sub/near-threshold DVFS system.

Section 5.3 introduces the variation-aware logic design for sub-threshold operation.

Section 5.4 demonstrates the proposed PVT compensation technique mainly for adjusting clock generator’s locking range. And section 5.5 shows the circuit description of clock generator. In section 5.6 the clock tree proposed in chapter 4 and the programmable clock generator will be combined. Section 5.7 shows the design implementation in UMC 65-nm CMOS technology. Finally, the post-layout design is simulated and results will be demonstrated in section 5.8.

54

5.1 Introduction

The dynamic-voltage-and-frequency-scaling (DVFS) technique has been adopted in many low-power devices such as wireless body area network (WBAN) communication system. The WBAN system provides body signal collecting and reliable physical monitoring, which has many wireless sensor nodes (WSNs) attached on or implanted inside human body [5.1][5.2]. To achieve low-power requirement, near/sub-threshold regime has been introduced to WBAN system.

Many clock multiplication schemes have been proposed for DVFS systems in super-threshold region. Phase-locked loops (PLLs) are usually used as clock generator, but its locking period takes hundred of reference clock cycles. To enhance the flexibility of clock generator for DVFS system, an all-digital clock generator is presented [5.3] which generates output clock by delaying the reference clock dynamically according to the frequency control code. However, the output frequency can only be fraction of reference clock. Delay-locked loop (DLL) [5.4] was presented for DVFS system, but it couldn’t generate fractional clock. Cyclic clock multiplier (CCM) has been presented for DVFS applications [5.5], and it has the advantage of creating fractional or multiplied clock. However, the cyclic clock multiplier uses TDC for phase error detection which will consume much area and power.

In this chapter, a programmable clock generator is proposed which is aimed at sub- and near-threshold region. It adopts the pulse-circulating scheme in [5.5] and includes some advantages. First, the pulse always circulates through the same delay line; thus compared to DLL based clock multiplier [5.6][5.7], the process-induced phase error will be reduced. Second, the proposed clock generator has the ability of

55

PVT compensation for locking range and takes only one reference clock cycle. Finally, variation-aware logic design is performed for sub-threshold and near operation.

5.2 System Architecture

The architecture of the proposed clock generator is shown in Figure 5.1. The clock generator consists of main blocks as following: pulse generators (PG), phase detector, counter, lock-in delay line, PVT-Comp. (PVT Compensation) delay line, PVT-Comp., control and frequency divider.

Figure 5.1 Proposed clock generator for sub- and near-threshold DVFS system

In the proposed clock generator, the CLKREF signal enters a PG which produces pulses (PREF) with frequency equal to CLKREF. Pulse multiplier generates pulses (POUT) with 8-time frequency of the reference pulses (PREF). In addition, the divider can divide the input frequency by 2, 4, 6 or 8. Therefore, the proposed clock generator is able to output clock with frequency M/N times of the reference clock, M = (1, 8) and N = (2, 4, 6, 8) which are controlled by input frequency selecting signal FS[2:0]. Table 5.1 shows

56

the frequency selection range.

FS[2:0] M N fout / fref 000 1 8 0.125 001 1 6 0.167 010 1 4 0.250 011 1 2 0.5

100 8 8 1

101 8 6 1.333

110 8 4 2

111 8 2 4

Table 5.1 Frequency selection range, fout and fref are the frequencies of output and reference clocks

In order to produce POUT with 8-time frequency of PREF, we adopt a circulating scheme. Each pulse of PREF will enter the circulating path and circulate for 8 times.

The paths is determined by path selection signal SEL, when SEL = 1 the pulse from PREF can enter the delay line; otherwise, the circulating path is built. The counter is used for counting the number of times that pulse flowing in the circulating path. The counter informs phase detector and control block whether the counting times is equal to 8 by the signal countE8. Phase detector compares the phases of POUT and PREF only when the counting times is equal to 8. The control block will change the value of C[5:0] according the compared results, LEAD and LAG. Figure 5.2 demonstrates the procedure of system operation. After the system is reset, the state machine will pass through three steps: PVT compensation, SAR control (successive approximation register) and lock.

57 Comp.PVT

Reset

SAR

Lock Reset

finishSAR Out of locking

range

Lock

Figure 5.2 Finite state machine

In the first step, the system undertakes PVT compensation. In sub- and near-threshold regions, devices behaviors are affected more seriously by PVT variations than that in super-threshold region. The effects of PVT variations cause the lock-in delay line having extremely different delay. To compensate for delay variations, the clock generator uses PVT-Comp. technique to provide adequate delay

In the first step, the system undertakes PVT compensation. In sub- and near-threshold regions, devices behaviors are affected more seriously by PVT variations than that in super-threshold region. The effects of PVT variations cause the lock-in delay line having extremely different delay. To compensate for delay variations, the clock generator uses PVT-Comp. technique to provide adequate delay

相關文件