• 沒有找到結果。

In this chapter, we first provide the motivation for going from synchronous design to asynchronous design, the advantages and disadvantages. Then we introduce the basic theory for asynchronous design. A 2-phase handshake protocol modified from the MOUSETRAP [21] pipeline for iterative computation is proposed and tested.

Design flow using commercial CAD tool is also provided, making asynchronous design an option for synchronous designers with standard cells. In the end, an example design of an energy-efficient 16-tap FIR filter is built to make comparison between asynchronous and synchronous design.

3-1 Motivation

Fig. 3-1 The Monte Carlo simulation with intra-die variation of 3 sigma using UMC 90nm process under 1.0V and 0.5V supply voltage respectively

Voltage scaling is a common technique to reduce the power consumption in circuit design. However, as the supply voltage is scaling down, circuit propagation delay becomes extremely sensitive to PVT variation. Hence, large delay margin is required

30

for successful operation for synchronous design. However, for asynchronous circuit, the operation speed is decided by the handshakes between registers instead of the global worst case clock cycle. This average case design can therefore result in faster operation speed or lower power consumption.

Fig. 3-1 shows the Monte Carlo simulation for the delay of the critical path in a multiplied accumulator (MAC) under 1.0V and 0.5V supply voltage. The figure clearly shows that the delay variance at 0.5V is approximately three times larger than the variance at 1.0V because synchronous design uses the worst case as its operation condition. Large safety timing margin is wasted for synchronous design.

Besides combating PVT variation under low supply voltage, asynchronous circuit is also an attractive technique for low-speed systems with event-driven asynchronous functions that are activated only when certain event occurred. Asynchronous design provides power management with low latency and removes the requirement for additional high speed clock. Simple handshake interface also makes it easy for system integrations between asynchronous and synchronous potion.

3-2 Introduction to Asynchronous Circuit

3-2.1 Moving from Synchronous to Asynchronous Approach

Synchronous designs use a global clock to synchronize the whole design. Besides dealing with severe PVT variations under low supply voltage, uncertainty of clock source is another important issue. Clock skew and jitter together with the logic propagation variation under low supply voltage makes synchronous approach an inefficient design for low supply systems. Although current CAD tools provide powerful algorithm to generate clock trees, the well-spread buffers and delay cells still results in large power overhead.

31

Unlike synchronous design, asynchronous design replaces the synchronous global clock into locally handshake circuit. Fig. 3-2 shows a common pipeline structure for both synchronous design and asynchronous design. The red part shows the replacement from global clock to locally handshake circuit. There are numerous approaches to designing without clocks, each with various pros and cons depending on the design style. Some of the major potential benefits include the follows.

 Robust operation against PVT variation due to elimination of fixed clock.

 Modular composition and delay insensitive interfacing.

 Power management with very low latency.

However, some significant potential drawbacks still exits for clock-less designs includes:

 Complicated design approaches unfamiliar for synchronous designer

 Lack of support from existing EDA tools

 Area and performance overhead for handshake circuits

Therefore, one must be clear about the properties of asynchronous designs when applying it to the system, giving its advantages and reduce the overhead as much as possible to deliver an efficient design. The next part will be some introduction to the basic asynchronous pipeline style and handshake protocol.

R1 R2 LOGIC R3

(b) Asynchronous Pipeline - Bundled Delay

(c) Asynchronous Pipeline – Dual Rail REQ ACK

ACK Clock tree

Fig. 3-2 The difference between synchronous design and asynchronous design (red part).

32

3-2.2 Handshake Protocol

The two most common asynchronous design styles are the bundled-delay asynchronous pipeline (Fig. 3-2 (b)) and dual rail asynchronous pipeline (Fig. 3-2 (c)).

A brief introduction is made here above the two asynchronous

 Dual rail asynchronous pipeline

One common type of the quasi delay insensitive circuits is the dual rail asynchronous design. Dual rail design encodes the arrival information into the data bit itself. Therefore, two bits are required to represent the arrival of one-bit information which makes the design “dual rail”. Table 3-1 shows a dual rail encoding scheme. With this kind of encoding, two parties can communicate reliably regardless of the delay variations. However, additional dual rail datapath and storage results in large area and power penalties. Although customized gates can be designed to reduce the overhead, the design process is too much effort and time costly for synchronous designer.

Table 3-1 One encoding scheme for dual rail encoding State True rail False rail

Null 0 0

Logic 0 0 1

Logic 1 1 0

N/A 1 1

 Bundled-delay asynchronous pipeline

The bundled-data asynchronous designs are conceptually closest to synchronous design. Each datapath through a combination block is matched with a delay line with the same propagation delay. Because of this “bundled” delay line, this type of asynchronous circuit is called “bundled delay” or “matched delay”. Thanks to the similarity with synchronous design, we can use standard cells and commercial CAD tool with costumized design constraints to implement the design.

33

Handshake circuit can also be categorized using the number of phase that is required for one data transmission. The most common protocols are the 4-phase and 2-phase handshake protocol. The 4-phase handshake protocol, also named “Return-to-Zero (RTZ)” or level triggered protocol, uses logic “1” to signal data valid, therefore simple transparent latches can be used as pipeline registers. However, additional return-to-zero operation is required for the start of the next handshake, which results in additional switching and time. To reduce the extra RTZ time, AND delay line configured as Fig.

3-3 (c) [19] can be used to force the output to zero once the input goes to zero.

The 2-phase handshake protocols, also named non return-to-zero (NRTZ) or transition signaling protocol, use signal transitions as signal event, no addition RTZ is required, but more complex handshake circuit or latched may be needed for recognition of both positive and negative edges.

Ack Req

Ack Req

DATA

(a) 4-phase protocol (b) 2-phase protocol DATA

IN . . . OUT

(c) AND delay line

Fig. 3-3 (a) 4-phase handshake protocol (b) 2-phase handshake protocol

34

3-2.3 Muller C Element

One of the basic components that are generally used in asynchronous design is the Muller C-element cell [20]. The output reflects the input when all the input matches. Fig.

3-4(a) shows the symbol and truth table of a 2-input C element. Such kind of function can be implemented using either by standard NAND gates (Fig. 3-4(a)) or by customized transistors level construction (Fig. 3-4 (b)).

C element is the basic gate for the famous Muller pipeline [20]. Because of its unique function, it can also be used in fork and join structures (Fig. 3-5). Fork structure is used when one request is sent to more than two recipients. The corresponded acknowledge signal will consist of a join structure using C element (Fig. 3-5(a)) and vice versa for conditions when two requests are being combined to one recipient (Fig.

3-5(b)).

Fig. 3-4 (a) Symbol and truth table of a 2 input C element (b) Standard cell based C element (c) Transistor level implementation

35

Fig. 3-5 Fork and join structures

3-2.4 Asynchronous Pipelines

Here a brief introduction about 2 existing asynchronous pipelines is presented: The 4-phase Muller pipeline [20] and 2-phase minimal-overhead ultra-high-speed transition-signaling asynchronous pipeline (MOUSETRAP) [21], including the structure and working mechanism. Both of the pipelines are bundled-delay pipelines.

 4-phase Muller pipeline [20]

L1 L2 LOGIC L3

Fig. 3-6 Muller pipeline

Muller pipeline is the backbone for many other variations and extensions of asynchronous pipeline. Simple handshake circuit is built using only inverters and C element. To understand the handshake working mechanism, we first assume all the

36

outputs of C element are reset to zero. A firing of 0 to1 from the left REQ will make the output of the first C element to 1 and triggered the first latch R1. This request signal will propagate through the pipeline stages together with the data. When the successor receives the request, an acknowledge signal will be sent back to the predecessor, and the predecessor will be able to receive the new data. With no costumed cell and simple timing constraint, this pipeline style is suitable to build using commercial CAD tools.

 2-phase MOUSETRAP pipeline [21]

L1 L2 LOGIC L3

DELAY ACK

REQ

ACK

REQ

DATA DATA

Fig. 3-7 MOUSETRAP pipeline

The MOSETRAP pipeline is a 2-phase pipeline style that can be constructed using simple transparent latch while other 2-phase pipeline style may require special designed latched to capture the transition of 2-phase protocols. The handshake circuit consists of a latch and a XNOR gate. With all handshake latches reset to zero, each latch is transparent before the data arrives. Upon receiving the arrival of a transition on the request signal, the latch captures the data and becomes opaque. The REQ transition and the data propagate continuously to the next stage, and are again captured. An acknowledge signal is sent back to the predecessor, enabling the predecessor and makes it transparent again in order to capture new data.

37

3-3 Design Flow Using Commercial CAD Tool

While asynchronous designs can prove substantial benefits, it is still largely limited by the incompatibility with existing CAD tool. Here we attempt a design flow based on relative timing and post-layout SPICE verification, to create and prove correct constraints for the bundled-delay asynchronous pipeline using the mature timing engine from existing CAD tool. These constraints are supported by most CAD tools and support timing driven synthesis and auto place and route.

3-3.1 Design flow

The signal in bundled-delay designs can be separated into two parts, the signal part and the control part. In this flow, we use the following CAD tools together with some .sdc constraint to complete the design for the 4-phase Muller pipeline and 2-phase MOUSETRAP pipeline. Table 3-2 shows the CAD tool and the constraints used in this asynchronous design flow.

Table 3-2 CAD tools and design constraints used in the asynchronous design flow

CAD Tool Design Constraints

Tech: UMC 90nm RTL: Verilog

Synthesis: Synopsys Design Compiler P&R: Cadence SOC Encounter Verification: Ultrasim (SPICE Model)

set_dont_touch set_size_only set_max_delay set_min_delay set_disable_timing

To start, we first construct a template model using standard cells at gate level for the handshake circuit (control part). For the Muller pipeline, it would be like the one shown in Fig. 3-8.

38

Structure modification may occur during synthesis when optimization, such as removing buffer or back-to-back inverters, breaking complex gate into simple gates, which may result in hazard or substantially modify necessary delay properties.

Constraints “set_dont_touch” and “set_size_only” are used to prevent this from happening. The “set_dont_touch” command disallow the tools to change the cells in any kinds of ways while “set_size_only” command prevents change of cells but still allow the tool to change cell size of driving strength, delay, and power optimization.

C L_ACK

L_REQ

R_ACK

R_REQ

EN

Fig. 3-8 A Verilog template for the Muller pipeline template

set_size_only -all_instance {*/c_ele0_nand0}

set_size_only -all_instance {*/c_ele0_nand1}

set_size_only -all_instance {*/c_ele0_nand2}

set_size_only -all_instance {*/c_ele0_nand3}

set_size_only -all_instance {*/hs0_inv}

39

Clock CAD tools use clock domain to optimize power and performance based on the defined frequency. These tools operate based on directed acyclic graph (DAG). If the timing got loops, algorithms in the tool are called to break it. In asynchronous design there are plenty of feedback loops either in the standard cell based C element or in the protocol cycles. Cutting theses loops at the right place is necessary for correct and efficient timing analysis and optimization. The command “set_disable_timing”

does the job. For example, we can use this command to break the acknowledge path that has no timing requirement but ends up in timing loops.

C C

L_ACK

L_REQ

Timing loops Cut the timing loops at the ack path

R_ACK

R_REQ

HS0 HS1

Fig. 3-9 Break timing loops for un-constraint path

set_disable_timing -from HS1/L_ACK -to HS0/R_ACK

Beside the control part, the registers and datapath can be generated using the standard flow used in synchronous design expect that in synchronous design, the maximum delay is decided by the required clock cycle. In bundled-delay asynchronous design, there’s no clock defined. The maximum delay for the datapath and matched delay is set manually by the “set_max_delay” and “set_min_delay”.

40

L2 LOGIC L3

DELAY

C C

DATA L_ACK

L_REQ

Set max delay: 10.0ns Set min delay:11.0ns

R_ACK

R_REQ

Fig. 3-10 Time constraints should be set manually at the desired start and end points

set_max_delay 10.0 -from L2_reg[*] -to L3_reg[*]

set_min_delay 10.5 -rise_from DELAY/IN -rise_to DELAY/OUT set_max_delay 11.0-rise_from DELAY/IN -rise_to DELAY/OUT

In this example the maximum delay from register L2 to register L3 is set to 10.0ns and the minimum delay is set to 10.5ns with a 5% safety margin. The next section will discuss the additional margin in detail based on Monte-Carlo simulation using UMC 90nm technology. After applying the above .sdc constraint, the design is ready for synthesis. The function is verified at gate-level. The same constraints can be applied again for backend place and route.

3-3.2 Delay Margin Tuning

Bundled-delay asynchronous relies on correct handshake and relative timing between delay line and datapath to compute the correct function. In the previous section, we mentioned that additional margin is required between the matched delay and datapath. Large margins results in wasted computation time but small margin may generate false output which cannot be solved. Here some Monte-Carlo simulation is

41

made to estimate the margin range and a lead-lag detector is proposed for margin tuning.

In the example, we extract the critical path in an 8×8+16bit multiplied accumulator (MAC) and produced a matched delay line using worst-case standard cell library. 5000 times Monte Carlo simulation is done with intra-die variance of 3 sigma at 3 different corner case ({0.45V, -40°C}, {0.5V, 25°C}, {0.55V, 120°C}). In the simulation, we assume the global voltage and temperature are the same because of the small design size. Fig. 3-11 shows the simulation result.

0 2 4 6 8 10 12 14 16 18 20

delay-datapath diff @ 0.5V, 25C delay-datapath diff @ 0.45V, -30C delay-datapath diff @ 0.55V, 120C

(a) 0.5V, 25°C (b) 0.45V, -30°C

(c) 0.55V, 120°C (d) Difference between datapath and delayline at 3 different corners

delay(ns)

Fig. 3-11 The delay distribution of datapath and delay line at 3 different corner case (a) 0.5V, 25°C (b) 0.45V, -30°C (c) 0.55V, 120°C (d) Difference between datapath and delay line

42

From this figure we again verified that the delay variation increased in critical case (Fig. 3-11 (a)(b)(c)). Fig. 3-11 (d) shows the difference between matched delay line and the datapath. The difference is always positive, which ensures the function correctness. The delay line tracks the process variation within a margin from 1 to 4 ns, which is significantly smaller than synchronous design which always assumes worst case design condition.

To further minimize this margin, a lead-lag detector and a delay-tunable delay line is proposed. The lead-lag detector checks the relationship between the delay of the critical datapath and the delay line and chooses the best delay time. Fig. 3-12 shows the architecture of the lead-lag detector together with the tunable delay line.

The lead-lag relationship of the delay line and the datapath is detected using a simple D flip-flop. The output of the delay line is connected to the clock pin of the flip-flop and the output of the critical path is connected to the D pin. As shown in Fig.

3-12.

D CK

Q

Matched delay line with tuning step

Critical path Design

FSM

Control code Trigger

Lead-lag detector

Tuning Block

Fig. 3-12 Tuning Circuit including the tunable delay line and a lead-lag detector

43

The tuning process starts with the FSM setting the control code to the delay line’s minimum delay. To start, a triggering rising edge is sent to the input of both the critical path and delay line at the same time. If the output of the delay line rises before the rising of critical path, the output of the D flip-flop will capture a zero indicating that the matched delay is shorter than the critical delay. The FSM will then change the control code for another iteration of testing until the sampled value of the flip-flop is one. Fig. 3-13(a) shows the tuning resolution at 3 corner cases for the same example of an 8×8+16bit MAC with critical path of 8 ns. Fig. 3-13 (b) shows the Monte-Carlo simulation after tuning. Comparing with Fig. 3-11(b), the difference between delay line and datapath is greatly reduced.

-1 0 1 2 3 4 5

delay-datapath diff @ 0.5V, 25C delay-datapath diff @ 0.45V, -30C delay-datapath diff @ 0.55V, 120C

6

Fig. 3-13 (a) The 8 tuning steps at 3 corners (b) Reduced difference between datapath delay and delay line delay after tuning

44

3-4 Design Example: An 16-tap FIR Filter

In this section several different implementation of a 16-tap FIR filter are made using different kinds of asynchronous pipelines. Here we focus on the ring structure of asynchronous pipelines commonly used in iterative computations. An improved 2-phase ring structure based on MOUSETRAP pipeline is proposed. Finally a comparison between asynchronous and synchronous design is made.

3-4.1 Iterative FIR Architecture and Implementation

Fig. 3-14 shows the block diagram of the 16-tap FIR filter. The FIR filter computes the output upon 16 iterations of MAC operation. It contains basically a 4-bit counter and an 8×8+16bit MAC with storage buffer.

FIR computation in 16 iterations of MAC operation

Fig. 3-14 Block diagram for the 16-tap FIR filter

Based on the FIR structure, in-place and iterative operation is required. To realize the iterative computation, ring structure is required. Three different asynchronous structures are used. Fig. 3-15 (a) shows an example ring structure for the 4-phase Muller pipeline. 3 latches are needed for successful handshake operation. Because of the 4-phase handshake protocol nature, one handshake element contains the request signal, the followed handshake element contains the RTZ value, and the third

45

handshake element are free for the data to flow inside the ring. Therefore, two data token and only one bubble token circulate inside the ring. The speed is limited by the bubble number. Two different Muller pipeline designs are implemented using different kinds of delay line cell (BUF delay and AND delay line), the performance comparison in respect of speed and energy are summarized in the next section.

HS1

Fig. 3-15 (a) Ring structure for 4-pahse Muller pipeline (b) The operating cycle time of the ring structure is decided by the available bubble and data number in the ring

The second structure (Fig. 3-16 (a)) utilizes the proposed ring structure modified based on the MOUSETRAP pipeline. In this structure, the XNOR gate in the second stage is replaced by an XOR gate. The XOR gate in the second stage distinguishes the difference of REQ signal before and after itself. If the REQ signals are different, it triggers the latch and propagates the REQ signal to the other side. After the propagation, the REQ signal becomes the same and latch becomes opaque. After the second stage receives the data, an ACK signal is sent back to start the next iteration.

Fig. 3-16 (b) shows the time diagram of the handshake protocol. No extra

Fig. 3-16 (b) shows the time diagram of the handshake protocol. No extra

相關文件