Building Blocks - Deskew CDR: Circuit Implementation

Chapter 4 Deskew CDR: Circuit Implementation

4.4 Building Blocks

4.4.1 Alexander Phase Detector

Alexander phase detector [29], or APD, was published on Electronics Letters in 1975. Because of the high-gain characteristics in its input-to-out transfer plot, it is also known as a bang-bang PD or a binary quantized PD. Fig. 4.8(a) shows the conventional APD architecture. Fig. 4.8(b) shows the timing diagram. An assumption

CQ 0

T = has been made.

In Fig. 4.8(b) and (c), the input Di is sampled at both the rising/falling edges of

the clock. Node a is sampled at the first rising edge. Node b is sampled at the falling edge. Node c is sampled at the last rising edge. So, nodes a, b, and c are all sampled and retimed at the rising edges. It ensures the synchronous operations of the XOR gates.

Fig. 4.8: Conventional APD (a) schematic (b) data leads (c) data lags

Consider the clock falling edge as a reference edge. Fig. 4.8(b) is the case that the data edge leads the reference edge. Fig. 4.8(c) implies the data edge lags. In the timing diagram, the data Lead/Lag phenomenon can be observed by the logic operations of a, b, and c. If (X Y) ⋅ is true, the data edge leads the clock falling edge.

If (Y X) ⋅ is true, the data edge lags behind the clock falling edge. Fig. 4.8 illustrates the full-rate APD, where the clock rate equals to the data rate. The clock falling edge also samples the data. Conceptually the APD applies an oversampling scheme to recover the NRZ data, where the oversampling ratio is 2.

(a) (b)

Fig. 4.9: Proposed APD (a) schematic (b) timing diagram

Di Lead

Lag

DFF DFF DFF

P<n+2>

P<n+1>

P<n>

c (a)

Di a

DFF DFF

Ck Ck

Ck Ck c

(b)

(c) Di

Ck Di Ck

a b c

10-Gb/s Di P1 P0 P2 P4 P3 P5 P6 P7

50ps

Fig. 4.9(a) shows the proposed quarter-rate APD, where n = 0, 2, 4, 6. It applies the 2.5-GHz 8-phase clocks shown in Fig. 4.9(b). The Lead/Lag information can be derived from the nodes a, b, and c as previously described. In Fig. 4.8(a),

(X Y) ⋅ validates for data edge leading and (Y X) ⋅ validates for data edge lagging.

But for the proposed implementation, it simply regards X as Lead, and Y as Lag to reduce the process time. This simplification introduces two false events shown in Table 5. Later on these two events are corrected by the majority-vote scheme in the confidence counter.

Table 5: The truth table of the proposed APD

4.4.2 Confidence Counter

Fig. 4.10 shows the confidence counter architecture. It consists of 1) a 3-bit unsigned encoder, 2) a comparator, and 3) an accumulator.

Lead<0:3> 3

U/D Counter Lead^CC LagdCC

Fig. 4.10: Confidence counter architecture

The 3-bit Encoders

The encoders derive the counts of Lead<0:3> and Lag<0:3>. For the circuit implementation, it counts the zeros of the input-bars. The result is encoded into 3-bit unsigned integer O<0:2> shown in Table 6.

Table 6: The truth table of the encoder

I3b 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

I2b 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1

I1b 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1

I0b 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

O2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

O1 0 1 1 1 1 1 1 0 1 1 1 0 1 0 0 0

O0 0 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0

Denote I3b as a, I2b as b, I1b as c, and I0b as d. The output O can be expressed as below

O = + + +a b c d (4.12)

1 ( )( ) ( )

O = a b⋅ + ⋅a b c+d + ⋅a b c+d + ⋅ ⋅ ⋅ a b c d

0 ( ) ( )

O = a⊕ ⊕ ⊕ b c d

The Comparator

Fig. 4.11: The 3-bit binary comparator

Fig. 4.11 shows the 3-bit comparator implemented with a ripple adder. The comparison result is derived from (B - A) . The circuit implementation adopts the carry-look-ahead adder instead of the ripple adder, so the long carry chain is avoided.

B0 A0b B1 A1b B2 A2b

Σ Σ Σ

Comb.

Logic

COMP

Leadb

COMP

Lagb

The look-ahead carries can be expressed as

1 0 in 0 0 0

C = P C⋅ +G = P + (4.13) G

2 1 ( 0 0) 1

C = P⋅ P +G + G

3 2 1 ( 0 0) 2 1 2

C = P P⋅ ⋅ P +G + ⋅ +P G G

where P indicates the propagation of the carry. G indicates the generation of the carry.

The pseudo-NMOS circuits for the carry generation are given in Fig. 4.12.

P0 G1 P1 G0

C2b

P0 G1 P1 G0

C2b

G0 P0

C1b G0 P0

C1b C3b

P1 P2

P0 G1

G0 G2

C3b

P1 P2

P0 G1

G0 G2

Fig. 4.12: Circuits of look-ahead carries

The Accumulator

The last stage of confidence counter is an accumulator. It is an up-down counter as shown in Fig. 4.13. When either Leadb_COMPor Lagb_COMPfrom the comparator validates, the accumulator starts its counting. The validation of Leadb_COMP results in upward counting, while that of Lagb_COMP results in the downward counting.

Rst Ck

S3 S2

S1 S0

T0 T1 T2 T

TFF TFF TFF TFF

S3 S2

S1 S0

T0 T1 T2 T

TFF TFF TFF TFF

LeadbCOMP

Fig. 4.13: The main counter in the accumulator

The critical path in Fig. 4.13 is the T’s logic. It can be optimized by the equations below

T = A + B = A B⋅ (4.14)

1 0 0 0 0 the T’s logic in the pseudo-NMOS scheme. The process time is greatly reduced. It approximates one-gate delay.

Fig. 4.14: T’s logic in pseudo-NMOS

Fig. 4.15 shows the state transition diagram of the accumulator. The upward counting is denoted as U. The downward counting is denoted as D. And, N stands for the case that both U and D are false. The path (1) generates the Lag_CC. The path (2)

Fig. 4.15: The state transition diagram of the accumulator

Table 7 shows the state name and its counting value. If the accumulator overflows, outputs Lead_CC/ Lag_CCwill be validated. The counter will be reset to its initial state ‘ST0’ immediately.

Table 7: The state table of the accumulator

State Name ST5 ST4 ST3 ST2 ST1 ST0 ST11 ST12 ST13 ST14 ST15

Counting Value 5 4 3 2 1 0 -1 -2 -3 -4 -5

Implementing the dynamic reset function is tricky. In Fig. 4.13, the main counter employs the resetable T Flip-Flops. The input Leadb_COMP/ Lagb_COMP and the current state are monitored all the time. Once the overflow condition is hit, the default setting⁷ is loaded to the TFFs at the next clock cycle.

4.4.3 Finite State Machine

FSM consists of an up-down counter and combinational logic circuits. In Fig.

4.16, the 5-bit counter is implemented with TFFs just like the 4-bit counter in confidence counter block. Comparing to the dynamic reset of the confidence counter, the reset function here is static and is for the initialization of DCDL. It adds the delay of half the tuning range to the 10-Gb/s input data.

T1 T2 T3 T4

Rst Ck

S4 S3

S2 S1

T0 TFF T1 TFF T2 TFF T3 TFF T4 TFF

Rst Ck Rst Ck

S4 S3

S2 S1

S0 S1 S2 S3 S4

T0 TFFTFF TFFTFF TFFTFF TFFTFF TFFTFF

LeadbCC

L ea d bC C

LeadbCC

L ea d bC C

Fig. 4.16: The 5-bit counter in FSM

The coarse/fine control codes are developed from the counting value of the main counter. S<2:4> determines the coarse control C<0:7>, and S<0:2> determines the fine control F<0:3>. Table 8 shows the truth tables for coarse/fine control codes. For S<2:4> = [ 1, 1, 1 ], it is an illegal state.

7 The default setting can be any counting value in binary format. It is ‘all zero’ in this implementation.

Table 8: The truth tables of FSM (a) coarse tuning function (b) fine tuning function

Table 9: The complete truth tables of FSM

Table 9 shows 28 sets of control codes. In other words, there are 27 phase intervals available for the tuning. The initial state S<0:4> = [ 0, 1, 1, 1, 0 ] adds the delay of half the tuning rage to the data.

The generation of coarse control C<0:7> is critical in FSM. Even though they come from the synchronous S<2:4>, the combinational logic is asynchronous and may introduce timing difference to C<0:7>. Consider a timing difference of 1 UI for example. The 4-to-1 multiplexers in DCDL fail to select the correct data path, so one bit of the full-ratet data is missing.

To minimize the timing difference between C<0:7>, the coarse control adopts the two-stage NOR scheme in pseudo-NMOS. The equations are given as below

The logic gates of NOR3 and NOR2 are shown in Fig. 4.17. By adopting (4.15), the timing difference is eliminated. But, it requires two stages to finish the logic. An extra stage delay is introduced to the loop.

(a) (b)

Fig. 4.17: NOR gates for coarse control (a) NOR3 (b) NOR2

The fine control F<0:3> determines the interpolation. It is relatively robust, because 1) the summation of weighted proportions is inherently linear, and 2) the interpolation is ensured by the thermometer coding style. The equations of F<0:3> are given

2 2 ( 1 0) F =S ⊕ S S⋅

1 2 1

F =S ⊕ S

0 2 ( 1 0)

F =S ⊕ S S⋅

Fig. 4.18 shows the circuit implementation of Y = ⊕ ⋅ . A (B C)

Fig. 4.18: The circuit of Y = ⊕ ⋅ for the fine control A (B C)

4.4.4 Logic Components

This section shows the circuits of some basic logic components which build up the system. In the pseudo-NMOS scheme, the process time of each functional block is one-gate delay.

XOR/XNOR

Fig. 4.19: (a) XOR (b) XNOR

D Flip-Flop

Y Ab

Cb Bb

A B C

Y Ab

A B

Y A

Ab B

(a) (b) (a) (b)

Ckb Latch Latch D

Q Qb Ck

DFF Q

Fig. 4.20: Schematics of (a) a DFF, (b) master & slave latches, (c) a latch, (d) the source-coupling latch.

T Flip-Flop

(c)

Fig. 4.21: Schematics of (a) a resetable TFF, (b) master & slave stages of the TFF, (c) the circuit of MUX & latch.

Cb C S1

i1 i2

S2 S2

i2b C

i1b S1 Cb

Ob O

C Cb

i ib

Cb C

O Ob

(c) (d)

TFF Q

T Rst

Maste

Latch Latch

Rstb, Rst Slav

Ck Ckb

Gnd: Reset to 0 Vdd: Reset to 1

Q Qb Tb, T

MUX

(a) (b)

Chapter 5 A 10-Gb/s Transceiver Test Chip

5.1 Architecture

The test chip realizes a 10-Gb/s transceiver, which consists of a CDR and a transmitter. Fig. 5.1 shows the full-chip architecture. The core circuit is the proposed Deskew CDR. It receives the full-rate data from the tester and recovers it into four quarter-rate data D<0:3>, which synchronize to the global clock P<0>. The parallel data then go to the transmitter, which is composed of the 4-to-1 serializer and the full-rate output buffer, and then the serialized full-rate data go back to the tester.

Due to the lack of on-chip channels in this chip, the channel-induced jitter and the degradation on signal amplitude is modeled by the CDR’s input from the tester.

Functions of jitter injection, data driving range, and crossing point adjustment, etc.

are available by the tester.

This chip is realized with high flexibility on measurements, based on two facts.

First, the full-rate differential data and clock come from the tester synchronously.

Second, the on-chip quarter-rate clocks are derived from the full-rate clock. The performance of the chip can thus be fully measured. For example, the original

targeting data rate is 10 Gb/s, but it can be measured at 12 Gb/s for the up-grade case or at 8 Gb/s for the down-grade case.

Lag_ext

10-Gb/s Output Buffer

Bypass

Bypass BypassBypass

MUX

Fig. 5.1: The full-chip architecture

5.2 Design for Test

Besides the previous description on controlled data input and the measurement flexibility, this section describes the considerations for the test, such as phase resolution measurements, pads/buffers sharing, etc, as well as the circuit operation in different modes.

5.2.1 Phase Resolution Measurement

An additional DCDL block, denoted as DCDL_T, is introduced to the chip.

There are mainly two measurements to be made. First is to understand the phase resolution of DCDL. Second is to see how the DCDL-induced jitter affects the output

eye diagram. And from the eye diagram of different tapping location, the induced jitter from stage to stage is observed. The maximum number of stages can be further analyzed and determined by the measurement results.

5.2.2 Output Buffer Driving Capability

The driving capability of the full-rate output buffer is controllable through the variable amount of the pre-emphasis. The delayed negative-feedback driving strength of the tri-state buffers constitutes the pre-emphasis mechanism. For a large capacitive load, it requires more strength but at the cost of the eye closure in the magnitude axis.

As for a small capacitive load, less strength is able to deal with. A capacitive load of 1 pF for each probe pad is considered as the typical case in this work.

The negative-feedback strength depends on how many tri-state buffers are activated, and therefore it is controllable.

5.2.3 Pads/Buffers Sharing

The control mechanism of DCDL_T and the driving strength of the output buffer both require 5 bits in binary format. To save the area cost, these 10 bits share the 5 control pads, denoted as Ct<0:4>.

Two sets of 5-bit registers are used for the pads sharing. They operate in the low-frequency domain. The low-frequency clock comes from the global 2.5-GHz clock divided by 512, so as to eliminate the disturbance to the internal circuits.

The 5-bit data Ct<0:4> is loaded continuously to either Register A or Register B.

The bypass signal determines which to be loaded.

The 10-Gb/s output buffer is the only way out for 10-Gb/s data for sure. Two inputs, multiplexed by the bypass signal, share the buffer. And, a 2.5-Gb/s output buffer is also shared by a multiplexer, which is controlled by the signal Ct<0>.

5.2.4 Bypass-Mode Operation

The first thing to test is the bypass-mode operation. Fig. 5.2 shows the active blocks and the signal flow in bypass mode. The full-rate data passes through DCDL_T, 2-to-1 full-rate multiplexer, and finally output from the buffer. This bypass mode makes sure that the full-rate data path behaves well.

10-Gb/s Output Buffer

Bypass

Bypass BypassBypass

Fig. 5.2: Bypass-mode operation

10-Gb/s Output Buffer

Bypass

Ck_10G, Ckb_10G

Ct <0:4>

Bypass BypassBypass

MUX

2:1

. . . . . .

Fig. 5.3: Nominal operation

The phase resolution of DCDL_T is measured in bypass mode. Before the measurement, configuring two settings is required. First, configure the output buffer setting. Set the registers B at bypass = 0 . After that, set bypass = 1 and enter the bypass mode. Configure the setting of registers A for DCDL_T. The phase resolution of each step can be measured by stepping the setting of registers A.

5.2.5 Nominal Operation

The chip is in nominal mode when bypass signal is false. Fig. 5.3 shows the corresponding active blocks and the signal flow. The full-rate data is recovered and de-serialized in Deskew CDR. It is serialized again in the 4-to-1 serializer. A 2-to-1 multiplexer selects the full-rate data at bypass = 0 . Finally the output buffer drives the full-rate data back to the tester. And the driving capability of the output buffer is configured by registers B.

5.2.6 Debug-Mode Operation

Lag_ext

10-Gb/s Output Buffer

Bypass 0

Ck_10G, Ckb_10G

Ct <0:4>

Bypass BypassBypass

MUX

Fig. 5.4: Debug-mode operation

Tracking and locking behaviors of the CDR can be observed from the internal

CC CC

Lead / Lag . Fig. 5.4 shows that two 2.5-Gb/s buffers drive these two signals to the oscilloscope in the outer world, and one of the four parallel data from the CDR can be multiplexed to the buffer.

The debug mode is not controlled by any signal, and thus it can accompany the nominal mode. Since the two 2.5-Gb/s output buffers use their own pair of Vdd/Gnd, the supplies can be removed at any time. In that case, the chip is naturally out of debug mode.

5.3 Building Blocks

This section depicts three blocks relating to 10-GHz or 10-Gb/s. It involves a multi-phase clock generator, a serializer, and the output buffer.

5.3.1 Multi-Phase Clock Generator

Ck Ck Ck

DFF DFF DFF

Q0 Q1 Q2

Q3 Q4 Q5

Ck Ck Ck

DFF DFF DFF

Q0 Q1 Q2

Q3 Q4 Q5

Ck Q0 Q1 Q2 Ck Q0 Q1 Q2

(a) (b)

Fig. 5.5: A 3-stage Johnson counter (a) architecture (b) timing diagram Fig. 5.5(a) shows a 3-stage Johnson counter. The phase difference between Q0 and Q1 is one Ck period. Consider the timing diagram in Fig. 5.5(b), the logic ‘1’

propagates from Q0, through Q1, and then to Q2. Q0 falls to zero at the 4th clock edge because the first DFF’s input-bar receives the positive Q2. And then the logic ‘0’

starts to propagate as the previous description on the logic ‘1’. The propagation of logic ‘1’ takes 3 Ck periods and that of logic ‘0’ takes 3 Ck periods, so the data rate of Q is the clock rate divided by 6 for the 3-stage Johnson counter.

Another approach to explain the data rate of Q is the data flow in the loop. Q0 propagates to Q1, to Q2 … , to Q5, and finally back the Q0. It takes 6 Ck periods to complete the loop. The data rate of Q0 is therefore the clock rate divided by 6.

Ck Ckb Ck Ckb

P7 P0 P1 P2

P3 P4 P5 P6

Limiter

Latch Latch Latch Latch

DFF

Fig. 5.6: The multi-phase clock generator

The proposed clock divider consists of 4 latches shown in Fig. 5.6. It is a 2-stage Johnson counter since the four latches compose two DFFs. The input clock rate is 10 GHz, and the developed clock rate is divided-by-4 based on the concept of a Johnson counter. To achieve the high-speed operation, limiters for the swing limitation are introduced to the differential signaling.

Even-phase clocks are outputs of DFFs. The phase difference between each even-phase clock is one 100-ps clock period, while the odd-phase clocks are tapped from the master latches of DFFs.

Because the latch circuit is differential and fully symmetric illustrated in Fig.

4.20(d) previously, the phase difference of P<0:7> should be the same. However, the multi-phase outputs from the latch circuits still suffer from the unbalanced duty cycle due to the mismatch between PMOS pulling strength and NMOS sinking strength. To balance the duty cycle, one can adjust the common-mode level of the 10-GHz Ck.

5.3.2 Serializer

Fig. 5.7 shows the 4-to-1 serializer circuit. It consists of three 2-to-1 multiplexers in cascaded two stages. The multiplexer at the output stage employs a 5-GHz clock and generates the full-rate serialized data. The multiplexers at the first stage employ two quadrature-phase clocks, and operate in 2.5-GHz domain.

All input data are retimed with buffered clock, which aligns to the global clock P<0>. A DFF implements the 1-cycle delay block, while a DFF with an extra latch implements the 1.5-cycle delay block. The extra latch retimes the data to the clock falling edge, so the 1.5-cycle delay is derived.

It is the conventional way to divide the fastest clock by 2 at the beginning and then generate the lower rate clocks sequentially. In this work, all the serializer clocks are developed from the global clocks P<0:7>.

MUX

Fig. 5.7: The 10-Gb/s 4-to-1 Serializer

5.3.3 Output Buffer

Fig. 5.8(a) shows the digital-circuit implementation of the 10-Gb/s output buffer.

It adopts the conventional one-bit delay scheme. The schematic of ‘A cell’ is the same as the delay cell in Fig. 4.3(a). A cells cascade as tapered buffers, and drive the ‘B cell’, which is a static CMOS inverter, at the final output stage.

In the bottom of the figure, the delayed-forward loop implements the FIR filter.

The output signal of the forward loop is delayed and with negative polarity to the main output of B cell.

The ‘Tri’ block is composed of the binary-controlled tri-state buffers shown in Fig. 5.8(b), where the ‘c cell’ is half size of the capital ‘C cell’. The amount of the pre-emphasis is controlled by Ct<0:4>. The more turned-on tri-states buffers introduce the larger amount of over-shooting to the output signal in time domain.

Besides, two 50-ohm poly resistors (not shown) are in series and connect the differential output nodes for the impedance matching.

I Ib

O Ob

Tri Tri

Ct<0:4>

m = 1 m = 1 m = 2 m = 4 m = 8

m = 1 m = 1 m = 2 m = 4 m = 8 m = 10

m = 1 m = 2 m = 4 m = 8

A A A A A

A A A A

B B

(a)

Ct1 Ct2 Ct3 Ct4

Ct0

m = 1 m = 1 m = 2 m = 4 m = 8 In

Out

C C C C

Ct1 Ct2 Ct3 Ct4

Ct0

m = 1 m = 1 m = 2 m = 4 m = 8 In

Out

C C C C

(b)

Fig. 5.8: The 10-Gb/s output buffer (a) architecture (b) binary-control tri-state buffers

Chapter 6 Layout & Post Simulations

6.1 Layout

6.1.1 Layout Guidelines for High-Speed Circuits

The layout plan and the block arrangement are critical to the high-speed operation. Based on the iterative process between the layout and the post simulation, the layout guidelines for high-speed circuits are derived. They are listed in the order of benefits to the high-speed operation.

1) Source Coupling

It is the most important guideline to the proposed differential large-swing digital

在文檔中在閉迴路上使用資料相位校正器之10-Gb/s CMOS時脈與資料回復電路 (頁 62-0)