Chapter 4 Deskew CDR: Circuit Implementation
4.4 Building Blocks
4.4.1 Alexander Phase Detector
Alexander phase detector [29], or APD, was published on Electronics Letters in 1975. Because of the high-gain characteristics in its input-to-out transfer plot, it is also known as a bang-bang PD or a binary quantized PD. Fig. 4.8(a) shows the conventional APD architecture. Fig. 4.8(b) shows the timing diagram. An assumption
CQ 0
T = has been made.
In Fig. 4.8(b) and (c), the input Di is sampled at both the rising/falling edges of
the clock. Node a is sampled at the first rising edge. Node b is sampled at the falling edge. Node c is sampled at the last rising edge. So, nodes a, b, and c are all sampled and retimed at the rising edges. It ensures the synchronous operations of the XOR gates.
Fig. 4.8: Conventional APD (a) schematic (b) data leads (c) data lags
Consider the clock falling edge as a reference edge. Fig. 4.8(b) is the case that the data edge leads the reference edge. Fig. 4.8(c) implies the data edge lags. In the timing diagram, the data Lead/Lag phenomenon can be observed by the logic operations of a, b, and c. If (X Y) ⋅ is true, the data edge leads the clock falling edge.
If (Y X) ⋅ is true, the data edge lags behind the clock falling edge. Fig. 4.8 illustrates the full-rate APD, where the clock rate equals to the data rate. The clock falling edge also samples the data. Conceptually the APD applies an oversampling scheme to recover the NRZ data, where the oversampling ratio is 2.
(a) (b)
Fig. 4.9: Proposed APD (a) schematic (b) timing diagram
Di Lead
Lag
DFF DFF DFF
P<n+2>
P<n+1>
P<n>
a
b
c (a)
Di a
DFF DFF
DFF DFF
X
Y
Ck Ck
b
Ck Ck c
(b)
(c) Di
Ck Di Ck
a b c
a b c
10-Gb/s Di P1 P0 P2 P4 P3 P5 P6 P7
50ps
Fig. 4.9(a) shows the proposed quarter-rate APD, where n = 0, 2, 4, 6. It applies the 2.5-GHz 8-phase clocks shown in Fig. 4.9(b). The Lead/Lag information can be derived from the nodes a, b, and c as previously described. In Fig. 4.8(a),
(X Y) ⋅ validates for data edge leading and (Y X) ⋅ validates for data edge lagging.
But for the proposed implementation, it simply regards X as Lead, and Y as Lag to reduce the process time. This simplification introduces two false events shown in Table 5. Later on these two events are corrected by the majority-vote scheme in the confidence counter.
Table 5: The truth table of the proposed APD
0
4.4.2 Confidence Counter
Fig. 4.10 shows the confidence counter architecture. It consists of 1) a 3-bit unsigned encoder, 2) a comparator, and 3) an accumulator.
Lead<0:3> 3
U/D Counter LeadCC LagdCC
Fig. 4.10: Confidence counter architecture
The 3-bit Encoders
The encoders derive the counts of Lead<0:3> and Lag<0:3>. For the circuit implementation, it counts the zeros of the input-bars. The result is encoded into 3-bit unsigned integer O<0:2> shown in Table 6.
Table 6: The truth table of the encoder
I3b 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
I2b 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
I1b 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
I0b 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
O2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O1 0 1 1 1 1 1 1 0 1 1 1 0 1 0 0 0
O0 0 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0
Denote I3b as a, I2b as b, I1b as c, and I0b as d. The output O can be expressed as below
2
O = + + +a b c d (4.12)
1 ( )( ) ( )
O = a b⋅ + ⋅a b c+d + ⋅a b c+d + ⋅ ⋅ ⋅ a b c d
0 ( ) ( )
O = a⊕ ⊕ ⊕ b c d
The Comparator
Fig. 4.11: The 3-bit binary comparator
Fig. 4.11 shows the 3-bit comparator implemented with a ripple adder. The comparison result is derived from (B - A) . The circuit implementation adopts the carry-look-ahead adder instead of the ripple adder, so the long carry chain is avoided.
B0 A0b B1 A1b B2 A2b
Σ Σ Σ
Comb.
Logic
COMP
Leadb
COMP
Lagb
The look-ahead carries can be expressed as
1 0 in 0 0 0
C = P C⋅ +G = P + (4.13) G
2 1 ( 0 0) 1
C = P⋅ P +G + G
3 2 1 ( 0 0) 2 1 2
C = P P⋅ ⋅ P +G + ⋅ +P G G
where P indicates the propagation of the carry. G indicates the generation of the carry.
The pseudo-NMOS circuits for the carry generation are given in Fig. 4.12.
P0 G1 P1 G0
C2b
P0 G1 P1 G0
C2b
G0 P0
C1b G0 P0
C1b C3b
P1 P2
P0 G1
G0 G2
C3b
P1 P2
P0 G1
G0 G2
Fig. 4.12: Circuits of look-ahead carries
The Accumulator
The last stage of confidence counter is an accumulator. It is an up-down counter as shown in Fig. 4.13. When either LeadbCOMPor LagbCOMPfrom the comparator validates, the accumulator starts its counting. The validation of LeadbCOMP results in upward counting, while that of LagbCOMP results in the downward counting.
Rst Ck
S3 S2
S1 S0
T0 T1 T2 T
3
TFF TFF TFF TFF
S3 S2
S1 S0
T0 T1 T2 T
3
TFF TFF TFF TFF
TFF TFF TFF TFF
LeadbCOMP
LeadbCOMP
LeadbCOMP
LeadbCOMP
Fig. 4.13: The main counter in the accumulator
The critical path in Fig. 4.13 is the T’s logic. It can be optimized by the equations below
0
T = A + B = A B⋅ (4.14)
1 0 0 0 0 the T’s logic in the pseudo-NMOS scheme. The process time is greatly reduced. It approximates one-gate delay.
Bb
Fig. 4.14: T’s logic in pseudo-NMOS
Fig. 4.15 shows the state transition diagram of the accumulator. The upward counting is denoted as U. The downward counting is denoted as D. And, N stands for the case that both U and D are false. The path (1) generates the LagCC. The path (2)
Fig. 4.15: The state transition diagram of the accumulator
Table 7 shows the state name and its counting value. If the accumulator overflows, outputs LeadCC/ LagCCwill be validated. The counter will be reset to its initial state ‘ST0’ immediately.
Table 7: The state table of the accumulator
State Name ST5 ST4 ST3 ST2 ST1 ST0 ST11 ST12 ST13 ST14 ST15
Counting Value 5 4 3 2 1 0 -1 -2 -3 -4 -5
Implementing the dynamic reset function is tricky. In Fig. 4.13, the main counter employs the resetable T Flip-Flops. The input LeadbCOMP/ LagbCOMP and the current state are monitored all the time. Once the overflow condition is hit, the default setting7 is loaded to the TFFs at the next clock cycle.
4.4.3 Finite State Machine
FSM consists of an up-down counter and combinational logic circuits. In Fig.
4.16, the 5-bit counter is implemented with TFFs just like the 4-bit counter in confidence counter block. Comparing to the dynamic reset of the confidence counter, the reset function here is static and is for the initialization of DCDL. It adds the delay of half the tuning range to the 10-Gb/s input data.
T1 T2 T3 T4
Rst Ck
S4 S3
S2 S1
S0
T0 TFF T1 TFF T2 TFF T3 TFF T4 TFF
Rst Ck Rst Ck
S4 S3
S2 S1
S0 S1 S2 S3 S4
S0
T0 TFFTFF TFFTFF TFFTFF TFFTFF TFFTFF
LeadbCC
L ea d bC C
LeadbCC
L ea d bC C
Fig. 4.16: The 5-bit counter in FSM
The coarse/fine control codes are developed from the counting value of the main counter. S<2:4> determines the coarse control C<0:7>, and S<0:2> determines the fine control F<0:3>. Table 8 shows the truth tables for coarse/fine control codes. For S<2:4> = [ 1, 1, 1 ], it is an illegal state.
7 The default setting can be any counting value in binary format. It is ‘all zero’ in this implementation.
Table 8: The truth tables of FSM (a) coarse tuning function (b) fine tuning function
Table 9: The complete truth tables of FSM
0
Table 9 shows 28 sets of control codes. In other words, there are 27 phase intervals available for the tuning. The initial state S<0:4> = [ 0, 1, 1, 1, 0 ] adds the delay of half the tuning rage to the data.
The generation of coarse control C<0:7> is critical in FSM. Even though they come from the synchronous S<2:4>, the combinational logic is asynchronous and may introduce timing difference to C<0:7>. Consider a timing difference of 1 UI for example. The 4-to-1 multiplexers in DCDL fail to select the correct data path, so one bit of the full-ratet data is missing.
To minimize the timing difference between C<0:7>, the coarse control adopts the two-stage NOR scheme in pseudo-NMOS. The equations are given as below
The logic gates of NOR3 and NOR2 are shown in Fig. 4.17. By adopting (4.15), the timing difference is eliminated. But, it requires two stages to finish the logic. An extra stage delay is introduced to the loop.
(a) (b)
Fig. 4.17: NOR gates for coarse control (a) NOR3 (b) NOR2
The fine control F<0:3> determines the interpolation. It is relatively robust, because 1) the summation of weighted proportions is inherently linear, and 2) the interpolation is ensured by the thermometer coding style. The equations of F<0:3> are given
2 2 ( 1 0) F =S ⊕ S S⋅
1 2 1
F =S ⊕ S
0 2 ( 1 0)
F =S ⊕ S S⋅
Fig. 4.18 shows the circuit implementation of Y = ⊕ ⋅ . A (B C)
Fig. 4.18: The circuit of Y = ⊕ ⋅ for the fine control A (B C)
4.4.4 Logic Components
This section shows the circuits of some basic logic components which build up the system. In the pseudo-NMOS scheme, the process time of each functional block is one-gate delay.
XOR/XNOR
Fig. 4.19: (a) XOR (b) XNOR
D Flip-Flop
Y Ab
Cb Bb
A B C
Y Ab
Bb
A B
Y A
Bb
Ab B
(a) (b) (a) (b)
Ckb Latch Latch D
Db
Q Qb Ck
DFF Q
D
Ck
Fig. 4.20: Schematics of (a) a DFF, (b) master & slave latches, (c) a latch, (d) the source-coupling latch.
T Flip-Flop
(c)
Fig. 4.21: Schematics of (a) a resetable TFF, (b) master & slave stages of the TFF, (c) the circuit of MUX & latch.
Cb C S1
i1 i2
S2 S2
i2b C
i1b S1 Cb
Ob O
i
C Cb
Ob
ib
C Cb
O
i ib
Cb C
O Ob
(c) (d)
TFF Q
T Rst
Ck
Maste
Latch Latch
Rstb, Rst Slav
Ck Ckb
Gnd: Reset to 0 Vdd: Reset to 1
Q Qb Tb, T
MUX
MUX
(a) (b)
Chapter 5
A 10-Gb/s Transceiver Test Chip
5.1 Architecture
The test chip realizes a 10-Gb/s transceiver, which consists of a CDR and a transmitter. Fig. 5.1 shows the full-chip architecture. The core circuit is the proposed Deskew CDR. It receives the full-rate data from the tester and recovers it into four quarter-rate data D<0:3>, which synchronize to the global clock P<0>. The parallel data then go to the transmitter, which is composed of the 4-to-1 serializer and the full-rate output buffer, and then the serialized full-rate data go back to the tester.
Due to the lack of on-chip channels in this chip, the channel-induced jitter and the degradation on signal amplitude is modeled by the CDR’s input from the tester.
Functions of jitter injection, data driving range, and crossing point adjustment, etc.
are available by the tester.
This chip is realized with high flexibility on measurements, based on two facts.
First, the full-rate differential data and clock come from the tester synchronously.
Second, the on-chip quarter-rate clocks are derived from the full-rate clock. The performance of the chip can thus be fully measured. For example, the original
targeting data rate is 10 Gb/s, but it can be measured at 12 Gb/s for the up-grade case or at 8 Gb/s for the down-grade case.
Lag_ext
10-Gb/s Output Buffer
Bypass
Bypass BypassBypass
MUX
Fig. 5.1: The full-chip architecture
5.2 Design for Test
Besides the previous description on controlled data input and the measurement flexibility, this section describes the considerations for the test, such as phase resolution measurements, pads/buffers sharing, etc, as well as the circuit operation in different modes.
5.2.1 Phase Resolution Measurement
An additional DCDL block, denoted as DCDL_T, is introduced to the chip.
There are mainly two measurements to be made. First is to understand the phase resolution of DCDL. Second is to see how the DCDL-induced jitter affects the output
eye diagram. And from the eye diagram of different tapping location, the induced jitter from stage to stage is observed. The maximum number of stages can be further analyzed and determined by the measurement results.
5.2.2 Output Buffer Driving Capability
The driving capability of the full-rate output buffer is controllable through the variable amount of the pre-emphasis. The delayed negative-feedback driving strength of the tri-state buffers constitutes the pre-emphasis mechanism. For a large capacitive load, it requires more strength but at the cost of the eye closure in the magnitude axis.
As for a small capacitive load, less strength is able to deal with. A capacitive load of 1 pF for each probe pad is considered as the typical case in this work.
The negative-feedback strength depends on how many tri-state buffers are activated, and therefore it is controllable.
5.2.3 Pads/Buffers Sharing
The control mechanism of DCDL_T and the driving strength of the output buffer both require 5 bits in binary format. To save the area cost, these 10 bits share the 5 control pads, denoted as Ct<0:4>.
Two sets of 5-bit registers are used for the pads sharing. They operate in the low-frequency domain. The low-frequency clock comes from the global 2.5-GHz clock divided by 512, so as to eliminate the disturbance to the internal circuits.
The 5-bit data Ct<0:4> is loaded continuously to either Register A or Register B.
The bypass signal determines which to be loaded.
The 10-Gb/s output buffer is the only way out for 10-Gb/s data for sure. Two inputs, multiplexed by the bypass signal, share the buffer. And, a 2.5-Gb/s output buffer is also shared by a multiplexer, which is controlled by the signal Ct<0>.
5.2.4 Bypass-Mode Operation
The first thing to test is the bypass-mode operation. Fig. 5.2 shows the active blocks and the signal flow in bypass mode. The full-rate data passes through DCDL_T, 2-to-1 full-rate multiplexer, and finally output from the buffer. This bypass mode makes sure that the full-rate data path behaves well.
1
10-Gb/s Output Buffer
Bypass
Bypass BypassBypass
Fig. 5.2: Bypass-mode operation
1
10-Gb/s Output Buffer
Bypass
0
Ck_10G, Ckb_10G
Ct <0:4>
Bypass BypassBypass
MUX
2:1
. . . . . .
Fig. 5.3: Nominal operation
The phase resolution of DCDL_T is measured in bypass mode. Before the measurement, configuring two settings is required. First, configure the output buffer setting. Set the registers B at bypass = 0 . After that, set bypass = 1 and enter the bypass mode. Configure the setting of registers A for DCDL_T. The phase resolution of each step can be measured by stepping the setting of registers A.
5.2.5 Nominal Operation
The chip is in nominal mode when bypass signal is false. Fig. 5.3 shows the corresponding active blocks and the signal flow. The full-rate data is recovered and de-serialized in Deskew CDR. It is serialized again in the 4-to-1 serializer. A 2-to-1 multiplexer selects the full-rate data at bypass = 0 . Finally the output buffer drives the full-rate data back to the tester. And the driving capability of the output buffer is configured by registers B.
5.2.6 Debug-Mode Operation
Lag_ext
10-Gb/s Output Buffer
Bypass 0
Ck_10G, Ckb_10G
Ct <0:4>
Bypass BypassBypass
MUX
Fig. 5.4: Debug-mode operation
Tracking and locking behaviors of the CDR can be observed from the internal
CC CC
Lead / Lag . Fig. 5.4 shows that two 2.5-Gb/s buffers drive these two signals to the oscilloscope in the outer world, and one of the four parallel data from the CDR can be multiplexed to the buffer.
The debug mode is not controlled by any signal, and thus it can accompany the nominal mode. Since the two 2.5-Gb/s output buffers use their own pair of Vdd/Gnd, the supplies can be removed at any time. In that case, the chip is naturally out of debug mode.
5.3 Building Blocks
This section depicts three blocks relating to 10-GHz or 10-Gb/s. It involves a multi-phase clock generator, a serializer, and the output buffer.
5.3.1 Multi-Phase Clock Generator
Ck Ck Ck
DFF DFF DFF
Q0 Q1 Q2
Q3 Q4 Q5
Ck Ck Ck
Ck Ck Ck
DFF DFF DFF
Q0 Q1 Q2
Q0 Q1 Q2
Q3 Q4 Q5
Q3 Q4 Q5
Ck Q0 Q1 Q2 Ck Q0 Q1 Q2
(a) (b)
Fig. 5.5: A 3-stage Johnson counter (a) architecture (b) timing diagram Fig. 5.5(a) shows a 3-stage Johnson counter. The phase difference between Q0 and Q1 is one Ck period. Consider the timing diagram in Fig. 5.5(b), the logic ‘1’
propagates from Q0, through Q1, and then to Q2. Q0 falls to zero at the 4th clock edge because the first DFF’s input-bar receives the positive Q2. And then the logic ‘0’
starts to propagate as the previous description on the logic ‘1’. The propagation of logic ‘1’ takes 3 Ck periods and that of logic ‘0’ takes 3 Ck periods, so the data rate of Q is the clock rate divided by 6 for the 3-stage Johnson counter.
Another approach to explain the data rate of Q is the data flow in the loop. Q0 propagates to Q1, to Q2 … , to Q5, and finally back the Q0. It takes 6 Ck periods to complete the loop. The data rate of Q0 is therefore the clock rate divided by 6.
Ck Ckb Ck Ckb
Ck Ckb Ck Ckb
P7 P0 P1 P2
P3 P4 P5 P6
Limiter
Latch Latch Latch Latch
Latch Latch Latch Latch
DFF
Fig. 5.6: The multi-phase clock generator
The proposed clock divider consists of 4 latches shown in Fig. 5.6. It is a 2-stage Johnson counter since the four latches compose two DFFs. The input clock rate is 10 GHz, and the developed clock rate is divided-by-4 based on the concept of a Johnson counter. To achieve the high-speed operation, limiters for the swing limitation are introduced to the differential signaling.
Even-phase clocks are outputs of DFFs. The phase difference between each even-phase clock is one 100-ps clock period, while the odd-phase clocks are tapped from the master latches of DFFs.
Because the latch circuit is differential and fully symmetric illustrated in Fig.
4.20(d) previously, the phase difference of P<0:7> should be the same. However, the multi-phase outputs from the latch circuits still suffer from the unbalanced duty cycle due to the mismatch between PMOS pulling strength and NMOS sinking strength. To balance the duty cycle, one can adjust the common-mode level of the 10-GHz Ck.
5.3.2 Serializer
Fig. 5.7 shows the 4-to-1 serializer circuit. It consists of three 2-to-1 multiplexers in cascaded two stages. The multiplexer at the output stage employs a 5-GHz clock and generates the full-rate serialized data. The multiplexers at the first stage employ two quadrature-phase clocks, and operate in 2.5-GHz domain.
All input data are retimed with buffered clock, which aligns to the global clock P<0>. A DFF implements the 1-cycle delay block, while a DFF with an extra latch implements the 1.5-cycle delay block. The extra latch retimes the data to the clock falling edge, so the 1.5-cycle delay is derived.
It is the conventional way to divide the fastest clock by 2 at the beginning and then generate the lower rate clocks sequentially. In this work, all the serializer clocks are developed from the global clocks P<0:7>.
MUX
Fig. 5.7: The 10-Gb/s 4-to-1 Serializer
5.3.3 Output Buffer
Fig. 5.8(a) shows the digital-circuit implementation of the 10-Gb/s output buffer.
It adopts the conventional one-bit delay scheme. The schematic of ‘A cell’ is the same as the delay cell in Fig. 4.3(a). A cells cascade as tapered buffers, and drive the ‘B cell’, which is a static CMOS inverter, at the final output stage.
In the bottom of the figure, the delayed-forward loop implements the FIR filter.
The output signal of the forward loop is delayed and with negative polarity to the main output of B cell.
The ‘Tri’ block is composed of the binary-controlled tri-state buffers shown in Fig. 5.8(b), where the ‘c cell’ is half size of the capital ‘C cell’. The amount of the pre-emphasis is controlled by Ct<0:4>. The more turned-on tri-states buffers introduce the larger amount of over-shooting to the output signal in time domain.
Besides, two 50-ohm poly resistors (not shown) are in series and connect the differential output nodes for the impedance matching.
I Ib
O Ob
Tri Tri
Ct<0:4>
m = 1 m = 1 m = 2 m = 4 m = 8
m = 1 m = 1 m = 2 m = 4 m = 8 m = 10
m = 1 m = 2 m = 4 m = 8
A A A A A
A A A A A
A A A A
A A A A
B B
(a)
Ct1 Ct2 Ct3 Ct4
Ct0
m = 1 m = 1 m = 2 m = 4 m = 8 In
Out
C C C C
c
Ct1 Ct2 Ct3 Ct4
Ct1 Ct2 Ct3 Ct4
Ct0
m = 1 m = 1 m = 2 m = 4 m = 8 In
Out
C C C C
c
(b)
Fig. 5.8: The 10-Gb/s output buffer (a) architecture (b) binary-control tri-state buffers
Chapter 6
Layout & Post Simulations
6.1 Layout
6.1.1 Layout Guidelines for High-Speed Circuits
The layout plan and the block arrangement are critical to the high-speed operation. Based on the iterative process between the layout and the post simulation, the layout guidelines for high-speed circuits are derived. They are listed in the order of benefits to the high-speed operation.
1) Source Coupling
It is the most important guideline to the proposed differential large-swing digital
It is the most important guideline to the proposed differential large-swing digital