Chapter 1 Introduction
1.3 Organization
This dissertation comprises seven chapters. The motivation and features of the CDR are described in this chapter.
Chapter 2 gives an overview of the world’s CDR architectures, which is far beyond the PLL-based and the oversampling architectures. It describes the operation principles, the issues, and discovers the possibilities on various architectures.
Chapter 3 analyzes the system-level behaviors of the CDR. Topics on system specifications, design parameters, and simulations are involved. Discussions and simulations on various cases, such as noise profiles, the frequency tolerance, the loop latency, and the frequency error, can be found in this section.
Chapter 4 depicts the circuit-level implementation. The high-speed large-swing DCDL design and meeting the loop-latency constraint are described. To minimize the loop latency, the pseudo-NMOS scheme is adopted. Demonstrated can be found in the carry-look-ahead adder of the comparator, or the TFF up-down counters of the confidence counter and the FSM.
Chapter 5 shows the digital implementation of the 10-Gb/s transceiver. It describes the considerations for design and test, as well as different operation modes.
The CDR is verified in nominal mode and/or debug mode, while the phase resolution of DCDL is measured in bypass mode. The multi-phase clock generator, the serializer, and the 10-Gb/s output buffer are also discussed in this section.
In the high-speed domain, the circuit layout is critical and dominates the final eye open. Chapter 6 shows the chip layout, grid design, and describes the layout guidelines for high-speed circuits. The guideline of source coupling is especially emphasized. Post simulations of DCDL, the CDR, and the full-chip transceiver are given in this section.
The final section, Chapter 7, shows the specification table of Deskew CDR. It compares the power/area among several CDR systems.
Chapter 2
Overview of the World’s CDR Architectures
Different applications require different CDR systems to the world. The types of CDR systems reflect on the modes of recovering process: continuous vs. burst, closed loop vs. open loop, filter-based vs. oversampling, clock delay vs. data delay, and digital vs. analog, and etc.
For conventional CDR systems, we have PLL-based and oversampling CDR architectures. They are well-explored and have their own traditions. But there are more candidates for applications of the timing recovery. Following the classification in [1]1, we categorize CDR systems into 1) PLL-Based, 2) Blind Oversampling, 3) DLL-Based, 4) Gated VCO, and 5) Alternative & Hybrid architectures.
In this section, some of the demonstrated systems are history, while some of them are state-of-the-art. This section mainly focuses on the world-view, the variety, and the possibilities of CDR systems.
1 A presentation document introduces the world’s CDR systems on the internet by Çobanoğlu in 2006.
The original classification is 1) PLL-Based, 2) Delay-the-Data, 3) Gated VCO, 4) (Semi-)Blind Oversampling, and 5) FSM-Based.
2.1 PLL-Based CDR
PLL-based CDR systems are suitable for continuous mode operation. They can be characterized as single loop [3]-[5] and dual loop [6]-[9]. In general, a PLL-based CDR system refers to an Nth-order system, where N≥2. It is usually implemented with analog circuits due to the inherently continuous characteristics.
Di RetimeRetimeRetime DoDo
P/F Detector
Charge Pump
VCO LPF
Fig. 2.1: A generic PLL-based CDR with a charge pump
Fig. 2.1 shows a generic PLL-based CDR architecture. It consists of phase and/or frequency detectors, a charge pump, a loop filter, and a VCO. For the dual-loop architecture, the entire recovery process includes the slow pull-in process by frequency detectors and then the lock-in process by phase detectors in sequence.
Back to 1985, a PLL-based system [3] was proposed for clock and data extractions from NRZ data. It employs an active SAW filter in the loop for the band-pass filtering instead of the architecture with a charge pump and a passive filter.
After the charge pump becomes popular, a second-order low-pass filter of ‘C // R-C’
structure is also welcome [5], [8]. The second-order filter composes a third-order system, so that phase step, frequency step, as well as the accelerative frequency variation can be tracked.
A high-order PLL-based system is well known as its high performance. However, it doesn’t suit burst-mode applications because of 1) the slow pull-in process, and 2) clock drifting at the case of no input.
Besides, there exists jitter peaking phenomenon [2] in the high-order system. To take the simplest case for instance, consider a second-order system. The closed-loop transfer function is expressed in (2.1).
2
The approximation of (2.1) is made by assuming that damping factor ς is large (such as 10) and w n2 is small enough and can be neglected. The approximated loop bandwidth is then derived in (2.2). And the corresponding zero and poles are given in (2.3)-(2.5).
From (2.3) and (2.4), the first pole locates behind the zero in absolute value. The jitter peaking phenomenon is then introduced in the closed-loop transfer plot as shown in Fig. 2.2.
Fig. 2.2: Jitter peaking phenomenon The jitter peaking J is P
The amount of jitter peaking in (2.7) can be eliminated by over-damping the loop;
that is applying large ς . But it results in slow response of the lock acquisition.
w
Example: Savoj2001 [4]
10Gb/s Di
5-GHz VCO
Charge Pump
LPF Half-Rate
PD SER 10Gb/s Do
10Gb/s Di
5-GHz VCO
Charge Pump
LPF Half-Rate
PD SERSERSER 10Gb/s Do10Gb/s Do
Fig. 2.3: The PLL-based CDR, Savoj2001
Example: Savoj2003 [6]
Loop Filter 10Gb/s Di
Retimed 10Gb/s Do Half-Rate
FD
Half-Rate PD
V/I Converter
V/I Converter
0 45 90 135
0 45 90 135
VCO
Fig. 2.4: The PLL-based CDR, Savoj2003
2.2 Blind Oversampling CDR
A blind oversampling architecture, shown in Fig. 2.5, is implemented with digital circuits, and can handle both continuous and burst-mode timing recovery. It oversamples the data and chooses the optimal clock phase according to the extracted edges information in decision circuit. The decision scheme can be either majority-voting [10] or center-picking [11], while the previous is less superior. [12]
Multi-phase Clock Generator Parallel
Samplers
Sample
Storage MUX
Decision Circuit
Di
. . . . . .
DoMulti-phase Clock Generator Parallel
Samplers
Sample
Storage MUXMUX
Decision Circuit
Di
. . . . . .
DoFig. 2.5: A generic oversampling CDR, Kim&Jeong2003 [13]
A blind oversampling CDR tracks the high-frequency jitter of input data stream well, while the limited size of storage causes a limitation on tracking the low-frequency jitter.
Different from most CDR systems, this architecture eliminates the need on the acquisition time but requires extra hardware for executing algorithm and introduces processing latency to the data recovery.
The phase picking scheme accompanies static offset error on each sampling, because neither the data nor the clock phases are adjusted. The maximum offset error is (0.5 UI / OSR) , where OSR denotes the oversampling ratio. Although this offset error can be suppressed by raising the oversampling rate, but in practical cases it encounters issues like: 1) A high OSR implies high-accuracy phase resolution for each sampling, which is always a challenge. 2) The input capacitance of phase detectors grows with OSR. That is especially critical to high-speed application. In the conventional way, 3×-oversampling is widely-used.
Example: C.K.Yang98 [14]
24
MUX
MUX 512Mb/s
Do<0:7>
4Gb/s Di 1:8 DEMUX Samplers ×24
Bit shifter FIFO
Over/Under-flow Controls
Decision Circuit
Multi-phase Clock
Delay
3
Fig. 2.6: The blind oversampling CDR, C.K.Yang98
In Fig. 2.6, the sample storage is denoted as a delay block, and the decision circuit controls the multiplexer as well as the FIFO at the last stage. The FIFO is implemented with an 8-bit shifter. It handles both the overflow and underflow cases when the phase error, which is mainly caused by the frequency error, accumulates more than 1-bit time.
2.3 DLL-Based CDR
A DLL-based CDR can be regarded as a simplified version of PLL-based architecture. It is a closed-loop first-order system without jitter peaking phenomenon.
In this system, only the phase delay is a variable. Implementations of DLL-based CDR can be either analog or digital, while the latter is the major trend in recent days.
According to the subject of delay adjustment, it can be distinguished as 1) clock-interpolation, and 2) data-deskew architectures. The clock-interpolation architecture can handle continuous timing recovery by the phase-rotation scheme, but this phase rotation needs additional hardware, such as the FIFO stage of oversampling architecture in Fig. 2.6, to handle the overflow/underflow condition.
As for the data-deskew architecture, it is a straight concept to adjust data instead of clock. It introduces a simple synchronization behavior by the shared and untouched global clock. But it is mainly limited by the data tuning range, and therefore is only suitable for burst-mode applications.
2.3.1 Clock-Interpolation CDR
Fig. 2.7 shows an example of clock-interpolation CDR by E. Lee. The 8 clock phases are adjusted by the interpolation scheme, which is generated from the phase controller, and finally the sampling clock phases align to the midpoint of data duration.
The receive amplifiers block consists of amplifiers and phase detectors. Here the full-rate data is de-multiplexed into 4 quarter-rate data inherently. Even though digital circuits implement the logic function in the phase controller block, the entire CDR implementation also adopts analog circuits.
Fig. 2.8 shows a clock-interpolation CDR for multi-channel timing recovery by Kreienkamp. It adopts analog circuitry to achieve high speed and fine phase resolution. Differential charge pump and two capacitors contribute the single pole to
the system. The phase interpolator is the conventional analog current-steering scheme, and just like those PLL-based CDR systems, the phase resolution is limited by the discrete steps, which is introduced by charge pump. The chip is fabricated in 0.11-µm CMOS technology, and its power consumption is 220-mW at a supply of 1.5 Volt.
But for continuous recovery, it lacks of description about phase-rotation of these CDR macro-cells.
Example: E.Lee2001 [15]
500-MHz 8-phase DLL
8 8 8 8
8 8 8 8
Phase Controller
...
Fig. 2.7: The clock-interpolation CDR, E.Lee2001
Example: Kreienkamp2005 [16]
Phase
CDR
Shared PLL CDR
CDR CDR Multi-channel
Recovery Recovered
Data Input
Data
… …
CDR
Shared PLL CDR
CDR CDR Multi-channel
Recovery Recovered
Data Input
Data
… …
Clock
(b)
Fig. 2.8: The clock-interpolation CDR, Kreienkamp2005 (a) the CDR, (b) the multi-channel configuration
2.3.2 Data-Deskew CDR
Fig. 2.9(a) shows the 10-Gb/s data-deskew CDR for multi-channel burst-mode applications proposed by Wong. It is a full-rate analog implementation, and fabricated in both AlGaAs/GaAs and InGaP/GaAs HBT technology, where f ~ 50 GHz , t fmax ~ 60 GHz and ~ 40β . The voltage controlled delay line, phase detector, and loop filter compose the delay lock loop. In addition, it employs an edge detector circuit to adjust the time constant of the loop filter. Fig. 2.9(b) shows the phase detector circuit. The detector’s output is generated from the transition edge of input and its asynchronous delay.
The achieved tuning range is 2 UI or 200 ps. It claims to be capable of a 12.5-kbit data packet but under the assumption that frequency error for all clocks is within 20 ppm. The 20-ppm error is far less than the conventional estimation of 200 ppm.
Fig. 2.10 shows a digital implementation of data-deskew CDR by Lu. The confidence counter replaces the conventional loop filter. The cascaded delay cells compose the DCDL block. Coarse and fine tune functions are available. The coarse function is implemented by the on/off state of tri-state buffers in the chain, and the fine function is implemented by the added amount of capacitive load.
It is fabricated in 0.18-µm CMOS technology, and the achieved tuning range is 1 UI, or 400 ps, for the 2.5-Gb/s operation. Due to the insufficient tuning range, this implementation is not going to handle any frequency error.
Example: Wong96 [17]
Voltage Controlled Delay Line
Phase Detector
Loop Filter Data
Retime
Edge Detector
10-Gb/s Di 10-GHz
Clock
10-Gb/s Do
(a)
Envelope Out Detector
In Envelope Out
Detector In
(b)
Fig. 2.9: The data-deskew CDR, Wong96 (a) Architecture (b) Edge Detector
Example: Lu2005 [18]
Phase Detector Confidence
Counter Delay Control
FSM
Up Dn
Lead Lag
Digitally Controlled Delay Line
5-GHz Clock 2.5-Gb/s
Di
2.5-Gb/s Do
...
Fig. 2.10: The data-deskew CDR, Lu2005
2.4 Gated-VCO CDR
Example: Nakamura96 [19]
CDR Core
Gating
Circuit 1 G-VCO1
G-VCO2
CP & LPF
Burst PLL PFD
Recovered Do
Recovered Ck
Vctrl
Decision 1 Di
Decision 2 Gating
Circuit 2 Reset
CDR Core
Gating
Circuit 1 G-VCO1
G-VCO2
CP & LPF
Burst PLL PFD
Recovered Do
Recovered Ck
Vctrl
Decision 1 Di
Decision 2 Gating
Circuit 2 Reset
(a)
Half-bit Delay
In Out
(b)
Fig. 2.11: Gated-VCO CDR, Nakamura96 (a) architecture (b) gating circuit A gated-VCO CDR system was first introduced by Nakamura in 1996. It can fast response to the asynchronous burst input data. In Fig. 2.11(a), the CDR core consists of a gating circuit, a gated VCO, and a DFF at the final stage for retiming the data.
This DFF is denoted as Decision 1 block.
The gating circuit in Fig. 2.11(b) adopts the same scheme as that in Fig. 2.9(b). It detects the transition edge of input data. Consider the gating signal is logic 0, and the Vctrl signal is ready; the gated VCO oscillates by default and is ready to re-initiate an oscillation. As the gating signal validates, the gated VCO re-generates the gated clock instantaneously. In other words, the gating signal re-synchronizes the gated clock,
every time the data transition validates.
This prototype of gated-VCO architecture cooperates with a burst PLL, which provides the control voltage to the CDR. An additional reset action is required after each burst data recovery.
Example: Nogawa2005 [20]
CDR Core
Gating Circuit
DFF
G-VCO1 Input
Amp.
G-VCO2
CP & LPF
PFD
÷ 64
PLL 10-Gb/s
Di Recovered
10-Gb/s Do
Recovered 10-GHz Ck Vctrl
156-MHz Ref. Ck
Fig. 2.12: Gated-VCO CDR, Nogawa2005
The implementation in Fig. 2.12 demonstrates a high-performance gated-VCO CDR. It is fabricated in 0.13-µm CMOS technology with the overall area of 2.5 2.5 mm × 2 and power consumption of 1.2 W at a 2.5-V supply. It operates at 10-Gb/s, and is able to extract the recovered clock within 5-bit time.
A new invention of this design is the input amplifier, which applies AC couple and edge detection schemes to accomplish the final comparison in a hysteresis comparator.
Previously in Nakamura’s prototype, it employs a burst PLL. But in the later years, a PLL with input reference clock becomes popular for the generation of Vctrl.
The gated VCO2 follows reference clock instead of input data. The need for the additional reset action is thus eliminated.
Example: Kaeriyama2003 [21]
DLL
Edge Detector
Gated VCO Gating Signal
CDR[n-1]
Edge Detector
Gated VCO Gating Signal
CDR[0]
.. . .. .
Gated
PFD CP LPF VCO
PFD CP LPF
÷ 8
5-GHz System Ck 10-Gb/s
Di<0>
625-MHz Ref. Clock
5-GHz
Recovered Ck<0>
5-GHz
Recovered Ck<n-1>
Vctrl
PLL 10-Gb/s
Di<n-1>
Fig. 2.13: Gated-VCO CDR, Kaeriyama2003
Fig. 2.13 shows the configuration of gated VCO CDR for multi-channel timing recovery. It is implemented in an economic way. First is that gated VCO is inherently low-hardware overhead with the shared control voltage, and second is that all gated VCO operate at half rate.
The CDR macrocell consists of 1) edge detector, 2) a gated VCO, 3) phase detector, and 4) reference voltage generator, where 3) and 4) are not shown in the figure.
The implementation is fabricated in 0.15-µm CMOS technology. Each CDR macrocell recovers 10-Gb/s data with a power dissipation of 50 mW at a 1.5-V supply, while area is 120 130 µm× 2. But the mentioned area excludes the hardware corresponding to data recovery such as the de-multiplexer for the half-rate data and the retiming circuit.
2.5 Alternative & Hybrid
This section introduces alternative CDR architectures, which involve a new recovering method, called FSM-based, and two hybrid architectures.
2.5.1 Alternative CDR
FSM-Based, Analui2005 [22]
Combinational
Fig. 2.14: FSM-based CDR, Analui2005 (a) Architecture (b) State Diagram at n=2 The FSM-based architecture is clockless and digital. Fig. 2.14(a) shows the CDR architecture with 1-to-n de-multiplexing, which includes two combinational logic circuits and the one-bit delay circuit. The one-bit delay is implemented with L-C delay cells. The recovered data output depends on the current input and the previous state from the delay line. It is therefore an asynchronous system but synchronized to every transition of incoming data.
The 1-to-n de-multiplexing relaxes the operation rate. Since the state information is kept in the memory of FSM and lasting for n-bit time. This system behaves like open-loop and operates without jitter rejection. The 1-to-n de-multiplexing behavior inherently introduces (1/n) of input jitter to the output.
The implementation operates at 7.5 Gb/s and is fabricated in SiGe technology. It is built with 1-to-2 de-multiplexing. From the data rate and technology, the digital-circuit approach still encounters speed limitation in timing recovery.
2.5.2 Hybrid CDR
A hybrid version of oversampling/PLL architecture, called semi-blind, is proposed by Ierssel in 2006. Fig. 2.15 shows the architecture. The main system is a blind oversampling architecture, while the second feedback loop shown in the bottom of the figure simulates the PLL-based system. The second feedback loop is composed of a DAC and a loop filter. The original blind oversampling architecture tracks the S1,0
high-frequency jitter while the second loop tracks the low-frequency jitter. The jitter tolerance specification at low frequency is greatly (32×) improved by this hybrid version.
Fig. 2.16 shows a hybrid DLL/PLL CDR architecture by T. Lee. The data-deskew path forms the DLL, and the second loop in dashed line refers to the PLL. The system can be either a simple DLL-based CDR by removing the voltage controlled crystal oscillator (VCXO) path or a hybrid DLL/PLL system.
Both DLL and hybrid DLL/PLL architectures provide jitter-peaking-free timing recovery since no zero exists. In summary, DLL loop determines the acquisition speed while the filtering of low-frequency jitter benefits from the PLL loop.
The possibility of the hybrid DLL/PLL architecture can be further explored. Fig.
2.17 2 shows the weighted control of DLL and PLL by the interpolator. The original design in [26] uses a multiplexer to determine how the loop of the delay line is configured, open vs. closed. When the loop is closed, the delay cells forms an oscillator.
In Fig. 2.17, the multiplexer is replaced by an interpolator, and through the weighted control, the behavior can be partial DLL and partial PLL. For instance, the hybrid ratio of DLL to PLL can be 50%-50%, 20%-80%, or anything else.
Semi-blind Oversampling CDR, Ierssel2006 [23]
20-phase 800-MHz VCO Samplers ×20
Di 8×4 FIFO Do
DownSample
Decision Circuit
DAC LPF
20-phase 800-MHz VCO Samplers ×20
Di 88×4 FIFO×4 FIFO DoDo
DownSample
Decision Circuit
DAC LPF
Fig. 2.15: Semi-blind oversampling CDR, Ierssel2006
2 The original topic is about DLL/PLL instead of CDR.
Hybrid DLL/PLL CDR, T.Lee92 [24]
Voltage Controlled Phase Shifter
Di Phase
Detector
Loop Filter
VCXO (External) Retiming
Module
Recovered Ck Recovered Do Clock In
(for DLL mode)
Fig. 2.16: DLL & DLL/PLL CDR, T.Lee92
Hybrid DLL/PLL, Bae&Wei2004 [25]
Voltage Controlled Delay Line CP &
LPF Up
Dn
Vctrl AND
AND AND AND
÷ N
φ
ininφ
1-ww
PFD
Wctrl Enable
φ
outCTRL Interpolator
Fig. 2.17: Mixed PLL/DLL, Bae&Wei2004
2.6 Summary
Table 1 shows the summary on the CDR architectures, where ○ denotes yes, △ for partially yes, and X for no. As for the blank area, it is a currently un-explored field in this survey. Take the lack of digital implementation of Gated-VCO for
Table 1 shows the summary on the CDR architectures, where ○ denotes yes, △ for partially yes, and X for no. As for the blank area, it is a currently un-explored field in this survey. Take the lack of digital implementation of Gated-VCO for