Organization - 在閉迴路上使用資料相位校正器之10-Gb/s CMOS時脈與資料回復電路

Chapter 1 Introduction

1.3 Organization

This dissertation comprises seven chapters. The motivation and features of the CDR are described in this chapter.

Chapter 2 gives an overview of the world’s CDR architectures, which is far beyond the PLL-based and the oversampling architectures. It describes the operation principles, the issues, and discovers the possibilities on various architectures.

Chapter 3 analyzes the system-level behaviors of the CDR. Topics on system specifications, design parameters, and simulations are involved. Discussions and simulations on various cases, such as noise profiles, the frequency tolerance, the loop latency, and the frequency error, can be found in this section.

Chapter 4 depicts the circuit-level implementation. The high-speed large-swing DCDL design and meeting the loop-latency constraint are described. To minimize the loop latency, the pseudo-NMOS scheme is adopted. Demonstrated can be found in the carry-look-ahead adder of the comparator, or the TFF up-down counters of the confidence counter and the FSM.

Chapter 5 shows the digital implementation of the 10-Gb/s transceiver. It describes the considerations for design and test, as well as different operation modes.

The CDR is verified in nominal mode and/or debug mode, while the phase resolution of DCDL is measured in bypass mode. The multi-phase clock generator, the serializer, and the 10-Gb/s output buffer are also discussed in this section.

In the high-speed domain, the circuit layout is critical and dominates the final eye open. Chapter 6 shows the chip layout, grid design, and describes the layout guidelines for high-speed circuits. The guideline of source coupling is especially emphasized. Post simulations of DCDL, the CDR, and the full-chip transceiver are given in this section.

The final section, Chapter 7, shows the specification table of Deskew CDR. It compares the power/area among several CDR systems.

Chapter 2 Overview of the World’s CDR Architectures

Different applications require different CDR systems to the world. The types of CDR systems reflect on the modes of recovering process: continuous vs. burst, closed loop vs. open loop, filter-based vs. oversampling, clock delay vs. data delay, and digital vs. analog, and etc.

For conventional CDR systems, we have PLL-based and oversampling CDR architectures. They are well-explored and have their own traditions. But there are more candidates for applications of the timing recovery. Following the classification in [1]¹, we categorize CDR systems into 1) PLL-Based, 2) Blind Oversampling, 3) DLL-Based, 4) Gated VCO, and 5) Alternative & Hybrid architectures.

In this section, some of the demonstrated systems are history, while some of them are state-of-the-art. This section mainly focuses on the world-view, the variety, and the possibilities of CDR systems.

1 A presentation document introduces the world’s CDR systems on the internet by Çobanoğlu in 2006.

The original classification is 1) PLL-Based, 2) Delay-the-Data, 3) Gated VCO, 4) (Semi-)Blind Oversampling, and 5) FSM-Based.

2.1 PLL-Based CDR

PLL-based CDR systems are suitable for continuous mode operation. They can be characterized as single loop [3]-[5] and dual loop [6]-[9]. In general, a PLL-based CDR system refers to an Nth-order system, where N≥2. It is usually implemented with analog circuits due to the inherently continuous characteristics.

Di RetimeRetimeRetime DoDo

P/F Detector

Charge Pump

VCO LPF

Fig. 2.1: A generic PLL-based CDR with a charge pump

Fig. 2.1 shows a generic PLL-based CDR architecture. It consists of phase and/or frequency detectors, a charge pump, a loop filter, and a VCO. For the dual-loop architecture, the entire recovery process includes the slow pull-in process by frequency detectors and then the lock-in process by phase detectors in sequence.

Back to 1985, a PLL-based system [3] was proposed for clock and data extractions from NRZ data. It employs an active SAW filter in the loop for the band-pass filtering instead of the architecture with a charge pump and a passive filter.

After the charge pump becomes popular, a second-order low-pass filter of ‘C // R-C’

structure is also welcome [5], [8]. The second-order filter composes a third-order system, so that phase step, frequency step, as well as the accelerative frequency variation can be tracked.

A high-order PLL-based system is well known as its high performance. However, it doesn’t suit burst-mode applications because of 1) the slow pull-in process, and 2) clock drifting at the case of no input.

Besides, there exists jitter peaking phenomenon [2] in the high-order system. To take the simplest case for instance, consider a second-order system. The closed-loop transfer function is expressed in (2.1).

The approximation of (2.1) is made by assuming that damping factor ς is large (such as 10) and w _n² is small enough and can be neglected. The approximated loop bandwidth is then derived in (2.2). And the corresponding zero and poles are given in (2.3)-(2.5).

From (2.3) and (2.4), the first pole locates behind the zero in absolute value. The jitter peaking phenomenon is then introduced in the closed-loop transfer plot as shown in Fig. 2.2.

Fig. 2.2: Jitter peaking phenomenon The jitter peaking J is _P

The amount of jitter peaking in (2.7) can be eliminated by over-damping the loop;

that is applying large ς . But it results in slow response of the lock acquisition.

Example: Savoj2001 [4]

10Gb/s Di

5-GHz VCO

Charge Pump

LPF Half-Rate

PD SER 10Gb/s Do

10Gb/s Di

5-GHz VCO

Charge Pump

LPF Half-Rate

PD SERSERSER 10Gb/s Do10Gb/s Do

Fig. 2.3: The PLL-based CDR, Savoj2001

Example: Savoj2003 [6]

Loop Filter 10Gb/s Di

Retimed 10Gb/s Do Half-Rate

Half-Rate PD

V/I Converter

0 45 90 135

VCO

Fig. 2.4: The PLL-based CDR, Savoj2003

2.2 Blind Oversampling CDR

A blind oversampling architecture, shown in Fig. 2.5, is implemented with digital circuits, and can handle both continuous and burst-mode timing recovery. It oversamples the data and chooses the optimal clock phase according to the extracted edges information in decision circuit. The decision scheme can be either majority-voting [10] or center-picking [11], while the previous is less superior. [12]

Multi-phase Clock Generator Parallel

Samplers

Sample

Storage MUX

Decision Circuit

. . . . . .

Multi-phase Clock Generator Parallel

Samplers

Sample

Storage MUXMUX

Decision Circuit

. . . . . .

Fig. 2.5: A generic oversampling CDR, Kim&Jeong2003 [13]

A blind oversampling CDR tracks the high-frequency jitter of input data stream well, while the limited size of storage causes a limitation on tracking the low-frequency jitter.

Different from most CDR systems, this architecture eliminates the need on the acquisition time but requires extra hardware for executing algorithm and introduces processing latency to the data recovery.

The phase picking scheme accompanies static offset error on each sampling, because neither the data nor the clock phases are adjusted. The maximum offset error is (0.5 UI / OSR) , where OSR denotes the oversampling ratio. Although this offset error can be suppressed by raising the oversampling rate, but in practical cases it encounters issues like: 1) A high OSR implies high-accuracy phase resolution for each sampling, which is always a challenge. 2) The input capacitance of phase detectors grows with OSR. That is especially critical to high-speed application. In the conventional way, 3×-oversampling is widely-used.

Example: C.K.Yang98 [14]

MUX

MUX 512Mb/s

Do<0:7>

4Gb/s Di 1:8 DEMUX Samplers ×24

Bit shifter FIFO

Over/Under-flow Controls

Decision Circuit

Multi-phase Clock

Delay

Fig. 2.6: The blind oversampling CDR, C.K.Yang98

In Fig. 2.6, the sample storage is denoted as a delay block, and the decision circuit controls the multiplexer as well as the FIFO at the last stage. The FIFO is implemented with an 8-bit shifter. It handles both the overflow and underflow cases when the phase error, which is mainly caused by the frequency error, accumulates more than 1-bit time.

2.3 DLL-Based CDR

A DLL-based CDR can be regarded as a simplified version of PLL-based architecture. It is a closed-loop first-order system without jitter peaking phenomenon.

In this system, only the phase delay is a variable. Implementations of DLL-based CDR can be either analog or digital, while the latter is the major trend in recent days.

According to the subject of delay adjustment, it can be distinguished as 1) clock-interpolation, and 2) data-deskew architectures. The clock-interpolation architecture can handle continuous timing recovery by the phase-rotation scheme, but this phase rotation needs additional hardware, such as the FIFO stage of oversampling architecture in Fig. 2.6, to handle the overflow/underflow condition.

As for the data-deskew architecture, it is a straight concept to adjust data instead of clock. It introduces a simple synchronization behavior by the shared and untouched global clock. But it is mainly limited by the data tuning range, and therefore is only suitable for burst-mode applications.

2.3.1 Clock-Interpolation CDR

Fig. 2.7 shows an example of clock-interpolation CDR by E. Lee. The 8 clock phases are adjusted by the interpolation scheme, which is generated from the phase controller, and finally the sampling clock phases align to the midpoint of data duration.

The receive amplifiers block consists of amplifiers and phase detectors. Here the full-rate data is de-multiplexed into 4 quarter-rate data inherently. Even though digital circuits implement the logic function in the phase controller block, the entire CDR implementation also adopts analog circuits.

Fig. 2.8 shows a clock-interpolation CDR for multi-channel timing recovery by Kreienkamp. It adopts analog circuitry to achieve high speed and fine phase resolution. Differential charge pump and two capacitors contribute the single pole to

the system. The phase interpolator is the conventional analog current-steering scheme, and just like those PLL-based CDR systems, the phase resolution is limited by the discrete steps, which is introduced by charge pump. The chip is fabricated in 0.11-µm CMOS technology, and its power consumption is 220-mW at a supply of 1.5 Volt.

But for continuous recovery, it lacks of description about phase-rotation of these CDR macro-cells.

Example: E.Lee2001 [15]

500-MHz 8-phase DLL

8 8 8 8

Phase Controller

...

Fig. 2.7: The clock-interpolation CDR, E.Lee2001

Example: Kreienkamp2005 [16]

Phase

CDR

Shared PLL CDR

CDR CDR Multi-channel

Recovery Recovered

Data Input

Data

… …

CDR

Shared PLL CDR

CDR CDR Multi-channel

Recovery Recovered

Data Input

Data

… …

Clock

(b)

Fig. 2.8: The clock-interpolation CDR, Kreienkamp2005 (a) the CDR, (b) the multi-channel configuration

2.3.2 Data-Deskew CDR

Fig. 2.9(a) shows the 10-Gb/s data-deskew CDR for multi-channel burst-mode applications proposed by Wong. It is a full-rate analog implementation, and fabricated in both AlGaAs/GaAs and InGaP/GaAs HBT technology, where f ~ 50 GHz , _t fmax ~ 60 GHz and ~ 40β . The voltage controlled delay line, phase detector, and loop filter compose the delay lock loop. In addition, it employs an edge detector circuit to adjust the time constant of the loop filter. Fig. 2.9(b) shows the phase detector circuit. The detector’s output is generated from the transition edge of input and its asynchronous delay.

The achieved tuning range is 2 UI or 200 ps. It claims to be capable of a 12.5-kbit data packet but under the assumption that frequency error for all clocks is within 20 ppm. The 20-ppm error is far less than the conventional estimation of 200 ppm.

Fig. 2.10 shows a digital implementation of data-deskew CDR by Lu. The confidence counter replaces the conventional loop filter. The cascaded delay cells compose the DCDL block. Coarse and fine tune functions are available. The coarse function is implemented by the on/off state of tri-state buffers in the chain, and the fine function is implemented by the added amount of capacitive load.

It is fabricated in 0.18-µm CMOS technology, and the achieved tuning range is 1 UI, or 400 ps, for the 2.5-Gb/s operation. Due to the insufficient tuning range, this implementation is not going to handle any frequency error.

Example: Wong96 [17]

Voltage Controlled Delay Line

Phase Detector

Loop Filter Data

Retime

Edge Detector

10-Gb/s Di 10-GHz

Clock

10-Gb/s Do

(a)

Envelope Out Detector

In Envelope Out

Detector In

(b)

Fig. 2.9: The data-deskew CDR, Wong96 (a) Architecture (b) Edge Detector

Example: Lu2005 [18]

Phase Detector Confidence

Counter Delay Control

FSM

Up Dn

Lead Lag

Digitally Controlled Delay Line

5-GHz Clock 2.5-Gb/s

2.5-Gb/s Do

...

Fig. 2.10: The data-deskew CDR, Lu2005

2.4 Gated-VCO CDR

Example: Nakamura96 [19]

CDR Core

Gating

Circuit 1 G-VCO1

G-VCO2

CP & LPF

Burst PLL PFD

Recovered Do

Recovered Ck

Vctrl

Decision 1 Di

Decision 2 Gating

Circuit 2 Reset

CDR Core

Gating

Circuit 1 G-VCO1

G-VCO2

CP & LPF

Burst PLL PFD

Recovered Do

Recovered Ck

Vctrl

Decision 1 Di

Decision 2 Gating

Circuit 2 Reset

(a)

Half-bit Delay

In Out

(b)

Fig. 2.11: Gated-VCO CDR, Nakamura96 (a) architecture (b) gating circuit A gated-VCO CDR system was first introduced by Nakamura in 1996. It can fast response to the asynchronous burst input data. In Fig. 2.11(a), the CDR core consists of a gating circuit, a gated VCO, and a DFF at the final stage for retiming the data.

This DFF is denoted as Decision 1 block.

The gating circuit in Fig. 2.11(b) adopts the same scheme as that in Fig. 2.9(b). It detects the transition edge of input data. Consider the gating signal is logic 0, and the Vctrl signal is ready; the gated VCO oscillates by default and is ready to re-initiate an oscillation. As the gating signal validates, the gated VCO re-generates the gated clock instantaneously. In other words, the gating signal re-synchronizes the gated clock,

every time the data transition validates.

This prototype of gated-VCO architecture cooperates with a burst PLL, which provides the control voltage to the CDR. An additional reset action is required after each burst data recovery.

Example: Nogawa2005 [20]

CDR Core

Gating Circuit

DFF

G-VCO1 Input

Amp.

G-VCO2

CP & LPF

PFD

÷ 64

PLL 10-Gb/s

Di Recovered

10-Gb/s Do

Recovered 10-GHz Ck Vctrl

156-MHz Ref. Ck

Fig. 2.12: Gated-VCO CDR, Nogawa2005

The implementation in Fig. 2.12 demonstrates a high-performance gated-VCO CDR. It is fabricated in 0.13-µm CMOS technology with the overall area of 2.5 2.5 mm × ² and power consumption of 1.2 W at a 2.5-V supply. It operates at 10-Gb/s, and is able to extract the recovered clock within 5-bit time.

A new invention of this design is the input amplifier, which applies AC couple and edge detection schemes to accomplish the final comparison in a hysteresis comparator.

Previously in Nakamura’s prototype, it employs a burst PLL. But in the later years, a PLL with input reference clock becomes popular for the generation of Vctrl.

The gated VCO2 follows reference clock instead of input data. The need for the additional reset action is thus eliminated.

Example: Kaeriyama2003 [21]

DLL

Edge Detector

Gated VCO Gating Signal

CDR[n-1]

Edge Detector

Gated VCO Gating Signal

CDR[0]

.. . .. .

Gated

PFD CP LPF VCO

PFD CP LPF

÷ 8

5-GHz System Ck 10-Gb/s

Di<0>

625-MHz Ref. Clock

5-GHz

Recovered Ck<0>

5-GHz

Recovered Ck<n-1>

Vctrl

PLL 10-Gb/s

Di<n-1>

Fig. 2.13: Gated-VCO CDR, Kaeriyama2003

Fig. 2.13 shows the configuration of gated VCO CDR for multi-channel timing recovery. It is implemented in an economic way. First is that gated VCO is inherently low-hardware overhead with the shared control voltage, and second is that all gated VCO operate at half rate.

The CDR macrocell consists of 1) edge detector, 2) a gated VCO, 3) phase detector, and 4) reference voltage generator, where 3) and 4) are not shown in the figure.

The implementation is fabricated in 0.15-µm CMOS technology. Each CDR macrocell recovers 10-Gb/s data with a power dissipation of 50 mW at a 1.5-V supply, while area is 120 130 µm× ². But the mentioned area excludes the hardware corresponding to data recovery such as the de-multiplexer for the half-rate data and the retiming circuit.

2.5 Alternative & Hybrid

This section introduces alternative CDR architectures, which involve a new recovering method, called FSM-based, and two hybrid architectures.

2.5.1 Alternative CDR

FSM-Based, Analui2005 [22]

Combinational

Fig. 2.14: FSM-based CDR, Analui2005 (a) Architecture (b) State Diagram at n=2 The FSM-based architecture is clockless and digital. Fig. 2.14(a) shows the CDR architecture with 1-to-n de-multiplexing, which includes two combinational logic circuits and the one-bit delay circuit. The one-bit delay is implemented with L-C delay cells. The recovered data output depends on the current input and the previous state from the delay line. It is therefore an asynchronous system but synchronized to every transition of incoming data.

The 1-to-n de-multiplexing relaxes the operation rate. Since the state information is kept in the memory of FSM and lasting for n-bit time. This system behaves like open-loop and operates without jitter rejection. The 1-to-n de-multiplexing behavior inherently introduces (1/n) of input jitter to the output.

The implementation operates at 7.5 Gb/s and is fabricated in SiGe technology. It is built with 1-to-2 de-multiplexing. From the data rate and technology, the digital-circuit approach still encounters speed limitation in timing recovery.

2.5.2 Hybrid CDR

A hybrid version of oversampling/PLL architecture, called semi-blind, is proposed by Ierssel in 2006. Fig. 2.15 shows the architecture. The main system is a blind oversampling architecture, while the second feedback loop shown in the bottom of the figure simulates the PLL-based system. The second feedback loop is composed of a DAC and a loop filter. The original blind oversampling architecture tracks the S1,0

high-frequency jitter while the second loop tracks the low-frequency jitter. The jitter tolerance specification at low frequency is greatly (32×) improved by this hybrid version.

Fig. 2.16 shows a hybrid DLL/PLL CDR architecture by T. Lee. The data-deskew path forms the DLL, and the second loop in dashed line refers to the PLL. The system can be either a simple DLL-based CDR by removing the voltage controlled crystal oscillator (VCXO) path or a hybrid DLL/PLL system.

Both DLL and hybrid DLL/PLL architectures provide jitter-peaking-free timing recovery since no zero exists. In summary, DLL loop determines the acquisition speed while the filtering of low-frequency jitter benefits from the PLL loop.

The possibility of the hybrid DLL/PLL architecture can be further explored. Fig.

2.17 ² shows the weighted control of DLL and PLL by the interpolator. The original design in [26] uses a multiplexer to determine how the loop of the delay line is configured, open vs. closed. When the loop is closed, the delay cells forms an oscillator.

In Fig. 2.17, the multiplexer is replaced by an interpolator, and through the weighted control, the behavior can be partial DLL and partial PLL. For instance, the hybrid ratio of DLL to PLL can be 50%-50%, 20%-80%, or anything else.

Semi-blind Oversampling CDR, Ierssel2006 [23]

20-phase 800-MHz VCO Samplers ×20

Di 8×4 FIFO Do

DownSample

Decision Circuit

DAC LPF

20-phase 800-MHz VCO Samplers ×20

Di 88×4 FIFO×4 FIFO DoDo

DownSample

Decision Circuit

DAC LPF

Fig. 2.15: Semi-blind oversampling CDR, Ierssel2006

2 The original topic is about DLL/PLL instead of CDR.

Hybrid DLL/PLL CDR, T.Lee92 [24]

Voltage Controlled Phase Shifter

Di Phase

Detector

Loop Filter

VCXO (External) Retiming

Module

Recovered Ck Recovered Do Clock In

(for DLL mode)

Fig. 2.16: DLL & DLL/PLL CDR, T.Lee92

Hybrid DLL/PLL, Bae&Wei2004 [25]

Voltage Controlled Delay Line CP &

LPF Up

Vctrl AND

AND AND AND

÷ N

φ

inin

φ

^1-w

PFD

Wctrl Enable

φ

out

CTRL Interpolator

Fig. 2.17: Mixed PLL/DLL, Bae&Wei2004

2.6 Summary

Table 1 shows the summary on the CDR architectures, where _○ denotes yes, _△ for partially yes, and _Ｘ for no. As for the blank area, it is a currently un-explored field in this survey. Take the lack of digital implementation of Gated-VCO for

在文檔中在閉迴路上使用資料相位校正器之10-Gb/s CMOS時脈與資料回復電路 (頁 14-0)