Design for Test - A 10-Gb/s Transceiver Test Chip

Chapter 5 A 10-Gb/s Transceiver Test Chip

5.2 Design for Test

Besides the previous description on controlled data input and the measurement flexibility, this section describes the considerations for the test, such as phase resolution measurements, pads/buffers sharing, etc, as well as the circuit operation in different modes.

5.2.1 Phase Resolution Measurement

An additional DCDL block, denoted as DCDL_T, is introduced to the chip.

There are mainly two measurements to be made. First is to understand the phase resolution of DCDL. Second is to see how the DCDL-induced jitter affects the output

eye diagram. And from the eye diagram of different tapping location, the induced jitter from stage to stage is observed. The maximum number of stages can be further analyzed and determined by the measurement results.

5.2.2 Output Buffer Driving Capability

The driving capability of the full-rate output buffer is controllable through the variable amount of the pre-emphasis. The delayed negative-feedback driving strength of the tri-state buffers constitutes the pre-emphasis mechanism. For a large capacitive load, it requires more strength but at the cost of the eye closure in the magnitude axis.

As for a small capacitive load, less strength is able to deal with. A capacitive load of 1 pF for each probe pad is considered as the typical case in this work.

The negative-feedback strength depends on how many tri-state buffers are activated, and therefore it is controllable.

5.2.3 Pads/Buffers Sharing

The control mechanism of DCDL_T and the driving strength of the output buffer both require 5 bits in binary format. To save the area cost, these 10 bits share the 5 control pads, denoted as Ct<0:4>.

Two sets of 5-bit registers are used for the pads sharing. They operate in the low-frequency domain. The low-frequency clock comes from the global 2.5-GHz clock divided by 512, so as to eliminate the disturbance to the internal circuits.

The 5-bit data Ct<0:4> is loaded continuously to either Register A or Register B.

The bypass signal determines which to be loaded.

The 10-Gb/s output buffer is the only way out for 10-Gb/s data for sure. Two inputs, multiplexed by the bypass signal, share the buffer. And, a 2.5-Gb/s output buffer is also shared by a multiplexer, which is controlled by the signal Ct<0>.

5.2.4 Bypass-Mode Operation

The first thing to test is the bypass-mode operation. Fig. 5.2 shows the active blocks and the signal flow in bypass mode. The full-rate data passes through DCDL_T, 2-to-1 full-rate multiplexer, and finally output from the buffer. This bypass mode makes sure that the full-rate data path behaves well.

10-Gb/s Output Buffer

Bypass

Bypass BypassBypass

Fig. 5.2: Bypass-mode operation

10-Gb/s Output Buffer

Bypass

Ck_10G, Ckb_10G

Ct <0:4>

Bypass BypassBypass

MUX

2:1

. . . . . .

Fig. 5.3: Nominal operation

The phase resolution of DCDL_T is measured in bypass mode. Before the measurement, configuring two settings is required. First, configure the output buffer setting. Set the registers B at bypass = 0 . After that, set bypass = 1 and enter the bypass mode. Configure the setting of registers A for DCDL_T. The phase resolution of each step can be measured by stepping the setting of registers A.

5.2.5 Nominal Operation

The chip is in nominal mode when bypass signal is false. Fig. 5.3 shows the corresponding active blocks and the signal flow. The full-rate data is recovered and de-serialized in Deskew CDR. It is serialized again in the 4-to-1 serializer. A 2-to-1 multiplexer selects the full-rate data at bypass = 0 . Finally the output buffer drives the full-rate data back to the tester. And the driving capability of the output buffer is configured by registers B.

5.2.6 Debug-Mode Operation

Lag_ext

10-Gb/s Output Buffer

Bypass 0

Ck_10G, Ckb_10G

Ct <0:4>

Bypass BypassBypass

MUX

Fig. 5.4: Debug-mode operation

Tracking and locking behaviors of the CDR can be observed from the internal

CC CC

Lead / Lag . Fig. 5.4 shows that two 2.5-Gb/s buffers drive these two signals to the oscilloscope in the outer world, and one of the four parallel data from the CDR can be multiplexed to the buffer.

The debug mode is not controlled by any signal, and thus it can accompany the nominal mode. Since the two 2.5-Gb/s output buffers use their own pair of Vdd/Gnd, the supplies can be removed at any time. In that case, the chip is naturally out of debug mode.

5.3 Building Blocks

This section depicts three blocks relating to 10-GHz or 10-Gb/s. It involves a multi-phase clock generator, a serializer, and the output buffer.

5.3.1 Multi-Phase Clock Generator

Ck Ck Ck

DFF DFF DFF

Q0 Q1 Q2

Q3 Q4 Q5

Ck Ck Ck

DFF DFF DFF

Q0 Q1 Q2

Q3 Q4 Q5

Ck Q0 Q1 Q2 Ck Q0 Q1 Q2

(a) (b)

Fig. 5.5: A 3-stage Johnson counter (a) architecture (b) timing diagram Fig. 5.5(a) shows a 3-stage Johnson counter. The phase difference between Q0 and Q1 is one Ck period. Consider the timing diagram in Fig. 5.5(b), the logic ‘1’

propagates from Q0, through Q1, and then to Q2. Q0 falls to zero at the 4th clock edge because the first DFF’s input-bar receives the positive Q2. And then the logic ‘0’

starts to propagate as the previous description on the logic ‘1’. The propagation of logic ‘1’ takes 3 Ck periods and that of logic ‘0’ takes 3 Ck periods, so the data rate of Q is the clock rate divided by 6 for the 3-stage Johnson counter.

Another approach to explain the data rate of Q is the data flow in the loop. Q0 propagates to Q1, to Q2 … , to Q5, and finally back the Q0. It takes 6 Ck periods to complete the loop. The data rate of Q0 is therefore the clock rate divided by 6.

Ck Ckb Ck Ckb

P7 P0 P1 P2

P3 P4 P5 P6

Limiter

Latch Latch Latch Latch

DFF

Fig. 5.6: The multi-phase clock generator

The proposed clock divider consists of 4 latches shown in Fig. 5.6. It is a 2-stage Johnson counter since the four latches compose two DFFs. The input clock rate is 10 GHz, and the developed clock rate is divided-by-4 based on the concept of a Johnson counter. To achieve the high-speed operation, limiters for the swing limitation are introduced to the differential signaling.

Even-phase clocks are outputs of DFFs. The phase difference between each even-phase clock is one 100-ps clock period, while the odd-phase clocks are tapped from the master latches of DFFs.

Because the latch circuit is differential and fully symmetric illustrated in Fig.

4.20(d) previously, the phase difference of P<0:7> should be the same. However, the multi-phase outputs from the latch circuits still suffer from the unbalanced duty cycle due to the mismatch between PMOS pulling strength and NMOS sinking strength. To balance the duty cycle, one can adjust the common-mode level of the 10-GHz Ck.

5.3.2 Serializer

Fig. 5.7 shows the 4-to-1 serializer circuit. It consists of three 2-to-1 multiplexers in cascaded two stages. The multiplexer at the output stage employs a 5-GHz clock and generates the full-rate serialized data. The multiplexers at the first stage employ two quadrature-phase clocks, and operate in 2.5-GHz domain.

All input data are retimed with buffered clock, which aligns to the global clock P<0>. A DFF implements the 1-cycle delay block, while a DFF with an extra latch implements the 1.5-cycle delay block. The extra latch retimes the data to the clock falling edge, so the 1.5-cycle delay is derived.

It is the conventional way to divide the fastest clock by 2 at the beginning and then generate the lower rate clocks sequentially. In this work, all the serializer clocks are developed from the global clocks P<0:7>.

MUX

Fig. 5.7: The 10-Gb/s 4-to-1 Serializer

5.3.3 Output Buffer

Fig. 5.8(a) shows the digital-circuit implementation of the 10-Gb/s output buffer.

It adopts the conventional one-bit delay scheme. The schematic of ‘A cell’ is the same as the delay cell in Fig. 4.3(a). A cells cascade as tapered buffers, and drive the ‘B cell’, which is a static CMOS inverter, at the final output stage.

In the bottom of the figure, the delayed-forward loop implements the FIR filter.

The output signal of the forward loop is delayed and with negative polarity to the main output of B cell.

The ‘Tri’ block is composed of the binary-controlled tri-state buffers shown in Fig. 5.8(b), where the ‘c cell’ is half size of the capital ‘C cell’. The amount of the pre-emphasis is controlled by Ct<0:4>. The more turned-on tri-states buffers introduce the larger amount of over-shooting to the output signal in time domain.

Besides, two 50-ohm poly resistors (not shown) are in series and connect the differential output nodes for the impedance matching.

I Ib

O Ob

Tri Tri

Ct<0:4>

m = 1 m = 1 m = 2 m = 4 m = 8

m = 1 m = 1 m = 2 m = 4 m = 8 m = 10

m = 1 m = 2 m = 4 m = 8

A A A A A

A A A A

B B

(a)

Ct1 Ct2 Ct3 Ct4

Ct0

m = 1 m = 1 m = 2 m = 4 m = 8 In

Out

C C C C

Ct1 Ct2 Ct3 Ct4

Ct0

m = 1 m = 1 m = 2 m = 4 m = 8 In

Out

C C C C

(b)

Fig. 5.8: The 10-Gb/s output buffer (a) architecture (b) binary-control tri-state buffers

Chapter 6 Layout & Post Simulations

6.1 Layout

6.1.1 Layout Guidelines for High-Speed Circuits

The layout plan and the block arrangement are critical to the high-speed operation. Based on the iterative process between the layout and the post simulation, the layout guidelines for high-speed circuits are derived. They are listed in the order of benefits to the high-speed operation.

1) Source Coupling

It is the most important guideline to the proposed differential large-swing digital circuits. The sources of differential input transistors should be placed together even at the lack of current sources. These sources are connected by the diffusion layer, but contacts sharing are not recommended.

The source coupling layout relates the differential signals. It can be observed in the post simulation with R-C-CC extraction, even though the phenomenon is not

modeled by the circuits in pre-simulation.

Fig. 6.1: Source-coupling pairs in the delay cell

Fig. 6.1 shows the delay cell of DCDL. Three inverter pairs adopt the source coupling scheme. It improves the 10-Gb/s differential signaling in the delay cell. For the 2.5-GHz domain, the differential signaling also benefits from the source coupling scheme. Fig. 6.2 shows the latch with 3 pairs of NMOS in the source coupling scheme.

Fig. 6.2: Source-coupling pairs in the latch

2) Reducing Capacitive Load

The drain area sharing between two transistors reduces the drain-to-bulk capacitance. Both the drain area sharing and contacts sharing are recommended.

A capacitive load can be partitioned into three components, 1) the area overlap capacitance C , 2) the coupling capacitance _a C , and 3) the fringing capacitance _c C . _f In most cases, the area overlap capacitance C dominates the high-speed _a performance.

(1) (2)

(3)

i ib

Cb C

O Ob

(1) (2) (3)

Especially it is observed when the drain’s output in the (N+1)th metal layer runs across the power line in the Nth metal layer. It is recommended to use the (N+2)th metal layer and the (N+2)th metal layer for the overlapping.

3) Common-Centroid Geometry

The common-centroid scheme is a general layout guideline. It mainly solves the process variation across the wafer, but it doesn’t do favors to the high-speed circuit operation except that the fully symmetric geometry results in symmetric capacitance, which helps the differential signaling.

For high-speed circuits, the common-centroid guideline is recommended as the global layout guideline, instead of the local optimization. It is verified that all transistors of the delay cell apply a common-centroid scheme, instead of the source coupling scheme, result in a poor performance.

Fig. 6.3: Common-centroid scheme (a) m = 2 (b) m = 3 (c) m = 4

Fig. 6.3 shows the common-centroid scheme for differential signaling. The input transistors are denoted as A, and the input-bar transistors are denoted as B. Assume the I/O direction is either top-to-bottom or right-to-left. The input capacitance of A equals to that of B, and so does the output capacitance.

6.1.2 Grid Design

Fig. 6.4 shows the proposed grid cells for supplies. (a) is the power grid cell. (b) and (c) are the decoupling capacitors filled in the chip’s blank area. Cap1 is a stack type capacitor of 0.17 pF. And Cap2 is a finger type capacitor of 0.7 pF.

The grid design is flexible since all the cells behave like tiles. Metals are self-connected, so supplies are easily transported through the cascaded cells.

A B B A

B A A B

A B

B A

A B A

B A B

(a) (b) (c)

Fig. 6.4: Grid design (a) power grid (b) Cap1 (c) Cap2

6.1.3 Chip Layout

Fig. 6.5: Chip Layout

Table 10: Pads and power configurations

G-S-G-S-G Probe 5 × 3 Vdc, Gdc 60 mW Full 240 mW

Vdd / Gnd 4 × 2 Vdt, Gdt 60 mW Nominal 180 mW

Control 7 Vdt2, Gdt2 60 mW Bypass 120 mW

2.5Gb/s Output 2 × 2 Vdt2, Gdt2 60 mW Debug 180 mW

Pads Configuration Power Configuration Power vs. Modes

(a) (b) (c)

20 um

40 um

20 um

1683 um

994 um

Table 11: Block layout area

Full Chip (pads included) 1683 × 994

Transceiver 700 × 310

Deskew CDR 183 × 149

Area (µm × µm)

The G-S-G-S-G probe pads are used for the high-speed I/O. In Fig. 6.5, there are 3 directions for the 10-G I/O. The data inputs from the left. The clock inputs from the top. The data outputs to the right.

Table 10(b) shows power pads and the corresponding power consumption.

Vdt/Gdt power pair is for the serializer, the clock generator, and DCDL_T. Vdt2/Gdt2 is for the 10-Gb/s output buffer. Vdt3/Gdt3 is for the two 2.5-Gb/s output buffers, while Vdc/Gdc is for the CDR. Each power pair provides 60 mW to the internal circuits. The full-chip power consumption depends on the operation modes, shown in Table 10(c). Table 11 shows the layout area. The transceiver area includes all circuits and 3 pairs of 50-ohm termination resistors.

6.1.4 Core Layout

149um

DCDL FSM

CK 183um

APD Confidence

Counter

PD × 8

(4) DCDL

FSM

ACCU CkBuf

ENC & COMP XOR DoBuf (a)

(3) (2)

(5) (6)

(7)

(b)

(d)

(a) (b)

Fig. 6.6: Deskew CDR (a) layout (b) block layout with I/O ports

Fig. 6.6(a) shows the layout of the CDR. Fig. 6.6(b) shows the block layout and the signal flow. Table 12 shows the ports information. The global reset at port (d) goes to the FSM and the accumulator ACCU. After each burst-data recovery, it resets the

two up-down counters. DCDL is then re-initialized.

Table 12: The ports information of the CDR (a) internal ports (b) external ports

CC CC

Lead , Lag

COMP COMP

Lead , Lag Internal Ports

(1)DCDL's output Do, Dob (2)Recovered data Q<0,2,4,6>

(3)PD’s output Q<0:7>

(4)XOR’s output Lead<0:3>, Lag<0:3>

(5) Comparator's output (6) Confidence counter's output (7)FSM's output C<0:7>, F<0:3>

External Ports

(a) 10-Gb/s Input Data (b) Global Clock P<0:7>

(d) Global Reset

(a) (b)

6.2 Post Simulation

6.2.1 Simulation Setup

Fig. 6.7: The bonding pad model

All the mentioned simulations in this section are post simulations. They all adopt the bonding pad model in Fig. 6.7, where R = 1 mΩ, and C = 0.5 pF, and L = 2 nH.

For the simulations of DCDL and the CDR, the parasitic extraction method is R-C-CC. The full-chip simulation uses C-CC instead since the hspice simulator fails to allocate the memory for an extra-large R-C-CC network.

The simulation refers to the typical case by defualt when it is not specified. The typical case refers to the TT-corner model, at room temperature, and under the typical 1.2-V supply.

C C

R L

In Out

6.2.2 DCDL Simulation

Phase Resolution & Tuning Range

Fig. 6.8: DCDL post simulation (a) phase resolution (b) tuning range

Table 13: DCDL phase resolution and tuning range

Fig. 6.8(a) shows the post simulation result of DCDL phase resolution. The target average phase resolution is 6 ps. Results of all corner models are given. Fig.

6.8(b) shows the DCDL tuning range. The monotonic tuning behaviors of all corner models can be observed.

The average phase resolution and the tuning range are given in Table 13. In FF-corner case, the ‘126.7-ps’ tuning range is less than the specification ‘140 ps’. To extend the tuning range as well as to reduce the power, it is a good way to lower down the supply voltage. The FF(2) simulation under 1-V supply proves this concept.

Worst Output Eye

The worst case of the DCDL output eye occurs at the maximum number of cascaded stages. Fig. 6.9 shows the environment. The longest path is active for the worst eye measurement. The 10-Gb/s input Di is 0.6V±50mV, and the jitter

DCDL Phase Resolution vs. Interval Index

Interval Index (1 to 27)

-5 0 5 10 15 20 25 30

Phase Resolution (ps)

4e-12

Added Delay vs. Code Index

Code Index (1 to 28)

0 5 10 15 20 25

Added Delay (ps)

-1.25e-10

Initial Code Index =15

(a) (b)

* FF(2) simulation is under 1-V supply.

accumulation time is 1200-bit time, or 120 ns.

Only in the SS-corner case, the peak-to-peak jitter of 14.83 ps does not meet the specification 10 ps, or ‘0.1 UI’. In the FF(2) simulation case, the common mode of the delay cell is self-adjusted, shown in Fig. 6.10(f), because of the CMOS architecture.

Fig. 6.9: DCDL worst-case setup

Table 14: DCDL output peak-to-peak jitter at the worst-case setup

(a) TT (b) SS (c) FF

(d) SF (e) FS (f) FF(2)

Fig. 6.10: DCDL post simulation - worst eye diagrams 4.69

8.41 3.88

3.96 14.83

4.06 Jp-p (ps)

FS SF

FF(2) SS

* FF(2) simulation is under 1-V supply.

d3 Di

Dib

d6 d0

ϕ d7

…

Do Dob MUX

MUX

PreA ∆t ∆t ∆t ∆t

6.2.3 CDR Simulation

Tracking-to-Locking Simulation

Fig. 6.11 shows the post simulation of the CDR. In Fig. 6.11(a), it demonstrates the tracking behavior of the ‘25-ps’ average case. The locking behavior is observed after t = 15ns.

Fig. 6.11(b) shows a timing interval of Fig. 6.11(a). It observes the relationship between ‘the input Di’ and ‘the output Do’ of DCDL. Do is the delayed and amplified version of Di.

(a)

(b)

Fig. 6.11: Deskew CDR post simulation

10-Gb/s input (600±50 mV) DCDL Output Do cc_Lead

cc_Lag

Idd

10-Gb/s input (600±50 mV) DCDL Output Do cc_Lead

cc_Lag

Idd

FSM Simulation

Fig. 6.12. follows the simulation in Fig. 6.11(a). The FSM updates the coarse control C<0:7> and fine control F<0:3> according to Leadb_CC/ Lagb_CC of the confidence counter.

Fig. 6.12: Deskew CDR post simulation – FSM

6.2.4 Full Chip Simulation

Nominal Operation

Fig. 6.13 shows the full-chip post simulation in nominal mode. The tracking- to-locking behavior of the CDR is observed. Fig. 6.13(b) is a zoom-in version of (a).

10-Gb/s input (600±50 mV) DCDL Output Do MUX2-1 Output OutBuf Output To

(a)

LeadCC

LagCC

C7 C6 C5 C4 C3 C2 C1 C0 F3 F2 F1 F0 LeadbCC

LagbCC

Fig. 6.13: Full-chip post simulation

Fig. 6.14: 10-Gb/s output eyes (a) single-ended To (b) differential-ended To-Tob

Fig. 6.14 shows the output eye diagrams of To of 10-Gb/s output buffer. The simulation assumes a capacitive load of 1 pF for each probe pad. The eye diagrams are measured in lock state. Fig. 6.14 (a) shows the single-ended eye of To, and (b) shows the differential-ended eye diagram. Peak-to-peak jitters are 13.4 ps and 12.8 ps for case (a) and case (b) respectively.

6.3 Test Environment

Agilent N4901B Serial BERT is responsible for the 10-G I/O. It measures the BER and data output eyes, and derives the phase resolution of DCDL. Keithley 2400 source meter provides/measures the power. HP DC Supply provides power to the chip,

(a) (b)

10-Gb/s input (600±50 mV) DCDL Output Do MUX2-1 Output OutBuf Output To

(b)

LeadCC

LagCC

of course. It also provides the AC ground, which is half the Vdd level, to the probe pads.

In debug mode, Agilent 86100B oscilloscope monitors the 2.5-Gb/s Lead and Lag of the CDR. It also verifies one of the four recovered output data.

Fig. 6.15: Test environment

Keithley 2400 Source Meter Keithley 2400 Source Meter Di_10G

Dib_10G

To_10G Tob_10G Ck_10G

Ckb_10G

PCB

Agilent 86100B Oscilloscope HP E3610A DC SupplyHP E3610A DC Supply Agilent N4901B 13.5-Gbps Serial BERT

Chapter 7 Conclusion

In this dissertation, we adopt a data-deskew CDR with advantages of a simple environment setup, low hardware overhead, and the easy synchronization for multi-channel timing recovery. Analysis and design of the CDR are presented in both system and circuit levels. A digital implementation of the 10-Gb/s transceiver is

在文檔中在閉迴路上使用資料相位校正器之10-Gb/s CMOS時脈與資料回復電路 (頁 74-0)