應用於系統晶片之低功率全數位式時脈產生器

(1)

國立交通大學

電子工程學系電子研究所

博士論文

應用於系統晶片之低功率

全數位式時脈產生器

Low-Power All-Digital Clock Generators for

SoC Applications

研究生 : 盛鐸

指導教授 : 李鎮宜博士

(2)

應用於系統晶片之低功率

全數位式時脈產生器

Low-Power All-Digital Clock Generators for

SoC Applications

研究生: 盛鐸 Student: Duo Sheng

指導教授: 李鎮宜博士 Advisor: Dr. Chen-Yi Lee

國立交通大學

電子工程學系電子研究所

博士論文

A Dissertation

Submitted to Department of Electronics Engineering & Institute Electronics College of Electrical and Computer Engineering

National Chiao Tung University in Partial Fulfillment of Requirements for the Degree of Doctor of Philosophy

in

Electronics Engineering June 2010

Hsinchu, Taiwan, Republic of China

(3)

i

應用於系統晶片之低功率

全數位式時脈產生器

研究生: 盛鐸指導教授: 李鎮宜教授國立交通大學電子工程學系電子研究所

摘要

隨著製程技術的進步以及電子產品功能需求的增加，系統晶片的複雜度日益增高。在複雜的系統晶片設計中，需要許多種類不同的時脈訊號以因應不同的功能需求。因此，如何設計適合於系統晶片的各種時脈產生器就成為一重要的議題。傳統上，時脈產生器常使用類比方式實現，但是類比時脈產生器於低供應電壓準位時面臨強大的設計挑戰，同時它有較低的系統整合度與較高的面積成本。相對於類比方式，全數位的實現方式則具有高系統整合度與低面積成本的特性，十分適合於系統晶片的應用。除此之外，在系統晶片應用中，功率和效能是設計時脈產生器最主要需要克服的問題。因此，本論文提出使用全數位的設計方案來實現多種應用於系統晶片的時脈產生器，並有效降低功率消耗與增進電路效能。在全數位的時脈產生器設計中，最核心的電路模組為數位控制振盪器與延遲細胞元。數位控制振盪器與延遲細胞元的效能表現與功率消耗對全數位時脈產生器的整體效能表現有顯著與重要的影響。因此，本論文首先提出一低功率高效能的數位控制振盪器與延遲細胞元，而這樣的數位控制振盪器設計可同時應用於多種全數位時脈產生器之中。此數位控制振盪器使用粗調-微調的串接架構來提高操作頻率範圍同時維持高延遲精準度。粗調的部分使用分割延遲線的架構來節省不必要的功率消耗，細調的部分則使用遲滯延遲細胞元來減少電路的負載與複雜度進而減少功率消耗。因此，數位控制振盪器的整體的功率消耗可大幅降低同時維持高效能表現。

(4)

ii 鎖相迴路是時脈產生器中最常見與最基本的一種。在具有功率管理功能的系統中，鎖相迴路需要能快速的提供已鎖定的時脈訊號，因此本論文接著提出具快速鎖定特性的全數位鎖相迴路。所提出的二階層快閃式時間數位量測轉換器能大幅縮短鎖定時間同時只需少量的硬體成本。除此之外，全數位展頻時脈產生器則是另一個常使用於系統晶片的電路，其作用為降低時脈訊號對系統的電磁干擾。本論文提出重排程片段三角調變的演算法來完成可程式化的展頻比率並同時保持對輸入時脈相位的追蹤能力。在系統晶片中，記憶體是不可或缺的基礎元件。而其中雙資料速率記憶體因其高效能而廣為使用。由於雙資料速率記憶體控制器需要特殊的時脈控制訊號使雙資料速率記憶體能正確的工作，因此本論文提出以全數位延遲迴路與數位控制相位變換器為基礎的可調式時脈產生器，並可克服因長距離佈線所造成的延遲不匹配問題。而記憶體則需要同步映射延遲電路來解決內部因佈線長短不同而造成的時脈扭曲問題。本論文所提出的全數位式同步映射延遲電路使用邊緣觸發映射延遲細胞元以及使用高精確度延遲細胞元的微調延遲線來擴大可接受責任週期的範圍與縮小靜態相位誤差。本論文所提出的全數位時脈產生器設計方案中，除了使用所提出的數位控制振盪器與各種設計技巧來提高效能與降低功率消耗，並且皆使用標準函式庫元件來實現硬體。因其具有的可移植性，它可如同軟矽智產一般的輕易將其設計轉換於不同的製程上。因此，所提出的全數位時脈產生器非常適合應用於系統晶片與系統層次整合。

(5)

iii

Low-Power All-Digital Clock Generators

for SoC Applications

Student: Duo Sheng Advisor: Chen-Yi Lee Department of Electronics Engineering and Institute of Electronics,

National Chiao-Tung University

Abstract

As IC technology migrates to nano-scale era and the demand of electronic product function increases, the design of system-on-chip (SoC) becomes more complex. In the complex SoC design, it needs many different clock signals for the different functional requirements. Thus, how to design the various clock generators for SoC applications becomes an important topic. Traditional clock generators are designed by analog approach. However, the analog clock generator not only encounters a high design challenge as supply voltage decreases, but also it is hard to be integrated into system design due to large area. In contrast to analog approach, all digital design approach is very suitable for SoC applications due to high portability and low design cost. In addition, power consumption and performance are major design considerations of clock generator in SoC applications. Thus, this work proposes a systematic all-digital design approach to implement various clock generators with high performance and low power for SoC applications.

The kernel module of all-digital clock generators is digitally controlled oscillator (DCO) and delay cell. Because DCO and delay cell dominate the overall performance and power consumption of all-digital clock generator, this work proposes a high-performance and low-power DCO and delay cell that can apply to all kinds of all-digital clock generators for SoC applications. The proposed DCO employs a cascadable structure with coarse and fine-tuning stage to achieve high resolution and

(6)

iv

wide frequency range at the same time. The coarse-tuning stage utilizes a segmental delay line (SDL) to reduce redundant power, and the proposed hysteresis delay cell (HDC) can reduce the circuit complexity and loading of the fine-tuning stage to further lower down the power consumption. As a result, the power consumption of the proposed DCO can be reduced significantly while keeping high performance.

The phase-locked loop (PLL) is the most essential type of clock generator. For the power management system application, PLL should provide the locked clock signal in a short time. Thus, this work proposes a fast-lock-in all-digital PLL (ADPLL) which employs a novel 2-level flash time-to-digital converter (TDC) to reduce lock-in time with low hardware cost. Besides, an all-digital spread spectrum clock generator (ADSSCG) that reduces the electromagnetic interference (EMI) effect is another important design in SoC applications. The proposed rescheduling division triangular modulation (RDTM) scheme can enhance the phase tracking capability and provide wide programmable spreading ratio at the same time.

Memory is an essential component of SoC design. Double data rate (DDR) memories have been widely used for high-performance system in modern SoC designs to meet required data bandwidth. Because DDR memory controller needs specified clock and control signal to ensure the functionality and performance of data accesses, a tunable phase shift scheme based on all-digital delay locked loop (ADDLL) and digital control phase shifter (DCPS) has been proposed in this work to solve the delay mismatching issue. In addition, memory design utilizes the synchronous mirror delay (SMD) to eliminate the clock skew by wire delay mismatching. The proposed all-digital SMD (ADSMD) uses edge-trigger mirror delay cells to enlarge the input

(7)

v

duty cycle range and fine-tuning delay line with high-resolution delay cell to reduce the static phase error.

The proposed all-digital clock generators not only use the proposed DCO/delay cell and several design techniques to enhance performance and reduce power consumption, but also can be realized by standard cells in standard CMOS processes, making it easily portable to different processes as a soft intellectual property (IP). As a result, the proposed all-digital clock generators are very suitable for SoC applications as well as system-level integration.

(8)

vi

誌謝

能完成博士班的相關研究，首先最要感謝的就是我的指導教授李鎮宜博士。老師教導我要以樂觀積極的精神在研究上不斷的創新，並以宏觀的系統角度來觀察問題，同時要以嚴謹的態度面對研究上的挑戰。老師無論在治學態度或是待人處世上都是我心中永遠的典範。此外，要感謝我的口試委員: 王進賢教授、劉深淵教授、李泰成教授、黃錫瑜教授、黃威教授與許騰尹教授在百忙中參加我的口試，給我許多寶貴的建議，讓我看到研究上許許多多不同的面向，並啟發我未來的研究方向。在此要特別感謝鍾菁哲教授，鍾教授在我遇到電路設計或論文撰寫上的相關問題時總是不吝的對我提出許多懇切的建議與指導，讓我能在研究上不斷的突破與成長，讓我穫益良多。除此之外也要感謝 Si2 實驗室的好夥伴: 黎峰學長、軒宇學長、建青學長、瑞元學長、子明兄、元哥、志龍、曜哥、義澤、芳年與琇茹，一同分享研究的心得，並提供許多系統晶片設計與應用上的寶貴建議。此外，要感謝我在職場上的好長官: 柯裕豐與曾友信對於我的鼓勵與包容，讓我能在工作與學業上取得平衡，並且有能將研究成果結合實際產品的機會。當然不能忘記工作上的好同事: Andy、志文、阿寬、詠松、世一、雄哥、思蔚、Steven、 Mark、江龍與小芳，除了在工作上的互相協助與鼓勵，也讓我在生活與休閒上找到許多樂趣。最後，要感謝我的父母親，沒有您的栽培與養育，就不會有這本論文的完成。另外，更要感謝一路陪我走來的太太，總是陪我度過一次次的困境與低潮，也伴我感受所有最深刻的感動與喜悅。最深的感激，無以言盡。

(9)

vii

List of Figures

Fig. 2.1 Power profiling of ADPLL 11

Fig. 2.2 (a) Proposed HDC(b) Equivalent circuit of HDC for analysis 14 Fig. 2.3 Hysteresis phenomenon of HDC 15 Fig. 2.4 The relation among input voltage of TINV,

effective driving current, and INV1 delay 15 Fig. 2.5 Architecture of the proposed DCO 16 Fig. 2.6 Proposed segmental coarse-tuning stage with SDL 17 Fig. 2.7 Proposed fine-tuning stage with HDC and DCV 18 Fig. 2.8 Power comparisons of different coarse-tuning designs 20 Fig. 2.9 Power and resolution comparisons of different fine-tuning

designs 22

Fig. 2.10 Microphotography and layout of DCO test chip 23 Fig. 2.11 Comparisons of measurement and post-layout

simulation results 24

Fig. 2.12 Jitter histogram of DCO at 952MHz 25 Fig. 3.1 Binary search ADPLL architecture 29

Fig. 3.2 Binary search algorithm 30

Fig. 3.3 Flowchart of phase tracking mode 31 Fig. 3.4 TDC-based ADPLL architecture 32

Fig. 3.5 Counter-based TDC 33

Fig. 3.6 (a) Single delay chain flash TDC

(b) Operation of Single delay chain flash TDC 35

Fig. 3.7 Vernier delay line TDC 36

Fig. 3.8 The proposed 2-level flash TDC architecture 38 Fig. 3.9 Simulation of 2-level flash TDC 39 Fig. 3.10 Transient response of binary search ADPLL 40 Fig. 3.11 Transient response of TDC-based ADPLL 41 Fig. 4.1 Architecture of the proposed ADSSCG 46

(13)

xi

Fig. 4.2 (a) Conventional triangular modulation. (b) Division triangular modulation

(c) Rescheduling division triangular modulation 47 Fig. 4.3 (a) Architecture of the proposed DCO

(b) Fine-tuning cells of DCO 51 Fig. 4.4 Flowchart of auto-adjustment algorithm 54 Fig. 4.5 Comparison between original and adjusted timing 54 Fig. 4.6 Microphotograph of ADSSCG test chip 55 Fig. 4.7 Measurement spectrum of 54MHz (a) Without frequency

spreading (b) With 1% frequency spreading 56 Fig. 4.8 Measurement spectrum of 27MHz (a) Without frequency

spreading (b)With 10% frequency spreading 56 Fig. 5.1 (a) Interconnection of DDR memory and core system

(b) Waveform of read operation

(c) Waveform of write operation 61 Fig. 5.2 Architecture of the proposed tunable phase shift scheme

for DDR controller 64

Fig. 5.3 Flowchart of the proposed tunable phase shift scheme 65 Fig. 5.4 Architecture of (a) ADDLL (b) DCPS 66 Fig. 5.5: (a) Proposed DCDL (b) Coarse-tuning stage (c) Fine-tuning

stage 67

Fig. 5.6: (a) Proposed TDC (b) Waveform of TDC 68 Fig. 5.7 Layout of ADDLL and DCPS 69 Fig. 5.8 (a) Transient response of ADDLL(b) ADDLL at steady state 70 Fig. 5.9 Tunable signal phase scheme in read operation when

(a) DQS leads DQ (b) DQS lags DQ 71 Fig. 5.10 Phase shift between CLOCK1 and CLOCK2 at 400MHz 72 Fig. 5.11 Jitter and phase shift of ADDLL under different PVT 73 Fig. 6.1 Architecture of the conventional SMD 77 Fig. 6.2 (a) Architecture of the proposed SMD (b) Circuit of EMDC 78 Fig. 6.3 Block diagram and equivalent circuit of DCV 79

(14)

xii

Fig. 6.4 Timing waveform (a) without blocking scheme

(b) with blocking scheme 80

Fig. 6.5 Microphotography of SMD test chip 81 Fig. 6.6 (a) Timing diagram of the proposed SMD

(15)

xiii

List of Tables

Table 2.1 Comparisons of Different DCO Approaches 12 Table 2.2 Performance Comparisons with Different Fine-Tuning Stages 21 Table 2.3 Measurement Results of Step/Range of Tuning Stage 23 Table 2.4 DCO Performance Comparisons 25 Table 4.1 Jitter and Timing Comparisons of DTM and RDTM 50 Table 4.2 Simulation Results of Delay of Tuning Stage 52 Table 4.3 SSCG Performance Comparisons 58 Table 5.1 ADDLL Performance Comparisons 74 Table 6.1 Comparisons of Different SMD Approaches 76 Table 6.2 Phase Error Under Different PVT Condition 83 Table 6.3 ADSMD Performance Summary 83

(16)

- 1 -

Chapter 1 Introduction

1.1 Motivation

As IC technology grows up rapidly, the focus of the modern VLSI design moves from single functional block to system-level integration and single chip solution. Because the demand of electronic product function increases, many different functional blocks are integrated into single chip, leading to increase the design complexity of system-on-chip (SoC). In the complex SoC design, it needs the various clock signals to meet the different functional block requirements. Hence, how to design the various clock generators to provide suitable clock signals for SoC applications becomes an important topic.

The design for realizing clock generator can be partitioned into analog and all-digital design approaches. Traditionally, the clock generators are realized by analog approach. However, as supply voltage decreases, both gain and frequency range need to be traded off in voltage-controlled oscillator (VCO) which is the most important block in analog clock generator. In addition, due to serious leakage current problem, it is hard to design a charge-pump circuit that is the essential block in analog clock generator in more advanced process technology. Thus it needs more design efforts to integrate analog clock generators in SoC with lower supply voltage and advanced process. Moreover, because the analog clock generator employs the passive

(17)

- 2 -

components such as resistor and capacitor to form the loop filter, it induces large area and cost. Furthermore, as technology migrates, the analog blocks in clock generator need to be re-designed, leading to enlarge the design turn around time.

In contrast to analog clock generator, all-digital design approach does not utilize any passive components and use digital design approaches, making it easily be integrated into digital and low-supply voltage systems. Because all-digital clock generator is reusable as a soft intellectual property (IP), it can radically decrease time-to-market for a design and be very suitable for SoC applications as well as system-level integration. As a result, it motivates us to focus on all-digital clock generator design for SoC applications in this dissertation.

Performance and power are always the most important design considerations in SoC design. Because the all-digital clock generator controls timing discretely, the minimum controllable delay resolution should be quite high to achieve low steady-state jitter. In addition, because a large number of clock generators are to be integrated into single chip, each clock generator should have low-power characteristic to further reduce overall power consumption of system. Among the functional blocks of all-digital clock generators, digitally controlled oscillator (DCO) is the kernel module, because it dominates overall performance and power consumption of all-digital clock generator. For example, DCO occupies over 50% power consumption of all-digital clock generator [1], and the delay resolution and operating range affect jitter performance and output frequency range of all-digital clock generator, respectively. According to these design requirements, all-digital clock generators require a high-performance and low-jitter DCO. Thus, before we start to study and design all-digital clock generators, a high-performance and low-power delay cell and

(18)

- 3 -

DCO that can be applied to all kinds of all-digital clock generators for SoC applications should be proposed first.

After we complete the design of a low-power DCO, the follow-up work focuses on all-digital clock generators. There are four important types of clock generators in SoC applications, namely phase-locked loop (PLL), spread-spectrum clock generator (SSCG), delay-locked loop (DLL), and synchronous mirror delay (SMD). The function and application of these clock generators are demonstrated as follows:

z PLL: It is widely used in microprocessor (µp) based and digital system [2]-[4]. It receives reference clock from the external components, for example a quartz crystal, and generates a set of system clock signals with frequency multiplication for system operation.

z SSCG: In SoC applications, the radiated emissions of system should be kept below an acceptable level to ensure the functionality and performance of system and adjacent devices, especially in high-speed serial link and video/display systems [5]. The SSCG can reduce the electromagnetic interference (EMI) effect significantly by the frequency-spreading clock and maintain the system performance [6].

z DLL: In the high-speed serial link and data transmission applications, a DLL-based multiphase clock generator generates the multiphase clocks that can be used to find a better sampling point and process data streams at a bit rate higher than internal clock frequencies to improve overall system performance [7], [8]. In addition, DLL also can eliminate the clock skew

(19)

- 4 -

among different functional blocks due to large wire loading in single chip or among multiple chips.

z SMD: Memory is an essential component of SoC design. In order to eliminate the internal clock skew by wire delay mismatching, memory design needs a synchronous mirror delay (SMD), with low complexity and small area, to quickly provide a small static phase error clock as compared with the external clock [9].

The design for SoC applications not only has to achieve high performance, low power, and low complexity, but it requires high portability to migrate to other processes easily and have a short design turn around time. Hence, this work attempts to implement the proposed all-digital clock generator only with standard cells, making it easily portable to different processes and very suitable for SoC applications.

1.2 Goal and Contribution

From the descriptions in the previous section, we can find that low-power all-digital clock generators are highly demanded in SoC applications. On the other hand, the performance requirement of clock generator grows rapidly as design complexity increases. Thus, the low-power all-digital clock generators design becomes more challenging than before. In this dissertation, an ultra-low-power DCO designed for all-digital clock generators is proposed. And based on this proposed DCO and delay cell, the overall power consumption can be saved significantly. The proposed all-digital clock generators not only utilize the proposed DCO and delay cell to raise performance, but also have performance improvement from the algorithmic

(20)

- 5 -

and architectural level. Furthermore, the proposed clock generators are truly portable because of realization by standard cells only. The goal and contribution of this dissertation are summarized as follows.

1. Digitally controlled oscillator (DCO)

z Goal: Propose a low-power and high-resolution DCO for all-digital clock generators.

z Contribution:

Based on the proposed segmental delay line (SDL) and hysteresis delay cell (HDC), the total power consumption of the proposed DCO can be improved to 140µW at 200MHz. As compared with conventional approaches, power consumption can be saved by 70% and 86.2% in coarse-tuning and fine-tuning stages respectively.

The proposed DCO employs a cascade-stage structure to achieve high resolution with 1.47ps and wide range at the same time.

2. All-digital phase-locked loop (ADPLL)

z Goal: Propose a fast lock-in and low-power ADPLL for power management scheme.

(21)

- 6 -

The proposed ADPLL uses a novel 2-level flash time-to-digital converter (TDC) to lock in within 2 reference clock cycles. In contrast to single level type, our proposed design takes only 12 D-flip-flops, thus it can reduce hardwire complexity and power consumption.

The proposed ADPLL employs the proposed low-power DCO saves the overall power consumption.

3. All-digital spread-spectrum clock generator (ADSSCG)

z Goal: Propose a low-power and programmable spreading ratio ADSSCG for EMI reduction of liquid crystal display (LCD) display system.

z Contribution:

The proposed ADSSCG employs a novel rescheduling division triangular modulation (RDTM) to enhance the phase tracking capability and provide wide programmable spreading ratio. The reduction of peak power is 9.5dB at 54MHz with 1% of spreading ratio, and the reduction of peak power is 15dB at 27MHz with 10% of spreading ratio.

The proposed ADSSCG employs the proposed low-power DCO with auto-adjustment algorithm saves the power consumption while keeping delay monotonic characteristic.

(22)

- 7 -

The total power consumption is 1.2mW at @54MHz, and the power index is 22.2 (µW/MHz) that is the highest power-to-frequency ratio as compared with the state-of-the art designs, implying the proposed ADSSCG is more effective in power saving for a given operating frequency.

4. All-digital delay-locked loop (ADDLL)

z Goal: Propose a tunable phase shift scheme based on ADDLL for DDR memory interface.

z Contribution:

The proposed phase shift scheme provides an all-digital and suitable solution to eliminate the non-ideal effect of data transmission between multi-chip interconnections especially for high data rate interconnection applications.

The proposed ADDLL that employs the high-performance digitally controlled delay line (DCDL) with HDC and TDC can achieve small phase-shift error in 1.3° at 400MHz and locking time of less than 13 clock cycles. As compared with the conventional ADDLLs, it can achieve the fastest phase lock and keep the smallest phase-shift error.

(23)

- 8 -

z Goal: Propose a wide input duty cycle range and small static phase error ADSMD for clock synchronization in SoC applications.

z Contribution:

The proposed SMD uses the edge-trigger mirror delay cell to enlarge the input duty cycle range (from 20% to 80%) and the blocking edge-trigger scheme to ensure the functionality and performance.

The phase error can be reduced to 18ps at 400MHz by the proposed delay-matching structure and fine-tuning delay line with high-resolution delay cell.

1.3 Dissertation Organization

This dissertation is organized as follows. Chapter 2 describes the proposed architecture and circuit of high-resolution and ultra-low-power DCO. The proposed DCO and HDC can be applied to the following clock generators. In Chapter 3, the general binary search-based ADPLL is discussed and the proposed TDC-based ADPLL for fast-lock-in demand is presented. Chapter 4 focuses on the proposed ADSSCG employs a novel rescheduling division triangular modulation (RDTM) to enhance the phase tracking capability and provide wide programmable spreading ratio. And the auto-adjustment algorithm for monotonic delay characteristic also has been proposed . In Chapter 5, the proposed tunable phase shift scheme based on ADPLL

(24)

- 9 -

for DDR controller application is presented. Chapter 6 describes the proposed ADSMD employs a delay-matching structure and a high-resolution delay cell to achieve small static phase error and an edge-trigger mirror delay cell to extend input duty cycle range. Finally, the conclusions and future works are given in Chapter 7.

(25)

- 10 -

Chapter 2 Low-Power Digitally Controlled

Oscillator with Hysteresis Delay Cell

2.1 Introduction

Digitally controlled oscillator (DCO) and digitally controlled delay line (DCDL) is the most important module in ADPLL/ADSSCG and ADDLL/ADSMD respectively. The delay cells are used to construct a ring oscillator in ADPLL/ADSSCG and a delay line in ADDLL/ADSMD. In this chapter, the high-performance and low-power delay cell will be described first, and the follow-up works will focus on the DCO architecture design with the proposed delay cell.

Basically, digitally controlled oscillator (DCO) dominates the major performances of the all-digital clock generators such as power consumption and jitter, and hence is the most important component of such clocking circuits [1], [10]-[14]. In terms of power, DCO occupies over 50% power consumption of an all-digital clock generator [1]. For example, the DCO occupies 59% power consumption of an all-digital phase-locked loop (ADPLL) as shown in Fig. 2.1. As a result, the power consumption of DCO should be reduced further to save overall power dissipation to meet low-power demands in SoC designs. Besides, the resolution of DCO has large influences on jitter performance and frequency or phase error of output clock.

(26)

- 11 -

Furthermore, if DCO can provide wide operating frequency range, it can extend the output frequency range of all-digital clock generator for the wider applications.

Recently, different architectural solutions have been proposed to implement the DCO. The current-starved type DCO [15] controls the supply current of delay cell to obtain different delay values. Although it has high resolution, it needs a static current source that will consume more static power dissipation. The LC tank DCO [16] can also achieve high delay resolution, however, it needs advanced process and requires intensive circuit layout. These approaches demand high complexity at circuit level, resulting in long design cycle and low portability.

In order to reduce design cycle when process or specification is changed, many DCO’s implemented with standard cells have been proposed to enhance portability [1], [11], [17], [18]. Driving capability modulation (DCM) changes the driving current of each delay cell by controlling number of enabled tri-state buffers/inverters [1], [17]. The design concept of this approach is straightforward, but it has a poor performance in linearity and power consumption, and the resolution is insufficient.

DCO 59% Controller 31%

Phase/Frequency Detector (PFD) 10%

(27)

- 12 -

The or-and-inverter (OAI) cells are proposed to enhance resolution by different input pattern combinations; however linearity remains to be solved [11]. Although digitally controlled varactor (DCV) has a good performance in resolution and linearity [18], it is hard to take a few cells to provide wider operation range. As a result, large power consumption is demanded due to many DCV cells to maintain an acceptable operation range. The brief summary of the different DCO approaches is listed in Table 2.1.

Thus, we attempt to propose a low-power, high-resolution, and wide-range DCO with high portability. Because the applications of our research focus on the general µp-based systems and communication baseband processors, the frequency operating range of the proposed DCO should be extended easily, and the maximum operation frequency of DCO would not be higher than 1GHz. In addition, the design target of power saving is an-order power reduction of the conventional works while keeping high delay resolution. However, because we want to propose a cell-based DCO design,

Table 2.1: Comparisons of Different DCO Approaches

Performance Indices Driving capability modulation (DCM) [1], [17] Or-and-inverter (OAI) cell [11] Digitally controlled varactor (DCV) [18]

Resolution Poor High High

Power High Medium High Linearity Poor Poor Good Operation Range Wide Narrow Narrow

(28)

- 13 -

how to overcome the limitations of the standard cells to build up such low-power, high-resolution, and wide-range DCO are the important design challenges for our research.

This chapter is organized as follows. Section 2.2 describes the proposed hysteresis delay cell. Section 2.3 describes the proposed architecture and circuit of DCO. And how to reduce power consumption of DCO is also presented in this section. Section 2.4 discusses and analyzes the performance comparison results of the different DCO structures. In Section 2.5, the implementation and measurement results of the fabricated DCO chip are presented. Overall performance comparison with the state-of-the-art DCO’s is also listed and discussed. Finally, a brief summary is addressed in section 2.6.

2.2 Hysteresis Delay Cell

Because DCO/DCDL usually utilizes many delay cells to generate the desired clock output, how to design a low-power delay cell is an important design issue in all-digital clock generator design. The delay cell should provide suitable and controllable delay value with low power and hardware penalty. Thus, the proposed hysteresis delay cell (HDC) which can reduce the gate count and loading is very suitable for all-digital clock generator applications. Fig. 2.2(a) illustrates the proposed HDCs used in the DCO and each of which contains one inverter (INV2) and one tri-state inverter (TINV). As the input state of control signal (F1ON [0] ~ F1ON [P-1]) of TINV in HDC changes, different delay can be obtained. The operation concept of HDC is to control driving current to obtain different propagation delay. When TINV of the HDC is enabled, the output signal of enabled TINV has the hysteresis

(29)

- 14 -

phenomenon in the transition state to produce different delay times from the delay chain. Fig. 2.2(b) illustrates the equivalent circuit of HDC for analysis. The propagation delay Tp from N1 to N2 is a function of loading capacitance and equivalent

resistance of turn-on MOS [19] and is given by

⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ + = 2 R R C 69 . 0 T_p _L eqp eqn (2.1)

where CL is the loading capacitance of N2, Reqn and Reqp are equivalent resistance of

NMOS and PMOS in the driving inverter (INV1) respectively. In the general

operating situation, CL remains as a constant value. But, the equivalent resistance of

turn-on MOS in INV1 varies with saturation current and drain-source voltage and is expressed by ( )dV V 1 I V 2 V 1 R 2 V VDD DSAT DD eq DD

∫

₊ _λ = (2.2) I1 I2 I3 F1ON N2 N3 N1 INV1 INV3 INV2 TINV HDC (a)

(Hi for a while) I1 I2 I3 TINV INV1 Cload N3 F1ON (Hi) N1 N2 (b) (F1ON) INV2 N3 N2

Fig. 2.2: (a) Proposed HDC. (b) Equivalent circuit of HDC for analysis.

(30)

- 15 -

where IDSAT is the saturation current of transistor device. When TINV is enabled, since

the input signal of TINV (N3) does not vary with the input of INV1 (N1)

instantaneously, it will sink the inverse current I2 to reduce the effective driving

current from I1 to I3.This leads to enlarge delay time of the delay chain. Fig. 2.3 shows

the hysteresis phenomenon of this HDC, where input signal transition is observed from SPICE simulation. In the beginning, N1 and N3 remain at high level and N2 is at

low level. As N1 signal level changes from high to low, the signal level of N2 attempts

to vary from low to high. However, because N3 remains at high level for a while

N1 N2 N3 I1 I2 I3 I3 = I1 - I2

Fig. 2.3: Hysteresis phenomenon of HDC.

455 460 465 470 475 480 485 0 0.2 0.4 0.6 0.8

Controlled Voltage of TINV (V)

De la y ( p s) 100 105 110 115 120 125 130 Dr iv in g Cu rr e n t ( µ A ) Driving Current Delay

Fig. 2.4: The relation among input voltage of TINV, effective driving current, and INV1 delay.

(31)

- 16 -

(delayed by INV2), TINV sinks the inverse current to slow down the pull-high speed of N2. Thus, (2.2) should be rewritten as follows

( )( )dV V 1 I I V 2 V 1 R 2 V VDD 1DSAT 2DSAT DD eq DD

∫

₋ ₊_λ = (2.3)

The effective driving current changes from I1DSAT to I1DSAT – I2DSAT as TINV is enabled.

The relation among input voltage of TINV, effective driving current, and INV1 delay is shown in Fig. 2.4. As the input voltage of TINV increases, the effective driving current of INV1 will decrease, leading to enlarge the delay of inverter chain. In addition, based on the different driving capability tri-state inverters in a given cell library, a set of different delay steps of HDC can be constructed for a specified DCO requirement.

2.3 The Proposed DCO Architecture

Fig. 2.5 illustrates the architecture of the proposed ultra-low-power DCO. Based on standard cells, our proposal can save power consumption and keep resolution. To preserve the control code resolution and operation range, the proposed DCO employs

COARSE-TUNING STAGE RESET F_IN DCO_OUT DECODER Fine [N-1:0] FINE-TUNING STAGE DECODER Coarse [M-1:0] Segmental Delay Cell Segmental Delay Cell Segmental Delay Cell Path-Selection MUX Segmental Delay Line

(32)

- 17 -

cascading structure for both coarse-tuning and fine-tuning stages to maintain control code-to-delay linearity and extend operation range easily. Two low-power circuit design techniques are proposed here. First, the proposed segmental delay line (SDL) can disable the transition of redundant segmental delay cells which is a two-input AND gate in coarse-tuning stage at target operation frequency. Second, the hysteresis delay cell (HDC) is proposed for fine-tuning stage to reduce the number of short-delay cells.

2.3.1 Coarse-Tuning

Stage

Fig. 2.6 shows the proposed segmental coarse-tuning stage, which is composed of 2M-1 two-input AND gates that form a SDL and a path-selection multiplexer. It can provide 2M different delay values by selecting different delay paths organized by these

2M-1 two-input AND gates. In the conventional delay line of path-selection schemes [11], [12], [18], the delay cell is composed of two inverters. When delay line is requested to provide higher operation frequency, a shorter delay path is selected and

COARSE-TUNING STAGE

EN[0] = 1 EN[1] = 1 EN[2] = 0

Path-Selection MUX RESET DECODER Coarse [M-1:0] DCO_OUT Disabled Cells

Selected Delay Path

P0 P1 P2 P3 P2M_-1

EN[2M_{-2] = 0}

EN[2M_-2:0]

F_IN Segmental Delay Line

(33)

- 18 -

the rest delay cells will not be used. However, these delay cells are not disabled. To reduce power consumption as the operating frequency changes, some enabling input controlled signals (EN [2M-2:0]) are set to low level to disable those redundant two-input AND gates.

2.3.2 Fine-Tuning

Stage

Because the resolution of the above mentioned coarse-tuning stage is not sufficient for typical DCO applications, a fine-tuning stage is added. In order to achieve better resolution and less power consumption, this fine-tuning stage is divided into three different sub-stages as shown in Fig. 2.7. It should be noted that the controllable range of each stage is larger than the delay step of the previous stage. As a result, the cascading DCO structure does not have any dead zone larger than the LSB resolution of DCO. The delay steps of these fine-tuning sub-stages are different; delay cells of the 1st stage and 3rd stage have the largest and smallest delay step, respectively. Therefore, delay cell of the 3rd fine-tuning stage determines the DCO

F_IN

F1ON[0] F1ON[1] F1ON[P-2]

F3ON[0] F3ON[R-1] DCO_OUT 2nd FINE-TUNE F2ON[0] F2ON[Q-P+1] F2ON[1] F2ON[P-2] 1st FINE-TUNE 3rd FINE-TUNE HDC F1ON[P-1] F2ON[Q-1] F2ON[Q-P+2] Long-Delay DCV Short-Delay DCV Delay Path

Fig. 2.7: Proposed fine-tuning stage with HDC and DCV.

(34)

- 19 -

LSB resolution and controllable range of the 1st fine-tuning stage can cover the delay step of the coarse-tuning stage easily. Since the proposed HDC can provide larger delay step than DCV, the 1st fine-tuning stage employs P HDCs to replace many DCV cells, leading to save power consumption. Due to better resolution capability, different DCVs are exploited in the 2nd and 3rd fine-tuning stages to improve the overall resolution of DCO. The operation concept of DCV is to control the gate capacitance of logic gate with input state to adjust the delay time [12], [18]. The 2nd and 3rd fine-tuning stages employ Q long-delay DCV cells (two-input NAND) and R short-delay DCV cells (tri-state inverter) respectively.

To optimize both power consumption and resolution, a strategy of allocating the proportion of the sub-stages in the proposed fine-tuning stage is introduced. First, in order to achieve high operation frequency, P should be limited to enlarge the length of total delay line in the fine-tuning stage. Then a suitable delay step of HDC can be determined by P. Second, because the delay resolution is only determined by the delay step of DCV in the 3rd fine-tuning stage, it needs to select a short-delay DCV from the cell library to meet the resolution requirement. After delay step has been determined, R can be chosen for the range of the 3rd fine-tuning stage and the loading capacitance consideration. Finally, after the delay step adjustment of HDC and short-delay DCV, the delay step of long-delay DCV and Q in the 2nd fine-tuning stage can also be determined. Note that Q can be reduced significantly by exploiting HDC to save power. For example, if the requirement of output delay is 260ps, it uses 4 HDCs to cover such delay range and 8 short-delay DCV cells to achieve high resolution. By the final step, 32 long-delay DCV cells are utilized to form the 2nd fine-tuning stage. As a result, total power consumption and resolution of the proposed

(35)

- 20 -

fine-tuning stage is 40.28µW and 0.97ps respectively under 200MHz and 0.8V in a 0.13µm CMOS process.

2.4 DCO Performance Comparisons

2.4.1 Coarse-Tuning

Stage

Performance Comparisons

For performance comparison, we rebuild those published approaches with an in-house 0.13µm CMOS standard cell library and then compare with our proposal. Because the DCO consists of coarse and fine tuning stages in general, the performance comparisons are divided into two parts as well.

In the coarse-tuning stage, we reconstruct the conventional delay line of path-selection type by two-inverter delay cells for power consumption comparisons. For fair comparisons, both conventional and the proposed segmental coarse-tuning stages have the same operation range. In terms of different operation frequencies, the

0 20 40 60 80 100 120 200 Conventional Design Segmental Design Po wer C onsumption ( µ W) Operation Frequency (MHz) 500 44.03 32.83 25% reduction 70% reduction 106.82 31.54 0

(36)

- 21 -

simulation results of power consumption are shown in Fig. 2.8. As compared with conventional approaches, the proposed segmental coarse-tuning stage can reduce 70% and 25% of the power consumption at 500MHz and 200MHz respectively. Because the number of disabled redundant delay cells varies with different operation frequencies; the segmental scheme has different power reduction ratio in different operation frequencies.

2.4.2 Fine-Tuning Stage Performance Comparisons

The fine-tuning stage determines many major performance indices of DCO, such as LSB resolution, delay linearity, and power consumption. Therefore, the performance comparisons of fine-tuning stage focus on these important performance indices. In the cell-based design approach, many designs exploit DCM or DCV to construct fine-tuning stage [1], [12], [17], [18]. For fair comparisons, these designs are rebuilt under the similar operation range, delay resolution, and number of control bit. To ensure correct functionality, the operation range of fine-tuning stage in all comparison candidates should be larger than the minimum delay step of two-input AND gate, which is 200ps in an in-house 0.13µm standard cell library. The rebuilt fine-tuning stages by different design approaches are: DCM type (Approach I) [1],

Table 2.2: Performance Comparisons with Different Fine-Tuning Stages

Resolution (ps)

Total Power (µW)

Partial Power*

(µW) Gate Count Range (ps) Proposed 0.97 40.28 36.31 48 261.34 Approach I 4.28 291.59 - 256 263.66 Approach II 1.07 233.61 228.77 128 266.9

Approach III 0.97 105.29 98.89 80 260.38 * Power consumption of long-delay stage

(37)

- 22 -

[17], DCV type (Approach II) [18], and combination of DCM and DCV type (Approach III) [12]. The operation frequency range should be similar for fair comparisons, resulting in the different number of delay cells in different structures. For example, Approach I, Approach II, and Approach III utilize 256, 128, and 80 tri-state inverters, respectively. In contrast to these approaches, the proposed structure only needs 12 tri-state inverters, 4 inverters, and 32 two-input NAND gates (based on the strategy mentioned in subsection 2.3.2 with P, Q, and R are assigned to 4, 32, and 8 respectively).

The performance comparisons simulated at 200MHz at 0.8V and typical corner cases, are summarized in Table 2.2. Note that all of them have the similar performance in LSB resolution except Approach I. But, in terms of power consumption and area, the proposed design has significant improvement. Since the proposed HDC can replace many DCV cells to obtain wider operation range, the number of delay cells connected with each driving inverter and loading capacitance can be reduced, leading to save power consumption and gate count as well. The

0 1 2 3 4 50 LS B Re so lu tio n ( p s) Power Consumption (µW) 200 100 150 250 300 0 86.2% reduction 82.8% reduction 61.7% reduction Proposed Design Approach III Approach II Approach I 5 350

Fig. 2.9: Power and resolution comparisons of different fine-tuning designs.

(38)

- 23 -

reduction ratios are 86.2%, 82.8%, and 61.7%, as compared with Approach I, Approach II, and Approach III, respectively. Fig. 2.9 also shows that our proposal has the high LSB resolution and low-power features as compared with the other designs.

Except Approach I, all of comparison candidates employ a short-delay DCV cell to form the finest delay cell; however, they utilize different type long-delay stages. Thus, we focus on the power comparison of long-delay stage in different approaches. In contrast to Approach II whose long-delay stage only utilizes long-delay DCV cell, our proposal exploits HDC and hence has less long-delay DCV cells compared with Approach II. As a result, power-to-delay ratio of long-delay stage of our proposal and Approach II is 0.14µW/ps (36.31µW/261.34ps) and 0.86µW/ps (228.77µW/266.9ps) respectively. Based on this power comparison, it is clear that HDC-based structure

Fig. 2.10: Microphotography and layout of DCO test chip.

Table 2.3: Measurement Results of Step/Range of Tuning Stage

Coarse-Tuning 1st Fine-Tuning 2nd Fine-Tuning 3rd Fine-Tuning Range (ps) 3726.36 296.74 116.02 10.26

(39)

- 24 -

can provide better power-to-delay ratio than pure DCV type structure, implying HDC is more effective in power saving for a given delay.

2.5 Experimental Results and Comparisons

Based on the requested frequency range and resolution for our application, the design parameters of the proposed DCO are determined as follows: N=10, M=5, P=4, Q=32, and R=8. In order to verify the feasibility and performance of the proposed DCO in advanced processes, a test chip has been fabricated in 90nm 1P9M CMOS process, where the chip microphoto and layout of the DCO chip is shown in Fig. 2.10. The DCO output signal is measured using LeCroy SDA4000A at 1V/25°C (supply of I/O pad is 2.5V) to test the performance. Due to the speed limitation of I/O pad, the DCO output frequency has to be divided by 2 when DCO operates at high frequency. Table 2.3 shows the delay step and operation range of different tuning stages in the proposed DCO. It shows that the controllable range of each stage is larger than the step of the previous stage, and the average DCO resolution is 1.47ps. Fig. 2.11 shows the comparison between measurement results and post-layout simulation to illustrate

Fig. 2.11: Comparisons of measurement and post-layout simulation results.

(40)

- 25 -

the linearity analysis of the proposed DCO. Both rms and peak-to-peak phase jitter at 417MHz is 8.18ps and 49.05ps respectively. Fig. 2.12 shows the rms and peak-to-peak phase jitter is 8.24ps and 49.95ps respectively over 150,000 sweeps at 952MHz under 1V and 60mV supply noise.

Table 2.4 lists comparison results with the state-of-the-art DCOs. In terms of power consumption, the proposed DCO has the lowest power consumption compared

Fig. 2.12: Jitter histogram of DCO at 952MHz.

Table 2.4: DCO Performance Comparisons

Performance Indices Proposed DCO JSSC'05 [15] TCAS2'05 [18] JSSC'04 [1] JSSC'03 [11] Process 90nm CMOS 0.18µm CMOS 0.35µm CMOS 0.35µm CMOS 0.35µm CMOS

Supply Voltage (V) 1 1.8 3.3 3 3.3

DCO Control Word Length 15 5 15 7 12 Operation Range (MHz) 191 ~ 952 413 ~ 485 18 ~ 214 152 ~ 366 45 ~ 510 LSB Resolution (ps) 1.47 2 1.55 10 ~ 150 5

Power Consumption 140µW (@200MHz) 340µW (Static only) 18mW (@200MHz) 12mW (@366MHz) * 50mW (@500MHz) *

Portability Yes No Yes Yes Yes

(41)

- 26 -

with other DCO designs. Furthermore, the proposed low-power solution does not induce any performance loss. Additionally, since the proposed DCO can be implemented with standard cells, it has a good portability. As a result the proposed DCO has the benefits of better resolution, operation range, linearity, and portability.

2.6 Summary

In this chapter, we have proposed a hysteresis delay cell an ultra-low-power DCO with cell-based design for SoC applications. The proposed HDC not only can be used in low-power DCO, but also can reduce the DCDL power consumption. With the proposed segmental tuning structure and HDC, the power consumption of coarse-tuning and fine-tuning stages can be further reduced by 70% and 86.2% respectively, as compared with conventional designs. Measurement results show that our proposed DCO can achieve 1.47ps resolution and 140µW at frequency of 200MHz. The proposed DCO achieve over an-order power reduction of the conventional works. As a result our proposal achieves not only less power consumption, but also better LSB resolution and delay linearity of DCO. Moreover, because the proposed DCO has a good portability as a soft intellectual property (IP), it is very suitable for SoC applications as well as system-level integration.

(42)

- 27 -

Chapter 3 Fast Lock-In All Digital

Phase-Locked Loop Design

3.1 Introduction

In this chapter, a fast lock-in all-digital phase-locked loop design is presented. As mentioned in Chapter 1, many applications such as microprocessor, communication baseband processor, and multimedia system require a clock synthesizer or clock multiplier. Hence, PLL had become an essential component in SoC design. In order to reduce overall power consumption of SoC design, especially in portable and mobile applications, system uses the power management commonly to save the redundant power dissipation. To support this low-power technique, the PLL should provide fast entry and exit from power management techniques [10]. As a result, the locking time of PLL is a very important design specification for low-power SoC applications. In addition, for the fast locking frequency synthesizer applications, such as a frequency hopping multiple access systems, locking time is also the most critical design issue.

For fast acquisition requirement, the traditional analog PLL requires tuning of the voltage-controlled oscillator (VCO) free-running frequency near the desired frequency in advance or increases loop bandwidth. However, the exact VCO tuning

(43)

- 28 -

range is not easy due to process, voltage, and temperature variations (PVT variations), and the increased loop bandwidth degrades jitter performance [20]. Many researchers have focused on overcoming such structural handicap. A digital frequency-difference detector (DFDD) is proposed in [20] to convert the frequency difference directly to the digital code, and then control the VCO gain adaptively. The adaptive loop bandwidth scheme is proposed by [21] to reduce the locking time. But, the circuit complexity will be increased due to the adaptive loop bandwidth architecture.

In contrast to analog approaches, all-digital phase-locked loop (ADPLL) using binary search algorithm is proposed to achieve locking with 50 [10] and 46 [11] cycles, respectively. The binary search ADPLL can not only achieve fast lock, but also have good performance as compared with the analog PLL. To further reduce locking time, a time-to-digital converter (TDC) based ADPLL is proposed in [22]. This ADPLL uses TDC to quantize the reference clock period into multiples of inverter delay times. Because TDC and DCO are influenced by the same PVT variations, the TDC measured code is more accurately and can cope with PVT variations. However, the power consumption and design complexity will be increased due to the TDC digital processing unit.

As a result, the research target of the proposed ADPLL is to achieve fast lock-in using TDC with small hardware penalty. In addition to locking time, power consumption is another important design specification of ADPLL, thus the proposed fast lock-in ADPLL employs the high-resolution and low-power DCO as described in Chapter 2 to save overall power and enhance performance.

(44)

- 29 -

This chapter is organized as follows. Section 3.2 introduces and describes the proposed design of the binary search ADPLL. The proposed TDC-based ADPLL for fast locking is described in Section 3.3. In Section 3.4, the proposed low-complexity 2-level flash TDC is presented, and the review of previous work of TDC is also dicussed in this section. In Section 3.5, the simulation results of the proposed ADPLLs are presented and discussed. Finally, a brief summary is given in Section 3.6.

3.2 Binary Search ADPLL Overview

3.2.1 Binary Search ADPLL Architecture

Fig. 3.1 illustrates the proposed binary search ADPLL architecture. It consists of seven major functional blocks: a phase/frequency detector (PFD), two digitally controlled oscillators (DCO’s) (tracking and average DCOs), an ADPLL controller, and three frequency dividers (pre-divider, DCO divider, and output divider). N, M, and K are inputs for programming pre-divider, DCO divider, and output divider respectively. There are two DCO’s in the ADPLL: the tracking DCO is used for

lead lag *: DCO code[16:0] OUTPUT CLK PRE-DIVIDER (N[2:0]) DCO DIVIDER (M[6:0]) OUTPUT DIVIDER (K[2:0]) ADPLL CONTROLLER PFD TRACKING DCO Ref_N DCO_M AVERAGE DCO #: avg_code[16:0] * Ref. CLK DCO CLK #

Fig. 3.1: Binary search ADPLL architecture.

(45)

- 30 -

tracking reference clock and the average DCO can generate the output clock with small jitter by the average mechanism.

The PFD detects the frequency difference and phase error between the divided reference clock (Ref_N) and the divided DCO output clock (DCO_M), and it generates LEAD/LAG signals to speed up or slow down the DCO output frequency. When controller receives LEAD from PFD, it increases the DCO control code (DCO code [16:0]) to decrease the output frequency of the tracking DCO. Oppositely, when controller receives LAG from PFD, it decreases the DCO control code to increase the output frequency of the tracking DCO. These blocks form a close-loop to achieve the “phase-lock” function. For frequency synthesis application, the controller can filter DCO control code variation and control average DCO to provide a low-jitter clock output (OUTPUT CLK). For clock multiplier application, the in-phase clock is generated directly from the tracking DCO.

Lower Bound Middle Target Upper Bound DCO Frequ ency Ba nd Cycle No. 0 1 2 3 4 5 6 7 8 9

Fig. 3.2: Binary search algorithm.

(46)

- 31 -

3.2.2 Binary Search Algorithm

The locking procedure of the binary search ADPLL can be divided into two modes: frequency acquisition and phase tracking. Phase lock starts from frequency acquisition mode. The frequency acquisition mode employs binary search algorithm to search the target frequency of input clock. Fig.3.2 illustrates the binary search algorithm for frequency acquisition. In the beginning, DCO oscillates at the middle of DCO frequency band, and the search step is one fourth of DCO frequency band. If output frequency is higher than target frequency, ADPLL controller adds current search step to DCO control code to lower the output frequency. Conversely, if output frequency is lower than target frequency, ADPLL controller adds DCO control code to increase the output frequency. Whenever PFD output changes from LAG to LEAD

PHASE TRACKING START POLARITY CHANGE SPEEDUP_COUNT = SPEEDUP_COUNT +1 STEP = STEP/2 SPEEDUP_ COUNT = 8 SPEEDUP_COUNT = 0

Y

N

STEP = 1

N

STEP = 1

Y

STEP = STEP*2

Y

N

Fig. 3.3: Flowchart of phase tracking mode.

(47)

- 32 -

or vice versa, the search step is divided by 2. After the search step reduces to 1, the frequency acquisition completes.

After the frequency acquisition completes, the locking procedure enters into phase tracking mode. Fig.3.3 shows the flowchart of phase tracking mode. In the beginning of this mode, the speed-up count (SPEEDUP_COUNT) sets to zero. When the PFD output changes from LAG to LEAD or vice versa, that means the phase polarity changes, the search step will be reduced half of the previous step. If the search direction keeps the same way, the speed-up count will add one. When the speed-up count equals to the boundary value, the search step will be doubled as the previous step to accelerate the phase tracking. If the boundary value is too large, the PLL may not track the input phase. Conversely, small boundary value will occur the unstable issue. By the simulation, the boundary value is selected to eight.

Due to the PFD dead zone and the reference clock noise, the DCO control code has small variations even the frequency and phase has been locked. In order to reduce jitter, the proposed ADPLL uses an average mechanism to eliminate such non-ideal effects. In the beginning, the ADPLL controller detects the maximum and minimum

lead lag *: DCO code[13:0] PRE-DIVIDER (N[2:0]) DCO DIVIDER (M[6:0]) ADPLL CONTROLLER PFD Ref_N DCO_M * Ref. CLK DCO CLK # TDC DCO RESET #: TDC code[5:0]

Fig. 3.4: TDC-based ADPLL architecture

(48)

- 33 -

of the DCO control code within 256 reference clock cycles and then takes the average of these two values. The average value will be the average DCO control code (avg_code [16:0]) for average DCO. Without the tracking noise, the ADPLL will generate a more stable and low-jitter output clock.

3.3 The Proposed TDC-Based ADPLL

The locking time is the most critical design specification for fast-locking application. In order to achieve fast locking, the TDC-based ADPLL is proposed and described in this section. In the proposed architecture, the locking procedure is divided into two modes: coarse locking and fine locking. Phase lock starts from coarse locking mode. In this mode, TDC is used to calculate the nearest control code quickly for DCO to produce the desired frequency. Because TDC can convert the input clock period information to multiples of delay time of delay cell, ADPLL controller can take this period information to jump to desired frequency quickly. After the coarse locking mode completed, ADPLL enters fine locking mode to reduce the residual frequency and phase error by binary search algorithm as described in

High-Frequency Clock

Counter

Time Interval

Fig. 3.5: Counter-based TDC.

(49)

- 34 -

previous section. As a result, overall lock-in time can be reduced by adding TDC module significantly.

Fig. 3.4 illustrates the proposed TDC-based ADPLL architecture. There are several functional blocks: a TDC, a phase/frequency detector (PFD), an ADPLL controller, a DCO, and two frequency dividers (pre-divider and DCO divider).Through the DCO divider, the signal DCO_M is the output of DCO divided by M. The Ref_N comes from reference clock divided by N. Once the ADPLL is enabled, TDC provides the coarse DCO control code (TDC_code [5:0]) to the ADPLL controller after two reference clock cycles, and then DCO generates the desired frequency output by this coarse DCO control code. After TDC operation is completed, the PFD generates the signal “lead” or “lag” depending on the phase and frequency difference between Ref_N and DCO_M. If DCO_M leads Ref_N, PFD generates a “lead” signal to slow down the DCO. Conversely, when DCO_M lags Ref_N, PFD generates a “lag” signal to speed up the DCO. When the ADPLL controller receives “lead” or “lag” from the PFD, it changes the DCO control code (DCO_code [13:0]). And then DCO control code controls the DCO to generate the output clock (DCO_CLK). These blocks form a close-loop to achieve the “phase-locked” function.

Because the proposed TDC-based ADPLL uses the novel 2-level flash TDC, the coarse locking only takes two input clock cycles. In the fine locking mode, the worst case for lock time of the binary search algorithm [11], in terms of input clock cycle,

T

(

2 log 2N

)

1

2

(50)

- 35 -

where T_L is the lock time of fine tuning and N is number of bits of the binary search

F/F D Q CK F/F D Q CK F/F D Q CK Q[0] Q[1] Q[2] t t t F/F D Q CK Q[n-1] t Input Clock (a) Indicates that 4t < T/2 < 5t F/F D Q CK F/F D Q CK F/F D Q CK F/F D Q CK F/F D Q CK F/F D Q CK F/F D Q CK F/F D Q CK 1 1 1 1 0 0 0 0 +1t +2t +3t +4t +5t +6t +7t +8t Input ClocK T (b)

Fig. 3.6: (a) Single delay chain flash TDC. (b) Operation of single delay chain flash TDC.

(51)

- 36 -

control code. In the proposed ADPLL, the DCO control code is 14 bits, as a result, the entire phase locking procedure takes 29 clock cycles including 2 cycles TDC operation and 27 cycles (N=14) for the fine-tuning phase locking.

3.4 Time-to-Digital Converter

3.4.1 TDC

Overview

Time-to-digital converters have been widely used for measurement system, temperature sensor, and communication system [23]-[25]. Because TDC can convert the time information to digital code, it is an essential component for the interface of analog and digital signals. Many approaches have been proposed to implement a TDC [1], [23]-[25]. The counter-based TDC uses a high-frequency clock or multi-phase clock to sample the timing interval and convert to multiples of period of

F/F D Q CK F/F D Q CK F/F D Q CK Start Q[0] t 2 t 1 t 2 t 1 F/F D Q CK t 2 t 1 Q[1] Q[2] Q[n-1] Stop

Fig. 3.7: Vernier delay line TDC.

(52)

- 37 -

high-frequency sampling clock as shown in Fig. 3.5 [1]. The design concept of counter-based TDC is very straightforward, but the power consumption is very high due to the high-frequency counter design.

Another approach is the flash TDC that is analogous to flash analog-to-digital converters for voltage amplitude encoding and operate by comparing a signal edge to various reference edges all displaced in time [23], [24]. The elements that compare the input signal to the reference are usually flip-flops. In the single delay chain flash TDC shown in Fig. 3.6 (a), each buffer produces a delay equal to t. Suppose it is desired to determine the period of input clock using the eight buffers converter in Fig. 3.6 (b). Each flip-flop compares the displacement in time of the delayed the first rising edge to the first falling edge of input clock. The thermometer-encoded output indicates the value of delay time of buffer; assuming the flip-flops are given sufficient time to resolve. The drawback to this implementation is that the resolution can not be smaller than a single gate delay. In addition, when the frequency of the input clock is low, it will require numbers of flip-flops and buffers to cover large clock period, leading to suffer large power consumption and hardware cost.

In order to enhance resolution, the flash converter can be constructed with a Vernier delay line as shown in Fig. 3.7 [25]. This architecture achieves a resolution of t1- t2, where t1 >t2. However, the power and area issues still need to be resolved when the sampled clock with low frequency.

Because the proposed TDC-based ADPLL uses TDC to lock the input clock frequency coarsely, the high resolution is not the design target of the TDC. In contrast,

(53)

- 38 -

how to lower the power and circuit complexity of TDC is more important design issue for the fast lock-in ADPLL application.

3.4.2 The Proposed 2-Level Flash TDC

As mentioned in the previous subsection, the single level flash TDC needs a large of flip-flops, leading to increase power consumption and design cost. In contrast to single level type, the proposed 2-level flash TDC takes only 12 D-flip-flops (8+4) as shown in Fig. 3.8, thus it has lower hardwire complexity and power consumption. There are several functional blocks, namely a 1st level flash TDC, a 2nd level flash TDC, a delay selection multiplexer, and a period calculator. The 1st level flash TDC consists of 4 large delay cells whose delay time is eight times of small delay cell (8t) and 4 D-flip-flops. In contrast to the 1st level flash TDC; the 2nd level flash TDC has only 8 small delay cells and D-flip-flops. The small delay cells used in the 1st and 2nd level flash TDC’s remain the same as those for DCO coarse-tuning stage.

F/F D Q CK F/F D Q CK t t F/F D Q CK t F/F D Q CK F/F D Q CK F/F D Q CK F/F D Q CK Q1[3:0] Ref_N 8t 8t 8t 8t Thermometer-to-Binary Converter L1_SEL[2:0] DL1[0] DL1[1] DL1[2] DL1[3] DL1[3:0] DL2[0] DL2[1] DL2[7] Thermometer-to-Binary Converter L2_SEL[3:0] Q2[7:0] Period Calculator M[6:0] TM[6:0] (TDC code) Level 2 Level 1

Fig. 3.8: The proposed 2-level flash TDC architecture.

(54)

- 39 -

When the TDC is enabled, Ref_N is sent to the 1st level flash TDC, and the input signal will propagate through the 4 large delay cells. When the first falling edge of Ref_N arrives, the outputs of the large delay cells will be sampled by D-flip-flops and selects one of large delay cell outputs for the 2nd level flash TDC. All outputs of D-flip-flops (Q1 [3:0]) are also sent to the thermometer-to-binary converter to generate the 1st level flash TDC output (L1_SEL). Then the 2nd level flash TDC generates the delay selection signal (L2_SEL) based on the sampled delay outputs (Q2 [7:0]). The outputs of the 1st and 2nd level flash TDC section are thermometer code type that can be used to generate selection signals easily. After both L1_SEL and L2_SEL have been generated, the period calculator can estimate the period of Ref_N based on these values. The conversion equation can be given as

Tr=(L1_SEL×8+L2_SEL)×2 (3.2)

where Tr is the period of Ref_N. For example, as shown in Fig. 3.9, if the period equals to 36 times of delay cell delay time, L1_SEL and L2_SEL should be 2 and 2 respectively. In order to reduce lock-in time, the TDC only measures half period of Ref_N, and the calculated value should be shifted left to obtain the period of Ref_N. The TDC takes only two reference clock cycles to complete lock-in operation. From

2 Clock Cycles

Fig. 3.9: Simulation of 2-level flash TDC.

(55)

- 40 -

the simulation results with 0.13µm CMOS standard cell library, the TDC resolution equals delay time of one delay cell (165ps), and the frequency error is 3.3% at 200MHz in the lock-in state.

In the proposed TDC-based ADPLL architecture, the frequency of Ref_N is the same as the frequency of DCO divided by M (DCO_M) as frequency locked. The delay time of coarse-tuning stage in DCO equals Tr divided by N. In order to reduce the hardware complexity of division, we propose a novel method to approximate this division operation results. This simplified operation can be divided into two steps. First, if the value of division ratio (M) is the power of two, this division operation is only a shift-right operation. If not, we extract the value of power of two of MSB in M (MS) and ML (M+1). Second, the division ratio will be shifted right by MS and ML, and then the TDC output equals the average of these two values (TL and TS). For example, if M=6, MS and ML is 2 and 3 respectively. The average of the shifted

Tracking DCO control code

Average DCO control code

Fig. 3.10: Transient response of binary search ADPLL.

應用於系統晶片之低功率全數位式時脈產生器

國 立 交 通 大 學

電子工程學系電子研究所

博 士 論 文

應用於系統晶片之低功率

全數位式時脈產生器

Low-Power All-Digital Clock Generators for

SoC Applications

研究生 : 盛 鐸

指導教授 : 李鎮宜博士

應用於系統晶片之低功率

全數位式時脈產生器

Low-Power All-Digital Clock Generators for

SoC Applications

國立交通大學

電子工程學系電子研究所

博士論文

應用於系統晶片之低功率

全數位式時脈產生器

摘要

Low-Power All-Digital Clock Generators

for SoC Applications

Abstract

誌 謝

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1 Motivation

1.2 Goal and Contribution

1.3 Dissertation Organization

Chapter 2

Low-Power Digitally Controlled

Oscillator with Hysteresis Delay Cell

2.1

Introduction

2.2

Hysteresis Delay Cell

∫

∫

2.3

The Proposed DCO Architecture

2.3.1 Coarse-Tuning

Stage

2.3.2 Fine-Tuning

Stage

2.4

DCO Performance Comparisons

2.4.1 Coarse-Tuning

Stage

Performance Comparisons

2.4.2

Fine-Tuning Stage Performance Comparisons

2.5

Experimental Results and Comparisons

2.6

Summary

Chapter 3

Fast Lock-In All Digital

Phase-Locked Loop Design

3.1

Introduction

3.2

Binary Search ADPLL Overview

3.2.1

Binary Search ADPLL Architecture

3.2.2

Binary Search Algorithm

Y

N

N

Y

Y

N

3.3

The Proposed TDC-Based ADPLL

High-Frequency Clock

Counter

Time Interval

國立交通大學

博士論文

研究生 : 盛鐸

誌謝