國
立
交
通
大
學
資訊科學
資訊科學
資訊科學
資訊科學與工程研究所
與工程研究所
與工程研究所
與工程研究所
博
博
博
博 士
士
士
士 論
論
論
論 文
文
文
文
低內容交換失誤率之轉換搜尋緩衝器
與其非同步電路實作之探討
TLB with Low Miss Rate in Context Switching and
Study of Implementation of Asynchronous Circuit
研 究 生:鄭緯民
指導教授:陳昌居 教授
中
中
中
中 華
華
華
華 民
民
民
民 國
國
國
國 九
九
九
九 十
十 八
十
十
八
八 年
八
年
年 七
年
七
七 月
七
月
月
月
低內容交換失誤率之轉換搜尋緩衝器
與其非同步電路實作之探討
TLB with Low Miss Rate in Context Switching and
Study of Implementation of Asynchronous Circuit
研 究 生:鄭緯民 Student:Wei-Min Cheng
指導教授:陳昌居 Advisor:Chang-Jiu Chen
國 立 交 通 大 學
資 訊 科 學 與 工 程 研 究 所
博 士 論 文
A ThesisSubmitted to Institute of Computer Science and Engineering College of Computer Science
National Chiao Tung University in partial Fulfillment of the Requirements
for the Degree of Doctor of Philosophy
in
Computer Science July 2009
Hsinchu, Taiwan, Republic of China
低內容交換失誤率之轉換搜尋緩衝器
低內容交換失誤率之轉換搜尋緩衝器
低內容交換失誤率之轉換搜尋緩衝器
低內容交換失誤率之轉換搜尋緩衝器
與其非同步電路實作之探討
與其非同步電路實作之探討
與其非同步電路實作之探討
與其非同步電路實作之探討
研究生
研究生
研究生
研究生:
:
:鄭緯民
:
鄭緯民
鄭緯民
鄭緯民
指導教授
指導教授:
指導教授
指導教授
:
:
:陳昌居
陳昌居
陳昌居
陳昌居 教授
教授
教授
教授
國立交通大學資訊學院資訊科學與工程研究所
國立交通大學資訊學院資訊科學與工程研究所
國立交通大學資訊學院資訊科學與工程研究所
國立交通大學資訊學院資訊科學與工程研究所
摘
摘
摘
摘 要
要
要
要
嵌入式處理器廣泛運用在嵌入式系統或手持裝置中,因此低功率、可靠與扎
實就成為這類處理器最重要的課題。非同步電路應該是解決這類問題最好的解
答,因此,非同步電路非常適合用來實作這些處理器。
眾所周知,這些嵌入式處理器被用來執行許多不同的工作。近來,許多嵌入
式系統與手持式裝置開始執行非常複雜的作業系統,像是嵌入式 Linux 或
Windows
®mobile, 而為了支援現代作業系統的虛擬記憶體機制,支援虛擬位址
與實體位址間的轉換是必需的,這也被認為是影響整體記憶體系統效能的關鍵因
素之一。為了提高位址轉換的效能,幾乎所有近代處理器內都具備了轉換搜尋緩
衝器,因此,在我們的計畫中,我們提出了一個設計給嵌入式處理器具備低內容
轉換失誤率的轉換搜尋緩衝器架構。為了區隔不同的位址空間,我們採取了區隔
轉換搜尋緩衝器庫來取代每個轉換搜尋緩衝器項目的位址空間區隔標籤,並且使
用了簡單的預取機制來減少可能發生的強迫性失誤,除此之外,因為是設計給非
同步嵌入式處理器的轉換搜尋緩衝器,設計上所有的運作行為也都很簡單。
最後,我們以 Balsa 硬體描述語言實作該轉換搜尋緩衝器之控制器,因為實
作過程中我們巧妙的安排了通訊交換管道,實作過程中我們就可以比較容易以假
設的輸入樣本驗證正確性,儘管也許以這樣的方式驗證我們這個實作是可行的,
然而這種方法運用在我們目前進行中的非同步嵌入式處理器計畫既不可能也不
合理,因此我們提出建議了一個未來我們計畫進行中的軟硬體共同設計與交互驗
證的流程,最後,以 Balsa 工具產生了邏輯閘級的 netlist 也評估了實作所需等效
的邏輯閘個數,然而結果顯示如此方式實作成本並不低,等價邏輯閘數為
688,560,我們也說明了還是依然以此高層次非同步硬體描述語言實作的原因,最重要的,
這對未來比較大的非同步設計而言是必需的。
TLB with Low Miss Rate in Context Switching and
Study of Implementation of Asynchronous Circuit
_______________________________________________________
Student:
:
:
:Wei-Min Cheng
Advisors:
:
:
:Prof. Chang-Jiu Chen
Institute of Computer Science and Engineering
National Chiao Tung University
Abstract
Embedded processors are widely used in many embedded systems and
handheld devices. Hence, low power, reliability, and robustness have been
becoming the critical issues for these processors. Asynchronous circuits may be
one of the best solutions to overcome these problems. Thus it may be more
suitable to implement these processors with asynchronous circuits.
It is widely known that these embedded processors are used to execute
varieties of tasks. Recently, many new embedded systems and handheld devices
begin to execute very complex operating systems, such as embedded Linux or
Windows
®mobile. In order to support virtual memory mechanism of modern
operating systems, address translation from virtual address to physical address
should be supported. However, it is widely considered as the critical issue of
memory system performance. In order to improve the address translation
performance, the Translation Lookaside Buffer (TLB) is implemented inside
almost all contemporary processors. In this work, we propose an alternative
TLB architecture with low context switch miss rate for asynchronous embedded
processors. We adopted a heuristic TLB banking designs to replace per-entry
ASID to identify each address space. In addition, simple prefetching mechanism
is used to reduce some possible compulsory misses. Because the architecture is
designed for asynchronous embedded processors, all operations are very simple.
Finally, we implemented the TLB controller for the proposed TLB
architecture with Balsa HDL. Because we skillfully arrange the communication
channels, we can verify the implementation easier with assumed random pattern.
Though it’s possible to verify our implementation with such simple way, it’s
impossible and unreasonable to verify the whole asynchronous embedded
processor that we are currently working for. We also suggested a
hardware/software co-design and cross-verification flow for our future work.
Finally, the gate-level netlist was generated with Balsa tools, and the equivalent
gate count of the implementation was estimated. The result shows that the cost
of the implementation modeled with Balsa HDL is not cheap. The total
equivalent gate count is
688,560.However, we also describe why designing
asynchronous circuits with such high-level asynchronous HDL. It’s needed for
future larger design!
Acknowledgement
從入學就讀博士班到現在,終於要離開交大了,多年來在這裡生活感觸很多,也很 感謝很多老師與身邊來來去去的許多人的幫助與陪伴,在交大真的有許多回憶,在我的 生命的歲月中彌足珍貴與重要,因此在這邊有很多感謝想說。 這些年最想感謝的是多年來一直照顧我幫助我的指導教授陳昌居教授,因為他的幫 助與照顧,讓我在這些年受益匪淺,也學到不少道理,擴張我的視野,大同大學資訊工 程學系鄭福炯教授與實驗室黃年畤學長,因為他們的引介,我才能夠瞭解除了傳統同步 電路的設計以外,還有很廣闊的非同步領域的天空,此外,多年來在實驗室也一直受到 黃年畤學長的照顧與協助,而且還能夠在學業以外,有更多活潑的室外休閒活動,讓我 不再是完全關在室內的「宅男」,當然,也是許多實驗室伙伴大家共同改變我,此外, 在交大資工系修習不少課程,因為系上教授殷殷指導,讓我學習到更多知識與技能,系 上老師親切的態度更讓我感動,像是每每遇到鍾崇斌教授,總能感受到他的噓寒問暖, 讓我真的很感謝,也感謝我的論文計畫書口試教授陳正教授與范倫達教授,陳正教授除 了過往在他的課堂上學到很多,對我的論文方向也給予了許多寶貴建議,至於范倫達教 授雖然未曾有機會修習到他的課程,但是這兩三年因為領域關係,實驗室學弟口試總能 遇到他,他也常能對我們實驗室研究有不少建議,對我的論文也提出珍貴意見,十分感 謝,此外,實驗室學弟張元騰、蔡宏岳更是給予許多支持與幫助,讓我計畫進行順利, 至為感謝,還有已經畢業的張繼文學弟,除了他跟我現在是工作上很要好且業務關係密 切的同事以外,他在有關 TLB 與 MMU 和現在有關 ESL 的研究上更是與我最好的合作伙伴, 也是讓我萬分感謝,此外,其他還有很多要感謝的實驗室伙伴,不管是畢業的或是還在 學的,真的,對於不管是曾給予我幫助的系上老師或是學弟妹與伙伴,真的有很多話想 說,心中萬千感受,實在千言萬語,難以言盡,只能對所有師長、同學與學弟妹在我心 中永遠感謝,謝謝大家。Contents
摘 要 ...i
Abstract ...iii
Acknowledgement... v
Contents...vi
List of Figures ...vii
List of Tables...ix
Chapter 1: Introduction ... 1
1-1 Motivation ...1
1-2 Introduction to asynchronous circuits...2
1-3 Introduction to Translation Lookaside Buffer ... 5
Chapter 2: Related Works ... 12
2-1 Recent studies of TLB...12
2-2 Circuit design with asynchronous circuits ... 24
2-3 Previous Asynchronous TLB or MMU Design...42
Chapter 3: Proposed TLB architecture for asynchronous embedded processor. 46 3-1 Relationship between the TLB miss rate and sizes ...46
3-2 The proposed TLB architecture...49
3-3 Performance evaluation of the proposed architecture ...53
3-4 Discussions of the proposed architecture ...58
Chapter 4: Implementation the TLB Controller with Asynchronous Circuits.... 59
4-1 Interface...59
4-2 The Balsa Framework ...61
4-3 The Design with Balsa ...66
4-4 Implementation ...73
Chapter 5: Conclusions and Future Works... 76
5-1 Conclusions ...76
5-2 Future Works...78
5-3 Verification Issue for future work...79
List of Figures
Figure 1-1 Clock distribution domains and generators..………...4
Figure 1-2 Conceptual virtual memory….………...8
Figure 1-3 Virtual address translation with TLB………8
Figure 1-4 Page table structure of IA32e mode with 4KB page size………..9
Figure 1-5 Virtual to physical address translation of Alpha AXP……..………9
Figure 1-6 Smart phones with Windows® Mobile OS………..10
Figure 2-1 Structure of TLBs and cache memories of Intel® Core i7………...13
Figure 2-2 IA32 linear address translation (4KB page)...14
Figure 2-3 IA32 linear address translation (4MB page)………...…………...14
Figure 2-4 Complete-subblock TLB with block factor 4……….……….17
Figure 2-5 Promotion TLB structure & Banked promotion TLB structure………..18
Figure 2-6 MMC example with shadow region………18
Figure 2-7 Reconfigurable partitioned TLB……….19
Figure 2-8 Share tag design of TLB and cache memory………..19
Figure 2-9 Operations of “Recency Stack”…….………..20
Figure 2-10 Memory translation table of TLB with “Recency prefetching”………..21
Figure 2-11 TLB with “Distance Prefetching”……….21
Figure 2-12 Schematic of generic TLB prefetching hardware……….22
Figure 2-13 CPD and per-address page tables..………23
Figure 2-14 TLB with per-entry ASID tag………...23
Figure 2-15 Isochronous fork………26
Figure 2-16 Classifications of asynchronous circuits………...26
Figure 2-17 The 4-phase protocol……….27
Figure 2-18 The 2-phase protocol……….28
Figure 2-19 Bundled-data signaling model..……….28
Figure 2-20 Dual-rail data signaling model………..28
Figure 2-21 The Muller C-element: symbol & truth table………33
Figure 2-22 The Muller pipeline….………..33
Figure 2-23 A three-stage 1-bit wide 4-phase dual-rail pipeline..………34
Figure 2-24 Control circuit of micropipeline…..………..………34
Figure 2-25 Micropipeline architecture………35
Figure 2-26 Q element……….………....….35
Figure 2-27 The architecture of CFPP………..…36
Figure 2-28 Concept of GALS………..37
Figure 2-29 Asynchronous pipelined 8051 architecture………...40
Figure 2-31 Dual-rail OR gate symbol & gate-level implementation….……….41
Figure 2-32 1-bit dual-rail register………41
Figure 2-33 Architecture of NCTUAC18 microcontroller core……….…………..……41
Figure 2-34 Overview of Mayers and Martin’s asynchronous MMU……….44
Figure 2-35 Architecture of baseline asynchronous MMU….……….44
Figure 2-36 Architecture of asynchronous MMU with performance architecture…….……..45
Figure 3-1 ITLB/DTLB miss rate for gcc with 4KB page………48
Figure 3-2 ITLB/DTLB miss rate for ijpeg with 4KB page……….48
Figure 3-3 ITLB/DTLB miss rate for compress with different page sizes and TLB sizes....49
Figure 3-4 The proposed TLB architecture………..……50
Figure 3-5 ITLB miss rates for SPEC95 benchmarks………...……55
Figure 3-6 DTLB miss rates for SPEC95 benchmarks……….56
Figure 4-1 Block diagram of the TLB interface……..……….61
Figure 4-2 The Balsa design flow……….63
Figure 4-3 The NC2P element………..……63
Figure 4-4 The S element………..64
Figure 4-5 The Fetch component………..64
Figure 4-6 The Sequence component……….…...65
Figure 4-7 The Concurrent component……….65
Figure 4-8 Architecture of asynchronous TLB modeled with Balsa HDL…….………..66
Figure 4-9 Handshaking component graph of CU…..………..72
Figure 4-10 Handshaking compoenet graph of PA Generator………..73
Figure 4-11 Waveform of circuit simulation..………..74
Figure 5-1 Simple VLSI design flow..………..80
List of Tables
Table 1-1 Comparisons of ARM996HS and ARM968E-S……….………4
Table 4-1 Definitions of asynchronous TLB interface…..………...60
Table 4-2 Structure of each TLB entry……….69
Table 4-3 Structure of each TLB bank tag...……….69
Table 4-4 Structure of 4-bit TLB command……….71
Table 4-5 Data signal of TLB_hit channel..………..71
Chapter 1: Introduction
1-1 Motivation
Embedded processors and microcontrollers are widely used in varieties of different embedded systems and handheld devices. Because of new complex applications today, these processors are now required to execute new embedded operating systems. Thus it’s very important to provide the capability to support virtual memory mechanism needed in modern operating system. In order to provide fast address translation, the translation lookaside buffer (TLB) should be provided inside these processors now. Furthermore, because of the embedded system or handheld devices nature, simple and easy context switching model should also be provided. In order to reduce the address translation penalty of context switching, a well-designed TLB with low context switching miss ratio is needed for these processors [1,2,3].
In addition, to keep those processors operating with high robustness and low power consumption are the two most important issues. It is widely known that asynchronous circuit is the best solution to address these two issues at the same time [4,5,6,7]. Thus embedded processors and microcontrollers may be suitable to be implemented with asynchronous circuits. However, it’s not very easy to implement the TLB that needed for modern operating system for asynchronous processors. In our work, we proposed TLB architecture with low context switching miss ratio that is suitable for embedded systems that runs only some tasks and implement the TLB controllers with asynchronous circuits.
1-2 Introduction to asynchronous circuits
Asynchronous chips improve computer performance by letting each circuit run as fast as it can!
By Ivan E. Sutherland and Jo Ebergen
"Scientific American", August 2002 [4] It is widely known that synchronous circuits have some problems that have to be carefully dealt with, such as clock skew, difficulty in clock distribution, worse case performance, not modular, sensitive to variations in physical parameters (temperature, voltage, and process), synchronization failure, and noise (EMI). All these problems derive from the “clock” signal [4,5,6,7]! As the VLSI based systems become larger, more complex, and work with higher clock rate, these problems also become more serious than ever before.
In addition, to reduce the power consumption has already become one of the most important issues in large VLSI system design. It is widely known that the dynamic power dissipation
P∝ fcv2[8]. That means that the dynamic power consumption is in proportion to the number of
switching activities. In order to improve the circuits or system performance, the clock frequency becomes higher and higher. Thus, the extra power wasted in the clock tree distribution also becomes larger and larger. That’s very clear that clock signal consumes a very large proportion power of the whole chip. For example, the clock tree distribution network of DEC (Compaq) Alpha 21064 processor consumes about 40% power when it runs at maximum speed [9]. Similarly, the Motorola MCORE micro-RISC processor consumes 36% power in clock tree distribution [10]. In fact, the clock distribution network should be responsible for an increasing fraction of the dynamic power consumed by modern processors and SoCs [11,12,13]. Thus, if the clock signal can be removed, the power consumption may be reduced with very high possibility. In order to reduce the power consumption, lots of different techniques are proposed and implemented, such as clock gating and dynamic voltage and frequency scaling (DVFS)[14]. Furthermore, higher clock frequency may also cause the temperature of the VLSI chips very high. It’s also harmful for embedded systems or handheld
devices. We can say that all these problems cause nightmares for almost all VLSI-based system developments today.
On the contrary, asynchronous circuits can easily reduce the power consumption via removing the “clock” signals that spread the whole VLSI chip. Replaced by the handshaking protocols, asynchronous circuits offer low active power and almost zero standby power [4,5,6]. In fact, because of data-driven nature, the inactive components or parts of asynchronous circuits can be automatically “shut-off.” Thus, asynchronous circuits can offer very good power efficiency. For example, the most famous asynchronous ARM compatible processors—the Amulet series processors [15,16,17,18] shows very good power efficiency than its synchronous ARM processor counterparts. Another very famous example, Philips asynchronous 80C51 microprocessor is 4 times power efficient than that of its synchronous counterparts [19]. The most interesting of all is the latest ARM996HS processor that is the first commercial-available synthesible 32-bit CPU built with clockless logic[7,20]. It consumes about 2.8x less power than the clock-gated ARM968E-S core. Table 1-1 shows the comparisons between ARM996HS and ARM968E-S [7]. The table also shows that ARM996HS can operate correctly in varieties of operating environment. It can operate with lower voltage in high temperature environment. Asynchronous circuits are much more robust than synchronous circuits.
In fact, designing the “clock” system has been becoming the critical issue in large VLSI system design today. For example, very complex “clocking architecture” is implemented in the latest Intel® 45nm 8-core Xeon® Enterprise processor announced in ISSCC2009 [21]. Figure 1-1 shows its clocking architecture. The design has totally 16 PLLs, 8 DLLs, and independent clock domains for each cores and the uncore. What a complex design it is! Unfortunately, such designs are very popular today. Since the first microprocessor, the Intel® 4004, was announced in 1971, the VLSI technologies have had great improvement. To put one billion transistors on a single chip have been becoming possible. How terrible it is to design the “clock” system on such big system!
However, because of several complex historical and practical reasons, almost all systems today are still implemented with fixed clock period based design. While synchronous design
may introduce lots of problems with systems growing up larger and larger, asynchronous design may overcome these problems via avoiding the use of clock signal. Furthermore, how to accomplish IP reuse easier becomes one of the most important issues for SoC design. Asynchronous circuits may be one of the best solutions to address this issue. Without the influence of the “clock” signal, asynchronous circuits make “software OOP” style design for hardware design possible. All things that the designers need to know are the handshaking protocol interface [4,5,6]. It also makes each designed component or IP more reusable. With growing up mobile device and embedded system markets, all these issues need to be seriously considered. Thus, it’s time to implement these systems with asynchronous circuits.
Table 1-1: Comparisons of ARM996HS and ARM968E-S
Frequency [equiv. MHz] Performance [DMIPS] Power [mW/equiv. MHz] Gate Count [NAND2 equiv.] ARM996HS 50 (worst, 1.08 V, 125ºC) 77 (nominal, 1.2 V, 25ºC) 54 (worst, 1.08 V, 125ºC) 83 (nominal, 1.2 V, 25ºC) 0.045 (nominal, 1.2 V, 25ºC) 89K ARM968E-S 100 107 0.13 (nominal, 1.2 V, 25ºC) 88K
1-3 Introduction to Translation Lookaside Buffer
In order to support larger memory requirements for modern applications, it’s important for modern operating systems (OS) to provide the virtual memory mechanism. Conceptually, with virtual memory, the movements of code and data of one program between main memory and secondary storage can be automatically achieved, and a single complete and contiguous “memory space” can be given for each program. Thus, only part of code and data of one program needs to be placed in main memory. Programmers do not need to know anything on how the code and data are arranged. Moreover, the program size can be even larger than the real physical memory size. In fact, virtual address space are often much larger than real physical memory space and size. Figure 1-2 shows the conceptual virtual memory. The virtual memory is divided into lots of fixed size blocks called pages and each page has a specific page number called Virtual Page Number (VPN). Similarly, the physical memory is also divided into the same size page frames, and each of it has its own unique page frame number called Physical Page Number (PPN). Via the memory mapping, each page of virtual memory can be mapped to a page frame of physical memory or the secondary storage. With appropriate hardware support, the virtual memory is carefully maintained by the OS [22].
As mentioned before, the OS is responsible to provide the mechanisms to map virtual address to physical address. However, all these virtual address to physical address translations are stored in main memory. To reduce the cost of address translations, the translation lookaside buffers (TLBs) are widely implemented inside the processor [23,24,25,26,27,28]. Figure 1-3 depicts the basic design idea of TLB. Once the virtual address (VA) is sent to TLB, it is compared with all the tag fields to find a matched VPN. If it is a hit, the corresponding PPN will be sent out. The physical address therefore can be generated via the combination of PPN and offset. Otherwise, if it is a miss, the page table traversal will be performed. The OS will take care of the TLB miss handling.
But, the virtual memory mechanism varies with different processor architecture and OS implementation. The page table organization dominates the page table traversal time that
occupies most TLB miss handling time. Though some new architectures use some advanced page table organizations to reduce the page table traversal time such as inverted page table structure [22] such as PowerPC architecture[29], the forward-mapped hierarchical page table structure are still widely used, such as Compaq/DEC Alpha AXP[28], the latest AMD64, and Intel®64 [30,31,32] architectures. It costs several main memory accesses to fetch the correct Page Table Entry (PTE) if any miss occurs. It even possibly needs to traverse 7 levels of different page tables on processors with 64-bit addressing [33]. Figure 1-4 shows the page table structure of IA32e mode with 4KB page size of Intel®64 architecture. If no any TLBs and address caches are implemented inside these processors, traversals of four levels of different tables should be completed to obtain correct PPN. Figure 1-5 shows the page table structure of Compaq/DEC Alpha AXP [28]. It has three levels of page tables. That impacts the overall system performance very seriously. Thus it’s important to reduce the TLB miss rates for systems with such page table structure.
In addition, frequently happened context switching may cause some extra TLB misses. Some research even shows that these misses play important role in TLB performance [1,2,3]. Thus most processors have implemented some kinds of address space identifier (ASID) to distinguish each address space [25]. For example, MIPS R10000 processor has an 8-bit ASID for each of its 64-entry TLB to allow context switches without having to invalidate all entries [34]. It is also suggested to provide 8-bit ASID for SPARC architecture [35,36]. However, some processors including the IA32 (x86) architecture which is the most popular processor family today simply flush all the TLB entries when the context switching occurs [31,1]. Unfortunately, it’s even still the same for the latest IA32 processors. We’ll treat the model as the worse case performance. Though lots of different research about TLB has been done, only some notice the influence of context switching. That may be because it’s very hard to model and estimate the context switching activities caused by the OS and it’s also hard to consider this issue without considering the OS behavior first. In our work, we tried to provide an alternative to address the context switching issue for TLB. To support the proposed mechanism, the OS should be modified a little. In fact, because of architecture differences, these kinds of modifications of OS for TLBs are needed for all architectures. We hope that this simple mechanism can be implemented inside an asynchronous embedded processors or microcontrollers that only run some tasks simultaneously.
To estimate the performance of the proposed architecture, we did some simulations. All the simulations were done by the modified SimpleScalar Version 3.0d tool suite [37] provided by the SimpleScalar LLC with SPEC95. In addition to the performance of traditional 1024-entry fully-associative TLB with x86-style assumption, we also compare the performance of 1024-entry fully-associative TLB with ASID and two different pre-fetching mechanisms incorporate with our proposed design. The results show that our banked design can work very well with sequential prefetching (SP, also called linear pre-fetching).
Our work is trying to realize TLB controllers for asynchronous embedded processors or microcontrollers with low TLB miss rate caused by context switching. Though most processors reduce the miss rate caused by context switching with ASID, our work provides an alternative to address this issue. There are several reasons for the proposed architecture. These embedded systems only execute some tasks at the same time. Thus, it really doesn’t need to store too many ASIDs. That’s why no ASIDs TLB design of StrongARM SA-1100 processor [2,3]. Don’t forget these processors are not designed for desktop purpose. Figure 1-6 shows smart phones that execute Windows® mobile OS. In fact, because we wish to implement such TLB for asynchronous embedded processors or microcontrollers, less tag bits may be more important than other issues. In addition, we also discuss why sequential prefetching is more suitable for the proposed design. Moreover, we’ll try to realize this design on the asynchronous processor which we currently work for. That would not be too hard to realize the proposed mechanism on an asynchronous processor with same extra handshaking protocols on bundled delay or self-timed design.
Figure 1-2: Conceptual virtual memory
Figure1-4: Page table structure of IA32e mode with 4KB page size
1-4 Summary
As mentioned in previous section, new embedded systems or handheld devices now begin to execute new modern operating systems. It therefore becomes more and more important for these processors to provide efficient address translations. A well-designed TLB now becomes one of the critical performance issues for these processors. In addition, because embedded systems and handheld devices may operate in varieties of environments, robustness and reliability are two of the most important issues to these processors. Asynchronous circuits can easily address these issues. However, lack of address translation mechanism, most asynchronous processors doesn’t support virtual memory directly. In order to support virtual memory for asynchronous processors, asynchronous TLB controller should be implemented. Thus, in this thesis, we propose a TLB architecture for future asynchronous embedded processors, and modeled it with Balsa HDL. Followings are the main contributions of this thesis.
Plenty surveys of TLB studies
Plenty surveys of asynchronous circuits, and detailed introductions of how to design circuits with asynchronous circuits
Studies of performance issue of TLB in context switching
New alternative TLB architecture with low miss rate in context switching for asynchronous embedded processors
Studies of implementation of proposed TLB architecture with asynchronous circuits Confirming the possibility to design TLB controller for asynchronous processors
Chapter 2: Related Works
In this chapter, we’ll discuss the related works of both TLB design and asynchronous systems or circuits design. Because only a few specific research on TLB design for asynchronous processors, we’ll discuss them separately in this chapter. Finally, case studies of asynchronous MMU or TLB design will be discussed.
2-1 Recent studies of TLB
As mentioned in Chapter 1, TLB plays an important role in the overall performance of the processors that support virtual memory technique. Thus, lots of different research has been done. Moreover, because of architecture and addressing mode differences, the real implementation may have great differences. The design requirements may even vary from different page modes or new addressing mode support for the same processor. However, that’s quite interesting that the TLB designs of most real commercial processors are not too complex. Most of them are not implemented with too complex algorithms or architectures. The key issue of these designs is to reduce the TLB search time. Some related works of TLB research will be described in the following paragraphs. These works will be classified into traditional techniques, advanced techniques, and works of reducing TLB context switching miss rate.
2-1-1 Traditional Techniques
Because TLB in fact is part of the memory hierarchy and can be considered as a special designed cache memory to cache the page table entry, it can be directly perceived through the senses that those traditional techniques to improve the cache performance can also be applied to TLB. That also means the 3Cs misses [28] can be also suitable for the TLB. In fact, those techniques are now widely used in commercial processors in different ways.
In order to reduce the TLB miss rate, most processors increase the size (total entries) of TLBs with fully or set associative. For example, recent AMD OpteronTM processor has both 512-entry L2 instruction TLB (ITLB) and L2 data TLB (DTLB) [38] and the IBM POWER4 processor has a common 1024 entry TLB for each processor core [39]. Furthermore, some processors even try to provide multi-level TLBs, such as 2-level ITLB/DTLB design on recent AMD OpteronTM processor [38] and each core (Nehalem architecture) of the latest Intel®Core i7® processor [31]. Figure 2-1 shows the TLB designs of the Intel®Core i7® processor. Each core of the processor has separated the Instruction and Data TLB with a unified Second-level TLB (STLB). In addition, some processors begin to provide larger page sizes to increase the TLB span, such as 2MB or even 4MB page size on all new Intel IA32 Processors after the Pentium® Pro Processor [40]. The Intel IA64 architecture offers 4K to 256MB and 4GB page sizes [41]. The AMD64 architecture also provides 4KB, 2MB, 4MB, and incredible maximum 1GB page sizes [42]. There are several advantages of larger page sizes. First, because the page table entry can be reduced, it can save the page table sizes. Second, it allows for larger physically addressed caches. Third, because each page can map larger memory spaces, fewer page tables and TLB entries can be used. Finally, because the level of page tables can be decreased, the fewer accesses to main memory are needed to generate correct physical page number if TLB miss occurs. Figure 2-2 shows the page table structure of IA32 mode with 4KB page size of IA32 architecture, and Figure 2-3 shows the page table structure of IA32 mode with 4MB page size of IA32 architecture [31]. We can easily find that with larger page size the levels of page tables can be decreased.
Figure 2-2: IA32 Linear address translation (4-KByte page)
Figure 2-3: IA32 linear address translation (4-MByte page)
2-1-2 Advanced Techniques
As mentioned in previous paragraph, most contemporary processors now provide some different page sizes from 4-KB size to incredible very large sizes. Some even allow these pages with different sizes coexist simultaneously with some augmented page table entry format. Certainly, it needs extra supports of OS. In fact, with small page size, the memory space can be saved. That’s because with larger page sizes, memory spaces would be wasted
due to the internal fragmentation. In addition, with small page size, the startup time of small program would be shorter. However, to provide several page sizes, some commercial designs put several TLBs inside the processor for each individual size. Some try to modify the TLB entry format and therefore the TLB can be shared with different page sizes.
In addition to what we mentioned in previous paragraph, several interesting mechanisms are proposed to support superpaging. Several base pages with both virtual and physical address alignment can be merged into a larger page called superpage at run time [43,44,45,46]. With superpage mechanism, the internal fragmentation problem can be resolved. However, to support a superpage, very complex OS and hardware interactions are needed. Furthermore, the virtual and physical memory space aligned limitation seriously impacts the usage of a superpage. Hence, some studies have focused on overcoming the limitation by dynamically supporting the superpage mechanism. Talluri et al. described an advanced method called the complete-subblock which allows a single TLB block to map to multiple base pages without any special OS support [43,44]. In addition, they also described a much smaller design called the partial-subblock which shares PPN and attribute fields across base page mappings. Figure 2-4 shows a complete-subblock TLB block (entry) with factor 4. Lee et al. proposed a novel banked-promotion TLB structure to support two page sizes dynamically [47]. Four 4KB pages can be promoted to a 16KB superpage. To support such mechanism, an interesting promotion TLB was designed. The heuristic promotion algorithm can promote four consecutive entries from small-page TLB bank to large-page TLB bank. Thus, the four 4KB TLB entries can be reused. Furthermore, in order to reduce the power consumption and TLB reference latency, they even divided the TLB for 4KB page into two banks [48]. Figure 2-5 shows the structures of their promotion TLB and banked-promotion TLB. In addition, Swanson et al. presented a novel memory controller (MMC) which can aggressively create superpages even from non-contiguous and unaligned regions of physical memory space [49,50]. Figure 2-6 depicts this design. In this design, they suggested to use a portion of unused physical memory address range to virtualized physical memory in their proposed MMC. The shadow pages are “shadow” of accessed page that can be remapped to real physical address by MMC. The TLB reach can be extended via a novel Memory Controller TLB (MTLB). Thus the superpage can be aggressively created from non-contiguous and nonaligned regions of physical memory. Park et al. proposed a way to integrate both partial-subblock with MMC to improve TLB performance [51]. They also
proposed a method called Variable-Size Subblock TLB (VS-TLB) which is an extension of original subblock TLB to support multiple size subblock. Based on the original subblock TLB design, they added subblock size field (SS) for each entry. With this extension, the total TLB reach can be increased via its maximum subblock size. There is still much research about improving TLB performance of superpaging.
Besides previous research, some different and interesting research can also be found. Channon et al. presented the reconfigurable partitioned TLBs to improve the TLB performance [52]. They claim that traditional split instruction and data TLB design is not suitable for unpredictable memory reference pattern. Thus the reconfigurable partitioned TLB can reduce misses between distinct reference types. The reconfigurable partitioned TLB can dynamically adjust the position of the partition in real time. Figure 2-7 shows this design. In addition, some research focus on the low power issue. Besides some architecture improvements to reduce power consumption such as baking skills, some even try to redesign the basic circuit element itself. For example, Juan presented low power CAM and SRAM cells design that can be implemented [53]. They also studied the relationship of power consumption and associativity of TLB. They concluded that small TLB with fully set-associative and implemented with modified cell can save more power. Because TLB is part of memory hierarchy, some research tries to integrate both TLB and cache memory. Among all of these studies, Lee et al. proposed an interesting way to reduce the tag memory of cache memory [54,55]. The design uses share tag memory of both TLB and cache memory. They still use CAM as the tag memory for TLB. However, the cache memory shares the same tag memory. The index tag memory of cache now only stores encoded index of an entry in shared tag memory rather than the PPN. Thus, the total tag memory sizes can be reduced. Figure 2-8 shows this design. In addition to these hardware efforts, lots of different software efforts can be found. Instead of hardware managed TLB, software management TLBs are widely used in lots of new RISC processors, such as SPARC, Alpha AXP, PA-RISC and MIPS architectures [23,25]. In fact, there are still varieties of different studies of TLBs and virtual memory.
Though lots of new TLB designs are proposed, just only a few studies focused on the TLB entries prefetching/preloading. Saulsbury introduces an interesting mechanism, called the Recency-based TLB Preloading (RP), to prefetch the TLB entry according to the
“Recency” of the referenced pages [56]. The mechanism maintains the “Recency Stack” via augmented translation table entry in memory and the TLB inside the processor according to the recently referenced pages. Thus the next possible referenced page number can be prefetched. Figure 2-9 (a) shows how the stack changes inside the processor if the TLB reference is a hit. Because it’s a TLB hit, the recency of all translation table entries (TTE) of the translation table will not be changed. Figure 2-9 (b) depicts how the “Recency Stacks” of both TLB and translation page table change if the TLB reference is a miss. After the missed TTE is moved to the top of TLB stack, the recency of both TLB entries and the translation table entries will be changed according to the recency stack position. Finally, the TTE with “recency ± 1” of missed TTE can be prefetched into the prefetch buffer inside the processor. It should be noted that in real implementation all the TTE positions of “Recency Stack” are maintained by the previous and next pointers of each TTE. Figure 2-10 shows the implementation of the translation table in memory. However, the mechanism may increase the memory traffic and the PTE should do some changes to store the stack pointers for the link-list. To solve these possible problems, Kandiraju proposes a new prefetching technique, called the Distance Prefetching (DP), according to the recently referenced pages ‘distance (stride)’ [57]. The mechanism maintains a table to keep the track of differences between successive address references and do prefetching according to the predicted distance. Figure 2-11 shows the implementation of TLB with DP technique. The paper also shows a generic schematic prefetching hardware and compares other possible prefetching techniques borrowing ideas from the cache prefetching techniques, such as Sequential Prefetching (SP), Arbitrary Stride Prefetching (ASP) and the Markov Prefetching (MP). Figure 2-12 shows the schematic of generic prefetching hardware. Because of the implementation costs, we’ll focus on the studying of the SP and DP in our work.
Figure 2-5: Promotion TLB structure & Banked-promotion TLB structure v al id ac ce ss d ir ty re ad -o n ly su p erv is o r v al id ac ce ss d irt y
Figure 2-7: Reconfigurable partitioned TLB
(a)
(b)
Figure 2-10: Memory translation table of TLB with “Recency Prefetching”
Figure 2-12: Schematic of generic TLB prefetching hardware
2-1-3 Reducing TLB Miss Rate in Context Switching
As mentioned in previous sections, the TLB miss handling requiring several main memory accesses and that impact the overall performance seriously. However, in traditional design, the simplest way to deal with context switching (address space switching) for TLB is to flush all the TLB entries. Thus, that’s even worse if the miss caused by TLB flushing of context switching. After the flushing of the TLB, it needs lots of “learning time” to refill these entries. However, only a few studies focus on this topic. Liedtke try to reduce the possibility of TLB flushing of address-space switching via integrating the segmentation mechanism of
x86 [1]. Based on the L4, Wiggins and Heiser try to avoid reloading translation table base
register by using a pointer register that points to a caching page directory (CPD) [2,3]. The basic idea of this implementation can be described as following sentences. The CPD contains entries from a number of different address spaces and each of it is defined by its own page table. Once the TLB miss occurs, the hardware only needs to reload the TLB via indexing to the CPD that contains pointers to LPT (Leaf Page Table, an array of 256 entries PTEs) of various address space. If it’s a miss, the current thread PD (page directory) should be indexed by handler to find a valid entry. Then the entry should be copied into CPD. The handler restarts the thread. Finally, the hardware can reload TLB. Now, only a valid page table entry
should be found. Figure 2-13 depicted this basic idea. In fact, still lots of other research tries to reduce the possibilities by modifying the OS or page table structures. Besides these software solutions, the basic method supported by TLBs is to provide address space identifier (ASID) for each entry to identify each address space. Figure 2-14 shows the TLB with per-entry ASID tag.
Figure 2-13: CPD and per-address page tables
2-2 Circuit design with asynchronous circuits
The technological trend is inevitable:
In the coming decades, asynchronous design will become prevalent!
By Ivan E. Sutherland and Jo Ebergen
"Scientific American", August 2002 [4] Asynchronous circuits have been studied since early 1950’s; however, synchronous circuits have still dominated the mainstream of digital circuit design [4,6]. Recently, some academic and commercial research shows that it’s worth to implement real-life systems with asynchronous circuits. But, without the global synchronization signal called “clock”, it makes asynchronous circuit design very difficult. In order to replace the “clock signal”, handshaking protocols between each part of asynchronous circuits are needed. It therefore makes the circuit costs of asynchronous circuits much higher than synchronous counterparts. In addition, because of lack of tools and standardization of implementation and design models, there is still not much research on it and that limits applications in commercial products. In fact, it’s very hard to find commercial products that are implemented with asynchronous circuits. In this section, we’ll discuss topics of asynchronous circuits from the classifications of asynchronous circuits, handshaking protocols, research of asynchronous circuits, and case study of implementation with asynchronous circuits.
2-2-1 Classifications of Asynchronous Circuits
We have discussed so many issues of asynchronous circuits, but you may ask what asynchronous circuits are. In fact, it’s not very hard to answer this question. We can say that asynchronous circuits are circuits without any global synchronization signal called “clock.” Based on this assumption, asynchronous circuits can be classified into four classes depending upon the delay model of gate and wire of the circuit. The four classes are Delay-Insensitive (DI) circuits, Quasi-Delay-Insensitive (QDI) circuits, Speed-Independent (SI) circuits, and
Delay-Insensitive (DI) circuits are the most robust and reliable circuits of all. These classes or circuits permit arbitrary (unbounded but finite) delays on gates and wires. The basic concept of DI circuits derives from Clark’s “Macromodular computer systems” proposed in 1967 [58]. However, because of its “arbitrary delays on gates and wires” nature, only a few circuits belong to this class. Martin already proved it in 1990 [59]. Thus, enormous limitations exist in designing DI circuits.
Because it’s too hard to implement pure DI circuits, Quasi-Delay-Insensitive (QDI) circuits relieve a little in arbitrary delay on wires. QDI circuits are DI circuits with isochronous forks. It means that all branches of a forked wire have exactly the same wire delay [60]. Figure 2-15 shows the isochronous fork. In this example, the signal from A can propagate to both B and C with the same wire delay. With this assumption, it permits DI class circuits can be more practical. In fact, in order to meet DI and QDI constraints, the implementation costs of these circuits may be higher. In addition, they should be carefully implemented to avoid violations of the constraints. Thus, to implement such circuits are really very difficult. However, because no extra delay assumptions, DI and QDI circuits may be attractive for asynchronous VLSI circuit synthesis [60].
The concept of Speed-Independent (SI) circuits first appeared in 1959 proposed by David Muller [59,60]. The class of circuits allows arbitrary (unbounded but finite) delays on gates but assumes zero wire delays. The SI circuits can be modeled with Petri net [63].
Self-Timed (ST) circuits are popular in lots of asynchronous circuit implementations. It is introduced by Seitz in 1980 [64]. The ST circuit is composed of a group of ST elements and each of ST elements is inside of an “equipotential region.” The wire delays of the region are negligible or well-bounded. The elements can be DI, QDI, SI, or circuits that can operate correctly with some local timing assumptions. There’s no any timing assumption on communications between regions. That also means that the communication belongs to DI. For example, Chang et al. proposed a ST torus-network with 1-of-5 DI encoding in 2009 [65]. The implementation uses DI encoding communication between each parts of the whole design.
Figure 2-16 shows the relationship of these models of asynchronous circuits. If the design contains both DI components and ST components, it should be an ST circuit.
Figure 2-15: Isochronous fork
Figure 2-16: Classifications of asynchronous circuits
2-2-2 Handshaking Protocols
Without a clock to govern its actions,
an asynchronous system must rely on local coordination circuits instead!
By Ivan E. Sutherland and Jo Ebergen
"Scientific American", August 2002 [4]
Without clock signal, asynchronous circuits rely on handshaking protocols to make sure the correctness of the circuit operations [5,66,67]. The protocols can be divided into control signaling and data encoding. A complete handshaking protocol is a combination of the control signaling and data encoding. Figure 2-17 shows the 4-phase handshaking protocol. In this protocol, only the rising edge is the valid active transition; thus it’s a level signaling or return-to-zero protocol. On the contrary, in the 2-phase handshaking protocol, the falling and
rising edge of request and acknowledge are active signals; thus it’s a transition signaling or non-return-to-zero protocol. However, it makes the circuits, especially datapath circuits, very complex and hard to implement. Figure 2-18 shows the 2-phase handshaking protocol. In addition to control signaling, there are also choices for how to encode data (data signaling protocol). The Bundled Data or called Single Rail refers to separate request and acknowledge wires that bundles the data signals with them. Thus total n+2 wires are required to send n-bit data. Figure 2-19 shows the bundled-data model. Besides bundled-data model, there are data encoding methods for DI circuits. However, because of implementation issue, dual-rail encoding is the most popular used DI data encoding scheme. To represent 1-bit data in dual-rail encoding method, two physical wires are used. For example, a valid data, D is represented by two physical data wires, d.0 and d.1. The following equation shows this encoding scheme. (1) D = 0 ; (d.0,d.1) = (0,1) (2) D = 1 ; (d.0,d.1) = (1,0). In particular, (0,0) represents a space which allows us to identify consecutive 0's or 1's. (1,1) state is not used. Data transferring starts from the (0,0) state (called “null” or “empty” data). If a state is changed from (d.0,d.1) = (0,0) to (0,l)/(1,0), which notices the arrival of valid data '0/l'. Thus total 2*n wires are needed to transfer n-bit data. Figure 2-20 shows the dual-rail model.
Figure 2-18: The 2-phase protocol
Figure 2-19: Bundled-data signaling model
Figure 2-20: Dual-rail data signaling model
2-2-3 Research of Asynchronous Circuits
the asynchronous pipeline models first. That’s because most asynchronous systems are designed or implemented based on these asynchronous pipeline models. David Muller proposed his famous Muller C-element and Muller pipeline (aka Muller distributor) in 1959 [68,69]. A Muller pipeline is a naturally simple and elegant handshaking control model. The simplest form of Muller pipeline mainly consists of C-elements and inverters. Figure 2-21 shows the schematic symbol and truth table of a two-input C-element. If both inputs are high or low, the output will be high or low; otherwise, the previous value is kept. Figure 2-22 shows the original Muller pipeline model. To understand its behavior, let’s consider the ith C-element Ci. In the initial state, all C-elements are initialized to 0. The handshaking may be
initialized. The ith C-element Ci can propagate a 1 from its previous stage the (i-1)th
C-element only if the next stage C-element (Ci+1) is 0. Thus, the signal can be propagated one
stage to one stage. It should be notice that the original single-rail model is based on bundled-data model; thus the request signal must be propagated via a matching delay as shown is Figure 6. In fact, the matching issue should be carefully handled on all bundled-data model. The pipeline model can also be constructed as 4-phase dual-rail model as shown in Figure 2-23 [66]. The model can be considered as two Muller pipelines connected in parallel with a common acknowledge signal in per stage. We implemented a 4-phase dual-rail pipeline based QDI 8-bit NCTUAC18 microcontroller core in 2009 [70].
Besides the Muller pipeline, there are also several models were proposed. The most important of all is the micropipeline which was described by Ivan E. Sutherland in his famous Turing Award “Micropipelines” lecture in 1989 [71]. The approach is based on a two-phase bundled-data model with micropipeline as backbone control circuit. Figure 2-24 shows the control circuit of a 4-stage micropipeline model. Without datapath, the micropipeline is a string of Muller-C elements. At each stage, there are one request input signal, R(n), and one output acknowledge signal, A(n). The request signal can propagate from left-most side, R(in) to the right-most side, R(out). It’s the same as the direction of data flow. The data therefore can flow from the left-most side to the right-most side stage by stage. After the data can be received by the right-most side, the acknowledge signal should be returned from the right-most side, A(out). The acknowledge signal, A(n), therefore can propagate back to the left-most side, A(in), and clear the whole pipeline. Thus the pipeline can keep on operation. Figure 2-25 depicts how to combine the control circuit of micropipeline with datapath. As the most well-known asynchronous circuit design model, lots of different asynchronous systems
have been implemented based on it. It can be used to implement many kinds of different pipelined systems, even processors. For example, the NSR processor is a very simple 16-bit micropipeline based microprocessor with very simple RISC instructions (less than 20 instructions) [72]. The Amulet1 is known as the first ARM compatible processor implemented with asynchronous circuit [15,73]. It was implemented with 2-phase micropipeline architecture.
There are also some different models proposed for asynchronous circuits design. Some try to modify the original “micropipeline” architecture. For example, a new control circuit for micropipeline was proposed by Choy et al. [74] and “Micronets” architecture tries to decentralize the control to the functional units [75]. Furthermore, there have been still several famous asynchronous processor implementation models proposed. Takashi Nanya et al. showed their QDI 8-bit microprocessor model called “TITAC” which uses Martin’s Q-element [60] as control circuitry [76]. Figure 2-26 shows the Martin’s Q element. With Q element, the control path can be easily built. In addition, they proposed Autosweeping Module (ASM) which is modified from Q element to replace Q element to gain better performance. TITAC2 was proposed to show a new delay model called scalable-delay-insensitive (SDI) [77]. The delay model modified original DI or QDI unbounded gate and wire delay to bounded relative delay ratio between any two components. There are also some works that try to model processor with asynchronous circuits. Martin et al. at Caltech have already shown three generations of different asynchronous processor model [78]. Chen et al. showed an asynchronous RISC processor model in 2002 [79]. In addition, there are also several asynchronous superscalar processor models proposed, for example the Kin architecture [80], Hades project [81], and the most famous of all the counter flow pipeline (CFPP) [82]. The design of CFPP is quite different from traditional design concept. Figure 2-27(a) shows the architecture of a 5-stage CFPP. The design separates the instruction flow and result flow in a counter flow. In this Figure, the instruction is fetched, decoded, and inserted into the instruction pipeline in stage F. At the same time, the source operands needed for this instruction is also inserted into the result pipeline in stage R. Figure 2-27(b) describes the instruction and result bindings. Each binding is composed of register name, valid bit, and data value. Because the instruction flow and data flow walk in counter flow, the instruction can meet needed data in one of the stages. Once the needed operands can be fetched, the instruction can be executed correctly. In addition, if the binding destination of
the instruction matches one of the binding results, the binding result will be updated. Thus the following instructions in the pipeline can obtain correct result value. It may be regarded as special designed data forwarding. However, all these superscalar models are not very easy to implement or just ideas that cannot be realized and certainly not very suitable to be implemented for cores of embedded processors. In fact, because it’s very hard to guarantee the instruction execution order in asynchronous design, only some research of asynchronous superscalar processor are really in progress.
Another issue should be pointed out here. As mentioned before, Chen et al. implemented a 4-phase dual-rail pipelined QDI processor [70]. However, in order to implement the QDI processor, all dual-rail components should be constructed first. These components even include all basic logic components. That’s lots of extra efforts for designing a processor. Considering the synchronous circuit design, they can be easily implemented with lots of pre-designed cells, components, or even large modules. In fact, it’s also a key to success. Some researchers have been already trying to offer solutions for asynchronous circuit design. Some try to provide basic building element. For example, Smith et al. proposed a new DI digital system called NULL Convention Logic (NCL) [83]. With NCL, DI system can be built easier. Some try to offer new pipeline/FIFO control. For example, a basic control circuits for an asynchronous pipeline called Asynchronous Symmetric Persistent Pulse Protocol, “asP*” was introduced by Molnar [84]. Sutherland and Fairbanks described GasP in 2001 [85]. There is still much different research involving new control circuits or offering new asynchronous elements.
Besides the “pure” asynchronous implementation research, some research topics focus on trying to find applications in other directions. Imaging on a large SoC, each components or IPs may be designed by different teams or even different companies. Integrating them on a single die may be a very difficult job. The most important reason is that these different designs may be operate correctly in different clock frequency. Some research tries to wrap the synchronous circuit with asynchronous wrapper. Thus, the whole system can communicate with asynchronous channels, while each local circuit can operate in their local clock. Thus, some Globally-Asynchronous Locally-Synchronous (GALS) methodologies are proposed. The concept of GALS was proposed by Capiro in his PhD thesis in 1984 [86]. Figure 2-28 depicts this idea. In addition, some research focus on the interconnection networks with
asynchronous circuits. In fact, MPSoCs or multicore processors have been becoming the major trend of system or processor designs nowadays. Thus the design of interconnection networks becomes the most important issue of all. However, lots of different problems may arise in the network design and they should be carefully handled. It is widely known that most of these problems can be resolved easily by asynchronous circuits. Hence, it’s really attractive to replace these networks with asynchronous implementations. For example, Dally and Seitz implemented the first torus topology based interconnection networked multiprocessors in 1986 [87]. They implemented the self-timed torus routing chip (TRC) which uses the bundled-data encoding to perform cut-through routing in k-ary n-cube multiprocessor interconnection networks. In 1997, Natvig presented a high-level simulation model of TRC written in Verilog [88]. Chen et al. implemented self-Timed torus interconnect with 1-of-5 encoding in 2009 [65]. In fact, because of asynchronous nature, the routing paths with different distances can operate in different speeds.
In addition, we have already pointed out that almost all commercial digital systems are implemented with synchronous circuits. One very important reason is lack of suitable EDA tools that can be used to implement asynchronous circuits directly. In fact, it’s also hard to directly model your design in behavior or RTL model with traditional HDL directly. Thus, most designs should be implemented in gate-level. In order to reduce the efforts in designing asynchronous systems and circuits, specific HDLs for designing asynchronous systems and circuits are needed. Tangram and Balsa HDLs are the most famous two of all related frameworks. The Philips Research Laboratories started to develop the Tangram tool over 20 years ago [89]. Now the tool is offered by Handshake Solutions. In fact, the ARM 996HS was also developed via it [7,20]. Handshake Solutions now provides Haste Design Language for describing the behavior of asynchronous circuits. In addition, an integrated easy-to-use tool suite called TiDETM (Timeless Design Environment) is also offered [90]. In fact, it’s the most successful commercial EDA tools for asynchronous circuit design. However, Balsa is a framework for providing an asynchronous HDL and synthesizing of asynchronous circuits and systems. It’s an open source and free solution developed and offered by the University of Manchester [91,92,93]. In fact, part of Amulet 3 was designed with Balsa [18]. In addition, Chen et al. also proposed an asynchronous pipelined 8051 soft-core with Balsa [94,95]. Zhang and Theodoropoulos modeled an asynchronous MIPS core with Balsa called SAMIPS [96]. An asynchronous MP3 decoder was also modeled by us with Balsa [97]. In addition, there are
still several works in developing EDA tools for asynchronous circuits. In addition, asynchronous research group in Caltech provides Communicating Hardware Processes (CHP) and its synthesis tool as asynchronous circuit design tool [98]. The notation of CHP is inspired by CSP. In fact, three generations of their asynchronous processors were designed with CHP [76], including a very large MiniMIPS processor [99]. The most interesting of all is SoCAD developed at Tatung University, Taipei, Taiwan [100]. They don’t develop any special HDL for asynchronous circuit design. Instead of specific HDL, Data dependency graphs (DDG) or Java language can be used to model the behavior of the design. Via several translation processes proposed by Cheng, the DDG or Java models will be translated into VHDL and mapped to lots of pre-defined cell-based designed asynchronous components. With SoCAD, the goal of hardware/software codesign can easily be achieved. They also modeled a very robust asynchronous Java Chip with it [101]. Unfortunately, though several EDA tools for asynchronous circuit design can be found, it still has a very long way to go for these tools.
Figure 2-21: The Muller C-element: symbol & truth table
Figure 2-23: A three-stage 1-bit wide 4-phase dual-rail pipeline
R EG LO G IC R EG LO G IC R EG LO G IC R EG LO G IC C C C C
R(in) A(1) R(2) A(3) R(out)
A(in) R(1) A(2) R(3) A(out)
DELAY
DELAY DELAY
DELAY
Figure 2-25: Micropipeline architecture
(a)
(b)
Figure 2-28: Concept of GALS
2-2-4 Case Study of Asynchronous Circuit Design
As mentioned in previous sections, it’s difficult to design and implement asynchronous circuits directly. Most designs cannot be implemented via writing RTL of Verilog or VHDL. In our group, some circuits are implemented with Balsa HDL, and some are implemented via writing gate-level descriptions of Verilog HDL. In this section, the two methods will be discussed.
It is widely known that it’s hard work to implement all designs with gate-level descriptions. It’s not worth to implement all circuits with gate-level descriptions. With higher level modeling, we can pay more attention on design itself. That’s the same for asynchronous circuit design. Thus, we select Balsa framework as our tool. Because the details of Balsa HDL and framework will be described in section 4.1, we’ll describe how to model a design with Balsa. We’ll describe how to model a pipelined asynchronous 8051 core here [94,95]. The first step, you must define your design model and the asynchronous communication channels between each part of your design. Figure 2-29 shows the AsyncPA8051 model and its interfaces and channels between each part of the model. Then, each part of the design can be described with high-level Balsa descriptions. Following segment shows the top module of the AsyncPA8051. It should be noted that components are connected with communication channels.