Introduction - 低內容交換失誤率之轉換搜尋緩衝器與其非同步電路實作之探討

1-1 Motivation

Embedded processors and microcontrollers are widely used in varieties of different embedded systems and handheld devices. Because of new complex applications today, these processors are now required to execute new embedded operating systems. Thus it’s very important to provide the capability to support virtual memory mechanism needed in modern operating system. In order to provide fast address translation, the translation lookaside buffer (TLB) should be provided inside these processors now. Furthermore, because of the embedded system or handheld devices nature, simple and easy context switching model should also be provided. In order to reduce the address translation penalty of context switching, a well-designed TLB with low context switching miss ratio is needed for these processors [1,2,3].

In addition, to keep those processors operating with high robustness and low power consumption are the two most important issues. It is widely known that asynchronous circuit is the best solution to address these two issues at the same time [4,5,6,7]. Thus embedded processors and microcontrollers may be suitable to be implemented with asynchronous circuits. However, it’s not very easy to implement the TLB that needed for modern operating system for asynchronous processors. In our work, we proposed TLB architecture with low context switching miss ratio that is suitable for embedded systems that runs only some tasks and implement the TLB controllers with asynchronous circuits.

1-2 Introduction to asynchronous circuits

Asynchronous chips improve computer performance by letting each circuit run as fast as it can!

By Ivan E. Sutherland and Jo Ebergen

"Scientific American", August 2002 [4]

It is widely known that synchronous circuits have some problems that have to be carefully dealt with, such as clock skew, difficulty in clock distribution, worse case performance, not modular, sensitive to variations in physical parameters (temperature, voltage, and process), synchronization failure, and noise (EMI). All these problems derive from the

“clock” signal [4,5,6,7]! As the VLSI based systems become larger, more complex, and work with higher clock rate, these problems also become more serious than ever before.

In addition, to reduce the power consumption has already become one of the most important issues in large VLSI system design. It is widely known that the dynamic power dissipation P∝_fcv²[8]. That means that the dynamic power consumption is in proportion to the number of switching activities. In order to improve the circuits or system performance, the clock frequency becomes higher and higher. Thus, the extra power wasted in the clock tree distribution also becomes larger and larger. That’s very clear that clock signal consumes a very large proportion power of the whole chip. For example, the clock tree distribution network of DEC (Compaq) Alpha 21064 processor consumes about 40% power when it runs at maximum speed [9]. Similarly, the Motorola MCORE micro-RISC processor consumes 36% power in clock tree distribution [10]. In fact, the clock distribution network should be responsible for an increasing fraction of the dynamic power consumed by modern processors and SoCs [11,12,13]. Thus, if the clock signal can be removed, the power consumption may be reduced with very high possibility. In order to reduce the power consumption, lots of different techniques are proposed and implemented, such as clock gating and dynamic voltage and frequency scaling (DVFS)[14]. Furthermore, higher clock frequency may also cause the temperature of the VLSI chips very high. It’s also harmful for embedded systems or handheld

devices. We can say that all these problems cause nightmares for almost all VLSI-based system developments today.

On the contrary, asynchronous circuits can easily reduce the power consumption via removing the “clock” signals that spread the whole VLSI chip. Replaced by the handshaking protocols, asynchronous circuits offer low active power and almost zero standby power [4,5,6]. In fact, because of data-driven nature, the inactive components or parts of asynchronous circuits can be automatically “shut-off.” Thus, asynchronous circuits can offer very good power efficiency. For example, the most famous asynchronous ARM compatible processors—the Amulet series processors [15,16,17,18] shows very good power efficiency than its synchronous ARM processor counterparts. Another very famous example, Philips asynchronous 80C51 microprocessor is 4 times power efficient than that of its synchronous counterparts [19]. The most interesting of all is the latest ARM996HS processor that is the first commercial-available synthesible 32-bit CPU built with clockless logic[7,20]. It consumes about 2.8x less power than the clock-gated ARM968E-S core. Table 1-1 shows the comparisons between ARM996HS and ARM968E-S [7]. The table also shows that ARM996HS can operate correctly in varieties of operating environment. It can operate with lower voltage in high temperature environment. Asynchronous circuits are much more robust than synchronous circuits.

In fact, designing the “clock” system has been becoming the critical issue in large VLSI system design today. For example, very complex “clocking architecture” is implemented in the latest Intel^® 45nm 8-core Xeon^® Enterprise processor announced in ISSCC2009 [21].

Figure 1-1 shows its clocking architecture. The design has totally 16 PLLs, 8 DLLs, and independent clock domains for each cores and the uncore. What a complex design it is!

Unfortunately, such designs are very popular today. Since the first microprocessor, the Intel^® 4004, was announced in 1971, the VLSI technologies have had great improvement. To put one billion transistors on a single chip have been becoming possible. How terrible it is to design the “clock” system on such big system!

However, because of several complex historical and practical reasons, almost all systems today are still implemented with fixed clock period based design. While synchronous design

may introduce lots of problems with systems growing up larger and larger, asynchronous design may overcome these problems via avoiding the use of clock signal. Furthermore, how to accomplish IP reuse easier becomes one of the most important issues for SoC design.

Asynchronous circuits may be one of the best solutions to address this issue. Without the influence of the “clock” signal, asynchronous circuits make “software OOP” style design for hardware design possible. All things that the designers need to know are the handshaking protocol interface [4,5,6]. It also makes each designed component or IP more reusable. With growing up mobile device and embedded system markets, all these issues need to be seriously considered. Thus, it’s time to implement these systems with asynchronous circuits.

Table 1-1: Comparisons of ARM996HS and ARM968E-S

Frequency

Figure 1-1: Clock distribution domains and generators

1-3 Introduction to Translation Lookaside Buffer

In order to support larger memory requirements for modern applications, it’s important for modern operating systems (OS) to provide the virtual memory mechanism. Conceptually, with virtual memory, the movements of code and data of one program between main memory and secondary storage can be automatically achieved, and a single complete and contiguous

“memory space” can be given for each program. Thus, only part of code and data of one program needs to be placed in main memory. Programmers do not need to know anything on how the code and data are arranged. Moreover, the program size can be even larger than the real physical memory size. In fact, virtual address space are often much larger than real physical memory space and size. Figure 1-2 shows the conceptual virtual memory. The virtual memory is divided into lots of fixed size blocks called pages and each page has a specific page number called Virtual Page Number (VPN). Similarly, the physical memory is also divided into the same size page frames, and each of it has its own unique page frame number called Physical Page Number (PPN). Via the memory mapping, each page of virtual memory can be mapped to a page frame of physical memory or the secondary storage. With appropriate hardware support, the virtual memory is carefully maintained by the OS [22].

As mentioned before, the OS is responsible to provide the mechanisms to map virtual address to physical address. However, all these virtual address to physical address translations are stored in main memory. To reduce the cost of address translations, the translation lookaside buffers (TLBs) are widely implemented inside the processor [23,24,25,26,27,28].

Figure 1-3 depicts the basic design idea of TLB. Once the virtual address (VA) is sent to TLB, it is compared with all the tag fields to find a matched VPN. If it is a hit, the corresponding PPN will be sent out. The physical address therefore can be generated via the combination of PPN and offset. Otherwise, if it is a miss, the page table traversal will be performed. The OS will take care of the TLB miss handling.

But, the virtual memory mechanism varies with different processor architecture and OS implementation. The page table organization dominates the page table traversal time that

occupies most TLB miss handling time. Though some new architectures use some advanced page table organizations to reduce the page table traversal time such as inverted page table structure [22] such as PowerPC architecture[29], the forward-mapped hierarchical page table structure are still widely used, such as Compaq/DEC Alpha AXP[28], the latest AMD64, and Intel^®64 [30,31,32] architectures. It costs several main memory accesses to fetch the correct Page Table Entry (PTE) if any miss occurs. It even possibly needs to traverse 7 levels of different page tables on processors with 64-bit addressing [33]. Figure 1-4 shows the page table structure of IA32e mode with 4KB page size of Intel^®64 architecture. If no any TLBs and address caches are implemented inside these processors, traversals of four levels of different tables should be completed to obtain correct PPN. Figure 1-5 shows the page table structure of Compaq/DEC Alpha AXP [28]. It has three levels of page tables. That impacts the overall system performance very seriously. Thus it’s important to reduce the TLB miss rates for systems with such page table structure.

In addition, frequently happened context switching may cause some extra TLB misses.

Some research even shows that these misses play important role in TLB performance [1,2,3].

Thus most processors have implemented some kinds of address space identifier (ASID) to distinguish each address space [25]. For example, MIPS R10000 processor has an 8-bit ASID for each of its 64-entry TLB to allow context switches without having to invalidate all entries [34]. It is also suggested to provide 8-bit ASID for SPARC architecture [35,36]. However, some processors including the IA32 (x86) architecture which is the most popular processor family today simply flush all the TLB entries when the context switching occurs [31,1].

Unfortunately, it’s even still the same for the latest IA32 processors. We’ll treat the model as the worse case performance. Though lots of different research about TLB has been done, only some notice the influence of context switching. That may be because it’s very hard to model and estimate the context switching activities caused by the OS and it’s also hard to consider this issue without considering the OS behavior first. In our work, we tried to provide an alternative to address the context switching issue for TLB. To support the proposed mechanism, the OS should be modified a little. In fact, because of architecture differences, these kinds of modifications of OS for TLBs are needed for all architectures. We hope that this simple mechanism can be implemented inside an asynchronous embedded processors or microcontrollers that only run some tasks simultaneously.

To estimate the performance of the proposed architecture, we did some simulations. All the simulations were done by the modified SimpleScalar Version 3.0d tool suite [37] provided by the SimpleScalar LLC with SPEC95. In addition to the performance of traditional 1024-entry fully-associative TLB with x86-style assumption, we also compare the performance of 1024-entry fully-associative TLB with ASID and two different pre-fetching mechanisms incorporate with our proposed design. The results show that our banked design can work very well with sequential prefetching (SP, also called linear pre-fetching).

Our work is trying to realize TLB controllers for asynchronous embedded processors or microcontrollers with low TLB miss rate caused by context switching. Though most processors reduce the miss rate caused by context switching with ASID, our work provides an alternative to address this issue. There are several reasons for the proposed architecture. These embedded systems only execute some tasks at the same time. Thus, it really doesn’t need to store too many ASIDs. That’s why no ASIDs TLB design of StrongARM SA-1100 processor [2,3]. Don’t forget these processors are not designed for desktop purpose. Figure 1-6 shows smart phones that execute Windows^® mobile OS. In fact, because we wish to implement such TLB for asynchronous embedded processors or microcontrollers, less tag bits may be more important than other issues. In addition, we also discuss why sequential prefetching is more suitable for the proposed design. Moreover, we’ll try to realize this design on the asynchronous processor which we currently work for. That would not be too hard to realize the proposed mechanism on an asynchronous processor with same extra handshaking protocols on bundled delay or self-timed design.

Figure 1-2: Conceptual virtual memory

Figure 1-3: Virtual address translation with TLB

Figure1-4: Page table structure of IA32e mode with 4KB page size

Figure 1-5: Virtual to physical address translation of Alpha AXP

Figure 1-6: Smart phones with Windows^® Mobile OS

1-4 Summary

As mentioned in previous section, new embedded systems or handheld devices now begin to execute new modern operating systems. It therefore becomes more and more important for these processors to provide efficient address translations. A well-designed TLB now becomes one of the critical performance issues for these processors. In addition, because embedded systems and handheld devices may operate in varieties of environments, robustness and reliability are two of the most important issues to these processors. Asynchronous circuits can easily address these issues. However, lack of address translation mechanism, most asynchronous processors doesn’t support virtual memory directly. In order to support virtual memory for asynchronous processors, asynchronous TLB controller should be implemented.

Thus, in this thesis, we propose a TLB architecture for future asynchronous embedded processors, and modeled it with Balsa HDL. Followings are the main contributions of this thesis.

Plenty surveys of TLB studies

Plenty surveys of asynchronous circuits, and detailed introductions of how to design circuits with asynchronous circuits

Studies of performance issue of TLB in context switching

New alternative TLB architecture with low miss rate in context switching for asynchronous embedded processors

Studies of implementation of proposed TLB architecture with asynchronous circuits

Confirming the possibility to design TLB controller for asynchronous processors

在文檔中低內容交換失誤率之轉換搜尋緩衝器與其非同步電路實作之探討 (頁 12-23)