Related Works - 低內容交換失誤率之轉換搜尋緩衝器與其非同步電路實作之探討

In this chapter, we’ll discuss the related works of both TLB design and asynchronous systems or circuits design. Because only a few specific research on TLB design for asynchronous processors, we’ll discuss them separately in this chapter. Finally, case studies of asynchronous MMU or TLB design will be discussed.

2-1 Recent studies of TLB

As mentioned in Chapter 1, TLB plays an important role in the overall performance of the processors that support virtual memory technique. Thus, lots of different research has been done. Moreover, because of architecture and addressing mode differences, the real implementation may have great differences. The design requirements may even vary from different page modes or new addressing mode support for the same processor. However, that’s quite interesting that the TLB designs of most real commercial processors are not too complex. Most of them are not implemented with too complex algorithms or architectures.

The key issue of these designs is to reduce the TLB search time. Some related works of TLB research will be described in the following paragraphs. These works will be classified into traditional techniques, advanced techniques, and works of reducing TLB context switching miss rate.

2-1-1 Traditional Techniques

Because TLB in fact is part of the memory hierarchy and can be considered as a special designed cache memory to cache the page table entry, it can be directly perceived through the senses that those traditional techniques to improve the cache performance can also be applied to TLB. That also means the 3Cs misses [28] can be also suitable for the TLB. In fact, those techniques are now widely used in commercial processors in different ways.

In order to reduce the TLB miss rate, most processors increase the size (total entries) of TLBs with fully or set associative. For example, recent AMD Opteron^TM processor has both 512-entry L2 instruction TLB (ITLB) and L2 data TLB (DTLB) [38] and the IBM POWER4 processor has a common 1024 entry TLB for each processor core [39]. Furthermore, some processors even try to provide multi-level TLBs, such as 2-level ITLB/DTLB design on recent AMD Opteron^TM processor [38] and each core (Nehalem architecture) of the latest Intel^®Core i7^® processor [31]. Figure 2-1 shows the TLB designs of the Intel^®Core i7^® processor. Each core of the processor has separated the Instruction and Data TLB with a unified Second-level TLB (STLB). In addition, some processors begin to provide larger page sizes to increase the TLB span, such as 2MB or even 4MB page size on all new Intel IA32 Processors after the Pentium^® Pro Processor [40]. The Intel IA64 architecture offers 4K to 256MB and 4GB page sizes [41]. The AMD64 architecture also provides 4KB, 2MB, 4MB, and incredible maximum 1GB page sizes [42]. There are several advantages of larger page sizes. First, because the page table entry can be reduced, it can save the page table sizes.

Second, it allows for larger physically addressed caches. Third, because each page can map larger memory spaces, fewer page tables and TLB entries can be used. Finally, because the level of page tables can be decreased, the fewer accesses to main memory are needed to generate correct physical page number if TLB miss occurs. Figure 2-2 shows the page table structure of IA32 mode with 4KB page size of IA32 architecture, and Figure 2-3 shows the page table structure of IA32 mode with 4MB page size of IA32 architecture [31]. We can easily find that with larger page size the levels of page tables can be decreased.

Figure 2-1: Structure of TLBs and cache memories of Intel^® Core i7

Figure 2-2: IA32 Linear address translation (4-KByte page)

Figure 2-3: IA32 linear address translation (4-MByte page)

2-1-2 Advanced Techniques

As mentioned in previous paragraph, most contemporary processors now provide some different page sizes from 4-KB size to incredible very large sizes. Some even allow these pages with different sizes coexist simultaneously with some augmented page table entry format. Certainly, it needs extra supports of OS. In fact, with small page size, the memory space can be saved. That’s because with larger page sizes, memory spaces would be wasted

due to the internal fragmentation. In addition, with small page size, the startup time of small program would be shorter. However, to provide several page sizes, some commercial designs put several TLBs inside the processor for each individual size. Some try to modify the TLB entry format and therefore the TLB can be shared with different page sizes.

In addition to what we mentioned in previous paragraph, several interesting mechanisms are proposed to support superpaging. Several base pages with both virtual and physical address alignment can be merged into a larger page called superpage at run time [43,44,45,46]. With superpage mechanism, the internal fragmentation problem can be resolved. However, to support a superpage, very complex OS and hardware interactions are needed. Furthermore, the virtual and physical memory space aligned limitation seriously impacts the usage of a superpage. Hence, some studies have focused on overcoming the limitation by dynamically supporting the superpage mechanism. Talluri et al. described an advanced method called the complete-subblock which allows a single TLB block to map to multiple base pages without any special OS support [43,44]. In addition, they also described a much smaller design called the partial-subblock which shares PPN and attribute fields across base page mappings. Figure 2-4 shows a complete-subblock TLB block (entry) with factor 4.

Lee et al. proposed a novel banked-promotion TLB structure to support two page sizes dynamically [47]. Four 4KB pages can be promoted to a 16KB superpage. To support such mechanism, an interesting promotion TLB was designed. The heuristic promotion algorithm can promote four consecutive entries from small-page TLB bank to large-page TLB bank.

Thus, the four 4KB TLB entries can be reused. Furthermore, in order to reduce the power consumption and TLB reference latency, they even divided the TLB for 4KB page into two banks [48]. Figure 2-5 shows the structures of their promotion TLB and banked-promotion TLB. In addition, Swanson et al. presented a novel memory controller (MMC) which can aggressively create superpages even from non-contiguous and unaligned regions of physical memory space [49,50]. Figure 2-6 depicts this design. In this design, they suggested to use a portion of unused physical memory address range to virtualized physical memory in their proposed MMC. The shadow pages are “shadow” of accessed page that can be remapped to real physical address by MMC. The TLB reach can be extended via a novel Memory Controller TLB (MTLB). Thus the superpage can be aggressively created from non-contiguous and nonaligned regions of physical memory. Park et al. proposed a way to integrate both partial-subblock with MMC to improve TLB performance [51]. They also

proposed a method called Variable-Size Subblock TLB (VS-TLB) which is an extension of original subblock TLB to support multiple size subblock. Based on the original subblock TLB design, they added subblock size field (SS) for each entry. With this extension, the total TLB reach can be increased via its maximum subblock size. There is still much research about improving TLB performance of superpaging.

Besides previous research, some different and interesting research can also be found.

Channon et al. presented the reconfigurable partitioned TLBs to improve the TLB performance [52]. They claim that traditional split instruction and data TLB design is not suitable for unpredictable memory reference pattern. Thus the reconfigurable partitioned TLB can reduce misses between distinct reference types. The reconfigurable partitioned TLB can dynamically adjust the position of the partition in real time. Figure 2-7 shows this design. In addition, some research focus on the low power issue. Besides some architecture improvements to reduce power consumption such as baking skills, some even try to redesign the basic circuit element itself. For example, Juan presented low power CAM and SRAM cells design that can be implemented [53]. They also studied the relationship of power consumption and associativity of TLB. They concluded that small TLB with fully set-associative and implemented with modified cell can save more power. Because TLB is part of memory hierarchy, some research tries to integrate both TLB and cache memory.

Among all of these studies, Lee et al. proposed an interesting way to reduce the tag memory of cache memory [54,55]. The design uses share tag memory of both TLB and cache memory.

They still use CAM as the tag memory for TLB. However, the cache memory shares the same tag memory. The index tag memory of cache now only stores encoded index of an entry in shared tag memory rather than the PPN. Thus, the total tag memory sizes can be reduced.

Figure 2-8 shows this design. In addition to these hardware efforts, lots of different software efforts can be found. Instead of hardware managed TLB, software management TLBs are widely used in lots of new RISC processors, such as SPARC, Alpha AXP, PA-RISC and MIPS architectures [23,25]. In fact, there are still varieties of different studies of TLBs and virtual memory.

Though lots of new TLB designs are proposed, just only a few studies focused on the TLB entries prefetching/preloading. Saulsbury introduces an interesting mechanism, called the Recency-based TLB Preloading (RP), to prefetch the TLB entry according to the

“Recency” of the referenced pages [56]. The mechanism maintains the “Recency Stack” via augmented translation table entry in memory and the TLB inside the processor according to the recently referenced pages. Thus the next possible referenced page number can be prefetched. Figure 2-9 (a) shows how the stack changes inside the processor if the TLB reference is a hit. Because it’s a TLB hit, the recency of all translation table entries (TTE) of the translation table will not be changed. Figure 2-9 (b) depicts how the “Recency Stacks” of both TLB and translation page table change if the TLB reference is a miss. After the missed TTE is moved to the top of TLB stack, the recency of both TLB entries and the translation table entries will be changed according to the recency stack position. Finally, the TTE with

“recency ± 1” of missed TTE can be prefetched into the prefetch buffer inside the processor.

It should be noted that in real implementation all the TTE positions of “Recency Stack” are maintained by the previous and next pointers of each TTE. Figure 2-10 shows the implementation of the translation table in memory. However, the mechanism may increase the memory traffic and the PTE should do some changes to store the stack pointers for the link-list. To solve these possible problems, Kandiraju proposes a new prefetching technique, called the Distance Prefetching (DP), according to the recently referenced pages ‘distance (stride)’ [57]. The mechanism maintains a table to keep the track of differences between successive address references and do prefetching according to the predicted distance. Figure 2-11 shows the implementation of TLB with DP technique. The paper also shows a generic schematic prefetching hardware and compares other possible prefetching techniques borrowing ideas from the cache prefetching techniques, such as Sequential Prefetching (SP), Arbitrary Stride Prefetching (ASP) and the Markov Prefetching (MP). Figure 2-12 shows the schematic of generic prefetching hardware. Because of the implementation costs, we’ll focus on the studying of the SP and DP in our work.

Figure 2-4: Complete-subblock TLB with block factor 4

Figure 2-5: Promotion TLB structure & Banked-promotion TLB structure

valid access

dirty read-only

supervisor valid access dirty

Figure 2-6: MMC example with shadow region

Figure 2-7: Reconfigurable partitioned TLB

Figure 2-8: Share tag design of TLB and cache memory

(a)

(b)

Figure 2-9: Operations of “Recency Stack”

Figure 2-10: Memory translation table of TLB with “Recency Prefetching”

Figure 2-11: TLB with “Distance Prefetching”

Figure 2-12: Schematic of generic TLB prefetching hardware

2-1-3 Reducing TLB Miss Rate in Context Switching

As mentioned in previous sections, the TLB miss handling requiring several main memory accesses and that impact the overall performance seriously. However, in traditional design, the simplest way to deal with context switching (address space switching) for TLB is to flush all the TLB entries. Thus, that’s even worse if the miss caused by TLB flushing of context switching. After the flushing of the TLB, it needs lots of “learning time” to refill these entries. However, only a few studies focus on this topic. Liedtke try to reduce the possibility of TLB flushing of address-space switching via integrating the segmentation mechanism of x86 [1]. Based on the L4, Wiggins and Heiser try to avoid reloading translation table base register by using a pointer register that points to a caching page directory (CPD) [2,3]. The basic idea of this implementation can be described as following sentences. The CPD contains entries from a number of different address spaces and each of it is defined by its own page table. Once the TLB miss occurs, the hardware only needs to reload the TLB via indexing to the CPD that contains pointers to LPT (Leaf Page Table, an array of 256 entries PTEs) of various address space. If it’s a miss, the current thread PD (page directory) should be indexed by handler to find a valid entry. Then the entry should be copied into CPD. The handler restarts the thread. Finally, the hardware can reload TLB. Now, only a valid page table entry

should be found. Figure 2-13 depicted this basic idea. In fact, still lots of other research tries to reduce the possibilities by modifying the OS or page table structures. Besides these software solutions, the basic method supported by TLBs is to provide address space identifier (ASID) for each entry to identify each address space. Figure 2-14 shows the TLB with per-entry ASID tag.

Figure 2-13: CPD and per-address page tables

Figure 2-14: TLB with per-entry ASID tag

2-2 Circuit design with asynchronous circuits

The technological trend is inevitable:

In the coming decades, asynchronous design will become prevalent!

By Ivan E. Sutherland and Jo Ebergen

"Scientific American", August 2002 [4]

Asynchronous circuits have been studied since early 1950’s; however, synchronous circuits have still dominated the mainstream of digital circuit design [4,6]. Recently, some academic and commercial research shows that it’s worth to implement real-life systems with asynchronous circuits. But, without the global synchronization signal called “clock”, it makes asynchronous circuit design very difficult. In order to replace the “clock signal”, handshaking protocols between each part of asynchronous circuits are needed. It therefore makes the circuit costs of asynchronous circuits much higher than synchronous counterparts. In addition, because of lack of tools and standardization of implementation and design models, there is still not much research on it and that limits applications in commercial products. In fact, it’s very hard to find commercial products that are implemented with asynchronous circuits. In this section, we’ll discuss topics of asynchronous circuits from the classifications of asynchronous circuits, handshaking protocols, research of asynchronous circuits, and case study of implementation with asynchronous circuits.

2-2-1 Classifications of Asynchronous Circuits

We have discussed so many issues of asynchronous circuits, but you may ask what asynchronous circuits are. In fact, it’s not very hard to answer this question. We can say that asynchronous circuits are circuits without any global synchronization signal called “clock.”

Based on this assumption, asynchronous circuits can be classified into four classes depending upon the delay model of gate and wire of the circuit. The four classes are Delay-Insensitive (DI) circuits, Quasi-Delay-Insensitive (QDI) circuits, Speed-Independent (SI) circuits, and Self-Timed (ST) Circuits [5,6].

Delay-Insensitive (DI) circuits are the most robust and reliable circuits of all. These classes or circuits permit arbitrary (unbounded but finite) delays on gates and wires. The basic concept of DI circuits derives from Clark’s “Macromodular computer systems” proposed in 1967 [58]. However, because of its “arbitrary delays on gates and wires” nature, only a few circuits belong to this class. Martin already proved it in 1990 [59]. Thus, enormous limitations exist in designing DI circuits.

Because it’s too hard to implement pure DI circuits, Quasi-Delay-Insensitive (QDI) circuits relieve a little in arbitrary delay on wires. QDI circuits are DI circuits with isochronous forks. It means that all branches of a forked wire have exactly the same wire delay [60]. Figure 2-15 shows the isochronous fork. In this example, the signal from A can propagate to both B and C with the same wire delay. With this assumption, it permits DI class circuits can be more practical. In fact, in order to meet DI and QDI constraints, the implementation costs of these circuits may be higher. In addition, they should be carefully implemented to avoid violations of the constraints. Thus, to implement such circuits are really very difficult. However, because no extra delay assumptions, DI and QDI circuits may be attractive for asynchronous VLSI circuit synthesis [60].

The concept of Speed-Independent (SI) circuits first appeared in 1959 proposed by David Muller [59,60]. The class of circuits allows arbitrary (unbounded but finite) delays on gates but assumes zero wire delays. The SI circuits can be modeled with Petri net [63].

Self-Timed (ST) circuits are popular in lots of asynchronous circuit implementations. It is introduced by Seitz in 1980 [64]. The ST circuit is composed of a group of ST elements and each of ST elements is inside of an “equipotential region.” The wire delays of the region are negligible or well-bounded. The elements can be DI, QDI, SI, or circuits that can operate correctly with some local timing assumptions. There’s no any timing assumption on communications between regions. That also means that the communication belongs to DI. For example, Chang et al. proposed a ST torus-network with 1-of-5 DI encoding in 2009 [65].

The implementation uses DI encoding communication between each parts of the whole design.

Figure 2-16 shows the relationship of these models of asynchronous circuits. If the design contains both DI components and ST components, it should be an ST circuit.

Figure 2-15: Isochronous fork

Figure 2-16: Classifications of asynchronous circuits

2-2-2 Handshaking Protocols

Without a clock to govern its actions,

an asynchronous system must rely on local coordination circuits instead!

By Ivan E. Sutherland and Jo Ebergen

"Scientific American", August 2002 [4]

Without clock signal, asynchronous circuits rely on handshaking protocols to make sure the correctness of the circuit operations [5,66,67]. The protocols can be divided into control signaling and data encoding. A complete handshaking protocol is a combination of the control signaling and data encoding. Figure 2-17 shows the 4-phase handshaking protocol. In this protocol, only the rising edge is the valid active transition; thus it’s a level signaling or return-to-zero protocol. On the contrary, in the 2-phase handshaking protocol, the falling and

rising edge of request and acknowledge are active signals; thus it’s a transition signaling or non-return-to-zero protocol. However, it makes the circuits, especially datapath circuits, very complex and hard to implement. Figure 2-18 shows the 2-phase handshaking protocol. In addition to control signaling, there are also choices for how to encode data (data signaling protocol). The Bundled Data or called Single Rail refers to separate request and acknowledge wires that bundles the data signals with them. Thus total n+2 wires are required to send n-bit data. Figure 2-19 shows the bundled-data model. Besides bundled-data model, there are data encoding methods for DI circuits. However, because of implementation issue, dual-rail encoding is the most popular used DI data encoding scheme. To represent 1-bit data in

在文檔中低內容交換失誤率之轉換搜尋緩衝器與其非同步電路實作之探討 (頁 23-57)