降低行程環境切換導致效能損失之轉換搜尋緩衝器設計

(1)

國

立

交

通

大

學

資訊工程系

碩

士

論

文

降低行程環境切換導致效能損失之轉換搜尋緩衝器設計

Translation Look-aside Buffer with Low Context Switch Penalty

研究生：張繼文

指導教授：陳昌居教授

(2)

降低行程環境切換導致效能損失之轉換搜尋緩衝器設計

Translation Look-aside Buffer with Low Context Switch Penalty

研究生：張繼文 Student：Chi-Wen Chang

指導教授：陳昌居 Advisor：Chang-Jiu Chen

國立交通大學

資訊工程學系

碩士論文

A Thesis

Submitted to Department of Computer Science and Information Engineering College of Electrical Engineering and Computer Science

National Chiao Tung University in

partial Fulfillment of the Requirements for the Degree of Master

in

Computer Science and Information Engineering

June 2004

Hsinchu, Taiwan, Republic of China

(3)

降低行程環境切換導致效能損失之

轉換搜尋緩衝器設計

研究生：張繼文

指導教授：陳昌居

國立交通大學資訊工程學研究所

摘要

一般人所熟悉的轉換搜尋緩衝器是用來將虛擬記憶體轉換成實體記憶體的一種機制，在整個電腦系統運作中扮演極重要的角色。如果有任何的失誤發生時，都會造成處理器的效能嚴重的下降。為了減低任何失誤的可能性發生，有不少的研究方法和理論被提出。有些理論是利用改進轉換搜尋緩衝器的關聯性或增加其容量大小來降低其因衝突產生的失誤和容量不足而導致的失誤，有些研究者則提出使用超級分頁的概念來涵蓋更多的記憶體空間。這些眾多的方法當中，特別是超級分頁對於大部分的應用程式能夠有效率的降低失誤的可能性。可惜的是對於行程切換導致轉換搜尋緩衝器效能降低方面的研究卻是少之又少。為了支援近代的作業系統當中都有多重程式操作的特色，作業系統必須要有行程環境切換機制以方便將正在處理的行程切換到下一個行程，此時須清除目前轉換搜尋緩衝器所暫存的位址轉換資訊。而清除轉換搜尋緩衝器資訊的這樣一個行為，將會造成處理器效能嚴重的下降，特別是對於近代高效能的處理器。這篇論文提出了一種新穎且容易實作的轉換搜尋緩衝器架構來降低行程環境切換機制所帶來的損失同時我們更整合多重分頁的機制來降低失誤的機會。我們修改 SimpleScalar 3.0d 模擬器及使用 SPEC2000 作為我們的模擬平台環境，並比較其它轉換搜尋緩衝器的架構，模擬結果顯示出所提出的轉換搜尋緩衝器架構在傳統的 4KB 分頁大小底下可以比傳統轉換搜尋緩衝器的失誤率小上 1.3 倍，在此可見所提出之架構非常適合用於多重程式環境底下。

(4)

Translation Look-aside Buffer with Low

Context Switch Penalty

Student : Ji-Wen Zhang Advisor : Dr. Chang-Jiu Chen

Department of Computer Science and Information Engineering

National Chiao-Tung University

Abstract

It is widely known that the Translation Look-aside Buffer (TLB) plays an

important role in the address translation mechanism from virtual addresses to physical

addresses. If any miss occur, the performance of the processor will seriously degrade.

There are many methods for improving TLB performance, such as increasing the

associativity, the number of entries, or page sizes, and using superpages to cover more

memory spaces. These methodologies, especially superpage, can effectively reduce

lots of misses for most applications. However, very few designs really focused on the

context switching issue. In order to support the multiprogramming characteristics in

all modern OS, the context switching mechanism is needed and it will cause all TLB

entries be flushed and will impact on the performance very seriously, especially on

today’s high performance processors. This thesis presents a novel and easy

implemented TLB architecture to reduce the misses in context switching with

complete-subblock mechanism. All simulations were done with modified

SimpleScalar 3.0d tool suite and SPEC2000 benchmarks. The thesis also compares

several designs, including the conventional TLB, the complete-subblock TLB, and the

promotion TLB. The simulations show that the new design can achieve about 1.3

times of relative improvement of miss rate in average with 4KB page size and reveal

(5)

Acknowledgment

I wish to express my gratitude to my prime professor, Dr. Chang-Jiu Chen, and

Wei-Min Cheng for their helpful suggestions, detailed comments, and constant

support. I am also indebted to a number of my laboratory colleagues at National

Chiao-Tung University, especially to Nian-Zhi Huang, Yuan-Teng Chang, Ming-Feng

Shen and Shi-Wei Chen, for their encouragement and help. I am most grateful to my

(6)

List of Figures

FIGURE 1: STRUCTURE OF A SINGLE-PAGE-SIZE TLB BLOCK ... 5

FIGURE 2: STRUCTURE OF CONVENTIONAL TLB. ... 5

FIGURE 3: SUPERPAGE TLB ENTRY ... 8

FIGURE 4: SCHEMATIC OF HARDWARE FOR PREFETCHING... 10

FIGURE 5: COMPLETE-SUBBLOCK TLB BLOCK (SUBBLOCK FACTOR N). ... 10

FIGURE 6: DUAL TLB STRUCTURE ... 11

FIGURE 7: THE RELATIONSHIP BETWEEN THE TLB SIZE AND THE MISS RATE WITH CONTEXT SWITCHING DIVIDED BY THE MISS RATE WITHOUT CONTEXT SWITCHING. ... 17

FIGURE 8: THE RELATIONSHIP BETWEEN THE PAGE SIZE, TLB SIZE AND THE MISS RATE WITH CONTEXT SWITCHING DIVIDED BY THE MISS RATE WITHOUT CONTEXT SWITCHING. ... 19

FIGURE 9: THE RELATIONSHIP BETWEEN THE TLB MISS RATE AND TLB SIZES WITH TRADITIONAL 4KB PAGE SIZES. ... 23

FIGURE 10: THE RELATIONSHIP BETWEEN THE TLB MISS RATE AND PAGE SIZE.... 26

FIGURE 11: A LOW CONTEXT SWITCHING MISS RATE TLB ARCHITECTURE. ... 28

FIGURE 12: DTLB/ITLB MISS RATE COMPARISON WITH 1MB PAGE SIZE... 32

FIGURE 13: TLB MISS RATE COMPARISON WITH 4KB PAGE SIZE. ... 35

FIGURE 14: NEW PROPOSED TLB ARCHITECTURE WITH LOW CONTEXT SWITCHING PENALTY. ... 37

FIGURE 15: SIMULATOR STRUCTURE. ... 43

FIGURE 16: SIMPLESCALAR ARCHITECTURE INSTRUCTION FORMATS. ... 45

FIGURE 17: THE COMPARISON OF RELATIVE IMPROVEMENT WITH DIFFERENT PAGE SIZES... 53

FIGURE 18: THE MEAN OF THE AVERAGE RELATIVE IMPROVEMENT FOR THE TWELVE SPEC2000 BENCHMARKS. ... 54

FIGURE 19: THE RELATIVE IMPROVEMENT OF MISS RATE IN GZIP. ... 59

(9)

FIGURE 21: THE RELATIVE IMPROVEMENT OF MISS RATE IN GCC. ... 61

FIGURE 22: THE RELATIVE IMPROVEMENT OF MISS RATE IN CRAFTY. ... 62

FIGURE 23: THE RELATIVE IMPROVEMENT OF MISS RATE IN VORTEX. ... 63

FIGURE 24: THE RELATIVE IMPROVEMENT OF MISS RATE IN LUCAS... 64

FIGURE 25: THE RELATIVE IMPROVEMENT OF MISS RATE IN TWOLF. ... 65

FIGURE 26: THE RELATIVE IMPROVEMENT OF MISS RATE IN SWIM... 66

FIGURE 27: THE RELATIVE IMPROVEMENT OF MISS RATE IN PERLBMK. ... 67

FIGURE 28: THE RELATIVE IMPROVEMENT OF MISS RATE IN APPLU... 68

FIGURE 29: THE RELATIVE IMPROVEMENT OF MISS RATE IN EQUAKE. ... 69

(10)

List of Tables

TABLE 1: SUMMARY OF THE SIMULATED SPEC2000 BENCHMARKS ALONG WITH THE INPUT SETS. ... 48 TABLE 2: SPEC 2000 BENCHMARKS DESCRIPTION... 48 TABLE 3: MEMORY SPACE COVERAGE COMPARISON OF FOUR TLBS... 50

(11)

Chapter 1 Introduction

To support large memory requirements for modern applications, all new

advanced general-purpose processors will support the virtual memory. The virtual

memory is one of the few interfaces through which the architecture and operating

system interact directly and is developed to automate the movement of program code

and data between main memory and secondary storage to give the appearance of a

single large memory system.

In order to support virtual memory, the address translation mechanism is needed.

It is well known that all the address translations are stored in main memory and

maintained by the operating system; to reduce the cost of address translation, the

translation look-aside buffers (TLBs) [10] are implemented inside the processor. If

there’s any TLB miss occurring, at least two or three memory accesses are needed to

fetch the translation from main memory by the memory management unit (MMU).

A case study on TLB miss handling [16]: It has been shown to constitute as

much as 40% of execution time and up to 90% of a kernel’s computation. Studies

with specific applications have also shown that the TLB miss rate can account for

over 10% of execution time even with an optimistic 30-50 cycles miss overhead.

With the VLSI technology improving rapidly, the new microprocessors become

much faster than ever before and it causes the gap between memories and the

processor core larger and larger. We can easily find that the TLB is in the critical path

(12)

1.1 Problem Definition

To enhance the TLB performance, several studies have been made in this field.

However, little attention has been given to the context switching problem under

multiprogramming environment. The context switches are wiping out locality from

one application to another and cause the flush operations for whole of the TLB entries.

It affects the TLB performance very seriously.

In [6], we propose a novel and easy implement TLB structure to reduce the miss

rate caused by the context switching. We divide the 256 entries into 32 banks storing

translation information of different tasks to avoid flushing all TLB when context

switching occurs. The structure is so easy to implement and reducing miss rate

effectively. However, it needs large page size as base page in our previous structure,

for example, 512KB, 1MB or larger page size. Large page sizes will waste memory

space seriously by increasing internal fragmentation, although it can provide the

advantage of increasing the overall coverage of memory mapping.

To overcome and improve the limitation in our original structure, our research is

intended as a study of how to reduce the miss rate under context switching but using

smaller base page size, such as 4KB, 8KB, or 16KB, to implement our new novel

TLB architecture. In the study of banked-promotion TLB structure proposed by Lee et

al. [17] [18], they promoted four consecutive 4KB pages from one banked-TLB into a

16KB page stored in another banked-TLB dynamically via simple hardware control

without any O/S support. We improve and develop this idea a little further to design

our novel TLB structure. We combine both the features of the dual TLB and our

original TLB with many TLB banks in our new structure. Our TLB architecture not

(13)

switching occurs. With the new design, we can obtain the advantages of low power

consumption by decreasing the amount of fully associative TLB entries to be accessed

at one time and less internal fragment problem compared with our original TLB

design.

The simulations were be done by the modified SimpleScalar version 3.0d tool

suite with SPEC 2000 benchmark. We modified the original SimpleScalar version

3.0d tool suite to accommodate our requirements.

1.2 Roadmap

The rest of the thesis is organized as follows. In Chapter 2, we begin with

reviewing several hardware enhancements for TLB. Then we show that the context

switching will be the performance bottleneck in TLB and discuss the relationships

between the miss rates, page sizes and TLB sizes. In Chapter 3, we will first review

our recently TLB architecture with low context switching penalty. Then, we will

develop our new novel TLB architecture to reduce miss rate in context switching. The

expected performance is demonstrated in Chapter 4. Finally, in Chapter 5 we

(14)

Chapter 2 Related Works

In chapter 1 we discussed that virtual to physical address translation is one of the

most critical operations in computer systems since this is invoked on every instruction

fetch and data reference. To speed up the address translation, computer systems that

use page based virtual memory provide a cache of recent translations called the

Translation Look-aside Buffer (TLB). With the instruction level parallelism, clock

frequencies, and the working sets increasing, the amount of research about

enhancement for TLB increases. In Section 2.1 we will begin with reviewing the

conventional TLB structure. Then we classify all the represented method according to

their research purposes, and give a survey of these mechanisms. In Section 2.2, we

will study the context switching penalty in TLB. In Section 2.3, we will discuss the

relationships between the miss rates, page sizes and TLB sizes.

2.1 TLB mechanisms

2.1.1 Conventional TLB Structure

The address translation acceleration mechanism using translation look-aside

buffer (TLB) is based on the principle of temporal locality. A TLB can be considered

to be a hardware cache used to contain recently used virtual-to-real address

translations. If the TLB has a matching translation — a TLB hit, it outputs the physical address and memory access attributes. If the TLB has no matching

translation — a TLB miss, special hardware or software fetches the missing translation and loads it into TLB.

(15)

The tag contains the virtual page number (VPN) of the translation and a valid bit (V).

The data part stores the corresponding physical page number (PPN) bits and page

attributes (ATTR), e.g., protection, cache-ability, referenced/ modified bits, as shown

in Figure 1.

Figure 1: Structure of a single-page-size TLB block

Many such TLB blocks can be combined in either fully-associative or

set-associative. In either case, a tag array stores all the tags and includes comparators

to compare them with the input VA. A random-access-memory (RAM) stores the data

parts of the blocks. During a TLB lookup, the input VA is split into two parts that are

the VPN and offset. The offset field, without any translation, appends to the PPN

output from the TLB. The TLB compares the VPN stored in the tag with the input

VPN. Only TLB blocks that contain a valid translation participate in the comparison.

The result of tag comparison outputs the correct PPN and attributes from the RAM. If

no block has matching tag, the TLB generate a TLB miss signal, as

shown in Figure 2.

Figure 2: Structure of conventional TLB.

VPN V PPN ATTR

(16)

2.1.2 Several TLB implementations

The TLB is a cache used to speeding up access to entries in the page tables,

where complete information on virtual memory to physical memory mapping are

maintained. We consider two general possibilities for the TLB implementation:

z A single TLB shared between instruction and data caches.

To reduce the contention miss, we can implement dual-ported TLB. This

introduces complex circuitry, doubling the size of the TLB without increasing its

capacity.

z Independent TLBs for instruction and data caches.

In general, the instruction reference streams exhibit greater locality than

data reference streams. So, the instruction TLB should be mad smaller than data

TLB. Furthermore, the instruction TLB and data TLB could be implemented

independently to get best TLB performance.

2.2 Enhancements for the TLB

2.2.1 Reducing TLB Access Time

z Multi-level TLB

Cache has a property that the smaller hardware is faster. Many processors,

such as the Itanium IA-64 [11] (32-entry L1, 96-entry L2), AMD Athlon

(32-entry L1, 256-entry L2) etc. provide multi-level TLB structures, instead of a

single large TLB. The larger L2 TLB will be accessed only after the smaller L1

TLB miss occurs. With a smaller first level TLB, the average TLB access time

(17)

Assume that the access time for a single monolithic TLB is a. Let the access

time for the first and second level TLBs in the hierarchical alternative be a1 and

a2 respectively. Let us denote the miss fraction of the monolithic TLB, the first

and second level of the hierarchical TLB to be m, m1, and m2 respectively. Also,

let the cost of fetching a translation that is not in the TLB be denoted by C.

Then, the cost of translating an address in the monolithic structure (Cm) is

given by

C m a

C_m = + × (2.1) The cost of translating an address in the 2-level TLB (Cs) is given by

) ( ₂ ₂ 1 1 m a m C a C_s = + × + × (2.2) The 2-level TLB is a better alternative when

C m a C m a m a₁+ ₁×( ₂ + ₂× )_p + × (2.3)

2.2.2 Reducing TLB Miss Rate

The classical approach to improve TLB performance is to reduce the miss rates,

and we present several techniques here to accomplish this goal.

z

Making the TLB hold more entries

This is a conventional and easy method to improve TLB performance.

Processors, such as Intel Pentium !!! Processor [11] use 512 entries 4K page

fully-associative or set-associative TLB to reduce the miss rate. But the side effect

is that

Longer memory reference latency can be occurred.

(18)

z

Using large page size

This is a method with less hardware support to improve the overall coverage

of memory mapping and to reduce the miss rate effectively; however, the

disadvantages is that

It wastes memory space seriously because of increasing internal fragment.

z

Using Superpage to improve TLB coverage

Applications with larger working sets can incur many TLB misses and

suffer from TLB penalty. To alleviate the problem of wasting memory coverage

without increasing the number of TLB entries or page size, most modern

general-purpose CPUs, such as the new Intel Processors from Pentium Pro begin

to provide larger page with sizes of 2MB and 4MB [13]. TLBs that support

superpage use a single entry for translating a set of consecutive virtual pages as

long as these pages are located physically contiguous.

Figure 3 shows the format of a superpage TLB entry. The MASK field

prevents certain tag bits from participating in tag comparison for superpage

mappings and the SZ attribute controls a multiplexer during physical address

generation.

Figure 3: Superpage TLB entry

Log2(n)

n=number of supported page sizes Log2(s) 1

max superpage size s=

base page size

VPN MASK V PPN ATTR SZ

(19)

The restrictions for superpage are:

The superpage must be mapped only to contiguous and aligned

physical pages.

Using superpage requires large operating system support and causes

significant overhead. For example, it increases the amount of I/O, page

utilization overhead, and page fault penalty.

z Prefetching TLB Mechanisms

Although there are many literatures on prefetching techniques for memory

hierarchy, it is only recently [15] [19] that the issue of prefetching TLB entries to

hide all or some of the miss costs has started drawing interest. The reason is that

the TLB is more important than other levels of memory hierarchy and we fear of

slowing down the critical path of TLB accesses due to memory traffic.

Generally speaking, the prefteching mechanisms can be viewed in two

classes: Arbitrary Stride prefetching (ASP) and Stride prefetching (SP) capture

the strided reference pattern, and Markov prefetching (MP), Recency prefetching

(RP) and Distance prefetching (DP) exploit history information about relative

recent usage of pages to predict future TLB misses. All of these techniques bring

the prefetched entry into a prefetch buffer that is concurrently looked up with the

TLB; the entry is moved to the TLB entry only after an actual reference to that

entry by the application. In the Figure 4 we show the schematic of generic

(20)

Figure 4: schematic of hardware for prefetching.

2.2.3 Reducing TLB Hardware Cost

z

Complete-Subblock TLB

Complete-subblock TLB structure is a TLB that have the same TLB reach

advantages of medium sized superpages and exploit spatial locality to improve

TLB performance without any operating system support [25] [26]. The main idea

of complete-subblock is to allow a single TLB block to map multiple base pages

to increase the coveraged memory space, as shown in Figure 5.

Figure 5: Complete-subblock TLB block (subblock factor n).

Take a complete-subblock TLB with subblock facor 4 and with 4KB base VPN BV V0 PPN0 ATTR0

Tag Data

V1 PPN1 ATTR1

(21)

page for example, the VPN in tag RAM represents the tag of a 16KB page and

each PPN in data RAM represents one of four sequential 4KB pages. The

disadvantages of complete-subblock TLB are:

The unused slots at each TLB block may occur and seriously waste

hardware cost.

If one small page entry is to be updated, all four 4KB pages could be

invalidated, resulting in performance degradation.

2.2.4 Low Power TLB

z

Banked-promotion TLB structure

The proposed dual TLB is a new structure which counteracts the defects

of the complete-subblock TLB and supports two page sizes via a hardware

approach. The proposed dual TLB is organized as two parts of a conventional

small page (4KB size) TLB and a large page (16KB size) TLB. Both are

designed with fully associative structures. Figure 6 shows the promotion TLB.

Figure 6: Dual TLB structure

16K byte TLB hit 4K byte TLB hit

Promotion hit (old PPN)

Promotion hit (new PPN) Promotion miss 16Kbyte VPN tag 4Kbyte VPN tag Tag (CAM) VPN (4K) Tag (CAM) VPN (16K) Data (SRAM) PPN (4K byte)

SRAM SRAM SRAM SRAM

PPN (4K byte) VPN To cache tag New PPN

(22)

When one virtual address is generated, the 4KB page and 16KB page TLBs

will be searched at the same time. The behavioral principle of the dual TLB is as

follows:

Case 1: Hit in small page TLB

If a small page is founded in the small page TLB, the actions are not

different at all from any conventional TLB hit. The requested physical address is

sent to the cache and compared with tag bit of cache.

Case 2: Hit in large page TLB

If a hit occurs at the large TLB, only one of four PPNs in the large TLB is

enabled at the same time. Thus the power consumption is decreased. As in the

case of a small page TLB hit, the requested physical address is sent to the cache

and compared with tag bit of cache.

Case 3: Miss in both places

When one virtual address turns out to be a TLB miss in both small page and

large page TLB, O/S performs miss handling. When a TLB miss occurs and if its

corresponding three sequential VPN exists in the small page TLB, those four

sequential VPNs belonging to a 16KB page boundary are chosen to be promoted,

and the three sequential VPNs in the small page TLB are invalidated at the same

(23)

2.3 The Context Switching Penalty in TLB

In the chapter 1 we discuss the context switching problem which causes the TLB

in the MMU to be flushed, and the miss penalty will impact on TLB performance

seriously. However, very few attempts have been made for this issue. In this section

we will show some simulations to confirm that flushing TLB entries when context

switching would cause the miss rate increase.

Before the first step in our analysis of TLB misses, we must know what category

of the TLB misses caused by context switching belongs to. General speaking, we can

classify all TLB misses into three simple categories:

z Compulsory misses— The first access to a block cannot be found in the TLB,

so the block must be brought from main memory into the TLB. These are also

called cold-start misses or first-reference misses.

z Capacity misses — If the TLB cannot contain all the blocks need during

execution of a program, capacity misses will occur because of blocks being

discarded and later retrieved.

z Conflict misses — If the block placement is set associative or direct mapped,

conflict misses will occur because a block may be discarded and later retrieved

if too many blocks map to its set. These are also called collision misses or

interference misses.

Obviously, the TLB misses caused by context switching belong to the

compulsory misses due to flushing TLB when context switching. Now, we will take a

close look at some simulations of TLB misses under context switching in comparison

with TLB misses without context switching. We assume the context switching would

(24)

We want to see how many times the miss rate with context switching are as large

as with no context switching. Figure 7 shows the times of the miss rate with context

switching divided by the miss rate without context switching for the twelve

SPEC2000 benchmarks using from 4-entry to 1024-entry fully-associative TLB with a

page size of 4KB. In addition, we make a comparison between different page size,

4KB, 16KB and 64KB for the times in Figure 8.

The Figure 7 and Figure 8 tell us that:

z The times of the miss rate with context switching divided by the miss rate

without context switching would increases if the number of TLB blocks

increases. In other words, the context switching would impact on the TLB

performance more seriously if we have more TLB entries. However, the most

simple way of reducing the TLB miss rate is to increase TLB entries; for

example, the lease AMD OpteronTM processor has both 512-entry L2

instruction TLB (ITLB) and L2 data TLB (DTLB) [1] and the IBM POWER

processor has a common 1024-entry TLB for each processor core [28].

z If the application, such as vortex, need only short computation time and the

computation can be finished before the first context switching, the TLB could

keep performance from decreasing.

z The context switching would impact on the TLB performance more seriously if

we increase page size.

GZIP 0 5 10 15 20 25 4 8 16 32 64 128 256 512 1024 TLB Size DTLB Miss Rate (no cnotext switches)

Miss Rate (with cnotext switches)

0 20 40 60 80 100 120 4 8 16 32 64 128 256 512 1024 TLB Size ITLB Miss Rate (no cnotext switches)

(25)

VPR GCC CRAFTY PERLBMK VORTEX 0 100 200 300 400 500 600 700 4 8 16 32 64 128 256 512 1024 TLB Size DTLB Miss Rate (no cnotext switches)

0 50 100 150 200 250 4 8 16 32 64 128 256 512 1024 TLB Size ITLB Miss Rate (no cnotext switches)

0 10 20 30 40 50 60 4 8 16 32 64 128 256 512 1024 TLB Size DTLB Miss Rate (no cnotext switches)

0 10 20 30 40 50 60 70 80 4 8 16 32 64 128 256 512 1024 TLB Size ITLB Miss Rate (no cnotext switches)

0 20 40 60 80 100 120 4 8 16 32 64 128 256 512 1024 TLB Size DTLB Miss Rate (no cnotext switches)

0 10 20 30 40 50 60 70 4 8 16 32 64 128 256 512 1024 TLB Size ITLB Miss Rate (no cnotext switches)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 4 8 16 32 64 128 256 512 1024 TLB Size DTLB Miss Rate (no cnotext switches)

0 0.5 1 1.5 2 2.5 3 3.5 4 4 8 16 32 64 128 256 512 1024 TLB Size ITLB Miss Rate (no cnotext switches)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.91 4 8 16 32 64 128 256 512 1024 TLB Size DTLB Miss Rate (no cnotext switches)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.91 4 8 16 32 64 128 256 512 1024 TLB Size ITLB Miss Rate (no cnotext switches)

(26)

TWOLF SWIM LUCAS APPLU EQUAKE 0 20 40 60 80 100 120 4 8 16 32 64 128 256 512 1024 TLB Size DTLB Miss Rate (no cnotext switches)

0.98 0.99 1 1.01 1.02 1.03 1.04 4 8 16 32 64 128 256 512 1024 TLB Size DTLB Miss Rate (no cnotext switches)

0 1 2 3 4 5 6 7 8 9 4 8 16 32 64 128 256 512 1024 TLB Size DTLB Miss Rate (no cnotext switches)

0 5 10 15 20 25 30 4 8 16 32 64 128 256 512 1024 TLB Size ITLB Miss Rate (no cnotext switches)

0 0.2 0.4 0.6 0.8 1 1.2 1.4 4 8 16 32 64 128 256 512 1024 TLB Size DTLB Miss Rate (no cnotext switches)

0 0.2 0.4 0.6 0.81 1.2 1.4 1.6 1.82 4 8 16 32 64 128 256 512 1024 DTLB Miss Rate (no cnotext switches)

0 50 100 150 200 250 300 350 400 4 8 16 32 64 128 256 512 1024 ITLB Miss Rate (no cnotext switches)

(27)

MGRID

Figure 7: The relationship between the TLB size and the miss rate with context switching divided by the miss rate without context switching.

GZIP VPR GCC DTLB 0 10 20 30 40 50 60 4 8 16 32 64 128 256 512 1024 TLB Size

4K page size 8K page size 16K page size

Miss Rate (no cnotext switches)

ITLB 0 50 100 150 200 250 300 350 400 4 8 16 32 64 128 256 512 1024 TLB Size

DTLB 0 200 400 600 800 1000 1200 1400 1600 4 8 16 32 64 128 256 512 1024 TLB Size

ITLB 0 200 400 600 800 1000 1200 4 8 16 32 64 128 256 512 1024 TLB Size

DTLB 0 50 100 150 200 250 4 8 16 32 64 128 256 512 1024 TLB Size

ITLB 0 50 100 150 200 250 4 8 16 32 64 128 256 512 1024 TLB Size

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 4 8 16 32 64 128 256 512 1024 TLB Size DTLB Miss Rate (no cnotext switches)

(28)

CRAFTY PERLBMK VORTEX TWOLF SWIM DTLB 0 20 40 60 80 100 120 140 160 180 200 4 8 16 32 64 128 256 512 1024 TLB Size

ITLB 0 20 40 60 80 100 120 4 8 16 32 64 128 256 512 1024 TLB Size

DTLB 0 1 2 3 4 5 6 4 8 16 32 64 128 256 512 1024 TLB Size

ITLB 0 1 2 3 4 5 6 4 8 16 32 64 128 256 512 1024 TLB Size

DTLB 0 0.2 0.4 0.6 0.8 1 1.2 4 8 16 32 64 128 256 512 1024 TLB Size

ITLB 0 0.2 0.4 0.6 0.8 1 1.2 4 8 16 32 64 128 256 512 1024 TLB Size

DTLB 0 50 100 150 200 250 4 8 16 32 64 128 256 512 1024 TLB Size

ITLB 0 50 100 150 200 250 4 8 16 32 64 128 256 512 1024 TLB Size

DTLB 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 4 8 16 32 64 128 256 512 1024 TLB Size Miss Rate (with cnotext switches)

ITLB 0 5 10 15 20 25 30 35 40 45 50 4 8 16 32 64 128 256 512 1024 TLB Size Miss Rate (with cnotext switches)

(29)

LUCAS

APPLU

EQUAKE

MGRID

Figure 8: The relationship between the page size, TLB size and the miss rate with context switching divided by the miss rate without context switching.

DTLB 0 5 10 15 20 25 30 35 40 4 8 16 32 64 128 256 512 1024 TLB Size

ITLB 0 10 20 30 40 50 60 4 8 16 32 64 128 256 512 1024 TLB Size

DTLB 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 4 8 16 32 64 128 256 512 1024 TLB Size

ITLB 0 50 100 150 200 250 4 8 16 32 64 128 256 512 TLB Size

DTLB 0 10 20 30 40 50 60 70 80 90 100 4 8 16 32 64 128 256 512 1024 TLB Size

ITLB 0 100 200 300 400 500 600 700 800 4 8 16 32 64 128 256 512 1024 TLB Size

DTLB 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 4 8 16 32 64 128 256 512 1024 TLB Size

ITLB 0 20 40 60 80 100 120 140 160 180 4 8 16 32 64 128 256 512 1024 TLB Size

(30)

2.4 Relationships between the Miss Rates,

Page Sizes and TLB Sizes

In this section we will limit the discussion to the relationships between the miss

rates, page sizes and TLB sizes, and will not be concerned with the issue of context

switching. It is well-known that the most important two issues for cache system

performance are to reduce the miss rate and miss penalty. It is almost the same for the

TLB performance. In fact, the most important of all is the miss rate issue. That’s why

we focused on the miss rate in our research. In order to select the suitable page size,

we did some study on the relationships between the miss rates, page sizes and TLB

sizes.

First, we consider the relationship between the miss rate and TLB sizes with

traditional 4KB page sizes. Figure 9 illustrates the relationship between TLB sizes and

the miss rate for twelve SPEC2000 benchmarks.

The Figure 9 tells us that:

z Some applications such as gzip, gcc, crafty, perlbmk, vortex, swim, applu, equake

and mgrid have better performance with sizes over 128 entries.

z For some applications such as the vpr, twolf and lucas benchmarks, it is very

clear that only 64 or less entries are enough and it is helpless to increase the

number of TLB entries.

z These benchmarks seeks to capture the fact that it does not really need to provide

TLB with over 128 or 256 entries with 4KB page size and we can consider what

(31)

GZIP VPR GCC CRAFTY PERLBMK 26.051412% 9.414523% 2.605881% 0.000457%0.000020%0.000014%0.000014%0.000014%0.000014% 0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% Miss rate 4 8 16 32 64 128 256 512 1024 TLB Size DTLB 0.264745% 0.000439%0.000119%0.000005%0.000003%0.000003% 0.000003% 0.000003% 0.000003% 0.0% 0.1% 0.1% 0.2% 0.2% 0.3% 0.3% Miss rate 4 8 16 32 64 128 256 512 1024 TLB Size ITLB 12.522859% 2.355942% 0.561549% 0.022578%0.008764%0.006546%0.006422%0.006241%0.001011% 0.0% 2.0% 4.0% 6.0% 8.0% 10.0% 12.0% 14.0% Miss rate 4 8 16 32 64 128 256 512 1024 TLB Size DTLB 0.000191% 0.000081% 0.000028% 0.000005%0.000005%0.000005%0.000005%0.000005%0.000005% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Miss rate 4 8 16 32 64 128 256 512 1024 TLB Size ITLB 23.643239% 10.299283% 4.611126% 2.284842% 0.854234%0.179235%0.034337%0.000708%0.000708% 0.0% 5.0% 10.0% 15.0% 20.0% 25.0% Miss rate 4 8 16 32 64 128 256 512 1024 TLB Size DTLB 0.583282% 0.318779% 0.014422% 0.001330%0.000104%0.000040%0.000040%0.000040%0.000040% 0.0% 0.1% 0.2% 0.3% 0.4% 0.5% 0.6% Miss rate 4 8 16 32 64 128 256 512 1024 TLB Size ITLB 9.997447% 3.727393% 1.252109% 0.308511%0.055602%_0.019287%0.014659%_0.003731%0.000507% 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0% 8.0% 9.0% 10.0% Miss rate 4 8 16 32 64 128 256 512 1024 TLB Size DTLB 0.492205% 0.320007% 0.186461% 0.048003% 0.005662% 0.001399% 0.000167% 0.000058% 0.000058% 0.0% 0.1% 0.1% 0.2% 0.2% 0.3% 0.3% 0.4% 0.4% 0.5% 0.5% Miss rate 4 8 16 32 64 128 256 512 1024 TLB Size ITLB 16.715474% 6.974124% 2.802387%_0.834551% 0.036460%0.006401%0.006309%0.006309%0.006309% 0.0% 2.0% 4.0% 6.0% 8.0% 10.0% 12.0% 14.0% 16.0% 18.0% Miss rate 4 8 16 32 64 128 256 512 1024 TLB Size DTLB 1.525389% 1.048359% 0.596435% 0.219500% 0.013633%_0.002772%0.002734% 0.002734%0.002734% 0.0% 0.2% 0.4% 0.6% 0.8% 1.0% 1.2% 1.4% 1.6% Miss rate 4 8 16 32 64 128 256 512 1024 TLB Size ITLB

(32)

VORTEX TWOLF SWIM LUCAS APPLU 48.089902% 35.231291% 0.348408%0.306035%0.294403%0.294398%0.294393%0.294384%0.287964% 0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% 35.0% 40.0% 45.0% 50.0% Miss rate 4 8 16 32 64 128 256 512 1024 TLB Size DTLB 0.000492% 0.000361% 0.000143% 0.000036%0.000021%0.000021%0.000020%0.000020%0.000020% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Miss rate 4 8 16 32 64 128 256 512 1024 TLB Size ITLB 12.732166% 3.202485% 0.587290%0.001947%0.000075%0.000074%0.000074%0.000074%0.000074% 0.0% 2.0% 4.0% 6.0% 8.0% 10.0% 12.0% 14.0% Miss rate 4 8 16 32 64 128 256 512 1024 TLB Size DTLB 0.536178% 0.346814% 0.003395% 0.001617%0.000037%0.000034%0.000034%0.000034%0.000034% 0.0% 0.1% 0.2% 0.3% 0.4% 0.5% 0.6% Miss rate 4 8 16 32 64 128 256 512 1024 TLB Size ITLB 2.654981% 1.001725% 0.002871%0.002320%0.001234%0.000807%0.000733%0.000733%0.000733% 0.0% 0.5% 1.0% 1.5% 2.0% 2.5% 3.0% Miss rate 4 8 16 32 64 128 256 512 1024 TLB Size DTLB 0.770339% 0.513528% 0.354544% 0.000128%0.000098%0.000094%0.000094%0.000094%0.000094% 0.0% 0.1% 0.2% 0.3% 0.4% 0.5% 0.6% 0.7% 0.8% Miss rate 4 8 16 32 64 128 256 512 1024 TLB Size ITLB 2.116731% 0.466686% 0.093479%0.072942%0.067040%0.062791%0.059014%0.059014%0.059014% 0.0% 0.5% 1.0% 1.5% 2.0% 2.5% Miss rate 4 8 16 32 64 128 256 512 1024 TLB Size DTLB 0.459231% 0.237901% 0.026081% 0.010087%0.007349%0.007349%0.007349%0.007349%0.007349% 0.0% 0.1% 0.1% 0.2% 0.2% 0.3% 0.3% 0.4% 0.4% 0.5% 0.5% Miss rate 4 8 16 32 64 128 256 512 1024 TLB Size ITLB 19.608754% 2.121957%_0.963909%0.683002% 0.196602%0.196050%0.185343%0.175293%0.123493% 0.0% 2.0% 4.0% 6.0% 8.0% 10.0% 12.0% 14.0% 16.0% 18.0% 20.0% Miss rate 4 8 16 32 64 128 256 512 1024 DTLB 0.022996% 0.000805%0.000506%_0.000202%0.000006%_0.000004%0.000004% 0.000004%0.000004% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Miss rate 4 8 16 32 64 128 256 512 1024 ITLB

(33)

EQUAKE

MGRID

Figure 9: The relationship between the TLB miss rate and TLB sizes with traditional 4KB page sizes.

Next, we consider the relationships between the miss rates and page sizes.

Another solution to improve the performance of TLB is to extend the page size into

larger one. It is easy to find that some modern processors begin to provide multiple

page sizes, such as 4KB, 2MB and 4MB sizes, on all Intel advanced x86 processors

after the Pentium Pro processor [13]. The advantages of larger page sizes are not only

obtaining better performance but saving the implementation cost with less tags

(virtual page number, VPN) and translations (physical page number, PPN) needed to

be stored. It is also a good method to reduce the cost on TLB implementation of

processors with larger addressing space, such as processors with 64-bit addressing

space. Certainly, it is suitable to implement on the processors core of SoC or

embedded systems.

What would happen if we extend the page size to 16KB, 64KB, 256KB and 1MB. 16.821638% 4.793879% 3.116490% 0.114468% 0.045552%0.030129%0.023264%0.019313%0.016903% 0.0% 2.0% 4.0% 6.0% 8.0% 10.0% 12.0% 14.0% 16.0% 18.0% Miss rate 4 8 16 32 64 128 256 512 1024 TLB Size DTLB 0.551231% 0.354006% 0.017179% 0.000003%0.000003%0.000003%0.000003%0.000003%0.000003% 0.0% 0.1% 0.2% 0.3% 0.4% 0.5% 0.6% Miss rate 4 8 16 32 64 128 256 512 1024 TLB Size ITLB 27.939944% 1.389092%_0.067141%0.066813%_0.064956%0.062915%_0.046036%0.045661%_0.043267% 0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% Miss rate 4 8 16 32 64 128 256 512 1024 TLB Size DTLB 0.000331% 0.000259% 0.000141% 0.000005%0.000002%0.000002%0.000002%0.000002%0.000002% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Miss rate 4 8 16 32 64 128 256 512 1024 TLB Size ITLB

(34)

Figure 10 below shows the miss rate for twelve SPEC2000 benchmarks of 4KB,

16KB, 64KB, 256KB and 1MB page sizes with eight TLB sizes – 8, 16, 32, 64, 128,

256, 512 and 1024. Observing the results, the figure indicates that for all benchmarks:

z The ITLB/DTLB performance of 1MB (or 256KB) page with eight entries can

greatly outperform that of 4KB page with 256 entries.

z It is helpless to increase the number of TLB entries when we using large page

size for all benchmarks.

z Benchmarks which scatter references across a sparse address have little needs

from large pages without significantly increased memory usage.

GZIP DTLB Miss Rate 0.0000% 0.5000% 1.0000% 1.5000% 2.0000% 2.5000% 8 16 32 64 128 256 512 1024 TLB Size M is s R at e 4KB 16KB 64KB 256KB 1MB

ITLB Miss Rate

0.0000% 0.0000% 0.0000% 0.0001% 0.0001% 0.0001% 8 16 32 64 128 256 512 1024 TLB Size M is s R at e 4KB 16KB 64KB 256KB 1MB VPR DTLB Miss Rate 0.0000% 2.0000% 4.0000% 6.0000% 8.0000% 10.0000% 8 16 32 64 128 256 512 1024 TLB Size M iss R ate 4KB 16KB 64KB 256KB 1MB

ITLB Miss Rate

0.0000% 0.0001% 0.0002% 0.0003% 0.0004% 0.0005% 8 16 32 64 128 256 512 1024 TLB Size M iss R ate 4KB 16KB 64KB 256KB 1MB GCC DTLB Miss Rate 0.0000% 0.5000% 1.0000% 1.5000% 2.0000% 2.5000% 3.0000% 3.5000% 4.0000% 8 16 32 64 128 256 512 1024 TLB Size M iss R te 4KB 16KB 64KB 256KB 1MB

ITLB Miss Rate

0.0000% 0.0500% 0.1000% 0.1500% 0.2000% 0.2500% 0.3000% 0.3500% 8 16 32 64 128 256 512 1024 TLB Size M iss R ate 4KB 16KB 64KB 256KB 1MB

(35)

CRAFTY DTLB Miss Rate 0.0000% 2.0000% 4.0000% 6.0000% 8.0000% 10.0000% 12.0000% 8 16 32 64 128 256 512 1024 TLB Size M is s R at e 4KB 16KB 64KB 256KB 1MB

ITLB Miss Rate

0.0000% 0.0500% 0.1000% 0.1500% 0.2000% 0.2500% 0.3000% 0.3500% 8 16 32 64 128 256 512 1024 TLB Size M is s R at e 4KB 16KB 64KB 256KB 1MB PERLBMK DTLB Miss Rate 0.0000% 1.0000% 2.0000% 3.0000% 4.0000% 5.0000% 6.0000% 7.0000% 8.0000% 8 16 32 64 128 256 512 1024 TLB Sizse M iss R ate 4KB 16KB 64KB 256KB 1MB

ITLB Miss Rate

0.0000% 0.2000% 0.4000% 0.6000% 0.8000% 1.0000% 1.2000% 8 16 32 64 128 256 512 1024 TLB Size M iss R at e 4KB 16KB 64KB 256KB 1MB VORTEX DTLB Miss Rate 0.0000% 0.0500% 0.1000% 0.1500% 0.2000% 0.2500% 0.3000% 0.3500% 0.4000% 0.4500% 0.5000% 8 16 32 64 128 256 512 1024 TLB Size M iss R at e 4KB 16KB 64KB 256KB 1MB

ITLB Miss Rate

0.0000% 0.0500% 0.1000% 0.1500% 0.2000% 0.2500% 8 16 32 64 128 256 512 1024 TLB Size M iss R at e 4KB 16KB 64KB 256KB 1MB TWOLF DTLB Miss Rate 0.0000% 0.5000% 1.0000% 1.5000% 2.0000% 2.5000% 3.0000% 3.5000% 8 16 32 64 128 256 512 1024 TLB Size M iss R at e 4KB 16KB 64KB 256KB 1MB

ITLB Miss Rate

0.0000% 0.0500% 0.1000% 0.1500% 0.2000% 0.2500% 0.3000% 0.3500% 0.4000% 8 16 32 64 128 256 512 1024 TLB Size M iss R at e 4KB 16KB 64KB 256KB 1MB SWIM DTLB Miss Rate 0.0000% 5.0000% 10.0000% 15.0000% 20.0000% 25.0000% 30.0000% 35.0000% 40.0000% 8 16 32 64 128 256 512 1024 TLB Size M iss R at e 4KB 16KB 64KB 256KB 1MB

ITLB Miss Rate

0.0000% 0.0001% 0.0001% 0.0002% 0.0002% 0.0003% 0.0003% 0.0004% 0.0004% 8 16 32 64 128 256 512 1024 TLB Size M iss R ate 4KB 16KB 64KB 256KB 1MB

(36)

LUCAS DTLB Miss Rate 0.0000% 0.0500% 0.1000% 0.1500% 0.2000% 0.2500% 0.3000% 0.3500% 8 16 32 64 128 256 512 1024 TLB Size M iss R at e 4KB 16KB 64KB 256KB 1MB

ITLB MIss Rate

0.0000% 0.0200% 0.0400% 0.0600% 0.0800% 0.1000% 0.1200% 0.1400% 0.1600% 0.1800% 8 16 32 64 128 256 512 1024 TLB Size M iss R ate 4KB 16KB 64KB 256KB 1MB APPLU DTLB Miss Rate 0.0000% 0.5000% 1.0000% 1.5000% 2.0000% 2.5000% 8 16 32 64 128 256 512 1024 TLB Sizse M iss R at e 4KB 16KB 64KB 256KB 1MB

ITLB Miss Rate

0.0000% 0.0001% 0.0002% 0.0003% 0.0004% 0.0005% 0.0006% 0.0007% 0.0008% 0.0009% 8 16 32 64 128 256 512 1024 TLB Size M iss R at e 4KB 16KB 64KB 256KB 1MB EQUAKE DTLB Miss Rate 0.0000% 1.0000% 2.0000% 3.0000% 4.0000% 5.0000% 6.0000% 8 16 32 64 128 256 512 1024 TLB Sizse M iss R at e 4KB 16KB 64KB 256KB 1MB

ITLB Miss Rate

0.0000% 0.0500% 0.1000% 0.1500% 0.2000% 0.2500% 0.3000% 0.3500% 0.4000% 8 16 32 64 128 256 512 1024 TLB Size M iss R at e 4KB 16KB 64KB 256KB 1MB MGRID DTLB Miss Rate 0.0000% 0.2000% 0.4000% 0.6000% 0.8000% 1.0000% 1.2000% 1.4000% 1.6000% 8 16 32 64 128 256 512 1024 TLB Sizse M iss R at e 4KB 16KB 64KB 256KB 1MB

ITLB Miss Rate

0.0000% 0.0001% 0.0001% 0.0002% 0.0002% 0.0003% 0.0003% 8 16 32 64 128 256 512 1024 TLB Sizse M iss R at e 4KB 16KB 64KB 256KB 1MB

(37)

Chapter 3 Proposed Mechanisms

Chapter 3 describes in detail with the two TLB structures we proposed for low

context switching penalty. Our TLB structures can be implemented not only in

contemporary processors but future processors comprised with one billion of

transistors. Furthermore, they are especially suitable to be implemented on processors

with larger addressing space than current processors with just 32-bit addressing ability.

We will review our original TLB architecture for processors with larger page size

support in Section 3.1. Then, we propose our new TLB architecture with general page

size, such as 4KB, 8KB, or 16KB, in the Section 3.2. Last, we will discuss the

mechanisms of our new novel TLB in Section 3.3.

3.1 The Original TLB Structure with Low

Context Switch Penalty

To reduce the miss rate, most designs just try to increase the TLB size to reduce

the capacity misses; however, we have showed in previous chapter that it is also

helpful if the page sizes can be enlarged. Furthermore, with large page size, we can

make use of more redundant TLB entries to store translation information of other

tasks and the size of tags and translations needed to be stored can be much smaller.

Thus, we used 1MB page size in our design.

3.1.1 Structure

Figure 11 shows in detail our original TLB structure to reduce the miss rate in

(38)

TLB banks with group tags, and a multiplexer to select a specific TLB banks.

Figure 11: A low context switching miss rate TLB architecture.

Each TLB bank contains eight entries, and the tag can be implemented with

CAM (content addressable memory) which is the same as that being implemented on

conventional TLB. In addition, each TLB bank is implemented with fully associative

with LRU replacement policy. There are total 32 or more TLB banks. Though there

are 32 banks, compared with 256-entry conventional TLB the total cost is not

increased very much. In fact, there are also total 256 (32×8) entries in our original proposed structure. Furthermore, because of larger page size, the cost of each entry is

decreased. Thus the increased cost can be counteracted.

Except the 32 TLB banks, there are also 32 extra registers to store the bank tag

as shown in Figure 3.1. The register contains

z Task tag: Identify each task.

z Current bit: Identify current working task.

(39)

We have to point out that the task tag can be PID (process ID) or the PPN

(physical page number) of the executing instruction when context switching occurs.

The PID is selected as task tag on systems that the PID will be sent into the processor;

otherwise, the PPN of the executing instruction when context switch occurs from the

PPN field (or last translation) is selected. Considering the general case, the PPN is

selected; however, the PID can be more easily selected and implemented under the

previous situation. The discussion will be ignored in this thesis.

3.1.2 OS Support and Implementation

In order to implement our original TLB mechanism, the OS is need to do a little

modification. Except larger page size, the OS needs to send ‘the clear TLB signal’ to

the processor only when swapping pages with disks occurs or page frames release.

Fortunately, it is very easy to realize. Most modern processors provide some ways to

flush TLB entries, such as using STA instruction with alternative address on Sparc

processors [22].

3.1.3 Expected Performance

We simulated the twelve SPEC2000 benchmarks to demonstrate the expected

performance. We assumed that the context switching would happen after executing

one million instructions. We compared the miss rates of conventional 256-entry TLB

with flushing all entries after context switching and our novel TLB structure with

8-entry each bank after correctly keeping entries. The page size is 1MB. Figure 12

shows the simulation results of the SPEC2000 benchmarks.

We can find that for most benchmarks we can deliver better performance than

(40)

applu and equake benchmarks because the working set of the three benchmarks need

more coverage space of memory mapping than other benchmarks. It is noteworthy

that the vortex benchmark needs only shorter computation time than others, and

therefore has the same performance as conventional TLB.

GZIP VPR GCC CRAFTY DTLB Comparison 0.00000514% 0.00088424% 0.00000000% 0.00020000% 0.00040000% 0.00060000% 0.00080000% 0.00100000% Novel DTLB Conventional DTLB-256 M iss ra te ITLB Comparison 0.00000015% 0.00010008% 0.00000000% 0.00002000% 0.00004000% 0.00006000% 0.00008000% 0.00010000% 0.00012000%

Novel ITLB Conventional ITLB-256 M iss ra te DTLB Comparison 0.00000053% 0.00082991% 0.00000000% 0.00020000% 0.00040000% 0.00060000% 0.00080000% 0.00100000% Novel DTLB Conventional DTLB-256 M iss ra te ITLB Comparison 0.00000006% 0.00010002% 0.00000000% 0.00002000% 0.00004000% 0.00006000% 0.00008000% 0.00010000% 0.00012000%

Novel ITLB Conventional ITLB-256 M iss ra te DTLB Comparison 0.00000230% 0.00124329% 0.00000000% 0.00020000% 0.00040000% 0.00060000% 0.00080000% 0.00100000% 0.00120000% 0.00140000% Novel DTLB Conventional DTLB-256 M iss ra te ITLB Comparison 0.00000030% 0.00019796% 0.00000000% 0.00005000% 0.00010000% 0.00015000% 0.00020000% 0.00025000%

Novel ITLB Conventional ITLB-256 M iss ra te DTLB Comparison 0.00000562% 0.00109438% 0.00000000% 0.00020000% 0.00040000% 0.00060000% 0.00080000% 0.00100000% 0.00120000% Novel DTLB Conventional DTLB-256 M iss ra te ITLB Comparison 0.00000051% 0.00010032% 0.00000000% 0.00002000% 0.00004000% 0.00006000% 0.00008000% 0.00010000% 0.00012000%

Novel ITLB Conventional ITLB-256 M

iss ra te

(41)

PERLBMK VORTEX TWOLF SWIM LUCAS DTLB Comparison 0.00013916% 0.00083497% 0.00000000% 0.00020000% 0.00040000% 0.00060000% 0.00080000% 0.00100000% Novel DTLB Conventional DTLB-256 M iss ra te ITLB Comparison 0.00001912% 0.00011472% 0.00000000% 0.00002000% 0.00004000% 0.00006000% 0.00008000% 0.00010000% 0.00012000% 0.00014000%

Novel ITLB Conventional ITLB-256 M iss ra te DTLB Comparison 12.40346491% 0.00414461% 0.00000000% 2.00000000% 4.00000000% 6.00000000% 8.00000000% 10.00000000% 12.00000000% 14.00000000% Novel DTLB Conventional DTLB-256 M iss ra te ITLB Comparison 0.00000023% 0.00010012% 0.00000000% 0.00002000% 0.00004000% 0.00006000% 0.00008000% 0.00010000% 0.00012000%

Novel ITLB Conventional ITLB-256 M iss ra te DTLB Comparison 0.00070817% 0.00070817% 0.00000000% 0.00010000% 0.00020000% 0.00030000% 0.00040000% 0.00050000% 0.00060000% 0.00070000% 0.00080000% Novel DTLB Conventional DTLB-256 M iss ra te ITLB Comparison 0.00014410% 0.00014410% 0.00000000% 0.00002000% 0.00004000% 0.00006000% 0.00008000% 0.00010000% 0.00012000% 0.00014000% 0.00016000%

iss ra te

(42)

APPLU

EQUAKE

MGRID

Figure 12: DTLB/ITLB miss rate comparison with 1MB page size.

DTLB Comparison 0.00414659% 0.07041661% 0.00000000% 0.01000000% 0.02000000% 0.03000000% 0.04000000% 0.05000000% 0.06000000% 0.07000000% 0.08000000% Novel DTLB Conventional DTLB-256 M iss ra te ITLB Comparison 0.00000006% 0.00010008% 0.00000000% 0.00002000% 0.00004000% 0.00006000% 0.00008000% 0.00010000% 0.00012000%

Novel ITLB Conventional ITLB-256 M iss ra te DTLB Comparison 0.01383976% 0.00157919% 0.00000000% 0.00200000% 0.00400000% 0.00600000% 0.00800000% 0.01000000% 0.01200000% 0.01400000% 0.01600000% Novel DTLB Conventional DTLB-256 M iss ra te ITLB Comparison 0.00000007% 0.00010005% 0.00000000% 0.00002000% 0.00004000% 0.00006000% 0.00008000% 0.00010000% 0.00012000%

Novel ITLB Conventional ITLB-256 M iss ra te DTLB Comparison 0.00023415% 0.00123320% 0.00000000% 0.00020000% 0.00040000% 0.00060000% 0.00080000% 0.00100000% 0.00120000% 0.00140000% Novel DTLB Conventional DTLB-256 M iss ra te ITLB Comparison 0.00000003% 0.00010002% 0.00000000% 0.00002000% 0.00004000% 0.00006000% 0.00008000% 0.00010000% 0.00012000%

iss ra te

(43)

3.2 The New TLB Structure with Low

Context Switch Penalty

We have surveyed our original TLB mechanism with low context switching

penalty in section 3.1. It works well and has good performance when page size is

large. On the contrary, it works badly with small page size, such as 4KB, 8KB or

16KB page size. We represent the results when using 4KB page size in our original

TLB with 8-entry per bank in Figure 13. As the diagram indicates, the performance

obviously degrades and our original structure is not suitable for small page size.

GZIP VPR GCC DTLB Comparison 0.023559417 0.00020937 0 0.005 0.01 0.015 0.02 0.025 Novel DTLB Conventional DTLB-256 M iss ra te ITLB Comparison 8.06162E-07 5.65191E-06 0 0.000001 0.000002 0.000003 0.000004 0.000005 0.000006

Novel ITLB Conventional ITLB-256 M iss ra te DTLB Comparison 0.094145235 8.98897E-05 0 0.02 0.04 0.06 0.08 0.1 Novel DTLB Conventional DTLB-256 M iss ra te ITLB Comparison 4.38947E-06 8.11576E-06 0 0.000002 0.000004 0.000006 0.000008 0.00001

Novel ITLB Conventional ITLB-256 M iss ra te DTLB Comparison 0.037273927 0.000322012 0 0.005 0.01 0.015 0.02 0.0250.03 0.035 0.04 Novel DTLB Conventional DTLB-256 M iss ra te ITLB Comparison 0.00320007 4.28911E-05 0 0.0005 0.001 0.0015 0.002 0.0025 0.003 0.0035

iss ra te

(44)

CRAFTY PERLBMK VORTEX TWOLF SWIM DTLB Comparison 0.102992835 0.000825084 0 0.02 0.04 0.06 0.08 0.1 0.12 Novel DTLB Conventional DTLB-256 M iss ra te ITLB Comparison 0.003187791 2.57409E-05 0 0.0005 0.001 0.0015 0.002 0.0025 0.003 0.0035

Novel ITLB Conventional ITLB-256 M iss ra te DTLB Comparison 0.069741239 0.000255129 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Novel DTLB Conventional DTLB-256 M iss ra te ITLB Comparison 0.010483593 0.000102675 0 0.002 0.004 0.006 0.008 0.01 0.012

Novel ITLB Conventional ITLB-256 M iss ra te DTLB Comparison 0.004666863 0.000590145 0 0.001 0.002 0.003 0.004 0.005 Novel DTLB Conventional DTLB-256 M iss ra te ITLB Comparison 0.002379011 7.34885E-05 0 0.0005 0.001 0.0015 0.002 0.0025

Novel ITLB Conventional ITLB-256 M iss ra te DTLB Comparison 0.032024853 8.43307E-05 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 Novel DTLB Conventional DTLB-256 M iss ra te ITLB Comparison 0.003468143 2.68246E-05 0 0.0005 0.001 0.0015 0.002 0.00250.003 0.0035 0.004

Novel ITLB Conventional ITLB-256 M iss ra te DTLB Comparison 0.352312907 0.002983162 0 0.050.1 0.15 0.2 0.250.3 0.35 0.4 M iss ra te ITLB Comparison 3.60843E-06 2.67446E-06 0 0.00000050.000001 0.0000015 0.000002 0.00000250.000003 0.0000035 0.000004 M iss ra te

(45)

LUCAS

APPLU

EQUAKE

MGRID

Figure 13: TLB miss rate comparison with 4KB page size.

This section presents our second novel TLB structure designed which is targeted

toward using small page improving the limitation in the original design. Although

using large page size can have better coverage of memory space and the translation

needed to be stored can be smaller, it also waste memory due to internal

DTLB Comparison 0.010017251 6.17372E-05 0 0.002 0.004 0.006 0.008 0.01 0.012 Novel DTLB Conventional DTLB-256 M iss ra te ITLB Comparison 0.005135275 2.56594E-05 0 0.001 0.002 0.003 0.004 0.005 0.006

Novel ITLB Conventional ITLB-256 M iss ra te DTLB Comparison 2.12195705% 0.19153414% 0.00000000% 0.50000000% 1.00000000% 1.50000000% 2.00000000% 2.50000000% Novel DTLB Conventional DTLB-256 M iss ra te ITLB Comparison 0.00080513% 0.00052773% 0.00000000% 0.00020000% 0.00040000% 0.00060000% 0.00080000% 0.00100000%

Novel ITLB Conventional ITLB-256 M iss ra te DTLB Comparison 4.79387851% 0.03246184% 0.00000000% 1.00000000% 2.00000000% 3.00000000% 4.00000000% 5.00000000% 6.00000000% Novel DTLB Conventional DTLB-256 M iss ra te ITLB Comparison 0.35400634% 0.00099221% 0.00000000% 0.05000000% 0.10000000% 0.15000000% 0.20000000% 0.25000000% 0.30000000% 0.35000000% 0.40000000%

Novel ITLB Conventional ITLB-256 M iss ra te DTLB Comparison 1.38909159% 0.06294093% 0.00000000% 0.20000000% 0.40000000% 0.60000000% 0.80000000% 1.00000000% 1.20000000% 1.40000000% 1.60000000% Novel DTLB Conventional DTLB-256 M iss ra te ITLB Comparison 0.00025921% 0.00028948% 0.00024000% 0.00025000% 0.00026000% 0.00027000% 0.00028000% 0.00029000% 0.00030000%

iss ra te

(46)

fragmentation and most modern OS only support small page size under

multiprogramming environment. Our new proposed TLB is not only capable of

reducing miss rate in context switches with small page size but supporting multiple

page sizes. Further, the hardware cost in our new design is almost as same as the

conventional TLB.

3.2.1 Overview

Our new TLB structure combines both the features of Lee’s dual TLB [17] [18]

and our original TLB [6] with many TLB banks. We use a shared conventional small

page (4KB size) TLB and many large page (16KB size) promotion-TLB banks. The

difference compared to our original design is that only the shared TLB need to be

flushed when context switching. The shared TLB works together with only one of the

promotion-TLB banks at a time. The TLB banks can keep from flushing when context

switching. As a result, we can reduce the miss rate in context switches with small

page size because the shared TLB can effectively reduce the miss rate. The remainder

of this section we will present our new TLB structure, implementations and

(47)

3.2.2 Proposed TLB Structure

Figure 14 shows in detail our new novel TLB structure to reduce miss rate in

context switch. The proposed structure can be broken down into four parts to discuss:

Figure 14: New proposed TLB architecture with low context switching penalty.

1. The Shared TLB and the promotion-TLB banks

Only one of the promotion-TLB banks can work together with the shared TLB at

the same time. The shared TLB and all the promotion-TLB banks are designed as

fully associative structures. The shared TLB is the same as that being implemented on

conventional TLB while the promotion-TLB banks are all implemented as the

complete-subblock TLBs.

The shared TLB is constructed as a set of m page entries, where the page size is

(48)

page entries, i.e., 16KB. Thus the total number of 4KB page entries is

m+n×(16KB/4KB)×the number of banks in this example.

In our novel TLB structure we assume that the m is 128, n is 2 and total 4KB

page entries is 256. In other words, our structure has 16 promotion-TLB banks. In

comparison with the conventional 256-entry TLB, the area cost does not increased

very much although we add 16 bank tag registers, multiplexers and de-multiplexers.

The reason is that we use complete-subblock TLB structure in our promotion-TLB

banks which several based pages are managed with one TLB tag such that the total

increased cost can be ignored.

In addition, it should also be added that the shared TLB can also work as the

victim cache. When the least recently used entry would be evicted from the current

TLB bank, the large page entry would be broken down many small page entries and

then sent back to the shared TLB through control logic. We do this action while one

virtual address misses in both the shared TLB and the current bank TLB. We have

sufficient time to do it because fetching the translation from main memory needs more

time.

2. Control logic

The control logic has two functions. The one is that it requests the memory

system for the translation when both shared TLB and current bank TLB miss. Then,

the control logic will send the translation to shared TLB when getting it. The other is

that it sends the evicted entry from current bank TLB back to the shared TLB.

3. Multiplexers and de-multiplexers

The multiplexers and de-multiplexers are used to select right current bank TLB.

(49)

4. Bank tag registers

Each bank TLB has its own bank tag register. The behavior of the bank tag

registers are the same as of our original TLB structure as described in Section 3.1. It

consists of four parts:

z Task tag: Identify each task.

z Current bit: Identify current working task.

z Valid bit: Validate/Invalidate a bank.

z LRU bits: Replace the victim bank.

3.2.3 Implementation of the novel TLB

As our original TLB design, in order to realize our new proposed mechanism, the

OS is just needed to do a little modification. The OS needs to send ‘the clear TLB

signal’ to the processor only when swapping page with disk occurs or page frames

release.

3.2.4 Mechanisms of the novel TLB

The mechanisms of the novel TLB can be divided into four situations to

consider:

1) Task matching in one of the bank tag registers.

Once the virtual address is generated from the CPU, the virtual page number is

sent to the shared TLB and all TLB banks at the same time. The shared TLB works

the same as conventional TLB and each bank works the same as the

complete-subblock TLB. In addition, the select signals are obtained from the current

bit of all group tags in order to select right bank. The possible three cases are:

(50)

are all 16KB page size in this example.)

z Hit in the shared TLB.

If the VPN is found in the shared page TLB, the actions are the same as any

conventional TLB hit. Then, the requested physical address would be sent to the

cache.

z Hit in a bank TLB which is current working.

If a hit occurs at a bank TLB which is current bank, only one of four PPNs in the

bank TLB is enabled at the same time. The actions are the same as any

complete-subblock TLB hit. Then, as in the case of the shared TLB hit, the requested

physical address would be sent to the cache.

z Miss in both places.

When one virtual address misses in both the shared TLB and the current bank

TLB, O/S perform miss handling. When a TLB miss occurs and if its corresponding

three sequential VPNs exist in the shared TLB, those four sequential VPNs belonging

to a 16KB page boundary are chosen to be promoted. The 16KB page is stored into

the current bank TLB as a new single entry. And also the three sequential VPNs in the

shared TLB are invalidated at the same time, causing to increase the effective entry

space in the shared TLB.

What has to be noticed is that we did some difference from Lee et al. When the

space of the current working bank is not enough, the least recently used entry has to

be discarded in Lee et al. But we divide the large page into four small page entries and

send back to the shared TLB through control logic to reuse them. The reason is that

the bank TLB usually have higher hit rate than the shared TLB and we can get better

降低行程環境切換導致效能損失之轉換搜尋緩衝器設計

國

立

交

通

大

學

資訊工程系

碩

士

論

文

降低行程環境切換導致效能損失之轉換搜尋緩衝器設計

Translation Look-aside Buffer with Low Context Switch Penalty

研 究 生：張繼文

指導教授：陳昌居 教授

降低行程環境切換導致效能損失之轉換搜尋緩衝器設計

Translation Look-aside Buffer with Low Context Switch Penalty

研 究 生：張繼文 Student：Chi-Wen Chang

指導教授：陳昌居 Advisor：Chang-Jiu Chen

國 立 交 通 大 學

資 訊 工 程 學 系

碩 士 論 文

降低行程環境切換導致效能損失之

轉換搜尋緩衝器設計

研究生：張繼文

指導教授：陳昌居

國立交通大學 資訊工程學研究所

摘要

Translation Look-aside Buffer with Low

Context Switch Penalty

Student : Ji-Wen Zhang Advisor : Dr. Chang-Jiu Chen

Department of Computer Science and Information Engineering

National Chiao-Tung University

Abstract

Acknowledgment

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1 Problem Definition

1.2 Roadmap

Chapter 2

Related Works

2.1 TLB mechanisms

2.1.1

Conventional TLB Structure

2.1.2

Several TLB implementations

2.2 Enhancements for the TLB

2.2.1 Reducing TLB Access Time

2.2.2 Reducing TLB Miss Rate

z

Making the TLB hold more entries

z

Using large page size

z

Using Superpage to improve TLB coverage

2.2.3 Reducing TLB Hardware Cost

z

Complete-Subblock TLB

2.2.4 Low Power TLB

z

Banked-promotion TLB structure

2.3 The Context Switching Penalty in TLB

2.4 Relationships between the Miss Rates,

Page Sizes and TLB Sizes

Chapter 3

Proposed Mechanisms

3.1 The Original TLB Structure with Low

Context Switch Penalty

3.1.1

Structure

3.1.2

OS Support and Implementation

3.1.3 Expected Performance

3.2 The New TLB Structure with Low

Context Switch Penalty

3.2.1

研究生：張繼文

指導教授：陳昌居教授

研究生：張繼文 Student：Chi-Wen Chang

國立交通大學

資訊工程學系

碩士論文

國立交通大學資訊工程學研究所