• 沒有找到結果。

CHAPTER 1 INTRODUCTION

1.3 Scope and Organization

Using packing and placement techniques to generate object layout

For Fully Associative

Use profiler to capture trace information of

code objects Classify code objects

in programs using a compiler

structure of a Java virtual machine

Java Programs

Generate a refined Java virtual machine

Pre-Processing Stage Object Class-dependentProcessing Stage Object Class-independentPost-Processing Stage Object Class-dependent

Figure 1.1. The framework of manipulating packing and placement for cache memory in different problem domains.

The main purpose of this dissertation focuses on modeling the object packing and placement for the three major kinds of cache organization. In addition, the treatment to different field of applications is also included in this research. Figure 1.1 illustrates the entire framework associated with the object packing and placement process.

The top part in the framework prepares parameters that are used by the packing and placement algorithms. The mission of the top part is to mark out the scope and

usage of objects for profile information. The techniques used to collect the profile information vary by the field of application. Dealing with generic data items can be straightforward. Technique for program code arrangement may involve with the study in compilers. The arrangement of a virtual machine, like Java Virtual Machine, can be a unique class. Developing the technique requires insight into the design of a virtual machine. Therefore, it deserves a detailed discussion in this dissertation. All these relevant techniques are presented in Chapter 5.

The block in the middle of the framework can be regarded as a black box. The inputs of the black box are parameters describes object characteristics and profile information. The mission of the black box is generating object layout for a specific type of cache memory. The design of the black box is the core of our research. To characterizing the nature of the problem, this dissertation formulates the problem model in Chapter 3. A thorough understanding of the problem model helps us to propose solutions of packing and placement problems, in Chapter 4, that practical enough to be utilized in real compilers or applications.

Chapter 6 has a series of experiment that utilize the proposed techniques to face real application. The experiments demonstrate the proposed techniques should work fine with program code arrangement on different cache organizations.

Before digging into the major article of this dissertation, Chapter 2 shall widely survey topics related with our research and explain why the pioneers’ works did not cover our research subject.

Chapter 2

Background

2.1 Memory Hierarchy

A computer system may require a large memory for storing program and data. Not all of them are accessed by the computer system simultaneously at any moment because of the principle of locality ([5]). A computational process typically accesses program codes and data items in the memory in a clustered manner. The locality behavior has two extents. Temporal locality models the access activties along time axis. A temporal locality set of objects are likely to be referenced occasionally within a given period.

Spatial locality means that a process is likely to access objects in several geometric neighborhoods in storage devices during the whole lifetime.

CPU

Level-1 Cache Level-2 Cache

Main Memory

Hard Drive / CDROM

within the chip

Figure 2.1 The memory hierarchy.

The memory hierarchy is a compromised approach to manage massive code and data objects in an efficient way. As shown in Figure 2.1, memory devices are stacked by access speed. The fastest memory is attached to the CPU directly, such as an on-chip static RAM. The slowest memory device is placed in the bottom layer, such as hard drive or CDROM. Objects are loaded to the upper layer before being used. Because a small portion of objects will be used, the capacity of the upper layer is usually smaller than the lower layer. The concept can be applied to many places in a computer system, such as the CPU cache in a processor, TLB to paged memory management, and virtual memory in an operating system [6][7]. Technically speaking, the system design policy can freely devise the scheme of exchanging objects between the upper and lower memories. However, cache memory plays an important role for this purpose.

2.1.1 Cache Organization

Cache memory is a mechanism dedicated for using a piece of small and fast memory to manipulated data contents stored in a large and slow main memory. In respect of functionality, it is a set of protocol to manage buffers in the memory. A cache memory consists of cache blocks (cache lines), thereby dividing the main memory into blocks. When a processor is about to access raw data in the main memory, raw data are transferred to cache block from main memory on block basis. The modified raw data are written back to the main memory from a cache block on block basis as well. Selecting a cache block for swapping a specific memory block is very important. That mapping is the origin of cache misses. By the method of mapping memory blocks to cache blocks, cache memories can be classified into three types as follows.

 Direct Mapped Cache

The cache blocks a separated into isolated sets. Conversely, each cache set has exactly one cache block. For a direct mapped cache with K cache sets, there are K cache blocks available. For a given memory address x, the formula (2.1) is used to calculate the corresponding cache set k.



In other words, all the memory blocks are divided into K sets, and each memory block is mapped to a fixed cache set. Memory blocks belonging to the same cache set have to contend for the only one cache block. If a cache set holds unwanted memory block, it will be invalidated, and loads the demanded memory block into that cache block. This leads to a conflict miss. Direct mapped cache is popular because of the simplicity in cache block management. However, the conflict misses could be awesome in the worst case, as discussed in Hill’s work [8].

 Fully Associative Cache

There is no restriction in mapping memory blocks to cache blocks. A memory block can be swapped to any cache blocks in this configuration. If there is no cache block contains wanted memory block, the cache system have to invalidate a victim cache block and load the desired memory block into it. Choosing the victim cache block uses a sort of replacement algorithm. Such kind of cache misses is called a capacity miss.

 Set Associative Cache

It can be regarded as a combination of the above two organization. The cache blocks are grouped into K sets, as a direct mapped cache. Each cache sets has N cache blocks, where N > 1. The term N-way describes the capacity of each cache set. When the processor is about to access a memory block absent in the k-th cache set, the cache memory uses the replacement algorithm to choose and invalidate a victim cache block in this set. The reclaimed cache block is used to hold the wanted memory block. The activity within a cache set is identical to a fully associative cache.

It is worth to briefly survey the replacement algorithms. Belady has made intensive research in these algorithms ([9]). Smith [10] categorizes the replacement algorithm to three classes.

 Class 1 – They are non-usage-based algorithms. It assumes all the blocks shares equal usage frequency. The choice of victim pages has no concern with the activities of accessed items. FIFO and random replacement (RAND) are the in this class.

 Class 2 – They are usage-based algorithms. They make decisions based on history or other statistics, such as LRU.

 Class 3 – The algorithm knows everything, past and future. That is the optimal algorithm, or denoted as OPT in the relevant literatures.

OPT algorithm is for analytic purpose. It is not used in real cache memory system.

LRU usually outperforms than FIFO and others, but it is too costly to implement LRU in a real system. There are pseudo LRU algorithms ([6][11]) approximate LRU, such as the one used in the Intel Pentium processor [12]. FIFO and RAND are the simplest in

The performance of the cache memory can be evaluated in terms of the average access time, as the Equation (2.2), defined in [5].

Average memory access time = Hit time + Miss rate × Miss penalty (2.2)

The Equation tells that performance of the cache memory is dependent on cache miss rate. The lower cache miss rate leads to higher performance. In the book by Hennessy and Patterson [5], they enumerate the techniques in reducing cache misses.

Two of them are related to our research. The first is to enlarge the cache block size, and the second is using the compiler to generated code and data optimized for the cache memory.

The size of a cache block concerns with the fundamental assumption of our proposed packing and placement problem, because larger block can gather more objects. Smith [10] has discussed the pro and con of small and large cache block (and also discussed in [13][14][15][16][17][18][19][20][21]). The advantages of the former become the disadvantages of the later. Naturally, it takes less time in transferring data from main memory to a small cache block, and it reduces miss penalty. Conversely, the overall miss count is higher while transferring a fix amount of data in contrast to the cache with large cache block. Large cache block has advantages in simpler hardware circuit because of the smaller tag memory. Therefore, the search cost is reduced. It can result to shorter access time for “hits”. On the contrary, one of the disadvantages for typical applications is that a cache block may contain many unused data in respect of a small locality. Nonetheless, this disadvantage can be suppressed by putting more

information being used in a cache block. Such that load them in one time can be more efficient.

The choice of small or large cache block depends on several factors. The first is the geometry of the main memory. The readable/writable unit of the main memory usually bounds the minimal size of cache block. Besides, for high transfer latency (transmission overhead) and high bandwidth main memory, the choice of the cache block is in favor of large ones. That causes minor increasing in miss penalty in contrast to small cache block. Since the increasing in bandwidth is a technology trend, it implies larger cache block size can be a trend as well.

Programmers and compilers can help to arrange code and data items in a program.

This is the origin of our research. There are several aggressive ways to help skillful programmers to increase the localities of their programs, such as rewriting the loops, changing the directions of iterating arrays (such as [22][23]), or incorporating cache-aware algorithms (for example, graph algorithms optimal for caches in the work of Park, Penner, and Prasanna in [24]).

There is another kind of approach to refine the locality. By altering the code or data placements in the memory or storage devices, it is possible to improve the spatial locality [1]. The intuition is to gather frequently used objects into one area; therefore, the spatial locality of the process is changed. The cache memory loads the concentrated area and satisfies most of accesses. A further step is considering the cache organization besides locality while creating the placement, such that the placement is more efficient in increasing cache hits for the given application.

2.1.2 XIP and NAND Flash

In a regular computer system, RAM is the major addressable component in the main memory space. The operating system loads a program from storage devices to RAM before execution. The CPU fetches machine codes from RAM and carries out instructions. Since a program should not modify itself, the RAM for placing program codes (called code memory) is treated as ROM.

However, a low-level embedded system seldom has sufficient RAM as a desktop PC does. In such circumstance, it becomes expansive to use RAM as code memory.

Using ROM to serve as code memory is a classical approach, but it is not rewritable, impossible to update programs. Therefore, NOR flash memory is a popular alternative because its physical interface is identical to ROM. A NOR flash chip can be connected to processor’s host bus and it is good for programs to execute-in-place (XIP) without extra hardware ([25][26]). Its programming interface (erasing and writing) is quite straightforward, and designers do not have to worry about bad block management.

However, NOR flash memory is small in capacity, the trend is migrating the code memory to NAND flash memory ([27]).

NAND flash memory has some important characteristics. The storage space consists of blocks. An erase operation is performed on block-basis. Each block consists of pages. The read operations are performed on page-basis. It does not allow random byte access, and the CPU must read out the whole page at a time, which is a slow operation compared with access to RAM. Table 2.1 lists typical combinations of blocks and pages.

Table 2.1. Typical combinations of NAND flash blocks and pages Block Size (bytes) # Pages / Block Page Size

16K 32 512

Figure 2.2. Execute programs stored in a NAND flash memory by using a shadow RAM

These properties cause a processor hardly to execute programs stored in NAND flash memory using the “execute-in-place” (XIP) technique. Nowadays, most implementations treat NAND flash memories as second storage devices like hard drives, the system duplicate entire content including both program code and data from NAND flash memory to a shadow RAM (as the configuration in Figure 2.2). Although this implementation is straight forward, but there are several drawbacks. First, it requires RAM large enough to hold everything regardless of useful content or not, sometimes up to 1 GB. After system boot, NAND flash memory is useless. The run time performance is definitely good because everything is already in RAM, but it is obviously uneconomic for small-scale embedded system. Second, the system suffers from long boot delay due to waste time in reading everything from NAND flash memory to RAM, it could take 15 seconds to download entire content from 512M NAND. Third, if the program code grows beyond original design, both NAND flash memory and RAM must upgrade

NAND Flash Memory

Flash Memory Interface

Cache RAM Optional ROM, NOR Flash

CPU

Address/Data Bus

Figure 2.3. Execute programs stored in a NAND flash memory by using a cache.

Yet another approach is adopting a memory management unit (MMU) and a small cache memory. Program codes always resident in NAND flash memory. CPU will fetch instructions from cache memory. When CPU is about to run a code fragment absent in cache memory, MMU will load code fragments from NAND flash pages to cache memory. A system may implement such kind of MMU by either hardware (as the configuration in Figure 2.3), such as Park et al. in [28], or by the operating system’s virtual memory mechanism. This is known as “execute-in-place”, which efficiently utilizes NAND flash memory without leaving it alone after boot, and retains precious RAM resource to applications.

2.2 Graph and Combinatorial Algorithms

In this dissertation, we try to transform the modeled problems to well-known graph problems. Since there are rich researches dealing with these well-know problems, which implies our modeled problems can be handled by those pioneer researches. Two well-known graph problems were adopted in our research. The first one is graph partitioning problem, and the second is the MAX k-CUT problem.

Definition 2.1 GARPH-PARTITIONING. Graph G=(V,E) weights w(v)Z+ for each vV and length l(e)Z+ for each eE. Given K, J Z+, find a partition of V into disjoint sets {V1, V2,..,Vm} such that ∑vVi w(v) ≤ K. Such that if E’E is the set of edges that have two endpoints in two different set Vi, then ∑eE’ l(e) ≤ J.

Graph partitioning problem is known to be NP-complete, as discussed in the book by Garey and Johnson [29]. It is a widely surveyed in many researches, so we review only key development in this topic. MIN-BISECTION is a simplified version of it. That breaks a weighted graph into two parts and minimizes the sum of inter-partition edges.

Some graph partitioning heuristics are done by recursive invocation of MIN-BISECTION until generating desired number of partitions. These methods are surveyed in Wang et al. [30]. Furthermore, the local-refinement technique partially exchanges elements in given partitions to get better results. Kernighan and Lin [31] first propose local refinement method to refine the bisection partitions, and there are many improved heuristics based on their approach.

Alternatively, Hendrickson and Leland [32] propose a multi-level scheme to solve the graph-partitioning problem. The whole process contains three major steps. The first step constructs a coarse graph by using the maximal matching, which merges vertexes to coarser vertexes and preserves the properties of the original graph. The second step uses global partitioning algorithms to generate unrefined partitions, and then use local-refinement algorithms (i.e., method by Kernighan and Lin) to generate desired number of partitions. The third step uncoarsens each partition and restores the vertexes within it.

Definition 2.2 MAX k-CUT. Given a weighted graph G=(V,E). Let wi,j denotes weight of edge ei,j. The aim is to partition V into K subsets, as partition P={P1,P2,..PK}, where K>2. Maximize the total weight of inter-partition edges, as maximize the following equation.

MAX k-CUT is known to be a NP-complete problem, as discussed in [33][34]. It is a generalization of the other two well-known problems. In the case of K=2, it becomes the MAXCUT problem. It is a NP-hard problem as discussed in [29][35]. Applying MAX k-CUT to an unweighted graph, or say wi,j=1 for any i and j, it becomes the k-COLORING problem. k-COLORING can be used for resolving resource confliction.

For example, it is used to assign registers to variables during the code generation stage of compilers. Aho et al. have explained using a k-COLORING heuristic algorithm for register-allocation in their book [36]. It is no wonder that some prior researches in code/data placements adopt k-COLORING (shall be discussed in Section 2.3.1), since they aim to resolve conflicts of assigning cache sets (colors) to code/data fragments (vertexes).

Since MAX k-CUT is NP-hard, it is not possible to solve it in polynomial time unless P=NP. Pioneers seek for approximation algorithms in polynomial time. A simple random method that randomly distributes vertexes to partitions is a

k

k 1 -approximation

algorithm ([33]). The technique of semidefinite programming (SDP) is widely used in dealing with combinatorial optimization problems. Goemans and Williamson, in [37][38], use SDP to provide an approximation algorithm for MAXCUT problem. The

techniques in solving MAXCUT inspire the development in solving MAX k-CUT.

Frieze and Jerrum [39] generalize the work of Goemans and Williamson and use SDP and randomized algorithm ([40]) to provide an approximation algorithm for MAX k-CUT problem. We briefly restate their approach here. The original problem can be formulated as follows:

Using the technique of SDP relaxation, the constraint of Xi j is changed as follows:

1

There are successive researches that improve the work of Frieze and Jerrum, including Klerk, Pasechnik, and Warners [41], Kann et al. [42][43], Coja-Oghlan, Moore, and Sanwalani [44], and Ghaddar, Anjos, and Liers [45].

The above approaches using SDP can provide good approximation, but it could take long time for solving SDP (as discussed in [46]) in real applications, such as using it in VLSI layout. Therefore, Kahruman et al. [47] propose a greedy heuristic for solving MAXCUT. Their algorithm iteratively separates endpoints from heavy edges into two partitions. Our algorithm devised in this dissertation (Section 4.2) shares the

The above approaches using SDP can provide good approximation, but it could take long time for solving SDP (as discussed in [46]) in real applications, such as using it in VLSI layout. Therefore, Kahruman et al. [47] propose a greedy heuristic for solving MAXCUT. Their algorithm iteratively separates endpoints from heavy edges into two partitions. Our algorithm devised in this dissertation (Section 4.2) shares the