Other Related Topics - Related Works - 適用於快取記憶體的封裝暨安置物件方法

CHAPTER 2 BACKGROUND

2.3 Related Works

2.3.4 Other Related Topics

The work of Rubin, Bodik, and Chilimbi [82] focuses on a framework to evaluate cache performance of a given data placement for the cache memory. Since it is difficult

to manipulate a large amount of trace data, their framework introduce a technique to

“compact the trace”. The approach finds out the “grammar” of the data access traces.

They use Nevill-Manning’s SEQUITUR algorithm [83] to represent the trace as a context-free grammar (derived from the previous work [84]), and the grammar is used to distribute data objects into pages. Chilimbi and Shaham, in [85], extend similar approach to place data items over direct-mapped and set-associative caches.

The idea of packing programs to fit virtual memory pages is a classic topic. In early 1970s, Ryder [86] proposes to pack small programs to fit one virtual memory page so that it reduces paging. The approach is applied to early multi-program operating systems like IBM OS/VS2. Hatfield and Gerald, in [87], discuss a similar problem aimed for arranging relocatable sectors (which are smaller than pages) within a program.

Nevertheless, modern processor architectures still face to similar challenges. As stated, the placement problem involves not only the characteristic of the storage media but also the processor architecture. Rong Xu and Zhiyuan Li discuss the cache mapping problem for the processor with partitioned cache, e.g., Intel StrongARM SA-1110 and the Intel XScale [88]. In a processor with partitioned cache, the software can control the cache zone that a memory page maps to it. For example, a memory page can be mapped to main-cache, mini-cache, or non-cacheable area. Since there are capacity limit, choosing which and how to mapping data items is a combinatorial problem. Their research proves the problem is NP-hard. Therefore, they propose a greedy algorithm to fit the most accessed pages into caches. The algorithm enumerates every memory page

mini-cache, or none. The iteration order is controlled by conflict weights of memory pages. The conflict weight is the number of interleaving access between the undecided and decided memory pages. This approach offers 1% ~ 2% improvement in cache misses, but the heuristic takes O(m³n) where m is the number of pages and n is the number of memory accesses!

A program written in an object-oriented language may contain a large number of data objects. The layout of contained data objects effects memory performance. One of the issues is accessing to scattered objects in the memory could causes higher cache misses. Stamos [89] has surveyed the relationship of Smalltalk runtime environment and the virtual memory. Because a virtual memory page can holds several Smalltalk objects, grouping objects to fit virtual pages can reduce page faults. The approach statically traverses the object forest in a certain order (DFS, BFS, or by object type) and expands the data layout in the memory. The approach improves the spatial locality of data objects in the virtual memory.

Modern object-oriented languages like Java support garbage collection. Garbage collection systems still face to memory performance issue. For example, Hirzel, in [90], demonstrates a garbage collector for Java that can incorporate several data layout strategies. The approach is to sort objects in the memory by a selectable layout rule while reclaiming and compacting objects. The layout rule is traversing the object forest in DFS or BFS order, sorting by thread, and some other static rules. The experimental results show the approach can help the Java application to reduces cache or TLB misses.

Similar techniques can be applied to heap memory management, such as works in [91][92][93][94][95].

Chapter 3

Problem Modeling

3.1 Object Access Trace

We start to discuss the packing and placement problem in a formal way. Consider a set of objects, defined as O = {o₁, o₂, o₃,...}. These elements are the relocatable units to be placed in the memory. Since one of the problem presumptions is sizes of objects are irregular, not necessary identical, the function size(o_i) denotes the size of the given object o_i. Besides, the function addr(o_i) denotes the beginning address in the memory of the given object.

The problem assumes that one of the three cache organizations is configured to mediate the processor/program and the main memory. Consider either the direct mapped cache or the set associative cache, it is assumed to have K sets. A cache block has M bytes in size. Because the cache memory exchanges raw data with the main memory by cache blocks, the main memory space is segmented into memory blocks. The size of a memory block is M bytes, identical to the size of a cache block, so that it can fit into a cache block. The collection of memory blocks is defined as a set B = {b₀, b₁, b₂,...}. In a program’s respect, it can access (load/store) arbitrary objects in the main memory. The

bottom layer undertakes data access activities. When accessing object o_i, the cache system loads the memory block containing the o_i from the main memory to a cache block. The loaded memory block b_j can be derived by (3.1).





 M

j addr( ⁱ) (3.1)

After that, the program accesses the object in the cache block. Since a direct mapped cache divides the memory space into K sets, the block b_j is located in set B_k, where k is calculated by (3.2).

k ≡ j (mod K) (3.2)

As the program constantly accesses objects in the main memory, the activities can be recorded as a trace of the accessed objects, denoted as object access trace (OT). It is used to represent the accessed objects arranged in temporal order. Figure 3.1 explains the conversion flows of the object access trace. It contains three traces. The first object access trace (OT) are composed of alphabets denote objects. Its entire trace can be converted to an address trace (AT) by written down the address numbers of each object with function addr(). Similarly, applying Equation (3.1) to elements in AT yields the block access trace (BT). The horizontal line that divides an address number into two parts denotes it. It is the sequence of blocks swapped into the cache. A cache conflict miss arises upon mismatch, the system pays penalty for loading the missing block to the cache.

PROGRAM

Trace (AT) Block Access Trace (BT)

Figure 3.1. The conversion of object access trace to block access traces.

Consider an object access trace shown as the first row of Figure 3.2(a). The object access trace is converted to a block access trace (BT) under the mapping shown in Figure 3.2(b). The second row of Figure 3.2(a) is a block access trace. When the system is about to access b_j, it matches whether the cache block in the set B_{j(mod k)} holds b_j.

Figure 3.2. (a) An example of object access trace, block access trace, and compressed block access trace in three rows. (b) A legal packing mapping that injects six objects to three

memory blocks.

The goal of this problem is to find a layout scheme that assigns objects to the memory space. The layout scheme injects objects to blocks, as well as object access trace to block access trace. After the new layout scheme is deployed, the new block access trace working on the K-set direct mapped cache is expected to cause fewer cache misses because of the layout scheme.

In the meanwhile, the problem has two preconditions. First, it restricts an object must be smaller than a memory block, i.e., i, size(o_i)  M. It leads to a memory block can hold several objects. Assigning address to an object is equivalent to determining both the memory block and cache set the object shall attend. Meanwhile, as long as the cache block gets larger (M increases), the horizontal line moves to the left progressively in Figure 3.1. The side effect is to inject more objects to the same memory block. In other words, this problem considers the scheme of “packing” objects to memory blocks and “placing” objects to cache sets simultaneously. This is the major difference between our study and related researches dealing with sole placement problem.

The second precondition disallows any object to be placed across memory blocks.

Since an object is assumed smaller than a memory block, the entire object is restricted to lie within a memory block, not crossing two of them. The condition prevents extra cache load. Make such a presumption is reasonable. For instance, real compilers have a code/data alignment optimization pass [96]. The optimization pass aligns instruction blocks or data items, prevents them to lie across the cache block boundary, and reduces extra fetches (also suggest by Intel [11]).

The proposed approach employs the information from the object access trace to construct the layout scheme by the packing and placement technique. The object access trace can be obtained by capturing the activities in executing benchmark or real programs. Our study itemizes scopes in measuring the trace information. The scopes differ by the connectivity of objects in the trace. Distinguishing these scopes is important because it affects the choice of methods for the packing and placement problem. The scopes are listed as follows.

 Degree-1 trace information

This is to count the number of occurrences of each object in the entire object access trace. Telling the popularity of objects is useful. It is call “Degree-1” since the measuring scope is limited to one object, regardless of before and after objects by temporal order. For example, the profile information used in Path Flow Analyzer for PA-RISC (mentioned in [52]), the researches of Steinke et al. [98], and Raman and August [1] can be classified to this category.

 Degree-2 trace information

Degree-2 access trace information is to observe the pair-wise relation between two objects in the trace. In other words, it counts the occurrences of object pairs in the access trace. The symbol w_i,j denotes the occurrence of the segment o_i, o_j in the object access trace. The relation is undirected, and o_i, o_j is equivalent to o_j, o_i. For example, consider the object access trace shown in the first row of Figure 3.2(a). Its access trace information is expressed as the adjacent matrix in Figure 3.3(a).

Degree-2 trace information is used in several related researches, such as [54][57][58]. There are variations by incorporating different metrics to express the affinity between two objects, such as Gloy et al. in [2].

 Degree-k trace information ( k > 2 )

By extending the idea of the Degree-2 trace information, Degree-k trace information means concerning an object with the (k-1)-th after object. The entering and leaving of an object is not merely decided by the preceding object. More than one object together composes the complete cache activity history. Such as the analysis technique showed in Section 3.4, both Degree-2 and Degree-3 trace information are used to reflect the relations of objects entering and leaving. The importance is stressed by Petrank and Rawitz in [68][69]. They suggest that solving placement problem perfectly by pair-wise information is insufficient. In fact, there is no prior research using it to resolve placement problems, because manipulating such deep levels of affinity is difficult. One of the obvious issues is that k is a variable choice. It is an auxiliary analysis tool used in our research. Incapable for forming the graph model, they could not be used for solving the problem.

Degree-2 trace information is especially useful because it can be transformed to graphs. An object access graph OG = (V, E) is constructed by the following instructions:

(i) The vertex set V is equivalent to the object set O, that is i, v_i=o_i. The value s_i =

e_i,j can be add to the graph OG to connect vertexes o_i to o_j. The value w_i,j is given as the length of the edge e_i,j. Figure 3.3(b) is the object access graph of the sample trace listed above. The edges are labeled with the Degree-2 trace information.

a b c d e f

Figure 3.3 (a) The adjacent matrix. (b) The object access graph. (c) Group the original object trace graph into partitions.

The sum of edge length of OG = (V, E) is obviously the length of the object access trace as well as the length of the block access trace, that is –

This is no coincidence because summing up all w_i,j equals to count the occurrences of all segments in the trace. The object access graph is useful in manipulating the packing and placement problem in the following discussions.

3.2 One Page Cache Model

A K-set direct mapped cache divides the memory space into K separated memory regions. The cache can hold one memory block from each region at a time. Therefore, we begin to construct the problem model from the simplest case, the 1-set direct mapped cache, or name it one-page cache in this dissertation. In this simplified model, the memory space is a monolithic region. The cache memory has only one cache block, thereby holding one memory block at a time. Because of having one cache set, considering the assignment of “placing” objects to cache sets becomes unnecessary. The only task is to consider packing objects into memory blocks. The meaning of “packing”

can be considered as a mapping function with conditions.

Definition 3.1. A legal packing is an onto-mapping f_pack: O  B, such that for each b_j,

That means the total size of objects within a memory block must be less than or equal to the cache block size.

For example, consider six objects of the object access trace in Figure 3.2(a). When the size of every object is 1 unit, and the capacity of a memory block is 2 units, the mapping shown in Figure 3.2(b) is a legal packing by definition.

Assume object size is the only factor needed for constructing a mapping function.

The goal is to find a mapping function that assigns objects to memory blocks efficiently by filling memory blocks as full as possible, and produces memory blocks as few as possible. Actually, this is exactly the purpose of the BIN PACKING problem [29].

The size of each object is inconsistent. A “bin” (container) is equivalent to a memory block, whose capacity is a given constant. The goal is to minimize the number of bin used, that matches the purpose of reducing memory usage.

However, if the temporal relations among objects are introduced to the construction of mapping functions, the one-page cache problem is no longer a BIN PACKING problem.

A memory block may contain several objects. The consequence is that a block can appear in the block access trace consecutively and repetitively. For the example shown in Figure 3.2, objects a and b are assigned to block X, and XX appears at the beginning of the block access trace. A trace segment consisted of a block repeated many times in the block access trace leads to cache hits. To deal with this situation, we define a compressed block access trace (CBT) derived from the original block access trace. That means deriving a shorter block access trace by merging repeated symbols in the block access trace as shown in the third row of Figure 3.2(a).

Because adjacent blocks are always different in a compressed block access trace, the one-page cache has to load each memory block of it. Consequently, for an object access trace, the length of the corresponding compressed block access trace is the number of cache misses happened in the one-page cache.

In the viewpoint of object access graph, the packing mapping equals to grouping vertexes into partitions. A vertex denotes an object, and a partition equals to one memory block that encloses several objects. The packing mapping equals to partitioning all objects to disjoint subsets. By using the packing mapping in Figure 3.2(b), the original graph is divided into three partitions, as shown in Figure 3.3(c).

This mapping is a utilization of BIN PACKING as mentioned above. Its purpose is filling memory blocks with objects as full as possible. Next, we are going to analyze all types of temporal relations and create a link between those types and cache misses. As shown in Figure 3.4, there are two kinds of edges in the partitioned object access graph.

 Type-I Edges – The Interior edges within partitions.

 Type-B Edges – The edges across different partitions (Blocks).

o

₁ ^I

o

₂ ^B

o

₃ ^I

o

₄

object partition

o

₅ ^I

o

₆

B B

Figure 3.4. Define the type of edges in the access graph.

The sum of length of Type-B edges is the length of the compressed block access trace. The reason can be found by the following equation.

∑w_i,j= |OT| = |BT| = Length( Type-I Edges ) + Length( Type-B Edges ),

where Length(Type-I Edges) means summing lengths of all Type-I edges, as well as for

to the same memory block, and they will be “compressed” in the compressed block access trace. Therefore, the operation of generating a compressed block access trace is to eliminate repeated symbols in the block access trace. The operation equals to removing Type-I edges and keeping Type-B edges in the equation. Therefore, it results to –

|CBT| = Length( Type-B Edges )

That proves the claim. The finding leads to the next claim that minimizing the cache misses is equivalent to minimizing the sum of length of Type-B edges. All these together define the following packing problem for the one-page cache model.

Definition 3.2. Construct a legal packing f_pack. Use that mapping to separate the vertexes in the object access graph OG = (V, E) to disjoint partitions. Each partition corresponds to a memory block b_i. The goal is to find an optimal f_pack that minimizes the sum of length of Type-B edges (defined as Equation (3.4)), thereby minimizing the cache misses caused by reproducing the same object access trace.

  

Proposition 1. The packing problem for the one-page cache is equivalent to the graph partitioning problem.

Graph partitioning is a well-known NP-complete problem [29], as introduced in Section 2.2. That means looking for an optimal mapping for the one-page cache is NP-hard as well.

3.3 Direct Mapped Cache

For arranging objects for a general K-set direct mapped cache (K>1), it involves not only packing but also placement movements. Because the main memory is divided into K regions, there are K memory block sets. Each set B_k = {b_k, b_k+1×K, b_k+2×K,…}

contains more than one memory blocks, where 0  k < K. The combination of the two movements creates a two-dimensional mapping that injects every object to a (set, block) pair, defined as follows.

Definition 3.3. f_pp : O  S  B_k, where O is the object set, S represents cache sets, and B_k represents blocks in the k-th cache set.

The mapping can transform an object access trace OT to a block access trace BT, and each element in the BT is an ordinal pair of the set and block. According to the mapped cache set index k, the BT can be decomposed into K disjoint block access sub-traces, denoted as BT_k, where 0  k < K. In the meanwhile, the mapping of the one-page cache can be regarded as a special case of a one-dimensional mapping working on subspace f_pp : O  1  B₀. As a result, the object access trace is no longer decomposable.

OT abhecfafgbhcgdefegfcdbhfdahegdaf

BT WWZYXYWYZWZXZXYYYZYXXWZYXWZYZXWY

BT₀ WW X W W X X XXW XW XW BT₁ ZY Y YZ Z Z YYYZY ZY ZYZ Y CBT₀ W X W X W XW XW CBT₁ ZY Z Y ZY ZY ZYZ Y

(a)

oi a b c d e f g h bj W W X X Y Y Z Z

(0,0) (0,0) (0,1) (0,1) (1,0) (1,0) (1,1) (1,1) (b)

Figure 3.5. (a) An example of object access trace, block access trace, block access sub-traces, and compressed block access sub-traces. (b) A legal f_pp injects eight objects to four memory

blocks.

Consider accessing eight objects on a 2-set direct mapped cache. The OT in Figure 3.5(a) is an object access trace which consists of eight objects. Figure 3.5(b) is an f_pp injects these objects to memory blocks. A memory block can be numbered as a (set, block) pair. Figure 3.5(a) also shows the BT, which is converted from OT by the mapping f_pp, and two decomposed sub-traces, BT₀ and BT₁.

Because memory blocks belonging to the same cache set contend for a single cache block, it makes each block access sub-trace can be regarded as a standalone block access trace working on a one-page cache. In this respect, the number of cache misses

在文檔中適用於快取記憶體的封裝暨安置物件方法 (頁 47-0)