• 沒有找到結果。

CHAPTER 3 PROBLEM MODELING

3.4 Fully Associative Cache

The fully associative cache consists of a number of cache blocks. Each memory block in the main memory can be bounded to any cache blocks. That means the mapping from memory block to cache is a many-to-many relation, in contrast to a many-to-one relation (onto) of the mapping for the direct mapped cache. The addresses of an object and of a memory block no longer determine their locations in the cache memory. In other words, there is only one set in this organization. Therefore, generating object layouts for the fully associative cache solely consists of the “packing” movement.

The “placement” movement is meaningless in this case. This property is similar to what

we have discussed about the one-page cache model. In fact, the one-page cache model can be regarded as a special case of the fully associative cache as well. In other words, it is a fully associative cache with only one cache block.

Can the optimal packing for the one-page cache apply to the n-page fully associative cache? Consider the object access trace in Figure 3.2(a). The mapping in Figure 3.2(b) is optimal that satisfies graph partitioning, thereby generating the shortest compressed block access trace (CBT) with 15 elements. Apply the CBT to work on a 2-page fully associative cache, it causes 8 cache misses on the FIFO cache, 9 cache misses on the LRU cache, and 6 cache misses on the OPT cache. However, there is another packing mapping for this 2-page cache shown in Figure 3.7. The length of the CBT is 18, longer than the previous one, but it causes 7 cache misses on a LRU cache, 9 cache misses on a FIFO cache, and 7 cache misses on a OPT cache. This packing mapping is a counter example negates the question.

OT abefafbcdefecdbdaedaf

BT XXZYXYXYZZYZYZXZXZZXY CBT X ZYXYXYZ YZYZXZXZ XY

(a)

oi a b c d e f bj X X Y Z Z Y

(b)

Figure 3.7. (a) An example of object access trace, block access trace, and compressed block access trace in three rows. (b) A legal packing mapping that injects six objects to three

memory blocks.

The counter example also shows that a 2-page cache optimal placement is not optimal for one-page cache. It implies finding a universal optimal placement for all sizes of fully associative cache is impossible. Once a placement is tailored for the k-page cache, it cannot ensure being optimal for the r-page cache for which k ≠ r. The reason is the OG keeps only pair-wise information, and an OG can be constructed by many different object access traces. In other words, the transformations from access traces to OG are “onto” mappings. Conversely, OG cannot express the precise temporal orders of all derivable object access traces. The following discussion shall demonstrate exploring object relations by higher degrees of trace information.

There are intrinsic differences between the one-page cache and the n-page fully associative cache (n>1). Since several memory blocks can concurrently stay in the cache, mapping objects to blocks must consider the inter-block relationship. In other word, the inter-block relationship affects the way of loading the block access trace. The n-page cache must employ a sort of the replacement algorithm, due to the limited cache capacity. When the processor cannot reach the memory block about to be accessed in the cache memory, the cache memory uses the replacement algorithm to choose a victim cache block and reclaims the storage space. That is to commit the dirty cache block to main memory and invalidate that entry. The reclaimed cache block is used to swap-in the desired memory block. Belady [9], Smith and Goodman [97] have made intensive researches in replacement algorithms. The goal of all replacement algorithms is making a decision on the element to leave. Assume the replacement algorithm is optimal (OPT or MIN in literatures), it should accurately pick the memory block that presents in the cache now and again in the farthest future. Conversely, the memory blocks remaining in

Figure 3.8 illustrates the concept of OPT algorithm. The string in the Figure is an object access trace. Assume the capacity of the cache memory is four elements. The processor has already accessed symbols in the string from the beginning to the dashed line cut (left to right). Each arrow connects an accessed object and the next nearest occurrence of it in the string. At the given moment, the cache memory contains four symbols {a,b,e,f}. Since the next symbol c absents in the cache memory, a capacity miss arises. By the OPT replacement, symbol a is chosen to be the victim since its next occurrence is far behind the others.

abefafbcdefecdbda

abef symbol set in the cache at the given moment

Figure 3.8. Choose the least used elements by the OPT replacement.

The goal of the “packing” method is opposite to which of replacement algorithms.

It resolves objects tend to be accessed together in the near future, and puts them into the same memory block. Consider the same object access trace as in Figure 3.9. The set {a,b,e,f} consists of objects existing in the cache memory at the moment t1, and it is termed lived object set in this article. The next four symbols being accessed are {c,d,e,f}

that constitutes a neighbor lived object set. Apparently {a,b,e,f}{c,d,e,f}={e,f} and {c,d,e,f}\{e,f}={a,b}. That means when the clock shifts to t2, {e,f} will persist in the cache memory, and {a,b} is no longer used. Therefore, if the memory block can hold 2 objects in total, and the capacity of the cache memory is 2 blocks, the best policy (only

valid at this moment) from the beginning to t2 is grouping {a,b} in one block, and {e,f}

to the other one.

abefafbcdefecdbda

abef symbol set in the cache at the given moment

t1 t2

Figure 3.9. Compare the two locality sets along the object access trace.

As discussed in Section 3.1, the Degree-2 trace information is collected from the pair-wise relations between two objects. In other words, the predictive scope is limited to one successive object. However, the predictive scope should expand beyond one object as our previous discussion. Therefore, the Degree-k trace information must be used to grouping objects being accessed “in the near future”.

a b e f a f b c d e f e c d b d a

THE CANDIDATE OF THE

VICTIM BLOCK

F

Figure 3.10. The object locality set hold by the cache contributes lengths to the edges of the objects access graph.

Combining the discussion together, Figure 3.10 illustrates the relations between objects in the lived object set during the moment t0 to t1. Assume that the first memory block has objects {a,b}, the second one has {e,f}, and both of them are loaded in the 2-page cache memory. The Degree-2 and Degree-3 trace information extracted from the duration t0 to t1 contribute edges to the object access graph OG in the Figure. The edges are classified to the following types.

 Type-I Edges – The Interior edges within partitions. They connect objects within a block in the lived object set.

 Type-B Edges – The edges across different partitions (Blocks) in the lived object set.

 Type-F Edges – The Interior edges within partitions. One endpoint is an object in the lived object set, and the other is an object coming after the lived object set in the trace (in the post trace). After shifting in time, it becomes Type-I edges.

 Type-P Edges – The edges connect objects in two different partitions, one of which is a partition in the lived object set and the other is not. After shifting in time, it becomes Type-B edges.

 Type-R Edges – The definition is similar to Type-P and Type-F edges. However, one endpoint connects to the partition (block) in the lived object set that will be discarded later by the replacement algorithm.

A transition along a Type-P or Type-R edge implies the object and the belonging block appear in the next lived object set. Since the victim block is away from the cache, a Type-R transition causes a capacity miss. Therefore, a good packing mapping should

reduce the number of Type-R edges and increase the number of Type-B, Type-F and Type-I edges to all lived object sets.

Meanwhile, the former model applies to in the one-page cache model as well.

There is no Type-B, Type-P edges in the one-page cache model because the cache memory can hold only one memory block. As a result, the only goal is increasing the number of Type-I edges.

abefafbcdefecdbda

(a)

abefafbcdefecdbda

(b)

Figure 3.11. Using Degree-2 and Degree-3 trace information to find the closest objects to objects a and e.

The example in Figure 3.11 is so small such that the mapping of objects can be derived by observation. Figure 3.11(a) shows the Degree-2 and Degree-3 trace information in terms of object a. It seems objects {b, e} are the closest ones by simple counting. Figure 3.11(b) shows the trace information in terms of object e, and objects {d, f} are the closest ones. Observing the trace information in such way can infer the mapping in Figure 3.2(b).

The issue of the realization in generating layouts needs further discussion. Both Type-R and Type-P edges are similar because they connect forward to objects. Since the Type-R edges are given to those victim blocks by the OPT replacement algorithm, and Length(Type-R Edges) < Length(Type-P Edges), the cache miss rate is minimized.

classes of replacement algorithms, which can be realized, have no knowledge about future accesses. Such as a RANDOM replacement algorithm invalidates arbitrary cache blocks upon misses. They could spoil the scheme created by the Degree-k trace information, because the associations by the Type-P and Type-R edges are in vain, the effectiveness of the Degree-k trace information is suppressed neither. This is the reason that the mapping in Figure 3.2(b) outperforms the mapping Figure 3.7(b) on a OPT cache, but the winner exchanged when apply both on a LRU cache. Only Type-I (and Type-F) edges preserve the effectiveness across different classes of replacement algorithms. Based on these observations, we propose the approaches in Section 4.3.

Chapter 4

Practical Approaches

4.1 Hardness of Packing and Placement for Direct Mapped Cache

Section 3.3 analyzes the properties of the packing and placement problem for the k-set direct mapped cache. It proposes a method to transform the object access trace to a graph by using the Degree-2 trace information. That graph expresses the relations between objects, memory blocks, and sets. The temporal relations among entities classify the edges in the graph into three types (Type-I, Type-B, and Type-S). The goal of the packing and placement problem is creating a memory layout that minimizes cache misses when reproduce the same object access trace. Due to the nature of the pair-wise trace information, we derive the following formula to estimate cache misses –

|BT| - ( Length(Type-I-Edges) + Length(Type-S-Edges) ) (4.1)

The length of the block access trace |BT| is a constant in this formula, but the lengths of Type-I edges and Type-S edges are derived by the object layout. In other words, maximizing the sum of lengths can minimize cache misses. It is easy to show

that minimizing the sum of length of all Type-B edges is a dual problem to Equation (4.1) by the following equation.

|BT| - (Length(Type-I-Edges) + Length(Type-S-Edges) )

=(Length(Type-I-Edges)+Length(Type-B-Edges)+Length(Type-S-Edges)) - Length(Type-I-Edges)+Length(Type-S-Edges))

= Length(Type-B-Edges)

(4.2)

Therefore, the packing and placement problem can be defined as follows.

Definition 4.1. Consider a K-set direct mapped cache and an object set allocating to the memory, defined as O = {o1, o2, o3,...}. The memory space is partitioned into K disjoint sets of memory blocks. A set denoted as si-=-{bi,1, bi,2, bi,3 … } represents a collection of memory blocks, where each bi,j denotes a memory block belonging the i-th set si. The size of each memory block bi,j is M. The purpose is to find a legal mapping function fpp(oi)  br,t that assigns each object to a memory block in a specific set. Meanwhile, it must satisfy the condition that

minimizing the following cost function –

 

In the last equation, w(oi, oj) is the value from the Degree-2 trace information, or the length of Type-B edges, equivalently.

Subsequently, we are going to show that finding an optimal solution for this problem is as hard as solving the MIN k-PARTITION problem. The MIN k-PARTITION is a dual problem of the MAX k-CUT [43].

Consider a graph G1=(V,E) with K partitions, where |V(G1)|=Q, and the each vertex is associated with value K. Since the vertex set V is divided into K partitions, the number of vertexes in each partition is denoted as (n1,n2,…nK), and the vertex set is

the r-th partition. In other words, the vertex subset { vr,1, … vr,n

r } contains vertexes belonging to the r-th partition. Figure 4.1 shows an example of G1, and the dashed lines divide the graph G1 into partitions. We use different notations to distinguish edges within and across partitions. p(vr,h, vr,s) denotes the length of an edge inside the r-th partition, and w(vr,h, vq,s) denotes the length of an edge across two distinct partitions.

Since the graph G1 is assumed to satisfy the conditions of MIN k-PARTITION, it implies the summing up length of edges within the same partitions ∑p(u,v) gets the minimum comparing to other geometrics of the partitioned G1. In the meanwhile, the condition ∑w(x,y) >∑p(u,v) is hold.

Next, we create a mapping F(v) to transform G1=(V,E) to G2=(V’,E’), where G2=(V’,E’) is a restricted version for the packing and placement problem for the direct mapped cache. The mapping F(v) works in the following way.

 vi,jV, F(vi,j)={v’i,j,1,…,v’i,j,K}, where v’V’. (4.4)

The mapping means evenly splitting every vertex vi,j into K fractional vertexes.

The value is evenly distributed to fractions as well, that is, the value of each fraction vi,j,t is 1

KK . As a result, we can get a transformed vertex set, written as )}

v’i,j,K} are connected to each other and become a KK complete graph. Therefore,

 

2

1 K

K edges are appended to the edge set E’(G2). Edge length h is given to all these

kind of edges, and its value is given as h = ∑w(x,y)+∑p(u,v), that equals to the summing up lengths of all inter-partition edges. This ensures h is the greatest value among all edge lengths in E(G1). The fractional vertex v’i,j,1 replaces the role of vi,j, and edges connected to vi,j are re-attached to v’i,j,1 correspondingly. Therefore, the edge set E’(G2) is expressed as follows.

For example, the graph G2 in Figure 4.2 is constructed from the graph G1 in Figure 4.1 using the discussed method. The vertexes and the sub-graph enclosed by a shadow area in G2 are expanded from a single vertex in G1.

Next, we are going to show that the optimal partition layout of G1 that satisfies MIN k-PARTITION can be transformed and becomes an optimal layout of G2 for the K-set packing and placement problem. That is, G2 can be used to represent an object access graph. Each vertex v’i,j,t represents an object and its value corresponds to the size of an object. The length of an edge is marked by the Degree-2 trace information.

Besides, block size constraint is assumed K. Since each vertex subset { vr,1, vr,2, … vr,n

r} belongs to the same partition in G1, the vertex subset { {v’r,1,1, …, v’r,1,K }…{ v’r,n

r,1, … v’r,n

r,K } } is grouped into the same r-th partition in G2. Consider the sub-graph enclosed within the r-th partition. The length of edges connects

v’r,x,1 and v’r,y,1, which is p(u,v), both are smaller than h.

( holds by our scheme. Therefore, every subset

{ v’r,t,1, v’r,t,2,…,v’r,t,K } can be consolidated to a memory block and that makes the sum of objects in a memory block fulfills the problem requirement. Since the lengths of all Type-B edges are exactly p(u,v), and p

 

u,v w

 

x,y hK(K2 1)Qholds.

Therefore, the layout satisfies the problem requirements.

Subsequently, the K-set packing and placement problem is as hard as MIN k-PARTITION, as well as MAX k-CUT. Since there is no polynomial time algorithm to

find an optimal layout to satisfy MIN k-PARTITION, neither solves the K-set packing and placement problem.

v

1,1 p1

v

1,2 p2

v

1,3

v

2,1 p3

v

2,2 p4

v

2,3

v

3,1 p5

v

3,2 p6

v

3,3

v

4,1 p7

v

4,2 p8

v

4,3

w1 w2 w3

w4 w5 w6

w7 w8 w9

Figure 4.1. A partitioned graph satisfies MIN k-PARTITION. The symbols w i and p

i denote edge lengths.

V4,3,1

Figure 4.2. A sample graph transformed from Figure 4.1.

4.2 Approaches for Direct Mapped Cache

The previous section has shown that it is hardly to find an optimal solution of minimizing the sum of length of Type-B edges. That implies the K-set packing and placement problem is hard to solve by its nature. The practice in finding a solution is to decompose the main problem to smaller sub-problems after constructing the object access graph, and find feasible solutions for each of them. By heuristic goal in Section

3.3, the objective function of the packing and placement problem is to maximize (Length(Type-I-Edges) + Length(Type-S-Edges)). It implies a two-stage approach in seeking feasible answers by dealing with each of the two items in the equation individually. One method is to maximize Length(Type-I-Edges) first and Length(Type-S-Edges) after. The reversed direction, that is to maximize Length(Type-S-Edges) first and Length(Type-I-Edges) after, can be an alternative method for comparision. The two directions stand for different aspects in solving the same problem. According the finding in Section 3.3, maximizing Length(Type-I-Edges) can surely increase cache hits. Therefore, we predict the first method should be better since longer edges are favor to become Type-I edges. It directly faffects cache miss counts. Still, both approaches are discussed in the following sub-sections. The experiment in Chapter 6 implements both approaches for verifying our prediction.

4.2.1 Packing Followed by Placement

The first step of the approach is to maximize Length(Type-I-Edges) from object access graph. We call this step the “packing” stage. This movement visits temporal relations from the object access trace, and “packs” objects into numerous memory blocks. The packing stage can be deemed as a utilization of the one-page cache problem, stated in Definition 3.2. Both of them begin with the object access trace and the corresponding graph OG. The purpose is assigning objects O to memory blocks B, in conjunction with a condition that the capacity limitation for a memory block is M.

Maximizing the sum of weighted edges that lie in blocks (Type-I edges) is a dual

stated, constructing such a mapping function fpack is equivalent to finding answers for a graph partitioning problem. Therefore, we have to use a heuristic method to assign objects to memory blocks in practice. Once this sub-problem is solved, the original object access trace can be converted to a block access trace.

In terms of graph partitioning, many researches provide algorithms to solve the problem. Most of them are based on the work of Kernighan and Lin [31]. However, their method seems unsuitable for solving the packing problem because it is incapable of separating a graph to arbitrary number of partitions.

Therefore, our implementation adopts the greedy algorithm in Figure 4.3 to perform the partitioning works (similar to the approach in [99]). The algorithm iteratively merges two vertexes (objects) connected by the heaviest edge into a larger piece. The merged piece cannot be greater than a memory block. The collection of objects of a memory block grows larger while progressively merging vertexes. The procedure continues until there are no qualified vertexes for merging. Given a random graph, the average time complexity of the algorithm is O(|V|2). Meanwhile, applying the algorithm to the OG of a typical program, the average complexity becomes O(|V|). The

There are famous packages for graph partitioning, such as METIS [100]. In theory, we are not necessary to propose a graph-partitioning solver because we did not define a variant of graph partitioning problem. A package like METIS should be able to handle the problem well.

Unfortunately, the fact is that we did adopt METIS while developing the experiment in the past, but it is insufficient for our application because of two reasons. First, the number of generating partitions must be given as a parameter, but it is not a constant in our experiment.

Second, the errors of individual partition size are too big for our application. For example, if the layout calls for 512-bytes partitions, METIS generates some partitions with 400 bytes. That means the layout eventually occupies more memory space. Since the size of our partition can be small, the errors can cause very different experimental result. In other words, it wastes

Second, the errors of individual partition size are too big for our application. For example, if the layout calls for 512-bytes partitions, METIS generates some partitions with 400 bytes. That means the layout eventually occupies more memory space. Since the size of our partition can be small, the errors can cause very different experimental result. In other words, it wastes