Hint generation - Hint-aided Cache Contention Avoidance Technique

Chapter 3 Hint-aided Cache Contention Avoidance Technique

3.3 Hint generation

Figure 3.3(a) illustrates the conventional processing flow, which is widely used in existing systems[25]. As shown in Figure 3.3(b), the hint generation resides after the linking processing of the high level language processing flow.

Analyzing the binary image needs a lot of time. In order to speed up the prediction, we first extract the necessary information for the prediction in this phase.

Without this phase, the prediction process will need unacceptable long time.

As shown in Figure 3.2, this phase included two methods. We will describe the binary image dissection in section 3.3.1 and the heap usage profiling in section 3.3.2.

Figure 3.3 The high level programming language processing flow.

(a) The conventional processing flow. (b) The proposed processing flow.

3.3.1 Binary image dissection

This mechanism is designed to extract necessary information from the binary image. This extracted information can be used to predict the memory accesses. First, we divide the code context of the binary image into basic blocks. Then, we extract the basic block characteristics that can assist the prediction. These extracted characteristics are described in the following.

Let's consider a basic block Bi. We denote code(Bi) as an address set of instructions which are included in Bi. exp_time(Bi) is the expected execution time of Bi, which can be computed by summing the numbers of clock cycles executed by all instructions belong to Bi. static(Bi) is an address set of the accessed data which

B₁

return (read from random service);

}

{ return (read from random service) }

B₇: print(val)

Figure 3.4 An example of sibling-child binary tree representation of basic blocks.

resides in the static partition. We can obtain these addresses from the binary image, because instructions which access to the static partition usually use immediate values to indicate their destinations. stack(Bi) is the memory size required to be allocated from the stack partition. The call stack is used to store the local data structures, such as local variables and call parameters. Therefore, the stack(Bi) can be obtained by counting how many local data structures which are allocated and used in Bi. next_bb(Bi) is used to indicate the basic block executed next to Bi. If Bi is the latest basic block of a procedure, next_bb(Bi) will be set to empty value. For example, as shown in Figure 3.4, next_bb(B3) is B4 and next_bb(B5) is empty. If Bi

contains a procedure call instruction, then call_bb(Bi) will indicate the first basic block of the calling target. Otherwise, call_bb(Bi) will be set to empty value. For example, in Figure 3.4, call_bb(B3) is B6, and call_bb(B4) is empty value.

In the following, we use a right-sibling left-child binary tree to represent the execution flow of the task. In this binary tree, a node represents a basic block, and an edge indicates the control dependency between two connected nodes. For every basic block Bi, we let call_bb(Bi) and next_bb(Bi) be the child and sibling node of Bi

respectively. However, if the given task contains a recursive call, it will cause an edge loop in the right-sibling left-child binary tree. Hence, for the basic block Br

which performs a recursive call, we set call_bb(Br) to empty value and merge Br into next_bb(Br). The corresponding right-sibling left-child binary tree of Figure 3.4(b) is shown in Figure 3.4(c).

3.3.2 Heap usage profiling

The execution of the same binary image with different input data may result in

different execution trace. In order to discover those memory-referencing instructions which have predictable behavior, we proposed a mechanism to analyze the collected execution trace after profiling. Before describing the detailed method of this analysis mechanism, we introduce the following terminologies.

Considering a memory accesses ai, task(ai) is the task which performs ai, inst(ai) is the number of instructions which has been executed by task(ai) before ai. clkc(ai) is the number of elapsed clock cycles from the start of the execution of task(ai). addr(ai) is the memory address which ai accesses to. instruction(ai) is the memory-referencing instruction which performs ai.

Definition 3.1 For sequence memory accesses A = {a1, a2, ... an} where a1, a2, ... an

denote the individual accesses of A and they are performed by the same instruction.

A is a sequential access if A satisfies the following conditions:

{

^{inst a}_{addr  a}ⁱ¹_{i 1}^{inst a}addr aⁱ^{=inst  a}_i=addr aⁱ^{inst a}_iaddr  a^i1^ _{i 1} ... 2^{... 1} where 2≤i≤ n1 , n≥3 An example of such instruction is a memory-referencing instruction within loop block. If A satisfies the equation (1), it indicates that the number of instructions executed between any two contiguous memory accesses of A are the same. We denote the number of instructions between two contiguous memory accesses by

∆inst(A) if A satisfies the equation (1). If A satisfies the equation (2), it indicates that

the address distance between any two contiguous memory accesses of A are the same. We denote the distance between two access targets by ∆addr(A) if A satisfies the equation (2). ∆clkc(A) denotes the average number of clock cycles between two contiguous accesses of A. ∆clkc(A) is calculated by formula 3.1.

 clkc  A=

∑

i =1 n1

clkc a_i1clkc a_i

n1

...(3.1)

Now we describe our mechanism in detail. The proposed mechanism has two stages. The first stage of this analysis is to find the memory-referencing instructions that perform sequential accesses from the execution trace. We predict that these instructions will still perform sequential access. Other types of access patterns are simply ignored, because most of them do not have a determined pattern. This stage includes the following two steps. First, we extract all sequential accesses from the execution trace. Then, considering a memory-referencing instruction R which performs sequential accesses Ki, we predict that R will perform sequential access in the future if all ∆inst(Ki) have the same value and all ∆addr(Ki) have the same value for all sequential accesses Ki. For convenience, we denote the value of ∆inst(Ki) for R as inst_step(R) and denote the ∆addr(Ki) for R as addr_step(R). For R, we also predict the distance between two accessed addresses will be addr_step(R), and the number of clock cycles between two accessed addresses will be the averaged value of ∆clkc(Ki) in the future. For convenience, we denote the averaged value of

∆clkc(Ki) as clkc_step(R). We store the instruction address of R, clkc_step(R) and addr_step(R) as part of the hint.

In the second stage, we attempt to find the memory-allocating instructions which allocate memory blocks for the memory-referencing instructions which perform sequential access. We predict that these memory-allocating instructions will still perform memory allocation for those memory-referencing instructions. This is done by comparing the accessed target of the memory-referencing instruction and

instruction R and a memory allocating instruction L, we predict that L will still allocate memory for R in the future. These referencing-allocating relations are stored as part of the hint. The algorithm of this mechanism is shown in Figure 3.5.

在文檔中在晶片多處理器系統下以減少快取衝突為目的之動態工作排程方法 (頁 27-32)