Hint evaluation - Hint-aided Cache Contention Avoidance Technique

Chapter 3 Hint-aided Cache Contention Avoidance Technique

3.4 Hint evaluation

In the previous phase, we collect the hint which includes the information about how a task may use memory. In this phase, we predict the future memory usage for HeapUsageProfiling()

SeqAccess ← NIL // here we store the result of 1st stage // 1^st stage

for each memory accessing event e // collect all memory accesses do SeqAccess[ instruction(e) ] ← SeqAccess[ instruction(e) ] { e}∪ for each instruction R where SeqAccess[R] exists // remove those are not

do B ← SeqAccess[R] // sequential accesses

if ∃L: L = {K0, K1, ... Kn} where

for each memory allocating event e // check all memory allocations do if a∃ , R: a∃ K, K∈ SeqAccess[R] where∈

allocation_start(e) ≤ addr(a) and allocation_end(e) ≥ addr(a) then allocating[R] ← allocating[R] ∪{ instruction(e) }

// finished, store hint

Store Hint_H as the following set of vectors:

∀

R where SeqAccess[R] is exist:

<R, address[R], allocating[R], clkc_step[R], addr_step[R]>

Figure 3.5 The algorithm of the heap usage profiling.

tasks when they leave cores by combining the hint and the task execution status. We first define some symbols in this phase. Considering a task Ti, we use GoingAccess(Ti) to represent the memory address set which may be accessed by the task Ti in the next time slice. StackTop(Ti) is the address of the top of the call stack.

BjTi

code(BjTi ) exp_time(BjTi ) static(BjTi ) stack(BjTi ) next_bb(BjTi ) call_bb(BjTi )

B0Ti {0xC0, 0xC1, ... 0xDB} 7 32 B1Ti

B1Ti {0xDC, 0xDD, ... 0xDF} 1 0 B2Ti B5Ti

B2Ti {0xE0, 0xE1, ... 0xFB} 7 28 B3Ti

B3Ti {0xFC, 0xFD, ... 0x278} 96 {0x1000, 0x1001, ... 0x101F} 32 B4Ti

B4Ti {0x279, 0x27A, ... 0x297} 8 32

B5Ti {0x60, 0x61, ... 0x87} 10 20 B6Ti

B6Ti {0x88, 0x89, ... 0x8B} 1 0 B6Ti B8Ti

B7Ti {0x8C, 0x8D, ... 0xBB} 12 20

B8Ti {0x2B1, 0x2B2, ... 0x2FC} 19 28

(a)

B₀^Ti B₁^Ti B₂^Ti B₃^Ti B₄^Ti B₅^Ti B₆^Ti B₇^Ti

B₈^Ti (b)

AjTi address(AjTi) allocating(AjTi) addr_step(AjTi) clkc_step(AjTi)

A0Ti 0x268 0xEC 4 96

(a) Hint_B(Ti). (b) The binary tree representation of Hint_B(Ti).

(c) Hint_H(Ti).

the basic block which is included by Hint_B(Ti) where 1≤ j ≤M, M is the number of basic blocks included in Hint_B(Ti). Hint_H(Ti) is the hint of Ti which is generated by the heap usage profiling. AkTi is one of the hint entries included in Hint_H(Ti) where 1≤ k ≤Q, Q is the number of entries included in Hint_H(Ti). Each hint entry represents one of the memory-referencing instructions which is predicted performing sequential accesses. We call such memory-referencing instructions as hint-covered memory-referencing instructions for convenience. address(AkTi) represents the address of AkTi. allocating(AkTi) is the address of the memory allocating instruction which allocate memory blocks for AkTi. addr_step(AkTi) is the distance between two accessed addresses. clkc_step(AkTi) is the number of clock cycles between two accesses. Figure 3.6 shows an example of the hint. For convenience, we denote the memory-referencing instruction located at address(AkTi) as instruction(AkTi).

The hint evaluation has three stages. In the first stage, we predict the number of clock cycles will be used by the hint-covered memory-referencing instructions to access the dynamically allocated memory blocks. The prediction result is used to adjust the estimated execution time of a basic block which is estimated by the hint generation phase. In the hint generation phase, we do not know how large the dynamically allocated memory block will be. Therefore, it is impossible to predict how many clock cycles will be required for accessing the allocated memory blocks.

However, in this phase, we can retrieve the memory allocation result from the memory allocation information maintained by the operating system. Therefore, we can estimate the number of clock cycles which is required for accessing the dynamically allocated memory blocks in this phase. Considering a task Ti and the

hint entry AkTi included in Hint_H(Ti), the prediction is done in two steps. First, we estimate how many times the instruction(AkTi) will be executed. This estimation can be made by dividing the size of the allocated memory block by addr_step(AkTi).

Because we predict that the instruction(AkTi) will perform sequential accesses and the address distance between two contiguous accesses performed by instruction(AkTi) will be addr_step(AkTi). If there are multiple memory blocks that are allocated for instruction(AkTi) to access, we will perform the estimation with the size of the largest one for safety, because we do not know which one will be accessed. Then, we multiply the estimation result from previous step with clkc_step(AkTi) to make the prediction. We implement this two step prediction mechanism with formula 3.2. In this formula, MaxAllocSize(allocating(AkTi)) denotes the maximum allocated memory size which is allocated by allocating(AkTi). If there is no memory block allocated by allocating(AkTi), the value of MaxAllocSize(allocating(AkTi)) is zero.

dynAccessClk  A_k^Tⁱ=MaxAllocSize allocating  A^T_kⁱ

addr _ step A_k^Tⁱ ×clkc _ step  A_k^Tⁱ ...(3.2) Then, the predicted number of clock cycles is used to adjust the expected execution time of basic blocks. Considering AkTi as one of the entries in Hint_H(Ti) and its corresponding basic block BjTi, the value of exp_time(BjTi) is adjusted by adding dynAccessClk(AkTi) to it. If there are multiple entries in Hint_H(Ti) which are mapped to a single basic block, we only select the maximum number of predicted clock cycles to add it. The value of exp_time(BjTi) will be restored after finishing this phase. An example of this adjustment is shown in Figure 3.7 which is the adjustment result of the example in Figure 3.6. The memory allocating result of A0Ti is shown in Figure 3.7(a). The corresponding basic block of A0Ti is B3Ti because address(A0Ti) is

included in code(B3Ti). The adjustment result of the expected execution time is shown in Figure 3.7(b) where the exp_time(B3Ti) is adjusted by adding the calculation result of formula 3.2.

In the second stage, we predict the memory addresses that will be access by the task Ti in the upcoming allocated time slice. These predicted addresses are converted into the predicted cache set usage in the next stage. In this stage, we first find the corresponding basic block of the current execution point of the task. Then, we start a depth-first traversal from the corresponding basic block of the current execution point. That is, for every visited basic block B^Ti, we first visit its child node

allocating(AjTi) start end size

0xEC 0x010000F0 0x010000FF 16

0xEC 0x01000100 0x0100010A 12

(a)

BjTi

code(BjTi ) exp_time(BjTi ) static(BjTi ) stack(BjTi ) next_bb(BjTi ) call_bb(BjTi )

B0Ti {0xC0, 0xC1, ... 0xDB} 7 32 B1Ti

B1Ti {0xDC, 0xDD, ... 0xDF} 1 0 B2Ti B5Ti

B2Ti {0xE0, 0xE1, ... 0xFB} 7 28 B3Ti

B3Ti {0xFC, 0xFD, ... 0x278} 480 {0x1000, 0x1001, ... 0x101F} 32 B4Ti

B4Ti {0x279, 0x27A, ... 0x297} 8 32

B5Ti {0x60, 0x61, ... 0x87} 10 20 B6Ti

B6Ti {0x88, 0x89, ... 0x8B} 1 0 B6Ti B8Ti

B7Ti {0x8C, 0x8D, ... 0xBB} 12 20

B8Ti {0x2B1, 0x2B2, ... 0x2FC} 19 28

(b)

Figure 3.7 An example of the expected execution time adjustment of basic blocks.

(a) The allocation result of memory-allocating instruction at 0xEC (b) The adjusted Hint_B(Ti)

call_bb(BjTi). Then we visit its sibling node next_bb(BjTi). For every visited basic block BjTi, we copy the values in code(BjTi) and static(BjTi) into GoingAccess(Ti). We also add the addresses between StackTop(Ti) and StackTop(Ti)+stack(BjTi) into GoingAccess(Ti). The added addresses predict the usage of the stack partition. For memory-referencing instructions AkTi, when their corresponding basic blocks are traversed, the addresses of memory blocks which are allocated by allocating(AkTi) are also added into GoingAccess(Ti) to predict the usage of the heap part. The traversal is stopped when the summing of expected execution time of all traversed HintEvaluation(Ti)

GoingAccess ← Ø

EstimatedDynAccessClk ← Ø // 1^st stage

for each entries Ak in Hint_H(Ti) do tmp ← dynAccessClk(Ak)

Locate hint entry Bj from Hint_B(Ti) such that address(Ak) ∈ code(Bj) if EstimatedDynAccessClk[Bj] is not existed or

EstimatedDynAccessClk[Bj] < tmp then EstimatedDynAccessClk[Bj] ← tmp

for each Bj where EstimatedDynAccessClk[Bj] is existed do exp_time(Bj) ← EstimatedDynAccessClk[Bj] // 2^nd stage

Locate hint entry Bpc from Hint_B(Ti) such that ProgramCounter ∈ code(Bpc) HintEvaluation_stage2(Bpc, 0, StackTop(Ti)) // predict target addresses of

// memory accesses // 3^rd stage

Convert memory addresses included in GoingAccess into the cache set usage according to the cache configuration of the system.

Output the converted result as the predicted cache set usage of Ti. Figure 3.8 The algorithm of the hint evaluation phase.

basic blocks is larger than the time slice.

In the third stage, the predicted memory addresses are converted into predicted cache set usage. Therefore, in the task scheduling phase, we can attempt to avoid cache contentions by not concurrently scheduling tasks which use the same cache sets on cores. Considering a m-set cache and a task Ti, the predicted cache accesses HintEvaluation_stage2(Bx, used_time, stack_top)

max_stack_offset ← 0 basic_block_pointer ← Bx

// compute maximum call stack offset while NIL ≠ next_bb(basic_block_pointer)

do if max_stack_offset < stack(basic_block_pointer) then max_stack_offset ← stack(basic_block_pointer) basic_block_pointer ← next_bb(basic_block_pointer) basic_block_pointer ← Bx

// traverse the basic blocks and make the prediction

while used_time < TimeSlice and NIL ≠ next_bb(basic_block_pointer) do if NIL ≠ call_bb(basic_block_pointer)

then used_time ← HintEvaluation_stage2(basic_block_pointer, then GoingAccess ← GoingAccess∪allocated_memories(allocating(Ak) for ∀s: s≥stack_top and s<stack_top + max_stack_offset // stack part do GoingAccess ← GoingAccess ∪{s}

basic_block_pointer ← next_bb(basic_block_pointer) return used_time

Figure 3.9 The algorithm of the hint evaluation phase. (cont.)

of Ti is represented in a bit vector <C1Ti, C2Ti, ..., CmTi>. CbTi represents the predicted usage of the b^th cache set. The b^th cache set is predicted to be used if there is a memory address included in GoingAccess(Ti) which is mapped to it. If the b^th cache set is predicted to be used, the value of Cb will be set to 1. Otherwise Cb will be set to 0. The algorithm of this phase is shown in Figure 3.8 and Figure 3.9.

在文檔中在晶片多處理器系統下以減少快取衝突為目的之動態工作排程方法 (頁 32-39)