Task scheduling - Hint-aided Cache Contention Avoidance Technique

Chapter 3 Hint-aided Cache Contention Avoidance Technique

3.5 Task scheduling

In the previous phase, the cache set usage of a task is predicted. In this phase, we group tasks in the global dispatch queue into small gangs according to their predictions of the cache usages and store gangs into the gang queue for future scheduling. When the scheduler is activated by idle cores, the scheduler will randomly pick one gang from the gang queue and assign tasks in the gang to the cores. For each gang, the number of contained tasks is no more than the number of cores within the system. The number of gang is equal to the following formula. In this formula, TaskCount denotes the number of tasks in the system. CoreCount denotes the number of cores in the system.

GangCount=⌈TaskCount

CoreCount⌉ ... (3.3) Before we describing the detailed mechanism of this phase, we first introduce the following formulas and terminologies which are used in this phase. Formula 3.4 is used to predict the number of cache contentions between two tasks.

PredictedCacheContention T_i, T_j=

∑

k=1 m

C^T_kⁱ×C^T_k^j ... (3.4)

In this formula, Ti and Tj are two tasks, m is the number of cache sets. CkTi and CkTj

the previous section. Considering two tasks Ti and Tj, we multiply CkTi with CkTj to see if both Ti and Tj are predicted to use the k^th cache set. If both Ti and Tj are predicted to use the k^th cache set, as we described in the previous section, both the value of CkTi and CkTj will be one. Therefore, the multiplication result will be one which indicates one predicted cache contention. However, if none of Ti and Tj are predicted to use the k^th cache set or only one of Ti and Tj is predicted to use the k^th cache set, at least one of CkTi and CkTj will be zero. Therefore, the multiplication result will be zero which indicates no cache contention. By summing all multiplication results on m cache sets, we can get the number of predicted cache contentions between Ti and Tj. Furthermore, formula 3.5 predicts the number of cache contentions between a task and tasks of a gang.

TaskGangCacheContentionT_i,G_x=

∑

∀ T_j∈G_x

PredictedCacheContention T_i, T_j ... (3.5) In this formula, Ti is a task and Gx is a gang. The number of cache contentions between Ti and tasks of Gx is predicted by summing the number of predicted cache contentions between Ti and each task included in Gx. The number of predicted cache contentions between two tasks can be got by applying formula 3.4. We use formula 3.5 to see if a task and a gang are perfect matching or not. Considering a task Ti and a gang Gx, if Ti and Gx are perfect matching, we can assign Ti into Gx without introducing any predicted cache contentions with other tasks within Gx. We say that Ti and Gx are perfect matching if the value of TaskGangCacheContention(Ti, Gx) is zero. Otherwise, we say that Ti and Gx are not perfect matching.

There are two stages in our gang grouping mechanism. In the first stage, the tasks with the largest number of predicted used cache sets will be distributed into

different gangs. Therefore, the possibility of the occurrence of cache contentions could be reduced. In this stage, we first sort tasks according to the number of predicted used cache sets in the decreasing order. Then we distribute tasks into gangs. The first gang is created by assigning the task with most predicted used cache sets to an empty gang. The remaining tasks are assigned to a gang one by one according to the number of predicted used sets. Considering a task Ti and a gang Gx, Ti will be assign to Gx if Ti and Gx are perfect matching. If there is no such gang exists and the number of the created gang is less than GangCount, a new gang will be created and Ti will be assigned to the created gang. If there is no such gang which

Predicted cache usage

Figure 3.10An example of the first stage of gang grouping.

(a) The predicted cache usage of tasks. (b) The sorted tasks.

exists and the number of gangs is equal to GangCount, the assignment of Ti will be left to the next stage. Figure 3.10 shows an example of this stage. Assuming there is an 8-sets L2 cache and two cores in the system. There are 8 tasks in the system, therefore the value of GangCount is 2.

After the previous stage, a task may still remain to be assigned if the task can not form any perfect matching with created gangs and the number of created gangs is equal to GangCount. We distribute the remaining tasks into gangs in the second stage. In the second stage, the remaining tasks are assigned to gangs one by one according to the number of predicted used cache sets. Each remained task is greedily assigned to a gang which creates the lowest number of predicted cache contentions with other tasks within the gang. We expect the overall assignment will cause the least number of cache contentions, because we introduce the least number of

TaskGangCacheContention(T₆, G₂)=3 TaskGangCacheContention(T₆, G₁)=3

(a) Figure 3.11 An example of the second stage of gang grouping.

(a) The assignment of T6. (b) The final result of gang grouping.

TaskScheduling() then G[created_gang_id] ← G[created_gang_id] T∪ i

if CoreCount = | G[created_gang_id] |

then created_gang_id ← created_gang_id + 1 G[created_gang_id] ← Ø

assigned[Ti] ← TRUE

else if (GangCount - 1) = created_gang_id then assigned[Ti] ← FALSE

else created_gang_id ← created_gang_id + 1 G[created_gang_id] ← {Ti}

assigned[Ti] ← TRUE // 2^nd stage

for i ← 1 to (TaskCount - 1)

do if FALSE = assigned[Ti] // assign remaining tasks only then candidate_gang ← 0

current_contention ← TaskGangCacheContention(Ti, G[0]) for j ← 1 to (created_gang_id-1) // look for gang w/ least contention do tmp ← TaskGangCacheContention(Ti, G[0])

if tmp < current_contention then candidate_gang ← j

current_contention ← tmp

G[candidate_gang] ← G[candidate_gang] T∪ i

Figure 3.12The algorithm of the task scheduling phase.

calculate the number of predicted cache contentions of Ti and every existing gangs.

For gang Gx, the number of predicted cache contentions between Gx and Ti is calculated by formula 3.5. Then, Ti is assigned to the gang which has the smallest number of predicted cache contentions between the gang and Ti. If there are multiple gangs which have the same number of predicted cache contentions with Ti, the gang with fewer tasks will be selected. Figure 3.11 shows the second stage of gang grouping for the example which is illustrated in Figure 3.10. Figure 3.11(a) shows the assignment of T6, where the number of cache contentions between T6 and both gangs are the same. But, G2 has the less number of tasks. Therefore, T6 is assigned to G2. Figure 3.11(b) shows the final result of the gang grouping. The algorithm of this phase is shown in Figure 3.12.

So far, we have introduced the essence of our mechanism. In the next chapter, we will evaluate the performance of our mechanism and compare with others.

Chapter 4 Preliminary Performance

在文檔中在晶片多處理器系統下以減少快取衝突為目的之動態工作排程方法 (頁 39-45)