Introduction - 在晶片多處理器系統下以減少快取衝突為目的之動態工作排程方法

The advances in integrated circuit processing technologies increase the transistor density and allow more micro-processor design options[1²-3]. The chip multiprocessing (CMP) architecture is one of the micro-processor designs that attempt to utilize the increased integration[4, 5]. In a typical chip multiprocessor, it consists a set of identical cores, and each core has its own execution resources such as ALU, FPU, L1 caches, register file and control logics. The L2 cache and its lower memory hierarchy are shared by these cores[6⁷-8]. By taking advantage of the thread level parallelism, chip multiprocessors can achieve better performance per watt scalability with advances of integrated circuit technologies than single core processors. This makes chip multiprocessors a promising microprocessor design for emerging high-performance and power-efficiency computing. Besides, sharing the L2 cache allows high cache utilization and avoids duplicating the cache hardware resources. However, cache sharing may cause cache contentions among executed tasks[9]. Because the L2 cache is sharing among all executed tasks, the data block loaded by one task may be replaced by the data block loaded by another task. The task which loses its data block from cache will experience a cache miss if it accesses the evicted data again. However, this cache miss would not occur in a single core processor environment. This extra L2 miss called cache contention, which will cause the processor to fetch data from the lower memory hierarchy. Fetching data from the lower memory hierarchy usually takes more time than directly fetching from the higher memory hierarchy, hence it lengthens the task execution time[10¹¹¹²

-systems by causing extra L2 cache misses and lengthening the execution time of tasks.

In order to reduce cache misses caused by cache contentions, many techniques have been proposed in recent years[10¹¹-12, 14, 15]. We classify these proposed techniques into two categories: one is cache partitioning[10, 11] and the other is operating system scheduling[12, 14, 15]. The key idea of cache partitioning approach is to partition cache blocks into groups. Then, each group is allocated to an executed task. During executing, the number of blocks in a group may be changed to fit the cache need of tasks. Cache contentions can be completely avoided if all groups are disjoint. However, groups may be overlapped to increase the flexibility.

The operating system scheduling approach attempts to avoid cache contentions by separately scheduling tasks which may use the same cache sets. A mechanism to predict the cache set usage is required for the operating system scheduling approach because the task scheduling decisions have to be made before the tasks actually executing on the cores. For cache partitioning approach, tasks may still suffer from cache contentions if all concurrently executed tasks frequently access memory and cause big overlap among groups. However, the operating system scheduling approach can resolve this by separately scheduling the tasks which are predicted to use the same cache sets.

In this thesis, we propose an effective task scheduling method, called Hint-aided Cache Contention Avoidance (HCCA) to reduce the number of cache contentions for chip multiprocessor systems. HCCA contains the following three phases: hint generation, hint evaluation and task scheduling. Like previous methods, HCCA first predict the cache set usage. Then, it attempts to minimize the cache

contentions among concurrently executed tasks by separately scheduling tasks using the same cache sets. In previous task scheduling methods, they usually predict the cache set usage of tasks according to their previous usage. However, because cache set usages may change during the execution of tasks, making predictions according to the previous cache set usage may not be able to predict these changes. Instead of using the previous cache set usage, we make cache set usage predictions according to the information extracted from the corresponding binary images of tasks. The binary image contains an ordered set of machine instruction codes which instructs

the processor to accomplish the task. Therefore, it directly affects the behavior of a task. However, analysis the binary image needs unacceptable long time for the task scheduling. We resolve this by first generating an abstract of a binary image which we call it a hint before running the task. This phase is called hint generation. While executing the tasks, we make the cache set usage prediction according the hint. We call this phase hint evaluation. Then, we make the scheduling decision according to the cache set usage predictions. This phase is called task scheduling. In summary, HCCA contains the following three phases. The hint generation phase generates hints from binary images. The hint evaluation phase makes the cache set usage predictions according the hints. The task scheduling phase make the scheduling decisions according to the predicted cache set usages.

For evaluating the performance, we construct a simulator and compare our method with previous work. We form workloads with benchmark programs and input data sets from SPEC 2000[16]. Then these workloads are used to test the scheduling mechanisms. From the simulation results, we can see that HCCA has

lower L2 cache miss rate than that of others and also have some improvement on overall IPC (instruction per cycle) compared with other methods.

This thesis is organized as follows. Chapter 2 introduces the system model and reviews some related work. Chapter 3 describes our HCCA technique in some detail.

Performance evaluations are presented in Chapter 4. Finally, conclusions and future work are given in Chapter 5.

在文檔中在晶片多處理器系統下以減少快取衝突為目的之動態工作排程方法 (頁 11-15)