Simulation overview - 在晶片多處理器系統下以減少快取衝突為目的之動態工作排程方法

Before executing the simulation, we first use a modified SimpleScalar[26] to collect the execution trace of tasks. The trace collecting process is diagrammed at Figure 4.1(a). The hints are also generated before executing the simulation. As shown in Figure 4.1(b), our hint generator contains a binary image dissector and a heap usage profiler to simulate the hint generation phase of HCCA (Hint-aided Cache Contention Avoidance). Then, the execution traces and hints are sent to our simulator.

Figure 4.1 The trace generator and the hint generator.

Figure 4.2 shows the architecture of our simulator which contains three modules: trace parser, scheduling method simulator, and memory simulator. The trace parser extracts memory access events from the execution trace. The extracted

events are sent to the scheduling method simulator. In addition to memory access events, values of registers and results of memory allocating operations are also extracted by the trace parser for HCCA. The scheduling method simulator makes the scheduling decisions. For tasks which are selected by the scheduling method simulator to be active tasks, the corresponding memory access events received from the trace parser are forwarded to the memory simulator by the scheduling method simulator. Otherwise, those memory access events will be queued at the scheduling method simulator. For the comparisons among different task scheduling methods, we have to implement different scheduling methods in the scheduling method

execution

Figure 4.2 The architecture of our simulator.

simulator. The following methods are implemented: Round-Robin[18], Active-Set[14], TOS (Throughput-oriented Scheduling)[15] and HCCA. The implementation of HCCA contains a hint evaluator and a task scheduler which simulate the hint evaluation phase and the task scheduling phase respectively. The corresponding hints of tasks and all trace events from the trace parser except memory access events are sent to the hint evaluator. The memory access events are sent to the task scheduler. The memory simulator simulates the memory hierarchy.

The accessing hit and miss events of L2 cache are sent to the scheduling method simulator for Active-Set and TOS. In our simulation, we simulate a four core chip-multiprocessor system. Figure 4.3 shows the cache configuration of our simulation.

Parameter Values Figure 4.3 The configuration of memory simulator

the SGI Origin200 workstation[27, 28]. We want to evaluate how the associativity of L2 cache may affect our method. Therefore, we simulate three different L2 cache associativity configurations. We simulate eight hundred million instructions for each task. The length of the time slice is set to ten million cycles for all evaluated task scheduling methods[14].

The simulation workload is formed by a set of tasks. We use benchmark programs and corresponding input data sets included in SPEC CPU2000[16] to form our workloads. A task is formed by a benchmark program and one of its input data set. Tasks are named by hyphening the name of the benchmark program and the name of the input data set. For example, the benchmark program gzip has four input data sets: graphic, program, source and random. Therefore, we form the following four tasks with gzip and its input data sets: gzip-graphic, gzip-program, gzip-source and gzip-random. Two input data sets may have the same name if they are used in different benchmark programs. The list of tasks used in our simulation is shown in Figure 4.4. For each benchmark program, a separated training data set is used as the input data of the hint generator to generate the hints. In our simulation, each workload includes twelve tasks. We form three workloads to evaluate the performance of our mechanism. We want to evaluate the performance of our

ammp art-110 art-470 bzip2-graphic

bzip2-program bzip2-random bzip2-source equake

gcc-166 gcc-200 gcc-expr gcc-integrate

gcc-scilab gzip-graphic gzip-program gzip-random

gzip-source mcf mesa vortex-lendian1

vortex-lendian2 vortex-lendian3 vpr-place vpr-route Figure 4.4 The list of tasks used in our simulation.

mechanism under the different possibility of the occurrence of cache contention. The tasks which frequently cause cache misses may cause more cache contentions[11, 15]. Therefore, we form our workloads according to the number of cache misses caused by individual tasks. We first execute the tasks once for eight hundred million cycles and sort tasks according to the number of cache misses in the decreasing order. The sorting result of our tasks is shown in Figure 4.5. Then, we form the

vpr-route vpr-place equake mesa gcc-scilab gcc-integrate gcc-expr gcc-200 gcc-166 vortex-lendian2 vortex-lendian3 vortex-lendian1 bzip2-program bzip2-random bzip2-graphic gzip-program bzip2-source gzip-random gzip-graphic gzip-source art-110 art-470 ammp mcf

0 500000 1500000 2500000 3500000 4500000 5500000

number of cache misses

Figure 4.5 The sorted task list.

Workload 1

workloads according to the sorting result. Figure 4.6 shows our three kinds of workloads. Workload 1 is formed by selecting the first twelve tasks from the sorted task list which have more number of cache misses. Workload 2 is formed by randomly selecting six tasks from the first half of the sorted task list and randomly selecting another six tasks from the second half of the sorted task list. Workload 3 is formed by selecting the last twelve tasks from the sorted task list which have less number of cache misses.

在文檔中在晶片多處理器系統下以減少快取衝突為目的之動態工作排程方法 (頁 45-50)