Chapter 3 The Proposed Techniques
3.4 An Example of The Proposed Policy
In the proposed scheduling policy, the rank level read-write reordering is combined with the read-write aware throttling. To understand how the proposed policy works, a complete example is given as follows. Assume that there are four ranks in the DRAM and the throttle delay is reached at cycle k. Suppose that at cycle k, the command pattern in the RQ is as shown in the left part of Fig. 11. The commands are clustered into command sets according to their target ranks, the result is as shown in the right part of Fig. 11. Notice that the command set 𝑆4 contains no commands and is therefore omitted in Fig. 11.
Fig. 11 An example of how the read-write aware throttling clusters commands in the RQ and determines which ranks should be turned on when the throttle delay is reached.
When the throttle delay is reached at cycle k, the read-write aware throttling is first performed to check on each command set for the existence of read commands. As the result, r1
and r3 are considered urgent, while r2 and r4 are considered trivial. After the read-write aware throttling, the rank level read-write reordering is performed on command sets 𝑆1 and 𝑆3 before they are sent to CQ1 and CQ3. Using 𝑆1 as an example, Fig. 12 shows how the
34
commands in 𝑆1 are reordered and sent to the CQ. The command at the front of the command set has the highest priority while the command at the end has the lowest priority. The commands are sent to the CQ in descending order of priority one at a cycle.
Fig. 12 An example of how the commands in a given command set 𝑆1 are reordered by the rank level read-write reordering and sent to the CQ.
As shown in Fig. 12, rank 1 was originally in the low power mode after all the commands from last throttle period are completed. .When the throttle delay is reached and the command set is formed, the rank level read-write reordering checks through the command set and found that W4 and W7 target to the same address as R8. These three commands are then combined to form a command group and reordered to the front of 𝑆1. Once the reordering completes, the commands are sent to 𝐶𝑄1 and rank 1 is turned on. Rank 1 is turned off again when all the command finishes. The example shows that the rank level read-write reordering forces the DRAM to process read requests as early as possible.
It is noticeable that for an urgent rank ri, all the commands in the command set 𝑆𝑖 are allowed to be sent to 𝐶𝑄𝑖. Even the write requests whose target addresses are not identical to any read requests are allowed to be sent to the corresponding CQ instead of kept blocked in the RQ. For example, the target addresses of W1 and W6 in Fig. 12 are not the same as R8, and
35
they are still sent to 𝐶𝑄1. The reason is that once a rank is activated, keeping these write requests in the RQ does not contribute to more DRAM power reduction because they will eventually be processed and consumes the same amount of power. Moreover, if these write requests are kept blocked in the RQ until a read requests with the same target address enters the command set, the read request has to wait for these write requests to be completed. This lengthens the latency of the critical read request and has chance to worsen the system performance degradation.
36
Chapter 4
Experimental Results
This chapter demonstrates and analyzes the experimental results to examine the proposed techniques. First, the simulation environment is described. Second, the evaluation results of how each techniques proposed in this thesis work at certain throttle delay are shown and analyzed. Finally, the evaluation on the power and performance trade-off is carried out and the results are analyzed and compared with other works.
4.1 Simulation Environment
The performance of our work is evaluated with Multi2Sim [28], a widely used cycle-accurate system simulator. Multi2Sim provides detailed simulation of single core or multicore processors and gives us the statistics of the system performance. Our evaluation integrates DRAMSim2 [29] into Multi2Sim to obtain more accurate statistics of DRAM, such as the latency of each memory command, DRAM power consumption and power mode transition delays. DRAMSim2 is a cycle-accurate, JEDEC DDRx memory system simulator, which models the memory controller, memory channels, DRAM ranks, and banks. In an evaluation, Multi2Sim runs the benchmark and generates memory commands accordingly. The memory commands are sent to DRAMSim2 and processed by the DRAM that DRAMSim2 models. The evaluation results of Multi2Sim provide the system throughput statistics, while the evaluation results of DRAMSim2 provide the DRAM power consumption.
The baseline system in the simulations uses the ARM Coretex-A9 MPCore [30]. The configuration parameters of ARM Coretex-A9 MPCore are listed in Table III. The baseline system has two-level caches. In order to evaluate our work by comparing to the previous work
37
[21], our simulations use the same cache sizes as in the previous work. The detail parameters of the memory system are presented in Table IV. The main memory used in the evaluation is a DDR2 SDRAM, which is one of the JEDEC standard memory available on market [12].
Table III
Configuration parameters of ARM Cortex A9 [30]
Parameter Value
Number of cores 4
Number of threads per core 1
Technology node 40 nm
Operating frequency 2.132 GHz
Supply voltage 0.66 V
Threshold voltage 0.23 V
Decode width 2
Issue width 4
Commit width 4
Number of arithmetic logic units per core 3
Number of multipliers per core 1
Number of floating-point units per core 1
Branch predictor 2 level, 1024-set 2-way BTB
38
Table IV
Memory system parameters
Parameter Value
Size of level 1 data cache per core 32 KB
Set associativity of level 1 data cache per core 4-way
Size of level 1 instruction cache 64 KB
Set associativity of level 1 instruction cache per core 2-way
Size of level 2 cache 2 MB
Set associativity of level 2 cache 8-way
DRAM frequency 533 MHz
Number of DRAM ports 2
DRAM device width 8
Number of DRAM ranks 4
Number of DRAM banks per rank 4
Number of DRAM rows per bank 8192
Number of DRAM columns per bank 4096
Table V
Benchmark combinations of floating-point benchmarks in SPEC CPU2006 [31]
Combination Benchmarks
fp1 410.bwaves 416.gamess 433.milc 434.zeusmp
fp2 435.gromacs 436.cactusADM 437.leslie3d 444.namd fp3 447.dealII 450.soplex 453.povray 454.calculix
fp4 459.GemsFDTD 465.tonto 470.lbm 481.wrf
The workload for our simulations is the SPEC CPU2006 [31] benchmark suite. The benchmarks in the SPEC CPU2006 suite can be separated into integer benchmarks and floating point benchmarks. The floating point benchmarks have higher memory pressure than integer benchmarks and need more sophisticated power management policies [1][21]. Therefore, the benchmarks used in our simulations are randomly chosen from the floating point benchmarks.
39
For each simulation, a benchmark combination containing four benchmarks is used. Every benchmark in the benchmark combination is assigned to a certain core. The benchmark combinations are listed in Table V. Each benchmark combination is run for five million CPU cycles in our evaluation.
Another workload used in the evaluation is the SPLASH-2 benchmarks [32], which are collected from real applications. Using the dynamic context scheduler provided by Multi2Sim [28], each program in the SPLASH-2 benchmarks forks at most four parallel contexts during runtime. The benchmarks in SPLASH-2 used in the evaluation are listed in Table VI.
Table VI
SPLASH-2 [32] benchmarks used in the evaluation
Benchmark Problem size
Barnes 2048 particles
Cholesky tk14.O
FFT 65536 points
FMM 2048 particles
Radix 256k keys, max-value 524288, radix 4096
The SPEC CPU2006 benchmarks are used in section 4.2 and 4.3, and the SPLASH-2 benchmarks are used in section 4.3. In the evaluation, the power consumption is measured in Watts. The system performance is measured in million instructions per second (MIPS), which represents the throughput of the system. All the results are normalized to the native DRAM, which refers to the DRAM with no power management policy.
40
4.2 Analysis on Different Techniques
Although our work employs both the write aware throttling and the rank level read-write reordering, these two techniques can be employed individually. Therefore, the techniques proposed in this thesis are not only evaluated jointly but also separately to see how they affect the DRAM power consumption and the system performance.
In the evaluation, our work is compared to the previous work and an oracle policy. In the oracle policy, the order of the memory accesses is transparent so that the DRAM ranks can be ideally turned on and off when needed. Furthermore, there is no transition delay and transition power in the oracle policy. The power reduction of the oracle policy is the maximum power reduction possible at zero system performance degradation. The oracle policy does not employ any throttling-based mechanism nor reordering. Therefore, the oracle policy can be viewed as a time-out-zero policy, which turns off a rank at the instant it becomes idle, with perfect pre-wakeup capability that turns a rank back on whenever it receives a command.
The benchmark combinations listed in Table V are used in this evaluation. The throttle delay for our work and the previous work is set to 400 CPU cycles, at which both our work and the previous work achieve good power reduction with acceptable system performance degradation. The effect of different throttle delays is analyzed later. The evaluation results of all four benchmark combinations are shown in Fig. 13.
41
(a) DRAM power reduction percentage
(b) Normalized system performance
Fig. 13 Power and performance of different policies on different benchmark combinations.
42
The evaluation results show that when the read-write aware throttling is employed alone, it reduces the DRAM power consumption 10%~15% more than the previous work but causes around 1% more system performance degradation. The reason is that the read-write aware throttling puts a rank into the low power mode until it receives read requests. However, when the rank is turned back on to handle the read requests, there are many write requests waiting to be processed. Without the rank level read-write reordering, the read requests have to wait until all the write requests that enters the queue before them are completed. The system performance is degraded since the critical read requests have to wait for a long time.
On the other hand, when the read-write reordering is used alone with the basic throttling mechanism as in the previous work [21], it improves the system performance by around 1% but the DRAM power consumption remains the same as the previous work [21]. This is because that the rank level read-write reordering only forces the DRAM to process read requests as early as possible and does not create extra power down opportunity.
By combining these two techniques, our work saves 10%~15% more power than the previous work with the same, or even slighter, system performance degradation. More importantly, our work reduces DRAM power consumption to below the oracle solution with 2% of the system performance degradation on average.
The evaluation results show that each technique reacts differently to different benchmark combinations. Since the proposed techniques take into consideration that read requests and write requests are not equally critical to the system performance, the number of requests contained in a benchmark is essential to the effect of the proposed techniques. Therefore, the read requests percentage of each benchmark combination are listed in Table VII. The read requests percentage is obtained by evaluating each benchmark combination with the native DRAM and is calculated by dividing the number of read requests into the total number of memory requests commands.
43
Table VII
Read requests percentage of each benchmark combination Benchmark combination Read requests percentage
fp1 6.76%
fp2 17.74%
fp3 55.27%
fp4 20.14%
Table VII shows that most of the read requests are completed in the cache, and the read requests that send down to the DRAM is less than write requests. It is obvious that fp3 is the most read intensive benchmark combination, while the fp1 is the least read intensive one. The read intensity of different benchmark combination reflects on the power reduction in the evaluation results. The read-write aware throttling works very well on fp1, which has a weak read intensity, due to the fact that most of the time there are only write requests blocked in the RQ and DRAM ranks can be turned off. On the other hand, the strong read intensity of fp3 limits the effect of the read-write aware throttling since it is less likely for a rank to only receive write request in a throttle period and thus cannot be turned off. Nevertheless, our work still manages to save 5% more DRAM power consumption with slightly better system performance than the previous work [21].
As mentioned in section 1.2, the DRAM power can be partitioned into several parts, including background power, active power, precharge power, read power, write power, and the refresh power. To further analyze the evaluation results, Fig. 14 shows these detail power consumptions obtained from the evaluation and compare them to the power consumption of the native DRAM. In Fig. 14, the ACT/PRE power consumption represents the sum of active power and precharge power, while the Read/write burst power represents the sumation of read power and write power. Notice that the refresh power is omitted because the DRAM is refreshed
44
periodically and the evaluation runs the benchmark for a fixed CPU cycle period, the refresh power consumptions for different techniques are the same.
(a) Background power consumptions of different techniques
(b) ACT/PRE power consumptions of different techniques
45
(c) Read/write burst power consumptions of different techniques
Fig. 14 The background power, ACT/PRE power, and the read/write power consumptions of different techniques.
The results in Fig. 14 show that the DRAM power consumption is dominated by the background power since the main memory accesses are sparse in the evaluation. Therefore, the throttling-based mechanism is used to turn off idle ranks, and the background power consumption of DRAM is greatly reduced. In addition, the read-write aware throttling creates longer idle period for a rank to be turned off and thus reduces 10%~45% more of the background power against the previous work [21]. The ACT/PRE power consumptions and the read/write burst power consumptions show that the read-write aware throttling cuts down the number of returned commands because the DRAM ranks are in the low power mode for a long period. However, when the read-write aware throttling releases the blocked commands from the RQ to the CQ, it forces the DRAM to focus on accessing the active rank.
It is noticeable that the read/write burst power consumption of our work on fp1 is low. It is because that fp1 has a weak read intensity. When the memory controller finally receives a
46
read request and turns on a rank, there are many pending write requests targeting that rank, which are blocked by the read-write aware throttling. As the result, in a command group formed by rank level read-write reordering, there are many write requests in front of the read request.
Once the rank is activated, it takes a long time processing the pending write requests before it becomes available to process the critical read request. The system performance is thus harmed and fewer main memory commands are completed within the same simulation period.
Therefore, the read/write burst power consumption is lower than other techniques.
47
4.3 Power and Performance Trade-Off
In the proposed techniques, the read-write aware throttling mechanism is the main contributor to the DRAM power reduction. For throttling based power reduction mechanisms, the throttle delay is critical to both the DRAM power consumption and the system performance.
Long throttle delay leads to better power reduction since the DRAM ranks stay in the low power mode for a long period. However, long throttle delay also leads to worse impact on the system performance because all the commands have to wait for a long period before they are processed by the DRAM. On the other hand, short throttle delay allows the DRAM to process memory commands more frequently but it also limits the effectiveness on the power reduction.
In order to see how different throttle delays affect the performance of our work, an evaluation on different throttle delays is carried out. The evaluation uses all benchmark combinations fp1, fp2, fp3 and fp4. Since every benchmark combination has a different memory command pattern, we averaged the evaluation results of all four benchmark combinations. The average simulation results are shown in Table VIII. All the throttle delays are in CPU cycles.
The improvements are the differences between our work and the previous work [21].
48
Table VIII
Effect of different throttle delays Throttle
Delay
Power reduction percentage System performance overhead Previous
work [21] Our work Improvement Previous
work [21] Our work Improvement
100 64.67% 75.46% 10.79% 1.03% 0.61% 0.42%
The evaluation results show that our work is stable with different throttle delays. It steadily reduces around 75% of DRAM power consumption, which is an upper bound of power reduction, and is around 10% better than the previous work. Moreover, the system performance degradation of our work is slightly better than the previous work. This shows that the read-write aware reordering mechanism effectively relieves the impact on the system performance. The evaluation results also shows as the throttle delay increases, the difference in power reduction between our work and the previous work gets smaller. It is because that when the throttle delay is too large, all the DRAM ranks are in the low power mode for most of the time. Therefore the power consumption is low and the performance degradation is dramatic.
With the results in Table VIII, we can further illustrates the trade-off characteristic between power and performance. By varying the throttle delay, the evaluation shows how our work reacts to different system performance degradations. The average results of all benchmark combinations are shown in Fig. 15, where both the power and performance are normalized to the DRAM with no power management policy. The result shows that our work has a better
49
power and performance trade-off characteristic. Under the same system performance degradation, our work reduces around 10% more of DRAM power than the previous work.
Fig. 15 Average power and performance trade-off characteristics on SPEC CPU2006 [31].
It is noticeable that our work, unlike the previous work [21], is sensitive to the memory command patterns of benchmarks. To show the difference, the evaluation results of benchmark combinations fp1 and fp3 are shown in Fig. 16 and Fig. 17 respectively. The number of read requests is much larger than write requests in fp3, while write requests dominates over read requests in fp1.
50
Fig. 16 Power and performance trade-off characteristics for fp1.
Fig. 17 Power and performance trade-off characteristics for fp3.
The trade-off curves of the previous work [21] in Fig. 16 and Fig. 17 have similar slopes.
The trade-off curves of the previous work [21] in Fig. 16 and Fig. 17 have similar slopes.