Chapter 1 Introduction
1.4 Our Contributions
To reduce DRAM power consumption, this thesis proposes the read-write aware DRAM scheduling, which utilizes the different criticalities between read accesses and write accesses.
The DRAM scheduling mentioned in the remainder of this thesis refers to reordering the sequence of all the memory commands inside the memory controller. In this thesis, the internal DRAM commands such as ACT and PRE…etc. remain their default order and are not taken into consideration in the proposed techniques.
This thesis proposes two techniques, the read-write aware throttling mechanism and the rank level read-write reordering technique. The read-write aware throttling mechanism
effectively cuts down DRAM power consumption. The rank level read-write reordering is employed to significantly reduces the system performance degradation caused by DRAM power management while maintaining the power saving. Our work reduces 75.34% of the DRAM power from the DRAM with no power management. When compared to the existing work, the proposed techniques reduce 10% more DRAM power with less performance degradation. Moreover, by evaluating and comparing with an oracle policy, the experimental results have shown that our work can reduce more power at the expense of a slight system performance degradation.
The remainder of this thesis is organized as follows. The next chapter describes the problem formulation. Chapter 3 places the proposed techniques. The experiment results are presented in Chapter 4. Finally, this thesis is concluded in Chapter 5.
12
Chapter 2
Problem Description
2.1 System Model
As shown in Fig. 4, the system has one or more processor cores. These cores are connected to a multi-level cache and are equipped with write buffers. A JEDEC standard DRAM is used as a shared main memory for all cores. The DRAM main memory is connected to the last level cache. The DRAM communicates with the processors through a memory controller, which lies between the last level cache and the DRAM circuit. The memory controller receives memory commands from the processors and issues these commands to the DIMMs inside the DRAM.
Each memory command contains information including its target address and access type. The memory controller keeps track of the time it receives each command.
Fig. 4 System hierarchy diagram and the architecture of queues inside the memory controller.
13
Within the memory controller, there are two queues: the reorder queue (RQ) and the command queues (CQs). Memory access commands from the last level cache are first stored in
the RQ. These commands are mapped to certain DRAM ranks and banks according to their target addresses. A scheduler inside the RQ is able to reorder the commands in the RQ while keeping the data hazard-free. After mapping and reordering, the memory controller sends the memory commands from the RQ to the CQs one at a cycle. Each CQ corresponds to a certain rank. Therefore, for a DRAM with n ranks, there are n CQs inside the memory controller.
Memory commands destined for rank 1, rank 2… rank n are sent to CQ1, CQ2… and CQn, respectively. A CQ handles all kinds of commands destined for its corresponding rank. These commands not only include access commands from the RQ, but also include other internal commands such as ACT, PRE, refresh, power-up, and power-down commands generated by the memory controller. The CQs do not change the order of the memory access commands since they are already reordered by the scheduler inside the RQ. Hence, the CQs issue the commands to the DRAM ranks in a first-in-first-out (FIFO) order.
Normally, the memory controller sends the commands from the RQ to the CQs whenever there are commands in the RQ. When a throttling mechanism is employed, instead of sending the memory access commands from the RQ to the CQs whenever the RQ is not empty, the memory controller blocks commands in the RQ for 𝑡𝑇𝐷. No command is sent to the CQs before the throttle delay is reached. When the throttle delay is reached, the memory controller starts sending the blocked commands to the CQs one at a cycle until the RQ is empty. The memory controller repeatedly blocks and releases the commands to realize the throttling mechanism.
Fig. 5 gives an example of how the memory commands are transferred from the RQ to the CQs when the throttling mechanism is employed. In Fig. 5, each rectangle represents a memory command. The access type of each command is denoted by R (read) or W (write), followed by an index number and the target rank of the command. The index numbers are assigned to each memory command according to the time they entered the RQ. The command that enters the RQ
14
earlier is represented by a smaller index number. For the example shown in Fig. 5, the DRAM is assumed to have four ranks, which are denoted as r1, r2, r3, and r4. Correspondingly, there are four CQs in the memory controller. Suppose that the throttle delay is reached at cycle k. For simplicity, we denote the ith rank as ri in Fig. 5 as well as in the remainder of this thesis.
As shown in Fig. 5, three commands W1, R2, and W3 are blocked in the RQ during cycle k − 𝑡𝑇𝐷 to cycle k. When the throttle delay is reached, the RQ starts sending commands from its front to its end. Each memory command is sent to the CQ according to its target rank. Hence, W1 is first sent to the CQ1, then R2 is sent to the CQ3, W3 is sent to the CQ3 at last. Each command in the CQ is issued to the corresponding DRAM rank. CQ1 issues W1 to r1 after receiving W1 from the RQ. Since the CQ issues commands in FIFO order, CQ3 issues R2 to r3
in advance of W3.
15
Fig. 5 An example of how the blocked memory commands transfer from the RQ to the CQs when the throttle delay is reached in a throttling mechanism.
16
The DRAM supports rank level power-mode control. That is, each rank has two different power mode, the active mode and the low power mode. A rank has to be turned on to the active mode before it is able to process the received commands. When a rank is idle, it can be turned off to the low power mode by the memory controller. The memory controller puts power-up and power-down commands into the CQ to switch the power mode of its corresponding DRAM rank. Switching the power mode of a rank is at the cost of transition delays.
The abbreviations and notations used throughout this thesis are listed in the following
PCRAM phase change random access memory
RQ reorder queue
CQ command queue
CQi the command queue assigned to rank i FIFO first-in-first-out
ri rank i
RAW read after write
MIPS million instructions per second
17
Table II Table of notations
Notation Definition
𝑃𝑃𝐷𝑁 background power of a DRAM chip in the off mode 𝑃𝐴𝐶𝑇_𝑆𝑇𝐵𝑌 background power of a DRAM chip in the on mode
𝑃𝑅𝐸𝐹 power consumption of refresh operation 𝑃𝐴𝐶𝑇 power consumption of an ACT command 𝑃𝑃𝑅𝐸 power consumption of a precharge command
𝑃𝑅𝐷 average power consumption of read accesses 𝑃𝑊𝑅 average power consumption of write accesses
𝑃𝑜𝑓𝑓 total power consumption of a DRAM chip in the off mode 𝑃𝑜𝑛 total power consumption of a DRAM chip in the on mode 𝑡𝑃𝐷𝑁 power down transition delay
𝑡𝑃𝑈𝑃 power up transition delay 𝑡𝑇𝐷 throttle delay
n number of ranks in the DRAM 𝐶𝑅𝑄 number of commands in the RQ
𝑆𝑖 the ith command set
𝐶𝑆𝑖 number of commands in the ith command set 𝑅𝑖 number of read commands in the ith command set 𝑊𝑖 number of write commands in the ith command set
18
2.2 Problem Statement
With the system model described in the previous section, the goal of this thesis is to find a delicate DRAM scheduling scheme that reduces the DRAM power with small system performance degradation for the memory controller. The scheduling scheme includes a throttling mechanism, which controls when the RQ starts sending commands to the CQs. The scheme also contains a scheduling policy for the scheduler inside the RQ, which is able to reorder the sequence of memory request commands in the RQ. The internal commands such as ACT and PRE are not taken into consideration by the scheduler inside the RQ since they are generated and put to the CQs directly by the memory controller. The scheduling policy should guarantee that the reordered sequence of memory commands are hazard-free. Finally, the scheme is in charge of determining when to turn on and off the DRAM ranks.
By using the throttling mechanism, reordering the memory commands and controlling the power mode of ranks, the goal of the proposed scheme is to reduce the power consumption of the DRAM with minor system performance degradation.
19
Chapter 3
The Proposed Techniques
3.1 Overview
The proposed DRAM power reduction techniques address on lowering DRAM power consumption with slight system performance degradation. Since the proposed techniques are based on the existing policies, this section starts off with the basic and shows the flow chart of the greedy memory controller in Fig. 6.
Fig. 6 Flow chart of a greedy memory controller, which employs the greedy power-down policy.
20
The greedy memory controller employs the greedy power-down policy, which turns off a DRAM rank whenever it is idle. In the greedy power-down policy, an idle rank is turned off even when there are pending commands destine for it in the RQ. The greedy memory controller does not employ the throttling mechanism. Therefore, whenever the RQ is not empty, a single memory command is sent from the RQ to the corresponding CQ every cycle.
As shown in Fig. 6, at any given cycle, the memory commands from the last level cache are pushed to the end of the RQ. The memory controller checks the state of each rank and turns off the ranks that are idle. After turning off idle ranks, the memory controller send the command at the front of the RQ to its corresponding CQ. Finally, the memory control checks each CQ to see if they are empty. If CQi is not empty, the memory controller turns on ri and issues the commands to it.
Since the greedy memory controller turns off an idle rank regardless of the upcoming memory commands in the RQ, the idle rank is turned off even during a short idle period. This results in frequent power mode transition and leads to dramatic system performance degradation caused by the power mode transition delays [20].
To improve the power reduction and the system performance of the greedy memory controller, the previous work [21] adds the throttling mechanism, the queue-aware power-down policy, and the power-aware memory scheduler to the greedy memory controller. The flow chart of the previous work [21] is shown in Fig. 7.
21
Fig. 7 Flow chart of the memory controller proposed in the previous work [21].
The light-gray rectangles and decision boxes in Fig. 7 are power reduction techniques added by the previous work [21]. The throttling mechanism, which is shown as the first light-gray decision box in the flow chart, blocks memory commands after they are pushed to the RQ until the throttle delay is reached. The blocked commands are not allowed to be sent to the CQ.
When the throttle delay is reached, the memory controller clusters the blocked commands into command sets by their target ranks. Commands destined for ri are clustered into command set
22
𝑆𝑖. The set of commands destined for the same rank as the command that first entered the RQ are moved to the front of the RQ, and so on. The memory controller then checks each command set, if there is command in 𝑆𝑖, all the commands in 𝑆𝑖 are allowed to be sent to the CQ. The memory controller then sends command at the front of the RQ to the corresponding CQ. At the end of each cycle, the memory checks each CQ to see if there is commands in it. If CQi is empty, the memory controller sends a power-down command to ri, which turns ri off.
To sum up, the techniques proposed in the previous work [21] improves both power reduction and system performance from the greedy memory controller. The power-aware memory scheduler in the previous work clusters commands according to their target ranks. This forces the DRAM to concentrate on accessing a certain rank for a period of time, allowing other ranks to be turned off. The queue-aware power-down policy checks the upcoming commands in the CQ before turning off a rank, which prevents a rank to be turned off during a short idle period. When a rank is turned off, the throttling mechanism assures that the rank stays in the low power mode for a long period of time.
However, the previous work [21] does not maximize the capability of the memory controller. Moreover, it does now take into consideration that the read request are more critical to the system performance than the write requests. Utilizing this fact, this thesis proposes the read-write aware throttling and the rank level read-write reordering techniques to modify the previous work [21]. The flow chart of the memory controller that employs the proposed techniques is shown in Fig. 8.
23
Fig. 8 Flow chart of the memory controller employing the proposed techniques.
Based on the previous work [21], the dark-gray rectangle and box in Fig. 8 are the techniques proposed in this thesis, where the decision box is the read-write aware throttling and the rectangle is the rank level read-write reordering.
The read-write aware throttling mechanism, which is depicted by the dark-gray decision box, checks for the existence of read commands in each command set. Instead of allowing all the nonempty command sets to be sent to the CQs, only the command sets containing read
24
requests are allowed to be sent to the CQs and only their target ranks are turned on. The other ranks, including ranks with write requests pending in the RQ, stay in the low power mode for another throttle delay to reduce the DRAM power consumption.
The dark-gray rectangle shows that the command sets are reordered by the rank level read-write reordering before they are sent to the CQs. The read requests in each command set get higher priorities than the write requests. The commands with higher priorities enter the CQs earlier. Since the CQ issues commands to the DIMM in a FIFO order, the read requests reach their target rank as soon as possible. This makes read requests, which are critical to system performance, to be served by the DIMMs earlier and thus the system performance degradation caused by the throttling mechanism is relieved.
For simplicity, the following notations are used in this thesis. Suppose that there are n ranks in the DRAM. There is one RQ and n CQs, 𝐶𝑄1, 𝐶𝑄2⋯ 𝑎𝑛𝑑 𝐶𝑄𝑛, inside the memory controller. The number of commands blocked inside the RQ is denoted as 𝐶𝑅𝑄. Inside the RQ, commands destined for r1, r2… and rn are clustered into command sets 𝑆1, 𝑆2, ⋯ , 𝑎𝑛𝑑 𝑆𝑛 respectively. The notation 𝐶𝑆𝑖 represents the number of commands in the command set 𝑆𝑖. For each command set 𝑆𝑖, 𝑅𝑖 denotes the number of read commands, while 𝑊𝑖 denotes the number of write commands in it. Therefore, it is clear that we can write the relation between these notations as:
Using these notations, the detail of the read-write aware throttling mechanism and the rank level read-write reordering are described in the following sections.
25
3.2 Read-Write Aware Throttling
The read-write aware throttling mechanism determines if a rank should be turned on whenever the throttle delay is reached. It checks on each command set 𝑆𝑖, which is composed of memory commands destined for ri in the RQ, to see whether the condition 𝑅𝑖 = 0 is true. If the condition is satisfied, ri is turned off and all the commands in 𝑆𝑖 remain in the RQ for another throttle delay. On the other hand, if the condition is not satisfied, ri is turned on and all the commands in 𝑆𝑖 are allowed to be sent to CQi.
The read-write aware throttling utilizes the fact that read requests affect system performance more than write requests [26]. It is performed on rank level and checks the existence of critical read requests in every command set whenever the throttle delay is reached.
If a read request appears in a command set, the memory controller sets the target rank of this command set to urgent. Ranks with no pending read requests are set to trivial. All the commands destined for an urgent rank are allowed to be sent to the corresponding CQ, while other commands remain in the RQ for another throttle delay. This allows the memory controller to only turn on the urgent ranks and keep the trivial ranks in the low power mode, contributing to a large DRAM power saving.
To better understand how the read-write aware throttling mechanism works, we give a simple example. Suppose that there are four ranks in the DRAM. At a certain point, the throttle delay is reached and the memory commands are blocked inside the RQ, as shown in the left part of Fig. 9. Before sending the commands to the CQs, the commands are first clustered into command sets according to their target ranks. The set of commands destined for the same rank as the command enters the RQ first are reordered to the front. The set of commands destined for the same rank as the command that sits right after the command set at the front is reordered to second to the front, and so on. The order of the commands within the same command set remains the same, the command that enters the RQ earlier is closer to the front of the command
26
set. The request command sequence after clustering is shown in the right part of Fig. 9, and the pseudo code of clustering command sets is given below Fig. 9. In the pseudo code, the DRAM is assumed to have n ranks and 𝑐𝑚𝑑𝑗 represents the command in the jth slot in the RQ. The action insert in line 8 reorders a command to the 𝑐𝑚𝑑𝑆𝑒𝑡𝑀𝑎𝑟𝑘[𝑖]th slot and pushes all the commands behind it one slot towards the end of the RQ.
Fig. 9 An example of how commands blocked in the RQ are clustered into command sets.
function ClusterCommandSets( ):
1. for 𝑙 ← 1 to 𝑛 do
2. 𝑐𝑚𝑑𝑆𝑒𝑡𝑀𝑎𝑟𝑘[𝑙] ← −1 3. for 𝑗 ← 1 to 𝐶𝑅𝑄 do
4. // Let the target rank of 𝑐𝑚𝑑𝑗 be 𝑖 5. if 𝑐𝑚𝑑𝑆𝑒𝑡𝑀𝑎𝑟𝑘[𝑖] = −1 then 6. 𝑐𝑚𝑑𝑆𝑒𝑡𝑀𝑎𝑟𝑘[𝑖] ← 𝑗 + 1 7. else
8. insert 𝑐𝑚𝑑𝑗 to 𝑐𝑚𝑑𝑆𝑒𝑡𝑀𝑎𝑟𝑘[𝑖]
9. for 𝑘 ← 1 to 𝑛 do
10. if 𝑐𝑚𝑑𝑆𝑒𝑡𝑀𝑎𝑟𝑘[𝑘] > 𝑐𝑚𝑑𝑆𝑒𝑡𝑀𝑎𝑟𝑘[𝑖] then
11. 𝑐𝑚𝑑𝑆𝑒𝑡𝑀𝑎𝑟𝑘[𝑘] ← 𝑐𝑚𝑑𝑆𝑒𝑡𝑀𝑎𝑟𝑘[𝑘] + 1
12. 𝑐𝑚𝑑𝑆𝑒𝑡𝑀𝑎𝑟𝑘[𝑖] ← 𝑐𝑚𝑑𝑆𝑒𝑡𝑀𝑎𝑟𝑘[𝑖] + 1
27
From Fig. 9, we can see that:
{
𝐶𝑆1 = 5, 𝑅1 = 1, 𝑊1 = 4 𝐶𝑆2 = 1, 𝑅2 = 0, 𝑊2 = 1 𝐶𝑆3 = 2, 𝑅3 = 1, 𝑊3 = 1 𝐶𝑆4 = 0, 𝑅4 = 0, 𝑊4 = 0
The read-write aware throttling then checks for the command sets containing no read request commands. Since the command set 𝑆4 has no commands in it, its target rank r4 is considered to be trivial and is turned off to save power. The command set 𝑆2 contains no read requests and thus its target rank r2 is also considered trivial and is turned off. All the commands inside 𝑆2 are kept blocked in the RQ for another throttle delay. The ranks r1 and r3 are
The read-write aware throttling then checks for the command sets containing no read request commands. Since the command set 𝑆4 has no commands in it, its target rank r4 is considered to be trivial and is turned off to save power. The command set 𝑆2 contains no read requests and thus its target rank r2 is also considered trivial and is turned off. All the commands inside 𝑆2 are kept blocked in the RQ for another throttle delay. The ranks r1 and r3 are