Chapter 2 Problem Description
2.2 Problem Statement
With the system model described in the previous section, the goal of this thesis is to find a delicate DRAM scheduling scheme that reduces the DRAM power with small system performance degradation for the memory controller. The scheduling scheme includes a throttling mechanism, which controls when the RQ starts sending commands to the CQs. The scheme also contains a scheduling policy for the scheduler inside the RQ, which is able to reorder the sequence of memory request commands in the RQ. The internal commands such as ACT and PRE are not taken into consideration by the scheduler inside the RQ since they are generated and put to the CQs directly by the memory controller. The scheduling policy should guarantee that the reordered sequence of memory commands are hazard-free. Finally, the scheme is in charge of determining when to turn on and off the DRAM ranks.
By using the throttling mechanism, reordering the memory commands and controlling the power mode of ranks, the goal of the proposed scheme is to reduce the power consumption of the DRAM with minor system performance degradation.
19
Chapter 3
The Proposed Techniques
3.1 Overview
The proposed DRAM power reduction techniques address on lowering DRAM power consumption with slight system performance degradation. Since the proposed techniques are based on the existing policies, this section starts off with the basic and shows the flow chart of the greedy memory controller in Fig. 6.
Fig. 6 Flow chart of a greedy memory controller, which employs the greedy power-down policy.
20
The greedy memory controller employs the greedy power-down policy, which turns off a DRAM rank whenever it is idle. In the greedy power-down policy, an idle rank is turned off even when there are pending commands destine for it in the RQ. The greedy memory controller does not employ the throttling mechanism. Therefore, whenever the RQ is not empty, a single memory command is sent from the RQ to the corresponding CQ every cycle.
As shown in Fig. 6, at any given cycle, the memory commands from the last level cache are pushed to the end of the RQ. The memory controller checks the state of each rank and turns off the ranks that are idle. After turning off idle ranks, the memory controller send the command at the front of the RQ to its corresponding CQ. Finally, the memory control checks each CQ to see if they are empty. If CQi is not empty, the memory controller turns on ri and issues the commands to it.
Since the greedy memory controller turns off an idle rank regardless of the upcoming memory commands in the RQ, the idle rank is turned off even during a short idle period. This results in frequent power mode transition and leads to dramatic system performance degradation caused by the power mode transition delays [20].
To improve the power reduction and the system performance of the greedy memory controller, the previous work [21] adds the throttling mechanism, the queue-aware power-down policy, and the power-aware memory scheduler to the greedy memory controller. The flow chart of the previous work [21] is shown in Fig. 7.
21
Fig. 7 Flow chart of the memory controller proposed in the previous work [21].
The light-gray rectangles and decision boxes in Fig. 7 are power reduction techniques added by the previous work [21]. The throttling mechanism, which is shown as the first light-gray decision box in the flow chart, blocks memory commands after they are pushed to the RQ until the throttle delay is reached. The blocked commands are not allowed to be sent to the CQ.
When the throttle delay is reached, the memory controller clusters the blocked commands into command sets by their target ranks. Commands destined for ri are clustered into command set
22
𝑆𝑖. The set of commands destined for the same rank as the command that first entered the RQ are moved to the front of the RQ, and so on. The memory controller then checks each command set, if there is command in 𝑆𝑖, all the commands in 𝑆𝑖 are allowed to be sent to the CQ. The memory controller then sends command at the front of the RQ to the corresponding CQ. At the end of each cycle, the memory checks each CQ to see if there is commands in it. If CQi is empty, the memory controller sends a power-down command to ri, which turns ri off.
To sum up, the techniques proposed in the previous work [21] improves both power reduction and system performance from the greedy memory controller. The power-aware memory scheduler in the previous work clusters commands according to their target ranks. This forces the DRAM to concentrate on accessing a certain rank for a period of time, allowing other ranks to be turned off. The queue-aware power-down policy checks the upcoming commands in the CQ before turning off a rank, which prevents a rank to be turned off during a short idle period. When a rank is turned off, the throttling mechanism assures that the rank stays in the low power mode for a long period of time.
However, the previous work [21] does not maximize the capability of the memory controller. Moreover, it does now take into consideration that the read request are more critical to the system performance than the write requests. Utilizing this fact, this thesis proposes the read-write aware throttling and the rank level read-write reordering techniques to modify the previous work [21]. The flow chart of the memory controller that employs the proposed techniques is shown in Fig. 8.
23
Fig. 8 Flow chart of the memory controller employing the proposed techniques.
Based on the previous work [21], the dark-gray rectangle and box in Fig. 8 are the techniques proposed in this thesis, where the decision box is the read-write aware throttling and the rectangle is the rank level read-write reordering.
The read-write aware throttling mechanism, which is depicted by the dark-gray decision box, checks for the existence of read commands in each command set. Instead of allowing all the nonempty command sets to be sent to the CQs, only the command sets containing read
24
requests are allowed to be sent to the CQs and only their target ranks are turned on. The other ranks, including ranks with write requests pending in the RQ, stay in the low power mode for another throttle delay to reduce the DRAM power consumption.
The dark-gray rectangle shows that the command sets are reordered by the rank level read-write reordering before they are sent to the CQs. The read requests in each command set get higher priorities than the write requests. The commands with higher priorities enter the CQs earlier. Since the CQ issues commands to the DIMM in a FIFO order, the read requests reach their target rank as soon as possible. This makes read requests, which are critical to system performance, to be served by the DIMMs earlier and thus the system performance degradation caused by the throttling mechanism is relieved.
For simplicity, the following notations are used in this thesis. Suppose that there are n ranks in the DRAM. There is one RQ and n CQs, 𝐶𝑄1, 𝐶𝑄2⋯ 𝑎𝑛𝑑 𝐶𝑄𝑛, inside the memory controller. The number of commands blocked inside the RQ is denoted as 𝐶𝑅𝑄. Inside the RQ, commands destined for r1, r2… and rn are clustered into command sets 𝑆1, 𝑆2, ⋯ , 𝑎𝑛𝑑 𝑆𝑛 respectively. The notation 𝐶𝑆𝑖 represents the number of commands in the command set 𝑆𝑖. For each command set 𝑆𝑖, 𝑅𝑖 denotes the number of read commands, while 𝑊𝑖 denotes the number of write commands in it. Therefore, it is clear that we can write the relation between these notations as:
Using these notations, the detail of the read-write aware throttling mechanism and the rank level read-write reordering are described in the following sections.
25
3.2 Read-Write Aware Throttling
The read-write aware throttling mechanism determines if a rank should be turned on whenever the throttle delay is reached. It checks on each command set 𝑆𝑖, which is composed of memory commands destined for ri in the RQ, to see whether the condition 𝑅𝑖 = 0 is true. If the condition is satisfied, ri is turned off and all the commands in 𝑆𝑖 remain in the RQ for another throttle delay. On the other hand, if the condition is not satisfied, ri is turned on and all the commands in 𝑆𝑖 are allowed to be sent to CQi.
The read-write aware throttling utilizes the fact that read requests affect system performance more than write requests [26]. It is performed on rank level and checks the existence of critical read requests in every command set whenever the throttle delay is reached.
If a read request appears in a command set, the memory controller sets the target rank of this command set to urgent. Ranks with no pending read requests are set to trivial. All the commands destined for an urgent rank are allowed to be sent to the corresponding CQ, while other commands remain in the RQ for another throttle delay. This allows the memory controller to only turn on the urgent ranks and keep the trivial ranks in the low power mode, contributing to a large DRAM power saving.
To better understand how the read-write aware throttling mechanism works, we give a simple example. Suppose that there are four ranks in the DRAM. At a certain point, the throttle delay is reached and the memory commands are blocked inside the RQ, as shown in the left part of Fig. 9. Before sending the commands to the CQs, the commands are first clustered into command sets according to their target ranks. The set of commands destined for the same rank as the command enters the RQ first are reordered to the front. The set of commands destined for the same rank as the command that sits right after the command set at the front is reordered to second to the front, and so on. The order of the commands within the same command set remains the same, the command that enters the RQ earlier is closer to the front of the command
26
set. The request command sequence after clustering is shown in the right part of Fig. 9, and the pseudo code of clustering command sets is given below Fig. 9. In the pseudo code, the DRAM is assumed to have n ranks and 𝑐𝑚𝑑𝑗 represents the command in the jth slot in the RQ. The action insert in line 8 reorders a command to the 𝑐𝑚𝑑𝑆𝑒𝑡𝑀𝑎𝑟𝑘[𝑖]th slot and pushes all the commands behind it one slot towards the end of the RQ.
Fig. 9 An example of how commands blocked in the RQ are clustered into command sets.
function ClusterCommandSets( ):
1. for 𝑙 ← 1 to 𝑛 do
2. 𝑐𝑚𝑑𝑆𝑒𝑡𝑀𝑎𝑟𝑘[𝑙] ← −1 3. for 𝑗 ← 1 to 𝐶𝑅𝑄 do
4. // Let the target rank of 𝑐𝑚𝑑𝑗 be 𝑖 5. if 𝑐𝑚𝑑𝑆𝑒𝑡𝑀𝑎𝑟𝑘[𝑖] = −1 then 6. 𝑐𝑚𝑑𝑆𝑒𝑡𝑀𝑎𝑟𝑘[𝑖] ← 𝑗 + 1 7. else
8. insert 𝑐𝑚𝑑𝑗 to 𝑐𝑚𝑑𝑆𝑒𝑡𝑀𝑎𝑟𝑘[𝑖]
9. for 𝑘 ← 1 to 𝑛 do
10. if 𝑐𝑚𝑑𝑆𝑒𝑡𝑀𝑎𝑟𝑘[𝑘] > 𝑐𝑚𝑑𝑆𝑒𝑡𝑀𝑎𝑟𝑘[𝑖] then
11. 𝑐𝑚𝑑𝑆𝑒𝑡𝑀𝑎𝑟𝑘[𝑘] ← 𝑐𝑚𝑑𝑆𝑒𝑡𝑀𝑎𝑟𝑘[𝑘] + 1
12. 𝑐𝑚𝑑𝑆𝑒𝑡𝑀𝑎𝑟𝑘[𝑖] ← 𝑐𝑚𝑑𝑆𝑒𝑡𝑀𝑎𝑟𝑘[𝑖] + 1
27
From Fig. 9, we can see that:
{
𝐶𝑆1 = 5, 𝑅1 = 1, 𝑊1 = 4 𝐶𝑆2 = 1, 𝑅2 = 0, 𝑊2 = 1 𝐶𝑆3 = 2, 𝑅3 = 1, 𝑊3 = 1 𝐶𝑆4 = 0, 𝑅4 = 0, 𝑊4 = 0
The read-write aware throttling then checks for the command sets containing no read request commands. Since the command set 𝑆4 has no commands in it, its target rank r4 is considered to be trivial and is turned off to save power. The command set 𝑆2 contains no read requests and thus its target rank r2 is also considered trivial and is turned off. All the commands inside 𝑆2 are kept blocked in the RQ for another throttle delay. The ranks r1 and r3 are considered urgent because there are read requests in 𝑆1 and 𝑆3. Therefore, commands in 𝑆1 and 𝑆3 are allowed to be sent to 𝐶𝑄1 and 𝐶𝑄3, respectively. The memory controller turns on r1 and r3 to process the commands in 𝑆1 and 𝑆3.
The pseudo code of the read-write aware throttling, is given as below. The function ReadWriteReorder in line 8 will be explained in the next section.
28
In implementation, a one-bit register is used for each command set to detect the existence of read commands whenever a new command enters the RQ. Therefore, the complexity of this procedure is lower than the pseudo code since the if condition in line 5 in the pseudo code is replaced by the registers.
10. send the command at the front of the RQ to the corresponding CQ if it is allowed to be sent
29
3.3 Rank Level Read-Write Reordering
The rank level read-write reordering is a scheduling policy for the command sets containing read requests in the RQ. It gives read requests higher priority than write requests.
The commands in a command set are sent to the CQ in descending order of priority. Since the CQ issues commands to the DIMM in FIFO order, read requests are issued to the DIMM prior to write requests. This forces the DIMM inside the DRAM to process read requests, which are critical to system performance, as soon as possible. The rank level read-write reordering effectively relieves the system performance degradation caused by the DRAM power management policy.
The system performance degradation is greatly relieved if the read requests are sent to the CQ prior to all the write requests. However, reordering memory commands blindly can lead to a data hazard issue since the memory commands are no longer handled by the DIMM in the same order as the processors sent out. If a read request enters the RQ after a write request and they are both destined for a same address, a read-after-write (RAW) data hazard occurs if the DIMM returns data for this read request prior to the write request. To avoid RAW hazard, the rank level read-write reordering performs a check before reordering. For each read request in the a command set, the rank level read-write reordering checks all the write requests in the same command set that entered the RQ earlier than this read request. If one or more write requests target to the same address as the read request does, they are combined in their original order to form a command group. All the command groups are then reordered to preserve the FIFO order of the read request in each command group.
An example is given in Fig. 10, where each rectangle represents a memory command with its access type denoted by W (Write) or R (Read) followed by an index number and the target address of the command. The index numbers are assigned to each memory command according the time they entered the RQ. The command that enters the RQ earlier gets a smaller index
30
number. Fig. 10 illustrates how the commands in a command set 𝑆1 is combined into command groups and reordered.
Fig. 10 An example of how commands in a given command set S1 are combined into command groups and then reordered.
As shown in Fig. 10, the rank level read-write reordering checks 𝑆1 for read requests.
Command R5 is first found, and the rank level read-write reordering search through W1 to W4 to find that W2 and W4 have the same target address as R5. Therefore W2, W4, and R5 and combined into a command group, and the order of these three commands are preserved inside the command group. The rank level read-write reordering then find R6 and W3, which has the same target address as R6. W3 and R6 are thus combined into a command group. Since R5 enters the 𝑆1 before R6, the command group containing R5 is placed in front of the command group containing R6. Commands that are not in any command group are placed in FIFO order behind the last command group.
After combining command groups and reordering, the reordered command sets are sent to the CQs. The target ranks of these command sets are turned on to process these command. It is switched back to the low power mode once all the commands are finished and is kept in the low power mode until the throttle delay is once again reached.
31
The pseudo code of the rank level read-write reordering is given below. In the pseudo code, the command set 𝑆𝑖 is assumed to have 𝐶𝑆𝑖 commands and notations 𝑐𝑚𝑑𝑗 and 𝑐𝑚𝑑𝑘 represent the jth and kth command in the command set, respectively. Notice that in line 6 and line 9, the action insert reorders the command to the markth slot in the command set. For instance, line 6 moves 𝑐𝑚𝑑𝑘 to the markth slot and all the commands originally sitting in the markth slot to the (𝑘 − 1)th slot are shifted to (𝑚𝑎𝑟𝑘 + 1)th slot to kth slot in order.
As shown in the pseudo code, the rank level read-write reordering checks for read requests among all the commands in a command set from the front to the end of the command set. For each read request found, the rank level read-write reordering search through commands that are in front of the read request and do not belong to any command group. All the commands with the same target address as the read request are combined into a command group.
In implementation, the action insert is provided by the modern memory controller and no extra hardware is required. The address comparing action in line 5 uses a comparator, whose size differs from 1-bit to the length of the memory address. When a small comparator is used, the rank level read-write reordering is more conservative and tends to insert more write requests
function ReadWriteReorder(𝑆𝑖):
32
in front of the read requests. The effect on relieving system performance degradation may be slightly weaken if a small comparator is used.
33
3.4 An Example of The Proposed Policy
In the proposed scheduling policy, the rank level read-write reordering is combined with the read-write aware throttling. To understand how the proposed policy works, a complete example is given as follows. Assume that there are four ranks in the DRAM and the throttle delay is reached at cycle k. Suppose that at cycle k, the command pattern in the RQ is as shown in the left part of Fig. 11. The commands are clustered into command sets according to their target ranks, the result is as shown in the right part of Fig. 11. Notice that the command set 𝑆4 contains no commands and is therefore omitted in Fig. 11.
Fig. 11 An example of how the read-write aware throttling clusters commands in the RQ and determines which ranks should be turned on when the throttle delay is reached.
When the throttle delay is reached at cycle k, the read-write aware throttling is first performed to check on each command set for the existence of read commands. As the result, r1
and r3 are considered urgent, while r2 and r4 are considered trivial. After the read-write aware throttling, the rank level read-write reordering is performed on command sets 𝑆1 and 𝑆3 before they are sent to CQ1 and CQ3. Using 𝑆1 as an example, Fig. 12 shows how the
34
commands in 𝑆1 are reordered and sent to the CQ. The command at the front of the command set has the highest priority while the command at the end has the lowest priority. The commands
commands in 𝑆1 are reordered and sent to the CQ. The command at the front of the command set has the highest priority while the command at the end has the lowest priority. The commands