Experimental Results - NVM Duet - 伺服器非揮發記憶體之跨層級設計與最佳化

4.3 NVM Duet

4.4.2 Experimental Results

The semantics and correctness of persistent data handling were described in Section4.3.1.

Our evaluation thus focuses on the performance gain achieved by NVM Duet.

0.5 0.6 0.7 0.8 0.9 1.0 1.1

zeus libq mcf sopl gcc lesl milc asta lbm cact bwav perl gems bzip

Normalized Load Latency

Baseline Rule 1 Rule 1+2 No Consistency

Figure 4.11: Benefits of Duet Scheduler for load latency

Duet Scheduler

We first focus on evaluating the benefits of Duet Scheduler only (i.e., Dual-Retention PCM is not adopted). Figure 4.11 shows the average load latency in the simulated system.

We compare the baseline with a system that adopts Rule 1 and a system that adopts both Rule 1 + 2. We also evaluate a system in which the memory scheduler ignores all barriers and offers no consistency guarantee. The no-consistency system represents the ideal case

that fully exploits the bank-level parallelism of write requests. The results are normalized to the baseline. We can observe that the load latency of the no-consistency system is reduced by up to 42% (30% on average) compared with the baseline. The proposed Duet Scheduler can effectively reduce the load latency. With Rule 1, the load latency is reduced by up to 33% (19% on average). If Rule 1 and Rule 2 are both adopted, the load latency is reduced by up to 38% (24% on average).

0.9

zeus libq mcf sopl gcc lesl milc asta lbm cact bwav perl gems bzip

IPC Speedup

Baseline Rule 1 Rule 1+2 No Consistency

Figure 4.12: Benefits of Duet Scheduler for IPC

Figure 4.13: Overhead analysis vs. refresh interval

Figure 4.12 plots the IPC speedup achieved by Duet Scheduler. With Rule 1, IPC is improved by up to 1.44× (1.2× on average). Combining Rule 1 and Rule 2 improves IPC by up to 1.53× (1.23× on average). We observe that Duet Scheduler achieves an IPC speedup that is close to that of the no-consistency case. The difference is within 7%

on average, which means that the proposed scheduling rules successfully maximize the utilization of the available bank-level parallelism.

Note that in Figure 4.12, the speedup increases along the horizontal axis, reaches a peak, and then gradually decreases. This is due to the fact that we arrange the workload

mixes in ascending order of WM% (the percentage of working-memory writes among all writes). Duet Scheduler is beneficial to write streams containing a mix of working-memory and persistent-store accesses. Duet Scheduler neither helps nor hurts workloads with only working-memory or persistent-store writes. Therefore, the speedup increases when WM% increases and reaches a peak for bwav, which consists of a mix of 35%

persistent-store writes and 65% working-memory writes. Subsequently, the speedup grad-ually drops when WM% further increases to 76% for bzip.

Deciding on the Appropriate Refresh Interval

Before we analyze Dual-Retention PCM, we must decide on the appropriate refresh in-terval (which is the same as the retention capability of the working memory). A shorter retention time requires more-frequent refreshes. Therefore, the refresh interval is a key design parameter in NVM Duet. The refresh interval can range by several orders of magnitude, from 10⁷ seconds down to a few seconds or even shorter. Thus, it would be time-consuming to perform a full exploration in such a large design space using a de-tailed simulation. Therefore, in this section, we first present analytical models that enable quick evaluation of the refresh overhead in terms of power, lifetime, and performance. We identify two refresh-interval candidates of 1000 s and 100 s. Full-system simulations are subsequently performed to evaluate the overall performance of the proposed architecture.

Refreshing PCM not only causes power and performance overhead but also degrades the PCM lifetime because of extra writes. Estimating the extra power incurred by re-freshing is straightforward. Given that each cell must be refreshed every Trefresh interval, during the Trefreshinterval, the energy consumption of refresh operations is (N · E), where N is the total number of PCM cells and E is the average energy consumption required to refresh a cell. Therefore, the extra power P is the total energy consumed by refreshing all PCM cells divided by the refresh interval:

P = N · E Trefresh

(4.5)

Compared with power, it is more challenging to estimate the impact of refreshes on a memory system without resorting to a detailed simulation. To perform a first-order estimation, we use a simple metric known as bank usage, i.e., the average percentage of time that a PCM bank spends in accessing its array. The extra bank usage, U %, due to refreshes can be modeled as follows:

U % = N · R B · C · 1

Trefresh

(4.6)

where N is the total number of PCM cells, R is the latency of each refresh, B is the number of memory banks, and C is the number of cells that are refreshed at a time (per bank). Then, ^{N ·R}_B·C is the total time that a bank spends in refresh operations during a Trefresh

interval.

The lifetime aspect can be estimated by dividing the endurance of a cell (denoted as D) by the out rate, assuming that perfect wear leveling is adopted. The worn-out rate is composed of two components: those due to normal writes and those due to refreshes. The former is 5×365×24×60×60^D , assuming that the original lifetime is five years.

The latter is _T^Y

refresh, where Y is the average number of iterations for refreshing a cell and

Trefreshis the refresh interval. Therefore, the resulting lifetime with refresh (denoted as L) can be evaluated as follows:

L = D

Figure 4.13 shows the estimated extra power, extra bank usage, and resulting lifetime for a system with 8 GB of PCM. The setting of each parameter is detailed in Table 4.6 (the read and write energy is from [82]). We can observe that for power overhead, refreshes incur at most an extra 0.013 W if the refresh interval is no shorter than 100 s. For the performance aspect, with a 1000-s refresh interval, the bank usage is increased by 1.1%, and the overhead increases to 10% if the refresh interval is shortened to 100 s. For the

lifetime analysis, the resulting lifetime is 4.8 years with 1000-s refreshes and decreases to 3.8 years with 100-s refreshes.

Table 4.6: Parameters for deciding on the appropriate refresh interval

Value Unit Notes

N (8×2³⁰×8)/2 Cell 8GB, 2 bits per cell

B 8×2 Bank 2 ranks of 8 banks

C (64×8)/2 Cell 64B line, 2 bits per cells

D 10⁷ Cycle

S Please see Table 2 Speedup factor due to wider target bands R 250 + (8.5/S)×250 ns Read: 250 ns

Write: (8.5/S) iterations of 250 ns E 2.5 + 17×(8.5/S)×50% pJ

Read: 2.5 pJ/cell (peripheral is included) Write: 17 pJ/cell (peripheral is included) Assume a 50% chance a cell is refreshed Y (8.5/S)×50% Cycle Write iterations for refreshing: (8.5/S)

Assume a 50% chance a cell is refreshed

From the above analysis, we can note that a 1000-s refresh interval leads to small overheads: 0.0014 W of extra power, 1.1% of extra bank usage, and a small degradation in lifetime from 5 to 4.8 years. In contrast, the overhead caused by 100-s refreshes be-gins to increase dramatically. Therefore, a refresh interval shorter than 100 s should not be considered any further. Although a 100-s refresh interval causes significantly higher overhead than the 1000-s interval in terms of performance and lifetime in the preliminary analysis, a 100-s refresh also offers additional benefits from relaxing the retention guar-antee for the PCM. To evaluate the overall effects, these two intervals are both considered in the detailed simulation. Below, we present the results from full-system simulations in terms of performance. Because the lifetime impact and extra power due to refresh have been estimated using the analytical models, further analysis of the lifetime and power is omitted in this work.

Dual-Retention PCM

In this section, we evaluate the benefits of Dual-Retention PCM alone (i.e., Duet Sched-uler is not adopted). Figure 4.14 and Figure 4.15 show the load latency of the baseline and Dual-Retention PCM. We compare two candidates of Dual-Retention PCM, one can-didate whose working memory adopts a 1000-s retention capability and the other whose

0.7 0.8 0.9 1.0 1.1

zeus libq mcf sopl gcc lesl milc asta lbm cact bwav perl gems bzip

Normalized Load Latency

Baseline Basic Refresh Smart Refresh Ideal

Figure 4.14: Benefits of Dual-Retention PCM (1000-s retention guarantee) for load la-tency

0.70.8 0.91.0 1.11.2 1.31.4 1.5 1.61.7 1.8

zeus libq mcf sopl gcc lesl milc asta lbm cact bwav perl gems bzip

Normalized Load Latency

Baseline Basic Refresh Smart Refresh Ideal

Figure 4.15: Benefits of Dual-Retention PCM (100-s retention guarantee) for load latency

working memory adopts a 100-s retention capability. We also compare different refresh schemes: the basic refresh scheme, the proposed Smart Refresh scheme, and a hypo-thetical ideal case that relaxes the working memory’s retention requirement but requires no refreshes. We first observe that for the ideal case, lowering the PCM’s non-volatility to 1000 s and 100 s can reduce the load latency by up to 21% and 23% (14% and 16%

on average), respectively. If the overheads of basic refresh are considered, load latency is reduced by up to 17% (10% on average) for the 1000-s configuration. However, the 100-s configuration even performs worse than the baseline. The load latency increases by up to 65% (41% on average). The proposed Smart Refresh reduces the refreshing overhead.

With Smart Refresh, the load latency for the 1000-s configuration is within 1% of the ideal case on average, but the 1000-s configuration still performs worse than the baseline.

We can observe that the 1000-s configuration consistently outperforms the 100-s version, and therefore, we only evaluate the 1000-s case from now on.

0.9 1.0 1.1 1.2 1.3

zeus libq mcf sopl gcc lesl milc asta lbm cact bwav perl gems bzip

IPC Speedup

Baseline Basic Refresh Smart Refresh Ideal

Figure 4.16: Benefits of Dual-Retention PCM for IPC

Figure 4.16 shows the IPC speedup achieved by Dual-Retention PCM. There is po-tential for a speedup of up to 1.26× (1.14× on average) for the ideal case. With the refreshing overheads considered, the basic refresh design achieves a speedup of up to 1.18× (1.08× on average). With the proposed Smart Refresh optimization, the speedup reaches up to 1.22× (1.11× on average), which is within 3% of the ideal case. This result implies the proposed optimization can effectively reduce and hide the refresh overhead.

We can observe that in Figure 4.16, the speedup trend increases as WM% increases be-cause the performance gain of Dual-Retention PCM comes from relaxing the retention

capability of working-memory writes. Higher WM% thus implies that a larger portion of the workloads is optimized.

0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8

zeus libq mcf sopl gcc lesl milc asta lbm cact bwav perl gems bzip

IPC Speedup

Baseline Duet Scheduler Dual-Retention PCM NVM Duet

Figure 4.17: Benefits of NVM Duet for IPC

NVM Duet — Putting It All Together

Duet Scheduler and Dual-Retention PCM improve the unified architecture from differ-ent aspects. The former exploits bank-level parallelism, and the latter reduces the av-erage write latency. Therefore, combining the two mechanisms can lead to greater per-formance gains. Figure 4.17 compares the IPC speedup achieved by Duet Scheduler, Dual-Retention PCM, and NVM Duet, in which all of the proposed techniques are si-multaneously adopted. The results demonstrate that NVM Duet significantly improves performance by up to 1.68× (1.32× on average). The IPC speedup of NVM Duet is within 5% of the product of the IPC speedup numbers of the two individual mechanisms.

This result suggests that the two mechanisms can cooperate well with each other. We also observe that NVM Duet is especially effective for write-intensive workloads, such as bwavand bzip, with WM% ranging from 65% to 76%.

Figure 4.18 shows the IPC speedup achieved by NVM Duet with different ratios of barriers to persistent-store writes. With a higher ratio of barriers, NVM Duet is a bit more effective. The performance speedup reaches 1.68× (1.34× on average) when the ratio is 1:10. The speedup is close to 1.55× (1.27× on average) when the ratio is 1:50.

0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8

zeus libq mcf sopl gcc lesl milc asta lbm cact bwav perl gems bzip

IPC Speedup

1:50 1:20 1:10

Figure 4.18: Benefits of NVM Duet for IPC with different ratios of barriers to persistent-store writes

4.5 Discussion

在文檔中伺服器非揮發記憶體之跨層級設計與最佳化 (頁 96-104)