System Design - 伺服器非揮發記憶體之跨層級設計與最佳化

2.5.1 Retention-Aware FTL (Flash Translation Layer)

In this section, we present our flash storage design leveraging Retention Relaxation for improving either write speed or ECC cost-performance. Specifically, in the proposed flash storage architecture, data written into flash can occur in variable write latencies or be pro-tected by different strengths of ECCs to obtain different levels of retention guarantees. We refer to the data written by these different methods as in different “modes”. In our design, data in a physical flash block are in the same mode. We need to record the mode of each flash block for correctly retrieving data from flash. In addition, to avoid data losses due to a shortage of retention capability, we have to monitor the remaining retention capability of each flash block. We decide to implement the proposed retention-aware strategy in the flash translation layer (FTL) of flash storage rather than in the OS layer because

FTL-Mode

Figure 2.12: Proposed retention-aware FTL

based implementation requires minimum OS/application modification, which we think is important for easy deployment and wide adoption of the proposed scheme.

Figure 2.12 shows the block diagram of the proposed FTL. The proposed FTL is based on the page-level FTL [11] with two additional components, mode selector (MS) and retention tracker (RT). For writes, MS sends different write commands to flash chips or invokes different ECC procedures. As discussed in Section 2.3.2, write speed can be improved by adopting larger ∆VP. Current flash chips only support write commands with a single predefined retention level. We propose that flash chips provide a command interface to expose their internal control on ∆VP values to support two or more retention levels. MS maintains the mode of each flash block so that during reads, it can issue the corresponding read commands or invoke the right ECC decoding procedure to retrieve data. RT is responsible for ensuring that every flash block in the flash storage does not run out of its retention capability. RT uses one counter per flash block to keep track of the remaining retention capability of the block. When the first page of a block is written, the retention capability of this write is stored in the counter. These retention counters are periodically updated. If RT finds that a block approaches its retention limit, RT schedules background operations to move valid data in the block to another new block and then invalidates the old one.

NAND BCH-LDPC codes in flash storage

NAND

Figure 2.14: Proposed ECC architecture leveraging Retention Relaxation in flash storage

One key parameter in the proposed flash storage design is how many write modes we should employ in flash storage. The optimal setting depends on retention variations in workloads and the cost for supporting multiple write modes. In this work, we present a coarse-grained management method: a dual-retention scheme. There are two kinds of flash writes in flash storage: host writes and background writes. Host writes correspond to writes issued by workloads from hosts to flash storage; background writes comprise cleaning, wear-leveling, and data movement internal to flash storage. Host writes are usually sensitive to the write performance of flash storage, and they usually require short retention as analyzed in Section 2.4. In contrast, background writes are less sensitive to performance, and they usually involve data that have been stored in flash storage without changes for a long time (i.e., cold data) and thus are expected to require long retention.

Based on these observations, we propose employing two levels of retention guarantees for the two kinds of writes. Retention-relaxed writes are used for host writes to exploit the high probability of short retention requirements and to gain performance benefits; normal writes are employed for background writes to preserve a long retention guarantee.

We target either optimizing write performance or optimizing ECCs for future flash storage in this study. To optimize write performance, host writes occur in fast write speed with reduced retention guarantee in the proposed dual-retention framework. Long-lived data are handled by background writes with normal write speed and a normal retention guarantee. To optimize ECCs for future flash storage, we propose a new ECC architec-ture. Concatenating an inner LDPC code and an outer BCH code (Figure 2.13) is a typical

ECC solution for future flash storage. Concatenating LDPC and BCH exploits the advan-tages of both codes [159]: LDPC greatly improves the maximum correcting capability, and BCH complements LDPC for eliminating LDPC’s error floor. The main issue of this architecture is that since every write needs to be encoded with LDPC, a high-throughput LDPC encoder is required to prevent the LDPC encoder from being the throughput bot-tleneck. Figure 2.14 depicts out proposed ECC architecture, in which host writes are only encoded by the BCH encoder, and LDPC encoding is only performed in the background.

In this way, the LDPC encoder is kept out of the critical performance path. The bene-fits of the proposed ECC architecture are threefold. First, write performance is improved since host writes do not go through time-consuming LDPC encoding. Second, since BCH protection satisfies written data that are short-lived, the amount of data that the LDPC en-coder needs to process is reduced. Third, LDPC encoding can be performed staggeredly in the background. All these advantages lead to an optimized cost-performance design point for the ECC architecture of flash storage.

We present two specific Retention Relaxation implementations, RR-10week and RR-2week, offering two retention capabilities to host writes. The first one relaxes the retention ca-pability of host writes to 10 weeks and periodically updating and checking the remaining retention counters at the end of every five-week checking period. This schemes ensures that FTL always has at least five weeks of time to staggeredly reprogram live data in the background without causing burst reprogramming overhead. We set the staggered interval of invoking reprogramming tasks to 100 ms. The second implementation is similar to the first one except that the retention capability of host writes and checking period are two weeks and one week, respectively.

2.5.2 Overhead Analysis

Memory Resource Overhead

The proposed mechanism requires extra memory resources to store the write mode and retention capability information of each block. Since we only have two write modes, i.e., the normal mode and retention-relaxed one, each block requires only a one-bit flag to

record its write mode. As for the size of the counter for keeping track of the remaining retention capability, both RR-2week and RR-10week require only a one-bit counter per block because all retention-relaxed blocks written in the n^th checking period are repro-grammed during the (n + 1)^thchecking period. For flash storage with 128 GB flash and 2 MB block size, the memory overhead is 16 KB in total.

Flash Reprogramming Overhead

In the proposed schemes, long-lived data that are in the retention-relaxed mode need to be reprogrammed. These extra activities affect the performance and lifetime of flash storage. To analyze the performance impact, we estimate reprogramming amounts per unit of timebased on the projection method described in Section 2.4.3 with T₂ equal to the checking period. For example, for RR-10week, T₂ equals five weeks. Therefore, the reprogramming amounts per unit of time are as follows:

(1 − S_T₂_,T₂%) × k × N

T₂ (2.18)

where kN is the total write amount at the end of a T2 period, and (1 − ST2,T2%) is the percentage of written data that require reprogramming.

The results show that the loading of reprogramming tasks range between 1.13 kB/s and 1.25 MB/s for RR-2week and range between 1.13 kB/s and 0.26 MB/s for RR-10week.

Since each flash plane can offer 6.2 MB/s write throughput (i.e., writing an 8 kB page in 1.3 ms), we anticipate that reprogramming does not lead to high performance overhead.

In Section 2.6, the actual performance impact is considered in simulations.

To quantify the wearout overhead caused by reprogramming, we calculate extra writes per cell per year, assuming perfect wear-leveling. We first give the upper bound on this metric. Let us take RR-2week for example. In the extreme case, RR-2week reprograms the entire flash storage every week and results in 52.1 extra writes per cell per year. Sim-ilarly, RR-10week causes 10.4 extra writes per cell per year at most. These extra writes are not significant compared to flash’s endurance, which is usually a few thousands P/E cycles. Therefore, even in the worst case, the proposed mechanism does not cause

sig-0 10 20 30 40 50 60

prn_0 prn_1 proj_0 proj_2 prxy_0 prxy_1 src1_0 src1_1 src1_2 src2_2 usr_1 usr_2 hd1 hd2 tpcc1 tpcc2

Annual Average Extra Write #

Workload

RR-10week RR-2week

Figure 2.15: Wearout overhead of Retention Relaxation

nificant wearout overhead. For real workloads, the wearout overhead is usually smaller than the worst case, as shown in Figure 2.15. The wearout overhead for each workload is evaluated based on the storage capacity and the reprogramming amounts per unit of time presented earlier.

在文檔中伺服器非揮發記憶體之跨層級設計與最佳化 (頁 46-51)