An Adaptive, Low-Cost Wear-Leveling Algorithm for Multichannel Solid-State Disks

(1)

55 Solid-State Disks

LI-PIN CHANG, TUNG-YANG CHOU, and LI-CHUN HUANG, National Chiao-Tung University

Multilevel flash memory cells double or even triple storage density, producing affordable solid-state disks for end users. As flash memory endures only limited program-erase cycles, solid-state disks employ wear-leveling methods to prevent any portions of flash memory from being retired prematurely. Modern solid-state disks must consider wear evenness at both block and channel levels. This study first presents a block-level wear-leveling method whose design has two new ideas. First, the proposed method reuses the intelligence available in flash-translation layers so it does not require any new data structures. Second, it adaptively tunes the threshold of block-level wear leveling according to the runtime write pattern. This study further introduces a new channel-level wear-leveling strategy, because block-level wear leveling is confined to a channel, but realistic workloads do not evenly write all channels. The proposed method swaps logical blocks among channels for achieving an eventually-even state of channel lifetimes. A series of trace-driven simulations show that our wear-leveling method outperforms existing approaches in terms of wear evenness and overhead reduction.

Categories and Subject Descriptors: D.4.2 [Operating Systems]: Storage Management—Garbage collection; B.3.2 [Memory Structures]: Design Styles—Mass storage

General Terms: Design, Performance, Algorithm

Additional Key Words and Phrases: Flash memory, wear leveling, solid-state disks

ACM Reference Format:

Chang, L.-P., Chou, T.-Y., and Huang, L.-C. 2013. An adaptive, low-cost wear-leveling algorithm for multi-channel solid-state disks. ACM Trans. Embedd. Comput. Syst. 13, 3, Article 55 (December 2013), 26 pages. DOI: http://dx.doi.org/10.1145/2539036.2539051

1. INTRODUCTION

Solid-state disks employ flash memory as their storage medium. The physical charac-teristics of flash memory differ from those of hard drives, necessitating new methods for data accessing. Solid-state disks hide flash memory from host systems by emulating a collection of logical sectors, allowing systems to switch from a hard drive to a solid-state disk without modifying any existing software and hardware. Solid-solid-state disks are superior to traditional hard drives in terms of shock resistance, energy conservation, random-access performance, and heat dissipation, attracting vendors to deploy such storage devices in laptops, smart phones, and portable media players.

Flash memory is a kind of erase-before-write memory. Because any one part of flash memory can only withstand a limited number of write-erase cycles, approximately 100K cycles under the current technology [Samsung Electronics 2006], frequent erase

This work is in part supported by research grant 98-2221-E-009-157-MY3 from the National Science Council, Taiwan, ROC, and a joint research project with ADATA Technology Co., Ltd.

Authors’ addresses: L.-P. Chang, T.-Y. Chou, and L.-C. Huang, Department of Computer Science, Na-tional Chiao-Tung University, 1001 University Road, Hsinchu, Taiwan 300, ROC; corresponding author’s email: lpchang@cs.nctu.edu.tw.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax_{+1 (212)} 869-0481, or permissions@acm.org.

c

2013 ACM 1539-9087/2013/12-ART55 $15.00

(2)

operations can prematurely retire a region in flash memory. This limitation affects the lifetime of solid-state disks in applications such as laptops and desktop PCs, which write disks at very high frequencies. Even worse, recent advances in flash manufactur-ing technologies exaggerate this lifetime issue. In an attempt to break the entry-cost barrier, modern flash devices now use multilevel cells for double or even triple density. Compared to standard single-level-cell flash, multilevel-cell flash degrades the erase endurance by one or two orders of magnitude [Samsung Electronics 2008].

Without wear leveling, localities of data access inevitably degrade wear evenness of flash memory in solid-state disks. Partially wearing out a piece of flash memory not only decreases its total effective capacity, but also increases the frequency of flash erase for free-space management, which further speeds up the wearing out of the rest of the flash memory. A solid-state drive ceases to function when the amount of its worn-out space in flash exceeds what the drive can manage. Wear-leveling techniques ensure that the entire flash wears evenly, postponing the first appearance of a worn-out memory region. However, wear leveling is not free, as it moves data around in flash to prevent solid-state disks from excessively wearing any one part of the memory. As reported in Chang et al. [2010], these extra data movements can increase the total number of erase operations by ten percent.

Wear-leveling algorithms include rules defining when data movement is necessary and where the data to move to/from. These rules monitor wear in the entire flash and intervene when the flash wear becomes unbalanced. Wear-leveling algorithms are part of the firmware of solid-state disks, and thus they are subject to crucial resource constraints of RAM space and execution speeds of solid-state disks’ microcon-trollers (or simply controller).1 _{Prior research explores various wear-leveling designs}

under such tight resource budgets, revealing three major design challenges. First, monitoring the entire flash’s wear requires considerable time and space overheads, which many controllers in present solid-state disks cannot afford. Second, algorithm tuning for host-workload adaption and performance definition requires prior knowl-edge of flash access patterns, online human intervention, or both. Third, high imple-mentation complexity discourages firmware programmers from adopting sophisticated algorithms.

Prior methods sort flash erase units in terms of their wear information. This re-quires efficient access to the wear information of arbitrary erase units, and thus these methods copy the wear information of the entire flash from flash to the RAM of the disk controllers. However, many controllers at the present time cannot afford this RAM space overhead. Chang and Du [2009] proposed caching only portions of wear infor-mation in RAM. However, the miss penalty and write-back overhead of the cache can scale up the volume of flash-write traffic by up to 10%. Instead of storing the wear information of all flash erase units in RAM, Jung et al. [2007] proposed using the av-erage wear of large flash regions. Nevertheless, the low-resolution wear information suffers from distortion whenever flash wearing is severely biased. Chang et al. [2010] introduced a bitmap that indicates whether a flash erase unit is recently erased or not. However, using the recent erase history could blind wear-leveling algorithms, because the recency and frequency of erasing operations on flash erase units are mutually independent.

Existing wear-leveling designs subject wear evenness to tunable threshold param-eters [Chang et al. 2010; Chang and Du 2009; Jung et al. 2007; Agrawal et al. 2008]. The system environment in which wear leveling takes place includes many conditions, 1_{For example, the GP5086 SSD controller from Global Unichip was rated at 150 MHz and has 64 KB of}

(3)

such as flash-translation layer designs, flash geometry, and host disk workloads. Existing approaches require human intervention or prior knowledge of the system en-vironment for threshold setting. However, there are problems of using manually tuned threshold. A wear-leveling algorithm may have good performance with a threshold in a system environment, but with the same threshold, it could cause unexpectedly high wear-leveling overhead or unsatisfactory wear evenness in a different system environment.

From a firmware point of view, implementation complexity primarily involves the ap-plicability of wear-leveling algorithms. The dual-pool algorithm [Chang and Du 2009] uses five priority queues of wear information and a caching method to reduce the RAM footprints of these queues. The group-based algorithm [Jung et al. 2007] and the static wear-leveling algorithm [Chang et al. 2010] add extra data structures to main-tain coarse-grained wear information and the recent history of flash wear, respectively. These approaches ignore the information already available in the disk-emulation al-gorithm, which is a firmware module accompanying wear leveling, and unnecessarily increase their design complexity.

This study presents a new wear-leveling design, called the lazy wear-leveling

algorithm, to tackle the three design challenges previously mentioned. First, this

de-sign stores only a RAM-resident counter indicating the average wear of the entire flash, achieving a tiny RAM footprint. Second, even though this algorithm uses a threshold parameter, it adopts an analytical model to estimate the overhead increase ratio with respect to different threshold settings, and then automatically selects a threshold for good balance between wear evenness and overhead. Third, the proposed algorithm utilizes the address-mapping information available in the disk-emulation algorithm, eliminating the need for adding extra data structures for wear leveling.

Modern solid-state disks are equipped with multiple channels for parallel flash operations. In this study, a channel refers to a logical unit that independently processes flash commands and transfers data. Multichannel designs boost the write throughput but introduce unbalanced wear of flash erase units among channels. Prior work address this issue by dispatching write requests to channels on a page-by-page basis [Chang and Kuo 2002; Dirik and Jacob 2009] (a page is the smallest read/write unit of flash). Dispatching data at the page level requires page-level mapping, whose implementation requires considerable RAM space for large flash. Additionally, this approach could map logically consecutive data to the same channel and degrade the channel-level parallelism in sequential read requests. This study introduces a novel channel-level wear-leveling strategy based on the concept of reaching eventually-even channel lifetimes. The basic idea is to align channels’ lifetime expectancies by remapping data among channels. The proposed approach has many benefits, including that (1) it does not require a channel-level threshold for wear leveling, (2) it incurs very limited overhead, and (3) it requires only a small RAM-resident data structure.

In summary, this study has the following contributions.

(1) An efficient block wear-leveling algorithm with a tiny RAM footprint. (2) A dynamic threshold-adjusting strategy for block wear leveling. (3) An algorithm for wear leveling at the channel level.

The rest of this article is organized as follows. Section 2 reviews flash characteristics and prior work on flash translation and wear leveling. Section 3 presents a block-level wear-block-leveling algorithm, and Section 4 describes an adaptive tuning strategy for this algorithm. Section 5 introduces a strategy for wear leveling at the channel level. Section 6 reports our simulation results. Section 7 concludes.

(4)

2. PROBLEM FORMULATION 2.1. Flash Management

2.1.1. Flash-Memory Characteristics.Solid-state disks use NAND flash memory (flash memory for short) as their storage medium. A piece of flash memory is a physical array of blocks, and each block contains the same number of pages. Typically, a flash page is of 2,048 plus 64 bytes. The 2,048-byte portion stores user data, while the 64 bytes are a spare area for mapping information, block aging information, error-correcting code, etc. Flash memory reads and writes in terms of pages, and overwriting a page requires erasing. Flash erases in terms of blocks, each of which consists of 64 pages. Under the current technology, a flash block can only sustain a limited number of write-erase cycles before it becomes unreliable. A single-level-cell flash block endures 100 K cycles [Samsung Electronics 2006], while this limit is 10 K or less in multilevel-cell flash [Samsung Electronics 2008].

Solid-state disks emulate disk geometry using a firmware layer called the flash-translation layer (FTL). FTLs update existing data out of place and invalidate old copies of the data to avoid erasing a flash block every time before rewriting a piece of data. Thus, FTLs require a mapping scheme to translate disk sector numbers into physical flash addresses. Updating data out of place consumes free space in flash, and FTLs must recycle flash space occupied by invalid data with erase operations. Before erasing a block, FTLs copy all valid data from this block to other free space. Garbage

collection refers to a series of copy and erase operations for reclaiming free space.

2.1.2. Flash Translation Layers (FTLs).Flash-translation layers are part of the firmware in solid-state disks. They use RAM-resident index structures to translate logical page numbers into physical flash locations. Mapping resolutions have direct impact on RAM-space requirements and write performance. Many entry-level flash-storage devices, like USB thumb drives, adopt block-level mapping, which requires only small map-ping structures. However, low-resolution mapmap-ping suffers from slow response when servicing small write requests. Page-level mapping [Gupta et al. 2009] better handles random write requests, but requires large mapping structures, making its implemen-tation difficult when flash capacity is high. This article considers logical pages as the smallest mapping unit as large as a flash page.

Hybrid mapping combines both page and block mapping. This method groups con-secutive logical pages into logical blocks as large as physical blocks. It maps logical blocks to physical blocks on a one-to-one basis using a block-mapping table. If a physi-cal block is mapped to a logiphysi-cal block, then this physiphysi-cal block is physi-called the data block of this logical block. Initially, physical blocks other than data blocks are spare blocks. Hybrid mapping uses spare blocks as log blocks to serve page updates, and uses a

page-mapping table to redirect read requests to the latest versions of data in the log

blocks.

Figures 1(a) and 1(b) show two different FTL designs using hybrid mapping. Hybrid mapping creates groups of logical blocks and allocates (flash) spare blocks as log blocks for these logical-block groups. Let lbn and pbn stand for a logical-block number and a physical-block number, respectively. Let lpn represent a logical-page number, and let

disp be the block offset in terms of pages. The bold boxes stand for physical blocks, each

of which has four pages. The numbers in the pages indicate the lpns of their storage data. The BMT and the PMT are the block-mapping table and the page-mapping table, respectively. In Figure 1(a), every group has two logical blocks, while a group can be allocated to up to two log blocks. This mapping scheme, developed by Park et al. [2008], is called set-associative mapping (SAST). This scheme uses two parameters N and K to specify the group size and the largest number of log blocks that a group can have, respectively. Figure 1(b) depicts another mapping scheme, developed by Lee et al.

(5)

Fig. 1. Two flash-translation layer designs based on hybrid mapping. (a) The set-associative mapping scheme with N_{= 2 and K = 2. Every group has two logical blocks, and a group is allocated to up to two log blocks.} (b) The fully-associative mapping scheme. All logical blocks are in one big group, and all the log blocks are shared by the logical blocks in this big group.

[2007], called fully-associative mapping (FAST). This method put all logical blocks in one big group and has all the logical blocks in this big group sharing all the log blocks. The FTL consumes spare blocks for serving incoming write requests. When the amount of spare blocks becomes low, the FTL starts erasing log blocks. Before erasing a log block, the FTL finds all logical blocks related to the valid data in this log block. For each of the found logical blocks, the FTL collects valid data from the log block and the data block of this logical blocks, copies these valid data to a new spare block, and remaps the logical block to the copy-destination spare block. Finally, the FTL erases all the involved data blocks and the log blocks into spare blocks. This procedure is referred to as merge operations or garbage collection. For example, in Figure 1(a), for garbage collection, the FTL collects the valid data scattered in the data blocks at pbns 0 and 2 and in the log blocks at pbns 6 and 3, writes them to the spare blocks at pbns 7 and 8, and then erases the four old flash blocks at pbns 0, 2, 6, and 3 into spare blocks.

Hybrid mapping FTLs exhibit some common behaviors in the garbage-collection process regardless of their designs, that is, garbage collection never involves a data block if none of its page data have been updated. In Figure 1(a), erasing the data blocks at pbn 5 cannot reclaim any free space. Similarly, in Figure 1(b), erasing any of the log blocks does not involve the data block at pbn 5. This is a potential cause of uneven flash wear.

2.2. The Need for Wear Leveling

This section first introduces prior methods, discusses their drawbacks, and then points out how the proposed method improves upon these shortcomings.

2.2.1. Block-Level Wear Leveling.Block-level wear leveling considers the wear evenness of a collection of flash blocks. Let the erase count of a flash block denote how many write-erase cycles this block has undergone. There have been three representative techniques for this problem: static wear leveling, hot-cold swapping, and cold-data migration. Static wear leveling moves static/immutable data away from lesser worn flash blocks, encouraging the flash-translation layer to start erasing these blocks. Flash vendors, including MicronR

[2008] and SpansionR

[2008], recommend using this approach. Chang et al. [2010] described a design of static wear leveling. However, Chang and Du [2009] found that static wear leveling failed to achieve even block wear on the long term, because static wear leveling could (1) move static/immutable data back and forth among lesser worn blocks and (2) erase a flash block even if its erase count is relatively

(6)

Table I. Comparison of Existing Algorithms for Block-Level Wear Leveling

Threshold

Algorithm Principle RAM-resident data structures required tuning

Static wear leveling [Chang et al. 2010]

Static wear leveling A block erase bitmap Manual

Group wear leveling [Jung et al. 2007]

Hot-cold swapping Average erase counts of block groups Manual Dual-pool wear

leveling [Chang and Du 2009]

Cold-data migration All blocks’ erase counts and their recent erase counts

Manual

Remaining-lifetime leveling [Agrawal et al. 2008]

Cold-data migration All blocks’ age information (remaining lifetimes) and block-data temperature (update frequencies)

Manual

Lazy wear leveling (this study)

Cold-data migration An average erase count of all blocks Automatic

large. Hot-cold swapping exchanges data in a lesser-worn block with data from a badly-worn block. Jung et al. [2007] presented a hot-cold swapping design. However, because the oldest block has a very large (and perhaps still the largest) erase count, Chang and Du [2009] found that hot-cold swapping risks erasing the most worn flash block pathologically.

Cold-data migration relocates infrequently-updated data (i.e., cold data) to excessively-worn blocks to protect these blocks against garbage collection. Preventing badly-worn blocks from aging further is not equal to increasing the wear of lesser-worn blocks (as static wear leveling does). This is because frequently updated data occupy only a small portion of the disk space. Prior work reported that the disk fullness of productive systems was only about forty percent [Agrawal et al. 2007]. In other words, stoping aging the small amount of badly-worn flash blocks mapped to frequently-updated data is more efficient than starting wearing the large amount of lesser-worn flash blocks. Cold-data migration has been proven more effective than static wear leveling and hot-cold swapping [Agrawal et al. 2008; Chang and Du 2009]. Based on cold-data migration, Agrawal et al. [2008] proposed storing the remaining lifetimes and data temperatures of all flash blocks in RAM, and Chang and Du [2009] proposed storing all blocks’ erase counts and their recent erase counts in RAM. These designs, however, impose large RAM-space requirements on disk controllers. Consider a 32GB flash-storage device with 512KB flash blocks, storing a four-byte wear information for every block costs the disk controller 256 KB of RAM. This figure is higher than that which a typical disk controller can afford (64 KB, mentioned in the Introduction). Re-ducing the RAM footprint is always beneficial, no matter how much RAM the controller can afford, because the saved RAM space can be used by the mapping tables and the disk write buffer. Table I is a summary of comparison among prior methods and our algorithm. Our design stores only an average erase count in RAM, achieving a tiny RAM footprint. However, our design does not sacrifice wear-leveling performance to footprint reduction. Our experimental results will show that it outperforms existing methods in almost all cases.

Block-level wear leveling controls the wear variance in all flash blocks within an acceptable threshold. Existing approaches have difference definitions of this variance: Chang et al. [2010] adopted the ratio of the total erase count to the total number of the recently erased blocks, Jung et al. [2007] and Chang and Du [2009] used the difference among blocks’ erase counts, and Agrawal et al. [2008] employed the difference among blocks’ remaining lifetimes. With a smaller threshold, wear leveling aims at a more level wear in flash blocks, but inevitably introduces more frequent data movement.

(7)

Wear-leveling overhead can be affected by many conditions of flash management, in-cluding the host workload, flash-translation layer, flash geometry, and flash capacity.

Unfortunately, it is almost impossible to find a universally applicable threshold set-ting for various applications of flash storage. For example, in our two tests with Dual-pool algorithm [Chang and Du 2009] with a threshold of 14, under the workloads of a multimedia appliance and a Windows desktop, it increased the total erase count by 0.8% and 3.9%, while the resultant standard deviations of all blocks’ erase counts were 5.4 and 10.5, respectively.2 _{The latter case shows that the same threshold setting}

re-sulted in more data movement but did not achieve a better wear evenness. This study identifies that the overhead of wear leveling is not linearly related to the threshold value, and the overhead will significantly increase when the threshold is becoming smaller than a certain critical value. This critical threshold value will be different for various conditions of flash management. Thus, we propose subjecting the threshold value to the overhead increase ratio and introduce a runtime strategy that dynamically sets the threshold value to the critical value.

2.2.2. Channel-Level Wear Leveling.In this study, a channel refers to a logical unit that independently processes flash commands and transfers data. Channel-level wear leveling is concerned with the wear evenness of flash blocks from different channels. This issue is closely related to channel binding of logical pages, that is, the allocation of free flash pages to host data. Dynamic channel binding globally manages free pages across all channels. Chang and Kuo [2002] proposed dispatching page write requests to channels based on the update frequencies of these page data. Dirik and Jacob [2009] proposed allocating channels to incoming page write requests using the round-robin policy. Even though dynamic channel binding has better flexibility of balancing the block wear across all channels, it has two drawbacks: (1) it adds extra channel-level mapping information to every logical page, resulting in larger mapping tables, and (2) it could map consecutive logical pages to the same channel, severely degrading the channel-level parallelism in sequential-read requests.

Instead of dynamic channel binding, this study considers static channel binding. Static channel binding uses fixed mapping between logical pages and channels. With static mapping, effectively every channel manages its free flash pages with its own instance of flash-translation layer. The most common strategy for static channel binding is the RAID-0-style striping [Agrawal et al. 2008; Park et al. 2010; Seong et al. 2010]. RAID-0 striping achieves the maximum channel-level parallelism in sequential reads because it maps a collection of consecutive logical pages to the largest number of channels. We must point out that RAID-0 striping cannot automatically achieve wear leveling at the channel level. This is because, as reported in Chang [2010], hot data (i.e., frequently updated data) are small, usually between 4 KB and 16 KB. RAID-0 striping statically binds small and hot data to some particular channels, resulting in imbalanced write traffics among channels. We found that, under the disk workload of a Windows desktop, a four-channel architecture had the largest and a smallest fractions of channel-write traffic of 28% and 23%, respectively. Thus, flash blocks from different channels wear at different rates. Extending the scope of block-level wear leveling to the entire storage device is not a feasible solution here, because it requires dynamic channel binding.

3. BLOCK-LEVEL WEAR LEVELING

This section presents an algorithm for wear leveling at the block level. This algorithm does not deal with channels, so logically, all flash blocks are in the same channel. 2_{These disk workloads were used in our experiments. See Section 6.1.}

(8)

Fig. 2. Physical blocks and their erase recency and erase counts. An upward arrow indicates that a block is recently increasing its erase count.

3.1. Observations

This section defines some key terms for the purpose of presenting our wear-leveling algorithm in later sections. Let the update recency of a logical block denote the time length between the current time and the latest update to this logical block. The update recency of a logical block is high if its latest update is more recent than the average update recency. Otherwise, its update recency is low. Analogously, let the erase recency of a physical block be the time length since the latest erase operation on this block. Thus, immediately after garbage collection erases a physical block, this block has the highest erase recency. A physical block is a senior block if its erase count is larger than the average erase count. Otherwise, it is a junior block.

Temporal localities of updating logical blocks affect the wear of physical blocks. As previously mentioned, if a physical block is mapped to an unmodified logical block, then garbage collection will avoid erasing this physical block. On the other hand, updates to logical blocks produce invalid data in flash blocks, and thus physical blocks mapped to recently modified logical blocks are good candidates for garbage collection. After a physical block is erased by garbage collection, it either serves a data block or a log block. Either way, this physical block is again related to recently modified logical blocks. So if a physical block has a high erase recency, then it will quickly accumulate many erase counts. Conversely, physical blocks lose momentum in increasing their erase counts if they are mapped to logical blocks having low update recency.

Figure 2 provides an example of eight physical blocks’ erase recency and erase counts. Upward arrows mark physical blocks recently increasing their erase counts, while an equal sign indicates otherwise. Block a is a senior block with a high erase recency, while block d is a senior block but with a low erase recency. The junior block h has a high erase recency, while the erase recency of the junior block e is low. Blocks should keep their erase counts close to the average. Two kinds of block wear can require intervention from wear leveling. First, the junior blocks e and f have not recently increased their erase counts. As their erase counts fall below the average, wear leveling has them start participating in garbage collection. Second, the senior blocks a and b are still increasing their erase counts. Wear leveling has garbage collection stop further wear in these two senior blocks.

3.2. The Lazy Wear-Leveling Algorithm

This study proposes a new wear-leveling algorithm based on a simple principle: when-ever a senior block’s erase recency becomes high, relocate (i.e., remap) a logical block having a low update recency to this senior block. This algorithm, called the lazy

(9)

Lazy wear leveling must be aware of the recent wear of all senior blocks, because senior blocks retire before junior blocks. However, physical blocks boost their erase recency only via garbage collection. The flash-translation layer can notify lazy wear leveling of its decision on victim selection. This way, lazy wear leveling captures senior blocks whenever their erase recency become high without repeatedly checking all senior blocks’ wear information.

How to prevent senior blocks from further aging is closely related to the behaviors of garbage collection. As previously mentioned in Section 2.2, if a logical block has a low update recency, then garbage collection has no interest in erasing the flash block(s) mapped to it. Therefore, remapping logical blocks of low update recency is a key to preventing senior blocks from aging further. Lazy wear leveling considers logical blocks not related to any page-mapping information as having low update recency, because recent updates to logical blocks leave mapping information in the the page-mapping table. The logical blocks at lbn 3 in Figures 1(a) and 1(b) are such examples.

To remap a logical block from one physical block to another, lazy wear leveling moves all valid data from the source physical block to the destination physical block. Junior blocks are the most common kind of source blocks, for example, blocks e and f in Figure 2, because storing immutable data keeps them away from garbage collection. As moving all valid data out of the source blocks makes them good candidates for garbage collection, selecting logical blocks for remapping is related to the wear of junior blocks. To give junior blocks even chances of wear, it is important to uniformly visit every logical block when selecting logical blocks for remapping.

Temporal localities of writes change occasionally. New updates to a logical block can neutralize the latest remapping effort involving this logical block. In this case, lazy wear leveling will be notified that a senior block is again selected as a victim of garbage collection and will perform another remapping operation for this senior block.

3.3. Interacting with Flash-Translation Layers

This section describes how lazy wear leveling interacts with its accompanying firmware module, the flash-translation layer. Algorithm 1 shows the pseudocode of lazy wear lev-eling. The flash-translation layer calls Algorithm 1 after it moves all valid data out of a garbage-collection victim block and before it erases this block. The input of Algorithm 1

isv, the pbn of the victim block. This algorithm performs remapping whenever

nec-essary and then returns a pbn. Note that this output pbn may be different from the

ALGORITHM 1: Lazy Wear-Leveling Algorithm Input: v: the victim block for garbage collection Output: p: a substitute for the original victim blockv

1: ev←eraseCount(v)

2: if (ev− eavg)> then

3: repeat

4: l← lbnNext()

5: until lbnHasPageMapping(l)=FALSE 6: erase(v); 7: p← pbn(l) 8: copy(v, p); map(v, l) 9: ev← ev+ 1 10: eavg← updateAverage(eavg, ev) 11: else 12: p← v 13: end if 14: RETURN p

(10)

input pbn. The flash-translation layer erases the flash block at the pbn returned by Algorithm 1. The discussion in this section is based on hybrid mapping. See later sections for using lazy wear leveling with page-level mapping.

For the example of SAST in Figure 1(a), suppose that the flash-translation layer decides to merge data of the logical blocks at lbns 0 and 1. The flash-translation layer calls Algorithm 1 before erasing each of the four physical blocks at pbns 0, 2, 6, and 3. For the example of FAST in Figure 1(b), because FAST recycles the oldest log block at a time, the flash-translation layer calls Algorithm 1 before erasing the log block at pbn 6 and the two related data blocks at pbns 0 and 2. The rest of this section is a detailed explanation of Algorithm 1.

In Algorithm 1, the flash-translation layer provides the subroutines with leading underscores, and wear leveling implements the rest. In Step 1, eraseCount() obtains the erase count e_vof the victim blockv by reading the victim block’s page spare area, in which the flash-translation layer stores the erase count. Step 2 compares e_vagainst the average erase count eavg. If ev is larger than eavgby a predefined threshold, then

Steps 3 through 10 will carry out a remapping operation. Otherwise, Steps 12 and 14 return the victim blockv intact. The loop of Steps 3 through 5 finds a logical block whose update recency is low. Step 4 uses the subroutine lbnNext() to obtain l the next logical block number to visit, and Step 5 calls the subroutine lbnHasPageMapping() to check if the logical block l has any related mapping information in the page-mapping table. As mentioned previously, to give junior blocks equal chances of getting erased, the subroutine lbnNext() must evenly visit all logical blocks. At this point, it is reasonable to assume that lbnNext() produces a linear enumeration of all lbns.

Steps 6 through 8 remap the previously found logical block l. Step 6 erases the original victim blockv. Step 7 uses the subroutine pbn() to identify the physical block p that the logical block l currently maps to. Step 8 copies the data of the logical block l from the physical block p to the original victim blockv, and then remaps the logical block l to the former victim blockv using the subroutine map(). After this remapping, Step 9 increases e_v since the former victim blockv has been erased, and Step 10 updates the average erase count. Step 14 returns the physical block p, which the logical block l previously mapped to, to the flash-translation layer as a substitute for the original victim blockv. In spite of the average erase count eavg, Algorithm 1 is only concerned

with the erase count of the victim block. Thus, this algorithm needs not store all blocks’ erase counts in RAM. Instead, it reads the spare area of a victim block before garbage collection erases it.

3.4. Wear-Leveling Enhancements

This section presents two enhancements that lazy wear leveling can use. The first is specific to sequential-write workloads, and the second is particularly useful if the flash-translation layer is FAST.

3.4.1. Workload-Specific Enhancement.Algorithm 1 calls lbnNext() to select logical blocks for remapping. This function can linearly visit all logical blocks. However, this simple strategy could result in many ineffective remapping operations if the host workload consists of a lot of long write bursts. This is because files systems try to allocate contiguous disk space when writing large files. This behavior coincides with linearly enumerating logical blocks and can neutralize prior remapping operations on a set of consecutive logical blocks.

To solve this problem, this study proposes using a Linear Congruential Generator [Rosen 2003] for logical-block selection. Let the total number of logical blocks be nl.

Let p be the smallest prime number larger than nl. Let s be an integer and 0 <

(11)

arbitrary number in [0, nl). Lazy wear leveling selects logical blocks using the following

recurrence relation.

li+1= (li+ s)%p,

where % is the modulo operator. Notice that any li ≥ nl are not used. Because s and p are prime to each other, the period of selecting the same logical-block number is

exactly nl. Here, s is the skip factor, which should be larger than the total number of

logical blocks that typical large files can have. This prevents lazy wear leveling from successively visiting two logical blocks belonging to the same large file. Our current implementation adopts s= 1000 when the logical block size is 128 KB.

The loop in Algorithm 1 (i.e., Steps 3 to 5) checks whether a logical block has re-lated mapping information in the page-mapping table. This check becomes difficult if the flash-translation layer caches a partial mapping table. To address this problem, Algorithm 1 can adopt an optional bitmap lbMod[] of logical blocks. For any logical block at lbn l, lbMod[l]= 0 initially, and the flash-translation layer sets lbMod[l] = 1 if a write request modifies any of its logical pages. For example, in Figure 1(a), all bits of this bitmap are 1’s except lbMod[3]. Garbage collection clears lbMod[l] after erasing the flash blocks related to the logical block at lbn l, because merging this logical block removes all its mapping information from the page-mapping table. With this bitmap,

lbnHaspageMapping(l) at Step 5 reports TRUE if lbMod[l]= 1, or else reports FALSE.

3.4.2. FTL-Specific Enhancement.On garbage collection, FAST erases one log block at a time, that is, the oldest log block. Thus, FAST can delay merging a logical block until a valid logical page of this logical block appears in the oldest log block. Consider that FAST has a very large number of log blocks and the host frequently modifies a logical block. On the one hand, FAST can indefinitely postpone merging this logical block. On the other hand, lazy wear leveling does not use this logical block for remapping because its page updates keep leaving information in the page-mapping table. As a result, the (flash) data blocks mapped to this logical block can never attract attention from both garbage collection and wear leveling.

A simple enhancement based on the bitmap lbMod[] deals with this problem. When FAST erases the oldest log block, for every piece of page data in this log block, regardless of whether it is valid or not, FAST finds the the logical block number of this logical page and clears the corresponding bit in lbMod[], as if FAST did not delay merging logical blocks. Note that SAST does not require this enhancement, because to improve log-block space utilization, SAST will not indefinitely delay merging logical blocks.

3.5. Lazy Wear Leveling and Page-Level Mapping

Although lazy wear leveling is primarily designed for hybrid mapping, its concept is applicable to page-level mapping. Like in hybrid mapping, in page-level mapping, lazy wear leveling copies data having low update recency to senior blocks to prevent these blocks from aging further. However, different from hybrid mapping, page-level mapping does not use logical block [Gupta et al. 2009], so lazy wear leveling needs a different strategy to find data having low update recency.

This study proposes using an invalidation bitmap. In this bitmap, one bit is for a flash block, and each bit indicates whether a flash block recently receives a page invalidation (i.e., 1) or not (i.e., 0). All the bits are 0 initially, and there is a pointer referring to the first bit. The bit of a flash block switches to 1 if any page in this block is updated (i.e., invalidated). Whenever lazy wear leveling finds the erase count of a victim block larger than the average by, it advances the pointer and scans the bitmap. As the pointer advances, it clears bits of 1’s until it encounters a bit of 0. Lazy wear leveling then copies valid data from the flash block owning this zero bit to the victim block. This

(12)

Fig. 3. Erase counts of flash blocks right before the lazy wear-leveling algorithm performs (a) the first remapping operation and (b) the nbh+ 1-th remapping operation.

scan-and-copy procedure repeats until it writes to all pages of the victim block. Notice that garbage-collection activities do not alter any bits in the bitmap.

The rationale behind the design is that in the presence of temporal localities of write, if a flash block does not receive page invalidations recently, then this block is unlikely to receive more page invalidations in the near future. The invalidation bitmap resides in RAM, and it requires one bit per flash block. Compared to the page-level mapping table, the space overhead of this bitmap is very limited.

4. SELF TUNING FOR BLOCK-LEVEL WEAR LEVELING

Lazy wear leveling subjects the evenness of block wear to a threshold parameter. A small value of targets even wear in flash blocks but increases the frequency of data movement. This section presents a dynamic tuning strategy for for achieving good balance between wear evenness and overhead.

4.1. Overhead Analysis

Consider a piece of flash memory consisting of nbphysical blocks. Let immutable logical

blocks map to nbc out of these nb physical blocks. Let the sizes of write requests be

multiples of the block size, and let write requests be aligned to block boundaries. Suppose that the disk workload uniformly writes the mutable logical blocks. In other words, the flash-translation layer evenly increases the erase counts of the nbh= nb−nbc

physical blocks.

Let the function f (x) denote how many blocks garbage collection erases to process a workload that write x logical blocks. Consider the case x= i × nbh× , where i is a

nonnegative integer. As all request sizes are multiples of the block size and requests are block-aligned, erasing victim blocks does not cost garbage collection any overhead in copying valid data. Therefore, without wear leveling, we have

f (x)= x.

Now, consider wear leveling enabled. For ease of presentation, this simulation revises the lazy wear-leveling algorithm slightly: the revised algorithm compares the victim block’s erase count against the smallest erase count instead of the average erase count. Figure 3(a) shows that right before lazy wear leveling performs the first remapping, garbage collection has uniformly accumulated nbh× erase counts in nbh physical

blocks. In the subsequent nbherase operations, garbage collection erases each of these nbhphysical blocks one more time and increases their erase counts to + 1. Thus, lazy

wear leveling conducts nbhremapping operations for these physical blocks at the cost

of erasing nbhblocks. These remapping operations redirect garbage-collection activities

to another nbh physical blocks. After these remapping operations, lazy wear leveling

(13)

physical blocks. Figure 3(b) shows that lazy wear leveling is about to spend nbherase

operations for remapping operations. Now let function f(x) be analogous to f (x), but with wear leveling enabled. We have

f(x)= x + x

= x + i × nbh.

Under real-life workloads, the frequencies of erasing these nbhblocks may not be

uni-form. Thus, f(x) adopts a real-number coefficient K to take this into account:

f(x)= x + i × nbh× K.

The coefficient K depends on various conditions of flash management, such as flash geometry, host workloads, and flash-translation layer designs. For example, dynamic changes in temporal localities of write can increase K because the write pattern might start updating new logical blocks and neutralize the prior remapping operations on these blocks. Notice that the value of K can be measured at runtime, as will be explained in the next section.

Let the overhead function g() denote the overhead ratio with respect to :

g() = f _(x)_{− f (x)} f (x) = i× nbh× K i× nbh× = K .

It shows that the overhead of wear leveling is inversely proportion to. Now recall that lazy wear leveling compares victim blocks’ erase counts against the average erase count rather than the smallest erase count. Thus, we use 2 as an approximation of the original. Because both nband nbhare constant, the difference between using the

average and the smallest can be accounted for by a constant ratio, which is further included in the runtime-measurable coefficient K. Thus, we have

g() = K

2. (1)

When is small, a further decrease in rapidly increases the overhead ratio. For

example, decreasing from 4 to 2 doubles the overhead ratio.

4.2. A Strategy of Tuning

Small values are always preferred in terms of wear evenness. However, decreasing

the value could cause an unexpectedly large increase in overhead. The rest of this

section introduces a-tuning strategy based on the overhead growth rates.

Under realistic disk workloads, the coefficient K in g() may vary over time. Thus, wear leveling must first determine the coefficient K before using g() for -tuning. This study proposes tuning on a session-by-session basis. A session refers to a time interval in which lazy wear leveling contributed a predefined number of erase counts. Refer to this number as the session length. The basic idea is to find Kcurof the current

session and use this value to findnextfor the next session.

The first session begins with = 16 (in theory it can be any number). Let cur be

the value of the current session. Figure 4 illustrates the concept of the -tuning

procedure. During a runtime session, lazy wear leveling separately records the erase counts contributed by garbage collection and wear leveling. At the end of the current session, the first step (in Figure 4) computes the overhead ratio f(x)− f (x)_{f (x)} , that is, g(cur),

and solves Kcurof the current session using Equation (1), that is, Kcur = 2cur× g(cur).

The second step uses g(next)= Kcur/(2next) to findnext for the next session.

Basi-cally, lazy wear leveling tries to decrease until the growth rate of the overhead ratio becomes equal to a user-defined limitλ. In other words, we are to find the value at which the tangent slope to g(next) is λ. Let the unit of the overhead ratio be one

(14)

Fig. 4. Computingnextsubject to the overhead growth limitλ for the next session according to curand

the overhead ratio g(cur) of the current session.

Fig. 5. Handling three write requestsw1,w2, andw3using (a) synchronized channels and (b) independent channels. In this example, using synchronized channels doubles the flash wear, while using independent channels results in unbalanced flash wear among channels.

percent. Therefore,λ = −0.1 means that the overhead ratio increases from x% to (x + 0.1)% when decreasing from y to (y −1). Now solve _ddg(next)= ₁₀₀λ for the smallest value subject to λ. Rewriting this equation, we have

next= 100 −λ g(cur)cur.

For example, whenλ = −0.1, if the overhead ratio g(cur) andcurof the current session

are 2.1% and 16, respectively, thennextfor the next session is

100 0.1

√

2.1% × 16 = 18.3.

The-tuning procedure uses the limit on the overhead-ratio growth rates and the

session length. Because g() is very large when is small, λ can be set to the boundary between near-linear and super-linear growth rates. Our experiments will show that −0.1 is a good choice of λ, and wear-leveling results are not sensitive to the lengths of sessions because workloads have temporal localities of write.

5. CHANNEL-LEVEL WEAR LEVELING 5.1. Multichannel Architectures

Advanced solid-state disks use multichannel architectures for high data transfer rates [Agrawal et al. 2008; Kang et al. 2007; Seong et al. 2010; Park et al. 2010]. In this study, a channel stands for a logical unit which can individually handle flash commands and perform data transfer. Parallel hardware structures, such as gangs, interleaving groups, and flash planes, are part of channels because flash chips in these structures might not be individually programmable.

From the point of view of wear leveling, channels can be synchronized or independent. Figure 5 is an example. Let the mapping between logical pages and channels use the

(15)

Fig. 6. Aligning the lifetime expectancies of two channels Ci and Cj for channel-level wear leveling.

(a) These two channels reach their end-of-life at different times. (b) Change channel utilizations uci and ucjto uciand ucj, respectively, such that the lifetime difference becomes zero (i.e., d= 0).

RAID-0 style striping. Figure 5(a) depicts that all the channels write synchronously, even if a write request does not access all the channels. Lazy wear leveling directly ap-plies to a set of synchronized channels because these channels are logically equivalent to a single channel. A major drawback of synching channel operations is the reduced device lifetime. As Figure 5(a) shows, the channels writes 16 flash pages to modify only eight logical pages. Independent channels need not copy unmodified data for synching channel operations, as shown in Figure 5(b). However, using independent channels inevitably introduces unbalanced flash wear among channels.

This study focuses on independent channels because they alleviate the pressure of garbage collection and reduce flash wear compared to synchronized channels. Let every independent channel adopt an instance of flash-translation layer, and let every channel perform wear leveling on its own flash blocks. Provided that the block-level wear leveling is effective, the problem of channel-level wear leveling refers to how to balance the total block erase counts of all channels.

Our design of channel-level wear leveling respects the property of maximum

paral-lelism [Shang et al. 2011] for the highest paralparal-lelism among page reads. A data layout

satisfies maximal parallelism if and only if a set of consecutive logical pages are mapped to the largest number of channels. This study uses the RAID-0 style striping as the initial mapping between logical pages and channels, and data updates and garbage collection do not change this mapping [Park et al. 2010].

5.2. Aligning Channel Lifetime Expectancies

Provided that block wear leveling is effective, the erase counts of blocks in the same channel will be close, and the wear of a channel can be indicated by the sum of all block erase counts in this channel. Recall that the utilization of a channel stands for the fraction of host data arriving at this channel. Even though data updates are out of place at the block level, they do not change the mapping between logical pages and channels, so temporal localities have affinity with channels. Thus, channel utilizations do not abruptly change, and the wear of channels increase at steady (but different) rates.

This study proposes adjusting channel utilizations to control the wear of channels for an eventually-even state of channel lifetimes. In other words, the idea is to project channels’ lifetime expectancies to the same time point. Figure 6 is an example of two channels Ci and Cj. Let every channel have the same total number of flash blocks nb.

Let a flash block endures ¯e write-erase cycles, and let the erase count of the channel Ci, denoted by eci, be the sum of all block erase counts in this channel. Let a channel reaches its end of life when its erase count becomes ¯e× nb. Let t be the current time,

(16)

Table II. Symbol Definitions

Symbol Description

w The total amount of data written to the flash storage during [t−, t) ¯e The write-erase cycle limit of flash blocks

nb The total number of flash blocks in a channel

y The total number of channels Ci The ith channel

eci The sum of all block erase counts in the channel Ci uci The utilization of the channel Ci. Note that

_u

ci=1 u_c_i The expected utilization of the channel Ci

ri The erase ratio of the channel Ci

x The total number of stripes Si The ith stripe

usi The utilization of the stripe Si. Note that

_u

si=1

ui_{, j} The utilization of the logical block at the stripe Siand the channel Cj

Note that_ix₌₀−1ui, j= ucjand

y−1

j=0ui, j= usi

be the utilization of the channel Ci. Thus, in this time interval, the total amount of

host data arriving at the channel Ci is uciw. Let the erase counts of the channel Ci at time t− and t be et

ci and e

t−

ci, respectively. Let the erase ratio of Ci during [t

−_{, t) be r}

i,

defined as ri = et

ci−et−ci

u_ciw . As Figure 6(a) shows, eci increases by riuciw = e

t ci − e

t−

ci in this time period. Table II is a summary of symbols.

Provided that channels’ erase ratios and utilizations remain steady, the lifetime expectancies of the channels Ciand Cjwill be t+(¯enb−ecti)(

t−t−

riuciw) and t+(¯enb−e

t cj)(

t−t−

rjuc jw), respectively. The lifetime difference d will be

d=¯enb− etci t − t− riuciw −¯enb− etcj t − t− rjucjw .

To align these two channels’ lifetime expectancies (i.e., d = 0), the channel wear-leveling algorithm computes the utilizations u_c

i and u

cj which the channels Ci and Cj are expected to have after time t, respectively. Replacing uci, ucj, and d in the preceding equation with u_c_i, u_c_j, and 0, respectively, produces u_c_j =ri( ¯enb−e

t c j)

rj( ¯enb−e_cit)u

ci. Because the total utilization is 100%, we have u_c_i+ u_c_j = 1. Now solve these two equations to obtain u_c_i and u_c_j. Figure 6(b) shows that, with these new expected utilizations u_c_i and u_c_j, the lifetime expectancies of these two channels will be the same. In the general case of y channels, solving the following system obtains the expect utilizations u_c₀. . . u_c_y₋₁:

⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ ∀k (k∈ {0, 1, 2, . . . , y − 1}) ∧ u_c_k= r0( ¯enb− e t ck) rk( ¯enb− etc0) u_c₀ y−1 k=0 u_c_k = 1 .

The next section will present a method that swaps logical blocks among channels to adjust channel utilizations for channel wear leveling.

5.3. Adjusting Channel Utilizations

Independent channels adopt their own instances of flash-translation layer to manage their flash blocks. Suppose that the flash-translation layer is based on hybrid mapping.

(17)

Recall that the initial mapping between logical pages and channels is the RAID-0-style striping. Let logical blocks be numbered in the channel-major order. For example, if there are four channels and a logical block is as large as four pages, then the logical block at lbn 0 is in the first channel, and this logical block contains the logical pages at

lpns 0, 4, 8, and 12. The logical block at lbn 2 is in the third channel, and it contains the

logical pages at lpns 2, 6, 10, and 14. Let a stripe be a set of consecutive logical blocks starting from the first channel and ending at the last channel. For example, the first stripe contains the four logical blocks at lbns 0, 1, 2, and 3. Notice that these definitions of logical blocks and stripes are also applicable to page-level mapping because they are not related to space allocation in flash.

Because real workloads have temporal localities of write, swapping logical blocks among channels can manipulate channels’ future utilizations. To retain the property of maximum parallelism, this swapping is confined to logical blocks of the same stripe. Let x be the total number of stripes. Let usj be the utilization of the stripe Sj. Thus, we haveusj = 1. Let ui, j be the utilization of the logical block at stripe i and channel j. Therefore, we havex_i₌₀−1ui, j = ucj and

_y−1

j=0ui, j = usi.

This study proposes invoking channel wear leveling periodically. On each invocation, channel wear leveling computes the expected utilizations of channels and then starts swapping logical blocks for minimizingx_i₌₀−1|uci− uci|. This problem of block swapping is intractable, even for each invocation of channel wear leveling. We can reduce any instance of the bin packing problem to this block-swapping problem. A key step of this reduction is to let an item of size s in the bin packing problem be a stripe which has only one logical block having a nonzero utilization s.

Channel wear leveling should reduce the total number of logical blocks swapped. We found that in real workloads, a stripe of high utilization usually has two logical blocks whose utilization difference is large. This is because frequently updated data are small and they do not write to all channels [Chang 2010]. Thus, the swapping begins with the stripe whose utilization is the highest. The following is a procedure to find and swap a pair of logical blocks.

—Step 1. Find the two channels Cm and Cnwhich have the largest positive value of

(ucm-ucm) and the smallest negative value of (ucn-ucn), respectively. —Step 2. Find the stripe Si subject to the following constraints.

(a) Si have the largest utilization among all stripes.

(b) In this stripe Si, the two logical blocks at Cmand Cnhave not yet been swapped

in the current invocation of channel-level wear leveling. (c) ui,m> ui,nand (ui,m− ui,n)≤ min(ucm− ucm,|ucn− ucn|).

—Step 3. Exchange the channel mapping of the two logical blocks found in Step 2. —Step 4. Change ucmand ucnto (ucm− (ui,m− ui,n)) and (ucn+ (ui,m− ui,n)), respectively. —Step 5. Swap ui,mand ui,n.

In each invocation, channel wear leveling repeats Steps 1 through 5 until (1) uci = uci for every i or 2) the total number of logical blocks swapped is larger than a predefined limitation. Figure 7 is a numeric example of channel wear leveling. In this example, the channel lifetime limit ¯enb is 10,000. Figure 7(a) shows the initial data layout

and utilizations of logical blocks, channels, and stripes. Channel wear leveling solves the expected channel utilizations using u_c

3 = 1.4×(10000−3000) 1.0×(10000−4000) = 1.63uc0, u c2 = 1.07u c0, u_c 1= 1.27u c0, and u c3+ u c2+ u c1+ u

c0 = 1. It then selects the stripe S0whose utilization

is the highest and swaps its two logical blocks at channels C2and C3. This swap changes uc2from 0.25 to 0.22 and and uc3 from 0.3 and 0.33. Next, channel wear leveling selects

(18)

Fig. 7. Swapping logical blocks among channels for channel wear leveling. Table III. Characteristics of the Experimental Disk Workloads

Operating File Logical Total Avg. Req. Disk Disk

Workload System System Disk Size Written Size Coverage† Coverage‡

PC Windows XP NTFS 40 GB 81.2 GB 11.5 KB 41.57% 48.54%

PM Windows 7 NTFS 40 GB 43.3 GB 11.8 KB 2.93% 6.31%

MM Windows CE FAT32 20 GB 19.8 GB 59.6 KB 87.25% 87.26%

RND Ubuntu 9 Ext4 16 GB 18.6 GB 4 KB 68.56% 99.61%

† fractions of disk space written during workload generation (in terms of 512 B sectors) . ‡fractions of disk space written during workload generation (in terms of 512 KB logical blocks).

Figures 7(b) shows the results after these swaps. The adjusted channel utilizations match their expected utilizations.

This study proposes caching the utilization information of a small collection of most-frequently written stripes. Our experiments will show that a small cache is sufficient for effective channel wear leveling.

6. PERFORMANCE EVALUATION

6.1. Experimental Setup and Performance Metrics

We built a simulator and implemented various wear-leveling algorithms and flash-translation layers for evaluation. The simulator provides three options of the flash-translation layer: SAST [Park et al. 2008], FAST [Lee et al. 2007], and DFTL [Gupta et al. 2009]. The former two are representative designs of hybrid mapping, while the last one uses page-level mapping. The simulator also implements the proposed lazy wear leveling, static wear leveling [Chang et al. 2010], and dual-pool wear leveling [Chang and Du 2009]. Static wear leveling is widely used in the industry [MicronR _{2008; Spansion}R _{2008], while dual-pool wear leveling delivers better}

performance [Chang and Du 2009].

Our experiments adopted four types of disk workloads, PC, PM, MM, and RND (see Table III). The PC workload was collected from a 40GB hard drive in a Windows desktop for three months. The disk drive was formatted in NTFS. The user activities of this workload include Web surfing, word processing, video playback, and gaming. Its write pattern consists of many temporal localities. The PM workload was produced by a Windows desktop running Postmark 1.5 benchmark [Katcher 1997] with the default settings except that the total number of transactions was set to 2,800,000. This workload has intensive activities of creating/writing/deleting small files. The MM workload was captured from a memory card of a Windows Mobile device. This workload repeatedly copied/deleted MP3 and video files to/from a 20GB memory card formatted

(19)

Table IV. Evaluation Results of Lazy Wear Leveling (LWL), Static Wear Leveling (SWL), and Dual-Pool Wear Leveling (DP) under the PC, MM, and RND Workloads. “no WL” Stands for not using Wear Leveling

Workload Algorithm largest EC smallest EC mean STDDEV Threshold Stable

PC no WL 939 0 270.1 283.1 — no LWL 298 151 278.4 11.4 16 yes SWL 586 50 278.7 64.3 14 no DP 470 244 279.3 19.3 16 yes PM no WL 3960 0 270.3 885.5 — no LWL 297 253 277.2 10.1 16 yes SWL 973 42 278.2 38.4 30 no DP 814 243 278.7 13.4 28 yes MM no WL 388 0 252.7 96.6 — no LWL 299 198 260.3 11.4 16 yes SWL 338 195 259.8 17 4 no DP 338 227 254.8 6 14 no RND no WL 6746 0 6639.5 408.8 — no LWL 6729 6108 6717.7 31.4 16 yes SWL 6743 6316 6663.7 38.4 2 no DP 6757 6661 6668.3 8.6 6 no

in FAT-32. This workload has many long write bursts. The RND workload was collected from a Linux box running Iometer [Open Source Development Lab 2003] on a 16GB hard drive formatted in ext4. The settings of Iometer were 100% random write with 4KB write requests.

This study uses the standard deviation of all flash blocks’ erase counts to indicate the evenness of flash wear. The smaller the standard deviation, the more even the flash wear will be. This study also considers the mean (i.e., the arithmetic average) of all erase counts. The the difference between the means with and without wear leveling reveals the overhead of wear leveling. It is desirable for a wear-leveling algorithm to achieve a small standard deviation and a small mean.

Unless explicitly specified, all the experiments adopted the following settings as the default values. The flash page size and block size were 4 KB and 512 KB, respectively. This is a typical MLC-flash geometry [Samsung Electronics 2008]. The input workload was the PC workload, and the FTL algorithm was FAST. The over-provisioning ratio was 2.5%, and thus the flash size under the PC workload was 40 GB*1.025= 41 GB. Each run of the experiments replayed the input workload until 4 TB of host data were written. These replays help to differentiate the performance of different wear-leveling algorithms, but they did not manipulate the experiments.

6.2. Experimental Results: Block-Level Wear Leveling

6.2.1. Lazy Wear Leveling vs. Existing Approaches.This part of the experiment compares lazy wear leveling against static wear leveling and dual-pool wear leveling under the three disk workloads. These three wear-leveling algorithms have different definitions of their thresholds. For fair comparison, this experiment fixed of lazy wear leveling at 16, and adjusted the other two algorithms’ thresholds to align their final erase-count means to that of lazy wear leveling. This experiment also adopts stability as a metric. Let the stable interval of a wear-leveling algorithm be the longest time interval [t1,t2]

in which the standard deviations at t1and t2are the same. A wear-leveling algorithm

is stable in an experiment if its stable interval length increases during the experiment. Otherwise it is unstable.

Table IV shows the experimental results. First, compare the results of using lazy wear leveling and the results of not using wear leveling at all. The standard deviations

(20)

Fig. 8. (a) The final erase-count distributions of lazy wear leveling and static wear leveling under the PC workload (after writing 4 TB of data). (b) The runtime standard deviations of lazy wear leveling and dual-pool wear leveling under the MM workload.

of the PC workload is very large without wear leveling, and lazy wear leveling reduced the standard deviation by 96% (from 283 to 11), while increasing the mean by only 2.9% (from 270 to 278). This is because lazy wear leveling is very effective in the presence of temporal localities of write. Lazy wear leveling was even more successful under the PM workload, and reduced the standard deviation by 99% (from 886 to 10). This is because the PM workload confines the write traffic to only 6.3% of the entire disk space, and thus its temporal locality is better than that of the PC workload. Compared to the PC and PM workloads, the MM workload has a relatively small standard deviation without wear leveling. This is because the MM workload has many sequential and long write bursts. Lazy wear leveling is still useful in this case, reducing the standard deviation from 96 to 11. The RND workload has the largest standard deviation without wear leveling. Even though the write pattern of the RND workload is uniformly random, the extremely high garbage-collection overhead under this workload exaggerated the imbalance in flash wear. With lazy wear leveling, the standard deviation decreased from 408 to 31.

Next, focus on the comparison among different wear-leveling algorithms. Lazy wear leveling outperformed static wear leveling in terms of wear evenness in all cases. Interestingly, static wear leveling was unstable under all workloads. Figure 8(a) shows that under the PC workload, the final erase-count distribution of static wear leveling is more imbalanced than that of lazy wear leveling. A closer inspection of static wear leveling’s behaviors revealed two causes of this performance difference. First, static wear leveling moves static data from a block to another, regardless of whether the target block is junior or senior. Under the PC workload, there was a 70% probability that static wear leveling would move data from a static block to a junior block. Second, static wear leveling does not prevent the flash-translation layer from writing new data to senior blocks. Thus senior blocks could repeatedly participate in garbage collection. In contrast, lazy wear leveling neither remaps data to a junior block nor allows the flash-translation layer to write new data to senior blocks.

Results in Table IV indicate that dual-pool wear leveling was unstable under the MM and RND workload, while lazy wear leveling was stable. Figure 8(b) shows that the standard deviation of dual-pool wear leveling became worse than that of lazy wear leveling after the total amount of data written achieved was 13 TB. This is because flash blocks of the same wear information (either the same erase count or the same recent erase count) appear first-in first-out in the priority queues of dual-pool wear leveling. Thus, under the MM workload, writing large files can neutralize the prior efforts of wear leveling on a number of flash blocks (as mentioned in Section 3.4.1).

(21)

Fig. 9. Runtime_{values and standard deviations using the proposed dynamic -tuning method under} (a) the PC workload, (b) the MM workload, and (c) the RND workload. The X-axes indicate the total amounts of host data written to the flash-translation layer.

Under the RND workload, a not-recently-updated logical block has a better chance of being updated, and thus this behavior coincides with the first-in first-out order in the priority queues. Lazy wear leveling avoids this problem using a nonlinear block selection policy. Even though dual-pool wear leveling is unstable, Table IV shows that its overhead is smaller than that of lazy wear leveling. This is because after performing data movement among blocks, dual-pool wear leveling hides these blocks for a while to see whether this data movement is effective in terms of wear leveling. This protection decreases the frequency of wear leveling operations and avoids some unnecessary data movement. Contrarily, lazy wear leveling selects not-recently-updated logical blocks for remapping, but under the random write pattern, these logical blocks have better chances to get updated in the near future.

Now focus on the space overhead of the three algorithms in terms of RAM footprints and flash space requirements. Let nb and nlb be the total number of physical blocks

and logical blocks, respectively. Note that nb> nlb. Suppose that storing an erase count

uses k bits. For RAM footprints, static wear leveling requires a block-erase bitmap of

nbbits. Dual-pool wear leveling uses knbbits to store all blocks’ erase counts in RAM.

It also requires five bit pyramids, each of which uses nb− 1 bits. Thus, its entire RAM

footprint is knb+ 5(nb− 1) bits. Lazy wear leveling uses only k bits to store an average

erase count in RAM. Adopting the optional bitmap lbMod[] requires an extra nl_bbits. Consider the experimental settings under the PC workload, we have nl

b= 81,920 and nb= 83,968. Let k be 16. From the previous discussion, the RAM footprints of dual-pool

wear leveling, static wear leveling, and lazy wear leveling are 215 KB, 10.25 KB, and 16 bytes (plus 10 KB for the optional bitmap lbMod[]), respectively. For flash space requirements, dual-pool wear leveling requires dedicated flash blocks for storing erase counts. Lazy wear leveling stores erase counts in page spare areas, so effectively, it does not cost extra flash pages. We had successfully implemented lazy wear leveling in a real solid-state disk. Interested readers are referred to Chang and Huang [2011].

6.2.2. Dynamic -Tuning for Lazy Wear Leveling.This experiment tested the proposed

-tuning method under the three workloads. The session length for -tuning was 200,

so adjusts after lazy wear leveling erased every 200 blocks. The value of λ was −0.1.

Figure 9 depicts the runtime values of and standard deviations during this experi-ment. The X-axes denote the total amounts of host data written to the flash-translation layer. These results show useful insights into how different types of workloads require wear leveling: Figure 9(a) shows that under the PC workload, and the standard deviation were becoming stable after the workload produced about 1.2 TB of data. At this time, the last flash block whose erase count was zero started contributing erase cycles. Afterward, every flash block had been involved in wear leveling, and and the standard deviation steadily remained at around 80 and 50, respectively.