高效能固態硬碟管理方法暨存取策略

(1)

附件一

行政院國家科學委員會補助專題研究計畫

■ 成果報告

□期中進度報告

高效能固態硬碟管理方法暨存取策略

計畫類別：

■

個別型計畫 □整合型計畫

計畫編號：98-2221-E-009-157-MY3

執行期間： 100 年 8 月 1 日至 101 年 7 月 30 日

執行機構及系所：國立交通大學資訊工程系

計畫主持人：張立平

共同主持人：

計畫參與人員：毛超遠，黃鼎傑，黃聖閔，洪政猷

成果報告類型(依經費核定清單規定繳交)：□精簡報告

■

完整報告

本計畫除繳交成果報告外，另須繳交以下出國心得報告：

□赴國外出差或研習心得報告

□赴大陸地區出差或研習心得報告

■出席國際學術會議心得報告

□國際合作研究計畫國外研究報告

處理方式：

除列管計畫及下列情形者外，得立即公開查詢

□涉及專利或其他智慧財產權，□一年□二年後可公開查詢

中華民國 101 年 10 月 24 日

(2)

嵌入式網路通訊裝置儲存裝置效能評比基準與工具之研發

計畫編號：NSC 98-2220-E-009-048-

執行期限：2009 年 8 月 1 日至 2010 年 7 月 31 日

主持人：張立平

計畫參與人員：郭晉廷，黃莉君，黃偉杰，黃義勛

國立交通大學資訊工程系

摘要

經過近三年市場嚴厲的驗證之後，固態硬碟的可行性與優勢已經越來越明朗。而固態硬碟

目前也清楚地走向兩個極端，一個是消費性電子產品中的記憶卡，另一端則是高效能或者

企業等級的固態硬碟。本計劃在三年工作之中，針對固態硬碟的效能與使用壽命的議題作

了深入的研究，開發了（一）寫入緩衝管理演算法、

（二）多通道管理演算法、以及（三）

多通道平均磨損演算法。本期末報告，將以第三年的成果為主（平均磨損演算法）

，前兩年

的成果則可參照先前年度的期中報告。

Keyword：固態硬碟，快閃記憶體，寫入緩衝，多通道架構，平均磨損。

Abstract

Solid-state disks had succeeded the past three years in the market of consumer electronics,

personal computing, and enterprise computing. There are two future directions of flash storage

devices: consumer-level storage cards and high-performance solid-state disks. This project aims

at the performance and lifetime issues of solid-state disks. In the duration of this three-year work,

we have successfully developed 1) a write-buffer management algorithm, 2) a channel

management algorithm, and 3) a wear-leveling algorithm for high-end flash storage. This final

report will be focused on the third year result, i.e., the wear-leveling algorithm. For the results of

the first two year work, please refer to the mid-term reports.

Key word：Solid-state disks, flash memory, write buffer, multichannel architectures, wear

leveling.

(3)

國科會補助專題研究計畫成果報告自評表

請就研究內容與原計畫相符程度、達成預期目標情況、研究成果之學術或應用價

值（簡要敘述成果所代表之意義、價值、影響或進一步發展之可能性）

、是否適

合在學術期刊發表或申請專利、主要發現或其他有關價值等，作一綜合評估。

1. 請就研究內容與原計畫相符程度、達成預期目標情況作一綜合評估

□ ■達成目標

□ 未達成目標（請說明，以 100 字為限）

□ 實驗失敗

□ 因故實驗中斷

□ 其他原因

說明：

2. 研究成果在學術期刊發表或申請專利等情形：

論文：■已發表 □未發表之文稿 □撰寫中 □無

專利：□已獲得 □申請中 □無

技轉：□已技轉 □洽談中 □無

其他：（以 100 字為限）

本計劃之學術成果包括三篇國際會議論文以及兩篇國際期刊論文。其中包括頂尖

的國際會議 Design Automation Conference 以及頂尖的國際期刊 ACM

Transactions on Embedded Computing Systems。

(4)

值（簡要敘述成果所代表之意義、價值、影響或進一步發展之可能性）（以

500 字為限）

本計劃三年來執行的方向與計劃書規劃的走向相當一致，是為研究固態硬碟

內部設計之關鍵議題，包括平均磨損、寫入緩衝、多通道管理等等。就以學

術研究而言，本計劃之成果共發表三篇國際會議論文以及兩篇國際期刊論文

（如下表所示）。其中包括頂尖的國際會議 Design Automation Conference

以及頂尖的國際期刊 ACM Transactions on Embedded Computing Systems。

就以應用價值方面，本計畫執行的相關成果，亦衍生出相關的產學合作主題

與成果。如 99 年與創意電子、以及 100 年與建興電子之建教合作案。該兩

岸皆成功地為業界提昇固態硬碟產品內快閃記憶體管理演算法的設計，並且

達成進一步的效能提昇。

本計劃執行結果已經發表為下列論文

1. Li-Pin Chang, Tung-Yang Chou, and Li-Chun Huang, “An Adaptive, Low-Cost

Wear-Leveling Algorithm for Multichannel Solid-State Disks,” ACM Transactions on Embedded

Computing Systems, accepted for publication.

2. Li-Pin Chang and Chen-Yi Wen, “Reducing Asynchrony in Channel Garbage-Collection

for Improving Internal Parallelism of SSDs,” ACM Transactions on Embedded Computing

Systems, accepted for publication.

3. Li-Pin Chang, Yi-Hsun Huang, Chen-Yi Wen, “On the Management of Multichannel

Architectures of Solid-State Disks,” the 9th IEEE/ACM Symposium on Embedded Systems for

Real-Time Multimedia (ESTIMedia), 2011.

4. Li-Pin Chang and Yo-Chuan Su, “Plugging versus Logging: A New Approach to Write

Buffer Management for Solid-State Disks,” The 48-th Design Automation Conference (DAC),

2011.

5. Li-Pin Chang and Li-Chun Huang, “A Low-Cost Wear-Leveling Algorithm for

Block-Mapping Solid-State Disks,” ACM Conference on Languages, Compilers, Tools and

Theory for Embedded Systems (ACM LCTES), 2011.

(5)

A

Final Report of 98-2221-E-009-157-MY3:

Data management and access policies for high-performance

solid-state disks

Principle Investigator: Li-Pin Chang, National Chiao-Tung University

Solid-state disks had succeeded the past three years in the market of consumer electronics, personal com-puting, and enterprise computing. There are two future directions of flash storage devices: consumer-level storage cards and high-performance solid-state disks. This project aims at the performance and lifetime issues of solid-state disks. In the duration of this three-year work, we have successfully developed 1) a write-buffer management algorithm, 2) a channel management algorithm, and 3) a wear-leveling algorithm for high-end flash storage. This final report will be focused on the third year result, i.e., the wear-leveling algorithm. For the results of the first two year work, please refer to the mid-term reports.

Additional Key Words and Phrases: Solid-state disks, flash memory, write buffer, multichannel architec-tures, wear leveling.

1. INTRODUCTION

Solid-state disks employ flash memory as their storage medium. The physical charac-teristics of flash memory differ from those of hard drives, necessitating new methods for data accessing. Solid-state disks hide flash memory from host systems by emulating a collection of logical sectors, allowing systems to switch from a hard drive to a solid-state disk without modifying any existing software and hardware. Solid-solid-state disks are superior to traditional hard drives in terms of shock resistance, energy conservation, random-access performance, and heat dissipation, attracting vendors to deploy such storage devices in laptops, smart phones, and portable media players.

Flash memory is a kind of erase-before-write memory. Because any one part of flash memory can only withstand a limited number of write-erase cycles, approximately 100K cycles under the current technology [Samsung Electronics 2006], frequent erase operations can prematurely retire a region in flash memory. This limitation affects the lifetime of solid-state disks in applications such as laptops and desktop PCs, which write disks at very high frequencies. Even worse, recent advances in flash manufactur-ing technologies exaggerate this lifetime issue. In an attempt to break the entry-cost barrier, modern flash devices now use multilevel cells for double or even triple density. Compared to standard single-level-cell flash, multilevel-cell flash degrades the erase endurance by one or two orders of magnitude [Samsung Electronics 2008].

Without wear leveling, localities of data access inevitably degrade wear evenness of flash memory in solid-state disks. Partially wearing out a piece of flash memory not only decreases its total effective capacity, but also increases the frequency of flash erase for free-space management, which further speeds up the wearing out of the rest of the flash memory. A solid-state drive ceases to function when the amount of its worn-out space in flash exceeds what the drive can manage. Wear-leveling techniques ensure that the entire flash wears evenly, postponing the first appearance of a worn-out memory region. However, wear leveling is not free, as it moves data around in flash to prevent solid-state disks from excessively wearing any one part of the memory. As reported in [Chang et al. 2010], these extra data movements can increase the total number of erase operations by ten percent.

Wear-leveling algorithms include rules defining when data movement is necessary and where the data to move to/from. These rules monitor wear in the entire flash, and intervene when the flash wear develops unbalanced. Wear-leveling algorithms are part of the firmware of solid-state disks, and thus they are subject to crucial resource con-straints of RAM space and execution speeds of solid-state disks’ microcontrollers (or

(6)

simply controller)1_{. Prior research explores various wear-leveling designs under such}

tight resource budgets, revealing three major design challenges: First, monitoring the entire flash’s wear requires considerable time and space overheads, which many con-trollers in present solid-state disks cannot afford. Second, algorithm tuning for host-workload adaption and performance definition requires prior knowledge of flash access patterns, on-line human intervention, or both. Third, high implementation complexity discourages firmware programmers from adopting sophisticated algorithms.

Prior methods sort flash erase units in terms of their wear information. This requires efficient access to the wear information of arbitrary erase units, and thus these meth-ods copy the wear information of the entire flash from flash to the RAM of the disk controllers. However, many controllers at the present time cannot afford this RAM space overhead. Chang et al. [Chang and Du 2009] proposed caching only portions of wear information in RAM. However, the miss penalty and write-back overhead of the cache can scale up the volume of flash-write traffic by up to 10%. Instead of storing the wear information of all flash erase units in RAM, Jung et al. [Jung et al. 2007] pro-posed using the average wear of large flash regions. Nevertheless, the low-resolution wear information suffers from distortion whenever flash wearing is severely biased. Chang et al. [Chang et al. 2010] introduced a bitmap that indicates whether a flash erase unit is recently erased or not. However, using the recent erase history can blind wear-leveling algorithms because the recency and frequency of erasing operations on flash erase units are mutually independent.

Existing wear-leveling designs subject wear evenness to tunable threshold param-eters [Chang et al. 2010; Chang and Du 2009; Jung et al. 2007; Agrawal et al. 2008]. The system environment in which wear leveling takes place includes many conditions, such as flash-translation layer designs, flash geometry, and host disk workloads. Ex-isting approaches require human intervention or prior knowledge of the system envi-ronment for threshold setting. However, there are problems of using manually tuned threshold. A wear-leveling algorithm may have good performance with a threshold in a system environment, but with the same threshold it can cause unexpectedly high wear-leveling overhead or unsatisfactory wear evenness in a different system environ-ment.

From a firmware point of view, implementation complexity primarily involves the ap-plicability of wear-leveling algorithms. The dual-pool algorithm [Chang and Du 2009] uses five priority queues of wear information and a caching method to reduce the RAM footprints of these queues. The group-based algorithm [Jung et al. 2007] and the static wear-leveling algorithm [Chang et al. 2010] add extra data structures to main-tain coarse-grained wear information and the recent history of flash wear, respectively. These approaches ignore the information already available in the disk-emulation al-gorithm, which is a firmware module accompanying wear leveling, and unnecessarily increase their design complexity.

This study presents a new wear-leveling design, called the lazy wear-leveling algo-rithm, to tackle the three design challenges mentioned above. First, this design stores only a RAM-resident counter indicating the average wear of the entire flash, achieving a tiny RAM footprint. Second, even though this algorithm uses a threshold parameter, it adopts an analytical model to estimate the overhead increase ratio with respect to different threshold settings, and then automatically selects a threshold for good bal-ance between wear evenness and overhead. Third, the proposed algorithm utilizes the address-mapping information available in the disk-emulation algorithm, eliminating the need for adding extra data structures for wear leveling. Our approach is called lazy

1_{For example, the GP5086 SSD controller from Global Unichip was rated at 150 MHz and has 64 KB of}

(7)

A:3

because it does not perform proactive data movement in contrast to the static wear-leveling algorithm, and it tries not to intervene in flash wear unless wear wear-leveling is cost effective.

Modern solid-state disks equip with multiple channels for parallel flash opera-tions. In this study, a channel refers to a logical unit that independently processes flash commands and transfers data. Multichannel designs boost the write through-put but introduce unbalanced wear of flash erase units among channels. Prior work address this issue by dispatching write requests to channels on a page-by-page basis [Chang and Kuo 2002; Dirik and Jacob 2009] (a page is the smallest read/write unit of flash). Dispatching data at the page level requires page-level mapping, whose imple-mentation requires considerable RAM space for large flash. Additionally, this approach could map logically consecutive data to the same channel and degrade the level parallelism in sequential read requests. This study introduces a novel channel-level wear channel-leveling strategy based on the concept of reaching “eventually even” channel lifetimes. The basic idea is to align channels’ lifetime expectancies by re-mapping data among channels. The proposed approach has many benefits, including 1) it does not require a channel-level threshold for wear leveling, 2) it incurs very limited overhead, and 3) it requires only a small RAM-resident data structure.

In summary, this study has the following contributions:

1. An efficient block wear-leveling algorithm with a tiny RAM footprint. 2. A dynamic threshold-adjusting strategy for block wear leveling. 3. An algorithm for wear leveling at the channel level.

The rest of this paper is organized as follows: Section 2 reviews flash characteristics and prior work on flash translation and wear leveling. Section 3 presents an block-level wear-block-leveling algorithm, and Section 4 describes an adaptive tuning strategy for this algorithm. Section 5 introduces a strategy for wear leveling at the channel level. Section 7 concludes this paper.

2. PROBLEM FORMULATION 2.1. Flash Management

2.1.1. Flash-Memory Characteristics.Solid-state disks use NAND flash memory (flash

memory for short) as their storage medium. A piece of flash memory is a physical ar-ray of blocks, and each block contains the same number of pages. Typically a flash page is of 2048 plus 64 bytes. The 2048-byte portion stores user data, while the 64 bytes is a spare area for mapping information, block aging information, error-correcting code, etc. Flash memory reads and writes in terms of pages, and overwriting a page requires erasing. Flash erases in terms of blocks, each of which consists of 64 pages. Under the current technology, a flash block can only sustain a limited number of write-erase cycles before it becomes unreliable. A single-level-cell flash block endures 100K cy-cles [Samsung Electronics 2006], while this limit is 10K or less in multilevel-cell flash [Samsung Electronics 2008].

Solid-state disks emulate disk geometry using a firmware layer called the flash-translation layer (FTL). FTLs update existing data out of place and invalidate old copies of the data to avoid erasing a flash block every time before rewriting a piece of data. Thus, FTLs require a mapping scheme to translate disk sector numbers into physical flash addresses. Updating data out of place consumes free space in flash, and FTLs must recycle flash space occupied by invalid data with erase operations. Before erasing a block, FTLs copy all valid data from this block to other free space. Garbage

(8)

Fig. 1. Two flash-translation layer designs based on hybrid mapping. (a) The set-associative mapping scheme with N=2 and K=2. Every group has two logical blocks and a group is allocated to up to two log blocks. (b) The fully-associative mapping scheme. All logical blocks are in one big group and all the log blocks are shared by the logical blocks in this big group.

2.1.2. Flash Translation Layers (FTLs). Flash-translation layers are part of the firmware

in solid-state disks. They use RAM-resident index structures to translate logical page numbers into physical flash locations. Mapping resolutions have direct impact on RAM-space requirements and write performance. Many entry-level flash-storage de-vices like USB thumb drives adopt block-level mapping, which requires only small mapping structures. However, low-resolution mapping suffers from slow response when servicing small write requests. Page-level mapping [Gupta et al. 2009] better handles random write requests, but requires large mapping structures, making its im-plementation difficult when flash capacity is high. This paper considers logical pages as the smallest mapping unit as large as a flash page.

Hybrid mapping combines both page and block mapping. This method groups con-secutive logical pages into logical blocks as large as physical blocks. It maps logical blocks to physical blocks on a one-to-one basis using a block-mapping table. If a physi-cal block is mapped to a logiphysi-cal block, then this physiphysi-cal block is physi-called the data block of this logical block. Initially, physical blocks other than data blocks are spare blocks. Hybrid mapping uses spare blocks as log blocks to serve page updates, and uses a page

mapping table to redirect read requests to the latest versions of data in the log blocks.

Figures 1(a) and 1(b) show two different FTL designs using hybrid mapping. Hybrid mapping creates groups of logical blocks and allocate (flash) spare blocks as log blocks for these logical-block groups. Let lbn and pbn stand for a logical-block number and a physical-block number, respectively. Let lpn represent a logical-page number, and let

disp be the block offset in terms of pages. The bold boxes stand for physical blocks, each of which has four pages. The numbers in the pages indicate the lpns of their stor-age data. The BMT and the PMT are the block-mapping table and the pstor-age-mapping table, respectively. In Fig. 1(a), every group has two logical blocks, while a group can be allocated to up to two log blocks. This mapping scheme, developed by Park et al. [Park et al. 2008], is called set-associative mapping (SAST). This scheme uses two pa-rameters N and K to specify the group size and the largest number of log blocks that a group can have, respectively. Figure 1(b) depicts another mapping scheme, developed by Lee et al. [Lee et al. 2007], called fully-associative mapping (FAST). This method put all logical blocks in one big group, and has all the logical blocks in this big group share all the log blocks.

The FTL consumes spare blocks for serving incoming write requests. When the amount of spare blocks becomes low, the FTL starts erasing log blocks. Before eras-ing a log block, the FTL finds all logical blocks related to the valid data in this log

(9)

A:5

Table I. Comparison of existing algorithms for block-level wear leveling. Algorithm Principle RAM-resident data structures

re-quired

Threshold tuning Static wear leveling

[Chang et al. 2010]

Static wear leveling A block erase bitmap Manual Group wear leveling

[Jung et al. 2007]

Hot-cold swapping Average erase counts of block groups Manual Dual-pool

wear leveling [Chang and Du 2009]

Cold-data migration All blocks’ erase counts and their re-cent erase counts

Manual

Remaining-lifetime leveling [Agrawal et al. 2008]

Cold-data migration All blocks’ age information (remain-ing lifetimes) and block-data temper-ature (update frequencies)

Manual

Lazy wear leveling (this study)

Cold-data migration An average erase count of all blocks Automatic

block. For each of the found logical block, the FTL collects valid data from the log block and the data block of this logical block, copies these valid data to a new spare block, and re-maps the logical block to the copy-destination spare block. Finally, the FTL erases all the involved data blocks and the log blocks into spare blocks. This procedure is referred to as merge operations or garbage collection. For example, in Fig. 1(a), for garbage collection the FTL collects the valid data scattered in the data blocks at pbns 0 and 2 and in the log blocks at pbns 6 and 3, write them to the spare blocks at pbns 7 and 8, and then erases the four old flash blocks at pbns 0, 2, 6, and 3 into spare blocks. Hybrid mapping FTLs exhibit some common behaviors in the garbage-collection pro-cess regardless of their designs, i.e., garbage collection never involves a data block if none of its page data have been updated. In Fig. 1(a), erasing the data blocks at pbn 5 cannot reclaim any free space. Similarly, in Fig. 1(b), erasing any of the log blocks does not involve the data block at pbn 5. This is a potential cause of uneven flash wear.

2.2. The Need for Wear Leveling

This section first introduces prior methods, discusses their drawbacks, and then point out how the method to be proposed improves upon these shortcomings.

2.2.1. Block-Level Wear Leveling.Block-level wear leveling considers the wear evenness

of a collection of flash blocks. Let the erase count of a flash block denote how many write-erase cycles this block has undergone. There have been three representative techniques for this problem: Static wear leveling, Hot-cold swapping, and Cold-data migration. Static wear leveling moves static/immutable data away from lesser worn flash blocks, encouraging the flash-translation layer to start erasing these blocks. Flash vendors including Micron [Micron⃝R

2008] and Spansion [Spansion⃝R

2008] rec-ommend using this approach. Chang et al. [Chang et al. 2010] described a design of Static wear leveling. However, Chang and Du [Chang and Du 2009] found Static wear leveling failed to achieve even block wear on the long-term, because Static wear lev-eling could 1) move static/immutable data back and forth among lesser worn blocks and 2) erase a flash block even if its erase count is relatively large. Hot-cold swap-ping exchanges data in a lesser worn block with data from a badly worn block. Jung et al. [Jung et al. 2007] presented a hot-cold swapping design. However, because the oldest block has a very large (and perhaps still the largest) erase count, Chang and Du [Chang and Du 2009] found that Hot-cold swapping risks erasing the most worn flash block pathologically.

Cold-data migration relocates infrequently updated data (i.e., cold data) to exces-sively worn blocks to protect these blocks against garbage collection. Preventing badly-worn blocks from aging further is not equal to increasing the wear of lesser-badly-worn blocks

(10)

(as Static wear leveling does). This is because frequently updated data occupy only a small portion of the disk space. Prior work reported that the disk fullness of productive systems was only about forty percent [Agrawal et al. 2007]. In other words, to stop ag-ing the small amount of badly-worn flash blocks mapped to frequently updated data is more efficient than to start wearing the large amount of lesser-worn flash blocks. Cold-data migration has been proven more effective than Static wear leveling and Hot-cold swapping [Agrawal et al. 2008; Chang and Du 2009]. Based on Cold-data migration, Argrawal et al. [Agrawal et al. 2008] proposed storing the remaining lifetimes and data temperatures of all flash blocks in RAM, and Chang and Du [Chang and Du 2009] proposed storing all blocks’ erase counts and their recent erase counts in RAM. These designs, however, impose large RAM-space requirements on disk controllers. Consider a 32 GB flash-storage device with 512 KB flash blocks, storing a four-byte wear infor-mation for every block costs the disk controller 256 KB of RAM. This figure is higher than that a typical disk controller can afford (64 KB, mentioned in the Introduction sec-tion). Reducing the RAM footprint is always beneficial no matter how much RAM the controller can afford, because the saved RAM space can be used by the mapping tables and the disk write buffer. Table I is a summary of comparison among prior methods and our algorithm. Our design stores only an average erase count in RAM, achieving a tiny RAM footprint. However, our design does not sacrifice wear-leveling performance to footprint reduction. Our experimental results will show that it outperforms existing methods in almost all cases.

Block-level wear leveling controls the wear variance in all flash blocks within an acceptable threshold. Existing approaches have difference definitions of this variance: Chang et al. [Chang et al. 2010] adopted the ratio of the total erase count to the total number of the recently erased blocks, Jung et al. [Jung et al. 2007] and Chang and Du [Chang and Du 2009] used the difference among blocks’ erase counts, and Argrawal et al. [Agrawal et al. 2008] employed the difference among blocks’ remaining lifetimes. With a smaller threshold, wear leveling aims at a more level wear in flash blocks, but inevitably introduces more frequent data movement. Wear leveling overhead can be affected by many conditions of flash management, including the host workload, flash-translation layer, flash geometry, and flash capacity.

Unfortunately, it is almost impossible to find a universally applicable threshold set-ting for various applications of flash storage. For example, in our two tests with Dual-pool algorithm [Chang and Du 2009] with a threshold of 14, under the workloads of a multimedia appliance and a Windows desktop, it increased the total erase count by 0.8% and 3.9% while the resultant standard deviations of all blocks’ erase counts were 5.4 and 10.5, respectively. The latter case shows that the same threshold setting re-sulted more data movement but not achieved a better wear evenness. This study iden-tifies that the overhead of wear leveling is not linearly related to the threshold value, and the overhead will significantly increase when the threshold is becoming smaller than a certain critical value. This critical threshold value will be different for various conditions of flash management. Thus, we propose subjecting the threshold value to the overhead increase ratio, and introduce a runtime strategy that dynamically sets the threshold value to the critical value.

2.2.2. Channel-Level Wear Leveling.In this study, a channel refers to a logical unit that

independently processes flash commands and transfers data. Channel-level wear lev-eling is concerned with the wear evenness of flash blocks from different channels. This issue is closely related channel binding of logical pages, i.e., the allocation of free flash pages to host data. Dynamic channel binding globally manages free pages across all channels. Chang and Kuo [Chang and Kuo 2002] proposed dispatching page write re-quests to channels based on the update frequencies of these page data. Dirik et al.

(11)

A:7

[Dirik and Jacob 2009] proposed allocating channels to incoming page write requests using the round-robin policy. Even though dynamic channel binding has better flexi-bility of balancing the block wear across all channels, it has two drawbacks: 1) it adds extra channel-level mapping information to every logical page, resulting in larger map-ping tables and 2) it could map consecutive logical pages to the same channel, severely degrading the channel-level parallelism in sequential-read requests.

Instead of dynamic channel binding, this study considers static channel binding. Static channel binding uses fixed mapping between logical pages and channels. With static mapping, effectively every channel manages its free flash pages with its own in-stance of flash-translation layer. The most common strategy for static channel binding is the RAID-0-style striping [Agrawal et al. 2008; Park et al. 2010; Seong et al. 2010]. RAID-0 striping achieves the maximum channel-level parallelism in sequential read because it maps a collection of consecutive logical pages to the largest number of chan-nels. We must point out that RAID-0 striping cannot automatically achieve wear lev-eling at the channel level. This is because, as reported in [Chang 2010], hot data (fre-quently updated data) are small, usually between 4 KB and 16 KB. RAID-0 striping statically binds small and hot data to some particular channels, resulting in imbal-anced write traffics among channels. We found that, under the disk workload of a Windows desktop, a four-channel architecture had a largest and a smallest fractions of channel-write traffic of 28% and 23%, respectively. Thus, flash blocks from different channels wear at different rates. Extending the scope of block-level wear leveling to the entire storage device is not a feasible solution here, because it requires dynamic channel binding.

3. BLOCK-LEVEL WEAR LEVELING

This section presents an algorithm for wear leveling at the block level. This algorithm does not deal with channels so logically all flash blocks are in the same channel.

3.1. Observations

This section defines some key terms for the purpose of presenting our wear-leveling algorithm in later sections. Let the update recency of a logical block denote the time length between the current time and the latest update to this logical block. The update recency of a logical block is high if its latest update is more recent recent than the average update recency. Otherwise, its update recency is low. Analogously, let the erase

recency of a physical block be the time length since the latest erase operation on this

block. Thus, immediately after garbage collection erases a physical block, this block has the highest erase recency. A physical block is an senior block if its erase count is larger than the average erase count. Otherwise, it is a junior block.

Temporal localities of updating logical blocks affect the wear of physical blocks. As previously mentioned, if a physical block is mapped to an unmodified logical block, then garbage collection will avoid erasing this physical block. On the other hand, up-dates to logical blocks produce invalid data in flash blocks, and thus physical blocks mapped to recently modified logical blocks are good candidates for garbage collection. After a physical block is erased by garbage collection, it either serves a data block or a log block. Either way, this physical block is again related to recently modified logical blocks. So if a physical block has a high erase recency, then it will quickly accumulate many erase counts. Conversely, physical blocks lose momentum in increasing their erase counts if they are mapped to logical blocks having low update recency.

Figure 2 provides an example of eight physical blocks’ erase recency and erase counts. Upward arrows mark physical blocks recently increasing their erase counts, while an equal sign indicates otherwise. Block a is a senior block with a high erase recency, while block d is a senior block but has a low erase recency. The junior block

(12)

Fig. 2. Physical blocks and their erase recency and erase counts. An upward arrow indicates that a block is recently increasing its erase count.

h has a high erase recency, while the erase recency of the junior block e is low. Blocks

should keep their erase counts close to the average. Two kinds of block wear can re-quire intervention from wear leveling. First, the junior blocks e and f have not recently increased their erase counts. As their erase counts fall below the average, wear lev-eling has them start participating in garbage collection. Second, the senior blocks a and b are still increasing their erase counts. Wear leveling has garbage collection stop further wear in these two senior blocks.

3.2. The Lazy Wear-Leveling Algorithm

This study proposes a new wear-leveling algorithm based on a simple principle: when-ever a senior block’s erase recency becomes high, re-locate (i.e., re-map) a logical block having a low update recency to this senior block. This algorithm, called the lazy

wear-leveling algorithm, is named after its passive reaction to excessive flash wear.

Lazy wear leveling must be aware of the recent wear of all senior blocks, because senior blocks retire before junior blocks. However, physical blocks boost their erase recency only via garbage collection. The flash-translation layer can notify Lazy wear leveling of its decision on victim selection. This way, Lazy wear leveling captures se-nior blocks whenever their erase recency become high without repeatedly checking all senior blocks’ wear information.

How to prevent senior blocks from further aging is closely related to the behaviors of garbage collection. As previously mentioned in Section 2.2, if a logical block has a low update recency, then garbage collection has no interest in erasing the flash block(s) mapped to it. Therefore, re-mapping logical blocks of low update recency is a key to prevent senior blocks from aging further. Lazy wear leveling considers logical blocks not related to any page-mapping information as having low update recency, because recent updates to logical blocks leave mapping information in the the page-mapping table. The logical blocks at lbn 3 in Fig. 1(a) and 1(b) are such examples.

To re-map a logical block from one physical block to another, Lazy wear leveling moves all valid data from the source physical block to the destination physical block. Junior blocks are the most common kind of source blocks, e.g., blocks e and f in Fig. 2, because storing immutable data keeps them away from garbage collection. As moving all valid data out of the source blocks makes them good candidates for garbage collec-tion, selecting logical blocks for re-mapping is related to the wear of junior blocks. To give junior blocks even chances of wear, it is important to uniformly visit every logical block when selecting logical blocks for re-mapping.

Temporal localities of write change occasionally. New updates to a logical block can neutralize the latest re-mapping effort involving this logical block. In this case, Lazy

(13)

A:9

Algorithm 1 The lazy wear-leveling algorithm

Input: v: the victim block for garbage collection Output: p: a substitute for the original victim block v

1: ev←eraseCount(v)

2: if (ev− eavg) > ∆then

3: repeat

4: l← lbnNext()

5: until lbnHasP ageM apping(l)=FALSE

6: erase(v);

7: p← pbn(l)

8: copy(v, p); map(v, l)

9: ev← ev+ 1

10: eavg← updateAverage(eavg, ev)

11: else 12: p← v

13: end if

14: RETURN p

wear leveling will be notified that a senior block is again selected as a victim of garbage collection, and will perform another re-mapping operation for this senior block.

3.3. Interacting with Flash-Translation Layers

This section describes how Lazy wear leveling interacts with its accompanying firmware module, the flash-translation layer. Algorithm 1 shows the pseudo code of Lazy wear leveling. The flash-translation layer calls Algorithm 1 after it moves all valid data out of a garbage-collection victim block and before it erases this block. The input of Algorithm 1 is v, the pbn of the victim block. This algorithm performs re-mapping whenever necessary, and then returns a pbn. Note that this output pbn may be different from the input pbn. The flash-translation layer erases the flash block at the pbn returned by Algorithm 1. The discussion in this section is based on hybrid mapping. See later sections for using Lazy wear leveling with page-level mapping.

For the example of SAST in Fig. 1(a), suppose that the flash-translation layer decides to merge data of the logical blocks at lbns 0 and 1. The flash-translation layer calls Algorithm 1 before erasing each of the four physical blocks at pbns 0, 2, 6, and 3. For the example of FAST in Fig. 1(b), because FAST recycles the oldest log block at a time, the flash-translation layer calls Algorithm 1 before erasing the log block at pbn 6 and the two related data blocks at pbns 0 and 2. The rest of this section is a detailed explanation of Algorithm 1.

In Algorithm 1, the flash-translation layer provides the subroutines with leading underscores, and wear leveling implements the rest. In Step 1, eraseCount() obtains the erase count ev of the victim block v by reading the victim block’s page spare area,

in which the flash-translation layer stores the erase count. Step 2 compares evagainst

the average erase count eavg. If evis larger than eavgby a predefined threshold ∆, then

Steps 3 through 10 will carry out a re-mapping operation. Otherwise, Steps 12 and 14 return the victim block v intact. The loop of Steps 3 through 5 finds a logical block whose update recency is low. Step 4 uses the subroutine lbnN ext() to obtain l the next logical block number to visit, and Step 5 calls the subroutine lbnHasP ageM apping() to check if the logical block l has any related mapping information in the page-mapping table. As mentioned previously, to give junior blocks equal chances of getting erased, the subroutine lbnN ext() must evenly visit all logical blocks. At this point, it is reason-able to assume that lbnN ext() produces a linear enumeration of all lbns.

(14)

Steps 6 through 8 re-map the previously found logical block l. Step 6 erases the origi-nal victim block v. Step 7 uses the subroutine pbn() to identify the physical block p that the logical block l currently maps to. Step 8 copies the data of the logical block l from the physical block p to the original victim block v, and then re-maps the logical block l to the former victim block v using the subroutine map(). After this re-mapping, Step 9 increases ev since the former victim block v has been erased, and Step 10 updates

the average erase count. Step 14 returns the physical block p, which the logical block

l previously mapped to, to the flash-translation layer as a substitute for the original victim block v. In spite of the average erase count eavg, Algorithm 1 is only concerned

with the erase count of the victim block. Thus, this algorithm needs not store all blocks’ erase counts in RAM. Instead, it reads the spare area of a victim block before garbage collection erases it.

3.4. Wear-Leveling Enhancements

This section presents two enhancements that Lazy wear leveling can use. The first is specific to sequential-write workloads, and the second is particularly useful if the flash-translation layer is FAST.

3.4.1. Workload-Specific Enhancement. Algorithm 1 calls lbnN ext() to select logical

blocks for re-mapping. This function can linearly visit all logical blocks. However, this simple strategy could result in many ineffective re-mapping operations if the host workload consists of a lot of long write bursts. This is because files systems try to allocate contiguous disk space when writing large files. This behavior coincides with linearly enumerating logical blocks, and can neutralize prior re-mapping operations on a set of consecutive logical blocks.

To solve this problem, this study proposes using Linear Congruential Generator [Rosen 2003] for logical-block selection. Let the total number of logical blocks be nl.

Let p be the smallest prime number larger than nl. Let s be an integer and 0 < s < nl.

Let li be the logical-block number produced by the i-th selection, and let l0 be an

ar-bitrary number in [0, nl). Lazy wear leveling selects logical blocks using the following

recurrence relation:

li+1 = (li+ s)%p

, where % is the modulo operator. Notice that any li ≥ nl are not used. Because s

and p are prime to each other, the period of selecting the same logical-block number is exactly nl. Here, s is the skip factor, which should be larger than the total number of

logical blocks that typical large files can have. This prevents Lazy wear leveling from successively visiting two logical blocks belonging to the same large file. Our current implementation adopts s=1000 when the logical block size is 128 KB.

The loop in Algorithm 1 (i.e., Steps 3 to 5) checks whether a logical block has re-lated mapping information in the page-mapping table. This check becomes difficult if the flash-translation layer caches a partial mapping table. To address this problem, Algorithm 1 can adopt an optional bitmap lbM od[] of logical blocks. For any logical block at lbn l , lbM od[l] =0 initially, and the flash-translation layer sets lbM od[l]=1 if a write request modifies any of its logical pages. For example, in Fig. 1(a) all bits of this bitmaps are 1’s except lbM od[3]. Garbage collection clears lbM od[l] after erasing the flash blocks related to the logical block at lbn l, because merging this logical block removes all its mapping information from the page-mapping table. With this bitmap,

lbnHaspageM apping(l)at Step 5 reports TRUE if lbM od[l]=1, or else reports FALSE.

3.4.2. FTL-Specific Enhancement. On garbage collection, FAST erases one log block at

a time, i.e., the oldest log block. Thus FAST can delay merging a logical block until a valid logical page of this logical block appears in the oldest log block. Consider that

(15)

A:11

FAST has a very large number of log blocks and the host frequently modifies a logical block. On the one hand, FAST can indefinitely postpone merging this logical block. On the other hand, Lazy wear leveling does not use this logical block for re-mapping because its page updates keep leaving information in the page-mapping table. As a result, the (flash) data blocks mapped to this logical block can never attract attention from both garbage collection and wear leveling.

A simple enhancement based on the bitmap lbM od[] deals with this problem. When FAST erases the oldest log block, for every piece of page data in this log block, regard-less of whether it is valid or not, FAST finds the the logical block number of this logical page and clears the corresponding bit in lbM od[], as if FAST did not delay merging logical blocks. Note that SAST does not require this enhancement, because to improve log-block space utilization SAST will not indefinitely delay merging logical blocks.

3.5. Lazy Wear Leveling and Page-Level Mapping

Although Lazy wear leveling is primarily designed for hybrid mapping, its concept is applicable to page-level mapping. Like in hybrid mapping, in page-level mapping Lazy wear leveling copies data having low update recency to senior blocks to prevent these blocks from aging further. However, different from hybrid mapping, page-level mapping does not use logical block [Gupta et al. 2009], so Lazy wear leveling needs a different strategy to find data having low update recency.

This study proposes using an invalidation bitmap. In this bitmap, one bit is for a flash block, and each bit indicates whether a flash block recently receives a page inval-idation (i.e., 1) or not (i.e., 0). All the bits are 0 initially, and there is a pointer referring to the first bit. The bit of a flash block switches to 1 if any page in this block is updated (i.e., invalidated). Whenever Lazy wear leveling finds the erase count of a victim block larger than the average by ∆, it advances the pointer and scans the bitmap. As the pointer advances, it clears bits of 1’s until it encounters a bit of 0. Lazy wear leveling then copies valid data from the flash block owning this zero bit to the victim block. This scan-and-copy procedure repeats until it writes to all pages of the victim block. Notice that garbage-collection activities do not alter any bits in the bitmap.

The rationale behind the design is that, in the presence of temporal localities of write, if a flash block does not receive page invalidations recently, then this block is unlikely to receive more page invalidations in the near future. The invalidation bitmap resides in RAM, and it requires one bit per flash block. Compared to the page-level mapping table, the space overhead of this bitmap is very limited.

4. SELF TUNING FOR BLOCK-LEVEL WEAR LEVELING

Lazy wear leveling subjects the evenness of block wear to a threshold parameter ∆. A small value of ∆ targets even wear in flash blocks but increases the frequency of data movement. This section presents a dynamic tuning strategy for ∆ for achieving good balance between wear evenness and overhead.

4.1. Overhead Analysis

Consider a piece of flash memory consisting of nb physical blocks. Let immutable

log-ical blocks map to nbcout of these nb physical blocks. Let the sizes of write requests

be multiples of the block size, and let write requests be aligned to block boundaries. Suppose that the disk workload uniformly writes the mutable logical blocks. In other words, the flash-translation layer evenly increases the erase counts of the nbh=nb− nbc

physical blocks.

Let the function f (x) denote how many blocks garbage collection erases to process a workload that write x logical blocks. Consider the case x = i× nbh× ∆, where i is a

(16)

E ra se c o u n ts

Physical block numbers n bh E ra se c o u n t s

Physical block numbers n

bh nbh

+1

(a) (b)

Fig. 3. Erase counts of flash blocks right before the lazy wear-leveling algorithm performs (a) the first re-mapping operation and (b) the nbh+1-th re-mapping operation.

are block-aligned, erasing victim blocks does not cost garbage collection any overhead in copying valid data. Therefore, without wear leveling, we have

f (x) = x.

Now, consider wear leveling enabled. For ease of presentation, this simulation re-vises the lazy wear leveling algorithm slightly: the revised algorithm compares the victim block’s erase count against the smallest erase count instead of the average erase count. Figure 3(a) shows that, right before Lazy wear leveling performs the first re-mapping, garbage collection has uniformly accumulated nbh × ∆ erase counts in

nbh physical blocks. In the subsequent nbh erase operations, garbage collection erases

each of these nbh physical blocks one more time, and increases their erase counts to

∆ + 1. Thus, Lazy wear leveling conducts nbhre-mapping operations for these physical

blocks at the cost of erasing nbhblocks. These re-mapping operations re-direct

garbage-collection activities to another nbhphysical blocks. After these re-mapping operations,

Lazy wear leveling stops until garbage collection accumulates another nbh× ∆ erase

counts in the new nbh physical blocks. Figure 3(b) shows that Lazy wear leveling is

about to spend nbherase operations for re-mapping operations. Now let function f′(x)

be analogous to f (x), but with wear leveling enabled. We have

f′(x) = x + ⌊_x

∆ ⌋

= x + i× nbh.

Under real-life workloads, the frequencies of erasing these nbhblocks may not be

uni-form. Thus, f′(x)adopts a real-number coefficient K to take this into account:

f′(x) = x + i× nbh× K.

The coefficient K depends on various conditions of flash management, such as flash geometry, host workloads, and flash-translation layer designs. For example, dynamic changes in temporal localities of write can increase K because the write pattern might start updating new logical blocks and neutralize the prior re-mapping operations on these blocks. Notice that the value of K can be measured at runtime, as will be ex-plained in the next section.

Let the overhead function g(∆) denote the overhead ratio with respect to ∆:

g(∆) = f ′_(x)_{− f(x)} f (x) = i× nbh× K i× nbh× ∆ = K ∆.

It shows that the overhead of wear leveling is inversely proportion to ∆. Now recall that Lazy wear leveling compares victim blocks’ erase counts against the average erase count rather than the smallest erase count. Thus, we use 2∆ as an approximation of the original ∆. Because both nb and nbh are constant, the difference between using

(17)

A:13 Values of Δ O v e r h e a d r a t io s g(Δ)=K/(2Δ) Δcur Δnext

Solve Kcur using Δcur and g Δ( cur)

Find Δnext at which the tangent slope to

(

g Δnext)=Kcur/ Δ(2 next) is λ.

tangent slope=λ (

g Δcur)

Fig. 4. Computing ∆nextsubject to the overhead growth limit λ for the next session, according to ∆curand

the overhead ratio g(∆cur)of the current session.

the average and the smallest can be accounted by a constant ratio, which is further included in the runtime-measurable coefficient K. Thus, we have

g(∆) = K

2∆. (1)

When ∆ is small, a further decrease in ∆ rapidly increases the overhead ratio. For example, decreasing ∆ from 4 to 2 doubles the overhead ratio.

4.2. A Strategy of Tuning ∆

Small ∆ values are always preferred in terms of wear evenness. However, decreasing ∆value can cause an unexpectedly large increase in overhead. The rest of this section introduces a ∆-tuning strategy based on the overhead growth rates.

Under realistic disk workloads, the coefficient K in g(∆) may vary over time. Thus, wear leveling must first determine the coefficient K before using g(∆) for ∆-tuning. This study proposes tuning ∆ on a session-by-session basis. A session refers to a time interval in which Lazy wear leveling contributed a pre-defined number of erase counts. Refer to this number as the session length. The basic idea is to find Kcurof the current

session and use this value to find ∆nextfor the next session.

The first session begins with ∆=16 (in theory it can be any number). Let ∆cur be

the ∆ value of the current session. Figure 4 illustrates the concept of the ∆-tuning procedure. During a runtime session, Lazy wear leveling separately records the erase counts contributed by garbage collection and wear leveling. At the end of the current session, the first step (in Fig. 4) computes the overhead ratio f′(x)_{f (x)}−f(x), i.e., g(∆cur),

and solves Kcurof the current session using Equation 1, i.e., Kcur= 2∆cur× g(∆cur).

The second step uses g(∆next)=Kcur/(2∆next)to find ∆nextfor the next session.

Basi-cally, Lazy wear leveling tries to decrease ∆ until the growth rate of the overhead ratio becomes equal to a user-defined limit λ. In other words, we are to find the ∆ value at which the tangent slope to g(∆next)is λ. Let the unit of the overhead ratio be one

per-cent. Therefore, λ=-0.1 means that the overhead ratio increases from x% to (x+0.1)% when decreasing ∆ from y to (y-1). Now solve d

d∆g(∆next) =

λ

100 for the smallest ∆

value subject to λ. Rewriting this equation, we have ∆next= √ 100 −λ √ g(∆cur)∆cur.

For example, when λ=-0.1, if the overhead ratio g(∆cur)and ∆curof the current session

are 2.1% and 16, respectively, then ∆nextfor the next session is

√

100 0.1

√

(18)

w₁={1,2}, w₂={82,83}, w₃={34,35,36,37} 0 1 2 3 80 81 82 83 32 33 34 35 36 37 38 39 C 1 C2 C3 C4 C1 C2 C3 C4 (a) (b) 1 822 83 34 35 36 37 w₁ w₂ w₃

Fig. 5. Handling three write requests w1, w2, and w3using (a) synchronized channels and (b) independent

channels. In this example, using synchronized channels doubles the flash wear, while using independent channels results in unbalanced flash wear among channels.

The ∆-tuning procedure uses the limit on the overhead-ratio growth rates and the session length. Because g(∆) is very large when ∆ is small, λ can be set to the boundary between near-linear and super-linear growth rates. Our experiments will show that

−0.1 is a good choice of λ, and wear-leveling results are not sensitive to the lengths of

sessions because workloads have temporal localities of write.

5. CHANNEL-LEVEL WEAR LEVELING 5.1. Multichannel Architectures

Advanced solid-state disks use multichannel architectures for high data transfer rates [Agrawal et al. 2008; Kang et al. 2007; Seong et al. 2010; Park et al. 2010]. In this study, a channel stands for a logical unit which can individually handle flash commands and perform data transfer. Parallel hardware structures such as gangs, in-terleaving groups, and flash planes are part of channels because flash chips in these structures might not be individually programmable.

From the point of view of wear leveling, channels can be synchronized or

indepen-dent. Figure 5 is an example. Let the mapping between logical pages and channels use

the RAID-0 style striping. Figure 5(a) depicts that all the channels write synchronously even if a write request do not access all the channels. Lazy wear leveling directly ap-plies to a set of synchronized channels because these channels are logically equivalent to a single channel. A major drawback of synching channel operations is the reduced device lifetime. As Figure 5(a) shows, the channels writes sixteen flash pages to mod-ify only eight logical pages. Independent channels need not copy unmodified data for synching channel operations, as shown in Figure 5(b). However, using independent channels inevitably introduces unbalanced flash wear among channels.

This study focuses on independent channels because they alleviate the pressure of garbage collection and reduce flash wear compared to synchronized channels. Let ev-ery independent channel adopt an instance of flash-translation layer, and let evev-ery channel perform wear leveling on its own flash blocks. Provided that the block-level wear leveling is effective, the problem of channel-level wear leveling refers to how to balance the total block erase counts of all channels.

Our design of channel-level wear leveling respects the property of maximum

paral-lelism [Shang et al. 2011] for the highest paralparal-lelism among page reads. A data

lay-out satisfies maximal parallelism if and only if a set of consecutive logical pages are mapped to the largest number of channels. This study uses the RAID-0 style strip-ing as the initial mappstrip-ing between logical pages and channels, and data updates and garbage collection do not change this mapping [Park et al. 2010].

(19)

A:15

Table II. Symbol definitions.

Symbol Description

w The total amount of data written to the flash storage during [t−, t)

¯

e The write-erase cycle limit of flash blocks

nb The total number of flash blocks in a channel

y The total number of channels Ci The i-th channel

e_ci The sum of all block erase counts in the channel Ci

u_ci The utilization of the channel Ci. Note that∑uci=1

u′_ci The expected utilization of the channel Ci

ri The erase ratio of the channel Ci

x The total number of stripes Si The i-th stripe

u_si The utilization of the stripe Si. Note that∑usi=1

ui,j The utilization of the logical block at the stripe Siand the channel Cj

Note that∑x−1_i=0ui,j= ucjand∑y−1j=0ui,j= usi

Fig. 6. Aligning the lifetime expectancies of two channels Ciand Cj for channel-level wear leveling. (a)

These two channels reach their end-of-life at different times. (b) Change channel utilizations uciand ucjto u′ciand u′cj, respectively, such that the lifetime difference becomes zero (i.e., d=0).

5.2. Aligning Channel Lifetime Expectancies

Provided that block wear leveling is effective, the erase counts of blocks in the same channel will be close, and the wear of a channel can be indicated by the sum of all block erase counts in this channel. Recall that the utilization of a channel stands for the fraction of host data arriving at this channel. Even though data updates are out of place at the block level, they do not change the mapping between logical pages and channels, so temporal localities have affinity with channels. Thus, channel utilizations do not abrupt change and the wear of channels increase at steady (but different) rates. This study proposes adjusting channel utilizations to control the wear of channels for an “eventually even” state of channel lifetimes. In other words, the idea is to project channels’ lifetime expectancies to the same time point. Figure 6 is an example of two channels Ciand Cj. Let every channel have the same total number of flash blocks nb.

Let a flash block endures ¯ewrite-erase cycles, and let the erase count of the channel Ci, denoted by eci, be the sum of all block erase counts in this channel. Let a channel

reaches its end of life when its erase count becomes ¯e×nb. Let t be the current time, and

let w be the total amount of host data written in the time interval [t−, t). Let uci ≤ 1 be

the utilization of the channel Ci. Thus, in this time interval the total amount of host

data arriving at the channel Ciis uciw. Let the erase counts of the channel Ciat time

t−and t be et ciand e

t−

ci, respectively. Let the erase ratio of Ciduring [t

−_{, t)}_{be r}

i, defined

as ri=

et_ci−et−_ci

u_ciw . As Fig. 6(a) shows, eciincreases by riuciw= e

t

ci−e

t−

ci in this time period.

(20)

Provided that channels’ erase ratios and utilizations remain steady, the lifetime expectancies of the channels Ci and Cj will be t + (¯enb − etci)(

t_−t−

riuciw) and t + (¯enb −

et

cj)(

t_−t−

rjucjw), respectively. The lifetime difference d will be

d = (¯enb− etci)( t− t− riuciw )− (¯enb− etcj)( t− t− rjucjw ).

To align these two channels’ lifetime expectancies (i.e., d=0), the channel wear-leveling algorithm computes the utilizations u′ci and u

′

cj which the channels Ci and Cj are

expected to have after the time t, respectively. Replacing uci, ucj, and d in the equation

above with u′_c_i, u′_c_j and 0, respectively, produces u′_c_j = ri(¯enb−e

t cj)

rj(¯enb−et_ci)u

′

ci. Because the total

utilization is 100%, we have u′_c_i+ u′_c_j = 1. Now solve these two equations to obtain u′_c_i and u′cj. Figure 6(b) shows that, with these new expected utilizations u

′

ci and u

′ cj, the

lifetime expectancies of these two channels will be the same. In the general case of y channels, solving the following system obtains the expect utilizations u′_c₀. . . u′_c_y

−1:      ∀k((k ∈ {0, 1, 2, . . . , y − 1}) ∧ (u′ ck= r0(¯enb−et_ck) rk(¯enb−et_c0)u ′ c0)) y∑−1 k=0 u′_c_k= 1 .

The next section will present a method that swaps logical blocks among channels to adjust channel utilizations for channel wear leveling.

5.3. Adjusting Channel Utilizations

Independent channels adopt their own instances of flash-translation layer to manage their flash blocks. Suppose that the flash-translation layer is based on hybrid mapping. Recall that the initial mapping between logical pages and channels is the RAID-0-style striping. Let logical blocks be numbered in the channel-major order. For example, if there are four channels and a logical block is as large as four pages, then the logical block at lbn 0 is in the first channel and this logical block contains the logical pages at

lpns0, 4, 8, and 12. The logical block at lbn 2 is in the third channel and it contains the logical pages at lpns 2, 6, 10, and 14. Let a stripe be a set of consecutive logical blocks starting from the first channel and ending at the last channel. For example, the first stripe contains the four logical blocks at lbns 0, 1, 2, and 3. Notice that these definitions of logical blocks and stripes are also applicable to page-level mapping because they are not related to space allocation in flash.

Because real workloads have temporal localities of write, swapping logical blocks among channels can manipulate channels’ future utilizations. To retain to the property of maximum parallelism, this swapping is confined to logical blocks of the same stripe. Let x be the total number of stripes. Let usj be the utilization of the stripe Sj. Thus,

we have∑usj=1. Let ui,jbe the utilization of the logical block at the stripe i and the

channel j. Therefore, we have∑x_i=0−1ui,j = ucj and

∑y−1

j=0ui,j= usi.

This study proposes invoking channel wear leveling periodically. On each invocation, channel wear leveling computes the expected utilizations of channels, and then starts swapping logical blocks for minimizing∑x_i=0−1|uci−u′ci|. This problem of block swapping

is intractable even for each invocation of channel wear leveling. We can reduce any instance of the bin packing problem to this block-swapping problem. A key step of this reduction is to let an item of size s in the bin packing problem be a stripe which has only one logical block having a non-zero utilization s.

(21)

A:17 4000 4000 4500 3000 1.40 1.10 1.20 1.00 0.20 0.25 0.25 0.30 0.20 0.26 0.21 0.33 0.20 0.26 0.21 0.33 0.20 0.26 0.21 0.33

Fig. 7. Swapping logical blocks among channels for channel wear leveling. (a) Before the swap and (b) after the swap.

Channel wear leveling should reduce the total number of logical blocks swapped. We found that, in real workloads a stripe of a high utilization usually has two logical blocks whose utilization difference is large. This is because frequently updated data are small and they do not write to all channels [Chang 2010]. Thus, the swapping begins with the stripe whose utilization is the highest. The following is a procedure to find and swap a pair of logical blocks:

Step 1: Find the two channels Cm and Cnwhich have the largest positive value of

(ucm-u′cm) and the smallest negative value of (ucn-u′cn), respectively.

Step 2: Find the stripe Si subject to the following constraints:

(a) Si have the largest utilization among all stripes.

(b) In this stripe Si, the two logical blocks at Cmand Cnhave not yet been swapped

in the current invocation of channel-level wear leveling. (c) ui,m> ui,nand (ui,m− ui,n)≤min(ucm− u′cm,|ucn− u′cn|).

Step 3: Exchange the channel mapping of the two logical blocks found in Step 2. Step 4: Change ucm and ucnto (ucm-(ui,m-ui,n)) and (ucn+(ui,m-ui,n)), respectively.

Step 5: Swap ui,mand ui,n.

In each invocation, channel wear leveling repeats Steps 1 through 5 until 1) uci=u′ci

for every i or 2) the total number of logical blocks swapped is larger than a pre-defined limitation. Figure 7 is an numeric example of channel wear leveling. In this example, the channel lifetime limit ¯enb is 10,000. Figure 7(a) shows the initial data layout and

utilizations of logical blocks, channels, and stripes. Channel wear leveling solves the expected channel utilizations using u′c3=

1.4_{×(10000−3000)} 1.0×(10000−4000)=1.63u′c0, u ′ c2 = 1.07u ′ c0, u ′ c1 =

1.27u′_c₀, and u′_c₃+u′_c₂+u′_c₁+u′_c₀ = 1. It then selects the stripe S0whose utilization is the

highest, and swaps its two logical blocks at the channels C2and C3. This swap changes

uc2from 0.25 to 0.22 and and uc3from 0.3 and 0.33. Next, channel wear leveling selects

the stripe S3whose utilization is the second highest and swaps two more logical blocks.

Figures 7(b) shows the results after these swaps. The adjusted channel utilizations match their expected utilizations.

This study proposes caching the utilization information of a small collection of most-frequently written stripes. Our experiments will show that a small cache is sufficient to effective channel wear leveling.

6. CONCLUSION

This study tackles three problems of wear leveling: block-level wear leveling, adaptive tuning for block wear leveling, and channel-level wear leveling. Block-level wear

(22)

lev-eling monitors the wear of all flash blocks and intervenes when block wear develops imbalanced. The tuning of block-level wear leveling seeks good balance between wear evenness and overhead under various workloads. Channel-level wear leveling aims at even channel lifetimes for maximizing the device-level lifespan.

This study presents Lazy wear leveling for block-level wear leveling. Lazy wear leveling prevents senior blocks from further aging by moving infrequently updated data to these senior blocks. We found its implementation can be very simple based on two observations: First, flash blocks increase their erase counts via garbage collection only. Thus Lazy wear leveling can identify senior blocks whenever garbage collection is about to erase a victim. Second, frequently updated logical blocks will leave map-ping information in the page-mapmap-ping table, so Lazy wear leveling can find these in-frequently updated data by checking the mapping table. Lazy wear leveling subjects block-wear evenness to a threshold, and using the same threshold value may produce different costs and wear-evenness under various workloads. This study derives the overhead as a function of the threshold, and proposes decreasing the threshold un-til the overhead can significantly increase. Our results show that wear level should refrain from using small thresholds for sequential and random workloads.

Multichannel architectures has became mandatory in the design of solid-state disks. Real workloads do not evenly write to all channels, and inevitably introduce imbal-anced flash wear in different channels. For wear leveling at the channel level, we pro-pose a strategy that swaps logical blocks among channels. The goal of this swapping is to reach an “eventually even” state of channel lifetimes. Results show that this strat-egy is very successful and its overhead is nearly negligible.

Recent study [Balakrishnan et al. 2010] suggests that SSDs in RAIDs should reach their end-of-life at different times for the convenience of drive replacement. Our fu-ture work is directed to optimizing the drive-replacement periods using the proposed lifetime projection technique.

The following papers are related to the results of this project:

— Li-Pin Chang, Tung-Yang Chou, and Li-Chun Huang, ”An Adaptive, Low-Cost Wear-Leveling Algorithm for Multichannel Solid-State Disks,” ACM Transactions on Em-bedded Computing Systems, accepted for publication.

— Li-Pin Chang and Chen-Yi Wen, ”Reducing Asynchrony in Channel Garbage-Collection for Improving Internal Parallelism of SSDs,” ACM Transactions on Em-bedded Computing Systems, accepted for publication.

— Li-Pin Chang, Yi-Hsun Huang, Chen-Yi Wen, ”On the Management of Multichannel Architectures of Solid-State Disks,” the 9th IEEE/ACM Symposium on Embedded Systems for Real-Time Multimedia (ESTIMedia), 2011.

— Li-Pin Chang and Yo-Chuan Su, ”Plugging versus Logging: A New Approach to Write Buffer Management for Solid-State Disks,” The 48-th Design Automation Conference (DAC), 2011.

— Li-Pin Chang and Li-Chun Huang, ”A Low-Cost Wear-Leveling Algorithm for Block-Mapping Solid-State Disks,” ACM Conference on Languages, Compilers, Tools and Theory for Embedded Systems (ACM LCTES), 2011.

REFERENCES

AGRAWAL, N., BOLOSKY, W. J., DOUCEUR, J. R.,ANDLORCH, J. R. 2007. A five-year study of file-system metadata. Trans. Storage 3.

AGRAWAL, N., PRABHAKARAN, V., WOBBER, T., DAVIS, J. D., MANASSE, M.,ANDPANIGRAHY, R. 2008. De-sign tradeoffs for SSD performance. In ATC’08: USENIX 2008 Annual Technical Conference on Annual

Technical Conference. USENIX Association, 57–70.

BALAKRISHNAN, M., KADAV, A., PRABHAKARAN, V.,ANDMALKHI, D. 2010. Differential raid: Rethinking raid for ssd reliability. Trans. Storage 6, 2, 4:1–4:22.