Energy-Efficient and Performance-Enhanced Disks Using Flash-Memory Cache

(1)

Energy-Efﬁcient and Performance-Enhanced Disks Using Flash-Memory Cache ^∗

Jen-Wei Hsieh

Department of Computer Science and Information

Engineering National Chiayi University, Chiayi, Taiwan 60004, ROC

Tei-Wei Kuo

Department of Computer Science and Information

Engineering Graduate Institute of Networking and Multimedia National Taiwan University, Taipei, Taiwan 106, ROC

Po-Liang Wu

Department of Computer Science and Information

Engineering National Taiwan University,

Taipei, Taiwan 106, ROC

Yu-Chung Huang

Genesys Logic, Inc. Taipei, Taiwan 231, R.O.C.

This work explores the unique characteristics of ﬂash memory in serving as a cache layer for disks. The experiments show that the proposed management scheme could save up to 20%

energy consumption while reduce the read response time by the two third and the write response time by the five sixth of their counterparts. The estimated lifetime of the flash- memory cache is significantly improved as well.

Categories and Subject Descriptors

C.0 [Computer Systems Organization]: General; B.3.2 [Memory Structure]: Design Styles—Cache memory

General Terms

Management

Keywords

Flash memory, cache, energy eﬃcient, performance

1. INTRODUCTION

Flash memory recently gains a lot of attention in serving as a storage-system alternative (e.g., [1, 3, 4, 8]) or as caches for hard disks. In particular, Windows ReadyBoost [6] lets users use a removable ﬂash memory device to improve system

∗ Supported in part by research grants from Taiwan, ROC National Science Council under Grants NSC95-2219-E-002- 014 and NSC 95R0062-AE00-07.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

ISLPED’07, August 27–29, 2007, Portland, Oregon, USA.

performance, while ReadyDrive [6] enables Windows Vista PCs equipped with a hybrid hard disk (a new type of disk with integrated non-volatile ﬂash memory) to boot up faster, resume from hibernate in less time, preserve battery power, and improve disk reliability.

However, flash memory does have several unique character- istics that introduce challenges to the management issues. A NAND flash memory is organized in terms of blocks, where each block is of a fixed number of pages. Data must be writ- ten to the free space of flash memory. When a flash memory page is written, the space is no longer available unless it is erased. As a result, out-place-update is usually adopted in the management. A block is the basic unit for erase opera- tions, while reads and writes are processed in terms of pages.

¹

The typical block size and the page size of a NAND ﬂash memory are 16KB and 512B, respectively.

²

After the pro- cessing of a large number of page writes, the number of free pages on ﬂash memory would be low. Garbage collection are needed to reclaim invalid pages scattered over blocks (due to out-place update) so that they could become free pages. A ﬂash-memory block has a limitation on erasures, block erased over 10

⁶

times might suﬀer from frequent write errors. “Wear- levelling” is usually adopted to erase blocks evenly so that a longer overall lifetime is achieved.

One of the most pioneering work in adopting flash memory as a disk cache is done by Marsh et al. [5]. Due to the state of the art at that time, the study had 20MB NOR flash memory as cache for a 40MB hard disk, which implies that an efficient lookup mechanism to locate the cache space of given Logical Block Addresses (LBA’s) for large capacity flash memory was not considered. Another issue of adopting flash memory as a disk cache is its robustness, since flash memory suffers from worn-out effect. Different from the past work, this work is motivated by the needs of management in caching data for

1

Note that terms “page” and “block” used here are diﬀerent from those used for disks. As can be seen, the term “page”

refers to a unit that is smaller than the unit referred to by the term“block.”

2

Some ﬂash memory adopts 128KB blocks and 2KB pages.

(2)

disks, especially when the characteristics of flash memory are considered. Note that well-known caching strategies, such as direct mapped cache and set associative cache [2], would suf- fer from significant deterioration on read/write performance (due to the write-once and wear-levelling features of flash memory), if they are implemented without considering the characteristics of flash memory. This paper presents an ef- ficient lookup mechanism to locate the cached data of given LBA’s over flash memory and have it being integrated with an LRU-based caching strategy. It also considers read and write requests jointly in energy efficiency and performance issues.

A garbage collection strategy is proposed in an integrated way to consider the hotness of data and the system perfor- mance. The capability of the proposed strategies is evaluated by a series of experiments based on realistic workloads.

The rest of this paper is organized as follows: Section 2 presents our management schemes, including a joint lookup and caching mechanism, a garbage-collection strategy, and a replacement policy, for a ﬂash-memory cache. Other im- plementation remarks are also presented. The capability of the proposed management schemes is evaluated by a series of experiments in Section 3. Section 4 is the conclusion.

2. MANAGEMENT SCHEMES 2.1 Overview

The management of flash-memory cache should consider the characteristics of flash memory and the access pattern of users over disks. Three potential situations in caching are considered: (1) When a read request arrives, the LBA of the request must be checked up to see if the corresponding data are in the cache. If the answer is “yes,” the read request can be satisfied without accessing any hard disk. (2) If the answer is “no,” the data are retrieved from the corresponding hard disk and then cached in the flash memory for future access.

(3) When a write request arrives, the data are cached in the ﬂash memory. No extra action is taken unless a data write back is required.

The three potential situations introduce several design and implementations issues. One critical issue is an efficient lookup strategy for a given LBA. Such a strategy is needed to look for any data corresponding to a given LBA on the flash memory, regardless of whether it is for a read or a write. When a write request is considered, we must invalidate an existing copy in the cache if the corresponding data exist in the cache. An- other critical issue is on the replacement strategy, when the cache is full, or the flash memory needs garbage collection. A good replacement strategy should reduce the chance of cache missing. Other important issues include an energy-efficient strategy in flushing written data to disks, cache robustness, and cache utilization, etc. In Section 2.2, we shall present data structures and strategies in the management of the flash- memory cache, especially for efficient data lookup when the user access pattern changes dynamically. Section 2.3 proposes our garbage collection and replacement strategies. Section 2.4 discusses a rebuilding procedure for the entry table.

2.2 Data Lookup and Caching 2.2.1 Management Information

The management of ﬂash-memory cache is based on the idea of set associative [2]. An entry table is used to do book- keeping for data in the cache. Each given LBA is hashed by a

Caching Buffer Caching Buffer Caching Buffer

LBA Hash1

Hash2

(LBAi,LBAi + R) (LBAj,LBAj + R) (LBAk,LBAk + R)

Primary Block

Entry Table

Overflow Block Primary

Block

Flash Memory Block Flash Memory Block : Used/Dead Page : Free Page

Collision!!

R LBA LBA LBAjd d j

Figure 1: The Organization of Management Informa- tion for Cached Data.

hash function to an entry in the table, where an example hash function is H(LBA) = (LBA/(K × NP B)) mod EN. Here stretch factor K is any constant no less than 1, and N P B and EN are the number of pages in a block and the number of entries in the table, respectively. A link of caching buﬀers is attached to each entry, and the length of each link might change with the access patterns. Each caching buﬀer, that corresponds to a range of LBA’s (LBA

_i

, LBA

_i

+ R), consists of a primary block and an overflow block if it exists, where R is any fixed multiple of the number of pages in a block, e.g., K × NP B. (Note that each primary/overflow block maps to a physical flash-memory block.) The lookup of a given LBA fails if it is not in the LBA range of any caching buffer associated with the hash entry.

The lookup of an LBA starts with a hashing to a specific hash entry and followed by a search of caching buffers asso- ciated with the entry, as shown in Figure 1. The lookup of the LBA is done by hashing again with a pre-defined hash function to a specific page in the primary block of the cor- responding caching buffer. An example hash function in lo- cating the target page is P ageIndex = LBA mod N P B. An overflow block is attached to a caching buffer if there is an attempt to overwrite the data in the hashed page of the pri- mary block and no overflow block is allocated yet. Free pages in an overflow block are written sequentially.

2.2.2 Read Requests and Write Requests

When a read request arrives, the LBA of the request is

checked up to see if the required data is cached in the ﬂash

memory. The corresponding entry of the given LBA is ﬁrst

derived by hashing. The corresponding caching buﬀer of the

LBA is then derived by searching over associated buﬀers of

the entry. If the target caching buﬀer is not found, the data

must be retrieved from the corresponding disk and cached for

any future access by allocating a new caching buﬀer to the

entry. If such a caching buﬀer is found, then the given LBA

is searched over the primary block and the overﬂow block to

locate the data. If the data is available in the cache, the

read request can be satisﬁed immediately without accessing

any disk. If it is not found in any of the blocks, then a read

operation to a proper disk is needed to retrieved the data, and

the retrieved data must be cached. When the data is retrieved

from a disk, such information might be useful in preventing

disks from being disturbed (from spin-down status) because

the system could know the device status and might activate

writing of dirty data back to its corresponding disk.

(3)

When a write request arrives, it is checked to see if its LBA exists in any corresponding caching buffer. The correspond- ing caching buffer of the LBA is then derived by searching over associated buffers. If no such a caching buffer exists, a new caching buffer is allocated and attached to the corre- sponding entry of the entry table. We always try to cache the data in the primary block first. If the corresponding page of the primary block is occupied by old-version data of the LBA, the page will be invalidated. We then try to cache the data in the first available page of the overflow block. An overflow block is allocated for the caching buffer if it does not exist.

If there exists an overflow block, it must be checked to see if any free page is available. A garbage collection is invoked to reclaim invalid pages of the primary and overflow blocks if there is no free page left in the overflow block. When the data is cached in the overflow block, any page for the old ver- sion of the data is invalidated. After garbage collection, the data would be written to the primary and overflow blocks, as described above. Note that the corresponding page of the data in the primary block might still be occupied because of a hash collision. In other words, an overflow block might still be needed. Finally, the data are cached in the overflow block.

2.3 Garbage Collection and Data Replacement 2.3.1 Garbage Collection

When there is no free page in an overﬂow block, garbage collection should start to recycle pages occupied by invalid pages of the overﬂow block and its corresponding primary block. If the disk is not spinning down or idle during the garbage collection, data in the blocks that correspond to write requests should be written back to the disk. The strategy of the proposed garbage collection is based on two major ideas:

(1) If the disk is spinning down or idle, the system should avoid writing data cached in the blocks back to the disk when- ever possible. (2) When valid pages of the two blocks are written back to the caching buﬀer (with new primary and overﬂow blocks), they are written back in an LRU fashion.

That is, valid pages in the overflow block are written back to the buffer earlier than those in the primary block, and valid pages in the overflow block are written back to the buffer from the bottom to the top of the overflow block.

A new primary block is allocated and associated with the caching buffer, and an overflow block is not allocated until necessary. If the disk is spinning down, then all of the data that correspond to writes must be kept in the cache when- ever possible. Otherwise, the data should be written to the corresponding disks, and the rest valid pages of the previous primary and overflow blocks (correspond to reads) are written back to the new primary and overflow blocks of the caching buffer in an LRU fashion. The previous primary and overflow blocks are then inserted into a queue to erase.

2.3.2 Replacement Strategy

The entry table of caching buffers, that changes over time, is used to do bookkeeping for data in the cache. Whenever there exists any problem in allocating a new block, we must execute a replacement strategy to recycle one or more caching buffers and their associated blocks. The basic idea is to pick up the LRU caching buffer for replacement to avoid any cache miss! Blocks of the flash memory are considered as a circular array, and a free pointer always points to a free block, as shown in Figure 2.(a). Whenever a free block is needed, the

free block pointed by the pointer is returned, and the pointer moves to the next free block one-by-one along the circular array. Examples in the allocation of free blocks are as shown in Figure 2.(b).

Entry Table Caching Buffer

Caching Buffer : Allocated

: Free : In Use

Flash Memory Blocks

Pointer of Free Blocks

Allocation (Find the First Free Block)

Caching Buffer

Caching Buffer 1

2 3

4 5

: Order of Allocation Requests 1 ~ 5

Flash Memory Blocks

1 2 3 4 5

Pointer of Free Blocks

(a) Before Allocation. (b) After Allocation.

Figure 2: Allocations of Free Blocks.

To speed up the seeking of any free block and to help in the locating of the LRU caching buﬀer, an access map, that is an array of bits, is introduced to keep the access record.

Each bit in the access map corresponds to a unique block in the circular array. When any block of a caching buﬀer is accessed, the corresponding bit is set to 1. A replacement pointer that initially equals to the free pointer moves along the circular queue whenever there is any need to locating an LRU caching buﬀer or to recycle used blocks. When the replacement pointer moves, it stops at the bit with value 0.

The caching buﬀer corresponding to the block is considered as the LRU buﬀer and recycled. If the replacement pointer moves on a bit with value 1, the bit is set as 0, and the pointer moves to the next bit, as shown in Figure 3.

Entry Table Accessed

Caching Buffer

Accessed

Accessed A

E F

1

2 D

B C

Flash Memory Blocks

1 0 0 1 1 0 0 1 0 0 0 : Order of Allocation Requests 1 ~ 2

Access Map

A E B D 1 C F 2

Replacement Pointer

Caching Buffer A

E F

1

2 D

B C

Flash Memory Blocks

1 0 00 00 0 1 0 0 0 : Order of Allocation Requests 1 ~ 2

Access Map

A E B D 1 C F 2

Replacement Pointer Replaced!

Caching Buffer Caching Buffer

Caching Buffer

Corresponding Caching Buffer Replaced

(a) Before Replacement. (b) After Replacement.

Figure 3: The Access Map and Replacement.

2.4 Rebuilding Procedure of the Entry Table

When a computer shut down normally, there exists many strategies in accelerating the rebuilding of the entry table.

This section illustrates a simple procedure in rebuilding the

entry table by scanning blocks on the ﬂash memory without

any auxiliary information when the system crashed.

(4)

To create the entry table, we examine blocks with valid pages. If all of the written pages in a block are scattered, then the block must be a primary block. We restore the in- formation of the primary block for its corresponding caching buffer and then associate the buffer with the corresponding entry. On the other hand, if all of the written pages in a block are written in a sequential order, then the block might be either a primary block or an overflow block. Each writ- ten page in the block must be checked up to see if its page index is consistent with the one derived from the page-index hashing of its LBA. If there exists any inconsistency, then the block must be an overflow block, and the information of the overflow block must be restored for the corresponding caching buffer; otherwise, the block can be either a primary block or an overflow block, depending on the discovery of any block being associated with its corresponding caching block.

3. PERFORMANCE EVALUATION 3.1 Experiment Setup

This section evaluates the performance of the proposed im- plementation strategies in energy efficiency, read/write re- sponse time, and number of block erasures. Four different ca- pacities of the flash-memory cache were simulated for the per- formance evaluation, and impacts of the stretch factor K were explored. The size of a flash-memory block was 16KB, and the number of entries M in the entry table was set to 16384.

In addition to comparisons between diﬀerent cache sizes and stretch factors, two well-known caching-management mech- anisms, a direct mapped cache and a set associative cache, were simulated for comparison.

The trace of data access for performance evaluation was col- lected over a 80GB hard disk of a personal computer with a 1GB RAM, and an AMD Athlon64 K8-3000+ 939 CPU. The operating system was Windows XP SP2, and the hard disk was formatted as NTFS. Traces were collected by DiskMon

³

, and the duration for trace collecting was one month. The workload of the personal computer in accessing the hard disk corresponds to daily use of most people, i.e., web surfing, movie playing, peer-to-peer file sharing, e-mail sending/receiving, and document typesetting/reading/editing. To evaluate the flash-memory cache in a steady state, we used the first week trace to fill up the flash-memory cache and collected statis- tics for the rest of the trace such that the effect of garbage collection could be observed.

3.2 Experiment Results 3.2.1 The Total Idle Time

Before we demonstrate the energy efficiency under various caching implementation strategies, the distribution of disk idle times, which affects the energy consumption, is worthy to note. Figure 4 shows the impact of different implementation strategies on disk idle times. Time intervals between any two consecutive disk accesses were compiled and ranked into six degrees according to the length of time intervals. Note that idle-time intervals less than two seconds were filtered out, since spined the disk down and then spined it up again within two seconds does not help in the power saving.

As the cache size became larger, more data can be retained in the ﬂash-memory cache. As a result, many access requests

3

http://www.sysinternals.com/Utilities/Diskmon.html

0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000

Implementation Strategies

T o t a l I d l e T i m e ( s e c )

241sec~1200sec 1201sec~2400sec 2401sec~

2sec~15sec 16sec~60sec 61sec~240sec

No Cache

512MB K=1

1024MB K=1

2048MB K=1

4096MB K=1

1024MB K=2

1024MB K=4

1024MB K=8

1024MB Direct Mapped

1024MB Set Associative

Figure 4: The Distribution of Idle Times.

to the disk could be fulfilled by accessing the flash-memory cache, and time intervals between two consecutive disk ac- cesses could be prolonged. Different stretch factor K resulted in different data placement manner. As K became larger, the disk idle time can be improved. It can be observed that the total idle time achieved by setting K = 8 for a 1024MB flash-memory cache was almost compared to that achieved by having a 4096MB flash-memory cache with K = 1. This was because a large K can prevent a huge but not frequently accessed file, e.g., a movie chip, from spreading over numer- ous caching buffers. In other words, chances to swap out frequently accessed data when sequentially accessing such a huge file were reduced when the stretch factor was set large.

Due to the ﬂexible management over data placement, the pro- posed implementation strategy outperformed a direct mapped cache and a set associative cache.

3.2.2 The Energy Efﬁciency

In our simulation, energy consumptions under various im- plementation strategies were derived from the statistic results of disk idle times, the number of disk spin-ups/spin-downs, and the number of ﬂash memory read/write/erase operations.

To simplify the estimation, we assume the disk has only two modes, namely active and standby. No matter what action (seek/rotation/transfer) the disk was taken, we assume the consumed power was the same. When no action was taken for 30 seconds, the disk turned from an active mode into an idle mode. Note that a mode transition of the disk requires an extra energy. Detailed parameters of power consumptions were modelled in Table 1.

IBM Ultrastar 36Z15 [9] Flash Memory [7]

Spin-down Spin-up Active Standby Read Write Erase

13J 135J 13.5W 2.5W 30mW 60mW 60mW

Table 1: Power Consumption Parameters.

Figure 5 illustrates the comparison of the energy eﬃciencies under various implementation strategies for the 23-day trace.

Suppose the energy consumed by the disk without any flash- memory cache was x, and the energy consumed by the disk with some implementation strategy was y. The saved energy in the figure was x − y, and we also accordingly derived the saved energy ratio, which is (x − y)/x. The energy efficiency was dominated by idle times. A long idle-time interval was superior to several short ones due to less spin-up and spin- down overheads, even though total idle times were the same.

A longer idle-time interval a disk can stay, a better energy

(5)

0 500,000 1,000,000 1,500,000 2,000,000 2,500,000 3,000,000 3,500,000

512MB K=1

1024MB K=1

2048MB K=1

4096MB K=1

1024MB K=2

1024MB K=4

1024MB K=8

1024MB Direct Mapped

1024MB Set Associative Implementation Strategies

S a v e d E n e r g y ( J o u l e )

7.9% 8.97%

14.65%

19.94%

10.84%

12.34%

15.7%

5.54%

8.42%

Figure 5: Comparison of Energy Eﬃciencies.

eﬃciency it can achieve. As shown in Figure 5, we could save about 20% energy consumption while adopting a 4GB ﬂash-memory cache for a 80GB disk.

3.2.3 The Number of Block Erasures

Since flash memory has a limitation on the block-erasure count, the distribution of erase counts over flash-memory blocks was definitely a major evaluation metric. The number of erasures over each flash-memory block is separately ac- cumulated. According to erase counts, flash-memory blocks were sorted into groups. The number of groups and the cov- ered range of erase counts for each group implied the quality of achieved wear-levelling, from which the life cycle of a flash- memory cache can be estimated.

A large cache size improved not only the idle time but also the quality of wear-levelling. Erasures over flash-memory blocks were amortized when the cache size became large. Dif- ferent from idle times, the impact of the cache size on the distribution of erase counts was more predictable. When the size of flash-memory cache was double, the peak in the dis- tribution roughly grew into double, and the range of erase counts roughly became half. On the other hand, although a large stretch factor was beneficial to idle times, it greatly deteriorated the quality of wear levelling. When K became larger, the range of erase counts over flash-memory blocks ex- panded. In addition, total erasures over flash-memory blocks boosted as well. Table 2 lists total erasures for a 23-day trace in the experiment. Since both a direct mapped cache and a set associative cache do not take out-place-update nature of flash memory into consideration, deviations of erase counts among flash-memory blocks were large. In addition, their suffered erasure overheads were also enormous, as shown in Table 2.

Note that when K was set large in the proposed strategy, the performance gap on idle times can even widen while the erasure overhead was still superior to a direct mapped cache or a set associative cache.

Total Erasure

512MB, K = 1 15,662,090

1024MB, K = 1 14,995,330 2048MB, K = 1 14,265,300 4096MB, K = 1 14,317,930 1024MB, K = 2 73,866,860 1024MB, K = 4 272,317,700 1024MB, K = 8 406,628,000 1024MB, Direct Mapped 418,747,600 1024MB, Set Associative 296,467,000

Table 2: Comparison of Total Erasures.

In the simulation over the 23-day trace, the maximum erase counts among all flash-memory blocks for various implemen- tation strategies are listed in Table 3. Based on these infor- mation, the life cycle of the flash-memory cache under differ- ent implementation strategies could be estimated. The flash- memory cache under the proposed implementation strategy (for cache size = 1024MB and K = 1) could last over 203 years, while a direct mapped cache could only work for 2.4 months and a set associative cache did not function well af- ter 7 months.

Maximum Erasure Estimated Product

Counts (23-day) Lifetime

512MB, K = 1 570 110.6 years

1024MB, K = 1 310 203.3 years

2048MB, K = 1 180 350 years

4096MB, K = 1 150 420 years

1024MB, K = 2 2,780 22.67 years

1024MB, K = 4 26,600 28.4 months

1024MB, K = 8 59,000 12.8 months

1024MB, Direct Mapped 314,500 2.4 months

1024MB, Set Associative 121,000 6.25 months

Table 3: The Estimated Product Lifetime.

3.2.4 The Read/Write Response Time

Since flash memory is a kind of EEPROM, the flash-memory cache has intrinsic limitation in improving the performance of data accessing. In addition to the penalty of disk access during a cache miss, the flash memory cache suffered from the erasure overhead when the utilization of the cache space was high. Without a proper space management, read/write requests could suffer from a series of page reads, page writes, block erasures, and even disk accesses. In the proposed im- plementation strategy, the garbage collection was properly designed such that block erasures could be postponed un- til a system idle time. To illustrate the read/write perfor- mance, the simulation adopts the access parameters of Sam- sung K9F6408U0A 8MB NAND Flash Memory and Western Digital Caviar WD800JB 80GB 7200RPM 8MB IDE Ultra ATA- 100 Hard Drive. Their performance characteristics are listed in Table 4.

Read Write Erase K9F6408U0A 36.55μs 226.65μs 2ms Caviar WD800JB 13.1ms 13.1ms N/A

Table 4: Performance Characteristics.

Figure 6 (a) and (b) compares average read/write response times among different cache sizes in terms of a day. As the cache size got larger, a better read/write response time can be achieved. Figure 6 (c) and (d) shows impacts of the stretch factor over the average read/write response time. When the stretch factor became larger, the average read/write response time deteriorated quickly. As shown in the figure, when K = 8, the average write response time even got worse than the disk without any flash-memory cache. This was because a great deal of erase operations were introduced.

Figure 6 (e) compares average read response times among

diﬀerent implementation strategies in terms of a day. As

shown in the ﬁgure, the proposed strategy could save up to

two third of the read response time and save one third of the

read response time in average. On the other hand, a direct