Real-Time Garbage Collection for Flash-Memory Storage Systems of Real-Time Embedded Systems

(1)

Real-Time Garbage Collection for Flash-Memory Storage Systems of Real-Time Embedded Systems ^∗

Li-Pin Chang and Tei-Wei Kuo {d6526009,ktw}@csie.ntu.edu.tw Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan, 106

Fax: +886-2-23628167

Abstract

Flash memory technology is becoming critical in building embedded systems applications because of its shock-resistant, power economic, and non-volatile nature. With the recent technology breakthroughs in both capacity and reli- ability, flash-memory storage systems are now very popular in many types of embedded systems. However, because flash memory is a write-once and bulk- erase medium, we need a translation layer and a garbage collection mechanism to provide applications a transparent storage service. In the past work, various techniques were introduced to improve the the garbage collection mechanism.

These techniques aimed at both performance and endurance issues, but they all failed in providing applications a guaranteed performance. In this paper, we propose a real-time garbage collection mechanism, which provides a guaranteed performance, for hard real-time systems. On the other hand, the proposed mechanism supports non-real-time tasks so that the potential bandwidth of the storage system can be fully utilized. A wear-levelling method, which is executed as a non-real-time service, is presented to resolve the endurance problem of flash memory. The capability of the proposed mechanism is demonstrated by a series of experiments over our system prototype.

Keywords: Flash memory, real-time garbage collection, storage systems, embedded systems, portable devices, real-time systems.

∗ This paper is an extended version of a paper appeared in the 8th International Conference on Real-Time Computing Systems and Applications, 2002.

(2)

1 Introduction

Flash memory is not only shock-resistant and power economic but also non- volatile. With the recent technology breakthroughs in both capacity and relia- bility, more and more (embedded) system applications now deploy flash memory for their storage systems. For example, the manufacturing systems in factories must be able to tolerate severe vibration, which may cause damage to hard-disks.

As a result, flash memory is a good choice for such systems. Flash memory is also suitable to portable devices, which have limited energy source from batteries, to lengthen their operating time.

A complicated management method is needed for flash-memory storage systems.

There are two major issues for the system implementation: the nature of flash memory in (1) write-once and bulk-erasing, and (2) the endurance problem. Be- cause flash memory is write-once, the flash memory management software could not overwrite existing data. Instead, the newer version of data will be written to available space elsewhere. The old version of the data is then invalidated and considered as “dead”. The latest version of the data is considered as “live”.

Flash memory Translation Layer (FTL) is introduced to emulate a block device for flash memory so that applications could have transparent access to data which might dynamically move around different locations. Bulk erasing, which involves significant live data copying, could be initiated when flash-memory storage systems have a large number of live and dead data mixed together. That is so called garbage collection with the intension in recycling the space occupied by dead data. No matter what garbage collection or data-writing policies are adopted, the flash-memory storage system should consider the limit of possible erasings on each erasable unit of flash memory (the endurance problem).

In the past work, various techniques were proposed to improve the performance of garbage collection for flash memory, e.g., [1, 8, 13, 10]. In particular, Kawaguchi, et al. proposed the cost-benefit policy [1], which uses a value-driven heuristic function based on the cost and the benefit of recycling a specific block.

The policy picks up the block which has the largest value of (a ∗_1−u^2u ) to recycle, where u is the capacity utilization (percentage of fullness) of the block, and a is the time elapsed so far since the last data invalidation on the block. The cost-benefit policy avoids recycling a block which contains recently invalidated data because the policy surmises that more live data on the block might be invalidated soon. Chiang, et al. [10] refined the above work by considering a fine-grained hot-cold identification mechanism. They proposed to keep track of the hotness of each LBA, where LBA stands for the Logical Block Address of a block device. The hotness of an LBA denotes how often the LBA is written.

They proposed to avoid recycle a block which contains many live and hot data, because any copying of the live and hot data is usually considered inefficient.

Kwoun, et al. [8] proposed to periodically move live data among blocks so that blocks have more even life times.

(3)

Although researchers have proposed excellent garbage-collection policies, there is little work done in providing a deterministic performance guarantee for flash- memory storage systems. For a time-critical system, such as manufacturing systems, it is highly important to predict the number of free pages reclaimed after each block recycling so that the system will never be blocked for an unpredictable duration of time because of garbage collection. It has been shown that garbage collection could impose almost 40 seconds of blocking time on real- time tasks without proper management [15, 16]. This paper is motivated by the needs of a predictable block-recycling policy to provide real-time performance to many time-critical systems. We shall not only propose a block-recycling policy with guaranteed performance but also resolve the endurance problem. A free- page replenishment mechanism is proposed for garbage collection to control the consumption rate of free pages so that bulk erasings only occur whenever necessary. Because of the real-time garbage collection support, each time-critical task could be guaranteed with a specified number of reads and/or writes within its period. Non-real-time tasks are also serviced with the objective to fully uti- lize the potential bandwidth of the flash-memory storage system. The design of the proposed mechanism is independent of the implementation of flash memory management software and the adoption of real-time scheduling algorithms. We demonstrate the performance of the system in terms of a system prototype.

The rest of this paper is organized as follows: Section 2 introduces the system architecture of a real-time flash memory storage system. Section 3 presents our real-time block-recycling policy, free-page replenishment mechanism, and the supports for non-real-time tasks. We provide the admission control strategy and the justification of the proposed mechanism in Section 4. Section 5 summarizes the experimental results. Section 6 is the conclusion and future research.

2 System Architecture

This section proposes the system architecture of a real-time flash-memory storage system, as shown in Figure 1. The system architecture of a typical flash- memory storage system is similar to that in Figure 1, except that a typical flash-memory storage system does not have or consider real-time tasks, a real- time scheduler, and real-time garbage collectors.

We selected NAND flash to realize our storage system. There are two major architectures in flash memory design: NOR flash and NAND flash. NOR flash is a kind of EEPROM, and NAND flash is designed for data storage. We study NAND flash in this paper because it has a better price/capacity ratio, compared to NOR flash. A NAND flash memory chip is partitioned into blocks, where each block has a fixed number of pages, and each page is of a fixed size byte array.

Due to the hardware architecture, data on a flash are written in a unit of one page, and the erase is performed in a unit of one block. A page can be either

(4)

Non- RT task Non-

RT task

Time-Sharing Scheduler

Real-Time Scheduler

RT Garbage Collector

RT Garbage Collector RT

Task RT

Task

File System

Block Device Emulation (FTL)

Flash Memory Driver

Flash Memory

ioctl_copy ioctl_erase fread, fwrite

sector (page) read and write

read, write, erase

control signals

Figure 1: System Architecture.

programmable or un-programmable, and any page initially is programmable.

Programmable page are called “free pages”. A programmable page becomes un- programmable once it is written (programmed). A written page that contains live data is called a “live page”; it is called “dead page” if it contains dead (invalidated) data. A block erase will erase all of its pages and reset their status back to programmable (free). Furthermore, each block has an individual limit (theoretically equivalent in the beginning) on the number of erasings. A worn- out block will suffer from frequent write errors. The block size of a typical NAND flash is 16KB, and the page size is 512B. The endurance of each block is usually 1,000,000 under the current technology.

Over a real-time flash-memory storage system, user tasks might read or/and write the FTL-emulated block device through the file system¹. We propose to support real-time and reasonable services to real-time and non-real-time tasks through a real-time scheduler and a time-sharing scheduler, respectively. A real- time garbage collector is initiated for each real-time task which might write data to flash memory to reclaim free pages for the task. Real-time garbage collectors interact directly with the FTL though the special designed control services.

1. Note that there are several different approaches in managing flash memory for storage systems. We must point out that NAND flash prefers the block device emulation approach because NAND flash is a block-oriented medium (Note that a NAND flash page fits a disk sector in size). There is an alternative approach which builds a native flash memory file system over NOR flash, we refer interested readers to [6] for more details.

(5)

Page Read Page Write Block Erase 512 bytes 512 bytes 16K bytes

Performance (µs) 348 909 1,881

Symbol tr tw te

Table 1: Performance of NAND Flash Memory

The control services includes ioctl erase which performs the block erase, and ioctl copy which performs atomic copying. Each atomic copying, that consists of a page read then a page write, is designed to realize the live-page copying in garbage collection. In order to ensure the integrity of each atomic copying, its read-then-write operation is non-preemptible to prevent the possibility of any race condition. The garbage collection for non-real-time tasks are handled inside the FTL. FTL must reclaim an enough number of free pages for non-real- time tasks, similar to typical flash-memory storage systems, so that reasonable performance is provided.

3 Real-Time Garbage Collection Mechanism

3.1 Characteristics of Flash Memory Operations

A typical flash memory chip supports three kinds of operations: Page read, page write, and block erase. The performance of the three operations measured on a real prototype is listed in Table 1. Block erases take a much longer time, compared to others. Because the garbage collection facility in FTL might impose unbounded blocking time on tasks, we propose to adopt two FTL-supported control services to process real-time garbage collection: the block erase service (ioctl erase) and the atomic copying service (ioctl copy). The atomic copying operation copies data from one page to another by the specified page addresses, and the copy operation is non-preemptible to avoid any possibility of race condi- tions.² The main idea in this paper is to have a real-time garbage collector created for each real-time task to handle block erase requests (through ioctl erase) and live-page copyings (through ioctl copy) for real-time garbage collection. We shall illustrate the idea in more detailed in later sections.

Flash memory is a programmed-I/O device. In other words, flash memory operations are very CPU-consuming. For example: To handle a page write request, the CPU first downloads the data to the flash memory, issues the address and the command, and then monitors the status of the flash memory until the command is completed. Page reads and block erases are handled in a similar fashion.

2. Some advanced NAND flash memory had a native support for the atomic copying. The overhead of the atomic copying is significantly reduced since the operation can be done inter- nally in flash memory without the redundant data transfer between RAM and flash memory.

(6)

As a result, the CPU is fully occupied during the execution of flash memory operations. On the other hand, flash memory operations are non-preemptible since the operations can not be interleaved with one another. We can treat flash memory operations as non-preemptible portions in tasks. As shown in Table 1, the longest non-preemptible operation among them is a block erase, which takes 1,881 µs to complete.

3.2 Real-Time Task Model

W R W

Legend:

A preempible CPU computation region.

A non-preemptible read request to flash memory.

A non-preemptible write request to flash memory.

R W t

W

R W

t + p_i

Figure 2: A task Ti which reads and writes flash memory.

Each periodic real-time task Tiis defined as a triple (cTi, pTi, wTi), where cTi, pTi, and wTi denote the CPU requirements, the period, and the maximum number of its page writes per period, respectively. The CPU requirements ci consists of the CPU computation time and the flash memory operating time: Suppose that task Tiwishes to use CPU for c^cpu_T_i µs, read i pages from flash memory, and write j pages to flash memory in each period. The CPU requirements cTi can be calculated as c^cpu_T_i + i ∗ tw+ j ∗ tr (twand trcan be found in Table 1). If Ti does not wish to write to flash memory, it may set wTi = 0. Figure 2 illustrates that a periodic real-time task Ti issues one page read and two page writes in each period (note that the order of the read and writes does not need to be fixed in this paper).

For example, in a manufacturing system, a task T1might periodically fetch the control codes from files, drive the mechanical gadgets, sample the reading from the sensors, and then update the status to some specified files. Suppose that c^cpu_T₁ = 1ms and pT1 = 20ms. Let T1 wish to read a 2K-sized fragment from a data file, write 256 bytes of machine status back to a status file. Assume that the status file already exists, and we have one-page spontaneous file system meta- data to read and one-page meta-data to write. According to these information, task T1can be defined as (1000 + (1 +²⁰⁴⁸₅₁₂) ∗ tr+ (1 + d²⁵⁶₅₁₂e) ∗ tw, 20 ∗ 1000, 1 + d²⁵⁶₅₁₂e) = (4558, 20000, 2). Note that the time granularity is 1µs. As shown in the example, it is very straight-forward to describe a real-time task in our system.

(7)

The meta-data of the file system , which contain information such as the file- access permissions and data blocks’ locations, are usually read or written spon- taneously with user data. (Note that a file system must determine data blocks’

locations for every accessed data) For example, an FAT file system determines data blocks’ locations by tracing the cluster chains in the FAT table. An UNIX file system de-references the direct/indirect blocks to get the data block locations. These data structures might scatter over blocks so that the unpredictable latencies were introduced. Molano, et al. [2] proposed a meta-data pre-fetch scheme which loads necessary meta-data into RAM to provide a deterministic behavior in accessing meta-data. In this paper, we assume that each real-time task has a deterministic behavior in accessing meta-data so that we can focus on the real-time support issue for the flash memory storage systems.

3.3 Real-Time Garbage Collection

This section is meant to propose a real-time garbage collection mechanism to prevent each real-time task from being blocked because of an insufficient number of free pages. We shall first propose the idea of real-time garbage collectors and the free-page replenishment mechanism for garbage collection. We will then present a block-recycling policy for the free-page replenishment mechanism to choose appropriate blocks to recycle.

3.3.1 Real-Time Garbage Collectors

- 30 T₁

G₁ +16 +16

T₂

G₂ +16

- 3 - 3 - 3 - 3 - 3

(a) (b)

w_T1=30 w_T2=3

Figure 3: Creation of the Corresponding Real-Time Garbage Collectors.

For each real-time task Ti which may write to flash memory (wTi > 0), we propose to create a corresponding real-time garbage collector Gi. Gi should reclaim and supply Ti with enough free pages. Let a constant α denote a lower- bound on the number of free pages that can be reclaimed for each block recycling (we will show that in Section 4.1). Note that the number of reclaimed free pages for a block recycling is identical to the number of dead pages on the block before the block recycling. Let π denote the number of pages per block. Given a real- time task T_i= (c_T_i, p_T_i, w_T_i) with w_T_i > 0, the corresponding real-time garbage collector is created as follows:

(8)

Symbol Description Value Λ the total number of live pages currently on flash

∆ the total number of dead pages currently on flash Φ the total number of free pages currently on flash

π the number of pages in each block 32

Θ the total number of pages on flash Λ + ∆ + Φ

the block size 32KB

the page size 512B

α the pre-defined lower bound of the number of reclaimed free pages after each block recycling ρ the total number of tokens in system

ρf ree the number of unallocated tokens in system ρTi the number of tokens given to Ti

ρGi the number of tokens given to Gi

ρnr the number of tokens given to non-real-time tasks Table 2: Symbol Definitions.

c_G_i = (π − α) ∗ (t_r+ t_w) + t_e+ c^cpu_G

i

p_G_i =

( pTi/d^w_α^Tie , if wTi > α pTi∗ b_w^α

Tic , otherwise

. (1)

The CPU demand cGi consists of at most (π − α) live-page copyings, a block erase, and computation requirements c^cpu_G_i . All real-time garbage collectors have the same worst-case CPU requirements since α is a constant lower-bound. Ob- viously, the estimation is based on a worst-case analysis, and Gi might not consume the entire CPU requirements in each period. The period pGi is set under the guideline in supplying Ti with enough free pages. The length of its period depends on how fast Ticonsumes free pages. We let Giand Tiarrive the system at the same time.

Figure 3 provides two examples for real-time garbage collectors, with the system parameter α = 16. In Figure 3.(a), because wT1 = 30, pG1 is set as one half of pT1 so that G1 can reclaim 32 free pages for T1 in each pT1. As astute readers may point out, more free pages may be unnecessarily reclaimed. We will address this issue in the next section. In Figure 3.(b), because wT2 = 3, pG2 is set to five- times of pT2 so that 16 pages are reclaimed for T2in each pG2. The reclaimed free pages are enough for T1 and T2 to consume in each pT1 and pG2, respectively.

We define the meta-period σi of Ti and Gi as pTi if pTi ≥ pGi; otherwise, σi is equal to pGi. In the above examples, σ1 = pT1 and σ2 = pG2. The meta-period will be used in the later sections.

(9)

- 30 T₁

G₁ +16 +16

Give up extra tokens

Supply T₁ with 16 tokens

Give up extra tokens

Give up extra tokens Create tokens Create tokens

Figure 4: The Free-Page Replenishment Mechanism.

3.3.2 Free-Page Replenishment Mechanism

The free-page replenishment mechanism proposed in this section is to resolve the over-reclaiming issue for free pages during real-time garbage collection. The over- reclaiming occurs because the real-time garbage collection is based on the worst- case assumption of the number of reclaimed free-pages per erasing. Furthermore, in many cases, real-time tasks might consume free pages slower than what they declared. A coordinating mechanism should be adopted to manage the free pages and their reclaiming more efficiently and flexibly.

In this section, we propose a token-based mechanism to coordinate the free-page usage and reclaiming. In order not to let a real-time garbage collector reclaim too many unnecessary free pages, here a token-based “free-page replenishment mechanism” is presented:

Consider a real-time task Ti = (cTi, pTi, wTi) with wTi > 0. Initially, Ti is given (wi∗ σi/pTi) tokens, and one token is good for executing one page-write (Please see Section 3.3.1 for the definition of σi). We require that each (real-time) task could not write any page if it does not own any token. Note that a token does not correspond to any specific free page in the system. Several counters of tokens are maintained: ρ_init denotes the number of available tokens when system starts. ρ denotes the total number of tokens that are currently in system (regardless of whether they are allocated and not). ρf ree denotes the number of unallocated tokens currently in system, ρTi and ρGi denote the numbers of tokens currently given to task Ti and Gi, respectively. (Gi also needs tokens for live-page copyings, and it will be addressed later.) The symbols for token counters are summarized in Table 2.

Initially, Ti and Gi are given (wTi∗ σi/pTi) and (π − α) tokens, respectively. It is to prevent Ti and Gi from being blocked in their first meta-period. During

(10)

Gi() {

if( ) {

// create tokens from the // existing free pages

} else {

// Recycle a block recycleBlock();

// Note that change }

// Supply Ti with tokens

// Give up the residual tokens

}

;

Gi ;

Gi ρ α ρ ρ α

ρ = + = +

,Gi

,ρρ Φ

;

; _Gi _Gi

Ti

Ti ρ αρ ρ α

ρ = + = −

; z

; z );

(

z=ρ_Gi−π−α ρ_Gi=ρ_Gi− ρ=ρ− α

; x

;

Ti x

Ti=ρ − ρ=ρ− ρ

Ti() {

if(beginningOfMetaPeriod()) {

// give up the extra tokens

}

... /* job of Ti */

}

α ρ Φ− )≥ (

p ; )

* w x (

Ti i Ti Ti

ρ − σ

=

Figure 5: The algorithm of the Free-Page Replenishment Mechanism.

their executions, Ti and Gi are more like a pair of consumer and producer for tokens. G_icreates and provides tokens to T_i in each of its period p_G_i because of its reclaimed free pages in garbage collection. The replenishment of tokens are enough for Ti to write pages in each of its meta-period σi. When Ticonsumes a token, both ρTi and ρ are decreased by one. When Gi reclaim a free page and, thus, create a token, ρGi and ρ are both increased by one. Basically, by the end of each period (of Gi), Gi provides Ti the created tokens in the period. When Ti terminates, its tokens (and those of Gi) must be returned to the system.

The above replenishment mechanism has some problems: First, real-time tasks might consume free pages slower than what they declared. Secondly, the real- time garbage collectors might reclaim too many free pages than what we esti- mate in the worst case. Since Gi always replenish Ti with all its created tokens, tokens would gradually accumulate at Ti. The other problem is that Gi might unnecessarily reclaim free pages even though there are already sufficient free pages in the system. Here we propose to refine the basic replenishment mechanism as follows:

In the beginning of each meta-period σi, Tigives up ρTi− (wTi∗ σi/pTi) tokens and decreases ρTi and ρ by the same number because those tokens are beyond the needs of Ti. In the beginning of each period of Gi, Gi also checks up if (Φ−ρ) ≥ α, where a variable Φ denotes the total number of free pages currently in the system. If the condition holds, then Gitakes α free pages from the system, and ρGi and ρ is incremented by α, instead of actually performing a block recycling. Otherwise (i.e., (Φ − ρ) < α), Giinitiates a block recycling to reclaim free pages and create tokens. Suppose that Gi now has y tokens (regardless of whether they are done by a block erasing or a gift from the system) and then gives α tokens to Ti. Gi might give up y − α − (π − α) = y − π tokens because they are beyond its needs, where (π − α) is the number of pages needed for live-

(11)

page copyings (done by Gi). Figure 4 illustrates the consuming and supplying of tokens between Ti and Gi(based on the example in Figure 3.(a)), and Figure 5 provides the algorithm of the free page replenishment mechanism. We shall justify that Gi and Ti always have free pages to write or perform live-page- copyings in Section 4.4. All symbols used in this section is summarized in Table 2.

3.3.3 A Block-Recycling Policy

A block-recycling policy should make a decision on which blocks should be erased during garbage collection. The previous two sections propose the idea of real- time garbage collectors and the free-page replenishment mechanism for garbage collection. The only thing missing is a policy for the free-page replenishment mechanism to choose appropriate blocks for recycling.

The block-recycling policies proposed in the past work [1, 8, 13, 10] usually adopted sophisticated heuristics with objectives in low garbage collection overheads and a longer overall flash memory lifetime. With unpredictable performance on garbage collection, these block-recycling policies could not be used for time-critical applications. The purpose of this section is to propose a greedy policy that delivers a predictable performance on garbage collection:

We propose to recycle a block which has the largest number of dead pages with the objective in predicting the worst-case performance. Obviously the worst case in the number of free-page reclaiming happens when all dead pages are evenly distributed among all blocks in the system. The number of free pages reclaimed in the worst case after a block recycling can be denoted as:

dπ ∗∆

Θe, (2)

where π, ∆, and Θ denote the number of pages per block, the total number of dead pages on flash, and the total number of pages on flash, respectively. We refer readers to Table 2 for the definitions of all symbols used in this paper.

Formula 2 denotes the worst case performance of the greedy policy, which is pro- portional to the ratio of the numbers of dead pages and pages on flash memory.

That is an interesting observation because we can not obtain a high garbage col- lection performance by simply increasing the flash memory capacity. We must emphasize that π and Θ are constant in a system. A proper greedy policy which properly manages ∆ would result in a better low bound on the number of free pages reclaimed in a block recycling.

Because Θ = ∆ + Φ + Λ. Equation 2 could be re-written as follows:

dπ ∗ Θ − Φ − Λ

Θ e, (3)

(12)

where π and Θ are constants, and Φ and Λ denote the numbers of free pages and live pages, respectively, when the block-recycling policy is activated. As shown in Equation 3, the worst case performance of the block-recycling policy can be controlled by bounding Φ and Λ:

Because it is not necessary to perform garbage collection if there are already sufficient free pages in system. The block-recycling policy could be activated only when the number of free pages is less than a threshold value (a bound for Φ). Furthermore, in order to bound the number of live pages in system (i.e., Λ), we propose to reduce the total number of the FTL-emulated LBA’s.

For example, we can emulate a 48MB block device by using a 64MB NAND flash memory. We argue that this intuitive approach is affordable since the capacity of flash memory chip is increasing very rapidly.³For example, suppose that the block-recycling policy is always activated when Φ ≤ 100, and there are 32 pages in a block, the worst-case performance of the greedy policy is d32 ∗ 131,072−100−98,304

131,072 e = 8.

The major challenge in guaranteeing the worst-case performance is how to prop- erly set a bound for Φ for each block recycling since the number of free pages in system might grow and shrink from time to time. We shall further discuss this issue in Section 4.1. As astute readers may point out, the above greedy policy does not consider wear-levelling because the consideration of wear-levelling could result in an unpredictable behavior in block recycling. We should address this issue in the next section.

3.4 Supports for Non-Real-Time Tasks

The objective of the system design aims at the simultaneous supports for both real-time and non-real-time tasks. In this section, we shall extend the token- based free-page replenishment mechanism in the previous section to supply free pages for non-real-time tasks. A non-real-time wear-leveller will then be proposed to resolve the endurance problem.

3.4.1 Free-Page Replenishment for Non-Real-Time Tasks

We propose to extend the free-page replenishment mechanism to support non- real-time tasks as follows: Let all non-real-time tasks be first scheduled by a time-sharing scheduler. When a non-real-time task is scheduled by the time- sharing scheduler, it will execute as a background task under the real-time scheduler (Please see Figure 1).

Different from real-time tasks, let all non-real-time tasks share a collection of tokens, denoted by ρnr. π tokens are given to non-real-time tasks (ρnr = π)

3. The capacity of a single NAND flash memory chip had grown to 256MB when we wrote this paper.

(13)

initially for the purpose of live-page copying during garbage collection for non- real-time tasks. Before a non-real-time task issues a page write, it should check up if ρnr > π. If the condition holds, the page write is executed, and one token is consumed (ρnr and ρ are decreased by one). Otherwise (i.e., ρnr≤ π), the system must replenish itself with tokens for non-real-time tasks. The token creation for non-real-time tasks is similar to the strategy adopted by real-time garbage collectors: If Φ ≤ ρ, a block recycling is initiated to reclaim free pages, and tokens are created. If Φ > ρ, then there might not be any needs for any block recycling.

3.4.2 A Non-Real-Time Wear-Leveller

In the past work, researchers tend to resolve the wear-levelling issue in the block- recycling policy. A typical wear-levelling-aware block-recycling policy might sometimes recycle the block which has the least number of erasing, regardless of how many free pages can be reclaimed. This approach is not suitable to a real-time application, where predictability is an important issue.

We propose to use non-real-time tasks for wear-levelling and separate the wear- levelling policy from the block-recycling policy. We could create a non-real-time wear-leveller, which sequentially scans a given number of blocks to see if any block has a relatively small erase count, where the erase count of a block denotes the number of erases that has been performed on the block so far. We say that a block has a relatively small erase count if the count is less than the average by a given number, e.g., 2. When the wear-leveller finds a block with a relatively small erase count, the wear-leveller first copy live pages from the block to some other blocks and then sleeps for a while. As the wear-leveller repeats the scanning and live-page copying, dead pages on the blocks with relatively small erase counts would gradually increase. As a result, those blocks will be selected by the block- recycling policy sooner or later.

Note that one of the main reasons why a block has a relatively small erase count is because the block contains many live-cold data (which had not been invalidated/updated for a long period of time). The wear-levelling is done by

”defrosting” blocks with relatively small erase counts by moving cold data away.

The capability of the proposed wear-leveller is evaluated in Section 5.3.

4 System Analysis

In the previous sections, we had proposed the idea of the real-time garbage collectors, the free page replenishment mechanism, and the greedy block-recycling policy. The purpose of this section is to provide a sufficient condition to guarantee the worst-case performance of the block-recycling policy. As a result, we can justify the performance of the free-page replenishment mechanism. An admission control strategy is also introduced.

(14)

4.1 The Performance of the Block-Recycling Policy

The performance guarantees of real-time garbage collectors and the free-page replenishment mechanism are based on a constant α, i.e., a lower-bound on the number of free pages that can be reclaimed for each block recycling. As shown in Section 3.3.3, it is not trivial in guaranteeing the performance of the block-recycling policy since the number of free pages on flash, i.e., Φ, may grow or shrink from time to time. In this section, we will derive the relationship between α and Λ (the number of live pages on the flash) as a sufficient condition for engineers to guarantee the performance of the block-recycling policy for a specified value of α.

As indicated by Equation 2, the system must satisfy the following condition:

d∆

Θ∗ πe ≥ α. (4)

Since ∆ = (Θ − Λ − Φ), and (dxe ≥ y) implies (x > y − 1) (y is an integer), Equation 4 can be re-written as follows:

Φ < (1 −(α − 1)

π ) ∗ Θ − Λ. (5)

The formula shows that the number of free pages, i.e., Φ, must be controlled under Equation 5 if the lower-bound α on the number of free pages reclaimed for each block recycling has to be guaranteed (under the utilization of the flash, i.e., Λ). Because real-time garbage collectors would initiate block recyclings only if Φ − ρ < α (please see Section 3.3.2), the largest possible value of Φ on each block recycling is (α − 1) + ρ. Let Φ = (α − 1) + ρ, Equation 5 can be again re-written as follows:

ρ < (1 −(α − 1)

π ) ∗ Θ − Λ − α + 1. (6)

Note that given a specified bound α and the (even conservatively estimated) utilization Λ of the flash, the total number of tokens ρ in system should be controlled under Equation 6. In the next paragraphs, we will first derive ρmax

which is the largest possible number of tokens in the system under the proposed garbage collection mechanism. Because ρmax should also satisfy Equation 6, we will derive the relationship between α and Λ (the number of live pages on the flash) as a sufficient condition for engineers to guarantee the performance of the block-recycling policy for a specified value of α. Note that ρ may go up and down, as shown in Section 3.3.1.

We shall consider the case which has the maximum number of tokens accumu- lated: The case occurs when non-real-time garbage collection recycled a block

(15)

T₁

G₁ +16

T₂

G₂

- 0 - 0 - 0 - 0 - 0

- 0

+32

+32 30

, 30 1

1= T =

wT ρ

15 , 3 2

2= T =

wT ρ

1 16

G = ρ

2 16

G = ρ

t

Figure 6: An Instant of Having the Largest Number of Tokens in System (Based on The Example in Figure 3.)

without any live-page copying. As a result, non-real-time tasks could hold up to 2 ∗ π tokens, where π tokens are reclaimed by the block recycling, and the other π tokens are reserved for live-page copying. Now consider real-time garbage collection: Within a meta-period σi of Ti (and Gi), assume that Ti consumes none of its reserved (wTi ∗ σi/pTi) tokens. On the other hand, Gi replenishes Ti with (α ∗ σi/pGi) tokens. In the best case, Gi recycles a block without any live-page copying in the last period of Gi within the meta-period. As a result, Ti could hold up to (wTi∗ σi/pTi) + (α ∗ σi/pGi) tokens, and Gi could hold up to 2 ∗ (π − α) tokens, where (π − α) tokens are the extra tokens created, and the other (π − α) tokens are reserved for live page copying. If all real-time tasks Ti

and their corresponding real-time garbage collectors Gibehave in the same way, as described above. Figure 6 illustrates the above example based on the task set in Figure 3. The largest potential number of tokens in system ρmax could be as follows, where ρf ree is the number of un-allocated tokens

ρmax= ρf ree+ 2π + Xn i=1

µwTi∗ σi

pTi

+α ∗ σi

pGi

+ 2(π − α)

¶

. (7)

When ρ = ρ_max, we have the following equation by combining Equation 6 and Equation 7:

ρf ree+ 2π + Xn i=1

µwTi∗ σi

pTi

+α ∗ σi

pGi

+ 2(π − α)

¶

< (1 −(α − 1)

π ) ∗ Θ − Λ − α + 1 (8)

(16)

Equation 8 shows the relationship between α and Λ. We must emphasize that this equation serves as as a sufficient condition for engineers to guarantee the performance of the block-recycling policy for a specified value of α, provided the number of live pages in the system, i.e., the utilization of the flash Λ. We shall address the admission control of real-time tasks in the next section, based on α.

4.2 Initial System Configuration

In the previous section, we derived a sufficient condition to verify whether the promised block-recycling policy performance (i.e., α) could be delivered. How- ever, the setting of α highly depends on the physical parameters of the selected flash memory and the target capacity utilization. In this section, we shall provide guidelines for the initial system configuration of the proposed flash-memory storage systems: the values of α and ρinit (i.e., the number of initially available tokens).

According to Equation 4, the value of α could be set in-between (0, dπ ∗ ^∆_Θe].

The objective is to set α as large as possible, because a large value for α means a better block-recycling performance. With a better block-recycling performance, the system could accept tasks which are highly demanding on page writes. For example, if the capacity utilization is 50%, we could set the value of α as d32 ∗ 0.5e = 16.

ρinit could be set as follows: A large value for ρinit could help the system in accepting more tasks because tokens must be provided for real-time tasks and real-time garbage collectors on their arrivals. Equation 6 in Section 4.1 suggests an upper-bound of ρ, which is the total number of tokens in the current system.

Obviously, ρ = ρinit when there is no task in system, and ρinit must satisfy Equation 6. We have shown that ρmax(which is derived from Equation 7) must satisfy Equation 6 because the value of ρ might grow up to ρmax. It becomes interesting if we could find a relationship between ρ_init and ρ_max, so that we could derive an upper-bound for ρinit based on Equation 6. First, Equation 7 could be re-written as follows:

ρmax= ρf ree+ 2π + Xn i=1

µ

(wTi∗ σi

pTi

+ (π − α)) + (α ∗ σi

pGi

+ (π − α))

¶ . (9)

Obviously the largest possible value of ρmax occurs when all available tokens are given to real-time tasks and real-time garbage collectors. In other words, we only need to consider when ρf ree= 0. Additionally, we have ρTi= (^w^Ti_p^∗σⁱ

Ti ),

ρGi = (π − α), and (^α∗σ_p ⁱ

Gi ) ≥ (^w^Ti_p^∗σⁱ

Ti ) by the creation rules in Equation 1.

Therefore, Equation 9 can be further re-written as follows:

(17)

ρmax≥ 2π + 2 Xn i=1

(ρTi+ ρGi). (10)

Since we let ρf ree = 0, we have ρinit = P_n

i=1(ρTi + ρGi). The relationship between ρmaxand ρinit could be represented as follows:

ρmax≥ 2π + 2 ∗ ρinit. (11)

This equation reveals the relationship between ρmaxand ρinit, and the equation is irrelevant to the characteristics (e.g., their periods) of real-time tasks and real- time garbage collectors in system. In other words, we could bound the value of ρmax by managing the value of ρinit. When we consider both Equation 6 and Equation 11, we have the following equation:

ρ_init<((1 −^(α−1)_π ) ∗ Θ − Λ − α + 1 − 2π)

2 . (12)

Equation 12 suggests an upper-bound for ρinit. If ρinit is always under the bound, then the total number of tokens in system will be bounded by Equation 6 (because ρmax is also bounded). As a result, we don’t need to verify Equation 8 for every arrival of any new real-time tasks and real-time garbage collectors.

We only need to ensure that Equation 12 holds when the system starts. That further simplifies the admission control procedure to be proposed in the next section.

4.3 Admission Control

The admission control of a real-time task must consider its resource requirements and the impacts on other tasks. The purpose of this section is to provide the procedures for admission control:

Given a set of real-time tasks {T1, T2, ..., Tn} and their corresponding real-time garbage collectors {G1, G2, ..., Gn}, let cTi, pTi, and wTi denote the CPU requirements, the period, and the maximum number of page writes per period of task Ti, respectively. Suppose that cGi = pGi = 0 when wTi = 0 (it is because no real-time garbage collector is needed for Ti).

Since the focus of this paper is on flash-memory storage systems, the admission control of the entire real-time task set must consider whether the system could give enough tokens to all tasks initially. The verification could be done by evaluating the following formula:

(18)

Xn i=1

µwTi∗ σi

pTi

+ (π − α)

¶

≤ ρf ree. (13)

Note that ρf ree = ρinit when the system starts (and there are no tasks in system) , where ρinit is the number of initially reserved tokens. The evaluation of the above test could be done in a linear time. Beside the test, engineers are supposed to verify the relationship between α and Λ by means of Equation 8, which serves as a sufficient condition to guarantee α. However, in the previous section, we have shown that if the value of ρ_initis initially configured to satisfy Equation 12, the relation in Equation 8 holds automatically. As a result, it is not necessary to verify Equation 8 at each arrival of any real-time task.

Other than the above tests, engineers should verify the schedulability of real- time tasks in terms of CPU utilization. Suppose that the earliest deadline first algorithm (EDF) [3] is adopted to schedule all real-time tasks and their corresponding real-time garbage collectors. Since all flash-memory operations are non-preemptive, and block erasing is the most time-consuming operation, the schedulability of the real-time task set can be verified by the following formula, provided that other system overheads are ignorable:

te

min p()+ Xn i=1

µcTi

pTi

+cGi

pGi

¶

≤ 1, (14)

where min p() denotes the minimum period of all Ti and Gi, and te is the duration of a block erase. Because every real-time task might be blocked by a block erase issued by either real-time garbage collection or non-real-time garbage collection, as a result, the longest blocking time of every task is the same, i.e., t_e. We could derive Equation 14 from Theorem 2 in T. P. Baker’s work [14], which presents a sufficient condition of the schedulability of tasks which have non- preemptible portions. We must emphasize that the above formula only intends to deliver the idea of blocking time, due to flash-memory operations.

4.4 The Performance Justification of the Free-Page Replenishment Mechanism

The purpose of this section is to justify that all real-time tasks always have enough tokens to run under the free-page replenishment mechanism, and no real-time garbage collector will be blocked because of an insufficient number of free pages.

Because each real-time task Ti consumes no more than (wTi∗ σi/pTi) tokens in a meta-period, the tokens initially reserved are enough for its first meta-period.

Because the corresponding real-time garbage collector Gi replenishes Ti with

(19)

(α ∗ σi/pGi) tokens in each subsequent meta-period, the number of replenished tokens is larger than or equal to the number that Tineeds, according to Equation 1. We conclude that all real-time tasks always have enough tokens to run under the free-page replenishment mechanism.

Because real-time garbage collectors also need to consume tokens for live-page copying, we shall justify that no real-time garbage collector will be blocked forever because of an insufficient number of free pages. Initially, (π − α) tokens are given to each real-time garbage collector Gi. It is enough for the first block recycling of Gi. Suppose that Gi decides to erase a block which has x dead pages (note that x ≥ α). The rest (π − x) pages in the block could be either live pages and/or free pages. For each non-dead page in the block, Gi might need to consume a token to handle the page, regardless of whether the page is live or free. After the copying of live pages, a block erase is executed to wipe out all pages on the block, and π tokens (free pages) are created. The π created tokens could be used by Gi as follows: Gi replenishes itself with (π − x) tokens, replenishes Ti with α tokens, and finally gives up the residual tokens. Because x ≥ α, we have (π − x) + α ≤ π so G_i will not be blocked because of an insufficient number of free pages.

5 Performance Evaluation

5.1 Simulation Workloads 5.1.1 Overview

The real-time garbage collection mechanism proposed in Section 3 provides a deterministic performance guarantee to real-time tasks. The mechanism also targets at simultaneous services to both real-time and non-real-time tasks. The purpose of this section is to show the usefulness and effectiveness of the proposed mechanism. We compared the behaviors of a system prototype with or without the proposed real-time garbage collection mechanism. We also show the effectiveness of the proposed mechanism in the wear-levelling service, which is executed as a non-real-time service.

A series of experiments were done over a real system prototype. A set of tasks was created to emulate the workload of a manufacturing system, and all files were stored on flash memory to avoid vibration from damaging the system: The basic workload consisted of two real-time tasks T1 and T2 and one non-real- time task. T1 and T2sequentially read the control files, did some computations, and updated their own (small) status files. The non-real-time task emulated a file downloader, which downloaded the control files continuously over a local- area-network and then wrote the downloaded file contents onto flash memory.

The flash memory used in the experiments was a 16MB NAND flash [7]. The block size was 16KB, and the page size was 512B (π = 32). The traces of T1, T2,

(20)

and the non-real-time file downloader were synthesized to emulate the necessary requests for the file system. The basic simulation workload is detailed in Table 3. We varied the capacity utilization of flash memory in some experiments to observe the system behavior under different capacity utilizations. For example, a 50% capacity utilization stand for the case that one half of the total pages were live pages. An 8MB block device was emulated by a 16MB flash memory to generate a 50% capacity utilization. Before each experiment started, the emulated block device was written so that most of pages on the flash memory were live pages. The duration of each experiment was 20 minutes.

5.1.2 System Configuration and Admission Control for Simulation Experiments

In this section, we used the basic simulation workload to explain how to configure the system and how to perform the admission control:

In order to configure the system, system engineers need to first decide the ex- pected block-recycling performance α. According to Equation 4, the value of α could be set as a value between (0, dπ ∗ ^∆_Θe]. Under a 50% capacity utilization, the value of α could be set as d32 ∗ 0.5e = 16, since a large value is preferred.

Secondly, we use Equation 12 to configure the number of tokens initially avail- able in system (i.e., ρinit). In this example, we have α = 16, π = 32, Λ = 16384, and Θ = 32768. As a result, we have ρinit < 472.5. Intuitively, ρinit should be set as large as possible since we must give tokens to tasks at their arrival, and a larger value of ρ_init might help system to accept more tasks. However, if we know the token requirements of all tasks in a priori, a smaller ρinit could help in reducing the possibility to recycle a block which still have free pages inside.

In this example, we set ρinit = 256, which is enough for the basic simulation workload. Once α and ρinitare determined, the corresponding real-time garbage collectors G1 and G2 could be created under the creation rules (Equation 1).

The CPU requirements, the flash memory requirements, and the periods of G1

and G2 were shown in Table 4.

The admission control procedure is as follows: First we verify if the system has enough tokens for the real-time garbage collection mechanism under Equation 13: The token requirements of T1, T2, G1, G2, and the non-real-time garbage collection are 16, 15, 16, 16, and 32, respectively. Because ρinit = 256, the verification can be verified by the formula (15 + 16 + 16 + 16 + 32) < 256 (Equation 13). Since the system satisfies Equation 12, it implies the condition in Equation 8 also holds, and we need not to check it again. Secondly, the CPU utilization is verified by Equation 14: The CPU utilization of each real- time task is shown in Table 4, and the verification is verified by the formula (0.3177 + 0.0437 + 0.1375 + 0.0367 +^1,881_6,354) ≤ 1 (Equation 14).

(21)

5.2 The Evaluation of the Real-time Garbage Collection Mechanism

This section is meant to evaluate the proposed real-time garbage collection mechanism. We shall show that a non-real-time garbage collection mechanism might impose an unpredictable blocking time on time-critical applications, and the proposed real-time garbage collection mechanism could prevent real-time tasks from being blocked due to an insufficient number of tokens.

We evaluated a system which adopted a non-real-time garbage collection mechanism under the basic workload and a 50% flash memory capacity utilization.

We compared our real-time garbage collection mechanism with the well-known cost − benef it block-recycling policy [1] in the non-real-time garbage collection mechanism. The non-real-time garbage collection mechanism was configured as follows: A page write could only be processed if the total number of free pages on flash was more than 100. If there were not enough free pages, garbage collection was activated to reclaim free pages. The cost-benefit block-recycling policy picked up a block which had the largest value of a ∗^1−u_2u , as summarized in Sec- tion 1. But in order to perform wear-levelling, the block-recycling policy might recycle a block which had an erase count less than the average erase count by 2, regardless how many free pages could be reclaimed. Since garbage collection was activated on demand, the garbage collection activities directly imposed blocking time on page writes. We defined that an instance of garbage collection consisted of all of the activities which started when the garbage collection was invoked and ended when the garbage collection returned control to the blocked page write.

An instance of garbage collection might consist of recycling several blocks consecutively because the non-real-time garbage collection policy might recycle a block without dead pages, due to wear-levelling.

We measured the blocking time imposed on a page write by each instance of garbage collection, because a non-real-time garbage collection instance blocked the whole system until the garbage collection instance finished. Figure 7 shows the blocking times of page writes, where the X-axis denotes the number of garbage collection instances invoked so far in a unit of 1,000, and the Y-axis de- notes the blocking time in ms. The results show that the non-real-time garbage collection activities indeed imposed a lengthy and unpredictable blocking time on page writes. The blocking time could even reach 21.209 seconds in the worst case. The lengthy blocking time was mainly caused by wear-levelling. The block- recycling policy might recycle a block which had a low erase count, regardless how many dead pages were on the block. As a result, the block-recycling policy might consecutively recycle many blocks until at least one free page was reclaimed. Although the maximum blocking time could be reduced by decreasing the wear-leveller activities or performing garbage collection in an opportunistic fashion, the non-real-time garbage collection did fail in providing a deterministic service to time-critical tasks.

(22)

The next experiment intended to observe if the proposed real-time garbage collection mechanism could prevent each real-time task from being blocked, due to an insufficient number of tokens. To compare with the non-real-time garbage collection mechanism, a system adopted the real-time garbage collection mechanism was evaluated under the same basic configurations. Additionally, according to the basic workload, the real-time garbage collection mechanism created two (corresponding) real-time garbage collectors and a non-real-time wear-leveller. In particular, the non-real-time wear-leveller performed live-page copyings whenever a block had an erase count less than the average erase count by 2. The wear-leveller slept for 50ms between every two consecutive live-page copyings. The workload is summarized in Table 4. Note that the garbage collection activities (real-time and non-real-time) didn’t directly block a real-time task’s page write request. A page write request of a real-time task would be blocked only if its corresponding real-time garbage collector did not supply it with enough tokens. Distinct from the non-real-time garbage collection experiment, we measured the blocking time of each real-time task’s page write request to observe if any blocking time ever existed. The results are shown in Figure 8, where the X-axis denotes the number of page writes processed so far in a unit of 1,000, and the Y-axis denotes the blocking time imposed on each page write (of real-time tasks) by waiting for tokens. Note that T1and T2 generated 12,000 and 3,000 page writes respectively in the 20-minutes experiment. Figure 8 shows that the proposed real-time garbage collection mechanism successfully prevented T1 and T2 from waiting for tokens, as expected.

5.3 Effectiveness of the Wear-Levelling Method

The second part of experiments evaluated the the effectiveness of wear-levelling.

The effectiveness of wear-levelling was evaluated in terms of the standard deviation of the erase counts of all blocks. A lower value indicated that all blocks were erased more evenly, and a higher value indicated that some specific blocks were erased frequently and some specific blocks were less erased. The objective of wear-levelling is to achieve a very small value of the standard deviation.

The non-real-time wear-leveller was configured as follows: The non-real-time wear-leveller performed live-page copyings whenever a block had an erase count less than the average erase count by 2. The wear-leveller slept for 50ms between every two consecutive live-page copyings. Since the past work showed that the overheads of garbage collection highly depend on the flash-memory capacity utilization [1, 13, 10, 5], we evaluated the experiments under different capacity utilizations: 50%, 60% and 70%. System parameters under each capacity utiliza- tion are summarized in Table 5. Additionally, we disabled the wear-leveller to observe the usage of blocks without wear-levelling. We also compared the effectiveness of the proposed wear-levelling method with the non-real-time garbage collection mechanism.

Real-Time Garbage Collection for Flash-Memory Storage Systems of Real-Time Embedded Systems