在基於非揮發性隨機存取記憶體系統上提供改善作業系統效能的機制

(1)

國立交通大學

資訊科學與工程研究所

碩士論文

在基於非揮發性隨機存取記憶體系統上提供改

善作業系統效能的機制

Operating System Support on NVRAM-Based Systems

研究生：蕭智文

指導教授：張瑞川教授

(2)

在基於非揮發性隨機存取記憶體系統上提供改

善作業系統效能的機制

學生：蕭智文指導教授：張瑞川教授

國立交通大學資訊科學與工程研究所

論文摘要

硬碟效能長久以來一直是電腦系統中一個主要瓶頸，根據 Moore’s Law，處理器的效能以每年 60%的速度增進；而硬碟的效能僅以每年 10%的速度增進，因此在處理器和硬碟之間產生了極大的效能差距。非揮發性記憶體(Non-volatile RAM)是最近幾年新興的一項記憶體技術，其具有非揮發性的特性可以有效改善系統效能。由於傳統緩衝快取以及確保檔案系統資料一致性的設計都是建立在揮發性記憶體上，因此我們在非揮發性記憶體性統上提出三種機制，來進一步增加緩衝快取的效能以及減少確保資料一致性的負擔。第一，我們在 VFS 和一般檔案系統之間加入一層 Ram-based 檔案系統來暫時的存放新建立檔案，其目的為延遲檔案分配給檔案系統的時間以避免檔案零碎的情形以及減少不必要的 IO。第二，我們修改原本以檔案為單位的寫回策略，而進一步的考慮寫回的區塊在磁碟上的相對應位置，讓較連續且較鄰近的資料一起寫回，以減少寫回的時間；另外，我們還確保不要寫回最近剛被更新過的頁面。最後，我們提供了一個簡單機制可同時確保檔案系統的一致性而不需要額外的磁碟 IO 負擔。實驗結果顯示我們的機制之效能優於 Ext2 大約 76%；優於 Ext3 大約 94%。

(3)

Operating System Support on NVRAM-Based

Systems

Student: Chih-Wen Hsiao Advisor: Prof. Ruei-Chuan Chang

Computer Science and Engineering College of Computer

Science

National Chiao Tung University

Abstract

Disk performance has become the major bottleneck of most computer systems for a

long time. Following the Moore’s Law, the performance of processors improves by

about 60% per year. However, the performance improvement of disks is only about

10%, resulting in an increasing performance gap between processors and disks.

Non-volatile memory is an emerging technology in recent years and the characteristic

of non-volatile can improve the performance of system. Because the traditional buffer

cache management and guaranteeing file system consistent are based on volatile

memory, we propose three mechanisms on non-volatile system to improve further the

performance of buffer cache and reduce the overhead of consistency. First, we put

Ram-based file system between VFS and file system to temporarily store all new files.

Its main purpose is delay allocation for all new files to avoid the file fragmentation

and reduce the redundant IO traffic. Second, we modify the original file-by-file write

back policy and write back the contiguous or neighbor dirty blocks to reduce the seek

(4)

guarantee the file system consistent without any overhead of disk IO. The

experimental results show that the performance of our mechanisms is superior to Ext2

(5)

致謝

感謝我的指導老師張瑞川教授以及張大緯教授，兩位老師對我的論文所做的

指導，感謝在系統實驗室上所有的學長和同學對我的幫助和建議，另外還要感謝

張大緯教授兩年來不遲辛苦的和我們遠端視訊做討論，真的很辛苦他。

(6)

論文摘要………...…..…i Abstract……….…ii 致謝………...………iv Table of Contents………...……….v List of Figures……….vii List of Tables……….………...ix Chapter 1 Introduction……….………...1 1.1 Motivation………...……….1 1.2 Our Mechanisms……….……….2

1.3 Structure of the Thesis……….…………4

Chapter 2 Related Work……….……….5

2.1 System Recovery……….5

2.2 NVRAM as Storage Device……….6

2.3 NVRAM as Buffer………...7

2.4 Transaction Support……….9

Chapter 3 Design and Implementation……… 11

3.1 Background……….11

3.1.1 Temporary-File File System………..11

3.1.2 Intelligent Write-Back Policy………15

3.1.3 Transaction Support on File System Operations………...18

3.2 Implementation and Integration of the Approaches………...22

3.2.1 Implementation of Three Mechanisms………..22

3.2.2 Integration of Three Mechanisms………..28

(7)

4.1 Experimental Environment and Configurations……….29

4.2 The Performance Results of TempFFS………...32

4.3 The Performance Results of Intelligent Write-Back Policy………...37

4.4 The Performance Results of Transaction Support on Ext2………41

4.5 Put it all Together………45

Chapter 5 Conclusions………...51

(8)

LIST OF FIGURES

Figure 3.1 Architecture of the Temporary-File File System……….13

Figure 3.2 Transformation Steps………..14

Figure 3.3 the Zone Information Table……….17

Figure 3.4 Zone with Different Average Segment Length………...18

Figure 3.5 the Main Data Structure of Journaling………20

Figure 3.6 the Core Functions of Journaling ………...20

Figure 3.7 Pseudo Code of Segment/Zone Algorithm……….25

Figure 3.8 Implementation of the Transaction API (Pseudo Code) ………27

Figure 3.9Integration of the TempFFS, Intelligent Write Back and File Operation Transaction Support ………28

Figure 4.1 Performance Improvements of TempFFS (Bonnie++)……..………32

Figure 4.2 Performance Improvements of TempFFS (Untar and Compile Linux Kernel)………33

Figure 4.3 Performance Improvements of TempFFS (Postmark)………34

Figure 4.4 Rations of Files Deleted in TempFFS (Postmark)………..34

Figure 4.5 Reduction of IO Traffic with TempFFS……….35

Figure 4.6 Average File Distance (Postmark)……….35

Figure 4.7 Performance Comparison of the Write Back Polices (Bonnie++)……….37

(9)

Linux Kernel)………..38

Figure 4.9 Performance Comparison of the Different Locative Write Back Polices...39

Figure 4.10 Performance Comparison of the Write Back Polices (Postmark)……….39

Figure 4.11 Accumulate Inter-requests Distance (Postmark)………...40

Figure 4.12 Repeated Write Back Blocks (Postmark)………...………….…..41

Figure 4.13 Performance of the Transactions Support Mechanism (Bonnie++)…..…41

Figure 4.14 Performance of the Transactions Support Mechanism (Untar and Compile

Linux Kernel)………..43

Figure 4.15 Performance of the Transactions Support Mechanism (Postmark)…….44

Figure 4.16 Performance Results of Different Combinations (Bonnie++)………….46

Figure 4.17 Performance Results of Different Combinations (Linux Kernel Untarring

and Compilation)………47

Figure 4.18 Performance Results of Different Combinations (Postmark, 512-10K

bytes)……….……..48

Figure 4.19 Performance Results of Different Combinations (Postmark, 512-2M

(10)

LIST OF TABLES

Table 1.1 MRAM and DRAM Characteristic………3

Table 4.1 Evaluation Environment………...30

Table 4.2 Experimental Configurations………31

Table 4.3 Memory Overhead of the Transaction Support Under Different

(11)

Chapter 1 Introduction

1.1 Motivation

The speed of processor is double every eighteen months by Moore’s Law, but the

performance of the computer system grows slowly due to the slow I/O. The disk I/O

time is mainly dominated by seek time and rotation delay. Because data in volatile

memory (DRAM) is not persistent, system must periodically write back dirty data to

the disk for avoiding data loss due to the power failure. Therefore, performance of

disk IO is worse.

With the well advances in non-volatile memory (NVRAM) technologies, many

kinds of non-volatile RAM such as MRAM [9] (Magnetoresistive RAM), FeRAM

(Ferro Electric RAM), PRAM (Phase-change RAM) and OUM (Ovonics Unified

Memory) [10] have been proposed. NVRAM is an emerging technique to solve the

problem to the slow disk IO. As semiconductor technology makes progress, we can

anticipate NVRAM to become a common component of computer systems. Since

MRAM among them is comparable with DRAM in terms of capacity, speed and cost,

MRAM is considered as the potential replacement of DRAM as the main memory for

computer systems. Therefore, we can regard data in memory as persistent and exploit

some approaches about NVRAM to improve performance of disk IO.

There are some problems in traditional DRAM-based system if the main memory is

NVRAM. Firstly, according to some research [37][38][46], many small files are

short-lived. Once the file is created in file system, file system must do some disk IO

operations, such as reading metadata of the file. Even many files are deleted soon

after created, they also need perform disk IO. Besides, after these files are deleted, it

(12)

when power outages. But there are two disadvantages: first, the dirty pages per file

maybe distributed far in the disk. It must spend many seek time and rotation delay on

writing back these dirty pages. Second, if it writes back all dirty pages of a file,

system can not make sure whether it does not write back recently-updated pages.

According to time locality, recently-updated pages may be re-accessed and

re-modified recently. Therefore, if it writes back recently-updated pages, it wastes the

disk IO operations. Lastly, the file system consistency has been an important issue

recently. Many file systems use the technique to support data consistency such as

journaling, but it needs extra journaling IO to write logs into the disk earlier.

Therefore, on the basis of three problems, we propose three mechanisms

corresponding three problems to improve performance of the file system.

1.2 Our Three Mechanisms

In this thesis, we propose the Buffer Cache management and transaction support on

file system operations in NVRAM systems. We have two mechanisms temporary-file

file system (TempFFS) and intelligent write back policy (WB) in Buffer Cache

management and one mechanism for transaction support on file system operations

(trans) for maintaining file system consistency.

Firstly, we add TempFFS between VFS and file system to apply delayed allocation

simultaneously on all existing file systems. Different from file system specific

implementations that maintain newly-created files on their own, such as XFS[47], and

Ext4, TempFFS maintains newly-created files for all the file systems. Upon memory

pressure or sync operations, the files are transferred to their original file systems and

block allocation of these files takes place. Therefore, an existing file system can enjoy

(13)

Secondly, due to the data in NVRAM is persistent; we modify original write back

policy which is file-by-file and does not consider recency. We consider the location of

dirty pages in the disk and write back contiguous or neighbor dirty pages to reduce the

seek time and rotation delay of the disk. Besides, we consider recency that we do not

write back the recently-updated dirty pages.

Lastly, our transaction support mechanism can ensure both file system consistency

without inducing any extra disk I/O. Since the data in NVRAM is persistent, it needs

not write journaling IO before. We only make sure the file operation is atomic. In

order to achieve atomic, we duplicate all data and metadata before they are modified

in the file operation into undo logs. Once the file operation has finished successfully,

the undo logs can be removed immediately. If the crash happens in the progress of the

file operation, the undo log can be used to restore to the consistent state. Since our

undo logs are placed in NVRAM and deleted later, it needs not any extra disk I/O.

We implement our three mechanisms in Linux 2.6.12. Since large capacity MRAM

is not generally available in the market, and the performance characteristics of DRAM

and MRAM are comparable and shown in Table 1.1, we use DRAM to emulate

MRAM.

Table 1.1 MRAM and DRAM Characteristic

Device Type MRAM DRAM

Volatility No Yes Characteristic Erase Needed No No Access Time 50ns ~5ns Read Time 50ns 50ns Performance Write Time 50ns 50ns

(14)

Operation Power Supply 1.8V 1.8-5V According to our experimental results, the performance improvement of our

TempFFS is about 35% compared to Ext2, the performance improvement of our

intelligent write-back is about 65% compared to Ext2 and the performance

improvement of our transaction support is about 80% compared to Ext3. Lastly, the

performance improvement of the combinations of three proposed mechanisms is

about 90% compared to Ext3.

1.3 Structure of the Thesis

The remainder of this thesis is organized as follows. Chapter 2 describes the related

work about NVRAM. Chapter 3 presents the design and implementation details of the

proposed mechanisms. The performance results are shown in Chapter 4. Finally, we

(15)

Chapter 2 Related Work

In this chapter, we introduce some researches about NVRAM. In Section 2.1, we

introduce some researches exploiting NVRAM to recover system when system

crashes. In Section 2.2, some researches use NVRAM as storage device to improve

the performance of file system. In Section 2.3, some researches use NVRAM as

buffer to reduce disk IO, especially write operation. In Section 2.4, we introduce

researches about providing file system consistency.

2.1 System Recovery

Ren Ohmura [33] in Keio University exploits the characteristic of NVRAM in

system recovery. They propose a scheme to recovery the state of peripheral devices in

NVRAM systems so that the system can resume its execution after an unpredictable

power failure. They record all messages between CPU and devices in NVRAM and

system re-sends messages recorded in memory to recovery devices into previous state

when power failure.

Harp [25] records all updates of files in server nodes. Files in individual node can

survive after the failure because file operations are logged in memory at several nodes.

The Recovery Box [2] stores the state of system in NVRAM and protects the region

of storing the system state to not overwrite when system crashes. After the system

crashes, it uses the protected system state to recover system.

Rio [8] enables the data in memory to survive operating system crashes and power

outages. It uses write protections to protect files in file cache and does not

accidentally overwrite the file cache while system is crashing. Therefore, the files in

(16)

2.2 NVRAM as Storage Device

Because the flash is cheap and has the characteristic of non-volatile, the more and

more file systems which are designed for flash memory are proposed, such as JFFS2 [44] and Microsoft Flash [24] and so on. However, the flash memory has some limits:

firstly, flash must erase the block before writing it. Secondly, the block in flash has

finite number of erase-write cycles. Therefore, the flash system usually writes data by

using non-in-place update and it makes the number of erase-write cycles in each block

are similar by using wear leveling technique.

In order to speed up the writing in flash system, eNVy [45] uses a small amount of

battery-backed SRAM as write buffer and uses copy-on-write technique to copy

corresponding data in flash into SRAM, then modifies data in SRAM. Lastly, it writes

back data into flash memory when the amount of SRAM is full. Hwan Doh [11] also

exploits non-volatile memory to enhance the performance of flash file system. They

propose a flash-based file system that stores all metadata in NVRAM and stores all

file data in flash memory. The advantages of using NRAM as a metadata store are the

mount time of flash is reduce to the minimum and access all metadata is speeder than

before.

MRAMFS [12] is a prototype in-memory file system to put all data/metadata in

NVRAM. However, the amount of NVRAM may be not enough containing of a large

number of files. Therefore, they use the compression method to reduce occupied

space and use the different compression method to compress metadata and data

because metadata often has the fixed format. In metadata they can save about 60%

space and save about 40%~60% space for file data.

(17)

HeRMES file system [31] and Conquest file system [42]. HeRMES considers that the

metadata is frequently modified in the file system requests. Therefore, they suggest

that use of compression techniques in order to minimize the amount of memory

required for metadata and place all metadata in NVRAM to improve the performance

of file system requests. Conquest assumes that the system is in the sufficient amount

of NVRAM. Therefore, it stores all small files and metadata in NVRAM and disk

holds only the data content of remaining large files. The advantages are that it can

avoid the overhead of accessing small file and metadata because metadata and small

files are placed in NVRAM and it can optimize the arrangements of large files to

reduce the fragmentation in disk because there are only large files in disk.

The above works have some disadvantages. Firstly, they almost place all metadata

in NVRAM but the occupied space of metadata/data is constantly increasing as users

create files at all times. Secondly, although the metadata is frequently accessed in file

system, it is not that all metadata are frequently accessed. Therefore, they place all

metadata in NVRAM such that there is some non-recently-used metadata occupied

the NVRAM space resulting in performance decreases.

2.3 NVRAM as Buffer

In addition to storage device, the general purpose of NVRAM is as the write buffer.

eNVy [45] mentioned in Section 2.2 uses a small amount of battery-backed SRAM as

write buffer to improve the performance of write operations in flash. Mark Baker [1]

proposes that if they provide a NVRAM as write buffer, it can reduce disk access by

about 20% on most of file systems, and by about 90% on one frequently-accessed file

system.

(18)

provides two benefits: some writes will be avoided because dirty blocks will be

overwritten in the cache, and physically contiguous dirty blocks can be grouped into a

single I/O operation. They also present some write back strategies, such as least

recently used (LRU), shortest access time first (STF) and largest segment per track

(LST) to manage non-volatile write buffer and find that write buffer can reduce a

large number of write requests to improve the performance of system.

Robert Y. Hou [20] exploits non-volatile memory to improve the performance of

RAID5. In each write request, RAID5 needs to execute “read-modify-writes” which

means that single-block writes require the old data block and old parity block to be

read, modify them to generate the new parity block, and then the new data and new

parity can be written to their respective locations. Read-modify-writes can reduce the

performance of RAID5 arrays because it needs four disk accesses in each write

request. Therefore, they use non-volatile memory as the write buffer of RAID5 to

improve the performance of write operations.

Above researches are also about using write buffer to improve write operations,

Alex Batsakis [3] mentions read operations may depend upon write operations

because buffering dirty pages will occupy the memory for read caching. They address

this problem by separately allocating memory between write buffering and read

caching and by writing dirty pages to disk opportunistically before the operation

system submits them for write-back. They also write back dirty pages which are

almost adjacent, but they do not consider whether the dirty pages are not

recently-updated.

Due to the capacity of MRAM is increasing continuously, it maybe replace DRAM

as the main memory of computing system in the future. We not only use the technique

(19)

improve the performance of file operations.

2.4 Transaction Supporting

Traditionally, file system consistency has been maintained by using synchronous

writes to restrict the proper ordering of metadata updates, but this approach degrades

the performance of file system because the proceeding of metadata updates is

dominated by the disk speed. Soft updates [30] eliminates the need for synchronous

disk I/O. Soft updates is an implementation mechanism that enforces the

dependencies of metadata updates and allows the metadata caching for write back.

Log-structured file system [39] proposed by Mendel Rosenblum treats the file

system as a segmented log and always writes all modified data blocks and metadata

into the end of the log. File system changes are buffered in the cache and then written

into the disk sequentially in single disk IO operation. Therefore, it can improve the

performance of write operation but it can not write all related metadata in single write

operation since if crashes happen in the progress of disk operation, the file system

remains an inconsistent state.

Journaling [35][44][47] is nowadays a widely-used technique for file system

consistency. It logs metadata and data updates into a stable storage before the updates

are performed on the disk. Hence, it produces the extra journaling IO traffic that is

critical impact on the system performance.

Kevin M. Greenan [17] introduces two approaches to reliably storing file system

structures in NVRAM. Firstly, they strengthen memory consistency by using

page-level write protection and error correcting codes. Secondly, it periodically calls

online consistency checker to replay all transaction logs for checking file system

(20)

state of file system. However, it needs to periodically replay all transaction logs even

if the file system is normal and does not have any failures.

Henry Mashburn [40] proposes recoverable virtual memory (RVM) that is simple

user-lever library to handle atomic file operation and data persistence. Firstly, it

copies the range of memory which will be updated to the undo log in memory, then

updates data, and lastly writes the updated data to the redo log in disk. Therefore, it

needs three copy operations for each file operation.

Vista [27] proposed by David Loweel is simple user-library runs on Rio mentioned

in Section 4.1. Because Rio protects the files in memory to be persistent, Vista can

eliminate the redo log to speed up disk operations and it only uses undo log to make

sure the file operation is atomic. However, it must be based on Rio and because it is

user-level library, Vista is not user-transparent.

We propose a simple lightweight transaction support on file system operations in

NVRAM environment and it only needs to add only about 40 line-codes in kernel and

about 300 line-codes in implementation. It also provides the same strength of

(21)

Chapter 3 Design and Implementation

In this chapter, we describe the design and implementation of the proposed

mechanisms. In Section 3.1, we first introduce the three mechanisms for improving

the performance and ensuring the consistency of file systems on NVRAM based

computer systems, namely Temporary-File File System (TempFFS), intelligent

write-back policy, and transaction support on file system operations. In Section 3.2,

we show the details of implementing and integrating the mechanisms and provide an

analysis on the integration of the mechanisms.

3.1 Background

In this section, we describe the proposed NVRAM-based buffer cache management

mechanisms, which include Temporary-File File System and intelligent write-back

policy. Both mechanisms aim at improving the file system performance based on the

non-volatility feature of main memory. Moreover, we also describe a lightweight

transaction support mechanism on file system operations, which takes advantage of

the non-volatility feature of main memory for ensuring the consistency and data

integrity of the file system.

3.1.1 Temporary-File File System (TempFFS)

The first goal of TempFFS is to reduce the fragmentation of the underlying file

systems. With numerous and concurrent file creation/deletion/appending activities, a

file system is easy to become fragmented, which leads to performance degradation.

Moreover, according to the previous studies [37][38][46], many files are short-lived,

meaning that they are deleted soon after their creation. Allocating disk space for these

(22)

To reduce the file system fragmentation and the unnecessary disk IO operations,

some advanced file systems such as XFS [47]and ext4 support delayed allocation,

which delays the disk block allocation of a newly-created file until the data is needed

to be flushed back to the disk due to memory pressure or sync operations. However,

the delayed allocation feature is not shared among all file systems. Only the file

systems that implement the feature can benefit from it.

Instead of integrating the delayed allocation feature into a specific file system, we

implement a RAM-based file system named TempFFS in order to apply the feature

simultaneously on existing file systems such as ext3 and NTFS. Based on the concept

of stackable file systems, TempFFS sits between VFS (virtual files system) and file

system implementations and is transparent to the latter, as shown in Figure 3.1. All

new files are initially written to TempFFS and associated with their original file

systems when they are created. TempFFS uses page cache as the file store, and the

files are transferred into their corresponding file systems upon memory pressure or

sync operations. In this way, existing file systems can benefit from delayed allocation

without code modifications. Note that a file can stay for a long time in TempFFS. This

raises the risk of data loss if the main memory is volatile. On systems with

non-volatile main memory, however, memory data can survive power failures. The

implementation of TempFFS was achieved by modify the code of an existing RAM

(23)

Figure 3.1 Architecture of the Temporary-File File System

TempFFS stores files in kernel memory, which cannot be paged out in traditional

UNIX operating systems (including Linux). Upon memory pressure, an OS usually

writes back the dirty pages that belong to the buffer cache or user processes to the

storage device so as to release more memory space. In this situation, TempFFS checks

if its size is larger than a specific threshold. If it is, TempFFS shrinks its size by

evicting pages of the least recently used files. All the evicted files are transformed into

their original file systems so that the corresponding data can be written back. In

addition, we transform files whose sizes are larger than a specific threshold (currently,

1MB) due to the following two reasons. First, according to previous research

[37][38][46], most short-lived files are small ones, if it puts short-lived files in

TempFFS, it can reduce some IO traffics. Second, creating a huge file may cause the

transform of a large number of short-lived small files before they are deleted, Transform Transform Create

General File System Operation

VFS

TempFFS

User Process VFAT Page Cache Read / Write EXT2

(24)

reducing the benefit of delay allocation.

We manage the files in TempFFS in a LRU list. The number of pages that should be

evicted from TempFFS, say N, is proportional to the number of pages in TempFFS.

Specifically, N is calculated according to the following equation:

N = NR_WB * NR_TempFFS / NR_Dirty,

where NR_WB represents the target number of pages that need to be written back,

NR_TempFFS represents the number of (dirty) pages in TempFFS, and NR_Dirty

represent the number of dirty pages in the system. As shown in Figure 3.2,

transforming a file involves the following three steps.

Figure 3.2 Transformation Steps

First, the file create operation of the original file system is invoked to produce the

metadata (inode) of the file. Second, several inode fields such as timing information,

access rights and file size, are copied to the new inode. Third, a sequence of disk

block allocation operations of the original file system are invoked for allocating the Ram FS Block 5 Block 6 Block 7 Original FS (2)Copy Some Inode Fields (3)Allocate Blocks Data Block Pointers

(1)Create a File System Inode

NULL Inode fields

(25)

disk space for the file. Because the operations are invoked consecutively, the resulting

data blocks tend to be contiguous. After the allocation, the data is associated with the

allocated blocks and the metadata in the TempFFS is deleted.

3.1.2 Intelligent Write-Back Policy

Modern operating systems write back dirty pages periodically or when the number of free pages is below a specific threshold (i.e., memory pressure). On systems with

non-volatile main memory, dirty pages are already persistent and thus need not to be

written back into the disk periodically. Instead, they need to be written back only

under memory pressure or sync operations. Currently, Linux utilizes a file-by-file

write back policy, which scans the list of dirty inodes and submits the dirty pages of

each inode to the IO subsystem. The rationale behind this policy is to reduce the

numbers of non-up-to-date files when power outages or system crashes. Assume that

100 files are updated and each file has 10 dirty pages in memory. If the system

crashes after 500 dirty pages are written back to disk, it would be better to write all

the dirty pages of 50 files than write 5 dirty pages of all the files.

However, this policy may write back recently-updated pages, which has two

drawbacks. First, writing back such pages can not help to release the situation of

memory pressure since these pages will not be reclaimed by the page replacement

policy. One purpose of writing back dirty pages is to reclaim the page so as to

maintain a reasonable number of free pages in the system. In Linux, all pages

belonging to user processes and page cache are grouped into two lists, the active list

and the inactive list. The former includes pages which have been accessed recently

while the latter contains pages that have not been accessed for a period of time. The

(26)

are used recently. Second, according to time locality, these pages will be marked dirty

soon after their write back. Thus, writing back such pages is of little use. The pages

may need to be written back again soon. Some UNIX systems like Solaris do not have

such problem. They only write back dirty pages that are not used recently.

The common problem of the write back policies of the existing UNIX operating

systems (including Linux) is that they ignore the disk location of the dirty pages when

submitting the pages to their IO subsystems. Although an IO subsystem can sort the

requests submitted to it, there may still a significant amount of seek and rotation delay

among the dirty pages.

In this paper, we propose an intelligent write-back policy, which considers the

recency as well as the disk locations of the dirty blocks to reduce the IO traffic, seek

time and rotation delay. To reduce the IO traffic, the proposed policy recency only

writes back dirty pages in the inactive list.

To reduce the seek time, we divide a disk into a number of zones, which is a set of

continuous blocks on the disk, and write back dirty pages in a zone-by-zone manner.

The dirty page information is recorded in a set of identical data structures called zone

information tables, each of which correspond to a zone. When a page becomes dirty

and inactive, we record the page in the corresponding zone information table

according to the disk block number of the page.

Each time the write-back procedure is invoked, the proposed policy selects a zone

and writes back dirty pages in that zone. This reduces the seek time because the disk

blocks of the written-back dirty pages are close. In order to further reducing the

rotation delay, the policy selects a zone with the maximum Average Segment Length

(27)

blocks in a zone, and there is generally no rotation delay between two continuous

blocks. Therefore, this policy tries to select a zone which contains more continuous

dirty pages to reduce both the seek time and the rotation delay of the IO traffic caused

by dirty page write back.

Average Segment Length (ASL) = Number of Dirty Pages in the Zone/Number of Segments in the Zone _________________________________________Equation 1

The zone information table, which is shown in Figure 3.3, it records some

information such as, dirty pages numbers, segment numbers, segment list which

contains of all segment in the zone, page list which includes all dirty pages of the

inactive list in the zone, and length (Average Segment Length).

Figure 3.3 the Zone Information Table

Zone1 Zone2 Zone3 Zone4 ZoneN Disk Layout Pages NO Segment NO Segment List 6 3

100

155

188

Seg List. Page List. Length: 3 Seg List. Page List. Length: 2 Seg List. Page List. Length: 1 100 155 188

101

102

156

(28)

Figure 3.4 Zone with Different Average Segment Length

Figure 3.4 shows an example of the zone selection. The dirty pages of zone 4 and

zone 6 are both 7, but the dirty pages of zone 6 are more continuous (i.e., with a larger

value of ASL) than zone 4. Therefore, zone 6 is selected to be written back.

As mentioned before, this policy only writes back pages in the inactive list in order

to reduce the write back IO traffic. Therefore, only the dirty pages in the inactive list

are recorded in the zone information tables. To accomplish this, we need to insert or

remove the information about a dirty page when it becomes inactive or active.

Specifically, when a dirty page becomes inactive (i.e., moves from the active list to

the inactive list), we record it in the corresponding zone information table. When the

page becomes active again or clean, the recorded information is removed. This allows

us to write back only inactive dirty pages.

3.1.3 Transaction Support on File System Operations

Journaling is a widely-used technique for guaranteeing file system consistently. It

logs metadata and data updates that are completed in memory into a stable storage

Zone 5

Zone 4

Zone 6

Dirty Pages

ASL = 1.75

ASL = 3.5

(29)

before the updates are performed on the disk. When the system crashes, the log is

replayed to restore the status of the file system. Therefore, journaling ensures file

system consistency by using a redo log. The overhead of journaling is that it requires

additional disk I/O operations for logging. On non-volatile memory based computing

systems, one straightforward approach for eliminating such I/O operations is to place

the log in the memory instead of disk. However, the drawback of simply placing logs

in memory is that it occupies a large memory space. For example, it typically needs

256MB memory space as journaling space. Additionally, to minimize the IO overhead,

a journaling file system usually writes the updates to the log in batches with size

about 16 to 64 MB. A significant amount of updates might be lost if the system fails

before the updates are written to the log.

To address the problems mentioned above, we design a lightweight transaction

mechanism on systems with non-volatile main memory. The mechanism not only

ensures file system consistently but also eliminates the need of large memory space

and extra disk I/O.

The basic idea of the mechanism is undo log. Because the data in NVRAM does

not lose, we need not write journaling logs into the disk before. We only need make

sure that each file operation is atomic. To ensure the atomicity of each file operation,

we duplicate the data and metadata in the undo log before they are modified. Once a

file operation has finished successfully, the duplicated data and metadata can be

removed immediately. If the system crashes, the content in the undo log (i.e., the

original values of the metadata and data of the uncompleted operations) is used to

recover the file system state. The main data structure of our lightweight transaction

(30)

Figure 3.5 Main Data Structure of Transaction Support

In order to make sure the atomicity of the updates involved in a file operation, all

metadata and data modified by the file operation are collected into a transaction and

recorded in a data structure called trans. When metadata or data is going to be

modified by a file operation, the original content is copied to a memory area pointed

by a data structure called replica, which is then attached to trans. The replica also

records the addresses of the metadata/data that are under update. This allows the

metadata/data to be recovered by the original content if necessary.

Figure 3.6 Core Functions of Transaction Support Read block bitmap & group descriptor

Modify metadata

Duplicate metadata Transaction_start

Transaction_stop

Access Metadata _{Modify data}

Duplicate data

Write

trans replica replica

Replica list trans trans

Transaction list

…

Metadata / Data Duplicated version

Process

(31)

We implemented the transaction support in a file system independent manner and

provide a transaction API by which file systems can leverage to achieve file operation

atomicity. Figure 3.6 shows an example usage of the transaction API. Before each file

operation, the file system firstly invokes the transaction_start() function, which

initializes the trans data structure for the current process and inserts the trans data

structure into the global transaction list. Then, the file system calls duplicate() which

duplicates the data and metadata before they are going to be updated. It initializes a

replica data structure and copies the original values of the metadata/data into a temporally-allocated area, then maps the area to its replica. The replica data structure

will be inserted into the replica list of the corresponding trans. Lastly, after the file

operation, the file system calls transaction_stop() which terminates the transaction. It

removes all duplicated data and metadata of corresponding this trans without

affecting data integrity and file system consistency because all metadata/data in the

file operation have already finished upgrading. Therefore, the transaction support

mechanism can have the same strength as the journal mode of ext3 because it

duplicates data and metadata before they are updated. Moreover, it causes little

overhead in file system because it deletes all replicas of data and metadata once file

(32)

3.2 Implementation and Integration of the Approaches

In this section, we describe the detailed implementation and integration of the three

approaches.

3.2.1 Implementation of Three Mechanisms

As mentioned before, we implement TempFFS by modifying an existing RAM file

system called Ram-FS (Resizable simple ram File System), and then insert it between

VFS and file system implementations. We intercept the invocation of the VFS file

create function (i.e., vfs_create()) and direct the invocation to the file create function

in RamFS. After the creation, operations on the file will use the file operations in

RamFS because the file now is placed in RamFS not in file system, such as Ext2.

Upon memory pressure or the size of file is over the threshold, we transform files in

Ram-FS into the file system. The detail steps of transform are shown in Section 3.1.1.

After transforming, the file is belong to the file system, we only use the original file

operations in file system to access it.

To implement the intelligent write-back policy relo (recency and location), we

record the information of a page in a zone information table, which is shown in Figure

3.3. When the page becomes dirty and inactive, we record this dirty page into the

corresponding zone information table. To achieve this, we invoke a function

add_to_zone() in two situation. First, when it calls the function that marks the page

dirty (i.e., set_page_dirty()), we check the active flag (PG_active) of the page. If this

dirty page is in the inactive list (i.e., PG_active flag is not set), we invoke a function

add_to_zone() for set_page_dirty() function. Second, when it calls the function that

moves the page form the active list into the inactive list (i.e.,

add_page_to_inactive_list()), we check whether the page is dirty or not. If this page is

(33)

The pseudo code of the add_to_zone() function is shown in Figure 3.7(a). First, we

get the block number of the page, and calculate the zone corresponding to this page

(i.e., block number of page divides block number per zone). Second, numbers of dirty

pages in zone information table increases by one. If the former block and latter block

of this page do not record in zone information table, it means that this page stands

alone. If this page is recorded in the zone information table, it produces a new

segment (contiguous dirty pages). Therefore, segment numbers of zone information

table increases by one. Lastly, we calculate the ASL (average segment length) as a

basis of selecting the zone to write back.

When the dirty page is clean or active, we also need to remove the information of

the page from the zone information table. To achieve this, we invoke a function

remove_from_zone() in two situation. First, when it calls the function that clears dirty

of the page (i.e., clear_page_dirty_for_io()), we invoke a function

remove_from_zone() for each call of the clear_page_dirty_for_io() function. Second,

when it calls the function that moves the page form the inactive list into the active list

(i.e., add_page_to_active_list()), we invoke a function remove_from_zone() for each

call of the add_page_to_active_list() function. The pseudo code of the

remove_from_zone() function is shown in Figure 3.7(b). First, we also get the block

number of this page to calculate the corresponding zone. Second, the dirty page

numbers decreases by one. If the former block and latter block of this page do not

record in zone information table, when it removes this page, it reduces a segment.

Therefore, segment numbers of zone information table decreases by one. If the former

block and latter block of this page both record in zone information table, when it

removes this page, the original segment divide into two segments. Therefore, segment

(34)

In Linux, when the system writes back the data in memory into the disk, the system

wakes up the Pdflush thread to call background_writeout(). In background_writeout(),

we change the original function (writeback_inodes()) into writeback_segment_zone()

which selects a zone to write back. It is shown in Figure 3.7(c). First, we select the

zone with largest average segment length. Second, we traverse all page lists recorded

in zone information table to write back all pages. Lastly, if the number of written-back

pages is greater than or equal to the number of demand for write-back pages, it

finishes. If not, it selects the next zone to write back.

(35)

Figure 3.7 Pseudo Code of Segment/Zone Algorithm

/* Adding a page to the zone info. table */ add_to_zone( page ){

get page’s block number;

zone_number = page_block_number / pages_per_zone; zone_information_table[zone number].dirty_pages++;

if (a new segment is created for this page) zone.segment++;

ASL = zone. dirty_pages / zone.segment }

(a)

/* removing a page from the zone info. table */ remove_from_zone( page ){

get page’s block number;

zone_number = page_block_number / pages_per_zone; zone_information_table[zone number]. dirty_pages --; if (a segment is deleted due to the removal of the page) zone.segment--;

if (a new segment is produced due to the removal of the page) zone.segment++;

ASL = zone. dirty_pages / zone.segment }

(b)

/* Segment-zone writeback algorithm*/

writeback_segment_zone(writeback_control wbc){ begin : select the zone with largest average segment length;

traverse the all segment’s page list of the zone to writeback all pages; if (number of pages written back >= wbc.nr_to_write)

finish; else writeback_segment_zone( wbc ); goto begin; } (c)

(36)

As mentioned in Section 3.1.3, we provide a transaction API for file systems the

require transaction support. Figure 3.8 shows the pseudo code of the major function

implementations, transaction_start(), transaction_stop() and duplicate(), in the API.

Transaction_start() firstly creates a transaction data structure trans, links this trans

into current process. Lastly, it inserts this trans into the global transaction list.

Transaction_stop() firstly gets a transaction data structure trans from current process,

clears the pointer of current process that points to this tans. Lastly, it frees all

duplicated data (replica) of this trans, and removes this trans form global transaction

list. Duplicate() firstly also gets a transaction data structure trans from current process,

creates a replica data structure to store the duplicated data, and inserts this replica into

corresponding trans. Lastly, it duplicates the data into this replica.

To demonstrate the effectiveness of the API, we augmented the ext2 file system to

leverage the API. We inserted the function pair transaction_start() and

transaction_stop() in all ext2 file system operations such as ext2_create(), ext2_link(), ext2_mkdir(), ext2_unlink(), ext2_rmdir(), etc. Moreover, we inserted the invocation

of duplicate() in functions that modify metadata and data such as ext2_new_inode(),

ext2_free_inode(), ext2_new_block() and ext2_free_blocks(), etc. We ensure the

invocation of the duplicate() function is right before the modification of metadata or

(37)

Figure 3.8 Implementation of the Transaction API (Pseudo Code)

/* Creating a transaction and inserting it to the transaction list */ transaction_start( ){

create a transaction data structure; link this trans into current process; insert this trans into transaction list; }

(a)

/* removing a transaction from the transaction list */ transaction_stop( ){

get a trans from current->journal_info; // journalling filesystem info set current->journal_info as NULL;

free all replicas in this trans; free this trans from transaction list; }

(b)

/* duplicate metadata or data*/

duplicate(buffer_head *bh, size_t size){ get a trans from current process; create a replica data structure;

insert this replica into replica list of trans; copy data from buffer_head *bh to this replica; }

(38)

3.2.2 Integration of Three Mechanisms

Three mechanisms can operate independently, and also can combine the

corresponding both of three mechanisms, even can integrate three mechanisms

together. We integrate the proposed three approaches in Linux 2.6.12. The overall

structure of three approaches is shown in Figure 3.9.

Figure 3.9Integration of the TempFFS, Intelligent Write Back and File Operation Transaction Support

We create all files in TempFFS, the other file operations, such as read, write, and

delete use the file operations of Ramfs if this file is placed in TempFFS and use file

operations of original file system if this file is transformed into the file system. When

it needs flush the dirty pages into the disk, it uses intelligent write back policy (relo).

Moreover, we inserted the invocation of the transaction API into both ext2 file system

and TempFFS to maintain their consistency.

The performance results of different combinations of the three approaches are

shown in Chapter 4.

TempFFS

Disk

Intelligent write-back pages to disk

Transform pages to file system Other file operations

Create file

VFS

File system

Transactions

(39)

Chapter 4 Performance Evaluation

In this chapter, we evaluate the performance of the three proposed mechanisms.

Section 4.1 describes the experimental environment and all the configurations under

performance comparison. Section 4.2 presents the performance improvements of

TempFFS. In Section 4.3, we compare the performance of various write-back policies

mentioned in Section3.1.2. Section 4.4 shows the performance and memory overhead

of the lightweight transaction support mechanism on file system operations. Finally,

we present the performance results of all combinations of the three proposed

mechanisms in Section 4.5.

4.1 Experimental Environment and Configurations

Table 4.1 shows the experimental environment. Since large capacity MRAM is not

generally available in the market, and the performance characteristics of DRAM and

MRAM are comparable, we use DRAM to emulate MRAM. We evaluate the

performance of the proposed mechanisms under two popular benchmarks.

Bonnie++ [5] is a micro-benchmark that measures the performance of single file

access. Three kinds of tests in Bonnie++ are performed, character_write, block_write,

and rewrite. The character write test writes a 2GByte file sequentially in a

character-by-character manner. The block write test writes a 2GByte file in a (several

bytes per block) block-based way. The rewrite test reads the existed block of file and

modifies it, then writes file by block-based. Postmark [22] is a macro-benchmark that

emulates the access pattern of an email server. It creates many files whose size

between Max. Size and Min. Size which are defined by users, and then operates the

assigned number of transactions which may be create/delete or read/append, lastly

(40)

files from 5k to 30k and the file size ranging form 512 bytes to 10 Kbytes. For the

other parameters, we use the default settings of Postmark.

Moreover, we also measure the performance under the execution of real application

such as untarring and compiling Linux kernel. We untar a package that contains the

source code and object files of Linux 2.6.12, and then compiling the kernel.

Table 4.1 Evaluation Environment

CPU AMD Athlon 64 3000+

Memory 1 GB DDR 400

Hardware

Disk Maxtor 80G 7200 RPM

OS Linux 2.6.12

Software

Workloads Bonnie++ 1.03a, Untar,

Make, Postmark 1.5

Table 4.2 shows all the experimental configurations under performance comparison.

In the first two configurations, the original ext2 and ext3 file systems are used. For

ext3, we use the Journal mode since it is the only one that provides data integrity. The

mext3 configuration is the same as ext3 except that it improves the performance of

ext3 by placing the logs in a ramdisk residing on MRAM. The last seven

configurations represent various ways of combinations of the three proposed

(41)

Table 4.2 Experimental Configurations

Configurations Description

Ext2 Ext2 file system

Ext3 Journal mode of ext3 file system

Mext3 Ext3 with logs on MRAM

Ext2_Trans Lightweight transaction support on file system operations

TempFFS Temporary-File file system

WB Intelligent write-back policy

TempFFS_Ext2_Trans Temporary-File file system + transaction support

WB_Ext2_Trans Intelligent write-back + transaction support

TempFFS_WB Temporary-File file system + intelligent write-back

TempFFS_WB_Ext2_Trans Temporary-File file system + intelligent write-back + transaction support

(42)

4.2 The Performance Results of TempFFS

0 5000

10000

15000

20000

25000

30000

35000

character

write

block write

rewrite

Thr

oughput

(

K

byt

es

/s

ec

)

Ext2

Ext2 with

TempFFS

Figure 4.1 Performance Improvements of TempFFS (Bonnie++)

In this section, we present the performance improvements achieved from TempFFS

by comparing the performance of the ext2 file system with and without TempFFS

under different workloads. In the first experiment, we compare the performance under

Bonnie++. The values of the parameters, including the file size and the block size, are

the same as those in Section 4.1.

Figure 4.1 shows the results. From the figure, we can see that TempFFS does not

result in noticeable performance improvements. This is mainly because the size

limitation of a file in TempFFS. As mentioned in Section 3.1.1, a file in TempFFS is

transformed to it original file system if its size exceeds the size limitation (currently,

1Mbytes). Thus, the 2GByte file is transformed to ext2 soon after its creation and

(43)

0

20

40

60

80

100

120

140 untar

make

E

x

ecu

ti

o

n

t

im

e (secs)

Ext2

Ext2 with

TempFFS

Figure 4.2 Performance Improvements of TempFFS (Untar and Compile Linux Kernel)

In the second experiment, we measure the performance of TempFFS under Linux

kernel untarring and compilation. Figure 4.2 shows the results. From the figure, we

can see that TempFFS degrades the performance of ext2 by 18% under the untar

workload. Although untaring Linux kernel produces a significant number of small

files, they are never be deleted. Therefore, the files are just first placed in TempFFS,

and then transformed to their original file systems. No IO traffic can be saved.

Moreover, the file creation does not result in a large degree of fragmentation, and thus

TempFFS can seldom help in this workload. Instead, it degrades the performance due

to the file transformation overhead.

For the make workload, the presence of TempFFS does not have a noticeable

impact on the performance of ext2. This is because the workload is CPU-bound.

Therefore, make needs not produce I/O operations to create object files because our

(44)

0

50

100

150

200

250 5000

10000

15000

20000

25000

30000

Files

Exe

cut

io

n t

im

e (

se

cs

)

0%

10%

20%

30%

40%

50%

60%

70%

80%

Pe

rf

or

m

anc

e I

m

pr

ove

m

en

t

Ext2

Ext2 with TempFFS

Figure 4.3 Performance Improvements of TempFFS (Postmark)

0% 20% 40% 60% 80% 100% P er ce n ta ge s of Fi le s ( % ) 5000 10000 15000 20000 25000 30000 Files

Deleted in TempFFS Deleted in Ext2

(45)

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 5000 10000 15000 20000 25000 30000 Files IO T ra ffi c (re qu es ts ) 0% 20% 40% 60% 80% 100% 120% R educ tion o f IO T ra ff ic

Ext2 Ext2 with TempFFS Reduction of IO traffic

Figure 4.5 Reduction of IO Traffic with TempFFS

0 2000

4000

6000

8000

10000

12000

14000

5000 10000 15000 20000 25000 300

00 Files

A

v

er

ag

e F

il

e S

p

an

(

b

lo

ck

s)

Ext2

Ext2 with

TempFFS

Figure 4.6 Average File Span (Postmark)

In this experiment, we measure the performance improvements of TempFFS under

Postmark. In this experiment, 200k transactions were performed and the file size

ranges from 512 bytes to 10 Kbytes. We measured the performance under various

(46)

improves the system performance. Specifically, the performance improvement ranges

from 34% to 69%. This is because a number of files have been deleted before they are

transformed to the file system, reducing both the I/O traffic and the degree of file

fragmentation. We demonstrate this in the following experiments.

As mentioned before, Postmark deletes all the files at the end of its execution.

Figure 4.4 shows the percentage of the number of files deleted in TempFFS and ext2,

with the presence of TempFFS. As shown in the figure, at least 44% of the files are

deleted in TempFFS. In the cases of 5000 and 10000 files, all files are deleted in

TempFFS because the capacity of TempFFS is enough to contain all the files. When

the number of files increases further, a number of files are transformed to ext2,

because of memory pressure, and finally deleted in ext2. For each file deleted in

TempFFS, all its file operations are done in memory and involve no disk IO. Figure

4.5 shows the reduction of IO traffic with the presence of TempFFS. In the cases of

5000 and 10000 files, nearly 100% of the IO traffic can be eliminated since almost all

file operations are done in TempFFS. For the other cases, about 31% of the IO traffic

can be eliminated. Moreover, we show that TempFFS can reduce the degree of file

fragmentation, which is evaluated by using the file span, the distance between the first

block and the last block of a file. During the execution of Postmark, we record the file

span of each file upon the deletion of the file. Figure 4.6 shows the average file span

of all the files. As shown in the figure, the degree of file fragmentation is largely

reduced. Especially, in the cases of 5000 and 10000 files, the average file span is zero

because all the files are deleted in TempFFS and do not have corresponding blocks on

(47)

4.3 The Performance Results of Intelligent Write-Back Policy

In this section, we compare the performance of the four write-back policies, file,

recency, location, and relo. The file policy is the file based policy used in Linux. It

scans the list of dirty files and writes back the dirty pages of each dirty file. The

recency policy only writes back dirty pages in the inactive list of the Linux page

cache. The location policy selects the zone with the maximum ASL value and writes

back the dirty pages within the selected zone. Finally, the proposed relo policy

combines the location and the recency policies and writes back the inactive dirty

pages in the selected zone.

0 5000

10000

15000

20000

25000

30000

35000

40000

45000

character

write

block write

rewrite

T

hr

oughput

(

K

byt

es

/s

ec

)

file

recency

location

relo

Figure 4.7 Performance Comparison of the Write Back Polices (Bonnie++)

From Figure 4.7 shows the performance comparison of the four policies under the

Bonnie++ benchmark. The values of the parameters, including the file size and the

block size, are the same as those in Section 4.1. As shown in the figure, the

(48)

the original write back policy in Linux. As mentioned in Section 3.1.2, this is because

the Linux write back policy does not consider the recency and disk location

information of the dirty pages, resulting in more redundant page write traffic and

longer seek and rotation delay. The proposed relo policy considers both the recency

and the disk location information and outperforms the original Linux policy by 31%

to 34% under the Bonnie++ benchmark.

0

20

40

60

80

100

120 untar

make

Exe

cut

ion t

im

e (

se

cs

)

file

recency

location

relo

Figure 4.8 Performance Comparison of the Write Back Polices (Untar and Compile Linux Kernel)

Figure 4.8 shows the performance comparison of the four write back policies under

Linux kernel untarring and compilation. As shown in figure, all the latter three

policies perform better than the original one in Linux, and the proposed relo policy

achieve the best performance among the four. Specifically, relo outperforms the file

policy by 23% in the untar workload. In the make workload, the performance

(49)

0 20 40 60 80 100 120 140 160 5000 10000 15000 20000 25000 30000 Files Ex ec ut io n t im e (s ec ) 0% 10% 20% 30% 40% 50% 60% 70% Pe rf or m anc e Im pr ov em en t

file zone segment zone_segment

zone/file segment/file zone_segment/file

Figure 4.9 Performance Comparison of the Different Locative Write Back Polices

Figure 4.9 shows three different locative write back policies, zone-based which writes back a region of disk and segment-based which writes back a longest

contiguous blocks of disk and zone/segment-based which mentioned in Section 3.2.

The performance improvement of zone/segment-based has about 58% and is best. The

zone-based is better than segment-based because zone-based can save the seek time of

disk and segment-based can save the rotation delay.

0 20 40 60 80 100 120 140 160 5000 10000 15000 20000 25000 Files Exe cut io n t im e ( se cs ) 0% 10% 20% 30% 40% 50% 60% 70% 80% P erfo rm an ce Im p ro v em en t

file recency location relo recency/file location/file relo/file

(50)

Figure 4.10 shows the performance comparison of the four write back policies

under Postmark. In this experiment, 200k transactions were performed and the file

size ranges from 512bytes to 10Kbytes. We measured the performance under various

numbers of files. In this figure, the bars denote the execution time of the Postmark

and the curves denotes the performance improvements of the latter three policies over

the file policy.

As shown in the figure, the performance improvements of the recency, location,

and relo policies are about 62%, 58% and 67%, respectively, as the file numbers are

lager than 15000. In the case of 5000 files and 10000 files, there is no obvious

performance difference among the four policies. This is because the working set is

smaller than the memory size, and hence the write back procedure is seldom

triggered. 0 200 400 600 800 1000 1200 1400 15000 20000 25000 30000 Files A cc u m u la te In te rre q u es ts D is ta nc e ( m il li on b loc ks )

file zone segment zone_segment(location) recency relo

Figure 4.11 Accumulate Inter-requests Distance (Postmark)

Figure 4.11 shows the accumulate inter-requests distance which is the accumulate

blocks of all inter-requests. The less of accumulate inter-requests distance means the

more contiguous write-back blocks. The zone/segment-based has the least accumulate

(51)

0 10000 20000 30000 40000 50000 60000 70000 15000 20000 25000 30000 Fils R epe at ed W ri te ba ck B loc ks

file location recency relo

Figure 4.12 Repeated Write Back Blocks (Postmark)

Figure 4.12 shows the total number of repeated write back blocks. In this

experiment, the recency and relo has the less repeated write back blocks because they

avoid write back the blocks which used recently.

4.4 The Performance Results of Transaction Support on Ext2

0 5000

10000

15000

20000

25000

30000

35000

character

write

block write

rewrite

T

hr

oughput

(

K

byt

es

/s

ec

)

Ext3

Mext3

Ext2

Ext2_Trans

(52)

transaction API to support atomic file operations. In this Section, we compare the

performance of the augmented ext2 with that of ext3. For ext3, we use the journal

mode since the augmented ext2 can ensure both file system consistency and data

integrity. Moreover, we use two versions of ext3. One places the log in a 256 MB disk

partition (ext3), while the other places the log in a 256 MB ramdisk (mext3). Finally,

the performance of the original ext2 is also presented for the evaluation of the runtime

overhead of the transaction API. Note that ext2 supports neither file system

consistency nor data integrity.

Figure 4.13 shows the performance comparison under the Bonnie++ benchmark. As

shown in the figure, ext2 with transaction support results in the best performance

among the three file systems that ensure file system consistency. This is because it

only duplicates data and metadata in memory during the file operations and does not

involve any disk I/O. It outperforms the ext3 by 63% and the mext3 by 31%. The

performance of ext3 is the worst since the journal mode of Ext3 logs both metadata

and data updates on the disk, it requires a significant number of extra disk IO. Mext3

eliminates some journaling IO traffic. However, the journaling IO traffic is still

required once the journal space is full. Besides, the in-memory journal space of mext3

also occupies the 256MB capacity of the main memory such that the physical memory

decreases a lot. The performance results between ext2 and ext2 with transaction

support are almost the same because our transaction support does not produce I/O

(53)

0

20

40

60

80

100

120

140 untar

make

E

x

ecu

ti

o

n

ti

m

e (secs)

Ext3

Mext3

Ext2

Ext2_Trans

Figure 4.14 Untar and Compile Linux Kernel in transaction support

Figure 4.14 shows the performance comparison under Linux kernel untarring and

compilation. Similar to the results in Figure 4.10, ext2 with transaction support

always results in the better performance than ext3 and mext3. For the untar case, the

performance difference is smaller than that under Bonnie++. This is because untarring

the Linux kernel creates many files but does not modify and delete them later, and

then it produces journal data less than Bonnie++ or Postmark. Therefore, ext2 with

transaction support outperforms the ext3 by 17% and the mext3 by 9%. Besides, the

execution time of ext2 transaction is greater than the ext2 a little. For the make case,

since make application is CPU-bound, the performance results of them are almost the

在基於非揮發性隨機存取記憶體系統上提供改善作業系統效能的機制

國立交通大學

資訊科學與工程研究所

碩 士 論 文

在基於非揮發性隨機存取記憶體系統上提供改

善作業系統效能的機制

Operating System Support on NVRAM-Based Systems

研 究 生：蕭智文

指導教授：張瑞川 教授

在基於非揮發性隨機存取記憶體系統上提供改

善作業系統效能的機制

學生：蕭智文 指導教授：張瑞川教授

國立交通大學資訊科學與工程研究所

論 文 摘 要

Operating System Support on NVRAM-Based

Systems

Student: Chih-Wen Hsiao Advisor: Prof. Ruei-Chuan Chang

Computer Science and Engineering College of Computer

Science

National Chiao Tung University

Abstract

致謝

TABLE OF CONTENTS

LIST OF FIGURES

LIST OF TABLES

Chapter 1 Introduction

1.1 Motivation

1.2 Our Three Mechanisms

1.3 Structure of the Thesis

Chapter 2 Related Work

2.1 System Recovery

2.2 NVRAM as Storage Device

2.3 NVRAM as Buffer

2.4 Transaction Supporting

Chapter 3 Design and Implementation

3.1 Background

VFS

TempFFS

100

155

188

101

102

156

Zone 5

Zone 4

Zone 6

ASL = 1.75

ASL = 3.5

Transaction list

…

Process

3.2 Implementation and Integration of the Approaches

TempFFS

Disk

VFS

File system

Chapter 4 Performance Evaluation

4.1 Experimental Environment and Configurations

Configurations Description

4.2 The Performance Results of TempFFS

0

5000

10000

15000

20000

25000

30000

35000

character

write

block write

rewrite

Thr

oughput

(

K

byt

es

/s

碩士論文

研究生：蕭智文

指導教授：張瑞川教授

學生：蕭智文指導教授：張瑞川教授

論文摘要