Chapter 1 Introduction
1.3 Structure of the Thesis
The remainder of this thesis is organized as follows. Chapter 2 describes the related work about NVRAM. Chapter 3 presents the design and implementation details of the proposed mechanisms. The performance results are shown in Chapter 4. Finally, we give conclusions in Chapter 5.
Chapter 2 Related Work
In this chapter, we introduce some researches about NVRAM. In Section 2.1, we introduce some researches exploiting NVRAM to recover system when system crashes. In Section 2.2, some researches use NVRAM as storage device to improve the performance of file system. In Section 2.3, some researches use NVRAM as buffer to reduce disk IO, especially write operation. In Section 2.4, we introduce researches about providing file system consistency.
2.1 System Recovery
Ren Ohmura [33] in Keio University exploits the characteristic of NVRAM in system recovery. They propose a scheme to recovery the state of peripheral devices in NVRAM systems so that the system can resume its execution after an unpredictable power failure. They record all messages between CPU and devices in NVRAM and system re-sends messages recorded in memory to recovery devices into previous state when power failure.
Harp [25] records all updates of files in server nodes. Files in individual node can survive after the failure because file operations are logged in memory at several nodes.
The Recovery Box [2] stores the state of system in NVRAM and protects the region of storing the system state to not overwrite when system crashes. After the system crashes, it uses the protected system state to recover system.
Rio [8] enables the data in memory to survive operating system crashes and power outages. It uses write protections to protect files in file cache and does not accidentally overwrite the file cache while system is crashing. Therefore, the files in Rio are persistent and safe when system crashes.
2.2 NVRAM as Storage Device
Because the flash is cheap and has the characteristic of non-volatile, the more and more file systems which are designed for flash memory are proposed, such as JFFS2 [44] and Microsoft Flash [24] and so on. However, the flash memory has some limits:
firstly, flash must erase the block before writing it. Secondly, the block in flash has finite number of erase-write cycles. Therefore, the flash system usually writes data by using non-in-place update and it makes the number of erase-write cycles in each block are similar by using wear leveling technique.
In order to speed up the writing in flash system, eNVy [45] uses a small amount of battery-backed SRAM as write buffer and uses copy-on-write technique to copy corresponding data in flash into SRAM, then modifies data in SRAM. Lastly, it writes back data into flash memory when the amount of SRAM is full. Hwan Doh [11] also exploits non-volatile memory to enhance the performance of flash file system. They propose a flash-based file system that stores all metadata in NVRAM and stores all file data in flash memory. The advantages of using NRAM as a metadata store are the mount time of flash is reduce to the minimum and access all metadata is speeder than before.
MRAMFS [12] is a prototype in-memory file system to put all data/metadata in NVRAM. However, the amount of NVRAM may be not enough containing of a large number of files. Therefore, they use the compression method to reduce occupied space and use the different compression method to compress metadata and data because metadata often has the fixed format. In metadata they can save about 60%
space and save about 40%~60% space for file data.
There are some recent works in Hybrid Disk/NVRAM file system such as
HeRMES file system [31] and Conquest file system [42]. HeRMES considers that the metadata is frequently modified in the file system requests. Therefore, they suggest that use of compression techniques in order to minimize the amount of memory required for metadata and place all metadata in NVRAM to improve the performance of file system requests. Conquest assumes that the system is in the sufficient amount of NVRAM. Therefore, it stores all small files and metadata in NVRAM and disk holds only the data content of remaining large files. The advantages are that it can avoid the overhead of accessing small file and metadata because metadata and small files are placed in NVRAM and it can optimize the arrangements of large files to reduce the fragmentation in disk because there are only large files in disk.
The above works have some disadvantages. Firstly, they almost place all metadata in NVRAM but the occupied space of metadata/data is constantly increasing as users create files at all times. Secondly, although the metadata is frequently accessed in file system, it is not that all metadata are frequently accessed. Therefore, they place all metadata in NVRAM such that there is some non-recently-used metadata occupied the NVRAM space resulting in performance decreases.
2.3 NVRAM as Buffer
In addition to storage device, the general purpose of NVRAM is as the write buffer.
eNVy [45] mentioned in Section 2.2 uses a small amount of battery-backed SRAM as write buffer to improve the performance of write operations in flash. Mark Baker [1]
proposes that if they provide a NVRAM as write buffer, it can reduce disk access by about 20% on most of file systems, and by about 90% on one frequently-accessed file system.
Theodore R. Haining [19] mentions that the use of non-volatile write caches
provides two benefits: some writes will be avoided because dirty blocks will be overwritten in the cache, and physically contiguous dirty blocks can be grouped into a single I/O operation. They also present some write back strategies, such as least recently used (LRU), shortest access time first (STF) and largest segment per track (LST) to manage non-volatile write buffer and find that write buffer can reduce a large number of write requests to improve the performance of system.
Robert Y. Hou [20] exploits non-volatile memory to improve the performance of RAID5. In each write request, RAID5 needs to execute “read-modify-writes” which means that single-block writes require the old data block and old parity block to be read, modify them to generate the new parity block, and then the new data and new parity can be written to their respective locations. Read-modify-writes can reduce the performance of RAID5 arrays because it needs four disk accesses in each write request. Therefore, they use non-volatile memory as the write buffer of RAID5 to improve the performance of write operations.
Above researches are also about using write buffer to improve write operations, Alex Batsakis [3] mentions read operations may depend upon write operations because buffering dirty pages will occupy the memory for read caching. They address this problem by separately allocating memory between write buffering and read caching and by writing dirty pages to disk opportunistically before the operation system submits them for write-back. They also write back dirty pages which are almost adjacent, but they do not consider whether the dirty pages are not recently-updated.
Due to the capacity of MRAM is increasing continuously, it maybe replace DRAM as the main memory of computing system in the future. We not only use the technique of non-volatile write buffer to delay write, but also use the better write-back policy to
improve the performance of file operations.
2.4 Transaction Supporting
Traditionally, file system consistency has been maintained by using synchronous writes to restrict the proper ordering of metadata updates, but this approach degrades the performance of file system because the proceeding of metadata updates is dominated by the disk speed. Soft updates [30] eliminates the need for synchronous disk I/O. Soft updates is an implementation mechanism that enforces the dependencies of metadata updates and allows the metadata caching for write back.
Log-structured file system [39] proposed by Mendel Rosenblum treats the file system as a segmented log and always writes all modified data blocks and metadata into the end of the log. File system changes are buffered in the cache and then written into the disk sequentially in single disk IO operation. Therefore, it can improve the performance of write operation but it can not write all related metadata in single write operation since if crashes happen in the progress of disk operation, the file system remains an inconsistent state.
Journaling [35][44][47] is nowadays a widely-used technique for file system consistency. It logs metadata and data updates into a stable storage before the updates are performed on the disk. Hence, it produces the extra journaling IO traffic that is critical impact on the system performance.
Kevin M. Greenan [17] introduces two approaches to reliably storing file system structures in NVRAM. Firstly, they strengthen memory consistency by using page-level write protection and error correcting codes. Secondly, it periodically calls online consistency checker to replay all transaction logs for checking file system inconsistency. If it finds the inconsistency in file system, it immediately recovers the
state of file system. However, it needs to periodically replay all transaction logs even if the file system is normal and does not have any failures.
Henry Mashburn [40] proposes recoverable virtual memory (RVM) that is simple user-lever library to handle atomic file operation and data persistence. Firstly, it copies the range of memory which will be updated to the undo log in memory, then updates data, and lastly writes the updated data to the redo log in disk. Therefore, it needs three copy operations for each file operation.
Vista [27] proposed by David Loweel is simple user-library runs on Rio mentioned in Section 4.1. Because Rio protects the files in memory to be persistent, Vista can eliminate the redo log to speed up disk operations and it only uses undo log to make sure the file operation is atomic. However, it must be based on Rio and because it is user-level library, Vista is not user-transparent.
We propose a simple lightweight transaction support on file system operations in NVRAM environment and it only needs to add only about 40 line-codes in kernel and about 300 line-codes in implementation. It also provides the same strength of consistency as the journaling mode of Ext3.
Chapter 3 Design and Implementation
In this chapter, we describe the design and implementation of the proposed mechanisms. In Section 3.1, we first introduce the three mechanisms for improving the performance and ensuring the consistency of file systems on NVRAM based computer systems, namely Temporary-File File System (TempFFS), intelligent write-back policy, and transaction support on file system operations. In Section 3.2, we show the details of implementing and integrating the mechanisms and provide an analysis on the integration of the mechanisms.
3.1 Background
In this section, we describe the proposed NVRAM-based buffer cache management mechanisms, which include Temporary-File File System and intelligent write-back policy. Both mechanisms aim at improving the file system performance based on the non-volatility feature of main memory. Moreover, we also describe a lightweight transaction support mechanism on file system operations, which takes advantage of the non-volatility feature of main memory for ensuring the consistency and data integrity of the file system.
3.1.1 Temporary-File File System (TempFFS)
The first goal of TempFFS is to reduce the fragmentation of the underlying file systems. With numerous and concurrent file creation/deletion/appending activities, a file system is easy to become fragmented, which leads to performance degradation.
Moreover, according to the previous studies [37][38][46], many files are short-lived, meaning that they are deleted soon after their creation. Allocating disk space for these files, which involves disk IO operations for reading the file system metadata (e.g.
To reduce the file system fragmentation and the unnecessary disk IO operations, some advanced file systems such as XFS [47]and ext4 support delayed allocation, which delays the disk block allocation of a newly-created file until the data is needed to be flushed back to the disk due to memory pressure or sync operations. However, the delayed allocation feature is not shared among all file systems. Only the file systems that implement the feature can benefit from it.
Instead of integrating the delayed allocation feature into a specific file system, we implement a RAM-based file system named TempFFS in order to apply the feature simultaneously on existing file systems such as ext3 and NTFS. Based on the concept of stackable file systems, TempFFS sits between VFS (virtual files system) and file system implementations and is transparent to the latter, as shown in Figure 3.1. All new files are initially written to TempFFS and associated with their original file systems when they are created. TempFFS uses page cache as the file store, and the files are transferred into their corresponding file systems upon memory pressure or sync operations. In this way, existing file systems can benefit from delayed allocation without code modifications. Note that a file can stay for a long time in TempFFS. This raises the risk of data loss if the main memory is volatile. On systems with non-volatile main memory, however, memory data can survive power failures. The implementation of TempFFS was achieved by modify the code of an existing RAM file system (i.e., the RamFS [34]) for ease of implementation.
Figure 3.1 Architecture of the Temporary-File File System
TempFFS stores files in kernel memory, which cannot be paged out in traditional UNIX operating systems (including Linux). Upon memory pressure, an OS usually writes back the dirty pages that belong to the buffer cache or user processes to the storage device so as to release more memory space. In this situation, TempFFS checks if its size is larger than a specific threshold. If it is, TempFFS shrinks its size by evicting pages of the least recently used files. All the evicted files are transformed into their original file systems so that the corresponding data can be written back. In addition, we transform files whose sizes are larger than a specific threshold (currently, 1MB) due to the following two reasons. First, according to previous research [37][38][46], most short-lived files are small ones, if it puts short-lived files in TempFFS, it can reduce some IO traffics. Second, creating a huge file may cause the transform of a large number of short-lived small files before they are deleted,
General File System Operation
VFS
reducing the benefit of delay allocation.
We manage the files in TempFFS in a LRU list. The number of pages that should be evicted from TempFFS, say N, is proportional to the number of pages in TempFFS.
Specifically, N is calculated according to the following equation:
N = NR_WB * NR_TempFFS / NR_Dirty,
where NR_WB represents the target number of pages that need to be written back, NR_TempFFS represents the number of (dirty) pages in TempFFS, and NR_Dirty represent the number of dirty pages in the system. As shown in Figure 3.2, transforming a file involves the following three steps.
Figure 3.2 Transformation Steps
First, the file create operation of the original file system is invoked to produce the metadata (inode) of the file. Second, several inode fields such as timing information, access rights and file size, are copied to the new inode. Third, a sequence of disk block allocation operations of the original file system are invoked for allocating the
Ram FS
Block 5
Block 6 Block 7 Original FS (2)Copy Some
Inode Fields
(3)Allocate Blocks Data Block Pointers
(1)Create a File System Inode
NULL Inode fields
disk space for the file. Because the operations are invoked consecutively, the resulting data blocks tend to be contiguous. After the allocation, the data is associated with the allocated blocks and the metadata in the TempFFS is deleted.
3.1.2 Intelligent Write-Back Policy
Modern operating systems write back dirty pages periodically or when the number of free pages is below a specific threshold (i.e., memory pressure). On systems with non-volatile main memory, dirty pages are already persistent and thus need not to be written back into the disk periodically. Instead, they need to be written back only under memory pressure or sync operations. Currently, Linux utilizes a file-by-file write back policy, which scans the list of dirty inodes and submits the dirty pages of each inode to the IO subsystem. The rationale behind this policy is to reduce the numbers of non-up-to-date files when power outages or system crashes. Assume that 100 files are updated and each file has 10 dirty pages in memory. If the system crashes after 500 dirty pages are written back to disk, it would be better to write all the dirty pages of 50 files than write 5 dirty pages of all the files.
However, this policy may write back recently-updated pages, which has two drawbacks. First, writing back such pages can not help to release the situation of memory pressure since these pages will not be reclaimed by the page replacement policy. One purpose of writing back dirty pages is to reclaim the page so as to maintain a reasonable number of free pages in the system. In Linux, all pages belonging to user processes and page cache are grouped into two lists, the active list and the inactive list. The former includes pages which have been accessed recently while the latter contains pages that have not been accessed for a period of time. The file-by-file policy may write back dirty pages in the active list. However, most
are used recently. Second, according to time locality, these pages will be marked dirty soon after their write back. Thus, writing back such pages is of little use. The pages may need to be written back again soon. Some UNIX systems like Solaris do not have such problem. They only write back dirty pages that are not used recently.
The common problem of the write back policies of the existing UNIX operating systems (including Linux) is that they ignore the disk location of the dirty pages when submitting the pages to their IO subsystems. Although an IO subsystem can sort the requests submitted to it, there may still a significant amount of seek and rotation delay among the dirty pages.
In this paper, we propose an intelligent write-back policy, which considers the recency as well as the disk locations of the dirty blocks to reduce the IO traffic, seek time and rotation delay. To reduce the IO traffic, the proposed policy recency only writes back dirty pages in the inactive list.
To reduce the seek time, we divide a disk into a number of zones, which is a set of continuous blocks on the disk, and write back dirty pages in a zone-by-zone manner.
The dirty page information is recorded in a set of identical data structures called zone
The dirty page information is recorded in a set of identical data structures called zone