Chapter 2 Backgrounds
2.4 Linux Ext3 file system
The Linux Ext3 file system extends the Ext2 file system with metadata journaling capability. With metadata journaling method, metadata is appended to a write-ahead log before metadata and then regular data are flushed to disk. After an unclean system shutdown, the information in the write-ahead log decreases time of file system recovery to insistent state.
Because of compatibility, Ext3 data structures on disk are largely similar to those in Ext2.
Thus an Ext2 file system can be easily mounted as an Ext3 file system, and vice versa. And unlike other journal file system, Ext3 saves metadata itself as journal log record rather than metadata operation because of compatibility.
Ext3 logs metadata (and even data) operations in a fixed size and circular buffer based
file, and it uses journaling block device (JBD) to handle the journal affair. In addition, Ext3 has the following modes that allow the administrator to choose the metadata and data journaling method: (1) journal mode (2) writeback mode, (3) ordered mode.
In journal mode, all updated data and metadata are written twice. This is the slowest but safest mode because full data logging reduces the chances of data losing.
In ordered mode, only updated metadata are flushed to the journal file. Ordered means that regular data writes must be done before metadata writes, which provides more consistency. Ordered mode guarantees that both regular data and metadata are consistent after recovery from an unclean system shutdown
In writeback mode, only updated metadata are flushed to the journal file and the metadata/data writes are not needed to be ordered. This is the fastest among the three modes.
Writeback mode just guarantees consistency of file system metadata
Ext3 utilizes a kernel daemon, called kjournald, to flush journal data periodically. The interval between successive flushes must be carefully decided because of balance between consistency and performance. A larger interval results in better performance but weaker consistency, while a smaller interval leads to stronger consistency but worse performance (due to extra small I/O operations).
Chapter 3
Related work
In this chapter, we discuss other schemes for improving file system consistency and
performance. They are soft update, NVRam, and log-structured file system.
3.1 Soft update
Besides metadata journaling, another well-known scheme for improving asynchronous write file system is Soft update [4] [5] [6] [13]. Soft update is originally used to improve FFS [14] consistency by guaranteeing dependency ordering of metadata write. Soft update maintains fine-grained (i.e., field-based) dependency information. Like metadata journaling, soft update employs delay writes for performance consideration. When a updated block needs to be flushed to disk, soft update checks dependency information of the block and does block content rolled-back before flush if necessary. When a flush of one block violates dependency constraint, soft update rolls the block state back to safe state because it maintains dependency information for each block.
3.2 Non-volatile RAM (NVRAM)
As mentioned above, synchronous metadata write leads to stronger consistency but causes many time-consuming seeks which result in unacceptable throughput. In contrast, asynchronous metadata write results in better performance but weaker consistency. In order to achieve good performance and strong consistency, NVRAM [1] is proposed to store the
then flushed to disk in any order. Therefore, it has a similar performance with that of asynchronous metadata update. In case of a file system crash, metadata in the NVRAM will not lost and thus the file system can still stay in a consistent state. The cost of this approach is high and extra memory copies metadata are needed.
3.3 Log-structured file system (LFS)
In the present days, file system can use logging in two ways—file system layout or file system enhanced. Unlike journaling file systems that journal data and metadata for enhancement, log-structured file system [12] [15] [16] [18] uses logging as file system layout.
Log-structured file systems treats the whole disk space as a circular log and appends written data to the end of log always. This method is optimized for write operations because no time-consuming seeks are needed. Basic data structures in LFS are similar with FFS data structures, like inode. Thus, the read performance is also similar to that of traditional file systems. However, in read-after-write or write-after-read cases, the cost of write a disk block becomes unacceptable. LFS divides disk space into fixed-size chunks called segments. Data updates are delayed and collected until the total size reaches to a full segment. At that time, the segment can be flushed to end of the log. Using segment can amortize the cost in read-after-write or write-after-read cases.? Nevertheless, this approach requires a cleaner for collecting available data in old segments when there are little empty segments for flush. The jobs of the cleaner are: (1) select a candidate segment for clean according to the cleaning policy, (2) read the available data in the candidate segment, (3) collect available data until the size is larger than a segment, and then append the segment to the log. Those activities need huge I/O operations and thus degrade the performance. The higher disk utilization is, the more
clean activities are needed.
Log-appended writing implies high consistency semantic. Because LFS employees delayed write and segment-based write, it can order the permutation of each metadata and regular data. Thus update dependency constraints rules we mentioned in section 2.1 can be satisfied.
Chapter 4
Journaling analyses
We analyze metadata journaling about journal I/O traffic effect( in section 4.3 ) and commit interval effect( in section 4.4 ). Then remote journal scheme for file system performance improving is proposed in next chapter by observing and summarizing the result.
4.1 Experimental Environment
Table 2 shows the detailed hardware and software configuration. In the SCSI disk, we use a partition which resides on the middle tracks of the disk to prevent the zone effect from affecting the performance results. This placement can eliminate our estimate result error from zone effect in the disk.
CPU Intel Pentium 4 1.6GHz
Memory 256MB DDR RAM
Disk
Seagate ST336753LW/P 15000 RPM, 3.9ms average seek time,
49 ~ 75 Mbytes / s transfer rate
OS Linux kernel 2.6.5
NIC Accton EN-1216 10/100Mb/s
Table 2 Experimental setup
The benchmark we choose is postmark [10] version 1.5. Postmark is a benchmark which is suitable for simulating small-files based environment, like mail server ,news server and
OLTP environment.
In each experiment, we execute the benchmark for 10 times, with each time includes 150000 file and 20000 file system transactions which include file create, delete, read, and append. The type of transaction is choose randomly. The size of the benchmark files ranges from 500 bytes to 1000 bytes, and we start each time with a cold cache (i.e., reboot the system before a round is started).
4.2 Variance type of Ext3 file system
In the experiments, we estimate three types of Ext3 file system: (1) normal Ext3 file system, (2) non-journal (NJ) Ext3 file system, and (3) remote-journal (RJ) Ext3 file system.
Normal Ext3 file system represents a comparison base relative to NJ-Ext3 and RJ-Ext3 file system.
In later sections, we modified the JBD layer of Ext3 file system in order to observe that how metadata journaling affects performance. That is, we want to estimate effect of superblock journal and metadata (and data in Ext3 journal mode) which are mainly journal I/O involved by Ext3 journaling. After JBD groups the journal data and commits it to the buffer cache layer, we intercept and drops the data and call the journal_end_buffer_io_sync function. We call it NJ-Ext3 file system in our experiments.
4.3 The Effect of Journaling
Although metadata journaling successfully reduces recovery time after a system crash, it brings performance overhead. Adding metadata journaling to a file system can cause 20% ~ 25% performances degradation, compared to a typical asynchrous-write file system [11],
especially in metadata bound workload.
In order to estimate how I/O activities of metadata journaling affects the performance of Ext3 file system, we run the postmark benchmark on Ext3 and non-journal Ext3 (NJ-Ext3).
The results are shown in Figure 1.
0
Figure 1 Performance comparisons of journal and non-journal Ext3 file system [In this figure, journaling brings 16.42% overhead in the ordered mode; 15.97%
overhead in the writeback mode; 27.51% overhead in the journal mode]
From the figure 1 we can see that journaling I/O causes performance degradation that ranges from 15.97% (in writeback mode) to 27.51 %( in journal mode). The degradation comes from extra disk traffic and seeks. Take the journal mode as an example, the extra disk traffic is up to almost twice as big as original data size. The results reported by the NJ file system can be viewed as performance upper bounds of Ext3 file system.
Note that disk performance can also affect the results. Generally speaking, the performance degradation will be larger for a disk with a slower seek and rotation time.
4.4 The Effect of Commit Interval
In this section, we observe the effect of journal commit interval. The kjournald daemon groups journal data and commit it to the buffer cache layer periodically, and the commit interval is defined as time interval between two successive commits (i.e., 5 seconds for default). General speaking, a larger commit interval has a higher chance of data loss, and a smaller commit interval leads to lower performance because of extra seeks.
1.6
Figure 2 Performance Comparisons of Different Commit Intervals
X-axis indicates commit interval, which unit is second. y-axis indicates the throughput.
Performance of ordered and writeback mode decreases with higher commit interval. Journal
Figure 2 shows the performance comparison among different commit intervals. From the figure we can see that, the throughput of writeback mode and journal mode increase as the interval becomes larger. A smaller interval results in worse performance since that it causes additional disk head seeks. A higher interval leads to better performance which benefits from delayed write effect. Note that the journal mode shows a different trend to the other modes.
Journal mode handles large data traffic which includes regular data and metadata, so performance of journal mode is the worst in the three mode.
4.5 Observation
In section 4.3 and 4.4, we observe the factors that effect the journal file system. Journal I/O is necessary for consistency recovery but harms the performance. Higher commit interval brings higher performance which results from delayed write effect but will lose more data if crash happens.
We propose remote journaling in next chapter. Remote journaling removes journal I/O from disk to network and has low commit interval which implies high consistency semantic.
The network overhead and throughput benefit of remote journaling will be estimated in our experiments.
Chapter 5
Remote Journaling
We explain the concept of remote journal and how it can be used in this chapter, and will have experiments in next chapter.
5.1 Concept of Remote Journaling
In this chapter, we propose a remote journaling architecture, which journals data to a remote journal server instead of local disk. In addition to guaranteeing consistency, a remote journaling file system also results in similar performance with non-journal file systems. This is because the journal data can be sent out immediately when it is generated and thus the journal traffic will not harm the file system performance.
Different with disk, network transmission does not need position time in disk I/O. The position time includes seek time and rotation time are time-consuming and harms performance. Remote journal scheme can prevent more necessary position time when flushing journal to local disk and then improving performance.
Moreover, remote journaling is a cheap solution. Many hosts can share a single journal server at the same time. Since the workload of the journal server is write-dominated, the disk layout of the journal can be designed to optimize the write performance.
5.2 Applications
Remote journaling can be applied to any journal file systems. This mechanism is especially useful for metadata bound workloads, like online transactions environment, web server, or news server. The network bandwidth taken by remote journaling is acceptable when using in network applications. File system consistency recovery which is the same with traditional process besides reading journal from network reduces downtime, which is important in a commercial service. Moreover, remote journal can be used in a storage cluster, like 錯誤! 找不到參照來源。3. Each storage server amortizes the cost of journal server.
Figure 3the example application
5.3 Implementations
We modify the daemon, kjournald, for remote journaling. When user specifies a remote journal mode in file system mount table, the modified kjournald tries to connect to the remote journal server. Metadata transfer is through TCP/IP, which guarantees the transfer can be accomplished without loss. However, if a transmission error happens, which may be caused by network congestion or server failures, the modified kjournald switches to the local journal mode for file system safety consistency.
The main function for journal flush in kjournald is journal_commit_transaction.
Journal_commit_transaction first update journal superblock which includes journal information. We modify journal_update_superblock function from disk commit to network commit. Then journal_commit_transaction tries to commit data buffer in ordered mode. After flush of all data that is needed flushing before metadata completes, we collect buffers which have journal data and commit it from buffer cache layer to network layer. If all commits are accomplished, we insert a checkpoint and release this file system transactions.
When recovery process starts, we try to connect remote server. We read journal superblock and necessary information for recovery from remote server. If there is any error, unfortunately we have to do a whole disk scanning because we do not have any information in order to recovery.
The best choice of the file system on the remote journal server is log-structured file system. Because workload on remote journal server is write-oriented in most time and
performance consideration, log-structured file system can achieve high consistency for the file system. The fast recovery of remote server file system is important, because any error of remote server causes clients doing whole disk scanning.
Chapter 6
Experiments result
We estimate the performance and overhead which is bring by remote journal here.
Remote journal scheme removes journal data traffic form disk to network. So we estimate performance raise, network bandwidth usage and other factors in this chapter.
6.1 Performance comparisons
There are throughput comparisons of three journal mode in Ext3 file system. We add non-journal serious as a the upper bound here. This helps us realizing the overhead of remote journaling. Figure 4 shows the overhead brings by remote journal is about 5% ~ 7% to upper bound and still outperforms about 10% (in writeback and ordered mode) to 21% (in journal mode).
Generally speaking, Ext3 journal mode that has to journal both data and metadata brings higher overhead, thus performance is only 78% of writeback mode. However, by remote journal the gap between writeback and journal mode becomes narrow. Figure 4 indicates the performance of journal mode raises to 86% of writeback mode by remote journaling. The raise is significant because it improves the availability of journal mode.
0
Figure 4 performance of remote journal
6.2 CPU utilities of remote journal
In this section, we record the CPU utilities of three mode and compare it. Figure 5, 6, 7 show the curves of normal mode, non-journal serious, and remote journal serious.
0
Figure 5 CPU utilities of writeback mode
0
Figure 6 CPU utilities of ordered mode
0
Figure 7 CPU utilities of journal mode
CPU usage time of non-journal and remote journal is higher than normal mode about 5% to 20%. However, the total time need to complete benchmark is less.
In order to understand how much CPU overhead will remote journaling brings, we integrate the area in figure 5 ~ 7 and show the result in table 3,4 and 5. Although remote journal brings higher CPU utilities, total CPU time approaches normal mode (lower than 10%).
Writeback RJ-Writeback Non-Writeback Ordered RJ-Ordered Non-Ordered Journal RJ-Journal Non-Journal
Figure 8 CPU time comparison
6.3 Network bandwidth taken by remote journal
Although file system performance benefits by removes journal I/O from disk to network in remote journal scheme, it may damage the network availability when using in network applications. Thus we estimate the network bandwidth taken by remote journal in three mode.
The results are shown in Figure 9. Y-axis shows network bandwidth percentage which is used
by remote journaling in gigabit Ethernet and X-axis is three mode of Ext3 file system.
Because the writeback and ordered mode only log metadata and journal mode logs both metadata and data, the journal mode has heavier burden on network than other two mode. In our workload writeback and ordered mode have only less than 2% network bandwidth and journal mode uses 6.6% network bandwidth in gigabit Ethernet. Note that the network burden in journal mode may different with workload. Thus the network overhead brings by remote journal becomes larger when workload includes larger data.
0.00%
1.00%
2.00%
3.00%
4.00%
5.00%
6.00%
7.00%
writeback ordered journal
Network usage utility per second
Figure 9 Network usage comparison
Chapter 7
Conclusion and future works
7.1 Conclusion
In this paper, we proposed a scheme named remote journal which improving performance of journal file systems. Remote journal improves file system performance by removing journal I/O from local disk to remote journal disk by network. If there is error when doing remote journaling, we switch the remote journal into local journal in order to guarantee the fast recovery. The consistency semantic of original file system will not be harmed because we do log the same journal data. We implement remote journal scheme in Ext3 file system, a popular journal file system in Linux. JBD layer in Linux and the daemon, kjournald, has been modified here.
The main advantage brings by remote journal is obvious performance upgrade which mainly results from remove of journal I/O. Another advantage is cost in hardware. A remote journal server can support many clients and thus the cost can be shared.
According to the experiments in this paper, remote journal increases about 10% (in Ext3 writeback and ordered mode) to 21% (in Ext3 journal mode) performance, but the penalty is light. Although remote journal does need more CPU time for network transfer, the overhead is less than 10%. On the other hand, we also consider that the journal traffic effect to network bandwidth. In our experiment result, overhead in writeback and ordered mode is light because only metadata is logged into journal by network. In our metadata bound workload, their overhead are just less than 2%. Nevertheless, Journal mode will have higher overhead because the journal traffic depends on workload. With big files workload, journal traffic is heavy and
overhead will be higher.
To sum up, remote journal scheme is a easy and cheap solution for improving performance of journal file system. It can be easily used after patching the kernel. In most time it brings better performance and keeps the same file system consistency semantic but low overhead penalty.
7.2 Future works
When we adopt the remote journal scheme in mobile storage, it can be used in disk power management. In order to save unused disk spinning power, most power saving approaches in disk try to make disk sleep time much longer. However, journal flush activities needs frequent update for consistency, but the sleep time in mobile storage suffers from frequent journal flush activities. This makes the disk wasting more energy on mode switch that includes spinning up and down.
Remote journal can solves this problem under this condition. Remote journal server can be a FTP server or a free mail space. If general Ethernet is used, remote journal can save
Remote journal can solves this problem under this condition. Remote journal server can be a FTP server or a free mail space. If general Ethernet is used, remote journal can save