• 沒有找到結果。

5. RELATED WORK

5.4 O THERS

Autonomic Computing (Kephart and Chess, 2003) is proposed by IBM, which enables systems to manage themselves according to the administrator’s goals. The self-managing means self-configuring, self-healing, self-protecting, and self-optimizing. Especially, the self-healing techniques automatically detect, diagnose, and repair software and hardware problems. Some efforts related to the self-healing are SRIRAM (Verma et al., 2003) which is a method that facilitates instantiating mirroring and replication of services in a network of servers, K42 (Appavoo et al., 2003) which allows software codes including system monitoring and diagnosis functions to be inserted and removed dynamically without shutting down the running system, and Dynamic CPU Sparing (Jann et al., 2003) that predicts the defective of a CPU and replace it with a spare one.

Recovery-Oriented Computing (Patterson et al., 2002) proposed by U. C.

Berkeley and Stanford University is a related effort to autonomic computing. It proposes new techniques to deal with hardware faults, software bugs, and operator errors. These techniques include Pinpoint (Chen et al., 2002) which finds the root cause of a system failure in an efficient way, System Undo (Brown and Patterson, 2003) which can perform system recovery from operator errors, and Recursive Restart (Candea et al., 2002) which reduces the service downtime. In addition, they also proposed on-line fault injection and system diagnosis to improve the robustness of the

system.

Checkpointing [15][2][25][20][26] is a common technique for system recovery.

It saves system state periodically or before entering critical regions. If a system fails, it can be recovered by restoring the last checkpointed state. The major problem of checkpointing is that it can not make the system survive from faults caused by driver bugs since it restores the aged state and re-executes the same code after recovery.

Moreover, many checkpointing implementations incur overheads due to the vast amounts of state need to be saved.

Lakamraju [14] introduced a low-overhead fault tolerance technique to recover from only network processor hangs in Myrinet. When the network processor hangs, it resets the NIC and rebuilds the hardware state from scratch to avoid duplicate and lost messages. The limitation of this work is that it only focuses on hardware failures instead of software errors. The former is easier to handle since it doesn’t consider the complex software state maintenance problem such as undoing the kernel state changes, reconfiguring the new driver, and solving the problem of dangling references.

6. Conclusion

In this thesis, we propose the nDriver framework, which uses multiple implementations of a device driver to survive from driver faults. It can detect two major types of driver faults, the exception and blocking faults. With the help of nDriver, driver faults will not always result in kernel panics or system hangs. Instead,

if it a fault is detected, nDriver substitutes another driver implementation with the faulty one to make the system continue working. In order to achieve the goal of seamless driver swapping, nDriver undoes the kernel state changes made by the faulty driver, keeps the unfinished driver requests, and redirects the external references. In addition, nDriver blocks the driver removing and installation events so that the other

kernel subsystems are not aware of the driver swapping.

The major contribution of our work is that nDriver realizes the concept of recovery blocks at the device driver layer. It achieves the goal of seamless driver swapping. However, it improves operating system availability without modifying the existing operating system or driver codes.

We implement nDriver as a kernel module in Linux. Currently, it can recover from faults in network and block device drivers. According to the performance evaluation, the overhead of nDriver is no more than 5% and the recovery time is very small. This indicates that nDriver is an efficient mechanism to increase the availability of operating systems.

References

[1] [Surge] Barford, P., and Crovella, M. E.. “Generating Representative Web Workloads for Network and Server Performance Evaluation”. In: Proceedings of the ACM SIGMETRICS '98, pp. 151-160.

[2] [CP2] Subhachandra Chandra, Peter M. Chen. “Whither Generic Recovery From Application Faults? A Fault Study using Open-Source Software”. In proceedings of the 2000 International Conference on Dependable Systems and Networks / Symposium on Fault-Tolerant Computing (DSN/FTCS), June 2000.

[3] [sw-aging] V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi, K. Vaidyanathan, and W. P. Zeggert. “Proactive management of software aging”.

IBM JRD, Vol. 45, No. 2, March 2001.

[4] [error_distr_0] Andy Chou, Junfeng Yang, Benjamin Chelf, Seth Hallem, Dawson Engler. “An Empirical Study of Operating System Errors”. In proceedings of the 18th ACM symposium on Operating systems principles, pp.

73–88, Banff, Alberta, Canada, 2001.

[5] [bonding] Davis, T.. “Linux Channel Bonding”. Available at

http://www.sourceforge.net/projects/bonding/usr/src/linux/Documentation/netwo rking/bonding.txt.

[6] [error_distr_4] Gray, J.; Siewiorek, D.P.; “High-availability computer systems”.

Computer, Volume: 24, Issue: 9, pp. 39-48, Sept. 1991.

[7] [LLKM_HOWTO] Bryan Henderson. “Linux Loadable Kernel Module HOWTO”. Available at http://www.tldp.org/HOWTO/Module-HOWTO/.

[8] [design_diversity] Chris Inacio. “Software Fault Tolerance”. Available at http://www.ece.cmu.edu/~koopman/des_s99/sw_fault_tolerance/index.html.

[9] [Harden] Intel Corporation, IBM Corporation. “Device Driver Hardening”.

Available at http://hardeneddrivers.sourceforge.net/.

[10] [Intel] Intel Corporation, 2003. “Intel Networking Technology – Load Balancing”. Available at

http://www.intel.com/network/connectivity/resources/technologies/load_

balancing.htm.

[11] [pSeries] Jann, J., Browning, L. M., and Burugula, R. S.. “Dynamic Reconfiguration: Basic Building Blocks for Autonomic Computing on IBM pSeries Servers”, IBM Systems Journal, 42(1): 29–37.

[12] [netperf] Rick Jones. “Netperf benchmark”. Available at http://www.netperf.org/netperf/NetperfPage.html.

[13] [error_distr_2] “Kernel Summit 2003: High Availability”. Available at http://lwn.net/Articles/40620/.

[14] [Myrinet] Vijay Lakamraju, Israel Koren, C.M. Krishna. “Low Overhead Fault Tolerant Networking in Myrinet”. 2003 International Conference on Dependable Systems and Networks (DSN'03), San Francisco, California. June 22-25, 2003.

[15] [CP1] David E. Lowell, Subhachandra Chandra, and Peter M. Chen. “Exploring Failure Transparency and the Limits of Generic Recovery”. In Proceedings of the Fourth Symposium on Operating Systems Design and Implementation (OSDI 2000), October 2000.

[16] [Devil] Fabrice Merillon, Laurent Reveillere, Charles Consel, Renaud Marlet, Gilles Muller. “Devil: An IDL for Hardware Programming”. In Proceedings of the 4th Symposium on Operating Systems Design and Implementation (OSDI 2000), San Diego, California, October 2000.

[17] [WinHEC] Microsoft Corporation. “Writing Drivers for Reliability, Robustness and Fault Tolerant Systems”. Microsoft Windows Hardware Engineering Conference (WinHEC), 2002.

[18] [WHY] David Oppenheimer, Archana Ganapathi, and David A. Patterson. “Why Do Internet Services Fail, and What Can be Done about It?” In Proceedings of the 4th USENIX Symposium on Internet Technologies and Systems (USITS '03), 2003.

[19] [RAID] Patterson, D. A., Chen, P., Gibson, G., and Katz, R.H.. “Introduction to Redundant Arrays of Inexpensive Disks (RAID)”. In: Digest of Papers for 34th IEEE Computer Society International Conference (COMPCON Spring '89), pp.

112 -117.

[20] [CP4] James S. Plank, Micah Beck, Gerry Kingsley, Kai Li. “Libckpt:

Transparent Checkpointing under Unix”. Usenix Winter 1995 Technical Conference, pp. 213 - 223, New Orleans, LA, January, 1995.

[21] [recovery_blocks] B. Randell and J. Xu. “The Evolution of the Recovery Block Concept”. Software Fault Tolerance, John Wiley & Sons, pages 1-21, New York, 1995.

[22] [TimeStamp] Rubini, A.. “Making System Calls from Kernel Space”. Linux Magazine, Nov. 2000. Available at

http://www.linux-mag.com/2000-11/gear_01.html.

[23] [Online] Craig A. N. Soules, Jonathan Appavoo, Kevin Hui, Robert W.

Wisniewski, Dilma Da Silva, Gregory R. Ganger, Orran Krieger, Michael Stumm, Marc A. Auslander, Michal Ostrowski, Bryan S. Rosenburg, Jimi Xenidis.

“System Support for Online Reconfiguration”. In Proceedings of the USENIX 2003 Annual Technical Conference, pp. 141-154, San Antonio, June 9-14, 2003.

[24] [Nook_SOSP] Michael Swift, Brian N. Bershad, and Henry M. Levy.

“Improving the Reliability of Commodity Operating Systems”. In proceedings of the 19th ACM Symposium on Operating Systems Principles, Bolton Landing, NY, Oct. 2003.

[25] [CP3] Yi-Min Wang, Yennun Huang, Kiem-Phong Vo, Pi-Yu Chung and Chandra Kintala. “Checkpointing and Its Applications”. In proceedings of the

Twenty-Fifth International Symposium on Fault-Tolerant Computing, pp. 22, 1995.

[26] [CP5] Avi Ziv, Jehoshua Bruck. “An On-Line Algorithm for Checkpoint Placement”. Computers, IEEE Transactions on , Volume: 46 , Issue: 9 , pp. 976 - 985 , Sept. 1997.

相關文件