• 沒有找到結果。

CHAPTER 5 RECOVERY TIME REDUCTION

5.2 HTTP Proxy

Figure 14 Relay Server

As shown in Figure 14, some Internet servers such as proxies and email servers act as both clients and servers, which are called relay servers. The others servers only play the server role, which are called end servers.

FT-TCP can recover a system in a client transparent way. However, the recovery can not be done in a server transparent way. Therefore, it is not suitable for relay servers. Specifically, applying FT-TCP on relay servers would cause the server to establish a number of new connections with the end servers once the service on the relay server crashes. This results in

long recovery time and may cause data inconsistency for dynamic-object requests or transaction based connections. Therefore, we extend the FT-TCP implementation to achieve server transparency and avoid such connection establishment. A fault tolerant relay service can use the API shown in Table 3 to achieve sever-side transparency.

Table 5 Server Side APIs

This API is a counterpart to the client side API As shown in the Table 5, the API can be divided into two categories. The first is Server-side South-side Wrapper (SSW) API, by which developers can log and recover TCP state. The second is Server-Side North-side Wrapper (SNW) API, by which developers can record interactions between TCP and the service application so as to recover the service state. They are similar to functionality of FT-TCP.

CHAPTER 6

PERFORMANCE EVALUATION

In this section, we evaluate the effectiveness and efficiency of our framework. We implemented a fault tolerant proxy (i.e., ft_proxy) and a fault tolerant FTP server (i.e., ft_ftp) based on our framework, and we measure their performance with or without the presence of faults.

6.1 Experimental Environment

Table 6 Experimental Environment

As shown in Table 6, we run the experiments by using three machines, one for clients, one for the fault tolerant system (i.e., the FT machine) and the other for the end server, which are connected via 100MbpsFast Ethernet links. The client machine runs Linux kernel 2.4.18 and benchmark applications such as Webstone version 2.5, dkftpbench 0.45 and wget. The end server machine runs Linux kernel 2.4.18 and an Apache server, version 2.0.40. The FT machine runs Xen 1.2 as the virtual machine monitor, xenolinux 2.4.26 as the guest operating

system, and Squid-2.5.STABLE4 and a Proftpd-1.2.8 as the applications. Our proxy server is an Intel Pentium 4 2.0 GHz PC with 1GB of memory, while both the web server and the client run on Pentium 4 2.0 GHz PC with 256MB of memory and Pentium 4 1.6 GHz PC with 256MB of memory respectively. We give 64MB, 256MB and 256MB of memory for control domain, primary domain and backup domain respectively.

6.2 Overhead

6.2.1 FT-TCP Overhead

Figure 15 Performance of FT-TCP

In order to prove FT-TCP architecture is not suitable in VMM, we implement a FT-TCP architecture in Squid. We use two domains in Xen, one runs as primary server and another rus as backup server. The primary will send log data to backup server. We download a 5MB file twenty times by wget in original squid and squid which used FT-TCP architecture. As shown in Figure 15, the y-axis represents throughput, which is the mean of tests of twenty times. We can find that squid which used FT-TCP architecture results in 49% overhead degradation.

6.2.2 Squid Performance Overhead

Figure 16 Performance of Squid (Connection Rate)

Figure 17 Performance of Squid (Connection Throughput)

Figure 18 Performance of Squid (Response Time)

We use the standard workload of WebStone benchmark version 2.5 to measure the impact of states logging on Squid performance. The benchmark simulates that many clients connect simultaneously and make requests to a web server within a defined time. Webstone could analyze the average connection rate, average connection throughput and average response time from the simulated clients. We set the increased client numbers at the WebStone.

Each client number runs five minutes to test overhead of our framework.

Figure 16, Figure 17 and Figure 18 show the performance comparison between the original squid and squid used our framework. The x-axis represents the number of clients simulated by WebStone. Each client establishes a large number of connections with the server during the experiment. The y-axis indicates the server connection rate, connection throughput and response time respectively. The dark bars present the performance of Squid running on the normal operating system and the light bars present the performance of Squid running on our framework. From the figure we can see that, using our framework results in little throughput degradation that ranges 1% to 4%. This shows that our framework is quite

efficient.

6.2.2 Proftpd Performance Overhead

Figure 19 Performance of Proftpd (Connection Throughput)

In addition to the performance overhead of Squid, we also measure performance overhead of Proftpd. We use dkftpbench 0.45 to measure the impact of Proftpd running on our framework. This benchmark is run repeatedly with different numbers of simulated users to determine the maximum number that can be supported. We define a minimum quality of service that simulated users must receive 100KB/s. If the simulated users reach the throughput, ftp server get one score. The dkftpbench would show total score by aggregating clients whose bandwidth excess 100KB/s. We compute an average total score through running 10 times.

Figure 19 show the score comparison between the original proftpd and proftpd used our framework. The y-axis indicates the scores reported by dkftpbench. From the figure we can see that using our framework results in about 1.12% throughput degradation.

6.3 Recovery Time

6.3.1 Squid Recovery Time

Figure 20 Fault Occurs When the First 10KB of Data is Sent.

Figure 21 Fault Occurs When the First Half of Data is Sent

In this section, we measure the performance of Squid which experience failure and recovery. The client requests one file from the server machine through the proxy in each run.

Squid process is being terminated intentionally when the client receives the first 10KB of data and the first halt of data in each run. The transmission time is measured in Fast Ethernet

environment. Figure 20 and Figure 21 show transmission time which the fault occurs when the first 10KB of data and the first halt of data is sent respectively. The x-axis stands for file size which is sent to the client from the server and the y-axis indicates the transmission time.

These three lines mean the transmission time sending the different file size in the different conditions. The blue line represents the no fault condition. The red line represents that a fault occurs when sending a file and recover the service in the different domain. The green line represents that a fault occurs and recover the service in the same domain. According to the result, the recovery latency is about 250ms and 600 ms in the green line and the red line respective. We can know that recovery in the different domain is more efficient. The reason is that recovery in the same domain must wait squid to restart. It takes about 300ms. Therefore, when transient fault or software aging problem occurs, we should recover service in the different domain rather than in the same domain for performance consideration.

6.3.2 Proftpd Recovery Time

Figure 22 Relation between Number of Connections and Recovery Time

Figure 22 shows the relation between number of connections need to recover and the total recovery time. Each client connection send six control command requests (include USER, PASS, SYST, PWD, TYPE I, and CWD) and two data command requests (include PASV and RETR) to request a 20MB file. We inject a fault when the last data connection sent the first 1MB file. Obviously, recovering 70 connections only takes 3.79 sec. The recovery latency is acceptable.

CHAPTER 7 CONCLUSION

In this thesis, we propose a framework that achieves the goal of zero-loss Internet service recovery and upgrade. We make Internet services become fault tolerant in a single node. Our framework can detect the faults and recover the faulty Internet service automatically. It can also reach online maintenance when the Internet service is running in a single node. In addition, we provide some techniques and APIs to enhance FT-TCP which can reduce recovery time in some Internet services. Our framework is divided into two parts - OS layer Zero-loss Framework(OZS) and VMM Zero-loss Framework(VZS). They provide some

functionalities to reach our goal. They implemented in the kernel and VMM layer. The experimental results show the low overhead in the state logging and acceptable performance during the recovery.

REFERENCES

[1]. L. Alvisi, T. C. Bressoud, A. El-Khashab, K. Marzullo, D. Zagorodnov, “Wrapping Server-side TCP to Mask Connection Failures,” In Proceedings of the IEEE INFOCOM, Anchorage, Alaska, pp. 329-337, Apr. 2001.

[2]. A. Brown, D. A. Patterson, “To Err is Human,” In Proceedings of the 2001 Workshop on Evaluating and Architecting System dependabilitY, Göteborg, Sweden, July 2001.

[3]. A. B. Brown, D. A. Patterson, “Undo for Operators: Building An Undoable E-mail Store”, In proceedings of USENIX Annual Technical Conference, pp. 1-14, Jun. 2003.

[4]. P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. “Xen and the Art of Virtualization.” In Proceedings of the 19th ACM Symposium on Operating Systems Principles, pages 164-177, October 2003.

[5]. A. Chou, J. Yang, B. Chelf, S. Hallem, Dawson Engler, “An Empirical Study of Operating System Errors”, In Proceedings of the 18th ACM symposium on Operating Systems Principles, pp. 73-88, 2001.

[6]. Alex C. Snoeren, Hari Balakrishnan, “An End-to-End Approach to Host Mobility,” In Proceedings of the 6th Annual ACM/IEEE International Conference on Mobile Computing and Networking, pp. 155–166, Boston, Massachusetts, Aug. 2000.

[7]. Alex C. Snoeren, David G. Andersen, Hari Balakrishnan, “Fine-Grained Failover Using Connection Migration,” In Proceedings of the 3rd USENIX Symposium on Internet Technologies and Systems (USITS '01), Mar. 2001.

[8]. Y. Chawathe, E. A. Brewer, "System Support for Scalable and Fault Tolerant Internet Service", In proceedings of IFIP International Conf. on Distributed Systems Platforms and Open Distributed Processing (Middleware '98), Sep. 1998.

[9]. G. Candea, A. Fox, “Crash-only Software”, In proceedings of the 9th Workshop on Hot Topics in Operating Systems, pp. 67-72, Jun. 2003.

[10]. G. Candea, J. Cutler, A. Fox, “Improving Availability with Recursive Microreboots: A Soft-state System Case Study”, Performance Evaluation Journal, Vol. 56, No. 1-3, Mar.

2004.

[11]. C. C. J. Li, W. K. Fuchs, "CATCH-compiler-assisted Techniques for Checkpointing", In proceedings of 20th Annual International Symposium on Fault-Tolerant Computing, pp 74-81, Jun. 1990.

[12]. David E. Lowell, Yasushi Saito, and Eileen J. Samberg, "Devirtualizable Virtual Machines Enabling General, Single-Node, Online Maintenance", In proceedings of the 11th international conference on Architectural support for programming languages and operating systems, pp. 211 – 223,Oct. 2004

[13]. Y. Huang, C. Kintala, N. Kolettis, N. D. Fulton, "Software Rejuvenation: Analysis, Module and Applications", In Proceedings of the 25th International Symposium on Fault Tolerant Computing, pp. 381-390, Jun. 1995.

[14]. HP NonStop Group. Personal communication, 1998.

[15]. David Lorge Parnas, “Software Aging”, In Proceeding of the 16th international conference on Software engineering, pp. 279 – 287, May. 1994

[16]. J. Long, W. K. Fuchs, J. A. Abraham, "Compiler-assisted Static Checkpoint Insertion", In proceedings. of the 22th Annual International Symposium. on Fault-Tolerant Computing, pp. 58-65, Jul. 1992.

[17]. D. Maltz and P. Bhagwat. “TCP Splicing for Application Layer Proxy Performance”, IBM Research Reprot 21139 (Computer Science/Mathematics), IBM Research Division, March 1998.

[18]. D. Oppenheimer, D. A. Patterson, “Why Do Internet Services Fail, and What Can Be Done about It?” In Proceedings of the 10th ACM SIGOPS European Workshop, Saint-Emilion, France, Sep. 2002.

productivity - a white paper analysis on the importance of non-stop networking”, available at http://whitepapers.informationweek.com/detail/RES/991044232_ 762.html, 2001.

[20]. D. Patterson, A. Brown, P. Broadwell, G. Candea, M. Chen, J. Cutler, P. Enriquez, A. Fox, M. Merzbacher, D. Oppenheimer, N. Sastry, W. Tetzlaff, J. Traupman, N. Treuhaft,

"Recovery-Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies", Computer Science Technical Report UCB//CSD-02-1175, Mar. 2002.

[21]. C. R. Landau, "The Checkpoint Mechanism in KeyKOS", Proc. Second International Workshop on Object Orientation in Operating Systems, pp. 86-91, Sept. 1992.

[22]. J. S. Plank, “An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance”, Technical Report UTCS -97-372, Jul.

1997.

[23]. F. Sultan, K. Srinivasan, D. Iyer, L. Iftode, "Migratory TCP: Connection Migration for Service Continuity in the Internet", In proceedings of the 22nd International Conf. on Distributed Computing Systems, pp. 469-470, Jul. 2002.

[24]. J. S. Plank, M. Beck, G. Kingsley, K. Li, "Libckpt: Transparent Checkpointing under UNIX", In proceedings of Usenix Winter 1995 Technical Conf., pp. 213-223, Jan. 1995.

[25]. S. T. Hsu, R. C. Chang, "Continuous Checkpointing: Joining the Checkpointing with Virtual Memory Paging", Software Practices and Experiences, vol. 27, no. 9, pp.

1103-1120, 1997.

[26]. Dmitrii Zagorodnov, Keith Marzullo, Lorenzo Alvisi, Thomas C. Bressoud,

“Engineering Fault-tolerant TCP/IP Servers Using FT-TCP,” In Proceedings of IEEE Intl.

Conf. on Dependable Systems and Networks (DSN), pp. 22-26, Apr. 2003.

相關文件