CHAPTER 3 RELATED WORK
3.4 Others
3.4.3 Xen
Xen [4] is an x86-based virtual machine monitor. It allows multiple commodity operating systems such as Linux, BSD and Windows XP to be hosted on a physical machine simultaneously. As mentioned before, we implemented our framework on Xen. The reasons are that it is open source and has little overhead.
CHAPTER 4
DESIGN AND IMPLEMENTATION
In general, functionality exported by current operating systems has the following limitations to achieve the goal of zero-loss service restart.
z Existing systems usually do not provide any mechanisms to recover the state of a service application when it crashes due to software faults. Instead, the operating system usually kills the faulty processes, which causes the internal state of the processes (including the connections being served) to be lost.
z The connection state in TCP layer will be lost when a fault crashes the service application or the operating system. For the former, the operating system will clean the connection state of the service process. For the latter, the system will be rebooted and all the system information will be lost.
z When upgrading a system, the administrator has to turn off the service, which causes the service to become unavailable for a period of time.
In this thesis, we propose a framework to overcome the above limitations.
The rest of this chapter is organized as follows. Session 4.1 explains the system
components in the framework. Session 4.2 describes the proposed fault recovery technique.
We will explain how to recover a service when a fault occurs. Session 4.2 describes the proposed online maintenance technique. We will explain how to avoid the downtime caused by system maintenance.
4.1 System Components
Figure 5 Our System Components
As shown in Figure 5, in order to achieve the goals mentioned above, we implement our framework in both the operating system kernel (i.e., Linux) and the virtual machine monitor (i.e., Xen). The former part is called OS layer Zero-loss Subsystem (OZS) while the latter part is called VMM layer Zero-loss Subsystem (VZS).
The major components of the framework are: protocol manager, health monitor and recovery manager. In addition to the components, we also enhance FT-TCP to reduce the recovery time and provide an API for the service designers to develop their fault tolerant service. Moreover, the framework provides system calls for the administrators to control the backup server and the service migration.
4.2 Fault Recovery
Briefly speacking, we use four techniques to achieve the goal of fault recovery. First, we develop a protocol to create/suspend/resume the backup server. During normal operation, the backup server is suspended so that it does not contend CPU resources with the primary server.
Once the primary fails, the backup server is resumed to take over the job of the primary.
Second, we provide a log buffer in VMM which allows us to store the connection state without communicating with the backup server. Third, we provide a fault detection mechanism, which can detect application or operating system faults and then trigger the recovery job. Finally, we provide a recovery mechanism to recover the service state.
Figure 6. An Overview of Fault Recovery
We give a brief overview of the recovery flow first, which is shown in Figure 6, before the detailed description of our fault recovery techniques. Before starting an Internet service, the administrator starts a backup server, including the backup OS and service application.
Then, in order to supply the primary server with the whole system resources, the backup server releases the resources such as CPU time it holds. The primary server then does the normal operations and logs the connection information in the Virtual Machine Monitor (VMM). When a fault is detected, VMM wakes up the backup server and recovers the service state so that the system can provide the service continually.
In the following, we will describe the details of the techniques. Section 4.2.1 describes the way to boot a second OS instance and release the system resource used by the second OS instance. The flow of logging connection state is presented in Section 4.2.2. Section 4.2.3 describes the fault detection mechanism, and the recovery flow is presented in Section 4.2.4.
4.2.1 Backup Server Boot-up
Figure 7 The Flow of Booting a Backup Server
We define a protocol to manage the boot up of the backup server. The protocol involves the VMM and three domains: control, primary and backup, which implement the protocol based on the API, as shown by Table 1, provided by the framework.
Table 1 System Calls Provided by OZS
Figure 7 shows the flow of booting up a backup server. Originally, Xen only allows the control domain to boot up other domains. In order to enable an authorized primary server to boot up its backup, we allow the administrator to register the primary servers that has the right to boot up their backups. Specifically, the administrator can register an entry for each primary server that has that right in the backup-grant table in advance. The table is stored in VMM and managed by the protocol manager, and the registration is done by calling the sys_ins_auth() system call in the control domain. When a primary server boots a backup
server, the protocol manager will check if the primary server has the grant.
The primary server calls the sys_boot_backup_server() system call to ask Xen to create the backup. As mentioned above, the protocol manager checks to see if the primary server is granted to boot its backup. If it is, the protocol manager asks Xen to create the backup domain .
Originally, Xen gives an unique IP address to each guest OS so that each domain can communicate with external machines. This results in a longer recovery time since the backup
server has to take over the IP address of the primary server when the latter crashes. Thus, we provide a sys_change_backup_ip() system call to allow the primary and backup servers to share a single IP address. When the system call is invoked by the primary server, a signal will be sent to the backup server through the VZS, and the backup server will get the primary IP address from the VZS and change its IP address accordingly. The IP address changing is done by a user-level task which invokes a shell command - ipconfig.
After the IP address is changed, the backup server should release its CPU time so that it will not affect the performance of the primary server. This is done by calling the sys_suspend_backup() system call by the primary server. When the system call is invoked,
Xen will remove the backup server task from the run queue of Xen.
From the above description we can see that, although the system calls are implemented in the OZS, most of them require cooperation from the VZS. The communication between OZS and VZS is through hypercalls and events.
4.2.2 Connection State Logging
FT-TCP provides a log buffer to record the connection state of the primary server. When the primary server crashes, the backup server will use the data in the log buffer to recover the system. In our design, we also provide a log buffer which does not lose data even when the primary server crashes. We use a memory area of the primary server as the logger buffer.
During the recovery period, backup server will remap the log buffer into its virtual address space and recover the service state accordingly. In the following, we describe how to implement the log buffer in our framework.
In order to let guest operating systems manage memory conveniently, Xen provides an illusional memory area, a continuous range of physical addresses, for each guest OS. However, physical address is not real machine address. Therefore, there are two problems deserving to be mentioned. First, as mentioned above, the backup server has to map the log buffer into its
virtual address space. This mapping requires the starting machine address of the log buffer.
However, a guest OS does not manage machine addresses directly. Thus, we lookup the page table of the guest OS, which is updated by Xen, to get the machine address of the log buffer.
Once the address is obtained, the OZS issues a hypercall to Xen in order to register the address. As a result, the backup server can get the machine address of the log buffer during the recovery period.
Second, if a primary server crashes, its memory area (including the log buffer) will be released by Xen. To avoid releasing the memory before recovering the service, we increase the reference count that corresponds to the primary server by 1 after booting the primary.
After the service recovery, the reference count is decreased by 1 and the resources held by the primary server can be released.
4.2.3 Fault Detection
Software faults, which cause the system become unavailable, can happen on service applications and the operating system. In the following, we describe how to detect the faults.
Figure 8 Detecting Application Faults
When a fault occurs on an application, the kernel usually invokes the do_exit() function
to kill the application process. As shown in Figure 8(a), two paths lead to the invocation of do_exit(). One is that application detects the fault itself and calls the sys_exit() system call,
which in turn calls do_exit(). The other is that kernel detects the application fault and sends a signal to kill the application process. In this case, kernel calls do_exit() through sig_exit().
Originally, we can intercept do_exit(), by kernel binary instrumentation, to detect the faults.
However, such callee-based instrumentation requires more efforts. Therefore, we use the caller-based instrumentation approach instead. As shown in Figure 8(b), the health monitor intercepts the exit() system call and the sig_exit() function, which only requires modifying the destination addresses of two jump instructions.
Figure 9 Detect Kernel Fault
In addition to application faults, operating system faults may also occur. To detect such faults, we inserted a heartbeat generator in the primary server domain and a heartbeat checker in Xen. At each timer interrupt, the former sends a heartbeat to Xen by increasing the value of the heartbeat counter variable by one, which is shared by the primary server domain and Xen.
The latter checks the variable at each timer interrupt to detect operating system faults. If the value remains the same during two timer interrupt periods, the operating system is regarded as failure, and the checker notifies the recovery manager to recover the system. It is worth noting that the heartbeat mechanism is implemented based on shared memory instead of hypercall, and thus it eliminates the overhead of frequent privilege mode crossings.
4.2.4 Recovery Flow
Figure 10 Recovery Protocol
When a fault is detected, the recovery manager will follow the recovery protocol to recover the system. Figure 10 illustrates the recovery protocol, which is divided into three steps. First, the recovery manager must change the network path so that incoming packets
which are originally delivered to the primary server will now be delivered to the backup server. Xen stores IP-to-domain mappings for each domain (i.e., in the net_schedule_list list) in order to perform packet delivery, and thus the network path changing can simply be done by updating the mapping that corresponds to the IP address of the backup server. Second, the recovery manager must wake up the backup server so that the backup server can take over the job of the primary server. Third, the recovery manager must send a signal to notify the backup server to recover the system. When receiving the signal, the kernel subsystem in the backup server will obtain the machine address of the log buffer through a hypercall, remap the log buffer, and then execute the FT-TCP recovery flow.
It is worth mentioning that, if the fault does not crash the kernel of the primary domain, we can change the IP address and the packet delivery path (in Xen) so that a system administrator can connect to the faulty server to diagnosis the reason of the fault.
4.3 Online Maintenance
To allow online maintenance, we use some mechanisms that are the same as those we use for fault recovery. For example, we also provide a backup server and a log buffer.
However, we add a functionality to allow online maintenance. Specifically, we provide a sys_migrate_service() system call(as shown in Table 1), which is used to migrate an Internet service when the administrator completes system maintenance. We will describe it in the next section.
Figure 11 An Overview of the Online Maintenance Flow
Figure 11 shows a brief overview of the online maintenance flow. Firstly, as shown in Figure 11(a) and (b), the OZS boots a backup server and then suspend it. When system needs maintenance, the kernel subsystem wakes up the backup server, and the administrator can upgrade the system on the backup server, as shown in Figure 11(c). When the system maintenance is completed, the administrator can use the sys_migrate_service() system call to migrate the service state from the primary server to the backup server, as shown in Figure 11(d). Finally, as shown in Figure 11(e), the backup server takes over the job and the clients can continuously be served without interruption caused by system maintenance.
4.3.1 Maintenance Flow
Figure 12 The Flow of Service Migration
Figure 12 shows the flow for achieving online maintenance. As mentioned above, the backup server suspends itself after it has initialized, and the IP address of the backup server has been changed to the IP address of the primary server in order to allow fast fault recovery.
Therefore, to allow online maintenance, the administrator first wakes up the backup server and restore its IP address by calling the sys_wakup_backup_server() and sys_change_backup_ip() system calls, respectively. When the system maintenance on the backup server is finished, the administrator changes the IP address of the backup server to that of the primary server and calls the sys_migrate_service() system call(as shown in Table 1) to migrate the service. This system call notifies the recovery manager to migrate the service through the protocol manager. The recovery manager will use the strategy mentioned in section 4.2.4 to migrate the service.
CHAPTER 5
RECOVERY TIME REDUCTION
For some Internet services, FT-TCP may take a long time to recover them. We will describe these conditions in the following section. Therefore, we provide an API for Internet services to reduce the recovery time. In the rest of this section, we take two Internet services (FTP and HTTP proxy) as examples to demonstrate how to reduce the recovery time by using the API.
5.1 FTP
Figure 13 A Example of Using FTP for Sending Files
Many Internet service protocols, such as FTP, SAMBA, and HTTP 1.1, uses long-lived connections/sessions for object transmission. FT-TCP may take a long time when recovering such services since it replays the connection reestablishment and object transmission from the beginning. We illustrate such condition with an FTP example. As shown in Figure 13, the FTP session contains control commands and a sequence of data commands. If a fault happens during the transmission of the crtt.tar.gz file (i.e., the second data connection), FT-TCP will re-establish all the control and data commands. However, as shown in Figure 13, the firs data commands is not needed to be recovered since it has completed successfully before the fault happens. Removing such data commands from the recovery job could improve the service recovery time.
Table 2 Wrapper Registration APIs
Table 3 Application-Specific Hook APIs
Table 4 Client-Side APIs
Applying such optimization on FT-TCP requires digging into the FT-TCP code and modifying it. Moreover, such optimization may only be suitable for a special class of services, such as FTP and HTTP1.1. Therefore, in order to reduce the effort of the developers, we decompose FT-TCP into several basic operations and export a function for each operation.
Thus, when implementing a fault tolerant service, the developers can invoke the operations provided by FT-TCP according to their needs. Table 2, Table 3, and Table 4 show the exported functions. The functions that correspond to the original FT-TCP implementation can be divided into two categories. The first category is Wrapper Registration APIs, by which designers can install wrapper function. The second category is Client-side South-side Wrapper (CSW) API and Client-side North-side Wrapper (CNW) API, by which designers can log and recover TCP state, and record interactions between TCP and service application so as to recover the service state respectively. We added the third category (i.e., Application-Specific
API), which is add-on functionality for some Internet services. For example, we implemented a function as_remove_request() for fault tolerant FTP developers to remove requests that are not needed to be replayed during the fault recovery period.
At the last of this section, we describe how to use the API to implement a fault tolerant FTP system. First, the developers should register handlers that will be invoked when a packet reaches the NSW or the SSW (i.e., wr_ins_nsw() and wr_ins_ssw()). Then, they should invoke the API shown in Table2. For example, if a SYN packet is received, they can invoke the cs_rec_syn() to log the packet. Finally, the FTP developers should record how many data connection commands the server has received such as PASV, LIST, RETR, and remove the commands that correspond to a transmission-completed file. After transmitting a file, the FTP server will send a 226 command to the client, and hence the FTP developers can call as_remove_request() to remove the data connection commands accordingly.
5.2 HTTP Proxy
Figure 14 Relay Server
As shown in Figure 14, some Internet servers such as proxies and email servers act as both clients and servers, which are called relay servers. The others servers only play the server role, which are called end servers.
FT-TCP can recover a system in a client transparent way. However, the recovery can not be done in a server transparent way. Therefore, it is not suitable for relay servers. Specifically, applying FT-TCP on relay servers would cause the server to establish a number of new connections with the end servers once the service on the relay server crashes. This results in
long recovery time and may cause data inconsistency for dynamic-object requests or transaction based connections. Therefore, we extend the FT-TCP implementation to achieve server transparency and avoid such connection establishment. A fault tolerant relay service can use the API shown in Table 3 to achieve sever-side transparency.
Table 5 Server Side APIs
This API is a counterpart to the client side API As shown in the Table 5, the API can be divided into two categories. The first is Server-side South-side Wrapper (SSW) API, by which developers can log and recover TCP state. The second is Server-Side North-side Wrapper (SNW) API, by which developers can record interactions between TCP and the service application so as to recover the service state. They are similar to functionality of FT-TCP.
CHAPTER 6
PERFORMANCE EVALUATION
In this section, we evaluate the effectiveness and efficiency of our framework. We implemented a fault tolerant proxy (i.e., ft_proxy) and a fault tolerant FTP server (i.e., ft_ftp) based on our framework, and we measure their performance with or without the presence of faults.
6.1 Experimental Environment
Table 6 Experimental Environment
Table 6 Experimental Environment