CHAPTER 2 BACKGROUND
2.2 The Recovery Flow
The flow of connection recovery is presented in Figure 2. In order to recover a connection in a client transparent way, FT-TCP re-establishes the connection on behalf of the client and replays the corresponding socket read/write operations.
For the connection reestablishment, the SSW sends a fake SYN packet to its local TCP (i.e., the TCP of the backup machine). When getting a SYN packet, the TCP sends a SYN/ACK packet back, which is intercepted and then discarded by the SSW. Then, the SSW calculates the delta_seq, which is the difference between the initial server-side sequence numbers of the re-established and the original connections. The delta_seq is used for adjusting the sequence numbers of the following outgoing (i.e., server-to-client) TCP packets and ACK sequence numbers of the following incoming packets in order to maintain client-side
transparency. Note that the sequence numbers of the incoming packets and ACK sequence numbers of the outgoing packets are not needed to be adjusted since SSW use a special initial sequence number, which equals to the last ACK sequence number minus one sent by the primary server, in the SYN packet it fakes. Finally, the SSW spoofs a fake ACK packet to complete the TCP 3-way handshake.
After the connection is re-established, the service application will replay the request-processing flow (i.e, accept the connection, read the request, process the request, and writes the response), which is controlled by the NSW. For example, NSW will send the logged request to the service application when the latter issues socket read operations. For another example, NSW is also responsible for dropping duplicated response data.
Figure 2 Connection Recovery
CHAPTER 3
RELATED WORK
The related efforts can be classified into four categories: fault tolerance in application layer, fault tolerance in transport layer, software maintenance, and others. We will describe these efforts in the following of this chapter.
3.1 Fault Tolerance in Application Layer
3.1.1 Recursive Restart (RR)
Recursive Restart (RR) [8][9] allows a fine-grained component-based service to restart a component (instead of the whole service) once the component fails, and thus reducing the service restart time. However, RR is not suitable for all Internet services due to the following reasons. First, the inter-component communication will degrade the system performance, which is not allowed for many Internet services. Second, RR requires an Internet service to be composed of fine-grained components, which needs redesigning the legacy Internet service programs.
3.2 Fault Tolerant in Transport Layer
3.2.1 Fine-Grained Failover System
This approach [7] aims at increasing availability of a web service which is based on a server cluster. The mechanism can be divided into two parts. First, a HTTP-aware module is inserted between the application and transport layers to log the interactions between these two layers. Moreover, this approach uses TCP migrate options to record TCP state for connection resumption. The advantage of this approach is that it requires no modification to the server
application. However, the TCP migrate options which the approach is based on requires modification to the TCP implementations of both the server and the clients, reducing the feasibility of the approach.
3.2.2 M-TCP
Migratory TCP (M-TCP) [23] allows on-line connections to be migrated from one server to another cooperative server. When a server overloads or fails, it will trigger the migration process, which makes the client to reconnect to a better performing server replica. A set of API is provided for the server application to support state transfer between server replicas.
However, same as Fine-Grained Failover System, M-TCP extends both the client-side and the server-side TCP implementations to accomplish dynamic connection migration. It is difficult to deploy the TCP extension to all the clients interacting with a service.
3.2.3 TCP Splicing
Figure 3 TCP Splicing
In this approach [17], a virtual TCP connection is composed of two physical connections, a client-proxy connection and a proxy-server connection. As shown in Figure 3, all the clients connect to a proxy, which receives and dispatches the client requests to the
back-end servers. The proxy records IP addresses, port numbers, and TCP sequence numbers of both the client-proxy and proxy-server connections. If the server crashes, the proxy can re-establish a new connection with another server and ensure that the new connection is consistent with the client’s state. This approach is transparent to both the client and the server.
However, it requires multiple server replicas and an extra proxy machine. In contrast, our approach does not require the proxy machine, and we address on software fault-tolerance and thus we perform recovery on a single node.
3.2.4 FT-TCP
We have described this approach in Chapter 2. FT-TCP [1] requires a primary server, a backup server, and a logger which can be co-located with the backup machine. If the primary crashes, the backup can take over the job of serving clients and replay the requests that are pending when the fault happens. As mentioned before, our approach is based on FT-TCP.
However, we aim at software fault tolerance and thus perform recovery on a single node instead.
We use a virtual machine monitor, Xen [4], to consolidate the primary and the backup servers into a single physical node. However, as shown in Figure 4, simply put the servers into virtual machines is not practical since the frequent communication between two virtual machines will degrade the system performance largely. Our framework solves this problem in two ways. First, we eliminate the primary-backup communication. All the connection state logging is recorded in the memory of the VMM. As a result, the backup server VM can be suspended and the system performance can be improved. Second, we propose techniques to reduce the recovery time of FT-TCP.
Figure 4 FT-TCP on Xen
3.3 Software Maintenance
3.3.1 Devirtualization
This approach [12] provides a strategy to boot a virtual machine monitor and a new operating system dynamically when the system needs software maintenance. Once the software maintenance operation is finished on the newly-boot operating system, the state of the old system is migrated to the new operating system. Then, the old operating system and the virtual machine monitor can be shutdown. This approach has good performance in normal operation since the VMM is only presented when maintenance is needed. However, it only focuses on decreasing the planned downtime, it can not reduce unplanned downtime.
3.4 Others
3.4.1Checkpointing
Checkpointing [16][21][22][24][25] is a common technique for system recovery. It saves the state of a running program periodically to a stable storage. When the system crashes, the
last checkpointed state can be reloaded to recover the system. This approach has two drawbacks. First, it can not solve the software aging problem since the checkpointed state is aged, instead of fresh. Even if the system can be recovered, the software may fail again immediately. Second, checkpointing usually result in large performance overhead due to the large volume of state that needs to be stored.
3.4.2 Recovery-Oriented Computing (ROC)
Recovery-Oriented Computing project [20][3] differs from the main stream of previous fault management approaches in that it concentrates on Mean Time to Repair (MTTR) rather than Mean Time to Failure (MTTF). ROC assists administrator to find out faults by its tools.
After the fault occurs, it can rewind the system to a previous correct state, repair the fault, and replays the service. The purpose of ROC is different from ours. We stress on avoiding connection/service state loss, which is not addressed on ROC.
3.4.3 Xen
Xen [4] is an x86-based virtual machine monitor. It allows multiple commodity operating systems such as Linux, BSD and Windows XP to be hosted on a physical machine simultaneously. As mentioned before, we implemented our framework on Xen. The reasons are that it is open source and has little overhead.
CHAPTER 4
DESIGN AND IMPLEMENTATION
In general, functionality exported by current operating systems has the following limitations to achieve the goal of zero-loss service restart.
z Existing systems usually do not provide any mechanisms to recover the state of a service application when it crashes due to software faults. Instead, the operating system usually kills the faulty processes, which causes the internal state of the processes (including the connections being served) to be lost.
z The connection state in TCP layer will be lost when a fault crashes the service application or the operating system. For the former, the operating system will clean the connection state of the service process. For the latter, the system will be rebooted and all the system information will be lost.
z When upgrading a system, the administrator has to turn off the service, which causes the service to become unavailable for a period of time.
In this thesis, we propose a framework to overcome the above limitations.
The rest of this chapter is organized as follows. Session 4.1 explains the system
components in the framework. Session 4.2 describes the proposed fault recovery technique.
We will explain how to recover a service when a fault occurs. Session 4.2 describes the proposed online maintenance technique. We will explain how to avoid the downtime caused by system maintenance.
4.1 System Components
Figure 5 Our System Components
As shown in Figure 5, in order to achieve the goals mentioned above, we implement our framework in both the operating system kernel (i.e., Linux) and the virtual machine monitor (i.e., Xen). The former part is called OS layer Zero-loss Subsystem (OZS) while the latter part is called VMM layer Zero-loss Subsystem (VZS).
The major components of the framework are: protocol manager, health monitor and recovery manager. In addition to the components, we also enhance FT-TCP to reduce the recovery time and provide an API for the service designers to develop their fault tolerant service. Moreover, the framework provides system calls for the administrators to control the backup server and the service migration.
4.2 Fault Recovery
Briefly speacking, we use four techniques to achieve the goal of fault recovery. First, we develop a protocol to create/suspend/resume the backup server. During normal operation, the backup server is suspended so that it does not contend CPU resources with the primary server.
Once the primary fails, the backup server is resumed to take over the job of the primary.
Second, we provide a log buffer in VMM which allows us to store the connection state without communicating with the backup server. Third, we provide a fault detection mechanism, which can detect application or operating system faults and then trigger the recovery job. Finally, we provide a recovery mechanism to recover the service state.
Figure 6. An Overview of Fault Recovery
We give a brief overview of the recovery flow first, which is shown in Figure 6, before the detailed description of our fault recovery techniques. Before starting an Internet service, the administrator starts a backup server, including the backup OS and service application.
Then, in order to supply the primary server with the whole system resources, the backup server releases the resources such as CPU time it holds. The primary server then does the normal operations and logs the connection information in the Virtual Machine Monitor (VMM). When a fault is detected, VMM wakes up the backup server and recovers the service state so that the system can provide the service continually.
In the following, we will describe the details of the techniques. Section 4.2.1 describes the way to boot a second OS instance and release the system resource used by the second OS instance. The flow of logging connection state is presented in Section 4.2.2. Section 4.2.3 describes the fault detection mechanism, and the recovery flow is presented in Section 4.2.4.
4.2.1 Backup Server Boot-up
Figure 7 The Flow of Booting a Backup Server
We define a protocol to manage the boot up of the backup server. The protocol involves the VMM and three domains: control, primary and backup, which implement the protocol based on the API, as shown by Table 1, provided by the framework.
Table 1 System Calls Provided by OZS
Figure 7 shows the flow of booting up a backup server. Originally, Xen only allows the control domain to boot up other domains. In order to enable an authorized primary server to boot up its backup, we allow the administrator to register the primary servers that has the right to boot up their backups. Specifically, the administrator can register an entry for each primary server that has that right in the backup-grant table in advance. The table is stored in VMM and managed by the protocol manager, and the registration is done by calling the sys_ins_auth() system call in the control domain. When a primary server boots a backup
server, the protocol manager will check if the primary server has the grant.
The primary server calls the sys_boot_backup_server() system call to ask Xen to create the backup. As mentioned above, the protocol manager checks to see if the primary server is granted to boot its backup. If it is, the protocol manager asks Xen to create the backup domain .
Originally, Xen gives an unique IP address to each guest OS so that each domain can communicate with external machines. This results in a longer recovery time since the backup
server has to take over the IP address of the primary server when the latter crashes. Thus, we provide a sys_change_backup_ip() system call to allow the primary and backup servers to share a single IP address. When the system call is invoked by the primary server, a signal will be sent to the backup server through the VZS, and the backup server will get the primary IP address from the VZS and change its IP address accordingly. The IP address changing is done by a user-level task which invokes a shell command - ipconfig.
After the IP address is changed, the backup server should release its CPU time so that it will not affect the performance of the primary server. This is done by calling the sys_suspend_backup() system call by the primary server. When the system call is invoked,
Xen will remove the backup server task from the run queue of Xen.
From the above description we can see that, although the system calls are implemented in the OZS, most of them require cooperation from the VZS. The communication between OZS and VZS is through hypercalls and events.
4.2.2 Connection State Logging
FT-TCP provides a log buffer to record the connection state of the primary server. When the primary server crashes, the backup server will use the data in the log buffer to recover the system. In our design, we also provide a log buffer which does not lose data even when the primary server crashes. We use a memory area of the primary server as the logger buffer.
During the recovery period, backup server will remap the log buffer into its virtual address space and recover the service state accordingly. In the following, we describe how to implement the log buffer in our framework.
In order to let guest operating systems manage memory conveniently, Xen provides an illusional memory area, a continuous range of physical addresses, for each guest OS. However, physical address is not real machine address. Therefore, there are two problems deserving to be mentioned. First, as mentioned above, the backup server has to map the log buffer into its
virtual address space. This mapping requires the starting machine address of the log buffer.
However, a guest OS does not manage machine addresses directly. Thus, we lookup the page table of the guest OS, which is updated by Xen, to get the machine address of the log buffer.
Once the address is obtained, the OZS issues a hypercall to Xen in order to register the address. As a result, the backup server can get the machine address of the log buffer during the recovery period.
Second, if a primary server crashes, its memory area (including the log buffer) will be released by Xen. To avoid releasing the memory before recovering the service, we increase the reference count that corresponds to the primary server by 1 after booting the primary.
After the service recovery, the reference count is decreased by 1 and the resources held by the primary server can be released.
4.2.3 Fault Detection
Software faults, which cause the system become unavailable, can happen on service applications and the operating system. In the following, we describe how to detect the faults.
Figure 8 Detecting Application Faults
When a fault occurs on an application, the kernel usually invokes the do_exit() function
to kill the application process. As shown in Figure 8(a), two paths lead to the invocation of do_exit(). One is that application detects the fault itself and calls the sys_exit() system call,
which in turn calls do_exit(). The other is that kernel detects the application fault and sends a signal to kill the application process. In this case, kernel calls do_exit() through sig_exit().
Originally, we can intercept do_exit(), by kernel binary instrumentation, to detect the faults.
However, such callee-based instrumentation requires more efforts. Therefore, we use the caller-based instrumentation approach instead. As shown in Figure 8(b), the health monitor intercepts the exit() system call and the sig_exit() function, which only requires modifying the destination addresses of two jump instructions.
Figure 9 Detect Kernel Fault
In addition to application faults, operating system faults may also occur. To detect such faults, we inserted a heartbeat generator in the primary server domain and a heartbeat checker in Xen. At each timer interrupt, the former sends a heartbeat to Xen by increasing the value of the heartbeat counter variable by one, which is shared by the primary server domain and Xen.
The latter checks the variable at each timer interrupt to detect operating system faults. If the value remains the same during two timer interrupt periods, the operating system is regarded as failure, and the checker notifies the recovery manager to recover the system. It is worth noting that the heartbeat mechanism is implemented based on shared memory instead of hypercall, and thus it eliminates the overhead of frequent privilege mode crossings.
4.2.4 Recovery Flow
Figure 10 Recovery Protocol
When a fault is detected, the recovery manager will follow the recovery protocol to recover the system. Figure 10 illustrates the recovery protocol, which is divided into three steps. First, the recovery manager must change the network path so that incoming packets
which are originally delivered to the primary server will now be delivered to the backup server. Xen stores IP-to-domain mappings for each domain (i.e., in the net_schedule_list list) in order to perform packet delivery, and thus the network path changing can simply be done by updating the mapping that corresponds to the IP address of the backup server. Second, the recovery manager must wake up the backup server so that the backup server can take over the job of the primary server. Third, the recovery manager must send a signal to notify the backup server to recover the system. When receiving the signal, the kernel subsystem in the backup server will obtain the machine address of the log buffer through a hypercall, remap the log buffer, and then execute the FT-TCP recovery flow.
which are originally delivered to the primary server will now be delivered to the backup server. Xen stores IP-to-domain mappings for each domain (i.e., in the net_schedule_list list) in order to perform packet delivery, and thus the network path changing can simply be done by updating the mapping that corresponds to the IP address of the backup server. Second, the recovery manager must wake up the backup server so that the backup server can take over the job of the primary server. Third, the recovery manager must send a signal to notify the backup server to recover the system. When receiving the signal, the kernel subsystem in the backup server will obtain the machine address of the log buffer through a hypercall, remap the log buffer, and then execute the FT-TCP recovery flow.