DESIGN - 動態置換驅動程式以增進作業系統之可用性

In this section, we will elaborate on what we do to survive from driver faults.

When a fault occurring in the driver is detected, the recovery mechanism will be triggered. Figure 1 shows the overview of the recovery process. Briefly speaking, we remove the faulty driver, undo the changes caused by it, insert the new driver, reconfigure it, and retry the original function in the new driver.

Figure 1. Overview of the Recovery Process

Figure 2. Architecture Overview

Figure 2 illustrates the components of the nDriver. If a fault occurs, it will be detected by the fault detector. Since the faulty driver may have changed the system state, we should remove the faulty driver and undo the changes. This is performed by the undo manager which records all the kernel functions invoked by the driver and undoes them when the driver is removed. In addition, it is responsible for inserting the new driver and asking the configuration manager to reconfigure it. After the reconfiguration, all external references to the removed driver must be redirected to the new driver to avoid the problem of dangling references.

In the following sections, we will describe the design of the nDriver framework.

First, we will present the fault detection approaches, which is followed by the description of how to keep the system state correct and consistent after a fault occurs.

Then, we present the approach for solving the problem of dangling references. Finally, we describe the details of the recovery process.

2.1 Fault Detection

The fault detector is responsible for detecting exception and blocking faults. An exception fault occurs due to the reasons such as accessing the NULL page (i.e., the first page of the physical address space), dividing an operand by zero, or executing an invalid opcode. To detect such faults, we replace the kernel exception handlers (such as page-fault and the divide-by-zero handlers) with our own ones. Therefore, the raising of a CPU exception will trigger our exception handler, which will then invoke the undo manager to recover the fault.

Besides exception faults, a faulty driver may cause system hangs (i.e., blocking fault), which make the system become responseless. Blocking faults usually result from careless driver design such as entering an infinite loop or trying to get the spinlock which is grabbed by another blocked kernel thread. To recover from such faults, we use a timeout-based approach. Before executing a driver function, we setup a software timer in order to measure the time it takes to execute the driver function. If the driver function occupies the CPU for a long time, it will be regarded as a faulty function. And the time-out handler will be triggered to recover from the fault. The accounting of the execution time is through timer interrupts, which happen every 10 ms. Although the time-out based approach is straightforward, two issues must be addressed to make it an effective technique for preventing driver hangs.

The first issue is how to determine the time-out value of a driver function.

Because the execution time of different driver functions varies, we can’t have a fixed time-out value for all the driver functions. Instead, the time-out value of a driver function should be set to its average execution time plus a guard time. Note that the time-out values are not required to be highly accurate. The 10-ms granularity is accurate enough for detecting blocking faults.

Another issue is how to prevent the software timer approach becoming useless if the driver function disables interrupts in their code. This is possible since many existing drivers disable interrupts for synchronization. To solve this problem, we replace the original interrupt-disabling/enabling functions, namely cli() and sti(), and the timer interrupt handler. Instead of disabling the interrupt pin of the CPU, the new cli() masks all the interrupts except for the timer interrupt. In this way, our software timer still works after calling cli(). Note that our timer interrupt handler will not invoke the original timer-interrupt handler when the interrupts are disabled. This preserves the interrupt-disabled semantic.

2.2 State Maintenance

We divide the system state that the driver may modify during its execution into driver state, kernel state, and driver requests. The driver state is the local state of the

device driver. The kernel state represents the global kernel state that may be changed by the driver. And the driver requests stands for the requests that are currently processed by the driver and the corresponding device. Because a fault may happen anytime during the execution of the driver code, we must keep the state correct and consistent after recovery. During the recovery period, we undo the changes the driver made to the kernel state. For the driver state, we decide to discard it and rebuild it from scratch. And, for the driver requests, we record them so that they can be re-issued to the new driver implementation after the recovery.

Generally, a driver changes the kernel state only through a few functions provided by the kernel. Such functions may request kernel-managed resources, register a new driver, or exchange information with the kernel. For example, the driver may request IRQs and I/O regions to the kernel. In order to undo the changes, we intercept the kernel functions called from the driver (i.e., callout functions), and

record them in an action list. Each callout function in the list has a corresponding undo routine, which will be invoked during the recovery process, for undoing the changes caused by the function.

It is worth noting that a device driver may invoke only a small subset of kernel-provided functions. This is because the main purpose of a device driver is just to drive the device. For example, a driver usually doesn't perform IPC operations, which are difficult to rollback¹. Thus, we focus on the set of functions which may be invoked by the driver, and implement their undo functions manually.

As we mentioned above, we discard the driver state and rebuild it from scratch during the recovery period. The reasons are as follows. First, the driver state is polluted after a fault emerged in the driver code. Second, different driver implementations may use different data structures and thus the old driver state cannot directly be used by the new driver implementation. Therefore, the new driver should implement a state transfer function if it wants to reuse the old state. This implies that all the driver implementations are needed to be modified, which is impossible.

Moreover, it’s impractical to implement a state transfer function for each pair of driver implementation.

For the driver requests, we backup all the unfinished requests in case they will be lost when the driver fails. Each time the kernel sends a request to the driver, we make a copy of the request and insert the copy to a per-driver unfinished request list. When the request is finished, the request copy will be removed from the list. If a driver fails, all the requests in the list will be re-issued to the new driver again.

1 It is not enough to rollback an IPC operation by canceling it or undoing it. The receiver may be triggered by the sent message to take some corresponding actions, which are usually difficult to rollback.

2.3 External References

After replacing the faulty driver with the new one, some external references (such as data or function pointers) still point to the data or functions of the original faulty driver. Therefore, we must update all the external references to point to the new implementation. Figure 3 illustrates an example. The structure net_device is used to represent an NIC device driver in Linux. During recovery process, for instance, the faulty driver Faulty is removed and the new driver New is inserted and initialized. All external references to Faulty become dangling pointers.

Soules et al. [23] proposed two approaches (i.e., backward reference and indirection) as shown in Figure 3(a) and 3(b) to solve this problem. In brief, the backward reference approach keeps track of all external references to Faulty, and then updates all of them to point to New. The drawback of this approach is that the operating system must be modified to record all the external references. The indirection approach, as shown in Figure 3(b), lets all the external references point to a single indirection pointer. If the target is changed due to the driver swapping, only the indirection pointer needs to be updated. This approach also requires modification to the existing operating system code since the data type of all the external references must be modified (e.g., from net_device* to net_device**). Besides, it needs an extra deferencing to access the target.

In nDriver, we take another approach to avoid modifying the existing operating system code. Figure 3(c) shows the approach. We add a placeholder for containing the target data. The placeholder is of the same type with the target data, and all the external references point to the placeholder. In the figure, the placeholder is initialized by copying the content of Faulty to it. During the recovery process, Faulty is removed and the placeholder is updated by copying the content of New to it. In this way, neither the maintaining of the backward references nor the modification to the data

type of the external references is needed.

Figure 3. External Reference Redirection

2.4 Detailed Process of Recovery

Before executing a driver function, we initialize the fault detector as well as save the current system context.

During the execution of the function, our recovery mechanism will be triggered if an exception fault or a blocking fault is detected. Figure 4 shows the detailed recovery process. First, we undo the changes the driver has made to the global kernel state. Specifically, we call the undo routine of each entry in the action list to undo the changes. Second, we remove the code and the local state of the faulty driver. Third, we insert the new driver into the kernel and reset the hardware. Fourth, the previously-issued configuration operations are issued again to the new driver in order to rebuild the driver state. This is achievable since all the configuration operations previously issued to the driver were intercepted and logged by the configuration manager. Fifth, we update the external references to point to the new driver by copying the content of the new driver state to the corresponding placeholder. Finally, we restore the system context and retry the originally-failed function in the new driver.

It is worth to note that the new driver may correspond to the same implementation with the old one. In this case, the new driver is just a fresh instance of that implementation. This kind of driver swapping can solve the problem of transient errors and driver aging [3]. The latter problem can be solved because we discard and rebuild the driver state from scratch. If there are multiple driver implementations for the device, the system can choose another implementation if one fails. This allows the system to survive from not only the above two kinds of faults but also the faults caused by driver bugs.

Figure 4. Detailed Process of Recovery

在文檔中動態置換驅動程式以增進作業系統之可用性 (頁 12-21)