R ESTART F LOW - 無資料遺失之重新啟動架構在網際網路服務上之設計與實作

The fully automatic and zero-data-loss restart is controlled by the restart manager.

Figure 3.8 shows the restart flow.

Fault

First of all, a fault of the service process is detected by the fault detection routine.

Then, the kernel performs the I/O channel keeping operation to prevent the I/O channels of the service process from being closed. For all the I/O channels, the kernel rewinds the read_pt in the kernel temporary space to the value of the destroy_pt.

A kernel thread is then created, which will eventually become the next generation of the failed service process. At the moment, the faulty process can be

Restoring

terminated. And, the kernel thread invokes the exec_usermode_helper() function to turn itself into a user-level process and execute the binary image of the service. As a result, a new generation of the service is started. Finally, the I/O channels can be handed over to the new service generation. This is done by copying the kept pointer that reference to the open file table into the task control block of the new service generation.

3.4 E XPERIMENTAL R ESULTS OF THE K ERNEL S UPPORT

In this section, we present the experimental results of the kernel support. Since buffers in the kernel temporary space are allocated/deallocated on demand, small buffer size will increase the number of allocation/deallocation and degrades the performance of hread(). In this experiment, we measure the impact of the buffer size (i.e., DHR_BUFFER_SIZE) on the performance of hread(). We use a small test program that reads a file through hread() with different buffer sizes, and record the resulting times in CPU ticks. Figure 3.9 shows the result. As shown in the figure, the performance improves as the buffer size increases. However, the improvement becomes little when the buffer is larger than 4096 bytes. As a result, we choose 4096 as the buffer size in current implementation.

User Buffer Size = 512 (bytes)

512 1024 2048 4096 6144 8192 10240

Buffer Sizes (bytes)

CPU Ticks

Figure 3.9: Performance of Hread() under Different DHR_ONE_BUFFER_SIZE

Figure 3.10 compares the performance of the original read() and the hread() system calls under different user buffer sizes that the test program specifies. From the figure we can see that when the user buffer size is smaller than 512 bytes, the overhead of hread() ranges from 36% to 46%. And, with the increase of the user buffer size, the overhead grows. The largest part of the overhead happens in the

copy_from_user() function, which is a standard function used by the linux kernel to

copy data from a user-mode buffer. And, we use the function to copy the read data from the user buffer to the kernel temporary space. According to our measurement, the execution time of the function grows rapidly as the data size increases. Although the overhead seems to be large, we still consider it be acceptable because that the reading frequency of non-storage based files is usually far less than the writing frequency for many Internet services, such as web services, FTP service, streaming service, and etc. We will verify it in Section 5.2.1.

DHR_ONE_BUFFER_SIZE = 4096

724 842 1050 1462 2649

5596

128 256 512 1024 2048 4096 8192

User Buffer Size (bytes)

original read holding read overhead (%)

Figure 3.10: Performance Comparison of Read() and Hread() under Different User Buffer Sizes

Besides hread(), we also measure the execution time of the other two system calls we added (i.e., reregist() and getreginfo()). Figure 3.11 shows the result. Both system calls require less than 2 us.

1.873

Figure 3.11: Time Complexity of Reregist() and Ggetreginfo() System Call

At last, a breakdown of the kernel execution time spent for starting a new service generation is given in Table 3.1. The time begins when the fault is detected and ends

at the end when the I/O channels are restored. From the table we can see that, keeping and restoring I/O channels is quite efficient (specifically, only about 7.4 us). Most of time is spent on creating the kernel thread for resuming the new service generation and executing the exec_usermod_helper() function.

I/O channel keeping

Kernel thread creating

User-mode process executing

I/O channel restoring

Execution Time 7.294 us 15.264 us 185.78 us 0.125 us

Table 3.1: Kernel-level Execution Time for Restarting a New Service Generation

C HAPTER 4

PROGRAMMING GUIDELINES OF ZERO-LOSS RESTARTABLE SERVICE

The kernel support that we described in Chapter 3 provides a basics building block for a restarable environment. However, cooperation from the service developers is also needed to achieve the goal of zero-loss service restart. The cooperation includes the following jobs. First, the service developers have to use the system calls mentioned in Chapter 3 to register/retrieve the service information, and hold the useful data in the kernel. Second, the developers should use a dedicated state storage to store the service state. Third, the service developers should follow the programming guidelines mentioned in Section 4.2 to facilitate the recovery procedure at service recovery phase.

In Section 4.1, we will illustrate how to use those kernel supported system calls mentioned in the previous chapter. And, we will also propose a model that allows two successive service generations communicate through a dedicated storage. In Section 4.2, we describe the programming guidelines.

4.1 P ROGRAMMING S TYLE OF R ESTARTABLE S ERVICE 4.1.1 Using Hread() System Call

void handle_new_connection() { int read_fd;

char* buf = malloc(…);

char* request = malloc(…);

read_fd = socket(PF_INET, SOCK_STREAM, 0);

bind(read_fd, …);

listen(read_fd, …);

...

int sz, count = 0;

while( !read_full_request(request) ) {

In this section we describe how to use the hread() system call to avoid data loss when a fault crashes the services. Figure 4.1 shows the pseudo code of handling a request using the hread() system call. In the while loop, the program reads a full request from socket read_fd. Since the data has not been processed, the forth parameter (i.e., dlen) is set to 0 in order to keep the data in the kernel temporary space.

After receiving the request, the process_rcv_data() is invoked to process the request data and generate the result. Finally, the hread() system call is invoked again with the

rlen parameter set to 0 and the dlen parameter set to the data length of the request data.

This invocation is used to delete the request from the kernel temporary space.

If a fault occurs before the end of the process_rev_data() function, the request

if( (sz = hread(read_fd, buf, 10, 0)) > 0) {

strncpy((char*)(request+count), buf, 10); Read a full request

count += sz;

}

} Really process the received data

process_rcv_data(request);

After processed, delete what you have processed in the kernel temporary space

hread(read_fd, buf, 0, count);

...

close(read_fd); Close the socket

}

Figure 4.1: Example pseudo code of using data-holding read system call

data still remains in the kernel temporary space. As we described in Section 3.3, the kernel will rewind the read pointer to the position of the destroy pointer when the service is restarted. Therefore, after the new generation starts, it can read the request from the kernel temp space and process it. On the other hand, if the fault occurs between the process_rev_data() and the second hread() function, the new generation has the ability to know that the result was generated. As a consequence, it will delete the request in the kernel temporary space. To know that the result was generated, there must be a communication channel between the successive generations, which is described in Section 4.1.4.

4.1.2 Using Reregist() System Call

#define CHILD_ID_INIT_VALUE 1

void main(char* argv[]) {

int cid; Register the main process of the service

reregist(“/usr/local/sbin/service”, 0, argv, NULL);

...

Register the child process as child_id = 1

cid = fork();

if(cid == 0) { // child process starts

reregist(“”, CHILD_ID_INIT_VALUE, argv, NULL);

...

}

else { // parent process ...

} ...

}

Figure 4.2: Example pseudo code of using restart registration system call

Reregist() system call is used to register a restartable service. In most of the cases,

the developer invokes reregist() during the program initialization. Figure 4.2 shows the typical usage of reregist() system call. In the figure, the first reregist() invocation means that the developer wants the main process of the service to be restartable. And, the second reregist() invocation registers the child process as a restartable process.

Note that the child_id parameter is different in these two invocations. This allows the restarted process to know which child process (or main process) it is.

4.1.3 Use Getreginfo() System Call

#include <registration.h>

int is_restarted = 0;

void main(char* argv[]) {

struct reg_info* reg_info = (struct reg_info*)malloc(sizeof(struct reg_info));

if( getreginfo(reg_info) ) {

is_restarted = reg_info->is_restarted;

}

if( is_restarted ) { // recovery path

switch( reg_info->child_id ) { case 0: break;

case 1: goto child_process_1;

...

} ...

}

else { // normal execution path ...

} }

Figure 4.3: Example pseudo code of using get registration information system call

Figure 4.3 shows an example of using the getreginfo() system call. The most important function of this system call is to tell whether the current process is original or restarted. If it is restarted, the process should execute the recovery path. Otherwise, it executes the normal path. Note that both paths should be programmed by the service developer. And, this system call should be called at the beginning of the program to determine the execution path of this generation. In addition, this system call also returns the child_id. With that information, the current process can know the child identifier of its previous generation. Therefore, the developer can divide the recovery path into a number of recovery procedures. Each procedure takes responsible for recovering a child process or the main process. As a result, the new generation can execute the corresponding recovery procedure to recover its previous generation.

4.1.4 Using Shared Memory for State Handover

To achieve the goal of zero-loss service restart, the developers should separate state from logic when developing the service. The service state has to be stored in a dedicated storage, which should be live across successive service generations. This allows the state to be handed over to the new generation. In our system, the dedicated storage is implemented by shared memory.

We use shared memory because of the following reasons. First, the shared memory attached on a process is available until the process detaches it. Therefore, if a service does not detach the shared memory before it terminates, the next generation will be able to attach the same shared memory area. Second, shared memory is efficient so that there will be little performance impact for storing service state in shared memory.

However, there is a problem when using shared memory as the state store. That is, the shared memory had better be attached at the same address for two successive generations. This is because that the service state stored in the shared memory may contain pointers which point to the data in the shared memory. If the shared memory area can not be attached at the same address, the new generation must adjust the pointers in the shared memory. For example, if we store a linked list in the shared memory, the values of all the links should be adjusted when the new generation attaches the shared memory to a different address.

Therefore, the developer should reduce the usage of pointers for maintaining the service state. If there are still some necessary pointers, the developers should write a procedure to adjust these pointers. In order to accomplish the adjustment, the application can store the attached address in a fixed location of the shared memory when it attaches the shared memory. Therefore, the new generation can calculate the difference value of the attached address and update all the pointers in the shared memory accordingly.

4.2 P ROGRAMMING G UIDELINES F OR Z ERO -L OSS R ESTART

In this section, we propose several programming guidelines that make the service operate at its logical level and hence become zero-loss between generations.

Avoid registering the signal handlers of the signals that cause the abnormal termination of the process. Such as SIGSEGV, SIGTRAP, SIGABRT, and etc are

this kind of signals. It is in order to let the abnormal termination of the faulty service can be caught by our fault detector instead of the programmer specified function.

Abstract the state variable of the service. The state variables contain all the

necessary information that represents the service state during the service execution. A piece of state information should be included into the state variables if the restarted service can’t reconstruct the whole service state without it. The state variable of the service needs to be stored in the dedicate storage and be updated when necessary.

This allows the service to be executed as a stateless client of the state storage.

Design recovery procedures for service recovery. The recovery procedures will

be executed during the service recovery phase. In a recovery procedure, the service must reconstruct its state from the state variables. And then, it tries to finish the ongoing jobs of the previous generation when the fault occurs.

Divide the execution into several stages. This can reduce the recovery time.

When a stage is finished, the service can record the progress and starts the next stage.

When a service restarts, the next generation can go through the unfinished stage as in its normal path. The recovery time is reduced since the jobs in the already finished stages are not needed to be performed again. For example, the page request processing in a web server can be divided into four stages: request reading, request parsing, response header sending, and response body sending. For large responses, the last stage can further be divided into more sub-stages. When the service restarts, the new generation can get the processing progress of the request. If, for example, the first two stages are finished, the new generation can start sending the response header.

C HAPTER 5

CASE STUDY: ZERO-LOSS RESTARTABLE THTTPD WEB SERVER

In this chapter, we present a case study that applies all the operations and program guidelines mentioned in Chapter 3 and 4 to a well-known tiny web server,

thttpd-2.25b [1], in order to make it zero-loss restartable. We chose web server as the

target because of its popularity on Internet. Thttpd is an open source web server, with simple and well-organized code.

In the following, we will briefly describe the design and execution flow of the original thttpd in Section5.1.1. In Section 5.1.2 we will present how we modify the original thttpd to achieve the goal of zero-loss restart. Furthermore, we will analyze the experimental results in Section 5.2.

5.1 Z ERO -L OSS R ESTARTABLE T HTTPD 5.1.1 Original Thttpd

In this section, we will describe the execution flow of thttpd. Thttpd uses single-process implementation of HTTP 1.1 protocol [8], and it divides the procedure of handling a request into two stages, namely reading and sending. Figure 5.1 shows the stage flow of request handling in thttpd.

Reading Sending

Got an incoming connection

Finish Sending Response

Generated

Start End

Figure 5.1: Stage Flow of Request Handling in Thttpd

When thttpd starts, it creates, binds, and listens on a TCP socket. Then, it probes for incoming requests by performing select() on the TCP socket. If there is no request, the server keeps on probing. When the server gets a request, it creates a connection entity structure for this request and the connection enters into the reading stage. The connection entity structure contains almost all the information needed to construct the total state variables of thttpd. This will be discussed in the next section. In the reading stage, the server reads the request from the client, parses the request, and generates the corresponding response. After the response is generated, the connection enters into the sending stage. In this stage, the server writes out the response to the client.

Note that connections are processed concurrently. Different connections may be in different stages. Moreover, request probing and processing are also handled concurrently. It is worth to mention that, reading and sending stages in thttpd are divided into more fine-grained sub-stages. In reading stage, the server doesn’t perform blocking read operation. Therefore, several read operations may be needed to get the full request. And the server will try to handle other requests between two successive read operations. Similarly, in the sending state, the server will also try to write a part of the response and then handle other requests or accept new requests.

5.1.2 Zero-Loss Restartable Thttpd

In this section, we describe how we modified the thttpd to make it zero-loss restarable. The modified version is called ZLR_thttpd. First of all, we replaced all socket read operations in thttpd with hread(), and applied the reregist() and the

getreginfo() on thttpd. We will not describe the detail of these modifications because

that they are simply like what we have mentioned in Chpater 4.

The other modifications are described in the following. First, we identified the state variables of thttpd. In the previous section, we mentioned that the connection entity structure is used to represent a connection in thttpd, and it contains all the information of an on-processing connection. Therefore, we can get the state variables of a connection by extracting the fields in this structure that are necessary for the recovery procedures. Instead of separating the original data structures in thttpd into state variable part and no_state variable part, we make a copy of all state variables and store the copy into the shared memory. And, we update the variable in shared memory before the corresponding variable in thttpd is modified. These can avoid large modifications to the original thttpd.

struct httpd_state_var {

Figure 5.2: State Variables of Thttpd

Figure 5.2(a) shows the per-connection state variables, the http_state_var structure. The fields in this structure can be divided into three parts. The first part contains the most important fields of a connection, conn_stage and conn_fd. The

conn_stage field represents the connection stage, and the conn_fd field represents the

data socket used for communicating with the client. The second part is updated during the reading stage of a connection. The expnfilename represents the requested file name in the server, and the method field indicates the HTTP method. These two fields are generated after the request is parsed. The last part is updated during the sending stage. The bytes_sent field represents how many bytes of response have been written

to the data socket. In addition to the per-connection state variables, the global variables are maintained in the global_state_var structure as shown in Figure 5.2(b).

The listen_fd represents the socket that the server uses to receive requests from the clients. The num_connects field indicates how many connections are currently processed in the server. And, the max_connect field indicates the maximum number of connections that the server can process simultaneously.

The second modification we made was using shared memory areas to store the

state variables of thttpd. Totally, four shared memory areas are used. As Figure 5.3

shows, these four areas are pointed by four pointers, shm_pointers, httpd_state_vars,

char_area, and global_vars, respectively.

shm_pointers

○¹ expnfilename

httpd_state_vars

chars_area expnfilename

The area pointed by httpd_state_vars is used to store the httpd_state_var structures of all the connections. The char_area points to a fix-sized shared memory area that is used to store the expnfilename fields of all the httpd_state_var structures.

The above two pointers are stored in another shared memory area, which is pointed by

global_vars

○²

○¹ sizeof ( httpd_state_var )

○² fix size chars area

○³

○³ sizeof (global_state_var )

Figure 5.3: Four Shared Memory Areas in Modified Thttpd

the shm_pointers pointer. In addition, the global_vars points to a shared memory area that stores the global_state_var structure of the server.

The final modification we made was to add a recovery path which will be executed during the service recovery phase. Since thttpd is a single-process application, the path contains a single recovery procedure only. In order to facilitate the recovery, we added a new stage, processing, between the reading and sending stages. As shown in Figure 5.4, the stage is entered when a request is completely read and parsed. In this stage, the server will use the parsing result of the request to generate the response. This stage is added to prevent the service going back to the reading stage while the request has been parsed. The recovery procedure performs two jobs. First, it restores all the variables from the shared memory area. Second, it handles the recovery of each on-processing connection according to the connection stage. For the reading stage, it tries to finish the reading and parsing job. For the

processing stage, it tries to read the requested file again into the memory space of the

thttpd. For the sending stage, the recovery procedure reads the requested file again

processing stage, it tries to read the requested file again into the memory space of the

thttpd. For the sending stage, the recovery procedure reads the requested file again

在文檔中無資料遺失之重新啟動架構在網際網路服務上之設計與實作 (頁 25-0)