Steady-state Availability Definitions - 高可用度路由器設計與實作

Chapter 2 Preliminary

2.3. Steady-state Availability Definitions

Based on the instantaneous availability, the steady state availability, A, can be defined as

) ( lim A t

A _t__ (6)

2.3. Steady-state Availability Definitions

The steady-state availability is the probability of a system that is still available over a long period. The steady-state availability (A) can be expressed as [22][23][24]:

MTTR MTTF

A MTTF

 

(7)

where MTTF (mean time to failure) is the arithmetic mean time between failures of a component or system and MTTR (mean time to repair) is the amount of time required to perform corrective maintenance and restore a component or system to operational status. MTTR includes total time required to detect that there is a failure, to repair it, and to place the system back into an operational status.

If the system lifetime is exponential with failure rate λ, and the time-to-repair distribution of the system is exponential with repair rate μ, then equation (7) can be rewritten as [22][23][24]

Chapter 3 HA Router Model Description and Analysis

With the design complexity and technology limitations, Mettas used a cost function to show that it is very difficult to improve the availability of the router, the greater the cost [4]. Thus, a feasible way to increase the router availability is to add the standby router to the HA router [5][6][7][9][10]. In this section, we propose a 5-tuple availability function, A M N( , , , , )   , to determine the minimal number of standby routers (N) in an HA router to achieve the desired availability, under the conditions of the failure rate (λ), repair rate (μ), failure detection and recovery rate (δ), and number of active routers (M). The continuous-time Markov chain (CTMC) [22][25][26] is used to determine the steady-state availability of an HA router with various numbers of active routers and standby routers.

3.1. Continuous-Time Markov Chain for 1+N Redundancy Model

In this section, the continuous-time Markov chain (CTMC) of an HA router with 1+N redundancy model (i.e., one active router and N standby routers) is considered.

Figure 3.1 is the state-transition diagram of a CTMC [22][25][26] modeling the failure and repair behavior of an HA router with 1+N redundancy model (i.e. one active and N standby). The failure of the active router will cause the network to recalculate routing path information. To avoid this undesirable situation, each standby router monitors the status of the active router. If a failure occurred in the active router, the standby routers hold an election automatically. Then, one of the standby routers will take over the role of the active router.

Figure 3.1: CTMC for an HA router with 1+N redundancy model.

As shown in Figure 3.1, state (i, j) represents the status of the HA router, where i and j represent the status of active and standby routers, respectively. If i (or j) equal to 1 means the active (or standby) router is working and 0 otherwise. If both i and j equal to 1, it means both the active and standby routers of the HA router are working. If i equal 0 and j equal to 1, it represents the failure of the active router and if i equal to 1 and j equal to 0, it represents the failure of the standby router. Finally, if both i and j

equal to 0, it means the two routers of the HA router are failed.

The state diagram of the CTMC modeling the failure and repair behavior of an HA router with 1+N redundancy model is depicted in Figure 3.1. The active router works properly at state (1, p), where 0 ≤ p ≤ N. The state (0, q) represents the active router failed and the HA router fails (i.e., cannot forward packets). The system detects and recovers the failure with rate δ and will go to state (1, q-1), where 1 ≤ q ≤ N. The state (0, 0) represents that all the router modules of the HA router are failed.

In this dissertation, the time to failure and time to repair of a router module are assumed to be exponentially distributed with mean 1/λ and 1/μ, respectively. In Figure 3.1, when the state transfers from (0, p) to (1, p-1), 0 ≤ p ≤ N, which indicates that a failure has been detected and recovered, and the standby router has taken over the role of the active router. The associated failure detection and recovery rate (δ) is the multiplicative inverse of the mean time that from the active router failed to the standby router detecting that the failure had occurred and being recovered from it.

Note that in this dissertation, all failure events are assumed to be mutually independent.

Let π(i, j) denotes the proportion of time that the system is in state (i, j). Note that in the steady state the rate at which transitions into state (i, j) must equal to the rate at which transitions out of state (i, j). Thus, from Figure 3.1, we obtain the following equations for the steady state probabilities:

(N 1)  (1, )N   (1,N 1) (9)

(N   ) (0, )N   (1, )N (10)

(0, 0) (0,1) (1, 0)

        (11)

(   ) (1, 0)  (1,1)  (0, 0)  (0,1) (12) By solving the preceding set of equations, along with this equation



 

The CTMC for an HA router with 1+N redundancy can transit into a two-state and two-transition Markov chain [27], as shown in Figure 3.2. One state is the Up with the reward rate λHA; the other state is the Down with the reward rate μHA [27]. λHA

and μHA are the equivalent failure rate and the equivalent repair rate of the HA router with 1+N redundancy, which can be determined by applying the aggregation techniques described in [27].

Figure 3.2: Equivalent Markov chain.

Therefore, λHA and μHA of an HA router for the CTMC in Figure 3.1 can be expressed as follows:



Therefore, from equation (8), the equivalent availability of an HA router (AHA) can be expressed as follows:

Solving equations (16) and (17), we can get an equivalent availability of an HA router (AHA) based on equation (18) under failure rate (λ), failure detection and recovery rate (δ), and repair rate (μ).

3.2. Continuous-Time Markov Chain for M+N Redundancy Model

In this section, the continuous-time Markov chain (CTMC) of an HA router with M+N redundancy (i.e., M active routers and N standby routers) is considered. Each

standby router monitors the status of all active routers. If one of the active routers failed, the standby routers hold an election automatically. Then, one of the standby routers will take over the role of the active router. Figure 3.3 (a) is the logical structure of an HA router with M+N redundancy. The CTMC for an HA router with M+N redundancy is depicted in Figure 3.3 (b). The active routers work properly at state (M, p), where 0 ≤ p ≤ N. If the state of an HA router moves from state (i, j) to state (i+1, j-1), it represents there is an active router failed and the system detects and recovers the failure with rate δ, where 0 ≤ i ≤ M-1 and 1 ≤ j ≤ N. State (0, 0) represents that all routers, including active and standby routers, of the HA router failed.

Figure 3.3: Logical structure and CTMC for an HA Router with M + N redundancy.

After writing the steady state equations and solving these equations, we obtain

the following equations under the steady state:

 

Solving equations (26) and (27), we can also get an equivalent availability of an HA (AHA) router with M+N redundancy model based on equation (18) under failure rate (λ), failure detection and recovery rate (δ), and repair rate (μ).

3.3. Formalizing a 5-tuple Availability Function

Based on the above discussion, we propose a 5-tuple availability function, ( , , , , ),

A M N    to determine the minimum required number of standby routers (N) need to be allocated in an HA router to achieve the desired availability (ρ). In addition, as shown in equation (28), the equivalent availability of an HA router (AHA) is equal to the derived value of the 5-tuple availability function.

( , , , , )

AHA A M N    (28)

Therefore, problem P1 can be formally defined as follows:

Problem P1:

Minimize N subject to

, where 0

HA HA

AHA

 

N M

 

   

 ⁽²⁹⁾

where μHA and λHA are the equivalent repair rate and equivalent failure rate of an HA router.

Chapter 4 Analytical Results

In this section, we want to find the most cost-effective redundancy model for the HA router such that its availability meets the requirement of the carrier-grade availability (ρ = 99.999%). The parameter settings of μ, λ, and δ are given as follows.

Based on the data from Cisco, we set μ = 0.25 times/hour (i.e., MTTR (1/μ) is equal to 4 hours). The MTTR of a router is assumed to the time it takes to have a spare part and a knowledgeable person arrive to repair. Three MTTFs, low MTTF (1/λ = 10000 hours), high MTTF (1/λ = 100000 hours) and Cisco carrier grade router’s MTTF (1/λ

= 61320 hours) are considered.

4.1. Numerical Analysis of Minimal Required Standby Routers for 1+N Redundancy Model

Solving equations (16) and (17) we can get the availabilities of an HA router by using equation (18) under various failure detection and recovery rates, and a different number N of standby routers, as shown in Table 4.1. From Table 4.1, an HA router with 1 + 1 redundancy (i.e., N = 1) will meet the five-nine availability if δ is greater

than 10 times/hour. In general, δ is much larger than 10. For example, in Table 6.3, the δ for the VRRP router is at least 248 times/hour and the δ for the proposed HA-OSPF router is at least 2903 times/hour. For a commercial router, such as a Cisco ASR 1000 Series router, its δ is 1800 times/hour [17]. Thus, we conclude that an HA router with 1+1 redundancy is preferred, which will meet the five-nine availability.

In addition, we also found that the failure detection and recovery rate (δ) is a key parameter to improve the availability of an HA router. To have high availability, δ is the larger the better. Note that, for an HA router with 1+1 redundancy, to obtain five-nine availability, the minimum δ is 1.632 times/hour for 1/λ = 7 years and 1/μ = 4 hours [28]-[30]. In Sections 5 and 6, we will show that the experimental δ’s for a PC-based and an ATCA-based HA routers with 1+1 redundancy are 2903 times/hour and 3377 times/hour for hardware failures, respectively, which are much higher than the minimum δ we just mentioned. For software failures, the experimental δ’s are even larger.

Table 4.1: The availability of an HA router (AHA) for a different number of standby routers and various failure detection and recovery rates under 1/λ = 7 years and 1/μ =

4 hours [28]-[30].

Failure detection and recovery rate (δ) (times/hour)

δ = 1 δ = 10 δ = 100 δ = 1000

N = 0 99.99347727% 99.99347727% 99.99347727% 99.99347727%

N = 1 99.99836852% 99.99983608% 99.99998284% 99.99999752%

N = 2 99.99836921% 99.99983692% 99.99998369% 99.99999837%

N = 4 99.99836921% 99.99983692% 99.99998369% 99.99999837%

N = 8 99.99836921% 99.99983692% 99.99998369% 99.99999837%

4.2. Numerical Analysis of Minimal Required Standby Routers for M+N Redundancy Model

The failure detection and recovery rate (δ) is set to 100, 1000, 10000, and 100000 times/hour. In addition, three failure detection and recovery rates which were measured from the proposed HA router, are also considered. Those includes δ=11429 times/hour for hardware failures only, δ=58065 times/hour for software failures only, and δ=34747 times/hour for hardware and software failures (see section 7). The number of active routers M varies from 1, 2, 4,…, to 128. Table 4.2 shows the analytical results to determine the minimum required number of standby routers (N) for the proposed HA router under various μ, λ, δ, and M.

Table 4.2: The minimum required standby routers (N) for an HA router to achieve the goal of carrier-grade availability (ρ = 99.999%).

μ=0.25 (times/hour)

M = 1 M = 2 M = 4 M = 8

1/λ(hours) 1/λ(hours) 1/λ(hours) 1/λ(hours) 10000 61320 100000 10000 61320 100000 10000 61320 100000 10000 61320 100000

δ (times/hour)

1/λ(hours) 1/λ(hours) 1/λ(hours) 1/λ(hours) 10000 61320 100000 10000 61320 100000 10000 61320 100000 10000 61320 100000

δ (times/hour)

Χ: no feasible solution

From the analytical results, we also found that the minimum required number of standby routers (N) can be decreased when the failure rate (λ) or the failure detection and recovery rate (δ) of the router decreases or increases, respectively. It also shows that the failure detection and recovery rate (δ) of a router is a key parameter for reducing the minimum required number of standby routers in an HA router.

Figure 4.1 shows the relationship between the minimum required number of standby routers and the number of active routers for an HA router with 1/λ, 1/μ, and ρ

being set to 61320 hours, 4 hours (from Cisco [28][29][30]), and 99.999%

respectively. Based on Figure 4.1, service providers or network administrators can determine the appropriate number of standby routers for constructing an HA router under various numbers of active routers and the desired availability (ρ). For instance, an HA router needs only one standby router to meet the requirement of carrier-grade availability (ρ = 99.999%) when the number of active routers is not greater than 47, as shown in Figure 4.1.

Figure 4.1: The minimum required number of standby routers for an HA router under various numbers of active routers and failure detection and recovery

rates (with ρ = 99.999%).

4.3. Computational Complexity

To solve Problem P1, we can apply binary search method on N (0N M).

For a given N, we evaluate A M N( , , , , )   and check to see if ( , , , , )

A M N     or not. By this way, the minimum value of N such that ( , , , , )

A M N    can be found. In each iteration, we have to solve the equations (19) ~ (25) for evaluatingA M N( , , , , )   . Note that the equations (19) ~ (25) can be rewritten as a system Ax = b of linear equations where A is n×n matrix. The system Ax

= b can be solved by Gaussian elimination with time complexityO n . Thus, we can ( )³ apply Gaussian elimination to the equations (19) ~ (25) with n = (M+1)(N+1). That is, it takes O M([( 1)(N1)] )³ O MN(( ) )³ time to evaluate ^{A M N}⁽ ^{, , , , )}   in each iteration. The number of iterations needed for the binary search isO(logM). Therefore, the total time for solving Problem P1 isO M N( ³ ³logM . )

Chapter 5 Proposed HA Router Design

The proposed 5-tuple availability function shows that the failure detection and recovery rate (δ) is a key parameter to increase the availability of an HA router. In order to increase the failure detection and recovery rate, a High Availability Management (HAM) middleware was designed which can decrease the takeover delay (1/λ) and meet the requirement of carrier-grade availability with five-nine. In this section, we are going to discuss the function of each component in the proposed HAM middleware design.

5.1. HAM Middleware Design

As shown in Figure 5.1, the HAM middleware (within the two-dot chain square) includes two different entities, OpenAIS middleware and Failure Manager. The OpenAIS middleware is a cluster middleware defined in the Service Availability Forum (SAF) Application Interface Specification [19]. In this dissertation, two services, AMF service and Checkpoint service, were used to construct the HA-OSPF router. The processes in the router can communicate with AMF service and

Checkpoint service through the interface, which is a set of APIs (Application Programming Interface) and callback functions, of OpenAIS middleware. The functions of AMF service and Checkpoint service are described as follows:

 AMF service: It provides role assignment and health check. The AMF service can provide three kinds of redundancy model, 2N redundancy, M+N redundancy, and N-way redundancy. When a router first starts, the AMF service will assign a role, active or standby, to the router. The AMF service of the active router sends a heartbeat message to the standby router(s) periodically to report its health status. If the standby router does not hear the heartbeat message from the active router within a down check interval (e.g., 1 second, which is a default value), it will assume the active router has failed and the AMF service will find a router from the standby router(s) to take over the role of the active router.

 Checkpoint service: It provides routing process status and link state information exchange service between active and standby routers. Through this service, the active router can replicate its routing process status and link state information to the standby router(s). The information can help a standby router reduce the takeover delay and improve the availability when it takes over.

HA Router Module

Figure 5.1: The components of an HA router module.

Moreover, the proposed Failure Manager is designed to monitor the status of NICs and routing process and to backup the routing process status and link state information. The Failure Manager will register itself to the OpenAIS middleware and get the permission for using the AMF service and Checkpoint service. The Failure Manager consists of following three modules:

 The Routing Process Failure Manager takes care of the routing process operations, informs the AMF service if a failure in the routing process is detected, and replicates the routing process status and link state information to the Checkpoint service.

 The Interface Monitor checks the health status of the network interface cards (NICs) and informs the AMF service if any NIC failure occurs.

 The Failure Handler has a set of callback functions. When the AMF service notifies the Failure Handler that a failure has occurred it will execute a

predefined callback function to handle the failure. For instances, the callback function will reinitialize the failed process or device if the failure can be determined by the Failure Manager (e.g., the routing process or an NIC failed). However, if the failure (e.g., AMF service failed or HA router failed) cannot be determined by the Failure Manager, the failed router will be restarted by the callback function after a down check interval and the standby router will send a report to the network administrator.

5.2. HAM Middleware Operation Procedures

The operation procedures of the HAM middleware can divide into three parts:

 Role assignment: We use M = 2 and N = 1 as an example to illustrate an HA router with M+N redundancy and it can be easily extended to the general case. As shown in Figure 5.2, there exist two protection groups (e.g., protection groups (RA, RC) and (RB, RC)) in an HA router. A protection group [19] is defined as a pair of routers, one active and one standby. When the router in an HA router is started, it will get the role, active or standby, firstly. The standby router monitors the active router’s health status in each protection group. If an active router fails, the standby router will take over the role of the active router. Note that at this moment all protection groups are lost. After a failed router having been repaired, it will re-initiate and execute the role assignment operation to form a protection group again.

Like VRRP, the active router and the standby router in the same protection group use the private IP addresses to communicate with each other.

Moreover, the active router uses the real IP address to communicate with its

adjacent routers. As soon as the standby router takes over, the standby router changes its IP addresses to the real IP addresses. For a broadcast network (e.g., Ethernet), the standby router will send a gratuitous ARP [31]

message to the network. The gratuitous ARP message is used to ask its neighbors to bind the MAC address of the standby router to the real IP address. Thus, the standby router can receive and forward the packets continuously when it takes over.

Figure 5.2: The logical structure of an HA router with 2+1 redundancy.

 Routing process status and link state information backup: Figure 5.3 shows how routing process status and link state information flow from the active router to standby router. The Routing Process Failure Manager of active router gets the routing process status and link state information and replicates those to the standby router through the Checkpoint service. Then, the standby router receives and saves the routing process status and the link state information. When the standby router takes over, the information can

help the standby router to decrease the takeover delay and improve the availability of the HA router.

Figure 5.3: Link state information backup for a protection group.

Chapter 6 Experiments

In Figure 4.1, we have shown that an HA router with M+1 redundancy (for M ≦ 47) is the recommended scheme to meet the carrier-grade (ρ = 99.999%) availability under an appropriate failure rate (λ), failure detection and recovery rate (δ), and repair rate (μ). In this section, we will actually measure the failure detection and recovery rate (δ) of the proposed HA-OSPF router with M+1 redundancy on an OSPF network (M = 2 in our experiments for illustration). We will show the takeover delay of the proposed HA-OSPF router with HAM middleware is smaller than those of an industry standard approach, Cisco ASR-1000 router [17], and a VRRP router [5]. The takeover delay (the multiplicative inverse of the failure detection and recovery rate) is defined as the latency from the active router of the HA-OSPF router failed to the standby router of the HA-OSPF router taking over and recovering from the failure.

6.1. Experimental Setup

We have implemented an HA-OSPF router on a PC-based environment. We used the 2+1 redundancy model as an example to construct the HA router to verify the

correctness of the proposed HA-OSPF router. To implement the HA-OSPF router with 2+1 redundancy, three desktop PCs with Intel Pentium 4 3.0 GHz processors and 512 MB memories connected via Ethernet were used to emulate an HA-OSPF router. That is, the HA-OSPF router consists of three routers RA, RB and RC, as shown in Figure 6.1. A Linux operating system and GNU Zebra [32] were selected as the developing platform for the PC-based HA-OSPF router. The GNU Zebra is a well-known open source software that manages the TCP/IP based routing protocol. Suppose that RA and RB are active routers and RC is a standby router when the HA-OSPF router is first started. Then, we used two PCs which run IMUNES (Integrated Multiprotocol Network Emulator Simulator) [33], which could send OSPF control messages to the HA-OSPF router, to emulate OSPF networks 1 and 2. There were two clients (S1 and S2) and one log server in our experimental network, as shown in Figure 6.1.

2+1 (M+N)

OSPF Network 1 OSPF Network 2

R3 R4

Figure 6.1: Experimental environment.

In the experiment, S1 sent UDP data packets with specific sequence numbers to S2 to examine the network connectivity (see Figure 6.1). The log server was used to record the sequence number and timestamp of each packet that it received. If S1 sends

在文檔中高可用度路由器設計與實作 (頁 21-0)