Computational Complexity - Analytical Results

Chapter 4 Analytical Results

4.3. Computational Complexity

To solve Problem P1, we can apply binary search method on N (0N M).

For a given N, we evaluate A M N( , , , , )   and check to see if ( , , , , )

A M N     or not. By this way, the minimum value of N such that ( , , , , )

A M N    can be found. In each iteration, we have to solve the equations (19) ~ (25) for evaluatingA M N( , , , , )   . Note that the equations (19) ~ (25) can be rewritten as a system Ax = b of linear equations where A is n×n matrix. The system Ax

= b can be solved by Gaussian elimination with time complexityO n . Thus, we can ( )³ apply Gaussian elimination to the equations (19) ~ (25) with n = (M+1)(N+1). That is, it takes O M([( 1)(N1)] )³ O MN(( ) )³ time to evaluate ^{A M N}⁽ ^{, , , , )}   in each iteration. The number of iterations needed for the binary search isO(logM). Therefore, the total time for solving Problem P1 isO M N( ³ ³logM . )

Chapter 5 Proposed HA Router Design

The proposed 5-tuple availability function shows that the failure detection and recovery rate (δ) is a key parameter to increase the availability of an HA router. In order to increase the failure detection and recovery rate, a High Availability Management (HAM) middleware was designed which can decrease the takeover delay (1/λ) and meet the requirement of carrier-grade availability with five-nine. In this section, we are going to discuss the function of each component in the proposed HAM middleware design.

5.1. HAM Middleware Design

As shown in Figure 5.1, the HAM middleware (within the two-dot chain square) includes two different entities, OpenAIS middleware and Failure Manager. The OpenAIS middleware is a cluster middleware defined in the Service Availability Forum (SAF) Application Interface Specification [19]. In this dissertation, two services, AMF service and Checkpoint service, were used to construct the HA-OSPF router. The processes in the router can communicate with AMF service and

Checkpoint service through the interface, which is a set of APIs (Application Programming Interface) and callback functions, of OpenAIS middleware. The functions of AMF service and Checkpoint service are described as follows:

 AMF service: It provides role assignment and health check. The AMF service can provide three kinds of redundancy model, 2N redundancy, M+N redundancy, and N-way redundancy. When a router first starts, the AMF service will assign a role, active or standby, to the router. The AMF service of the active router sends a heartbeat message to the standby router(s) periodically to report its health status. If the standby router does not hear the heartbeat message from the active router within a down check interval (e.g., 1 second, which is a default value), it will assume the active router has failed and the AMF service will find a router from the standby router(s) to take over the role of the active router.

 Checkpoint service: It provides routing process status and link state information exchange service between active and standby routers. Through this service, the active router can replicate its routing process status and link state information to the standby router(s). The information can help a standby router reduce the takeover delay and improve the availability when it takes over.

HA Router Module

Figure 5.1: The components of an HA router module.

Moreover, the proposed Failure Manager is designed to monitor the status of NICs and routing process and to backup the routing process status and link state information. The Failure Manager will register itself to the OpenAIS middleware and get the permission for using the AMF service and Checkpoint service. The Failure Manager consists of following three modules:

 The Routing Process Failure Manager takes care of the routing process operations, informs the AMF service if a failure in the routing process is detected, and replicates the routing process status and link state information to the Checkpoint service.

 The Interface Monitor checks the health status of the network interface cards (NICs) and informs the AMF service if any NIC failure occurs.

 The Failure Handler has a set of callback functions. When the AMF service notifies the Failure Handler that a failure has occurred it will execute a

predefined callback function to handle the failure. For instances, the callback function will reinitialize the failed process or device if the failure can be determined by the Failure Manager (e.g., the routing process or an NIC failed). However, if the failure (e.g., AMF service failed or HA router failed) cannot be determined by the Failure Manager, the failed router will be restarted by the callback function after a down check interval and the standby router will send a report to the network administrator.

5.2. HAM Middleware Operation Procedures

The operation procedures of the HAM middleware can divide into three parts:

 Role assignment: We use M = 2 and N = 1 as an example to illustrate an HA router with M+N redundancy and it can be easily extended to the general case. As shown in Figure 5.2, there exist two protection groups (e.g., protection groups (RA, RC) and (RB, RC)) in an HA router. A protection group [19] is defined as a pair of routers, one active and one standby. When the router in an HA router is started, it will get the role, active or standby, firstly. The standby router monitors the active router’s health status in each protection group. If an active router fails, the standby router will take over the role of the active router. Note that at this moment all protection groups are lost. After a failed router having been repaired, it will re-initiate and execute the role assignment operation to form a protection group again.

Like VRRP, the active router and the standby router in the same protection group use the private IP addresses to communicate with each other.

Moreover, the active router uses the real IP address to communicate with its

adjacent routers. As soon as the standby router takes over, the standby router changes its IP addresses to the real IP addresses. For a broadcast network (e.g., Ethernet), the standby router will send a gratuitous ARP [31]

message to the network. The gratuitous ARP message is used to ask its neighbors to bind the MAC address of the standby router to the real IP address. Thus, the standby router can receive and forward the packets continuously when it takes over.

Figure 5.2: The logical structure of an HA router with 2+1 redundancy.

 Routing process status and link state information backup: Figure 5.3 shows how routing process status and link state information flow from the active router to standby router. The Routing Process Failure Manager of active router gets the routing process status and link state information and replicates those to the standby router through the Checkpoint service. Then, the standby router receives and saves the routing process status and the link state information. When the standby router takes over, the information can

help the standby router to decrease the takeover delay and improve the availability of the HA router.

Figure 5.3: Link state information backup for a protection group.

Chapter 6 Experiments

In Figure 4.1, we have shown that an HA router with M+1 redundancy (for M ≦ 47) is the recommended scheme to meet the carrier-grade (ρ = 99.999%) availability under an appropriate failure rate (λ), failure detection and recovery rate (δ), and repair rate (μ). In this section, we will actually measure the failure detection and recovery rate (δ) of the proposed HA-OSPF router with M+1 redundancy on an OSPF network (M = 2 in our experiments for illustration). We will show the takeover delay of the proposed HA-OSPF router with HAM middleware is smaller than those of an industry standard approach, Cisco ASR-1000 router [17], and a VRRP router [5]. The takeover delay (the multiplicative inverse of the failure detection and recovery rate) is defined as the latency from the active router of the HA-OSPF router failed to the standby router of the HA-OSPF router taking over and recovering from the failure.

6.1. Experimental Setup

We have implemented an HA-OSPF router on a PC-based environment. We used the 2+1 redundancy model as an example to construct the HA router to verify the

correctness of the proposed HA-OSPF router. To implement the HA-OSPF router with 2+1 redundancy, three desktop PCs with Intel Pentium 4 3.0 GHz processors and 512 MB memories connected via Ethernet were used to emulate an HA-OSPF router. That is, the HA-OSPF router consists of three routers RA, RB and RC, as shown in Figure 6.1. A Linux operating system and GNU Zebra [32] were selected as the developing platform for the PC-based HA-OSPF router. The GNU Zebra is a well-known open source software that manages the TCP/IP based routing protocol. Suppose that RA and RB are active routers and RC is a standby router when the HA-OSPF router is first started. Then, we used two PCs which run IMUNES (Integrated Multiprotocol Network Emulator Simulator) [33], which could send OSPF control messages to the HA-OSPF router, to emulate OSPF networks 1 and 2. There were two clients (S1 and S2) and one log server in our experimental network, as shown in Figure 6.1.

2+1 (M+N)

OSPF Network 1 OSPF Network 2

R3 R4

Figure 6.1: Experimental environment.

In the experiment, S1 sent UDP data packets with specific sequence numbers to S2 to examine the network connectivity (see Figure 6.1). The log server was used to record the sequence number and timestamp of each packet that it received. If S1 sends a packet to S2, it also has to send a copy of the packet to the log server. Then, S2 will forward the packet it received from S1 to the log server. During the takeover period, the network will be disrupted. The log server will not receive any packets transferred from S2. After the standby router takes over the role of the active router, the log server will continue to receive packets from S2. In this way, the takeover delay can be determined. The default parameter values for the OSPF routing protocol and HAM middleware are listed in Table 6.1 [19][20][28][29][30]. The Hello interval is the number of seconds this router waits before sending out the next Hello packet [19][20].

If a router does not receive a Hello packet from a neighbor router within a fixed amount of time, the router modifies its topological database to indicate that the neighbor router is not operational. The time that the router waits is called the router dead interval. By default, this interval is 40 seconds (four times the default Hello interval) [19][20]. Based on Cisco data, the MTTF (1/λ) and MTTR (1/μ) of a commercial router need at least 7 years (i.e., 61320 hours) and not exceed 4 hours, respectively [28][29][30]. The default values for the down check interval of AMF service and polling interval of the Failure Manager are 1000 ms and 100 ms, respectively [34]. The down check interval is a period of time in which the standby router has to hear at least one heartbeat from the active router; otherwise, the standby router assumes it has failed. The polling interval is a period of time in which the Routing Process Failure Manager and the Interface Monitor check the status of routing process and the NICs, respectively.

Table 6.1: Default parameter values [19][20][28][29][30][34].

Router dead interval of OSPF 40 sec

Hello interval 10 sec

Down check interval of AMF service 1000 ms Polling interval of Failure Manager 100 ms

MTTF(1/λ) 7 years (61320 hours)

MTTR(1/μ) 4 hours

6.2. Experimental Results

First, we will show that the failure detection and recovery time (i.e., takeover delay) is not affected too much by the redundancy model used in the HA router. The takeover delays for the proposed HA-OSPF router under various redundancy models are shown in Table 6.2 with the down check interval of 1000 ms and the polling interval of 100 ms for a hardware failure and software failure, respectively. From Table 6.2, the takeover delay for a hardware failure (a software failure) of the proposed HA-OSPF router with 1+1, 2+1, and 2+2 redundancy are 565 ± 3 ms, 569 ± 3 ms, and 576 ± 4 ms (110 ± 2 ms, 112 ± 3 ms, and 118 ± 4 ms), respectively. The experimental results show that the redundancy model of the HA-OSPF router does not affect too much the takeover delay. Therefore, the 2+1 redundancy model, which a more cost-effective configuration, was used to measure takeover delays of the proposed HA-OSPF router in the subsequent experiments.

Table 6.2: Takeover delay (ms) of the proposed HA-OSPF router under various redundancy models.

Redundancy Model

1+1 2+1 2+2 Hardware failure 565 ± 3 569 ± 3 576 ± 4

Software failure 110 ± 2 112 ± 3 118 ± 4

Then, we investigate how the takeover delay is affected by the state information backup of the standby router. We did not measure the takeover delay of Cisco ASR-1000 series router due to lack of facilities. However, in [17], it describes that if an active router of Cisco ASR-1000 series router experiences a hardware or software failure that makes it unable to forward traffic and a standby router of Cisco ASR-1000 series router is configured, the standby router becomes the active router within 200 ms [17]. Therefore, only the following two cases were implemented and evaluated as follows:

 VRRP-based router with 2+1 redundancy: The active routers do not save any state information in the standby router.

 Proposed HA-OSPF router with 2+1 redundancy: Each active router backs up its full state information, including its link states, LSDB (link state database), and routing table to the standby router.

In addition, two types of failures were considered. One is when R2 halts by an unexpected power down (referred as a hardware failure), and the other is when an OSPF process failed (referred to as a software failure). First, in Figure 6.1, UDP packets traveled along path S1, R4, RA, R12, S2 until the active router failed. After

R12 and R4 reestablished their routing tables, the UDP packets could go through the path S1, R4, RC, R12, S2.

The takeover delays for the proposed HA-OSPF router with 2+1 redundancy and VRRP-based router with 2+1 redundancy are shown in Table 6.3. The takeover delay for a hardware failure (a software failure) of the VRRP-based router and the proposed HA-OSPF router were 14511 ± 36 ms and 569 ± 3 ms (13383 ± 3 ms and 112 ± 3 ms), respectively. Experimental results show that the takeover delays of the proposed HA-OSPF router were reduced by 96.08% and 99.16% compared to those of VRRP for a hardware failure and a software failure, respectively. The proposed HA-OSPF router with full state information backup demonstrates its benefits.

Table 6.3: Takeover delays (ms) and failure detection and recovery rates (times/hour) for a HA-OSPF router and a VRRP-based router.

Emulation Scenario

VRRP HA-OSPF router

Hardware failure

Takeover delay (ms) 14511 ± 36 569 ± 3 Failure detection and

recovery rate (times/hour) 248 6327

Software failure

Takeover delay (ms) 13383 ± 3 112 ± 3 Failure detection and

recovery rate (times/hour) 269 32143

Next, we measured the takeover delay for the PC-based HA-OSPF router due to a software failure under various polling intervals. Table 6.4 shows that the takeover delays (failure detection and recovery rates) due to a software failure were, 62 ± 1 ms (δ = 58065 times/hour), 112 ± 3 ms (δ = 32143 times/hour), and 170 ± 2 ms (δ = 21176 times/hour) for three polling intervals, 50 ms, 100 ms, and 200 ms, respectively.

Experimental results show that the takeover delay depends on the polling interval. We found that the shorter the polling interval, the faster the takeover delay (i.e., failure detection and recovery time) is.

Table 6.4: Takeover delays (ms) and failure detection and recovery rates (times/hour) due to a software failure (OSPF process down) under various polling intervals.

Polling interval

50 ms 100 ms 200 ms

Takeover delay (ms) 62 ± 1 112 ± 3 170 ± 2 Failure detection and

recovery rate (times/hour) 58065 32143 21176

We then investigated the takeover delay of the proposed HA-OSPF router due to a hardware failure under different down check intervals. In Table 6.5, the takeover delays (failure detection and recovery rates) due to a hardware failure under down check intervals of 500 ms, 1000 ms, and 2000 ms were 315 ± 2 ms, 569 ± 3 ms, and 1087 ± 9 ms (11429 times/hour, 6327 times/hour, and 3312 times/hour), respectively.

That is, the smaller down check intervals result in the shorter takeover delays.

Table 6.5: Takeover delays (ms) and failure detection and recovery rates (times/hour) due to a hardware failure under various down check intervals.

Down check interval

500 ms 1000 ms 2000 ms Takeover delay (ms) 315 ± 2 569 ± 3 1087 ± 9 Failure detection and

recovery rate (times/hour) 11429 6327 3312

Table 6.6 summarized the comparisons of the proposed HA-OSPF router, VRRP router, Cisco ASR-1000 series router, and Juniper MX series router in terms of cost, takeover delay, implementation flexibility, flexible redundancy model, stateful backup, open specification and open source, storage overhead, and bandwidth overhead. The router which supports stateful backup needs the additional bandwidth and storage to transfer and save the routing process status and link state information, respectively. As shown in Table 6.6, the bandwidth overhead is the amount of bandwidth (in bps) used by the active router transmitting the heartbeat and replicating its routing process status and the link state information to the standby router. The storage overhead is the number of bytes used by standby router saving the routing process status and link state information of active router. Moreover, since the proposed HA-OSPF router is constructed based an open source and open architecture specification, OpenAIS, and it does not need the specific chassis and hardware to achieve the goal of carrier-grade availability, the cost and implementation difficulty for constructing the proposed HA-OSPF router are less than those of the Cisco ASR-1000 series router and Juniper MX series router. Furthermore, from experimental results, we found that the takeover delay of the proposed HA-OSPF router were reduced 6%, 37.3%, and 98.6%

compared to those of the Cisco-ASR 1000 series router, the Juniper MX series router, and the VRRP router, respectively. Therefore, we concluded that the proposed HA-OSPF router is more feasible than VRRP-based router, Cisco ASR-1000 series router, and Juniper MX series router to construct a high availability network.

Table 6.6: The comparisons of the proposed HA-OSPF router, VRRP router, Cisco ASR-1000 series router, and Juniper MX series router.

Scheme HA-OSPF router

(proposed)

VRRP router

Cisco ASR-1000

series router Juniper MX series router

Cost Medium Low Very High Very High

Takeover delay 189 ms ^*1 13383 ms about 200 ms 300 ms ^*2 Implementation

flexibility ^Easy ^Easy Hard (Cisco IOS) Hard (Juniper JUNOS)

Flexible redundancy

*2 The takeover delay of the Juniper MX series router is three times of Hello intervals (Hello interval is 100 ms ~ 65535 ms).

*3 P is the number of routers in the network and Q is the number of bits of process status and link state information for each router.

*4 TC and TH are the checkpoint interval and Hello interval, respectively and K is the number of bits of heartbeat for each router.

Chapter 7 Field Trial Results

This section describes how to implement the HA-OSPF router on an ATCA (Advanced Telecom Computing Architecture) platform and experimental results of the field trial is given. ATCA technology [34][35] allows new communication equipment to be constructed with great attributes such as high performance, high availability, adaptability for adding new features, and lower cost of ownership. An open architecture solution using the ATCA technology can improve service availability.

Thus, industries often use ATCA open architecture combined with their own software solutions to quickly deploy competitive services.

Three types of ATCA cards (i.e., line card, control card, and switch card) were used to build an ATCA-based HA-OSPF router, as shown in Figure 7.1 [34][35].

Based on the operating function of ATCA cards and the concepts of ForCES (Forwarding and Control Element Separation) [36][37][38], the router can be separated into two parts: control plane and forwarding plane. The control plane service was designed to send control messages and to manage routing information.

The forwarding plane service is to decide the outgoing interface for each incoming packet. In general, the forwarding plan looks up the destination address of an

incoming packet, refers to a routing table (or forwarding table), finds an outgoing interface for the incoming packet, and then sends the incoming packet through the outgoing interface.

Figure 7.1: An ATCA-based HA-OSPF router consisting of LC, CC and SC [34][35].

The details of each ATCA card are described as below [34][35]:

 Line Card (LC): The LC belongs to the forwarding plane and was designed for the basic packet forwarding function. When the LC receives OSPF control

packets from its neighbor router, the LC will forward the packets to the control card. Then, if the LC receives the packets, it will forward the packets to correct destinations according to the routing table.

 Control Card (CC): It belongs to the control plane. The CC performs the OSPF routing protocol based on a received OSPF control packet. When the CC receives an OSPF control packet from its neighbors, the CC resets the waiting timer (e.g., the Hello message timer of its neighbor). If the network topology has changed, the CC recalculates the routing table. After that, the CC updates the LC’s

在文檔中高可用度路由器設計與實作 (頁 35-0)