Experimental Results - 高可用度路由器設計與實作

Chapter 6 Experiments

6.2. Experimental Results

First, we will show that the failure detection and recovery time (i.e., takeover delay) is not affected too much by the redundancy model used in the HA router. The takeover delays for the proposed HA-OSPF router under various redundancy models are shown in Table 6.2 with the down check interval of 1000 ms and the polling interval of 100 ms for a hardware failure and software failure, respectively. From Table 6.2, the takeover delay for a hardware failure (a software failure) of the proposed HA-OSPF router with 1+1, 2+1, and 2+2 redundancy are 565 ± 3 ms, 569 ± 3 ms, and 576 ± 4 ms (110 ± 2 ms, 112 ± 3 ms, and 118 ± 4 ms), respectively. The experimental results show that the redundancy model of the HA-OSPF router does not affect too much the takeover delay. Therefore, the 2+1 redundancy model, which a more cost-effective configuration, was used to measure takeover delays of the proposed HA-OSPF router in the subsequent experiments.

Table 6.2: Takeover delay (ms) of the proposed HA-OSPF router under various redundancy models.

Redundancy Model

1+1 2+1 2+2 Hardware failure 565 ± 3 569 ± 3 576 ± 4

Software failure 110 ± 2 112 ± 3 118 ± 4

Then, we investigate how the takeover delay is affected by the state information backup of the standby router. We did not measure the takeover delay of Cisco ASR-1000 series router due to lack of facilities. However, in [17], it describes that if an active router of Cisco ASR-1000 series router experiences a hardware or software failure that makes it unable to forward traffic and a standby router of Cisco ASR-1000 series router is configured, the standby router becomes the active router within 200 ms [17]. Therefore, only the following two cases were implemented and evaluated as follows:

 VRRP-based router with 2+1 redundancy: The active routers do not save any state information in the standby router.

 Proposed HA-OSPF router with 2+1 redundancy: Each active router backs up its full state information, including its link states, LSDB (link state database), and routing table to the standby router.

In addition, two types of failures were considered. One is when R2 halts by an unexpected power down (referred as a hardware failure), and the other is when an OSPF process failed (referred to as a software failure). First, in Figure 6.1, UDP packets traveled along path S1, R4, RA, R12, S2 until the active router failed. After

R12 and R4 reestablished their routing tables, the UDP packets could go through the path S1, R4, RC, R12, S2.

The takeover delays for the proposed HA-OSPF router with 2+1 redundancy and VRRP-based router with 2+1 redundancy are shown in Table 6.3. The takeover delay for a hardware failure (a software failure) of the VRRP-based router and the proposed HA-OSPF router were 14511 ± 36 ms and 569 ± 3 ms (13383 ± 3 ms and 112 ± 3 ms), respectively. Experimental results show that the takeover delays of the proposed HA-OSPF router were reduced by 96.08% and 99.16% compared to those of VRRP for a hardware failure and a software failure, respectively. The proposed HA-OSPF router with full state information backup demonstrates its benefits.

Table 6.3: Takeover delays (ms) and failure detection and recovery rates (times/hour) for a HA-OSPF router and a VRRP-based router.

Emulation Scenario

VRRP HA-OSPF router

Hardware failure

Takeover delay (ms) 14511 ± 36 569 ± 3 Failure detection and

recovery rate (times/hour) 248 6327

Software failure

Takeover delay (ms) 13383 ± 3 112 ± 3 Failure detection and

recovery rate (times/hour) 269 32143

Next, we measured the takeover delay for the PC-based HA-OSPF router due to a software failure under various polling intervals. Table 6.4 shows that the takeover delays (failure detection and recovery rates) due to a software failure were, 62 ± 1 ms (δ = 58065 times/hour), 112 ± 3 ms (δ = 32143 times/hour), and 170 ± 2 ms (δ = 21176 times/hour) for three polling intervals, 50 ms, 100 ms, and 200 ms, respectively.

Experimental results show that the takeover delay depends on the polling interval. We found that the shorter the polling interval, the faster the takeover delay (i.e., failure detection and recovery time) is.

Table 6.4: Takeover delays (ms) and failure detection and recovery rates (times/hour) due to a software failure (OSPF process down) under various polling intervals.

Polling interval

50 ms 100 ms 200 ms

Takeover delay (ms) 62 ± 1 112 ± 3 170 ± 2 Failure detection and

recovery rate (times/hour) 58065 32143 21176

We then investigated the takeover delay of the proposed HA-OSPF router due to a hardware failure under different down check intervals. In Table 6.5, the takeover delays (failure detection and recovery rates) due to a hardware failure under down check intervals of 500 ms, 1000 ms, and 2000 ms were 315 ± 2 ms, 569 ± 3 ms, and 1087 ± 9 ms (11429 times/hour, 6327 times/hour, and 3312 times/hour), respectively.

That is, the smaller down check intervals result in the shorter takeover delays.

Table 6.5: Takeover delays (ms) and failure detection and recovery rates (times/hour) due to a hardware failure under various down check intervals.

Down check interval

500 ms 1000 ms 2000 ms Takeover delay (ms) 315 ± 2 569 ± 3 1087 ± 9 Failure detection and

recovery rate (times/hour) 11429 6327 3312

Table 6.6 summarized the comparisons of the proposed HA-OSPF router, VRRP router, Cisco ASR-1000 series router, and Juniper MX series router in terms of cost, takeover delay, implementation flexibility, flexible redundancy model, stateful backup, open specification and open source, storage overhead, and bandwidth overhead. The router which supports stateful backup needs the additional bandwidth and storage to transfer and save the routing process status and link state information, respectively. As shown in Table 6.6, the bandwidth overhead is the amount of bandwidth (in bps) used by the active router transmitting the heartbeat and replicating its routing process status and the link state information to the standby router. The storage overhead is the number of bytes used by standby router saving the routing process status and link state information of active router. Moreover, since the proposed HA-OSPF router is constructed based an open source and open architecture specification, OpenAIS, and it does not need the specific chassis and hardware to achieve the goal of carrier-grade availability, the cost and implementation difficulty for constructing the proposed HA-OSPF router are less than those of the Cisco ASR-1000 series router and Juniper MX series router. Furthermore, from experimental results, we found that the takeover delay of the proposed HA-OSPF router were reduced 6%, 37.3%, and 98.6%

compared to those of the Cisco-ASR 1000 series router, the Juniper MX series router, and the VRRP router, respectively. Therefore, we concluded that the proposed HA-OSPF router is more feasible than VRRP-based router, Cisco ASR-1000 series router, and Juniper MX series router to construct a high availability network.

Table 6.6: The comparisons of the proposed HA-OSPF router, VRRP router, Cisco ASR-1000 series router, and Juniper MX series router.

Scheme HA-OSPF router

(proposed)

VRRP router

Cisco ASR-1000

series router Juniper MX series router

Cost Medium Low Very High Very High

Takeover delay 189 ms ^*1 13383 ms about 200 ms 300 ms ^*2 Implementation

flexibility ^Easy ^Easy Hard (Cisco IOS) Hard (Juniper JUNOS)

Flexible redundancy

*2 The takeover delay of the Juniper MX series router is three times of Hello intervals (Hello interval is 100 ms ~ 65535 ms).

*3 P is the number of routers in the network and Q is the number of bits of process status and link state information for each router.

*4 TC and TH are the checkpoint interval and Hello interval, respectively and K is the number of bits of heartbeat for each router.

Chapter 7 Field Trial Results

This section describes how to implement the HA-OSPF router on an ATCA (Advanced Telecom Computing Architecture) platform and experimental results of the field trial is given. ATCA technology [34][35] allows new communication equipment to be constructed with great attributes such as high performance, high availability, adaptability for adding new features, and lower cost of ownership. An open architecture solution using the ATCA technology can improve service availability.

Thus, industries often use ATCA open architecture combined with their own software solutions to quickly deploy competitive services.

Three types of ATCA cards (i.e., line card, control card, and switch card) were used to build an ATCA-based HA-OSPF router, as shown in Figure 7.1 [34][35].

Based on the operating function of ATCA cards and the concepts of ForCES (Forwarding and Control Element Separation) [36][37][38], the router can be separated into two parts: control plane and forwarding plane. The control plane service was designed to send control messages and to manage routing information.

The forwarding plane service is to decide the outgoing interface for each incoming packet. In general, the forwarding plan looks up the destination address of an

incoming packet, refers to a routing table (or forwarding table), finds an outgoing interface for the incoming packet, and then sends the incoming packet through the outgoing interface.

Figure 7.1: An ATCA-based HA-OSPF router consisting of LC, CC and SC [34][35].

The details of each ATCA card are described as below [34][35]:

 Line Card (LC): The LC belongs to the forwarding plane and was designed for the basic packet forwarding function. When the LC receives OSPF control

packets from its neighbor router, the LC will forward the packets to the control card. Then, if the LC receives the packets, it will forward the packets to correct destinations according to the routing table.

 Control Card (CC): It belongs to the control plane. The CC performs the OSPF routing protocol based on a received OSPF control packet. When the CC receives an OSPF control packet from its neighbors, the CC resets the waiting timer (e.g., the Hello message timer of its neighbor). If the network topology has changed, the CC recalculates the routing table. After that, the CC updates the LC’s forwarding table. In addition to the OSPF process, the HAM middleware and OpenAIS middleware have been installed in the CC to perform the state information backup and failure detection and recovery functions.

 Switch Card (SC): The SC belongs to the forwarding plane. It switches packets to a correct card (LC or CC) through the backplane. For example, as shown in Figure 7.1, control packets received by the LC will be forwarded to the SC via the base interface and then the SC switches the packets to the CC. Data packets received by the LC will be forwarded to the SC via a fabric interface and then the SC switches these packets to the LC.

In our system, an AdvancedTCA compliant processor card, named as aTCA-6890 [35], was used as a control card to build a router. The aTCA-6890 is available in a dual processor configuration with the Low Voltage Intel 3.2 GHz Xeon processor and 800 MHz System Bus. The aTCA-6890 also features the Intel E7520 chipset and 4 GB DDR-400 memories. Peripherals include six Gigabit Ethernet ports and two 10/100/1000 Mbps Ethernet maintenance ports.

Remind that we used two PCs connected via the Ethernet to emulate a PC-based HA-OSPF router in the previous experiment; our HA-OSPF router can be easily

implemented on an ATCA platform. We employed the OSPF process and HAM middleware on the ATCA control card and then integrated it on an ATCA chassis to build an ATCA-based HA-OSPF router. In the ATCA, both control cards have two Ethernet interfaces connected to the backplane [34][35]. Therefore, heartbeat and checkpoint messages can be exchanged between control cards by the backplane. In this experiment, the PCs R2 and R3 were replaced by control cards P1 and P2 (see Figure 7.2). The configuration of control cards on the ATCA is the same as that on the PC-based system.

Network 2

Power Tray Module 1 Power Tray Module 2 FAN Tray

SAM ASAM B Line CardLine CardSwitch CardSwitch CardControl CardControl Card

P1 P2

Figure 7.2: ATCA-based experimental environment.

Based on the default parameter values in Table 6.1, we measured takeover delays of the ATCA-based HA-OSPF router with 1+1 redundancy, and experimental results are shown in Table 7.1. The takeover delays of the PC-based HA-OSPF router from Table 6.3 are also included in Table 7.1 for easy reference. The takeover delays (failure detection and recovery rates) of the ATCA-based HA-OSPF router with 1+1 redundancy due to a hardware failure and a software failure are 1066 ± 54 ms (δ =

3377 times/hour) and 217 ± 17 ms (δ = 16590 times/hour), respectively. The takeover delay of the ATCA-based HA-OSPF router due to a hardware failure was reduced by 14% compared to that of the PC-based HA-OSPF router. The availabilities (AHA) of the proposed ATCA-base HA-OSPF router with 1+1 redundancy are 9.99999867%

and 99.99999905% due to a hardware failure and a software failure, respectively, under 1/λ = 7 years and 1/μ = 4 hours [28][29][30]. That is, the proposed ATCA-based HA-OSPF router with 1+1 redundancy can easily meet the requirement of carrier-grade availability with five-nine.

Table 7.1: Takeover delays (ms), failure detection and recovery rates (times/hour), and availabilities for ATCA-based and PC-based HA-OSPF routers.

(1/λ=7 years, 1/μ = 4 hours)

Emulation Scenario ATCA-based

recovery rate (times/hour) 3377 2903 Availability (AHA) 99.99999867% 99.99999859%

Software failure

Takeover delay (ms) 217 ± 17 166 ± 9 Failure detection and

recovery rate (times/hour) 16590 21687 Availability (AHA) 99.99999905% 99.99999907%

According to Table 7.2, we found that the CPU usage of the ATCA-based HA-OSPF router is much less than that of the PC-based routers (0.11% vs. 4.47%).

This means that the processing capability of an ATCA control card is much more powerful than that of an ordinary PC.

Table 7.2: CPU usages of HAM middleware and OSPF process for ATCA-based and PC-based HA-OSPF routers.

Emulation Architecture

ATCA-based PC-based CPU Usage 0.11 ± 0.01 % 4.47 ± 0.73 %

Table 7.3 shows the takeover delays under various polling intervals when a software failure occurred. The takeover delays (failure detection and recovery rates) of the ATCA-based HA-OSPF router with 1+1 redundancy were 188 ± 9 ms (δ = 19149 times/hour), 217 ± 17 ms (δ = 16590 times/hour), and 242 ± 26 ms (δ = 14876 times/hour) for three different polling intervals. Because the control card of the standby router needs several seconds to recover the routing information and sends the up-to-date routing table information to the line card [39], the average failure recovery time of the ATCA-based HA-OSPF router (about 150 ms) is greater than that of the PC-based HA-OSPF router (about 100 ms). However, the difference in takeover delays between the PC-based HA-OSPF router and the ATCA-based HA-OSPF router decreases when the polling interval increases.

Table 7.3: Takeover delays (ms), failure detection and recovery rates (times/hour), and availabilities for ATCA-based and PC-based HA-OSPF routers with 1+1 redundancy

under a software failure (OSPF process failed) and various polling intervals.

(1/λ=7 years, 1/μ = 4 hours)

Polling interval (ms)

50 ms 100 ms 200 ms

ATCA-based

Takeover delay (ms) 188 ± 9 217 ± 17 242 ± 26 Failure detection and

recovery rate (times/hour) 19149 16590 14876

Availability (AHA) 99.99999906% 99.99999905% 99.99999904%

PC-based

Takeover delay (ms) 121 ± 5 166 ± 9 223 ± 23 Failure detection and

recovery rate (times/hour) 29752 21687 16216

Availability (AHA) 99.99999909% 99.99999907% 99.99999905%

Table 7.4 shows the takeover delays and failure detection and recovery rates due to a hardware failure of power down for different down check intervals. The takeover delay of the ATCA-based HA-OSPF router was reduced by 14% compared to that of the PC-based HA-OSPF router when the down check interval is 1000 ms.

Experimental results show that the ATCA-based HA-OSPF router performed better than the PC-based HA-OSPF router under a hardware failure.

Table 7.4: Takeover delays (ms), failure detection and recovery rates (times/hour), and availabilities for ATCA-based and PC-based HA-OSPF routers with 1+1 redundancy

under a hardware failure (power down) and various down check intervals.

(1/λ=7 years, 1/μ = 4 hours)

Down check interval

1000 ms 500 ms 200 ms

ATCA-based

Takeover delay (ms) 1066 ± 54 743 ± 36 331 ± 28 Failure detection and

recovery rate (times/hour) 3377 4845 10876

Availability (AHA) 99.99999867% 99.99999881% 99.99999900%

PC-based

Takeover delay (ms) 1240 ± 12 740 ± 15 360 ± 6 Failure detection and

recovery rate (times/hour) 2903 4865 10000

Availability (AHA) 99.99999859% 99.99999881% 99.99999899%

From Table 7.3 and Table 7.4, the experimental results show that the failure detection and recovery rates (δ) for the ATCA-based HA-OSPF router with 1+1 redundancy are at least 3377 times/hour and 14876 times/hour due to a hardware failure and a software failure, respectively. The experimental results also show that the failure detection and recovery rates of the proposed ATCA-based HA-OSPF router with 1+1 redundancy is much higher than 1.632 times/hour, the minimum required δ to obtain five-nine availability. Therefore, we conclude that the proposed ATCA-based HA-OSPF router with 1+1 redundancy can easily achieve the goal of carrier-grade availability with five-nine.

Chapter 8 Conclusion

We have presented a 5-tuple availability function, A M N( , , , , )   , to relate to the desired availability (ρ), where M, N, λ, μ, and δ are number of active routes, number of standby routers, failure rate, repair rate, and failure detection and recovery rate, respectively. By applying this 5-tuple availability function, service providers can determine the minimum required number of standby routers for constructing an HA router to meet the requirement of the carrier-grade availability (ρ = 99.999%). The continuous-time Markov chain has been used to estimate the steady-state availability of an HA router with a different combination of numbers of active and standby routers.

The analytical results have shown that the failure detection and recovery rate (δ) is a key parameter for reducing the minimum required number of standby routers. In order to increase the failure detection and recovery rate, the active router needs replicate its routing process status and link state information to the standby routers. The HAM (High Availability Management) middleware, which includes AMF (Availability Management Framework) service, Checkpoint service, Failure Manager, has also been proposed. It has been integrated to the proposed HA router to achieve the goal of reducing the takeover delay by stateful backup. In addition, we have implemented the

proposed HA-OSPF router on a PC-based platform based on the N+1 redundancy model (N = 2 in our experiments). Experimental results have shown that the takeover delay of the proposed PC-based HA-OSPF router is slightly better than that of Cisco ASR-1000 series router under the same redundancy model (189 ms vs. 200 ms for 2+1 redundancy). However, unlike Cisco ASR-1000 series router, our HA-OSPF router does not need a specific hardware and the redundancy model of the proposed HA router can be adjusted flexibly. In addition, we have also implemented the HA-OSPF router on an ATCA platform, which can provide an industrial standardized modular architecture for an efficient, flexible, and reliable router design. The availabilities of the proposed ATCA-based HA-OSPF router with 1+1 redundancy are 99.99999905%

due to a software failure and 99.99999867% due to a hardware failure under the failure detection and recovery rates δ = 16590 (times/hour) and 3377 (times/hour), respectively, along with the router module data, 1/λ = 7 years and 1/μ = 4 hours, obtained from Cisco. The experimental results have shown that both our proposed ATCA-based and PC-based HA-OSPF routers can easily achieve the goal of carrier-grade availability with five-nine. From the analytical results, and experimental results, we conclude that the proposed 5-tuple availability function can be used to determine the minimum required number of standby routers and the HAM middleware can decrease the takeover delay while meeting the carrier-grade availability and achieving cost-effectiveness.

References

[1] N. Budhiraja, K. Marzullo, F. B. Schneider, and S. Toueg, Distributed Systems, 2^nd Edition, ACM Press/Addison-Wesley Publishing Co., New York, 1993, pp. 199－216.

[2] W. Kuo and R. Wan, “Recent Advances in Optimal Reliability Allocation,” Studies in Computational Intelligence, Vol. 39, 2007, pp. 1－36.

[3] S. Srivastava, “Redundancy Management for Network Devices,” The 9th Asia-Pacific Conference on Communications, Vol. 3, Sept. 2003, pp. 1157－1162.

[4] A. Mettas “Reliability Allocation and Optimization for Complex Systems,” Proceedings of the Annual Reliability and Maintainability Symposium, Jan. 2000, pp. 216－221.

[5] R. Hinden, “Virtual Router Redundancy Protocol (VRRP),” RFC 3768, Internet Engineering Task Force (IETF), Apr. 2004.

[6] T. Li, B. Cole, P. Morton, and D. Li, “Cisco Hot Standby Router Protocol (HSRP),” RFC 2281, Internet Engineering Task Force (IETF), Mar. 1998.

[7] J. Li, and B. Cole, “Standby Router Protocol,” 5473599, United State Patent, Dec. 1995.

[8] N. Dennis, H. Michael, D. Peter, and M. John, “Method and System for Router Redundancy in a Wide Area Network,” 7554903, United State Patent, Jun. 2009.

[9] J. Ranta, “Router Redundancy and Scalability Using Clustering,” Seminar on Internetworking, Spring 2004.

[10] S. Bommareddy, M. Kale, and S. Chaganty, “System and Method for Routing Message Traffic Using a Cluster of Routers Sharing a Single Logical IP Address Distinct from Unique IP Addresses of the Routers,” 6779039, United State Patent, Aug. 2004.

[11] C.T. Tsai, R.H. Jan, C. Chen, and C.Y. Huang, “Implementation of Highly Available OSPF router on ATCA,” The 13^th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC'07), Dec. 2007

[12] C.F. Ho, A.Gupta, M. Grandhi, and A. Bachmutsky, “Router and Routing Protocol Redundancy,” 6910148, United State Patent, June 2005.

[13] T. Bourke, Server Load Balancing, 1st Edition, O'Reilly Media, Aug. 2001.

[14] N. Milanovic and B. Milic, “Automatic Generation of Service Availability Models,”

IEEE Transactions on Services Computing, vol. 4, no. 1, Jan 2011, pp. 56 – 29.

[15] E.A.P Alchieri, A.N. Bessani, J.D.S. Fraga, “A Dependable Infrastructure for Cooperative Web Services Coordination,” IEEE International Conference on Web Services, Sept. 2008.

[16] V. Ermagan, I. Kruger, M. Menarini, “A Fault Tolerance Approach for Enterprise Applications,” IEEE International Conference on Services Computing, July 2008.

[17] Cisco ASR 1000 Series Aggregation Services Router High Availability: Delivering Carrier-Class Services to Midrange Router, Cisco, http://www.cisco.com/.

[18] Juniper Networks, http://www.juniper.com/.

在文檔中高可用度路由器設計與實作 (頁 46-0)