Chapter 2 Related work
2.5 Qualitative comparison
We compare the design of DNS SRV [3], redundant hop of proxy server design [3], BIG-IP [11], redundant hop of load balancer design [3], parallel SIP proxy servers using direct routing approach [14], and the proposed FFF qualitatively, as shown in Table 1. The following eight metrics are considered: Configuration (in the terms of “dispatchers * proxy servers”), failover mechanism, load balancing support, redundancy model (n (active) + k (standby), client and/or DNS involved, failover mechanism (failover time), single-point-of-failure problem, scalability (with DNS SRV). We have the following observations:
Firstly, the configuration of each design is listed, only the proposed FFF uses n * m.
Secondly, we found that the designs of redundancy at the server side, such the designs in[3][11][14], all adopt 1+1 redundant SIP servers (dispatchers or proxy servers) by using VRRP. Thirdly, those using the redundant SIP proxy server model, such as the designs in [3]
do not have the load balancing support. Fourthly, the redundancy model of each approach is listed and some of them are not involve with the SIP dispatcher, such as the designs in [3] [11].
Fifthly, only DNS SRV is the client side solution. Sixthly, based on the evaluation results, the proposed FFF has the fastest failover. Seventhly, only BIG-IP would face the single-point-of-failure problem. Eighthly, The FFF and redundant hop of SIP load balancer design can make the SIP dispatchers become all active by using DNS SRV. However, the parallel SIP proxy servers design can not take this advantage due to the limitation of LVS design.
Approach DNS SRV
Table 1. Comparison of different SIP high availability networks
Chapter 3
System design approach
In order to achieve “five nines” availability in SIP high availability networks, a fast failure detection and failover (FFF) scheme is proposed in this sector. The cluster of SIP proxy servers is controlled by centralized job controllers, which are SIP dispatchers. We use an n + k redundancy model for SIP dispatchers. The reason of using a clustered dispatchers design is not only protecting the system from crashing caused by the single-point-of-failure problem but also improving the overall system availability. With the proposed FFF, the carrier class
reliability of SIP networks can be achieved.
Fig. 5 shows the designed architecture of FFF for SIP high availability networks. As shown in Fig. 5, an SAP (service access point) is a virtual IP that is associated with a SIP dispatcher, which is active and is responsible for handling incoming SIP messages. With an SAP, SIP user
clients can access the service correctly without knowing the high availability design at the server side. To have high availability, when an active SIP dispatcher fails, the SAP will be reassociated with the one of the backup SIP dispatchers. The newly associated SIP dispatcher will process SIP requests that are sent to the SAP to avoid service interruption. Clients at different regions (for example, regions A and B) can access respective SAP to enhance scalability.
Fig. 5 . Design architecture of the proposed FFF scheme for SIP high availability networks.
The mechanism of how a backup SIP dispatcher takes over the SAP is maintained by OpenAIS AMF service. Each SIP dispatcher has its own real IP address. As to which SIP dispatcher is given the virtual IP (SAP) to become active is assigned by OpenAIS AMF service. After owing the virtual IP address, the “active” SIP dispatcher would make an announcement and map the virtual IP address to its own MAC address of the network interface. According to the group and redundant settings set in OpenAIS, AMF services running on each SIP dispatcher would keep sending control messages as heartbeats to the
heartbeats of others to detect failures. When the “active” SIP dispatcher fails, the AMF would choose one of the backup (standby) SIP dispatchers to be the “active” SIP dispatcher and inform it to claim that it owns the virtual IP (SAP). In this operation, a gratuitous ARP message is sent by the new active SIP dispatcher in order to refresh the ARP table of the other network devices in the same LAN. This is an important step during the failover because all network devices on the same LAN will send SIP requests and responses to the new serving node after this operation.
In clustered SIP proxy servers, when one of the SIP proxy servers becomes unavailable, the AMF would inform all the SIP dispatchers, and they would stop forwarding SIP messages to the failed SIP proxy server. In order to monitor multiple SIP proxy servers and to achieve load balancing, the OpenAIS checkpoint service was used. Each SIP proxy server uses OpenAIS checkpoint service to keep sending a counting number as a heartbeat to SIP dispatchers. The SIP dispatchers would check the number every 100 ms. Note that this interval of 100 ms is the as to the AMF’s heartbeat interval for the health check of SIP dispatchers. If the number did not increase as expected, the SIP dispatchers would consider the associate SIP proxy server is failed.
Fig. 6 shows the system state diagram, which includes failure detection, failover and recovery operations. Server failures caused by hardware failures in this design can be detected.
The failover operations will then be initiated automatically. The changing of system state and corresponding operation are listed.
Fig. 6 . System state diagram with failure detection, failover and recovery operations.
(1) and (11): After active SIP dispatcher failure is detected, a standby SIP dispatcher claims the SAP to become active, and start to serve incoming messages. (3) and (9): After one of the SIP proxy servers failed, all active SIP dispatchers will stop delivering SIP messages to it. (5), (7) and (14): When all of the SIP dispatchers or SIP proxy servers fail, the system is considered as unavailable. (2) and (12): If one of the failed SIP dispatchers becomes available,
the system will consider it as a standby. (4), (6) and (8): If one of the failed SIP proxy server becomes available, all SIP dispatchers will add it back to the proxy server list.
The use of OpenAIS AMF service for failover is a fast failover scheme because it was designed to check the health condition of a node in a very short period of time. If the “active”
SIP dispatcher fails, the AMF can detect the health condition quickly and informs one of the backups take over its job immediately. In the proposed architecture, the health condition of each SIP proxy server is monitored by all active SIP dispatchers via the checkpoint service of OpenAIS. It also has fast failover time between SIP proxy servers by .using a short heartbeat interval.
The “token” parameter in the AMF was adjusted to reduce the failover time in our design. It does not cause apparently increasing of control messages because it is a timer to decide whether the monitored server is dead if no heartbeat has been received within the “token”
period. Adjusting the token parameter value will not change the generating frequency of heartbeats.
The proposed redundancy model of SIP dispatchers is n +k where n is the number of active SIP dispatchers (or SAPs), since each active SIP dispatcher has one virtual IP address (SAP).
SIP proxy servers is an all-active redundancy model because they all can be accessed by any SIP dispatcher. If SIP user agents know all the IP addresses of SIP dispatchers or have the DNS SRV support, the design of the SIP dispatchers cluster can be enhanced as an all-active redundancy model. This is because all SIP dispatchers can serve incoming messages with its real IP address.
Chapter 4
Evaluation and discussion
SIPp [17] was used as a SIP traffic generator and analyzer in the performance evaluation.
The usages of SIPp include generating SIP requests and responses, checking the operation with implemented SIPStone [18] scenarios, and reporting testing results. The default UAC (user agent client) and UAS (user agent server) scenarios which have already been implemented in SIPp were used to test the proposed SIP high availability network architecture.
Fig. 7 shows the call flow of SIPp UAC and UAS scenario. Each SIP request or response is sent immediately when the corresponding message arrived.
Fig. 8 shows the evaluation environment for FFF. It contains three SIP dispatchers with 2 + 1 redundancy model and three SIP proxy servers with 3 + 0 (all active) redundancy model.
SIPp UAC and UAS were placed at different sides of the proposed SIP network to test the functions and failover operations of dispatchers and proxy servers. All SIP servers (dispatchers and proxy servers) were implemented by OpenSER on Linux operating systems.
Fig. 7. SIPp UAC and UAS scenario.
Fig. 8 . Evaluation environment for FFF.
All SIP servers are running on the virtual machines of one PC. The SIPp user agent client and user agent server were running on the other machine. The testing environment parameters settings are shown in Table 2.
Environment parameter value
CPU (for virtual machines) Pentium 4 3.0G Memory (for virtual machines) 3 Gigabytes
CPU (for SIPp) Pentium 4 m 1.8G
Memory (for SIPp) 512 Megabytes
Call rate (test for failover time) in SIPp 10 calls/sec (default in SIPp) Call rate (test for number of failed calls) in
SIPp
50, 100, 150, 200 calls/sec
Call limit 10000 calls
Token (OpenAIS AMF) 300, 1000 (default in AMF) ms
The call generating speed of SIPp was set as default, 10 packets per second. The default setting of OpenAIS AMF “token” value (1000 ms) and a smaller value (300 ms) were used to test if this parameter would affect the failover time or not. 300 ms is the smallest value allowed in the OpenAIS configuration.
The failover time between the SIP dispatchers is defined as the elapsed time that the standby SIP dispatcher detects the failure of the active SIP dispatcher, takes over IP and serves a new incoming SIP request correctly. In the test of failover time, we induced a service failure by power down the active SIP dispatcher. We compared the SIP dispatcher failover time of ours with that of VRRP, because it is a common used failover mechanism by IP takeover. The VRRP was setup by using an open source demon VRRPd [19] with default settings.
The effect of CPU loading on failure time is first evaluated. Low CPU loading is defined as there is no extra loading of other traffic. Running a heavy CPU program in the background is to simulate the high CPU loading condition that all SIP servers have 100% CPU loading. Fig.
9 shows the effect of CPU loading on failover time. Firstly, the FFF scheme had much shorter failover time than VRRPd. Secondly, the smaller value (300 ms) of the token parameter could reduce the failover time to almost half compared to the default (1000 ms). Thirdly, under the high CPU loading condition, all schemes would have longer failover time, but the FFF still has shorter failover time than the other two. The failover time of the proposed FFF was not affected too much by the increase of CPU loading (0.651 vs. 0.769). The results also show that the FFF scheme reduces the dispatcher failover time by 80% (0.651 vs. 3.163) compared
to VRRP.
Fig. 9. Effect of CPU loading on failover time.
Normally, SIPp will not show the number of failed calls during the test of failover time because it would retransmit the SIP requests when no response is received and the overall retransmission time is much longer than the failover time. Therefore, we turned off SIPp’s retransmission mechanism and computed the number of failed calls caused by the SIP dispatcher failover under different call rates. The retransmission mechanism is only turned off at the UAC side. Fig. shows failed calls caused by a dispatcher failure under different call rates. Using VRRP, it resulted in a much higher number of failed calls because its failover time was much longer than that of FFF. In addition, adjusting the token value would also reduce the number of failed calls. This result shows that the FFF scheme would have fewer failed calls compared to the VRRPd scheme when a SIP dispatcher failed.
Fig. 10. Failed calls caused by dispatcher failure under different call rates.
The SIP proxy server failover time is defined as how long the SIP dispatchers can detect a proxy server failure and stops forwarding the SIP messages to it when the SIP proxy server failed. The OpenSER with default settings was used to compare with the FFF scheme because the OpenSER itself has a SIP proxy failover scheme. Fig. 11 shows the SIP proxy server failover time comparison. The result shows the proposed FFF scheme can detect a failure and react much faster than OpenSER when a SIP proxy server failed. If SIP requests were forwarded to the failed SIP proxy server, the packets would be lost and the call setup procedure would halt until the SIP dispatcher or SIP user agent retransmits the packets to the serving SIP proxy server. That is, using the FFF scheme, the SIP client will encounter less probability of service errors during the SIP proxy server failover time. The results show that the FFF scheme shortens the proxy failover time by 85% (3.89 vs. 24.79) compared to the mechanism provided by OpenSER itself.
Fig. 11. SIP proxy server failover time comparison.
Chapter 5
Analysis and discussion
In this section, based on our proposed high availability SIP network architecture, we analyze theoretical service availability of using multiple SIP dispatchers and multiple SIP proxy servers. The availability of a SIP dispatcher (Adispatcher) can be calculated by the mean time to failure of a SIP dispatcher (MTTFdispatcher) and the mean time to restore of a SIP dispatcher (MTTRdispatcher). The availability of a SIP proxy server (Aproxy) can be calculated similarly. The overall service availability (Aservice) of the network architecture can be calculated by the availability of clustered SIP dispatchers (Adispatcher_cluster) and the availability of clustered SIP proxy servers (Aproxy_cluster). The equations are expressed as follows:
dispatcher
cluster proxy
cluster dispatcher
service A A
A = _ × _
(5)We analyzed the overall service availability based on MTTFdispatcher = MTTFproxy = 5000 hours and MTTFdispatcher = MTTRproxy = 72 [20]. By this setting, the service availability of a single dispatcher Adispatcher or proxy server Aproxy, is about 98.58%. This value is close to the average service availability of a server implemented for real services (98.46%),[21].
Fig. 12 shows the overall service availability in terms of how many nines with respect to various number of dispatchers and proxy servers. The analysis results show the overall service availability will reach a limit if we only increase the number of SIP proxy servers. To increase the overall service availability, we have to increase the number of dispatchers as well as the number of proxy servers. From Fig. 12, we found that to achieve a carrier class reliability of five nines service availability with a minimal cost, we need to install three SIP dispatchers and three SIP proxy servers. This figure also shows that to enhance service availability with a minimal cost, the number of dispatcher should be equal to the number of proxy servers.
Service Availability (Nines)
Fig. 12 Analysis of service availability under different numbers of dispatchers and proxy servers.
The service availability under different MTTFs was also analyzed. We used the same numbers of SIP dispatchers and SIP proxy servers. That is, it is a n * n SIP network. Fig. 13 shows that when the availability (MTTF) of a single server is low (high), it is necessary to use more servers to build a SIP high availability network.
Availability (Nines)
Fig. 13 Analysis of service availability under different MTTFs.
Chapter 6
Conclusion and future work
6.1 Conclusion
The traditional designs of SIP high availability networks have a potential problem of limited overall service availability since they used limited (one or two) dispatchers and multiple SIP proxy servers. This was verified by our analysis results that has the following conclusion.
Increasing the number of SIP proxy servers will increase the service availability initially;
however, the enhancement of the service availability will encounters a bottleneck if SIP dispatchers were not increased proportionally as well. The proposed fast failure detection and failover (FFF) scheme uses multiple SIP dispatchers and multiple SIP proxy servers to reach carrier class reliability. In this thesis, the OpenAIS was used as a high availability middleware to handle the failure detection and failover among SIP dispatchers and SIP proxy servers.
Experimental results have shown that the FFF scheme reduces the SIP dispatcher failover time by 80% compared to the VRRP solution and also shortens the SIP proxy server failover time by 85% compared to the mechanism provided by OpenSER itself. Analysis also shows that using multiple SIP dispatchers is key to raise the service availability in SIP high availability networks.
6.2 Future work
In this thesis, we only consider the hardware failure of servers, either dispatchers or proxy
servers. Failures that occurred on the SIP server software (for example, failed OpenSER processes) will not initiate the failover operation in our design. If an OpenSER process can be registered as an OpenAIS component, then this problem can be solved as well. In addition, a high availability middleware (for example, OpenAIS) can perform faster failover in the condition of software failures by directly monitoring the server demons. These points deserve further study to verify.
Bibliography
[1] A.B. Johnston, SIP: Understanding the Session Initiation Protocol, 2nd Edition, Artech House, 2004.
[2] H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson, RTP: a transport protocol for real time applications RFC 1889, Internet Engineering Force, January 1996.
[3] Cisco Inc., “Overview of High Availability in SIP-Based Voice Networks,” [Online].
Available:
http://www.cisco.com/univercd/cc/td/doc/product/software/ios123/123cgcr/vvfax_c/callc_
c/sip_c/sipha_c/hachap1.htm
[4] CR. Johnson, Y. Kogan, Y. Levy, F. Saheban, P. Tarapore, “VoIP Reliability: A Service Provider’s Perspective,” IEEE Communications Magazine, July, 2004, pp. 48-54.
[5] OpenAIS at Open Source Development Labs, Beaverton, OR, USA. [Online]. Available:
http://developer.osdl.org/dev/openais/.
[6] Service Availability Forum. [Online]. Available: http://www.saforum.org.
[7] A. Kamalvanshi, T. Jokiaho, “Using Open AIS for Building Highly Available Session Initiation Protocol (SIP) Registrar,” Nokia Corporation, SA Forum, in Proc. the Third Annual International Service Availability Symposium, May 2006, pp. 217-228.
[8] K. Singh, H. Schulzrinne, “Failover, load sharing and server architecture in SIP telephony,” Computer Communications, Volume 30, Issue 5, March, 2007, pp. 927-942.
[9] M. Mealling, R. W. Daniel, “The Naming Authority Pointer (NAPTR) DNS Resource Record,” RFC 2915, Internet Engineering Task Force, Sept. 2000.
[10] J. Rosenberg and H. Schulzrinne, “Session initiation protocol (SIP): locating SIP servers,” RFC 3263, Internet Engineering Task Force, June 2002.
[11] F5 Networks Inc., “A New Paradigm for SIP High Availability and Reliability,” [Online].
Available: http://www.f5.com/solutions/technology/pdfs/sip_wp.pdf.
[12] “OpenSER,” [Online]. Available: http://www.openser.org/.
[13] S. Knight, D. Weaver, D. Whipple, R. Hinden, D. Mitzel, P. Hunt, P. Higginson, M.
Shand, and A. Lindem. RFC 2338: Virtual Router Redundancy Protocol, April 1998.
[14] V. Matic, I. Franicevic, D. Sekalec, “Parallel SIP Proxy Servers Using Direct Routing Approach,“ in Proc. International Conference on Software in Telecommunications and Computer Networks, Sept. 2006, pp. 218-222.
[15] “Linux Virtual Server,” [Online]. Available: http://www.linuxvirtualserver.org/.
[16] “Keepalived,” [Online]. Available: http://www.keepalived.org/.
[17] “SIPp - Test Tool/Traffic Generator for the SIP Protocol,” Mar 2006, [Online].
Available: http://sipp.sourceforge.net.
[18] H. Schulzrinne, S. Narayanan, J. Lennox, and M. Doyle, “SIPstone - Benchmarking SIP Server Performance,” Apr 2002, [Online]. Available: http://www.sipstone.com/.
[19] “VRRPD,” [Online]. Available: http://off.net/~jme/vrrpd/.
[20] K. Uhlemann, C. Engelmann, and S. L. Scott, “JOSHUA: Symmetric Active/active Replication for Highly Available HPC Job and Resource Management,” in Proc. IEEE International Conference on Cluster Computing, Sept. 2006, pp. 1-10.
[21] J. Neises, “Benefit Evaluation of High-Availability Middleware,” Service Availability, in Proc. First International Service Availability Symposium, May, 2004, pp. 73-85.