Reliability assessments - National Taiwan University of Science and Technology, Taiwan, Republi

National Taiwan University of Science and Technology, Taiwan, Republic of China

3 Reliability assessments

The information system built using the services described has many favourable features, and it can be deployed flexibly. This flexibility, however, causes difficulties in system configuration and deployment. How many DS should one use to handle 10,000 sensors? How many FS should be assigned to one DS? These questions cannot be answered easily without some quantitative advices.

Therefore, we suggest using reliability analyses to help answer these questions. In the following, we discuss four assessments. They are discussed because either they represent the most critical operations in the system, or they are most important system characteristics.

3.1 Service invocation reliability

CS, DS, and FS collaborate with each other to build a functional information system for storing and managing monitoring data. The interactions between them are based on web-service invocations.

Assuming the client computer tries to invoke a service offered by the server computer. First, the client sends a service request over its Internet connection. The request is received through the Internet connection of the server. The server then processes the request and sends the response over its Internet connection. Finally, the response is received by the client over its Internet connection. In this process, failure can occur on 1) the client when sending requests, 2) the Internet connection of the client when sending requests, 3) the Internet connection of the server when receiving requests, 4) the server, 5) the Internet connection of the server when sending responses, 6) the Internet connection of the client when receiving responses, and 7) the client when receiving responses.

We assume 1) failures are owing to hardware and networking issues, 2) software are implemented perfectly, and 3) all processes involved are independent. Under such assumptions, the reliability of service invocation can be estimated using Eq. 1.

  

 



1 1 1 1 1

SI C IC IS S

F   F  F  F  F (1)

In Eq. 1, F_SI, F_C, F_IC, F_IS, and F_S represent failure probabilities of service invocation, the client computer, the Internet connection of the client computer, the Internet connection of the server computer, and the server computer. Let’s assume these probabilities are all 3%, Eq. 1 then suggests the F_SI is

19.2%. In other words, the service invocation is more likely to fail than any individual

component or process involved. This result seems to against using distributed computing for building information systems. This is true only if the software is implemented poorly. In network programming, retry mechanisms are often implemented to fight against unreliable networking. Eq. 2 and Eq. 3 estimate the failure probability of service invocation with considerations of the retry mechanism.

    

1 1 1

NT n

retry

SI C retry retry

F F F F ^



 



  ⁽²⁾

    



1 1 1 1 1

retry C S IC IS

F   F F F F (3)

In Eq. 2, F_SI^retry estimates failure probability of service invocation after NT attempts. Eq. 3 computes failure probability of one retrial under all above assumptions. In addition, Eq. 3 also assumes if the client computer spots a failure in the former service invocation, it will retry the invocation. Meanwhile, the client computer will not fail. Fig. 4 shows the how F_SI^retry decreases the increasing number of retrials. It can be seen the retry mechanism is very effective at making service invocation reliable. It also shows F_SI^retry ultimately approaches F_C . In other words, by giving sufficient number of retrials, the reliability of service invocation is solely determined by the reliability of the client computer. This result also suggests service invocation can be as reliable as local function calls. Both of their failure probabilities are solely determined the computer making function calls or service invocations.

Figure 4. Reliability of service invocation affected by number of retrials

3.2 Data transmission reliability

Once the reliability for service invocation is established as in Eq. 2, we can use it to evaluate various processes in the information system. Such evaluation helps us quantify reliabilities of functions in the system. More importantly, such evaluation helps us design reliable processes in the system. The following uses data transmission, one of the most critical functions for monitoring systems, as an example to evaluate process reliability.

Data transmission refers to sending monitoring data from FS to DS. This process, in the design, starts from FS making inquiries to CS to find its associated DS. Then, FS sends monitoring data to designated DS to store them. Therefore, there are two service invocations involved in this process.

Both invocations must be successful to declare this process success. Assume all previous assumptions are true and only three retrials are allowed during service invocation. The failure probability for such service invocation, according to Eq. 2 or Fig. 4, is 3.45%. Since one data-transmission invokes two consecutive services, the failure probability according to Eq. 4 is 6.78%.

 

1 1 ^retry 0.0678

TR SI

F   F  (4)

Data-transmission is one critical function for the monitoring system. It seems to have somewhat high probability of failure. Therefore, it is necessary to improve this process. One solution is to cache designated DS information. The reason why the process described needs two service invocations is FS needs to fetch DS information from CS. By caching DS information on FS, we can effectively reduce one service invocation for each data transmission. Then, the failure probability for data transmissions changes from 6.78% to 3.45%:

 

1 1 0.0345

caching retry

TR SI

F   F  (5)

The distributed architecture described in Section 2 needs many processes to make it complete.

These processes consist of services invocations. Similar assessments described can be applied to other processes. These assessments could help evaluate and redesign processes to make the distributed system more reliable.

10%

15%

20%

1 3 5 7 9

Failure probability of service invocation, with retrial

No. of invocation attempts

Failure probability of client computers

3.3 Server-level storage reliability

Data storage is also an important function for monitoring systems. Typical client-server architecture stores all monitoring data on one computer, and data are stored in hard disks. Hard drives, comparing to other components in computer systems, can break easily due to their mechanical nature. According to a large-scale survey done by Google (Pinheiro et al. 2007) in their data centers, the AFR (annual failure rate) is 2% for hard drives in their first service year. AFR becomes 7% if the service life is more than two years. Therefore, if data should be kept long term, it is necessary to evaluate and to develop solutions to resolve relatively unreliable storage medium.

The information system design described in Section 2 can provide two levels of protections. At the first level, one may use RAID (redundant array of independent disks) on servers to keep data safe.

This is the server-level storage reliability. The distributed information system can additional provide system-level data protection. This is described in section 3.4.

RAID has been protecting data on servers for many years. Modern computer mainboards now have this capability built in. There are many RAID levels. Different levels provide different characteristics. Some RAID levels provide speed, and some provide data safety. In the following, we discuss two commonly used RAID levels and their reliabilities.

RAID-1, also known as mirroring, mirrors content on one hard drive to others. In this configuration, data is lost only when all drives constituting the disk array fail. Assuming the failure probability of one hard drive is F_hd, then the failure probability F_RAID_₁ for a RAID-1 array configured using N_hd is:

 

Nhd

RAID hd

F _  F (6)

If F_hd is

2%, then a RAID-1 array with two drives has a failure probability of 0.04%. RAID-1

array with four drives would have a very low failure probability at 1.6 10 % ^⁵

. Therefore, RAID-1 is

very effective at reducing the risk of losing data.

Another commonly used RAID configuration is RAID-5. This configuration needs at least three disks and allows one disk failure without losing data. The failure probability of RAID-5 arrays is:

 

⁴

 

5 1 0^N^hd 1 ^N^hd 1 1 ^N^hd

RAID hd hd hd

F _  C F C F F ^ (7)

Thus, a RAID-5 array with four drives has a failure probability of 0.23%, much higher than using RAID-1 array. By using Eq. 6 or Eq. 7, we can assess reliability of data storage on servers running DS. However, it must be noted these equations assume the failure between hard disks are independent. However, disk failures are sometimes related to improper ventilation, power fluctuations, etc. As a result, the assumption is optimistic. The true failure probability could be higher than those calculated using Eqs. 6 and 7.

3.4 System-level storage reliability

As suggested early, the system designed can associate many DS to one FS to keep data at different servers. We refer this as the system-level storage reliability. System-level redundancy is essentially the same as the RAID-1 configuration. In other words, assuming one FS is associated with N_DS, then the probability of data loss, F_SYS:

 

^N^DS

SYS DS

F  F

(8) FDS is the failure probability of disks hosting DS. It could be simple disk failure probability. It

could also calculate from Eq. 6 or Eq. 7, depending on the disk configuration on the server.

Eq. 8 also bears the assumption of independence of server disk failures, as in previous discussions.

It must be emphasized that the independence assumption at system level is more likely to be true than

for disk independences in a machine. Further, the system-level data protection can provide data redundancy across geographical locations. Therefore, it can protect against extreme events, such as flooding or fire hazard at data centres in one location.

Eqs. 6 – 8 help system administrators to configure the information system. This is done by setting some target levels of data storage reliability. Once the reliability target is set, one can easily solve these equations to find the number of needed disks or servers.

在文檔中應用雲端運算技術於建構隧道檢監測資訊系統之研析 (頁 24-28)