Availability and reliability - Security, in this case, is less likely to be an issue, as the pu

Security, in this case, is less likely to be an issue, as the pump will not maintain confidential information and is not networked so cannot be maliciously attacked

3.3 Availability and reliability

System availability and reliability are closely related properties that can both be expressed as numerical probabilitie!;. The reliability of a system is the probability that the system's services willbecorrectly delivered as specified. The availability of asyste~mis the probability that the: system will be up and running to deliver these services to users when they request them.

Although they are closely related, you cannot assume that reliable systems will always be available and vice versa. For example, some systems can have a high availability requirement but a much lower reliability requirement.Ifusers expect continuous service then the availability requirements are high. However, if the con-sequences of a failure are minimal lmd the system can recover quickly from these failures then the same system can have low reliability requirements.

Anexample of a system where availability is more critical than reliability is a tele-phone eXI;hange switch. Users expect a dial tone when they pick up a tele-phone so the sys-tem has high availability requirements. However, if a syssys-tem fault causes a connection to fail, this is often recoverable. Exchange switches usually include repair facilities that can reset the system and retry the cormection attempt. This can be done very quickly, and the phone user may not even nolice that a failure has occurred. Therefore, avail-ability ralher than reliavail-ability is the key dependavail-ability requirement for these systems.

A further distinction between these characteristics is that availability does not simply d1epend on the system itself but also on the time needed to repair the faults that mak·e the system unavailable. Therefore, if system A fails once per year, and systemEo fails once per month, then A is clearly more reliable then B. However, assume that system A takes three days to restart after a failure, whereas system B takes 10 minutes to restart. The availability of system B over the year (120 minutes of down time) is much better than that of system A (4,320 minutes of down time).

System reliability and availability may be defined more precisely as follows:

1. ReliabilityThe probability of failure-free operation over a specified time in a given environment for a specific purpose.

2. AvailabilityThe probability that a system, at a point in time, will be opera-tionaJ and able to deliver the requested services.

One of the practical problems in developing reliable systems is that our intu-itive notions of reliability and availability are sometimes broader than these lim-ited definitions. The definition ofreliabilitystates that the environment in which the system is used and the purpose that it is used for must be taken into account.

If you measure system reliability in one environment, you can't assume that the reliability will be the same in another environment where the system is used in a different way.

52 Chapter 3 _~ Critical systems

For example, let's say that you measure the reliability of a word processor in an office environment where most users are uninterested in the operation of the soft-ware. They follow the instructions for its use and do not try to experiment with the system.Ifyou measure the reliability of the same system in a university environ-ment, then the reliability may be quite different. Here, students may explore the boundaries of the system and use the system in unexpected ways. These may result in system failures that did not occur in the more constrained office environment.

Human perceptions and patterns of use are also significant. For example, say a car has a fault in its windscreen wiper system that results in intermittent failures of the wipers to operate correctly in heavy rain. The reliability of that system as per-ceived by a driver depends on where they live and use the car. A driver in Seattle (wet climate) will probably be more affected by this failure than a driver in Las Vegas (dry climate). The Seattle driver's perception will be that the system is unre-liable, whereas the driver in Las Vegas may never notice the problem.

A further difficulty with these definitions is that they do not take into account the severity of failure or the consequences of unavailability. People, naturally, are more concerned about system failures that have serious consequences, and their per-ception of system reliability is influenced by these consequences. For example, say a failure of initialisation in the engine management software causes a car engine to cut out immediately after starting, but it operates correctly after a restart that cor-rects the initialisation problem. This does not affect the normal operation of the car, and many drivers would not think that a repair was needed. By contrast, most drivers will think that an engine that cuts out while they are driving at high speed once per month (say) is both unreliable and unsafe and must be repaired.

A strict definition of reliability relates the system implementation to its specifi-cation. That is, the system is behaving reliably if its behaviour is consistent with that defined in the specification. However, a common cause of perceived unrelia-bility is that the system specification does not match the expectations of the sys-tem users. Unfortunately, many specifications are incomplete or incorrect and it is left to software engineers to interpret how the system should behave. As they are not domain experts, they may not, therefore, implement the behaviour that users expect.

Reliability and availability are compromised by system failures. These may be a failure to provide a service, a failure to deliver a service as specified, or the deliv-ery of a service in such a way that is unsafe or insecure. Some of these failures are a consequence of specification errors or failures in associated systems such as a telecommunications system. However, many failures are a consequence of erroneous system behaviour that derives from faults in the system. When discussing reliabil-ity, it is helpful to distinguish between the terms fault, error and failure. I have defined these terms in Figure 3.5.

Human errors do not inevitably lead to system failures. The faults introduced may be in parts of the system that are never used. Faults do not necessarily result in system errors, as the faulty state may be transient and may be corrected before erroneous behaviour occurs. System errors may not result in system failures, as the behaviour may also be transient and have no observable effects or the system may

3.3 " Availability and reliability 53

Term Description

System failure

System error

System fault

Human error or mistake

An event that OCClirs at some point in time when the system does not deliver a service as expectedby itsusers

An erroneous systl!m state that can lead to system behaviour that is unexpected by system users.

Acharacteristic of a software system that can lead to a system error. For example, failure to initialise a variable could lead to that variable having the wrong value whenitis used.

Human behaviour that results in the introduction of faults into a system.

include protection that ensures that the erroneous behaviour is discovered and cor-rected before the system services are affected.

This distinction between the terms shown in Figure 3.5 helps us identify three complementary approaches that are used to improve the reliability of a system:

t. Fault avoidance Development techniques are used that either minimise the pos-sibility of mistakes and/or that trap mistakes before they result in the introduction of system faults. Examples of such techniques include avoiding error-prone pro-gramming language constructs such as pointers and the use of static analysis to detect program anomalies.

2. Fault detection and removal The use of verification and validation techniques that increase the chances that faults will be detected and removed before the system is used. Systematic system testing and debugging is an example of a fault-detection technique.

3. Fault tolerance Techniques that ensure that faults in a system do not result in system errors or that ensure that system errors do not result in system failures.

The incorporation of self-checking facilities in a system and the use of redun-dant system modules are examples of fault tolerance techniques.

I cover the development of fault tolerant systems in Chapter 20, where I also discuss some techniques for fault avoidance. I discuss process-based approaches to fault avoidance in Chapter 27 and fault detection in Chapters 22 and 23.

Software faults cause software failures when the faulty code is executed with a set of inputs that expose the software fault. The code works properly for most inputs.

Figure 3.6, derived from Littlewood (Littlewood, 1990), shows a software system as a mapping of an input to an output set. Given an input or input sequence, the program n:,sponds by producing a comesponding output. For example, given an input of a URL, a web browser produces .illoutput that is the display of the requested web page.

54 chapter

3

I\' Critical systems

Figure 3.6 A system asan input/output mapping

Inputs causing erroneous outputs

Erroneous outputs

Figure 3.7 Software usage patterns

Some of these inputs or input combinations, shown in the shaded ellipse in Figure 3.6, cause erroneous outputs to be generated. The software reliability is related to the probability that, in a particular execution of the program, the system input will be a member of the set of inputs, which cause an erroneous output to occur. If an input causing an erroneous output is associated with a frequently used part of the program, then failures will be frequent. However, if it is associated with rarely used code, then users will hardly ever see failures.

Each user of a system uses it in different ways. Faults that affect the reliability of the system for one user may never be revealed under someone else's mode of working (Figure 3.7). InFigure 3.7, the set of erroneous inputs correspond to the shaded ellipse in Figure 3.6. The set of inputs produced by User 2 intersects with this erroneous input set. User 2 will therefore experience some system failures. User 1 and User 3, however, never use inputs from the erroneous set. For them, the soft-ware will always be reliable.

3.4

I!:

Safety

The ::lVerall reliability of a program, therefore, mostly depends on the number of inputs causing erroneous outputs during normal use of the system by most users. Software faults that occur only in exceptional situations have little effect on the system's reli-ability. Removing software faults from parts of the system that are rarely used makes little real difference to the reliability as seen by system users. Mills et al. (Mills, et al., 1987) found that, in their software, removing 60% of known errors in their soft-ware led to only a 3% reliability improvement. Adams (Adams, 1984), in a study of IBM software products, noted that many defects in the products were only likely to cause failures after hundreds or thousands of months of product usage.

Users in a socia-technical system may adapt to software with known faults, and may share information about how to get around these problems. They may avoid using inputs that are known to cause probl<ems so program failures never arise. Furthermore, experienced users often 'work around' software faults that are known to cause fail-ures. They deliberately avoid using system featfail-ures. that they know can cause prob-lems for them. For example, I avoid certain features, such as automatic numbering in the word processing system that I us,ed to write this book. Repairing the faults in these features may make no practical difference to the reliability as seen by these users.

3.4 Safety

Safety-c:itical systems are systems \lihere it is essential that system operation is always safe. That is, the system shouldneverdamage people or the system's environment even if the system fails. Examples of safety-critical systems are control and moni-toring systems in aircraft, process control systems in chemical and pharmaceutical plants and automobile control systems.

Hardware control of safety-critical systems is simpler to implement and analyse than software control. However, we now build systems of such complexity that they cannot be controlled by hardware alone. Some software control is essential because of the need to manage large numbers of sensors and actuators with complex con-trol la\lis. An example of such complexity is found in advanced, aerodynamically unstable military aircraft. They require continual software-controlled adjustment of their flight surfaces to ensure that they do not crash.

Safety-critical software falls into two classes:

1. Primary, safety-critical softwareThis is software that is embedded as a con-troller in a system. Malfunctioning of such software can cause a hardware mal-function, which results in human injury or environmental damage. I focus on this type of software.

2. Secondary safety-critical softwareThis is software that can indirectly result in injury. Examples of such sy~,tems are computer-aided engineering design systems whose malfunctioning might result in a design fault in the object being

56 Chapter 3 • Critical systems

designed. This fault may cause injury to people if the designed system mal-functions. Another example of a secondary safety-critical system is a medical database holding details of drugs administered to patients. Errors in this sys-tem might result in an incorrect drug dosage being administered.

System reliability and system safety are related but separate dependability attributes. Of course, a safety-critical system should be reliable in that it should con-form to its specification and operate without failures. Itmay incorporate fault-tol-erant features so that it can provide continuous service even if faults occur.

However, fault-tolerant systems are not necessarily safe. The software may still mal-function and cause system behaviour, which results in an accident.

Apart from the fact that we can never be 100% certain that a software system is fault-free and fault-tolerant, there are several other reasons why software systems that are reliable are not necessarily safe:

1. Thespecification maybeincomplete in that it does not describe therequiredbehaviour of the system in some critical situations. A high percentage of system malfunc-tions (Nakajo and Kume, 1991; Lutz, 1993) are the result of specification rather than design errors. In a study of errors in embedded systems, Lutz concludes:

... difficulties with requirements are the key root cause of the safety-related soft-ware errors which have persisted until integration and system testing.

2. Hardware malfunctions may cause the system to behave in an unpredictable way and may present the software with an unanticipated environment. When components are close to failure they may behave erratically and generate sig-nals that are outside the ranges that can be handled by the software.

3. The system operators may generate inputs that are not individually incorrect but which, in some situations, can lead to a system malfunction. An anecdotal exam-ple of this is when a mechanic instructed the utility management software on an aircraft to raise the undercarriage. The software carried out the mechanic's instruc-tion perfectly. Unfortunately, the plane was on the ground at the time-dearly, the system should have disallowed the command unless the plane was in the air.

A specialised vocabulary has evolved to discuss safety-critical systems, and it is important to understand the specific terms used. In Figure 3.8, I show some defi-nitions that I have adapted from terms initially defined by Leveson (Leveson, 1985).

The key to assuring safety is to ensure either that accidents do not occur or that the consequences of an accident are minimal. This can be achieved in three com-plementary ways:

1. Hazard avoidanceThe system is designed so that hazards are avoided. For exam-ple, a cutting system that requires the operator to press two separate buttons at the same time to operate the machine avoids the hazard of the operator's hands being in the blade pathway.

Figure 3.8 Safety terminology

34 Safety 57

• Term Description

Accident (or mishap) Anunplanned event or sequence of events which results In human death I)r Injury, damage to property or to the environ-ment. A computer-controlled machine injuring Its operator is an example of an accident.

Hazard A condition with the potential for causing or contributing to an accident. A failure of the sensor that detects an obstacle In front of a machine Is an example of a hazard.

Damage A measure of the loss resulting from a mishap. Damage can range from m21ny people killed as a result of an accident to minor Injury or propl!rty damage.

Hazard severity An assessment of the worst possible damage that could result from a particular hazard. Hazard severity can range from catastrophic where many people are killed to minor where only minor damage results.

Hazard probability Theprobabillt~ of the events occurring which create a hazard.

Probability val,ues tend to be arbitrary but range fromprobDble (say1/100ch21nce of a hazard occurring) to implausible (no conceivable situations are likely where the hazard could occur).

Risk This is a measure of the probability that the system will cause an accident. The risk Is assessed by considering the hazard prob-ability, the haurd severity and the probability that a hazard will result in an accident.

2. Hazard detectionand removalThe system is designed so that hazards are detected and removed before they result in an accident. For example, a chemical plant system may detect excessive pressure and open a relief valve to reduce the pres-sure before an explosion occur,.

3. Damage limitationThe system may include protection features that minimise the damage that may result from an accident. For example, an aircraft engine normally includes automatic fire extinguishers. If a fire occurs, it can often be controlled before it poses athfl~atto the aircraft.

Accidents generally occur when several things go wrong at the same time. An anal-ysis of serious accidents (Perrow, 1(84) suggests that they were almost all due to a combinmion of malfunctions rather Iban single failures. The unanticipated combina-tion led to interaccombina-tions that resulted in system failure. Perrow also suggests that it is impossible to anticipate all possible combinations of system malfunction, and that acci-dents are an inevitable part of using complex systems. Software tends to increase sys-tem complexity, so using software controlmayincrease the probability of system accidents.

However, software control and monitoring can also improve the safety of systems.

Software-controlled systems can monitor a wider range of conditions than electro-mechanical systems. They can be adapted relatively easily. They involve the

ritical systems

use of computer hardware, which has very high inherent reliability and which is phys-ically small and lightweight. Software-controlled systems can provide sophisticated safety interlocks. They can support control strategies that reduce the amount of time people need to spend in hazardous environments. Therefore, although software con-trol may introduce more ways in which a system can go wrong, it also allows better monitoring and protection and hence may improve the safety of the system.

In all cases, it is important to maintain a sense of proportion about system safety.

Itis impossible to make a system 100% safe, and society has to decide whether or

在文檔中 Engineering Software (頁 76-85)