Computing System Failure Frequencies and Reliability Importance Measures using OBDD

(1)

Computing System Failure Frequencies and

Reliability Importance Measures Using OBDD

Yung-Ruei Chang, Student Member, IEEE, Suprasad V. Amari, Member, IEEE, and

Sy-Yen Kuo, Fellow, IEEE

Abstract—The recent literature showed that, in many cases, Ordered Binary Decision Diagram (OBDD)-based algorithms are more efficient in reliability evaluation compared to other methods such as the Inclusion-Exclusion (I-E) method and the sum of disjoint products (SDP) method. This paper presents algorithms based on OBDD to compute system failure frequencies and reliability importance measures. Methods are presented to calculate both steady-state and time-specific frequencies of system-failure as well as system-success. The reliability importance measures discussed in this paper include the Birnbaum importance, the Criticality importance, and other indices for the risk evaluation of a system. In addition, we propose an efficient approach based on OBDD to evaluate the reliability of a nonrepairable system and the availability of a repairable system with imperfect fault-coverage mechanisms. The powerful capability of OBDD for reliability evaluation is fully exploited in this paper. Further, we extend all of the proposed algorithms in this paper to analyze systems with imperfect fault-coverage.

Index Terms—Failure frequency, reliability importance measure, BDD, imperfect coverage, system availability, fault tolerance.

æ

1 I

NTRODUCTION

T

HEreliability analysis of a system includes not only the reliability and availability calculations, but also the calculations of other indices related to reliability. Some of the important indices include the Mean Time To Failure (MTTF), the Mean Time Between Failures (MTBF), the failure frequency, and the reliability importance measures. Recent papers [1], [2], [3], [4], [5], [6], [7] showed that, in many cases, Ordered Binary Decision Diagram (OBDD)-based algorithms are more efficient in reliability evaluation compared to other methods such as the Inclusion-Exclusion (I-E) method [8] and the sum of disjoint products (SDP) method [9], [10], [11]. However, the advantages of OBDD have not been explored to compute other reliability indices such as system failure frequency and reliability importance measures.

Many researchers have evaluated the system availability (reliability) using various methods, such as the I-E method and the SDP method. However, recent literature on the reliability or availability evaluation of a system with s-independent components showed that OBDD is efficient in terms of computational time and accuracy. Although there has been considerable interest in methods on calculating the system failure frequencies [4], [8], [11], [12], [13], [14], there is no convenient method to compute the frequencies using OBDD. Moreover, most of the reported works assume that the process of failure and repair has reached steady state. Techniques to calculating the time-specific and interval frequencies are described in [13], [14], [15]; however, they

are limited to availability expressions obtained using the I-E method and the SDP method.

In [16], Schneeweiss showed that it is difficult to calculate failure frequencies using OBDD. However, later, in [17], Schneeweiss has shown a way to compute failure frequency using OBDD. This method is similar to the method proposed in [18] for calculating failure frequencies using Shannon decomposition. The method in [17] did not take full advantage of OBDD structure and is limited to the calculation of only the steady-state frequencies. In [19] Sinnamon and Andrews presented an OBDD-based algo-rithm to calculate system failure frequency using Birnu-baum importance measures of system components. This method does not take full advantage of the relationships between various reliability characteristics; therefore, it needs to traverse the OBDD n times, where n is the total number of components in the system. In this paper, we propose new algorithms to compute steady-state as well as time-specific failure and success frequencies of a system using OBDD. In addition to the calculation of system failure frequencies, we also propose an efficient algorithm based on OBDD for the evaluation of the reliability importance measures of a system. Moreover, these proposed algorithms could be extended to systems with imperfect fault-coverage. Further, the reliability analysis of a nonrepairable system or a repairable system with imperfect fault-coverage is discussed in this paper. We propose an approach to incorporating the imperfect coverage model for a repairable system into a combinatorial model. The model uses Markov chains for a repairable system with imperfect fault-coverage and is quite useful. In addition, an OBDD-based algorithm is devised based on this approach. Due to the use of conditional probabilities and Markov chains, this OBDD-based algorithm is very efficient for the availability evaluation of a repairable system with imperfect fault-coverage.

. Y.R. Chang and S.Y. Kuo are with the Department of Electrical Engineering, National Taiwan University, Taipei 10617, Taiwan. E-mail: [email protected], [email protected].

. S.V. Amari is with Relex Software Corp., 540 Pellis Rd., Greensburg, PA 15601. E-mail: [email protected].

Manuscript received 31 Dec. 2002; accepted 1 May 2003.

For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number 117884.

(2)

In Section 2, we propose several OBDD-based algorithms to calculate the system failure frequencies. These algorithms are efficient under not only steady-state condition but also time-specific condition. Section 3 illustrates how we apply the OBDD-based algorithm to the imperfect fault-coverage model for evaluating the reliability of a nonrepairable system and the availability of a repairable system. In addition, the failure frequencies of imperfect coverage models are discussed in Section 4. Also, an algorithm based on OBDD is proposed to calculate the failure frequency of a system with imperfect fault-coverage. In Section 5, efficient algorithms for the evaluation of reliability importance measures, including the Birnbaum importance, the Criti-cality importance, and other indices for risk evaluation of the system, are proposed. These efficient algorithms are also extended for systems with imperfect fault-coverage. The last section gives the conclusions and future works.

2 F

AILURE

F

REQUENCY

The frequency of failure (v), i.e., the mean number of system failures per unit time, is a key parameter in reliability and risk analysis. In general, v can be obtained from the system availability (A) or unavailability (U) expression as well as the failure and repair rates of components. Schneeweiss [20] indicated that the failure frequency is more important than the system availability. Indeed, the expected system cost depends upon not only the mean down-time (or availability) but also the number of failures during a specific time interval since each repair incurs a certain cost, irrespective of the down-time. Other performance measures of a system, such as the mean up-time and the mean down-time, can be determined easily from the steady-state availability and the steady-state system failure frequency [4], [8].

MTBF¼ 1=v MTTF¼ A=v MTTR¼ U=v:

When a system is in steady-state condition, the expected number of failures is equivalent to the expected number of successes (from failed states). Therefore, under steady-state conditions, the system failure frequency is equivalent to the system success frequency. Further, MTTF is the expected up-time during a failure and repair cycle. Hence, the expected up-time in a unit time is MTTF/MTBF = A. Therefore, for a time interval T, we have

Expected number of failures during T ¼ ENF ¼ vT Expected up-time during T ¼ EUT ¼ AT Expected down-time during T¼ EDT ¼ UT : For short duration missions, the time-specific (time-dependent) availability AðtÞ and the failure frequency vðtÞ are important. In this case, the system failure frequency (vf) is

not equivalent to the system success frequency (vs).

There-fore, vfand vsshould be calculated separately to find other

importance measures. Under a transient condition, we have

Expected number of successes during½0; T ¼ Z T

0

vsðtÞdt

Expected number of failures during½0; T ¼ Z T

0

vfðtÞdt

Expected up-time during½0; T ¼ EUT ¼ Z T

0

AðtÞdt Expected down-time during½0; T ¼ EDT ¼

Z T 0

UðtÞdt: Therefore, in order to compute the reliability indices of computer systems or networks for reliability analysis, efficient algorithms are needed for evaluating both the system failure frequency and the system availability. In this paper, we propose new and fast algorithms to compute the steady-state as well as time-specific failure and success frequencies of a system using OBDD.

2.1 Steady-State Failure Frequency 2.1.1 Assumptions

1. The system is composed of n s-independent compo-nents. Its structure function (describing redundancy) is s-coherent [21], [22].

2. The system and its components have two states: up (working) and down (failed).

3. The failure and repair rates of the components are constant.

4. All components are repairable. Repaired compo-nents are as good as new.

5. The system is in steady-state. 2.1.2 Calculating Probability from OBDD

OBDD [23] is based on a disjoint decomposition of Boolean function called the Shannon expansion. Given a Boolean function fðx1; ; xnÞ, then, for any i 2 f1; ; ng;

x

xi :xi¼ 1 xi:

f¼ xi fxi¼1þ xxi fxi¼0: ð1Þ

In order to express Shannon decomposition concisely, the if-then-else (ite) format [24], [25] is defined as:

f¼ iteðxi; fxi¼1; fxi¼0Þ:

The manipulation of OBDD to represent logical opera-tions is simple. In practice, the OBDD is generated by using logical operations on variables. Let Boolean expressions f and g be:

f¼ iteðxi; fxi¼1; fxi¼0Þ ¼ iteðxi; F1; F0Þ

g¼ iteðxj; gxj¼1; gxj¼0Þ ¼ iteðxj; G1; G0Þ:

A logic operation between f and g can be represented by OBDD manipulations as:

iteðxi; F1; F0Þ}iteðxj; G1; G0Þ

¼

iteðxi; F1}G1; F0}G0Þ orderingðxiÞ ¼ orderingðxjÞ

iteðxi; F1}g; F0}gÞ orderingðxiÞ < orderingðxjÞ

iteðxj; f}G1; f}G0Þ orderingðxiÞ > orderingðxjÞ;

8 > < > :

where } represents a logic operation such as AND and OR. Fig. 1 illustrates the construction and manipulation steps of

(3)

a Boolean function. For more details on using the operations of OBDD, please refer to [23].

A useful property of OBDD is that all the paths from the root to the leaves are mutually disjoint. If f represents the system availability expression, based on the property of the disjoint decomposition of OBDD, the reliability (or avail-ability) of the system can be recursively evaluated by (1).

Prffg ¼ Prfxig Prffxi¼1g þ Prfxxig Prffxi¼0g;

where Prfg means Prf ¼ 1g for simplification. For example, if Prfxig is the availability Ai of component i

and Ui is the unavailability of component i, then

A¼ Prffg ¼ AiAxi¼1þ UiAxi¼0¼ AiAxi¼1þ ð1 AiÞAxi¼0;

ð2Þ where Axi¼1ðAxi¼0Þ is equal to Prffxi¼1gðPrffxi¼0gÞ.

Simi-larly, when the system unavailability expression is used, the unavailability of a system can be calculated as:

U¼ Prfgg ¼ UiUyi¼1þ AiUyi¼0; ð3Þ

where yi¼ 1 means component i has failed, Uyi¼1ðUyi¼0Þ is

equal to Prfgyi¼1gðPrfgyi¼0gÞ, g is the dual of f, i.e.,

fðx1; x2; ; xnÞ 1 gð1 x1; 1 x2; ; 1 xnÞ

1 gðy1; y2; ; ynÞ:

2.1.3 Calculating Failure Frequency Using OBDD (Method 1)

In this section, we describe an algorithm based on OBDD to calculate the frequency of system failure at steady state. From rule I in [11], v can be derived from the system availability expression f, which is expressed in the sum of disjoint product terms ðxixj xxkxxl Þ, by multiplying the

probability ðAiAj UkUl Þ of every product term

ðxixj xxkxxl Þ by ðiþ jþ k l Þ, where

iðiÞ is the failure (repair) rate of component i. Therefore,

from (2) the system failure frequency becomes

v¼ Ai Axi¼1 ½iþ xi¼1 þ Ui Axi¼0 ½iþ xi¼0; ð4Þ

where xi¼1ðxi¼0Þ is the effective failure rate of subterm

fxi¼1ðfxi¼0Þ. Similarly, if the system unavailability

expres-sion is used, from rule II in [11], v can be obtained from the system unavailability expression g by multiplying the probability ðAiAj UkUl Þ of every product term

ðxixj xxkxxl Þ by ði j þ kþ lþ Þ.

There-fore, from (3), the system failure frequency is obtained v¼ Ui Uyi¼1 ½iþ yi¼1 þ Ai Uyi¼0 ½iþ yi¼0;

where yi¼1ðyi¼0Þ is the effective repair rate of subterm

gyi¼1ðgyi¼0Þ.

If the OBDD representing a system has been constructed as shown in Fig. 2, based on the property of disjoint decomposition of OBDD, every path from the root to the leaf node “1” means a disjoint product term of f (i.e., xixj xxkxxl ). Therefore, we get ðiþ jþ k l

Þ by summing up the rates of nodes along the path. For instance, in Fig. 2b, we get the path rate ð1þ 2þ 4Þ and

ð1þ 2 4þ 5Þ from the paths of fx1; x2; x4; BDD oneg

and fx1; x2; x4; x5; BDD oneg, respectively. Therefore, we

modify the general algorithm of computing the probability of OBDD by recording on each node the path rate from the root to it and multiplying the probability ðAiAj UkUl Þ

the recorded path rate ðiþ jþ k l Þ when

the calculation reaches the leaf node “1.” The OBDD-based recursive algorithm for calculating the system failure frequency from the availability expression of the system is shown in Fig. 3. If the calculation is based on unavailability, then the sign of rate should be flipped.

Example 1.For a terminal-pair network system, Kuo et al. [3] proposed an efficient approach to determine the terminal-pair (from source node s to target node t) reliability based on edge expansion diagrams using OBDD. The main idea, which avoids redundant calcula-tions and makes their approach much more efficient than other methods, is that the OBDD of a given network is automatically constructed with the convergence of isomorphic subproblems during traversing the network from source to sink. Also a good variable-ordering is obtained by breadth-first traversing the network from source to sink in [3]. Therefore, the system reliability is efficiently derived from the OBDD.

Fig. 1. The OBDD generated from a Boolean equation.

Fig. 2. (a) A bridge nework with failure rate i and repair rate ifor component xiat steady state. (b) The OBDD of (a).

(4)

Considering a bridge network as shown in Fig. 2a, the path function of this terminal-pair network system is

f¼ ½x1ðx4_ x3x5Þ _ ½x2ðx5_ x3x4Þ:

Fig. 2b shows the OBDD of this network system constructed by [3] during traversing the network from source to sink. The derivation is as follows:

By (1), with i ¼ 1:

f¼ x1½ðx4_ x3x5Þ _ x2ðx5_ x3x4Þ þ xx1½x2ðx5_ x3x4Þ:

By (1) with i ¼ 2 (for a common sub-OBDD): f¼ x1½x2ðx4_ x3x5_ x5_ x3x4Þ þ xx2ðx4_ x3x5Þ

þ xx1½x2ðx5_ x3x4Þ þ xx2ð0Þ:

Repeat until i ¼ 6 and we get

f¼ x1½x2ðx4þ xx4x5Þ þ xx2ðx3ðx4þ xx4x5Þ þ xx3x4Þ

þ xx1½x2ðx3ðx4þ xx4x5Þ þ xx3x5Þ:

Then, the system availability can be obtained by (2), i.e., a set of mutually node disjoint paths from the root to the leaf node “1” in OBDD.

A¼ A1 A2 A4þ A1 A2 U4 A5þ A1 U2 A3 A4

þ A1 U2 A3 U4 A5þ A1 U2 U3 A4þ U1 A2 A3 A4

þ U1 A2 A3 U4 A5þ U1 A2 U3 A5:

From (4), the system failure frequency can also be obtained as follows: v¼ A1 A2 A4 ð1þ 2þ 4Þ þ A1 A2 U4 A5 ð1þ 2 4þ 5Þ þ A1 U2 A3 A4 ð1 2þ 3þ 4Þ þ A1 U2 A3 U4 A5 ð1 2þ 3 4þ 5Þ þ A1 U2 U3 A4 ð1 2 3þ 4Þ þ U1 A2 A3 A4 ð1þ 2þ 3þ 4Þ þ U1 A2 A3 U4 A5 ð1þ 2þ 3 4þ 5Þ þ U1 A2 U3 A5 ð1þ 2 3þ 5Þ: ð5Þ

It should be noted that, in order to get the system failure frequency, the previous methods need to find the disjoint terms of the availability expression from the

minimal path/cut sets of a network system or assume they are given. In our method, the OBDD is based on the set of disjoint decomposed functions and automatically constructed during traversing the network [3]. Combin-ing our method with [3], it will be very efficient for calculating the failure frequency of a network system. 2.1.4 Calculating Failure Frequency Using OBDD

(Method 2)

In the above section, once the OBDD is obtained, the time complexity of that algorithm depends on the number of paths in the OBDD. This section describes another algorithm whose time complexity is proportional to the number of nodes in the OBDD.

Theorem 1. v¼ vi ½Axi¼1 Axi¼0 þ Ai vxi¼1þ Ui vxi¼0; ð6Þ where vi ¼ Ai i vxi¼1¼ Axi¼1 xi¼1 vxi¼0¼ Axi¼0 xi¼0:

Proof.The proof is based on mathematical induction. From [4], [26], we have v¼X n i¼1 vi @A @Ai : ð7Þ

From (2), since Axi¼1and Axi¼0do not depend on Ai, we

have @A @Ai ¼ Axi¼1 Axi¼0: ð8Þ Similarly, if j 6¼ i, we have, by (2), @A @Aj ¼ Ai @Axi¼1 @Aj þ Ui @Axi¼0 @Aj : Therefore, for any 1 i n,

v¼ vi A½ xi¼1 Axi¼0 þ Xn j¼1;j6¼i vj Ai @Axi¼1 @Aj þ Ui @Axi¼0 @Aj ¼ vi A½ xi¼1 Axi¼0 þ Ai Xn j¼1;j6¼i vj @Axi¼1 @Aj þ Ui Xn j¼0;j6¼i vj @Axi¼0 @Aj :

From the definition of failure frequency (see (7)), we know thatPnj¼1;j6¼ivj

@A_xi¼1

@Aj is the failure frequency of

the system when xi¼ 1. Similarly, Pnj¼1;j6¼ivj @A_xi¼0

@Aj is

the failure frequency of the system when xi¼ 0.

There-fore, (6) holds. Equation (6) is equivalent to (4) if we use the relation vi¼ Aii¼ Uii under steady-state

con-dition. tu

Equations (2) and (6) form the basic recursion to compute the failure frequency. We can calculate the frequency at each node by first finding the corresponding availability at

Fig. 3. OBDD-based algorithm using accumulative path rate to calculate the system failure frequency at steady state.

(5)

that node from (2) and then calculating v at that node using (6). These two steps can be applied recursively.

If the OBDD calculations are based on system unavailability, the calculation of failure frequency can also be based on:

Theorem 2. v ¼ vi ½Uyi¼1 Uyi¼0 þ Ui vyi¼1þ Ai vyi¼0,

where

vi¼ Ui i

vyi¼1¼ Uyi¼1 yi¼1

vyi¼0¼ Uyi¼0 yi¼0:

Proof.The proof is similar to the proof of Theorem 1. tu Applying method 2 to Example 1, the calculation of the system failure frequency is as shown in Fig. 4. The result is the same as that derived by (5). Fig. 5 illustrates the proposed algorithm. Once the OBDD is obtained, the time complexity of the proposed algorithm is linearly proportional to the number of nodes. It should be noted that a hash table is used in this algorithm to avoid repeated calculations in every node of the OBDD. If we get a hit in the hash table, we don’t need to recalculate the information of this node. We can retrieve it from the hash table. The hash table used here only records the availability and the failure frequency of each node. An ordinary table could be used instead of a hash table if only a few data need to be stored. However, in [3], the network topology of each node during traversing a path to construct the OBDD is recorded in a hash table. A proper hash table can reduce the comparison time of network topologies. There-fore, our method can be combined with [3] to compute the reliability indices of a network system. This scheme saves significant computational time.

2.2 Time-Specific Failure/Success Frequency 2.2.1 Assumptions

1. See assumptions in Section 2.1.1.

2. The failure and repair rates of the components are constant.1

3. Repaired components are as good as new.2

In most of the literature [4], [20], the time-specific frequencies are calculated using the following basic equations: vfðtÞ ¼ lim t!0 1 tE X½ sðtÞ XXsðt þ tÞ vsðtÞ ¼ lim t!0 1 tE ½XXsðtÞXsðt þ tÞ;

where vfðtÞ=vsðtÞ are the time-specific failure/success

frequency of a system, i.e., “mean number of system failures/failure-recoveries” per time-unit, XsðtÞ is a random

variable; XsðtÞ ¼ 1 means the system is working and

X

XsðtÞ ¼ :XsðtÞ. EðÞ is the expected value of a random

variable.

Under the steady-state condition, we have vi ¼ Aii¼

Uii and i is applied directly into (4) to get the system

failure frequency. However, under transient conditions (time-specific conditions), the frequency of system failure is not the same as the frequency of system success [13], i.e.,

vfðtÞ 6¼ vsðtÞ , AðtÞ 6¼ UðtÞ :

We cannot just apply i into (4). Therefore, we need to

modify our algorithm under time-specific condition. 2.2.2 Method 1

From [13], [14], the time-specific system failure and success frequencies are respectively given by

vfðtÞ ¼ Xn i¼1 vifðtÞ vsðtÞ ¼ Xn i¼1 visðtÞ; ð9Þ

where vifðtÞ=visðtÞ is the contribution of component i to the

system time-specific frequency of failure/success. Further, since component i is s-independent of the rest of the system

Fig. 4. OBDD of Example 1 with availability = 0.9 and failure rate = 0.2 in each component at steady state.

1. The proposed algorithm can also be applied to the case of global time-dependent failure and repair rates. In this case, the individual component availabilities should be calculated using appropriate techniques.

2. This assumption would not be applicable if global time varying failure and repair rates are considered.

(6)

and after changing the notation appropriately, the contribu-tion frequencies [14] are

vifðtÞ ¼ ½Axi¼1ðtÞ Axi¼0ðtÞ AiðtÞ i ¼ ½Ayi¼0ðtÞ Ayi¼1ðtÞ AiðtÞ i ¼ ½Uxi¼0ðtÞ Uxi¼1ðtÞ AiðtÞ i ¼ ½Uyi¼1ðtÞ Uyi¼0ðtÞ AiðtÞ i ð10Þ and visðtÞ ¼ ½Uyi¼1ðtÞ Uyi¼0ðtÞ UiðtÞ i ¼ ½Uxi¼0ðtÞ Uxi¼1ðtÞ UiðtÞ i ¼ ½Ayi¼0ðtÞ Ayi¼1ðtÞ UiðtÞ i ¼ ½Axi¼1ðtÞ Axi¼0ðtÞ UiðtÞ i; ð11Þ

where xi¼ :yi. By combining (9), (10), and (11) and

applying rule A and rule B in [14], we get vfðtÞ ¼ X j2I gjðtÞ X i2Zj i X i2 ZZj iAiðtÞ=UiðtÞ ð Þ 2 4 3 5 vsðtÞ ¼ X j2I gjðtÞ X i2Zj iUiðtÞ=AiðtÞ ð Þ X i2 ZZj i 2 4 3 5; ð12Þ

where I is the set of subterms of expression of AðtÞ, gjðtÞ is

the subterm of index j in the expression of AðtÞ, Zj and ZZj

are index subsets such that members of fAiðtÞgi2Zj and

fUiðtÞgi2 ZZj comprise gjðtÞ. These are basic equations for the

time-specific case. In fact, for all subsystems, if we define

AðtÞ 0¼ UðtÞ ) 0¼UðtÞ AðtÞ UðtÞ 0¼ AðtÞ ) 0¼AðtÞ

UðtÞ and make some modifications from (4), then we get

vfðtÞ ¼ AiðtÞ Axi¼1ðtÞ ½iþ xi¼1 þ UiðtÞ Axi¼0ðtÞ

½0_iþ xi¼0:

ð13Þ Equation (13) is equivalent to (12). Similarly, we can get the following equations: vfðtÞ ¼ UiðtÞ Uyi¼1ðtÞ ½ 0 iþ 0yi¼1 þ AiðtÞ Uyi¼0ðtÞ ½iþ 0yi¼0 vsðtÞ ¼ AiðtÞ Axi¼1ðtÞ ½ 0 iþ 0xi¼1 þ UiðtÞ Axi¼0ðtÞ ½iþ 0xi¼0 vsðtÞ ¼ UiðtÞ Uyi¼1ðtÞ ½iþ yi¼1 þ AiðtÞ Uyi¼0ðtÞ ½0 iþ yi¼0:

Therefore, the modified algorithm is similar to the steady-state algorithm. The only difference is that we should use 0i instead of i to compute the time-specific

frequency of system failure based on the availability expression, where 0i¼ iAiðtÞ=UiðtÞ. To simplify the

representation, we use Ai (Ui) to represent AiðtÞ (UiðtÞ),

i.e., the time-specific availability (unavailability) of compo-nent i, and (5) becomes

vfðtÞ; ¼ A1 A2 A4 ð1þ 2þ 4Þ þ A1 A2 U4 A5 ð1þ 2 4A4=U4þ 5Þ þ A1 U2 A3 A4 ð1 2A2=U2þ 3þ 4Þ þ A1 U2 A3 U4 A5 ð1 2A2=U2þ 3 4A4=U4þ 5Þ þ A1 U2 U3 A4 ð1 2A2=U2 3A3=U3þ 4Þ þ U1 A2 A3 A4 ð1A1=U1þ 2þ 3þ 4Þ þ U1 A2 A3 U4 A5 ð1A1=U1þ 2þ 3 4A4=U4þ 5Þ þ U1 A2 U3 A5 ð1A1=U1þ 2 3A3=U3þ 5Þ:

If the calculations are based on unreliability, then signs should be flipped. The modified algorithm for the calcula-tion of the time-specific failure frequency is shown in Fig. 6. Also, for the computation of time-specific frequency of system success (vsðtÞ), we should use 0iinstead of i, where

0_i¼ iUiðtÞ=AiðtÞ.

2.2.3 Method 2

In this section, we describe another method with time complexity proportional to the number of nodes in the OBDD. Let vfiðtÞ=vsiðtÞ be the time-specific frequency of

failure/success of component i and, by (8), we have vfiðtÞ ¼ AiðtÞ i vsiðtÞ ¼ UiðtÞ i @AðtÞ @AiðtÞ ¼ A½ xi¼1ðtÞ Axi¼0ðtÞ @UðtÞ @UiðtÞ ¼ Uyi¼1ðtÞ Uyi¼0ðtÞ :

(7)

Therefore, (9) can be expressed as below: vfðtÞ ¼ Xn i¼1 @AðtÞ @AiðtÞ vfiðtÞ ¼ Xn i¼1 @UðtÞ @UiðtÞ vfiðtÞ vsðtÞ ¼ Xn i¼1 @AðtÞ @AiðtÞ vsiðtÞ ¼ Xn i¼1 @UðtÞ @UiðtÞ vsiðtÞ: ð14Þ

Equation (14) is similar to (7). This method for calculating the time-specific frequencies is very much similar to the case under steady-state condition, except we should use vfiðtÞ instead of viðtÞ to calculate vfðtÞ and use vsiðtÞ instead

of viðtÞ to calculate vsðtÞ. Therefore, we get:

vfðtÞ ¼ vfiðtÞ ½Axi¼1ðtÞ Axi¼0ðtÞ þ AiðtÞ vf_jxi¼1ðtÞ þ UiðtÞ vf_jxi¼0ðtÞ ¼ vfiðtÞ ½Uyi¼1ðtÞ Uyi¼0ðtÞ þ UiðtÞ vf_jyi¼1ðtÞ þ AiðtÞ vf_jyi¼0ðtÞ; ð15Þ vsðtÞ ¼ vsiðtÞ ½Axi¼1ðtÞ Axi¼0ðtÞ þ AiðtÞ vs_jxi¼1ðtÞ þ UiðtÞ vs_jxi¼0ðtÞ ¼ vsiðtÞ ½Uyi¼1ðtÞ Uyi¼0ðtÞ þ UiðtÞ vs_jyi¼1ðtÞ þ AiðtÞ vs_jyi¼0ðtÞ: ð16Þ

The proof is similar to the proof of (6). In fact, from (15), we have

vfðtÞ ¼ AiðtÞ i ½Axi¼1ðtÞ Axi¼0ðtÞ þ AiðtÞ Axi¼1ðtÞ xi¼1þ UiðtÞ Axi¼0ðtÞ xi¼0 ¼ AiðtÞ Axi¼1ðtÞ ½iþ xi¼1 þ UiðtÞ Axi¼0ðtÞ i AiðtÞ UiðtÞ þ xi¼0 :

This equation is equivalent to (13). It should be noted that, by using this method, we don’t need to find the modified instantaneous rates 0

i or 0i as in Method 1. Therefore, we

should use (15) and (16) to compute both vfðtÞ and vsðtÞ

either from the availability or the unavailability based on OBDD. The time complexity of this method is also proportional to the number of nodes in the OBDD.

3 A

VAILABILITY UNDER

I

MPERFECT

C

OVERAGE

M

ODEL

Computer systems that are used in life-critical applications, such as flight control, nuclear power plant monitoring, space missions, etc., are designed with sufficient redun-dancy to be tolerant of errors that may occur. However, if a system cannot detect, locate, and recover from faults and errors in a system, then system failures can result even when there is enough redundancy [26]. Therefore, an accurate analysis must account for not only the complex system structure, but also the system fault and error recovery behavior as well. This helps in determining the optimal level of redundancy. Otherwise, the increase in redundancy may decrease the system reliability (availabil-ity) due to imperfect coverage, as shown by Pham [28] for the reliability of Triple Modular Redundant Systems and for more general cases in [29].

Assumptions

1. Component failures are s-independent.

2. An s-coherent combinatorial model (fault-tree or digraph or network) can be used to represent the combinations of covered component faults that lead to system fault.

3. Uncovered component failures cause immediate system failure, even in the presence of adequate redundancy.

4. Fault-occurrence probabilities are given: a) as fixed probabilities (for a given mission time) or b) in terms of a lifetime distribution.

5. Coverage = Pr{system can recover | a fault occurs} is given either as a fixed probability or in terms of a model from which coverage factors are derived. 3.1 Coverage Model

Fig. 7 shows the general structure of a fault-coverage model representing a recovery process [21], [30] initiated when a fault occurs. The entry point to the model signifies the occurrence of a failure and the three exits (R, S, C) represent the three possible outcomes.

. If the offending fault is transient and can be handled without discarding any components, then the tran-sient restoration exit (R) is taken.

. If the fault is determined to be permanent, and the offending component is discarded, then the perma-nent fault-coverage exit (C) is taken.

. If the fault by itself causes a system to fail, then the single-point failure exit (S) is taken.

The exit probabilities r0, c0, s0 are required for the

analysis of system reliability. The exits are a partitioning of the event space; thus, the three exit probabilities sum to one, i.e., ðc0þ s0Þ ¼ ð1 r0Þ. The values of r0, c0, s0 can

be determined using an appropriate fault-coverage model [30]; for more details, see [22], [26].

For the fault-coverage model, each component is always in one of three states: x½i, y½i, z½i. To determine the system reliability (unreliability), it is required to have a½i, b½i, c½i, which represent the probabilities of component i associated with the exits of the fault-coverage model. Fig. 8 shows the

Fig. 6. The modified algorithm for calculating the time-specific frequency of system failure.

(8)

event space (and corresponding probability) representation of a component. Therefore, a½i ¼ exp½ð1 ri0Þ i0 t b½i ¼ ci0 ci0þ si0 1 exp½ð1 r½ i0Þ i0 t c½i ¼ si0 ci0þ si0 1 exp½ð1 r½ i0Þ i0 t; ð17Þ

where ðri0; ci0; si0Þ are the probabilities of taking (transient

restoration, permanent coverage, single-point failure) exit, respectively, in the coverage model and i0 is the rate of

failure occurrence of component i. It should be noted that the effective failure rate i and the effective coverage

factor ci of component i are

i ðci0þ si0Þ i0¼ ð1 ri0Þ i0

ci ci0ðci0þ si0Þ:

ð18Þ Amari et al. [21] proposed an efficient algorithm, named SEA, to calculate the reliability of a system under the imperfect coverage model (IPCM). SEA separates the fault-coverage modeling of failures into two terms that are multiplied to compute the system reliability. The first term, a simple product, represents the probability that no uncovered fault occurs. The second term comes from a combinatorial model which includes the covered faults that can lead to system failure. This second term can be computed from any common approach (e.g., fault tree, block diagram, digraph) which ignores the fault-coverage concept by slightly altering the component-failure prob-abilities. The basic idea is shown in the following equation and could be easily proven [21] by using conditional probabilities.

System Unavailability¼ Prfan uncovered failureg Prfsystem failure j an uncovered failureg þ Prfno uncovered failureg

Prfsystem failure j no uncovered failureg:

The SEA algorithm [21] focused on nonrepairable systems and works very well. However, we found that the use of conditional probabilities in the SEA algorithm could be extended to repairable systems. Also, the powerful capability of OBDD in computing system reliability could be included in this algorithm.

3.2 Imperfect Fault Coverage Model for Repairable Systems

For a repairable system, the availability (unavailability) is an important index for system performance. Dugan and Trivedi [30] used imperfect-coverage modeling for repair-able systems. Akhtar [31] attempted to model the imperfect fault-coverage of a repairable system. Yin et al. [32] considered the components of a system to be repairable and subject to imperfect fault-coverage. They modeled the components using Markov chains. If the components can be repaired, each component will be in any one of its three states at any time. The three states are good (working) state, failed and covered state, and failed and uncovered state. These three states are mutually exclusive in a repairable system. The behavior of a component is shown in Fig. 9, where iis

the failure rate of component i, iis the repair rate, and ciis

the coverage factor.

Let SjiðtÞ ¼ Prfstate of component i is j at time tg, where

j¼ ½1; 2; 3 for [good, failed & covered, failed & uncovered], respectively, then S1iðtÞ ¼ i 1i 2i 1i e1it_þ i 2i 1i 2i e2it S2iðtÞ ¼ ici 2i 1i e1it_þ ici 1i 2i e2it S3iðtÞ ¼ 1 S1iðtÞ S2iðtÞ ¼ 1 iþ ici 1i 2i 1i e1itiþ ici 2i 1i 2i e2it_; ð19Þ where 1i; 2i¼ iþ i ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðiþ iÞ2 4ð1 ciÞii q 2 :

Therefore, the conditional probabilities (given no un-covered failures) of component i are as follows:

PiðtÞ ¼ S1iðtÞ S1iðtÞ þ S2iðtÞ QiðtÞ ¼ 1 PiðtÞ ¼ S2iðtÞ S1iðtÞ þ S2iðtÞ : ð20Þ

Fig. 7. General structure of a fault-coverage model.

Fig. 8. Event and probability space of component i.

(9)

And, the probability of no uncovered failure of the system, PuðtÞ becomes PuðtÞ ¼ Y i2S 1 S3iðtÞ ð Þ ¼Y i2S S1iðtÞ þ S2iðtÞ ð Þ; ð21Þ where S is the set of components (n elements).

The SEA algorithm based on OBDD for a repairable system with imperfect coverage is as follows:

1. Find

Prfno uncovered failureg ¼ PuðtÞ

¼Y

i2S

ðS1iðtÞ þ S2iðtÞÞ:

2. Find the conditional probability (given no uncovered failures) of component i, PiðtÞ, QiðtÞ, where PiðtÞ ¼

S1iðtÞ=ðS1iðtÞ þ S2iðtÞÞ and

QiðtÞ ¼ 1 PiðtÞ ¼ S2iðtÞ=ðS1iðtÞ þ S2iðtÞÞ:

3. Using PiðtÞ, QiðtÞ from Step 2 instead of the

availability and unavailability of each component i, respectively, find the conditional unavailability (UcðtÞ) or availability (AcðtÞ) of a system with perfect

coverage by using OBDD.

4. Find the unavailability (UsðtÞ) for a system with

imperfect coverage using

Prfan uncovered failureg ¼ ½1 PuðtÞ

Prfsystem failure j an uncovered failureg ¼ 1 Prfno uncovered failureg ¼ PuðtÞ

Prfsystem failure j no uncovered failureg ¼ UcðtÞ:

Therefore, we get the following results

UsðtÞ ¼ 1 P½ uðtÞ 1 þ P½ uðtÞ Uc¼ 1 PuðtÞ AcðtÞ

AsðtÞ ¼ 1 UsðtÞ ¼ PuðtÞ AcðtÞ:

ð22Þ

For a nonrepairable system, the repair rate i of

component i is zero. Therefore, letting i¼ 0 in (19), we

get the same result, S1iðtÞ ¼ a½i, S2iðtÞ ¼ b½i, S3iðtÞ ¼ c½i of

a nonrepairable system, as (17) and (18). This method is more efficient than [7]. In [7], the method splits one node into two subnodes to represent imperfect coverage. It makes the number of nodes in OBDD become larger and increases the computational complexity. On the other hand, using the conditional probabilities in our method, the computational complexity is the same as that of the method for solving perfect coverage problems. Moreover, our method could also be applied to modular structures using the concept of conditional probabilities. Fig. 10 illustrates the OBDD-based algorithm modified from the SEA algorithm for repairable systems with imperfect coverage.

Example 2. Consider a bridge network in Example 1 as shown in Fig. 2a. Assume redundancy techniques are used such that each link has a fault tolerance scheme and is repairable. We can get the system conditional avail-ability AcðtÞ by using PiðtÞ and QiðtÞ instead of the

availability and unavailability of each component i as shown in Fig. 11. Therefore, the availability of this repairable network system subject to imperfect coverage can be easily obtained from (22). With this efficient integration, we don’t need to solve the whole problem (in this case, the total number of states is 53_{¼ 125, including}

the imperfect coverage state) using Markov chain, especially when the network is quite large and complex. Moreover, using the conditional probabilities, this method could be applied to modular structures and the computational complexity of this method is the same as that of the method for solving perfect coverage problems. 3.3 Semi-Markov Model

The algorithm proposed above could also be applied to the case where time-to-failure distribution is not exponential. This is a more general case. The behavior of an individual component is shown in Fig. 12.

Fig. 10. The SEA algorithm based on OBDD with IPCM in a repairable system.

Fig. 11. The availability evaluation of a bridge network subject to imperfect coverage.

(10)

Assumptions

1. The probability density function (pdf), cumulative density function (cdf), and survival function (sf) of times-to-failure of component i are fiðtÞ, FiðtÞ, and

F

FiðtÞ, respectively.

2. The pdf, cdf, and sf of repair times of component i are giðtÞ, GiðtÞ, and GGiðtÞ, respectively.

3. The coverage factor of component i is ci.

4. Initially, all components are in good condition. Using the Semi-Markovian approach, the probabilities of SjiðtÞ are calculated as follows:

S1iðtÞ ¼ FFiðtÞ þ ci½Fi Gi FFiðtÞ þ ðciÞ2 Fið2Þ G ð2Þ i FFiðtÞ h i þ ¼X 1 m¼0 ðciÞm FiðmÞ G ðmÞ i FFiðtÞ h i ; Fð0Þ¼ Gð0Þ 1 ¼X 1 m¼0 ðciÞm F ðmÞ i G ðmÞ i ðtÞ F ðmþ1Þ i G ðmÞ i ðtÞ h i ; ð23Þ where F GðtÞ ¼ Z t 0 Fðt uÞdGðuÞ ¼ Z t 0 Gðt uÞdF ðuÞ; i.e., the convolution of F ðtÞ and GðtÞ and FðnÞ_{ðtÞ is n-fold}

convolution of F with itself. Similarly, S2iðtÞ ¼ ci Fi GGiðtÞ þ ðciÞ2 Fið2Þ G ð1Þ i GGiðtÞ h i þ ¼X 1 m¼1 ðciÞm FiðmÞ G ðm1Þ i GGiðtÞ h i ¼X 1 m¼1 ðciÞm FiðmÞ G ðm1Þ i ðtÞ F ðmÞ i G ðmÞ i ðtÞ h i ð24Þ

and S3iðtÞ ¼ 1 S1iðtÞ S2iðtÞ. Solving (23) and (24) is

equivalent to solving the convolution and renewal functions [33], [34].

3.4 Time-Dependent Coverage Factor

In the above case, it is assumed that the coverage factor is independent of the occurrence time of a fault. However, in some cases, it might depend upon the occurrence time. In other words, occurrence time distributions of covered faults and uncovered faults might be different. Therefore, differ-ent distributions for time to occurrence of covered and uncovered faults are considered. Further, if the occurrence rates are constant, the coverage factor becomes constant with respect to time. The behavior of an individual module is shown in Fig. 13.

Assumptions

1. The pdf, cdf, and sf of the number of covered faults of component i are fiðtÞ, FiðtÞ, and FFiðtÞ, respectively.

2. The pdf, cdf, and sf of the number of uncovered faults of component i are liðtÞ, LiðtÞ, and LLiðtÞ, respectively.

3. The pdf, cdf, and sf of the number of repairs of component i are giðtÞ, GiðtÞ, and GGiðtÞ, respectively.

4. Initially, all components are in good condition. Using the Semi-Markovian approach, the probabilities of SjiðtÞ are calculated as follows:

S1iðtÞ ¼ X1 m¼0 T_iðmÞ GðmÞ_i NiðtÞ h i S2iðtÞ ¼ X1 m¼0 T_iðmþ1Þ GðmÞ_i GGiðtÞ h i S3iðtÞ ¼ 1 S1iðtÞ S2iðtÞ; where TiðtÞ ¼ Rt 0LLiðxÞdFiðxÞ and NiðtÞ ¼ LLiðtÞ FFiðtÞ.

4 F

AILURE

F

REQUENCY UNDER

I

MPERFECT

C

OVERAGE

M

ODEL

This section describes the calculation of failure frequency under imperfect fault-coverage. An algorithm based on OBDD is proposed in this section. For simplification, we consider the nonrepairable system and use R to represent the time-specific reliability RðtÞ of a system and rithe

time-specific reliability riðtÞ of component i. Then, (17) becomes

a½i ¼ ri

b½i ¼ ð1 riÞci

c½i ¼ ð1 riÞð1 ciÞ:

Since component i is s-independent of the rest of the system, from (10), the contribution to failure frequency is

vif IP CMðtÞ ¼ R½ xi¼1ðtÞ Rxi¼0ðtÞ riðtÞ i¼

@R @ri

rii:

Therefore, the failure frequency of the system subject to imperfect coverage is vf IP CMðtÞ ¼ Xn i¼1 vif IP CMðtÞ ¼ Xn i¼1 @R @ri rii:

As shown in (22) as well as from [21], we can specify the reliability of a system subject to imperfect coverage as follows:

R¼ Pu Rc; ð25Þ

where Puis the probability of no uncovered failure in the

system. Rc is the conditional reliability of the system given

that no uncovered failure has occurred in the system and, from (21), we have

Fig. 12. Semi-Markov model with nonexponential time-to-failure distribution.

(11)

Pu¼

Yn i¼1

riþ ð1 riÞci

½ ;

where ri is the reliability of component i and ci is the

coverage factor of component i. Let the conditional reliability of component i (given no uncovered failure) as shown in (20) be Ri ¼ ri riþ ð1 riÞci : By (25), we have @R @ri ¼ Rc @Pu @ri þ Pu @Rc @ri ; where @Pu @ri ¼ Y n j¼1;j6¼i rjþ ð1 rjÞcj ð1 ciÞ ¼ Pu 1 ci riþ ð1 riÞci ¼ Puð1 ciÞ Ri ri

and, by the chain rule of differentiation, @Rc @ri ¼@Rc @Ri @Ri @ri @Ri @ri ¼riþ ð1 riÞci rið1 ciÞ ðriþ ð1 riÞciÞ2 ¼ ci ðriþ ð1 riÞciÞ2 ¼ ci Ri ri 2 :

Therefore, the failure frequency of the system subject to imperfect coverage is vf IP CMðtÞ ¼ Xn i¼1 vif IP CMðtÞ ¼ Xn i¼1 @R @ri rii ¼X n i¼1 RcPuð1 ciÞ Ri ri þ Pu @Rc @Ri cið Ri ri Þ2 rii ¼ Pu Rc Xn i¼1 w1iþ Xn i¼1 w2i @Rc @Ri ! ¼ Pu vf IP CM1ðtÞ þ vf IP CM2ðtÞ ; where w1i ¼ ð1 ciÞiRi w2i ¼ ci ri iR2i vf IP CM1ðtÞ ¼ Rc Xn i¼1 w1i vf IP CM2ðtÞ ¼ Xn i¼1 w2i @Rc @Ri :

In order to obtain vf IP CM1ðtÞ, we can apply the OBDD

method depicted in Section 3 to get Rc, i.e., the conditional

reliability of a system given that no uncovered failure has occurred, and then vf IP CM1ðtÞ can be easily obtained.

Moreover, since vf IP CM2ðtÞ is similar to v in (7), we can

compute frequencies as shown below.

vf IP CM2ðtÞ ¼ w2i ½Rc_jxi¼1ðtÞ Rc_jxi¼0ðtÞ

þ RiðtÞ vf IP CM2_jxi¼1ðtÞ þ ~RRiðtÞ vf IP CM2_jxi¼0ðtÞ:

The proof is similar to that for (6). An OBDD-based algorithm similar to that in Fig. 5 can be applied to this equation.

5 R

ELIABILITY

I

MPORTANCE

M

EASURES

Some components are more critical than others to the functioning of a system as a result of the system’s structural arrangement. In addition, a component’s failure probability will also have a great influence in assessing its importance with respect to the overall state of a system. Therefore, the purpose of evaluating reliability importance measures (also known as sensitivity analysis) is to obtain information concerning a component’s contribution to the system reliability, which can be very useful in system design, failure diagnosis, and system failure probability minimiza-tion. For more details on the definitions, please refer to [9]. In [19], two algorithms are proposed to compute the Birnbaum importance measure using OBDD. The first algorithm is a two-pass traversal algorithm, which com-putes system reliability in each traversal by setting the reliability of the corresponding component as one in the first travel and zero in the second traversal. The second algorithm is a single-pass traversal algorithm, but has two phases of computation. In the first phase, the probability of path selection from the root node to a specific node i is calculated. In the second phase, the probabilities of its left and right child nodes are calculated. Finally, these prob-abilities are used to compute the importance measure. Therefore, the algorithms presented in [19] take more computational time. Moreover, this algorithm can only be applied to cases where there is only one path between the root and each node of the OBDD; otherwise, some calculations have to be modified. Later, Ou and Dugan [35] proposed an equation that can produce the importance measure of a component during the reliability evaluation traversal. Therefore, in a single-pass traversal, both the reliability and importance measures of a single component can be evaluated. However, they did not present the implementation of their single-pass traversal. Further, a modular approach to calculate system-level importance measure using module-level importance measure is pre-sented in [35], which can be used to reduce the computa-tional time. However, the approaches presented in both [19] and [35] calculate only one component’s importance measure at a time. They need to be run again to obtain the importance measure of another component.

In this section, we propose an algorithm to compute multiple components’ importance measures with only a single OBDD traversal. First, we present the importance measures for systems with perfect fault-coverage. Then, we extend our algorithms to evaluate the importance measures of systems with imperfect fault-coverage. Our algorithm can be combined with the modular approach presented in [35] to find the importance of modular fault trees or networks in an efficient way.

(12)

5.1 Reliability Importance Measures under Perfect Coverage Model

5.1.1 Birnbaum Importance Measure

The Birnbaum importance measure of a component (say component k) represents the probability that a system is in a critical state with respect to that component, i.e., the probability that the system is initially in a good state and the failure of component k causes the system to fail. It is defined as the partial derivative of system unreliability with respect to the failure probability of component k:

I_kBðtÞ @FðtÞ @FkðtÞ @RðtÞ @RkðtÞ ¼ Prfgyk¼1g Prfgyk¼0g ¼ Prffxk¼1g Prffxk¼0g; ð26Þ

where F ðtÞ is the system failure probability at time t, FkðtÞ is

the failure probability of component k at time t; FkðtÞ ¼ Prfgyk¼1g, g is the system structure function, and f

is the dual of g, i.e.,

fðx1; x2; ; xnÞ 1 gð1 x1; 1 x2; ; 1 xnÞ

1 gðy1; y2; ; ynÞ;

where yi¼ 1 xiand xi¼ 1 (0) means component i is good

(faulty). Here, Prfgyk¼1g is the unreliability of the system

given that component k has failed. Similarly, Prffxk¼1g is

the reliability of the system given that component k is working.

5.1.2 Two-Pass Traversal

This method needs to traverse the OBDD twice to obtain the importance measure of component k [19].

. Find Prffxk¼1g using OBDD, i.e., find the system

reliability by assuming the reliability of component k to be 1.

. Find Prffxk¼0g using OBDD, i.e., find the system

reliability by assuming the reliability of component k to be 0.

. Prffxk¼1g Prffxk¼0g gives the Birnbaum

impor-tance measure of component k. 5.1.3 Modified Single-Pass Traversal

This method needs to traverse the OBDD only once to get the importance measure of component k. There are two steps. In the first step, from (26), the importance measure depends on the probability of state transition of component k. Therefore, a disjoint path, which goes to the terminal one and does not include component k in it, will not contribute to the importance measure of component k. We should delete this kind of path or let the probabilities of the paths be 0 during traversing the OBDD.

The second step is similar to the procedure in the above section for nodes in finding Prffxk¼1g and Prffxk¼0g except

for the node corresponding to component k. At the node corresponding to component k, since the reliability of component k in finding Prffxk¼1g is 1, the probability of

this node is equivalent to the probability of the right subtree. Similarly, since the reliability of component k in finding Prffxk¼0g is 0, the probability of this node is

equivalent to the probability of the left subtree. Therefore, we combine the two calculations at the node corresponding to component k and compute the probability of each node (say i) in OBDD using the following rules:

. If node i is corresponding to component k, then Prffkg ¼ Prffxk¼1g Prffxk¼0g:

. If node i is not corresponding to component k and ordering(i) > ordering(k), then

Prffig ¼ Prfxig Prffxi¼1g þ 1 Prfxð igÞ

Prffxi¼0g:

ð27Þ

. If node i is not corresponding to component k and ordering(i) < ordering(k), then also use (27) to calculate Prffig except that if the right (left) subtree

is independent of component k, then let Prffxi¼1g ðPrffxi¼0gÞ be 0 in (27). To check if the

subtree of node i is independent of component k is simple. Let node j be the subnode of node i. If ordering(j) > ordering(k) then the subtree is inde-pendent of component k.

Finally, when we have finished traversing the OBDD, we get the probabilityof the root, Prffg. Prffg gives the Birnbaum importance measure of component k.

5.1.4 Single-Pass Traversal for Multiple Components This method needs to traverse the OBDD only once to get the importance measures of multiple components. This method is extended from the method in Section 5.1.3. If is the set of components whose Birnbaum importance measures are to be calculated, Prffð0Þg is the system reliability, and PrffðkÞg is the Birnbaum importance measure of component k, then we have the following at each node (say i):

. For each node i,

Prffið0Þg ¼ Prfxig Prffxi¼1ð0Þg þ 1 Prfxð igÞ

Prffxi¼0ð0Þg:

. For each k 2 ,

- If node i is corresponding to component k, then PrffiðkÞg ¼ Prffxi¼1ð0Þg Prffxi¼0ð0Þg

- If node i is not corresponding to component k and ordering(k) > ordering(i), (i.e. Prffxk¼1ðkÞg and

Prffxk¼0ðkÞg have been calculated), then let

Prffxk¼1ðkÞg ðPrffxk¼1ðkÞgÞ be 0 in (28) if the right

(left) subtree is independent of component k. Then, calculate PrffiðkÞg using (28).

PrffiðkÞg ¼ Prfxig Prffxi¼1ðkÞg þ 1 Prfxð igÞ

Prffxi¼0ðkÞg:

(13)

- Otherwise, do nothing since PrffiðkÞg is

equiva-lent to Prffið0Þg.

Finally, the probabilities, PrffðkÞg for all k 2 , at the root in the OBDD gives the Birnbaum importance measure of component k. Fig. 14 illustrates the OBDD-based algorithm for the calculation of Birnbaum importance measures of multiple components by traversing the OBDD only once. 5.1.5 Criticality Importance Measure

The criticality importance measure of component k of a system is defined as the probability that component k becomes faulty at time t and, at the same time, is critical to the occurrence of system failure, given that the system has failed. I_kCrðtÞ @FðtÞ @FkðtÞ FkðtÞ FðtÞ I B kðtÞ FkðtÞ FðtÞ: ð29Þ It should be noted that the criticality importance measure of component k could be calculated directly from (29) given that we have the Birnbaum importance measure of component k. The Birnbaum importance measure can be calculated using the algorithm presented in Section 5.1.3. If we want to evaluate this measure for all components, then a similar algorithm that is used for the Birnbaum importance measure in Section 5.1.4 can be used for this case.

5.1.6 Risk Reduction Ratio

This ratio indicates decrease in system unreliability when the unreliability of a given component, say component k, is zero (i.e., component k never fails). This ratio is defined as

RRRkðtÞ FðtÞ Fxk¼1ðtÞ 1 RðtÞ 1 Rxk¼1ðtÞ ¼ 1 Prffg 1 Prffxk¼1g : The evaluation of this measure is similar to the evaluation of Birnbaum importance measure. We first find Prffg and then find Prffxk¼1g. In fact, using the concept of

Section 5.1.4, we can get the result by traversing the OBDD only once.

5.1.7 Risk Reduction Interval

This index is similar to the Risk Reduction Ratio except that it uses the actual unreliability difference instead of the unreliability ratio. It is defined as

RRIkðtÞ F ðtÞ Fxk¼1ðtÞ Rxk¼1ðtÞ RðtÞ

¼ Prffxk¼1g Prffg:

However, we make some modifications and get

RRIkðtÞ Prffxk¼1g Prfxð kg Prffxk¼1g þ ð1 PrfxkgÞ

Prffxk¼0gÞ

¼ ð1 PrfxkgÞ Prffxk¼1g ð1 PrfxkgÞ

Prffxk¼0g

¼ ð1 PrfxkgÞ Prffð xk¼1g Prffxk¼0gÞ:

Therefore, the evaluation of this measure is similar to the evaluation of the Birnbaum importance measure by traver-sing the OBDD once, as discussed in Section 5.1.4.

5.1.8 Risk Increase Ratio

This ratio indicates the increase in the system unreliability when the unreliability of a given component, say compo-nent k, is one (i.e.. compocompo-nent k always fails). This ratio is defined as RIRkðtÞ Fxk¼0ðtÞ FðtÞ 1 Rxk¼0ðtÞ 1 RðtÞ ¼ 1 Prffxk¼0g 1 Prffg : Evaluation of this measure is similar to the evaluation of the Risk Reduction Ratio. Using the concept of Section 5.1.4, we can get the result by traversing the BDD only once. 5.1.9 Risk Increase Interval

This index is similar to the Risk Increase Ratio; however, it uses the actual unreliability difference instead of the unreliability ratio. It is defined as

RIIkðtÞ Fxk¼0ðtÞ F ðtÞ RðtÞ Rxk¼0ðtÞ

¼ Prffg Prffxk¼0g

¼ Prfxkg Prffð xk¼1g Prffxk¼0gÞ:

The evaluation of this measure is similar to the evaluation of the Risk Reduction Interval.

5.2 Reliability Importance Measures under Imperfect Coverage Model

This section presents several algorithms to compute the importance measures for a system with imperfect fault-coverage. For simplification, we consider the nonrepairable system and the importance measures are calculated assum-ing no change in the coverage factor with respect to the

Fig. 14. The OBDD-based algorithm for the calculation of the Birnbaum importance measure.

(14)

change in the component failure probability (i.e.. for fixed coverage factors).

5.2.1 Birnbaum Importance Measure

From (25) as well as [21], the reliability of a system with imperfect coverage is as follows:

R¼ Pu Rc; ð30Þ

where Pu is the probability of no uncovered failure in the

system and Rc is the conditional reliability of the system

given that no uncovered failure has occurred in this system. Pu¼

Yn i¼1

riþ ð1 riÞci

½ ;

where ri is the reliability of component i and ci is the

coverage factor of component i.

The conditional reliability of component i, Ri, given that

no uncovered failure has occurred, is Ri ¼ ri riþ ð1 riÞci : By (30), we have @R @rk ¼ Rc @Pu @rk þ Pu @Rc @rk ; where @Pu @rk ¼ Y n i¼1;i6¼k riþ ð1 riÞci ½ ð1 cjÞ ¼ Pu 1 ck rkþ ð1 rkÞck ¼ Puð1 ckÞ Rk rk ;

and, by the chain rule of differentiation, @Rc @rk ¼@Rc @Rk @Rk @rk @Rk @rk ¼rkþ ð1 rkÞck rkð1 ckÞ ðrkþ ð1 rkÞckÞ2 ¼ ck Rk rk 2 : Therefore, @R @rk ¼ RcPuð1 ckÞ Rk rk þ Pu @Rc @Rk ckð Rk rk Þ2 ¼ Pu Rk rk ð1 ckÞRcþ ck @Rc @Rk Rk rk : ð31Þ

In order to derive @R=@ri, we need to find only @Rc=@Rk

because all other parameters are known already. Finding @Rc=@Rkis equivalent to finding the Birnbaum importance

measure of the system with perfect coverage, but with the modified values of the component reliabilities, that is, the conditional reliability of component i, Ri. The Birnbaum

importance measure can be evaluated using the method in Section 5.1.3.

Therefore, from (26), the Birnbaum importance measure under the imperfect coverage model is

I_{k ICP M}B ðtÞ @RðtÞ @rkðtÞ

:

The calculation of this measure is straightforward from (31).

5.2.2 Criticality Importance Measure

From the definition and (29), the criticality importance measure under the imperfect coverage model is

I_{k IP CM}Cr ðtÞ IB

k IP CMðtÞ

FkðtÞ

FðtÞ:

Therefore, the calculation of this measure is simple given that we have the Birnbaum importance measure of component k.

5.2.3 Risk-Based Measures

The evaluation of the risk-based measures for the imperfect coverage case is similar to the case of perfect coverage except that the effect of imperfect coverage should be taken into account in evaluating the system unreliability.

6 C

ONCLUSIONS

We have presented a new efficient algorithm based on OBDD for the calculation of the time-specific as well as the steady-state failure frequency of a system. This algorithm can also be applied to the case of global time-dependent failure and repair rates if the individual component availability is calculated using appropriate techniques. In addition, we have proposed an OBDD-based algorithm for the evaluation of importance measures, including the Birnbaum impor-tance, the Criticality imporimpor-tance, and other indices for risk evaluation of systems. Moreover, all of the proposed algorithms based on OBDD in this paper could also be extended to analyze a system with imperfect fault-coverage. The imperfect coverage model (IPCM) is important to accurate reliability assessment of a fault-tolerant computer system. We have also proposed an approach for incorpor-ating the IPCM of a repairable system into a combinatorial model. The model using Markov chains for repairable systems with imperfect fault-coverage is quite useful. Due to the use of conditional probabilities and Markov chains, our OBDD-based algorithm is very efficient for the reliability evaluation of a nonrepairable system and the availability evaluation of a repairable system with imperfect fault-coverage.

In this paper, the powerful capability of OBDD for reliability evaluation has been fully exploited with the proposed algorithms. These OBDD-based algorithms are very efficient in both computational time and storage demand for reliability analysis and also make it possible for us to study practical and large distributed systems. Based on the approaches, researches on sensitivity analysis, importance measures, failure frequency analysis, or optimal design issues of distributed systems and multistate systems will be the focus of our future works.

A

CKNOWLEDGMENTS

This research was supported by the National Science Council, Taiwan, Republic of China, under Grant NSC 92-2623-7-002-004-NU.

R

EFERENCES

[1] A. Rauzy, “A New Methodology to Handle Boolean Models with Loops,” IEEE Trans. Reliability, vol. 52, no. 1, pp. 96-105, Mar. 2003.

(15)

[2] J.D. Andrews and S.J. Dunnett, “Event-Tree Analysis Using Binary Decision Diagrams,” IEEE Trans. Reliability, vol. 49, pp. 230-338, June 2000.

[3] S.Y. Kuo, S.K. Lu, and F.M. Yeh, “Determining Terminal-Pair Reliability Based on Edge Expansion Diagrams Using OBDD,” IEEE Trans. Reliability, vol. 48, pp. 234-246, Sept. 1999.

[4] W.G. Schneeweiss, The Fault Tree Method. Hagen: LiLoLe-Verlag, 1999.

[5] L. Xing and J.B. Dugan, “Reliability Analysis of Phased-Mission Systems with Combinatorial Phase Requirements,” Proc. Ann. Reliability and Maintainability Symp. (RAMS ’01), pp. 344-351, 2001. [6] L. Xing and J.B. Dugan, “Analysis of Generalized Phased-Mission Systems Reliability, Performance, and Sensitivity,” IEEE Trans. Reliability, pp. 199-211, June 2002.

[7] X. Zang, H. Sun, and K.S. Trivedi, ”Dependability Analysis of Distributed Computer Systems with Imperfect Coverage,” Proc. 29th Ann. Int’l Symp. Fault-Tolerant Computing (FTCS-29), pp. 330-337, 1999.

[8] H. Kumamoto and E.J. Henley, Probabilistic Risk Assessment and Management for Engineers and Scientists, second ed. IEEE Press, 1996.

[9] K.B. Misra, Reliability Analysis and Prediction: A Methodology Oriented Treatment. Elsevier, 1992.

[10] M. Veeraraghavan and K.S. Trivedi, “Multiple Variable Inversion Techniques,” New Trends in System Reliability Evaluation, K.B. Misra, ed., pp. 39-74, Elsevier, 1993.

[11] S.V. Amari, “Generic Rules to Evaluate System-Failure Fre-quency,” IEEE Trans. Reliability, vol. 49, pp. 85-87, Mar. 2000. [12] R. Billinton and S. Jonnavithula, “Calculation of Frequency,

Duration, and Availability Indexes in Complex Network,” IEEE Trans. Reliability, vol. 48, pp. 25-30, Mar. 1999.

[13] C. Singh, “Calculating the Time-Specific Frequency of System Failure,” IEEE Trans. Reliability, vol. 28, pp. 124-126, June 1979. [14] C. Singh, ”Rules for Calculating the Time-Specific Frequency of

System Failure,” IEEE Trans. Reliability, vol. 30, pp. 364-366, Oct. 1981.

[15] C. Singh and R. Billinton, System Reliability Modelling and Evaluation. Hutchinson & Co., 1977.

[16] W.G. Schneeweiss, “Limited Usefulness of BDDs for Mean Failure Frequency Calculation,” J. Automatic Control Production Systems, pp. 1131-1136, 1996.

[17] W.G. Schneeweiss, “Advanced Fault Tree Modeling,” J. Universal Computer Science, vol. 5, pp. 633-643, 1999.

[18] W.G. Schneeweiss, “Fast Fault-Tree Evaluation for Many Sets of Input Data,” IEEE Trans. Reliability, vol. 39, pp. 296-300, Aug. 1990. [19] R. Sinnamon and J.D. Andrews, “Fault Tree Analysis and Binary Decision Diagrams,” Proc. Ann. Reliability and Maintainability Symp. (RAMS ’96), pp. 215-222, Jan. 1996.

[20] W.G. Schneeweiss, “The Failure Frequency of Systems with Dependent Components,” IEEE Trans. Reliability, vol. 35, pp. 512-517, Dec. 1986.

[21] S.V. Amari, J.B. Dugan, and R.B. Misra, “A Separable Method for Incorporating Imperfect Fault-Coverage into Combinatorial Mod-els,” IEEE Trans. Reliability, pp. 267-274, Sept. 1999.

[22] S.A. Doyle, J.B. Dugan, and F.A. Patterson-Hine, “A Combinator-ial Approach to Modeling Imperfect Coverage,” IEEE Trans. Reliability, vol. 44, pp. 87-94, Mar. 1995.

[23] R.E. Bryant, “Graph-Based Algorithms for Boolean Function Manipulation,” IEEE Trans. Computers, vol. 35, no. 8, pp. 677-691, Aug. 1986.

[24] A. Rauzy, “New Algorithms for Fault Tree Analysis,” Reliability Eng. and System Safety, vol. 40, pp. 203-211, 1993.

[25] R.M. Sinnamon and J.D. Andrews, “Improved Efficiency in Qualitative Fault Tree Analysis,” Quality and Reliability Eng. Int’l, vol. 13, pp. 293-298, 1997.

[26] J.M. Nahman, “Failure-Frequency Evaluation of Complex Systems Using Cut-Set Approach,” IEEE Trans. Reliability, vol. 30, pp. 353-355, Oct. 1981.

[27] J.B. Dugan, “Fault Trees and Imperfect Coverage,” IEEE Trans. Reliability, vol. 38, pp. 177-185, June 1989.

[28] H. Pham, “Optimal Cost-Effective Design of Triple-Modular-Redundancy-with-Spares Systems,” IEEE Trans. Reliability, vol. 42, pp. 369-374, Sept. 1993.

[29] S.V. Amari, J.B. Dugan, and R.B. Misra, “Optimal Reliability Design of Systems Subject to Imperfect Coverage,” IEEE Trans. Reliability, pp. 275-284, Sept. 1999.

[30] J.B. Dugan and K.S. Trivedi, “Coverage Modeling for Depend-ability Analysis of Fault-Tolerant Systems,” IEEE Trans. Compu-ters, vol. 38, no. 6, pp. 755-787, June 1989.

[31] S. Akhtar, “Reliability of k-out-of-n:G Systems with Imperfect Fault-Coverage,” IEEE Trans. Reliability, vol. 43, pp. 101-106, Mar. 1994.

[32] L. Yin, M.A.J. Smith, and K.S. Trivedi, “Uncertainty Analysis in Reliability Modeling,” Proc. Ann. Reliability and Maintainability Symp. (RAMS ’01), pp. 229-234, 2001.

[33] H. Ayhan, J. Limon-Robles, and M. Wortman, “An Approach for Computing Tight Numerical Bounds on Renewal Functions,” IEEE Trans. Reliability, vol. 39, pp. 182-188, June 1999.

[34] E. Smeitink and R. Dekker, “A Simple Approximation to Renewal Function,” IEEE Trans. Reliability, vol. 39, pp. 71-75, Apr. 1990. [35] Y. Ou and J.B. Dugan, “Sensitivity Analysis of Modular Dynamic

Fault Trees,” Proc. Computer Performance and Dependability Symp. (IPDS ’00), pp. 35-43, 2000.

Yung-Ruei Chang received the MS degree (1995) in electrical engineering from National Taiwan University. Since 1996, he has been working as an assistant researcher at the Institute of Nuclear Energy Research, Atomic Energy Council, Taiwan. He has been pursuing the PhD degree in electrical engineering from National Taiwan University since 2000. His research interests include fault-tolerant system reliability, distributed dependable system relia-bility, network reliarelia-bility, and dependable computing. He is a student member of the IEEE.

Suprasad V. Amari received the BS dgree (1990) in mechanical engineering from Sri Venkateswara University, Tirupati, the MS de-gree in reliability engineering (1992) and the PhD degree in reliability, risk, and fault tolerance of complex systems (1998) from the Indian Institute of Technology, Kharagpur. He is a reliability software design rngineer at Relex Software Corporation. In his work with Relex Software Corporation, he is involved in design-ing the Relex Markov module, the Relex System Optimization and Simulation (OpSim) module, Optimal Scheduling and Replacement Policies, Test Planning, Reliability Growth Models, Dynamic Fault Trees, and algorithms to compute the reliability of systems with repairable components. He was with Tata Consultancy Services from 1996 to 2000, where he was involved in software design and development using Objected Oriented Methodologies and Formal Methods, and worked as a consultant and technical lead for data mining and data warehousing projects. His research interests include hardware and software reliability, risk assessment, fault-tolerant computing, and optimization. He is a member of the IEEE, ACM, and ASQ.

Sy-Yen Kuo received the BS degree (1979) in electrical engineering from National Taiwan University, the MS degree (1982) in electrical and computer engineering from the University of California at Santa Barbara, and the PhD degree (1987) in computer science from the University of Illinois at Urbana-Champaign. Since 1991, he has been with National Taiwan University, where he is currently a professor and the chairman of Department of Electrical Engineering. He spent his sabbatical year as a visiting researcher at AT&T Labs-Research, New Jersey from 1999 to 2000. He was the chairman of the Department of Computer Science and Information Engineering, National Dong Hwa University, Taiwan, from 1995 to 1998, a faculty member in the Department of Electrical and Computer Engineering at the University of Arizona from 1988 to 1991, and an engineer at Fairchild Semiconductor and Silvar-Lisco, both in California, from 1982 to 1984. In 1989, he also worked as a summer faculty fellow at the Jet Propulsion Laboratory of the California Institute of Technology. His current research interests include software reliability engineering, mobile computing, dependable systems and networks, and optical WDM networks. He is an IEEE fellow and a member of the IEEE Computer Society. He has published more than 180 papers in journals and conferences.