Ecient algorithms for reliability analysis of
distributed computing systems
Min-Sheng Lin
a,*, Ming-Sang Chang
b, Deng-Jyi Chen
baDepartment of Information Management, Tamsui Oxford University College, 32, Chen-Li Rd.,
Tamsui, Taipei, 25103, Taiwan, ROC
bInstitute of Computer Science and Information Engineering, National Chiao-Tung University,
Hsin Chu, 30050, Taiwan, ROC
Received 12 March 1998; received in revised form 23 October 1998; accepted 1 January 1999
Abstract
A distributed computing system is modeled as a collection of resources (e.g. pro-cessing elements, data ®les and programs) interconnected via an arbitrary communi-cation network and controlled by a distributed operating system. The distributed program reliability in a distributed computing system is the probability of successful execution of a program running on multiple processing elements and needs to retrieve data ®les from other processing elements. This reliability varies according to (1) the topology of the distributed computing system, (2) the reliability of the communication edges, (3) the data ®les and programs distribution among processing elements and (4) the data ®les required to execute a program. In addition, computing the reliability of distributed computing systems is #P-complete even when the distributed computing system is restricted to a series-parallel, a 2-tree, a tree, or a star structure. This paper presents ecient algorithms for computing the reliability of a distributed program running on other restricted classes of networks. Ó 1999 Elsevier Science Inc. All rights reserved.
Keywords: Distributed computing systems; Distributed program reliability; Computa-tional complexity; Algorithms
www.elsevier.com/locate/ins
*Corresponding author. E-mail: [email protected]
0020-0255/99/$ ± see front matter Ó 1999 Elsevier Science Inc. All rights reserved. PII: S 0 0 2 0 - 0 2 5 5 ( 9 9 ) 0 0 0 0 3 - 1
1. Introduction
A typical distributed computing system (DCS) consists of processing ele-ments (nodes), communication links (links), memory units, data ®les, and programs [1,2]. These resources are interconnected via a communication net-work that dictates how information ¯ows between nodes. Programs residing on some nodes can run using data ®les at other nodes.
A previous investigation [3], introduced distributed program reliability (DPR) to evaluate the reliability of DCSs. Consider DCS in which the nodes are perfectly reliable but the links can fail, s-independently of each other, with known probabilities. Successfully executing a distributed program depends on the node containing the program, other nodes that have required data ®les, and the links between them being operational. DPR is thus de®ned as the proba-bility that a program with distributed ®les can run successfully despite some faults in the links. For example, consider the DCS in Fig. 1 which consists of four nodes (processing elements) and ®ve edges (communication links). This ®gure also includes the available ®les at each processing element. Assume that program f1 requires data ®les f2, f3, and f4to complete its execution, and it is
running at node v1, which holds data ®les f2and f3. Hence, it must access data
®le f4, which is stored in both nodes v2and v4. Therefore, the DPR of the DCS
in Fig. 1 can be formulated as: DPR Prob[(v1and v2are connected) or (v1and
v4 are connected)].
Although several algorithms have been proposed for evaluation DPR [4,5], none satisfy our desire for more ecient algorithms. We hypothesize that either the approaches examined are ineective, or that no ecient algorithms exist for our reliability problems. Lin and Chen [6] demonstrated, for the ®rst time, that computing DPR is #P-hard even when the distributed computing system is restricted to a series-parallel, a 2-tree, a tree, or a star structure. The class of #P-complete problems was introduced by Valiant [7]. The class #P
contains those problems that involve counting the accepting computations for problems in NP; the class of #P-complete problems contains the hardest problems in #P. As widely recognized, all known exact algorithms for these problems have exponential time complexity, thereby making it unlikely that ecient (polynomial time) algorithms can be developed for this class of problems. This complexity can be averted by considering only a restricted class of DCS's. In light of above discussion, this paper presents a polyno-mially-solvable case of DPR problem for star topologies in which data ®les are restricted to a certain type of distribution. A linear time algorithm is also proposed to verify whether or not a star DCS has this restricted class of ®le distribution. Also proposed herein are two polynomial-time algorithms for computing the DPR of a DCS with a linear and a circular structure, re-spectively.
2. Assumptions, de®nitions and notation Assumptions
· The nodes are perfect
· The edges are s-independent and either function or fail with known proba-bilities.
De®nitions
· A star DCS Dshas the consecutive file distribution property if and only if its
nodes can be linearly ordered such that, for each distinct ®le fi, the nodes
containing ®le fd occur consecutively. More formally, a star DCS Ds has
the consecutive file distribution property if and only if there exists a permuta-tion P [p(1), p(2), . . ., p(n)] of numbers {1, 2, . . ., n} such that if ®le fd 2 Ap i and fd2 Ap i, then fd2 Ap k for all k, i < k < j.
· A set C of edges of Dsis referred to as a file cut set if and only if all edges in
C fail which implies system failure.
· A ®le cut set C is referred to as minimal if there is no other ®le cut set C0such
that C0 C.
· A set I of edges for a linear DCS Dlis referred to as a file path set if and only
if all edges in I function which implies system functions.
· A ®le path set I is referred to as minimal if there is no other ®le path set I0
such that I0 I.
Notation (general)
D a Distributed Computing System (DCS)
ei edge i in D
vi node i in D
fi data ®le i
m number of distinct ®les in D
t total number of ®les in D
Ai the set of ®les available at node vi
pi probability that edge ei functions
qi probability that edge ei fails; º 1 ÿ pi
E complement of event E
for star topology
Ds a star DCS with n + 1 nodes {s, v1, v2, . . ., vn} and n edges
{e1 (s,v1), e2 (s,v2), . . ., en (s, vn)}
P º [p(1), p(2), . . ., p(n)] a permutation of numbers {1, 2, . . ., n} such that if ®le fd2 Ap iand fd2 Ap j, then fd2 Ap kfor all k,
i < k < j
Cd the minimal ®le cut set for ®le fdif it consists of all edges (s, vi)
such that node vi contains ®le fd, i.e. Cd {(s, vi) | fd 2 Ai}.
(Without loss of generality, we reorder the minimal ®le cut sets, if necessary, by their minimal component, i.e. for two distinct minimal ®le cut sets Ciand Cj, i < j if and only if
min{k | (s, vp k) 2 Ci} < min{k | (s, vp k) 2 Cj}.)
U ordered set of all minimal ®le cut sets according to their
minimal components
r number of minimal ®le cut sets in U
ai º min{k | ep k2 Ci}, i.e. the index of the minimal component in
Ci
bi º max{k | ep k2 Ci}, i.e. the index of the maximal component
in Ci
H(i, j) º{ep i, ep i1, . . ., ep j}; 1 6 i 6 j 6 n (note that Ciº H(ai, bi))
X(i, j) event: all edges in H(i, j) fail
Wi ºSij1X aj; bj (note that the DPR of Dscan be expressed as
1 ÿ Pr(Wr))
Fi event: the star DCS D0sfails in which it consists of i 1 nodes s,
vp 1, vp 2, . . ., vp i and i edges ep 1, ep 2, . . ., ep i
for linear topology
Dl a linear DCS with n + 1 nodes {v0, v1, v2, . . ., vn} and n edges
{e1 (v0,v1), e2 (v1,v2), . . ., en (vnÿ1,vn)}
Ii the minimal ®le path set which starts at edge ei
bi º max{k | ek2 Ii} , i.e. , the index of the maximal component in
Ii
3. Ecient algorithms for computing DPR of DCS's
According to a previous investigation [6], computing DPR over a star DCS is #P-complete, implying that polynomial algorithms unlikely exist for solving them. However, ecient algorithms possibly exist for computing DPR over some restricted classes.
3.1. Star DCS's with a consecutive ®le distribution
In this section, we present a polynomial-time algorithm for computing the DPR of a star DCS with a consecutive ®le distribution. Let Dsbe a star DCS
and it have the consecutive ®le distribution property. Then, the minimal ®le cut sets can be ordered by their minimal component, i.e. for two distinct minimal ®le cut sets Ci and Cj, i < j if and only if min{k | (s, vp k) 2 Ci}< min{k | (s,
vp k) 2 Cj}. By de®nition, Ds fails if and only if at least one event X(ai, bi),
1 6 i 6 r, occurs, where aiand biare the indexes of the minimal and maximal
components in Ci, respectively. Clearly, if r 1, the unreliability of Ds can be
easily obtained as Pr[W1] Pr[X(a1, b1)]. Next consider the case with r P 2.
The unreliability of Dswith the ®rst i's ®le cut sets is
PrWi PrWiÿ1[ X ai; bi:
This expression can be decomposed using conditional probability as
PrWi PrWiÿ1 PrWiÿ1\ X ai; bi: 1
Consider the event Wiÿ1\ X ai; bi, which implies
· E1: For each k, 1 6 k 6 i ÿ 1, at least one edge e 2 H ak; bk Ckfunctions
and
· E2: All edges 2 H ai; bi Ci fail.
By event E2, event E1 can be rewritten as
Ui Sij1 Yj (Notably, the DPR of Dlcan be expressed as
1 ÿ Pr(Un))
Rj event: there exists an operating event Yibetween edges e1and ej
for ring topology
Dr a ring DCS with n nodes {v1, v2, . . ., vn} and n edges
{e1 (v1, v2), e2 (v2,v3), . . ., enÿ1 (vnÿ1, vn), en (vn, v1)}
D
rei the DCS Drwith edge ei (vi, vi1) contracted so that nodes vi
and vi1are merged into a single node. This newly merged node
contains all data ®les that were previously in nodes viand vi1,
and
· E0
1: For each k, 1 6 k 6 i ÿ 1, at least one edge e 2 fH ak; bk ÿ H ai; big
functions.
A fundamental diculty in calculating Pr(E0
1) is that events in E01are not, in
general, disjoint. However, we can de®ne events Sj's that are disjoint by
Sj fE0
1occurs and edge ep jis the last good oneg; for aiÿ16 j 6 aiÿ 1:
Thus, E0 1\ E2 [ aiÿ1 jaiÿ1 Sj\ E2 and PrWiÿ1\ X ai; bi Pr [ aiÿ1 jaiÿ1 Sj\ E2 " # : 2
Since Sj's are disjoint events, we have
Pr a[iÿ1 jaiÿ1 Sj\ E2 " # Xaiÿ1 jaiÿ1 Pr Sj\ E2: 3
The event Sj\ E2; aiÿ16 j 6 aiÿ 1, can be decomposed into three independent
events: {no ®le cut set fail between edges ep 1 and ep jÿ1}, {edge ep j
func-tions}, and {all edges between ep j1 and ep bifail}. So
Pr Sj\ E2 1 ÿ Pr Fjÿ1 pp j PrX j 1; bi: 4
Therefore, according to Eqs. (1)±(4) , we have Pr Wi Pr Wiÿ1 Xaiÿ1 jaiÿ1 1 ÿ Pr Fjÿ1 pp j PrX j 1; bi : The following theorem can now be easily established.
Theorem 1. For 2 6 i 6 r: Pr Wi Pr Wiÿ1 Xaiÿ1 jaiÿ1 1 ÿ Pr Fjÿ1 pp j PrX j 1; bi ; 5
with the boundary conditions: Pr W1 PrX a1; b1; and Pr Fk 0 for
0 6 k < b1.
Before applying Theorem 1, initially compute the values of PrX j 1; bi and Pr(Fjÿ1) for 2 6 i 6 r and aiÿ16 j 6 aiÿ 1. By noting that ag< ahwhenever
PrX j 1; bi
qp a1iÿ1 PrX aiÿ1; biÿ1
Qbi kbiÿ11 qp k for j aiÿ1; 1 qp j PrX j; bi for aiÿ1< j 6 aiÿ 1: 8 > < > : 6
By starting with PrX a1; b1 Qbka1 1qp k, we successively determine that
PrX a1 1; b2; PrX a1 2; b2; . . . ; PrX a2; b2;
PrX a2 1; b3; PrX a2 3; b3; . . . ; PrX a3; b3;
. . .
PrX arÿ1 1; br; PrX arÿ1 2; br; . . . ; and PrX ar; br:
To obtain the values of Pr(Fjÿ1) in Theorem 1, by de®nition, we have that
Pr Fk 0Pr Wiÿ1 for bfor k 6 biÿ16 k 6 biÿ 1; 1ÿ 1:
7 Hence, while computing Pr(Wi) by Theorem 1, we can also obtain Pr(Fk), for
biÿ1 6 k 6 biÿ1.
Next, the major algorithm-related strategies to compute the DPR of star DCS's are outlined. Given a star DCS Dsand the ®le distribution Ai's for each
node. By assuming that Dshas the property of consecutive ®le distribution, let
P be a permutation of numbers f1; 2; . . . ; ng such that if ®le fd2 Ap iand fd2
Ap j, then fd2 Ap kfor all k, i < k < j. All ®le cut sets can be easily enumerated
from Ai's in the following manner: if node vicontains ®le fd, then ®le cut set Cd
contains edge ei. Subsequently, aiand bivalues of Cican be determined from
the permutation P such that ai min{k| ep k2 Ci} and bi max{k| ep k2 Ci}.
Then, remove the ®le cut sets which are not minimal and rearrange the re-maining minimal ®le cut sets according to their ai's values. Finally, use
The-orem 1, Eqs. (6) and (7) to compute the DPR ( 1 ÿ Pr[Wr] ). The algorithm is
formally described as belows. Algorithm Reliability_Star_DCS
Input: A star DCS Dswith n + 1 nodes {s, v1, v2, . . ., vn} and n edges
{(s,v1), (s,v2), . . ., (s,vn)}.
A permutation P [p(1), p(2), . . ., p(n)] of numbers {1, 2, . . ., n} such that if ®le fd2 Ap i, fd2Ap j, then fd2 Ap kfor all k,
i < k < j, where Airepresents the set of ®les available at node
vi.
begin
Step 1: // ®nd all ®le cut sets //
for i ¬ 1 to m do Ci ¬ B ; // initialization step; m is the number of dis-tinct ®les //
for i ¬ 1 to n do
for each fd2 Aido Cd¬ Cd[ {ei}; // For convenience, let eidenote edge
(s, vi) //
Step 2: // set the values of ai and bifor 1 6 i 6 m //
for i ¬ 1 to m do begin
ai¬ min{k| ep k 2 Ci };
bi¬ max{k| ep k2 Ci};
end
Step 3: // ®nd all minimal ®le cut set // U ¬ B;
for i ¬ 1 to m do U ¬ U [ {Ci}; for 1 6 i, j 6 m do
if (aiP ajand bi6 bj) then remove Cjfrom U; == which implies CiÍ
Cj //
Step 4: reorder the minimal ®le cut sets in U for two distinct minimal ®le cut sets Ciand Cj, i < j if and only if ai< aj;
Step 5: // compute PrX j 1; bi, for 2 6 i 6 r and aiÿ16 j 6 aiÿ 1, by
Eq. (6) //
PrX a1; b1 Qbka1 1qp k;
for i ¬ 2 to r do // r is the number of minimal ®le cut sets in U // begin
Pr[X(ai-1+1, bi)] ¬ 1= qp aiÿ1 PrX aiÿ1; biÿ1
Qbi
kbiÿ11qp k;
for j ¬ ai-1+2 to aiÿ1do Pr[X(j + 1, bi)] ¬ 1= qp j PrX j; bi ;
end
Step 6: // Apply Theorem 1 and Eq. (7) to compute Pr(Wi) and Pr(Fj) //
Pr W1 PrX a1; b1; // boundary condition //
for k ¬ 0 to b1ÿ1 do Pr(Fk) ¬ 0; == boundary condition // for i ¬ 2 to r do
begin
for k ¬ bi-1 to biÿ1 do Pr(Fk) ¬ Pr(Wi-1);
Pr Wi Pr Wiÿ1 Pjaaiÿ1iÿ1 1 ÿ Pr Fjÿ1 pp j PrX j 1; bi
; end Step 7: DPR ¬ 1 ÿ Pr(Wr); Output(DPR); end Reliability_Star_DCS Complexity analysis
The time complexity of Algorithm Reliability_Star_DCS is analyzed as follows. Step 1 performs O m Pni1 O m t O t time (sinceAp i
m < t) to identify all ®le cut sets, where t denotes the total number of ®les in Ds. Step 2 requires O 2 Pmi1j j O t time to set aCi iand bi, 1 6 i 6 m and
step 3 takes O(m2) time to obtain all minimal ®le cut sets. Step 4 requires the
reordering of all minimal ®le cut sets in a nondecreasing order of their index of the minimal component. This ordering can be executed in O(rálog r) using an ecient sorting algorithm, where r denotes the number of minimal ®le cut sets. In step 5, evaluating Pr[X(j + 1, bi)] by making use of Eq. (6) requires
that
O Pr
i2 biÿ biÿ1 2
O brÿ b1 r O n r; for j aiÿ1;
O Pr
i2 1
O r ÿ 1 O r; for aiÿ16 j
6 aiÿ 1: 8 > > > > < > > > > :
Hence, the total time to evaluate all Pr[X(j + 1, bi)] is therefore O(n + r).
In step 6, computing all Pr(Fk) takes OPri2 biÿ biÿ1 O brÿ b1 O n
time and computing all Pr(Wi) takes OfPri21 aiÿ aiÿ1 3 O1
3 arÿ a1 O n time. Therefore, the total time in step 6 is O(n). Clearly,
step 7 performs in constant time. Finally, the entire algorithm has time com-plexity O[t + t + m2+ rálog r + (n + r) + n]. Since t 6 mán, and r 6 m, the
com-plexity of Algorithm Reliability_Star_DCS can be obtained as O m2 m n.
An illustrative example
To illustrate Algorithm Reliability_Star_DCS as stated above, consider the star DCS in Fig. 2 in which there is a consecutive ®le distribution property and the associative permutation P [3, 6, 4, 2, 5, 1, 7]. (In Section 3.2, we will show
how to identify the associative permutation when the star DCS has the con-secutive ®le distribution property.) The overall procedure is as follows:
Step 1: The ®le cut sets are found to be
C1 e2; e5; C2 e1; e5; e7; C3 e1; e2; e5; C4 e3; e6; C5 e2; e4; e5:
Step 2: According to the permutation
p 1 3; p 2 6; p 3 4; p 4 2; p 5 5; p 6 1; p 7 7 and the results of Step 1, we have
a1 4; b1 5; a2 5; b2 7; a3 4;
b3 6; a4 1; b4 2; a5 3; b5 5:
Step 3: Since C1Ì C3 and C1Ì C5, remove C3 and C5. Thus, the set of
minimal ®le cut sets is U C1; C2; C4:
Step 4: Reorder the minimal ®le cut sets in such a manner that for Ciand Cj,
i < j if and only if ai< aj, and we obtain
C1 e3; e6; a1 1; b1 2;
C2 e2; e5; a2 4; b2 5;
C3 e1; e5; e7; a3 5; b3 7:
Step 5: By using Eq. (6), we have
PrX 1; 2 q3q6; PrX 2; 5 q6q4q2q5;
PrX 3; 5 q4q2q5; PrX 4; 5 q2q5; and PrX 5; 7 q5q1q7:
Step 6: We use Theorem 1 and Eq. (7) to compute Pr(Wi) and Pr(Fk) for
2 6 i 6 3 and biÿ 1 6 k 6 biÿ 1, and obtain
Pr W1 q3q6; Pr F0 Pr F1 0 boundary condition Step 7: Therefore, DPR is i 2: Pr F2 Pr(F3) Pr(F4) Pr(W1) q3q6, Pr(W2) Pr(W1) + [1 ÿ Pr(F0)] á p3á PrX 2; 5 (j 2) +[1 ÿ Pr(F1)] á p6á PrX 3; 5 (j 3) +[1 ÿ Pr(F2)] á p4á PrX 4; 5 (j 4) q3q6+ p3q6q4q2q5+ p6q4q2q5+ (1 ÿ q3q6) á p4q2q5 i 3: Pr F5 Pr(W2) Pr(W3) Pr(W2) + [1 ÿ Pr(F3)] á p2á Pr[X(5,7)] (j 5) q3q6+ p3q6q4q2q5+ p6q4q2q5 + (1 ÿ q3q6) á p4q2q5+ (1 ÿ q3q6) á p2q5q1q7
DPR 1 ÿ Pr W3
1 ÿ fq3q6 p3q6q4q2q5 p6q4q2q5 1 ÿ q3q6 p4q2q5
1 ÿ q3q6 p2q5q1q7g:
3.2. A linear-time algorithm of testing for the consecutive ®le distribution property in a star DCS
The previous section has presented a polynomial-time algorithm for com-puting the DPR of a star DCS when it has the consecutive ®le distribution property. In this section, we con®rm whether or not a star DCS has the con-secutive ®le distribution property. The problem statement would be:
Input: A star DCS Ds with n + 1 nodes s, v1, v2, . . ., vn and ®le distributions
Ai, 1 6 i 6 n.
Output: A permutation P [p(1), p(2), . . ., p(n)] of numbers {1; 2; . . . ; n} such that if ®le fd2 Ap iand fd2 Ap j, then fd2 Ap k for all k, i < k < j.
Notably a solution does not always exist. To facilitate our search for the ®nding the correct ordering of P, we use a data structure of a PQ-tree pro-posed by Booth and Leuker [8]. A PQ-tree is a rooted tree that has nodes of two varieties: P-nodes and Q-nodes. A P-node is a node whose children can be arbitrarily permuted. A Q-node is a node whose children are ordered or reverse ordered. The frontier of a PQ-tree is the permutation of leaves from left to right. Two PQ-trees are equivalent if and only if one can be transformed into the other by applying a sequence of the following transformation rules. · arbitrarily permute the children of a P-node,
· reverse the children of a Q-node.
By using PQ-tree data structure, we have the following algorithm. Algorithm Check_Consecutive_File_Distribution
begin
T ¬ universal tree; // a single P-node connected to all the leaf nodes of {1, 2, . . ., n} //
for j ¬ 1 to m do Aÿ1
j ¬ B; // m denotes the number of distinct ®les in Ds// // Aÿ1
j is the set of indexes of nodes which contain the ®le fj //
for i ¬ 1 to n do
for each fj2 Aido Aÿ1j ¬ {i};
Input : A star DCS Ds with n + 1 nodes s, v1, v2, . . ., vn, n edges e1, e2,
. . ., en, where ei (s, vi) for 1 6 i 6 n, and ®le available set
Ai {fj| for each fj stored in node vi} for 1 6 i 6 n.
Output : A permutation P [p(1), p(2), . . ., p(n)] of numbers{1, 2, . . ., n}such that if ®le fd 2 Ap iand fd 2Ap j, then fd 2 Ap kfor
for j ¬ 1 to m do T ¬ REDUCE(T, Aÿ1 j ); if T is a null tree
then
print out ``Dshas no consecutive ®le distribution property'' ;
else
print out the frontier of T ;
end Check_Consecutive_File_Distribution
The routine REDUCE attempts to apply a set of eleven templates. Each template consists of a pattern to be matched against the current PQ-tree and the set Aÿ1
j and a replacement to be substituted for the pattern. The templates
are applied from the bottom to the top of the tree. Notably, the null tree may be returned when no template is applied. For brevity, the details are omitted herein. Details of the algorithm can be found in Booth and Leuker [8]. Complexity analysis
For Aÿ1
j , 1 6 j 6 m, it can be obtained in O m
Pn
i1j j steps. AccordingAi
to [8], the loop of REDUCE routine can be computed in O m n Pmj1jAÿ1 j j
steps. Furthermore, it is very easy to verify thatPni1j j Ai Pmj1jAÿ1j j t (the
total number of ®les in Ds). Therefore, the time complexity for the above al-gorithm is O(m + t) + O(m + n + t) O(m + n + t).
An illustrative example
Consider the star DCS Ds shown in Fig. 2. Applying the above algorithm
lead to Aÿ1
1 f2; 5g; Aÿ12 f1; 5; 7g; Aÿ13 f1; 2; 5g; Aÿ14 f3; 6g; Aÿ15 f2; 4; 5g:
Fig. 3 displays the reduction steps. In an illustration of a PQ-tree, a P-node is drawn as a circle and a Q-node as a rectangle. From this ®gure, we can conclude that the star DCS Ds of Fig. 2 has the consecutive ®le distribution
property and one of the associative permutations is P 3; 6; 4; 2; 5; 1; 7:
3.3. Linear DCS's
In this section, we extend the results in Section 3.1 for computing the DPR of linear DCS's. Consider a linear DCS Dlwith n + 1 nodes {v0, v1, v2, . . ., vn}
and n edges {e1 (v0, v1), e2 (v1, v2), . . ., en (vnÿ1, vn)}. Let Iibe the minimal
®le path set which starts at edge ei. Notably, a linear DCS has the consecutive
®le distribution property resembling that of a star DCS such that for each minimal ®le path set I if ei2 I and ej2 I then ek2 I for all k, i < k < j.
Prob{at least one minimal ®le path set I whose all edges function} and the unreliability of a star DCS with the consecutive ®le distribution property can be expressed as Prob{at least one minimal ®le cut set C whose edges all fail}.Owing to this duality, a simple relationship exists between a linear DCS and a star DCS with the consecutive ®le distribution property. The relationship is stated as follows.
According to the mirror image described in Table 1, if let Wi Ui,
ai p(i) i, pi qi, Pr(Fi) Pr(Ri), and X(i, bi) Yi, in Theorem 1, then the
following theorem can be readily obtained to compute the reliability of a linear DCS Dl.
Theorem 2. For 2 6 i 6 n:
Pr Ui Pr Uiÿ1 1 ÿ Pr Riÿ2 qiÿ1 Pr Yi
with the boundary conditions Pr(U1) Pr(Y1) and Pr Rj 0 for j 6 b1.
In addition, Pr(Yi) and Pr(Rj) can be easily obtained from Eq. (6) as follows.
Pr Yi 1
piÿ1 Pr Yiÿ1
Qbi jbiÿ11 pj for bi6 n; 0 for bi 1; 8 < : 8
with the boundary condition Pr Y1 Qbj11 pj, and
Pr Rj 0Pr Ui for bfor 0 6 j 6 bi6 j 6 bi1ÿ 1; 1ÿ 1:
9 Next, the complete algorithm for computing the reliability of a linear DCS is presented as follows.
Algorithm Reliability_Linear_DCS
begin
Step 1: // ®nd all bi's //
for i ¬ 1 to m do NFi¬ 0 // NFiis the number of ®le fibetween vhand vt //
for each fi 2 A0 do NFi¬ 1;
h ¬ 0; // h and k are two indexes moving among nodes //
for k ¬ 1 to n do begin
for each ®le fi2 Ak do NFi¬ NFi+ 1; // update the total number of ®le i
for node vk //
MFPS ¬ true; // if there is a minimal ®le path set between vhand vt, then
MFPS true // while MFPS do
begin
for i ¬ 1 to m do if NFi 0 then MFPS ¬ false;
Input: A linear DCS Dlwith n + 1 nodes {v0, v1, v2, . . ., vn} and n
edges {e1 (v0,v1), e2 (v1,v2), . . ., en (vnÿ1, vn)}
Ai: the set of ®les available at node vi.
Output: the DPR of Dl
Table 1
The relationship between a linear DCS and a star DCS with the consecutive ®le distribution Star DCS Dswith the consecutive ®le
distribution M Linear DCS Dl
minimal ®le cut set C M minimal ®le path set I
qiº probability that edge eifails M piº probability that edge eifunctions
[p(1), p(2), . . ., p(n)] a permutation such that if ®le fd2 Ap iand fd2 Ap j, then
fd2 Ap k for all k, i < k < j
M [p(1), p(2), . . ., p(n)] (1,2, . . ., n)
// check if there exists a minimal ®le path set if MFPS then
begin
for each ®le fi2 Ah do NFi¬ NFiÿ 1;
h ¬ h + 1; bh ¬ k; end end end for i ¬ h to n do bi¬ 1;
Step 2: // compute Pr(Yi) by Eq. (8) //
Pr Y1 Qbj11 pj// boundary condition // for i ¬ 1 to n do
begin
if bi6 n then Pr Yi 1= piÿ1 Pr Yiÿ1 Qbjbi iÿ11pj
else Pr(Yi) ¬ 0
end
Step 3: // Apply Theorem 2 and Eq. (9) to compute Pr(Ui) and Pr(Rj) //
for i ¬ 0 to b1ÿ 1 do Pr(Ri) ¬ 0; // boundary condition // Pr(U1) ¬ Pr(Y1) ; // boundary condition //
for i ¬ b1 to b2 ÿ 1 do Pr(Ri) ¬ Pr(U1); for i ¬ 2 to n do
begin
Pr Ui Pr Uiÿ1 1 ÿ Pr Riÿ2 qiÿ1 Pr Yi;
for j ¬ bito bi+ 1 ÿ 1 do Pr(Rj) ¬ Pr(Ui);
end
Step 4: DPR ¬ Pr(Un); Output(DPR);
end Reliability_Linear_DCS Complexity analysis
For step 1, the computational complexity of the procedure biis O(nm) since
the value of h in the inner while_loop monotonously increases and does not exceed the value of k, i.e. the index of the outer for_loop. Computing Pr(Yi) in
step 2 is the similar operation as computing Pr[X(j, bi)] in step 5 of Algorithm
Reliability_Star_DCS. Thus, the complexity for step 2 is O(n + n) O(n). Step 3, which is the same as step 6 of Algorithm Reliability_Star_DCS, can be computed in O(n). Therefore, the algorithm Reliability_Linear_DCS takes O(nm) + O(n) + O(n) O(nm) time.
An illustrative example
Consider the linear DCS Dl in Fig. 4. Applying the algorithm
Step 1:
b1 1; b2 2; b3 4; b5 1; Step 2:
Pr Y1 p1; Pr Y2 p2 p3 p4;
Pr Y3 p3 p4; fPr Y4 p4 p5; Pr Y5 0;
Step 3:
Pr R0 0;
Pr U1 p1; Pr R1 Pr R2 Pr R3 Pr U1 p1;
Step 4: Therefore, DPR is Pr(U5) p1+ q1áp2áp3áp4+ q1áq2áp3áp4+ q1áq3áp4áp5.
3.4. Ring DCS's
A ring DCS is a DCS with a circular communication link. Each node connects two conjoining edges with two neighboring nodes. Assume that Dris
a DCS with a ring structure. According to the well known factoring theorem [7], the DPR of Dris obtained as follows:
DPR Dr pi DPR Drei qi DPR Drÿ ei; 10
i 2: Pr(U3) Pr(U2) Pr(U1)+[1 ÿ Pr(R0)]áq1áPr(Y2)
p1+ q1áp2áp3áp4
i 3: Pr(U3) Pr(U2) + [1 ÿ Pr(R1)]áq2áPr(Y3)
p1+ q1áp2áp3áp4+ q1áq2áp3áp4
Pr(R4) Pr(U3) p1+ q1áp2áp3áp4+ q1áq2áp3áp4
i 4: Pr(U4) Pr(U3)+[1 ÿ Pr(R2)]áq3áPr(Y4)
p1+ q1áp2áp3áp4+ q1áq2áp3áp4+ q1áq3áp4áp5
Pr(R5) Pr(U4) p1+ q1áp2áp3áp4+ q1áq2áp3áp4+ q1áq3áp4áp5
i 5: Pr(U5) Pr(U4) + [1 ÿ Pr(R3)]áq4áPr(Y5)
Pr(U4) // since Pr(Y5) 0 //
p1+ q1áp2áp3áp4+ q1áq2áp3áp4+ q1áq3áp4áp5
where ei is an arbitrary edge of Dr. Since Drÿ ei is a DCS with a linear
structure with n ÿ 1 edges, its reliability can be computed by the algorithm Reliability_Linear_DCS in O(nm) time. Notably, D
reiremains a DCS with a
ring structure with n ÿ 1 edges. The same analysis is then applied to D rei. By
recursively applying Eq. (10), we decompose the ring DCS Drwith n edges into,
in the worst case, n linear DCSs. Therefore, we have an O(n2m) time algorithm
for computing the reliability of a DCS with a ring structure. Algorithm Reliability_Ring_DCS(Dr)
Step 1: if there exists one node that holds all distinct data ®les then Return (DPR ¬ 1);
Step 2: Select an arbitrary edge ei of Dr; Step 3: Rell ¬ Reliability_Linear_DCS(Drÿ ei); Step 4: Relr ¬ Reliability_Ring_DCS(D
rei); Step 5: Return(DRP ¬ piáRelr+ qiáRell); end Reliability_Ring_DCS
An illustrative example
Consider the DCS with a ring topology in Fig. 5. This is a simpli®cation of the DCS in Fig. 4 with one edge e6 added between nodes v5and v0. Applying
algorithm Reliability_Ring_DCS yields
DPR Dr q6 DPR Drÿ e6 p6 DPR Dre6
q6 DPR Drÿ e6 p6 q5DPR Dre6ÿ e5
p5 DPR Dre6e5:
The fact that there exists one node in D
re6e5that holds all distinct data ®les
{f1, f2, f3, f4}, so we have DPR Dre6e5 1. The example in Section 3.3
ob-viously reveals that DPR(Drÿ e6) Pr U5 and DPR Dre6ÿ e5 Pr(U4).
Therefore, we have
DPR Dr q6 p1 q1 p2 p3 p4 q1 q2 p3 p4 q1 q3 p4 p5
p6 q5 p1 q1 p2 p3 p4 q1 q2 p3 p4 p5:
4. Conclusions
This paper elucidates the distributed program reliability in various classes of distributed computing systems. This reliability is computationally intractable for arbitrarily distributed computing systems, even when it is restricted to the class of star distributed computing systems. A particular solvable case for star distributed computing systems is identi®ed, in which data ®les are distributed with respect to a consecutive property. In addition, a polynomial-time algo-rithm is developed for this case as well. Also proposed herein is a linear-time algorithm to verify whether or not an arbitrary star distributed computing system has this consecutive ®le distribution property. Furthermore, these re-sults are applied towards star DCS's to obtain the reliability of linear and ring DCS's in polynomial time. A future work should attempt to construct ecient algorithms for computing lower and upper bounds on the distributed program reliability for arbitrarily distributed computing systems.
References
[1] P. Enslow, What is a distributed data processing system, Computer, vol. 11, Jan. 1978. [2] J. Garcia-Molina, Reliability issues for fully replicated distributed database, IEEE Trans.
Computer 16 (1982) 34±42.
[3] A. Satyanarayana, J.N. Hagstrom, A new algorithm for the reliability analysis of multi-terminal networks, IEEE Trans. on Reliability 30 (1981) 325±334.
[4] A. Kumar, S. Rai, D.P. Agrawal, On computer communication network reliability under program execution constraints, IEEE JSAC 6 (1988) 1393±1399.
[5] V.K.P. Kumar, S. Hariri, C.S. Raghavendra, Distributed program reliability analysis, IEEE Trans. Software Eng. 12 (1986) 42±50.
[6] M.S. Lin, D.J. Chen, The computational complexity of the reliability problem on distributed systemsInformation Processing Letters 64 (1997) 143±147.
[7] L.G. Valiant, The complexity of enumeration and reliability problems, SIAM J. Computing 8 (1979) 410±421.
[8] K.S. Booth, G.S. Leuker, Testing for the consecutive ones property interval graphs and graph planarity using PQ-tree algorithms, Journal of Computer System and Science 13 (1976) 335± 379.