RELIABILITY OPTIMIZATION OF DISTRIBUTED COMPUTING SYSTEMS SUBJECT TO CAPACITY CONSTRAINTS

(1)

P e r g a m o n

Reliability Optimization of Distributed

Computing Systems Subject to

Capacity Constraints

R U E Y - S H U N C H E N , * D E N G - J Y I C H E N AND Y . S. Y E H Institute of Computer Science and Information Engineering

National Chiao Tung University, Hsinchu, Taiwan, R.O.C. Rschen~cc. nctu. edu. tw

(Received and accepted August 1993)

A b s t r a c t - - I n this paper, we propose a simple, easily programmed exact method for obtaining the optimal design of a distributed computing system in terms of maximizing reliability subject to memory capacity constraints. We assume that a given amount of resources are available for linking the distributed computing system. The method is based on the partial order relation. To speed up the procedure, some rules are proposed to indicate conditions under which certain vectors in the numerical ordering that do not satisfy the capacity constraints can be skipped over. Simulation results show that the proposed algorithm requires less time and space than exhaustive method. K e y w o r d s - - D i s t r i b u t e d computer system, Reliability, Exact approach, CaPacity constraint.

1. I N T R O D U C T I O N

A distributed computer system (DCS) has been defined as a collection of nodes at which re- side computing resources t h a t communicate with each other via a set of links [1]. Large scale distributed computer systems are coming into use primarily because of the economy achieved t h r o u g h resource sharing [2]. The main objective of a DCS is to provide efficient communication among various nodes in order to increase their utility and to make their service available to more users [3]. One of the fundamental considerations in designing such systems is t h a t of system reliability, which strongly depends on the topological layout of the communication links [4]. Reliability is a very good measure of DCS performance if all the needed network users are to be connected with each other, i.e., it is desired to establish a communication path between all the K-available nodes at one time. A DCS may be modelled by a graph in which the nodes correspond to the file servers and the edges to the communication links.

Several heuristic methods [5-7] have been proposed for obtaining an optimal network topology t h a t gives maximum overall reliability of a given computer communication network, but there is no m e t h o d t h a t provides an exact solution. All of the proposed methods find an approximate solution, because as the number of links increases, the number of possible layouts of the links grows faster t h a n exponentially. However, an exact optimal solution is important where the topology will be used for an extended time. To date, the problem of maximizing the reliability of a DCS under m e m o r y constraints through exact methods does not appear to have been considered. In *Author to whom all correspondence should be addressed.

This research was partially supported by the National Science Council of the Republic of China under contract NSC81-0301-E009-507.

Typeset by ,4f148-~I~E X 93

(2)

94 R.-S. CHEN et al.

this paper, we propose an exact method for obtaining an optimal DCS using the partial order relation of solving discrete optimization problems [8]. The method is simple, easy to understand, and easy to program. To speed up the procedure, rules indicating conditions under which certain vectors in the numerical ordering do not satisfy the capacity constraints can be skipped over. This reduces the evaluation count and execution time.

The organization of the rest of this paper is as follows. In Section 2, assumptions, notation and definitions t h a t will be used throughout this paper are given. Section 3 presents the mathematical formulation of the problem and an algorithm with a flow chart of the solution procedure. Examples are used to illustrate the method and simulation results are obtained in Section 4. Section 5 concludes the paper.

2. A S S U M P T I O N S , N O T A T I O N A N D D E F I N I T I O N S ASSUMPTIONS. T h e method suggested here makes the following assumptions.

1. The locations of the various computers and the possible locations of the connecting links axe known.

2. T h e reliability of every link is known.

3. T h e capacity of every node installed is specified. 4. Each node is perfectly reliable.

5. Every link is either in the working (ON) state or failed (OFF) state. 6. The total capacity constraint on the DCS is known.

NOTATION. The notation and definitions used in the rest of paper are summarized here. G = (N, L)

Ni

Li L = {L1, L2, . . . , L~} X ^ Xi Z = { x l , x2, C~ Ca Gk R n R ( a x ) X* X < Y X < Y

an undirected DCS graph in which the set of nodes N represents the PEs and the links L represent the communication links. a node representing a processing element i.

an edge representing a communication link i. the set of all allowable links.

denotes the vector which has current optimal solution R(Gx).

a decision node, Xi = 1 if i is selected, else Xi = 0. the set of decision nodes, Xi = 0 or 1, i = 1, 2 , . . . , n . capacity of the i th node.

memory capacity constraint in system. denotes the graph G with K - n o d e specified. denotes the current optimal reliability solution. the reliability of X-node solution of the DCS graph G. first vector following X in the numerical ordering that has the property X ~ X*.

X is less than Y with numerical ordering. X is less than Y with vector partial ordering.

DEFINITION 1. K-node reliability is defined as the probabifity of successful communication, i.e., all K-nodes in Gx axe connected by working edges within the given m e m o r y capacity constraint.

DEFINITION 2. An K-node DCS reliability problem is the problem of computing R(Gx). The problem is a member of the class of number K-complete problems that is a class of NP-complete problems.

DEFINITION 3. An evaluation count is the number of computation of R(Gx) that axe needed to satisfy the capacity constraint.

3. M A T H E M A T I C A L F O R M U L A T I O N O F T H E P R O B L E M

The problem considered in this paper may be stated as follows: determine an optimal DCS t h a t gives maximum reliability within the given memory capacity constraint. In other words,

(3)

Distributed Computing Systems 95 we are to find a set of X-nodes from the given set N which constitutes an optimal DCS in t h a t X-node reliability is maximized and the total memory capacity satisfies the capacity constraint. The main problem can be stated mathematically as follows:

Maximize

R( Gx ),

subject to: ~ 6'/>_ Cs. X i E K

In this section, we discuss some preliminaries t h a t are essential for the description of the algorithm. We consider vectors X of the form X = (xl, x 2 , . . . ,

Xn)

which are binary in the sense t h a t each Xj is either 0 or 1. We say t h a t X _< Y if and only if Xj <_ Yj, for j = 1, 2 , . . . , n, e.g., X < Y where X = (01 0) and Y = (01 1). This is the vector partial ordering. Note t h a t the numerical ordering is a refinement of the vector partial ordering, i.e., X < Y implies

n(x) <

n(y),

but

n(x) < n(y)

does not imply X < Y, where

n(x), n(y)

are numerical order. Suppose all binary

n-vectors are listed in numerical order, i.e.,

(0, 0, 0, 0), (0, 0, 0, 1), (0, 0, 1, 0), (0, 0, 1, 1 ) , . . . , (1, 1, 1, 1), etc.

Immediately following an arbitrary vector X, there may (or may not) be a number of vectors X ' with the property t h a t X < X'. Roughly speaking, these are vectors t h a t differ from X only in t h a t they have l ' s in place of one or more of the rightmost O's of X, for example, immediately following X = (0, 1, 0, 0), (0, 1, 0, 1), (0, 1, 1, 0), and (0, 1, 1, 1), each of which is greater t h a n X in the vector partial ordering.

We let X* denote the first vector following X in the numerical ordering t h a t has the property t h a t X :~ X*. For any given X, the vector X* is easily calculated on a computer as follows. Treat X as a binary number:

(1) Subtract 1 from X,

(2) Logically 'or' X and X - 1 to obtain X* - 1, (3) Add 1 to obtain X*. An example: Let X = 0101100, (1) X - l = 0101011, (2) X* - 1 = 0101111, (3) X * = 0110000.

Note t h a t X* - 1 is greater t h a n each of X, X + 1 , . . . , X* - 2, in the vector partial ordering. Consider the following simple optimal DCS problem:

Maximize

R(Gx),

subject to: ~ 6'/ _> C,. X i E K

We can solve this problem by examining each of the 2 n possible solution vectors in numerical order, beginning with X = (0, 0 , . . . , 0) and ending with (1, 1 , . . . , 1). However, this process can be shortened considerably by invoking certain rules, which are stated below.

As we proceed through the list of vectors, we keep a record of the least capacity solution found to date. Let X ^ denote current optimal solution vector. Let X denote the vector t h a t is currently being examined. The following rules indicate conditions under which certain vectors in the numerical ordering can be skipped over.

(4)

R U L E 1. If ~ X i e K C~(X* - 1)~ < C8, then skip to X*.

JUSTIFICATION. X is monotone nondecreasing, and X < X ÷ I < X + 2 < , . . . , < X * - 2 ~_ X * - I , thus, C i ( X ) <_ C i ( X + 1) _< C i ( X + 2) _<,..., Ci(X* - 2) < Ci(X* - 1) < Cs.

RULE 2. I f the vector partial ordering X >_ X A, then skip to X*.

JUSTIFICATION. X is monotone nondecreasing, and X ^ ~_ X _ X + I ~ X + 2 ~ , . . . , < X * - 2 ~ X* - 1 , thus, R (Gx A) _> R(G~) ~ R (Gz+l) >_ R(G~+2) > , . . . , > R (G~*-2) _> R (Gx*-l) are all less than R (Gx^).

RULE 3. If X is a feasible solution of R(Gx) < R A, then skip to X*.

JUSTIFICATION. X is monotone nondecreasing, and X _< X + I _< X + 2 _<,..., _< X * - 2 _< X * - I , so R(Gx) ~ R (G~+I) _~ R (Gx+2) _~,..., _~ R (Gx*-2) _~ R (Gx*-l). Of course, if Rule 3 applies and R(Gx) > R ^, then X ^ is substituted for X and R ^ is substituted for R(Gk).

To find an optimal solution, we do not consider an exhaustive method, since it is too time- consuming. Instead, we apply the Lawler-Bell algorithm [8] to find an optimal solution by skipping unsatisfactory conditions and executing only a portion of all the combinations. The problem of DCS can then be transformed into the following formulation:

Maximize

R( Gz ),

subject to: ~ Ci _> Cs.

X i ~ K

The exact solution for our problem using the Lawler-Bell technique is described as follows: 1. Node vector:

X = (Xn, X n - 1 , . . . , X 1 ) , Xi = 0 or 1, i = 1, 2 , . . . , n .

2. Numerical ordering:

(0, 0, 0, 0), (0, 0, 0, 1), (0, 0, 1, 0), (0, 0, 1, 1 ) , . . . , ( 1 , 1, 1, 1). 3. Vector partial ordering:

X < Y if and only if Xi <_ Y~, for i = 1, 2 , . . . , n,

e.g., if X = (0, 1, 0, 0, 1), Y = (0, 1, 1, 0, 1), then, X <_ Y implies R (Gx) >_ R (Gy) and

c i ( x i ) < ci(Y ).

4. X*: the first vector following X in the numerical ordering that has the property that X / ~ X*. The formulation is X* = (X bit or X - 1 bit) + 1, e.g.,

X = 0 1 0 1 1 0 0 , X * = 0 1 1 0 0 0 0 .

If X is monotone nondecreasing in each of the vector partial orderings, then

X ~ X + I ~ X + 2 ~ . . . . , ~ X * - 2 ~ X * - I ~ X ,

(5)

Distributed Computing Systems 97 ALGORITHM. T h e algorithm based on the above rules is given below.

STEP 1. Determine whether the count number has overflowed. If X >overflow_X then stop; otherwise, go to the next step.

STEP 2. Compare the capacity ~-]~=ln Ci(x* - 1)~ with capacity constraint Cs. If ~ = l n Ci(x* - 1)i < Cs, then substitute X for X* and go to back to Step 1, else go to the next step.

STEP 3. Compare the capacity ~-]~=1 C~X~ with capacity constraint Cs. If ~ 1 C~X~ < Cs, then X = X + 1 and go to back to Step 1, else go to the next step.

STEP 4. Compare the vector partial ordering X with X A. If X > X ^, then X is substituted for X* and go to back to Step 1, else go to the next step.

STEP 5. Compute reliability R(Gx) and compare with the current optimal solution R n. If R(Gx) > R ^, then substitute R n for R(Gx) and X n for X, else substitute X for X* and go to back to Step 1.

STEP 6. Continue loop until X overflow. Then, last R (Gx^) reliability computed with the factoring algorithm [9] is obtained for our X - n o d e reliability and X n is obtained for our optimal node vector.

No

Initial

value

X=X,~_ 0 ~,.

R~-~ 1.0

y••

No

Rulo 1.

Yes

Rule2

-I

I

No~

Rule3

Figure 1. Flow chart of the actual solution procedure.

v

T h e actual solution algorithm followed is a variation of t h a t given by Lawler and Bell [8], modified for the sake of computational simplicity. The algorithm is illustrated with the flow chart in Figure 1. We determine the X - n o d e reliability of a DCS iff the corresponding node capacity satisfies the given capacity constraints.

(6)

4. E X A M P L E S A N D R E S U L T S The exact method is illustrated below by means of two examples.

4.1. Example 1

Consider the six node DCS with eight links depicted in Figure 2. Here, our problem is to determine an optimal DCS which includes some of the nodes X1, X2,. •., X6, whose total capacity exceeds the memory capacity constraint of 70 units. The optimization can be formulated as the following mathematical problem:

Maximize

R(Gz),

subject to: ~ C~ _> 70. X i c K

Using a C computer program based on the proposed algorithm and the flow chart in Figure 1, an optimal DCS was obtained by an Intel-486 personal computer in 2 clock cycle execution time. The optimal K-DCS topology with node vector (X1, X2, X3, Xs)was found in the DCS with maximum reliability of 0.7628 and memory capacity of 70 units. The evaluation count was only 17, compared with a count of 32 for the exhaustive method. This is a rather modest reduction in computing, but much greater saving will be made in problems with a larger DCS.

X2

0.6

0.8 X1 0.9 0.8 X3 X4

%; x6

0.6

X5

Figure 2. A six node distributed computing system.

X2 X3 0.8

7 )

X1 X4 X5 X7 X(

Figure 3. An ARPA-net distributed computing system.

C1=20 C2=15 C3=10

C4=30

C5:25 C6=25 C1=20 C2=30 C3=18 (:4=26 C5=40 C6=20 O7=10 C8=35

4.2. Example 2

Consider a simplified version of the well-known ARPA network having eight nodes and 12 links, as depicted in Figure 3. Here, our problem is to determine an optimal DCS which includes some of the nodes X1, X 2 , . . . , Xs, the total capacity of which exceeds the memory capacity constraint of 100 units. The problem can be stated mathematically as follows:

Maximize

R(Gz ),

subject to: ~ Ci _> 100. X i E K

(7)

Distributed Computing Systems 99 U s i n g a C c o m p u t e r p r o g r a m b a s e d on t h e p r o p o s e d a l g o r i t h m a n d t h e flow c h a r t in F i g u r e 1, a n o p t i m a l D C S was o b t a i n e d b y a n Intel-486 p e r s o n a l c o m p u t e r in 54 clock cycle e x e c u t i o n t i m e . T h e o p t i m a l K - D C S w i t h n o d e v e c t o r (X1, X2, X3, X4, XT) was f o u n d in t h e D C S w i t h m a x i m u m r e l i a b i l i t y of 0.8245 a n d m e m o r y c a p a c i t y of 104 units. T h e e v a l u a t i o n c o u n t was o n l y 78, as c o m p a r e d w i t h a c o u n t of 128 for t h e e x h a u s t i v e m e t h o d . A l t h o u g h 78 e v a l u a t i o n c o u n t s n e e d t o b e e x a m i n e d , o n l y a s m a l l p o r t i o n o f t h e 78 n e e d t o use f a c t o r i n g a l g o r i t h m [9] t o c o m p u t e r e l i a b i l i t y R ( G x ) . T h e r e f o r e , rules c a n r e d u c e e x e c u t i o n t i m e effectively.

5. C O N C L U S I O N

T h e p r e s e n t m e t h o d for t h e d e t e r m i n i n g of t h e o p t i m a l X - n o d e in a D C S so as t o m a x i m i z e t h e K - n o d e r e l i a b i l i t y is b a s e d on t h e p a r t i a l o r d e r r e l a t i o n . However, t h e p r o p o s e d a l g o r i t h m r e q u i r e s less e x e c u t i o n t i m e a n d s p a c e t h a n t h e e x h a u s t i v e m e t h o d , b e c a u s e we a p p l y s e v e r a l rules t o d i s c a r d m o s t of t h e infeasible e v a l u a t i o n c o u n t before c o m p u t i n g t h e r e l i a b i l i t y R ( G x )

a n d r e d u c e t h e c o m p u t a t i o n t i m e . T h u s , t h e rules are v e r y effective. T h e a l g o r i t h m p r e s e n t e d is s t r a i g h t f o r w a r d a n d e a s i l y t r a n s f o r m e d into a p r o g r a m .

R E F E R E N C E S

1. J.A. Stankovic, A perspective on distributed computer systems, I E E E Trans. Comput. 33, 1102-1115 (1984). 2. D.J. Chen and T.H. Huang, Reliability analysis of distributed systems based on a fast reliability algorithm,

I E E E Trans. on Parallel and Distributed Systems 3 (2) (1992).

3. L. Fratta and U.G. Montanari, Synthesis of available networks, I E E E Trans. Reliab. R-25, 81-87 (1976). 4. R.S. Wilkov, Design of computer networks based on a reliability measure, In Proceedings of the Symposium

on Computer Communication, Networks and Teletra]fics, pp. 371-384, (1986).

5. K.K. Aggarwal, Y.C. Chopra and J.S. Bajwa, Topological layout of links for optimizing the S - T reliability in a computer communication system, Microelectron. Reliab. 22, 341-345 (1982).

6. K.K. Aggarwal, Y.C. Chopra and J.S. Bajwa, Topological layout or links for optimizing the overall reliability in a computer communication system, Microelectron. Reliab. 22, 347-351 (1982).

7. Y.C. Chpra, B.S. Sohi and K.K. Aggarwal, Network topological for maximizing the terminal 1 reliability in a computer communication system, Microelectron. Reliab. 24 (5), 911-913 (1984).

8. E.L. Lawler and M.D. Bell, A method for solving discrete optimization problems, Ops. Res. 14, 1098-1111 (1966).

9. K.R. Wood, Factoring algorithms for computing K-terminal network reliability, I E E E Trans. Reliability 35, 269-278 (1989).