A Fault-Tolerant Routing Strategy in Hypercube Multicomputers

(1)

Ge-Ming Chiu, Member, /€€E Computer Society, and Shui-Pao Wu

Abstract-We investigate fault-tolerant routing which aims at finding feasible minimum paths in a faulty hypercube. The concept of unsafe node and its extension are used in our scheme. A set of stringent criteria is proposed to identify the possibly bad candidates for forwarding a message. As a result, the number of such undesirable nodes is reduced without sacrificing the functionality of the mechanism. Furthermore, the notion of degree of unsafeness for classifying the unsafe nodes is introduced to facilitate the design of efficient routing algorithms which rely on having each node keep the states of its nearest neighbors. We show that a feasible path of length no more than the Hamming distance between the source and the destination plus four can always be established by the routing algorithm as long as the hypercube is not fully unsafe. The issue of deadlock freeness is also addressed in this research. More importantly, another fault-tolerant routing algorithm, which requires only a constant of five virtual networks in wormhole routing to ensure the property of deadlock freeness for a hypercube of any size, is presented in this paper.

Index Terms-Deadlock, fault tolerance, hypercubes, routing, virtual channels, wormhole routing.

1 INTRODUCTION

ECENTLY, hypercube multicomputers have been draw-

R ng considerable attention from many researchers. A large amount of research effort has been directed towards hypercube systems [3], [8], [17], [18]. Due to their regular structure and low diameter, hypercube multicomputers are well suited for parallel processing [11]. Several research and commercial hypercube machines have been built in recent years [lo].

In a multicomputer system, efficient communication among the processors is critical to the performance of the system. Processors (or nodes) in a hypercube communicate with each other via passing messages. Hence, routing of messages is an important issue that need to be addressed.

In particular, it is essential to design routing strategies that can route messages efficiently in the presence of faulty components. The large number of nodes and communication links in a hypercube intrinsically provides fault- tolerant capability for message routing. If a message path is blocked due to component failures, fault-free alternative paths can be used to forward the message. In this paper, the issue of fault-tolerant routing in hypercube multicomputers is investigated.

A variety of routing algorithms have been proposed for hypercube systems in the past [2], 191, [12]. However, most of these algorithms are not suitable for routing messages when nodes fail in a hypercube. There also has been a number of

G.-M. Chiu is with the Department of Electrical Engineering and Tech- nology, National Taiwan lnstitute of Technology, Taipei, Taiwan, ROC.

E-mail: [email protected].

S.-P. Wu was with the Department of Electrical Engineering and Technol- ogy, National Taiwan Institute of Technology, Taipei, Taiwan, ROC. He is now with ATbT Taiwan, Hsinchu, Taiwan, ROC.

Manuscript received July 18,1994; revised Dec. 22,1994. A preliminary version of this paper appeared in the Proceedings of the 24th International Symposium on Fault-Tolerant Computing, June, 1994.

For information on obtaining reprints of this article, please send e-mail to:

[email protected], and reference IEEECS Log Number C95180.

fault-tolerant routing strategies proposed in the previous research [4], [5], [14], [15], [18], [19], [20]. Gordon and Stout [14] proposed an approach called ”sidetracking” to route messages in a hypercube with faulty nodes. A message is derouted to a randomly chosen fault-free neighboring node when there exists no link available for advancing the message. However, routing failures may occur although such probability is low, and excessive message delay may arise.

Chen and Shin [5] developed and analyzed a routing scheme based on depth-first search in which backtracking is required when all forward links are blocked by faulty components. Every message must carry information indicating the dimensions already traversed to avoid routing to the same node except when backtracking is enforced. The time overhead required is relatively excessive when the number of faults is small. A simplified version of the method that tolerates fewer faults was presented in [4]. Most of the foregoing strategies are designed for packet-switching systems only. Furthermore, the issue of deadlock freeness is not specifically considered in these schemes.

Lee and Hayes [15] proposed a different fault-tolerant routing strategy which is based on the concept of unsafe nodes. Message routing is to avoid unsafe nodes which could possibly lead to communication difficulties or excessive delay. It does not require each message to carry information about its past routing history. Instead, each node only has to keep simple information about the states of its neighbors. The computing requirement for the routing algorithm is simplified; the time delay incurred at each node is small. They showed that the algorithm can route a message via a path of length no greater than two plus the Hamming distance between the source and the destination of the message as long as the hypercube is not fully unsafe, which is guaranteed provided that the number of faults is no more than r 4 2 1 in an n-cube.

In [19], the concept of safety level, which is an enhance-

0018-9340/96$05,00 01996 IEEE

(2)

144 IEEE TRANSACTIONS ON COMPUTERS, VOL. 45, NO. 2, FEBRUARY 1996

ment of the safe node concept is presented for implement- ing reliable broadcasting in hypercubes. The safety level is an approximate measure of the number as well as distribu- tion of faulty nodes. An efficient broadcasting scheme using safety level is proposed. Extensive survey of routing algorithms for hypercube systems can be found in [13], [16].

In this paper, we develop a routing strategy which adopts a similar philosophy, in its nature, to the one proposed in [15]. The concept of unsafe nodes and its extension are used in our scheme. A set of stringent criteria is presented to iden- tify the possibly bad candidates, called unsafe nodes, to forward a message. As a result, the number of such undesirable nodes is reduced without sacrificing the functionality of the mechanism. The same definition of unsafe nodes has also been proposed independently in [20], which specifically addresses the issue of reliable broadcasting in a faulty hypercube. Our paper instead mainly focuses on node-tu-node routing. In this paper, we study the properties of the set of unsafe nodes in details. In particular, rigorous proofs are presented. Furthermore, we introduce the notion of degree of unsafeness for classifying the unsafe nodes to facilitate the design of efficient routing algorithms. We show that a feasible path of length no more than the Hamming distance between the source and the destination plus four can always be established by our algorithm as long as the hypercube is not fully unsafe. According to the proposed criteria, this condi- tion is guaranteed if the number of faulty nodes is no more than n - 1 in an n-cube.

While the issue of deadlock freeness was addressed in some of the previous routing schemes, the property of deadlock freeness is basically achieved by assigning a set of priority classes to communication resources, such as message buffers, and restricting the use of these resources according to a certain order of the classes. One of the main disadvantages of this scheme is that the number of priority classes required for the communication resources increases as the size of the hypercube grows, thereby rendering the design unscalable.

In this paper, we propose another fault-tolerant routing algorithm which achieves deadlock freeness with only a constant of five virtual networks for a wormhole-routing hypercube of any size. Deadlock free-ness is ensured by restricting the use of virtual channels in increasing order of network numbers, and having up-transition traversals be considered first over down-transition ones. The notions of up- transition and down-transition links are used in an earlier work for routing in a circuit-switched hypercube [6]. How- ever, they are used in a very different form in this paper.

Extension of the results to hypercube systems with link failures is also briefly discussed.

In the next section, necessary notations and definitions are given. The routing strategy and the associated properties are presented in Section 3. The issue of deadlock freeness is addressed in Section 4. In Section 5, we briefly discuss the issue of tolerating link failures. Section 6 draws the conclusion.

2 PRELIMINARIES

An n-dimensional hypercube, abbreviated as n-cube, has

N = 2" processors (or nodes). Each processor T, 0 2 T 5 N - 1,

can be represented by a sequence of n binary digits (tn-i, ..., to) where fl E {0, l} for 0 5 i 5 rr - 1. The bit t, is called the i- dimensional bit of node T, and t o is the least significant bit.

Two processors are connected by a bidirectional link if and only if the binary representations of their labels differ in exactly one bit. A link is said to be an i-dimensional link, 0 S i I n - 1, if it connects two nodes that differ in their i-dimensional bits. A path between two nodes can be represented as an ordered sequence of nodes in which these two nodes are the both end nodes and every two consecutive nodes are physically adjacent to each other. Equivalently, this path can be represented as a sequence of communication links connecting the consecutive nodes. The number of links on a path is called the length of the path. The Ham- ming distance between two nodes X and Y, denoted by H ( X , Y), is the number of bits in which the labels of X and Y differ. It is assumed that a faulty node cannot be used for routing messages in a hypercube. A path is called a feasible path if it traverses through no faulty nodes. A feasible path is called minimum if it has the shortest length among all feasible paths between the source and destination nodes.

3 FAULT-TOLERANT ROUTING STRATEGY

Failed components may block or delay the routing of messages in a hypercube system, possibly degrading system performance. Fault-tolerant routing capability should be exploited by taking advantages of the large amount of nodes and communication links in a hypercube. For in- stance, there are Y! shortest paths between two nodes that are separated by a Hamming distance of ^Y.To illustrate the routing problem, Fig. 1 shows a 4-dimensional hypercube with four faulty nodes, whose addresses are (OOlO), (OlOO), (lOOO), and (1111). Suppose that a message is to be sent from node (1101) to node (0000). In this case, a feasible minimum path of length 3, such as (1101, 1001, 0001, 0000), exists in the system. However, if node (1101) has no knowledge of the failures of nodes (0100) and (lOOO), the message may first be sent to node (1100). It turns out that there exists no feasible path of length 2 from this intermediate node to the destination, thereby the message cannot be forwarded to its destination through a feasible minimum path of length 3 via node (1100). Now consider another case where a message is to be sent from (0110) to (0000). It is obvious that no feasible minimum path of length 2 exists; thus the message must be derouted. If the message is first derouted

11

01

Fig. 1. A 4-dimensional hypercube with four faulty (dark) nodes.

(3)

to node (1110), it will next be forwarded to either of nodes (1010) and (llOO), from which no feasible minimum paths of length 2 are possible and another derouting is necessary.

This not only increases routing latency but also makes the routing function complex. On the other hand, if the message is first forwarded to node (0111) from (OllO), a feasible minimum path of length 4, as opposed to 6, can be readily established.

It would be desirable to route a message through a feasible minimum path in a faulty hypercube. From the previous example, it is apparent that the amount of information, known to each individual node, about the states of other nodes can dictate the likelihood for a distributed routing mechanism to route messages through feasible minimum paths. A natural way of defining the amount of information known to each node is in terms of the range of neighborhood of a node whereby the node has information about the states of all other nodes within this neighborhood [l], [15]. For example, if each node is aware of the state of every other node in an n-cube, i.e., the range of neighborhood associated with the knowledge is n, one can search through all possible routes to locate a feasible minimum path when it sends a message to a given destination. This, however, may incur a significant amount of computation time, and require a considerable size of storage to hold the information.

In the following, we investigate the design of a fault- tolerant routing scheme for hypercubes in which nodes may fail. Each node stores information about the states of its nearest neighbors, which is a common and natural situation. To facilitate our discussion, the consideration of deadlock freeness will not be addressed until Section 4.

3.1 The Unsafe Nodes

In the following, we present a fault-tolerant routing strategy which is based on the idea of avoiding routing a message via a node which may lead to no feasible minimum paths. Both the source and destination nodes of a message are assumed to be fault-free. The notion of an unsafe node is introduced to indicate a potentially bad choice for a routing algorithm 1151. Consider a node, say I, which currently has a message destined for a node D, and the Hamming distance H(I, D) is k. If k is equal to 2, there are two node- disjoint minimum paths between these two nodes, and they are the only such paths. It is easy to determine whether a minimum path from node I to node D exists by simply ex- amining if the two nearest neighbors of I which are also nearest neighbors of node D are both faulty. However, if k 2 3, the mere faulty/nonfaulty state information of node Ps nearest neighbors is not informative enough in preventing a message from being forwarded to a “bad“ next node as illustrated in the previous example. This lays the basis for the following definition.

DEFINITION 1. A fa&-free node is defined as an unsafe node i f it has either two or more faulty nearest neighbors, or three or more faulty or unsafe nearest neighbors.

A nonfaulty node that is not unsafe is called a safe node.

With the definition, the unsafe nodes in a hypercube can be identified in a recursive manner. At each iteration, each node exchanges its current state with all of its fault-free

neighbors, and then checks if itself should be marked unsafe with the latest state information. For the ease of discussion, let us assume that the identification function is per- formed among all fault-free nodes in a synchronous fash- ion. A simple algorithm for identifying unsafe nodes is shown in Fig. 2. Initially, each node contains a list FAULT identifying its faulty neighbors, and a list S A F E containing all fault-free neighbors. Firstly, all unsafe nodes that are due to two or more faulty neighbors are identified. It takes three or more unsafe or faulty neighbors to make a nonfaulty node unsafe afterwards. Finally, the unsafe and safe neighbors of a node are kept in UNSAFE and S A F E lists, respectively. Apparently, the lengths of FAULT, UNSAFE, and SAFE should add up to n for an n-cube. This algorithm is initiated at the system setup time. It is also initiated whenever the status of the system changes, e.g.,’when a new fault is detected. The status of a node can be changed only when its neighbors change their status. For example, Fig. 3 shows the 4-cube of Fig. 1 with unsafe nodes identi- fied. Node (0110) is unsafe as it has two faulty neighbors, namely (0010) and (0100). By the same token, nodes (1010) and (1100) are also unsafe. Node (1110) will subsequently be marked unsafe due to the aforementioned three unsafe nodes. In Fig. 3, the unsafe nodes are shown with gray fill- ing. Routing strategies should avoid forwarding messages to unsafe nodes as much as possible. Consider the previous example where a message is sent from node (1101) to (0000). By not forwarding the message to node (1100) which is unsafe, a routing function can avoid the routing difficulty

Algorithm I d e n Unsafe

{Each node keeps a list FAULT containing its faulty neighbors, a list UNSAFE containing its unsafe neighbors, and .a list SAFE containing its safe neighbors. Notations 1, and 1.

are lengths o f FAULT and UNSAFE, respectively. Initially, SAFE cantains dl fault-free neighbors and It IS set accordmgly.)

begin I , = 0;

wbde (aystwn status is not stable) begin for j = 1 to n begin

Re-ve status 3.x from neighbor X connected by dimension j;

if (SI = unsafe and X is not in UNSAFE) then begin Add X to UNSAFE;

Delete X from SAFG I , := r, + 1;

end end

if (Z, 2 2 or I , + 1. 5 3) then Mark current node as unsafe;

end end

Fig. 2. The algorithm used to find all unsafe nodes in an n-cube.

1

0010

1

0 0 0 0 0001 1000 1 0 0 1

Fig. 3. The unsafe nodes for the 4-cube of Fig. ^1.

-111

,101

(4)

described earlier. Instead, it may choose to route via a feasible minimum path such as (1101,1001,0001,0000).

Obviously, a large number of unsafe nodes will reduce the number of alternative paths a message can traverse; thus af- fecting the flexibility that a routing algorithm can possibly offer. Therefore, a nonfaulty node should be labeled as unsafe only if that is necessary. This should be accompanied by the property that there always exists, between a safe node and any other fault-free node, a feasible path of length equal to the Hamming distance between them.

THEOREM 1. In a faulty hypercube, $node A is a safe node, for any nonjaulty node B, there always exists a feasible mini- mum path of length equal to H(A, B) _betweenA and B.

PROOF. Let h = H ( A , B). If h is 1, nodes A and B are nearest neighbors and the link exists. Suppose h is equal to 2.

There are two node-disjoint shortest paths from A to B. By definition, node A, which is safe, has at most one faulty neighbor. Consequently, at least one feasible shortest path of length 2 exists. Consider the case where h 2 3. Node A has h nearest neighbors that are on some shortest paths of length h to node B. As node A is safe, the number of unsafe or faulty neighbors of A must be less than three, whieh means that there exists at least one nearest neighbor, among h of them, of A that is safe. Take any one of these safe neighboring nodes, say node 4 as the next node. Obviously, the Hamming distance between I a n d B will be h - 1. It- eratively apply the former procedure to node l a n d the succeeding nodes as long as the associated distance is greater than or equal to 3. Eventually, one will reach a safe node which is 2 hops from node B.

Borrowing the argument for the case of h = 2 as described earlier, one has that there exists a shortest par- tial path from this safe node to node B. Therefore, we have that the full path is feasible and is of length h.

0 Theorem 1 provides the ground for the usefulness of safe nodes in a faulty hypercube. It also illustrates the impor- tance of reducing the number of unsafe nodes for routing schemes.

Hence, the theorem is proved.

3.2 Properties of the Unsafe Area

Before proceeding, let us study some properties of the structure formed of the safe nodes, or alternatively of the faulty and unsafe nodes. These properties are useful for the design of fault-tolerant routing algorithms.

Let us define safe set as the set of safe nodes in a hypercube. It would be interesting to see whether the safe set can be partitioned into disconnected components. A component in this case is a maximal nonempty subset of safe nodes that are connected among themselves.

LEMMA 1. Suppose that the safe set of an n-dimensional faulty hypercube is not connected. Let SA _andS B be any two dis- connected components of the safe set. Then there exist two nodes A E SA and B E S B such that H(A, B) = 2.

PROOF. Consider a pair of nodes A ^ESA and B E SB such that H(A, B) is minimum among all pairs of nodes in S A and

SB. Since SA and S B are disconnected, H(A, B ) > 1, i.e., A and E cannot be nearest neighbors. Now assume that l = H(A, B ) 2 3. There are 1 nearest neighbors of A on some shortest paths from A to B. As node A is a safe node and 12 3, at least one of these 1 nearest neighbors of A is safe, i.e., there exists a safe neighbor, say A’, of A such that H(A’, B ) = I - 1. Since nodes A and A’ are con- nected, S A should contain node A‘ as well. However, this contradicts previous assumption that 1 = H ( A , B ) is the shortest distance between the nodes in SA and SB.

Hence we show that H(A, B ) = 2, that is, the minimum distance between two nodes, one each in SA and SB, equals 2. The existence of such nodes is obvious, and the l e m a follows.

Fig. 4 shows an example of a faulty 4-cube in which there are four faulty nodes, whose addresses are (OOlO), (0100),(1001), and (1111). The safe set of the hypercube contains two disconnected subcubes of dimension 2, namely ( O T ) and (l**O), where * is a don’t care symbol represent- ing either 0 or 1.

111

101

Fig. 4. Faulty 4-cube with disconnected safe set consisted of two subcubes of dimension 2.

THEOFEM 2. For an n-dimensional hypercube with a given set of faulty nodes, f t h e safe set is not connected, it must be com- posed of exactly two distinct subsets of safe nodes, where each of them exactly forms a complete (n - 2)-dimensional sub- cube, and they are located in the opposite quarters.

PROOF. Consider any two distinct subsets SA and S B of safe nodes which correspond to two different components.

From Lemma 1, there exist safe nodes A E SA and B ^E S B such that H(A, B ) ⁼2. Assume nodes A and B differ in dimensions i and j . Divide the hypercube into four (n - 2)-dimensional subcubes along dimensions i and j . Let these four subcubes be denoted by SQm, SQol, SQlo, and SQn, respectively, where the two digits of the sub- scripts indicate the binary values of the ith and jth bits of the corresponding subcubes. Nodes A and B must be in two opposite subcubes; without loss of generality, assume A is in SQm and B is in SQ11. Since SA and SB are disconnected, neither of the two common neighbors, A’ and B’, of A and B is safe; otherwise SA and S B will be connected through the safe one(s). In addition, A’ and B‘ must be contained, separately, in the other two (n - 2)-dimensional subcubes. Next we want to prove that SA and S B exactly form the complete SQoo

(5)

and SQn, respectively. Now consider a neighboring node A” of A such that A“ differs from A in dimension k, where k f i and k # j. Apparently, A“ is in SQw. As node A is safe and it already has two neighbors A’ and B’ that are both not safe, A” must be a safe node; thus A” E SA. Similarly, the neighbor B” of B along dimen- sion k is also a safe node in SQll and B” E SB. Note that the nodes A” and B” differ exactly in dimensions i and j. In addition, neither of the two common neighbors of A” and B” is safe, and these nodes are contained in SQlo and SQol separately.

For any node X E SQoo, the addresses of nodes A and X are identical in dimensions i and j . Any shortest path from A to X would contain no link of either dimension i or dimension j , i.e., all the intermediate nodes on the path is in SQoo. Applying the aforementioned argument sequentially to the nodes on any such path, one can conclude that node X E SQw is safe. In other words, all the nodes in SQoo are safe.

The same is true for SQII. On the other hand, none of the nodes in SQlo and SQol can be safe; otherwise the assumption that the safe set is disconnected is vio- lated. Obviously, SQm and SQll are located in the opposite quarters of the hypercube. The above argument also rules out the possibility that a safe set can be composed of more than two disconnected compo- 0 Fig. 4 also shows a typical example for Theorem 2. Note that a connected safe set in a hypercube does not necessar- ily appear in form of a complete subcube as illustrated in Fig. 3.

A faulty hypercube is called fully unsafe if the corresponding safe set is empty, i.e., every node in the hypercube is either faulty or unsafe. A fully unsafe hypercube may encounter difficulties in routing messages.

Apparently, the structure of the safe set is dependent on the fault pattern existing in the system. Consider an n- dimensional hypercube in which there are n faulty nodes such that these faults happen to occur at the n neighbors of a nonfaulty node A. Obviously, node A is unsafe. If node B is of distance 2 from A, it has two common neighbors with node A, and these neighbors are both faulty according to the given fault pattern. Hence node B is unsafe. In other words, all the nodes that are of distance 2 from A are unsafe. As to those nodes that are of distance 3 from A, every one of them has three neighbors that are of distance 2 from node A. Since these neighbors are all unsafe, we have that all the nodes that are of distance 3 from A must be unsafe as well. Applying the same argument to other nodes that are of longer distances from A, one can easily conclude that the n-cube is fully unsafe. The following theorem summarizes this observation in a general way.

THEOREM 3. Given an n-dimensional hypercube, if a node A has k, 2 5 k 5 n, faulty neighbors, then every nonfaulty node in the associated k-subcube, which includes node A and the k faulty neighbors, is unsafe.

In the following we further address the issue on the number of faulty nodes it takes to make a hypercube fully unsafe.

nents. Hence we prove the theorem.

LEMMA 2. Consider a k-dimensional hypercube with some given set of faulty nodes. If this k-cube is fully unsafe with the fault pattern, the following statement is always true: Di- vide the k-cube into two (k ^-1)-subcubes, say SQo and SQl, along any dimension i, 0 5 i <: k - 1. Now construct another independent (k - 1)-dimensional hypercube such that if either of the two neighboring nodes across dimension i, one each in SQO and SQ1, is faulty, the corresponding node in the (k - 1)-cube is also faulty. Then the newly con- structed (k - 1)-cube is fully unsafe by itself.

PROOF. Let us call the newly constructed (k - 1)-cube Q.

Unsafe nodes in a hypercube can be classified into two categories. Those unsafe nodes having two or more faulty neighbors is called type 1. The rest of unsafe nodes belong to type 2 . Q can be visualized as simply overlapping SQo and SQ1 and marking a node faulty if either of its corresponding nodes in SQo and SQ1 is faulty. Now compare Q with SQo. If a node is faulty in SQo, its corresponding node in Q is faulty too. Consider an unsafe node of type 1 in SQo. If the opposite node in SQ1 is faulty, the corresponding node in Q is obviously faulty. Otherwise, it must have two or more faulty neighbors in SQo; hence the corresponding node in Q is unsafe as well. The same argument applies to the comparison between Q and SQ1. Consequently, the set of faulty or type-1 unsafe nodes in Q contains the union of the corresponding sets of such nodes in SQo and SQ1. Note that all the unsafe nodes of type 2 are generated iteratively out of the aforementioned set of nodes. Now consider an unsafe node X of type 2 in SQo which is generated immediately as a result of the set of faulty or type-1 unsafe nodes and whose corresponding node in Q is neither faulty nor type-1 unsafe. The opposite node of X, in SQl, can be neither faulty’ nor type-1 unsafe.

Hence, there are at least three faulty or type-1 unsafe neighbors of X in SQo; thus the corresponding node in Q must also be unsafe. This argument leads to the conclusion that the set of faulty or unsafe, regardless of the type, nodes in Q after each iteration will in- clude the union of the corresponding sets of nodes in SQo and SQ1. In other words, if a node in SQo or SQ1 is unsafe as of type 2, its corresponding node in Q is also unsafe. Consequently, the newly constructed (k - 1)-cube Q is fully unsafe if the original k-cube is fully

unsafe. 0

The following lemma is also required for upcoming discussion.

LEMMA 3. v a n n-cube with n faulty nodes is fully unsafe, then any two of the n faulty nodes cannot be nearest neighbors.

PROOF. We prove by induction. For n = 3 it is easy to see that the lemma is true. Assume the lemma is true for n = k.

We want to prove that it is also true for n = k + 1. Now, by contradiction, let us assume that the lemma is untrue for n = k + 1, i.e., there exists certain case where two of the k + 1 faults, say nodes A and B, in a fully unsafe (k + 1)-cube are nearest neighbors. Partition the ( k + 1)-cube, along the dimension in which A and

(6)

B differ, into two k-subcubes, called SQo and SQi, where A is in SQo and B is in SQi. Let Fo and Fi repre- sent respectively the sets of faulty nodes in SQo and SQl, including A and B. Construct a new k-cube Q by overlapping SQo and SQi along with their fault pat- terns. According to Lemma 2, Q is also fully unsafe.

Furthermore, the number of faulty nodes in Q is less than or equal to k as nodes A and B would merge into one single faulty node in Q. Note that, for any pair of nodes X E FO and Y E Fi, it must be true that H ( X , Y)

# 2. This is due to the reason that if H(X, Y) = 2, the two faulty nodes in Q, which correspond to X and Y , would become nearest neighbors; thus contradict the previous assumption for the case of n = k. Next, the cardinalities of FO and F1 can be characterized as follows:

Now assume, without loss of generality, IFo/ = min(lFo), IFl/). Construct another k-cube Q' in which the fault pattern is exactly identical to that of SQo.

Note that the number of faults in Q is no more than Hence, if, instead, the criterion used in [15] for labeling a nonfaulty node as unsafe is used in Qf, Q' is not fully unsafe as per our proof in [7]. Now let us re- turn to SQo. Because the distance between any pair of nodes X E FO and Y E F1 cannot be 2, there is no type- 1 unsafe node in SQO which is due to two faulty neighbors, one in FO and another in FI. As a result, the unsafe nodes of type 1 in SQo must be identical to those in Q'. Now consider type-2 unsafe nodes in SQo.

According to our definition, it takes one more unsafe or faulty neighbor to mark a nonfaulty node type-2 unsafe in SQo than it does in Q . Therefore, it is easy to see that, if a node is labeled as type-2 unsafe in SQo, the corresponding node in Q' must also be marked unsafe after each iteration of labeling. In other words, if a node is safe in Q' the corresponding node in SQo must also be safe. Since Q is not fully unsafe, the nodes in SQo cannot possibly be fully unsafe. How- ever, this contradicts the assumption that the (k + 1)- cube is fully unsafe. Hence we prove the lemma.

Now we are set to present the following theorem.

THEOREM 4. I f the number of faulty nodes in an n-cube is no greater than n - 1, then the n-cube cannot become fully unsafe.

PROOF. We prove by induction. For n = 3 it is easy to see that the theorem is true. Assume the theorem is true for n = k. We are to prove that it is also true for n = k + 1.

Now, by contradiction, let us assume that the theorem is untrue for the case of n = k + 1, i.e., some fault pattern with k faults can make the (k + 1)-cube fully un-

safe. Partition the (k + 1)-cube along an arbitrary di- mension into two k-subcubes, SQo and SQ1. The sets of faulty nodes Fo and ^F1are identically defined as those in the previous lemma. Note that neither FO nor

Fl can be empty. For an arbitrary pair of nodes X E FO and Y E El, X and Y cannot be nearest neighbors, i.e., H ( X , Y) + ^1;otherwise one can construct a fully unsafe k-cube with k - 1 faulty nodes by overlapping SQo and SQ1, as X and Y would merge into one single faulty node, and contradicts the case of n = k. In addition, if H(X, Y) = 2, the two corresponding faulty nodes in the aforementioned fully unsafe k-cube would be nearest neighbors, and this violates Lemma 3. ln fact, the foregoing arguments can be applied to any pair of the k faulty nodes. Hence, we have H(A, B ) 2 3 for any pair of faulty nodes A and B. Consequently, there is no type-1 unsafe node generated in the (k + 1)-cube;

therefore the (k + 1)-cube cannot be fully unsafe.

However, this contradicts the previous assumption that the (k + 1)-cube is fully unsafe. Thus the theorem is proved.

Theorem 4 provides a lower bound on the number of faulty nodes for which a hypercube can become fully un- safe as per our criteria. Although n faults can make an n- cube h l l y unsafe, it only occurs in very limited cases even when the size of the hypercube is small. Our analysis of many cases shows that fully unsafe situation can happen only when the n faults occur at the n neighbors of some nonfaulty node for n 2 6. In this case, the given nonfaulty node is isolated anyway; routing among other nodes is hardly affected.

3.3 Fault-Tolerant Routing Algorithm

We now use state information of the nodes in a hypercube to assist routing of messages. The routing strategy is based on that each node has knowledge of the states of its nearest neighbors.

Before proceeding, we further classify the unsafe nodes according to the degree of "unsafeness" to facilitate the design of a fault-tolerant routing algorithm. The following definition follows.

D E ~ O N 2. An unsafe node is defzned as strongly unsafe if none of its nearest neighbors is safe. An unsafe, but not skongly unsafe, node is called an ordinarily unsafe node.

An ordinarily unsafe node has at least one safe nearest neighbor. Apparently, it is more difficult to find a "good"

next node for a message to advance at a strongly unsafe node. Consequently, a node can be in one of the four states:

safe, ordinarily unsafe, strongly unsafe, and faulty. The funda- mental idea of the routing strategy is to avoid forwarding messages to a more severely unsafe node, if possible, to avoid potential pitfalls. The algorithm Iden-Unsafe can be easily modified to identify strongly unsafe nodes.

Each node in the hypercube contains four lists, namely FAULT, SAFE, UNSAFE, and S-UNSAFE, which maintain its faulty, safe, ordinarily unsafe, and strongly unsafe nearest neighbors, respectively. Let sour and dest denote the source and destination nodes of a message respectively.