An Algorithm for Detecting Z-cycles in Distributed Computing System

全文

(1)Int. Computer Symposium, Dec. 15-17, 2004, Taipei, Taiwan.. An Algorithm for Detecting Z-cycles in Distributed Computing System Chin-Lin Kuo and Yuo-Ming Yeh Fault Tolerance Lab. of National Taiwan Normal University { gene, ymyeh }@ice.ntnu.edu.tw. Abstract- The checkpointing approach of rollbackrecovery has been widely used for fault-tolerance in distributed computing system. There are many communication messages resulting in much dependency during the time of program running. Once a process generates faults, many processes that are directly or indirectly related with the faulting process will be influenced. These processes in turn rollback to some previously stored state, respectively. What’s worse, the rollback action may repeatedly trigger another rollback action of other dependent processes. This is what we know as the domino effect[11]. The main cause of generating domino effect is Z-cycles[2]. So far there is no effective method to detect Z-cycles with length more than two. In this paper, we propose a distributed algorithm to detect Z-cycles with long length. Keywords : fault tolerance , checkpoints , domino effect , Z-cycles , rollback-recovery.. 1. Introduction In distributed computing system, checkpointing and rollback-recovery[17] is an important mechanism for fault tolerance. A checkpoint is a stable memory record of a process state. Each process could take a checkpoint whenever process favors. The simplest solution for a process to achieve this is to take a checkpoint periodically and it will work efficiently in only one processor. But in messaging passing system with many processors, such an action are likely to generate domino effect and waste much time and computation for rollback-recovery. Every process takes checkpoints independently without considering other processes. Although this uncoordinated checkpoint method is easily implemented and allows each process to flexibly take checkpoints, it must pay much overhead, such as rollback extent, complex recovery and garbage collection. A consistent global recovery line is a set of checkpoints, one per process, which form a recovery line. When there are faults happening on a process or processes, the process or processes in question immediately launch the rollback-recovery mechanism. If there is no valid recovery line, this action may repeatedly trigger another rollback action of other dependent processes, and the rollback distance may be unbounded. and unpredictable. Many processes may have to rollback to their own initial state. This is what we call ”domino effect”, the worst case we would not like to encounter. In order to determine a consistent global checkpoint, the processes have to record the dependencies relation among their checkpoints during failurefree operation. However, processes cannot determine whether or not specific checkpoints are part of a consistent state. One of the most serious problems in uncoordinated checkpointing is useless checkpoints. The processes may easily take useless checkpoints which are never part of any global consistent recovery line. Useless checkpoints are undesirable and waste much stable storage space. So applications with frequent output commits are not suitable since they could easily form many orphan messages between two checkpoints taken by two different processes and dependency relation between the states of different processes. Dependency between many processes may be occurred by message communication and there have been many papers[9,12,13] discussed about it. Another disadvantage is that determining a consistent state may be laborious and the rollback mechanism will become more complicated. Therefore most research is concentrated on coordinated checkpointing[14,15] and communication-induced checkpointing[4] schemes. Communication-induced protocols reserves Zcycle-free property by inserting forced checkpoints based on communication events. Hence, minimizing the number of forced checkpoints is becoming the most important topic. The main cause of generating domino effect is attributed to Z-cycles. So far, detecting Z-cycles with long length in distributed computing system is still a difficult problem. In Taesoon Park and Heon Y.Yeon’s paper[3], they propose an scheme of detecting Z-cycles with length two and of taking forced checkpoints to break them under many special communication patterns. In this paper, we propose an distributed algorithm to detect all Z-cycles with long length and their involved checkpoints.. 2. System Model and Background A distributed computation consists of a finite set P of n processes {P1 , P2 , · · · , Pn } that interact by. 1124.

(2) Int. Computer Symposium, Dec. 15-17, 2004, Taipei, Taiwan.. means of messages sent over channels which transmission times are unpredictable but finite. Processes do not share any common memory and a common clock value, that is, they are asynchronous. The communication pattern among these processes in P could be arbitrary and the communication channel between two processes is reliable, FIFO(first-in-first-out) and bidirectional(undirectional). Execution of a process produces a sequence of events which can be classified as: send events, receive events, and internal events. An internal statement does not involve communication. The casual ordering of events in a distributed execution is based on Lamport’s hb happened-before relation[1] denoted by ” →”. A process may fail, lose its volatile state and stop execution according to the fail-stop model[16]. A local checkpoint records the current process state on stable storage. The k-th checkpoint in process P i is denoted as Ci,k , where k is an non-negative integer and we assume that each process P i takes an initial checkpoint Ci,0 immediately before execution begins. Let Ii,α denote the interval between the consecutive checkpoints Ci,α−1 and Ci,α where α = 1, 2, 3, · · · . In this paper, we assume each process only take local checkpoints at its own pace (for example, using a periodic algorithm) without taking forced checkpoints. A message m sent by Pi to Pj is called an orphan with respect to a pair (Ci,xi , Cj,xj ) iff its receive event happened before C j,xj while its send event happened after Ci,xi . A global consistent checkpoint C is a set of local checkpoints (C 1,x1 , C2,x2 , . . . , Cn,xn ) which no orphan messages exists in any pair of local checkpoints belonging to C. The processes are said to rollback to the consistent recovery line if there is no orphan interval after the rollback-recovery. Sometimes, the processes have to rollback recursively to reach a consistent recovery line due to the domino effect and the rollback distance may be unbounded. In the worst case, the only consistent recovery line consists of a set of the initial checkpoints, that is, the total loss of the computation in spite of checkpointing efforts. So there are many papers talking about how to prevent dominoeffect[5] or useless checkpoints[6,7].. 3. Z-cycle Definition and Properties First, we recall the Z-path definition introduced by Netzer and Xu[2]. Definition 1: A Z-path exists from Ci,x to Cj,y iff there are messages m1 , m2 , · · · , m , ( ≥ 1) such that : 1. m1 is sent by process Pi after Ci,x 2. if mk (1 ≤ k < ) is received by process Pr , then mk+1 is sent by Pr in the same or a later checkpoint interval (although m k+1 may be sent before or after mk is received).. 3. m is received by process Pj before Cj,y . Definition 2: If there is a Z-path from C i,x to itself , then this is a Z-cycle which the checkpoint C i,x is involved. Assertion 1: The length of a Z-cycle(or Z-path) is if the Z-cycle(or Z-path) is formed by messages m1 , m2 , · · · , m . Consider some process P i in a Z-cycle. Suppose that message m and m are consecutive two messages contained in this Z-cycle, and message m is received by P i and message m is sent by the same hb process Pi . If receive(m) → send(m ), we say the interval between receive(m) and send(m ) on Pi in this Z-cycle is casual. On the other hand, if hb send(m ) → receive(m), we say the interval between them is non-casual and they must occur in the same checkpoint interval to satisfy the definition of Z-cycle. For example, consider figure1. The Zcycle is consisted of 4 messages m 1 , m2 , m3 , m4 . On hb P1 , receive(m4 ) → send(m1 ) so the interval is cahb. sual. But on P3 , send(m3 ) → receive(m2 ) and the two events occur at the same checkpoint interval so it’s a non-casual situation. For a Z-cycle, associated with a sequence of messages m 1 , m2 , . . . , m , its length is and has intervals( ≥ 2). By definition of Z-cycle, we can obtain that the interval between any two events receive(m i ) and send(mi+1 ) for 1 ≤ i ≤ − 1 has to be either a casual or non-casual interval. But there must be at least one of these intervals to be a non-casual interval[10]. In addition, the interval between receive(m ) and send(m1 ) must be a casual interval and the checkpoints between them are involved in this Z-cycle. C1,0 P1 P2 P3 P4. C1,1. m4. m1 C 2,2. ~. m3. m2. ^. ^. Figure 1. Assertion 2: For a Z-cycle, there may be more than one checkpoint involved in this Z-cycle and these checkpoints may be distributed in one or more processes. Obviously, the length of a Z-cycle must be at least two. In this condition, Z-cycles with length two are easy to be detected and destroied[3]. Figure 1 illustrates an example of Z-cycle with length 4 and the checkpoints C1,1 and C2,2 are involved in it. Intuitively, the longer Z-cycle is, the more difficult it can be detected and broken. According to Netzer and Xu’s theorem, a checkpoint is said to be useless if it is involved in a Z-cycle[2], that is, it can not be included in any consistent recovery line.. 1125.

(3) Int. Computer Symposium, Dec. 15-17, 2004, Taipei, Taiwan.. 4. Detecting Z-cycles Algorithm. a Z-path, represented by [ i , j , k ]. 4.1. The notation and data structures. proof : These two messages m 1 , m2 , denoted by [ i , j ] and [ j , k ] respectively ,where. ,α β,γ θ,. ,α β,. A Z-cycle is formed by a Z-path while starting with a checkpoint and terminating at the same checkpoint. From the global view of all processes, Wang[8,9] defines a graph called the rollback − dependency graph (or R − graph) which shows Z-paths in a distributed computation that has terminated or stopped execution. It is easy to find Z-paths from such a graph. In distributed algorithm, each process only has its local memory and knows the (send and receive) events relative to itself but does not know other messages’ transmission in other processes. Hence a process may not have ability to accumulate sufficient information of message transmission to concatenate them into Zpaths without piggybacked information. So the most critical problem to detect Z-cycles is how to collect necessary messages m1 , m2 , · · · , m which may have any possibility of forming a Z-cycle. First, we have to conceptualize an appropriate data structure to express Z-path and Z-cycle. For a single message m, its important four characteristics are the two processes which send, receive m and the two checkpoint intervals while the sending, receiving events occurring. There are totally four natural numbers, send P id, c out on process send-Pid, receive P id, and cin on process receive-Pid to describe the message m. For example, if there is a message m which was sent by process P i in checkpoint interval Ii,α and received by process P j in Ij,β , then send P id = i, receive P id = j, cout in Pi is α and cin in Pj is β. We use the symbol [ i , j ] to ex,α β,. press m. The lower-left of i and lower-right of j mean a checkpoint interval number of a message delivery event in P i and a checkpoint interval number of another message sending event occurring in P j respectively. These two s are written out for the purpose of connecting messages to form a Z-path. Notation : A message m which is sent by P i in Ii,α and received by P j in Ij,β is denoted by [ i , j ]. ,α β,. The symbol means unknown or not occurred yet and α, β are natural numbers. This notation of a single message can completely express relative information in a Z-path and from that we can only pay attention to the notation instead of R − graph. Lemma 1: For a process Pj , if there are two messages m1 , m2 , which are denoted by [ i , j ] and [ j , k ] ,α β,. ,γ θ,. respectively ,where α, β, γ, θ ∈ N (natural number), then we check whether β ≤ γ. If β ≤ γ holds, then the second condition of Z-path’s definition is satisfied and so we can merge(connect) these two messages into. ,γ θ,. α, β, γ, θ ∈ N , mean that Pi sends m1 in Ii,α to Pj in Ij,β and Pj sends m2 in Ij,γ to Pk in Ik,θ . When β = γ , it means m1 is received by Pj and m2 is sent by Pj in the same checkpoint interval no hb. hb. matter receive(m1 ) → send(m2 ) or send(m2 ) → receive(m1 ). The interval between the two events probably could be casual or non-casual. When β < γ, hb it means receive(m1 ) → send(m2 ) and send(m2 ) occurs in a later checkpoint interval. So by the second condition of Z-path’s definition, if one of the above two conditions(β = γ or β < γ) holds, then m 1 and m2 could be merged into a Z-path [ i , j , k ]. But ,α β,γ θ,. if β > γ, m1 and m2 could not be merged since these two checkpoint intervals I j,β and Ij,γ contradict the definition 2 of Z-path. From above discussion, the length of a Z-path can gradually increase by merging messages one by one or merging other Z-paths. Contrarily, a Z-path [· · · , i , j , k , · · · ] could be decomposed into two ···,α β,γ θ,···. Z-paths, [· · · , i , j ] and [ j , k , · · · ]. The rules ···,α β,. ,γ θ,···. of merging two Z-paths path1 and path2 are to check (1) whether the last P id of path1 is equal to the first P id of path2 and (2) whether the c in of the last P id of path1 is equal to or smaller than the c out of the first P id of path2. If satisfied, then these two Z-paths could be merged into a single Z-path [· · · , i , j , k ,· · · ]. We use notation [ 1 , 2 , 3 , ···,α β,γ θ,···. ,b1 a2 ,b2 a3 ,b3. · · · , n , k ] to express a Z-path from checkpoint an ,bn ak ,. C1,b1 −1 to Ck,ak . Certainly ai , bi are natural numbers and the relation a i ≤ bi must holds for every process. If k = 1 and a k ≤ b1 then Z-cycle ( 1 , 2 , 3 , · · · , n ) forms. a1 ,b1 a2 ,b2 a3 ,b3. an ,bn. The length of a Z-path is not fixed, so for data structure representation, the way of utilizing queue can appropriately express the meaning of Z-path. Each element of the queue has three integers P id, c in and c out, where P id ∈ {1, 2, . . . , n} means process ID and c in,c out means the checkpoint interval IP id,c in of the receive event and the checkpoint interval IP id,c out of the send event on the same process P id respectively. Assertion 3:The data structure ”queue of Z-path” we define can appropriately express the meaning of Zpath. Assertion 4:For a Z-path [· · · , i , · · · ], where α ≤ α,β. β, α and β means the checkpoint interval I i,α of event receive(ms ) and Ii,β of event send(mt ) respectively for some s, t ∈ N . If α = β, then these. 1126.

(4) Int. Computer Symposium, Dec. 15-17, 2004, Taipei, Taiwan.. two events, receive(ms ) and send(mt ), occur in the same checkpoint interval. If α < β, then there are β − α checkpoints Ci,α , Ci,α+1 , · · · , Ci,β−1 between receive(ms ) and send(mt ). For the case α < β, if the Z-path can form a Z-cycle in the future, then the checkpoints Ci,α , Ci,α+1 , · · · , Ci,β−1 are involved in this Z-cycle. For example, in figure 1 there is a Z-cycle ( 4 , 1 , 2 , 3 ) in which checkpoints {C 1,1 , C2,2 } are 1,1 1,2 2,3 1,1. involved. The following paragraph lists the notations and data structures used in our algorithm. There are n processes and for each process P i it has • ci : an integer and a logical counter which means current checkpoint interval index between two consecutive checkpoints and its initial value is 1. Pi. lci = 1. lci = 2 Ci,1. Ci,0. lci = 3 Ci,2. Figure 2. • Z Queuei : A queue which each element of it is still a queue z path containing Z-path information, for example [ 4 , 2 , 1 ]. In a node of z path, ,2 2,2 1,. there are three integers which mean process’s id P id and its two subscripts below, c in and c out. If one of them are , it means unknown, which could only appear at the c in of the first P id and the c out of the last P id in a Z-path. The [ 1 , 2 , 3 , . . .] means Z-path from process P 1 ,··· ···,··· ···,···. Pj that Pi knows. The value of csn i [i] is always equal to (ci − 1). Its initial value is [0, 0, · · · , 0]. • Z cyclei : An Z-cycle list which each element stores a Z-cycle. Initial values are none 4.2. The algorithm We distinguish two kinds of messages: computation messages and system messages. Computation messages are sent for their application purposes. In our protocol there are two kinds of system message, ”z-path request ” and ”z-path reply ”. This algorithm mainly adopt piggyback approach and request Z-paths from other processes to accumulate sufficient information. Then process merges its own Z-paths with them to check whether Z-cycles form or not. Not every time P i has to send z-path request to collect another process’s Z-paths. When there were sending events occurred after the latest checkpoint in P i and the Pi receives a computation message (non-casual), P i needs to do so. By the definition of Z-cycle formed by m 1 , m2 , · · · , m , the checkpoint interval between m 1 and m must be casual and there must exist at least one non-casual interval[10] in a Z-cycle. For our algorithm, the less number of non-casual intervals, the more efficient performance we have. So there are briefly three different cases of Z-cycles(best, worst, average). The figures 4,5 and 1 illustrate the three situations.. to P2 , P3 , · · · · · · . The c in is smaller or equal to the c out. Maybe there are many Z-paths included in the Z Queue i. Its structure is as the following figure and its initial value is null.. P1 P2. ?. 1. ,3. ?. 5. ,1. .. .. i - 1,2. m1. ^. m2. N. P3 m5 P4. Z Quenei. ? 4 i - 2,2 z path- ,2. C1,x. . m3. N. 1 - 1, 1 - 2,3. 5 - 3,. i - 2,. Figure 3. • Z Queue buf f er1i : A Z-path queue buffer which stores the Z-path queue piggybacked from other processes and is used to merge them with its own Z Queuei . • Z Queue buf f er2i : A Z-path queue buffer which also stores a queue of Z-paths. If P i needs to send z-path request message to other processes, then Pi must wait to receive for replying z-path s from them and store these z paths into Z Queue buf f er2i . • csni : checkpoint line which is an array of n checkpoint sequence numbers(csn) and csn i [j] represents the largest checkpoint sequence number of. m4. U. P5. Figure 4 : best case For the best case like the above figure 4, there is only one non-casual interval(between m 5 and m4 ) in the Z-cycle. When P 5 receives m4 , it checks there is a computation message m 5 sent to P1 in the current checkpoint interval. So, P 5 must send a z-path request message to P 1 for obtaining [ 5 , 1 , · · · ]. In the best case, most of these in,··· ···,···. tervals are casual and only few processes need to send z-path request for more Z-path information.. 1127. C1,x P1 P2 P3 m5 P4 P5. . m1. ^. m2. N. m3. N. m4. U. Figure 5 : worst case.

(5) Int. Computer Symposium, Dec. 15-17, 2004, Taipei, Taiwan.. But in the worst case like figure 5, most of the intervals are non-casual. Hence most processes(P 5 , P4 , P3 , P2 ) have to send z-path request to other processes for more Z-path information. The performance would be decreased. Each time of computation message-passing occurring the message must bring many Z-paths data, which may be tremendous, to target process and then the target process connects these received Z-paths data with its own. There will generate many new Z-paths in the connecting action and Z-cycle(s) will be detected. The following part is the explanation of our algorithm. Sending a computation message: P i sends a computation message to P j . Let the computation message be denoted by [ i , j ]. For each Z-path in ,ci ,. Z Queuei we only duplicate the front part of the Zpath, [· · · · · · , i ], for some α, to merge with [ i , ,ci. α,. j ].Then Pi obtains a new Z-path [· · · · · · , i , j ], α,ci ,. ,. where α ≤ ci . There probably are many such new Z-paths and all of them piggyback the computation message forwarding to P j . Reception of a computation message and piggybacked information: When P i receives a computation message M and piggybacked information(Zpaths) from Pk , each of them as [· · · , k , i ],. is larger than csni [P id]. When Pi receives a z-path request ([ q , i ]) ,cq ,. from Pq : If Pi receives such z-path request and its parameter [ q , i ], it means that there was a com,cq ,. putation message sent by P q to Pi . But Pq doesn’t know the checkpoint interval index of the computation message arrived at Pi . For Pi there must be a Z-path [· · · , q , i , · · · ] in Z Queuei , for some α, β. We ···,cq α,β. duplicate the back part, [ q , i , · · · ] and reply them ,cq α,β. for Pq . After collecting such Z-paths, [ q , i , · · · ], ,cq α,β. Pq can connect them with its own Z-paths, [· · · , q ]. ···,. So Pq can check whether Z-cycles form or not. We demonstrate our algorithm by an example. Example : In this example figure 6, there are totally two Z-cycles,{m3, m5 , m1 } and {m4 , m3 , m5 , m2 }. The checkpoints involved are {C 2,1 , C3,2 } and {C1,2 , C3,2 } respectively. So we can observe that messages m3 and m5 are associated with these two Z-cycles simultaneously. C1,1. P1 P2. ···,··· ,. P3. the first step Pi must do is to write ci into them, [· · · , k , i ]. Pi can update csn i by these piggy-. ···,ci ,. computation message sending from P i to Pj in the current checkpoint interval of index c i , then Pi has to send a z-path request for P j in order to obtain sufficient information of Z-path as [ i , j , · · · ]. Then Pi can connect [· · · , k , · · · ] into [· · · , k ,. ,ci ···,···. i , j ] with [ i , j ,. ···,··· ···,ci ,. ,ci ···,···. i , j , · · · ]. If there is any. ···,··· ···,ci ···,···. Z-cycle formed due to the message [ k ,. i ], then. ,··· ci ,. we can detect the Z-cycle containing it. Procedure PruneZ-path(csni ,Z Queuei ):The data of csni in Pi means the checkpoint line that P i already knows. When the csn i is updated, Pi checks each Z-path in Z Queue i whether its c out of first P id is equal to or smaller than csn i [P id]. That is, the event send(m) of the first message m in the Zpath occurred before checkpoint C P id,csni [P id] , the left side of the checkpoint line csn i . If so, it implies that there could not be any messages received by P P id at that checkpoint interval in the future. Then the first message of the Z-path should be deleted. Repeat such pruning action till the c out of first P id in this Z-path. . N m4. C2,1 m3. m1 m2. W. P4. ···,··· ci ,. backed Z-paths. That is, P i can move checkpoint line forward to the latest checkpoint index which P i can know. After updating csn i , Pi can also prune these piggybacked Z-paths. In Z Queue i if there exists Z-paths like [· · · , i , j ], which means there is a. C3,1. C1,2. C3,2. m5. U. Figure 6. We illustrate this example by the order of messages occurring time and present the csn and Z Queue data of Z-paths for all processes at the time of sending, receiving and checkpointing. The concatenation of two Z-paths is expressed by path 1 + path2 ⇒ · · · . send(m1 ) : P1 : csn1 : (0000) ; empty P2 : csn2 : (0000) ; empty P3 : csn3 : (0000) ; empty P4 : csn4 : (0000) ; [ 4 , 2 ] ,1 ,. receive(m1 ): [ 4 , 2 ] piggybacked to P 2 ,1 1,. P1 : csn1 : (0000) ; empty P2 : csn2 : (0000) ; [ 4 , 2 ] ,1 1,. P3 : csn3 : (0000) ; empty P4 : csn4 : (0000) ; [ 4 , 2 ] ,1 ,. P3 takes C3,1 , csn3 : (0010); empty P1 takes C1,1 , csn1 : (1000); empty send(m2 ): P1 : csn1 : (1000) ; empty P2 : csn2 : (0000) ; [ 4 , 2 ] ,1 1,. P3 : csn3 : (0010) ; empty P4 : csn4 : (0000) ; [ 4 , 2 ] and [ 4 , 1 ] ,1 ,. ,1 ,. receive(m2 ): [ 4 , 1 ] piggybacked to P 1. 1128. ,1 2,.

(6) Int. Computer Symposium, Dec. 15-17, 2004, Taipei, Taiwan.. P1 : csn1 : (1000) ; [ 4 , 1 ]. receive(m5 ): [ 4 , 2 , 3 , 4 ] piggybacked to P 4. P2 : csn2 : (0000) ; [ 4 , 2 ]. P1 :csn1 : (2000);[ 4 , 1 , 2 ]. P3 : csn3 : (0010) ; empty P4 : csn4 : (0000) ; [ 4 , 2 ] and [ 4 , 1 ]. P2 :csn2 : (2100);[ 4 , 1 , 2 , 3 ] and [ 4 , 2 , 3 ]. ,1 1,2 2,3 1,. ,1 2,. ,1 2,3 ,. ,1 1,. ,1 ,. ,1 2,3 2,2 2,. P3 : csn3 : (0120) ; [ 4 , 2 , 3 , 4 ]. ,1 ,. ,1 1,2 2,3 ,. P2 takes C2,1 , csn2 : (0100) ; [ 4 , 2 ] send(m3 ): P1 : csn1 : (1000) ; [ 4 , 1 ]. ,1 1,2 2,. P4 : Update csn4 : (0120) ; [ 4 , 2 , 3 , 4 ]+[ 4 , 2 ]. ,1 1,. ,1 1,2 2,3 1,. ⇒[ 4 , 2 , 3 , 4 , 2 ]. ,1 ,. ,1 1,2 2,3 1,1 ,. ,1 2,. P2 : csn2 : (0100) ; [ 4 , 2 ] + [ 2 , 3 ] ⇒ ,1 1,. [ 4 , 2 , 3 ] piggybacked to P 3. ,2 ,. ,1 1,2 2,3 1,. ,1 ,. ,1 1,2 2,3 1,1 ,. Since there are [· · · , 2 ] and [· · · , 1 ], P4 sends ,. ,1 1,2 ,. ,. request([ 4 , 2 ]) to P2 to get [ 4 , 2 , 3 ].. P3 : csn3 : (0010) ; empty P4 : csn4 : (0000) ; [ 4 , 2 ] and [ 4 , 1 ] ,1 ,. [ 4 , 2 , 3 , 4 ]+[ 4 , 1 ]⇒[ 4 , 2 , 3 , 4 , 1 ]. ,1 1,2 2,. ,1 ,. P4. ,1 ,. P1 takes C1,2 , csn1 : (2000) ; [ 4 , 1 ]. sends. request([ 4 , 1 ]) ,1 ,. to. P1 .When. P1 receives the request,P1 plans to reply [ 4 , 1 , 2 ].But there is [· · · , 2 ], P1 has to send. ,1 2,. receive(m3 ): [ 4 , 2 , 3 ] piggybacked to P 3. ,1 2,3 ,. ,1 1,2 2,. ,. request([ 1 , 2 ]) to P2 to get [ 1 , 2 , 3 ]. In P1 ,. P1 : csn1 : (2000) ; [ 4 , 1 ]. ,3 2,2 2,. ,3 ,. ,1 2,. [ 4 , 1 , 2 ]+[ 1 , 2 , 3 ]⇒[ 4 , 1 , 2 , 3 ]. So P1. ,1 1,2 ,. replies [ 4 , 1 , 2 , 3 ] for P4 ’s request.. P2 : csn2 : (0100) ; [ 4 , 2 , 3 ]. ,1 2,3 ,. P3 : Update csn3 : (0110) ; [ 4 , 2 , 3 ]. ,3 2,2 2,. ,1 2,3 2,2 2,. ,1 2,3 2,2 2,. ,1 1,2 2,. P4 : csn4 : (0000) ; [ 4 , 2 ] and [ 4 , 1 ]. Then P4 has the following action : [ 4 , 2 , 3 , 4 , 2 ]+[ 4 , 2 , 3 ]. send(m4 ): P1 : csn1 : (2000) ; [ 4 , 1 ] + [ 1 , 2 ] ⇒. ⇒[ 4 , 2, 3, 4, 2, 3 ]. ,1 ,. ,1 ,. ,1 2,. [ 4 , 1 , 2 ] which piggybacks m 4. ,3 ,. ,1 1,2 2,3 1,1 ,. ,1 1,2 2,3 1,1 1,2 2,. [ 4 , 2 , 3 , 4 , 1 ]+[ 4 , 1 , 2 , 3 ] ,1 1,2 2,3 1,1 ,. ,1 2,3 ,. ⇒. P2 : csn2 : (0100) ; [ 4 , 2 , 3 ]. ,1 1,2 2,. ,1 2,3 2,2 2,. [ 4 , 2 , 3 , 4 , 1 , 2 , 3 ].. So Z-cycles. ,1 1,2 2,3 1,1 2,3 2,2 2,. ,1 1,2 ,. ( 4 , 2 , 3 ) ,( 2 , 3 , 4 , 1 ) will be detected and in-. ,1 1,2 2,. volved checkpoints are {C 2,1 , C3,2 }, {C3,2 , C1,2 } respectively.. P3 : csn3 : (0110) ; [ 4 , 2 , 3 ]. 1,1 1,2 2,3. P4 : csn4 : (0000) ; [ 4 , 2 ] and [ 4 , 1 ] ,1 ,. ,1 ,. 2,2 2,3 1,1 2,3. receive(m4 ): [ 4 , 1 , 2 ] piggybacked to P 2 ,1 2,3 2,. 5. Proof of correctness. P1 : csn1 : (2000) ; [ 4 , 1 , 2 ] ,1 2,3 ,. P2 : Update csn2 : (2100) ; sends request([ 2 , 3 ]) ,2 ,. to P3 to get [ 2 , 3 ]. ,2 2,. For Z-cycle detection algorithm, the crucial question is that a process should accumulate necessary and sufficient information of messages passing and merge these data to check Z-cycle. proof : Without losing generality, we assume there is a Z-cycle associated with a sequence of messages m1 , m2 , . . . , m and the representation of the Z-cycle is [ 1 , 2 ,· · · , , 1 ], where ≥ 2. We prove. So [ 4 , 1 , 2 ] + [ 2 , 3 ] ⇒ [ 4 , 1 , 2 , 3 ] ,1 2,3 2,. ,1 2,3 2,2 2,. ,2 2,. and [ 4 , 2 , 3 ] + [ 2 , 3 ] ⇒ [ 4 , 2 , 3 ] ,1 1,2 ,. ,2 2,. P3 : csn3 : (0110) ; [ 4 , 2 , 3 ]. ,1 1,2 2,. ,1 1,2 2,. P4 : csn4 : (0000) ; [ 4 , 2 ] and [ 4 , 1 ] ,1 ,. ,1 ,. P3 takes C3,2 , csn3 : (0120). [ 4 , 2 , 3 ] ,1 1,2 2,. ,b1 a2 ,b2. send(m5 ): P1 : csn1 : (2000) ; [ 4 , 1 , 2 ] ,1 2,3 ,. ,1 2,3 2,2 2,. P1. ,1 1,2 2,. P2. P3 : csn3 : (0120) ; [ 4 , 2 , 3 ] + [ 3 , 4 ] ⇒ ,1 1,2 2,. [ 4 , 2 , 3 , 4 ] which piggybacks m 5. ,3 ,. ,1 1,2 2,3 ,. P4 : csn4 : (0000) ; [ 4 , 2 ] and [ 4 , 1 ] ,1 ,. a ,b a1 ,. this theorem by induction on the length of Z-cycle. When = 2, the figure of such Z-cycle is as figure 7.. P2 : csn2 : (2100) ; [ 4 , 1 , 2 , 3 ] and [ 4 , 2 , 3 ]. 5.1. Theorem : Our algorithm can detect all Z-cycles in distributed computing system.. c1 = qC1,q c = q + 1 1. . m2. c2 = α. m1. ^. Figure 7. For P2 , when m2 is sent to P1 at I2,α , m2 = [ 2 , ,α. ,1 ,. 1129.

(7) Int. Computer Symposium, Dec. 15-17, 2004, Taipei, Taiwan.. 1 ] is placed in Z Queue2. Till P1 receives m2. ,. from P2 , m2 is piggybacked to P 1 and P1 can fill c1 = q value into m2 , that is, m2 = [ 2 , 1 ] in ,α q,. Z Queue1. After P1 taking a checkpoint C 1,q , P1 sends m1 = [ 1 , 2 ] to P2 . Before the send,q+1 ,. Z Queuer . So by our algorithm, the message m k+1 could be completely inserted into the Z-cycle which could be detected. hb case II : On Ps , receive(mk ) → send(mk+1 ) and hb. on Pt , send(m1 ) → receive(mk+1 ). That is, Ps is casual and Pt is non-casual.. ing event, P1 merges m1 with Z Queue1 and then there will be a Z-path [ 2 , 1 , 2 ] generated in ,α q,q+1 ,. Z Queue1. When m1 arrives P2 , it piggybacks the Z-path to P2 and so P2 can fill c2 = α into the lowerleft of 2 . Then there is a Z-path [ 2 , 1 , 2 ] ,. ,α q,q+1 α,. α,α q,q+1. we can also induct that the checkpoint C 1,q is involved in this Z-cycle. Suppose when = k, the theorem is true. That is, a Z-cycle associated with k messages m 1 , m2 , · · · , mk denoted by [ 1 , 2 , · · · , k , 1 ] can be de,b1 a2 ,b2. ak ,bk a1 ,. tected at process Pi , for some i. Then when = k + 1, we must show a Z-cycle associated with k + 1 messages m1 , m2 , · · · , mk , mk+1 could be detected at some process. Let m 1 and mk be the neighbor messages of m k+1 and the Z-cycle is {· · · , mk , mk+1 , m1 ,· · · }. According to the time of events mk , mk+1 , m1 occurring, there are four cases. Assume Ps receives mk and sends mk+1 , and Pt receives mk+1 and sends m1 to Pr . hb case I : For Ps , receive(mk ) → send(mk+1 ) and. mk+1. ^. Pt. m1. ^. Pr. ,α q,q+1 α,. contained in Z Queue 2. So P2 can detect the Z-cycle [ 2 , 1 , 2 ],that is ( 2 , 1 ). By this notation. mk. w. Ps. Figure 9. By case I, when Pt receives mk+1 , Z Queuet contains [ 1 , · · · , s , t ], where at = ct . But ,b1. as ,bs at ,. m1 has already been sent, so Z Queue t contains [ t , r ] and after merge action, Z Queue t will ,at ,. generate [ 1 , · · · , s , t , r ], in which there are ,b1. as ,bs at ,at ,. two symbols at process r. So P t will send a request message for Z Queuer to obtain [ t , r , · · · ] from ,at ar ,···. P r. And then P t merges again, Z Queue t will get [· · · , s , t , r , · · · ]. So the message mk+1 as ,bs at ,at ar ,br. could be also inserted into the Z-cycle. hb case III : For Ps , send(mk+1 ) → send(mk ) and for hb Pt , receive(mk+1 ) → send(m1 ). That is, Ps is noncasual and Pt is casual.. hb. on Pt , receive(mk+1 ) → send(m1 ). That is, Ps is casual and Pt is also casual.. Ps Pt. w. Ps. Figure 8. For Ps , when Ps receives mk , it contains hb [ 1 , · · · , s ] in Z Queues . Since receive(mk ) → ,b1. ,bs ,. becomes [ 1 , · · · , s , t ] which will be piggyas ,bs ,. backed to Pt . For Pt , it receives [ 1 , · · · , s , t ] ,b1. as ,bs ,. and it can fill at = ct into the lower-left hb of Pt . Since receive(mk+1 ) → send(m1 ), so when Pt sends m1 ,denoted by [ t , r ], ,ct ,. [ 1 , · · · , s , t , r ], where bt = ct , will be ,b1. Pt. as ,bs at ,bt ,. piggybacked to its target process P r . For Pr , when Pr receives m1 , it can have [ 1 , · · · , s , t , r ] in ,b1. as ,bs at ,bt ar ,. mk. cs = α. w. mk+1. ^. m1. ^. Pr. as ,. send(mk+1 ), so when the event send(m k+1 ) occurs, the Z-path will be merged with [ s , t ] and then ,b1. m1. ^ Figure 10.a. m1. ^. Pr. ^. w. mk+1. ^. Pt. mk+1. Pr. mk. Ps. mk. cs = α. Figure 10.b hb For Ps , since send(mk+1 ) → receive(mk ), so Z Queues contains [· · · , s , t ]. When Ps re,α ,. ceives mk , Z Queues will contains [· · · , s ]. So α,. they will be merged into [· · · , s , t ]. Because there α,α ,. are two symbols in Pt , Ps will send Pt a request and then merges with Z Queue s to obtain [· · · , s , t ]. α,α β,. For Pt , when Pt receives mk+1 , Z Queuet contains [ s , t ], for some beta. Later when P t sends m1 to ,α β,. Pr , Z Queuet contains [ t , r ], where β ≥ β. So ,β. 1130. ,.

(8) Int. Computer Symposium, Dec. 15-17, 2004, Taipei, Taiwan.. Pt could have [ s , t , r ]. When Pr receives m1 ,. tain [ u ,· · · , s , t , r ]. For the two conditions,. The task of detecting Z-cycles has never been implemented before, so there is not any evaluation of such a scheme as this. In this paper, we innovate an appropriate data structure expressing Z-path and detecting algorithm in distributed computing system. Although the algorithm demands much piggybacked Z-paths information, we can detect Z-cycles and involved checkpoints accurately. By Netzer,Xu’s theorem and this algorithm we can distinguish useless checkpoints(involved in a Z-cycle) from other checkpoints. Hence the objective of breaking Z-cycles could be accessible by inserting minimal number of forced checkpoints. In the future, we can eliminate useless checkpoints or rearrange their position to make Zcycle free for decreasing the number of forced checkpoints to destroy Z-cycles is still an important issue.. [· · · , s , t , r ,· · · ] could be obtained in P s (figure. 7. References. ,α β,β. ,. [ s , t , r ], for some θ, will be obtained. As figure ,α β,β. θ,. 10, there are two distinct situations. hb If receive(m1 ) → receivr(mk ), as figure 10.a, then Z Queues has [· · · , s ] and [ s , t , r ,· · · ] ,α β,β θ,···. α,. after requesting Pt . So Ps could obtain [· · · , s , t , α,α β,β. r ,· · · ].. θ,···. hb. If receive(mk ) → receivr(m1 ), as figure 10.b, then Z Queuer has [ s , t , r ]. When Pr receives ,α β,β θ,. m1 , it sends a request for some process P u to get [ u ,· · · , s ]. So after connection, P r could ob,···. α,. ,···. α,α β,β. α,α β,β. θ,. θ,. 10.a) or Pr (figure 10.b). So m k+1 could also be inserted into the Z-cycle. hb case IV : For Ps , send(mk+1 ) → receive(mk ) and hb. for Pt , send(m1 ) → send(mk+1 ). That is, Ps and Pt are non-casual. mk cs = β w mk+1 ct = α ^ Pt m1. Ps. Pr. ^. Figure 11. For Pt , when Pt sends m1 to Pr . Z Queuet contains [ t , r ].When Pt receives mk+1 , Z Queuet con,α ,. tains [ s , t ]. So after merge action, Z Queue t ,β α,. could contain [ s , t , r ] and then Pt request s ,β α,α ,. Pr and merges again to obtain [ s , t , r ,· · · ], ,β. α,α. γ,γ. for some γ . For Ps , when Ps receives mk , Z Queues could have [· · · , s ]. And Z Queues already contains [ s , t ]. ,β ,. β,. So after merge action. Z Queues would contains [· · · , s , t ]. β,β ,. Since. there are two symbols, P s request s Z Queuet , which already contains [ s , t , r ,· · · ], to merge ,β α,α γ,γ. again. So in Z Queue s there will be a Z-cycle [· · · , s , t , r ,· · · ]. Hence mk+1 could also be inβ,β α,α γ,γ. serted into the Z-cycle. From above discussion of four distinct cases, the theorem still holds when the length of Z-cycle is k + 1. By induction the proof is completed. . 6. Conclusions. [1] L.Lamport. Time, Clocks and the Ordering of Evevts in a Distributed System. Comm. ACM, vol.21, no.7, pp.558-565, 1978 [2] R.H.B Netzer and J. Xu. Necessary and Sufficient Conditions for Consistent Global Snapshots. IEEE Trans. on Parallel and Distributed Systems, vol.6, no.2, pp.165-169, 1995 [3] Taesoon Park. Heon Y. Yeon. Application Controlled Checkpointing Coordination for FaultTolerant Distributed Computing Systems. Dept of Computer Engineer Sejong University. Parallel Computing, vol.26, no.4, pp.467-482, 2000 [4] D. Manivannan and M. Singhal. QuasiSynchronous Checkpointing: Models, Characterization, and Classification. IEEE Trans. on Parallel and Distributed Systems, vol.10, no.7, pp.703713, 1999 [5] D. Briatico, A. Ciuffoletti and L. Simoncini, A distributed domino-effect free recovery algorithm. In Proc. of the IEEE 4th Symp. on Reliability in Distributed Software and Database Systems, pp. 207-215 , 1984 [6] J.M. Helary et al. Communication-based prevention of useless checkpoints in distributed computations. Distributed Computing, vol.13, no.1, pp.2943, 2000 [7] R.Baldoni, F. Quaglia, and B. Ciciani. A VPaccordant checkpointing protocol preventing useless checkpoints. In the 17th IEEE Symposium on Reliable Distributed Systems, pp.61-67. 20-23 Oct. 1998 [8] Yi-Min Wang. Maximum and Minimum Consistent Global Checkpoints and their Applications. In the 14th IEEE Symposium on Reliable Distributed Systems , pp.86-95. 13-15 September, 1995 [9] Yi-Min Wang. Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints.. 1131.

(9) Int. Computer Symposium, Dec. 15-17, 2004, Taipei, Taiwan.. IEEE Transactions on Computers, vol.46, no.4, pp.456-468, 1997. [10] Jane-Feng Chiu and Ge-Ming Chiu. Placing Forced Checkpoints in Distributed Real-Time Embedded Systems. IEEE Computing & Control Engineering Journal, vol.13, issue 4, pp.197-205 Aug 2002 [11] B. Randell. System structures for software fault-tolerance. IEEE Transactions on Software Eng.,vol.1 no.2, pp.220-232, June, 1975 [12] R. Baldoni, J. M. Helary, and M. Raynal. Rollback-dependency trackability: Visible characterizations. In 18th ACM Symposium on the Principles of Distributed Computing(PODC’99), Atlanta(USA), pp.33-42, May 1999 [13] I. C. Garcia and L. E. Buzato. On the minimal characterization of rollback-dependency trackability property. In Proceedings of the 21th IEEE Int.. Conf. on Distributed Computing Systems, pp.03420349. 16-19 April 2001 [14] K. M. Chandy and L. Lamport. Distributed snapshots: determining global states of distributed systems. ACM Trans. on Computer Systems,vol.3, no.1, pp.63-75, Feb, 1985 [15] R. Koo and S. Toueg. Checkpointing and Rollback-recovery for distributed systems. IEEE Trans. on Software Eng., vol.13, no.1, pp.23-31, Jan, 1987 [16] R.D. Schlichting and F.B. Schneider. FailStop Processors: an Approach to Designing FaultTolerant Computing Systems. ACM Trans. on Computer Systems, vol.1, no.3, pp.222-238, 1983 [17] E.N. Elnozahy, D.B. Johnson and Y.M. Wang. A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computing Surveys(CSUR), vol.34, issue 3, pp.375-408, Sep. 2002. Appendix : The section illustrate our algorithms detailed and we typeset them with one column. Actions taken when Pi sends a message M to Pj 1: for each Z-path in Z Queuei do 2: Duplicate the front part [· · · · · · , i ], where α ≤ ci to merge [ i , j ] into a new Z-path [· · · · · · , α,. ,ci ,. i , j ] and copy them. α,ci ,. into Z Queue buf f er1i ; 3: end for 4: Send (Z Queue buf f er1i and M ) to Pj ; 5: Clear Z Queue buf f er1i ; // end Actions taken when Pi receives a message (M , Z Queue buf f er1k ) from Pk 1: Store Z Queue buf f er1k into Z Queue buf f er1i ; 2: for each Z-path [· · · · · · , k , i ] in Z Queue buf f er1i do ·,α ,. 3: 4: 5: 6: 7:. Write ci into it, [· · · · · · , k ,. i ]. ·,α ci ,. end for Update(csni , Z Queue buf f er1i ); PruneZ-path(csni , Z Queue buf f er1i ); for each [· · · , i , j ] appears in Z-path of Z Queuei do ···,ci ,. 8:. Send Z-path request([ i , j ]) to Pj to obtain the back part [ i , j , · · · ] of Z-paths in their Z Queuej ;. 9:. Obtain Z-paths [ i , j , · · · ] from other processe js and connect them with [· · · ,. ,ci ,. ,ci ···,···. ,ci ···,···. [· · · ,. i , j , · · · ];. i , j ] into. ···,ci ,. ···,ci ···,···. 10: PruneZ-path(csni , Z Queuei ); 11: end for 12: for each z − path in Z Queue buf f er1i do 13: Take the front part [· · · , k , i ], where α = ci ···,··· α,. 14:. for each z − path containing [· · · ,. i , · · · ] in Z Queuei do. ···,ci. 15:. Connect [· · · , k , i ] with [ i , · · · ] and then generate a new Z-path [· · · , k ,. 16:. CheckZ-cycle(this new Z-path, [ k ,. ···,··· α,. ,ci. i , · · · ];. ···,··· α,ci. i ]);. ,··· ci ,. 17: end for 18: end for 19: Clear Z Queue buf f er1i and Z Queue buf f er2i ; 20: Processing M ; // end. 1132.

(10) Int. Computer Symposium, Dec. 15-17, 2004, Taipei, Taiwan.. Actions taken when Pi takes a basic checkpoint 1: Pi takes a checkpoint Ci,ci ; 2: ci := ci + 1; 3: csni [i] := ci ; 4: PruneZ-path(csni , Z Queuei ); // end Actions taken when Pi receives a Z-path request([ q , i ]) from Pq ,cq ,. 1: for each Z-path of Z Queuei do 2: Cut the back part [ q , i , · · · · · · ] of the Z-path; ,cq ···,···. 3:. if the Z-path is as [ q , i , · · · , r , s ] then ,cq ···,···. ...,α ,. 4:. Send Z-path request([ r , s ]) to all processes s to obtain back part [ r , s , · · · · · · ] of z-paths in their Z Queue and wait. 5:. for reply; Collect Z-paths [ r , s , · · · · · · ] from other processes and connect with [ q , i , · · · , r , s ] , then update this Z-path ,α ···,···. [· · · , 6: 7: 8: 9: 10: 11: 12: 13: 14:. ,α ···,···. ,α ,. ,cq ···,···. q , i ,··· , r , s ,···] ;. ,cq ···,···. ...,α ,. ...,α ···,···. Store [ q , i , · · · , r , s , · · · ] into Z Queue buf f er1i ,cq ···,···. ...,α ···,···. else Store the Z-path [ q , i , · · · · · · ] into Z Queue buf f er1i ; ,cq ···,···. end if end for Send Z Queue buf f er1i back to Pq for reply. Update(csni ;Z Queue buf f er1i ); PruneZ-path(csni , Z Queuei ); Clear Z Queue buf f er1i , Z Queue buf f er2i ; // end. Procedure Update(csn , Z Queue) 1: for each z − path in Z Queue do 2: for each P id.c out do 3: csn[P id] = max(csn[P id], P id.c out − 1); 4: end for 5: end for // end Procedure PruneZ-path(csn , Z Queue) 1: for each Z-path in Z Queue do 2: while first P id.c out ≤ csn[P id] do 3: Delete the first element of the z-path; // The sending of the first message occurred at the left side of checkpoint line ,so first element(message) is useless. 4: end while 5: end for // end Procedure CheckZ-cycle(z − path, [ k , i ]) ,α β,. 1: if there exists m in z − path [· · · , m , · · · , k , i , · · · , m , · · · ] such that in ≤ out then ···,α β,···. ···,out. 2: 3:. in,···. if there exists at least one P id such that c in < c out in the cycle [ m , · · · , k , i , · · · , m ] then Z-cycle [ m , · · · , k , i , · · · , m ] forms and save it; ,out. ···,α β,···. in,. 4: end if 5: end if // end. 1133. ,out. ···,α β,···. in,.

(11)