• 沒有找到結果。

So far, we have not considered any ordering guarantee among messages delivered by different processes. In particular, when we consider a reliable broadcast abstraction, messages can be delivered in any order by distinct processes.

In this section, we introduce reliable broadcast abstractions that deliver messages according to first-in first-out (FIFO) order and according to causal order. FIFO order ensures that messages broadcast by the same sender process are delivered in the order in which they were sent. Causal order is a generalization of FIFO order that additionally preserves the potential causality among messages from multiple senders. These orderings are orthogonal to the reliability guarantees.

3.9.1 Overview

Consider the case of a distributed message board that manages two types of information: proposals and comments on previous proposals. To make the inter-face user-friendly, comments are depicted attached to the proposal they are referring to. In order to make it highly available, it is natural to implement the board appli-cation by replicating all the information to all participants. This can be achieved through the use of a reliable broadcast primitive to disseminate both proposals and comments.

With reliable broadcast, however, the following sequence of events is possible:

participant p broadcasts a message m1 containing a new proposal; then p changes its mind and broadcasts a message m2with a modification to its previous proposal;

because of message delays, another participant q delivers m2before m1. It may be difficult for participant q to understand m2 without the context of m1. Imposing a FIFO order on the delivery of messages solves this problem because it requires q delivers m1 before m2 as they are from the same sender. FIFO order can be implemented by delaying the delivery of a message m from a given sender until all messages that the sender has broadcast before m have been delivered.

Even when reliable broadcast implements FIFO order, the following execution is still possible: participant p broadcasts a message m1 containing its new pro-posal; participant q delivers m1and disseminates a comment on m1in message m2; because of message delays, another participant r delivers m2before m1. When par-ticipant r delivers m2, it lacks the context of message m1to properly interpret m2. Message delivery in causal order prevents this. It could be implemented by delay-ing the delivery of a message m2until every message m1that may have potentially caused m2(i.e., where m1→ m2) has been delivered.

3.9.2 FIFO-Order Specification

The specification of reliable broadcast does not state anything about the order in which multiple messages are delivered. A FIFO-order is one of the simplest possi-ble orderings and guarantees that messages from the same sender are delivered in the same sequence as they were broadcast by the sender. Note, this does not affect messages from different senders.

The FIFO-order (reliable) broadcast abstraction shown in Module 3.8 is obtained from the (regular) reliable broadcast abstraction (Module3.2) by extend-ing it with the FIFO delivery property. A uniform variation of FIFO-order (reliable) broadcast with causal order can be obtained in the same way. For brevity, we usually skip the term “reliable” refer to a FIFO-order broadcast abstraction.

3.9.3 Fail-Silent Algorithm: Broadcast with Sequence Number

Algorithm3.12, “Broadcast with Sequence Number,” implements FIFO-order reli-able broadcast. Every process maintains a sequence number lsn for the frb-broadcast messages, and rb-broadcasts the value lsn together with the message. The process

Module 3.8: Interface and properties of FIFO-order (reliable) broadcast Module:

Name: FIFOReliableBroadcast, instance frb.

Events:

Request: ⟨ frb, Broadcast | m ⟩: Broadcasts a message m to all processes.

Indication: ⟨ frb, Deliver | p, m ⟩: Delivers a message m broadcast by process p.

Properties:

FRB1–FRB4: Same as properties RB1–RB4 in (regular) reliable broadcast (Mod-ule3.2).

FRB5: FIFO delivery: If some process broadcasts message m1before it broadcasts message m2, then no correct process delivers m2unless it has already delivered m1.

Algorithm 3.12: Broadcast with Sequence Number Implements:

FIFOReliableBroadcast, instance frb.

Uses:

ReliableBroadcast, instance rb.

upon eventfrb,Init do lsn:=0;

pending:=; next:=[1]N;

upon eventfrb,Broadcast | m ⟩do lsn:= lsn+ 1;

triggerrb,Broadcast | [DATA, self, m,lsn]; upon eventrb,Deliver| p,[DATA,s, m, sn]do

pending:= pending∪ {(s, m, sn)};

while exists(s, m, sn)pendingsuch thatsn=next[s]do next[s]:= next[s] + 1;

pending:= pending\ {(s, m, sn)}; triggerfrb,Deliver| s,m;

also maintains an array next, which contains an entry for every process p with the sequence number of the next message to be frb-delivered from sender p. The process buffers all messages received via the reliable broadcast primitive in a set pending and frb-delivers them according to the sequence number assigned per the sender. (The same mechanism is also found in the “Lazy Probabilistic Broadcast” algorithm.) Correctness. Because the FIFO-order broadcast abstraction is an extension of reliable broadcast, and because the algorithm uses a reliable broadcast primitive

directly, the algorithm satisfies the four basic properties (FRB1–FRB4) that also define reliable broadcast.

The FIFO delivery property follows directly from the assignment of sequence numbers to messages by every sender and from way that receivers buffer and frb-deliver messages according to the sequence number assigned by the sender.

Performance. The algorithm does not add any messages to the reliable broadcast primitive and only increases the message size by a marginal amount.

3.9.4 Causal-Order Specification

The causal order property for a broadcast abstraction ensures that messages are delivered such that they respect all cause–effect relations. The happened-before relation described earlier in this book (Sect. 2.5.1) expresses all such dependen-cies. This relation is also called the causal order relation, when applied to messages exchanged among processes and expressed by broadcast and delivery events. In this case, we say that a message m1may have potentially caused another message m2, denoted as m1 → m2, if any of the following relations apply (see Fig.3.8):

(a) some process p broadcasts m1before it broadcasts m2;

(b) some process p delivers m1and subsequently broadcasts m2; or (c) there exists some message msuch that m1 → mand m → m2.

Using the causal order relation, one can define a broadcast abstraction with a causal deliveryproperty, which states that all messages delivered by the broadcast abstraction are delivered according to the causal order relation. There must be no

“holes” in the causal past, such that when a message is delivered, all messages that causally precede it have already been delivered.

The causal-order (reliable) broadcast abstraction shown in Module 3.9 is obtained from the (regular) reliable broadcast abstraction (Module3.2) by extend-ing it with a causal delivery property. A uniform variation of reliable broadcast with causal order can be stated analogously. The causal-order uniform (reliable) broad-castabstraction is shown in Module3.10and extends the uniform reliable broadcast abstraction (Module3.3) with the causal delivery property. For brevity, we usually

m2

q

r

p m1 m1

m’

m2 m1

m2

(a) (b) (c)

Figure 3.8:Causal order of messages

Module 3.9: Interface and properties of causal-order (reliable) broadcast Module:

Name: CausalOrderReliableBroadcast, instance crb.

Events:

Request: ⟨ crb, Broadcast | m ⟩: Broadcasts a message m to all processes.

Indication: ⟨ crb, Deliver | p, m ⟩: Delivers a message m broadcast by process p.

Properties:

CRB1–CRB4: Same as properties RB1–RB4 in (regular) reliable broadcast (Mod-ule3.2).

CRB5: Causal delivery: For any message m1that potentially caused a message m2, i.e., m1→ m2, no process delivers m2unless it has already delivered m1.

Module 3.10: Interface and properties of causal-order uniform (reliable) broadcast Module:

Name: CausalOrderUniformReliableBroadcast, instance curb.

Events:

Request: ⟨ curb, Broadcast | m ⟩: Broadcasts a message m to all processes.

Indication: ⟨ curb, Deliver | p, m ⟩: Delivers a message m broadcast by process p.

Properties:

CURB1–CURB4: Same as properties URB1–URB4 in uniform reliable broadcast (Module3.3).

CURB5: Same as property CRB5 in causal-order broadcast (Module3.9).

skip the term “reliable” and call the first one causal-order broadcast and the second one causal-order uniform broadcast.

As is evident from the first condition of causal order, the causal delivery property implies the FIFO order property in Module3.8. Hence, a causal-order broadcast primitive provides also FIFO-order reliable broadcast.

The reader might wonder at this point whether it also makes sense to consider a causal-order effort broadcast abstraction, combining the properties of best-effort broadcast with the causal delivery property. As we show through an exercise at the end of the chapter, this would inherently be also reliable.

3.9.5 Fail-Silent Algorithm: No-Waiting Causal Broadcast

Algorithm3.13, called “No-Waiting Causal Broadcast,” uses an underlying reliable broadcast communication abstraction rb, accessed through an rb-broadcast request

Algorithm 3.13: No-Waiting Causal Broadcast Implements:

CausalOrderReliableBroadcast, instance crb.

Uses:

ReliableBroadcast, instance rb.

upon eventcrb,Initdo delivered:=; past:=[];

upon eventcrb,Broadcast| m ⟩do

triggerrb,Broadcast | [DATA, past, m]; append(past, (self, m));

upon eventrb,Deliver| p,[DATA, mpast, m]do ifm̸∈deliveredthen

forall(s, n)mpastdo // by the order in the list ifn̸∈deliveredthen

triggercrb,Deliver| s, n ⟩; delivered:= delivered∪ {n}; if(s, n)̸∈pastthen

append(past, (s, n)); triggercrb,Deliver| p,m; delivered:= delivered∪ {m}; if(p, m)̸∈pastthen

append(past, (p, m));

and an rb-deliver indication. The same algorithm could be used to implement a uniform causal broadcast abstraction, simply by replacing the underlying reliable broadcast module by a uniform reliable broadcast module.

We call the algorithm no-waiting in the following sense: whenever a process rb-delivers a message m, it crb-delivers m without waiting for further messages to be rb-delivered. Each message m arrives as part of a DATAmessage together with a control field mpast, containing a list of all messages that causally precede m in the order these messages were delivered by the sender of m. When a pair (mpast, m) is rb-delivered, the receiver first inspects mpast and crb-delivers all messages in mpast that have not yet been crb-delivered; only afterward it crb-delivers m. In order to disseminate its own causal past, each process p stores all the messages it has crb-broadcast or crb-delivered in a local variable past, and rb-crb-broadcasts past together with every crb-broadcast message.

Variables past and mpast in the algorithm are lists of process/message tuples. An empty list is denoted by[], the operation append(L, x) adds an element x at the end of list L, and the operation remove(L, x) removes element x from L. The guard that appears before eachappend operation after the process rb-delivers a DATAmessage prevents that a message occurs multiple times in past.

p

q

r

s

crb−broadcast(m1)

[m1]

crb−broadcast(m2)

crb−deliver(m2) crb−deliver(m1)

Figure 3.9:Sample execution of no-waiting causal broadcast

An important feature of Algorithm 3.13is that the crb-delivery of a message is never delayed in order to enforce causal order. Figure3.9illustrates this behavior.

Consider, for instance, process s that rb-delivers message m2. As m2carries m1in its variable mpast, messages m1 and m2 are crb-delivered immediately, in causal order. Finally, when s rb-delivers m1from p, then m1is discarded.

Correctness. The first four properties (CRB1–CRB4), which are also properties of reliable broadcast, follow directly from the use of an underlying reliable broad-cast primitive in the implementation of the algorithm, which crb-delivers a message immediately upon rb-delivering it. The causal delivery property is enforced by having every message carry its causal past and every process making sure that it crb-delivers the causal past of a message before crb-delivering the message itself.

Performance. The algorithm does not add additional communication steps or send extra messages with respect to the underlying reliable broadcast algorithm. How-ever, the size of the messages grows linearly with time. In particular, the list past may become extremely large in long-running executions, because it includes the complete causal past of the process.

In the next subsection, we present a simple scheme to reduce the size of past. In the exercises, we describe an alternative implementation based on FIFO-broadcast with shorter causal past lists. We will later discuss an algorithm (“Waiting Causal Broadcast”) that completely eliminates the need for exchanging past messages.

3.9.6 Fail-Stop Algorithm: Garbage-Collection of Causal Past

We now present a very simple optimization of the “No-Waiting Causal Broadcast”

algorithm, depicted in Algorithm3.14, to delete messages from the past variable.

Algorithm3.14assumes a fail-stop model: it uses a perfect failure detector. The alg-orithm uses a distributed garbage-collection scheme and works as follows: when a process rb-delivers a message m, the process rb-broadcasts an ACKmessage to all processes; when an ACK message for message m has been rb-delivered from all correct processes, then m is purged from past.

Algorithm 3.14: Garbage-Collection of Causal Past (extends Algorithm3.13) Implements:

CausalOrderReliableBroadcast, instance crb.

Uses:

ReliableBroadcast, instance rb;

PerfectFailureDetector, instanceP.

// Except for itsInit event handler, the pseudo code of Algorithm3.13is also // part of this algorithm.

upon eventcrb,Initdo delivered:=; past:=[]; correct:=Π;

forallmdo ack[m]:=; upon event⟨ P,Crash| p ⟩do

correct:= correct\ {p};

upon existsmdeliveredsuch that self̸∈ack[m]do ack[m]:= ack[m]∪ {self};

triggerrb,Broadcast | [ACK,m]; upon eventrb,Deliver| p,[ACK,m]do

ack[m]:= ack[m]∪ {p}; upon correctack[m]do

forall(s, m)pastsuch thatm= mdo remove(past, (s, m));

This distributed garbage-collection scheme does not affect the correctness of the

“No-Waiting Causal Broadcast” algorithm, provided the strong accuracy property of the failure detector holds. The algorithm purges a message only if this message has been rb-delivered by all correct processes. If the completeness property of the failure detector is violated then the only risk is to keep messages around that could have been purged, but correctness is not affected.

In terms of performance, every acknowledgment message disseminated through reliable broadcast adds O(N2) point-to-point messages to the network traffic. How-ever, such acknowledgments can be grouped and disseminated in batch mode; as they are not in main path of crb-delivering a message, the acknowledgments do not slow down the causal-order broadcast algorithm.

Even with this optimization, the no-waiting approach might be considered too expensive in terms of bandwidth. In the following, we present an approach that tackles the problem at the expense of waiting.

Algorithm 3.15: Waiting Causal Broadcast Implements:

CausalOrderReliableBroadcast, instance crb.

Uses:

ReliableBroadcast, instance rb.

upon eventcrb,Initdo V :=[0]N;

lsn:=0; pending:=;

upon eventcrb,Broadcast| m ⟩do W:=V;

W [rank(self)]:= lsn;

lsn:= lsn+ 1;

triggerrb,Broadcast | [DATA,W, m]; upon eventrb,Deliver| p,[DATA,W, m]do

pending:= pending∪ {(p, W, m)};

while exists(p, W, m)pendingsuch thatW≤ V do pending:= pending\ {(p, W, m)};

V [rank(p)]:=V [rank(p)] + 1; triggercrb,Deliver| p,m;

3.9.7 Fail-Silent Algorithm: Waiting Causal Broadcast

Algorithm3.15, called “Waiting Causal Broadcast,” implements causal-order broad-cast without storing and disseminating any extra, causally related messages. Like Algorithm3.13(“No-Waiting Causal Broadcast”), it relies on an underlying reliable broadcast abstraction rb for communication, accessed through an rb-broadcast request and an rb-deliver indication.

An instance crb of Algorithm3.15does not keep a record of all past messages as in Algorithm3.13. Instead, it represents the past with a vector of sequence numbers, more precisely, with an array V of N integers called a vector clock. The vector captures the causal precedence between messages. In particular, entry V [rank(q)]

corresponds to process q, using the rank() function from Sect. 2.1 that maps the processes to the integers between1 and N.

Every process p maintains a vector clock V such that entry V [rank(q)] repre-sents the number of messages that p has crb-delivered from process q; additionally, it maintains a local sequence number lsn, denoting the number of messages that p has itself crb-broadcast. Process p then adds V , with V [rank(p)] replaced by lsn, to every crb-broadcast message m. When a process q rb-delivers some vector W and message m, where m was sent by process s, it compares W to its own vec-tor clock. The difference at indexrank(s) between the vectors tells process p how

p

q

r

s

crb−deliver(m1) crb−deliver(m2) [0,0,0,0]

crb−broadcast(m1)

[1,0,0,0]

crb−broadcast(m2)

Figure 3.10:Sample execution of waiting causal broadcast

many messages are missing from process s. Process p needs to crb-deliver all these messages before it can crb-deliver m.

As the name “Waiting Causal Broadcast” indicates a process may have to wait sometimes before crb-delivering a message that it has already rb-delivered. This is the price to pay for limiting the size of the messages. Using the above description, it is indeed possible that process p cannot immediately crb-deliver message m after it is rb-delivered because W [rank(p)] > V [rank(p)] for some process p (which might be s, if rb-delivered messages from s were reordered). Hence, process p waits until the messages from p that precede m in causal order have been rb-delivered and crb-delivered. On the other hand, it is possible that the rb-delivery of a single message triggers the crb-delivery of several messages that were already waiting to be crb-delivered.

As before, we use the notation[x]N for any symbol x as an abbreviation for the N -vector [x, . . . , x]. For comparing two N -vectors of integers v and w, we say that v≤ w whenever it holds for every i = 1, . . . , N that v[i] ≤ w[i].

Figure3.10shows an example of how a process has to wait. Process s rb-delivers message m2before it rb-delivers message m1. But, it cannot crb-deliver m2 imme-diately and has to wait until m1is rb-delivered. Messages m1and m2are only then crb-delivered in causal order. The figure shows the vector clock values broadcast together with the message.

Correctness. For the validity property, consider a message m that is crb-broadcast by some correct process p. According to the validity property of the underly-ing reliable broadcast, p directly rb-delivers m. Consider the vector V that is rb-delivered together with m, which is taken from the vector clock V of p when it has rb-broadcast m. Since V may only have increased meanwhile, it holds V ≥ V and m is crb-delivered immediately.

The no duplication and no creation properties follow directly from the underlying reliable broadcast abstraction.

To show agreement, consider a message m that is crb-delivered by some correct process p. Because of the agreement property of the underlying reliable broadcast,

every correct process eventually rb-delivers m. According to the algorithm, and again relying on the agreement property of the reliable broadcast, every correct process also eventually rb-delivers every message that causally precedes m. Hence, every correct process eventually crb-delivers m.

Consider now the causal order property. Recall that the vector clock V at a process p stores the number of crb-delivered messages with sender q in entry V [rank(q)]. Furthermore, process p assigns a sequence number (starting at 0) to every message that it rb-broadcast in entry rank(p) of the attached vector. When p rb-broadcasts a message m with attached vector W computed like this, then W [rank(q)] messages from sender q causally precede m. But every receiver of m also counts the number of messages that it has crb-delivered from sender q and waits until V [rank(q)] such messages have been crb-delivered before crb-delivering m.

Performance. The algorithm does not add any additional communication steps or messages to the underlying reliable broadcast algorithm. The size of the message header is linear in the number of processes in the system.

Variant. Algorithm “Waiting Causal Broadcast” also implements causal-order uni-form broadcast, when the underlying reliable broadcast primitive is replaced by a uniform reliable broadcast primitive.