4.1 Introduction
4.3.2 Transformation:
a(1, N) atomic register abstraction. For pedagogical reasons, we divide the trans-formation in two parts. We first explain how to transform any(1, N) regular register abstraction into a(1, 1) atomic register abstraction and then how to transform any (1, 1) atomic register abstraction into a (1, N) atomic register abstraction. These transformations do not use any other means of communication between processes than the underlying registers.
From (1, N) Regular to (1, 1) Atomic Registers. The first transformation is given in Algorithm 4.3 and realizes the following simple idea. To build a (1, 1) atomic register with process p as writer and process q as reader, we use one (1, N) regular register, also with writer p and reader q. Furthermore, the writer maintains a timestamp that it increments and associates with every new value to be written. The reader also maintains a timestamp, together with the value associated to the highest timestamp that it has read from the regular register so far. Intuitively, the reader
Algorithm 4.3: From (1, N) Regular to (1, 1) Atomic Registers Implements:
(1, 1)-AtomicRegister, instance ooar.
Uses:
(1, N )-RegularRegister, instance onrr.
upon event⟨ooar,Init ⟩do (ts, val):=(0,⊥); wts:=0;
upon event⟨ooar,Write| v ⟩do wts:=wts + 1;
trigger⟨onrr,Write| (wts, v) ⟩; upon event⟨onrr,WriteReturn⟩do
trigger⟨ooar,WriteReturn⟩; upon event⟨ooar,Read ⟩do
trigger⟨onrr,Read⟩;
upon event⟨onrr,ReadReturn| (ts′, v′)⟩do ifts′> tsthen
(ts, val):=(ts′, v′);
trigger⟨ooar,ReadReturn| val ⟩;
stores these items in order to always return the value with the highest timestamp and to avoid returning an old value once it has read a newer value from the regular register.
To implement a (1, 1) atomic register instance ooar, Algorithm 4.3 maintains one instance onrr of a (1, N) regular register. The writer maintains a writer-timestamp wts, and the reader maintains a writer-timestamp ts, both initialized to 0.
In addition, the reader stores the most recently read value in a variable val. The algorithm proceeds as follows:
• To ooar-write a value v to the atomic register, the writer p increments its timestamp wts and onrr-writes the pair (wts, v) into the underlying regular register.
• To ooar-read a value from the atomic register, the reader q first onrr-reads a timestamp/value pair from the underlying regular register. If the returned time-stamp ts′ is larger than the local timestamp ts then q stores ts′ together with the returned value v in the local variables, and returns v. Otherwise, the reader simply returns the value from val, which it has already stored locally.
Correctness. The termination property of the atomic register follows from the same property of the underlying regular register.
Consider validity and assume first that a read is not concurrent with any write, and the last value written by p is v and associated with timestamp ts′. The reader-timestamp stored by the reader q is either ts′, if q has already read v in some previous
read, or a strictly smaller value. In both cases, because of the validity property of the regular register, a read by q will return v. Consider now a read that is concurrent with some write of value v and timestamp ts′, and the previous write was for value v′ and timestamp ts′− 1. The reader-timestamp stored by q cannot be larger than ts′. Hence, because of the validity property of the underlying regular register, q will return either v or v′; both are valid replies.
Consider now ordering and assume that p writes v and subsequently writes w.
Suppose that q returns w for some read and consider any subsequent read of q. The reader-timestamp stored by q is either the one associated with w or a larger one.
Hence, the last check in the algorithm when returning from a read prevents that the return value was written before w and there is no way for the algorithm to return v.
Performance. The transformation requires only local computation, such as main-taining timestamps and performing some checks, in addition to writing to and reading from the regular register.
From(1, 1) Atomic to (1, N) Atomic Registers. We describe here an algorithm that implements the abstraction of a(1, N) atomic register out of (1, 1) atomic reg-isters. To get an intuition of the transformation, think of a teacher (the writer), who needs to communicate some information to a set of students (the readers), through the abstraction of a traditional blackboard. The board is a good match for the abstraction of a(1, N) register, as long as only the teacher writes on it. Furthermore, it is made of a single physical entity and atomic.
Assume now that the teacher cannot physically gather all students within the same classroom, and hence cannot use one physical board for all. Instead, this global board needs to be emulated with one or several individual boards (i-boards) that can also be written by one person but may only be read by one person. For example, every student can have one or several such electronic i-boards at home, which only he or she can read.
It makes sense to have the teacher write each new piece of information to at least one i-board per student. This is intuitively necessary for the students to eventually read the information provided by the teacher, i.e., to ensure the validity property of the register. However, this is not enough to guarantee the ordering property of an atomic register. Indeed, assume that the teacher writes two pieces of information consecutively, first x and then y. It might happen that a student reads y and later on, some other student still reads x, say, because the information flow from the teacher to the first student is faster than the flow to the second student. This ordering violation is similar to the situation of Fig.4.5.
One way to cope with this issue is for every student, before terminating the read-ing of some information, to transmit this information to all other students, through other i-boards. That is, every student would use, besides the i-board devoted to the teacher to provide new information, another one for writing new information to the other students. Whenever a student reads some information from the teacher, the student first writes this information to the i-board that is read by the other students, before returning the information. Of course, the student must in addition also read the i-boards on which the other students might have written newer information. The
Algorithm 4.4: From (1, 1) Atomic to (1, N) Atomic Registers Implements:
(1, N )-AtomicRegister, instance onar.
Uses:
(1, 1)-AtomicRegister (multiple instances).
upon event⟨onar,Init ⟩do ts:=0;
acks:=0; writing:= FALSE; readval:=⊥; readlist:=[⊥]N; forallq∈ Π, r ∈ Πdo
Initialize a new instance ooar.q.rof(1, 1)-AtomicRegister with writerrand readerq;
upon event⟨onar,Write| v ⟩do ts:=ts + 1;
writing:= TRUE; forallq∈ Πdo
trigger⟨ooar.q.self,Write|(ts, v)⟩; upon event⟨ooar.q.self,WriteReturn⟩do
acks:= acks+ 1; if acks= N then
acks:=0;
if writing=TRUEthen
trigger⟨onar,WriteReturn⟩; writing:= FALSE;
else
trigger⟨onar,ReadReturn|readval⟩; upon event⟨onar,Read ⟩do
forallr∈ Πdo
trigger⟨ooar.self.r,Read⟩;
upon event⟨ooar.self.r,ReadReturn| (ts′, v′)⟩do readlist[r]:=(ts′, v′);
if#(readlist) = Nthen
(maxts,readval):=highest(readlist); readlist:=[⊥]N;
forallq∈ Πdo
trigger⟨ooar.q.self,Write| (maxts,readval)⟩;
teacher adds a timestamp to the written information to distinguish new information from old one.
The transformation in Algorithm 4.4 implements one (1, N) atomic register instance onar from N2 underlying (1, 1) atomic register instances. Suppose the writer of the(1, N) atomic register onar is process p (note that the writer is also
a reader here, in contrast to the teacher in the story). The(1, 1) registers are orga-nized in a N × N matrix, with register instances called ooar.q.r for q ∈ Π and r∈ Π. They are used to communicate among all processes, from the writer p to all N readers and among the readers. In particular, register instance ooar.q.r is used to inform process q about the last value read by reader r; that is, process r writes to this register and process q reads from it. The register instances ooar.q.p, which are written by the writer p, are also used to store the written value in the first place; as process p may also operate as a reader, these instances have dual roles.
Note that both write and read operations require N registers to be updated; the acks counter keeps track of the number of updated registers in the write and read operation, respectively. As this is a local variable of the process that executes the operation, and as a process executes only one operation at a time, using the same variable in both operations does not create any interference between reading and writing. A variable writing keeps track of whether the process is writing on behalf of a write operation, or whether the process is engaged in a read operation and writing the value to be returned.
Algorithm 4.4 also relies on a timestamp ts maintained by the writer, which indicates the version of the current value of the register. For presentation simplicity, we use a functionhighest(·) that returns the timestamp/value pair with the largest timestamp from a list or a set of such pairs (this is similar to thehighestval func-tion introduced before, except that the timestamp/value pair is returned whereas highestval only returns the value). More formally, highest(S) with a set or a list of timestamp/value pairs S is defined as the pair (ts, v) ∈ S such that
forall(ts′, v′) ∈ S : ts′< ts∨ (ts′, v′) = (ts, v).
The variable readlist is a length-N list of timestamp/value pairs; in the algorithm for reading, we convert it implicitly to the set of its entries. Recall that the func-tion#(S) denotes the cardinality of a set S or the number of non-⊥ entries in a list S.
Correctness. Because of the termination property of the underlying(1, 1) atomic registers, it is easy to see that every operation in the transformation algorithm eventually returns.
Similarly, because of the validity property of the underlying(1, 1) atomic reg-isters, and due to the choice of the value with the largest timestamp as the return value, we also derive the validity of the(1, N) atomic register.
For the ordering property, consider an onar-write operation of a value v with ass-ociated timestamp tsv that precedes an onar-write of value w with timestamp tsw; this means that tsv < tsw. Assume that a process r onar-reads w. According to the algorithm, process r has written (tsw, w) to N underlying registers, with iden-tifiers ooar.q.r for q ∈ Π. Because of the ordering property of the (1, 1) atomic registers, every subsequent read operation from instance onar reads at least one of the underlying registers that contains(tsw, w), or a pair containing a higher time-stamp. Hence, the read operation returns a value associated with a timestamp that is at least tsw, and there is no way for the algorithm to return v.
Performance. Every write operation into the(1, N) register requires N writes into (1, 1) registers. Every read from the (1, N) register requires one read from N (1, 1) registers and one write into N (1, 1) registers.
We give, in the following, two direct implementations of (1, N) atomic regis-ter abstractions from distributed communication abstractions. The first algorithm is in the fail-stop system model and the second one uses the fail-silent model. These are adaptations of the “Read-One Write-All” and “Majority Voting”(1, N) regular register algorithms, respectively. Both algorithms use the same approach as pre-sented transformation, but require fewer messages than if the transformation would be applied automatically.
4.3.3 Fail-Stop Algorithm: Read-Impose Write-All(1, N) Atomic Register If the goal is to implement a(1, N) register with one writer and multiple readers, the “Read-One Write-All” regular register algorithm (Algorithm4.1) clearly does not work: the scenario depicted in Fig.4.5illustrates how it fails.
To cope with this case, we define an extension to the “Read-One Write-All”
regular register algorithm that circumvents the problem by having the reader also impose the value it is about to return on all other processes. In other words, the read operation also writes back the value that it is about to return. This modification is described as Algorithm4.5, called “Read-Impose Write-All.” The writer uses a timestamp to distinguish the values it is writing, which ensures the ordering property of every execution. A process that is asked by another process to store an older value than the currently stored value does not modify its memory. We discuss the need for this test, as well as the need for the timestamp, through an exercise (at the end of this chapter).
The algorithm uses a request identifier rid in the same way as in Algorithm4.2.
Here, the request identifier field distinguishes among WRITEmessages that belong to different reads or writes. A flag reading used during the writing part distinguishes between the write operations and the write-back part of the read operations.
Correctness. The termination and validity properties are ensured in the same way as in the “Read-One Write-All” algorithm (Algorithm4.1). Consider now ordering and assume process p writes a value v, which is associated to some timestamp tsv, and subsequently writes a value w, associated to some timestamp tsw > tsv. Assume, furthermore, that some process q reads w and, later on, some other process r invokes another read operation. At the time when q completes its read, all processes that did not crash have a timestamp variable ts that is at least tsw. According to the algorithm, there is no way for r to change its value to v after this time because tsv < tsw.
Performance. Every write or read operation requires two communication steps, cor-responding to the roundtrip communication between the writer or the reader and all processes. At most O(N) messages are needed in both cases.
Algorithm 4.5: Read-Impose Write-All Implements:
(1, N )-AtomicRegister, instance onar.
Uses:
BestEffortBroadcast, instance beb;
PerfectPointToPointLinks, instance pl;
PerfectFailureDetector, instanceP. upon event⟨onar,Init ⟩do
(ts, val):=(0,⊥); correct:=Π; writeset:=∅; readval:=⊥; reading:= FALSE; upon event⟨ P,Crash| p ⟩do
correct:= correct\ {p}; upon event⟨onar,Read ⟩do
reading:= TRUE; readval:=val;
trigger⟨beb,Broadcast| [WRITE,ts, val]⟩; upon event⟨onar,Write| v ⟩do
trigger⟨beb,Broadcast| [WRITE,ts + 1, v]⟩; upon event⟨beb,Deliver| p,[WRITE,ts′, v′]⟩do
ifts′> tsthen
(ts, val):=(ts′, v′); trigger⟨pl,Send| p,[ACK]⟩; upon event⟨pl,Deliver| p,[ACK]⟩then
writeset:= writeset∪ {p}; upon correct⊆writesetdo
writeset:=∅;
if reading=TRUEthen reading:= FALSE;
trigger⟨onar,ReadReturn|readval⟩; else
trigger⟨onar,WriteReturn⟩;
4.3.4 Fail-Silent Algorithm: Read-Impose Write-Majority(1, N) Atomic Register
In this section, we consider a fail-silent model. We describe an extension of our
“Majority Voting”(1, N) regular register algorithm (Algorithm4.2) to implement a (1, N) atomic register.
Algorithm 4.6: Read-Impose Write-Majority (part 1, read) Implements:
(1, N )-AtomicRegister, instance onar.
Uses:
BestEffortBroadcast, instance beb;
PerfectPointToPointLinks, instance pl.
upon event⟨onar,Init ⟩do (ts, val):=(0,⊥); wts:=0;
acks:=0; rid:=0;
readlist:=[⊥]N; readval:=⊥; reading:= FALSE; upon event⟨onar,Read ⟩do
rid:= rid+ 1; acks:=0; readlist:=[⊥]N; reading:= TRUE;
trigger⟨beb,Broadcast| [READ, rid]⟩; upon event⟨beb,Deliver| p,[READ,r]⟩do
trigger⟨pl,Send| p,[VALUE,r, ts, val]⟩;
upon event⟨pl,Deliver| q,[VALUE,r, ts′, v′]⟩such thatr =riddo readlist[q]:=(ts′, v′);
if#(readlist) > N/2then
(maxts,readval):=highest(readlist); readlist:=[⊥]N;
trigger⟨beb,Broadcast | [WRITE, rid,maxts,readval]⟩;
The algorithm is called “Read-Impose Write-Majority” and shown in Algorithm 4.6–4.7. The implementation of the write operation is similar to that of the “Ma-jority Voting” algorithm: the writer simply makes sure a ma“Ma-jority adopts its value.
The implementation of the read operation is different, however. A reader selects the value with the largest timestamp from a majority, as in the “Majority Voting” algo-rithm, but now also imposes this value and makes sure a majority adopts it before completing the read operation: this is the key to ensuring the ordering property of an atomic register.
The “Majority Voting” algorithm can be seen as the combination of the “Read-Impose Write-Majority” algorithm with the two ideas that are found in the two-step transformation from(1, N) regular registers to (1, N) atomic registers (Algorithms 4.3 and 4.4): first, the mechanism to store the value with the highest timestamp that was read so far, as in Algorithm 4.3; and, second, the approach of the read implementation to write the value to all other processes before it is returned, as in Algorithm4.4.
Algorithm 4.7: Read-Impose Write-Majority (part 2, write and write-back) upon event⟨onar,Write| v ⟩do
rid:= rid+ 1; wts:=wts + 1; acks:=0;
trigger⟨beb,Broadcast| [WRITE, rid, wts, v]⟩; upon event⟨beb,Deliver| p,[WRITE,r, ts′, v′]⟩do
ifts′> tsthen
(ts, val):=(ts′, v′); trigger⟨pl,Send| p,[ACK,r]⟩;
upon event⟨pl,Deliver| q,[ACK,r]⟩such thatr =riddo acks:= acks+ 1;
if acks> N/2then acks:=0;
if reading=TRUEthen reading:= FALSE;
trigger⟨onar,ReadReturn|readval⟩; else
trigger⟨onar,WriteReturn⟩;
Correctness. The termination and validity properties are ensured in the same way as in Algorithm4.2(“Majority Voting”). Consider now the ordering property. Sup-pose that a read operation or by process r reads a value v from a write operation ow
of process p (the only writer), that a read operation or′ by process r′ reads a dif-ferent value v′ from a write operation ow′, also by process p, and that or precedes or′. Assume by contradiction that ow′ precedes ow. According to the algorithm, the timestamp tsvthat p associated with v is strictly larger than the timestamp tsv′that p associated with v′. Given that the operation or precedes or′, at the time when or′
was invoked, a majority of the processes has stored a timestamp value in ts that is at least tsv, the timestamp associated to v, according to the write-back part of the algorithm for reading v. Hence, process r′ cannot read v′, because the timestamp associated to v′ is strictly smaller than tsv. A contradiction.
Performance. Every write operation requires two communication steps correspond-ing to one roundtrip exchange between p and a majority of the processes, and O(N) messages are exchanged. Every read requires four communication steps correspond-ing to two roundtrip exchanges between the reader and a majority of the processes, or O(N) messages in total.