• 沒有找到結果。

2.2.1 Process Failures

A process executes the distributed algorithm assigned to it through the set of com-ponents implementing the algorithm within that process. A failure occurs whenever the process does not behave according to the algorithm. Our unit of failure is the process. When the process fails, all its components fail at the same time.

Process abstractions differ according to the nature of the faults that cause them to fail. Possible failures range from a crash, where a process simply stops to execute any steps, over an omission to take some steps, a crash with subsequent recovery, to arbitrary and even adversarial behavior. We discuss these kinds of failures in the subsequent sections. Figure2.3summarizes the types of failures.

2.2.2 Crashes

The simplest way of failing for a process is when the process stops executing steps.

The process executes its algorithm correctly, including the exchange of messages with other processes, until some time t, after which it stops executing any local computation and does not send any message to other processes. In other words, the process crashes at time t and never recovers after that time. We call this a crash fault (Fig.2.3), and talk about a crash-stop process abstraction. With this abstraction, a

process is said to be faulty if it crashes at some time during the execution. It is said to be correct if it never crashes and executes an infinite number of steps. We discuss two ramifications of the crash-stop abstraction.

It is usual to devise algorithms that implement a given distributed programming abstraction, say, some form of agreement, provided that only a limited number f of processes are faulty, which might be a minority of the processes or all processes up to one. Assuming a bound on the number of faulty processes in the form of a parameter f means that any number of processes up to f may fail, but not that f processes actually exhibit such faults in every execution. The relation between the number f of potentially faulty processes and the total number N of processes in the system is generally called resilience.

It is important to understand here that such an assumption does not mean that the hardware underlying these processes is supposed to operate correctly forever.

In fact, the assumption means that in every execution of an algorithm that relies on that abstraction, it is very unlikely that more than f of processes crash during the lifetime of that very execution. An engineer picking such an algorithm for a given application should be confident that the chosen elements underlying the soft-ware and hardsoft-ware architecture make that assumption plausible. In general, it is also a good practice, when devising algorithms that implement a given distributed abstraction under certain assumptions, to determine precisely which properties of the abstraction are preserved and which can be violated when a specific subset of the assumptions are not satisfied, e.g., when more than f processes crash.

By considering the crash-stop process abstraction, one assumes that a process executes its algorithm correctly, but may crash at some time; after a process has crashed, it never recovers. That is, once it has crashed, the process does not ever perform any step again. Obviously, in practice, processes that crash can be restarted and hence may recover. In fact, it is usually desirable that they do. But with the crash-stop abstraction, a recovered process is no longer part of the system.

It is also important to notice that, in practice, the crash-stop process abstraction neither precludes the possibility of recovery nor does it mean that recovery should be prevented for a given algorithm (assuming a crash-stop process abstraction) to behave correctly. It simply means that the algorithm should not rely on some of the processes to recover in order to pursue its execution. These processes might not recover, or might recover only after a long period encompassing the crash detection and then the restarting delay. In some sense, an algorithm that is not relying on crashed processes to recover would typically be faster than an algorithm relying on some of the processes to recover (we will discuss this issue in the next section).

Nothing prevents recovered processes from getting informed about the outcome of the computation, however, and from participating again in subsequent instances of the distributed algorithm.

Unless explicitly stated otherwise, we will assume the crash-stop process abstraction throughout this book.

2.2.3 Omissions

A more general kind of fault is an omission fault (Fig.2.3). An omission fault occurs when a process does not send (or receive) a message that it is supposed to send (or receive) according to its algorithm. In general, omission faults are due to buffer overflows or network congestion that cause messages to be lost. With an omission, the process deviates from the algorithm assigned to it by dropping some messages that should have been exchanged with other processes.

Omission faults are not discussed further in this book, except through the related notion of crash-recovery faults, introduced next.

2.2.4 Crashes with Recoveries

Sometimes, the assumption that particular processes never crash is simply not plau-sible for certain distributed environments. For instance, assuming that a majority of the processes do not crash might simply be too strong, even if this should not happen only during the period until an algorithm execution terminates.

An interesting alternative in this case is the crash-recovery process abstraction;

we also talk about a crash-recovery fault (Fig.2.3). In this case, we say that a pro-cess is faulty if either the propro-cess crashes and never recovers or the propro-cess keeps infinitely often crashing and recovering. Otherwise, the process is said to be correct.

Basically, such a process is eventually always up and running (as far as the lifetime of the algorithm execution is concerned). A process that crashes and recovers a finite number of times is correct in this model.

According to the crash-recovery abstraction, a process can crash and stop to send messages, but might recover later. This can be viewed as an omission fault, with one exception, however: a process might suffer amnesia when it crashes and lose its internal state. This significantly complicates the design of algorithms because, upon recovery, the process might send new messages that contradict messages that the process might have sent prior to the crash. To cope with this issue, we some-times assume that every process has, in addition to its regular volatile memory, a stable storage(also called a log), which can be accessed throughstore and retrieve operations.

Upon recovery, we assume that a process is aware that it has crashed and recovered. In particular, a specific ⟨ Recovery ⟩ event is assumed to be automat-ically generated by the runtime environment whenever the process recovers, in a similar manner to the ⟨ Init ⟩ event that is generated whenever a process starts executing some algorithm. The processing of the ⟨ Recovery ⟩ event should, for instance, retrieve the relevant state of the process from stable storage before the processing of other events is resumed. The process might, however, have lost all the remaining data that was preserved in volatile memory. This data should thus be prop-erly reinitialized. The ⟨ Init ⟩ event is considered atomic with respect to recovery.

More precisely, if a process crashes in the middle of its initialization procedure and recovers, say, without having finished the procedure properly, the process resumes again with processing the initialization procedure and then continues to process the

⟨ Recovery ⟩ event.

In some sense, a crash-recovery kind of failure matches an omission fault if we consider that every process stores every update to any of its variables in stable stor-age. This is not very practical because access to stable storage is usually expensive (as there is a significant delay in accessing it). Therefore, a crucial issue in devising algorithms with the crash-recovery abstraction is to minimize the access to stable storage.

One way to alleviate the need for accessing any form of stable storage is to assume that some of the processes never crash (during the lifetime of an algorithm execution). This might look contradictory with the actual motivation for introducing the crash-recovery process abstraction in the first place. In fact, there is no con-tradiction, as we explain later. As discussed earlier, with crash-stop faults, some distributed-programming abstractions can be implemented only under the assump-tion that a certain number of processes never crash, say, a majority of the processes participating in the computation, e.g., four out of seven processes. This assumption might be considered unrealistic in certain environments. Instead, one might con-sider it more reasonable to assume that at least two processes do not crash during the execution of an algorithm. (The rest of the processes would indeed crash and recover.) As we will discuss later in the book, such an assumption makes it some-times possible to devise algorithms assuming the crash-recovery process abstraction without any access to a stable storage. In fact, the processes that do not crash imple-ment a virtual stable storage abstraction, and the algorithm can exploit this without knowing in advance which of the processes will not crash in a given execution.

At first glance, one might believe that the crash-stop abstraction can also cap-ture situations where processes crash and recover, by simply having the processes change their identities upon recovery. That is, a process that recovers after a crash, would behave with respect to the other processes as if it were a different process that was simply not performing any action. This could easily be implemented by a recovery procedure which initializes the process state as if it just started its exe-cution and also changes the identity of the process. Of course, this process should be updated with any information it might have missed from others, as if it did not receive that information yet. Unfortunately, this view is misleading, as we explain later. Again, consider an algorithm devised using the crash-stop process abstraction, and assuming that a majority of the processes never crash, say at least four out of a total of seven processes composing the system. Consider, furthermore, a scenario where four processes do indeed crash, and one process recovers. Pretending that the latter process is a different one (upon recovery) would mean that the system is actually composed of eight processes, five of which should not crash. The same reasoning can then be made for this larger number of processes. However, a funda-mental assumption that we build upon is that the set of processes involved in any given computation is static, and the processes know of each other in advance.

A tricky issue with the crash-recovery process abstraction is the interface between software modules. Assume that some module of a process, involved in the implementation of some specific distributed abstraction, delivers some message or decision to the upper layer (say, the application layer), and subsequently the pro-cess hosting the module crashes. Upon recovery, the module cannot determine if

the upper layer (i.e., the application) has processed the message or decision before crashing or not. There are at least two ways to deal with this issue:

1. One solution is to change the interface between modules. Instead of delivering a message or a decision to the upper layer (e.g., the application layer), the module may instead store the message or the decision in stable storage, which can also be accessed by the upper layer. The upper layer should subsequently access the stable storage and consume the delivered information.

2. A different approach consists in having the module periodically deliver a mes-sage or a decision to the upper layer until the latter explicitly asks for the stopping of the delivery. That is, the distributed programming abstraction implemented by the module is responsible for making sure the application will make use of the delivered information. Of course, the application layer needs to filter out duplicates in this case.

For the algorithms in this book that address crash-recovery faults, we generally adopt the first solution (see the logged perfect links abstraction in Sect. 2.4.5for an example).

2.2.5 Eavesdropping Faults

When a distributed system operates in an untrusted environment, some of its compo-nents may become exposed to an adversary or even fall under its control. A relatively benign form of adversarial action occurs when a process leaks information obtained in an algorithm to an outside entity. The outsider may eavesdrop on multiple pro-cesses in this way and correlate all leaked pieces of information with each other.

Faults of this kind threaten the confidentiality of the data handled by an algorithm, such as the privacy of messages that are disseminated by a broadcast algorithm or the secrecy of data written to a storage abstraction. We call this an eavesdropping faultof a process.

As the example of attacks mounted by remote adversaries against machines connected to the Internet shows, such eavesdropping faults occur in practice. An eavesdropping fault cannot be detected by observing how an affected process behaves in an algorithm, as the process continues to perform all actions according to its instructions. The adversary merely reads the internal state of all faulty processes.

In practice, however, the eavesdropper must run some code on the physical machine that hosts the faulty process, in order to mount the attack, and the presence of such code can be detected and will raise suspicion. Eavesdropping faults typically affect communication links before they affect the processes; hence, one usually assumes that if any process is susceptible to eavesdropping faults then all communication links are also affected by eavesdropping and leak all messages to the adversary.

Eavesdropping can be prevented by cryptography, in particular by encrypting communication messages and stored data. Data encryption is generally orthogonal to the problems considered in this book, and confidentiality plays no significant role in implementing our distributed programming abstractions. Therefore, we will not consider eavesdropping faults any further here, although confidentiality and privacy are important for many secure distributed programs in practice.

2.2.6 Arbitrary Faults

A process is said to fail in an arbitrary manner if it may deviate in any conceiv-able way from the algorithm assigned to it. The arbitrary-fault behavior is the most general one. When we use it, we make no assumptions on the behavior of faulty processes, which are allowed any kind of output and, therefore, can send any kind of message. Such failures are also called Byzantine for historical reasons (see the notes at the end of this chapter) or malicious failures. The terms “arbitrary faulty”

and “Byzantine” are synonyms throughout this book. We model a process that may behave arbitrarily as an arbitrary-fault process abstraction or a Byzantine process abstraction.

Not surprisingly, arbitrary faults are the most expensive to tolerate, but this is the only acceptable option when unknown or unpredictable faults may occur. One also considers them when the system is vulnerable to attacks, where some of its processes may become controlled by malicious users that deliberately try to prevent correct system operation.

Similar to the case of eavesdropping faults, one can simplify reasoning about arbitrary faults by assuming the existence of one determined adversary that coor-dinates the actions of all faulty processes. Whenever we consider algorithms with Byzantine processes, we also allow this adversary to access the messages exchanged over any communication link, to read messages, modify them, and insert messages of its own. In practice, a remote attacker may take over control of the physical machine that hosts the faulty process and not only read the state of a process but also completely determine the process’ behavior.

An arbitrary fault is not necessarily intentional and malicious: it can simply be caused by a bug in the implementation, the programming language, or the compiler.

This bug can thus cause the process to deviate from the algorithm it was supposed to execute. Faults that are triggered by benign bugs can sometimes be detected, and their effects eliminated, by the process itself or by other processes, through double-checking of results and added redundancy. As arbitrary but nonmalicious events of this kind often appear to be random and follow a uniform distribution over all errors, verification of the data can use simple verification methods (such as cyclic redundancy checks). Against a determined adversary, these methods are completely ineffective, however. On the other hand, a system that protects against arbitrary faults with a malicious intention also defends against nonmalicious faults.

Throughout this book, we consider only arbitrary faults of intentional and mali-cious nature. This gives the algorithms where processes are subject to arbitrary faults a robust notion of protection, because the given guarantees do not depend on the nature of and the intention behind an arbitrary fault. The added protection usually relies on cryptographic primitives, whose security properties may not be broken even by a determined adversary. Cryptographic abstractions are introduced in the next section.