In shared memory systems, since all processes communicate through the shared memory, each competing process may test certain shared variable(s) repeatedly while it is waiting to enter its critical region. Such repeated testing may produce a large amount of processor-to-memory traffic in shared memory systems, heavily degrading the system performance. This problem can be avoided in two architectural para-digms of shared memory systems: distributed shared memory (DSM) systems, in which each process has a local portion of shared memory, and cache coherent (CC) systems, in which each process has a local cache [37]. In DSM systems, a memory reference to a shared variable will not cause interconnect traffic if the variable is stored in the local portion of shared memory. In CC systems, whether a memory reference causes interconnect traffic depends on the caching protocol. Generally speaking, the first reference (be it read, write, or both) to a shared variable will cause interconnect traffic and establish a cached copy. Subsequent references, how-ever, will not cause traffic until the cached copy of the shared variable is updated or invalidated. In general, a memory reference is regarded as local if it does not cause
any interconnect traffic; otherwise, it is remote.
Much work on the mutual exclusion problem has focused on the design of local-spin algorithms, which reduce the number of remote memory reference (RMR) steps by busy waiting only on locally-accessible shared variables. A number of perfor-mance studies [6, 8, 26, 31, 37, 41] have shown that synchronization algorithms minimizing the number of RMR steps have the best performance.
To evaluate mutual exclusion algorithms, the conventional time complexity, which counts all steps for one process in the worst case, might be inappropriate. This is because in any algorithm in which a process enters a busy-waiting loop when its critical region is unavailable, the worst case number of steps taken by one waiting process is unbounded. In other words, the conventional time complexity yields no useful information concerning the performance of such algorithms. Since the num-ber of RMR steps significantly reflects the performance of an algorithm, Anderson and Yang [7] were the first to propose the number of RMR steps as a time complex-ity metric. To be more specific, the RMR time complexcomplex-ity of a mutual exclusion algorithm is the worst case number of RMR steps taken by any single process to enter and exit its critical region once. One may consider the amortized number of RMR steps instead of the worst case number as the RMR time complexity of an algorithm. But, as Anderson and Yang did, we adopt the worst case number rather than the amortized one because of the following reasons.
1. The worst case RMR time complexity of an algorithm can be easily analyzed by just inspecting the algorithm.
2. To achieve low amortized RMR time complexity, an algorithm may assign some process to service other processes. However, such a process is not equally treated. This unfairness will be revealed if we consider the worst case number.
Throughout the rest of this dissertation, the RMR time complexity means the worst case RMR time complexity.
Known constant RMR time algorithms. In the literature, with some read-modify-write primitives in addition to atomic read and write, many mutual
exclu-sion algorithms of constant RMR time complexity are proposed:
• Anderson [8] proposed a constant RMR time algorithm for CC systems using fetch&increment and fetch&add .
• Graunke and Thakkar [26] proposed a constant RMR time algorithm for CC systems using fetch&store.
• Mellor-Crummey and Scott [37] first proposed an algorithm (referred to as the MCS lock in literature) for both CC and DSM systems using fetch&store and compare&swap.
• Craig [14], Magnusson et al. [36], and Huang and Lin [29] independently pro-posed the same constant time algorithm with fetch&store. Craig presented variants of the algorithm for both CC and DSM systems; while the other two considered only CC systems.
• In recent work, Anderson and Kim [4] presented a genetic constant RMR time algorithm for both CC and DSM systems using fetch&φ.
For more details of these algorithms, see the recent survey paper [5] of Anderson et al.
Because of these constant RMR time algorithms, the asymptotic tight bound on RMR time complexity is Θ(1). From a theoretical point of view, constant time is the best an algorithm can achieve in the RMR time complexity. Nevertheless, some researchers such as Fu and Tzeng [24, 30] continue to strive for minimizing the number of RMR steps. We consider it worthwhile to reduce the number as much as possible. In practice, remote memory references are orders of magnitude slower than references to the local memory. And mutual exclusion is a basic synchronization mechanism frequently used in multiprocessing systems both at the operating system kernel level and the users’ application level [37]. Consequently, minimizing the number of RMR steps yields considerable performance improvement.
Our result for this direction of research is a tight bound on the number of RMR steps needed to solve the mutual exclusion problem in DSM systems. We prove
three is a lower bound on RMR time complexity. The lower bound is tight because it matches the upper bound of the algorithm proposed by Huang in ICDCS’99 [28].
(The algorithm is referred to as Huang’s algorithm throughout the rest of the dis-sertation.) To prove the correctness of Huang’s algorithm, we sketch a proof in Section 3.3.2.
Huang’s algorithm is related to the MCS lock [37] and the CL algorithm by Fu and Tzeng [24, 30]. Fu and Tzeng tried to improve the MCS lock, whose RMR time complexity is four, and obtained a better algorithm in terms of the amortized RMR time complexity. But, in the CL algorithm, some process in its exit region (i.e., the code fragment after executing its critical region) may take an unbounded number of RMR steps for the purpose of scheduling other competing processes. Thus, the worst case number of RMR steps taken by some process is unbounded, i.e., the RMR time complexity is unbounded. Huang follows the line of their algorithm but eliminate the above drawback.
We prove the time bound in an asynchronous distributed shared memory model where processes communicate by means of a general RMW primitive. The general RMW primitive atomically accesses one shared variable, reading the value of the variable and writing back a new value according to the submitted function. Let V be the set of all possible values for the variable. The submitted function can be any function f : V → V . Hence, the general RMW primitive is a generalization of all atomic primitives that access at most one shared variable, and therefore the lower bound holds for any set of such primitives. In practice, almost all commonly-available primitives implemented in multiprocessor systems—such as read/write, test&set, compare&swap, fetch&add , fetch&increment, fetch&store, fetch-and-φ—
access one shared variable. Thus, the general RMW primitive can be used to model these primitives. For instance, a read primitive is equivalent to the general RMW primitive with the identity function (write the same value as that returned by the read), and a write primitive is equivalent to the general RMW primitive with the constant function that always maps to the new value (write the new value and discard the returned value).
Known Lower Bounds on RMR time complexity. Several related lower bounds have been proved in the literature. All of these bounds are asymptotic.
Anderson and Yang [7] first initiated a series of studies of lower bounds on RMR time complexity. They established a trade-off between the amount of contention, which was defined by Dwork et al. [19], and the RMR time complexity. The amount of contention of an algorithm is the maximum number of processes that are enabled to access the same shared variable simultaneously. Since our aim is minimizing the number of RMR steps, we focus on the RMR time complexity when contention may equal the number of all processes. Applying their result to the model with the general RMW primitive, we have that Ω(logcn) RMR steps are required in both DSM and CC systems, where c is the amount of contention and n is the number of processes.
Thus, the lower bound on RMR time complexity is Ω(1), a trivial bound, when con-tention is n. Then, Cypher [15] showed a lower bound of Ω(log log n/ log log log n) on RMR time complexity in DSM and CC systems with only atomic read and write primitives. This result implies that there is no constant time mutual exclusion al-gorithm if only read and write are available. He went on to show that the lower bound holds even if conditional RMW primitives are available in addition to read and write. In a later work, Anderson and Kim [2] improved Cypher’s lower bound to Ω(log n/ log log n). Cypher’s lower bound and the improved bound by Anderson and Kim hold for read, write and conditional RMW primitives, whereas ours holds for all commonly-available primitives that access at most one shared variable in an instruction.
In addition, Kim and Anderson [32] provided an RMR time complexity lower bound for adaptive mutual exclusion algorithms in which the RMR time complexity is a function of the number of contending processes. They showed that for any k, there exists some n such that, for any n-process mutual exclusion algorithm based on read, write or conditional RMW primitives, there exists an execution involving Θ(k) processes in which some process performs Ω(k) RMR steps to enter and exit its critical region. The result applies to both DSM and CC systems. In another paper [3], Anderson and Kim showed that for any n-process mutual exclusion
al-gorithm based on non-atomic read and write, there exists an execution involving only one process in which that process performs Ω(log n/ log log n) RMR steps in DSM systems to enter its critical region. Moreover, these RMR steps must access Ω(plog n/ log log n) distinct remote shared variables, which implies that the process performs Ω(plog n/ log log n) RMR steps in CC systems to enter its critical region.
Unlike the researchers who provided related lower bounds on the RMR time complexity, we establish a lower bound only for DSM systems; the lower bound proof herein is not applicable to CC systems. Future work is needed to establish the exact lower bound in CC systems.