互斥問題在空間與遠端存取次數的最佳解

(1)

國

立

交

通

大

學

資訊工程學系

博

士

論

文

互斥問題在空間與遠端存取次數的最佳解

Tight Bounds on Space and Remote Reference Time Complexity of

Mutual Exclusion

研究生：陳勝雄

指導教授：黃廷祿教授

(2)

互斥問題在空間與遠端存取次數的最佳解

Tight Bounds on Space and Remote Reference Time Complexity of Mutual

Exclusion

研究生：陳勝雄 Student：Sheng-Hsiung Chen

指導教授：黃廷祿 Advisor：Ting-Lu Huang

國立交通大學

資訊工程學系

博士論文

A Dissertation

Submitted to Department of Computer Science College of Computer Science

National Chiao Tung University in Partial Fulfillment of the Requirements

for the Degree of Doctor of Philosophy

in

Computer Science

February 2008

Hsinchu, Taiwan, Republic of China

(3)

i

互斥問題在空間與遠端存取次數的最佳解

學生：陳勝雄

指導教授

：

黃廷祿

國立交通大學資訊工程學系博士班

摘

要

互斥問題為非同步共享記憶體系統中的基本問題，用來管理系統中的資源。本論文針對此問題，分別就空間使用與遠端存取次數上提出最佳解。針對像嵌入式即時系統這樣具有時間與資源限制的環境，互斥演算法應該符合公平性並且降低記憶體的使用。在文獻中，已有數個演算法僅用一個共享變數並且具有公平性。然而，這些演算法使用一些從未在任何系統中出現的假設性指令來設計。在不使用這樣的指令之下，我們首先提出兩個具公平性的演算法，並且僅需多用一個共享變數。所採用的指令為常見於一般系統的 fetch&store 與 read/write。第一個演算法符合 bounded bypass 條件。第二個則是改

進第一個演算法，使其達到 FCFS 的公平性。改進公平性所需的代價為需更大的共享變數，在第一個演算法中共享變數大小為 2log2(n+1) 位元，第二個演算法則需 1 + 3log2 (n+1) 位元，其中 n 代表所有 process 的個數。此外，我們進一步去證明在使用相同指令的條件下， 至少需兩個共享變數才能達到 bounded bypass 的公平性。因此，就共享變數的使用個數上，所提出的演算法為最佳解。而針對分散式共享記憶體系統，近期的研究主要為設計降低遠端存取次數的互斥演算法。頻繁地遠端存取會產生大量記憶體與處理器之間的流量，進而降低系統的效能。在此研究方向上，我們提出一個遠端存取次數的 lower bound。所假設的系統為採用通用 read-modify-write 指令的分散式共享記憶體系統。此通用 read-modify-write 指令為一般常見於系統一次存取一

(4)

ii

個共享變數的不可分割指令之一般化模型，因此所提出的 lower bound 適用於所有採用這類

指令的系統。再者，根據黃廷祿博士於 ICDCS’99 提出的演算法，此 lower bound 為最佳。

關鍵字：互斥問題、共享記憶體系統、嵌入式即時系統、公平性、空間複雜度、時間複雜度、最佳解。

(5)

Tight Bounds on Space and Remote Memory Reference

Time Complexity of Mutual Exclusion

Student: Sheng-Hsiung Chen Advisor: Dr. Ting-Lu Huang Department of Computer Science

National Chiao Tung University ABSTRACT

The mutual exclusion problem is fundamental to resource allocation in asyn-chronous shared memory systems. In this dissertation we present mutual exclusion algorithms with fairness and the minimum number of shared variables, and then show a tight bound on remote reference time complexity.

For shared memory systems under time and memory constraints such as embed-ded real-time systems, a mutual exclusion mechanism that is both fair and space-efficient can be highly valuable. Several algorithms that utilize only one shared vari-able and guarantee a certain level of fairness have been proposed. However, these use hypothetical read-modify-write primitives that have never been implemented in any system. We present two fair algorithms that do not use such primitives, but each algorithm has one additional shared variable. The proposed algorithms employ commonly available primitives, fetch&store and read/write, on two shared variables. The first algorithm satisfies the bounded bypass condition. The second is an improvement on the first that satisfies the FCFS condition, which is the most stringent fairness condition. The improvement is at the cost of increasing the size of a shared variable from 2 log2(n + 1) bits to 1 + 3 log2(n + 1) bits, where n is the

number of processes. In addition, it is shown that achieving the bounded bypass condition using the same set of primitives requires two shared variables. Both of the algorithms are thus space-optimal in terms of the number of shared variables.

(6)

For distributed shared memory (DSM) systems, recent work on this problem has focused on the design of mutual exclusion algorithms that minimize the number of remote memory references, which generate processor-to-memory traffic and therefore may result in a bottleneck. We establish a lower bound of three on remote reference time complexity for mutual exclusion algorithms in a DSM model where processes communicate by means of a general read-modify-write primitive that accesses at most one shared variable in one instruction. Since the general read-modify-write primitive is a generalization of a variety of atomic primitives that have been imple-mented in multiprocessor systems, the lower bound holds for all mutual exclusion algorithms that use such primitives. Additionally, the lower bound is tight because it matches the upper bound of Huang’s algorithm proposed in ICDCS’99.

Key words: Mutual exclusion, shared memory systems, embedded real-time systems, fairness, space complexity, time complexity, tight bounds

(7)

Acknowledgment

I would like to express my sincere thanks to my advisor, Prof. Ting-Lu Huang, for his supervision and perspicacious advice. Special thanks are due to my commit-tee members: Prof. Shih-Kun Huang, Prof. Chung-Ta King, Prof. Ce-Kuen Shieh, Prof. Shi-Chun Tsai, Prof. Yih-Kuen Tsay and Dr. Da-Wei Wang for their valuable comments and encouragement.

(8)

List of Figures

3.1 f etch&store, compare&swap and swap&compare primitives. . . 24 3.2 The MCS lock. . . 25 3.3 An execution of the MCS lock. An arrow from node p to note q

indicates that process q has updated process p’s Next variable so that p is aware of the identity of its successor. . . 26 3.4 The CL algorithm. . . 27 3.5 An execution of the CL algorithm. A gray node indicates a process

that has finished one life cycle. An upward arrow from a process points to the process’s predecessor, and a downward arrow from a process, which must be a controller, points to the tail of a waiting list to which the process is responsible. . . 29 3.6 Huang’s algorithm. . . 31 3.7 An execution of Huang’s algorithm in Fig. 3.6. A gray node indicates

a process that has finished one life cycle. An upward arrow from a process points to the process’s predecessor, and a downward arrow from a process, which must be a controller, points to the tail of the waiting list to which the process is responsible. The label of a down-ward arrow from a process represents the permission word conveyed to the tail by the process. . . 32

(11)

4.1 An execution of the 2-bounded-bypass algorithm. A gray node indi-cates a process that has finished one life cycle. The symbol 4 denotes an arbitrary value. An arrow from process a to b represents that a

has the identity of b. . . 43

4.2 The 2-bounded-bypass algorithm. . . 45

4.3 An execution of the FCFS algorithm. The notation is the same as that in Fig. 4.1. . . 51

4.4 The FCFS algorithm. . . 53

4.5 The execution for the proof of Theorem 4.7. . . 59

5.1 A goal execution extended from αij in which time(i, αij) ≥ 2. . . 68

5.2 A goal execution extended from either αij or αik. We write e to denote the RMR step from i to j. . . 70

5.3 Shared variables for the proof of Claim 5.7.1. . . 73

5.4 Executions αij and αkj. Execution fragment α00 ends with the first RMR step from j to i. . . 75

(12)

Chapter 1 Introduction

The mutual exclusion problem [18] is fundamental in multiprocessing systems for managing access to a single indivisible resource. In mutual exclusion, a process accesses the resource within a distinct part of code known as the critical region. A process executes trying and exit regions, respectively, before and after executing the critical region, to guarantee the following basic requirements.

Mutual Exclusion: At most one process at a time is permitted to enter the critical region.

Progress: If at least one process is in the trying region and no process is in the critical region, then at some later point some process enters the critical region. In addition, if at least one process is in the exit region, then at some later point some process enters the rest of the code, called the remainder region. The progress condition is necessary for the system to make any progress at all. However, an algorithm satisfying the condition does not guarantee that the critical region is granted fairly to different processes; for example, it allows one process to be repeatedly granted access to its critical region while other users trying to gain access are forever prevented from doing so. This situation is known as lockout, or starvation. Therefore, there are other fairness conditions of granting the critical region, several of which are enumerated in the following.

(13)

Lockout Freedom: A mutual exclusion algorithm is said to be lockout-free if no process can be kept waiting indefinitely either for the critical region or for the remainder region.

The next two conditions constrain the number of processes that may bypass a requesting process. To define such conditions, a definition is needed to specify when a process has make a request in its trying region. We adopt a widely-used definition that assumes the trying region is composed of a doorway and a waiting parts. Only the entry to the waiting part of the trying region bounds the possible orders of entry to the critical region.

Bounded Bypass: A mutual exclusion algorithm is said to be bounded-bypass if it is b-bounded-bypass for some constant b. A mutual exclusion algorithm is defined to satisfy the b-bounded bypass property if no process that has finished its doorway can be bypassed more than b times by any other process when competing for a resource.

FCFS: The most stringent fairness condition is the first-come-first-served (FCFS) property that if a process i passes through its doorway before j performs a step in its doorway, then j can not enter its critical region before i does so. It is clear that a FCFS algorithm is also bounded-bypass.

Starting with an algorithm by Dijkstra [18], early work on this problem was focused on improving Dijkstra’s algorithm by guaranteeing fairness conditions de-scribed above or by weakening the type of shared memory that is used [33, 17, 20, 34, 9, 38]. Due to the increasing interest on embedded real-time systems, we ad-dress that none of the previous algorithms is feasible for such systems, and proposes suitable algorithms in Chapter 4.

In contrast, recent work on the mutual exclusion problem has focused on the design of algorithms that reduce the number of remote memory references, which may produce a large amount of processor-to-memory traffic in shared memory sys-tems. For this direction of research, we show a tight bound on the number of remote memory references in Chapter 5.

(14)

1.1 Algorithms for Systems under Time and

Mem-ory Constraints

Embedded real-time systems, e.g., automotive control systems, mobile computing devices and home electronics, have received increasing interest in recent years. An algorithm for such systems should consider time and memory constraints. The time constraint imposes a deadline for each process in executing a particular job because the process often interacts with users or a dynamic environment. Additionally, em-bedded systems often have small memory (about 32–64 kBytes) since minimizing production costs, weight and power consumption are primary concerns in their de-signs [25, 42, 43]. As shown below, a mutual exclusion algorithm, in particular, should consider fairness and space efficiency.

Since a process can remain in the critical region for an arbitrarily long time, no algorithm can ensure that each waiting process will gain the permission to enter the critical region before its deadline. This creates an inherent difficulty in the mutual exclusion problem, especially for systems under the time constraint. Thus, algo-rithm designers attempt to improve the feasibility of mutual exclusion algoalgo-rithms by designing them to grant the critical region fairly to each process. A mutual exclu-sion algorithm that satisfies the basic requirements may not guarantee such fairness. That is, a process may be indefinitely denied access to the critical region. Hence, the worst-case waiting time may be infinite even when each process always returns the resource quickly. A fair mutual exclusion algorithm tries to reduce the worst-case waiting time by scheduling requests fairly, and thereby improves the feasibility of the algorithm.

A space-efficient mutual exclusion algorithm largely focuses on reducing the memory consumption. This requirement is crucial for systems under the memory constraint. In terms of the space complexity, most n-process mutual exclusion algo-rithms in previous literature use at least n shared variables, as shown in surveys by Anderson et al. [5] and Raynal [39]. For systems with limited memory, an algorithm using a constant number of shared variables would be more suitable.

(15)

For systems under time and memory constraints, we provide two fair and space-efficient mutual exclusion algorithms in Chapter 4. A 2-bounded-bypass algorithm with two shared variables is first presented to show the basic idea. A FCFS algo-rithm, which is based on the first algoalgo-rithm, and uses the same number of shared variables, is then presented. The cost at improving the fairness from bounded by-pass to FCFS is that the size of a shared variable is increased from 2 log2(n + 1) bits

to 1 + 3 log2(n + 1) bits, where n denotes the number of all processes.

In terms of the fairness, both of the proposed algorithms satisfy bounded bypass, so that a process in either algorithm can roughly estimate the waiting time. (Note that a FCFS algorithm is also bounded-bypass.) For instance, in the 2-bounded-bypass algorithm, a process cannot be 2-bounded-bypassed more than 2(n − 1) times by other processes after its requesting the critical region. By contrast, a process might be bypassed without limitation in an algorithm that does not satisfy bounded bypass, easily violating the deadline for executing a particular job.

In terms of the space complexity, only two shared variables are utilized in each of the algorithms. Moreover, no dynamic memory allocation is needed when executing the algorithm, so the system overhead is reduced. Since mutual exclusion is a basic synchronization mechanism frequently used in multiprocessing systems both in operating system kernel level and in users’ application level [37], the system performance can be significantly improved.

In addition to atomic read and write primitives, both of the algorithms are implemented by fetch&store, which atomically writes a value into a shared variable and returns the old value of the same variable. Burns and Lynch [11] showed that n shared variables are necessary to solve the n-process mutual exclusion problem if only read and write are available. Fich et al. [21] recently extended the linear lower bound to systems that support conditional read-modify-write (RMW) primitives, such as compare&swap. A primitive is said to be RMW provided that it reads the value of a shared variable and changes the value of the shared variable in a single step. An RMW primitive is said to be conditional provided that it changes the value of a variable only if the variable has a particular value. Hence, some primitives other

(16)

than read/write and conditional RMW primitives are needed to decrease the space requirement. Primitive fetch&store is adopted to implement the algorithms since it is commonly supported in modern microprocessors such as a series of processors of Intel and AMD, Motorola 88000, and SPARC [40], and is also available in the ARM processor family [1]1

, which is arguably the most popular embedded architecture today. Thus, fetch&store improves the portability of the algorithm.

Several algorithms that use only a single shared variable and guarantee a certain level of fairness have been presented. For instance, Fischer et al. [23] devised a FCFS algorithm, and Burns et al. [10] devised a bounded-bypass algorithm and a lockout-free algorithm. Unfortunately, all of these algorithms used hypothetical RMW primitives that have never yet been implemented in any system. In contrast, none of the algorithms we propose use a hypothetical RMW operation, and each of them requires only one more shared variable than these algorithms.

The proposed algorithms are inspired by the circular list-based mutual exclusion algorithm presented by Fu and Tzeng [24, 30]. (Fu and Tzeng’s algorithm is refereed to as the CL algorithm throughout the rest of the dissertation.) The proposed algorithms, like the CL algorithm, organize waiting processes into lists, but pass the permission within and among lists very differently. The CL algorithm may block a process in the exit region. However, the proposed algorithms eliminate this drawback. Whereas Fu and Tzeng reduced the number of remote memory references, our algorithms target the space complexity and guarantee a certain level of fairness. Furthermore, we prove that two shared variables are necessary to solve the mutual exclusion problem with b-bounded bypass for any constant b using only fetch&store and read/write. This impossibility result is proven by showing a more general result, that two object instances are required to implement a bounded-bypass mutual exclusion algorithm when using only historyless objects, regardless of the size of the objects. The definition of a historyless object is given by Fich et al. [22] and is restated in Section 4.3. According to the definition, shared variables associated with 1_{The ARM processor provides the SWP instruction, which performs the same functionality as}

(17)

fetch&store and read/write belong to the class of historyless objects, so the more general result implies the proposed algorithms are space-optimal. Informally, an object is historyless if applying a sequence of operations yields the same value in the object as applying just the last nontrivial operation in the sequence. A nontrivial operation is one that writes a value to the object.

The lower bound proof technique is related to an elegant method introduced by Burns and Lynch in proving the lower bound of n on the number of read/write objects required to solve the n-process mutual exclusion problem [11]. Their method, called covering argument, aims at read/write objects, and is generalized herein to historyless objects.

1.2 Algorithms for Systems Whose Memory Has

Locality

In shared memory systems, since all processes communicate through the shared memory, each competing process may test certain shared variable(s) repeatedly while it is waiting to enter its critical region. Such repeated testing may produce a large amount of processor-to-memory traffic in shared memory systems, heavily degrading the system performance. This problem can be avoided in two architectural para-digms of shared memory systems: distributed shared memory (DSM) systems, in which each process has a local portion of shared memory, and cache coherent (CC) systems, in which each process has a local cache [37]. In DSM systems, a memory reference to a shared variable will not cause interconnect traffic if the variable is stored in the local portion of shared memory. In CC systems, whether a memory reference causes interconnect traffic depends on the caching protocol. Generally speaking, the first reference (be it read, write, or both) to a shared variable will cause interconnect traffic and establish a cached copy. Subsequent references, how-ever, will not cause traffic until the cached copy of the shared variable is updated or invalidated. In general, a memory reference is regarded as local if it does not cause

(18)

any interconnect traffic; otherwise, it is remote.

Much work on the mutual exclusion problem has focused on the design of local-spin algorithms, which reduce the number of remote memory reference (RMR) steps by busy waiting only on locally-accessible shared variables. A number of perfor-mance studies [6, 8, 26, 31, 37, 41] have shown that synchronization algorithms minimizing the number of RMR steps have the best performance.

To evaluate mutual exclusion algorithms, the conventional time complexity, which counts all steps for one process in the worst case, might be inappropriate. This is because in any algorithm in which a process enters a busy-waiting loop when its critical region is unavailable, the worst case number of steps taken by one waiting process is unbounded. In other words, the conventional time complexity yields no useful information concerning the performance of such algorithms. Since the num-ber of RMR steps significantly reflects the performance of an algorithm, Anderson and Yang [7] were the first to propose the number of RMR steps as a time complex-ity metric. To be more specific, the RMR time complexcomplex-ity of a mutual exclusion algorithm is the worst case number of RMR steps taken by any single process to enter and exit its critical region once. One may consider the amortized number of RMR steps instead of the worst case number as the RMR time complexity of an algorithm. But, as Anderson and Yang did, we adopt the worst case number rather than the amortized one because of the following reasons.

1. The worst case RMR time complexity of an algorithm can be easily analyzed by just inspecting the algorithm.

2. To achieve low amortized RMR time complexity, an algorithm may assign some process to service other processes. However, such a process is not equally treated. This unfairness will be revealed if we consider the worst case number. Throughout the rest of this dissertation, the RMR time complexity means the worst case RMR time complexity.

Known constant RMR time algorithms. In the literature, with some read-modify-write primitives in addition to atomic read and write, many mutual

(19)

exclu-sion algorithms of constant RMR time complexity are proposed:

• Anderson [8] proposed a constant RMR time algorithm for CC systems using fetch&increment and fetch&add .

• Graunke and Thakkar [26] proposed a constant RMR time algorithm for CC systems using fetch&store.

• Mellor-Crummey and Scott [37] first proposed an algorithm (referred to as the MCS lock in literature) for both CC and DSM systems using fetch&store and compare&swap.

• Craig [14], Magnusson et al. [36], and Huang and Lin [29] independently pro-posed the same constant time algorithm with fetch&store. Craig presented variants of the algorithm for both CC and DSM systems; while the other two considered only CC systems.

• In recent work, Anderson and Kim [4] presented a genetic constant RMR time algorithm for both CC and DSM systems using fetch&φ.

For more details of these algorithms, see the recent survey paper [5] of Anderson et al.

Because of these constant RMR time algorithms, the asymptotic tight bound on RMR time complexity is Θ(1). From a theoretical point of view, constant time is the best an algorithm can achieve in the RMR time complexity. Nevertheless, some researchers such as Fu and Tzeng [24, 30] continue to strive for minimizing the number of RMR steps. We consider it worthwhile to reduce the number as much as possible. In practice, remote memory references are orders of magnitude slower than references to the local memory. And mutual exclusion is a basic synchronization mechanism frequently used in multiprocessing systems both at the operating system kernel level and the users’ application level [37]. Consequently, minimizing the number of RMR steps yields considerable performance improvement.

Our result for this direction of research is a tight bound on the number of RMR steps needed to solve the mutual exclusion problem in DSM systems. We prove

(20)

three is a lower bound on RMR time complexity. The lower bound is tight because it matches the upper bound of the algorithm proposed by Huang in ICDCS’99 [28]. (The algorithm is referred to as Huang’s algorithm throughout the rest of the dis-sertation.) To prove the correctness of Huang’s algorithm, we sketch a proof in Section 3.3.2.

Huang’s algorithm is related to the MCS lock [37] and the CL algorithm by Fu and Tzeng [24, 30]. Fu and Tzeng tried to improve the MCS lock, whose RMR time complexity is four, and obtained a better algorithm in terms of the amortized RMR time complexity. But, in the CL algorithm, some process in its exit region (i.e., the code fragment after executing its critical region) may take an unbounded number of RMR steps for the purpose of scheduling other competing processes. Thus, the worst case number of RMR steps taken by some process is unbounded, i.e., the RMR time complexity is unbounded. Huang follows the line of their algorithm but eliminate the above drawback.

We prove the time bound in an asynchronous distributed shared memory model where processes communicate by means of a general RMW primitive. The general RMW primitive atomically accesses one shared variable, reading the value of the variable and writing back a new value according to the submitted function. Let V be the set of all possible values for the variable. The submitted function can be any function f : V → V . Hence, the general RMW primitive is a generalization of all atomic primitives that access at most one shared variable, and therefore the lower bound holds for any set of such primitives. In practice, almost all commonly-available primitives implemented in multiprocessor systems—such as read/write, test&set, compare&swap, fetch&add , fetch&increment, fetch&store, fetch-and-φ— access one shared variable. Thus, the general RMW primitive can be used to model these primitives. For instance, a read primitive is equivalent to the general RMW primitive with the identity function (write the same value as that returned by the read), and a write primitive is equivalent to the general RMW primitive with the constant function that always maps to the new value (write the new value and discard the returned value).

(21)

Known Lower Bounds on RMR time complexity. Several related lower bounds have been proved in the literature. All of these bounds are asymptotic. Anderson and Yang [7] first initiated a series of studies of lower bounds on RMR time complexity. They established a trade-off between the amount of contention, which was defined by Dwork et al. [19], and the RMR time complexity. The amount of contention of an algorithm is the maximum number of processes that are enabled to access the same shared variable simultaneously. Since our aim is minimizing the number of RMR steps, we focus on the RMR time complexity when contention may equal the number of all processes. Applying their result to the model with the general RMW primitive, we have that Ω(log_cn) RMR steps are required in both DSM and CC systems, where c is the amount of contention and n is the number of processes. Thus, the lower bound on RMR time complexity is Ω(1), a trivial bound, when con-tention is n. Then, Cypher [15] showed a lower bound of Ω(log log n/ log log log n) on RMR time complexity in DSM and CC systems with only atomic read and write primitives. This result implies that there is no constant time mutual exclusion al-gorithm if only read and write are available. He went on to show that the lower bound holds even if conditional RMW primitives are available in addition to read and write. In a later work, Anderson and Kim [2] improved Cypher’s lower bound to Ω(log n/ log log n). Cypher’s lower bound and the improved bound by Anderson and Kim hold for read, write and conditional RMW primitives, whereas ours holds for all commonly-available primitives that access at most one shared variable in an instruction.

In addition, Kim and Anderson [32] provided an RMR time complexity lower bound for adaptive mutual exclusion algorithms in which the RMR time complexity is a function of the number of contending processes. They showed that for any k, there exists some n such that, for any n-process mutual exclusion algorithm based on read, write or conditional RMW primitives, there exists an execution involving Θ(k) processes in which some process performs Ω(k) RMR steps to enter and exit its critical region. The result applies to both DSM and CC systems. In another paper [3], Anderson and Kim showed that for any n-process mutual exclusion

(22)

al-gorithm based on non-atomic read and write, there exists an execution involving only one process in which that process performs Ω(log n/ log log n) RMR steps in DSM systems to enter its critical region. Moreover, these RMR steps must access Ω(plog n/ log log n) distinct remote shared variables, which implies that the process performs Ω(plog n/ log log n) RMR steps in CC systems to enter its critical region. Unlike the researchers who provided related lower bounds on the RMR time complexity, we establish a lower bound only for DSM systems; the lower bound proof herein is not applicable to CC systems. Future work is needed to establish the exact lower bound in CC systems.

1.3 Contributions

In summary, we first provide two fair and space-efficient algorithms for shared mem-ory systems without resorting to any hypothetical primitive, and also show that the proposed algorithms are space-optimal in terms of the number of shared variables, making them highly valuable for systems under time and memory constraints.

We then improve the tight bound of mutual exclusion algorithms on RMR time complexity from Θ(1) to three in DSM systems. From the complexity-theoretic point of view, it may not be so surprising. But, this result is of importance for algorithm designers. Focus of mutual exclusion algorithms for shared memory systems for the last 15 years has been on minimizing the number of remote memory references [14, 24, 28, 30, 37]. The tight bound shows that it is impossible to obtain any better algorithm than Huang’s algorithm in terms of minimizing the number.

1.4 Organization

The rest of this dissertation is organized as follows. Chapter 2 provides the system models and definitions of the mutual exclusion problem. Chapter 3 reviews the MCS lock, the CL algorithm and Huang’s algorithm, which inspire our algorithms. Chapter 4 presents the space-optimal mutual exclusion algorithms for systems under

(23)

time and memory constraints. Chapter 5 presents the tight bound on the RMR time complexity in DSM systems. Conclusions and future directions for this research are finally drawn in Chapter 6.

(24)

Chapter 2 System Models and Definitions

The purpose of this chapter is to introduce formal models that are adopted. We first describe the shared memory model and then extend it to the distributed shared memory model. The only difference between these two models is that the shared memory of the latter has locality. The shared memory model is adopted in Chap. 4, where a tight bound on the number of shared variables is provided; while, the distributed shared memory model is utilized in Chap. 5 to present a tight bound on the number of remote memory references.

Besides, the mutual exclusion problem is formally defined. And, an indistin-guishability relation is defined in order to prove the impossibility results in this dissertation.

2.1 Shared Memory Model

The model of an asynchronous shared memory system is based on the model de-scribed by Lynch in [35].

An algorithm in a shared memory system is modelled as a triple (P, V, δ), where P is a nonempty finite set of processes, V is a nonempty finite set of shared variables, and δ is a transition relation for the entire system.

Each shared variable v ∈ V has an associated set of values, among which some are designated as the initial values, Iv.

(25)

Each process i ∈ P is associated with a kind of state machine consisting of the following components:

• Σi: a (possibly infinite) set of states;

• Ii: a subset of Σi, indicating the initial states;

• Πi : {(v, f )i| v ∈ V and f is a function mapping from the value set of v to

the same set}. Informally, Πi specifies the steps that i may execute. Each

step (v, f )i is a read-modify-write operation that atomically reads the current

value of v, say old, and writes back f (old) to the same variable v. That is, step (v, f )i means that process i accesses v by executing RMW(v,f ).

The system is asynchronous. That is, process steps do not necessarily take place in lock-step synchrony; rather, they may happen in an arbitrary order.

A system state is a tuple consisting of the state of each process in P and the value of each shared variable in V. System states will be denoted by s and t with subscripts and superscripts. For a system state s, we write s(i), i ∈ P, to denote the state of process i at s, and s(v), v ∈ V, to denote the value of shared variable v at s. An initial system state is a system state s at which s(i) ∈ Ii for each process

i ∈ P and s(v) ∈ Iv for each shared variable v ∈ V.

The transition relation δ is a set of (s, e, s0_{) triples, where s and s}0 _{are system}

states, and e is a step of some process. We assume that δ satisfies the following assumptions.

Localized update: Suppose (s, (v, f )i, s0) is a transition in δ, where (v, f )i is a

step of process i.

1. Suppose (t, (v, f )i, t0) is an arbitrary transition in δ, with the same step

of i. If s(i) = t(i) and s(v) = t(v), then s0_{(i) = t}0_(i).

Informally, the present state of i and the present value of v uniquely determine the state of i after i takes step (v, f )i.

(26)

2. s0_{(v) = f (s(v)).}

The new value of v is determined by the function f and the current value of v.

3. s0_{(j) = s(j) for all j ∈ P\{i}, and}

s0_{(u) = s(u) for all u ∈ V\{v}.}

Only the state of process i and the value of variable v can be affected. Localized enabling: If (s, (v, f )i, s0) ∈ δ, then for any system state t at which

t(i) = s(i) holds, there exists a system state t0 _{such that} _{(t, (v, f )}

i, t0) ∈ δ.

We say that a step e = (v, f )i is enabled at system state s if there exists

a system state s0 _{such that (s, e, s}0_{) ∈ δ. “Localized enabling” means that}

whether or not a step of a process is enabled at a system state depends only on the state of the process. Namely, if a step of process i is enabled at system state s, then the step is also enabled at any system state t at which t(i) = s(i) holds.

Determinism: For any process at any system state, there is at most one step of that process enabled.

If a step e = (v, f )i is enabled at system state s, the resulting system state after

i takes the step is unique since the new state of i and the new value of v are uniquely determined in the model. Therefore, we write e(s) to denote the resulting system state.

An execution fragment is a finite or infinite sequence of steps. Several notations regarding execution fragments will be used in the sequel. Let α and α0 _{be execution}

fragments.

• |α|: the length of α. (if α is a finite fragment)

• α|i: the subsequence of α consisting of all steps of process i in α. • Pro(α): the set of processes that take at least one step in α. • Var (α): the set of shared variables accessed by any step in α.

(27)

• α ◦ α0_{: the execution fragment obtained by concatenating α and α}0_{, provided}

that α is finite.

In addition, we say that α is a P -execution fragment if all processes involved in α are included in P (i.e., Pro(α) ⊆ P ), where P is a subset of P. When P = {i} we write i-execution fragment instead of {i}-execution fragment.

A finite execution fragment e1e2. . . en is executable from a system state s if for all

i, n ≥ i ≥ 1, ei is enabled at si−1where s0 = s and s_i = e_i(s_i−1). Likewise, an infinite

execution fragment e1e2. . . is executable from a system state s if for all i ≥ 1, ei is

enabled at si−1 where s0 = s and s_i = e_i(s_i−1). If α is a finite execution fragment

executable from s, we write α(s) to denote the system state after performing α from s. An execution is an execution fragment that is executable from an initial system state. A system state s is said to be reachable if there exists a finite execution such that the resulting system state is s.

2.2 Distributed Shared Memory Model

The distributed shared memory (DSM) model is the same as the shared memory model proposed in the previous section, except that in the DSM model, each process has a segment of shared memory that is local to it. We adopt the definition of a remote memory reference step proposed by Anderson and Yang [7], and thus use the number of remote memory reference steps as the RMR time complexity metric. In the DSM model, V is partitioned into disjoint nonempty subsets Vi for each

i ∈ P. In other words, each variable belongs to a segment of shared memory that is local to a single process. This captures the essence of distributed shared memory systems. Vi denotes the set of all shared variables located at process i. To a process

i, a shared variable v is remote if v 6∈ Vi; otherwise, it is local.

For a step (v, f )i ∈ Πi, we say that this step of process i accesses the shared

variable v. It is a remote memory reference (RMR) step from i if v 6∈ Vi. That

is, the step accesses a shared variable located at some other process. An RMR step to j is an RMR step from i 6= j that accesses a shared variable v ∈ Vj.

(28)

2.3 The Mutual Exclusion Problem

The shared memory model and the distributed shared memory model have been described so far. A formal definition of the mutual exclusion problem, which is similar the one proposed by Burns et al. in [10], is given below for both models.

Informally, the mutual exclusion problem is to devise algorithms for each process to access a designated region of code called the critical region. A process can only occupy its critical region while no other process is in its critical region. In order to gain admission to the critical region, a process executes its trying region code, and when a process leaves its critical region, it executes the exit region code for syn-chronization purposes and then returns to the rest of its code, called the remainder region.

For each process i, Σi is partitioned into nonempty disjoint subsets Ri, Ti, Ci

and Ei. We say that a process i is in its remainder (R) region, trying (T ) region,

critical (C ) region or exit (E ) region at system state s if s(i) belongs to Ri, Ti, Ci

or Ei, respectively. A system state is said to be idle if all processes are in R. Each

initial system state is assumed to be idle. In addition, we assume that the transition relation δ for a mutual exclusion algorithm satisfies the following well-formedness conditions.

• If (s, (v, f )i, s0) ∈ δ and s(i) ∈ Ri, then s0(i) ∈ Ri∪ Ti.

• If (s, (v, f )i, s0) ∈ δ and s(i) ∈ Ti, then s0(i) ∈ Ti∪ Ci.

• If (s, (v, f )i, s0) ∈ δ and s(i) ∈ Ci, then s0(i) ∈ Ci∪ Ei.

• If (s, (v, f )i, s0) ∈ δ and s(i) ∈ Ei, then s0(i) ∈ Ei∪ Ri.

That is, each process cycles through its remainder, trying, critical and exit regions, in that order.

For all steps, we assume that a step enabled in R or C never accesses a shared variable that may be accessed by a step enabled in T or E. Thus, a step taken in R or C will not affect the processes in T and E.

(29)

In addition, an algorithm that solves the mutual exclusion problem must meet the two basic conditions below.

Mutual Exclusion: There is no reachable system state at which more than one process is in C.

The next condition depends on an assumption about the scheduling of processes in executions: no process “halts” anywhere except possibly in R. Executions with this property are said to be admissible. Let α be an execution executable from an initial system state s. Formally, α is admissible from s if for every process i ∈ P that takes only finitely many steps in α, i’s final state belongs to Ri.

Progress: Let α be an admissible execution executable from an initial system state s and α1 be any finite prefix of α. At system state α1(s),

• if at least one process is in T and no process is in C, then there exists a finite prefix α2 of α, |α2| > |α1|, such that some process enters C at

α2(s);

• if at least one process is in E, then there exists a finite prefix α2 of α,

|α2| > |α1|, such that some process enters R at α2(s).

An algorithm satisfying the condition does not guarantee that the critical region is granted fairly to each individual process. To avoid entering a situation in which some process is denied indefinitely access to the critical region, it is often desirable to have some level of fairness other than the progress condition.

An algorithm is lockout-free provided that it guarantees, assuming that no process stays in C indefinitely and the execution is admissible, no process can be kept waiting indefinitely either for C or for R. It is intuitively clear that a lockout-free algorithm is also an algorithm satisfying the progress condition.

To define the fairness properties below, which guarantee a bound on the number of bypasses, we assume that the trying region of each process consists of two parts: a doorway followed by a waiting part. The doorway part is wait-free: its execution requires only a bounded number of steps. The following properties prevent any

(30)

process that has finished its doorway from being bypassed arbitrary times by any other process.

A mutual exclusion algorithm is said to be bounded-bypass if it guarantees a b-bounded bypass for some constant b. The b-bounded bypass condition is defined as follows.

b-bounded bypass: Once a process i has passed through its doorway, no process can enter its C more than b times before i does so.

A mutual exclusion algorithm is said to be first-come-first-served (FCFS) if process i completes its doorway before j performs a step in its doorway, then j can not enter C before i does so. It is intuitively clear that a FCFS algorithm is also an algorithm satisfying the bounded bypass condition.

RMR Time Complexity in the DSM model. In the DSM model, the RMR time complexity of a mutual exclusion algorithm is the worst case number of RMR steps taken by any single process in T and the following E if the process enters and then leaves C, i.e., the worst case number of RMR steps for any single process to enter and then exit C once.

Then, a local-spin mutual exclusion algorithm can be formally defined as follows. This definition has been used implicitly or explicitly in related work about local-spin algorithms [5].

Definition 2.1 A mutual exclusion algorithm is local-spin if its RMR time com-plexity is bounded, that is, a constant c exists such that its RMR time complexity is less than or equal to c.

2.4 An Indistinguishability Relation

Variants of the notion of indistinguishability are frequently used to prove impossibil-ity results in distributed systems [35]. Here, we first define an equivalence relation among system states, and then propose several ways to manipulate execution frag-ments.

(31)

Definition 2.2 Let P be a subset of P and V a subset of V. System states s and t are said to be indistinguishable to P with respect to V , denoted by s∼P

V t, if

1. s(i) = t(i) for each i ∈ P , and 2. s(v) = t(v) for each v ∈ V .

Informally, for system states s and t with s∼P

V t, s and t are indistinguishable to those

processes in P consulting only shared variables in V . When P = {i}, we write s∼i

V t instead of s{i}∼ V t; when V = V, we write s P ∼ t instead of s∼P V t.

Our definition is a generalization of the indistinguishability relation defined by Lynch [35]: when V = V, the two indistinguishability relations become equal. The generalized version of indistinguishability makes it easier to define a weaker condition imposed on two system states such that an execution fragment executable from one system state is also executable from the other. Intuitively, it is enough to consider the set of all shared variables accessed in the execution fragment rather than the whole set V. Furthermore, for a shared memory model whose memory has locality, this definition is useful in characterizing properties related to local shared memory, as we will see in Lemma 2.2 below and Lemma 5.2 in Section 5.2.1.

Now, we present two lemmas about ways to manipulate execution fragments based on the indistinguishability relation defined above. The first is holds for both of the proposed models; in contrast, the latter holds only for the DSM model. These lemmas can be easily proved by the localized update and localized enabling assumptions.

Suppose that execution fragment α is executable from system state s. Let P = Pro(α) and V = Var (α). If s∼P

V t, Lemma 2.1 says that α is also executable from

system state t. This is because each process and each shared variable involved in α have the same state and the same value, respectively, at s and t. By the localized update and localized enabling assumptions, an induction on each prefix of α can show that α is also executable from system state t. If, in addition, α is finite, the resulting system states α(s) and α(t) will be also indistinguishable to P with respect to V , i.e., α(s)∼P

(32)

Lemma 2.1 Let s and t be system states. Suppose that α is an execution fragment executable from s. Let P = Pro(α) and V = Var (α). If s∼P

V t, then α is also

executable from t. If, in addition, α is finite, then α(s)∼P

V α(t).

Proof. Suppose that s∼P

V t, that is, each process and each shared variable involved in

α have the same state and the same value, respectively, at s and t. According to the localized update and localized enabling assumptions, a straightforward induction proves that for each prefix α0 of α, α0 is also executable from t and furthermore at the resulting system states α0_{(s) and α}0_{(t), the states of all processes in P and the}

values of all shared variables in V are the same. 2 The above lemma can be applied on both of the models, whereas the next lemma, Lemma 2.2, is only for the DSM model. Lemma 2.2 is for system states s and t that are indistinguishable to a process i consulting only shared variables in Vi. Informally,

if an execution fragment α executable from system state s contains neither RMR steps from i nor RMR steps to i, then no communication between i and any other process can occur in α. Lemma 2.2 says that α|i is also executable from all system states t at which s∼i

Vi

t holds. If, in addition, α is finite, then the resulting system states α(s) and (α|i)(t) will be also indistinguishable to process i with respect to Vi.

Lemma 2.2 Let s and t be system states and i a process. Suppose α is an execution fragment that is executable from s and contains neither RMR steps from i nor RMR steps to i. If s∼i

Vi

t, then α|i is also executable from t. If, in addition, α is finite, then α(s)∼i

Vi

(α|i)(t).

Proof. Since α contains neither RMR steps from i nor RMR steps to i, i does not access any remote shared variable and no other process accesses any shared variable located at i in α. Thus, when α is executed from s, the state of i and the values of all shared variables located at i depend only on α|i. Therefore, α|i is also executable from s and if, in addition, α is finite, α(s)∼i

Vi

(α|i)(s). Suppose that s∼i

Vi

t. We show that α|i is also executable from t. Since α|i is an i-execution fragment and i does not access any remote shared variable in α|i (i.e.,

(33)

Var(α|i) ⊆ Vi), s i

∼

Vi

t implies s∼P

V t where P = Pro(α|i) = {i} and V = Var (α|i).

Hence, by Lemma 2.1, α|i is also executable from t and if, in addition, α is finite, (α|i)(s)∼i

Vi

(α|i)(t).

If α is finite, since α(s)∼i

Vi

(α|i)(s) and (α|i)(s)∼i

Vi

(α|i)(t), we have α(s)∼i

Vi

(α|i)(t). 2

When α ending with an RMR step from i satisfies the assumptions on α in Lemma 2.2 except the last step, the following corollary says that α|i is also exe-cutable from t. Let α0 _{be the prefix of α, just excluding the last step of α. By}

Lemma 2.2, α0_{|i is also executable from t and the states of i at α}0_{(s) and (α}0_|i)(t)

are the same. Thus, the RMR step from i at the end of α is also enabled at (α0_|i)(t).

Namely, the execution fragment α|i (α|i = α0_{|i ◦ the RMR step from i) is also}

ex-ecutable from t. However, since the last step from i is an RMR step, the state of i at α(s) might be different from that at (α|i)(t).

Corollary 2.3 Let s and t be system states and i a process. Suppose α is a finite execution fragment that is executable from s, ends with an RMR step from i, and contains neither RMR steps from i nor RMR steps to i except the last step. If s∼i

Vi

t, then α|i is also executable from t.

(34)

Chapter 3 Related Algorithms

Before presenting our results, three algorithms that aim at reducing the number of RMR steps are reviewed. They demonstrate how to order requests using RMW primitives. These algorithms also inspire the proposed algorithms in Chapter 4. The first is the MCS lock, which is proposed by Mellor-Crummey and Scott [37]; the second is the CL algorithm, which is proposed by Fu and Tzeng [24]; the last is Huang’s algorithm, which is proposed by Huang [28]. Due to Huang’s algorithm, the lower bound on RMR time complexity in Chapter 5 is tight. Notably, the original version of the CL algorithm suffers a deadlock error in the trying region, and the version herein is the one corrected by Huang and Shann [30].

Both of the MCS lock and Huang’s algorithm employ fetch&store and compare&swap to order requests to the critical region in a list-based way; while the CL algorithm employs fetch&store and swap&compare to do so in a circular-list-based way. The primitive swap&compare is a hypothetical RMW primitive defined by Fu and Tzeng. Definitions of these RMW primitives are given in Fig. 3.1.

3.1 The MCS Lock

As shown in Fig. 3.2, the MCS lock uses a fetch&store on a lock to chain competing processes as a list. Each process in the doorway, which is composed of line T1 in Fig. 3.2, executes fetch&store on the shared variable L (i.e., the lock), announcing

(35)

f etch&store (shared variable v, value new) previous:= v

v:= new

return_previous

compare&swap (shared variable v, value old, value new) previous:= v

if _previous_{= old then} v:= new

fi

return_previous

swap&compare (shared variable v, private variable old, value new) previous:= v v:= old old:= previous if _v_{= old then} v:= new fi

Figure 3.1: f etch&store, compare&swap and swap&compare primitives. its identity and obtaining the identity of its predecessor if there is one. It then enters the waiting part of its trying region, which is composed of lines T2–T4. If the returned value is nil, i.e., the requesting process is the head of the list, then it immediately enters its critical region. Otherwise, if it has a predecessor, it first writes a value to its predecessor’s Next variable, notifying its predecessor to refer back to its identity (T3). It then starts to spin on a locally-accessible shared variable until it is awakened (T4).

In the exit region, a process i passes the permission to its successor if there is one. If Next(i) 6= ⊥, i.e., i’s successor has updated Next(i), then i updates its successor’s spin variable (E8). Otherwise, two cases are possible: (1) i has no successor, or (2) i does have a successor, but the successor has not yet updated Next(i). Primitive compare&swap in E2 enables i to determine which case is true. If the returned value of compare&swap is not i, i.e., i indeed has a successor, i waits until its successor updates Next(i) (E3), and then wakes up its successor (E5). Otherwise, if the

(36)

Shared variables:

L∈ {nil, 0, 1, . . . , n − 1}, initially nil Lcan be located at any process for every i ∈ {0, . . . , n − 1}:

Spin(i) ∈ {true, false}, initially true Next(i) ∈ {⊥, 0, . . . , n − 1}, initially ⊥

Spin(i) and Next(i) are located at process i Process i : (i ∈ {0, . . . , n − 1})

Private variables of i:

pred, suc∈ {nil, 0, 1, . . . , n − 1}, initially arbitrary

while true do

R: Remainder region T1: pred:= fetch&store(L, i); T2: if pred6= nil then T3: Next(pred) := i

T4: await_{¬Spin(i); fi} locally spin until Spin(i) = false C: Critical region

E1: if _{Next(i) = ⊥ then}

E2: if compare&swap(L, i, nil) 6= i then

E3: await_{Next(i) 6= ⊥;} locally spin until Next(i) is updated E4: suc:= Next(i);

E5: Spin(suc) := false; fi wake up its successor E6: else

E7: suc:= Next(i);

E8: Spin(suc) := false; wake up its successor E9: fi

E10: Spin(i) := true; set Spin(i) to true E11: Next(i) := ⊥; set Next(i) to ⊥

od

(37)

Figure 3.3: An execution of the MCS lock. An arrow from node p to note q indicates that process q has updated process p’s Next variable so that p is aware of the identity of its successor.

returned value of compare&swap is i, i.e., i has no successor, then compare&swap has modified L’s value to nil, setting the system state to the starting state.

Figure 3.3 illustrates a simple execution of the MCS lock. Process 3 first executes fetch&store in T1 and gets nil from L, so it enters C immediately. While process 3 is in C, processes 1, 5 and 4 execute T1 in turn. Each of processes 1, 5 and 4 updates its predecessor’s N ext variable and then starts to wait. The permission is conveyed from 3 to 1, then from 1 to 5, and then from 5 to 4. After process 4 leaves C, if there is no other request, process 4 modifies L’s value to nil; otherwise, it passes the permission to its successor.

The MCS lock satisfies mutual exclusion, progress and the FCFS condition. Inspecting the algorithm, the worst case number of RMR steps taken by any single process in T and E is four (Steps T1, T3, E2 and E5).

3.2 The CL Algorithm

Fu and Tzeng tried to improve the MCS lock and proposed the CL algorithm, which is better in terms of the amortized RMR time complexity. But, the FCFS condition is not satisfied. Furthermore, although the CL algorithm is bounded-bypass in the trying region, some process may be blocked in the exit region. Figure 3.4 is the CL

(38)

algorithm. Explanation of the algorithm follows.

Shared variables:

L∈ {nil, 0, 1, . . . , n − 1}, initially nil _Lcan be located at any process for every i ∈ {0, . . . , n − 1}:

Spin(i) ∈ {true, false}, initially true Spin(i) is located at process i Process _{i :} _{(i ∈ {0, . . . , n − 1})}

Private variables of_i:

pred∈ {nil, 0, 1, . . . , n − 1}, initially arbitrary

whiletrue do R: Remainder region

T1: pred:= fetch&store(L, i); T2: if _pred_{6= nil then}

T3: await_{¬Spin(i); fi} locally spin until Spin(i) = false C: Critical region

E1: if _pred _{= i then} as a controller E2: whiletrue do

E3: pred := nil;

E4: swap&compare(L, pred, nil); E5: if _pred _{= i then}

E6: break_; leave the inner while loop E7: else

E8: Spin(i) := true;

E9: Spin(pred) := f alse; wake up the tail of the waiting list E10: await_¬Spin(i); locally spin until Spin(i) = false E11: fi

E12: od E13: else

E14: Spin(pred) := f alse; wake up its predecessor E15: fi

E16: Spin(i) := true; set Spin(i) to true od

Figure 3.4: The CL algorithm.

As in the MCS lock, each process in its doorway, which is composed of line T1 in Fig. 3.4, executes fetch&store on the shared variable L to make public its identity and obtain the identity of its predecessor if there is one. The process then enters its waiting part, which is composed of lines T2 and T3, and starts to check whether it is the first process that references L (T2), either since system start-up or since the last step that the value nil was written back to L. If so, the process enters C ; otherwise it starts to spin on its spin variable (T3). Unlike the MCS lock,

(39)

the CL algorithm eliminates the remote memory reference that notifies a requesting process’s predecessor to refer back to the process’s identity (i.e., step T3 in the MCS lock). As a result, the MCS lock orders processes in a list according to when they make requests, whereas the CL algorithm orders processes in the opposite order. For instances, suppose that processes 3, 1, 5 and 4 make requests in turn. In the MCS lock, they are linked into a list as shown in Fig. 3.3; while, in the CL algorithm, they are linked in the opposite order as shown in Fig. 3.5(a).

When a process i leaves C, if it does not get nil from L in T (i.e., pred 6= nil), i just passes the permission to its predecessor (E14), and then enters R after setting its spin variable to true (E16). Otherwise, if pred = nil, it is selected as a controller and has additional responsibility for servicing other requesting processes. It executes steps E2–E12 to take care of the followings. Two possibilities exist. If L is still equal to i, no other processes are interested in entering C. Process i writes nil to L when performing step E4 and moves to R. Otherwise, if L has some other process’s identity, there is a list of waiting processes. Process i stores the value of L, which is the identity of the tail of the current waiting list, to pred (as a result of E4) and passes the permission to the tail (E9). The permission will be conveyed along the list from the tail to the head. While the permission is being transmitted, process i, the head of the list, is blocked at E10.

After i passes E10, all processes in the waiting list have finished C but more processes may have arrived and have been kept waiting. Process i should go back to E2 to prepare for the next run of playing controller. It will be kept in this potentially unbounded number of runs of playing controller as long as there are processes interested in entering C.

Figure 3.5 depicts an example execution. Process 3 first takes step T1, gets nil from L and thus enters C immediately. At about the same time, processes 1, 5 and 4 execute T1 in turn. A waiting list, called list 1, is formed as shown in Fig. 3.3(a). In Fig. 3.5(b), process 3 leaves C, sets L to its identity and obtains the identity of the tail. It then passes the permission to the tail of the list (i.e., process 4). The permission will be conveyed from 4 to 5, then from 5 to 1, and then

(40)

Figure 3.5: An execution of the CL algorithm. A gray node indicates a process that has finished one life cycle. An upward arrow from a process points to the process’s predecessor, and a downward arrow from a process, which must be a controller, points to the tail of a waiting list to which the process is responsible.

from 1 to 3. Process 3 will be blocked until the permission is passed back to itself. As Fig. 3.5(c) shows, while the permission is transmitted along list 1, subsequent requesting processes form another waiting list, called list 2. In Fig. 3.5(d), the permission is conveyed back to process 3, the process takes the role of the controller again and redirects the permission to the process 4, which is the tail of list 2.

The concept of using a controller to convey the permission to the tail of a waiting list also appears in our algorithms in Chap. 4 and Chap. 5. The differences are which process is selected as a controller and how to pass the responsibility of controller to the next one.

The CL algorithm satisfies mutual exclusion, progress and bounded bypass in the trying region. But, since a process may be kept an unbounded number of times at the while loop in E, the RMR time complexity of the CL algorithm is unbounded.

(41)

3.3 Huang’s Algorithm

This section presents Huang’s algorithm, whose RMR time complexity is three. The key to minimizing the number of RMR steps is encoding different messages into an RMR step. Based on the algorithm, the lower bound result on RMR time complexity in Chapter 5 is tight.

The algorithm also satisfies bounded bypass and lockout-freedom besides the basic requirements. To argue the correctness, we sketch a proof in the end of the section.

3.3.1 The Algorithm

The algorithm is shown in Fig. 4.2. Figure 4.1 illustrates an example to help explain the working of the algorithm.

As in the MCS lock, Huang’s algorithm uses a fetch&store on a lock to link competing processes, but, as in the CL algorithm, it eliminates the remote memory references needed in the MCS lock to notify its predecessor to re-direct the link for each process in a list. With this modification, the CL algorithm proposed a way to pass the lock among processes. However, this way suffers from blocking in the exit region. To eliminate this drawback, the algorithm provides a new way to convey the lock.

We first give an informal description of the algorithm and then describe it in more detail. In the algorithm for n processes, each process i ∈ P = {0, . . . , n − 1} has two identities, i and n + i. For brevity, let ¯i denote n + i. Each process uses different identities in any two consecutive life cycles to avoid a subtle situation. We defer the explanation of the subtlety until we have presented the algorithm.

We now explain the key idea of the algorithm. Each requesting process executes fetch&store on the shared variable L (i.e., the lock) to announce its identity and obtain its predecessor’s identity if there is one. If the returned value is nil, the critical region is available and the requesting process enters the critical region immediately; otherwise, it waits by repeatedly testing its local spin variable. Since each process

(42)

Shared variables:

L∈ {nil, 0, 1, . . . , 2n − 1}, initially nil Lcan be located at any process for every i ∈ {0, . . . , n − 1}:

Spin(i) ∈ {(Head , Tail ) | Head , Tail ∈

{nil, 0, 1, . . . , 2n − 1} }, initially (nil, nil) _Spin(i) is located at process i

Process _{i :} _{(i ∈ {0, . . . , n − 1})}

Private variables of _i: id∈ {i, n + i}, initially i

pred∈ {nil, 0, 1, . . . , 2n − 1}, initially arbitrary head, tail∈ {nil, 0, 1, . . . , 2n − 1}, initially arbitrary

while_{true do}

R: Remainder region

T1: pred:= fetch&store(L, id); T2: if _pred_{6= nil then}

T3: await_{Spin(i) 6= (nil, nil); fi} C: Critical region

E1: (head, tail) := Spin(i);

E2: if _pred_{= nil or pred = head then} as a controller E3: if _pred_{= nil then} E3–E8 encode the permission word E4: head:= id;

E5: else

E6: head:= tail; E7: fi

E8: tail := compare&swap(L, head, nil);

E9: if _tail_{6= head then} wake up the tail of the waiting list E10: Spin(tail mod n) := (head, tail); fi

E11: else as a non-controller

E12: Spin(pred mod n) := (head, tail); wake up its predecessor E13: fi

E14: Spin(i) := (nil, nil); set the spin variable to (nil, nil) E15: id:= (id + n) mod 2n; change the identity

od

Figure 3.6: Huang’s algorithm.

makes a request by executing fetch&store on the same variable L, a waiting list will be formed if some process has been in C. For instance, in Fig. 3.7(a), as process 3 is in C, all competing processes (1, 5, and 4) form a waiting list.

When a process leaves C, it takes an RMR step to write a value, called the permission word, to the spin variable of some waiting process. Since the waiting process is testing its spin variable repeatedly, the permission word in effect serves as a wake-up signal. In order to minimize the number of remote memory references, the permission word not only serves as permission to enter C, but also carries enough

(43)

Figure 3.7: An execution of Huang’s algorithm in Fig. 3.6. A gray node indicates a process that has finished one life cycle. An upward arrow from a process points to the process’s predecessor, and a downward arrow from a process, which must be a controller, points to the tail of the waiting list to which the process is responsi-ble. The label of a downward arrow from a process represents the permission word conveyed to the tail by the process.

(44)

information for processes to arrange among themselves the order to enter C, without using any other control word.

The permission will be conveyed in the following way. First, any process that succeeded in acquiring nil from L enters C. When such a process leaves C, it conveys the permission to the tail of the current waiting list. Then, the permission will be transmitted along the list from the tail to the head, allowing every process in the list to enter C in an orderly way. While the permission is being transmitted, all subsequent requesting processes form a new waiting list appending to the tail of the old list. Once the head of the old list leaves C, i.e., all processes in the list have finished their critical regions, the permission will be redirected to the tail of the new waiting list. Similarly, the permission will be conveyed along the new list. We call a process that redirects the permission to the tail of a new waiting list a controller. Namely, a process is a controller if it gets nil from L or it is the head of a waiting list. In addition, a controller has the responsibility to encode some information into the permission so that each process in a new list can check whether it is the head of the list and if so, it should take the role of a new controller. If there is no new waiting list when a controller tries to redirect the permission, the controller modifies L’s value to nil, thus properly setting the system to the starting state. Using compare&swap, a controller can atomically check whether there is a new waiting list and if not, modify L’s value to nil, avoiding any interleaving with processes that make requests about the same time.

For example, in Fig. 3.7(a), when process 3 (the controller at the time) leaves C, it conveys the permission to process 4, the tail of the current waiting list, called list 1. Pair (3,4) serves as the permission, where 3 is used for each process receiving the permission to check whether it is the head of list 1, and 4 indicates the tail of the list and will be used to encode the next permission. The permission will be transmitted along list 1. In Fig. 3.7(b), when process 1 in list 1 leaves C, i.e., all processes in the list have finished their critical regions, process 1 knows that it is the head of list 1 by checking whether its predecessor is 3. Process 1 encodes new information into the permission and redirects it to the tail of the current waiting list, called list 2.

(45)

We now describe the algorithm in more detail. The algorithm uses n + 1 shared variables: L and Spin(i) for each i ∈ P. L can be located at any process; in contrast, Spin(i) must be located at process i. Spin(i) is the spin variable of process i. Whenever busy-waiting is necessary, process i repeatedly checks its spin variable without causing any remote memory reference. Each spin variable consists of two parts, (Head , Tail ), each being the identity of a process or nil. Initially, L is set to nil and each spin variable is set to (nil, nil).

In the trying region, a process in its doorway, which is composed of line T1 in Fig. 3.6, executes fetch&store on L. It then enters the waiting part, which is composed of lines T2 and T3. If the returned value of the primitive is nil, the requesting process enters its critical region immediately; otherwise, it waits by repeatedly testing its spin variable until the value is not equal to (nil, nil) (T3).

In the exit region, each process reads its spin variable and stores the permission word into its private variables head and tail (E1). A process will identify itself as a controller if the result of checking E2 is “yes”—that is, pred is equal to nil or head. If the process is not a controller, it just transmits the permission to its predecessor by executing E12. Otherwise, it first encodes new control information into a new permission word by executing steps E3–E8. Steps E3–E7 set the new value of head: if the controller gets nil from L, then head is set to its current identity; otherwise, head is set to the value of tail in the old permission word. This is because the value of head will be used by processes in the new waiting list to check whether it is the head of the list. Step E8 sets tail to the returned value of compare&swap on L, which is the identity of the tail of the new waiting list if there is one. If there is no new waiting list, E8 atomically modifies L’s value to nil. Otherwise, the controller redirects the modified permission word to the tail of the new list by executing E10. The algorithm has been presented. It remains to explain the reason why each process uses different identities in any two consecutive life cycles. Each process alternately uses one of its identities to avoid a subtle situation. Although a process cannot appear more than once in a waiting list, it may appear in two neighboring lists. A process’s identity in one life cycle is different from that in the next cycle since

互斥問題在空間與遠端存取次數的最佳解

國

立

交

通

大

學

資訊工程學系

博

士

論

文

互斥問題在空間與遠端存取次數的最佳解

Tight Bounds on Space and Remote Reference Time Complexity of

Mutual Exclusion

研 究 生：陳勝雄

指導教授：黃廷祿 教授

互斥問題在空間與遠端存取次數的最佳解

Tight Bounds on Space and Remote Reference Time Complexity of Mutual

Exclusion

研 究 生：陳勝雄 Student：Sheng-Hsiung Chen

指導教授：黃廷祿 Advisor：Ting-Lu Huang

國 立 交 通 大 學

資 訊 工 程 學 系

博 士 論 文

互斥問題在空間與遠端存取次數的最佳解

學生：陳勝雄

指導教授

黃廷祿

國立交通大學 資訊工程學系 博士班

摘

要

Tight Bounds on Space and Remote Memory Reference

Time Complexity of Mutual Exclusion

Acknowledgment

Contents

List of Figures

Chapter 1

Introduction

1.1

Algorithms for Systems under Time and

Mem-ory Constraints

1.2

Algorithms for Systems Whose Memory Has

Locality

1.3

Contributions

1.4

Organization

Chapter 2

System Models and Definitions

2.1

Shared Memory Model

2.2

Distributed Shared Memory Model

2.3

The Mutual Exclusion Problem

2.4

An Indistinguishability Relation

Chapter 3

Related Algorithms

3.1

The MCS Lock

3.2

The CL Algorithm

3.3

Huang’s Algorithm

3.3.1

The Algorithm

研究生：陳勝雄

指導教授：黃廷祿教授

研究生：陳勝雄 Student：Sheng-Hsiung Chen

國立交通大學

資訊工程學系

博士論文

國立交通大學資訊工程學系博士班