The Lock-Free Properties - The Lock-Free and Cache-Friendly Properties

The Lock-Free and Cache-Friendly Properties

4.1 The Lock-Free Properties

Since DSWP exploits fine-grained pipeline parallelism in applications [4], fast synchronization and communication mechanisms are necessary. Use a lock-based approach, such as pthread mutex lock, to achieve synchronization is infeasible, as such a coarse-grained locking proach can significantly offset the parallelism brought up by DSWP. Besides, lock-based ap-proaches come with a lot of pitfalls, from deadlocks and livelocks to priority inversion to con-voying [13]. With lock-free approaches, synchronization is achieved through lower-level tools rather than mutex locks [14].

Lamport [15, 7] proved that a queue for SPSC (single producer single consumer) can be accessed concurrently without explicit lock only if the multiprocessor system is sequentially consistent. Although Lamport’s conclusion seems heartening, most hardware and compilers used today do not provide the necessary sequential consistency [9, 16]. We will show how ordering and atomicity might not hold on modern multiprocessor systems and give an example demonstrating how the ordering issues break Dekker’s mutual exclusion algorithm.

We discuss the ordering issue first. From the programmers’ point of view, a computer sys-tem consists of a processor, memory, and the IO subsyssys-tem. In practice, there is a multi-level memory hierarchy since the significant speed gap between processor and memory. Each level in the memory hierarchy is smaller, faster than the next lower level. Memory accesses are expen-sive operations compared to other CPU’s operations. In order to improve the performance of sequential programs, compilers, microprocessors, and caches put much emphasis on optimiz-ing memory reads and writes. They may reorder, insert, or remove memory reads and writes in order to avoid or delay memory accesses.

Here we explain how the ordering issue could break Dekker’s mutual exclusion algorithm on multiprocessor systems. Figure 4.1 gives the part of Dekker’s algorithm achieving mutual exclusion. In Figure 4.1, X and Y represent different memory locations; r1 and r2 are registers of P1 and P2 respectively. Figure 4.2 gives three possible execution orders that illustrate the possible final values of r1 and r2 on a sequentially consistent multiprocessor system. Clearly, it is not possible that both X and Y are zero at the end of execution. This fact ensures mutual exclusion.

Compilers are allowed to reorder memory operations involving different memory locations.

Since store operations cost much more time than does load operations, compilers will try to schedule memory loads early. The instructions which depend on those memory loads can be executed as soon as possible due to the early memory loads scheduling. Take Figure 4.1 as an example, a compiler might reorder the independent memory operations in the threads so that

Figure 4.1: The part of Dekker’s algorithm achieving mutual exclusion

Figure 4.2: Valid executions of Fig. 4.1 on a sequentially consistent multiprocessor system

the memory loads can be executed early.

In addition, modern processors nearly always use a (hardware) store buffer to avoid waiting for the store instruction to complete. This means that a later read operation might reach memory before an earlier store operation. Both compilers and hardware optimization make the outcome of r1 == 0 and r2 == 0 possible, and hence may break Dekker’s algorithm.

Next we discuss atomicity. On modern multiprocessor systems, atomicity is not always guaranteed. Consider what may happen if one thread assigns 1000000 to a 32-bit integer vari-able X on a 16-bit processor while the other thread reads that varivari-able X? The assignment is translated into two hardware store instructions, one for each 16-bit half-word for the constant 1000000. Without an appropriate lock mechanism, the other thread might see an “intermediate”

value. Note that common hardware does not guarantee bit-, byte-, or word-stores are atomic.

Any shared data that could be modified has to be protected in some way.

Existing low-level tools used to realize lock-free operations include explicit memory fences

(e.g., mb() in Linux), special API calls (e.g., InterlockedExchange in Windows), and various special atomic types. Many of them are tedious or difficult to use. Worse still, their varieties imply that lock-free code is not portable.

In recent years, the computer industry gradually adopts ordered atomic variables as the main tool to write lock-free code in major programming languages and OS platforms. In short, ordered atomic variables are safe to read and write by several threads simultaneously without any explicit locking. Ordered atomic variables guarantee the following properties:

• The read and write operations are guaranteed to be executed under some ordering rules defined by programming languages and libraries.

• Each read or write operation on an ordered atomic variable is guaranteed to be atomic, all-or-nothing.

Many programming languages and libraries now support ordered atomic types that assure ordering and atomicity:

• Java provides ordered atomic types under the volatile keyword (e.g., volatile int), and solidifies this support in Java 5 (2004).

• .NET added them in Visual Studio 2005, also under the volatile keyword (e.g., volatile int).

• ISO C++ added them to the C++0x Draft Standard in 2007, under the templated name atomic<T>.

• Intel Thread Building Blocks library provides template classes atomic<T>, which^R implement atomic operations in the C++ style.

Programmers need to know the ordering rules defined by the programming languages and

li-might not define an ordering rule. For example, functions atomic inc() and atomic add() provided by Linux kernel only guarantee atomicity but not ordering [9]. Using them to write lock-free code without any ordering enforcement (e.g., memory fence) is completely wrong.

在文檔中為非耦合軟體管線所設計的鎖無關且尊重快取機制之軟體佇列 (頁 22-26)