In this section, we present our implementation of the QueueBuffer class, and explain its correctness both in informal and formal ways.
5.1 Implementation
We present the C++ template class QueueBuffer, which is a lock-free, cache-friendly, STL-compliant queue. Figure 5.1 shows the data members of class QueueBuffer. The most im-portant data members are m front and m back, which are shared, mutable variables. Compil-ers do not yet support ordered atomic types defined by the forthcoming C++0x Draft Standard.
So we declare them as ordered atomic variables by using the template class atomic<T> pro-vided by Intel Threading Building Blocks library. An atomic<T> class supports atomicR read, write, fetch-and-add, fetch-and-store, and compare-swap operations. For reads and writes, their default memory fences are acquire and release, respectively.
Hence, the ordered atomic read is also known as a acquiring read, and the ordered atomic write is also called a releasing write.
An acquire fence prevents memory operations after the fence moves over it. On the other
Figure 5.1: Class QueueBuffer data members
show how the ordered atomic variables allow us to access the queue concurrently without locks later.
Since false sharing will cause unnecessary memory traffic, we take some steps to avoid false sharing. First, the software interfaces emulating the produce and consume instructions have been designed to take false sharing into consideration. Therefore, PROD (CONS) will enqueue (dequeue) multiple (m cds) data elements whose total size is a multiple of the cache-line size at a time. Second, according to the locality as illustrated in section 4.2, we group class QueueBufferdata members into different chunks that are multiples of the cache line size and aligned on cache line boundaries by using alignment and padding. Data members are aligned to the cache-line boundary. Padding is added between different groups of data members. Most compilers do not yet support attribute defined by the forthcoming C++0x Draft Standard.
Therefore, we have to use compiler-specific syntax extensions to control a compiler’s alignment operations.
Figure 5.2a and Figure 5.2b show the class QueueBuffer member functions called by
(a) push (b) front
Figure 5.2: Class QueueBuffer member functions
PROD and CONS, respectively. We use m back and m front to check if the queue is empty or full. The queue is empty when m back == m front and is full when (m back + m cds) % m cap== m front. We now show how atomic<T> allows us to access the queue concur-rently without locks. First, although m back and m front are shared, mutable variables, they are safe to be read and written by PROD and CONS simultaneously without locks as atomicity is guaranteed by atomic<T>.
Second, as pointed out by [19, 20, 21], there is an ordering problem between PROD and CONS. CONS might see that m back has been incremented by PROD before it sees the change to the corresponding m buf slot if the memory consistency model is very relaxed. The solution proposed by [19] is to use mutexes to flush the cache. We do not need mutexes here since the ordered atomic write associated with a release fence which guarantees the change of m buf slot occurs before it. We show the queue can be accessed concurrently without locks.
When writing parallel programs, we need to understand the ordering rules provided by pro-gramming languages and libraries. For the problem in Dekker’s algorithm as demonstrated in Section 4.1, acquiring reads and releasing writes do not fix the problem. Instead, the fix needs to stop reads from floating backwards over writes, but acquiring reads can nonetheless float
As shown in Figure 5.2a and Figure 5.2b, m back and m front are copied to local vari-ables local back and local front before checking if the queue is empty or full. Since m backand m front are only written by PROD and CONS, respectively, we are free to do so. Because local variables can be cached in registers or cache, doing so can avoid reloading the ordered atomic variables unnecessarily.
There is one last thing that needs to be considered. In DSWP, both PROD and CONS are in a continuous loop of accessing the queue. We need to terminate CONS when PROD produces no more data. When writing a sequential program or a parallel program with locks, we can associate an EXIT flag with the queue. When PROD produces no more data, the queue’s EXIT flag is set to true. When CONS tries to dequeue data, if the queue is empty and the EXIT flag is true, then CONS exits the loop.
The above approach might not work for a lock-free parallel program. A typical pattern for lock-free programs is do the work off to the side, then publish each change to the shared data with a single ordered atomic write or compare-and-swap [22]. The difficulty of writing lock-free code is that we are only allowed to use a single ordered atomic write or compare-and-swap operation to update the shared object. In our case, PROD will update m back after inserting data elements into the m buf slot. And CONS will update m front after retrieving data elements from the m buf slot. We cannot update m back and the EXIT flag at the same time.
There is another approach, however. Stopping CONS can be accomplished by having PROD pass an EOF through the queue. When CONS receives an EOF, CONS exits the loop.
The member function push back n in the QueueBuffer class requires a parameter, last iter, set by PROD when there is no more data. Then push back n will insert an m eof whose value is a template argument provided by programmers. The m eof has to be inserted with the last chunk of data. If not, since the front n member function will return m cdsdata unconditionally, it will return some garbage data.
5.2 Verification
Program 1: SPIN model for class QueueBuffer
As indicated in [23], the non-deterministic nature of a parallel system makes it hard to use traditional testing techniques to verify the parallel system. Our lock-free queue belongs to this case. Hence, it is insufficient to run test cases to verify the correctness of our lock-free queue.
Instead, the implementation should be modelled and verified in a formal way. In this paper, we use the SPIN model checker [24] to verify our design.
The tool we use to check our design is called SPIN, and the specification language that it accepts is call Promela (Process Meta-Language). Promela is a process modelling language whose intended use is to verify the logic of parallel systems. It is not to be an implementa-tion language like C/C++, but a system descripimplementa-tion language for building verificaimplementa-tion models.
not on computations.
A Promela program is composed of processes, values and message channels. Processes are global objects that represent concurrent entities of the parallel system. Variables and mes-sage channels can be defined either globally or locally within a process. Processes specify their behavior; global variables and message channels together define the environment in which processes run.
SPIN simulates the operation of a parallel system by running its Promela program. This is the SPIN simulation mode. SPIN also has verification mode. SPIN can convert a Promela program into a C source code which is a verifier. Once the verifier discovers flaws in the parallel system, we can rely on the SPIN simulator to display the error traces.
Since Promela models the synchronization and coordination between processes, we can ignore the activities occurring within a thread and focus on the interactions among concurrent processes. Program 1 shows the Promela model for the protocol used by the producer and the consumer.
Chapter 6 Experiment
This section describes the platforms and benchmarks we used in our implementation of the QueueBufferclass.
6.1 Platform and Benchmark
Experiments were conducted on an IBM System x3400 Server [25] and a Sun SPARC Enterprise T5120 Server [26]. The IBM System x3400 Server is equipped with two quad-core Intel Xeon E5335 processor. Each pair of cores in a quad-core Intel Xeon E5335 processor share a 4 MB L2 cache [27]. The Sun SPARC Enterprise T5120 Server equipped with one UltraSPARC T2 processor which has eight cores. All cores in a UltraSPARC T2 processor share a 4 MB L2 cache divided into eight banks [28]. Table 6.1 shows the detailed parameters of our experimental platforms.
We manually applied DSWP on three linked-list traversal loops in three SPEC CPU2006 benchmarks [29]: 429.mcf, 450.soplex, and 453.povray. We also experimented on a micro-benchmark, called llubench, which focuses on linked-list manipulation [30]. Llubench was modified so that CONS does more computation1. More detailed information of the benchmarks