Parallel Memory Optimized
Multithreaded software requires an understanding of subtle aspects of memory access
Parallel programming is gaining favor as multicore architecture becomes ubiquitous. But it comes with complications. On x86-based processors, running threads in parallel presents memory problems beyond the typical concerns of keeping threads from interfering with one another's data. There are subtle memory-access issues that developers must consider as they work with multithreaded software.
In a sequential program at any given instant, memory has a well-defined state, called sequential consistency. In parallel programs, consistency depends on who's observing processor activity. Two writes to memory by a hardware thread may be seen in a different order by another thread. When a hardware thread writes to memory, the written data goes through a series of buffers and caches before reaching main memory (that is, RAM). Because of these stops along the way, a later write operation may reach main memory sooner than an earlier one. Similar effects apply to reads. If one read requires a fetch from main memory and a later one finds its data in cache, the processor may allow the later read to "pass" the earlier one. Likewise, reads and writes may pass each other.
White PapersMore >>
A processor must see its reads and writes in the order it issues them--otherwise, programs can break. But the processor doesn't have to guarantee that other processors see those reads and writes in the original order. Systems that allow reordering are said to exhibit relaxed consistency.
Relaxed consistency isn't a problem for programs running on a single hardware thread. However, ignoring consistency on multithreaded hardware can result in programs failing, even for programs that run correctly on single-threaded hardware.
Compilers can also reorder instructions and cause relaxed consistency. For instance, they can move loop-invariant reads out of a loop, so the read is done once per loop instead of once per loop iteration. Language rules typically let the compiler presume the code is single threaded, even if it isn't. This is particularly true for older languages such as Fortran, C, and C++. Languages based on C syntax, such as Java, C#, and, of course, C/C++, can make the compiler more circumspect about these assumptions by use of the keyword volatile. Unlike hardware reordering, compiler reordering can affect code even when it's running on a single hardware thread. Thus, programmers must be on the lookout for reordering by the hardware and the compiler.
Current x86 Architecture
The 32-bit x86 architecture approximates sequential consistency, because it evolved in the single-core age. By approximating sequential consistency, the x86 architecture preserves the ability to run legacy software. Extreme departures from sequential consistency will break old code.
However, adhering strictly to sequential consistency yields poor performance. For the most part, developers can strike a balance between consistency and performance with two options in typical programming:
>> Relaxation for performance. A thread sees other threads' reads and writes in the original order, except that a read may pass a write going to a different location.
>> Strictness for correctness. An instruction with the LOCK prefix acts as a memory fence. No read or write may cross the fence. This rule stops relaxations from breaking typical synchronization idioms based on the LOCK instructions.
Relaxed memory consistency is called processor order. For efficiency, the x86 architecture allows loads (moving data items to registers) to pass other loads, but it hides this from the programmer. If the processor detects that the reordering might have a visible effect, it quashes the results of the affected instructions and reruns them. Thus the only visible relaxation is that reads can pass writes.
The architecture preserves most idioms but ironically breaks the textbook algorithm for mutual exclusion called Dekker's Algorithm. This algorithm enables mutual exclusion for processors without the use of special atomic instructions. Section A of the figure on p. 35 demonstrates the key sequence in Dekker's Algorithm. Two variables, x and y, are initially 0. Thread 1 writes x and reads y. Thread 2 writes y and reads x. On a sequentially consistent machine, no matter how the reads and writes are interleaved, no more than one thread reads a 0. That thread is the one allowed into the exclusion region. On just about every other modern processor, both threads might read 0, because the reads might pass the writes. The code behaves as if written in section B of the figure.
Section C shows how to make the sequence work by inserting explicit memory fence instructions. Fences keep the reads from passing the writes.
Fences also tighten memory ordering when necessary for correctness. The order of writes can be loosened with select store instructions, which aren't necessarily seen by other processors in the order they were issued. Some x86 string operations, such as MOVS and STOS, are among these so-called nontemporal instructions. The looser memory ordering lets the processor combine write operations. However, the processor consuming such data might not be expecting to see the writes out of order, so the producer should issue an sfence instruction, which will keep the writes from crossing, before signaling the consumer that the data is ready.
The x86 architecture lets memory consistency rules be varied for specific memory ranges. For instance, a range with "write combining" lets the processor temporarily record write in a buffer and commit the results to cache or main memory later in a different order.
Avoid Pipeline Stalls
When writing a parallel program for performance, first get the decomposition right. Then tune for cache usage, including avoidance of false sharing. Then concern yourself with the processor's pipeline. Many x86 processors have deep pipelines that permit high execution rates of instructions typical of single-threaded code. The execution units reorder instructions so those waiting on memory accesses don't block others. Deep pipelines and out-of-order execution are usually a good thing, but they make some operations relatively expensive in terms of performance.
Particularly expensive are serializing instructions that force all prior instructions to complete before any subsequent instructions. Common serializing instructions include those with the LOCK prefix, memory fences, and the CPUID instruction. The XCHG instruction on memory is likewise serializing, even without the LOCK prefix. These instructions are essential when serving their purpose, but it can pay to avoid or minimize them when alternatives exist.
On processors with HyperThreading Technology, spin waits--a thread that loops, constantly checking a variable to see whether it's changed value--can be a problem because it might consume excessive hardware resources. In the worst case, it might starve the thread on which the spinner is waiting. On x86 processors, the solution is to insert a PAUSE instruction. On x86, spinning on a read can consume bus bandwidth.
Data Organization For High Performance
The interactions of memory, cache, and pipeline can be daunting. It may help to think of a program's locations as divided into four types:
>> Thread private. These locations are private to a given thread and never shared. A hardware thread tends to keep this memory in cache, as long as it fits. Accesses to these locations are fast and don't consume bus bandwidth.
>> Thread shared read only. These locations are shared by multiple threads but never written to by those threads. Look-up tables are a common example. Because there are no writes, a hardware thread tends to keep its own copy in cache, as long as it fits.
>> Exclusive access. These locations are read and written to but protected by a lock. Once a thread acquires the lock and starts operating on the data, the locations migrate to cache. Once a thread releases the lock, the locations migrate back to memory or to the next hardware thread that acquires the lock.
>> Wild West. These are unsynchronized threads read and written to the other locations. Depending on the lock implementation, they may include the lock objects.
The type of a given location may change as the program runs. For instance, a thread may create a lookup table privately, and then publish its location to other threads so it becomes a read-only table.
Good decomposition of tasks across threads favors thread-private storage and thread-shared read-only storage, because they don't need synchronization. Different types of locations shouldn't be mixed on the same cache line, because false sharing problems arise. That is, a block of memory that contains a Wild West data item might be updated frequently. If it also contains a thread-private data item, the owning thread must reread the block every time the Wild West data item is updated, causing unnecessary reads, as the thread has no interest in the Wild West data item.
In sum, to make the most of parallel software, you must understand how the processor executes certain tasks, such as reading and writing to memory. In general, keep memory consistent and consider the costs of too much serialization of operations; avoid spin-locks; and group same-type variables. Do all this, and keep locking mechanisms correct and optimized, and you'll realize multicore processors' full benefits.
This article was adapted and updated by Dr. Dobb's editor in chief Andrew Binstock from the Intel Press book "Multi-Core Programming" by Shameem Akhter, an Intel platform architect, and Jason Roberts, an Intel senior software engineer. Write to us at firstname.lastname@example.org.