Memory consistency models have an almost mythical aura. They can puzzle the most experienced programmers and lead to bugs that are incredibly hard to understand and fix. If you have written multithreaded code, it is likely that you have stumbled upon memory model woes. Chances are that you have also lost bets with your colleagues because of memory consistency model disputes. In this blog post I will discuss some of the rationale of why memory models were created and give some specific examples of how that affects you.
First things first, lets define what a memory model is. The memory consistency model defines what values a read operation can return. The simplest memory model is sequential consistency, in which the execution behaves as if there were a single global interleaving of memory operations and the operations of a given thread appear in the same order as they appear in the program. It is the most natural model for normal humans to think about because the execution behaves as a multitasking uniprocessor. For example, consider the following example:

The question is, what value the read from data in P2 can return? The most obvious answer here is 42. Now what would happen if P2 observed the writes to data and flag in the opposite order? P2 could actually read data as “0″, which is surprising and not allowed by the sequential consistency memory model.
The main problem with sequential consistency is that systems like to reorder memory operations to hide long latency operations and consequently improve performance. For example, when a cache miss is being serviced, the processor may execute another memory access that comes after that in program order, which may hit in the cache and therefore complete earlier than the missing access. However, processors are not the only source of memory operation reordering. Many compiler optimizations effectively reorder code, e.g., loop-invariant code motion, common sub-expression elimination, etc. Furthermore, the memory models of languages and the hardware they run on need not be the same. The compiler and synchronization libraries need to insert fences in the code to map the language memory model to the hardware model. For example, Java and C++0x (the upcoming C++ standard) support memory models that guarantee sequential consistency for programs free of data races.
Due the difficulty of improving performance under sequential consistency, a variety of “relaxed” memory models were conceived. For example, in the Weak Ordering memory model, there is no guarantee that a processor will observe another processor’s memory operations in program order. This is where a “memory fence” (a.k.a. “memory barrier”) comes into play. When a fence instruction is executed, it guarantees that all memory operations prior to it in program order are completed (and visible to other processors) before any operation after the fence in program order is allowed to proceed. You would be bored and stop reading if I described the multitude of consistency models in this post. However, I do encourage you to read more about memory models in this very nice tutorial by Sarita Adve and Kourosh Gharachorloo. Also, Paul McKenney’s paper has a nice table summarizing the ordering relaxations in modern microprocessors.
Now let’s talk about some of highlights of the x86 memory model. A big disclaimer first. This can change and probably does change between models, so it is always a good idea to check the manuals before endeavoring in sensitive code (8-8 Vol. 3 in this manual for Intel and Section 7.2, page 164 in this manual for AMD).
In a nutshell, recent implementations of the X86 ISA (P6 and on), implement, roughly, what is normally termed processor consistency. Its key ordering properties are:
- reads are not reordered with respect to reads;
- writes are not reordered with respect to reads that come earlier in the program order;
- writes are not reordered with respect to most writes (e.g., it excludes multiple writes implicitly caused by string operations);
- reads may be reordered with respect to writes that come earlier in program order as long as those writes are to a different memory location;
- reads are not reordered with respect to I/O instructions, locked instructions and other serializing instructions.
There are no guarantees whatsoever of ordering between writes of different processors, the outcome of concurrent writes to the same memory location is non-deterministic. Increment instructions have no atomicity guarantees, moreover, even some write operations that update multiple bytes are not guaranteed to be atomic. For example, if a write operation to multiple bytes happen to cross a cache line boundary, the operation is not guaranteed to be atomic.
Here is an example of how the x86 memory model can get you in trouble:

t1 == 0 and t2 == 0 is allowed. Such an outcome is unintuitive (therefore non-sequentially consistent) because there is no serialized execution that leads to this state. In any serialized execution, there will be an assignment in one processor (A = 1 or B = 1) prior to a read (t1 = B or t2 = A) in the other processor. Another way to look at the problem is to build a happens-before graph of the execution. In this representation, a node is an executed instruction and a directed edge exists from instruction P to instruction Q if Q has observed the effects of P, and P has not observed the effects of Q. Here is the happens before graph for the example above when the outcome is t1 == 0 and t2 == 0:
t1 = B in P1 did not observe the write B = 1 in P2. Same applies to edge (2). Edges (3) and (4) are there because of program order. Since there is a cycle in the happens-before graph, there is no serialized order that would satisfy the happens-before relationship, therefore, the execution is non-sequentially consistent. This happened in this example because the read operation t1 = B in P1 can proceed before the write operation in A = 1 is completed and visible to P2.
t2 == t4 == 0. Lets look at this from the perspective of P1. This can happen because the processor can forward the value from the pending write to A (A = 1) to the read of A (t1 = A), which can complete and then allow the read of B to proceed (t2 = B). Note that this does not characterize a reordering of t2 = B and t1 = A! Intuitive, huh?
flag1 and flag2. P1 sets flag1 when it is attempting to enter the critical section, it then checks if flag2 is set; if it is not set, it means P2 has not attempted to enter the critical section, so P1 can safely enter it. Because the x86 memory model allows reordering loads with respect to earlier stores, the read of flag2 can proceed before setting flag1 is completed, which can lead to both processors entering critical sections, since P2 might have just set flag2!