Threads in C – keeping simple

There is a nice example thread semantics in a paper by David Goldblatt (P1916R0) which illustrates a couple of things about multi-core processing – including how interesting it is and how poorly defined the C11 atomic/thread extension is . A simplified version looks like this:

//Example1 int x=0; int y=0; int r1=0; int r2 =0; //globals void t1() { r1=x; int local = r1; if (local == 0) { y=1;} } // thread1 code void t2() { r2=y; x= r2; } //thread 2 code
When both threads finish, r1==0. How do we know that? Suppose r1 is not zero. It starts as zero, so it must have been changed somewhere, and the only place it is assigned a value is the first statement of thread1 where we see “r1 = x”. So we must have “x != 0” and, again, initially, “x == 0” so “x” must change before r1 is assigned and the only place x is assigned a value is in the final statement of thread2 “x=y”. So “y !=0″ is required before that assignment to get the result we want, and since initially ” y==0″ we look for a place where y is changed, and the only place is the final statement in thread1 “y = 1”, which is executed only if the conditional permits – which is only if “local == 0”, which is only if “r1 == 0”, which would contradict our premise. So, it is required that “r1 == 0” when these two threads complete. Now suppose that the programmer (or the compiler) is smart enough to optimize this code and decides that since we know “r1 == 0” is an invariant of the code, we can leave out the test:

//Example2 int x=0; int y=0; int r1=0; int r2 =0; //globals void t1() { r1=x; y=1; } // thread1 code void t2() { r2=y; x= r2; } //thread 2 code

It may appear that things are still good. The assignment to y happens only after the assignment to x. If “r1 != 0” then “x !=0” which means “r2 != 0” which means that “y !=0”, but the assignment to “y” in Thread1 happens AFTER the assignment to r1! But as Goldblatt notes, it is possible that running the code leaves “r1 == 1” – on an ARM processor. The ARM has what is called “relaxed” memory order. Loads and stores on a single core (and we assume our threads are running on single cores, perhaps incorrectly) all become visible sequentially following code order – so “y=1” happens AFTER “r1 = x” in thread1, but that nice ordering is not guaranteed between cores. Memory operations have been a bottleneck for a long time so processors look for ways to short circuit operations. On the ARM the two threads compile down to load and store operations:

//Example3 void t1() { 1: load register1a from Memory[x]; 2: store register1a to Memory[r1]; 3: store 1 to Memory[y]; } // thread1 code void t2() { 4: load register1b from Memory[y]; 5: store register1b to Memory[r2]; 6: store register1b to Memory[x] } //thread 2 code

The compiler has figured out, in this code, that in thread2 it doesn’t need to load Memory[r2] back to a register since it has the value still in register1b: and in Thread1 it can leave “local” as a register. But the ARM processor is free to mess around with the order instructions complete, as long as the result does not change the computation within the core. The processor may notice that instruction 3 does not depend on any results of 1 or 2, so it can complete memory operations in order 3,4,5,6,1,2 which would let the 1 be stored in y before y was loaded into register1b – which allows r1 to end up with a value of 1.

Why does the original code work on the ARM then? Because the test “if (local == 0)” creates a dependency — the processor cannot evaluate local and complete this test until x has been loaded into a register in thread1. In the code below, (2a) must complete before 3 can happen, and 2a depends on the value loaded from x so that load has to complete before the assignment to y.

//Example4 void t1() { 1: load register1a from Memory[x]; 2: store register1a to Memory[r1]; 2a: if register1a != 0 skip the next instruction 3: store 1 to Memory[y]; } // thread1 code void t2() { 4: load register1b from Memory[y]; 5: store register1b to Memory[r2]; 6: store register1b to Memory[x] } //thread 2 code

If the programmer or the compiler deleted the test as an optimization, it changed the semantics of the code. From my point of view, the first is ok and the responsibility of the programmer but the second possibility would be a miscompilation. But in ISO-C and the open source C compilers, preserving semantics through optimizations is not always necessary. In this case, the kind of mushy C dependency rules probably prevent the compiler from making this transformation in the original code, but Goldblatt shows a case where it can and does happen if the programmer adds an “else unreachable()” clause. In theory, “unreachable” is a clue to the compiler that the branch cannot be taken. The compiler, then, decides to eliminate the test, since the programmer told it that the test always produces true, and then the compiler produces code like in the second example.

//Example5 int x=0; int y=0; int r1=0; int r2 =0; //globals void t1() { r1=x; int local = r1; if (local == 0) { y=1;} where to order Pregabalin else unreachable();} // thread1 code void t2() { r2=y; x= r2; } //thread 2 code

In the original paper, instead of the simple assignments of x and y, the code uses C11 “atomic” operations which are now becoming common in the standard open source C++ and C compilers – even though what they mean is not even clear to the people who wrote the specification. In this case, the original example uses atomic operations with a so-called “relaxed” memory order. Translated into C from C++

//Example6 atomic_int x = 0; atomic_int y = 0; int r1 = 0; int r2 = 0; void t1() {r1 = __atomic_load_n(&x,__ATOMIC_RELAXED); int local = (r1 & 1); if (local == 0) {__atomic_store_n(&y,1,__ATOMIC_RELAXED);}} void t2() { r2 = __atomic_load_n(&y,__ATOMIC_RELAXED); __atomic_store_n(&x,r2,__ATOMIC_RELAXED); }

In every modern processor I know about (including ARM, x86, MIPS, RISCV, Power … ), ordinary loads and stores of words (ints) are already atomic, so – currently – the compilers emit the same machine code with or without the C11 atomic operators. You can see this example, with conditional compilation using and not using the atomic operations here. One of the big supposed advantages of adding atomics to C is that they are portable but the example of P1916R0 shows that for even the simplest use, we can see different behaviors on different processors. I am ok with that because, to me, C is a low level language that exposes architectural differences like different word size and different memory rules. But C11 atomics/threads is built on the theory that C and C++ should not be low level languages. The compilers are supposed to “portably” figure out how to do atomic assignments to larger data structures too – somehow. That they cannot do so is a problem for mere practice, not theory.

The program of example1 in my version has “maybe undefined behavior” according to the C11 Standard and the compilers maybe are entitled to, at any moment, change the code in any arbitrary way, possibly by deleting one or both threads entirely or deleting reference to y in the first and to x in the second or – well, really anything. The reason the code has “undefined behavior”(maybe) is that any two threads that refer to the same variables without using atomic operations that do not have a “before” relationship as far as the C/C++ compiler is concerned have a “data race” and are undefined according to C11.

The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.

As far as C17 is concerned, there is no definition of “conflicting action” but perhaps what is meant is “conflicting evaluations”

Two expression evaluations conflict if one of them modifies a memory location and the other one reads or modifies the same memory location

The “maybe” is because the current compilers don’t do that kind of analysis (which is really hard) and because C11 atomics are supposed to be optional (after strong user pushback), but the standard text was not edited to make that clear. For example, the definition of “data race” depends on on “threads”, which are defined in a different supposedly optional part of C11. Before C11, C had no notion of “data race” because it didn’t know anything about threads. Suppose that the underlying OS implements threads as run-to-exit co-routines. In that case, threads 1 and 2 would not have anything like a data race and r1 would have to be 0 at the completion of both. Does the C compiler know what the OS or run-time is doing with threads? No. Not at all. Perhaps thread 1 and 2 will run on the same core with “memory fences” (to force completion of any delayed memory operations) between them. The compiler has no practical way to find out. Perhaps the threads themselves can switch cores between instructions (as in some architectures or maybe in a smart future OS). Or they could be in separate processes with separate address spaces (see: “fork”). This is why the designers of C left threads out of the language entirely – so that the language did not incorporate a complex but limited view of threading. Prior to C11, C had a successful history of implementing threads and concurrency control in libraries. The requirement for adding atomics/threads to C11 was kind of nebulous. The result is, well:

Here’s one way of thinking about why the out-of-thin-air problem is tricky to resolve: we want additional constraints on the concurrent part of the memory model, but the desire for those
constraints is motivated by the dependency ordering established during execution of the non-concurrent parts of the program on real hardware. So we want sequential code to maintain a(n implicit) memory ordering constraint on concurrent code. The problem of outlawing OOTA, then, is about defining which sequential dependencies the compiler is not allowed to optimize away.

This is the main topic of P1619R0: that C11 Atomics have a multiplicative effect on ISO-C’s “undefined behavior” problem by increasing the surface space of undefined behavior . You might ask: “Out of thin air? That sounds bad.” or “why is the compiler ‘optimizing away’ sequential dependencies that are expressed in the code some programmer wrote in the first place?”. The C standard text pretty clearly says that the compiler can reorder expressions within a statement, but cannot reorder statements. This is disputed within the standard body, where the argument is that reordering can happen if it does not change “observable state”. That’s a reasonable argument in the abstract because if the compiler reorders statements without changing the semantics, who could object? Imagine:

//Example 6
x = f(); //x is in a register
...do a bunch of stuff which does not involve y ..
y = x;
---------Reorder----
x = f(); y = x; //while x is  still in a register
... do a bunch of stuff

Unfortunately, figuring out what is observable in practice and what dependencies exist is not simple. One might imagine that adding concurrency would mean that more state is “observable” since a concurrent thread might observe changes that would otherwise be invisible between different points in sequential code. Suppose a second thread accesses “y” while “do a bunch of stuff” happens in the example above. Now moving the assignment to “y” changes the operation of the program because the second thread might see a different value. Reordering statements is much more dangerous in a multi-thread environment – like an operating system. In practice, statement reordering is not common in current compilers and, critically, the programmer can reorder statements in response to profile data if she wants and, hopefully, with some understanding of the threading semantics of her code. The C11 atomics are, in many ways an invitation to compiler developers to become more aggressive with reordering. The current attitude of WG14 appears to be that negative effects are just too bad, the programmer has absolutely no assurances about what the compiler will change in program order without using atomic operations. The code of Example6 is conformant without threads and undefined behavior with a data race if there is a second thread that accesses “y” so all bets are off. Even with the atomic operations things are not clear. If the program in example 6 used atomic operations to access “y” would the statement order be preserved? Maybe. That’s essentially the main topic of P1619R0. The net result is that it is essentially impossible to tell what C programs do in the presence of concurrent threads partly because the rules for what compilers can do to statement order are totally unclear. If this unclarity produced profound optimizations, there would be at least some grounds for accepting it, but there is no reason to believe that reordering statements by the compilers produce anything but micro-benchmark optimizations or optimizations of bad code. It is bizarre to read P1619R0, written in 2019, 8 years after C11 was published and to see a discussion of unanticipated interactions between atomic operations and undefined behavior.

Although I very much dislike the addition of atomics in C11, I am sympathetic to the argument that unstructured sharing of variables in a multi-thread environment is not a good methodology. However, creating yet more compiler introduced failures in production code that has already been designed to work with existing compilers is irresponsible.

Threads in C