Category Archives: computer architecture

undefined behavior and the purpose of C

C undefined behavior. From one of the LLVM developers:

This behavior enables an analysis known as “Type-Based Alias Analysis” (TBAA) which is used by a broad range of memory access optimizations in the compiler, and can significantly improve performance of the generated code. For example, this rule allows clang to optimize this function:

float *P;
 void zero_array() {
   int i;
   for (i = 0; i < 10000; ++i)
     P[i] = 0.0f;
 }

into “memset(P, 0, 40000)“. This optimization also allows many loads to be hoisted out of loops, common subexpressions to be eliminated, etc. This class of undefined behavior can be disabled by passing the -fno-strict-aliasing flag, which disallows this analysis. When this flag is passed, Clang is required to compile this loop into 10000 4-byte stores (which is several times slower), because it has to assume that it is possible for any of the stores to change the value of P, as in something like this:

int main() {
  P = (float*)&P;  // cast causes TBAA violation in zero_array.
  zero_array();
}

This sort of type abuse is pretty uncommon, which is why the standard committee decided that the significant performance wins were worth the unexpected result for “reasonable” type casts.

The permitted optimization is very effective: the code is 80 (EIGHTY!) times faster when clang O3 optimization replaces the loop with a memset (some of that is other optimization, but still).  But the programmer has the option of using memset directly – which produces exactly the same optimization. In fact, the programmer has the option of using memset directly because fundamentally C wants to expose the underlying memory to the programmer.

The original motivation for leaving some C behavior undefined was that different processor architectures would produce different behaviors and the expert C programmer was supposed to know about those.  Now compiler writers and standards developers claim it is good to introduce “unexpected results” (that surprise the experienced programmer) because these permit a certain kind of optimization.

Violating Type Rules: It is undefined behavior to cast an int* to a float* and dereference it (accessing the “int” as if it were a “float”). C requires that these sorts of type conversions happen through memcpy: using pointer casts is not correct and undefined behavior results. The rules for this are quite nuanced and I don’t want to go into the details here (there is an exception for char*, vectors have special properties, unions change things, etc). This behavior enables an analysis known as “Type-Based Alias Analysis” (TBAA) which is used by a broad range of memory access optimizations in the compiler, and can significantly improve performance of the generated code.

To me: “The rules for this are quite nuanced and I don’t want to go into the details here (there is an exception for char*, vectors have special properties, unions change things, etc)” means, “we mucked up the standard and we are going to cause many systems to fail as these nuanced rules confuse and surprise otherwise careful and highly expert programmers”. Compiler writers like undefined behavior because, in their interpretation of the standard, these permit any arbitrary code transformation. Anything. The well known controversial results include removing checks for null pointers due to an unreliable compiler inference about dereference behavior.  These uses of “undefined” and limitations on reasonable type coercion are based on an incorrect idea of the purpose of the C language.  Unexpected results are a catastrophe for a C programmer. Limitations on compiler optimizations  based on second guessing the programmer are not catastrophes ( and nothing prevents compiler writers from adding suggestions about optimizations).  There are two cases for this loop:

  1. The C programmer is an expert who used a loop instead of memset for a good reason or because this is not a performance critical part of the code.
  2. The C is programmer is not an expert and program almost certainly contains algorithmic weaknesses that are more significant than the loop –

Neither case benefits from the optimization. Programmers who want the compiler to optimize their algorithms using clever transformations should use programming languages that are better suited to large scale compiler transformations where type information is clear indication of purpose. As Chris Lattner notes:

It is worth pointing out that Java gets the benefits of type-based optimizations without these drawbacks because it doesn’t have unsafe pointer casting in the language at all.

Java hides the memory model from the programmer to make programming safer and also to permit compilers to do clever transformations because the programmer is not permitted to interact with the low level system.  Optimization strategies for the C compiler should take into account that C does not put the programmer inside a nice abstract machine.  The  C compiler doesn’t need to be a 5th rate Mathematica or APL or FORTRAN or  Haskell.  Unexpected results are far more serious a problem for C than missed minor optimizations.

Some reading:

DJT on boring C.

Regehr’s guide for the perplexed. 

Synchronous processors

Imagine a processor with no interrupts. We can do a lot better and get rid of most exceptions (e.g. system calls, page faults etc.), most peripheral devices/buses, and even cache misses, but let’s start with interrupts. Modern micro-processors are bloated with circuitry that is designed to allow the precious CPU to switch back and forth between streams of code because cpu cycles were the scarcest resource. That was a great idea in the 1950s. But interrupts are expensive in compute time circuit complexity, and chip area. Interrupts  take microseconds to start processing – which is an eternity on a Ghz processor.  And they solve a problem that does not exist anymore: we have a lot of processor cores. In fact, one problem faced by computer architects is that it’s not easy to exploit parallel processing if you keep thinking in terms of 1950s computer architecture.

Suppose you have 32 cores and you make one or two of them poll I/O devices and the other 30 never get I/O interrupts, never run interrupt handling code, never have to save and restore state or context switch due to I/O interrupts.  What they do is run application code to completion or for a specified number of cycles or clock ticks. The cores reserved for operating system use manage all devices and even they don’t need interrupts because they can run a real-time OS that polls devices.

Continue reading

Processor architecture

Bill Gates and Paul Allen as infants

I wish that processor architecture was not so committed to a obsolete model of software. For example, it has become clear over the last few years that the sloppy shared memory thread model of programming has multiple drawbacks. Sharing of data, in particular, should be tightly controlled. Yet processor architectures are “optimized” for unstructured sharing of data between threads with enormously complex “snooping” caches running at high speed to inspect all transactions on what have become essentially cache buses, and initiating complex transactions when conflicts are detected.  I put “optimized” in quotes because the performance is terrible for the reason that violations of locality cannot easily be cured. That is, caches are designed to permit totally random sharing of memory without software action. Performance gets hammered by cache locality anyways. Essentially what we have is an expensive (both in transistors and power) fix for poorly designed software that really does not get much advantage out of it, but gets protected from the logical (but not temporal) consequences of  sloppy data sharing by the hardware. In point of fact, most threads do not share much memory and the ones that do could be rewritten to explicitly pass control of memory to each other.

Sadly, we are in a situation where processor architectures are constrained to run obsolete software, and then software is constrained to try to take advantage of obsolete hardware. All of this is especially good only for electric power utilities and manufacturers of air conditioning equipment.

A second, related, area of obsolete designs is in memory mapping and paging. This is treated as a problem that has been solved for all time by a hierarchical paging model that was finalized in the 1980s.  Is the paging model designed for machines with 10 meg of memory and a 100megabyte swap partition on a slow disk a good model for machines with 1G or 10G of memory and couple of hundred spare gigabytes on disk, or maybe 40 spare on solid state drives? Well, if you give it a little thought, you can see a lot of potential problems. In the good old days, paging in 512 byte blocks or 4k byte blocks from a disk drive organized around such block, to and from  a highly constrained memory where you really wanted to be able to avoid wasting even a couple of kilobytes required one design. But what happens when the entire concept of disk blocks disappears on the drive and a system with 10M of unused memory is almost certainly stalled? What kinds of memory use patterns are good for LAMP type systems where jobs do not appear and disappear ? You can look at processor design/operating systems sometimes and think that, just maybe, 1970s computer science departments are not the ultimate final inspiration for all architecture.

 

See also a synchronous processor