Synchronous processors

Imagine a processor with no interrupts. We can do a lot better and get rid of most exceptions (e.g. system calls, page faults etc.), most peripheral devices/buses, and even cache misses, but let’s start with interrupts. Modern micro-processors are bloated with circuitry that is designed to allow the precious CPU to switch back and forth between streams of code because cpu cycles were the scarcest resource. That was a great idea in the 1950s. But interrupts are expensive in compute time circuit complexity, and chip area. Interrupts  take microseconds to start processing – which is an eternity on a Ghz processor.  And they solve a problem that does not exist anymore: we have a lot of processor cores. In fact, one problem faced by computer architects is that it’s not easy to exploit parallel processing if you keep thinking in terms of 1950s computer architecture.

Suppose you have 32 cores and you make one or two of them poll I/O devices and the other 30 never get I/O interrupts, never run interrupt handling code, never have to save and restore state or context switch due to I/O interrupts.  What they do is run application code to completion or for a specified number of cycles or clock ticks. The cores reserved for operating system use manage all devices and even they don’t need interrupts because they can run a real-time OS that polls devices.

Continue reading Synchronous processors

IEEE 1588 PTP is a mess

IEEE 1588 was not designed for modern enterprise computer networks and contains many hacks to make it sort of work. The standard also suffers from being overly explicit on some things and overly unspecific  on others.  One marker of the flawed process is that IEEE 1588 transparent clocks don’t really comply with Ethernet standards because they modify packets without changing the MAC address. So in 2012 the 802.1 and 1588 standards groups started discussing what could be done. The 1588 committee notes that the “intent” (and practice) violates OSI layering but that 1588 doesn’t “mandate” that intent! Oy vey.

Questions have been raised concerning an IEEE 1588-2008 Transparent Clock layer 2 bridge modifying the CorrectionField of Ethernet transported PTP frames without changing the Ethernet source MAC address.  The question is if this operation is permitted by IEEE 802.1Q [1].  The original intent of the IEEE 1588-2008 standard was that a Transparent Clock will forward PTP event frames with no modifications except for the CorrectionField and FCS updates, however IEEE 1588-2008 does not mandate that.

Operations and maps on finite sequences

A lot of what I’m trying to do with mathematical models of computer systems involves operations on finite sequences.

Define a “finite sequence of length n>0” to be any total map f: {1 … n} → X for some set X. The 0-length sequence is the null map  “nulls”. If f is a finite sequence of length n, then g = f affix c is the map g: {1 … n+1}   → X ∪ {c}  so that g(i)= f(i) for  i ≤ n and g(n+1) = c.  Also, if g is length n>0 then there is an f of length n-1  and some c so that g = f affix c .

A primitive recursive function on finite sequences is given by the rules  F(nulls) = k and  F( f affix c) = G(c,F(f)). 

For example we can define prefix by  (c prefix nulls) = (nulls affix c) and ( c prefix (f affix d)) = (c prefix f) affix d.

I use affix instead of append because prepend is ugly.

Two observations about common usage in computer science that differs from standard mathematical practice. First, note how I’m defining sequences as maps but being imprecise about the function images, about the set X. In CS we really only have one type of object – a sequence of binary digits- so X is always a set of finite sequences of binary digits. This is a fundamental property of computing. So if I wanted to be pedantic, I’d first define the set B of finite sequences of binary digits as the set of all maps g:{1 … n} → {0,1}  plus the null map, and then define general finite sequences as maps  {1 … n} → B.  But we use those binary sequences as representations of other mathematical objects and even objects that are usually not considered mathematical objects such as MPEG encodings of video.  So it’s convenient to speak of a sequence of integers or strings or a sequence of assorted objects without making the translation between representation and denotation  (connotation? ). The second  observation is that, for the same reason, second order functions are not unusual in computer science – e.g. defining prefix as a function that operates on sequences which are also functions. Note however that in non CS applied math, second order functions are also common.

Painting is The Mathematician by Diego Rivera.

The difference between unspecified, undefined, and non-deterministic

There is too much confusion in the “formal methods” computer science literature between these three different terms. Let me start with what this means for a state machine and then move on to engineering objects such as threads. Suppose we have a map α: X → Y  where X is the set of finite sequences over X. In an earlier post I explain how these maps are equivalent to state machines.  For a finite sequence z we can leave  α(z) unspecified – meaning, we don’t make any assurances about the value other than that  α(z)∈ X.  Alternatively, we can say that α(z) is undefined which is a definite assertion that α does not associate any element of X with z. If α is considered as a set of pairs (s,x) with the function property that (s,x)∈α and (s,x’)∈α implies x=x’, then specifying that  α(z) is undefined specifies that (z,x) ∉ α for any x.  So  “α(z) is undefined ” is, in fact, a quite precise specification. Finally, if we say “α(z) is non-deterministic ” we mean that α is not a function at all, it is a relation on E* cross X so that (s,x)∈α and (s,x’)∈α  and  x=x’ is true for at least one s, x and x’. These are all different, but it’s common to mix them up in “formal methods” which has bad mathematical effects and worse effects when trying to understand how computer systems work.

Programs are purely deterministic systems. Given the same inputs, they always compute the same outputs. Without this property, they would not be useful at all. Generally, when researchers say a program is non-deterministic, they are discussing a program that has unspecified or undefined behavior or inputs. It’s that last part that really introduces confusion. For example, it is common to say threaded programs are non-deterministic. But that’s not correct. The apparent non-determinism is due to differences in i/o and scheduling – both of which are inputs. If the operating system schedules a multi-threaded program the same way twice and the input data is the same, the computations will be identical.  The same can be said about i/o.

In fact, the underlying error is sloppy definition of system state. We can see this most clearly in Robin Milner’s CCS book where he speaks of the non-deterministic characteristic  of a device where he has somehow included the choices of the external operator or the weather conditions as part of system state. The consequence is that his state machines are non-deterministic, and this makes the math really complicated and slippery – for no good reason.



one way queues

Freight_train_in_Tucson_Arizona_2 Here’s some code for lock free queues for a single producer and single consumer. The code is designed for Intel multiprocessors with strong memory model. I don’t know what ARM offers these days. But the strong memory model for x86 means that the program doesn’t need any special synchronizing operations at all. All it needs are a couple of volatile declarations to keep the compiler from caching values that are changed by the other process/thread.

There’s some use of #defines to make it easier to use static type checking but the core method is to keep data in an array (I also have a non-array based lock free linked list I may get organized) and have the producer increment a tail index and the consumer increment a head index. Increments are mod n where n is the number of elements in the array. The only complexity is full and empty conditions. When the array is empty  head== tail but if the producer then fills the array using the last slot should roll the tail back to equal head again. One fix is to just never let the array fill up completely – reserve one element as a buffer. But that would be too easy. So I use one of the bits in the head and tail pointers to indicate condition. That bit it never used to calculate the index. When the producer fills the array it sets the bit in tail to be the complement of the value of that bit in head. When the consumer empties, it sets the bit in head to be the same as the value in tail.

*i = ((OWQ_ELEMENT_T *)q->v)[h];
next = (h + 1) % q->z;
if(next == (t & OWQ_OFFBIT) ) q->h = t; //empty
else q->h = next;


The main code uses the high order bit, but I also have code using the low order bit and shifting so you can see the  comparison.

There’s an example program called test_owq.c and the main code is a header file q.h

Photo is by Simeon87Own work, CC BY-SA 3.0,

The Auragen file system.

This article on the interesting Wave Transactional File System inspired me to look up an earlier file system that also used copy on write semantics.


Anita Borg, Wolfgang Blau, Wolfgang Graetsch, Ferdinand Herrmann, and Wolfgang Oberle. 1989. Fault tolerance under UNIX. ACM Trans. Comput. Syst. 7, 1 (January 1989), 1-24. DOI=


4.3 Availability of the File System
Since a recovering file server reconstructs its buffers by reading blocks from the file system, the file system in the state as of the last sync must be available. The existence of that version of the file system is also necessary during recovery as the file server redoes requests. For example, if a file has been deleted since sync and a read request is reissued, the disk driver, and thus the recovering file server, will behave differently than the primary. Unfortunately, the contents of the disk can change between syncs, at least during the Fsync that constitutes the first phase of the sync operation.

The solution is to use a copy-on-write strategy between syncs, rather than overwriting existing blocks. Logically this corresponds to keeping two versions of a file system.3 An early version of the file system organization described here is discussed in Arnow [ 11].

There are two root nodes on disk. At any given time one of them is valid for recovery. We refer to the other as the alternate root. Associated with each root is state information (the state tables described above), the most recent being that associated with the currently valid root. Changes to the file system are done relative to a copy of the valid root kept in memory in the primary file server’s address space, and in a nondestructive manner, as seen in Figure 2(a-d). Freed blocks, which contain the old data, are added to a semi-free list, and cannot be reallocated until after the next sync. Therefore, the unmodified file system still exists rooted in the valid on-disk root node.

If a crash occurs at any time between syncs, the recovering file server is able to determine which root to use because of information sent on the primary’s last sync. It reads in the correct state information and reconstructs its buffers accordingly. Disk blocks that were used by the primary since the last sync appear to it as free blocks.

The difficult case is when a crash occurs during a sync. To see that the solution works in this case, consider the sequence of actions that take place during a sync. First, all dirty blocks except the root are written to disk, and old blocks are added to the semi-free list. Second, the state information is collected and written to the alternate state area. Third, the in-memory root is written to the alternate on disk root block, Finally, the sync message is constructed and sent to the backup. It contains the information necessary to update message queues as well as specifying which on-disk state information and root block to use on recovery.

Once the sync message has been sent, the semi-free list is added to the free list and the primary continues. Just before the sync message is sent, there are two copies of every modified data and indirect block. At any time before the sync message is sent, the old consistent state is available. Any time after it is sent, the new state and file system will be used and message queues consistently updated. An additional benefit of this organization is that the file system as a whole is considerably more robust than a standard UNIXstyle file system. Even if the entire system is shut down in an uncontrolled way as the result of multiple faults or operator error, there will always be an entire consistent file system on disk.

Paxos and other consensus algorithms – draft. updated plus chang/maxemchuk

I’ve been looking at Paxos and Raft and Zab and other algorithms that can loosely be called “consensus” algorithms because we want to see where we can improve distributed system operation with precision time synchronization or where we can offer novel features because of precision timing. Google’s Spanner database is an interesting example in this area but there is also nice work in a number of other places.  One obvious question is how to account for the complexity of Paxos, which seems like it should be simple, but as several people have pointed out, is remarkably elusive and complex in practice.

The original Paxos paper is unreadable, but later versions tried, without success, to make it clear. I believe part of the problem is that  time and timeouts are fundamental to operation of distributed algorithms, but particularly in Paxos, there has been a laborious attempt to sweep these things under the table so it looks as much as possible like a purely “asynchronous” algorithm.

Suppose A and B can only communicate by sending/receiving messages over some communications medium that can lose or delay messages, but never corrupt them. A sends a message M to B and waits for a reply message R. B can fail. The original transmit might have failed. It may be that A can deduce B has failed or is unreachable if A sends K messages to B and has not received any reply by the time the Kth message has been transmitted.  But it’s more likely that A will use a combination of the count of messages sent and the time since a reply in order to conclude that B is dead or unreachable (at least for the moment). This obvious fact of life in distributed systems is something, for some reason, that academic researchers in distributed systems don’t like but it’s actually really interesting. The current time is data that is shared between nodes on a network with no communication delay as long as clocks are synchronized properly (a whole different topic).  Paxos, like any other consensus algorithm that can tolerate failures, has to rely on timeouts, but these have been marginalized and pushed into leader election and caveats about liveness in the Paxos papers. And that, I believe, accounts for a great deal of the obscurity of the presentations.

Curiously, when it comes time to build a working Paxos implementation, the necessity of time based algorithms becomes clear. The Google developers note:

In our implementation, all replicas implicitly grant a lease to the master of the previous Paxos instance and refuse to process Paxos messages from any other replica while the lease is held. The master maintains a shorter timeout for the lease than the replicas – this protects the system against clock drift. The master periodically submits a dummy “heartbeat” value to Paxos to refresh its lease

Which naturally brings up the question: if you are going to use time, e.g. as in Raft, and on top of that, select a single coordinator, why go to all the trouble of a generic, mostly asynchronous algorithm?

Another interesting question is the relationship of Paxos to the well-known (but not in distributed systems) Chang-Maxemchuk protocol (1983). CM is basically a reliable broadcast – designed to have a set of receiver sites commit messages from a single transmitter in order.  The reformation phase essentially solves the same problem Paxos is attempting to solve – forcing a consensus on a new list of receivers after some failure.

Any site that detects a failure or recovery initiates a reformation and is called an originator. It invites other sites in the broadcast group, the slaves, to form a new list. The reformation process can be described in terms of the activities of sites joining and committing a valid list. A valid list satisfies a set of specific requirements, as explained below. When the reformation starts, a site is invited to join a new list and eventually commits to a valid list. When all of the sites in a valid list are committed to this list, the list will be authorized with a token and the reformation terminates. This list becomes the new token list. Multiple originators can exist if more than one site discovers the failure or recovery. During the reformation, it is possible that acknowledged messages from the old token list have been missed by all sites that join a new list. To guarantee that there is only one new list and that this list has all of the committed messages, the list must be tested before it can be considered a valid list. Specifically, a list becomes valid if it passes the majority test, the sequence test, and the resiliency test. Majority Test. The majority test requires that a valid list has a majority of the sites in the broadcast group. During the reformation, a site can join only one list. The majority test is necessary to ensure that only one valid list can be formed. Sequence Test. The sequence test requires that a site only join a list with a higher version number than the list it previously belonged to. The version number of a token list is in the form of (version #, site number). Each site has a unique site number. When a new list is formed, the originator chooses the new version # to be the version # of the last list it has joined plus one. Therefore, token lists have unique version numbers. The originator always passes the sequence test. If any of the slaves fail the sequence test, it tells the originator its version number. The originator increments the higher version # the next time it tries to form a new list. The combination of the majority and the sequence test ensures that all valid lists have increasing version numbers. This is true because any two valid lists must have at least one site in common, and a site can join a second list only if the second list has a higher version number. Therefore, the version numbers indicate the sequence in which token lists were formed.