one way queues

Freight_train_in_Tucson_Arizona_2 Here’s some code for lock free queues for a single producer and single consumer. The code is designed for Intel multiprocessors with strong memory model. I don’t know what ARM offers these days. But the strong memory model for x86 means that the program doesn’t need any special synchronizing operations at all. All it needs are a couple of volatile declarations to keep the compiler from caching values that are changed by the other process/thread.

There’s some use of #defines to make it easier to use static type checking but the core method is to keep data in an array (I also have a non-array based lock free linked list I may get organized) and have the producer increment a tail index and the consumer increment a head index. Increments are mod n where n is the number of elements in the array. The only complexity is full and empty conditions. When the array is empty  head== tail but if the producer then fills the array using the last slot should roll the tail back to equal head again. One fix is to just never let the array fill up completely – reserve one element as a buffer. But that would be too easy. So I use one of the bits in the head and tail pointers to indicate condition. That bit it never used to calculate the index. When the producer fills the array it sets the bit in tail to be the complement of the value of that bit in head. When the consumer empties, it sets the bit in head to be the same as the value in tail.

*i = ((OWQ_ELEMENT_T *)q->v)[h];
next = (h + 1) % q->z;
if(next == (t & OWQ_OFFBIT) ) q->h = t; //empty
else q->h = next;

The main code uses the high order bit, but I also have code using the low order bit and shifting so you can see the  comparison.

There’s an example program called test_owq.c and the main code is a header file q.h

Photo is by Simeon87Own work, CC BY-SA 3.0,

The Auragen file system.

This article on the interesting Wave Transactional File System inspired me to look up an earlier file system that also used copy on write semantics.


Anita Borg, Wolfgang Blau, Wolfgang Graetsch, Ferdinand Herrmann, and Wolfgang Oberle. 1989. Fault tolerance under UNIX. ACM Trans. Comput. Syst. 7, 1 (January 1989), 1-24. DOI=


4.3 Availability of the File System
Since a recovering file server reconstructs its buffers by reading blocks from the file system, the file system in the state as of the last sync must be available. The existence of that version of the file system is also necessary during recovery as the file server redoes requests. For example, if a file has been deleted since sync and a read request is reissued, the disk driver, and thus the recovering file server, will behave differently than the primary. Unfortunately, the contents of the disk can change between syncs, at least during the Fsync that constitutes the first phase of the sync operation.

The solution is to use a copy-on-write strategy between syncs, rather than overwriting existing blocks. Logically this corresponds to keeping two versions of a file system.3 An early version of the file system organization described here is discussed in Arnow [ 11].

There are two root nodes on disk. At any given time one of them is valid for recovery. We refer to the other as the alternate root. Associated with each root is state information (the state tables described above), the most recent being that associated with the currently valid root. Changes to the file system are done relative to a copy of the valid root kept in memory in the primary file server’s address space, and in a nondestructive manner, as seen in Figure 2(a-d). Freed blocks, which contain the old data, are added to a semi-free list, and cannot be reallocated until after the next sync. Therefore, the unmodified file system still exists rooted in the valid on-disk root node.

If a crash occurs at any time between syncs, the recovering file server is able to determine which root to use because of information sent on the primary’s last sync. It reads in the correct state information and reconstructs its buffers accordingly. Disk blocks that were used by the primary since the last sync appear to it as free blocks.

The difficult case is when a crash occurs during a sync. To see that the solution works in this case, consider the sequence of actions that take place during a sync. First, all dirty blocks except the root are written to disk, and old blocks are added to the semi-free list. Second, the state information is collected and written to the alternate state area. Third, the in-memory root is written to the alternate on disk root block, Finally, the sync message is constructed and sent to the backup. It contains the information necessary to update message queues as well as specifying which on-disk state information and root block to use on recovery.

Once the sync message has been sent, the semi-free list is added to the free list and the primary continues. Just before the sync message is sent, there are two copies of every modified data and indirect block. At any time before the sync message is sent, the old consistent state is available. Any time after it is sent, the new state and file system will be used and message queues consistently updated. An additional benefit of this organization is that the file system as a whole is considerably more robust than a standard UNIXstyle file system. Even if the entire system is shut down in an uncontrolled way as the result of multiple faults or operator error, there will always be an entire consistent file system on disk.

Paxos and other consensus algorithms – draft. updated plus chang/maxemchuk

I’ve been looking at Paxos and Raft and Zab and other algorithms that can loosely be called “consensus” algorithms because we want to see where we can improve distributed system operation with precision time synchronization or where we can offer novel features because of precision timing. Google’s Spanner database is an interesting example in this area but there is also nice work in a number of other places.  One obvious question is how to account for the complexity of Paxos, which seems like it should be simple, but as several people have pointed out, is remarkably elusive and complex in practice.

The original Paxos paper is unreadable, but later versions tried, without success, to make it clear. I believe part of the problem is that  time and timeouts are fundamental to operation of distributed algorithms, but particularly in Paxos, there has been a laborious attempt to sweep these things under the table so it looks as much as possible like a purely “asynchronous” algorithm.

Suppose A and B can only communicate by sending/receiving messages over some communications medium that can lose or delay messages, but never corrupt them. A sends a message M to B and waits for a reply message R. B can fail. The original transmit might have failed. It may be that A can deduce B has failed or is unreachable if A sends K messages to B and has not received any reply by the time the Kth message has been transmitted.  But it’s more likely that A will use a combination of the count of messages sent and the time since a reply in order to conclude that B is dead or unreachable (at least for the moment). This obvious fact of life in distributed systems is something, for some reason, that academic researchers in distributed systems don’t like but it’s actually really interesting. The current time is data that is shared between nodes on a network with no communication delay as long as clocks are synchronized properly (a whole different topic).  Paxos, like any other consensus algorithm that can tolerate failures, has to rely on timeouts, but these have been marginalized and pushed into leader election and caveats about liveness in the Paxos papers. And that, I believe, accounts for a great deal of the obscurity of the presentations.

Curiously, when it comes time to build a working Paxos implementation, the necessity of time based algorithms becomes clear. The Google developers note:

In our implementation, all replicas implicitly grant a lease to the master of the previous Paxos instance and refuse to process Paxos messages from any other replica while the lease is held. The master maintains a shorter timeout for the lease than the replicas – this protects the system against clock drift. The master periodically submits a dummy “heartbeat” value to Paxos to refresh its lease

Which naturally brings up the question: if you are going to use time, e.g. as in Raft, and on top of that, select a single coordinator, why go to all the trouble of a generic, mostly asynchronous algorithm?

Another interesting question is the relationship of Paxos to the well-known (but not in distributed systems) Chang-Maxemchuk protocol (1983). CM is basically a reliable broadcast – designed to have a set of receiver sites commit messages from a single transmitter in order.  The reformation phase essentially solves the same problem Paxos is attempting to solve – forcing a consensus on a new list of receivers after some failure.

Any site that detects a failure or recovery initiates a reformation and is called an originator. It invites other sites in the broadcast group, the slaves, to form a new list. The reformation process can be described in terms of the activities of sites joining and committing a valid list. A valid list satisfies a set of specific requirements, as explained below. When the reformation starts, a site is invited to join a new list and eventually commits to a valid list. When all of the sites in a valid list are committed to this list, the list will be authorized with a token and the reformation terminates. This list becomes the new token list. Multiple originators can exist if more than one site discovers the failure or recovery. During the reformation, it is possible that acknowledged messages from the old token list have been missed by all sites that join a new list. To guarantee that there is only one new list and that this list has all of the committed messages, the list must be tested before it can be considered a valid list. Specifically, a list becomes valid if it passes the majority test, the sequence test, and the resiliency test. Majority Test. The majority test requires that a valid list has a majority of the sites in the broadcast group. During the reformation, a site can join only one list. The majority test is necessary to ensure that only one valid list can be formed. Sequence Test. The sequence test requires that a site only join a list with a higher version number than the list it previously belonged to. The version number of a token list is in the form of (version #, site number). Each site has a unique site number. When a new list is formed, the originator chooses the new version # to be the version # of the last list it has joined plus one. Therefore, token lists have unique version numbers. The originator always passes the sequence test. If any of the slaves fail the sequence test, it tells the originator its version number. The originator increments the higher version # the next time it tries to form a new list. The combination of the majority and the sequence test ensures that all valid lists have increasing version numbers. This is true because any two valid lists must have at least one site in common, and a site can join a second list only if the second list has a higher version number. Therefore, the version numbers indicate the sequence in which token lists were formed.


Keynes apology

The composition of this book has been for the author a long struggle of escape, and so must the reading of it be for most readers if the author’s assault upon them is to be successful,—a struggle of escape from habitual modes of thought and expression. The ideas which are here expressed so laboriously are extremely simple and should be obvious. The difficulty lies, not in the new ideas, but in escaping from the old ones, which ramify, for those brought up as most of us have been, into every corner of our minds.

From the introduction to the General Theory. J.M. Keynes

Process algebra versus state machines part 1

Robin Milner’s influential book Communication and Concurrency involves a take on state machines that has always puzzled me. “Now in standard automata theory, an automaton is interpreted as a language,” Milner writes “i.e. as a set of strings over the alphabet.” That’s not at all correct, but let’s accept the claim for now and follow the argument. Consider two state machines A and B with an alphabet of events  {a,b,c,d}  and A has states {A,A1,A2, A3} and B has states {B,B’,B1,B2,B3}. The state machine transitions can be given by ordered triplets (state1,input, state2) that show the input label  on a directed edge between state1 and state2.  For Milner’s example:

state machine A has transitions { (A,a,A1), (A1,b,A2), (A1,c,A3), (A3,d,A) },

state machine B has transitions: { (B,a,B1) (B,a,B’), (B1,b, B2), (B’,c,B3), (B3,d,B)}.

B is non-deterministic because there are 2 “a” transitions from state B. Milner points out that if we consider A2 and B2 to be accept states, both machines accept the same language (acd)*ab. So far so good. At this point Milner asks us to think of {a,b,c,d} as “ports” or maybe buttons that can be pushed. The button “is unlocked if the agent can perform the associated action and can be depressed to make it do the action, otherwise the button is locked and cannot be depressed.”  Then: “after the a-button is depressed a difference emerges between A and B. For A – which is deterministic – b and c will be unlocked, while for B – which is non-deterministic – sometimes only b will be unlocked and sometimes only c will be unlocked.” If you don’t look carefully, you’ll miss a subtle change of conditions that has significant ramifications.

An experimenter or external agent trying to push these buttons will discover a difference between the two machines eventually because some times after an initial “a” input on the second state machine a “b” is possible and sometimes not, although on the first state machine after an “a” the “b” is always possible.  But how does the external agent determine that B will not perform a “b” action sometimes? The external agent “attempt[s] to depress b” and fails – the locked/unlocked state of each button is visible to the external agent. So Milner has changed definitions in the middle of the argument.  At the start, finite state machines were language recognizers with, as the classical text on automata theory explains: “output limited to a binary signal: ‘accept/don’t accept’ ” [Hopcroft and Ullman].  Those automata will not tell us  anything else about a word other than that binary condition – is it in the language or not.  But Milner’s button state machines tell us also what buttons are locked and what are unlocked in the terminal state reached by the word.  So Milner’s state machines distinguish words that a recognizer state machines does not. In fact, these Milner state machines have 5 binary outputs in each state – indicating the locked/unlocked status of each button plus accept/don’t accept. State machines with more than a binary output alphabet are called Moore or Mealy machines in poor old standard automata theory.

Standard automata theory does not “interpret” state machines  “as a language” but there is a theorem that the class of languages recognized by those finite state binary output deterministic state machines is the same as the class of languages recognized by finite state non-deterministic state machines. Two  machines that recognize the same language may be distinct in many other ways.   And state machines that have additional outputs (sometimes called “transducers”) are essentially descriptions of maps from input strings to output strings or from input strings to output value in the terminal state. Standard automata theory would say Milner’s two machines accept the same language of strings, but produce different languages of strings.

Standard automata theory, as far as I know, has never really considered the case of non-deterministic Moore machines but the extension is trivial. Milner’s transition systems are just labeled directed graphs with a root vertex. Consider a labeled directed graph G with labels A, a distinguished root vertex (start state)  s0, the set of triples R= { (s1,a,s2) if there is an edge labeled a from s1 to s2}. The set of vertices V is the set of states. We can define a relation R* subset A* x V so that R* is the largest set containing only (null,s0) and (wa,s’) whenever (w,s) is in R* and (s,a,s’) is in R – where wa is the string obtained by appending “a” to “w” on the right.  For any vertex “s” define Q(s) to be the subset of A containing every “a” so that (s,a,s’) is in R for some s’ in V. Then let G* be the set of pairs (w,Q(s)) for every (w,s) in R*.  As far as I can tell on a quick look, Milner’s bisimulation between G1 and G2 is simply equality of G1* and G2*.


More on Fischer, Lynch, Patterson and the parrot theorem.

I’m thinking about distributed consensus algorithms, timestamping, and databases, and am repeatedly seeing references to the Fischer, Lynch, Paterson “theorem”. Here is the problem statement

The problem is for all the data manager processes that have participated in the processing of a particular transaction to agree on whether to install the transaction’s results in the database or to discard them. The latter action might be necessary, for example, if some data managers were, for any reason, unable to carry out the required transaction processing. Whatever decision is made, all data managers must make the same decision in order to preserve the consistency of the database.

We have a set of data manager processes that must come to a consensus about where to commit or to discard. The problem statement requires that ALL of the processes must agree on a result. The processes communicate in some way – they can send each other information or request information from each other.  Implicitly, if a process fails, the others can ignore its opinion. But there is no way for any process to detect whether another process has failed or whether it is just thinking about its response to a request – there is no upper bound on how long a non-failed process can take to send a response (maybe it’s just very busy working on some other problem).  A single failed process is then an insurmountable problem because it is impossible to tell whether it is dead or just resting.

In this paper, we show the surprising result that no completely asynchronous consensus protocol can tolerate even a single unannounced process death. We do not consider Byzantine failures, and we assume that the message system is reliable: it delivers all messages correctly and exactly once. Nevertheless, even with these assumptions, the stopping of a single process at an inopportune time can cause any distributed commit protocol to fail to reach agreement.

I cannot imagine what was ever surprising about this result. The problem statement says: you cannot distinguish between Dead and Resting. And “surprising result” is that – you cannot distinguish between Dead and Resting. Surely there is something more here?

We also assume that processes do not have access to synchronized clocks, so algorithms based on time-outs, for example, cannot be used. (In particular, the solutions in [6] are not applicable.) Finally, we do not postulate the ability to detect the death of a process, so it is impossible for one process to tell whether another has died (stopped entirely) or is just running very slowly.

There is no way to tell if a process is Dead or if it is Resting (running very slowly). Therefore, no algorithm can determine the set of live processes. There is, then,  no algorithm for obtaining a consensus among all live processes. QED. Maybe someone can enlighten me on what I missed.



ESMA clarifies time sources for MiFID II


ESMA just released guidelines that reinforce what was already clear in the MiFIDII regulation – that GPS time is a perfectly acceptable source of “traceable” time. There is a lot else that is of interest in this report, but it’s a good reminder to not be panicked by marketing scare-tactics.

As per Article 1 of the MiFIR RTS 25, systems that provide direct traceability to the UTC time issued and maintained by a timing centre listed in the BIPM Annual Report on Time Activities are considered as acceptable to record reportable events. The use of the time source of the U.S. Global Positioning System (GPS) or any other global navigation satellite system such as the Russian GLONASS or European Galileo satellite system when it becomes operational is also acceptable to record reportable events. GPS time is different to UTC. However, the GPS time message also includes an offset from UTC (the leap seconds) and this offset should be combined with the GPS timestamp to provide a UTC timestamp.

Theories of abstract automata – Arbib

We have said that automata theory deals with the realization of partial functions F: X* —» Y* by some finitely specifiable substrate. Before we specify in more detail the forms (of which the Turing machine is one) of substrate which have figured most prominently in automata theory, it is useful to distinguish on-line machines from off-line machines. An on-line machine is one that may be thought of as processing data in an interactive situation—in processing a string it must yield a continual flow of outputs, processing each symbol completely (albeit in a way dependent on prior inputs) before it reads in the next symbol. This means that the corresponding function F: X* —> F* must have the following special property:

1 For each nonempty string u of X* there exists a function Fu: X* —» Y* such that for every nonempty v in X*
F(uv) = F(u) • Fu(v) that is, the input string u causes the machine to put out the string F(u) and to “change state” in such a way that it henceforth processes inputs according to a function Fu determined solely by F and u. We call a function sequential if it satisfies property 1.

From: Arbib, M.A., 1969,Theories of Abstract Automata, Prentice-Hall: Englewood Cliffs, N. J., 412 + xiii pages.

MiFID II, GPS and UTC time

I have a post up on FSMLabs web site about the use of GPS and other satellite time for MiFID II timestamp compliance.  It’s fascinating how much effort has recently gone into trying to convince people that MiFID II will require direct time from a national lab or certified via a national lab despite the clear wording in MiFID II proposed regulations. To me, the deal is sealed in the Cost Benefit Analysis in which the ESMA regulators write

“The final draft RTS also reduces costs of the initial draft RTS proposed in the CP by allowing UTC disseminated via satellite systems (i.e. GPS receiver or the use of other satellite systems when available)”

That is not a promise one can easily walk away from. ESMA justifies the regulations with a cost/benefit analysis in which the costs for time stamping are limited by license to use GPS time.  Of course, legal reasoning and logic are not always the same, but I’m trying to figure out how ESMA regulators could claim that they didn’t mean it, or why they would have such a motivation.


 “South Sea Bubble” by Edward Matthew Ward via href=”” Wikimedia