Category Archives: systems

undefined behavior and the purpose of C

C undefined behavior. From one of the LLVM developers:

This behavior enables an analysis known as “Type-Based Alias Analysis” (TBAA) which is used by a broad range of memory access optimizations in the compiler, and can significantly improve performance of the generated code. For example, this rule allows clang to optimize this function:

float *P;
 void zero_array() {
   int i;
   for (i = 0; i < 10000; ++i)
     P[i] = 0.0f;
 }

into “memset(P, 0, 40000)“. This optimization also allows many loads to be hoisted out of loops, common subexpressions to be eliminated, etc. This class of undefined behavior can be disabled by passing the -fno-strict-aliasing flag, which disallows this analysis. When this flag is passed, Clang is required to compile this loop into 10000 4-byte stores (which is several times slower), because it has to assume that it is possible for any of the stores to change the value of P, as in something like this:

int main() {
  P = (float*)&P;  // cast causes TBAA violation in zero_array.
  zero_array();
}

This sort of type abuse is pretty uncommon, which is why the standard committee decided that the significant performance wins were worth the unexpected result for “reasonable” type casts.

The permitted optimization is very effective: the code is 80 (EIGHTY!) times faster when clang O3 optimization replaces the loop with a memset (some of that is other optimization, but still).  But the programmer has the option of using memset directly – which produces exactly the same optimization. In fact, the programmer has the option of using memset directly because fundamentally C wants to expose the underlying memory to the programmer.

The original motivation for leaving some C behavior undefined was that different processor architectures would produce different behaviors and the expert C programmer was supposed to know about those.  Now compiler writers and standards developers claim it is good to introduce “unexpected results” (that surprise the experienced programmer) because these permit a certain kind of optimization.

Violating Type Rules: It is undefined behavior to cast an int* to a float* and dereference it (accessing the “int” as if it were a “float”). C requires that these sorts of type conversions happen through memcpy: using pointer casts is not correct and undefined behavior results. The rules for this are quite nuanced and I don’t want to go into the details here (there is an exception for char*, vectors have special properties, unions change things, etc). This behavior enables an analysis known as “Type-Based Alias Analysis” (TBAA) which is used by a broad range of memory access optimizations in the compiler, and can significantly improve performance of the generated code.

To me: “The rules for this are quite nuanced and I don’t want to go into the details here (there is an exception for char*, vectors have special properties, unions change things, etc)” means, “we mucked up the standard and we are going to cause many systems to fail as these nuanced rules confuse and surprise otherwise careful and highly expert programmers”. Compiler writers like undefined behavior because, in their interpretation of the standard, these permit any arbitrary code transformation. Anything. The well known controversial results include removing checks for null pointers due to an unreliable compiler inference about dereference behavior.  These uses of “undefined” and limitations on reasonable type coercion are based on an incorrect idea of the purpose of the C language.  Unexpected results are a catastrophe for a C programmer. Limitations on compiler optimizations  based on second guessing the programmer are not catastrophes ( and nothing prevents compiler writers from adding suggestions about optimizations).  There are two cases for this loop:

  1. The C programmer is an expert who used a loop instead of memset for a good reason or because this is not a performance critical part of the code.
  2. The C is programmer is not an expert and program almost certainly contains algorithmic weaknesses that are more significant than the loop –

Neither case benefits from the optimization. Programmers who want the compiler to optimize their algorithms using clever transformations should use programming languages that are better suited to large scale compiler transformations where type information is clear indication of purpose. As Chris Lattner notes:

It is worth pointing out that Java gets the benefits of type-based optimizations without these drawbacks because it doesn’t have unsafe pointer casting in the language at all.

Java hides the memory model from the programmer to make programming safer and also to permit compilers to do clever transformations because the programmer is not permitted to interact with the low level system.  Optimization strategies for the C compiler should take into account that C does not put the programmer inside a nice abstract machine.  The  C compiler doesn’t need to be a 5th rate Mathematica or APL or FORTRAN or  Haskell.  Unexpected results are far more serious a problem for C than missed minor optimizations.

Some reading:

DJT on boring C.

Regehr’s guide for the perplexed. 

Understanding Paxos and Distributed Consensus

vase

(minor wording correction and more complaining added 10/2/2016, minor edits 10/5/2016)

Multi-proposer Paxos is a very clever and notoriously slippery algorithm for obtaining distributed consensus. In this note I try to explain it clearly and provide a correctness proof that gives some intuition why it works – to the extent that it does work. I am specifically concerned with Multi-Proposer Paxos, the first algorithm discussed in “Paxos Made Simple”.  What is often called “Paxos” involves single Proposer variants which are much simpler and less interesting.

I think this is right – let me know if you spot an error.

Rules for how Paxos works

There is a finite set of processes or network sites, some of which are Proposers and some Acceptors (the sets can intersect). Each proposer has a unique id, confusingly called a sequence number. A proposal is a pair consisting of a proposal value and the sequence number of the Proposer. The goal is to ensure that if two Proposers believe that they have convinced the Acceptors to come to a consensus on a value, they must both agree on the same value, even though they may disagree on sequence number. The most clever part of Paxos is the observation that since we don’t care which value wins, even though we do care that some unique value wins, we can force Proposers to inherit values of the most likely previous proposal.

  1. Proposers can ask Acceptors to approve sequence numbers and to accept proposals which include a value and the Proposer’s sequence number. Acceptors do not have to approve or accept but are limited to approving and accepting what Proposers send them.
  2. When an Acceptor approves a sequence number it:
    1. Promises to not approve any smaller sequence numbers
    2. Promises to not accept any proposals with smaller sequence numbers
    3. Returns to the Proposer the proposal with the highest sequence number it has already accepted, if any.
  3. The Proposer cannot send any proposals or select a value for a proposal until it gets approval for its sequence number from a majority of Acceptors.
  4. Once the Proposer has approval from a majority of Acceptors it must select the value of the proposal with the highest sequence number sent to it during the approval phase (the inherited proposal). If the approval phase did not turn up any accepted proposals, the Proposer can pick any value. In this case the Proposer “originates” the value.
  5. Once the value is selected, the Proposer can never change the value and can only propose the pair of that value and its sequence number – unless it increases its sequence number, abandons the proposal and value, and starts over.
  6. The choice of a new sequence number must preserve the property that each sequence number belongs to only one Proposer, ever.

(see the  code for what this looks like in a simulation)

Why it works intuition

The first thing to see is that individual Acceptors are forced to order their approvals and accepts by sequence number. If an Acceptor has approved j and accepted (i,v) and j>i then we know that it must have approved j after it accepted (i,v). The sequence of operations for that Acceptor must act like this:

Continue reading

Chang-Maxemchuk atomic broadcast

The Chang-Maxemchuk algorithm (US Patent 4,725,834 ) solves atomic broadcast (and in-order broadcast) problems for distributed networks in a far simpler and more efficient way than some popular alternatives. In fact, the obscurity of this method is hard to understand given the current interest in distributed consensus.

The basic idea is simple algebra. A source site or process broadcasts “data messages” to a list of sites n sites. Data messages are tagged with sequence numbers and each sequence number is associated with exactly one “responsible”destination site so that  n consecutive sequence numbers map to n sites (the entire list).  For example, if the list sites are numbered 0 … n-1, then sequence number q could be mapped to responsible site q mod n.  Sites on the list broadcast numbered acknowledgment messages to all sites on the list and the source. Only the responsible site for sequence number can create an acknowledgment message numbered  i and the responsible site will only create the acknowledgment if it has received data message i and all lower numbered data messages and acknowledgment messages.  As a result, when the source sees acknowledgment message n+i it is assured that all sites have received the data message numbered  and the acknowledgment.

That’s normal operation mode. There is a reformation mode which is used to create  a list after a failure.  Reading the reformation mode description in the original paper is a good education in how to describe standard “leader election” clearly:

Any site that detects a failure or recovery initiates a reformation and is called an originator. It invites other sites in the broadcast group, the slaves, to form a new list. The reformation process can be described in terms of the activities of sites joining and committing a valid list. A valid list satisfies a set of specific requirements, as explained below. When the reformation starts, a site is invited to join a new list and eventually commits to a valid list. When all of the sites in a valid list are committed to this list, the list will be authorized with a token and the reformation terminates. This list becomes the new token list. Multiple originators can exist if more than one site discovers the failure or recovery. During the reformation, it is possible that acknowledged messages from the old token list have been missed by all sites that join a new list.

To guarantee that there is only one new list and that this list has all of the committed messages, the list must be tested before it can be considered a valid list. Specifically, a list becomes valid if it passes the majority test, the sequence test, and the resiliency test.

Majority Test. The majority test requires that a valid list has a majority of the sites in the broadcast group. During the reformation, a site can join only one list. The majority test is necessary to ensure that only one valid list can be formed.

Sequence Test. The sequence test requires that a site only join a list with a higher version number than the list it previously belonged to. The version number of a token list is in the form of (version #, site number). Each site has a unique site number. When a new list is formed, the originator chooses the new version # to be the version # of the last list it has joined plus one. Therefore, token lists have unique version numbers.

The originator always passes the sequence test. If any of the slaves fail the sequence test, it tells the originator its version number. The originator increments the higher version # the next time it tries to form a new list. The combination of the majority and the sequence test ensures that all valid lists have increasing version numbers. This is true because any two valid lists must have at least one site in common, and a site can join a second list only if the second list has a higher version number. Therefore, the version numbers indicate the sequence in which token lists were formed.

This paper was published 1984 and the first Paxos paper was from 1988. In my opinion Paxos is a big step backwards from CM.

 

Data base design criteria: ease of use

Regarding ease-of-use, it’s often struck me when reviewing data systems papers that the evaluation sections are full of performance and correctness criteria, but only rarely is there any discussion of how well a system helps its target users achieve their goals: how easy is it to build, maintain, and debug applications?; how easy is it to operate and troubleshoot? Yet in an industry setting, systems that focus on ease of use (even at the expense of some of the other criteria) have tended to do very well. What would happen if a research program put ease of use (how easy is it to achieve the outcome the user desires) as its top evaluation criteria?  Adrian Colyer

Distributed consensus and network reliability

All of the distributed consensus algorithms I have been reviewing recently (Paxos, Raft, Zab, Chang Maxemchuck, Viewstamped, … ) are based on a number of assumptions about the network environment, including the assumption that messages may be lost but are not silently corrupted.  Is that a good assumption? Perhaps:

Real data 1995

Checksums disagree 2000 

Koopman

 

 

The replicated state machine method of fault tolerance from 1980s

The first time I saw this method was when I went to work for Parallel Computer Systems, , later called Auragen, in the famous tech startup center of Englewood Cliffs, New Jersey. I commuted there from the East Village. (True story: I applied for the job after finding an advert in a discarded copy of the NY Times on the floor of a Brooklyn apartment while visiting friends. I sent via US mail a resume typed on a manual typewriter- I’m tempted to  say “composed by the light of a tallow candle” but that would be over the top- and forgot to send the second page. )

The company built a parallel computer based on Motorola 68000s  with a replicated message bus. The bus guaranteed message delivery to 3 destinations would either succeed to all three or fail to all three. This property is called “reliable broadcast”.  All interprocess communication was by message transfer (a fashionable idea at the time). Each process had a backup.  Whenever a primary process sent a message, the message was also delivered to the backup and to the destination backup. If the primary failed, the backup could be run. The backup would have a queue of messages received by the primary and a count of messages sent by the primary.  When the recovering backup tried to transmit a message, if the count was greater than zero, the count would be decremented and the message discarded because it has already been transmitted by the old primary. When the recovering secondary did a receive operation, if there was a message on the input queue, it would get that message.  In this way, the recovering backup would repeat the operations of the primary until it caught up. As an optimization, the primary could be periodically checkpointed and queues of duplicated messages could be discarded.

The operating system was an implementation of UNIX. In practice, it was discovered that making each UNIX system call into a message exchange, which was an idea advocated in the OS research community at the time, caused serious performance problems.  The replicated state machine operation depended on this design  in order to make the state machine operation deterministic. Suppose the primary requested, for example,  the time and then made a decision based on the time.  A recovering secondary would need exactly the same time to guarantee that it produced the same results as the primary. So every interaction between application and OS needed to be recorded in a message exchange.  But a message exchange is nowhere near as fast as a system call (unless the OS developers are horrible).

The performance issue was mitigated by some clever engineering, but  was a problem that was discovered in parallel by a number of development teams working on distributed OS designs and micro-kernels which were in vogue at the time. Execution of “ls -l” was particularly interesting.

Anyways, here’s the description from the patent.

To accomplish this object, the invention contemplates that instead of keeping the backup or secondary task exactly up to date, the backup is kept nearly up to date but is provided with all information necessary to bring itself up to the state of the primary task should there by a failure of the primary task. The inventive concept is based on the notion that if two tasks start out in identical states and are given identical input information, they will perform identically.

In particular, all inputs to a process running on a system according to the invention are provided via messages. Therefore, all messages sent to the primary task must be made available to the secondary or backup task so that upon failure of the primary task the secondary task catches up by recomputing based on the messages. In essence, then, this is accomplished by allowing every backup task to “listen in on” its primary’s message.

United States Patent 4,590,554 Glazer ,   et al.May 20, 1986

Inventors: Glazer; Sam D. (New York, NY), Baumbach; James (Brooklyn, NY), Borg; Anita (New York, NY), Wittels; Emanuel (Englewood Cliffs, NJ)
Assignee: Parallel Computers Systems, Inc. (Fort Lee, NJ)
Family ID: 23762790
Appl. No.: 06/443,937
Filed: November 23, 1982

See also: A message system supporting fault tolerance.

and a very similar later patent.

circularity problems in distributed consensus

Distributed consensus involves organizing a collection of independent agents – processes or network sites – to agree on some value or sequence of values.  Many distributed consensus methods depend on a leader-follower scheme in which the leader is an agent that essentially tells the followers what the values are. The challenges in such methods are to determine when enough of the followers have accepted the value and how to recover from failures of agents. In particular, failures of the leader trigger some procedure to select a new leader.  Leader election, however, is a distributed consensus problem. In fact, leader election is the harder problem. Once there is a leader, consensus in the followers can be produced by a dead simple protocol (see the second part of this ).  Oddly, leader election is generally treated as a minor issue. For example, in “Paxos made simple” we read:

The famous result of Fischer, Lynch, and Patterson [1] implies that a reliable algorithm for electing a proposer must use either randomness or real time—for example, by using timeouts. However, safety is ensured regardless of the success or failure of the election.

The FLP result is essentially a tautology: if an agent doesn’t ever get any information that reliably distinguishes between failure and slow response in a second agent, the first agent cannot reliably distinguish between failure of the second agent and slow response.  So the import of the first sentence is that leader election depends on timeouts or “randomness” (perhaps this means some analysis of probability of failure scenarios).  I don’t think this is correct, but it’s an interesting claim. The second sentence says nothing more than that an algorithm that fails to progress will never produce a false result – which I think is also a dubious claim.

Algorithm P solves problem X by assuming some other mechanism solves X and then by using that mechanism to make problem X simpler.  Ok.

 

Making Paxos face facts

Lamport’s  “Paxos Made Simple” paper is notoriously hard to understand but at least part of the difficulty is that the algorithm  changes radically in the middle of the presentation.  The first part of the paper presents a subtle (maybe too subtle) method to permit multiple processes or network sites to agree on a consensus value.  The second part of the paper switches to a second, much simpler algorithm without much notice.

The paper begins with a problem statement:

Assume a collection of processes that can propose values. A consensus algorithm ensures that a single one among the proposed values is chosen. If no value is proposed, then no value should be chosen. If a value has been chosen, then processes should be able to learn the chosen value. The safety requirements for consensus are:

  • Only a value that has been proposed may be chosen,
  • Only a single value is chosen, and
  • A process never learns that a value has been chosen unless it actually has been

The Paxos algorithm that is  presented first is liable to what we used to call livelock even in the absence of failure:

It’s easy to construct a scenario in which two proposers each keep issuing a sequence of proposals with increasing numbers, none of which are ever chosen.

It’s argued that this original Paxos works in the sense that it never can reach an inconsistent state (it is “safe” ), but the livelock scenario means it can easily fail to progress even without process failure or loss of messages. To avoid this scenario, on page seven of an eleven page paper, Paxos is redefined to restrict the set of  proposers to a single member – a distinguished proposer.

To guarantee progress, a distinguished proposer must be selected as the only one to try issuing proposals

Then, on page nine, when going over how Paxos can be used to build replicated state machines, Lamport writes:

In normal operation, a single server is elected to be the leader, which acts as the distinguished proposer (the only one that tries to issue proposals) in all instances of the consensus algorithm

Since much of the complexity of the first part of the paper involves describing how Paxos safely resolves contention between competing proposers, the modification is far from minor. It’s unfortunate that both the multi-proposer and single proposer algorithm are called by the same name which seems to cause confusion in the literature and certainly obscures the presentation in “Paxos Made Simple“. For example, the “Paxos Made Live”   paper appears to discuss an implementation based on the first (the multi-proposer) Paxos algorithm but then appears to revert to the second method via a mechanism called “master leases”).

In any case, the single proposer problem is much simpler than the original problem.

 

A brute force commit solution to the single proposer problem.

What, exactly do the bolded words in the problem statement mean?  “Chosen” and “Learn” are the two hard ones. “Proposed” is pretty clear: a process sends messages to all the other processes that says ” I , process A, propose value V with supporting documentation x“.

A proposer sends a proposed value to a set of acceptors. An acceptor may accept the proposed value. The value is chosen when a large enough set of acceptors have accepted it.

Proposal is a simple action: the proposer sends a proposal message.

There are the usual assumptions: messages may be lost or duplicated or delivered out of order, but are neither corrupted or spurious. That is: if a process receives a message from a second process, the second process must have previously transmitted that message.   A process can propose a value, but that proposal may never arrive at any of the other process or it may arrive at all of them or some of them. It makes sense that an accept operation is the transmit of an accept message by some process  that received the proposal message.  Presumably then, the value is chosen when the original proposer receives accept messages from a large enough set of the Acceptor processes  – a majority of the processes. Learning is also simple:

To learn that a value has been chosen, a learner must find out that a proposal has been accepted by a majority of acceptors.

Suppose that there is a distinguished proposer process D. Suppose that process sends value to all Acceptors and marks the value as “chosen” if it receives an accept response from a majority of Acceptors. Acceptors only send accept messages to the distinguished proposer if they receive a proposal from the distinguished proposer.  Let Learners ask the distinguished proposer for chosen values. To make the learning process more efficient, the distinguished proposer can notify Acceptors that the value has been chosen and they can answer Learners too. All properties are now satisfied:

  • Only a value that has been proposed may be chosen: only D proposes so all it has to do is to select a single value.
  • Only a single value is chosen: follows from the first.
  • A process never learns that a value has been chosen unless it actually has been: the process D knows reliably and can inform users and Acceptors that have received notification from D can also inform Learners reliably.

If there are no failures, there is nothing else to discuss. If processes can fail, there is still not much to discuss. As long as the distinguished proposer doesn’t fail, everything works. If the distinguished proposer and some  of the Acceptors fail before any Acceptor has accepted, a poll of Acceptors will reveal no accepted value – and we know that no Learner has been falsely told there is some consensus value. If some Learner was told about a chosen value and the leader and some minority of Acceptors fail,  since a majority of Acceptors must have accepted and a minority have failed, there is at least one Acceptor that knows the accepted value and it can be copied to all surviving Acceptors. There cannot be two different accepted values. If  no Learner was told about an accepted value, either no value was chosen or one was and nobody was informed. In either case, we can just copy an accepted value if any of the Acceptors has one, or start from a blank slate otherwise. No harm no foul.

Notice, we have not had to worry about reliable store. All we need is the absence of spurious or corrupted messages and the survival of at least 1/2 of the Acceptors. If we need a sequence of values, the distinguished proposer can just rerun the same process as needed and incorporate a sequence number in the values. The distinguished proposer is, of course, the single point of failure but an election mechanism can select new proposers.

(revised)

IEEE 1588 PTP is a mess

IEEE 1588 was not designed for modern enterprise computer networks and contains many hacks to make it sort of work. The standard also suffers from being overly explicit on some things and overly unspecific  on others.  One marker of the flawed process is that IEEE 1588 transparent clocks don’t really comply with Ethernet standards because they modify packets without changing the MAC address. So in 2012 the 802.1 and 1588 standards groups started discussing what could be done. The 1588 committee notes that the “intent” (and practice) violates OSI layering but that 1588 doesn’t “mandate” that intent! Oy vey.

Questions have been raised concerning an IEEE 1588-2008 Transparent Clock layer 2 bridge modifying the CorrectionField of Ethernet transported PTP frames without changing the Ethernet source MAC address.  The question is if this operation is permitted by IEEE 802.1Q [1].  The original intent of the IEEE 1588-2008 standard was that a Transparent Clock will forward PTP event frames with no modifications except for the CorrectionField and FCS updates, however IEEE 1588-2008 does not mandate that.