Distributed consensus, paxos, raft, etc.

vase

(minor wording correction and more complaining added 10/2/2016, minor edits 10/5/2016)

Multi-proposer Paxos is a very clever and notoriously slippery algorithm for obtaining distributed consensus. In this note I try to explain it clearly and provide a correctness proof that gives some intuition why it works – to the extent that it does work. I am specifically concerned with Multi-Proposer Paxos, the first algorithm discussed in “Paxos Made Simple”.  What is often called “Paxos” involves single Proposer variants which are much simpler and less interesting.

I think this is right – let me know if you spot an error.

Rules for how Paxos works

There is a finite set of processes or network sites, some of which are Proposers and some Acceptors (the sets can intersect). Each proposer has a unique id, confusingly called a sequence number. A proposal is a pair consisting of a proposal value and the sequence number of the Proposer. The goal is to ensure that if two Proposers believe that they have convinced the Acceptors to come to a consensus on a value, they must both agree on the same value, even though they may disagree on sequence number. The most clever part of Paxos is the observation that since we don’t care which value wins, even though we do care that some unique value wins, we can force Proposers to inherit values of the most likely previous proposal.

  1. Proposers can ask Acceptors to approve sequence numbers and to accept proposals which include a value and the Proposer’s sequence number. Acceptors do not have to approve or accept but are limited to approving and accepting what Proposers send them.
  2. When an Acceptor approves a sequence number it:
    1. Promises to not approve any smaller sequence numbers
    2. Promises to not accept any proposals with smaller sequence numbers
    3. Returns to the Proposer the proposal with the highest sequence number it has already accepted, if any.
  3. The Proposer cannot send any proposals or select a value for a proposal until it gets approval for its sequence number from a majority of Acceptors.
  4. Once the Proposer has approval from a majority of Acceptors it must select the value of the proposal with the highest sequence number sent to it during the approval phase (the inherited proposal). If the approval phase did not turn up any accepted proposals, the Proposer can pick any value. In this case the Proposer “originates” the value.
  5. Once the value is selected, the Proposer can never change the value and can only propose the pair of that value and its sequence number – unless it increases its sequence number, abandons the proposal and value, and starts over.
  6. The choice of a new sequence number must preserve the property that each sequence number belongs to only one Proposer, ever.

(see the  code for what this looks like in a simulation)

Why it works intuition

The first thing to see is that individual Acceptors are forced to order their approvals and accepts by sequence number. If an Acceptor has approved j and accepted (i,v) and j>i then we know that it must have approved j after it accepted (i,v). The sequence of operations for that Acceptor must act like this:

Time goes left to right⇒

Accept (i,v)

More stuff

Approve j

….

 But if an Acceptor approved j after it accepted (i,v) then it had to send (i,v) or possibly a better proposal back to the Proposer of j. In consequence, the Proposer of j must inherit either (i,v) or a proposal with a higher sequence number.

Accept (i,v)

More stuff

Approve j and return (i,v) or better

….

The requirements for majorities ensures that we have a web of these kinds of ordering requirements and inherited proposals. If a proposal (j,z) is accepted by even a single Acceptor,  a majority of Acceptors had to approve j first – otherwise the proposer of (j,z) could not have sent a proposal.

Time increases from top to bottom

0

Approve j

Approve j

Approve j

n

Accept (j,z)

Suppose (i,v) has been accepted by a majority of Acceptors and j>i and (j,z) has been accepted by at least one Acceptor. Since a majority of Acceptors had to approve j and since a majority of Acceptors accepted (i,v), at least one Acceptor did both. And as we observed above, we know that the Acceptor that did both must have approved j after it accepted (i,v) so the Proposer with sequence number j must have either inherited (i,v) or some proposal with a higher sequence number. 

If any proposal has been accepted by a majority of Acceptors one of those proposals, say (s,v), must have the smallest sequence number. From the previous paragraph, we know that for j>s where j has been accepted by even a single Acceptor, the Proposer of j either inherited (s,v) or some (k,z) where k>s. But then any such k has the same problem as j – the Proposer of k cannot be an originator  and must have inherited either (s,v) or some other proposal with a sequence number higher than s. This process can’t continue indefinitely, eventually we run out of Proposers and so at least one of the proposals (j,z) with sequence number greater than s has to be inherited directly from (s,v) so z=v. Now look at the remaining proposals with sequence numbers greater than j and using the same logic they must eventually inherit from j or s. The wall of accepted (s,v) proposals does not allow any proposal with even one Acceptor and a sequence number greater than s to dissent from the v consensus.

And then we are done. If (j,z) has been accepted by a majority, we know that j>s so z=v. QED.

More detailed proof.

  1. If i<j, any Acceptor that has both accepted proposal (i,v) and approved sequence number j must have accepted (i,v) before it approved j (Rule 2.2)
  2. If i<j any Acceptor that has both accepted (i,v) and approved j, must have sent (i,v) or some higher numbered proposal back to the Proposer with sequence number j. (Rule 2.3 and 1 above)
  3. If the Proposer with sequence number j inherited (i,v), then any proposal (j,z) sent to any Acceptor must have v=z. (Rule 4 says the Proposer had to select v as the value, Rule 5 limits it to proposing (j,v) and rule 6 says no other Proposer can use that sequence and Rule 1 says the Acceptor had to get that proposal from a Proposer. )
  4. Say (j,z) is descendent of (i,v) if either j inherited (i,v) or j inherited some (k,x) where (k,x) is a descendent of (i,v). If (j,z) is a descendent of (i,v) then z=v. By induction on number of steps: If there is one step, j inherited (i,v) and (3) above gives the result. If (4) is true for any descent of n steps and (j,z) descends from (i,v) in n+1 steps, j inherited some value (k,x) so that (k,x) is a descendent of (i,v) in n steps.
  5. Any two sets of Acceptors where both contain majorities must have elements in common.
  6. If any Acceptor has accepted proposal (i,v) then a majority of acceptors must have approved i first. (Rule 3).
  7. If i < j and a majority of Acceptors have accepted (i,v) and at least one Acceptor has accepted (j,z) then the Proposer of (j,z) must have inherited either (i,v) or some proposal (k,x) where k>i. By (4) we know a majority of acceptors approved j and by (3) we know that there is at least one Acceptor that must be on both sets, and so by (2) the Proposer of j must have inherited either (i,v) or some proposal with a higher sequence number. This is the key result – the majority that accepted (i,v) is a barrier that forces any higher numbered proposal to inherit either (i,v) or some higher numbered proposal.
  8. In the situation of step (7) above, (j,z) must be a descendent of (i,v). By (7) we know that no Proposer with sequence number greater than i can be an originator, so if j inherits from some k>i, k must descend from some other proposal. But the chain of inheritance must eventually terminate, so it cannot cycle around Proposers with sequence numbers greater than i, and by (7) the only alternative is to inherit from a proposal with sequence number i, but there can only be one of those (i,v) and so by (4) we conclude z=v.
  9. If s is the smallest sequence number so that for some v, (s,v) has been accepted by a majority of Acceptors, then for every sequence number j where for z, (j,z) has been accepted by a majority of Acceptors, z=v (by 8 and s being minimal). 

(Note that Rule 2.2 was not used above – either I missed something or it is not necessary. It does prevent some work that cannot ever get to consensus so … )

Random Discussion

The Multi-Proposer Paxos algorithm is not practical because it can spin without ever getting to a result. Supaxos-ptpx-ppxtppose there are 3 Proposers and 5 Acceptors,  Proposer 1 gets 2 approvals, Proposer 3 gets 2 approvals and Proposer 3 gets 2 approvals. Only 3 can advance, so suppose it continues and gets 3 approvals. Then Proposer 4 (which could be a frustrated ex-Proposer 1 with a new sequence number) gets 3 approvals. Proposer 3 now cannot proceed to consensus because a majority of Acceptors have approved 4. And so on. We may never get a single proposal accepted by a single Acceptor. A simulation (code is here)   illustrates this process. The more Acceptors, the higher the probability that the algorithm does not come to consensus in 500 steps  (which is a lot).

The way to fix this is – time outs. In fact, the weird thing about this field is how hard people try to avoid using timeouts, despite the fact that there is no way to avoid depending on timeouts or some other clock dependent methods.  One of the more confusing aspects of “Paxos Made Simple” is that multi-proposer asynchronous Paxos discussed here is the topic of the first 8 pages and then is suddenly dropped. On page 9, without warning, Lamport writes:

In normal operation, a single server is elected to be the leader, which acts as the distinguished proposer (the only one that tries to issue proposals) in all instances of the consensus algorithm

The solution: a single Proposer is elected Leader using timeouts!

If enough of the system (proposer, acceptors, and communication network) is working properly, liveness can therefore be achieved by electing a single distinguished proposer. The famous result of Fischer, Lynch, and Patterson [1] implies that a reliable algorithm for electing a proposer must use either randomness or real time—for example, by using timeouts

But, but, but … Since the first part of the paper is entirely taken up with describing how Paxos safely resolves contention between competing proposers, the modification is far from minor. It’s unfortunate that both the multi-proposer and single proposer algorithm are called by the same name which seems to cause confusion in the literature and certainly obscures the presentation in “Paxos Made Simple“. For example, the “Paxos Made Live”   paper appears to discuss an implementation based on the first (the multi-proposer) Paxos algorithm but then appears to revert to the second method via a mechanism called “master leases”).

In any case, the single proposer problem is much simpler than the original problem  because leader election is the hard part. Once we have a leader, we can get consensus without any smarts at all by letting the leader choose a value and get accepts from a majority of the Acceptors. And the multi-proposer version of Paxos is the most interesting version with the most clever approach.

While I’m at it, “Paxos Made Simple” is written in a style that is almost as opaque as the original Paxos paper  and that is couched in terms of a fictional Greek polity and is way too annoying to read (the first sentence! – “Early in this millennium, the Aegean island of Paxos was a thriving mercantile center.1 Wealth led to political sophistication, and the Paxons replaced their ancient theocracy with a parliamentary form of government.”).   “Paxos Made Simple” is an improvement, but is hard to read because the algorithm is introduced backwards via what it claimed to be a proof and the explication is murky and contradictory. For the last consider page 5:

A proposer issues a proposal by sending, to some set of acceptors, a request that the proposal be accepted. (This need not be the same set of acceptors that responded to the initial requests.) Let’s call this an accept request

and then on page 6, summarizing:

(a) If the proposer receives a response to its prepare requests (numbered n) from a majority of acceptors, then it sends an accept request to each of those acceptors

The first is correct, although the second is also correct and easier to prove.

General state machine method

The first time I saw this method was when I went to work for Parallel Computer Systems, , later called Auragen, in the famous tech startup center of Englewood Cliffs, New Jersey. I commuted there from the East Village. (True story: I applied for the job after finding an advert in a discarded copy of the NY Times on the floor of a Brooklyn apartment while visiting friends. I sent via US mail a resume typed on a manual typewriter- I’m tempted to  say “composed by the light of a tallow candle” but that would be over the top- and forgot to send the second page. )

The company built a parallel computer based on Motorola 68000s  with a replicated message bus. The bus guaranteed message delivery to 3 destinations would either succeed to all three or fail to all three. This property is called “reliable broadcast”.  All interprocess communication was by message transfer (a fashionable idea at the time). Each process had a backup.  Whenever a primary process sent a message, the message was also delivered to the backup and to the destination backup. If the primary failed, the backup could be run. The backup would have a queue of messages received by the primary and a count of messages sent by the primary.  When the recovering backup tried to transmit a message, if the count was greater than zero, the count would be decremented and the message discarded because it has already been transmitted by the old primary. When the recovering secondary did a receive operation, if there was a message on the input queue, it would get that message.  In this way, the recovering backup would repeat the operations of the primary until it caught up. As an optimization, the primary could be periodically checkpointed and queues of duplicated messages could be discarded.

The operating system was an implementation of UNIX. In practice, it was discovered that making each UNIX system call into a message exchange, which was an idea advocated in the OS research community at the time, caused serious performance problems.  The replicated state machine operation depended on this design  in order to make the state machine operation deterministic. Suppose the primary requested, for example,  the time and then made a decision based on the time.  A recovering secondary would need exactly the same time to guarantee that it produced the same results as the primary. So every interaction between application and OS needed to be recorded in a message exchange.  But a message exchange is nowhere near as fast as a system call (unless the OS developers are horrible).

The performance issue was mitigated by some clever engineering, but  was a problem that was discovered in parallel by a number of development teams working on distributed OS designs and micro-kernels which were in vogue at the time. Execution of “ls -l” was particularly interesting.

Anyways, here’s the description from the patent.

To accomplish this object, the invention contemplates that instead of keeping the backup or secondary task exactly up to date, the backup is kept nearly up to date but is provided with all information necessary to bring itself up to the state of the primary task should there by a failure of the primary task. The inventive concept is based on the notion that if two tasks start out in identical states and are given identical input information, they will perform identically.

In particular, all inputs to a process running on a system according to the invention are provided via messages. Therefore, all messages sent to the primary task must be made available to the secondary or backup task so that upon failure of the primary task the secondary task catches up by recomputing based on the messages. In essence, then, this is accomplished by allowing every backup task to “listen in on” its primary’s message.

United States Patent 4,590,554 Glazer ,   et al.May 20, 1986

Inventors: Glazer; Sam D. (New York, NY), Baumbach; James (Brooklyn, NY), Borg; Anita (New York, NY), Wittels; Emanuel (Englewood Cliffs, NJ)
Assignee: Parallel Computers Systems, Inc. (Fort Lee, NJ)
Family ID: 23762790
Appl. No.: 06/443,937
Filed: November 23, 1982

See also: A message system supporting fault tolerance.

and a very similar later patent.

Chang Maxemchuk

The Chang-Maxemchuk algorithm (US Patent 4,725,834 ) solves atomic broadcast (and in-order broadcast) problems for distributed networks in a far simpler and more efficient way than some popular alternatives. In fact, the obscurity of this method is hard to understand given the current interest in distributed consensus.

The basic idea is simple algebra. A source site or process broadcasts “data messages” to a list of sites n sites. Data messages are tagged with sequence numbers and each sequence number is associated with exactly one “responsible”destination site so that  n consecutive sequence numbers map to n sites (the entire list).  For example, if the list sites are numbered 0 … n-1, then sequence number q could be mapped to responsible site q mod n.  Sites on the list broadcast numbered acknowledgment messages to all sites on the list and the source. Only the responsible site for sequence number can create an acknowledgment message numbered  i and the responsible site will only create the acknowledgment if it has received data message i and all lower numbered data messages and acknowledgment messages.  As a result, when the source sees acknowledgment message n+i it is assured that all sites have received the data message numbered  and the acknowledgment.

That’s normal operation mode. There is a reformation mode which is used to create  a list after a failure.  Reading the reformation mode description in the original paper is a good education in how to describe standard “leader election” clearly:

Any site that detects a failure or recovery initiates a reformation and is called an originator. It invites other sites in the broadcast group, the slaves, to form a new list. The reformation process can be described in terms of the activities of sites joining and committing a valid list. A valid list satisfies a set of specific requirements, as explained below. When the reformation starts, a site is invited to join a new list and eventually commits to a valid list. When all of the sites in a valid list are committed to this list, the list will be authorized with a token and the reformation terminates. This list becomes the new token list. Multiple originators can exist if more than one site discovers the failure or recovery. During the reformation, it is possible that acknowledged messages from the old token list have been missed by all sites that join a new list.

To guarantee that there is only one new list and that this list has all of the committed messages, the list must be tested before it can be considered a valid list. Specifically, a list becomes valid if it passes the majority test, the sequence test, and the resiliency test.

Majority Test. The majority test requires that a valid list has a majority of the sites in the broadcast group. During the reformation, a site can join only one list. The majority test is necessary to ensure that only one valid list can be formed.

Sequence Test. The sequence test requires that a site only join a list with a higher version number than the list it previously belonged to. The version number of a token list is in the form of (version #, site number). Each site has a unique site number. When a new list is formed, the originator chooses the new version # to be the version # of the last list it has joined plus one. Therefore, token lists have unique version numbers.

The originator always passes the sequence test. If any of the slaves fail the sequence test, it tells the originator its version number. The originator increments the higher version # the next time it tries to form a new list. The combination of the majority and the sequence test ensures that all valid lists have increasing version numbers. This is true because any two valid lists must have at least one site in common, and a site can join a second list only if the second list has a higher version number. Therefore, the version numbers indicate the sequence in which token lists were formed.

This paper was published 1984 and the first Paxos paper was from 1988. In my opinion Paxos is a big step backwards from CM.