## Understanding Paxos and Distributed Consensus

`(minor wording correction and more complaining added 10/2/2016, minor edits 10/5/2016)`

Multi-proposer Paxos is a very clever and notoriously slippery algorithm for obtaining distributed consensus. In this note I try to explain it clearly and provide a correctness proof that gives some intuition why it works – to the extent that it does work. I am specifically concerned with Multi-Proposer Paxos, the first algorithm discussed in “Paxos Made Simple”.  What is often called “Paxos” involves single Proposer variants which are much simpler and less interesting.

I think this is right – let me know if you spot an error.

## Rules for how Paxos works

There is a finite set of processes or network sites, some of which are Proposers and some Acceptors (the sets can intersect). Each proposer has a unique id, confusingly called a sequence number. A proposal is a pair consisting of a proposal value and the sequence number of the Proposer. The goal is to ensure that if two Proposers believe that they have convinced the Acceptors to come to a consensus on a value, they must both agree on the same value, even though they may disagree on sequence number. The most clever part of Paxos is the observation that since we don’t care which value wins, even though we do care that some unique value wins, we can force Proposers to inherit values of the most likely previous proposal.

1. Proposers can ask Acceptors to approve sequence numbers and to accept proposals which include a value and the Proposer’s sequence number. Acceptors do not have to approve or accept but are limited to approving and accepting what Proposers send them.
2. When an Acceptor approves a sequence number it:
1. Promises to not approve any smaller sequence numbers
2. Promises to not accept any proposals with smaller sequence numbers
3. Returns to the Proposer the proposal with the highest sequence number it has already accepted, if any.
3. The Proposer cannot send any proposals or select a value for a proposal until it gets approval for its sequence number from a majority of Acceptors.
4. Once the Proposer has approval from a majority of Acceptors it must select the value of the proposal with the highest sequence number sent to it during the approval phase (the inherited proposal). If the approval phase did not turn up any accepted proposals, the Proposer can pick any value. In this case the Proposer “originates” the value.
5. Once the value is selected, the Proposer can never change the value and can only propose the pair of that value and its sequence number – unless it increases its sequence number, abandons the proposal and value, and starts over.
6. The choice of a new sequence number must preserve the property that each sequence number belongs to only one Proposer, ever.

(see the  code for what this looks like in a simulation)

## Why it works intuition

The first thing to see is that individual Acceptors are forced to order their approvals and accepts by sequence number. If an Acceptor has approved j and accepted (i,v) and j>i then we know that it must have approved j after it accepted (i,v). The sequence of operations for that Acceptor must act like this:

 Time goes left to right⇒ … Accept (i,v) More stuff Approve j ….

But if an Acceptor approved j after it accepted (i,v) then it had to send (i,v) or possibly a better proposal back to the Proposer of j. In consequence, the Proposer of j must inherit either (i,v) or a proposal with a higher sequence number.

 … Accept (i,v) More stuff Approve j and return (i,v) or better ….

The requirements for majorities ensures that we have a web of these kinds of ordering requirements and inherited proposals. If a proposal (j,z) is accepted by even a single Acceptor,  a majority of Acceptors had to approve j first – otherwise the proposer of (j,z) could not have sent a proposal.

 Time increases from top to bottom 0 Approve j Approve j … … Approve j … … … n Accept (j,z)

Suppose (i,v) has been accepted by a majority of Acceptors and j>i and (j,z) has been accepted by at least one Acceptor. Since a majority of Acceptors had to approve j and since a majority of Acceptors accepted (i,v), at least one Acceptor did both. And as we observed above, we know that the Acceptor that did both must have approved j after it accepted (i,v) so the Proposer with sequence number j must have either inherited (i,v) or some proposal with a higher sequence number.

If any proposal has been accepted by a majority of Acceptors one of those proposals, say (s,v), must have the smallest sequence number. From the previous paragraph, we know that for j>s where j has been accepted by even a single Acceptor, the Proposer of j either inherited (s,v) or some (k,z) where k>s. But then any such k has the same problem as j – the Proposer of k cannot be an originator  and must have inherited either (s,v) or some other proposal with a sequence number higher than s. This process can’t continue indefinitely, eventually we run out of Proposers and so at least one of the proposals (j,z) with sequence number greater than s has to be inherited directly from (s,v) so z=v. Now look at the remaining proposals with sequence numbers greater than j and using the same logic they must eventually inherit from j or s. The wall of accepted (s,v) proposals does not allow any proposal with even one Acceptor and a sequence number greater than s to dissent from the v consensus.

And then we are done. If (j,z) has been accepted by a majority, we know that j>s so z=v. QED.

# More detailed proof.

1. If i<j, any Acceptor that has both accepted proposal (i,v) and approved sequence number j must have accepted (i,v) before it approved j (Rule 2.2)
2. If i<j any Acceptor that has both accepted (i,v) and approved j, must have sent (i,v) or some higher numbered proposal back to the Proposer with sequence number j. (Rule 2.3 and 1 above)
3. If the Proposer with sequence number j inherited (i,v), then any proposal (j,z) sent to any Acceptor must have v=z. (Rule 4 says the Proposer had to select v as the value, Rule 5 limits it to proposing (j,v) and rule 6 says no other Proposer can use that sequence and Rule 1 says the Acceptor had to get that proposal from a Proposer. )
4. Say (j,z) is descendent of (i,v) if either j inherited (i,v) or j inherited some (k,x) where (k,x) is a descendent of (i,v). If (j,z) is a descendent of (i,v) then z=v. By induction on number of steps: If there is one step, j inherited (i,v) and (3) above gives the result. If (4) is true for any descent of n steps and (j,z) descends from (i,v) in n+1 steps, j inherited some value (k,x) so that (k,x) is a descendent of (i,v) in n steps.
5. Any two sets of Acceptors where both contain majorities must have elements in common.
6. If any Acceptor has accepted proposal (i,v) then a majority of acceptors must have approved i first. (Rule 3).
7. If i < j and a majority of Acceptors have accepted (i,v) and at least one Acceptor has accepted (j,z) then the Proposer of (j,z) must have inherited either (i,v) or some proposal (k,x) where k>i. By (4) we know a majority of acceptors approved j and by (3) we know that there is at least one Acceptor that must be on both sets, and so by (2) the Proposer of j must have inherited either (i,v) or some proposal with a higher sequence number. This is the key result – the majority that accepted (i,v) is a barrier that forces any higher numbered proposal to inherit either (i,v) or some higher numbered proposal.
8. In the situation of step (7) above, (j,z) must be a descendent of (i,v). By (7) we know that no Proposer with sequence number greater than i can be an originator, so if j inherits from some k>i, k must descend from some other proposal. But the chain of inheritance must eventually terminate, so it cannot cycle around Proposers with sequence numbers greater than i, and by (7) the only alternative is to inherit from a proposal with sequence number i, but there can only be one of those (i,v) and so by (4) we conclude z=v.
9. If s is the smallest sequence number so that for some v, (s,v) has been accepted by a majority of Acceptors, then for every sequence number j where for z, (j,z) has been accepted by a majority of Acceptors, z=v (by 8 and s being minimal).

(Note that Rule 2.2 was not used above – either I missed something or it is not necessary. It does prevent some work that cannot ever get to consensus so … )

# Random Discussion

The Multi-Proposer Paxos algorithm is not practical because it can spin without ever getting to a result. Suppose there are 3 Proposers and 5 Acceptors,  Proposer 1 gets 2 approvals, Proposer 3 gets 2 approvals and Proposer 3 gets 2 approvals. Only 3 can advance, so suppose it continues and gets 3 approvals. Then Proposer 4 (which could be a frustrated ex-Proposer 1 with a new sequence number) gets 3 approvals. Proposer 3 now cannot proceed to consensus because a majority of Acceptors have approved 4. And so on. We may never get a single proposal accepted by a single Acceptor. A simulation (code is here)   illustrates this process. The more Acceptors, the higher the probability that the algorithm does not come to consensus in 500 steps  (which is a lot).

The way to fix this is – time outs. In fact, the weird thing about this field is how hard people try to avoid using timeouts, despite the fact that there is no way to avoid depending on timeouts or some other clock dependent methods.  One of the more confusing aspects of “Paxos Made Simple” is that multi-proposer asynchronous Paxos discussed here is the topic of the first 8 pages and then is suddenly dropped. On page 9, without warning, Lamport writes:

In normal operation, a single server is elected to be the leader, which acts as the distinguished proposer (the only one that tries to issue proposals) in all instances of the consensus algorithm

The solution: a single Proposer is elected Leader using timeouts!

If enough of the system (proposer, acceptors, and communication network) is working properly, liveness can therefore be achieved by electing a single distinguished proposer. The famous result of Fischer, Lynch, and Patterson [1] implies that a reliable algorithm for electing a proposer must use either randomness or real time—for example, by using timeouts

But, but, but … Since the first part of the paper is entirely taken up with describing how Paxos safely resolves contention between competing proposers, the modification is far from minor. It’s unfortunate that both the multi-proposer and single proposer algorithm are called by the same name which seems to cause confusion in the literature and certainly obscures the presentation in “Paxos Made Simple“. For example, the “Paxos Made Live”   paper appears to discuss an implementation based on the first (the multi-proposer) Paxos algorithm but then appears to revert to the second method via a mechanism called “master leases”).

In any case, the single proposer problem is much simpler than the original problem  because leader election is the hard part. Once we have a leader, we can get consensus without any smarts at all by letting the leader choose a value and get accepts from a majority of the Acceptors. And the multi-proposer version of Paxos is the most interesting version with the most clever approach.

While I’m at it, “Paxos Made Simple” is written in a style that is almost as opaque as the original Paxos paper  and that is couched in terms of a fictional Greek polity and is way too annoying to read (the first sentence! – “Early in this millennium, the Aegean island of Paxos was a thriving mercantile center.1 Wealth led to political sophistication, and the Paxons replaced their ancient theocracy with a parliamentary form of government.”).   “Paxos Made Simple” is an improvement, but is hard to read because the algorithm is introduced backwards via what it claimed to be a proof and the explication is murky and contradictory. For the last consider page 5:

A proposer issues a proposal by sending, to some set of acceptors, a request that the proposal be accepted. (This need not be the same set of acceptors that responded to the initial requests.) Let’s call this an accept request

and then on page 6, summarizing:

(a) If the proposer receives a response to its prepare requests (numbered n) from a majority of acceptors, then it sends an accept request to each of those acceptors

The first is correct, although the second is also correct and easier to prove.

more to come