Understanding Paxos and Distributed Consensus


(minor wording correction and more complaining added 10/2/2016, minor edits 10/5/2016)

Multi-proposer Paxos is a very clever and notoriously slippery algorithm for obtaining distributed consensus. In this note I try to explain it clearly and provide a correctness proof that gives some intuition why it works – to the extent that it does work. I am specifically concerned with Multi-Proposer Paxos, the first algorithm discussed in “Paxos Made Simple”.  What is often called “Paxos” involves single Proposer variants which are much simpler and less interesting.

I think this is right – let me know if you spot an error.

Rules for how Paxos works

There is a finite set of processes or network sites, some of which are Proposers and some Acceptors (the sets can intersect). Each proposer has a unique id, confusingly called a sequence number. A proposal is a pair consisting of a proposal value and the sequence number of the Proposer. The goal is to ensure that if two Proposers believe that they have convinced the Acceptors to come to a consensus on a value, they must both agree on the same value, even though they may disagree on sequence number. The most clever part of Paxos is the observation that since we don’t care which value wins, even though we do care that some unique value wins, we can force Proposers to inherit values of the most likely previous proposal.

  1. Proposers can ask Acceptors to approve sequence numbers and to accept proposals which include a value and the Proposer’s sequence number. Acceptors do not have to approve or accept but are limited to approving and accepting what Proposers send them.
  2. When an Acceptor approves a sequence number it:
    1. Promises to not approve any smaller sequence numbers
    2. Promises to not accept any proposals with smaller sequence numbers
    3. Returns to the Proposer the proposal with the highest sequence number it has already accepted, if any.
  3. The Proposer cannot send any proposals or select a value for a proposal until it gets approval for its sequence number from a majority of Acceptors.
  4. Once the Proposer has approval from a majority of Acceptors it must select the value of the proposal with the highest sequence number sent to it during the approval phase (the inherited proposal). If the approval phase did not turn up any accepted proposals, the Proposer can pick any value. In this case the Proposer “originates” the value.
  5. Once the value is selected, the Proposer can never change the value and can only propose the pair of that value and its sequence number – unless it increases its sequence number, abandons the proposal and value, and starts over.
  6. The choice of a new sequence number must preserve the property that each sequence number belongs to only one Proposer, ever.

(see the  code for what this looks like in a simulation)

Why it works intuition

The first thing to see is that individual Acceptors are forced to order their approvals and accepts by sequence number. If an Acceptor has approved j and accepted (i,v) and j>i then we know that it must have approved j after it accepted (i,v). The sequence of operations for that Acceptor must act like this:

Continue reading “Understanding Paxos and Distributed Consensus”

Chang-Maxemchuk atomic broadcast

The Chang-Maxemchuk algorithm (US Patent 4,725,834 ) solves atomic broadcast (and in-order broadcast) problems for distributed networks in a far simpler and more efficient way than some popular alternatives. In fact, the obscurity of this method is hard to understand given the current interest in distributed consensus.

The basic idea is simple algebra. A source site or process broadcasts “data messages” to a list of sites n sites. Data messages are tagged with sequence numbers and each sequence number is associated with exactly one “responsible”destination site so that  n consecutive sequence numbers map to n sites (the entire list).  For example, if the list sites are numbered 0 … n-1, then sequence number q could be mapped to responsible site q mod n.  Sites on the list broadcast numbered acknowledgment messages to all sites on the list and the source. Only the responsible site for sequence number can create an acknowledgment message numbered  i and the responsible site will only create the acknowledgment if it has received data message i and all lower numbered data messages and acknowledgment messages.  As a result, when the source sees acknowledgment message n+i it is assured that all sites have received the data message numbered  and the acknowledgment.

That’s normal operation mode. There is a reformation mode which is used to create  a list after a failure.  Reading the reformation mode description in the original paper is a good education in how to describe standard “leader election” clearly:

Any site that detects a failure or recovery initiates a reformation and is called an originator. It invites other sites in the broadcast group, the slaves, to form a new list. The reformation process can be described in terms of the activities of sites joining and committing a valid list. A valid list satisfies a set of specific requirements, as explained below. When the reformation starts, a site is invited to join a new list and eventually commits to a valid list. When all of the sites in a valid list are committed to this list, the list will be authorized with a token and the reformation terminates. This list becomes the new token list. Multiple originators can exist if more than one site discovers the failure or recovery. During the reformation, it is possible that acknowledged messages from the old token list have been missed by all sites that join a new list.

To guarantee that there is only one new list and that this list has all of the committed messages, the list must be tested before it can be considered a valid list. Specifically, a list becomes valid if it passes the majority test, the sequence test, and the resiliency test.

Majority Test. The majority test requires that a valid list has a majority of the sites in the broadcast group. During the reformation, a site can join only one list. The majority test is necessary to ensure that only one valid list can be formed.

Sequence Test. The sequence test requires that a site only join a list with a higher version number than the list it previously belonged to. The version number of a token list is in the form of (version #, site number). Each site has a unique site number. When a new list is formed, the originator chooses the new version # to be the version # of the last list it has joined plus one. Therefore, token lists have unique version numbers.

The originator always passes the sequence test. If any of the slaves fail the sequence test, it tells the originator its version number. The originator increments the higher version # the next time it tries to form a new list. The combination of the majority and the sequence test ensures that all valid lists have increasing version numbers. This is true because any two valid lists must have at least one site in common, and a site can join a second list only if the second list has a higher version number. Therefore, the version numbers indicate the sequence in which token lists were formed.

This paper was published 1984 and the first Paxos paper was from 1988. In my opinion Paxos is a big step backwards from CM.


Time out of joint

Financial trading venues and trading systems operate so quickly and rely on clocks so deeply that events like the one noted in this FINRA report are more common than many understand

The findings stated that the firm transmitted to OATS New Order Reports and related subsequent reports where the timestamp for the related subsequent report occurred prior to the receipt of the order,

In electronic trading such errors are easy to make. Two computer servers split the work in some data center and the clock on one is 10 milliseconds faster than the clock on the second. The faster device sends an order to a market and stamps it with the time. The slower device gets the response from the market and stamps it with the time.

Real time Server One Server Two
12:00 Send order clock=12:00.010 Clock=12:00
12:00.05  Clock=12:00.15 Get confirmation. Clock=12:00.05

In fact, for many trading organizations this is scenario does not even require two servers because their clocks can jump backward.


Data base design criteria: ease of use

Regarding ease-of-use, it’s often struck me when reviewing data systems papers that the evaluation sections are full of performance and correctness criteria, but only rarely is there any discussion of how well a system helps its target users achieve their goals: how easy is it to build, maintain, and debug applications?; how easy is it to operate and troubleshoot? Yet in an industry setting, systems that focus on ease of use (even at the expense of some of the other criteria) have tended to do very well. What would happen if a research program put ease of use (how easy is it to achieve the outcome the user desires) as its top evaluation criteria?  Adrian Colyer

Distributed consensus and network reliability

All of the distributed consensus algorithms I have been reviewing recently (Paxos, Raft, Zab, Chang Maxemchuck, Viewstamped, … ) are based on a number of assumptions about the network environment, including the assumption that messages may be lost but are not silently corrupted.  Is that a good assumption? Perhaps:

Real data 1995

Checksums disagree 2000 




The replicated state machine method of fault tolerance from 1980s

The first time I saw this method was when I went to work for Parallel Computer Systems, , later called Auragen, in the famous tech startup center of Englewood Cliffs, New Jersey. I commuted there from the East Village. (True story: I applied for the job after finding an advert in a discarded copy of the NY Times on the floor of a Brooklyn apartment while visiting friends. I sent via US mail a resume typed on a manual typewriter- I’m tempted to  say “composed by the light of a tallow candle” but that would be over the top- and forgot to send the second page. )

The company built a parallel computer based on Motorola 68000s  with a replicated message bus. The bus guaranteed message delivery to 3 destinations would either succeed to all three or fail to all three. This property is called “reliable broadcast”.  All interprocess communication was by message transfer (a fashionable idea at the time). Each process had a backup.  Whenever a primary process sent a message, the message was also delivered to the backup and to the destination backup. If the primary failed, the backup could be run. The backup would have a queue of messages received by the primary and a count of messages sent by the primary.  When the recovering backup tried to transmit a message, if the count was greater than zero, the count would be decremented and the message discarded because it has already been transmitted by the old primary. When the recovering secondary did a receive operation, if there was a message on the input queue, it would get that message.  In this way, the recovering backup would repeat the operations of the primary until it caught up. As an optimization, the primary could be periodically checkpointed and queues of duplicated messages could be discarded.

The operating system was an implementation of UNIX. In practice, it was discovered that making each UNIX system call into a message exchange, which was an idea advocated in the OS research community at the time, caused serious performance problems.  The replicated state machine operation depended on this design  in order to make the state machine operation deterministic. Suppose the primary requested, for example,  the time and then made a decision based on the time.  A recovering secondary would need exactly the same time to guarantee that it produced the same results as the primary. So every interaction between application and OS needed to be recorded in a message exchange.  But a message exchange is nowhere near as fast as a system call (unless the OS developers are horrible).

The performance issue was mitigated by some clever engineering, but  was a problem that was discovered in parallel by a number of development teams working on distributed OS designs and micro-kernels which were in vogue at the time. Execution of “ls -l” was particularly interesting.

Anyways, here’s the description from the patent.

To accomplish this object, the invention contemplates that instead of keeping the backup or secondary task exactly up to date, the backup is kept nearly up to date but is provided with all information necessary to bring itself up to the state of the primary task should there by a failure of the primary task. The inventive concept is based on the notion that if two tasks start out in identical states and are given identical input information, they will perform identically.

In particular, all inputs to a process running on a system according to the invention are provided via messages. Therefore, all messages sent to the primary task must be made available to the secondary or backup task so that upon failure of the primary task the secondary task catches up by recomputing based on the messages. In essence, then, this is accomplished by allowing every backup task to “listen in on” its primary’s message.

United States Patent 4,590,554 Glazer ,   et al.May 20, 1986

Inventors: Glazer; Sam D. (New York, NY), Baumbach; James (Brooklyn, NY), Borg; Anita (New York, NY), Wittels; Emanuel (Englewood Cliffs, NJ)
Assignee: Parallel Computers Systems, Inc. (Fort Lee, NJ)
Family ID: 23762790
Appl. No.: 06/443,937
Filed: November 23, 1982

See also: A message system supporting fault tolerance.

and a very similar later patent.

circularity problems in distributed consensus

Distributed consensus involves organizing a collection of independent agents – processes or network sites – to agree on some value or sequence of values.  Many distributed consensus methods depend on a leader-follower scheme in which the leader is an agent that essentially tells the followers what the values are. The challenges in such methods are to determine when enough of the followers have accepted the value and how to recover from failures of agents. In particular, failures of the leader trigger some procedure to select a new leader.  Leader election, however, is a distributed consensus problem. In fact, leader election is the harder problem. Once there is a leader, consensus in the followers can be produced by a dead simple protocol (see the second part of this ).  Oddly, leader election is generally treated as a minor issue. For example, in “Paxos made simple” we read:

The famous result of Fischer, Lynch, and Patterson [1] implies that a reliable algorithm for electing a proposer must use either randomness or real time—for example, by using timeouts. However, safety is ensured regardless of the success or failure of the election.

The FLP result is essentially a tautology: if an agent doesn’t ever get any information that reliably distinguishes between failure and slow response in a second agent, the first agent cannot reliably distinguish between failure of the second agent and slow response.  So the import of the first sentence is that leader election depends on timeouts or “randomness” (perhaps this means some analysis of probability of failure scenarios).  I don’t think this is correct, but it’s an interesting claim. The second sentence says nothing more than that an algorithm that fails to progress will never produce a false result – which I think is also a dubious claim.

Algorithm P solves problem X by assuming some other mechanism solves X and then by using that mechanism to make problem X simpler.  Ok.


Making Paxos face facts

Lamport’s  “Paxos Made Simple” paper is notoriously hard to understand but at least part of the difficulty is that the algorithm  changes radically in the middle of the presentation.  The first part of the paper presents a subtle (maybe too subtle) method to permit multiple processes or network sites to agree on a consensus value.  The second part of the paper switches to a second, much simpler algorithm without much notice.

The paper begins with a problem statement:

Assume a collection of processes that can propose values. A consensus algorithm ensures that a single one among the proposed values is chosen. If no value is proposed, then no value should be chosen. If a value has been chosen, then processes should be able to learn the chosen value. The safety requirements for consensus are:

  • Only a value that has been proposed may be chosen,
  • Only a single value is chosen, and
  • A process never learns that a value has been chosen unless it actually has been

The Paxos algorithm that is  presented first is liable to what we used to call livelock even in the absence of failure:

It’s easy to construct a scenario in which two proposers each keep issuing a sequence of proposals with increasing numbers, none of which are ever chosen.

It’s argued that this original Paxos works in the sense that it never can reach an inconsistent state (it is “safe” ), but the livelock scenario means it can easily fail to progress even without process failure or loss of messages. To avoid this scenario, on page seven of an eleven page paper, Paxos is redefined to restrict the set of  proposers to a single member – a distinguished proposer.

To guarantee progress, a distinguished proposer must be selected as the only one to try issuing proposals

Then, on page nine, when going over how Paxos can be used to build replicated state machines, Lamport writes:

In normal operation, a single server is elected to be the leader, which acts as the distinguished proposer (the only one that tries to issue proposals) in all instances of the consensus algorithm

Since much of the complexity of the first part of the paper involves describing how Paxos safely resolves contention between competing proposers, the modification is far from minor. It’s unfortunate that both the multi-proposer and single proposer algorithm are called by the same name which seems to cause confusion in the literature and certainly obscures the presentation in “Paxos Made Simple“. For example, the “Paxos Made Live”   paper appears to discuss an implementation based on the first (the multi-proposer) Paxos algorithm but then appears to revert to the second method via a mechanism called “master leases”).

In any case, the single proposer problem is much simpler than the original problem.


A brute force commit solution to the single proposer problem.

What, exactly do the bolded words in the problem statement mean?  “Chosen” and “Learn” are the two hard ones. “Proposed” is pretty clear: a process sends messages to all the other processes that says ” I , process A, propose value V with supporting documentation x“.

A proposer sends a proposed value to a set of acceptors. An acceptor may accept the proposed value. The value is chosen when a large enough set of acceptors have accepted it.

Proposal is a simple action: the proposer sends a proposal message.

There are the usual assumptions: messages may be lost or duplicated or delivered out of order, but are neither corrupted or spurious. That is: if a process receives a message from a second process, the second process must have previously transmitted that message.   A process can propose a value, but that proposal may never arrive at any of the other process or it may arrive at all of them or some of them. It makes sense that an accept operation is the transmit of an accept message by some process  that received the proposal message.  Presumably then, the value is chosen when the original proposer receives accept messages from a large enough set of the Acceptor processes  – a majority of the processes. Learning is also simple:

To learn that a value has been chosen, a learner must find out that a proposal has been accepted by a majority of acceptors.

Suppose that there is a distinguished proposer process D. Suppose that process sends value to all Acceptors and marks the value as “chosen” if it receives an accept response from a majority of Acceptors. Acceptors only send accept messages to the distinguished proposer if they receive a proposal from the distinguished proposer.  Let Learners ask the distinguished proposer for chosen values. To make the learning process more efficient, the distinguished proposer can notify Acceptors that the value has been chosen and they can answer Learners too. All properties are now satisfied:

  • Only a value that has been proposed may be chosen: only D proposes so all it has to do is to select a single value.
  • Only a single value is chosen: follows from the first.
  • A process never learns that a value has been chosen unless it actually has been: the process D knows reliably and can inform users and Acceptors that have received notification from D can also inform Learners reliably.

If there are no failures, there is nothing else to discuss. If processes can fail, there is still not much to discuss. As long as the distinguished proposer doesn’t fail, everything works. If the distinguished proposer and some  of the Acceptors fail before any Acceptor has accepted, a poll of Acceptors will reveal no accepted value – and we know that no Learner has been falsely told there is some consensus value. If some Learner was told about a chosen value and the leader and some minority of Acceptors fail,  since a majority of Acceptors must have accepted and a minority have failed, there is at least one Acceptor that knows the accepted value and it can be copied to all surviving Acceptors. There cannot be two different accepted values. If  no Learner was told about an accepted value, either no value was chosen or one was and nobody was informed. In either case, we can just copy an accepted value if any of the Acceptors has one, or start from a blank slate otherwise. No harm no foul.

Notice, we have not had to worry about reliable store. All we need is the absence of spurious or corrupted messages and the survival of at least 1/2 of the Acceptors. If we need a sequence of values, the distinguished proposer can just rerun the same process as needed and incorporate a sequence number in the values. The distinguished proposer is, of course, the single point of failure but an election mechanism can select new proposers.


state equations in practice


Let’s step back and think about what a discrete state system looks like.


  1. We have a set E of discrete “events” called the event alphabet, The event alphabet defines all the possible events and inputs that can cause state to change.
  2. A finite sequence of events over E then can be thought of as a system history, describing what series of events drove the system from its initial state (which is the point at which no events at all had happened.)  You could complain at this point that deterministic, single thread of execution (the sequence) models can’t describe non-determinism, parallelism, concurrency, and abstract properties, but you’d be wrong – keep reading.
  3. Now we are going to take an odd step – I don’t want to start with a state machine or other object that represents the whole system. I want to start with state variables that express some system property and how it changes in response to events. For example, a state variable Executing might tell us the process identifier of the currently running (executing) process or it could be set valued in a multi-processor environment. And priorityp might be the priority of the process with identifier, p.  As the system advances state, the values of the state variable change but often we want to have properties that are true in any state. For example,  (Executing=p AND priorityq  >  priorityp implies TimeSinceSwitch < t)  would be a good property for some operating system that needs to schedule by priority but also let lower priority processes advance.
  4. These state functions can be thought of as dependent variables that depend on the event sequence and some map that extracts  information from the event sequence. The event sequence is the free variable and the state variable is a dependent variable that embodies some aspect of system state. Everything that determines current state is in the event sequence but state is not a thing, not an object, it’s the cumulative information of multiple state variables depending on that sequence.
  5.  We might have many state variables to specify how an operating system works or even how a simple gate works. Let’s think of all of them at the same level of abstraction as depending on the same event sequence (finite, of course).
  6. The key to this approach is to be able to compactly define state variables and abstract properties of state variables even though the event alphabets and maps on sequences of events will often be enormously complicated.
  7. More precisely, say a variable y is a state variable for an event alphabet E if  y=f(σ) for some function f where σ is a free variable in the set of finite sequences over E which includes the empty sequence Nil.  The map f is a solution for y  that extracts values from the sequence as state advances (as events are appended to the event sequence). Essentially y depends on two parameters: the sequence  and the map. These parameters are often implicit.
  8. By the definition of state variable,  y(Nil) = f(Nil) and represents the value of y in the initial state of the system.
  9. And y(z) = f(z) is the value of y in the state determined by sequence z.
  10. The convention here is that σ is always a free variable over E* (the set of finite sequences over alphabet E). Since y=f(σ) for any w, then y = y(σ), trivially. And I mean this is a trivial, tautological equation, but it’s going to be useful. We can think of y as the value of the state variable in the current state and σ  as the current sequence.
  11. If is a sequence and e is an event or a variable over events, write ze for the sequence obtained by appending e to z on the right.  So y(σe) is the value of y in the “next state” – the state after e. If  y = y(σe) then the event “e” leaves the value of unchanged. If y(σe) = y +2  then the event “e” increases the value of  by 2. Note that if y=f(σ)  then y(σe) = f(σe) so  (y(σe)=y+2) can be rewritten as f(σe)=f(σ) + 2.  I’m sorry to have to explain basic familiar algebra, but this work is an attempt to express properties of systems in, mostly, basic familiar algebra.
  12. Suppose we define y by a pair of equations: y(Nil)= k and y(σe) = h(e, y). If k is some constant and h is a defined map, then we have implicitly defined f on every finite sequence and so implicitly defined y in every state. Note y(Nil) = f(Nil)=k. And y(σe)= h(e,y) = h(e,y(σ)) = h(e,f(σ)). Suppose sequence w=<a,b.c>, then y(w) = f(<a,b,c>) = h(c,f(<a,b>)) = h(c,h(b,f(<a>)) = h(c,h(b,h(a,k)))).  But in many cases we don’t want to specify f completely – f is a solution to the constraints on y and there may be many or infinitely many or no solutions.
  13. For example, suppose C is a state variable and C(Nil)=0 and C(σe) = 1+C. Then we have defined C to be the length of the event sequence. Define L(σ e) = e so L is the most recent event. Define SinceReset(Nil) = 0 and SinceReset(σ e) = 0 if e=RESET and SinceReset(σ e) = 1+ SinceReset otherwise.
  14. Suppose we have a state variable Clock that somehow determines time passed since initial state from the event sequence. Then (Clock(σ e)- Clock) is the duration of event e in this state (and may be different for different values of  σ ). Define  waitingp(Nil)=0 and waitingp(σ e)=0 if Executing(σ e) =p and   waitingp(σ e)= waitingp+(Clock(σ e)- Clock)  otherwise. So  waitingp  is the time that process has been waiting to become the executing process. We might want a property (waiting> k only if priorityp prioritywhere q = Executing).
  15. Ok, now we can consider component composition – which is just function composition. Consider the state variable L defined above and suppose we have a new alphabet B consisting of Leftx and Rightx for x in some set X of data values. The goal is to construct a shift register from instances of L.  Define Li =  L(ui) where ui is also a state variable that has values that are sequences of elements of X. Usually we want components to state in the initial state so ui(Nil) = Nil.  This means Li (Nil)= L(Nil) which is something we didn’t define. Now define ui(σ e) where “e” is an element of the event alphabet of the composite system.  Remember I’m using juxtaposition to mean “append on the right” so, for example (uiLi-1 ) means “append the current value of Li-1 to the current value of ui
    1. ui(σ e)=  uix if i=1 and e=Rightxor if i=n and e=Leftx
    2. ui(σ e)=  ui Li-1  if i>1 and e=Leftx
    3. ui(σ e) =  ui   Li+1  if i<n and e=Rightx
  16. The definition above causes all n components to change state in parallel when an event is appended to the event sequence of the composite system – but it’s just composition: Li =  L(ui)=  L(ui(σ)) and Li (σ e)= L(ui(σ e) ) =L(ui(σ) e’) ) = e’ where e’ is calculated as above.
  17. If  a device D that is constructed from components  C …  Cn  that are interconnected. D is a function of σ but the components don’t “see” the same events. For each component, we need a state variable  ui   that is sequence valued and a function of  σ . Usually these are defined recursively general   ui(Nil) = Nil  and ui(σe) =  concatenate(ui, hi(e,C1(u1 )…  Cn(un ))) or something like that. That is, each event e for D causes event or events   hi(e,C1(u1 )…  Cn(un ))  to be appended to the sequence of events ui,.