# Understanding Paxos and Distributed Consensus

`(minor wording correction and more complaining added 10/2/2016, minor edits 10/5/2016)`

Multi-proposer Paxos is a very clever and notoriously slippery algorithm for obtaining distributed consensus. In this note I try to explain it clearly and provide a correctness proof that gives some intuition why it works – to the extent that it does work. I am specifically concerned with Multi-Proposer Paxos, the first algorithm discussed in “Paxos Made Simple”.  What is often called “Paxos” involves single Proposer variants which are much simpler and less interesting.

I think this is right – let me know if you spot an error.

## Rules for how Paxos works

There is a finite set of processes or network sites, some of which are Proposers and some Acceptors (the sets can intersect). Each proposer has a unique id, confusingly called a sequence number. A proposal is a pair consisting of a proposal value and the sequence number of the Proposer. The goal is to ensure that if two Proposers believe that they have convinced the Acceptors to come to a consensus on a value, they must both agree on the same value, even though they may disagree on sequence number. The most clever part of Paxos is the observation that since we don’t care which value wins, even though we do care that some unique value wins, we can force Proposers to inherit values of the most likely previous proposal.

1. Proposers can ask Acceptors to approve sequence numbers and to accept proposals which include a value and the Proposer’s sequence number. Acceptors do not have to approve or accept but are limited to approving and accepting what Proposers send them.
2. When an Acceptor approves a sequence number it:
1. Promises to not approve any smaller sequence numbers
2. Promises to not accept any proposals with smaller sequence numbers
3. Returns to the Proposer the proposal with the highest sequence number it has already accepted, if any.
3. The Proposer cannot send any proposals or select a value for a proposal until it gets approval for its sequence number from a majority of Acceptors.
4. Once the Proposer has approval from a majority of Acceptors it must select the value of the proposal with the highest sequence number sent to it during the approval phase (the inherited proposal). If the approval phase did not turn up any accepted proposals, the Proposer can pick any value. In this case the Proposer “originates” the value.
5. Once the value is selected, the Proposer can never change the value and can only propose the pair of that value and its sequence number – unless it increases its sequence number, abandons the proposal and value, and starts over.
6. The choice of a new sequence number must preserve the property that each sequence number belongs to only one Proposer, ever.

(see the  code for what this looks like in a simulation)

## Why it works intuition

The first thing to see is that individual Acceptors are forced to order their approvals and accepts by sequence number. If an Acceptor has approved j and accepted (i,v) and j>i then we know that it must have approved j after it accepted (i,v). The sequence of operations for that Acceptor must act like this:

# circularity problems in distributed consensus

The famous result of Fischer, Lynch, and Patterson [1] implies that a reliable algorithm for electing a proposer must use either randomness or real time—for example, by using timeouts. However, safety is ensured regardless of the success or failure of the election.

The FLP result is essentially a tautology: if an agent doesn’t ever get any information that reliably distinguishes between failure and slow response in a second agent, the first agent cannot reliably distinguish between failure of the second agent and slow response.  So the import of the first sentence is that leader election depends on timeouts or “randomness” (perhaps this means some analysis of probability of failure scenarios).  I don’t think this is correct, but it’s an interesting claim. The second sentence says nothing more than that an algorithm that fails to progress will never produce a false result – which I think is also a dubious claim.

Algorithm P solves problem X by assuming some other mechanism solves X and then by using that mechanism to make problem X simpler.  Ok.

# state equations in practice

When people think of mathematical models of state for programs and other computer systems, it’s natural and conventional to consider state as a map from symbolic names of state variable to values. This is an error for a number of reasons including, most critically, the lack of any compositional structure. Let’s step back and think about what a discrete state system looks like.

Notation warning: I’m using _ for subscript x_p or f_r or one_(two).

1. We have a set E of discrete “events” called the event alphabet, The event alphabet defines all the possible events and inputs that can cause state to change.
2. A finite sequence of events over E can be thought of as a system history, describing what series of events drove the system from its initial state (which is the point at which no events at all had happened.)  You could complain at this point that deterministic, single thread of execution (the sequence) models can’t describe non-determinism, parallelism, concurrency, and abstract properties, but you’d be wrong – keep reading.
3. Now we are going to take an odd step – I don’t want to start with a state machine or other object that represents the whole system. I want to start with state variables that express some system property and how it changes in response to events. For example, a state variable Executing might tell us the process identifier of the currently running (executing) process or it could be set valued in a multi-processor environment. And priority_ might be the priority of the process with identifier, p.  As the system advances state, the values of the state variable change but often we want to have properties that are true in any state. For example,  (Executing=p and q in Waiting and priority_p>  priority_qimplies TimeSinceSwitch < t)  would be a good property for some operating system that needs to schedule by priority but also let lower priority processes advance.
4. State variables are dependent variables that depend on the event sequence and some map that extracts  information from the event sequence. The event sequence is the free variable and the state variable is a dependent variable that embodies some aspect of system state. Everything that determines current state is in the event sequence but state is not a completed thing, not an discrete object, it is the cumulative information of multiple state variables depending on that event sequence. The same event sequence may correspond to completely different collections of state variables for different systems.
5.  We might have many state variables to specify how an operating system works or even how a simple gate works. Let’s think of all of them at the same level of abstraction as depending on the same event sequence (finite, of course).
6. The key to this approach is to be able to compactly define state variables and abstract properties of state variables even though the event alphabets and maps on sequences of events will often be enormously complicated.
7. Say a variable y is a state variable for an event alphabet E if  y=f(σ) for some function f where σ is a free variable in the set of finite sequences over E which includes the empty sequence Nil.  The map f is a solution for y  that extracts values from the sequence as state advances (as events are appended to the event sequence). Essentially y depends on two parameters: the sequence  and the map. These parameters are often implicit.
8. By the definition of state variable, if is a state variable, then  y=f(σ)  for some f so y(z) = f(z)  by the usual convention.
9. If y is a state variable then y(Nil) is the value of  in the initial state since  y(Nil) = f(Nil).
10. The convention here is that σ is always a free variable over E* (the set of finite sequences over alphabet E).
11. If is a sequence and e is an event or a variable over events, write Append(z,e) or just ze for the sequence obtained by appending e to z on the right.  So y(σe) is the value of y in the “next state” – the state after e – since y(σe) = f(σe).
12. If  y = y(σe) then the event “e” leaves the value of unchanged.
13. If y(σe) = y +2  then the event “e” increases the value of  by 2. The equation  y(σe)=y+2 can be rewritten as f(σe)=f(σ) + if we know f.
14. A primitive recursive definition on the sequence will completely define the solution  if the recursive map is defined. So  [y(Nil)= k and y(σe) = h(e, y)] defines f if h and are defined.  Note y(Nil) = f(Nil)=k. And y(σe)= h(e,y) = h(e,y(σ)) = h(e,f(σ)). Suppose sequence w=<a,b,c>, then y(w) = f(<a,b,c>) = h(c,f(<a,b>)) = h(c,h(b,f(<a>)) = h(c,h(b,h(a,k)))).  In many cases we don’t want to or can’t specify f completely – f is a solution to the constraints on y and there may be many or infinitely many or no solutions.
15. For example, suppose C is a state variable and C(Nil)=0 and C(σe) = 1+C. Then we have defined C to be the length of the event sequence.
16. Define L(σ e) = e so L is the most recent event. Note we do not need to define L(Nil).
17. Define [ SinceReset(Nil) = 0 and SinceReset(σ e) = ( 0 if e=RESET and  1+ SinceReset otherwise).
18. Suppose we have a state variable Clock that somehow determines time passed since initial state from the event sequence. Then (Clock(σ e)- Clock) is the duration of event e in this state (and may be different for different values of  σ ). Define  waiting_p(Nil)=0 and waiting_p(σ e)=0 if Executing(σ e) =p and   waiting_p(σ e)= waiting_p+(Clock(σ e)- Clock)  otherwise. So  waiting_p is the time that process has been waiting to become the executing process. We might want a property (waiting_p > k only if priority_p < priority_p where q = Executing).
19. Ok, now we can consider component composition – which is just function composition. Consider the state variable L defined above and suppose we have a new alphabet B consisting of Left_x and Right_x for x in some set X of data values. The goal is to construct a shift register from instances of L.  Define Li =  L(ui) where ui is also a state variable that has values that are sequences of elements of X. Usually we want components to state in the initial state so ui(Nil) = Nil.  This means Li (Nil)= L(Nil) which is something we didn’t define. Now define ui(σ e) where “e” is an element of the event alphabet of the composite system.  Remember I’m using juxtaposition to mean “append on the right” so, for example (u_iL_(i-1) ) means “append the current value of L_(i-1) to the current value of “
1. u_i(σ e)=  u_i x if i=1 and e=Right_x or if i=n and e=Left_x
2. u_i(σ e)=  u_i L_(i-1) if i>1 and

e=Left_x

20. u_i(σ e)=  u_i  L_(i+1)  if i<n and e=Right_x
• The definition above causes all n components to change state in parallel when an event is appended to the event sequence of the composite system – but it’s just composition: L_i = L(u_i)=  L(u_i(σ)) and L_i(σ e)= L(u_i(σ e) ) =L(u_i(σ)e’) = e’ where e’ is calculated as above.
• If  a device D that is constructed from components  C_1, … C_n  that are interconnected. D is a function of σ but the components don’t “see” the same events. For each component, we need a state variable u_i that is sequence valued and a function of  σ . Usually these are defined recursively: u_i(Nil)= Nil  and u_i(σe)=  concatenate(u_i,h_i(e,C_1(u_1)…  C_n(u_n ))) or something like that. That is, each event e for D causes event or events  h_i(e,C_1(u_1 )…  C_n(u_n ))  to be appended to the sequence of events u_i.

# Operations and maps on finite sequences

A lot of what I’m trying to do with mathematical models of computer systems involves operations on finite sequences.

Define a “finite sequence of length n>0” to be any total map f: {1 … n} → X for some set X. The 0-length sequence is the null map  “nulls”. If f is a finite sequence of length n, then g = f affix c is the map g: {1 … n+1}   → X ∪ {c}  so that g(i)= f(i) for  i ≤ n and g(n+1) = c.  Also, if g is length n>0 then there is an f of length n-1  and some c so that g = f affix c .

A primitive recursive function on finite sequences is given by the rules  F(nulls) = k and  F( f affix c) = G(c,F(f)).

For example we can define prefix by  (c prefix nulls) = (nulls affix c) and ( c prefix (f affix d)) = (c prefix f) affix d.

I use affix instead of append because prepend is ugly.

Two observations about common usage in computer science that differs from standard mathematical practice. First, note how I’m defining sequences as maps but being imprecise about the function images, about the set X. In CS we really only have one type of object – a sequence of binary digits- so X is always a set of finite sequences of binary digits. This is a fundamental property of computing. So if I wanted to be pedantic, I’d first define the set B of finite sequences of binary digits as the set of all maps g:{1 … n} → {0,1}  plus the null map, and then define general finite sequences as maps  {1 … n} → B.  But we use those binary sequences as representations of other mathematical objects and even objects that are usually not considered mathematical objects such as MPEG encodings of video.  So it’s convenient to speak of a sequence of integers or strings or a sequence of assorted objects without making the translation between representation and denotation  (connotation? ). The second  observation is that, for the same reason, second order functions are not unusual in computer science – e.g. defining prefix as a function that operates on sequences which are also functions. Note however that in non CS applied math, second order functions are also common.

Painting is The Mathematician by Diego Rivera.

# Process algebra versus state machines part 1

Robin Milner’s influential book Communication and Concurrency involves a take on state machines that has always puzzled me. “Now in standard automata theory, an automaton is interpreted as a language,” Milner writes “i.e. as a set of strings over the alphabet.” That’s not at all correct, but let’s accept the claim for now and follow the argument. Consider two state machines A and B with an alphabet of events  {a,b,c,d}  and A has states {A,A1,A2, A3} and B has states {B,B’,B1,B2,B3}. The state machine transitions can be given by ordered triplets (state1,input, state2) that show the input label  on a directed edge between state1 and state2.  For Milner’s example:

state machine A has transitions { (A,a,A1), (A1,b,A2), (A1,c,A3), (A3,d,A) },

state machine B has transitions: { (B,a,B1) (B,a,B’), (B1,b, B2), (B’,c,B3), (B3,d,B)}.

B is non-deterministic because there are 2 “a” transitions from state B. Milner points out that if we consider A2 and B2 to be accept states, both machines accept the same language (acd)*ab. So far so good. At this point Milner asks us to think of {a,b,c,d} as “ports” or maybe buttons that can be pushed. The button “is unlocked if the agent can perform the associated action and can be depressed to make it do the action, otherwise the button is locked and cannot be depressed.”  Then: “after the a-button is depressed a difference emerges between A and B. For A – which is deterministic – b and c will be unlocked, while for B – which is non-deterministic – sometimes only b will be unlocked and sometimes only c will be unlocked.” If you don’t look carefully, you’ll miss a subtle change of conditions that has significant ramifications.

An experimenter or external agent trying to push these buttons will discover a difference between the two machines eventually because some times after an initial “a” input on the second state machine a “b” is possible and sometimes not, although on the first state machine after an “a” the “b” is always possible.  But how does the external agent determine that B will not perform a “b” action sometimes? The external agent “attempt[s] to depress b” and fails – the locked/unlocked state of each button is visible to the external agent. So Milner has changed definitions in the middle of the argument.  At the start, finite state machines were language recognizers with, as the classical text on automata theory explains: “output limited to a binary signal: ‘accept/don’t accept’ ” [Hopcroft and Ullman].  Those automata will not tell us  anything else about a word other than that binary condition – is it in the language or not.  But Milner’s button state machines tell us also what buttons are locked and what are unlocked in the terminal state reached by the word.  So Milner’s state machines distinguish words that a recognizer state machines does not. In fact, these Milner state machines have 5 binary outputs in each state – indicating the locked/unlocked status of each button plus accept/don’t accept. State machines with more than a binary output alphabet are called Moore or Mealy machines in poor old standard automata theory.

Standard automata theory does not “interpret” state machines  “as a language” but there is a theorem that the class of languages recognized by those finite state binary output deterministic state machines is the same as the class of languages recognized by finite state non-deterministic state machines. Two  machines that recognize the same language may be distinct in many other ways.   And state machines that have additional outputs (sometimes called “transducers”) are essentially descriptions of maps from input strings to output strings or from input strings to output value in the terminal state. Standard automata theory would say Milner’s two machines accept the same language of strings, but produce different languages of strings.

Standard automata theory, as far as I know, has never really considered the case of non-deterministic Moore machines but the extension is trivial. Milner’s transition systems are just labeled directed graphs with a root vertex. Consider a labeled directed graph G with labels A, a distinguished root vertex (start state)  s0, the set of triples R= { (s1,a,s2) if there is an edge labeled a from s1 to s2}. The set of vertices V is the set of states. We can define a relation R* subset A* x V so that R* is the largest set containing only (null,s0) and (wa,s’) whenever (w,s) is in R* and (s,a,s’) is in R – where wa is the string obtained by appending “a” to “w” on the right.  For any vertex “s” define Q(s) to be the subset of A containing every “a” so that (s,a,s’) is in R for some s’ in V. Then let G* be the set of pairs (w,Q(s)) for every (w,s) in R*.  As far as I can tell on a quick look, Milner’s bisimulation between G1 and G2 is simply equality of G1* and G2*.

# Theories of abstract automata – Arbib

We have said that automata theory deals with the realization of partial functions F: X* —» Y* by some finitely specifiable substrate. Before we specify in more detail the forms (of which the Turing machine is one) of substrate which have figured most prominently in automata theory, it is useful to distinguish on-line machines from off-line machines. An on-line machine is one that may be thought of as processing data in an interactive situation—in processing a string it must yield a continual flow of outputs, processing each symbol completely (albeit in a way dependent on prior inputs) before it reads in the next symbol. This means that the corresponding function F: X* —> F* must have the following special property:

1 For each nonempty string u of X* there exists a function Fu: X* —» Y* such that for every nonempty v in X*
F(uv) = F(u) • Fu(v) that is, the input string u causes the machine to put out the string F(u) and to “change state” in such a way that it henceforth processes inputs according to a function Fu determined solely by F and u. We call a function sequential if it satisfies property 1.

# Algebra for computer science and function nesting

The fundamental mathematical basis of computer science is that code and data are the same things. A sequence of bits can represent real numbers, integers, vectors, groups, matrices, video , audio, or programs, algorithms or even proofs. When we try to describe these systems mathematically, then it should not be surprising when we encounter a value that is also a map and a map with an image that consists of more maps.  For example, in a UNIX/MULTICS type file system if the file system is represented by a map F:Paths → Data, the data for a directory could also be considered to be or to represent a map. If F(home/snowden/passwords) is block of text, then F(home/snowden) is or encodes a map from strings to some information about where to find the file. In the original UNIX file systems F(home/snowden) is a map from strings to inode numbers. So (F(home/snowden))(passwords) is the inode number of the file that contains the passwords. UNIX style tree structured file systems are generally embedded in simpler file systems that map inode numbers to file contents S:Inode → Data. So the resolution of file names to data involves something like this