Making Paxos face facts

Lamport’s  “Paxos Made Simple” paper is notoriously hard to understand but at least part of the difficulty is that the algorithm  changes radically in the middle of the presentation.  The first part of the paper presents a subtle (maybe too subtle) method to permit multiple processes or network sites to agree on a consensus value.  The second part of the paper switches to a second, much simpler algorithm without much notice.

The paper begins with a problem statement:

Assume a collection of processes that can propose values. A consensus algorithm ensures that a single one among the proposed values is chosen. If no value is proposed, then no value should be chosen. If a value has been chosen, then processes should be able to learn the chosen value. The safety requirements for consensus are:

  • Only a value that has been proposed may be chosen,
  • Only a single value is chosen, and
  • A process never learns that a value has been chosen unless it actually has been

The Paxos algorithm that is  presented first is liable to what we used to call livelock even in the absence of failure:

It’s easy to construct a scenario in which two proposers each keep issuing a sequence of proposals with increasing numbers, none of which are ever chosen.

It’s argued that this original Paxos works in the sense that it never can reach an inconsistent state (it is “safe” ), but the livelock scenario means it can easily fail to progress even without process failure or loss of messages. To avoid this scenario, on page seven of an eleven page paper, Paxos is redefined to restrict the set of  proposers to a single member – a distinguished proposer.

To guarantee progress, a distinguished proposer must be selected as the only one to try issuing proposals

Then, on page nine, when going over how Paxos can be used to build replicated state machines, Lamport writes:

In normal operation, a single server is elected to be the leader, which acts as the distinguished proposer (the only one that tries to issue proposals) in all instances of the consensus algorithm

Since much of the complexity of the first part of the paper involves describing how Paxos safely resolves contention between competing proposers, the modification is far from minor. It’s unfortunate that both the multi-proposer and single proposer algorithm are called by the same name which seems to cause confusion in the literature and certainly obscures the presentation in “Paxos Made Simple“. For example, the “Paxos Made Live”   paper appears to discuss an implementation based on the first (the multi-proposer) Paxos algorithm but then appears to revert to the second method via a mechanism called “master leases”).

In any case, the single proposer problem is much simpler than the original problem.

 

A brute force commit solution to the single proposer problem.

What, exactly do the bolded words in the problem statement mean?  “Chosen” and “Learn” are the two hard ones. “Proposed” is pretty clear: a process sends messages to all the other processes that says ” I , process A, propose value V with supporting documentation x“.

A proposer sends a proposed value to a set of acceptors. An acceptor may accept the proposed value. The value is chosen when a large enough set of acceptors have accepted it.

Proposal is a simple action: the proposer sends a proposal message.

There are the usual assumptions: messages may be lost or duplicated or delivered out of order, but are neither corrupted or spurious. That is: if a process receives a message from a second process, the second process must have previously transmitted that message.   A process can propose a value, but that proposal may never arrive at any of the other process or it may arrive at all of them or some of them. It makes sense that an accept operation is the transmit of an accept message by some process  that received the proposal message.  Presumably then, the value is chosen when the original proposer receives accept messages from a large enough set of the Acceptor processes  – a majority of the processes. Learning is also simple:

To learn that a value has been chosen, a learner must find out that a proposal has been accepted by a majority of acceptors.

Suppose that there is a distinguished proposer process D. Suppose that process sends value to all Acceptors and marks the value as “chosen” if it receives an accept response from a majority of Acceptors. Acceptors only send accept messages to the distinguished proposer if they receive a proposal from the distinguished proposer.  Let Learners ask the distinguished proposer for chosen values. To make the learning process more efficient, the distinguished proposer can notify Acceptors that the value has been chosen and they can answer Learners too. All properties are now satisfied:

  • Only a value that has been proposed may be chosen: only D proposes so all it has to do is to select a single value.
  • Only a single value is chosen: follows from the first.
  • A process never learns that a value has been chosen unless it actually has been: the process D knows reliably and can inform users and Acceptors that have received notification from D can also inform Learners reliably.

If there are no failures, there is nothing else to discuss. If processes can fail, there is still not much to discuss. As long as the distinguished proposer doesn’t fail, everything works. If the distinguished proposer and some  of the Acceptors fail before any Acceptor has accepted, a poll of Acceptors will reveal no accepted value – and we know that no Learner has been falsely told there is some consensus value. If some Learner was told about a chosen value and the leader and some minority of Acceptors fail,  since a majority of Acceptors must have accepted and a minority have failed, there is at least one Acceptor that knows the accepted value and it can be copied to all surviving Acceptors. There cannot be two different accepted values. If  no Learner was told about an accepted value, either no value was chosen or one was and nobody was informed. In either case, we can just copy an accepted value if any of the Acceptors has one, or start from a blank slate otherwise. No harm no foul.

Notice, we have not had to worry about reliable store. All we need is the absence of spurious or corrupted messages and the survival of at least 1/2 of the Acceptors. If we need a sequence of values, the distinguished proposer can just rerun the same process as needed and incorporate a sequence number in the values. The distinguished proposer is, of course, the single point of failure but an election mechanism can select new proposers.

(revised)

state equations in practice

 

When people think of mathematical models of state for programs and other computer systems, it’s natural and conventional to consider state as a map from symbolic names of state variable to values. This is an error for a number of reasons including, most critically, the lack of any compositional structure. Let’s step back and think about what a discrete state system looks like.

Notation warning: I’m using _ for subscript x_p or f_r or one_(two).

  1. We have a set E of discrete “events” called the event alphabet, The event alphabet defines all the possible events and inputs that can cause state to change.
  2. A finite sequence of events over E can be thought of as a system history, describing what series of events drove the system from its initial state (which is the point at which no events at all had happened.)  You could complain at this point that deterministic, single thread of execution (the sequence) models can’t describe non-determinism, parallelism, concurrency, and abstract properties, but you’d be wrong – keep reading.
  3. Now we are going to take an odd step – I don’t want to start with a state machine or other object that represents the whole system. I want to start with state variables that express some system property and how it changes in response to events. For example, a state variable Executing might tell us the process identifier of the currently running (executing) process or it could be set valued in a multi-processor environment. And priority_ might be the priority of the process with identifier, p.  As the system advances state, the values of the state variable change but often we want to have properties that are true in any state. For example,  (Executing=p and q in Waiting and priority_p>  priority_qimplies TimeSinceSwitch < t)  would be a good property for some operating system that needs to schedule by priority but also let lower priority processes advance.
  4. State variables are dependent variables that depend on the event sequence and some map that extracts  information from the event sequence. The event sequence is the free variable and the state variable is a dependent variable that embodies some aspect of system state. Everything that determines current state is in the event sequence but state is not a completed thing, not an discrete object, it is the cumulative information of multiple state variables depending on that event sequence. The same event sequence may correspond to completely different collections of state variables for different systems.
  5.  We might have many state variables to specify how an operating system works or even how a simple gate works. Let’s think of all of them at the same level of abstraction as depending on the same event sequence (finite, of course).
  6. The key to this approach is to be able to compactly define state variables and abstract properties of state variables even though the event alphabets and maps on sequences of events will often be enormously complicated.
  7. Say a variable y is a state variable for an event alphabet E if  y=f(σ) for some function f where σ is a free variable in the set of finite sequences over E which includes the empty sequence Nil.  The map f is a solution for y  that extracts values from the sequence as state advances (as events are appended to the event sequence). Essentially y depends on two parameters: the sequence  and the map. These parameters are often implicit.
  8. By the definition of state variable, if is a state variable, then  y=f(σ)  for some f so y(z) = f(z)  by the usual convention.
  9. If y is a state variable then y(Nil) is the value of  in the initial state since  y(Nil) = f(Nil).
  10. The convention here is that σ is always a free variable over E* (the set of finite sequences over alphabet E).
  11. If is a sequence and e is an event or a variable over events, write Append(z,e) or just ze for the sequence obtained by appending e to z on the right.  So y(σe) is the value of y in the “next state” – the state after e – since y(σe) = f(σe). 
  12. If  y = y(σe) then the event “e” leaves the value of unchanged.
  13. If y(σe) = y +2  then the event “e” increases the value of  by 2. The equation  y(σe)=y+2 can be rewritten as f(σe)=f(σ) + if we know f. 
  14. A primitive recursive definition on the sequence will completely define the solution  if the recursive map is defined. So  [y(Nil)= k and y(σe) = h(e, y)] defines f if h and are defined.  Note y(Nil) = f(Nil)=k. And y(σe)= h(e,y) = h(e,y(σ)) = h(e,f(σ)). Suppose sequence w=<a,b,c>, then y(w) = f(<a,b,c>) = h(c,f(<a,b>)) = h(c,h(b,f(<a>)) = h(c,h(b,h(a,k)))).  In many cases we don’t want to or can’t specify f completely – f is a solution to the constraints on y and there may be many or infinitely many or no solutions.
  15. For example, suppose C is a state variable and C(Nil)=0 and C(σe) = 1+C. Then we have defined C to be the length of the event sequence.
  16. Define L(σ e) = e so L is the most recent event. Note we do not need to define L(Nil).
  17. Define [ SinceReset(Nil) = 0 and SinceReset(σ e) = ( 0 if e=RESET and  1+ SinceReset otherwise).
  18. Suppose we have a state variable Clock that somehow determines time passed since initial state from the event sequence. Then (Clock(σ e)- Clock) is the duration of event e in this state (and may be different for different values of  σ ). Define  waiting_p(Nil)=0 and waiting_p(σ e)=0 if Executing(σ e) =p and   waiting_p(σ e)= waiting_p+(Clock(σ e)- Clock)  otherwise. So  waiting_p is the time that process has been waiting to become the executing process. We might want a property (waiting_p > k only if priority_p < priority_p where q = Executing).
  19. Ok, now we can consider component composition – which is just function composition. Consider the state variable L defined above and suppose we have a new alphabet B consisting of Left_x and Right_x for x in some set X of data values. The goal is to construct a shift register from instances of L.  Define Li =  L(ui) where ui is also a state variable that has values that are sequences of elements of X. Usually we want components to state in the initial state so ui(Nil) = Nil.  This means Li (Nil)= L(Nil) which is something we didn’t define. Now define ui(σ e) where “e” is an element of the event alphabet of the composite system.  Remember I’m using juxtaposition to mean “append on the right” so, for example (u_iL_(i-1) ) means “append the current value of L_(i-1) to the current value of “
    1. u_i(σ e)=  u_i x if i=1 and e=Right_x or if i=n and e=Left_x
    2. u_i(σ e)=  u_i L_(i-1) if i>1 and

    e=Left_x

  20. u_i(σ e)=  u_i  L_(i+1)  if i<n and e=Right_x
  • The definition above causes all n components to change state in parallel when an event is appended to the event sequence of the composite system – but it’s just composition: L_i = L(u_i)=  L(u_i(σ)) and L_i(σ e)= L(u_i(σ e) ) =L(u_i(σ)e’) = e’ where e’ is calculated as above.
  • If  a device D that is constructed from components  C_1, … C_n  that are interconnected. D is a function of σ but the components don’t “see” the same events. For each component, we need a state variable u_i that is sequence valued and a function of  σ . Usually these are defined recursively: u_i(Nil)= Nil  and u_i(σe)=  concatenate(u_i,h_i(e,C_1(u_1)…  C_n(u_n ))) or something like that. That is, each event e for D causes event or events  h_i(e,C_1(u_1 )…  C_n(u_n ))  to be appended to the sequence of events u_i.

Cutting and pasting about Kodak’s demise

kodakThis graph from Peter Diamandis about how Kodak entered the sinkhole is kind of amazing. Diamandis explains Kodak’s failure to swerve in what is I think the orthodox Silicon Valley analysis:

[in 1996] Kodak had a $28 billion market cap and 140,000 employees.

In 1976, 20 years earlier, Kodak had invented the digital camera. They owned the IP and had the first mover advantage. This is a company that should have owned it all.

Instead, in 2012, Kodak filed for bankruptcy, put out of business by the very technology they had invented.

What happened? Kodak was married to the “paper and chemicals” (film development) business… their most profitable division, while the R&D on digital cameras was a cost center. They saw the digital world coming on, but were convinced that digital cameras wouldn’t have traction outside of the professional market. They certainly had the expertise to design and build consumer digital cameras — Kodak actually built the Apple QuickTake (see photo), generally considered the world’s first consumer digital camera. Amazingly, Kodak decided they didn’t even want to put their name on the camera.

There is more of the same (2012 and before) in the MIT Technology review. This is a totally convincing story (it had me convinced), but it leaves out three things:

  1. the boring old chemicals division of Eastman Kodak which was spun off in 1993 (three years earlier) is still around, profitable ( $10B/year revenue) and dwarfs what’s left of the original company,
  2. and  Fuji films, Kodak’s also ran in chemical film space managed to leap over the sinkhole and prosper, but not by relying on digital cameras.
  3. Kodak did briefly become a market leader in digital cameras but ran into a more fundamental problem.

Back in 1996, the business press and analysts thought Kodak was doing the right thing by divesting its chemical business.

NEW YORK — Eastman Kodak Co., struggling against poor profit and high debt, Tuesday took a big step in its corporate restructuring, announcing that it will divest Eastman Chemical Co. and in one fell swoop wipe out $2 billion of debt.

Such a spinoff would not have occurred just a few years ago, analysts said, and the move signals that Chief Executive Kay R. Whitmore is responding to new, tougher markets and stockholder pressure to improve financial results quickly.

“They are now recognizing that they are not a growth company, that they must go through this downsizing,” analyst Eugene Glazer of Dean Witter Reynolds said in an interview on CNBC.

Kodak’s shares, up sharply Monday in anticipation of the announcement, ended down $1.375 to $52.375 on the New York Stock Exchange.

[…] “We determined that there was little strategic reason related to our core imaging and health business for Kodak to continue to own Eastman,” Whitmore said at a news conference.

Kodak, best known for photography products but also a major pharmaceutical and chemicals group, has endured slow growth for years. Its photography business has been hit by changing demographics, foreign rivals and new technologies such as camcorders. Whitmore said Kodak sales, especially in photography and imaging, were weak.

[…] Costs will be reduced elsewhere in the company, Whitmore said. Other executives also have said Kodak will cut spending on research and development of new products.

In retrospect, dumping the cash generating parts of the business and cutting R&D was  not the best plan even if Wall St. analysts loved the idea. But it’s easy to be a genius after the fact as Willy Shih points out:

Responding to recommendations from management experts, from the mid-1990s to 2003 the company set up a separate division (which I ran) charged with tackling the digital opportunity. Not constrained by any legacy assets or practices, the new division was able to build a leading market share position in digital cameras — a position that was essentially decimated soon thereafter when smartphones with built-in cameras overtook the market.

Yes, those camera phones – which not too many people saw coming in the 1990s. Not only that, but Kodak’s path in digital imaging was not obvious.

The transition from analog to digital imaging brought several challenges. First, digital imaging was based on a general-purpose semiconductor technology platform that had nothing to do with film manufacturing — it had its own scale and learning curves. The broad applicability of the technology platform meant that it could be scaled up in numerous high-volume markets (such as microprocessors, logic circuits, and communications chips) apart from digital imaging. Suppliers selling components offered the technology to anyone who would pay, and there were few entry barriers. What’s more, digital technology is modular. A good engineer could buy all the building blocks and put together a camera. These building blocks abstracted almost all the technology required, so you no longer needed a lot of experience and specialized skills.

Semiconductor technology was well outside of Kodak’s core know-how and organizational capabilities. Even though the company invested lots of money in the basic research and manufacturing of solid-state semiconductor image sensors and developed some notable inventions (including the color filter array that is used on virtually every color image sensor), it had little hope of being a competitive volume supplier of image sensor components, and it was difficult for Kodak to offer something distinctive.

And Shih, perhaps unintentionally, reinforces Diamandis’s point that the top company managers failed to face up to the problem.

For many managers of legacy businesses, the survival instinct kicked in. Some who had worked at Kodak for decades felt they were entitled to be reassigned to the new businesses, or wished to control sales channels for digital products. But that just fueled internal strife. Kodak ended up merging the consumer digital, professional, and legacy consumer film divisions in 2003. Kodak then tried to make inroads in the inkjet printing business, spending heavily to compete with fortified incumbents such as HP, Canon, and Epson. But the effort failed, and Kodak exited the printer business after it filed for Chapter 11 bankruptcy reorganization in 2012.

Management chaos and “spending heavily to compete with fortified incumbents”.

Yep.

With the benefit of hindsight, it’s interesting to ask how Kodak might have been able to achieve a different outcome. One argument is that the company could have tried to compete on capabilities rather than on the markets it was in. This would have meant directing its skills in complex organic chemistry and high-speed coating toward other products involving complex materials — a path followed successfully by Fuji. However, this would have meant walking away from a great consumer franchise. That’s not the logic that managers learn at business schools, and it would have been a hard pill for Kodak leaders to swallow.

it would have been a hard pill for Kodak leaders to swallow.

But wasn’t that their job? So to conclude this exercise in cut and paste, what about Fuji? The Economist had an interesting take:

 the digital imaging sector accounts for only about one-fifth of Fujifilm’s revenue, down from more than half a decade ago.

How Fujifilm succeeded serves as a warning to American firms about the danger of trying to take the easy way out: competing through one’s marketing rather than taking the harder route of developing new products and new businesses. […]

Like Kodak, Fujifilm realised in the 1980s that photography would be going digital. Like Kodak, it continued to milk profits from film sales, invested in digital technologies, and tried to diversify into new areas. Like Kodak, the folks in the wildly profitable film division were in control and late to admit that the film business was a lost cause. As late as 2000 Fujifilm counted on a gentle 15 or 20-year decline of film—not the sudden free-fall that took place. Within a decade, film went from 60% of Fujifilm’s profits to basically nothing.

If the market forecast, strategy and internal politics were the same, why the divergent outcomes? The big difference was execution.

Fujifilm realised it needed to develop in-house expertise in the new businesses. In contrast, Kodak seemed to believe that its core strength lay in brand and marketing, and that it could simply partner or buy its way into new industries, such as drugs or chemicals. The problem with this approach was that without in-house expertise, Kodak lacked some key skills: the ability to vet acquisition candidates well, to integrate the companies it had purchased and to negotiate profitable partnerships. “Kodak was so confident about their marketing capability and their brand, that they tried to take the easy way out,” says Mr Komori.

Fujifilm realised it needed to develop in-house expertise in the new businesses.

ok.

 

Synchronous processors

Imagine a processor with no interrupts. We can do a lot better and get rid of most exceptions (e.g. system calls, page faults etc.), most peripheral devices/buses, and even cache misses, but let’s start with interrupts. Modern micro-processors are bloated with circuitry that is designed to allow the precious CPU to switch back and forth between streams of code because cpu cycles were the scarcest resource. That was a great idea in the 1950s. But interrupts are expensive in compute time circuit complexity, and chip area. Interrupts  take microseconds to start processing – which is an eternity on a Ghz processor.  And they solve a problem that does not exist anymore: we have a lot of processor cores. In fact, one problem faced by computer architects is that it’s not easy to exploit parallel processing if you keep thinking in terms of 1950s computer architecture.

Suppose you have 32 cores and you make one or two of them poll I/O devices and the other 30 never get I/O interrupts, never run interrupt handling code, never have to save and restore state or context switch due to I/O interrupts.  What they do is run application code to completion or for a specified number of cycles or clock ticks. The cores reserved for operating system use manage all devices and even they don’t need interrupts because they can run a real-time OS that polls devices.

Continue reading

IEEE 1588 PTP is a mess

IEEE 1588 was not designed for modern enterprise computer networks and contains many hacks to make it sort of work. The standard also suffers from being overly explicit on some things and overly unspecific  on others.  One marker of the flawed process is that IEEE 1588 transparent clocks don’t really comply with Ethernet standards because they modify packets without changing the MAC address. So in 2012 the 802.1 and 1588 standards groups started discussing what could be done. The 1588 committee notes that the “intent” (and practice) violates OSI layering but that 1588 doesn’t “mandate” that intent! Oy vey.

Questions have been raised concerning an IEEE 1588-2008 Transparent Clock layer 2 bridge modifying the CorrectionField of Ethernet transported PTP frames without changing the Ethernet source MAC address.  The question is if this operation is permitted by IEEE 802.1Q [1].  The original intent of the IEEE 1588-2008 standard was that a Transparent Clock will forward PTP event frames with no modifications except for the CorrectionField and FCS updates, however IEEE 1588-2008 does not mandate that.

Operations and maps on finite sequences

A lot of what I’m trying to do with mathematical models of computer systems involves operations on finite sequences.

Define a “finite sequence of length n>0” to be any total map f: {1 … n} → X for some set X. The 0-length sequence is the null map  “nulls”. If f is a finite sequence of length n, then g = f affix c is the map g: {1 … n+1}   → X ∪ {c}  so that g(i)= f(i) for  i ≤ n and g(n+1) = c.  Also, if g is length n>0 then there is an f of length n-1  and some c so that g = f affix c .

A primitive recursive function on finite sequences is given by the rules  F(nulls) = k and  F( f affix c) = G(c,F(f)). 

For example we can define prefix by  (c prefix nulls) = (nulls affix c) and ( c prefix (f affix d)) = (c prefix f) affix d.

I use affix instead of append because prepend is ugly.

Two observations about common usage in computer science that differs from standard mathematical practice. First, note how I’m defining sequences as maps but being imprecise about the function images, about the set X. In CS we really only have one type of object – a sequence of binary digits- so X is always a set of finite sequences of binary digits. This is a fundamental property of computing. So if I wanted to be pedantic, I’d first define the set B of finite sequences of binary digits as the set of all maps g:{1 … n} → {0,1}  plus the null map, and then define general finite sequences as maps  {1 … n} → B.  But we use those binary sequences as representations of other mathematical objects and even objects that are usually not considered mathematical objects such as MPEG encodings of video.  So it’s convenient to speak of a sequence of integers or strings or a sequence of assorted objects without making the translation between representation and denotation  (connotation? ). The second  observation is that, for the same reason, second order functions are not unusual in computer science – e.g. defining prefix as a function that operates on sequences which are also functions. Note however that in non CS applied math, second order functions are also common.

Painting is The Mathematician by Diego Rivera.

The difference between unspecified, undefined, and non-deterministic

There is too much confusion in the “formal methods” computer science literature between these three different terms. Let me start with what this means for a state machine and then move on to engineering objects such as threads. Suppose we have a map α: X → Y  where X is the set of finite sequences over X. In an earlier post I explain how these maps are equivalent to state machines.  For a finite sequence z we can leave  α(z) unspecified – meaning, we don’t make any assurances about the value other than that  α(z)∈ X.  Alternatively, we can say that α(z) is undefined which is a definite assertion that α does not associate any element of X with z. If α is considered as a set of pairs (s,x) with the function property that (s,x)∈α and (s,x’)∈α implies x=x’, then specifying that  α(z) is undefined specifies that (z,x) ∉ α for any x.  So  “α(z) is undefined ” is, in fact, a quite precise specification. Finally, if we say “α(z) is non-deterministic ” we mean that α is not a function at all, it is a relation on E* cross X so that (s,x)∈α and (s,x’)∈α  and  x=x’ is true for at least one s, x and x’. These are all different, but it’s common to mix them up in “formal methods” which has bad mathematical effects and worse effects when trying to understand how computer systems work.

Programs are purely deterministic systems. Given the same inputs, they always compute the same outputs. Without this property, they would not be useful at all. Generally, when researchers say a program is non-deterministic, they are discussing a program that has unspecified or undefined behavior or inputs. It’s that last part that really introduces confusion. For example, it is common to say threaded programs are non-deterministic. But that’s not correct. The apparent non-determinism is due to differences in i/o and scheduling – both of which are inputs. If the operating system schedules a multi-threaded program the same way twice and the input data is the same, the computations will be identical.  The same can be said about i/o.

In fact, the underlying error is sloppy definition of system state. We can see this most clearly in Robin Milner’s CCS book where he speaks of the non-deterministic characteristic  of a device where he has somehow included the choices of the external operator or the weather conditions as part of system state. The consequence is that his state machines are non-deterministic, and this makes the math really complicated and slippery – for no good reason.