The replicated state machine method of fault tolerance from 1980s

The first time I saw this method was when I went to work for Parallel Computer Systems, , later called Auragen, in the famous tech startup center of Englewood Cliffs, New Jersey. I commuted there from the East Village. (True story: I applied for the job after finding an advert in a discarded copy of the NY Times on the floor of a Brooklyn apartment while visiting friends. I sent via US mail a resume typed on a manual typewriter- I’m tempted to  say “composed by the light of a tallow candle” but that would be over the top- and forgot to send the second page. )

The company built a parallel computer based on Motorola 68000s  with a replicated message bus. The bus guaranteed message delivery to 3 destinations would either succeed to all three or fail to all three. This property is called “reliable broadcast”.  All interprocess communication was by message transfer (a fashionable idea at the time). Each process had a backup.  Whenever a primary process sent a message, the message was also delivered to the backup and to the destination backup. If the primary failed, the backup could be run. The backup would have a queue of messages received by the primary and a count of messages sent by the primary.  When the recovering backup tried to transmit a message, if the count was greater than zero, the count would be decremented and the message discarded because it has already been transmitted by the old primary. When the recovering secondary did a receive operation, if there was a message on the input queue, it would get that message.  In this way, the recovering backup would repeat the operations of the primary until it caught up. As an optimization, the primary could be periodically checkpointed and queues of duplicated messages could be discarded.

The operating system was an implementation of UNIX. In practice, it was discovered that making each UNIX system call into a message exchange, which was an idea advocated in the OS research community at the time, caused serious performance problems.  The replicated state machine operation depended on this design  in order to make the state machine operation deterministic. Suppose the primary requested, for example,  the time and then made a decision based on the time.  A recovering secondary would need exactly the same time to guarantee that it produced the same results as the primary. So every interaction between application and OS needed to be recorded in a message exchange.  But a message exchange is nowhere near as fast as a system call (unless the OS developers are horrible).

The performance issue was mitigated by some clever engineering, but  was a problem that was discovered in parallel by a number of development teams working on distributed OS designs and micro-kernels which were in vogue at the time. Execution of “ls -l” was particularly interesting.

Anyways, here’s the description from the patent.

To accomplish this object, the invention contemplates that instead of keeping the backup or secondary task exactly up to date, the backup is kept nearly up to date but is provided with all information necessary to bring itself up to the state of the primary task should there by a failure of the primary task. The inventive concept is based on the notion that if two tasks start out in identical states and are given identical input information, they will perform identically.

In particular, all inputs to a process running on a system according to the invention are provided via messages. Therefore, all messages sent to the primary task must be made available to the secondary or backup task so that upon failure of the primary task the secondary task catches up by recomputing based on the messages. In essence, then, this is accomplished by allowing every backup task to “listen in on” its primary’s message.

United States Patent 4,590,554 Glazer ,   et al.May 20, 1986

Inventors: Glazer; Sam D. (New York, NY), Baumbach; James (Brooklyn, NY), Borg; Anita (New York, NY), Wittels; Emanuel (Englewood Cliffs, NJ)
Assignee: Parallel Computers Systems, Inc. (Fort Lee, NJ)
Family ID: 23762790
Appl. No.: 06/443,937
Filed: November 23, 1982

See also: A message system supporting fault tolerance.

and a very similar later patent.

circularity problems in distributed consensus

Distributed consensus involves organizing a collection of independent agents – processes or network sites – to agree on some value or sequence of values.  Many distributed consensus methods depend on a leader-follower scheme in which the leader is an agent that essentially tells the followers what the values are. The challenges in such methods are to determine when enough of the followers have accepted the value and how to recover from failures of agents. In particular, failures of the leader trigger some procedure to select a new leader.  Leader election, however, is a distributed consensus problem. In fact, leader election is the harder problem. Once there is a leader, consensus in the followers can be produced by a dead simple protocol (see the second part of this ).  Oddly, leader election is generally treated as a minor issue. For example, in “Paxos made simple” we read:

The famous result of Fischer, Lynch, and Patterson [1] implies that a reliable algorithm for electing a proposer must use either randomness or real time—for example, by using timeouts. However, safety is ensured regardless of the success or failure of the election.

The FLP result is essentially a tautology: if an agent doesn’t ever get any information that reliably distinguishes between failure and slow response in a second agent, the first agent cannot reliably distinguish between failure of the second agent and slow response.  So the import of the first sentence is that leader election depends on timeouts or “randomness” (perhaps this means some analysis of probability of failure scenarios).  I don’t think this is correct, but it’s an interesting claim. The second sentence says nothing more than that an algorithm that fails to progress will never produce a false result – which I think is also a dubious claim.

Algorithm P solves problem X by assuming some other mechanism solves X and then by using that mechanism to make problem X simpler.  Ok.

 

Making Paxos face facts

Lamport’s  “Paxos Made Simple” paper is notoriously hard to understand but at least part of the difficulty is that the algorithm  changes radically in the middle of the presentation.  The first part of the paper presents a subtle (maybe too subtle) method to permit multiple processes or network sites to agree on a consensus value.  The second part of the paper switches to a second, much simpler algorithm without much notice.

The paper begins with a problem statement:

Assume a collection of processes that can propose values. A consensus algorithm ensures that a single one among the proposed values is chosen. If no value is proposed, then no value should be chosen. If a value has been chosen, then processes should be able to learn the chosen value. The safety requirements for consensus are:

  • Only a value that has been proposed may be chosen,
  • Only a single value is chosen, and
  • A process never learns that a value has been chosen unless it actually has been

The Paxos algorithm that is  presented first is liable to what we used to call livelock even in the absence of failure:

It’s easy to construct a scenario in which two proposers each keep issuing a sequence of proposals with increasing numbers, none of which are ever chosen.

It’s argued that this original Paxos works in the sense that it never can reach an inconsistent state (it is “safe” ), but the livelock scenario means it can easily fail to progress even without process failure or loss of messages. To avoid this scenario, on page seven of an eleven page paper, Paxos is redefined to restrict the set of  proposers to a single member – a distinguished proposer.

To guarantee progress, a distinguished proposer must be selected as the only one to try issuing proposals

Then, on page nine, when going over how Paxos can be used to build replicated state machines, Lamport writes:

In normal operation, a single server is elected to be the leader, which acts as the distinguished proposer (the only one that tries to issue proposals) in all instances of the consensus algorithm

Since much of the complexity of the first part of the paper involves describing how Paxos safely resolves contention between competing proposers, the modification is far from minor. It’s unfortunate that both the multi-proposer and single proposer algorithm are called by the same name which seems to cause confusion in the literature and certainly obscures the presentation in “Paxos Made Simple“. For example, the “Paxos Made Live”   paper appears to discuss an implementation based on the first (the multi-proposer) Paxos algorithm but then appears to revert to the second method via a mechanism called “master leases”).

In any case, the single proposer problem is much simpler than the original problem.

 

A brute force commit solution to the single proposer problem.

What, exactly do the bolded words in the problem statement mean?  “Chosen” and “Learn” are the two hard ones. “Proposed” is pretty clear: a process sends messages to all the other processes that says ” I , process A, propose value V with supporting documentation x“.

A proposer sends a proposed value to a set of acceptors. An acceptor may accept the proposed value. The value is chosen when a large enough set of acceptors have accepted it.

Proposal is a simple action: the proposer sends a proposal message.

There are the usual assumptions: messages may be lost or duplicated or delivered out of order, but are neither corrupted or spurious. That is: if a process receives a message from a second process, the second process must have previously transmitted that message.   A process can propose a value, but that proposal may never arrive at any of the other process or it may arrive at all of them or some of them. It makes sense that an accept operation is the transmit of an accept message by some process  that received the proposal message.  Presumably then, the value is chosen when the original proposer receives accept messages from a large enough set of the Acceptor processes  – a majority of the processes. Learning is also simple:

To learn that a value has been chosen, a learner must find out that a proposal has been accepted by a majority of acceptors.

Suppose that there is a distinguished proposer process D. Suppose that process sends value to all Acceptors and marks the value as “chosen” if it receives an accept response from a majority of Acceptors. Acceptors only send accept messages to the distinguished proposer if they receive a proposal from the distinguished proposer.  Let Learners ask the distinguished proposer for chosen values. To make the learning process more efficient, the distinguished proposer can notify Acceptors that the value has been chosen and they can answer Learners too. All properties are now satisfied:

  • Only a value that has been proposed may be chosen: only D proposes so all it has to do is to select a single value.
  • Only a single value is chosen: follows from the first.
  • A process never learns that a value has been chosen unless it actually has been: the process D knows reliably and can inform users and Acceptors that have received notification from D can also inform Learners reliably.

If there are no failures, there is nothing else to discuss. If processes can fail, there is still not much to discuss. As long as the distinguished proposer doesn’t fail, everything works. If the distinguished proposer and some  of the Acceptors fail before any Acceptor has accepted, a poll of Acceptors will reveal no accepted value – and we know that no Learner has been falsely told there is some consensus value. If some Learner was told about a chosen value and the leader and some minority of Acceptors fail,  since a majority of Acceptors must have accepted and a minority have failed, there is at least one Acceptor that knows the accepted value and it can be copied to all surviving Acceptors. There cannot be two different accepted values. If  no Learner was told about an accepted value, either no value was chosen or one was and nobody was informed. In either case, we can just copy an accepted value if any of the Acceptors has one, or start from a blank slate otherwise. No harm no foul.

Notice, we have not had to worry about reliable store. All we need is the absence of spurious or corrupted messages and the survival of at least 1/2 of the Acceptors. If we need a sequence of values, the distinguished proposer can just rerun the same process as needed and incorporate a sequence number in the values. The distinguished proposer is, of course, the single point of failure but an election mechanism can select new proposers.

(revised)

state equations in practice

 

Let’s step back and think about what a discrete state system looks like.

 

  1. We have a set E of discrete “events” called the event alphabet, The event alphabet defines all the possible events and inputs that can cause state to change.
  2. A finite sequence of events over E then can be thought of as a system history, describing what series of events drove the system from its initial state (which is the point at which no events at all had happened.)  You could complain at this point that deterministic, single thread of execution (the sequence) models can’t describe non-determinism, parallelism, concurrency, and abstract properties, but you’d be wrong – keep reading.
  3. Now we are going to take an odd step – I don’t want to start with a state machine or other object that represents the whole system. I want to start with state variables that express some system property and how it changes in response to events. For example, a state variable Executing might tell us the process identifier of the currently running (executing) process or it could be set valued in a multi-processor environment. And priorityp might be the priority of the process with identifier, p.  As the system advances state, the values of the state variable change but often we want to have properties that are true in any state. For example,  (Executing=p AND priorityq  >  priorityp implies TimeSinceSwitch < t)  would be a good property for some operating system that needs to schedule by priority but also let lower priority processes advance.
  4. These state functions can be thought of as dependent variables that depend on the event sequence and some map that extracts  information from the event sequence. The event sequence is the free variable and the state variable is a dependent variable that embodies some aspect of system state. Everything that determines current state is in the event sequence but state is not a thing, not an object, it’s the cumulative information of multiple state variables depending on that sequence.
  5.  We might have many state variables to specify how an operating system works or even how a simple gate works. Let’s think of all of them at the same level of abstraction as depending on the same event sequence (finite, of course).
  6. The key to this approach is to be able to compactly define state variables and abstract properties of state variables even though the event alphabets and maps on sequences of events will often be enormously complicated.
  7. More precisely, say a variable y is a state variable for an event alphabet E if  y=f(σ) for some function f where σ is a free variable in the set of finite sequences over E which includes the empty sequence Nil.  The map f is a solution for y  that extracts values from the sequence as state advances (as events are appended to the event sequence). Essentially y depends on two parameters: the sequence  and the map. These parameters are often implicit.
  8. By the definition of state variable,  y(Nil) = f(Nil) and represents the value of y in the initial state of the system.
  9. And y(z) = f(z) is the value of y in the state determined by sequence z.
  10. The convention here is that σ is always a free variable over E* (the set of finite sequences over alphabet E). Since y=f(σ) for any w, then y = y(σ), trivially. And I mean this is a trivial, tautological equation, but it’s going to be useful. We can think of y as the value of the state variable in the current state and σ  as the current sequence.
  11. If is a sequence and e is an event or a variable over events, write ze for the sequence obtained by appending e to z on the right.  So y(σe) is the value of y in the “next state” – the state after e. If  y = y(σe) then the event “e” leaves the value of unchanged. If y(σe) = y +2  then the event “e” increases the value of  by 2. Note that if y=f(σ)  then y(σe) = f(σe) so  (y(σe)=y+2) can be rewritten as f(σe)=f(σ) + 2.  I’m sorry to have to explain basic familiar algebra, but this work is an attempt to express properties of systems in, mostly, basic familiar algebra.
  12. Suppose we define y by a pair of equations: y(Nil)= k and y(σe) = h(e, y). If k is some constant and h is a defined map, then we have implicitly defined f on every finite sequence and so implicitly defined y in every state. Note y(Nil) = f(Nil)=k. And y(σe)= h(e,y) = h(e,y(σ)) = h(e,f(σ)). Suppose sequence w=<a,b.c>, then y(w) = f(<a,b,c>) = h(c,f(<a,b>)) = h(c,h(b,f(<a>)) = h(c,h(b,h(a,k)))).  But in many cases we don’t want to specify f completely – f is a solution to the constraints on y and there may be many or infinitely many or no solutions.
  13. For example, suppose C is a state variable and C(Nil)=0 and C(σe) = 1+C. Then we have defined C to be the length of the event sequence. Define L(σ e) = e so L is the most recent event. Define SinceReset(Nil) = 0 and SinceReset(σ e) = 0 if e=RESET and SinceReset(σ e) = 1+ SinceReset otherwise.
  14. Suppose we have a state variable Clock that somehow determines time passed since initial state from the event sequence. Then (Clock(σ e)- Clock) is the duration of event e in this state (and may be different for different values of  σ ). Define  waitingp(Nil)=0 and waitingp(σ e)=0 if Executing(σ e) =p and   waitingp(σ e)= waitingp+(Clock(σ e)- Clock)  otherwise. So  waitingp  is the time that process has been waiting to become the executing process. We might want a property (waiting> k only if priorityp prioritywhere q = Executing).
  15. Ok, now we can consider component composition – which is just function composition. Consider the state variable L defined above and suppose we have a new alphabet B consisting of Leftx and Rightx for x in some set X of data values. The goal is to construct a shift register from instances of L.  Define Li =  L(ui) where ui is also a state variable that has values that are sequences of elements of X. Usually we want components to state in the initial state so ui(Nil) = Nil.  This means Li (Nil)= L(Nil) which is something we didn’t define. Now define ui(σ e) where “e” is an element of the event alphabet of the composite system.  Remember I’m using juxtaposition to mean “append on the right” so, for example (uiLi-1 ) means “append the current value of Li-1 to the current value of ui
    1. ui(σ e)=  uix if i=1 and e=Rightxor if i=n and e=Leftx
    2. ui(σ e)=  ui Li-1  if i>1 and e=Leftx
    3. ui(σ e) =  ui   Li+1  if i<n and e=Rightx
  16. The definition above causes all n components to change state in parallel when an event is appended to the event sequence of the composite system – but it’s just composition: Li =  L(ui)=  L(ui(σ)) and Li (σ e)= L(ui(σ e) ) =L(ui(σ) e’) ) = e’ where e’ is calculated as above.
  17. If  a device D that is constructed from components  C …  Cn  that are interconnected. D is a function of σ but the components don’t “see” the same events. For each component, we need a state variable  ui   that is sequence valued and a function of  σ . Usually these are defined recursively general   ui(Nil) = Nil  and ui(σe) =  concatenate(ui, hi(e,C1(u1 )…  Cn(un ))) or something like that. That is, each event e for D causes event or events   hi(e,C1(u1 )…  Cn(un ))  to be appended to the sequence of events ui,.

Cutting and pasting about Kodak’s demise

kodakThis graph from Peter Diamandis about how Kodak entered the sinkhole is kind of amazing. Diamandis explains Kodak’s failure to swerve in what is I think the orthodox Silicon Valley analysis:

[in 1996] Kodak had a $28 billion market cap and 140,000 employees.

In 1976, 20 years earlier, Kodak had invented the digital camera. They owned the IP and had the first mover advantage. This is a company that should have owned it all.

Instead, in 2012, Kodak filed for bankruptcy, put out of business by the very technology they had invented.

What happened? Kodak was married to the “paper and chemicals” (film development) business… their most profitable division, while the R&D on digital cameras was a cost center. They saw the digital world coming on, but were convinced that digital cameras wouldn’t have traction outside of the professional market. They certainly had the expertise to design and build consumer digital cameras — Kodak actually built the Apple QuickTake (see photo), generally considered the world’s first consumer digital camera. Amazingly, Kodak decided they didn’t even want to put their name on the camera.

There is more of the same (2012 and before) in the MIT Technology review. This is a totally convincing story (it had me convinced), but it leaves out three things:

  1. the boring old chemicals division of Eastman Kodak which was spun off in 1993 (three years earlier) is still around, profitable ( $10B/year revenue) and dwarfs what’s left of the original company,
  2. and  Fuji films, Kodak’s also ran in chemical film space managed to leap over the sinkhole and prosper, but not by relying on digital cameras.
  3. Kodak did briefly become a market leader in digital cameras but ran into a more fundamental problem.

Back in 1996, the business press and analysts thought Kodak was doing the right thing by divesting its chemical business.

NEW YORK — Eastman Kodak Co., struggling against poor profit and high debt, Tuesday took a big step in its corporate restructuring, announcing that it will divest Eastman Chemical Co. and in one fell swoop wipe out $2 billion of debt.

Such a spinoff would not have occurred just a few years ago, analysts said, and the move signals that Chief Executive Kay R. Whitmore is responding to new, tougher markets and stockholder pressure to improve financial results quickly.

“They are now recognizing that they are not a growth company, that they must go through this downsizing,” analyst Eugene Glazer of Dean Witter Reynolds said in an interview on CNBC.

Kodak’s shares, up sharply Monday in anticipation of the announcement, ended down $1.375 to $52.375 on the New York Stock Exchange.

[…] “We determined that there was little strategic reason related to our core imaging and health business for Kodak to continue to own Eastman,” Whitmore said at a news conference.

Kodak, best known for photography products but also a major pharmaceutical and chemicals group, has endured slow growth for years. Its photography business has been hit by changing demographics, foreign rivals and new technologies such as camcorders. Whitmore said Kodak sales, especially in photography and imaging, were weak.

[…] Costs will be reduced elsewhere in the company, Whitmore said. Other executives also have said Kodak will cut spending on research and development of new products.

In retrospect, dumping the cash generating parts of the business and cutting R&D was  not the best plan even if Wall St. analysts loved the idea. But it’s easy to be a genius after the fact as Willy Shih points out:

Responding to recommendations from management experts, from the mid-1990s to 2003 the company set up a separate division (which I ran) charged with tackling the digital opportunity. Not constrained by any legacy assets or practices, the new division was able to build a leading market share position in digital cameras — a position that was essentially decimated soon thereafter when smartphones with built-in cameras overtook the market.

Yes, those camera phones – which not too many people saw coming in the 1990s. Not only that, but Kodak’s path in digital imaging was not obvious.

The transition from analog to digital imaging brought several challenges. First, digital imaging was based on a general-purpose semiconductor technology platform that had nothing to do with film manufacturing — it had its own scale and learning curves. The broad applicability of the technology platform meant that it could be scaled up in numerous high-volume markets (such as microprocessors, logic circuits, and communications chips) apart from digital imaging. Suppliers selling components offered the technology to anyone who would pay, and there were few entry barriers. What’s more, digital technology is modular. A good engineer could buy all the building blocks and put together a camera. These building blocks abstracted almost all the technology required, so you no longer needed a lot of experience and specialized skills.

Semiconductor technology was well outside of Kodak’s core know-how and organizational capabilities. Even though the company invested lots of money in the basic research and manufacturing of solid-state semiconductor image sensors and developed some notable inventions (including the color filter array that is used on virtually every color image sensor), it had little hope of being a competitive volume supplier of image sensor components, and it was difficult for Kodak to offer something distinctive.

And Shih, perhaps unintentionally, reinforces Diamandis’s point that the top company managers failed to face up to the problem.

For many managers of legacy businesses, the survival instinct kicked in. Some who had worked at Kodak for decades felt they were entitled to be reassigned to the new businesses, or wished to control sales channels for digital products. But that just fueled internal strife. Kodak ended up merging the consumer digital, professional, and legacy consumer film divisions in 2003. Kodak then tried to make inroads in the inkjet printing business, spending heavily to compete with fortified incumbents such as HP, Canon, and Epson. But the effort failed, and Kodak exited the printer business after it filed for Chapter 11 bankruptcy reorganization in 2012.

Management chaos and “spending heavily to compete with fortified incumbents”.

Yep.

With the benefit of hindsight, it’s interesting to ask how Kodak might have been able to achieve a different outcome. One argument is that the company could have tried to compete on capabilities rather than on the markets it was in. This would have meant directing its skills in complex organic chemistry and high-speed coating toward other products involving complex materials — a path followed successfully by Fuji. However, this would have meant walking away from a great consumer franchise. That’s not the logic that managers learn at business schools, and it would have been a hard pill for Kodak leaders to swallow.

it would have been a hard pill for Kodak leaders to swallow.

But wasn’t that their job? So to conclude this exercise in cut and paste, what about Fuji? The Economist had an interesting take:

 the digital imaging sector accounts for only about one-fifth of Fujifilm’s revenue, down from more than half a decade ago.

How Fujifilm succeeded serves as a warning to American firms about the danger of trying to take the easy way out: competing through one’s marketing rather than taking the harder route of developing new products and new businesses. […]

Like Kodak, Fujifilm realised in the 1980s that photography would be going digital. Like Kodak, it continued to milk profits from film sales, invested in digital technologies, and tried to diversify into new areas. Like Kodak, the folks in the wildly profitable film division were in control and late to admit that the film business was a lost cause. As late as 2000 Fujifilm counted on a gentle 15 or 20-year decline of film—not the sudden free-fall that took place. Within a decade, film went from 60% of Fujifilm’s profits to basically nothing.

If the market forecast, strategy and internal politics were the same, why the divergent outcomes? The big difference was execution.

Fujifilm realised it needed to develop in-house expertise in the new businesses. In contrast, Kodak seemed to believe that its core strength lay in brand and marketing, and that it could simply partner or buy its way into new industries, such as drugs or chemicals. The problem with this approach was that without in-house expertise, Kodak lacked some key skills: the ability to vet acquisition candidates well, to integrate the companies it had purchased and to negotiate profitable partnerships. “Kodak was so confident about their marketing capability and their brand, that they tried to take the easy way out,” says Mr Komori.

Fujifilm realised it needed to develop in-house expertise in the new businesses.

ok.

 

Synchronous processors

Imagine a processor with no interrupts. We can do a lot better and get rid of most exceptions (e.g. system calls, page faults etc.), most peripheral devices/buses, and even cache misses, but let’s start with interrupts. Modern micro-processors are bloated with circuitry that is designed to allow the precious CPU to switch back and forth between streams of code because cpu cycles were the scarcest resource. That was a great idea in the 1950s. But interrupts are expensive in compute time circuit complexity, and chip area. Interrupts  take microseconds to start processing – which is an eternity on a Ghz processor.  And they solve a problem that does not exist anymore: we have a lot of processor cores. In fact, one problem faced by computer architects is that it’s not easy to exploit parallel processing if you keep thinking in terms of 1950s computer architecture.

Suppose you have 32 cores and you make one or two of them poll I/O devices and the other 30 never get I/O interrupts, never run interrupt handling code, never have to save and restore state or context switch due to I/O interrupts.  What they do is run application code to completion or for a specified number of cycles or clock ticks. The cores reserved for operating system use manage all devices and even they don’t need interrupts because they can run a real-time OS that polls devices.

Continue reading Synchronous processors

IEEE 1588 PTP is a mess

IEEE 1588 was not designed for modern enterprise computer networks and contains many hacks to make it sort of work. The standard also suffers from being overly explicit on some things and overly unspecific  on others.  One marker of the flawed process is that IEEE 1588 transparent clocks don’t really comply with Ethernet standards because they modify packets without changing the MAC address. So in 2012 the 802.1 and 1588 standards groups started discussing what could be done. The 1588 committee notes that the “intent” (and practice) violates OSI layering but that 1588 doesn’t “mandate” that intent! Oy vey.

Questions have been raised concerning an IEEE 1588-2008 Transparent Clock layer 2 bridge modifying the CorrectionField of Ethernet transported PTP frames without changing the Ethernet source MAC address.  The question is if this operation is permitted by IEEE 802.1Q [1].  The original intent of the IEEE 1588-2008 standard was that a Transparent Clock will forward PTP event frames with no modifications except for the CorrectionField and FCS updates, however IEEE 1588-2008 does not mandate that.